Message ID | 20230125155557.37816-2-mjguzik@gmail.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | [v3,1/2] capability: add cap_isidentical | expand |
On Wed, Jan 25, 2023 at 7:56 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > Turns out for typical consumers the resulting creds would be identical > and this can be checked upfront, avoiding the hard work. I've applied this v3 of the two patches. Normally it would go through Al, but he's clearly been under the weather and is drowning in email. Besides, I'm comfortable with this particular set of patches anyway as I was involved in the previous round of access() overhead avoidance with the whole RCU grace period thing. So I think we're all better off having Al look at any iov_iter issues. Anybody holler if there are issues, Linus
On Mon, Feb 27, 2023 at 04:44:06PM -0800, Linus Torvalds wrote: > On Wed, Jan 25, 2023 at 7:56 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > Turns out for typical consumers the resulting creds would be identical > > and this can be checked upfront, avoiding the hard work. > > I've applied this v3 of the two patches. > > Normally it would go through Al, but he's clearly been under the > weather and is drowning in email. Besides, I'm comfortable with this > particular set of patches anyway as I was involved in the previous > round of access() overhead avoidance with the whole RCU grace period > thing. > > So I think we're all better off having Al look at any iov_iter issues. > > Anybody holler if there are issues, Fwiw, as long as you, Al, and others are fine with it and I'm aware of it I'm happy to pick up more stuff like this. I've done it before and have worked in this area so I'm happy to help with some of the load. Christian
On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote: > > Fwiw, as long as you, Al, and others are fine with it and I'm aware of > it I'm happy to pick up more stuff like this. I've done it before and > have worked in this area so I'm happy to help with some of the load. Yeah, that would work. We've actually had discussions of vfs maintenance in general. In this case it really wasn't an issue, with this being just two fairly straightforward patches for code that I was familiar with. Linus
On Thu, Mar 02, 2023 at 09:30:25AM +0100, Christian Brauner wrote: > On Mon, Feb 27, 2023 at 04:44:06PM -0800, Linus Torvalds wrote: > > On Wed, Jan 25, 2023 at 7:56 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > Turns out for typical consumers the resulting creds would be identical > > > and this can be checked upfront, avoiding the hard work. > > > > I've applied this v3 of the two patches. > > > > Normally it would go through Al, but he's clearly been under the > > weather and is drowning in email. Besides, I'm comfortable with this > > particular set of patches anyway as I was involved in the previous > > round of access() overhead avoidance with the whole RCU grace period > > thing. > > > > So I think we're all better off having Al look at any iov_iter issues. > > > > Anybody holler if there are issues, > > Fwiw, as long as you, Al, and others are fine with it and I'm aware of > it I'm happy to pick up more stuff like this. I've done it before and > have worked in this area so I'm happy to help with some of the load. TBH, I've missed that series; patches look sane, so consider them belatedly ACKed. And I've no problem with sharing the load - you have been doing that with idmapping stuff anyway. As far as I'm concerned, I generally trust your taste; it doesn't matter that I won't disagree with specific decisions, of course, but...
On 3/2/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org> > wrote: >> >> Fwiw, as long as you, Al, and others are fine with it and I'm aware of >> it I'm happy to pick up more stuff like this. I've done it before and >> have worked in this area so I'm happy to help with some of the load. > > Yeah, that would work. We've actually had discussions of vfs > maintenance in general. > > In this case it really wasn't an issue, with this being just two > fairly straightforward patches for code that I was familiar with. > Well on that note I intend to write a patch which would add a flag to the dentry cache. There is this thing named CONFIG_INIT_ON_ALLOC_DEFAULT_ON which is enabled at least on debian, ubuntu and arch. It results in mandatory memset on the obj before it gets returned from kmalloc, for hardening purposes. Now the problem is that dentry cache allocates bufs 4096 bytes in size, so this is an equivalent of a clear_page call and it happens *every time* there is a path lookup. Given how dentry cache is used, I'm confident there is 0 hardening benefit for it. So the plan would be to add a flag on cache creation to exempt it from the mandatory memset treatment and use it with dentry. I don't have numbers handy but as you can imagine it gave me a nice bump :) Whatever you think about the idea aside, the q is: can something like the above go in without Al approving it? If so, I'll try to find time to hack it up this or next week. I had some other patches to write too, but $reallife.
On Thu, Mar 02, 2023 at 07:14:24PM +0100, Mateusz Guzik wrote: > On 3/2/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org> > > wrote: > >> > >> Fwiw, as long as you, Al, and others are fine with it and I'm aware of > >> it I'm happy to pick up more stuff like this. I've done it before and > >> have worked in this area so I'm happy to help with some of the load. > > > > Yeah, that would work. We've actually had discussions of vfs > > maintenance in general. > > > > In this case it really wasn't an issue, with this being just two > > fairly straightforward patches for code that I was familiar with. > > > > Well on that note I intend to write a patch which would add a flag to > the dentry cache. > > There is this thing named CONFIG_INIT_ON_ALLOC_DEFAULT_ON which is > enabled at least on debian, ubuntu and arch. It results in mandatory > memset on the obj before it gets returned from kmalloc, for hardening > purposes. > > Now the problem is that dentry cache allocates bufs 4096 bytes in > size, so this is an equivalent of a clear_page call and it happens > *every time* there is a path lookup. Huh? Quite a few path lookups don't end up allocating any dentries; what exactly are you talking about? > Given how dentry cache is used, I'm confident there is 0 hardening > benefit for it. > > So the plan would be to add a flag on cache creation to exempt it from > the mandatory memset treatment and use it with dentry. > > I don't have numbers handy but as you can imagine it gave me a nice bump :) > > Whatever you think about the idea aside, the q is: can something like > the above go in without Al approving it? That one I would really like to take a look at.
On 3/2/23, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Mar 02, 2023 at 07:14:24PM +0100, Mateusz Guzik wrote: >> On 3/2/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> > On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org> >> > wrote: >> >> >> >> Fwiw, as long as you, Al, and others are fine with it and I'm aware of >> >> it I'm happy to pick up more stuff like this. I've done it before and >> >> have worked in this area so I'm happy to help with some of the load. >> > >> > Yeah, that would work. We've actually had discussions of vfs >> > maintenance in general. >> > >> > In this case it really wasn't an issue, with this being just two >> > fairly straightforward patches for code that I was familiar with. >> > >> >> Well on that note I intend to write a patch which would add a flag to >> the dentry cache. >> >> There is this thing named CONFIG_INIT_ON_ALLOC_DEFAULT_ON which is >> enabled at least on debian, ubuntu and arch. It results in mandatory >> memset on the obj before it gets returned from kmalloc, for hardening >> purposes. >> >> Now the problem is that dentry cache allocates bufs 4096 bytes in >> size, so this is an equivalent of a clear_page call and it happens >> *every time* there is a path lookup. > > Huh? Quite a few path lookups don't end up allocating any dentries; > what exactly are you talking about? > Ops, I meant "names_cache", here: names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL); it is fs/dcache.c and I brainfarted into the above. >> Given how dentry cache is used, I'm confident there is 0 hardening >> benefit for it. >> >> So the plan would be to add a flag on cache creation to exempt it from >> the mandatory memset treatment and use it with dentry. >> >> I don't have numbers handy but as you can imagine it gave me a nice bump >> :) >> >> Whatever you think about the idea aside, the q is: can something like >> the above go in without Al approving it? > > That one I would really like to take a look at. > allright
On Thu, Mar 02, 2023 at 06:18:32PM +0000, Al Viro wrote: > On Thu, Mar 02, 2023 at 07:14:24PM +0100, Mateusz Guzik wrote: > > On 3/2/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org> > > > wrote: > > >> > > >> Fwiw, as long as you, Al, and others are fine with it and I'm aware of > > >> it I'm happy to pick up more stuff like this. I've done it before and > > >> have worked in this area so I'm happy to help with some of the load. > > > > > > Yeah, that would work. We've actually had discussions of vfs > > > maintenance in general. > > > > > > In this case it really wasn't an issue, with this being just two > > > fairly straightforward patches for code that I was familiar with. > > > > > > > Well on that note I intend to write a patch which would add a flag to > > the dentry cache. > > > > There is this thing named CONFIG_INIT_ON_ALLOC_DEFAULT_ON which is > > enabled at least on debian, ubuntu and arch. It results in mandatory > > memset on the obj before it gets returned from kmalloc, for hardening > > purposes. > > > > Now the problem is that dentry cache allocates bufs 4096 bytes in > > size, so this is an equivalent of a clear_page call and it happens > > *every time* there is a path lookup. > > Huh? Quite a few path lookups don't end up allocating any dentries; > what exactly are you talking about? > > > Given how dentry cache is used, I'm confident there is 0 hardening > > benefit for it. > > > > So the plan would be to add a flag on cache creation to exempt it from > > the mandatory memset treatment and use it with dentry. > > > > I don't have numbers handy but as you can imagine it gave me a nice bump :) Hmm... Let's see what __d_alloc() will explicitly store into: [*] unsigned int d_flags; [*] seqcount_spinlock_t d_seq; [*] struct hlist_bl_node d_hash; [*] struct dentry *d_parent; [*] struct qstr d_name; [*] struct inode *d_inode; unsigned char d_iname[DNAME_INLINE_LEN]; [*] struct lockref d_lockref; [*] const struct dentry_operations *d_op; [*] struct super_block *d_sb; unsigned long d_time; [*] void *d_fsdata; union { [*] struct list_head d_lru; wait_queue_head_t *d_wait; }; [*] struct list_head d_child; [*] struct list_head d_subdirs; union { struct hlist_node d_alias; struct hlist_bl_node d_in_lookup_hash; struct rcu_head d_rcu; } d_u; These are straightforward "overwrite no matter what". ->d_time is not initialized at all - it's fs-private, and unused for the majority of filesystems (kernfs, nfs and vboxsf are the only users). ->d_alias/->d_in_lookup_hash/->d_rcu are also uninitialized - those are valid only on some stages of dentry lifecycle (which is why they can share space) and initialization is responsibility of the places that do the corresponding state transitions. ->d_iname is the only somewhat delicate one. I think we are OK as it is (i.e. that having all of it zeroed out won't buy anything), but that's not trivial. Note that the last byte does need to be initialized, despite the memcpy(dname, name->name, name->len); dname[name->len] = 0; following it. I'm not saying that getting rid of pre-zeroing the entire underlying page is the wrong thing to do; it's just that it needs some analysis in commit message.
On Thu, Mar 02, 2023 at 07:22:17PM +0100, Mateusz Guzik wrote: > Ops, I meant "names_cache", here: > names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0, > SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL); > > it is fs/dcache.c and I brainfarted into the above. So you mean __getname() stuff?
On 3/2/23, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Mar 02, 2023 at 07:22:17PM +0100, Mateusz Guzik wrote: > >> Ops, I meant "names_cache", here: >> names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0, >> SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL); >> >> it is fs/dcache.c and I brainfarted into the above. > > So you mean __getname() stuff? > yes. do some lookups in a loop on a kernel built with CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y (there may be a runtime switch for it?) and you will see memset using most time in perf top
On Thu, Mar 02, 2023 at 06:43:39PM +0000, Al Viro wrote: > On Thu, Mar 02, 2023 at 07:22:17PM +0100, Mateusz Guzik wrote: > > > Ops, I meant "names_cache", here: > > names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0, > > SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL); > > > > it is fs/dcache.c and I brainfarted into the above. > > So you mean __getname() stuff? The thing is, getname_flags()/getname_kernel() is not the only user of that thing; grep and you'll see (and keep in mind that cifs alloc_dentry_path() is a __getname() wrapper, with its own callers). We might have bugs papered over^W^Whardened away^W^Wpapered over in some of those users. I agree that getname_flags()/getname_kernel()/sys_getcwd() have no need of pre-zeroing; fw_get_filesystem_firmware(), ceph_mdsc_build_path(), [hostfs]dentry_name() and ima_d_path() seem to be safe. So's vboxsf_path_from_dentry() (I think). But with this bunch I'd need a review before I'd be willing to say "this security theatre buys us nothing here": fs/cifs/cifsproto.h:67: return __getname(); fs/exfat/dir.c:195: nb->lfn = __getname(); fs/fat/dir.c:287: *unicode = __getname(); fs/fat/namei_vfat.c:598: uname = __getname(); fs/ntfs3/dir.c:394: name = __getname(); fs/ntfs3/inode.c:1289: new_de = __getname(); fs/ntfs3/inode.c:1694: de = __getname(); fs/ntfs3/inode.c:1732: de = __getname(); fs/ntfs3/namei.c:71: struct cpu_str *uni = __getname(); fs/ntfs3/namei.c:286: de = __getname(); fs/ntfs3/namei.c:355: struct cpu_str *uni = __getname(); fs/ntfs3/namei.c:494: uni = __getname(); fs/ntfs3/namei.c:555: uni1 = __getname(); fs/ntfs3/xattr.c:532: buf = __getname(); fs/cifs/cifs_dfs_ref.c:168: page = alloc_dentry_path(); fs/cifs/cifsacl.c:1697: page = alloc_dentry_path(); fs/cifs/cifsacl.c:1760: page = alloc_dentry_path(); fs/cifs/cifsproto.h:65:static inline void *alloc_dentry_path(void) fs/cifs/dir.c:187: void *page = alloc_dentry_path(); fs/cifs/dir.c:604: page = alloc_dentry_path(); fs/cifs/dir.c:664: page = alloc_dentry_path(); fs/cifs/file.c:594: page = alloc_dentry_path(); fs/cifs/file.c:796: page = alloc_dentry_path(); fs/cifs/file.c:2223: void *page = alloc_dentry_path(); fs/cifs/file.c:2255: void *page = alloc_dentry_path(); fs/cifs/inode.c:1663: page = alloc_dentry_path(); fs/cifs/inode.c:1938: page = alloc_dentry_path(); fs/cifs/inode.c:2001: void *page = alloc_dentry_path(); fs/cifs/inode.c:2170: page1 = alloc_dentry_path(); fs/cifs/inode.c:2171: page2 = alloc_dentry_path(); fs/cifs/inode.c:2446: page = alloc_dentry_path(); fs/cifs/inode.c:2738: void *page = alloc_dentry_path(); fs/cifs/inode.c:2893: void *page = alloc_dentry_path(); fs/cifs/ioctl.c:34: void *page = alloc_dentry_path(); fs/cifs/link.c:491: page1 = alloc_dentry_path(); fs/cifs/link.c:492: page2 = alloc_dentry_path(); fs/cifs/misc.c:803: page = alloc_dentry_path(); fs/cifs/readdir.c:1071: void *page = alloc_dentry_path(); fs/cifs/smb2ops.c:2059: void *page = alloc_dentry_path(); fs/cifs/xattr.c:112: page = alloc_dentry_path(); fs/cifs/xattr.c:277: page = alloc_dentry_path(); fs/cifs/xattr.c:382: page = alloc_dentry_path();
On Thu, Mar 2, 2023 at 10:22 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > Ops, I meant "names_cache", here: > names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0, > SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL); That code just needs a __GFP_SKIP_ZERO. It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing, just to make it possible to say - exactly in situations like this - that this particular slab cache has no advantage from pre-zeroing. This doesn't sound like a vfs issue, this is a hardening issue where apparently people now use that INIT_ON_ALLOC_DEFAULT_ON in "real use" and then you notice how horrid the performance impact can be. But there might also be some possible interactions with KASAN etc. Adding some hardening people to the cc. Linus
On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing, > just to make it possible to say - exactly in situations like this - > that this particular slab cache has no advantage from pre-zeroing. Actually, maybe it's just as well to keep it per-allocation, and just special-case getname_flags() itself. We could replace the __getname() there with just a kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO); we're going to overwrite the beginning of the buffer with the path we copy from user space, and then we'd have to make people comfortable with the fact that even with zero initialization hardening on, the space after the filename wouldn't be initialized... Linus
On Thu, Mar 02, 2023 at 07:02:10PM +0000, Al Viro wrote: > On Thu, Mar 02, 2023 at 06:43:39PM +0000, Al Viro wrote: > > On Thu, Mar 02, 2023 at 07:22:17PM +0100, Mateusz Guzik wrote: > > > > > Ops, I meant "names_cache", here: > > > names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0, > > > SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL); > > > > > > it is fs/dcache.c and I brainfarted into the above. > > > > So you mean __getname() stuff? > > The thing is, getname_flags()/getname_kernel() is not the only user of that > thing; grep and you'll see (and keep in mind that cifs alloc_dentry_path() > is a __getname() wrapper, with its own callers). We might have bugs papered > over^W^Whardened away^W^Wpapered over in some of those users. > > I agree that getname_flags()/getname_kernel()/sys_getcwd() have no need of > pre-zeroing; fw_get_filesystem_firmware(), ceph_mdsc_build_path(), > [hostfs]dentry_name() and ima_d_path() seem to be safe. So's > vboxsf_path_from_dentry() (I think). But with this bunch I'd need > a review before I'd be willing to say "this security theatre buys us > nothing here": [snip the list] PS: ripping this bandaid off might very well be the right thing to do, it's just that "I'm confident there is 0 hardening benefit for it" needs a code review is some moderately grotty places. It's not too awful (e.g. in case of cifs most of the callers are immediately followed by build_path_from_dentry(), which stores the pathname in the end of page and returns the pointer to beginning of initialized part; verifying that after that allocation + build_path we only access the parts past the returned pointer until it's time to free the buffer is not hard), but it's worth doing.
On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote: > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing, > > just to make it possible to say - exactly in situations like this - > > that this particular slab cache has no advantage from pre-zeroing. > > Actually, maybe it's just as well to keep it per-allocation, and just > special-case getname_flags() itself. > > We could replace the __getname() there with just a > > kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO); > > we're going to overwrite the beginning of the buffer with the path we > copy from user space, and then we'd have to make people comfortable > with the fact that even with zero initialization hardening on, the > space after the filename wouldn't be initialized... ACK; same in getname_kernel() and sys_getcwd(), at the very least.
On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote: > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing, > > just to make it possible to say - exactly in situations like this - > > that this particular slab cache has no advantage from pre-zeroing. > > Actually, maybe it's just as well to keep it per-allocation, and just > special-case getname_flags() itself. > > We could replace the __getname() there with just a > > kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO); > > we're going to overwrite the beginning of the buffer with the path we > copy from user space, and then we'd have to make people comfortable > with the fact that even with zero initialization hardening on, the > space after the filename wouldn't be initialized... Yeah, I'd love to have a way to safely opt-out of always-zero. The discussion[1] when we originally did this devolved into a guessing game on performance since no one could actually point to workloads that were affected by it, beyond skbuff[2]. So in the interest of not over-engineering a solution to an unknown problem, the plan was once someone found a problem, we could find a sensible solution at that time. And so here we are! :) I'd always wanted to avoid a "don't zero" flag and instead adjust APIs so the allocation could include a callback to do the memory content filling that would return a size-that-was-initialized result. That way we don't end up in the situations we've seen so many times with drivers, etc, where an uninit buffer is handed off and some path fails to actually fill it with anything. However, in practice, I think this kind of API change becomes really hard to do. -Kees [1] https://lore.kernel.org/all/20190514143537.10435-4-glider@google.com/ [2] https://lore.kernel.org/all/20190514143537.10435-5-glider@google.com/
On Thu, Mar 02, 2023 at 11:38:50AM -0800, Kees Cook wrote: > On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote: > > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing, > > > just to make it possible to say - exactly in situations like this - > > > that this particular slab cache has no advantage from pre-zeroing. > > > > Actually, maybe it's just as well to keep it per-allocation, and just > > special-case getname_flags() itself. > > > > We could replace the __getname() there with just a > > > > kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO); > > > > we're going to overwrite the beginning of the buffer with the path we > > copy from user space, and then we'd have to make people comfortable > > with the fact that even with zero initialization hardening on, the > > space after the filename wouldn't be initialized... > > Yeah, I'd love to have a way to safely opt-out of always-zero. The > discussion[1] when we originally did this devolved into a guessing > game on performance since no one could actually point to workloads > that were affected by it, beyond skbuff[2]. So in the interest of not > over-engineering a solution to an unknown problem, the plan was once > someone found a problem, we could find a sensible solution at that > time. And so here we are! :) > > I'd always wanted to avoid a "don't zero" flag and instead adjust APIs so > the allocation could include a callback to do the memory content filling > that would return a size-that-was-initialized result. That way we don't > end up in the situations we've seen so many times with drivers, etc, > where an uninit buffer is handed off and some path fails to actually > fill it with anything. However, in practice, I think this kind of API > change becomes really hard to do. > Having not been following init_on_alloc very closely myself, I'm a bit surprised that an opt-out flag never made it into the final version. Was names_cachep considered in those earlier discussions? I think that's a pretty obvious use case for an opt-out. Every syscall that operates on a path allocates a 4K buffer from names_cachep. - Eric
On Thu, Mar 02, 2023 at 07:19:49PM +0000, Al Viro wrote: > On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote: > > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing, > > > just to make it possible to say - exactly in situations like this - > > > that this particular slab cache has no advantage from pre-zeroing. > > > > Actually, maybe it's just as well to keep it per-allocation, and just > > special-case getname_flags() itself. > > > > We could replace the __getname() there with just a > > > > kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO); > > > > we're going to overwrite the beginning of the buffer with the path we > > copy from user space, and then we'd have to make people comfortable > > with the fact that even with zero initialization hardening on, the > > space after the filename wouldn't be initialized... > > ACK; same in getname_kernel() and sys_getcwd(), at the very least. FWIW, much earlier analysis suggested opting out these kmem caches: buffer_head names_cache mm_struct anon_vma skbuff_head_cache skbuff_fclone_cache Alexander's analysis more recently[2] of skbuff went a bit further, I think, and allowed opt-out for non-kmem cache page allocations too. -Kees [1] https://lore.kernel.org/all/20190514143537.10435-5-glider@google.com/
On Thu, Mar 02, 2023 at 11:54:03AM -0800, Kees Cook wrote: > On Thu, Mar 02, 2023 at 07:19:49PM +0000, Al Viro wrote: > > On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote: > > > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing, > > > > just to make it possible to say - exactly in situations like this - > > > > that this particular slab cache has no advantage from pre-zeroing. > > > > > > Actually, maybe it's just as well to keep it per-allocation, and just > > > special-case getname_flags() itself. > > > > > > We could replace the __getname() there with just a > > > > > > kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO); > > > > > > we're going to overwrite the beginning of the buffer with the path we > > > copy from user space, and then we'd have to make people comfortable > > > with the fact that even with zero initialization hardening on, the > > > space after the filename wouldn't be initialized... > > > > ACK; same in getname_kernel() and sys_getcwd(), at the very least. > > FWIW, much earlier analysis suggested opting out these kmem caches: > > buffer_head > names_cache > mm_struct > anon_vma > skbuff_head_cache > skbuff_fclone_cache I would probably add dentry_cache to it; the only subtle part is ->d_iname and I'm convinced that explicit "make sure there's a NUL at the very end" is enough.
On Thu, Mar 02, 2023 at 06:11:07PM +0000, Al Viro wrote: > On Thu, Mar 02, 2023 at 09:30:25AM +0100, Christian Brauner wrote: > > On Mon, Feb 27, 2023 at 04:44:06PM -0800, Linus Torvalds wrote: > > > On Wed, Jan 25, 2023 at 7:56 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > > > Turns out for typical consumers the resulting creds would be identical > > > > and this can be checked upfront, avoiding the hard work. > > > > > > I've applied this v3 of the two patches. > > > > > > Normally it would go through Al, but he's clearly been under the > > > weather and is drowning in email. Besides, I'm comfortable with this > > > particular set of patches anyway as I was involved in the previous > > > round of access() overhead avoidance with the whole RCU grace period > > > thing. > > > > > > So I think we're all better off having Al look at any iov_iter issues. > > > > > > Anybody holler if there are issues, > > > > Fwiw, as long as you, Al, and others are fine with it and I'm aware of > > it I'm happy to pick up more stuff like this. I've done it before and > > have worked in this area so I'm happy to help with some of the load. > > TBH, I've missed that series; patches look sane, so consider them > belatedly ACKed. > > And I've no problem with sharing the load - you have been doing that with > idmapping stuff anyway. As far as I'm concerned, I generally trust your > taste; it doesn't matter that I won't disagree with specific decisions, > of course, but... Thanks, I appreciate that! Sure, you won't agree with everything. That's kinda expected for reasons of preference alone but also simply because there's a lot of stuff that probably only you know atm. In general, I never pick up any stuff I can't review myself unless it's in-your-face obvious or it already has acks from the subject matter experts. But it is much easier to help maintain vfs stuff now that you made it clear that you're ok with this. So I'm happy to share more of the vfs maintenance workload.
On Thu, Mar 02, 2023 at 09:51:48AM -0800, Linus Torvalds wrote: > On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote: > > > > Fwiw, as long as you, Al, and others are fine with it and I'm aware of > > it I'm happy to pick up more stuff like this. I've done it before and > > have worked in this area so I'm happy to help with some of the load. > > Yeah, that would work. We've actually had discussions of vfs > maintenance in general. As I've said to Al in the other thread I'm happy to help more. > > In this case it really wasn't an issue, with this being just two > fairly straightforward patches for code that I was familiar with. Yup, I understand.
On Thu, Mar 2, 2023 at 9:11 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Mar 02, 2023 at 11:54:03AM -0800, Kees Cook wrote: > > On Thu, Mar 02, 2023 at 07:19:49PM +0000, Al Viro wrote: > > > On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote: > > > > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds > > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing, > > > > > just to make it possible to say - exactly in situations like this - > > > > > that this particular slab cache has no advantage from pre-zeroing. > > > > > > > > Actually, maybe it's just as well to keep it per-allocation, and just > > > > special-case getname_flags() itself. > > > > > > > > We could replace the __getname() there with just a > > > > > > > > kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO); > > > > > > > > we're going to overwrite the beginning of the buffer with the path we > > > > copy from user space, and then we'd have to make people comfortable > > > > with the fact that even with zero initialization hardening on, the > > > > space after the filename wouldn't be initialized... > > > > > > ACK; same in getname_kernel() and sys_getcwd(), at the very least. > > > > FWIW, much earlier analysis suggested opting out these kmem caches: > > > > buffer_head > > names_cache > > mm_struct > > anon_vma > > skbuff_head_cache > > skbuff_fclone_cache > > I would probably add dentry_cache to it; the only subtle part is > ->d_iname and I'm convinced that explicit "make sure there's a NUL > at the very end" is enough. FWIW, a couple of years ago I was looking into implementing the following scheme for opt-out that I also discussed with Kees: 1. Add a ___GFP_SKIP_ZERO flag that is not intended to be used explicitly (e.g. disallow passing it to kmalloc(), add a checkpatch.pl warning). Explicitly passing an opt-out flag to allocation functions was considered harmful previously: https://lore.kernel.org/kernel-hardening/20190423083148.GF25106@dhcp22.suse.cz/ 2. Define new allocation API that will allow opt-out: struct page *alloc_pages_uninit(gfp_t gfp, unsigned int order, const char *key); void *kmalloc_uninit(size_t size, gfp_t flags, const char *key); void *kmem_cache_alloc_uninit(struct kmem_cache *, gfp_t flags, const char *key); , where @key is an arbitrary string that identifies a single allocation or a group of allocations. 3. Provide a boot-time flag that holds a comma-separated list of opt-out keys that actually take effect: init_on_alloc.skip="xyz-camera-driver,some_big_buffer". The rationale behind this two-step mechanism is that despite certain allocations may be appealing opt-out targets for performance reasons, some vendors may choose to be on the safe side and explicitly list the allocations that should not be zeroed. Several possible enhancements include: 1. A SLAB_NOINIT memcache flag that prohibits cache merging and disables initialization. Because the number of caches is relatively small, it might be fine to explicitly pass SLAB_NOINIT at cache creation sites. Again, if needed, we could only use this flag as a hint that needs to be acknowledged by a boot-time option. 2. No opt-out for kmalloc() - instead advise users to promote the anonymous allocations they want to opt-out to memory caches with SLAB_NOINIT. 3. Provide an emergency brake that completely disables ___GFP_SKIP_ZERO if a major security breach is discovered. Extending this idea of per-callsite opt-out we could generate unique keys for every allocation in the kernel (or e.g. group them together by the caller name) and decide at runtime if we want to opt them out. But I am not sure anyone would want this level of control in their distro.
On 3/3/23, Alexander Potapenko <glider@google.com> wrote: > On Thu, Mar 2, 2023 at 9:11 PM Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> On Thu, Mar 02, 2023 at 11:54:03AM -0800, Kees Cook wrote: >> > On Thu, Mar 02, 2023 at 07:19:49PM +0000, Al Viro wrote: >> > > On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote: >> > > > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds >> > > > <torvalds@linux-foundation.org> wrote: >> > > > > >> > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO >> > > > > thing, >> > > > > just to make it possible to say - exactly in situations like this >> > > > > - >> > > > > that this particular slab cache has no advantage from >> > > > > pre-zeroing. >> > > > >> > > > Actually, maybe it's just as well to keep it per-allocation, and >> > > > just >> > > > special-case getname_flags() itself. >> > > > >> > > > We could replace the __getname() there with just a >> > > > >> > > > kmem_cache_alloc(names_cachep, GFP_KERNEL | >> > > > __GFP_SKIP_ZERO); >> > > > >> > > > we're going to overwrite the beginning of the buffer with the path >> > > > we >> > > > copy from user space, and then we'd have to make people comfortable >> > > > with the fact that even with zero initialization hardening on, the >> > > > space after the filename wouldn't be initialized... >> > > >> > > ACK; same in getname_kernel() and sys_getcwd(), at the very least. >> > >> > FWIW, much earlier analysis suggested opting out these kmem caches: >> > >> > buffer_head >> > names_cache >> > mm_struct >> > anon_vma >> > skbuff_head_cache >> > skbuff_fclone_cache >> >> I would probably add dentry_cache to it; the only subtle part is >> ->d_iname and I'm convinced that explicit "make sure there's a NUL >> at the very end" is enough. > > FWIW, a couple of years ago I was looking into implementing the > following scheme for opt-out that I also discussed with Kees: > > 1. Add a ___GFP_SKIP_ZERO flag that is not intended to be used > explicitly (e.g. disallow passing it to kmalloc(), add a checkpatch.pl > warning). Explicitly passing an opt-out flag to allocation functions > was considered harmful previously: > https://lore.kernel.org/kernel-hardening/20190423083148.GF25106@dhcp22.suse.cz/ > > 2. Define new allocation API that will allow opt-out: > > struct page *alloc_pages_uninit(gfp_t gfp, unsigned int order, const > char *key); > void *kmalloc_uninit(size_t size, gfp_t flags, const char *key); > void *kmem_cache_alloc_uninit(struct kmem_cache *, gfp_t flags, > const char *key); > > , where @key is an arbitrary string that identifies a single > allocation or a group of allocations. > > 3. Provide a boot-time flag that holds a comma-separated list of > opt-out keys that actually take effect: > > init_on_alloc.skip="xyz-camera-driver,some_big_buffer". > > The rationale behind this two-step mechanism is that despite certain > allocations may be appealing opt-out targets for performance reasons, > some vendors may choose to be on the safe side and explicitly list the > allocations that should not be zeroed. > > Several possible enhancements include: > 1. A SLAB_NOINIT memcache flag that prohibits cache merging and > disables initialization. Because the number of caches is relatively > small, it might be fine to explicitly pass SLAB_NOINIT at cache > creation sites. > Again, if needed, we could only use this flag as a hint that needs to > be acknowledged by a boot-time option. > 2. No opt-out for kmalloc() - instead advise users to promote the > anonymous allocations they want to opt-out to memory caches with > SLAB_NOINIT. > 3. Provide an emergency brake that completely disables > ___GFP_SKIP_ZERO if a major security breach is discovered. > > Extending this idea of per-callsite opt-out we could generate unique > keys for every allocation in the kernel (or e.g. group them together > by the caller name) and decide at runtime if we want to opt them out. > But I am not sure anyone would want this level of control in their distro. > I intended to write a longer e-mail concerning the entire init-on-alloc situation along with some patches in not-so-distant future, but the bare minimum I wrote previously already got numerous people involved (unsurprisingly now that I look at it). I don't have time to write the thing I wanted at the moment, but now that there is traffic I think I should at least mention one other important bit. I'm not going to argue for any particular level of granularity -- I'm a happy camper as long as I can end up with names_cachep allocations excluded without disabling the entire thing. So the key is: memset is underperforming at least on x86-64 for certain sizes and the init-on-alloc thingy makes it used significantly more, exacerbating the problem. Fixing it is worthwhile on its own and will reduce the impact of the option, but the need for some form of opt-out will remain. I don't have any numbers handy nor time to produce them, so the mail will be a little handwave-y. I only write it in case someone decides to bench now what would warrant exclusion with the mechanism under discussion. Should any of the claims below be challenged, I can respond some time late next week with hard data(tm). Onto the issue: Most cpus in use today have the ERMS bit, in which case the routine is: SYM_FUNC_START_LOCAL(memset_erms) movq %rdi,%r9 movb %sil,%al movq %rdx,%rcx rep stosb movq %r9,%rax RET SYM_FUNC_END(memset_erms) The problem here is that instructions with the rep prefix have a very high startup latency. Afair this remains true even on cpus with FSRM in case of rep *stos* (what is helped by FSRM is rep *mov*, whereas stos remains unaffected). Interestingly looks like the problem was recognized in general -- memcpy and copy_{to,from}_user have hand rolled smaller copies. Apart from that __clear_user relatively recently got a treatment of that sort but it somehow never got implemented in memset itself. If memory serves right, the least slow way to do it is to *NOT* use rep stos below 128 bytes of size (and this number is even higher the better the cpu). Instead, a 32-byte loop (as in 4 times movsq) would do it as long as there is space, along with overlapping stores to take care of whatever < 32 bytes. __clear_user got rep stos if FSRM is present and 64 byte non-rep treatment, with an 8 byte loop and 1 byte loop to cover the tail. As noted above, this leaves perf on the table. Even then, it would be an improvement for memset if transplanted over. Maybe this was mentioned in the discussion concerning the func, I had not read the thread -- I only see that memset remains unpatched. memset, even with init-on-alloc disabled, is used *a lot* with very small sizes. For that bit I do have data collected over 'make' in the kernel directory. bpftrace -e 'kprobe:memset { @ = lhist(arg2, 0, 128, 8); }' [0, 8) 9126030 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8, 16) 515728 |@@ | [16, 24) 621902 |@@ | [24, 32) 110822 | | [32, 40) 11003451 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [40, 48) 488 | | [48, 56) 164 | | [56, 64) 1493851 |@@@@@@ | [64, 72) 214613 | | [72, 80) 10468 | | [80, 88) 10524 | | [88, 96) 121 | | [96, 104) 81591 | | [104, 112) 1659699 |@@@@@@@ | [112, 120) 3240 | | [120, 128) 9058 | | [128, ...) 11287204 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| note: i failed to figure out how to attach to memset on stock kernel, thus the kernel was booted with the crappery below: diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S index 6143b1a6fa2c..141d899d5f1d 100644 --- a/arch/x86/lib/memset_64.S +++ b/arch/x86/lib/memset_64.S @@ -45,9 +45,6 @@ SYM_FUNC_START(__memset) SYM_FUNC_END(__memset) EXPORT_SYMBOL(__memset) -SYM_FUNC_ALIAS(memset, __memset) -EXPORT_SYMBOL(memset) - /* * ISO C memset - set a memory block to a byte value. This function uses * enhanced rep stosb to override the fast string function. diff --git a/fs/open.c b/fs/open.c index 4401a73d4032..6e11e95ad9c3 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1564,3 +1564,11 @@ int stream_open(struct inode *inode, struct file *filp) } EXPORT_SYMBOL(stream_open); + +void *(memset)(void *s, int c, size_t n) +{ + return __memset(s, c, n); +} + +EXPORT_SYMBOL(memset);
On Fri, Mar 3, 2023 at 9:39 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > So the key is: memset is underperforming at least on x86-64 for > certain sizes and the init-on-alloc thingy makes it used significantly > more, exacerbating the problem One reason that the kernel memset isn't as optimized as memcpy, is simply because under normal circumstances it shouldn't be *used* that much outside of page clearing and constant-sized structure initialization. Page clearing is fine, and constant-sized structure inits are also generally fine (ie the compiler does the small ones directly). So this is literally a problem with pointless automated memset, introduced by that hardening option. And hardening people generally simply don't care about performance, and the people who _do _care about performance usually don't enable the known-expensive crazy stuff. Honestly, I think the INIT_ONCE stuff is actively detrimental, and only hides issues (and in this case adds its own). So I can't but help to say "don't do that then". I think it's literally stupid to clear allocations by default. I'm not opposed to improving memset, but honestly, if the argument is based on the stupid hardening behavior, I really think *that* needs to be fixed first. Linus
On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Mar 3, 2023 at 9:39 AM Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> So the key is: memset is underperforming at least on x86-64 for >> certain sizes and the init-on-alloc thingy makes it used significantly >> more, exacerbating the problem > > One reason that the kernel memset isn't as optimized as memcpy, is > simply because under normal circumstances it shouldn't be *used* that > much outside of page clearing and constant-sized structure > initialization. > I mentioned in the previous e-mail that memset is used a lot even without the problematic opt and even have shown size distribution of what's getting passed there. You made me curious how usage compares to memcpy, so I checked 'make' in kernel dir once again. I stress the init-on-alloc opt is *OFF*: # CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set # CONFIG_INIT_ON_FREE_DEFAULT_ON is not set # bpftrace -e 'kprobe:memset,kprobe:memcpy { @[probe] = count(); }' [snip] @[kprobe:memcpy]: 6948498 @[kprobe:memset]: 36150671 iow memset is used about 7 times as much as memcpy when building the kernel. If it matters I'm building on top of 2eb29d59ddf02e39774abfb60b2030b0b7e27c1f (reasonably fresh master). > So this is literally a problem with pointless automated memset, > introduced by that hardening option. And hardening people generally > simply don't care about performance, and the people who _do _care > about performance usually don't enable the known-expensive crazy > stuff. > > Honestly, I think the INIT_ONCE stuff is actively detrimental, and > only hides issues (and in this case adds its own). So I can't but help > to say "don't do that then". I think it's literally stupid to clear > allocations by default. > Questioning usability of the entire thing was on the menu in what I intended to write, but I did not read entire public discussion so perhaps I missed a crucial insight. > I'm not opposed to improving memset, but honestly, if the argument is > based on the stupid hardening behavior, I really think *that* needs to > be fixed first. > It is not. The argument is that this is a core primitive, used a lot with sizes where rep stos is detrimental to its performance. There is no urgency, but eventually someone(tm) should sort it out. For $reasons I don't want to do it myself. I do bring it up in the context of the init-on-alloc machinery because it disfigures whatever results are obtained testing on x86-64 -- they can get exactly the same effect for much smaller cost, consequently they should have interest in sorting this out. General point though was that the work should have sanity-checked performance of the primitive it relies on, instead of assuming it is ~optimal. I'm guilty of this mistake myself, so not going to throw stones.
On 3/3/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> On Fri, Mar 3, 2023 at 9:39 AM Mateusz Guzik <mjguzik@gmail.com> wrote: >>> >>> So the key is: memset is underperforming at least on x86-64 for >>> certain sizes and the init-on-alloc thingy makes it used significantly >>> more, exacerbating the problem >> >> One reason that the kernel memset isn't as optimized as memcpy, is >> simply because under normal circumstances it shouldn't be *used* that >> much outside of page clearing and constant-sized structure >> initialization. >> > > I mentioned in the previous e-mail that memset is used a lot even > without the problematic opt and even have shown size distribution of > what's getting passed there. > > You made me curious how usage compares to memcpy, so I checked 'make' > in kernel dir once again. > > I stress the init-on-alloc opt is *OFF*: > # CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set > # CONFIG_INIT_ON_FREE_DEFAULT_ON is not set > > # bpftrace -e 'kprobe:memset,kprobe:memcpy { @[probe] = count(); }' > [snip] > @[kprobe:memcpy]: 6948498 > @[kprobe:memset]: 36150671 > > iow memset is used about 7 times as much as memcpy when building the > kernel. If it matters I'm building on top of > 2eb29d59ddf02e39774abfb60b2030b0b7e27c1f (reasonably fresh master). > >> So this is literally a problem with pointless automated memset, >> introduced by that hardening option. And hardening people generally >> simply don't care about performance, and the people who _do _care >> about performance usually don't enable the known-expensive crazy >> stuff. >> >> Honestly, I think the INIT_ONCE stuff is actively detrimental, and >> only hides issues (and in this case adds its own). So I can't but help >> to say "don't do that then". I think it's literally stupid to clear >> allocations by default. >> > > Questioning usability of the entire thing was on the menu in what I > intended to write, but I did not read entire public discussion so > perhaps I missed a crucial insight. > >> I'm not opposed to improving memset, but honestly, if the argument is >> based on the stupid hardening behavior, I really think *that* needs to >> be fixed first. >> > > It is not. The argument is that this is a core primitive, used a lot > with sizes where rep stos is detrimental to its performance. There is > no urgency, but eventually someone(tm) should sort it out. For > $reasons I don't want to do it myself. > > I do bring it up in the context of the init-on-alloc machinery because > it disfigures whatever results are obtained testing on x86-64 -- they > can get exactly the same effect for much smaller cost, consequently > they should have interest in sorting this out. > > General point though was that the work should have sanity-checked > performance of the primitive it relies on, instead of assuming it is > ~optimal. I'm guilty of this mistake myself, so not going to throw > stones. > Forgot to paste the crapper I used to make both visible to bpftrace. I'm guessing there is a sensible way, I just don't see it and would love an instruction :) diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S index a64017602010..c40084308d58 100644 --- a/arch/x86/lib/memcpy_64.S +++ b/arch/x86/lib/memcpy_64.S @@ -43,9 +43,6 @@ SYM_TYPED_FUNC_START(__memcpy) SYM_FUNC_END(__memcpy) EXPORT_SYMBOL(__memcpy) -SYM_FUNC_ALIAS(memcpy, __memcpy) -EXPORT_SYMBOL(memcpy) - /* * memcpy_erms() - enhanced fast string memcpy. This is faster and * simpler than memcpy. Use memcpy_erms when possible. diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S index 6143b1a6fa2c..141d899d5f1d 100644 --- a/arch/x86/lib/memset_64.S +++ b/arch/x86/lib/memset_64.S @@ -45,9 +45,6 @@ SYM_FUNC_START(__memset) SYM_FUNC_END(__memset) EXPORT_SYMBOL(__memset) -SYM_FUNC_ALIAS(memset, __memset) -EXPORT_SYMBOL(memset) - /* * ISO C memset - set a memory block to a byte value. This function uses * enhanced rep stosb to override the fast string function. diff --git a/fs/open.c b/fs/open.c index 4401a73d4032..a3383391bd17 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1564,3 +1564,19 @@ int stream_open(struct inode *inode, struct file *filp) } EXPORT_SYMBOL(stream_open); + +void *(memset)(void *s, int c, size_t n) +{ + return __memset(s, c, n); +} + +EXPORT_SYMBOL(memset); + + +void *(memcpy)(void *d, const void *s, size_t n) +{ + return __memcpy(d, s, n); +} + +EXPORT_SYMBOL(memcpy);
On Fri, Mar 3, 2023 at 11:37 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > I mentioned in the previous e-mail that memset is used a lot even > without the problematic opt and even have shown size distribution of > what's getting passed there. Well, I *have* been pushing Intel to try to fix memcpy and memset for over two decades by now, but with FSRM I haven't actually seen the huge problems any more. It may just be that the loads I look at don't have issues (or the machines I've done profiles on don't tend to show them as much). Hmm. Just re-did my usual kernel profile. It may also be that something has changed. I do see "clear_page" at the top, but yes, memset is higher up than I remembered. I actually used to have the reverse of your hack for this - I've had various hacks over the year that made memcpy and memset be inlined "rep movs/stos", which (along with inlined spinlocks) is a great way to see the _culprit_ (without having to deal with the call chains - which always get done the wrong way around imnsho). Linus
On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Mar 3, 2023 at 11:37 AM Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> I mentioned in the previous e-mail that memset is used a lot even >> without the problematic opt and even have shown size distribution of >> what's getting passed there. > > Well, I *have* been pushing Intel to try to fix memcpy and memset for > over two decades by now, but with FSRM I haven't actually seen the > huge problems any more. > rep *stos* remains crap with FSRM, but I don't have sensible tests handy nor the ice lake cpu i tested on at the moment > I actually used to have the reverse of your hack for this - I've had > various hacks over the year that made memcpy and memset be inlined > "rep movs/stos", which (along with inlined spinlocks) is a great way > to see the _culprit_ (without having to deal with the call chains - > which always get done the wrong way around imnsho). > that's all hackery which makes sense pre tooling like bpftrace, people can do better now (see the second part of the email) I think there is a systemic problem which comes with the kzalloc API, consider: static struct file *__alloc_file(int flags, const struct cred *cred) { struct file *f; int error; f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); if (unlikely(!f)) return ERR_PTR(-ENOMEM); [bunch of the struct gets initialized here] the allocation routine does not have any information about the size available at compilation time, so has to resort to a memset call at runtime. Instead, should this be: f = kmem_cache_alloc(...); memset(f, 0, sizeof(*f)); ... the compiler could in principle inititalize stuff as indicated by code and emit zerofill for the rest. Interestingly, last I checked neither clang nor gcc knew how to do it, they instead resort to a full sized memset anyway, which is quite a bummer. Personally i grew up on dtrace, bpftrace I can barely use and don't know how to specifically get the caller, but kstack(2) seems like a good enough workaround. as an example here is a one-liner to show crappers which do 0-sized ops: bpftrace -e 'kprobe:memset,kprobe:memcpy /arg2 == 0/ { @[probe, kstack(2)] = count(); }' one can trace all kinds of degeneracy like that without recompiling anything, provided funcs are visible to bpftrace sample result from the above one-liner while doing 'make clean' in the kernel dir: @[kprobe:memcpy, memcpy+5 realloc_array+78 ]: 1 @[kprobe:memcpy, memcpy+5 push_jmp_history+125 ]: 1 @[kprobe:memset, memset+5 blk_mq_dispatch_rq_list+687 ]: 3 @[kprobe:memcpy, memcpy+5 mix_interrupt_randomness+192 ]: 4 @[kprobe:memcpy, memcpy+5 d_alloc_pseudo+18 ]: 59 @[kprobe:memcpy, memcpy+5 add_device_randomness+111 ]: 241 @[kprobe:memcpy, memcpy+5 add_device_randomness+93 ]: 527 @[kprobe:memset, memset+5 copy_process+2904 ]: 2054 @[kprobe:memset, memset+5 dup_fd+283 ]: 6162
On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > I think there is a systemic problem which comes with the kzalloc API Well, it's not necessarily the API that is bad, but the implementation. We could easily make kzalloc() with a constant size just expand to kmalloc+memset, and get the behavior you want. We already do magical things for "find the right slab bucket" part of kmalloc too for constant sizes. It's changed over the years, but that policy goes back a long long time. See https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8 from the BK history tree. Exactly because some things are worth optimizing for when the size is known at compile time. Maybe just extending kzalloc() similarly? Trivial and entirely untested patch: --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags) */ static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) { + if (__builtin_constant_p(size)) { + void *ret = kmalloc(size, flags); + if (ret) + memset(ret, 0, size); + return ret; + } return kmalloc(size, flags | __GFP_ZERO); } This may well be part of what has changed over the years. People have done a *lot* of pseudo-automated "kmalloc+memset -> kzalloc" code simplification. And in the process we've lost a lot of good optimizations. I used to do profiling religiously, but these days I only do it for particular areas (usually just the walking part of pathname lookup) Linus
On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> I think there is a systemic problem which comes with the kzalloc API > > Well, it's not necessarily the API that is bad, but the implementation. > > We could easily make kzalloc() with a constant size just expand to > kmalloc+memset, and get the behavior you want. > > We already do magical things for "find the right slab bucket" part of > kmalloc too for constant sizes. It's changed over the years, but that > policy goes back a long long time. See > > > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8 > > from the BK history tree. > > Exactly because some things are worth optimizing for when the size is > known at compile time. > > Maybe just extending kzalloc() similarly? Trivial and entirely untested > patch: > > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct > kmem_cache *k, gfp_t flags) > */ > static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) > { > + if (__builtin_constant_p(size)) { > + void *ret = kmalloc(size, flags); > + if (ret) > + memset(ret, 0, size); > + return ret; > + } > return kmalloc(size, flags | __GFP_ZERO); > } > > This may well be part of what has changed over the years. People have > done a *lot* of pseudo-automated "kmalloc+memset -> kzalloc" code > simplification. And in the process we've lost a lot of good > optimizations. I was about to write that kzalloc can use automagic treatment. I made a change of similar sort years back in FreeBSD https://cgit.freebsd.org/src/commit/?id=34c538c3560591a3856e85988b0b5eefdde53b0c The crux of the comment though was not about kzalloc (another brainfart, apologies), but kmem_cache_zalloc -- that one is kind of screwed as is. Perhaps it would be unscrewed if calls could be converted to something in the lines of kmem_cache_zalloc_ptr(cachep, flags, returnobj); so this from __alloc_file: struct file *f; int error; f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); if (unlikely(!f)) return ERR_PTR(-ENOMEM); could be: if (unlikely(!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f)) return ERR_PTR(-ENOMEM); ... where the macro rolls with similar treatment to the one you pasted for kzalloc. and assigns to f. if this sounds acceptable coccinelle could be used to do a sweep I don't have a good name for it.
On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > as an example here is a one-liner to show crappers which do 0-sized ops: > bpftrace -e 'kprobe:memset,kprobe:memcpy /arg2 == 0/ { @[probe, > kstack(2)] = count(); }' Looking not at 0-sized ops, but crazy small memset() ops, at least some of these seem to have grown from the bitmap "optimizations". In particular, 'cpumask_clear()' should just zero the cpumask, and on the config I use, I have CONFIG_NR_CPUS=64 so it should literally just be a single "store zero to cpumask word". And that's what it used to be. But then we had commit aa47a7c215e7 ("lib/cpumask: deprecate nr_cpumask_bits") and suddenly 'nr_cpumask_bits' isn't a simple constant any more for the "small mask that fits on stack" case, and instead you end up with code like movl nr_cpu_ids(%rip), %edx addq $63, %rdx shrq $3, %rdx andl $-8, %edx .. callq memset@PLT that does a 8-byte memset because I have 32 cores and 64 threads. Now, at least some distro kernels seem to be built with CONFIG_MAXSMP, so CONFIG_NR_CPUS is something insane (namely 8192), and then it is indeed better to calculate some minimum size instead of doing a 1kB memset(). But yes, it does look like we've had several regressions in various areas, where we do insane "memset()" calls with variable size that should never have been that. Both with kzalloc() and with cpumasks. There are probably lots of other cases, I just happened to notice the cpumask one when looking at "why is there a memset call in a function that shouldn't have any at all". I don't think this is a "rep stos" problem in general. Obviously it might be in some cases, but I do think that the deeper problem is that those small memset() calls generally should not have existed at all, and were actual bugs. That cpumask_clear() should never have been a function call in the first place for the "clear one or a couple of words" case. The kernel very seldom acts on byte data, so a memset() or memcpy() that looks like individual bytes is some very odd stuff. We have lots of often fairly small structures with fixed sizes, though. And maybe there are other cases like that bogus "let's try to optimize the bitmap range"... Of course, then we *do* have distros that do actively detrimental things. You have Ubuntu with that CONFIG_INIT_ON_ALLOC_DEFAULT_ON, and if I didn't just rebuild my kernel with sane config options, I'd have Fedora with CONFIG_MAXSMP that is meant to be a "check worst case" option, not something you *use*. Oh well. Linus
[...] > In particular, 'cpumask_clear()' should just zero the cpumask, and on > the config I use, I have > > CONFIG_NR_CPUS=64 > > so it should literally just be a single "store zero to cpumask word". > And that's what it used to be. > > But then we had commit aa47a7c215e7 ("lib/cpumask: deprecate > nr_cpumask_bits") and suddenly 'nr_cpumask_bits' isn't a simple > constant any more for the "small mask that fits on stack" case, and > instead you end up with code like > > movl nr_cpu_ids(%rip), %edx > addq $63, %rdx > shrq $3, %rdx > andl $-8, %edx > .. > callq memset@PLT > > that does a 8-byte memset because I have 32 cores and 64 threads. Did you enable CONFIG_FORCE_NR_CPUS? If you pick it, the kernel will bind nr_cpu_ids to NR_CPUS at compile time, and the memset() call should disappear. Depending on your compiler you might want to apply this patch as well: https://lore.kernel.org/lkml/20221027043810.350460-2-yury.norov@gmail.com/ > Now, at least some distro kernels seem to be built with CONFIG_MAXSMP, > so CONFIG_NR_CPUS is something insane (namely 8192), and then it is > indeed better to calculate some minimum size instead of doing a 1kB > memset(). Ubuntu too. That was one of the reasons for the patch. Thanks, Yury
On Fri, Mar 3, 2023 at 7:25 PM Yury Norov <yury.norov@gmail.com> wrote: > > Did you enable CONFIG_FORCE_NR_CPUS? If you pick it, the kernel will > bind nr_cpu_ids to NR_CPUS at compile time, and the memset() call > should disappear. I do not believe CONFIG_FORCE_NR_CPUS makes any sense, and I think I told you so at the time. This all used to just work *without* some kind of config thing, First removing the automatic "do the right thing", and then adding a config option to "force" doing the right thing seems more than a bit silly to me. I think CONFIG_FORCE_NR_CPUS should go away, and - once more - become just the "is the cpumask small enough to be just allocated directly" thing. Of course, the problem for others remain that distros will do that CONFIG_CPUMASK_OFFSTACK thing, and then things will suck regardless. I was *so* happy with our clever "you can have large cpumasks, and we'll just allocate them off the stack" long long ago, because it meant that we could have one single source tree where this was all cleanly abstracted away, and we even had nice types and type safety for it all. That meant that we could support all the fancy SGI machines with several thousand cores, and it all "JustWorked(tm)", and didn't make the normal case any worse. I didn't expect distros to then go "ooh, we want that too", and enable it all by default, and make all our clever "you only see this indirection if you need it" go away, and now the normal case is the *bad* case, unless you just build your own kernel and pick sane defaults. Oh well. Linus
On Fri, Mar 03, 2023 at 07:42:36PM -0800, Linus Torvalds wrote: > ing: quoted-printable > Status: O > Content-Length: 1593 > Lines: 41 > > On Fri, Mar 3, 2023 at 7:25 PM Yury Norov <yury.norov@gmail.com> wrote: > > > > Did you enable CONFIG_FORCE_NR_CPUS? If you pick it, the kernel will > > bind nr_cpu_ids to NR_CPUS at compile time, and the memset() call > > should disappear. > > I do not believe CONFIG_FORCE_NR_CPUS makes any sense, and I think I > told you so at the time. At that time you was OK with CONFIG_FORCE_NR_CPUS, only suggested to hide it behind CONFIG_EXPERT: https://lore.kernel.org/all/Yzx4fSmmr8bh6gdl@yury-laptop/T/#m92d405527636154c3b2000e0105379170d988315 > This all used to just work *without* some kind of config thing, First > removing the automatic "do the right thing", and then adding a config > option to "force" doing the right thing seems more than a bit silly to > me. > > I think CONFIG_FORCE_NR_CPUS should go away, and - once more - become > just the "is the cpumask small enough to be just allocated directly" > thing. This all was just broken. For example, as I mentioned in commit message, cpumask_full() was broken. I know because I wrote a test. There were no a single user for the function, and nobody complained. Now we have one in BPF code. So if we simply revert the aa47a7c215e, it will hurt real users. The pre-CONFIG_FORCE_NR_CPUS cpumask machinery would work only if you set NR_CPUS to the number that matches to the actual number of CPUs as detected at boot time. In your example, if you have NR_CPUS == 64, and for some reason disable hyper threading, nr_cpumask_bits will be set to 64 at compile time, but nr_cpu_ids will be set to 32 at boot time, assuming CONFIG_CPUMASK_OFFSTACK is disabled. And the following code will be broken: cpumask_t m1, m2; cpumask_setall(m1); // m1 is ffff ffff ffff ffff because it uses // compile-time optimized nr_cpumask_bits for_each_cpu(cpu, m1) // 32 iterations because it relied on nr_cpu_ids cpumask_set_cpu(cpu, m2); // m2 is ffff ffff XXXX XXXX BUG_ON(!cpumask_equal(m1, m2)); // Bug because it will test all 64 bits Today with CONFIG_FORCE_NR_CPUS disabled, kernel consistently relies on boot-time defined nr_cpu_ids in functions like cpumask_equal() with the cost of disabled runtime optimizations. If CONFIG_FORCE_NR_CPUS is enabled, it wires nr_cpu_ids to NR_CPUS at compile time, which allows compile-time optimization. If CONFIG_FORCE_NR_CPUS is enabled, but actual number of CPUs doesn't match to NR_CPUS, the kernel throws a warning at boot time - better than nothing. I'm not happy bothering people with a new config parameter in such a simple case. I just don't know how to fix it better. Is there a safe way to teach compiler to optimize against NR_CPUS other than telling it explicitly? > Of course, the problem for others remain that distros will do that > CONFIG_CPUMASK_OFFSTACK thing, and then things will suck regardless. > > I was *so* happy with our clever "you can have large cpumasks, and > we'll just allocate them off the stack" long long ago, because it > meant that we could have one single source tree where this was all > cleanly abstracted away, and we even had nice types and type safety > for it all. > > That meant that we could support all the fancy SGI machines with > several thousand cores, and it all "JustWorked(tm)", and didn't make > the normal case any worse. > > I didn't expect distros to then go "ooh, we want that too", and enable > it all by default, and make all our clever "you only see this > indirection if you need it" go away, and now the normal case is the > *bad* case, unless you just build your own kernel and pick sane > defaults. > > Oh well. From distro people's perspective, 'one size fits all' is the best approach. It's hard to blame them. Thanks, Yury
On Fri, Mar 03, 2023 at 09:51:41PM -0800, Yury Norov wrote: > On Fri, Mar 03, 2023 at 07:42:36PM -0800, Linus Torvalds wrote: > > ing: quoted-printable > > Status: O > > Content-Length: 1593 > > Lines: 41 > > > > On Fri, Mar 3, 2023 at 7:25 PM Yury Norov <yury.norov@gmail.com> wrote: > > > > > > Did you enable CONFIG_FORCE_NR_CPUS? If you pick it, the kernel will > > > bind nr_cpu_ids to NR_CPUS at compile time, and the memset() call > > > should disappear. > > > > I do not believe CONFIG_FORCE_NR_CPUS makes any sense, and I think I > > told you so at the time. > > At that time you was OK with CONFIG_FORCE_NR_CPUS, only suggested to > hide it behind CONFIG_EXPERT: > > https://lore.kernel.org/all/Yzx4fSmmr8bh6gdl@yury-laptop/T/#m92d405527636154c3b2000e0105379170d988315 > > > This all used to just work *without* some kind of config thing, First > > removing the automatic "do the right thing", and then adding a config > > option to "force" doing the right thing seems more than a bit silly to > > me. > > > > I think CONFIG_FORCE_NR_CPUS should go away, and - once more - become > > just the "is the cpumask small enough to be just allocated directly" > > thing. > > This all was just broken. For example, as I mentioned in commit message, > cpumask_full() was broken. I know because I wrote a test. There were no > a single user for the function, and nobody complained. Now we have one > in BPF code. So if we simply revert the aa47a7c215e, it will hurt real > users. FWIW, we can remove bpf_cpumask_full() and any other affected bpf_cpumask kfuncs if we need to. kfuncs have no strict stability guarantees (they're kernel symbols meant for use by kernel programs, see [0]), so we can remove them without worrying about user space breakages or stability issues. They were also added relatively recently, and as far as I know, Tejun and I are thus far the only people using them. [0]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/Documentation/bpf/kfuncs.rst#n350 Of course, I'd prefer that we didn't remove any of them, but if we have to then BPF won't get in the way. As you pointed out below, there are scenarios beyond cpumask_full() where (many) non-BPF callers could trip on this as well, so IMO what we have today makes sense (assuming there's not some other clever way to correctly optimize for NR_CPUS without the extra config option, as you also said below). Thanks, David > The pre-CONFIG_FORCE_NR_CPUS cpumask machinery would work only if you > set NR_CPUS to the number that matches to the actual number of CPUs as > detected at boot time. > > In your example, if you have NR_CPUS == 64, and for some reason disable > hyper threading, nr_cpumask_bits will be set to 64 at compile time, but > nr_cpu_ids will be set to 32 at boot time, assuming > CONFIG_CPUMASK_OFFSTACK is disabled. > > And the following code will be broken: > > cpumask_t m1, m2; > > cpumask_setall(m1); // m1 is ffff ffff ffff ffff because it uses > // compile-time optimized nr_cpumask_bits > > for_each_cpu(cpu, m1) // 32 iterations because it relied on nr_cpu_ids > cpumask_set_cpu(cpu, m2); // m2 is ffff ffff XXXX XXXX > > BUG_ON(!cpumask_equal(m1, m2)); // Bug because it will test all 64 bits > > Today with CONFIG_FORCE_NR_CPUS disabled, kernel consistently relies > on boot-time defined nr_cpu_ids in functions like cpumask_equal() > with the cost of disabled runtime optimizations. > > If CONFIG_FORCE_NR_CPUS is enabled, it wires nr_cpu_ids to NR_CPUS > at compile time, which allows compile-time optimization. > > If CONFIG_FORCE_NR_CPUS is enabled, but actual number of CPUs doesn't > match to NR_CPUS, the kernel throws a warning at boot time - better > than nothing. > > I'm not happy bothering people with a new config parameter in such a > simple case. I just don't know how to fix it better. Is there a safe > way to teach compiler to optimize against NR_CPUS other than telling > it explicitly? > > > Of course, the problem for others remain that distros will do that > > CONFIG_CPUMASK_OFFSTACK thing, and then things will suck regardless. > > > > I was *so* happy with our clever "you can have large cpumasks, and > > we'll just allocate them off the stack" long long ago, because it > > meant that we could have one single source tree where this was all > > cleanly abstracted away, and we even had nice types and type safety > > for it all. > > > > That meant that we could support all the fancy SGI machines with > > several thousand cores, and it all "JustWorked(tm)", and didn't make > > the normal case any worse. > > > > I didn't expect distros to then go "ooh, we want that too", and enable > > it all by default, and make all our clever "you only see this > > indirection if you need it" go away, and now the normal case is the > > *bad* case, unless you just build your own kernel and pick sane > > defaults. > > > > Oh well. > > From distro people's perspective, 'one size fits all' is the best > approach. It's hard to blame them. > > Thanks, > Yury
On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> I think there is a systemic problem which comes with the kzalloc API > > Well, it's not necessarily the API that is bad, but the implementation. > > We could easily make kzalloc() with a constant size just expand to > kmalloc+memset, and get the behavior you want. > > We already do magical things for "find the right slab bucket" part of > kmalloc too for constant sizes. It's changed over the years, but that > policy goes back a long long time. See > > > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8 > > from the BK history tree. > > Exactly because some things are worth optimizing for when the size is > known at compile time. > > Maybe just extending kzalloc() similarly? Trivial and entirely untested > patch: > > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct > kmem_cache *k, gfp_t flags) > */ > static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) > { > + if (__builtin_constant_p(size)) { > + void *ret = kmalloc(size, flags); > + if (ret) > + memset(ret, 0, size); > + return ret; > + } > return kmalloc(size, flags | __GFP_ZERO); > } > So I played with this and have a rather nasty summary. Bullet points: 1. patched kzalloc does not reduce memsets calls during kernel build 2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop it significantly (36150671 -> 14414454) 3. ... inline memset generated by gcc sucks by resorting to rep stosq around 48 bytes 4. memsets not sorted out have sizes not known at compilation time and are not necessarily perf bugs on their own [read: would benefit from faster memset] Onto the the meat: I patched the kernel with a slightly tweaked version of the above: diff --git a/include/linux/slab.h b/include/linux/slab.h index 45af70315a94..7abb5490690f 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags) */ static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) { + if (__builtin_constant_p(size)) { + void *ret = kmalloc(size, flags); + if (likely(ret)) + memset(ret, 0, size); + return ret; + } return kmalloc(size, flags | __GFP_ZERO); } and verified it indeed zeroes inline: void kztest(void) { void *ptr; ptr = kzalloc(32, GFP_KERNEL); if (unlikely(!ptr)) return; memsettest_rec(ptr); } $ objdump --disassemble=kztest vmlinux [snip] call ffffffff8135e130 <kmalloc_trace> test %rax,%rax je ffffffff81447d5f <kztest+0x4f> movq $0x0,(%rax) mov %rax,%rdi movq $0x0,0x8(%rax) movq $0x0,0x10(%rax) movq $0x0,0x18(%rax) call ffffffff81454060 <memsettest_rec> [snip] This did *NOT* lead to reduction of memset calls when building the kernel. I verified few cases by hand, it is all either kmem_cache_zalloc or explicitly added memsets with sizes not known at compilation time. Two most frequent callers: @[ memset+5 __alloc_file+40 alloc_empty_file+73 path_openat+77 do_filp_open+182 do_sys_openat2+159 __x64_sys_openat+89 do_syscall_64+93 entry_SYSCALL_64_after_hwframe+114 ]: 11028994 @[ memset+5 security_file_alloc+45 __alloc_file+89 alloc_empty_file+73 path_openat+77 do_filp_open+182 do_sys_openat2+159 __x64_sys_openat+89 do_syscall_64+93 entry_SYSCALL_64_after_hwframe+114 ]: 11028994 My wip addition is: diff --git a/include/linux/slab.h b/include/linux/slab.h index 45af70315a94..12b5b02ef3d3 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags) return kmem_cache_alloc(k, flags | __GFP_ZERO); } +#define kmem_cache_zalloc_ptr(k, f, retp) ({ \ + __typeof(retp) _retp = kmem_cache_alloc(k, f); \ + bool _rv = false; \ + retp = _retp; \ + if (likely(_retp)) { \ + memset(_retp, 0, sizeof(*_retp)); \ + _rv = true; \ + } \ + _rv; \ +}) + diff --git a/security/security.c b/security/security.c index cf6cc576736f..0f769ede0e54 100644 --- a/security/security.c +++ b/security/security.c @@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file) return 0; } - file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL); - if (file->f_security == NULL) + if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL, file->f_security)) return -ENOMEM; return 0; } diff --git a/fs/file_table.c b/fs/file_table.c index 372653b92617..8e0dabf9530e 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const struct cred *cred) struct file *f; int error; - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); - if (unlikely(!f)) + if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f)) return ERR_PTR(-ENOMEM); f->f_cred = get_cred(cred); As mentioned above it cuts total calls in more than half. The problem is it is it rolls with rep stosq way too easily, partially defeating the point of inlining anything. clang does not have this problem. Take a look at __alloc_file: [snip] mov 0x19cab05(%rip),%rdi # ffffffff82df4318 <filp_cachep> call ffffffff813dd610 <kmem_cache_alloc> test %rax,%rax je ffffffff814298b7 <__alloc_file+0xc7> mov %rax,%r12 mov $0x1d,%ecx xor %eax,%eax mov %r12,%rdi rep stos %rax,%es:(%rdi) [/snip] Here is a sample consumer which can't help but have a variable size -- select, used by gmake: @[ memset+5 do_pselect.constprop.0+202 __x64_sys_pselect6+101 do_syscall_64+93 entry_SYSCALL_64_after_hwframe+114 ]: 13160 In conclusion: 1. fixing up memsets is worthwhile regardless of what happens to its current consumers -- not all of them are necessarily doing something wrong 2. inlining memset can be a pessimization vs non-plain-erms memset as evidenced above. will have to figure out how to convince gcc to be less eager to use it. Sometimes I hate computers.
On Fri, Mar 3, 2023 at 9:51 PM Yury Norov <yury.norov@gmail.com> wrote: > > At that time you was OK with CONFIG_FORCE_NR_CPUS, only suggested to > hide it behind CONFIG_EXPERT: I think there was a mis-communication. I as violently *not* ok with that question at all. I think our Kconfig phase is really annoying and nasty, and we should not ask people questions that they don't know what they mean. So putting it behind EXPERT was a "at that point, the question is gone", and I'm ok with the config variable existing. But.. I am *not* ok with you then thinking that "oh, the config variable exists, so our default code generation can be complete crap". The kernel should do well by default. No amount of "but you could go into magic config variables and then force options that might be ok for embedded systems to make it generate ok code". I think it's completely crazy that the distros enable MAXSMP. But that's their choice. A *sane* distro should not do that, and then we limit the normal kernel to something sane like a couple of hundreds of CPUs rather than thousands of them (and the associated huge overhead). At that point, things like cpumap_clear() should be a couple of stores - not a call-out to a variable-sized memset(). Notice? Simple and understandable Kconfig questions like "do you *really* need to support thousands of CPU's" Not that FORCE_NR_CPUS that is _completely_ useless to any distribution and thus completely unacceptable as a regular question. See the difference? Linus
On Fri, Mar 3, 2023 at 9:51 PM Yury Norov <yury.norov@gmail.com> wrote: > > And the following code will be broken: > > cpumask_t m1, m2; > > cpumask_setall(m1); // m1 is ffff ffff ffff ffff because it uses > // compile-time optimized nr_cpumask_bits > > for_each_cpu(cpu, m1) // 32 iterations because it relied on nr_cpu_ids > cpumask_set_cpu(cpu, m2); // m2 is ffff ffff XXXX XXXX So honestly, it looks like you picked an example of something very unusual to then make everything else slower. Rather than commit aa47a7c215e7, we should just have fixed 'setall()' and 'for_each_cpu()' to use nr_cpu_ids, and then the rest would continue to use nr_cpumask_bits. That particular code sequence is arguably broken to begin with. setall() should really only be used as a mask, most definitely not as some kind of "all possible cpus". The latter is "cpu_possible_mask", which is very different indeed (and often what you want is "cpu_online_mask") But I'd certainly be ok with using nr_cpu_ids for setall, partly exactly because it's so rare. It would probably be better to remove it entirely, but whatever. Linus
On Fri, Mar 03, 2023 at 09:39:11PM +0100, Mateusz Guzik wrote: > the allocation routine does not have any information about the size > available at compilation time, so has to resort to a memset call at > runtime. Instead, should this be: > > f = kmem_cache_alloc(...); > memset(f, 0, sizeof(*f)); > > ... the compiler could in principle inititalize stuff as indicated by > code and emit zerofill for the rest. Interestingly, last I checked > neither clang nor gcc knew how to do it, they instead resort to a full > sized memset anyway, which is quite a bummer. For struct file I wouldn't expect a win from that, TBH.
On 3/4/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >>> >>> I think there is a systemic problem which comes with the kzalloc API >> >> Well, it's not necessarily the API that is bad, but the implementation. >> >> We could easily make kzalloc() with a constant size just expand to >> kmalloc+memset, and get the behavior you want. >> >> We already do magical things for "find the right slab bucket" part of >> kmalloc too for constant sizes. It's changed over the years, but that >> policy goes back a long long time. See >> >> >> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8 >> >> from the BK history tree. >> >> Exactly because some things are worth optimizing for when the size is >> known at compile time. >> >> Maybe just extending kzalloc() similarly? Trivial and entirely untested >> patch: >> >> --- a/include/linux/slab.h >> +++ b/include/linux/slab.h >> @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct >> kmem_cache *k, gfp_t flags) >> */ >> static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) >> { >> + if (__builtin_constant_p(size)) { >> + void *ret = kmalloc(size, flags); >> + if (ret) >> + memset(ret, 0, size); >> + return ret; >> + } >> return kmalloc(size, flags | __GFP_ZERO); >> } >> > > So I played with this and have a rather nasty summary. Bullet points: > 1. patched kzalloc does not reduce memsets calls during kernel build > 2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop > it significantly (36150671 -> 14414454) > 3. ... inline memset generated by gcc sucks by resorting to rep stosq > around 48 bytes > 4. memsets not sorted out have sizes not known at compilation time and > are not necessarily perf bugs on their own [read: would benefit from > faster memset] > > Onto the the meat: > > I patched the kernel with a slightly tweaked version of the above: > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 45af70315a94..7abb5490690f 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct > kmem_cache *k, gfp_t flags) > */ > static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) > { > + if (__builtin_constant_p(size)) { > + void *ret = kmalloc(size, flags); > + if (likely(ret)) > + memset(ret, 0, size); > + return ret; > + } > return kmalloc(size, flags | __GFP_ZERO); > } > > and verified it indeed zeroes inline: > > void kztest(void) > { > void *ptr; > > ptr = kzalloc(32, GFP_KERNEL); > if (unlikely(!ptr)) > return; > memsettest_rec(ptr); > } > > $ objdump --disassemble=kztest vmlinux > [snip] > call ffffffff8135e130 <kmalloc_trace> > test %rax,%rax > je ffffffff81447d5f <kztest+0x4f> > movq $0x0,(%rax) > mov %rax,%rdi > movq $0x0,0x8(%rax) > movq $0x0,0x10(%rax) > movq $0x0,0x18(%rax) > call ffffffff81454060 <memsettest_rec> > [snip] > > This did *NOT* lead to reduction of memset calls when building the kernel. > > I verified few cases by hand, it is all either kmem_cache_zalloc or > explicitly added memsets with sizes not known at compilation time. > > Two most frequent callers: > @[ > memset+5 > __alloc_file+40 > alloc_empty_file+73 > path_openat+77 > do_filp_open+182 > do_sys_openat2+159 > __x64_sys_openat+89 > do_syscall_64+93 > entry_SYSCALL_64_after_hwframe+114 > ]: 11028994 > @[ > memset+5 > security_file_alloc+45 > __alloc_file+89 > alloc_empty_file+73 > path_openat+77 > do_filp_open+182 > do_sys_openat2+159 > __x64_sys_openat+89 > do_syscall_64+93 > entry_SYSCALL_64_after_hwframe+114 > ]: 11028994 > > My wip addition is: > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 45af70315a94..12b5b02ef3d3 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct > kmem_cache *k, gfp_t flags) > return kmem_cache_alloc(k, flags | __GFP_ZERO); > } > > +#define kmem_cache_zalloc_ptr(k, f, retp) ({ \ > + __typeof(retp) _retp = kmem_cache_alloc(k, f); \ > + bool _rv = false; \ > + retp = _retp; \ > + if (likely(_retp)) { \ > + memset(_retp, 0, sizeof(*_retp)); \ > + _rv = true; \ > + } \ > + _rv; \ > +}) > + > diff --git a/security/security.c b/security/security.c > index cf6cc576736f..0f769ede0e54 100644 > --- a/security/security.c > +++ b/security/security.c > @@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file) > return 0; > } > > - file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL); > - if (file->f_security == NULL) > + if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL, > file->f_security)) > return -ENOMEM; > return 0; > } This one is actually buggy -- f_security is a void * pointer and sizeof(*file->f_security) returns just 1. The macro does not have any safety belts against that -- it should probably check for void * at compilation time and get a BUG_ON for runtime mismatch. Does not affect the idea though. Good news: gcc provides a lot of control as to how it inlines string ops, most notably: -mstringop-strategy=alg Override the internal decision heuristic for the particular algorithm to use for inlining string operations. The allowed values for alg are: rep_byte rep_4byte rep_8byte Expand using i386 "rep" prefix of the specified size. byte_loop loop unrolled_loop Expand into an inline loop. libcall Always use a library call. I'm going to play with it and send something more presentable. > diff --git a/fs/file_table.c b/fs/file_table.c > index 372653b92617..8e0dabf9530e 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const > struct cred *cred) > struct file *f; > int error; > > - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); > - if (unlikely(!f)) > + if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f)) > return ERR_PTR(-ENOMEM); > > f->f_cred = get_cred(cred); > > As mentioned above it cuts total calls in more than half. > > The problem is it is it rolls with rep stosq way too easily, partially > defeating the point of inlining anything. clang does not have this > problem. > > Take a look at __alloc_file: > [snip] > mov 0x19cab05(%rip),%rdi # ffffffff82df4318 <filp_cachep> > call ffffffff813dd610 <kmem_cache_alloc> > test %rax,%rax > je ffffffff814298b7 <__alloc_file+0xc7> > mov %rax,%r12 > mov $0x1d,%ecx > xor %eax,%eax > mov %r12,%rdi > rep stos %rax,%es:(%rdi) > [/snip] > > Here is a sample consumer which can't help but have a variable size -- > select, used by gmake: > @[ > memset+5 > do_pselect.constprop.0+202 > __x64_sys_pselect6+101 > do_syscall_64+93 > entry_SYSCALL_64_after_hwframe+114 > ]: 13160 > > In conclusion: > 1. fixing up memsets is worthwhile regardless of what happens to its > current consumers -- not all of them are necessarily doing something > wrong > 2. inlining memset can be a pessimization vs non-plain-erms memset as > evidenced above. will have to figure out how to convince gcc to be > less eager to use it. > > Sometimes I hate computers. > > -- > Mateusz Guzik <mjguzik gmail.com> >
On Sat, Mar 4, 2023 at 11:19 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Rather than commit aa47a7c215e7, we should just have fixed 'setall()' > and 'for_each_cpu()' to use nr_cpu_ids, and then the rest would > continue to use nr_cpumask_bits. Looking around, there's a few more "might set high bits in the bitmask" cases - notably cpumask_complement(). It uses bitmap_complement(), which in turn can set bits past the end of the bit range. Of course, that function is then used in exactly one place (ia64 ACPI): cpumask_complement(&tmp_map, cpu_present_mask); cpu = cpumask_first(&tmp_map); it's basically a nasty way or writing cpu = cpumask_first_zero(cpu_present_mask); so the "fix" is to just do that cleanup, and get rid of "cpumask_complement()" entirely. So I would suggest we simply do something like the attached patch. NOTE! This is *entirely* untested. See the _intent_ behind the patch in the big comment above the 'nr_cpumask_bits' #define. So this patch is more of a "maybe something like this?" And no, nothing here helps the MAXSMP case. I don't think it's entirely unfixable, but it's close. Some very involved static jump infrastructure *might* make the MAXSMP case be something we could generate good code for too, but the whole "we potentially have thousands of CPUs" case really shouldn't have ever been used on normal machines. It is what it is. I think the best we can do is to try to generate good code for a distribution that cares about good code. Once the distro maintainers go "let's enable MAXSMP even though the kernel Kconfig help file tells us not to", there's very little we as kernel maintainers can do. Linus
On 3/4/23, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Mar 03, 2023 at 09:39:11PM +0100, Mateusz Guzik wrote: > >> the allocation routine does not have any information about the size >> available at compilation time, so has to resort to a memset call at >> runtime. Instead, should this be: >> >> f = kmem_cache_alloc(...); >> memset(f, 0, sizeof(*f)); >> >> ... the compiler could in principle inititalize stuff as indicated by >> code and emit zerofill for the rest. Interestingly, last I checked >> neither clang nor gcc knew how to do it, they instead resort to a full >> sized memset anyway, which is quite a bummer. > > For struct file I wouldn't expect a win from that, TBH. > That was mostly for illustrative purposes, but you are right -- turns out the slab is 256 bytes in size per obj and only a small fraction of it is inititalized in the allocation routine. Good candidate to always punt to memset in the allocator as it happens now. Bummer though, it is also one of the 2 most memset'ed during kernel build. The other one allocated at the same rate is lsm_file_cache and that's only 32 bytes in size, so it will get a win.
On Sat, Mar 4, 2023 at 12:31 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > Good news: gcc provides a lot of control as to how it inlines string > ops, most notably: > -mstringop-strategy=alg Note that any static decision is always going to be crap somewhere. You can make it do the "optimal" thing for any particular machine, but I consider that to be just garbage. What I would actually like to see is the compiler always generate an out-of-line call for the "big enough to not just do inline trivially" case, but do so with the "rep stosb/movsb" calling convention. Then we'd just mark those with objdump, and patch it up dynamically to either use the right out-of-line memset/memcpy function, *or* just replace it entirely with 'rep stosb' inline. Because the cores that do this right *do* exist, despite your hatred of the rep string instructions. At least Borislav claims that the modern AMD cores do better with 'rep stosb'. In particular, see what we do for 'clear_user()', where we effectively can do the above (because unlike memset, we control it entirely). See commit 0db7058e8e23 ("x86/clear_user: Make it faster"). Once we'd have that kind of infrastructure, we could then control exactly what 'memset()' does. And I note that we should probably have added Borislav to the cc when memset came up, exactly because he's been looking at it anyway. Even if AMD seems to have slightly different optimization rules than Intel cores probably do. But again, that only emphasizes the whole "we should not have a static choice here". Linus
On Sat, Mar 04, 2023 at 11:19:54AM -0800, Linus Torvalds wrote: > On Fri, Mar 3, 2023 at 9:51 PM Yury Norov <yury.norov@gmail.com> wrote: > > > > And the following code will be broken: > > > > cpumask_t m1, m2; > > > > cpumask_setall(m1); // m1 is ffff ffff ffff ffff because it uses > > // compile-time optimized nr_cpumask_bits > > > > for_each_cpu(cpu, m1) // 32 iterations because it relied on nr_cpu_ids > > cpumask_set_cpu(cpu, m2); // m2 is ffff ffff XXXX XXXX > > So honestly, it looks like you picked an example of something very > unusual to then make everything else slower. What about bootable sticks? > Rather than commit aa47a7c215e7, we should just have fixed 'setall()' > and 'for_each_cpu()' to use nr_cpu_ids, and then the rest would > continue to use nr_cpumask_bits. No, that wouldn't work for all. > That particular code sequence is arguably broken to begin with. > setall() should really only be used as a mask, most definitely not as > some kind of "all possible cpus". Sorry, don't understand this. > The latter is "cpu_possible_mask", which is very different indeed (and > often what you want is "cpu_online_mask") Don't understand this possible vs online argument, but OK. What about this? if (cpumask_setall_is_fixed) cpumask_setall(mask); else { for_each_online_cpu(cpu) cpumask_set_cpu(cpu, mask); } // You forgot to 'fix' cpumask_equal() BUG_ON(!cpumask_equal(cpu_online_mask, mask)); Or this: cpumask_clear(mask); for_each_cpu_not(cpu, mask) cpumask_set_cpu(cpu, mask); BUG_ON(!cpumask_full(mask)); The root of the problem is that half of cpumask API relies (relied) on compile-time nr_cpumask_bits, and the other - on boot-time nr_cpu_ids. So, if you consistently propagate your 'fix', you'll get rid of nr_cpumask_bits entirely with all that associate overhead. That's what I actually did. And even tried to minimize the overhead the best way I can think of. > But I'd certainly be ok with using nr_cpu_ids for setall, partly > exactly because it's so rare. It would probably be better to remove it > entirely, but whatever. > > Linus
On Sat, Mar 4, 2023 at 12:51 PM Yury Norov <yury.norov@gmail.com> wrote: > > > That particular code sequence is arguably broken to begin with. > > setall() should really only be used as a mask, most definitely not as > > some kind of "all possible cpus". > > Sorry, don't understand this. See the example patch I sent out. Literally just make the rule be "we play games with cpumasks in that they have two different 'sizes', so just make sure the bits in the bigger and faster size are always clear". That simple rule just means that we can then use that bigger constant size in all cases where "upper bits zero" just don't matter. Which is basically all of them. Your for_each_cpu_not() example is actually a great example: it should damn well not exist at all. I hadn't even noticed how broken it was. Exactly like the other broken case (that I *did* notice - cpumask_complement), it has no actual valid users. It _literally_ only exists as a pointless test-case. So this is *literally* what I'm talking about: you are making up silly cases that then act as "arguments" for making all the _real_ cases slower. Stop it. Silly useless cases are just that - silly and useless. They should not be arguments for the real cases then being optimized and simplified. Updated patch to remove 'for_each_cpu_not()' attached. It's still completely untested. Treat this very much as a "Let's make the common cases faster, at least for !MAXSMP". Linus
On Sat, Mar 4, 2023 at 1:01 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Silly useless cases are just that - silly and useless. They should not > be arguments for the real cases then being optimized and simplified. There's a missing "not" above, that was hopefully obvious from the context: "They should not be arguments for the real cases then NOT being optimized and simplified" Linus
On Sat, Mar 4, 2023 at 1:01 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It's still completely untested. Treat this very much as a "Let's make > the common cases faster, at least for !MAXSMP". Ok, so I started "testing" it in the sense that I actually looked at the code it generated, and went all "why didn't it make any difference". And that's because the patch had the #ifdef CONFIG_CPUMASK_OFFSTACK condition exactly the wrong way around. So if somebody actually wants to play with that patch, you need to change that to be #ifndef CONFIG_CPUMASK_OFFSTACK (and then you obviously need to have a kernel config that does *not* have MAXSMP set). That at least simplifies some of the code generation when I look at it. Whether the end result _works_ or not, I still haven't checked. That patch is still very much meant as a "look, something like this should make our cpumask handling much more efficient" Linus
On Sat, Mar 4, 2023 at 1:10 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Whether the end result _works_ or not, I still haven't checked. Well, this particular patch at least boots for me for my normal config. Not that I've run any extensive tests, but I'm writing this email while running this patch, so .. Linus
On Sat, Mar 4, 2023 at 3:08 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Well, this particular patch at least boots for me for my normal > config. Not that I've run any extensive tests, but I'm writing this > email while running this patch, so .. Hmm. I enabled the KUNIT tests, and used an odd CONFIG_NR_CPUS to test this a bit more. So in my situation, I have 64 threads, and so nr_cpu_ids is 64, and CONFIG_NR_CPUS is 150. Then one cpumask KUNIT test fails with # test_cpumask_weight: EXPECTATION FAILED at lib/cpumask_kunit.c:70 Expected ((unsigned int)150) == cpumask_weight(&mask_all), but ((unsigned int)150) == 150 (0x96) cpumask_weight(&mask_all) == 64 (0x40) &mask_all contains CPUs 0-63 but I think that's actually a KUNIT test bug. The KUNIT test there is KUNIT_EXPECT_EQ_MSG(test, nr_cpumask_bits, cpumask_weight(&mask_all), MASK_MSG(&mask_all)); and it should *not* expect the cpumask weight to be nr_cpumask_bits, it should expect it to be nr_cpu_ids. That only matters now that nr_cpumask_bits isn't the same as nr_cpu_ids./ Anyway, I still think that patch of mine is fine, and I think this test failure only ends up being about the test, not the patch. Linus
From: Linus Torvalds > Sent: 04 March 2023 20:48 > > On Sat, Mar 4, 2023 at 12:31 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > Good news: gcc provides a lot of control as to how it inlines string > > ops, most notably: > > -mstringop-strategy=alg > > Note that any static decision is always going to be crap somewhere. > You can make it do the "optimal" thing for any particular machine, but > I consider that to be just garbage. > > What I would actually like to see is the compiler always generate an > out-of-line call for the "big enough to not just do inline trivially" > case, but do so with the "rep stosb/movsb" calling convention. I think you also want it to differentiate between requests that are known to be a whole number of words and ones that might be byte sized. For the kmalloc+memzero case you know you can zero a whole number of words - so all the checks memset has to do for byte length/alignment can be removed. The same is true for memcpy() calls used for structure copies. The compiler knows that aligned full-word copies can be done. So it shouldn't be calling a function that has to redo the tests. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Mar 5, 2023 at 1:26 AM Sedat Dilek <sedat.dilek@gmail.com> wrote: > > can you share your "normal config", please? Well, I just have CONFIG_NR_CPUS set to 64. That happens to be the number of actual cores (well, threads) I have on my desktop, but it's what I use for my laptop too (that has 8 cores). Basically, I consider CONFIG_NR_CPUS=64 is likely the "sweet spot" for code generation and still covering 99+% of all machines out there. Now, MAXSMP is great for (a) coverage testing and for (b) being able to handle pretty much *anything* out there, but it really was originally meant for the SGI kind of hardware: not exactly off-the-shelf. So I use MAXSMP for compile testing (by virtue of "allmodconfig"), and it's great for that. But unless you have more than several hundred cpus in your machine, you should never use it. There are a few main issues with MAXSMP: - the simple ("common") embedded cpu masks end up being big (ie any data structure that has a "cpumask_t" in it will be huge, just because the static size of 'struct cpumask' is 8192 bits, ie 1kB) - the fancy case of using a "cpumask_var_t" will use a pointer and a dynamic allocation (which is then sized to be appropriate to the *actual* number of CPU's, so that you don't have to allocate 8192 bits for everything). - the code generation ends up inevitably being about variable-sized loops, because nobody wants to traverse those kinds of data structures In contrast, if you use CONFIG_NR_CPUS=64, both the embeddeed and "fancy" version will be just a single 64-bit word. No extra pointer overhead, no indirection through said pointers, and no need for loops (ok, there will still be loops for walking the bits in the word, but a lot of them will actually be about using instructions like "bsf" etc). So you end up with better code, smaller data structures, and less pointer chasing. So those two situations are generally the two "sane" configurations: a good reasonable NR_CPUS that works for just about everybody, and then another extreme config for the 1% (or - more likely - 0.01%) Now, it's not like 64 is somehow magical. Picking something like NR_CPUS of 192 is perfectly fine too - it will use three words for the bitmap, it will still avoid the pointer indirection, it will have a few small fixed-size loops. It's certainly not *wrong*. It will cover bigger HEDT machines, but I feel like the HEDT people probably are special enough that they could probably just use the MAXSMP case, or - if they care - just build their own kernels. So you *can* certainly pick other values. We used to have special UP vs SMP kernel builds, and that clearly no longer makes sense. Nobody cares about UP on x86-64. But I do feel like MAXSMP is simply not a great config for 99.9% of all people, and if you are willing to have two configs, then that "64 or MAXSMP" seems to be the sane split. And with that split, there will be *very* few people who actually use MAXSMP. Linus
On Sun, Mar 5, 2023 at 10:17 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > There are a few main issues with MAXSMP: It's probably worth noting that most architectures don't even support MAXSMP at all. Only x86-64 does. For example, ia64 and sparc64, which both did techncially support a lot of cores, just made "cpumask_t" huge, and had no support for the whole "use a pointer to an indirect allocation". That ends up meaning that you allocate those huge structures on the stack or just make other structures enormous when they contain a CPU mask, but it mostly works. It's a horrid, horrid model, though. But at least ia64 had 64kB stacks anyway, and in the book of "bad engineering decisions of Itanium", this is all just a footnote. arm64 also has that "range 2 4096" for number of CPUs but defaults to a much saner 256 cpus. I suspect (and sincerely hope) that nobody actually tries to use an arm64 build with that 4k cpu build. If/when arm64 actually does get up to that 'thousands of cores" situation, they'll hopefully enable the MAXSMP kind of indirection and off-stack cpu mask arrays. So MAXSMP and the whole CPUMASK_OFFSTACK option is an architecture choice, and you don't have to do it the way x86-64 does it. But the x86 choice is likely the best tested and thought out by far. For example, POWERPC technically supports CPUMASK_OFFSTACK too, but really only in theory. On powerpc, you have config NR_CPUS range 2 8192 if SMP default "32" if PPC64 so while configuration the range is technically up to 8k CPUs, I doubt people use that value very much. And we have select CPUMASK_OFFSTACK if NR_CPUS >= 8192 so it only uses that OFFSTACK one if you pick exactly 8192 CPUs (which presumably nobody does in real life outside of build testing - it's not the default, and I think most of the POWER range tops up in the 192 core range, eg E980 with 16 sockets of 12 cores each). So I suspect that x86-64 is the *only* one to actually use this widely, and I think distros have been *much* too eager to do so. The fact that most distros default to CONFIG_MAXSMP=y CONFIG_NR_CPUS=8192 seems pretty crazy, when I have a hard time finding anything with more than 192 cores. I'm sure they exist. But do they _really_ run unmodified vendor kernels? Linus
On Sat, Mar 04, 2023 at 03:08:49PM -0800, Linus Torvalds wrote: > On Sat, Mar 4, 2023 at 1:10 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Whether the end result _works_ or not, I still haven't checked. > > Well, this particular patch at least boots for me for my normal > config. Not that I've run any extensive tests, but I'm writing this > email while running this patch, so .. > > Linus I didn't test it properly, but the approach looks good. Need some time to think on implications of the new rule. At the first glance, there should be no major impact on cpumask machinery. It should be very well tested on arm and m68k because they implement their own bitmap functions. Please see comments inline. Thanks, Yury [...] > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > index 10c92bd9b807..bd9576e8d856 100644 > --- a/include/linux/cpumask.h > +++ b/include/linux/cpumask.h > @@ -50,8 +50,30 @@ static inline void set_nr_cpu_ids(unsigned int nr) > #endif > } > > -/* Deprecated. Always use nr_cpu_ids. */ > -#define nr_cpumask_bits nr_cpu_ids > +/* > + * The difference between nr_cpumask_bits and nr_cpu_ids is that > + * 'nr_cpu_ids' is the actual number of CPU ids in the system, while > + * nr_cpumask_bits is a "reasonable upper value" that is often more > + * efficient because it can be a fixed constant. > + * > + * So when clearing or traversing a cpumask, use 'nr_cpumask_bits', > + * but when checking exact limits (and when _setting_ bits), use the > + * tighter exact limit of 'nr_cpu_ids'. > + * > + * NOTE! The code depends on any exyta bits in nr_cpumask_bits a always s/exyta/extra ? s/a always/as always ? > + * being (a) allocated and (b) zero, so that the only effect of using > + * 'nr_cpumask_bits' is that we might return a higher maximum CPU value > + * (which is why we have that pattern of > + * > + * Returns >= nr_cpu_ids if no cpus set. > + * > + * for many of the functions - they can return that higher value). > + */ > +#ifndef CONFIG_CPUMASK_OFFSTACK > + #define nr_cpumask_bits ((unsigned int)NR_CPUS) > +#else > + #define nr_cpumask_bits nr_cpu_ids > +#endif > > /* > * The following particular system cpumasks and operations manage > @@ -114,7 +136,7 @@ static __always_inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bit > /* verify cpu argument to cpumask_* operators */ > static __always_inline unsigned int cpumask_check(unsigned int cpu) > { > - cpu_max_bits_warn(cpu, nr_cpumask_bits); > + cpu_max_bits_warn(cpu, nr_cpu_ids); > return cpu; > } > > @@ -248,16 +270,6 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p, > #define for_each_cpu(cpu, mask) \ > for_each_set_bit(cpu, cpumask_bits(mask), nr_cpumask_bits) > > -/** > - * for_each_cpu_not - iterate over every cpu in a complemented mask > - * @cpu: the (optionally unsigned) integer iterator > - * @mask: the cpumask pointer > - * > - * After the loop, cpu is >= nr_cpu_ids. > - */ > -#define for_each_cpu_not(cpu, mask) \ > - for_each_clear_bit(cpu, cpumask_bits(mask), nr_cpumask_bits) > - We can do it like: for ((bit) = 0; (bit) = find_next_zero_bit((addr), nr_cpumask_bits, (bit)), (bit) < nr_cpu_ids; (bit)++) > #if NR_CPUS == 1 > static inline > unsigned int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap) > @@ -495,10 +507,14 @@ static __always_inline bool cpumask_test_and_clear_cpu(int cpu, struct cpumask * > /** > * cpumask_setall - set all cpus (< nr_cpu_ids) in a cpumask > * @dstp: the cpumask pointer > + * > + * Note: since we set bits, we should use the tighter 'bitmap_set()' with > + * the eact number of bits, not 'bitmap_fill()' that will fill past the s/eact/exact > + * end. > */ > static inline void cpumask_setall(struct cpumask *dstp) > { > - bitmap_fill(cpumask_bits(dstp), nr_cpumask_bits); > + bitmap_set(cpumask_bits(dstp), 0, nr_cpu_ids); > } It should be like: + bitmap_set(cpumask_bits(dstp), 0, nr_cpu_ids); + bitmap_clear(cpumask_bits(dstp), nr_cpu_ids, nr_cpumask_bits); Because bitmap_set() will not zero memory beyond round_up(nr_cpu_ids, 64).
diff --git a/fs/open.c b/fs/open.c index 82c1a28b3308..2afed058250c 100644 --- a/fs/open.c +++ b/fs/open.c @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode, compat_arg_u64_dual(offset * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and * switching the fsuid/fsgid around to the real ones. + * + * Creating new credentials is expensive, so we try to skip doing it, + * which we can if the result would match what we already got. */ +static bool access_need_override_creds(int flags) +{ + const struct cred *cred; + + if (flags & AT_EACCESS) + return false; + + cred = current_cred(); + if (!uid_eq(cred->fsuid, cred->uid) || + !gid_eq(cred->fsgid, cred->gid)) + return true; + + if (!issecure(SECURE_NO_SETUID_FIXUP)) { + kuid_t root_uid = make_kuid(cred->user_ns, 0); + if (!uid_eq(cred->uid, root_uid)) { + if (!cap_isclear(cred->cap_effective)) + return true; + } else { + if (!cap_isidentical(cred->cap_effective, + cred->cap_permitted)) + return true; + } + } + + return false; +} + static const struct cred *access_override_creds(void) { const struct cred *old_cred; @@ -377,6 +407,12 @@ static const struct cred *access_override_creds(void) if (!override_cred) return NULL; + /* + * XXX access_need_override_creds performs checks in hopes of skipping + * this work. Make sure it stays in sync if making any changes in this + * routine. + */ + override_cred->fsuid = override_cred->uid; override_cred->fsgid = override_cred->gid; @@ -436,7 +472,7 @@ static long do_faccessat(int dfd, const char __user *filename, int mode, int fla if (flags & AT_EMPTY_PATH) lookup_flags |= LOOKUP_EMPTY; - if (!(flags & AT_EACCESS)) { + if (access_need_override_creds(flags)) { old_cred = access_override_creds(); if (!old_cred) return -ENOMEM;
access(2) remains commonly used, for example on exec: access("/etc/ld.so.preload", R_OK) or when running gcc: strace -c gcc empty.c % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 0.00 0.000000 0 42 26 access It falls down to do_faccessat without the AT_EACCESS flag, which in turn results in allocation of new creds in order to modify fsuid/fsgid and caps. This is a very expensive process single-threaded and most notably multi-threaded, with numerous structures getting refed and unrefed on imminent new cred destruction. Turns out for typical consumers the resulting creds would be identical and this can be checked upfront, avoiding the hard work. An access benchmark plugged into will-it-scale running on Cascade Lake shows: test proc before after access1 1 1310582 2908735 (+121%) # distinct files access1 24 4716491 63822173 (+1353%) # distinct files access2 24 2378041 5370335 (+125%) # same file The above benchmarks are not integrated into will-it-scale, but can be found in a pull request: https://github.com/antonblanchard/will-it-scale/pull/36/files Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> v3: - add a comment warning about changing access_override_creds v2: - fix current->cred usage warn reported by the kernel test robot Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/ --- fs/open.c | 38 +++++++++++++++++++++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-)