diff mbox series

[RESEND] cred: separate the refcount from frequently read fields

Message ID 20240822131542.785546-1-mjguzik@gmail.com (mailing list archive)
State New
Delegated to: Paul Moore
Headers show
Series [RESEND] cred: separate the refcount from frequently read fields | expand

Commit Message

Mateusz Guzik Aug. 22, 2024, 1:15 p.m. UTC
The refcount shares the cacheline with uid, gid and other frequently
read fields.

Said counter gest modified a lot (for example every time a file is
closed or opened) in turn causing avoidable bounces when another thread
tries to do permission checks. Said bouncing can be avoided without
growing the struct by reordering the fields -- keyring (enabled by
default) is comparatively rarely used and can suffer bouncing instead.

An additional store is performed to clear the non_rcu flag. Since the
flag is rarely set (a special case in the access(2) system call) and
transitions at most once, it can get placed in a read-mostly area and is
only conditionally written to.

With this in place regular permission checks no longer bounce cachelines
in face of refcount changes.

Validated with a simple test where one thread opens and closes a file
(dirtying creds twice), while another keeps re-reading *another* file in
a loop (ops/s):
before:	4353763
after:	4742792 (+9%)

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

This is a resend with a reworded commit message and added Linus since he
wrote the non_rcu thing.

Note each process has its on creds, so this is not causing bounces
globally.

Even so, there is stuff I plan to do in the area and this patch can be
considered prep (only one store to non_rcu).

I'll also note I don't see a way to *whack* non_rcu either. :)

0 rush

 fs/open.c            |  2 +-
 include/linux/cred.h | 31 +++++++++++++++----------------
 kernel/cred.c        |  6 +++---
 3 files changed, 19 insertions(+), 20 deletions(-)

Comments

Paul Moore Aug. 22, 2024, 8:58 p.m. UTC | #1
On Thu, Aug 22, 2024 at 9:15 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> The refcount shares the cacheline with uid, gid and other frequently
> read fields.
>
> Said counter gest modified a lot (for example every time a file is
> closed or opened) in turn causing avoidable bounces when another thread
> tries to do permission checks. Said bouncing can be avoided without
> growing the struct by reordering the fields -- keyring (enabled by
> default) is comparatively rarely used and can suffer bouncing instead.
>
> An additional store is performed to clear the non_rcu flag. Since the
> flag is rarely set (a special case in the access(2) system call) and
> transitions at most once, it can get placed in a read-mostly area and is
> only conditionally written to.
>
> With this in place regular permission checks no longer bounce cachelines
> in face of refcount changes.
>
> Validated with a simple test where one thread opens and closes a file
> (dirtying creds twice), while another keeps re-reading *another* file in
> a loop (ops/s):
> before: 4353763
> after:  4742792 (+9%)
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>
> This is a resend with a reworded commit message and added Linus since he
> wrote the non_rcu thing.
>
> Note each process has its on creds, so this is not causing bounces
> globally.
>
> Even so, there is stuff I plan to do in the area and this patch can be
> considered prep (only one store to non_rcu).
>
> I'll also note I don't see a way to *whack* non_rcu either. :)
>
> 0 rush
>
>  fs/open.c            |  2 +-
>  include/linux/cred.h | 31 +++++++++++++++----------------
>  kernel/cred.c        |  6 +++---
>  3 files changed, 19 insertions(+), 20 deletions(-)

[NOTE: no comment on the patch context yet, just process comments below]

FWIW, I really haven't commented on these last couple of cred patches
mostly because I've been assuming someone else would emerge from the
shadows as the cred maintainer, or at least someone who felt strongly
enough about these changes would merge them.  Unfortunately, I worry
that isn't really going to happen and I'd hate for some of the cred
improvements to die on the lists.

If no one starts grabbing your cred patches I can start taking cred
patches as part of the LSM tree, I've done it a couple of times in the
past and Linus didn't seem to mind.
Mateusz Guzik Aug. 22, 2024, 9:31 p.m. UTC | #2
On Thu, Aug 22, 2024 at 10:58 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Thu, Aug 22, 2024 at 9:15 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > The refcount shares the cacheline with uid, gid and other frequently
> > read fields.
> >
> > Said counter gest modified a lot (for example every time a file is
> > closed or opened) in turn causing avoidable bounces when another thread
> > tries to do permission checks. Said bouncing can be avoided without
> > growing the struct by reordering the fields -- keyring (enabled by
> > default) is comparatively rarely used and can suffer bouncing instead.
> >
> > An additional store is performed to clear the non_rcu flag. Since the
> > flag is rarely set (a special case in the access(2) system call) and
> > transitions at most once, it can get placed in a read-mostly area and is
> > only conditionally written to.
> >
> > With this in place regular permission checks no longer bounce cachelines
> > in face of refcount changes.
> >
> > Validated with a simple test where one thread opens and closes a file
> > (dirtying creds twice), while another keeps re-reading *another* file in
> > a loop (ops/s):
> > before: 4353763
> > after:  4742792 (+9%)
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
> >
> > This is a resend with a reworded commit message and added Linus since he
> > wrote the non_rcu thing.
> >
> > Note each process has its on creds, so this is not causing bounces
> > globally.
> >
> > Even so, there is stuff I plan to do in the area and this patch can be
> > considered prep (only one store to non_rcu).
> >
> > I'll also note I don't see a way to *whack* non_rcu either. :)
> >
> > 0 rush
> >
> >  fs/open.c            |  2 +-
> >  include/linux/cred.h | 31 +++++++++++++++----------------
> >  kernel/cred.c        |  6 +++---
> >  3 files changed, 19 insertions(+), 20 deletions(-)
>
> [NOTE: no comment on the patch context yet, just process comments below]
>
> FWIW, I really haven't commented on these last couple of cred patches
> mostly because I've been assuming someone else would emerge from the
> shadows as the cred maintainer, or at least someone who felt strongly
> enough about these changes would merge them.  Unfortunately, I worry
> that isn't really going to happen and I'd hate for some of the cred
> improvements to die on the lists.
>
> If no one starts grabbing your cred patches I can start taking cred
> patches as part of the LSM tree, I've done it a couple of times in the
> past and Linus didn't seem to mind.
>

ye I remember trouble getting any response in the past.

fwiw I expect to write one more patch (maybe split into two) to
distribute the refcount into task_struct, then modulo a bugfix should
someone find a problem hopefully I wont be submitting any more cred
stuff :)
Linus Torvalds Aug. 23, 2024, 12:05 a.m. UTC | #3
On Thu, 22 Aug 2024 at 21:15, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> The refcount shares the cacheline with uid, gid and other frequently
> read fields.

So moving the refcount around looks sensible, but I don't see why you
moved 'non_rcu' away from the rcu head union.

Isn't 'non_rcu' accessed exactly when the refcount is accessed too? So
putting it in the same cacheline with ->usage would seem to make
sense, and you literally moved the RCU head there.

Why not move it as a union, and keep the non-rcu bit with the RCU head?

Yes, it is rarely actually written to and as such can be "mostly
read-only", but since it is both read and written next to refcounts,
why do that?

Did I miss some common use?

               Linus
Mateusz Guzik Aug. 23, 2024, 12:33 p.m. UTC | #4
On Fri, Aug 23, 2024 at 2:06 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 22 Aug 2024 at 21:15, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > The refcount shares the cacheline with uid, gid and other frequently
> > read fields.
>
> So moving the refcount around looks sensible, but I don't see why you
> moved 'non_rcu' away from the rcu head union.
>
> Isn't 'non_rcu' accessed exactly when the refcount is accessed too? So
> putting it in the same cacheline with ->usage would seem to make
> sense, and you literally moved the RCU head there.
>
> Why not move it as a union, and keep the non-rcu bit with the RCU head?
>
> Yes, it is rarely actually written to and as such can be "mostly
> read-only", but since it is both read and written next to refcounts,
> why do that?
>
> Did I miss some common use?
>

It gets looked at every time you grab a ref.

The layout with the change is this:
        /* --- cacheline 1 boundary (64 bytes) --- */
        kuid_t                     uid
__attribute__((__aligned__(64))); /*    64     4 */
        kgid_t                     gid;                  /*    68     4 */
        kuid_t                     suid;                 /*    72     4 */
        kgid_t                     sgid;                 /*    76     4 */
        kuid_t                     euid;                 /*    80     4 */
        kgid_t                     egid;                 /*    84     4 */
        kuid_t                     fsuid;                /*    88     4 */
        kgid_t                     fsgid;                /*    92     4 */
        unsigned int               securebits;           /*    96     4 */
        bool                       non_rcu;              /*   100     1 */

        /* XXX 3 bytes hole, try to pack */

        kernel_cap_t               cap_inheritable;      /*   104     8 */
        kernel_cap_t               cap_permitted;        /*   112     8 */
        kernel_cap_t               cap_effective;        /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */

Thus consumers which grab the ref and then look at the most commonly
used fields get the non_rcu + rest combo "for free".

consumers which already had a ref don't suffer any extra misses
Linus Torvalds Aug. 23, 2024, 3:25 p.m. UTC | #5
On Fri, 23 Aug 2024 at 20:33, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Fri, Aug 23, 2024 at 2:06 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> >
> > Yes, it is rarely actually written to and as such can be "mostly
> > read-only", but since it is both read and written next to refcounts,
> > why do that?
> >
> > Did I miss some common use?
> >
>
> It gets looked at every time you grab a ref.

Mateusz - read my email. That's what I daid.

But the *ref* is already in cacheline 0. With your change it looked like this:

   struct cred {
        atomic_long_t   usage;
        struct rcu_head rcu;            /* RCU deletion hook */

and if you had kept the union with that 'struct rcu_head', then
'non_rcu' would be RIGHT THERE.

> Thus consumers which grab the ref and then look at the most commonly
> used fields get the non_rcu + rest combo "for free".

They'd get it for free JUST BECAUSE IT'S NEXT TO THE REF. In cacheline
0 - that is already dirtied by the reference count. Which makes a
*store* cheaper.

And you also wouldn't waste separate space for it.

> consumers which already had a ref don't suffer any extra misses

Consumers that already had a ref don't touch 'non_rcu' at all as far
as I can see, so they have no reason to have it next to those "most
commonly used fields".

See my argument? You seem to have pointlessly separated out the
'non_rcu' from being together with the rcu_head, and thus wasted
potentially useful space in the structure.

Your own email even had that:

>        bool                       non_rcu;              /*   100     1 */
>
>        /* XXX 3 bytes hole, try to pack */

which would have been a /* 4 byte hole, try to pack */

A 4-byte hole is more useful than a 3-byte one. A 3-byte one is much
harder to use.

In fact, even without the union, I think your current cacheline 0 ends
up having a 3-byte hole due to that

        unsigned char   jit_keyring;    /* default keyring to attach requested

with CONFIG_KEYS.

Without CONFIG_KEYS, you have something like a 40-byte hole there due to the

    kuid_t uid ____cacheline_aligned_in_smp;

which seems very wasteful, but I guess CONFIG_KEYS is the common case.

So I repeat: what did I miss?

       Linus
Mateusz Guzik Aug. 23, 2024, 3:42 p.m. UTC | #6
On Fri, Aug 23, 2024 at 5:26 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 23 Aug 2024 at 20:33, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Fri, Aug 23, 2024 at 2:06 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > >
> > > Yes, it is rarely actually written to and as such can be "mostly
> > > read-only", but since it is both read and written next to refcounts,
> > > why do that?
> > >
> > > Did I miss some common use?
> > >
> >
> > It gets looked at every time you grab a ref.
>
> Mateusz - read my email. That's what I daid.
>
> But the *ref* is already in cacheline 0. With your change it looked like this:
>
>    struct cred {
>         atomic_long_t   usage;
>         struct rcu_head rcu;            /* RCU deletion hook */
>
> and if you had kept the union with that 'struct rcu_head', then
> 'non_rcu' would be RIGHT THERE.
>

Huh, it's been a while since I wrote this patch. the rcu thing used to
be at the end I forgot I moved *that* var as well (apart from just
keys stuff), but I still support non_rcu being elsewhere (see below)

> > Thus consumers which grab the ref and then look at the most commonly
> > used fields get the non_rcu + rest combo "for free".
>
> They'd get it for free JUST BECAUSE IT'S NEXT TO THE REF. In cacheline
> 0 - that is already dirtied by the reference count. Which makes a
> *store* cheaper.
>

The store to the non_rcu var happens at most once.

Suppose multiple threads open and close file objs, this results in a
stream of atomics on ->usage. With the proposed layout only accesses
there are rmw within the atomic.

If non_rcu hangs around in the same cacheline there is additional
ping-pong to check the field.

> And you also wouldn't waste separate space for it.
>

I agree space is getting wasted (or rather a hole is there when it
does not have to be), I just don't think in the current layout this
matters.

I am not going to die on this hill though, if you insist on non_rcu
going back to the union I send a v2, but as is this would come with
*some* loss.
diff mbox series

Patch

diff --git a/fs/open.c b/fs/open.c
index 22adbef7ecc2..930e22fe8dba 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -455,7 +455,7 @@  static const struct cred *access_override_creds(void)
 	 * cred accesses will keep things non-racy to avoid RCU
 	 * freeing.
 	 */
-	override_cred->non_rcu = 1;
+	override_cred->non_rcu = true;
 
 	old_cred = override_creds(override_cred);
 
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 2976f534a7a3..3127eaad4140 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -110,7 +110,16 @@  static inline int groups_search(const struct group_info *group_info, kgid_t grp)
  */
 struct cred {
 	atomic_long_t	usage;
-	kuid_t		uid;		/* real UID of the task */
+	struct rcu_head	rcu;		/* RCU deletion hook */
+#ifdef CONFIG_KEYS
+	unsigned char	jit_keyring;	/* default keyring to attach requested
+					 * keys to */
+	struct key	*session_keyring; /* keyring inherited over fork */
+	struct key	*process_keyring; /* keyring private to this process */
+	struct key	*thread_keyring; /* keyring private to this thread */
+	struct key	*request_key_auth; /* assumed request_key authority */
+#endif
+	kuid_t		uid ____cacheline_aligned_in_smp;/* real UID of the task */
 	kgid_t		gid;		/* real GID of the task */
 	kuid_t		suid;		/* saved UID of the task */
 	kgid_t		sgid;		/* saved GID of the task */
@@ -119,19 +128,12 @@  struct cred {
 	kuid_t		fsuid;		/* UID for VFS ops */
 	kgid_t		fsgid;		/* GID for VFS ops */
 	unsigned	securebits;	/* SUID-less security management */
+	bool		non_rcu;	/* Can we skip RCU deletion? */
 	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
 	kernel_cap_t	cap_permitted;	/* caps we're permitted */
 	kernel_cap_t	cap_effective;	/* caps we can actually use */
 	kernel_cap_t	cap_bset;	/* capability bounding set */
 	kernel_cap_t	cap_ambient;	/* Ambient capability set */
-#ifdef CONFIG_KEYS
-	unsigned char	jit_keyring;	/* default keyring to attach requested
-					 * keys to */
-	struct key	*session_keyring; /* keyring inherited over fork */
-	struct key	*process_keyring; /* keyring private to this process */
-	struct key	*thread_keyring; /* keyring private to this thread */
-	struct key	*request_key_auth; /* assumed request_key authority */
-#endif
 #ifdef CONFIG_SECURITY
 	void		*security;	/* LSM security */
 #endif
@@ -139,11 +141,6 @@  struct cred {
 	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
 	struct ucounts *ucounts;
 	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
-	/* RCU deletion */
-	union {
-		int non_rcu;			/* Can we skip RCU deletion? */
-		struct rcu_head	rcu;		/* RCU deletion hook */
-	};
 } __randomize_layout;
 
 extern void __put_cred(struct cred *);
@@ -217,7 +214,8 @@  static inline const struct cred *get_cred_many(const struct cred *cred, int nr)
 	struct cred *nonconst_cred = (struct cred *) cred;
 	if (!cred)
 		return cred;
-	nonconst_cred->non_rcu = 0;
+	if (unlikely(nonconst_cred->non_rcu))
+		WRITE_ONCE(nonconst_cred->non_rcu, false);
 	return get_new_cred_many(nonconst_cred, nr);
 }
 
@@ -242,7 +240,8 @@  static inline const struct cred *get_cred_rcu(const struct cred *cred)
 		return NULL;
 	if (!atomic_long_inc_not_zero(&nonconst_cred->usage))
 		return NULL;
-	nonconst_cred->non_rcu = 0;
+	if (unlikely(nonconst_cred->non_rcu))
+		WRITE_ONCE(nonconst_cred->non_rcu, false);
 	return cred;
 }
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 075cfa7c896f..23b73ee2ec63 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -104,7 +104,7 @@  void __put_cred(struct cred *cred)
 	BUG_ON(cred == current->cred);
 	BUG_ON(cred == current->real_cred);
 
-	if (cred->non_rcu)
+	if (unlikely(cred->non_rcu))
 		put_cred_rcu(&cred->rcu);
 	else
 		call_rcu(&cred->rcu, put_cred_rcu);
@@ -218,7 +218,7 @@  struct cred *prepare_creds(void)
 	old = task->cred;
 	memcpy(new, old, sizeof(struct cred));
 
-	new->non_rcu = 0;
+	new->non_rcu = false;
 	atomic_long_set(&new->usage, 1);
 	get_group_info(new->group_info);
 	get_uid(new->user);
@@ -643,7 +643,7 @@  struct cred *prepare_kernel_cred(struct task_struct *daemon)
 	old = get_task_cred(daemon);
 
 	*new = *old;
-	new->non_rcu = 0;
+	new->non_rcu = false;
 	atomic_long_set(&new->usage, 1);
 	get_uid(new->user);
 	get_user_ns(new->user_ns);