mbox series

[v6,0/1] proc: Allow pid_revalidate() during LOOKUP_RCU

Message ID 20211004175629.292270-1-stephen.s.brennan@oracle.com (mailing list archive)
Headers show
Series proc: Allow pid_revalidate() during LOOKUP_RCU | expand

Message

Stephen Brennan Oct. 4, 2021, 5:56 p.m. UTC
Hello all,

I'm resending this patch, adding Andrew Morton since I notice many
fs/proc changes seem to go through his tree. Andrew, please consider
this patch if you have the time to look. The patch is the same as the v5
I've sent a few times, but with a refreshed analysis based on v5.15-rc3.

Problem Description:

When running running ~128 parallel instances of "TZ=/etc/localtime ps
-fe >/dev/null" on a 128CPU machine, the %sys utilization reaches 97%,
and perf shows the following code path as being responsible for heavy
contention on the d_lockref spinlock:

      walk_component()
        lookup_fast()
          d_revalidate()
            pid_revalidate() // returns -ECHILD
          unlazy_child()
            lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention

The reason is that pid_revalidate() is triggering a drop from RCU to ref
path walk mode. All concurrent path lookups thus try to grab a reference
to the dentry for /proc/, before re-executing pid_revalidate() and then
stepping into the /proc/$pid directory. Thus there is huge spinlock
contention. This patch allows pid_revalidate() to execute in RCU mode,
meaning that the path lookup can successfully enter the /proc/$pid
directory while still in RCU mode. Later on, the path lookup may still
drop into ref mode, but the contention will be much reduced at this
point.

By applying this patch, %sys utilization falls to around 85% under the
same workload, and the number of ps processes executed per unit time
increases by 3x-4x. Although this particular workload is a bit
contrived, we have seen some large collections of eager monitoring
scripts which produced similarly high %sys time due to contention in the
/proc directory.

As a result this patch, Al noted that several procfs methods which were
only called in ref-walk mode could now be called from RCU mode. To
ensure that this patch is safe, I audited all the inode get_link and
permission() implementations, as well as dentry d_revalidate()
implementations, in fs/proc. The purpose here is to ensure that they
either are safe to call in RCU (i.e. don't sleep) or correctly bail out
of RCU mode if they don't support it. My analysis shows that all at-risk
procfs methods are safe to call under RCU, and thus this patch is safe.

Procfs RCU-walk Analysis:

This analysis is up-to-date with 5.15-rc3. When called under RCU mode,
these functions have arguments as follows:

* get_link() receives a NULL dentry pointer when called in RCU mode.
* permission() receives MAY_NOT_BLOCK in the mode parameter when called
  from RCU.
* d_revalidate() receives LOOKUP_RCU in flags.

For the following functions, either they are trivially RCU safe, or they
explicitly bail at the beginning of the function when they run:

proc_ns_get_link       (bails out)
proc_get_link          (RCU safe)
proc_pid_get_link      (bails out)
map_files_d_revalidate (bails out)
map_misc_d_revalidate  (bails out)
proc_net_d_revalidate  (RCU safe)
proc_sys_revalidate    (bails out, also not under /proc/$pid)
tid_fd_revalidate      (bails out)
proc_sys_permission    (not under /proc/$pid)

The remainder of the functions require a bit more detail:

* proc_fd_permission: RCU safe. All of the body of this function is
  under rcu_read_lock(), except generic_permission() which declares
  itself RCU safe in its documentation string.
* proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware
  and otherwise looks safe. The same is true of proc_thread_self_get_link.
* proc_map_files_get_link: calls ns_capable, which calls capable(), and
  thus calls into the audit code (see note #1 below). The remainder is
  just a call to the trivially safe proc_pid_get_link().
* proc_pid_permission: calls ptrace_may_access(), which appears RCU
  safe, although it does call into the "security_ptrace_access_check()"
  hook, which looks safe under smack and selinux. Just the audit code is
  of concern. Also uses get_task_struct() and put_task_struct(), see
  note #2 below.
* proc_tid_comm_permission: Appears safe, though calls put_task_struct
  (see note #2 below).

Note #1:
  Most of the concern of RCU safety has centered around the audit code.
  However, since b17ec22fb339 ("selinux: slow_avc_audit has become
  non-blocking"), it's safe to call this code under RCU. So all of the
  above are safe by my estimation.

Note #2: get_task_struct() and put_task_struct():
  The majority of get_task_struct() is under RCU read lock, and in any
  case it is a simple increment. But put_task_struct() is complex, given
  that it could at some point free the task struct, and this process has
  many steps which I couldn't manually verify. However, several other
  places call put_task_struct() under RCU, so it appears safe to use
  here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296)

Stephen Brennan (1):
  proc: Allow pid_revalidate() during LOOKUP_RCU

 fs/proc/base.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

Comments

Matthew Wilcox Oct. 4, 2021, 6:20 p.m. UTC | #1
On Mon, Oct 04, 2021 at 10:56:28AM -0700, Stephen Brennan wrote:
> Problem Description:
> 
> When running running ~128 parallel instances of "TZ=/etc/localtime ps
> -fe >/dev/null" on a 128CPU machine, the %sys utilization reaches 97%,
> and perf shows the following code path as being responsible for heavy
> contention on the d_lockref spinlock:
> 
>       walk_component()
>         lookup_fast()
>           d_revalidate()
>             pid_revalidate() // returns -ECHILD
>           unlazy_child()
>             lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention
> 
> The reason is that pid_revalidate() is triggering a drop from RCU to ref
> path walk mode. All concurrent path lookups thus try to grab a reference
> to the dentry for /proc/, before re-executing pid_revalidate() and then
> stepping into the /proc/$pid directory. Thus there is huge spinlock
> contention. This patch allows pid_revalidate() to execute in RCU mode,
> meaning that the path lookup can successfully enter the /proc/$pid
> directory while still in RCU mode. Later on, the path lookup may still
> drop into ref mode, but the contention will be much reduced at this
> point.
> 
> By applying this patch, %sys utilization falls to around 85% under the
> same workload, and the number of ps processes executed per unit time
> increases by 3x-4x. Although this particular workload is a bit
> contrived, we have seen some large collections of eager monitoring
> scripts which produced similarly high %sys time due to contention in the
> /proc directory.

I think it's perhaps also worth noting that this is a performance
regression relative to ... v5.4?  v4.14?  I forget the details; do you
have those to hand, Stephen?

(Yes, this is a stupid workload.  Yes, a customer really does have
this workload.)
Matthew Wilcox Oct. 4, 2021, 6:41 p.m. UTC | #2
On Mon, Oct 04, 2021 at 07:20:14PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 04, 2021 at 10:56:28AM -0700, Stephen Brennan wrote:
> > Problem Description:
> > 
> > When running running ~128 parallel instances of "TZ=/etc/localtime ps
> > -fe >/dev/null" on a 128CPU machine, the %sys utilization reaches 97%,
> > and perf shows the following code path as being responsible for heavy
> > contention on the d_lockref spinlock:
> > 
> >       walk_component()
> >         lookup_fast()
> >           d_revalidate()
> >             pid_revalidate() // returns -ECHILD
> >           unlazy_child()
> >             lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention
> > 
> > The reason is that pid_revalidate() is triggering a drop from RCU to ref
> > path walk mode. All concurrent path lookups thus try to grab a reference
> > to the dentry for /proc/, before re-executing pid_revalidate() and then
> > stepping into the /proc/$pid directory. Thus there is huge spinlock
> > contention. This patch allows pid_revalidate() to execute in RCU mode,
> > meaning that the path lookup can successfully enter the /proc/$pid
> > directory while still in RCU mode. Later on, the path lookup may still
> > drop into ref mode, but the contention will be much reduced at this
> > point.
> > 
> > By applying this patch, %sys utilization falls to around 85% under the
> > same workload, and the number of ps processes executed per unit time
> > increases by 3x-4x. Although this particular workload is a bit
> > contrived, we have seen some large collections of eager monitoring
> > scripts which produced similarly high %sys time due to contention in the
> > /proc directory.
> 
> I think it's perhaps also worth noting that this is a performance
> regression relative to ... v5.4?  v4.14?  I forget the details; do you
> have those to hand, Stephen?
> 
> (Yes, this is a stupid workload.  Yes, a customer really does have
> this workload.)

OK, it's not a performance regression.  My apologies; I misremembered
the ticket.  What happens with 4.14 is that the tasks all sleep on
the directory's i_mutex.  When i_mutex became i_rwsem, all the lookups
would now contend on the dentry spinlock.  That turns "lots of processes
sleeping" into "lots of processes spinning", which looks like a regression
if you're asking "Why has my system time increased a lot?"