mbox series

[0/3] proc: improve root readdir latency with many threads

Message ID 20220614180949.102914-1-bfoster@redhat.com (mailing list archive)
Headers show
Series proc: improve root readdir latency with many threads | expand

Message

Brian Foster June 14, 2022, 6:09 p.m. UTC
Hi all,

We have a user who has reported performance problems related to
(presumably) custom task monitoring on Linux systems when running
processes with large numbers of threads. Unfortunately I don't have much
information around the practical workload and observations, but only
that the problem had been narrowed down to excessive readdir latency of
the /proc root dir in the presence of large numbers of threads in the
associated pid namespace.

This latency boils down to the inefficient pid_namespace walk down in
the proc_pid_readdir() path. More specifically, every thread/task
allocates an associated struct pid, and the procfs next_tgid()
implementation walks every pid in the namespace looking for those with
an associated PIDTYPE_TGID task to fill into the directory listing.

Given that ids are part of the idr radix-tree, it seemed fairly logical
that this could be improved using an internal tree tag. I started
playing around with an approach that tagged and untagged ids based on
actual task association (i.e., attach_pid() and friends), but after some
thought and feedback came to the realization that this could probably be
simplified to just tag the pid once on allocation and allow procfs to
use it as a hint for root dir population. This works because post-fork
tgid task disassociation (without an exit() and freeing the pid) seems
to be uncommon. The only tool I've seen in my testing so far that leaves
around a tagged, non-TGID pid is chronyd, which appears to do a fork()
-> setsid() -> fork() pattern where the intermediate task exits but the
associated pid hangs around for the lifetime of the process due to the
PIDTYPE_SID association.

Therefore, this series implements this tgid tag hinting approach. Patch
1 includes a couple tweaks to the idr tree to support traditional
radix-tree tag propagation. Patch 2 defines the new tag and sets it on
pid allocation. Patch 3 updates procfs to use the tag for the readdir
pid_namespace traversal.

As far as testing goes, I've thrown this at fstests (not for filesystem
testing purposes, but moreso just because I had the test env handy and
it's a longish running task creation workload), LTP and some of the
kernel internal tests in tools/testing/selftests (clone, proc,
pid_namespace) without any obvious regressions. From the performance
angle, the user who reported this problem has provided some synthetic
tools to create dummy tasks/threads and run repeated readdir iterations
of /proc, which is what they've been using to compare results on Linux
kernels with some $other OS. These tools show a notable improvement in
terms of the number of /proc readdir iterations possible per-second. For
example, on 5.19.0-rc2 running on a mostly idle system with an active
100k thread process, readdirs-per-second improves from a baseline of
~285 to ~7.3k with the series applied. More detailed getdents() latency
numbers are included in the commit log of patch 3.

Thoughts, reviews, flames appreciated.

Brian

Brian Foster (3):
  radix-tree: propagate all tags in idr tree
  pid: use idr tag to hint pids associated with group leader tasks
  proc: use idr tgid tag hint to iterate pids in readdir

 fs/proc/base.c      |  2 +-
 include/linux/idr.h | 25 +++++++++++++++++++++++++
 include/linux/pid.h |  2 +-
 kernel/fork.c       |  2 +-
 kernel/pid.c        |  9 ++++++++-
 lib/radix-tree.c    | 26 +++++++++++++++-----------
 6 files changed, 51 insertions(+), 15 deletions(-)