mbox series

[v6,0/6] ioctl()-based API to query VMAs from /proc/<pid>/maps

Message ID 20240627170900.1672542-1-andrii@kernel.org (mailing list archive)
Headers show
Series ioctl()-based API to query VMAs from /proc/<pid>/maps | expand

Message

Andrii Nakryiko June 27, 2024, 5:08 p.m. UTC
Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
applications to query VMA information more efficiently than reading *all* VMAs
nonselectively through text-based interface of /proc/<pid>/maps file.

Patch #2 goes into a lot of details and background on some common patterns of
using /proc/<pid>/maps in the area of performance profiling and subsequent
symbolization of captured stack traces. As mentioned in that patch, patterns
of VMA querying can differ depending on specific use case, but can generally
be grouped into two main categories: the need to query a small subset of VMAs
covering a given batch of addresses, or reading/storing/caching all
(typically, executable) VMAs upfront for later processing.

The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by the
former pattern of usage. Earlier revisions had a patch adding a tool that
faithfully reproduces an efficient VMA matching pass of a symbolizer,
collecting a subset of covering VMAs for a given set of addresses as
efficiently as possible. This tool served both as a testing ground, as well as
a benchmarking tool. It implements everything both for currently existing
text-based /proc/<pid>/maps interface, as well as for newly-added
PROCMAP_QUERY ioctl(). This revision dropped the tool from the patch set and,
once the API lands upstream, this tool might be added separately on Github as
an example.

Based on discussion on earlier revisions of this patch set, it turned out
that this ioctl() API is competitive with highly-optimized text-based
pre-processing pattern that perf tool is using. Based on perf discussion, this
revision adds more flexibility in specifying a subset of VMAs that are of
interest. Now it's possible to specify desired permissions of VMAs (e.g.,
request only executable ones) and/or restrict to only a subset of VMAs that
have file backing. This further improves the efficiency when using this new
API thanks to more selective (executable VMAs only) querying.

In addition to a custom benchmarking tool, and experimental perf integration
(available at [0]), Daniel Mueller has since also implemented an experimental
integration into blazesym (see [1]), a library used for stack trace
symbolization by our server fleet-wide profiler and another on-device profiler
agent that runs on weaker ARM devices. The latter ARM-based device profiler is
especially sensitive to performance, and so we benchmarked and compared
text-based /proc/<pid>/maps solution to the equivalent one using PROCMAP_QUERY
ioctl().

Results are very encouraging, giving us 5x improvement for end-to-end
so-called "address normalization" pass, which is the part of the symbolization
process that happens locally on ARM device, before being sent out for further
heavier-weight processing on more powerful remote server. Note that this is
not an artificial microbenchmark. It's a full end-to-end API call being
measured with real-world data on real-world device.

  TEXT-BASED
  ==========
  Benchmarking main/normalize_process_no_build_ids_uncached_maps
  main/normalize_process_no_build_ids_uncached_maps
	  time:   [49.777 µs 49.982 µs 50.250 µs]

  IOCTL-BASED
  ===========
  Benchmarking main/normalize_process_no_build_ids_uncached_maps
  main/normalize_process_no_build_ids_uncached_maps
	  time:   [10.328 µs 10.391 µs 10.457 µs]
	  change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02)
	  Performance has improved.

You can see above that we see the drop from 50µs down to 10µs for exactly
the same amount of work, with the same data and target process.

With the aforementioned custom tool, we see about ~40x improvement (it might
vary a bit, depending on a specific captured set of addresses). And even for
perf-based benchmark it's on par or slightly ahead when using permission-based
filtering (fetching only executable VMAs).

Earlier revisions attempted to use per-VMA locking, if kernel was compiled
with CONFIG_PER_VMA_LOCK=y, but it turned out that anon_vma_name() is not yet
compatible with per-VMA locking and assumes mmap_lock to be taken, which makes
the use of per-VMA locking for this API premature. It was agreed ([2]) to
continue for now with just mmap_lock, but the code structure is such that it
should be easy to add per-VMA locking support once all the pieces are ready.

One thing that did not change was basing this new API as an ioctl() command
on /proc/<pid>/maps file. An ioctl-based API on top of pidfd was considered,
but has its own downsides. Implementing ioctl() directly on pidfd will cause
access permission checks on every single ioctl(), which leads to performance
concerns and potential spam of capable() audit messages. It also prevents
a nice pattern, possible with /proc/<pid>/maps, in which application opens
/proc/self/maps FD (requiring no additional capabilities) and passed this FD
to profiling agent for querying. To achieve similar pattern, a new file would
have to be created from pidf just for VMA querying, which is considered to be
inferior to just querying /proc/<pid>/maps FD as proposed in current approach.
These aspects were discussed in the hallway track at recent LSF/MM/BPF 2024
and sticking to procfs ioctl() was the final agreement we arrived at.

This patch set is based on top of mm-unstable branch in mm tree.

  [0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/
  [1] https://github.com/libbpf/blazesym/pull/675
  [2] https://lore.kernel.org/bpf/7rm3izyq2vjp5evdjc7c6z4crdd3oerpiknumdnmmemwyiwx7t@hleldw7iozi3/

v5->v6:
  - make vma_page_size an __u64 field (Liam);
v4->v5:
  - added tests in selftests/proc (Andrew);
  - added vma_page_size field (Liam);
  - dropped the benchmark tool and BPF selftests parts, I'll send them
    directly to bpf-next once this API makes it into that tree;
v3->v4:
  - drop per-VMA locking changes for now, we'll need anon_vma_name() to be
    compatible with vma->vm_lock approach (Suren, Liam);
v2->v3:
  - drop mmap_lock aggressively under CONFIG_PER_VMA_LOCK (Liam);
  - code massaging to abstract per-VMA vs mmap_lock differences (Liam);
v1->v2:
  - per-VMA lock is used, if possible (Liam, Suren);
  - added file-backed VMA querying (perf folks);
  - added permission-based VMA querying (perf folks);
  - split out build ID into separate patch (Suren);
  - better documented API, added mention of ioctl() into procfs docs (Greg).

Andrii Nakryiko (6):
  fs/procfs: extract logic for getting VMA name constituents
  fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  fs/procfs: add build ID fetching to PROCMAP_QUERY API
  docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence
  tools: sync uapi/linux/fs.h header into tools subdir
  selftests/proc: add PROCMAP_QUERY ioctl tests

 Documentation/filesystems/proc.rst         |   9 +
 fs/proc/task_mmu.c                         | 383 ++++++++++++++++++---
 include/uapi/linux/fs.h                    | 158 ++++++++-
 tools/include/uapi/linux/fs.h              | 184 +++++++++-
 tools/testing/selftests/proc/Makefile      |   1 +
 tools/testing/selftests/proc/proc-pid-vm.c |  86 +++++
 6 files changed, 754 insertions(+), 67 deletions(-)

Comments

Andrew Morton June 27, 2024, 7:59 p.m. UTC | #1
On Thu, 27 Jun 2024 10:08:52 -0700 Andrii Nakryiko <andrii@kernel.org> wrote:

> Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> applications to query VMA information more efficiently than reading *all* VMAs
> nonselectively through text-based interface of /proc/<pid>/maps file.

I appreciate the usefulness for monitoring large fleets, so I'll add
this version to mm-unstable.  As we're almost at -rc6 I'll await
further review before deciding on the next steps.

Is it possible/sensible to make this feature Kconfigurable so that people who
don't need it can omit it?
Andrii Nakryiko June 27, 2024, 8:50 p.m. UTC | #2
On Thu, Jun 27, 2024 at 12:59 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Thu, 27 Jun 2024 10:08:52 -0700 Andrii Nakryiko <andrii@kernel.org> wrote:
>
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than reading *all* VMAs
> > nonselectively through text-based interface of /proc/<pid>/maps file.
>
> I appreciate the usefulness for monitoring large fleets, so I'll add
> this version to mm-unstable.  As we're almost at -rc6 I'll await
> further review before deciding on the next steps.
>
> Is it possible/sensible to make this feature Kconfigurable so that people who
> don't need it can omit it?

It's just a matter of #ifdef/#endif, so not hard, technically
speaking. But I'm wondering what's the concern? This is mostly newly
added code (except factoring out get_vma_name logic, which won't be
#ifdef'ed anyways), so if no one is using this new API, then it should
cause no issue.

Generally speaking, I'd say if we don't *have to* add the Kconfig
option, I'd prefer that. But if you feel strongly, it's not hard for
me to do, of course.

Or are you concerned with the vmlinux code size increase? It doesn't
seem to be large enough to warrant a Kconfig, IMO (from
bloat-o-meter):

do_procmap_query                               -    1308   +1308
get_vma_name                                   -     283    +283
procfs_procmap_ioctl                           -      47     +47
show_map_vma                                 444     274    -170

But again, do let me know if you insist.
Andrew Morton June 27, 2024, 9:11 p.m. UTC | #3
On Thu, 27 Jun 2024 13:50:22 -0700 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Thu, Jun 27, 2024 at 12:59 PM Andrew Morton
> > Is it possible/sensible to make this feature Kconfigurable so that people who
> > don't need it can omit it?
> 
> It's just a matter of #ifdef/#endif, so not hard, technically
> speaking. But I'm wondering what's the concern? This is mostly newly
> added code (except factoring out get_vma_name logic, which won't be
> #ifdef'ed anyways), so if no one is using this new API, then it should
> cause no issue.
> 
> Generally speaking, I'd say if we don't *have to* add the Kconfig
> option, I'd prefer that. But if you feel strongly, it's not hard for
> me to do, of course.
> 
> Or are you concerned with the vmlinux code size increase? It doesn't
> seem to be large enough to warrant a Kconfig, IMO (from
> bloat-o-meter):
> 
> do_procmap_query                               -    1308   +1308
> get_vma_name                                   -     283    +283
> procfs_procmap_ioctl                           -      47     +47
> show_map_vma                                 444     274    -170
> 
> But again, do let me know if you insist.

Yes, I'm thinking about being nice to small systems ("make
tinyconfig"!).  The kernel just gets bigger and bigger over time,
little bit by little bit.

It's a judgment call - if making it configurable is ugly and/or adds
maintenance overhead then no.
Andrii Nakryiko June 28, 2024, 4:42 p.m. UTC | #4
On Thu, Jun 27, 2024 at 2:11 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 27 Jun 2024 13:50:22 -0700 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > On Thu, Jun 27, 2024 at 12:59 PM Andrew Morton
> > > Is it possible/sensible to make this feature Kconfigurable so that people who
> > > don't need it can omit it?
> >
> > It's just a matter of #ifdef/#endif, so not hard, technically
> > speaking. But I'm wondering what's the concern? This is mostly newly
> > added code (except factoring out get_vma_name logic, which won't be
> > #ifdef'ed anyways), so if no one is using this new API, then it should
> > cause no issue.
> >
> > Generally speaking, I'd say if we don't *have to* add the Kconfig
> > option, I'd prefer that. But if you feel strongly, it's not hard for
> > me to do, of course.
> >
> > Or are you concerned with the vmlinux code size increase? It doesn't
> > seem to be large enough to warrant a Kconfig, IMO (from
> > bloat-o-meter):
> >
> > do_procmap_query                               -    1308   +1308
> > get_vma_name                                   -     283    +283
> > procfs_procmap_ioctl                           -      47     +47
> > show_map_vma                                 444     274    -170
> >
> > But again, do let me know if you insist.
>
> Yes, I'm thinking about being nice to small systems ("make
> tinyconfig"!).  The kernel just gets bigger and bigger over time,
> little bit by little bit.
>
> It's a judgment call - if making it configurable is ugly and/or adds
> maintenance overhead then no.
>

I see, thanks for clarifying. I'd vote to not add extra Kconfig to
keep things simple and less surprising. All this code is conditional
on CONFIG_PROC_FS=y anyways, and there is plenty of code for procfs
already. I think this do_procmap_query is just a small addition here
that doesn't fundamentally regress anything.
Andrew Morton July 10, 2024, 6:32 p.m. UTC | #5
On Thu, 27 Jun 2024 10:08:52 -0700 Andrii Nakryiko <andrii@kernel.org> wrote:

> Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> applications to query VMA information more efficiently than reading *all* VMAs
> nonselectively through text-based interface of /proc/<pid>/maps file.

I haven't seen any acks or reviewed-by's for this series.  lgtm, so I
plan to move it into mm-stable later this week.
Andrii Nakryiko July 10, 2024, 6:41 p.m. UTC | #6
On Wed, Jul 10, 2024 at 11:32 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Thu, 27 Jun 2024 10:08:52 -0700 Andrii Nakryiko <andrii@kernel.org> wrote:
>
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than reading *all* VMAs
> > nonselectively through text-based interface of /proc/<pid>/maps file.
>
> I haven't seen any acks or reviewed-by's for this series.  lgtm, so I
> plan to move it into mm-stable later this week.

great, thanks!
Liam R. Howlett July 11, 2024, 6:07 p.m. UTC | #7
* Andrii Nakryiko <andrii@kernel.org> [240627 13:09]:
> Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> applications to query VMA information more efficiently than reading *all* VMAs
> nonselectively through text-based interface of /proc/<pid>/maps file.
> 

Thanks for doing this Andrii.  It looks to be a step forward for a lot
of use cases.

Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Alexey Dobriyan July 24, 2024, 4:32 p.m. UTC | #8
On Thu, Jul 11, 2024 at 02:07:18PM -0400, Liam R. Howlett wrote:
> * Andrii Nakryiko <andrii@kernel.org> [240627 13:09]:
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than reading *all* VMAs
> > nonselectively through text-based interface of /proc/<pid>/maps file.
> > 
> 
> Thanks for doing this Andrii.  It looks to be a step forward for a lot
> of use cases.

Yes, looks like ioctl on text files are the way to go. :-)