diff mbox series

[5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

Message ID 20240504003006.3303334-6-andrii@kernel.org (mailing list archive)
State New
Headers show
Series ioctl()-based API to query VMAs from /proc/<pid>/maps | expand

Commit Message

Andrii Nakryiko May 4, 2024, 12:30 a.m. UTC
Implement a simple tool/benchmark for comparing address "resolution"
logic based on textual /proc/<pid>/maps interface and new binary
ioctl-based PROCFS_PROCMAP_QUERY command.

The tool expects a file with a list of hex addresses, relevant PID, and
then provides control over whether textual or binary ioctl-based ways to
process VMAs should be used.

The overall logic implements as efficient way to do batched processing
of a given set of (unsorted) addresses. We first sort them in increasing
order (remembering their original position to restore original order, if
necessary), and then process all VMAs from /proc/<pid>/maps, matching
addresses to VMAs and calculating file offsets, if matched. For
ioctl-based approach the idea is similar, but is implemented even more
efficiently, requesting only VMAs that cover all given addresses,
skipping all the irrelevant VMAs altogether.

To be able to compare efficiency of both APIs tool has "benchark" mode.
User provides a number of processing runs to run in a tight loop, timing
specifically /proc/<pid>/maps parsing and processing parts of the logic
only. Address sorting and re-sorting is excluded. This gives a more
direct way to compare ioctl- vs text-based APIs.

We used a medium-sized production application to do representative
benchmark. A bunch of stack traces were captured, resulting in 4435
user space addresses (699 unique ones, but we didn't deduplicate them).
Application itself had 702 VMAs reported in /proc/<pid>/maps.

Averaging time taken to process all addresses 10000 times, showed that:
  - text-based approach took 380 microseconds *per one batch run*;
  - ioctl-based approach took 10 microseconds *per identical batch run*.

This gives about ~35x speed up to do exactly the same amoun of work
(build IDs were not fetched for ioctl-based benchmark; fetching build
IDs resulted in 2x slowdown compared to no-build-ID case).

I also did an strace run of both cases. In text-based one the tool did
68 read() syscalls, fetching up to 4KB of data in one go. In comparison,
ioctl-based implementation had to do only 6 ioctl() calls to fetch all
relevant VMAs.

It is projected that savings from processing big production applications
would only widen the gap in favor of binary-based querying ioctl API, as
bigger applications will tend to have even more non-executable VMA
mappings relative to executable ones.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/testing/selftests/bpf/.gitignore     |   1 +
 tools/testing/selftests/bpf/Makefile       |   2 +-
 tools/testing/selftests/bpf/procfs_query.c | 366 +++++++++++++++++++++
 3 files changed, 368 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/procfs_query.c

Comments

Greg Kroah-Hartman May 4, 2024, 3:29 p.m. UTC | #1
On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> Implement a simple tool/benchmark for comparing address "resolution"
> logic based on textual /proc/<pid>/maps interface and new binary
> ioctl-based PROCFS_PROCMAP_QUERY command.

Of course an artificial benchmark of "read a whole file" vs. "a tiny
ioctl" is going to be different, but step back and show how this is
going to be used in the real world overall.  Pounding on this file is
not a normal operation, right?

thanks,

greg k-h
Greg Kroah-Hartman May 4, 2024, 3:32 p.m. UTC | #2
On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> I also did an strace run of both cases. In text-based one the tool did
> 68 read() syscalls, fetching up to 4KB of data in one go.

Why not fetch more at once?

And I have a fun 'readfile()' syscall implementation around here that
needs justification to get merged (I try so every other year or so) that
can do the open/read/close loop in one call, with the buffer size set by
userspace if you really are saying this is a "hot path" that needs that
kind of speedup.  But in the end, io_uring usually is the proper api for
that instead, why not use that here instead of slow open/read/close if
you care about speed?

> In comparison,
> ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> relevant VMAs.
> 
> It is projected that savings from processing big production applications
> would only widen the gap in favor of binary-based querying ioctl API, as
> bigger applications will tend to have even more non-executable VMA
> mappings relative to executable ones.

Define "bigger applications" please.  Is this some "large database
company workload" type of thing, or something else?

thanks,

greg k-h
Andrii Nakryiko May 4, 2024, 9:57 p.m. UTC | #3
On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > Implement a simple tool/benchmark for comparing address "resolution"
> > logic based on textual /proc/<pid>/maps interface and new binary
> > ioctl-based PROCFS_PROCMAP_QUERY command.
>
> Of course an artificial benchmark of "read a whole file" vs. "a tiny
> ioctl" is going to be different, but step back and show how this is
> going to be used in the real world overall.  Pounding on this file is
> not a normal operation, right?
>

It's not artificial at all. It's *exactly* what, say, blazesym library
is doing (see [0], it's Rust and part of the overall library API, I
think C code in this patch is way easier to follow for someone not
familiar with implementation of blazesym, but both implementations are
doing exactly the same sequence of steps). You can do it even less
efficiently by parsing the whole file, building an in-memory lookup
table, then looking up addresses one by one. But that's even slower
and more memory-hungry. So I didn't even bother implementing that, it
would put /proc/<pid>/maps at even more disadvantage.

Other applications that deal with stack traces (including perf) would
be doing one of those two approaches, depending on circumstances and
level of sophistication of code (and sensitivity to performance).

  [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193

> thanks,
>
> greg k-h
Andrii Nakryiko May 4, 2024, 10:13 p.m. UTC | #4
On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > I also did an strace run of both cases. In text-based one the tool did
> > 68 read() syscalls, fetching up to 4KB of data in one go.
>
> Why not fetch more at once?
>

I didn't expect to be interrogated so much on the performance of the
text parsing front, sorry. :) You can probably tune this, but where is
the reasonable limit? 64KB? 256KB? 1MB? See below for some more
production numbers.

> And I have a fun 'readfile()' syscall implementation around here that
> needs justification to get merged (I try so every other year or so) that
> can do the open/read/close loop in one call, with the buffer size set by
> userspace if you really are saying this is a "hot path" that needs that
> kind of speedup.  But in the end, io_uring usually is the proper api for
> that instead, why not use that here instead of slow open/read/close if
> you care about speed?
>

I'm not sure what I need to say here. I'm sure it will be useful, but
as I already explained, it's not about the text file or not, it's
about having to read too much information that's completely
irrelevant. Again, see below for another data point.

> > In comparison,
> > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > relevant VMAs.
> >
> > It is projected that savings from processing big production applications
> > would only widen the gap in favor of binary-based querying ioctl API, as
> > bigger applications will tend to have even more non-executable VMA
> > mappings relative to executable ones.
>
> Define "bigger applications" please.  Is this some "large database
> company workload" type of thing, or something else?

I don't have a definition. But I had in mind, as one example, an
ads-serving service we use internally (it's a pretty large application
by pretty much any metric you can come up with). I just randomly
picked one of the production hosts, found one instance of that
service, and looked at its /proc/<pid>/maps file. Hopefully it will
satisfy your need for specifics.

# cat /proc/1126243/maps | wc -c
1570178
# cat /proc/1126243/maps | wc -l
28875
# cat /proc/1126243/maps | grep ' ..x. ' | wc -l
7347

You can see that maps file itself is about 1.5MB of text (which means
single-shot reading of its entire contents is a bit unrealistic,
though, sure, why not). The process contains 28875 VMAs, out of which
only 7347 are executable.

This means if we were to profile this process (and normally we profile
entire system, so it's almost never single /proc/<pid>/maps file that
needs to be open and processed), we'd need *at most* (absolute worst
case!) 7347/28875 = 25.5% of entries. In reality, most code will be
concentrated in a much smaller number of executable VMAs, of course.
But no, I don't have specific numbers at hand, sorry.

It matters less whether it's text or binary (though binary undoubtedly
will be faster, it's strange to even argue about this), it's the
ability to fetch only relevant VMAs that is the point here.

>
> thanks,
>
> greg k-h
Ian Rogers May 5, 2024, 5:09 a.m. UTC | #5
On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > Implement a simple tool/benchmark for comparing address "resolution"
> > > logic based on textual /proc/<pid>/maps interface and new binary
> > > ioctl-based PROCFS_PROCMAP_QUERY command.
> >
> > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > ioctl" is going to be different, but step back and show how this is
> > going to be used in the real world overall.  Pounding on this file is
> > not a normal operation, right?
> >
>
> It's not artificial at all. It's *exactly* what, say, blazesym library
> is doing (see [0], it's Rust and part of the overall library API, I
> think C code in this patch is way easier to follow for someone not
> familiar with implementation of blazesym, but both implementations are
> doing exactly the same sequence of steps). You can do it even less
> efficiently by parsing the whole file, building an in-memory lookup
> table, then looking up addresses one by one. But that's even slower
> and more memory-hungry. So I didn't even bother implementing that, it
> would put /proc/<pid>/maps at even more disadvantage.
>
> Other applications that deal with stack traces (including perf) would
> be doing one of those two approaches, depending on circumstances and
> level of sophistication of code (and sensitivity to performance).

The code in perf doing this is here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
The code is using the api/io.h code:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
Using perf to profile perf it was observed time was spent allocating
buffers and locale related activities when using stdio, so io is a
lighter weight alternative, albeit with more verbose code than fscanf.
You could add this as an alternate /proc/<pid>/maps reader, we have a
similar benchmark in `perf bench internals synthesize`.

Thanks,
Ian

>   [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
>
> > thanks,
> >
> > greg k-h
>
Andrii Nakryiko May 6, 2024, 6:32 p.m. UTC | #6
On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote:
>
> On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > >
> > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > ioctl" is going to be different, but step back and show how this is
> > > going to be used in the real world overall.  Pounding on this file is
> > > not a normal operation, right?
> > >
> >
> > It's not artificial at all. It's *exactly* what, say, blazesym library
> > is doing (see [0], it's Rust and part of the overall library API, I
> > think C code in this patch is way easier to follow for someone not
> > familiar with implementation of blazesym, but both implementations are
> > doing exactly the same sequence of steps). You can do it even less
> > efficiently by parsing the whole file, building an in-memory lookup
> > table, then looking up addresses one by one. But that's even slower
> > and more memory-hungry. So I didn't even bother implementing that, it
> > would put /proc/<pid>/maps at even more disadvantage.
> >
> > Other applications that deal with stack traces (including perf) would
> > be doing one of those two approaches, depending on circumstances and
> > level of sophistication of code (and sensitivity to performance).
>
> The code in perf doing this is here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> The code is using the api/io.h code:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> Using perf to profile perf it was observed time was spent allocating
> buffers and locale related activities when using stdio, so io is a
> lighter weight alternative, albeit with more verbose code than fscanf.
> You could add this as an alternate /proc/<pid>/maps reader, we have a
> similar benchmark in `perf bench internals synthesize`.
>

If I add a new implementation using this ioctl() into
perf_event__synthesize_mmap_events(), will it be tested from this
`perf bench internals synthesize`? I'm not too familiar with perf code
organization, sorry if it's a stupid question. If not, where exactly
is the code that would be triggered from benchmark?

> Thanks,
> Ian
>
> >   [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> >
> > > thanks,
> > >
> > > greg k-h
> >
Ian Rogers May 6, 2024, 6:43 p.m. UTC | #7
On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > >
> > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > > ioctl" is going to be different, but step back and show how this is
> > > > going to be used in the real world overall.  Pounding on this file is
> > > > not a normal operation, right?
> > > >
> > >
> > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > is doing (see [0], it's Rust and part of the overall library API, I
> > > think C code in this patch is way easier to follow for someone not
> > > familiar with implementation of blazesym, but both implementations are
> > > doing exactly the same sequence of steps). You can do it even less
> > > efficiently by parsing the whole file, building an in-memory lookup
> > > table, then looking up addresses one by one. But that's even slower
> > > and more memory-hungry. So I didn't even bother implementing that, it
> > > would put /proc/<pid>/maps at even more disadvantage.
> > >
> > > Other applications that deal with stack traces (including perf) would
> > > be doing one of those two approaches, depending on circumstances and
> > > level of sophistication of code (and sensitivity to performance).
> >
> > The code in perf doing this is here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > The code is using the api/io.h code:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > Using perf to profile perf it was observed time was spent allocating
> > buffers and locale related activities when using stdio, so io is a
> > lighter weight alternative, albeit with more verbose code than fscanf.
> > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > similar benchmark in `perf bench internals synthesize`.
> >
>
> If I add a new implementation using this ioctl() into
> perf_event__synthesize_mmap_events(), will it be tested from this
> `perf bench internals synthesize`? I'm not too familiar with perf code
> organization, sorry if it's a stupid question. If not, where exactly
> is the code that would be triggered from benchmark?

Yes it would be triggered :-)

Thanks,
Ian

> > Thanks,
> > Ian
> >
> > >   [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > >
> > > > thanks,
> > > >
> > > > greg k-h
> > >
Andrii Nakryiko May 7, 2024, 5:06 a.m. UTC | #8
On Mon, May 6, 2024 at 11:43 AM Ian Rogers <irogers@google.com> wrote:
>
> On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > > >
> > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > > > ioctl" is going to be different, but step back and show how this is
> > > > > going to be used in the real world overall.  Pounding on this file is
> > > > > not a normal operation, right?
> > > > >
> > > >
> > > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > > is doing (see [0], it's Rust and part of the overall library API, I
> > > > think C code in this patch is way easier to follow for someone not
> > > > familiar with implementation of blazesym, but both implementations are
> > > > doing exactly the same sequence of steps). You can do it even less
> > > > efficiently by parsing the whole file, building an in-memory lookup
> > > > table, then looking up addresses one by one. But that's even slower
> > > > and more memory-hungry. So I didn't even bother implementing that, it
> > > > would put /proc/<pid>/maps at even more disadvantage.
> > > >
> > > > Other applications that deal with stack traces (including perf) would
> > > > be doing one of those two approaches, depending on circumstances and
> > > > level of sophistication of code (and sensitivity to performance).
> > >
> > > The code in perf doing this is here:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > > The code is using the api/io.h code:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > > Using perf to profile perf it was observed time was spent allocating
> > > buffers and locale related activities when using stdio, so io is a
> > > lighter weight alternative, albeit with more verbose code than fscanf.
> > > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > > similar benchmark in `perf bench internals synthesize`.
> > >
> >
> > If I add a new implementation using this ioctl() into
> > perf_event__synthesize_mmap_events(), will it be tested from this
> > `perf bench internals synthesize`? I'm not too familiar with perf code
> > organization, sorry if it's a stupid question. If not, where exactly
> > is the code that would be triggered from benchmark?
>
> Yes it would be triggered :-)

Ok, I don't exactly know how to interpret the results (and what the
benchmark is doing), but numbers don't seem to be worse. They actually
seem to be a bit better.

I pushed my code that adds perf integration to [0]. That commit has
results, but I'll post them here (with invocation parameters).
perf-ioctl is the version with ioctl()-based implementation,
perf-parse is, logically, text-parsing version. Here are the results
(and see my notes below the results as well):

TEXT-BASED
==========

# ./perf-parse bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 80.311 usec (+- 0.077 usec)
  Average num. events: 32.000 (+- 0.000)
  Average time per event 2.510 usec
  Average data synthesis took: 84.429 usec (+- 0.066 usec)
  Average num. events: 179.000 (+- 0.000)
  Average time per event 0.472 usec

# ./perf-parse bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 79.900 usec (+- 0.077 usec)
  Average num. events: 32.000 (+- 0.000)
  Average time per event 2.497 usec
  Average data synthesis took: 84.832 usec (+- 0.074 usec)
  Average num. events: 180.000 (+- 0.000)
  Average time per event 0.471 usec

# ./perf-parse bench internals synthesize --mt -M 8
# Running 'internals/synthesize' benchmark:
Computing performance of multi threaded perf event synthesis by
synthesizing events on CPU 0:
  Number of synthesis threads: 1
    Average synthesis took: 36338.100 usec (+- 406.091 usec)
    Average num. events: 14091.300 (+- 7.433)
    Average time per event 2.579 usec
  Number of synthesis threads: 2
    Average synthesis took: 37071.200 usec (+- 746.498 usec)
    Average num. events: 14085.900 (+- 1.900)
    Average time per event 2.632 usec
  Number of synthesis threads: 3
    Average synthesis took: 33932.300 usec (+- 626.861 usec)
    Average num. events: 14085.900 (+- 1.900)
    Average time per event 2.409 usec
  Number of synthesis threads: 4
    Average synthesis took: 33822.700 usec (+- 506.290 usec)
    Average num. events: 14099.200 (+- 8.761)
    Average time per event 2.399 usec
  Number of synthesis threads: 5
    Average synthesis took: 33348.200 usec (+- 389.771 usec)
    Average num. events: 14085.900 (+- 1.900)
    Average time per event 2.367 usec
  Number of synthesis threads: 6
    Average synthesis took: 33269.600 usec (+- 350.341 usec)
    Average num. events: 14084.000 (+- 0.000)
    Average time per event 2.362 usec
  Number of synthesis threads: 7
    Average synthesis took: 32663.900 usec (+- 338.870 usec)
    Average num. events: 14085.900 (+- 1.900)
    Average time per event 2.319 usec
  Number of synthesis threads: 8
    Average synthesis took: 32748.400 usec (+- 285.450 usec)
    Average num. events: 14085.900 (+- 1.900)
    Average time per event 2.325 usec

IOCTL-BASED
===========
# ./perf-ioctl bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 72.996 usec (+- 0.076 usec)
  Average num. events: 31.000 (+- 0.000)
  Average time per event 2.355 usec
  Average data synthesis took: 79.067 usec (+- 0.074 usec)
  Average num. events: 178.000 (+- 0.000)
  Average time per event 0.444 usec

# ./perf-ioctl bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 73.921 usec (+- 0.073 usec)
  Average num. events: 31.000 (+- 0.000)
  Average time per event 2.385 usec
  Average data synthesis took: 80.545 usec (+- 0.070 usec)
  Average num. events: 178.000 (+- 0.000)
  Average time per event 0.453 usec

# ./perf-ioctl bench internals synthesize --mt -M 8
# Running 'internals/synthesize' benchmark:
Computing performance of multi threaded perf event synthesis by
synthesizing events on CPU 0:
  Number of synthesis threads: 1
    Average synthesis took: 35609.500 usec (+- 428.576 usec)
    Average num. events: 14040.700 (+- 1.700)
    Average time per event 2.536 usec
  Number of synthesis threads: 2
    Average synthesis took: 34293.800 usec (+- 453.811 usec)
    Average num. events: 14040.700 (+- 1.700)
    Average time per event 2.442 usec
  Number of synthesis threads: 3
    Average synthesis took: 32385.200 usec (+- 363.106 usec)
    Average num. events: 14040.700 (+- 1.700)
    Average time per event 2.307 usec
  Number of synthesis threads: 4
    Average synthesis took: 33113.100 usec (+- 553.931 usec)
    Average num. events: 14054.500 (+- 11.469)
    Average time per event 2.356 usec
  Number of synthesis threads: 5
    Average synthesis took: 31600.600 usec (+- 297.349 usec)
    Average num. events: 14012.500 (+- 4.590)
    Average time per event 2.255 usec
  Number of synthesis threads: 6
    Average synthesis took: 32309.900 usec (+- 472.225 usec)
    Average num. events: 14004.000 (+- 0.000)
    Average time per event 2.307 usec
  Number of synthesis threads: 7
    Average synthesis took: 31400.100 usec (+- 206.261 usec)
    Average num. events: 14004.800 (+- 0.800)
    Average time per event 2.242 usec
  Number of synthesis threads: 8
    Average synthesis took: 31601.400 usec (+- 303.350 usec)
    Average num. events: 14005.700 (+- 1.700)
    Average time per event 2.256 usec

I also double-checked (using strace) that it does what it is supposed
to do, and it seems like everything checks out. Here's text-based
strace log:

openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3
read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997
read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025
read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048
read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017
read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744
read(3, "", 8192)                       = 0
close(3)                                = 0


BTW, note how the kernel doesn't serve more than 4KB of data, even
though perf provides 8KB buffer (that's to Greg's question about
optimizing using bigger buffers, I suspect without seq_file changes,
it won't work).

And here's an abbreviated log for ioctl version, it has lots more (but
much faster) ioctl() syscalls, given it dumps everything:

openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0

 ... 195 ioctl() calls in total ...

ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50)
= -1 ENOENT (No such file or directory)
close(3)                                = 0


So, it's not the optimal usage of this API, and yet it's still better
(or at least not worse) than text-based API.

  [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35

>
> Thanks,
> Ian
>
> > > Thanks,
> > > Ian
> > >
> > > >   [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > > >
> > > > > thanks,
> > > > >
> > > > > greg k-h
> > > >
Liam R. Howlett May 7, 2024, 3:48 p.m. UTC | #9
.. Adding Suren & Willy to the Cc

* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]:
> On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > I also did an strace run of both cases. In text-based one the tool did
> > > 68 read() syscalls, fetching up to 4KB of data in one go.
> >
> > Why not fetch more at once?
> >
> 
> I didn't expect to be interrogated so much on the performance of the
> text parsing front, sorry. :) You can probably tune this, but where is
> the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> production numbers.

The reason the file reads are limited to 4KB is because this file is
used for monitoring processes.  We have a significant number of
organisations polling this file so frequently that the mmap lock
contention becomes an issue. (reading a file is free, right?)  People
also tend to try to figure out why a process is slow by reading this
file - which amplifies the lock contention.

What happens today is that the lock is yielded after 4KB to allow time
for mmap writes to happen.  This also means your data may be
inconsistent from one 4KB block to the next (the write may be around
this boundary).

This new interface also takes the lock in do_procmap_query() and does
the 4kb blocks as well.  Extending this size means more time spent
blocking mmap writes, but a more consistent view of the world (less
"tearing" of the addresses).

We are working to reduce these issues by switching the /proc/<pid>/maps
file to use rcu lookup.  I would recommend we do not proceed with this
interface using the old method and instead, implement it using rcu from
the start - if it fits your use case (or we can make it fit your use
case).

At least, for most page faults, we can work around the lock contention
(since v6.6), but not all and not on all archs.

...

> 
> > > In comparison,
> > > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > > relevant VMAs.
> > >
> > > It is projected that savings from processing big production applications
> > > would only widen the gap in favor of binary-based querying ioctl API, as
> > > bigger applications will tend to have even more non-executable VMA
> > > mappings relative to executable ones.
> >
> > Define "bigger applications" please.  Is this some "large database
> > company workload" type of thing, or something else?
> 
> I don't have a definition. But I had in mind, as one example, an
> ads-serving service we use internally (it's a pretty large application
> by pretty much any metric you can come up with). I just randomly
> picked one of the production hosts, found one instance of that
> service, and looked at its /proc/<pid>/maps file. Hopefully it will
> satisfy your need for specifics.
> 
> # cat /proc/1126243/maps | wc -c
> 1570178
> # cat /proc/1126243/maps | wc -l
> 28875
> # cat /proc/1126243/maps | grep ' ..x. ' | wc -l
> 7347

We have distributions increasing the map_count to an insane number to
allow games to work [1].  It is, unfortunately, only a matter of time until
this is regularly an issue as it is being normalised and allowed by an
increased number of distributions (fedora, arch, ubuntu).  So, despite
my email address, I am not talking about large database companies here.

Also, note that applications that use guard VMAs double the number for
the guards.  Fun stuff.

We are really doing a lot in the VMA area to reduce the mmap locking
contention and it seems you have a use case for a new interface that can
leverage these changes.

We have at least two talks around this area at LSF if you are attending.

Thanks,
Liam

[1] https://lore.kernel.org/linux-mm/8f6e2d69-b4df-45f3-aed4-5190966e2dea@valvesoftware.com/
Matthew Wilcox (Oracle) May 7, 2024, 4:10 p.m. UTC | #10
On Tue, May 07, 2024 at 11:48:44AM -0400, Liam R. Howlett wrote:
> .. Adding Suren & Willy to the Cc

I've been staying out of this disaster.  i thought steven rostedt was
going to do all of this in the kernel anyway.  wasn't thre a session on
that at lsfmm in vancouver last year?
Liam R. Howlett May 7, 2024, 4:18 p.m. UTC | #11
* Matthew Wilcox <willy@infradead.org> [240507 12:10]:
> On Tue, May 07, 2024 at 11:48:44AM -0400, Liam R. Howlett wrote:
> > .. Adding Suren & Willy to the Cc
> 
> I've been staying out of this disaster.  i thought steven rostedt was
> going to do all of this in the kernel anyway.  wasn't thre a session on
> that at lsfmm in vancouver last year?

sframes?  The only other one that comes to mind is the one where he and
kent were yelling at each other.
Andrii Nakryiko May 7, 2024, 4:27 p.m. UTC | #12
On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> .. Adding Suren & Willy to the Cc
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]:
> > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > I also did an strace run of both cases. In text-based one the tool did
> > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > >
> > > Why not fetch more at once?
> > >
> >
> > I didn't expect to be interrogated so much on the performance of the
> > text parsing front, sorry. :) You can probably tune this, but where is
> > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > production numbers.
>
> The reason the file reads are limited to 4KB is because this file is
> used for monitoring processes.  We have a significant number of
> organisations polling this file so frequently that the mmap lock
> contention becomes an issue. (reading a file is free, right?)  People
> also tend to try to figure out why a process is slow by reading this
> file - which amplifies the lock contention.
>
> What happens today is that the lock is yielded after 4KB to allow time
> for mmap writes to happen.  This also means your data may be
> inconsistent from one 4KB block to the next (the write may be around
> this boundary).
>
> This new interface also takes the lock in do_procmap_query() and does
> the 4kb blocks as well.  Extending this size means more time spent
> blocking mmap writes, but a more consistent view of the world (less
> "tearing" of the addresses).

Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
does a single VMA look up (presumably O(logN) operation) using a
single vma_iter_init(addr) + vma_next() call on vma_iterator.

As for the mmap_read_lock_killable() (is that what we are talking
about?), I'm happy to use anything else available, please give me a
pointer. But I suspect given how fast and small this new API is,
mmap_read_lock_killable() in it is not comparable to holding it for
producing /proc/<pid>/maps contents.

>
> We are working to reduce these issues by switching the /proc/<pid>/maps
> file to use rcu lookup.  I would recommend we do not proceed with this
> interface using the old method and instead, implement it using rcu from
> the start - if it fits your use case (or we can make it fit your use
> case).
>
> At least, for most page faults, we can work around the lock contention
> (since v6.6), but not all and not on all archs.
>
> ...
>
> >
> > > > In comparison,
> > > > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > > > relevant VMAs.
> > > >
> > > > It is projected that savings from processing big production applications
> > > > would only widen the gap in favor of binary-based querying ioctl API, as
> > > > bigger applications will tend to have even more non-executable VMA
> > > > mappings relative to executable ones.
> > >
> > > Define "bigger applications" please.  Is this some "large database
> > > company workload" type of thing, or something else?
> >
> > I don't have a definition. But I had in mind, as one example, an
> > ads-serving service we use internally (it's a pretty large application
> > by pretty much any metric you can come up with). I just randomly
> > picked one of the production hosts, found one instance of that
> > service, and looked at its /proc/<pid>/maps file. Hopefully it will
> > satisfy your need for specifics.
> >
> > # cat /proc/1126243/maps | wc -c
> > 1570178
> > # cat /proc/1126243/maps | wc -l
> > 28875
> > # cat /proc/1126243/maps | grep ' ..x. ' | wc -l
> > 7347
>
> We have distributions increasing the map_count to an insane number to
> allow games to work [1].  It is, unfortunately, only a matter of time until
> this is regularly an issue as it is being normalised and allowed by an
> increased number of distributions (fedora, arch, ubuntu).  So, despite
> my email address, I am not talking about large database companies here.
>
> Also, note that applications that use guard VMAs double the number for
> the guards.  Fun stuff.
>
> We are really doing a lot in the VMA area to reduce the mmap locking
> contention and it seems you have a use case for a new interface that can
> leverage these changes.
>
> We have at least two talks around this area at LSF if you are attending.

I am attending LSFMM, yes, I'll try to not miss them.

>
> Thanks,
> Liam
>
> [1] https://lore.kernel.org/linux-mm/8f6e2d69-b4df-45f3-aed4-5190966e2dea@valvesoftware.com/
>
Andrii Nakryiko May 7, 2024, 5:29 p.m. UTC | #13
On Mon, May 6, 2024 at 10:06 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, May 6, 2024 at 11:43 AM Ian Rogers <irogers@google.com> wrote:
> >
> > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > >
> > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > > > >
> > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > > > > ioctl" is going to be different, but step back and show how this is
> > > > > > going to be used in the real world overall.  Pounding on this file is
> > > > > > not a normal operation, right?
> > > > > >
> > > > >
> > > > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > > > is doing (see [0], it's Rust and part of the overall library API, I
> > > > > think C code in this patch is way easier to follow for someone not
> > > > > familiar with implementation of blazesym, but both implementations are
> > > > > doing exactly the same sequence of steps). You can do it even less
> > > > > efficiently by parsing the whole file, building an in-memory lookup
> > > > > table, then looking up addresses one by one. But that's even slower
> > > > > and more memory-hungry. So I didn't even bother implementing that, it
> > > > > would put /proc/<pid>/maps at even more disadvantage.
> > > > >
> > > > > Other applications that deal with stack traces (including perf) would
> > > > > be doing one of those two approaches, depending on circumstances and
> > > > > level of sophistication of code (and sensitivity to performance).
> > > >
> > > > The code in perf doing this is here:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > > > The code is using the api/io.h code:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > > > Using perf to profile perf it was observed time was spent allocating
> > > > buffers and locale related activities when using stdio, so io is a
> > > > lighter weight alternative, albeit with more verbose code than fscanf.
> > > > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > > > similar benchmark in `perf bench internals synthesize`.
> > > >
> > >
> > > If I add a new implementation using this ioctl() into
> > > perf_event__synthesize_mmap_events(), will it be tested from this
> > > `perf bench internals synthesize`? I'm not too familiar with perf code
> > > organization, sorry if it's a stupid question. If not, where exactly
> > > is the code that would be triggered from benchmark?
> >
> > Yes it would be triggered :-)
>
> Ok, I don't exactly know how to interpret the results (and what the
> benchmark is doing), but numbers don't seem to be worse. They actually
> seem to be a bit better.
>
> I pushed my code that adds perf integration to [0]. That commit has
> results, but I'll post them here (with invocation parameters).
> perf-ioctl is the version with ioctl()-based implementation,
> perf-parse is, logically, text-parsing version. Here are the results
> (and see my notes below the results as well):
>
> TEXT-BASED
> ==========
>
> # ./perf-parse bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 80.311 usec (+- 0.077 usec)
>   Average num. events: 32.000 (+- 0.000)
>   Average time per event 2.510 usec
>   Average data synthesis took: 84.429 usec (+- 0.066 usec)
>   Average num. events: 179.000 (+- 0.000)
>   Average time per event 0.472 usec
>
> # ./perf-parse bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 79.900 usec (+- 0.077 usec)
>   Average num. events: 32.000 (+- 0.000)
>   Average time per event 2.497 usec
>   Average data synthesis took: 84.832 usec (+- 0.074 usec)
>   Average num. events: 180.000 (+- 0.000)
>   Average time per event 0.471 usec
>
> # ./perf-parse bench internals synthesize --mt -M 8
> # Running 'internals/synthesize' benchmark:
> Computing performance of multi threaded perf event synthesis by
> synthesizing events on CPU 0:
>   Number of synthesis threads: 1
>     Average synthesis took: 36338.100 usec (+- 406.091 usec)
>     Average num. events: 14091.300 (+- 7.433)
>     Average time per event 2.579 usec
>   Number of synthesis threads: 2
>     Average synthesis took: 37071.200 usec (+- 746.498 usec)
>     Average num. events: 14085.900 (+- 1.900)
>     Average time per event 2.632 usec
>   Number of synthesis threads: 3
>     Average synthesis took: 33932.300 usec (+- 626.861 usec)
>     Average num. events: 14085.900 (+- 1.900)
>     Average time per event 2.409 usec
>   Number of synthesis threads: 4
>     Average synthesis took: 33822.700 usec (+- 506.290 usec)
>     Average num. events: 14099.200 (+- 8.761)
>     Average time per event 2.399 usec
>   Number of synthesis threads: 5
>     Average synthesis took: 33348.200 usec (+- 389.771 usec)
>     Average num. events: 14085.900 (+- 1.900)
>     Average time per event 2.367 usec
>   Number of synthesis threads: 6
>     Average synthesis took: 33269.600 usec (+- 350.341 usec)
>     Average num. events: 14084.000 (+- 0.000)
>     Average time per event 2.362 usec
>   Number of synthesis threads: 7
>     Average synthesis took: 32663.900 usec (+- 338.870 usec)
>     Average num. events: 14085.900 (+- 1.900)
>     Average time per event 2.319 usec
>   Number of synthesis threads: 8
>     Average synthesis took: 32748.400 usec (+- 285.450 usec)
>     Average num. events: 14085.900 (+- 1.900)
>     Average time per event 2.325 usec
>
> IOCTL-BASED
> ===========
> # ./perf-ioctl bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 72.996 usec (+- 0.076 usec)
>   Average num. events: 31.000 (+- 0.000)
>   Average time per event 2.355 usec
>   Average data synthesis took: 79.067 usec (+- 0.074 usec)
>   Average num. events: 178.000 (+- 0.000)
>   Average time per event 0.444 usec
>
> # ./perf-ioctl bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 73.921 usec (+- 0.073 usec)
>   Average num. events: 31.000 (+- 0.000)
>   Average time per event 2.385 usec
>   Average data synthesis took: 80.545 usec (+- 0.070 usec)
>   Average num. events: 178.000 (+- 0.000)
>   Average time per event 0.453 usec
>
> # ./perf-ioctl bench internals synthesize --mt -M 8
> # Running 'internals/synthesize' benchmark:
> Computing performance of multi threaded perf event synthesis by
> synthesizing events on CPU 0:
>   Number of synthesis threads: 1
>     Average synthesis took: 35609.500 usec (+- 428.576 usec)
>     Average num. events: 14040.700 (+- 1.700)
>     Average time per event 2.536 usec
>   Number of synthesis threads: 2
>     Average synthesis took: 34293.800 usec (+- 453.811 usec)
>     Average num. events: 14040.700 (+- 1.700)
>     Average time per event 2.442 usec
>   Number of synthesis threads: 3
>     Average synthesis took: 32385.200 usec (+- 363.106 usec)
>     Average num. events: 14040.700 (+- 1.700)
>     Average time per event 2.307 usec
>   Number of synthesis threads: 4
>     Average synthesis took: 33113.100 usec (+- 553.931 usec)
>     Average num. events: 14054.500 (+- 11.469)
>     Average time per event 2.356 usec
>   Number of synthesis threads: 5
>     Average synthesis took: 31600.600 usec (+- 297.349 usec)
>     Average num. events: 14012.500 (+- 4.590)
>     Average time per event 2.255 usec
>   Number of synthesis threads: 6
>     Average synthesis took: 32309.900 usec (+- 472.225 usec)
>     Average num. events: 14004.000 (+- 0.000)
>     Average time per event 2.307 usec
>   Number of synthesis threads: 7
>     Average synthesis took: 31400.100 usec (+- 206.261 usec)
>     Average num. events: 14004.800 (+- 0.800)
>     Average time per event 2.242 usec
>   Number of synthesis threads: 8
>     Average synthesis took: 31601.400 usec (+- 303.350 usec)
>     Average num. events: 14005.700 (+- 1.700)
>     Average time per event 2.256 usec
>
> I also double-checked (using strace) that it does what it is supposed
> to do, and it seems like everything checks out. Here's text-based
> strace log:
>
> openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3
> read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997
> read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025
> read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048
> read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017
> read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744
> read(3, "", 8192)                       = 0
> close(3)                                = 0
>
>
> BTW, note how the kernel doesn't serve more than 4KB of data, even
> though perf provides 8KB buffer (that's to Greg's question about
> optimizing using bigger buffers, I suspect without seq_file changes,
> it won't work).
>
> And here's an abbreviated log for ioctl version, it has lots more (but
> much faster) ioctl() syscalls, given it dumps everything:
>
> openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
>
>  ... 195 ioctl() calls in total ...
>
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50)
> = -1 ENOENT (No such file or directory)
> close(3)                                = 0
>
>
> So, it's not the optimal usage of this API, and yet it's still better
> (or at least not worse) than text-based API.
>

In another reply to Arnaldo on patch #2 I mentioned the idea of
allowing to iterate only file-backed VMAs (as it seems like what
symbolizers would only care about, but I might be wrong here). So I
tried that quickly, given it's a trivial addition to my code. See
results below (it is slightly faster, but not much, because most of
VMAs in that benchmark seem to be indeed file-backed anyways), just
for completeness. I'm not sure if that would be useful/used by perf,
so please let me know.

As I mentioned above, it's not radically faster in this perf
benchmark, because we still request about 170 VMAs (vs ~195 if we
iterate *all* of them), so not a big change. The ratio will vary
depending on what the process is doing, of course. Anyways, just for
completeness, I'm not sure if I have to add this "filter" to the
actual implementation.

# ./perf-filebacked bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 65.759 usec (+- 0.063 usec)
  Average num. events: 30.000 (+- 0.000)
  Average time per event 2.192 usec
  Average data synthesis took: 73.840 usec (+- 0.080 usec)
  Average num. events: 153.000 (+- 0.000)
  Average time per event 0.483 usec

# ./perf-filebacked bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 66.245 usec (+- 0.059 usec)
  Average num. events: 30.000 (+- 0.000)
  Average time per event 2.208 usec
  Average data synthesis took: 70.627 usec (+- 0.074 usec)
  Average num. events: 153.000 (+- 0.000)
  Average time per event 0.462 usec

# ./perf-filebacked bench internals synthesize --mt -M 8
# Running 'internals/synthesize' benchmark:
Computing performance of multi threaded perf event synthesis by
synthesizing events on CPU 0:
  Number of synthesis threads: 1
    Average synthesis took: 33477.500 usec (+- 556.102 usec)
    Average num. events: 10125.700 (+- 1.620)
    Average time per event 3.306 usec
  Number of synthesis threads: 2
    Average synthesis took: 30473.700 usec (+- 221.933 usec)
    Average num. events: 10127.000 (+- 0.000)
    Average time per event 3.009 usec
  Number of synthesis threads: 3
    Average synthesis took: 29775.200 usec (+- 315.212 usec)
    Average num. events: 10128.700 (+- 0.667)
    Average time per event 2.940 usec
  Number of synthesis threads: 4
    Average synthesis took: 29477.100 usec (+- 621.258 usec)
    Average num. events: 10129.000 (+- 0.000)
    Average time per event 2.910 usec
  Number of synthesis threads: 5
    Average synthesis took: 29777.900 usec (+- 294.710 usec)
    Average num. events: 10144.700 (+- 11.597)
    Average time per event 2.935 usec
  Number of synthesis threads: 6
    Average synthesis took: 27774.700 usec (+- 357.569 usec)
    Average num. events: 10158.500 (+- 14.710)
    Average time per event 2.734 usec
  Number of synthesis threads: 7
    Average synthesis took: 27437.200 usec (+- 233.626 usec)
    Average num. events: 10135.700 (+- 2.700)
    Average time per event 2.707 usec
  Number of synthesis threads: 8
    Average synthesis took: 28784.600 usec (+- 477.630 usec)
    Average num. events: 10133.000 (+- 0.000)
    Average time per event 2.841 usec

>   [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35
>
> >
> > Thanks,
> > Ian
> >
> > > > Thanks,
> > > > Ian
> > > >
> > > > >   [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > > > >
> > > > > > thanks,
> > > > > >
> > > > > > greg k-h
> > > > >
Liam R. Howlett May 7, 2024, 6:06 p.m. UTC | #14
* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240507 12:28]:
> On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > .. Adding Suren & Willy to the Cc
> >
> > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]:
> > > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > I also did an strace run of both cases. In text-based one the tool did
> > > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > > >
> > > > Why not fetch more at once?
> > > >
> > >
> > > I didn't expect to be interrogated so much on the performance of the
> > > text parsing front, sorry. :) You can probably tune this, but where is
> > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > > production numbers.
> >
> > The reason the file reads are limited to 4KB is because this file is
> > used for monitoring processes.  We have a significant number of
> > organisations polling this file so frequently that the mmap lock
> > contention becomes an issue. (reading a file is free, right?)  People
> > also tend to try to figure out why a process is slow by reading this
> > file - which amplifies the lock contention.
> >
> > What happens today is that the lock is yielded after 4KB to allow time
> > for mmap writes to happen.  This also means your data may be
> > inconsistent from one 4KB block to the next (the write may be around
> > this boundary).
> >
> > This new interface also takes the lock in do_procmap_query() and does
> > the 4kb blocks as well.  Extending this size means more time spent
> > blocking mmap writes, but a more consistent view of the world (less
> > "tearing" of the addresses).
> 
> Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
> does a single VMA look up (presumably O(logN) operation) using a
> single vma_iter_init(addr) + vma_next() call on vma_iterator.

Sorry, I read this:

+	if (usize > PAGE_SIZE)
+		return -E2BIG;

And thought you were going to return many vmas in that buffer.  I see
now that you are doing one copy at a time.

> 
> As for the mmap_read_lock_killable() (is that what we are talking
> about?), I'm happy to use anything else available, please give me a
> pointer. But I suspect given how fast and small this new API is,
> mmap_read_lock_killable() in it is not comparable to holding it for
> producing /proc/<pid>/maps contents.

Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
mmap sem).

You can see examples of avoiding the mmap lock by use of rcu in
mm/memory.c lock_vma_under_rcu() which is used in the fault path.
userfaultfd has an example as well. But again, remember that not all
archs have this functionality, so you'd need to fall back to full mmap
locking.

Certainly a single lookup and copy will be faster than a 4k buffer
filling copy, but you will be walking the tree O(n) times, where n is
the vma count.  This isn't as efficient as multiple lookups in a row as
we will re-walk from the top of the tree. You will also need to contend
with the fact that the chance of the vmas changing between calls is much
higher here too - if that's an issue. Neither of these issues go away
with use of the rcu locking instead of the mmap lock, but we can be
quite certain that we won't cause locking contention.

Thanks,
Liam
Andrii Nakryiko May 7, 2024, 7 p.m. UTC | #15
On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240507 12:28]:
> > On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > .. Adding Suren & Willy to the Cc
> > >
> > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]:
> > > > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > I also did an strace run of both cases. In text-based one the tool did
> > > > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > > > >
> > > > > Why not fetch more at once?
> > > > >
> > > >
> > > > I didn't expect to be interrogated so much on the performance of the
> > > > text parsing front, sorry. :) You can probably tune this, but where is
> > > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > > > production numbers.
> > >
> > > The reason the file reads are limited to 4KB is because this file is
> > > used for monitoring processes.  We have a significant number of
> > > organisations polling this file so frequently that the mmap lock
> > > contention becomes an issue. (reading a file is free, right?)  People
> > > also tend to try to figure out why a process is slow by reading this
> > > file - which amplifies the lock contention.
> > >
> > > What happens today is that the lock is yielded after 4KB to allow time
> > > for mmap writes to happen.  This also means your data may be
> > > inconsistent from one 4KB block to the next (the write may be around
> > > this boundary).
> > >
> > > This new interface also takes the lock in do_procmap_query() and does
> > > the 4kb blocks as well.  Extending this size means more time spent
> > > blocking mmap writes, but a more consistent view of the world (less
> > > "tearing" of the addresses).
> >
> > Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
> > does a single VMA look up (presumably O(logN) operation) using a
> > single vma_iter_init(addr) + vma_next() call on vma_iterator.
>
> Sorry, I read this:
>
> +       if (usize > PAGE_SIZE)
> +               return -E2BIG;
>
> And thought you were going to return many vmas in that buffer.  I see
> now that you are doing one copy at a time.
>
> >
> > As for the mmap_read_lock_killable() (is that what we are talking
> > about?), I'm happy to use anything else available, please give me a
> > pointer. But I suspect given how fast and small this new API is,
> > mmap_read_lock_killable() in it is not comparable to holding it for
> > producing /proc/<pid>/maps contents.
>
> Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
> mmap sem).
>
> You can see examples of avoiding the mmap lock by use of rcu in
> mm/memory.c lock_vma_under_rcu() which is used in the fault path.
> userfaultfd has an example as well. But again, remember that not all
> archs have this functionality, so you'd need to fall back to full mmap
> locking.

Thanks for the pointer (didn't see email when replying on the other thread).

I looked at lock_vma_under_rcu() quickly, and seems like it's designed
to find VMA that covers given address, but not the next closest one.
So it's a bit problematic for the API I'm adding, as
PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to
COVERING_OR_NEXT_VMA, if necessary), is quite important for the use
cases we have. But maybe some variation of lock_vma_under_rcu() can be
added that would fit this case?

>
> Certainly a single lookup and copy will be faster than a 4k buffer
> filling copy, but you will be walking the tree O(n) times, where n is
> the vma count.  This isn't as efficient as multiple lookups in a row as
> we will re-walk from the top of the tree. You will also need to contend
> with the fact that the chance of the vmas changing between calls is much
> higher here too - if that's an issue. Neither of these issues go away
> with use of the rcu locking instead of the mmap lock, but we can be
> quite certain that we won't cause locking contention.

You are right about O(n) times, but note that for symbolization cases
I'm describing, this n will be, generally, *much* smaller than a total
number of VMAs within the process. It's a huge speed up in practice.
This is because we pre-sort addresses in user-space, and then we query
VMA for the first address, but then we quickly skip all the other
addresses that are already covered by this VMA, and so the next
request will query a new VMA that covers another subset of addresses.
This way we'll get the minimal number of VMAs that cover captured
addresses (which in the case of stack traces would be a few VMAs
belonging to executable sections of process' binary plus a bunch of
shared libraries).

>
> Thanks,
> Liam
>
Namhyung Kim May 7, 2024, 10:27 p.m. UTC | #16
On Tue, May 7, 2024 at 10:29 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, May 6, 2024 at 10:06 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, May 6, 2024 at 11:43 AM Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote:
> > > > >
> > > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > > >
> > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > > > > >
> > > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > > > > > ioctl" is going to be different, but step back and show how this is
> > > > > > > going to be used in the real world overall.  Pounding on this file is
> > > > > > > not a normal operation, right?
> > > > > > >
> > > > > >
> > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > > > > is doing (see [0], it's Rust and part of the overall library API, I
> > > > > > think C code in this patch is way easier to follow for someone not
> > > > > > familiar with implementation of blazesym, but both implementations are
> > > > > > doing exactly the same sequence of steps). You can do it even less
> > > > > > efficiently by parsing the whole file, building an in-memory lookup
> > > > > > table, then looking up addresses one by one. But that's even slower
> > > > > > and more memory-hungry. So I didn't even bother implementing that, it
> > > > > > would put /proc/<pid>/maps at even more disadvantage.
> > > > > >
> > > > > > Other applications that deal with stack traces (including perf) would
> > > > > > be doing one of those two approaches, depending on circumstances and
> > > > > > level of sophistication of code (and sensitivity to performance).
> > > > >
> > > > > The code in perf doing this is here:
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > > > > The code is using the api/io.h code:
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > > > > Using perf to profile perf it was observed time was spent allocating
> > > > > buffers and locale related activities when using stdio, so io is a
> > > > > lighter weight alternative, albeit with more verbose code than fscanf.
> > > > > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > > > > similar benchmark in `perf bench internals synthesize`.
> > > > >
> > > >
> > > > If I add a new implementation using this ioctl() into
> > > > perf_event__synthesize_mmap_events(), will it be tested from this
> > > > `perf bench internals synthesize`? I'm not too familiar with perf code
> > > > organization, sorry if it's a stupid question. If not, where exactly
> > > > is the code that would be triggered from benchmark?
> > >
> > > Yes it would be triggered :-)
> >
> > Ok, I don't exactly know how to interpret the results (and what the
> > benchmark is doing), but numbers don't seem to be worse. They actually
> > seem to be a bit better.
> >
> > I pushed my code that adds perf integration to [0]. That commit has
> > results, but I'll post them here (with invocation parameters).
> > perf-ioctl is the version with ioctl()-based implementation,
> > perf-parse is, logically, text-parsing version. Here are the results
> > (and see my notes below the results as well):
> >
> > TEXT-BASED
> > ==========
> >
> > # ./perf-parse bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >   Average synthesis took: 80.311 usec (+- 0.077 usec)
> >   Average num. events: 32.000 (+- 0.000)
> >   Average time per event 2.510 usec
> >   Average data synthesis took: 84.429 usec (+- 0.066 usec)
> >   Average num. events: 179.000 (+- 0.000)
> >   Average time per event 0.472 usec
> >
> > # ./perf-parse bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >   Average synthesis took: 79.900 usec (+- 0.077 usec)
> >   Average num. events: 32.000 (+- 0.000)
> >   Average time per event 2.497 usec
> >   Average data synthesis took: 84.832 usec (+- 0.074 usec)
> >   Average num. events: 180.000 (+- 0.000)
> >   Average time per event 0.471 usec
> >
> > # ./perf-parse bench internals synthesize --mt -M 8
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of multi threaded perf event synthesis by
> > synthesizing events on CPU 0:
> >   Number of synthesis threads: 1
> >     Average synthesis took: 36338.100 usec (+- 406.091 usec)
> >     Average num. events: 14091.300 (+- 7.433)
> >     Average time per event 2.579 usec
> >   Number of synthesis threads: 2
> >     Average synthesis took: 37071.200 usec (+- 746.498 usec)
> >     Average num. events: 14085.900 (+- 1.900)
> >     Average time per event 2.632 usec
> >   Number of synthesis threads: 3
> >     Average synthesis took: 33932.300 usec (+- 626.861 usec)
> >     Average num. events: 14085.900 (+- 1.900)
> >     Average time per event 2.409 usec
> >   Number of synthesis threads: 4
> >     Average synthesis took: 33822.700 usec (+- 506.290 usec)
> >     Average num. events: 14099.200 (+- 8.761)
> >     Average time per event 2.399 usec
> >   Number of synthesis threads: 5
> >     Average synthesis took: 33348.200 usec (+- 389.771 usec)
> >     Average num. events: 14085.900 (+- 1.900)
> >     Average time per event 2.367 usec
> >   Number of synthesis threads: 6
> >     Average synthesis took: 33269.600 usec (+- 350.341 usec)
> >     Average num. events: 14084.000 (+- 0.000)
> >     Average time per event 2.362 usec
> >   Number of synthesis threads: 7
> >     Average synthesis took: 32663.900 usec (+- 338.870 usec)
> >     Average num. events: 14085.900 (+- 1.900)
> >     Average time per event 2.319 usec
> >   Number of synthesis threads: 8
> >     Average synthesis took: 32748.400 usec (+- 285.450 usec)
> >     Average num. events: 14085.900 (+- 1.900)
> >     Average time per event 2.325 usec
> >
> > IOCTL-BASED
> > ===========
> > # ./perf-ioctl bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >   Average synthesis took: 72.996 usec (+- 0.076 usec)
> >   Average num. events: 31.000 (+- 0.000)
> >   Average time per event 2.355 usec
> >   Average data synthesis took: 79.067 usec (+- 0.074 usec)
> >   Average num. events: 178.000 (+- 0.000)
> >   Average time per event 0.444 usec
> >
> > # ./perf-ioctl bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >   Average synthesis took: 73.921 usec (+- 0.073 usec)
> >   Average num. events: 31.000 (+- 0.000)
> >   Average time per event 2.385 usec
> >   Average data synthesis took: 80.545 usec (+- 0.070 usec)
> >   Average num. events: 178.000 (+- 0.000)
> >   Average time per event 0.453 usec
> >
> > # ./perf-ioctl bench internals synthesize --mt -M 8
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of multi threaded perf event synthesis by
> > synthesizing events on CPU 0:
> >   Number of synthesis threads: 1
> >     Average synthesis took: 35609.500 usec (+- 428.576 usec)
> >     Average num. events: 14040.700 (+- 1.700)
> >     Average time per event 2.536 usec
> >   Number of synthesis threads: 2
> >     Average synthesis took: 34293.800 usec (+- 453.811 usec)
> >     Average num. events: 14040.700 (+- 1.700)
> >     Average time per event 2.442 usec
> >   Number of synthesis threads: 3
> >     Average synthesis took: 32385.200 usec (+- 363.106 usec)
> >     Average num. events: 14040.700 (+- 1.700)
> >     Average time per event 2.307 usec
> >   Number of synthesis threads: 4
> >     Average synthesis took: 33113.100 usec (+- 553.931 usec)
> >     Average num. events: 14054.500 (+- 11.469)
> >     Average time per event 2.356 usec
> >   Number of synthesis threads: 5
> >     Average synthesis took: 31600.600 usec (+- 297.349 usec)
> >     Average num. events: 14012.500 (+- 4.590)
> >     Average time per event 2.255 usec
> >   Number of synthesis threads: 6
> >     Average synthesis took: 32309.900 usec (+- 472.225 usec)
> >     Average num. events: 14004.000 (+- 0.000)
> >     Average time per event 2.307 usec
> >   Number of synthesis threads: 7
> >     Average synthesis took: 31400.100 usec (+- 206.261 usec)
> >     Average num. events: 14004.800 (+- 0.800)
> >     Average time per event 2.242 usec
> >   Number of synthesis threads: 8
> >     Average synthesis took: 31601.400 usec (+- 303.350 usec)
> >     Average num. events: 14005.700 (+- 1.700)
> >     Average time per event 2.256 usec
> >
> > I also double-checked (using strace) that it does what it is supposed
> > to do, and it seems like everything checks out. Here's text-based
> > strace log:
> >
> > openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3
> > read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997
> > read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025
> > read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048
> > read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017
> > read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744
> > read(3, "", 8192)                       = 0
> > close(3)                                = 0
> >
> >
> > BTW, note how the kernel doesn't serve more than 4KB of data, even
> > though perf provides 8KB buffer (that's to Greg's question about
> > optimizing using bigger buffers, I suspect without seq_file changes,
> > it won't work).
> >
> > And here's an abbreviated log for ioctl version, it has lots more (but
> > much faster) ioctl() syscalls, given it dumps everything:
> >
> > openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> >
> >  ... 195 ioctl() calls in total ...
> >
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50)
> > = -1 ENOENT (No such file or directory)
> > close(3)                                = 0
> >
> >
> > So, it's not the optimal usage of this API, and yet it's still better
> > (or at least not worse) than text-based API.

It's surprising that more ioctl is cheaper than less read and parse.

> >
>
> In another reply to Arnaldo on patch #2 I mentioned the idea of
> allowing to iterate only file-backed VMAs (as it seems like what
> symbolizers would only care about, but I might be wrong here). So I

Yep, I think it's enough to get file-backed VMAs only.


> tried that quickly, given it's a trivial addition to my code. See
> results below (it is slightly faster, but not much, because most of
> VMAs in that benchmark seem to be indeed file-backed anyways), just
> for completeness. I'm not sure if that would be useful/used by perf,
> so please let me know.

Thanks for doing this.  It'd be useful as it provides better synthesizing
performance.  The startup latency of perf record is a problem, I need
to take a look if it can be improved more.

Thanks,
Namhyung


>
> As I mentioned above, it's not radically faster in this perf
> benchmark, because we still request about 170 VMAs (vs ~195 if we
> iterate *all* of them), so not a big change. The ratio will vary
> depending on what the process is doing, of course. Anyways, just for
> completeness, I'm not sure if I have to add this "filter" to the
> actual implementation.
>
> # ./perf-filebacked bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 65.759 usec (+- 0.063 usec)
>   Average num. events: 30.000 (+- 0.000)
>   Average time per event 2.192 usec
>   Average data synthesis took: 73.840 usec (+- 0.080 usec)
>   Average num. events: 153.000 (+- 0.000)
>   Average time per event 0.483 usec
>
> # ./perf-filebacked bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 66.245 usec (+- 0.059 usec)
>   Average num. events: 30.000 (+- 0.000)
>   Average time per event 2.208 usec
>   Average data synthesis took: 70.627 usec (+- 0.074 usec)
>   Average num. events: 153.000 (+- 0.000)
>   Average time per event 0.462 usec
>
> # ./perf-filebacked bench internals synthesize --mt -M 8
> # Running 'internals/synthesize' benchmark:
> Computing performance of multi threaded perf event synthesis by
> synthesizing events on CPU 0:
>   Number of synthesis threads: 1
>     Average synthesis took: 33477.500 usec (+- 556.102 usec)
>     Average num. events: 10125.700 (+- 1.620)
>     Average time per event 3.306 usec
>   Number of synthesis threads: 2
>     Average synthesis took: 30473.700 usec (+- 221.933 usec)
>     Average num. events: 10127.000 (+- 0.000)
>     Average time per event 3.009 usec
>   Number of synthesis threads: 3
>     Average synthesis took: 29775.200 usec (+- 315.212 usec)
>     Average num. events: 10128.700 (+- 0.667)
>     Average time per event 2.940 usec
>   Number of synthesis threads: 4
>     Average synthesis took: 29477.100 usec (+- 621.258 usec)
>     Average num. events: 10129.000 (+- 0.000)
>     Average time per event 2.910 usec
>   Number of synthesis threads: 5
>     Average synthesis took: 29777.900 usec (+- 294.710 usec)
>     Average num. events: 10144.700 (+- 11.597)
>     Average time per event 2.935 usec
>   Number of synthesis threads: 6
>     Average synthesis took: 27774.700 usec (+- 357.569 usec)
>     Average num. events: 10158.500 (+- 14.710)
>     Average time per event 2.734 usec
>   Number of synthesis threads: 7
>     Average synthesis took: 27437.200 usec (+- 233.626 usec)
>     Average num. events: 10135.700 (+- 2.700)
>     Average time per event 2.707 usec
>   Number of synthesis threads: 8
>     Average synthesis took: 28784.600 usec (+- 477.630 usec)
>     Average num. events: 10133.000 (+- 0.000)
>     Average time per event 2.841 usec
>
> >   [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35
> >
> > >
> > > Thanks,
> > > Ian
> > >
> > > > > Thanks,
> > > > > Ian
> > > > >
> > > > > >   [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > > > > >
> > > > > > > thanks,
> > > > > > >
> > > > > > > greg k-h
> > > > > >
>
Andrii Nakryiko May 7, 2024, 10:56 p.m. UTC | #17
On Tue, May 7, 2024 at 3:27 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, May 7, 2024 at 10:29 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, May 6, 2024 at 10:06 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Mon, May 6, 2024 at 11:43 AM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote:
> > > > > >
> > > > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > >
> > > > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > > > >
> > > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > > > > > >
> > > > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > > > > > > ioctl" is going to be different, but step back and show how this is
> > > > > > > > going to be used in the real world overall.  Pounding on this file is
> > > > > > > > not a normal operation, right?
> > > > > > > >
> > > > > > >
> > > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > > > > > is doing (see [0], it's Rust and part of the overall library API, I
> > > > > > > think C code in this patch is way easier to follow for someone not
> > > > > > > familiar with implementation of blazesym, but both implementations are
> > > > > > > doing exactly the same sequence of steps). You can do it even less
> > > > > > > efficiently by parsing the whole file, building an in-memory lookup
> > > > > > > table, then looking up addresses one by one. But that's even slower
> > > > > > > and more memory-hungry. So I didn't even bother implementing that, it
> > > > > > > would put /proc/<pid>/maps at even more disadvantage.
> > > > > > >
> > > > > > > Other applications that deal with stack traces (including perf) would
> > > > > > > be doing one of those two approaches, depending on circumstances and
> > > > > > > level of sophistication of code (and sensitivity to performance).
> > > > > >
> > > > > > The code in perf doing this is here:
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > > > > > The code is using the api/io.h code:
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > > > > > Using perf to profile perf it was observed time was spent allocating
> > > > > > buffers and locale related activities when using stdio, so io is a
> > > > > > lighter weight alternative, albeit with more verbose code than fscanf.
> > > > > > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > > > > > similar benchmark in `perf bench internals synthesize`.
> > > > > >
> > > > >
> > > > > If I add a new implementation using this ioctl() into
> > > > > perf_event__synthesize_mmap_events(), will it be tested from this
> > > > > `perf bench internals synthesize`? I'm not too familiar with perf code
> > > > > organization, sorry if it's a stupid question. If not, where exactly
> > > > > is the code that would be triggered from benchmark?
> > > >
> > > > Yes it would be triggered :-)
> > >
> > > Ok, I don't exactly know how to interpret the results (and what the
> > > benchmark is doing), but numbers don't seem to be worse. They actually
> > > seem to be a bit better.
> > >
> > > I pushed my code that adds perf integration to [0]. That commit has
> > > results, but I'll post them here (with invocation parameters).
> > > perf-ioctl is the version with ioctl()-based implementation,
> > > perf-parse is, logically, text-parsing version. Here are the results
> > > (and see my notes below the results as well):
> > >
> > > TEXT-BASED
> > > ==========
> > >
> > > # ./perf-parse bench internals synthesize
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of single threaded perf event synthesis by
> > > synthesizing events on the perf process itself:
> > >   Average synthesis took: 80.311 usec (+- 0.077 usec)
> > >   Average num. events: 32.000 (+- 0.000)
> > >   Average time per event 2.510 usec
> > >   Average data synthesis took: 84.429 usec (+- 0.066 usec)
> > >   Average num. events: 179.000 (+- 0.000)
> > >   Average time per event 0.472 usec
> > >
> > > # ./perf-parse bench internals synthesize
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of single threaded perf event synthesis by
> > > synthesizing events on the perf process itself:
> > >   Average synthesis took: 79.900 usec (+- 0.077 usec)
> > >   Average num. events: 32.000 (+- 0.000)
> > >   Average time per event 2.497 usec
> > >   Average data synthesis took: 84.832 usec (+- 0.074 usec)
> > >   Average num. events: 180.000 (+- 0.000)
> > >   Average time per event 0.471 usec
> > >
> > > # ./perf-parse bench internals synthesize --mt -M 8
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of multi threaded perf event synthesis by
> > > synthesizing events on CPU 0:
> > >   Number of synthesis threads: 1
> > >     Average synthesis took: 36338.100 usec (+- 406.091 usec)
> > >     Average num. events: 14091.300 (+- 7.433)
> > >     Average time per event 2.579 usec
> > >   Number of synthesis threads: 2
> > >     Average synthesis took: 37071.200 usec (+- 746.498 usec)
> > >     Average num. events: 14085.900 (+- 1.900)
> > >     Average time per event 2.632 usec
> > >   Number of synthesis threads: 3
> > >     Average synthesis took: 33932.300 usec (+- 626.861 usec)
> > >     Average num. events: 14085.900 (+- 1.900)
> > >     Average time per event 2.409 usec
> > >   Number of synthesis threads: 4
> > >     Average synthesis took: 33822.700 usec (+- 506.290 usec)
> > >     Average num. events: 14099.200 (+- 8.761)
> > >     Average time per event 2.399 usec
> > >   Number of synthesis threads: 5
> > >     Average synthesis took: 33348.200 usec (+- 389.771 usec)
> > >     Average num. events: 14085.900 (+- 1.900)
> > >     Average time per event 2.367 usec
> > >   Number of synthesis threads: 6
> > >     Average synthesis took: 33269.600 usec (+- 350.341 usec)
> > >     Average num. events: 14084.000 (+- 0.000)
> > >     Average time per event 2.362 usec
> > >   Number of synthesis threads: 7
> > >     Average synthesis took: 32663.900 usec (+- 338.870 usec)
> > >     Average num. events: 14085.900 (+- 1.900)
> > >     Average time per event 2.319 usec
> > >   Number of synthesis threads: 8
> > >     Average synthesis took: 32748.400 usec (+- 285.450 usec)
> > >     Average num. events: 14085.900 (+- 1.900)
> > >     Average time per event 2.325 usec
> > >
> > > IOCTL-BASED
> > > ===========
> > > # ./perf-ioctl bench internals synthesize
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of single threaded perf event synthesis by
> > > synthesizing events on the perf process itself:
> > >   Average synthesis took: 72.996 usec (+- 0.076 usec)
> > >   Average num. events: 31.000 (+- 0.000)
> > >   Average time per event 2.355 usec
> > >   Average data synthesis took: 79.067 usec (+- 0.074 usec)
> > >   Average num. events: 178.000 (+- 0.000)
> > >   Average time per event 0.444 usec
> > >
> > > # ./perf-ioctl bench internals synthesize
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of single threaded perf event synthesis by
> > > synthesizing events on the perf process itself:
> > >   Average synthesis took: 73.921 usec (+- 0.073 usec)
> > >   Average num. events: 31.000 (+- 0.000)
> > >   Average time per event 2.385 usec
> > >   Average data synthesis took: 80.545 usec (+- 0.070 usec)
> > >   Average num. events: 178.000 (+- 0.000)
> > >   Average time per event 0.453 usec
> > >
> > > # ./perf-ioctl bench internals synthesize --mt -M 8
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of multi threaded perf event synthesis by
> > > synthesizing events on CPU 0:
> > >   Number of synthesis threads: 1
> > >     Average synthesis took: 35609.500 usec (+- 428.576 usec)
> > >     Average num. events: 14040.700 (+- 1.700)
> > >     Average time per event 2.536 usec
> > >   Number of synthesis threads: 2
> > >     Average synthesis took: 34293.800 usec (+- 453.811 usec)
> > >     Average num. events: 14040.700 (+- 1.700)
> > >     Average time per event 2.442 usec
> > >   Number of synthesis threads: 3
> > >     Average synthesis took: 32385.200 usec (+- 363.106 usec)
> > >     Average num. events: 14040.700 (+- 1.700)
> > >     Average time per event 2.307 usec
> > >   Number of synthesis threads: 4
> > >     Average synthesis took: 33113.100 usec (+- 553.931 usec)
> > >     Average num. events: 14054.500 (+- 11.469)
> > >     Average time per event 2.356 usec
> > >   Number of synthesis threads: 5
> > >     Average synthesis took: 31600.600 usec (+- 297.349 usec)
> > >     Average num. events: 14012.500 (+- 4.590)
> > >     Average time per event 2.255 usec
> > >   Number of synthesis threads: 6
> > >     Average synthesis took: 32309.900 usec (+- 472.225 usec)
> > >     Average num. events: 14004.000 (+- 0.000)
> > >     Average time per event 2.307 usec
> > >   Number of synthesis threads: 7
> > >     Average synthesis took: 31400.100 usec (+- 206.261 usec)
> > >     Average num. events: 14004.800 (+- 0.800)
> > >     Average time per event 2.242 usec
> > >   Number of synthesis threads: 8
> > >     Average synthesis took: 31601.400 usec (+- 303.350 usec)
> > >     Average num. events: 14005.700 (+- 1.700)
> > >     Average time per event 2.256 usec
> > >
> > > I also double-checked (using strace) that it does what it is supposed
> > > to do, and it seems like everything checks out. Here's text-based
> > > strace log:
> > >
> > > openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3
> > > read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997
> > > read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025
> > > read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048
> > > read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017
> > > read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744
> > > read(3, "", 8192)                       = 0
> > > close(3)                                = 0
> > >
> > >
> > > BTW, note how the kernel doesn't serve more than 4KB of data, even
> > > though perf provides 8KB buffer (that's to Greg's question about
> > > optimizing using bigger buffers, I suspect without seq_file changes,
> > > it won't work).
> > >
> > > And here's an abbreviated log for ioctl version, it has lots more (but
> > > much faster) ioctl() syscalls, given it dumps everything:
> > >
> > > openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > >
> > >  ... 195 ioctl() calls in total ...
> > >
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50)
> > > = -1 ENOENT (No such file or directory)
> > > close(3)                                = 0
> > >
> > >
> > > So, it's not the optimal usage of this API, and yet it's still better
> > > (or at least not worse) than text-based API.
>
> It's surprising that more ioctl is cheaper than less read and parse.

I encourage you to try this locally, just in case I missed something
([0]). But it does seem this way. I have mitigations and retpoline
off, so syscall switch is pretty fast (under 0.5 microsecond).

  [0] https://github.com/anakryiko/linux/tree/procfs-proc-maps-ioctl
>
> > >
> >
> > In another reply to Arnaldo on patch #2 I mentioned the idea of
> > allowing to iterate only file-backed VMAs (as it seems like what
> > symbolizers would only care about, but I might be wrong here). So I
>
> Yep, I think it's enough to get file-backed VMAs only.
>

Ok, I guess I'll keep this functionality for v2 then, it's a pretty
trivial extension to existing logic.

>
> > tried that quickly, given it's a trivial addition to my code. See
> > results below (it is slightly faster, but not much, because most of
> > VMAs in that benchmark seem to be indeed file-backed anyways), just
> > for completeness. I'm not sure if that would be useful/used by perf,
> > so please let me know.
>
> Thanks for doing this.  It'd be useful as it provides better synthesizing
> performance.  The startup latency of perf record is a problem, I need
> to take a look if it can be improved more.
>
> Thanks,
> Namhyung
>
>
> >
> > As I mentioned above, it's not radically faster in this perf
> > benchmark, because we still request about 170 VMAs (vs ~195 if we
> > iterate *all* of them), so not a big change. The ratio will vary
> > depending on what the process is doing, of course. Anyways, just for
> > completeness, I'm not sure if I have to add this "filter" to the
> > actual implementation.
> >
> > # ./perf-filebacked bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >   Average synthesis took: 65.759 usec (+- 0.063 usec)
> >   Average num. events: 30.000 (+- 0.000)
> >   Average time per event 2.192 usec
> >   Average data synthesis took: 73.840 usec (+- 0.080 usec)
> >   Average num. events: 153.000 (+- 0.000)
> >   Average time per event 0.483 usec
> >
> > # ./perf-filebacked bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >   Average synthesis took: 66.245 usec (+- 0.059 usec)
> >   Average num. events: 30.000 (+- 0.000)
> >   Average time per event 2.208 usec
> >   Average data synthesis took: 70.627 usec (+- 0.074 usec)
> >   Average num. events: 153.000 (+- 0.000)
> >   Average time per event 0.462 usec
> >
> > # ./perf-filebacked bench internals synthesize --mt -M 8
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of multi threaded perf event synthesis by
> > synthesizing events on CPU 0:
> >   Number of synthesis threads: 1
> >     Average synthesis took: 33477.500 usec (+- 556.102 usec)
> >     Average num. events: 10125.700 (+- 1.620)
> >     Average time per event 3.306 usec
> >   Number of synthesis threads: 2
> >     Average synthesis took: 30473.700 usec (+- 221.933 usec)
> >     Average num. events: 10127.000 (+- 0.000)
> >     Average time per event 3.009 usec
> >   Number of synthesis threads: 3
> >     Average synthesis took: 29775.200 usec (+- 315.212 usec)
> >     Average num. events: 10128.700 (+- 0.667)
> >     Average time per event 2.940 usec
> >   Number of synthesis threads: 4
> >     Average synthesis took: 29477.100 usec (+- 621.258 usec)
> >     Average num. events: 10129.000 (+- 0.000)
> >     Average time per event 2.910 usec
> >   Number of synthesis threads: 5
> >     Average synthesis took: 29777.900 usec (+- 294.710 usec)
> >     Average num. events: 10144.700 (+- 11.597)
> >     Average time per event 2.935 usec
> >   Number of synthesis threads: 6
> >     Average synthesis took: 27774.700 usec (+- 357.569 usec)
> >     Average num. events: 10158.500 (+- 14.710)
> >     Average time per event 2.734 usec
> >   Number of synthesis threads: 7
> >     Average synthesis took: 27437.200 usec (+- 233.626 usec)
> >     Average num. events: 10135.700 (+- 2.700)
> >     Average time per event 2.707 usec
> >   Number of synthesis threads: 8
> >     Average synthesis took: 28784.600 usec (+- 477.630 usec)
> >     Average num. events: 10133.000 (+- 0.000)
> >     Average time per event 2.841 usec
> >
> > >   [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35
> > >
> > > >
> > > > Thanks,
> > > > Ian
> > > >
> > > > > > Thanks,
> > > > > > Ian
> > > > > >
> > > > > > >   [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > > > > > >
> > > > > > > > thanks,
> > > > > > > >
> > > > > > > > greg k-h
> > > > > > >
> >
Arnaldo Carvalho de Melo May 8, 2024, 12:36 a.m. UTC | #18
On Tue, May 07, 2024 at 03:56:40PM -0700, Andrii Nakryiko wrote:
> On Tue, May 7, 2024 at 3:27 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > On Tue, May 7, 2024 at 10:29 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > > In another reply to Arnaldo on patch #2 I mentioned the idea of
> > > allowing to iterate only file-backed VMAs (as it seems like what
> > > symbolizers would only care about, but I might be wrong here). So I

> > Yep, I think it's enough to get file-backed VMAs only.
 
> Ok, I guess I'll keep this functionality for v2 then, it's a pretty
> trivial extension to existing logic.

Maps for JITed code, for isntance, aren't backed by files:

commit 578c03c86fadcc6fd7319ddf41dd4d1d88aab77a
Author: Namhyung Kim <namhyung@kernel.org>
Date:   Thu Jan 16 10:49:31 2014 +0900

    perf symbols: Fix JIT symbol resolution on heap
    
    Gaurav reported that perf cannot profile JIT program if it executes the
    code on heap.  This was because current map__new() only handle JIT on
    anon mappings - extends it to handle no_dso (heap, stack) case too.
    
    This patch assumes JIT profiling only provides dynamic function symbols
    so check the mapping type to distinguish the case.  It'd provide no
    symbols for data mapping - if we need to support symbols on data
    mappings later it should be changed.
    
    Reported-by: Gaurav Jain <gjain@fb.com>
    Signed-off-by: Namhyung Kim <namhyung@kernel.org>
    Tested-by: Gaurav Jain <gjain@fb.com>

⬢[acme@toolbox perf-tools-next]$ git show 89365e6c9ad4c0e090e4c6a4b67a3ce319381d89
commit 89365e6c9ad4c0e090e4c6a4b67a3ce319381d89
Author: Andi Kleen <ak@linux.intel.com>
Date:   Wed Apr 24 17:03:02 2013 -0700

    perf tools: Handle JITed code in shared memory
    
    Need to check for /dev/zero.
    
    Most likely more strings are missing too.
    
    Signed-off-by: Andi Kleen <ak@linux.intel.com>
    Link: http://lkml.kernel.org/r/1366848182-30449-1-git-send-email-andi@firstfloor.org
    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

diff --git a/tools/perf/util/map.c b/tools/perf/util/map.c
index 6fcb9de623401b8a..8bcdf9e54089acaf 100644
--- a/tools/perf/util/map.c
+++ b/tools/perf/util/map.c
@@ -21,6 +21,7 @@ const char *map_type__name[MAP__NR_TYPES] = {
 static inline int is_anon_memory(const char *filename)
 {
        return !strcmp(filename, "//anon") ||
+              !strcmp(filename, "/dev/zero (deleted)") ||
               !strcmp(filename, "/anon_hugepage (deleted)");
 }

etc.

- Arnaldo
Liam R. Howlett May 8, 2024, 1:20 a.m. UTC | #19
* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240507 15:01]:
> On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
...
> > >
> > > As for the mmap_read_lock_killable() (is that what we are talking
> > > about?), I'm happy to use anything else available, please give me a
> > > pointer. But I suspect given how fast and small this new API is,
> > > mmap_read_lock_killable() in it is not comparable to holding it for
> > > producing /proc/<pid>/maps contents.
> >
> > Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
> > mmap sem).
> >
> > You can see examples of avoiding the mmap lock by use of rcu in
> > mm/memory.c lock_vma_under_rcu() which is used in the fault path.
> > userfaultfd has an example as well. But again, remember that not all
> > archs have this functionality, so you'd need to fall back to full mmap
> > locking.
> 
> Thanks for the pointer (didn't see email when replying on the other thread).
> 
> I looked at lock_vma_under_rcu() quickly, and seems like it's designed
> to find VMA that covers given address, but not the next closest one.
> So it's a bit problematic for the API I'm adding, as
> PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to
> COVERING_OR_NEXT_VMA, if necessary), is quite important for the use
> cases we have. But maybe some variation of lock_vma_under_rcu() can be
> added that would fit this case?

Yes, as long as we have the rcu read lock, we can use the same
vma_next() calls you use today.  We will have to be careful not to use
the vma while it's being altered, but per-vma locking should provide
that functionality for you.

> 
> >
> > Certainly a single lookup and copy will be faster than a 4k buffer
> > filling copy, but you will be walking the tree O(n) times, where n is
> > the vma count.  This isn't as efficient as multiple lookups in a row as
> > we will re-walk from the top of the tree. You will also need to contend
> > with the fact that the chance of the vmas changing between calls is much
> > higher here too - if that's an issue. Neither of these issues go away
> > with use of the rcu locking instead of the mmap lock, but we can be
> > quite certain that we won't cause locking contention.
> 
> You are right about O(n) times, but note that for symbolization cases
> I'm describing, this n will be, generally, *much* smaller than a total
> number of VMAs within the process. It's a huge speed up in practice.
> This is because we pre-sort addresses in user-space, and then we query
> VMA for the first address, but then we quickly skip all the other
> addresses that are already covered by this VMA, and so the next
> request will query a new VMA that covers another subset of addresses.
> This way we'll get the minimal number of VMAs that cover captured
> addresses (which in the case of stack traces would be a few VMAs
> belonging to executable sections of process' binary plus a bunch of
> shared libraries).

This also implies you won't have to worry about shifting addresses?  I'd
think that the reference to the mm means none of these are going to be
changing at the point of the calls (not exiting).

Given your usecase, I'm surprised you're looking for the next vma at
all.

Thanks,
Liam
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index f1aebabfb017..7eaa8f417278 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -45,6 +45,7 @@  test_cpp
 /veristat
 /sign-file
 /uprobe_multi
+/procfs_query
 *.ko
 *.tmp
 xskxceiver
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index ba28d42b74db..07e17bb89767 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -131,7 +131,7 @@  TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
 	test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
 	xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
-	xdp_features bpf_test_no_cfi.ko
+	xdp_features bpf_test_no_cfi.ko procfs_query
 
 TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
 
diff --git a/tools/testing/selftests/bpf/procfs_query.c b/tools/testing/selftests/bpf/procfs_query.c
new file mode 100644
index 000000000000..8ca3978244ad
--- /dev/null
+++ b/tools/testing/selftests/bpf/procfs_query.c
@@ -0,0 +1,366 @@ 
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+#include <argp.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <time.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <time.h>
+
+static bool verbose;
+static bool quiet;
+static bool use_ioctl;
+static bool request_build_id;
+static char *addrs_path;
+static int pid;
+static int bench_runs;
+
+const char *argp_program_version = "procfs_query 0.0";
+const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
+
+static inline uint64_t get_time_ns(void)
+{
+	struct timespec t;
+
+	clock_gettime(CLOCK_MONOTONIC, &t);
+
+	return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec;
+}
+
+static const struct argp_option opts[] = {
+	{ "verbose", 'v', NULL, 0, "Verbose mode" },
+	{ "quiet", 'q', NULL, 0, "Quiet mode (no output)" },
+	{ "pid", 'p', "PID", 0, "PID of the process" },
+	{ "addrs-path", 'f', "PATH", 0, "File with addresses to resolve" },
+	{ "benchmark", 'B', "RUNS", 0, "Benchmark mode" },
+	{ "query", 'Q', NULL, 0, "Use ioctl()-based point query API (by default text parsing is done)" },
+	{ "build-id", 'b', NULL, 0, "Fetch build ID, if available (only for ioctl mode)" },
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case 'v':
+		verbose = true;
+		break;
+	case 'q':
+		quiet = true;
+		break;
+	case 'i':
+		use_ioctl = true;
+		break;
+	case 'b':
+		request_build_id = true;
+		break;
+	case 'p':
+		pid = strtol(arg, NULL, 10);
+		break;
+	case 'f':
+		addrs_path = strdup(arg);
+		break;
+	case 'B':
+		bench_runs = strtol(arg, NULL, 10);
+		if (bench_runs <= 0) {
+			fprintf(stderr, "Invalid benchmark run count: %s\n", arg);
+			return -EINVAL;
+		}
+		break;
+	case ARGP_KEY_ARG:
+		argp_usage(state);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static const struct argp argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+struct addr {
+	unsigned long long addr;
+	int idx;
+};
+
+static struct addr *addrs;
+static size_t addr_cnt, addr_cap;
+
+struct resolved_addr {
+	unsigned long long file_off;
+	const char *vma_name;
+	int build_id_sz;
+	char build_id[20];
+};
+
+static struct resolved_addr *resolved;
+
+static int resolve_addrs_ioctl(void)
+{
+	char buf[32], build_id_buf[20], vma_name[PATH_MAX];
+	struct procfs_procmap_query q;
+	int fd, err, i;
+	struct addr *a = &addrs[0];
+	struct resolved_addr *r;
+
+	snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+	fd = open(buf, O_RDONLY);
+	if (fd < 0) {
+		err = -errno;
+		fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+		return err;
+	}
+
+	memset(&q, 0, sizeof(q));
+	q.size = sizeof(q);
+	q.query_flags = PROCFS_PROCMAP_EXACT_OR_NEXT_VMA;
+	q.vma_name_addr = (__u64)vma_name;
+	if (request_build_id)
+		q.build_id_addr = (__u64)build_id_buf;
+
+	for (i = 0; i < addr_cnt; ) {
+		char *name = NULL;
+
+		q.query_addr = (__u64)a->addr;
+		q.vma_name_size = sizeof(vma_name);
+		if (request_build_id)
+			q.build_id_size = sizeof(build_id_buf);
+
+		err = ioctl(fd, PROCFS_PROCMAP_QUERY, &q);
+		if (err < 0 && errno == ENOTTY) {
+			close(fd);
+			fprintf(stderr, "PROCFS_PROCMAP_QUERY ioctl() command is not supported on this kernel!\n");
+			return -EOPNOTSUPP; /* ioctl() not implemented yet */
+		}
+		if (err < 0 && errno == ENOENT) {
+			fprintf(stderr, "ENOENT\n");
+			i++;
+			a++;
+			continue; /* unresolved address */
+		}
+		if (err < 0) {
+			err = -errno;
+			close(fd);
+			fprintf(stderr, "PROCFS_PROCMAP_QUERY ioctl() returned error: %d\n", err);
+			return err;
+		}
+
+		/* skip addrs falling before current VMA */
+		for (; i < addr_cnt && a->addr < q.vma_start; i++, a++) {
+		}
+		/* process addrs covered by current VMA */
+		for (; i < addr_cnt && a->addr < q.vma_end; i++, a++) {
+			r = &resolved[a->idx];
+			r->file_off = a->addr - q.vma_start + q.vma_offset;
+
+			/* reuse name, if it was already strdup()'ed */
+			if (q.vma_name_size)
+				name = name ?: strdup(vma_name);
+			r->vma_name = name;
+
+			if (q.build_id_size) {
+				r->build_id_sz = q.build_id_size;
+				memcpy(r->build_id, build_id_buf, q.build_id_size);
+			}
+		}
+	}
+
+	close(fd);
+	return 0;
+}
+
+static int resolve_addrs_parse(void)
+{
+	size_t vma_start, vma_end, vma_offset, ino;
+	uint32_t dev_major, dev_minor;
+	char perms[4], buf[32], vma_name[PATH_MAX];
+	FILE *f;
+	int err, idx = 0;
+	struct addr *a = &addrs[idx];
+	struct resolved_addr *r;
+
+	snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+	f = fopen(buf, "r");
+	if (!f) {
+		err = -errno;
+		fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+		return err;
+	}
+
+	while ((err = fscanf(f, "%zx-%zx %c%c%c%c %zx %x:%x %zu %[^\n]\n",
+			     &vma_start, &vma_end,
+			     &perms[0], &perms[1], &perms[2], &perms[3],
+			     &vma_offset, &dev_major, &dev_minor, &ino, vma_name)) >= 10) {
+		const char *name = NULL;
+
+		/* skip addrs before current vma, they stay unresolved */
+		for (; idx < addr_cnt && a->addr < vma_start; idx++, a++) {
+		}
+
+		/* resolve all addrs within current vma now */
+		for (; idx < addr_cnt && a->addr < vma_end; idx++, a++) {
+			r = &resolved[a->idx];
+			r->file_off = a->addr - vma_start + vma_offset;
+
+			/* reuse name, if it was already strdup()'ed */
+			if (err > 10)
+				name = name ?: strdup(vma_name);
+			else
+				name = NULL;
+			r->vma_name = name;
+		}
+
+		/* ran out of addrs to resolve, stop early */
+		if (idx >= addr_cnt)
+			break;
+	}
+
+	fclose(f);
+	return 0;
+}
+
+static int cmp_by_addr(const void *a, const void *b)
+{
+	const struct addr *x = a, *y = b;
+
+	if (x->addr != y->addr)
+		return x->addr < y->addr ? -1 : 1;
+	return x->idx < y->idx ? -1 : 1;
+}
+
+static int cmp_by_idx(const void *a, const void *b)
+{
+	const struct addr *x = a, *y = b;
+
+	return x->idx < y->idx ? -1 : 1;
+}
+
+int main(int argc, char **argv)
+{
+	FILE* f;
+	int err, i;
+	unsigned long long addr;
+	uint64_t start_ns;
+	double total_ns;
+
+	/* Parse command line arguments */
+	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
+	if (err)
+		return err;
+
+	if (pid <= 0 || !addrs_path) {
+		fprintf(stderr, "Please provide PID and file with addresses to process!\n");
+		exit(1);
+	}
+
+	if (verbose) {
+		fprintf(stderr, "PID: %d\n", pid);
+		fprintf(stderr, "PATH: %s\n", addrs_path);
+	}
+
+	f = fopen(addrs_path, "r");
+	if (!f) {
+		err = -errno;
+		fprintf(stderr, "Failed to open '%s': %d\n", addrs_path, err);
+		goto out;
+	}
+
+	while ((err = fscanf(f, "%llx\n", &addr)) == 1) {
+		if (addr_cnt == addr_cap) {
+			addr_cap = addr_cap == 0 ? 16 : (addr_cap * 3 / 2);
+			addrs = realloc(addrs, sizeof(*addrs) * addr_cap);
+			memset(addrs + addr_cnt, 0, (addr_cap - addr_cnt) * sizeof(*addrs));
+		}
+
+		addrs[addr_cnt].addr = addr;
+		addrs[addr_cnt].idx = addr_cnt;
+
+		addr_cnt++;
+	}
+	if (verbose)
+		fprintf(stderr, "READ %zu addrs!\n", addr_cnt);
+	if (!feof(f)) {
+		fprintf(stderr, "Failure parsing full list of addresses at '%s'!\n", addrs_path);
+		err = -EINVAL;
+		fclose(f);
+		goto out;
+	}
+	fclose(f);
+	if (addr_cnt == 0) {
+		fprintf(stderr, "No addresses provided, bailing out!\n");
+		err = -ENOENT;
+		goto out;
+	}
+
+	resolved = calloc(addr_cnt, sizeof(*resolved));
+
+	qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_addr);
+	if (verbose) {
+		fprintf(stderr, "SORTED ADDRS (%zu):\n", addr_cnt);
+		for (i = 0; i < addr_cnt; i++) {
+			fprintf(stderr, "ADDR #%d: %#llx\n", addrs[i].idx, addrs[i].addr);
+		}
+	}
+
+	start_ns = get_time_ns();
+	for (i = bench_runs ?: 1; i > 0; i--) {
+		if (use_ioctl) {
+			err = resolve_addrs_ioctl();
+		} else {
+			err = resolve_addrs_parse();
+		}
+		if (err) {
+			fprintf(stderr, "Failed to resolve addrs: %d!\n", err);
+			goto out;
+		}
+	}
+	total_ns = get_time_ns() - start_ns;
+
+	if (bench_runs) {
+		fprintf(stderr, "BENCHMARK MODE. RUNS: %d TOTAL TIME (ms): %.3lf TIME/RUN (ms): %.3lf TIME/ADDR (us): %.3lf\n",
+			bench_runs, total_ns / 1000000.0, total_ns / bench_runs / 1000000.0,
+			total_ns / bench_runs / addr_cnt / 1000.0);
+	}
+
+	/* sort them back into the original order */
+	qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_idx);
+
+	if (!quiet) {
+		printf("RESOLVED ADDRS (%zu):\n", addr_cnt);
+		for (i = 0; i < addr_cnt; i++) {
+			const struct addr *a = &addrs[i];
+			const struct resolved_addr *r = &resolved[a->idx];
+
+			if (r->file_off) {
+				printf("RESOLVED   #%d: %#llx -> OFF %#llx",
+					a->idx, a->addr, r->file_off);
+				if (r->vma_name)
+					printf(" NAME %s", r->vma_name);
+				if (r->build_id_sz) {
+					char build_id_str[41];
+					int j;
+
+					for (j = 0; j < r->build_id_sz; j++)
+						sprintf(&build_id_str[j * 2], "%02hhx", r->build_id[j]);
+					printf(" BUILDID %s", build_id_str);
+				}
+				printf("\n");
+			} else {
+				printf("UNRESOLVED #%d: %#llx\n", a->idx, a->addr);
+			}
+		}
+	}
+out:
+	free(addrs);
+	free(addrs_path);
+	free(resolved);
+
+	return err < 0 ? -err : 0;
+}