Message ID | 20240504003006.3303334-6-andrii@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | ioctl()-based API to query VMAs from /proc/<pid>/maps | expand |
On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > Implement a simple tool/benchmark for comparing address "resolution" > logic based on textual /proc/<pid>/maps interface and new binary > ioctl-based PROCFS_PROCMAP_QUERY command. Of course an artificial benchmark of "read a whole file" vs. "a tiny ioctl" is going to be different, but step back and show how this is going to be used in the real world overall. Pounding on this file is not a normal operation, right? thanks, greg k-h
On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > I also did an strace run of both cases. In text-based one the tool did > 68 read() syscalls, fetching up to 4KB of data in one go. Why not fetch more at once? And I have a fun 'readfile()' syscall implementation around here that needs justification to get merged (I try so every other year or so) that can do the open/read/close loop in one call, with the buffer size set by userspace if you really are saying this is a "hot path" that needs that kind of speedup. But in the end, io_uring usually is the proper api for that instead, why not use that here instead of slow open/read/close if you care about speed? > In comparison, > ioctl-based implementation had to do only 6 ioctl() calls to fetch all > relevant VMAs. > > It is projected that savings from processing big production applications > would only widen the gap in favor of binary-based querying ioctl API, as > bigger applications will tend to have even more non-executable VMA > mappings relative to executable ones. Define "bigger applications" please. Is this some "large database company workload" type of thing, or something else? thanks, greg k-h
On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > Implement a simple tool/benchmark for comparing address "resolution" > > logic based on textual /proc/<pid>/maps interface and new binary > > ioctl-based PROCFS_PROCMAP_QUERY command. > > Of course an artificial benchmark of "read a whole file" vs. "a tiny > ioctl" is going to be different, but step back and show how this is > going to be used in the real world overall. Pounding on this file is > not a normal operation, right? > It's not artificial at all. It's *exactly* what, say, blazesym library is doing (see [0], it's Rust and part of the overall library API, I think C code in this patch is way easier to follow for someone not familiar with implementation of blazesym, but both implementations are doing exactly the same sequence of steps). You can do it even less efficiently by parsing the whole file, building an in-memory lookup table, then looking up addresses one by one. But that's even slower and more memory-hungry. So I didn't even bother implementing that, it would put /proc/<pid>/maps at even more disadvantage. Other applications that deal with stack traces (including perf) would be doing one of those two approaches, depending on circumstances and level of sophistication of code (and sensitivity to performance). [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193 > thanks, > > greg k-h
On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > I also did an strace run of both cases. In text-based one the tool did > > 68 read() syscalls, fetching up to 4KB of data in one go. > > Why not fetch more at once? > I didn't expect to be interrogated so much on the performance of the text parsing front, sorry. :) You can probably tune this, but where is the reasonable limit? 64KB? 256KB? 1MB? See below for some more production numbers. > And I have a fun 'readfile()' syscall implementation around here that > needs justification to get merged (I try so every other year or so) that > can do the open/read/close loop in one call, with the buffer size set by > userspace if you really are saying this is a "hot path" that needs that > kind of speedup. But in the end, io_uring usually is the proper api for > that instead, why not use that here instead of slow open/read/close if > you care about speed? > I'm not sure what I need to say here. I'm sure it will be useful, but as I already explained, it's not about the text file or not, it's about having to read too much information that's completely irrelevant. Again, see below for another data point. > > In comparison, > > ioctl-based implementation had to do only 6 ioctl() calls to fetch all > > relevant VMAs. > > > > It is projected that savings from processing big production applications > > would only widen the gap in favor of binary-based querying ioctl API, as > > bigger applications will tend to have even more non-executable VMA > > mappings relative to executable ones. > > Define "bigger applications" please. Is this some "large database > company workload" type of thing, or something else? I don't have a definition. But I had in mind, as one example, an ads-serving service we use internally (it's a pretty large application by pretty much any metric you can come up with). I just randomly picked one of the production hosts, found one instance of that service, and looked at its /proc/<pid>/maps file. Hopefully it will satisfy your need for specifics. # cat /proc/1126243/maps | wc -c 1570178 # cat /proc/1126243/maps | wc -l 28875 # cat /proc/1126243/maps | grep ' ..x. ' | wc -l 7347 You can see that maps file itself is about 1.5MB of text (which means single-shot reading of its entire contents is a bit unrealistic, though, sure, why not). The process contains 28875 VMAs, out of which only 7347 are executable. This means if we were to profile this process (and normally we profile entire system, so it's almost never single /proc/<pid>/maps file that needs to be open and processed), we'd need *at most* (absolute worst case!) 7347/28875 = 25.5% of entries. In reality, most code will be concentrated in a much smaller number of executable VMAs, of course. But no, I don't have specific numbers at hand, sorry. It matters less whether it's text or binary (though binary undoubtedly will be faster, it's strange to even argue about this), it's the ability to fetch only relevant VMAs that is the point here. > > thanks, > > greg k-h
On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > Implement a simple tool/benchmark for comparing address "resolution" > > > logic based on textual /proc/<pid>/maps interface and new binary > > > ioctl-based PROCFS_PROCMAP_QUERY command. > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny > > ioctl" is going to be different, but step back and show how this is > > going to be used in the real world overall. Pounding on this file is > > not a normal operation, right? > > > > It's not artificial at all. It's *exactly* what, say, blazesym library > is doing (see [0], it's Rust and part of the overall library API, I > think C code in this patch is way easier to follow for someone not > familiar with implementation of blazesym, but both implementations are > doing exactly the same sequence of steps). You can do it even less > efficiently by parsing the whole file, building an in-memory lookup > table, then looking up addresses one by one. But that's even slower > and more memory-hungry. So I didn't even bother implementing that, it > would put /proc/<pid>/maps at even more disadvantage. > > Other applications that deal with stack traces (including perf) would > be doing one of those two approaches, depending on circumstances and > level of sophistication of code (and sensitivity to performance). The code in perf doing this is here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440 The code is using the api/io.h code: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h Using perf to profile perf it was observed time was spent allocating buffers and locale related activities when using stdio, so io is a lighter weight alternative, albeit with more verbose code than fscanf. You could add this as an alternate /proc/<pid>/maps reader, we have a similar benchmark in `perf bench internals synthesize`. Thanks, Ian > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193 > > > thanks, > > > > greg k-h >
On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote: > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > > Implement a simple tool/benchmark for comparing address "resolution" > > > > logic based on textual /proc/<pid>/maps interface and new binary > > > > ioctl-based PROCFS_PROCMAP_QUERY command. > > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny > > > ioctl" is going to be different, but step back and show how this is > > > going to be used in the real world overall. Pounding on this file is > > > not a normal operation, right? > > > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library > > is doing (see [0], it's Rust and part of the overall library API, I > > think C code in this patch is way easier to follow for someone not > > familiar with implementation of blazesym, but both implementations are > > doing exactly the same sequence of steps). You can do it even less > > efficiently by parsing the whole file, building an in-memory lookup > > table, then looking up addresses one by one. But that's even slower > > and more memory-hungry. So I didn't even bother implementing that, it > > would put /proc/<pid>/maps at even more disadvantage. > > > > Other applications that deal with stack traces (including perf) would > > be doing one of those two approaches, depending on circumstances and > > level of sophistication of code (and sensitivity to performance). > > The code in perf doing this is here: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440 > The code is using the api/io.h code: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h > Using perf to profile perf it was observed time was spent allocating > buffers and locale related activities when using stdio, so io is a > lighter weight alternative, albeit with more verbose code than fscanf. > You could add this as an alternate /proc/<pid>/maps reader, we have a > similar benchmark in `perf bench internals synthesize`. > If I add a new implementation using this ioctl() into perf_event__synthesize_mmap_events(), will it be tested from this `perf bench internals synthesize`? I'm not too familiar with perf code organization, sorry if it's a stupid question. If not, where exactly is the code that would be triggered from benchmark? > Thanks, > Ian > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193 > > > > > thanks, > > > > > > greg k-h > >
On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote: > > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > > > Implement a simple tool/benchmark for comparing address "resolution" > > > > > logic based on textual /proc/<pid>/maps interface and new binary > > > > > ioctl-based PROCFS_PROCMAP_QUERY command. > > > > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny > > > > ioctl" is going to be different, but step back and show how this is > > > > going to be used in the real world overall. Pounding on this file is > > > > not a normal operation, right? > > > > > > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library > > > is doing (see [0], it's Rust and part of the overall library API, I > > > think C code in this patch is way easier to follow for someone not > > > familiar with implementation of blazesym, but both implementations are > > > doing exactly the same sequence of steps). You can do it even less > > > efficiently by parsing the whole file, building an in-memory lookup > > > table, then looking up addresses one by one. But that's even slower > > > and more memory-hungry. So I didn't even bother implementing that, it > > > would put /proc/<pid>/maps at even more disadvantage. > > > > > > Other applications that deal with stack traces (including perf) would > > > be doing one of those two approaches, depending on circumstances and > > > level of sophistication of code (and sensitivity to performance). > > > > The code in perf doing this is here: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440 > > The code is using the api/io.h code: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h > > Using perf to profile perf it was observed time was spent allocating > > buffers and locale related activities when using stdio, so io is a > > lighter weight alternative, albeit with more verbose code than fscanf. > > You could add this as an alternate /proc/<pid>/maps reader, we have a > > similar benchmark in `perf bench internals synthesize`. > > > > If I add a new implementation using this ioctl() into > perf_event__synthesize_mmap_events(), will it be tested from this > `perf bench internals synthesize`? I'm not too familiar with perf code > organization, sorry if it's a stupid question. If not, where exactly > is the code that would be triggered from benchmark? Yes it would be triggered :-) Thanks, Ian > > Thanks, > > Ian > > > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193 > > > > > > > thanks, > > > > > > > > greg k-h > > >
On Mon, May 6, 2024 at 11:43 AM Ian Rogers <irogers@google.com> wrote: > > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote: > > > > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > > > > Implement a simple tool/benchmark for comparing address "resolution" > > > > > > logic based on textual /proc/<pid>/maps interface and new binary > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command. > > > > > > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny > > > > > ioctl" is going to be different, but step back and show how this is > > > > > going to be used in the real world overall. Pounding on this file is > > > > > not a normal operation, right? > > > > > > > > > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library > > > > is doing (see [0], it's Rust and part of the overall library API, I > > > > think C code in this patch is way easier to follow for someone not > > > > familiar with implementation of blazesym, but both implementations are > > > > doing exactly the same sequence of steps). You can do it even less > > > > efficiently by parsing the whole file, building an in-memory lookup > > > > table, then looking up addresses one by one. But that's even slower > > > > and more memory-hungry. So I didn't even bother implementing that, it > > > > would put /proc/<pid>/maps at even more disadvantage. > > > > > > > > Other applications that deal with stack traces (including perf) would > > > > be doing one of those two approaches, depending on circumstances and > > > > level of sophistication of code (and sensitivity to performance). > > > > > > The code in perf doing this is here: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440 > > > The code is using the api/io.h code: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h > > > Using perf to profile perf it was observed time was spent allocating > > > buffers and locale related activities when using stdio, so io is a > > > lighter weight alternative, albeit with more verbose code than fscanf. > > > You could add this as an alternate /proc/<pid>/maps reader, we have a > > > similar benchmark in `perf bench internals synthesize`. > > > > > > > If I add a new implementation using this ioctl() into > > perf_event__synthesize_mmap_events(), will it be tested from this > > `perf bench internals synthesize`? I'm not too familiar with perf code > > organization, sorry if it's a stupid question. If not, where exactly > > is the code that would be triggered from benchmark? > > Yes it would be triggered :-) Ok, I don't exactly know how to interpret the results (and what the benchmark is doing), but numbers don't seem to be worse. They actually seem to be a bit better. I pushed my code that adds perf integration to [0]. That commit has results, but I'll post them here (with invocation parameters). perf-ioctl is the version with ioctl()-based implementation, perf-parse is, logically, text-parsing version. Here are the results (and see my notes below the results as well): TEXT-BASED ========== # ./perf-parse bench internals synthesize # Running 'internals/synthesize' benchmark: Computing performance of single threaded perf event synthesis by synthesizing events on the perf process itself: Average synthesis took: 80.311 usec (+- 0.077 usec) Average num. events: 32.000 (+- 0.000) Average time per event 2.510 usec Average data synthesis took: 84.429 usec (+- 0.066 usec) Average num. events: 179.000 (+- 0.000) Average time per event 0.472 usec # ./perf-parse bench internals synthesize # Running 'internals/synthesize' benchmark: Computing performance of single threaded perf event synthesis by synthesizing events on the perf process itself: Average synthesis took: 79.900 usec (+- 0.077 usec) Average num. events: 32.000 (+- 0.000) Average time per event 2.497 usec Average data synthesis took: 84.832 usec (+- 0.074 usec) Average num. events: 180.000 (+- 0.000) Average time per event 0.471 usec # ./perf-parse bench internals synthesize --mt -M 8 # Running 'internals/synthesize' benchmark: Computing performance of multi threaded perf event synthesis by synthesizing events on CPU 0: Number of synthesis threads: 1 Average synthesis took: 36338.100 usec (+- 406.091 usec) Average num. events: 14091.300 (+- 7.433) Average time per event 2.579 usec Number of synthesis threads: 2 Average synthesis took: 37071.200 usec (+- 746.498 usec) Average num. events: 14085.900 (+- 1.900) Average time per event 2.632 usec Number of synthesis threads: 3 Average synthesis took: 33932.300 usec (+- 626.861 usec) Average num. events: 14085.900 (+- 1.900) Average time per event 2.409 usec Number of synthesis threads: 4 Average synthesis took: 33822.700 usec (+- 506.290 usec) Average num. events: 14099.200 (+- 8.761) Average time per event 2.399 usec Number of synthesis threads: 5 Average synthesis took: 33348.200 usec (+- 389.771 usec) Average num. events: 14085.900 (+- 1.900) Average time per event 2.367 usec Number of synthesis threads: 6 Average synthesis took: 33269.600 usec (+- 350.341 usec) Average num. events: 14084.000 (+- 0.000) Average time per event 2.362 usec Number of synthesis threads: 7 Average synthesis took: 32663.900 usec (+- 338.870 usec) Average num. events: 14085.900 (+- 1.900) Average time per event 2.319 usec Number of synthesis threads: 8 Average synthesis took: 32748.400 usec (+- 285.450 usec) Average num. events: 14085.900 (+- 1.900) Average time per event 2.325 usec IOCTL-BASED =========== # ./perf-ioctl bench internals synthesize # Running 'internals/synthesize' benchmark: Computing performance of single threaded perf event synthesis by synthesizing events on the perf process itself: Average synthesis took: 72.996 usec (+- 0.076 usec) Average num. events: 31.000 (+- 0.000) Average time per event 2.355 usec Average data synthesis took: 79.067 usec (+- 0.074 usec) Average num. events: 178.000 (+- 0.000) Average time per event 0.444 usec # ./perf-ioctl bench internals synthesize # Running 'internals/synthesize' benchmark: Computing performance of single threaded perf event synthesis by synthesizing events on the perf process itself: Average synthesis took: 73.921 usec (+- 0.073 usec) Average num. events: 31.000 (+- 0.000) Average time per event 2.385 usec Average data synthesis took: 80.545 usec (+- 0.070 usec) Average num. events: 178.000 (+- 0.000) Average time per event 0.453 usec # ./perf-ioctl bench internals synthesize --mt -M 8 # Running 'internals/synthesize' benchmark: Computing performance of multi threaded perf event synthesis by synthesizing events on CPU 0: Number of synthesis threads: 1 Average synthesis took: 35609.500 usec (+- 428.576 usec) Average num. events: 14040.700 (+- 1.700) Average time per event 2.536 usec Number of synthesis threads: 2 Average synthesis took: 34293.800 usec (+- 453.811 usec) Average num. events: 14040.700 (+- 1.700) Average time per event 2.442 usec Number of synthesis threads: 3 Average synthesis took: 32385.200 usec (+- 363.106 usec) Average num. events: 14040.700 (+- 1.700) Average time per event 2.307 usec Number of synthesis threads: 4 Average synthesis took: 33113.100 usec (+- 553.931 usec) Average num. events: 14054.500 (+- 11.469) Average time per event 2.356 usec Number of synthesis threads: 5 Average synthesis took: 31600.600 usec (+- 297.349 usec) Average num. events: 14012.500 (+- 4.590) Average time per event 2.255 usec Number of synthesis threads: 6 Average synthesis took: 32309.900 usec (+- 472.225 usec) Average num. events: 14004.000 (+- 0.000) Average time per event 2.307 usec Number of synthesis threads: 7 Average synthesis took: 31400.100 usec (+- 206.261 usec) Average num. events: 14004.800 (+- 0.800) Average time per event 2.242 usec Number of synthesis threads: 8 Average synthesis took: 31601.400 usec (+- 303.350 usec) Average num. events: 14005.700 (+- 1.700) Average time per event 2.256 usec I also double-checked (using strace) that it does what it is supposed to do, and it seems like everything checks out. Here's text-based strace log: openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3 read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997 read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025 read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048 read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017 read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744 read(3, "", 8192) = 0 close(3) = 0 BTW, note how the kernel doesn't serve more than 4KB of data, even though perf provides 8KB buffer (that's to Greg's question about optimizing using bigger buffers, I suspect without seq_file changes, it won't work). And here's an abbreviated log for ioctl version, it has lots more (but much faster) ioctl() syscalls, given it dumps everything: openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 ... 195 ioctl() calls in total ... ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = -1 ENOENT (No such file or directory) close(3) = 0 So, it's not the optimal usage of this API, and yet it's still better (or at least not worse) than text-based API. [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35 > > Thanks, > Ian > > > > Thanks, > > > Ian > > > > > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193 > > > > > > > > > thanks, > > > > > > > > > > greg k-h > > > >
.. Adding Suren & Willy to the Cc * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]: > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > I also did an strace run of both cases. In text-based one the tool did > > > 68 read() syscalls, fetching up to 4KB of data in one go. > > > > Why not fetch more at once? > > > > I didn't expect to be interrogated so much on the performance of the > text parsing front, sorry. :) You can probably tune this, but where is > the reasonable limit? 64KB? 256KB? 1MB? See below for some more > production numbers. The reason the file reads are limited to 4KB is because this file is used for monitoring processes. We have a significant number of organisations polling this file so frequently that the mmap lock contention becomes an issue. (reading a file is free, right?) People also tend to try to figure out why a process is slow by reading this file - which amplifies the lock contention. What happens today is that the lock is yielded after 4KB to allow time for mmap writes to happen. This also means your data may be inconsistent from one 4KB block to the next (the write may be around this boundary). This new interface also takes the lock in do_procmap_query() and does the 4kb blocks as well. Extending this size means more time spent blocking mmap writes, but a more consistent view of the world (less "tearing" of the addresses). We are working to reduce these issues by switching the /proc/<pid>/maps file to use rcu lookup. I would recommend we do not proceed with this interface using the old method and instead, implement it using rcu from the start - if it fits your use case (or we can make it fit your use case). At least, for most page faults, we can work around the lock contention (since v6.6), but not all and not on all archs. ... > > > > In comparison, > > > ioctl-based implementation had to do only 6 ioctl() calls to fetch all > > > relevant VMAs. > > > > > > It is projected that savings from processing big production applications > > > would only widen the gap in favor of binary-based querying ioctl API, as > > > bigger applications will tend to have even more non-executable VMA > > > mappings relative to executable ones. > > > > Define "bigger applications" please. Is this some "large database > > company workload" type of thing, or something else? > > I don't have a definition. But I had in mind, as one example, an > ads-serving service we use internally (it's a pretty large application > by pretty much any metric you can come up with). I just randomly > picked one of the production hosts, found one instance of that > service, and looked at its /proc/<pid>/maps file. Hopefully it will > satisfy your need for specifics. > > # cat /proc/1126243/maps | wc -c > 1570178 > # cat /proc/1126243/maps | wc -l > 28875 > # cat /proc/1126243/maps | grep ' ..x. ' | wc -l > 7347 We have distributions increasing the map_count to an insane number to allow games to work [1]. It is, unfortunately, only a matter of time until this is regularly an issue as it is being normalised and allowed by an increased number of distributions (fedora, arch, ubuntu). So, despite my email address, I am not talking about large database companies here. Also, note that applications that use guard VMAs double the number for the guards. Fun stuff. We are really doing a lot in the VMA area to reduce the mmap locking contention and it seems you have a use case for a new interface that can leverage these changes. We have at least two talks around this area at LSF if you are attending. Thanks, Liam [1] https://lore.kernel.org/linux-mm/8f6e2d69-b4df-45f3-aed4-5190966e2dea@valvesoftware.com/
On Tue, May 07, 2024 at 11:48:44AM -0400, Liam R. Howlett wrote:
> .. Adding Suren & Willy to the Cc
I've been staying out of this disaster. i thought steven rostedt was
going to do all of this in the kernel anyway. wasn't thre a session on
that at lsfmm in vancouver last year?
* Matthew Wilcox <willy@infradead.org> [240507 12:10]: > On Tue, May 07, 2024 at 11:48:44AM -0400, Liam R. Howlett wrote: > > .. Adding Suren & Willy to the Cc > > I've been staying out of this disaster. i thought steven rostedt was > going to do all of this in the kernel anyway. wasn't thre a session on > that at lsfmm in vancouver last year? sframes? The only other one that comes to mind is the one where he and kent were yelling at each other.
On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > .. Adding Suren & Willy to the Cc > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]: > > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > > I also did an strace run of both cases. In text-based one the tool did > > > > 68 read() syscalls, fetching up to 4KB of data in one go. > > > > > > Why not fetch more at once? > > > > > > > I didn't expect to be interrogated so much on the performance of the > > text parsing front, sorry. :) You can probably tune this, but where is > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more > > production numbers. > > The reason the file reads are limited to 4KB is because this file is > used for monitoring processes. We have a significant number of > organisations polling this file so frequently that the mmap lock > contention becomes an issue. (reading a file is free, right?) People > also tend to try to figure out why a process is slow by reading this > file - which amplifies the lock contention. > > What happens today is that the lock is yielded after 4KB to allow time > for mmap writes to happen. This also means your data may be > inconsistent from one 4KB block to the next (the write may be around > this boundary). > > This new interface also takes the lock in do_procmap_query() and does > the 4kb blocks as well. Extending this size means more time spent > blocking mmap writes, but a more consistent view of the world (less > "tearing" of the addresses). Hold on. There is no 4KB in the new ioctl-based API I'm adding. It does a single VMA look up (presumably O(logN) operation) using a single vma_iter_init(addr) + vma_next() call on vma_iterator. As for the mmap_read_lock_killable() (is that what we are talking about?), I'm happy to use anything else available, please give me a pointer. But I suspect given how fast and small this new API is, mmap_read_lock_killable() in it is not comparable to holding it for producing /proc/<pid>/maps contents. > > We are working to reduce these issues by switching the /proc/<pid>/maps > file to use rcu lookup. I would recommend we do not proceed with this > interface using the old method and instead, implement it using rcu from > the start - if it fits your use case (or we can make it fit your use > case). > > At least, for most page faults, we can work around the lock contention > (since v6.6), but not all and not on all archs. > > ... > > > > > > > In comparison, > > > > ioctl-based implementation had to do only 6 ioctl() calls to fetch all > > > > relevant VMAs. > > > > > > > > It is projected that savings from processing big production applications > > > > would only widen the gap in favor of binary-based querying ioctl API, as > > > > bigger applications will tend to have even more non-executable VMA > > > > mappings relative to executable ones. > > > > > > Define "bigger applications" please. Is this some "large database > > > company workload" type of thing, or something else? > > > > I don't have a definition. But I had in mind, as one example, an > > ads-serving service we use internally (it's a pretty large application > > by pretty much any metric you can come up with). I just randomly > > picked one of the production hosts, found one instance of that > > service, and looked at its /proc/<pid>/maps file. Hopefully it will > > satisfy your need for specifics. > > > > # cat /proc/1126243/maps | wc -c > > 1570178 > > # cat /proc/1126243/maps | wc -l > > 28875 > > # cat /proc/1126243/maps | grep ' ..x. ' | wc -l > > 7347 > > We have distributions increasing the map_count to an insane number to > allow games to work [1]. It is, unfortunately, only a matter of time until > this is regularly an issue as it is being normalised and allowed by an > increased number of distributions (fedora, arch, ubuntu). So, despite > my email address, I am not talking about large database companies here. > > Also, note that applications that use guard VMAs double the number for > the guards. Fun stuff. > > We are really doing a lot in the VMA area to reduce the mmap locking > contention and it seems you have a use case for a new interface that can > leverage these changes. > > We have at least two talks around this area at LSF if you are attending. I am attending LSFMM, yes, I'll try to not miss them. > > Thanks, > Liam > > [1] https://lore.kernel.org/linux-mm/8f6e2d69-b4df-45f3-aed4-5190966e2dea@valvesoftware.com/ >
On Mon, May 6, 2024 at 10:06 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Mon, May 6, 2024 at 11:43 AM Ian Rogers <irogers@google.com> wrote: > > > > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote: > > > > > > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko > > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > > > > > Implement a simple tool/benchmark for comparing address "resolution" > > > > > > > logic based on textual /proc/<pid>/maps interface and new binary > > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command. > > > > > > > > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny > > > > > > ioctl" is going to be different, but step back and show how this is > > > > > > going to be used in the real world overall. Pounding on this file is > > > > > > not a normal operation, right? > > > > > > > > > > > > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library > > > > > is doing (see [0], it's Rust and part of the overall library API, I > > > > > think C code in this patch is way easier to follow for someone not > > > > > familiar with implementation of blazesym, but both implementations are > > > > > doing exactly the same sequence of steps). You can do it even less > > > > > efficiently by parsing the whole file, building an in-memory lookup > > > > > table, then looking up addresses one by one. But that's even slower > > > > > and more memory-hungry. So I didn't even bother implementing that, it > > > > > would put /proc/<pid>/maps at even more disadvantage. > > > > > > > > > > Other applications that deal with stack traces (including perf) would > > > > > be doing one of those two approaches, depending on circumstances and > > > > > level of sophistication of code (and sensitivity to performance). > > > > > > > > The code in perf doing this is here: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440 > > > > The code is using the api/io.h code: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h > > > > Using perf to profile perf it was observed time was spent allocating > > > > buffers and locale related activities when using stdio, so io is a > > > > lighter weight alternative, albeit with more verbose code than fscanf. > > > > You could add this as an alternate /proc/<pid>/maps reader, we have a > > > > similar benchmark in `perf bench internals synthesize`. > > > > > > > > > > If I add a new implementation using this ioctl() into > > > perf_event__synthesize_mmap_events(), will it be tested from this > > > `perf bench internals synthesize`? I'm not too familiar with perf code > > > organization, sorry if it's a stupid question. If not, where exactly > > > is the code that would be triggered from benchmark? > > > > Yes it would be triggered :-) > > Ok, I don't exactly know how to interpret the results (and what the > benchmark is doing), but numbers don't seem to be worse. They actually > seem to be a bit better. > > I pushed my code that adds perf integration to [0]. That commit has > results, but I'll post them here (with invocation parameters). > perf-ioctl is the version with ioctl()-based implementation, > perf-parse is, logically, text-parsing version. Here are the results > (and see my notes below the results as well): > > TEXT-BASED > ========== > > # ./perf-parse bench internals synthesize > # Running 'internals/synthesize' benchmark: > Computing performance of single threaded perf event synthesis by > synthesizing events on the perf process itself: > Average synthesis took: 80.311 usec (+- 0.077 usec) > Average num. events: 32.000 (+- 0.000) > Average time per event 2.510 usec > Average data synthesis took: 84.429 usec (+- 0.066 usec) > Average num. events: 179.000 (+- 0.000) > Average time per event 0.472 usec > > # ./perf-parse bench internals synthesize > # Running 'internals/synthesize' benchmark: > Computing performance of single threaded perf event synthesis by > synthesizing events on the perf process itself: > Average synthesis took: 79.900 usec (+- 0.077 usec) > Average num. events: 32.000 (+- 0.000) > Average time per event 2.497 usec > Average data synthesis took: 84.832 usec (+- 0.074 usec) > Average num. events: 180.000 (+- 0.000) > Average time per event 0.471 usec > > # ./perf-parse bench internals synthesize --mt -M 8 > # Running 'internals/synthesize' benchmark: > Computing performance of multi threaded perf event synthesis by > synthesizing events on CPU 0: > Number of synthesis threads: 1 > Average synthesis took: 36338.100 usec (+- 406.091 usec) > Average num. events: 14091.300 (+- 7.433) > Average time per event 2.579 usec > Number of synthesis threads: 2 > Average synthesis took: 37071.200 usec (+- 746.498 usec) > Average num. events: 14085.900 (+- 1.900) > Average time per event 2.632 usec > Number of synthesis threads: 3 > Average synthesis took: 33932.300 usec (+- 626.861 usec) > Average num. events: 14085.900 (+- 1.900) > Average time per event 2.409 usec > Number of synthesis threads: 4 > Average synthesis took: 33822.700 usec (+- 506.290 usec) > Average num. events: 14099.200 (+- 8.761) > Average time per event 2.399 usec > Number of synthesis threads: 5 > Average synthesis took: 33348.200 usec (+- 389.771 usec) > Average num. events: 14085.900 (+- 1.900) > Average time per event 2.367 usec > Number of synthesis threads: 6 > Average synthesis took: 33269.600 usec (+- 350.341 usec) > Average num. events: 14084.000 (+- 0.000) > Average time per event 2.362 usec > Number of synthesis threads: 7 > Average synthesis took: 32663.900 usec (+- 338.870 usec) > Average num. events: 14085.900 (+- 1.900) > Average time per event 2.319 usec > Number of synthesis threads: 8 > Average synthesis took: 32748.400 usec (+- 285.450 usec) > Average num. events: 14085.900 (+- 1.900) > Average time per event 2.325 usec > > IOCTL-BASED > =========== > # ./perf-ioctl bench internals synthesize > # Running 'internals/synthesize' benchmark: > Computing performance of single threaded perf event synthesis by > synthesizing events on the perf process itself: > Average synthesis took: 72.996 usec (+- 0.076 usec) > Average num. events: 31.000 (+- 0.000) > Average time per event 2.355 usec > Average data synthesis took: 79.067 usec (+- 0.074 usec) > Average num. events: 178.000 (+- 0.000) > Average time per event 0.444 usec > > # ./perf-ioctl bench internals synthesize > # Running 'internals/synthesize' benchmark: > Computing performance of single threaded perf event synthesis by > synthesizing events on the perf process itself: > Average synthesis took: 73.921 usec (+- 0.073 usec) > Average num. events: 31.000 (+- 0.000) > Average time per event 2.385 usec > Average data synthesis took: 80.545 usec (+- 0.070 usec) > Average num. events: 178.000 (+- 0.000) > Average time per event 0.453 usec > > # ./perf-ioctl bench internals synthesize --mt -M 8 > # Running 'internals/synthesize' benchmark: > Computing performance of multi threaded perf event synthesis by > synthesizing events on CPU 0: > Number of synthesis threads: 1 > Average synthesis took: 35609.500 usec (+- 428.576 usec) > Average num. events: 14040.700 (+- 1.700) > Average time per event 2.536 usec > Number of synthesis threads: 2 > Average synthesis took: 34293.800 usec (+- 453.811 usec) > Average num. events: 14040.700 (+- 1.700) > Average time per event 2.442 usec > Number of synthesis threads: 3 > Average synthesis took: 32385.200 usec (+- 363.106 usec) > Average num. events: 14040.700 (+- 1.700) > Average time per event 2.307 usec > Number of synthesis threads: 4 > Average synthesis took: 33113.100 usec (+- 553.931 usec) > Average num. events: 14054.500 (+- 11.469) > Average time per event 2.356 usec > Number of synthesis threads: 5 > Average synthesis took: 31600.600 usec (+- 297.349 usec) > Average num. events: 14012.500 (+- 4.590) > Average time per event 2.255 usec > Number of synthesis threads: 6 > Average synthesis took: 32309.900 usec (+- 472.225 usec) > Average num. events: 14004.000 (+- 0.000) > Average time per event 2.307 usec > Number of synthesis threads: 7 > Average synthesis took: 31400.100 usec (+- 206.261 usec) > Average num. events: 14004.800 (+- 0.800) > Average time per event 2.242 usec > Number of synthesis threads: 8 > Average synthesis took: 31601.400 usec (+- 303.350 usec) > Average num. events: 14005.700 (+- 1.700) > Average time per event 2.256 usec > > I also double-checked (using strace) that it does what it is supposed > to do, and it seems like everything checks out. Here's text-based > strace log: > > openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3 > read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997 > read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025 > read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048 > read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017 > read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744 > read(3, "", 8192) = 0 > close(3) = 0 > > > BTW, note how the kernel doesn't serve more than 4KB of data, even > though perf provides 8KB buffer (that's to Greg's question about > optimizing using bigger buffers, I suspect without seq_file changes, > it won't work). > > And here's an abbreviated log for ioctl version, it has lots more (but > much faster) ioctl() syscalls, given it dumps everything: > > openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3 > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > ... 195 ioctl() calls in total ... > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) > = -1 ENOENT (No such file or directory) > close(3) = 0 > > > So, it's not the optimal usage of this API, and yet it's still better > (or at least not worse) than text-based API. > In another reply to Arnaldo on patch #2 I mentioned the idea of allowing to iterate only file-backed VMAs (as it seems like what symbolizers would only care about, but I might be wrong here). So I tried that quickly, given it's a trivial addition to my code. See results below (it is slightly faster, but not much, because most of VMAs in that benchmark seem to be indeed file-backed anyways), just for completeness. I'm not sure if that would be useful/used by perf, so please let me know. As I mentioned above, it's not radically faster in this perf benchmark, because we still request about 170 VMAs (vs ~195 if we iterate *all* of them), so not a big change. The ratio will vary depending on what the process is doing, of course. Anyways, just for completeness, I'm not sure if I have to add this "filter" to the actual implementation. # ./perf-filebacked bench internals synthesize # Running 'internals/synthesize' benchmark: Computing performance of single threaded perf event synthesis by synthesizing events on the perf process itself: Average synthesis took: 65.759 usec (+- 0.063 usec) Average num. events: 30.000 (+- 0.000) Average time per event 2.192 usec Average data synthesis took: 73.840 usec (+- 0.080 usec) Average num. events: 153.000 (+- 0.000) Average time per event 0.483 usec # ./perf-filebacked bench internals synthesize # Running 'internals/synthesize' benchmark: Computing performance of single threaded perf event synthesis by synthesizing events on the perf process itself: Average synthesis took: 66.245 usec (+- 0.059 usec) Average num. events: 30.000 (+- 0.000) Average time per event 2.208 usec Average data synthesis took: 70.627 usec (+- 0.074 usec) Average num. events: 153.000 (+- 0.000) Average time per event 0.462 usec # ./perf-filebacked bench internals synthesize --mt -M 8 # Running 'internals/synthesize' benchmark: Computing performance of multi threaded perf event synthesis by synthesizing events on CPU 0: Number of synthesis threads: 1 Average synthesis took: 33477.500 usec (+- 556.102 usec) Average num. events: 10125.700 (+- 1.620) Average time per event 3.306 usec Number of synthesis threads: 2 Average synthesis took: 30473.700 usec (+- 221.933 usec) Average num. events: 10127.000 (+- 0.000) Average time per event 3.009 usec Number of synthesis threads: 3 Average synthesis took: 29775.200 usec (+- 315.212 usec) Average num. events: 10128.700 (+- 0.667) Average time per event 2.940 usec Number of synthesis threads: 4 Average synthesis took: 29477.100 usec (+- 621.258 usec) Average num. events: 10129.000 (+- 0.000) Average time per event 2.910 usec Number of synthesis threads: 5 Average synthesis took: 29777.900 usec (+- 294.710 usec) Average num. events: 10144.700 (+- 11.597) Average time per event 2.935 usec Number of synthesis threads: 6 Average synthesis took: 27774.700 usec (+- 357.569 usec) Average num. events: 10158.500 (+- 14.710) Average time per event 2.734 usec Number of synthesis threads: 7 Average synthesis took: 27437.200 usec (+- 233.626 usec) Average num. events: 10135.700 (+- 2.700) Average time per event 2.707 usec Number of synthesis threads: 8 Average synthesis took: 28784.600 usec (+- 477.630 usec) Average num. events: 10133.000 (+- 0.000) Average time per event 2.841 usec > [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35 > > > > > Thanks, > > Ian > > > > > > Thanks, > > > > Ian > > > > > > > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193 > > > > > > > > > > > thanks, > > > > > > > > > > > > greg k-h > > > > >
* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240507 12:28]: > On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > .. Adding Suren & Willy to the Cc > > > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]: > > > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > > > I also did an strace run of both cases. In text-based one the tool did > > > > > 68 read() syscalls, fetching up to 4KB of data in one go. > > > > > > > > Why not fetch more at once? > > > > > > > > > > I didn't expect to be interrogated so much on the performance of the > > > text parsing front, sorry. :) You can probably tune this, but where is > > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more > > > production numbers. > > > > The reason the file reads are limited to 4KB is because this file is > > used for monitoring processes. We have a significant number of > > organisations polling this file so frequently that the mmap lock > > contention becomes an issue. (reading a file is free, right?) People > > also tend to try to figure out why a process is slow by reading this > > file - which amplifies the lock contention. > > > > What happens today is that the lock is yielded after 4KB to allow time > > for mmap writes to happen. This also means your data may be > > inconsistent from one 4KB block to the next (the write may be around > > this boundary). > > > > This new interface also takes the lock in do_procmap_query() and does > > the 4kb blocks as well. Extending this size means more time spent > > blocking mmap writes, but a more consistent view of the world (less > > "tearing" of the addresses). > > Hold on. There is no 4KB in the new ioctl-based API I'm adding. It > does a single VMA look up (presumably O(logN) operation) using a > single vma_iter_init(addr) + vma_next() call on vma_iterator. Sorry, I read this: + if (usize > PAGE_SIZE) + return -E2BIG; And thought you were going to return many vmas in that buffer. I see now that you are doing one copy at a time. > > As for the mmap_read_lock_killable() (is that what we are talking > about?), I'm happy to use anything else available, please give me a > pointer. But I suspect given how fast and small this new API is, > mmap_read_lock_killable() in it is not comparable to holding it for > producing /proc/<pid>/maps contents. Yes, mmap_read_lock_killable() is the mmap lock (formally known as the mmap sem). You can see examples of avoiding the mmap lock by use of rcu in mm/memory.c lock_vma_under_rcu() which is used in the fault path. userfaultfd has an example as well. But again, remember that not all archs have this functionality, so you'd need to fall back to full mmap locking. Certainly a single lookup and copy will be faster than a 4k buffer filling copy, but you will be walking the tree O(n) times, where n is the vma count. This isn't as efficient as multiple lookups in a row as we will re-walk from the top of the tree. You will also need to contend with the fact that the chance of the vmas changing between calls is much higher here too - if that's an issue. Neither of these issues go away with use of the rcu locking instead of the mmap lock, but we can be quite certain that we won't cause locking contention. Thanks, Liam
On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240507 12:28]: > > On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > .. Adding Suren & Willy to the Cc > > > > > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]: > > > > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > > > > I also did an strace run of both cases. In text-based one the tool did > > > > > > 68 read() syscalls, fetching up to 4KB of data in one go. > > > > > > > > > > Why not fetch more at once? > > > > > > > > > > > > > I didn't expect to be interrogated so much on the performance of the > > > > text parsing front, sorry. :) You can probably tune this, but where is > > > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more > > > > production numbers. > > > > > > The reason the file reads are limited to 4KB is because this file is > > > used for monitoring processes. We have a significant number of > > > organisations polling this file so frequently that the mmap lock > > > contention becomes an issue. (reading a file is free, right?) People > > > also tend to try to figure out why a process is slow by reading this > > > file - which amplifies the lock contention. > > > > > > What happens today is that the lock is yielded after 4KB to allow time > > > for mmap writes to happen. This also means your data may be > > > inconsistent from one 4KB block to the next (the write may be around > > > this boundary). > > > > > > This new interface also takes the lock in do_procmap_query() and does > > > the 4kb blocks as well. Extending this size means more time spent > > > blocking mmap writes, but a more consistent view of the world (less > > > "tearing" of the addresses). > > > > Hold on. There is no 4KB in the new ioctl-based API I'm adding. It > > does a single VMA look up (presumably O(logN) operation) using a > > single vma_iter_init(addr) + vma_next() call on vma_iterator. > > Sorry, I read this: > > + if (usize > PAGE_SIZE) > + return -E2BIG; > > And thought you were going to return many vmas in that buffer. I see > now that you are doing one copy at a time. > > > > > As for the mmap_read_lock_killable() (is that what we are talking > > about?), I'm happy to use anything else available, please give me a > > pointer. But I suspect given how fast and small this new API is, > > mmap_read_lock_killable() in it is not comparable to holding it for > > producing /proc/<pid>/maps contents. > > Yes, mmap_read_lock_killable() is the mmap lock (formally known as the > mmap sem). > > You can see examples of avoiding the mmap lock by use of rcu in > mm/memory.c lock_vma_under_rcu() which is used in the fault path. > userfaultfd has an example as well. But again, remember that not all > archs have this functionality, so you'd need to fall back to full mmap > locking. Thanks for the pointer (didn't see email when replying on the other thread). I looked at lock_vma_under_rcu() quickly, and seems like it's designed to find VMA that covers given address, but not the next closest one. So it's a bit problematic for the API I'm adding, as PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to COVERING_OR_NEXT_VMA, if necessary), is quite important for the use cases we have. But maybe some variation of lock_vma_under_rcu() can be added that would fit this case? > > Certainly a single lookup and copy will be faster than a 4k buffer > filling copy, but you will be walking the tree O(n) times, where n is > the vma count. This isn't as efficient as multiple lookups in a row as > we will re-walk from the top of the tree. You will also need to contend > with the fact that the chance of the vmas changing between calls is much > higher here too - if that's an issue. Neither of these issues go away > with use of the rcu locking instead of the mmap lock, but we can be > quite certain that we won't cause locking contention. You are right about O(n) times, but note that for symbolization cases I'm describing, this n will be, generally, *much* smaller than a total number of VMAs within the process. It's a huge speed up in practice. This is because we pre-sort addresses in user-space, and then we query VMA for the first address, but then we quickly skip all the other addresses that are already covered by this VMA, and so the next request will query a new VMA that covers another subset of addresses. This way we'll get the minimal number of VMAs that cover captured addresses (which in the case of stack traces would be a few VMAs belonging to executable sections of process' binary plus a bunch of shared libraries). > > Thanks, > Liam >
On Tue, May 7, 2024 at 10:29 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Mon, May 6, 2024 at 10:06 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Mon, May 6, 2024 at 11:43 AM Ian Rogers <irogers@google.com> wrote: > > > > > > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote: > > > > > > > > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko > > > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > > > > > > Implement a simple tool/benchmark for comparing address "resolution" > > > > > > > > logic based on textual /proc/<pid>/maps interface and new binary > > > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command. > > > > > > > > > > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny > > > > > > > ioctl" is going to be different, but step back and show how this is > > > > > > > going to be used in the real world overall. Pounding on this file is > > > > > > > not a normal operation, right? > > > > > > > > > > > > > > > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library > > > > > > is doing (see [0], it's Rust and part of the overall library API, I > > > > > > think C code in this patch is way easier to follow for someone not > > > > > > familiar with implementation of blazesym, but both implementations are > > > > > > doing exactly the same sequence of steps). You can do it even less > > > > > > efficiently by parsing the whole file, building an in-memory lookup > > > > > > table, then looking up addresses one by one. But that's even slower > > > > > > and more memory-hungry. So I didn't even bother implementing that, it > > > > > > would put /proc/<pid>/maps at even more disadvantage. > > > > > > > > > > > > Other applications that deal with stack traces (including perf) would > > > > > > be doing one of those two approaches, depending on circumstances and > > > > > > level of sophistication of code (and sensitivity to performance). > > > > > > > > > > The code in perf doing this is here: > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440 > > > > > The code is using the api/io.h code: > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h > > > > > Using perf to profile perf it was observed time was spent allocating > > > > > buffers and locale related activities when using stdio, so io is a > > > > > lighter weight alternative, albeit with more verbose code than fscanf. > > > > > You could add this as an alternate /proc/<pid>/maps reader, we have a > > > > > similar benchmark in `perf bench internals synthesize`. > > > > > > > > > > > > > If I add a new implementation using this ioctl() into > > > > perf_event__synthesize_mmap_events(), will it be tested from this > > > > `perf bench internals synthesize`? I'm not too familiar with perf code > > > > organization, sorry if it's a stupid question. If not, where exactly > > > > is the code that would be triggered from benchmark? > > > > > > Yes it would be triggered :-) > > > > Ok, I don't exactly know how to interpret the results (and what the > > benchmark is doing), but numbers don't seem to be worse. They actually > > seem to be a bit better. > > > > I pushed my code that adds perf integration to [0]. That commit has > > results, but I'll post them here (with invocation parameters). > > perf-ioctl is the version with ioctl()-based implementation, > > perf-parse is, logically, text-parsing version. Here are the results > > (and see my notes below the results as well): > > > > TEXT-BASED > > ========== > > > > # ./perf-parse bench internals synthesize > > # Running 'internals/synthesize' benchmark: > > Computing performance of single threaded perf event synthesis by > > synthesizing events on the perf process itself: > > Average synthesis took: 80.311 usec (+- 0.077 usec) > > Average num. events: 32.000 (+- 0.000) > > Average time per event 2.510 usec > > Average data synthesis took: 84.429 usec (+- 0.066 usec) > > Average num. events: 179.000 (+- 0.000) > > Average time per event 0.472 usec > > > > # ./perf-parse bench internals synthesize > > # Running 'internals/synthesize' benchmark: > > Computing performance of single threaded perf event synthesis by > > synthesizing events on the perf process itself: > > Average synthesis took: 79.900 usec (+- 0.077 usec) > > Average num. events: 32.000 (+- 0.000) > > Average time per event 2.497 usec > > Average data synthesis took: 84.832 usec (+- 0.074 usec) > > Average num. events: 180.000 (+- 0.000) > > Average time per event 0.471 usec > > > > # ./perf-parse bench internals synthesize --mt -M 8 > > # Running 'internals/synthesize' benchmark: > > Computing performance of multi threaded perf event synthesis by > > synthesizing events on CPU 0: > > Number of synthesis threads: 1 > > Average synthesis took: 36338.100 usec (+- 406.091 usec) > > Average num. events: 14091.300 (+- 7.433) > > Average time per event 2.579 usec > > Number of synthesis threads: 2 > > Average synthesis took: 37071.200 usec (+- 746.498 usec) > > Average num. events: 14085.900 (+- 1.900) > > Average time per event 2.632 usec > > Number of synthesis threads: 3 > > Average synthesis took: 33932.300 usec (+- 626.861 usec) > > Average num. events: 14085.900 (+- 1.900) > > Average time per event 2.409 usec > > Number of synthesis threads: 4 > > Average synthesis took: 33822.700 usec (+- 506.290 usec) > > Average num. events: 14099.200 (+- 8.761) > > Average time per event 2.399 usec > > Number of synthesis threads: 5 > > Average synthesis took: 33348.200 usec (+- 389.771 usec) > > Average num. events: 14085.900 (+- 1.900) > > Average time per event 2.367 usec > > Number of synthesis threads: 6 > > Average synthesis took: 33269.600 usec (+- 350.341 usec) > > Average num. events: 14084.000 (+- 0.000) > > Average time per event 2.362 usec > > Number of synthesis threads: 7 > > Average synthesis took: 32663.900 usec (+- 338.870 usec) > > Average num. events: 14085.900 (+- 1.900) > > Average time per event 2.319 usec > > Number of synthesis threads: 8 > > Average synthesis took: 32748.400 usec (+- 285.450 usec) > > Average num. events: 14085.900 (+- 1.900) > > Average time per event 2.325 usec > > > > IOCTL-BASED > > =========== > > # ./perf-ioctl bench internals synthesize > > # Running 'internals/synthesize' benchmark: > > Computing performance of single threaded perf event synthesis by > > synthesizing events on the perf process itself: > > Average synthesis took: 72.996 usec (+- 0.076 usec) > > Average num. events: 31.000 (+- 0.000) > > Average time per event 2.355 usec > > Average data synthesis took: 79.067 usec (+- 0.074 usec) > > Average num. events: 178.000 (+- 0.000) > > Average time per event 0.444 usec > > > > # ./perf-ioctl bench internals synthesize > > # Running 'internals/synthesize' benchmark: > > Computing performance of single threaded perf event synthesis by > > synthesizing events on the perf process itself: > > Average synthesis took: 73.921 usec (+- 0.073 usec) > > Average num. events: 31.000 (+- 0.000) > > Average time per event 2.385 usec > > Average data synthesis took: 80.545 usec (+- 0.070 usec) > > Average num. events: 178.000 (+- 0.000) > > Average time per event 0.453 usec > > > > # ./perf-ioctl bench internals synthesize --mt -M 8 > > # Running 'internals/synthesize' benchmark: > > Computing performance of multi threaded perf event synthesis by > > synthesizing events on CPU 0: > > Number of synthesis threads: 1 > > Average synthesis took: 35609.500 usec (+- 428.576 usec) > > Average num. events: 14040.700 (+- 1.700) > > Average time per event 2.536 usec > > Number of synthesis threads: 2 > > Average synthesis took: 34293.800 usec (+- 453.811 usec) > > Average num. events: 14040.700 (+- 1.700) > > Average time per event 2.442 usec > > Number of synthesis threads: 3 > > Average synthesis took: 32385.200 usec (+- 363.106 usec) > > Average num. events: 14040.700 (+- 1.700) > > Average time per event 2.307 usec > > Number of synthesis threads: 4 > > Average synthesis took: 33113.100 usec (+- 553.931 usec) > > Average num. events: 14054.500 (+- 11.469) > > Average time per event 2.356 usec > > Number of synthesis threads: 5 > > Average synthesis took: 31600.600 usec (+- 297.349 usec) > > Average num. events: 14012.500 (+- 4.590) > > Average time per event 2.255 usec > > Number of synthesis threads: 6 > > Average synthesis took: 32309.900 usec (+- 472.225 usec) > > Average num. events: 14004.000 (+- 0.000) > > Average time per event 2.307 usec > > Number of synthesis threads: 7 > > Average synthesis took: 31400.100 usec (+- 206.261 usec) > > Average num. events: 14004.800 (+- 0.800) > > Average time per event 2.242 usec > > Number of synthesis threads: 8 > > Average synthesis took: 31601.400 usec (+- 303.350 usec) > > Average num. events: 14005.700 (+- 1.700) > > Average time per event 2.256 usec > > > > I also double-checked (using strace) that it does what it is supposed > > to do, and it seems like everything checks out. Here's text-based > > strace log: > > > > openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3 > > read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997 > > read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025 > > read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048 > > read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017 > > read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744 > > read(3, "", 8192) = 0 > > close(3) = 0 > > > > > > BTW, note how the kernel doesn't serve more than 4KB of data, even > > though perf provides 8KB buffer (that's to Greg's question about > > optimizing using bigger buffers, I suspect without seq_file changes, > > it won't work). > > > > And here's an abbreviated log for ioctl version, it has lots more (but > > much faster) ioctl() syscalls, given it dumps everything: > > > > openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3 > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > > > ... 195 ioctl() calls in total ... > > > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) > > = -1 ENOENT (No such file or directory) > > close(3) = 0 > > > > > > So, it's not the optimal usage of this API, and yet it's still better > > (or at least not worse) than text-based API. It's surprising that more ioctl is cheaper than less read and parse. > > > > In another reply to Arnaldo on patch #2 I mentioned the idea of > allowing to iterate only file-backed VMAs (as it seems like what > symbolizers would only care about, but I might be wrong here). So I Yep, I think it's enough to get file-backed VMAs only. > tried that quickly, given it's a trivial addition to my code. See > results below (it is slightly faster, but not much, because most of > VMAs in that benchmark seem to be indeed file-backed anyways), just > for completeness. I'm not sure if that would be useful/used by perf, > so please let me know. Thanks for doing this. It'd be useful as it provides better synthesizing performance. The startup latency of perf record is a problem, I need to take a look if it can be improved more. Thanks, Namhyung > > As I mentioned above, it's not radically faster in this perf > benchmark, because we still request about 170 VMAs (vs ~195 if we > iterate *all* of them), so not a big change. The ratio will vary > depending on what the process is doing, of course. Anyways, just for > completeness, I'm not sure if I have to add this "filter" to the > actual implementation. > > # ./perf-filebacked bench internals synthesize > # Running 'internals/synthesize' benchmark: > Computing performance of single threaded perf event synthesis by > synthesizing events on the perf process itself: > Average synthesis took: 65.759 usec (+- 0.063 usec) > Average num. events: 30.000 (+- 0.000) > Average time per event 2.192 usec > Average data synthesis took: 73.840 usec (+- 0.080 usec) > Average num. events: 153.000 (+- 0.000) > Average time per event 0.483 usec > > # ./perf-filebacked bench internals synthesize > # Running 'internals/synthesize' benchmark: > Computing performance of single threaded perf event synthesis by > synthesizing events on the perf process itself: > Average synthesis took: 66.245 usec (+- 0.059 usec) > Average num. events: 30.000 (+- 0.000) > Average time per event 2.208 usec > Average data synthesis took: 70.627 usec (+- 0.074 usec) > Average num. events: 153.000 (+- 0.000) > Average time per event 0.462 usec > > # ./perf-filebacked bench internals synthesize --mt -M 8 > # Running 'internals/synthesize' benchmark: > Computing performance of multi threaded perf event synthesis by > synthesizing events on CPU 0: > Number of synthesis threads: 1 > Average synthesis took: 33477.500 usec (+- 556.102 usec) > Average num. events: 10125.700 (+- 1.620) > Average time per event 3.306 usec > Number of synthesis threads: 2 > Average synthesis took: 30473.700 usec (+- 221.933 usec) > Average num. events: 10127.000 (+- 0.000) > Average time per event 3.009 usec > Number of synthesis threads: 3 > Average synthesis took: 29775.200 usec (+- 315.212 usec) > Average num. events: 10128.700 (+- 0.667) > Average time per event 2.940 usec > Number of synthesis threads: 4 > Average synthesis took: 29477.100 usec (+- 621.258 usec) > Average num. events: 10129.000 (+- 0.000) > Average time per event 2.910 usec > Number of synthesis threads: 5 > Average synthesis took: 29777.900 usec (+- 294.710 usec) > Average num. events: 10144.700 (+- 11.597) > Average time per event 2.935 usec > Number of synthesis threads: 6 > Average synthesis took: 27774.700 usec (+- 357.569 usec) > Average num. events: 10158.500 (+- 14.710) > Average time per event 2.734 usec > Number of synthesis threads: 7 > Average synthesis took: 27437.200 usec (+- 233.626 usec) > Average num. events: 10135.700 (+- 2.700) > Average time per event 2.707 usec > Number of synthesis threads: 8 > Average synthesis took: 28784.600 usec (+- 477.630 usec) > Average num. events: 10133.000 (+- 0.000) > Average time per event 2.841 usec > > > [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35 > > > > > > > > Thanks, > > > Ian > > > > > > > > Thanks, > > > > > Ian > > > > > > > > > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193 > > > > > > > > > > > > > thanks, > > > > > > > > > > > > > > greg k-h > > > > > > >
On Tue, May 7, 2024 at 3:27 PM Namhyung Kim <namhyung@kernel.org> wrote: > > On Tue, May 7, 2024 at 10:29 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Mon, May 6, 2024 at 10:06 PM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Mon, May 6, 2024 at 11:43 AM Ian Rogers <irogers@google.com> wrote: > > > > > > > > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko > > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <irogers@google.com> wrote: > > > > > > > > > > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko > > > > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > > > > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote: > > > > > > > > > Implement a simple tool/benchmark for comparing address "resolution" > > > > > > > > > logic based on textual /proc/<pid>/maps interface and new binary > > > > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command. > > > > > > > > > > > > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny > > > > > > > > ioctl" is going to be different, but step back and show how this is > > > > > > > > going to be used in the real world overall. Pounding on this file is > > > > > > > > not a normal operation, right? > > > > > > > > > > > > > > > > > > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library > > > > > > > is doing (see [0], it's Rust and part of the overall library API, I > > > > > > > think C code in this patch is way easier to follow for someone not > > > > > > > familiar with implementation of blazesym, but both implementations are > > > > > > > doing exactly the same sequence of steps). You can do it even less > > > > > > > efficiently by parsing the whole file, building an in-memory lookup > > > > > > > table, then looking up addresses one by one. But that's even slower > > > > > > > and more memory-hungry. So I didn't even bother implementing that, it > > > > > > > would put /proc/<pid>/maps at even more disadvantage. > > > > > > > > > > > > > > Other applications that deal with stack traces (including perf) would > > > > > > > be doing one of those two approaches, depending on circumstances and > > > > > > > level of sophistication of code (and sensitivity to performance). > > > > > > > > > > > > The code in perf doing this is here: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440 > > > > > > The code is using the api/io.h code: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h > > > > > > Using perf to profile perf it was observed time was spent allocating > > > > > > buffers and locale related activities when using stdio, so io is a > > > > > > lighter weight alternative, albeit with more verbose code than fscanf. > > > > > > You could add this as an alternate /proc/<pid>/maps reader, we have a > > > > > > similar benchmark in `perf bench internals synthesize`. > > > > > > > > > > > > > > > > If I add a new implementation using this ioctl() into > > > > > perf_event__synthesize_mmap_events(), will it be tested from this > > > > > `perf bench internals synthesize`? I'm not too familiar with perf code > > > > > organization, sorry if it's a stupid question. If not, where exactly > > > > > is the code that would be triggered from benchmark? > > > > > > > > Yes it would be triggered :-) > > > > > > Ok, I don't exactly know how to interpret the results (and what the > > > benchmark is doing), but numbers don't seem to be worse. They actually > > > seem to be a bit better. > > > > > > I pushed my code that adds perf integration to [0]. That commit has > > > results, but I'll post them here (with invocation parameters). > > > perf-ioctl is the version with ioctl()-based implementation, > > > perf-parse is, logically, text-parsing version. Here are the results > > > (and see my notes below the results as well): > > > > > > TEXT-BASED > > > ========== > > > > > > # ./perf-parse bench internals synthesize > > > # Running 'internals/synthesize' benchmark: > > > Computing performance of single threaded perf event synthesis by > > > synthesizing events on the perf process itself: > > > Average synthesis took: 80.311 usec (+- 0.077 usec) > > > Average num. events: 32.000 (+- 0.000) > > > Average time per event 2.510 usec > > > Average data synthesis took: 84.429 usec (+- 0.066 usec) > > > Average num. events: 179.000 (+- 0.000) > > > Average time per event 0.472 usec > > > > > > # ./perf-parse bench internals synthesize > > > # Running 'internals/synthesize' benchmark: > > > Computing performance of single threaded perf event synthesis by > > > synthesizing events on the perf process itself: > > > Average synthesis took: 79.900 usec (+- 0.077 usec) > > > Average num. events: 32.000 (+- 0.000) > > > Average time per event 2.497 usec > > > Average data synthesis took: 84.832 usec (+- 0.074 usec) > > > Average num. events: 180.000 (+- 0.000) > > > Average time per event 0.471 usec > > > > > > # ./perf-parse bench internals synthesize --mt -M 8 > > > # Running 'internals/synthesize' benchmark: > > > Computing performance of multi threaded perf event synthesis by > > > synthesizing events on CPU 0: > > > Number of synthesis threads: 1 > > > Average synthesis took: 36338.100 usec (+- 406.091 usec) > > > Average num. events: 14091.300 (+- 7.433) > > > Average time per event 2.579 usec > > > Number of synthesis threads: 2 > > > Average synthesis took: 37071.200 usec (+- 746.498 usec) > > > Average num. events: 14085.900 (+- 1.900) > > > Average time per event 2.632 usec > > > Number of synthesis threads: 3 > > > Average synthesis took: 33932.300 usec (+- 626.861 usec) > > > Average num. events: 14085.900 (+- 1.900) > > > Average time per event 2.409 usec > > > Number of synthesis threads: 4 > > > Average synthesis took: 33822.700 usec (+- 506.290 usec) > > > Average num. events: 14099.200 (+- 8.761) > > > Average time per event 2.399 usec > > > Number of synthesis threads: 5 > > > Average synthesis took: 33348.200 usec (+- 389.771 usec) > > > Average num. events: 14085.900 (+- 1.900) > > > Average time per event 2.367 usec > > > Number of synthesis threads: 6 > > > Average synthesis took: 33269.600 usec (+- 350.341 usec) > > > Average num. events: 14084.000 (+- 0.000) > > > Average time per event 2.362 usec > > > Number of synthesis threads: 7 > > > Average synthesis took: 32663.900 usec (+- 338.870 usec) > > > Average num. events: 14085.900 (+- 1.900) > > > Average time per event 2.319 usec > > > Number of synthesis threads: 8 > > > Average synthesis took: 32748.400 usec (+- 285.450 usec) > > > Average num. events: 14085.900 (+- 1.900) > > > Average time per event 2.325 usec > > > > > > IOCTL-BASED > > > =========== > > > # ./perf-ioctl bench internals synthesize > > > # Running 'internals/synthesize' benchmark: > > > Computing performance of single threaded perf event synthesis by > > > synthesizing events on the perf process itself: > > > Average synthesis took: 72.996 usec (+- 0.076 usec) > > > Average num. events: 31.000 (+- 0.000) > > > Average time per event 2.355 usec > > > Average data synthesis took: 79.067 usec (+- 0.074 usec) > > > Average num. events: 178.000 (+- 0.000) > > > Average time per event 0.444 usec > > > > > > # ./perf-ioctl bench internals synthesize > > > # Running 'internals/synthesize' benchmark: > > > Computing performance of single threaded perf event synthesis by > > > synthesizing events on the perf process itself: > > > Average synthesis took: 73.921 usec (+- 0.073 usec) > > > Average num. events: 31.000 (+- 0.000) > > > Average time per event 2.385 usec > > > Average data synthesis took: 80.545 usec (+- 0.070 usec) > > > Average num. events: 178.000 (+- 0.000) > > > Average time per event 0.453 usec > > > > > > # ./perf-ioctl bench internals synthesize --mt -M 8 > > > # Running 'internals/synthesize' benchmark: > > > Computing performance of multi threaded perf event synthesis by > > > synthesizing events on CPU 0: > > > Number of synthesis threads: 1 > > > Average synthesis took: 35609.500 usec (+- 428.576 usec) > > > Average num. events: 14040.700 (+- 1.700) > > > Average time per event 2.536 usec > > > Number of synthesis threads: 2 > > > Average synthesis took: 34293.800 usec (+- 453.811 usec) > > > Average num. events: 14040.700 (+- 1.700) > > > Average time per event 2.442 usec > > > Number of synthesis threads: 3 > > > Average synthesis took: 32385.200 usec (+- 363.106 usec) > > > Average num. events: 14040.700 (+- 1.700) > > > Average time per event 2.307 usec > > > Number of synthesis threads: 4 > > > Average synthesis took: 33113.100 usec (+- 553.931 usec) > > > Average num. events: 14054.500 (+- 11.469) > > > Average time per event 2.356 usec > > > Number of synthesis threads: 5 > > > Average synthesis took: 31600.600 usec (+- 297.349 usec) > > > Average num. events: 14012.500 (+- 4.590) > > > Average time per event 2.255 usec > > > Number of synthesis threads: 6 > > > Average synthesis took: 32309.900 usec (+- 472.225 usec) > > > Average num. events: 14004.000 (+- 0.000) > > > Average time per event 2.307 usec > > > Number of synthesis threads: 7 > > > Average synthesis took: 31400.100 usec (+- 206.261 usec) > > > Average num. events: 14004.800 (+- 0.800) > > > Average time per event 2.242 usec > > > Number of synthesis threads: 8 > > > Average synthesis took: 31601.400 usec (+- 303.350 usec) > > > Average num. events: 14005.700 (+- 1.700) > > > Average time per event 2.256 usec > > > > > > I also double-checked (using strace) that it does what it is supposed > > > to do, and it seems like everything checks out. Here's text-based > > > strace log: > > > > > > openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3 > > > read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997 > > > read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025 > > > read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048 > > > read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017 > > > read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744 > > > read(3, "", 8192) = 0 > > > close(3) = 0 > > > > > > > > > BTW, note how the kernel doesn't serve more than 4KB of data, even > > > though perf provides 8KB buffer (that's to Greg's question about > > > optimizing using bigger buffers, I suspect without seq_file changes, > > > it won't work). > > > > > > And here's an abbreviated log for ioctl version, it has lots more (but > > > much faster) ioctl() syscalls, given it dumps everything: > > > > > > openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3 > > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > > > > > ... 195 ioctl() calls in total ... > > > > > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0 > > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) > > > = -1 ENOENT (No such file or directory) > > > close(3) = 0 > > > > > > > > > So, it's not the optimal usage of this API, and yet it's still better > > > (or at least not worse) than text-based API. > > It's surprising that more ioctl is cheaper than less read and parse. I encourage you to try this locally, just in case I missed something ([0]). But it does seem this way. I have mitigations and retpoline off, so syscall switch is pretty fast (under 0.5 microsecond). [0] https://github.com/anakryiko/linux/tree/procfs-proc-maps-ioctl > > > > > > > > In another reply to Arnaldo on patch #2 I mentioned the idea of > > allowing to iterate only file-backed VMAs (as it seems like what > > symbolizers would only care about, but I might be wrong here). So I > > Yep, I think it's enough to get file-backed VMAs only. > Ok, I guess I'll keep this functionality for v2 then, it's a pretty trivial extension to existing logic. > > > tried that quickly, given it's a trivial addition to my code. See > > results below (it is slightly faster, but not much, because most of > > VMAs in that benchmark seem to be indeed file-backed anyways), just > > for completeness. I'm not sure if that would be useful/used by perf, > > so please let me know. > > Thanks for doing this. It'd be useful as it provides better synthesizing > performance. The startup latency of perf record is a problem, I need > to take a look if it can be improved more. > > Thanks, > Namhyung > > > > > > As I mentioned above, it's not radically faster in this perf > > benchmark, because we still request about 170 VMAs (vs ~195 if we > > iterate *all* of them), so not a big change. The ratio will vary > > depending on what the process is doing, of course. Anyways, just for > > completeness, I'm not sure if I have to add this "filter" to the > > actual implementation. > > > > # ./perf-filebacked bench internals synthesize > > # Running 'internals/synthesize' benchmark: > > Computing performance of single threaded perf event synthesis by > > synthesizing events on the perf process itself: > > Average synthesis took: 65.759 usec (+- 0.063 usec) > > Average num. events: 30.000 (+- 0.000) > > Average time per event 2.192 usec > > Average data synthesis took: 73.840 usec (+- 0.080 usec) > > Average num. events: 153.000 (+- 0.000) > > Average time per event 0.483 usec > > > > # ./perf-filebacked bench internals synthesize > > # Running 'internals/synthesize' benchmark: > > Computing performance of single threaded perf event synthesis by > > synthesizing events on the perf process itself: > > Average synthesis took: 66.245 usec (+- 0.059 usec) > > Average num. events: 30.000 (+- 0.000) > > Average time per event 2.208 usec > > Average data synthesis took: 70.627 usec (+- 0.074 usec) > > Average num. events: 153.000 (+- 0.000) > > Average time per event 0.462 usec > > > > # ./perf-filebacked bench internals synthesize --mt -M 8 > > # Running 'internals/synthesize' benchmark: > > Computing performance of multi threaded perf event synthesis by > > synthesizing events on CPU 0: > > Number of synthesis threads: 1 > > Average synthesis took: 33477.500 usec (+- 556.102 usec) > > Average num. events: 10125.700 (+- 1.620) > > Average time per event 3.306 usec > > Number of synthesis threads: 2 > > Average synthesis took: 30473.700 usec (+- 221.933 usec) > > Average num. events: 10127.000 (+- 0.000) > > Average time per event 3.009 usec > > Number of synthesis threads: 3 > > Average synthesis took: 29775.200 usec (+- 315.212 usec) > > Average num. events: 10128.700 (+- 0.667) > > Average time per event 2.940 usec > > Number of synthesis threads: 4 > > Average synthesis took: 29477.100 usec (+- 621.258 usec) > > Average num. events: 10129.000 (+- 0.000) > > Average time per event 2.910 usec > > Number of synthesis threads: 5 > > Average synthesis took: 29777.900 usec (+- 294.710 usec) > > Average num. events: 10144.700 (+- 11.597) > > Average time per event 2.935 usec > > Number of synthesis threads: 6 > > Average synthesis took: 27774.700 usec (+- 357.569 usec) > > Average num. events: 10158.500 (+- 14.710) > > Average time per event 2.734 usec > > Number of synthesis threads: 7 > > Average synthesis took: 27437.200 usec (+- 233.626 usec) > > Average num. events: 10135.700 (+- 2.700) > > Average time per event 2.707 usec > > Number of synthesis threads: 8 > > Average synthesis took: 28784.600 usec (+- 477.630 usec) > > Average num. events: 10133.000 (+- 0.000) > > Average time per event 2.841 usec > > > > > [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35 > > > > > > > > > > > Thanks, > > > > Ian > > > > > > > > > > Thanks, > > > > > > Ian > > > > > > > > > > > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193 > > > > > > > > > > > > > > > thanks, > > > > > > > > > > > > > > > > greg k-h > > > > > > > > >
On Tue, May 07, 2024 at 03:56:40PM -0700, Andrii Nakryiko wrote: > On Tue, May 7, 2024 at 3:27 PM Namhyung Kim <namhyung@kernel.org> wrote: > > On Tue, May 7, 2024 at 10:29 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > In another reply to Arnaldo on patch #2 I mentioned the idea of > > > allowing to iterate only file-backed VMAs (as it seems like what > > > symbolizers would only care about, but I might be wrong here). So I > > Yep, I think it's enough to get file-backed VMAs only. > Ok, I guess I'll keep this functionality for v2 then, it's a pretty > trivial extension to existing logic. Maps for JITed code, for isntance, aren't backed by files: commit 578c03c86fadcc6fd7319ddf41dd4d1d88aab77a Author: Namhyung Kim <namhyung@kernel.org> Date: Thu Jan 16 10:49:31 2014 +0900 perf symbols: Fix JIT symbol resolution on heap Gaurav reported that perf cannot profile JIT program if it executes the code on heap. This was because current map__new() only handle JIT on anon mappings - extends it to handle no_dso (heap, stack) case too. This patch assumes JIT profiling only provides dynamic function symbols so check the mapping type to distinguish the case. It'd provide no symbols for data mapping - if we need to support symbols on data mappings later it should be changed. Reported-by: Gaurav Jain <gjain@fb.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Gaurav Jain <gjain@fb.com> ⬢[acme@toolbox perf-tools-next]$ git show 89365e6c9ad4c0e090e4c6a4b67a3ce319381d89 commit 89365e6c9ad4c0e090e4c6a4b67a3ce319381d89 Author: Andi Kleen <ak@linux.intel.com> Date: Wed Apr 24 17:03:02 2013 -0700 perf tools: Handle JITed code in shared memory Need to check for /dev/zero. Most likely more strings are missing too. Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1366848182-30449-1-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> diff --git a/tools/perf/util/map.c b/tools/perf/util/map.c index 6fcb9de623401b8a..8bcdf9e54089acaf 100644 --- a/tools/perf/util/map.c +++ b/tools/perf/util/map.c @@ -21,6 +21,7 @@ const char *map_type__name[MAP__NR_TYPES] = { static inline int is_anon_memory(const char *filename) { return !strcmp(filename, "//anon") || + !strcmp(filename, "/dev/zero (deleted)") || !strcmp(filename, "/anon_hugepage (deleted)"); } etc. - Arnaldo
* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240507 15:01]: > On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: ... > > > > > > As for the mmap_read_lock_killable() (is that what we are talking > > > about?), I'm happy to use anything else available, please give me a > > > pointer. But I suspect given how fast and small this new API is, > > > mmap_read_lock_killable() in it is not comparable to holding it for > > > producing /proc/<pid>/maps contents. > > > > Yes, mmap_read_lock_killable() is the mmap lock (formally known as the > > mmap sem). > > > > You can see examples of avoiding the mmap lock by use of rcu in > > mm/memory.c lock_vma_under_rcu() which is used in the fault path. > > userfaultfd has an example as well. But again, remember that not all > > archs have this functionality, so you'd need to fall back to full mmap > > locking. > > Thanks for the pointer (didn't see email when replying on the other thread). > > I looked at lock_vma_under_rcu() quickly, and seems like it's designed > to find VMA that covers given address, but not the next closest one. > So it's a bit problematic for the API I'm adding, as > PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to > COVERING_OR_NEXT_VMA, if necessary), is quite important for the use > cases we have. But maybe some variation of lock_vma_under_rcu() can be > added that would fit this case? Yes, as long as we have the rcu read lock, we can use the same vma_next() calls you use today. We will have to be careful not to use the vma while it's being altered, but per-vma locking should provide that functionality for you. > > > > > Certainly a single lookup and copy will be faster than a 4k buffer > > filling copy, but you will be walking the tree O(n) times, where n is > > the vma count. This isn't as efficient as multiple lookups in a row as > > we will re-walk from the top of the tree. You will also need to contend > > with the fact that the chance of the vmas changing between calls is much > > higher here too - if that's an issue. Neither of these issues go away > > with use of the rcu locking instead of the mmap lock, but we can be > > quite certain that we won't cause locking contention. > > You are right about O(n) times, but note that for symbolization cases > I'm describing, this n will be, generally, *much* smaller than a total > number of VMAs within the process. It's a huge speed up in practice. > This is because we pre-sort addresses in user-space, and then we query > VMA for the first address, but then we quickly skip all the other > addresses that are already covered by this VMA, and so the next > request will query a new VMA that covers another subset of addresses. > This way we'll get the minimal number of VMAs that cover captured > addresses (which in the case of stack traces would be a few VMAs > belonging to executable sections of process' binary plus a bunch of > shared libraries). This also implies you won't have to worry about shifting addresses? I'd think that the reference to the mm means none of these are going to be changing at the point of the calls (not exiting). Given your usecase, I'm surprised you're looking for the next vma at all. Thanks, Liam
diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index f1aebabfb017..7eaa8f417278 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -45,6 +45,7 @@ test_cpp /veristat /sign-file /uprobe_multi +/procfs_query *.ko *.tmp xskxceiver diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index ba28d42b74db..07e17bb89767 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -131,7 +131,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \ flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \ test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \ xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \ - xdp_features bpf_test_no_cfi.ko + xdp_features bpf_test_no_cfi.ko procfs_query TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi diff --git a/tools/testing/selftests/bpf/procfs_query.c b/tools/testing/selftests/bpf/procfs_query.c new file mode 100644 index 000000000000..8ca3978244ad --- /dev/null +++ b/tools/testing/selftests/bpf/procfs_query.c @@ -0,0 +1,366 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +#include <argp.h> +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <time.h> +#include <stdbool.h> +#include <stdint.h> +#include <sys/ioctl.h> +#include <linux/fs.h> +#include <fcntl.h> +#include <unistd.h> +#include <time.h> + +static bool verbose; +static bool quiet; +static bool use_ioctl; +static bool request_build_id; +static char *addrs_path; +static int pid; +static int bench_runs; + +const char *argp_program_version = "procfs_query 0.0"; +const char *argp_program_bug_address = "<bpf@vger.kernel.org>"; + +static inline uint64_t get_time_ns(void) +{ + struct timespec t; + + clock_gettime(CLOCK_MONOTONIC, &t); + + return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec; +} + +static const struct argp_option opts[] = { + { "verbose", 'v', NULL, 0, "Verbose mode" }, + { "quiet", 'q', NULL, 0, "Quiet mode (no output)" }, + { "pid", 'p', "PID", 0, "PID of the process" }, + { "addrs-path", 'f', "PATH", 0, "File with addresses to resolve" }, + { "benchmark", 'B', "RUNS", 0, "Benchmark mode" }, + { "query", 'Q', NULL, 0, "Use ioctl()-based point query API (by default text parsing is done)" }, + { "build-id", 'b', NULL, 0, "Fetch build ID, if available (only for ioctl mode)" }, + {}, +}; + +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + switch (key) { + case 'v': + verbose = true; + break; + case 'q': + quiet = true; + break; + case 'i': + use_ioctl = true; + break; + case 'b': + request_build_id = true; + break; + case 'p': + pid = strtol(arg, NULL, 10); + break; + case 'f': + addrs_path = strdup(arg); + break; + case 'B': + bench_runs = strtol(arg, NULL, 10); + if (bench_runs <= 0) { + fprintf(stderr, "Invalid benchmark run count: %s\n", arg); + return -EINVAL; + } + break; + case ARGP_KEY_ARG: + argp_usage(state); + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} + +static const struct argp argp = { + .options = opts, + .parser = parse_arg, +}; + +struct addr { + unsigned long long addr; + int idx; +}; + +static struct addr *addrs; +static size_t addr_cnt, addr_cap; + +struct resolved_addr { + unsigned long long file_off; + const char *vma_name; + int build_id_sz; + char build_id[20]; +}; + +static struct resolved_addr *resolved; + +static int resolve_addrs_ioctl(void) +{ + char buf[32], build_id_buf[20], vma_name[PATH_MAX]; + struct procfs_procmap_query q; + int fd, err, i; + struct addr *a = &addrs[0]; + struct resolved_addr *r; + + snprintf(buf, sizeof(buf), "/proc/%d/maps", pid); + fd = open(buf, O_RDONLY); + if (fd < 0) { + err = -errno; + fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err); + return err; + } + + memset(&q, 0, sizeof(q)); + q.size = sizeof(q); + q.query_flags = PROCFS_PROCMAP_EXACT_OR_NEXT_VMA; + q.vma_name_addr = (__u64)vma_name; + if (request_build_id) + q.build_id_addr = (__u64)build_id_buf; + + for (i = 0; i < addr_cnt; ) { + char *name = NULL; + + q.query_addr = (__u64)a->addr; + q.vma_name_size = sizeof(vma_name); + if (request_build_id) + q.build_id_size = sizeof(build_id_buf); + + err = ioctl(fd, PROCFS_PROCMAP_QUERY, &q); + if (err < 0 && errno == ENOTTY) { + close(fd); + fprintf(stderr, "PROCFS_PROCMAP_QUERY ioctl() command is not supported on this kernel!\n"); + return -EOPNOTSUPP; /* ioctl() not implemented yet */ + } + if (err < 0 && errno == ENOENT) { + fprintf(stderr, "ENOENT\n"); + i++; + a++; + continue; /* unresolved address */ + } + if (err < 0) { + err = -errno; + close(fd); + fprintf(stderr, "PROCFS_PROCMAP_QUERY ioctl() returned error: %d\n", err); + return err; + } + + /* skip addrs falling before current VMA */ + for (; i < addr_cnt && a->addr < q.vma_start; i++, a++) { + } + /* process addrs covered by current VMA */ + for (; i < addr_cnt && a->addr < q.vma_end; i++, a++) { + r = &resolved[a->idx]; + r->file_off = a->addr - q.vma_start + q.vma_offset; + + /* reuse name, if it was already strdup()'ed */ + if (q.vma_name_size) + name = name ?: strdup(vma_name); + r->vma_name = name; + + if (q.build_id_size) { + r->build_id_sz = q.build_id_size; + memcpy(r->build_id, build_id_buf, q.build_id_size); + } + } + } + + close(fd); + return 0; +} + +static int resolve_addrs_parse(void) +{ + size_t vma_start, vma_end, vma_offset, ino; + uint32_t dev_major, dev_minor; + char perms[4], buf[32], vma_name[PATH_MAX]; + FILE *f; + int err, idx = 0; + struct addr *a = &addrs[idx]; + struct resolved_addr *r; + + snprintf(buf, sizeof(buf), "/proc/%d/maps", pid); + f = fopen(buf, "r"); + if (!f) { + err = -errno; + fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err); + return err; + } + + while ((err = fscanf(f, "%zx-%zx %c%c%c%c %zx %x:%x %zu %[^\n]\n", + &vma_start, &vma_end, + &perms[0], &perms[1], &perms[2], &perms[3], + &vma_offset, &dev_major, &dev_minor, &ino, vma_name)) >= 10) { + const char *name = NULL; + + /* skip addrs before current vma, they stay unresolved */ + for (; idx < addr_cnt && a->addr < vma_start; idx++, a++) { + } + + /* resolve all addrs within current vma now */ + for (; idx < addr_cnt && a->addr < vma_end; idx++, a++) { + r = &resolved[a->idx]; + r->file_off = a->addr - vma_start + vma_offset; + + /* reuse name, if it was already strdup()'ed */ + if (err > 10) + name = name ?: strdup(vma_name); + else + name = NULL; + r->vma_name = name; + } + + /* ran out of addrs to resolve, stop early */ + if (idx >= addr_cnt) + break; + } + + fclose(f); + return 0; +} + +static int cmp_by_addr(const void *a, const void *b) +{ + const struct addr *x = a, *y = b; + + if (x->addr != y->addr) + return x->addr < y->addr ? -1 : 1; + return x->idx < y->idx ? -1 : 1; +} + +static int cmp_by_idx(const void *a, const void *b) +{ + const struct addr *x = a, *y = b; + + return x->idx < y->idx ? -1 : 1; +} + +int main(int argc, char **argv) +{ + FILE* f; + int err, i; + unsigned long long addr; + uint64_t start_ns; + double total_ns; + + /* Parse command line arguments */ + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) + return err; + + if (pid <= 0 || !addrs_path) { + fprintf(stderr, "Please provide PID and file with addresses to process!\n"); + exit(1); + } + + if (verbose) { + fprintf(stderr, "PID: %d\n", pid); + fprintf(stderr, "PATH: %s\n", addrs_path); + } + + f = fopen(addrs_path, "r"); + if (!f) { + err = -errno; + fprintf(stderr, "Failed to open '%s': %d\n", addrs_path, err); + goto out; + } + + while ((err = fscanf(f, "%llx\n", &addr)) == 1) { + if (addr_cnt == addr_cap) { + addr_cap = addr_cap == 0 ? 16 : (addr_cap * 3 / 2); + addrs = realloc(addrs, sizeof(*addrs) * addr_cap); + memset(addrs + addr_cnt, 0, (addr_cap - addr_cnt) * sizeof(*addrs)); + } + + addrs[addr_cnt].addr = addr; + addrs[addr_cnt].idx = addr_cnt; + + addr_cnt++; + } + if (verbose) + fprintf(stderr, "READ %zu addrs!\n", addr_cnt); + if (!feof(f)) { + fprintf(stderr, "Failure parsing full list of addresses at '%s'!\n", addrs_path); + err = -EINVAL; + fclose(f); + goto out; + } + fclose(f); + if (addr_cnt == 0) { + fprintf(stderr, "No addresses provided, bailing out!\n"); + err = -ENOENT; + goto out; + } + + resolved = calloc(addr_cnt, sizeof(*resolved)); + + qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_addr); + if (verbose) { + fprintf(stderr, "SORTED ADDRS (%zu):\n", addr_cnt); + for (i = 0; i < addr_cnt; i++) { + fprintf(stderr, "ADDR #%d: %#llx\n", addrs[i].idx, addrs[i].addr); + } + } + + start_ns = get_time_ns(); + for (i = bench_runs ?: 1; i > 0; i--) { + if (use_ioctl) { + err = resolve_addrs_ioctl(); + } else { + err = resolve_addrs_parse(); + } + if (err) { + fprintf(stderr, "Failed to resolve addrs: %d!\n", err); + goto out; + } + } + total_ns = get_time_ns() - start_ns; + + if (bench_runs) { + fprintf(stderr, "BENCHMARK MODE. RUNS: %d TOTAL TIME (ms): %.3lf TIME/RUN (ms): %.3lf TIME/ADDR (us): %.3lf\n", + bench_runs, total_ns / 1000000.0, total_ns / bench_runs / 1000000.0, + total_ns / bench_runs / addr_cnt / 1000.0); + } + + /* sort them back into the original order */ + qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_idx); + + if (!quiet) { + printf("RESOLVED ADDRS (%zu):\n", addr_cnt); + for (i = 0; i < addr_cnt; i++) { + const struct addr *a = &addrs[i]; + const struct resolved_addr *r = &resolved[a->idx]; + + if (r->file_off) { + printf("RESOLVED #%d: %#llx -> OFF %#llx", + a->idx, a->addr, r->file_off); + if (r->vma_name) + printf(" NAME %s", r->vma_name); + if (r->build_id_sz) { + char build_id_str[41]; + int j; + + for (j = 0; j < r->build_id_sz; j++) + sprintf(&build_id_str[j * 2], "%02hhx", r->build_id[j]); + printf(" BUILDID %s", build_id_str); + } + printf("\n"); + } else { + printf("UNRESOLVED #%d: %#llx\n", a->idx, a->addr); + } + } + } +out: + free(addrs); + free(addrs_path); + free(resolved); + + return err < 0 ? -err : 0; +}
Implement a simple tool/benchmark for comparing address "resolution" logic based on textual /proc/<pid>/maps interface and new binary ioctl-based PROCFS_PROCMAP_QUERY command. The tool expects a file with a list of hex addresses, relevant PID, and then provides control over whether textual or binary ioctl-based ways to process VMAs should be used. The overall logic implements as efficient way to do batched processing of a given set of (unsorted) addresses. We first sort them in increasing order (remembering their original position to restore original order, if necessary), and then process all VMAs from /proc/<pid>/maps, matching addresses to VMAs and calculating file offsets, if matched. For ioctl-based approach the idea is similar, but is implemented even more efficiently, requesting only VMAs that cover all given addresses, skipping all the irrelevant VMAs altogether. To be able to compare efficiency of both APIs tool has "benchark" mode. User provides a number of processing runs to run in a tight loop, timing specifically /proc/<pid>/maps parsing and processing parts of the logic only. Address sorting and re-sorting is excluded. This gives a more direct way to compare ioctl- vs text-based APIs. We used a medium-sized production application to do representative benchmark. A bunch of stack traces were captured, resulting in 4435 user space addresses (699 unique ones, but we didn't deduplicate them). Application itself had 702 VMAs reported in /proc/<pid>/maps. Averaging time taken to process all addresses 10000 times, showed that: - text-based approach took 380 microseconds *per one batch run*; - ioctl-based approach took 10 microseconds *per identical batch run*. This gives about ~35x speed up to do exactly the same amoun of work (build IDs were not fetched for ioctl-based benchmark; fetching build IDs resulted in 2x slowdown compared to no-build-ID case). I also did an strace run of both cases. In text-based one the tool did 68 read() syscalls, fetching up to 4KB of data in one go. In comparison, ioctl-based implementation had to do only 6 ioctl() calls to fetch all relevant VMAs. It is projected that savings from processing big production applications would only widen the gap in favor of binary-based querying ioctl API, as bigger applications will tend to have even more non-executable VMA mappings relative to executable ones. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- tools/testing/selftests/bpf/.gitignore | 1 + tools/testing/selftests/bpf/Makefile | 2 +- tools/testing/selftests/bpf/procfs_query.c | 366 +++++++++++++++++++++ 3 files changed, 368 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/procfs_query.c