Message ID | 20230406210611.1622492-1-namhyung@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | perf lock contention: Improve performance if map is full (v1) | expand |
On Thu, Apr 6, 2023 at 2:06 PM Namhyung Kim <namhyung@kernel.org> wrote: > > Hello, > > I got a report that the overhead of perf lock contention is too big in > some cases. It was running the task aggregation mode (-t) at the moment > and there were lots of tasks contending each other. > > It turned out that the hash map update is a problem. The result is saved > in the lock_stat hash map which is pre-allocated. The BPF program never > deletes data in the map, but just adds. But if the map is full, (try to) > update the map becomes a very heavy operation - since it needs to check > every CPU's freelist to get a new node to save the result. But we know > it'd fail when the map is full. No need to update then. > > I've checked it on my 64 CPU machine with this. > > $ perf bench sched messaging -g 1000 > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 1000 groups == 40000 processes run > > Total time: 2.825 [sec] > > And I used the task mode, so that it can guarantee the map is full. > The default map entry size is 16K and this workload has 40K tasks. > > Before: > $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 1000 groups == 40000 processes run > > Total time: 11.299 [sec] > contended total wait max wait avg wait pid comm > > 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging > 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging > 177 66.35 ms 12.08 ms 374.88 us 1220416 node > > After: > $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 1000 groups == 40000 processes run > > Total time: 3.044 [sec] > contended total wait max wait avg wait pid comm > > 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging > 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging > 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging > > === output for debug === > > bad: 1164137, total: 2253341 > bad rate: 51.66 % > histogram of failure reasons > task: 0 > stack: 0 > time: 0 > data: 1164137 > > The first few patches are small cleanups and fixes. You can get the code > from 'perf/lock-map-v1' branch in > > git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git > > Thanks, > Namhyung > > Namhyung Kim (7): > perf lock contention: Simplify parse_lock_type() > perf lock contention: Use -M for --map-nr-entries > perf lock contention: Update default map size to 16384 > perf lock contention: Add data failure stat > perf lock contention: Update total/bad stats for hidden entries > perf lock contention: Revise needs_callstack() condition > perf lock contention: Do not try to update if hash map is full Series: Acked-by: Ian Rogers <irogers@google.com> Thanks, Ian > tools/perf/Documentation/perf-lock.txt | 4 +- > tools/perf/builtin-lock.c | 64 ++++++++----------- > tools/perf/util/bpf_lock_contention.c | 7 +- > .../perf/util/bpf_skel/lock_contention.bpf.c | 29 +++++++-- > tools/perf/util/bpf_skel/lock_data.h | 3 + > tools/perf/util/lock-contention.h | 2 + > 6 files changed, 60 insertions(+), 49 deletions(-) > > > base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76 > -- > 2.40.0.577.gac1e443424-goog >
Em Thu, Apr 06, 2023 at 02:06:04PM -0700, Namhyung Kim escreveu: > Hello, > > I got a report that the overhead of perf lock contention is too big in > some cases. It was running the task aggregation mode (-t) at the moment > and there were lots of tasks contending each other. > > It turned out that the hash map update is a problem. The result is saved > in the lock_stat hash map which is pre-allocated. The BPF program never > deletes data in the map, but just adds. But if the map is full, (try to) > update the map becomes a very heavy operation - since it needs to check > every CPU's freelist to get a new node to save the result. But we know > it'd fail when the map is full. No need to update then. Thanks, applied. - Arnaldo > I've checked it on my 64 CPU machine with this. > > $ perf bench sched messaging -g 1000 > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 1000 groups == 40000 processes run > > Total time: 2.825 [sec] > > And I used the task mode, so that it can guarantee the map is full. > The default map entry size is 16K and this workload has 40K tasks. > > Before: > $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 1000 groups == 40000 processes run > > Total time: 11.299 [sec] > contended total wait max wait avg wait pid comm > > 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging > 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging > 177 66.35 ms 12.08 ms 374.88 us 1220416 node > > After: > $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 1000 groups == 40000 processes run > > Total time: 3.044 [sec] > contended total wait max wait avg wait pid comm > > 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging > 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging > 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging > > === output for debug === > > bad: 1164137, total: 2253341 > bad rate: 51.66 % > histogram of failure reasons > task: 0 > stack: 0 > time: 0 > data: 1164137 > > The first few patches are small cleanups and fixes. You can get the code > from 'perf/lock-map-v1' branch in > > git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git > > Thanks, > Namhyung > > Namhyung Kim (7): > perf lock contention: Simplify parse_lock_type() > perf lock contention: Use -M for --map-nr-entries > perf lock contention: Update default map size to 16384 > perf lock contention: Add data failure stat > perf lock contention: Update total/bad stats for hidden entries > perf lock contention: Revise needs_callstack() condition > perf lock contention: Do not try to update if hash map is full > > tools/perf/Documentation/perf-lock.txt | 4 +- > tools/perf/builtin-lock.c | 64 ++++++++----------- > tools/perf/util/bpf_lock_contention.c | 7 +- > .../perf/util/bpf_skel/lock_contention.bpf.c | 29 +++++++-- > tools/perf/util/bpf_skel/lock_data.h | 3 + > tools/perf/util/lock-contention.h | 2 + > 6 files changed, 60 insertions(+), 49 deletions(-) > > > base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76 > -- > 2.40.0.577.gac1e443424-goog >