Message ID | 20210106152120.31279-6-andrey.gruzdev@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | UFFD write-tracking migration/snapshots | expand |
On Wed, Jan 06, 2021 at 06:21:20PM +0300, Andrey Gruzdev wrote: > Add BCC/eBPF script to analyze userfaultfd write fault latency distribution. > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> > Acked-by: Peter Xu <peterx@redhat.com> (This seems to be the last patch that lacks a r-b ... Let's see whether I could convert my a-b into an r-b... :) > +BPF_HASH(ev_start, struct ev_desc, u64); > +BPF_HASH(ctx_handle_userfault, u64, u64); IMHO we only need one hash here instead of two: BPF_HASH(ev_start, u32, u64); Where we use the tid as the key (u32), and timestamp as the value (u64). The thing is we don't really need the address for current statistics, IMHO. > +/* KPROBE for handle_userfault(). */ > +int probe_handle_userfault(struct pt_regs *ctx, struct vm_fault *vmf, > + unsigned long reason) > +{ > + /* Trace only UFFD write faults. */ > + if (reason & VM_UFFD_WP) { Better with comment: /* Using "(u32)" to drop group ID which is upper 32 bits */ If even better, we'd want a get_current_tid() helper and call it here and below (bpf_get_current_pid_tgid() will return tid|gid<<32 I think, so I'm a bit confused why bcc people called it pid at the first place...). > + u64 pid = (u32) bpf_get_current_pid_tgid(); > + u64 addr = vmf->address; > + > + do_event_start(pid, addr); > + ctx_handle_userfault.update(&pid, &addr); > + } > + return 0; > +} > + > +/* KRETPROBE for handle_userfault(). */ > +int retprobe_handle_userfault(struct pt_regs *ctx) > +{ > + u64 pid = (u32) bpf_get_current_pid_tgid(); > + u64 *addr_p; > + > + /* > + * Here we just ignore the return value. In case of spurious wakeup > + * or pending signal we'll still get (at least for v5.8.0 kernel) > + * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here. > + * Anyhow, handle_userfault() would be re-entered if such case happens, > + * keeping initial timestamp unchanged for the faulting thread. AFAIU this comment is not matching what the code does. But I agree it's not a big problem because we won't miss any long delays (because the one long delayed sample will just be split into two or multiple delays, which will still be reflected in the histogram at last). Or am I wrong? > + */ > + addr_p = ctx_handle_userfault.lookup(&pid); > + if (addr_p) { > + do_event_end(pid, *addr_p); > + ctx_handle_userfault.delete(&pid); > + } > + return 0; > +} > +""" Other than that, the rest looks good to me. I'd think it's fine to even merge the current version since it actually works nicely. Andrey, if you agree with any of my above comments, feel free to repost this patch (since I see Dave provided the rest r-bs). Then I think I can r-b this one too. Thanks!
On 20.01.2021 00:01, Peter Xu wrote: > On Wed, Jan 06, 2021 at 06:21:20PM +0300, Andrey Gruzdev wrote: >> Add BCC/eBPF script to analyze userfaultfd write fault latency distribution. >> >> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> >> Acked-by: Peter Xu <peterx@redhat.com> > (This seems to be the last patch that lacks a r-b ... Let's see whether I could > convert my a-b into an r-b... :) > >> +BPF_HASH(ev_start, struct ev_desc, u64); >> +BPF_HASH(ctx_handle_userfault, u64, u64); > IMHO we only need one hash here instead of two: > > BPF_HASH(ev_start, u32, u64); > > Where we use the tid as the key (u32), and timestamp as the value (u64). The > thing is we don't really need the address for current statistics, IMHO. Agree, that's a more appropriate way do that. Address here is not really needed. Thanks! >> +/* KPROBE for handle_userfault(). */ >> +int probe_handle_userfault(struct pt_regs *ctx, struct vm_fault *vmf, >> + unsigned long reason) >> +{ >> + /* Trace only UFFD write faults. */ >> + if (reason & VM_UFFD_WP) { > Better with comment: > > /* Using "(u32)" to drop group ID which is upper 32 bits */ Yep. > > If even better, we'd want a get_current_tid() helper and call it here and below > (bpf_get_current_pid_tgid() will return tid|gid<<32 I think, so I'm a bit > confused why bcc people called it pid at the first place...). > >> + u64 pid = (u32) bpf_get_current_pid_tgid(); >> + u64 addr = vmf->address; >> + >> + do_event_start(pid, addr); >> + ctx_handle_userfault.update(&pid, &addr); >> + } >> + return 0; >> +} >> + >> +/* KRETPROBE for handle_userfault(). */ >> +int retprobe_handle_userfault(struct pt_regs *ctx) >> +{ >> + u64 pid = (u32) bpf_get_current_pid_tgid(); >> + u64 *addr_p; >> + >> + /* >> + * Here we just ignore the return value. In case of spurious wakeup >> + * or pending signal we'll still get (at least for v5.8.0 kernel) >> + * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here. >> + * Anyhow, handle_userfault() would be re-entered if such case happens, >> + * keeping initial timestamp unchanged for the faulting thread. > AFAIU this comment is not matching what the code does. But I agree it's not a > big problem because we won't miss any long delays (because the one long delayed > sample will just be split into two or multiple delays, which will still be > reflected in the histogram at last). Or am I wrong? Mm, not really sure about comment.. I need to read kernel code again. >> + */ >> + addr_p = ctx_handle_userfault.lookup(&pid); >> + if (addr_p) { >> + do_event_end(pid, *addr_p); >> + ctx_handle_userfault.delete(&pid); >> + } >> + return 0; >> +} >> +""" > Other than that, the rest looks good to me. > > I'd think it's fine to even merge the current version since it actually works > nicely. Andrey, if you agree with any of my above comments, feel free to > repost this patch (since I see Dave provided the rest r-bs). Then I think I > can r-b this one too. Thanks! > Thanks!
On Thu, Jan 21, 2021 at 04:12:23PM +0300, Andrey Gruzdev wrote: > > > +/* KRETPROBE for handle_userfault(). */ > > > +int retprobe_handle_userfault(struct pt_regs *ctx) > > > +{ > > > + u64 pid = (u32) bpf_get_current_pid_tgid(); > > > + u64 *addr_p; > > > + > > > + /* > > > + * Here we just ignore the return value. In case of spurious wakeup > > > + * or pending signal we'll still get (at least for v5.8.0 kernel) > > > + * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here. > > > + * Anyhow, handle_userfault() would be re-entered if such case happens, > > > + * keeping initial timestamp unchanged for the faulting thread. > > AFAIU this comment is not matching what the code does. But I agree it's not a > > big problem because we won't miss any long delays (because the one long delayed > > sample will just be split into two or multiple delays, which will still be > > reflected in the histogram at last). Or am I wrong? > > Mm, not really sure about comment.. I need to read kernel code again. Not relevant to kernel; I was only talking about the last sentence where we won't "keeping initial timestamp unchanged" but we'll do the statistic anyways. Because exactly as you said we'll get VM_FAULT_RETRY unconditionally while we won't be able to identify whether the page fault request is resolved or not.
On 21.01.2021 18:37, Peter Xu wrote: > On Thu, Jan 21, 2021 at 04:12:23PM +0300, Andrey Gruzdev wrote: >>>> +/* KRETPROBE for handle_userfault(). */ >>>> +int retprobe_handle_userfault(struct pt_regs *ctx) >>>> +{ >>>> + u64 pid = (u32) bpf_get_current_pid_tgid(); >>>> + u64 *addr_p; >>>> + >>>> + /* >>>> + * Here we just ignore the return value. In case of spurious wakeup >>>> + * or pending signal we'll still get (at least for v5.8.0 kernel) >>>> + * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here. >>>> + * Anyhow, handle_userfault() would be re-entered if such case happens, >>>> + * keeping initial timestamp unchanged for the faulting thread. >>> AFAIU this comment is not matching what the code does. But I agree it's not a >>> big problem because we won't miss any long delays (because the one long delayed >>> sample will just be split into two or multiple delays, which will still be >>> reflected in the histogram at last). Or am I wrong? >> Mm, not really sure about comment.. I need to read kernel code again. > Not relevant to kernel; I was only talking about the last sentence where we > won't "keeping initial timestamp unchanged" but we'll do the statistic anyways. > Because exactly as you said we'll get VM_FAULT_RETRY unconditionally while we > won't be able to identify whether the page fault request is resolved or not. > Yep, agree. My point is also that trying to get timestamp of real PF resolution is complicated and not very reasonable since a sequence of softbit and hardbit modifications to the in-memory paging structures occur, not immediately reflected in particular TLB entry. But for our statistics this level of accuracy is OK, I think.
diff --git a/scripts/userfaultfd-wrlat.py b/scripts/userfaultfd-wrlat.py new file mode 100755 index 0000000000..5ffd3c6c9a --- /dev/null +++ b/scripts/userfaultfd-wrlat.py @@ -0,0 +1,148 @@ +#!/usr/bin/python3 +# +# userfaultfd-wrlat Summarize userfaultfd write fault latencies. +# Events are continuously accumulated for the +# run, while latency distribution histogram is +# dumped each 'interval' seconds. +# +# For Linux, uses BCC, eBPF. +# +# USAGE: userfaultfd-lat [interval [count]] +# +# Copyright Virtuozzo GmbH, 2020 +# +# Authors: +# Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> +# +# This work is licensed under the terms of the GNU GPL, version 2 or +# later. See the COPYING file in the top-level directory. + +from __future__ import print_function +from bcc import BPF +from ctypes import c_ushort, c_int, c_ulonglong +from time import sleep +from sys import argv + +def usage(): + print("USAGE: %s [interval [count]]" % argv[0]) + exit() + +# define BPF program +bpf_text = """ +#include <uapi/linux/ptrace.h> +#include <linux/mm.h> + +/* + * UFFD page fault event descriptor. + * Used as a key to BPF_HASH table. + */ +struct ev_desc { + u64 pid; + u64 addr; +}; + +BPF_HASH(ev_start, struct ev_desc, u64); +BPF_HASH(ctx_handle_userfault, u64, u64); +BPF_HISTOGRAM(ev_delta_hist, u64); + +/* Trace UFFD page fault start event. */ +static void do_event_start(u64 pid, u64 address) +{ + struct ev_desc desc = { .pid = pid, .addr = address }; + u64 ts = bpf_ktime_get_ns(); + + ev_start.insert(&desc, &ts); +} + +/* Trace UFFD page fault end event. */ +static void do_event_end(u64 pid, u64 address) +{ + struct ev_desc desc = { .pid = pid, .addr = address }; + u64 ts = bpf_ktime_get_ns(); + u64 *tsp; + + tsp = ev_start.lookup(&desc); + if (tsp) { + u64 delta = ts - (*tsp); + /* Transform time delta to milliseconds */ + ev_delta_hist.increment(bpf_log2l(delta / 1000000)); + ev_start.delete(&desc); + } +} + +/* KPROBE for handle_userfault(). */ +int probe_handle_userfault(struct pt_regs *ctx, struct vm_fault *vmf, + unsigned long reason) +{ + /* Trace only UFFD write faults. */ + if (reason & VM_UFFD_WP) { + u64 pid = (u32) bpf_get_current_pid_tgid(); + u64 addr = vmf->address; + + do_event_start(pid, addr); + ctx_handle_userfault.update(&pid, &addr); + } + return 0; +} + +/* KRETPROBE for handle_userfault(). */ +int retprobe_handle_userfault(struct pt_regs *ctx) +{ + u64 pid = (u32) bpf_get_current_pid_tgid(); + u64 *addr_p; + + /* + * Here we just ignore the return value. In case of spurious wakeup + * or pending signal we'll still get (at least for v5.8.0 kernel) + * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here. + * Anyhow, handle_userfault() would be re-entered if such case happens, + * keeping initial timestamp unchanged for the faulting thread. + */ + addr_p = ctx_handle_userfault.lookup(&pid); + if (addr_p) { + do_event_end(pid, *addr_p); + ctx_handle_userfault.delete(&pid); + } + return 0; +} +""" + +# arguments +interval = 10 +count = -1 +if len(argv) > 1: + try: + interval = int(argv[1]) + if interval == 0: + raise + if len(argv) > 2: + count = int(argv[2]) + except: # also catches -h, --help + usage() + +# load BPF program +b = BPF(text=bpf_text) +# attach KRPOBEs +b.attach_kprobe(event="handle_userfault", fn_name="probe_handle_userfault") +b.attach_kretprobe(event="handle_userfault", fn_name="retprobe_handle_userfault") + +# header +print("Tracing UFFD-WP write fault latency... Hit Ctrl-C to end.") + +# output +loop = 0 +do_exit = 0 +while (1): + if count > 0: + loop += 1 + if loop > count: + exit() + try: + sleep(interval) + except KeyboardInterrupt: + pass; do_exit = 1 + + print() + b["ev_delta_hist"].print_log2_hist("msecs") + if do_exit: + exit()