Message ID | 20220610112648.29695-2-quentin@isovalent.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 6b4384ff108874cf336fe2fb1633313c2c7620bf |
Delegated to: | BPF |
Headers | show |
Series | bpftool: Restore memlock rlimit bump | expand |
On 06/10, Quentin Monnet wrote: > This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > kernel has switched to memcg-based memory accounting. Thanks to the > LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > with other systems and ask libbpf to raise the limit for us if > necessary. > How do we know if memcg-based accounting is supported? There is a probe > in libbpf to check this. But this probe currently relies on the > availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > landed in the same kernel version as the memory accounting change. This > works in the generic case, but it may fail, for example, if the helper > function has been backported to an older kernel. This has been observed > for Google Cloud's Container-Optimized OS (COS), where the helper is > available but rlimit is still in use. The probe succeeds, the rlimit is > not raised, and probing features with bpftool, for example, fails. > A patch was submitted [0] to update this probe in libbpf, based on what > the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > 0, attempt to load a BPF object, and reset the rlimit. But it may induce > some hard-to-debug flakiness if another process starts, or the current > application is killed, while the rlimit is reduced, and the approach was > discarded. > As a workaround to ensure that the rlimit bump does not depend on the > availability of a given helper, we restore the unconditional rlimit bump > in bpftool for now. > [0] > https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > Cc: Yafang Shao <laoar.shao@gmail.com> > Signed-off-by: Quentin Monnet <quentin@isovalent.com> > --- > tools/bpf/bpftool/common.c | 8 ++++++++ > tools/bpf/bpftool/feature.c | 2 ++ > tools/bpf/bpftool/main.c | 6 +++--- > tools/bpf/bpftool/main.h | 2 ++ > tools/bpf/bpftool/map.c | 2 ++ > tools/bpf/bpftool/pids.c | 1 + > tools/bpf/bpftool/prog.c | 3 +++ > tools/bpf/bpftool/struct_ops.c | 2 ++ > 8 files changed, 23 insertions(+), 3 deletions(-) > diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > index a45b42ee8ab0..a0d4acd7c54a 100644 > --- a/tools/bpf/bpftool/common.c > +++ b/tools/bpf/bpftool/common.c > @@ -17,6 +17,7 @@ > #include <linux/magic.h> > #include <net/if.h> > #include <sys/mount.h> > +#include <sys/resource.h> > #include <sys/stat.h> > #include <sys/vfs.h> > @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > } > +void set_max_rlimit(void) > +{ > + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > + > + setrlimit(RLIMIT_MEMLOCK, &rinf); Do you think it might make sense to print to stderr some warning if we actually happen to adjust this limit? if (getrlimit(MEMLOCK) != RLIM_INFINITY) { fprintf(stderr, "Warning: resetting MEMLOCK rlimit to infinity!\n"); setrlimit(RLIMIT_MEMLOCK, &rinf); } ? Because while it's nice that we automatically do this, this might still lead to surprises for some users. OTOH, not sure whether people actually read those warnings? :-/ > +} > + > static int > mnt_fs(const char *target, const char *type, char *buff, size_t bufflen) > { > diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c > index cc9e4df8c58e..bac4ef428a02 100644 > --- a/tools/bpf/bpftool/feature.c > +++ b/tools/bpf/bpftool/feature.c > @@ -1167,6 +1167,8 @@ static int do_probe(int argc, char **argv) > __u32 ifindex = 0; > char *ifname; > + set_max_rlimit(); > + > while (argc) { > if (is_prefix(*argv, "kernel")) { > if (target != COMPONENT_UNSPEC) { > diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c > index 9062ef2b8767..e81227761f5d 100644 > --- a/tools/bpf/bpftool/main.c > +++ b/tools/bpf/bpftool/main.c > @@ -507,9 +507,9 @@ int main(int argc, char **argv) > * It will still be rejected if users use LIBBPF_STRICT_ALL > * mode for loading generated skeleton. > */ > - libbpf_set_strict_mode(LIBBPF_STRICT_ALL & > ~LIBBPF_STRICT_MAP_DEFINITIONS); > - } else { > - libbpf_set_strict_mode(LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK); > + ret = libbpf_set_strict_mode(LIBBPF_STRICT_ALL & > ~LIBBPF_STRICT_MAP_DEFINITIONS); > + if (ret) > + p_err("failed to enable libbpf strict mode: %d", ret); > } > argc -= optind; > diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h > index 6c311f47147e..589cb76b227a 100644 > --- a/tools/bpf/bpftool/main.h > +++ b/tools/bpf/bpftool/main.h > @@ -96,6 +96,8 @@ int detect_common_prefix(const char *arg, ...); > void fprint_hex(FILE *f, void *arg, unsigned int n, const char *sep); > void usage(void) __noreturn; > +void set_max_rlimit(void); > + > int mount_tracefs(const char *target); > struct obj_ref { > diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c > index 800834be1bcb..38b6bc9c26c3 100644 > --- a/tools/bpf/bpftool/map.c > +++ b/tools/bpf/bpftool/map.c > @@ -1326,6 +1326,8 @@ static int do_create(int argc, char **argv) > goto exit; > } > + set_max_rlimit(); > + > fd = bpf_map_create(map_type, map_name, key_size, value_size, > max_entries, &attr); > if (fd < 0) { > p_err("map create failed: %s", strerror(errno)); > diff --git a/tools/bpf/bpftool/pids.c b/tools/bpf/bpftool/pids.c > index e2d00d3cd868..bb6c969a114a 100644 > --- a/tools/bpf/bpftool/pids.c > +++ b/tools/bpf/bpftool/pids.c > @@ -108,6 +108,7 @@ int build_obj_refs_table(struct hashmap **map, enum > bpf_obj_type type) > p_err("failed to create hashmap for PID references"); > return -1; > } > + set_max_rlimit(); > skel = pid_iter_bpf__open(); > if (!skel) { > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c > index e71f0b2da50b..f081de398b60 100644 > --- a/tools/bpf/bpftool/prog.c > +++ b/tools/bpf/bpftool/prog.c > @@ -1590,6 +1590,8 @@ static int load_with_options(int argc, char **argv, > bool first_prog_only) > } > } > + set_max_rlimit(); > + > if (verifier_logs) > /* log_level1 + log_level2 + stats, but not stable UAPI */ > open_opts.kernel_log_level = 1 + 2 + 4; > @@ -2287,6 +2289,7 @@ static int do_profile(int argc, char **argv) > } > } > + set_max_rlimit(); > err = profiler_bpf__load(profile_obj); > if (err) { > p_err("failed to load profile_obj"); > diff --git a/tools/bpf/bpftool/struct_ops.c > b/tools/bpf/bpftool/struct_ops.c > index 2535f079ed67..e08a6ff2866c 100644 > --- a/tools/bpf/bpftool/struct_ops.c > +++ b/tools/bpf/bpftool/struct_ops.c > @@ -501,6 +501,8 @@ static int do_register(int argc, char **argv) > if (libbpf_get_error(obj)) > return -1; > + set_max_rlimit(); > + > if (bpf_object__load(obj)) { > bpf_object__close(obj); > return -1; > -- > 2.34.1
2022-06-10 09:07 UTC-0700 ~ sdf@google.com > On 06/10, Quentin Monnet wrote: >> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > >> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of >> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the >> kernel has switched to memcg-based memory accounting. Thanks to the >> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility >> with other systems and ask libbpf to raise the limit for us if >> necessary. > >> How do we know if memcg-based accounting is supported? There is a probe >> in libbpf to check this. But this probe currently relies on the >> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which >> landed in the same kernel version as the memory accounting change. This >> works in the generic case, but it may fail, for example, if the helper >> function has been backported to an older kernel. This has been observed >> for Google Cloud's Container-Optimized OS (COS), where the helper is >> available but rlimit is still in use. The probe succeeds, the rlimit is >> not raised, and probing features with bpftool, for example, fails. > >> A patch was submitted [0] to update this probe in libbpf, based on what >> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to >> 0, attempt to load a BPF object, and reset the rlimit. But it may induce >> some hard-to-debug flakiness if another process starts, or the current >> application is killed, while the rlimit is reduced, and the approach was >> discarded. > >> As a workaround to ensure that the rlimit bump does not depend on the >> availability of a given helper, we restore the unconditional rlimit bump >> in bpftool for now. > >> [0] >> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ >> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > >> Cc: Yafang Shao <laoar.shao@gmail.com> >> Signed-off-by: Quentin Monnet <quentin@isovalent.com> >> --- >> tools/bpf/bpftool/common.c | 8 ++++++++ >> tools/bpf/bpftool/feature.c | 2 ++ >> tools/bpf/bpftool/main.c | 6 +++--- >> tools/bpf/bpftool/main.h | 2 ++ >> tools/bpf/bpftool/map.c | 2 ++ >> tools/bpf/bpftool/pids.c | 1 + >> tools/bpf/bpftool/prog.c | 3 +++ >> tools/bpf/bpftool/struct_ops.c | 2 ++ >> 8 files changed, 23 insertions(+), 3 deletions(-) > >> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c >> index a45b42ee8ab0..a0d4acd7c54a 100644 >> --- a/tools/bpf/bpftool/common.c >> +++ b/tools/bpf/bpftool/common.c >> @@ -17,6 +17,7 @@ >> #include <linux/magic.h> >> #include <net/if.h> >> #include <sys/mount.h> >> +#include <sys/resource.h> >> #include <sys/stat.h> >> #include <sys/vfs.h> > >> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) >> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; >> } > >> +void set_max_rlimit(void) >> +{ >> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; >> + >> + setrlimit(RLIMIT_MEMLOCK, &rinf); > > Do you think it might make sense to print to stderr some warning if > we actually happen to adjust this limit? > > if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > infinity!\n"); > setrlimit(RLIMIT_MEMLOCK, &rinf); > } > > ? > > Because while it's nice that we automatically do this, this might still > lead to surprises for some users. OTOH, not sure whether people > actually read those warnings? :-/ I'm not strictly opposed to a warning, but I'm not completely sure this is desirable. Bpftool has raised the rlimit for a long time, it changed only in April, so I don't think it would come up as a surprise for people who have used it for a while. I think this is also something that several other BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) have been doing too. For new users, I agree the warning may be helpful. But then the message is likely to appear the very first time a user runs the command - likely as root - and I fear this might worry people not familiar with rlimits, who would wonder if they just broke something on their system? Maybe with a different phrasing. Alternatively we could document it in the relevant man pages (not that people would see it better, but at least it would be mentioned somewhere if people take the time to read the docs)? What do you think? Quentin
On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > > 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > > On 06/10, Quentin Monnet wrote: > >> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > > > >> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > >> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > >> kernel has switched to memcg-based memory accounting. Thanks to the > >> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > >> with other systems and ask libbpf to raise the limit for us if > >> necessary. > > > >> How do we know if memcg-based accounting is supported? There is a probe > >> in libbpf to check this. But this probe currently relies on the > >> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > >> landed in the same kernel version as the memory accounting change. This > >> works in the generic case, but it may fail, for example, if the helper > >> function has been backported to an older kernel. This has been observed > >> for Google Cloud's Container-Optimized OS (COS), where the helper is > >> available but rlimit is still in use. The probe succeeds, the rlimit is > >> not raised, and probing features with bpftool, for example, fails. > > > >> A patch was submitted [0] to update this probe in libbpf, based on what > >> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > >> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > >> some hard-to-debug flakiness if another process starts, or the current > >> application is killed, while the rlimit is reduced, and the approach was > >> discarded. > > > >> As a workaround to ensure that the rlimit bump does not depend on the > >> availability of a given helper, we restore the unconditional rlimit bump > >> in bpftool for now. > > > >> [0] > >> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > >> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > > > >> Cc: Yafang Shao <laoar.shao@gmail.com> > >> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > >> --- > >> tools/bpf/bpftool/common.c | 8 ++++++++ > >> tools/bpf/bpftool/feature.c | 2 ++ > >> tools/bpf/bpftool/main.c | 6 +++--- > >> tools/bpf/bpftool/main.h | 2 ++ > >> tools/bpf/bpftool/map.c | 2 ++ > >> tools/bpf/bpftool/pids.c | 1 + > >> tools/bpf/bpftool/prog.c | 3 +++ > >> tools/bpf/bpftool/struct_ops.c | 2 ++ > >> 8 files changed, 23 insertions(+), 3 deletions(-) > > > >> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > >> index a45b42ee8ab0..a0d4acd7c54a 100644 > >> --- a/tools/bpf/bpftool/common.c > >> +++ b/tools/bpf/bpftool/common.c > >> @@ -17,6 +17,7 @@ > >> #include <linux/magic.h> > >> #include <net/if.h> > >> #include <sys/mount.h> > >> +#include <sys/resource.h> > >> #include <sys/stat.h> > >> #include <sys/vfs.h> > > > >> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > >> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > >> } > > > >> +void set_max_rlimit(void) > >> +{ > >> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > >> + > >> + setrlimit(RLIMIT_MEMLOCK, &rinf); > > > > Do you think it might make sense to print to stderr some warning if > > we actually happen to adjust this limit? > > > > if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > > fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > > infinity!\n"); > > setrlimit(RLIMIT_MEMLOCK, &rinf); > > } > > > > ? > > > > Because while it's nice that we automatically do this, this might still > > lead to surprises for some users. OTOH, not sure whether people > > actually read those warnings? :-/ > > I'm not strictly opposed to a warning, but I'm not completely sure this > is desirable. > > Bpftool has raised the rlimit for a long time, it changed only in April, > so I don't think it would come up as a surprise for people who have used > it for a while. I think this is also something that several other > BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > have been doing too. In this case ignore me and let's continue doing that :-) Btw, eventually we'd still like to stop doing that I'd presume? Should we at some point follow up with something like: if (kernel_version >= 5.11) { don't touch memlock; } ? I guess we care only about <5.11 because of the backports, but 5.11+ kernels are guaranteed to have memcg. I'm not sure whether memlock is used out there in the distros (and especially for root/bpf_capable), so I'm also not sure whether we really care or not. > For new users, I agree the warning may be helpful. But then the message > is likely to appear the very first time a user runs the command - likely > as root - and I fear this might worry people not familiar with rlimits, > who would wonder if they just broke something on their system? Maybe > with a different phrasing. > > Alternatively we could document it in the relevant man pages (not that > people would see it better, but at least it would be mentioned somewhere > if people take the time to read the docs)? What do you think? > > Quentin
2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: >> >> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com >>> On 06/10, Quentin Monnet wrote: >>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. >>> >>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of >>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the >>>> kernel has switched to memcg-based memory accounting. Thanks to the >>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility >>>> with other systems and ask libbpf to raise the limit for us if >>>> necessary. >>> >>>> How do we know if memcg-based accounting is supported? There is a probe >>>> in libbpf to check this. But this probe currently relies on the >>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which >>>> landed in the same kernel version as the memory accounting change. This >>>> works in the generic case, but it may fail, for example, if the helper >>>> function has been backported to an older kernel. This has been observed >>>> for Google Cloud's Container-Optimized OS (COS), where the helper is >>>> available but rlimit is still in use. The probe succeeds, the rlimit is >>>> not raised, and probing features with bpftool, for example, fails. >>> >>>> A patch was submitted [0] to update this probe in libbpf, based on what >>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to >>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce >>>> some hard-to-debug flakiness if another process starts, or the current >>>> application is killed, while the rlimit is reduced, and the approach was >>>> discarded. >>> >>>> As a workaround to ensure that the rlimit bump does not depend on the >>>> availability of a given helper, we restore the unconditional rlimit bump >>>> in bpftool for now. >>> >>>> [0] >>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ >>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 >>> >>>> Cc: Yafang Shao <laoar.shao@gmail.com> >>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> >>>> --- >>>> tools/bpf/bpftool/common.c | 8 ++++++++ >>>> tools/bpf/bpftool/feature.c | 2 ++ >>>> tools/bpf/bpftool/main.c | 6 +++--- >>>> tools/bpf/bpftool/main.h | 2 ++ >>>> tools/bpf/bpftool/map.c | 2 ++ >>>> tools/bpf/bpftool/pids.c | 1 + >>>> tools/bpf/bpftool/prog.c | 3 +++ >>>> tools/bpf/bpftool/struct_ops.c | 2 ++ >>>> 8 files changed, 23 insertions(+), 3 deletions(-) >>> >>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c >>>> index a45b42ee8ab0..a0d4acd7c54a 100644 >>>> --- a/tools/bpf/bpftool/common.c >>>> +++ b/tools/bpf/bpftool/common.c >>>> @@ -17,6 +17,7 @@ >>>> #include <linux/magic.h> >>>> #include <net/if.h> >>>> #include <sys/mount.h> >>>> +#include <sys/resource.h> >>>> #include <sys/stat.h> >>>> #include <sys/vfs.h> >>> >>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) >>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; >>>> } >>> >>>> +void set_max_rlimit(void) >>>> +{ >>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; >>>> + >>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); >>> >>> Do you think it might make sense to print to stderr some warning if >>> we actually happen to adjust this limit? >>> >>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { >>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to >>> infinity!\n"); >>> setrlimit(RLIMIT_MEMLOCK, &rinf); >>> } >>> >>> ? >>> >>> Because while it's nice that we automatically do this, this might still >>> lead to surprises for some users. OTOH, not sure whether people >>> actually read those warnings? :-/ >> >> I'm not strictly opposed to a warning, but I'm not completely sure this >> is desirable. >> >> Bpftool has raised the rlimit for a long time, it changed only in April, >> so I don't think it would come up as a surprise for people who have used >> it for a while. I think this is also something that several other >> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) >> have been doing too. > > In this case ignore me and let's continue doing that :-) > > Btw, eventually we'd still like to stop doing that I'd presume? Agreed. I was thinking either finding a way to improve the probe in libbpf, or waiting for some more time until 5.11 gets old, but this may take years :/ > Should > we at some point follow up with something like: > > if (kernel_version >= 5.11) { don't touch memlock; } > > ? > > I guess we care only about <5.11 because of the backports, but 5.11+ > kernels are guaranteed to have memcg. You mean from uname() and parsing the release? Yes I suppose we could do that, can do as a follow-up. > > I'm not sure whether memlock is used out there in the distros (and > especially for root/bpf_capable), so I'm also not sure whether we > really care or not. Not sure either. For what it's worth, I've never seen complaints so far from users about the rlimit being raised (from bpftool or other BPF apps).
On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: > > 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > > On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > >> > >> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > >>> On 06/10, Quentin Monnet wrote: > >>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > >>> > >>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > >>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > >>>> kernel has switched to memcg-based memory accounting. Thanks to the > >>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > >>>> with other systems and ask libbpf to raise the limit for us if > >>>> necessary. > >>> > >>>> How do we know if memcg-based accounting is supported? There is a probe > >>>> in libbpf to check this. But this probe currently relies on the > >>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > >>>> landed in the same kernel version as the memory accounting change. This > >>>> works in the generic case, but it may fail, for example, if the helper > >>>> function has been backported to an older kernel. This has been observed > >>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > >>>> available but rlimit is still in use. The probe succeeds, the rlimit is > >>>> not raised, and probing features with bpftool, for example, fails. > >>> > >>>> A patch was submitted [0] to update this probe in libbpf, based on what > >>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > >>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > >>>> some hard-to-debug flakiness if another process starts, or the current > >>>> application is killed, while the rlimit is reduced, and the approach was > >>>> discarded. > >>> > >>>> As a workaround to ensure that the rlimit bump does not depend on the > >>>> availability of a given helper, we restore the unconditional rlimit bump > >>>> in bpftool for now. > >>> > >>>> [0] > >>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > >>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > >>> > >>>> Cc: Yafang Shao <laoar.shao@gmail.com> > >>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > >>>> --- > >>>> tools/bpf/bpftool/common.c | 8 ++++++++ > >>>> tools/bpf/bpftool/feature.c | 2 ++ > >>>> tools/bpf/bpftool/main.c | 6 +++--- > >>>> tools/bpf/bpftool/main.h | 2 ++ > >>>> tools/bpf/bpftool/map.c | 2 ++ > >>>> tools/bpf/bpftool/pids.c | 1 + > >>>> tools/bpf/bpftool/prog.c | 3 +++ > >>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > >>>> 8 files changed, 23 insertions(+), 3 deletions(-) > >>> > >>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > >>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > >>>> --- a/tools/bpf/bpftool/common.c > >>>> +++ b/tools/bpf/bpftool/common.c > >>>> @@ -17,6 +17,7 @@ > >>>> #include <linux/magic.h> > >>>> #include <net/if.h> > >>>> #include <sys/mount.h> > >>>> +#include <sys/resource.h> > >>>> #include <sys/stat.h> > >>>> #include <sys/vfs.h> > >>> > >>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > >>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > >>>> } > >>> > >>>> +void set_max_rlimit(void) > >>>> +{ > >>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > >>>> + > >>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > >>> > >>> Do you think it might make sense to print to stderr some warning if > >>> we actually happen to adjust this limit? > >>> > >>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > >>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > >>> infinity!\n"); > >>> setrlimit(RLIMIT_MEMLOCK, &rinf); > >>> } > >>> > >>> ? > >>> > >>> Because while it's nice that we automatically do this, this might still > >>> lead to surprises for some users. OTOH, not sure whether people > >>> actually read those warnings? :-/ > >> > >> I'm not strictly opposed to a warning, but I'm not completely sure this > >> is desirable. > >> > >> Bpftool has raised the rlimit for a long time, it changed only in April, > >> so I don't think it would come up as a surprise for people who have used > >> it for a while. I think this is also something that several other > >> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > >> have been doing too. > > > > In this case ignore me and let's continue doing that :-) > > > > Btw, eventually we'd still like to stop doing that I'd presume? > > Agreed. I was thinking either finding a way to improve the probe in > libbpf, or waiting for some more time until 5.11 gets old, but this may > take years :/ > > > Should > > we at some point follow up with something like: > > > > if (kernel_version >= 5.11) { don't touch memlock; } > > > > ? > > > > I guess we care only about <5.11 because of the backports, but 5.11+ > > kernels are guaranteed to have memcg. > > You mean from uname() and parsing the release? Yes I suppose we could do > that, can do as a follow-up. Yeah, uname-based, I don't think we can do better? Given that probing is problematic as well :-( But idk, up to you. > > I'm not sure whether memlock is used out there in the distros (and > > especially for root/bpf_capable), so I'm also not sure whether we > > really care or not. > > Not sure either. For what it's worth, I've never seen complaints so far > from users about the rlimit being raised (from bpftool or other BPF apps).
On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: > > On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: > > > > 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > > > On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > > >> > > >> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > > >>> On 06/10, Quentin Monnet wrote: > > >>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > > >>> > > >>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > > >>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > > >>>> kernel has switched to memcg-based memory accounting. Thanks to the > > >>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > > >>>> with other systems and ask libbpf to raise the limit for us if > > >>>> necessary. > > >>> > > >>>> How do we know if memcg-based accounting is supported? There is a probe > > >>>> in libbpf to check this. But this probe currently relies on the > > >>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > > >>>> landed in the same kernel version as the memory accounting change. This > > >>>> works in the generic case, but it may fail, for example, if the helper > > >>>> function has been backported to an older kernel. This has been observed > > >>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > > >>>> available but rlimit is still in use. The probe succeeds, the rlimit is > > >>>> not raised, and probing features with bpftool, for example, fails. > > >>> > > >>>> A patch was submitted [0] to update this probe in libbpf, based on what > > >>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > > >>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > > >>>> some hard-to-debug flakiness if another process starts, or the current > > >>>> application is killed, while the rlimit is reduced, and the approach was > > >>>> discarded. > > >>> > > >>>> As a workaround to ensure that the rlimit bump does not depend on the > > >>>> availability of a given helper, we restore the unconditional rlimit bump > > >>>> in bpftool for now. > > >>> > > >>>> [0] > > >>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > > >>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > > >>> > > >>>> Cc: Yafang Shao <laoar.shao@gmail.com> > > >>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > > >>>> --- > > >>>> tools/bpf/bpftool/common.c | 8 ++++++++ > > >>>> tools/bpf/bpftool/feature.c | 2 ++ > > >>>> tools/bpf/bpftool/main.c | 6 +++--- > > >>>> tools/bpf/bpftool/main.h | 2 ++ > > >>>> tools/bpf/bpftool/map.c | 2 ++ > > >>>> tools/bpf/bpftool/pids.c | 1 + > > >>>> tools/bpf/bpftool/prog.c | 3 +++ > > >>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > > >>>> 8 files changed, 23 insertions(+), 3 deletions(-) > > >>> > > >>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > > >>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > > >>>> --- a/tools/bpf/bpftool/common.c > > >>>> +++ b/tools/bpf/bpftool/common.c > > >>>> @@ -17,6 +17,7 @@ > > >>>> #include <linux/magic.h> > > >>>> #include <net/if.h> > > >>>> #include <sys/mount.h> > > >>>> +#include <sys/resource.h> > > >>>> #include <sys/stat.h> > > >>>> #include <sys/vfs.h> > > >>> > > >>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > > >>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > > >>>> } > > >>> > > >>>> +void set_max_rlimit(void) > > >>>> +{ > > >>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > > >>>> + > > >>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > > >>> > > >>> Do you think it might make sense to print to stderr some warning if > > >>> we actually happen to adjust this limit? > > >>> > > >>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > > >>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > > >>> infinity!\n"); > > >>> setrlimit(RLIMIT_MEMLOCK, &rinf); > > >>> } > > >>> > > >>> ? > > >>> > > >>> Because while it's nice that we automatically do this, this might still > > >>> lead to surprises for some users. OTOH, not sure whether people > > >>> actually read those warnings? :-/ > > >> > > >> I'm not strictly opposed to a warning, but I'm not completely sure this > > >> is desirable. > > >> > > >> Bpftool has raised the rlimit for a long time, it changed only in April, > > >> so I don't think it would come up as a surprise for people who have used > > >> it for a while. I think this is also something that several other > > >> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > > >> have been doing too. > > > > > > In this case ignore me and let's continue doing that :-) > > > > > > Btw, eventually we'd still like to stop doing that I'd presume? > > > > Agreed. I was thinking either finding a way to improve the probe in > > libbpf, or waiting for some more time until 5.11 gets old, but this may > > take years :/ > > > > > Should > > > we at some point follow up with something like: > > > > > > if (kernel_version >= 5.11) { don't touch memlock; } > > > > > > ? > > > > > > I guess we care only about <5.11 because of the backports, but 5.11+ > > > kernels are guaranteed to have memcg. > > > > You mean from uname() and parsing the release? Yes I suppose we could do > > that, can do as a follow-up. > > Yeah, uname-based, I don't think we can do better? Given that probing > is problematic as well :-( > But idk, up to you. > Agreed with the uname-based solution. Another possible solution is to probe the member 'memcg' in struct bpf_map, in case someone may backport memcg-based memory accounting, but that will be a little over-engineering. The uname-based solution is simple and can work.
2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: >> >> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: >>> >>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> >>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: >>>>> >>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com >>>>>> On 06/10, Quentin Monnet wrote: >>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. >>>>>> >>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of >>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the >>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the >>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility >>>>>>> with other systems and ask libbpf to raise the limit for us if >>>>>>> necessary. >>>>>> >>>>>>> How do we know if memcg-based accounting is supported? There is a probe >>>>>>> in libbpf to check this. But this probe currently relies on the >>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which >>>>>>> landed in the same kernel version as the memory accounting change. This >>>>>>> works in the generic case, but it may fail, for example, if the helper >>>>>>> function has been backported to an older kernel. This has been observed >>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is >>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is >>>>>>> not raised, and probing features with bpftool, for example, fails. >>>>>> >>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what >>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to >>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce >>>>>>> some hard-to-debug flakiness if another process starts, or the current >>>>>>> application is killed, while the rlimit is reduced, and the approach was >>>>>>> discarded. >>>>>> >>>>>>> As a workaround to ensure that the rlimit bump does not depend on the >>>>>>> availability of a given helper, we restore the unconditional rlimit bump >>>>>>> in bpftool for now. >>>>>> >>>>>>> [0] >>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ >>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 >>>>>> >>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> >>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> >>>>>>> --- >>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ >>>>>>> tools/bpf/bpftool/feature.c | 2 ++ >>>>>>> tools/bpf/bpftool/main.c | 6 +++--- >>>>>>> tools/bpf/bpftool/main.h | 2 ++ >>>>>>> tools/bpf/bpftool/map.c | 2 ++ >>>>>>> tools/bpf/bpftool/pids.c | 1 + >>>>>>> tools/bpf/bpftool/prog.c | 3 +++ >>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ >>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) >>>>>> >>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c >>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 >>>>>>> --- a/tools/bpf/bpftool/common.c >>>>>>> +++ b/tools/bpf/bpftool/common.c >>>>>>> @@ -17,6 +17,7 @@ >>>>>>> #include <linux/magic.h> >>>>>>> #include <net/if.h> >>>>>>> #include <sys/mount.h> >>>>>>> +#include <sys/resource.h> >>>>>>> #include <sys/stat.h> >>>>>>> #include <sys/vfs.h> >>>>>> >>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) >>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; >>>>>>> } >>>>>> >>>>>>> +void set_max_rlimit(void) >>>>>>> +{ >>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; >>>>>>> + >>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); >>>>>> >>>>>> Do you think it might make sense to print to stderr some warning if >>>>>> we actually happen to adjust this limit? >>>>>> >>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { >>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to >>>>>> infinity!\n"); >>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); >>>>>> } >>>>>> >>>>>> ? >>>>>> >>>>>> Because while it's nice that we automatically do this, this might still >>>>>> lead to surprises for some users. OTOH, not sure whether people >>>>>> actually read those warnings? :-/ >>>>> >>>>> I'm not strictly opposed to a warning, but I'm not completely sure this >>>>> is desirable. >>>>> >>>>> Bpftool has raised the rlimit for a long time, it changed only in April, >>>>> so I don't think it would come up as a surprise for people who have used >>>>> it for a while. I think this is also something that several other >>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) >>>>> have been doing too. >>>> >>>> In this case ignore me and let's continue doing that :-) >>>> >>>> Btw, eventually we'd still like to stop doing that I'd presume? >>> >>> Agreed. I was thinking either finding a way to improve the probe in >>> libbpf, or waiting for some more time until 5.11 gets old, but this may >>> take years :/ >>> >>>> Should >>>> we at some point follow up with something like: >>>> >>>> if (kernel_version >= 5.11) { don't touch memlock; } >>>> >>>> ? >>>> >>>> I guess we care only about <5.11 because of the backports, but 5.11+ >>>> kernels are guaranteed to have memcg. >>> >>> You mean from uname() and parsing the release? Yes I suppose we could do >>> that, can do as a follow-up. >> >> Yeah, uname-based, I don't think we can do better? Given that probing >> is problematic as well :-( >> But idk, up to you. >> > > Agreed with the uname-based solution. Another possible solution is to > probe the member 'memcg' in struct bpf_map, in case someone may > backport memcg-based memory accounting, but that will be a little > over-engineering. The uname-based solution is simple and can work. > Thanks! Yes, memcg would be more complex: the struct is not exposed to user space, and BTF is not a hard dependency for bpftool. I'll work on the uname-based test as a follow-up to this set. Quentin
On 6/14/22 4:20 PM, Quentin Monnet wrote: > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> >> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: >>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: >>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> >>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: >>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com >>>>>>> On 06/10, Quentin Monnet wrote: >>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. >>>>>>> >>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of >>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the >>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the >>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility >>>>>>>> with other systems and ask libbpf to raise the limit for us if >>>>>>>> necessary. >>>>>>> >>>>>>>> How do we know if memcg-based accounting is supported? There is a probe >>>>>>>> in libbpf to check this. But this probe currently relies on the >>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which >>>>>>>> landed in the same kernel version as the memory accounting change. This >>>>>>>> works in the generic case, but it may fail, for example, if the helper >>>>>>>> function has been backported to an older kernel. This has been observed >>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is >>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is >>>>>>>> not raised, and probing features with bpftool, for example, fails. >>>>>>> >>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what >>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to >>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce >>>>>>>> some hard-to-debug flakiness if another process starts, or the current >>>>>>>> application is killed, while the rlimit is reduced, and the approach was >>>>>>>> discarded. >>>>>>> >>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the >>>>>>>> availability of a given helper, we restore the unconditional rlimit bump >>>>>>>> in bpftool for now. >>>>>>> >>>>>>>> [0] >>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ >>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 >>>>>>> >>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> >>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> >>>>>>>> --- >>>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ >>>>>>>> tools/bpf/bpftool/feature.c | 2 ++ >>>>>>>> tools/bpf/bpftool/main.c | 6 +++--- >>>>>>>> tools/bpf/bpftool/main.h | 2 ++ >>>>>>>> tools/bpf/bpftool/map.c | 2 ++ >>>>>>>> tools/bpf/bpftool/pids.c | 1 + >>>>>>>> tools/bpf/bpftool/prog.c | 3 +++ >>>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ >>>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) >>>>>>> >>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c >>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 >>>>>>>> --- a/tools/bpf/bpftool/common.c >>>>>>>> +++ b/tools/bpf/bpftool/common.c >>>>>>>> @@ -17,6 +17,7 @@ >>>>>>>> #include <linux/magic.h> >>>>>>>> #include <net/if.h> >>>>>>>> #include <sys/mount.h> >>>>>>>> +#include <sys/resource.h> >>>>>>>> #include <sys/stat.h> >>>>>>>> #include <sys/vfs.h> >>>>>>> >>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) >>>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; >>>>>>>> } >>>>>>> >>>>>>>> +void set_max_rlimit(void) >>>>>>>> +{ >>>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; >>>>>>>> + >>>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); >>>>>>> >>>>>>> Do you think it might make sense to print to stderr some warning if >>>>>>> we actually happen to adjust this limit? >>>>>>> >>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { >>>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to >>>>>>> infinity!\n"); >>>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); >>>>>>> } >>>>>>> >>>>>>> ? >>>>>>> >>>>>>> Because while it's nice that we automatically do this, this might still >>>>>>> lead to surprises for some users. OTOH, not sure whether people >>>>>>> actually read those warnings? :-/ >>>>>> >>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this >>>>>> is desirable. >>>>>> >>>>>> Bpftool has raised the rlimit for a long time, it changed only in April, >>>>>> so I don't think it would come up as a surprise for people who have used >>>>>> it for a while. I think this is also something that several other >>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) >>>>>> have been doing too. >>>>> >>>>> In this case ignore me and let's continue doing that :-) >>>>> >>>>> Btw, eventually we'd still like to stop doing that I'd presume? >>>> >>>> Agreed. I was thinking either finding a way to improve the probe in >>>> libbpf, or waiting for some more time until 5.11 gets old, but this may >>>> take years :/ >>>> >>>>> Should >>>>> we at some point follow up with something like: >>>>> >>>>> if (kernel_version >= 5.11) { don't touch memlock; } >>>>> >>>>> ? >>>>> >>>>> I guess we care only about <5.11 because of the backports, but 5.11+ >>>>> kernels are guaranteed to have memcg. >>>> >>>> You mean from uname() and parsing the release? Yes I suppose we could do >>>> that, can do as a follow-up. >>> >>> Yeah, uname-based, I don't think we can do better? Given that probing >>> is problematic as well :-( >>> But idk, up to you. >> >> Agreed with the uname-based solution. Another possible solution is to >> probe the member 'memcg' in struct bpf_map, in case someone may >> backport memcg-based memory accounting, but that will be a little >> over-engineering. The uname-based solution is simple and can work. > > Thanks! Yes, memcg would be more complex: the struct is not exposed to > user space, and BTF is not a hard dependency for bpftool. I'll work on > the uname-based test as a follow-up to this set. How would this work for things like RHEL? Maybe two potential workarounds ... 1) We could use a different helper for the probing and see how far we get with it.. though not great, probably still rare enough that we would run into it again. 2) Maybe we could create a temp memcg and check whether we get accounted against it on prog load (e.g. despite high rlimit)? Thanks, Daniel
On Tue, Jun 14, 2022 at 1:34 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 6/14/22 4:20 PM, Quentin Monnet wrote: > > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > >> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: > >>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: > >>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > >>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > >>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > >>>>>>> On 06/10, Quentin Monnet wrote: > >>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > >>>>>>> > >>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > >>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > >>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > >>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > >>>>>>>> with other systems and ask libbpf to raise the limit for us if > >>>>>>>> necessary. > >>>>>>> > >>>>>>>> How do we know if memcg-based accounting is supported? There is a probe > >>>>>>>> in libbpf to check this. But this probe currently relies on the > >>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > >>>>>>>> landed in the same kernel version as the memory accounting change. This > >>>>>>>> works in the generic case, but it may fail, for example, if the helper > >>>>>>>> function has been backported to an older kernel. This has been observed > >>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > >>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > >>>>>>>> not raised, and probing features with bpftool, for example, fails. > >>>>>>> > >>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > >>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > >>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > >>>>>>>> some hard-to-debug flakiness if another process starts, or the current > >>>>>>>> application is killed, while the rlimit is reduced, and the approach was > >>>>>>>> discarded. > >>>>>>> > >>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > >>>>>>>> availability of a given helper, we restore the unconditional rlimit bump > >>>>>>>> in bpftool for now. > >>>>>>> > >>>>>>>> [0] > >>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > >>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > >>>>>>> > >>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> > >>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > >>>>>>>> --- > >>>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > >>>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > >>>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > >>>>>>>> tools/bpf/bpftool/main.h | 2 ++ > >>>>>>>> tools/bpf/bpftool/map.c | 2 ++ > >>>>>>>> tools/bpf/bpftool/pids.c | 1 + > >>>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > >>>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > >>>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > >>>>>>> > >>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > >>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > >>>>>>>> --- a/tools/bpf/bpftool/common.c > >>>>>>>> +++ b/tools/bpf/bpftool/common.c > >>>>>>>> @@ -17,6 +17,7 @@ > >>>>>>>> #include <linux/magic.h> > >>>>>>>> #include <net/if.h> > >>>>>>>> #include <sys/mount.h> > >>>>>>>> +#include <sys/resource.h> > >>>>>>>> #include <sys/stat.h> > >>>>>>>> #include <sys/vfs.h> > >>>>>>> > >>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > >>>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > >>>>>>>> } > >>>>>>> > >>>>>>>> +void set_max_rlimit(void) > >>>>>>>> +{ > >>>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > >>>>>>>> + > >>>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>>> > >>>>>>> Do you think it might make sense to print to stderr some warning if > >>>>>>> we actually happen to adjust this limit? > >>>>>>> > >>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > >>>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > >>>>>>> infinity!\n"); > >>>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>>> } > >>>>>>> > >>>>>>> ? > >>>>>>> > >>>>>>> Because while it's nice that we automatically do this, this might still > >>>>>>> lead to surprises for some users. OTOH, not sure whether people > >>>>>>> actually read those warnings? :-/ > >>>>>> > >>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > >>>>>> is desirable. > >>>>>> > >>>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > >>>>>> so I don't think it would come up as a surprise for people who have used > >>>>>> it for a while. I think this is also something that several other > >>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > >>>>>> have been doing too. > >>>>> > >>>>> In this case ignore me and let's continue doing that :-) > >>>>> > >>>>> Btw, eventually we'd still like to stop doing that I'd presume? > >>>> > >>>> Agreed. I was thinking either finding a way to improve the probe in > >>>> libbpf, or waiting for some more time until 5.11 gets old, but this may > >>>> take years :/ > >>>> > >>>>> Should > >>>>> we at some point follow up with something like: > >>>>> > >>>>> if (kernel_version >= 5.11) { don't touch memlock; } > >>>>> > >>>>> ? > >>>>> > >>>>> I guess we care only about <5.11 because of the backports, but 5.11+ > >>>>> kernels are guaranteed to have memcg. > >>>> > >>>> You mean from uname() and parsing the release? Yes I suppose we could do > >>>> that, can do as a follow-up. > >>> > >>> Yeah, uname-based, I don't think we can do better? Given that probing > >>> is problematic as well :-( > >>> But idk, up to you. > >> > >> Agreed with the uname-based solution. Another possible solution is to > >> probe the member 'memcg' in struct bpf_map, in case someone may > >> backport memcg-based memory accounting, but that will be a little > >> over-engineering. The uname-based solution is simple and can work. > > > > Thanks! Yes, memcg would be more complex: the struct is not exposed to > > user space, and BTF is not a hard dependency for bpftool. I'll work on > > the uname-based test as a follow-up to this set. > > How would this work for things like RHEL? Maybe two potential workarounds ... > 1) We could use a different helper for the probing and see how far we get with > it.. though not great, probably still rare enough that we would run into it > again. 2) Maybe we could create a temp memcg and check whether we get accounted > against it on prog load (e.g. despite high rlimit)? This might be dangerous as well: we don't want to move to a different cgroup and move back; if we are in a multithreaded application, other threads might potentially create resources in that temp directory. And I don't think we can fork as well since we are in a library :-(
On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote: > > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > > On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: > >> > >> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: > >>> > >>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > >>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > >>>>> > >>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > >>>>>> On 06/10, Quentin Monnet wrote: > >>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > >>>>>> > >>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > >>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > >>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > >>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > >>>>>>> with other systems and ask libbpf to raise the limit for us if > >>>>>>> necessary. > >>>>>> > >>>>>>> How do we know if memcg-based accounting is supported? There is a probe > >>>>>>> in libbpf to check this. But this probe currently relies on the > >>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > >>>>>>> landed in the same kernel version as the memory accounting change. This > >>>>>>> works in the generic case, but it may fail, for example, if the helper > >>>>>>> function has been backported to an older kernel. This has been observed > >>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > >>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > >>>>>>> not raised, and probing features with bpftool, for example, fails. > >>>>>> > >>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > >>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > >>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > >>>>>>> some hard-to-debug flakiness if another process starts, or the current > >>>>>>> application is killed, while the rlimit is reduced, and the approach was > >>>>>>> discarded. > >>>>>> > >>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > >>>>>>> availability of a given helper, we restore the unconditional rlimit bump > >>>>>>> in bpftool for now. > >>>>>> > >>>>>>> [0] > >>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > >>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > >>>>>> > >>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> > >>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > >>>>>>> --- > >>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > >>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > >>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > >>>>>>> tools/bpf/bpftool/main.h | 2 ++ > >>>>>>> tools/bpf/bpftool/map.c | 2 ++ > >>>>>>> tools/bpf/bpftool/pids.c | 1 + > >>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > >>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > >>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > >>>>>> > >>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > >>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > >>>>>>> --- a/tools/bpf/bpftool/common.c > >>>>>>> +++ b/tools/bpf/bpftool/common.c > >>>>>>> @@ -17,6 +17,7 @@ > >>>>>>> #include <linux/magic.h> > >>>>>>> #include <net/if.h> > >>>>>>> #include <sys/mount.h> > >>>>>>> +#include <sys/resource.h> > >>>>>>> #include <sys/stat.h> > >>>>>>> #include <sys/vfs.h> > >>>>>> > >>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > >>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > >>>>>>> } > >>>>>> > >>>>>>> +void set_max_rlimit(void) > >>>>>>> +{ > >>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > >>>>>>> + > >>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>> > >>>>>> Do you think it might make sense to print to stderr some warning if > >>>>>> we actually happen to adjust this limit? > >>>>>> > >>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > >>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > >>>>>> infinity!\n"); > >>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>> } > >>>>>> > >>>>>> ? > >>>>>> > >>>>>> Because while it's nice that we automatically do this, this might still > >>>>>> lead to surprises for some users. OTOH, not sure whether people > >>>>>> actually read those warnings? :-/ > >>>>> > >>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > >>>>> is desirable. > >>>>> > >>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > >>>>> so I don't think it would come up as a surprise for people who have used > >>>>> it for a while. I think this is also something that several other > >>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > >>>>> have been doing too. > >>>> > >>>> In this case ignore me and let's continue doing that :-) > >>>> > >>>> Btw, eventually we'd still like to stop doing that I'd presume? > >>> > >>> Agreed. I was thinking either finding a way to improve the probe in > >>> libbpf, or waiting for some more time until 5.11 gets old, but this may > >>> take years :/ > >>> > >>>> Should > >>>> we at some point follow up with something like: > >>>> > >>>> if (kernel_version >= 5.11) { don't touch memlock; } > >>>> > >>>> ? > >>>> > >>>> I guess we care only about <5.11 because of the backports, but 5.11+ > >>>> kernels are guaranteed to have memcg. > >>> > >>> You mean from uname() and parsing the release? Yes I suppose we could do > >>> that, can do as a follow-up. > >> > >> Yeah, uname-based, I don't think we can do better? Given that probing > >> is problematic as well :-( > >> But idk, up to you. > >> > > > > Agreed with the uname-based solution. Another possible solution is to > > probe the member 'memcg' in struct bpf_map, in case someone may > > backport memcg-based memory accounting, but that will be a little > > over-engineering. The uname-based solution is simple and can work. > > > > Thanks! Yes, memcg would be more complex: the struct is not exposed to > user space, and BTF is not a hard dependency for bpftool. I'll work on > the uname-based test as a follow-up to this set. > After a second thought, the uname-based test may not work, because CONFIG_MEMCG_KMEM can be disabled. Maybe we can probe the member 'memcg' in struct bpf_map by parsing /sys/kernel/btf/vmlinux: [8584] STRUCT 'bpf_map' size=256 vlen=27 'ops' type_id=8659 bits_offset=0 'inner_map_meta' type_id=8587 bits_offset=64 'security' type_id=93 bits_offset=128 'map_type' type_id=8532 bits_offset=192 'key_size' type_id=36 bits_offset=224 'value_size' type_id=36 bits_offset=256 'max_entries' type_id=36 bits_offset=288 'map_extra' type_id=38 bits_offset=320 'map_flags' type_id=36 bits_offset=384 'spin_lock_off' type_id=21 bits_offset=416 'timer_off' type_id=21 bits_offset=448 'id' type_id=36 bits_offset=480 'numa_node' type_id=21 bits_offset=512 'btf_key_type_id' type_id=36 bits_offset=544 'btf_value_type_id' type_id=36 bits_offset=576 'btf_vmlinux_value_type_id' type_id=36 bits_offset=608 'btf' type_id=8660 bits_offset=640 'memcg' type_id=687 bits_offset=704 <<<< here 'name' type_id=337 bits_offset=768 'bypass_spec_v1' type_id=63 bits_offset=896 'frozen' type_id=63 bits_offset=904 'refcnt' type_id=81 bits_offset=1024 'usercnt' type_id=81 bits_offset=1088 'work' type_id=484 bits_offset=1152 'freeze_mutex' type_id=443 bits_offset=1408 'writecnt' type_id=81 bits_offset=1664 'owner' type_id=8658 bits_offset=1728 If 'memcg' exists, it is memcg-based, otherwise it is rlimit-based. WDYT?
On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote: > > > > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > > > On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: > > >> > > >> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: > > >>> > > >>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > > >>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > > >>>>> > > >>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > > >>>>>> On 06/10, Quentin Monnet wrote: > > >>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > > >>>>>> > > >>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > > >>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > > >>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > > >>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > > >>>>>>> with other systems and ask libbpf to raise the limit for us if > > >>>>>>> necessary. > > >>>>>> > > >>>>>>> How do we know if memcg-based accounting is supported? There is a probe > > >>>>>>> in libbpf to check this. But this probe currently relies on the > > >>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > > >>>>>>> landed in the same kernel version as the memory accounting change. This > > >>>>>>> works in the generic case, but it may fail, for example, if the helper > > >>>>>>> function has been backported to an older kernel. This has been observed > > >>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > > >>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > > >>>>>>> not raised, and probing features with bpftool, for example, fails. > > >>>>>> > > >>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > > >>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > > >>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > > >>>>>>> some hard-to-debug flakiness if another process starts, or the current > > >>>>>>> application is killed, while the rlimit is reduced, and the approach was > > >>>>>>> discarded. > > >>>>>> > > >>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > > >>>>>>> availability of a given helper, we restore the unconditional rlimit bump > > >>>>>>> in bpftool for now. > > >>>>>> > > >>>>>>> [0] > > >>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > > >>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > > >>>>>> > > >>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> > > >>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > > >>>>>>> --- > > >>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > > >>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > > >>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > > >>>>>>> tools/bpf/bpftool/main.h | 2 ++ > > >>>>>>> tools/bpf/bpftool/map.c | 2 ++ > > >>>>>>> tools/bpf/bpftool/pids.c | 1 + > > >>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > > >>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > > >>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > > >>>>>> > > >>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > > >>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > > >>>>>>> --- a/tools/bpf/bpftool/common.c > > >>>>>>> +++ b/tools/bpf/bpftool/common.c > > >>>>>>> @@ -17,6 +17,7 @@ > > >>>>>>> #include <linux/magic.h> > > >>>>>>> #include <net/if.h> > > >>>>>>> #include <sys/mount.h> > > >>>>>>> +#include <sys/resource.h> > > >>>>>>> #include <sys/stat.h> > > >>>>>>> #include <sys/vfs.h> > > >>>>>> > > >>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > > >>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > > >>>>>>> } > > >>>>>> > > >>>>>>> +void set_max_rlimit(void) > > >>>>>>> +{ > > >>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > > >>>>>>> + > > >>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > > >>>>>> > > >>>>>> Do you think it might make sense to print to stderr some warning if > > >>>>>> we actually happen to adjust this limit? > > >>>>>> > > >>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > > >>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > > >>>>>> infinity!\n"); > > >>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > > >>>>>> } > > >>>>>> > > >>>>>> ? > > >>>>>> > > >>>>>> Because while it's nice that we automatically do this, this might still > > >>>>>> lead to surprises for some users. OTOH, not sure whether people > > >>>>>> actually read those warnings? :-/ > > >>>>> > > >>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > > >>>>> is desirable. > > >>>>> > > >>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > > >>>>> so I don't think it would come up as a surprise for people who have used > > >>>>> it for a while. I think this is also something that several other > > >>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > > >>>>> have been doing too. > > >>>> > > >>>> In this case ignore me and let's continue doing that :-) > > >>>> > > >>>> Btw, eventually we'd still like to stop doing that I'd presume? > > >>> > > >>> Agreed. I was thinking either finding a way to improve the probe in > > >>> libbpf, or waiting for some more time until 5.11 gets old, but this may > > >>> take years :/ > > >>> > > >>>> Should > > >>>> we at some point follow up with something like: > > >>>> > > >>>> if (kernel_version >= 5.11) { don't touch memlock; } > > >>>> > > >>>> ? > > >>>> > > >>>> I guess we care only about <5.11 because of the backports, but 5.11+ > > >>>> kernels are guaranteed to have memcg. > > >>> > > >>> You mean from uname() and parsing the release? Yes I suppose we could do > > >>> that, can do as a follow-up. > > >> > > >> Yeah, uname-based, I don't think we can do better? Given that probing > > >> is problematic as well :-( > > >> But idk, up to you. > > >> > > > > > > Agreed with the uname-based solution. Another possible solution is to > > > probe the member 'memcg' in struct bpf_map, in case someone may > > > backport memcg-based memory accounting, but that will be a little > > > over-engineering. The uname-based solution is simple and can work. > > > > > > > Thanks! Yes, memcg would be more complex: the struct is not exposed to > > user space, and BTF is not a hard dependency for bpftool. I'll work on > > the uname-based test as a follow-up to this set. > > > > After a second thought, the uname-based test may not work, because > CONFIG_MEMCG_KMEM can be disabled. Does it matter? Regardless of whether there is memcg or not, we shouldn't touch ulimit on 5.11+ If there is no memcg, there is no bpf memory enforcement.
On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote: > > > > > > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > > > > On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: > > > >> > > > >> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: > > > >>> > > > >>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > > > >>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > > > >>>>> > > > >>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > > > >>>>>> On 06/10, Quentin Monnet wrote: > > > >>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > > > >>>>>> > > > >>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > > > >>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > > > >>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > > > >>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > > > >>>>>>> with other systems and ask libbpf to raise the limit for us if > > > >>>>>>> necessary. > > > >>>>>> > > > >>>>>>> How do we know if memcg-based accounting is supported? There is a probe > > > >>>>>>> in libbpf to check this. But this probe currently relies on the > > > >>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > > > >>>>>>> landed in the same kernel version as the memory accounting change. This > > > >>>>>>> works in the generic case, but it may fail, for example, if the helper > > > >>>>>>> function has been backported to an older kernel. This has been observed > > > >>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > > > >>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > > > >>>>>>> not raised, and probing features with bpftool, for example, fails. > > > >>>>>> > > > >>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > > > >>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > > > >>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > > > >>>>>>> some hard-to-debug flakiness if another process starts, or the current > > > >>>>>>> application is killed, while the rlimit is reduced, and the approach was > > > >>>>>>> discarded. > > > >>>>>> > > > >>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > > > >>>>>>> availability of a given helper, we restore the unconditional rlimit bump > > > >>>>>>> in bpftool for now. > > > >>>>>> > > > >>>>>>> [0] > > > >>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > > > >>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > > > >>>>>> > > > >>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> > > > >>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > > > >>>>>>> --- > > > >>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > > > >>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > > > >>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > > > >>>>>>> tools/bpf/bpftool/main.h | 2 ++ > > > >>>>>>> tools/bpf/bpftool/map.c | 2 ++ > > > >>>>>>> tools/bpf/bpftool/pids.c | 1 + > > > >>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > > > >>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > > > >>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > > > >>>>>> > > > >>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > > > >>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > > > >>>>>>> --- a/tools/bpf/bpftool/common.c > > > >>>>>>> +++ b/tools/bpf/bpftool/common.c > > > >>>>>>> @@ -17,6 +17,7 @@ > > > >>>>>>> #include <linux/magic.h> > > > >>>>>>> #include <net/if.h> > > > >>>>>>> #include <sys/mount.h> > > > >>>>>>> +#include <sys/resource.h> > > > >>>>>>> #include <sys/stat.h> > > > >>>>>>> #include <sys/vfs.h> > > > >>>>>> > > > >>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > > > >>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > > > >>>>>>> } > > > >>>>>> > > > >>>>>>> +void set_max_rlimit(void) > > > >>>>>>> +{ > > > >>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > > > >>>>>>> + > > > >>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > > > >>>>>> > > > >>>>>> Do you think it might make sense to print to stderr some warning if > > > >>>>>> we actually happen to adjust this limit? > > > >>>>>> > > > >>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > > > >>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > > > >>>>>> infinity!\n"); > > > >>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > > > >>>>>> } > > > >>>>>> > > > >>>>>> ? > > > >>>>>> > > > >>>>>> Because while it's nice that we automatically do this, this might still > > > >>>>>> lead to surprises for some users. OTOH, not sure whether people > > > >>>>>> actually read those warnings? :-/ > > > >>>>> > > > >>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > > > >>>>> is desirable. > > > >>>>> > > > >>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > > > >>>>> so I don't think it would come up as a surprise for people who have used > > > >>>>> it for a while. I think this is also something that several other > > > >>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > > > >>>>> have been doing too. > > > >>>> > > > >>>> In this case ignore me and let's continue doing that :-) > > > >>>> > > > >>>> Btw, eventually we'd still like to stop doing that I'd presume? > > > >>> > > > >>> Agreed. I was thinking either finding a way to improve the probe in > > > >>> libbpf, or waiting for some more time until 5.11 gets old, but this may > > > >>> take years :/ > > > >>> > > > >>>> Should > > > >>>> we at some point follow up with something like: > > > >>>> > > > >>>> if (kernel_version >= 5.11) { don't touch memlock; } > > > >>>> > > > >>>> ? > > > >>>> > > > >>>> I guess we care only about <5.11 because of the backports, but 5.11+ > > > >>>> kernels are guaranteed to have memcg. > > > >>> > > > >>> You mean from uname() and parsing the release? Yes I suppose we could do > > > >>> that, can do as a follow-up. > > > >> > > > >> Yeah, uname-based, I don't think we can do better? Given that probing > > > >> is problematic as well :-( > > > >> But idk, up to you. > > > >> > > > > > > > > Agreed with the uname-based solution. Another possible solution is to > > > > probe the member 'memcg' in struct bpf_map, in case someone may > > > > backport memcg-based memory accounting, but that will be a little > > > > over-engineering. The uname-based solution is simple and can work. > > > > > > > > > > Thanks! Yes, memcg would be more complex: the struct is not exposed to > > > user space, and BTF is not a hard dependency for bpftool. I'll work on > > > the uname-based test as a follow-up to this set. > > > > > > > After a second thought, the uname-based test may not work, because > > CONFIG_MEMCG_KMEM can be disabled. > > Does it matter? Regardless of whether there is memcg or not, we > shouldn't touch ulimit on 5.11+ > If there is no memcg, there is no bpf memory enforcement. Right, rlimit-based accounting is totally removed, that is not the same with what I thought before, while I thought it will fallback to rlimit-based if kmemcg is disabled.
2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote: >> >> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote: >>> >>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote: >>>> >>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> >>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: >>>>>> >>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: >>>>>>> >>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> >>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: >>>>>>>>> >>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com >>>>>>>>>> On 06/10, Quentin Monnet wrote: >>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. >>>>>>>>>> >>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of >>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the >>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the >>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility >>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if >>>>>>>>>>> necessary. >>>>>>>>>> >>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe >>>>>>>>>>> in libbpf to check this. But this probe currently relies on the >>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which >>>>>>>>>>> landed in the same kernel version as the memory accounting change. This >>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper >>>>>>>>>>> function has been backported to an older kernel. This has been observed >>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is >>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is >>>>>>>>>>> not raised, and probing features with bpftool, for example, fails. >>>>>>>>>> >>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what >>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to >>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce >>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current >>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was >>>>>>>>>>> discarded. >>>>>>>>>> >>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the >>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump >>>>>>>>>>> in bpftool for now. >>>>>>>>>> >>>>>>>>>>> [0] >>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ >>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 >>>>>>>>>> >>>>>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> >>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> >>>>>>>>>>> --- >>>>>>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ >>>>>>>>>>> tools/bpf/bpftool/feature.c | 2 ++ >>>>>>>>>>> tools/bpf/bpftool/main.c | 6 +++--- >>>>>>>>>>> tools/bpf/bpftool/main.h | 2 ++ >>>>>>>>>>> tools/bpf/bpftool/map.c | 2 ++ >>>>>>>>>>> tools/bpf/bpftool/pids.c | 1 + >>>>>>>>>>> tools/bpf/bpftool/prog.c | 3 +++ >>>>>>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ >>>>>>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) >>>>>>>>>> >>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c >>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 >>>>>>>>>>> --- a/tools/bpf/bpftool/common.c >>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c >>>>>>>>>>> @@ -17,6 +17,7 @@ >>>>>>>>>>> #include <linux/magic.h> >>>>>>>>>>> #include <net/if.h> >>>>>>>>>>> #include <sys/mount.h> >>>>>>>>>>> +#include <sys/resource.h> >>>>>>>>>>> #include <sys/stat.h> >>>>>>>>>>> #include <sys/vfs.h> >>>>>>>>>> >>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) >>>>>>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; >>>>>>>>>>> } >>>>>>>>>> >>>>>>>>>>> +void set_max_rlimit(void) >>>>>>>>>>> +{ >>>>>>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; >>>>>>>>>>> + >>>>>>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); >>>>>>>>>> >>>>>>>>>> Do you think it might make sense to print to stderr some warning if >>>>>>>>>> we actually happen to adjust this limit? >>>>>>>>>> >>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { >>>>>>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to >>>>>>>>>> infinity!\n"); >>>>>>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> ? >>>>>>>>>> >>>>>>>>>> Because while it's nice that we automatically do this, this might still >>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people >>>>>>>>>> actually read those warnings? :-/ >>>>>>>>> >>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this >>>>>>>>> is desirable. >>>>>>>>> >>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April, >>>>>>>>> so I don't think it would come up as a surprise for people who have used >>>>>>>>> it for a while. I think this is also something that several other >>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) >>>>>>>>> have been doing too. >>>>>>>> >>>>>>>> In this case ignore me and let's continue doing that :-) >>>>>>>> >>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume? >>>>>>> >>>>>>> Agreed. I was thinking either finding a way to improve the probe in >>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may >>>>>>> take years :/ >>>>>>> >>>>>>>> Should >>>>>>>> we at some point follow up with something like: >>>>>>>> >>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; } >>>>>>>> >>>>>>>> ? >>>>>>>> >>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+ >>>>>>>> kernels are guaranteed to have memcg. >>>>>>> >>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do >>>>>>> that, can do as a follow-up. >>>>>> >>>>>> Yeah, uname-based, I don't think we can do better? Given that probing >>>>>> is problematic as well :-( >>>>>> But idk, up to you. >>>>>> >>>>> >>>>> Agreed with the uname-based solution. Another possible solution is to >>>>> probe the member 'memcg' in struct bpf_map, in case someone may >>>>> backport memcg-based memory accounting, but that will be a little >>>>> over-engineering. The uname-based solution is simple and can work. >>>>> >>>> >>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to >>>> user space, and BTF is not a hard dependency for bpftool. I'll work on >>>> the uname-based test as a follow-up to this set. >>>> >>> >>> After a second thought, the uname-based test may not work, because >>> CONFIG_MEMCG_KMEM can be disabled. >> >> Does it matter? Regardless of whether there is memcg or not, we >> shouldn't touch ulimit on 5.11+ >> If there is no memcg, there is no bpf memory enforcement. > > Right, rlimit-based accounting is totally removed, that is not the > same with what I thought before, while I thought it will fallback to > rlimit-based if kmemcg is disabled. Agreed, and so I've got a patch ready for the uname-based probe. But talking about this with Daniel, we were wondering if it would make sense instead to have the probe I had initially submitted (lower the rlimit to 0, attempt to load a program, reset rlimit - see [0]), but only for bpftool instead of libbpf? My understanding is that the memlock rlimit is per-process, right? So this shouldn't affect any other process, and because bpftool is not multithreaded, nothing other than probing would happen while the rlimit is at zero? Or is it just simpler, if less accurate, to stick to the uname probe? Quentin [0] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
On Thu, Jun 16, 2022 at 9:59 PM Quentin Monnet <quentin@isovalent.com> wrote: > > 2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > > On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote: > >> > >> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote: > >>> > >>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote: > >>>> > >>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > >>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: > >>>>>> > >>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: > >>>>>>> > >>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > >>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > >>>>>>>>> > >>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > >>>>>>>>>> On 06/10, Quentin Monnet wrote: > >>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > >>>>>>>>>> > >>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > >>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > >>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > >>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > >>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if > >>>>>>>>>>> necessary. > >>>>>>>>>> > >>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe > >>>>>>>>>>> in libbpf to check this. But this probe currently relies on the > >>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > >>>>>>>>>>> landed in the same kernel version as the memory accounting change. This > >>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper > >>>>>>>>>>> function has been backported to an older kernel. This has been observed > >>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > >>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > >>>>>>>>>>> not raised, and probing features with bpftool, for example, fails. > >>>>>>>>>> > >>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > >>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > >>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > >>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current > >>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was > >>>>>>>>>>> discarded. > >>>>>>>>>> > >>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > >>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump > >>>>>>>>>>> in bpftool for now. > >>>>>>>>>> > >>>>>>>>>>> [0] > >>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > >>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > >>>>>>>>>> > >>>>>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> > >>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > >>>>>>>>>>> --- > >>>>>>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > >>>>>>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > >>>>>>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > >>>>>>>>>>> tools/bpf/bpftool/main.h | 2 ++ > >>>>>>>>>>> tools/bpf/bpftool/map.c | 2 ++ > >>>>>>>>>>> tools/bpf/bpftool/pids.c | 1 + > >>>>>>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > >>>>>>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > >>>>>>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > >>>>>>>>>> > >>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > >>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > >>>>>>>>>>> --- a/tools/bpf/bpftool/common.c > >>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c > >>>>>>>>>>> @@ -17,6 +17,7 @@ > >>>>>>>>>>> #include <linux/magic.h> > >>>>>>>>>>> #include <net/if.h> > >>>>>>>>>>> #include <sys/mount.h> > >>>>>>>>>>> +#include <sys/resource.h> > >>>>>>>>>>> #include <sys/stat.h> > >>>>>>>>>>> #include <sys/vfs.h> > >>>>>>>>>> > >>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > >>>>>>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > >>>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>>> +void set_max_rlimit(void) > >>>>>>>>>>> +{ > >>>>>>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > >>>>>>>>>>> + > >>>>>>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>>>>>> > >>>>>>>>>> Do you think it might make sense to print to stderr some warning if > >>>>>>>>>> we actually happen to adjust this limit? > >>>>>>>>>> > >>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > >>>>>>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > >>>>>>>>>> infinity!\n"); > >>>>>>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> ? > >>>>>>>>>> > >>>>>>>>>> Because while it's nice that we automatically do this, this might still > >>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people > >>>>>>>>>> actually read those warnings? :-/ > >>>>>>>>> > >>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > >>>>>>>>> is desirable. > >>>>>>>>> > >>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > >>>>>>>>> so I don't think it would come up as a surprise for people who have used > >>>>>>>>> it for a while. I think this is also something that several other > >>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > >>>>>>>>> have been doing too. > >>>>>>>> > >>>>>>>> In this case ignore me and let's continue doing that :-) > >>>>>>>> > >>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume? > >>>>>>> > >>>>>>> Agreed. I was thinking either finding a way to improve the probe in > >>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may > >>>>>>> take years :/ > >>>>>>> > >>>>>>>> Should > >>>>>>>> we at some point follow up with something like: > >>>>>>>> > >>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; } > >>>>>>>> > >>>>>>>> ? > >>>>>>>> > >>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+ > >>>>>>>> kernels are guaranteed to have memcg. > >>>>>>> > >>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do > >>>>>>> that, can do as a follow-up. > >>>>>> > >>>>>> Yeah, uname-based, I don't think we can do better? Given that probing > >>>>>> is problematic as well :-( > >>>>>> But idk, up to you. > >>>>>> > >>>>> > >>>>> Agreed with the uname-based solution. Another possible solution is to > >>>>> probe the member 'memcg' in struct bpf_map, in case someone may > >>>>> backport memcg-based memory accounting, but that will be a little > >>>>> over-engineering. The uname-based solution is simple and can work. > >>>>> > >>>> > >>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to > >>>> user space, and BTF is not a hard dependency for bpftool. I'll work on > >>>> the uname-based test as a follow-up to this set. > >>>> > >>> > >>> After a second thought, the uname-based test may not work, because > >>> CONFIG_MEMCG_KMEM can be disabled. > >> > >> Does it matter? Regardless of whether there is memcg or not, we > >> shouldn't touch ulimit on 5.11+ > >> If there is no memcg, there is no bpf memory enforcement. > > > > Right, rlimit-based accounting is totally removed, that is not the > > same with what I thought before, while I thought it will fallback to > > rlimit-based if kmemcg is disabled. > > Agreed, and so I've got a patch ready for the uname-based probe. > > But talking about this with Daniel, we were wondering if it would make > sense instead to have the probe I had initially submitted (lower the > rlimit to 0, attempt to load a program, reset rlimit - see [0]), but > only for bpftool instead of libbpf? My understanding is that the memlock > rlimit is per-process, right? So this shouldn't affect any other > process, and because bpftool is not multithreaded, nothing other than > probing would happen while the rlimit is at zero? Makes sense. It is safe to do the probe within bpftool. > Or is it just simpler, > if less accurate, to stick to the uname probe? > > Quentin > > [0] > https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
On Thu, Jun 16, 2022 at 7:54 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Thu, Jun 16, 2022 at 9:59 PM Quentin Monnet <quentin@isovalent.com> wrote: > > > > 2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > > > On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote: > > >> > > >> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > >>> > > >>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote: > > >>>> > > >>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > > >>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: > > >>>>>> > > >>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: > > >>>>>>> > > >>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > > >>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > > >>>>>>>>> > > >>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > > >>>>>>>>>> On 06/10, Quentin Monnet wrote: > > >>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > > >>>>>>>>>> > > >>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > > >>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > > >>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > > >>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > > >>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if > > >>>>>>>>>>> necessary. > > >>>>>>>>>> > > >>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe > > >>>>>>>>>>> in libbpf to check this. But this probe currently relies on the > > >>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > > >>>>>>>>>>> landed in the same kernel version as the memory accounting change. This > > >>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper > > >>>>>>>>>>> function has been backported to an older kernel. This has been observed > > >>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > > >>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > > >>>>>>>>>>> not raised, and probing features with bpftool, for example, fails. > > >>>>>>>>>> > > >>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > > >>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > > >>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > > >>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current > > >>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was > > >>>>>>>>>>> discarded. > > >>>>>>>>>> > > >>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > > >>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump > > >>>>>>>>>>> in bpftool for now. > > >>>>>>>>>> > > >>>>>>>>>>> [0] > > >>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > > >>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > > >>>>>>>>>> > > >>>>>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> > > >>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > > >>>>>>>>>>> --- > > >>>>>>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > > >>>>>>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > > >>>>>>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > > >>>>>>>>>>> tools/bpf/bpftool/main.h | 2 ++ > > >>>>>>>>>>> tools/bpf/bpftool/map.c | 2 ++ > > >>>>>>>>>>> tools/bpf/bpftool/pids.c | 1 + > > >>>>>>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > > >>>>>>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > > >>>>>>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > > >>>>>>>>>> > > >>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > > >>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > > >>>>>>>>>>> --- a/tools/bpf/bpftool/common.c > > >>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c > > >>>>>>>>>>> @@ -17,6 +17,7 @@ > > >>>>>>>>>>> #include <linux/magic.h> > > >>>>>>>>>>> #include <net/if.h> > > >>>>>>>>>>> #include <sys/mount.h> > > >>>>>>>>>>> +#include <sys/resource.h> > > >>>>>>>>>>> #include <sys/stat.h> > > >>>>>>>>>>> #include <sys/vfs.h> > > >>>>>>>>>> > > >>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > > >>>>>>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > > >>>>>>>>>>> } > > >>>>>>>>>> > > >>>>>>>>>>> +void set_max_rlimit(void) > > >>>>>>>>>>> +{ > > >>>>>>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > > >>>>>>>>>>> + > > >>>>>>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > > >>>>>>>>>> > > >>>>>>>>>> Do you think it might make sense to print to stderr some warning if > > >>>>>>>>>> we actually happen to adjust this limit? > > >>>>>>>>>> > > >>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > > >>>>>>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > > >>>>>>>>>> infinity!\n"); > > >>>>>>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > > >>>>>>>>>> } > > >>>>>>>>>> > > >>>>>>>>>> ? > > >>>>>>>>>> > > >>>>>>>>>> Because while it's nice that we automatically do this, this might still > > >>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people > > >>>>>>>>>> actually read those warnings? :-/ > > >>>>>>>>> > > >>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > > >>>>>>>>> is desirable. > > >>>>>>>>> > > >>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > > >>>>>>>>> so I don't think it would come up as a surprise for people who have used > > >>>>>>>>> it for a while. I think this is also something that several other > > >>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > > >>>>>>>>> have been doing too. > > >>>>>>>> > > >>>>>>>> In this case ignore me and let's continue doing that :-) > > >>>>>>>> > > >>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume? > > >>>>>>> > > >>>>>>> Agreed. I was thinking either finding a way to improve the probe in > > >>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may > > >>>>>>> take years :/ > > >>>>>>> > > >>>>>>>> Should > > >>>>>>>> we at some point follow up with something like: > > >>>>>>>> > > >>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; } > > >>>>>>>> > > >>>>>>>> ? > > >>>>>>>> > > >>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+ > > >>>>>>>> kernels are guaranteed to have memcg. > > >>>>>>> > > >>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do > > >>>>>>> that, can do as a follow-up. > > >>>>>> > > >>>>>> Yeah, uname-based, I don't think we can do better? Given that probing > > >>>>>> is problematic as well :-( > > >>>>>> But idk, up to you. > > >>>>>> > > >>>>> > > >>>>> Agreed with the uname-based solution. Another possible solution is to > > >>>>> probe the member 'memcg' in struct bpf_map, in case someone may > > >>>>> backport memcg-based memory accounting, but that will be a little > > >>>>> over-engineering. The uname-based solution is simple and can work. > > >>>>> > > >>>> > > >>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to > > >>>> user space, and BTF is not a hard dependency for bpftool. I'll work on > > >>>> the uname-based test as a follow-up to this set. > > >>>> > > >>> > > >>> After a second thought, the uname-based test may not work, because > > >>> CONFIG_MEMCG_KMEM can be disabled. > > >> > > >> Does it matter? Regardless of whether there is memcg or not, we > > >> shouldn't touch ulimit on 5.11+ > > >> If there is no memcg, there is no bpf memory enforcement. > > > > > > Right, rlimit-based accounting is totally removed, that is not the > > > same with what I thought before, while I thought it will fallback to > > > rlimit-based if kmemcg is disabled. > > > > Agreed, and so I've got a patch ready for the uname-based probe. > > > > But talking about this with Daniel, we were wondering if it would make > > sense instead to have the probe I had initially submitted (lower the > > rlimit to 0, attempt to load a program, reset rlimit - see [0]), but > > only for bpftool instead of libbpf? My understanding is that the memlock > > rlimit is per-process, right? So this shouldn't affect any other > > process, and because bpftool is not multithreaded, nothing other than > > probing would happen while the rlimit is at zero? > > Makes sense. > It is safe to do the probe within bpftool. +1, seems to be safe to continue doing that in bpftool.
On Thu, Jun 16, 2022 at 11:08 AM Stanislav Fomichev <sdf@google.com> wrote: > > On Thu, Jun 16, 2022 at 7:54 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Thu, Jun 16, 2022 at 9:59 PM Quentin Monnet <quentin@isovalent.com> wrote: > > > > > > 2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > > > > On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote: > > > >> > > > >> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > >>> > > > >>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote: > > > >>>> > > > >>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com> > > > >>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote: > > > >>>>>> > > > >>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote: > > > >>>>>>> > > > >>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com> > > > >>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote: > > > >>>>>>>>> > > > >>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com > > > >>>>>>>>>> On 06/10, Quentin Monnet wrote: > > > >>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. > > > >>>>>>>>>> > > > >>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of > > > >>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the > > > >>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the > > > >>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility > > > >>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if > > > >>>>>>>>>>> necessary. > > > >>>>>>>>>> > > > >>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe > > > >>>>>>>>>>> in libbpf to check this. But this probe currently relies on the > > > >>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which > > > >>>>>>>>>>> landed in the same kernel version as the memory accounting change. This > > > >>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper > > > >>>>>>>>>>> function has been backported to an older kernel. This has been observed > > > >>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is > > > >>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is > > > >>>>>>>>>>> not raised, and probing features with bpftool, for example, fails. > > > >>>>>>>>>> > > > >>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what > > > >>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to > > > >>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce > > > >>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current > > > >>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was > > > >>>>>>>>>>> discarded. > > > >>>>>>>>>> > > > >>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the > > > >>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump > > > >>>>>>>>>>> in bpftool for now. > > > >>>>>>>>>> > > > >>>>>>>>>>> [0] > > > >>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ > > > >>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 > > > >>>>>>>>>> > > > >>>>>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com> > > > >>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com> > > > >>>>>>>>>>> --- > > > >>>>>>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++ > > > >>>>>>>>>>> tools/bpf/bpftool/feature.c | 2 ++ > > > >>>>>>>>>>> tools/bpf/bpftool/main.c | 6 +++--- > > > >>>>>>>>>>> tools/bpf/bpftool/main.h | 2 ++ > > > >>>>>>>>>>> tools/bpf/bpftool/map.c | 2 ++ > > > >>>>>>>>>>> tools/bpf/bpftool/pids.c | 1 + > > > >>>>>>>>>>> tools/bpf/bpftool/prog.c | 3 +++ > > > >>>>>>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++ > > > >>>>>>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-) > > > >>>>>>>>>> > > > >>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c > > > >>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644 > > > >>>>>>>>>>> --- a/tools/bpf/bpftool/common.c > > > >>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c > > > >>>>>>>>>>> @@ -17,6 +17,7 @@ > > > >>>>>>>>>>> #include <linux/magic.h> > > > >>>>>>>>>>> #include <net/if.h> > > > >>>>>>>>>>> #include <sys/mount.h> > > > >>>>>>>>>>> +#include <sys/resource.h> > > > >>>>>>>>>>> #include <sys/stat.h> > > > >>>>>>>>>>> #include <sys/vfs.h> > > > >>>>>>>>>> > > > >>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) > > > >>>>>>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; > > > >>>>>>>>>>> } > > > >>>>>>>>>> > > > >>>>>>>>>>> +void set_max_rlimit(void) > > > >>>>>>>>>>> +{ > > > >>>>>>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; > > > >>>>>>>>>>> + > > > >>>>>>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf); > > > >>>>>>>>>> > > > >>>>>>>>>> Do you think it might make sense to print to stderr some warning if > > > >>>>>>>>>> we actually happen to adjust this limit? > > > >>>>>>>>>> > > > >>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) { > > > >>>>>>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to > > > >>>>>>>>>> infinity!\n"); > > > >>>>>>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf); > > > >>>>>>>>>> } > > > >>>>>>>>>> > > > >>>>>>>>>> ? > > > >>>>>>>>>> > > > >>>>>>>>>> Because while it's nice that we automatically do this, this might still > > > >>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people > > > >>>>>>>>>> actually read those warnings? :-/ > > > >>>>>>>>> > > > >>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this > > > >>>>>>>>> is desirable. > > > >>>>>>>>> > > > >>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April, > > > >>>>>>>>> so I don't think it would come up as a surprise for people who have used > > > >>>>>>>>> it for a while. I think this is also something that several other > > > >>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind) > > > >>>>>>>>> have been doing too. > > > >>>>>>>> > > > >>>>>>>> In this case ignore me and let's continue doing that :-) > > > >>>>>>>> > > > >>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume? > > > >>>>>>> > > > >>>>>>> Agreed. I was thinking either finding a way to improve the probe in > > > >>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may > > > >>>>>>> take years :/ > > > >>>>>>> > > > >>>>>>>> Should > > > >>>>>>>> we at some point follow up with something like: > > > >>>>>>>> > > > >>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; } > > > >>>>>>>> > > > >>>>>>>> ? > > > >>>>>>>> > > > >>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+ > > > >>>>>>>> kernels are guaranteed to have memcg. > > > >>>>>>> > > > >>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do > > > >>>>>>> that, can do as a follow-up. > > > >>>>>> > > > >>>>>> Yeah, uname-based, I don't think we can do better? Given that probing > > > >>>>>> is problematic as well :-( > > > >>>>>> But idk, up to you. > > > >>>>>> > > > >>>>> > > > >>>>> Agreed with the uname-based solution. Another possible solution is to > > > >>>>> probe the member 'memcg' in struct bpf_map, in case someone may > > > >>>>> backport memcg-based memory accounting, but that will be a little > > > >>>>> over-engineering. The uname-based solution is simple and can work. > > > >>>>> > > > >>>> > > > >>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to > > > >>>> user space, and BTF is not a hard dependency for bpftool. I'll work on > > > >>>> the uname-based test as a follow-up to this set. > > > >>>> > > > >>> > > > >>> After a second thought, the uname-based test may not work, because > > > >>> CONFIG_MEMCG_KMEM can be disabled. > > > >> > > > >> Does it matter? Regardless of whether there is memcg or not, we > > > >> shouldn't touch ulimit on 5.11+ > > > >> If there is no memcg, there is no bpf memory enforcement. > > > > > > > > Right, rlimit-based accounting is totally removed, that is not the > > > > same with what I thought before, while I thought it will fallback to > > > > rlimit-based if kmemcg is disabled. > > > > > > Agreed, and so I've got a patch ready for the uname-based probe. > > > > > > But talking about this with Daniel, we were wondering if it would make > > > sense instead to have the probe I had initially submitted (lower the > > > rlimit to 0, attempt to load a program, reset rlimit - see [0]), but > > > only for bpftool instead of libbpf? My understanding is that the memlock > > > rlimit is per-process, right? So this shouldn't affect any other > > > process, and because bpftool is not multithreaded, nothing other than > > > probing would happen while the rlimit is at zero? > > > > Makes sense. > > It is safe to do the probe within bpftool. > > +1, seems to be safe to continue doing that in bpftool. Agree, doing it in the *application* (which bpftool is for libbpf) seems totally safe and fine.
diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c index a45b42ee8ab0..a0d4acd7c54a 100644 --- a/tools/bpf/bpftool/common.c +++ b/tools/bpf/bpftool/common.c @@ -17,6 +17,7 @@ #include <linux/magic.h> #include <net/if.h> #include <sys/mount.h> +#include <sys/resource.h> #include <sys/stat.h> #include <sys/vfs.h> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path) return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; } +void set_max_rlimit(void) +{ + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; + + setrlimit(RLIMIT_MEMLOCK, &rinf); +} + static int mnt_fs(const char *target, const char *type, char *buff, size_t bufflen) { diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c index cc9e4df8c58e..bac4ef428a02 100644 --- a/tools/bpf/bpftool/feature.c +++ b/tools/bpf/bpftool/feature.c @@ -1167,6 +1167,8 @@ static int do_probe(int argc, char **argv) __u32 ifindex = 0; char *ifname; + set_max_rlimit(); + while (argc) { if (is_prefix(*argv, "kernel")) { if (target != COMPONENT_UNSPEC) { diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c index 9062ef2b8767..e81227761f5d 100644 --- a/tools/bpf/bpftool/main.c +++ b/tools/bpf/bpftool/main.c @@ -507,9 +507,9 @@ int main(int argc, char **argv) * It will still be rejected if users use LIBBPF_STRICT_ALL * mode for loading generated skeleton. */ - libbpf_set_strict_mode(LIBBPF_STRICT_ALL & ~LIBBPF_STRICT_MAP_DEFINITIONS); - } else { - libbpf_set_strict_mode(LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK); + ret = libbpf_set_strict_mode(LIBBPF_STRICT_ALL & ~LIBBPF_STRICT_MAP_DEFINITIONS); + if (ret) + p_err("failed to enable libbpf strict mode: %d", ret); } argc -= optind; diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h index 6c311f47147e..589cb76b227a 100644 --- a/tools/bpf/bpftool/main.h +++ b/tools/bpf/bpftool/main.h @@ -96,6 +96,8 @@ int detect_common_prefix(const char *arg, ...); void fprint_hex(FILE *f, void *arg, unsigned int n, const char *sep); void usage(void) __noreturn; +void set_max_rlimit(void); + int mount_tracefs(const char *target); struct obj_ref { diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c index 800834be1bcb..38b6bc9c26c3 100644 --- a/tools/bpf/bpftool/map.c +++ b/tools/bpf/bpftool/map.c @@ -1326,6 +1326,8 @@ static int do_create(int argc, char **argv) goto exit; } + set_max_rlimit(); + fd = bpf_map_create(map_type, map_name, key_size, value_size, max_entries, &attr); if (fd < 0) { p_err("map create failed: %s", strerror(errno)); diff --git a/tools/bpf/bpftool/pids.c b/tools/bpf/bpftool/pids.c index e2d00d3cd868..bb6c969a114a 100644 --- a/tools/bpf/bpftool/pids.c +++ b/tools/bpf/bpftool/pids.c @@ -108,6 +108,7 @@ int build_obj_refs_table(struct hashmap **map, enum bpf_obj_type type) p_err("failed to create hashmap for PID references"); return -1; } + set_max_rlimit(); skel = pid_iter_bpf__open(); if (!skel) { diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index e71f0b2da50b..f081de398b60 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -1590,6 +1590,8 @@ static int load_with_options(int argc, char **argv, bool first_prog_only) } } + set_max_rlimit(); + if (verifier_logs) /* log_level1 + log_level2 + stats, but not stable UAPI */ open_opts.kernel_log_level = 1 + 2 + 4; @@ -2287,6 +2289,7 @@ static int do_profile(int argc, char **argv) } } + set_max_rlimit(); err = profiler_bpf__load(profile_obj); if (err) { p_err("failed to load profile_obj"); diff --git a/tools/bpf/bpftool/struct_ops.c b/tools/bpf/bpftool/struct_ops.c index 2535f079ed67..e08a6ff2866c 100644 --- a/tools/bpf/bpftool/struct_ops.c +++ b/tools/bpf/bpftool/struct_ops.c @@ -501,6 +501,8 @@ static int do_register(int argc, char **argv) if (libbpf_get_error(obj)) return -1; + set_max_rlimit(); + if (bpf_object__load(obj)) { bpf_object__close(obj); return -1;
This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8. In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the kernel has switched to memcg-based memory accounting. Thanks to the LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility with other systems and ask libbpf to raise the limit for us if necessary. How do we know if memcg-based accounting is supported? There is a probe in libbpf to check this. But this probe currently relies on the availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which landed in the same kernel version as the memory accounting change. This works in the generic case, but it may fail, for example, if the helper function has been backported to an older kernel. This has been observed for Google Cloud's Container-Optimized OS (COS), where the helper is available but rlimit is still in use. The probe succeeds, the rlimit is not raised, and probing features with bpftool, for example, fails. A patch was submitted [0] to update this probe in libbpf, based on what the cilium/ebpf Go library does [1]. It would lower the soft rlimit to 0, attempt to load a BPF object, and reset the rlimit. But it may induce some hard-to-debug flakiness if another process starts, or the current application is killed, while the rlimit is reduced, and the approach was discarded. As a workaround to ensure that the rlimit bump does not depend on the availability of a given helper, we restore the unconditional rlimit bump in bpftool for now. [0] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/ [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39 Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Quentin Monnet <quentin@isovalent.com> --- tools/bpf/bpftool/common.c | 8 ++++++++ tools/bpf/bpftool/feature.c | 2 ++ tools/bpf/bpftool/main.c | 6 +++--- tools/bpf/bpftool/main.h | 2 ++ tools/bpf/bpftool/map.c | 2 ++ tools/bpf/bpftool/pids.c | 1 + tools/bpf/bpftool/prog.c | 3 +++ tools/bpf/bpftool/struct_ops.c | 2 ++ 8 files changed, 23 insertions(+), 3 deletions(-)