diff mbox series

[bpf-next,1/2] Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"

Message ID 20220610112648.29695-2-quentin@isovalent.com (mailing list archive)
State Accepted
Commit 6b4384ff108874cf336fe2fb1633313c2c7620bf
Delegated to: BPF
Headers show
Series bpftool: Restore memlock rlimit bump | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 12 maintainers not CCed: niklas.soderlund@corigine.com paul@isovalent.com songliubraving@fb.com deso@posteo.net hengqi.chen@gmail.com ramasha@fb.com davemarchevsky@fb.com yhs@fb.com john.fastabend@gmail.com kafai@fb.com sdf@google.com kpsingh@kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 97 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on ubuntu-latest with llvm-15
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-next-PR fail merge-conflict

Commit Message

Quentin Monnet June 10, 2022, 11:26 a.m. UTC
This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.

In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
kernel has switched to memcg-based memory accounting. Thanks to the
LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
with other systems and ask libbpf to raise the limit for us if
necessary.

How do we know if memcg-based accounting is supported? There is a probe
in libbpf to check this. But this probe currently relies on the
availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
landed in the same kernel version as the memory accounting change. This
works in the generic case, but it may fail, for example, if the helper
function has been backported to an older kernel. This has been observed
for Google Cloud's Container-Optimized OS (COS), where the helper is
available but rlimit is still in use. The probe succeeds, the rlimit is
not raised, and probing features with bpftool, for example, fails.

A patch was submitted [0] to update this probe in libbpf, based on what
the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
0, attempt to load a BPF object, and reset the rlimit. But it may induce
some hard-to-debug flakiness if another process starts, or the current
application is killed, while the rlimit is reduced, and the approach was
discarded.

As a workaround to ensure that the rlimit bump does not depend on the
availability of a given helper, we restore the unconditional rlimit bump
in bpftool for now.

[0] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
[1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39

Cc: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
---
 tools/bpf/bpftool/common.c     | 8 ++++++++
 tools/bpf/bpftool/feature.c    | 2 ++
 tools/bpf/bpftool/main.c       | 6 +++---
 tools/bpf/bpftool/main.h       | 2 ++
 tools/bpf/bpftool/map.c        | 2 ++
 tools/bpf/bpftool/pids.c       | 1 +
 tools/bpf/bpftool/prog.c       | 3 +++
 tools/bpf/bpftool/struct_ops.c | 2 ++
 8 files changed, 23 insertions(+), 3 deletions(-)

Comments

Stanislav Fomichev June 10, 2022, 4:07 p.m. UTC | #1
On 06/10, Quentin Monnet wrote:
> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.

> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> kernel has switched to memcg-based memory accounting. Thanks to the
> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> with other systems and ask libbpf to raise the limit for us if
> necessary.

> How do we know if memcg-based accounting is supported? There is a probe
> in libbpf to check this. But this probe currently relies on the
> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> landed in the same kernel version as the memory accounting change. This
> works in the generic case, but it may fail, for example, if the helper
> function has been backported to an older kernel. This has been observed
> for Google Cloud's Container-Optimized OS (COS), where the helper is
> available but rlimit is still in use. The probe succeeds, the rlimit is
> not raised, and probing features with bpftool, for example, fails.

> A patch was submitted [0] to update this probe in libbpf, based on what
> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> some hard-to-debug flakiness if another process starts, or the current
> application is killed, while the rlimit is reduced, and the approach was
> discarded.

> As a workaround to ensure that the rlimit bump does not depend on the
> availability of a given helper, we restore the unconditional rlimit bump
> in bpftool for now.

> [0]  
> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39

> Cc: Yafang Shao <laoar.shao@gmail.com>
> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> ---
>   tools/bpf/bpftool/common.c     | 8 ++++++++
>   tools/bpf/bpftool/feature.c    | 2 ++
>   tools/bpf/bpftool/main.c       | 6 +++---
>   tools/bpf/bpftool/main.h       | 2 ++
>   tools/bpf/bpftool/map.c        | 2 ++
>   tools/bpf/bpftool/pids.c       | 1 +
>   tools/bpf/bpftool/prog.c       | 3 +++
>   tools/bpf/bpftool/struct_ops.c | 2 ++
>   8 files changed, 23 insertions(+), 3 deletions(-)

> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> index a45b42ee8ab0..a0d4acd7c54a 100644
> --- a/tools/bpf/bpftool/common.c
> +++ b/tools/bpf/bpftool/common.c
> @@ -17,6 +17,7 @@
>   #include <linux/magic.h>
>   #include <net/if.h>
>   #include <sys/mount.h>
> +#include <sys/resource.h>
>   #include <sys/stat.h>
>   #include <sys/vfs.h>

> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
>   	return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>   }

> +void set_max_rlimit(void)
> +{
> +	struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> +
> +	setrlimit(RLIMIT_MEMLOCK, &rinf);

Do you think it might make sense to print to stderr some warning if
we actually happen to adjust this limit?

if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
	fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
	infinity!\n");
	setrlimit(RLIMIT_MEMLOCK, &rinf);
}

?

Because while it's nice that we automatically do this, this might still
lead to surprises for some users. OTOH, not sure whether people
actually read those warnings? :-/

> +}
> +
>   static int
>   mnt_fs(const char *target, const char *type, char *buff, size_t bufflen)
>   {
> diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c
> index cc9e4df8c58e..bac4ef428a02 100644
> --- a/tools/bpf/bpftool/feature.c
> +++ b/tools/bpf/bpftool/feature.c
> @@ -1167,6 +1167,8 @@ static int do_probe(int argc, char **argv)
>   	__u32 ifindex = 0;
>   	char *ifname;

> +	set_max_rlimit();
> +
>   	while (argc) {
>   		if (is_prefix(*argv, "kernel")) {
>   			if (target != COMPONENT_UNSPEC) {
> diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
> index 9062ef2b8767..e81227761f5d 100644
> --- a/tools/bpf/bpftool/main.c
> +++ b/tools/bpf/bpftool/main.c
> @@ -507,9 +507,9 @@ int main(int argc, char **argv)
>   		 * It will still be rejected if users use LIBBPF_STRICT_ALL
>   		 * mode for loading generated skeleton.
>   		 */
> -		libbpf_set_strict_mode(LIBBPF_STRICT_ALL &  
> ~LIBBPF_STRICT_MAP_DEFINITIONS);
> -	} else {
> -		libbpf_set_strict_mode(LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK);
> +		ret = libbpf_set_strict_mode(LIBBPF_STRICT_ALL &  
> ~LIBBPF_STRICT_MAP_DEFINITIONS);
> +		if (ret)
> +			p_err("failed to enable libbpf strict mode: %d", ret);
>   	}

>   	argc -= optind;
> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
> index 6c311f47147e..589cb76b227a 100644
> --- a/tools/bpf/bpftool/main.h
> +++ b/tools/bpf/bpftool/main.h
> @@ -96,6 +96,8 @@ int detect_common_prefix(const char *arg, ...);
>   void fprint_hex(FILE *f, void *arg, unsigned int n, const char *sep);
>   void usage(void) __noreturn;

> +void set_max_rlimit(void);
> +
>   int mount_tracefs(const char *target);

>   struct obj_ref {
> diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
> index 800834be1bcb..38b6bc9c26c3 100644
> --- a/tools/bpf/bpftool/map.c
> +++ b/tools/bpf/bpftool/map.c
> @@ -1326,6 +1326,8 @@ static int do_create(int argc, char **argv)
>   		goto exit;
>   	}

> +	set_max_rlimit();
> +
>   	fd = bpf_map_create(map_type, map_name, key_size, value_size,  
> max_entries, &attr);
>   	if (fd < 0) {
>   		p_err("map create failed: %s", strerror(errno));
> diff --git a/tools/bpf/bpftool/pids.c b/tools/bpf/bpftool/pids.c
> index e2d00d3cd868..bb6c969a114a 100644
> --- a/tools/bpf/bpftool/pids.c
> +++ b/tools/bpf/bpftool/pids.c
> @@ -108,6 +108,7 @@ int build_obj_refs_table(struct hashmap **map, enum  
> bpf_obj_type type)
>   		p_err("failed to create hashmap for PID references");
>   		return -1;
>   	}
> +	set_max_rlimit();

>   	skel = pid_iter_bpf__open();
>   	if (!skel) {
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index e71f0b2da50b..f081de398b60 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -1590,6 +1590,8 @@ static int load_with_options(int argc, char **argv,  
> bool first_prog_only)
>   		}
>   	}

> +	set_max_rlimit();
> +
>   	if (verifier_logs)
>   		/* log_level1 + log_level2 + stats, but not stable UAPI */
>   		open_opts.kernel_log_level = 1 + 2 + 4;
> @@ -2287,6 +2289,7 @@ static int do_profile(int argc, char **argv)
>   		}
>   	}

> +	set_max_rlimit();
>   	err = profiler_bpf__load(profile_obj);
>   	if (err) {
>   		p_err("failed to load profile_obj");
> diff --git a/tools/bpf/bpftool/struct_ops.c  
> b/tools/bpf/bpftool/struct_ops.c
> index 2535f079ed67..e08a6ff2866c 100644
> --- a/tools/bpf/bpftool/struct_ops.c
> +++ b/tools/bpf/bpftool/struct_ops.c
> @@ -501,6 +501,8 @@ static int do_register(int argc, char **argv)
>   	if (libbpf_get_error(obj))
>   		return -1;

> +	set_max_rlimit();
> +
>   	if (bpf_object__load(obj)) {
>   		bpf_object__close(obj);
>   		return -1;
> --
> 2.34.1
Quentin Monnet June 10, 2022, 4:34 p.m. UTC | #2
2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> On 06/10, Quentin Monnet wrote:
>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> 
>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
>> kernel has switched to memcg-based memory accounting. Thanks to the
>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
>> with other systems and ask libbpf to raise the limit for us if
>> necessary.
> 
>> How do we know if memcg-based accounting is supported? There is a probe
>> in libbpf to check this. But this probe currently relies on the
>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
>> landed in the same kernel version as the memory accounting change. This
>> works in the generic case, but it may fail, for example, if the helper
>> function has been backported to an older kernel. This has been observed
>> for Google Cloud's Container-Optimized OS (COS), where the helper is
>> available but rlimit is still in use. The probe succeeds, the rlimit is
>> not raised, and probing features with bpftool, for example, fails.
> 
>> A patch was submitted [0] to update this probe in libbpf, based on what
>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
>> some hard-to-debug flakiness if another process starts, or the current
>> application is killed, while the rlimit is reduced, and the approach was
>> discarded.
> 
>> As a workaround to ensure that the rlimit bump does not depend on the
>> availability of a given helper, we restore the unconditional rlimit bump
>> in bpftool for now.
> 
>> [0]
>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> 
>> Cc: Yafang Shao <laoar.shao@gmail.com>
>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
>> ---
>>   tools/bpf/bpftool/common.c     | 8 ++++++++
>>   tools/bpf/bpftool/feature.c    | 2 ++
>>   tools/bpf/bpftool/main.c       | 6 +++---
>>   tools/bpf/bpftool/main.h       | 2 ++
>>   tools/bpf/bpftool/map.c        | 2 ++
>>   tools/bpf/bpftool/pids.c       | 1 +
>>   tools/bpf/bpftool/prog.c       | 3 +++
>>   tools/bpf/bpftool/struct_ops.c | 2 ++
>>   8 files changed, 23 insertions(+), 3 deletions(-)
> 
>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>> index a45b42ee8ab0..a0d4acd7c54a 100644
>> --- a/tools/bpf/bpftool/common.c
>> +++ b/tools/bpf/bpftool/common.c
>> @@ -17,6 +17,7 @@
>>   #include <linux/magic.h>
>>   #include <net/if.h>
>>   #include <sys/mount.h>
>> +#include <sys/resource.h>
>>   #include <sys/stat.h>
>>   #include <sys/vfs.h>
> 
>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>   }
> 
>> +void set_max_rlimit(void)
>> +{
>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>> +
>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> 
> Do you think it might make sense to print to stderr some warning if
> we actually happen to adjust this limit?
> 
> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
>     infinity!\n");
>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> }
> 
> ?
> 
> Because while it's nice that we automatically do this, this might still
> lead to surprises for some users. OTOH, not sure whether people
> actually read those warnings? :-/

I'm not strictly opposed to a warning, but I'm not completely sure this
is desirable.

Bpftool has raised the rlimit for a long time, it changed only in April,
so I don't think it would come up as a surprise for people who have used
it for a while. I think this is also something that several other
BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
have been doing too.

For new users, I agree the warning may be helpful. But then the message
is likely to appear the very first time a user runs the command - likely
as root - and I fear this might worry people not familiar with rlimits,
who would wonder if they just broke something on their system? Maybe
with a different phrasing.

Alternatively we could document it in the relevant man pages (not that
people would see it better, but at least it would be mentioned somewhere
if people take the time to read the docs)? What do you think?

Quentin
Stanislav Fomichev June 10, 2022, 4:46 p.m. UTC | #3
On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
>
> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> > On 06/10, Quentin Monnet wrote:
> >> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> >
> >> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> >> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> >> kernel has switched to memcg-based memory accounting. Thanks to the
> >> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> >> with other systems and ask libbpf to raise the limit for us if
> >> necessary.
> >
> >> How do we know if memcg-based accounting is supported? There is a probe
> >> in libbpf to check this. But this probe currently relies on the
> >> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> >> landed in the same kernel version as the memory accounting change. This
> >> works in the generic case, but it may fail, for example, if the helper
> >> function has been backported to an older kernel. This has been observed
> >> for Google Cloud's Container-Optimized OS (COS), where the helper is
> >> available but rlimit is still in use. The probe succeeds, the rlimit is
> >> not raised, and probing features with bpftool, for example, fails.
> >
> >> A patch was submitted [0] to update this probe in libbpf, based on what
> >> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> >> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> >> some hard-to-debug flakiness if another process starts, or the current
> >> application is killed, while the rlimit is reduced, and the approach was
> >> discarded.
> >
> >> As a workaround to ensure that the rlimit bump does not depend on the
> >> availability of a given helper, we restore the unconditional rlimit bump
> >> in bpftool for now.
> >
> >> [0]
> >> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> >> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >
> >> Cc: Yafang Shao <laoar.shao@gmail.com>
> >> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> >> ---
> >>   tools/bpf/bpftool/common.c     | 8 ++++++++
> >>   tools/bpf/bpftool/feature.c    | 2 ++
> >>   tools/bpf/bpftool/main.c       | 6 +++---
> >>   tools/bpf/bpftool/main.h       | 2 ++
> >>   tools/bpf/bpftool/map.c        | 2 ++
> >>   tools/bpf/bpftool/pids.c       | 1 +
> >>   tools/bpf/bpftool/prog.c       | 3 +++
> >>   tools/bpf/bpftool/struct_ops.c | 2 ++
> >>   8 files changed, 23 insertions(+), 3 deletions(-)
> >
> >> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> >> index a45b42ee8ab0..a0d4acd7c54a 100644
> >> --- a/tools/bpf/bpftool/common.c
> >> +++ b/tools/bpf/bpftool/common.c
> >> @@ -17,6 +17,7 @@
> >>   #include <linux/magic.h>
> >>   #include <net/if.h>
> >>   #include <sys/mount.h>
> >> +#include <sys/resource.h>
> >>   #include <sys/stat.h>
> >>   #include <sys/vfs.h>
> >
> >> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> >>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> >>   }
> >
> >> +void set_max_rlimit(void)
> >> +{
> >> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> >> +
> >> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> >
> > Do you think it might make sense to print to stderr some warning if
> > we actually happen to adjust this limit?
> >
> > if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> >     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> >     infinity!\n");
> >     setrlimit(RLIMIT_MEMLOCK, &rinf);
> > }
> >
> > ?
> >
> > Because while it's nice that we automatically do this, this might still
> > lead to surprises for some users. OTOH, not sure whether people
> > actually read those warnings? :-/
>
> I'm not strictly opposed to a warning, but I'm not completely sure this
> is desirable.
>
> Bpftool has raised the rlimit for a long time, it changed only in April,
> so I don't think it would come up as a surprise for people who have used
> it for a while. I think this is also something that several other
> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> have been doing too.

In this case ignore me and let's continue doing that :-)

Btw, eventually we'd still like to stop doing that I'd presume? Should
we at some point follow up with something like:

if (kernel_version >= 5.11) { don't touch memlock; }

?

I guess we care only about <5.11 because of the backports, but 5.11+
kernels are guaranteed to have memcg.

I'm not sure whether memlock is used out there in the distros (and
especially for root/bpf_capable), so I'm also not sure whether we
really care or not.

> For new users, I agree the warning may be helpful. But then the message
> is likely to appear the very first time a user runs the command - likely
> as root - and I fear this might worry people not familiar with rlimits,
> who would wonder if they just broke something on their system? Maybe
> with a different phrasing.
>
> Alternatively we could document it in the relevant man pages (not that
> people would see it better, but at least it would be mentioned somewhere
> if people take the time to read the docs)? What do you think?
>
> Quentin
Quentin Monnet June 10, 2022, 5 p.m. UTC | #4
2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
>>
>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
>>> On 06/10, Quentin Monnet wrote:
>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
>>>
>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
>>>> kernel has switched to memcg-based memory accounting. Thanks to the
>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
>>>> with other systems and ask libbpf to raise the limit for us if
>>>> necessary.
>>>
>>>> How do we know if memcg-based accounting is supported? There is a probe
>>>> in libbpf to check this. But this probe currently relies on the
>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
>>>> landed in the same kernel version as the memory accounting change. This
>>>> works in the generic case, but it may fail, for example, if the helper
>>>> function has been backported to an older kernel. This has been observed
>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
>>>> not raised, and probing features with bpftool, for example, fails.
>>>
>>>> A patch was submitted [0] to update this probe in libbpf, based on what
>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
>>>> some hard-to-debug flakiness if another process starts, or the current
>>>> application is killed, while the rlimit is reduced, and the approach was
>>>> discarded.
>>>
>>>> As a workaround to ensure that the rlimit bump does not depend on the
>>>> availability of a given helper, we restore the unconditional rlimit bump
>>>> in bpftool for now.
>>>
>>>> [0]
>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>>>
>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
>>>> ---
>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
>>>>   tools/bpf/bpftool/feature.c    | 2 ++
>>>>   tools/bpf/bpftool/main.c       | 6 +++---
>>>>   tools/bpf/bpftool/main.h       | 2 ++
>>>>   tools/bpf/bpftool/map.c        | 2 ++
>>>>   tools/bpf/bpftool/pids.c       | 1 +
>>>>   tools/bpf/bpftool/prog.c       | 3 +++
>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
>>>
>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
>>>> --- a/tools/bpf/bpftool/common.c
>>>> +++ b/tools/bpf/bpftool/common.c
>>>> @@ -17,6 +17,7 @@
>>>>   #include <linux/magic.h>
>>>>   #include <net/if.h>
>>>>   #include <sys/mount.h>
>>>> +#include <sys/resource.h>
>>>>   #include <sys/stat.h>
>>>>   #include <sys/vfs.h>
>>>
>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>>>   }
>>>
>>>> +void set_max_rlimit(void)
>>>> +{
>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>>>> +
>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>
>>> Do you think it might make sense to print to stderr some warning if
>>> we actually happen to adjust this limit?
>>>
>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
>>>     infinity!\n");
>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
>>> }
>>>
>>> ?
>>>
>>> Because while it's nice that we automatically do this, this might still
>>> lead to surprises for some users. OTOH, not sure whether people
>>> actually read those warnings? :-/
>>
>> I'm not strictly opposed to a warning, but I'm not completely sure this
>> is desirable.
>>
>> Bpftool has raised the rlimit for a long time, it changed only in April,
>> so I don't think it would come up as a surprise for people who have used
>> it for a while. I think this is also something that several other
>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
>> have been doing too.
> 
> In this case ignore me and let's continue doing that :-)
> 
> Btw, eventually we'd still like to stop doing that I'd presume?

Agreed. I was thinking either finding a way to improve the probe in
libbpf, or waiting for some more time until 5.11 gets old, but this may
take years :/

> Should
> we at some point follow up with something like:
> 
> if (kernel_version >= 5.11) { don't touch memlock; }
> 
> ?
> 
> I guess we care only about <5.11 because of the backports, but 5.11+
> kernels are guaranteed to have memcg.

You mean from uname() and parsing the release? Yes I suppose we could do
that, can do as a follow-up.

> 
> I'm not sure whether memlock is used out there in the distros (and
> especially for root/bpf_capable), so I'm also not sure whether we
> really care or not.

Not sure either. For what it's worth, I've never seen complaints so far
from users about the rlimit being raised (from bpftool or other BPF apps).
Stanislav Fomichev June 10, 2022, 5:17 p.m. UTC | #5
On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
>
> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> > On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
> >>
> >> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> >>> On 06/10, Quentin Monnet wrote:
> >>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> >>>
> >>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> >>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> >>>> kernel has switched to memcg-based memory accounting. Thanks to the
> >>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> >>>> with other systems and ask libbpf to raise the limit for us if
> >>>> necessary.
> >>>
> >>>> How do we know if memcg-based accounting is supported? There is a probe
> >>>> in libbpf to check this. But this probe currently relies on the
> >>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> >>>> landed in the same kernel version as the memory accounting change. This
> >>>> works in the generic case, but it may fail, for example, if the helper
> >>>> function has been backported to an older kernel. This has been observed
> >>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> >>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> >>>> not raised, and probing features with bpftool, for example, fails.
> >>>
> >>>> A patch was submitted [0] to update this probe in libbpf, based on what
> >>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> >>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> >>>> some hard-to-debug flakiness if another process starts, or the current
> >>>> application is killed, while the rlimit is reduced, and the approach was
> >>>> discarded.
> >>>
> >>>> As a workaround to ensure that the rlimit bump does not depend on the
> >>>> availability of a given helper, we restore the unconditional rlimit bump
> >>>> in bpftool for now.
> >>>
> >>>> [0]
> >>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> >>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >>>
> >>>> Cc: Yafang Shao <laoar.shao@gmail.com>
> >>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> >>>> ---
> >>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> >>>>   tools/bpf/bpftool/feature.c    | 2 ++
> >>>>   tools/bpf/bpftool/main.c       | 6 +++---
> >>>>   tools/bpf/bpftool/main.h       | 2 ++
> >>>>   tools/bpf/bpftool/map.c        | 2 ++
> >>>>   tools/bpf/bpftool/pids.c       | 1 +
> >>>>   tools/bpf/bpftool/prog.c       | 3 +++
> >>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> >>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> >>>
> >>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> >>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> >>>> --- a/tools/bpf/bpftool/common.c
> >>>> +++ b/tools/bpf/bpftool/common.c
> >>>> @@ -17,6 +17,7 @@
> >>>>   #include <linux/magic.h>
> >>>>   #include <net/if.h>
> >>>>   #include <sys/mount.h>
> >>>> +#include <sys/resource.h>
> >>>>   #include <sys/stat.h>
> >>>>   #include <sys/vfs.h>
> >>>
> >>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> >>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> >>>>   }
> >>>
> >>>> +void set_max_rlimit(void)
> >>>> +{
> >>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> >>>> +
> >>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>
> >>> Do you think it might make sense to print to stderr some warning if
> >>> we actually happen to adjust this limit?
> >>>
> >>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> >>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> >>>     infinity!\n");
> >>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>> }
> >>>
> >>> ?
> >>>
> >>> Because while it's nice that we automatically do this, this might still
> >>> lead to surprises for some users. OTOH, not sure whether people
> >>> actually read those warnings? :-/
> >>
> >> I'm not strictly opposed to a warning, but I'm not completely sure this
> >> is desirable.
> >>
> >> Bpftool has raised the rlimit for a long time, it changed only in April,
> >> so I don't think it would come up as a surprise for people who have used
> >> it for a while. I think this is also something that several other
> >> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> >> have been doing too.
> >
> > In this case ignore me and let's continue doing that :-)
> >
> > Btw, eventually we'd still like to stop doing that I'd presume?
>
> Agreed. I was thinking either finding a way to improve the probe in
> libbpf, or waiting for some more time until 5.11 gets old, but this may
> take years :/
>
> > Should
> > we at some point follow up with something like:
> >
> > if (kernel_version >= 5.11) { don't touch memlock; }
> >
> > ?
> >
> > I guess we care only about <5.11 because of the backports, but 5.11+
> > kernels are guaranteed to have memcg.
>
> You mean from uname() and parsing the release? Yes I suppose we could do
> that, can do as a follow-up.

Yeah, uname-based, I don't think we can do better? Given that probing
is problematic as well :-(
But idk, up to you.

> > I'm not sure whether memlock is used out there in the distros (and
> > especially for root/bpf_capable), so I'm also not sure whether we
> > really care or not.
>
> Not sure either. For what it's worth, I've never seen complaints so far
> from users about the rlimit being raised (from bpftool or other BPF apps).
Yafang Shao June 14, 2022, 12:37 p.m. UTC | #6
On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
> >
> > 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> > > On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
> > >>
> > >> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> > >>> On 06/10, Quentin Monnet wrote:
> > >>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> > >>>
> > >>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> > >>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> > >>>> kernel has switched to memcg-based memory accounting. Thanks to the
> > >>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> > >>>> with other systems and ask libbpf to raise the limit for us if
> > >>>> necessary.
> > >>>
> > >>>> How do we know if memcg-based accounting is supported? There is a probe
> > >>>> in libbpf to check this. But this probe currently relies on the
> > >>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> > >>>> landed in the same kernel version as the memory accounting change. This
> > >>>> works in the generic case, but it may fail, for example, if the helper
> > >>>> function has been backported to an older kernel. This has been observed
> > >>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> > >>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> > >>>> not raised, and probing features with bpftool, for example, fails.
> > >>>
> > >>>> A patch was submitted [0] to update this probe in libbpf, based on what
> > >>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> > >>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> > >>>> some hard-to-debug flakiness if another process starts, or the current
> > >>>> application is killed, while the rlimit is reduced, and the approach was
> > >>>> discarded.
> > >>>
> > >>>> As a workaround to ensure that the rlimit bump does not depend on the
> > >>>> availability of a given helper, we restore the unconditional rlimit bump
> > >>>> in bpftool for now.
> > >>>
> > >>>> [0]
> > >>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> > >>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> > >>>
> > >>>> Cc: Yafang Shao <laoar.shao@gmail.com>
> > >>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> > >>>> ---
> > >>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> > >>>>   tools/bpf/bpftool/feature.c    | 2 ++
> > >>>>   tools/bpf/bpftool/main.c       | 6 +++---
> > >>>>   tools/bpf/bpftool/main.h       | 2 ++
> > >>>>   tools/bpf/bpftool/map.c        | 2 ++
> > >>>>   tools/bpf/bpftool/pids.c       | 1 +
> > >>>>   tools/bpf/bpftool/prog.c       | 3 +++
> > >>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> > >>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> > >>>
> > >>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> > >>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> > >>>> --- a/tools/bpf/bpftool/common.c
> > >>>> +++ b/tools/bpf/bpftool/common.c
> > >>>> @@ -17,6 +17,7 @@
> > >>>>   #include <linux/magic.h>
> > >>>>   #include <net/if.h>
> > >>>>   #include <sys/mount.h>
> > >>>> +#include <sys/resource.h>
> > >>>>   #include <sys/stat.h>
> > >>>>   #include <sys/vfs.h>
> > >>>
> > >>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> > >>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> > >>>>   }
> > >>>
> > >>>> +void set_max_rlimit(void)
> > >>>> +{
> > >>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> > >>>> +
> > >>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> > >>>
> > >>> Do you think it might make sense to print to stderr some warning if
> > >>> we actually happen to adjust this limit?
> > >>>
> > >>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> > >>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> > >>>     infinity!\n");
> > >>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> > >>> }
> > >>>
> > >>> ?
> > >>>
> > >>> Because while it's nice that we automatically do this, this might still
> > >>> lead to surprises for some users. OTOH, not sure whether people
> > >>> actually read those warnings? :-/
> > >>
> > >> I'm not strictly opposed to a warning, but I'm not completely sure this
> > >> is desirable.
> > >>
> > >> Bpftool has raised the rlimit for a long time, it changed only in April,
> > >> so I don't think it would come up as a surprise for people who have used
> > >> it for a while. I think this is also something that several other
> > >> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> > >> have been doing too.
> > >
> > > In this case ignore me and let's continue doing that :-)
> > >
> > > Btw, eventually we'd still like to stop doing that I'd presume?
> >
> > Agreed. I was thinking either finding a way to improve the probe in
> > libbpf, or waiting for some more time until 5.11 gets old, but this may
> > take years :/
> >
> > > Should
> > > we at some point follow up with something like:
> > >
> > > if (kernel_version >= 5.11) { don't touch memlock; }
> > >
> > > ?
> > >
> > > I guess we care only about <5.11 because of the backports, but 5.11+
> > > kernels are guaranteed to have memcg.
> >
> > You mean from uname() and parsing the release? Yes I suppose we could do
> > that, can do as a follow-up.
>
> Yeah, uname-based, I don't think we can do better? Given that probing
> is problematic as well :-(
> But idk, up to you.
>

Agreed with the uname-based solution. Another possible solution is to
probe the member 'memcg' in struct bpf_map, in case someone may
backport memcg-based  memory accounting, but that will be a little
over-engineering. The uname-based solution is simple and can work.
Quentin Monnet June 14, 2022, 2:20 p.m. UTC | #7
2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
>>>
>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
>>>>>
>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
>>>>>> On 06/10, Quentin Monnet wrote:
>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
>>>>>>
>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
>>>>>>> with other systems and ask libbpf to raise the limit for us if
>>>>>>> necessary.
>>>>>>
>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
>>>>>>> in libbpf to check this. But this probe currently relies on the
>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
>>>>>>> landed in the same kernel version as the memory accounting change. This
>>>>>>> works in the generic case, but it may fail, for example, if the helper
>>>>>>> function has been backported to an older kernel. This has been observed
>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
>>>>>>> not raised, and probing features with bpftool, for example, fails.
>>>>>>
>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
>>>>>>> some hard-to-debug flakiness if another process starts, or the current
>>>>>>> application is killed, while the rlimit is reduced, and the approach was
>>>>>>> discarded.
>>>>>>
>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
>>>>>>> in bpftool for now.
>>>>>>
>>>>>>> [0]
>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>>>>>>
>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
>>>>>>> ---
>>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
>>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
>>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
>>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
>>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
>>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
>>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
>>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
>>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
>>>>>>
>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
>>>>>>> --- a/tools/bpf/bpftool/common.c
>>>>>>> +++ b/tools/bpf/bpftool/common.c
>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>   #include <linux/magic.h>
>>>>>>>   #include <net/if.h>
>>>>>>>   #include <sys/mount.h>
>>>>>>> +#include <sys/resource.h>
>>>>>>>   #include <sys/stat.h>
>>>>>>>   #include <sys/vfs.h>
>>>>>>
>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
>>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>>>>>>   }
>>>>>>
>>>>>>> +void set_max_rlimit(void)
>>>>>>> +{
>>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>>>>>>> +
>>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>>
>>>>>> Do you think it might make sense to print to stderr some warning if
>>>>>> we actually happen to adjust this limit?
>>>>>>
>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
>>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
>>>>>>     infinity!\n");
>>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>> }
>>>>>>
>>>>>> ?
>>>>>>
>>>>>> Because while it's nice that we automatically do this, this might still
>>>>>> lead to surprises for some users. OTOH, not sure whether people
>>>>>> actually read those warnings? :-/
>>>>>
>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
>>>>> is desirable.
>>>>>
>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
>>>>> so I don't think it would come up as a surprise for people who have used
>>>>> it for a while. I think this is also something that several other
>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
>>>>> have been doing too.
>>>>
>>>> In this case ignore me and let's continue doing that :-)
>>>>
>>>> Btw, eventually we'd still like to stop doing that I'd presume?
>>>
>>> Agreed. I was thinking either finding a way to improve the probe in
>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
>>> take years :/
>>>
>>>> Should
>>>> we at some point follow up with something like:
>>>>
>>>> if (kernel_version >= 5.11) { don't touch memlock; }
>>>>
>>>> ?
>>>>
>>>> I guess we care only about <5.11 because of the backports, but 5.11+
>>>> kernels are guaranteed to have memcg.
>>>
>>> You mean from uname() and parsing the release? Yes I suppose we could do
>>> that, can do as a follow-up.
>>
>> Yeah, uname-based, I don't think we can do better? Given that probing
>> is problematic as well :-(
>> But idk, up to you.
>>
> 
> Agreed with the uname-based solution. Another possible solution is to
> probe the member 'memcg' in struct bpf_map, in case someone may
> backport memcg-based  memory accounting, but that will be a little
> over-engineering. The uname-based solution is simple and can work.
> 

Thanks! Yes, memcg would be more complex: the struct is not exposed to
user space, and BTF is not a hard dependency for bpftool. I'll work on
the uname-based test as a follow-up to this set.

Quentin
Daniel Borkmann June 14, 2022, 8:34 p.m. UTC | #8
On 6/14/22 4:20 PM, Quentin Monnet wrote:
> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
>>>>>>> On 06/10, Quentin Monnet wrote:
>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
>>>>>>>
>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
>>>>>>>> with other systems and ask libbpf to raise the limit for us if
>>>>>>>> necessary.
>>>>>>>
>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
>>>>>>>> in libbpf to check this. But this probe currently relies on the
>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
>>>>>>>> landed in the same kernel version as the memory accounting change. This
>>>>>>>> works in the generic case, but it may fail, for example, if the helper
>>>>>>>> function has been backported to an older kernel. This has been observed
>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
>>>>>>>> not raised, and probing features with bpftool, for example, fails.
>>>>>>>
>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
>>>>>>>> some hard-to-debug flakiness if another process starts, or the current
>>>>>>>> application is killed, while the rlimit is reduced, and the approach was
>>>>>>>> discarded.
>>>>>>>
>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
>>>>>>>> in bpftool for now.
>>>>>>>
>>>>>>>> [0]
>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>>>>>>>
>>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
>>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
>>>>>>>> ---
>>>>>>>>    tools/bpf/bpftool/common.c     | 8 ++++++++
>>>>>>>>    tools/bpf/bpftool/feature.c    | 2 ++
>>>>>>>>    tools/bpf/bpftool/main.c       | 6 +++---
>>>>>>>>    tools/bpf/bpftool/main.h       | 2 ++
>>>>>>>>    tools/bpf/bpftool/map.c        | 2 ++
>>>>>>>>    tools/bpf/bpftool/pids.c       | 1 +
>>>>>>>>    tools/bpf/bpftool/prog.c       | 3 +++
>>>>>>>>    tools/bpf/bpftool/struct_ops.c | 2 ++
>>>>>>>>    8 files changed, 23 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
>>>>>>>> --- a/tools/bpf/bpftool/common.c
>>>>>>>> +++ b/tools/bpf/bpftool/common.c
>>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>>    #include <linux/magic.h>
>>>>>>>>    #include <net/if.h>
>>>>>>>>    #include <sys/mount.h>
>>>>>>>> +#include <sys/resource.h>
>>>>>>>>    #include <sys/stat.h>
>>>>>>>>    #include <sys/vfs.h>
>>>>>>>
>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
>>>>>>>>        return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>>>>>>>    }
>>>>>>>
>>>>>>>> +void set_max_rlimit(void)
>>>>>>>> +{
>>>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>>>>>>>> +
>>>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>>>
>>>>>>> Do you think it might make sense to print to stderr some warning if
>>>>>>> we actually happen to adjust this limit?
>>>>>>>
>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
>>>>>>>      fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
>>>>>>>      infinity!\n");
>>>>>>>      setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>>> }
>>>>>>>
>>>>>>> ?
>>>>>>>
>>>>>>> Because while it's nice that we automatically do this, this might still
>>>>>>> lead to surprises for some users. OTOH, not sure whether people
>>>>>>> actually read those warnings? :-/
>>>>>>
>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
>>>>>> is desirable.
>>>>>>
>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
>>>>>> so I don't think it would come up as a surprise for people who have used
>>>>>> it for a while. I think this is also something that several other
>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
>>>>>> have been doing too.
>>>>>
>>>>> In this case ignore me and let's continue doing that :-)
>>>>>
>>>>> Btw, eventually we'd still like to stop doing that I'd presume?
>>>>
>>>> Agreed. I was thinking either finding a way to improve the probe in
>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
>>>> take years :/
>>>>
>>>>> Should
>>>>> we at some point follow up with something like:
>>>>>
>>>>> if (kernel_version >= 5.11) { don't touch memlock; }
>>>>>
>>>>> ?
>>>>>
>>>>> I guess we care only about <5.11 because of the backports, but 5.11+
>>>>> kernels are guaranteed to have memcg.
>>>>
>>>> You mean from uname() and parsing the release? Yes I suppose we could do
>>>> that, can do as a follow-up.
>>>
>>> Yeah, uname-based, I don't think we can do better? Given that probing
>>> is problematic as well :-(
>>> But idk, up to you.
>>
>> Agreed with the uname-based solution. Another possible solution is to
>> probe the member 'memcg' in struct bpf_map, in case someone may
>> backport memcg-based  memory accounting, but that will be a little
>> over-engineering. The uname-based solution is simple and can work.
> 
> Thanks! Yes, memcg would be more complex: the struct is not exposed to
> user space, and BTF is not a hard dependency for bpftool. I'll work on
> the uname-based test as a follow-up to this set.

How would this work for things like RHEL? Maybe two potential workarounds ...
1) We could use a different helper for the probing and see how far we get with
it.. though not great, probably still rare enough that we would run into it
again. 2) Maybe we could create a temp memcg and check whether we get accounted
against it on prog load (e.g. despite high rlimit)?

Thanks,
Daniel
Stanislav Fomichev June 14, 2022, 9:01 p.m. UTC | #9
On Tue, Jun 14, 2022 at 1:34 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/14/22 4:20 PM, Quentin Monnet wrote:
> > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> >> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
> >>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
> >>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> >>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
> >>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> >>>>>>> On 06/10, Quentin Monnet wrote:
> >>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> >>>>>>>
> >>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> >>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> >>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
> >>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> >>>>>>>> with other systems and ask libbpf to raise the limit for us if
> >>>>>>>> necessary.
> >>>>>>>
> >>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
> >>>>>>>> in libbpf to check this. But this probe currently relies on the
> >>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> >>>>>>>> landed in the same kernel version as the memory accounting change. This
> >>>>>>>> works in the generic case, but it may fail, for example, if the helper
> >>>>>>>> function has been backported to an older kernel. This has been observed
> >>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> >>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> >>>>>>>> not raised, and probing features with bpftool, for example, fails.
> >>>>>>>
> >>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
> >>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> >>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> >>>>>>>> some hard-to-debug flakiness if another process starts, or the current
> >>>>>>>> application is killed, while the rlimit is reduced, and the approach was
> >>>>>>>> discarded.
> >>>>>>>
> >>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
> >>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
> >>>>>>>> in bpftool for now.
> >>>>>>>
> >>>>>>>> [0]
> >>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> >>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >>>>>>>
> >>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
> >>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> >>>>>>>> ---
> >>>>>>>>    tools/bpf/bpftool/common.c     | 8 ++++++++
> >>>>>>>>    tools/bpf/bpftool/feature.c    | 2 ++
> >>>>>>>>    tools/bpf/bpftool/main.c       | 6 +++---
> >>>>>>>>    tools/bpf/bpftool/main.h       | 2 ++
> >>>>>>>>    tools/bpf/bpftool/map.c        | 2 ++
> >>>>>>>>    tools/bpf/bpftool/pids.c       | 1 +
> >>>>>>>>    tools/bpf/bpftool/prog.c       | 3 +++
> >>>>>>>>    tools/bpf/bpftool/struct_ops.c | 2 ++
> >>>>>>>>    8 files changed, 23 insertions(+), 3 deletions(-)
> >>>>>>>
> >>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> >>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> >>>>>>>> --- a/tools/bpf/bpftool/common.c
> >>>>>>>> +++ b/tools/bpf/bpftool/common.c
> >>>>>>>> @@ -17,6 +17,7 @@
> >>>>>>>>    #include <linux/magic.h>
> >>>>>>>>    #include <net/if.h>
> >>>>>>>>    #include <sys/mount.h>
> >>>>>>>> +#include <sys/resource.h>
> >>>>>>>>    #include <sys/stat.h>
> >>>>>>>>    #include <sys/vfs.h>
> >>>>>>>
> >>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> >>>>>>>>        return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> >>>>>>>>    }
> >>>>>>>
> >>>>>>>> +void set_max_rlimit(void)
> >>>>>>>> +{
> >>>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> >>>>>>>> +
> >>>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>>>
> >>>>>>> Do you think it might make sense to print to stderr some warning if
> >>>>>>> we actually happen to adjust this limit?
> >>>>>>>
> >>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> >>>>>>>      fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> >>>>>>>      infinity!\n");
> >>>>>>>      setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>>> }
> >>>>>>>
> >>>>>>> ?
> >>>>>>>
> >>>>>>> Because while it's nice that we automatically do this, this might still
> >>>>>>> lead to surprises for some users. OTOH, not sure whether people
> >>>>>>> actually read those warnings? :-/
> >>>>>>
> >>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
> >>>>>> is desirable.
> >>>>>>
> >>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
> >>>>>> so I don't think it would come up as a surprise for people who have used
> >>>>>> it for a while. I think this is also something that several other
> >>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> >>>>>> have been doing too.
> >>>>>
> >>>>> In this case ignore me and let's continue doing that :-)
> >>>>>
> >>>>> Btw, eventually we'd still like to stop doing that I'd presume?
> >>>>
> >>>> Agreed. I was thinking either finding a way to improve the probe in
> >>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
> >>>> take years :/
> >>>>
> >>>>> Should
> >>>>> we at some point follow up with something like:
> >>>>>
> >>>>> if (kernel_version >= 5.11) { don't touch memlock; }
> >>>>>
> >>>>> ?
> >>>>>
> >>>>> I guess we care only about <5.11 because of the backports, but 5.11+
> >>>>> kernels are guaranteed to have memcg.
> >>>>
> >>>> You mean from uname() and parsing the release? Yes I suppose we could do
> >>>> that, can do as a follow-up.
> >>>
> >>> Yeah, uname-based, I don't think we can do better? Given that probing
> >>> is problematic as well :-(
> >>> But idk, up to you.
> >>
> >> Agreed with the uname-based solution. Another possible solution is to
> >> probe the member 'memcg' in struct bpf_map, in case someone may
> >> backport memcg-based  memory accounting, but that will be a little
> >> over-engineering. The uname-based solution is simple and can work.
> >
> > Thanks! Yes, memcg would be more complex: the struct is not exposed to
> > user space, and BTF is not a hard dependency for bpftool. I'll work on
> > the uname-based test as a follow-up to this set.
>
> How would this work for things like RHEL? Maybe two potential workarounds ...
> 1) We could use a different helper for the probing and see how far we get with
> it.. though not great, probably still rare enough that we would run into it
> again. 2) Maybe we could create a temp memcg and check whether we get accounted
> against it on prog load (e.g. despite high rlimit)?

This might be dangerous as well: we don't want to move to a different
cgroup and move back; if we are in a multithreaded application, other
threads might potentially create resources in that temp directory. And
I don't think we can fork as well since we are in a library :-(
Yafang Shao June 15, 2022, 1:22 p.m. UTC | #10
On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote:
>
> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> > On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
> >>
> >> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
> >>>
> >>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> >>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
> >>>>>
> >>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> >>>>>> On 06/10, Quentin Monnet wrote:
> >>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> >>>>>>
> >>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> >>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> >>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
> >>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> >>>>>>> with other systems and ask libbpf to raise the limit for us if
> >>>>>>> necessary.
> >>>>>>
> >>>>>>> How do we know if memcg-based accounting is supported? There is a probe
> >>>>>>> in libbpf to check this. But this probe currently relies on the
> >>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> >>>>>>> landed in the same kernel version as the memory accounting change. This
> >>>>>>> works in the generic case, but it may fail, for example, if the helper
> >>>>>>> function has been backported to an older kernel. This has been observed
> >>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> >>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> >>>>>>> not raised, and probing features with bpftool, for example, fails.
> >>>>>>
> >>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
> >>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> >>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> >>>>>>> some hard-to-debug flakiness if another process starts, or the current
> >>>>>>> application is killed, while the rlimit is reduced, and the approach was
> >>>>>>> discarded.
> >>>>>>
> >>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
> >>>>>>> availability of a given helper, we restore the unconditional rlimit bump
> >>>>>>> in bpftool for now.
> >>>>>>
> >>>>>>> [0]
> >>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> >>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >>>>>>
> >>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
> >>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> >>>>>>> ---
> >>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> >>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
> >>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
> >>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
> >>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
> >>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
> >>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
> >>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> >>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> >>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> >>>>>>> --- a/tools/bpf/bpftool/common.c
> >>>>>>> +++ b/tools/bpf/bpftool/common.c
> >>>>>>> @@ -17,6 +17,7 @@
> >>>>>>>   #include <linux/magic.h>
> >>>>>>>   #include <net/if.h>
> >>>>>>>   #include <sys/mount.h>
> >>>>>>> +#include <sys/resource.h>
> >>>>>>>   #include <sys/stat.h>
> >>>>>>>   #include <sys/vfs.h>
> >>>>>>
> >>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> >>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> >>>>>>>   }
> >>>>>>
> >>>>>>> +void set_max_rlimit(void)
> >>>>>>> +{
> >>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> >>>>>>> +
> >>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>>
> >>>>>> Do you think it might make sense to print to stderr some warning if
> >>>>>> we actually happen to adjust this limit?
> >>>>>>
> >>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> >>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> >>>>>>     infinity!\n");
> >>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>> }
> >>>>>>
> >>>>>> ?
> >>>>>>
> >>>>>> Because while it's nice that we automatically do this, this might still
> >>>>>> lead to surprises for some users. OTOH, not sure whether people
> >>>>>> actually read those warnings? :-/
> >>>>>
> >>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
> >>>>> is desirable.
> >>>>>
> >>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
> >>>>> so I don't think it would come up as a surprise for people who have used
> >>>>> it for a while. I think this is also something that several other
> >>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> >>>>> have been doing too.
> >>>>
> >>>> In this case ignore me and let's continue doing that :-)
> >>>>
> >>>> Btw, eventually we'd still like to stop doing that I'd presume?
> >>>
> >>> Agreed. I was thinking either finding a way to improve the probe in
> >>> libbpf, or waiting for some more time until 5.11 gets old, but this may
> >>> take years :/
> >>>
> >>>> Should
> >>>> we at some point follow up with something like:
> >>>>
> >>>> if (kernel_version >= 5.11) { don't touch memlock; }
> >>>>
> >>>> ?
> >>>>
> >>>> I guess we care only about <5.11 because of the backports, but 5.11+
> >>>> kernels are guaranteed to have memcg.
> >>>
> >>> You mean from uname() and parsing the release? Yes I suppose we could do
> >>> that, can do as a follow-up.
> >>
> >> Yeah, uname-based, I don't think we can do better? Given that probing
> >> is problematic as well :-(
> >> But idk, up to you.
> >>
> >
> > Agreed with the uname-based solution. Another possible solution is to
> > probe the member 'memcg' in struct bpf_map, in case someone may
> > backport memcg-based  memory accounting, but that will be a little
> > over-engineering. The uname-based solution is simple and can work.
> >
>
> Thanks! Yes, memcg would be more complex: the struct is not exposed to
> user space, and BTF is not a hard dependency for bpftool. I'll work on
> the uname-based test as a follow-up to this set.
>

After a second thought, the uname-based test may not work, because
CONFIG_MEMCG_KMEM can be disabled.
Maybe we can probe the member 'memcg' in struct bpf_map by parsing
/sys/kernel/btf/vmlinux:
[8584] STRUCT 'bpf_map' size=256 vlen=27
        'ops' type_id=8659 bits_offset=0
        'inner_map_meta' type_id=8587 bits_offset=64
        'security' type_id=93 bits_offset=128
        'map_type' type_id=8532 bits_offset=192
        'key_size' type_id=36 bits_offset=224
        'value_size' type_id=36 bits_offset=256
        'max_entries' type_id=36 bits_offset=288
        'map_extra' type_id=38 bits_offset=320
        'map_flags' type_id=36 bits_offset=384
        'spin_lock_off' type_id=21 bits_offset=416
        'timer_off' type_id=21 bits_offset=448
        'id' type_id=36 bits_offset=480
        'numa_node' type_id=21 bits_offset=512
        'btf_key_type_id' type_id=36 bits_offset=544
        'btf_value_type_id' type_id=36 bits_offset=576
        'btf_vmlinux_value_type_id' type_id=36 bits_offset=608
        'btf' type_id=8660 bits_offset=640
        'memcg' type_id=687 bits_offset=704                       <<<< here
        'name' type_id=337 bits_offset=768
        'bypass_spec_v1' type_id=63 bits_offset=896
        'frozen' type_id=63 bits_offset=904
        'refcnt' type_id=81 bits_offset=1024
        'usercnt' type_id=81 bits_offset=1088
        'work' type_id=484 bits_offset=1152
        'freeze_mutex' type_id=443 bits_offset=1408
        'writecnt' type_id=81 bits_offset=1664
        'owner' type_id=8658 bits_offset=1728

If 'memcg' exists, it is memcg-based, otherwise it is rlimit-based.

WDYT?
Stanislav Fomichev June 15, 2022, 3:52 p.m. UTC | #11
On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote:
> >
> > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> > > On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >>
> > >> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
> > >>>
> > >>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> > >>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
> > >>>>>
> > >>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> > >>>>>> On 06/10, Quentin Monnet wrote:
> > >>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> > >>>>>>
> > >>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> > >>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> > >>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
> > >>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> > >>>>>>> with other systems and ask libbpf to raise the limit for us if
> > >>>>>>> necessary.
> > >>>>>>
> > >>>>>>> How do we know if memcg-based accounting is supported? There is a probe
> > >>>>>>> in libbpf to check this. But this probe currently relies on the
> > >>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> > >>>>>>> landed in the same kernel version as the memory accounting change. This
> > >>>>>>> works in the generic case, but it may fail, for example, if the helper
> > >>>>>>> function has been backported to an older kernel. This has been observed
> > >>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> > >>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> > >>>>>>> not raised, and probing features with bpftool, for example, fails.
> > >>>>>>
> > >>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
> > >>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> > >>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> > >>>>>>> some hard-to-debug flakiness if another process starts, or the current
> > >>>>>>> application is killed, while the rlimit is reduced, and the approach was
> > >>>>>>> discarded.
> > >>>>>>
> > >>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
> > >>>>>>> availability of a given helper, we restore the unconditional rlimit bump
> > >>>>>>> in bpftool for now.
> > >>>>>>
> > >>>>>>> [0]
> > >>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> > >>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> > >>>>>>
> > >>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
> > >>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> > >>>>>>> ---
> > >>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> > >>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
> > >>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
> > >>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
> > >>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
> > >>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
> > >>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
> > >>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> > >>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> > >>>>>>
> > >>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> > >>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> > >>>>>>> --- a/tools/bpf/bpftool/common.c
> > >>>>>>> +++ b/tools/bpf/bpftool/common.c
> > >>>>>>> @@ -17,6 +17,7 @@
> > >>>>>>>   #include <linux/magic.h>
> > >>>>>>>   #include <net/if.h>
> > >>>>>>>   #include <sys/mount.h>
> > >>>>>>> +#include <sys/resource.h>
> > >>>>>>>   #include <sys/stat.h>
> > >>>>>>>   #include <sys/vfs.h>
> > >>>>>>
> > >>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> > >>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> > >>>>>>>   }
> > >>>>>>
> > >>>>>>> +void set_max_rlimit(void)
> > >>>>>>> +{
> > >>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> > >>>>>>> +
> > >>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> > >>>>>>
> > >>>>>> Do you think it might make sense to print to stderr some warning if
> > >>>>>> we actually happen to adjust this limit?
> > >>>>>>
> > >>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> > >>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> > >>>>>>     infinity!\n");
> > >>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> > >>>>>> }
> > >>>>>>
> > >>>>>> ?
> > >>>>>>
> > >>>>>> Because while it's nice that we automatically do this, this might still
> > >>>>>> lead to surprises for some users. OTOH, not sure whether people
> > >>>>>> actually read those warnings? :-/
> > >>>>>
> > >>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
> > >>>>> is desirable.
> > >>>>>
> > >>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
> > >>>>> so I don't think it would come up as a surprise for people who have used
> > >>>>> it for a while. I think this is also something that several other
> > >>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> > >>>>> have been doing too.
> > >>>>
> > >>>> In this case ignore me and let's continue doing that :-)
> > >>>>
> > >>>> Btw, eventually we'd still like to stop doing that I'd presume?
> > >>>
> > >>> Agreed. I was thinking either finding a way to improve the probe in
> > >>> libbpf, or waiting for some more time until 5.11 gets old, but this may
> > >>> take years :/
> > >>>
> > >>>> Should
> > >>>> we at some point follow up with something like:
> > >>>>
> > >>>> if (kernel_version >= 5.11) { don't touch memlock; }
> > >>>>
> > >>>> ?
> > >>>>
> > >>>> I guess we care only about <5.11 because of the backports, but 5.11+
> > >>>> kernels are guaranteed to have memcg.
> > >>>
> > >>> You mean from uname() and parsing the release? Yes I suppose we could do
> > >>> that, can do as a follow-up.
> > >>
> > >> Yeah, uname-based, I don't think we can do better? Given that probing
> > >> is problematic as well :-(
> > >> But idk, up to you.
> > >>
> > >
> > > Agreed with the uname-based solution. Another possible solution is to
> > > probe the member 'memcg' in struct bpf_map, in case someone may
> > > backport memcg-based  memory accounting, but that will be a little
> > > over-engineering. The uname-based solution is simple and can work.
> > >
> >
> > Thanks! Yes, memcg would be more complex: the struct is not exposed to
> > user space, and BTF is not a hard dependency for bpftool. I'll work on
> > the uname-based test as a follow-up to this set.
> >
>
> After a second thought, the uname-based test may not work, because
> CONFIG_MEMCG_KMEM can be disabled.

Does it matter? Regardless of whether there is memcg or not, we
shouldn't touch ulimit on 5.11+
If there is no memcg, there is no bpf memory enforcement.
Yafang Shao June 15, 2022, 4:05 p.m. UTC | #12
On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote:
> > >
> > > 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> > > > On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
> > > >>
> > > >> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
> > > >>>
> > > >>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> > > >>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
> > > >>>>>
> > > >>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> > > >>>>>> On 06/10, Quentin Monnet wrote:
> > > >>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> > > >>>>>>
> > > >>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> > > >>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> > > >>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
> > > >>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> > > >>>>>>> with other systems and ask libbpf to raise the limit for us if
> > > >>>>>>> necessary.
> > > >>>>>>
> > > >>>>>>> How do we know if memcg-based accounting is supported? There is a probe
> > > >>>>>>> in libbpf to check this. But this probe currently relies on the
> > > >>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> > > >>>>>>> landed in the same kernel version as the memory accounting change. This
> > > >>>>>>> works in the generic case, but it may fail, for example, if the helper
> > > >>>>>>> function has been backported to an older kernel. This has been observed
> > > >>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> > > >>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> > > >>>>>>> not raised, and probing features with bpftool, for example, fails.
> > > >>>>>>
> > > >>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
> > > >>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> > > >>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> > > >>>>>>> some hard-to-debug flakiness if another process starts, or the current
> > > >>>>>>> application is killed, while the rlimit is reduced, and the approach was
> > > >>>>>>> discarded.
> > > >>>>>>
> > > >>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
> > > >>>>>>> availability of a given helper, we restore the unconditional rlimit bump
> > > >>>>>>> in bpftool for now.
> > > >>>>>>
> > > >>>>>>> [0]
> > > >>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> > > >>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> > > >>>>>>
> > > >>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
> > > >>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> > > >>>>>>> ---
> > > >>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> > > >>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
> > > >>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
> > > >>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
> > > >>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
> > > >>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
> > > >>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
> > > >>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> > > >>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> > > >>>>>>
> > > >>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> > > >>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> > > >>>>>>> --- a/tools/bpf/bpftool/common.c
> > > >>>>>>> +++ b/tools/bpf/bpftool/common.c
> > > >>>>>>> @@ -17,6 +17,7 @@
> > > >>>>>>>   #include <linux/magic.h>
> > > >>>>>>>   #include <net/if.h>
> > > >>>>>>>   #include <sys/mount.h>
> > > >>>>>>> +#include <sys/resource.h>
> > > >>>>>>>   #include <sys/stat.h>
> > > >>>>>>>   #include <sys/vfs.h>
> > > >>>>>>
> > > >>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> > > >>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> > > >>>>>>>   }
> > > >>>>>>
> > > >>>>>>> +void set_max_rlimit(void)
> > > >>>>>>> +{
> > > >>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> > > >>>>>>> +
> > > >>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> > > >>>>>>
> > > >>>>>> Do you think it might make sense to print to stderr some warning if
> > > >>>>>> we actually happen to adjust this limit?
> > > >>>>>>
> > > >>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> > > >>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> > > >>>>>>     infinity!\n");
> > > >>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> > > >>>>>> }
> > > >>>>>>
> > > >>>>>> ?
> > > >>>>>>
> > > >>>>>> Because while it's nice that we automatically do this, this might still
> > > >>>>>> lead to surprises for some users. OTOH, not sure whether people
> > > >>>>>> actually read those warnings? :-/
> > > >>>>>
> > > >>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
> > > >>>>> is desirable.
> > > >>>>>
> > > >>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
> > > >>>>> so I don't think it would come up as a surprise for people who have used
> > > >>>>> it for a while. I think this is also something that several other
> > > >>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> > > >>>>> have been doing too.
> > > >>>>
> > > >>>> In this case ignore me and let's continue doing that :-)
> > > >>>>
> > > >>>> Btw, eventually we'd still like to stop doing that I'd presume?
> > > >>>
> > > >>> Agreed. I was thinking either finding a way to improve the probe in
> > > >>> libbpf, or waiting for some more time until 5.11 gets old, but this may
> > > >>> take years :/
> > > >>>
> > > >>>> Should
> > > >>>> we at some point follow up with something like:
> > > >>>>
> > > >>>> if (kernel_version >= 5.11) { don't touch memlock; }
> > > >>>>
> > > >>>> ?
> > > >>>>
> > > >>>> I guess we care only about <5.11 because of the backports, but 5.11+
> > > >>>> kernels are guaranteed to have memcg.
> > > >>>
> > > >>> You mean from uname() and parsing the release? Yes I suppose we could do
> > > >>> that, can do as a follow-up.
> > > >>
> > > >> Yeah, uname-based, I don't think we can do better? Given that probing
> > > >> is problematic as well :-(
> > > >> But idk, up to you.
> > > >>
> > > >
> > > > Agreed with the uname-based solution. Another possible solution is to
> > > > probe the member 'memcg' in struct bpf_map, in case someone may
> > > > backport memcg-based  memory accounting, but that will be a little
> > > > over-engineering. The uname-based solution is simple and can work.
> > > >
> > >
> > > Thanks! Yes, memcg would be more complex: the struct is not exposed to
> > > user space, and BTF is not a hard dependency for bpftool. I'll work on
> > > the uname-based test as a follow-up to this set.
> > >
> >
> > After a second thought, the uname-based test may not work, because
> > CONFIG_MEMCG_KMEM can be disabled.
>
> Does it matter? Regardless of whether there is memcg or not, we
> shouldn't touch ulimit on 5.11+
> If there is no memcg, there is no bpf memory enforcement.

Right, rlimit-based accounting is totally removed, that is not the
same with what I thought before, while I thought it will fallback to
rlimit-based if kmemcg is disabled.
Quentin Monnet June 16, 2022, 1:59 p.m. UTC | #13
2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>
>>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote:
>>>>
>>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
>>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
>>>>>>
>>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
>>>>>>>
>>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
>>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
>>>>>>>>>
>>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
>>>>>>>>>> On 06/10, Quentin Monnet wrote:
>>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
>>>>>>>>>>
>>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
>>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
>>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
>>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
>>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if
>>>>>>>>>>> necessary.
>>>>>>>>>>
>>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
>>>>>>>>>>> in libbpf to check this. But this probe currently relies on the
>>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
>>>>>>>>>>> landed in the same kernel version as the memory accounting change. This
>>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper
>>>>>>>>>>> function has been backported to an older kernel. This has been observed
>>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
>>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
>>>>>>>>>>> not raised, and probing features with bpftool, for example, fails.
>>>>>>>>>>
>>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
>>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
>>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
>>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current
>>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was
>>>>>>>>>>> discarded.
>>>>>>>>>>
>>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
>>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
>>>>>>>>>>> in bpftool for now.
>>>>>>>>>>
>>>>>>>>>>> [0]
>>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
>>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>>>>>>>>>>
>>>>>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
>>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
>>>>>>>>>>> ---
>>>>>>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
>>>>>>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
>>>>>>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
>>>>>>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
>>>>>>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
>>>>>>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
>>>>>>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
>>>>>>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
>>>>>>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
>>>>>>>>>>> --- a/tools/bpf/bpftool/common.c
>>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c
>>>>>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>>>>>   #include <linux/magic.h>
>>>>>>>>>>>   #include <net/if.h>
>>>>>>>>>>>   #include <sys/mount.h>
>>>>>>>>>>> +#include <sys/resource.h>
>>>>>>>>>>>   #include <sys/stat.h>
>>>>>>>>>>>   #include <sys/vfs.h>
>>>>>>>>>>
>>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
>>>>>>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>>>>>>>>>>   }
>>>>>>>>>>
>>>>>>>>>>> +void set_max_rlimit(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>>>>>>>>>>> +
>>>>>>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>>>>>>
>>>>>>>>>> Do you think it might make sense to print to stderr some warning if
>>>>>>>>>> we actually happen to adjust this limit?
>>>>>>>>>>
>>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
>>>>>>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
>>>>>>>>>>     infinity!\n");
>>>>>>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>> Because while it's nice that we automatically do this, this might still
>>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people
>>>>>>>>>> actually read those warnings? :-/
>>>>>>>>>
>>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
>>>>>>>>> is desirable.
>>>>>>>>>
>>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
>>>>>>>>> so I don't think it would come up as a surprise for people who have used
>>>>>>>>> it for a while. I think this is also something that several other
>>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
>>>>>>>>> have been doing too.
>>>>>>>>
>>>>>>>> In this case ignore me and let's continue doing that :-)
>>>>>>>>
>>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume?
>>>>>>>
>>>>>>> Agreed. I was thinking either finding a way to improve the probe in
>>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
>>>>>>> take years :/
>>>>>>>
>>>>>>>> Should
>>>>>>>> we at some point follow up with something like:
>>>>>>>>
>>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; }
>>>>>>>>
>>>>>>>> ?
>>>>>>>>
>>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+
>>>>>>>> kernels are guaranteed to have memcg.
>>>>>>>
>>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do
>>>>>>> that, can do as a follow-up.
>>>>>>
>>>>>> Yeah, uname-based, I don't think we can do better? Given that probing
>>>>>> is problematic as well :-(
>>>>>> But idk, up to you.
>>>>>>
>>>>>
>>>>> Agreed with the uname-based solution. Another possible solution is to
>>>>> probe the member 'memcg' in struct bpf_map, in case someone may
>>>>> backport memcg-based  memory accounting, but that will be a little
>>>>> over-engineering. The uname-based solution is simple and can work.
>>>>>
>>>>
>>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to
>>>> user space, and BTF is not a hard dependency for bpftool. I'll work on
>>>> the uname-based test as a follow-up to this set.
>>>>
>>>
>>> After a second thought, the uname-based test may not work, because
>>> CONFIG_MEMCG_KMEM can be disabled.
>>
>> Does it matter? Regardless of whether there is memcg or not, we
>> shouldn't touch ulimit on 5.11+
>> If there is no memcg, there is no bpf memory enforcement.
> 
> Right, rlimit-based accounting is totally removed, that is not the
> same with what I thought before, while I thought it will fallback to
> rlimit-based if kmemcg is disabled.

Agreed, and so I've got a patch ready for the uname-based probe.

But talking about this with Daniel, we were wondering if it would make
sense instead to have the probe I had initially submitted (lower the
rlimit to 0, attempt to load a program, reset rlimit - see [0]), but
only for bpftool instead of libbpf? My understanding is that the memlock
rlimit is per-process, right? So this shouldn't affect any other
process, and because bpftool is not multithreaded, nothing other than
probing would happen while the rlimit is at zero? Or is it just simpler,
if less accurate, to stick to the uname probe?

Quentin

[0]
https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
Yafang Shao June 16, 2022, 2:54 p.m. UTC | #14
On Thu, Jun 16, 2022 at 9:59 PM Quentin Monnet <quentin@isovalent.com> wrote:
>
> 2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> > On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote:
> >>
> >> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>
> >>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote:
> >>>>
> >>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> >>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
> >>>>>>
> >>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
> >>>>>>>
> >>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> >>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
> >>>>>>>>>
> >>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> >>>>>>>>>> On 06/10, Quentin Monnet wrote:
> >>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> >>>>>>>>>>
> >>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> >>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> >>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
> >>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> >>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if
> >>>>>>>>>>> necessary.
> >>>>>>>>>>
> >>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
> >>>>>>>>>>> in libbpf to check this. But this probe currently relies on the
> >>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> >>>>>>>>>>> landed in the same kernel version as the memory accounting change. This
> >>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper
> >>>>>>>>>>> function has been backported to an older kernel. This has been observed
> >>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> >>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> >>>>>>>>>>> not raised, and probing features with bpftool, for example, fails.
> >>>>>>>>>>
> >>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
> >>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> >>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> >>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current
> >>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was
> >>>>>>>>>>> discarded.
> >>>>>>>>>>
> >>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
> >>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
> >>>>>>>>>>> in bpftool for now.
> >>>>>>>>>>
> >>>>>>>>>>> [0]
> >>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> >>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >>>>>>>>>>
> >>>>>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
> >>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> >>>>>>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
> >>>>>>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
> >>>>>>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
> >>>>>>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
> >>>>>>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
> >>>>>>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
> >>>>>>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> >>>>>>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> >>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> >>>>>>>>>>> --- a/tools/bpf/bpftool/common.c
> >>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c
> >>>>>>>>>>> @@ -17,6 +17,7 @@
> >>>>>>>>>>>   #include <linux/magic.h>
> >>>>>>>>>>>   #include <net/if.h>
> >>>>>>>>>>>   #include <sys/mount.h>
> >>>>>>>>>>> +#include <sys/resource.h>
> >>>>>>>>>>>   #include <sys/stat.h>
> >>>>>>>>>>>   #include <sys/vfs.h>
> >>>>>>>>>>
> >>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> >>>>>>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> >>>>>>>>>>>   }
> >>>>>>>>>>
> >>>>>>>>>>> +void set_max_rlimit(void)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> >>>>>>>>>>> +
> >>>>>>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>>>>>>
> >>>>>>>>>> Do you think it might make sense to print to stderr some warning if
> >>>>>>>>>> we actually happen to adjust this limit?
> >>>>>>>>>>
> >>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> >>>>>>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> >>>>>>>>>>     infinity!\n");
> >>>>>>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> ?
> >>>>>>>>>>
> >>>>>>>>>> Because while it's nice that we automatically do this, this might still
> >>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people
> >>>>>>>>>> actually read those warnings? :-/
> >>>>>>>>>
> >>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
> >>>>>>>>> is desirable.
> >>>>>>>>>
> >>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
> >>>>>>>>> so I don't think it would come up as a surprise for people who have used
> >>>>>>>>> it for a while. I think this is also something that several other
> >>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> >>>>>>>>> have been doing too.
> >>>>>>>>
> >>>>>>>> In this case ignore me and let's continue doing that :-)
> >>>>>>>>
> >>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume?
> >>>>>>>
> >>>>>>> Agreed. I was thinking either finding a way to improve the probe in
> >>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
> >>>>>>> take years :/
> >>>>>>>
> >>>>>>>> Should
> >>>>>>>> we at some point follow up with something like:
> >>>>>>>>
> >>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; }
> >>>>>>>>
> >>>>>>>> ?
> >>>>>>>>
> >>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+
> >>>>>>>> kernels are guaranteed to have memcg.
> >>>>>>>
> >>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do
> >>>>>>> that, can do as a follow-up.
> >>>>>>
> >>>>>> Yeah, uname-based, I don't think we can do better? Given that probing
> >>>>>> is problematic as well :-(
> >>>>>> But idk, up to you.
> >>>>>>
> >>>>>
> >>>>> Agreed with the uname-based solution. Another possible solution is to
> >>>>> probe the member 'memcg' in struct bpf_map, in case someone may
> >>>>> backport memcg-based  memory accounting, but that will be a little
> >>>>> over-engineering. The uname-based solution is simple and can work.
> >>>>>
> >>>>
> >>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to
> >>>> user space, and BTF is not a hard dependency for bpftool. I'll work on
> >>>> the uname-based test as a follow-up to this set.
> >>>>
> >>>
> >>> After a second thought, the uname-based test may not work, because
> >>> CONFIG_MEMCG_KMEM can be disabled.
> >>
> >> Does it matter? Regardless of whether there is memcg or not, we
> >> shouldn't touch ulimit on 5.11+
> >> If there is no memcg, there is no bpf memory enforcement.
> >
> > Right, rlimit-based accounting is totally removed, that is not the
> > same with what I thought before, while I thought it will fallback to
> > rlimit-based if kmemcg is disabled.
>
> Agreed, and so I've got a patch ready for the uname-based probe.
>
> But talking about this with Daniel, we were wondering if it would make
> sense instead to have the probe I had initially submitted (lower the
> rlimit to 0, attempt to load a program, reset rlimit - see [0]), but
> only for bpftool instead of libbpf? My understanding is that the memlock
> rlimit is per-process, right? So this shouldn't affect any other
> process, and because bpftool is not multithreaded, nothing other than
> probing would happen while the rlimit is at zero?

Makes sense.
It is safe to do the probe within bpftool.

> Or is it just simpler,
> if less accurate, to stick to the uname probe?
>
> Quentin
>
> [0]
> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
Stanislav Fomichev June 16, 2022, 6:07 p.m. UTC | #15
On Thu, Jun 16, 2022 at 7:54 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Jun 16, 2022 at 9:59 PM Quentin Monnet <quentin@isovalent.com> wrote:
> >
> > 2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> > > On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >>
> > >> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >>>
> > >>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote:
> > >>>>
> > >>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> > >>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >>>>>>
> > >>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
> > >>>>>>>
> > >>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> > >>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> > >>>>>>>>>> On 06/10, Quentin Monnet wrote:
> > >>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> > >>>>>>>>>>
> > >>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> > >>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> > >>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
> > >>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> > >>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if
> > >>>>>>>>>>> necessary.
> > >>>>>>>>>>
> > >>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
> > >>>>>>>>>>> in libbpf to check this. But this probe currently relies on the
> > >>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> > >>>>>>>>>>> landed in the same kernel version as the memory accounting change. This
> > >>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper
> > >>>>>>>>>>> function has been backported to an older kernel. This has been observed
> > >>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> > >>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> > >>>>>>>>>>> not raised, and probing features with bpftool, for example, fails.
> > >>>>>>>>>>
> > >>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
> > >>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> > >>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> > >>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current
> > >>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was
> > >>>>>>>>>>> discarded.
> > >>>>>>>>>>
> > >>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
> > >>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
> > >>>>>>>>>>> in bpftool for now.
> > >>>>>>>>>>
> > >>>>>>>>>>> [0]
> > >>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> > >>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> > >>>>>>>>>>
> > >>>>>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
> > >>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> > >>>>>>>>>>> ---
> > >>>>>>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> > >>>>>>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
> > >>>>>>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
> > >>>>>>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
> > >>>>>>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
> > >>>>>>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
> > >>>>>>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
> > >>>>>>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> > >>>>>>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> > >>>>>>>>>>
> > >>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> > >>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> > >>>>>>>>>>> --- a/tools/bpf/bpftool/common.c
> > >>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c
> > >>>>>>>>>>> @@ -17,6 +17,7 @@
> > >>>>>>>>>>>   #include <linux/magic.h>
> > >>>>>>>>>>>   #include <net/if.h>
> > >>>>>>>>>>>   #include <sys/mount.h>
> > >>>>>>>>>>> +#include <sys/resource.h>
> > >>>>>>>>>>>   #include <sys/stat.h>
> > >>>>>>>>>>>   #include <sys/vfs.h>
> > >>>>>>>>>>
> > >>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> > >>>>>>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> > >>>>>>>>>>>   }
> > >>>>>>>>>>
> > >>>>>>>>>>> +void set_max_rlimit(void)
> > >>>>>>>>>>> +{
> > >>>>>>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> > >>>>>>>>>>> +
> > >>>>>>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> > >>>>>>>>>>
> > >>>>>>>>>> Do you think it might make sense to print to stderr some warning if
> > >>>>>>>>>> we actually happen to adjust this limit?
> > >>>>>>>>>>
> > >>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> > >>>>>>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> > >>>>>>>>>>     infinity!\n");
> > >>>>>>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> > >>>>>>>>>> }
> > >>>>>>>>>>
> > >>>>>>>>>> ?
> > >>>>>>>>>>
> > >>>>>>>>>> Because while it's nice that we automatically do this, this might still
> > >>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people
> > >>>>>>>>>> actually read those warnings? :-/
> > >>>>>>>>>
> > >>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
> > >>>>>>>>> is desirable.
> > >>>>>>>>>
> > >>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
> > >>>>>>>>> so I don't think it would come up as a surprise for people who have used
> > >>>>>>>>> it for a while. I think this is also something that several other
> > >>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> > >>>>>>>>> have been doing too.
> > >>>>>>>>
> > >>>>>>>> In this case ignore me and let's continue doing that :-)
> > >>>>>>>>
> > >>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume?
> > >>>>>>>
> > >>>>>>> Agreed. I was thinking either finding a way to improve the probe in
> > >>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
> > >>>>>>> take years :/
> > >>>>>>>
> > >>>>>>>> Should
> > >>>>>>>> we at some point follow up with something like:
> > >>>>>>>>
> > >>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; }
> > >>>>>>>>
> > >>>>>>>> ?
> > >>>>>>>>
> > >>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+
> > >>>>>>>> kernels are guaranteed to have memcg.
> > >>>>>>>
> > >>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do
> > >>>>>>> that, can do as a follow-up.
> > >>>>>>
> > >>>>>> Yeah, uname-based, I don't think we can do better? Given that probing
> > >>>>>> is problematic as well :-(
> > >>>>>> But idk, up to you.
> > >>>>>>
> > >>>>>
> > >>>>> Agreed with the uname-based solution. Another possible solution is to
> > >>>>> probe the member 'memcg' in struct bpf_map, in case someone may
> > >>>>> backport memcg-based  memory accounting, but that will be a little
> > >>>>> over-engineering. The uname-based solution is simple and can work.
> > >>>>>
> > >>>>
> > >>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to
> > >>>> user space, and BTF is not a hard dependency for bpftool. I'll work on
> > >>>> the uname-based test as a follow-up to this set.
> > >>>>
> > >>>
> > >>> After a second thought, the uname-based test may not work, because
> > >>> CONFIG_MEMCG_KMEM can be disabled.
> > >>
> > >> Does it matter? Regardless of whether there is memcg or not, we
> > >> shouldn't touch ulimit on 5.11+
> > >> If there is no memcg, there is no bpf memory enforcement.
> > >
> > > Right, rlimit-based accounting is totally removed, that is not the
> > > same with what I thought before, while I thought it will fallback to
> > > rlimit-based if kmemcg is disabled.
> >
> > Agreed, and so I've got a patch ready for the uname-based probe.
> >
> > But talking about this with Daniel, we were wondering if it would make
> > sense instead to have the probe I had initially submitted (lower the
> > rlimit to 0, attempt to load a program, reset rlimit - see [0]), but
> > only for bpftool instead of libbpf? My understanding is that the memlock
> > rlimit is per-process, right? So this shouldn't affect any other
> > process, and because bpftool is not multithreaded, nothing other than
> > probing would happen while the rlimit is at zero?
>
> Makes sense.
> It is safe to do the probe within bpftool.

+1, seems to be safe to continue doing that in bpftool.
Andrii Nakryiko June 16, 2022, 8:40 p.m. UTC | #16
On Thu, Jun 16, 2022 at 11:08 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Thu, Jun 16, 2022 at 7:54 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Jun 16, 2022 at 9:59 PM Quentin Monnet <quentin@isovalent.com> wrote:
> > >
> > > 2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> > > > On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >>
> > > >> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >>>
> > > >>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@isovalent.com> wrote:
> > > >>>>
> > > >>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@gmail.com>
> > > >>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@google.com> wrote:
> > > >>>>>>
> > > >>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@isovalent.com> wrote:
> > > >>>>>>>
> > > >>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@google.com>
> > > >>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@isovalent.com> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@google.com
> > > >>>>>>>>>> On 06/10, Quentin Monnet wrote:
> > > >>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> > > >>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> > > >>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
> > > >>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> > > >>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if
> > > >>>>>>>>>>> necessary.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
> > > >>>>>>>>>>> in libbpf to check this. But this probe currently relies on the
> > > >>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> > > >>>>>>>>>>> landed in the same kernel version as the memory accounting change. This
> > > >>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper
> > > >>>>>>>>>>> function has been backported to an older kernel. This has been observed
> > > >>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> > > >>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> > > >>>>>>>>>>> not raised, and probing features with bpftool, for example, fails.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
> > > >>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> > > >>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> > > >>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current
> > > >>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was
> > > >>>>>>>>>>> discarded.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
> > > >>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
> > > >>>>>>>>>>> in bpftool for now.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> [0]
> > > >>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> > > >>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
> > > >>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> > > >>>>>>>>>>> ---
> > > >>>>>>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> > > >>>>>>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
> > > >>>>>>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
> > > >>>>>>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
> > > >>>>>>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
> > > >>>>>>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
> > > >>>>>>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
> > > >>>>>>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> > > >>>>>>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> > > >>>>>>>>>>
> > > >>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> > > >>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> > > >>>>>>>>>>> --- a/tools/bpf/bpftool/common.c
> > > >>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c
> > > >>>>>>>>>>> @@ -17,6 +17,7 @@
> > > >>>>>>>>>>>   #include <linux/magic.h>
> > > >>>>>>>>>>>   #include <net/if.h>
> > > >>>>>>>>>>>   #include <sys/mount.h>
> > > >>>>>>>>>>> +#include <sys/resource.h>
> > > >>>>>>>>>>>   #include <sys/stat.h>
> > > >>>>>>>>>>>   #include <sys/vfs.h>
> > > >>>>>>>>>>
> > > >>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> > > >>>>>>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> > > >>>>>>>>>>>   }
> > > >>>>>>>>>>
> > > >>>>>>>>>>> +void set_max_rlimit(void)
> > > >>>>>>>>>>> +{
> > > >>>>>>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> > > >>>>>>>>>>> +
> > > >>>>>>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> > > >>>>>>>>>>
> > > >>>>>>>>>> Do you think it might make sense to print to stderr some warning if
> > > >>>>>>>>>> we actually happen to adjust this limit?
> > > >>>>>>>>>>
> > > >>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> > > >>>>>>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> > > >>>>>>>>>>     infinity!\n");
> > > >>>>>>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> > > >>>>>>>>>> }
> > > >>>>>>>>>>
> > > >>>>>>>>>> ?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Because while it's nice that we automatically do this, this might still
> > > >>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people
> > > >>>>>>>>>> actually read those warnings? :-/
> > > >>>>>>>>>
> > > >>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
> > > >>>>>>>>> is desirable.
> > > >>>>>>>>>
> > > >>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
> > > >>>>>>>>> so I don't think it would come up as a surprise for people who have used
> > > >>>>>>>>> it for a while. I think this is also something that several other
> > > >>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> > > >>>>>>>>> have been doing too.
> > > >>>>>>>>
> > > >>>>>>>> In this case ignore me and let's continue doing that :-)
> > > >>>>>>>>
> > > >>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume?
> > > >>>>>>>
> > > >>>>>>> Agreed. I was thinking either finding a way to improve the probe in
> > > >>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
> > > >>>>>>> take years :/
> > > >>>>>>>
> > > >>>>>>>> Should
> > > >>>>>>>> we at some point follow up with something like:
> > > >>>>>>>>
> > > >>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; }
> > > >>>>>>>>
> > > >>>>>>>> ?
> > > >>>>>>>>
> > > >>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+
> > > >>>>>>>> kernels are guaranteed to have memcg.
> > > >>>>>>>
> > > >>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do
> > > >>>>>>> that, can do as a follow-up.
> > > >>>>>>
> > > >>>>>> Yeah, uname-based, I don't think we can do better? Given that probing
> > > >>>>>> is problematic as well :-(
> > > >>>>>> But idk, up to you.
> > > >>>>>>
> > > >>>>>
> > > >>>>> Agreed with the uname-based solution. Another possible solution is to
> > > >>>>> probe the member 'memcg' in struct bpf_map, in case someone may
> > > >>>>> backport memcg-based  memory accounting, but that will be a little
> > > >>>>> over-engineering. The uname-based solution is simple and can work.
> > > >>>>>
> > > >>>>
> > > >>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to
> > > >>>> user space, and BTF is not a hard dependency for bpftool. I'll work on
> > > >>>> the uname-based test as a follow-up to this set.
> > > >>>>
> > > >>>
> > > >>> After a second thought, the uname-based test may not work, because
> > > >>> CONFIG_MEMCG_KMEM can be disabled.
> > > >>
> > > >> Does it matter? Regardless of whether there is memcg or not, we
> > > >> shouldn't touch ulimit on 5.11+
> > > >> If there is no memcg, there is no bpf memory enforcement.
> > > >
> > > > Right, rlimit-based accounting is totally removed, that is not the
> > > > same with what I thought before, while I thought it will fallback to
> > > > rlimit-based if kmemcg is disabled.
> > >
> > > Agreed, and so I've got a patch ready for the uname-based probe.
> > >
> > > But talking about this with Daniel, we were wondering if it would make
> > > sense instead to have the probe I had initially submitted (lower the
> > > rlimit to 0, attempt to load a program, reset rlimit - see [0]), but
> > > only for bpftool instead of libbpf? My understanding is that the memlock
> > > rlimit is per-process, right? So this shouldn't affect any other
> > > process, and because bpftool is not multithreaded, nothing other than
> > > probing would happen while the rlimit is at zero?
> >
> > Makes sense.
> > It is safe to do the probe within bpftool.
>
> +1, seems to be safe to continue doing that in bpftool.

Agree, doing it in the *application* (which bpftool is for libbpf)
seems totally safe and fine.
diff mbox series

Patch

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index a45b42ee8ab0..a0d4acd7c54a 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -17,6 +17,7 @@ 
 #include <linux/magic.h>
 #include <net/if.h>
 #include <sys/mount.h>
+#include <sys/resource.h>
 #include <sys/stat.h>
 #include <sys/vfs.h>
 
@@ -72,6 +73,13 @@  static bool is_bpffs(char *path)
 	return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
 }
 
+void set_max_rlimit(void)
+{
+	struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
+
+	setrlimit(RLIMIT_MEMLOCK, &rinf);
+}
+
 static int
 mnt_fs(const char *target, const char *type, char *buff, size_t bufflen)
 {
diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c
index cc9e4df8c58e..bac4ef428a02 100644
--- a/tools/bpf/bpftool/feature.c
+++ b/tools/bpf/bpftool/feature.c
@@ -1167,6 +1167,8 @@  static int do_probe(int argc, char **argv)
 	__u32 ifindex = 0;
 	char *ifname;
 
+	set_max_rlimit();
+
 	while (argc) {
 		if (is_prefix(*argv, "kernel")) {
 			if (target != COMPONENT_UNSPEC) {
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index 9062ef2b8767..e81227761f5d 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -507,9 +507,9 @@  int main(int argc, char **argv)
 		 * It will still be rejected if users use LIBBPF_STRICT_ALL
 		 * mode for loading generated skeleton.
 		 */
-		libbpf_set_strict_mode(LIBBPF_STRICT_ALL & ~LIBBPF_STRICT_MAP_DEFINITIONS);
-	} else {
-		libbpf_set_strict_mode(LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK);
+		ret = libbpf_set_strict_mode(LIBBPF_STRICT_ALL & ~LIBBPF_STRICT_MAP_DEFINITIONS);
+		if (ret)
+			p_err("failed to enable libbpf strict mode: %d", ret);
 	}
 
 	argc -= optind;
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 6c311f47147e..589cb76b227a 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -96,6 +96,8 @@  int detect_common_prefix(const char *arg, ...);
 void fprint_hex(FILE *f, void *arg, unsigned int n, const char *sep);
 void usage(void) __noreturn;
 
+void set_max_rlimit(void);
+
 int mount_tracefs(const char *target);
 
 struct obj_ref {
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 800834be1bcb..38b6bc9c26c3 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -1326,6 +1326,8 @@  static int do_create(int argc, char **argv)
 		goto exit;
 	}
 
+	set_max_rlimit();
+
 	fd = bpf_map_create(map_type, map_name, key_size, value_size, max_entries, &attr);
 	if (fd < 0) {
 		p_err("map create failed: %s", strerror(errno));
diff --git a/tools/bpf/bpftool/pids.c b/tools/bpf/bpftool/pids.c
index e2d00d3cd868..bb6c969a114a 100644
--- a/tools/bpf/bpftool/pids.c
+++ b/tools/bpf/bpftool/pids.c
@@ -108,6 +108,7 @@  int build_obj_refs_table(struct hashmap **map, enum bpf_obj_type type)
 		p_err("failed to create hashmap for PID references");
 		return -1;
 	}
+	set_max_rlimit();
 
 	skel = pid_iter_bpf__open();
 	if (!skel) {
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index e71f0b2da50b..f081de398b60 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -1590,6 +1590,8 @@  static int load_with_options(int argc, char **argv, bool first_prog_only)
 		}
 	}
 
+	set_max_rlimit();
+
 	if (verifier_logs)
 		/* log_level1 + log_level2 + stats, but not stable UAPI */
 		open_opts.kernel_log_level = 1 + 2 + 4;
@@ -2287,6 +2289,7 @@  static int do_profile(int argc, char **argv)
 		}
 	}
 
+	set_max_rlimit();
 	err = profiler_bpf__load(profile_obj);
 	if (err) {
 		p_err("failed to load profile_obj");
diff --git a/tools/bpf/bpftool/struct_ops.c b/tools/bpf/bpftool/struct_ops.c
index 2535f079ed67..e08a6ff2866c 100644
--- a/tools/bpf/bpftool/struct_ops.c
+++ b/tools/bpf/bpftool/struct_ops.c
@@ -501,6 +501,8 @@  static int do_register(int argc, char **argv)
 	if (libbpf_get_error(obj))
 		return -1;
 
+	set_max_rlimit();
+
 	if (bpf_object__load(obj)) {
 		bpf_object__close(obj);
 		return -1;