diff mbox series

[bpf-next] bpftool: Probe for memcg-based accounting before bumping rlimit

Message ID 20220628164529.80050-1-quentin@isovalent.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [bpf-next] bpftool: Probe for memcg-based accounting before bumping rlimit | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on ubuntu-latest with llvm-15
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers warning 9 maintainers not CCed: songliubraving@fb.com rppt@kernel.org willy@infradead.org yhs@fb.com john.fastabend@gmail.com karolinadrobnik@gmail.com kafai@fb.com skhan@linuxfoundation.org kpsingh@kernel.org
netdev/build_clang success Errors and warnings before: 979 this patch: 979
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3475 this patch: 3475
netdev/checkpatch warning WARNING: line length of 88 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Quentin Monnet June 28, 2022, 4:45 p.m. UTC
Bpftool used to bump the memlock rlimit to make sure to be able to load
BPF objects. After the kernel has switched to memcg-based memory
accounting [0] in 5.11, bpftool has relied on libbpf to probe the system
for memcg-based accounting support and for raising the rlimit if
necessary [1]. But this was later reverted, because the probe would
sometimes fail, resulting in bpftool not being able to load all required
objects [2].

Here we add a more efficient probe, in bpftool itself. We first lower
the rlimit to 0, then we attempt to load a BPF object (and finally reset
the rlimit): if the load succeeds, then memcg-based memory accounting is
supported.

This approach was earlier proposed for the probe in libbpf itself [3],
but given that the library may be used in multithreaded applications,
the probe could have undesirable consequences if one thread attempts to
lock kernel memory while memlock rlimit is at 0. Since bpftool is
single-threaded and the rlimit is process-based, this is fine to do in
bpftool itself.

This probe was inspired by the similar one from the cilium/ebpf Go
library [4].

[0] commit 97306be45fbe ("Merge branch 'switch to memcg-based memory accounting'")
[1] commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK")
[2] commit 6b4384ff1088 ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"")
[3] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/t/#u
[4] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39

Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
---
 tools/bpf/bpftool/common.c   | 71 ++++++++++++++++++++++++++++++++++--
 tools/include/linux/kernel.h |  5 +++
 2 files changed, 73 insertions(+), 3 deletions(-)

Comments

Stanislav Fomichev June 28, 2022, 5:53 p.m. UTC | #1
On Tue, Jun 28, 2022 at 9:45 AM Quentin Monnet <quentin@isovalent.com> wrote:
>
> Bpftool used to bump the memlock rlimit to make sure to be able to load
> BPF objects. After the kernel has switched to memcg-based memory
> accounting [0] in 5.11, bpftool has relied on libbpf to probe the system
> for memcg-based accounting support and for raising the rlimit if
> necessary [1]. But this was later reverted, because the probe would
> sometimes fail, resulting in bpftool not being able to load all required
> objects [2].
>
> Here we add a more efficient probe, in bpftool itself. We first lower
> the rlimit to 0, then we attempt to load a BPF object (and finally reset
> the rlimit): if the load succeeds, then memcg-based memory accounting is
> supported.
>
> This approach was earlier proposed for the probe in libbpf itself [3],
> but given that the library may be used in multithreaded applications,
> the probe could have undesirable consequences if one thread attempts to
> lock kernel memory while memlock rlimit is at 0. Since bpftool is
> single-threaded and the rlimit is process-based, this is fine to do in
> bpftool itself.
>
> This probe was inspired by the similar one from the cilium/ebpf Go
> library [4].
>
> [0] commit 97306be45fbe ("Merge branch 'switch to memcg-based memory accounting'")
> [1] commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK")
> [2] commit 6b4384ff1088 ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"")
> [3] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/t/#u
> [4] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>
> Cc: Stanislav Fomichev <sdf@google.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> ---
>  tools/bpf/bpftool/common.c   | 71 ++++++++++++++++++++++++++++++++++--
>  tools/include/linux/kernel.h |  5 +++
>  2 files changed, 73 insertions(+), 3 deletions(-)
>
> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> index a0d4acd7c54a..e07769802f76 100644
> --- a/tools/bpf/bpftool/common.c
> +++ b/tools/bpf/bpftool/common.c
> @@ -13,14 +13,17 @@
>  #include <stdlib.h>
>  #include <string.h>
>  #include <unistd.h>
> -#include <linux/limits.h>
> -#include <linux/magic.h>
>  #include <net/if.h>
>  #include <sys/mount.h>
>  #include <sys/resource.h>
>  #include <sys/stat.h>
>  #include <sys/vfs.h>
>
> +#include <linux/filter.h>
> +#include <linux/limits.h>
> +#include <linux/magic.h>
> +#include <linux/unistd.h>
> +
>  #include <bpf/bpf.h>
>  #include <bpf/hashmap.h>
>  #include <bpf/libbpf.h> /* libbpf_num_possible_cpus */
> @@ -73,11 +76,73 @@ static bool is_bpffs(char *path)
>         return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>  }
>
> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
> + * memcg-based memory accounting for BPF maps and programs. This was done in
> + * commit 97306be45fbe ("Merge branch 'switch to memcg-based memory
> + * accounting'"), in Linux 5.11.
> + *
> + * Libbpf also offers to probe for memcg-based accounting vs rlimit, but does
> + * so by checking for the availability of a given BPF helper and this has
> + * failed on some kernels with backports in the past, see commit 6b4384ff1088
> + * ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"").
> + * Instead, we can probe by lowering the process-based rlimit to 0, trying to
> + * load a BPF object, and resetting the rlimit. If the load succeeds then
> + * memcg-based accounting is supported.
> + *
> + * This would be too dangerous to do in the library, because multithreaded
> + * applications might attempt to load items while the rlimit is at 0. Given
> + * that bpftool is single-threaded, this is fine to do here.
> + */
> +static bool known_to_need_rlimit(void)
> +{
> +       const size_t prog_load_attr_sz = offsetofend(union bpf_attr, attach_btf_obj_fd);

nit:
Any specific reason you're hard coding this sz via offseofend? Why not
use sizeof(bpf_attr) directly as a syscall/memset size?
The kernel should handle all these cases where bpftool has extra zero
padding, right?

> +       struct bpf_insn insns[] = {
> +               BPF_EXIT_INSN(),
> +       };
> +       struct rlimit rlim_init, rlim_cur_zero = {};
> +       size_t insn_cnt = ARRAY_SIZE(insns);
> +       union bpf_attr attr;
> +       int prog_fd, err;
> +
> +       memset(&attr, 0, prog_load_attr_sz);
> +       attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
> +       attr.insns = ptr_to_u64(insns);
> +       attr.insn_cnt = insn_cnt;
> +       attr.license = ptr_to_u64("GPL");
> +
> +       if (getrlimit(RLIMIT_MEMLOCK, &rlim_init))
> +               return false;
> +
> +       /* Drop the soft limit to zero. We maintain the hard limit to its
> +        * current value, because lowering it would be a permanent operation
> +        * for unprivileged users.
> +        */
> +       rlim_cur_zero.rlim_max = rlim_init.rlim_max;
> +       if (setrlimit(RLIMIT_MEMLOCK, &rlim_cur_zero))
> +               return false;
> +
> +       /* Do not use bpf_prog_load() from libbpf here, because it calls
> +        * bump_rlimit_memlock(), interfering with the current probe.
> +        */
> +       prog_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, prog_load_attr_sz);
> +       err = errno;
> +
> +       /* reset soft rlimit to its initial value */
> +       setrlimit(RLIMIT_MEMLOCK, &rlim_init);
> +
> +       if (prog_fd < 0)
> +               return err == EPERM;
> +
> +       close(prog_fd);
> +       return false;
> +}
> +
>  void set_max_rlimit(void)
>  {
>         struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>
> -       setrlimit(RLIMIT_MEMLOCK, &rinf);
> +       if (known_to_need_rlimit())
> +               setrlimit(RLIMIT_MEMLOCK, &rinf);
>  }
>
>  static int
> diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
> index 4b0673bf52c2..5c90d65cc2d3 100644
> --- a/tools/include/linux/kernel.h
> +++ b/tools/include/linux/kernel.h
> @@ -24,6 +24,11 @@
>  #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
>  #endif
>
> +#ifndef offsetofend
> +# define offsetofend(TYPE, FIELD) \
> +       (offsetof(TYPE, FIELD) + sizeof(((TYPE *)0)->FIELD))
> +#endif
> +
>  #ifndef container_of
>  /**
>   * container_of - cast a member of a structure out to the containing structure
> --
> 2.34.1
>
Quentin Monnet June 29, 2022, 11:11 a.m. UTC | #2
On 28/06/2022 18:53, Stanislav Fomichev wrote:
> On Tue, Jun 28, 2022 at 9:45 AM Quentin Monnet <quentin@isovalent.com> wrote:
>>
>> Bpftool used to bump the memlock rlimit to make sure to be able to load
>> BPF objects. After the kernel has switched to memcg-based memory
>> accounting [0] in 5.11, bpftool has relied on libbpf to probe the system
>> for memcg-based accounting support and for raising the rlimit if
>> necessary [1]. But this was later reverted, because the probe would
>> sometimes fail, resulting in bpftool not being able to load all required
>> objects [2].
>>
>> Here we add a more efficient probe, in bpftool itself. We first lower
>> the rlimit to 0, then we attempt to load a BPF object (and finally reset
>> the rlimit): if the load succeeds, then memcg-based memory accounting is
>> supported.
>>
>> This approach was earlier proposed for the probe in libbpf itself [3],
>> but given that the library may be used in multithreaded applications,
>> the probe could have undesirable consequences if one thread attempts to
>> lock kernel memory while memlock rlimit is at 0. Since bpftool is
>> single-threaded and the rlimit is process-based, this is fine to do in
>> bpftool itself.
>>
>> This probe was inspired by the similar one from the cilium/ebpf Go
>> library [4].
>>
>> [0] commit 97306be45fbe ("Merge branch 'switch to memcg-based memory accounting'")
>> [1] commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK")
>> [2] commit 6b4384ff1088 ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"")
>> [3] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/t/#u
>> [4] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>>
>> Cc: Stanislav Fomichev <sdf@google.com>
>> Cc: Yafang Shao <laoar.shao@gmail.com>
>> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
>> ---
>>  tools/bpf/bpftool/common.c   | 71 ++++++++++++++++++++++++++++++++++--
>>  tools/include/linux/kernel.h |  5 +++
>>  2 files changed, 73 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>> index a0d4acd7c54a..e07769802f76 100644
>> --- a/tools/bpf/bpftool/common.c
>> +++ b/tools/bpf/bpftool/common.c
>> @@ -13,14 +13,17 @@
>>  #include <stdlib.h>
>>  #include <string.h>
>>  #include <unistd.h>
>> -#include <linux/limits.h>
>> -#include <linux/magic.h>
>>  #include <net/if.h>
>>  #include <sys/mount.h>
>>  #include <sys/resource.h>
>>  #include <sys/stat.h>
>>  #include <sys/vfs.h>
>>
>> +#include <linux/filter.h>
>> +#include <linux/limits.h>
>> +#include <linux/magic.h>
>> +#include <linux/unistd.h>
>> +
>>  #include <bpf/bpf.h>
>>  #include <bpf/hashmap.h>
>>  #include <bpf/libbpf.h> /* libbpf_num_possible_cpus */
>> @@ -73,11 +76,73 @@ static bool is_bpffs(char *path)
>>         return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>  }
>>
>> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
>> + * memcg-based memory accounting for BPF maps and programs. This was done in
>> + * commit 97306be45fbe ("Merge branch 'switch to memcg-based memory
>> + * accounting'"), in Linux 5.11.
>> + *
>> + * Libbpf also offers to probe for memcg-based accounting vs rlimit, but does
>> + * so by checking for the availability of a given BPF helper and this has
>> + * failed on some kernels with backports in the past, see commit 6b4384ff1088
>> + * ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"").
>> + * Instead, we can probe by lowering the process-based rlimit to 0, trying to
>> + * load a BPF object, and resetting the rlimit. If the load succeeds then
>> + * memcg-based accounting is supported.
>> + *
>> + * This would be too dangerous to do in the library, because multithreaded
>> + * applications might attempt to load items while the rlimit is at 0. Given
>> + * that bpftool is single-threaded, this is fine to do here.
>> + */
>> +static bool known_to_need_rlimit(void)
>> +{
>> +       const size_t prog_load_attr_sz = offsetofend(union bpf_attr, attach_btf_obj_fd);
> 
> nit:
> Any specific reason you're hard coding this sz via offseofend? Why not
> use sizeof(bpf_attr) directly as a syscall/memset size?
> The kernel should handle all these cases where bpftool has extra zero
> padding, right?

No particular reason. Good point, I'll send a v2 to address this.

Thanks,
Quentin
diff mbox series

Patch

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index a0d4acd7c54a..e07769802f76 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -13,14 +13,17 @@ 
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
-#include <linux/limits.h>
-#include <linux/magic.h>
 #include <net/if.h>
 #include <sys/mount.h>
 #include <sys/resource.h>
 #include <sys/stat.h>
 #include <sys/vfs.h>
 
+#include <linux/filter.h>
+#include <linux/limits.h>
+#include <linux/magic.h>
+#include <linux/unistd.h>
+
 #include <bpf/bpf.h>
 #include <bpf/hashmap.h>
 #include <bpf/libbpf.h> /* libbpf_num_possible_cpus */
@@ -73,11 +76,73 @@  static bool is_bpffs(char *path)
 	return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
 }
 
+/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
+ * memcg-based memory accounting for BPF maps and programs. This was done in
+ * commit 97306be45fbe ("Merge branch 'switch to memcg-based memory
+ * accounting'"), in Linux 5.11.
+ *
+ * Libbpf also offers to probe for memcg-based accounting vs rlimit, but does
+ * so by checking for the availability of a given BPF helper and this has
+ * failed on some kernels with backports in the past, see commit 6b4384ff1088
+ * ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"").
+ * Instead, we can probe by lowering the process-based rlimit to 0, trying to
+ * load a BPF object, and resetting the rlimit. If the load succeeds then
+ * memcg-based accounting is supported.
+ *
+ * This would be too dangerous to do in the library, because multithreaded
+ * applications might attempt to load items while the rlimit is at 0. Given
+ * that bpftool is single-threaded, this is fine to do here.
+ */
+static bool known_to_need_rlimit(void)
+{
+	const size_t prog_load_attr_sz = offsetofend(union bpf_attr, attach_btf_obj_fd);
+	struct bpf_insn insns[] = {
+		BPF_EXIT_INSN(),
+	};
+	struct rlimit rlim_init, rlim_cur_zero = {};
+	size_t insn_cnt = ARRAY_SIZE(insns);
+	union bpf_attr attr;
+	int prog_fd, err;
+
+	memset(&attr, 0, prog_load_attr_sz);
+	attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+	attr.insns = ptr_to_u64(insns);
+	attr.insn_cnt = insn_cnt;
+	attr.license = ptr_to_u64("GPL");
+
+	if (getrlimit(RLIMIT_MEMLOCK, &rlim_init))
+		return false;
+
+	/* Drop the soft limit to zero. We maintain the hard limit to its
+	 * current value, because lowering it would be a permanent operation
+	 * for unprivileged users.
+	 */
+	rlim_cur_zero.rlim_max = rlim_init.rlim_max;
+	if (setrlimit(RLIMIT_MEMLOCK, &rlim_cur_zero))
+		return false;
+
+	/* Do not use bpf_prog_load() from libbpf here, because it calls
+	 * bump_rlimit_memlock(), interfering with the current probe.
+	 */
+	prog_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, prog_load_attr_sz);
+	err = errno;
+
+	/* reset soft rlimit to its initial value */
+	setrlimit(RLIMIT_MEMLOCK, &rlim_init);
+
+	if (prog_fd < 0)
+		return err == EPERM;
+
+	close(prog_fd);
+	return false;
+}
+
 void set_max_rlimit(void)
 {
 	struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
 
-	setrlimit(RLIMIT_MEMLOCK, &rinf);
+	if (known_to_need_rlimit())
+		setrlimit(RLIMIT_MEMLOCK, &rinf);
 }
 
 static int
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 4b0673bf52c2..5c90d65cc2d3 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -24,6 +24,11 @@ 
 #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
 #endif
 
+#ifndef offsetofend
+# define offsetofend(TYPE, FIELD) \
+	(offsetof(TYPE, FIELD) + sizeof(((TYPE *)0)->FIELD))
+#endif
+
 #ifndef container_of
 /**
  * container_of - cast a member of a structure out to the containing structure