[v7,bpf-next,7/9] bpf: introduce bpf_prog_pack allocator

Message ID	20220128234517.3503701-8-song@kernel.org (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@kernel.org> From: Song Liu <song@kernel.org> To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org> CC: <ast@kernel.org>, <daniel@iogearbox.net>, <andrii@kernel.org>, <kernel-team@fb.com>, <peterz@infradead.org>, <x86@kernel.org>, <iii@linux.ibm.com>, Song Liu <song@kernel.org> Subject: [PATCH v7 bpf-next 7/9] bpf: introduce bpf_prog_pack allocator Date: Fri, 28 Jan 2022 15:45:15 -0800 Message-ID: <20220128234517.3503701-8-song@kernel.org> In-Reply-To: <20220128234517.3503701-1-song@kernel.org> References: <20220128234517.3503701-1-song@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8BIT Content-Type: text/plain Precedence: bulk
Series	bpf_prog_pack allocator \| expand [v7,bpf-next,0/9] bpf_prog_pack allocator [v7,bpf-next,1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP [v7,bpf-next,2/9] bpf: use bytes instead of pages for bpf_jit_[charge\|uncharge]_modmem [v7,bpf-next,3/9] bpf: use size instead of pages in bpf_binary_header [v7,bpf-next,4/9] bpf: use prog->jited_len in bpf_prog_ksym_set_addr() [v7,bpf-next,5/9] x86/alternative: introduce text_poke_copy [v7,bpf-next,6/9] bpf: introduce bpf_arch_text_copy [v7,bpf-next,7/9] bpf: introduce bpf_prog_pack allocator [v7,bpf-next,8/9] bpf: introduce bpf_jit_binary_pack_[alloc\|finalize\|free] [v7,bpf-next,9/9] bpf, x86_64: use bpf_jit_binary_pack_alloc

Message ID

20220128234517.3503701-8-song@kernel.org (mailing list archive)

State

Superseded

Delegated to:

BPF

Headers

From: Song Liu <song@kernel.org>
To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>
CC: <ast@kernel.org>, <daniel@iogearbox.net>, <andrii@kernel.org>,
        <kernel-team@fb.com>, <peterz@infradead.org>, <x86@kernel.org>,
        <iii@linux.ibm.com>, Song Liu <song@kernel.org>
Subject: [PATCH v7 bpf-next 7/9] bpf: introduce bpf_prog_pack allocator
Date: Fri, 28 Jan 2022 15:45:15 -0800
Message-ID: <20220128234517.3503701-8-song@kernel.org>
In-Reply-To: <20220128234517.3503701-1-song@kernel.org>
References: <20220128234517.3503701-1-song@kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8BIT
Content-Type: text/plain
Precedence: bulk

Series

bpf_prog_pack allocator | expand

Checks

Context	Check	Description
bpf/vmtest-bpf-next-PR	fail	merge-conflict
netdev/tree_selection	success	Clearly marked for bpf-next
netdev/apply	fail	Patch does not apply to bpf-next

Context

Check

Description

bpf/vmtest-bpf-next-PR

fail

merge-conflict

netdev/tree_selection

success

Clearly marked for bpf-next

netdev/apply

fail

Patch does not apply to bpf-next

Commit Message

Song Liu Jan. 28, 2022, 11:45 p.m. UTC

Most BPF programs are small, but they consume a page each. For systems
with busy traffic and many BPF programs, this could add significant
pressure to instruction TLB.

Introduce bpf_prog_pack allocator to pack multiple BPF programs in a huge
page. The memory is then allocated in 64 byte chunks.

Memory allocated by bpf_prog_pack allocator is RO protected after initial
allocation. To write to it, the user (jit engine) need to use text poke
API.

Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 127 insertions(+)

Comments

Daniel Borkmann Feb. 1, 2022, 12:06 a.m. UTC | #1

On 1/29/22 12:45 AM, Song Liu wrote:
> Most BPF programs are small, but they consume a page each. For systems
> with busy traffic and many BPF programs, this could add significant
> pressure to instruction TLB.
> 
> Introduce bpf_prog_pack allocator to pack multiple BPF programs in a huge
> page. The memory is then allocated in 64 byte chunks.
> 
> Memory allocated by bpf_prog_pack allocator is RO protected after initial
> allocation. To write to it, the user (jit engine) need to use text poke
> API.

Did you benchmark the program load times under this API, e.g. how much
overhead is expected for very large programs?

> Signed-off-by: Song Liu <song@kernel.org>
> ---
>   kernel/bpf/core.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 127 insertions(+)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index dc0142e20c72..25e34caa9a95 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -805,6 +805,133 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
>   	return slot;
>   }
>   
> +/*
> + * BPF program pack allocator.
> + *
> + * Most BPF programs are pretty small. Allocating a hole page for each
> + * program is sometime a waste. Many small bpf program also adds pressure
> + * to instruction TLB. To solve this issue, we introduce a BPF program pack
> + * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> + * to host BPF programs.
> + */
> +#define BPF_PROG_PACK_SIZE	HPAGE_PMD_SIZE
> +#define BPF_PROG_CHUNK_SHIFT	6
> +#define BPF_PROG_CHUNK_SIZE	(1 << BPF_PROG_CHUNK_SHIFT)
> +#define BPF_PROG_CHUNK_MASK	(~(BPF_PROG_CHUNK_SIZE - 1))
> +#define BPF_PROG_CHUNK_COUNT	(BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE)
> +
> +struct bpf_prog_pack {
> +	struct list_head list;
> +	void *ptr;
> +	unsigned long bitmap[BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)];
> +};
> +
> +#define BPF_PROG_MAX_PACK_PROG_SIZE	HPAGE_PMD_SIZE
> +#define BPF_PROG_SIZE_TO_NBITS(size)	(round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE)
> +
> +static DEFINE_MUTEX(pack_mutex);
> +static LIST_HEAD(pack_list);
> +
> +static struct bpf_prog_pack *alloc_new_pack(void)
> +{
> +	struct bpf_prog_pack *pack;
> +
> +	pack = kzalloc(sizeof(*pack), GFP_KERNEL);
> +	if (!pack)
> +		return NULL;
> +	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
> +	if (!pack->ptr) {
> +		kfree(pack);
> +		return NULL;
> +	}
> +	bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
> +	list_add_tail(&pack->list, &pack_list);
> +
> +	set_vm_flush_reset_perms(pack->ptr);
> +	set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
> +	set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
> +	return pack;
> +}
> +
> +static void *bpf_prog_pack_alloc(u32 size)
> +{
> +	unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size);
> +	struct bpf_prog_pack *pack;
> +	unsigned long pos;
> +	void *ptr = NULL;
> +
> +	if (size > BPF_PROG_MAX_PACK_PROG_SIZE) {
> +		size = round_up(size, PAGE_SIZE);
> +		ptr = module_alloc(size);
> +		if (ptr) {
> +			set_vm_flush_reset_perms(ptr);
> +			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
> +			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
> +		}
> +		return ptr;
> +	}
> +	mutex_lock(&pack_mutex);
> +	list_for_each_entry(pack, &pack_list, list) {
> +		pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
> +						 nbits, 0);
> +		if (pos < BPF_PROG_CHUNK_COUNT)
> +			goto found_free_area;
> +	}
> +
> +	pack = alloc_new_pack();
> +	if (!pack)
> +		goto out;

Will this effectively disable the JIT for all bpf_prog_pack_alloc requests <=
BPF_PROG_MAX_PACK_PROG_SIZE when vmap_allow_huge is false (e.g. boot param via
nohugevmalloc) ?

> +	pos = 0;
> +
> +found_free_area:
> +	bitmap_set(pack->bitmap, pos, nbits);
> +	ptr = (void *)(pack->ptr) + (pos << BPF_PROG_CHUNK_SHIFT);
> +
> +out:
> +	mutex_unlock(&pack_mutex);
> +	return ptr;
> +}
> +
> +static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
> +{
> +	struct bpf_prog_pack *pack = NULL, *tmp;
> +	unsigned int nbits;
> +	unsigned long pos;
> +	void *pack_ptr;
> +
> +	if (hdr->size > BPF_PROG_MAX_PACK_PROG_SIZE) {
> +		module_memfree(hdr);
> +		return;
> +	}
> +
> +	pack_ptr = (void *)((unsigned long)hdr & ~(BPF_PROG_PACK_SIZE - 1));
> +	mutex_lock(&pack_mutex);
> +
> +	list_for_each_entry(tmp, &pack_list, list) {
> +		if (tmp->ptr == pack_ptr) {
> +			pack = tmp;
> +			break;
> +		}
> +	}
> +
> +	if (WARN_ONCE(!pack, "bpf_prog_pack bug\n"))
> +		goto out;
> +
> +	nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
> +	pos = ((unsigned long)hdr - (unsigned long)pack_ptr) >> BPF_PROG_CHUNK_SHIFT;
> +
> +	bitmap_clear(pack->bitmap, pos, nbits);
> +	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
> +				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
> +		list_del(&pack->list);
> +		module_memfree(pack->ptr);
> +		kfree(pack);
> +	}
> +out:
> +	mutex_unlock(&pack_mutex);
> +}
> +
>   static atomic_long_t bpf_jit_current;
>   
>   /* Can be overridden by an arch's JIT compiler if it has a custom,
>

Song Liu Feb. 1, 2022, 1:34 a.m. UTC | #2

On Mon, Jan 31, 2022 at 4:06 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/29/22 12:45 AM, Song Liu wrote:
> > Most BPF programs are small, but they consume a page each. For systems
> > with busy traffic and many BPF programs, this could add significant
> > pressure to instruction TLB.
> >
> > Introduce bpf_prog_pack allocator to pack multiple BPF programs in a huge
> > page. The memory is then allocated in 64 byte chunks.
> >
> > Memory allocated by bpf_prog_pack allocator is RO protected after initial
> > allocation. To write to it, the user (jit engine) need to use text poke
> > API.
>
> Did you benchmark the program load times under this API, e.g. how much
> overhead is expected for very large programs?

For the two scale tests in test_verifier:

./test_verifier 965 966
#965/p scale: scale test 1 OK
#966/p scale: scale test 2 OK

The runtime is about 0.6 second before the set and 0.7 second after.

Is this a good benchmark?

>
> > Signed-off-by: Song Liu <song@kernel.org>
> > ---
> >   kernel/bpf/core.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 127 insertions(+)
> >
[...]
> > +     }
> > +     mutex_lock(&pack_mutex);
> > +     list_for_each_entry(pack, &pack_list, list) {
> > +             pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
> > +                                              nbits, 0);
> > +             if (pos < BPF_PROG_CHUNK_COUNT)
> > +                     goto found_free_area;
> > +     }
> > +
> > +     pack = alloc_new_pack();
> > +     if (!pack)
> > +             goto out;
>
> Will this effectively disable the JIT for all bpf_prog_pack_alloc requests <=
> BPF_PROG_MAX_PACK_PROG_SIZE when vmap_allow_huge is false (e.g. boot param via
> nohugevmalloc) ?

This won't disable JIT. It will just allocate 512x 4k pages for a 2MB pack. We
will mark the whole 2MB RO, same as a 2MB huge page. We still benefit
from this as this avoids poking the linear mapping (1GB pages) to 4kB pages
with set_memory_ro().

Thanks,
Song

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index dc0142e20c72..25e34caa9a95 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -805,6 +805,133 @@  int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 	return slot;
 }
 
+/*
+ * BPF program pack allocator.
+ *
+ * Most BPF programs are pretty small. Allocating a hole page for each
+ * program is sometime a waste. Many small bpf program also adds pressure
+ * to instruction TLB. To solve this issue, we introduce a BPF program pack
+ * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
+ * to host BPF programs.
+ */
+#define BPF_PROG_PACK_SIZE	HPAGE_PMD_SIZE
+#define BPF_PROG_CHUNK_SHIFT	6
+#define BPF_PROG_CHUNK_SIZE	(1 << BPF_PROG_CHUNK_SHIFT)
+#define BPF_PROG_CHUNK_MASK	(~(BPF_PROG_CHUNK_SIZE - 1))
+#define BPF_PROG_CHUNK_COUNT	(BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE)
+
+struct bpf_prog_pack {
+	struct list_head list;
+	void *ptr;
+	unsigned long bitmap[BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)];
+};
+
+#define BPF_PROG_MAX_PACK_PROG_SIZE	HPAGE_PMD_SIZE
+#define BPF_PROG_SIZE_TO_NBITS(size)	(round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE)
+
+static DEFINE_MUTEX(pack_mutex);
+static LIST_HEAD(pack_list);
+
+static struct bpf_prog_pack *alloc_new_pack(void)
+{
+	struct bpf_prog_pack *pack;
+
+	pack = kzalloc(sizeof(*pack), GFP_KERNEL);
+	if (!pack)
+		return NULL;
+	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
+	if (!pack->ptr) {
+		kfree(pack);
+		return NULL;
+	}
+	bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
+	list_add_tail(&pack->list, &pack_list);
+
+	set_vm_flush_reset_perms(pack->ptr);
+	set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
+	set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
+	return pack;
+}
+
+static void *bpf_prog_pack_alloc(u32 size)
+{
+	unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size);
+	struct bpf_prog_pack *pack;
+	unsigned long pos;
+	void *ptr = NULL;
+
+	if (size > BPF_PROG_MAX_PACK_PROG_SIZE) {
+		size = round_up(size, PAGE_SIZE);
+		ptr = module_alloc(size);
+		if (ptr) {
+			set_vm_flush_reset_perms(ptr);
+			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
+			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
+		}
+		return ptr;
+	}
+	mutex_lock(&pack_mutex);
+	list_for_each_entry(pack, &pack_list, list) {
+		pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
+						 nbits, 0);
+		if (pos < BPF_PROG_CHUNK_COUNT)
+			goto found_free_area;
+	}
+
+	pack = alloc_new_pack();
+	if (!pack)
+		goto out;
+
+	pos = 0;
+
+found_free_area:
+	bitmap_set(pack->bitmap, pos, nbits);
+	ptr = (void *)(pack->ptr) + (pos << BPF_PROG_CHUNK_SHIFT);
+
+out:
+	mutex_unlock(&pack_mutex);
+	return ptr;
+}
+
+static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
+{
+	struct bpf_prog_pack *pack = NULL, *tmp;
+	unsigned int nbits;
+	unsigned long pos;
+	void *pack_ptr;
+
+	if (hdr->size > BPF_PROG_MAX_PACK_PROG_SIZE) {
+		module_memfree(hdr);
+		return;
+	}
+
+	pack_ptr = (void *)((unsigned long)hdr & ~(BPF_PROG_PACK_SIZE - 1));
+	mutex_lock(&pack_mutex);
+
+	list_for_each_entry(tmp, &pack_list, list) {
+		if (tmp->ptr == pack_ptr) {
+			pack = tmp;
+			break;
+		}
+	}
+
+	if (WARN_ONCE(!pack, "bpf_prog_pack bug\n"))
+		goto out;
+
+	nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
+	pos = ((unsigned long)hdr - (unsigned long)pack_ptr) >> BPF_PROG_CHUNK_SHIFT;
+
+	bitmap_clear(pack->bitmap, pos, nbits);
+	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
+				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
+		list_del(&pack->list);
+		module_memfree(pack->ptr);
+		kfree(pack);
+	}
+out:
+	mutex_unlock(&pack_mutex);
+}
+
 static atomic_long_t bpf_jit_current;
 
 /* Can be overridden by an arch's JIT compiler if it has a custom,