mbox series

[v2,bpf-next,0/7] bpf_prog_pack allocator

Message ID 20211215060102.3793196-1-song@kernel.org (mailing list archive)
Headers show
Series bpf_prog_pack allocator | expand

Message

Song Liu Dec. 15, 2021, 6 a.m. UTC
Changes v1 => v2:
1. Use text_poke instead of writing through linear mapping. (Peter)
2. Avoid making changes to non-x86_64 code.

Most BPF programs are small, but they consume a page each. For systems
with busy traffic and many BPF programs, this could also add significant
pressure to instruction TLB.

This set tries to solve this problem with customized allocator that pack
multiple programs into a huge page.

Patches 1-5 prepare the work. Patch 6 contains key logic of the allocator.
Patch 7 uses this allocator in x86_64 jit compiler.

Song Liu (7):
  x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP
  bpf: use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem
  bpf: use size instead of pages in bpf_binary_header
  bpf: add a pointer of bpf_binary_header to bpf_prog
  x86/alternative: introduce text_poke_jit
  bpf: introduce bpf_prog_pack allocator
  bpf, x86_64: use bpf_prog_pack allocator

 arch/x86/Kconfig                     |   1 +
 arch/x86/include/asm/text-patching.h |   1 +
 arch/x86/kernel/alternative.c        |  28 ++++
 arch/x86/net/bpf_jit_comp.c          |  93 ++++++++++--
 include/linux/bpf.h                  |   4 +-
 include/linux/filter.h               |  23 ++-
 kernel/bpf/core.c                    | 213 ++++++++++++++++++++++++---
 kernel/bpf/trampoline.c              |   6 +-
 8 files changed, 328 insertions(+), 41 deletions(-)

--
2.30.2

Comments

Andrii Nakryiko Dec. 16, 2021, 8:06 p.m. UTC | #1
On Tue, Dec 14, 2021 at 10:01 PM Song Liu <song@kernel.org> wrote:
>
> Changes v1 => v2:
> 1. Use text_poke instead of writing through linear mapping. (Peter)
> 2. Avoid making changes to non-x86_64 code.
>
> Most BPF programs are small, but they consume a page each. For systems
> with busy traffic and many BPF programs, this could also add significant
> pressure to instruction TLB.
>
> This set tries to solve this problem with customized allocator that pack
> multiple programs into a huge page.
>
> Patches 1-5 prepare the work. Patch 6 contains key logic of the allocator.
> Patch 7 uses this allocator in x86_64 jit compiler.
>

There are test failures, please see [0]. But I was also wondering if
there could be an explicit selftest added to validate that all this
huge page machinery is actually activated and working as expected?

  [0] https://github.com/kernel-patches/bpf/runs/4530372387?check_suite_focus=true

> Song Liu (7):
>   x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP
>   bpf: use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem
>   bpf: use size instead of pages in bpf_binary_header
>   bpf: add a pointer of bpf_binary_header to bpf_prog
>   x86/alternative: introduce text_poke_jit
>   bpf: introduce bpf_prog_pack allocator
>   bpf, x86_64: use bpf_prog_pack allocator
>
>  arch/x86/Kconfig                     |   1 +
>  arch/x86/include/asm/text-patching.h |   1 +
>  arch/x86/kernel/alternative.c        |  28 ++++
>  arch/x86/net/bpf_jit_comp.c          |  93 ++++++++++--
>  include/linux/bpf.h                  |   4 +-
>  include/linux/filter.h               |  23 ++-
>  kernel/bpf/core.c                    | 213 ++++++++++++++++++++++++---
>  kernel/bpf/trampoline.c              |   6 +-
>  8 files changed, 328 insertions(+), 41 deletions(-)
>
> --
> 2.30.2
Song Liu Dec. 17, 2021, 1:53 a.m. UTC | #2
> On Dec 16, 2021, at 12:06 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Tue, Dec 14, 2021 at 10:01 PM Song Liu <song@kernel.org> wrote:
>> 
>> Changes v1 => v2:
>> 1. Use text_poke instead of writing through linear mapping. (Peter)
>> 2. Avoid making changes to non-x86_64 code.
>> 
>> Most BPF programs are small, but they consume a page each. For systems
>> with busy traffic and many BPF programs, this could also add significant
>> pressure to instruction TLB.
>> 
>> This set tries to solve this problem with customized allocator that pack
>> multiple programs into a huge page.
>> 
>> Patches 1-5 prepare the work. Patch 6 contains key logic of the allocator.
>> Patch 7 uses this allocator in x86_64 jit compiler.
>> 
> 
> There are test failures, please see [0]. But I was also wondering if
> there could be an explicit selftest added to validate that all this
> huge page machinery is actually activated and working as expected?

We can enable some debug option that dumps the page table. Then from the
page table, we can confirm the programs are running on a huge page. This 
only works on x86_64 though. WDYT?

Thanks,
Song


> 
>  [0] https://github.com/kernel-patches/bpf/runs/4530372387?check_suite_focus=true
> 
>> Song Liu (7):
>>  x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP
>>  bpf: use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem
>>  bpf: use size instead of pages in bpf_binary_header
>>  bpf: add a pointer of bpf_binary_header to bpf_prog
>>  x86/alternative: introduce text_poke_jit
>>  bpf: introduce bpf_prog_pack allocator
>>  bpf, x86_64: use bpf_prog_pack allocator
>> 
>> arch/x86/Kconfig                     |   1 +
>> arch/x86/include/asm/text-patching.h |   1 +
>> arch/x86/kernel/alternative.c        |  28 ++++
>> arch/x86/net/bpf_jit_comp.c          |  93 ++++++++++--
>> include/linux/bpf.h                  |   4 +-
>> include/linux/filter.h               |  23 ++-
>> kernel/bpf/core.c                    | 213 ++++++++++++++++++++++++---
>> kernel/bpf/trampoline.c              |   6 +-
>> 8 files changed, 328 insertions(+), 41 deletions(-)
>> 
>> --
>> 2.30.2
Andrii Nakryiko Dec. 17, 2021, 4:42 p.m. UTC | #3
On Thu, Dec 16, 2021 at 5:53 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Dec 16, 2021, at 12:06 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Dec 14, 2021 at 10:01 PM Song Liu <song@kernel.org> wrote:
> >>
> >> Changes v1 => v2:
> >> 1. Use text_poke instead of writing through linear mapping. (Peter)
> >> 2. Avoid making changes to non-x86_64 code.
> >>
> >> Most BPF programs are small, but they consume a page each. For systems
> >> with busy traffic and many BPF programs, this could also add significant
> >> pressure to instruction TLB.
> >>
> >> This set tries to solve this problem with customized allocator that pack
> >> multiple programs into a huge page.
> >>
> >> Patches 1-5 prepare the work. Patch 6 contains key logic of the allocator.
> >> Patch 7 uses this allocator in x86_64 jit compiler.
> >>
> >
> > There are test failures, please see [0]. But I was also wondering if
> > there could be an explicit selftest added to validate that all this
> > huge page machinery is actually activated and working as expected?
>
> We can enable some debug option that dumps the page table. Then from the
> page table, we can confirm the programs are running on a huge page. This
> only works on x86_64 though. WDYT?
>

I don't know what exactly is involved, so it's hard to say. Ideally
whatever we do doesn't complicate our CI setup. Can we use BPF tracing
magic to check this from inside the kernel somehow?

> Thanks,
> Song
>
>
> >
> >  [0] https://github.com/kernel-patches/bpf/runs/4530372387?check_suite_focus=true
> >
> >> Song Liu (7):
> >>  x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP
> >>  bpf: use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem
> >>  bpf: use size instead of pages in bpf_binary_header
> >>  bpf: add a pointer of bpf_binary_header to bpf_prog
> >>  x86/alternative: introduce text_poke_jit
> >>  bpf: introduce bpf_prog_pack allocator
> >>  bpf, x86_64: use bpf_prog_pack allocator
> >>
> >> arch/x86/Kconfig                     |   1 +
> >> arch/x86/include/asm/text-patching.h |   1 +
> >> arch/x86/kernel/alternative.c        |  28 ++++
> >> arch/x86/net/bpf_jit_comp.c          |  93 ++++++++++--
> >> include/linux/bpf.h                  |   4 +-
> >> include/linux/filter.h               |  23 ++-
> >> kernel/bpf/core.c                    | 213 ++++++++++++++++++++++++---
> >> kernel/bpf/trampoline.c              |   6 +-
> >> 8 files changed, 328 insertions(+), 41 deletions(-)
> >>
> >> --
> >> 2.30.2
>
Andrii Nakryiko Dec. 17, 2021, 4:43 p.m. UTC | #4
On Fri, Dec 17, 2021 at 8:42 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Dec 16, 2021 at 5:53 PM Song Liu <songliubraving@fb.com> wrote:
> >
> >
> >
> > > On Dec 16, 2021, at 12:06 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Tue, Dec 14, 2021 at 10:01 PM Song Liu <song@kernel.org> wrote:
> > >>
> > >> Changes v1 => v2:
> > >> 1. Use text_poke instead of writing through linear mapping. (Peter)
> > >> 2. Avoid making changes to non-x86_64 code.
> > >>
> > >> Most BPF programs are small, but they consume a page each. For systems
> > >> with busy traffic and many BPF programs, this could also add significant
> > >> pressure to instruction TLB.
> > >>
> > >> This set tries to solve this problem with customized allocator that pack
> > >> multiple programs into a huge page.
> > >>
> > >> Patches 1-5 prepare the work. Patch 6 contains key logic of the allocator.
> > >> Patch 7 uses this allocator in x86_64 jit compiler.
> > >>
> > >
> > > There are test failures, please see [0]. But I was also wondering if
> > > there could be an explicit selftest added to validate that all this
> > > huge page machinery is actually activated and working as expected?
> >
> > We can enable some debug option that dumps the page table. Then from the
> > page table, we can confirm the programs are running on a huge page. This
> > only works on x86_64 though. WDYT?
> >
>
> I don't know what exactly is involved, so it's hard to say. Ideally
> whatever we do doesn't complicate our CI setup. Can we use BPF tracing
> magic to check this from inside the kernel somehow?
>

But I don't feel strongly about this, if it's hard to detect, it's
fine to not have a specific test (especially that it's very
architecture-specific)

> > Thanks,
> > Song
> >
> >
> > >
> > >  [0] https://github.com/kernel-patches/bpf/runs/4530372387?check_suite_focus=true
> > >
> > >> Song Liu (7):
> > >>  x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP
> > >>  bpf: use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem
> > >>  bpf: use size instead of pages in bpf_binary_header
> > >>  bpf: add a pointer of bpf_binary_header to bpf_prog
> > >>  x86/alternative: introduce text_poke_jit
> > >>  bpf: introduce bpf_prog_pack allocator
> > >>  bpf, x86_64: use bpf_prog_pack allocator
> > >>
> > >> arch/x86/Kconfig                     |   1 +
> > >> arch/x86/include/asm/text-patching.h |   1 +
> > >> arch/x86/kernel/alternative.c        |  28 ++++
> > >> arch/x86/net/bpf_jit_comp.c          |  93 ++++++++++--
> > >> include/linux/bpf.h                  |   4 +-
> > >> include/linux/filter.h               |  23 ++-
> > >> kernel/bpf/core.c                    | 213 ++++++++++++++++++++++++---
> > >> kernel/bpf/trampoline.c              |   6 +-
> > >> 8 files changed, 328 insertions(+), 41 deletions(-)
> > >>
> > >> --
> > >> 2.30.2
> >
Song Liu Dec. 17, 2021, 5:13 p.m. UTC | #5
> On Dec 17, 2021, at 8:43 AM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Fri, Dec 17, 2021 at 8:42 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> 
>> On Thu, Dec 16, 2021 at 5:53 PM Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> 
>>> 
>>>> On Dec 16, 2021, at 12:06 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>>> 
>>>> On Tue, Dec 14, 2021 at 10:01 PM Song Liu <song@kernel.org> wrote:
>>>>> 
>>>>> Changes v1 => v2:
>>>>> 1. Use text_poke instead of writing through linear mapping. (Peter)
>>>>> 2. Avoid making changes to non-x86_64 code.
>>>>> 
>>>>> Most BPF programs are small, but they consume a page each. For systems
>>>>> with busy traffic and many BPF programs, this could also add significant
>>>>> pressure to instruction TLB.
>>>>> 
>>>>> This set tries to solve this problem with customized allocator that pack
>>>>> multiple programs into a huge page.
>>>>> 
>>>>> Patches 1-5 prepare the work. Patch 6 contains key logic of the allocator.
>>>>> Patch 7 uses this allocator in x86_64 jit compiler.
>>>>> 
>>>> 
>>>> There are test failures, please see [0]. But I was also wondering if
>>>> there could be an explicit selftest added to validate that all this
>>>> huge page machinery is actually activated and working as expected?
>>> 
>>> We can enable some debug option that dumps the page table. Then from the
>>> page table, we can confirm the programs are running on a huge page. This
>>> only works on x86_64 though. WDYT?
>>> 
>> 
>> I don't know what exactly is involved, so it's hard to say. Ideally
>> whatever we do doesn't complicate our CI setup. Can we use BPF tracing
>> magic to check this from inside the kernel somehow?
>> 
> 
> But I don't feel strongly about this, if it's hard to detect, it's
> fine to not have a specific test (especially that it's very
> architecture-specific)

It will be more or less architecture-specific, as we need somehow walk 
the page table (with debug option or with BPF iterator). I will try 
something. 

Thanks,
Song


> 
>>> Thanks,
>>> Song
>>> 
>>> 
>>>> 
>>>> [0] https://github.com/kernel-patches/bpf/runs/4530372387?check_suite_focus=true
>>>> 
>>>>> Song Liu (7):
>>>>> x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP
>>>>> bpf: use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem
>>>>> bpf: use size instead of pages in bpf_binary_header
>>>>> bpf: add a pointer of bpf_binary_header to bpf_prog
>>>>> x86/alternative: introduce text_poke_jit
>>>>> bpf: introduce bpf_prog_pack allocator
>>>>> bpf, x86_64: use bpf_prog_pack allocator
>>>>> 
>>>>> arch/x86/Kconfig                     |   1 +
>>>>> arch/x86/include/asm/text-patching.h |   1 +
>>>>> arch/x86/kernel/alternative.c        |  28 ++++
>>>>> arch/x86/net/bpf_jit_comp.c          |  93 ++++++++++--
>>>>> include/linux/bpf.h                  |   4 +-
>>>>> include/linux/filter.h               |  23 ++-
>>>>> kernel/bpf/core.c                    | 213 ++++++++++++++++++++++++---
>>>>> kernel/bpf/trampoline.c              |   6 +-
>>>>> 8 files changed, 328 insertions(+), 41 deletions(-)
>>>>> 
>>>>> --
>>>>> 2.30.2
>>>
Andrii Nakryiko Dec. 17, 2021, 5:16 p.m. UTC | #6
On Fri, Dec 17, 2021 at 9:13 AM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Dec 17, 2021, at 8:43 AM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Dec 17, 2021 at 8:42 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> >>
> >> On Thu, Dec 16, 2021 at 5:53 PM Song Liu <songliubraving@fb.com> wrote:
> >>>
> >>>
> >>>
> >>>> On Dec 16, 2021, at 12:06 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >>>>
> >>>> On Tue, Dec 14, 2021 at 10:01 PM Song Liu <song@kernel.org> wrote:
> >>>>>
> >>>>> Changes v1 => v2:
> >>>>> 1. Use text_poke instead of writing through linear mapping. (Peter)
> >>>>> 2. Avoid making changes to non-x86_64 code.
> >>>>>
> >>>>> Most BPF programs are small, but they consume a page each. For systems
> >>>>> with busy traffic and many BPF programs, this could also add significant
> >>>>> pressure to instruction TLB.
> >>>>>
> >>>>> This set tries to solve this problem with customized allocator that pack
> >>>>> multiple programs into a huge page.
> >>>>>
> >>>>> Patches 1-5 prepare the work. Patch 6 contains key logic of the allocator.
> >>>>> Patch 7 uses this allocator in x86_64 jit compiler.
> >>>>>
> >>>>
> >>>> There are test failures, please see [0]. But I was also wondering if
> >>>> there could be an explicit selftest added to validate that all this
> >>>> huge page machinery is actually activated and working as expected?
> >>>
> >>> We can enable some debug option that dumps the page table. Then from the
> >>> page table, we can confirm the programs are running on a huge page. This
> >>> only works on x86_64 though. WDYT?
> >>>
> >>
> >> I don't know what exactly is involved, so it's hard to say. Ideally
> >> whatever we do doesn't complicate our CI setup. Can we use BPF tracing
> >> magic to check this from inside the kernel somehow?
> >>
> >
> > But I don't feel strongly about this, if it's hard to detect, it's
> > fine to not have a specific test (especially that it's very
> > architecture-specific)
>
> It will be more or less architecture-specific, as we need somehow walk
> the page table (with debug option or with BPF iterator). I will try
> something.

If BPF iterator approach works, that would be great!

>
> Thanks,
> Song
>
>
> >
> >>> Thanks,
> >>> Song
> >>>
> >>>
> >>>>
> >>>> [0] https://github.com/kernel-patches/bpf/runs/4530372387?check_suite_focus=true
> >>>>
> >>>>> Song Liu (7):
> >>>>> x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP
> >>>>> bpf: use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem
> >>>>> bpf: use size instead of pages in bpf_binary_header
> >>>>> bpf: add a pointer of bpf_binary_header to bpf_prog
> >>>>> x86/alternative: introduce text_poke_jit
> >>>>> bpf: introduce bpf_prog_pack allocator
> >>>>> bpf, x86_64: use bpf_prog_pack allocator
> >>>>>
> >>>>> arch/x86/Kconfig                     |   1 +
> >>>>> arch/x86/include/asm/text-patching.h |   1 +
> >>>>> arch/x86/kernel/alternative.c        |  28 ++++
> >>>>> arch/x86/net/bpf_jit_comp.c          |  93 ++++++++++--
> >>>>> include/linux/bpf.h                  |   4 +-
> >>>>> include/linux/filter.h               |  23 ++-
> >>>>> kernel/bpf/core.c                    | 213 ++++++++++++++++++++++++---
> >>>>> kernel/bpf/trampoline.c              |   6 +-
> >>>>> 8 files changed, 328 insertions(+), 41 deletions(-)
> >>>>>
> >>>>> --
> >>>>> 2.30.2
> >>>
>