[bpf,v2,0/3] bpf: invalidate unused part of bpf_prog_pack

Message ID	20220425203947.3311308-1-song@kernel.org (mailing list archive)
Headers	show Return-Path: <bpf-owner@kernel.org> From: Song Liu <song@kernel.org> To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org> CC: <andrii@kernel.org>, <ast@kernel.org>, <daniel@iogearbox.net>, <kernel-team@fb.com>, <hch@infradead.org>, <peterz@infradead.org>, <torvalds@linux-foundation.org>, Song Liu <song@kernel.org> Subject: [PATCH bpf v2 0/3] bpf: invalidate unused part of bpf_prog_pack Date: Mon, 25 Apr 2022 13:39:44 -0700 Message-ID: <20220425203947.3311308-1-song@kernel.org> Content-Type: text/plain Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Precedence: bulk
Series	bpf: invalidate unused part of bpf_prog_pack \| expand [bpf,v2,0/3] bpf: invalidate unused part of bpf_prog_pack [v2,bpf,1/3] bpf: fill new bpf_prog_pack with illegal instructions [v2,bpf,2/3] x86/alternative: introduce text_poke_set [v2,bpf,3/3] bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack

Song Liu April 25, 2022, 8:39 p.m. UTC

Note: while prefixed with "bpf", this set is based on Linus' master branch.

This is v2 of [1]. It is revised based on Peter's feedback. The patch is
also split into 3.

[1] https://lore.kernel.org/linux-mm/20220421072212.608884-1-song@kernel.org/

Song Liu (3):
  bpf: fill new bpf_prog_pack with illegal instructions
  x86/alternative: introduce text_poke_set
  bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack

 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c        | 70 ++++++++++++++++++++++++----
 arch/x86/net/bpf_jit_comp.c          |  5 ++
 include/linux/bpf.h                  |  1 +
 kernel/bpf/core.c                    | 18 +++++--
 5 files changed, 81 insertions(+), 14 deletions(-)

--
2.30.2

Song Liu April 27, 2022, 10:10 p.m. UTC | #1

Hi Linus,

> On Apr 25, 2022, at 1:39 PM, Song Liu <song@kernel.org> wrote:
> 
> Note: while prefixed with "bpf", this set is based on Linus' master branch.
> 
> This is v2 of [1]. It is revised based on Peter's feedback. The patch is
> also split into 3.
> 
> [1] https://lore.kernel.org/linux-mm/20220421072212.608884-1-song@kernel.org/

Could you please share your suggestions on this set? Shall we ship it 
with 5.18?

Thanks,
Song

> 
> Song Liu (3):
>  bpf: fill new bpf_prog_pack with illegal instructions
>  x86/alternative: introduce text_poke_set
>  bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack
> 
> arch/x86/include/asm/text-patching.h |  1 +
> arch/x86/kernel/alternative.c        | 70 ++++++++++++++++++++++++----
> arch/x86/net/bpf_jit_comp.c          |  5 ++
> include/linux/bpf.h                  |  1 +
> kernel/bpf/core.c                    | 18 +++++--
> 5 files changed, 81 insertions(+), 14 deletions(-)
> 
> --
> 2.30.2

Linus Torvalds April 28, 2022, 1:45 a.m. UTC | #2

On Wed, Apr 27, 2022 at 3:24 PM Song Liu <songliubraving@fb.com> wrote:
>
> Could you please share your suggestions on this set? Shall we ship it
> with 5.18?

I'd personally prefer to just not do the prog_pack thing at all, since
I don't think it was actually in a "ready to ship" state for this
merge window, and the hugepage mapping protection games I'm still
leery of.

Yes, the hugepage protection things probably do work from what I saw
when I looked through them, but that x86 vmalloc hugepage code was
really designed for another use (non-refcounted device pages), so the
fact that it all actually seems surprisingly ok certainly wasn't
because the code was designed to do that new case.

Does the prog_pack thing work with small pages?

Yes. But that wasn't what it was designed for or its selling point, so
it all is a bit suspect to me.

And I'm looking at those set_memory_xyz() calls, and I'm going "yeah,
I think it works on x86, but on ppc I look at it and I see

  static inline int set_memory_ro(unsigned long addr, int numpages)
  {
        return change_memory_attr(addr, numpages, SET_MEMORY_RO);
  }

and then in change_memory_attr() it does

        if (WARN_ON_ONCE(is_vmalloc_or_module_addr((void *)addr) &&
                         is_vm_area_hugepages((void *)addr)))
                return -EINVAL;

and I'm "this is all supposedly generic code, but I'm not seeing how
it works cross-architecture".

I *think* it's actually because this is all basically x86-specific due
to x86 being the only architecture that implements
bpf_arch_text_copy(), and everybody else then ends up erroring out and
not using the prog_pack thing after all.

And then one of the two places that use bpf_arch_text_copy() doesn't
even check the returned error code.

So this all ends up just depending on "only x86 will actually succeed
in bpf_jit_binary_pack_finalize(), everybody else will fail after
having done all the common setup".

End result: it all seems a bit broken right now. The "generic" code
only works on x86, and on other architectures it goes through the
motions and then fails at the end. And even on x86 I worry about
actually enabling it fully.

I'm not having the warm and fuzzies about this all, in other words.

Maybe people can convince me otherwise, but I think you need to work at it.

And even for 5.19+ kind of timeframes, I'd actually like the x86
people who maintain arch/x86/mm/pat/set_memory.c also sign off on
using that code for hugepage vmalloc mappings too.

I *think* it does. But that code has various subtle things going on.

I see PeterZ is cc'd (presumably because of the text_poke() stuff, not
because of the whole "call set_memory_ro() on virtually mapped kernel
largepage memory".

Did people even talk to x86 people about this, or did the whole "it
works, except it turns out set_vm_flush_reset_perms() doesn't work"
mean that the authors of that code never got involved?

               Linus

Song Liu April 28, 2022, 6:48 a.m. UTC | #3

Hi Linus, 

Thanks for your thorough analysis of the situation, which make a lot of
sense. 

> On Apr 27, 2022, at 6:45 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Wed, Apr 27, 2022 at 3:24 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> Could you please share your suggestions on this set? Shall we ship it
>> with 5.18?
> 
> I'd personally prefer to just not do the prog_pack thing at all, since
> I don't think it was actually in a "ready to ship" state for this
> merge window, and the hugepage mapping protection games I'm still
> leery of.
> 
> Yes, the hugepage protection things probably do work from what I saw
> when I looked through them, but that x86 vmalloc hugepage code was
> really designed for another use (non-refcounted device pages), so the
> fact that it all actually seems surprisingly ok certainly wasn't
> because the code was designed to do that new case.
> 
> Does the prog_pack thing work with small pages?
> 
> Yes. But that wasn't what it was designed for or its selling point, so
> it all is a bit suspect to me.

prog_pack on small pages can also reduce the direct map fragmentation.
This is because libbpf uses tiny BPF programs to probe kernel features. 
Before prog_pack, all these BPF programs can fragment the direct map.
For example, runqslower (tools/bpf/runqslower/) loads total 7 BPF programs 
(3 actual programs and 4 tiny probe programs). All these programs may 
cause direct map fragmentation. With prog_pack, OTOH, these BPF programs 
would fit in a single page (or even share pages with other tools). 

> 
> And I'm looking at those set_memory_xyz() calls, and I'm going "yeah,
> I think it works on x86, but on ppc I look at it and I see
> 
>  static inline int set_memory_ro(unsigned long addr, int numpages)
>  {
>        return change_memory_attr(addr, numpages, SET_MEMORY_RO);
>  }
> 
> and then in change_memory_attr() it does
> 
>        if (WARN_ON_ONCE(is_vmalloc_or_module_addr((void *)addr) &&
>                         is_vm_area_hugepages((void *)addr)))
>                return -EINVAL;
> 
> and I'm "this is all supposedly generic code, but I'm not seeing how
> it works cross-architecture".

AFAICT, we have spent the time and effort to design bpf_prog_pack to 
work cross-architecture. However, since prog_pack requires relatively 
new building blocks in multiple layers (vmalloc, set_memory_XXX, 
bpf_jit, etc.), non-x86 architectures have more missing pieces. 

The fact that we cannot rely on set_vm_flush_reset_perms() for huge 
pages on x86 does leak some architectural details to generic code. 
But I guess we don’t really need the hack (by not using 
set_vm_flush_reset_perms, but calling set_memory_ manually) for 
prog_pack on small pages? 

> 
> I *think* it's actually because this is all basically x86-specific due
> to x86 being the only architecture that implements
> bpf_arch_text_copy(), and everybody else then ends up erroring out and
> not using the prog_pack thing after all.
> 
> And then one of the two places that use bpf_arch_text_copy() doesn't
> even check the returned error code.
> 
> So this all ends up just depending on "only x86 will actually succeed
> in bpf_jit_binary_pack_finalize(), everybody else will fail after
> having done all the common setup".
> 
> End result: it all seems a bit broken right now. The "generic" code
> only works on x86, and on other architectures it goes through the
> motions and then fails at the end. And even on x86 I worry about
> actually enabling it fully.
> 
> I'm not having the warm and fuzzies about this all, in other words.
> 
> Maybe people can convince me otherwise, but I think you need to work at it.
> 
> And even for 5.19+ kind of timeframes, I'd actually like the x86
> people who maintain arch/x86/mm/pat/set_memory.c also sign off on
> using that code for hugepage vmalloc mappings too.
> 
> I *think* it does. But that code has various subtle things going on.
> 
> I see PeterZ is cc'd (presumably because of the text_poke() stuff, not
> because of the whole "call set_memory_ro() on virtually mapped kernel
> largepage memory".
> 
> Did people even talk to x86 people about this, or did the whole "it
> works, except it turns out set_vm_flush_reset_perms() doesn't work"
> mean that the authors of that code never got involved?

We have CC'ed x86 folks (at least on some versions). But we haven’t
got much feedbacks until recently. We should definitely do better 
with this in the future. 

set_vm_flush_reset_perms is clearly a problem here. But if we don’t
use huge page (current upstream + this set), we shouldn’t need that
workaround. 

Overall, we do hope to eliminate (or reduce) system slowdown caused 
by direct map fragmentation. Ideally, we want achieve this with 
small number of huge pages. If huge pages don’t work here, small 
pages would also help. Given we already do set_memory_() with BPF
programs (in bpf_jit_binary_lock_ro), I think the risk with small 
pages is pretty low. And we should still see non-trivial performance
improvement from 4kB bpf_prog_pack (I don’t have full benchmark
results at the moment). 

Thanks again,
Song

Song Liu May 7, 2022, 6:50 a.m. UTC | #4

> On Apr 27, 2022, at 11:48 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> Hi Linus, 
> 
> Thanks for your thorough analysis of the situation, which make a lot of
> sense. 
> 
>> On Apr 27, 2022, at 6:45 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> 
>> On Wed, Apr 27, 2022 at 3:24 PM Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> Could you please share your suggestions on this set? Shall we ship it
>>> with 5.18?
>> 
>> I'd personally prefer to just not do the prog_pack thing at all, since
>> I don't think it was actually in a "ready to ship" state for this
>> merge window, and the hugepage mapping protection games I'm still
>> leery of.
>> 
>> Yes, the hugepage protection things probably do work from what I saw
>> when I looked through them, but that x86 vmalloc hugepage code was
>> really designed for another use (non-refcounted device pages), so the
>> fact that it all actually seems surprisingly ok certainly wasn't
>> because the code was designed to do that new case.
>> 
>> Does the prog_pack thing work with small pages?
>> 
>> Yes. But that wasn't what it was designed for or its selling point, so
>> it all is a bit suspect to me.
> 
> prog_pack on small pages can also reduce the direct map fragmentation.
> This is because libbpf uses tiny BPF programs to probe kernel features. 
> Before prog_pack, all these BPF programs can fragment the direct map.
> For example, runqslower (tools/bpf/runqslower/) loads total 7 BPF programs 
> (3 actual programs and 4 tiny probe programs). All these programs may 
> cause direct map fragmentation. With prog_pack, OTOH, these BPF programs 
> would fit in a single page (or even share pages with other tools). 

Here are some performance data from our web service production benchmark, 
which is the biggest service in our fleet. We compare 3 kernels:    

  nopack: no bpf_prog_pack; IOW, the same behavior as 5.17
  4kpack: use bpf_prog_pack on 4kB pages (same as 5.18-rc5)
  2mpack: use bpf_prog_pack on 2MB pages

The benchmark measures system throughput under latency constraints. 
4kpack provides 0.5% to 0.7% more throughput than nopack. 
2mpack provides 0.6% to 0.9% more throughput than nopack. 

So the data has confirmed:
1. Direct map fragmentation has non-trivial impact on system performance;
2. While 2MB pages are preferred, bpf_prog_pack on 4kB pages also gives 
   Significant performance improvements.  

Thanks,
Song

Song Liu May 7, 2022, 7:36 p.m. UTC | #5

> On May 6, 2022, at 11:50 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Apr 27, 2022, at 11:48 PM, Song Liu <songliubraving@fb.com> wrote:
>> 
>> Hi Linus, 
>> 
>> Thanks for your thorough analysis of the situation, which make a lot of
>> sense. 
>> 
>>> On Apr 27, 2022, at 6:45 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>> 
>>> On Wed, Apr 27, 2022 at 3:24 PM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> Could you please share your suggestions on this set? Shall we ship it
>>>> with 5.18?
>>> 
>>> I'd personally prefer to just not do the prog_pack thing at all, since
>>> I don't think it was actually in a "ready to ship" state for this
>>> merge window, and the hugepage mapping protection games I'm still
>>> leery of.
>>> 
>>> Yes, the hugepage protection things probably do work from what I saw
>>> when I looked through them, but that x86 vmalloc hugepage code was
>>> really designed for another use (non-refcounted device pages), so the
>>> fact that it all actually seems surprisingly ok certainly wasn't
>>> because the code was designed to do that new case.
>>> 
>>> Does the prog_pack thing work with small pages?
>>> 
>>> Yes. But that wasn't what it was designed for or its selling point, so
>>> it all is a bit suspect to me.
>> 
>> prog_pack on small pages can also reduce the direct map fragmentation.
>> This is because libbpf uses tiny BPF programs to probe kernel features. 
>> Before prog_pack, all these BPF programs can fragment the direct map.
>> For example, runqslower (tools/bpf/runqslower/) loads total 7 BPF programs 
>> (3 actual programs and 4 tiny probe programs). All these programs may 
>> cause direct map fragmentation. With prog_pack, OTOH, these BPF programs 
>> would fit in a single page (or even share pages with other tools). 
> 
> Here are some performance data from our web service production benchmark, 
> which is the biggest service in our fleet. We compare 3 kernels:    
> 
>  nopack: no bpf_prog_pack; IOW, the same behavior as 5.17
>  4kpack: use bpf_prog_pack on 4kB pages (same as 5.18-rc5)
>  2mpack: use bpf_prog_pack on 2MB pages
> 
> The benchmark measures system throughput under latency constraints. 
> 4kpack provides 0.5% to 0.7% more throughput than nopack. 
> 2mpack provides 0.6% to 0.9% more throughput than nopack. 
> 
> So the data has confirmed:
> 1. Direct map fragmentation has non-trivial impact on system performance;
> 2. While 2MB pages are preferred, bpf_prog_pack on 4kB pages also gives 
>   Significant performance improvements.  

Please note that 0.5% is a huge improvement for our fleet. I believe this
is also significant for other companies with many thousand servers. 

Thanks,
Song

[bpf,v2,0/3] bpf: invalidate unused part of bpf_prog_pack

Message

Comments