Message ID | 20220425203947.3311308-1-song@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | bpf: invalidate unused part of bpf_prog_pack | expand |
Hi Linus, > On Apr 25, 2022, at 1:39 PM, Song Liu <song@kernel.org> wrote: > > Note: while prefixed with "bpf", this set is based on Linus' master branch. > > This is v2 of [1]. It is revised based on Peter's feedback. The patch is > also split into 3. > > [1] https://lore.kernel.org/linux-mm/20220421072212.608884-1-song@kernel.org/ Could you please share your suggestions on this set? Shall we ship it with 5.18? Thanks, Song > > Song Liu (3): > bpf: fill new bpf_prog_pack with illegal instructions > x86/alternative: introduce text_poke_set > bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack > > arch/x86/include/asm/text-patching.h | 1 + > arch/x86/kernel/alternative.c | 70 ++++++++++++++++++++++++---- > arch/x86/net/bpf_jit_comp.c | 5 ++ > include/linux/bpf.h | 1 + > kernel/bpf/core.c | 18 +++++-- > 5 files changed, 81 insertions(+), 14 deletions(-) > > -- > 2.30.2
On Wed, Apr 27, 2022 at 3:24 PM Song Liu <songliubraving@fb.com> wrote: > > Could you please share your suggestions on this set? Shall we ship it > with 5.18? I'd personally prefer to just not do the prog_pack thing at all, since I don't think it was actually in a "ready to ship" state for this merge window, and the hugepage mapping protection games I'm still leery of. Yes, the hugepage protection things probably do work from what I saw when I looked through them, but that x86 vmalloc hugepage code was really designed for another use (non-refcounted device pages), so the fact that it all actually seems surprisingly ok certainly wasn't because the code was designed to do that new case. Does the prog_pack thing work with small pages? Yes. But that wasn't what it was designed for or its selling point, so it all is a bit suspect to me. And I'm looking at those set_memory_xyz() calls, and I'm going "yeah, I think it works on x86, but on ppc I look at it and I see static inline int set_memory_ro(unsigned long addr, int numpages) { return change_memory_attr(addr, numpages, SET_MEMORY_RO); } and then in change_memory_attr() it does if (WARN_ON_ONCE(is_vmalloc_or_module_addr((void *)addr) && is_vm_area_hugepages((void *)addr))) return -EINVAL; and I'm "this is all supposedly generic code, but I'm not seeing how it works cross-architecture". I *think* it's actually because this is all basically x86-specific due to x86 being the only architecture that implements bpf_arch_text_copy(), and everybody else then ends up erroring out and not using the prog_pack thing after all. And then one of the two places that use bpf_arch_text_copy() doesn't even check the returned error code. So this all ends up just depending on "only x86 will actually succeed in bpf_jit_binary_pack_finalize(), everybody else will fail after having done all the common setup". End result: it all seems a bit broken right now. The "generic" code only works on x86, and on other architectures it goes through the motions and then fails at the end. And even on x86 I worry about actually enabling it fully. I'm not having the warm and fuzzies about this all, in other words. Maybe people can convince me otherwise, but I think you need to work at it. And even for 5.19+ kind of timeframes, I'd actually like the x86 people who maintain arch/x86/mm/pat/set_memory.c also sign off on using that code for hugepage vmalloc mappings too. I *think* it does. But that code has various subtle things going on. I see PeterZ is cc'd (presumably because of the text_poke() stuff, not because of the whole "call set_memory_ro() on virtually mapped kernel largepage memory". Did people even talk to x86 people about this, or did the whole "it works, except it turns out set_vm_flush_reset_perms() doesn't work" mean that the authors of that code never got involved? Linus
Hi Linus, Thanks for your thorough analysis of the situation, which make a lot of sense. > On Apr 27, 2022, at 6:45 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Apr 27, 2022 at 3:24 PM Song Liu <songliubraving@fb.com> wrote: >> >> Could you please share your suggestions on this set? Shall we ship it >> with 5.18? > > I'd personally prefer to just not do the prog_pack thing at all, since > I don't think it was actually in a "ready to ship" state for this > merge window, and the hugepage mapping protection games I'm still > leery of. > > Yes, the hugepage protection things probably do work from what I saw > when I looked through them, but that x86 vmalloc hugepage code was > really designed for another use (non-refcounted device pages), so the > fact that it all actually seems surprisingly ok certainly wasn't > because the code was designed to do that new case. > > Does the prog_pack thing work with small pages? > > Yes. But that wasn't what it was designed for or its selling point, so > it all is a bit suspect to me. prog_pack on small pages can also reduce the direct map fragmentation. This is because libbpf uses tiny BPF programs to probe kernel features. Before prog_pack, all these BPF programs can fragment the direct map. For example, runqslower (tools/bpf/runqslower/) loads total 7 BPF programs (3 actual programs and 4 tiny probe programs). All these programs may cause direct map fragmentation. With prog_pack, OTOH, these BPF programs would fit in a single page (or even share pages with other tools). > > And I'm looking at those set_memory_xyz() calls, and I'm going "yeah, > I think it works on x86, but on ppc I look at it and I see > > static inline int set_memory_ro(unsigned long addr, int numpages) > { > return change_memory_attr(addr, numpages, SET_MEMORY_RO); > } > > and then in change_memory_attr() it does > > if (WARN_ON_ONCE(is_vmalloc_or_module_addr((void *)addr) && > is_vm_area_hugepages((void *)addr))) > return -EINVAL; > > and I'm "this is all supposedly generic code, but I'm not seeing how > it works cross-architecture". AFAICT, we have spent the time and effort to design bpf_prog_pack to work cross-architecture. However, since prog_pack requires relatively new building blocks in multiple layers (vmalloc, set_memory_XXX, bpf_jit, etc.), non-x86 architectures have more missing pieces. The fact that we cannot rely on set_vm_flush_reset_perms() for huge pages on x86 does leak some architectural details to generic code. But I guess we don’t really need the hack (by not using set_vm_flush_reset_perms, but calling set_memory_ manually) for prog_pack on small pages? > > I *think* it's actually because this is all basically x86-specific due > to x86 being the only architecture that implements > bpf_arch_text_copy(), and everybody else then ends up erroring out and > not using the prog_pack thing after all. > > And then one of the two places that use bpf_arch_text_copy() doesn't > even check the returned error code. > > So this all ends up just depending on "only x86 will actually succeed > in bpf_jit_binary_pack_finalize(), everybody else will fail after > having done all the common setup". > > End result: it all seems a bit broken right now. The "generic" code > only works on x86, and on other architectures it goes through the > motions and then fails at the end. And even on x86 I worry about > actually enabling it fully. > > I'm not having the warm and fuzzies about this all, in other words. > > Maybe people can convince me otherwise, but I think you need to work at it. > > And even for 5.19+ kind of timeframes, I'd actually like the x86 > people who maintain arch/x86/mm/pat/set_memory.c also sign off on > using that code for hugepage vmalloc mappings too. > > I *think* it does. But that code has various subtle things going on. > > I see PeterZ is cc'd (presumably because of the text_poke() stuff, not > because of the whole "call set_memory_ro() on virtually mapped kernel > largepage memory". > > Did people even talk to x86 people about this, or did the whole "it > works, except it turns out set_vm_flush_reset_perms() doesn't work" > mean that the authors of that code never got involved? We have CC'ed x86 folks (at least on some versions). But we haven’t got much feedbacks until recently. We should definitely do better with this in the future. set_vm_flush_reset_perms is clearly a problem here. But if we don’t use huge page (current upstream + this set), we shouldn’t need that workaround. Overall, we do hope to eliminate (or reduce) system slowdown caused by direct map fragmentation. Ideally, we want achieve this with small number of huge pages. If huge pages don’t work here, small pages would also help. Given we already do set_memory_() with BPF programs (in bpf_jit_binary_lock_ro), I think the risk with small pages is pretty low. And we should still see non-trivial performance improvement from 4kB bpf_prog_pack (I don’t have full benchmark results at the moment). Thanks again, Song
> On Apr 27, 2022, at 11:48 PM, Song Liu <songliubraving@fb.com> wrote: > > Hi Linus, > > Thanks for your thorough analysis of the situation, which make a lot of > sense. > >> On Apr 27, 2022, at 6:45 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> On Wed, Apr 27, 2022 at 3:24 PM Song Liu <songliubraving@fb.com> wrote: >>> >>> Could you please share your suggestions on this set? Shall we ship it >>> with 5.18? >> >> I'd personally prefer to just not do the prog_pack thing at all, since >> I don't think it was actually in a "ready to ship" state for this >> merge window, and the hugepage mapping protection games I'm still >> leery of. >> >> Yes, the hugepage protection things probably do work from what I saw >> when I looked through them, but that x86 vmalloc hugepage code was >> really designed for another use (non-refcounted device pages), so the >> fact that it all actually seems surprisingly ok certainly wasn't >> because the code was designed to do that new case. >> >> Does the prog_pack thing work with small pages? >> >> Yes. But that wasn't what it was designed for or its selling point, so >> it all is a bit suspect to me. > > prog_pack on small pages can also reduce the direct map fragmentation. > This is because libbpf uses tiny BPF programs to probe kernel features. > Before prog_pack, all these BPF programs can fragment the direct map. > For example, runqslower (tools/bpf/runqslower/) loads total 7 BPF programs > (3 actual programs and 4 tiny probe programs). All these programs may > cause direct map fragmentation. With prog_pack, OTOH, these BPF programs > would fit in a single page (or even share pages with other tools). Here are some performance data from our web service production benchmark, which is the biggest service in our fleet. We compare 3 kernels: nopack: no bpf_prog_pack; IOW, the same behavior as 5.17 4kpack: use bpf_prog_pack on 4kB pages (same as 5.18-rc5) 2mpack: use bpf_prog_pack on 2MB pages The benchmark measures system throughput under latency constraints. 4kpack provides 0.5% to 0.7% more throughput than nopack. 2mpack provides 0.6% to 0.9% more throughput than nopack. So the data has confirmed: 1. Direct map fragmentation has non-trivial impact on system performance; 2. While 2MB pages are preferred, bpf_prog_pack on 4kB pages also gives Significant performance improvements. Thanks, Song
> On May 6, 2022, at 11:50 PM, Song Liu <songliubraving@fb.com> wrote: > > > >> On Apr 27, 2022, at 11:48 PM, Song Liu <songliubraving@fb.com> wrote: >> >> Hi Linus, >> >> Thanks for your thorough analysis of the situation, which make a lot of >> sense. >> >>> On Apr 27, 2022, at 6:45 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>> >>> On Wed, Apr 27, 2022 at 3:24 PM Song Liu <songliubraving@fb.com> wrote: >>>> >>>> Could you please share your suggestions on this set? Shall we ship it >>>> with 5.18? >>> >>> I'd personally prefer to just not do the prog_pack thing at all, since >>> I don't think it was actually in a "ready to ship" state for this >>> merge window, and the hugepage mapping protection games I'm still >>> leery of. >>> >>> Yes, the hugepage protection things probably do work from what I saw >>> when I looked through them, but that x86 vmalloc hugepage code was >>> really designed for another use (non-refcounted device pages), so the >>> fact that it all actually seems surprisingly ok certainly wasn't >>> because the code was designed to do that new case. >>> >>> Does the prog_pack thing work with small pages? >>> >>> Yes. But that wasn't what it was designed for or its selling point, so >>> it all is a bit suspect to me. >> >> prog_pack on small pages can also reduce the direct map fragmentation. >> This is because libbpf uses tiny BPF programs to probe kernel features. >> Before prog_pack, all these BPF programs can fragment the direct map. >> For example, runqslower (tools/bpf/runqslower/) loads total 7 BPF programs >> (3 actual programs and 4 tiny probe programs). All these programs may >> cause direct map fragmentation. With prog_pack, OTOH, these BPF programs >> would fit in a single page (or even share pages with other tools). > > Here are some performance data from our web service production benchmark, > which is the biggest service in our fleet. We compare 3 kernels: > > nopack: no bpf_prog_pack; IOW, the same behavior as 5.17 > 4kpack: use bpf_prog_pack on 4kB pages (same as 5.18-rc5) > 2mpack: use bpf_prog_pack on 2MB pages > > The benchmark measures system throughput under latency constraints. > 4kpack provides 0.5% to 0.7% more throughput than nopack. > 2mpack provides 0.6% to 0.9% more throughput than nopack. > > So the data has confirmed: > 1. Direct map fragmentation has non-trivial impact on system performance; > 2. While 2MB pages are preferred, bpf_prog_pack on 4kB pages also gives > Significant performance improvements. Please note that 0.5% is a huge improvement for our fleet. I believe this is also significant for other companies with many thousand servers. Thanks, Song