[v4,bpf-next,0/8] bpf_prog_pack followup

Message ID	20220520235758.1858153-1-song@kernel.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Song Liu <song@kernel.org> To: <linux-kernel@vger.kernel.org>, <bpf@vger.kernel.org>, <linux-mm@kvack.org> CC: <ast@kernel.org>, <daniel@iogearbox.net>, <peterz@infradead.org>, <mcgrof@kernel.org>, <torvalds@linux-foundation.org>, <rick.p.edgecombe@intel.com>, <kernel-team@fb.com>, Song Liu <song@kernel.org> Subject: [PATCH v4 bpf-next 0/8] bpf_prog_pack followup Date: Fri, 20 May 2022 16:57:50 -0700 Message-ID: <20220520235758.1858153-1-song@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	bpf_prog_pack followup \| expand [v4,bpf-next,0/8] bpf_prog_pack followup [v4,bpf-next,1/8] bpf: fill new bpf_prog_pack with illegal instructions [v4,bpf-next,2/8] x86/alternative: introduce text_poke_set [v4,bpf-next,3/8] bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack [v4,bpf-next,4/8] module: introduce module_alloc_huge [v4,bpf-next,5/8] bpf: use module_alloc_huge for bpf_prog_pack [v4,bpf-next,6/8] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages [v4,bpf-next,7/8] vmalloc: introduce huge_vmalloc_supported [v4,bpf-next,8/8] bpf: simplify select_bpf_prog_pack_size

Song Liu May 20, 2022, 11:57 p.m. UTC

Changes v3 => v4:
1. Shorten CC list on 4/8, so it is not dropped by the mail list.

Changes v2 => v3:
1. Fix issues reported by kernel test robot <lkp@intel.com>.

Changes v1 => v2:
1. Add WARN to set_vm_flush_reset_perms() on huge pages. (Rick Edgecombe)
2. Simplify select_bpf_prog_pack_size. (Rick Edgecombe)

As of 5.18-rc6, x86_64 uses bpf_prog_pack on 4kB pages. This set contains
a few followups:
  1/8 - 3/8 fills unused part of bpf_prog_pack with illegal instructions.
  4/8 - 5/8 enables bpf_prog_pack on 2MB pages.

The primary goal of bpf_prog_pack is to reduce iTLB miss rate and reduce
direct memory mapping fragmentation. This leads to non-trivial performance
improvements.

For our web service production benchmark, bpf_prog_pack on 4kB pages
gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
believe this is also significant for other companies with many thousand
servers.

bpf_prog_pack on 2MB pages may use slightly more memory for systems
without many BPF programs. However, such waste in memory (<2MB) is within
noisy for modern x86_64 systems.

Song Liu (8):
  bpf: fill new bpf_prog_pack with illegal instructions
  x86/alternative: introduce text_poke_set
  bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack
  module: introduce module_alloc_huge
  bpf: use module_alloc_huge for bpf_prog_pack
  vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
  vmalloc: introduce huge_vmalloc_supported
  bpf: simplify select_bpf_prog_pack_size

 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c        | 67 +++++++++++++++++++++++-----
 arch/x86/kernel/module.c             | 21 +++++++++
 arch/x86/net/bpf_jit_comp.c          |  5 +++
 include/linux/bpf.h                  |  1 +
 include/linux/moduleloader.h         |  5 +++
 include/linux/vmalloc.h              |  7 +++
 kernel/bpf/core.c                    | 43 ++++++++++--------
 kernel/module.c                      |  8 ++++
 mm/vmalloc.c                         |  5 +++
 10 files changed, 134 insertions(+), 29 deletions(-)

--
2.30.2

patchwork-bot+netdevbpf@kernel.org May 23, 2022, 9:20 p.m. UTC | #1

Hello:

This series was applied to bpf/bpf-next.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Fri, 20 May 2022 16:57:50 -0700 you wrote:
> Changes v3 => v4:
> 1. Shorten CC list on 4/8, so it is not dropped by the mail list.
> 
> Changes v2 => v3:
> 1. Fix issues reported by kernel test robot <lkp@intel.com>.
> 
> Changes v1 => v2:
> 1. Add WARN to set_vm_flush_reset_perms() on huge pages. (Rick Edgecombe)
> 2. Simplify select_bpf_prog_pack_size. (Rick Edgecombe)
> 
> [...]

Here is the summary with links:
  - [v4,bpf-next,1/8] bpf: fill new bpf_prog_pack with illegal instructions
    https://git.kernel.org/bpf/bpf-next/c/d88bb5eed04c
  - [v4,bpf-next,2/8] x86/alternative: introduce text_poke_set
    https://git.kernel.org/bpf/bpf-next/c/aadd1b678ebe
  - [v4,bpf-next,3/8] bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack
    https://git.kernel.org/bpf/bpf-next/c/fe736565efb7
  - [v4,bpf-next,4/8] module: introduce module_alloc_huge
    (no matching commit)
  - [v4,bpf-next,5/8] bpf: use module_alloc_huge for bpf_prog_pack
    (no matching commit)
  - [v4,bpf-next,6/8] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
    (no matching commit)
  - [v4,bpf-next,7/8] vmalloc: introduce huge_vmalloc_supported
    (no matching commit)
  - [v4,bpf-next,8/8] bpf: simplify select_bpf_prog_pack_size
    (no matching commit)

You are awesome, thank you!

Aaron Lu June 20, 2022, 11:11 a.m. UTC | #2

Hi Song,

On Fri, May 20, 2022 at 04:57:50PM -0700, Song Liu wrote:

... ...

> The primary goal of bpf_prog_pack is to reduce iTLB miss rate and reduce
> direct memory mapping fragmentation. This leads to non-trivial performance
> improvements.
>
> For our web service production benchmark, bpf_prog_pack on 4kB pages
> gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
> bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
> bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
> believe this is also significant for other companies with many thousand
> servers.
>

I'm evaluationg performance impact due to direct memory mapping
fragmentation and seeing the above, I wonder: is the performance improve
mostly due to prog pack and hugepage instead of less direct mapping
fragmentation?

I can understand that when progs are packed together, iTLB miss rate will
be reduced and thus, performance can be improved. But I don't see
immediately how direct mapping fragmentation can impact performance since
the bpf code are running from the module alias addresses, not the direct
mapping addresses IIUC?

I appreciate it if you can shed some light on performance impact direct
mapping fragmentation can cause, thanks.

Song Liu June 20, 2022, 4:03 p.m. UTC | #3

Hi Aaron,

On Mon, Jun 20, 2022 at 4:12 AM Aaron Lu <aaron.lu@intel.com> wrote:
>
> Hi Song,
>
> On Fri, May 20, 2022 at 04:57:50PM -0700, Song Liu wrote:
>
> ... ...
>
> > The primary goal of bpf_prog_pack is to reduce iTLB miss rate and reduce
> > direct memory mapping fragmentation. This leads to non-trivial performance
> > improvements.
> >
> > For our web service production benchmark, bpf_prog_pack on 4kB pages
> > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
> > bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
> > bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
> > believe this is also significant for other companies with many thousand
> > servers.
> >
>
> I'm evaluationg performance impact due to direct memory mapping
> fragmentation and seeing the above, I wonder: is the performance improve
> mostly due to prog pack and hugepage instead of less direct mapping
> fragmentation?
>
> I can understand that when progs are packed together, iTLB miss rate will
> be reduced and thus, performance can be improved. But I don't see
> immediately how direct mapping fragmentation can impact performance since
> the bpf code are running from the module alias addresses, not the direct
> mapping addresses IIUC?

You are right that BPF code runs from module alias addresses. However, to
protect text from overwrites, we use set_memory_x() and set_memory_ro()
for the BPF code. These two functions will set permissions for all aliases
of the memory, including the direct map, and thus cause fragmentation of
the direct map. Does this make sense?

Thanks,
Song

Luis Chamberlain June 20, 2022, 6:31 p.m. UTC | #4

On Mon, Jun 20, 2022 at 07:11:45PM +0800, Aaron Lu wrote:
> Hi Song,
> 
> On Fri, May 20, 2022 at 04:57:50PM -0700, Song Liu wrote:
> 
> ... ...
> 
> > The primary goal of bpf_prog_pack is to reduce iTLB miss rate and reduce
> > direct memory mapping fragmentation. This leads to non-trivial performance
> > improvements.
> >
> > For our web service production benchmark, bpf_prog_pack on 4kB pages
> > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
> > bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
> > bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
> > believe this is also significant for other companies with many thousand
> > servers.
> >
> 
> I'm evaluationg performance impact due to direct memory mapping
> fragmentation 

BTW how exactly are you doing this?

  Luis

> and seeing the above, I wonder: is the performance improve
> mostly due to prog pack and hugepage instead of less direct mapping
> fragmentation?
> 
> I can understand that when progs are packed together, iTLB miss rate will
> be reduced and thus, performance can be improved. But I don't see
> immediately how direct mapping fragmentation can impact performance since
> the bpf code are running from the module alias addresses, not the direct
> mapping addresses IIUC?
> 
> I appreciate it if you can shed some light on performance impact direct
> mapping fragmentation can cause, thanks.

Aaron Lu June 21, 2022, 1:31 a.m. UTC | #5

On Mon, Jun 20, 2022 at 09:03:52AM -0700, Song Liu wrote:
> Hi Aaron,
> 
> On Mon, Jun 20, 2022 at 4:12 AM Aaron Lu <aaron.lu@intel.com> wrote:
> >
> > Hi Song,
> >
> > On Fri, May 20, 2022 at 04:57:50PM -0700, Song Liu wrote:
> >
> > ... ...
> >
> > > The primary goal of bpf_prog_pack is to reduce iTLB miss rate and reduce
> > > direct memory mapping fragmentation. This leads to non-trivial performance
> > > improvements.
> > >
> > > For our web service production benchmark, bpf_prog_pack on 4kB pages
> > > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
> > > bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
> > > bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
> > > believe this is also significant for other companies with many thousand
> > > servers.
> > >
> >
> > I'm evaluationg performance impact due to direct memory mapping
> > fragmentation and seeing the above, I wonder: is the performance improve
> > mostly due to prog pack and hugepage instead of less direct mapping
> > fragmentation?
> >
> > I can understand that when progs are packed together, iTLB miss rate will
> > be reduced and thus, performance can be improved. But I don't see
> > immediately how direct mapping fragmentation can impact performance since
> > the bpf code are running from the module alias addresses, not the direct
> > mapping addresses IIUC?
> 
> You are right that BPF code runs from module alias addresses. However, to
> protect text from overwrites, we use set_memory_x() and set_memory_ro()
> for the BPF code. These two functions will set permissions for all aliases
> of the memory, including the direct map, and thus cause fragmentation of
> the direct map. Does this make sense?

Guess I didn't make it clear.

I understand that set_memory_XXX() will cause direct mapping split and
thus, fragmented. What is not clear to me is, how much impact does
direct mapping fragmentation have on performance, in your case and in
general?

In your case, I guess the performance gain is due to code gets packed
together and iTLB gets reduced. When code are a lot, packing them
together as a hugepage is a further gain. In the meantime, direct
mapping split (or not) seems to be a side effect of this packing, but it
doesn't have a direct impact on performance.

One thing I can imagine is, when an area of direct mapping gets splited
due to permission reason, when that reason is gone(like module unload
or bpf code unload), those areas will remain fragmented and that can
cause later operations that touch these same areas using more dTLBs
and that can be bad for performance, but it's hard to say how much
impact this can cause though.

Regards,
Aaron

Aaron Lu June 21, 2022, 1:45 a.m. UTC | #6

On Mon, Jun 20, 2022 at 11:31:39AM -0700, Luis Chamberlain wrote:
> On Mon, Jun 20, 2022 at 07:11:45PM +0800, Aaron Lu wrote:
> > Hi Song,
> > 
> > On Fri, May 20, 2022 at 04:57:50PM -0700, Song Liu wrote:
> > 
> > ... ...
> > 
> > > The primary goal of bpf_prog_pack is to reduce iTLB miss rate and reduce
> > > direct memory mapping fragmentation. This leads to non-trivial performance
> > > improvements.
> > >
> > > For our web service production benchmark, bpf_prog_pack on 4kB pages
> > > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
> > > bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
> > > bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
> > > believe this is also significant for other companies with many thousand
> > > servers.
> > >
> > 
> > I'm evaluationg performance impact due to direct memory mapping
> > fragmentation 
> 
> BTW how exactly are you doing this?

Right now I'm mostly collecting materials from the web :-)

Zhengjun has run some extensive microbenmarks with different page size
for direct mapping and on different server machines a while ago, here
is his report:
https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
Quoting part of the conclusion:
"
This leads us to conclude that although 1G mappings are a 
good default choice, there is no compelling evidence that it must be the 
only choice, or that folks deriving benefits (like hardening) from 
smaller mapping sizes should avoid the smaller mapping sizes.
"

I searched the archive and found there is performance problem when
kernel text huge mapping gets splitted:
https://lore.kernel.org/lkml/20190823052335.572133-1-songliubraving@fb.com/

But I haven't found a report complaining direct mapping fragmentation yet.

Song Liu June 21, 2022, 2:51 a.m. UTC | #7

On Mon, Jun 20, 2022 at 6:32 PM Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Mon, Jun 20, 2022 at 09:03:52AM -0700, Song Liu wrote:
> > Hi Aaron,
> >
> > On Mon, Jun 20, 2022 at 4:12 AM Aaron Lu <aaron.lu@intel.com> wrote:
> > >
> > > Hi Song,
> > >
> > > On Fri, May 20, 2022 at 04:57:50PM -0700, Song Liu wrote:
> > >
> > > ... ...
> > >
> > > > The primary goal of bpf_prog_pack is to reduce iTLB miss rate and reduce
> > > > direct memory mapping fragmentation. This leads to non-trivial performance
> > > > improvements.
> > > >
> > > > For our web service production benchmark, bpf_prog_pack on 4kB pages
> > > > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
> > > > bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
> > > > bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
> > > > believe this is also significant for other companies with many thousand
> > > > servers.
> > > >
> > >
> > > I'm evaluationg performance impact due to direct memory mapping
> > > fragmentation and seeing the above, I wonder: is the performance improve
> > > mostly due to prog pack and hugepage instead of less direct mapping
> > > fragmentation?
> > >
> > > I can understand that when progs are packed together, iTLB miss rate will
> > > be reduced and thus, performance can be improved. But I don't see
> > > immediately how direct mapping fragmentation can impact performance since
> > > the bpf code are running from the module alias addresses, not the direct
> > > mapping addresses IIUC?
> >
> > You are right that BPF code runs from module alias addresses. However, to
> > protect text from overwrites, we use set_memory_x() and set_memory_ro()
> > for the BPF code. These two functions will set permissions for all aliases
> > of the memory, including the direct map, and thus cause fragmentation of
> > the direct map. Does this make sense?
>
> Guess I didn't make it clear.
>
> I understand that set_memory_XXX() will cause direct mapping split and
> thus, fragmented. What is not clear to me is, how much impact does
> direct mapping fragmentation have on performance, in your case and in
> general?
>
> In your case, I guess the performance gain is due to code gets packed
> together and iTLB gets reduced. When code are a lot, packing them
> together as a hugepage is a further gain. In the meantime, direct
> mapping split (or not) seems to be a side effect of this packing, but it
> doesn't have a direct impact on performance.
>
> One thing I can imagine is, when an area of direct mapping gets splited
> due to permission reason, when that reason is gone(like module unload
> or bpf code unload), those areas will remain fragmented and that can
> cause later operations that touch these same areas using more dTLBs
> and that can be bad for performance, but it's hard to say how much
> impact this can cause though.

Yes, we have data showing the direct mapping remaining fragmented
can cause non-trivial performance degradation. For our web workload,
the difference is in the order of 1%.

Thanks,
Song

Aaron Lu June 21, 2022, 3:25 a.m. UTC | #8

On Mon, Jun 20, 2022 at 07:51:24PM -0700, Song Liu wrote:
> On Mon, Jun 20, 2022 at 6:32 PM Aaron Lu <aaron.lu@intel.com> wrote:
> >
> > On Mon, Jun 20, 2022 at 09:03:52AM -0700, Song Liu wrote:
> > > Hi Aaron,
> > >
> > > On Mon, Jun 20, 2022 at 4:12 AM Aaron Lu <aaron.lu@intel.com> wrote:
> > > >
> > > > Hi Song,
> > > >
> > > > On Fri, May 20, 2022 at 04:57:50PM -0700, Song Liu wrote:
> > > >
> > > > ... ...
> > > >
> > > > > The primary goal of bpf_prog_pack is to reduce iTLB miss rate and reduce
> > > > > direct memory mapping fragmentation. This leads to non-trivial performance
> > > > > improvements.
> > > > >
> > > > > For our web service production benchmark, bpf_prog_pack on 4kB pages
> > > > > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
> > > > > bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
> > > > > bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
> > > > > believe this is also significant for other companies with many thousand
> > > > > servers.
> > > > >
> > > >
> > > > I'm evaluationg performance impact due to direct memory mapping
> > > > fragmentation and seeing the above, I wonder: is the performance improve
> > > > mostly due to prog pack and hugepage instead of less direct mapping
> > > > fragmentation?
> > > >
> > > > I can understand that when progs are packed together, iTLB miss rate will
> > > > be reduced and thus, performance can be improved. But I don't see
> > > > immediately how direct mapping fragmentation can impact performance since
> > > > the bpf code are running from the module alias addresses, not the direct
> > > > mapping addresses IIUC?
> > >
> > > You are right that BPF code runs from module alias addresses. However, to
> > > protect text from overwrites, we use set_memory_x() and set_memory_ro()
> > > for the BPF code. These two functions will set permissions for all aliases
> > > of the memory, including the direct map, and thus cause fragmentation of
> > > the direct map. Does this make sense?
> >
> > Guess I didn't make it clear.
> >
> > I understand that set_memory_XXX() will cause direct mapping split and
> > thus, fragmented. What is not clear to me is, how much impact does
> > direct mapping fragmentation have on performance, in your case and in
> > general?
> >
> > In your case, I guess the performance gain is due to code gets packed
> > together and iTLB gets reduced. When code are a lot, packing them
> > together as a hugepage is a further gain. In the meantime, direct
> > mapping split (or not) seems to be a side effect of this packing, but it
> > doesn't have a direct impact on performance.
> >
> > One thing I can imagine is, when an area of direct mapping gets splited
> > due to permission reason, when that reason is gone(like module unload
> > or bpf code unload), those areas will remain fragmented and that can
> > cause later operations that touch these same areas using more dTLBs
> > and that can be bad for performance, but it's hard to say how much
> > impact this can cause though.
> 
> Yes, we have data showing the direct mapping remaining fragmented
> can cause non-trivial performance degradation. For our web workload,
> the difference is in the order of 1%.

Many thanks for the info, really appreciate it.

Regards,
Aaron

[v4,bpf-next,0/8] bpf_prog_pack followup

Message

Comments