mbox series

[bpf-next,v4,0/6] execmem_alloc for BPF programs

Message ID 20221117202322.944661-1-song@kernel.org (mailing list archive)
Headers show
Series execmem_alloc for BPF programs | expand

Message

Song Liu Nov. 17, 2022, 8:23 p.m. UTC
This patchset tries to address the following issues:

1. Direct map fragmentation

On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
RO+X. These set_memory_* calls cause 1GB page table entries to be split
into 2MB and 4kB ones. This fragmentation in direct map results in bigger
and slower page table, and pressure for both instruction and data TLB.

Our previous work in bpf_prog_pack tries to address this issue from BPF
program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
greatly reduced direct map fragmentation from BPF programs.

2. iTLB pressure from BPF program

Dynamic kernel text such as modules and BPF programs (even with current
bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
BPF program is big, we can see visible performance drop caused by high
iTLB miss rate.

3. TLB shootdown for short-living BPF programs

Before bpf_prog_pack loading and unloading BPF programs requires global
TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
TLB flush.

4. Reduce memory usage by BPF programs (in some cases)

Most BPF programs and various trampolines are small, and they often
occupies a whole page. From a random server in our fleet, 50% of the
loaded BPF programs are less than 500 byte in size, and 75% of them are
less than 2kB in size. Allowing these BPF programs to share 2MB pages
would yield some memory saving for systems with many BPF programs. For
systems with only small number of BPF programs, this patch may waste a
little memory by allocating one 2MB page, but using only part of it.

5. Introduce a unified API to allocate memory with special permissions.

This will help get rid of set_vm_flush_reset_perms calls from users of
vmalloc, module_alloc, etc.


Based on our experiments [5], we measured ~0.6% performance improvement
from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
The difference is because bpf_prog_pack uses 512x 4kB pages instead of
1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.

This patchset replaces bpf_prog_pack with a better API and makes it
available for other dynamic kernel text, such as modules, ftrace, kprobe.


This set enables bpf programs and bpf dispatchers to share huge pages with
new API:
  execmem_alloc()
  execmem_alloc()
  execmem_fill()

The idea is similar to Peter's suggestion in [1].

execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
memory to its users. execmem_alloc() is used to free memory allocated by
execmem_alloc(). execmem_fill() is used to update memory allocated by
execmem_alloc().

Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
The caller has to update the content with text_poke like mechanism.
Specifically, execmem_fill() is provided to update memory allocated by
execmem_alloc(). execmem_fill() also makes sure the update stays in the
boundary of one chunk allocated by execmem_alloc(). Please refer to patch
1/6 for more details of

Patch 4/6 uses these new APIs in bpf program and bpf dispatcher.

Patch 5/6and 6/6 allows static kernel text (_stext to _etext) to share
PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
_etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
[2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/
[3] v1: https://lore.kernel.org/bpf/20221031222541.1773452-1-song@kernel.org/
[4] https://lore.kernel.org/bpf/Y2ioTodn+mBXdIqp@ziqianlu-desk2/
[5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/

Changes v3 => v4:
1. Fixed a bug found with test_vmalloc.

Changes v2 => v3:
1. Add selftests in 4/6. (Luis Chamberlain)
2. Add more motivation and test results. (Luis Chamberlain)
3. Fix error handling in execmem_alloc().

Changes PATCH v1 => v2:
1. Rename the APIs as execmem_* (Christoph Hellwig)
2. Add more information about the motivation of this work (and follow up
   works in for kernel modules, various trampolines, etc).
   (Luis Chamberlain, Rick Edgecombe, Mike Rapoport, Aaron Lu)
3. Include expermential results from previous bpf_prog_pack and the
   community. (Aaron Lu, Luis Chamberlain, Rick Edgecombe)

Changes RFC v2 => PATCH v1:
1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It
   also ensures vcopy_exec() is only used to update memory from one single
   vmalloc_exec() call. (Christoph Hellwig)
2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the
   text_poke() like logic.
3. Drop changes for kernel modules and focus on BPF side changes.

Changes RFC v1 => RFC v2:
1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
   still need some work.

Song Liu (6):
  vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  x86/alternative: support execmem_alloc() and execmem_free()
  selftests/vm: extend test_vmalloc to test execmem_* APIs
  bpf: use execmem_alloc for bpf program and bpf dispatcher
  vmalloc: introduce register_text_tail_vm()
  x86: use register_text_tail_vm

 Documentation/x86/x86_64/mm.rst         |   4 +-
 arch/x86/include/asm/pgtable_64_types.h |   1 +
 arch/x86/kernel/alternative.c           |  12 +
 arch/x86/mm/init_64.c                   |   4 +-
 arch/x86/net/bpf_jit_comp.c             |  23 +-
 include/linux/bpf.h                     |   3 -
 include/linux/filter.h                  |   5 -
 include/linux/vmalloc.h                 |   9 +
 kernel/bpf/core.c                       | 180 +-----------
 kernel/bpf/dispatcher.c                 |  11 +-
 lib/test_vmalloc.c                      |  30 ++
 mm/nommu.c                              |  12 +
 mm/vmalloc.c                            | 362 ++++++++++++++++++++++++
 13 files changed, 452 insertions(+), 204 deletions(-)

--
2.30.2

Comments

Luis Chamberlain Nov. 21, 2022, 8:12 p.m. UTC | #1
On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> This patchset tries to address the following issues:
> 
> 1. Direct map fragmentation
> 
> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> RO+X. These set_memory_* calls cause 1GB page table entries to be split
> into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> and slower page table, and pressure for both instruction and data TLB.
> 
> Our previous work in bpf_prog_pack tries to address this issue from BPF
> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> greatly reduced direct map fragmentation from BPF programs.

This value is clear, but I'd like to see at least another new user and
the respective commit log show the gains as Aaron Lu showed.

> 2. iTLB pressure from BPF program
> 
> Dynamic kernel text such as modules and BPF programs (even with current
> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> BPF program is big, we can see visible performance drop caused by high
> iTLB miss rate.

As suggested by Mike Rapoport, "benchmarking iTLB performance on an idle
system is not very representative. TLB is a scarce resource, so it'd be
interesting to see this benchmark on a loaded system."

This would also help pave the way to measure this for more possible
future callers like modules. There in lies true value to this
consideration.

Also, you mention your perf stats are run on a VM, I am curious what
things you need to get TLB to be properly measured on the VM and if
this is really reliable data Vs bare metal. I haven't yet been sucessful
on getting perf stat for TBL to work on a VM and based on what I've read
have been catious about the results.

So curious if you'd see something different on bare metal.

[0] https://lkml.kernel.org/r/Y3YA2mRZDJkB4lmP@kernel.org

> 3. TLB shootdown for short-living BPF programs
> 
> Before bpf_prog_pack loading and unloading BPF programs requires global
> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> TLB flush.

If this is all done on the bpf code replacement then the commit log
should clarify that in the commit log, as then it allows future users
to not be surprised if they don't see these gains as this is specific
to the way bpf code used bpf_prog_pag. Also, you can measure the
shootdowns and show the differences with perf stat tlb:tlb_flush.

> 4. Reduce memory usage by BPF programs (in some cases)
> 
> Most BPF programs and various trampolines are small, and they often
> occupies a whole page. From a random server in our fleet, 50% of the
> loaded BPF programs are less than 500 byte in size, and 75% of them are
> less than 2kB in size. Allowing these BPF programs to share 2MB pages
> would yield some memory saving for systems with many BPF programs. For
> systems with only small number of BPF programs, this patch may waste a
> little memory by allocating one 2MB page, but using only part of it.
> 
> 5. Introduce a unified API to allocate memory with special permissions.
> 
> This will help get rid of set_vm_flush_reset_perms calls from users of
> vmalloc, module_alloc, etc.

And *this* is one of the reasons I'm so eager to see a proper solution
drawn up. This would be a huge win for modules, however since some of
the complexities in special permissions with modules lies in all the
cross architecture hanky panky, I'd prefer to see this through merged
*iff* we have modules converted as well as it would give us a clearer
picture if the solution covers the bases. And we'd get proper testing
on this. Rather than it being a special thing for BPF.

> Based on our experiments [5], we measured ~0.6% performance improvement
> from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.

I'd prefer we leave out arbitrary performance data, as it does not help much.

> The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> 
> This patchset replaces bpf_prog_pack with a better API and makes it
> available for other dynamic kernel text, such as modules, ftrace, kprobe.

Let's see that through, then I think the series builds confidence in
implementation.

  Luis
Luis Chamberlain Nov. 21, 2022, 8:20 p.m. UTC | #2
On Mon, Nov 21, 2022 at 12:12:49PM -0800, Luis Chamberlain wrote:
> On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > Based on our experiments [5], we measured ~0.6% performance improvement
> > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
> 
> I'd prefer we leave out arbitrary performance data, as it does not help much.

I'd like to clarify what I mean by this, as Linus has suggested before, you are
missing the opportunity to present "actual numbers on real loads. (not
some artificial benchmark)"

[0] https://lkml.kernel.org/r/CAHk-=wiF1KnM1_paB3jCONR9Mh1D_RCsnXKBau1K7XLG-mwwTQ@mail.gmail.com

  Luis
Song Liu Nov. 22, 2022, 2:28 a.m. UTC | #3
On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > This patchset tries to address the following issues:
> >
> > 1. Direct map fragmentation
> >
> > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> > RO+X. These set_memory_* calls cause 1GB page table entries to be split
> > into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> > and slower page table, and pressure for both instruction and data TLB.
> >
> > Our previous work in bpf_prog_pack tries to address this issue from BPF
> > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> > greatly reduced direct map fragmentation from BPF programs.
>
> This value is clear, but I'd like to see at least another new user and
> the respective commit log show the gains as Aaron Lu showed.
>
> > 2. iTLB pressure from BPF program
> >
> > Dynamic kernel text such as modules and BPF programs (even with current
> > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> > BPF program is big, we can see visible performance drop caused by high
> > iTLB miss rate.
>
> As suggested by Mike Rapoport, "benchmarking iTLB performance on an idle
> system is not very representative. TLB is a scarce resource, so it'd be
> interesting to see this benchmark on a loaded system."
>
> This would also help pave the way to measure this for more possible
> future callers like modules. There in lies true value to this
> consideration.
>
> Also, you mention your perf stats are run on a VM, I am curious what
> things you need to get TLB to be properly measured on the VM and if
> this is really reliable data Vs bare metal. I haven't yet been sucessful
> on getting perf stat for TBL to work on a VM and based on what I've read
> have been catious about the results.

To make these perf counters work on VM, we need a newer host kernel
(my system is running 5.6 based kernel, but I am not sure what is the
minimum required version). Then we need to run qemu with option
"-cpu host" (both host and guest are x86_64).

>
> So curious if you'd see something different on bare metal.

Once the above all worked out, VM runs the same as bare metal from
perf counter's point of view.

>
> [0] https://lkml.kernel.org/r/Y3YA2mRZDJkB4lmP@kernel.org
>
> > 3. TLB shootdown for short-living BPF programs
> >
> > Before bpf_prog_pack loading and unloading BPF programs requires global
> > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> > TLB flush.
>
> If this is all done on the bpf code replacement then the commit log
> should clarify that in the commit log, as then it allows future users
> to not be surprised if they don't see these gains as this is specific
> to the way bpf code used bpf_prog_pag. Also, you can measure the
> shootdowns and show the differences with perf stat tlb:tlb_flush.
>
> > 4. Reduce memory usage by BPF programs (in some cases)
> >
> > Most BPF programs and various trampolines are small, and they often
> > occupies a whole page. From a random server in our fleet, 50% of the
> > loaded BPF programs are less than 500 byte in size, and 75% of them are
> > less than 2kB in size. Allowing these BPF programs to share 2MB pages
> > would yield some memory saving for systems with many BPF programs. For
> > systems with only small number of BPF programs, this patch may waste a
> > little memory by allocating one 2MB page, but using only part of it.
> >
> > 5. Introduce a unified API to allocate memory with special permissions.
> >
> > This will help get rid of set_vm_flush_reset_perms calls from users of
> > vmalloc, module_alloc, etc.
>
> And *this* is one of the reasons I'm so eager to see a proper solution
> drawn up. This would be a huge win for modules, however since some of
> the complexities in special permissions with modules lies in all the
> cross architecture hanky panky, I'd prefer to see this through merged
> *iff* we have modules converted as well as it would give us a clearer
> picture if the solution covers the bases. And we'd get proper testing
> on this. Rather than it being a special thing for BPF.
>
> > Based on our experiments [5], we measured ~0.6% performance improvement
> > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
>
> I'd prefer we leave out arbitrary performance data, as it does not help much.

This really bothers me. With real workload, we are talking about performance
difference of ~1%. I don't think there is any open source benchmark that can
show this level of performance difference. In our case, we used A/B test with
80 hosts (40 vs. 40) and runs for many hours to confidently show 1%
performance difference.

This exact benchmark has a very good record of reporting smallish
performance regression. For example, this commit

  commit 7af0145067bc ("x86/mm/cpa: Avoid the 4k pages check completely")

fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel
text. The bug stayed in the kernel for almost a year. None of all the available
open source benchmark had caught it before this specific benchmark.

We have used this benchmark to demonstrate performance benefits of many
optimizations. I don't understand why it suddenly becomes "arbitrary
performance data".

Song

>
> > The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> >
> > This patchset replaces bpf_prog_pack with a better API and makes it
> > available for other dynamic kernel text, such as modules, ftrace, kprobe.
>
> Let's see that through, then I think the series builds confidence in
> implementation.
Song Liu Nov. 22, 2022, 2:36 a.m. UTC | #4
On Mon, Nov 21, 2022 at 1:20 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Nov 21, 2022 at 12:12:49PM -0800, Luis Chamberlain wrote:
> > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > > Based on our experiments [5], we measured ~0.6% performance improvement
> > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
> >
> > I'd prefer we leave out arbitrary performance data, as it does not help much.
>
> I'd like to clarify what I mean by this, as Linus has suggested before, you are
> missing the opportunity to present "actual numbers on real loads. (not
> some artificial benchmark)"
>
> [0] https://lkml.kernel.org/r/CAHk-=wiF1KnM1_paB3jCONR9Mh1D_RCsnXKBau1K7XLG-mwwTQ@mail.gmail.com

Unless I made some serious mistakes, Linus' concern with performance was
addressed by the exact performance results above. [5]

Thanks,
Song

[5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/
Song Liu Nov. 22, 2022, 2:55 a.m. UTC | #5
On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:

[...]

> > 5. Introduce a unified API to allocate memory with special permissions.
> >
> > This will help get rid of set_vm_flush_reset_perms calls from users of
> > vmalloc, module_alloc, etc.
>
> And *this* is one of the reasons I'm so eager to see a proper solution
> drawn up. This would be a huge win for modules, however since some of
> the complexities in special permissions with modules lies in all the
> cross architecture hanky panky, I'd prefer to see this through merged
> *iff* we have modules converted as well as it would give us a clearer
> picture if the solution covers the bases. And we'd get proper testing
> on this. Rather than it being a special thing for BPF.

Module code is clearly the most difficult to migrate. (It has to work on
almost all archs, and it contains 3 allocations: core, data, init.) If we
want actionable path towards fixing all these, I don't think we should
use module code as the bar for the very first set. (Of course, if
Andrew or Linus insists that way, I will rethink about this).

PS: I don't quite understand why there is a strong concern in adding
this to core mm API, especially that there is also an argument that
this is only for BPF.

IIUC, the real concern comes for a core API that is
   1. easy to use, and have many users;
   2. has a horrible internal implementation (maybe bpf_prog_pack
      falls in here, but it is not easy to use).

Such API will cause a lot of problems, and it is also so hard to
remove. execmem_* APIs are quite the opposite. It is hard to use,
and it has a decent internal implementation (at least better than
bpf_prog_pack).

In 4/5 of the set, we easily reverted all the code bpf_prog_pack
and used execmem_* instead. If execmem_* turn out to be
horrible, and only useful for BPF, we can easily migrate it to the
next good API, right?

Thanks,
Song
Luis Chamberlain Nov. 23, 2022, 12:21 a.m. UTC | #6
On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote:
> On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > > This patchset tries to address the following issues:
> > >
> > > Based on our experiments [5], we measured ~0.6% performance improvement
> > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
> >
> > I'd prefer we leave out arbitrary performance data, as it does not help much.
> 
> This really bothers me. With real workload, we are talking about performance
> difference of ~1%. I don't think there is any open source benchmark that can
> show this level of performance difference.

I *highly* doubt that.

> In our case, we used A/B test with 80 hosts (40 vs. 40) and runs for
> many hours to confidently show 1% performance difference. This exact
> benchmark has a very good record of reporting smallish performance
> regression.

As per wikipedia, "A/B tests are useful for understanding user
engagement and satisfaction of online features like a new feature or
product". Let us disregards what is going on with user experience and
consider evaluating the performance instead of what goes on behind the
scenes.

> For example, this commit
> 
>   commit 7af0145067bc ("x86/mm/cpa: Avoid the 4k pages check completely")
> 
> fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel
> text. The bug stayed in the kernel for almost a year. None of all the available
> open source benchmark had caught it before this specific benchmark.

That doesn't mean enterpise level testing would not have caught it, and
enteprise kernels run on ancient kernels so they would not catch up that
fast. RHEL uses even more ancient kernels than SUSE so let's consider
where SUSE was during this regression. The commit you mentioned the fix
7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019.
The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid
the 4k pages check completely") and that was on v4.20-rc1~159^2~41
around September 2018. Around September 2018, the time the regression was
committed, the most bleeding edge Enterprise Linux kernel in the industry was
that on SLE15 and so v4.12 and so there is no way in hell the performance
team at SUSE for instance would have even come close to evaluating code with
that regression. In fact, they wouldn't come accross it in testing until
SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed.

Yes, 0-day does *some* performance testing, but it does not do any
justice the monumental effort that goes into performance testing at
Enterprise Linux distributions. The gap that leaves perhaps should be
solved in the community long term however that that's a separate problem.

But to suggest that there is *nothing* like what you have, is probably
pretty innacurate.

> We have used this benchmark to demonstrate performance benefits of many
> optimizations. I don't understand why it suddenly becomes "arbitrary
> performance data".

It's because typically you'd want a benchmark you can reproduce something with,
and some "A/B testing" reference really doesn't help future developers who are
evaluating performance regressions, or who would want to provide critical
feedback to you on things you may have overlooked when selling a generic
performance improvement into the kernel.

  Luis
Song Liu Nov. 23, 2022, 5:06 a.m. UTC | #7
On Tue, Nov 22, 2022 at 5:21 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote:
> > On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > >
[...]
> > fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel
> > text. The bug stayed in the kernel for almost a year. None of all the available
> > open source benchmark had caught it before this specific benchmark.
>
> That doesn't mean enterpise level testing would not have caught it, and
> enteprise kernels run on ancient kernels so they would not catch up that
> fast. RHEL uses even more ancient kernels than SUSE so let's consider
> where SUSE was during this regression. The commit you mentioned the fix
> 7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019.
> The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid
> the 4k pages check completely") and that was on v4.20-rc1~159^2~41
> around September 2018. Around September 2018, the time the regression was
> committed, the most bleeding edge Enterprise Linux kernel in the industry was
> that on SLE15 and so v4.12 and so there is no way in hell the performance
> team at SUSE for instance would have even come close to evaluating code with
> that regression. In fact, they wouldn't come accross it in testing until
> SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed.

Can you refer me to one enterprise performance report with open source
benchmark that shows ~1% performance regression? If it is available, I am
more than happy to try it out. Note that, we need some BPF programs to show
the benefit of this set. In most production hosts, network related BPF programs
are the busiest. Therefore, single host benchmarks will not show the benefit.

Thanks,
Song

PS: Data in [1] if full of noise:

"""
2. For each benchmark/system combination, the 1G mapping had the highest
performance for 45% of the tests, 2M for ~30%, and 4k for~20%.

3. From the average delta, among 1G/2M/4K, 4K gets the lowest
performance in all the 4 test machines, while 1G gets the best
performance on 2 test machines and 2M gets the best performance on the
other 2 machines.
"""

There is no way we can get consistent result of 1% performance improvement
from experiments like those.


[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
Mike Rapoport Nov. 30, 2022, 9:41 a.m. UTC | #8
On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote:
> On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > Also, you mention your perf stats are run on a VM, I am curious what
> > things you need to get TLB to be properly measured on the VM and if
> > this is really reliable data Vs bare metal. I haven't yet been sucessful
> > on getting perf stat for TBL to work on a VM and based on what I've read
> > have been catious about the results.
> 
> To make these perf counters work on VM, we need a newer host kernel
> (my system is running 5.6 based kernel, but I am not sure what is the
> minimum required version). Then we need to run qemu with option
> "-cpu host" (both host and guest are x86_64).
> 
> >
> > So curious if you'd see something different on bare metal.
> 
> Once the above all worked out, VM runs the same as bare metal from
> perf counter's point of view.

TLBs behave differently because of EPT.
Mike Rapoport Nov. 30, 2022, 9:53 a.m. UTC | #9
On Tue, Nov 22, 2022 at 10:06:06PM -0700, Song Liu wrote:
> On Tue, Nov 22, 2022 at 5:21 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote:
> > > On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > >
> [...]
> > > fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel
> > > text. The bug stayed in the kernel for almost a year. None of all the available
> > > open source benchmark had caught it before this specific benchmark.
> >
> > That doesn't mean enterpise level testing would not have caught it, and
> > enteprise kernels run on ancient kernels so they would not catch up that
> > fast. RHEL uses even more ancient kernels than SUSE so let's consider
> > where SUSE was during this regression. The commit you mentioned the fix
> > 7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019.
> > The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid
> > the 4k pages check completely") and that was on v4.20-rc1~159^2~41
> > around September 2018. Around September 2018, the time the regression was
> > committed, the most bleeding edge Enterprise Linux kernel in the industry was
> > that on SLE15 and so v4.12 and so there is no way in hell the performance
> > team at SUSE for instance would have even come close to evaluating code with
> > that regression. In fact, they wouldn't come accross it in testing until
> > SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed.
> 
> Can you refer me to one enterprise performance report with open source
> benchmark that shows ~1% performance regression? If it is available, I am
> more than happy to try it out. Note that, we need some BPF programs to show
> the benefit of this set. In most production hosts, network related BPF programs
> are the busiest. Therefore, single host benchmarks will not show the benefit.
> 
> Thanks,
> Song
> 
> PS: Data in [1] if full of noise:
> 
> """
> 2. For each benchmark/system combination, the 1G mapping had the highest
> performance for 45% of the tests, 2M for ~30%, and 4k for~20%.
> 
> 3. From the average delta, among 1G/2M/4K, 4K gets the lowest
> performance in all the 4 test machines, while 1G gets the best
> performance on 2 test machines and 2M gets the best performance on the
> other 2 machines.
> """

I don't think it's noise. IMO, this means that performance degradation
caused by the fragmentation of the direct map highly depends on workload
and microarchitecture.
 
> There is no way we can get consistent result of 1% performance improvement
> from experiments like those.

Experiments like those show how a change in the kernel behaviour affects
different workloads and not a single benchmark. Having a performance
improvement in a single benchmark does necessarily not mean other
benchmarks won't regress.
 
> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
Luis Chamberlain Dec. 8, 2022, 2:48 a.m. UTC | #10
On Mon, Nov 21, 2022 at 07:36:21PM -0700, Song Liu wrote:
> On Mon, Nov 21, 2022 at 1:20 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Mon, Nov 21, 2022 at 12:12:49PM -0800, Luis Chamberlain wrote:
> > > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > > > Based on our experiments [5], we measured ~0.6% performance improvement
> > > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
> > >
> > > I'd prefer we leave out arbitrary performance data, as it does not help much.
> >
> > I'd like to clarify what I mean by this, as Linus has suggested before, you are
> > missing the opportunity to present "actual numbers on real loads. (not
> > some artificial benchmark)"
> >
> > [0] https://lkml.kernel.org/r/CAHk-=wiF1KnM1_paB3jCONR9Mh1D_RCsnXKBau1K7XLG-mwwTQ@mail.gmail.com
> 
> Unless I made some serious mistakes, Linus' concern with performance was
> addressed by the exact performance results above. [5]
> 
> [5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/

AFAICT no, that still is all still artificial.

  Luis