Message ID | 20221117202322.944661-1-song@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | execmem_alloc for BPF programs | expand |
On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote: > This patchset tries to address the following issues: > > 1. Direct map fragmentation > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > RO+X. These set_memory_* calls cause 1GB page table entries to be split > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > and slower page table, and pressure for both instruction and data TLB. > > Our previous work in bpf_prog_pack tries to address this issue from BPF > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > greatly reduced direct map fragmentation from BPF programs. This value is clear, but I'd like to see at least another new user and the respective commit log show the gains as Aaron Lu showed. > 2. iTLB pressure from BPF program > > Dynamic kernel text such as modules and BPF programs (even with current > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > BPF program is big, we can see visible performance drop caused by high > iTLB miss rate. As suggested by Mike Rapoport, "benchmarking iTLB performance on an idle system is not very representative. TLB is a scarce resource, so it'd be interesting to see this benchmark on a loaded system." This would also help pave the way to measure this for more possible future callers like modules. There in lies true value to this consideration. Also, you mention your perf stats are run on a VM, I am curious what things you need to get TLB to be properly measured on the VM and if this is really reliable data Vs bare metal. I haven't yet been sucessful on getting perf stat for TBL to work on a VM and based on what I've read have been catious about the results. So curious if you'd see something different on bare metal. [0] https://lkml.kernel.org/r/Y3YA2mRZDJkB4lmP@kernel.org > 3. TLB shootdown for short-living BPF programs > > Before bpf_prog_pack loading and unloading BPF programs requires global > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > TLB flush. If this is all done on the bpf code replacement then the commit log should clarify that in the commit log, as then it allows future users to not be surprised if they don't see these gains as this is specific to the way bpf code used bpf_prog_pag. Also, you can measure the shootdowns and show the differences with perf stat tlb:tlb_flush. > 4. Reduce memory usage by BPF programs (in some cases) > > Most BPF programs and various trampolines are small, and they often > occupies a whole page. From a random server in our fleet, 50% of the > loaded BPF programs are less than 500 byte in size, and 75% of them are > less than 2kB in size. Allowing these BPF programs to share 2MB pages > would yield some memory saving for systems with many BPF programs. For > systems with only small number of BPF programs, this patch may waste a > little memory by allocating one 2MB page, but using only part of it. > > 5. Introduce a unified API to allocate memory with special permissions. > > This will help get rid of set_vm_flush_reset_perms calls from users of > vmalloc, module_alloc, etc. And *this* is one of the reasons I'm so eager to see a proper solution drawn up. This would be a huge win for modules, however since some of the complexities in special permissions with modules lies in all the cross architecture hanky panky, I'd prefer to see this through merged *iff* we have modules converted as well as it would give us a clearer picture if the solution covers the bases. And we'd get proper testing on this. Rather than it being a special thing for BPF. > Based on our experiments [5], we measured ~0.6% performance improvement > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%. I'd prefer we leave out arbitrary performance data, as it does not help much. > The difference is because bpf_prog_pack uses 512x 4kB pages instead of > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > This patchset replaces bpf_prog_pack with a better API and makes it > available for other dynamic kernel text, such as modules, ftrace, kprobe. Let's see that through, then I think the series builds confidence in implementation. Luis
On Mon, Nov 21, 2022 at 12:12:49PM -0800, Luis Chamberlain wrote: > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote: > > Based on our experiments [5], we measured ~0.6% performance improvement > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%. > > I'd prefer we leave out arbitrary performance data, as it does not help much. I'd like to clarify what I mean by this, as Linus has suggested before, you are missing the opportunity to present "actual numbers on real loads. (not some artificial benchmark)" [0] https://lkml.kernel.org/r/CAHk-=wiF1KnM1_paB3jCONR9Mh1D_RCsnXKBau1K7XLG-mwwTQ@mail.gmail.com Luis
On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote: > > This patchset tries to address the following issues: > > > > 1. Direct map fragmentation > > > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > > RO+X. These set_memory_* calls cause 1GB page table entries to be split > > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > > and slower page table, and pressure for both instruction and data TLB. > > > > Our previous work in bpf_prog_pack tries to address this issue from BPF > > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > > greatly reduced direct map fragmentation from BPF programs. > > This value is clear, but I'd like to see at least another new user and > the respective commit log show the gains as Aaron Lu showed. > > > 2. iTLB pressure from BPF program > > > > Dynamic kernel text such as modules and BPF programs (even with current > > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > > BPF program is big, we can see visible performance drop caused by high > > iTLB miss rate. > > As suggested by Mike Rapoport, "benchmarking iTLB performance on an idle > system is not very representative. TLB is a scarce resource, so it'd be > interesting to see this benchmark on a loaded system." > > This would also help pave the way to measure this for more possible > future callers like modules. There in lies true value to this > consideration. > > Also, you mention your perf stats are run on a VM, I am curious what > things you need to get TLB to be properly measured on the VM and if > this is really reliable data Vs bare metal. I haven't yet been sucessful > on getting perf stat for TBL to work on a VM and based on what I've read > have been catious about the results. To make these perf counters work on VM, we need a newer host kernel (my system is running 5.6 based kernel, but I am not sure what is the minimum required version). Then we need to run qemu with option "-cpu host" (both host and guest are x86_64). > > So curious if you'd see something different on bare metal. Once the above all worked out, VM runs the same as bare metal from perf counter's point of view. > > [0] https://lkml.kernel.org/r/Y3YA2mRZDJkB4lmP@kernel.org > > > 3. TLB shootdown for short-living BPF programs > > > > Before bpf_prog_pack loading and unloading BPF programs requires global > > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > > TLB flush. > > If this is all done on the bpf code replacement then the commit log > should clarify that in the commit log, as then it allows future users > to not be surprised if they don't see these gains as this is specific > to the way bpf code used bpf_prog_pag. Also, you can measure the > shootdowns and show the differences with perf stat tlb:tlb_flush. > > > 4. Reduce memory usage by BPF programs (in some cases) > > > > Most BPF programs and various trampolines are small, and they often > > occupies a whole page. From a random server in our fleet, 50% of the > > loaded BPF programs are less than 500 byte in size, and 75% of them are > > less than 2kB in size. Allowing these BPF programs to share 2MB pages > > would yield some memory saving for systems with many BPF programs. For > > systems with only small number of BPF programs, this patch may waste a > > little memory by allocating one 2MB page, but using only part of it. > > > > 5. Introduce a unified API to allocate memory with special permissions. > > > > This will help get rid of set_vm_flush_reset_perms calls from users of > > vmalloc, module_alloc, etc. > > And *this* is one of the reasons I'm so eager to see a proper solution > drawn up. This would be a huge win for modules, however since some of > the complexities in special permissions with modules lies in all the > cross architecture hanky panky, I'd prefer to see this through merged > *iff* we have modules converted as well as it would give us a clearer > picture if the solution covers the bases. And we'd get proper testing > on this. Rather than it being a special thing for BPF. > > > Based on our experiments [5], we measured ~0.6% performance improvement > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%. > > I'd prefer we leave out arbitrary performance data, as it does not help much. This really bothers me. With real workload, we are talking about performance difference of ~1%. I don't think there is any open source benchmark that can show this level of performance difference. In our case, we used A/B test with 80 hosts (40 vs. 40) and runs for many hours to confidently show 1% performance difference. This exact benchmark has a very good record of reporting smallish performance regression. For example, this commit commit 7af0145067bc ("x86/mm/cpa: Avoid the 4k pages check completely") fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel text. The bug stayed in the kernel for almost a year. None of all the available open source benchmark had caught it before this specific benchmark. We have used this benchmark to demonstrate performance benefits of many optimizations. I don't understand why it suddenly becomes "arbitrary performance data". Song > > > The difference is because bpf_prog_pack uses 512x 4kB pages instead of > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > > > This patchset replaces bpf_prog_pack with a better API and makes it > > available for other dynamic kernel text, such as modules, ftrace, kprobe. > > Let's see that through, then I think the series builds confidence in > implementation.
On Mon, Nov 21, 2022 at 1:20 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, Nov 21, 2022 at 12:12:49PM -0800, Luis Chamberlain wrote: > > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote: > > > Based on our experiments [5], we measured ~0.6% performance improvement > > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%. > > > > I'd prefer we leave out arbitrary performance data, as it does not help much. > > I'd like to clarify what I mean by this, as Linus has suggested before, you are > missing the opportunity to present "actual numbers on real loads. (not > some artificial benchmark)" > > [0] https://lkml.kernel.org/r/CAHk-=wiF1KnM1_paB3jCONR9Mh1D_RCsnXKBau1K7XLG-mwwTQ@mail.gmail.com Unless I made some serious mistakes, Linus' concern with performance was addressed by the exact performance results above. [5] Thanks, Song [5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/
On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote: [...] > > 5. Introduce a unified API to allocate memory with special permissions. > > > > This will help get rid of set_vm_flush_reset_perms calls from users of > > vmalloc, module_alloc, etc. > > And *this* is one of the reasons I'm so eager to see a proper solution > drawn up. This would be a huge win for modules, however since some of > the complexities in special permissions with modules lies in all the > cross architecture hanky panky, I'd prefer to see this through merged > *iff* we have modules converted as well as it would give us a clearer > picture if the solution covers the bases. And we'd get proper testing > on this. Rather than it being a special thing for BPF. Module code is clearly the most difficult to migrate. (It has to work on almost all archs, and it contains 3 allocations: core, data, init.) If we want actionable path towards fixing all these, I don't think we should use module code as the bar for the very first set. (Of course, if Andrew or Linus insists that way, I will rethink about this). PS: I don't quite understand why there is a strong concern in adding this to core mm API, especially that there is also an argument that this is only for BPF. IIUC, the real concern comes for a core API that is 1. easy to use, and have many users; 2. has a horrible internal implementation (maybe bpf_prog_pack falls in here, but it is not easy to use). Such API will cause a lot of problems, and it is also so hard to remove. execmem_* APIs are quite the opposite. It is hard to use, and it has a decent internal implementation (at least better than bpf_prog_pack). In 4/5 of the set, we easily reverted all the code bpf_prog_pack and used execmem_* instead. If execmem_* turn out to be horrible, and only useful for BPF, we can easily migrate it to the next good API, right? Thanks, Song
On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote: > On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote: > > > This patchset tries to address the following issues: > > > > > > Based on our experiments [5], we measured ~0.6% performance improvement > > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%. > > > > I'd prefer we leave out arbitrary performance data, as it does not help much. > > This really bothers me. With real workload, we are talking about performance > difference of ~1%. I don't think there is any open source benchmark that can > show this level of performance difference. I *highly* doubt that. > In our case, we used A/B test with 80 hosts (40 vs. 40) and runs for > many hours to confidently show 1% performance difference. This exact > benchmark has a very good record of reporting smallish performance > regression. As per wikipedia, "A/B tests are useful for understanding user engagement and satisfaction of online features like a new feature or product". Let us disregards what is going on with user experience and consider evaluating the performance instead of what goes on behind the scenes. > For example, this commit > > commit 7af0145067bc ("x86/mm/cpa: Avoid the 4k pages check completely") > > fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel > text. The bug stayed in the kernel for almost a year. None of all the available > open source benchmark had caught it before this specific benchmark. That doesn't mean enterpise level testing would not have caught it, and enteprise kernels run on ancient kernels so they would not catch up that fast. RHEL uses even more ancient kernels than SUSE so let's consider where SUSE was during this regression. The commit you mentioned the fix 7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019. The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") and that was on v4.20-rc1~159^2~41 around September 2018. Around September 2018, the time the regression was committed, the most bleeding edge Enterprise Linux kernel in the industry was that on SLE15 and so v4.12 and so there is no way in hell the performance team at SUSE for instance would have even come close to evaluating code with that regression. In fact, they wouldn't come accross it in testing until SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed. Yes, 0-day does *some* performance testing, but it does not do any justice the monumental effort that goes into performance testing at Enterprise Linux distributions. The gap that leaves perhaps should be solved in the community long term however that that's a separate problem. But to suggest that there is *nothing* like what you have, is probably pretty innacurate. > We have used this benchmark to demonstrate performance benefits of many > optimizations. I don't understand why it suddenly becomes "arbitrary > performance data". It's because typically you'd want a benchmark you can reproduce something with, and some "A/B testing" reference really doesn't help future developers who are evaluating performance regressions, or who would want to provide critical feedback to you on things you may have overlooked when selling a generic performance improvement into the kernel. Luis
On Tue, Nov 22, 2022 at 5:21 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote: > > On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > [...] > > fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel > > text. The bug stayed in the kernel for almost a year. None of all the available > > open source benchmark had caught it before this specific benchmark. > > That doesn't mean enterpise level testing would not have caught it, and > enteprise kernels run on ancient kernels so they would not catch up that > fast. RHEL uses even more ancient kernels than SUSE so let's consider > where SUSE was during this regression. The commit you mentioned the fix > 7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019. > The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid > the 4k pages check completely") and that was on v4.20-rc1~159^2~41 > around September 2018. Around September 2018, the time the regression was > committed, the most bleeding edge Enterprise Linux kernel in the industry was > that on SLE15 and so v4.12 and so there is no way in hell the performance > team at SUSE for instance would have even come close to evaluating code with > that regression. In fact, they wouldn't come accross it in testing until > SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed. Can you refer me to one enterprise performance report with open source benchmark that shows ~1% performance regression? If it is available, I am more than happy to try it out. Note that, we need some BPF programs to show the benefit of this set. In most production hosts, network related BPF programs are the busiest. Therefore, single host benchmarks will not show the benefit. Thanks, Song PS: Data in [1] if full of noise: """ 2. For each benchmark/system combination, the 1G mapping had the highest performance for 45% of the tests, 2M for ~30%, and 4k for~20%. 3. From the average delta, among 1G/2M/4K, 4K gets the lowest performance in all the 4 test machines, while 1G gets the best performance on 2 test machines and 2M gets the best performance on the other 2 machines. """ There is no way we can get consistent result of 1% performance improvement from experiments like those. [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote: > On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > Also, you mention your perf stats are run on a VM, I am curious what > > things you need to get TLB to be properly measured on the VM and if > > this is really reliable data Vs bare metal. I haven't yet been sucessful > > on getting perf stat for TBL to work on a VM and based on what I've read > > have been catious about the results. > > To make these perf counters work on VM, we need a newer host kernel > (my system is running 5.6 based kernel, but I am not sure what is the > minimum required version). Then we need to run qemu with option > "-cpu host" (both host and guest are x86_64). > > > > > So curious if you'd see something different on bare metal. > > Once the above all worked out, VM runs the same as bare metal from > perf counter's point of view. TLBs behave differently because of EPT.
On Tue, Nov 22, 2022 at 10:06:06PM -0700, Song Liu wrote: > On Tue, Nov 22, 2022 at 5:21 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote: > > > On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > > [...] > > > fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel > > > text. The bug stayed in the kernel for almost a year. None of all the available > > > open source benchmark had caught it before this specific benchmark. > > > > That doesn't mean enterpise level testing would not have caught it, and > > enteprise kernels run on ancient kernels so they would not catch up that > > fast. RHEL uses even more ancient kernels than SUSE so let's consider > > where SUSE was during this regression. The commit you mentioned the fix > > 7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019. > > The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid > > the 4k pages check completely") and that was on v4.20-rc1~159^2~41 > > around September 2018. Around September 2018, the time the regression was > > committed, the most bleeding edge Enterprise Linux kernel in the industry was > > that on SLE15 and so v4.12 and so there is no way in hell the performance > > team at SUSE for instance would have even come close to evaluating code with > > that regression. In fact, they wouldn't come accross it in testing until > > SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed. > > Can you refer me to one enterprise performance report with open source > benchmark that shows ~1% performance regression? If it is available, I am > more than happy to try it out. Note that, we need some BPF programs to show > the benefit of this set. In most production hosts, network related BPF programs > are the busiest. Therefore, single host benchmarks will not show the benefit. > > Thanks, > Song > > PS: Data in [1] if full of noise: > > """ > 2. For each benchmark/system combination, the 1G mapping had the highest > performance for 45% of the tests, 2M for ~30%, and 4k for~20%. > > 3. From the average delta, among 1G/2M/4K, 4K gets the lowest > performance in all the 4 test machines, while 1G gets the best > performance on 2 test machines and 2M gets the best performance on the > other 2 machines. > """ I don't think it's noise. IMO, this means that performance degradation caused by the fragmentation of the direct map highly depends on workload and microarchitecture. > There is no way we can get consistent result of 1% performance improvement > from experiments like those. Experiments like those show how a change in the kernel behaviour affects different workloads and not a single benchmark. Having a performance improvement in a single benchmark does necessarily not mean other benchmarks won't regress. > [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
On Mon, Nov 21, 2022 at 07:36:21PM -0700, Song Liu wrote: > On Mon, Nov 21, 2022 at 1:20 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Mon, Nov 21, 2022 at 12:12:49PM -0800, Luis Chamberlain wrote: > > > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote: > > > > Based on our experiments [5], we measured ~0.6% performance improvement > > > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%. > > > > > > I'd prefer we leave out arbitrary performance data, as it does not help much. > > > > I'd like to clarify what I mean by this, as Linus has suggested before, you are > > missing the opportunity to present "actual numbers on real loads. (not > > some artificial benchmark)" > > > > [0] https://lkml.kernel.org/r/CAHk-=wiF1KnM1_paB3jCONR9Mh1D_RCsnXKBau1K7XLG-mwwTQ@mail.gmail.com > > Unless I made some serious mistakes, Linus' concern with performance was > addressed by the exact performance results above. [5] > > [5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/ AFAICT no, that still is all still artificial. Luis