Message ID | 20221031222541.1773452-2-song@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | vmalloc_exec for modules and BPF programs | expand |
On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote: > vmalloc_exec is used to allocate memory to host dynamic kernel text > (modules, BPF programs, etc.) with huge pages. This is similar to the > proposal by Peter in [1]. This is allg reat but we need to clarify *why* we would go through the trouble. So if folks are not to excited about this series, that's probably why. IMHO it lacks substance for rationale, **and** implies a few gains without any *clear* performance metrics. I have 0 experience with mm so I'd like other's feedback on my this -- I'm just trying to do decipher rationale from prior "bpf prog pack" efforts. I'm sensing that the cables in messaging are a bit crossed here and we need to provide a bit better full picture for rationale and this is being completely missed and this work is being undersold. If my assessment is accurate though, the bpf prog pack strategy with sharing huge pages may prove useful long term for other things than just modules / ftrace / kprobes. I was surprised to see this entire patch series upgrade from RFC to proper PATCH form now completely fails to mention any of the original motivations behind the "BPF prog pack", which you are doing a true heroic effort to try to generalize as the problem is hard. Let me try to help with that. The rationale for the old BPF prog pack is documented as follows: * Most BPF programs are pretty small. Allocating a hole page for each * program is sometime a waste. Many small bpf program also adds pressure * to instruction TLB. To solve this issue, we introduce a BPF program pack * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86) * to host BPF programs. Previously you have also stated in earlier versions of this patch set: "Most BPF programs are small, but they consume a page each. For systems with busy traffic and many BPF programs, this could also add significant pressure to instruction TLB. High iTLB pressure usually causes slow down for the whole system, which includes visible performance degradation for production workloads." So it is implied here that one of the benefits is to help reduce iTLB misses. But that's it. We have no visible numbers to look at and for what... But reducing iTLB misses doesn't always have a complete direct correlation with improving things, but if the code change is small enough it obviously makes sense to apply. If the change is a bit more intrusive, as in this patch series a bit more rationale should be provided. Other than the "performance aspects" of your patchset, the *main* reason I am engaged and like it is it reduces the nasty mess of semantics on dealing with special permissions on pages which we see in modules and a few other places which today completely open code it. That proves error prone and I'm glad to see efforts to generalize that nastiness. So please ensure this is added as part of the documented rationale. Even if the iTLB miss ratio improvement is not astronomical I believe that the gains in sanity on improving semantics on special pages and sharing code make it well worthwhile. The iTLB miss ratio improvement is just a small cherry on top. Going back to performance aspects, when Linus had poked for more details about this your have elaborated further: "we have seen direct map fragmentation causing visible performance drop for our major services. This is the shadow production benchmark, so it is not possible to run it out of our data centers. Tracing showed that BPF program was the top trigger of these direct map splits." And the only other metric we have is: "For our web service production benchmark, bpf_prog_pack on 4kB pages gives 0.5% to 0.7% more throughput than not using bpf_prog_pack." These metrics are completely arbitrary and opaque to us. We need something tangible and reproducible and I have been suggesting that from early on... I'm under the impression that the real missed, undocumented, major value-add here is that the old "BPF prog pack" strategy helps to reduce the direct map fragmentation caused by heavy use of the eBPF JIT programs and this in turn helps your overall random system performance (regardless of what it is you do). As I see it then the eBPF prog pack is just one strategy to try to mitigate memory fragmentation on the direct map caused by the the eBPF JIT programs, so the "slow down" your team has obvserved should be due to the eventual fragmentation caused on the direct map *while* eBPF programs get heavily used. Mike Rapoport had presented about the Direct map fragmentation problem at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace / kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance evaluation on whether using 2M/1G pages aggressively for the kernel direct map help performance [1] ends up generally recommending huge pages. The work by Xing though was about using huge pages *alone*, not using a strategy such as in the "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs, and that I think is the real golden nugget here. I contend therefore that the theoretical reduction of iTLB misses by using huge pages for "bpf prog pack" is not what gets your systems to perform somehow better. It should be simply that it reduces fragmentation and *this* generally can help with performance long term. If this is accurate then let's please separate the two aspects to this. There's two aspects to what I would like to see from a performance perspective then actually mentioned in the commit logs: 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution Vs not using it at all: I'd tried the below but didn't see much difference: perf stat --repeat 10 -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-stores-misses,iTLB-loads,iTLB-load-misses,page-faults \ --pre 'modprobe -r test_bpf' \-- modprobe test_bpf This is likely that the system I have might have a huge cache and I was not contending it with another heavy memory intensive workload. So it may be worthy to run something like the above *as* some other memory intensive benchark kicks gear in a loop. Another reason too may be that test_bpf runs tests sequentially instead of in parallel, and so it would be good to get ebpf folks's feedback as to an idea of what other things could be done to really eBPF JIT the hell out of a system. It is why I had suggested long ago maybe a custom new selftest which stresses the hell out of only eBPF JIT and had given an example multithreaded selftest. If we ever get modules support, I was hoping to see if something like the below (if the fs module .text section does end up shared) *may* have reduce iTLB miss ratio. perf stat --repeat 10 -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-stores-misses,iTLB-loads,iTLB-load-misses,page-faults \ --pre 'make -s mrproper defconfig' \-- make -s -j$(nproc) bzImage Maybe something as simple as systemd-analyze may show time reduction if iTLB miss ratio is reduced by sharing most module code in a huge page. 2) Estimate in reduction on direct map fragmentation by using the "bpf prog pack" or this generalized solution: For this I'd expect a benchmark similar to the workload you guys run or something memory intensive, as eBPF JITs are heavily used, and after a certain amount of time somehow compute how fragmented memory is. The only sensible thing I can think to measure memory fragmentation is to look at the memory compaction index /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's ideas as I'm a mm n00b. [0] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ > +static void move_vmap_to_free_text_tree(void *addr) > +{ > + struct vmap_area *va; > + > + /* remove from vmap_area_root */ > + spin_lock(&vmap_area_lock); > + va = __find_vmap_area((unsigned long)addr, &vmap_area_root); > + if (WARN_ON_ONCE(!va)) { This seems to be hopeful but we purposely are allowing for the allocated memory to be used elsewhere are we not? Can you trigger the WARN_ON_ONCE() with stress-ng as described in commit 82dd23e84be3e ("mm/vmalloc.c: preload a CPU with one object for split purpose") while using eBPF JIT or loading/unloading a module in a loop? > +void *vmalloc_exec(unsigned long size, unsigned long align) > +{ > + struct vmap_area *va, *tmp; > + unsigned long addr; > + enum fit_type type; > + int ret; > + > + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE); > + if (unlikely(!va)) > + return NULL; > + > +again: > + preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE); > + tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false); > + > + if (!tmp) { > + unsigned long alloc_size; > + void *ptr; > + > + spin_unlock(&free_text_area_lock); > + > + /* > + * Not enough continuous space in free_text_area_root, try > + * allocate more memory. The memory is first added to > + * vmap_area_root, and then moved to free_text_area_root. > + */ > + alloc_size = roundup(size, PMD_SIZE * num_online_nodes()); > + ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START, > + VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL, > + VM_ALLOW_HUGE_VMAP | VM_NO_GUARD, > + NUMA_NO_NODE, __builtin_return_address(0)); I thought Peter had suggested keeping the guard? Once you get modules going you may want to update the comment about modules on __vmalloc_node_range(). Luis
Hi Luis, Thanks for looping me in. On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote: > On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote: > > vmalloc_exec is used to allocate memory to host dynamic kernel text > > (modules, BPF programs, etc.) with huge pages. This is similar to the > > proposal by Peter in [1]. > > This is allg reat but we need to clarify *why* we would go through the > trouble. So if folks are not to excited about this series, that's > probably why. IMHO it lacks substance for rationale, **and** implies a few > gains without any *clear* performance metrics. I have 0 experience with > mm so I'd like other's feedback on my this -- I'm just trying to do > decipher rationale from prior "bpf prog pack" efforts. > > I'm sensing that the cables in messaging are a bit crossed here and we need > to provide a bit better full picture for rationale and this is being > completely missed and this work is being undersold. If my assessment is > accurate though, the bpf prog pack strategy with sharing huge pages may prove > useful long term for other things than just modules / ftrace / kprobes. > > I was surprised to see this entire patch series upgrade from RFC to proper > PATCH form now completely fails to mention any of the original motivations > behind the "BPF prog pack", which you are doing a true heroic effort to try to > generalize as the problem is hard. Let me try to help with that. The rationale > for the old BPF prog pack is documented as follows: > > * Most BPF programs are pretty small. Allocating a hole page for each > * program is sometime a waste. Many small bpf program also adds pressure > * to instruction TLB. To solve this issue, we introduce a BPF program pack > * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86) > * to host BPF programs. > > Previously you have also stated in earlier versions of this patch set: > > "Most BPF programs are small, but they consume a page each. For systems > with busy traffic and many BPF programs, this could also add significant > pressure to instruction TLB. High iTLB pressure usually causes slow down > for the whole system, which includes visible performance > degradation for production workloads." > > So it is implied here that one of the benefits is to help reduce iTLB misses. > But that's it. We have no visible numbers to look at and for what... But > reducing iTLB misses doesn't always have a complete direct correlation > with improving things, but if the code change is small enough it obviously > makes sense to apply. If the change is a bit more intrusive, as in this > patch series a bit more rationale should be provided. > > Other than the "performance aspects" of your patchset, the *main* reason > I am engaged and like it is it reduces the nasty mess of semantics on > dealing with special permissions on pages which we see in modules and a > few other places which today completely open code it. That proves error > prone and I'm glad to see efforts to generalize that nastiness. So > please ensure this is added as part of the documented rationale. Even > if the iTLB miss ratio improvement is not astronomical I believe that > the gains in sanity on improving semantics on special pages and sharing code > make it well worthwhile. The iTLB miss ratio improvement is just a small > cherry on top. > > Going back to performance aspects, when Linus had poked for more details > about this your have elaborated further: > > "we have seen direct map fragmentation causing visible > performance drop for our major services. This is the shadow > production benchmark, so it is not possible to run it out of > our data centers. Tracing showed that BPF program was the top > trigger of these direct map splits." > > And the only other metric we have is: > > "For our web service production benchmark, bpf_prog_pack on 4kB pages > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack." > > These metrics are completely arbitrary and opaque to us. We need > something tangible and reproducible and I have been suggesting that > from early on... > > I'm under the impression that the real missed, undocumented, major value-add > here is that the old "BPF prog pack" strategy helps to reduce the direct map > fragmentation caused by heavy use of the eBPF JIT programs and this in > turn helps your overall random system performance (regardless of what > it is you do). As I see it then the eBPF prog pack is just one strategy to > try to mitigate memory fragmentation on the direct map caused by the the eBPF > JIT programs, so the "slow down" your team has obvserved should be due to the > eventual fragmentation caused on the direct map *while* eBPF programs > get heavily used. I believe that while the eBPF prog pack is helpful in mitigation of the direct map fragmentation caused by the eBPF JIT programs, the same strategy of allocating a large page, splitting its PMD entry and then reusing the memory for smaller allocations can be (and should be) generalized to other use cases that require non-default permissions in the page table. Most prominent use cases are those that allocate memory for code, but the same approach is relevant for other cases, like secretmem or page table protection with PKS. A while ago I've suggested to handle such caching of large pages at the page allocator level, but when we discussed it at LSF/MM/BPF, prevailing opinion was that added value does not justify changes to the page allocator and it was suggested to handle such caching elsewhere. I had to put this project on a backburner for $VARIOUS_REASONS, but I still think that we need a generic allocator for memory with non-default permissions in the direct map and that code allocation should build on that allocator. All that said, the direct map fragmentation problem is currently relevant only to x86 because it's the only architecture that supports splitting of the large pages in the direct map. > Mike Rapoport had presented about the Direct map fragmentation problem > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace / > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance > evaluation on whether using 2M/1G pages aggressively for the kernel direct map > help performance [1] ends up generally recommending huge pages. The work by Xing > though was about using huge pages *alone*, not using a strategy such as in the > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs, > and that I think is the real golden nugget here. > > I contend therefore that the theoretical reduction of iTLB misses by using > huge pages for "bpf prog pack" is not what gets your systems to perform > somehow better. It should be simply that it reduces fragmentation and > *this* generally can help with performance long term. If this is accurate > then let's please separate the two aspects to this. The direct map fragmentation is the reason for higher TLB miss rate, both for iTLB and dTLB. Whenever a large page in the direct map is split, all kernel accesses via the direct map will use small pages which requires dealing with 512 page table entries instead of one for 2M range. Since small pages in the direct map are never collapsed back to large pages, long living system that heavily uses eBPF programs will have its direct map severely fragmented, higher TLB miss rate and worse overall performance. > There's two aspects to what I would like to see from a performance > perspective then actually mentioned in the commit logs: > > 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution > Vs not using it at all: ... > 2) Estimate in reduction on direct map fragmentation by using the "bpf > prog pack" or this generalized solution: > > For this I'd expect a benchmark similar to the workload you guys > run or something memory intensive, as eBPF JITs are heavily used, > and after a certain amount of time somehow compute how fragmented > memory is. The only sensible thing I can think to measure memory > fragmentation is to look at the memory compaction index > /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's > ideas as I'm a mm n00b. The direct map fragmentation can be tracked with grep DirectMap /proc/meminfo grep direct_map /proc/vmstat and by looking at /sys/kernel/debug/page_tables/kernel > [0] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf > [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ > > Luis
On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote: > Hi Luis, > > Thanks for looping me in. > > On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote: > > On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote: > > > vmalloc_exec is used to allocate memory to host dynamic kernel text > > > (modules, BPF programs, etc.) with huge pages. This is similar to the > > > proposal by Peter in [1]. > > > > This is allg reat but we need to clarify *why* we would go through the > > trouble. So if folks are not to excited about this series, that's > > probably why. IMHO it lacks substance for rationale, **and** implies a few > > gains without any *clear* performance metrics. I have 0 experience with > > mm so I'd like other's feedback on my this -- I'm just trying to do > > decipher rationale from prior "bpf prog pack" efforts. > > > > I'm sensing that the cables in messaging are a bit crossed here and we need > > to provide a bit better full picture for rationale and this is being > > completely missed and this work is being undersold. If my assessment is > > accurate though, the bpf prog pack strategy with sharing huge pages may prove > > useful long term for other things than just modules / ftrace / kprobes. > > > > I was surprised to see this entire patch series upgrade from RFC to proper > > PATCH form now completely fails to mention any of the original motivations > > behind the "BPF prog pack", which you are doing a true heroic effort to try to > > generalize as the problem is hard. Let me try to help with that. The rationale > > for the old BPF prog pack is documented as follows: > > > > * Most BPF programs are pretty small. Allocating a hole page for each > > * program is sometime a waste. Many small bpf program also adds pressure > > * to instruction TLB. To solve this issue, we introduce a BPF program pack > > * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86) > > * to host BPF programs. > > > > Previously you have also stated in earlier versions of this patch set: > > > > "Most BPF programs are small, but they consume a page each. For systems > > with busy traffic and many BPF programs, this could also add significant > > pressure to instruction TLB. High iTLB pressure usually causes slow down > > for the whole system, which includes visible performance > > degradation for production workloads." > > > > So it is implied here that one of the benefits is to help reduce iTLB misses. > > But that's it. We have no visible numbers to look at and for what... But > > reducing iTLB misses doesn't always have a complete direct correlation > > with improving things, but if the code change is small enough it obviously > > makes sense to apply. If the change is a bit more intrusive, as in this > > patch series a bit more rationale should be provided. > > > > Other than the "performance aspects" of your patchset, the *main* reason > > I am engaged and like it is it reduces the nasty mess of semantics on > > dealing with special permissions on pages which we see in modules and a > > few other places which today completely open code it. That proves error > > prone and I'm glad to see efforts to generalize that nastiness. So > > please ensure this is added as part of the documented rationale. Even > > if the iTLB miss ratio improvement is not astronomical I believe that > > the gains in sanity on improving semantics on special pages and sharing code > > make it well worthwhile. The iTLB miss ratio improvement is just a small > > cherry on top. > > > > Going back to performance aspects, when Linus had poked for more details > > about this your have elaborated further: > > > > "we have seen direct map fragmentation causing visible > > performance drop for our major services. This is the shadow > > production benchmark, so it is not possible to run it out of > > our data centers. Tracing showed that BPF program was the top > > trigger of these direct map splits." > > > > And the only other metric we have is: > > > > "For our web service production benchmark, bpf_prog_pack on 4kB pages > > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack." > > > > These metrics are completely arbitrary and opaque to us. We need > > something tangible and reproducible and I have been suggesting that > > from early on... > > > > I'm under the impression that the real missed, undocumented, major value-add > > here is that the old "BPF prog pack" strategy helps to reduce the direct map > > fragmentation caused by heavy use of the eBPF JIT programs and this in > > turn helps your overall random system performance (regardless of what > > it is you do). As I see it then the eBPF prog pack is just one strategy to > > try to mitigate memory fragmentation on the direct map caused by the the eBPF > > JIT programs, so the "slow down" your team has obvserved should be due to the > > eventual fragmentation caused on the direct map *while* eBPF programs > > get heavily used. > > I believe that while the eBPF prog pack is helpful in mitigation of the > direct map fragmentation caused by the eBPF JIT programs, the same strategy > of allocating a large page, splitting its PMD entry and then reusing the > memory for smaller allocations can be (and should be) generalized to other > use cases that require non-default permissions in the page table. Most > prominent use cases are those that allocate memory for code, but the same > approach is relevant for other cases, like secretmem or page table > protection with PKS. > > A while ago I've suggested to handle such caching of large pages at the > page allocator level, but when we discussed it at LSF/MM/BPF, prevailing > opinion was that added value does not justify changes to the page > allocator and it was suggested to handle such caching elsewhere. I saw that on the lwn coverage. > I had to put this project on a backburner for $VARIOUS_REASONS, but I still > think that we need a generic allocator for memory with non-default > permissions in the direct map and that code allocation should build on that > allocator. It seems this generalization of the bpf prog pack to possibly be used for modules / kprobes / ftrace is a small step in that direction. > All that said, the direct map fragmentation problem is currently relevant > only to x86 because it's the only architecture that supports splitting of > the large pages in the direct map. I was thinking even more long term too, using this as a proof of concept. If this practice in general helps with fragmentation, could it be used for experimetnation with compound pages later, as a way to reduce possible fragmentation. > > Mike Rapoport had presented about the Direct map fragmentation problem > > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace / > > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance > > evaluation on whether using 2M/1G pages aggressively for the kernel direct map > > help performance [1] ends up generally recommending huge pages. The work by Xing > > though was about using huge pages *alone*, not using a strategy such as in the > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs, > > and that I think is the real golden nugget here. > > > > I contend therefore that the theoretical reduction of iTLB misses by using > > huge pages for "bpf prog pack" is not what gets your systems to perform > > somehow better. It should be simply that it reduces fragmentation and > > *this* generally can help with performance long term. If this is accurate > > then let's please separate the two aspects to this. > > The direct map fragmentation is the reason for higher TLB miss rate, both > for iTLB and dTLB. OK so then whatever benchmark is running in tandem as eBPF JIT is hammered should *also* be measured with perf for iTLB and dTLB. ie, the patch can provide such results as a justifications. > Whenever a large page in the direct map is split, all > kernel accesses via the direct map will use small pages which requires > dealing with 512 page table entries instead of one for 2M range. > > Since small pages in the direct map are never collapsed back to large > pages, long living system that heavily uses eBPF programs will have its > direct map severely fragmented, higher TLB miss rate and worse overall > performance. Shouldn't compaction help with those situations? > > There's two aspects to what I would like to see from a performance > > perspective then actually mentioned in the commit logs: > > > > 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution > > Vs not using it at all: > > ... > > > 2) Estimate in reduction on direct map fragmentation by using the "bpf > > prog pack" or this generalized solution: > > > > For this I'd expect a benchmark similar to the workload you guys > > run or something memory intensive, as eBPF JITs are heavily used, > > and after a certain amount of time somehow compute how fragmented > > memory is. The only sensible thing I can think to measure memory > > fragmentation is to look at the memory compaction index > > /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's > > ideas as I'm a mm n00b. > > The direct map fragmentation can be tracked with > > grep DirectMap /proc/meminfo > grep direct_map /proc/vmstat > > and by looking at /sys/kernel/debug/page_tables/kernel Thanks! Luis
On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote: > > > Mike Rapoport had presented about the Direct map fragmentation > > > problem > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF / > > > ftrace / > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021 > > > performance > > > evaluation on whether using 2M/1G pages aggressively for the > > > kernel direct map > > > help performance [1] ends up generally recommending huge pages. > > > The work by Xing > > > though was about using huge pages *alone*, not using a strategy > > > such as in the > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF > > > programs, > > > and that I think is the real golden nugget here. > > > > > > I contend therefore that the theoretical reduction of iTLB misses > > > by using > > > huge pages for "bpf prog pack" is not what gets your systems to > > > perform > > > somehow better. It should be simply that it reduces fragmentation > > > and > > > *this* generally can help with performance long term. If this is > > > accurate > > > then let's please separate the two aspects to this. > > > > The direct map fragmentation is the reason for higher TLB miss > > rate, both > > for iTLB and dTLB. > > OK so then whatever benchmark is running in tandem as eBPF JIT is > hammered > should *also* be measured with perf for iTLB and dTLB. ie, the patch > can > provide such results as a justifications. Song had done some tests on the old prog pack version that to me seemed to indicate most (or possibly all) of the benefit was direct map fragmentation reduction. This was surprised me, since 2MB kernel text has shown to be beneficial. Otherwise +1 to all these comments. This should be clear about what the benefits are. I would add, that this is also much nicer about TLB shootdowns than the existing way of loading text and saves some memory. So I think there are sort of four areas of improvements: 1. Direct map fragmentation reduction (dTLB miss improvements). This sort of does it as a side effect in this series, and the solution Mike is talking about is a more general, probably better one. 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent solution for this, but surprisingly it doesn't seem to be useful for JITs. (modules testing TBD) 3. Loading text to reused allocation with per-cpu mappings. This reduces TLB shootdowns, which are a short term load and teardown time performance drag. My understanding is this is more of a problem on bigger systems with many CPUs. This series does a decent job at this, but the solution is not compatible with modules. Maybe ok since modules don't load as often as JITs. 4. Having BPF progs share pages. This saves memory. This series could probably easily get a number for how much.
On Thu, Nov 3, 2022 at 2:19 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote: > > > > Mike Rapoport had presented about the Direct map fragmentation > > > > problem > > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF / > > > > ftrace / > > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021 > > > > performance > > > > evaluation on whether using 2M/1G pages aggressively for the > > > > kernel direct map > > > > help performance [1] ends up generally recommending huge pages. > > > > The work by Xing > > > > though was about using huge pages *alone*, not using a strategy > > > > such as in the > > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF > > > > programs, > > > > and that I think is the real golden nugget here. > > > > > > > > I contend therefore that the theoretical reduction of iTLB misses > > > > by using > > > > huge pages for "bpf prog pack" is not what gets your systems to > > > > perform > > > > somehow better. It should be simply that it reduces fragmentation > > > > and > > > > *this* generally can help with performance long term. If this is > > > > accurate > > > > then let's please separate the two aspects to this. > > > > > > The direct map fragmentation is the reason for higher TLB miss > > > rate, both > > > for iTLB and dTLB. > > > > OK so then whatever benchmark is running in tandem as eBPF JIT is > > hammered > > should *also* be measured with perf for iTLB and dTLB. ie, the patch > > can > > provide such results as a justifications. > > Song had done some tests on the old prog pack version that to me seemed > to indicate most (or possibly all) of the benefit was direct map > fragmentation reduction. This was surprised me, since 2MB kernel text > has shown to be beneficial. > > Otherwise +1 to all these comments. This should be clear about what the > benefits are. I would add, that this is also much nicer about TLB > shootdowns than the existing way of loading text and saves some memory. > > So I think there are sort of four areas of improvements: > 1. Direct map fragmentation reduction (dTLB miss improvements). This > sort of does it as a side effect in this series, and the solution Mike > is talking about is a more general, probably better one. > 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent > solution for this, but surprisingly it doesn't seem to be useful for > JITs. (modules testing TBD) > 3. Loading text to reused allocation with per-cpu mappings. This > reduces TLB shootdowns, which are a short term load and teardown time > performance drag. My understanding is this is more of a problem on > bigger systems with many CPUs. This series does a decent job at this, > but the solution is not compatible with modules. Maybe ok since modules > don't load as often as JITs. > 4. Having BPF progs share pages. This saves memory. This series could > probably easily get a number for how much. > Hi Luis, Rick, and Mike, Thanks a lot for helping me organize this information. Totally agree with all these comments. I will add more data to the next revision. Besides the motivation improvement, could you please also share your comments on: 1. The logic/design of the vmalloc_exec() et. al. APIs; 2. The naming of these functions. Does execmem_[alloc|free|fill|cpy] (as suggested by Chritoph) sound good? Thanks, Song
On Thu, Nov 03, 2022 at 02:41:42PM -0700, Song Liu wrote: > On Thu, Nov 3, 2022 at 2:19 PM Edgecombe, Rick P > <rick.p.edgecombe@intel.com> wrote: > Besides the motivation improvement, could you please also share your > comments on: > 1. The logic/design of the vmalloc_exec() et. al. APIs; I've provided the feedback that I can so far as I'm new to mm. Best I can do is provided a better rationale so that mm folks can understand the motivation. > 2. The naming of these functions. Does execmem_[alloc|free|fill|cpy] > (as suggested by Chritoph) sound good? That seems sensible. There may be other users later, secmm_alloc() too then later, for instance. So we just gotta keep that in mind. Luis
On Thu, Nov 03, 2022 at 09:19:25PM +0000, Edgecombe, Rick P wrote: > On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote: > > > > Mike Rapoport had presented about the Direct map fragmentation > > > > problem > > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF / > > > > ftrace / > > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021 > > > > performance > > > > evaluation on whether using 2M/1G pages aggressively for the > > > > kernel direct map > > > > help performance [1] ends up generally recommending huge pages. > > > > The work by Xing > > > > though was about using huge pages *alone*, not using a strategy > > > > such as in the > > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF > > > > programs, > > > > and that I think is the real golden nugget here. > > > > > > > > I contend therefore that the theoretical reduction of iTLB misses > > > > by using > > > > huge pages for "bpf prog pack" is not what gets your systems to > > > > perform > > > > somehow better. It should be simply that it reduces fragmentation > > > > and > > > > *this* generally can help with performance long term. If this is > > > > accurate > > > > then let's please separate the two aspects to this. > > > > > > The direct map fragmentation is the reason for higher TLB miss > > > rate, both > > > for iTLB and dTLB. > > > > OK so then whatever benchmark is running in tandem as eBPF JIT is > > hammered > > should *also* be measured with perf for iTLB and dTLB. ie, the patch > > can > > provide such results as a justifications. > > Song had done some tests on the old prog pack version that to me seemed > to indicate most (or possibly all) of the benefit was direct map > fragmentation reduction. Matches my observations but I also provided quite a bit of hints as to *why* I think that is. I suggested lib/test_kmod.c as an example beefy multithreaded selftests which really kicks the hell out of the kernel with whatever crap you want to run. That is precicely how I uncovered some odd kmod bug lingering for years. > This was surprised me, since 2MB kernel text > has shown to be beneficial. > > Otherwise +1 to all these comments. This should be clear about what the > benefits are. I would add, that this is also much nicer about TLB > shootdowns than the existing way of loading text and saves some memory. > > So I think there are sort of four areas of improvements: > 1. Direct map fragmentation reduction (dTLB miss improvements). The dTLB gains should be on the benchmark which runs in tandem to the ebpf-JIT-monster-selftest, not on the ebpf-JIT-monster-selftest, right? > This > sort of does it as a side effect in this series, and the solution Mike > is talking about is a more general, probably better one. > 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent > solution for this, but surprisingly it doesn't seem to be useful for > JITs. (modules testing TBD) Yes I'm super eager to get this tested. In fact I wonder if one can boot Linux with less memory too... > 3. Loading text to reused allocation with per-cpu mappings. This > reduces TLB shootdowns, which are a short term load and teardown time > performance drag. My understanding is this is more of a problem on > bigger systems with many CPUs. This series does a decent job at this, > but the solution is not compatible with modules. Maybe ok since modules > don't load as often as JITs. There are some tests like fstests which make heavy use of module removal. But as a side effect, indeed I like to reboot to have a fresh system before running fstests. I guess fstests should run with a heavily fragmented memory too as a side corner case thing. > 4. Having BPF progs share pages. This saves memory. This series could > probably easily get a number for how much. Once that does hit modules / kprobes / ftrace, the impact is much much greater obviously. Luis
On Thu, Nov 03, 2022 at 05:18:51PM -0700, Luis Chamberlain wrote: > On Thu, Nov 03, 2022 at 09:19:25PM +0000, Edgecombe, Rick P wrote: > > On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote: > > > > > Mike Rapoport had presented about the Direct map fragmentation > > > > > problem > > > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF / > > > > > ftrace / > > > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021 > > > > > performance > > > > > evaluation on whether using 2M/1G pages aggressively for the > > > > > kernel direct map > > > > > help performance [1] ends up generally recommending huge pages. > > > > > The work by Xing > > > > > though was about using huge pages *alone*, not using a strategy > > > > > such as in the > > > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF > > > > > programs, > > > > > and that I think is the real golden nugget here. > > > > > > > > > > I contend therefore that the theoretical reduction of iTLB misses > > > > > by using > > > > > huge pages for "bpf prog pack" is not what gets your systems to > > > > > perform > > > > > somehow better. It should be simply that it reduces fragmentation > > > > > and > > > > > *this* generally can help with performance long term. If this is > > > > > accurate > > > > > then let's please separate the two aspects to this. > > > > > > > > The direct map fragmentation is the reason for higher TLB miss > > > > rate, both > > > > for iTLB and dTLB. > > > > > > OK so then whatever benchmark is running in tandem as eBPF JIT is > > > hammered > > > should *also* be measured with perf for iTLB and dTLB. ie, the patch > > > can > > > provide such results as a justifications. > > > > Song had done some tests on the old prog pack version that to me seemed > > to indicate most (or possibly all) of the benefit was direct map > > fragmentation reduction. > > Matches my observations but I also provided quite a bit of hints as to > *why* I think that is. I suggested lib/test_kmod.c as an example beefy > multithreaded selftests which really kicks the hell out of the kernel > with whatever crap you want to run. That is precicely how I uncovered > some odd kmod bug lingering for years. *and*, *perhaps*... it may be that you need another memory intensive benchmark to run in tandem, one which mimics the behaviour of the internal "shadow production benchmark", whatever that is. Luis
Hello, On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote: ... ... > I'm under the impression that the real missed, undocumented, major value-add > here is that the old "BPF prog pack" strategy helps to reduce the direct map > fragmentation caused by heavy use of the eBPF JIT programs and this in > turn helps your overall random system performance (regardless of what > it is you do). As I see it then the eBPF prog pack is just one strategy to > try to mitigate memory fragmentation on the direct map caused by the the eBPF > JIT programs, so the "slow down" your team has obvserved should be due to the > eventual fragmentation caused on the direct map *while* eBPF programs > get heavily used. > > Mike Rapoport had presented about the Direct map fragmentation problem > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace / > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance > evaluation on whether using 2M/1G pages aggressively for the kernel direct map > help performance [1] ends up generally recommending huge pages. The work by Xing > though was about using huge pages *alone*, not using a strategy such as in the > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs, > and that I think is the real golden nugget here. I'm interested in how this patchset (further) improves direct map fragmentation so would like to evaluate it to see if my previous work to merge small mappings back in architecture layer[1] is still necessary. I tried to apply this patchset on v6.1-rc3/2/1 and v6.0 but all failed, so I took one step back and evaluated the existing bpf_prog_pack. I'm aware of this patchset would make things even better by using order-9 page to backup the vmalloced range. I used the sample bpf prog: sample/bpf/sockex1 because it looks easy to run, feel free to let me know a better way to evaluate this. - In kernels before bpf_prog_pack(v5.17 and earlier), this prog would cause 3 pages to change protection to RO+X from RW+NX; And if the three pages are far apart, each would cause a level 3 split then a level 2 split; Reality is, allocated pages tend to stay close physically so actual result will not be this bad. - After bpf_prog_pack, the load of this prog will most likely requires no new page protection change as long as the existing pack pool has space for it(the common case). The actual space required for this bpf prog that needs special protection is: 64 * 2 + 192 bytes, it would need 6144 such progs to use up the cache and trigger another 2MB alloc which can potentially cause a direct map split. 6144 seems to be pretty large number to me so I think the direct map split due to bpf is greatly improved (if not totally solved). Here are test results on a 8G x86_64 VM. (on x86_64, 4k is PTE mapping, 2M is PMD mapping and 1G is PUD mapping) - v5.17 1) right after boot $ grep Direct /proc/meminfo DirectMap4k: 87900 kB DirectMap2M: 5154816 kB DirectMap1G: 5242880 kB 2) after running 512 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 462684 kB DirectMap2M: 4780032 kB DirectMap1G: 5242880 kB PUD mapping survived, some PMD mappings are splitted. 3) after running 1024 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 884572 kB DirectMap2M: 6455296 kB DirectMap1G: 3145728 kB 2 PUD mappings and some PMD mappings splitted. 4) after run 2048 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 1654620 kB DirectMap2M: 6733824 kB DirectMap1G: 2097152 kB Another PUD mapping and some PMD mappings spliited. At the end, 2 PUD mappings survived. - v6.1-rc3 The direct map number doesn't change for "right after boot", "after running 512/1024/2048" sockex1 instances. I also tried running 5120 instances but it would consume all available memory when it runs about 4000 instances and even then, the direct map number doesn't change, i.e. not a single split happened. This is understandable because as I calculated above, it would need 6144 such progs to cause another alloc/split. Here is its number: $ grep Direct /proc/meminfo DirectMap4k: 22364 kB DirectMap2M: 3123200 kB DirectMap1G: 7340032 kB Consider that production system will have memory mostly consumed and when CPU allocates pages, the page can be far apart than systems right after boot, causing more mapping split, so I also tested to firstly consume all memory to page cache by reading some sparse files and then run this test. I expect this time v5.17 would become worse. Here it is: - v5.17 1) right after boot $ grep Direct /proc/meminfo DirectMap4k: 94044 kB DirectMap2M: 4100096 kB DirectMap1G: 6291456 kB More mappings are in PUD for this boot. 2) after run 512 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 538460 kB DirectMap2M: 7849984 kB DirectMap1G: 2097152 kB 4 PUD mappings and some PMD mappings are splitted this time, more than last time. 3) after run 1024 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 1083228 kB DirectMap2M: 7305216 kB DirectMap1G: 2097152 kB Some PMD mappings split. 4) after running 2048 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 2340700 kB DirectMap2M: 6047744 kB DirectMap1G: 2097152 kB The end result is about the same as before. - v6.1-rc3 There is no difference because I can't trigger another pack alloc before system is OOMed. Conclusion: I think bpf_prog_pack is very good at reducing direct map fragmentation and this patchset can further improve this situation on large machines(with huge amount of memory) or with more large bpf progs loaded etc. Some imperfect things I can think of are(not related to this patchset): 1 Once a split happened, it remains happened. This may not be a big deal now with bpf_prog_pack and this patchset because the need to allocate a new order-9 page and thus cause a potential split should happen much much less; 2 When a new order-9 page has to be allocated, there is no way to tell the allocator to allocate this order-9 page from an already splitted PUD range to avoid another PUD mapping split; 3 As Mike and others have mentioned, there are other users that can also cause direct map split. [1]: https://lore.kernel.org/lkml/20220808145649.2261258-1-aaron.lu@intel.com/ Regards, Aaron
On Thu, Nov 03, 2022 at 11:59:48AM -0700, Luis Chamberlain wrote: > On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote: > > > I had to put this project on a backburner for $VARIOUS_REASONS, but I still > > think that we need a generic allocator for memory with non-default > > permissions in the direct map and that code allocation should build on that > > allocator. > > It seems this generalization of the bpf prog pack to possibly be used > for modules / kprobes / ftrace is a small step in that direction. > > > All that said, the direct map fragmentation problem is currently relevant > > only to x86 because it's the only architecture that supports splitting of > > the large pages in the direct map. > > I was thinking even more long term too, using this as a proof of concept. If > this practice in general helps with fragmentation, could it be used for > experimetnation with compound pages later, as a way to reduce possible > fragmentation. As Rick already mentioned, these patches help with the direct map fragmentation only indirectly. With these patches memory is freed in PMD_SIZE chunks and this makes the changes to the direct map in vm_remove_mappings() to happen in in PMD_SIZE units and this is pretty much the only effect of this series on the direct map layout. A bit unrelated, but I'm wondering now if we want to have the direct map alias of the pages allocated for code also to be read-only... > > Whenever a large page in the direct map is split, all > > kernel accesses via the direct map will use small pages which requires > > dealing with 512 page table entries instead of one for 2M range. > > > > Since small pages in the direct map are never collapsed back to large > > pages, long living system that heavily uses eBPF programs will have its > > direct map severely fragmented, higher TLB miss rate and worse overall > > performance. > > Shouldn't compaction help with those situations? Compaction helps to reduce fragmentation of the physical memory, it tries to bring free physical pages next to each other to create large contiguous chunks, but it does not change the virtual addresses the users of the underlying data see. Changing permissions of a small page in the direct map causes "discontinuity" in the virtual space. E.g. if we have 2M mapped RW with a single PMD changing several page in the middle of those 2M to R+X will require to remap that range with 512 PTEs. > Thanks! > > Luis
On Mon, Nov 07, 2022 at 08:58:17AM +0200, Mike Rapoport wrote: > On Thu, Nov 03, 2022 at 11:59:48AM -0700, Luis Chamberlain wrote: > > On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote: > > > > > I had to put this project on a backburner for $VARIOUS_REASONS, but I still > > > think that we need a generic allocator for memory with non-default > > > permissions in the direct map and that code allocation should build on that > > > allocator. > > > > It seems this generalization of the bpf prog pack to possibly be used > > for modules / kprobes / ftrace is a small step in that direction. > > > > > All that said, the direct map fragmentation problem is currently relevant > > > only to x86 because it's the only architecture that supports splitting of > > > the large pages in the direct map. > > > > I was thinking even more long term too, using this as a proof of concept. If > > this practice in general helps with fragmentation, could it be used for > > experimetnation with compound pages later, as a way to reduce possible > > fragmentation. > > As Rick already mentioned, these patches help with the direct map > fragmentation only indirectly. With these patches memory is freed in > PMD_SIZE chunks and this makes the changes to the direct map in > vm_remove_mappings() to happen in in PMD_SIZE units and this is pretty much > the only effect of this series on the direct map layout. I understand that is what *this* series does. I was wondering is similar scheme may be useful to study to see if it helps with aggregating say something like 32 x 64 kb for compound page allocations of order 16 (64 Kib) to see if it may help with possible fragmentation concerns for that world where that may be useful in the future (completely unrelated to page permissions). > A bit unrelated, but I'm wondering now if we want to have the direct map > alias of the pages allocated for code also to be read-only... > > > > Whenever a large page in the direct map is split, all > > > kernel accesses via the direct map will use small pages which requires > > > dealing with 512 page table entries instead of one for 2M range. > > > > > > Since small pages in the direct map are never collapsed back to large > > > pages, long living system that heavily uses eBPF programs will have its > > > direct map severely fragmented, higher TLB miss rate and worse overall > > > performance. > > > > Shouldn't compaction help with those situations? > > Compaction helps to reduce fragmentation of the physical memory, it tries > to bring free physical pages next to each other to create large contiguous > chunks, but it does not change the virtual addresses the users of the > underlying data see. Sorry I understood that 'bpf prog pack' only only used *one* 2 MiB huge page for *all* eBPF JIT programs, and so the fragmentation issue prior to 'bpf prog pack' I thought was the fact that as eBPF programs are still alive, we have fragmentation. In the world without 'bpf prog pack' I thought no huge pages were used. > Changing permissions of a small page in the direct map causes > "discontinuity" in the virtual space. E.g. if we have 2M mapped RW with a > single PMD changing several page in the middle of those 2M to R+X will > require to remap that range with 512 PTEs. Makes sense, thanks. Luis
On Mon, Nov 07, 2022 at 02:40:14PM +0800, Aaron Lu wrote: > Hello, > > On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote: > > ... ... > > > I'm under the impression that the real missed, undocumented, major value-add > > here is that the old "BPF prog pack" strategy helps to reduce the direct map > > fragmentation caused by heavy use of the eBPF JIT programs and this in > > turn helps your overall random system performance (regardless of what > > it is you do). As I see it then the eBPF prog pack is just one strategy to > > try to mitigate memory fragmentation on the direct map caused by the the eBPF > > JIT programs, so the "slow down" your team has obvserved should be due to the > > eventual fragmentation caused on the direct map *while* eBPF programs > > get heavily used. > > > > Mike Rapoport had presented about the Direct map fragmentation problem > > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace / > > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance > > evaluation on whether using 2M/1G pages aggressively for the kernel direct map > > help performance [1] ends up generally recommending huge pages. The work by Xing > > though was about using huge pages *alone*, not using a strategy such as in the > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs, > > and that I think is the real golden nugget here. > > I'm interested in how this patchset (further) improves direct map > fragmentation so would like to evaluate it to see if my previous work to > merge small mappings back in architecture layer[1] is still necessary. You gotta apply it to 6.0.5 which had a large change go in for eBPF which was not present on 6.0. > Conclusion: I think bpf_prog_pack is very good at reducing direct map > fragmentation and this patchset can further improve this situation on > large machines(with huge amount of memory) or with more large bpf progs > loaded etc. Fantastic. Thanks for the analysis, so yet another set of metrics which I'd hope can be applied to this patch set as this effort is generalized. Now imagine the effort in cosnideration also of modules / ftrace / kprobes. > Some imperfect things I can think of are(not related to this patchset): > 1 Once a split happened, it remains happened. This may not be a big deal > now with bpf_prog_pack and this patchset because the need to allocate a > new order-9 page and thus cause a potential split should happen much much > less; Not sure I follow, are you suggesting a order-9 (512 bytes) allocation would trigger a split of the reserved say 2 MiB huge page? > 2 When a new order-9 page has to be allocated, there is no way to tell > the allocator to allocate this order-9 page from an already splitted PUD > range to avoid another PUD mapping split; > 3 As Mike and others have mentioned, there are other users that can also > cause direct map split. Hence the effort to generalize. Luis
Hi Aaron, On Sun, Nov 6, 2022 at 10:40 PM Aaron Lu <aaron.lu@intel.com> wrote: > [...] > > and that I think is the real golden nugget here. > > I'm interested in how this patchset (further) improves direct map > fragmentation so would like to evaluate it to see if my previous work to > merge small mappings back in architecture layer[1] is still necessary. > > I tried to apply this patchset on v6.1-rc3/2/1 and v6.0 but all failed, > so I took one step back and evaluated the existing bpf_prog_pack. I'm > aware of this patchset would make things even better by using order-9 > page to backup the vmalloced range. The patchset was based on bpf-next tree: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ > > I used the sample bpf prog: sample/bpf/sockex1 because it looks easy to > run, feel free to let me know a better way to evaluate this. > > - In kernels before bpf_prog_pack(v5.17 and earlier), this prog would > cause 3 pages to change protection to RO+X from RW+NX; And if the three > pages are far apart, each would cause a level 3 split then a level 2 > split; Reality is, allocated pages tend to stay close physically so > actual result will not be this bad. [...] > > - v6.1-rc3 > There is no difference because I can't trigger another pack alloc before > system is OOMed. > > Conclusion: I think bpf_prog_pack is very good at reducing direct map > fragmentation and this patchset can further improve this situation on > large machines(with huge amount of memory) or with more large bpf progs > loaded etc. Thanks a lot for these experiments! I will include the data in the next version of the set. > > Some imperfect things I can think of are(not related to this patchset): > 1 Once a split happened, it remains happened. This may not be a big deal > now with bpf_prog_pack and this patchset because the need to allocate a > new order-9 page and thus cause a potential split should happen much much > less; I think we will need to regroup the direct map for some scenarios. But I am not aware of such workloads. > 2 When a new order-9 page has to be allocated, there is no way to tell > the allocator to allocate this order-9 page from an already splitted PUD > range to avoid another PUD mapping split; This would be a good improvement. > 3 As Mike and others have mentioned, there are other users that can also > cause direct map split. Thanks, Song
On Mon, Nov 7, 2022 at 9:39 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > Some imperfect things I can think of are(not related to this patchset): > > 1 Once a split happened, it remains happened. This may not be a big deal > > now with bpf_prog_pack and this patchset because the need to allocate a > > new order-9 page and thus cause a potential split should happen much much > > less; > > Not sure I follow, are you suggesting a order-9 (512 bytes) allocation would > trigger a split of the reserved say 2 MiB huge page? I think by order-9 allocation, Aaron meant 2MiB huge pages. The issue is that direct map split is one-way operation. If we set_memory_x() on one 4kB page out of a 1GB direct map, we will split it into 511x 2MiB pages and 512x 4kB pages. There is currently no way to regroup the 1GB page after set_memory_nx() on the page. Thanks, Song > > > 2 When a new order-9 page has to be allocated, there is no way to tell > > the allocator to allocate this order-9 page from an already splitted PUD > > range to avoid another PUD mapping split; > > 3 As Mike and others have mentioned, there are other users that can also > > cause direct map split. > > Hence the effort to generalize. > > Luis
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 096d48aa3437..9b2042313c12 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align, void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, int node, const void *caller) __alloc_size(1); void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1); +void *vmalloc_exec(unsigned long size, unsigned long align) __alloc_size(1); +void *vcopy_exec(void *dst, void *src, size_t len); +void vfree_exec(void *addr); +void *arch_vcopy_exec(void *dst, void *src, size_t len); +int arch_invalidate_exec(void *ptr, size_t len); extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2); extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2); diff --git a/mm/nommu.c b/mm/nommu.c index 214c70e1d059..8a1317247ef0 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, } EXPORT_SYMBOL(vm_map_pages_zero); +void *vmalloc_exec(unsigned long size, unsigned long align) +{ + return NULL; +} + +void *vcopy_exec(void *dst, void *src, size_t len) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +void vfree_exec(const void *addr) { } + /* * sys_brk() for the most part doesn't need the global kernel * lock, except when an application is doing something nasty diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ccaa461998f3..6f4c73e67191 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -72,6 +72,9 @@ early_param("nohugevmalloc", set_nohugevmalloc); static const bool vmap_allow_huge = false; #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ +#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE) +#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE) + bool is_vmalloc_addr(const void *x) { unsigned long addr = (unsigned long)kasan_reset_tag(x); @@ -769,6 +772,38 @@ static LIST_HEAD(free_vmap_area_list); */ static struct rb_root free_vmap_area_root = RB_ROOT; +/* + * free_text_area for vmalloc_exec() + */ +static DEFINE_SPINLOCK(free_text_area_lock); +/* + * This linked list is used in pair with free_text_area_root. + * It gives O(1) access to prev/next to perform fast coalescing. + */ +static LIST_HEAD(free_text_area_list); + +/* + * This augment red-black tree represents the free text space. + * All vmap_area objects in this tree are sorted by va->va_start + * address. It is used for allocation and merging when a vmap + * object is released. + * + * Each vmap_area node contains a maximum available free block + * of its sub-tree, right or left. Therefore it is possible to + * find a lowest match of free area. + * + * vmap_area in this tree are backed by RO+X memory, but they do + * not have valid vm pointer (because we need subtree_max_size). + * The vm for these vmap_area are stored in all_text_vm. + */ +static struct rb_root free_text_area_root = RB_ROOT; + +/* + * List of vm_struct for free_text_area_root. This list is rarely + * accessed, so the O(N) complexity is not likely a real issue. + */ +struct vm_struct *all_text_vm; + /* * Preload a CPU with one object for "no edge" split case. The * aim is to get rid of allocations from the atomic context, thus @@ -3313,6 +3348,289 @@ void *vmalloc(unsigned long size) } EXPORT_SYMBOL(vmalloc); +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR) +#define VMALLOC_EXEC_START MODULES_VADDR +#define VMALLOC_EXEC_END MODULES_END +#else +#define VMALLOC_EXEC_START VMALLOC_START +#define VMALLOC_EXEC_END VMALLOC_END +#endif + +static void move_vmap_to_free_text_tree(void *addr) +{ + struct vmap_area *va; + + /* remove from vmap_area_root */ + spin_lock(&vmap_area_lock); + va = __find_vmap_area((unsigned long)addr, &vmap_area_root); + if (WARN_ON_ONCE(!va)) { + spin_unlock(&vmap_area_lock); + return; + } + unlink_va(va, &vmap_area_root); + spin_unlock(&vmap_area_lock); + + /* make the memory RO+X */ + memset(addr, 0, va->va_end - va->va_start); + set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT); + set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT); + + /* add to all_text_vm */ + va->vm->next = all_text_vm; + all_text_vm = va->vm; + + /* add to free_text_area_root */ + spin_lock(&free_text_area_lock); + merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list); + spin_unlock(&free_text_area_lock); +} + +/** + * vmalloc_exec - allocate virtually contiguous RO+X memory + * @size: allocation size + * + * This is used to allocate dynamic kernel text, such as module text, BPF + * programs, etc. User need to use text_poke to update the memory allocated + * by vmalloc_exec. + * + * Return: pointer to the allocated memory or %NULL on error + */ +void *vmalloc_exec(unsigned long size, unsigned long align) +{ + struct vmap_area *va, *tmp; + unsigned long addr; + enum fit_type type; + int ret; + + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE); + if (unlikely(!va)) + return NULL; + +again: + preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE); + tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false); + + if (!tmp) { + unsigned long alloc_size; + void *ptr; + + spin_unlock(&free_text_area_lock); + + /* + * Not enough continuous space in free_text_area_root, try + * allocate more memory. The memory is first added to + * vmap_area_root, and then moved to free_text_area_root. + */ + alloc_size = roundup(size, PMD_SIZE * num_online_nodes()); + ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START, + VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL, + VM_ALLOW_HUGE_VMAP | VM_NO_GUARD, + NUMA_NO_NODE, __builtin_return_address(0)); + if (unlikely(!ptr)) + goto err_out; + + move_vmap_to_free_text_tree(ptr); + goto again; + } + + addr = roundup(tmp->va_start, align); + type = classify_va_fit_type(tmp, addr, size); + if (WARN_ON_ONCE(type == NOTHING_FIT)) + goto err_out; + + ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list, + tmp, addr, size); + if (ret) + goto err_out; + + spin_unlock(&free_text_area_lock); + + va->va_start = addr; + va->va_end = addr + size; + va->vm = NULL; + + spin_lock(&vmap_area_lock); + insert_vmap_area(va, &vmap_area_root, &vmap_area_list); + spin_unlock(&vmap_area_lock); + + return (void *)addr; + +err_out: + spin_unlock(&free_text_area_lock); + kmem_cache_free(vmap_area_cachep, va); + return NULL; +} + +void __weak *arch_vcopy_exec(void *dst, void *src, size_t len) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +int __weak arch_invalidate_exec(void *ptr, size_t len) +{ + return -EOPNOTSUPP; +} + +/** + * vcopy_exec - Copy text to RO+X memory allocated by vmalloc_exec() + * @dst: pointer to memory allocated by vmalloc_exec() + * @src: pointer to data being copied from + * @len: number of bytes to be copied + * + * vcopy_exec() will only update memory allocated by a single vcopy_exec() + * call. If dst + len goes beyond the boundary of one allocation, + * vcopy_exec() is aborted. + * + * If @addr is NULL, no operation is performed. + */ +void *vcopy_exec(void *dst, void *src, size_t len) +{ + struct vmap_area *va; + + spin_lock(&vmap_area_lock); + va = __find_vmap_area((unsigned long)dst, &vmap_area_root); + + /* + * If no va, or va has a vm attached, this memory is not allocated + * by vmalloc_exec(). + */ + if (WARN_ON_ONCE(!va) || WARN_ON_ONCE(va->vm)) + goto err_out; + if (WARN_ON_ONCE((unsigned long)dst + len > va->va_end)) + goto err_out; + + spin_unlock(&vmap_area_lock); + + return arch_vcopy_exec(dst, src, len); + +err_out: + spin_unlock(&vmap_area_lock); + return ERR_PTR(-EINVAL); +} + +static struct vm_struct *find_and_unlink_text_vm(unsigned long start, unsigned long end) +{ + struct vm_struct *vm, *prev_vm; + + lockdep_assert_held(&free_text_area_lock); + + vm = all_text_vm; + while (vm) { + unsigned long vm_addr = (unsigned long)vm->addr; + + /* vm is within this free space, we can free it */ + if ((vm_addr >= start) && ((vm_addr + vm->size) <= end)) + goto unlink_vm; + vm = vm->next; + } + return NULL; + +unlink_vm: + if (all_text_vm == vm) { + all_text_vm = vm->next; + } else { + prev_vm = all_text_vm; + while (prev_vm->next != vm) + prev_vm = prev_vm->next; + prev_vm = vm->next; + } + return vm; +} + +/** + * vfree_exec - Release memory allocated by vmalloc_exec() + * @addr: Memory base address + * + * If @addr is NULL, no operation is performed. + */ +void vfree_exec(void *addr) +{ + unsigned long free_start, free_end, free_addr; + struct vm_struct *vm; + struct vmap_area *va; + + might_sleep(); + + if (!addr) + return; + + spin_lock(&vmap_area_lock); + va = __find_vmap_area((unsigned long)addr, &vmap_area_root); + if (WARN_ON_ONCE(!va)) { + spin_unlock(&vmap_area_lock); + return; + } + WARN_ON_ONCE(va->vm); + + unlink_va(va, &vmap_area_root); + spin_unlock(&vmap_area_lock); + + /* Invalidate text in the region */ + arch_invalidate_exec(addr, va->va_end - va->va_start); + + spin_lock(&free_text_area_lock); + va = merge_or_add_vmap_area_augment(va, + &free_text_area_root, &free_text_area_list); + + if (WARN_ON_ONCE(!va)) + goto out; + + free_start = PMD_ALIGN(va->va_start); + free_end = PMD_ALIGN_DOWN(va->va_end); + + /* + * Only try to free vm when there is at least one PMD_SIZE free + * continuous memory. + */ + if (free_start >= free_end) + goto out; + + /* + * TODO: It is possible that multiple vm are ready to be freed + * after one vfree_exec(). But we free at most one vm for now. + */ + vm = find_and_unlink_text_vm(free_start, free_end); + if (!vm) + goto out; + + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_ATOMIC, NUMA_NO_NODE); + if (unlikely(!va)) + goto out_save_vm; + + free_addr = __alloc_vmap_area(&free_text_area_root, &free_text_area_list, + vm->size, 1, (unsigned long)vm->addr, + (unsigned long)vm->addr + vm->size); + + if (WARN_ON_ONCE(free_addr != (unsigned long)vm->addr)) + goto out_save_vm; + + va->va_start = (unsigned long)vm->addr; + va->va_end = va->va_start + vm->size; + va->vm = vm; + spin_unlock(&free_text_area_lock); + + set_memory_nx(va->va_start, vm->size >> PAGE_SHIFT); + set_memory_rw(va->va_start, vm->size >> PAGE_SHIFT); + + /* put the va to vmap_area_root, and then free it with vfree */ + spin_lock(&vmap_area_lock); + insert_vmap_area(va, &vmap_area_root, &vmap_area_list); + spin_unlock(&vmap_area_lock); + + vfree(vm->addr); + return; + +out_save_vm: + /* + * vm is removed from all_text_vm, but not freed. Add it back, + * so that we can use or free it later. + */ + vm->next = all_text_vm; + all_text_vm = vm; +out: + spin_unlock(&free_text_area_lock); +} + /** * vmalloc_huge - allocate virtually contiguous memory, allow huge pages * @size: allocation size
vmalloc_exec is used to allocate memory to host dynamic kernel text (modules, BPF programs, etc.) with huge pages. This is similar to the proposal by Peter in [1]. A new tree of vmap_area, free_text_area_* tree, is introduced in addition to free_vmap_area_* and vmap_area_*. vmalloc_exec allocates pages from free_text_area_*. When there isn't enough space left in free_text_area_*, new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to free_text_area_*. To be more accurate, the vmap_area is first added to vmap_area_* tree and then moved to free_text_area_*. This extra move simplifies the logic of vmalloc_exec. vmap_area in free_text_area_* tree are backed with memory, but we need subtree_max_size for tree operations. Therefore, vm_struct for these vmap_area are stored in a separate list, all_text_vm. The new tree allows separate handling of < PAGE_SIZE allocations, as current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This version of vmalloc_exec can handle bpf programs, which uses 64 byte aligned allocations), and modules, which uses PAGE_SIZE aligned allocations. Memory allocated by vmalloc_exec() is set to RO+X before returning to the caller. Therefore, the caller cannot write directly write to the memory. Instead, the caller is required to use vcopy_exec() to update the memory. For the safety and security of X memory, vcopy_exec() checks the data being updated always in the memory allocated by one vmalloc_exec() call. vcopy_exec() uses text_poke like mechanism and requires arch support. Specifically, the arch need to implement arch_vcopy_exec(). In vfree_exec(), the memory is first erased with arch_invalidate_exec(). Then, the memory is added to free_text_area_*. If this free creates big enough continuous free space (> PMD_SIZE), vfree_exec() will try to free the backing vm_struct. [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ Signed-off-by: Song Liu <song@kernel.org> --- include/linux/vmalloc.h | 5 + mm/nommu.c | 12 ++ mm/vmalloc.c | 318 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 335 insertions(+)