Message ID | 20220707223546.4124919-1-song@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | bpf_prog_pack followup | expand |
On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote: > This set is the second half of v4 [1]. > > Changes v5 => v6: > 1. Rebase and extend CC list. Why post a new iteration so soon without completing the discussion we had? It seems like we were at least going somewhere. If it's just to include mm as I requested, sure, that's fine, but this does not provide context as to what we last were talking about. Luis
> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote: >> This set is the second half of v4 [1]. >> >> Changes v5 => v6: >> 1. Rebase and extend CC list. > > Why post a new iteration so soon without completing the discussion we > had? It seems like we were at least going somewhere. If it's just > to include mm as I requested, sure, that's fine, but this does not > provide context as to what we last were talking about. Sorry for sending v6 too soon. The primary reason was to extend the CC list and add it back to patchwork (v5 somehow got archived). Also, I think vmalloc_exec_ work would be a separate project, while this set is the followup work of bpf_prog_pack. Does this make sense? Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much more efficient to discuss this in person. Thanks, Song
On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote: > > On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote: > >> This set is the second half of v4 [1]. > >> > >> Changes v5 => v6: > >> 1. Rebase and extend CC list. > > > > Why post a new iteration so soon without completing the discussion we > > had? It seems like we were at least going somewhere. If it's just > > to include mm as I requested, sure, that's fine, but this does not > > provide context as to what we last were talking about. > > Sorry for sending v6 too soon. The primary reason was to extend the CC > list and add it back to patchwork (v5 somehow got archived). > > Also, I think vmalloc_exec_ work would be a separate project, while this > set is the followup work of bpf_prog_pack. Does this make sense? > > Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much > more efficient to discuss this in person. What we need is input from mm / arch folks. What is not done here is what that stuff we're talking about is and so mm folks can't guess. My preference is to address that. I don't think in person discussion is needed if the only folks discussing this topic so far is just you and me. Luis
> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote: >>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: >>> >>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote: >>>> This set is the second half of v4 [1]. >>>> >>>> Changes v5 => v6: >>>> 1. Rebase and extend CC list. >>> >>> Why post a new iteration so soon without completing the discussion we >>> had? It seems like we were at least going somewhere. If it's just >>> to include mm as I requested, sure, that's fine, but this does not >>> provide context as to what we last were talking about. >> >> Sorry for sending v6 too soon. The primary reason was to extend the CC >> list and add it back to patchwork (v5 somehow got archived). >> >> Also, I think vmalloc_exec_ work would be a separate project, while this >> set is the followup work of bpf_prog_pack. Does this make sense? >> >> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much >> more efficient to discuss this in person. > > What we need is input from mm / arch folks. What is not done here is > what that stuff we're talking about is and so mm folks can't guess. My > preference is to address that. > > I don't think in person discussion is needed if the only folks > discussing this topic so far is just you and me. How about we start a thread with mm / arch folks for the vmalloc_exec_* topic? I will summarize previous discussions and include pointers to these discussions. If necessary, we can continue the discussion at LPC. OTOH, I guess the outcome of that discussion should not change this set? If we have concern about module_alloc_huge(), maybe we can have bpf code call vmalloc directly (until we have vmalloc_exec_)? What do you think about this plan? Thanks, Song
On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote: > > > > On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote: > >>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > >>> > >>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote: > >>>> This set is the second half of v4 [1]. > >>>> > >>>> Changes v5 => v6: > >>>> 1. Rebase and extend CC list. > >>> > >>> Why post a new iteration so soon without completing the discussion we > >>> had? It seems like we were at least going somewhere. If it's just > >>> to include mm as I requested, sure, that's fine, but this does not > >>> provide context as to what we last were talking about. > >> > >> Sorry for sending v6 too soon. The primary reason was to extend the CC > >> list and add it back to patchwork (v5 somehow got archived). > >> > >> Also, I think vmalloc_exec_ work would be a separate project, while this > >> set is the followup work of bpf_prog_pack. Does this make sense? > >> > >> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much > >> more efficient to discuss this in person. > > > > What we need is input from mm / arch folks. What is not done here is > > what that stuff we're talking about is and so mm folks can't guess. My > > preference is to address that. > > > > I don't think in person discussion is needed if the only folks > > discussing this topic so far is just you and me. > > How about we start a thread with mm / arch folks for the vmalloc_exec_* > topic? I will summarize previous discussions and include pointers to > these discussions. If necessary, we can continue the discussion at LPC. This sounds like a nice thread to use as this is why we are talking about that topic. > OTOH, I guess the outcome of that discussion should not change this set? If the above is done right then actually I think it would show similar considerations for a respective free for module_alloc_huge(). > If we have concern about module_alloc_huge(), maybe we can have bpf code > call vmalloc directly (until we have vmalloc_exec_)? You'd need to then still open code in a similar way the same things which we are trying to reach consensus on. > What do you think about this plan? I think we should strive to not be lazy and sloppy, and prevent growth of sloppy code. So long as we do that I think this is all reasoanble. Luis
> On Jul 8, 2022, at 8:58 AM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote: >> >> >>> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: >>> >>> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote: >>>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: >>>>> >>>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote: >>>>>> This set is the second half of v4 [1]. >>>>>> >>>>>> Changes v5 => v6: >>>>>> 1. Rebase and extend CC list. >>>>> >>>>> Why post a new iteration so soon without completing the discussion we >>>>> had? It seems like we were at least going somewhere. If it's just >>>>> to include mm as I requested, sure, that's fine, but this does not >>>>> provide context as to what we last were talking about. >>>> >>>> Sorry for sending v6 too soon. The primary reason was to extend the CC >>>> list and add it back to patchwork (v5 somehow got archived). >>>> >>>> Also, I think vmalloc_exec_ work would be a separate project, while this >>>> set is the followup work of bpf_prog_pack. Does this make sense? >>>> >>>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much >>>> more efficient to discuss this in person. >>> >>> What we need is input from mm / arch folks. What is not done here is >>> what that stuff we're talking about is and so mm folks can't guess. My >>> preference is to address that. >>> >>> I don't think in person discussion is needed if the only folks >>> discussing this topic so far is just you and me. >> >> How about we start a thread with mm / arch folks for the vmalloc_exec_* >> topic? I will summarize previous discussions and include pointers to >> these discussions. If necessary, we can continue the discussion at LPC. > > This sounds like a nice thread to use as this is why we are talking > about that topic. > >> OTOH, I guess the outcome of that discussion should not change this set? > > If the above is done right then actually I think it would show similar > considerations for a respective free for module_alloc_huge(). > >> If we have concern about module_alloc_huge(), maybe we can have bpf code >> call vmalloc directly (until we have vmalloc_exec_)? > > You'd need to then still open code in a similar way the same things > which we are trying to reach consensus on. >> What do you think about this plan? > > I think we should strive to not be lazy and sloppy, and prevent growth > of sloppy code. So long as we do that I think this is all reasoanble. Let me try to understand your concerns here. Say if we want module code to be a temporary home for module_alloc_huge before we move it to mm code. Would you think it is ready to ship if we: 1) Rename module_alloc_huge as module_alloc_text_huge(); 2) Add module_free_text_huge(); 3) Move set_memory_* and fill_ill_insn logic into module_alloc_text_huge() and module_free_text_huge(). Are these on the right direction? Did I miss anything important? Thanks, Song
On Fri, Jul 08, 2022 at 07:58:44PM +0000, Song Liu wrote: > > > > On Jul 8, 2022, at 8:58 AM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote: > >> > >> > >>> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > >>> > >>> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote: > >>>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > >>>>> > >>>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote: > >>>>>> This set is the second half of v4 [1]. > >>>>>> > >>>>>> Changes v5 => v6: > >>>>>> 1. Rebase and extend CC list. > >>>>> > >>>>> Why post a new iteration so soon without completing the discussion we > >>>>> had? It seems like we were at least going somewhere. If it's just > >>>>> to include mm as I requested, sure, that's fine, but this does not > >>>>> provide context as to what we last were talking about. > >>>> > >>>> Sorry for sending v6 too soon. The primary reason was to extend the CC > >>>> list and add it back to patchwork (v5 somehow got archived). > >>>> > >>>> Also, I think vmalloc_exec_ work would be a separate project, while this > >>>> set is the followup work of bpf_prog_pack. Does this make sense? > >>>> > >>>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much > >>>> more efficient to discuss this in person. > >>> > >>> What we need is input from mm / arch folks. What is not done here is > >>> what that stuff we're talking about is and so mm folks can't guess. My > >>> preference is to address that. > >>> > >>> I don't think in person discussion is needed if the only folks > >>> discussing this topic so far is just you and me. > >> > >> How about we start a thread with mm / arch folks for the vmalloc_exec_* > >> topic? I will summarize previous discussions and include pointers to > >> these discussions. If necessary, we can continue the discussion at LPC. > > > > This sounds like a nice thread to use as this is why we are talking > > about that topic. > > > >> OTOH, I guess the outcome of that discussion should not change this set? > > > > If the above is done right then actually I think it would show similar > > considerations for a respective free for module_alloc_huge(). > > > >> If we have concern about module_alloc_huge(), maybe we can have bpf code > >> call vmalloc directly (until we have vmalloc_exec_)? > > > > You'd need to then still open code in a similar way the same things > > which we are trying to reach consensus on. > > >> What do you think about this plan? > > > > I think we should strive to not be lazy and sloppy, and prevent growth > > of sloppy code. So long as we do that I think this is all reasoanble. > > Let me try to understand your concerns here. Say if we want module code > to be a temporary home for module_alloc_huge before we move it to mm > code. Would you think it is ready to ship if we: Please CC Christoph and linux-modules@vger.kernel.org on future patches and dicussions aroudn this, and all others now CC'd. > 1) Rename module_alloc_huge as module_alloc_text_huge(); module_alloc_text_huge() is too long, but I've suggested names before which are short and generic, and also suggested that if modules are not the only users this needs to go outside of modules and so vmalloc_text_huge() or whatever. To do this right it begs the question why we don't do that for the existing module_alloc(), as the users of this code is well outside of modules now. Last time a similar generic name was used all the special arch stuff was left to be done by the module code still, but still non-modules were still using that allocator. From my perspective the right thing to do is to deal with all the arch stuff as well in the generic handler, and have the module code *and* the other users which use module_alloc() to use that new caller as well. > 2) Add module_free_text_huge(); Right, we have special handling for how we free this special code for regular module_alloc() and so similar considerations would be needed here for the huge stuff. > 3) Move set_memory_* and fill_ill_insn logic into module_alloc_text_huge() > and module_free_text_huge(). Yes, that's a bit hairy now, and so a saner and consistent way to do this would be best. > Are these on the right direction? Did I miss anything important? I've also hinted before that another way to help here is to have draw up a simple lib/test_vmalloc_text.c or something like that which would enable a selftest to ensure correctness of this code on different archs and maybe even let you do performance analysis using perf [0]. You have good reasons to move to the huge allocator and the performance metrics are an abstract load, however perf measurements can also give you real raw data which you can reproduce and enable others to do similar comparisons later. The last thing I'd ask is just ensure you Cc folks who have already been in these discussions. [0] https://lkml.kernel.org/r/Yog+d+oR5TtPp2cs@bombadil.infradead.org Luis
> On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Fri, Jul 08, 2022 at 07:58:44PM +0000, Song Liu wrote: >> >> >>> On Jul 8, 2022, at 8:58 AM, Luis Chamberlain <mcgrof@kernel.org> wrote: >>> >>> On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote: >>>> >>>> >>>>> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: >>>>> >>>>> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote: >>>>>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: >>>>>>> >>>>>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote: >>>>>>>> This set is the second half of v4 [1]. >>>>>>>> >>>>>>>> Changes v5 => v6: >>>>>>>> 1. Rebase and extend CC list. >>>>>>> >>>>>>> Why post a new iteration so soon without completing the discussion we >>>>>>> had? It seems like we were at least going somewhere. If it's just >>>>>>> to include mm as I requested, sure, that's fine, but this does not >>>>>>> provide context as to what we last were talking about. >>>>>> >>>>>> Sorry for sending v6 too soon. The primary reason was to extend the CC >>>>>> list and add it back to patchwork (v5 somehow got archived). >>>>>> >>>>>> Also, I think vmalloc_exec_ work would be a separate project, while this >>>>>> set is the followup work of bpf_prog_pack. Does this make sense? >>>>>> >>>>>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much >>>>>> more efficient to discuss this in person. >>>>> >>>>> What we need is input from mm / arch folks. What is not done here is >>>>> what that stuff we're talking about is and so mm folks can't guess. My >>>>> preference is to address that. >>>>> >>>>> I don't think in person discussion is needed if the only folks >>>>> discussing this topic so far is just you and me. >>>> >>>> How about we start a thread with mm / arch folks for the vmalloc_exec_* >>>> topic? I will summarize previous discussions and include pointers to >>>> these discussions. If necessary, we can continue the discussion at LPC. >>> >>> This sounds like a nice thread to use as this is why we are talking >>> about that topic. >>> >>>> OTOH, I guess the outcome of that discussion should not change this set? >>> >>> If the above is done right then actually I think it would show similar >>> considerations for a respective free for module_alloc_huge(). >>> >>>> If we have concern about module_alloc_huge(), maybe we can have bpf code >>>> call vmalloc directly (until we have vmalloc_exec_)? >>> >>> You'd need to then still open code in a similar way the same things >>> which we are trying to reach consensus on. >> >>>> What do you think about this plan? >>> >>> I think we should strive to not be lazy and sloppy, and prevent growth >>> of sloppy code. So long as we do that I think this is all reasoanble. >> >> Let me try to understand your concerns here. Say if we want module code >> to be a temporary home for module_alloc_huge before we move it to mm >> code. Would you think it is ready to ship if we: > > Please CC Christoph and linux-modules@vger.kernel.org on future patches > and dicussions aroudn this, and all others now CC'd. Sometimes, vger drops my patch because the CC list is too long. That's the reason I often trim the CC list. I will try to keep folks in this thread CC'ed. > >> 1) Rename module_alloc_huge as module_alloc_text_huge(); > > module_alloc_text_huge() is too long, but I've suggested names before > which are short and generic, and also suggested that if modules are > not the only users this needs to go outside of modules and so > vmalloc_text_huge() or whatever. > > To do this right it begs the question why we don't do that for the > existing module_alloc(), as the users of this code is well outside of > modules now. Last time a similar generic name was used all the special > arch stuff was left to be done by the module code still, but still > non-modules were still using that allocator. From my perspective the > right thing to do is to deal with all the arch stuff as well in the > generic handler, and have the module code *and* the other users which > use module_alloc() to use that new caller as well. The key difference between module_alloc() and the new API is that the API will return RO+X memory, and the user need text-poke like API to modify this buffer. Archs that do not support text-poke will not be able to use the new API. Does this sound like a reasonable design? > >> 2) Add module_free_text_huge(); > > Right, we have special handling for how we free this special code for regular > module_alloc() and so similar considerations would be needed here for > the huge stuff. > >> 3) Move set_memory_* and fill_ill_insn logic into module_alloc_text_huge() >> and module_free_text_huge(). > > Yes, that's a bit hairy now, and so a saner and consistent way to do > this would be best. Thanks for these information. I will try to go this direction. > >> Are these on the right direction? Did I miss anything important? > > I've also hinted before that another way to help here is to have draw > up a simple lib/test_vmalloc_text.c or something like that which would > enable a selftest to ensure correctness of this code on different archs > and maybe even let you do performance analysis using perf [0]. You have > good reasons to move to the huge allocator and the performance metrics > are an abstract load, however perf measurements can also give you real > raw data which you can reproduce and enable others to do similar > comparisons later. > > The last thing I'd ask is just ensure you Cc folks who have already been in > these discussions. > > [0] https://lkml.kernel.org/r/Yog+d+oR5TtPp2cs@bombadil.infradead.org Let me see how we can test it. Thanks, Song
On Sat, Jul 09, 2022 at 01:14:23AM +0000, Song Liu wrote: > > On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > > >> 1) Rename module_alloc_huge as module_alloc_text_huge(); > > > > module_alloc_text_huge() is too long, but I've suggested names before > > which are short and generic, and also suggested that if modules are > > not the only users this needs to go outside of modules and so > > vmalloc_text_huge() or whatever. > > > > To do this right it begs the question why we don't do that for the > > existing module_alloc(), as the users of this code is well outside of > > modules now. Last time a similar generic name was used all the special > > arch stuff was left to be done by the module code still, but still > > non-modules were still using that allocator. From my perspective the > > right thing to do is to deal with all the arch stuff as well in the > > generic handler, and have the module code *and* the other users which > > use module_alloc() to use that new caller as well. > > The key difference between module_alloc() and the new API is that the > API will return RO+X memory, and the user need text-poke like API to > modify this buffer. Archs that do not support text-poke will not be > able to use the new API. Does this sound like a reasonable design? I'm adding kprobe + ftrace folks. I can't see why we need to *require* text_poke for just a module_alloc_huge(). Enhancements on module_alloc() are just enhancements, not requirements. So we have these for instance: ``` from arch/Kconfig config ARCH_OPTIONAL_KERNEL_RWX def_bool n config ARCH_OPTIONAL_KERNEL_RWX_DEFAULT def_bool n config ARCH_HAS_STRICT_KERNEL_RWX def_bool n config STRICT_KERNEL_RWX bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX depends on ARCH_HAS_STRICT_KERNEL_RWX default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT help If this is set, kernel text and rodata memory will be made read-only, and non-text memory will be made non-executable. This provides protection against certain security exploits (e.g. executing the heap or modifying text) These features are considered standard security practice these days. You should say Y here in almost all cases. config ARCH_HAS_STRICT_MODULE_RWX def_bool n config STRICT_MODULE_RWX bool "Set loadable kernel module data as NX and text as RO" if ARCH_OPTIONAL_KERNEL_RWX depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT help If this is set, module text and rodata memory will be made read-only, and non-text memory will be made non-executable. This provides protection against certain security exploits (e.g. writing to text) ``` With module_alloc() we have the above symbols to tell us when we *can* support strict module rwx. So the way the kernel's modules are allocated and used is: for each module section: module_alloc() module_enable_ro() module_enable_nx() module_enable_x() The above can be read in the code as: load_module() --> layout_and_allocate() complete_formation() Then there is the consideration of set_vm_flush_reset_perms() for freeing. On the module code we use this fore the RO+X stuff (core_layout, init_layout), but now that is a bit obfuscated due to the placement of the call. It would seem the other users use it for the same: * ebpf * kprobes * ftrace I believe you are mentioning requiring text_poke() because the way eBPF code uses the module_alloc() is different. Correct me if I'm wrong, but from what I gather is you use the text_poke_copy() as the data is already RO+X, contrary module_alloc() use cases. You do this since your bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after module_alloc() and before you can use this memory. This is a different type of allocator. And, again please correct me if I'm wrong but now you want to share *one* 2 MiB huge-page for multiple BPF programs to help with the impact of TLB misses. A vmalloc_ro_exec() by definition would imply a text_poke(). Can kprobes, ftrace and modules use it too? It would be nice so to not have to deal with the loose semantics on the user to have to use set_vm_flush_reset_perms() on ro+x later, but I think this can be addressed separately on a case by case basis. But a vmalloc_ro_exec() with a respective free can remove the requirement to do set_vm_flush_reset_perms(). Luis
On Mon, Jul 11, 2022 at 09:18:53PM -0700, Luis Chamberlain wrote: > A vmalloc_ro_exec() by definition would imply a text_poke(). > > Can kprobes, ftrace and modules use it too? It would be nice > so to not have to deal with the loose semantics on the user to > have to use set_vm_flush_reset_perms() on ro+x later, but > I think this can be addressed separately on a case by case basis. And who knows, if they can, and can also share a huge page allocator then they may also share similar performance improvements. Luis
> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Sat, Jul 09, 2022 at 01:14:23AM +0000, Song Liu wrote: >>> On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: >>> >>>> 1) Rename module_alloc_huge as module_alloc_text_huge(); >>> >>> module_alloc_text_huge() is too long, but I've suggested names before >>> which are short and generic, and also suggested that if modules are >>> not the only users this needs to go outside of modules and so >>> vmalloc_text_huge() or whatever. >>> >>> To do this right it begs the question why we don't do that for the >>> existing module_alloc(), as the users of this code is well outside of >>> modules now. Last time a similar generic name was used all the special >>> arch stuff was left to be done by the module code still, but still >>> non-modules were still using that allocator. From my perspective the >>> right thing to do is to deal with all the arch stuff as well in the >>> generic handler, and have the module code *and* the other users which >>> use module_alloc() to use that new caller as well. >> >> The key difference between module_alloc() and the new API is that the >> API will return RO+X memory, and the user need text-poke like API to >> modify this buffer. Archs that do not support text-poke will not be >> able to use the new API. Does this sound like a reasonable design? [...] > I believe you are mentioning requiring text_poke() because the way > eBPF code uses the module_alloc() is different. Correct me if I'm > wrong, but from what I gather is you use the text_poke_copy() as the data > is already RO+X, contrary module_alloc() use cases. You do this since your > bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after > module_alloc() and before you can use this memory. This is a different type > of allocator. And, again please correct me if I'm wrong but now you want to > share *one* 2 MiB huge-page for multiple BPF programs to help with the > impact of TLB misses. Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe and ftrace only uses a fraction of a 4kB page. Most BPF programs and modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add much value on top of current module_alloc(). > A vmalloc_ro_exec() by definition would imply a text_poke(). > > Can kprobes, ftrace and modules use it too? It would be nice > so to not have to deal with the loose semantics on the user to > have to use set_vm_flush_reset_perms() on ro+x later, but > I think this can be addressed separately on a case by case basis. I am pretty confident that kprobe and ftrace can share huge pages with BPF programs. I haven't looked into all the details with modules, but given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also possible. Once this is done, a regular system (without huge BPF program or huge modules) will just use 1x 2MB page for text from module, ftrace, kprobe, and bpf programs. > > But a vmalloc_ro_exec() with a respective free can remove the > requirement to do set_vm_flush_reset_perms(). Removing the requirement to set_vm_flush_reset_perms() is the other reason to go directly to vmalloc_ro_exec(). My current version looks like this: void *vmalloc_exec(unsigned long size); void vfree_exec(void *ptr, unsigned int size); ro is eliminated as there is no rw version of the API. The ugly part is @size for vfree_exec(). We need it to share huge pages. Under the hood, it looks similar to current bpf_prog_pack_alloc and bpf_prog_pack_free. Thanks, Song
On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote: > > On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > > I believe you are mentioning requiring text_poke() because the way > > eBPF code uses the module_alloc() is different. Correct me if I'm > > wrong, but from what I gather is you use the text_poke_copy() as the data > > is already RO+X, contrary module_alloc() use cases. You do this since your > > bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after > > module_alloc() and before you can use this memory. This is a different type > > of allocator. And, again please correct me if I'm wrong but now you want to > > share *one* 2 MiB huge-page for multiple BPF programs to help with the > > impact of TLB misses. > > Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. > OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe > and ftrace only uses a fraction of a 4kB page. Most BPF programs and > modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add > much value on top of current module_alloc(). Thanks for the clarification. > > A vmalloc_ro_exec() by definition would imply a text_poke(). > > > > Can kprobes, ftrace and modules use it too? It would be nice > > so to not have to deal with the loose semantics on the user to > > have to use set_vm_flush_reset_perms() on ro+x later, but > > I think this can be addressed separately on a case by case basis. > > I am pretty confident that kprobe and ftrace can share huge pages with > BPF programs. Then wonderful, we know where to go in terms of a new API then as it can be shared in the future for sure and there are gains. > I haven't looked into all the details with modules, but > given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also > possible. Sure. > Once this is done, a regular system (without huge BPF program or huge > modules) will just use 1x 2MB page for text from module, ftrace, kprobe, > and bpf programs. That would be nice, if possible, however modules will require likely its own thing, on my system I see about 57 MiB used on coresize alone. lsmod | grep -v Module | cut -f1 -d ' ' | \ xargs sudo modinfo | grep filename | \ grep -o '/.*' | xargs stat -c "%s - %n" | \ awk 'BEGIN {sum=0} {sum+=$1} END {print sum}' 60001272 And so perhaps we need such a pool size to be configurable. > > But a vmalloc_ro_exec() with a respective free can remove the > > requirement to do set_vm_flush_reset_perms(). > > Removing the requirement to set_vm_flush_reset_perms() is the other > reason to go directly to vmalloc_ro_exec(). Yes fantastic. > My current version looks like this: > > void *vmalloc_exec(unsigned long size); > void vfree_exec(void *ptr, unsigned int size); > > ro is eliminated as there is no rw version of the API. Alright. I am not sure if 2 MiB will suffice given what I mentioned above, and what to do to ensure this grows at a reasonable pace. Then, at least for usage for all architectures since not all will support text_poke() we will want to consider a way to make it easy to users to use non huge page fallbacks, but that would be up to those users, so we can wait for that. > The ugly part is @size for vfree_exec(). We need it to share huge > pages. I suppose this will become evident during patch review. > Under the hood, it looks similar to current bpf_prog_pack_alloc > and bpf_prog_pack_free. Groovy. Luis
> On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote: >>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: >> >>> I believe you are mentioning requiring text_poke() because the way >>> eBPF code uses the module_alloc() is different. Correct me if I'm >>> wrong, but from what I gather is you use the text_poke_copy() as the data >>> is already RO+X, contrary module_alloc() use cases. You do this since your >>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after >>> module_alloc() and before you can use this memory. This is a different type >>> of allocator. And, again please correct me if I'm wrong but now you want to >>> share *one* 2 MiB huge-page for multiple BPF programs to help with the >>> impact of TLB misses. >> >> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. >> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe >> and ftrace only uses a fraction of a 4kB page. Most BPF programs and >> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add >> much value on top of current module_alloc(). > > Thanks for the clarification. > >>> A vmalloc_ro_exec() by definition would imply a text_poke(). >>> >>> Can kprobes, ftrace and modules use it too? It would be nice >>> so to not have to deal with the loose semantics on the user to >>> have to use set_vm_flush_reset_perms() on ro+x later, but >>> I think this can be addressed separately on a case by case basis. >> >> I am pretty confident that kprobe and ftrace can share huge pages with >> BPF programs. > > Then wonderful, we know where to go in terms of a new API then as it > can be shared in the future for sure and there are gains. > >> I haven't looked into all the details with modules, but >> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also >> possible. > > Sure. > >> Once this is done, a regular system (without huge BPF program or huge >> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, >> and bpf programs. > > That would be nice, if possible, however modules will require likely its > own thing, on my system I see about 57 MiB used on coresize alone. > > lsmod | grep -v Module | cut -f1 -d ' ' | \ > xargs sudo modinfo | grep filename | \ > grep -o '/.*' | xargs stat -c "%s - %n" | \ > awk 'BEGIN {sum=0} {sum+=$1} END {print sum}' > 60001272 > > And so perhaps we need such a pool size to be configurable. > >>> But a vmalloc_ro_exec() with a respective free can remove the >>> requirement to do set_vm_flush_reset_perms(). >> >> Removing the requirement to set_vm_flush_reset_perms() is the other >> reason to go directly to vmalloc_ro_exec(). > > Yes fantastic. > >> My current version looks like this: >> >> void *vmalloc_exec(unsigned long size); >> void vfree_exec(void *ptr, unsigned int size); >> >> ro is eliminated as there is no rw version of the API. > > Alright. > > I am not sure if 2 MiB will suffice given what I mentioned above, and > what to do to ensure this grows at a reasonable pace. Then, at least for > usage for all architectures since not all will support text_poke() we > will want to consider a way to make it easy to users to use non huge > page fallbacks, but that would be up to those users, so we can wait for > that. We are not limited to 2MiB total. The logic is like: 1. Anything bigger than 2MiB gets its own allocation. 2. We maintain a list of 2MiB pages, and bitmaps showing which parts of these pages are in use. 3. For objects smaller than 2MiB, we will try to fit it in one of these pages. 3. a) If there isn't a page with big enough continuous free space, we will allocate a new 2MiB page. (For system with n NUMA nodes, multiple 2MiB above by n). So, if we have 100 kernel modules using 1MiB each, they will share 50x 2MiB pages. Thanks, Song
On Tue, Jul 12, 2022 at 11:12:22PM +0000, Song Liu wrote: > > > > On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote: > >>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > >> > >>> I believe you are mentioning requiring text_poke() because the way > >>> eBPF code uses the module_alloc() is different. Correct me if I'm > >>> wrong, but from what I gather is you use the text_poke_copy() as the data > >>> is already RO+X, contrary module_alloc() use cases. You do this since your > >>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after > >>> module_alloc() and before you can use this memory. This is a different type > >>> of allocator. And, again please correct me if I'm wrong but now you want to > >>> share *one* 2 MiB huge-page for multiple BPF programs to help with the > >>> impact of TLB misses. > >> > >> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. > >> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe > >> and ftrace only uses a fraction of a 4kB page. Most BPF programs and > >> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add > >> much value on top of current module_alloc(). > > > > Thanks for the clarification. > > > >>> A vmalloc_ro_exec() by definition would imply a text_poke(). > >>> > >>> Can kprobes, ftrace and modules use it too? It would be nice > >>> so to not have to deal with the loose semantics on the user to > >>> have to use set_vm_flush_reset_perms() on ro+x later, but > >>> I think this can be addressed separately on a case by case basis. > >> > >> I am pretty confident that kprobe and ftrace can share huge pages with > >> BPF programs. > > > > Then wonderful, we know where to go in terms of a new API then as it > > can be shared in the future for sure and there are gains. > > > >> I haven't looked into all the details with modules, but > >> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also > >> possible. > > > > Sure. > > > >> Once this is done, a regular system (without huge BPF program or huge > >> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, > >> and bpf programs. > > > > That would be nice, if possible, however modules will require likely its > > own thing, on my system I see about 57 MiB used on coresize alone. > > > > lsmod | grep -v Module | cut -f1 -d ' ' | \ > > xargs sudo modinfo | grep filename | \ > > grep -o '/.*' | xargs stat -c "%s - %n" | \ > > awk 'BEGIN {sum=0} {sum+=$1} END {print sum}' > > 60001272 > > > > And so perhaps we need such a pool size to be configurable. > > > >>> But a vmalloc_ro_exec() with a respective free can remove the > >>> requirement to do set_vm_flush_reset_perms(). > >> > >> Removing the requirement to set_vm_flush_reset_perms() is the other > >> reason to go directly to vmalloc_ro_exec(). > > > > Yes fantastic. > > > >> My current version looks like this: > >> > >> void *vmalloc_exec(unsigned long size); > >> void vfree_exec(void *ptr, unsigned int size); > >> > >> ro is eliminated as there is no rw version of the API. > > > > Alright. > > > > I am not sure if 2 MiB will suffice given what I mentioned above, and > > what to do to ensure this grows at a reasonable pace. Then, at least for > > usage for all architectures since not all will support text_poke() we > > will want to consider a way to make it easy to users to use non huge > > page fallbacks, but that would be up to those users, so we can wait for > > that. > > We are not limited to 2MiB total. The logic is like: > > 1. Anything bigger than 2MiB gets its own allocation. And does that allocation get split up into a few huge 2 MiB pages? When freed does that go into the pool of available list of 2 MiB pages to use? > 2. We maintain a list of 2MiB pages, and bitmaps showing which parts of > these pages are in use. How many 2 MiB huge pages are allocated initially? Do we have a cap? > 3. For objects smaller than 2MiB, we will try to fit it in one of these > pages. > 3. a) If there isn't a page with big enough continuous free space, we > will allocate a new 2MiB page. > > (For system with n NUMA nodes, multiple 2MiB above by n). > > So, if we have 100 kernel modules using 1MiB each, they will share 50x > 2MiB pages. lsmod | grep -v Module | cut -f1 -d ' ' | \ xargs sudo modinfo | grep filename |\ grep -o '/.*' | xargs stat -c "%s - %n" | \ awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 271.273 On average my system's modules are 271 KiB. Then I only have 6 out of 216 modules which are use more than 2 MiB or memory for coresize. So roughly 97% of my modules would be covered with this. Not bad. The monsters: lsmod | grep -v Module | cut -f1 -d ' ' | xargs sudo modinfo \ | grep filename |grep -o '/.*' | xargs stat -c "%s %n" | \ sort -n -k 1 -r | head -10 | \ awk '{print $1/1024/1024" "$2}' 6.50775 /lib/modules/5.17.0-1-amd64/kernel/drivers/gpu/drm/i915/i915.ko 3.6847 /lib/modules/5.17.0-1-amd64/kernel/fs/xfs/xfs.ko 3.34252 /lib/modules/5.17.0-1-amd64/kernel/fs/btrfs/btrfs.ko 2.37677 /lib/modules/5.17.0-1-amd64/kernel/net/mac80211/mac80211.ko 2.2972 /lib/modules/5.17.0-1-amd64/kernel/net/wireless/cfg80211.ko 2.05754 /lib/modules/5.17.0-1-amd64/kernel/arch/x86/kvm/kvm.ko 1.96126 /lib/modules/5.17.0-1-amd64/kernel/net/bluetooth/bluetooth.ko 1.83429 /lib/modules/5.17.0-1-amd64/kernel/fs/ext4/ext4.ko 1.7724 /lib/modules/5.17.0-1-amd64/kernel/fs/nfsd/nfsd.ko 1.60539 /lib/modules/5.17.0-1-amd64/kernel/net/sunrpc/sunrpc.ko On a big iron server I have 149 modules and the situation is better there: 3.69791 /lib/modules/5.16.0-6-amd64/kernel/fs/xfs/xfs.ko 3.35575 /lib/modules/5.16.0-6-amd64/kernel/fs/btrfs/btrfs.ko 3.21056 /lib/modules/5.16.0-6-amd64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko 2.02773 /lib/modules/5.16.0-6-amd64/kernel/arch/x86/kvm/kvm.ko 1.82574 /lib/modules/5.16.0-6-amd64/kernel/fs/ext4/ext4.ko 1.36571 /lib/modules/5.16.0-6-amd64/kernel/net/sunrpc/sunrpc.ko 1.32686 /lib/modules/5.16.0-6-amd64/kernel/fs/nfsd/nfsd.ko 1.12648 /lib/modules/5.16.0-6-amd64/kernel/drivers/gpu/drm/drm.ko 0.898623 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko 0.86922 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/core/ib_core.ko So this may just work nicely. Luis
> On Jul 12, 2022, at 4:42 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Tue, Jul 12, 2022 at 11:12:22PM +0000, Song Liu wrote: >> >> >>> On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: >>> >>> On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote: >>>>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote: >>>> >>>>> I believe you are mentioning requiring text_poke() because the way >>>>> eBPF code uses the module_alloc() is different. Correct me if I'm >>>>> wrong, but from what I gather is you use the text_poke_copy() as the data >>>>> is already RO+X, contrary module_alloc() use cases. You do this since your >>>>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after >>>>> module_alloc() and before you can use this memory. This is a different type >>>>> of allocator. And, again please correct me if I'm wrong but now you want to >>>>> share *one* 2 MiB huge-page for multiple BPF programs to help with the >>>>> impact of TLB misses. >>>> >>>> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. >>>> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe >>>> and ftrace only uses a fraction of a 4kB page. Most BPF programs and >>>> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add >>>> much value on top of current module_alloc(). >>> >>> Thanks for the clarification. >>> >>>>> A vmalloc_ro_exec() by definition would imply a text_poke(). >>>>> >>>>> Can kprobes, ftrace and modules use it too? It would be nice >>>>> so to not have to deal with the loose semantics on the user to >>>>> have to use set_vm_flush_reset_perms() on ro+x later, but >>>>> I think this can be addressed separately on a case by case basis. >>>> >>>> I am pretty confident that kprobe and ftrace can share huge pages with >>>> BPF programs. >>> >>> Then wonderful, we know where to go in terms of a new API then as it >>> can be shared in the future for sure and there are gains. >>> >>>> I haven't looked into all the details with modules, but >>>> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also >>>> possible. >>> >>> Sure. >>> >>>> Once this is done, a regular system (without huge BPF program or huge >>>> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, >>>> and bpf programs. >>> >>> That would be nice, if possible, however modules will require likely its >>> own thing, on my system I see about 57 MiB used on coresize alone. >>> >>> lsmod | grep -v Module | cut -f1 -d ' ' | \ >>> xargs sudo modinfo | grep filename | \ >>> grep -o '/.*' | xargs stat -c "%s - %n" | \ >>> awk 'BEGIN {sum=0} {sum+=$1} END {print sum}' >>> 60001272 >>> >>> And so perhaps we need such a pool size to be configurable. >>> >>>>> But a vmalloc_ro_exec() with a respective free can remove the >>>>> requirement to do set_vm_flush_reset_perms(). >>>> >>>> Removing the requirement to set_vm_flush_reset_perms() is the other >>>> reason to go directly to vmalloc_ro_exec(). >>> >>> Yes fantastic. >>> >>>> My current version looks like this: >>>> >>>> void *vmalloc_exec(unsigned long size); >>>> void vfree_exec(void *ptr, unsigned int size); >>>> >>>> ro is eliminated as there is no rw version of the API. >>> >>> Alright. >>> >>> I am not sure if 2 MiB will suffice given what I mentioned above, and >>> what to do to ensure this grows at a reasonable pace. Then, at least for >>> usage for all architectures since not all will support text_poke() we >>> will want to consider a way to make it easy to users to use non huge >>> page fallbacks, but that would be up to those users, so we can wait for >>> that. >> >> We are not limited to 2MiB total. The logic is like: >> >> 1. Anything bigger than 2MiB gets its own allocation. > > And does that allocation get split up into a few huge 2 MiB pages? > When freed does that go into the pool of available list of 2 MiB pages > to use? This would have some 2MiB pages and some 4kB pages. For example, if we need 4MiB + 5kB, it will allocate 2x 2MiB pages, and 2x 4kB pages (round up to 8kB). On free, the will not go to the pool. Instead, it will be vfree()'ed. > >> 2. We maintain a list of 2MiB pages, and bitmaps showing which parts of >> these pages are in use. > > How many 2 MiB huge pages are allocated initially? Do we have a cap? Current logic just allocates 1 huge page at a time. No cap. > >> 3. For objects smaller than 2MiB, we will try to fit it in one of these >> pages. >> 3. a) If there isn't a page with big enough continuous free space, we >> will allocate a new 2MiB page. >> >> (For system with n NUMA nodes, multiple 2MiB above by n). >> >> So, if we have 100 kernel modules using 1MiB each, they will share 50x >> 2MiB pages. > > lsmod | grep -v Module | cut -f1 -d ' ' | \ > xargs sudo modinfo | grep filename |\ > grep -o '/.*' | xargs stat -c "%s - %n" | \ > awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' > 271.273 > > On average my system's modules are 271 KiB. > > Then I only have 6 out of 216 modules which are use more than 2 MiB or > memory for coresize. So roughly 97% of my modules would be covered > with this. Not bad. Are these all the modules we have in tree? ;) Thanks, Song > > The monsters: > > lsmod | grep -v Module | cut -f1 -d ' ' | xargs sudo modinfo \ > | grep filename |grep -o '/.*' | xargs stat -c "%s %n" | \ > sort -n -k 1 -r | head -10 | \ > awk '{print $1/1024/1024" "$2}' > 6.50775 /lib/modules/5.17.0-1-amd64/kernel/drivers/gpu/drm/i915/i915.ko > 3.6847 /lib/modules/5.17.0-1-amd64/kernel/fs/xfs/xfs.ko > 3.34252 /lib/modules/5.17.0-1-amd64/kernel/fs/btrfs/btrfs.ko > 2.37677 /lib/modules/5.17.0-1-amd64/kernel/net/mac80211/mac80211.ko > 2.2972 /lib/modules/5.17.0-1-amd64/kernel/net/wireless/cfg80211.ko > 2.05754 /lib/modules/5.17.0-1-amd64/kernel/arch/x86/kvm/kvm.ko > 1.96126 /lib/modules/5.17.0-1-amd64/kernel/net/bluetooth/bluetooth.ko > 1.83429 /lib/modules/5.17.0-1-amd64/kernel/fs/ext4/ext4.ko > 1.7724 /lib/modules/5.17.0-1-amd64/kernel/fs/nfsd/nfsd.ko > 1.60539 /lib/modules/5.17.0-1-amd64/kernel/net/sunrpc/sunrpc.ko > > On a big iron server I have 149 modules and the situation is better > there: > > 3.69791 /lib/modules/5.16.0-6-amd64/kernel/fs/xfs/xfs.ko > 3.35575 /lib/modules/5.16.0-6-amd64/kernel/fs/btrfs/btrfs.ko > 3.21056 /lib/modules/5.16.0-6-amd64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko > 2.02773 /lib/modules/5.16.0-6-amd64/kernel/arch/x86/kvm/kvm.ko > 1.82574 /lib/modules/5.16.0-6-amd64/kernel/fs/ext4/ext4.ko > 1.36571 /lib/modules/5.16.0-6-amd64/kernel/net/sunrpc/sunrpc.ko > 1.32686 /lib/modules/5.16.0-6-amd64/kernel/fs/nfsd/nfsd.ko > 1.12648 /lib/modules/5.16.0-6-amd64/kernel/drivers/gpu/drm/drm.ko > 0.898623 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko > 0.86922 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/core/ib_core.ko > > So this may just work nicely. > > Luis