Message ID | 20240209040608.98927-5-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | bpf: Introduce BPF arena. | expand |
NAK. Please On Thu, Feb 08, 2024 at 08:05:52PM -0800, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > BPF would like to use the vmap API to implement a lazily-populated > memory space which can be shared by multiple userspace threads. > The vmap API is generally public and has functions to request and What is "the vmap API"? > For example, there is the public ioremap_page_range(), which is used > to map device memory into addressable kernel space. It's not really public. It's a helper for the ioremap implementation which really should not be arch specific to start with and are in the process of beeing consolidatd into common code. > The new BPF code needs the functionality of vmap_pages_range() in > order to incrementally map privately managed arrays of pages into its > vmap area. Indeed this function used to be public, but became private > when usecases other than vmalloc happened to disappear. Yes, for a freaking good reason. The vmap area is not for general abuse by random callers. We have a few of those left, but we need to get rid of that and not add more.
On Wed, Feb 14, 2024 at 12:36 AM Christoph Hellwig <hch@infradead.org> wrote: > > NAK. Please What is the alternative? Remember, maintainers cannot tell developers "go away". They must suggest a different path. > On Thu, Feb 08, 2024 at 08:05:52PM -0800, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > BPF would like to use the vmap API to implement a lazily-populated > > memory space which can be shared by multiple userspace threads. > > The vmap API is generally public and has functions to request and > > What is "the vmap API"? I mean an API that manages kernel virtual address space: . get_vm_area - external . free_vm_area - EXPORT_SYMBOL_GPL . vunmap_range - external . vmalloc_to_page - EXPORT_SYMBOL . apply_to_page_range - EXPORT_SYMBOL_GPL and the last one is pretty much equivalent to vmap_pages_range, hence I'm surprised by push back to make vmap_pages_range available to bpf. > > For example, there is the public ioremap_page_range(), which is used > > to map device memory into addressable kernel space. > > It's not really public. It's a helper for the ioremap implementation > which really should not be arch specific to start with and are in > the process of beeing consolidatd into common code. Any link to such consolidation of ioremap ? I couldn't find one. I surely don't want bpf_arena to cause headaches to mm folks. Anyway, ioremap_page_range() was just an example. I could have used vmap() as an equivalent example. vmap is EXPORT_SYMBOL, btw. What bpf_arena needs is pretty much vmap(), but instead of allocating all pages in advance, allocate them and insert on demand. As you saw in the next patch bpf_arena does: get_vm_area(4Gbyte, VM_MAP | VM_USERMAP); and then alloc_page + vmap_pages_range into this region on demand. Nothing fancy. > > The new BPF code needs the functionality of vmap_pages_range() in > > order to incrementally map privately managed arrays of pages into its > > vmap area. Indeed this function used to be public, but became private > > when usecases other than vmalloc happened to disappear. > > Yes, for a freaking good reason. The vmap area is not for general abuse > by random callers. We have a few of those left, but we need to get rid > of that and not add more. What do you mean by "vmap area" ? The vmalloc virtual region ? Are you suggesting that bpf_arena should reserve its own virtual region of kernel memory instead of vmalloc region ? That's doable, but I don't quite see the point. Instead of VMALLOC_START/END we can carve a bpf specific region and do __get_vm_area_node() from there, but why? vmalloc region fits the best. bpf_arena's mm manipulations don't interfere with kasan either. Or you meant vm_map_ram() ? Don't care about those. bpf_arena doesn't touch that.
On Wed, Feb 14, 2024 at 12:53:42PM -0800, Alexei Starovoitov wrote: > On Wed, Feb 14, 2024 at 12:36 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > NAK. Please > > What is the alternative? > Remember, maintainers cannot tell developers "go away". > They must suggest a different path. That criteria is something you've made up. Telling that something is not ok is the most important job of not just maintainers but all developers. Maybe start with a description of the problem you're solving and why you think it matters and needs different APIs. > . get_vm_area - external > . free_vm_area - EXPORT_SYMBOL_GPL > . vunmap_range - external > . vmalloc_to_page - EXPORT_SYMBOL > . apply_to_page_range - EXPORT_SYMBOL_GPL > > and the last one is pretty much equivalent to vmap_pages_range, > hence I'm surprised by push back to make vmap_pages_range available to bpf. And the last we've been trying to get rid of by ages because we don't want random modules to > > > For example, there is the public ioremap_page_range(), which is used > > > to map device memory into addressable kernel space. > > > > It's not really public. It's a helper for the ioremap implementation > > which really should not be arch specific to start with and are in > > the process of beeing consolidatd into common code. > > Any link to such consolidation of ioremap ? I couldn't find one. Second hit on google: https://lore.kernel.org/lkml/20230609075528.9390-1-bhe@redhat.com/T/ > I surely don't want bpf_arena to cause headaches to mm folks. > > Anyway, ioremap_page_range() was just an example. > I could have used vmap() as an equivalent example. > vmap is EXPORT_SYMBOL, btw. vmap is a good well defined API. vmap_pages_range is not. > What bpf_arena needs is pretty much vmap(), but instead of > allocating all pages in advance, allocate them and insert on demand. So propose an API that does that instead of exposing random low-level details.
On Wed, Feb 14, 2024 at 10:58 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Feb 14, 2024 at 12:53:42PM -0800, Alexei Starovoitov wrote: > > On Wed, Feb 14, 2024 at 12:36 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > NAK. Please > > > > What is the alternative? > > Remember, maintainers cannot tell developers "go away". > > They must suggest a different path. > > That criteria is something you've made up. I didn't invent it. I internalized it based on the feedback received. > Telling that something > is not ok is the most important job of not just maintainers but all > developers. I'm not saying that maintainers should not say "no", I'm saying that maintainers should say "no", understand the problem being solved, and suggest an alternative. > Maybe start with a description of the problem you're > solving and why you think it matters and needs different APIs. bpf_arena doesn't need a different api. These 5 api-s below are enough. I'm saying that vmap_pages_range() is equivalent to apply_to_page_range() for all practical purposes. So, since apply_to_page_range() is available to the kernel (xen, gpu, kasan, etc) then I see no reason why vmap_pages_range() shouldn't be available as well, since: struct vmap_ctx { struct page **pages; int idx; }; static int __for_each_pte(pte_t *ptep, unsigned long addr, void *data) { struct vmap_ctx *ctx = data; struct page *page = ctx->pages[ctx->idx++]; /* TODO: sanity checks here */ set_pte_at(&init_mm, addr, ptep, mk_pte(page, PAGE_KERNEL)); return 0; } static int vmap_pages_range_hack(unsigned long addr, unsigned long end, struct page **pages) { struct vmap_ctx ctx = { .pages = pages }; return apply_to_page_range(&init_mm, addr, end - addr, __for_each_pte, &ctx); } Anything I miss? > > . get_vm_area - external > > . free_vm_area - EXPORT_SYMBOL_GPL > > . vunmap_range - external > > . vmalloc_to_page - EXPORT_SYMBOL > > . apply_to_page_range - EXPORT_SYMBOL_GPL > > > > and the last one is pretty much equivalent to vmap_pages_range, > > hence I'm surprised by push back to make vmap_pages_range available to bpf. > > And the last we've been trying to get rid of by ages because we don't > want random modules to Get rid of EXPORT_SYMBOL from it? Fine by me. Or you're saying that you have a plan to replace apply_to_page_range() with something else ? With what ? > > > > For example, there is the public ioremap_page_range(), which is used > > > > to map device memory into addressable kernel space. > > > > > > It's not really public. It's a helper for the ioremap implementation > > > which really should not be arch specific to start with and are in > > > the process of beeing consolidatd into common code. > > > > Any link to such consolidation of ioremap ? I couldn't find one. > > Second hit on google: > > https://lore.kernel.org/lkml/20230609075528.9390-1-bhe@redhat.com/T/ Thanks. It sounded like you were referring to some future work. The series that landed was a good cleanup. No questions about it. > > I surely don't want bpf_arena to cause headaches to mm folks. > > > > Anyway, ioremap_page_range() was just an example. > > I could have used vmap() as an equivalent example. > > vmap is EXPORT_SYMBOL, btw. > > vmap is a good well defined API. vmap_pages_range is not. since vmap() is nothing but get_vm_area() + vmap_pages_range() and few checks... I'm missing the point. Pls elaborate. > > What bpf_arena needs is pretty much vmap(), but instead of > > allocating all pages in advance, allocate them and insert on demand. > > So propose an API that does that instead of exposing random low-level > details. The generic_ioremap_prot() and vmap() APIs make sense for the cases when phys memory exists with known size. It needs to vmap-ed and not touched after. bpf_arena use case is similar to kasan which reserves a giant virtual memory region, and then does apply_to_page_range() to populate certain pte-s with pages in that region, and later apply_to_existing_page_range() to free pages in kasan's region. bpf_arena is very similar, except it currently calls get_vm_area() to get a 4Gb+guard_pages region, and then vmap_pages_range() to populate a page in it, and vunmap_range() to remove a page. These existing api-s work, so not sure what you're requesting. I can guess many different things, but pls clarify to reduce this back and forth. Are you worried about range checking? That vmap_pages_range() can accidently hit an unintended range? btw the cover letter and patch 5 explain the higher level motivation from bpf pov in detail. There was a bunch of feedback on that patch, which was addressed, and the latest version is here: https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=a752b4122071adb5307d7ab3ae6736a9a0e45317
On Thu, 15 Feb 2024 at 12:51, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > I didn't invent it. I internalized it based on the feedback received. No. It's not up to maintainers to suggest alternatives. Sometimes it's simply enough to explain *why* something isn't acceptable. A plain "no" without explanation isn't sufficient. NAKs need a good reason. But they don't need more than that. The onus of coming up with an acceptable solution is on the person who needs something new. Linus
On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote: > So, since apply_to_page_range() is available to the kernel > (xen, gpu, kasan, etc) then I see no reason why > vmap_pages_range() shouldn't be available as well, since: In case it wasn't clear before: apply_to_page_range is a bad API to be exported. We've been working on removing it but it stalled. Exposing something that allows a module to change arbitrary page table bits is not a good idea. Please take a step back and think of how to expose a vmalloc like allocation that grows only when used as a proper abstraction. I could actually think of various other uses for it.
On Fri, Feb 16, 2024 at 1:31 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote: > > So, since apply_to_page_range() is available to the kernel > > (xen, gpu, kasan, etc) then I see no reason why > > vmap_pages_range() shouldn't be available as well, since: > > In case it wasn't clear before: apply_to_page_range is a bad API to > be exported. We've been working on removing it but it stalled. > Exposing something that allows a module to change arbitrary page table > bits is not a good idea. I never said that that module should do that. > Please take a step back and think of how to expose a vmalloc like > allocation that grows only when used as a proper abstraction. I could > actually think of various other uses for it. "vmalloc like allocation that grows" is not what I'm after. I need 4G+guard region at the start. Please read my earlier email and reply to my questions and api proposals. Replying to half of the sentence, and out of context, is not a productive discussion.
On Fri, Feb 16, 2024 at 08:54:08AM -0800, Alexei Starovoitov wrote: > On Fri, Feb 16, 2024 at 1:31 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote: > > > So, since apply_to_page_range() is available to the kernel > > > (xen, gpu, kasan, etc) then I see no reason why > > > vmap_pages_range() shouldn't be available as well, since: > > > > In case it wasn't clear before: apply_to_page_range is a bad API to > > be exported. We've been working on removing it but it stalled. > > Exposing something that allows a module to change arbitrary page table > > bits is not a good idea. > > I never said that that module should do that. > > > Please take a step back and think of how to expose a vmalloc like > > allocation that grows only when used as a proper abstraction. I could > > actually think of various other uses for it. > > "vmalloc like allocation that grows" is not what I'm after. > I need 4G+guard region at the start. > Please read my earlier email and reply to my questions and api proposals. > Replying to half of the sentence, and out of context, is not a > productive discussion. > 1. The concern here is that this interface, which you would like to add, exposes the "addr", "end" to upper layer, so fake values can easily be passed to vmap internals. 2. Other users can start using this API/function which is hidden now and is not supposed to be used outside of vmap code. Because it is a static helper. 3. It opens new dependencies which we would like to avoid. As a second step someone wants to dump "such region(4G+guard region)" over vmallocifo to see what is mapped what requires a certain tracking. -- Uladzislau Rezki
On Fri, Feb 16, 2024 at 9:18 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > On Fri, Feb 16, 2024 at 08:54:08AM -0800, Alexei Starovoitov wrote: > > On Fri, Feb 16, 2024 at 1:31 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote: > > > > So, since apply_to_page_range() is available to the kernel > > > > (xen, gpu, kasan, etc) then I see no reason why > > > > vmap_pages_range() shouldn't be available as well, since: > > > > > > In case it wasn't clear before: apply_to_page_range is a bad API to > > > be exported. We've been working on removing it but it stalled. > > > Exposing something that allows a module to change arbitrary page table > > > bits is not a good idea. > > > > I never said that that module should do that. > > > > > Please take a step back and think of how to expose a vmalloc like > > > allocation that grows only when used as a proper abstraction. I could > > > actually think of various other uses for it. > > > > "vmalloc like allocation that grows" is not what I'm after. > > I need 4G+guard region at the start. > > Please read my earlier email and reply to my questions and api proposals. > > Replying to half of the sentence, and out of context, is not a > > productive discussion. > > > 1. The concern here is that this interface, which you would like to add, > exposes the "addr", "end" to upper layer, so fake values can easily be > passed to vmap internals. > > 2. Other users can start using this API/function which is hidden now > and is not supposed to be used outside of vmap code. Because it is a > static helper. I suspect you're replying to the original patch that just makes vmap_pages_range() external. It was discarded already. The request for feedback is for vm_area_map_pages proposal upthread: +int vm_area_map_pages(struct vm_struct *area, unsigned long addr, unsigned int count, + struct page **pages) There is no "end" and "addr" is range checked. > 3. It opens new dependencies which we would like to avoid. As a second > step someone wants to dump "such region(4G+guard region)" over vmallocifo > to see what is mapped what requires a certain tracking. What do you mean by "dump over /proc/vmallocinfo" ? Privileged user space can see the start/end of every region. And if some regions have all pages mapped and others don't, so? vmallocinfo is racy. By the time user space sees the range it can be unmapped already.
On Fri, Feb 16, 2024 at 08:54:08AM -0800, Alexei Starovoitov wrote: > "vmalloc like allocation that grows" is not what I'm after. > I need 4G+guard region at the start. > Please read my earlier email and reply to my questions and api proposals. > Replying to half of the sentence, and out of context, is not a > productive discussion. If you can explain what you are trying to do to the reviewers you're doing something wrong..
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index c720be70c8dd..bafb87c69e3d 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -233,6 +233,8 @@ static inline bool is_vm_area_hugepages(const void *addr) #ifdef CONFIG_MMU void vunmap_range(unsigned long addr, unsigned long end); +int vmap_pages_range(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, unsigned int page_shift); static inline void set_vm_flush_reset_perms(void *addr) { struct vm_struct *vm = find_vm_area(addr); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d12a17fc0c17..eae93d575d1b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -625,8 +625,8 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end, * RETURNS: * 0 on success, -errno on failure. */ -static int vmap_pages_range(unsigned long addr, unsigned long end, - pgprot_t prot, struct page **pages, unsigned int page_shift) +int vmap_pages_range(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, unsigned int page_shift) { int err;