diff mbox series

[v2,bpf-next,04/20] mm: Expose vmap_pages_range() to the rest of the kernel.

Message ID 20240209040608.98927-5-alexei.starovoitov@gmail.com (mailing list archive)
State New
Headers show
Series bpf: Introduce BPF arena. | expand

Commit Message

Alexei Starovoitov Feb. 9, 2024, 4:05 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

BPF would like to use the vmap API to implement a lazily-populated
memory space which can be shared by multiple userspace threads.

The vmap API is generally public and has functions to request and
release areas of kernel address space, as well as functions to map
various types of backing memory into that space.

For example, there is the public ioremap_page_range(), which is used
to map device memory into addressable kernel space.

The new BPF code needs the functionality of vmap_pages_range() in
order to incrementally map privately managed arrays of pages into its
vmap area. Indeed this function used to be public, but became private
when usecases other than vmalloc happened to disappear.

Make it public again for the new external user.

The next commits will introduce bpf_arena which is a sparsely populated shared
memory region between bpf program and user space process. It will map
privately-managed pages into an existing vm area. It's the same pattern and
layer of abstraction as ioremap_pages_range().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/vmalloc.h | 2 ++
 mm/vmalloc.c            | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig Feb. 14, 2024, 8:36 a.m. UTC | #1
NAK.  Please

On Thu, Feb 08, 2024 at 08:05:52PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> BPF would like to use the vmap API to implement a lazily-populated
> memory space which can be shared by multiple userspace threads.
> The vmap API is generally public and has functions to request and

What is "the vmap API"?

> For example, there is the public ioremap_page_range(), which is used
> to map device memory into addressable kernel space.

It's not really public.  It's a helper for the ioremap implementation
which really should not be arch specific to start with and are in
the process of beeing consolidatd into common code.

> The new BPF code needs the functionality of vmap_pages_range() in
> order to incrementally map privately managed arrays of pages into its
> vmap area. Indeed this function used to be public, but became private
> when usecases other than vmalloc happened to disappear.

Yes, for a freaking good reason.  The vmap area is not for general abuse
by random callers.  We have a few of those left, but we need to get rid
of that and not add more.
Alexei Starovoitov Feb. 14, 2024, 8:53 p.m. UTC | #2
On Wed, Feb 14, 2024 at 12:36 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> NAK.  Please

What is the alternative?
Remember, maintainers cannot tell developers "go away".
They must suggest a different path.

> On Thu, Feb 08, 2024 at 08:05:52PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > BPF would like to use the vmap API to implement a lazily-populated
> > memory space which can be shared by multiple userspace threads.
> > The vmap API is generally public and has functions to request and
>
> What is "the vmap API"?

I mean an API that manages kernel virtual address space:

. get_vm_area - external
. free_vm_area - EXPORT_SYMBOL_GPL
. vunmap_range - external
. vmalloc_to_page - EXPORT_SYMBOL
. apply_to_page_range - EXPORT_SYMBOL_GPL

and the last one is pretty much equivalent to vmap_pages_range,
hence I'm surprised by push back to make vmap_pages_range available to bpf.

> > For example, there is the public ioremap_page_range(), which is used
> > to map device memory into addressable kernel space.
>
> It's not really public.  It's a helper for the ioremap implementation
> which really should not be arch specific to start with and are in
> the process of beeing consolidatd into common code.

Any link to such consolidation of ioremap ? I couldn't find one.
I surely don't want bpf_arena to cause headaches to mm folks.

Anyway, ioremap_page_range() was just an example.
I could have used vmap() as an equivalent example.
vmap is EXPORT_SYMBOL, btw.

What bpf_arena needs is pretty much vmap(), but instead of
allocating all pages in advance, allocate them and insert on demand.

As you saw in the next patch bpf_arena does:
get_vm_area(4Gbyte, VM_MAP | VM_USERMAP);
and then alloc_page + vmap_pages_range into this region on demand.
Nothing fancy.

> > The new BPF code needs the functionality of vmap_pages_range() in
> > order to incrementally map privately managed arrays of pages into its
> > vmap area. Indeed this function used to be public, but became private
> > when usecases other than vmalloc happened to disappear.
>
> Yes, for a freaking good reason.  The vmap area is not for general abuse
> by random callers.  We have a few of those left, but we need to get rid
> of that and not add more.

What do you mean by "vmap area" ? The vmalloc virtual region ?
Are you suggesting that bpf_arena should reserve its own virtual region of
kernel memory instead of vmalloc region ?
That's doable, but I don't quite see the point.
Instead of VMALLOC_START/END we can carve a bpf specific region and
do __get_vm_area_node() from there, but why?
vmalloc region fits the best.
bpf_arena's mm manipulations don't interfere with kasan either.

Or you meant vm_map_ram() ? Don't care about those. bpf_arena doesn't
touch that.
Christoph Hellwig Feb. 15, 2024, 6:58 a.m. UTC | #3
On Wed, Feb 14, 2024 at 12:53:42PM -0800, Alexei Starovoitov wrote:
> On Wed, Feb 14, 2024 at 12:36 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > NAK.  Please
> 
> What is the alternative?
> Remember, maintainers cannot tell developers "go away".
> They must suggest a different path.

That criteria is something you've made up.   Telling that something
is not ok is the most important job of not just maintainers but all
developers.  Maybe start with a description of the problem you're
solving and why you think it matters and needs different APIs.

> . get_vm_area - external
> . free_vm_area - EXPORT_SYMBOL_GPL
> . vunmap_range - external
> . vmalloc_to_page - EXPORT_SYMBOL
> . apply_to_page_range - EXPORT_SYMBOL_GPL
> 
> and the last one is pretty much equivalent to vmap_pages_range,
> hence I'm surprised by push back to make vmap_pages_range available to bpf.

And the last we've been trying to get rid of by ages because we don't
want random modules to 

> > > For example, there is the public ioremap_page_range(), which is used
> > > to map device memory into addressable kernel space.
> >
> > It's not really public.  It's a helper for the ioremap implementation
> > which really should not be arch specific to start with and are in
> > the process of beeing consolidatd into common code.
> 
> Any link to such consolidation of ioremap ? I couldn't find one.

Second hit on google:

https://lore.kernel.org/lkml/20230609075528.9390-1-bhe@redhat.com/T/

> I surely don't want bpf_arena to cause headaches to mm folks.
> 
> Anyway, ioremap_page_range() was just an example.
> I could have used vmap() as an equivalent example.
> vmap is EXPORT_SYMBOL, btw.

vmap is a good well defined API.  vmap_pages_range is not.

> What bpf_arena needs is pretty much vmap(), but instead of
> allocating all pages in advance, allocate them and insert on demand.

So propose an API that does that instead of exposing random low-level
details.
Alexei Starovoitov Feb. 15, 2024, 8:50 p.m. UTC | #4
On Wed, Feb 14, 2024 at 10:58 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Feb 14, 2024 at 12:53:42PM -0800, Alexei Starovoitov wrote:
> > On Wed, Feb 14, 2024 at 12:36 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > NAK.  Please
> >
> > What is the alternative?
> > Remember, maintainers cannot tell developers "go away".
> > They must suggest a different path.
>
> That criteria is something you've made up.

I didn't invent it. I internalized it based on the feedback received.

> Telling that something
> is not ok is the most important job of not just maintainers but all
> developers.

I'm not saying that maintainers should not say "no",
I'm saying that maintainers should say "no", understand the problem
being solved, and suggest an alternative.

> Maybe start with a description of the problem you're
> solving and why you think it matters and needs different APIs.

bpf_arena doesn't need a different api. These 5 api-s below are enough.
I'm saying that vmap_pages_range() is equivalent to apply_to_page_range()
for all practical purposes.
So, since apply_to_page_range() is available to the kernel
(xen, gpu, kasan, etc) then I see no reason why
vmap_pages_range() shouldn't be available as well, since:

struct vmap_ctx {
     struct page **pages;
     int idx;
};

static int __for_each_pte(pte_t *ptep, unsigned long addr, void *data)
{
     struct vmap_ctx *ctx = data;
     struct page *page = ctx->pages[ctx->idx++];

     /* TODO: sanity checks here */
     set_pte_at(&init_mm, addr, ptep, mk_pte(page, PAGE_KERNEL));
     return 0;
}

static int vmap_pages_range_hack(unsigned long addr, unsigned long end,
                                 struct page **pages)
{
    struct vmap_ctx ctx = { .pages = pages };

    return apply_to_page_range(&init_mm, addr, end - addr,
__for_each_pte, &ctx);
}

Anything I miss?

> > . get_vm_area - external
> > . free_vm_area - EXPORT_SYMBOL_GPL
> > . vunmap_range - external
> > . vmalloc_to_page - EXPORT_SYMBOL
> > . apply_to_page_range - EXPORT_SYMBOL_GPL
> >
> > and the last one is pretty much equivalent to vmap_pages_range,
> > hence I'm surprised by push back to make vmap_pages_range available to bpf.
>
> And the last we've been trying to get rid of by ages because we don't
> want random modules to

Get rid of EXPORT_SYMBOL from it? Fine by me.
Or you're saying that you have a plan to replace apply_to_page_range()
with something else ? With what ?

> > > > For example, there is the public ioremap_page_range(), which is used
> > > > to map device memory into addressable kernel space.
> > >
> > > It's not really public.  It's a helper for the ioremap implementation
> > > which really should not be arch specific to start with and are in
> > > the process of beeing consolidatd into common code.
> >
> > Any link to such consolidation of ioremap ? I couldn't find one.
>
> Second hit on google:
>
> https://lore.kernel.org/lkml/20230609075528.9390-1-bhe@redhat.com/T/

Thanks.
It sounded like you were referring to some future work.
The series that landed was a good cleanup.
No questions about it.

> > I surely don't want bpf_arena to cause headaches to mm folks.
> >
> > Anyway, ioremap_page_range() was just an example.
> > I could have used vmap() as an equivalent example.
> > vmap is EXPORT_SYMBOL, btw.
>
> vmap is a good well defined API.  vmap_pages_range is not.

since vmap() is nothing but get_vm_area() + vmap_pages_range()
and few checks... I'm missing the point.
Pls elaborate.

> > What bpf_arena needs is pretty much vmap(), but instead of
> > allocating all pages in advance, allocate them and insert on demand.
>
> So propose an API that does that instead of exposing random low-level
> details.

The generic_ioremap_prot() and vmap() APIs make sense for the cases
when phys memory exists with known size. It needs to vmap-ed and
not touched after.
bpf_arena use case is similar to kasan which
reserves a giant virtual memory region, and then
does apply_to_page_range() to populate certain pte-s with pages in that region,
and later apply_to_existing_page_range() to free pages in kasan's region.

bpf_arena is very similar, except it currently calls get_vm_area()
to get a 4Gb+guard_pages region, and then vmap_pages_range() to
populate a page in it, and vunmap_range() to remove a page.

These existing api-s work, so not sure what you're requesting.
I can guess many different things, but pls clarify to reduce
this back and forth.
Are you worried about range checking? That vmap_pages_range()
can accidently hit an unintended range?

btw the cover letter and patch 5 explain the higher level motivation
from bpf pov in detail.
There was a bunch of feedback on that patch, which was addressed,
and the latest version is here:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=a752b4122071adb5307d7ab3ae6736a9a0e45317
Linus Torvalds Feb. 15, 2024, 9:26 p.m. UTC | #5
On Thu, 15 Feb 2024 at 12:51, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> I didn't invent it. I internalized it based on the feedback received.

No. It's not up to maintainers to suggest alternatives. Sometimes it's
simply enough to explain *why* something isn't acceptable.

A plain "no" without explanation isn't sufficient. NAKs need a good
reason. But they don't need more than that.

The onus of coming up with an acceptable solution is on the person who
needs something new.

          Linus
Christoph Hellwig Feb. 16, 2024, 9:31 a.m. UTC | #6
On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote:
> So, since apply_to_page_range() is available to the kernel
> (xen, gpu, kasan, etc) then I see no reason why
> vmap_pages_range() shouldn't be available as well, since:

In case it wasn't clear before:  apply_to_page_range is a bad API to
be exported.  We've been working on removing it but it stalled.
Exposing something that allows a module to change arbitrary page table
bits is not a good idea.

Please take a step back and think of how to expose a vmalloc like
allocation that grows only when used as a proper abstraction.  I could
actually think of various other uses for it.
Alexei Starovoitov Feb. 16, 2024, 4:54 p.m. UTC | #7
On Fri, Feb 16, 2024 at 1:31 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote:
> > So, since apply_to_page_range() is available to the kernel
> > (xen, gpu, kasan, etc) then I see no reason why
> > vmap_pages_range() shouldn't be available as well, since:
>
> In case it wasn't clear before:  apply_to_page_range is a bad API to
> be exported.  We've been working on removing it but it stalled.
> Exposing something that allows a module to change arbitrary page table
> bits is not a good idea.

I never said that that module should do that.

> Please take a step back and think of how to expose a vmalloc like
> allocation that grows only when used as a proper abstraction.  I could
> actually think of various other uses for it.

"vmalloc like allocation that grows" is not what I'm after.
I need 4G+guard region at the start.
Please read my earlier email and reply to my questions and api proposals.
Replying to half of the sentence, and out of context, is not a
productive discussion.
Uladzislau Rezki Feb. 16, 2024, 5:18 p.m. UTC | #8
On Fri, Feb 16, 2024 at 08:54:08AM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 16, 2024 at 1:31 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote:
> > > So, since apply_to_page_range() is available to the kernel
> > > (xen, gpu, kasan, etc) then I see no reason why
> > > vmap_pages_range() shouldn't be available as well, since:
> >
> > In case it wasn't clear before:  apply_to_page_range is a bad API to
> > be exported.  We've been working on removing it but it stalled.
> > Exposing something that allows a module to change arbitrary page table
> > bits is not a good idea.
> 
> I never said that that module should do that.
> 
> > Please take a step back and think of how to expose a vmalloc like
> > allocation that grows only when used as a proper abstraction.  I could
> > actually think of various other uses for it.
> 
> "vmalloc like allocation that grows" is not what I'm after.
> I need 4G+guard region at the start.
> Please read my earlier email and reply to my questions and api proposals.
> Replying to half of the sentence, and out of context, is not a
> productive discussion.
>
1. The concern here is that this interface, which you would like to add,
exposes the "addr", "end" to upper layer, so fake values can easily be
passed to vmap internals.

2. Other users can start using this API/function which is hidden now
and is not supposed to be used outside of vmap code. Because it is a
static helper.

3. It opens new dependencies which we would like to avoid. As a second
step someone wants to dump "such region(4G+guard region)" over vmallocifo
to see what is mapped what requires a certain tracking.

--
Uladzislau Rezki
Alexei Starovoitov Feb. 18, 2024, 2:06 a.m. UTC | #9
On Fri, Feb 16, 2024 at 9:18 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Fri, Feb 16, 2024 at 08:54:08AM -0800, Alexei Starovoitov wrote:
> > On Fri, Feb 16, 2024 at 1:31 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote:
> > > > So, since apply_to_page_range() is available to the kernel
> > > > (xen, gpu, kasan, etc) then I see no reason why
> > > > vmap_pages_range() shouldn't be available as well, since:
> > >
> > > In case it wasn't clear before:  apply_to_page_range is a bad API to
> > > be exported.  We've been working on removing it but it stalled.
> > > Exposing something that allows a module to change arbitrary page table
> > > bits is not a good idea.
> >
> > I never said that that module should do that.
> >
> > > Please take a step back and think of how to expose a vmalloc like
> > > allocation that grows only when used as a proper abstraction.  I could
> > > actually think of various other uses for it.
> >
> > "vmalloc like allocation that grows" is not what I'm after.
> > I need 4G+guard region at the start.
> > Please read my earlier email and reply to my questions and api proposals.
> > Replying to half of the sentence, and out of context, is not a
> > productive discussion.
> >
> 1. The concern here is that this interface, which you would like to add,
> exposes the "addr", "end" to upper layer, so fake values can easily be
> passed to vmap internals.
>
> 2. Other users can start using this API/function which is hidden now
> and is not supposed to be used outside of vmap code. Because it is a
> static helper.

I suspect you're replying to the original patch that just
makes vmap_pages_range() external.
It was discarded already.
The request for feedback is for vm_area_map_pages proposal upthread:

+int vm_area_map_pages(struct vm_struct *area, unsigned long addr,
unsigned int count,
+                     struct page **pages)

There is no "end" and "addr" is range checked.

> 3. It opens new dependencies which we would like to avoid. As a second
> step someone wants to dump "such region(4G+guard region)" over vmallocifo
> to see what is mapped what requires a certain tracking.

What do you mean by "dump over /proc/vmallocinfo" ?
Privileged user space can see the start/end of every region.
And if some regions have all pages mapped and others don't, so?
vmallocinfo is racy. By the time user space sees the range
it can be unmapped already.
Christoph Hellwig Feb. 20, 2024, 6:57 a.m. UTC | #10
On Fri, Feb 16, 2024 at 08:54:08AM -0800, Alexei Starovoitov wrote:
> "vmalloc like allocation that grows" is not what I'm after.
> I need 4G+guard region at the start.
> Please read my earlier email and reply to my questions and api proposals.
> Replying to half of the sentence, and out of context, is not a
> productive discussion.

If you can explain what you are trying to do to the reviewers you're
doing something wrong..
diff mbox series

Patch

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..bafb87c69e3d 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -233,6 +233,8 @@  static inline bool is_vm_area_hugepages(const void *addr)
 
 #ifdef CONFIG_MMU
 void vunmap_range(unsigned long addr, unsigned long end);
+int vmap_pages_range(unsigned long addr, unsigned long end,
+		     pgprot_t prot, struct page **pages, unsigned int page_shift);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 	struct vm_struct *vm = find_vm_area(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..eae93d575d1b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -625,8 +625,8 @@  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
  * RETURNS:
  * 0 on success, -errno on failure.
  */
-static int vmap_pages_range(unsigned long addr, unsigned long end,
-		pgprot_t prot, struct page **pages, unsigned int page_shift)
+int vmap_pages_range(unsigned long addr, unsigned long end,
+		     pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
 	int err;