diff mbox series

[bpf-next,03/16] mm: Expose vmap_pages_range() to the rest of the kernel.

Message ID 20240206220441.38311-4-alexei.starovoitov@gmail.com (mailing list archive)
State New
Headers show
Series bpf: Introduce BPF arena. | expand

Commit Message

Alexei Starovoitov Feb. 6, 2024, 10:04 p.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

The next commit will introduce bpf_arena which is a sparsely populated shared
memory region between bpf program and user space process.
It will function similar to vmalloc()/vm_map_ram():
- get_vm_area()
- alloc_pages()
- vmap_pages_range()

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/vmalloc.h | 2 ++
 mm/vmalloc.c            | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

Comments

Lorenzo Stoakes Feb. 7, 2024, 9:07 p.m. UTC | #1
I don't know what conventions you bpf guys follow, but it's common courtesy
in the rest of the kernel to do a get_maintainers.pl check and figure out
who the maintainers/reviews of a part of the kernel you change are,
and include them in your mailing list.

I've done this for you.

On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> The next commit will introduce bpf_arena which is a sparsely populated shared
> memory region between bpf program and user space process.
> It will function similar to vmalloc()/vm_map_ram():
> - get_vm_area()
> - alloc_pages()
> - vmap_pages_range()

This tells me absolutely nothing about why it is justified to expose this
internal interface. You need to put more explanation here along the lines
of 'we had no other means of achieving what we needed from vmalloc because
X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.

I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
for instance. We good to expose that, not only for you but for any other
core kernel users?

>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/vmalloc.h | 2 ++
>  mm/vmalloc.c            | 4 ++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index c720be70c8dd..bafb87c69e3d 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -233,6 +233,8 @@ static inline bool is_vm_area_hugepages(const void *addr)
>
>  #ifdef CONFIG_MMU
>  void vunmap_range(unsigned long addr, unsigned long end);
> +int vmap_pages_range(unsigned long addr, unsigned long end,
> +		     pgprot_t prot, struct page **pages, unsigned int page_shift);
>  static inline void set_vm_flush_reset_perms(void *addr)
>  {
>  	struct vm_struct *vm = find_vm_area(addr);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d12a17fc0c17..eae93d575d1b 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -625,8 +625,8 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>   * RETURNS:
>   * 0 on success, -errno on failure.
>   */
> -static int vmap_pages_range(unsigned long addr, unsigned long end,
> -		pgprot_t prot, struct page **pages, unsigned int page_shift)
> +int vmap_pages_range(unsigned long addr, unsigned long end,
> +		     pgprot_t prot, struct page **pages, unsigned int page_shift)
>  {
>  	int err;
>
> --
> 2.34.1
>
Alexei Starovoitov Feb. 7, 2024, 10:56 p.m. UTC | #2
On Wed, Feb 7, 2024 at 1:10 PM Lorenzo Stoakes <lstoakes@gmail.com> wrote:
>
> I don't know what conventions you bpf guys follow, but it's common courtesy
> in the rest of the kernel to do a get_maintainers.pl check and figure out
> who the maintainers/reviews of a part of the kernel you change are,
> and include them in your mailing list.

linux-mm and Johannes was cc-ed.

> On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > The next commit will introduce bpf_arena which is a sparsely populated shared
> > memory region between bpf program and user space process.
> > It will function similar to vmalloc()/vm_map_ram():
> > - get_vm_area()
> > - alloc_pages()
> > - vmap_pages_range()
>
> This tells me absolutely nothing about why it is justified to expose this
> internal interface. You need to put more explanation here along the lines
> of 'we had no other means of achieving what we needed from vmalloc because
> X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.

Full motivation and details are in the cover letter and in the next commit as
commit log of this patch says.
Everyone subscribed to linux-mm has all patches in their mailboxes.

The commit log also mentioned that the next patch does pretty much
what vm_map_ram() does.
Any further details you're looking for?

What 'risk of breaking' are you talking about?

> I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> for instance. We good to expose that, not only for you but for any other
> core kernel users?

I could have overlooked something. What specific checks do you have in mind?
Johannes Weiner Feb. 8, 2024, 5:44 a.m. UTC | #3
On Wed, Feb 07, 2024 at 09:07:51PM +0000, Lorenzo Stoakes wrote:
> On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > The next commit will introduce bpf_arena which is a sparsely populated shared
> > memory region between bpf program and user space process.
> > It will function similar to vmalloc()/vm_map_ram():
> > - get_vm_area()
> > - alloc_pages()
> > - vmap_pages_range()
> 
> This tells me absolutely nothing about why it is justified to expose this
> internal interface. You need to put more explanation here along the lines
> of 'we had no other means of achieving what we needed from vmalloc because
> X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.

How about this:

---

BPF would like to use the vmap API to implement a lazily-populated
memory space which can be shared by multiple userspace threads.

The vmap API is generally public and has functions to request and
release areas of kernel address space, as well as functions to map
various types of backing memory into that space.

For example, there is the public ioremap_page_range(), which is used
to map device memory into addressable kernel space.

The new BPF code needs the functionality of vmap_pages_range() in
order to incrementally map privately managed arrays of pages into its
vmap area. Indeed this function used to be public, but became private
when usecases other than vmalloc happened to disappear.

Make it public again for the new external user.

---

> I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> for instance. We good to expose that, not only for you but for any other
> core kernel users?

Those are applicable only to the higher-level vmap/vmalloc usecases:
controlling the implied call to get_vm_area; managing the area with
vfree(). They're not relevant for mapping privately-managed pages into
an existing vm area. It's the same pattern and layer of abstraction as
ioremap_pages_range(), which doesn't have any of those checks either.
Alexei Starovoitov Feb. 8, 2024, 11:55 p.m. UTC | #4
On Wed, Feb 7, 2024 at 9:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Feb 07, 2024 at 09:07:51PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > The next commit will introduce bpf_arena which is a sparsely populated shared
> > > memory region between bpf program and user space process.
> > > It will function similar to vmalloc()/vm_map_ram():
> > > - get_vm_area()
> > > - alloc_pages()
> > > - vmap_pages_range()
> >
> > This tells me absolutely nothing about why it is justified to expose this
> > internal interface. You need to put more explanation here along the lines
> > of 'we had no other means of achieving what we needed from vmalloc because
> > X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.
>
> How about this:
>
> ---
>
> BPF would like to use the vmap API to implement a lazily-populated
> memory space which can be shared by multiple userspace threads.
>
> The vmap API is generally public and has functions to request and
> release areas of kernel address space, as well as functions to map
> various types of backing memory into that space.
>
> For example, there is the public ioremap_page_range(), which is used
> to map device memory into addressable kernel space.
>
> The new BPF code needs the functionality of vmap_pages_range() in
> order to incrementally map privately managed arrays of pages into its
> vmap area. Indeed this function used to be public, but became private
> when usecases other than vmalloc happened to disappear.
>
> Make it public again for the new external user.

Thank you Johannes!
You've said it better than I ever could.
I'll replace my cryptic commit log with the above in v2.

>
> ---
>
> > I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> > for instance. We good to expose that, not only for you but for any other
> > core kernel users?
>
> Those are applicable only to the higher-level vmap/vmalloc usecases:
> controlling the implied call to get_vm_area; managing the area with
> vfree(). They're not relevant for mapping privately-managed pages into
> an existing vm area. It's the same pattern and layer of abstraction as
> ioremap_pages_range(), which doesn't have any of those checks either.
Lorenzo Stoakes Feb. 9, 2024, 6:36 a.m. UTC | #5
On Thu, Feb 08, 2024 at 06:44:35AM +0100, Johannes Weiner wrote:
> On Wed, Feb 07, 2024 at 09:07:51PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Feb 06, 2024 at 02:04:28PM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > The next commit will introduce bpf_arena which is a sparsely populated shared
> > > memory region between bpf program and user space process.
> > > It will function similar to vmalloc()/vm_map_ram():
> > > - get_vm_area()
> > > - alloc_pages()
> > > - vmap_pages_range()
> >
> > This tells me absolutely nothing about why it is justified to expose this
> > internal interface. You need to put more explanation here along the lines
> > of 'we had no other means of achieving what we needed from vmalloc because
> > X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.
>
> How about this:
>
> ---
>
> BPF would like to use the vmap API to implement a lazily-populated
> memory space which can be shared by multiple userspace threads.
>
> The vmap API is generally public and has functions to request and
> release areas of kernel address space, as well as functions to map
> various types of backing memory into that space.
>
> For example, there is the public ioremap_page_range(), which is used
> to map device memory into addressable kernel space.
>
> The new BPF code needs the functionality of vmap_pages_range() in
> order to incrementally map privately managed arrays of pages into its
> vmap area. Indeed this function used to be public, but became private
> when usecases other than vmalloc happened to disappear.
>
> Make it public again for the new external user.

Thanks yes this is much better!

>
> ---
>
> > I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> > for instance. We good to expose that, not only for you but for any other
> > core kernel users?
>
> Those are applicable only to the higher-level vmap/vmalloc usecases:
> controlling the implied call to get_vm_area; managing the area with
> vfree(). They're not relevant for mapping privately-managed pages into
> an existing vm area. It's the same pattern and layer of abstraction as
> ioremap_pages_range(), which doesn't have any of those checks either.

OK that makes more sense re: comparison to ioremap_page_range(). My concern
arises from a couple things - firstly to avoid the exposure of an interface
that might be misinterpreted as acting as if it were a standard vmap() when
it instead skips a lot of checks (e.g. count > totalram_pages()).

Secondly my concern is that this side-steps metadata tracking the use of
the vmap range doesn't it? So there is nothing something coming along and
remapping some other vmalloc memory into that range later right?

It feels like exposing page table code that sits outside of the whole
vmalloc mechanism for other users.

On the other hand... since we already expose ioremap_page_range() and that
has the exact same issue I guess it's moot anyway?
Christoph Hellwig Feb. 14, 2024, 8:31 a.m. UTC | #6
On Wed, Feb 07, 2024 at 09:07:51PM +0000, Lorenzo Stoakes wrote:
> > memory region between bpf program and user space process.
> > It will function similar to vmalloc()/vm_map_ram():
> > - get_vm_area()
> > - alloc_pages()
> > - vmap_pages_range()
> 
> This tells me absolutely nothing about why it is justified to expose this
> internal interface. You need to put more explanation here along the lines
> of 'we had no other means of achieving what we needed from vmalloc because
> X, Y, Z and are absolutely convinced it poses no risk of breaking anything'.
> 
> I mean I see a lot of checks in vmap() that aren't in vmap_pages_range()
> for instance. We good to expose that, not only for you but for any other
> core kernel users?

And as someone who has reviewed these same thing before:

hard NAK.  We need to keep vmalloc internals internal and not start
poking holes into the abstractions after we've got them roughly into
shape.
diff mbox series

Patch

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..bafb87c69e3d 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -233,6 +233,8 @@  static inline bool is_vm_area_hugepages(const void *addr)
 
 #ifdef CONFIG_MMU
 void vunmap_range(unsigned long addr, unsigned long end);
+int vmap_pages_range(unsigned long addr, unsigned long end,
+		     pgprot_t prot, struct page **pages, unsigned int page_shift);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 	struct vm_struct *vm = find_vm_area(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..eae93d575d1b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -625,8 +625,8 @@  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
  * RETURNS:
  * 0 on success, -errno on failure.
  */
-static int vmap_pages_range(unsigned long addr, unsigned long end,
-		pgprot_t prot, struct page **pages, unsigned int page_shift)
+int vmap_pages_range(unsigned long addr, unsigned long end,
+		     pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
 	int err;