mbox series

[RFC,0/3] Group pages on the direct map for permissioned vmallocs

Message ID 20210405203711.1095940-1-rick.p.edgecombe@intel.com (mailing list archive)
Headers show
Series Group pages on the direct map for permissioned vmallocs | expand

Message

Edgecombe, Rick P April 5, 2021, 8:37 p.m. UTC
Hi,

This is a followup to the previous attempt to overhaul how vmalloc permissions
are done:
https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/

In working on a next version it dawned on me that we can significantly reduce
direct map breakage on x86 with a much less invasive change, so I thought maybe
I would start there in the meantime.

In a test of booting fedora and running the BPF unit tests, this reduced 4k
direct map pages by 98%.

It simply uses pages for x86 module_alloc() mappings from a cache created out of
2MB pages. This results in all the later breakage clustering in 2MB blocks on
the direct map. The trade-off is colder pages are used for these allocations.
All module_alloc() users (modules, ebpf jit, ftrace, kprobes) get this behavior.

Potentially this behavior should be enabled for eBPF byte code allocations as
well in the case of !CONFIG_BPF_JIT_ALWAYS_ON.

The new APIs and invasive changes in the callers can happen after vmalloc huge
pages bring more benefits. Although, I can post shootdown reduction changes with
previous comments integrated if anyone disagrees.

Based on v5.11.

Thanks,

Rick


Rick Edgecombe (3):
  list: Support getting most recent element in list_lru
  vmalloc: Support grouped page allocations
  x86/module: Use VM_GROUP_PAGES flag

 arch/x86/Kconfig         |   1 +
 arch/x86/kernel/module.c |   2 +-
 include/linux/list_lru.h |  13 +++
 include/linux/vmalloc.h  |   1 +
 mm/Kconfig               |   9 ++
 mm/list_lru.c            |  28 +++++
 mm/vmalloc.c             | 215 +++++++++++++++++++++++++++++++++++++--
 7 files changed, 257 insertions(+), 12 deletions(-)

Comments

Dave Hansen April 5, 2021, 9:01 p.m. UTC | #1
On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> +static void __dispose_pages(struct list_head *head)
> +{
> +	struct list_head *cur, *next;
> +
> +	list_for_each_safe(cur, next, head) {
> +		list_del(cur);
> +
> +		/* The list head is stored at the start of the page */
> +		free_page((unsigned long)cur);
> +	}
> +}

This is interesting.

While the page is in the allocator, you're using the page contents
themselves to store the list_head.  It took me a minute to figure out
what you were doing here because: "start of the page" is a bit
ambiguous.  It could mean:

 * the first 16 bytes in 'struct page'
or
 * the first 16 bytes in the page itself, aka *page_address(page)

The fact that this doesn't work on higmem systems makes this an OK thing
to do, but it is a bit weird.  It's also doubly susceptible to bugs
where there's a page_to_virt() or virt_to_page() screwup.

I was *hoping* there was still sufficient space in 'struct page' for
this second list_head in addition to page->lru.  I think there *should*
be.  That would at least make this allocator a bit more "normal" in not
caring about page contents while the page is free in the allocator.  If
you were able to do that you could do things like kmemcheck or page
alloc debugging while the page is in the allocator.

Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
But, if that doesn't work out, please comment the snot out of this thing
because it _is_ weird.
Matthew Wilcox April 5, 2021, 9:32 p.m. UTC | #2
On Mon, Apr 05, 2021 at 02:01:58PM -0700, Dave Hansen wrote:
> On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> > +static void __dispose_pages(struct list_head *head)
> > +{
> > +	struct list_head *cur, *next;
> > +
> > +	list_for_each_safe(cur, next, head) {
> > +		list_del(cur);
> > +
> > +		/* The list head is stored at the start of the page */
> > +		free_page((unsigned long)cur);
> > +	}
> > +}
> 
> This is interesting.
> 
> While the page is in the allocator, you're using the page contents
> themselves to store the list_head.  It took me a minute to figure out
> what you were doing here because: "start of the page" is a bit
> ambiguous.  It could mean:
> 
>  * the first 16 bytes in 'struct page'
> or
>  * the first 16 bytes in the page itself, aka *page_address(page)
> 
> The fact that this doesn't work on higmem systems makes this an OK thing
> to do, but it is a bit weird.  It's also doubly susceptible to bugs
> where there's a page_to_virt() or virt_to_page() screwup.
> 
> I was *hoping* there was still sufficient space in 'struct page' for
> this second list_head in addition to page->lru.  I think there *should*
> be.  That would at least make this allocator a bit more "normal" in not
> caring about page contents while the page is free in the allocator.  If
> you were able to do that you could do things like kmemcheck or page
> alloc debugging while the page is in the allocator.
> 
> Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
> But, if that doesn't work out, please comment the snot out of this thing
> because it _is_ weird.

Hi!  Current closest-thing-we-have-to-an-expert-on-struct-page here!

I haven't read over these patches yet.  If these pages are in use by
vmalloc, they can't use mapping+index because get_user_pages() will call
page_mapping() and the list_head will confuse it.  I think it could use
index+private for a list_head.

If the pages are in the buddy, I _think_ mapping+index are free.  private
is in use for buddy order.  But I haven't read through the buddy code
in a while.

Does it need to be a doubly linked list?  Can it be an hlist?
Edgecombe, Rick P April 5, 2021, 9:38 p.m. UTC | #3
On Mon, 2021-04-05 at 14:01 -0700, Dave Hansen wrote:
> On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> > +static void __dispose_pages(struct list_head *head)
> > +{
> > +       struct list_head *cur, *next;
> > +
> > +       list_for_each_safe(cur, next, head) {
> > +               list_del(cur);
> > +
> > +               /* The list head is stored at the start of the page
> > */
> > +               free_page((unsigned long)cur);
> > +       }
> > +}
> 
> This is interesting.
> 
> While the page is in the allocator, you're using the page contents
> themselves to store the list_head.  It took me a minute to figure out
> what you were doing here because: "start of the page" is a bit
> ambiguous.  It could mean:
> 
>  * the first 16 bytes in 'struct page'
> or
>  * the first 16 bytes in the page itself, aka *page_address(page)
> 
> The fact that this doesn't work on higmem systems makes this an OK
> thing
> to do, but it is a bit weird.  It's also doubly susceptible to bugs
> where there's a page_to_virt() or virt_to_page() screwup.
> 
> I was *hoping* there was still sufficient space in 'struct page' for
> this second list_head in addition to page->lru.  I think there
> *should*
> be.  That would at least make this allocator a bit more "normal" in
> not
> caring about page contents while the page is free in the allocator. 
> If
> you were able to do that you could do things like kmemcheck or page
> alloc debugging while the page is in the allocator.
> 
> Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
> But, if that doesn't work out, please comment the snot out of this
> thing
> because it _is_ weird.

Yes sorry, that deserved more explanation. I tried putting it in struct
page actually. The problem was list_lru automatically determines the
node id from the list_head provided to it via
page_to_nid(virt_to_page(head)). I guess it assumes the list_head is on
the actual item. I started adding another list_lru function that let
the node id be passed in separately, but I remembered this trick from
the deferred free list in vmalloc.

If this ever expands to handle direct map unmapped pages (which would
probably be the next step), the list_head will have to be moved out of
the actual page anyway. But in the meantime it resulted in the smallest
change.

I can try the other way if it's still too weird.
Edgecombe, Rick P April 5, 2021, 9:49 p.m. UTC | #4
On Mon, 2021-04-05 at 22:32 +0100, Matthew Wilcox wrote:
> On Mon, Apr 05, 2021 at 02:01:58PM -0700, Dave Hansen wrote:
> > On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> > > +static void __dispose_pages(struct list_head *head)
> > > +{
> > > +       struct list_head *cur, *next;
> > > +
> > > +       list_for_each_safe(cur, next, head) {
> > > +               list_del(cur);
> > > +
> > > +               /* The list head is stored at the start of the
> > > page */
> > > +               free_page((unsigned long)cur);
> > > +       }
> > > +}
> > 
> > This is interesting.
> > 
> > While the page is in the allocator, you're using the page contents
> > themselves to store the list_head.  It took me a minute to figure
> > out
> > what you were doing here because: "start of the page" is a bit
> > ambiguous.  It could mean:
> > 
> >  * the first 16 bytes in 'struct page'
> > or
> >  * the first 16 bytes in the page itself, aka *page_address(page)
> > 
> > The fact that this doesn't work on higmem systems makes this an OK
> > thing
> > to do, but it is a bit weird.  It's also doubly susceptible to bugs
> > where there's a page_to_virt() or virt_to_page() screwup.
> > 
> > I was *hoping* there was still sufficient space in 'struct page'
> > for
> > this second list_head in addition to page->lru.  I think there
> > *should*
> > be.  That would at least make this allocator a bit more "normal" in
> > not
> > caring about page contents while the page is free in the
> > allocator.  If
> > you were able to do that you could do things like kmemcheck or page
> > alloc debugging while the page is in the allocator.
> > 
> > Anyway, I think I'd prefer that you *try* to use 'struct page'
> > alone.
> > But, if that doesn't work out, please comment the snot out of this
> > thing
> > because it _is_ weird.
> 
> Hi!  Current closest-thing-we-have-to-an-expert-on-struct-page here!
> 
> I haven't read over these patches yet.  If these pages are in use by
> vmalloc, they can't use mapping+index because get_user_pages() will
> call
> page_mapping() and the list_head will confuse it.  I think it could
> use
> index+private for a list_head.
> 
> If the pages are in the buddy, I _think_ mapping+index are free. 
> private
> is in use for buddy order.  But I haven't read through the buddy code
> in a while.
> 
> Does it need to be a doubly linked list?  Can it be an hlist?

It does need to be a doubly linked list. I think they should never be
mapped to userspace. As far as the page allocator is concerned these
pages are not free. And they are not compound.

Originally I was just using the lru member. Would it be ok in that
case?