[v4,bpf,0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Message ID	20220415164413.2727220-1-song@kernel.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Song Liu <song@kernel.org> To: <bpf@vger.kernel.org>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org> CC: <ast@kernel.org>, <daniel@iogearbox.net>, <kernel-team@fb.com>, <akpm@linux-foundation.org>, <rick.p.edgecombe@intel.com>, <hch@infradead.org>, <imbrenda@linux.ibm.com>, <mcgrof@kernel.org>, Song Liu <song@kernel.org> Subject: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Date: Fri, 15 Apr 2022 09:44:09 -0700 Message-ID: <20220415164413.2727220-1-song@kernel.org> Content-Type: text/plain Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP \| expand [v4,bpf,0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP [v4,bpf,1/4] vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP [v4,bpf,2/4] page_alloc: use vmalloc_huge for large system hash [v4,bpf,3/4] module: introduce module_alloc_huge [v4,bpf,4/4] bpf: use module_alloc_huge for bpf_prog_pack

Song Liu April 15, 2022, 4:44 p.m. UTC

Changes v3 => v4:
1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename
   __vmalloc_huge => vmalloc_huge. (Christoph Hellwig)
2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in
   kvm_s390_pv_alloc_vm.

Changes v2 => v3:
1. Use __vmalloc_huge in alloc_large_system_hash.
2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig)
3. Add more description about the issues and changes.(Christoph Hellwig,
   Rick Edgecombe).

Changes v1 => v2:
1. Add vmalloc_huge(). (Christoph Hellwig)
2. Add module_alloc_huge(). (Christoph Hellwig)
3. Add Fixes tag and Link tag. (Thorsten Leemhuis)

Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has
caused some issues [1], as many users of vmalloc are not yet ready to
handle huge pages. To enable a more smooth transition to use huge page
backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new
opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be
found at [2].

Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP.
Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack.

[1] https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/
[2] https://lore.kernel.org/linux-mm/20220330225642.1163897-1-song@kernel.org/

Song Liu (4):
  vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP
  page_alloc: use vmalloc_huge for large system hash
  module: introduce module_alloc_huge
  bpf: use module_alloc_huge for bpf_prog_pack

 arch/Kconfig                 |  6 ++----
 arch/powerpc/kernel/module.c |  2 +-
 arch/s390/kvm/pv.c           |  7 +------
 arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
 include/linux/moduleloader.h |  5 +++++
 include/linux/vmalloc.h      |  4 ++--
 kernel/bpf/core.c            |  7 ++++---
 kernel/module.c              |  8 ++++++++
 mm/page_alloc.c              |  2 +-
 mm/vmalloc.c                 | 17 ++++++++++-------
 10 files changed, 55 insertions(+), 24 deletions(-)

--
2.30.2

Luis Chamberlain April 15, 2022, 7:05 p.m. UTC | #1

On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote:
> Changes v3 => v4:
> 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename
>    __vmalloc_huge => vmalloc_huge. (Christoph Hellwig)
> 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in
>    kvm_s390_pv_alloc_vm.
> 
> Changes v2 => v3:
> 1. Use __vmalloc_huge in alloc_large_system_hash.
> 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig)
> 3. Add more description about the issues and changes.(Christoph Hellwig,
>    Rick Edgecombe).
> 
> Changes v1 => v2:
> 1. Add vmalloc_huge(). (Christoph Hellwig)
> 2. Add module_alloc_huge(). (Christoph Hellwig)
> 3. Add Fixes tag and Link tag. (Thorsten Leemhuis)
> 
> Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has
> caused some issues [1], as many users of vmalloc are not yet ready to
> handle huge pages. To enable a more smooth transition to use huge page
> backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new
> opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be
> found at [2].
> 
> Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP.
> Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack.
> 
> [1] https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/
> [2] https://lore.kernel.org/linux-mm/20220330225642.1163897-1-song@kernel.org/

Looks good except for that I think this should just wait for v5.19. The
fixes are so large I can't see why this needs to be rushed in other than
the first assumptions of the optimizations had some flaws addressed here.

If this goes through v5.19 expect conflicts on modules unless you use
modules-testing.

  Luis

Song Liu April 16, 2022, 1:34 a.m. UTC | #2

On Fri, Apr 15, 2022 at 12:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote:
> > Changes v3 => v4:
> > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename
> >    __vmalloc_huge => vmalloc_huge. (Christoph Hellwig)
> > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in
> >    kvm_s390_pv_alloc_vm.
> >
> > Changes v2 => v3:
> > 1. Use __vmalloc_huge in alloc_large_system_hash.
> > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig)
> > 3. Add more description about the issues and changes.(Christoph Hellwig,
> >    Rick Edgecombe).
> >
> > Changes v1 => v2:
> > 1. Add vmalloc_huge(). (Christoph Hellwig)
> > 2. Add module_alloc_huge(). (Christoph Hellwig)
> > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis)
> >
> > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has
> > caused some issues [1], as many users of vmalloc are not yet ready to
> > handle huge pages. To enable a more smooth transition to use huge page
> > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new
> > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be
> > found at [2].
> >
> > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP.
> > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack.
> >
> > [1] https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/
> > [2] https://lore.kernel.org/linux-mm/20220330225642.1163897-1-song@kernel.org/
>
> Looks good except for that I think this should just wait for v5.19. The
> fixes are so large I can't see why this needs to be rushed in other than
> the first assumptions of the optimizations had some flaws addressed here.

We need these changes to fix issues like [3]. Note that there might
still be some
undiscovered issues with huge page backed vmalloc memory on powerpc, which
had HAVE_ARCH_HUGE_VMALLOC enabled since the 5.15 kernel. As we
agreed, the new opt-in flag is a safer approach here. We probably should have
1/4 and 2/4 back ported to stable. Therefore, I think shipping this
set now would
give us a more reliable 5.18 release.

Does this make sense?

Thanks,
Song

[3] https://lore.kernel.org/lkml/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/

Luis Chamberlain April 16, 2022, 1:42 a.m. UTC | #3

On Fri, Apr 15, 2022 at 06:34:16PM -0700, Song Liu wrote:
> On Fri, Apr 15, 2022 at 12:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote:
> > > Changes v3 => v4:
> > > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename
> > >    __vmalloc_huge => vmalloc_huge. (Christoph Hellwig)
> > > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in
> > >    kvm_s390_pv_alloc_vm.
> > >
> > > Changes v2 => v3:
> > > 1. Use __vmalloc_huge in alloc_large_system_hash.
> > > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig)
> > > 3. Add more description about the issues and changes.(Christoph Hellwig,
> > >    Rick Edgecombe).
> > >
> > > Changes v1 => v2:
> > > 1. Add vmalloc_huge(). (Christoph Hellwig)
> > > 2. Add module_alloc_huge(). (Christoph Hellwig)
> > > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis)
> > >
> > > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has
> > > caused some issues [1], as many users of vmalloc are not yet ready to
> > > handle huge pages. To enable a more smooth transition to use huge page
> > > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new
> > > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be
> > > found at [2].
> > >
> > > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP.
> > > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack.
> > >
> > > [1] https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/
> > > [2] https://lore.kernel.org/linux-mm/20220330225642.1163897-1-song@kernel.org/
> >
> > Looks good except for that I think this should just wait for v5.19. The
> > fixes are so large I can't see why this needs to be rushed in other than
> > the first assumptions of the optimizations had some flaws addressed here.
> 
> We need these changes to fix issues like [3]. Note that there might
> still be some
> undiscovered issues with huge page backed vmalloc memory on powerpc, which
> had HAVE_ARCH_HUGE_VMALLOC enabled since the 5.15 kernel. As we
> agreed, the new opt-in flag is a safer approach here. We probably should have
> 1/4 and 2/4 back ported to stable. Therefore, I think shipping this
> set now would
> give us a more reliable 5.18 release.
> 
> Does this make sense?

Yes absolutely, but that sounds like that optimization should just be
reverted completely from v5.18 isntead then no? Or if one can skip the
optimizations for v5.18 with a small patch, and then enable for v5.19.

  Luis

Luis Chamberlain April 16, 2022, 1:43 a.m. UTC | #4

On Fri, Apr 15, 2022 at 06:42:27PM -0700, Luis Chamberlain wrote:
> On Fri, Apr 15, 2022 at 06:34:16PM -0700, Song Liu wrote:
> > On Fri, Apr 15, 2022 at 12:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > >
> > > On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote:
> > > > Changes v3 => v4:
> > > > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename
> > > >    __vmalloc_huge => vmalloc_huge. (Christoph Hellwig)
> > > > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in
> > > >    kvm_s390_pv_alloc_vm.
> > > >
> > > > Changes v2 => v3:
> > > > 1. Use __vmalloc_huge in alloc_large_system_hash.
> > > > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig)
> > > > 3. Add more description about the issues and changes.(Christoph Hellwig,
> > > >    Rick Edgecombe).
> > > >
> > > > Changes v1 => v2:
> > > > 1. Add vmalloc_huge(). (Christoph Hellwig)
> > > > 2. Add module_alloc_huge(). (Christoph Hellwig)
> > > > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis)
> > > >
> > > > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has
> > > > caused some issues [1], as many users of vmalloc are not yet ready to
> > > > handle huge pages. To enable a more smooth transition to use huge page
> > > > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new
> > > > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be
> > > > found at [2].
> > > >
> > > > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP.
> > > > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack.
> > > >
> > > > [1] https://lore.kernel.org/lkml/20220204185742.271030-1-song@kernel.org/
> > > > [2] https://lore.kernel.org/linux-mm/20220330225642.1163897-1-song@kernel.org/
> > >
> > > Looks good except for that I think this should just wait for v5.19. The
> > > fixes are so large I can't see why this needs to be rushed in other than
> > > the first assumptions of the optimizations had some flaws addressed here.
> > 
> > We need these changes to fix issues like [3]. Note that there might
> > still be some
> > undiscovered issues with huge page backed vmalloc memory on powerpc, which
> > had HAVE_ARCH_HUGE_VMALLOC enabled since the 5.15 kernel. As we
> > agreed, the new opt-in flag is a safer approach here. We probably should have
> > 1/4 and 2/4 back ported to stable. Therefore, I think shipping this
> > set now would
> > give us a more reliable 5.18 release.
> > 
> > Does this make sense?
> 
> Yes absolutely, but that sounds like that optimization should just be
> reverted completely from v5.18 isntead then no? Or if one can skip the
> optimizations for v5.18 with a small patch, and then enable for v5.19.

In any case it is up to Linus, I already chimed in with my opinion.

  Luis

Christoph Hellwig April 16, 2022, 5:08 a.m. UTC | #5

On Fri, Apr 15, 2022 at 12:05:42PM -0700, Luis Chamberlain wrote:
> Looks good except for that I think this should just wait for v5.19. The
> fixes are so large I can't see why this needs to be rushed in other than
> the first assumptions of the optimizations had some flaws addressed here.

Patches 1 and 2 are bug fixes for regressions caused by using huge page
backed vmalloc by default.  So I think we do need it for 5.18.  The
other two do look like candidates for 5.19, though.

Song Liu April 16, 2022, 7:55 p.m. UTC | #6

> On Apr 15, 2022, at 10:08 PM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Fri, Apr 15, 2022 at 12:05:42PM -0700, Luis Chamberlain wrote:
>> Looks good except for that I think this should just wait for v5.19. The
>> fixes are so large I can't see why this needs to be rushed in other than
>> the first assumptions of the optimizations had some flaws addressed here.
> 
> Patches 1 and 2 are bug fixes for regressions caused by using huge page
> backed vmalloc by default.  So I think we do need it for 5.18.  The
> other two do look like candidates for 5.19, though.

Thanks Luis and Christoph for your kind inputs on the set. 

Here are my analysis after thinking about it overnight. 

We can discuss the users of vmalloc in 4 categories: module_alloc, BPF 
programs, alloc_large_system_hash, and others; and there are two archs
involved here: x86_64 and powerpc. 

With whole set, the behavior is like:

              |           x86_64            |       powerpc
--------------------------------------------+----------------------
module_alloc  |                        use small pages 
--------------------------------------------+----------------------
BPF programs  |      use 2MB pages          |  use small changes
--------------------------------------------+----------------------
large hash    |           use huge pages when size > PMD_SIZE
--------------------------------------------+----------------------
other-vmalloc |                      use small pages 

Patch 1/4 fixes the behavior of module_alloc and other-vmalloc. 
Without 1/4, both these users may get huge pages for size > PMD_SIZE 
allocations, which may be troublesome([3] for example). 

Patch 3/4 and 4/4, together with 1/1, allows BPF programs use 2MB 
pages. This is the same behavior as before 5.18-rc1, which has been 
tested in bpf-next and linux-next. Therefore, I don't think we need
to hold them until 5.19. 

Patch 2/4 enables huge pages for large hash. Large hash has been 
using huge pages on powerpc since 5.15. But this is new for x86_64. 
If we ship 2/4, this is a performance improvement for x86_64, but
it is less tested on x86_64 (didn't go through linux-next). If we 
ship 1/4 but not 2/4 with 5.18, we will see a small performance 
regression for powerpc. 

Based on this analysis, I think we should either 
  1) ship the whole set with 5.18; or
  2) ship 1/4, 3/4, and 4/4 with 5.18, and 2/4 with 5.19. 

With option 1), we enables huge pages for large hash on x86_64 
without going through linux-next. With option 2), we take a small
performance regression with 5.18 on powerpc. 

Of course, we can ship a hybrid solution by gating 2/4 for powerpc 
only in 5.18, and enabling it for x86_64 in 5.19. 

Does this make sense? Please let me know you comments and 
suggestions on this. 

Thanks,
Song

[3] https://lore.kernel.org/lkml/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/

Linus Torvalds April 16, 2022, 8:30 p.m. UTC | #7

On Sat, Apr 16, 2022 at 12:55 PM Song Liu <songliubraving@fb.com> wrote:
>
> Based on this analysis, I think we should either
>   1) ship the whole set with 5.18; or
>   2) ship 1/4, 3/4, and 4/4 with 5.18, and 2/4 with 5.19.

Honestly, I think the proper thing to do is

 - apply #1, because yes, that "use huge pages" should be an opt-in.

 - but just disable hugepages for now.

I think those games with set_memory_nx() and friends just show how
rough this all is right now.

In fact, I personally think that the whole bpf 'prog_pack' stuff
should probably be disabled. It looks incredible broken to me right
now.

Unless I mis-read it, it does a "module_alloc()" to allocate the vmap
area, and then just marks it executable without having even
initialized the pages. Am I missing something? So now we have random
kernel memory that is marked executable.

Sure, it's also marked RO, but who cares? It's random data that is now
executable.

Maybe I am missing something, but I really don't think this is ready
for prime-time. We should effectively disable it all, and have people
think through it a lot more.

                   Linus

Song Liu April 16, 2022, 10:26 p.m. UTC | #8

Hi Linus,

Thanks a lot for your kind reply. 

> On Apr 16, 2022, at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Sat, Apr 16, 2022 at 12:55 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> Based on this analysis, I think we should either
>>  1) ship the whole set with 5.18; or
>>  2) ship 1/4, 3/4, and 4/4 with 5.18, and 2/4 with 5.19.
> 
> Honestly, I think the proper thing to do is
> 
> - apply #1, because yes, that "use huge pages" should be an opt-in.
> 
> - but just disable hugepages for now.

Hmm.. This sure is an option...

> 
> I think those games with set_memory_nx() and friends just show how
> rough this all is right now.
> 
> In fact, I personally think that the whole bpf 'prog_pack' stuff
> should probably be disabled. It looks incredible broken to me right
> now.
> 
> Unless I mis-read it, it does a "module_alloc()" to allocate the vmap
> area, and then just marks it executable without having even
> initialized the pages. Am I missing something? So now we have random
> kernel memory that is marked executable.
> 
> Sure, it's also marked RO, but who cares? It's random data that is now
> executable.

While this is a serious issue (my fault), it is relatively easy to fix.
We can fill the whole space with invalid instructions when the page
is allocated and when a BPF program is freed. Both these operations are
on the slow paths, so the overhead is minimal.  

> 
> Maybe I am missing something, but I really don't think this is ready
> for prime-time. We should effectively disable it all, and have people
> think through it a lot more.

This has been discussed on lwn.net: https://lwn.net/Articles/883454/. 
AFAICT, the biggest concern is whether reserving minimal 2MB for BPF
programs is a good trade-off for memory usage. This is again my fault
not to state the motivation clearly: the primary gain comes from less 
page table fragmentation and thus better iTLB efficiency. 

Other folks (in recent thread on this topic and offline in other 
discussions) also showed strong interests in using similar technical 
for text of kernel modules. So I would really like to learn your 
opinion on this. There are many details we can optimize, but I guess 
the general mechanism has to be something like:
 - allocate a huge page, make it safe, and set it as executable;
 - as users (BPF, kernel module, etc.) request memory for text, give
   a chunk of the huge page to the user. 
 - use some mechanism to update the chunk of memory safely. 

As you correctly noted, I missed the "make it safe" step (which 
should be easy to fix). But the rest of the flow is more or less the 
best solution to me. 

Did I totally miss some better option here? Or is this really not a 
viable option (IOW, bpf programs and kernel modules have to use 
regular pages)? 

Thanks again,
Song

Mike Rapoport April 18, 2022, 10:06 a.m. UTC | #9

Hi,

On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote:
> > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > Maybe I am missing something, but I really don't think this is ready
> > for prime-time. We should effectively disable it all, and have people
> > think through it a lot more.
> 
> This has been discussed on lwn.net: https://lwn.net/Articles/883454/. 
> AFAICT, the biggest concern is whether reserving minimal 2MB for BPF
> programs is a good trade-off for memory usage. This is again my fault
> not to state the motivation clearly: the primary gain comes from less 
> page table fragmentation and thus better iTLB efficiency. 

Reserving 2MB pages for BPF programs will indeed reduce the fragmentation,
but OTOH it will reduce memory utilization. If for large systems this may
not be an issue, on smaller machines trading off memory for iTLB
performance may be not that obvious.
 
> Other folks (in recent thread on this topic and offline in other 
> discussions) also showed strong interests in using similar technical 
> for text of kernel modules. So I would really like to learn your 
> opinion on this. There are many details we can optimize, but I guess 
> the general mechanism has to be something like:
>  - allocate a huge page, make it safe, and set it as executable;
>  - as users (BPF, kernel module, etc.) request memory for text, give
>    a chunk of the huge page to the user. 
>  - use some mechanism to update the chunk of memory safely. 

There are use-cases that require 4K pages with non-default permissions in
the direct map and the pages not necessarily should be executable. There
were several suggestions to implement caches of 4K pages backed by 2M
pages.

I believe that "allocate huge page and split it to basic pages to hand out
to users" concept should be implemented at page allocator level and I
posted and RFC for this a while ago:

https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/

Luis Chamberlain April 19, 2022, 12:44 a.m. UTC | #10

On Mon, Apr 18, 2022 at 01:06:36PM +0300, Mike Rapoport wrote:
> Hi,
> 
> On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote:
> > > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > 
> > > Maybe I am missing something, but I really don't think this is ready
> > > for prime-time. We should effectively disable it all, and have people
> > > think through it a lot more.
> > 
> > This has been discussed on lwn.net: https://lwn.net/Articles/883454/. 
> > AFAICT, the biggest concern is whether reserving minimal 2MB for BPF
> > programs is a good trade-off for memory usage. This is again my fault
> > not to state the motivation clearly: the primary gain comes from less 
> > page table fragmentation and thus better iTLB efficiency. 
> 
> Reserving 2MB pages for BPF programs will indeed reduce the fragmentation,
> but OTOH it will reduce memory utilization. If for large systems this may
> not be an issue, on smaller machines trading off memory for iTLB
> performance may be not that obvious.

So the current optimization at best should be a kconfig option?

> > Other folks (in recent thread on this topic and offline in other 
> > discussions) also showed strong interests in using similar technical 
> > for text of kernel modules. So I would really like to learn your 
> > opinion on this. There are many details we can optimize, but I guess 
> > the general mechanism has to be something like:
> >  - allocate a huge page, make it safe, and set it as executable;
> >  - as users (BPF, kernel module, etc.) request memory for text, give
> >    a chunk of the huge page to the user. 
> >  - use some mechanism to update the chunk of memory safely. 
> 
> There are use-cases that require 4K pages with non-default permissions in
> the direct map and the pages not necessarily should be executable. There
> were several suggestions to implement caches of 4K pages backed by 2M
> pages.

Even if we just focus on the executable side of the story... there may
be users who can share this too.

I've gone down memory lane now at least down to year 2005 in kprobes
to see why the heck module_alloc() was used. At first glance there are
some old comments about being within the 2 GiB text kernel range... But
some old tribal knowledge is still lost. The real hints come from kprobe work
since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of line
- take2"), so that the "For the %rip-relative displacement fixups to be
doable"... but this got me wondering, would other users who *do* want
similar funcionality benefit from a cache. If the space is limited then
using a cache makes sense. Specially if architectures tend to require
hacks for some of this to all work.

Then, since it seems since the vmalloc area was not initialized,
wouldn't that break the old JIT spray fixes, refer to commit
314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
attacks")?

Is that sort of work not needed anymore? If in doubt I at least made the
old proof of concept JIT spray stuff compile on recent kernels [0], but
I haven't tried out your patches yet. If this is not needed anymore,
why not?

The collection of tribal knowedge around these sorts of things would be
good to not loose and if we can share, even better.

> I believe that "allocate huge page and split it to basic pages to hand out
> to users" concept should be implemented at page allocator level and I
> posted and RFC for this a while ago:
> 
> https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/

Neat, so although eBPF is a big user, are there some use cases outside
that immediately benefit?

[0] https://github.com/mcgrof/jit-spray-poc-for-ksp

  LUis

Edgecombe, Rick P April 19, 2022, 1:56 a.m. UTC | #11

On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote:
> > There are use-cases that require 4K pages with non-default
> > permissions in
> > the direct map and the pages not necessarily should be executable.
> > There
> > were several suggestions to implement caches of 4K pages backed by
> > 2M
> > pages.
> 
> Even if we just focus on the executable side of the story... there
> may
> be users who can share this too.
> 
> I've gone down memory lane now at least down to year 2005 in kprobes
> to see why the heck module_alloc() was used. At first glance there
> are
> some old comments about being within the 2 GiB text kernel range...
> But
> some old tribal knowledge is still lost. The real hints come from
> kprobe work
> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of
> line
> - take2"), so that the "For the %rip-relative displacement fixups to
> be
> doable"... but this got me wondering, would other users who *do* want
> similar funcionality benefit from a cache. If the space is limited
> then
> using a cache makes sense. Specially if architectures tend to require
> hacks for some of this to all work.

Yea, that was my understanding. X86 modules have to be linked within
2GB of the kernel text, also eBPF x86 JIT generates code that expects
to be within 2GB of the kernel text.

I think of two types of caches we could have: caches of unmapped pages
on the direct map and caches of virtual memory mappings. Caches of
pages on the direct map reduce breakage of the large pages (and is
somewhat x86 specific problem). Caches of virtual memory mappings
reduce shootdowns, and are also required to share huge pages. I'll plug
my old RFC, where I tried to work towards enabling both:

https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/

Since then Mike has taken a lot further the direct map cache piece.

Yea, probably a lot of JIT's are way smaller than a page, but there is
also hopefully some performance benefit of reduced ITLB pressure and
TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps
its own cache of a page for putting very small trampolines.

> 
> Then, since it seems since the vmalloc area was not initialized,
> wouldn't that break the old JIT spray fixes, refer to commit
> 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
> attacks")?

Hmm, yea it might be a way to get around the ebpf jit rlimit. The
allocator could just text_poke() invalid instructions on "free" of the
jit.

> 
> Is that sort of work not needed anymore? If in doubt I at least made
> the
> old proof of concept JIT spray stuff compile on recent kernels [0],
> but
> I haven't tried out your patches yet. If this is not needed anymore,
> why not?

IIRC this got addressed in two ways, randomizing of the jit offset
inside the vmalloc allocation, and "constant blinding", such that the
specific attack of inserting unaligned instructions as immediate
instruction data did not work. Neither of those mitigations seem
unworkable with a large page caching allocator.

> 
> The collection of tribal knowedge around these sorts of things would
> be
> good to not loose and if we can share, even better.

Totally agree here. I think the abstraction I was exploring in that RFC
could remove some of the special permission memory tribal knowledge
that is lurking in in the cross-arch module.c. I wonder if you have any
thoughts on something like that? The normal modules proved the hardest.

Song Liu April 19, 2022, 5:36 a.m. UTC | #12

Hi Mike, Luis, and Rick,

Thanks for sharing your work and findings in the space. I didn't 
realize we were looking at the same set of problems. 

> On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote:
>>> There are use-cases that require 4K pages with non-default
>>> permissions in
>>> the direct map and the pages not necessarily should be executable.
>>> There
>>> were several suggestions to implement caches of 4K pages backed by
>>> 2M
>>> pages.
>> 
>> Even if we just focus on the executable side of the story... there
>> may
>> be users who can share this too.
>> 
>> I've gone down memory lane now at least down to year 2005 in kprobes
>> to see why the heck module_alloc() was used. At first glance there
>> are
>> some old comments about being within the 2 GiB text kernel range...
>> But
>> some old tribal knowledge is still lost. The real hints come from
>> kprobe work
>> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of
>> line
>> - take2"), so that the "For the %rip-relative displacement fixups to
>> be
>> doable"... but this got me wondering, would other users who *do* want
>> similar funcionality benefit from a cache. If the space is limited
>> then
>> using a cache makes sense. Specially if architectures tend to require
>> hacks for some of this to all work.
> 
> Yea, that was my understanding. X86 modules have to be linked within
> 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> to be within 2GB of the kernel text.
> 
> 
> I think of two types of caches we could have: caches of unmapped pages
> on the direct map and caches of virtual memory mappings. Caches of
> pages on the direct map reduce breakage of the large pages (and is
> somewhat x86 specific problem). Caches of virtual memory mappings
> reduce shootdowns, and are also required to share huge pages. I'll plug
> my old RFC, where I tried to work towards enabling both:
> 
> https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/
> 
> Since then Mike has taken a lot further the direct map cache piece.

These are really interesting work. With this landed, we won't need 
the bpf_prog_pack work at all (I think). OTOH, this looks like a 
long term project, as some of the work in bpf_prog_pack took quite 
some time to discuss/debate, and that was just a subset of the 
whole thing. 

I really like the two types of cache concept. But there are some 
details I cannot figure out about them:

1. Is "caches of unmapped pages on direct map" (cache #1) 
   sufficient to fix all direct map fragmentation? IIUC, pages in
   the cache may still be used by other allocation (with some 
   memory pressure). If the system runs for long enough, there 
   may be a lot of direct map fragmentation. Is this right?
2. If we have "cache of virtual memory mappings" (cache #2), do we
   still need cache #1? I know cache #2 alone may waste some 
   memory, but I still think 2MB within noise for modern systems. 
3. If we do need both caches, what would be the right APIs? 

Thanks,
Song



> Yea, probably a lot of JIT's are way smaller than a page, but there is
> also hopefully some performance benefit of reduced ITLB pressure and
> TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps
> its own cache of a page for putting very small trampolines.
> 
>> 
>> Then, since it seems since the vmalloc area was not initialized,
>> wouldn't that break the old JIT spray fixes, refer to commit
>> 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
>> attacks")?
> 
> Hmm, yea it might be a way to get around the ebpf jit rlimit. The
> allocator could just text_poke() invalid instructions on "free" of the
> jit.
> 
>> 
>> Is that sort of work not needed anymore? If in doubt I at least made
>> the
>> old proof of concept JIT spray stuff compile on recent kernels [0],
>> but
>> I haven't tried out your patches yet. If this is not needed anymore,
>> why not?
> 
> IIRC this got addressed in two ways, randomizing of the jit offset
> inside the vmalloc allocation, and "constant blinding", such that the
> specific attack of inserting unaligned instructions as immediate
> instruction data did not work. Neither of those mitigations seem
> unworkable with a large page caching allocator.
> 
>> 
>> The collection of tribal knowedge around these sorts of things would
>> be
>> good to not loose and if we can share, even better.
> 
> Totally agree here. I think the abstraction I was exploring in that RFC
> could remove some of the special permission memory tribal knowledge
> that is lurking in in the cross-arch module.c. I wonder if you have any
> thoughts on something like that? The normal modules proved the hardest.
>

Mike Rapoport April 19, 2022, 6:20 p.m. UTC | #13

On Mon, Apr 18, 2022 at 05:44:19PM -0700, Luis Chamberlain wrote:
> On Mon, Apr 18, 2022 at 01:06:36PM +0300, Mike Rapoport wrote:
> > Hi,
> > 
> > On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote:
> > > > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > > 
> > > > Maybe I am missing something, but I really don't think this is ready
> > > > for prime-time. We should effectively disable it all, and have people
> > > > think through it a lot more.
> > > 
> > > This has been discussed on lwn.net: https://lwn.net/Articles/883454/. 
> > > AFAICT, the biggest concern is whether reserving minimal 2MB for BPF
> > > programs is a good trade-off for memory usage. This is again my fault
> > > not to state the motivation clearly: the primary gain comes from less 
> > > page table fragmentation and thus better iTLB efficiency. 
> > 
> > Reserving 2MB pages for BPF programs will indeed reduce the fragmentation,
> > but OTOH it will reduce memory utilization. If for large systems this may
> > not be an issue, on smaller machines trading off memory for iTLB
> > performance may be not that obvious.
> 
> So the current optimization at best should be a kconfig option?

Maybe not and it'll be fine on smaller systems, but from what I see the
bpf_prog_pack implementation didn't consider them.

And if we move the caches from BPF to vmalloc or page allocator that
would be much less of an issue.
 
> > I believe that "allocate huge page and split it to basic pages to hand out
> > to users" concept should be implemented at page allocator level and I
> > posted and RFC for this a while ago:
> > 
> > https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/
> 
> Neat, so although eBPF is a big user, are there some use cases outside
> that immediately benefit?

Anything that uses set_memory APIs could benefit from this. Except eBPF and
other module_alloc() users, there is secretmem that also fractures the
direct map and actually that was my initial use case for these patches.

Another possible use-case can be protection of page tables with PKS:

https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgecombe@intel.com/

Vlastimil also mentioned that SEV-SNP could use such caching mechanism, but
I don't know the details.

>   LUis

Mike Rapoport April 19, 2022, 6:42 p.m. UTC | #14

Hi,

On Tue, Apr 19, 2022 at 05:36:45AM +0000, Song Liu wrote:
> Hi Mike, Luis, and Rick,
> 
> Thanks for sharing your work and findings in the space. I didn't 
> realize we were looking at the same set of problems. 
> 
> > On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote:
> >>> There are use-cases that require 4K pages with non-default
> >>> permissions in
> >>> the direct map and the pages not necessarily should be executable.
> >>> There
> >>> were several suggestions to implement caches of 4K pages backed by
> >>> 2M
> >>> pages.
> >> 
> >> Even if we just focus on the executable side of the story... there
> >> may
> >> be users who can share this too.
> >> 
> >> I've gone down memory lane now at least down to year 2005 in kprobes
> >> to see why the heck module_alloc() was used. At first glance there
> >> are
> >> some old comments about being within the 2 GiB text kernel range...
> >> But
> >> some old tribal knowledge is still lost. The real hints come from
> >> kprobe work
> >> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of
> >> line
> >> - take2"), so that the "For the %rip-relative displacement fixups to
> >> be
> >> doable"... but this got me wondering, would other users who *do* want
> >> similar funcionality benefit from a cache. If the space is limited
> >> then
> >> using a cache makes sense. Specially if architectures tend to require
> >> hacks for some of this to all work.
> > 
> > Yea, that was my understanding. X86 modules have to be linked within
> > 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> > to be within 2GB of the kernel text.
> > 
> > 
> > I think of two types of caches we could have: caches of unmapped pages
> > on the direct map and caches of virtual memory mappings. Caches of
> > pages on the direct map reduce breakage of the large pages (and is
> > somewhat x86 specific problem). Caches of virtual memory mappings
> > reduce shootdowns, and are also required to share huge pages. I'll plug
> > my old RFC, where I tried to work towards enabling both:
> > 
> > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/
> > 
> > Since then Mike has taken a lot further the direct map cache piece.
> 
> These are really interesting work. With this landed, we won't need 
> the bpf_prog_pack work at all (I think). OTOH, this looks like a 
> long term project, as some of the work in bpf_prog_pack took quite 
> some time to discuss/debate, and that was just a subset of the 
> whole thing. 

I'd say that bpf_prog_pack was a cure for symptoms and this project tries
to address more general problem.
But you are right, it'll take some time and won't land in 5.19.
 
> I really like the two types of cache concept. But there are some 
> details I cannot figure out about them:

After some discussions we decided to try moving the caching of large pages
to the page allocator and see if the second cache will be needed at all.
But I've got distracted after posting the RFC and that work didn't have
real progress since then.
 
> 1. Is "caches of unmapped pages on direct map" (cache #1) 
>    sufficient to fix all direct map fragmentation? IIUC, pages in
>    the cache may still be used by other allocation (with some 
>    memory pressure). If the system runs for long enough, there 
>    may be a lot of direct map fragmentation. Is this right?

If the system runs long enough, it may run out of high-order free pages
regardless of the way the caches are implemented. Then we either fail the
allocation because it is impossible to refill the cache with large pages or
fall back to 4k pages and fragment direct map.

I don't see how can we avoid direct map fragmentation entirely and still be
able to allocate memory for users of set_memory APIs.

> 2. If we have "cache of virtual memory mappings" (cache #2), do we
>    still need cache #1? I know cache #2 alone may waste some 
>    memory, but I still think 2MB within noise for modern systems. 

I presume that by cache #1 you mean the cache in the page allocator. In
that case cache #2 is probably not needed at all, because the cache at page
allocator level will be used by vmalloc() and friends to provide what Rick
called "permissioned allocations".

> Thanks,
> Song

Linus Torvalds April 19, 2022, 7:20 p.m. UTC | #15

On Tue, Apr 19, 2022 at 11:42 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> I'd say that bpf_prog_pack was a cure for symptoms and this project tries
> to address more general problem.
> But you are right, it'll take some time and won't land in 5.19.

Just to update people: I've just applied Song's [1/4] patch, which
means that the whole current hugepage vmalloc thing is effectively
disabled (because nothing opts in).

And I suspect that will be the status for 5.18, unless somebody comes
up with some very strong arguments for (re-)starting using huge pages.

                      Linus

Luis Chamberlain April 19, 2022, 9:24 p.m. UTC | #16

On Tue, Apr 19, 2022 at 01:56:03AM +0000, Edgecombe, Rick P wrote:
> Yea, that was my understanding. X86 modules have to be linked within
> 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> to be within 2GB of the kernel text.

And kprobes / live patching / ftrace.

Another architectural fun fact, powerpc book3s/32 requires executability
to be set per 256 Mbytes segments. Some architectures like this one
will want to also optimize how they use the module alloc area.

Even though today the use cases might be limited, we don't exactly know
how much memory a target device has a well, and so treating memory
failures for "special memory" request as regular memory failures seems
a bit odd, and users could get confused. For instance slapping on
extra memory on a system won't resolve any issues if the limit for a
special type of memory is already hit. Very likely not a problem at all today,
given how small modules / eBPF jit programs are / etc, but conceptually it
would seem wrong to just say -ENOMEM when in fact it's a special type of
required memory which cannot be allocated and the issue cannot possibly be
fixed. I don't think we have an option but to use -ENOMEM but at least
hinting of the special failure would have seem desirable.

Do we have other type of architectural limitations for "special memory"
other than executable? Do we have *new* types of special memory we
should consider which might be similar / limited in nature? And can / could /
should these architectural limitations hopefully be disappear in newer CPUs?
I see vmalloc_pks() as you pointed out [0] . Anything else?

> I think of two types of caches we could have: caches of unmapped pages
> on the direct map and caches of virtual memory mappings. Caches of
> pages on the direct map reduce breakage of the large pages (and is
> somewhat x86 specific problem). Caches of virtual memory mappings
> reduce shootdowns, and are also required to share huge pages. I'll plug
> my old RFC, where I tried to work towards enabling both:
> 
> https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/
> 
> Since then Mike has taken a lot further the direct map cache piece.
> 
> Yea, probably a lot of JIT's are way smaller than a page, but there is
> also hopefully some performance benefit of reduced ITLB pressure and
> TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps
> its own cache of a page for putting very small trampolines.

The reason I looked into *why* module_alloc() was used was particularly
because it seemed a bit odd to have such ITLB enhancements for such
a niche use case and we couldn't have desired this elsewhere before.

> > Then, since it seems since the vmalloc area was not initialized,
> > wouldn't that break the old JIT spray fixes, refer to commit
> > 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
> > attacks")?
> 
> Hmm, yea it might be a way to get around the ebpf jit rlimit. The
> allocator could just text_poke() invalid instructions on "free" of the
> jit.
> 
> > 
> > Is that sort of work not needed anymore? If in doubt I at least made
> > the
> > old proof of concept JIT spray stuff compile on recent kernels [0],
> > but
> > I haven't tried out your patches yet. If this is not needed anymore,
> > why not?
> 
> IIRC this got addressed in two ways, randomizing of the jit offset
> inside the vmalloc allocation, and "constant blinding", such that the
> specific attack of inserting unaligned instructions as immediate
> instruction data did not work. Neither of those mitigations seem
> unworkable with a large page caching allocator.

Got it, but was it *also* considerd in the fixes posted recently?

> > The collection of tribal knowedge around these sorts of things would
> > be
> > good to not loose and if we can share, even better.
> 
> Totally agree here. I think the abstraction I was exploring in that RFC
> could remove some of the special permission memory tribal knowledge
> that is lurking in in the cross-arch module.c. I wonder if you have any
> thoughts on something like that? The normal modules proved the hardest.

Yeah modules will be harder now with the new ARCH_WANTS_MODULES_DATA_IN_VMALLOC
which Christophe Leroy added (queued in my modules-next). At a quick
glance it seems like an API in the right direction, but you just need
more architecture folks other than the usual x86 suspects to review.

Perhaps time for a new spin?

[0] https://lore.kernel.org/lkml/20201009201410.3209180-2-ira.weiny@intel.com/

  Luis

Edgecombe, Rick P April 19, 2022, 11:58 p.m. UTC | #17

On Tue, 2022-04-19 at 14:24 -0700, Luis Chamberlain wrote:
> On Tue, Apr 19, 2022 at 01:56:03AM +0000, Edgecombe, Rick P wrote:
> > Yea, that was my understanding. X86 modules have to be linked
> > within
> > 2GB of the kernel text, also eBPF x86 JIT generates code that
> > expects
> > to be within 2GB of the kernel text.
> 
> And kprobes / live patching / ftrace.
> 
> Another architectural fun fact, powerpc book3s/32 requires
> executability
> to be set per 256 Mbytes segments. Some architectures like this one
> will want to also optimize how they use the module alloc area.
> 
> Even though today the use cases might be limited, we don't exactly
> know
> how much memory a target device has a well, and so treating memory
> failures for "special memory" request as regular memory failures
> seems
> a bit odd, and users could get confused. For instance slapping on
> extra memory on a system won't resolve any issues if the limit for a
> special type of memory is already hit. Very likely not a problem at
> all today,
> given how small modules / eBPF jit programs are / etc, but
> conceptually it
> would seem wrong to just say -ENOMEM when in fact it's a special type
> of
> required memory which cannot be allocated and the issue cannot
> possibly be
> fixed. I don't think we have an option but to use -ENOMEM but at
> least
> hinting of the special failure would have seem desirable.

ENOMEM doesn't always mean out of physical memory though right? Could
be hitting some other memory limit. Not sure where this discussion of
the error code is coming from though.

As far as the problem of eating a whole 2MB on small systems, this
makes sense to me to worry about. Erratas limit what we can do here
with swapping page sizes on the fly. Probably a sensible solution would
be to decide whether to try based on system properties like boot memory
size.

Even without 2MB pages though, there are still improvements from these
types of changes.

> 
> Do we have other type of architectural limitations for "special
> memory"
> other than executable? Do we have *new* types of special memory we
> should consider which might be similar / limited in nature? And can /
> could /
> should these architectural limitations hopefully be disappear in
> newer CPUs?
> I see vmalloc_pks() as you pointed out [0] . Anything else?

Hmm, shadow stack permission memory could pop up in vmalloc someday.

Not sure what you mean by architectural limitations. The relative
addressing? If so, those other usages shouldn't be restricted by that.

> 
> > I think of two types of caches we could have: caches of unmapped
> > pages
> > on the direct map and caches of virtual memory mappings. Caches of
> > pages on the direct map reduce breakage of the large pages (and is
> > somewhat x86 specific problem). Caches of virtual memory mappings
> > reduce shootdowns, and are also required to share huge pages. I'll
> > plug
> > my old RFC, where I tried to work towards enabling both:
> > 
> > 
> > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/
> > 
> > Since then Mike has taken a lot further the direct map cache piece.
> > 
> > Yea, probably a lot of JIT's are way smaller than a page, but there
> > is
> > also hopefully some performance benefit of reduced ITLB pressure
> > and
> > TLB shootdowns. I think kprobes/ftrace (or at least one of them)
> > keeps
> > its own cache of a page for putting very small trampolines.
> 
> The reason I looked into *why* module_alloc() was used was
> particularly
> because it seemed a bit odd to have such ITLB enhancements for such
> a niche use case and we couldn't have desired this elsewhere before.

I think in general it is the only cross-arch way to get an allocation
in the arch's executable compatible area (which some need as you say).
module_alloc() is probably misnamed at this point with so many users
that are not modules.

> 
> > > Then, since it seems since the vmalloc area was not initialized,
> > > wouldn't that break the old JIT spray fixes, refer to commit
> > > 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against
> > > spraying
> > > attacks")?
> > 
> > Hmm, yea it might be a way to get around the ebpf jit rlimit. The
> > allocator could just text_poke() invalid instructions on "free" of
> > the
> > jit.
> > 
> > > 
> > > Is that sort of work not needed anymore? If in doubt I at least
> > > made
> > > the
> > > old proof of concept JIT spray stuff compile on recent kernels
> > > [0],
> > > but
> > > I haven't tried out your patches yet. If this is not needed
> > > anymore,
> > > why not?
> > 
> > IIRC this got addressed in two ways, randomizing of the jit offset
> > inside the vmalloc allocation, and "constant blinding", such that
> > the
> > specific attack of inserting unaligned instructions as immediate
> > instruction data did not work. Neither of those mitigations seem
> > unworkable with a large page caching allocator.
> 
> Got it, but was it *also* considerd in the fixes posted recently?

I didn't read any discussion about it. But if it wasn't clear, I'm just
an onlooker on bpf_prog_pack. I didn't see it until it was already
upstream. Maybe Song can say.

> 
> > > The collection of tribal knowedge around these sorts of things
> > > would
> > > be
> > > good to not loose and if we can share, even better.
> > 
> > Totally agree here. I think the abstraction I was exploring in that
> > RFC
> > could remove some of the special permission memory tribal knowledge
> > that is lurking in in the cross-arch module.c. I wonder if you have
> > any
> > thoughts on something like that? The normal modules proved the
> > hardest.
> 
> Yeah modules will be harder now with the new
> ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> which Christophe Leroy added (queued in my modules-next).

Part of that work was separating out each module into 4 allocations, so
it might make it easier.

>  At a quick
> glance it seems like an API in the right direction, but you just need
> more architecture folks other than the usual x86 suspects to review.
> 
> Perhaps time for a new spin?

I would very much like to, but I am currently too busy with another
project. As such, I am mostly just trying to contribute ideas and my
personal collection of the hidden knowledge. It sounded like from Song,
someone might want to tackle it before I can get back to it.

And, yes other arch's review would be critical to making sure it
actually is a better interface and not just another one.

> 
> [0] 
> 
https://lore.kernel.org/lkml/20201009201410.3209180-2-ira.weiny@intel.com/
> 
>   Luis

Alexei Starovoitov April 20, 2022, 2:03 a.m. UTC | #18

On Tue, Apr 19, 2022 at 12:20:39PM -0700, Linus Torvalds wrote:
> On Tue, Apr 19, 2022 at 11:42 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > I'd say that bpf_prog_pack was a cure for symptoms and this project tries
> > to address more general problem.
> > But you are right, it'll take some time and won't land in 5.19.
> 
> Just to update people: I've just applied Song's [1/4] patch, which
> means that the whole current hugepage vmalloc thing is effectively
> disabled (because nothing opts in).
> 
> And I suspect that will be the status for 5.18, unless somebody comes
> up with some very strong arguments for (re-)starting using huge pages.

Here is the quote from Song's cover letter for bpf_prog_pack series:

  Most BPF programs are small, but they consume a page each. For systems
  with busy traffic and many BPF programs, this could also add significant
  pressure to instruction TLB. High iTLB pressure usually causes slow down
  for the whole system, which includes visible performance degradation for
  production workloads.

The last sentence is the key. We've added this feature not because of bpf
programs themselves. So calling this feature an optimization is not quite
correct. The number of bpf programs on the production server doesn't matter.
The programs come and go all the time. That is the key here.  The 4k
module_alloc() plus set_memory_ro/x done by the JIT break down huge pages and
increase TLB pressure on the kernel code. That creates visible performance
degradation for normal user space workloads that are not doing anything bpf
related. mm folks can fill in the details here. My understanding it's
something to do with identity mapping.
So we're not trying to improve bpf performance. We're trying to make
sure that bpf program load/unload doesn't affect the speed of the kernel.
Generalizing bpf_prog_alloc to modules would be nice, but it's not clear
what benefits such optimization might have. It's orthogonal here.

So I argue that all 4 Song's fixes are necessary in 5.18.
We need an additional zeroing patch too, of course, to make sure huge page
doesn't have garbage at alloc time and it's cleaned after prog is unloaded.

Regarding JIT spraying and other concerns. Short answer: nothing changed.
JIT spraying was mitigated with start address randomization and invalid
instruction padding. Both features are still present.
Constant blinding is also fully functional.

Any kind of generalization of bpf_prog_pack into general mm feature would be
nice, but it cannot be done as opportunistic cache. We need a guarantee that
bpf prog/unload won't recreate the issue with kernel performance degradation. I
suspect we would need bpf_prog_pack in the current form for foreseeable future.

Linus Torvalds April 20, 2022, 2:18 a.m. UTC | #19

On Tue, Apr 19, 2022 at 7:03 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Here is the quote from Song's cover letter for bpf_prog_pack series:

I care about performance as much as the next person, but I care about
correctness too.

That large-page code was a disaster, and was buggy and broken.

And even with those four patches, it's still broken.

End result: there's no way that thigh gets re-enabled without the
correctness being in place.

At a minimum, to re-enable it, it needs (a) that zeroing and (b)
actual numbers on real loads. (not some artificial benchmark).

Because without (a) there's no way in hell I'll enable it.

And without (b), "performance" isn't actually an argument.

                  Linus

Petr Mladek April 20, 2022, 7:58 a.m. UTC | #20

On Tue 2022-04-19 14:24:38, Luis Chamberlain wrote:
> On Tue, Apr 19, 2022 at 01:56:03AM +0000, Edgecombe, Rick P wrote:
> > Yea, that was my understanding. X86 modules have to be linked within
> > 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> > to be within 2GB of the kernel text.
> 
> And kprobes / live patching / ftrace.
> 
> Another architectural fun fact, powerpc book3s/32 requires executability
> to be set per 256 Mbytes segments. Some architectures like this one
> will want to also optimize how they use the module alloc area.
> 
> Even though today the use cases might be limited, we don't exactly know
> how much memory a target device has a well, and so treating memory
> failures for "special memory" request as regular memory failures seems
> a bit odd, and users could get confused. For instance slapping on
> extra memory on a system won't resolve any issues if the limit for a
> special type of memory is already hit. Very likely not a problem at all today,
> given how small modules / eBPF jit programs are / etc, but conceptually it
> would seem wrong to just say -ENOMEM when in fact it's a special type of
> required memory which cannot be allocated and the issue cannot possibly be
> fixed. I don't think we have an option but to use -ENOMEM but at least
> hinting of the special failure would have seem desirable.

I am not mm expert but I think that this is common problem. There are
many types of "special memory". And mm provides many details via procfs, e.g.
/proc/meminfo, /proc/slabinfo, /proc/vmstat.

Best Regards,
Petr

Song Liu April 20, 2022, 2:42 p.m. UTC | #21

Hi Linus,

> On Apr 19, 2022, at 7:18 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Tue, Apr 19, 2022 at 7:03 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> 
>> Here is the quote from Song's cover letter for bpf_prog_pack series:
> 
> I care about performance as much as the next person, but I care about
> correctness too.
> 
> That large-page code was a disaster, and was buggy and broken.
> 
> And even with those four patches, it's still broken.
> 
> End result: there's no way that thigh gets re-enabled without the
> correctness being in place.
> 
> At a minimum, to re-enable it, it needs (a) that zeroing and (b)
> actual numbers on real loads. (not some artificial benchmark).
> 
> Because without (a) there's no way in hell I'll enable it.
> 
> And without (b), "performance" isn't actually an argument.

I will send patch to do (a) later this week. 

For (b), we have seen direct map fragmentation causing visible
performance drop for our major services. This is the shadow 
production benchmark, so it is not possible to run it out of 
our data centers. Tracing showed that BPF program was the top 
trigger of these direct map splits. 

Thanks,
Song

Luis Chamberlain April 20, 2022, 6:28 p.m. UTC | #22

On Wed, Apr 20, 2022 at 02:42:37PM +0000, Song Liu wrote:
> For (b), we have seen direct map fragmentation causing visible
> performance drop for our major services. This is the shadow 
> production benchmark, so it is not possible to run it out of 
> our data centers. Tracing showed that BPF program was the top 
> trigger of these direct map splits. 

It's often not easy to reproduce issues like these, but I've
ran into that before for other Proof of Concept issues before
and the solution has been a Linux selftest. For instance a
"multithreaded" bombing for kmod can be triggered with
lib/test_kmod.c and tools/testing/selftests/kmod/kmod.sh

Would desinging a selftest to abuse eBPF JIT be a possible
way to reproduce the issue?

  Luis

Nicholas Piggin April 21, 2022, 3:19 a.m. UTC | #23

Excerpts from Christoph Hellwig's message of April 16, 2022 3:08 pm:
> On Fri, Apr 15, 2022 at 12:05:42PM -0700, Luis Chamberlain wrote:
>> Looks good except for that I think this should just wait for v5.19. The
>> fixes are so large I can't see why this needs to be rushed in other than
>> the first assumptions of the optimizations had some flaws addressed here.
> 
> Patches 1 and 2 are bug fixes for regressions caused by using huge page
> backed vmalloc by default.  So I think we do need it for 5.18.

No, the huge vmap patch should just be reverted because that caused
the regression, rather than adding another hack on top of it. All the
breakage is in arch/x86/, it doesn't make sense to change this code
and APIs outside x86 to work around it.

And once they are fixed these shouldn't be needed.

Thanks,
Nick

Nicholas Piggin April 21, 2022, 3:25 a.m. UTC | #24

Excerpts from Linus Torvalds's message of April 20, 2022 5:20 am:
> On Tue, Apr 19, 2022 at 11:42 AM Mike Rapoport <rppt@kernel.org> wrote:
>>
>> I'd say that bpf_prog_pack was a cure for symptoms and this project tries
>> to address more general problem.
>> But you are right, it'll take some time and won't land in 5.19.
> 
> Just to update people: I've just applied Song's [1/4] patch, which
> means that the whole current hugepage vmalloc thing is effectively
> disabled (because nothing opts in).
> 
> And I suspect that will be the status for 5.18, unless somebody comes
> up with some very strong arguments for (re-)starting using huge pages.

Why not just revert fac54e2bfb5b ?

Thanks,
Nick

Linus Torvalds April 21, 2022, 5:48 a.m. UTC | #25

On Wed, Apr 20, 2022 at 8:25 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Why not just revert fac54e2bfb5b ?

That would be stupid, with no sane way forward.

The fact is, HUGE_VMALLOC was badly misdesigned, and enabling it on
x86 only ended up showing the problems.

It wasn't fac54e2bfb5b that was the fundamental issue. It was the
whole "oh, we should never have done it that way to begin with".

The whole initial notion that HAVE_ARCH_HUGE_VMALLOC means that there
must be no PAGE_SIZE pte assumptions was simply broken. There were
actual real cases that had those assumptions, and the whole "let's
just change vmalloc behavior by default and then people who don't like
it can opt out" was just fundamentally a really bad idea.

Power had that random "oh, we don't want to do this for module_alloc",
which you had a comment about "more testing" for.

And s390 had a case of hardware limitations where it didn't work for some cases.

And then enabling it on x86 turned up more issues.

So yes, commit fac54e2bfb5b _exposed_ things to a much larger
audience. But all it just made clear was that your original notion of
"let's change behavior and randomly disable it as things turn up" was
just broken.

Including "small" details like the fact that apparently
VM_FLUSH_RESET_PERMS didn't work correctly any more for this, which
caused issues for bpf, and that [PATCH 4/4]. And yes, there was a
half-arsed comment ("may require extra work") to that effect in the
powerpc __module_alloc() function, but it had been left to others to
notice separately.

So no. We're not going back to that completely broken model. The
lagepage thing needs to be opt-in, and needs a lot more care.

               Linus

Linus Torvalds April 21, 2022, 6:02 a.m. UTC | #26

On Wed, Apr 20, 2022 at 10:48 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The lagepage thing needs to be opt-in, and needs a lot more care.

Side note: part of the opt-in really should be about the performance impact.

It clearly can be quite noticeable, as outlined by that powerpc case
in commit 8abddd968a30 ("powerpc/64s/radix: Enable huge vmalloc
mappings"), but it presumably is some _particular_ case that actually
matters.

But it's equalyl clearly not the module code/data case, since
__module_alloc() explicitly disables largepages on powerpc.

At a guess, it's one or more of the large hash-table allocations.

And it would actually be interesting to hear *which*one*. From the
'git diff' workload, I'd expect it to be the dentry lookup hash table
- I can't think of anything else that would be vmalloc'ed that would
be remotely interesting - but who knows.

So I think the whole "opt in" isn't _purely_ about the "oh, random
cases are broken for odd reasons, so let's not enable it by default".

I think it would actually be good to literally mark the cases that
matter (and have the performance numbers for those cases).

               Linus

Song Liu April 21, 2022, 7:29 a.m. UTC | #27

> On Apr 20, 2022, at 7:42 AM, Song Liu <songliubraving@fb.com> wrote:
>
> Hi Linus,
>
>> On Apr 19, 2022, at 7:18 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> On Tue, Apr 19, 2022 at 7:03 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> Here is the quote from Song's cover letter for bpf_prog_pack series:
>>
>> I care about performance as much as the next person, but I care about
>> correctness too.
>>
>> That large-page code was a disaster, and was buggy and broken.
>>
>> And even with those four patches, it's still broken.
>>
>> End result: there's no way that thigh gets re-enabled without the
>> correctness being in place.
>>
>> At a minimum, to re-enable it, it needs (a) that zeroing and (b)
>> actual numbers on real loads. (not some artificial benchmark).
>>
>> Because without (a) there's no way in hell I'll enable it.
>>
>> And without (b), "performance" isn't actually an argument.
>
> I will send patch to do (a) later this week.
>
> For (b), we have seen direct map fragmentation causing visible
> performance drop for our major services. This is the shadow
> production benchmark, so it is not possible to run it out of
> our data centers. Tracing showed that BPF program was the top
> trigger of these direct map splits.

Attached is the patch for (a). I also sent it to the mail list.

Thanks,
Song

Nicholas Piggin April 21, 2022, 8:57 a.m. UTC | #28

Excerpts from Linus Torvalds's message of April 21, 2022 3:48 pm:
> On Wed, Apr 20, 2022 at 8:25 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> Why not just revert fac54e2bfb5b ?
> 
> That would be stupid, with no sane way forward.
> 
> The fact is, HUGE_VMALLOC was badly misdesigned, and enabling it on
> x86 only ended up showing the problems.
> 
> It wasn't fac54e2bfb5b that was the fundamental issue. It was the
> whole "oh, we should never have done it that way to begin with".
> 
> The whole initial notion that HAVE_ARCH_HUGE_VMALLOC means that there
> must be no PAGE_SIZE pte assumptions was simply broken.

It didn't have that requirement so much as required it to be
accounted for if the arch enabled it.

> There were
> actual real cases that had those assumptions, and the whole "let's
> just change vmalloc behavior by default and then people who don't like
> it can opt out" was just fundamentally a really bad idea.
> 
> Power had that random "oh, we don't want to do this for module_alloc",
> which you had a comment about "more testing" for.
> 
> And s390 had a case of hardware limitations where it didn't work for some cases.
> 
> And then enabling it on x86 turned up more issues.

Those were (AFAIKS) all in arch code though. The patch was the
fundamental issue for x86 because it had bugs. I don't quite see
what your objection is to power and s390's working implementations.
Some parts of the arch code could not cope with hue PTEs so they
used small.

Switching the API around to expect non-arch code to know whether or
not it can use huge mappings is much worse. How is 
alloc_large_system_hash expected to know whether it may use huge
pages on any given half-broken arch like x86?

It's the same like we have huge iomap for a long time. No driver
should be expect to have to understand that.

> So yes, commit fac54e2bfb5b _exposed_ things to a much larger
> audience. But all it just made clear was that your original notion of
> "let's change behavior and randomly disable it as things turn up" was
> just broken.
> 
> Including "small" details like the fact that apparently
> VM_FLUSH_RESET_PERMS didn't work correctly any more for this, which
> caused issues for bpf, and that [PATCH 4/4].

Which is another arch detail.

> And yes, there was a
> half-arsed comment ("may require extra work") to that effect in the
> powerpc __module_alloc() function, but it had been left to others to
> notice separately.

It had a comment in arch/Kconfig about it. Combing through the
details of every arch is left to others who choose to opt-in though.

> So no. We're not going back to that completely broken model. The
> lagepage thing needs to be opt-in, and needs a lot more care.

I don't think it should be opt-in at the caller level (at least not
outside arch/). As I said earlier maybe we end up finding fragmentation
to be a problem that can't be solved with simple heuristics tweaking
so we could think about adding something to give size/speed hint
tradeoff, but as for "can this caller use huge vmap backed memory",
it should not be something drivers or core code has to think about.

Thanks,
Nick

Nicholas Piggin April 21, 2022, 9:07 a.m. UTC | #29

Excerpts from Linus Torvalds's message of April 21, 2022 4:02 pm:
> On Wed, Apr 20, 2022 at 10:48 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> The lagepage thing needs to be opt-in, and needs a lot more care.
> 
> Side note: part of the opt-in really should be about the performance impact.
> 
> It clearly can be quite noticeable, as outlined by that powerpc case
> in commit 8abddd968a30 ("powerpc/64s/radix: Enable huge vmalloc
> mappings"), but it presumably is some _particular_ case that actually
> matters.
> 
> But it's equalyl clearly not the module code/data case, since
> __module_alloc() explicitly disables largepages on powerpc.
> 
> At a guess, it's one or more of the large hash-table allocations.

The changelog is explicit it is the vfs hashes.

> And it would actually be interesting to hear *which*one*. From the
> 'git diff' workload, I'd expect it to be the dentry lookup hash table
> - I can't think of anything else that would be vmalloc'ed that would
> be remotely interesting - but who knows.

I didn't measure dentry/inode separately but it should mostly
(~entirely?) be the dentry hash, yes.

> So I think the whole "opt in" isn't _purely_ about the "oh, random
> cases are broken for odd reasons, so let's not enable it by default".

The whole concept is totally broken upstream now though. Core code
absolutely can not mark any allocation as able to use huge pages
because x86 is in some crazy half-working state. Can we use hugepage
dentry cache with x86 with hibernation? With BPF? Who knows.

> I think it would actually be good to literally mark the cases that
> matter (and have the performance numbers for those cases).

As per previous comment, not for correctness but possibly to help
guide some heuristic. I don't see it being too big a deal though,
a multi-MB vmalloc that can use hugepages probably wants to, quite
small downside (fragmentation being about the only one, but there
aren't a vast number of such allocations in the kernel to have
been noticed as yet).

Thanks,
Nick

Nicholas Piggin April 21, 2022, 9:47 a.m. UTC | #30

Excerpts from Linus Torvalds's message of April 21, 2022 3:48 pm:
> On Wed, Apr 20, 2022 at 8:25 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> Why not just revert fac54e2bfb5b ?
> 
> That would be stupid, with no sane way forward.

Oh I missed this comment. Now I'm completely confused. Reverting the
patch which caused the breakage *is* the sane way forward. Your tree
is not broken after that so you're done.

And if x86 wanted to select HUGE_VMALLOC in future then it can do so
after fixing the issues it has with it, so that's that the sane way
forward for that.

What you have now is what's insane. HAVE_ARCH_HUGE_VMALLOC now means
"you can ask for huge pages but it might crash in undocumented 
arch-specific circumstances so good luck".

Thanks,
Nick

Linus Torvalds April 21, 2022, 3:44 p.m. UTC | #31

On Thu, Apr 21, 2022 at 1:57 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Those were (AFAIKS) all in arch code though.

No Nick, they really weren't.

The bpf issue with VM_FLUSH_RESET_PERMS means that all your arguments
are invalid, because this affected non-architecture code.

So the bpf case had two independent issues: one was just bpf doing a
really bad job at making sure the executable mapping was sanely
initialized.

But the other was an actual bug in that hugepage case for vmalloc.

And that bug was an issue on power too.

So your "this is purely an x86 issue" argument is simply wrong.
Because I'm very much looking at that power code that says "oh,
__module_alloc() needs more work".

Notice?

Can these be fixed? Yes. But they can't be fixed by saying "oh, let's
disable it on x86".

Although it's probably true that at that point, some of the issues
would no longer be nearly as noticeable.

                  Linus

Edgecombe, Rick P April 21, 2022, 3:47 p.m. UTC | #32

On Thu, 2022-04-21 at 18:57 +1000, Nicholas Piggin wrote:
> Those were (AFAIKS) all in arch code though. The patch was the
> fundamental issue for x86 because it had bugs.

This wasn't root caused to arch code:
"BUG: Bad page state in process systemd-udevd"

https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/

In fact it wasn't root caused at all. But on the surface it seemed like
it didn't have to do with virtual page size assumptions. I wonder if it
might have to do with the vmalloc huge pages using compound pages, then
some caller doing vmalloc_to_page() and getting surprised with what
they could get away with in the struct page. But, regardless there was
an assumption, not proven, that there was some lurking cross-arch issue
that could show up for any vmalloc huge page user.

There is another good reason to opt-in to the current vmalloc huge page
size implementation - vmalloc will round up to huge page size once the
size exceeds the huge page size. Only the callers can know if the
allocation is worth using extra memory for.

Linus Torvalds April 21, 2022, 4:15 p.m. UTC | #33

On Thu, Apr 21, 2022 at 8:47 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
>                 I wonder if it
> might have to do with the vmalloc huge pages using compound pages, then
> some caller doing vmalloc_to_page() and getting surprised with what
> they could get away with in the struct page.

Very likely. We have 100+ users of vmalloc_to_page() in random
drivers, and the gpu code does show up on that list.

And is very much another case of "it's always been broken, but
enabling it on x86 made the breakage actually show up in real life".

                   Linus

Nicholas Piggin April 21, 2022, 11:30 p.m. UTC | #34

Excerpts from Linus Torvalds's message of April 22, 2022 1:44 am:
> On Thu, Apr 21, 2022 at 1:57 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> Those were (AFAIKS) all in arch code though.
> 
> No Nick, they really weren't.
> 
> The bpf issue with VM_FLUSH_RESET_PERMS means that all your arguments
> are invalid, because this affected non-architecture code.

VM_FLUSH_RESET_PERMS was because bpf uses the arch module allocation 
code which was not capable of dealing with huge pages in the arch
specific direct map manipulation stuff was unable to deal with it.
An x86 bug.

> So the bpf case had two independent issues: one was just bpf doing a
> really bad job at making sure the executable mapping was sanely
> initialized.
> 
> But the other was an actual bug in that hugepage case for vmalloc.
> 
> And that bug was an issue on power too.

I missed it, which bug was that?

> 
> So your "this is purely an x86 issue" argument is simply wrong.
> Because I'm very much looking at that power code that says "oh,
> __module_alloc() needs more work".
> 
> Notice?

No I don't notice. More work to support huge allocations for
executable mappings, sure. But the arch's implementation explicitly
does not support that yet. That doesn't make huge vmalloc broken!
Ridiculous. It works fine.

> 
> Can these be fixed? Yes. But they can't be fixed by saying "oh, let's
> disable it on x86".

You did just effectively disable it on x86 though.

And why can't it be reverted on x86 until it's fixed on x86??

> Although it's probably true that at that point, some of the issues
> would no longer be nearly as noticeable.

There really aren't all these "issues" you're imagining. They
aren't noticable now, on power or s390, because they have
non-buggy HAVE_ARCH_HUGE_VMALLOC implementations.

If you're really going to insist on this will you apply this to fix 
(some of) the performance regressions it introduced?

Thanks,
Nick

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e5b4488a0c5..b555f17e84d5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8919,7 +8919,10 @@ void *__init alloc_large_system_hash(const char *tablename,
 				table = memblock_alloc_raw(size,
 							   SMP_CACHE_BYTES);
 		} else if (get_order(size) >= MAX_ORDER || hashdist) {
-			table = __vmalloc(size, gfp_flags);
+			if (IS_ENABLED(CONFIG_PPC) || IS_ENABLED(CONFIG_S390))
+				table = vmalloc_huge(size, gfp_flags);
+			else
+				table = __vmalloc(size, gfp_flags);
 			virt = true;
 			if (table)
 				huge = is_vm_area_hugepages(table);

Nicholas Piggin April 22, 2022, 12:12 a.m. UTC | #35

Excerpts from Linus Torvalds's message of April 22, 2022 2:15 am:
> On Thu, Apr 21, 2022 at 8:47 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
>>
>>                 I wonder if it
>> might have to do with the vmalloc huge pages using compound pages, then
>> some caller doing vmalloc_to_page() and getting surprised with what
>> they could get away with in the struct page.
> 
> Very likely. We have 100+ users of vmalloc_to_page() in random
> drivers, and the gpu code does show up on that list.
> 
> And is very much another case of "it's always been broken, but
> enabling it on x86 made the breakage actually show up in real life".

Okay that looks like a valid breakage. *Possibly* fb_deferred_io_fault()
using pages vmalloced to screen_buffer? Or a couple of the gpu drivers
are playing with page->mapping as well, not sure if they're vmalloced.

But the fix is this (untested at the moment). It's not some fundamental 
reason why any driver should care about allocation size, it's a simple 
bug in my code that missed that case. The whole point of the design is 
that it's transparent to callers!

Thanks,
Nick

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e163372d3967..70933f4ed069 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
                        if (nr != nr_pages_request)
                                break;
                }
-       } else
-               /*
-                * Compound pages required for remap_vmalloc_page if
-                * high-order pages.
-                */
-               gfp |= __GFP_COMP;
+       }
 
        /* High-order pages or fallback path if "bulk" fails. */
 
@@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
                        page = alloc_pages_node(nid, gfp, order);
                if (unlikely(!page))
                        break;
+               /*
+                * Higher order allocations must be able to be treated as
+                * indepdenent small pages by callers (as they can with
+                * small page allocs).
+                */
+               if (order)
+                       split_page(page, order);
 
                /*
                 * Careful, we allocate and map page-order pages, but

Linus Torvalds April 22, 2022, 12:49 a.m. UTC | #36

On Thu, Apr 21, 2022 at 4:30 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> VM_FLUSH_RESET_PERMS was because bpf uses the arch module allocation
> code which was not capable of dealing with huge pages in the arch
> specific direct map manipulation stuff was unable to deal with it.
> An x86 bug.

.. and a power one? The only thing really special in  __module_alloc()
on power is that same VM_FLUSH_RESET_PERMS.

Why had you otherwise disabled it there on powerpc too?

> > And that bug was an issue on power too.
>
> I missed it, which bug was that?

See above. At least it's very strongly implied by how the powerpc
__module_alloc() function also used

                    VM_FLUSH_RESET_PERMS | VM_NO_HUGE_VMAP,

because there isn't much else going on.

> No I don't notice. More work to support huge allocations for
> executable mappings, sure. But the arch's implementation explicitly
> does not support that yet. That doesn't make huge vmalloc broken!
> Ridiculous. It works fine.

There are several other reports of problems that weren't related to
permissions (at least not obviously so).

You were pointed at one of them in this thread:

    https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/

and yes, it also happened on x86-64, but my point this whole time has
been that x86-64 gets A *LOT* MORE TEST COVERAGE.

See the difference? The fact that it has workedc for you on powerpc
doesn't mean that it's fine on powerpc.  It only means that powerpc
gets about one thousandth of the test coverage that x86-64 gets.

> You did just effectively disable it on x86 though.

I disabled it *EVERYWHERE*.

What is so hard to understand about that?

Why are you so convinced this is about x86?

It's not.

> There really aren't all these "issues" you're imagining. They
> aren't noticable now, on power or s390, because they have
> non-buggy HAVE_ARCH_HUGE_VMALLOC implementations.

So I really think you've not tested it.

How many of those powerpc or s390 machines do you think test drivers
that do vmalloc_to_page() and then do something with that 'struct page
*'?

Seriously. Why are you so convinced that "oh, any vmalloc() can be
converted to large pages"?

I really think the only reason you think that is because it ran on
machines that basically have almost no drivers in use, and are all
very homogenous, and just didn't happen to hit the bugs.

IOW, I think you are probably entirely right that x86 has its own set
of excitement (the bpf thread has this thing about how x86 does RO
differently from other architectures), and there are likely x86
specific bugs *TOO*.

But let's just pick a random driver that uses vmalloc (literally
random - I just grepped for it and started looking at it):

   drivers/infiniband/hw/qib/qib_file_ops.c

and it unquestionably does absolutely disgusting things, and if
looking at the code makes you go "nobody should do that", then I won't
disagree all that much.

But as an example of what it does, it basically does things like this:

        rcd->subctxt_uregbase = vmalloc_user(...);

and then you can mmap it in user space in mmap_kvaddr():

                addr = rcd->subctxt_uregbase;
                size = PAGE_SIZE * subctxt_cnt;
        ...
        vma->vm_pgoff = (unsigned long) addr >> PAGE_SHIFT;
        vma->vm_ops = &qib_file_vm_ops;

and then the page fault routine is:

    static const struct vm_operations_struct qib_file_vm_ops = {
            .fault = qib_file_vma_fault,
    };

and that function qib_file_vma_fault() does things like this:

        page = vmalloc_to_page((void *)(vmf->pgoff << PAGE_SHIFT));
        if (!page)
                return VM_FAULT_SIGBUS;

        get_page(page);
        vmf->page = page;

        return 0;

and let me just claim

 (a) I bet you have never EVER tested this kind of insane code on powerpc

 (b) do you think this will work great if vmalloc() allocates large pages?

Can you now see what I'm saying?

Can you now admit that the whole "nmake vmalloc() do large pages
without explicit opt-in" was a HORRIBLE MISTAKE.

> If you're really going to insist on this will you apply this to fix
> (some of) the performance regressions it introduced?

No.

That patch is disgusting and there is no excuse for applying something
crap like that.

What I applied was the first in a series of patches that do it sanely.
That whole "a sane way forward" thing.

See

    https://lore.kernel.org/all/20220415164413.2727220-3-song@kernel.org/

for [PATCH 2/4] in the series for this particular issue.

But I'm not applying anything else than the "disable this mess" before
we have more discussion and consensus.

And dammit, you had better just admit that this wasn't some x86-only thing.

Powerpc and s390 were BROKEN GARBAGE AND JUST DIDN'T HAVE THE TEST COVERAGE.

Seriously.

And the thing is, your opt-out approach was just POINTLESS. The actual
cases that are performance-critical are likely in the single digits.

It's probably just that big-hash case and maybe a *couple* of other
cases (ie the bpf jit really wants to use it).

So opt-in is clearly the correct thing to do.

Do you now understand why I applied that "users must opt-in" patch?

Do you now understand that this is not some "x86" thing?

                Linus

Nicholas Piggin April 22, 2022, 1:51 a.m. UTC | #37

Excerpts from Linus Torvalds's message of April 22, 2022 10:49 am:
> On Thu, Apr 21, 2022 at 4:30 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> VM_FLUSH_RESET_PERMS was because bpf uses the arch module allocation
>> code which was not capable of dealing with huge pages in the arch
>> specific direct map manipulation stuff was unable to deal with it.
>> An x86 bug.
> 
> .. and a power one? The only thing really special in  __module_alloc()
> on power is that same VM_FLUSH_RESET_PERMS.
> 
> Why had you otherwise disabled it there on powerpc too?

Right, it is for that (and possibly other similar ro->nx stuff for
module loading and instruction patching etc).

That's why we don't do huge vmalloc for executable mappings (yet).

I don't understand why you're trying to claim that limitation is a
bug in power or warrants disabling the whole thing for data mappings
as well.

> 
>> > And that bug was an issue on power too.
>>
>> I missed it, which bug was that?
> 
> See above. At least it's very strongly implied by how the powerpc
> __module_alloc() function also used
> 
>                     VM_FLUSH_RESET_PERMS | VM_NO_HUGE_VMAP,
> 
> because there isn't much else going on.

So that's not a bug though. To be clear, VM_FLUSH_RESET_PERMS is
clearly never to be used in driver code, it's a low-level arch
specific detail. So the flag itself does not have to be usable
by any old random code.

> 
>> No I don't notice. More work to support huge allocations for
>> executable mappings, sure. But the arch's implementation explicitly
>> does not support that yet. That doesn't make huge vmalloc broken!
>> Ridiculous. It works fine.
> 
> There are several other reports of problems that weren't related to
> permissions (at least not obviously so).
> 
> You were pointed at one of them in this thread:
> 
>     https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/

Right, I agree see the patch I posted.

> 
> and yes, it also happened on x86-64, but my point this whole time has
> been that x86-64 gets A *LOT* MORE TEST COVERAGE.
> 
> See the difference? The fact that it has workedc for you on powerpc
> doesn't mean that it's fine on powerpc.  It only means that powerpc
> gets about one thousandth of the test coverage that x86-64 gets.

Surely it's not unreasonable that there will be *some* bugs when
coverage is expanded.

I'm not saying the code I write is perfect and bug-free when I say
it's fine on power, I'm saying that in concept it is fine.

> 
>> You did just effectively disable it on x86 though.
> 
> I disabled it *EVERYWHERE*.
> 
> What is so hard to understand about that?
> 
> Why are you so convinced this is about x86?

Well the x86 bugs in its direct mapping stuff seem to be. If they
go away when not using huge mappings for modules then fine, I'm
just not really across all the issues but it sounded like there
were more than just this module case.

> 
> It's not.
> 
>> There really aren't all these "issues" you're imagining. They
>> aren't noticable now, on power or s390, because they have
>> non-buggy HAVE_ARCH_HUGE_VMALLOC implementations.
> 
> So I really think you've not tested it.
> 
> How many of those powerpc or s390 machines do you think test drivers
> that do vmalloc_to_page() and then do something with that 'struct page
> *'?

Probably quite a few, but the intersection of those with ones that
also allocate huge vmalloc *and* modify struct page fields to trip
that bug was probably empty.

> Seriously. Why are you so convinced that "oh, any vmalloc() can be
> converted to large pages"?

Because it can be transparent. The bug was (stupidly) using compound
pages when it should have just used split higher order pages.

But the whole idea of it is that it's supposed to be transparent to
drivers. I don't get why you're freaking out about vmalloc_to_page(),
a huge mapping is pretty much just like a small mapping except the
pages are contiguous and the page tables are a bit different.

> I really think the only reason you think that is because it ran on
> machines that basically have almost no drivers in use, and are all
> very homogenous, and just didn't happen to hit the bugs.
> 
> IOW, I think you are probably entirely right that x86 has its own set
> of excitement (the bpf thread has this thing about how x86 does RO
> differently from other architectures), and there are likely x86
> specific bugs *TOO*.
> 
> But let's just pick a random driver that uses vmalloc (literally
> random - I just grepped for it and started looking at it):
> 
>    drivers/infiniband/hw/qib/qib_file_ops.c
> 
> and it unquestionably does absolutely disgusting things, and if
> looking at the code makes you go "nobody should do that", then I won't
> disagree all that much.
> 
> But as an example of what it does, it basically does things like this:
> 
>         rcd->subctxt_uregbase = vmalloc_user(...);
> 
> and then you can mmap it in user space in mmap_kvaddr():
> 
>                 addr = rcd->subctxt_uregbase;
>                 size = PAGE_SIZE * subctxt_cnt;
>         ...
>         vma->vm_pgoff = (unsigned long) addr >> PAGE_SHIFT;
>         vma->vm_ops = &qib_file_vm_ops;
> 
> and then the page fault routine is:
> 
>     static const struct vm_operations_struct qib_file_vm_ops = {
>             .fault = qib_file_vma_fault,
>     };
> 
> and that function qib_file_vma_fault() does things like this:
> 
>         page = vmalloc_to_page((void *)(vmf->pgoff << PAGE_SHIFT));
>         if (!page)
>                 return VM_FAULT_SIGBUS;
> 
>         get_page(page);
>         vmf->page = page;
> 
>         return 0;
> 
> and let me just claim
> 
>  (a) I bet you have never EVER tested this kind of insane code on powerpc
> 
>  (b) do you think this will work great if vmalloc() allocates large pages?

Yes of course it will, with said 2-line bugfix.

> Can you now see what I'm saying?
> 
> Can you now admit that the whole "nmake vmalloc() do large pages
> without explicit opt-in" was a HORRIBLE MISTAKE.

No, can you admit that vmalloc_to_page is not some some crazy problem?
The drivers are doing fairly horrible things sure, but *nothing* can
look at the PTEs without going through accessors (except arch code
obviously, which has to adapt before enabling). vmalloc_to_page() just
gives you a struct page! Why would a driver care that it now happens
to be contiguous with the next one or not??

>> If you're really going to insist on this will you apply this to fix
>> (some of) the performance regressions it introduced?
> 
> No.
> 
> That patch is disgusting and there is no excuse for applying something
> crap like that.
> 
> What I applied was the first in a series of patches that do it sanely.
> That whole "a sane way forward" thing.
> 
> See
> 
>     https://lore.kernel.org/all/20220415164413.2727220-3-song@kernel.org/
> 
> for [PATCH 2/4] in the series for this particular issue.

I was being facetious. The problem is you can't do ^ because x86 is
buggy.

> 
> But I'm not applying anything else than the "disable this mess" before
> we have more discussion and consensus.
> 
> And dammit, you had better just admit that this wasn't some x86-only thing.
> 
> Powerpc and s390 were BROKEN GARBAGE AND JUST DIDN'T HAVE THE TEST COVERAGE.

Everybody writes bugs, nobody has 100% test coverage. I missed
something, big whoop. That doesn't invalidate the entire concept
of the feature.

> 
> Seriously.
> 
> And the thing is, your opt-out approach was just POINTLESS. The actual
> cases that are performance-critical are likely in the single digits.
> 
> It's probably just that big-hash case and maybe a *couple* of other
> cases (ie the bpf jit really wants to use it).
> 
> So opt-in is clearly the correct thing to do.
> 
> Do you now understand why I applied that "users must opt-in" patch?

No, opt-in for correctness is stupid and unecessary because callers 
shouldn't have to know about it. That's the whole point of the huge
vmalloc is that it's supposed to be transparent to non-arch code.

That drm screen buffer thing involved in the bug -- why wouldn't that
want to use huge pages?

> 
> Do you now understand that this is not some "x86" thing?
> 

If the only bug is that compound page thing and x86 is otherwise fine
then sure, and patch 1 was unnecessary to begin with.

Thanks,
Nick

Edgecombe, Rick P April 22, 2022, 2:29 a.m. UTC | #38

On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote:
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e163372d3967..70933f4ed069 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>                         if (nr != nr_pages_request)
>                                 break;
>                 }
> -       } else
> -               /*
> -                * Compound pages required for remap_vmalloc_page if
> -                * high-order pages.
> -                */
> -               gfp |= __GFP_COMP;
> +       }
>  
>         /* High-order pages or fallback path if "bulk" fails. */
>  
> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>                         page = alloc_pages_node(nid, gfp, order);
>                 if (unlikely(!page))
>                         break;
> +               /*
> +                * Higher order allocations must be able to be
> treated as
> +                * indepdenent small pages by callers (as they can
> with
> +                * small page allocs).
> +                */
> +               if (order)
> +                       split_page(page, order);
>  
>                 /*
>                  * Careful, we allocate and map page-order pages, but

FWIW, I like this direction. I think it needs to free them differently
though? Since currently assumes they are high order pages in that path.
I also wonder if we wouldn't need vm_struct->page_order anymore, and
all the places that would percolates out to. Basically all the places
where it iterates through vm_struct->pages with page_order stepping.

Besides fixing the bisected issue (hopefully), it also more cleanly
separates the mapping from the backing allocation logic. And then since
all the pages are 4k (from the page allocator perspective), it would be
easier to support non-huge page aligned sizes. i.e. not use up a whole
additional 2MB page if you only need 4k more of allocation size.

Linus Torvalds April 22, 2022, 2:31 a.m. UTC | #39

On Thu, Apr 21, 2022 at 6:51 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> > See
> >
> >     https://lore.kernel.org/all/20220415164413.2727220-3-song@kernel.org/
> >
> > for [PATCH 2/4] in the series for this particular issue.
>
> I was being facetious. The problem is you can't do ^ because x86 is
> buggy.

No, we probably *can* do that PATCH 2/4. I suspect x86 really isn't
that buggy. The bugs are elsewhere (including other vmalloc_huge()
uses).

Really. Why can't you just admit that the major bug was in the
hugepage code itself?

You claim:

> Because it can be transparent. The bug was (stupidly) using compound
> pages when it should have just used split higher order pages.

but we're in -rc3 for 5.18, and you seem to be entirely ignoring the
fact that that stupid bug has been around for a *YEAR* now.

Guess what? It was reported within *days* of the code having  been
enabled on x86.

But for about a year, youv'e been convinced that powerpc is fine,
because nobody ever reported it.

And you *still* try to make this about how it's some "x86 bug",
despite that bug not having been x86-specific AT ALL.

Nick, please take a long look at yourself in the mirror.

And stop this whole mindless "it's x86".

The *ONLY* thing x86-64 did was to show that the code that had been
enabled on powerpc for a year had gotten almost no testing there.

And don't bother mentioning s390. It got even less coverage there.

So exactly *because* bugs were uncovered in days by x86 enabling this,
I'm not rushing to re-enable it until I think it's gone through more
thinking and testing.

And in particular, I really *really* want to limit the fallout.

For example, your "two-liner fix" is not at all obvious.

That broken code case used to have a comment that remap_vmalloc_page()
required compound pages, and you just removed that whole thing as if
it didn't matter, and split the page.

(I also think the comment meant 'vmap_pages_range()', but whatever).

And the thing is, I'm not entirely convinced that comment was wrong
and could just be ignored. The freeing code in __vunmap() will do

                int i, step = 1U << page_order;

                for (i = 0; i < area->nr_pages; i += step) {
                        struct page *page = area->pages[i];

                        BUG_ON(!page);
                        mod_memcg_page_state(page, MEMCG_VMALLOC, -step);
                        __free_pages(page, page_order);

which now looks VERY VERY wrong.

You've split the pages, they may be used as individual pages (possibly
by other things), and then you now at freeing time treat them as a
single compound page after all..

So your "trivial two-liner" that tried to fix a bug that has been
there for a year now, itself seems quite questionable.

Maybe it works, maybe it doesn't. My bet is "it doesn't".

And guess what? I bet it worked just fine in your testing on powerpc,
because you probably didn't actually have any real huge-page vmalloc
cases except for those filesystem big-hash cases that never get
free'd.

So that "this code was completely buggy for a year on powerpc" never
seemed to teach you anything about the code.

And again - none of this is at all x86-specific. NOT AT ALL.

So how about you admit you were wrong to begin with.

That hugepage code needs more care before we re-enable it.

Your two-liner wasn't so obvious after all, was it?

I really think we're much safer saying "hugepage mappings only matter
for a couple of things, and those things will *not* do sub-page games,
so it's simple and safe".

.. and that requires that opt-in model.

Because your "it's transparent" argument has never ever actually been
true, now has it?

           Linus

Linus Torvalds April 22, 2022, 2:47 a.m. UTC | #40

On Thu, Apr 21, 2022 at 7:29 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> FWIW, I like this direction. I think it needs to free them differently
> though?

Very much so.

> Besides fixing the bisected issue (hopefully), it also more cleanly
> separates the mapping from the backing allocation logic. And then since
> all the pages are 4k (from the page allocator perspective), it would be
> easier to support non-huge page aligned sizes. i.e. not use up a whole
> additional 2MB page if you only need 4k more of allocation size.

I don't disagree, but I think the real problem is that the whole "oen
page_order per vmalloc() area" itself is a bit broken.

For example, AMD already does this "automatic TLB size" thing for when
you have multiple contiguous PTE entries (shades of the old alpha
"page size hint" thing, except it's automatic and doesn't have
explicit hints).

And I'm hoping Intel will do something similar in the future.

End result? It would actually be really good to just map contiguous
pages, but it doesn't have anything to do with the 2MB PMD size.

And there's no "fixed order" needed either. If you have mapping that
is 17 pages in size, it would still be good to allocate them as a
block of 16 pages ("page_order = 4") and as a single page, because
just laying them out in the page tables that way will already allow
AMD to use a 64kB TLB entry for that 16-page block.

But it would also work to just do the allocations as a set of 8, 4, 4 and 1.

But the whole "one page order for one vmalloc" means that doesn't work
very well.

Where I disagree (violently) with Nick is his contention that (a) this
is x86-specific and (b) this is somehow trivial to fix.

Let's face it - the current code is broken. I think the sub-page issue
is not entirely trivial, and the current design isn't even very good
for it.

But the *easy* cases are the ones that simply don't care - the ones
that powerpc has actually been testing.

So for 5.18, I think it's quite likely reasonable to re-enable
large-page vmalloc for the easy case (ie those big hash tables).

Re-enabling it *all*, considering how broken it has been, and how
little testing it has clearly gotten? And potentially not enabling it
on x86 because x86 is so much better at showing issues? That's not
what I want to do.

If the code is so broken that it can't be used on x86, then it's too
broken to be enabled on powerpc and s390 too. Never mind that those
architectures might have so limited use that they never realized how
broken they were..

                 Linus

Nicholas Piggin April 22, 2022, 2:57 a.m. UTC | #41

Excerpts from Linus Torvalds's message of April 22, 2022 12:31 pm:
> On Thu, Apr 21, 2022 at 6:51 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> > See
>> >
>> >     https://lore.kernel.org/all/20220415164413.2727220-3-song@kernel.org/
>> >
>> > for [PATCH 2/4] in the series for this particular issue.
>>
>> I was being facetious. The problem is you can't do ^ because x86 is
>> buggy.
> 
> No, we probably *can* do that PATCH 2/4. I suspect x86 really isn't
> that buggy. The bugs are elsewhere (including other vmalloc_huge()
> uses).

I thought Rick seemed to be concerned about other issues but if
not then great, I would happily admit to being mistaken.

> 
> Really. Why can't you just admit that the major bug was in the
> hugepage code itself?

Uh, I did. But not the concept. You admit you were going stupid
and freaking about about vmalloc_to_page which is really not
some insurmountable problem.

> 
> You claim:
> 
>> Because it can be transparent. The bug was (stupidly) using compound
>> pages when it should have just used split higher order pages.
> 
> but we're in -rc3 for 5.18, and you seem to be entirely ignoring the
> fact that that stupid bug has been around for a *YEAR* now.

I'm not ignoring that at all. Again I completely agree that path is
unlikely to have been tested (or at least not reported) on
powerpc or s390.

> Guess what? It was reported within *days* of the code having  been
> enabled on x86.

Is that supposed to be a gotcha? If I was cc'ed on it I might have
fixed it a month ago, but why would it be unusual for major new
coverage of such a big feature to expose a bug?

> But for about a year, youv'e been convinced that powerpc is fine,
> because nobody ever reported it.
> 
> And you *still* try to make this about how it's some "x86 bug",
> despite that bug not having been x86-specific AT ALL.

I think we're talking past one another. I never said that bug
was x86 specific, it was only just brought to my attention now
and I (presume that I) fixed it within 10 minutes of looking at
it.

That's not what I understood to be the only x86 problem though,
but again as I said I'll admit I misunderstood that if it's not
the case.

[snip]

> For example, your "two-liner fix" is not at all obvious.

It is pretty obvious. Something wanted to use a tail page and
ran afoul of the compound page thing. Splitting the page just
gives us an array of conguous small pages.

> That broken code case used to have a comment that remap_vmalloc_page()
> required compound pages, and you just removed that whole thing as if
> it didn't matter, and split the page.

The new comment is a superset, I agree the old comment is
under-done. remap_vmalloc_page() wanted to get_page the sub
pages and without compound or split huge page they have a ref
count of 0 so that blows up.

split page just makes the pages behave exactly as they would
if allocated individually (i.e., the small page case).

Yes a stupid thinko on my part when I added that, but that's
all it is, it's an obvious thing.

> (I also think the comment meant 'vmap_pages_range()', but whatever).

remap_vmalloc_range I think. But clearly the same issue applies
to any caller which may use the struct pages for stuff.

> And the thing is, I'm not entirely convinced that comment was wrong
> and could just be ignored. The freeing code in __vunmap() will do
> 
>                 int i, step = 1U << page_order;
> 
>                 for (i = 0; i < area->nr_pages; i += step) {
>                         struct page *page = area->pages[i];
> 
>                         BUG_ON(!page);
>                         mod_memcg_page_state(page, MEMCG_VMALLOC, -step);
>                         __free_pages(page, page_order);
> 
> which now looks VERY VERY wrong.
> 
> You've split the pages, they may be used as individual pages (possibly
> by other things), and then you now at freeing time treat them as a
> single compound page after all..

Oh yeah I'll take a look at the freeing side too, as I said
totally untested but that's obviously the issue.

> 
> So your "trivial two-liner" that tried to fix a bug that has been
> there for a year now, itself seems quite questionable.
> 
> Maybe it works, maybe it doesn't. My bet is "it doesn't".
> 
> And guess what? I bet it worked just fine in your testing on powerpc,
> because you probably didn't actually have any real huge-page vmalloc
> cases except for those filesystem big-hash cases that never get
> free'd.
> 
> So that "this code was completely buggy for a year on powerpc" never
> seemed to teach you anything about the code.
> 
> And again - none of this is at all x86-specific. NOT AT ALL.
> 
> So how about you admit you were wrong to begin with.
> 
> That hugepage code needs more care before we re-enable it.
> 
> Your two-liner wasn't so obvious after all, was it?
> 
> I really think we're much safer saying "hugepage mappings only matter
> for a couple of things, and those things will *not* do sub-page games,
> so it's simple and safe".
> 
> .. and that requires that opt-in model.
> 
> Because your "it's transparent" argument has never ever actually been
> true, now has it?

Of course it has. You admit you didn't understand the design and
freaked out about vmalloc_to_page when drivers playing games with
struct page is a non issue.

I'm not a moron, clearly if it wasn't intended to be transparent to
callers it could not have just been enabled unconditionally. What's
so difficult to you about the concept that the arch code is really
the only thing that has to know about its page table arrangement?

You're now fixating on this bug as though that invalidates the whole
concept of huge vmalloc being transparent to callers. That's being
disingenous though, it's just a bug. That's all it is. You merge
fixes for hundreds of bugs everywhere in the tree every day. You're
using it as a crutch because you can't admit you were wrong about it
being transparent to drivers and thinking it has to be opt-in.

Anyway blah whatever. I admit to everything, I did it, I'm bad I did
a huge booboo a year ago, huge vmalloc is stupid, x86 is wonderful,
everything else is garbage, Linus is great. Okay?

I'll polish up the patch (*and* test it with a specifically written
test case just for your), and then we can see if it solves that drm
bug and we can go from there.

Thanks,
Nick

Nicholas Piggin April 22, 2022, 3:08 a.m. UTC | #42

Excerpts from Edgecombe, Rick P's message of April 22, 2022 12:29 pm:
> On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote:
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index e163372d3967..70933f4ed069 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>                         if (nr != nr_pages_request)
>>                                 break;
>>                 }
>> -       } else
>> -               /*
>> -                * Compound pages required for remap_vmalloc_page if
>> -                * high-order pages.
>> -                */
>> -               gfp |= __GFP_COMP;
>> +       }
>>  
>>         /* High-order pages or fallback path if "bulk" fails. */
>>  
>> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>                         page = alloc_pages_node(nid, gfp, order);
>>                 if (unlikely(!page))
>>                         break;
>> +               /*
>> +                * Higher order allocations must be able to be
>> treated as
>> +                * indepdenent small pages by callers (as they can
>> with
>> +                * small page allocs).
>> +                */
>> +               if (order)
>> +                       split_page(page, order);
>>  
>>                 /*
>>                  * Careful, we allocate and map page-order pages, but
> 
> FWIW, I like this direction. I think it needs to free them differently
> though? Since currently assumes they are high order pages in that path.

Yeah I got a bit excited there, but fairly sure that's the bug.
I'll do a proper patch.

> I also wonder if we wouldn't need vm_struct->page_order anymore, and
> all the places that would percolates out to. Basically all the places
> where it iterates through vm_struct->pages with page_order stepping.
> 
> Besides fixing the bisected issue (hopefully), it also more cleanly
> separates the mapping from the backing allocation logic. And then since
> all the pages are 4k (from the page allocator perspective), it would be
> easier to support non-huge page aligned sizes. i.e. not use up a whole
> additional 2MB page if you only need 4k more of allocation size.

Somewhat. I'm not sure about freeing pages back for general use when
they have page tables pointing to them. That will be a whole different
kettle of fish I suspect. Honestly I don't think fragmentation has
proven to be that bad of a problem yet that we'd want to do something
like that yet.

But there are other reasons to decople them in general. Some ppc32 I
think had a different page size that was not a huge page which it could
use to map vmalloc memory, for example.

In any case I have no objections to making that part of it more
flexible.

Thanks,
Nick

Nicholas Piggin April 22, 2022, 3:33 a.m. UTC | #43

Excerpts from Edgecombe, Rick P's message of April 22, 2022 12:29 pm:
> On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote:
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index e163372d3967..70933f4ed069 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>                         if (nr != nr_pages_request)
>>                                 break;
>>                 }
>> -       } else
>> -               /*
>> -                * Compound pages required for remap_vmalloc_page if
>> -                * high-order pages.
>> -                */
>> -               gfp |= __GFP_COMP;
>> +       }
>>  
>>         /* High-order pages or fallback path if "bulk" fails. */
>>  
>> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>                         page = alloc_pages_node(nid, gfp, order);
>>                 if (unlikely(!page))
>>                         break;
>> +               /*
>> +                * Higher order allocations must be able to be
>> treated as
>> +                * indepdenent small pages by callers (as they can
>> with
>> +                * small page allocs).
>> +                */
>> +               if (order)
>> +                       split_page(page, order);
>>  
>>                 /*
>>                  * Careful, we allocate and map page-order pages, but
> 
> FWIW, I like this direction. I think it needs to free them differently
> though? Since currently assumes they are high order pages in that path.
> I also wonder if we wouldn't need vm_struct->page_order anymore, and
> all the places that would percolates out to. Basically all the places
> where it iterates through vm_struct->pages with page_order stepping.

To respond to this, I do like having the huge page indicator in
/proc/vmallocinfo. Which is really the only use of it. I might just
keep it in for now.

Thanks,
Nick

Nicholas Piggin April 22, 2022, 4:31 a.m. UTC | #44

Excerpts from Nicholas Piggin's message of April 22, 2022 1:08 pm:
> Excerpts from Edgecombe, Rick P's message of April 22, 2022 12:29 pm:
>> On Fri, 2022-04-22 at 10:12 +1000, Nicholas Piggin wrote:
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index e163372d3967..70933f4ed069 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -2925,12 +2925,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>                         if (nr != nr_pages_request)
>>>                                 break;
>>>                 }
>>> -       } else
>>> -               /*
>>> -                * Compound pages required for remap_vmalloc_page if
>>> -                * high-order pages.
>>> -                */
>>> -               gfp |= __GFP_COMP;
>>> +       }
>>>  
>>>         /* High-order pages or fallback path if "bulk" fails. */
>>>  
>>> @@ -2944,6 +2939,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>                         page = alloc_pages_node(nid, gfp, order);
>>>                 if (unlikely(!page))
>>>                         break;
>>> +               /*
>>> +                * Higher order allocations must be able to be
>>> treated as
>>> +                * indepdenent small pages by callers (as they can
>>> with
>>> +                * small page allocs).
>>> +                */
>>> +               if (order)
>>> +                       split_page(page, order);
>>>  
>>>                 /*
>>>                  * Careful, we allocate and map page-order pages, but
>> 
>> FWIW, I like this direction. I think it needs to free them differently
>> though? Since currently assumes they are high order pages in that path.
> 
> Yeah I got a bit excited there, but fairly sure that's the bug.
> I'll do a proper patch.

So here's the patch on top of the revert. Only tested on a lowly
powerpc machine, but it does fix this simple test case that does
what the drm driver is obviously doing:

  size_t sz = PMD_SIZE;
  void *mem = vmalloc(sz);
  struct page *p = vmalloc_to_page(mem + PAGE_SIZE*3);
  p->mapping = NULL;
  p->index = 0;
  INIT_LIST_HEAD(&p->lru);
  vfree(mem);

Without the below fix the same exact problem reproduces:

  BUG: Bad page state in process swapper/0  pfn:00743
  page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
  flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
  raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
  raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: corrupted mapping in tail page
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
  Call Trace:
  [c000000002383940] [c0000000006ebb00] dump_stack_lvl+0x74/0xa8 (unreliable)
  [c000000002383980] [c0000000003dabdc] bad_page+0x12c/0x170 
  [c000000002383a00] [c0000000003dad08] free_tail_pages_check+0xe8/0x190
  [c000000002383a30] [c0000000003dc45c] free_pcp_prepare+0x31c/0x4e0
  [c000000002383a90] [c0000000003df9f0] free_unref_page+0x40/0x1b0
  [c000000002383ad0] [c0000000003d7fc8] __vunmap+0x1d8/0x420 
  [c000000002383b70] [c00000000102e0d8] proc_vmalloc_init+0xdc/0x108
  [c000000002383bf0] [c000000000011f80] do_one_initcall+0x60/0x2c0
  [c000000002383cc0] [c000000001001658] kernel_init_freeable+0x32c/0x3cc
  [c000000002383da0] [c000000000012564] kernel_init+0x34/0x1a0
  [c000000002383e10] [c00000000000ce64] ret_from_kernel_thread+0x5c/0x64

Any other concerns with the fix?

Thanks,
Nick

--
mm/vmalloc: huge vmalloc backing pages should be split rather than compound

Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
in order to allow the sub-pages to be refcounted by callers such as
"remap_vmalloc_page [sic]" (remap_vmalloc_range).

However a similar problem exists for other struct page fields callers
use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
not only refcounts it but uses ->lru, ->mapping, ->index. This is not
compatible with compound sub-pages.

The correct approach is to use split high-order pages for the huge
vmalloc backing. These allow callers to treat them in exactly the same
way as individually-allocated order-0 pages.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/vmalloc.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0b17498a34f1..09470361dc03 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2653,15 +2653,18 @@ static void __vunmap(const void *addr, int deallocate_pages)
 	vm_remove_mappings(area, deallocate_pages);
 
 	if (deallocate_pages) {
-		unsigned int page_order = vm_area_page_order(area);
-		int i, step = 1U << page_order;
+		int i;
 
-		for (i = 0; i < area->nr_pages; i += step) {
+		for (i = 0; i < area->nr_pages; i++) {
 			struct page *page = area->pages[i];
 
 			BUG_ON(!page);
-			mod_memcg_page_state(page, MEMCG_VMALLOC, -step);
-			__free_pages(page, page_order);
+			mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
+			/*
+			 * High-order allocs for huge vmallocs are split, so
+			 * can be freed as an array of order-0 allocations
+			 */
+			__free_pages(page, 0);
 			cond_resched();
 		}
 		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
@@ -2914,12 +2917,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 			if (nr != nr_pages_request)
 				break;
 		}
-	} else
-		/*
-		 * Compound pages required for remap_vmalloc_page if
-		 * high-order pages.
-		 */
-		gfp |= __GFP_COMP;
+	}
 
 	/* High-order pages or fallback path if "bulk" fails. */
 
@@ -2933,6 +2931,15 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 			page = alloc_pages_node(nid, gfp, order);
 		if (unlikely(!page))
 			break;
+		/*
+		 * Higher order allocations must be able to be treated as
+		 * indepdenent small pages by callers (as they can with
+		 * small-page vmallocs). Some drivers do their own refcounting
+		 * on vmalloc_to_page() pages, some use page->mapping,
+		 * page->lru, etc.
+		 */
+		if (order)
+			split_page(page, order);
 
 		/*
 		 * Careful, we allocate and map page-order pages, but
@@ -2992,11 +2999,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 
 	atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
 	if (gfp_mask & __GFP_ACCOUNT) {
-		int i, step = 1U << page_order;
+		int i;
 
-		for (i = 0; i < area->nr_pages; i += step)
-			mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC,
-					     step);
+		for (i = 0; i < area->nr_pages; i++)
+			mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);
 	}
 
 	/*

Edgecombe, Rick P April 22, 2022, 4:54 p.m. UTC | #45

On Thu, 2022-04-21 at 19:47 -0700, Linus Torvalds wrote:
> I don't disagree, but I think the real problem is that the whole "oen
> page_order per vmalloc() area" itself is a bit broken.

Yea. It is the main reason it has to round up to huge page sizes
AFAICT. I'd really like to see it use memory a little more efficiently
if it is going to be an opt-out thing again.

> 
> For example, AMD already does this "automatic TLB size" thing for
> when
> you have multiple contiguous PTE entries (shades of the old alpha
> "page size hint" thing, except it's automatic and doesn't have
> explicit hints).
> 
> And I'm hoping Intel will do something similar in the future.
> 
> End result? It would actually be really good to just map contiguous
> pages, but it doesn't have anything to do with the 2MB PMD size.
> 
> And there's no "fixed order" needed either. If you have mapping that
> is 17 pages in size, it would still be good to allocate them as a
> block of 16 pages ("page_order = 4") and as a single page, because
> just laying them out in the page tables that way will already allow
> AMD to use a 64kB TLB entry for that 16-page block.
> 
> But it would also work to just do the allocations as a set of 8, 4, 4
> and 1.

Hmm, that's neat.

> 
> But the whole "one page order for one vmalloc" means that doesn't
> work
> very well.
> 
> Where I disagree (violently) with Nick is his contention that (a)
> this
> is x86-specific and (b) this is somehow trivial to fix.
> 
> Let's face it - the current code is broken. I think the sub-page
> issue
> is not entirely trivial, and the current design isn't even very good
> for it.
> 
> But the *easy* cases are the ones that simply don't care - the ones
> that powerpc has actually been testing.
> 
> So for 5.18, I think it's quite likely reasonable to re-enable
> large-page vmalloc for the easy case (ie those big hash tables).
> 
> Re-enabling it *all*, considering how broken it has been, and how
> little testing it has clearly gotten? And potentially not enabling it
> on x86 because x86 is so much better at showing issues? That's not
> what I want to do.
> 
> If the code is so broken that it can't be used on x86, then it's too
> broken to be enabled on powerpc and s390 too. Never mind that those
> architectures might have so limited use that they never realized how
> broken they were..

I think there is another cross-arch issue here that we shouldn't lose
sight of. There are not enough warnings in the code about the
assumptions made on the arch's. The other issues are x86 specific in
terms of who gets affected in rc1, but I dug up this prophetic
assessment:

https://lore.kernel.org/lkml/4488d39f-0698-7bfd-b81c-1e609821818f@intel.com/

That is pretty much what happened. Song came along and, in its current
state, took it as a knob that could just be flipped. Seems pretty
reasonable that it could happen again.

So IMHO, the other general issue is the lack of guard rails or warnings
for the next arch that comes along. Probably VM_FLUSH_RESET_PERMS
should get some warnings as well.

I kind of like the idea in that thread of making functions or configs
for arch's to be forced to declare they have specific properties. Does
it seem reasonable at this point? Probably not necessary as a 5.18 fix.

Edgecombe, Rick P April 22, 2022, 5:10 p.m. UTC | #46

On Fri, 2022-04-22 at 14:31 +1000, Nicholas Piggin wrote:
> Any other concerns with the fix?

I had another concern with the vmalloc huge pages. Possibly getting
into paranoid territory here, but it's around the new behavior of
freeing vmalloc page tables. At least on x86 this doesn't happen very
often. AFAICT it all happens during boot or during memory hot unplug.

The risk this entails is racing against all the software page table
walks and walking a freed table. At least on x86 the walks of the
kernel tables are done with no locks, which works because the PTE
updates are atomic and pretty much never freed. Some of the kernel page
table walks in the fault handler are actually triggerable from
userspace.

So it's in the category of "already there is a bug" or a little trouble
someone could cause with a lot of effort. At least until PKS vmalloc
memory shows up. Anyway, as long as we are going over this all again I
thought it was worth bringing up.

Edgecombe, Rick P April 22, 2022, 8:22 p.m. UTC | #47

On Fri, 2022-04-22 at 10:10 -0700, Edgecombe, Richard P wrote:
> The risk this entails is racing against all the software page table
> walks and walking a freed table. At least on x86 the walks of the
> kernel tables are done with no locks, which works because the PTE
> updates are atomic and pretty much never freed. Some of the kernel
> page
> table walks in the fault handler are actually triggerable from
> userspace.

Argh, please ignore this. I guess interrupts getting disabled in the
fault handler forces the freeing of the page table to synchronize with
the fault handler walks via the TLB flush IPIs that precede the free.
Sorry for the noise.

Linus Torvalds April 24, 2022, 5:43 p.m. UTC | #48

[ I see that you posted a new version of the series, but I wasn't cc'd
on that one, so I'm replying to the old thread instead ]

On Sat, Apr 16, 2022 at 12:55 PM Song Liu <songliubraving@fb.com> wrote:
>
> Patch 2/4 enables huge pages for large hash.

I decided that for 5.18, we want to at least fix the performance
regression on powerpc, so I've applied the 2/4 patch to enable huge
pages for the large hashes.

I also enabled them for kvmalloc(), since that seemed like the one
ObviouslySafe(tm) case of vmalloc use (famous last words, maybe I'll
be informed of somebody who still did odd protection games on the
result, but that really sounds invalid with the whole SLUB component).

I'm not touching the bpf parts. I think that's a 5.19 issue by now,
and since it's new, there's no equivalent performance regression
issue.

               Linus

Song Liu April 25, 2022, 6:48 a.m. UTC | #49

Hi Linus,

> On Apr 24, 2022, at 10:43 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> [ I see that you posted a new version of the series, but I wasn't cc'd
> on that one, so I'm replying to the old thread instead ]

Thanks for fixing up these, and sorry for the messing up CC list. 

> 
> On Sat, Apr 16, 2022 at 12:55 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> Patch 2/4 enables huge pages for large hash.
> 
> I decided that for 5.18, we want to at least fix the performance
> regression on powerpc, so I've applied the 2/4 patch to enable huge
> pages for the large hashes.
> 
> I also enabled them for kvmalloc(), since that seemed like the one
> ObviouslySafe(tm) case of vmalloc use (famous last words, maybe I'll
> be informed of somebody who still did odd protection games on the
> result, but that really sounds invalid with the whole SLUB component).
> 
> I'm not touching the bpf parts. I think that's a 5.19 issue by now,
> and since it's new, there's no equivalent performance regression
> issue.

With 5.18-rc4, bpf programs on x86_64 are using 4kB bpf_prog_pack. 
So multiple small BPF programs will share a 4kB page. We still need 
to initialize each 4kB bpf_prog_pack with illegal instructions, with 
something similar to [1]. I will revise it based on Peter’s 
suggestion [2]. 

Thanks,
Song



[1] https://lore.kernel.org/all/20220422051813.1989257-2-song@kernel.org/
[2] https://lore.kernel.org/linux-mm/20220422073118.GR2731@worktop.programming.kicks-ass.net/

[v4,bpf,0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Message

Comments