mbox series

[RFC,0/7] mm: Get rid of vmalloc_sync_(un)mappings()

Message ID 20200508144043.13893-1-joro@8bytes.org (mailing list archive)
Headers show
Series mm: Get rid of vmalloc_sync_(un)mappings() | expand

Message

Joerg Roedel May 8, 2020, 2:40 p.m. UTC
Hi,

after the recent issue with vmalloc and tracing code[1] on x86 and a
long history of previous issues related to the vmalloc_sync_mappings()
interface, I thought the time has come to remove it. Please
see [2], [3], and [4] for some other issues in the past.

The patches are based on v5.7-rc4 and add tracking of page-table
directory changes to the vmalloc and ioremap code. Depending on which
page-table levels changes have been made, a new per-arch function is
called: arch_sync_kernel_mappings().

On x86-64 with 4-level paging, this function will not be called more
than 64 times in a systems runtime (because vmalloc-space takes 64 PGD
entries which are only populated, but never cleared).

As a side effect this also allows to get rid of vmalloc faults on x86,
making it safe to touch vmalloc'ed memory in the page-fault handler.
Note that this potentially includes per-cpu memory.

The code is tested on x86-64, x86-32 with and without PAE. It also fixes
the issue described in [1]. Additionally this code has been
compile-tested on almost all architectures supported by Linux. I
couldn't find working compilers for hexagon and unicore32, so these are
not tested.

Please review.

Regards,

	Joerg

[1] https://lore.kernel.org/lkml/20200430141120.GA8135@suse.de/
[2] https://lore.kernel.org/lkml/20191009124418.8286-1-joro@8bytes.org/
[3] https://lore.kernel.org/lkml/20190719184652.11391-1-joro@8bytes.org/
[4] https://lore.kernel.org/lkml/20191126111119.GA110513@gmail.com/

Joerg Roedel (7):
  mm: Add functions to track page directory modifications
  mm/vmalloc: Track which page-table levels were modified
  mm/ioremap: Track which page-table levels were modified
  x86/mm/64: Implement arch_sync_kernel_mappings()
  x86/mm/32: Implement arch_sync_kernel_mappings()
  mm: Remove vmalloc_sync_(un)mappings()
  x86/mm: Remove vmalloc faulting

 arch/x86/include/asm/pgtable-2level_types.h |   2 +
 arch/x86/include/asm/pgtable-3level_types.h |   2 +
 arch/x86/include/asm/pgtable_64_types.h     |   2 +
 arch/x86/include/asm/switch_to.h            |  23 ---
 arch/x86/kernel/setup_percpu.c              |   6 +-
 arch/x86/mm/fault.c                         | 176 +-------------------
 arch/x86/mm/init_64.c                       |   5 +
 arch/x86/mm/pti.c                           |   8 +-
 drivers/acpi/apei/ghes.c                    |   6 -
 include/asm-generic/5level-fixup.h          |   5 +-
 include/asm-generic/pgtable.h               |  23 +++
 include/linux/mm.h                          |  46 +++++
 include/linux/vmalloc.h                     |  13 +-
 kernel/notifier.c                           |   1 -
 lib/ioremap.c                               |  46 +++--
 mm/nommu.c                                  |  12 --
 mm/vmalloc.c                                | 109 +++++++-----
 17 files changed, 199 insertions(+), 286 deletions(-)

Comments

Peter Zijlstra May 8, 2020, 7:20 p.m. UTC | #1
On Fri, May 08, 2020 at 04:40:36PM +0200, Joerg Roedel wrote:

> Joerg Roedel (7):
>   mm: Add functions to track page directory modifications
>   mm/vmalloc: Track which page-table levels were modified
>   mm/ioremap: Track which page-table levels were modified
>   x86/mm/64: Implement arch_sync_kernel_mappings()
>   x86/mm/32: Implement arch_sync_kernel_mappings()
>   mm: Remove vmalloc_sync_(un)mappings()
>   x86/mm: Remove vmalloc faulting
> 
>  arch/x86/include/asm/pgtable-2level_types.h |   2 +
>  arch/x86/include/asm/pgtable-3level_types.h |   2 +
>  arch/x86/include/asm/pgtable_64_types.h     |   2 +
>  arch/x86/include/asm/switch_to.h            |  23 ---
>  arch/x86/kernel/setup_percpu.c              |   6 +-
>  arch/x86/mm/fault.c                         | 176 +-------------------
>  arch/x86/mm/init_64.c                       |   5 +
>  arch/x86/mm/pti.c                           |   8 +-
>  drivers/acpi/apei/ghes.c                    |   6 -
>  include/asm-generic/5level-fixup.h          |   5 +-
>  include/asm-generic/pgtable.h               |  23 +++
>  include/linux/mm.h                          |  46 +++++
>  include/linux/vmalloc.h                     |  13 +-
>  kernel/notifier.c                           |   1 -
>  lib/ioremap.c                               |  46 +++--
>  mm/nommu.c                                  |  12 --
>  mm/vmalloc.c                                | 109 +++++++-----
>  17 files changed, 199 insertions(+), 286 deletions(-)

The only concern I have is the pgd_lock lock hold times.

By not doing on-demand faults anymore, and consistently calling
sync_global_*(), we iterate that pgd_list thing much more often than
before if I'm not mistaken.
Andy Lutomirski May 8, 2020, 9:33 p.m. UTC | #2
On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> Hi,
>
> after the recent issue with vmalloc and tracing code[1] on x86 and a
> long history of previous issues related to the vmalloc_sync_mappings()
> interface, I thought the time has come to remove it. Please
> see [2], [3], and [4] for some other issues in the past.
>
> The patches are based on v5.7-rc4 and add tracking of page-table
> directory changes to the vmalloc and ioremap code. Depending on which
> page-table levels changes have been made, a new per-arch function is
> called: arch_sync_kernel_mappings().
>
> On x86-64 with 4-level paging, this function will not be called more
> than 64 times in a systems runtime (because vmalloc-space takes 64 PGD
> entries which are only populated, but never cleared).

What's the maximum on other system types?  It might make more sense to
take the memory hit and pre-populate all the tables at boot so we
never have to sync them.
Joerg Roedel May 8, 2020, 9:34 p.m. UTC | #3
Hi Peter,

thanks for reviewing this!

On Fri, May 08, 2020 at 09:20:00PM +0200, Peter Zijlstra wrote:
> The only concern I have is the pgd_lock lock hold times.
> 
> By not doing on-demand faults anymore, and consistently calling
> sync_global_*(), we iterate that pgd_list thing much more often than
> before if I'm not mistaken.

Should not be a problem, from what I have seen this function is not
called often on x86-64.  The vmalloc area needs to be synchronized at
the top-level there, which is by now P4D (with 4-level paging). And the
vmalloc area takes 64 entries, when all of them are populated the
function will not be called again.

On 32bit it might be called more often, because synchronization happens
on the PMD level, which is also used for large-page mapped ioremap
regions. But these don't happen very often and there are also no VMAP
stacks on x86-32 which could cause this function to be called
frequently.


	Joerg
Joerg Roedel May 8, 2020, 9:36 p.m. UTC | #4
On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote:
> On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote:

> What's the maximum on other system types?  It might make more sense to
> take the memory hit and pre-populate all the tables at boot so we
> never have to sync them.

Need to look it up for 5-level paging, with 4-level paging its 64 pages
to pre-populate the vmalloc area.

But that would not solve the problem on x86-32, which needs to
synchronize unmappings on the PMD level.


	Joerg
Andy Lutomirski May 8, 2020, 11:49 p.m. UTC | #5
On Fri, May 8, 2020 at 2:36 PM Joerg Roedel <jroedel@suse.de> wrote:
>
> On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote:
> > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> > What's the maximum on other system types?  It might make more sense to
> > take the memory hit and pre-populate all the tables at boot so we
> > never have to sync them.
>
> Need to look it up for 5-level paging, with 4-level paging its 64 pages
> to pre-populate the vmalloc area.
>
> But that would not solve the problem on x86-32, which needs to
> synchronize unmappings on the PMD level.

What changes in this series with x86-32?  We already do that
synchronization, right?  IOW, in the cases where the vmalloc *fault*
code does anything at all, we should have a small bound for how much
memory to preallocate and, if we preallocate it, then there is nothing
to sync and nothing to fault.  And we have the benefit that we never
need to sync anything on 64-bit, which is kind of nice.

Do we actually need PMD-level things for 32-bit?  What if we just
outlawed huge pages in the vmalloc space on 32-bit non-PAE?

Or maybe the net result isn't much of a cleanup after all given the
need to support 32-bit.

>
>
>         Joerg
Peter Zijlstra May 9, 2020, 9:25 a.m. UTC | #6
On Fri, May 08, 2020 at 11:34:07PM +0200, Joerg Roedel wrote:
> Hi Peter,
> 
> thanks for reviewing this!
> 
> On Fri, May 08, 2020 at 09:20:00PM +0200, Peter Zijlstra wrote:
> > The only concern I have is the pgd_lock lock hold times.
> > 
> > By not doing on-demand faults anymore, and consistently calling
> > sync_global_*(), we iterate that pgd_list thing much more often than
> > before if I'm not mistaken.
> 
> Should not be a problem, from what I have seen this function is not
> called often on x86-64.  The vmalloc area needs to be synchronized at
> the top-level there, which is by now P4D (with 4-level paging). And the
> vmalloc area takes 64 entries, when all of them are populated the
> function will not be called again.

Right; it's just that the moment you do trigger it, it'll iterate that
pgd_list and that is potentially huge. Then again, that's not a new
problem.

I suppose we can deal with it if/when it becomes an actual problem.

I had a quick look and I think it might be possible to make it an RCU
managed list. We'd have to remove the pgd_list entry in
put_task_struct_rcu_user(). Then we can play games in sync_global_*()
and use RCU iteration. New tasks (which can be missed in the RCU
iteration) will already have a full clone of the PGD anyway.

But like said, something for later I suppose.
Joerg Roedel May 9, 2020, 5:52 p.m. UTC | #7
On Fri, May 08, 2020 at 04:49:17PM -0700, Andy Lutomirski wrote:
> On Fri, May 8, 2020 at 2:36 PM Joerg Roedel <jroedel@suse.de> wrote:
> >
> > On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote:
> > > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote:
> >
> > > What's the maximum on other system types?  It might make more sense to
> > > take the memory hit and pre-populate all the tables at boot so we
> > > never have to sync them.
> >
> > Need to look it up for 5-level paging, with 4-level paging its 64 pages
> > to pre-populate the vmalloc area.
> >
> > But that would not solve the problem on x86-32, which needs to
> > synchronize unmappings on the PMD level.
> 
> What changes in this series with x86-32?

This series sets ARCH_PAGE_TABLE_SYNC_MASK to PGTBL_PMD_MODIFIED, so
that the synchronization happens every time PMD(s) in the vmalloc areas
are changed. Before this series this synchronization only happened at
arbitrary places calling vmalloc_sync_(un)mappings().

> We already do that synchronization, right?  IOW, in the cases where
> the vmalloc *fault* code does anything at all, we should have a small
> bound for how much memory to preallocate and, if we preallocate it,
> then there is nothing to sync and nothing to fault.  And we have the
> benefit that we never need to sync anything on 64-bit, which is kind
> of nice.

Don't really get you here, what is pre-allocated and why is there no
need to sync and fault then?

> Do we actually need PMD-level things for 32-bit?  What if we just
> outlawed huge pages in the vmalloc space on 32-bit non-PAE?

Disallowing huge-pages would at least remove the need to sync
unmappings, but we still need to sync new PMD entries. Remember that the
size of the vmalloc area on 32 bit is dynamic and depends on the VM-split
and the actual amount of RAM on the system.

A machine wit 512MB of RAM and a 1G/3G split will have around 2.5G of
VMALLOC address space. And if we want to avoid vmalloc-faults there, we
need to pre-allocate all PTE pages for that area (and the amount of PTE
pages needed increases when RAM decreases).

On a machine with 512M of RAM we would need ca. 1270+ PTE pages, which
is around 5M (or 1% of total system memory).

Regards,

	Joerg
Joerg Roedel May 9, 2020, 5:54 p.m. UTC | #8
On Sat, May 09, 2020 at 11:25:16AM +0200, Peter Zijlstra wrote:
> Right; it's just that the moment you do trigger it, it'll iterate that
> pgd_list and that is potentially huge. Then again, that's not a new
> problem.
> 
> I suppose we can deal with it if/when it becomes an actual problem.
> 
> I had a quick look and I think it might be possible to make it an RCU
> managed list. We'd have to remove the pgd_list entry in
> put_task_struct_rcu_user(). Then we can play games in sync_global_*()
> and use RCU iteration. New tasks (which can be missed in the RCU
> iteration) will already have a full clone of the PGD anyway.

Right, it should not be that difficult to rcu-protect the pgd-list, but
we can try that when it actually becomes a problem.


	Joerg
Andy Lutomirski May 9, 2020, 7:05 p.m. UTC | #9
On Sat, May 9, 2020 at 10:52 AM Joerg Roedel <jroedel@suse.de> wrote:
>
> On Fri, May 08, 2020 at 04:49:17PM -0700, Andy Lutomirski wrote:
> > On Fri, May 8, 2020 at 2:36 PM Joerg Roedel <jroedel@suse.de> wrote:
> > >
> > > On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote:
> > > > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel <joro@8bytes.org> wrote:
> > >
> > > > What's the maximum on other system types?  It might make more sense to
> > > > take the memory hit and pre-populate all the tables at boot so we
> > > > never have to sync them.
> > >
> > > Need to look it up for 5-level paging, with 4-level paging its 64 pages
> > > to pre-populate the vmalloc area.
> > >
> > > But that would not solve the problem on x86-32, which needs to
> > > synchronize unmappings on the PMD level.
> >
> > What changes in this series with x86-32?
>
> This series sets ARCH_PAGE_TABLE_SYNC_MASK to PGTBL_PMD_MODIFIED, so
> that the synchronization happens every time PMD(s) in the vmalloc areas
> are changed. Before this series this synchronization only happened at
> arbitrary places calling vmalloc_sync_(un)mappings().
>
> > We already do that synchronization, right?  IOW, in the cases where
> > the vmalloc *fault* code does anything at all, we should have a small
> > bound for how much memory to preallocate and, if we preallocate it,
> > then there is nothing to sync and nothing to fault.  And we have the
> > benefit that we never need to sync anything on 64-bit, which is kind
> > of nice.
>
> Don't really get you here, what is pre-allocated and why is there no
> need to sync and fault then?
>
> > Do we actually need PMD-level things for 32-bit?  What if we just
> > outlawed huge pages in the vmalloc space on 32-bit non-PAE?
>
> Disallowing huge-pages would at least remove the need to sync
> unmappings, but we still need to sync new PMD entries. Remember that the
> size of the vmalloc area on 32 bit is dynamic and depends on the VM-split
> and the actual amount of RAM on the system.
>
> A machine wit 512MB of RAM and a 1G/3G split will have around 2.5G of
> VMALLOC address space. And if we want to avoid vmalloc-faults there, we
> need to pre-allocate all PTE pages for that area (and the amount of PTE
> pages needed increases when RAM decreases).
>
> On a machine with 512M of RAM we would need ca. 1270+ PTE pages, which
> is around 5M (or 1% of total system memory).

I can never remember which P?D name goes with which level and which
machine type, but I don't think I agree with your math regardless.  On
x86, there are two fundamental situations that can occur:

1. Non-PAE.  There is a single 4k top-level page table per mm, and
this table contains either 512 or 1024 entries total. Of those
entries, some fraction (half or less) control the kernel address
space, and some fraction of *that* is for vmalloc space.  Those
entries are the *only* thing that needs syncing -- all mms will either
have null (not present) in those slots or will have pointers to the
*same* next-level-down directories.

2. PAE.  Depending on your perspective, there could be a grand total
of four top-level paging pointers, of which one (IIRC) is for the
kernel.  That points to the same place for all mms.  Or, if you look
at it the other way, PAE is just like #1 except that the top-level
table has only four entries and only one points to VMALLOC space.

So, unless I'm missing something here, there is an absolute maximum of
512 top-level entries that ever need to be synchronized.

Now, there's an additional complication.  On x86_64, we have a rule:
those entries that need to be synced start out null and may, during
the lifetime of the system, change *once*.  They are never unmapped or
modified after being allocated.  This means that those entries can
only ever point to a page *table* and not to a ginormous page.  So,
even if the hardware were to support ginormous pages (which, IIRC, it
doesn't), we would be limited to merely immense and not ginormous
pages in the vmalloc range.  On x86_32, I don't think we have this
rule right now.  And this means that it's possible for one of these
pages to be unmapped or modified.

So my suggestion is that just apply the x86_64 rule to x86_32 as well.
The practical effect will be that 2-level-paging systems will not be
able to use huge pages in the vmalloc range, since the rule will be
that the vmalloc-relevant entries in the top-level table must point to
page *tables* instead of huge pages.

On top of this, if we preallocate these entries, then the maximum
amount of memory we can possibly waste is 4k * (entries pointing to
vmalloc space - entries actually used for vmalloc space).  I don't
know what this number typically is, but I don't think it's very large.
Preallocating means that vmalloc faults *and* synchronization go away
entirely.  All of the page tables used for vmalloc will be entirely
shared by all mms, so all that's needed to modify vmalloc mappings is
to update init_mm and, if needed, flush TLBs.  No other page tables
will need modification at all.

On x86_64, the only real advantage is that the handful of corner cases
that make vmalloc faults unpleasant (mostly relating to vmap stacks)
go away.  On x86_32, a bunch of mind-bending stuff (everything your
series deletes but also almost everything your series *adds*) goes
away.  There may be a genuine tiny performance hit on 2-level systems
due to the loss of huge pages in vmalloc space, but I'm not sure I
care or that we use them anyway on these systems.  And PeterZ can stop
even thinking about RCU.

Am I making sense?


(Aside: I *hate* the PMD, etc terminology.  Even the kernel's C types
can't keep track of whether pmd_t* points to an entire paging
directory or to a single entry.  Similarly, everyone knows that a
pte_t is a "page table entry", except that pte_t* might instead be a
pointer to an array of 512 or 1024 page table entries.)
Joerg Roedel May 9, 2020, 9:57 p.m. UTC | #10
Hi Andy,

On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:
> 1. Non-PAE.  There is a single 4k top-level page table per mm, and
> this table contains either 512 or 1024 entries total. Of those
> entries, some fraction (half or less) control the kernel address
> space, and some fraction of *that* is for vmalloc space.  Those
> entries are the *only* thing that needs syncing -- all mms will either
> have null (not present) in those slots or will have pointers to the
> *same* next-level-down directories.

Not entirely correct, on 32-bit non-PAE there are two levels with 1024
entries each. In Linux these map to PMD and PTE levels. With 1024
entries each PMD maps 4MB of address space.

How much of these 1024 top-level entries are used for the kernel is
configuration dependent in Linux, it can be 768, 512, or 256.

> 2. PAE.  Depending on your perspective, there could be a grand total
> of four top-level paging pointers, of which one (IIRC) is for the
> kernel.

That is again configuration dependent, up to 3 of the top-level pointers
could be used for kernel-space.

> That points to the same place for all mms.  Or, if you look at it the
> other way, PAE is just like #1 except that the top-level table has
> only four entries and only one points to VMALLOC space.

There are more differences. In PAE an entry is 64-bit, which means each
PMD-entry only maps 2MB of address space, which means you need two PTE
pages to map the same amount of address space you could map with one
page without PAE paging.
And as noted above, all 3 of the top-level pointers can point to vmalloc
space (not exclusivly).
 
> So, unless I'm missing something here, there is an absolute maximum of
> 512 top-level entries that ever need to be synchronized.

And here is where your assumption is wrong. On 32-bit PAE systems it is
not the top-level entries that need to be synchronized for vmalloc, but
the second-level entries. And dependent on the kernel configuration,
there are (in total, not only vmalloc) 1536, 1024, or 512 of these
second-level entries. How much of them are actually used for vmalloc
depends on the size of the system RAM (but is at least 64), because
the vmalloc area begins after the kernel direct-mapping (with an 8MB
unmapped hole).

With 512MB of RAM you need 256 entries for the kernel direct-mapping (I
ignore the ISA hole here). After that there are 4 unmapped entries and
the rest is vmalloc, minus some fixmap and ldt mappings at the end of
the address space. I havn't looked up the exact number, but 4 entries
(8MB) for this should be close to the real value.

	  1536 total PMD entries
	-  256 for direct-mapping
	-    4 for ISA hole
	-    4 for stuff at the end of the address space

	= 1272 PMD entries for VMALLOC space which need synchronization.

With less RAM it is even more, and to get rid of faulting and
synchronization you need to pre-allocate a 4k PTE page for each of these
PMD entries.

> Now, there's an additional complication.  On x86_64, we have a rule:
> those entries that need to be synced start out null and may, during
> the lifetime of the system, change *once*.  They are never unmapped or
> modified after being allocated.  This means that those entries can
> only ever point to a page *table* and not to a ginormous page.  So,
> even if the hardware were to support ginormous pages (which, IIRC, it
> doesn't), we would be limited to merely immense and not ginormous
> pages in the vmalloc range.  On x86_32, I don't think we have this
> rule right now.  And this means that it's possible for one of these
> pages to be unmapped or modified.

The reason for x86-32 being different is that the address space is
orders of magnitude smaller than on x86-64. We just have 4 top-level
entries with PAE paging and can't afford to partition kernel-adress
space on that level like we do on x86-64. That is the reason the address
space is partitioned on the second (PMD) level, which is also the reason
vmalloc synchronization needs to happen on that level. And because
that's not enough yet, its also the page-table level to map huge-pages.

> So my suggestion is that just apply the x86_64 rule to x86_32 as well.
> The practical effect will be that 2-level-paging systems will not be
> able to use huge pages in the vmalloc range, since the rule will be
> that the vmalloc-relevant entries in the top-level table must point to
> page *tables* instead of huge pages.

I could very well live with prohibiting huge-page ioremap mappings for
x86-32. But as I wrote before, this doesn't solve the problems I am
trying to address with this patch-set, or would only address them if
significant amount of total system memory is used.

The pre-allocation solution would work for x86-64, it would only need
256kb of preallocated memory for the vmalloc range to never synchronize
or fault again. I have thought about that and did the math before
writing this patch-set, but doing the math for 32 bit drove me away from
it for reasons written above.

And since a lot of the vmalloc_sync_(un)mappings problems I debugged
were actually related to 32-bit, I want a solution that works for 32 and
64-bit x86 (at least until support for x86-32 is removed). And I think
this patch-set provides a solution that works well for both.


	Joerg
Matthew Wilcox May 10, 2020, 1:11 a.m. UTC | #11
On Sat, May 09, 2020 at 11:25:16AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 11:34:07PM +0200, Joerg Roedel wrote:
> > On Fri, May 08, 2020 at 09:20:00PM +0200, Peter Zijlstra wrote:
> > > The only concern I have is the pgd_lock lock hold times.
> > > 
> > > By not doing on-demand faults anymore, and consistently calling
> > > sync_global_*(), we iterate that pgd_list thing much more often than
> > > before if I'm not mistaken.
> > 
> > Should not be a problem, from what I have seen this function is not
> > called often on x86-64.  The vmalloc area needs to be synchronized at
> > the top-level there, which is by now P4D (with 4-level paging). And the
> > vmalloc area takes 64 entries, when all of them are populated the
> > function will not be called again.
> 
> Right; it's just that the moment you do trigger it, it'll iterate that
> pgd_list and that is potentially huge. Then again, that's not a new
> problem.
> 
> I suppose we can deal with it if/when it becomes an actual problem.
> 
> I had a quick look and I think it might be possible to make it an RCU
> managed list. We'd have to remove the pgd_list entry in
> put_task_struct_rcu_user(). Then we can play games in sync_global_*()
> and use RCU iteration. New tasks (which can be missed in the RCU
> iteration) will already have a full clone of the PGD anyway.

One of the things on my long-term todo list is to replace mm_struct.mmlist
with an XArray containing all mm_structs.  Then we can use one mark
to indicate maybe-swapped and another mark to indicate ... whatever it
is pgd_list indicates.  Iterating an XArray (whether the entire thing
or with marks) is RCU-safe and faster than iterating a linked list,
so this should solve the problem?
Andy Lutomirski May 10, 2020, 5:05 a.m. UTC | #12
On Sat, May 9, 2020 at 2:57 PM Joerg Roedel <joro@8bytes.org> wrote:
>
> Hi Andy,
>
> On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:

> > So, unless I'm missing something here, there is an absolute maximum of
> > 512 top-level entries that ever need to be synchronized.
>
> And here is where your assumption is wrong. On 32-bit PAE systems it is
> not the top-level entries that need to be synchronized for vmalloc, but
> the second-level entries. And dependent on the kernel configuration,
> there are (in total, not only vmalloc) 1536, 1024, or 512 of these
> second-level entries. How much of them are actually used for vmalloc
> depends on the size of the system RAM (but is at least 64), because
> the vmalloc area begins after the kernel direct-mapping (with an 8MB
> unmapped hole).

I spent some time looking at the code, and I'm guessing you're talking
about the 3-level !SHARED_KERNEL_PMD case.  I can't quite figure out
what's going on.

Can you explain what is actually going on that causes different
mms/pgds to have top-level entries in the kernel range that point to
different tables?  Because I'm not seeing why this makes any sense.

>
> > Now, there's an additional complication.  On x86_64, we have a rule:
> > those entries that need to be synced start out null and may, during
> > the lifetime of the system, change *once*.  They are never unmapped or
> > modified after being allocated.  This means that those entries can
> > only ever point to a page *table* and not to a ginormous page.  So,
> > even if the hardware were to support ginormous pages (which, IIRC, it
> > doesn't), we would be limited to merely immense and not ginormous
> > pages in the vmalloc range.  On x86_32, I don't think we have this
> > rule right now.  And this means that it's possible for one of these
> > pages to be unmapped or modified.
>
> The reason for x86-32 being different is that the address space is
> orders of magnitude smaller than on x86-64. We just have 4 top-level
> entries with PAE paging and can't afford to partition kernel-adress
> space on that level like we do on x86-64. That is the reason the address
> space is partitioned on the second (PMD) level, which is also the reason
> vmalloc synchronization needs to happen on that level. And because
> that's not enough yet, its also the page-table level to map huge-pages.

Why does it need to be partitioned at all?  The only thing that comes
to mind is that the LDT range is per-mm.  So I can imagine that the
PAE case with a 3G user / 1G kernel split has to have the vmalloc
range and the LDT range in the same top-level entry.  Yuck.

>
> > So my suggestion is that just apply the x86_64 rule to x86_32 as well.
> > The practical effect will be that 2-level-paging systems will not be
> > able to use huge pages in the vmalloc range, since the rule will be
> > that the vmalloc-relevant entries in the top-level table must point to
> > page *tables* instead of huge pages.
>
> I could very well live with prohibiting huge-page ioremap mappings for
> x86-32. But as I wrote before, this doesn't solve the problems I am
> trying to address with this patch-set, or would only address them if
> significant amount of total system memory is used.
>
> The pre-allocation solution would work for x86-64, it would only need
> 256kb of preallocated memory for the vmalloc range to never synchronize
> or fault again. I have thought about that and did the math before
> writing this patch-set, but doing the math for 32 bit drove me away from
> it for reasons written above.
>

If it's *just* the LDT that's a problem, we could plausibly shrink the
user address range a little bit and put the LDT in the user portion.
I suppose this could end up creating its own set of problems involving
tracking which code owns which page tables.

> And since a lot of the vmalloc_sync_(un)mappings problems I debugged
> were actually related to 32-bit, I want a solution that works for 32 and
> 64-bit x86 (at least until support for x86-32 is removed). And I think
> this patch-set provides a solution that works well for both.

I'm not fundamentally objecting to your patch set, but I do want to
understand what's going on that needs this stuff.

>
>
>         Joerg
Joerg Roedel May 10, 2020, 8:15 a.m. UTC | #13
On Sat, May 09, 2020 at 10:05:43PM -0700, Andy Lutomirski wrote:
> On Sat, May 9, 2020 at 2:57 PM Joerg Roedel <joro@8bytes.org> wrote:
> I spent some time looking at the code, and I'm guessing you're talking
> about the 3-level !SHARED_KERNEL_PMD case.  I can't quite figure out
> what's going on.
> 
> Can you explain what is actually going on that causes different
> mms/pgds to have top-level entries in the kernel range that point to
> different tables?  Because I'm not seeing why this makes any sense.

There are three cases where the PMDs are not shared on x86-32:

	1) With non-PAE the top-level is already the PMD level, because
	   the page-table only has two levels. Since the top-level can't
	   be shared, the PMDs are also not shared.

	2) For some reason Xen-PV also does not share kernel PMDs on PAE
	   systems, but I havn't looked into why.

	3) On 32-bit PAE with PTI enabled the kernel address space
	   contains the LDT mapping, which is also different per-PGD.
	   There is one PMD entry reserved for the LDT, giving it 2MB of
	   address space. I implemented it this way to keep the 32-bit
	   implementation of PTI mostly similar to the 64-bit one.

> Why does it need to be partitioned at all?  The only thing that comes
> to mind is that the LDT range is per-mm.  So I can imagine that the
> PAE case with a 3G user / 1G kernel split has to have the vmalloc
> range and the LDT range in the same top-level entry.  Yuck.

PAE with 3G user / 1G kernel has _all_ of the kernel mappings in one
top-level entry (direct-mapping, vmalloc, ldt, fixmap).

> If it's *just* the LDT that's a problem, we could plausibly shrink the
> user address range a little bit and put the LDT in the user portion.
> I suppose this could end up creating its own set of problems involving
> tracking which code owns which page tables.

Yeah, for the PTI case it is only the LDT that causes the unshared
kernel PMDs, but even if we move the LDT somewhere else we still have
two-level paging and the xen-pv case.


	Joerg
Peter Zijlstra May 11, 2020, 7:31 a.m. UTC | #14
On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote:
> Iterating an XArray (whether the entire thing
> or with marks) is RCU-safe and faster than iterating a linked list,
> so this should solve the problem?

It can hardly be faster if you want all elements -- which is I think the
case here. We only call into this if we change an entry, and then we
need to propagate that change to all.
Peter Zijlstra May 11, 2020, 7:42 a.m. UTC | #15
On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:

> On x86_64, the only real advantage is that the handful of corner cases
> that make vmalloc faults unpleasant (mostly relating to vmap stacks)
> go away.  On x86_32, a bunch of mind-bending stuff (everything your
> series deletes but also almost everything your series *adds*) goes
> away.  There may be a genuine tiny performance hit on 2-level systems
> due to the loss of huge pages in vmalloc space, but I'm not sure I
> care or that we use them anyway on these systems.  And PeterZ can stop
> even thinking about RCU.
> 
> Am I making sense?

I think it'll work for x86_64 and that is really all I care about :-)
Andy Lutomirski May 11, 2020, 3:36 p.m. UTC | #16
On Mon, May 11, 2020 at 12:42 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:
>
> > On x86_64, the only real advantage is that the handful of corner cases
> > that make vmalloc faults unpleasant (mostly relating to vmap stacks)
> > go away.  On x86_32, a bunch of mind-bending stuff (everything your
> > series deletes but also almost everything your series *adds*) goes
> > away.  There may be a genuine tiny performance hit on 2-level systems
> > due to the loss of huge pages in vmalloc space, but I'm not sure I
> > care or that we use them anyway on these systems.  And PeterZ can stop
> > even thinking about RCU.
> >
> > Am I making sense?
>
> I think it'll work for x86_64 and that is really all I care about :-)

Sadly, I think that Joerg has convinced my that this doesn't really
work for 32-bit unless we rework the LDT code or drop support for
something that we might not want to drop support for.  So, last try --
maybe we can start defeaturing 32-bit:

What if we make 32-bit PTI depend on PAE?  And drop 32-bit Xen PV
support?  And make 32-bit huge pages depend on PAE?  Then 32-bit
non-PAE can use the direct-mapped LDT, 32-bit PTI (and optionally PAE
non-PTI) can use the evil virtually mapped LDT.  And 32-bit non-PAE
(the 2-level case) will only have pointers to page tables at the top
level.  And then we can preallocate.

Or maybe we don't want to defeature this much, or maybe the memory hit
from this preallocation will hurt little 2-level 32-bit systems too
much.

(Xen 32-bit PV support seems to be on its way out upstream.)
Matthew Wilcox May 11, 2020, 3:52 p.m. UTC | #17
On Mon, May 11, 2020 at 09:31:34AM +0200, Peter Zijlstra wrote:
> On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote:
> > Iterating an XArray (whether the entire thing
> > or with marks) is RCU-safe and faster than iterating a linked list,
> > so this should solve the problem?
> 
> It can hardly be faster if you want all elements -- which is I think the
> case here. We only call into this if we change an entry, and then we
> need to propagate that change to all.

Of course it can be faster.  Iterating an array is faster than iterating
a linked list because caches.  While an XArray is a segmented array
(so slower than a plain array), it's plainly going to be faster than
iterating a linked list.
Matthew Wilcox May 11, 2020, 4:08 p.m. UTC | #18
On Mon, May 11, 2020 at 08:52:04AM -0700, Matthew Wilcox wrote:
> On Mon, May 11, 2020 at 09:31:34AM +0200, Peter Zijlstra wrote:
> > On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote:
> > > Iterating an XArray (whether the entire thing
> > > or with marks) is RCU-safe and faster than iterating a linked list,
> > > so this should solve the problem?
> > 
> > It can hardly be faster if you want all elements -- which is I think the
> > case here. We only call into this if we change an entry, and then we
> > need to propagate that change to all.
> 
> Of course it can be faster.  Iterating an array is faster than iterating
> a linked list because caches.  While an XArray is a segmented array
> (so slower than a plain array), it's plainly going to be faster than
> iterating a linked list.

Quantifying this:

$ ./array-vs-list 
walked sequential array in 0.002039s
walked sequential list in 0.002807s
walked sequential array in 0.002017s
walked shuffled list in 0.102367s
walked shuffled array in 0.012114s

Attached is the source code; above results on a Kaby Lake with
CFLAGS="-O2 -W -Wall -g".
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

unsigned long count = 1000 * 1000;
unsigned int verbose;

struct object {
	struct object *next;
	struct object *prev;
	unsigned int value;
};

#define printv(level, fmt, ...) \
	if (level <= verbose) { printf(fmt, ##__VA_ARGS__); }

void check_total(unsigned long total)
{
	if (total * 2 != count * (count + 1))
		printf("Check your staging! (%lu %lu)\n", total, count);
}

void alloc_objs(struct object **array)
{
	unsigned long i;

	for (i = 0; i < count; i++) {
		struct object *obj = malloc(sizeof(*obj));

		obj->value = i + 1;
		/* Add to the array */
		array[i] = obj;
	}
}

void shuffle(struct object **array, unsigned long seed)
{
	unsigned long i;

	printv(1, "random seed %lu\n", seed);
	srand48(seed);

	/* Shuffle the array */
	for (i = 1; i < count; i++) {
		struct object *obj;
		unsigned long j = (unsigned int)mrand48() % (i + 1);

		if (i == j)
			continue;
		obj = array[j];
		array[j] = array[i];
		array[i] = obj;
	}
}

void create_list(struct object **array, struct object *list)
{
	unsigned long i;

	list->next = list;
	list->prev = list;

	for (i = 0; i < count; i++) {
		struct object *obj = array[i];
		/* Add to the tail of the list */
		obj->next = list;
		obj->prev = list->prev;
		list->prev->next = obj;
		list->prev = obj;
	}
}

void walk_list(struct object *list)
{
	unsigned long total = 0;
	struct object *obj;

	for (obj = list->next; obj != list; obj = obj->next) {
		total += obj->value;
	}

	check_total(total);
}

void walk_array(struct object **array)
{
	unsigned long total = 0;
	unsigned long i;

	for (i = 0; i < count; i++) {
		total += array[i]->value;
	}

	check_total(total);
}

/* time2 - time1 */
double diff_time(struct timespec *time1, struct timespec *time2)
{
	double result = time2->tv_nsec - time1->tv_nsec;

	return time2->tv_sec - time1->tv_sec + result / 1000 / 1000 / 1000;
}

int main(int argc, char **argv)
{
	int opt;
	unsigned long seed = time(NULL);
	struct object **array;
	struct object list;
	struct timespec time1, time2;

	while ((opt = getopt(argc, argv, "c:s:v")) != -1) {
		if (opt == 'c')
			count *= strtoul(optarg, NULL, 0);
		else if (opt == 's')
			seed = strtoul(optarg, NULL, 0);
		else if (opt == 'v')
			verbose++;
	}

	clock_gettime(CLOCK_MONOTONIC, &time1);
	array = calloc(count, sizeof(void *));
	alloc_objs(array);
	clock_gettime(CLOCK_MONOTONIC, &time2);
	printv(1, "allocated %lu items in %fs\n", count,
			diff_time(&time1, &time2));

	clock_gettime(CLOCK_MONOTONIC, &time1);
	walk_array(array);
	clock_gettime(CLOCK_MONOTONIC, &time2);
	printf("walked sequential array in %fs\n",
			diff_time(&time1, &time2));

	clock_gettime(CLOCK_MONOTONIC, &time1);
	create_list(array, &list);
	clock_gettime(CLOCK_MONOTONIC, &time2);
	printv(1, "created list in %fs\n",
			diff_time(&time1, &time2));

	clock_gettime(CLOCK_MONOTONIC, &time1);
	walk_list(&list);
	clock_gettime(CLOCK_MONOTONIC, &time2);
	printf("walked sequential list in %fs\n",
			diff_time(&time1, &time2));

	clock_gettime(CLOCK_MONOTONIC, &time1);
	walk_array(array);
	clock_gettime(CLOCK_MONOTONIC, &time2);
	printf("walked sequential array in %fs\n",
			diff_time(&time1, &time2));

	clock_gettime(CLOCK_MONOTONIC, &time1);
	shuffle(array, seed);
	clock_gettime(CLOCK_MONOTONIC, &time2);
	printv(1, "shuffled array in %fs\n",
			diff_time(&time1, &time2));

	clock_gettime(CLOCK_MONOTONIC, &time1);
	create_list(array, &list);
	clock_gettime(CLOCK_MONOTONIC, &time2);
	printv(1, "created list in %fs\n",
			diff_time(&time1, &time2));

	clock_gettime(CLOCK_MONOTONIC, &time1);
	walk_list(&list);
	clock_gettime(CLOCK_MONOTONIC, &time2);
	printf("walked shuffled list in %fs\n",
			diff_time(&time1, &time2));

	clock_gettime(CLOCK_MONOTONIC, &time1);
	walk_array(array);
	clock_gettime(CLOCK_MONOTONIC, &time2);
	printf("walked shuffled array in %fs\n",
			diff_time(&time1, &time2));

	return 0;
}
Peter Zijlstra May 11, 2020, 5:02 p.m. UTC | #19
On Mon, May 11, 2020 at 08:52:04AM -0700, Matthew Wilcox wrote:
> On Mon, May 11, 2020 at 09:31:34AM +0200, Peter Zijlstra wrote:
> > On Sat, May 09, 2020 at 06:11:57PM -0700, Matthew Wilcox wrote:
> > > Iterating an XArray (whether the entire thing
> > > or with marks) is RCU-safe and faster than iterating a linked list,
> > > so this should solve the problem?
> > 
> > It can hardly be faster if you want all elements -- which is I think the
> > case here. We only call into this if we change an entry, and then we
> > need to propagate that change to all.
> 
> Of course it can be faster.  Iterating an array is faster than iterating
> a linked list because caches.  While an XArray is a segmented array
> (so slower than a plain array), it's plainly going to be faster than
> iterating a linked list.

Fair enough, mostly also because the actual work (setting a single PTE)
doesn't dominate in this case.
Peter Zijlstra May 11, 2020, 5:06 p.m. UTC | #20
On Mon, May 11, 2020 at 08:36:31AM -0700, Andy Lutomirski wrote:
> On Mon, May 11, 2020 at 12:42 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote:
> >
> > > On x86_64, the only real advantage is that the handful of corner cases
> > > that make vmalloc faults unpleasant (mostly relating to vmap stacks)
> > > go away.  On x86_32, a bunch of mind-bending stuff (everything your
> > > series deletes but also almost everything your series *adds*) goes
> > > away.  There may be a genuine tiny performance hit on 2-level systems
> > > due to the loss of huge pages in vmalloc space, but I'm not sure I
> > > care or that we use them anyway on these systems.  And PeterZ can stop
> > > even thinking about RCU.
> > >
> > > Am I making sense?
> >
> > I think it'll work for x86_64 and that is really all I care about :-)
> 
> Sadly, I think that Joerg has convinced my that this doesn't really
> work for 32-bit unless we rework the LDT code or drop support for
> something that we might not want to drop support for.

I was thinking keep these patches for 32bit and fix 64bit 'proper'. But
sure, if we can get rid of it all by stripping 32bit features I'm not
going to object either.
Joerg Roedel May 11, 2020, 7:14 p.m. UTC | #21
On Mon, May 11, 2020 at 08:36:31AM -0700, Andy Lutomirski wrote:
> What if we make 32-bit PTI depend on PAE?

It already does, PTI support for legacy paging had to be removed because
there were memory corruption problems with THP. The reason was that huge
PTEs in the user-space area were mapped in two page-tables (kernel and
user), but A/D bits were only fetched from the kernel part. To not make
things more complicated we agreed on just not supporting PTI without
PAE.

> And drop 32-bit Xen PV support?  And make 32-bit huge pages depend on
> PAE?  Then 32-bit non-PAE can use the direct-mapped LDT, 32-bit PTI
> (and optionally PAE non-PTI) can use the evil virtually mapped LDT.
> And 32-bit non-PAE (the 2-level case) will only have pointers to page
> tables at the top level.  And then we can preallocate.

Not sure I can follow you here. How can 32-bit PTI with PAE use the LDT
from the direct mapping? I am guessing you want to get rid of the
SHARED_KERNEL_PMD==0 case for PAE kernels. This would indeed make
syncing unneccessary on PAE, but pre-allocation would still be needed
for 2-level paging. Just the amount of memory needed for the
pre-allocated PTE pages is half as big as it would be with PAE.

> Or maybe we don't want to defeature this much, or maybe the memory hit
> from this preallocation will hurt little 2-level 32-bit systems too
> much.

It will certainly make Linux less likely to boot on low-memory x86-32
systems, whoever will be affected by this.


	Joerg
Andy Lutomirski May 11, 2020, 7:36 p.m. UTC | #22
> On May 11, 2020, at 12:14 PM, Joerg Roedel <jroedel@suse.de> wrote:
> 
> On Mon, May 11, 2020 at 08:36:31AM -0700, Andy Lutomirski wrote:
>> What if we make 32-bit PTI depend on PAE?
> 
> It already does, PTI support for legacy paging had to be removed because
> there were memory corruption problems with THP. The reason was that huge
> PTEs in the user-space area were mapped in two page-tables (kernel and
> user), but A/D bits were only fetched from the kernel part. To not make
> things more complicated we agreed on just not supporting PTI without
> PAE.
> 
>> And drop 32-bit Xen PV support?  And make 32-bit huge pages depend on
>> PAE?  Then 32-bit non-PAE can use the direct-mapped LDT, 32-bit PTI
>> (and optionally PAE non-PTI) can use the evil virtually mapped LDT.
>> And 32-bit non-PAE (the 2-level case) will only have pointers to page
>> tables at the top level.  And then we can preallocate.
> 
> Not sure I can follow you here. How can 32-bit PTI with PAE use the LDT
> from the direct mapping? I am guessing you want to get rid of the
> SHARED_KERNEL_PMD==0 case for PAE kernels.

I wrote nonsense. I mean bite off a piece of the *user* portion of the address space and stick the LDT there.

I contemplated doing this when I wrote the 64-bit code, but I decided we had so much address space to throw around that I liked my solution better.

> This would indeed make
> syncing unneccessary on PAE, but pre-allocation would still be needed
> for 2-level paging. Just the amount of memory needed for the
> pre-allocated PTE pages is half as big as it would be with PAE.
> 
>> Or maybe we don't want to defeature this much, or maybe the memory hit
>> from this preallocation will hurt little 2-level 32-bit systems too
>> much.
> 
> It will certainly make Linux less likely to boot on low-memory x86-32
> systems, whoever will be affected by this.
> 
> 

I’m guessing the right solution is either your series or your series plus preallocation on 64-bit. I’m just grumpy about it...
Steven Rostedt May 11, 2020, 8:50 p.m. UTC | #23
On Mon, 11 May 2020 21:14:14 +0200
Joerg Roedel <jroedel@suse.de> wrote:

> > Or maybe we don't want to defeature this much, or maybe the memory hit
> > from this preallocation will hurt little 2-level 32-bit systems too
> > much.  
> 
> It will certainly make Linux less likely to boot on low-memory x86-32
> systems, whoever will be affected by this.

That may be an issue, as I'm guessing the biggest users of x86-32, is
probably small systems that uses Intel's attempts at the embedded space.

-- Steve
Joerg Roedel May 12, 2020, 3:02 p.m. UTC | #24
On Mon, May 11, 2020 at 12:36:19PM -0700, Andy Lutomirski wrote:
> I’m guessing the right solution is either your series or your series
> plus preallocation on 64-bit. I’m just grumpy about it...

Okay, so we can do the pre-allocation when it turns out the pgd_list
lock-times become a problem on x86-64. The tracking code in vmalloc.c is
needed anyway for 32-bit and there is no reason why 64-bit shouldn't use
it as well for now.
I don't think that taking the lock _will_ be a problem, as it is only
taken when a new PGD/P4D entry is populated. And it is pretty unlikely
that a system will populate all 64 of them, with 4-level paging each of
these entries will map 512GB of address space. But if I am wrong here
pre-allocating is still an option.


	Joerg
Steven Rostedt May 12, 2020, 3:13 p.m. UTC | #25
On Tue, 12 May 2020 17:02:50 +0200
Joerg Roedel <jroedel@suse.de> wrote:

> On Mon, May 11, 2020 at 12:36:19PM -0700, Andy Lutomirski wrote:
> > I’m guessing the right solution is either your series or your series
> > plus preallocation on 64-bit. I’m just grumpy about it...  
> 
> Okay, so we can do the pre-allocation when it turns out the pgd_list
> lock-times become a problem on x86-64. The tracking code in vmalloc.c is
> needed anyway for 32-bit and there is no reason why 64-bit shouldn't use
> it as well for now.
> I don't think that taking the lock _will_ be a problem, as it is only
> taken when a new PGD/P4D entry is populated. And it is pretty unlikely
> that a system will populate all 64 of them, with 4-level paging each of
> these entries will map 512GB of address space. But if I am wrong here
> pre-allocating is still an option.
> 

256TB of RAM isn't too far in the future ;-)

-- Steve