mbox series

[v2,0/6] mm/arm64: re-enable HVO

Message ID 20241107202033.2721681-1-yuzhao@google.com (mailing list archive)
Headers show
Series mm/arm64: re-enable HVO | expand

Message

Yu Zhao Nov. 7, 2024, 8:20 p.m. UTC
HVO was disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason:

  This is deemed UNPREDICTABLE by the Arm architecture without a
  break-before-make sequence (make the PTE invalid, TLBI, write the
  new valid PTE). However, such sequence is not possible since the
  vmemmap may be concurrently accessed by the kernel.

This series presents one of the previously discussed approaches to
re-enable HugeTLB Vmemmap Optimization (HVO) on arm64. Other
approaches that have been discussed include:
  A. Handle kernel PF while doing BBM [1],
  B. Use stop_machine() while doing BBM [2], and,
  C. Enable FEAT_BBM level 2 and keep the memory contents at the old
     and new output addresses unchanged to avoid BBM (D8.16.1-2) [3].

A quick comparison between this approach (D) and the above approaches:
  --+------------------------------+-----------------------------+
    |              Pros            |             Cons            |
  --+------------------------------+-----------------------------+
  A | Low latency, h/w independent | Predictability concerns [4] |
  B | Predictable, h/w independent | High latency                |
  C | Predictable, low latency     | H/w dependent, complex      |
  D | Predictable, h/w independent | Medium latency              |
  --+------------------------------+-----------------------------+

This approach is being tested for Google's production systems, which
generally find the "cons" above acceptable, making it the preferred
trade-off for our use cases:
  +------------------------------+------------+----------+--------+
  | HugeTLB operations           | Before [0] + After    | Change |
  +------------------------------+------------+----------+--------+
  | Alloc 600 1GB                | 0m3.526s   | 0m3.649s | +4%    |
  | Free 600 1GB                 | 0m0.880s   | 0m0.917s | +4%    |
  | Demote 600 1GB to 307200 2MB | 0m1.575s   | 0m3.640s | +231%  |
  | Free 307200 2MB              | 0m0.946s   | 0m2.921s | +309%  |
  +------------------------------+------------+----------+--------+

[0] For comparison purposes, this only includes the last patch in the
    series, i.e., CONFIG_ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP=y.
[1] https://lore.kernel.org/20240113094436.2506396-1-sunnanyong@huawei.com/
[2] https://lore.kernel.org/ZbKjHHeEdFYY1xR5@arm.com/
[3] https://lore.kernel.org/Zo68DP6siXfb6ZBR@arm.com/
[4] https://lore.kernel.org/20240326125409.GA9552@willie-the-truck/

Major changes from v1, based on Marc Zyngier's help:
  1. Switched from CPU masks to a counter when pausing remote CPUs.
  2. Removed unnecessary memory barriers.

Yu Zhao (6):
  mm/hugetlb_vmemmap: batch-update PTEs
  mm/hugetlb_vmemmap: add arch-independent helpers
  irqchip/gic-v3: support SGI broadcast
  arm64: broadcast IPIs to pause remote CPUs
  arm64: pause remote CPUs to update vmemmap
  arm64: select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP

 arch/arm64/Kconfig               |   1 +
 arch/arm64/include/asm/pgalloc.h |  69 ++++++++
 arch/arm64/include/asm/smp.h     |   3 +
 arch/arm64/kernel/smp.c          |  85 +++++++++-
 drivers/irqchip/irq-gic-v3.c     |  31 +++-
 include/linux/mm_types.h         |   7 +
 mm/hugetlb_vmemmap.c             | 262 +++++++++++++++++++++----------
 7 files changed, 362 insertions(+), 96 deletions(-)


base-commit: 80fb25341631b75f57b84f99cc35b95ca2aad329

Comments

Will Deacon Nov. 25, 2024, 3:22 p.m. UTC | #1
Hi Yu Zhao,

On Thu, Nov 07, 2024 at 01:20:27PM -0700, Yu Zhao wrote:
> HVO was disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason:
> 
>   This is deemed UNPREDICTABLE by the Arm architecture without a
>   break-before-make sequence (make the PTE invalid, TLBI, write the
>   new valid PTE). However, such sequence is not possible since the
>   vmemmap may be concurrently accessed by the kernel.
> 
> This series presents one of the previously discussed approaches to
> re-enable HugeTLB Vmemmap Optimization (HVO) on arm64.

Before jumping into the new mechanisms here, I'd really like to
understand how the current code is intended to work in the relatively
simple case where the vmemmap is page-mapped to start with (i.e. when we
don't need to worry about block-splitting).

In that case, who are the concurrent users of the vmemmap that we need
to worry about? Is it solely speculative references via
page_ref_add_unless() or are there others?

Looking at page_ref_add_unless(), what serialises that against
__hugetlb_vmemmap_restore_folio()? I see there's a synchronize_rcu()
call in the latter, but what prevents an RCU reader coming in
immediately after that?

Even if we resolve the BBM issues, we still need to get the
synchronisation right so that we don't e.g. attempt a cmpxchg() to a
read-only mapping, as the CAS instruction requires write permission on
arm64 even if the comparison ultimately fails.

So please help me to understand the basics of HVO before we get bogged
down by the block-splitting on arm64.

Cheers,

Will
Yu Zhao Nov. 25, 2024, 10:22 p.m. UTC | #2
On Mon, Nov 25, 2024 at 8:22 AM Will Deacon <will@kernel.org> wrote:
>
> Hi Yu Zhao,
>
> On Thu, Nov 07, 2024 at 01:20:27PM -0700, Yu Zhao wrote:
> > HVO was disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason:
> >
> >   This is deemed UNPREDICTABLE by the Arm architecture without a
> >   break-before-make sequence (make the PTE invalid, TLBI, write the
> >   new valid PTE). However, such sequence is not possible since the
> >   vmemmap may be concurrently accessed by the kernel.
> >
> > This series presents one of the previously discussed approaches to
> > re-enable HugeTLB Vmemmap Optimization (HVO) on arm64.
>
> Before jumping into the new mechanisms here, I'd really like to
> understand how the current code is intended to work in the relatively
> simple case where the vmemmap is page-mapped to start with (i.e. when we
> don't need to worry about block-splitting).
>
> In that case, who are the concurrent users of the vmemmap that we need
> to worry about?

Any speculative PFN walkers who either only read `struct page[]` or
attempt to increment page->_refcount if it's not zero.

> Is it solely speculative references via
> page_ref_add_unless() or are there others?

page_ref_add_unless() needs to be successful before writes can follow;
speculative reads are always allowed.

> Looking at page_ref_add_unless(), what serialises that against
> __hugetlb_vmemmap_restore_folio()? I see there's a synchronize_rcu()
> call in the latter, but what prevents an RCU reader coming in
> immediately after that?

In page_ref_add_unless(), the condtion `!page_is_fake_head(page) &&
page_ref_count(page)` returns false before a PTE becomes RO.

For HVO, i.e., a PTE being switched from RW to RO, page_ref_count() is
frozen (remains zero), followed by synchronize_rcu(). After the
switch, page_is_fake_head() is true and it appears before
page_ref_count() is unfrozen (become non-zero), so the condition
remains false.

For de-HVO, i.e., a PTE being switched from RO to RW, page_ref_count()
again is frozen, followed by synchronize_rcu(). Only this time
page_is_fake_head() is false after the switch, and again it appears
before page_ref_count() is unfrozen. To answer your question, readers
coming in immediately after that won't be able to see non-zero
page_ref_count() before it sees page_is_fake_head() being false. IOW,
regarding whether it is RW, the condition can be false negative but
never false positive.

> Even if we resolve the BBM issues, we still need to get the
> synchronisation right so that we don't e.g. attempt a cmpxchg() to a
> read-only mapping, as the CAS instruction requires write permission on
> arm64 even if the comparison ultimately fails.

Correct. This applies to x86 as well, i.e., CAS on RO memory crashes
the kernel, even if CAS would fail otherwise.

> So please help me to understand the basics of HVO before we get bogged
> down by the block-splitting on arm64.

Gladly. Please let me know if anything from the core MM side is unclear.
Will Deacon Nov. 28, 2024, 2:20 p.m. UTC | #3
On Mon, Nov 25, 2024 at 03:22:47PM -0700, Yu Zhao wrote:
> On Mon, Nov 25, 2024 at 8:22 AM Will Deacon <will@kernel.org> wrote:
> > On Thu, Nov 07, 2024 at 01:20:27PM -0700, Yu Zhao wrote:
> > > HVO was disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason:
> > >
> > >   This is deemed UNPREDICTABLE by the Arm architecture without a
> > >   break-before-make sequence (make the PTE invalid, TLBI, write the
> > >   new valid PTE). However, such sequence is not possible since the
> > >   vmemmap may be concurrently accessed by the kernel.
> > >
> > > This series presents one of the previously discussed approaches to
> > > re-enable HugeTLB Vmemmap Optimization (HVO) on arm64.
> >
> > Before jumping into the new mechanisms here, I'd really like to
> > understand how the current code is intended to work in the relatively
> > simple case where the vmemmap is page-mapped to start with (i.e. when we
> > don't need to worry about block-splitting).
> >
> > In that case, who are the concurrent users of the vmemmap that we need
> > to worry about?
> 
> Any speculative PFN walkers who either only read `struct page[]` or
> attempt to increment page->_refcount if it's not zero.
> 
> > Is it solely speculative references via
> > page_ref_add_unless() or are there others?
> 
> page_ref_add_unless() needs to be successful before writes can follow;
> speculative reads are always allowed.
> 
> > Looking at page_ref_add_unless(), what serialises that against
> > __hugetlb_vmemmap_restore_folio()? I see there's a synchronize_rcu()
> > call in the latter, but what prevents an RCU reader coming in
> > immediately after that?
> 
> In page_ref_add_unless(), the condtion `!page_is_fake_head(page) &&
> page_ref_count(page)` returns false before a PTE becomes RO.
> 
> For HVO, i.e., a PTE being switched from RW to RO, page_ref_count() is
> frozen (remains zero), followed by synchronize_rcu(). After the
> switch, page_is_fake_head() is true and it appears before
> page_ref_count() is unfrozen (become non-zero), so the condition
> remains false.
> 
> For de-HVO, i.e., a PTE being switched from RO to RW, page_ref_count()
> again is frozen, followed by synchronize_rcu(). Only this time
> page_is_fake_head() is false after the switch, and again it appears
> before page_ref_count() is unfrozen. To answer your question, readers
> coming in immediately after that won't be able to see non-zero
> page_ref_count() before it sees page_is_fake_head() being false. IOW,
> regarding whether it is RW, the condition can be false negative but
> never false positive.

Thanks, but I'm still not seeing how this works. When you say "appears
before", I don't see any memory barriers in page_ref_add_unless() that
enforce that e.g. the refcount and the flags are checked in order and
I can't see how the synchronize_rcu() helps either as it's called really
earlyi (I think that's just there for the static key).

If page_is_fake_head() is reliable, then I'm thinking we could use that
to steer page_ref_add_unless() away from the tail pages during the
remapping operations and it would be fine to use a break-before-make
sequence.

Will
Yu Zhao Jan. 7, 2025, 6:07 a.m. UTC | #4
On Thu, Nov 28, 2024 at 7:20 AM Will Deacon <will@kernel.org> wrote:
>
> On Mon, Nov 25, 2024 at 03:22:47PM -0700, Yu Zhao wrote:
> > On Mon, Nov 25, 2024 at 8:22 AM Will Deacon <will@kernel.org> wrote:
> > > On Thu, Nov 07, 2024 at 01:20:27PM -0700, Yu Zhao wrote:
> > > > HVO was disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> > > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason:
> > > >
> > > >   This is deemed UNPREDICTABLE by the Arm architecture without a
> > > >   break-before-make sequence (make the PTE invalid, TLBI, write the
> > > >   new valid PTE). However, such sequence is not possible since the
> > > >   vmemmap may be concurrently accessed by the kernel.
> > > >
> > > > This series presents one of the previously discussed approaches to
> > > > re-enable HugeTLB Vmemmap Optimization (HVO) on arm64.
> > >
> > > Before jumping into the new mechanisms here, I'd really like to
> > > understand how the current code is intended to work in the relatively
> > > simple case where the vmemmap is page-mapped to start with (i.e. when we
> > > don't need to worry about block-splitting).
> > >
> > > In that case, who are the concurrent users of the vmemmap that we need
> > > to worry about?
> >
> > Any speculative PFN walkers who either only read `struct page[]` or
> > attempt to increment page->_refcount if it's not zero.
> >
> > > Is it solely speculative references via
> > > page_ref_add_unless() or are there others?
> >
> > page_ref_add_unless() needs to be successful before writes can follow;
> > speculative reads are always allowed.
> >
> > > Looking at page_ref_add_unless(), what serialises that against
> > > __hugetlb_vmemmap_restore_folio()? I see there's a synchronize_rcu()
> > > call in the latter, but what prevents an RCU reader coming in
> > > immediately after that?
> >
> > In page_ref_add_unless(), the condtion `!page_is_fake_head(page) &&
> > page_ref_count(page)` returns false before a PTE becomes RO.
> >
> > For HVO, i.e., a PTE being switched from RW to RO, page_ref_count() is
> > frozen (remains zero), followed by synchronize_rcu(). After the
> > switch, page_is_fake_head() is true and it appears before
> > page_ref_count() is unfrozen (become non-zero), so the condition
> > remains false.
> >
> > For de-HVO, i.e., a PTE being switched from RO to RW, page_ref_count()
> > again is frozen, followed by synchronize_rcu(). Only this time
> > page_is_fake_head() is false after the switch, and again it appears
> > before page_ref_count() is unfrozen. To answer your question, readers
> > coming in immediately after that won't be able to see non-zero
> > page_ref_count() before it sees page_is_fake_head() being false. IOW,
> > regarding whether it is RW, the condition can be false negative but
> > never false positive.
>
> Thanks, but I'm still not seeing how this works. When you say "appears
> before", I don't see any memory barriers in page_ref_add_unless() that
> enforce that e.g. the refcount and the flags are checked in order and

Right, there is a missing barrier in page_ref_add_unless() and the
order of those two checks, i.e., page_is_fake_head() and then
page_ref_count() is wrong.

I posted a fix here [1].

[1] https://lore.kernel.org/20250107043505.351925-1-yuzhao@google.com/

> I can't see how the synchronize_rcu() helps either as it's called really
> earlyi (I think that's just there for the static key).

That fix makes sure no speculative PFN walkers will try to modify
page->_refcount during the transition from the counter being frozen to
modifiable. synchronize_rcu() makes sure something similar won't
happen during the transition from the counter being modifiable to
frozen.

> If page_is_fake_head() is reliable, then I'm thinking we could use that
> to steer page_ref_add_unless() away from the tail pages during the
> remapping operations and it would be fine to use a break-before-make
> sequence.

The struct page pointer passed into page_is_fake_head() would become
inaccessible during BBM. So it would just crash there. That's why I
think we either have to handle kernel PFs or pause other CPUs.

(page_is_fake_head() works by detecting whether it's accessing the
original struct page or a remapped (r/o) one, and the latter has a
signature for it to tell.)