Message ID | 20241107202033.2721681-1-yuzhao@google.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/arm64: re-enable HVO | expand |
Hi Yu Zhao, On Thu, Nov 07, 2024 at 01:20:27PM -0700, Yu Zhao wrote: > HVO was disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason: > > This is deemed UNPREDICTABLE by the Arm architecture without a > break-before-make sequence (make the PTE invalid, TLBI, write the > new valid PTE). However, such sequence is not possible since the > vmemmap may be concurrently accessed by the kernel. > > This series presents one of the previously discussed approaches to > re-enable HugeTLB Vmemmap Optimization (HVO) on arm64. Before jumping into the new mechanisms here, I'd really like to understand how the current code is intended to work in the relatively simple case where the vmemmap is page-mapped to start with (i.e. when we don't need to worry about block-splitting). In that case, who are the concurrent users of the vmemmap that we need to worry about? Is it solely speculative references via page_ref_add_unless() or are there others? Looking at page_ref_add_unless(), what serialises that against __hugetlb_vmemmap_restore_folio()? I see there's a synchronize_rcu() call in the latter, but what prevents an RCU reader coming in immediately after that? Even if we resolve the BBM issues, we still need to get the synchronisation right so that we don't e.g. attempt a cmpxchg() to a read-only mapping, as the CAS instruction requires write permission on arm64 even if the comparison ultimately fails. So please help me to understand the basics of HVO before we get bogged down by the block-splitting on arm64. Cheers, Will
On Mon, Nov 25, 2024 at 8:22 AM Will Deacon <will@kernel.org> wrote: > > Hi Yu Zhao, > > On Thu, Nov 07, 2024 at 01:20:27PM -0700, Yu Zhao wrote: > > HVO was disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason: > > > > This is deemed UNPREDICTABLE by the Arm architecture without a > > break-before-make sequence (make the PTE invalid, TLBI, write the > > new valid PTE). However, such sequence is not possible since the > > vmemmap may be concurrently accessed by the kernel. > > > > This series presents one of the previously discussed approaches to > > re-enable HugeTLB Vmemmap Optimization (HVO) on arm64. > > Before jumping into the new mechanisms here, I'd really like to > understand how the current code is intended to work in the relatively > simple case where the vmemmap is page-mapped to start with (i.e. when we > don't need to worry about block-splitting). > > In that case, who are the concurrent users of the vmemmap that we need > to worry about? Any speculative PFN walkers who either only read `struct page[]` or attempt to increment page->_refcount if it's not zero. > Is it solely speculative references via > page_ref_add_unless() or are there others? page_ref_add_unless() needs to be successful before writes can follow; speculative reads are always allowed. > Looking at page_ref_add_unless(), what serialises that against > __hugetlb_vmemmap_restore_folio()? I see there's a synchronize_rcu() > call in the latter, but what prevents an RCU reader coming in > immediately after that? In page_ref_add_unless(), the condtion `!page_is_fake_head(page) && page_ref_count(page)` returns false before a PTE becomes RO. For HVO, i.e., a PTE being switched from RW to RO, page_ref_count() is frozen (remains zero), followed by synchronize_rcu(). After the switch, page_is_fake_head() is true and it appears before page_ref_count() is unfrozen (become non-zero), so the condition remains false. For de-HVO, i.e., a PTE being switched from RO to RW, page_ref_count() again is frozen, followed by synchronize_rcu(). Only this time page_is_fake_head() is false after the switch, and again it appears before page_ref_count() is unfrozen. To answer your question, readers coming in immediately after that won't be able to see non-zero page_ref_count() before it sees page_is_fake_head() being false. IOW, regarding whether it is RW, the condition can be false negative but never false positive. > Even if we resolve the BBM issues, we still need to get the > synchronisation right so that we don't e.g. attempt a cmpxchg() to a > read-only mapping, as the CAS instruction requires write permission on > arm64 even if the comparison ultimately fails. Correct. This applies to x86 as well, i.e., CAS on RO memory crashes the kernel, even if CAS would fail otherwise. > So please help me to understand the basics of HVO before we get bogged > down by the block-splitting on arm64. Gladly. Please let me know if anything from the core MM side is unclear.
On Mon, Nov 25, 2024 at 03:22:47PM -0700, Yu Zhao wrote: > On Mon, Nov 25, 2024 at 8:22 AM Will Deacon <will@kernel.org> wrote: > > On Thu, Nov 07, 2024 at 01:20:27PM -0700, Yu Zhao wrote: > > > HVO was disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable > > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason: > > > > > > This is deemed UNPREDICTABLE by the Arm architecture without a > > > break-before-make sequence (make the PTE invalid, TLBI, write the > > > new valid PTE). However, such sequence is not possible since the > > > vmemmap may be concurrently accessed by the kernel. > > > > > > This series presents one of the previously discussed approaches to > > > re-enable HugeTLB Vmemmap Optimization (HVO) on arm64. > > > > Before jumping into the new mechanisms here, I'd really like to > > understand how the current code is intended to work in the relatively > > simple case where the vmemmap is page-mapped to start with (i.e. when we > > don't need to worry about block-splitting). > > > > In that case, who are the concurrent users of the vmemmap that we need > > to worry about? > > Any speculative PFN walkers who either only read `struct page[]` or > attempt to increment page->_refcount if it's not zero. > > > Is it solely speculative references via > > page_ref_add_unless() or are there others? > > page_ref_add_unless() needs to be successful before writes can follow; > speculative reads are always allowed. > > > Looking at page_ref_add_unless(), what serialises that against > > __hugetlb_vmemmap_restore_folio()? I see there's a synchronize_rcu() > > call in the latter, but what prevents an RCU reader coming in > > immediately after that? > > In page_ref_add_unless(), the condtion `!page_is_fake_head(page) && > page_ref_count(page)` returns false before a PTE becomes RO. > > For HVO, i.e., a PTE being switched from RW to RO, page_ref_count() is > frozen (remains zero), followed by synchronize_rcu(). After the > switch, page_is_fake_head() is true and it appears before > page_ref_count() is unfrozen (become non-zero), so the condition > remains false. > > For de-HVO, i.e., a PTE being switched from RO to RW, page_ref_count() > again is frozen, followed by synchronize_rcu(). Only this time > page_is_fake_head() is false after the switch, and again it appears > before page_ref_count() is unfrozen. To answer your question, readers > coming in immediately after that won't be able to see non-zero > page_ref_count() before it sees page_is_fake_head() being false. IOW, > regarding whether it is RW, the condition can be false negative but > never false positive. Thanks, but I'm still not seeing how this works. When you say "appears before", I don't see any memory barriers in page_ref_add_unless() that enforce that e.g. the refcount and the flags are checked in order and I can't see how the synchronize_rcu() helps either as it's called really earlyi (I think that's just there for the static key). If page_is_fake_head() is reliable, then I'm thinking we could use that to steer page_ref_add_unless() away from the tail pages during the remapping operations and it would be fine to use a break-before-make sequence. Will
On Thu, Nov 28, 2024 at 7:20 AM Will Deacon <will@kernel.org> wrote: > > On Mon, Nov 25, 2024 at 03:22:47PM -0700, Yu Zhao wrote: > > On Mon, Nov 25, 2024 at 8:22 AM Will Deacon <will@kernel.org> wrote: > > > On Thu, Nov 07, 2024 at 01:20:27PM -0700, Yu Zhao wrote: > > > > HVO was disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable > > > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason: > > > > > > > > This is deemed UNPREDICTABLE by the Arm architecture without a > > > > break-before-make sequence (make the PTE invalid, TLBI, write the > > > > new valid PTE). However, such sequence is not possible since the > > > > vmemmap may be concurrently accessed by the kernel. > > > > > > > > This series presents one of the previously discussed approaches to > > > > re-enable HugeTLB Vmemmap Optimization (HVO) on arm64. > > > > > > Before jumping into the new mechanisms here, I'd really like to > > > understand how the current code is intended to work in the relatively > > > simple case where the vmemmap is page-mapped to start with (i.e. when we > > > don't need to worry about block-splitting). > > > > > > In that case, who are the concurrent users of the vmemmap that we need > > > to worry about? > > > > Any speculative PFN walkers who either only read `struct page[]` or > > attempt to increment page->_refcount if it's not zero. > > > > > Is it solely speculative references via > > > page_ref_add_unless() or are there others? > > > > page_ref_add_unless() needs to be successful before writes can follow; > > speculative reads are always allowed. > > > > > Looking at page_ref_add_unless(), what serialises that against > > > __hugetlb_vmemmap_restore_folio()? I see there's a synchronize_rcu() > > > call in the latter, but what prevents an RCU reader coming in > > > immediately after that? > > > > In page_ref_add_unless(), the condtion `!page_is_fake_head(page) && > > page_ref_count(page)` returns false before a PTE becomes RO. > > > > For HVO, i.e., a PTE being switched from RW to RO, page_ref_count() is > > frozen (remains zero), followed by synchronize_rcu(). After the > > switch, page_is_fake_head() is true and it appears before > > page_ref_count() is unfrozen (become non-zero), so the condition > > remains false. > > > > For de-HVO, i.e., a PTE being switched from RO to RW, page_ref_count() > > again is frozen, followed by synchronize_rcu(). Only this time > > page_is_fake_head() is false after the switch, and again it appears > > before page_ref_count() is unfrozen. To answer your question, readers > > coming in immediately after that won't be able to see non-zero > > page_ref_count() before it sees page_is_fake_head() being false. IOW, > > regarding whether it is RW, the condition can be false negative but > > never false positive. > > Thanks, but I'm still not seeing how this works. When you say "appears > before", I don't see any memory barriers in page_ref_add_unless() that > enforce that e.g. the refcount and the flags are checked in order and Right, there is a missing barrier in page_ref_add_unless() and the order of those two checks, i.e., page_is_fake_head() and then page_ref_count() is wrong. I posted a fix here [1]. [1] https://lore.kernel.org/20250107043505.351925-1-yuzhao@google.com/ > I can't see how the synchronize_rcu() helps either as it's called really > earlyi (I think that's just there for the static key). That fix makes sure no speculative PFN walkers will try to modify page->_refcount during the transition from the counter being frozen to modifiable. synchronize_rcu() makes sure something similar won't happen during the transition from the counter being modifiable to frozen. > If page_is_fake_head() is reliable, then I'm thinking we could use that > to steer page_ref_add_unless() away from the tail pages during the > remapping operations and it would be fine to use a break-before-make > sequence. The struct page pointer passed into page_is_fake_head() would become inaccessible during BBM. So it would just crash there. That's why I think we either have to handle kernel PFs or pause other CPUs. (page_is_fake_head() works by detecting whether it's accessing the original struct page or a remapped (r/o) one, and the latter has a signature for it to tell.)