Message ID | 20250108074822.722696-1-yuzhao@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [mm-unstable,v2] mm/hugetlb_vmemmap: fix memory loads ordering | expand |
On 08.01.25 08:48, Yu Zhao wrote: > Using x86_64 as an example, for a 32KB struct page[] area describing a > 2MB hugeTLB, HVO reduces the area to 4KB by the following steps: > 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs; > 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped > by PTE 0, and at the same time change the permission from r/w to > r/o; > 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB > to 4KB. > > However, the following race can happen due to improperly memory loads > ordering: > CPU 1 (HVO) CPU 2 (speculative PFN walker) > > page_ref_freeze() > synchronize_rcu() > rcu_read_lock() > page_is_fake_head() is false > vmemmap_remap_pte() > XXX: struct page[] becomes r/o > > page_ref_unfreeze() > page_ref_count() is not zero > > atomic_add_unless(&page->_refcount) > XXX: try to modify r/o struct page[] > > Specifically, page_is_fake_head() must be ordered after > page_ref_count() on CPU 2 so that it can only return true for this > case, to avoid the later attempt to modify r/o struct page[]. > > This patch adds the missing memory barrier and makes the tests on > page_is_fake_head() and page_ref_count() done in the proper order. > > Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") > Reported-by: Will Deacon <will@kernel.org> > Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/ > Signed-off-by: Yu Zhao <yuzhao@google.com> > --- > include/linux/page-flags.h | 37 +++++++++++++++++++++++++++++++++++++ > include/linux/page_ref.h | 2 +- > 2 files changed, 38 insertions(+), 1 deletion(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 691506bdf2c5..16fa8f0cea02 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -225,11 +225,48 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page > } > return page; > } > + > +static __always_inline bool page_count_writable(const struct page *page, int u) > +{ > + if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key)) > + return true; > + > + /* > + * The refcount check is ordered before the fake-head check to prevent > + * the following race: > + * CPU 1 (HVO) CPU 2 (speculative PFN walker) > + * > + * page_ref_freeze() > + * synchronize_rcu() > + * rcu_read_lock() > + * page_is_fake_head() is false > + * vmemmap_remap_pte() > + * XXX: struct page[] becomes r/o > + * > + * page_ref_unfreeze() > + * page_ref_count() is not zero > + * > + * atomic_add_unless(&page->_refcount) > + * XXX: try to modify r/o struct page[] > + * > + * The refcount check also prevents modification attempts to other (r/o) > + * tail pages that are not fake heads. > + */ > + if (atomic_read_acquire(&page->_refcount) == u) > + return false; > + > + return page_fixed_fake_head(page) == page; > +} > #else > static inline const struct page *page_fixed_fake_head(const struct page *page) > { > return page; > } > + > +static inline bool page_count_writable(const struct page *page, int u) > +{ > + return true; > +} > #endif > > static __always_inline int page_is_fake_head(const struct page *page) > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h > index 8c236c651d1d..544150d1d5fd 100644 > --- a/include/linux/page_ref.h > +++ b/include/linux/page_ref.h > @@ -234,7 +234,7 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u) > > rcu_read_lock(); > /* avoid writing to the vmemmap area being remapped */ > - if (!page_is_fake_head(page) && page_ref_count(page) != u) > + if (page_count_writable(page, u)) > ret = atomic_add_unless(&page->_refcount, nr, u); > rcu_read_unlock(); > LGTM, thanks! Reviewed-by: David Hildenbrand <david@redhat.com>
> On Jan 8, 2025, at 15:48, Yu Zhao <yuzhao@google.com> wrote: > > Using x86_64 as an example, for a 32KB struct page[] area describing a > 2MB hugeTLB, HVO reduces the area to 4KB by the following steps: > 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs; > 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped > by PTE 0, and at the same time change the permission from r/w to > r/o; > 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB > to 4KB. > > However, the following race can happen due to improperly memory loads > ordering: > CPU 1 (HVO) CPU 2 (speculative PFN walker) > > page_ref_freeze() > synchronize_rcu() > rcu_read_lock() > page_is_fake_head() is false > vmemmap_remap_pte() > XXX: struct page[] becomes r/o > > page_ref_unfreeze() > page_ref_count() is not zero > > atomic_add_unless(&page->_refcount) > XXX: try to modify r/o struct page[] > > Specifically, page_is_fake_head() must be ordered after > page_ref_count() on CPU 2 so that it can only return true for this > case, to avoid the later attempt to modify r/o struct page[]. > > This patch adds the missing memory barrier and makes the tests on > page_is_fake_head() and page_ref_count() done in the proper order. > > Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") > Reported-by: Will Deacon <will@kernel.org> > Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/ > Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Thanks.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 691506bdf2c5..16fa8f0cea02 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -225,11 +225,48 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page } return page; } + +static __always_inline bool page_count_writable(const struct page *page, int u) +{ + if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key)) + return true; + + /* + * The refcount check is ordered before the fake-head check to prevent + * the following race: + * CPU 1 (HVO) CPU 2 (speculative PFN walker) + * + * page_ref_freeze() + * synchronize_rcu() + * rcu_read_lock() + * page_is_fake_head() is false + * vmemmap_remap_pte() + * XXX: struct page[] becomes r/o + * + * page_ref_unfreeze() + * page_ref_count() is not zero + * + * atomic_add_unless(&page->_refcount) + * XXX: try to modify r/o struct page[] + * + * The refcount check also prevents modification attempts to other (r/o) + * tail pages that are not fake heads. + */ + if (atomic_read_acquire(&page->_refcount) == u) + return false; + + return page_fixed_fake_head(page) == page; +} #else static inline const struct page *page_fixed_fake_head(const struct page *page) { return page; } + +static inline bool page_count_writable(const struct page *page, int u) +{ + return true; +} #endif static __always_inline int page_is_fake_head(const struct page *page) diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 8c236c651d1d..544150d1d5fd 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -234,7 +234,7 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u) rcu_read_lock(); /* avoid writing to the vmemmap area being remapped */ - if (!page_is_fake_head(page) && page_ref_count(page) != u) + if (page_count_writable(page, u)) ret = atomic_add_unless(&page->_refcount, nr, u); rcu_read_unlock();
Using x86_64 as an example, for a 32KB struct page[] area describing a 2MB hugeTLB, HVO reduces the area to 4KB by the following steps: 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs; 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped by PTE 0, and at the same time change the permission from r/w to r/o; 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB to 4KB. However, the following race can happen due to improperly memory loads ordering: CPU 1 (HVO) CPU 2 (speculative PFN walker) page_ref_freeze() synchronize_rcu() rcu_read_lock() page_is_fake_head() is false vmemmap_remap_pte() XXX: struct page[] becomes r/o page_ref_unfreeze() page_ref_count() is not zero atomic_add_unless(&page->_refcount) XXX: try to modify r/o struct page[] Specifically, page_is_fake_head() must be ordered after page_ref_count() on CPU 2 so that it can only return true for this case, to avoid the later attempt to modify r/o struct page[]. This patch adds the missing memory barrier and makes the tests on page_is_fake_head() and page_ref_count() done in the proper order. Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Reported-by: Will Deacon <will@kernel.org> Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/ Signed-off-by: Yu Zhao <yuzhao@google.com> --- include/linux/page-flags.h | 37 +++++++++++++++++++++++++++++++++++++ include/linux/page_ref.h | 2 +- 2 files changed, 38 insertions(+), 1 deletion(-)