Message ID | 1603968305-8026-9-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | per memcg lru lock | expand |
On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote: > From: Hugh Dickins <hughd@google.com> > > It is necessary for page_idle_get_page() to recheck PageLRU() after > get_page_unless_zero(), but holding lru_lock around that serves no > useful purpose, and adds to lru_lock contention: delete it. > > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the > discussion that led to lru_lock there; but __page_set_anon_rmap() now > uses WRITE_ONCE(), That doesn't seem to be the case in Linus's or Andrew's tree. Am I missing a dependent patch series? > and I see no other risk in page_idle_clear_pte_refs() using > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but > not entirely prevented by page_count() check in ksm.c's > write_protect_page(): that risk being shared with page_referenced() > and not helped by lru_lock). Isn't it possible, as per Minchan's description, for page->mapping to point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap thinking it's looking at a struct address_space?
On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote: > On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote: > > From: Hugh Dickins <hughd@google.com> > > > > It is necessary for page_idle_get_page() to recheck PageLRU() after > > get_page_unless_zero(), but holding lru_lock around that serves no > > useful purpose, and adds to lru_lock contention: delete it. > > > > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the > > discussion that led to lru_lock there; but __page_set_anon_rmap() now > > uses WRITE_ONCE(), > > That doesn't seem to be the case in Linus's or Andrew's tree. Am I > missing a dependent patch series? > > > and I see no other risk in page_idle_clear_pte_refs() using > > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but > > not entirely prevented by page_count() check in ksm.c's > > write_protect_page(): that risk being shared with page_referenced() > > and not helped by lru_lock). > > Isn't it possible, as per Minchan's description, for page->mapping to > point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap > thinking it's looking at a struct address_space? I don't think it can point to an anon_vma without the ANON bit set. Minchan's concern in that email was that it might still be NULL.
On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote: > On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote: > > On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote: > > > From: Hugh Dickins <hughd@google.com> > > > > > > It is necessary for page_idle_get_page() to recheck PageLRU() after > > > get_page_unless_zero(), but holding lru_lock around that serves no > > > useful purpose, and adds to lru_lock contention: delete it. > > > > > > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the > > > discussion that led to lru_lock there; but __page_set_anon_rmap() now > > > uses WRITE_ONCE(), > > > > That doesn't seem to be the case in Linus's or Andrew's tree. Am I > > missing a dependent patch series? > > > > > and I see no other risk in page_idle_clear_pte_refs() using > > > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but > > > not entirely prevented by page_count() check in ksm.c's > > > write_protect_page(): that risk being shared with page_referenced() > > > and not helped by lru_lock). > > > > Isn't it possible, as per Minchan's description, for page->mapping to > > point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap > > thinking it's looking at a struct address_space? > > I don't think it can point to an anon_vma without the ANON bit set. > Minchan's concern in that email was that it might still be NULL. Hm, no, the thread is a lengthy discussion about whether the store could be split such that page->mapping is actually pointing to something invalid (anon_vma without the PageAnon bit). From his email: CPU 0 CPU 1 do_anonymous_page __page_set_anon_rmap /* out of order happened so SetPageLRU is done ahead */ SetPageLRU(page) /* Compilr changed store operation like below */ page->mapping = (struct address_space *) anon_vma; /* Big stall happens */ /* idletacking judged it as LRU page so pass the page in page_reference */ page_refernced page_rmapping return true because page->mapping has some vaule but not complete so it calls rmap_walk_file. it's okay to pass non-completed anon page in rmap_walk_file?
在 2020/11/3 上午4:20, Johannes Weiner 写道: > On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote: >> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote: >>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote: >>>> From: Hugh Dickins <hughd@google.com> >>>> >>>> It is necessary for page_idle_get_page() to recheck PageLRU() after >>>> get_page_unless_zero(), but holding lru_lock around that serves no >>>> useful purpose, and adds to lru_lock contention: delete it. >>>> >>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the >>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now >>>> uses WRITE_ONCE(), >>> >>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I >>> missing a dependent patch series? >>> >>>> and I see no other risk in page_idle_clear_pte_refs() using >>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but >>>> not entirely prevented by page_count() check in ksm.c's >>>> write_protect_page(): that risk being shared with page_referenced() >>>> and not helped by lru_lock). >>> >>> Isn't it possible, as per Minchan's description, for page->mapping to >>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap >>> thinking it's looking at a struct address_space? >> >> I don't think it can point to an anon_vma without the ANON bit set. >> Minchan's concern in that email was that it might still be NULL. > > Hm, no, the thread is a lengthy discussion about whether the store > could be split such that page->mapping is actually pointing to > something invalid (anon_vma without the PageAnon bit). > > From his email: > > CPU 0 CPU 1 > > do_anonymous_page > __page_set_anon_rmap > /* out of order happened so SetPageLRU is done ahead */ > SetPageLRU(page) This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock protection, so the original memory barrier or lock concern isn't correct. that means, the SetPageLRU isn't possible to be here. And then no warry on right side 'CPU 1' problem. > /* Compilr changed store operation like below */ > page->mapping = (struct address_space *) anon_vma; > /* Big stall happens */ > /* idletacking judged it as LRU page so pass the page > in page_reference */ > page_refernced page_referenced should be page_idle_clear_pte_refs_one here? > page_rmapping return true because > page->mapping has some vaule but not complete > so it calls rmap_walk_file. > it's okay to pass non-completed anon page in rmap_walk_file? > For this patch, According to comments of page_idle_get_page() PageLRU just used to judge if the page is in user using. for this purpose, a unguard PageLRU check is good enough. So this patch should be fine. Thanks Alex
On Wed, Nov 04, 2020 at 07:27:21PM +0800, Alex Shi wrote: > 在 2020/11/3 上午4:20, Johannes Weiner 写道: > > On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote: > >> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote: > >>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote: > >>>> From: Hugh Dickins <hughd@google.com> > >>>> > >>>> It is necessary for page_idle_get_page() to recheck PageLRU() after > >>>> get_page_unless_zero(), but holding lru_lock around that serves no > >>>> useful purpose, and adds to lru_lock contention: delete it. > >>>> > >>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the > >>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now > >>>> uses WRITE_ONCE(), > >>> > >>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I > >>> missing a dependent patch series? > >>> > >>>> and I see no other risk in page_idle_clear_pte_refs() using > >>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but > >>>> not entirely prevented by page_count() check in ksm.c's > >>>> write_protect_page(): that risk being shared with page_referenced() > >>>> and not helped by lru_lock). > >>> > >>> Isn't it possible, as per Minchan's description, for page->mapping to > >>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap > >>> thinking it's looking at a struct address_space? > >> > >> I don't think it can point to an anon_vma without the ANON bit set. > >> Minchan's concern in that email was that it might still be NULL. > > > > Hm, no, the thread is a lengthy discussion about whether the store > > could be split such that page->mapping is actually pointing to > > something invalid (anon_vma without the PageAnon bit). > > > > From his email: > > > > CPU 0 CPU 1 > > > > do_anonymous_page > > __page_set_anon_rmap > > /* out of order happened so SetPageLRU is done ahead */ > > SetPageLRU(page) > > This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock > protection, so the original memory barrier or lock concern isn't > correct. that means, the SetPageLRU isn't possible to be here. > And then no warry on right side 'CPU 1' problem. The SetPageLRU is done under lru_lock, but the store to page->mapping is not, so what ensures ordering between them? And what prevents the compiler from tearing the store to page->mapping? The writer does this: CPU 0 page_add_new_anon_rmap() page->mapping = anon_vma + PAGE_MAPPING_ANON lru_cache_add_inactive_or_unevictable() spin_lock(lruvec->lock) SetPageLRU() spin_unlock(lruvec->lock) The concern is what CPU 1 will observe at page->mapping after observing PageLRU set on the page. 1. anon_vma + PAGE_MAPPING_ANON That's the in-order scenario and is fine. 2. NULL That's possible if the page->mapping store gets reordered to occur after SetPageLRU. That's fine too because we check for it. 3. anon_vma without the PAGE_MAPPING_ANON bit That would be a problem and could lead to all kinds of undesirable behavior including crashes and data corruption. Is it possible? AFAICT the compiler is allowed to tear the store to page->mapping and I don't see anything that would prevent it. That said, I also don't see how the reader testing PageLRU under the lru_lock would prevent that in the first place. AFAICT we need that WRITE_ONCE() around the page->mapping assignment that's referenced in the changelog of this patch but I cannot find in any tree or patch.
在 2020/11/5 上午1:46, Johannes Weiner 写道: > On Wed, Nov 04, 2020 at 07:27:21PM +0800, Alex Shi wrote: >> 在 2020/11/3 上午4:20, Johannes Weiner 写道: >>> On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote: >>>> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote: >>>>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote: >>>>>> From: Hugh Dickins <hughd@google.com> >>>>>> >>>>>> It is necessary for page_idle_get_page() to recheck PageLRU() after >>>>>> get_page_unless_zero(), but holding lru_lock around that serves no >>>>>> useful purpose, and adds to lru_lock contention: delete it. >>>>>> >>>>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the >>>>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now >>>>>> uses WRITE_ONCE(), >>>>> >>>>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I >>>>> missing a dependent patch series? >>>>> >>>>>> and I see no other risk in page_idle_clear_pte_refs() using >>>>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but >>>>>> not entirely prevented by page_count() check in ksm.c's >>>>>> write_protect_page(): that risk being shared with page_referenced() >>>>>> and not helped by lru_lock). >>>>> >>>>> Isn't it possible, as per Minchan's description, for page->mapping to >>>>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap >>>>> thinking it's looking at a struct address_space? >>>> >>>> I don't think it can point to an anon_vma without the ANON bit set. >>>> Minchan's concern in that email was that it might still be NULL. >>> >>> Hm, no, the thread is a lengthy discussion about whether the store >>> could be split such that page->mapping is actually pointing to >>> something invalid (anon_vma without the PageAnon bit). >>> >>> From his email: >>> >>> CPU 0 CPU 1 >>> >>> do_anonymous_page >>> __page_set_anon_rmap >>> /* out of order happened so SetPageLRU is done ahead */ >>> SetPageLRU(page) >> >> This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock >> protection, so the original memory barrier or lock concern isn't >> correct. that means, the SetPageLRU isn't possible to be here. >> And then no warry on right side 'CPU 1' problem. > > The SetPageLRU is done under lru_lock, but the store to page->mapping > is not, so what ensures ordering between them? And what prevents the > compiler from tearing the store to page->mapping? > Right, I misunderstand the spin_lock in memory barrier. Thanks a lot for point out this. So, is this patch fine to fix the problem? From 8427121da195a6a386d1ce71abcb41b31211e21f Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi@linux.alibaba.com> Date: Thu, 5 Nov 2020 11:38:24 +0800 Subject: [PATCH] mm/rmap: stop store reordering issue on page->mapping Hugh Dickins and Minchan Kim observed a long time issue which discussed here, but actully the mentioned fix missed. https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/ The store reordering may cause problem in the scenario: CPU 0 CPU1 do_anonymous_page page_add_new_anon_rmap() page->mapping = anon_vma + PAGE_MAPPING_ANON lru_cache_add_inactive_or_unevictable() spin_lock(lruvec->lock) SetPageLRU() spin_unlock(lruvec->lock) /* idletacking judged it as LRU * page so pass the page in * page_idle_clear_pte_refs */ page_idle_clear_pte_refs rmap_walk if PageAnon(page) Johannes give detailed examples how the store reordering could cause a trouble: The concern is the SetPageLRU may get reorder before 'page->mapping' setting, That would make CPU 1 will observe at page->mapping after observing PageLRU set on the page. 1. anon_vma + PAGE_MAPPING_ANON That's the in-order scenario and is fine. 2. NULL That's possible if the page->mapping store gets reordered to occur after SetPageLRU. That's fine too because we check for it. 3. anon_vma without the PAGE_MAPPING_ANON bit That would be a problem and could lead to all kinds of undesirable behavior including crashes and data corruption. Is it possible? AFAICT the compiler is allowed to tear the store to page->mapping and I don't see anything that would prevent it. That said, I also don't see how the reader testing PageLRU under the lru_lock would prevent that in the first place. AFAICT we need that WRITE_ONCE() around the page->mapping assignment. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/rmap.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index c050dab2ae65..56af18aa57de 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page, if (!exclusive) anon_vma = anon_vma->root; + /* + * w/o the WRITE_ONCE here the following scenario may happens due to + * store reordering. + * + * CPU 0 CPU 1 + * + * do_anonymous_page page_idle_clear_pte_refs + * __page_set_anon_rmap + * page->mapping = anon_vma + PAGE_MAPPING_ANON + * lru_cache_add_inactive_or_unevictable() + * SetPageLRU(page) + * rmap_walk + * if PageAnon(page) + * + * The 'SetPageLRU' may reordered before page->mapping setting, and + * page->mapping may set with anon_vma, w/o anon bit, then rmap_walk + * may goes to rmap_walk_file() for a anon page. + */ + anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; - page->mapping = (struct address_space *) anon_vma; + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); page->index = linear_page_index(vma, address); }
On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote: > @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page, > if (!exclusive) > anon_vma = anon_vma->root; > > + /* > + * w/o the WRITE_ONCE here the following scenario may happens due to > + * store reordering. > + * > + * CPU 0 CPU 1 > + * > + * do_anonymous_page page_idle_clear_pte_refs > + * __page_set_anon_rmap > + * page->mapping = anon_vma + PAGE_MAPPING_ANON > + * lru_cache_add_inactive_or_unevictable() > + * SetPageLRU(page) > + * rmap_walk > + * if PageAnon(page) > + * > + * The 'SetPageLRU' may reordered before page->mapping setting, and > + * page->mapping may set with anon_vma, w/o anon bit, then rmap_walk > + * may goes to rmap_walk_file() for a anon page. > + */ > + > anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; > - page->mapping = (struct address_space *) anon_vma; > + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); > page->index = linear_page_index(vma, address); > } I don't like these verbose comments with detailed descriptions in the source code. They're fine in changelogs, but they clutter the code, and they get outdated really quickly. My preference is for something more brief: /* * Prevent page->mapping from pointing to an anon_vma without * the PAGE_MAPPING_ANON bit set. This could happen if the * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it. */
在 2020/11/5 下午12:57, Matthew Wilcox 写道: > On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote: >> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page, >> if (!exclusive) >> anon_vma = anon_vma->root; >> >> + /* >> + * w/o the WRITE_ONCE here the following scenario may happens due to >> + * store reordering. >> + * >> + * CPU 0 CPU 1 >> + * >> + * do_anonymous_page page_idle_clear_pte_refs >> + * __page_set_anon_rmap >> + * page->mapping = anon_vma + PAGE_MAPPING_ANON >> + * lru_cache_add_inactive_or_unevictable() >> + * SetPageLRU(page) >> + * rmap_walk >> + * if PageAnon(page) >> + * >> + * The 'SetPageLRU' may reordered before page->mapping setting, and >> + * page->mapping may set with anon_vma, w/o anon bit, then rmap_walk >> + * may goes to rmap_walk_file() for a anon page. >> + */ >> + >> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; >> - page->mapping = (struct address_space *) anon_vma; >> + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); >> page->index = linear_page_index(vma, address); >> } > > I don't like these verbose comments with detailed descriptions in > the source code. They're fine in changelogs, but they clutter the > code, and they get outdated really quickly. My preference is for > something more brief: > > /* > * Prevent page->mapping from pointing to an anon_vma without > * the PAGE_MAPPING_ANON bit set. This could happen if the > * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it. > */ > Yes, it's reansonble. So is the following fine? From f166f0d5df350c5eae1218456b9e6e1bd43434e7 Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi@linux.alibaba.com> Date: Thu, 5 Nov 2020 11:38:24 +0800 Subject: [PATCH] mm/rmap: stop store reordering issue on page->mapping Hugh Dickins and Minchan Kim observed a long time issue which discussed here, but actully the mentioned fix missed. https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/ The store reordering may cause problem in the scenario: CPU 0 CPU1 do_anonymous_page page_add_new_anon_rmap() page->mapping = anon_vma + PAGE_MAPPING_ANON lru_cache_add_inactive_or_unevictable() spin_lock(lruvec->lock) SetPageLRU() spin_unlock(lruvec->lock) /* idletacking judged it as LRU * page so pass the page in * page_idle_clear_pte_refs */ page_idle_clear_pte_refs rmap_walk if PageAnon(page) Johannes give detailed examples how the store reordering could cause a trouble: The concern is the SetPageLRU may get reorder before 'page->mapping' setting, That would make CPU 1 will observe at page->mapping after observing PageLRU set on the page. 1. anon_vma + PAGE_MAPPING_ANON That's the in-order scenario and is fine. 2. NULL That's possible if the page->mapping store gets reordered to occur after SetPageLRU. That's fine too because we check for it. 3. anon_vma without the PAGE_MAPPING_ANON bit That would be a problem and could lead to all kinds of undesirable behavior including crashes and data corruption. Is it possible? AFAICT the compiler is allowed to tear the store to page->mapping and I don't see anything that would prevent it. That said, I also don't see how the reader testing PageLRU under the lru_lock would prevent that in the first place. AFAICT we need that WRITE_ONCE() around the page->mapping assignment. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/rmap.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index c050dab2ae65..73788505aa0a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1054,8 +1054,13 @@ static void __page_set_anon_rmap(struct page *page, if (!exclusive) anon_vma = anon_vma->root; + /* + * Prevent page->mapping from pointing to an anon_vma without + * the PAGE_MAPPING_ANON bit set. This could happen if the + * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it. + */ anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; - page->mapping = (struct address_space *) anon_vma; + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); page->index = linear_page_index(vma, address); }
On Thu, Nov 05, 2020 at 01:03:18PM +0800, Alex Shi wrote: > > > 在 2020/11/5 下午12:57, Matthew Wilcox 写道: > > On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote: > >> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page, > >> if (!exclusive) > >> anon_vma = anon_vma->root; > >> > >> + /* > >> + * w/o the WRITE_ONCE here the following scenario may happens due to > >> + * store reordering. > >> + * > >> + * CPU 0 CPU 1 > >> + * > >> + * do_anonymous_page page_idle_clear_pte_refs > >> + * __page_set_anon_rmap > >> + * page->mapping = anon_vma + PAGE_MAPPING_ANON > >> + * lru_cache_add_inactive_or_unevictable() > >> + * SetPageLRU(page) > >> + * rmap_walk > >> + * if PageAnon(page) > >> + * > >> + * The 'SetPageLRU' may reordered before page->mapping setting, and > >> + * page->mapping may set with anon_vma, w/o anon bit, then rmap_walk > >> + * may goes to rmap_walk_file() for a anon page. > >> + */ > >> + > >> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; > >> - page->mapping = (struct address_space *) anon_vma; > >> + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); > >> page->index = linear_page_index(vma, address); > >> } > > > > I don't like these verbose comments with detailed descriptions in > > the source code. They're fine in changelogs, but they clutter the > > code, and they get outdated really quickly. My preference is for > > something more brief: > > > > /* > > * Prevent page->mapping from pointing to an anon_vma without > > * the PAGE_MAPPING_ANON bit set. This could happen if the > > * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it. > > */ > > Yeah, I don't think this scenario warrants the full race diagram in the code itself. But the code is highly specific - synchronizing one struct page member for one particular use case. Let's keep at least a reference to what we are synchronizing against. There is a non-zero chance that if the comment goes out of date, so does the code. How about this? /* * page_idle does a lockless/optimistic rmap scan on page->mapping. * Make sure the compiler doesn't split the stores of anon_vma and * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code * could mistake the mapping for a struct address_space and crash. */
On Thu, Nov 05, 2020 at 10:36:49AM -0500, Johannes Weiner wrote: > But the code is highly specific - synchronizing one struct page member > for one particular use case. Let's keep at least a reference to what > we are synchronizing against. There is a non-zero chance that if the > comment goes out of date, so does the code. How about this? > > /* > * page_idle does a lockless/optimistic rmap scan on page->mapping. > * Make sure the compiler doesn't split the stores of anon_vma and > * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code > * could mistake the mapping for a struct address_space and crash. > */ Fine by me! There may be other cases where seeing a split store would be bad, so I didn't want to call out page_idle explicitly. But if you want to, I'm happy with this comment.
在 2020/11/5 下午11:36, Johannes Weiner 写道: >>> */ >>> > Yeah, I don't think this scenario warrants the full race diagram in > the code itself. > > But the code is highly specific - synchronizing one struct page member > for one particular use case. Let's keep at least a reference to what > we are synchronizing against. There is a non-zero chance that if the > comment goes out of date, so does the code. How about this? > > /* > * page_idle does a lockless/optimistic rmap scan on page->mapping. > * Make sure the compiler doesn't split the stores of anon_vma and > * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code > * could mistake the mapping for a struct address_space and crash. > */ Thanks a lot to you all. I will update to v21 patch Alex
On Mon, 2 Nov 2020, Johannes Weiner wrote: > On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote: > > From: Hugh Dickins <hughd@google.com> > > > > It is necessary for page_idle_get_page() to recheck PageLRU() after > > get_page_unless_zero(), but holding lru_lock around that serves no > > useful purpose, and adds to lru_lock contention: delete it. > > > > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the > > discussion that led to lru_lock there; but __page_set_anon_rmap() now > > uses WRITE_ONCE(), > > That doesn't seem to be the case in Linus's or Andrew's tree. Am I > missing a dependent patch series? Sorry, I was out of action, then slower than ever, for a while. Many thanks for calling out my falsehood there, Johannes. What led me to write that? It has baffled me, but at last I see: this patch to page_idle_get_page() was 0002 in my lru_lock patchset against v5.3 last year, and 0001 was the patch which made it true. Then when I checked against mainline, I must have got confused by the similar WRITE_ONCE in page_move_anon_rmap(). Appended below, but not rediffed, and let's not hold up Alex's set for the rest of it: it is all theoretical until the kernel gets to be built with a suitably malicious compiler; but I'll follow up with a fresh version of the below after his set is safely in. From a1abcbc2aac70c6ba47b8991992bb85b86b4a160 Mon Sep 17 00:00:00 2001 From: Hugh Dickins <hughd@google.com> Date: Thu, 22 Aug 2019 15:49:44 -0700 Subject: [PATCH 1/9] mm: more WRITE_ONCE and READ_ONCE on page->mapping v4.2 commit 414e2fb8ce5a ("rmap: fix theoretical race between do_wp_page and shrink_active_list") added a WRITE_ONCE() where page_move_anon_rmap() composes page->mapping from anon_vma pointer and PAGE_MAPPING_ANON. Now do the same where __page_set_anon_rmap() does the same, and where compaction.c applies PAGE_MAPPING_MOVABLE, and ksm.c PAGE_MAPPING_KSM. rmap.c already uses READ_ONCE(page->mapping), but util.c should too: add READ_ONCE() in page_rmapping(), page_anon_vma() and page_mapping(). Delete the then unused helper __page_rmapping(). I doubt that this commit fixes anything, but it's harmless and unintrusive, and makes reasoning about page mapping flags easier. What if a compiler implements "page->mapping = mapping" in other places by, say, first assigning the odd bits of mapping, then adding in the even bits? Then we shall not build the kernel with such a compiler. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> --- mm/compaction.c | 7 ++++--- mm/ksm.c | 2 +- mm/rmap.c | 7 ++++++- mm/util.c | 24 ++++++++++-------------- 4 files changed, 21 insertions(+), 19 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 952dc2fb24e5..c405f4362624 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -113,7 +113,8 @@ void __SetPageMovable(struct page *page, struct address_space *mapping) { VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page); - page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE); + WRITE_ONCE(page->mapping, + (unsigned long)mapping | PAGE_MAPPING_MOVABLE); } EXPORT_SYMBOL(__SetPageMovable); @@ -126,8 +127,8 @@ void __ClearPageMovable(struct page *page) * flag so that VM can catch up released page by driver after isolation. * With it, VM migration doesn't try to put it back. */ - page->mapping = (void *)((unsigned long)page->mapping & - PAGE_MAPPING_MOVABLE); + WRITE_ONCE(page->mapping, + (unsigned long)page->mapping & PAGE_MAPPING_MOVABLE); } EXPORT_SYMBOL(__ClearPageMovable); diff --git a/mm/ksm.c b/mm/ksm.c index 3dc4346411e4..426b6a40ea41 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -865,7 +865,7 @@ static inline struct stable_node *page_stable_node(struct page *page) static inline void set_page_stable_node(struct page *page, struct stable_node *stable_node) { - page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM); + WRITE_ONCE(page->mapping, (unsigned long)stable_node | PAGE_MAPPING_KSM); } #ifdef CONFIG_SYSFS diff --git a/mm/rmap.c b/mm/rmap.c index 003377e24232..9480df437edc 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1044,7 +1044,12 @@ static void __page_set_anon_rmap(struct page *page, anon_vma = anon_vma->root; anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; - page->mapping = (struct address_space *) anon_vma; + /* + * Ensure that anon_vma and the PAGE_MAPPING_ANON bit are written + * simultaneously, so a concurrent reader (eg page_referenced()'s + * PageAnon()) will not see one without the other. + */ + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); page->index = linear_page_index(vma, address); } diff --git a/mm/util.c b/mm/util.c index e6351a80f248..09b9fcbedac3 100644 --- a/mm/util.c +++ b/mm/util.c @@ -489,21 +489,14 @@ void kvfree(const void *addr) } EXPORT_SYMBOL(kvfree); -static inline void *__page_rmapping(struct page *page) -{ - unsigned long mapping; - - mapping = (unsigned long)page->mapping; - mapping &= ~PAGE_MAPPING_FLAGS; - - return (void *)mapping; -} - /* Neutral page->mapping pointer to address_space or anon_vma or other */ void *page_rmapping(struct page *page) { + unsigned long mapping; + page = compound_head(page); - return __page_rmapping(page); + mapping = (unsigned long)READ_ONCE(page->mapping); + return (void *)(mapping & ~PAGE_MAPPING_FLAGS); } /* @@ -534,10 +527,11 @@ struct anon_vma *page_anon_vma(struct page *page) unsigned long mapping; page = compound_head(page); - mapping = (unsigned long)page->mapping; + mapping = (unsigned long)READ_ONCE(page->mapping); + /* Return NULL if file or PageMovable or PageKsm */ if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) return NULL; - return __page_rmapping(page); + return (struct anon_vma *)(mapping & ~PAGE_MAPPING_FLAGS); } struct address_space *page_mapping(struct page *page) @@ -557,10 +551,12 @@ struct address_space *page_mapping(struct page *page) return swap_address_space(entry); } - mapping = page->mapping; + mapping = READ_ONCE(page->mapping); + /* Return NULL if PageAnon (including PageKsm) */ if ((unsigned long)mapping & PAGE_MAPPING_ANON) return NULL; + /* Return struct address_space pointer if file or PageMovable */ return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS); } EXPORT_SYMBOL(page_mapping);
diff --git a/mm/page_idle.c b/mm/page_idle.c index 057c61df12db..64e5344a992c 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -32,19 +32,15 @@ static struct page *page_idle_get_page(unsigned long pfn) { struct page *page = pfn_to_online_page(pfn); - pg_data_t *pgdat; if (!page || !PageLRU(page) || !get_page_unless_zero(page)) return NULL; - pgdat = page_pgdat(page); - spin_lock_irq(&pgdat->lru_lock); if (unlikely(!PageLRU(page))) { put_page(page); page = NULL; } - spin_unlock_irq(&pgdat->lru_lock); return page; }