diff mbox series

[v20,08/20] mm: page_idle_get_page() does not need lru_lock

Message ID 1603968305-8026-9-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series per memcg lru lock | expand

Commit Message

Alex Shi Oct. 29, 2020, 10:44 a.m. UTC
From: Hugh Dickins <hughd@google.com>

It is necessary for page_idle_get_page() to recheck PageLRU() after
get_page_unless_zero(), but holding lru_lock around that serves no
useful purpose, and adds to lru_lock contention: delete it.

See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
discussion that led to lru_lock there; but __page_set_anon_rmap() now
uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
but not entirely prevented by page_count() check in ksm.c's
write_protect_page(): that risk being shared with page_referenced() and
not helped by lru_lock).

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/page_idle.c | 4 ----
 1 file changed, 4 deletions(-)

Comments

Johannes Weiner Nov. 2, 2020, 2:41 p.m. UTC | #1
On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> From: Hugh Dickins <hughd@google.com>
> 
> It is necessary for page_idle_get_page() to recheck PageLRU() after
> get_page_unless_zero(), but holding lru_lock around that serves no
> useful purpose, and adds to lru_lock contention: delete it.
> 
> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> uses WRITE_ONCE(),

That doesn't seem to be the case in Linus's or Andrew's tree. Am I
missing a dependent patch series?

> and I see no other risk in page_idle_clear_pte_refs() using
> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> not entirely prevented by page_count() check in ksm.c's
> write_protect_page(): that risk being shared with page_referenced()
> and not helped by lru_lock).

Isn't it possible, as per Minchan's description, for page->mapping to
point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
thinking it's looking at a struct address_space?
Matthew Wilcox Nov. 2, 2020, 2:49 p.m. UTC | #2
On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> > From: Hugh Dickins <hughd@google.com>
> > 
> > It is necessary for page_idle_get_page() to recheck PageLRU() after
> > get_page_unless_zero(), but holding lru_lock around that serves no
> > useful purpose, and adds to lru_lock contention: delete it.
> > 
> > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> > discussion that led to lru_lock there; but __page_set_anon_rmap() now
> > uses WRITE_ONCE(),
> 
> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> missing a dependent patch series?
> 
> > and I see no other risk in page_idle_clear_pte_refs() using
> > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> > not entirely prevented by page_count() check in ksm.c's
> > write_protect_page(): that risk being shared with page_referenced()
> > and not helped by lru_lock).
> 
> Isn't it possible, as per Minchan's description, for page->mapping to
> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
> thinking it's looking at a struct address_space?

I don't think it can point to an anon_vma without the ANON bit set.
Minchan's concern in that email was that it might still be NULL.
Johannes Weiner Nov. 2, 2020, 8:20 p.m. UTC | #3
On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
> > On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> > > From: Hugh Dickins <hughd@google.com>
> > > 
> > > It is necessary for page_idle_get_page() to recheck PageLRU() after
> > > get_page_unless_zero(), but holding lru_lock around that serves no
> > > useful purpose, and adds to lru_lock contention: delete it.
> > > 
> > > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> > > discussion that led to lru_lock there; but __page_set_anon_rmap() now
> > > uses WRITE_ONCE(),
> > 
> > That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> > missing a dependent patch series?
> > 
> > > and I see no other risk in page_idle_clear_pte_refs() using
> > > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> > > not entirely prevented by page_count() check in ksm.c's
> > > write_protect_page(): that risk being shared with page_referenced()
> > > and not helped by lru_lock).
> > 
> > Isn't it possible, as per Minchan's description, for page->mapping to
> > point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
> > thinking it's looking at a struct address_space?
> 
> I don't think it can point to an anon_vma without the ANON bit set.
> Minchan's concern in that email was that it might still be NULL.

Hm, no, the thread is a lengthy discussion about whether the store
could be split such that page->mapping is actually pointing to
something invalid (anon_vma without the PageAnon bit).

From his email:

        CPU 0                                                                           CPU 1

do_anonymous_page
  __page_set_anon_rmap
  /* out of order happened so SetPageLRU is done ahead */
  SetPageLRU(page)
  /* Compilr changed store operation like below */
  page->mapping = (struct address_space *) anon_vma;
  /* Big stall happens */
                                                                /* idletacking judged it as LRU page so pass the page
                                                                   in page_reference */
                                                                page_refernced
                                                                        page_rmapping return true because
                                                                        page->mapping has some vaule but not complete
                                                                        so it calls rmap_walk_file.
                                                                        it's okay to pass non-completed anon page in rmap_walk_file?
Alex Shi Nov. 4, 2020, 11:27 a.m. UTC | #4
在 2020/11/3 上午4:20, Johannes Weiner 写道:
> On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
>> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
>>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
>>>> From: Hugh Dickins <hughd@google.com>
>>>>
>>>> It is necessary for page_idle_get_page() to recheck PageLRU() after
>>>> get_page_unless_zero(), but holding lru_lock around that serves no
>>>> useful purpose, and adds to lru_lock contention: delete it.
>>>>
>>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
>>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
>>>> uses WRITE_ONCE(),
>>>
>>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
>>> missing a dependent patch series?
>>>
>>>> and I see no other risk in page_idle_clear_pte_refs() using
>>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
>>>> not entirely prevented by page_count() check in ksm.c's
>>>> write_protect_page(): that risk being shared with page_referenced()
>>>> and not helped by lru_lock).
>>>
>>> Isn't it possible, as per Minchan's description, for page->mapping to
>>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
>>> thinking it's looking at a struct address_space?
>>
>> I don't think it can point to an anon_vma without the ANON bit set.
>> Minchan's concern in that email was that it might still be NULL.
> 
> Hm, no, the thread is a lengthy discussion about whether the store
> could be split such that page->mapping is actually pointing to
> something invalid (anon_vma without the PageAnon bit).
> 
> From his email:
> 
>         CPU 0                                                                           CPU 1
> 
> do_anonymous_page
>   __page_set_anon_rmap
>   /* out of order happened so SetPageLRU is done ahead */
>   SetPageLRU(page)

This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock
protection, so the original memory barrier or lock concern isn't
correct. that means, the SetPageLRU isn't possible to be here.
And then no warry on right side 'CPU 1' problem.

>   /* Compilr changed store operation like below */
>   page->mapping = (struct address_space *) anon_vma;
>   /* Big stall happens */
>                                                                 /* idletacking judged it as LRU page so pass the page
>                                                                    in page_reference */
>                                                                 page_refernced
								  page_referenced should be page_idle_clear_pte_refs_one here?	
>                                                                         page_rmapping return true because
>                                                                         page->mapping has some vaule but not complete
>                                                                         so it calls rmap_walk_file.
>                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> 


For this patch, According to comments of page_idle_get_page() 
PageLRU just used to judge if the page is in user using. for this
purpose, a unguard PageLRU check is good enough. So this patch
should be fine.

Thanks
Alex
Johannes Weiner Nov. 4, 2020, 5:46 p.m. UTC | #5
On Wed, Nov 04, 2020 at 07:27:21PM +0800, Alex Shi wrote:
> 在 2020/11/3 上午4:20, Johannes Weiner 写道:
> > On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
> >> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
> >>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> >>>> From: Hugh Dickins <hughd@google.com>
> >>>>
> >>>> It is necessary for page_idle_get_page() to recheck PageLRU() after
> >>>> get_page_unless_zero(), but holding lru_lock around that serves no
> >>>> useful purpose, and adds to lru_lock contention: delete it.
> >>>>
> >>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> >>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> >>>> uses WRITE_ONCE(),
> >>>
> >>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> >>> missing a dependent patch series?
> >>>
> >>>> and I see no other risk in page_idle_clear_pte_refs() using
> >>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> >>>> not entirely prevented by page_count() check in ksm.c's
> >>>> write_protect_page(): that risk being shared with page_referenced()
> >>>> and not helped by lru_lock).
> >>>
> >>> Isn't it possible, as per Minchan's description, for page->mapping to
> >>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
> >>> thinking it's looking at a struct address_space?
> >>
> >> I don't think it can point to an anon_vma without the ANON bit set.
> >> Minchan's concern in that email was that it might still be NULL.
> > 
> > Hm, no, the thread is a lengthy discussion about whether the store
> > could be split such that page->mapping is actually pointing to
> > something invalid (anon_vma without the PageAnon bit).
> > 
> > From his email:
> > 
> >         CPU 0                                                                           CPU 1
> > 
> > do_anonymous_page
> >   __page_set_anon_rmap
> >   /* out of order happened so SetPageLRU is done ahead */
> >   SetPageLRU(page)
> 
> This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock
> protection, so the original memory barrier or lock concern isn't
> correct. that means, the SetPageLRU isn't possible to be here.
> And then no warry on right side 'CPU 1' problem.

The SetPageLRU is done under lru_lock, but the store to page->mapping
is not, so what ensures ordering between them? And what prevents the
compiler from tearing the store to page->mapping?

The writer does this:

	CPU 0
	page_add_new_anon_rmap()
	  page->mapping = anon_vma + PAGE_MAPPING_ANON
	lru_cache_add_inactive_or_unevictable()
	  spin_lock(lruvec->lock)
	  SetPageLRU()
	  spin_unlock(lruvec->lock)

The concern is what CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

   That's the in-order scenario and is fine.

2. NULL

   That's possible if the page->mapping store gets reordered to occur
   after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

   That would be a problem and could lead to all kinds of undesirable
   behavior including crashes and data corruption.

   Is it possible? AFAICT the compiler is allowed to tear the store to
   page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment that's referenced in
the changelog of this patch but I cannot find in any tree or patch.
Alex Shi Nov. 5, 2020, 4:52 a.m. UTC | #6
在 2020/11/5 上午1:46, Johannes Weiner 写道:
> On Wed, Nov 04, 2020 at 07:27:21PM +0800, Alex Shi wrote:
>> 在 2020/11/3 上午4:20, Johannes Weiner 写道:
>>> On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
>>>> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
>>>>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
>>>>>> From: Hugh Dickins <hughd@google.com>
>>>>>>
>>>>>> It is necessary for page_idle_get_page() to recheck PageLRU() after
>>>>>> get_page_unless_zero(), but holding lru_lock around that serves no
>>>>>> useful purpose, and adds to lru_lock contention: delete it.
>>>>>>
>>>>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
>>>>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
>>>>>> uses WRITE_ONCE(),
>>>>>
>>>>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
>>>>> missing a dependent patch series?
>>>>>
>>>>>> and I see no other risk in page_idle_clear_pte_refs() using
>>>>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
>>>>>> not entirely prevented by page_count() check in ksm.c's
>>>>>> write_protect_page(): that risk being shared with page_referenced()
>>>>>> and not helped by lru_lock).
>>>>>
>>>>> Isn't it possible, as per Minchan's description, for page->mapping to
>>>>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
>>>>> thinking it's looking at a struct address_space?
>>>>
>>>> I don't think it can point to an anon_vma without the ANON bit set.
>>>> Minchan's concern in that email was that it might still be NULL.
>>>
>>> Hm, no, the thread is a lengthy discussion about whether the store
>>> could be split such that page->mapping is actually pointing to
>>> something invalid (anon_vma without the PageAnon bit).
>>>
>>> From his email:
>>>
>>>         CPU 0                                                                           CPU 1
>>>
>>> do_anonymous_page
>>>   __page_set_anon_rmap
>>>   /* out of order happened so SetPageLRU is done ahead */
>>>   SetPageLRU(page)
>>
>> This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock
>> protection, so the original memory barrier or lock concern isn't
>> correct. that means, the SetPageLRU isn't possible to be here.
>> And then no warry on right side 'CPU 1' problem.
> 
> The SetPageLRU is done under lru_lock, but the store to page->mapping
> is not, so what ensures ordering between them? And what prevents the
> compiler from tearing the store to page->mapping?
> 

Right, I misunderstand the spin_lock in memory barrier. Thanks a lot
for point out this.
So, is this patch fine to fix the problem?

From 8427121da195a6a386d1ce71abcb41b31211e21f Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Thu, 5 Nov 2020 11:38:24 +0800
Subject: [PATCH] mm/rmap: stop store reordering issue on page->mapping

Hugh Dickins and Minchan Kim observed a long time issue which
discussed here, but actully the mentioned fix missed.
https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
The store reordering may cause problem in the scenario:

	CPU 0						CPU1
   do_anonymous_page
	page_add_new_anon_rmap()
	  page->mapping = anon_vma + PAGE_MAPPING_ANON
	lru_cache_add_inactive_or_unevictable()
	  spin_lock(lruvec->lock)
	  SetPageLRU()
	  spin_unlock(lruvec->lock)
						/* idletacking judged it as LRU
						 * page so pass the page in
						 * page_idle_clear_pte_refs
						 */
						page_idle_clear_pte_refs
						  rmap_walk
						    if PageAnon(page)

Johannes give detailed examples how the store reordering could cause
a trouble:
The concern is the SetPageLRU may get reorder before 'page->mapping'
setting, That would make CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

   That's the in-order scenario and is fine.

2. NULL

   That's possible if the page->mapping store gets reordered to occur
   after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

   That would be a problem and could lead to all kinds of undesirable
   behavior including crashes and data corruption.

   Is it possible? AFAICT the compiler is allowed to tear the store to
   page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/rmap.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c050dab2ae65..56af18aa57de 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	/*
+	 * w/o the WRITE_ONCE here the following scenario may happens due to
+	 * store reordering.
+	 *
+	 *      CPU 0                                          CPU 1
+	 *
+	 * do_anonymous_page				page_idle_clear_pte_refs
+	 *   __page_set_anon_rmap
+	 *     page->mapping = anon_vma + PAGE_MAPPING_ANON
+	 *   lru_cache_add_inactive_or_unevictable()
+	 *     SetPageLRU(page)
+	 *                                               rmap_walk
+	 *                                                if PageAnon(page)
+	 *
+	 *  The 'SetPageLRU' may reordered before page->mapping setting, and
+	 *  page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
+	 *  may goes to rmap_walk_file() for a anon page.
+	 */
+
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	page->mapping = (struct address_space *) anon_vma;
+	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
 }
Matthew Wilcox Nov. 5, 2020, 4:57 a.m. UTC | #7
On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote:
> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
>  	if (!exclusive)
>  		anon_vma = anon_vma->root;
>  
> +	/*
> +	 * w/o the WRITE_ONCE here the following scenario may happens due to
> +	 * store reordering.
> +	 *
> +	 *      CPU 0                                          CPU 1
> +	 *
> +	 * do_anonymous_page				page_idle_clear_pte_refs
> +	 *   __page_set_anon_rmap
> +	 *     page->mapping = anon_vma + PAGE_MAPPING_ANON
> +	 *   lru_cache_add_inactive_or_unevictable()
> +	 *     SetPageLRU(page)
> +	 *                                               rmap_walk
> +	 *                                                if PageAnon(page)
> +	 *
> +	 *  The 'SetPageLRU' may reordered before page->mapping setting, and
> +	 *  page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
> +	 *  may goes to rmap_walk_file() for a anon page.
> +	 */
> +
>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> -	page->mapping = (struct address_space *) anon_vma;
> +	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
>  	page->index = linear_page_index(vma, address);
>  }

I don't like these verbose comments with detailed descriptions in
the source code.  They're fine in changelogs, but they clutter the
code, and they get outdated really quickly.  My preference is for
something more brief:

	/*
	 * Prevent page->mapping from pointing to an anon_vma without
	 * the PAGE_MAPPING_ANON bit set.  This could happen if the
	 * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
	 */
Alex Shi Nov. 5, 2020, 5:03 a.m. UTC | #8
在 2020/11/5 下午12:57, Matthew Wilcox 写道:
> On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote:
>> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
>>  	if (!exclusive)
>>  		anon_vma = anon_vma->root;
>>  
>> +	/*
>> +	 * w/o the WRITE_ONCE here the following scenario may happens due to
>> +	 * store reordering.
>> +	 *
>> +	 *      CPU 0                                          CPU 1
>> +	 *
>> +	 * do_anonymous_page				page_idle_clear_pte_refs
>> +	 *   __page_set_anon_rmap
>> +	 *     page->mapping = anon_vma + PAGE_MAPPING_ANON
>> +	 *   lru_cache_add_inactive_or_unevictable()
>> +	 *     SetPageLRU(page)
>> +	 *                                               rmap_walk
>> +	 *                                                if PageAnon(page)
>> +	 *
>> +	 *  The 'SetPageLRU' may reordered before page->mapping setting, and
>> +	 *  page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
>> +	 *  may goes to rmap_walk_file() for a anon page.
>> +	 */
>> +
>>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
>> -	page->mapping = (struct address_space *) anon_vma;
>> +	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
>>  	page->index = linear_page_index(vma, address);
>>  }
> 
> I don't like these verbose comments with detailed descriptions in
> the source code.  They're fine in changelogs, but they clutter the
> code, and they get outdated really quickly.  My preference is for
> something more brief:
> 
> 	/*
> 	 * Prevent page->mapping from pointing to an anon_vma without
> 	 * the PAGE_MAPPING_ANON bit set.  This could happen if the
> 	 * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
> 	 */
> 

Yes, it's reansonble. So is the following fine?

From f166f0d5df350c5eae1218456b9e6e1bd43434e7 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Thu, 5 Nov 2020 11:38:24 +0800
Subject: [PATCH] mm/rmap: stop store reordering issue on page->mapping

Hugh Dickins and Minchan Kim observed a long time issue which
discussed here, but actully the mentioned fix missed.
https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
The store reordering may cause problem in the scenario:

	CPU 0						CPU1
   do_anonymous_page
	page_add_new_anon_rmap()
	  page->mapping = anon_vma + PAGE_MAPPING_ANON
	lru_cache_add_inactive_or_unevictable()
	  spin_lock(lruvec->lock)
	  SetPageLRU()
	  spin_unlock(lruvec->lock)
						/* idletacking judged it as LRU
						 * page so pass the page in
						 * page_idle_clear_pte_refs
						 */
						page_idle_clear_pte_refs
						  rmap_walk
						    if PageAnon(page)

Johannes give detailed examples how the store reordering could cause
a trouble:
The concern is the SetPageLRU may get reorder before 'page->mapping'
setting, That would make CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

   That's the in-order scenario and is fine.

2. NULL

   That's possible if the page->mapping store gets reordered to occur
   after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

   That would be a problem and could lead to all kinds of undesirable
   behavior including crashes and data corruption.

   Is it possible? AFAICT the compiler is allowed to tear the store to
   page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/rmap.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c050dab2ae65..73788505aa0a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,8 +1054,13 @@ static void __page_set_anon_rmap(struct page *page,
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	/*
+	 * Prevent page->mapping from pointing to an anon_vma without
+	 * the PAGE_MAPPING_ANON bit set.  This could happen if the
+	 * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
+	 */
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	page->mapping = (struct address_space *) anon_vma;
+	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
 }
Johannes Weiner Nov. 5, 2020, 3:36 p.m. UTC | #9
On Thu, Nov 05, 2020 at 01:03:18PM +0800, Alex Shi wrote:
> 
> 
> 在 2020/11/5 下午12:57, Matthew Wilcox 写道:
> > On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote:
> >> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
> >>  	if (!exclusive)
> >>  		anon_vma = anon_vma->root;
> >>  
> >> +	/*
> >> +	 * w/o the WRITE_ONCE here the following scenario may happens due to
> >> +	 * store reordering.
> >> +	 *
> >> +	 *      CPU 0                                          CPU 1
> >> +	 *
> >> +	 * do_anonymous_page				page_idle_clear_pte_refs
> >> +	 *   __page_set_anon_rmap
> >> +	 *     page->mapping = anon_vma + PAGE_MAPPING_ANON
> >> +	 *   lru_cache_add_inactive_or_unevictable()
> >> +	 *     SetPageLRU(page)
> >> +	 *                                               rmap_walk
> >> +	 *                                                if PageAnon(page)
> >> +	 *
> >> +	 *  The 'SetPageLRU' may reordered before page->mapping setting, and
> >> +	 *  page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
> >> +	 *  may goes to rmap_walk_file() for a anon page.
> >> +	 */
> >> +
> >>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> >> -	page->mapping = (struct address_space *) anon_vma;
> >> +	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
> >>  	page->index = linear_page_index(vma, address);
> >>  }
> > 
> > I don't like these verbose comments with detailed descriptions in
> > the source code.  They're fine in changelogs, but they clutter the
> > code, and they get outdated really quickly.  My preference is for
> > something more brief:
> > 
> > 	/*
> > 	 * Prevent page->mapping from pointing to an anon_vma without
> > 	 * the PAGE_MAPPING_ANON bit set.  This could happen if the
> > 	 * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
> > 	 */
> > 

Yeah, I don't think this scenario warrants the full race diagram in
the code itself.

But the code is highly specific - synchronizing one struct page member
for one particular use case. Let's keep at least a reference to what
we are synchronizing against. There is a non-zero chance that if the
comment goes out of date, so does the code. How about this?

	/*
	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
	 * Make sure the compiler doesn't split the stores of anon_vma and
	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
	 * could mistake the mapping for a struct address_space and crash.
	 */
Matthew Wilcox Nov. 5, 2020, 3:43 p.m. UTC | #10
On Thu, Nov 05, 2020 at 10:36:49AM -0500, Johannes Weiner wrote:
> But the code is highly specific - synchronizing one struct page member
> for one particular use case. Let's keep at least a reference to what
> we are synchronizing against. There is a non-zero chance that if the
> comment goes out of date, so does the code. How about this?
> 
> 	/*
> 	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
> 	 * Make sure the compiler doesn't split the stores of anon_vma and
> 	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
> 	 * could mistake the mapping for a struct address_space and crash.
> 	 */

Fine by me!  There may be other cases where seeing a split store would
be bad, so I didn't want to call out page_idle explicitly.  But if you
want to, I'm happy with this comment.
Alex Shi Nov. 6, 2020, 1:11 a.m. UTC | #11
在 2020/11/5 下午11:36, Johannes Weiner 写道:
>>> 	 */
>>>
> Yeah, I don't think this scenario warrants the full race diagram in
> the code itself.
> 
> But the code is highly specific - synchronizing one struct page member
> for one particular use case. Let's keep at least a reference to what
> we are synchronizing against. There is a non-zero chance that if the
> comment goes out of date, so does the code. How about this?
> 
> 	/*
> 	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
> 	 * Make sure the compiler doesn't split the stores of anon_vma and
> 	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
> 	 * could mistake the mapping for a struct address_space and crash.
> 	 */

Thanks a lot to you all. I will update to v21 patch

Alex
Hugh Dickins Nov. 11, 2020, 7:27 a.m. UTC | #12
On Mon, 2 Nov 2020, Johannes Weiner wrote:
> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> > From: Hugh Dickins <hughd@google.com>
> > 
> > It is necessary for page_idle_get_page() to recheck PageLRU() after
> > get_page_unless_zero(), but holding lru_lock around that serves no
> > useful purpose, and adds to lru_lock contention: delete it.
> > 
> > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> > discussion that led to lru_lock there; but __page_set_anon_rmap() now
> > uses WRITE_ONCE(),
> 
> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> missing a dependent patch series?

Sorry, I was out of action, then slower than ever, for a while.

Many thanks for calling out my falsehood there, Johannes.

What led me to write that?  It has baffled me, but at last I see:
this patch to page_idle_get_page() was 0002 in my lru_lock patchset
against v5.3 last year, and 0001 was the patch which made it true.
Then when I checked against mainline, I must have got confused by
the similar WRITE_ONCE in page_move_anon_rmap().

Appended below, but not rediffed, and let's not hold up Alex's set
for the rest of it: it is all theoretical until the kernel gets to
be built with a suitably malicious compiler; but I'll follow up
with a fresh version of the below after his set is safely in.

From a1abcbc2aac70c6ba47b8991992bb85b86b4a160 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hughd@google.com>
Date: Thu, 22 Aug 2019 15:49:44 -0700
Subject: [PATCH 1/9] mm: more WRITE_ONCE and READ_ONCE on page->mapping

v4.2 commit 414e2fb8ce5a ("rmap: fix theoretical race between do_wp_page
and shrink_active_list") added a WRITE_ONCE() where page_move_anon_rmap()
composes page->mapping from anon_vma pointer and PAGE_MAPPING_ANON.

Now do the same where __page_set_anon_rmap() does the same, and where
compaction.c applies PAGE_MAPPING_MOVABLE, and ksm.c PAGE_MAPPING_KSM.

rmap.c already uses READ_ONCE(page->mapping), but util.c should too:
add READ_ONCE() in page_rmapping(), page_anon_vma() and page_mapping().
Delete the then unused helper __page_rmapping().

I doubt that this commit fixes anything, but it's harmless and
unintrusive, and makes reasoning about page mapping flags easier.

What if a compiler implements "page->mapping = mapping" in other places
by, say, first assigning the odd bits of mapping, then adding in the
even bits?  Then we shall not build the kernel with such a compiler.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
---
 mm/compaction.c |  7 ++++---
 mm/ksm.c        |  2 +-
 mm/rmap.c       |  7 ++++++-
 mm/util.c       | 24 ++++++++++--------------
 4 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..c405f4362624 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -113,7 +113,8 @@ void __SetPageMovable(struct page *page, struct address_space *mapping)
 {
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page);
-	page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE);
+	WRITE_ONCE(page->mapping,
+		   (unsigned long)mapping | PAGE_MAPPING_MOVABLE);
 }
 EXPORT_SYMBOL(__SetPageMovable);
 
@@ -126,8 +127,8 @@ void __ClearPageMovable(struct page *page)
 	 * flag so that VM can catch up released page by driver after isolation.
 	 * With it, VM migration doesn't try to put it back.
 	 */
-	page->mapping = (void *)((unsigned long)page->mapping &
-				PAGE_MAPPING_MOVABLE);
+	WRITE_ONCE(page->mapping,
+		   (unsigned long)page->mapping & PAGE_MAPPING_MOVABLE);
 }
 EXPORT_SYMBOL(__ClearPageMovable);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 3dc4346411e4..426b6a40ea41 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -865,7 +865,7 @@ static inline struct stable_node *page_stable_node(struct page *page)
 static inline void set_page_stable_node(struct page *page,
 					struct stable_node *stable_node)
 {
-	page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM);
+	WRITE_ONCE(page->mapping, (unsigned long)stable_node | PAGE_MAPPING_KSM);
 }
 
 #ifdef CONFIG_SYSFS
diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..9480df437edc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1044,7 +1044,12 @@ static void __page_set_anon_rmap(struct page *page,
 		anon_vma = anon_vma->root;
 
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	page->mapping = (struct address_space *) anon_vma;
+	/*
+	 * Ensure that anon_vma and the PAGE_MAPPING_ANON bit are written
+	 * simultaneously, so a concurrent reader (eg page_referenced()'s
+	 * PageAnon()) will not see one without the other.
+	 */
+	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
 }
 
diff --git a/mm/util.c b/mm/util.c
index e6351a80f248..09b9fcbedac3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -489,21 +489,14 @@ void kvfree(const void *addr)
 }
 EXPORT_SYMBOL(kvfree);
 
-static inline void *__page_rmapping(struct page *page)
-{
-	unsigned long mapping;
-
-	mapping = (unsigned long)page->mapping;
-	mapping &= ~PAGE_MAPPING_FLAGS;
-
-	return (void *)mapping;
-}
-
 /* Neutral page->mapping pointer to address_space or anon_vma or other */
 void *page_rmapping(struct page *page)
 {
+	unsigned long mapping;
+
 	page = compound_head(page);
-	return __page_rmapping(page);
+	mapping = (unsigned long)READ_ONCE(page->mapping);
+	return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
 }
 
 /*
@@ -534,10 +527,11 @@ struct anon_vma *page_anon_vma(struct page *page)
 	unsigned long mapping;
 
 	page = compound_head(page);
-	mapping = (unsigned long)page->mapping;
+	mapping = (unsigned long)READ_ONCE(page->mapping);
+	/* Return NULL if file or PageMovable or PageKsm */
 	if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
 		return NULL;
-	return __page_rmapping(page);
+	return (struct anon_vma *)(mapping & ~PAGE_MAPPING_FLAGS);
 }
 
 struct address_space *page_mapping(struct page *page)
@@ -557,10 +551,12 @@ struct address_space *page_mapping(struct page *page)
 		return swap_address_space(entry);
 	}
 
-	mapping = page->mapping;
+	mapping = READ_ONCE(page->mapping);
+	/* Return NULL if PageAnon (including PageKsm) */
 	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
 		return NULL;
 
+	/* Return struct address_space pointer if file or PageMovable */
 	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
 }
 EXPORT_SYMBOL(page_mapping);
diff mbox series

Patch

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..64e5344a992c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,15 @@ 
 static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page = pfn_to_online_page(pfn);
-	pg_data_t *pgdat;
 
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
 
-	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
 	return page;
 }