diff mbox series

[v2] mm: thp: grab the lock before manipulation defer list

Message ID 20200109143054.13203-1-richardw.yang@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm: thp: grab the lock before manipulation defer list | expand

Commit Message

Wei Yang Jan. 9, 2020, 2:30 p.m. UTC
As all the other places, we grab the lock before manipulate the defer list.
Current implementation may face a race condition.

For example, the potential race would be:

    CPU1                      CPU2
    mem_cgroup_move_account   split_huge_page_to_list
      !list_empty
                                lock
                                !list_empty
                                list_del
                                unlock
      lock
      # !list_empty might not hold anymore
      list_del_init
      unlock

When this sequence happens, the list_del_init() in
mem_cgroup_move_account() would crash if CONFIG_DEBUG_LIST since the
page is already been removed by list_del in split_huge_page_to_list().

Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")

Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>

---
v2:
  * move check on compound outside suggested by Alexander
  * an example of the race condition, suggested by Michal
---
 mm/memcontrol.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

Comments

David Rientjes Jan. 9, 2020, 6:52 p.m. UTC | #1
On Thu, 9 Jan 2020, Wei Yang wrote:

> As all the other places, we grab the lock before manipulate the defer list.
> Current implementation may face a race condition.
> 
> For example, the potential race would be:
> 
>     CPU1                      CPU2
>     mem_cgroup_move_account   split_huge_page_to_list
>       !list_empty
>                                 lock
>                                 !list_empty
>                                 list_del
>                                 unlock
>       lock
>       # !list_empty might not hold anymore
>       list_del_init
>       unlock
> 
> When this sequence happens, the list_del_init() in
> mem_cgroup_move_account() would crash if CONFIG_DEBUG_LIST since the
> page is already been removed by list_del in split_huge_page_to_list().
> 
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> 
> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
> Acked-by: David Rientjes <rientjes@google.com>

Thanks Wei!

Andrew, I'd also suggest:

Cc: stable@vger.kernel.org # 5.4+
Michal Hocko Jan. 9, 2020, 6:58 p.m. UTC | #2
On Thu 09-01-20 22:30:54, Wei Yang wrote:
> As all the other places, we grab the lock before manipulate the defer list.
> Current implementation may face a race condition.
> 
> For example, the potential race would be:
> 
>     CPU1                      CPU2
>     mem_cgroup_move_account   split_huge_page_to_list
>       !list_empty
>                                 lock
>                                 !list_empty
>                                 list_del
>                                 unlock
>       lock
>       # !list_empty might not hold anymore
>       list_del_init
>       unlock
> 
> When this sequence happens, the list_del_init() in
> mem_cgroup_move_account() would crash if CONFIG_DEBUG_LIST since the
> page is already been removed by list_del in split_huge_page_to_list().
> 
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> 
> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
> Acked-by: David Rientjes <rientjes@google.com>

Thanks a lot for the changelog approvements!
Acked-by: Michal Hocko <mhocko@suse.com>

> 
> ---
> v2:
>   * move check on compound outside suggested by Alexander
>   * an example of the race condition, suggested by Michal
> ---
>  mm/memcontrol.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bc01423277c5..1492eefe4f3c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5368,10 +5368,12 @@ static int mem_cgroup_move_account(struct page *page,
>  	}
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if (compound && !list_empty(page_deferred_list(page))) {
> +	if (compound) {
>  		spin_lock(&from->deferred_split_queue.split_queue_lock);
> -		list_del_init(page_deferred_list(page));
> -		from->deferred_split_queue.split_queue_len--;
> +		if (!list_empty(page_deferred_list(page))) {
> +			list_del_init(page_deferred_list(page));
> +			from->deferred_split_queue.split_queue_len--;
> +		}
>  		spin_unlock(&from->deferred_split_queue.split_queue_lock);
>  	}
>  #endif
> @@ -5385,11 +5387,13 @@ static int mem_cgroup_move_account(struct page *page,
>  	page->mem_cgroup = to;
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if (compound && list_empty(page_deferred_list(page))) {
> +	if (compound) {
>  		spin_lock(&to->deferred_split_queue.split_queue_lock);
> -		list_add_tail(page_deferred_list(page),
> -			      &to->deferred_split_queue.split_queue);
> -		to->deferred_split_queue.split_queue_len++;
> +		if (list_empty(page_deferred_list(page))) {
> +			list_add_tail(page_deferred_list(page),
> +				      &to->deferred_split_queue.split_queue);
> +			to->deferred_split_queue.split_queue_len++;
> +		}
>  		spin_unlock(&to->deferred_split_queue.split_queue_lock);
>  	}
>  #endif
> -- 
> 2.17.1
Kirill A. Shutemov Jan. 11, 2020, 12:03 a.m. UTC | #3
On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote:
> As all the other places, we grab the lock before manipulate the defer list.
> Current implementation may face a race condition.
> 
> For example, the potential race would be:
> 
>     CPU1                      CPU2
>     mem_cgroup_move_account   split_huge_page_to_list
>       !list_empty
>                                 lock
>                                 !list_empty
>                                 list_del
>                                 unlock
>       lock
>       # !list_empty might not hold anymore
>       list_del_init
>       unlock

I don't think this particular race is possible. Both parties take page
lock before messing with deferred queue, but anytway:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

> 
> When this sequence happens, the list_del_init() in
> mem_cgroup_move_account() would crash if CONFIG_DEBUG_LIST since the
> page is already been removed by list_del in split_huge_page_to_list().
> 
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> 
> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
> Acked-by: David Rientjes <rientjes@google.com>
> 
> ---
> v2:
>   * move check on compound outside suggested by Alexander
>   * an example of the race condition, suggested by Michal
> ---
>  mm/memcontrol.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bc01423277c5..1492eefe4f3c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5368,10 +5368,12 @@ static int mem_cgroup_move_account(struct page *page,
>  	}
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if (compound && !list_empty(page_deferred_list(page))) {
> +	if (compound) {
>  		spin_lock(&from->deferred_split_queue.split_queue_lock);
> -		list_del_init(page_deferred_list(page));
> -		from->deferred_split_queue.split_queue_len--;
> +		if (!list_empty(page_deferred_list(page))) {
> +			list_del_init(page_deferred_list(page));
> +			from->deferred_split_queue.split_queue_len--;
> +		}
>  		spin_unlock(&from->deferred_split_queue.split_queue_lock);
>  	}
>  #endif
> @@ -5385,11 +5387,13 @@ static int mem_cgroup_move_account(struct page *page,
>  	page->mem_cgroup = to;
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if (compound && list_empty(page_deferred_list(page))) {
> +	if (compound) {
>  		spin_lock(&to->deferred_split_queue.split_queue_lock);
> -		list_add_tail(page_deferred_list(page),
> -			      &to->deferred_split_queue.split_queue);
> -		to->deferred_split_queue.split_queue_len++;
> +		if (list_empty(page_deferred_list(page))) {
> +			list_add_tail(page_deferred_list(page),
> +				      &to->deferred_split_queue.split_queue);
> +			to->deferred_split_queue.split_queue_len++;
> +		}
>  		spin_unlock(&to->deferred_split_queue.split_queue_lock);
>  	}
>  #endif
> -- 
> 2.17.1
> 
>
Wei Yang Jan. 12, 2020, 2:28 a.m. UTC | #4
On Sat, Jan 11, 2020 at 03:03:52AM +0300, Kirill A. Shutemov wrote:
>On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote:
>> As all the other places, we grab the lock before manipulate the defer list.
>> Current implementation may face a race condition.
>> 
>> For example, the potential race would be:
>> 
>>     CPU1                      CPU2
>>     mem_cgroup_move_account   split_huge_page_to_list
>>       !list_empty
>>                                 lock
>>                                 !list_empty
>>                                 list_del
>>                                 unlock
>>       lock
>>       # !list_empty might not hold anymore
>>       list_del_init
>>       unlock
>
>I don't think this particular race is possible. Both parties take page
>lock before messing with deferred queue, but anytway:

I am afraid not. Page lock is per page, while defer queue is per pgdate or
memcg.

It is possible two page in the same pgdate or memcg grab page lock
respectively and then access the same defer queue concurrently.

>
>Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>
>> 
>> When this sequence happens, the list_del_init() in
>> mem_cgroup_move_account() would crash if CONFIG_DEBUG_LIST since the
>> page is already been removed by list_del in split_huge_page_to_list().
>> 
>> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
>> 
>> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
>> Acked-by: David Rientjes <rientjes@google.com>
>> 
>> ---
>> v2:
>>   * move check on compound outside suggested by Alexander
>>   * an example of the race condition, suggested by Michal
>> ---
>>  mm/memcontrol.c | 18 +++++++++++-------
>>  1 file changed, 11 insertions(+), 7 deletions(-)
>> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index bc01423277c5..1492eefe4f3c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5368,10 +5368,12 @@ static int mem_cgroup_move_account(struct page *page,
>>  	}
>>  
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -	if (compound && !list_empty(page_deferred_list(page))) {
>> +	if (compound) {
>>  		spin_lock(&from->deferred_split_queue.split_queue_lock);
>> -		list_del_init(page_deferred_list(page));
>> -		from->deferred_split_queue.split_queue_len--;
>> +		if (!list_empty(page_deferred_list(page))) {
>> +			list_del_init(page_deferred_list(page));
>> +			from->deferred_split_queue.split_queue_len--;
>> +		}
>>  		spin_unlock(&from->deferred_split_queue.split_queue_lock);
>>  	}
>>  #endif
>> @@ -5385,11 +5387,13 @@ static int mem_cgroup_move_account(struct page *page,
>>  	page->mem_cgroup = to;
>>  
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -	if (compound && list_empty(page_deferred_list(page))) {
>> +	if (compound) {
>>  		spin_lock(&to->deferred_split_queue.split_queue_lock);
>> -		list_add_tail(page_deferred_list(page),
>> -			      &to->deferred_split_queue.split_queue);
>> -		to->deferred_split_queue.split_queue_len++;
>> +		if (list_empty(page_deferred_list(page))) {
>> +			list_add_tail(page_deferred_list(page),
>> +				      &to->deferred_split_queue.split_queue);
>> +			to->deferred_split_queue.split_queue_len++;
>> +		}
>>  		spin_unlock(&to->deferred_split_queue.split_queue_lock);
>>  	}
>>  #endif
>> -- 
>> 2.17.1
>> 
>> 
>
>-- 
> Kirill A. Shutemov
Kirill A. Shutemov Jan. 12, 2020, 10:57 p.m. UTC | #5
On Sun, Jan 12, 2020 at 10:28:58AM +0800, Wei Yang wrote:
> On Sat, Jan 11, 2020 at 03:03:52AM +0300, Kirill A. Shutemov wrote:
> >On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote:
> >> As all the other places, we grab the lock before manipulate the defer list.
> >> Current implementation may face a race condition.
> >> 
> >> For example, the potential race would be:
> >> 
> >>     CPU1                      CPU2
> >>     mem_cgroup_move_account   split_huge_page_to_list
> >>       !list_empty
> >>                                 lock
> >>                                 !list_empty
> >>                                 list_del
> >>                                 unlock
> >>       lock
> >>       # !list_empty might not hold anymore
> >>       list_del_init
> >>       unlock
> >
> >I don't think this particular race is possible. Both parties take page
> >lock before messing with deferred queue, but anytway:
> 
> I am afraid not. Page lock is per page, while defer queue is per pgdate or
> memcg.
> 
> It is possible two page in the same pgdate or memcg grab page lock
> respectively and then access the same defer queue concurrently.

Look closer on the list_empty() argument. It's list_head local to the
page. Too different pages can be handled in parallel without any problem
in this particular scenario. As long as we as we modify it under the lock.

Said that, page lock here was somewhat accidential and I still belive we
need to move the check under the lock anyway.
Wei Yang Jan. 13, 2020, 12:44 a.m. UTC | #6
On Mon, Jan 13, 2020 at 01:57:18AM +0300, Kirill A. Shutemov wrote:
>On Sun, Jan 12, 2020 at 10:28:58AM +0800, Wei Yang wrote:
>> On Sat, Jan 11, 2020 at 03:03:52AM +0300, Kirill A. Shutemov wrote:
>> >On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote:
>> >> As all the other places, we grab the lock before manipulate the defer list.
>> >> Current implementation may face a race condition.
>> >> 
>> >> For example, the potential race would be:
>> >> 
>> >>     CPU1                      CPU2
>> >>     mem_cgroup_move_account   split_huge_page_to_list
>> >>       !list_empty
>> >>                                 lock
>> >>                                 !list_empty
>> >>                                 list_del
>> >>                                 unlock
>> >>       lock
>> >>       # !list_empty might not hold anymore
>> >>       list_del_init
>> >>       unlock
>> >
>> >I don't think this particular race is possible. Both parties take page
>> >lock before messing with deferred queue, but anytway:
>> 
>> I am afraid not. Page lock is per page, while defer queue is per pgdate or
>> memcg.
>> 
>> It is possible two page in the same pgdate or memcg grab page lock
>> respectively and then access the same defer queue concurrently.
>
>Look closer on the list_empty() argument. It's list_head local to the
>page. Too different pages can be handled in parallel without any problem
>in this particular scenario. As long as we as we modify it under the lock.
>
>Said that, page lock here was somewhat accidential and I still belive we
>need to move the check under the lock anyway.
>

If my understanding is correct, you agree with my statement?

>-- 
> Kirill A. Shutemov
Kirill A. Shutemov Jan. 13, 2020, 7:36 a.m. UTC | #7
On Mon, Jan 13, 2020 at 08:44:57AM +0800, Wei Yang wrote:
> >> It is possible two page in the same pgdate or memcg grab page lock
> >> respectively and then access the same defer queue concurrently.
> 
> If my understanding is correct, you agree with my statement?

Which one? If the one above then no. list_empty only accesses list_head
for the struct page, not the queue.
Wei Yang Jan. 13, 2020, 8:23 a.m. UTC | #8
On Mon, Jan 13, 2020 at 10:36:14AM +0300, Kirill A. Shutemov wrote:
>On Mon, Jan 13, 2020 at 08:44:57AM +0800, Wei Yang wrote:
>> >> It is possible two page in the same pgdate or memcg grab page lock
>> >> respectively and then access the same defer queue concurrently.
>> 
>> If my understanding is correct, you agree with my statement?
>
>Which one? If the one above then no. list_empty only accesses list_head
>for the struct page, not the queue.
>

Ah, I get your point.

>-- 
> Kirill A. Shutemov
Michal Hocko Jan. 14, 2020, 9:31 a.m. UTC | #9
On Sat 11-01-20 03:03:52, Kirill A. Shutemov wrote:
> On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote:
> > As all the other places, we grab the lock before manipulate the defer list.
> > Current implementation may face a race condition.
> > 
> > For example, the potential race would be:
> > 
> >     CPU1                      CPU2
> >     mem_cgroup_move_account   split_huge_page_to_list
> >       !list_empty
> >                                 lock
> >                                 !list_empty
> >                                 list_del
> >                                 unlock
> >       lock
> >       # !list_empty might not hold anymore
> >       list_del_init
> >       unlock
> 
> I don't think this particular race is possible. Both parties take page
> lock before messing with deferred queue, but anytway:
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

I am confused, if the above race is not possible then what would be a
real race? We really do not want to have a patch with a misleading
changelog, do we?
Kirill A. Shutemov Jan. 14, 2020, 10:31 a.m. UTC | #10
On Tue, Jan 14, 2020 at 10:31:22AM +0100, Michal Hocko wrote:
> On Sat 11-01-20 03:03:52, Kirill A. Shutemov wrote:
> > On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote:
> > > As all the other places, we grab the lock before manipulate the defer list.
> > > Current implementation may face a race condition.
> > > 
> > > For example, the potential race would be:
> > > 
> > >     CPU1                      CPU2
> > >     mem_cgroup_move_account   split_huge_page_to_list
> > >       !list_empty
> > >                                 lock
> > >                                 !list_empty
> > >                                 list_del
> > >                                 unlock
> > >       lock
> > >       # !list_empty might not hold anymore
> > >       list_del_init
> > >       unlock
> > 
> > I don't think this particular race is possible. Both parties take page
> > lock before messing with deferred queue, but anytway:
> > 
> > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> I am confused, if the above race is not possible then what would be a
> real race? We really do not want to have a patch with a misleading
> changelog, do we?

The alternative is to make sure that all page_deferred_list() called with
page lock taken.

I'll look into it.
Kirill A. Shutemov Jan. 14, 2020, 10:59 a.m. UTC | #11
On Tue, Jan 14, 2020 at 01:31:12PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jan 14, 2020 at 10:31:22AM +0100, Michal Hocko wrote:
> > On Sat 11-01-20 03:03:52, Kirill A. Shutemov wrote:
> > > On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote:
> > > > As all the other places, we grab the lock before manipulate the defer list.
> > > > Current implementation may face a race condition.
> > > > 
> > > > For example, the potential race would be:
> > > > 
> > > >     CPU1                      CPU2
> > > >     mem_cgroup_move_account   split_huge_page_to_list
> > > >       !list_empty
> > > >                                 lock
> > > >                                 !list_empty
> > > >                                 list_del
> > > >                                 unlock
> > > >       lock
> > > >       # !list_empty might not hold anymore
> > > >       list_del_init
> > > >       unlock
> > > 
> > > I don't think this particular race is possible. Both parties take page
> > > lock before messing with deferred queue, but anytway:
> > > 
> > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > 
> > I am confused, if the above race is not possible then what would be a
> > real race? We really do not want to have a patch with a misleading
> > changelog, do we?
> 
> The alternative is to make sure that all page_deferred_list() called with
> page lock taken.
> 
> I'll look into it.

split_huge_page_to_list() has page lock taken.

free_transhuge_page() is in the free path and doesn't susceptible to the
race.

deferred_split_scan() is trickier. list_move() should be safe against
list_empty() as it will not produce false-positive list_empty().
list_del_init() *should* (correct me if I'm wrong) be safe because the page
is freeing and memcg will not touch the page anymore.

deferred_split_huge_page() is a problematic one. It called from
page_remove_rmap() path witch does require page lock. I don't see any
obvious way to exclude race with mem_cgroup_move_account() here.
Anybody else?

Wei, could you rewrite the commit message with deferred_split_huge_page()
as a race source instead of split_huge_page_to_list()?
David Rientjes Jan. 14, 2020, 8:57 p.m. UTC | #12
On Tue, 14 Jan 2020, Kirill A. Shutemov wrote:

> split_huge_page_to_list() has page lock taken.
> 
> free_transhuge_page() is in the free path and doesn't susceptible to the
> race.
> 
> deferred_split_scan() is trickier. list_move() should be safe against
> list_empty() as it will not produce false-positive list_empty().
> list_del_init() *should* (correct me if I'm wrong) be safe because the page
> is freeing and memcg will not touch the page anymore.
> 
> deferred_split_huge_page() is a problematic one. It called from
> page_remove_rmap() path witch does require page lock. I don't see any
> obvious way to exclude race with mem_cgroup_move_account() here.
> Anybody else?
> 
> Wei, could you rewrite the commit message with deferred_split_huge_page()
> as a race source instead of split_huge_page_to_list()?
> 

I think describing the race in terms of deferred_split_huge_page() makes 
the most sense and I'd prefer a cc to stable for 5.4+.  Even getting the 
split_queue_len, which is unsigned long, to underflow because of a 
list_empty(page_deferred_list()) check that is no longer accurate after 
the lock is taken would be a significant issue for shrinkers.
Wei Yang Jan. 15, 2020, 1:07 a.m. UTC | #13
On Tue, Jan 14, 2020 at 01:59:21PM +0300, Kirill A. Shutemov wrote:
>On Tue, Jan 14, 2020 at 01:31:12PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Jan 14, 2020 at 10:31:22AM +0100, Michal Hocko wrote:
>> > On Sat 11-01-20 03:03:52, Kirill A. Shutemov wrote:
>> > > On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote:
>> > > > As all the other places, we grab the lock before manipulate the defer list.
>> > > > Current implementation may face a race condition.
>> > > > 
>> > > > For example, the potential race would be:
>> > > > 
>> > > >     CPU1                      CPU2
>> > > >     mem_cgroup_move_account   split_huge_page_to_list
>> > > >       !list_empty
>> > > >                                 lock
>> > > >                                 !list_empty
>> > > >                                 list_del
>> > > >                                 unlock
>> > > >       lock
>> > > >       # !list_empty might not hold anymore
>> > > >       list_del_init
>> > > >       unlock
>> > > 
>> > > I don't think this particular race is possible. Both parties take page
>> > > lock before messing with deferred queue, but anytway:
>> > > 
>> > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > 
>> > I am confused, if the above race is not possible then what would be a
>> > real race? We really do not want to have a patch with a misleading
>> > changelog, do we?
>> 
>> The alternative is to make sure that all page_deferred_list() called with
>> page lock taken.
>> 
>> I'll look into it.
>
>split_huge_page_to_list() has page lock taken.
>
>free_transhuge_page() is in the free path and doesn't susceptible to the
>race.
>
>deferred_split_scan() is trickier. list_move() should be safe against
>list_empty() as it will not produce false-positive list_empty().
>list_del_init() *should* (correct me if I'm wrong) be safe because the page
>is freeing and memcg will not touch the page anymore.
>
>deferred_split_huge_page() is a problematic one. It called from
>page_remove_rmap() path witch does require page lock. I don't see any
>obvious way to exclude race with mem_cgroup_move_account() here.
>Anybody else?

If my understanding is correct, the reason is deferred_split_huge_page()
doesn't has page lock taken, right?

>
>Wei, could you rewrite the commit message with deferred_split_huge_page()
>as a race source instead of split_huge_page_to_list()?
>
>-- 
> Kirill A. Shutemov
Wei Yang Jan. 15, 2020, 1:19 a.m. UTC | #14
On Tue, Jan 14, 2020 at 12:57:22PM -0800, David Rientjes wrote:
>On Tue, 14 Jan 2020, Kirill A. Shutemov wrote:
>
>> split_huge_page_to_list() has page lock taken.
>> 
>> free_transhuge_page() is in the free path and doesn't susceptible to the
>> race.
>> 
>> deferred_split_scan() is trickier. list_move() should be safe against
>> list_empty() as it will not produce false-positive list_empty().
>> list_del_init() *should* (correct me if I'm wrong) be safe because the page
>> is freeing and memcg will not touch the page anymore.
>> 
>> deferred_split_huge_page() is a problematic one. It called from
>> page_remove_rmap() path witch does require page lock. I don't see any
>> obvious way to exclude race with mem_cgroup_move_account() here.
>> Anybody else?
>> 
>> Wei, could you rewrite the commit message with deferred_split_huge_page()
>> as a race source instead of split_huge_page_to_list()?
>> 
>
>I think describing the race in terms of deferred_split_huge_page() makes 
>the most sense and I'd prefer a cc to stable for 5.4+.  Even getting the 
>split_queue_len, which is unsigned long, to underflow because of a 
>list_empty(page_deferred_list()) check that is no longer accurate after 
>the lock is taken would be a significant issue for shrinkers.

Oh, you are right. Even the list is not corrupted between
deferred_split_scan() and mem_cgroup_move_account(), split_queue_len would be
in a wrong state.

Hmm... to some extend, the page lock complicates the picture a little here,
even it helps in some cases.
David Rientjes Jan. 15, 2020, 3:26 a.m. UTC | #15
On Wed, 15 Jan 2020, Wei Yang wrote:

> >split_huge_page_to_list() has page lock taken.
> >
> >free_transhuge_page() is in the free path and doesn't susceptible to the
> >race.
> >
> >deferred_split_scan() is trickier. list_move() should be safe against
> >list_empty() as it will not produce false-positive list_empty().
> >list_del_init() *should* (correct me if I'm wrong) be safe because the page
> >is freeing and memcg will not touch the page anymore.
> >
> >deferred_split_huge_page() is a problematic one. It called from
> >page_remove_rmap() path witch does require page lock. I don't see any
> >obvious way to exclude race with mem_cgroup_move_account() here.
> >Anybody else?
> 
> If my understanding is correct, the reason is deferred_split_huge_page()
> doesn't has page lock taken, right?
> 

I think the fix that you have proposed has inspired some deeper looks at 
the locking around the deferred split queue and the hope was that perhaps 
this could be protected by the page lock but it was found that at least in 
one path that isn't taken.  So I believe your fix is still needed and any 
possible optimizations in this area can be proposed on top.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bc01423277c5..1492eefe4f3c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5368,10 +5368,12 @@  static int mem_cgroup_move_account(struct page *page,
 	}
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (compound && !list_empty(page_deferred_list(page))) {
+	if (compound) {
 		spin_lock(&from->deferred_split_queue.split_queue_lock);
-		list_del_init(page_deferred_list(page));
-		from->deferred_split_queue.split_queue_len--;
+		if (!list_empty(page_deferred_list(page))) {
+			list_del_init(page_deferred_list(page));
+			from->deferred_split_queue.split_queue_len--;
+		}
 		spin_unlock(&from->deferred_split_queue.split_queue_lock);
 	}
 #endif
@@ -5385,11 +5387,13 @@  static int mem_cgroup_move_account(struct page *page,
 	page->mem_cgroup = to;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (compound && list_empty(page_deferred_list(page))) {
+	if (compound) {
 		spin_lock(&to->deferred_split_queue.split_queue_lock);
-		list_add_tail(page_deferred_list(page),
-			      &to->deferred_split_queue.split_queue);
-		to->deferred_split_queue.split_queue_len++;
+		if (list_empty(page_deferred_list(page))) {
+			list_add_tail(page_deferred_list(page),
+				      &to->deferred_split_queue.split_queue);
+			to->deferred_split_queue.split_queue_len++;
+		}
 		spin_unlock(&to->deferred_split_queue.split_queue_lock);
 	}
 #endif