diff mbox series

mm, thp: relax migration wait when failed to get tail page

Message ID bc8567d7a2c08ab6fdbb8e94008157265d5d28a3.1622564942.git.xuyu@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series mm, thp: relax migration wait when failed to get tail page | expand

Commit Message

Xu Yu June 1, 2021, 4:31 p.m. UTC
We notice that hung task happens in a conner but practical scenario when
CONFIG_PREEMPT_NONE is enabled, as follows.

Process 0                       Process 1                     Process 2..Inf
split_huge_page_to_list
    unmap_page
        split_huge_pmd_address
                                __migration_entry_wait(head)
                                                              __migration_entry_wait(tail)
    remap_page (roll back)
        remove_migration_ptes
            rmap_walk_anon
                cond_resched

Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
copy_to_user, which will immediately fault again without rescheduling,
and thus occupy the cpu fully.

When there are too many processes performing __migration_entry_wait on
tail page, remap_page will never be done after cond_resched.

This relaxes __migration_entry_wait on tail page, thus gives remap_page
a chance to complete.

Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
---
 mm/migrate.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox June 1, 2021, 4:49 p.m. UTC | #1
On Wed, Jun 02, 2021 at 12:31:41AM +0800, Xu Yu wrote:
> We notice that hung task happens in a conner but practical scenario when
> CONFIG_PREEMPT_NONE is enabled, as follows.
> 
> Process 0                       Process 1                     Process 2..Inf
> split_huge_page_to_list
>     unmap_page
>         split_huge_pmd_address
>                                 __migration_entry_wait(head)
>                                                               __migration_entry_wait(tail)
>     remap_page (roll back)
>         remove_migration_ptes
>             rmap_walk_anon
>                 cond_resched
> 
> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> copy_to_user, which will immediately fault again without rescheduling,
> and thus occupy the cpu fully.
> 
> When there are too many processes performing __migration_entry_wait on
> tail page, remap_page will never be done after cond_resched.
> 
> This relaxes __migration_entry_wait on tail page, thus gives remap_page
> a chance to complete.

I don't like this at all.  If only we could call wait_on_page_locked()
without having a reference to the page.

I /think/ the only obstacle to doing that is that Slab may reuse
the PG_locked bit without following the pagecache wakeup protocol.
That wouldn't be a hard thing to change; I think it's only slub that
uses that bit, and it really could use any bit in the page flags.

> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> ---
>  mm/migrate.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b234c3f3acb7..df2dc39fe566 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -301,8 +301,11 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>  	 * is zero; but we must not call put_and_wait_on_page_locked() without
>  	 * a ref. Use get_page_unless_zero(), and just fault again if it fails.
>  	 */
> -	if (!get_page_unless_zero(page))
> -		goto out;
> +	if (!get_page_unless_zero(page)) {
> +		pte_unmap_unlock(ptep, ptl);
> +		cond_resched();
> +		return;
> +	}
>  	pte_unmap_unlock(ptep, ptl);
>  	put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
>  	return;
> -- 
> 2.20.1.2432.ga663e714
> 
>
Hugh Dickins June 1, 2021, 4:55 p.m. UTC | #2
On Wed, 2 Jun 2021, Xu Yu wrote:

> We notice that hung task happens in a conner but practical scenario when
> CONFIG_PREEMPT_NONE is enabled, as follows.
> 
> Process 0                       Process 1                     Process 2..Inf
> split_huge_page_to_list
>     unmap_page
>         split_huge_pmd_address
>                                 __migration_entry_wait(head)
>                                                               __migration_entry_wait(tail)
>     remap_page (roll back)
>         remove_migration_ptes
>             rmap_walk_anon
>                 cond_resched
> 
> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> copy_to_user, which will immediately fault again without rescheduling,
> and thus occupy the cpu fully.
> 
> When there are too many processes performing __migration_entry_wait on
> tail page, remap_page will never be done after cond_resched.
> 
> This relaxes __migration_entry_wait on tail page, thus gives remap_page
> a chance to complete.
> 
> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>

Well caught: you're absolutely right that there's a bug there.
But isn't cond_resched() just papering over the real bug, and
what it should do is a "page = compound_head(page);" before the
get_page_unless_zero()? How does that work out in your testing?

Hugh

> ---
>  mm/migrate.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b234c3f3acb7..df2dc39fe566 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -301,8 +301,11 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>  	 * is zero; but we must not call put_and_wait_on_page_locked() without
>  	 * a ref. Use get_page_unless_zero(), and just fault again if it fails.
>  	 */
> -	if (!get_page_unless_zero(page))
> -		goto out;
> +	if (!get_page_unless_zero(page)) {
> +		pte_unmap_unlock(ptep, ptl);
> +		cond_resched();
> +		return;
> +	}
>  	pte_unmap_unlock(ptep, ptl);
>  	put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
>  	return;
> -- 
> 2.20.1.2432.ga663e714
> 
> 
>
Matthew Wilcox June 1, 2021, 5:11 p.m. UTC | #3
On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote:
> On Wed, 2 Jun 2021, Xu Yu wrote:
> 
> > We notice that hung task happens in a conner but practical scenario when
> > CONFIG_PREEMPT_NONE is enabled, as follows.
> > 
> > Process 0                       Process 1                     Process 2..Inf
> > split_huge_page_to_list
> >     unmap_page
> >         split_huge_pmd_address
> >                                 __migration_entry_wait(head)
> >                                                               __migration_entry_wait(tail)
> >     remap_page (roll back)
> >         remove_migration_ptes
> >             rmap_walk_anon
> >                 cond_resched
> > 
> > Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> > copy_to_user, which will immediately fault again without rescheduling,
> > and thus occupy the cpu fully.
> > 
> > When there are too many processes performing __migration_entry_wait on
> > tail page, remap_page will never be done after cond_resched.
> > 
> > This relaxes __migration_entry_wait on tail page, thus gives remap_page
> > a chance to complete.
> > 
> > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
> > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> 
> Well caught: you're absolutely right that there's a bug there.
> But isn't cond_resched() just papering over the real bug, and
> what it should do is a "page = compound_head(page);" before the
> get_page_unless_zero()? How does that work out in your testing?

You do realise you're strengthening my case for folios by suggesting
that, don't you?  ;-)

I was going to suggest that it won't make any difference because the
page reference count is frozen, but the freezing happens after the call
to unmap_page(), so it may make a difference.
Hugh Dickins June 1, 2021, 7:10 p.m. UTC | #4
On Tue, 1 Jun 2021, Matthew Wilcox wrote:
> On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote:
> > 
> > Well caught: you're absolutely right that there's a bug there.
> > But isn't cond_resched() just papering over the real bug, and
> > what it should do is a "page = compound_head(page);" before the
> > get_page_unless_zero()? How does that work out in your testing?
> 
> You do realise you're strengthening my case for folios by suggesting
> that, don't you?  ;-)

Hah! Well, I do realize that I'm offering you a marketing opportunity.
And you won't believe how many patches I dread to post for fear of that ;-)

But I'm not so sure that it strengthens your case: apparently folios
had not detected this?  Or do you have a hoard of folio-detected fixes
waiting for the day, and a folio-kit for each of the stable releases?

> 
> I was going to suggest that it won't make any difference because the
> page reference count is frozen, but the freezing happens after the call
> to unmap_page(), so it may make a difference.

I think that's a good point: I may have just jumped on the missing
compound_head(), without thinking it through as far as you have.

I'm having trouble remembering the dynamics now; but I think there
are cond_resched()s in the unmap_page() part, so the splitter may
get preempted even on a non-preempt kernel; whereas the frozen
part is all done expeditiously, with interrupts disabled.

Greg discovered the same issue recently, but we all got sidetracked,
and I don't know where his investigation ended up.  He was in favour
of cond_resched(), I was in favour of compound_head(); and I think I
was coming to the conclusion that if cond_resched() is needed, it
should not be there in __migration_entry_wait(), but somewhere up
in mm/gup.c, so that other faults that retry, expecting to reschedule
on return to userspace, do not get trapped in kernelspace this way.

Waiting on migration entries from THP splitting is an egregious
example, but others may be suffering too.

Hugh
Matthew Wilcox June 1, 2021, 8:27 p.m. UTC | #5
On Tue, Jun 01, 2021 at 12:10:48PM -0700, Hugh Dickins wrote:
> On Tue, 1 Jun 2021, Matthew Wilcox wrote:
> > On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote:
> > > 
> > > Well caught: you're absolutely right that there's a bug there.
> > > But isn't cond_resched() just papering over the real bug, and
> > > what it should do is a "page = compound_head(page);" before the
> > > get_page_unless_zero()? How does that work out in your testing?
> > 
> > You do realise you're strengthening my case for folios by suggesting
> > that, don't you?  ;-)
> 
> Hah! Well, I do realize that I'm offering you a marketing opportunity.
> And you won't believe how many patches I dread to post for fear of that ;-)
> 
> But I'm not so sure that it strengthens your case: apparently folios
> had not detected this?  Or do you have a hoard of folio-detected fixes
> waiting for the day, and a folio-kit for each of the stable releases?

Oh, I wish!  I haven't been working on converting the
migration code to use folios.  If I'd taken the step to convert
put_and_wait_on_page_locked() to folio_put_and_wait_locked(), I would
have fixed the bug, but I'm not sure I would have noticed that it was
fixing a bug.  I would have probably converted migration_entry_to_page()
to be migration_entry_to_folio() and just inadvertently fixed it.

> > I was going to suggest that it won't make any difference because the
> > page reference count is frozen, but the freezing happens after the call
> > to unmap_page(), so it may make a difference.
> 
> I think that's a good point: I may have just jumped on the missing
> compound_head(), without thinking it through as far as you have.
> 
> I'm having trouble remembering the dynamics now; but I think there
> are cond_resched()s in the unmap_page() part, so the splitter may
> get preempted even on a non-preempt kernel; whereas the frozen
> part is all done expeditiously, with interrupts disabled.

That would certainly make a difference.
Xu Yu June 2, 2021, 3:27 a.m. UTC | #6
On 6/2/21 3:10 AM, Hugh Dickins wrote:
> On Tue, 1 Jun 2021, Matthew Wilcox wrote:
>> On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote:
>>>
>>> Well caught: you're absolutely right that there's a bug there.
>>> But isn't cond_resched() just papering over the real bug, and
>>> what it should do is a "page = compound_head(page);" before the
>>> get_page_unless_zero()? How does that work out in your testing?
>>
>> You do realise you're strengthening my case for folios by suggesting
>> that, don't you?  ;-)
> 
> Hah! Well, I do realize that I'm offering you a marketing opportunity.
> And you won't believe how many patches I dread to post for fear of that ;-)
> 
> But I'm not so sure that it strengthens your case: apparently folios
> had not detected this?  Or do you have a hoard of folio-detected fixes
> waiting for the day, and a folio-kit for each of the stable releases?
> 
>>
>> I was going to suggest that it won't make any difference because the
>> page reference count is frozen, but the freezing happens after the call
>> to unmap_page(), so it may make a difference.
> 
> I think that's a good point: I may have just jumped on the missing
> compound_head(), without thinking it through as far as you have.
> 
> I'm having trouble remembering the dynamics now; but I think there
> are cond_resched()s in the unmap_page() part, so the splitter may
> get preempted even on a non-preempt kernel; whereas the frozen
> part is all done expeditiously, with interrupts disabled.
> 
> Greg discovered the same issue recently, but we all got sidetracked,
> and I don't know where his investigation ended up.  He was in favour
> of cond_resched(), I was in favour of compound_head(); and I think I

I ever considered about using compound_head, but isn't there another
race that, the following put_and_wait_on_page_locked operates on the
"tail page" which has been split and is now a single page?

Anyway, I will test and verify compound_head.

> was coming to the conclusion that if cond_resched() is needed, it
> should not be there in __migration_entry_wait(), but somewhere up
> in mm/gup.c, so that other faults that retry, expecting to reschedule
> on return to userspace, do not get trapped in kernelspace this way.

Agreed. I will send v2, if cond_resched is still an option.

> 
> Waiting on migration entries from THP splitting is an egregious
> example, but others may be suffering too.
> 
> Hugh
>
Matthew Wilcox June 2, 2021, 11:58 a.m. UTC | #7
On Wed, Jun 02, 2021 at 11:27:47AM +0800, Yu Xu wrote:
> On 6/2/21 3:10 AM, Hugh Dickins wrote:
> > On Tue, 1 Jun 2021, Matthew Wilcox wrote:
> > > On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote:
> > > > 
> > > > Well caught: you're absolutely right that there's a bug there.
> > > > But isn't cond_resched() just papering over the real bug, and
> > > > what it should do is a "page = compound_head(page);" before the
> > > > get_page_unless_zero()? How does that work out in your testing?
> > > 
> > > You do realise you're strengthening my case for folios by suggesting
> > > that, don't you?  ;-)
> > 
> > Hah! Well, I do realize that I'm offering you a marketing opportunity.
> > And you won't believe how many patches I dread to post for fear of that ;-)
> > 
> > But I'm not so sure that it strengthens your case: apparently folios
> > had not detected this?  Or do you have a hoard of folio-detected fixes
> > waiting for the day, and a folio-kit for each of the stable releases?
> > 
> > > 
> > > I was going to suggest that it won't make any difference because the
> > > page reference count is frozen, but the freezing happens after the call
> > > to unmap_page(), so it may make a difference.
> > 
> > I think that's a good point: I may have just jumped on the missing
> > compound_head(), without thinking it through as far as you have.
> > 
> > I'm having trouble remembering the dynamics now; but I think there
> > are cond_resched()s in the unmap_page() part, so the splitter may
> > get preempted even on a non-preempt kernel; whereas the frozen
> > part is all done expeditiously, with interrupts disabled.
> > 
> > Greg discovered the same issue recently, but we all got sidetracked,
> > and I don't know where his investigation ended up.  He was in favour
> > of cond_resched(), I was in favour of compound_head(); and I think I
> 
> I ever considered about using compound_head, but isn't there another
> race that, the following put_and_wait_on_page_locked operates on the
> "tail page" which has been split and is now a single page?

No, having your own reference on a page prevents the page from being
split.  But that's a good question to ask.
Xu Yu June 2, 2021, 12:59 p.m. UTC | #8
On 6/2/21 7:58 PM, Matthew Wilcox wrote:
> On Wed, Jun 02, 2021 at 11:27:47AM +0800, Yu Xu wrote:
>> On 6/2/21 3:10 AM, Hugh Dickins wrote:
>>> On Tue, 1 Jun 2021, Matthew Wilcox wrote:
>>>> On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote:
>>>>>
>>>>> Well caught: you're absolutely right that there's a bug there.
>>>>> But isn't cond_resched() just papering over the real bug, and
>>>>> what it should do is a "page = compound_head(page);" before the
>>>>> get_page_unless_zero()? How does that work out in your testing?
>>>>
>>>> You do realise you're strengthening my case for folios by suggesting
>>>> that, don't you?  ;-)
>>>
>>> Hah! Well, I do realize that I'm offering you a marketing opportunity.
>>> And you won't believe how many patches I dread to post for fear of that ;-)
>>>
>>> But I'm not so sure that it strengthens your case: apparently folios
>>> had not detected this?  Or do you have a hoard of folio-detected fixes
>>> waiting for the day, and a folio-kit for each of the stable releases?
>>>
>>>>
>>>> I was going to suggest that it won't make any difference because the
>>>> page reference count is frozen, but the freezing happens after the call
>>>> to unmap_page(), so it may make a difference.
>>>
>>> I think that's a good point: I may have just jumped on the missing
>>> compound_head(), without thinking it through as far as you have.
>>>
>>> I'm having trouble remembering the dynamics now; but I think there
>>> are cond_resched()s in the unmap_page() part, so the splitter may
>>> get preempted even on a non-preempt kernel; whereas the frozen
>>> part is all done expeditiously, with interrupts disabled.
>>>
>>> Greg discovered the same issue recently, but we all got sidetracked,
>>> and I don't know where his investigation ended up.  He was in favour
>>> of cond_resched(), I was in favour of compound_head(); and I think I
>>
>> I ever considered about using compound_head, but isn't there another
>> race that, the following put_and_wait_on_page_locked operates on the
>> "tail page" which has been split and is now a single page?
> 
> No, having your own reference on a page prevents the page from being
> split.  But that's a good question to ask.

Thanks. I have recalled that head page is frozen when splitting THP.
Xu Yu June 2, 2021, 1:20 p.m. UTC | #9
On 6/2/21 12:55 AM, Hugh Dickins wrote:
> On Wed, 2 Jun 2021, Xu Yu wrote:
> 
>> We notice that hung task happens in a conner but practical scenario when
>> CONFIG_PREEMPT_NONE is enabled, as follows.
>>
>> Process 0                       Process 1                     Process 2..Inf
>> split_huge_page_to_list
>>      unmap_page
>>          split_huge_pmd_address
>>                                  __migration_entry_wait(head)
>>                                                                __migration_entry_wait(tail)
>>      remap_page (roll back)
>>          remove_migration_ptes
>>              rmap_walk_anon
>>                  cond_resched
>>
>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
>> copy_to_user, which will immediately fault again without rescheduling,
>> and thus occupy the cpu fully.
>>
>> When there are too many processes performing __migration_entry_wait on
>> tail page, remap_page will never be done after cond_resched.
>>
>> This relaxes __migration_entry_wait on tail page, thus gives remap_page
>> a chance to complete.
>>
>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> 
> Well caught: you're absolutely right that there's a bug there.
> But isn't cond_resched() just papering over the real bug, and
> what it should do is a "page = compound_head(page);" before the
> get_page_unless_zero()? How does that work out in your testing?

compound_head works. The patched kernel is alive for hours under
our reproducer, which usually makes the vanilla kernel hung after
tens of minutes at most.

If we use compound_head, the behavior of __migration_entry_wait(tail)
changes from "retry fault" to "prevent THP from being split". Is that
right?  Then which is preferred? If it were me, I would prefer "retry
fault".


> 
> Hugh
> 
>> ---
>>   mm/migrate.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index b234c3f3acb7..df2dc39fe566 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -301,8 +301,11 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>>   	 * is zero; but we must not call put_and_wait_on_page_locked() without
>>   	 * a ref. Use get_page_unless_zero(), and just fault again if it fails.
>>   	 */
>> -	if (!get_page_unless_zero(page))
>> -		goto out;
>> +	if (!get_page_unless_zero(page)) {
>> +		pte_unmap_unlock(ptep, ptl);
>> +		cond_resched();
>> +		return;
>> +	}
>>   	pte_unmap_unlock(ptep, ptl);
>>   	put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
>>   	return;
>> -- 
>> 2.20.1.2432.ga663e714
>>
>>
>>
Hugh Dickins June 2, 2021, 3:57 p.m. UTC | #10
On Wed, 2 Jun 2021, Yu Xu wrote:
> On 6/2/21 12:55 AM, Hugh Dickins wrote:
> > On Wed, 2 Jun 2021, Xu Yu wrote:
> > 
> > > We notice that hung task happens in a conner but practical scenario when
> > > CONFIG_PREEMPT_NONE is enabled, as follows.
> > > 
> > > Process 0                       Process 1                     Process
> > > 2..Inf
> > > split_huge_page_to_list
> > >      unmap_page
> > >          split_huge_pmd_address
> > >                                  __migration_entry_wait(head)
> > >                                                                __migration_entry_wait(tail)
> > >      remap_page (roll back)
> > >          remove_migration_ptes
> > >              rmap_walk_anon
> > >                  cond_resched
> > > 
> > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> > > copy_to_user, which will immediately fault again without rescheduling,
> > > and thus occupy the cpu fully.
> > > 
> > > When there are too many processes performing __migration_entry_wait on
> > > tail page, remap_page will never be done after cond_resched.
> > > 
> > > This relaxes __migration_entry_wait on tail page, thus gives remap_page
> > > a chance to complete.
> > > 
> > > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
> > > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> > 
> > Well caught: you're absolutely right that there's a bug there.
> > But isn't cond_resched() just papering over the real bug, and
> > what it should do is a "page = compound_head(page);" before the
> > get_page_unless_zero()? How does that work out in your testing?
> 
> compound_head works. The patched kernel is alive for hours under
> our reproducer, which usually makes the vanilla kernel hung after
> tens of minutes at most.

Oh, that's good news, thanks.

(It's still likely that a well-placed cond_resched() somewhere in
mm/gup.c would also be a good idea, but none of us have yet got
around to identifying where.)

> 
> If we use compound_head, the behavior of __migration_entry_wait(tail)
> changes from "retry fault" to "prevent THP from being split". Is that
> right?  Then which is preferred? If it were me, I would prefer "retry
> fault".

As Matthew remarked, you are asking very good questions, and split
migration entries are difficult to think about.  But I believe you'll
find it works out okay.

The point of *put_and_* wait_on_page_locked() is that it does drop
the page reference you acquired with get_page_unless_zero, as soon
as the page is on the wait queue, before actually waiting.

So splitting the THP is only prevented for a brief interval.  Now,
it's true that if there are very many tasks faulting on portions
of the huge page, in that interval between inserting the migration
entries and freezing the huge page's refcount to 0, they can reduce
the chance of splitting considerably.  But that's not an excuse for 
for doing get_page_unless_zero() on the wrong thing, as it was doing.

Hugh
Xu Yu June 7, 2021, 7:24 a.m. UTC | #11
On 6/2/21 11:57 PM, Hugh Dickins wrote:
> On Wed, 2 Jun 2021, Yu Xu wrote:
>> On 6/2/21 12:55 AM, Hugh Dickins wrote:
>>> On Wed, 2 Jun 2021, Xu Yu wrote:
>>>
>>>> We notice that hung task happens in a conner but practical scenario when
>>>> CONFIG_PREEMPT_NONE is enabled, as follows.
>>>>
>>>> Process 0                       Process 1                     Process
>>>> 2..Inf
>>>> split_huge_page_to_list
>>>>       unmap_page
>>>>           split_huge_pmd_address
>>>>                                   __migration_entry_wait(head)
>>>>                                                                 __migration_entry_wait(tail)
>>>>       remap_page (roll back)
>>>>           remove_migration_ptes
>>>>               rmap_walk_anon
>>>>                   cond_resched
>>>>
>>>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
>>>> copy_to_user, which will immediately fault again without rescheduling,
>>>> and thus occupy the cpu fully.
>>>>
>>>> When there are too many processes performing __migration_entry_wait on
>>>> tail page, remap_page will never be done after cond_resched.
>>>>
>>>> This relaxes __migration_entry_wait on tail page, thus gives remap_page
>>>> a chance to complete.
>>>>
>>>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
>>>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
>>>
>>> Well caught: you're absolutely right that there's a bug there.
>>> But isn't cond_resched() just papering over the real bug, and
>>> what it should do is a "page = compound_head(page);" before the
>>> get_page_unless_zero()? How does that work out in your testing?
>>
>> compound_head works. The patched kernel is alive for hours under
>> our reproducer, which usually makes the vanilla kernel hung after
>> tens of minutes at most.
> 
> Oh, that's good news, thanks.
> 
> (It's still likely that a well-placed cond_resched() somewhere in
> mm/gup.c would also be a good idea, but none of us have yet got
> around to identifying where.)

We neither. If really have to do it outside of __migration_entry_wait,
return value of __migration_entry_wait is needed, and many related
functions have to updated, which may be undesirable.

> 
>>
>> If we use compound_head, the behavior of __migration_entry_wait(tail)
>> changes from "retry fault" to "prevent THP from being split". Is that
>> right?  Then which is preferred? If it were me, I would prefer "retry
>> fault".
> 
> As Matthew remarked, you are asking very good questions, and split
> migration entries are difficult to think about.  But I believe you'll
> find it works out okay.
> 
> The point of *put_and_* wait_on_page_locked() is that it does drop
> the page reference you acquired with get_page_unless_zero, as soon
> as the page is on the wait queue, before actually waiting.
> 
> So splitting the THP is only prevented for a brief interval.  Now,
> it's true that if there are very many tasks faulting on portions
> of the huge page, in that interval between inserting the migration
> entries and freezing the huge page's refcount to 0, they can reduce
> the chance of splitting considerably.  But that's not an excuse for
> for doing get_page_unless_zero() on the wrong thing, as it was doing.

We finally come to your solution, i.e., compound_head.

In that case, who should resend the compound_head patch to this issue?
shall we do with your s.o.b?

> 
> Hugh
>
Hugh Dickins June 8, 2021, 4:44 a.m. UTC | #12
On Mon, 7 Jun 2021, Yu Xu wrote:
> On 6/2/21 11:57 PM, Hugh Dickins wrote:
> > On Wed, 2 Jun 2021, Yu Xu wrote:
> >> On 6/2/21 12:55 AM, Hugh Dickins wrote:
> >>> On Wed, 2 Jun 2021, Xu Yu wrote:
> >>>
> >>>> We notice that hung task happens in a conner but practical scenario when
> >>>> CONFIG_PREEMPT_NONE is enabled, as follows.
> >>>>
> >>>> Process 0                       Process 1                     Process
> >>>> 2..Inf
> >>>> split_huge_page_to_list
> >>>>       unmap_page
> >>>>           split_huge_pmd_address
> >>>>                                   __migration_entry_wait(head)
> >>>>                                                                 __migration_entry_wait(tail)
> >>>>       remap_page (roll back)
> >>>>           remove_migration_ptes
> >>>>               rmap_walk_anon
> >>>>                   cond_resched
> >>>>
> >>>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> >>>> copy_to_user, which will immediately fault again without rescheduling,
> >>>> and thus occupy the cpu fully.
> >>>>
> >>>> When there are too many processes performing __migration_entry_wait on
> >>>> tail page, remap_page will never be done after cond_resched.
> >>>>
> >>>> This relaxes __migration_entry_wait on tail page, thus gives remap_page
> >>>> a chance to complete.
> >>>>
> >>>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
> >>>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> >>>
> >>> Well caught: you're absolutely right that there's a bug there.
> >>> But isn't cond_resched() just papering over the real bug, and
> >>> what it should do is a "page = compound_head(page);" before the
> >>> get_page_unless_zero()? How does that work out in your testing?
> >>
> >> compound_head works. The patched kernel is alive for hours under
> >> our reproducer, which usually makes the vanilla kernel hung after
> >> tens of minutes at most.
> > 
> > Oh, that's good news, thanks.
> > 
> > (It's still likely that a well-placed cond_resched() somewhere in
> > mm/gup.c would also be a good idea, but none of us have yet got
> > around to identifying where.)
> 
> We neither. If really have to do it outside of __migration_entry_wait,
> return value of __migration_entry_wait is needed, and many related
> functions have to updated, which may be undesirable.

No, it would not be necessary to plumb through a return value from
__migration_entry_wait(): I didn't mean that this GUP cond_resched()
should be done only for the migration case, but (I guess) on any path
where handle_mm_fault() returns "success" for a retry, yet the retry
of follow_page_mask() fails.

But now that I look, I see there is already a cond_resched() there!

So I'm puzzled as to how your cond_resched() in __migration_entry_wait()
appeared to help - well, you never actually said that it helped, but I
assume that it did, or you wouldn't have bothered to send that patch?

It's irrelevant, now that we've admitted there should be a
"page = compound_head(page)" in there, and you have said that helps,
and that's the patch we want to send now.  But it troubles me, to be
unable to explain it.  Two cond_resched()s are not twice as good as one.

> 
> > 
> >>
> >> If we use compound_head, the behavior of __migration_entry_wait(tail)
> >> changes from "retry fault" to "prevent THP from being split". Is that
> >> right?  Then which is preferred? If it were me, I would prefer "retry
> >> fault".
> > 
> > As Matthew remarked, you are asking very good questions, and split
> > migration entries are difficult to think about.  But I believe you'll
> > find it works out okay.
> > 
> > The point of *put_and_* wait_on_page_locked() is that it does drop
> > the page reference you acquired with get_page_unless_zero, as soon
> > as the page is on the wait queue, before actually waiting.
> > 
> > So splitting the THP is only prevented for a brief interval.  Now,
> > it's true that if there are very many tasks faulting on portions
> > of the huge page, in that interval between inserting the migration
> > entries and freezing the huge page's refcount to 0, they can reduce
> > the chance of splitting considerably.  But that's not an excuse for
> > for doing get_page_unless_zero() on the wrong thing, as it was doing.
> 
> We finally come to your solution, i.e., compound_head.
> 
> In that case, who should resend the compound_head patch to this issue?
> shall we do with your s.o.b?

I was rather expecting you to send the patch: with your s.o.b, not mine.
You could say "Suggested-by: Hugh Dickins <hughd@google.com>" if you like.

And I suggest that you put that "page = compound_head(page);" line
immediately after the "page = migration_entry_to_page(entry);" line,
so as not to interfere with the comment above get_page_unless_zero().

(No need for a comment on the compound_head(): it's self-explanatory.)

I did meanwhile research other callers of migration_entry_to_page():
it had been on my mind, that others might need a compound_head() too,
and perhaps it should be done inside migration_entry_to_page() itself.

But so far as I can tell (I don't really know about the s390 one),
the others are okay, and it would just be unnecessary overhead
(in particular, the mm_counter() stuff looks correct on a tail).

I *think* the right Fixes tag would be
Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
though I'm not sure of that; it's probably good enough.

(With all this direction, I did wonder if it would be kinder just to
send a patch myself, but using some of your comments: but I didn't
understand "conner" in your description, so couldn't do that.)

Thanks!
Hugh
Xu Yu June 8, 2021, 5:43 a.m. UTC | #13
On 6/8/21 12:44 PM, Hugh Dickins wrote:
> On Mon, 7 Jun 2021, Yu Xu wrote:
>> On 6/2/21 11:57 PM, Hugh Dickins wrote:
>>> On Wed, 2 Jun 2021, Yu Xu wrote:
>>>> On 6/2/21 12:55 AM, Hugh Dickins wrote:
>>>>> On Wed, 2 Jun 2021, Xu Yu wrote:
>>>>>
>>>>>> We notice that hung task happens in a conner but practical scenario when
>>>>>> CONFIG_PREEMPT_NONE is enabled, as follows.
>>>>>>
>>>>>> Process 0                       Process 1                     Process
>>>>>> 2..Inf
>>>>>> split_huge_page_to_list
>>>>>>        unmap_page
>>>>>>            split_huge_pmd_address
>>>>>>                                    __migration_entry_wait(head)
>>>>>>                                                                  __migration_entry_wait(tail)
>>>>>>        remap_page (roll back)
>>>>>>            remove_migration_ptes
>>>>>>                rmap_walk_anon
>>>>>>                    cond_resched
>>>>>>
>>>>>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
>>>>>> copy_to_user, which will immediately fault again without rescheduling,
>>>>>> and thus occupy the cpu fully.
>>>>>>
>>>>>> When there are too many processes performing __migration_entry_wait on
>>>>>> tail page, remap_page will never be done after cond_resched.
>>>>>>
>>>>>> This relaxes __migration_entry_wait on tail page, thus gives remap_page
>>>>>> a chance to complete.
>>>>>>
>>>>>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
>>>>>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
>>>>>
>>>>> Well caught: you're absolutely right that there's a bug there.
>>>>> But isn't cond_resched() just papering over the real bug, and
>>>>> what it should do is a "page = compound_head(page);" before the
>>>>> get_page_unless_zero()? How does that work out in your testing?
>>>>
>>>> compound_head works. The patched kernel is alive for hours under
>>>> our reproducer, which usually makes the vanilla kernel hung after
>>>> tens of minutes at most.
>>>
>>> Oh, that's good news, thanks.
>>>
>>> (It's still likely that a well-placed cond_resched() somewhere in
>>> mm/gup.c would also be a good idea, but none of us have yet got
>>> around to identifying where.)
>>
>> We neither. If really have to do it outside of __migration_entry_wait,
>> return value of __migration_entry_wait is needed, and many related
>> functions have to updated, which may be undesirable.
> 
> No, it would not be necessary to plumb through a return value from
> __migration_entry_wait(): I didn't mean that this GUP cond_resched()
> should be done only for the migration case, but (I guess) on any path
> where handle_mm_fault() returns "success" for a retry, yet the retry
> of follow_page_mask() fails.
> 
> But now that I look, I see there is already a cond_resched() there!

Do you mean might_sleep in mmap_read_trylock within do_user_addr_fault?

If so, our environment has CONFIG_PREEMPT_NONE is enabled, and the
__migration_entry_wait happens in kernel when do something like
copy_to_user (e.g., fstat).


> 
> So I'm puzzled as to how your cond_resched() in __migration_entry_wait()
> appeared to help - well, you never actually said that it helped, but I
> assume that it did, or you wouldn't have bothered to send that patch?
> 
> It's irrelevant, now that we've admitted there should be a
> "page = compound_head(page)" in there, and you have said that helps,
> and that's the patch we want to send now.  But it troubles me, to be
> unable to explain it.  Two cond_resched()s are not twice as good as one.
> 
>>
>>>
>>>>
>>>> If we use compound_head, the behavior of __migration_entry_wait(tail)
>>>> changes from "retry fault" to "prevent THP from being split". Is that
>>>> right?  Then which is preferred? If it were me, I would prefer "retry
>>>> fault".
>>>
>>> As Matthew remarked, you are asking very good questions, and split
>>> migration entries are difficult to think about.  But I believe you'll
>>> find it works out okay.
>>>
>>> The point of *put_and_* wait_on_page_locked() is that it does drop
>>> the page reference you acquired with get_page_unless_zero, as soon
>>> as the page is on the wait queue, before actually waiting.
>>>
>>> So splitting the THP is only prevented for a brief interval.  Now,
>>> it's true that if there are very many tasks faulting on portions
>>> of the huge page, in that interval between inserting the migration
>>> entries and freezing the huge page's refcount to 0, they can reduce
>>> the chance of splitting considerably.  But that's not an excuse for
>>> for doing get_page_unless_zero() on the wrong thing, as it was doing.
>>
>> We finally come to your solution, i.e., compound_head.
>>
>> In that case, who should resend the compound_head patch to this issue?
>> shall we do with your s.o.b?
> 
> I was rather expecting you to send the patch: with your s.o.b, not mine.
> You could say "Suggested-by: Hugh Dickins <hughd@google.com>" if you like.
> 
> And I suggest that you put that "page = compound_head(page);" line
> immediately after the "page = migration_entry_to_page(entry);" line,
> so as not to interfere with the comment above get_page_unless_zero().
> 
> (No need for a comment on the compound_head(): it's self-explanatory.)
> 
> I did meanwhile research other callers of migration_entry_to_page():
> it had been on my mind, that others might need a compound_head() too,
> and perhaps it should be done inside migration_entry_to_page() itself.
> 
> But so far as I can tell (I don't really know about the s390 one),
> the others are okay, and it would just be unnecessary overhead
> (in particular, the mm_counter() stuff looks correct on a tail).
> 
> I *think* the right Fixes tag would be
> Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
> though I'm not sure of that; it's probably good enough.
> 
> (With all this direction, I did wonder if it would be kinder just to
> send a patch myself, but using some of your comments: but I didn't
> understand "conner" in your description, so couldn't do that.)
> 
> Thanks!
> Hugh
>
Hugh Dickins June 8, 2021, 6:53 a.m. UTC | #14
On Tue, 8 Jun 2021, Yu Xu wrote:
> On 6/8/21 12:44 PM, Hugh Dickins wrote:
> > On Mon, 7 Jun 2021, Yu Xu wrote:
> >> On 6/2/21 11:57 PM, Hugh Dickins wrote:
> >>> On Wed, 2 Jun 2021, Yu Xu wrote:
> >>>> On 6/2/21 12:55 AM, Hugh Dickins wrote:
> >>>>> On Wed, 2 Jun 2021, Xu Yu wrote:
> >>>>>
> >>>>>> We notice that hung task happens in a conner but practical scenario
> >>>>>> when
> >>>>>> CONFIG_PREEMPT_NONE is enabled, as follows.
> >>>>>>
> >>>>>> Process 0                       Process 1                     Process
> >>>>>> 2..Inf
> >>>>>> split_huge_page_to_list
> >>>>>>        unmap_page
> >>>>>>            split_huge_pmd_address
> >>>>>>                                    __migration_entry_wait(head)
> >>>>>>                                                                  __migration_entry_wait(tail)
> >>>>>>        remap_page (roll back)
> >>>>>>            remove_migration_ptes
> >>>>>>                rmap_walk_anon
> >>>>>>                    cond_resched
> >>>>>>
> >>>>>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> >>>>>> copy_to_user, which will immediately fault again without rescheduling,
> >>>>>> and thus occupy the cpu fully.
> >>>>>>
> >>>>>> When there are too many processes performing __migration_entry_wait on
> >>>>>> tail page, remap_page will never be done after cond_resched.
> >>>>>>
> >>>>>> This relaxes __migration_entry_wait on tail page, thus gives remap_page
> >>>>>> a chance to complete.
> >>>>>>
> >>>>>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
> >>>>>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> >>>>>
> >>>>> Well caught: you're absolutely right that there's a bug there.
> >>>>> But isn't cond_resched() just papering over the real bug, and
> >>>>> what it should do is a "page = compound_head(page);" before the
> >>>>> get_page_unless_zero()? How does that work out in your testing?
> >>>>
> >>>> compound_head works. The patched kernel is alive for hours under
> >>>> our reproducer, which usually makes the vanilla kernel hung after
> >>>> tens of minutes at most.
> >>>
> >>> Oh, that's good news, thanks.
> >>>
> >>> (It's still likely that a well-placed cond_resched() somewhere in
> >>> mm/gup.c would also be a good idea, but none of us have yet got
> >>> around to identifying where.)
> >>
> >> We neither. If really have to do it outside of __migration_entry_wait,
> >> return value of __migration_entry_wait is needed, and many related
> >> functions have to updated, which may be undesirable.
> > 
> > No, it would not be necessary to plumb through a return value from
> > __migration_entry_wait(): I didn't mean that this GUP cond_resched()
> > should be done only for the migration case, but (I guess) on any path
> > where handle_mm_fault() returns "success" for a retry, yet the retry
> > of follow_page_mask() fails.
> > 
> > But now that I look, I see there is already a cond_resched() there!
> 
> Do you mean might_sleep in mmap_read_trylock within do_user_addr_fault?
> 
> If so, our environment has CONFIG_PREEMPT_NONE is enabled, and the
> __migration_entry_wait happens in kernel when do something like
> copy_to_user (e.g., fstat).

Oh, I am sorry: now I see that you did mention copy_to_user() in your
original post, but I'm afraid I was fixated on get_user_pages() all
along: a different way in which the kernel handles a fault on user
address space without returning to userspace immediately afterwards.

So, the GUP case has its cond_resched() and is okay, but the
arch/whatever/mm/fault.c case is the one which probably deserves a
cond_resched() somewhere (on the architecture in question anyway - x86?).

I was reluctant to suggest where to place it in GUP, I am even more
reluctant to say where in arch/whatever/mm/fault.c: I haven't thought
through that code in years.  x86, somewhere in do_user_addr_fault(),
probably yes; but it's better to cond_resched() without holding a
lock; and better to avoid it on first entry too.

But we don't need to decide that, if the compound_head() is a
satisfactory solution for you in practice.  Sorry for confusing
you with my own confusion, and thank you for clearing it up.

Hugh
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index b234c3f3acb7..df2dc39fe566 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -301,8 +301,11 @@  void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 	 * is zero; but we must not call put_and_wait_on_page_locked() without
 	 * a ref. Use get_page_unless_zero(), and just fault again if it fails.
 	 */
-	if (!get_page_unless_zero(page))
-		goto out;
+	if (!get_page_unless_zero(page)) {
+		pte_unmap_unlock(ptep, ptl);
+		cond_resched();
+		return;
+	}
 	pte_unmap_unlock(ptep, ptl);
 	put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
 	return;