[v2] mm, thp: use head page in __migration_entry_wait

Message ID	b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=dVmx=LC=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D51FB6124B From: Xu Yu <xuyu@linux.alibaba.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, hughd@google.com Cc: akpm@linux-foundation.org, gavin.dg@linux.alibaba.com Subject: [PATCH v2] mm, thp: use head page in __migration_entry_wait Date: Tue, 8 Jun 2021 17:22:39 +0800 Message-Id: <b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] mm, thp: use head page in __migration_entry_wait \| expand [v2] mm, thp: use head page in __migration_entry_wait

Xu Yu June 8, 2021, 9:22 a.m. UTC

We notice that hung task happens in a conner but practical scenario when
CONFIG_PREEMPT_NONE is enabled, as follows.

Process 0                       Process 1                     Process 2..Inf
split_huge_page_to_list
    unmap_page
        split_huge_pmd_address
                                __migration_entry_wait(head)
                                                              __migration_entry_wait(tail)
    remap_page (roll back)
        remove_migration_ptes
            rmap_walk_anon
                cond_resched

Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
copy_to_user in fstat, which will immediately fault again without
rescheduling, and thus occupy the cpu fully.

When there are too many processes performing __migration_entry_wait on
tail page, remap_page will never be done after cond_resched.

This makes __migration_entry_wait operate on the compound head page,
thus waits for remap_page to complete, whether the THP is split
successfully or roll back.

Note that put_and_wait_on_page_locked helps to drop the page reference
acquired with get_page_unless_zero, as soon as the page is on the wait
queue, before actually waiting. So splitting the THP is only prevented
for a brief interval.

Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
---
 mm/migrate.c | 1 +
 1 file changed, 1 insertion(+)

Kirill A. Shutemov June 8, 2021, noon UTC | #1

On Tue, Jun 08, 2021 at 05:22:39PM +0800, Xu Yu wrote:
> We notice that hung task happens in a conner but practical scenario when
> CONFIG_PREEMPT_NONE is enabled, as follows.
> 
> Process 0                       Process 1                     Process 2..Inf
> split_huge_page_to_list
>     unmap_page
>         split_huge_pmd_address
>                                 __migration_entry_wait(head)
>                                                               __migration_entry_wait(tail)
>     remap_page (roll back)
>         remove_migration_ptes
>             rmap_walk_anon
>                 cond_resched
> 
> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> copy_to_user in fstat, which will immediately fault again without
> rescheduling, and thus occupy the cpu fully.
> 
> When there are too many processes performing __migration_entry_wait on
> tail page, remap_page will never be done after cond_resched.
> 
> This makes __migration_entry_wait operate on the compound head page,
> thus waits for remap_page to complete, whether the THP is split
> successfully or roll back.
> 
> Note that put_and_wait_on_page_locked helps to drop the page reference
> acquired with get_page_unless_zero, as soon as the page is on the wait
> queue, before actually waiting. So splitting the THP is only prevented
> for a brief interval.
> 
> Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
> Suggested-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>

Looks good to me:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

But there's one quirk: if split succeed we effectively wait on wrong
page to be unlocked. And it may take indefinite time if split_huge_page()
was called on the head page.

Maybe we should consider waking up head waiter on head page, even if it is
still locked after split?

Something like this (untested):

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 63ed6b25deaa..f79a38e21e53 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2535,6 +2535,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		 */
 		put_page(subpage);
 	}
+
+	if (page == head)
+		wake_up_page_bit(page, PG_locked);
 }

 int total_mapcount(struct page *page)

Matthew Wilcox (Oracle) June 8, 2021, 12:32 p.m. UTC | #2

On Tue, Jun 08, 2021 at 03:00:26PM +0300, Kirill A. Shutemov wrote:
> But there's one quirk: if split succeed we effectively wait on wrong
> page to be unlocked. And it may take indefinite time if split_huge_page()
> was called on the head page.

Hardly indefinite time ... callers of split_huge_page_to_list() usually
unlock the page soon after.  Actually, I can't find one that doesn't call
unlock_page() within a few lines of calling split_huge_page_to_list().

Kirill A. Shutemov June 8, 2021, 12:58 p.m. UTC | #3

On Tue, Jun 08, 2021 at 01:32:21PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 08, 2021 at 03:00:26PM +0300, Kirill A. Shutemov wrote:
> > But there's one quirk: if split succeed we effectively wait on wrong
> > page to be unlocked. And it may take indefinite time if split_huge_page()
> > was called on the head page.
> 
> Hardly indefinite time ... callers of split_huge_page_to_list() usually
> unlock the page soon after.  Actually, I can't find one that doesn't call
> unlock_page() within a few lines of calling split_huge_page_to_list().

I didn't check all callers, but it's not guaranteed by the interface and
it's not hard to imagine a future situation when a page got split on the
way to IO and kept locked until IO is complete.

The wake up shouldn't have much overhead as in most cases split going to
be called on the head page.

Xu Yu June 8, 2021, 1:22 p.m. UTC | #4

On 6/8/21 8:00 PM, Kirill A. Shutemov wrote:
> On Tue, Jun 08, 2021 at 05:22:39PM +0800, Xu Yu wrote:
>> We notice that hung task happens in a conner but practical scenario when
>> CONFIG_PREEMPT_NONE is enabled, as follows.
>>
>> Process 0                       Process 1                     Process 2..Inf
>> split_huge_page_to_list
>>      unmap_page
>>          split_huge_pmd_address
>>                                  __migration_entry_wait(head)
>>                                                                __migration_entry_wait(tail)
>>      remap_page (roll back)
>>          remove_migration_ptes
>>              rmap_walk_anon
>>                  cond_resched
>>
>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
>> copy_to_user in fstat, which will immediately fault again without
>> rescheduling, and thus occupy the cpu fully.
>>
>> When there are too many processes performing __migration_entry_wait on
>> tail page, remap_page will never be done after cond_resched.
>>
>> This makes __migration_entry_wait operate on the compound head page,
>> thus waits for remap_page to complete, whether the THP is split
>> successfully or roll back.
>>
>> Note that put_and_wait_on_page_locked helps to drop the page reference
>> acquired with get_page_unless_zero, as soon as the page is on the wait
>> queue, before actually waiting. So splitting the THP is only prevented
>> for a brief interval.
>>
>> Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
>> Suggested-by: Hugh Dickins <hughd@google.com>
>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> 
> Looks good to me:
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> But there's one quirk: if split succeed we effectively wait on wrong
> page to be unlocked. And it may take indefinite time if split_huge_page()
> was called on the head page.

Inspired by you, I look into the codes, and have a new question (nothing
to do with this patch).

If we split_huge_page_to_list on *tail* page (in fact, I haven't seen
that used yet),

mm/huge_memory.c:2666 checks "VM_BUG_ON_PAGE(!PageLocked(head), head);"
in split_huge_page_to_list(), while

mm/huge_memory.c:2497 does "unlock_page(subpage)", where subpage can
be head in this scenario, in __split_huge_page().

My confusion is
1) how the pin on the @subpage is got outside split_huge_page_to_list()?
    can we ever get tail page?

2) head page is locked outside split_huge_page_to_list(), but unlocked
    in __split_huge_page()?


> 
> Maybe we should consider waking up head waiter on head page, even if it is
> still locked after split?
> 
> Something like this (untested):
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 63ed6b25deaa..f79a38e21e53 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2535,6 +2535,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   		 */
>   		put_page(subpage);
>   	}
> +
> +	if (page == head)
> +		wake_up_page_bit(page, PG_locked);
>   }
> 
>   int total_mapcount(struct page *page)
>

Matthew Wilcox (Oracle) June 8, 2021, 1:35 p.m. UTC | #5

On Tue, Jun 08, 2021 at 03:58:38PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jun 08, 2021 at 01:32:21PM +0100, Matthew Wilcox wrote:
> > On Tue, Jun 08, 2021 at 03:00:26PM +0300, Kirill A. Shutemov wrote:
> > > But there's one quirk: if split succeed we effectively wait on wrong
> > > page to be unlocked. And it may take indefinite time if split_huge_page()
> > > was called on the head page.
> > 
> > Hardly indefinite time ... callers of split_huge_page_to_list() usually
> > unlock the page soon after.  Actually, I can't find one that doesn't call
> > unlock_page() within a few lines of calling split_huge_page_to_list().
> 
> I didn't check all callers, but it's not guaranteed by the interface and
> it's not hard to imagine a future situation when a page got split on the
> way to IO and kept locked until IO is complete.

I would say that can't happen.  Pages are locked when added to the page
cache and are !Uptodate.  You can't put a PTE in a process page table
until it's Uptodate, and once it's Uptodate, the page is unlocked.  So
any subsequent locks are transient, and not for the purposes of IO
(writebacks only take the page lock transiently).

> The wake up shouldn't have much overhead as in most cases split going to
> be called on the head page.

I'm not convinced about that.  We go out of our way to not wake up pages
(eg PageWaiters), and we've had some impressively long lists in the past
(which is why we now have the bookmarks).

Hugh Dickins June 8, 2021, 8 p.m. UTC | #6

On Tue, 8 Jun 2021, Xu Yu wrote:

> We notice that hung task happens in a conner but practical scenario when

But I still don't understand what you mean by "conner":
common, corner, something else? Maybe just delete "conner but ".

> CONFIG_PREEMPT_NONE is enabled, as follows.
> 
> Process 0                       Process 1                     Process 2..Inf
> split_huge_page_to_list
>     unmap_page
>         split_huge_pmd_address
>                                 __migration_entry_wait(head)
>                                                               __migration_entry_wait(tail)
>     remap_page (roll back)
>         remove_migration_ptes
>             rmap_walk_anon
>                 cond_resched
> 
> Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> copy_to_user in fstat, which will immediately fault again without
> rescheduling, and thus occupy the cpu fully.
> 
> When there are too many processes performing __migration_entry_wait on
> tail page, remap_page will never be done after cond_resched.
> 
> This makes __migration_entry_wait operate on the compound head page,
> thus waits for remap_page to complete, whether the THP is split
> successfully or roll back.
> 
> Note that put_and_wait_on_page_locked helps to drop the page reference
> acquired with get_page_unless_zero, as soon as the page is on the wait
> queue, before actually waiting. So splitting the THP is only prevented
> for a brief interval.
> 
> Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
> Suggested-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>

Thanks:
Acked-by: Hugh Dickins <hughd@google.com>

And I hope Andrew will add Cc stable when it goes into his tree.

I'll leave the (independent) discussion of optimal wakeup strategy
to Kirill and Matthew: no strong opinion from me, it works as it is.

> ---
>  mm/migrate.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b234c3f3acb7..41ff2c9896c4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -295,6 +295,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>  		goto out;
>  
>  	page = migration_entry_to_page(entry);
> +	page = compound_head(page);
>  
>  	/*
>  	 * Once page cache replacement of page migration started, page_count
> -- 
> 2.20.1.2432.ga663e714
> 
>

Kirill A. Shutemov June 9, 2021, 9:59 a.m. UTC | #7

On Tue, Jun 08, 2021 at 02:35:23PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 08, 2021 at 03:58:38PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Jun 08, 2021 at 01:32:21PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jun 08, 2021 at 03:00:26PM +0300, Kirill A. Shutemov wrote:
> > > > But there's one quirk: if split succeed we effectively wait on wrong
> > > > page to be unlocked. And it may take indefinite time if split_huge_page()
> > > > was called on the head page.
> > > 
> > > Hardly indefinite time ... callers of split_huge_page_to_list() usually
> > > unlock the page soon after.  Actually, I can't find one that doesn't call
> > > unlock_page() within a few lines of calling split_huge_page_to_list().
> > 
> > I didn't check all callers, but it's not guaranteed by the interface and
> > it's not hard to imagine a future situation when a page got split on the
> > way to IO and kept locked until IO is complete.
> 
> I would say that can't happen.  Pages are locked when added to the page
> cache and are !Uptodate.  You can't put a PTE in a process page table
> until it's Uptodate, and once it's Uptodate, the page is unlocked.  So
> any subsequent locks are transient, and not for the purposes of IO
> (writebacks only take the page lock transiently).

Documentation/filesystems/locking.rst:

	Note, if the filesystem needs the page to be locked during writeout, that
	is ok, too, the page is allowed to be unlocked at any point in time
	between the calls to set_page_writeback() and end_page_writeback().

I probably misinterpret what is written here. I know very little about
writeback path.

> > The wake up shouldn't have much overhead as in most cases split going to
> > be called on the head page.
> 
> I'm not convinced about that.  We go out of our way to not wake up pages
> (eg PageWaiters), and we've had some impressively long lists in the past
> (which is why we now have the bookmarks).

Maybe we should be smarter on when to wake up, I donno.

I just notice that with the change we have /potential/ to wait long time
on the wrong page to be unlocked. split_huge_page() interface doesn't
enforce that the page gets split soon after split is complete.

Kirill A. Shutemov June 9, 2021, 10:08 a.m. UTC | #8

On Tue, Jun 08, 2021 at 09:22:28PM +0800, Yu Xu wrote:
> On 6/8/21 8:00 PM, Kirill A. Shutemov wrote:
> > On Tue, Jun 08, 2021 at 05:22:39PM +0800, Xu Yu wrote:
> > > We notice that hung task happens in a conner but practical scenario when
> > > CONFIG_PREEMPT_NONE is enabled, as follows.
> > > 
> > > Process 0                       Process 1                     Process 2..Inf
> > > split_huge_page_to_list
> > >      unmap_page
> > >          split_huge_pmd_address
> > >                                  __migration_entry_wait(head)
> > >                                                                __migration_entry_wait(tail)
> > >      remap_page (roll back)
> > >          remove_migration_ptes
> > >              rmap_walk_anon
> > >                  cond_resched
> > > 
> > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> > > copy_to_user in fstat, which will immediately fault again without
> > > rescheduling, and thus occupy the cpu fully.
> > > 
> > > When there are too many processes performing __migration_entry_wait on
> > > tail page, remap_page will never be done after cond_resched.
> > > 
> > > This makes __migration_entry_wait operate on the compound head page,
> > > thus waits for remap_page to complete, whether the THP is split
> > > successfully or roll back.
> > > 
> > > Note that put_and_wait_on_page_locked helps to drop the page reference
> > > acquired with get_page_unless_zero, as soon as the page is on the wait
> > > queue, before actually waiting. So splitting the THP is only prevented
> > > for a brief interval.
> > > 
> > > Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
> > > Suggested-by: Hugh Dickins <hughd@google.com>
> > > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
> > > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> > 
> > Looks good to me:
> > 
> > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > 
> > But there's one quirk: if split succeed we effectively wait on wrong
> > page to be unlocked. And it may take indefinite time if split_huge_page()
> > was called on the head page.
> 
> Inspired by you, I look into the codes, and have a new question (nothing
> to do with this patch).
> 
> If we split_huge_page_to_list on *tail* page (in fact, I haven't seen
> that used yet),

See ksm code for instance.

> mm/huge_memory.c:2666 checks "VM_BUG_ON_PAGE(!PageLocked(head), head);"
> in split_huge_page_to_list(), while
> 
> mm/huge_memory.c:2497 does "unlock_page(subpage)", where subpage can
> be head in this scenario, in __split_huge_page().
> 
> My confusion is
> 1) how the pin on the @subpage is got outside split_huge_page_to_list()?
>    can we ever get tail page?

This loop:

	for (i = 0; i < nr; i++) {
		struct page *subpage = head + i;
		if (subpage == page)
			continue;
		unlock_page(subpage);

		/*
		 * Subpages may be freed if there wasn't any mapping
		 * like if add_to_swap() is running on a lru page that
		 * had its mapping zapped. And freeing these pages
		 * requires taking the lru_lock so we do the put_page
		 * of the tail pages after the split is complete.
		 */
		put_page(subpage);
	}

We skip unlocking and unpinning the page split_huge_page() got called for.

> 
> 2) head page is locked outside split_huge_page_to_list(), but unlocked
>    in __split_huge_page()?

If called on tail page, yes.

[v2] mm, thp: use head page in __migration_entry_wait

Commit Message

Comments

Patch