Message ID | b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm, thp: use head page in __migration_entry_wait | expand |
On Tue, Jun 08, 2021 at 05:22:39PM +0800, Xu Yu wrote: > We notice that hung task happens in a conner but practical scenario when > CONFIG_PREEMPT_NONE is enabled, as follows. > > Process 0 Process 1 Process 2..Inf > split_huge_page_to_list > unmap_page > split_huge_pmd_address > __migration_entry_wait(head) > __migration_entry_wait(tail) > remap_page (roll back) > remove_migration_ptes > rmap_walk_anon > cond_resched > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g., > copy_to_user in fstat, which will immediately fault again without > rescheduling, and thus occupy the cpu fully. > > When there are too many processes performing __migration_entry_wait on > tail page, remap_page will never be done after cond_resched. > > This makes __migration_entry_wait operate on the compound head page, > thus waits for remap_page to complete, whether the THP is split > successfully or roll back. > > Note that put_and_wait_on_page_locked helps to drop the page reference > acquired with get_page_unless_zero, as soon as the page is on the wait > queue, before actually waiting. So splitting the THP is only prevented > for a brief interval. > > Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split") > Suggested-by: Hugh Dickins <hughd@google.com> > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Looks good to me: Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> But there's one quirk: if split succeed we effectively wait on wrong page to be unlocked. And it may take indefinite time if split_huge_page() was called on the head page. Maybe we should consider waking up head waiter on head page, even if it is still locked after split? Something like this (untested): diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 63ed6b25deaa..f79a38e21e53 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2535,6 +2535,9 @@ static void __split_huge_page(struct page *page, struct list_head *list, */ put_page(subpage); } + + if (page == head) + wake_up_page_bit(page, PG_locked); } int total_mapcount(struct page *page)
On Tue, Jun 08, 2021 at 03:00:26PM +0300, Kirill A. Shutemov wrote: > But there's one quirk: if split succeed we effectively wait on wrong > page to be unlocked. And it may take indefinite time if split_huge_page() > was called on the head page. Hardly indefinite time ... callers of split_huge_page_to_list() usually unlock the page soon after. Actually, I can't find one that doesn't call unlock_page() within a few lines of calling split_huge_page_to_list().
On Tue, Jun 08, 2021 at 01:32:21PM +0100, Matthew Wilcox wrote: > On Tue, Jun 08, 2021 at 03:00:26PM +0300, Kirill A. Shutemov wrote: > > But there's one quirk: if split succeed we effectively wait on wrong > > page to be unlocked. And it may take indefinite time if split_huge_page() > > was called on the head page. > > Hardly indefinite time ... callers of split_huge_page_to_list() usually > unlock the page soon after. Actually, I can't find one that doesn't call > unlock_page() within a few lines of calling split_huge_page_to_list(). I didn't check all callers, but it's not guaranteed by the interface and it's not hard to imagine a future situation when a page got split on the way to IO and kept locked until IO is complete. The wake up shouldn't have much overhead as in most cases split going to be called on the head page.
On 6/8/21 8:00 PM, Kirill A. Shutemov wrote: > On Tue, Jun 08, 2021 at 05:22:39PM +0800, Xu Yu wrote: >> We notice that hung task happens in a conner but practical scenario when >> CONFIG_PREEMPT_NONE is enabled, as follows. >> >> Process 0 Process 1 Process 2..Inf >> split_huge_page_to_list >> unmap_page >> split_huge_pmd_address >> __migration_entry_wait(head) >> __migration_entry_wait(tail) >> remap_page (roll back) >> remove_migration_ptes >> rmap_walk_anon >> cond_resched >> >> Where __migration_entry_wait(tail) is occurred in kernel space, e.g., >> copy_to_user in fstat, which will immediately fault again without >> rescheduling, and thus occupy the cpu fully. >> >> When there are too many processes performing __migration_entry_wait on >> tail page, remap_page will never be done after cond_resched. >> >> This makes __migration_entry_wait operate on the compound head page, >> thus waits for remap_page to complete, whether the THP is split >> successfully or roll back. >> >> Note that put_and_wait_on_page_locked helps to drop the page reference >> acquired with get_page_unless_zero, as soon as the page is on the wait >> queue, before actually waiting. So splitting the THP is only prevented >> for a brief interval. >> >> Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split") >> Suggested-by: Hugh Dickins <hughd@google.com> >> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> >> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > > Looks good to me: > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > But there's one quirk: if split succeed we effectively wait on wrong > page to be unlocked. And it may take indefinite time if split_huge_page() > was called on the head page. Inspired by you, I look into the codes, and have a new question (nothing to do with this patch). If we split_huge_page_to_list on *tail* page (in fact, I haven't seen that used yet), mm/huge_memory.c:2666 checks "VM_BUG_ON_PAGE(!PageLocked(head), head);" in split_huge_page_to_list(), while mm/huge_memory.c:2497 does "unlock_page(subpage)", where subpage can be head in this scenario, in __split_huge_page(). My confusion is 1) how the pin on the @subpage is got outside split_huge_page_to_list()? can we ever get tail page? 2) head page is locked outside split_huge_page_to_list(), but unlocked in __split_huge_page()? > > Maybe we should consider waking up head waiter on head page, even if it is > still locked after split? > > Something like this (untested): > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 63ed6b25deaa..f79a38e21e53 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2535,6 +2535,9 @@ static void __split_huge_page(struct page *page, struct list_head *list, > */ > put_page(subpage); > } > + > + if (page == head) > + wake_up_page_bit(page, PG_locked); > } > > int total_mapcount(struct page *page) >
On Tue, Jun 08, 2021 at 03:58:38PM +0300, Kirill A. Shutemov wrote: > On Tue, Jun 08, 2021 at 01:32:21PM +0100, Matthew Wilcox wrote: > > On Tue, Jun 08, 2021 at 03:00:26PM +0300, Kirill A. Shutemov wrote: > > > But there's one quirk: if split succeed we effectively wait on wrong > > > page to be unlocked. And it may take indefinite time if split_huge_page() > > > was called on the head page. > > > > Hardly indefinite time ... callers of split_huge_page_to_list() usually > > unlock the page soon after. Actually, I can't find one that doesn't call > > unlock_page() within a few lines of calling split_huge_page_to_list(). > > I didn't check all callers, but it's not guaranteed by the interface and > it's not hard to imagine a future situation when a page got split on the > way to IO and kept locked until IO is complete. I would say that can't happen. Pages are locked when added to the page cache and are !Uptodate. You can't put a PTE in a process page table until it's Uptodate, and once it's Uptodate, the page is unlocked. So any subsequent locks are transient, and not for the purposes of IO (writebacks only take the page lock transiently). > The wake up shouldn't have much overhead as in most cases split going to > be called on the head page. I'm not convinced about that. We go out of our way to not wake up pages (eg PageWaiters), and we've had some impressively long lists in the past (which is why we now have the bookmarks).
On Tue, 8 Jun 2021, Xu Yu wrote: > We notice that hung task happens in a conner but practical scenario when But I still don't understand what you mean by "conner": common, corner, something else? Maybe just delete "conner but ". > CONFIG_PREEMPT_NONE is enabled, as follows. > > Process 0 Process 1 Process 2..Inf > split_huge_page_to_list > unmap_page > split_huge_pmd_address > __migration_entry_wait(head) > __migration_entry_wait(tail) > remap_page (roll back) > remove_migration_ptes > rmap_walk_anon > cond_resched > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g., > copy_to_user in fstat, which will immediately fault again without > rescheduling, and thus occupy the cpu fully. > > When there are too many processes performing __migration_entry_wait on > tail page, remap_page will never be done after cond_resched. > > This makes __migration_entry_wait operate on the compound head page, > thus waits for remap_page to complete, whether the THP is split > successfully or roll back. > > Note that put_and_wait_on_page_locked helps to drop the page reference > acquired with get_page_unless_zero, as soon as the page is on the wait > queue, before actually waiting. So splitting the THP is only prevented > for a brief interval. > > Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split") > Suggested-by: Hugh Dickins <hughd@google.com> > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Thanks: Acked-by: Hugh Dickins <hughd@google.com> And I hope Andrew will add Cc stable when it goes into his tree. I'll leave the (independent) discussion of optimal wakeup strategy to Kirill and Matthew: no strong opinion from me, it works as it is. > --- > mm/migrate.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/migrate.c b/mm/migrate.c > index b234c3f3acb7..41ff2c9896c4 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -295,6 +295,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, > goto out; > > page = migration_entry_to_page(entry); > + page = compound_head(page); > > /* > * Once page cache replacement of page migration started, page_count > -- > 2.20.1.2432.ga663e714 > >
On Tue, Jun 08, 2021 at 02:35:23PM +0100, Matthew Wilcox wrote: > On Tue, Jun 08, 2021 at 03:58:38PM +0300, Kirill A. Shutemov wrote: > > On Tue, Jun 08, 2021 at 01:32:21PM +0100, Matthew Wilcox wrote: > > > On Tue, Jun 08, 2021 at 03:00:26PM +0300, Kirill A. Shutemov wrote: > > > > But there's one quirk: if split succeed we effectively wait on wrong > > > > page to be unlocked. And it may take indefinite time if split_huge_page() > > > > was called on the head page. > > > > > > Hardly indefinite time ... callers of split_huge_page_to_list() usually > > > unlock the page soon after. Actually, I can't find one that doesn't call > > > unlock_page() within a few lines of calling split_huge_page_to_list(). > > > > I didn't check all callers, but it's not guaranteed by the interface and > > it's not hard to imagine a future situation when a page got split on the > > way to IO and kept locked until IO is complete. > > I would say that can't happen. Pages are locked when added to the page > cache and are !Uptodate. You can't put a PTE in a process page table > until it's Uptodate, and once it's Uptodate, the page is unlocked. So > any subsequent locks are transient, and not for the purposes of IO > (writebacks only take the page lock transiently). Documentation/filesystems/locking.rst: Note, if the filesystem needs the page to be locked during writeout, that is ok, too, the page is allowed to be unlocked at any point in time between the calls to set_page_writeback() and end_page_writeback(). I probably misinterpret what is written here. I know very little about writeback path. > > The wake up shouldn't have much overhead as in most cases split going to > > be called on the head page. > > I'm not convinced about that. We go out of our way to not wake up pages > (eg PageWaiters), and we've had some impressively long lists in the past > (which is why we now have the bookmarks). Maybe we should be smarter on when to wake up, I donno. I just notice that with the change we have /potential/ to wait long time on the wrong page to be unlocked. split_huge_page() interface doesn't enforce that the page gets split soon after split is complete.
On Tue, Jun 08, 2021 at 09:22:28PM +0800, Yu Xu wrote: > On 6/8/21 8:00 PM, Kirill A. Shutemov wrote: > > On Tue, Jun 08, 2021 at 05:22:39PM +0800, Xu Yu wrote: > > > We notice that hung task happens in a conner but practical scenario when > > > CONFIG_PREEMPT_NONE is enabled, as follows. > > > > > > Process 0 Process 1 Process 2..Inf > > > split_huge_page_to_list > > > unmap_page > > > split_huge_pmd_address > > > __migration_entry_wait(head) > > > __migration_entry_wait(tail) > > > remap_page (roll back) > > > remove_migration_ptes > > > rmap_walk_anon > > > cond_resched > > > > > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g., > > > copy_to_user in fstat, which will immediately fault again without > > > rescheduling, and thus occupy the cpu fully. > > > > > > When there are too many processes performing __migration_entry_wait on > > > tail page, remap_page will never be done after cond_resched. > > > > > > This makes __migration_entry_wait operate on the compound head page, > > > thus waits for remap_page to complete, whether the THP is split > > > successfully or roll back. > > > > > > Note that put_and_wait_on_page_locked helps to drop the page reference > > > acquired with get_page_unless_zero, as soon as the page is on the wait > > > queue, before actually waiting. So splitting the THP is only prevented > > > for a brief interval. > > > > > > Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split") > > > Suggested-by: Hugh Dickins <hughd@google.com> > > > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> > > > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > > > > Looks good to me: > > > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > > But there's one quirk: if split succeed we effectively wait on wrong > > page to be unlocked. And it may take indefinite time if split_huge_page() > > was called on the head page. > > Inspired by you, I look into the codes, and have a new question (nothing > to do with this patch). > > If we split_huge_page_to_list on *tail* page (in fact, I haven't seen > that used yet), See ksm code for instance. > mm/huge_memory.c:2666 checks "VM_BUG_ON_PAGE(!PageLocked(head), head);" > in split_huge_page_to_list(), while > > mm/huge_memory.c:2497 does "unlock_page(subpage)", where subpage can > be head in this scenario, in __split_huge_page(). > > My confusion is > 1) how the pin on the @subpage is got outside split_huge_page_to_list()? > can we ever get tail page? This loop: for (i = 0; i < nr; i++) { struct page *subpage = head + i; if (subpage == page) continue; unlock_page(subpage); /* * Subpages may be freed if there wasn't any mapping * like if add_to_swap() is running on a lru page that * had its mapping zapped. And freeing these pages * requires taking the lru_lock so we do the put_page * of the tail pages after the split is complete. */ put_page(subpage); } We skip unlocking and unpinning the page split_huge_page() got called for. > > 2) head page is locked outside split_huge_page_to_list(), but unlocked > in __split_huge_page()? If called on tail page, yes.
diff --git a/mm/migrate.c b/mm/migrate.c index b234c3f3acb7..41ff2c9896c4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -295,6 +295,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, goto out; page = migration_entry_to_page(entry); + page = compound_head(page); /* * Once page cache replacement of page migration started, page_count