Message ID | bc8567d7a2c08ab6fdbb8e94008157265d5d28a3.1622564942.git.xuyu@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, thp: relax migration wait when failed to get tail page | expand |
On Wed, Jun 02, 2021 at 12:31:41AM +0800, Xu Yu wrote: > We notice that hung task happens in a conner but practical scenario when > CONFIG_PREEMPT_NONE is enabled, as follows. > > Process 0 Process 1 Process 2..Inf > split_huge_page_to_list > unmap_page > split_huge_pmd_address > __migration_entry_wait(head) > __migration_entry_wait(tail) > remap_page (roll back) > remove_migration_ptes > rmap_walk_anon > cond_resched > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g., > copy_to_user, which will immediately fault again without rescheduling, > and thus occupy the cpu fully. > > When there are too many processes performing __migration_entry_wait on > tail page, remap_page will never be done after cond_resched. > > This relaxes __migration_entry_wait on tail page, thus gives remap_page > a chance to complete. I don't like this at all. If only we could call wait_on_page_locked() without having a reference to the page. I /think/ the only obstacle to doing that is that Slab may reuse the PG_locked bit without following the pagecache wakeup protocol. That wouldn't be a hard thing to change; I think it's only slub that uses that bit, and it really could use any bit in the page flags. > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > --- > mm/migrate.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index b234c3f3acb7..df2dc39fe566 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -301,8 +301,11 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, > * is zero; but we must not call put_and_wait_on_page_locked() without > * a ref. Use get_page_unless_zero(), and just fault again if it fails. > */ > - if (!get_page_unless_zero(page)) > - goto out; > + if (!get_page_unless_zero(page)) { > + pte_unmap_unlock(ptep, ptl); > + cond_resched(); > + return; > + } > pte_unmap_unlock(ptep, ptl); > put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE); > return; > -- > 2.20.1.2432.ga663e714 > >
On Wed, 2 Jun 2021, Xu Yu wrote: > We notice that hung task happens in a conner but practical scenario when > CONFIG_PREEMPT_NONE is enabled, as follows. > > Process 0 Process 1 Process 2..Inf > split_huge_page_to_list > unmap_page > split_huge_pmd_address > __migration_entry_wait(head) > __migration_entry_wait(tail) > remap_page (roll back) > remove_migration_ptes > rmap_walk_anon > cond_resched > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g., > copy_to_user, which will immediately fault again without rescheduling, > and thus occupy the cpu fully. > > When there are too many processes performing __migration_entry_wait on > tail page, remap_page will never be done after cond_resched. > > This relaxes __migration_entry_wait on tail page, thus gives remap_page > a chance to complete. > > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Well caught: you're absolutely right that there's a bug there. But isn't cond_resched() just papering over the real bug, and what it should do is a "page = compound_head(page);" before the get_page_unless_zero()? How does that work out in your testing? Hugh > --- > mm/migrate.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index b234c3f3acb7..df2dc39fe566 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -301,8 +301,11 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, > * is zero; but we must not call put_and_wait_on_page_locked() without > * a ref. Use get_page_unless_zero(), and just fault again if it fails. > */ > - if (!get_page_unless_zero(page)) > - goto out; > + if (!get_page_unless_zero(page)) { > + pte_unmap_unlock(ptep, ptl); > + cond_resched(); > + return; > + } > pte_unmap_unlock(ptep, ptl); > put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE); > return; > -- > 2.20.1.2432.ga663e714 > > >
On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote: > On Wed, 2 Jun 2021, Xu Yu wrote: > > > We notice that hung task happens in a conner but practical scenario when > > CONFIG_PREEMPT_NONE is enabled, as follows. > > > > Process 0 Process 1 Process 2..Inf > > split_huge_page_to_list > > unmap_page > > split_huge_pmd_address > > __migration_entry_wait(head) > > __migration_entry_wait(tail) > > remap_page (roll back) > > remove_migration_ptes > > rmap_walk_anon > > cond_resched > > > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g., > > copy_to_user, which will immediately fault again without rescheduling, > > and thus occupy the cpu fully. > > > > When there are too many processes performing __migration_entry_wait on > > tail page, remap_page will never be done after cond_resched. > > > > This relaxes __migration_entry_wait on tail page, thus gives remap_page > > a chance to complete. > > > > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> > > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > > Well caught: you're absolutely right that there's a bug there. > But isn't cond_resched() just papering over the real bug, and > what it should do is a "page = compound_head(page);" before the > get_page_unless_zero()? How does that work out in your testing? You do realise you're strengthening my case for folios by suggesting that, don't you? ;-) I was going to suggest that it won't make any difference because the page reference count is frozen, but the freezing happens after the call to unmap_page(), so it may make a difference.
On Tue, 1 Jun 2021, Matthew Wilcox wrote: > On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote: > > > > Well caught: you're absolutely right that there's a bug there. > > But isn't cond_resched() just papering over the real bug, and > > what it should do is a "page = compound_head(page);" before the > > get_page_unless_zero()? How does that work out in your testing? > > You do realise you're strengthening my case for folios by suggesting > that, don't you? ;-) Hah! Well, I do realize that I'm offering you a marketing opportunity. And you won't believe how many patches I dread to post for fear of that ;-) But I'm not so sure that it strengthens your case: apparently folios had not detected this? Or do you have a hoard of folio-detected fixes waiting for the day, and a folio-kit for each of the stable releases? > > I was going to suggest that it won't make any difference because the > page reference count is frozen, but the freezing happens after the call > to unmap_page(), so it may make a difference. I think that's a good point: I may have just jumped on the missing compound_head(), without thinking it through as far as you have. I'm having trouble remembering the dynamics now; but I think there are cond_resched()s in the unmap_page() part, so the splitter may get preempted even on a non-preempt kernel; whereas the frozen part is all done expeditiously, with interrupts disabled. Greg discovered the same issue recently, but we all got sidetracked, and I don't know where his investigation ended up. He was in favour of cond_resched(), I was in favour of compound_head(); and I think I was coming to the conclusion that if cond_resched() is needed, it should not be there in __migration_entry_wait(), but somewhere up in mm/gup.c, so that other faults that retry, expecting to reschedule on return to userspace, do not get trapped in kernelspace this way. Waiting on migration entries from THP splitting is an egregious example, but others may be suffering too. Hugh
On Tue, Jun 01, 2021 at 12:10:48PM -0700, Hugh Dickins wrote: > On Tue, 1 Jun 2021, Matthew Wilcox wrote: > > On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote: > > > > > > Well caught: you're absolutely right that there's a bug there. > > > But isn't cond_resched() just papering over the real bug, and > > > what it should do is a "page = compound_head(page);" before the > > > get_page_unless_zero()? How does that work out in your testing? > > > > You do realise you're strengthening my case for folios by suggesting > > that, don't you? ;-) > > Hah! Well, I do realize that I'm offering you a marketing opportunity. > And you won't believe how many patches I dread to post for fear of that ;-) > > But I'm not so sure that it strengthens your case: apparently folios > had not detected this? Or do you have a hoard of folio-detected fixes > waiting for the day, and a folio-kit for each of the stable releases? Oh, I wish! I haven't been working on converting the migration code to use folios. If I'd taken the step to convert put_and_wait_on_page_locked() to folio_put_and_wait_locked(), I would have fixed the bug, but I'm not sure I would have noticed that it was fixing a bug. I would have probably converted migration_entry_to_page() to be migration_entry_to_folio() and just inadvertently fixed it. > > I was going to suggest that it won't make any difference because the > > page reference count is frozen, but the freezing happens after the call > > to unmap_page(), so it may make a difference. > > I think that's a good point: I may have just jumped on the missing > compound_head(), without thinking it through as far as you have. > > I'm having trouble remembering the dynamics now; but I think there > are cond_resched()s in the unmap_page() part, so the splitter may > get preempted even on a non-preempt kernel; whereas the frozen > part is all done expeditiously, with interrupts disabled. That would certainly make a difference.
On 6/2/21 3:10 AM, Hugh Dickins wrote: > On Tue, 1 Jun 2021, Matthew Wilcox wrote: >> On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote: >>> >>> Well caught: you're absolutely right that there's a bug there. >>> But isn't cond_resched() just papering over the real bug, and >>> what it should do is a "page = compound_head(page);" before the >>> get_page_unless_zero()? How does that work out in your testing? >> >> You do realise you're strengthening my case for folios by suggesting >> that, don't you? ;-) > > Hah! Well, I do realize that I'm offering you a marketing opportunity. > And you won't believe how many patches I dread to post for fear of that ;-) > > But I'm not so sure that it strengthens your case: apparently folios > had not detected this? Or do you have a hoard of folio-detected fixes > waiting for the day, and a folio-kit for each of the stable releases? > >> >> I was going to suggest that it won't make any difference because the >> page reference count is frozen, but the freezing happens after the call >> to unmap_page(), so it may make a difference. > > I think that's a good point: I may have just jumped on the missing > compound_head(), without thinking it through as far as you have. > > I'm having trouble remembering the dynamics now; but I think there > are cond_resched()s in the unmap_page() part, so the splitter may > get preempted even on a non-preempt kernel; whereas the frozen > part is all done expeditiously, with interrupts disabled. > > Greg discovered the same issue recently, but we all got sidetracked, > and I don't know where his investigation ended up. He was in favour > of cond_resched(), I was in favour of compound_head(); and I think I I ever considered about using compound_head, but isn't there another race that, the following put_and_wait_on_page_locked operates on the "tail page" which has been split and is now a single page? Anyway, I will test and verify compound_head. > was coming to the conclusion that if cond_resched() is needed, it > should not be there in __migration_entry_wait(), but somewhere up > in mm/gup.c, so that other faults that retry, expecting to reschedule > on return to userspace, do not get trapped in kernelspace this way. Agreed. I will send v2, if cond_resched is still an option. > > Waiting on migration entries from THP splitting is an egregious > example, but others may be suffering too. > > Hugh >
On Wed, Jun 02, 2021 at 11:27:47AM +0800, Yu Xu wrote: > On 6/2/21 3:10 AM, Hugh Dickins wrote: > > On Tue, 1 Jun 2021, Matthew Wilcox wrote: > > > On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote: > > > > > > > > Well caught: you're absolutely right that there's a bug there. > > > > But isn't cond_resched() just papering over the real bug, and > > > > what it should do is a "page = compound_head(page);" before the > > > > get_page_unless_zero()? How does that work out in your testing? > > > > > > You do realise you're strengthening my case for folios by suggesting > > > that, don't you? ;-) > > > > Hah! Well, I do realize that I'm offering you a marketing opportunity. > > And you won't believe how many patches I dread to post for fear of that ;-) > > > > But I'm not so sure that it strengthens your case: apparently folios > > had not detected this? Or do you have a hoard of folio-detected fixes > > waiting for the day, and a folio-kit for each of the stable releases? > > > > > > > > I was going to suggest that it won't make any difference because the > > > page reference count is frozen, but the freezing happens after the call > > > to unmap_page(), so it may make a difference. > > > > I think that's a good point: I may have just jumped on the missing > > compound_head(), without thinking it through as far as you have. > > > > I'm having trouble remembering the dynamics now; but I think there > > are cond_resched()s in the unmap_page() part, so the splitter may > > get preempted even on a non-preempt kernel; whereas the frozen > > part is all done expeditiously, with interrupts disabled. > > > > Greg discovered the same issue recently, but we all got sidetracked, > > and I don't know where his investigation ended up. He was in favour > > of cond_resched(), I was in favour of compound_head(); and I think I > > I ever considered about using compound_head, but isn't there another > race that, the following put_and_wait_on_page_locked operates on the > "tail page" which has been split and is now a single page? No, having your own reference on a page prevents the page from being split. But that's a good question to ask.
On 6/2/21 7:58 PM, Matthew Wilcox wrote: > On Wed, Jun 02, 2021 at 11:27:47AM +0800, Yu Xu wrote: >> On 6/2/21 3:10 AM, Hugh Dickins wrote: >>> On Tue, 1 Jun 2021, Matthew Wilcox wrote: >>>> On Tue, Jun 01, 2021 at 09:55:56AM -0700, Hugh Dickins wrote: >>>>> >>>>> Well caught: you're absolutely right that there's a bug there. >>>>> But isn't cond_resched() just papering over the real bug, and >>>>> what it should do is a "page = compound_head(page);" before the >>>>> get_page_unless_zero()? How does that work out in your testing? >>>> >>>> You do realise you're strengthening my case for folios by suggesting >>>> that, don't you? ;-) >>> >>> Hah! Well, I do realize that I'm offering you a marketing opportunity. >>> And you won't believe how many patches I dread to post for fear of that ;-) >>> >>> But I'm not so sure that it strengthens your case: apparently folios >>> had not detected this? Or do you have a hoard of folio-detected fixes >>> waiting for the day, and a folio-kit for each of the stable releases? >>> >>>> >>>> I was going to suggest that it won't make any difference because the >>>> page reference count is frozen, but the freezing happens after the call >>>> to unmap_page(), so it may make a difference. >>> >>> I think that's a good point: I may have just jumped on the missing >>> compound_head(), without thinking it through as far as you have. >>> >>> I'm having trouble remembering the dynamics now; but I think there >>> are cond_resched()s in the unmap_page() part, so the splitter may >>> get preempted even on a non-preempt kernel; whereas the frozen >>> part is all done expeditiously, with interrupts disabled. >>> >>> Greg discovered the same issue recently, but we all got sidetracked, >>> and I don't know where his investigation ended up. He was in favour >>> of cond_resched(), I was in favour of compound_head(); and I think I >> >> I ever considered about using compound_head, but isn't there another >> race that, the following put_and_wait_on_page_locked operates on the >> "tail page" which has been split and is now a single page? > > No, having your own reference on a page prevents the page from being > split. But that's a good question to ask. Thanks. I have recalled that head page is frozen when splitting THP.
On 6/2/21 12:55 AM, Hugh Dickins wrote: > On Wed, 2 Jun 2021, Xu Yu wrote: > >> We notice that hung task happens in a conner but practical scenario when >> CONFIG_PREEMPT_NONE is enabled, as follows. >> >> Process 0 Process 1 Process 2..Inf >> split_huge_page_to_list >> unmap_page >> split_huge_pmd_address >> __migration_entry_wait(head) >> __migration_entry_wait(tail) >> remap_page (roll back) >> remove_migration_ptes >> rmap_walk_anon >> cond_resched >> >> Where __migration_entry_wait(tail) is occurred in kernel space, e.g., >> copy_to_user, which will immediately fault again without rescheduling, >> and thus occupy the cpu fully. >> >> When there are too many processes performing __migration_entry_wait on >> tail page, remap_page will never be done after cond_resched. >> >> This relaxes __migration_entry_wait on tail page, thus gives remap_page >> a chance to complete. >> >> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> >> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > > Well caught: you're absolutely right that there's a bug there. > But isn't cond_resched() just papering over the real bug, and > what it should do is a "page = compound_head(page);" before the > get_page_unless_zero()? How does that work out in your testing? compound_head works. The patched kernel is alive for hours under our reproducer, which usually makes the vanilla kernel hung after tens of minutes at most. If we use compound_head, the behavior of __migration_entry_wait(tail) changes from "retry fault" to "prevent THP from being split". Is that right? Then which is preferred? If it were me, I would prefer "retry fault". > > Hugh > >> --- >> mm/migrate.c | 7 +++++-- >> 1 file changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index b234c3f3acb7..df2dc39fe566 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -301,8 +301,11 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, >> * is zero; but we must not call put_and_wait_on_page_locked() without >> * a ref. Use get_page_unless_zero(), and just fault again if it fails. >> */ >> - if (!get_page_unless_zero(page)) >> - goto out; >> + if (!get_page_unless_zero(page)) { >> + pte_unmap_unlock(ptep, ptl); >> + cond_resched(); >> + return; >> + } >> pte_unmap_unlock(ptep, ptl); >> put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE); >> return; >> -- >> 2.20.1.2432.ga663e714 >> >> >>
On Wed, 2 Jun 2021, Yu Xu wrote: > On 6/2/21 12:55 AM, Hugh Dickins wrote: > > On Wed, 2 Jun 2021, Xu Yu wrote: > > > > > We notice that hung task happens in a conner but practical scenario when > > > CONFIG_PREEMPT_NONE is enabled, as follows. > > > > > > Process 0 Process 1 Process > > > 2..Inf > > > split_huge_page_to_list > > > unmap_page > > > split_huge_pmd_address > > > __migration_entry_wait(head) > > > __migration_entry_wait(tail) > > > remap_page (roll back) > > > remove_migration_ptes > > > rmap_walk_anon > > > cond_resched > > > > > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g., > > > copy_to_user, which will immediately fault again without rescheduling, > > > and thus occupy the cpu fully. > > > > > > When there are too many processes performing __migration_entry_wait on > > > tail page, remap_page will never be done after cond_resched. > > > > > > This relaxes __migration_entry_wait on tail page, thus gives remap_page > > > a chance to complete. > > > > > > Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> > > > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > > > > Well caught: you're absolutely right that there's a bug there. > > But isn't cond_resched() just papering over the real bug, and > > what it should do is a "page = compound_head(page);" before the > > get_page_unless_zero()? How does that work out in your testing? > > compound_head works. The patched kernel is alive for hours under > our reproducer, which usually makes the vanilla kernel hung after > tens of minutes at most. Oh, that's good news, thanks. (It's still likely that a well-placed cond_resched() somewhere in mm/gup.c would also be a good idea, but none of us have yet got around to identifying where.) > > If we use compound_head, the behavior of __migration_entry_wait(tail) > changes from "retry fault" to "prevent THP from being split". Is that > right? Then which is preferred? If it were me, I would prefer "retry > fault". As Matthew remarked, you are asking very good questions, and split migration entries are difficult to think about. But I believe you'll find it works out okay. The point of *put_and_* wait_on_page_locked() is that it does drop the page reference you acquired with get_page_unless_zero, as soon as the page is on the wait queue, before actually waiting. So splitting the THP is only prevented for a brief interval. Now, it's true that if there are very many tasks faulting on portions of the huge page, in that interval between inserting the migration entries and freezing the huge page's refcount to 0, they can reduce the chance of splitting considerably. But that's not an excuse for for doing get_page_unless_zero() on the wrong thing, as it was doing. Hugh
On 6/2/21 11:57 PM, Hugh Dickins wrote: > On Wed, 2 Jun 2021, Yu Xu wrote: >> On 6/2/21 12:55 AM, Hugh Dickins wrote: >>> On Wed, 2 Jun 2021, Xu Yu wrote: >>> >>>> We notice that hung task happens in a conner but practical scenario when >>>> CONFIG_PREEMPT_NONE is enabled, as follows. >>>> >>>> Process 0 Process 1 Process >>>> 2..Inf >>>> split_huge_page_to_list >>>> unmap_page >>>> split_huge_pmd_address >>>> __migration_entry_wait(head) >>>> __migration_entry_wait(tail) >>>> remap_page (roll back) >>>> remove_migration_ptes >>>> rmap_walk_anon >>>> cond_resched >>>> >>>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g., >>>> copy_to_user, which will immediately fault again without rescheduling, >>>> and thus occupy the cpu fully. >>>> >>>> When there are too many processes performing __migration_entry_wait on >>>> tail page, remap_page will never be done after cond_resched. >>>> >>>> This relaxes __migration_entry_wait on tail page, thus gives remap_page >>>> a chance to complete. >>>> >>>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> >>>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> >>> >>> Well caught: you're absolutely right that there's a bug there. >>> But isn't cond_resched() just papering over the real bug, and >>> what it should do is a "page = compound_head(page);" before the >>> get_page_unless_zero()? How does that work out in your testing? >> >> compound_head works. The patched kernel is alive for hours under >> our reproducer, which usually makes the vanilla kernel hung after >> tens of minutes at most. > > Oh, that's good news, thanks. > > (It's still likely that a well-placed cond_resched() somewhere in > mm/gup.c would also be a good idea, but none of us have yet got > around to identifying where.) We neither. If really have to do it outside of __migration_entry_wait, return value of __migration_entry_wait is needed, and many related functions have to updated, which may be undesirable. > >> >> If we use compound_head, the behavior of __migration_entry_wait(tail) >> changes from "retry fault" to "prevent THP from being split". Is that >> right? Then which is preferred? If it were me, I would prefer "retry >> fault". > > As Matthew remarked, you are asking very good questions, and split > migration entries are difficult to think about. But I believe you'll > find it works out okay. > > The point of *put_and_* wait_on_page_locked() is that it does drop > the page reference you acquired with get_page_unless_zero, as soon > as the page is on the wait queue, before actually waiting. > > So splitting the THP is only prevented for a brief interval. Now, > it's true that if there are very many tasks faulting on portions > of the huge page, in that interval between inserting the migration > entries and freezing the huge page's refcount to 0, they can reduce > the chance of splitting considerably. But that's not an excuse for > for doing get_page_unless_zero() on the wrong thing, as it was doing. We finally come to your solution, i.e., compound_head. In that case, who should resend the compound_head patch to this issue? shall we do with your s.o.b? > > Hugh >
On Mon, 7 Jun 2021, Yu Xu wrote: > On 6/2/21 11:57 PM, Hugh Dickins wrote: > > On Wed, 2 Jun 2021, Yu Xu wrote: > >> On 6/2/21 12:55 AM, Hugh Dickins wrote: > >>> On Wed, 2 Jun 2021, Xu Yu wrote: > >>> > >>>> We notice that hung task happens in a conner but practical scenario when > >>>> CONFIG_PREEMPT_NONE is enabled, as follows. > >>>> > >>>> Process 0 Process 1 Process > >>>> 2..Inf > >>>> split_huge_page_to_list > >>>> unmap_page > >>>> split_huge_pmd_address > >>>> __migration_entry_wait(head) > >>>> __migration_entry_wait(tail) > >>>> remap_page (roll back) > >>>> remove_migration_ptes > >>>> rmap_walk_anon > >>>> cond_resched > >>>> > >>>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g., > >>>> copy_to_user, which will immediately fault again without rescheduling, > >>>> and thus occupy the cpu fully. > >>>> > >>>> When there are too many processes performing __migration_entry_wait on > >>>> tail page, remap_page will never be done after cond_resched. > >>>> > >>>> This relaxes __migration_entry_wait on tail page, thus gives remap_page > >>>> a chance to complete. > >>>> > >>>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> > >>>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > >>> > >>> Well caught: you're absolutely right that there's a bug there. > >>> But isn't cond_resched() just papering over the real bug, and > >>> what it should do is a "page = compound_head(page);" before the > >>> get_page_unless_zero()? How does that work out in your testing? > >> > >> compound_head works. The patched kernel is alive for hours under > >> our reproducer, which usually makes the vanilla kernel hung after > >> tens of minutes at most. > > > > Oh, that's good news, thanks. > > > > (It's still likely that a well-placed cond_resched() somewhere in > > mm/gup.c would also be a good idea, but none of us have yet got > > around to identifying where.) > > We neither. If really have to do it outside of __migration_entry_wait, > return value of __migration_entry_wait is needed, and many related > functions have to updated, which may be undesirable. No, it would not be necessary to plumb through a return value from __migration_entry_wait(): I didn't mean that this GUP cond_resched() should be done only for the migration case, but (I guess) on any path where handle_mm_fault() returns "success" for a retry, yet the retry of follow_page_mask() fails. But now that I look, I see there is already a cond_resched() there! So I'm puzzled as to how your cond_resched() in __migration_entry_wait() appeared to help - well, you never actually said that it helped, but I assume that it did, or you wouldn't have bothered to send that patch? It's irrelevant, now that we've admitted there should be a "page = compound_head(page)" in there, and you have said that helps, and that's the patch we want to send now. But it troubles me, to be unable to explain it. Two cond_resched()s are not twice as good as one. > > > > >> > >> If we use compound_head, the behavior of __migration_entry_wait(tail) > >> changes from "retry fault" to "prevent THP from being split". Is that > >> right? Then which is preferred? If it were me, I would prefer "retry > >> fault". > > > > As Matthew remarked, you are asking very good questions, and split > > migration entries are difficult to think about. But I believe you'll > > find it works out okay. > > > > The point of *put_and_* wait_on_page_locked() is that it does drop > > the page reference you acquired with get_page_unless_zero, as soon > > as the page is on the wait queue, before actually waiting. > > > > So splitting the THP is only prevented for a brief interval. Now, > > it's true that if there are very many tasks faulting on portions > > of the huge page, in that interval between inserting the migration > > entries and freezing the huge page's refcount to 0, they can reduce > > the chance of splitting considerably. But that's not an excuse for > > for doing get_page_unless_zero() on the wrong thing, as it was doing. > > We finally come to your solution, i.e., compound_head. > > In that case, who should resend the compound_head patch to this issue? > shall we do with your s.o.b? I was rather expecting you to send the patch: with your s.o.b, not mine. You could say "Suggested-by: Hugh Dickins <hughd@google.com>" if you like. And I suggest that you put that "page = compound_head(page);" line immediately after the "page = migration_entry_to_page(entry);" line, so as not to interfere with the comment above get_page_unless_zero(). (No need for a comment on the compound_head(): it's self-explanatory.) I did meanwhile research other callers of migration_entry_to_page(): it had been on my mind, that others might need a compound_head() too, and perhaps it should be done inside migration_entry_to_page() itself. But so far as I can tell (I don't really know about the s390 one), the others are okay, and it would just be unnecessary overhead (in particular, the mm_counter() stuff looks correct on a tail). I *think* the right Fixes tag would be Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split") though I'm not sure of that; it's probably good enough. (With all this direction, I did wonder if it would be kinder just to send a patch myself, but using some of your comments: but I didn't understand "conner" in your description, so couldn't do that.) Thanks! Hugh
On 6/8/21 12:44 PM, Hugh Dickins wrote: > On Mon, 7 Jun 2021, Yu Xu wrote: >> On 6/2/21 11:57 PM, Hugh Dickins wrote: >>> On Wed, 2 Jun 2021, Yu Xu wrote: >>>> On 6/2/21 12:55 AM, Hugh Dickins wrote: >>>>> On Wed, 2 Jun 2021, Xu Yu wrote: >>>>> >>>>>> We notice that hung task happens in a conner but practical scenario when >>>>>> CONFIG_PREEMPT_NONE is enabled, as follows. >>>>>> >>>>>> Process 0 Process 1 Process >>>>>> 2..Inf >>>>>> split_huge_page_to_list >>>>>> unmap_page >>>>>> split_huge_pmd_address >>>>>> __migration_entry_wait(head) >>>>>> __migration_entry_wait(tail) >>>>>> remap_page (roll back) >>>>>> remove_migration_ptes >>>>>> rmap_walk_anon >>>>>> cond_resched >>>>>> >>>>>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g., >>>>>> copy_to_user, which will immediately fault again without rescheduling, >>>>>> and thus occupy the cpu fully. >>>>>> >>>>>> When there are too many processes performing __migration_entry_wait on >>>>>> tail page, remap_page will never be done after cond_resched. >>>>>> >>>>>> This relaxes __migration_entry_wait on tail page, thus gives remap_page >>>>>> a chance to complete. >>>>>> >>>>>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> >>>>>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> >>>>> >>>>> Well caught: you're absolutely right that there's a bug there. >>>>> But isn't cond_resched() just papering over the real bug, and >>>>> what it should do is a "page = compound_head(page);" before the >>>>> get_page_unless_zero()? How does that work out in your testing? >>>> >>>> compound_head works. The patched kernel is alive for hours under >>>> our reproducer, which usually makes the vanilla kernel hung after >>>> tens of minutes at most. >>> >>> Oh, that's good news, thanks. >>> >>> (It's still likely that a well-placed cond_resched() somewhere in >>> mm/gup.c would also be a good idea, but none of us have yet got >>> around to identifying where.) >> >> We neither. If really have to do it outside of __migration_entry_wait, >> return value of __migration_entry_wait is needed, and many related >> functions have to updated, which may be undesirable. > > No, it would not be necessary to plumb through a return value from > __migration_entry_wait(): I didn't mean that this GUP cond_resched() > should be done only for the migration case, but (I guess) on any path > where handle_mm_fault() returns "success" for a retry, yet the retry > of follow_page_mask() fails. > > But now that I look, I see there is already a cond_resched() there! Do you mean might_sleep in mmap_read_trylock within do_user_addr_fault? If so, our environment has CONFIG_PREEMPT_NONE is enabled, and the __migration_entry_wait happens in kernel when do something like copy_to_user (e.g., fstat). > > So I'm puzzled as to how your cond_resched() in __migration_entry_wait() > appeared to help - well, you never actually said that it helped, but I > assume that it did, or you wouldn't have bothered to send that patch? > > It's irrelevant, now that we've admitted there should be a > "page = compound_head(page)" in there, and you have said that helps, > and that's the patch we want to send now. But it troubles me, to be > unable to explain it. Two cond_resched()s are not twice as good as one. > >> >>> >>>> >>>> If we use compound_head, the behavior of __migration_entry_wait(tail) >>>> changes from "retry fault" to "prevent THP from being split". Is that >>>> right? Then which is preferred? If it were me, I would prefer "retry >>>> fault". >>> >>> As Matthew remarked, you are asking very good questions, and split >>> migration entries are difficult to think about. But I believe you'll >>> find it works out okay. >>> >>> The point of *put_and_* wait_on_page_locked() is that it does drop >>> the page reference you acquired with get_page_unless_zero, as soon >>> as the page is on the wait queue, before actually waiting. >>> >>> So splitting the THP is only prevented for a brief interval. Now, >>> it's true that if there are very many tasks faulting on portions >>> of the huge page, in that interval between inserting the migration >>> entries and freezing the huge page's refcount to 0, they can reduce >>> the chance of splitting considerably. But that's not an excuse for >>> for doing get_page_unless_zero() on the wrong thing, as it was doing. >> >> We finally come to your solution, i.e., compound_head. >> >> In that case, who should resend the compound_head patch to this issue? >> shall we do with your s.o.b? > > I was rather expecting you to send the patch: with your s.o.b, not mine. > You could say "Suggested-by: Hugh Dickins <hughd@google.com>" if you like. > > And I suggest that you put that "page = compound_head(page);" line > immediately after the "page = migration_entry_to_page(entry);" line, > so as not to interfere with the comment above get_page_unless_zero(). > > (No need for a comment on the compound_head(): it's self-explanatory.) > > I did meanwhile research other callers of migration_entry_to_page(): > it had been on my mind, that others might need a compound_head() too, > and perhaps it should be done inside migration_entry_to_page() itself. > > But so far as I can tell (I don't really know about the s390 one), > the others are okay, and it would just be unnecessary overhead > (in particular, the mm_counter() stuff looks correct on a tail). > > I *think* the right Fixes tag would be > Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split") > though I'm not sure of that; it's probably good enough. > > (With all this direction, I did wonder if it would be kinder just to > send a patch myself, but using some of your comments: but I didn't > understand "conner" in your description, so couldn't do that.) > > Thanks! > Hugh >
On Tue, 8 Jun 2021, Yu Xu wrote: > On 6/8/21 12:44 PM, Hugh Dickins wrote: > > On Mon, 7 Jun 2021, Yu Xu wrote: > >> On 6/2/21 11:57 PM, Hugh Dickins wrote: > >>> On Wed, 2 Jun 2021, Yu Xu wrote: > >>>> On 6/2/21 12:55 AM, Hugh Dickins wrote: > >>>>> On Wed, 2 Jun 2021, Xu Yu wrote: > >>>>> > >>>>>> We notice that hung task happens in a conner but practical scenario > >>>>>> when > >>>>>> CONFIG_PREEMPT_NONE is enabled, as follows. > >>>>>> > >>>>>> Process 0 Process 1 Process > >>>>>> 2..Inf > >>>>>> split_huge_page_to_list > >>>>>> unmap_page > >>>>>> split_huge_pmd_address > >>>>>> __migration_entry_wait(head) > >>>>>> __migration_entry_wait(tail) > >>>>>> remap_page (roll back) > >>>>>> remove_migration_ptes > >>>>>> rmap_walk_anon > >>>>>> cond_resched > >>>>>> > >>>>>> Where __migration_entry_wait(tail) is occurred in kernel space, e.g., > >>>>>> copy_to_user, which will immediately fault again without rescheduling, > >>>>>> and thus occupy the cpu fully. > >>>>>> > >>>>>> When there are too many processes performing __migration_entry_wait on > >>>>>> tail page, remap_page will never be done after cond_resched. > >>>>>> > >>>>>> This relaxes __migration_entry_wait on tail page, thus gives remap_page > >>>>>> a chance to complete. > >>>>>> > >>>>>> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> > >>>>>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > >>>>> > >>>>> Well caught: you're absolutely right that there's a bug there. > >>>>> But isn't cond_resched() just papering over the real bug, and > >>>>> what it should do is a "page = compound_head(page);" before the > >>>>> get_page_unless_zero()? How does that work out in your testing? > >>>> > >>>> compound_head works. The patched kernel is alive for hours under > >>>> our reproducer, which usually makes the vanilla kernel hung after > >>>> tens of minutes at most. > >>> > >>> Oh, that's good news, thanks. > >>> > >>> (It's still likely that a well-placed cond_resched() somewhere in > >>> mm/gup.c would also be a good idea, but none of us have yet got > >>> around to identifying where.) > >> > >> We neither. If really have to do it outside of __migration_entry_wait, > >> return value of __migration_entry_wait is needed, and many related > >> functions have to updated, which may be undesirable. > > > > No, it would not be necessary to plumb through a return value from > > __migration_entry_wait(): I didn't mean that this GUP cond_resched() > > should be done only for the migration case, but (I guess) on any path > > where handle_mm_fault() returns "success" for a retry, yet the retry > > of follow_page_mask() fails. > > > > But now that I look, I see there is already a cond_resched() there! > > Do you mean might_sleep in mmap_read_trylock within do_user_addr_fault? > > If so, our environment has CONFIG_PREEMPT_NONE is enabled, and the > __migration_entry_wait happens in kernel when do something like > copy_to_user (e.g., fstat). Oh, I am sorry: now I see that you did mention copy_to_user() in your original post, but I'm afraid I was fixated on get_user_pages() all along: a different way in which the kernel handles a fault on user address space without returning to userspace immediately afterwards. So, the GUP case has its cond_resched() and is okay, but the arch/whatever/mm/fault.c case is the one which probably deserves a cond_resched() somewhere (on the architecture in question anyway - x86?). I was reluctant to suggest where to place it in GUP, I am even more reluctant to say where in arch/whatever/mm/fault.c: I haven't thought through that code in years. x86, somewhere in do_user_addr_fault(), probably yes; but it's better to cond_resched() without holding a lock; and better to avoid it on first entry too. But we don't need to decide that, if the compound_head() is a satisfactory solution for you in practice. Sorry for confusing you with my own confusion, and thank you for clearing it up. Hugh
diff --git a/mm/migrate.c b/mm/migrate.c index b234c3f3acb7..df2dc39fe566 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -301,8 +301,11 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, * is zero; but we must not call put_and_wait_on_page_locked() without * a ref. Use get_page_unless_zero(), and just fault again if it fails. */ - if (!get_page_unless_zero(page)) - goto out; + if (!get_page_unless_zero(page)) { + pte_unmap_unlock(ptep, ptl); + cond_resched(); + return; + } pte_unmap_unlock(ptep, ptl); put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE); return;