Message ID | 20230425084627.3573866-2-fengwei.yin@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Reduce lock contention related with large folio | expand |
On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: > free_transhuge_page() acquires split queue lock then check > whether the THP was added to deferred list or not. > > It's safe to check whether the THP is in deferred list or not. > When code hit free_transhuge_page(), there is no one tries > to update the folio's _deferred_list. > > If folio is not in deferred_list, it's safe to check without > acquiring lock. > > If folio is in deferred_list, the other node in deferred_list > adding/deleteing doesn't impact the return value of > list_epmty(@folio->_deferred_list). Typo. > > Running page_fault1 of will-it-scale + order 2 folio for anonymous > mapping with 96 processes on an Ice Lake 48C/96T test box, we could > see the 61% split_queue_lock contention: > - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] > release_pages > - 70.93% release_pages > - 61.42% free_transhuge_page > + 60.77% _raw_spin_lock_irqsave > > With this patch applied, the split_queue_lock contention is less > than 1%. > > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > Tested-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/huge_memory.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 032fb0ef9cd1..c620f1f12247 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) > struct deferred_split *ds_queue = get_deferred_split_queue(folio); > unsigned long flags; > > - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > - if (!list_empty(&folio->_deferred_list)) { > + /* > + * At this point, there is no one trying to queue the folio > + * to deferred_list. folio->_deferred_list is not possible > + * being updated. > + * > + * If folio is already added to deferred_list, add/delete to/from > + * deferred_list will not impact list_empty(&folio->_deferred_list). > + * It's safe to check list_empty(&folio->_deferred_list) without > + * acquiring the lock. > + * > + * If folio is not in deferred_list, it's safe to check without > + * acquiring the lock. > + */ > + if (data_race(!list_empty(&folio->_deferred_list))) { > + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); Recheck under lock? > ds_queue->split_queue_len--; > list_del(&folio->_deferred_list); > + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > } > - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > free_compound_page(page); > } > > -- > 2.30.2 > >
Yin Fengwei <fengwei.yin@intel.com> writes: > free_transhuge_page() acquires split queue lock then check > whether the THP was added to deferred list or not. > > It's safe to check whether the THP is in deferred list or not. > When code hit free_transhuge_page(), there is no one tries > to update the folio's _deferred_list. I think that it's clearer to enumerate all places pages are added and removed from deferred list. Then we can find out whether there's code path that may race with this. Take a glance at the search result of `grep split_queue_lock -r mm`. It seems that deferred_split_scan() may race with free_transhuge_page(), so we need to recheck with the lock held as Kirill pointed out. Best Regards, Huang, Ying > If folio is not in deferred_list, it's safe to check without > acquiring lock. > > If folio is in deferred_list, the other node in deferred_list > adding/deleteing doesn't impact the return value of > list_epmty(@folio->_deferred_list). > > Running page_fault1 of will-it-scale + order 2 folio for anonymous > mapping with 96 processes on an Ice Lake 48C/96T test box, we could > see the 61% split_queue_lock contention: > - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] > release_pages > - 70.93% release_pages > - 61.42% free_transhuge_page > + 60.77% _raw_spin_lock_irqsave > > With this patch applied, the split_queue_lock contention is less > than 1%. > > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > Tested-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/huge_memory.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 032fb0ef9cd1..c620f1f12247 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) > struct deferred_split *ds_queue = get_deferred_split_queue(folio); > unsigned long flags; > > - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > - if (!list_empty(&folio->_deferred_list)) { > + /* > + * At this point, there is no one trying to queue the folio > + * to deferred_list. folio->_deferred_list is not possible > + * being updated. > + * > + * If folio is already added to deferred_list, add/delete to/from > + * deferred_list will not impact list_empty(&folio->_deferred_list). > + * It's safe to check list_empty(&folio->_deferred_list) without > + * acquiring the lock. > + * > + * If folio is not in deferred_list, it's safe to check without > + * acquiring the lock. > + */ > + if (data_race(!list_empty(&folio->_deferred_list))) { > + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > ds_queue->split_queue_len--; > list_del(&folio->_deferred_list); > + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > } > - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > free_compound_page(page); > }
On 4/25/23 20:38, Kirill A. Shutemov wrote: > On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: >> free_transhuge_page() acquires split queue lock then check >> whether the THP was added to deferred list or not. >> >> It's safe to check whether the THP is in deferred list or not. >> When code hit free_transhuge_page(), there is no one tries >> to update the folio's _deferred_list. >> >> If folio is not in deferred_list, it's safe to check without >> acquiring lock. >> >> If folio is in deferred_list, the other node in deferred_list >> adding/deleteing doesn't impact the return value of >> list_epmty(@folio->_deferred_list). > > Typo. Oops. Will correct it in next version. > >> >> Running page_fault1 of will-it-scale + order 2 folio for anonymous >> mapping with 96 processes on an Ice Lake 48C/96T test box, we could >> see the 61% split_queue_lock contention: >> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] >> release_pages >> - 70.93% release_pages >> - 61.42% free_transhuge_page >> + 60.77% _raw_spin_lock_irqsave >> >> With this patch applied, the split_queue_lock contention is less >> than 1%. >> >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >> Tested-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/huge_memory.c | 19 ++++++++++++++++--- >> 1 file changed, 16 insertions(+), 3 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 032fb0ef9cd1..c620f1f12247 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) >> struct deferred_split *ds_queue = get_deferred_split_queue(folio); >> unsigned long flags; >> >> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> - if (!list_empty(&folio->_deferred_list)) { >> + /* >> + * At this point, there is no one trying to queue the folio >> + * to deferred_list. folio->_deferred_list is not possible >> + * being updated. >> + * >> + * If folio is already added to deferred_list, add/delete to/from >> + * deferred_list will not impact list_empty(&folio->_deferred_list). >> + * It's safe to check list_empty(&folio->_deferred_list) without >> + * acquiring the lock. >> + * >> + * If folio is not in deferred_list, it's safe to check without >> + * acquiring the lock. >> + */ >> + if (data_race(!list_empty(&folio->_deferred_list))) { >> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > > Recheck under lock? My understanding is even there is race, the race doesn't impact the correctness of list_epmty(@folio->_deferred_list). - If the folio is not in deferred_list, list_empty() always return true. - If the folio is in deferred_list, even the element near the folio is being added/removed deferred_list, the list_empty() always return false. There is one precondition: No other user adds/removes the folio to/from deferred_list concurrently. I think it's true for free_transhuge_page() so recheck after take the lock is not necessary. Thanks Regards Yin, Fengwei > >> ds_queue->split_queue_len--; >> list_del(&folio->_deferred_list); >> + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> } >> - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> free_compound_page(page); >> } >> >> -- >> 2.30.2 >> >> >
On 4/26/23 09:13, Huang, Ying wrote: > Yin Fengwei <fengwei.yin@intel.com> writes: > >> free_transhuge_page() acquires split queue lock then check >> whether the THP was added to deferred list or not. >> >> It's safe to check whether the THP is in deferred list or not. >> When code hit free_transhuge_page(), there is no one tries >> to update the folio's _deferred_list. > > I think that it's clearer to enumerate all places pages are added and > removed from deferred list. Then we can find out whether there's code > path that may race with this. > > Take a glance at the search result of `grep split_queue_lock -r mm`. It > seems that deferred_split_scan() may race with free_transhuge_page(), so > we need to recheck with the lock held as Kirill pointed out. My understanding is the check after take the lock is not necessary. See my reply to Kirill. Thanks. Regards Yin, Fengwei > > Best Regards, > Huang, Ying > >> If folio is not in deferred_list, it's safe to check without >> acquiring lock. >> >> If folio is in deferred_list, the other node in deferred_list >> adding/deleteing doesn't impact the return value of >> list_epmty(@folio->_deferred_list). >> >> Running page_fault1 of will-it-scale + order 2 folio for anonymous >> mapping with 96 processes on an Ice Lake 48C/96T test box, we could >> see the 61% split_queue_lock contention: >> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] >> release_pages >> - 70.93% release_pages >> - 61.42% free_transhuge_page >> + 60.77% _raw_spin_lock_irqsave >> >> With this patch applied, the split_queue_lock contention is less >> than 1%. >> >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >> Tested-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/huge_memory.c | 19 ++++++++++++++++--- >> 1 file changed, 16 insertions(+), 3 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 032fb0ef9cd1..c620f1f12247 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) >> struct deferred_split *ds_queue = get_deferred_split_queue(folio); >> unsigned long flags; >> >> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> - if (!list_empty(&folio->_deferred_list)) { >> + /* >> + * At this point, there is no one trying to queue the folio >> + * to deferred_list. folio->_deferred_list is not possible >> + * being updated. >> + * >> + * If folio is already added to deferred_list, add/delete to/from >> + * deferred_list will not impact list_empty(&folio->_deferred_list). >> + * It's safe to check list_empty(&folio->_deferred_list) without >> + * acquiring the lock. >> + * >> + * If folio is not in deferred_list, it's safe to check without >> + * acquiring the lock. >> + */ >> + if (data_race(!list_empty(&folio->_deferred_list))) { >> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> ds_queue->split_queue_len--; >> list_del(&folio->_deferred_list); >> + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> } >> - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> free_compound_page(page); >> }
On 4/25/23 20:38, Kirill A. Shutemov wrote: > On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: >> free_transhuge_page() acquires split queue lock then check >> whether the THP was added to deferred list or not. >> >> It's safe to check whether the THP is in deferred list or not. >> When code hit free_transhuge_page(), there is no one tries >> to update the folio's _deferred_list. >> >> If folio is not in deferred_list, it's safe to check without >> acquiring lock. >> >> If folio is in deferred_list, the other node in deferred_list >> adding/deleteing doesn't impact the return value of >> list_epmty(@folio->_deferred_list). > > Typo. > >> >> Running page_fault1 of will-it-scale + order 2 folio for anonymous >> mapping with 96 processes on an Ice Lake 48C/96T test box, we could >> see the 61% split_queue_lock contention: >> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] >> release_pages >> - 70.93% release_pages >> - 61.42% free_transhuge_page >> + 60.77% _raw_spin_lock_irqsave >> >> With this patch applied, the split_queue_lock contention is less >> than 1%. >> >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >> Tested-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/huge_memory.c | 19 ++++++++++++++++--- >> 1 file changed, 16 insertions(+), 3 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 032fb0ef9cd1..c620f1f12247 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) >> struct deferred_split *ds_queue = get_deferred_split_queue(folio); >> unsigned long flags; >> >> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> - if (!list_empty(&folio->_deferred_list)) { >> + /* >> + * At this point, there is no one trying to queue the folio >> + * to deferred_list. folio->_deferred_list is not possible >> + * being updated. >> + * >> + * If folio is already added to deferred_list, add/delete to/from >> + * deferred_list will not impact list_empty(&folio->_deferred_list). >> + * It's safe to check list_empty(&folio->_deferred_list) without >> + * acquiring the lock. >> + * >> + * If folio is not in deferred_list, it's safe to check without >> + * acquiring the lock. >> + */ >> + if (data_race(!list_empty(&folio->_deferred_list))) { >> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > > Recheck under lock? Huang Ying pointed out the race with deferred_split_scan(). And Yes. Need recheck under lock. Will update in next version. Regards Yin, Fengwei > >> ds_queue->split_queue_len--; >> list_del(&folio->_deferred_list); >> + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> } >> - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> free_compound_page(page); >> } >> >> -- >> 2.30.2 >> >> >
On 25/04/2023 09:46, Yin Fengwei wrote: > free_transhuge_page() acquires split queue lock then check > whether the THP was added to deferred list or not. > > It's safe to check whether the THP is in deferred list or not. > When code hit free_transhuge_page(), there is no one tries > to update the folio's _deferred_list. > > If folio is not in deferred_list, it's safe to check without > acquiring lock. > > If folio is in deferred_list, the other node in deferred_list > adding/deleteing doesn't impact the return value of > list_epmty(@folio->_deferred_list). > > Running page_fault1 of will-it-scale + order 2 folio for anonymous > mapping with 96 processes on an Ice Lake 48C/96T test box, we could > see the 61% split_queue_lock contention: > - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] > release_pages > - 70.93% release_pages > - 61.42% free_transhuge_page > + 60.77% _raw_spin_lock_irqsave > > With this patch applied, the split_queue_lock contention is less > than 1%. > > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > Tested-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/huge_memory.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 032fb0ef9cd1..c620f1f12247 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) > struct deferred_split *ds_queue = get_deferred_split_queue(folio); > unsigned long flags; > > - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > - if (!list_empty(&folio->_deferred_list)) { > + /* > + * At this point, there is no one trying to queue the folio > + * to deferred_list. folio->_deferred_list is not possible > + * being updated. > + * > + * If folio is already added to deferred_list, add/delete to/from > + * deferred_list will not impact list_empty(&folio->_deferred_list). > + * It's safe to check list_empty(&folio->_deferred_list) without > + * acquiring the lock. > + * > + * If folio is not in deferred_list, it's safe to check without > + * acquiring the lock. > + */ > + if (data_race(!list_empty(&folio->_deferred_list))) { > + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > ds_queue->split_queue_len--; > list_del(&folio->_deferred_list); I wonder if there is a race here? Could the folio have been in the deferred list when checking, but then something removed it from the list before the lock is taken? In this case, I guess split_queue_len would be out of sync with the number of folios in the queue? Perhaps recheck list_empty() after taking the lock? Thanks, Ryan > + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > } > - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > free_compound_page(page); > } >
On 26/04/2023 03:08, Yin Fengwei wrote: > > > On 4/25/23 20:38, Kirill A. Shutemov wrote: >> On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: >>> free_transhuge_page() acquires split queue lock then check >>> whether the THP was added to deferred list or not. >>> >>> It's safe to check whether the THP is in deferred list or not. >>> When code hit free_transhuge_page(), there is no one tries >>> to update the folio's _deferred_list. >>> >>> If folio is not in deferred_list, it's safe to check without >>> acquiring lock. >>> >>> If folio is in deferred_list, the other node in deferred_list >>> adding/deleteing doesn't impact the return value of >>> list_epmty(@folio->_deferred_list). >> >> Typo. >> >>> >>> Running page_fault1 of will-it-scale + order 2 folio for anonymous >>> mapping with 96 processes on an Ice Lake 48C/96T test box, we could >>> see the 61% split_queue_lock contention: >>> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] >>> release_pages >>> - 70.93% release_pages >>> - 61.42% free_transhuge_page >>> + 60.77% _raw_spin_lock_irqsave >>> >>> With this patch applied, the split_queue_lock contention is less >>> than 1%. >>> >>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >>> Tested-by: Ryan Roberts <ryan.roberts@arm.com> >>> --- >>> mm/huge_memory.c | 19 ++++++++++++++++--- >>> 1 file changed, 16 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index 032fb0ef9cd1..c620f1f12247 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) >>> struct deferred_split *ds_queue = get_deferred_split_queue(folio); >>> unsigned long flags; >>> >>> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>> - if (!list_empty(&folio->_deferred_list)) { >>> + /* >>> + * At this point, there is no one trying to queue the folio >>> + * to deferred_list. folio->_deferred_list is not possible >>> + * being updated. >>> + * >>> + * If folio is already added to deferred_list, add/delete to/from >>> + * deferred_list will not impact list_empty(&folio->_deferred_list). >>> + * It's safe to check list_empty(&folio->_deferred_list) without >>> + * acquiring the lock. >>> + * >>> + * If folio is not in deferred_list, it's safe to check without >>> + * acquiring the lock. >>> + */ >>> + if (data_race(!list_empty(&folio->_deferred_list))) { >>> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> >> Recheck under lock? > Huang Ying pointed out the race with deferred_split_scan(). And Yes. Need > recheck under lock. Will update in next version. Oops sorry - I see this was already pointed out. Disregard my previous mail. Thanks, Ryan > > > Regards > Yin, Fengwei > >> >>> ds_queue->split_queue_len--; >>> list_del(&folio->_deferred_list); >>> + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >>> } >>> - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >>> free_compound_page(page); >>> } >>> >>> -- >>> 2.30.2 >>> >>> >>
Hi Kirill, On 4/25/2023 8:38 PM, Kirill A. Shutemov wrote: > On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: >> free_transhuge_page() acquires split queue lock then check >> whether the THP was added to deferred list or not. >> >> It's safe to check whether the THP is in deferred list or not. >> When code hit free_transhuge_page(), there is no one tries >> to update the folio's _deferred_list. >> >> If folio is not in deferred_list, it's safe to check without >> acquiring lock. >> >> If folio is in deferred_list, the other node in deferred_list >> adding/deleteing doesn't impact the return value of >> list_epmty(@folio->_deferred_list). > > Typo. > >> >> Running page_fault1 of will-it-scale + order 2 folio for anonymous >> mapping with 96 processes on an Ice Lake 48C/96T test box, we could >> see the 61% split_queue_lock contention: >> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] >> release_pages >> - 70.93% release_pages >> - 61.42% free_transhuge_page >> + 60.77% _raw_spin_lock_irqsave >> >> With this patch applied, the split_queue_lock contention is less >> than 1%. >> >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >> Tested-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/huge_memory.c | 19 ++++++++++++++++--- >> 1 file changed, 16 insertions(+), 3 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 032fb0ef9cd1..c620f1f12247 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) >> struct deferred_split *ds_queue = get_deferred_split_queue(folio); >> unsigned long flags; >> >> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> - if (!list_empty(&folio->_deferred_list)) { >> + /* >> + * At this point, there is no one trying to queue the folio >> + * to deferred_list. folio->_deferred_list is not possible >> + * being updated. >> + * >> + * If folio is already added to deferred_list, add/delete to/from >> + * deferred_list will not impact list_empty(&folio->_deferred_list). >> + * It's safe to check list_empty(&folio->_deferred_list) without >> + * acquiring the lock. >> + * >> + * If folio is not in deferred_list, it's safe to check without >> + * acquiring the lock. >> + */ >> + if (data_race(!list_empty(&folio->_deferred_list))) { >> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > > Recheck under lock? In function deferred_split_scan(), there is following code block: if (folio_try_get(folio)) { list_move(&folio->_deferred_list, &list); } else { /* We lost race with folio_put() */ list_del_init(&folio->_deferred_list); ds_queue->split_queue_len--; } I am wondering what kind of "lost race with folio_put()" can be. My understanding is that it's not necessary to handle this case here because free_transhuge_page() will handle it once folio get zero ref. But I must miss something here. Thanks. Regards Yin, Fengwei > >> ds_queue->split_queue_len--; >> list_del(&folio->_deferred_list); >> + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> } >> - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> free_compound_page(page); >> } >> >> -- >> 2.30.2 >> >> >
On Fri, Apr 28, 2023 at 02:28:07PM +0800, Yin, Fengwei wrote: > Hi Kirill, > > On 4/25/2023 8:38 PM, Kirill A. Shutemov wrote: > > On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: > >> free_transhuge_page() acquires split queue lock then check > >> whether the THP was added to deferred list or not. > >> > >> It's safe to check whether the THP is in deferred list or not. > >> When code hit free_transhuge_page(), there is no one tries > >> to update the folio's _deferred_list. > >> > >> If folio is not in deferred_list, it's safe to check without > >> acquiring lock. > >> > >> If folio is in deferred_list, the other node in deferred_list > >> adding/deleteing doesn't impact the return value of > >> list_epmty(@folio->_deferred_list). > > > > Typo. > > > >> > >> Running page_fault1 of will-it-scale + order 2 folio for anonymous > >> mapping with 96 processes on an Ice Lake 48C/96T test box, we could > >> see the 61% split_queue_lock contention: > >> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] > >> release_pages > >> - 70.93% release_pages > >> - 61.42% free_transhuge_page > >> + 60.77% _raw_spin_lock_irqsave > >> > >> With this patch applied, the split_queue_lock contention is less > >> than 1%. > >> > >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > >> Tested-by: Ryan Roberts <ryan.roberts@arm.com> > >> --- > >> mm/huge_memory.c | 19 ++++++++++++++++--- > >> 1 file changed, 16 insertions(+), 3 deletions(-) > >> > >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >> index 032fb0ef9cd1..c620f1f12247 100644 > >> --- a/mm/huge_memory.c > >> +++ b/mm/huge_memory.c > >> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) > >> struct deferred_split *ds_queue = get_deferred_split_queue(folio); > >> unsigned long flags; > >> > >> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > >> - if (!list_empty(&folio->_deferred_list)) { > >> + /* > >> + * At this point, there is no one trying to queue the folio > >> + * to deferred_list. folio->_deferred_list is not possible > >> + * being updated. > >> + * > >> + * If folio is already added to deferred_list, add/delete to/from > >> + * deferred_list will not impact list_empty(&folio->_deferred_list). > >> + * It's safe to check list_empty(&folio->_deferred_list) without > >> + * acquiring the lock. > >> + * > >> + * If folio is not in deferred_list, it's safe to check without > >> + * acquiring the lock. > >> + */ > >> + if (data_race(!list_empty(&folio->_deferred_list))) { > >> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > > > > Recheck under lock? > In function deferred_split_scan(), there is following code block: > if (folio_try_get(folio)) { > list_move(&folio->_deferred_list, &list); > } else { > /* We lost race with folio_put() */ > list_del_init(&folio->_deferred_list); > ds_queue->split_queue_len--; > } > > I am wondering what kind of "lost race with folio_put()" can be. > > My understanding is that it's not necessary to handle this case here > because free_transhuge_page() will handle it once folio get zero ref. > But I must miss something here. Thanks. free_transhuge_page() got when refcount is already zero. Both deferred_split_scan() and free_transhuge_page() can see the page with zero refcount. The check makes deferred_split_scan() to leave the page to the free_transhuge_page().
Hi Kirill, On 4/28/2023 10:02 PM, Kirill A. Shutemov wrote: > On Fri, Apr 28, 2023 at 02:28:07PM +0800, Yin, Fengwei wrote: >> Hi Kirill, >> >> On 4/25/2023 8:38 PM, Kirill A. Shutemov wrote: >>> On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: >>>> free_transhuge_page() acquires split queue lock then check >>>> whether the THP was added to deferred list or not. >>>> >>>> It's safe to check whether the THP is in deferred list or not. >>>> When code hit free_transhuge_page(), there is no one tries >>>> to update the folio's _deferred_list. >>>> >>>> If folio is not in deferred_list, it's safe to check without >>>> acquiring lock. >>>> >>>> If folio is in deferred_list, the other node in deferred_list >>>> adding/deleteing doesn't impact the return value of >>>> list_epmty(@folio->_deferred_list). >>> >>> Typo. >>> >>>> >>>> Running page_fault1 of will-it-scale + order 2 folio for anonymous >>>> mapping with 96 processes on an Ice Lake 48C/96T test box, we could >>>> see the 61% split_queue_lock contention: >>>> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] >>>> release_pages >>>> - 70.93% release_pages >>>> - 61.42% free_transhuge_page >>>> + 60.77% _raw_spin_lock_irqsave >>>> >>>> With this patch applied, the split_queue_lock contention is less >>>> than 1%. >>>> >>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >>>> Tested-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> mm/huge_memory.c | 19 ++++++++++++++++--- >>>> 1 file changed, 16 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>> index 032fb0ef9cd1..c620f1f12247 100644 >>>> --- a/mm/huge_memory.c >>>> +++ b/mm/huge_memory.c >>>> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) >>>> struct deferred_split *ds_queue = get_deferred_split_queue(folio); >>>> unsigned long flags; >>>> >>>> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>>> - if (!list_empty(&folio->_deferred_list)) { >>>> + /* >>>> + * At this point, there is no one trying to queue the folio >>>> + * to deferred_list. folio->_deferred_list is not possible >>>> + * being updated. >>>> + * >>>> + * If folio is already added to deferred_list, add/delete to/from >>>> + * deferred_list will not impact list_empty(&folio->_deferred_list). >>>> + * It's safe to check list_empty(&folio->_deferred_list) without >>>> + * acquiring the lock. >>>> + * >>>> + * If folio is not in deferred_list, it's safe to check without >>>> + * acquiring the lock. >>>> + */ >>>> + if (data_race(!list_empty(&folio->_deferred_list))) { >>>> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>> >>> Recheck under lock? >> In function deferred_split_scan(), there is following code block: >> if (folio_try_get(folio)) { >> list_move(&folio->_deferred_list, &list); >> } else { >> /* We lost race with folio_put() */ >> list_del_init(&folio->_deferred_list); >> ds_queue->split_queue_len--; >> } >> >> I am wondering what kind of "lost race with folio_put()" can be. >> >> My understanding is that it's not necessary to handle this case here >> because free_transhuge_page() will handle it once folio get zero ref. >> But I must miss something here. Thanks. > > free_transhuge_page() got when refcount is already zero. Both > deferred_split_scan() and free_transhuge_page() can see the page with zero > refcount. The check makes deferred_split_scan() to leave the page to the > free_transhuge_page(). > If deferred_split_scan() leaves the page to free_transhuge_page(), is it necessary to do list_del_init(&folio->_deferred_list); ds_queue->split_queue_len--; Can these two line be left to free_transhuge_page() either? Thanks. Regards Yin, Fengwei
On Sat, Apr 29, 2023 at 04:32:34PM +0800, Yin, Fengwei wrote: > Hi Kirill, > > On 4/28/2023 10:02 PM, Kirill A. Shutemov wrote: > > On Fri, Apr 28, 2023 at 02:28:07PM +0800, Yin, Fengwei wrote: > >> Hi Kirill, > >> > >> On 4/25/2023 8:38 PM, Kirill A. Shutemov wrote: > >>> On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: > >>>> free_transhuge_page() acquires split queue lock then check > >>>> whether the THP was added to deferred list or not. > >>>> > >>>> It's safe to check whether the THP is in deferred list or not. > >>>> When code hit free_transhuge_page(), there is no one tries > >>>> to update the folio's _deferred_list. > >>>> > >>>> If folio is not in deferred_list, it's safe to check without > >>>> acquiring lock. > >>>> > >>>> If folio is in deferred_list, the other node in deferred_list > >>>> adding/deleteing doesn't impact the return value of > >>>> list_epmty(@folio->_deferred_list). > >>> > >>> Typo. > >>> > >>>> > >>>> Running page_fault1 of will-it-scale + order 2 folio for anonymous > >>>> mapping with 96 processes on an Ice Lake 48C/96T test box, we could > >>>> see the 61% split_queue_lock contention: > >>>> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] > >>>> release_pages > >>>> - 70.93% release_pages > >>>> - 61.42% free_transhuge_page > >>>> + 60.77% _raw_spin_lock_irqsave > >>>> > >>>> With this patch applied, the split_queue_lock contention is less > >>>> than 1%. > >>>> > >>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > >>>> Tested-by: Ryan Roberts <ryan.roberts@arm.com> > >>>> --- > >>>> mm/huge_memory.c | 19 ++++++++++++++++--- > >>>> 1 file changed, 16 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >>>> index 032fb0ef9cd1..c620f1f12247 100644 > >>>> --- a/mm/huge_memory.c > >>>> +++ b/mm/huge_memory.c > >>>> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) > >>>> struct deferred_split *ds_queue = get_deferred_split_queue(folio); > >>>> unsigned long flags; > >>>> > >>>> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > >>>> - if (!list_empty(&folio->_deferred_list)) { > >>>> + /* > >>>> + * At this point, there is no one trying to queue the folio > >>>> + * to deferred_list. folio->_deferred_list is not possible > >>>> + * being updated. > >>>> + * > >>>> + * If folio is already added to deferred_list, add/delete to/from > >>>> + * deferred_list will not impact list_empty(&folio->_deferred_list). > >>>> + * It's safe to check list_empty(&folio->_deferred_list) without > >>>> + * acquiring the lock. > >>>> + * > >>>> + * If folio is not in deferred_list, it's safe to check without > >>>> + * acquiring the lock. > >>>> + */ > >>>> + if (data_race(!list_empty(&folio->_deferred_list))) { > >>>> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > >>> > >>> Recheck under lock? > >> In function deferred_split_scan(), there is following code block: > >> if (folio_try_get(folio)) { > >> list_move(&folio->_deferred_list, &list); > >> } else { > >> /* We lost race with folio_put() */ > >> list_del_init(&folio->_deferred_list); > >> ds_queue->split_queue_len--; > >> } > >> > >> I am wondering what kind of "lost race with folio_put()" can be. > >> > >> My understanding is that it's not necessary to handle this case here > >> because free_transhuge_page() will handle it once folio get zero ref. > >> But I must miss something here. Thanks. > > > > free_transhuge_page() got when refcount is already zero. Both > > deferred_split_scan() and free_transhuge_page() can see the page with zero > > refcount. The check makes deferred_split_scan() to leave the page to the > > free_transhuge_page(). > > > If deferred_split_scan() leaves the page to free_transhuge_page(), is it > necessary to do > list_del_init(&folio->_deferred_list); > ds_queue->split_queue_len--; > > Can these two line be left to free_transhuge_page() either? Thanks. I *think* (my cache is cold on deferred split) we can. But since we already hold the lock, why not take care of it? It makes your change more efficient.
Hi Kirill, On 4/29/2023 4:46 PM, Kirill A. Shutemov wrote: > On Sat, Apr 29, 2023 at 04:32:34PM +0800, Yin, Fengwei wrote: >> Hi Kirill, >> >> On 4/28/2023 10:02 PM, Kirill A. Shutemov wrote: >>> On Fri, Apr 28, 2023 at 02:28:07PM +0800, Yin, Fengwei wrote: >>>> Hi Kirill, >>>> >>>> On 4/25/2023 8:38 PM, Kirill A. Shutemov wrote: >>>>> On Tue, Apr 25, 2023 at 04:46:26PM +0800, Yin Fengwei wrote: >>>>>> free_transhuge_page() acquires split queue lock then check >>>>>> whether the THP was added to deferred list or not. >>>>>> >>>>>> It's safe to check whether the THP is in deferred list or not. >>>>>> When code hit free_transhuge_page(), there is no one tries >>>>>> to update the folio's _deferred_list. >>>>>> >>>>>> If folio is not in deferred_list, it's safe to check without >>>>>> acquiring lock. >>>>>> >>>>>> If folio is in deferred_list, the other node in deferred_list >>>>>> adding/deleteing doesn't impact the return value of >>>>>> list_epmty(@folio->_deferred_list). >>>>> >>>>> Typo. >>>>> >>>>>> >>>>>> Running page_fault1 of will-it-scale + order 2 folio for anonymous >>>>>> mapping with 96 processes on an Ice Lake 48C/96T test box, we could >>>>>> see the 61% split_queue_lock contention: >>>>>> - 71.28% 0.35% page_fault1_pro [kernel.kallsyms] [k] >>>>>> release_pages >>>>>> - 70.93% release_pages >>>>>> - 61.42% free_transhuge_page >>>>>> + 60.77% _raw_spin_lock_irqsave >>>>>> >>>>>> With this patch applied, the split_queue_lock contention is less >>>>>> than 1%. >>>>>> >>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>> Tested-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>> --- >>>>>> mm/huge_memory.c | 19 ++++++++++++++++--- >>>>>> 1 file changed, 16 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>>>> index 032fb0ef9cd1..c620f1f12247 100644 >>>>>> --- a/mm/huge_memory.c >>>>>> +++ b/mm/huge_memory.c >>>>>> @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) >>>>>> struct deferred_split *ds_queue = get_deferred_split_queue(folio); >>>>>> unsigned long flags; >>>>>> >>>>>> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>>>>> - if (!list_empty(&folio->_deferred_list)) { >>>>>> + /* >>>>>> + * At this point, there is no one trying to queue the folio >>>>>> + * to deferred_list. folio->_deferred_list is not possible >>>>>> + * being updated. >>>>>> + * >>>>>> + * If folio is already added to deferred_list, add/delete to/from >>>>>> + * deferred_list will not impact list_empty(&folio->_deferred_list). >>>>>> + * It's safe to check list_empty(&folio->_deferred_list) without >>>>>> + * acquiring the lock. >>>>>> + * >>>>>> + * If folio is not in deferred_list, it's safe to check without >>>>>> + * acquiring the lock. >>>>>> + */ >>>>>> + if (data_race(!list_empty(&folio->_deferred_list))) { >>>>>> + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>>>> >>>>> Recheck under lock? >>>> In function deferred_split_scan(), there is following code block: >>>> if (folio_try_get(folio)) { >>>> list_move(&folio->_deferred_list, &list); >>>> } else { >>>> /* We lost race with folio_put() */ >>>> list_del_init(&folio->_deferred_list); >>>> ds_queue->split_queue_len--; >>>> } >>>> >>>> I am wondering what kind of "lost race with folio_put()" can be. >>>> >>>> My understanding is that it's not necessary to handle this case here >>>> because free_transhuge_page() will handle it once folio get zero ref. >>>> But I must miss something here. Thanks. >>> >>> free_transhuge_page() got when refcount is already zero. Both >>> deferred_split_scan() and free_transhuge_page() can see the page with zero >>> refcount. The check makes deferred_split_scan() to leave the page to the >>> free_transhuge_page(). >>> >> If deferred_split_scan() leaves the page to free_transhuge_page(), is it >> necessary to do >> list_del_init(&folio->_deferred_list); >> ds_queue->split_queue_len--; >> >> Can these two line be left to free_transhuge_page() either? Thanks. > > I *think* (my cache is cold on deferred split) we can. But since we > already hold the lock, why not take care of it? It makes your change more > efficient. Thanks a lot for your confirmation. I just wanted to make sure I understand the race here correctly (I didn't notice this part of code before Ying pointed it out). Regards Yin, Fengwei >
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 032fb0ef9cd1..c620f1f12247 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2799,12 +2799,25 @@ void free_transhuge_page(struct page *page) struct deferred_split *ds_queue = get_deferred_split_queue(folio); unsigned long flags; - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - if (!list_empty(&folio->_deferred_list)) { + /* + * At this point, there is no one trying to queue the folio + * to deferred_list. folio->_deferred_list is not possible + * being updated. + * + * If folio is already added to deferred_list, add/delete to/from + * deferred_list will not impact list_empty(&folio->_deferred_list). + * It's safe to check list_empty(&folio->_deferred_list) without + * acquiring the lock. + * + * If folio is not in deferred_list, it's safe to check without + * acquiring the lock. + */ + if (data_race(!list_empty(&folio->_deferred_list))) { + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); ds_queue->split_queue_len--; list_del(&folio->_deferred_list); + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); } - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); free_compound_page(page); }