diff mbox series

[v2] mm/rmap: do not add fully unmapped large folio to deferred split list

Message ID 20240424211031.475756-1-zi.yan@sent.com (mailing list archive)
State New
Headers show
Series [v2] mm/rmap: do not add fully unmapped large folio to deferred split list | expand

Commit Message

Zi Yan April 24, 2024, 9:10 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

In __folio_remove_rmap(), a large folio is added to deferred split list
if any page in a folio loses its final mapping. It is possible that
the folio is unmapped fully, but it is unnecessary to add the folio
to deferred split list at all. Fix it by checking folio->_nr_pages_mapped
before adding a folio to deferred split list. If the folio is already
on the deferred split list, it will be skipped.

Commit 98046944a159 ("mm: huge_memory: add the missing
folio_test_pmd_mappable() for THP split statistics") tried to exclude
mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
fix everything. A fully unmapped PTE-mapped order-9 THP was also added to
deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is
512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
the order-9 folio is folio_test_pmd_mappable(). However, this miscount
was present even earlier due to implementation, since PTEs are unmapped
individually and first PTE unmapping adds the THP into the deferred split
list.

With commit b06dc281aa99 ("mm/rmap: introduce
folio_remove_rmap_[pte|ptes|pmd]()"), kernel is able to unmap PTE-mapped
folios in one shot without causing the miscount, hence this patch.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/rmap.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)


base-commit: 2541ee5668b019c486dd3e815114130e35c1495d

Comments

Yang Shi April 24, 2024, 10:32 p.m. UTC | #1
On Wed, Apr 24, 2024 at 2:10 PM Zi Yan <zi.yan@sent.com> wrote:
>
> From: Zi Yan <ziy@nvidia.com>
>
> In __folio_remove_rmap(), a large folio is added to deferred split list
> if any page in a folio loses its final mapping. It is possible that
> the folio is unmapped fully, but it is unnecessary to add the folio
> to deferred split list at all. Fix it by checking folio->_nr_pages_mapped
> before adding a folio to deferred split list. If the folio is already
> on the deferred split list, it will be skipped.
>
> Commit 98046944a159 ("mm: huge_memory: add the missing
> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> fix everything. A fully unmapped PTE-mapped order-9 THP was also added to
> deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is
> 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
> the order-9 folio is folio_test_pmd_mappable(). However, this miscount
> was present even earlier due to implementation, since PTEs are unmapped
> individually and first PTE unmapping adds the THP into the deferred split
> list.

Shall you mention the miscounting for mTHP too? There is another patch
series adding the counter support for mTHP.

>
> With commit b06dc281aa99 ("mm/rmap: introduce
> folio_remove_rmap_[pte|ptes|pmd]()"), kernel is able to unmap PTE-mapped
> folios in one shot without causing the miscount, hence this patch.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/rmap.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a7913a454028..220ad8a83589 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                  * page of the folio is unmapped and at least one page
>                  * is still mapped.
>                  */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> +                   list_empty(&folio->_deferred_list) &&

Do we really need this check? deferred_split_folio() does the same
check too. Bailing out earlier sounds ok too, but there may not be too
much gain.

> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))

IIUC, this line is used to cover the case which has both partial
PTE-mapping and PMD-mapping, then PMD mapping is unmapped fully. IIRC
this case was not handled correctly before, the THP actually skipped
deferred split queue. If so please add some description in the commit
log.

Otherwise the patch looks good to me. Reviewed-by: Yang Shi
<shy828301@gmail.com>

> +                       deferred_split_folio(folio);
>         }
>
>         /*
>
> base-commit: 2541ee5668b019c486dd3e815114130e35c1495d
> --
> 2.43.0
>
Zi Yan April 24, 2024, 10:39 p.m. UTC | #2
On 24 Apr 2024, at 18:32, Yang Shi wrote:

> On Wed, Apr 24, 2024 at 2:10 PM Zi Yan <zi.yan@sent.com> wrote:
>>
>> From: Zi Yan <ziy@nvidia.com>
>>
>> In __folio_remove_rmap(), a large folio is added to deferred split list
>> if any page in a folio loses its final mapping. It is possible that
>> the folio is unmapped fully, but it is unnecessary to add the folio
>> to deferred split list at all. Fix it by checking folio->_nr_pages_mapped
>> before adding a folio to deferred split list. If the folio is already
>> on the deferred split list, it will be skipped.
>>
>> Commit 98046944a159 ("mm: huge_memory: add the missing
>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>> fix everything. A fully unmapped PTE-mapped order-9 THP was also added to
>> deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is
>> 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
>> the order-9 folio is folio_test_pmd_mappable(). However, this miscount
>> was present even earlier due to implementation, since PTEs are unmapped
>> individually and first PTE unmapping adds the THP into the deferred split
>> list.
>
> Shall you mention the miscounting for mTHP too? There is another patch
> series adding the counter support for mTHP.

OK, will add it.
>
>>
>> With commit b06dc281aa99 ("mm/rmap: introduce
>> folio_remove_rmap_[pte|ptes|pmd]()"), kernel is able to unmap PTE-mapped
>> folios in one shot without causing the miscount, hence this patch.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/rmap.c | 8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index a7913a454028..220ad8a83589 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>                  * page of the folio is unmapped and at least one page
>>                  * is still mapped.
>>                  */
>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>> -                               deferred_split_folio(folio);
>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>> +                   list_empty(&folio->_deferred_list) &&
>
> Do we really need this check? deferred_split_folio() does the same
> check too. Bailing out earlier sounds ok too, but there may not be too
> much gain.

Sure, I can remove it.

>
>> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
>> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
>
> IIUC, this line is used to cover the case which has both partial
> PTE-mapping and PMD-mapping, then PMD mapping is unmapped fully. IIRC
> this case was not handled correctly before, the THP actually skipped
> deferred split queue. If so please add some description in the commit
> log.

It is properly handled before, since the original code is
(level == RMAP_LEVEL_PTE || nr < nr_pmdmapped), meaning
if either level is RMAP_LEVEL_PTE or
(level == RMAP_LEVEL_PMD && nr < nr_pmdmapped), the folio
is added to the deferred split list. So only level == RMAP_LEVEL_PTE
part of logic needs to be fixed.

>
> Otherwise the patch looks good to me. Reviewed-by: Yang Shi
> <shy828301@gmail.com>
>

Thanks.
>> +                       deferred_split_folio(folio);
>>         }
>>
>>         /*
>>
>> base-commit: 2541ee5668b019c486dd3e815114130e35c1495d
>> --
>> 2.43.0
>>


--
Best Regards,
Yan, Zi
Yang Shi April 24, 2024, 10:53 p.m. UTC | #3
On Wed, Apr 24, 2024 at 3:39 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 24 Apr 2024, at 18:32, Yang Shi wrote:
>
> > On Wed, Apr 24, 2024 at 2:10 PM Zi Yan <zi.yan@sent.com> wrote:
> >>
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> In __folio_remove_rmap(), a large folio is added to deferred split list
> >> if any page in a folio loses its final mapping. It is possible that
> >> the folio is unmapped fully, but it is unnecessary to add the folio
> >> to deferred split list at all. Fix it by checking folio->_nr_pages_mapped
> >> before adding a folio to deferred split list. If the folio is already
> >> on the deferred split list, it will be skipped.
> >>
> >> Commit 98046944a159 ("mm: huge_memory: add the missing
> >> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> >> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> >> fix everything. A fully unmapped PTE-mapped order-9 THP was also added to
> >> deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is
> >> 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
> >> the order-9 folio is folio_test_pmd_mappable(). However, this miscount
> >> was present even earlier due to implementation, since PTEs are unmapped
> >> individually and first PTE unmapping adds the THP into the deferred split
> >> list.
> >
> > Shall you mention the miscounting for mTHP too? There is another patch
> > series adding the counter support for mTHP.
>
> OK, will add it.
> >
> >>
> >> With commit b06dc281aa99 ("mm/rmap: introduce
> >> folio_remove_rmap_[pte|ptes|pmd]()"), kernel is able to unmap PTE-mapped
> >> folios in one shot without causing the miscount, hence this patch.
> >>
> >> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >> ---
> >>  mm/rmap.c | 8 +++++---
> >>  1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index a7913a454028..220ad8a83589 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>                  * page of the folio is unmapped and at least one page
> >>                  * is still mapped.
> >>                  */
> >> -               if (folio_test_large(folio) && folio_test_anon(folio))
> >> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> >> -                               deferred_split_folio(folio);
> >> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
> >> +                   list_empty(&folio->_deferred_list) &&
> >
> > Do we really need this check? deferred_split_folio() does the same
> > check too. Bailing out earlier sounds ok too, but there may not be too
> > much gain.
>
> Sure, I can remove it.
>
> >
> >> +                   ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
> >> +                    (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
> >
> > IIUC, this line is used to cover the case which has both partial
> > PTE-mapping and PMD-mapping, then PMD mapping is unmapped fully. IIRC
> > this case was not handled correctly before, the THP actually skipped
> > deferred split queue. If so please add some description in the commit
> > log.
>
> It is properly handled before, since the original code is
> (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped), meaning
> if either level is RMAP_LEVEL_PTE or
> (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped), the folio
> is added to the deferred split list. So only level == RMAP_LEVEL_PTE
> part of logic needs to be fixed.

Oh, yes. I misread "||" to "&&". Thanks for correcting me and fixing
the problem.

>
> >
> > Otherwise the patch looks good to me. Reviewed-by: Yang Shi
> > <shy828301@gmail.com>
> >
>
> Thanks.
> >> +                       deferred_split_folio(folio);
> >>         }
> >>
> >>         /*
> >>
> >> base-commit: 2541ee5668b019c486dd3e815114130e35c1495d
> >> --
> >> 2.43.0
> >>
>
>
> --
> Best Regards,
> Yan, Zi
David Hildenbrand April 25, 2024, 7:15 a.m. UTC | #4
On 25.04.24 00:39, Zi Yan wrote:
> On 24 Apr 2024, at 18:32, Yang Shi wrote:
> 
>> On Wed, Apr 24, 2024 at 2:10 PM Zi Yan <zi.yan@sent.com> wrote:
>>>
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>> if any page in a folio loses its final mapping. It is possible that
>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>> to deferred split list at all. Fix it by checking folio->_nr_pages_mapped
>>> before adding a folio to deferred split list. If the folio is already
>>> on the deferred split list, it will be skipped.
>>>
>>> Commit 98046944a159 ("mm: huge_memory: add the missing
>>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>>> fix everything. A fully unmapped PTE-mapped order-9 THP was also added to
>>> deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is
>>> 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
>>> the order-9 folio is folio_test_pmd_mappable(). However, this miscount
>>> was present even earlier due to implementation, since PTEs are unmapped
>>> individually and first PTE unmapping adds the THP into the deferred split
>>> list.
>>
>> Shall you mention the miscounting for mTHP too? There is another patch
>> series adding the counter support for mTHP.
> 
> OK, will add it.

I thought I made it clear: this patch won't "fix" it. Misaccounting will 
still happen. Just less frequently.

Please spell that out.

>>
>>>
>>> With commit b06dc281aa99 ("mm/rmap: introduce
>>> folio_remove_rmap_[pte|ptes|pmd]()"), kernel is able to unmap PTE-mapped
>>> folios in one shot without causing the miscount, hence this patch.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>   mm/rmap.c | 8 +++++---
>>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index a7913a454028..220ad8a83589 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>                   * page of the folio is unmapped and at least one page
>>>                   * is still mapped.
>>>                   */
>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>> -                               deferred_split_folio(folio);
>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>>> +                   list_empty(&folio->_deferred_list) &&
>>
>> Do we really need this check? deferred_split_folio() does the same
>> check too. Bailing out earlier sounds ok too, but there may not be too
>> much gain.
> 
> Sure, I can remove it.

Please leave it. It's a function call that cannot be optimized out 
otherwise.
Zi Yan April 25, 2024, 2:50 p.m. UTC | #5
On 25 Apr 2024, at 3:15, David Hildenbrand wrote:

> On 25.04.24 00:39, Zi Yan wrote:
>> On 24 Apr 2024, at 18:32, Yang Shi wrote:
>>
>>> On Wed, Apr 24, 2024 at 2:10 PM Zi Yan <zi.yan@sent.com> wrote:
>>>>
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>> if any page in a folio loses its final mapping. It is possible that
>>>> the folio is unmapped fully, but it is unnecessary to add the folio
>>>> to deferred split list at all. Fix it by checking folio->_nr_pages_mapped
>>>> before adding a folio to deferred split list. If the folio is already
>>>> on the deferred split list, it will be skipped.
>>>>
>>>> Commit 98046944a159 ("mm: huge_memory: add the missing
>>>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>>>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>>>> fix everything. A fully unmapped PTE-mapped order-9 THP was also added to
>>>> deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is
>>>> 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
>>>> the order-9 folio is folio_test_pmd_mappable(). However, this miscount
>>>> was present even earlier due to implementation, since PTEs are unmapped
>>>> individually and first PTE unmapping adds the THP into the deferred split
>>>> list.
>>>
>>> Shall you mention the miscounting for mTHP too? There is another patch
>>> series adding the counter support for mTHP.
>>
>> OK, will add it.
>
> I thought I made it clear: this patch won't "fix" it. Misaccounting will still happen. Just less frequently.
>
> Please spell that out.

Sure. Sorry I did not make that clear.


>
>>>
>>>>
>>>> With commit b06dc281aa99 ("mm/rmap: introduce
>>>> folio_remove_rmap_[pte|ptes|pmd]()"), kernel is able to unmap PTE-mapped
>>>> folios in one shot without causing the miscount, hence this patch.
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>>   mm/rmap.c | 8 +++++---
>>>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index a7913a454028..220ad8a83589 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>                   * page of the folio is unmapped and at least one page
>>>>                   * is still mapped.
>>>>                   */
>>>> -               if (folio_test_large(folio) && folio_test_anon(folio))
>>>> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>>> -                               deferred_split_folio(folio);
>>>> +               if (folio_test_large(folio) && folio_test_anon(folio) &&
>>>> +                   list_empty(&folio->_deferred_list) &&
>>>
>>> Do we really need this check? deferred_split_folio() does the same
>>> check too. Bailing out earlier sounds ok too, but there may not be too
>>> much gain.
>>
>> Sure, I can remove it.
>
> Please leave it. It's a function call that cannot be optimized out otherwise.

OK. If you think it is worth optimizing that function call, I will keep it.


--
Best Regards,
Yan, Zi
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index a7913a454028..220ad8a83589 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1553,9 +1553,11 @@  static __always_inline void __folio_remove_rmap(struct folio *folio,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_large(folio) && folio_test_anon(folio))
-			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
-				deferred_split_folio(folio);
+		if (folio_test_large(folio) && folio_test_anon(folio) &&
+		    list_empty(&folio->_deferred_list) &&
+		    ((level == RMAP_LEVEL_PTE && atomic_read(mapped)) ||
+		     (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)))
+			deferred_split_folio(folio);
 	}
 
 	/*