diff mbox series

[v5] mm/rmap: do not add fully unmapped large folio to deferred split list

Message ID 20240426190253.541419-1-zi.yan@sent.com (mailing list archive)
State New
Headers show
Series [v5] mm/rmap: do not add fully unmapped large folio to deferred split list | expand

Commit Message

Zi Yan April 26, 2024, 7:02 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

In __folio_remove_rmap(), a large folio is added to deferred split list
if any page in a folio loses its final mapping. But it is possible that
the folio is fully unmapped and adding it to deferred split list is
unnecessary.

For PMD-mapped THPs, that was not really an issue, because removing the
last PMD mapping in the absence of PTE mappings would not have added the
folio to the deferred split queue.

However, for PTE-mapped THPs, which are now more prominent due to mTHP,
they are always added to the deferred split queue. One side effect
is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
unintentionally increased, making it look like there are many partially
mapped folios -- although the whole folio is fully unmapped stepwise.

Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
folio is unmapped in one go and can avoid being added to deferred split
list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
noise when we cannot batch-unmap a complete PTE-mapped folio in one go
-- or where this type of batching is not implemented yet, e.g., migration.

To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
to tell if the whole folio is unmapped. If the folio is already on
deferred split list, it will be skipped, too.

Note: commit 98046944a159 ("mm: huge_memory: add the missing
folio_test_pmd_mappable() for THP split statistics") tried to exclude
mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/rmap.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)


base-commit: 3dba658670af22074cc6f26dc92efe0013ac3359

Comments

David Hildenbrand April 26, 2024, 7:08 p.m. UTC | #1
On 26.04.24 21:02, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> In __folio_remove_rmap(), a large folio is added to deferred split list
> if any page in a folio loses its final mapping. But it is possible that
> the folio is fully unmapped and adding it to deferred split list is
> unnecessary.
> 
> For PMD-mapped THPs, that was not really an issue, because removing the
> last PMD mapping in the absence of PTE mappings would not have added the
> folio to the deferred split queue.
> 
> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> they are always added to the deferred split queue. One side effect
> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> unintentionally increased, making it look like there are many partially
> mapped folios -- although the whole folio is fully unmapped stepwise.
> 
> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> folio is unmapped in one go and can avoid being added to deferred split
> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> -- or where this type of batching is not implemented yet, e.g., migration.
> 
> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> to tell if the whole folio is unmapped. If the folio is already on
> deferred split list, it will be skipped, too.
> 
> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/rmap.c | 12 +++++++++---
>   1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..a9bd64ebdd9a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>   {
>   	atomic_t *mapped = &folio->_nr_pages_mapped;
>   	int last, nr = 0, nr_pmdmapped = 0;
> +	bool partially_mapped = false;
>   	enum node_stat_item idx;
>   
>   	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
> @@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>   					nr++;
>   			}
>   		} while (page++, --nr_pages > 0);
> +
> +		partially_mapped = !!nr && !!atomic_read(mapped);

Nit: The && should remove the need for both !!.

Reviewed-by: David Hildenbrand <david@redhat.com>
Zi Yan April 26, 2024, 7:20 p.m. UTC | #2
On 26 Apr 2024, at 15:08, David Hildenbrand wrote:

> On 26.04.24 21:02, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> In __folio_remove_rmap(), a large folio is added to deferred split list
>> if any page in a folio loses its final mapping. But it is possible that
>> the folio is fully unmapped and adding it to deferred split list is
>> unnecessary.
>>
>> For PMD-mapped THPs, that was not really an issue, because removing the
>> last PMD mapping in the absence of PTE mappings would not have added the
>> folio to the deferred split queue.
>>
>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
>> they are always added to the deferred split queue. One side effect
>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
>> unintentionally increased, making it look like there are many partially
>> mapped folios -- although the whole folio is fully unmapped stepwise.
>>
>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
>> folio is unmapped in one go and can avoid being added to deferred split
>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
>> -- or where this type of batching is not implemented yet, e.g., migration.
>>
>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
>> to tell if the whole folio is unmapped. If the folio is already on
>> deferred split list, it will be skipped, too.
>>
>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/rmap.c | 12 +++++++++---
>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2608c40dffad..a9bd64ebdd9a 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>   {
>>   	atomic_t *mapped = &folio->_nr_pages_mapped;
>>   	int last, nr = 0, nr_pmdmapped = 0;
>> +	bool partially_mapped = false;
>>   	enum node_stat_item idx;
>>    	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>> @@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>   					nr++;
>>   			}
>>   		} while (page++, --nr_pages > 0);
>> +
>> +		partially_mapped = !!nr && !!atomic_read(mapped);
>
> Nit: The && should remove the need for both !!.

My impression was that !! is needed to convert from int to bool and I do
find "!!int && !!int" use in the kernel. If this is unnecessary, Andrew
can apply the fixup below. I can send a new version if it is really needed.

diff --git a/mm/rmap.c b/mm/rmap.c
index a9bd64ebdd9a..c1fd5828409b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1517,7 +1517,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                        }
                } while (page++, --nr_pages > 0);

-               partially_mapped = !!nr && !!atomic_read(mapped);
+               partially_mapped = nr && atomic_read(mapped);
                break;
        case RMAP_LEVEL_PMD:
                atomic_dec(&folio->_large_mapcount);

>
> Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks.

--
Best Regards,
Yan, Zi
David Hildenbrand April 26, 2024, 8:15 p.m. UTC | #3
On 26.04.24 21:20, Zi Yan wrote:
> On 26 Apr 2024, at 15:08, David Hildenbrand wrote:
> 
>> On 26.04.24 21:02, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>> if any page in a folio loses its final mapping. But it is possible that
>>> the folio is fully unmapped and adding it to deferred split list is
>>> unnecessary.
>>>
>>> For PMD-mapped THPs, that was not really an issue, because removing the
>>> last PMD mapping in the absence of PTE mappings would not have added the
>>> folio to the deferred split queue.
>>>
>>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
>>> they are always added to the deferred split queue. One side effect
>>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
>>> unintentionally increased, making it look like there are many partially
>>> mapped folios -- although the whole folio is fully unmapped stepwise.
>>>
>>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
>>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
>>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
>>> folio is unmapped in one go and can avoid being added to deferred split
>>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
>>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
>>> -- or where this type of batching is not implemented yet, e.g., migration.
>>>
>>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
>>> to tell if the whole folio is unmapped. If the folio is already on
>>> deferred split list, it will be skipped, too.
>>>
>>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
>>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
>>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
>>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
>>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>>>
>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    mm/rmap.c | 12 +++++++++---
>>>    1 file changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..a9bd64ebdd9a 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>    {
>>>    	atomic_t *mapped = &folio->_nr_pages_mapped;
>>>    	int last, nr = 0, nr_pmdmapped = 0;
>>> +	bool partially_mapped = false;
>>>    	enum node_stat_item idx;
>>>     	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>> @@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>    					nr++;
>>>    			}
>>>    		} while (page++, --nr_pages > 0);
>>> +
>>> +		partially_mapped = !!nr && !!atomic_read(mapped);
>>
>> Nit: The && should remove the need for both !!.
> 
> My impression was that !! is needed to convert from int to bool and I do
> find "!!int && !!int" use in the kernel. 

I might be wrong about this, but if you wouldn't write

	if (!!nr && !!atomic_read(mapped))

then

bool partially_mapped = nr && atomic_read(mapped);

is sufficient.

&& would make sure that the result is either 0 or 1, which
you can store safely in a bool, no matter which underlying type
is used to store that value.

But I *think* nowdays, the compiler will always handle that
correctly, even without the "&&" (ever since C99 added _Bool).

Likely, also

	bool partially_mapped = nr & atomic_read(mapped);

Would nowadays work, but looks stupid.


Related: https://lkml.org/lkml/2013/8/31/138

---
#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>
#include <inttypes.h>

volatile uint64_t a = 0x8000000000000000ull;

void main (void) {
         printf("uint64_t a = a: 0x%" PRIx64 "\n", a);

         int i1 = a;
         printf("int i1 = a: %d\n", i1);

         int i2 = !!a;
         printf("int i2 = !!a: %d\n", i2);

         bool b1 = a;
         printf("bool b1 = a: %d\n", b1);

         bool b2 = !!a;
         printf("bool b2 = !!a: %d\n", b2);
}
---
$ ./test
uint64_t a = a: 0x8000000000000000
int i1 = a: 0
int i2 = !!a: 1
bool b1 = a: 1
bool b2 = !!a: 1
---

Note that if bool would be defined as "int", you would need the !!, otherwise you
would lose information.

But even for b1, the gcc generates now:

  40118c:       48 8b 05 7d 2e 00 00    mov    0x2e7d(%rip),%rax        # 404010 <a>
  401193:       48 85 c0                test   %rax,%rax
  401196:       0f 95 c0                setne  %al


My stdbool.h contains

#define bool	_Bool

And I think C99 added _Bool that makes that work.

But I didn't read the standard, and it's time for the weekend :)
Zi Yan April 26, 2024, 8:22 p.m. UTC | #4
On 26 Apr 2024, at 16:15, David Hildenbrand wrote:

> On 26.04.24 21:20, Zi Yan wrote:
>> On 26 Apr 2024, at 15:08, David Hildenbrand wrote:
>>
>>> On 26.04.24 21:02, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> In __folio_remove_rmap(), a large folio is added to deferred split list
>>>> if any page in a folio loses its final mapping. But it is possible that
>>>> the folio is fully unmapped and adding it to deferred split list is
>>>> unnecessary.
>>>>
>>>> For PMD-mapped THPs, that was not really an issue, because removing the
>>>> last PMD mapping in the absence of PTE mappings would not have added the
>>>> folio to the deferred split queue.
>>>>
>>>> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
>>>> they are always added to the deferred split queue. One side effect
>>>> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
>>>> unintentionally increased, making it look like there are many partially
>>>> mapped folios -- although the whole folio is fully unmapped stepwise.
>>>>
>>>> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
>>>> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
>>>> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
>>>> folio is unmapped in one go and can avoid being added to deferred split
>>>> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
>>>> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
>>>> -- or where this type of batching is not implemented yet, e.g., migration.
>>>>
>>>> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
>>>> to tell if the whole folio is unmapped. If the folio is already on
>>>> deferred split list, it will be skipped, too.
>>>>
>>>> Note: commit 98046944a159 ("mm: huge_memory: add the missing
>>>> folio_test_pmd_mappable() for THP split statistics") tried to exclude
>>>> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
>>>> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
>>>> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
>>>> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
>>>> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>>>>
>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>>    mm/rmap.c | 12 +++++++++---
>>>>    1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 2608c40dffad..a9bd64ebdd9a 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>    {
>>>>    	atomic_t *mapped = &folio->_nr_pages_mapped;
>>>>    	int last, nr = 0, nr_pmdmapped = 0;
>>>> +	bool partially_mapped = false;
>>>>    	enum node_stat_item idx;
>>>>     	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>> @@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>>    					nr++;
>>>>    			}
>>>>    		} while (page++, --nr_pages > 0);
>>>> +
>>>> +		partially_mapped = !!nr && !!atomic_read(mapped);
>>>
>>> Nit: The && should remove the need for both !!.
>>
>> My impression was that !! is needed to convert from int to bool and I do
>> find "!!int && !!int" use in the kernel.
>
> I might be wrong about this, but if you wouldn't write
>
> 	if (!!nr && !!atomic_read(mapped))
>
> then
>
> bool partially_mapped = nr && atomic_read(mapped);
>
> is sufficient.
>
> && would make sure that the result is either 0 or 1, which
> you can store safely in a bool, no matter which underlying type
> is used to store that value.
>
> But I *think* nowdays, the compiler will always handle that
> correctly, even without the "&&" (ever since C99 added _Bool).
>
> Likely, also
>
> 	bool partially_mapped = nr & atomic_read(mapped);
>
> Would nowadays work, but looks stupid.
>
>
> Related: https://lkml.org/lkml/2013/8/31/138
>
> ---
> #include <stdio.h>
> #include <stdbool.h>
> #include <stdint.h>
> #include <inttypes.h>
>
> volatile uint64_t a = 0x8000000000000000ull;
>
> void main (void) {
>         printf("uint64_t a = a: 0x%" PRIx64 "\n", a);
>
>         int i1 = a;
>         printf("int i1 = a: %d\n", i1);
>
>         int i2 = !!a;
>         printf("int i2 = !!a: %d\n", i2);
>
>         bool b1 = a;
>         printf("bool b1 = a: %d\n", b1);
>
>         bool b2 = !!a;
>         printf("bool b2 = !!a: %d\n", b2);
> }
> ---
> $ ./test
> uint64_t a = a: 0x8000000000000000
> int i1 = a: 0
> int i2 = !!a: 1
> bool b1 = a: 1
> bool b2 = !!a: 1
> ---
>
> Note that if bool would be defined as "int", you would need the !!, otherwise you
> would lose information.

Thank you for all above. Really appreciate it! And you are right about && and !!.
My memory on !! must be from the old days and now is refreshed. :)

>
> But even for b1, the gcc generates now:
>
>  40118c:       48 8b 05 7d 2e 00 00    mov    0x2e7d(%rip),%rax        # 404010 <a>
>  401193:       48 85 c0                test   %rax,%rax
>  401196:       0f 95 c0                setne  %al
>
>
> My stdbool.h contains
>
> #define bool	_Bool
>
> And I think C99 added _Bool that makes that work.
>
> But I didn't read the standard, and it's time for the weekend :)

Have a good weekend!

--
Best Regards,
Yan, Zi
Yang Shi April 26, 2024, 8:42 p.m. UTC | #5
On Fri, Apr 26, 2024 at 12:02 PM Zi Yan <zi.yan@sent.com> wrote:
>
> From: Zi Yan <ziy@nvidia.com>
>
> In __folio_remove_rmap(), a large folio is added to deferred split list
> if any page in a folio loses its final mapping. But it is possible that
> the folio is fully unmapped and adding it to deferred split list is
> unnecessary.
>
> For PMD-mapped THPs, that was not really an issue, because removing the
> last PMD mapping in the absence of PTE mappings would not have added the
> folio to the deferred split queue.
>
> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> they are always added to the deferred split queue. One side effect
> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> unintentionally increased, making it look like there are many partially
> mapped folios -- although the whole folio is fully unmapped stepwise.
>
> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> folio is unmapped in one go and can avoid being added to deferred split
> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> -- or where this type of batching is not implemented yet, e.g., migration.
>
> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> to tell if the whole folio is unmapped. If the folio is already on
> deferred split list, it will be skipped, too.
>
> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/rmap.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..a9bd64ebdd9a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>  {
>         atomic_t *mapped = &folio->_nr_pages_mapped;
>         int last, nr = 0, nr_pmdmapped = 0;
> +       bool partially_mapped = false;
>         enum node_stat_item idx;
>
>         __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> @@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                                         nr++;
>                         }
>                 } while (page++, --nr_pages > 0);
> +
> +               partially_mapped = !!nr && !!atomic_read(mapped);
>                 break;
>         case RMAP_LEVEL_PMD:
>                 atomic_dec(&folio->_large_mapcount);
> @@ -1532,6 +1535,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                                 nr = 0;
>                         }
>                 }
> +
> +               partially_mapped = nr < nr_pmdmapped;
>                 break;
>         }
>
> @@ -1553,9 +1558,10 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                  * page of the folio is unmapped and at least one page
>                  * is still mapped.
>                  */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_anon(folio) &&
> +                   list_empty(&folio->_deferred_list) &&
> +                   partially_mapped)
> +                       deferred_split_folio(folio);
>         }
>
>         /*
>
> base-commit: 3dba658670af22074cc6f26dc92efe0013ac3359
> --
> 2.43.0
>
Lance Yang April 27, 2024, 4:06 a.m. UTC | #6
On Sat, Apr 27, 2024 at 4:16 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.04.24 21:20, Zi Yan wrote:
> > On 26 Apr 2024, at 15:08, David Hildenbrand wrote:
[...]
> >>> +   bool partially_mapped = false;
[...]
> >>> +
> >>> +           partially_mapped = !!nr && !!atomic_read(mapped);
> >>
> >> Nit: The && should remove the need for both !!.
> >
> > My impression was that !! is needed to convert from int to bool and I do
> > find "!!int && !!int" use in the kernel.
>
> I might be wrong about this, but if you wouldn't write

I think you're correct.

>
>         if (!!nr && !!atomic_read(mapped))
>
> then
>
> bool partially_mapped = nr && atomic_read(mapped);
>
> is sufficient.

+1

>
> && would make sure that the result is either 0 or 1, which
> you can store safely in a bool, no matter which underlying type
> is used to store that value.
>
> But I *think* nowdays, the compiler will always handle that
> correctly, even without the "&&" (ever since C99 added _Bool).
>
> Likely, also
>
>         bool partially_mapped = nr & atomic_read(mapped);
>
> Would nowadays work, but looks stupid.
>
>
> Related: https://lkml.org/lkml/2013/8/31/138
>
> ---
> #include <stdio.h>
> #include <stdbool.h>
> #include <stdint.h>
> #include <inttypes.h>
>
> volatile uint64_t a = 0x8000000000000000ull;
>
> void main (void) {
>          printf("uint64_t a = a: 0x%" PRIx64 "\n", a);
>
>          int i1 = a;
>          printf("int i1 = a: %d\n", i1);
>
>          int i2 = !!a;
>          printf("int i2 = !!a: %d\n", i2);
>
>          bool b1 = a;
>          printf("bool b1 = a: %d\n", b1);
>
>          bool b2 = !!a;
>          printf("bool b2 = !!a: %d\n", b2);
> }
> ---
> $ ./test
> uint64_t a = a: 0x8000000000000000
> int i1 = a: 0
> int i2 = !!a: 1
> bool b1 = a: 1
> bool b2 = !!a: 1
> ---
>
> Note that if bool would be defined as "int", you would need the !!, otherwise you
> would lose information.

Agreed. We need to be careful in this case.

>
> But even for b1, the gcc generates now:
>
>   40118c:       48 8b 05 7d 2e 00 00    mov    0x2e7d(%rip),%rax        # 404010 <a>
>   401193:       48 85 c0                test   %rax,%rax
>   401196:       0f 95 c0                setne  %al
>
>
> My stdbool.h contains
>
> #define bool    _Bool
>
> And I think C99 added _Bool that makes that work.
>
> But I didn't read the standard, and it's time for the weekend :)

I just read the C99 and found some interesting information as follows:

6.3.1.2 Boolean type
    When any *scalar value* is converted to _Bool, the result is 0 if the
    value compares equal to 0; otherwise, the result is 1.

6.2.5 Types
    21. Arithmetic types and pointer types are collectively called *scalar
    types*. Array and structure types are collectively called aggregate types.

6.5.13 Logical AND operator
    Semantics
    The && operator shall yield 1 if both of its operands compare unequal to
    0; otherwise, it yields 0. The result has type int.

6.5.10 Bitwise AND operator
    Constraints
    Each of the operands shall have integer type.
    Semantics
    The result of the binary & operator is the bitwise AND of the operands
    (that is, each bit in the result is set if and only if each of the
corresponding
    bits in the converted operands is set).

&& would ensure that the result is either 0 or 1, as David said, so no worries.

We defined partially_mapped as a bool(_Bool). IIUC, "partially_mapped
= int & int;"
would work correctly as well. However, "partially_mapped = long &
int;" might not.

Using && would be nicer as David suggested :p

Thanks,
Lance

>
> --
> Cheers,
>
> David / dhildenb
>
Lance Yang April 27, 2024, 4:09 a.m. UTC | #7
On Sat, Apr 27, 2024 at 3:02 AM Zi Yan <zi.yan@sent.com> wrote:
>
> From: Zi Yan <ziy@nvidia.com>
>
> In __folio_remove_rmap(), a large folio is added to deferred split list
> if any page in a folio loses its final mapping. But it is possible that
> the folio is fully unmapped and adding it to deferred split list is
> unnecessary.
>
> For PMD-mapped THPs, that was not really an issue, because removing the
> last PMD mapping in the absence of PTE mappings would not have added the
> folio to the deferred split queue.
>
> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> they are always added to the deferred split queue. One side effect
> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> unintentionally increased, making it look like there are many partially
> mapped folios -- although the whole folio is fully unmapped stepwise.
>
> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> folio is unmapped in one go and can avoid being added to deferred split
> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> -- or where this type of batching is not implemented yet, e.g., migration.
>
> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> to tell if the whole folio is unmapped. If the folio is already on
> deferred split list, it will be skipped, too.
>
> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Reviewed-by: Lance Yang <ioworker0@gmail.com>

Thanks,
Lance

> ---
>  mm/rmap.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..a9bd64ebdd9a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>  {
>         atomic_t *mapped = &folio->_nr_pages_mapped;
>         int last, nr = 0, nr_pmdmapped = 0;
> +       bool partially_mapped = false;
>         enum node_stat_item idx;
>
>         __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> @@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                                         nr++;
>                         }
>                 } while (page++, --nr_pages > 0);
> +
> +               partially_mapped = !!nr && !!atomic_read(mapped);
>                 break;
>         case RMAP_LEVEL_PMD:
>                 atomic_dec(&folio->_large_mapcount);
> @@ -1532,6 +1535,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                                 nr = 0;
>                         }
>                 }
> +
> +               partially_mapped = nr < nr_pmdmapped;
>                 break;
>         }
>
> @@ -1553,9 +1558,10 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                  * page of the folio is unmapped and at least one page
>                  * is still mapped.
>                  */
> -               if (folio_test_large(folio) && folio_test_anon(folio))
> -                       if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -                               deferred_split_folio(folio);
> +               if (folio_test_anon(folio) &&
> +                   list_empty(&folio->_deferred_list) &&
> +                   partially_mapped)
> +                       deferred_split_folio(folio);
>         }
>
>         /*
>
> base-commit: 3dba658670af22074cc6f26dc92efe0013ac3359
> --
> 2.43.0
>
David Hildenbrand April 27, 2024, 6:51 a.m. UTC | #8
On 27.04.24 06:06, Lance Yang wrote:
> On Sat, Apr 27, 2024 at 4:16 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.04.24 21:20, Zi Yan wrote:
>>> On 26 Apr 2024, at 15:08, David Hildenbrand wrote:
> [...]
>>>>> +   bool partially_mapped = false;
> [...]
>>>>> +
>>>>> +           partially_mapped = !!nr && !!atomic_read(mapped);
>>>>
>>>> Nit: The && should remove the need for both !!.
>>>
>>> My impression was that !! is needed to convert from int to bool and I do
>>> find "!!int && !!int" use in the kernel.
>>
>> I might be wrong about this, but if you wouldn't write
> 
> I think you're correct.
> 
>>
>>          if (!!nr && !!atomic_read(mapped))
>>
>> then
>>
>> bool partially_mapped = nr && atomic_read(mapped);
>>
>> is sufficient.
> 
> +1
> 
>>
>> && would make sure that the result is either 0 or 1, which
>> you can store safely in a bool, no matter which underlying type
>> is used to store that value.
>>
>> But I *think* nowdays, the compiler will always handle that
>> correctly, even without the "&&" (ever since C99 added _Bool).
>>
>> Likely, also
>>
>>          bool partially_mapped = nr & atomic_read(mapped);
>>
>> Would nowadays work, but looks stupid.
>>
>>
>> Related: https://lkml.org/lkml/2013/8/31/138
>>
>> ---
>> #include <stdio.h>
>> #include <stdbool.h>
>> #include <stdint.h>
>> #include <inttypes.h>
>>
>> volatile uint64_t a = 0x8000000000000000ull;
>>
>> void main (void) {
>>           printf("uint64_t a = a: 0x%" PRIx64 "\n", a);
>>
>>           int i1 = a;
>>           printf("int i1 = a: %d\n", i1);
>>
>>           int i2 = !!a;
>>           printf("int i2 = !!a: %d\n", i2);
>>
>>           bool b1 = a;
>>           printf("bool b1 = a: %d\n", b1);
>>
>>           bool b2 = !!a;
>>           printf("bool b2 = !!a: %d\n", b2);
>> }
>> ---
>> $ ./test
>> uint64_t a = a: 0x8000000000000000
>> int i1 = a: 0
>> int i2 = !!a: 1
>> bool b1 = a: 1
>> bool b2 = !!a: 1
>> ---
>>
>> Note that if bool would be defined as "int", you would need the !!, otherwise you
>> would lose information.
> 
> Agreed. We need to be careful in this case.
> 
>>
>> But even for b1, the gcc generates now:
>>
>>    40118c:       48 8b 05 7d 2e 00 00    mov    0x2e7d(%rip),%rax        # 404010 <a>
>>    401193:       48 85 c0                test   %rax,%rax
>>    401196:       0f 95 c0                setne  %al
>>
>>
>> My stdbool.h contains
>>
>> #define bool    _Bool
>>
>> And I think C99 added _Bool that makes that work.
>>
>> But I didn't read the standard, and it's time for the weekend :)
> 
> I just read the C99 and found some interesting information as follows:
> 
> 6.3.1.2 Boolean type
>      When any *scalar value* is converted to _Bool, the result is 0 if the
>      value compares equal to 0; otherwise, the result is 1.
> 
> 6.2.5 Types
>      21. Arithmetic types and pointer types are collectively called *scalar
>      types*. Array and structure types are collectively called aggregate types.
> 
> 6.5.13 Logical AND operator
>      Semantics
>      The && operator shall yield 1 if both of its operands compare unequal to
>      0; otherwise, it yields 0. The result has type int.
> 
> 6.5.10 Bitwise AND operator
>      Constraints
>      Each of the operands shall have integer type.
>      Semantics
>      The result of the binary & operator is the bitwise AND of the operands
>      (that is, each bit in the result is set if and only if each of the
> corresponding
>      bits in the converted operands is set).
> 
> && would ensure that the result is either 0 or 1, as David said, so no worries.
> 

My example was flawed: I wanted to express that "if any bit is set, the 
bool value will be 1. That works for "| vs ||" but not for "& vs &&", 
obviously :)

> We defined partially_mapped as a bool(_Bool). IIUC, "partially_mapped
> = int & int;"
> would work correctly as well. However, "partially_mapped = long &
> int;" might not.

Implicit type conversion would convert "long & int" to "long & long" 
first, which should work just fine. I think really most concerns 
regarding the bool type are due to < C99 not supporting _Bool.

Great weekend everybody!
Barry Song April 27, 2024, 9:32 a.m. UTC | #9
On Sat, Apr 27, 2024 at 3:20 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 26 Apr 2024, at 15:08, David Hildenbrand wrote:
>
> > On 26.04.24 21:02, Zi Yan wrote:
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> In __folio_remove_rmap(), a large folio is added to deferred split list
> >> if any page in a folio loses its final mapping. But it is possible that
> >> the folio is fully unmapped and adding it to deferred split list is
> >> unnecessary.
> >>
> >> For PMD-mapped THPs, that was not really an issue, because removing the
> >> last PMD mapping in the absence of PTE mappings would not have added the
> >> folio to the deferred split queue.
> >>
> >> However, for PTE-mapped THPs, which are now more prominent due to mTHP,
> >> they are always added to the deferred split queue. One side effect
> >> is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
> >> unintentionally increased, making it look like there are many partially
> >> mapped folios -- although the whole folio is fully unmapped stepwise.
> >>
> >> Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
> >> where possible starting from commit b06dc281aa99 ("mm/rmap: introduce
> >> folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped
> >> folio is unmapped in one go and can avoid being added to deferred split
> >> list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be
> >> noise when we cannot batch-unmap a complete PTE-mapped folio in one go
> >> -- or where this type of batching is not implemented yet, e.g., migration.
> >>
> >> To avoid the unnecessary addition, folio->_nr_pages_mapped is checked
> >> to tell if the whole folio is unmapped. If the folio is already on
> >> deferred split list, it will be skipped, too.
> >>
> >> Note: commit 98046944a159 ("mm: huge_memory: add the missing
> >> folio_test_pmd_mappable() for THP split statistics") tried to exclude
> >> mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not
> >> fix the above issue. A fully unmapped PTE-mapped order-9 THP was still
> >> added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE,
> >> since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside
> >> deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
> >>
> >> Suggested-by: David Hildenbrand <david@redhat.com>
> >> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >> ---
> >>   mm/rmap.c | 12 +++++++++---
> >>   1 file changed, 9 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 2608c40dffad..a9bd64ebdd9a 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1495,6 +1495,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>   {
> >>      atomic_t *mapped = &folio->_nr_pages_mapped;
> >>      int last, nr = 0, nr_pmdmapped = 0;
> >> +    bool partially_mapped = false;
> >>      enum node_stat_item idx;
> >>      __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >> @@ -1515,6 +1516,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>                                      nr++;
> >>                      }
> >>              } while (page++, --nr_pages > 0);
> >> +
> >> +            partially_mapped = !!nr && !!atomic_read(mapped);
> >
> > Nit: The && should remove the need for both !!.
>
> My impression was that !! is needed to convert from int to bool and I do
> find "!!int && !!int" use in the kernel. If this is unnecessary, Andrew
> can apply the fixup below. I can send a new version if it is really needed.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a9bd64ebdd9a..c1fd5828409b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1517,7 +1517,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                         }
>                 } while (page++, --nr_pages > 0);
>
> -               partially_mapped = !!nr && !!atomic_read(mapped);
> +               partially_mapped = nr && atomic_read(mapped

Reviewed-by:Barry Song <baohua@kernel.org>

>                 break;
>         case RMAP_LEVEL_PMD:
>                 atomic_dec(&folio->_large_mapcount);
>
> >
> > Reviewed-by: David Hildenbrand <david@redhat.com>
>
> Thanks.
>
> --
> Best Regards,
> Yan, Zi
Alexander Gordeev May 1, 2024, 1:24 p.m. UTC | #10
On Fri, Apr 26, 2024 at 03:02:53PM -0400, Zi Yan wrote:

Hi Zi,

It increasingly looks like this commit is crashing on s390 since
2024-04-30 in linux-next. If I do not miss something - since it
was included in mm-everything.

> @@ -1553,9 +1558,10 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>  		 * page of the folio is unmapped and at least one page
>  		 * is still mapped.
>  		 */
> -		if (folio_test_large(folio) && folio_test_anon(folio))
> -			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
> -				deferred_split_folio(folio);
> +		if (folio_test_anon(folio) &&
> +		    list_empty(&folio->_deferred_list) &&

An attempt to reference folio->_deferred_list causes the crash below.

> +		    partially_mapped)
> +			deferred_split_folio(folio);
>  	}
>  
>  	/*

[  507.227423] Unable to handle kernel pointer dereference in virtual kernel address space
[  507.227432] Failing address: 000001d689000000 TEID: 000001d689000803
[  507.227435] Fault in home space mode while using kernel ASCE.
[  507.227439] AS:0000000180788007 R3:00000001fe2cc007 S:0000000000000020 
[  507.227492] Oops: 0010 ilc:3 [#1] SMP 
[  507.227497] Modules linked in: vmur(E) kvm(E) algif_hash(E) af_alg(E) binfmt_misc(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) nf_tables(E) nfnetlink(E) dm_service_time(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) loop(E) configfs(E) lcs(E) ctcm(E) fsm(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) chacha_s390(E) libchacha(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha512_s390(E) sha256_s390(E) sha1_s390(E) sha_common(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) pkey(E) zcrypt(E) rng_core(E) dm_multipath(E) autofs4(E)
[  507.227546] Unloaded tainted modules: dcssblk(E):2 [last unloaded: dcssblk(E)]
[  507.230569] CPU: 0 PID: 36783 Comm: pahole Tainted: G            E      6.9.0-20240430.rc6.git237.d04466706db5.300.fc39.s390x+next #1
[  507.230574] Hardware name: IBM 3931 A01 703 (z/VM 7.3.0)
[  507.230576] Krnl PSW : 0704f00180000000 0000025e1092a430 (folio_remove_rmap_ptes+0xe0/0x140)
[  507.230588]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[  507.230592] Krnl GPRS: ffffffffffffe377 0000000000000000 0000025e122075b8 0000000000000000
[  507.230595]            ffffffffffffffff 0000025d8f613288 8800000000000000 00000157a38b8700
[  507.230598]            000000023fffe13f 0000000000000000 000001579ccd75c0 000001d688ffff80
[  507.230602]            000003ffb9cacf98 000001d688ffff80 0000025e1092a428 000001de11fab878
[  507.230610] Krnl Code: 0000025e1092a422: c0e500039f47        brasl   %r14,0000025e1099e2b0
[  507.230610]            0000025e1092a428: 9101b01f            tm      31(%r11),1
[  507.230610]           #0000025e1092a42c: a784ffb9            brc     8,0000025e1092a39e
[  507.230610]           >0000025e1092a430: e340b0900004        lg      %r4,144(%r11)
[  507.230610]            0000025e1092a436: 4150b090            la      %r5,144(%r11)
[  507.230610]            0000025e1092a43a: ec45ffb26064        cgrj    %r4,%r5,6,0000025e1092a39e
[  507.230610]            0000025e1092a440: a7910001            tmll    %r9,1
[  507.230610]            0000025e1092a444: a784ffad            brc     8,0000025e1092a39e
[  507.230672] Call Trace:
[  507.230678]  [<0000025e1092a430>] folio_remove_rmap_ptes+0xe0/0x140 
[  507.230682] ([<0000025e1092a428>] folio_remove_rmap_ptes+0xd8/0x140)
[  507.230685]  [<0000025e1090d76a>] zap_present_ptes.isra.0+0x222/0x918 
[  507.230689]  [<0000025e1090e008>] zap_pte_range+0x1a8/0x4e8 
[  507.230692]  [<0000025e1090e58c>] zap_p4d_range+0x244/0x480 
[  507.230695]  [<0000025e1090eb22>] unmap_page_range+0xea/0x2c0 
[  507.230698]  [<0000025e1090ed92>] unmap_single_vma.isra.0+0x9a/0xf0 
[  507.230701]  [<0000025e1090ee9e>] unmap_vmas+0xb6/0x1a0 
[  507.230705]  [<0000025e1091e0d4>] exit_mmap+0xc4/0x3d0 
[  507.230709]  [<0000025e10675c64>] __mmput+0x54/0x150 
[  507.230714]  [<0000025e1067f3ba>] exit_mm+0xca/0x138 
[  507.230717]  [<0000025e1067f690>] do_exit+0x268/0x520 
[  507.230721]  [<0000025e1067fb38>] do_group_exit+0x40/0xb8 
[  507.230725]  [<0000025e1067fc0e>] __s390x_sys_exit_group+0x2e/0x30 
[  507.230729]  [<0000025e1136ba4e>] __do_syscall+0x216/0x2d0 
[  507.230736]  [<0000025e1137c848>] system_call+0x70/0x98
[  507.230780] Last Breaking-Event-Address:
[  507.230783]  [<0000025e1099e32a>] __lruvec_stat_mod_folio+0x7a/0xb0
[  507.230789] Kernel panic - not syncing: Fatal exception: panic_on_oops
00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 0000025E 10630B56

Thanks!
Zi Yan May 1, 2024, 1:38 p.m. UTC | #11
On 1 May 2024, at 9:24, Alexander Gordeev wrote:

> On Fri, Apr 26, 2024 at 03:02:53PM -0400, Zi Yan wrote:
>
> Hi Zi,
>
> It increasingly looks like this commit is crashing on s390 since
> 2024-04-30 in linux-next. If I do not miss something - since it
> was included in mm-everything.
>
>> @@ -1553,9 +1558,10 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>  		 * page of the folio is unmapped and at least one page
>>  		 * is still mapped.
>>  		 */
>> -		if (folio_test_large(folio) && folio_test_anon(folio))
>> -			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>> -				deferred_split_folio(folio);
>> +		if (folio_test_anon(folio) &&
>> +		    list_empty(&folio->_deferred_list) &&
>
> An attempt to reference folio->_deferred_list causes the crash below.

So if you remove this line, the crash no longer happens? It looks strange to
me that referencing a anonymous folio's _deferred_list would cause a crash.
Hmm, unless the folio is order-0.

Can you try the patch below and see if it fixes the crash? It moves partially_mapped
ahead to exclude order-0 folios.

diff --git a/mm/rmap.c b/mm/rmap.c
index 087a79f1f611..2d27c92bb6d5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1557,9 +1557,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
                 * page of the folio is unmapped and at least one page
                 * is still mapped.
                 */
-               if (folio_test_anon(folio) &&
-                   list_empty(&folio->_deferred_list) &&
-                   partially_mapped)
+               if (folio_test_anon(folio) && partially_mapped &&
+                   list_empty(&folio->_deferred_list))
                        deferred_split_folio(folio);
        }


>
>> +		    partially_mapped)
>> +			deferred_split_folio(folio);
>>  	}
>>
>>  	/*
>
> [  507.227423] Unable to handle kernel pointer dereference in virtual kernel address space
> [  507.227432] Failing address: 000001d689000000 TEID: 000001d689000803
> [  507.227435] Fault in home space mode while using kernel ASCE.
> [  507.227439] AS:0000000180788007 R3:00000001fe2cc007 S:0000000000000020
> [  507.227492] Oops: 0010 ilc:3 [#1] SMP
> [  507.227497] Modules linked in: vmur(E) kvm(E) algif_hash(E) af_alg(E) binfmt_misc(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) nf_tables(E) nfnetlink(E) dm_service_time(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) loop(E) configfs(E) lcs(E) ctcm(E) fsm(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) chacha_s390(E) libchacha(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha512_s390(E) sha256_s390(E) sha1_s390(E) sha_common(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) pkey(E) zcrypt(E) rng_core(E) dm_multipath(E) autofs4(E)
> [  507.227546] Unloaded tainted modules: dcssblk(E):2 [last unloaded: dcssblk(E)]
> [  507.230569] CPU: 0 PID: 36783 Comm: pahole Tainted: G            E      6.9.0-20240430.rc6.git237.d04466706db5.300.fc39.s390x+next #1
> [  507.230574] Hardware name: IBM 3931 A01 703 (z/VM 7.3.0)
> [  507.230576] Krnl PSW : 0704f00180000000 0000025e1092a430 (folio_remove_rmap_ptes+0xe0/0x140)
> [  507.230588]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
> [  507.230592] Krnl GPRS: ffffffffffffe377 0000000000000000 0000025e122075b8 0000000000000000
> [  507.230595]            ffffffffffffffff 0000025d8f613288 8800000000000000 00000157a38b8700
> [  507.230598]            000000023fffe13f 0000000000000000 000001579ccd75c0 000001d688ffff80
> [  507.230602]            000003ffb9cacf98 000001d688ffff80 0000025e1092a428 000001de11fab878
> [  507.230610] Krnl Code: 0000025e1092a422: c0e500039f47        brasl   %r14,0000025e1099e2b0
> [  507.230610]            0000025e1092a428: 9101b01f            tm      31(%r11),1
> [  507.230610]           #0000025e1092a42c: a784ffb9            brc     8,0000025e1092a39e
> [  507.230610]           >0000025e1092a430: e340b0900004        lg      %r4,144(%r11)
> [  507.230610]            0000025e1092a436: 4150b090            la      %r5,144(%r11)
> [  507.230610]            0000025e1092a43a: ec45ffb26064        cgrj    %r4,%r5,6,0000025e1092a39e
> [  507.230610]            0000025e1092a440: a7910001            tmll    %r9,1
> [  507.230610]            0000025e1092a444: a784ffad            brc     8,0000025e1092a39e
> [  507.230672] Call Trace:
> [  507.230678]  [<0000025e1092a430>] folio_remove_rmap_ptes+0xe0/0x140
> [  507.230682] ([<0000025e1092a428>] folio_remove_rmap_ptes+0xd8/0x140)
> [  507.230685]  [<0000025e1090d76a>] zap_present_ptes.isra.0+0x222/0x918
> [  507.230689]  [<0000025e1090e008>] zap_pte_range+0x1a8/0x4e8
> [  507.230692]  [<0000025e1090e58c>] zap_p4d_range+0x244/0x480
> [  507.230695]  [<0000025e1090eb22>] unmap_page_range+0xea/0x2c0
> [  507.230698]  [<0000025e1090ed92>] unmap_single_vma.isra.0+0x9a/0xf0
> [  507.230701]  [<0000025e1090ee9e>] unmap_vmas+0xb6/0x1a0
> [  507.230705]  [<0000025e1091e0d4>] exit_mmap+0xc4/0x3d0
> [  507.230709]  [<0000025e10675c64>] __mmput+0x54/0x150
> [  507.230714]  [<0000025e1067f3ba>] exit_mm+0xca/0x138
> [  507.230717]  [<0000025e1067f690>] do_exit+0x268/0x520
> [  507.230721]  [<0000025e1067fb38>] do_group_exit+0x40/0xb8
> [  507.230725]  [<0000025e1067fc0e>] __s390x_sys_exit_group+0x2e/0x30
> [  507.230729]  [<0000025e1136ba4e>] __do_syscall+0x216/0x2d0
> [  507.230736]  [<0000025e1137c848>] system_call+0x70/0x98
> [  507.230780] Last Breaking-Event-Address:
> [  507.230783]  [<0000025e1099e32a>] __lruvec_stat_mod_folio+0x7a/0xb0
> [  507.230789] Kernel panic - not syncing: Fatal exception: panic_on_oops
> 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 0000025E 10630B56
>
> Thanks!

--
Best Regards,
Yan, Zi
David Hildenbrand May 1, 2024, 3:54 p.m. UTC | #12
On 01.05.24 15:38, Zi Yan wrote:
> On 1 May 2024, at 9:24, Alexander Gordeev wrote:
> 
>> On Fri, Apr 26, 2024 at 03:02:53PM -0400, Zi Yan wrote:
>>
>> Hi Zi,
>>
>> It increasingly looks like this commit is crashing on s390 since
>> 2024-04-30 in linux-next. If I do not miss something - since it
>> was included in mm-everything.
>>
>>> @@ -1553,9 +1558,10 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>>   		 * page of the folio is unmapped and at least one page
>>>   		 * is still mapped.
>>>   		 */
>>> -		if (folio_test_large(folio) && folio_test_anon(folio))
>>> -			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
>>> -				deferred_split_folio(folio);
>>> +		if (folio_test_anon(folio) &&
>>> +		    list_empty(&folio->_deferred_list) &&
>>
>> An attempt to reference folio->_deferred_list causes the crash below.
> 
> So if you remove this line, the crash no longer happens? It looks strange to
> me that referencing a anonymous folio's _deferred_list would cause a crash.
> Hmm, unless the folio is order-0.
> 
> Can you try the patch below and see if it fixes the crash? It moves partially_mapped
> ahead to exclude order-0 folios.
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 087a79f1f611..2d27c92bb6d5 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1557,9 +1557,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                   * page of the folio is unmapped and at least one page
>                   * is still mapped.
>                   */
> -               if (folio_test_anon(folio) &&
> -                   list_empty(&folio->_deferred_list) &&
> -                   partially_mapped)
> +               if (folio_test_anon(folio) && partially_mapped &&
> +                   list_empty(&folio->_deferred_list))
>                          deferred_split_folio(folio);

Yes, that should fix it and is the right thing to do. For small folios, 
partially_mapped will always be false.
Alexander Gordeev May 2, 2024, 1:18 p.m. UTC | #13
On Wed, May 01, 2024 at 09:38:24AM -0400, Zi Yan wrote:
Hi Zi,
> @@ -1557,9 +1557,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>                  * page of the folio is unmapped and at least one page
>                  * is still mapped.
>                  */
> -               if (folio_test_anon(folio) &&
> -                   list_empty(&folio->_deferred_list) &&
> -                   partially_mapped)
> +               if (folio_test_anon(folio) && partially_mapped &&
> +                   list_empty(&folio->_deferred_list))
>                         deferred_split_folio(folio);
>         }

That helps.

> Best Regards,
> Yan, Zi

Thanks!
Zi Yan May 2, 2024, 1:20 p.m. UTC | #14
On 2 May 2024, at 9:18, Alexander Gordeev wrote:

> On Wed, May 01, 2024 at 09:38:24AM -0400, Zi Yan wrote:
> Hi Zi,
>> @@ -1557,9 +1557,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>                  * page of the folio is unmapped and at least one page
>>                  * is still mapped.
>>                  */
>> -               if (folio_test_anon(folio) &&
>> -                   list_empty(&folio->_deferred_list) &&
>> -                   partially_mapped)
>> +               if (folio_test_anon(folio) && partially_mapped &&
>> +                   list_empty(&folio->_deferred_list))
>>                         deferred_split_folio(folio);
>>         }
>
> That helps.
>
>> Best Regards,
>> Yan, Zi
>
> Thanks!

Great! I will send a v6.

--
Best Regards,
Yan, Zi
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..a9bd64ebdd9a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1495,6 +1495,7 @@  static __always_inline void __folio_remove_rmap(struct folio *folio,
 {
 	atomic_t *mapped = &folio->_nr_pages_mapped;
 	int last, nr = 0, nr_pmdmapped = 0;
+	bool partially_mapped = false;
 	enum node_stat_item idx;
 
 	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1515,6 +1516,8 @@  static __always_inline void __folio_remove_rmap(struct folio *folio,
 					nr++;
 			}
 		} while (page++, --nr_pages > 0);
+
+		partially_mapped = !!nr && !!atomic_read(mapped);
 		break;
 	case RMAP_LEVEL_PMD:
 		atomic_dec(&folio->_large_mapcount);
@@ -1532,6 +1535,8 @@  static __always_inline void __folio_remove_rmap(struct folio *folio,
 				nr = 0;
 			}
 		}
+
+		partially_mapped = nr < nr_pmdmapped;
 		break;
 	}
 
@@ -1553,9 +1558,10 @@  static __always_inline void __folio_remove_rmap(struct folio *folio,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_large(folio) && folio_test_anon(folio))
-			if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped)
-				deferred_split_folio(folio);
+		if (folio_test_anon(folio) &&
+		    list_empty(&folio->_deferred_list) &&
+		    partially_mapped)
+			deferred_split_folio(folio);
 	}
 
 	/*