diff mbox series

[v2] arm64/mm: pmd_mkinvalid() must handle swap pmds

Message ID 20240430133138.732088-1-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series [v2] arm64/mm: pmd_mkinvalid() must handle swap pmds | expand

Commit Message

Ryan Roberts April 30, 2024, 1:31 p.m. UTC
__split_huge_pmd_locked() can be called for a present THP, devmap or
(non-present) migration entry. It calls pmdp_invalidate()
unconditionally on the pmdp and only determines if it is present or not
based on the returned old pmd.

But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
pmd_present() calls to return true - even for a swap pmd. Therefore any
lockless pgtable walker could see the migration entry pmd in this state
and start interpretting the fields (e.g. pmd_pfn()) as if it were
present, leading to BadThings (TM). GUP-fast appears to be one such
lockless pgtable walker.

While the obvious fix is for core-mm to avoid such calls for non-present
pmds (pmdp_invalidate() will also issue TLBI which is not necessary for
this case either), all other arches that implement pmd_mkinvalid() do it
in such a way that it is robust to being called with a non-present pmd.
So it is simpler and safer to make arm64 robust too. This approach means
we can even add tests to debug_vm_pgtable.c to validate the required
behaviour.

This is a theoretical bug found during code review. I don't have any
test case to trigger it in practice.

Cc: stable@vger.kernel.org
Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---

Hi all,

v1 of this fix [1] took the approach of fixing core-mm to never call
pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64
suffers this problem; all other arches are robust. So his suggestion was to
instead make arm64 robust in the same way and add tests to validate it. Despite
my stated reservations in the context of the v1 discussion, having thought on it
for a bit, I now agree with Zi Yan. Hence this post.

Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is
remove it from there and have this go in through the arm64 tree? Assuming there
is agreement that this approach is right one.

This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.

[1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/

Thanks,
Ryan


 arch/arm64/include/asm/pgtable.h | 12 +++++--
 mm/debug_vm_pgtable.c            | 61 ++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 2 deletions(-)

--
2.25.1

Comments

Will Deacon April 30, 2024, 1:55 p.m. UTC | #1
On Tue, Apr 30, 2024 at 02:31:38PM +0100, Ryan Roberts wrote:
> __split_huge_pmd_locked() can be called for a present THP, devmap or
> (non-present) migration entry. It calls pmdp_invalidate()
> unconditionally on the pmdp and only determines if it is present or not
> based on the returned old pmd.
> 
> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
> pmd_present() calls to return true - even for a swap pmd. Therefore any
> lockless pgtable walker could see the migration entry pmd in this state
> and start interpretting the fields (e.g. pmd_pfn()) as if it were
> present, leading to BadThings (TM). GUP-fast appears to be one such
> lockless pgtable walker.
> 
> While the obvious fix is for core-mm to avoid such calls for non-present
> pmds (pmdp_invalidate() will also issue TLBI which is not necessary for
> this case either), all other arches that implement pmd_mkinvalid() do it
> in such a way that it is robust to being called with a non-present pmd.
> So it is simpler and safer to make arm64 robust too. This approach means
> we can even add tests to debug_vm_pgtable.c to validate the required
> behaviour.
> 
> This is a theoretical bug found during code review. I don't have any
> test case to trigger it in practice.
> 
> Cc: stable@vger.kernel.org
> Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration")
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> 
> Hi all,
> 
> v1 of this fix [1] took the approach of fixing core-mm to never call
> pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64
> suffers this problem; all other arches are robust. So his suggestion was to
> instead make arm64 robust in the same way and add tests to validate it. Despite
> my stated reservations in the context of the v1 discussion, having thought on it
> for a bit, I now agree with Zi Yan. Hence this post.
> 
> Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is
> remove it from there and have this go in through the arm64 tree? Assuming there
> is agreement that this approach is right one.
> 
> This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.
> 
> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> 
>  arch/arm64/include/asm/pgtable.h | 12 +++++--
>  mm/debug_vm_pgtable.c            | 61 ++++++++++++++++++++++++++++++++
>  2 files changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index afdd56d26ad7..7d580271a46d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -511,8 +511,16 @@ static inline int pmd_trans_huge(pmd_t pmd)
> 
>  static inline pmd_t pmd_mkinvalid(pmd_t pmd)
>  {
> -	pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
> -	pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
> +	/*
> +	 * If not valid then either we are already present-invalid or we are
> +	 * not-present (i.e. none or swap entry). We must not convert
> +	 * not-present to present-invalid. Unbelievably, the core-mm may call
> +	 * pmd_mkinvalid() for a swap entry and all other arches can handle it.
> +	 */
> +	if (pmd_valid(pmd)) {
> +		pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
> +		pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
> +	}
> 
>  	return pmd;
>  }

Acked-by: Will Deacon <will@kernel.org>

But it might be worth splitting the tests from the fix to make backporting
easier.

Catalin -- I assume you'll pick this up, but please shout if you want me
to take it instead.

Will
Ryan Roberts April 30, 2024, 2:04 p.m. UTC | #2
On 30/04/2024 14:55, Will Deacon wrote:
> On Tue, Apr 30, 2024 at 02:31:38PM +0100, Ryan Roberts wrote:
>> __split_huge_pmd_locked() can be called for a present THP, devmap or
>> (non-present) migration entry. It calls pmdp_invalidate()
>> unconditionally on the pmdp and only determines if it is present or not
>> based on the returned old pmd.
>>
>> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
>> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
>> pmd_present() calls to return true - even for a swap pmd. Therefore any
>> lockless pgtable walker could see the migration entry pmd in this state
>> and start interpretting the fields (e.g. pmd_pfn()) as if it were
>> present, leading to BadThings (TM). GUP-fast appears to be one such
>> lockless pgtable walker.
>>
>> While the obvious fix is for core-mm to avoid such calls for non-present
>> pmds (pmdp_invalidate() will also issue TLBI which is not necessary for
>> this case either), all other arches that implement pmd_mkinvalid() do it
>> in such a way that it is robust to being called with a non-present pmd.
>> So it is simpler and safer to make arm64 robust too. This approach means
>> we can even add tests to debug_vm_pgtable.c to validate the required
>> behaviour.
>>
>> This is a theoretical bug found during code review. I don't have any
>> test case to trigger it in practice.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration")
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> Hi all,
>>
>> v1 of this fix [1] took the approach of fixing core-mm to never call
>> pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64
>> suffers this problem; all other arches are robust. So his suggestion was to
>> instead make arm64 robust in the same way and add tests to validate it. Despite
>> my stated reservations in the context of the v1 discussion, having thought on it
>> for a bit, I now agree with Zi Yan. Hence this post.
>>
>> Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is
>> remove it from there and have this go in through the arm64 tree? Assuming there
>> is agreement that this approach is right one.
>>
>> This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.
>>
>> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
>>
>> Thanks,
>> Ryan
>>
>>
>>  arch/arm64/include/asm/pgtable.h | 12 +++++--
>>  mm/debug_vm_pgtable.c            | 61 ++++++++++++++++++++++++++++++++
>>  2 files changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index afdd56d26ad7..7d580271a46d 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -511,8 +511,16 @@ static inline int pmd_trans_huge(pmd_t pmd)
>>
>>  static inline pmd_t pmd_mkinvalid(pmd_t pmd)
>>  {
>> -	pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>> -	pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>> +	/*
>> +	 * If not valid then either we are already present-invalid or we are
>> +	 * not-present (i.e. none or swap entry). We must not convert
>> +	 * not-present to present-invalid. Unbelievably, the core-mm may call
>> +	 * pmd_mkinvalid() for a swap entry and all other arches can handle it.
>> +	 */
>> +	if (pmd_valid(pmd)) {
>> +		pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>> +		pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>> +	}
>>
>>  	return pmd;
>>  }
> 
> Acked-by: Will Deacon <will@kernel.org>

Thanks

> 
> But it might be worth splitting the tests from the fix to make backporting
> easier.

Yes good point. I'll leave this hanging for today to see if any more comments
come in, and will re-post tomorrow as 2 patches. I assume we need to go fast to
catch 6.9.

> 
> Catalin -- I assume you'll pick this up, but please shout if you want me
> to take it instead.
> 
> Will
Zi Yan April 30, 2024, 3 p.m. UTC | #3
On 30 Apr 2024, at 9:31, Ryan Roberts wrote:

> __split_huge_pmd_locked() can be called for a present THP, devmap or
> (non-present) migration entry. It calls pmdp_invalidate()
> unconditionally on the pmdp and only determines if it is present or not
> based on the returned old pmd.
>
> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
> pmd_present() calls to return true - even for a swap pmd. Therefore any
> lockless pgtable walker could see the migration entry pmd in this state
> and start interpretting the fields (e.g. pmd_pfn()) as if it were
> present, leading to BadThings (TM). GUP-fast appears to be one such
> lockless pgtable walker.
>
> While the obvious fix is for core-mm to avoid such calls for non-present
> pmds (pmdp_invalidate() will also issue TLBI which is not necessary for
> this case either), all other arches that implement pmd_mkinvalid() do it
> in such a way that it is robust to being called with a non-present pmd.
> So it is simpler and safer to make arm64 robust too. This approach means
> we can even add tests to debug_vm_pgtable.c to validate the required
> behaviour.
>
> This is a theoretical bug found during code review. I don't have any
> test case to trigger it in practice.
>
> Cc: stable@vger.kernel.org
> Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration")
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> Hi all,
>
> v1 of this fix [1] took the approach of fixing core-mm to never call
> pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64
> suffers this problem; all other arches are robust. So his suggestion was to
> instead make arm64 robust in the same way and add tests to validate it. Despite
> my stated reservations in the context of the v1 discussion, having thought on it
> for a bit, I now agree with Zi Yan. Hence this post.
>
> Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is
> remove it from there and have this go in through the arm64 tree? Assuming there
> is agreement that this approach is right one.
>
> This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.
>
> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
>
> Thanks,
> Ryan
>
>
>  arch/arm64/include/asm/pgtable.h | 12 +++++--
>  mm/debug_vm_pgtable.c            | 61 ++++++++++++++++++++++++++++++++
>  2 files changed, 71 insertions(+), 2 deletions(-)
>
LGTM. Thanks. Feel free to add Reviewed-by: Zi Yan <ziy@nvidia.com>.

--
Best Regards,
Yan, Zi
Catalin Marinas April 30, 2024, 4:23 p.m. UTC | #4
On Tue, Apr 30, 2024 at 03:04:32PM +0100, Ryan Roberts wrote:
> On 30/04/2024 14:55, Will Deacon wrote:
> > But it might be worth splitting the tests from the fix to make backporting
> > easier.
> 
> Yes good point. I'll leave this hanging for today to see if any more comments
> come in, and will re-post tomorrow as 2 patches. I assume we need to go fast to
> catch 6.9.

Yes, I'll pick it up for 6.9. I can drop the tests from the patch (and
their mention in the log) and you can post your tests separately to go
via Andrew's tree.
Ryan Roberts April 30, 2024, 4:25 p.m. UTC | #5
On 30/04/2024 17:23, Catalin Marinas wrote:
> On Tue, Apr 30, 2024 at 03:04:32PM +0100, Ryan Roberts wrote:
>> On 30/04/2024 14:55, Will Deacon wrote:
>>> But it might be worth splitting the tests from the fix to make backporting
>>> easier.
>>
>> Yes good point. I'll leave this hanging for today to see if any more comments
>> come in, and will re-post tomorrow as 2 patches. I assume we need to go fast to
>> catch 6.9.
> 
> Yes, I'll pick it up for 6.9. I can drop the tests from the patch (and
> their mention in the log) and you can post your tests separately to go
> via Andrew's tree.
> 

Yep that works for me! thanks.
Catalin Marinas April 30, 2024, 5:57 p.m. UTC | #6
On Tue, 30 Apr 2024 14:31:38 +0100, Ryan Roberts wrote:
> __split_huge_pmd_locked() can be called for a present THP, devmap or
> (non-present) migration entry. It calls pmdp_invalidate()
> unconditionally on the pmdp and only determines if it is present or not
> based on the returned old pmd.
> 
> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
> pmd_present() calls to return true - even for a swap pmd. Therefore any
> lockless pgtable walker could see the migration entry pmd in this state
> and start interpretting the fields (e.g. pmd_pfn()) as if it were
> present, leading to BadThings (TM). GUP-fast appears to be one such
> lockless pgtable walker.
> 
> [...]

Applied to arm64 (for-next/fixes), thanks! It should land in 6.9-rc7. I
removed the debug/test code, please send it as a separate patch for
6.10.

[1/1] arm64/mm: pmd_mkinvalid() must handle swap pmds
      https://git.kernel.org/arm64/c/e783331c7720
Ryan Roberts May 1, 2024, 8:05 a.m. UTC | #7
On 30/04/2024 18:57, Catalin Marinas wrote:
> On Tue, 30 Apr 2024 14:31:38 +0100, Ryan Roberts wrote:
>> __split_huge_pmd_locked() can be called for a present THP, devmap or
>> (non-present) migration entry. It calls pmdp_invalidate()
>> unconditionally on the pmdp and only determines if it is present or not
>> based on the returned old pmd.
>>
>> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
>> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
>> pmd_present() calls to return true - even for a swap pmd. Therefore any
>> lockless pgtable walker could see the migration entry pmd in this state
>> and start interpretting the fields (e.g. pmd_pfn()) as if it were
>> present, leading to BadThings (TM). GUP-fast appears to be one such
>> lockless pgtable walker.
>>
>> [...]
> 
> Applied to arm64 (for-next/fixes), thanks! It should land in 6.9-rc7. I
> removed the debug/test code, please send it as a separate patch for
> 6.10.

Thanks Catalin! I'm guessing this will turn up in today's linux-next, so if I
send the tests today and Andrew puts them straight in mm-unstable (which will
goto linux-next) there is no risk that the tests are there without the fix? Or
do I need to hold off until the fix is in v6.9-rc7?

> 
> [1/1] arm64/mm: pmd_mkinvalid() must handle swap pmds
>       https://git.kernel.org/arm64/c/e783331c7720
>
Catalin Marinas May 1, 2024, 10:04 a.m. UTC | #8
On Wed, May 01, 2024 at 09:05:17AM +0100, Ryan Roberts wrote:
> On 30/04/2024 18:57, Catalin Marinas wrote:
> > On Tue, 30 Apr 2024 14:31:38 +0100, Ryan Roberts wrote:
> >> __split_huge_pmd_locked() can be called for a present THP, devmap or
> >> (non-present) migration entry. It calls pmdp_invalidate()
> >> unconditionally on the pmdp and only determines if it is present or not
> >> based on the returned old pmd.
> >>
> >> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
> >> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
> >> pmd_present() calls to return true - even for a swap pmd. Therefore any
> >> lockless pgtable walker could see the migration entry pmd in this state
> >> and start interpretting the fields (e.g. pmd_pfn()) as if it were
> >> present, leading to BadThings (TM). GUP-fast appears to be one such
> >> lockless pgtable walker.
> >>
> >> [...]
> > 
> > Applied to arm64 (for-next/fixes), thanks! It should land in 6.9-rc7. I
> > removed the debug/test code, please send it as a separate patch for
> > 6.10.
> 
> Thanks Catalin! I'm guessing this will turn up in today's linux-next, so if I
> send the tests today and Andrew puts them straight in mm-unstable (which will
> goto linux-next) there is no risk that the tests are there without the fix? Or
> do I need to hold off until the fix is in v6.9-rc7?

It looks like we don't push for-next/fixes to linux-next, it's
short-lived usually, it ends up upstream quickly. I can send the pull
request later today, should turn up in mainline by tomorrow. You can add
a note to your patch for Andrew that it will fail on arm64 until the fix
ends up upstream. It's a matter of a couple of days anyway.
Ryan Roberts May 1, 2024, 10:13 a.m. UTC | #9
On 01/05/2024 11:04, Catalin Marinas wrote:
> On Wed, May 01, 2024 at 09:05:17AM +0100, Ryan Roberts wrote:
>> On 30/04/2024 18:57, Catalin Marinas wrote:
>>> On Tue, 30 Apr 2024 14:31:38 +0100, Ryan Roberts wrote:
>>>> __split_huge_pmd_locked() can be called for a present THP, devmap or
>>>> (non-present) migration entry. It calls pmdp_invalidate()
>>>> unconditionally on the pmdp and only determines if it is present or not
>>>> based on the returned old pmd.
>>>>
>>>> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
>>>> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
>>>> pmd_present() calls to return true - even for a swap pmd. Therefore any
>>>> lockless pgtable walker could see the migration entry pmd in this state
>>>> and start interpretting the fields (e.g. pmd_pfn()) as if it were
>>>> present, leading to BadThings (TM). GUP-fast appears to be one such
>>>> lockless pgtable walker.
>>>>
>>>> [...]
>>>
>>> Applied to arm64 (for-next/fixes), thanks! It should land in 6.9-rc7. I
>>> removed the debug/test code, please send it as a separate patch for
>>> 6.10.
>>
>> Thanks Catalin! I'm guessing this will turn up in today's linux-next, so if I
>> send the tests today and Andrew puts them straight in mm-unstable (which will
>> goto linux-next) there is no risk that the tests are there without the fix? Or
>> do I need to hold off until the fix is in v6.9-rc7?
> 
> It looks like we don't push for-next/fixes to linux-next, it's
> short-lived usually, it ends up upstream quickly. I can send the pull
> request later today, should turn up in mainline by tomorrow. You can add
> a note to your patch for Andrew that it will fail on arm64 until the fix
> ends up upstream. It's a matter of a couple of days anyway.

OK, I just didn't want to send it only for our CI to start failing. I'll do as
you suggest.
Ryan Roberts May 1, 2024, 11:35 a.m. UTC | #10
Zi Yan, I'm hoping you might have some input on the below...


On 30/04/2024 14:31, Ryan Roberts wrote:
> __split_huge_pmd_locked() can be called for a present THP, devmap or
> (non-present) migration entry. It calls pmdp_invalidate()
> unconditionally on the pmdp and only determines if it is present or not
> based on the returned old pmd.
> 
> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
> pmd_present() calls to return true - even for a swap pmd. Therefore any
> lockless pgtable walker could see the migration entry pmd in this state
> and start interpretting the fields (e.g. pmd_pfn()) as if it were
> present, leading to BadThings (TM). GUP-fast appears to be one such
> lockless pgtable walker.
> 
> While the obvious fix is for core-mm to avoid such calls for non-present
> pmds (pmdp_invalidate() will also issue TLBI which is not necessary for
> this case either), all other arches that implement pmd_mkinvalid() do it
> in such a way that it is robust to being called with a non-present pmd.

OK the plot thickens; The tests I wrote to check that pmd_mkinvalid() is safe for swap entries fails on x86_64. See below...

> So it is simpler and safer to make arm64 robust too. This approach means
> we can even add tests to debug_vm_pgtable.c to validate the required
> behaviour.
> 
> This is a theoretical bug found during code review. I don't have any
> test case to trigger it in practice.
> 
> Cc: stable@vger.kernel.org
> Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration")
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> 
> Hi all,
> 
> v1 of this fix [1] took the approach of fixing core-mm to never call
> pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64
> suffers this problem; all other arches are robust. So his suggestion was to
> instead make arm64 robust in the same way and add tests to validate it. Despite
> my stated reservations in the context of the v1 discussion, having thought on it
> for a bit, I now agree with Zi Yan. Hence this post.
> 
> Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is
> remove it from there and have this go in through the arm64 tree? Assuming there
> is agreement that this approach is right one.
> 
> This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.
> 
> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> 
>  arch/arm64/include/asm/pgtable.h | 12 +++++--
>  mm/debug_vm_pgtable.c            | 61 ++++++++++++++++++++++++++++++++
>  2 files changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index afdd56d26ad7..7d580271a46d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -511,8 +511,16 @@ static inline int pmd_trans_huge(pmd_t pmd)
> 
>  static inline pmd_t pmd_mkinvalid(pmd_t pmd)
>  {
> -	pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
> -	pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
> +	/*
> +	 * If not valid then either we are already present-invalid or we are
> +	 * not-present (i.e. none or swap entry). We must not convert
> +	 * not-present to present-invalid. Unbelievably, the core-mm may call
> +	 * pmd_mkinvalid() for a swap entry and all other arches can handle it.
> +	 */
> +	if (pmd_valid(pmd)) {
> +		pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
> +		pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
> +	}
> 
>  	return pmd;
>  }
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 65c19025da3d..7e9c387d06b0 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -956,6 +956,65 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { }
>  #endif /* CONFIG_HUGETLB_PAGE */
> 
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if !defined(__HAVE_ARCH_PMDP_INVALIDATE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args)
> +{

Printing various values at different locations in this function for debug:

> +	unsigned long max_swap_offset;
> +	swp_entry_t swp_set, swp_clear, swp_convert;
> +	pmd_t pmd_set, pmd_clear;
> +
> +	/*
> +	 * See generic_max_swapfile_size(): probe the maximum offset, then
> +	 * create swap entry will all possible bits set and a swap entry will
> +	 * all bits clear.
> +	 */
> +	max_swap_offset = swp_offset(pmd_to_swp_entry(swp_entry_to_pmd(swp_entry(0, ~0UL))));
> +	swp_set = swp_entry((1 << MAX_SWAPFILES_SHIFT) - 1, max_swap_offset);
> +	swp_clear = swp_entry(0, 0);
> +
> +	/* Convert to pmd. */
> +	pmd_set = swp_entry_to_pmd(swp_set);
> +	pmd_clear = swp_entry_to_pmd(swp_clear);

[    0.702163] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: valid: pmd_set=f800000000000000, pmd_clear=7fffffffffffe00

> +
> +	/*
> +	 * Sanity check that the pmds are not-present, not-huge and swap entry
> +	 * is recoverable without corruption.
> +	 */
> +	WARN_ON(pmd_present(pmd_set));
> +	WARN_ON(pmd_trans_huge(pmd_set));
> +	swp_convert = pmd_to_swp_entry(pmd_set);
> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
> +	WARN_ON(pmd_present(pmd_clear));
> +	WARN_ON(pmd_trans_huge(pmd_clear));
> +	swp_convert = pmd_to_swp_entry(pmd_clear);
> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
> +
> +	/* Now invalidate the pmd. */
> +	pmd_set = pmd_mkinvalid(pmd_set);
> +	pmd_clear = pmd_mkinvalid(pmd_clear);

[    0.704452] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: invalid: pmd_set=f800000000000000, pmd_clear=7ffffffffe00e00

> +
> +	/*
> +	 * Since its a swap pmd, invalidation should effectively be a noop and
> +	 * the checks we already did should give the same answer. Check the
> +	 * invalidation didn't corrupt any fields.
> +	 */
> +	WARN_ON(pmd_present(pmd_set));
> +	WARN_ON(pmd_trans_huge(pmd_set));
> +	swp_convert = pmd_to_swp_entry(pmd_set);

[    0.706461] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: set: swp=7c03ffffffffffff (1f, 3ffffffffffff), convert=7c03ffffffffffff (1f, 3ffffffffffff)

> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
> +	WARN_ON(pmd_present(pmd_clear));
> +	WARN_ON(pmd_trans_huge(pmd_clear));
> +	swp_convert = pmd_to_swp_entry(pmd_clear);

[    0.708841] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: clear: swp=0 (0, 0), convert=ff8 (0, ff8)

> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));

This line fails on x86_64.

The logs show that the offset is indeed being corrupted by pmd_mkinvalid(); 0 -> 0xff8.

I think this is due to x86's pmd_mkinvalid() assuming the pmd is present; pmd_flags() and pmd_pfn() do all sorts of weird and wonderful things.

So does this take us full circle? Are we now back to modifying the core-mm to never call pmd_mkinvalid() on a non-present entry? If so, then I guess we should remove the arm64 fix from for-next/fixes.

> +}
> +#else
> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args) { }
> +#endif /* !__HAVE_ARCH_PMDP_INVALIDATE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
>  static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>  {
>  	pmd_t pmd;
> @@ -982,6 +1041,8 @@ static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>  	WARN_ON(!pmd_trans_huge(pmd_mkinvalid(pmd_mkhuge(pmd))));
>  	WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd))));
>  #endif /* __HAVE_ARCH_PMDP_INVALIDATE */
> +
> +	swp_pmd_mkinvalid_tests(args);
>  }
> 
>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> --
> 2.25.1
>
Ryan Roberts May 1, 2024, 11:38 a.m. UTC | #11
Pulling in David, who may be able to advise...


On 01/05/2024 12:35, Ryan Roberts wrote:
> Zi Yan, I'm hoping you might have some input on the below...
> 
> 
> On 30/04/2024 14:31, Ryan Roberts wrote:
>> __split_huge_pmd_locked() can be called for a present THP, devmap or
>> (non-present) migration entry. It calls pmdp_invalidate()
>> unconditionally on the pmdp and only determines if it is present or not
>> based on the returned old pmd.
>>
>> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
>> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
>> pmd_present() calls to return true - even for a swap pmd. Therefore any
>> lockless pgtable walker could see the migration entry pmd in this state
>> and start interpretting the fields (e.g. pmd_pfn()) as if it were
>> present, leading to BadThings (TM). GUP-fast appears to be one such
>> lockless pgtable walker.
>>
>> While the obvious fix is for core-mm to avoid such calls for non-present
>> pmds (pmdp_invalidate() will also issue TLBI which is not necessary for
>> this case either), all other arches that implement pmd_mkinvalid() do it
>> in such a way that it is robust to being called with a non-present pmd.
> 
> OK the plot thickens; The tests I wrote to check that pmd_mkinvalid() is safe for swap entries fails on x86_64. See below...
> 
>> So it is simpler and safer to make arm64 robust too. This approach means
>> we can even add tests to debug_vm_pgtable.c to validate the required
>> behaviour.
>>
>> This is a theoretical bug found during code review. I don't have any
>> test case to trigger it in practice.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration")
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> Hi all,
>>
>> v1 of this fix [1] took the approach of fixing core-mm to never call
>> pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64
>> suffers this problem; all other arches are robust. So his suggestion was to
>> instead make arm64 robust in the same way and add tests to validate it. Despite
>> my stated reservations in the context of the v1 discussion, having thought on it
>> for a bit, I now agree with Zi Yan. Hence this post.
>>
>> Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is
>> remove it from there and have this go in through the arm64 tree? Assuming there
>> is agreement that this approach is right one.
>>
>> This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.
>>
>> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
>>
>> Thanks,
>> Ryan
>>
>>
>>  arch/arm64/include/asm/pgtable.h | 12 +++++--
>>  mm/debug_vm_pgtable.c            | 61 ++++++++++++++++++++++++++++++++
>>  2 files changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index afdd56d26ad7..7d580271a46d 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -511,8 +511,16 @@ static inline int pmd_trans_huge(pmd_t pmd)
>>
>>  static inline pmd_t pmd_mkinvalid(pmd_t pmd)
>>  {
>> -	pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>> -	pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>> +	/*
>> +	 * If not valid then either we are already present-invalid or we are
>> +	 * not-present (i.e. none or swap entry). We must not convert
>> +	 * not-present to present-invalid. Unbelievably, the core-mm may call
>> +	 * pmd_mkinvalid() for a swap entry and all other arches can handle it.
>> +	 */
>> +	if (pmd_valid(pmd)) {
>> +		pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>> +		pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>> +	}
>>
>>  	return pmd;
>>  }
>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>> index 65c19025da3d..7e9c387d06b0 100644
>> --- a/mm/debug_vm_pgtable.c
>> +++ b/mm/debug_vm_pgtable.c
>> @@ -956,6 +956,65 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { }
>>  #endif /* CONFIG_HUGETLB_PAGE */
>>
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +#if !defined(__HAVE_ARCH_PMDP_INVALIDATE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
>> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args)
>> +{
> 
> Printing various values at different locations in this function for debug:
> 
>> +	unsigned long max_swap_offset;
>> +	swp_entry_t swp_set, swp_clear, swp_convert;
>> +	pmd_t pmd_set, pmd_clear;
>> +
>> +	/*
>> +	 * See generic_max_swapfile_size(): probe the maximum offset, then
>> +	 * create swap entry will all possible bits set and a swap entry will
>> +	 * all bits clear.
>> +	 */
>> +	max_swap_offset = swp_offset(pmd_to_swp_entry(swp_entry_to_pmd(swp_entry(0, ~0UL))));
>> +	swp_set = swp_entry((1 << MAX_SWAPFILES_SHIFT) - 1, max_swap_offset);
>> +	swp_clear = swp_entry(0, 0);
>> +
>> +	/* Convert to pmd. */
>> +	pmd_set = swp_entry_to_pmd(swp_set);
>> +	pmd_clear = swp_entry_to_pmd(swp_clear);
> 
> [    0.702163] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: valid: pmd_set=f800000000000000, pmd_clear=7fffffffffffe00
> 
>> +
>> +	/*
>> +	 * Sanity check that the pmds are not-present, not-huge and swap entry
>> +	 * is recoverable without corruption.
>> +	 */
>> +	WARN_ON(pmd_present(pmd_set));
>> +	WARN_ON(pmd_trans_huge(pmd_set));
>> +	swp_convert = pmd_to_swp_entry(pmd_set);
>> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
>> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
>> +	WARN_ON(pmd_present(pmd_clear));
>> +	WARN_ON(pmd_trans_huge(pmd_clear));
>> +	swp_convert = pmd_to_swp_entry(pmd_clear);
>> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
>> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
>> +
>> +	/* Now invalidate the pmd. */
>> +	pmd_set = pmd_mkinvalid(pmd_set);
>> +	pmd_clear = pmd_mkinvalid(pmd_clear);
> 
> [    0.704452] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: invalid: pmd_set=f800000000000000, pmd_clear=7ffffffffe00e00
> 
>> +
>> +	/*
>> +	 * Since its a swap pmd, invalidation should effectively be a noop and
>> +	 * the checks we already did should give the same answer. Check the
>> +	 * invalidation didn't corrupt any fields.
>> +	 */
>> +	WARN_ON(pmd_present(pmd_set));
>> +	WARN_ON(pmd_trans_huge(pmd_set));
>> +	swp_convert = pmd_to_swp_entry(pmd_set);
> 
> [    0.706461] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: set: swp=7c03ffffffffffff (1f, 3ffffffffffff), convert=7c03ffffffffffff (1f, 3ffffffffffff)
> 
>> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
>> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
>> +	WARN_ON(pmd_present(pmd_clear));
>> +	WARN_ON(pmd_trans_huge(pmd_clear));
>> +	swp_convert = pmd_to_swp_entry(pmd_clear);
> 
> [    0.708841] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: clear: swp=0 (0, 0), convert=ff8 (0, ff8)
> 
>> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
>> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
> 
> This line fails on x86_64.
> 
> The logs show that the offset is indeed being corrupted by pmd_mkinvalid(); 0 -> 0xff8.
> 
> I think this is due to x86's pmd_mkinvalid() assuming the pmd is present; pmd_flags() and pmd_pfn() do all sorts of weird and wonderful things.
> 
> So does this take us full circle? Are we now back to modifying the core-mm to never call pmd_mkinvalid() on a non-present entry? If so, then I guess we should remove the arm64 fix from for-next/fixes.
> 
>> +}
>> +#else
>> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args) { }
>> +#endif /* !__HAVE_ARCH_PMDP_INVALIDATE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>> +
>>  static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>>  {
>>  	pmd_t pmd;
>> @@ -982,6 +1041,8 @@ static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>>  	WARN_ON(!pmd_trans_huge(pmd_mkinvalid(pmd_mkhuge(pmd))));
>>  	WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd))));
>>  #endif /* __HAVE_ARCH_PMDP_INVALIDATE */
>> +
>> +	swp_pmd_mkinvalid_tests(args);
>>  }
>>
>>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> --
>> 2.25.1
>>
>
Zi Yan May 1, 2024, 12:07 p.m. UTC | #12
On 1 May 2024, at 7:38, Ryan Roberts wrote:

> Pulling in David, who may be able to advise...
>
>
> On 01/05/2024 12:35, Ryan Roberts wrote:
>> Zi Yan, I'm hoping you might have some input on the below...
>>
>>
>> On 30/04/2024 14:31, Ryan Roberts wrote:
>>> __split_huge_pmd_locked() can be called for a present THP, devmap or
>>> (non-present) migration entry. It calls pmdp_invalidate()
>>> unconditionally on the pmdp and only determines if it is present or not
>>> based on the returned old pmd.
>>>
>>> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
>>> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
>>> pmd_present() calls to return true - even for a swap pmd. Therefore any
>>> lockless pgtable walker could see the migration entry pmd in this state
>>> and start interpretting the fields (e.g. pmd_pfn()) as if it were
>>> present, leading to BadThings (TM). GUP-fast appears to be one such
>>> lockless pgtable walker.
>>>
>>> While the obvious fix is for core-mm to avoid such calls for non-present
>>> pmds (pmdp_invalidate() will also issue TLBI which is not necessary for
>>> this case either), all other arches that implement pmd_mkinvalid() do it
>>> in such a way that it is robust to being called with a non-present pmd.
>>
>> OK the plot thickens; The tests I wrote to check that pmd_mkinvalid() is safe for swap entries fails on x86_64. See below...
>>
>>> So it is simpler and safer to make arm64 robust too. This approach means
>>> we can even add tests to debug_vm_pgtable.c to validate the required
>>> behaviour.
>>>
>>> This is a theoretical bug found during code review. I don't have any
>>> test case to trigger it in practice.
>>>
>>> Cc: stable@vger.kernel.org
>>> Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration")
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>
>>> Hi all,
>>>
>>> v1 of this fix [1] took the approach of fixing core-mm to never call
>>> pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64
>>> suffers this problem; all other arches are robust. So his suggestion was to
>>> instead make arm64 robust in the same way and add tests to validate it. Despite
>>> my stated reservations in the context of the v1 discussion, having thought on it
>>> for a bit, I now agree with Zi Yan. Hence this post.
>>>
>>> Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is
>>> remove it from there and have this go in through the arm64 tree? Assuming there
>>> is agreement that this approach is right one.
>>>
>>> This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.
>>>
>>> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>  arch/arm64/include/asm/pgtable.h | 12 +++++--
>>>  mm/debug_vm_pgtable.c            | 61 ++++++++++++++++++++++++++++++++
>>>  2 files changed, 71 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index afdd56d26ad7..7d580271a46d 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -511,8 +511,16 @@ static inline int pmd_trans_huge(pmd_t pmd)
>>>
>>>  static inline pmd_t pmd_mkinvalid(pmd_t pmd)
>>>  {
>>> -	pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>>> -	pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>>> +	/*
>>> +	 * If not valid then either we are already present-invalid or we are
>>> +	 * not-present (i.e. none or swap entry). We must not convert
>>> +	 * not-present to present-invalid. Unbelievably, the core-mm may call
>>> +	 * pmd_mkinvalid() for a swap entry and all other arches can handle it.
>>> +	 */
>>> +	if (pmd_valid(pmd)) {
>>> +		pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>>> +		pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>>> +	}
>>>
>>>  	return pmd;
>>>  }
>>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>>> index 65c19025da3d..7e9c387d06b0 100644
>>> --- a/mm/debug_vm_pgtable.c
>>> +++ b/mm/debug_vm_pgtable.c
>>> @@ -956,6 +956,65 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { }
>>>  #endif /* CONFIG_HUGETLB_PAGE */
>>>
>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +#if !defined(__HAVE_ARCH_PMDP_INVALIDATE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
>>> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args)
>>> +{
>>
>> Printing various values at different locations in this function for debug:
>>
>>> +	unsigned long max_swap_offset;
>>> +	swp_entry_t swp_set, swp_clear, swp_convert;
>>> +	pmd_t pmd_set, pmd_clear;
>>> +
>>> +	/*
>>> +	 * See generic_max_swapfile_size(): probe the maximum offset, then
>>> +	 * create swap entry will all possible bits set and a swap entry will
>>> +	 * all bits clear.
>>> +	 */
>>> +	max_swap_offset = swp_offset(pmd_to_swp_entry(swp_entry_to_pmd(swp_entry(0, ~0UL))));
>>> +	swp_set = swp_entry((1 << MAX_SWAPFILES_SHIFT) - 1, max_swap_offset);
>>> +	swp_clear = swp_entry(0, 0);
>>> +
>>> +	/* Convert to pmd. */
>>> +	pmd_set = swp_entry_to_pmd(swp_set);
>>> +	pmd_clear = swp_entry_to_pmd(swp_clear);
>>
>> [    0.702163] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: valid: pmd_set=f800000000000000, pmd_clear=7fffffffffffe00
>>
>>> +
>>> +	/*
>>> +	 * Sanity check that the pmds are not-present, not-huge and swap entry
>>> +	 * is recoverable without corruption.
>>> +	 */
>>> +	WARN_ON(pmd_present(pmd_set));
>>> +	WARN_ON(pmd_trans_huge(pmd_set));
>>> +	swp_convert = pmd_to_swp_entry(pmd_set);
>>> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
>>> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
>>> +	WARN_ON(pmd_present(pmd_clear));
>>> +	WARN_ON(pmd_trans_huge(pmd_clear));
>>> +	swp_convert = pmd_to_swp_entry(pmd_clear);
>>> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
>>> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
>>> +
>>> +	/* Now invalidate the pmd. */
>>> +	pmd_set = pmd_mkinvalid(pmd_set);
>>> +	pmd_clear = pmd_mkinvalid(pmd_clear);
>>
>> [    0.704452] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: invalid: pmd_set=f800000000000000, pmd_clear=7ffffffffe00e00
>>
>>> +
>>> +	/*
>>> +	 * Since its a swap pmd, invalidation should effectively be a noop and
>>> +	 * the checks we already did should give the same answer. Check the
>>> +	 * invalidation didn't corrupt any fields.
>>> +	 */
>>> +	WARN_ON(pmd_present(pmd_set));
>>> +	WARN_ON(pmd_trans_huge(pmd_set));
>>> +	swp_convert = pmd_to_swp_entry(pmd_set);
>>
>> [    0.706461] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: set: swp=7c03ffffffffffff (1f, 3ffffffffffff), convert=7c03ffffffffffff (1f, 3ffffffffffff)
>>
>>> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
>>> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
>>> +	WARN_ON(pmd_present(pmd_clear));
>>> +	WARN_ON(pmd_trans_huge(pmd_clear));
>>> +	swp_convert = pmd_to_swp_entry(pmd_clear);
>>
>> [    0.708841] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: clear: swp=0 (0, 0), convert=ff8 (0, ff8)
>>
>>> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
>>> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
>>
>> This line fails on x86_64.
>>
>> The logs show that the offset is indeed being corrupted by pmd_mkinvalid(); 0 -> 0xff8.
>>
>> I think this is due to x86's pmd_mkinvalid() assuming the pmd is present; pmd_flags() and pmd_pfn() do all sorts of weird and wonderful things.
>>
>> So does this take us full circle? Are we now back to modifying the core-mm to never call pmd_mkinvalid() on a non-present entry? If so, then I guess we should remove the arm64 fix from for-next/fixes.

If x86_64's pmd_mkinvalid() also corrupts swap entries, yes, your original fix
is better. I will dig into the x86 code more to figure out what goes wrong.
Last time, I only checked PAGE_* bits in these pmd|pte_* operations.
Sorry for the misinformation.

>>
>>> +}
>>> +#else
>>> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args) { }
>>> +#endif /* !__HAVE_ARCH_PMDP_INVALIDATE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>> +
>>>  static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>>>  {
>>>  	pmd_t pmd;
>>> @@ -982,6 +1041,8 @@ static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>>>  	WARN_ON(!pmd_trans_huge(pmd_mkinvalid(pmd_mkhuge(pmd))));
>>>  	WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd))));
>>>  #endif /* __HAVE_ARCH_PMDP_INVALIDATE */
>>> +
>>> +	swp_pmd_mkinvalid_tests(args);
>>>  }
>>>
>>>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>> --
>>> 2.25.1
>>>
>>


--
Best Regards,
Yan, Zi
Ryan Roberts May 1, 2024, 12:58 p.m. UTC | #13
On 01/05/2024 13:07, Zi Yan wrote:
> On 1 May 2024, at 7:38, Ryan Roberts wrote:
> 
>> Pulling in David, who may be able to advise...
>>
>>
>> On 01/05/2024 12:35, Ryan Roberts wrote:
>>> Zi Yan, I'm hoping you might have some input on the below...
>>>
>>>
>>> On 30/04/2024 14:31, Ryan Roberts wrote:
>>>> __split_huge_pmd_locked() can be called for a present THP, devmap or
>>>> (non-present) migration entry. It calls pmdp_invalidate()
>>>> unconditionally on the pmdp and only determines if it is present or not
>>>> based on the returned old pmd.
>>>>
>>>> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
>>>> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
>>>> pmd_present() calls to return true - even for a swap pmd. Therefore any
>>>> lockless pgtable walker could see the migration entry pmd in this state
>>>> and start interpretting the fields (e.g. pmd_pfn()) as if it were
>>>> present, leading to BadThings (TM). GUP-fast appears to be one such
>>>> lockless pgtable walker.
>>>>
>>>> While the obvious fix is for core-mm to avoid such calls for non-present
>>>> pmds (pmdp_invalidate() will also issue TLBI which is not necessary for
>>>> this case either), all other arches that implement pmd_mkinvalid() do it
>>>> in such a way that it is robust to being called with a non-present pmd.
>>>
>>> OK the plot thickens; The tests I wrote to check that pmd_mkinvalid() is safe for swap entries fails on x86_64. See below...
>>>
>>>> So it is simpler and safer to make arm64 robust too. This approach means
>>>> we can even add tests to debug_vm_pgtable.c to validate the required
>>>> behaviour.
>>>>
>>>> This is a theoretical bug found during code review. I don't have any
>>>> test case to trigger it in practice.
>>>>
>>>> Cc: stable@vger.kernel.org
>>>> Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration")
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>
>>>> Hi all,
>>>>
>>>> v1 of this fix [1] took the approach of fixing core-mm to never call
>>>> pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64
>>>> suffers this problem; all other arches are robust. So his suggestion was to
>>>> instead make arm64 robust in the same way and add tests to validate it. Despite
>>>> my stated reservations in the context of the v1 discussion, having thought on it
>>>> for a bit, I now agree with Zi Yan. Hence this post.
>>>>
>>>> Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is
>>>> remove it from there and have this go in through the arm64 tree? Assuming there
>>>> is agreement that this approach is right one.
>>>>
>>>> This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>  arch/arm64/include/asm/pgtable.h | 12 +++++--
>>>>  mm/debug_vm_pgtable.c            | 61 ++++++++++++++++++++++++++++++++
>>>>  2 files changed, 71 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index afdd56d26ad7..7d580271a46d 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -511,8 +511,16 @@ static inline int pmd_trans_huge(pmd_t pmd)
>>>>
>>>>  static inline pmd_t pmd_mkinvalid(pmd_t pmd)
>>>>  {
>>>> -	pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>>>> -	pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>>>> +	/*
>>>> +	 * If not valid then either we are already present-invalid or we are
>>>> +	 * not-present (i.e. none or swap entry). We must not convert
>>>> +	 * not-present to present-invalid. Unbelievably, the core-mm may call
>>>> +	 * pmd_mkinvalid() for a swap entry and all other arches can handle it.
>>>> +	 */
>>>> +	if (pmd_valid(pmd)) {
>>>> +		pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>>>> +		pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>>>> +	}
>>>>
>>>>  	return pmd;
>>>>  }
>>>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>>>> index 65c19025da3d..7e9c387d06b0 100644
>>>> --- a/mm/debug_vm_pgtable.c
>>>> +++ b/mm/debug_vm_pgtable.c
>>>> @@ -956,6 +956,65 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { }
>>>>  #endif /* CONFIG_HUGETLB_PAGE */
>>>>
>>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +#if !defined(__HAVE_ARCH_PMDP_INVALIDATE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
>>>> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args)
>>>> +{
>>>
>>> Printing various values at different locations in this function for debug:
>>>
>>>> +	unsigned long max_swap_offset;
>>>> +	swp_entry_t swp_set, swp_clear, swp_convert;
>>>> +	pmd_t pmd_set, pmd_clear;
>>>> +
>>>> +	/*
>>>> +	 * See generic_max_swapfile_size(): probe the maximum offset, then
>>>> +	 * create swap entry will all possible bits set and a swap entry will
>>>> +	 * all bits clear.
>>>> +	 */
>>>> +	max_swap_offset = swp_offset(pmd_to_swp_entry(swp_entry_to_pmd(swp_entry(0, ~0UL))));
>>>> +	swp_set = swp_entry((1 << MAX_SWAPFILES_SHIFT) - 1, max_swap_offset);
>>>> +	swp_clear = swp_entry(0, 0);
>>>> +
>>>> +	/* Convert to pmd. */
>>>> +	pmd_set = swp_entry_to_pmd(swp_set);
>>>> +	pmd_clear = swp_entry_to_pmd(swp_clear);
>>>
>>> [    0.702163] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: valid: pmd_set=f800000000000000, pmd_clear=7fffffffffffe00
>>>
>>>> +
>>>> +	/*
>>>> +	 * Sanity check that the pmds are not-present, not-huge and swap entry
>>>> +	 * is recoverable without corruption.
>>>> +	 */
>>>> +	WARN_ON(pmd_present(pmd_set));
>>>> +	WARN_ON(pmd_trans_huge(pmd_set));
>>>> +	swp_convert = pmd_to_swp_entry(pmd_set);
>>>> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
>>>> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
>>>> +	WARN_ON(pmd_present(pmd_clear));
>>>> +	WARN_ON(pmd_trans_huge(pmd_clear));
>>>> +	swp_convert = pmd_to_swp_entry(pmd_clear);
>>>> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
>>>> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
>>>> +
>>>> +	/* Now invalidate the pmd. */
>>>> +	pmd_set = pmd_mkinvalid(pmd_set);
>>>> +	pmd_clear = pmd_mkinvalid(pmd_clear);
>>>
>>> [    0.704452] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: invalid: pmd_set=f800000000000000, pmd_clear=7ffffffffe00e00
>>>
>>>> +
>>>> +	/*
>>>> +	 * Since its a swap pmd, invalidation should effectively be a noop and
>>>> +	 * the checks we already did should give the same answer. Check the
>>>> +	 * invalidation didn't corrupt any fields.
>>>> +	 */
>>>> +	WARN_ON(pmd_present(pmd_set));
>>>> +	WARN_ON(pmd_trans_huge(pmd_set));
>>>> +	swp_convert = pmd_to_swp_entry(pmd_set);
>>>
>>> [    0.706461] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: set: swp=7c03ffffffffffff (1f, 3ffffffffffff), convert=7c03ffffffffffff (1f, 3ffffffffffff)
>>>
>>>> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
>>>> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
>>>> +	WARN_ON(pmd_present(pmd_clear));
>>>> +	WARN_ON(pmd_trans_huge(pmd_clear));
>>>> +	swp_convert = pmd_to_swp_entry(pmd_clear);
>>>
>>> [    0.708841] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: clear: swp=0 (0, 0), convert=ff8 (0, ff8)
>>>
>>>> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
>>>> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
>>>
>>> This line fails on x86_64.
>>>
>>> The logs show that the offset is indeed being corrupted by pmd_mkinvalid(); 0 -> 0xff8.
>>>
>>> I think this is due to x86's pmd_mkinvalid() assuming the pmd is present; pmd_flags() and pmd_pfn() do all sorts of weird and wonderful things.
>>>
>>> So does this take us full circle? Are we now back to modifying the core-mm to never call pmd_mkinvalid() on a non-present entry? If so, then I guess we should remove the arm64 fix from for-next/fixes.
> 
> If x86_64's pmd_mkinvalid() also corrupts swap entries, yes, your original fix
> is better. I will dig into the x86 code more to figure out what goes wrong.
> Last time, I only checked PAGE_* bits in these pmd|pte_* operations.
> Sorry for the misinformation.

No worries, I'll do the amends we originally agreed for the original fix and resend.

> 
>>>
>>>> +}
>>>> +#else
>>>> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args) { }
>>>> +#endif /* !__HAVE_ARCH_PMDP_INVALIDATE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>>> +
>>>>  static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>>>>  {
>>>>  	pmd_t pmd;
>>>> @@ -982,6 +1041,8 @@ static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>>>>  	WARN_ON(!pmd_trans_huge(pmd_mkinvalid(pmd_mkhuge(pmd))));
>>>>  	WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd))));
>>>>  #endif /* __HAVE_ARCH_PMDP_INVALIDATE */
>>>> +
>>>> +	swp_pmd_mkinvalid_tests(args);
>>>>  }
>>>>
>>>>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> --
>>>> 2.25.1
>>>>
>>>
> 
> 
> --
> Best Regards,
> Yan, Zi
Catalin Marinas May 2, 2024, 6 p.m. UTC | #14
On Tue, Apr 30, 2024 at 06:57:52PM +0100, Catalin Marinas wrote:
> On Tue, 30 Apr 2024 14:31:38 +0100, Ryan Roberts wrote:
> > __split_huge_pmd_locked() can be called for a present THP, devmap or
> > (non-present) migration entry. It calls pmdp_invalidate()
> > unconditionally on the pmdp and only determines if it is present or not
> > based on the returned old pmd.
> > 
> > But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
> > unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
> > pmd_present() calls to return true - even for a swap pmd. Therefore any
> > lockless pgtable walker could see the migration entry pmd in this state
> > and start interpretting the fields (e.g. pmd_pfn()) as if it were
> > present, leading to BadThings (TM). GUP-fast appears to be one such
> > lockless pgtable walker.
> > 
> > [...]
> 
> Applied to arm64 (for-next/fixes), thanks! It should land in 6.9-rc7. I
> removed the debug/test code, please send it as a separate patch for
> 6.10.
> 
> [1/1] arm64/mm: pmd_mkinvalid() must handle swap pmds
>       https://git.kernel.org/arm64/c/e783331c7720

Since Andrew merged the generic mm fix, I dropped this patch from the
arm64 for-next/fixes branch.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..7d580271a46d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -511,8 +511,16 @@  static inline int pmd_trans_huge(pmd_t pmd)

 static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 {
-	pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
-	pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
+	/*
+	 * If not valid then either we are already present-invalid or we are
+	 * not-present (i.e. none or swap entry). We must not convert
+	 * not-present to present-invalid. Unbelievably, the core-mm may call
+	 * pmd_mkinvalid() for a swap entry and all other arches can handle it.
+	 */
+	if (pmd_valid(pmd)) {
+		pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
+		pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
+	}

 	return pmd;
 }
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 65c19025da3d..7e9c387d06b0 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -956,6 +956,65 @@  static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { }
 #endif /* CONFIG_HUGETLB_PAGE */

 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if !defined(__HAVE_ARCH_PMDP_INVALIDATE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
+static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args)
+{
+	unsigned long max_swap_offset;
+	swp_entry_t swp_set, swp_clear, swp_convert;
+	pmd_t pmd_set, pmd_clear;
+
+	/*
+	 * See generic_max_swapfile_size(): probe the maximum offset, then
+	 * create swap entry will all possible bits set and a swap entry will
+	 * all bits clear.
+	 */
+	max_swap_offset = swp_offset(pmd_to_swp_entry(swp_entry_to_pmd(swp_entry(0, ~0UL))));
+	swp_set = swp_entry((1 << MAX_SWAPFILES_SHIFT) - 1, max_swap_offset);
+	swp_clear = swp_entry(0, 0);
+
+	/* Convert to pmd. */
+	pmd_set = swp_entry_to_pmd(swp_set);
+	pmd_clear = swp_entry_to_pmd(swp_clear);
+
+	/*
+	 * Sanity check that the pmds are not-present, not-huge and swap entry
+	 * is recoverable without corruption.
+	 */
+	WARN_ON(pmd_present(pmd_set));
+	WARN_ON(pmd_trans_huge(pmd_set));
+	swp_convert = pmd_to_swp_entry(pmd_set);
+	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
+	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
+	WARN_ON(pmd_present(pmd_clear));
+	WARN_ON(pmd_trans_huge(pmd_clear));
+	swp_convert = pmd_to_swp_entry(pmd_clear);
+	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
+	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
+
+	/* Now invalidate the pmd. */
+	pmd_set = pmd_mkinvalid(pmd_set);
+	pmd_clear = pmd_mkinvalid(pmd_clear);
+
+	/*
+	 * Since its a swap pmd, invalidation should effectively be a noop and
+	 * the checks we already did should give the same answer. Check the
+	 * invalidation didn't corrupt any fields.
+	 */
+	WARN_ON(pmd_present(pmd_set));
+	WARN_ON(pmd_trans_huge(pmd_set));
+	swp_convert = pmd_to_swp_entry(pmd_set);
+	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
+	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
+	WARN_ON(pmd_present(pmd_clear));
+	WARN_ON(pmd_trans_huge(pmd_clear));
+	swp_convert = pmd_to_swp_entry(pmd_clear);
+	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
+	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
+}
+#else
+static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args) { }
+#endif /* !__HAVE_ARCH_PMDP_INVALIDATE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 static void __init pmd_thp_tests(struct pgtable_debug_args *args)
 {
 	pmd_t pmd;
@@ -982,6 +1041,8 @@  static void __init pmd_thp_tests(struct pgtable_debug_args *args)
 	WARN_ON(!pmd_trans_huge(pmd_mkinvalid(pmd_mkhuge(pmd))));
 	WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd))));
 #endif /* __HAVE_ARCH_PMDP_INVALIDATE */
+
+	swp_pmd_mkinvalid_tests(args);
 }

 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD