diff mbox series

[RFC,v1,1/4] mm: Introduce ptep_get_lockless_norecency()

Message ID 20240215121756.2734131-2-ryan.roberts@arm.com (mailing list archive)
State New, archived
Headers show
Series Reduce cost of ptep_get_lockless on arm64 | expand

Commit Message

Ryan Roberts Feb. 15, 2024, 12:17 p.m. UTC
With the introduction of contpte mapping support for arm64, that
architecture's implementation of ptep_get_lockless() has become very
complex due to the need to gather access and dirty bits from across all
of the ptes in the contpte block. This requires careful implementation
to ensure the returned value is consistent (because its not possible to
read all ptes atomically), but even in the common case when there is no
racing modification, we have to read all ptes, which gives an ~O(n^2)
cost if the core-mm is iterating over a range, and performing a
ptep_get_lockless() on each pte.

Solve this by introducing ptep_get_lockless_norecency(), which does not
make any guarantees about access and dirty bits. Therefore it can simply
read the single target pte.

At the same time, convert all call sites that previously used
ptep_get_lockless() but don't care about access and dirty state.

We may want to do something similar for ptep_get() (i.e.
ptep_get_norecency()) in future; it doesn't suffer from the consistency
problem because the PTL serializes it with any modifications, but does
suffer the same O(n^2) cost.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
 kernel/events/core.c    |  2 +-
 mm/hugetlb.c            |  2 +-
 mm/khugepaged.c         |  2 +-
 mm/memory.c             |  2 +-
 mm/swap_state.c         |  2 +-
 mm/swapfile.c           |  2 +-
 7 files changed, 40 insertions(+), 9 deletions(-)

--
2.25.1

Comments

David Hildenbrand March 26, 2024, 4:27 p.m. UTC | #1
On 15.02.24 13:17, Ryan Roberts wrote:
> With the introduction of contpte mapping support for arm64, that
> architecture's implementation of ptep_get_lockless() has become very
> complex due to the need to gather access and dirty bits from across all
> of the ptes in the contpte block. This requires careful implementation
> to ensure the returned value is consistent (because its not possible to
> read all ptes atomically), but even in the common case when there is no
> racing modification, we have to read all ptes, which gives an ~O(n^2)
> cost if the core-mm is iterating over a range, and performing a
> ptep_get_lockless() on each pte.
> 
> Solve this by introducing ptep_get_lockless_norecency(), which does not
> make any guarantees about access and dirty bits. Therefore it can simply
> read the single target pte.
> 
> At the same time, convert all call sites that previously used
> ptep_get_lockless() but don't care about access and dirty state.
> 

I'd probably split that part off.

> We may want to do something similar for ptep_get() (i.e.
> ptep_get_norecency()) in future; it doesn't suffer from the consistency
> problem because the PTL serializes it with any modifications, but does
> suffer the same O(n^2) cost.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
>   kernel/events/core.c    |  2 +-
>   mm/hugetlb.c            |  2 +-
>   mm/khugepaged.c         |  2 +-
>   mm/memory.c             |  2 +-
>   mm/swap_state.c         |  2 +-
>   mm/swapfile.c           |  2 +-
>   7 files changed, 40 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index a36cf4e124b0..9dd40fdbd825 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>   #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>   #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
> 
> -/*
> - * We require that the PTE can be read atomically.
> - */
>   #ifndef ptep_get_lockless
> +/**
> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and
> + *                     dirty bits are guaranteed to accurately reflect the state
> + *                     of the pte at the time of the call.
> + * @ptep: Page table pointer for pte to get.
> + *
> + * If young and dirty information is not required, use
> + * ptep_get_lockless_norecency() which can be faster on some architectures.
> + *
> + * May be overridden by the architecture; otherwise, implemented using
> + * ptep_get(), on the assumption that it is atomic.
> + *
> + * Context: Any.
> + */

I think we usually say "Any context.". But I would just do it like idr.h:

"Any context. It is safe to call this function without locking in your 
code."

... but is this true? We really want to say "without page table lock". 
Because there must be some way to prevent against concurrent page table 
freeing. For example, GUP-fast disables IRQs, whereby page table freeing 
code frees using RCU.

>   static inline pte_t ptep_get_lockless(pte_t *ptep)
>   {
>   	return ptep_get(ptep);
>   }
>   #endif
> 
> +#ifndef ptep_get_lockless_norecency
> +/**
> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock.
> + *				 Young and dirty bits may not be accurate.
> + * @ptep: Page table pointer for pte to get.
> + *
> + * Prefer this over ptep_get_lockless() when young and dirty information is not
> + * required since it can be faster on some architectures.
> + *
> + * May be overridden by the architecture; otherwise, implemented using the more
> + * precise ptep_get_lockless().
> + *
> + * Context: Any.

Same comment.

> + */
> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
> +{
> +	return ptep_get_lockless(ptep);
> +}
> +#endif

[...]

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 68283e54c899..41dc44eb8454 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   	}
> 
>   	if (pte) {
> -		pte_t pteval = ptep_get_lockless(pte);
> +		pte_t pteval = ptep_get_lockless_norecency(pte);
> 
>   		BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>   	}
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2771fc043b3b..1a6c9ed8237a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   			}
>   		}
> 
> -		vmf.orig_pte = ptep_get_lockless(pte);
> +		vmf.orig_pte = ptep_get_lockless_norecency(pte);
>   		if (!is_swap_pte(vmf.orig_pte))
>   			continue;


Hm, I think you mentioned that we want to be careful with vmf.orig_pte.
Ryan Roberts March 26, 2024, 4:39 p.m. UTC | #2
On 26/03/2024 16:27, David Hildenbrand wrote:
> On 15.02.24 13:17, Ryan Roberts wrote:
>> With the introduction of contpte mapping support for arm64, that
>> architecture's implementation of ptep_get_lockless() has become very
>> complex due to the need to gather access and dirty bits from across all
>> of the ptes in the contpte block. This requires careful implementation
>> to ensure the returned value is consistent (because its not possible to
>> read all ptes atomically), but even in the common case when there is no
>> racing modification, we have to read all ptes, which gives an ~O(n^2)
>> cost if the core-mm is iterating over a range, and performing a
>> ptep_get_lockless() on each pte.
>>
>> Solve this by introducing ptep_get_lockless_norecency(), which does not
>> make any guarantees about access and dirty bits. Therefore it can simply
>> read the single target pte.
>>
>> At the same time, convert all call sites that previously used
>> ptep_get_lockless() but don't care about access and dirty state.
>>
> 
> I'd probably split that part off.

I thought the general guidance was to introduce new APIs in same patch they are
first used in? If I split this off, I'll have one patch for a new (unused) API,
then another for the first users.

> 
>> We may want to do something similar for ptep_get() (i.e.
>> ptep_get_norecency()) in future; it doesn't suffer from the consistency
>> problem because the PTL serializes it with any modifications, but does
>> suffer the same O(n^2) cost.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
>>   kernel/events/core.c    |  2 +-
>>   mm/hugetlb.c            |  2 +-
>>   mm/khugepaged.c         |  2 +-
>>   mm/memory.c             |  2 +-
>>   mm/swap_state.c         |  2 +-
>>   mm/swapfile.c           |  2 +-
>>   7 files changed, 40 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index a36cf4e124b0..9dd40fdbd825 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>>   #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>>   #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>>
>> -/*
>> - * We require that the PTE can be read atomically.
>> - */
>>   #ifndef ptep_get_lockless
>> +/**
>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and
>> + *                     dirty bits are guaranteed to accurately reflect the state
>> + *                     of the pte at the time of the call.
>> + * @ptep: Page table pointer for pte to get.
>> + *
>> + * If young and dirty information is not required, use
>> + * ptep_get_lockless_norecency() which can be faster on some architectures.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented using
>> + * ptep_get(), on the assumption that it is atomic.
>> + *
>> + * Context: Any.
>> + */
> 
> I think we usually say "Any context.". But I would just do it like idr.h:
> 
> "Any context. It is safe to call this function without locking in your code."
> 
> ... but is this true? We really want to say "without page table lock". Because
> there must be some way to prevent against concurrent page table freeing. For
> example, GUP-fast disables IRQs, whereby page table freeing code frees using RCU.

How about:

"
Context: Any context that guarrantees the page table can't be freed
concurrently. The page table lock is not required.
"

> 
>>   static inline pte_t ptep_get_lockless(pte_t *ptep)
>>   {
>>       return ptep_get(ptep);
>>   }
>>   #endif
>>
>> +#ifndef ptep_get_lockless_norecency
>> +/**
>> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock.
>> + *                 Young and dirty bits may not be accurate.
>> + * @ptep: Page table pointer for pte to get.
>> + *
>> + * Prefer this over ptep_get_lockless() when young and dirty information is not
>> + * required since it can be faster on some architectures.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented using the more
>> + * precise ptep_get_lockless().
>> + *
>> + * Context: Any.
> 
> Same comment.
> 
>> + */
>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
>> +{
>> +    return ptep_get_lockless(ptep);
>> +}
>> +#endif
> 
> [...]
> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 68283e54c899..41dc44eb8454 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct
>> vm_area_struct *vma,
>>       }
>>
>>       if (pte) {
>> -        pte_t pteval = ptep_get_lockless(pte);
>> +        pte_t pteval = ptep_get_lockless_norecency(pte);
>>
>>           BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>>       }
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 2771fc043b3b..1a6c9ed8237a 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct
>> *mm,
>>               }
>>           }
>>
>> -        vmf.orig_pte = ptep_get_lockless(pte);
>> +        vmf.orig_pte = ptep_get_lockless_norecency(pte);
>>           if (!is_swap_pte(vmf.orig_pte))
>>               continue;
> 
> 
> Hm, I think you mentioned that we want to be careful with vmf.orig_pte.

Yeah good point. So I guess this should move to patch 3 (which may be dropped -
tbd)?
David Hildenbrand March 27, 2024, 9:28 a.m. UTC | #3
On 26.03.24 17:39, Ryan Roberts wrote:
> On 26/03/2024 16:27, David Hildenbrand wrote:
>> On 15.02.24 13:17, Ryan Roberts wrote:
>>> With the introduction of contpte mapping support for arm64, that
>>> architecture's implementation of ptep_get_lockless() has become very
>>> complex due to the need to gather access and dirty bits from across all
>>> of the ptes in the contpte block. This requires careful implementation
>>> to ensure the returned value is consistent (because its not possible to
>>> read all ptes atomically), but even in the common case when there is no
>>> racing modification, we have to read all ptes, which gives an ~O(n^2)
>>> cost if the core-mm is iterating over a range, and performing a
>>> ptep_get_lockless() on each pte.
>>>
>>> Solve this by introducing ptep_get_lockless_norecency(), which does not
>>> make any guarantees about access and dirty bits. Therefore it can simply
>>> read the single target pte.
>>>
>>> At the same time, convert all call sites that previously used
>>> ptep_get_lockless() but don't care about access and dirty state.
>>>
>>
>> I'd probably split that part off.
> 
> I thought the general guidance was to introduce new APIs in same patch they are
> first used in? If I split this off, I'll have one patch for a new (unused) API,
> then another for the first users.

I don't know what exact guidance there is, but I tend to leave "non 
trivial changes" to separate patches.

Some of the changes here are rather trivial (mm/hugetlb.c), and I agree 
that we can perform them here.

At least the "vmf.orig_pte" looked "non-trivial" to me, thus my comment.

> 
>>
>>> We may want to do something similar for ptep_get() (i.e.
>>> ptep_get_norecency()) in future; it doesn't suffer from the consistency
>>> problem because the PTL serializes it with any modifications, but does
>>> suffer the same O(n^2) cost.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
>>>    kernel/events/core.c    |  2 +-
>>>    mm/hugetlb.c            |  2 +-
>>>    mm/khugepaged.c         |  2 +-
>>>    mm/memory.c             |  2 +-
>>>    mm/swap_state.c         |  2 +-
>>>    mm/swapfile.c           |  2 +-
>>>    7 files changed, 40 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index a36cf4e124b0..9dd40fdbd825 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>>>    #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>>>    #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>>>
>>> -/*
>>> - * We require that the PTE can be read atomically.
>>> - */
>>>    #ifndef ptep_get_lockless
>>> +/**
>>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and
>>> + *                     dirty bits are guaranteed to accurately reflect the state
>>> + *                     of the pte at the time of the call.
>>> + * @ptep: Page table pointer for pte to get.
>>> + *
>>> + * If young and dirty information is not required, use
>>> + * ptep_get_lockless_norecency() which can be faster on some architectures.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented using
>>> + * ptep_get(), on the assumption that it is atomic.
>>> + *
>>> + * Context: Any.
>>> + */
>>
>> I think we usually say "Any context.". But I would just do it like idr.h:
>>
>> "Any context. It is safe to call this function without locking in your code."
>>
>> ... but is this true? We really want to say "without page table lock". Because
>> there must be some way to prevent against concurrent page table freeing. For
>> example, GUP-fast disables IRQs, whereby page table freeing code frees using RCU.
> 
> How about:
> 
> "
> Context: Any context that guarrantees the page table can't be freed

s/guarrantees/guarantees/

> concurrently. The page table lock is not required.
> "
> 

Sounds good.

>>
>>>    static inline pte_t ptep_get_lockless(pte_t *ptep)
>>>    {
>>>        return ptep_get(ptep);
>>>    }
>>>    #endif
>>>
>>> +#ifndef ptep_get_lockless_norecency
>>> +/**
>>> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock.
>>> + *                 Young and dirty bits may not be accurate.
>>> + * @ptep: Page table pointer for pte to get.
>>> + *
>>> + * Prefer this over ptep_get_lockless() when young and dirty information is not
>>> + * required since it can be faster on some architectures.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented using the more
>>> + * precise ptep_get_lockless().
>>> + *
>>> + * Context: Any.
>>
>> Same comment.
>>
>>> + */
>>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
>>> +{
>>> +    return ptep_get_lockless(ptep);
>>> +}
>>> +#endif
>>
>> [...]
>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 68283e54c899..41dc44eb8454 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct
>>> vm_area_struct *vma,
>>>        }
>>>
>>>        if (pte) {
>>> -        pte_t pteval = ptep_get_lockless(pte);
>>> +        pte_t pteval = ptep_get_lockless_norecency(pte);
>>>
>>>            BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>>>        }
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 2771fc043b3b..1a6c9ed8237a 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct
>>> *mm,
>>>                }
>>>            }
>>>
>>> -        vmf.orig_pte = ptep_get_lockless(pte);
>>> +        vmf.orig_pte = ptep_get_lockless_norecency(pte);
>>>            if (!is_swap_pte(vmf.orig_pte))
>>>                continue;
>>
>>
>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte.
> 
> Yeah good point. So I guess this should move to patch 3 (which may be dropped -
> tbd)?
> 

Yes. Or a separate one where you explain in detail why do_swap_page() 
can handle it just fine.
Ryan Roberts March 27, 2024, 9:57 a.m. UTC | #4
On 27/03/2024 09:28, David Hildenbrand wrote:
> On 26.03.24 17:39, Ryan Roberts wrote:
>> On 26/03/2024 16:27, David Hildenbrand wrote:
>>> On 15.02.24 13:17, Ryan Roberts wrote:
>>>> With the introduction of contpte mapping support for arm64, that
>>>> architecture's implementation of ptep_get_lockless() has become very
>>>> complex due to the need to gather access and dirty bits from across all
>>>> of the ptes in the contpte block. This requires careful implementation
>>>> to ensure the returned value is consistent (because its not possible to
>>>> read all ptes atomically), but even in the common case when there is no
>>>> racing modification, we have to read all ptes, which gives an ~O(n^2)
>>>> cost if the core-mm is iterating over a range, and performing a
>>>> ptep_get_lockless() on each pte.
>>>>
>>>> Solve this by introducing ptep_get_lockless_norecency(), which does not
>>>> make any guarantees about access and dirty bits. Therefore it can simply
>>>> read the single target pte.
>>>>
>>>> At the same time, convert all call sites that previously used
>>>> ptep_get_lockless() but don't care about access and dirty state.
>>>>
>>>
>>> I'd probably split that part off.
>>
>> I thought the general guidance was to introduce new APIs in same patch they are
>> first used in? If I split this off, I'll have one patch for a new (unused) API,
>> then another for the first users.
> 
> I don't know what exact guidance there is, but I tend to leave "non trivial
> changes" to separate patches.
> 
> Some of the changes here are rather trivial (mm/hugetlb.c), and I agree that we
> can perform them here.
> 
> At least the "vmf.orig_pte" looked "non-trivial" to me, thus my comment.

got it.


> 
>>
>>>
>>>> We may want to do something similar for ptep_get() (i.e.
>>>> ptep_get_norecency()) in future; it doesn't suffer from the consistency
>>>> problem because the PTL serializes it with any modifications, but does
>>>> suffer the same O(n^2) cost.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
>>>>    kernel/events/core.c    |  2 +-
>>>>    mm/hugetlb.c            |  2 +-
>>>>    mm/khugepaged.c         |  2 +-
>>>>    mm/memory.c             |  2 +-
>>>>    mm/swap_state.c         |  2 +-
>>>>    mm/swapfile.c           |  2 +-
>>>>    7 files changed, 40 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index a36cf4e124b0..9dd40fdbd825 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>>>>    #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>>>>    #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>>>>
>>>> -/*
>>>> - * We require that the PTE can be read atomically.
>>>> - */
>>>>    #ifndef ptep_get_lockless
>>>> +/**
>>>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young
>>>> and
>>>> + *                     dirty bits are guaranteed to accurately reflect the
>>>> state
>>>> + *                     of the pte at the time of the call.
>>>> + * @ptep: Page table pointer for pte to get.
>>>> + *
>>>> + * If young and dirty information is not required, use
>>>> + * ptep_get_lockless_norecency() which can be faster on some architectures.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented using
>>>> + * ptep_get(), on the assumption that it is atomic.
>>>> + *
>>>> + * Context: Any.
>>>> + */
>>>
>>> I think we usually say "Any context.". But I would just do it like idr.h:
>>>
>>> "Any context. It is safe to call this function without locking in your code."
>>>
>>> ... but is this true? We really want to say "without page table lock". Because
>>> there must be some way to prevent against concurrent page table freeing. For
>>> example, GUP-fast disables IRQs, whereby page table freeing code frees using
>>> RCU.
>>
>> How about:
>>
>> "
>> Context: Any context that guarrantees the page table can't be freed
> 
> s/guarrantees/guarantees/
> 
>> concurrently. The page table lock is not required.
>> "
>>
> 
> Sounds good.

Great!

> 
>>>
>>>>    static inline pte_t ptep_get_lockless(pte_t *ptep)
>>>>    {
>>>>        return ptep_get(ptep);
>>>>    }
>>>>    #endif
>>>>
>>>> +#ifndef ptep_get_lockless_norecency
>>>> +/**
>>>> + * ptep_get_lockless_norecency - Get a pte without holding the page table
>>>> lock.
>>>> + *                 Young and dirty bits may not be accurate.
>>>> + * @ptep: Page table pointer for pte to get.
>>>> + *
>>>> + * Prefer this over ptep_get_lockless() when young and dirty information is
>>>> not
>>>> + * required since it can be faster on some architectures.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented using the
>>>> more
>>>> + * precise ptep_get_lockless().
>>>> + *
>>>> + * Context: Any.
>>>
>>> Same comment.
>>>
>>>> + */
>>>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
>>>> +{
>>>> +    return ptep_get_lockless(ptep);
>>>> +}
>>>> +#endif
>>>
>>> [...]
>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index 68283e54c899..41dc44eb8454 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct
>>>> vm_area_struct *vma,
>>>>        }
>>>>
>>>>        if (pte) {
>>>> -        pte_t pteval = ptep_get_lockless(pte);
>>>> +        pte_t pteval = ptep_get_lockless_norecency(pte);
>>>>
>>>>            BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>>>>        }
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 2771fc043b3b..1a6c9ed8237a 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct
>>>> *mm,
>>>>                }
>>>>            }
>>>>
>>>> -        vmf.orig_pte = ptep_get_lockless(pte);
>>>> +        vmf.orig_pte = ptep_get_lockless_norecency(pte);
>>>>            if (!is_swap_pte(vmf.orig_pte))
>>>>                continue;
>>>
>>>
>>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte.
>>
>> Yeah good point. So I guess this should move to patch 3 (which may be dropped -
>> tbd)?
>>
> 
> Yes. Or a separate one where you explain in detail why do_swap_page() can handle
> it just fine.

Ahh no wait - I remember now; the reason I believe this is a "trivial" case is
because we only leak vmf.orig_pte to the rest of the world if its a swap entry.
And if its a swap entry, then ptep_get_lockless_norecency() is equivalent to
ptep_get_lockless() - the pte is not present so there are no access or dirty
bits. So I think this can stay here?
David Hildenbrand March 27, 2024, 5:02 p.m. UTC | #5
>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index 68283e54c899..41dc44eb8454 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct
>>>>> vm_area_struct *vma,
>>>>>         }
>>>>>
>>>>>         if (pte) {
>>>>> -        pte_t pteval = ptep_get_lockless(pte);
>>>>> +        pte_t pteval = ptep_get_lockless_norecency(pte);
>>>>>
>>>>>             BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>>>>>         }
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 2771fc043b3b..1a6c9ed8237a 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct
>>>>> *mm,
>>>>>                 }
>>>>>             }
>>>>>
>>>>> -        vmf.orig_pte = ptep_get_lockless(pte);
>>>>> +        vmf.orig_pte = ptep_get_lockless_norecency(pte);
>>>>>             if (!is_swap_pte(vmf.orig_pte))
>>>>>                 continue;
>>>>
>>>>
>>>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte.
>>>
>>> Yeah good point. So I guess this should move to patch 3 (which may be dropped -
>>> tbd)?
>>>
>>
>> Yes. Or a separate one where you explain in detail why do_swap_page() can handle
>> it just fine.
> 
> Ahh no wait - I remember now; the reason I believe this is a "trivial" case is
> because we only leak vmf.orig_pte to the rest of the world if its a swap entry.

Ugh, yes!
diff mbox series

Patch

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a36cf4e124b0..9dd40fdbd825 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -528,16 +528,47 @@  static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
 #endif /* CONFIG_PGTABLE_LEVELS > 2 */
 #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */

-/*
- * We require that the PTE can be read atomically.
- */
 #ifndef ptep_get_lockless
+/**
+ * ptep_get_lockless - Get a pte without holding the page table lock. Young and
+ *                     dirty bits are guaranteed to accurately reflect the state
+ *                     of the pte at the time of the call.
+ * @ptep: Page table pointer for pte to get.
+ *
+ * If young and dirty information is not required, use
+ * ptep_get_lockless_norecency() which can be faster on some architectures.
+ *
+ * May be overridden by the architecture; otherwise, implemented using
+ * ptep_get(), on the assumption that it is atomic.
+ *
+ * Context: Any.
+ */
 static inline pte_t ptep_get_lockless(pte_t *ptep)
 {
 	return ptep_get(ptep);
 }
 #endif

+#ifndef ptep_get_lockless_norecency
+/**
+ * ptep_get_lockless_norecency - Get a pte without holding the page table lock.
+ *				 Young and dirty bits may not be accurate.
+ * @ptep: Page table pointer for pte to get.
+ *
+ * Prefer this over ptep_get_lockless() when young and dirty information is not
+ * required since it can be faster on some architectures.
+ *
+ * May be overridden by the architecture; otherwise, implemented using the more
+ * precise ptep_get_lockless().
+ *
+ * Context: Any.
+ */
+static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
+{
+	return ptep_get_lockless(ptep);
+}
+#endif
+
 #ifndef pmdp_get_lockless
 static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
 {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f0f0f71213a1..27991312d635 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7583,7 +7583,7 @@  static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
 	if (!ptep)
 		goto again;

-	pte = ptep_get_lockless(ptep);
+	pte = ptep_get_lockless_norecency(ptep);
 	if (pte_present(pte))
 		size = pte_leaf_size(pte);
 	pte_unmap(ptep);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 68283e54c899..41dc44eb8454 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7517,7 +7517,7 @@  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	}

 	if (pte) {
-		pte_t pteval = ptep_get_lockless(pte);
+		pte_t pteval = ptep_get_lockless_norecency(pte);

 		BUG_ON(pte_present(pteval) && !pte_huge(pteval));
 	}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2771fc043b3b..1a6c9ed8237a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1019,7 +1019,7 @@  static int __collapse_huge_page_swapin(struct mm_struct *mm,
 			}
 		}

-		vmf.orig_pte = ptep_get_lockless(pte);
+		vmf.orig_pte = ptep_get_lockless_norecency(pte);
 		if (!is_swap_pte(vmf.orig_pte))
 			continue;

diff --git a/mm/memory.c b/mm/memory.c
index 4dd8e35b593a..8e65fa1884f1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4353,7 +4353,7 @@  static bool pte_range_none(pte_t *pte, int nr_pages)
 	int i;

 	for (i = 0; i < nr_pages; i++) {
-		if (!pte_none(ptep_get_lockless(pte + i)))
+		if (!pte_none(ptep_get_lockless_norecency(pte + i)))
 			return false;
 	}

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2f540748f7c0..061c6c16c7ff 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -837,7 +837,7 @@  static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 			if (!pte)
 				break;
 		}
-		pentry = ptep_get_lockless(pte);
+		pentry = ptep_get_lockless_norecency(pte);
 		if (!is_swap_pte(pentry))
 			continue;
 		entry = pte_to_swp_entry(pentry);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d1bd8d1e17bd..c414dd238814 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1857,7 +1857,7 @@  static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				break;
 		}

-		ptent = ptep_get_lockless(pte);
+		ptent = ptep_get_lockless_norecency(pte);

 		if (!is_swap_pte(ptent))
 			continue;