mmu_notifiers: Notify on pte permission upgrades

Message ID	20230522063725.284686-1-apopple@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Alistair Popple <apopple@nvidia.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, robin.murphy@arm.com, will@kernel.org, nicolinc@nvidia.com, linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, jgg@nvidia.com, John Hubbard <jhubbard@nvidia.com>, Alistair Popple <apopple@nvidia.com> Subject: [PATCH] mmu_notifiers: Notify on pte permission upgrades Date: Mon, 22 May 2023 16:37:25 +1000 Message-Id: <20230522063725.284686-1-apopple@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	mmu_notifiers: Notify on pte permission upgrades \| expand mmu_notifiers: Notify on pte permission upgrades

Alistair Popple May 22, 2023, 6:37 a.m. UTC

Some architectures, specifically ARM and perhaps Sparc and IA64,
require TLB invalidates when upgrading pte permission from read-only
to read-write.

The current mmu_notifier implementation assumes that upgrades do not
need notifications. Typically though mmu_notifiers are used to
implement TLB invalidations for secondary MMUs that comply with the
main CPU architecture.

Therefore if the main CPU architecture requires an invalidation for
permission upgrade the secondary MMU will as well and an mmu_notifier
should be sent for the upgrade.

Currently CPU invalidations for permission upgrade occur in
ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
directly from this architecture specific code as the notifier
callbacks can sleep, and ptep_set_access_flags() is usually called
whilst holding the PTL spinlock. Therefore add the notifier calls
after the PTL is dropped and only if the PTE actually changed. This
will allow secondary MMUs to obtain an updated PTE with appropriate
permissions.

This problem was discovered during testing of an ARM SMMU
implementation that does not support broadcast TLB maintenance
(BTM). In this case the SMMU driver uses notifiers to issue TLB
invalidates. For read-only to read-write pte upgrades the SMMU
continually returned a read-only PTE to the device, even though the
CPU had a read-write PTE installed.

Sending a mmu notifier event to the SMMU driver fixes the problem by
flushing secondary TLB entries. A new notifier event type is added so
drivers may filter out these invalidations if not required. Note a
driver should never upgrade or install a PTE in response to this mmu
notifier event as it is not synchronised against other PTE operations.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/mmu_notifier.h |  6 +++++
 mm/hugetlb.c                 | 24 ++++++++++++++++++-
 mm/memory.c                  | 45 ++++++++++++++++++++++++++++++++++--
 3 files changed, 72 insertions(+), 3 deletions(-)

Qi Zheng May 22, 2023, 7:15 a.m. UTC | #1

Hi Alistair,

On 2023/5/22 14:37, Alistair Popple wrote:
> Some architectures, specifically ARM and perhaps Sparc and IA64,
> require TLB invalidates when upgrading pte permission from read-only
> to read-write.
> 
> The current mmu_notifier implementation assumes that upgrades do not
> need notifications. Typically though mmu_notifiers are used to
> implement TLB invalidations for secondary MMUs that comply with the
> main CPU architecture.
> 
> Therefore if the main CPU architecture requires an invalidation for
> permission upgrade the secondary MMU will as well and an mmu_notifier
> should be sent for the upgrade.
> 
> Currently CPU invalidations for permission upgrade occur in
> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
> directly from this architecture specific code as the notifier
> callbacks can sleep, and ptep_set_access_flags() is usually called
> whilst holding the PTL spinlock. Therefore add the notifier calls
> after the PTL is dropped and only if the PTE actually changed. This
> will allow secondary MMUs to obtain an updated PTE with appropriate
> permissions.
> 
> This problem was discovered during testing of an ARM SMMU
> implementation that does not support broadcast TLB maintenance
> (BTM). In this case the SMMU driver uses notifiers to issue TLB
> invalidates. For read-only to read-write pte upgrades the SMMU
> continually returned a read-only PTE to the device, even though the
> CPU had a read-write PTE installed.
> 
> Sending a mmu notifier event to the SMMU driver fixes the problem by
> flushing secondary TLB entries. A new notifier event type is added so
> drivers may filter out these invalidations if not required. Note a
> driver should never upgrade or install a PTE in response to this mmu
> notifier event as it is not synchronised against other PTE operations.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>   include/linux/mmu_notifier.h |  6 +++++
>   mm/hugetlb.c                 | 24 ++++++++++++++++++-
>   mm/memory.c                  | 45 ++++++++++++++++++++++++++++++++++--
>   3 files changed, 72 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d6c06e140277..f14d68f119d8 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -31,6 +31,11 @@ struct mmu_interval_notifier;
>    * pages in the range so to mirror those changes the user must inspect the CPU
>    * page table (from the end callback).
>    *
> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to
> + * read-write for pages in the range. This must not be used to upgrade
> + * permissions on secondary PTEs, rather it should only be used to invalidate
> + * caches such as secondary TLBs that may cache old read-only entries.
> + *
>    * @MMU_NOTIFY_SOFT_DIRTY: soft dirty accounting (still same page and same
>    * access flags). User should soft dirty the page in the end callback to make
>    * sure that anyone relying on soft dirtiness catch pages that might be written
> @@ -53,6 +58,7 @@ enum mmu_notifier_event {
>   	MMU_NOTIFY_CLEAR,
>   	MMU_NOTIFY_PROTECTION_VMA,
>   	MMU_NOTIFY_PROTECTION_PAGE,
> +	MMU_NOTIFY_PROTECTION_UPGRADE,
>   	MMU_NOTIFY_SOFT_DIRTY,
>   	MMU_NOTIFY_RELEASE,
>   	MMU_NOTIFY_MIGRATE,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bdbfeb6fb393..e5d467c7bff7 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5987,6 +5987,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	vm_fault_t ret;
>   	u32 hash;
>   	pgoff_t idx;
> +	bool changed = false;
>   	struct page *page = NULL;
>   	struct page *pagecache_page = NULL;
>   	struct hstate *h = hstate_vma(vma);
> @@ -6122,6 +6123,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   		if (!huge_pte_write(entry)) {
>   			ret = hugetlb_wp(mm, vma, address, ptep, flags,
>   					 pagecache_page, ptl);
> +			if (!ret)
> +				changed = true;
> +
>   			goto out_put_page;
>   		} else if (likely(flags & FAULT_FLAG_WRITE)) {
>   			entry = huge_pte_mkdirty(entry);
> @@ -6129,8 +6133,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	}
>   	entry = pte_mkyoung(entry);
>   	if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
> -						flags & FAULT_FLAG_WRITE))
> +						flags & FAULT_FLAG_WRITE)) {
>   		update_mmu_cache(vma, haddr, ptep);
> +		changed = true;
> +	}
> +
>   out_put_page:
>   	if (page != pagecache_page)
>   		unlock_page(page);
> @@ -6138,6 +6145,21 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   out_ptl:
>   	spin_unlock(ptl);
>   
> +	if (changed) {
> +		struct mmu_notifier_range range;
> +		unsigned long hpage_mask = huge_page_mask(h);
> +		unsigned long hpage_size = huge_page_size(h);
> +
> +		update_mmu_cache(vma, haddr, ptep);
> +
> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE,
> +					0, vma, mm, haddr & hpage_mask,
> +					(haddr & hpage_mask) + hpage_size);
> +		mmu_notifier_invalidate_range_start(&range);
> +		mmu_notifier_invalidate_range_end(&range);
> +	}
> +
> +
>   	if (pagecache_page) {
>   		unlock_page(pagecache_page);
>   		put_page(pagecache_page);
> diff --git a/mm/memory.c b/mm/memory.c
> index f526b9152bef..0ac78c6a232c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2098,6 +2098,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>   	struct mm_struct *mm = vma->vm_mm;
>   	pte_t *pte, entry;
>   	spinlock_t *ptl;
> +	bool changed = false;
>   
>   	pte = get_locked_pte(mm, addr, &ptl);
>   	if (!pte)
> @@ -2120,8 +2121,10 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>   			}
>   			entry = pte_mkyoung(*pte);
>   			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -			if (ptep_set_access_flags(vma, addr, pte, entry, 1))
> +			if (ptep_set_access_flags(vma, addr, pte, entry, 1)) {
>   				update_mmu_cache(vma, addr, pte);
> +				changed = true;
> +			}
>   		}
>   		goto out_unlock;
>   	}
> @@ -2142,6 +2145,17 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>   
>   out_unlock:
>   	pte_unmap_unlock(pte, ptl);
> +
> +	if (changed) {
> +		struct mmu_notifier_range range;
> +
> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE,
> +					0, vma, mm, addr & PAGE_MASK,
> +					(addr & PAGE_MASK) + PAGE_SIZE);
> +		mmu_notifier_invalidate_range_start(&range);
> +		mmu_notifier_invalidate_range_end(&range);
> +	}
> +
>   	return VM_FAULT_NOPAGE;
>   }
>   
> @@ -2820,6 +2834,7 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
>   	struct vm_area_struct *vma = vmf->vma;
>   	struct mm_struct *mm = vma->vm_mm;
>   	unsigned long addr = vmf->address;
> +	bool changed = false;
>   
>   	if (likely(src)) {
>   		if (copy_mc_user_highpage(dst, src, addr, vma)) {
> @@ -2858,8 +2873,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
>   		}
>   
>   		entry = pte_mkyoung(vmf->orig_pte);
> -		if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0))
> +		if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) {
>   			update_mmu_cache(vma, addr, vmf->pte);
> +			changed = true;
> +		}
>   	}
>   
>   	/*
> @@ -2897,6 +2914,16 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
>   		}
>   	}
>   
> +	if (changed) {
> +		struct mmu_notifier_range range;
> +
> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE,
> +					0, vma, vma->vm_mm, addr & PAGE_MASK,
> +					(addr & PAGE_MASK) + PAGE_SIZE);
> +		mmu_notifier_invalidate_range_start(&range);
> +		mmu_notifier_invalidate_range_end(&range);
> +	}
> +
>   	ret = 0;
>   
>   pte_unlock:
> @@ -4877,6 +4904,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
>   static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>   {
>   	pte_t entry;
> +	bool changed = false;
>   
>   	if (unlikely(pmd_none(*vmf->pmd))) {
>   		/*
> @@ -4957,6 +4985,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>   	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
>   				vmf->flags & FAULT_FLAG_WRITE)) {
>   		update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
> +		changed = true;
>   	} else {
>   		/* Skip spurious TLB flush for retried page fault */
>   		if (vmf->flags & FAULT_FLAG_TRIED)
> @@ -4972,6 +5001,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>   	}
>   unlock:
>   	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +
> +	if (changed) {
> +		struct mmu_notifier_range range;
> +
> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE,
> +					0, vmf->vma, vmf->vma->vm_mm,
> +					vmf->address & PAGE_MASK,
> +					(vmf->address & PAGE_MASK) + PAGE_SIZE);
> +		mmu_notifier_invalidate_range_start(&range);
> +		mmu_notifier_invalidate_range_end(&range);
> +	}

There are four similar patterns, can we introduce a helper function to
deduplicate them?

> +
>   	return 0;
>   }
>

Alistair Popple May 22, 2023, 7:45 a.m. UTC | #2

Qi Zheng <qi.zheng@linux.dev> writes:

> Hi Alistair,
>
> On 2023/5/22 14:37, Alistair Popple wrote:

[...]

>> +	if (changed) {
>> +		struct mmu_notifier_range range;
>> +
>> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE,
>> +					0, vmf->vma, vmf->vma->vm_mm,
>> +					vmf->address & PAGE_MASK,
>> +					(vmf->address & PAGE_MASK) + PAGE_SIZE);
>> +		mmu_notifier_invalidate_range_start(&range);
>> +		mmu_notifier_invalidate_range_end(&range);
>> +	}
>
> There are four similar patterns, can we introduce a helper function to
> deduplicate them?

For sure. How about something like this?

void mmu_notifier_range_start_end(enum mmu_notifier_event event,
                                  struct vm_area_struct *vma,
                                  struct mm_struct *mm,
                                  unsigned long start,
                                  unsigned long end)

As an aside I didn't just use mmu_notifier_invalidate_range() as that
doesn't allow an event type to be set for interval notifiers which may
want to filter this.

>> +
>>   	return 0;
>>   }
>>

Qi Zheng May 22, 2023, 8:28 a.m. UTC | #3

On 2023/5/22 15:45, Alistair Popple wrote:
> 
> Qi Zheng <qi.zheng@linux.dev> writes:
> 
>> Hi Alistair,
>>
>> On 2023/5/22 14:37, Alistair Popple wrote:
> 
> [...]
> 
>>> +	if (changed) {
>>> +		struct mmu_notifier_range range;
>>> +
>>> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_UPGRADE,
>>> +					0, vmf->vma, vmf->vma->vm_mm,
>>> +					vmf->address & PAGE_MASK,
>>> +					(vmf->address & PAGE_MASK) + PAGE_SIZE);
>>> +		mmu_notifier_invalidate_range_start(&range);
>>> +		mmu_notifier_invalidate_range_end(&range);
>>> +	}
>>
>> There are four similar patterns, can we introduce a helper function to
>> deduplicate them?
> 
> For sure. How about something like this?
> 
> void mmu_notifier_range_start_end(enum mmu_notifier_event event,
>                                    struct vm_area_struct *vma,
>                                    struct mm_struct *mm,
>                                    unsigned long start,
>                                    unsigned long end)

LGTM. :)

> 
> As an aside I didn't just use mmu_notifier_invalidate_range() as that
> doesn't allow an event type to be set for interval notifiers which may
> want to filter this.
> 
>>> +
>>>    	return 0;
>>>    }
>>>    
>

Catalin Marinas May 22, 2023, 3:09 p.m. UTC | #4

On Mon, May 22, 2023 at 04:37:25PM +1000, Alistair Popple wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index f526b9152bef..0ac78c6a232c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2098,6 +2098,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *pte, entry;
>  	spinlock_t *ptl;
> +	bool changed = false;
>  
>  	pte = get_locked_pte(mm, addr, &ptl);
>  	if (!pte)
> @@ -2120,8 +2121,10 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>  			}
>  			entry = pte_mkyoung(*pte);
>  			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -			if (ptep_set_access_flags(vma, addr, pte, entry, 1))
> +			if (ptep_set_access_flags(vma, addr, pte, entry, 1)) {
>  				update_mmu_cache(vma, addr, pte);
> +				changed = true;
> +			}
>  		}
>  		goto out_unlock;
>  	}

I haven't checked all the corner cases but can we not have a
ptep_set_access_flags_notify() that handles this (and the huge
equivalent)? It matches the other API like ptep_clear_flush_notify().

Sean Christopherson May 22, 2023, 6:34 p.m. UTC | #5

On Mon, May 22, 2023, Alistair Popple wrote:
> Some architectures, specifically ARM and perhaps Sparc and IA64,
> require TLB invalidates when upgrading pte permission from read-only
> to read-write.
> 
> The current mmu_notifier implementation assumes that upgrades do not
> need notifications. Typically though mmu_notifiers are used to
> implement TLB invalidations for secondary MMUs that comply with the
> main CPU architecture.
> 
> Therefore if the main CPU architecture requires an invalidation for
> permission upgrade the secondary MMU will as well and an mmu_notifier
> should be sent for the upgrade.
> 
> Currently CPU invalidations for permission upgrade occur in
> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
> directly from this architecture specific code as the notifier
> callbacks can sleep, and ptep_set_access_flags() is usually called
> whilst holding the PTL spinlock. Therefore add the notifier calls
> after the PTL is dropped and only if the PTE actually changed. This
> will allow secondary MMUs to obtain an updated PTE with appropriate
> permissions.
> 
> This problem was discovered during testing of an ARM SMMU
> implementation that does not support broadcast TLB maintenance
> (BTM). In this case the SMMU driver uses notifiers to issue TLB
> invalidates. For read-only to read-write pte upgrades the SMMU
> continually returned a read-only PTE to the device, even though the
> CPU had a read-write PTE installed.
> 
> Sending a mmu notifier event to the SMMU driver fixes the problem by
> flushing secondary TLB entries. A new notifier event type is added so
> drivers may filter out these invalidations if not required. Note a
> driver should never upgrade or install a PTE in response to this mmu
> notifier event as it is not synchronised against other PTE operations.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  include/linux/mmu_notifier.h |  6 +++++
>  mm/hugetlb.c                 | 24 ++++++++++++++++++-
>  mm/memory.c                  | 45 ++++++++++++++++++++++++++++++++++--
>  3 files changed, 72 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d6c06e140277..f14d68f119d8 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -31,6 +31,11 @@ struct mmu_interval_notifier;
>   * pages in the range so to mirror those changes the user must inspect the CPU
>   * page table (from the end callback).
>   *
> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to
> + * read-write for pages in the range. This must not be used to upgrade
> + * permissions on secondary PTEs, rather it should only be used to invalidate
> + * caches such as secondary TLBs that may cache old read-only entries.

This is a poor fit for invalidate_range_{start,end}().  All other uses bookend
the primary MMU update, i.e. call start() _before_ changing PTEs.  The comments
are somewhat stale as they talk only about "unmapped", but the contract between
the primary MMU and the secondary MMU is otherwise quite clear on when the primary
MMU will invoke start() and end().

	 * invalidate_range_start() is called when all pages in the
	 * range are still mapped and have at least a refcount of one.
	 *
	 * invalidate_range_end() is called when all pages in the
	 * range have been unmapped and the pages have been freed by
	 * the VM.

I'm also confused as to how this actually fixes ARM's SMMU.  Unless I'm looking
at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(),
not the start()/end() variants.

	static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
		.invalidate_range	= arm_smmu_mm_invalidate_range,
		.release		= arm_smmu_mm_release,
		.free_notifier		= arm_smmu_mmu_notifier_free,
	};

Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
is perfectly valid.  And AFAICT, the existing invalidate_range() hook is pretty
much a perfect fit for what you want to achieve.

	 * If invalidate_range() is used to manage a non-CPU TLB with
	 * shared page-tables, it not necessary to implement the
	 * invalidate_range_start()/end() notifiers, as
	 * invalidate_range() already catches the points in time when an
	 * external TLB range needs to be flushed. For more in depth
	 * discussion on this see Documentation/mm/mmu_notifier.rst

Even worse, this change may silently regress performance for secondary MMUs that
haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs
in response to the upgrade, instead of waiting until the guest actually tries to
utilize the new protections.

Alistair Popple May 22, 2023, 11:50 p.m. UTC | #6

Sean Christopherson <seanjc@google.com> writes:

> On Mon, May 22, 2023, Alistair Popple wrote:
>> Some architectures, specifically ARM and perhaps Sparc and IA64,
>> require TLB invalidates when upgrading pte permission from read-only
>> to read-write.
>> 
>> The current mmu_notifier implementation assumes that upgrades do not
>> need notifications. Typically though mmu_notifiers are used to
>> implement TLB invalidations for secondary MMUs that comply with the
>> main CPU architecture.
>> 
>> Therefore if the main CPU architecture requires an invalidation for
>> permission upgrade the secondary MMU will as well and an mmu_notifier
>> should be sent for the upgrade.
>> 
>> Currently CPU invalidations for permission upgrade occur in
>> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
>> directly from this architecture specific code as the notifier
>> callbacks can sleep, and ptep_set_access_flags() is usually called
>> whilst holding the PTL spinlock. Therefore add the notifier calls
>> after the PTL is dropped and only if the PTE actually changed. This
>> will allow secondary MMUs to obtain an updated PTE with appropriate
>> permissions.
>> 
>> This problem was discovered during testing of an ARM SMMU
>> implementation that does not support broadcast TLB maintenance
>> (BTM). In this case the SMMU driver uses notifiers to issue TLB
>> invalidates. For read-only to read-write pte upgrades the SMMU
>> continually returned a read-only PTE to the device, even though the
>> CPU had a read-write PTE installed.
>> 
>> Sending a mmu notifier event to the SMMU driver fixes the problem by
>> flushing secondary TLB entries. A new notifier event type is added so
>> drivers may filter out these invalidations if not required. Note a
>> driver should never upgrade or install a PTE in response to this mmu
>> notifier event as it is not synchronised against other PTE operations.
>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>  include/linux/mmu_notifier.h |  6 +++++
>>  mm/hugetlb.c                 | 24 ++++++++++++++++++-
>>  mm/memory.c                  | 45 ++++++++++++++++++++++++++++++++++--
>>  3 files changed, 72 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d6c06e140277..f14d68f119d8 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -31,6 +31,11 @@ struct mmu_interval_notifier;
>>   * pages in the range so to mirror those changes the user must inspect the CPU
>>   * page table (from the end callback).
>>   *
>> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to
>> + * read-write for pages in the range. This must not be used to upgrade
>> + * permissions on secondary PTEs, rather it should only be used to invalidate
>> + * caches such as secondary TLBs that may cache old read-only entries.
>
> This is a poor fit for invalidate_range_{start,end}().  All other uses bookend
> the primary MMU update, i.e. call start() _before_ changing PTEs.  The comments
> are somewhat stale as they talk only about "unmapped", but the contract between
> the primary MMU and the secondary MMU is otherwise quite clear on when the primary
> MMU will invoke start() and end().
>
> 	 * invalidate_range_start() is called when all pages in the
> 	 * range are still mapped and have at least a refcount of one.
> 	 *
> 	 * invalidate_range_end() is called when all pages in the
> 	 * range have been unmapped and the pages have been freed by
> 	 * the VM.
>
> I'm also confused as to how this actually fixes ARM's SMMU.  Unless I'm looking
> at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(),
> not the start()/end() variants.

mmu_invalidate_range_end() calls the invalidate_range() callback if
the start()/end() variants aren't set.

> 	static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
> 		.invalidate_range	= arm_smmu_mm_invalidate_range,
> 		.release		= arm_smmu_mm_release,
> 		.free_notifier		= arm_smmu_mmu_notifier_free,
> 	};
>
> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
> is perfectly valid.  And AFAICT, the existing invalidate_range() hook is pretty
> much a perfect fit for what you want to achieve.

Right, I didn't take that approach because it doesn't allow an event
type to be passed which would allow them to be filtered on platforms
which don't require this.

I had also assumed the invalidate_range() callbacks were allowed to
sleep, hence couldn't be called under PTL. That's certainly true of mmu
interval notifier callbacks, but Catalin reminded me that calls such as
ptep_clear_flush_notify() already call invalidate_range() callback under
PTL so I guess we already assume drivers don't sleep in their
invalidate_range() callbacks. I will update the comments to reflect
that.

> 	 * If invalidate_range() is used to manage a non-CPU TLB with
> 	 * shared page-tables, it not necessary to implement the
> 	 * invalidate_range_start()/end() notifiers, as
> 	 * invalidate_range() already catches the points in time when an
> 	 * external TLB range needs to be flushed. For more in depth
> 	 * discussion on this see Documentation/mm/mmu_notifier.rst
>
> Even worse, this change may silently regress performance for secondary MMUs that
> haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs
> in response to the upgrade, instead of waiting until the guest actually tries to
> utilize the new protections.

Yeah, I like the idea of introducing a
ptep_set_access_flags_notify(). That way this won't regress performance
on platforms that don't need it. Note this isn't a new feature but
rather a bugfix. It's unclear to me why KVM on ARM hasn't already run
into this issue, but I'm no KVM expert. Thanks for the feedback.

 - Alistair

Alistair Popple May 22, 2023, 11:52 p.m. UTC | #7

Catalin Marinas <catalin.marinas@arm.com> writes:

> On Mon, May 22, 2023 at 04:37:25PM +1000, Alistair Popple wrote:
>> diff --git a/mm/memory.c b/mm/memory.c
>> index f526b9152bef..0ac78c6a232c 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2098,6 +2098,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>>  	struct mm_struct *mm = vma->vm_mm;
>>  	pte_t *pte, entry;
>>  	spinlock_t *ptl;
>> +	bool changed = false;
>>  
>>  	pte = get_locked_pte(mm, addr, &ptl);
>>  	if (!pte)
>> @@ -2120,8 +2121,10 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>>  			}
>>  			entry = pte_mkyoung(*pte);
>>  			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>> -			if (ptep_set_access_flags(vma, addr, pte, entry, 1))
>> +			if (ptep_set_access_flags(vma, addr, pte, entry, 1)) {
>>  				update_mmu_cache(vma, addr, pte);
>> +				changed = true;
>> +			}
>>  		}
>>  		goto out_unlock;
>>  	}
>
> I haven't checked all the corner cases but can we not have a
> ptep_set_access_flags_notify() that handles this (and the huge
> equivalent)? It matches the other API like ptep_clear_flush_notify().

Good suggestion, thanks! I can make the implementations architecture
specific too which helps with filtering on platforms that don't need
it. I had assumed the invalidate_range() callbacks could sleep and
therefore couldn't be called under PTL. However
ptep_clear_flush_notify() already calls invalidate_range() under PTL, so
we already assume drivers don't sleep in the invalidate_range()
callbacks.

Sean Christopherson May 23, 2023, 12:06 a.m. UTC | #8

On Tue, May 23, 2023, Alistair Popple wrote:
> 
> Sean Christopherson <seanjc@google.com> writes:
> >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> index d6c06e140277..f14d68f119d8 100644
> >> --- a/include/linux/mmu_notifier.h
> >> +++ b/include/linux/mmu_notifier.h
> >> @@ -31,6 +31,11 @@ struct mmu_interval_notifier;
> >>   * pages in the range so to mirror those changes the user must inspect the CPU
> >>   * page table (from the end callback).
> >>   *
> >> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to
> >> + * read-write for pages in the range. This must not be used to upgrade
> >> + * permissions on secondary PTEs, rather it should only be used to invalidate
> >> + * caches such as secondary TLBs that may cache old read-only entries.
> >
> > This is a poor fit for invalidate_range_{start,end}().  All other uses bookend
> > the primary MMU update, i.e. call start() _before_ changing PTEs.  The comments
> > are somewhat stale as they talk only about "unmapped", but the contract between
> > the primary MMU and the secondary MMU is otherwise quite clear on when the primary
> > MMU will invoke start() and end().
> >
> > 	 * invalidate_range_start() is called when all pages in the
> > 	 * range are still mapped and have at least a refcount of one.
> > 	 *
> > 	 * invalidate_range_end() is called when all pages in the
> > 	 * range have been unmapped and the pages have been freed by
> > 	 * the VM.
> >
> > I'm also confused as to how this actually fixes ARM's SMMU.  Unless I'm looking
> > at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(),
> > not the start()/end() variants.
> 
> mmu_invalidate_range_end() calls the invalidate_range() callback if
> the start()/end() variants aren't set.

Ahh, thanks.

> > 	static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
> > 		.invalidate_range	= arm_smmu_mm_invalidate_range,
> > 		.release		= arm_smmu_mm_release,
> > 		.free_notifier		= arm_smmu_mmu_notifier_free,
> > 	};
> >
> > Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
> > is perfectly valid.  And AFAICT, the existing invalidate_range() hook is pretty
> > much a perfect fit for what you want to achieve.
> 
> Right, I didn't take that approach because it doesn't allow an event
> type to be passed which would allow them to be filtered on platforms
> which don't require this.
> 
> I had also assumed the invalidate_range() callbacks were allowed to
> sleep, hence couldn't be called under PTL. That's certainly true of mmu
> interval notifier callbacks, but Catalin reminded me that calls such as
> ptep_clear_flush_notify() already call invalidate_range() callback under
> PTL so I guess we already assume drivers don't sleep in their
> invalidate_range() callbacks. I will update the comments to reflect
> that.
> 
> > 	 * If invalidate_range() is used to manage a non-CPU TLB with
> > 	 * shared page-tables, it not necessary to implement the
> > 	 * invalidate_range_start()/end() notifiers, as
> > 	 * invalidate_range() already catches the points in time when an
> > 	 * external TLB range needs to be flushed. For more in depth
> > 	 * discussion on this see Documentation/mm/mmu_notifier.rst
> >
> > Even worse, this change may silently regress performance for secondary MMUs that
> > haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs
> > in response to the upgrade, instead of waiting until the guest actually tries to
> > utilize the new protections.
> 
> Yeah, I like the idea of introducing a
> ptep_set_access_flags_notify(). That way this won't regress performance
> on platforms that don't need it. Note this isn't a new feature but
> rather a bugfix. It's unclear to me why KVM on ARM hasn't already run
> into this issue, but I'm no KVM expert. Thanks for the feedback.

KVM manages its own page tables and so does its own TLB invalidations as needed,
e.g. KVM can and does change KVM's stage-2 PTEs from read-only to read/write
irrespective of mmu_notifiers.  I assume the SMMU issue arises only because the
SMMU is reusing the host kernel's (stage-1?) page tables.

Alistair Popple May 23, 2023, 12:43 a.m. UTC | #9

Sean Christopherson <seanjc@google.com> writes:

> On Tue, May 23, 2023, Alistair Popple wrote:
>> 
>> Sean Christopherson <seanjc@google.com> writes:

[...]

>> > 	 * If invalidate_range() is used to manage a non-CPU TLB with
>> > 	 * shared page-tables, it not necessary to implement the
>> > 	 * invalidate_range_start()/end() notifiers, as
>> > 	 * invalidate_range() already catches the points in time when an
>> > 	 * external TLB range needs to be flushed. For more in depth
>> > 	 * discussion on this see Documentation/mm/mmu_notifier.rst
>> >
>> > Even worse, this change may silently regress performance for secondary MMUs that
>> > haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs
>> > in response to the upgrade, instead of waiting until the guest actually tries to
>> > utilize the new protections.
>> 
>> Yeah, I like the idea of introducing a
>> ptep_set_access_flags_notify(). That way this won't regress performance
>> on platforms that don't need it. Note this isn't a new feature but
>> rather a bugfix. It's unclear to me why KVM on ARM hasn't already run
>> into this issue, but I'm no KVM expert. Thanks for the feedback.
>
> KVM manages its own page tables and so does its own TLB invalidations as needed,
> e.g. KVM can and does change KVM's stage-2 PTEs from read-only to read/write
> irrespective of mmu_notifiers.  I assume the SMMU issue arises only because the
> SMMU is reusing the host kernel's (stage-1?) page tables.

Argh, thanks. That makes sense. The SMMU issue arises because it is not
snooping CPU TLB invalidations and therefore relies entirely on notifier
callbacks to invalidate it's TLB. If it was snooping invalidations it
would observe the TLB invalidation ARM64 does in
ptep_set_access_flags()[1]. Now that I've figured out we can call
invalidate_range() under the PTL I think I can just add the notifier
call there.

[1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/fault.c?id=ae8373a5add4ea39f032563cf12a02946d1e3546#n229

John Hubbard May 23, 2023, 12:55 a.m. UTC | #10

On 5/21/23 23:37, Alistair Popple wrote:
> Some architectures, specifically ARM and perhaps Sparc and IA64,
> require TLB invalidates when upgrading pte permission from read-only
> to read-write.
> 
> The current mmu_notifier implementation assumes that upgrades do not
> need notifications. Typically though mmu_notifiers are used to
> implement TLB invalidations for secondary MMUs that comply with the
> main CPU architecture.
> 
> Therefore if the main CPU architecture requires an invalidation for
> permission upgrade the secondary MMU will as well and an mmu_notifier
> should be sent for the upgrade.
> 
> Currently CPU invalidations for permission upgrade occur in
> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
> directly from this architecture specific code as the notifier
> callbacks can sleep, and ptep_set_access_flags() is usually called
> whilst holding the PTL spinlock. Therefore add the notifier calls
> after the PTL is dropped and only if the PTE actually changed. This
> will allow secondary MMUs to obtain an updated PTE with appropriate
> permissions.
> 
> This problem was discovered during testing of an ARM SMMU
> implementation that does not support broadcast TLB maintenance
> (BTM). In this case the SMMU driver uses notifiers to issue TLB
> invalidates. For read-only to read-write pte upgrades the SMMU
> continually returned a read-only PTE to the device, even though the
> CPU had a read-write PTE installed.
> 
> Sending a mmu notifier event to the SMMU driver fixes the problem by
> flushing secondary TLB entries. A new notifier event type is added so
> drivers may filter out these invalidations if not required. Note a
> driver should never upgrade or install a PTE in response to this mmu
> notifier event as it is not synchronised against other PTE operations.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>   include/linux/mmu_notifier.h |  6 +++++
>   mm/hugetlb.c                 | 24 ++++++++++++++++++-
>   mm/memory.c                  | 45 ++++++++++++++++++++++++++++++++++--
>   3 files changed, 72 insertions(+), 3 deletions(-)

Point of order:

What is this based on? It would really help if you would either use --base
with git-format-patch, or else just mention the base somewhere. Otherwise,
actually applying the patch takes some hunting around...in particular for
older stuff. This is from Feb, 2023, before hugetlb.c got converted to
folios, right?

thanks,

Alistair Popple May 23, 2023, 1:12 a.m. UTC | #11

John Hubbard <jhubbard@nvidia.com> writes:

> On 5/21/23 23:37, Alistair Popple wrote:
>> Some architectures, specifically ARM and perhaps Sparc and IA64,
>> require TLB invalidates when upgrading pte permission from read-only
>> to read-write.
>> The current mmu_notifier implementation assumes that upgrades do not
>> need notifications. Typically though mmu_notifiers are used to
>> implement TLB invalidations for secondary MMUs that comply with the
>> main CPU architecture.
>> Therefore if the main CPU architecture requires an invalidation for
>> permission upgrade the secondary MMU will as well and an mmu_notifier
>> should be sent for the upgrade.
>> Currently CPU invalidations for permission upgrade occur in
>> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
>> directly from this architecture specific code as the notifier
>> callbacks can sleep, and ptep_set_access_flags() is usually called
>> whilst holding the PTL spinlock. Therefore add the notifier calls
>> after the PTL is dropped and only if the PTE actually changed. This
>> will allow secondary MMUs to obtain an updated PTE with appropriate
>> permissions.
>> This problem was discovered during testing of an ARM SMMU
>> implementation that does not support broadcast TLB maintenance
>> (BTM). In this case the SMMU driver uses notifiers to issue TLB
>> invalidates. For read-only to read-write pte upgrades the SMMU
>> continually returned a read-only PTE to the device, even though the
>> CPU had a read-write PTE installed.
>> Sending a mmu notifier event to the SMMU driver fixes the problem by
>> flushing secondary TLB entries. A new notifier event type is added so
>> drivers may filter out these invalidations if not required. Note a
>> driver should never upgrade or install a PTE in response to this mmu
>> notifier event as it is not synchronised against other PTE operations.
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>   include/linux/mmu_notifier.h |  6 +++++
>>   mm/hugetlb.c                 | 24 ++++++++++++++++++-
>>   mm/memory.c                  | 45 ++++++++++++++++++++++++++++++++++--
>>   3 files changed, 72 insertions(+), 3 deletions(-)
>
> Point of order:
>
> What is this based on? It would really help if you would either use --base
> with git-format-patch, or else just mention the base somewhere. Otherwise,
> actually applying the patch takes some hunting around...in particular for
> older stuff. This is from Feb, 2023, before hugetlb.c got converted to
> folios, right?

Yes, my bad. This is based on v6.2. I had meant to rebase it prior to
sending but forgot. Based on the other review comments though I will be
reworking this to put the notifier call in ptep_set_access_flags() on
Arm anyway.

> thanks,

John Hubbard May 23, 2023, 1:13 a.m. UTC | #12

On 5/22/23 16:50, Alistair Popple wrote:
...
>> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
>> is perfectly valid.  And AFAICT, the existing invalidate_range() hook is pretty
>> much a perfect fit for what you want to achieve.
> 
> Right, I didn't take that approach because it doesn't allow an event
> type to be passed which would allow them to be filtered on platforms
> which don't require this.
> 
> I had also assumed the invalidate_range() callbacks were allowed to
> sleep, hence couldn't be called under PTL. That's certainly true of mmu
> interval notifier callbacks, but Catalin reminded me that calls such as
> ptep_clear_flush_notify() already call invalidate_range() callback under
> PTL so I guess we already assume drivers don't sleep in their
> invalidate_range() callbacks. I will update the comments to reflect

This line of reasoning feels very fragile. The range notifiers generally
do allow sleeping. They are using srcu (sleepable RCU) protection, btw.

The fact that existing callers are calling these under PTL just means
that so far, that has sorta worked. And yes, we can probably make this
all work. That's not really the ideal way to deduce the API rules, though,
and it would be good to clarify what they really are.

Aside from those use cases, I don't see anything justifying a "not allowed
to sleep" rule for .invalidate_range(), right?

thanks,

Alistair Popple May 23, 2023, 4:35 a.m. UTC | #13

John Hubbard <jhubbard@nvidia.com> writes:

> On 5/22/23 16:50, Alistair Popple wrote:
> ...
>>> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
>>> is perfectly valid.  And AFAICT, the existing invalidate_range() hook is pretty
>>> much a perfect fit for what you want to achieve.
>> Right, I didn't take that approach because it doesn't allow an event
>> type to be passed which would allow them to be filtered on platforms
>> which don't require this.
>> I had also assumed the invalidate_range() callbacks were allowed to
>> sleep, hence couldn't be called under PTL. That's certainly true of mmu
>> interval notifier callbacks, but Catalin reminded me that calls such as
>> ptep_clear_flush_notify() already call invalidate_range() callback under
>> PTL so I guess we already assume drivers don't sleep in their
>> invalidate_range() callbacks. I will update the comments to reflect
>
> This line of reasoning feels very fragile. The range notifiers generally
> do allow sleeping. They are using srcu (sleepable RCU) protection, btw.

Regardless of how well documented this is or isn't (it isn't currently,
but it used to be) it certainly seems to be a well established rule that
the .invalidate_range() callback cannot sleep. The vast majority of
callers do call this holding the PTL, and comments make it explicit that
this is somewhat expected:

Eg: In rmap.c:

		 * No need to call mmu_notifier_invalidate_range() it has be
		 * done above for all cases requiring it to happen under page
		 * table lock before mmu_notifier_invalidate_range_end()

> The fact that existing callers are calling these under PTL just means
> that so far, that has sorta worked. And yes, we can probably make this
> all work. That's not really the ideal way to deduce the API rules, though,
> and it would be good to clarify what they really are.

Of course not. I will update the documentation to clarify this, but see
below for some history which clarifies how we got here.

> Aside from those use cases, I don't see anything justifying a "not allowed
> to sleep" rule for .invalidate_range(), right?

Except that "those use cases" are approximately all of the use cases
AFAICT, and as it turns out this was actually a rule when
.invalidate_range() was added.

Commit 0f0a327fa12c ("mmu_notifier: add the callback for
mmu_notifier_invalidate_range()") included this in the documentation:

        * The invalidate_range() function is called under the ptl
        * spin-lock and not allowed to sleep.

This was later removed in 5ff7091f5a2c ("mm, mmu_notifier: annotate mmu
notifiers with blockable invalidate callbacks") which introduced the
MMU_INVALIDATE_DOES_NOT_BLOCK flag:

        * If this [invalidate_range()] callback cannot block, and invalidate_range_{start,end}
        * cannot block, mmu_notifier_ops.flags should have
        * MMU_INVALIDATE_DOES_NOT_BLOCK set.

However the removal of the original comment seems wrong -
invalidate_range() was still getting called under the ptl and therefore
could not block regardless of if MMU_INVALIDATE_DOES_NOT_BLOCK was set
or not.

Of course the flag and related documentation was removed shortly after
by 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
notifiers") and 4e15a073a168 ("Revert "mm, mmu_notifier: annotate mmu
notifiers with blockable invalidate callbacks"")

None of those changes actually made it safe for .invalidate_range()
callbacks to sleep, nor was that their goal. They were all about making
sure it was ok for .invalidate_range_start() to sleep AFAICT.

So I think it's perfectly fine to require .invalidate_range() callbacks
to be non-blocking, and if they are that's a driver bug. Note that this
isn't talking about mmu *interval* notifiers - they are slightly
different and don't hook into the mmu_notifier_invalidate_range() call.
They use start()/end() and as such are allowed to sleep.

 - Alistair

> thanks,

Christoph Hellwig May 23, 2023, 6:45 a.m. UTC | #14

On Mon, May 22, 2023 at 04:37:25PM +1000, Alistair Popple wrote:
> This problem was discovered during testing of an ARM SMMU
> implementation that does not support broadcast TLB maintenance
> (BTM). In this case the SMMU driver uses notifiers to issue TLB
> invalidates. For read-only to read-write pte upgrades the SMMU
> continually returned a read-only PTE to the device, even though the
> CPU had a read-write PTE installed.
> 
> Sending a mmu notifier event to the SMMU driver fixes the problem by
> flushing secondary TLB entries. A new notifier event type is added so
> drivers may filter out these invalidations if not required. Note a
> driver should never upgrade or install a PTE in response to this mmu
> notifier event as it is not synchronised against other PTE operations.

I don't see these SMMU driver changes anywhere.  I.e. you're adding dead
code as far as I can tell.

mmu_notifiers: Notify on pte permission upgrades

Commit Message

Comments

Patch