diff mbox series

[v2,2/3] mm: rmap: Move the cache flushing to the correct place for hugetlb PMD sharing

Message ID 4f7ae6dfdc838ab71e1655188b657c032ff1f28f.1651056365.git.baolin.wang@linux.alibaba.com (mailing list archive)
State New
Headers show
Series Fix cache flush issues considering PMD sharing | expand

Commit Message

Baolin Wang April 27, 2022, 10:52 a.m. UTC
The cache level flush will always be first when changing an existing
virtual–>physical mapping to a new value, since this allows us to
properly handle systems whose caches are strict and require a
virtual–>physical translation to exist for a virtual address. So we
should move the cache flushing before huge_pmd_unshare().

As Muchun pointed out[1], now the architectures whose supporting hugetlb
PMD sharing have no cache flush issues in practice. But I think we
should still follow the cache/TLB flushing rules when changing a valid
virtual address mapping in case of potential issues in future.

[1] https://lore.kernel.org/all/YmT%2F%2FhuUbFX+KHcy@FVFYT0MHHV2J.usts.net/
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/rmap.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

Comments

Muchun Song April 28, 2022, 5:55 a.m. UTC | #1
On Wed, Apr 27, 2022 at 06:52:06PM +0800, Baolin Wang wrote:
> The cache level flush will always be first when changing an existing
> virtual–>physical mapping to a new value, since this allows us to
> properly handle systems whose caches are strict and require a
> virtual–>physical translation to exist for a virtual address. So we
> should move the cache flushing before huge_pmd_unshare().
>

Right.

> As Muchun pointed out[1], now the architectures whose supporting hugetlb
> PMD sharing have no cache flush issues in practice. But I think we
> should still follow the cache/TLB flushing rules when changing a valid
> virtual address mapping in case of potential issues in future.

Right. One point i need to clarify. I do not object this change but
want you to clarify this (not an issue in practice) in commit log
to let others know they do not need to bp this.

> 
> [1] https://lore.kernel.org/all/YmT%2F%2FhuUbFX+KHcy@FVFYT0MHHV2J.usts.net/
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/rmap.c | 40 ++++++++++++++++++++++------------------
>  1 file changed, 22 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 61e63db..4f0d115 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1535,15 +1535,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			 * do this outside rmap routines.
>  			 */
>  			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> +			/*
> +			 * huge_pmd_unshare may unmap an entire PMD page.
> +			 * There is no way of knowing exactly which PMDs may
> +			 * be cached for this mm, so we must flush them all.
> +			 * start/end were already adjusted above to cover this
> +			 * range.
> +			 */
> +			flush_cache_range(vma, range.start, range.end);
> +

flush_cache_range() is always called even if we do not need to flush.
How about introducing a new helper like hugetlb_pmd_shared() which
returns true for shared PMD? Then:

	if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
		flush_cache_range(vma, range.start, range.end);
		huge_pmd_unshare(mm, vma, &address, pvmw.pte);
		flush_tlb_range(vma, range.start, range.end);
	}

The code could be a little simpler. Right?

Thanks.
Baolin Wang April 28, 2022, 7:06 a.m. UTC | #2
On 4/28/2022 1:55 PM, Muchun Song wrote:
> On Wed, Apr 27, 2022 at 06:52:06PM +0800, Baolin Wang wrote:
>> The cache level flush will always be first when changing an existing
>> virtual–>physical mapping to a new value, since this allows us to
>> properly handle systems whose caches are strict and require a
>> virtual–>physical translation to exist for a virtual address. So we
>> should move the cache flushing before huge_pmd_unshare().
>>
> 
> Right.
> 
>> As Muchun pointed out[1], now the architectures whose supporting hugetlb
>> PMD sharing have no cache flush issues in practice. But I think we
>> should still follow the cache/TLB flushing rules when changing a valid
>> virtual address mapping in case of potential issues in future.
> 
> Right. One point i need to clarify. I do not object this change but
> want you to clarify this (not an issue in practice) in commit log
> to let others know they do not need to bp this.
> 
>>
>> [1] https://lore.kernel.org/all/YmT%2F%2FhuUbFX+KHcy@FVFYT0MHHV2J.usts.net/
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/rmap.c | 40 ++++++++++++++++++++++------------------
>>   1 file changed, 22 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 61e63db..4f0d115 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1535,15 +1535,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>   			 * do this outside rmap routines.
>>   			 */
>>   			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>> +			/*
>> +			 * huge_pmd_unshare may unmap an entire PMD page.
>> +			 * There is no way of knowing exactly which PMDs may
>> +			 * be cached for this mm, so we must flush them all.
>> +			 * start/end were already adjusted above to cover this
>> +			 * range.
>> +			 */
>> +			flush_cache_range(vma, range.start, range.end);
>> +
> 
> flush_cache_range() is always called even if we do not need to flush.

Right, this is intended. In the original code, if it is not a shared 
PMD, we will use flush_cache_page() to do cache flushing. However the 
flush_cache_page() can not cover the whole size of a hugetlb page on 
some architectures, which is fixed by patch 3.

> How about introducing a new helper like hugetlb_pmd_shared() which
> returns true for shared PMD? Then:
> 
> 	if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
> 		flush_cache_range(vma, range.start, range.end);
> 		huge_pmd_unshare(mm, vma, &address, pvmw.pte);
> 		flush_tlb_range(vma, range.start, range.end);
> 	}
> 
> The code could be a little simpler. Right?

IMHO after patch 3, the code will be changed as below, so seems no need 
to separate the validation of the shared PMDs from huge_pmd_unshare() 
into a new function.

if (folio_test_hugetlb(folio)) {
	flush_cache_range(vma, range.start, range.end);

	if (!folio_test_anon(folio)) {
		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));

		if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
			huge_pmd_unshare(mm, vma, &address, pvmw.pte));
			flush_tlb_range(vma, range.start, range.end);
			......
			break;
		}
	}
} else {
	......
}
Mike Kravetz May 3, 2022, 6:42 p.m. UTC | #3
On 4/27/22 22:55, Muchun Song wrote:
> On Wed, Apr 27, 2022 at 06:52:06PM +0800, Baolin Wang wrote:
>> The cache level flush will always be first when changing an existing
>> virtual–>physical mapping to a new value, since this allows us to
>> properly handle systems whose caches are strict and require a
>> virtual–>physical translation to exist for a virtual address. So we
>> should move the cache flushing before huge_pmd_unshare().
>>
> 
> Right.
> 
>> As Muchun pointed out[1], now the architectures whose supporting hugetlb
>> PMD sharing have no cache flush issues in practice. But I think we
>> should still follow the cache/TLB flushing rules when changing a valid
>> virtual address mapping in case of potential issues in future.
> 
> Right. One point i need to clarify. I do not object this change but
> want you to clarify this (not an issue in practice) in commit log
> to let others know they do not need to bp this.
> 
>>
>> [1] https://lore.kernel.org/all/YmT%2F%2FhuUbFX+KHcy@FVFYT0MHHV2J.usts.net/
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>  mm/rmap.c | 40 ++++++++++++++++++++++------------------
>>  1 file changed, 22 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 61e63db..4f0d115 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1535,15 +1535,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  			 * do this outside rmap routines.
>>  			 */
>>  			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>> +			/*
>> +			 * huge_pmd_unshare may unmap an entire PMD page.
>> +			 * There is no way of knowing exactly which PMDs may
>> +			 * be cached for this mm, so we must flush them all.
>> +			 * start/end were already adjusted above to cover this
>> +			 * range.
>> +			 */
>> +			flush_cache_range(vma, range.start, range.end);
>> +
> 
> flush_cache_range() is always called even if we do not need to flush.
> How about introducing a new helper like hugetlb_pmd_shared() which
> returns true for shared PMD? Then:
> 
> 	if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
> 		flush_cache_range(vma, range.start, range.end);
> 		huge_pmd_unshare(mm, vma, &address, pvmw.pte);
> 		flush_tlb_range(vma, range.start, range.end);
> 	}
> 
> The code could be a little simpler. Right?
> 
> Thanks.
> 

I thought about adding a 'hugetlb_pmd_shared()' interface for another use.
I believe it could even be used earlier in this call sequence.  Since we
hold i_mmap_rwsem, we would even test for shared BEFORE calling
adjust_range_if_pmd_sharing_possible.  We can not make an authoritative test
in adjust range... because not all callers will be holding i_mmap_rwsem.

I think we COULD optimize to minimize the flush range.  However, I think
that would complicate this code even more, and it is difficult enough to
follow.

My preference would be to over flush as is done here for correctness and
simplification.  We can optimize later if desired.

With Muchun's comment that this is not an issue in practice today,
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Baolin Wang May 4, 2022, 2:50 a.m. UTC | #4
On 5/4/2022 2:42 AM, Mike Kravetz wrote:
> On 4/27/22 22:55, Muchun Song wrote:
>> On Wed, Apr 27, 2022 at 06:52:06PM +0800, Baolin Wang wrote:
>>> The cache level flush will always be first when changing an existing
>>> virtual–>physical mapping to a new value, since this allows us to
>>> properly handle systems whose caches are strict and require a
>>> virtual–>physical translation to exist for a virtual address. So we
>>> should move the cache flushing before huge_pmd_unshare().
>>>
>>
>> Right.
>>
>>> As Muchun pointed out[1], now the architectures whose supporting hugetlb
>>> PMD sharing have no cache flush issues in practice. But I think we
>>> should still follow the cache/TLB flushing rules when changing a valid
>>> virtual address mapping in case of potential issues in future.
>>
>> Right. One point i need to clarify. I do not object this change but
>> want you to clarify this (not an issue in practice) in commit log
>> to let others know they do not need to bp this.
>>
>>>
>>> [1] https://lore.kernel.org/all/YmT%2F%2FhuUbFX+KHcy@FVFYT0MHHV2J.usts.net/
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   mm/rmap.c | 40 ++++++++++++++++++++++------------------
>>>   1 file changed, 22 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 61e63db..4f0d115 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1535,15 +1535,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>   			 * do this outside rmap routines.
>>>   			 */
>>>   			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>>> +			/*
>>> +			 * huge_pmd_unshare may unmap an entire PMD page.
>>> +			 * There is no way of knowing exactly which PMDs may
>>> +			 * be cached for this mm, so we must flush them all.
>>> +			 * start/end were already adjusted above to cover this
>>> +			 * range.
>>> +			 */
>>> +			flush_cache_range(vma, range.start, range.end);
>>> +
>>
>> flush_cache_range() is always called even if we do not need to flush.
>> How about introducing a new helper like hugetlb_pmd_shared() which
>> returns true for shared PMD? Then:
>>
>> 	if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
>> 		flush_cache_range(vma, range.start, range.end);
>> 		huge_pmd_unshare(mm, vma, &address, pvmw.pte);
>> 		flush_tlb_range(vma, range.start, range.end);
>> 	}
>>
>> The code could be a little simpler. Right?
>>
>> Thanks.
>>
> 
> I thought about adding a 'hugetlb_pmd_shared()' interface for another use.
> I believe it could even be used earlier in this call sequence.  Since we
> hold i_mmap_rwsem, we would even test for shared BEFORE calling
> adjust_range_if_pmd_sharing_possible.  We can not make an authoritative test
> in adjust range... because not all callers will be holding i_mmap_rwsem.
> 
> I think we COULD optimize to minimize the flush range.  However, I think
> that would complicate this code even more, and it is difficult enough to
> follow.
> 
> My preference would be to over flush as is done here for correctness and
> simplification.  We can optimize later if desired.

OK. Agree.

> 
> With Muchun's comment that this is not an issue in practice today,
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks.
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index 61e63db..4f0d115 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1535,15 +1535,16 @@  static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * do this outside rmap routines.
 			 */
 			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+			/*
+			 * huge_pmd_unshare may unmap an entire PMD page.
+			 * There is no way of knowing exactly which PMDs may
+			 * be cached for this mm, so we must flush them all.
+			 * start/end were already adjusted above to cover this
+			 * range.
+			 */
+			flush_cache_range(vma, range.start, range.end);
+
 			if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
-				/*
-				 * huge_pmd_unshare unmapped an entire PMD
-				 * page.  There is no way of knowing exactly
-				 * which PMDs may be cached for this mm, so
-				 * we must flush them all.  start/end were
-				 * already adjusted above to cover this range.
-				 */
-				flush_cache_range(vma, range.start, range.end);
 				flush_tlb_range(vma, range.start, range.end);
 				mmu_notifier_invalidate_range(mm, range.start,
 							      range.end);
@@ -1560,13 +1561,14 @@  static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
+		} else {
+			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
 		}
 
 		/*
 		 * Nuke the page table entry. When having to clear
 		 * PageAnonExclusive(), we always have to flush.
 		 */
-		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
 		if (should_defer_flush(mm, flags) && !anon_exclusive) {
 			/*
 			 * We clear the PTE but do not flush so potentially
@@ -1890,15 +1892,16 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * do this outside rmap routines.
 			 */
 			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+			/*
+			 * huge_pmd_unshare may unmap an entire PMD page.
+			 * There is no way of knowing exactly which PMDs may
+			 * be cached for this mm, so we must flush them all.
+			 * start/end were already adjusted above to cover this
+			 * range.
+			 */
+			flush_cache_range(vma, range.start, range.end);
+
 			if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
-				/*
-				 * huge_pmd_unshare unmapped an entire PMD
-				 * page.  There is no way of knowing exactly
-				 * which PMDs may be cached for this mm, so
-				 * we must flush them all.  start/end were
-				 * already adjusted above to cover this range.
-				 */
-				flush_cache_range(vma, range.start, range.end);
 				flush_tlb_range(vma, range.start, range.end);
 				mmu_notifier_invalidate_range(mm, range.start,
 							      range.end);
@@ -1915,10 +1918,11 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
+		} else {
+			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
 		}
 
 		/* Nuke the page table entry. */
-		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
 		pteval = ptep_clear_flush(vma, address, pvmw.pte);
 
 		/* Set the dirty flag on the folio now the pte is gone. */