diff mbox series

[v3,2/3] mm/migrate_device.c: Copy pte dirty bit to page

Message ID ffbc824af5daa2c44b91c66834a341894fba4ce6.1661309831.git-series.apopple@nvidia.com (mailing list archive)
State New
Headers show
Series [v3,1/3] mm/migrate_device.c: Flush TLB while holding PTL | expand

Commit Message

Alistair Popple Aug. 24, 2022, 3:03 a.m. UTC
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
installs migration entries directly if it can lock the migrating page.
When removing a dirty pte the dirty bit is supposed to be carried over
to the underlying page to prevent it being lost.

Currently migrate_vma_*() can only be used for private anonymous
mappings. That means loss of the dirty bit usually doesn't result in
data loss because these pages are typically not file-backed. However
pages may be backed by swap storage which can result in data loss if an
attempt is made to migrate a dirty page that doesn't yet have the
PageDirty flag set.

In this case migration will fail due to unexpected references but the
dirty pte bit will be lost. If the page is subsequently reclaimed data
won't be written back to swap storage as it is considered uptodate,
resulting in data loss if the page is subsequently accessed.

Prevent this by copying the dirty bit to the page when removing the pte
to match what try_to_migrate_one() does.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reported-by: Huang Ying <ying.huang@intel.com>
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Cc: stable@vger.kernel.org

---

Changes for v3:

 - Defer TLB flushing
 - Split a TLB flushing fix into a separate change.

Changes for v2:

 - Fixed up Reported-by tag.
 - Added Peter's Acked-by.
 - Atomically read and clear the pte to prevent the dirty bit getting
   set after reading it.
 - Added fixes tag
---
 mm/migrate_device.c |  9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Peter Xu Aug. 24, 2022, 3:39 p.m. UTC | #1
On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> installs migration entries directly if it can lock the migrating page.
> When removing a dirty pte the dirty bit is supposed to be carried over
> to the underlying page to prevent it being lost.
> 
> Currently migrate_vma_*() can only be used for private anonymous
> mappings. That means loss of the dirty bit usually doesn't result in
> data loss because these pages are typically not file-backed. However
> pages may be backed by swap storage which can result in data loss if an
> attempt is made to migrate a dirty page that doesn't yet have the
> PageDirty flag set.
> 
> In this case migration will fail due to unexpected references but the
> dirty pte bit will be lost. If the page is subsequently reclaimed data
> won't be written back to swap storage as it is considered uptodate,
> resulting in data loss if the page is subsequently accessed.
> 
> Prevent this by copying the dirty bit to the page when removing the pte
> to match what try_to_migrate_one() does.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Acked-by: Peter Xu <peterx@redhat.com>
> Reported-by: Huang Ying <ying.huang@intel.com>
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> Cc: stable@vger.kernel.org
> 
> ---
> 
> Changes for v3:
> 
>  - Defer TLB flushing
>  - Split a TLB flushing fix into a separate change.
> 
> Changes for v2:
> 
>  - Fixed up Reported-by tag.
>  - Added Peter's Acked-by.
>  - Atomically read and clear the pte to prevent the dirty bit getting
>    set after reading it.
>  - Added fixes tag
> ---
>  mm/migrate_device.c |  9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 6a5ef9f..51d9afa 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -7,6 +7,7 @@
>  #include <linux/export.h>
>  #include <linux/memremap.h>
>  #include <linux/migrate.h>
> +#include <linux/mm.h>
>  #include <linux/mm_inline.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/oom.h>
> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>  			if (anon_exclusive) {
>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
> -				ptep_clear_flush(vma, addr, ptep);
> +				pte = ptep_clear_flush(vma, addr, ptep);
>  
>  				if (page_try_share_anon_rmap(page)) {
>  					set_pte_at(mm, addr, ptep, pte);
> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  					goto next;
>  				}
>  			} else {
> -				ptep_get_and_clear(mm, addr, ptep);
> +				pte = ptep_get_and_clear(mm, addr, ptep);
>  			}

I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
moved above the condition check so they're called unconditionally.  Could
you explain the rational on why it's changed back (since I think v2 was the
correct approach)?

The other question is if we want to split the patch, would it be better to
move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?

>  
>  			migrate->cpages++;
>  
> +			/* Set the dirty flag on the folio now the pte is gone. */
> +			if (pte_dirty(pte))
> +				folio_mark_dirty(page_folio(page));
> +
>  			/* Setup special migration page table entry */
>  			if (mpfn & MIGRATE_PFN_WRITE)
>  				entry = make_writable_migration_entry(
> -- 
> git-series 0.9.1
>
Alistair Popple Aug. 25, 2022, 10:21 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>> installs migration entries directly if it can lock the migrating page.
>> When removing a dirty pte the dirty bit is supposed to be carried over
>> to the underlying page to prevent it being lost.
>>
>> Currently migrate_vma_*() can only be used for private anonymous
>> mappings. That means loss of the dirty bit usually doesn't result in
>> data loss because these pages are typically not file-backed. However
>> pages may be backed by swap storage which can result in data loss if an
>> attempt is made to migrate a dirty page that doesn't yet have the
>> PageDirty flag set.
>>
>> In this case migration will fail due to unexpected references but the
>> dirty pte bit will be lost. If the page is subsequently reclaimed data
>> won't be written back to swap storage as it is considered uptodate,
>> resulting in data loss if the page is subsequently accessed.
>>
>> Prevent this by copying the dirty bit to the page when removing the pte
>> to match what try_to_migrate_one() does.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Acked-by: Peter Xu <peterx@redhat.com>
>> Reported-by: Huang Ying <ying.huang@intel.com>
>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>> Cc: stable@vger.kernel.org
>>
>> ---
>>
>> Changes for v3:
>>
>>  - Defer TLB flushing
>>  - Split a TLB flushing fix into a separate change.
>>
>> Changes for v2:
>>
>>  - Fixed up Reported-by tag.
>>  - Added Peter's Acked-by.
>>  - Atomically read and clear the pte to prevent the dirty bit getting
>>    set after reading it.
>>  - Added fixes tag
>> ---
>>  mm/migrate_device.c |  9 +++++++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 6a5ef9f..51d9afa 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -7,6 +7,7 @@
>>  #include <linux/export.h>
>>  #include <linux/memremap.h>
>>  #include <linux/migrate.h>
>> +#include <linux/mm.h>
>>  #include <linux/mm_inline.h>
>>  #include <linux/mmu_notifier.h>
>>  #include <linux/oom.h>
>> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>>  			if (anon_exclusive) {
>>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
>> -				ptep_clear_flush(vma, addr, ptep);
>> +				pte = ptep_clear_flush(vma, addr, ptep);
>>
>>  				if (page_try_share_anon_rmap(page)) {
>>  					set_pte_at(mm, addr, ptep, pte);
>> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  					goto next;
>>  				}
>>  			} else {
>> -				ptep_get_and_clear(mm, addr, ptep);
>> +				pte = ptep_get_and_clear(mm, addr, ptep);
>>  			}
>
> I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
> moved above the condition check so they're called unconditionally.  Could
> you explain the rational on why it's changed back (since I think v2 was the
> correct approach)?

Mainly because I agree with your original comments, that it would be
better to keep the batching of TLB flushing if possible. After the
discussion I don't think there is any issues with HW pte dirty bits
here. There are already other cases where HW needs to get that right
anyway (eg. zap_pte_range).

> The other question is if we want to split the patch, would it be better to
> move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?

Isn't that already the case? Patch 1 moves the TLB flush before the PTL
as suggested, patch 2 atomically copies the dirty bit without changing
any TLB flushing.

>>
>>  			migrate->cpages++;
>>
>> +			/* Set the dirty flag on the folio now the pte is gone. */
>> +			if (pte_dirty(pte))
>> +				folio_mark_dirty(page_folio(page));
>> +
>>  			/* Setup special migration page table entry */
>>  			if (mpfn & MIGRATE_PFN_WRITE)
>>  				entry = make_writable_migration_entry(
>> --
>> git-series 0.9.1
>>
Peter Xu Aug. 25, 2022, 11:27 p.m. UTC | #3
On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
> 
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> >> installs migration entries directly if it can lock the migrating page.
> >> When removing a dirty pte the dirty bit is supposed to be carried over
> >> to the underlying page to prevent it being lost.
> >>
> >> Currently migrate_vma_*() can only be used for private anonymous
> >> mappings. That means loss of the dirty bit usually doesn't result in
> >> data loss because these pages are typically not file-backed. However
> >> pages may be backed by swap storage which can result in data loss if an
> >> attempt is made to migrate a dirty page that doesn't yet have the
> >> PageDirty flag set.
> >>
> >> In this case migration will fail due to unexpected references but the
> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
> >> won't be written back to swap storage as it is considered uptodate,
> >> resulting in data loss if the page is subsequently accessed.
> >>
> >> Prevent this by copying the dirty bit to the page when removing the pte
> >> to match what try_to_migrate_one() does.
> >>
> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> Acked-by: Peter Xu <peterx@redhat.com>
> >> Reported-by: Huang Ying <ying.huang@intel.com>
> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> >> Cc: stable@vger.kernel.org
> >>
> >> ---
> >>
> >> Changes for v3:
> >>
> >>  - Defer TLB flushing
> >>  - Split a TLB flushing fix into a separate change.
> >>
> >> Changes for v2:
> >>
> >>  - Fixed up Reported-by tag.
> >>  - Added Peter's Acked-by.
> >>  - Atomically read and clear the pte to prevent the dirty bit getting
> >>    set after reading it.
> >>  - Added fixes tag
> >> ---
> >>  mm/migrate_device.c |  9 +++++++--
> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index 6a5ef9f..51d9afa 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -7,6 +7,7 @@
> >>  #include <linux/export.h>
> >>  #include <linux/memremap.h>
> >>  #include <linux/migrate.h>
> >> +#include <linux/mm.h>
> >>  #include <linux/mm_inline.h>
> >>  #include <linux/mmu_notifier.h>
> >>  #include <linux/oom.h>
> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> >>  			if (anon_exclusive) {
> >>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> -				ptep_clear_flush(vma, addr, ptep);
> >> +				pte = ptep_clear_flush(vma, addr, ptep);
> >>
> >>  				if (page_try_share_anon_rmap(page)) {
> >>  					set_pte_at(mm, addr, ptep, pte);
> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>  					goto next;
> >>  				}
> >>  			} else {
> >> -				ptep_get_and_clear(mm, addr, ptep);
> >> +				pte = ptep_get_and_clear(mm, addr, ptep);
> >>  			}
> >
> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
> > moved above the condition check so they're called unconditionally.  Could
> > you explain the rational on why it's changed back (since I think v2 was the
> > correct approach)?
> 
> Mainly because I agree with your original comments, that it would be
> better to keep the batching of TLB flushing if possible. After the
> discussion I don't think there is any issues with HW pte dirty bits
> here. There are already other cases where HW needs to get that right
> anyway (eg. zap_pte_range).

Yes tlb batching was kept, thanks for doing that way.  Though if only apply
patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
be redundant.

> 
> > The other question is if we want to split the patch, would it be better to
> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
> 
> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
> as suggested, patch 2 atomically copies the dirty bit without changing
> any TLB flushing.

IMHO it's cleaner to have patch 1 fix batch flush, replace
ptep_clear_flush() with ptep_get_and_clear() and update pte properly.

No strong opinions on the layout, but I still think we should drop the
redundant ptep_clear_flush() above, meanwhile add the flush_cache_page()
properly for !exclusive case too.

Thanks,
Alistair Popple Aug. 26, 2022, 1:02 a.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
>>
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
>> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>> >> installs migration entries directly if it can lock the migrating page.
>> >> When removing a dirty pte the dirty bit is supposed to be carried over
>> >> to the underlying page to prevent it being lost.
>> >>
>> >> Currently migrate_vma_*() can only be used for private anonymous
>> >> mappings. That means loss of the dirty bit usually doesn't result in
>> >> data loss because these pages are typically not file-backed. However
>> >> pages may be backed by swap storage which can result in data loss if an
>> >> attempt is made to migrate a dirty page that doesn't yet have the
>> >> PageDirty flag set.
>> >>
>> >> In this case migration will fail due to unexpected references but the
>> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
>> >> won't be written back to swap storage as it is considered uptodate,
>> >> resulting in data loss if the page is subsequently accessed.
>> >>
>> >> Prevent this by copying the dirty bit to the page when removing the pte
>> >> to match what try_to_migrate_one() does.
>> >>
>> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> >> Acked-by: Peter Xu <peterx@redhat.com>
>> >> Reported-by: Huang Ying <ying.huang@intel.com>
>> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>> >> Cc: stable@vger.kernel.org
>> >>
>> >> ---
>> >>
>> >> Changes for v3:
>> >>
>> >>  - Defer TLB flushing
>> >>  - Split a TLB flushing fix into a separate change.
>> >>
>> >> Changes for v2:
>> >>
>> >>  - Fixed up Reported-by tag.
>> >>  - Added Peter's Acked-by.
>> >>  - Atomically read and clear the pte to prevent the dirty bit getting
>> >>    set after reading it.
>> >>  - Added fixes tag
>> >> ---
>> >>  mm/migrate_device.c |  9 +++++++--
>> >>  1 file changed, 7 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> >> index 6a5ef9f..51d9afa 100644
>> >> --- a/mm/migrate_device.c
>> >> +++ b/mm/migrate_device.c
>> >> @@ -7,6 +7,7 @@
>> >>  #include <linux/export.h>
>> >>  #include <linux/memremap.h>
>> >>  #include <linux/migrate.h>
>> >> +#include <linux/mm.h>
>> >>  #include <linux/mm_inline.h>
>> >>  #include <linux/mmu_notifier.h>
>> >>  #include <linux/oom.h>
>> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>> >>  			if (anon_exclusive) {
>> >>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
>> >> -				ptep_clear_flush(vma, addr, ptep);
>> >> +				pte = ptep_clear_flush(vma, addr, ptep);
>> >>
>> >>  				if (page_try_share_anon_rmap(page)) {
>> >>  					set_pte_at(mm, addr, ptep, pte);
>> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >>  					goto next;
>> >>  				}
>> >>  			} else {
>> >> -				ptep_get_and_clear(mm, addr, ptep);
>> >> +				pte = ptep_get_and_clear(mm, addr, ptep);
>> >>  			}
>> >
>> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
>> > moved above the condition check so they're called unconditionally.  Could
>> > you explain the rational on why it's changed back (since I think v2 was the
>> > correct approach)?
>>
>> Mainly because I agree with your original comments, that it would be
>> better to keep the batching of TLB flushing if possible. After the
>> discussion I don't think there is any issues with HW pte dirty bits
>> here. There are already other cases where HW needs to get that right
>> anyway (eg. zap_pte_range).
>
> Yes tlb batching was kept, thanks for doing that way.  Though if only apply
> patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
> be redundant.
>
>>
>> > The other question is if we want to split the patch, would it be better to
>> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
>>
>> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
>> as suggested, patch 2 atomically copies the dirty bit without changing
>> any TLB flushing.
>
> IMHO it's cleaner to have patch 1 fix batch flush, replace
> ptep_clear_flush() with ptep_get_and_clear() and update pte properly.

Which ptep_clear_flush() are you referring to? This one?

			if (anon_exclusive) {
				flush_cache_page(vma, addr, pte_pfn(*ptep));
				ptep_clear_flush(vma, addr, ptep);

My understanding is that we need to do a flush for anon_exclusive.

> No strong opinions on the layout, but I still think we should drop the
> redundant ptep_clear_flush() above, meanwhile add the flush_cache_page()
> properly for !exclusive case too.

Good point, we need flush_cache_page() for !exclusive. Will add.

> Thanks,
Huang, Ying Aug. 26, 2022, 1:14 a.m. UTC | #5
Alistair Popple <apopple@nvidia.com> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
>>>
>>> Peter Xu <peterx@redhat.com> writes:
>>>
>>> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
>>> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>>> >> installs migration entries directly if it can lock the migrating page.
>>> >> When removing a dirty pte the dirty bit is supposed to be carried over
>>> >> to the underlying page to prevent it being lost.
>>> >>
>>> >> Currently migrate_vma_*() can only be used for private anonymous
>>> >> mappings. That means loss of the dirty bit usually doesn't result in
>>> >> data loss because these pages are typically not file-backed. However
>>> >> pages may be backed by swap storage which can result in data loss if an
>>> >> attempt is made to migrate a dirty page that doesn't yet have the
>>> >> PageDirty flag set.
>>> >>
>>> >> In this case migration will fail due to unexpected references but the
>>> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
>>> >> won't be written back to swap storage as it is considered uptodate,
>>> >> resulting in data loss if the page is subsequently accessed.
>>> >>
>>> >> Prevent this by copying the dirty bit to the page when removing the pte
>>> >> to match what try_to_migrate_one() does.
>>> >>
>>> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> >> Acked-by: Peter Xu <peterx@redhat.com>
>>> >> Reported-by: Huang Ying <ying.huang@intel.com>
>>> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>>> >> Cc: stable@vger.kernel.org
>>> >>
>>> >> ---
>>> >>
>>> >> Changes for v3:
>>> >>
>>> >>  - Defer TLB flushing
>>> >>  - Split a TLB flushing fix into a separate change.
>>> >>
>>> >> Changes for v2:
>>> >>
>>> >>  - Fixed up Reported-by tag.
>>> >>  - Added Peter's Acked-by.
>>> >>  - Atomically read and clear the pte to prevent the dirty bit getting
>>> >>    set after reading it.
>>> >>  - Added fixes tag
>>> >> ---
>>> >>  mm/migrate_device.c |  9 +++++++--
>>> >>  1 file changed, 7 insertions(+), 2 deletions(-)
>>> >>
>>> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> >> index 6a5ef9f..51d9afa 100644
>>> >> --- a/mm/migrate_device.c
>>> >> +++ b/mm/migrate_device.c
>>> >> @@ -7,6 +7,7 @@
>>> >>  #include <linux/export.h>
>>> >>  #include <linux/memremap.h>
>>> >>  #include <linux/migrate.h>
>>> >> +#include <linux/mm.h>
>>> >>  #include <linux/mm_inline.h>
>>> >>  #include <linux/mmu_notifier.h>
>>> >>  #include <linux/oom.h>
>>> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> >>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>>> >>  			if (anon_exclusive) {
>>> >>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
>>> >> -				ptep_clear_flush(vma, addr, ptep);
>>> >> +				pte = ptep_clear_flush(vma, addr, ptep);
>>> >>
>>> >>  				if (page_try_share_anon_rmap(page)) {
>>> >>  					set_pte_at(mm, addr, ptep, pte);
>>> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> >>  					goto next;
>>> >>  				}
>>> >>  			} else {
>>> >> -				ptep_get_and_clear(mm, addr, ptep);
>>> >> +				pte = ptep_get_and_clear(mm, addr, ptep);
>>> >>  			}
>>> >
>>> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
>>> > moved above the condition check so they're called unconditionally.  Could
>>> > you explain the rational on why it's changed back (since I think v2 was the
>>> > correct approach)?
>>>
>>> Mainly because I agree with your original comments, that it would be
>>> better to keep the batching of TLB flushing if possible. After the
>>> discussion I don't think there is any issues with HW pte dirty bits
>>> here. There are already other cases where HW needs to get that right
>>> anyway (eg. zap_pte_range).
>>
>> Yes tlb batching was kept, thanks for doing that way.  Though if only apply
>> patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
>> be redundant.
>>
>>>
>>> > The other question is if we want to split the patch, would it be better to
>>> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
>>>
>>> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
>>> as suggested, patch 2 atomically copies the dirty bit without changing
>>> any TLB flushing.
>>
>> IMHO it's cleaner to have patch 1 fix batch flush, replace
>> ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
>
> Which ptep_clear_flush() are you referring to? This one?
>
> 			if (anon_exclusive) {
> 				flush_cache_page(vma, addr, pte_pfn(*ptep));
> 				ptep_clear_flush(vma, addr, ptep);
>
> My understanding is that we need to do a flush for anon_exclusive.
>
>> No strong opinions on the layout, but I still think we should drop the
>> redundant ptep_clear_flush() above, meanwhile add the flush_cache_page()
>> properly for !exclusive case too.
>
> Good point, we need flush_cache_page() for !exclusive. Will add.

That can be in another patch.  For the patch itself, it looks good to
me.  Feel free to add,

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

Best Regards,
Huang, Ying
Peter Xu Aug. 26, 2022, 2:32 p.m. UTC | #6
On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote:
> 
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
> >>
> >> Peter Xu <peterx@redhat.com> writes:
> >>
> >> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
> >> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> >> >> installs migration entries directly if it can lock the migrating page.
> >> >> When removing a dirty pte the dirty bit is supposed to be carried over
> >> >> to the underlying page to prevent it being lost.
> >> >>
> >> >> Currently migrate_vma_*() can only be used for private anonymous
> >> >> mappings. That means loss of the dirty bit usually doesn't result in
> >> >> data loss because these pages are typically not file-backed. However
> >> >> pages may be backed by swap storage which can result in data loss if an
> >> >> attempt is made to migrate a dirty page that doesn't yet have the
> >> >> PageDirty flag set.
> >> >>
> >> >> In this case migration will fail due to unexpected references but the
> >> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
> >> >> won't be written back to swap storage as it is considered uptodate,
> >> >> resulting in data loss if the page is subsequently accessed.
> >> >>
> >> >> Prevent this by copying the dirty bit to the page when removing the pte
> >> >> to match what try_to_migrate_one() does.
> >> >>
> >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> >> Acked-by: Peter Xu <peterx@redhat.com>
> >> >> Reported-by: Huang Ying <ying.huang@intel.com>
> >> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> >> >> Cc: stable@vger.kernel.org
> >> >>
> >> >> ---
> >> >>
> >> >> Changes for v3:
> >> >>
> >> >>  - Defer TLB flushing
> >> >>  - Split a TLB flushing fix into a separate change.
> >> >>
> >> >> Changes for v2:
> >> >>
> >> >>  - Fixed up Reported-by tag.
> >> >>  - Added Peter's Acked-by.
> >> >>  - Atomically read and clear the pte to prevent the dirty bit getting
> >> >>    set after reading it.
> >> >>  - Added fixes tag
> >> >> ---
> >> >>  mm/migrate_device.c |  9 +++++++--
> >> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> >> index 6a5ef9f..51d9afa 100644
> >> >> --- a/mm/migrate_device.c
> >> >> +++ b/mm/migrate_device.c
> >> >> @@ -7,6 +7,7 @@
> >> >>  #include <linux/export.h>
> >> >>  #include <linux/memremap.h>
> >> >>  #include <linux/migrate.h>
> >> >> +#include <linux/mm.h>
> >> >>  #include <linux/mm_inline.h>
> >> >>  #include <linux/mmu_notifier.h>
> >> >>  #include <linux/oom.h>
> >> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> >> >>  			if (anon_exclusive) {
> >> >>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> >> -				ptep_clear_flush(vma, addr, ptep);
> >> >> +				pte = ptep_clear_flush(vma, addr, ptep);
> >> >>
> >> >>  				if (page_try_share_anon_rmap(page)) {
> >> >>  					set_pte_at(mm, addr, ptep, pte);
> >> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >>  					goto next;
> >> >>  				}
> >> >>  			} else {
> >> >> -				ptep_get_and_clear(mm, addr, ptep);
> >> >> +				pte = ptep_get_and_clear(mm, addr, ptep);
> >> >>  			}
> >> >
> >> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
> >> > moved above the condition check so they're called unconditionally.  Could
> >> > you explain the rational on why it's changed back (since I think v2 was the
> >> > correct approach)?
> >>
> >> Mainly because I agree with your original comments, that it would be
> >> better to keep the batching of TLB flushing if possible. After the
> >> discussion I don't think there is any issues with HW pte dirty bits
> >> here. There are already other cases where HW needs to get that right
> >> anyway (eg. zap_pte_range).
> >
> > Yes tlb batching was kept, thanks for doing that way.  Though if only apply
> > patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
> > be redundant.
> >
> >>
> >> > The other question is if we want to split the patch, would it be better to
> >> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
> >>
> >> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
> >> as suggested, patch 2 atomically copies the dirty bit without changing
> >> any TLB flushing.
> >
> > IMHO it's cleaner to have patch 1 fix batch flush, replace
> > ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
> 
> Which ptep_clear_flush() are you referring to? This one?
> 
> 			if (anon_exclusive) {
> 				flush_cache_page(vma, addr, pte_pfn(*ptep));
> 				ptep_clear_flush(vma, addr, ptep);

Correct.

> 
> My understanding is that we need to do a flush for anon_exclusive.

To me anon exclusive only shows this mm exclusively owns this page. I
didn't quickly figure out why that requires different handling on tlb
flushs.  Did I perhaps miss something?

Thanks,
David Hildenbrand Aug. 26, 2022, 2:47 p.m. UTC | #7
On 26.08.22 16:32, Peter Xu wrote:
> On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote:
>>
>> Peter Xu <peterx@redhat.com> writes:
>>
>>> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
>>>>
>>>> Peter Xu <peterx@redhat.com> writes:
>>>>
>>>>> On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
>>>>>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>>>>>> installs migration entries directly if it can lock the migrating page.
>>>>>> When removing a dirty pte the dirty bit is supposed to be carried over
>>>>>> to the underlying page to prevent it being lost.
>>>>>>
>>>>>> Currently migrate_vma_*() can only be used for private anonymous
>>>>>> mappings. That means loss of the dirty bit usually doesn't result in
>>>>>> data loss because these pages are typically not file-backed. However
>>>>>> pages may be backed by swap storage which can result in data loss if an
>>>>>> attempt is made to migrate a dirty page that doesn't yet have the
>>>>>> PageDirty flag set.
>>>>>>
>>>>>> In this case migration will fail due to unexpected references but the
>>>>>> dirty pte bit will be lost. If the page is subsequently reclaimed data
>>>>>> won't be written back to swap storage as it is considered uptodate,
>>>>>> resulting in data loss if the page is subsequently accessed.
>>>>>>
>>>>>> Prevent this by copying the dirty bit to the page when removing the pte
>>>>>> to match what try_to_migrate_one() does.
>>>>>>
>>>>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>>> Reported-by: Huang Ying <ying.huang@intel.com>
>>>>>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>>>>>> Cc: stable@vger.kernel.org
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Changes for v3:
>>>>>>
>>>>>>  - Defer TLB flushing
>>>>>>  - Split a TLB flushing fix into a separate change.
>>>>>>
>>>>>> Changes for v2:
>>>>>>
>>>>>>  - Fixed up Reported-by tag.
>>>>>>  - Added Peter's Acked-by.
>>>>>>  - Atomically read and clear the pte to prevent the dirty bit getting
>>>>>>    set after reading it.
>>>>>>  - Added fixes tag
>>>>>> ---
>>>>>>  mm/migrate_device.c |  9 +++++++--
>>>>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 6a5ef9f..51d9afa 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -7,6 +7,7 @@
>>>>>>  #include <linux/export.h>
>>>>>>  #include <linux/memremap.h>
>>>>>>  #include <linux/migrate.h>
>>>>>> +#include <linux/mm.h>
>>>>>>  #include <linux/mm_inline.h>
>>>>>>  #include <linux/mmu_notifier.h>
>>>>>>  #include <linux/oom.h>
>>>>>> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>>>>>>  			if (anon_exclusive) {
>>>>>>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
>>>>>> -				ptep_clear_flush(vma, addr, ptep);
>>>>>> +				pte = ptep_clear_flush(vma, addr, ptep);
>>>>>>
>>>>>>  				if (page_try_share_anon_rmap(page)) {
>>>>>>  					set_pte_at(mm, addr, ptep, pte);
>>>>>> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  					goto next;
>>>>>>  				}
>>>>>>  			} else {
>>>>>> -				ptep_get_and_clear(mm, addr, ptep);
>>>>>> +				pte = ptep_get_and_clear(mm, addr, ptep);
>>>>>>  			}
>>>>>
>>>>> I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
>>>>> moved above the condition check so they're called unconditionally.  Could
>>>>> you explain the rational on why it's changed back (since I think v2 was the
>>>>> correct approach)?
>>>>
>>>> Mainly because I agree with your original comments, that it would be
>>>> better to keep the batching of TLB flushing if possible. After the
>>>> discussion I don't think there is any issues with HW pte dirty bits
>>>> here. There are already other cases where HW needs to get that right
>>>> anyway (eg. zap_pte_range).
>>>
>>> Yes tlb batching was kept, thanks for doing that way.  Though if only apply
>>> patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
>>> be redundant.
>>>
>>>>
>>>>> The other question is if we want to split the patch, would it be better to
>>>>> move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
>>>>
>>>> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
>>>> as suggested, patch 2 atomically copies the dirty bit without changing
>>>> any TLB flushing.
>>>
>>> IMHO it's cleaner to have patch 1 fix batch flush, replace
>>> ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
>>
>> Which ptep_clear_flush() are you referring to? This one?
>>
>> 			if (anon_exclusive) {
>> 				flush_cache_page(vma, addr, pte_pfn(*ptep));
>> 				ptep_clear_flush(vma, addr, ptep);
> 
> Correct.
> 
>>
>> My understanding is that we need to do a flush for anon_exclusive.
> 
> To me anon exclusive only shows this mm exclusively owns this page. I
> didn't quickly figure out why that requires different handling on tlb
> flushs.  Did I perhaps miss something?

GUP-fast is the magic bit, we have to make sure that we won't see new
GUP pins, thus the TLB flush.

include/linux/mm.h:gup_must_unshare() contains documentation.

Without GUP-fast, some things would be significantly easier to handle.
Peter Xu Aug. 26, 2022, 3:55 p.m. UTC | #8
On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
> > To me anon exclusive only shows this mm exclusively owns this page. I
> > didn't quickly figure out why that requires different handling on tlb
> > flushs.  Did I perhaps miss something?
> 
> GUP-fast is the magic bit, we have to make sure that we won't see new
> GUP pins, thus the TLB flush.
> 
> include/linux/mm.h:gup_must_unshare() contains documentation.

Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
guarantees that no other process/thread will see this pte anymore
afterwards?
David Hildenbrand Aug. 26, 2022, 4:46 p.m. UTC | #9
On 26.08.22 17:55, Peter Xu wrote:
> On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
>>> To me anon exclusive only shows this mm exclusively owns this page. I
>>> didn't quickly figure out why that requires different handling on tlb
>>> flushs.  Did I perhaps miss something?
>>
>> GUP-fast is the magic bit, we have to make sure that we won't see new
>> GUP pins, thus the TLB flush.
>>
>> include/linux/mm.h:gup_must_unshare() contains documentation.
> 
> Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
> guarantees that no other process/thread will see this pte anymore
> afterwards?

You could have a GUP-fast thread that just looked up the PTE and is
going to pin the page afterwards, after the ptep_get_and_clear()
returned. You'll have to wait until that thread finished.

Another user that relies on this interaction between GUP-fast and TLB
flushing is for example mm/ksm.c:write_protect_page()

There is a comment in there explaining the interaction a bit more detailed.

Maybe we'll be able to handle this differently in the future (maybe once
this turns out to be an actual performance problem). Unfortunately,
mm->write_protect_seq isn't easily usable because we'd need have to make
sure we're the exclusive writer.


For now, it's not too complicated. For PTEs:
* try_to_migrate_one() already uses ptep_clear_flush().
* try_to_unmap_one() already conditionally used ptep_clear_flush().
* migrate_vma_collect_pmd() was the one case that didn't use it already
 (and I wonder why it's different than try_to_migrate_one()).
Peter Xu Aug. 26, 2022, 9:37 p.m. UTC | #10
On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote:
> On 26.08.22 17:55, Peter Xu wrote:
> > On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
> >>> To me anon exclusive only shows this mm exclusively owns this page. I
> >>> didn't quickly figure out why that requires different handling on tlb
> >>> flushs.  Did I perhaps miss something?
> >>
> >> GUP-fast is the magic bit, we have to make sure that we won't see new
> >> GUP pins, thus the TLB flush.
> >>
> >> include/linux/mm.h:gup_must_unshare() contains documentation.
> > 
> > Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
> > guarantees that no other process/thread will see this pte anymore
> > afterwards?
> 
> You could have a GUP-fast thread that just looked up the PTE and is
> going to pin the page afterwards, after the ptep_get_and_clear()
> returned. You'll have to wait until that thread finished.

IIUC the early tlb flush won't protect concurrent fast-gup from happening,
but I think it's safe because fast-gup will check pte after pinning, so
either:

  (1) fast-gup runs before ptep_get_and_clear(), then
      page_try_share_anon_rmap() will fail properly, or,

  (2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup
      will see that either the pte is none or changed, then it'll fail the
      fast-gup itself.

> 
> Another user that relies on this interaction between GUP-fast and TLB
> flushing is for example mm/ksm.c:write_protect_page()
> 
> There is a comment in there explaining the interaction a bit more detailed.
> 
> Maybe we'll be able to handle this differently in the future (maybe once
> this turns out to be an actual performance problem). Unfortunately,
> mm->write_protect_seq isn't easily usable because we'd need have to make
> sure we're the exclusive writer.
> 
> 
> For now, it's not too complicated. For PTEs:
> * try_to_migrate_one() already uses ptep_clear_flush().
> * try_to_unmap_one() already conditionally used ptep_clear_flush().
> * migrate_vma_collect_pmd() was the one case that didn't use it already
>  (and I wonder why it's different than try_to_migrate_one()).

I'm not sure whether I fully get the point, but here one major difference
is all the rest handles one page, so a tlb flush alongside with the pte
clear sounds reasonable.  Even if so try_to_unmap_one() was modified to use
tlb batching, but then I see that anon exclusive made that batching
conditional.  I also have question there on whether we can keep using the
tlb batching even with anon exclusive pages there.

In general, I still don't see how stall tlb could affect anon exclusive
pages on racing with fast-gup, because the only side effect of a stall tlb
is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb,
afaict..

Thanks,
David Hildenbrand Aug. 26, 2022, 10:19 p.m. UTC | #11
On 26.08.22 23:37, Peter Xu wrote:
> On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote:
>> On 26.08.22 17:55, Peter Xu wrote:
>>> On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
>>>>> To me anon exclusive only shows this mm exclusively owns this page. I
>>>>> didn't quickly figure out why that requires different handling on tlb
>>>>> flushs.  Did I perhaps miss something?
>>>>
>>>> GUP-fast is the magic bit, we have to make sure that we won't see new
>>>> GUP pins, thus the TLB flush.
>>>>
>>>> include/linux/mm.h:gup_must_unshare() contains documentation.
>>>
>>> Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
>>> guarantees that no other process/thread will see this pte anymore
>>> afterwards?
>>
>> You could have a GUP-fast thread that just looked up the PTE and is
>> going to pin the page afterwards, after the ptep_get_and_clear()
>> returned. You'll have to wait until that thread finished.
> 

Good that we're talking about it, very helpful! If that's actually not
required -- good.


What I learned how GUP-fast and TLB flushes interact is the following:

GUP-fast disables local interrupts. A TLB flush will send an IPI and
wait until it has been processed. This implies, that once the TLB flush
succeeded, that the interrupt was handled and GUP-fast cannot be running
anymore.

BUT, there is the new RCU variant nowadays, and the TLB flush itself
should not actually perform such a sync. They merely protect the page
tables from getting freed and the THP from getting split IIUC. And
you're correct that that wouldn't help.


> IIUC the early tlb flush won't protect concurrent fast-gup from happening,
> but I think it's safe because fast-gup will check pte after pinning, so
> either:
> 
>   (1) fast-gup runs before ptep_get_and_clear(), then
>       page_try_share_anon_rmap() will fail properly, or,
> 
>   (2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup
>       will see that either the pte is none or changed, then it'll fail the
>       fast-gup itself.

I think you're right and I might have managed to confuse myself with the
write_protect_page() comment. I placed the gup_must_unshare() check
explicitly after the "pte changed" check for this reason. So once the
PTE was cleared, GUP-fast would undo any pin performed on a now-stale PTE.

> 
>>
>> Another user that relies on this interaction between GUP-fast and TLB
>> flushing is for example mm/ksm.c:write_protect_page()
>>
>> There is a comment in there explaining the interaction a bit more detailed.
>>
>> Maybe we'll be able to handle this differently in the future (maybe once
>> this turns out to be an actual performance problem). Unfortunately,
>> mm->write_protect_seq isn't easily usable because we'd need have to make
>> sure we're the exclusive writer.
>>
>>
>> For now, it's not too complicated. For PTEs:
>> * try_to_migrate_one() already uses ptep_clear_flush().
>> * try_to_unmap_one() already conditionally used ptep_clear_flush().
>> * migrate_vma_collect_pmd() was the one case that didn't use it already
>>  (and I wonder why it's different than try_to_migrate_one()).
> 
> I'm not sure whether I fully get the point, but here one major difference
> is all the rest handles one page, so a tlb flush alongside with the pte
> clear sounds reasonable.  Even if so try_to_unmap_one() was modified to use
> tlb batching, but then I see that anon exclusive made that batching
> conditional.  I also have question there on whether we can keep using the
> tlb batching even with anon exclusive pages there.
> 
> In general, I still don't see how stall tlb could affect anon exclusive
> pages on racing with fast-gup, because the only side effect of a stall tlb
> is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb,
> afaict..

I have the gut feeling that the comment in write_protect_page() is
indeed stale, and that clearing PageAnonExclusive doesn't strictly need
the TLB flush.

I'll try to refresh my memory if there was any other case that I had to
handle over the weekend.

Thanks!
diff mbox series

Patch

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6a5ef9f..51d9afa 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -7,6 +7,7 @@ 
 #include <linux/export.h>
 #include <linux/memremap.h>
 #include <linux/migrate.h>
+#include <linux/mm.h>
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/oom.h>
@@ -196,7 +197,7 @@  static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
 			if (anon_exclusive) {
 				flush_cache_page(vma, addr, pte_pfn(*ptep));
-				ptep_clear_flush(vma, addr, ptep);
+				pte = ptep_clear_flush(vma, addr, ptep);
 
 				if (page_try_share_anon_rmap(page)) {
 					set_pte_at(mm, addr, ptep, pte);
@@ -206,11 +207,15 @@  static int migrate_vma_collect_pmd(pmd_t *pmdp,
 					goto next;
 				}
 			} else {
-				ptep_get_and_clear(mm, addr, ptep);
+				pte = ptep_get_and_clear(mm, addr, ptep);
 			}
 
 			migrate->cpages++;
 
+			/* Set the dirty flag on the folio now the pte is gone. */
+			if (pte_dirty(pte))
+				folio_mark_dirty(page_folio(page));
+
 			/* Setup special migration page table entry */
 			if (mpfn & MIGRATE_PFN_WRITE)
 				entry = make_writable_migration_entry(