Message ID | ffbc824af5daa2c44b91c66834a341894fba4ce6.1661309831.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3,1/3] mm/migrate_device.c: Flush TLB while holding PTL | expand |
On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote: > migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that > installs migration entries directly if it can lock the migrating page. > When removing a dirty pte the dirty bit is supposed to be carried over > to the underlying page to prevent it being lost. > > Currently migrate_vma_*() can only be used for private anonymous > mappings. That means loss of the dirty bit usually doesn't result in > data loss because these pages are typically not file-backed. However > pages may be backed by swap storage which can result in data loss if an > attempt is made to migrate a dirty page that doesn't yet have the > PageDirty flag set. > > In this case migration will fail due to unexpected references but the > dirty pte bit will be lost. If the page is subsequently reclaimed data > won't be written back to swap storage as it is considered uptodate, > resulting in data loss if the page is subsequently accessed. > > Prevent this by copying the dirty bit to the page when removing the pte > to match what try_to_migrate_one() does. > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > Acked-by: Peter Xu <peterx@redhat.com> > Reported-by: Huang Ying <ying.huang@intel.com> > Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") > Cc: stable@vger.kernel.org > > --- > > Changes for v3: > > - Defer TLB flushing > - Split a TLB flushing fix into a separate change. > > Changes for v2: > > - Fixed up Reported-by tag. > - Added Peter's Acked-by. > - Atomically read and clear the pte to prevent the dirty bit getting > set after reading it. > - Added fixes tag > --- > mm/migrate_device.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > index 6a5ef9f..51d9afa 100644 > --- a/mm/migrate_device.c > +++ b/mm/migrate_device.c > @@ -7,6 +7,7 @@ > #include <linux/export.h> > #include <linux/memremap.h> > #include <linux/migrate.h> > +#include <linux/mm.h> > #include <linux/mm_inline.h> > #include <linux/mmu_notifier.h> > #include <linux/oom.h> > @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > anon_exclusive = PageAnon(page) && PageAnonExclusive(page); > if (anon_exclusive) { > flush_cache_page(vma, addr, pte_pfn(*ptep)); > - ptep_clear_flush(vma, addr, ptep); > + pte = ptep_clear_flush(vma, addr, ptep); > > if (page_try_share_anon_rmap(page)) { > set_pte_at(mm, addr, ptep, pte); > @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > goto next; > } > } else { > - ptep_get_and_clear(mm, addr, ptep); > + pte = ptep_get_and_clear(mm, addr, ptep); > } I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are moved above the condition check so they're called unconditionally. Could you explain the rational on why it's changed back (since I think v2 was the correct approach)? The other question is if we want to split the patch, would it be better to move the tlb changes to patch 1, and leave the dirty bit fix in patch 2? > > migrate->cpages++; > > + /* Set the dirty flag on the folio now the pte is gone. */ > + if (pte_dirty(pte)) > + folio_mark_dirty(page_folio(page)); > + > /* Setup special migration page table entry */ > if (mpfn & MIGRATE_PFN_WRITE) > entry = make_writable_migration_entry( > -- > git-series 0.9.1 >
Peter Xu <peterx@redhat.com> writes: > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote: >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that >> installs migration entries directly if it can lock the migrating page. >> When removing a dirty pte the dirty bit is supposed to be carried over >> to the underlying page to prevent it being lost. >> >> Currently migrate_vma_*() can only be used for private anonymous >> mappings. That means loss of the dirty bit usually doesn't result in >> data loss because these pages are typically not file-backed. However >> pages may be backed by swap storage which can result in data loss if an >> attempt is made to migrate a dirty page that doesn't yet have the >> PageDirty flag set. >> >> In this case migration will fail due to unexpected references but the >> dirty pte bit will be lost. If the page is subsequently reclaimed data >> won't be written back to swap storage as it is considered uptodate, >> resulting in data loss if the page is subsequently accessed. >> >> Prevent this by copying the dirty bit to the page when removing the pte >> to match what try_to_migrate_one() does. >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >> Acked-by: Peter Xu <peterx@redhat.com> >> Reported-by: Huang Ying <ying.huang@intel.com> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") >> Cc: stable@vger.kernel.org >> >> --- >> >> Changes for v3: >> >> - Defer TLB flushing >> - Split a TLB flushing fix into a separate change. >> >> Changes for v2: >> >> - Fixed up Reported-by tag. >> - Added Peter's Acked-by. >> - Atomically read and clear the pte to prevent the dirty bit getting >> set after reading it. >> - Added fixes tag >> --- >> mm/migrate_device.c | 9 +++++++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >> index 6a5ef9f..51d9afa 100644 >> --- a/mm/migrate_device.c >> +++ b/mm/migrate_device.c >> @@ -7,6 +7,7 @@ >> #include <linux/export.h> >> #include <linux/memremap.h> >> #include <linux/migrate.h> >> +#include <linux/mm.h> >> #include <linux/mm_inline.h> >> #include <linux/mmu_notifier.h> >> #include <linux/oom.h> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >> anon_exclusive = PageAnon(page) && PageAnonExclusive(page); >> if (anon_exclusive) { >> flush_cache_page(vma, addr, pte_pfn(*ptep)); >> - ptep_clear_flush(vma, addr, ptep); >> + pte = ptep_clear_flush(vma, addr, ptep); >> >> if (page_try_share_anon_rmap(page)) { >> set_pte_at(mm, addr, ptep, pte); >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >> goto next; >> } >> } else { >> - ptep_get_and_clear(mm, addr, ptep); >> + pte = ptep_get_and_clear(mm, addr, ptep); >> } > > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are > moved above the condition check so they're called unconditionally. Could > you explain the rational on why it's changed back (since I think v2 was the > correct approach)? Mainly because I agree with your original comments, that it would be better to keep the batching of TLB flushing if possible. After the discussion I don't think there is any issues with HW pte dirty bits here. There are already other cases where HW needs to get that right anyway (eg. zap_pte_range). > The other question is if we want to split the patch, would it be better to > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2? Isn't that already the case? Patch 1 moves the TLB flush before the PTL as suggested, patch 2 atomically copies the dirty bit without changing any TLB flushing. >> >> migrate->cpages++; >> >> + /* Set the dirty flag on the folio now the pte is gone. */ >> + if (pte_dirty(pte)) >> + folio_mark_dirty(page_folio(page)); >> + >> /* Setup special migration page table entry */ >> if (mpfn & MIGRATE_PFN_WRITE) >> entry = make_writable_migration_entry( >> -- >> git-series 0.9.1 >>
On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote: > > Peter Xu <peterx@redhat.com> writes: > > > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote: > >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that > >> installs migration entries directly if it can lock the migrating page. > >> When removing a dirty pte the dirty bit is supposed to be carried over > >> to the underlying page to prevent it being lost. > >> > >> Currently migrate_vma_*() can only be used for private anonymous > >> mappings. That means loss of the dirty bit usually doesn't result in > >> data loss because these pages are typically not file-backed. However > >> pages may be backed by swap storage which can result in data loss if an > >> attempt is made to migrate a dirty page that doesn't yet have the > >> PageDirty flag set. > >> > >> In this case migration will fail due to unexpected references but the > >> dirty pte bit will be lost. If the page is subsequently reclaimed data > >> won't be written back to swap storage as it is considered uptodate, > >> resulting in data loss if the page is subsequently accessed. > >> > >> Prevent this by copying the dirty bit to the page when removing the pte > >> to match what try_to_migrate_one() does. > >> > >> Signed-off-by: Alistair Popple <apopple@nvidia.com> > >> Acked-by: Peter Xu <peterx@redhat.com> > >> Reported-by: Huang Ying <ying.huang@intel.com> > >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") > >> Cc: stable@vger.kernel.org > >> > >> --- > >> > >> Changes for v3: > >> > >> - Defer TLB flushing > >> - Split a TLB flushing fix into a separate change. > >> > >> Changes for v2: > >> > >> - Fixed up Reported-by tag. > >> - Added Peter's Acked-by. > >> - Atomically read and clear the pte to prevent the dirty bit getting > >> set after reading it. > >> - Added fixes tag > >> --- > >> mm/migrate_device.c | 9 +++++++-- > >> 1 file changed, 7 insertions(+), 2 deletions(-) > >> > >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c > >> index 6a5ef9f..51d9afa 100644 > >> --- a/mm/migrate_device.c > >> +++ b/mm/migrate_device.c > >> @@ -7,6 +7,7 @@ > >> #include <linux/export.h> > >> #include <linux/memremap.h> > >> #include <linux/migrate.h> > >> +#include <linux/mm.h> > >> #include <linux/mm_inline.h> > >> #include <linux/mmu_notifier.h> > >> #include <linux/oom.h> > >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > >> anon_exclusive = PageAnon(page) && PageAnonExclusive(page); > >> if (anon_exclusive) { > >> flush_cache_page(vma, addr, pte_pfn(*ptep)); > >> - ptep_clear_flush(vma, addr, ptep); > >> + pte = ptep_clear_flush(vma, addr, ptep); > >> > >> if (page_try_share_anon_rmap(page)) { > >> set_pte_at(mm, addr, ptep, pte); > >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > >> goto next; > >> } > >> } else { > >> - ptep_get_and_clear(mm, addr, ptep); > >> + pte = ptep_get_and_clear(mm, addr, ptep); > >> } > > > > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are > > moved above the condition check so they're called unconditionally. Could > > you explain the rational on why it's changed back (since I think v2 was the > > correct approach)? > > Mainly because I agree with your original comments, that it would be > better to keep the batching of TLB flushing if possible. After the > discussion I don't think there is any issues with HW pte dirty bits > here. There are already other cases where HW needs to get that right > anyway (eg. zap_pte_range). Yes tlb batching was kept, thanks for doing that way. Though if only apply patch 1 we'll have both ptep_clear_flush() and batched flush which seems to be redundant. > > > The other question is if we want to split the patch, would it be better to > > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2? > > Isn't that already the case? Patch 1 moves the TLB flush before the PTL > as suggested, patch 2 atomically copies the dirty bit without changing > any TLB flushing. IMHO it's cleaner to have patch 1 fix batch flush, replace ptep_clear_flush() with ptep_get_and_clear() and update pte properly. No strong opinions on the layout, but I still think we should drop the redundant ptep_clear_flush() above, meanwhile add the flush_cache_page() properly for !exclusive case too. Thanks,
Peter Xu <peterx@redhat.com> writes: > On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote: >> >> Peter Xu <peterx@redhat.com> writes: >> >> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote: >> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that >> >> installs migration entries directly if it can lock the migrating page. >> >> When removing a dirty pte the dirty bit is supposed to be carried over >> >> to the underlying page to prevent it being lost. >> >> >> >> Currently migrate_vma_*() can only be used for private anonymous >> >> mappings. That means loss of the dirty bit usually doesn't result in >> >> data loss because these pages are typically not file-backed. However >> >> pages may be backed by swap storage which can result in data loss if an >> >> attempt is made to migrate a dirty page that doesn't yet have the >> >> PageDirty flag set. >> >> >> >> In this case migration will fail due to unexpected references but the >> >> dirty pte bit will be lost. If the page is subsequently reclaimed data >> >> won't be written back to swap storage as it is considered uptodate, >> >> resulting in data loss if the page is subsequently accessed. >> >> >> >> Prevent this by copying the dirty bit to the page when removing the pte >> >> to match what try_to_migrate_one() does. >> >> >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >> >> Acked-by: Peter Xu <peterx@redhat.com> >> >> Reported-by: Huang Ying <ying.huang@intel.com> >> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") >> >> Cc: stable@vger.kernel.org >> >> >> >> --- >> >> >> >> Changes for v3: >> >> >> >> - Defer TLB flushing >> >> - Split a TLB flushing fix into a separate change. >> >> >> >> Changes for v2: >> >> >> >> - Fixed up Reported-by tag. >> >> - Added Peter's Acked-by. >> >> - Atomically read and clear the pte to prevent the dirty bit getting >> >> set after reading it. >> >> - Added fixes tag >> >> --- >> >> mm/migrate_device.c | 9 +++++++-- >> >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >> >> index 6a5ef9f..51d9afa 100644 >> >> --- a/mm/migrate_device.c >> >> +++ b/mm/migrate_device.c >> >> @@ -7,6 +7,7 @@ >> >> #include <linux/export.h> >> >> #include <linux/memremap.h> >> >> #include <linux/migrate.h> >> >> +#include <linux/mm.h> >> >> #include <linux/mm_inline.h> >> >> #include <linux/mmu_notifier.h> >> >> #include <linux/oom.h> >> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >> >> anon_exclusive = PageAnon(page) && PageAnonExclusive(page); >> >> if (anon_exclusive) { >> >> flush_cache_page(vma, addr, pte_pfn(*ptep)); >> >> - ptep_clear_flush(vma, addr, ptep); >> >> + pte = ptep_clear_flush(vma, addr, ptep); >> >> >> >> if (page_try_share_anon_rmap(page)) { >> >> set_pte_at(mm, addr, ptep, pte); >> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >> >> goto next; >> >> } >> >> } else { >> >> - ptep_get_and_clear(mm, addr, ptep); >> >> + pte = ptep_get_and_clear(mm, addr, ptep); >> >> } >> > >> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are >> > moved above the condition check so they're called unconditionally. Could >> > you explain the rational on why it's changed back (since I think v2 was the >> > correct approach)? >> >> Mainly because I agree with your original comments, that it would be >> better to keep the batching of TLB flushing if possible. After the >> discussion I don't think there is any issues with HW pte dirty bits >> here. There are already other cases where HW needs to get that right >> anyway (eg. zap_pte_range). > > Yes tlb batching was kept, thanks for doing that way. Though if only apply > patch 1 we'll have both ptep_clear_flush() and batched flush which seems to > be redundant. > >> >> > The other question is if we want to split the patch, would it be better to >> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2? >> >> Isn't that already the case? Patch 1 moves the TLB flush before the PTL >> as suggested, patch 2 atomically copies the dirty bit without changing >> any TLB flushing. > > IMHO it's cleaner to have patch 1 fix batch flush, replace > ptep_clear_flush() with ptep_get_and_clear() and update pte properly. Which ptep_clear_flush() are you referring to? This one? if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep)); ptep_clear_flush(vma, addr, ptep); My understanding is that we need to do a flush for anon_exclusive. > No strong opinions on the layout, but I still think we should drop the > redundant ptep_clear_flush() above, meanwhile add the flush_cache_page() > properly for !exclusive case too. Good point, we need flush_cache_page() for !exclusive. Will add. > Thanks,
Alistair Popple <apopple@nvidia.com> writes: > Peter Xu <peterx@redhat.com> writes: > >> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote: >>> >>> Peter Xu <peterx@redhat.com> writes: >>> >>> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote: >>> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that >>> >> installs migration entries directly if it can lock the migrating page. >>> >> When removing a dirty pte the dirty bit is supposed to be carried over >>> >> to the underlying page to prevent it being lost. >>> >> >>> >> Currently migrate_vma_*() can only be used for private anonymous >>> >> mappings. That means loss of the dirty bit usually doesn't result in >>> >> data loss because these pages are typically not file-backed. However >>> >> pages may be backed by swap storage which can result in data loss if an >>> >> attempt is made to migrate a dirty page that doesn't yet have the >>> >> PageDirty flag set. >>> >> >>> >> In this case migration will fail due to unexpected references but the >>> >> dirty pte bit will be lost. If the page is subsequently reclaimed data >>> >> won't be written back to swap storage as it is considered uptodate, >>> >> resulting in data loss if the page is subsequently accessed. >>> >> >>> >> Prevent this by copying the dirty bit to the page when removing the pte >>> >> to match what try_to_migrate_one() does. >>> >> >>> >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >>> >> Acked-by: Peter Xu <peterx@redhat.com> >>> >> Reported-by: Huang Ying <ying.huang@intel.com> >>> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") >>> >> Cc: stable@vger.kernel.org >>> >> >>> >> --- >>> >> >>> >> Changes for v3: >>> >> >>> >> - Defer TLB flushing >>> >> - Split a TLB flushing fix into a separate change. >>> >> >>> >> Changes for v2: >>> >> >>> >> - Fixed up Reported-by tag. >>> >> - Added Peter's Acked-by. >>> >> - Atomically read and clear the pte to prevent the dirty bit getting >>> >> set after reading it. >>> >> - Added fixes tag >>> >> --- >>> >> mm/migrate_device.c | 9 +++++++-- >>> >> 1 file changed, 7 insertions(+), 2 deletions(-) >>> >> >>> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>> >> index 6a5ef9f..51d9afa 100644 >>> >> --- a/mm/migrate_device.c >>> >> +++ b/mm/migrate_device.c >>> >> @@ -7,6 +7,7 @@ >>> >> #include <linux/export.h> >>> >> #include <linux/memremap.h> >>> >> #include <linux/migrate.h> >>> >> +#include <linux/mm.h> >>> >> #include <linux/mm_inline.h> >>> >> #include <linux/mmu_notifier.h> >>> >> #include <linux/oom.h> >>> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>> >> anon_exclusive = PageAnon(page) && PageAnonExclusive(page); >>> >> if (anon_exclusive) { >>> >> flush_cache_page(vma, addr, pte_pfn(*ptep)); >>> >> - ptep_clear_flush(vma, addr, ptep); >>> >> + pte = ptep_clear_flush(vma, addr, ptep); >>> >> >>> >> if (page_try_share_anon_rmap(page)) { >>> >> set_pte_at(mm, addr, ptep, pte); >>> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>> >> goto next; >>> >> } >>> >> } else { >>> >> - ptep_get_and_clear(mm, addr, ptep); >>> >> + pte = ptep_get_and_clear(mm, addr, ptep); >>> >> } >>> > >>> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are >>> > moved above the condition check so they're called unconditionally. Could >>> > you explain the rational on why it's changed back (since I think v2 was the >>> > correct approach)? >>> >>> Mainly because I agree with your original comments, that it would be >>> better to keep the batching of TLB flushing if possible. After the >>> discussion I don't think there is any issues with HW pte dirty bits >>> here. There are already other cases where HW needs to get that right >>> anyway (eg. zap_pte_range). >> >> Yes tlb batching was kept, thanks for doing that way. Though if only apply >> patch 1 we'll have both ptep_clear_flush() and batched flush which seems to >> be redundant. >> >>> >>> > The other question is if we want to split the patch, would it be better to >>> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2? >>> >>> Isn't that already the case? Patch 1 moves the TLB flush before the PTL >>> as suggested, patch 2 atomically copies the dirty bit without changing >>> any TLB flushing. >> >> IMHO it's cleaner to have patch 1 fix batch flush, replace >> ptep_clear_flush() with ptep_get_and_clear() and update pte properly. > > Which ptep_clear_flush() are you referring to? This one? > > if (anon_exclusive) { > flush_cache_page(vma, addr, pte_pfn(*ptep)); > ptep_clear_flush(vma, addr, ptep); > > My understanding is that we need to do a flush for anon_exclusive. > >> No strong opinions on the layout, but I still think we should drop the >> redundant ptep_clear_flush() above, meanwhile add the flush_cache_page() >> properly for !exclusive case too. > > Good point, we need flush_cache_page() for !exclusive. Will add. That can be in another patch. For the patch itself, it looks good to me. Feel free to add, Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Best Regards, Huang, Ying
On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote: > > Peter Xu <peterx@redhat.com> writes: > > > On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote: > >> > >> Peter Xu <peterx@redhat.com> writes: > >> > >> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote: > >> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that > >> >> installs migration entries directly if it can lock the migrating page. > >> >> When removing a dirty pte the dirty bit is supposed to be carried over > >> >> to the underlying page to prevent it being lost. > >> >> > >> >> Currently migrate_vma_*() can only be used for private anonymous > >> >> mappings. That means loss of the dirty bit usually doesn't result in > >> >> data loss because these pages are typically not file-backed. However > >> >> pages may be backed by swap storage which can result in data loss if an > >> >> attempt is made to migrate a dirty page that doesn't yet have the > >> >> PageDirty flag set. > >> >> > >> >> In this case migration will fail due to unexpected references but the > >> >> dirty pte bit will be lost. If the page is subsequently reclaimed data > >> >> won't be written back to swap storage as it is considered uptodate, > >> >> resulting in data loss if the page is subsequently accessed. > >> >> > >> >> Prevent this by copying the dirty bit to the page when removing the pte > >> >> to match what try_to_migrate_one() does. > >> >> > >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com> > >> >> Acked-by: Peter Xu <peterx@redhat.com> > >> >> Reported-by: Huang Ying <ying.huang@intel.com> > >> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") > >> >> Cc: stable@vger.kernel.org > >> >> > >> >> --- > >> >> > >> >> Changes for v3: > >> >> > >> >> - Defer TLB flushing > >> >> - Split a TLB flushing fix into a separate change. > >> >> > >> >> Changes for v2: > >> >> > >> >> - Fixed up Reported-by tag. > >> >> - Added Peter's Acked-by. > >> >> - Atomically read and clear the pte to prevent the dirty bit getting > >> >> set after reading it. > >> >> - Added fixes tag > >> >> --- > >> >> mm/migrate_device.c | 9 +++++++-- > >> >> 1 file changed, 7 insertions(+), 2 deletions(-) > >> >> > >> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c > >> >> index 6a5ef9f..51d9afa 100644 > >> >> --- a/mm/migrate_device.c > >> >> +++ b/mm/migrate_device.c > >> >> @@ -7,6 +7,7 @@ > >> >> #include <linux/export.h> > >> >> #include <linux/memremap.h> > >> >> #include <linux/migrate.h> > >> >> +#include <linux/mm.h> > >> >> #include <linux/mm_inline.h> > >> >> #include <linux/mmu_notifier.h> > >> >> #include <linux/oom.h> > >> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > >> >> anon_exclusive = PageAnon(page) && PageAnonExclusive(page); > >> >> if (anon_exclusive) { > >> >> flush_cache_page(vma, addr, pte_pfn(*ptep)); > >> >> - ptep_clear_flush(vma, addr, ptep); > >> >> + pte = ptep_clear_flush(vma, addr, ptep); > >> >> > >> >> if (page_try_share_anon_rmap(page)) { > >> >> set_pte_at(mm, addr, ptep, pte); > >> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > >> >> goto next; > >> >> } > >> >> } else { > >> >> - ptep_get_and_clear(mm, addr, ptep); > >> >> + pte = ptep_get_and_clear(mm, addr, ptep); > >> >> } > >> > > >> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are > >> > moved above the condition check so they're called unconditionally. Could > >> > you explain the rational on why it's changed back (since I think v2 was the > >> > correct approach)? > >> > >> Mainly because I agree with your original comments, that it would be > >> better to keep the batching of TLB flushing if possible. After the > >> discussion I don't think there is any issues with HW pte dirty bits > >> here. There are already other cases where HW needs to get that right > >> anyway (eg. zap_pte_range). > > > > Yes tlb batching was kept, thanks for doing that way. Though if only apply > > patch 1 we'll have both ptep_clear_flush() and batched flush which seems to > > be redundant. > > > >> > >> > The other question is if we want to split the patch, would it be better to > >> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2? > >> > >> Isn't that already the case? Patch 1 moves the TLB flush before the PTL > >> as suggested, patch 2 atomically copies the dirty bit without changing > >> any TLB flushing. > > > > IMHO it's cleaner to have patch 1 fix batch flush, replace > > ptep_clear_flush() with ptep_get_and_clear() and update pte properly. > > Which ptep_clear_flush() are you referring to? This one? > > if (anon_exclusive) { > flush_cache_page(vma, addr, pte_pfn(*ptep)); > ptep_clear_flush(vma, addr, ptep); Correct. > > My understanding is that we need to do a flush for anon_exclusive. To me anon exclusive only shows this mm exclusively owns this page. I didn't quickly figure out why that requires different handling on tlb flushs. Did I perhaps miss something? Thanks,
On 26.08.22 16:32, Peter Xu wrote: > On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote: >> >> Peter Xu <peterx@redhat.com> writes: >> >>> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote: >>>> >>>> Peter Xu <peterx@redhat.com> writes: >>>> >>>>> On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote: >>>>>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that >>>>>> installs migration entries directly if it can lock the migrating page. >>>>>> When removing a dirty pte the dirty bit is supposed to be carried over >>>>>> to the underlying page to prevent it being lost. >>>>>> >>>>>> Currently migrate_vma_*() can only be used for private anonymous >>>>>> mappings. That means loss of the dirty bit usually doesn't result in >>>>>> data loss because these pages are typically not file-backed. However >>>>>> pages may be backed by swap storage which can result in data loss if an >>>>>> attempt is made to migrate a dirty page that doesn't yet have the >>>>>> PageDirty flag set. >>>>>> >>>>>> In this case migration will fail due to unexpected references but the >>>>>> dirty pte bit will be lost. If the page is subsequently reclaimed data >>>>>> won't be written back to swap storage as it is considered uptodate, >>>>>> resulting in data loss if the page is subsequently accessed. >>>>>> >>>>>> Prevent this by copying the dirty bit to the page when removing the pte >>>>>> to match what try_to_migrate_one() does. >>>>>> >>>>>> Signed-off-by: Alistair Popple <apopple@nvidia.com> >>>>>> Acked-by: Peter Xu <peterx@redhat.com> >>>>>> Reported-by: Huang Ying <ying.huang@intel.com> >>>>>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") >>>>>> Cc: stable@vger.kernel.org >>>>>> >>>>>> --- >>>>>> >>>>>> Changes for v3: >>>>>> >>>>>> - Defer TLB flushing >>>>>> - Split a TLB flushing fix into a separate change. >>>>>> >>>>>> Changes for v2: >>>>>> >>>>>> - Fixed up Reported-by tag. >>>>>> - Added Peter's Acked-by. >>>>>> - Atomically read and clear the pte to prevent the dirty bit getting >>>>>> set after reading it. >>>>>> - Added fixes tag >>>>>> --- >>>>>> mm/migrate_device.c | 9 +++++++-- >>>>>> 1 file changed, 7 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>>>>> index 6a5ef9f..51d9afa 100644 >>>>>> --- a/mm/migrate_device.c >>>>>> +++ b/mm/migrate_device.c >>>>>> @@ -7,6 +7,7 @@ >>>>>> #include <linux/export.h> >>>>>> #include <linux/memremap.h> >>>>>> #include <linux/migrate.h> >>>>>> +#include <linux/mm.h> >>>>>> #include <linux/mm_inline.h> >>>>>> #include <linux/mmu_notifier.h> >>>>>> #include <linux/oom.h> >>>>>> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>>>>> anon_exclusive = PageAnon(page) && PageAnonExclusive(page); >>>>>> if (anon_exclusive) { >>>>>> flush_cache_page(vma, addr, pte_pfn(*ptep)); >>>>>> - ptep_clear_flush(vma, addr, ptep); >>>>>> + pte = ptep_clear_flush(vma, addr, ptep); >>>>>> >>>>>> if (page_try_share_anon_rmap(page)) { >>>>>> set_pte_at(mm, addr, ptep, pte); >>>>>> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>>>>> goto next; >>>>>> } >>>>>> } else { >>>>>> - ptep_get_and_clear(mm, addr, ptep); >>>>>> + pte = ptep_get_and_clear(mm, addr, ptep); >>>>>> } >>>>> >>>>> I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are >>>>> moved above the condition check so they're called unconditionally. Could >>>>> you explain the rational on why it's changed back (since I think v2 was the >>>>> correct approach)? >>>> >>>> Mainly because I agree with your original comments, that it would be >>>> better to keep the batching of TLB flushing if possible. After the >>>> discussion I don't think there is any issues with HW pte dirty bits >>>> here. There are already other cases where HW needs to get that right >>>> anyway (eg. zap_pte_range). >>> >>> Yes tlb batching was kept, thanks for doing that way. Though if only apply >>> patch 1 we'll have both ptep_clear_flush() and batched flush which seems to >>> be redundant. >>> >>>> >>>>> The other question is if we want to split the patch, would it be better to >>>>> move the tlb changes to patch 1, and leave the dirty bit fix in patch 2? >>>> >>>> Isn't that already the case? Patch 1 moves the TLB flush before the PTL >>>> as suggested, patch 2 atomically copies the dirty bit without changing >>>> any TLB flushing. >>> >>> IMHO it's cleaner to have patch 1 fix batch flush, replace >>> ptep_clear_flush() with ptep_get_and_clear() and update pte properly. >> >> Which ptep_clear_flush() are you referring to? This one? >> >> if (anon_exclusive) { >> flush_cache_page(vma, addr, pte_pfn(*ptep)); >> ptep_clear_flush(vma, addr, ptep); > > Correct. > >> >> My understanding is that we need to do a flush for anon_exclusive. > > To me anon exclusive only shows this mm exclusively owns this page. I > didn't quickly figure out why that requires different handling on tlb > flushs. Did I perhaps miss something? GUP-fast is the magic bit, we have to make sure that we won't see new GUP pins, thus the TLB flush. include/linux/mm.h:gup_must_unshare() contains documentation. Without GUP-fast, some things would be significantly easier to handle.
On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote: > > To me anon exclusive only shows this mm exclusively owns this page. I > > didn't quickly figure out why that requires different handling on tlb > > flushs. Did I perhaps miss something? > > GUP-fast is the magic bit, we have to make sure that we won't see new > GUP pins, thus the TLB flush. > > include/linux/mm.h:gup_must_unshare() contains documentation. Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already guarantees that no other process/thread will see this pte anymore afterwards?
On 26.08.22 17:55, Peter Xu wrote: > On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote: >>> To me anon exclusive only shows this mm exclusively owns this page. I >>> didn't quickly figure out why that requires different handling on tlb >>> flushs. Did I perhaps miss something? >> >> GUP-fast is the magic bit, we have to make sure that we won't see new >> GUP pins, thus the TLB flush. >> >> include/linux/mm.h:gup_must_unshare() contains documentation. > > Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already > guarantees that no other process/thread will see this pte anymore > afterwards? You could have a GUP-fast thread that just looked up the PTE and is going to pin the page afterwards, after the ptep_get_and_clear() returned. You'll have to wait until that thread finished. Another user that relies on this interaction between GUP-fast and TLB flushing is for example mm/ksm.c:write_protect_page() There is a comment in there explaining the interaction a bit more detailed. Maybe we'll be able to handle this differently in the future (maybe once this turns out to be an actual performance problem). Unfortunately, mm->write_protect_seq isn't easily usable because we'd need have to make sure we're the exclusive writer. For now, it's not too complicated. For PTEs: * try_to_migrate_one() already uses ptep_clear_flush(). * try_to_unmap_one() already conditionally used ptep_clear_flush(). * migrate_vma_collect_pmd() was the one case that didn't use it already (and I wonder why it's different than try_to_migrate_one()).
On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote: > On 26.08.22 17:55, Peter Xu wrote: > > On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote: > >>> To me anon exclusive only shows this mm exclusively owns this page. I > >>> didn't quickly figure out why that requires different handling on tlb > >>> flushs. Did I perhaps miss something? > >> > >> GUP-fast is the magic bit, we have to make sure that we won't see new > >> GUP pins, thus the TLB flush. > >> > >> include/linux/mm.h:gup_must_unshare() contains documentation. > > > > Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already > > guarantees that no other process/thread will see this pte anymore > > afterwards? > > You could have a GUP-fast thread that just looked up the PTE and is > going to pin the page afterwards, after the ptep_get_and_clear() > returned. You'll have to wait until that thread finished. IIUC the early tlb flush won't protect concurrent fast-gup from happening, but I think it's safe because fast-gup will check pte after pinning, so either: (1) fast-gup runs before ptep_get_and_clear(), then page_try_share_anon_rmap() will fail properly, or, (2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup will see that either the pte is none or changed, then it'll fail the fast-gup itself. > > Another user that relies on this interaction between GUP-fast and TLB > flushing is for example mm/ksm.c:write_protect_page() > > There is a comment in there explaining the interaction a bit more detailed. > > Maybe we'll be able to handle this differently in the future (maybe once > this turns out to be an actual performance problem). Unfortunately, > mm->write_protect_seq isn't easily usable because we'd need have to make > sure we're the exclusive writer. > > > For now, it's not too complicated. For PTEs: > * try_to_migrate_one() already uses ptep_clear_flush(). > * try_to_unmap_one() already conditionally used ptep_clear_flush(). > * migrate_vma_collect_pmd() was the one case that didn't use it already > (and I wonder why it's different than try_to_migrate_one()). I'm not sure whether I fully get the point, but here one major difference is all the rest handles one page, so a tlb flush alongside with the pte clear sounds reasonable. Even if so try_to_unmap_one() was modified to use tlb batching, but then I see that anon exclusive made that batching conditional. I also have question there on whether we can keep using the tlb batching even with anon exclusive pages there. In general, I still don't see how stall tlb could affect anon exclusive pages on racing with fast-gup, because the only side effect of a stall tlb is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb, afaict.. Thanks,
On 26.08.22 23:37, Peter Xu wrote: > On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote: >> On 26.08.22 17:55, Peter Xu wrote: >>> On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote: >>>>> To me anon exclusive only shows this mm exclusively owns this page. I >>>>> didn't quickly figure out why that requires different handling on tlb >>>>> flushs. Did I perhaps miss something? >>>> >>>> GUP-fast is the magic bit, we have to make sure that we won't see new >>>> GUP pins, thus the TLB flush. >>>> >>>> include/linux/mm.h:gup_must_unshare() contains documentation. >>> >>> Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already >>> guarantees that no other process/thread will see this pte anymore >>> afterwards? >> >> You could have a GUP-fast thread that just looked up the PTE and is >> going to pin the page afterwards, after the ptep_get_and_clear() >> returned. You'll have to wait until that thread finished. > Good that we're talking about it, very helpful! If that's actually not required -- good. What I learned how GUP-fast and TLB flushes interact is the following: GUP-fast disables local interrupts. A TLB flush will send an IPI and wait until it has been processed. This implies, that once the TLB flush succeeded, that the interrupt was handled and GUP-fast cannot be running anymore. BUT, there is the new RCU variant nowadays, and the TLB flush itself should not actually perform such a sync. They merely protect the page tables from getting freed and the THP from getting split IIUC. And you're correct that that wouldn't help. > IIUC the early tlb flush won't protect concurrent fast-gup from happening, > but I think it's safe because fast-gup will check pte after pinning, so > either: > > (1) fast-gup runs before ptep_get_and_clear(), then > page_try_share_anon_rmap() will fail properly, or, > > (2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup > will see that either the pte is none or changed, then it'll fail the > fast-gup itself. I think you're right and I might have managed to confuse myself with the write_protect_page() comment. I placed the gup_must_unshare() check explicitly after the "pte changed" check for this reason. So once the PTE was cleared, GUP-fast would undo any pin performed on a now-stale PTE. > >> >> Another user that relies on this interaction between GUP-fast and TLB >> flushing is for example mm/ksm.c:write_protect_page() >> >> There is a comment in there explaining the interaction a bit more detailed. >> >> Maybe we'll be able to handle this differently in the future (maybe once >> this turns out to be an actual performance problem). Unfortunately, >> mm->write_protect_seq isn't easily usable because we'd need have to make >> sure we're the exclusive writer. >> >> >> For now, it's not too complicated. For PTEs: >> * try_to_migrate_one() already uses ptep_clear_flush(). >> * try_to_unmap_one() already conditionally used ptep_clear_flush(). >> * migrate_vma_collect_pmd() was the one case that didn't use it already >> (and I wonder why it's different than try_to_migrate_one()). > > I'm not sure whether I fully get the point, but here one major difference > is all the rest handles one page, so a tlb flush alongside with the pte > clear sounds reasonable. Even if so try_to_unmap_one() was modified to use > tlb batching, but then I see that anon exclusive made that batching > conditional. I also have question there on whether we can keep using the > tlb batching even with anon exclusive pages there. > > In general, I still don't see how stall tlb could affect anon exclusive > pages on racing with fast-gup, because the only side effect of a stall tlb > is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb, > afaict.. I have the gut feeling that the comment in write_protect_page() is indeed stale, and that clearing PageAnonExclusive doesn't strictly need the TLB flush. I'll try to refresh my memory if there was any other case that I had to handle over the weekend. Thanks!
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f..51d9afa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/memremap.h> #include <linux/migrate.h> +#include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_notifier.h> #include <linux/oom.h> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep)); - ptep_clear_flush(vma, addr, ptep); + pte = ptep_clear_flush(vma, addr, ptep); if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte); @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } } else { - ptep_get_and_clear(mm, addr, ptep); + pte = ptep_get_and_clear(mm, addr, ptep); } migrate->cpages++; + /* Set the dirty flag on the folio now the pte is gone. */ + if (pte_dirty(pte)) + folio_mark_dirty(page_folio(page)); + /* Setup special migration page table entry */ if (mpfn & MIGRATE_PFN_WRITE) entry = make_writable_migration_entry(