Message ID | ea51bb69-189c-229b-fc0-9d3e7be5d6b@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: allow pte_offset_map[_lock]() to fail | expand |
Hugh Dickins <hughd@google.com> writes: > migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after > splitting huge zero pmd, and the pmd_none() handling after successfully > splitting huge page: those are now managed inside pte_offset_map_lock(), > and by "goto again" when it fails. > > But the skip after unsuccessful split_huge_page() must stay: it avoids an > endless loop. The skip when pmd_bad()? Remove that: it will be treated > as a hole rather than a skip once cleared by pte_offset_map_lock(), but > with different timing that would be so anyway; and it's arguably best to > leave the pmd_bad() handling centralized there. So for a pmd_bad() the sequence would be: 1. pte_offset_map_lock() would return NULL and clear the PMD. 2. goto again marks the page as a migrating hole, 3. In migrate_vma_insert_page() a new PMD is created by pmd_alloc(). 4. This leads to a new zero page getting mapped for the previously pmd_bad() mapping. I'm not entirely sure what the pmd_bad() case is used for but is that ok? I understand that previously it was all a matter of timing, but I wouldn't rely on the previous code being correct in this regard either. > migrate_vma_insert_page(): remove comment on the old pte_offset_map() > and old locking limitations; remove the pmd_trans_unstable() check and > just proceed to pte_offset_map_lock(), aborting when it fails (page has > now been charged to memcg, but that's so in other cases, and presumably > uncharged later). Correct, the non-migrating page will be freed later via put_page() which will uncharge the page. > Signed-off-by: Hugh Dickins <hughd@google.com> > --- > mm/migrate_device.c | 31 ++++--------------------------- > 1 file changed, 4 insertions(+), 27 deletions(-) > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > index d30c9de60b0d..a14af6b12b04 100644 > --- a/mm/migrate_device.c > +++ b/mm/migrate_device.c > @@ -83,9 +83,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > if (is_huge_zero_page(page)) { > spin_unlock(ptl); > split_huge_pmd(vma, pmdp, addr); > - if (pmd_trans_unstable(pmdp)) > - return migrate_vma_collect_skip(start, end, > - walk); > } else { > int ret; > > @@ -100,16 +97,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > if (ret) > return migrate_vma_collect_skip(start, end, > walk); > - if (pmd_none(*pmdp)) > - return migrate_vma_collect_hole(start, end, -1, > - walk); > } > } > > - if (unlikely(pmd_bad(*pmdp))) > - return migrate_vma_collect_skip(start, end, walk); > - > ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); > + if (!ptep) > + goto again; > arch_enter_lazy_mmu_mode(); > > for (; addr < end; addr += PAGE_SIZE, ptep++) { > @@ -595,27 +588,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, > pmdp = pmd_alloc(mm, pudp, addr); > if (!pmdp) > goto abort; > - > if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp)) > goto abort; > - > - /* > - * Use pte_alloc() instead of pte_alloc_map(). We can't run > - * pte_offset_map() on pmds where a huge pmd might be created > - * from a different thread. > - * > - * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when > - * parallel threads are excluded by other means. > - * > - * Here we only have mmap_read_lock(mm). > - */ > if (pte_alloc(mm, pmdp)) > goto abort; > - > - /* See the comment in pte_alloc_one_map() */ > - if (unlikely(pmd_trans_unstable(pmdp))) > - goto abort; > - > if (unlikely(anon_vma_prepare(vma))) > goto abort; > if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) > @@ -650,7 +626,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, > } > > ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); > - > + if (!ptep) > + goto abort; > if (check_stable_address_space(mm)) > goto unlock_abort;
On Tue, 23 May 2023, Alistair Popple wrote: > Hugh Dickins <hughd@google.com> writes: > > > migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after > > splitting huge zero pmd, and the pmd_none() handling after successfully > > splitting huge page: those are now managed inside pte_offset_map_lock(), > > and by "goto again" when it fails. > > > > But the skip after unsuccessful split_huge_page() must stay: it avoids an > > endless loop. The skip when pmd_bad()? Remove that: it will be treated > > as a hole rather than a skip once cleared by pte_offset_map_lock(), but > > with different timing that would be so anyway; and it's arguably best to > > leave the pmd_bad() handling centralized there. > > So for a pmd_bad() the sequence would be: > > 1. pte_offset_map_lock() would return NULL and clear the PMD. > 2. goto again marks the page as a migrating hole, > 3. In migrate_vma_insert_page() a new PMD is created by pmd_alloc(). > 4. This leads to a new zero page getting mapped for the previously > pmd_bad() mapping. Agreed. > > I'm not entirely sure what the pmd_bad() case is used for but is that > ok? I understand that previously it was all a matter of timing, but I > wouldn't rely on the previous code being correct in this regard either. The pmd_bad() case is for when the pmd table got corrupted (overwritten, cosmic rays, whatever), and that pmd entry is easily recognized as nonsense: we try not to crash on it, but user data may have got lost. My "timing" remark may not be accurate: I seem to be living in the past, when we had a lot more "pmd_none_or_clear_bad()"s around than today - I was thinking that any one of them could be racily changing the bad to none. Though I suppose I am now making my timing remark accurate, by changing the bad to none more often again. Since data is liable to be lost anyway (unless the corrupted entry was actually none before it got corrupted), it doesn't matter greatly what we do with it (some would definitely prefer a crash, but traditionally we don't): issue a "pmd bad" message and not get stuck in a loop is the main thing. > > > migrate_vma_insert_page(): remove comment on the old pte_offset_map() > > and old locking limitations; remove the pmd_trans_unstable() check and > > just proceed to pte_offset_map_lock(), aborting when it fails (page has > > now been charged to memcg, but that's so in other cases, and presumably > > uncharged later). > > Correct, the non-migrating page will be freed later via put_page() which > will uncharge the page. Thanks for confirming, yes, it was more difficult once upon a time, but nowadays just a matter of reaching the final put_page() Hugh
Hugh Dickins <hughd@google.com> writes: > On Tue, 23 May 2023, Alistair Popple wrote: >> Hugh Dickins <hughd@google.com> writes: >> >> > migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after >> > splitting huge zero pmd, and the pmd_none() handling after successfully >> > splitting huge page: those are now managed inside pte_offset_map_lock(), >> > and by "goto again" when it fails. >> > >> > But the skip after unsuccessful split_huge_page() must stay: it avoids an >> > endless loop. The skip when pmd_bad()? Remove that: it will be treated >> > as a hole rather than a skip once cleared by pte_offset_map_lock(), but >> > with different timing that would be so anyway; and it's arguably best to >> > leave the pmd_bad() handling centralized there. >> >> So for a pmd_bad() the sequence would be: >> >> 1. pte_offset_map_lock() would return NULL and clear the PMD. >> 2. goto again marks the page as a migrating hole, >> 3. In migrate_vma_insert_page() a new PMD is created by pmd_alloc(). >> 4. This leads to a new zero page getting mapped for the previously >> pmd_bad() mapping. > > Agreed. > >> >> I'm not entirely sure what the pmd_bad() case is used for but is that >> ok? I understand that previously it was all a matter of timing, but I >> wouldn't rely on the previous code being correct in this regard either. > > The pmd_bad() case is for when the pmd table got corrupted (overwritten, > cosmic rays, whatever), and that pmd entry is easily recognized as > nonsense: we try not to crash on it, but user data may have got lost. > > My "timing" remark may not be accurate: I seem to be living in the past, > when we had a lot more "pmd_none_or_clear_bad()"s around than today - I > was thinking that any one of them could be racily changing the bad to none. > Though I suppose I am now making my timing remark accurate, by changing > the bad to none more often again. > > Since data is liable to be lost anyway (unless the corrupted entry was > actually none before it got corrupted), it doesn't matter greatly what > we do with it (some would definitely prefer a crash, but traditionally > we don't): issue a "pmd bad" message and not get stuck in a loop is > the main thing. Thanks for the background. Either skipping it or marking it as a hole as you've done here will avoid a loop so feel free to add: Reviewed-by: Alistair Popple <apopple@nvidia.com> >> >> > migrate_vma_insert_page(): remove comment on the old pte_offset_map() >> > and old locking limitations; remove the pmd_trans_unstable() check and >> > just proceed to pte_offset_map_lock(), aborting when it fails (page has >> > now been charged to memcg, but that's so in other cases, and presumably >> > uncharged later). >> >> Correct, the non-migrating page will be freed later via put_page() which >> will uncharge the page. > > Thanks for confirming, yes, it was more difficult once upon a time, > but nowadays just a matter of reaching the final put_page() > > Hugh
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index d30c9de60b0d..a14af6b12b04 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -83,9 +83,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (is_huge_zero_page(page)) { spin_unlock(ptl); split_huge_pmd(vma, pmdp, addr); - if (pmd_trans_unstable(pmdp)) - return migrate_vma_collect_skip(start, end, - walk); } else { int ret; @@ -100,16 +97,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (ret) return migrate_vma_collect_skip(start, end, walk); - if (pmd_none(*pmdp)) - return migrate_vma_collect_hole(start, end, -1, - walk); } } - if (unlikely(pmd_bad(*pmdp))) - return migrate_vma_collect_skip(start, end, walk); - ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); + if (!ptep) + goto again; arch_enter_lazy_mmu_mode(); for (; addr < end; addr += PAGE_SIZE, ptep++) { @@ -595,27 +588,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, pmdp = pmd_alloc(mm, pudp, addr); if (!pmdp) goto abort; - if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp)) goto abort; - - /* - * Use pte_alloc() instead of pte_alloc_map(). We can't run - * pte_offset_map() on pmds where a huge pmd might be created - * from a different thread. - * - * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when - * parallel threads are excluded by other means. - * - * Here we only have mmap_read_lock(mm). - */ if (pte_alloc(mm, pmdp)) goto abort; - - /* See the comment in pte_alloc_one_map() */ - if (unlikely(pmd_trans_unstable(pmdp))) - goto abort; - if (unlikely(anon_vma_prepare(vma))) goto abort; if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) @@ -650,7 +626,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, } ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); - + if (!ptep) + goto abort; if (check_stable_address_space(mm)) goto unlock_abort;
migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after splitting huge zero pmd, and the pmd_none() handling after successfully splitting huge page: those are now managed inside pte_offset_map_lock(), and by "goto again" when it fails. But the skip after unsuccessful split_huge_page() must stay: it avoids an endless loop. The skip when pmd_bad()? Remove that: it will be treated as a hole rather than a skip once cleared by pte_offset_map_lock(), but with different timing that would be so anyway; and it's arguably best to leave the pmd_bad() handling centralized there. migrate_vma_insert_page(): remove comment on the old pte_offset_map() and old locking limitations; remove the pmd_trans_unstable() check and just proceed to pte_offset_map_lock(), aborting when it fails (page has now been charged to memcg, but that's so in other cases, and presumably uncharged later). Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/migrate_device.c | 31 ++++--------------------------- 1 file changed, 4 insertions(+), 27 deletions(-)