Message ID | 20240801081657.1386743-1-dev.jain@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Race condition observed between page migration and page fault handling on arm64 machines | expand |
On 01.08.24 10:16, Dev Jain wrote: > I and Ryan had a discussion and we thought it would be best to get feedback > from the community. > > The migration mm selftest currently fails on arm64 for shared anon mappings, > due to the following race: Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because you note shmem below, I assume you mean MAP_SHARED|MAP_ANON > > Migration: Page fault: > try_to_migrate_one(): handle_pte_fault(): > 1. Nuke the PTE PTE has been deleted => do_pte_missing() > 2. Mark the PTE for migration PTE has not been deleted but is just not "present" => do_swap_page() > In filemap_fault_recheck_pte_none() we recheck under PTL to make sure that a temporary pte_none() really was persistent pte_none() and not a temporary pte_none() under PTL. Should we do something similar in do_fault()? I see that we already do something like that on the "!vma->vm_ops->fault" path. But of course, there is a tradeoff between letting migration (temporarily) fail and grabbing the PTL during page faults.
On 8/1/24 14:12, David Hildenbrand wrote: > On 01.08.24 10:16, Dev Jain wrote: >> I and Ryan had a discussion and we thought it would be best to get >> feedback >> from the community. >> >> The migration mm selftest currently fails on arm64 for shared anon >> mappings, >> due to the following race: > > Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because > you note shmem below, I assume you mean MAP_SHARED|MAP_ANON Yes. > >> >> Migration: Page fault: >> try_to_migrate_one(): handle_pte_fault(): >> 1. Nuke the PTE PTE has been deleted => >> do_pte_missing() >> 2. Mark the PTE for migration PTE has not been deleted >> but is just not "present" => do_swap_page() >> > > In filemap_fault_recheck_pte_none() we recheck under PTL to make sure > that a temporary pte_none() really was persistent pte_none() and not a > temporary pte_none() under PTL. > > Should we do something similar in do_fault()? I see that we already do > something like that on the "!vma->vm_ops->fault" path. > > But of course, there is a tradeoff between letting migration > (temporarily) fail and grabbing the PTL during page faults. To dampen the tradeoff, we could do this in shmem_fault() instead? But then, this would mean that we do this in all kinds of vma->vm_ops->fault, only when we discover another reference count race condition :) Doing this in do_fault() should solve this once and for all. In fact, do_pte_missing() may call do_anonymous_page() or do_fault(), and I just noticed that the former already checks this using vmf_pte_changed().
On 01.08.24 11:38, Dev Jain wrote: > > On 8/1/24 14:12, David Hildenbrand wrote: >> On 01.08.24 10:16, Dev Jain wrote: >>> I and Ryan had a discussion and we thought it would be best to get >>> feedback >>> from the community. >>> >>> The migration mm selftest currently fails on arm64 for shared anon >>> mappings, >>> due to the following race: >> >> Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because >> you note shmem below, I assume you mean MAP_SHARED|MAP_ANON > > > Yes. > >> >>> >>> Migration: Page fault: >>> try_to_migrate_one(): handle_pte_fault(): >>> 1. Nuke the PTE PTE has been deleted => >>> do_pte_missing() >>> 2. Mark the PTE for migration PTE has not been deleted >>> but is just not "present" => do_swap_page() >>> >> >> In filemap_fault_recheck_pte_none() we recheck under PTL to make sure >> that a temporary pte_none() really was persistent pte_none() and not a >> temporary pte_none() under PTL. >> >> Should we do something similar in do_fault()? I see that we already do >> something like that on the "!vma->vm_ops->fault" path. >> >> But of course, there is a tradeoff between letting migration >> (temporarily) fail and grabbing the PTL during page faults. > > > To dampen the tradeoff, we could do this in shmem_fault() instead? But > then, this would mean that we do this in all > > kinds of vma->vm_ops->fault, only when we discover another reference > count race condition :) Doing this in do_fault() > > should solve this once and for all. In fact, do_pte_missing() may call > do_anonymous_page() or do_fault(), and I just > > noticed that the former already checks this using vmf_pte_changed(). What I am still missing is why this is (a) arm64 only; and (b) if this is something we should really worry about. There are other reasons (e.g., speculative references) why migration could temporarily fail, does it happen that often that it is really something we have to worry about?
On 8/1/24 15:11, David Hildenbrand wrote: > On 01.08.24 11:38, Dev Jain wrote: >> >> On 8/1/24 14:12, David Hildenbrand wrote: >>> On 01.08.24 10:16, Dev Jain wrote: >>>> I and Ryan had a discussion and we thought it would be best to get >>>> feedback >>>> from the community. >>>> >>>> The migration mm selftest currently fails on arm64 for shared anon >>>> mappings, >>>> due to the following race: >>> >>> Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because >>> you note shmem below, I assume you mean MAP_SHARED|MAP_ANON >> >> >> Yes. >> >>> >>>> >>>> Migration: Page fault: >>>> try_to_migrate_one(): handle_pte_fault(): >>>> 1. Nuke the PTE PTE has been deleted => >>>> do_pte_missing() >>>> 2. Mark the PTE for migration PTE has not been deleted >>>> but is just not "present" => do_swap_page() >>>> >>> >>> In filemap_fault_recheck_pte_none() we recheck under PTL to make sure >>> that a temporary pte_none() really was persistent pte_none() and not a >>> temporary pte_none() under PTL. >>> >>> Should we do something similar in do_fault()? I see that we already do >>> something like that on the "!vma->vm_ops->fault" path. >>> >>> But of course, there is a tradeoff between letting migration >>> (temporarily) fail and grabbing the PTL during page faults. >> >> >> To dampen the tradeoff, we could do this in shmem_fault() instead? But >> then, this would mean that we do this in all >> >> kinds of vma->vm_ops->fault, only when we discover another reference >> count race condition :) Doing this in do_fault() >> >> should solve this once and for all. In fact, do_pte_missing() may call >> do_anonymous_page() or do_fault(), and I just >> >> noticed that the former already checks this using vmf_pte_changed(). > > What I am still missing is why this is (a) arm64 only; and (b) if this > is something we should really worry about. There are other reasons > (e.g., speculative references) why migration could temporarily fail, > does it happen that often that it is really something we have to worry > about? (a) See discussion at [1]; I guess it passes on x86, which is quite strange since the race is clearly arch-independent. (b) On my machine, on an average in under 10 iterations of move_pages(), it fails, which seems problematic to me assuming it is passing on other arches, since 10 iterations means this is failing very quickly. [1]: https://lore.kernel.org/all/9de42ace-dab1-5f60-af8a-26045ada27b9@arm.com/
On 01/08/2024 10:41, David Hildenbrand wrote: > On 01.08.24 11:38, Dev Jain wrote: >> >> On 8/1/24 14:12, David Hildenbrand wrote: >>> On 01.08.24 10:16, Dev Jain wrote: >>>> I and Ryan had a discussion and we thought it would be best to get >>>> feedback >>>> from the community. >>>> >>>> The migration mm selftest currently fails on arm64 for shared anon >>>> mappings, >>>> due to the following race: >>> >>> Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because >>> you note shmem below, I assume you mean MAP_SHARED|MAP_ANON >> >> >> Yes. >> >>> >>>> >>>> Migration: Page fault: >>>> try_to_migrate_one(): handle_pte_fault(): >>>> 1. Nuke the PTE PTE has been deleted => >>>> do_pte_missing() >>>> 2. Mark the PTE for migration PTE has not been deleted >>>> but is just not "present" => do_swap_page() >>>> >>> >>> In filemap_fault_recheck_pte_none() we recheck under PTL to make sure >>> that a temporary pte_none() really was persistent pte_none() and not a >>> temporary pte_none() under PTL. >>> >>> Should we do something similar in do_fault()? I see that we already do >>> something like that on the "!vma->vm_ops->fault" path. >>> >>> But of course, there is a tradeoff between letting migration >>> (temporarily) fail and grabbing the PTL during page faults. >> >> >> To dampen the tradeoff, we could do this in shmem_fault() instead? But >> then, this would mean that we do this in all >> >> kinds of vma->vm_ops->fault, only when we discover another reference >> count race condition :) Doing this in do_fault() >> >> should solve this once and for all. In fact, do_pte_missing() may call >> do_anonymous_page() or do_fault(), and I just >> >> noticed that the former already checks this using vmf_pte_changed(). > > What I am still missing is why this is (a) arm64 only; and (b) if this is > something we should really worry about. There are other reasons (e.g., > speculative references) why migration could temporarily fail, does it happen > that often that it is really something we have to worry about? The test fails consistently on arm64. It's my rough understanding that it's failing due to migration backing off because the fault handler has raised the ref count? (Dev correct me if I'm wrong). So the real question is, is it a valid test in the first place? Should we just delete the test or do we need to strengthen the kernel's guarrantees around migration success?
>>> To dampen the tradeoff, we could do this in shmem_fault() instead? But >>> then, this would mean that we do this in all >>> >>> kinds of vma->vm_ops->fault, only when we discover another reference >>> count race condition :) Doing this in do_fault() >>> >>> should solve this once and for all. In fact, do_pte_missing() may call >>> do_anonymous_page() or do_fault(), and I just >>> >>> noticed that the former already checks this using vmf_pte_changed(). >> >> What I am still missing is why this is (a) arm64 only; and (b) if this >> is something we should really worry about. There are other reasons >> (e.g., speculative references) why migration could temporarily fail, >> does it happen that often that it is really something we have to worry >> about? > > > (a) See discussion at [1]; I guess it passes on x86, which is quite > strange since the race is clearly arch-independent. Yes, I think this is what we have to understand. Is the race simply less likely to trigger on x86? I would assume that it would trigger on any arch. I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here. Is this maybe related to deferred flushing? Such that the other CPU will by accident just observe the !pte_none a little less likely? But arm64 also usually defers flushes, right? At least unless ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred flushes. > > (b) On my machine, on an average in under 10 iterations of move_pages(), > it fails, which seems problematic to Yes, it's a big difference compared to what I encounter.
>> What I am still missing is why this is (a) arm64 only; and (b) if this is >> something we should really worry about. There are other reasons (e.g., >> speculative references) why migration could temporarily fail, does it happen >> that often that it is really something we have to worry about? > > The test fails consistently on arm64. It's my rough understanding that it's > failing due to migration backing off because the fault handler has raised the > ref count? (Dev correct me if I'm wrong). > > So the real question is, is it a valid test in the first place? Should we just > delete the test or do we need to strengthen the kernel's guarrantees around > migration success? I think the test should retry migration a number of times in case it fails. But if it is a persistent migration failure, the test should fail.
On 01.08.24 15:13, David Hildenbrand wrote: >>>> To dampen the tradeoff, we could do this in shmem_fault() instead? But >>>> then, this would mean that we do this in all >>>> >>>> kinds of vma->vm_ops->fault, only when we discover another reference >>>> count race condition :) Doing this in do_fault() >>>> >>>> should solve this once and for all. In fact, do_pte_missing() may call >>>> do_anonymous_page() or do_fault(), and I just >>>> >>>> noticed that the former already checks this using vmf_pte_changed(). >>> >>> What I am still missing is why this is (a) arm64 only; and (b) if this >>> is something we should really worry about. There are other reasons >>> (e.g., speculative references) why migration could temporarily fail, >>> does it happen that often that it is really something we have to worry >>> about? >> >> >> (a) See discussion at [1]; I guess it passes on x86, which is quite >> strange since the race is clearly arch-independent. > > Yes, I think this is what we have to understand. Is the race simply less > likely to trigger on x86? > > I would assume that it would trigger on any arch. > > I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here. > > Is this maybe related to deferred flushing? Such that the other CPU will > by accident just observe the !pte_none a little less likely? > > But arm64 also usually defers flushes, right? At least unless > ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred > flushes. Bingo! diff --git a/mm/rmap.c b/mm/rmap.c index e51ed44f8b53..ce94b810586b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval, */ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) { - if (!(flags & TTU_BATCH_FLUSH)) - return false; - - return arch_tlbbatch_should_defer(mm); + return false; } On x86: # ./migration TAP version 13 1..1 # Starting 1 tests from 1 test cases. # RUN migration.shared_anon ... Didn't migrate 1 pages # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, self->n2) (-2) == 0 (0) # shared_anon: Test terminated by assertion # FAIL migration.shared_anon not ok 1 migration.shared_anon It fails all of the time!
On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: > On 01.08.24 15:13, David Hildenbrand wrote: > > > > > To dampen the tradeoff, we could do this in shmem_fault() instead? But > > > > > then, this would mean that we do this in all > > > > > > > > > > kinds of vma->vm_ops->fault, only when we discover another reference > > > > > count race condition :) Doing this in do_fault() > > > > > > > > > > should solve this once and for all. In fact, do_pte_missing() may call > > > > > do_anonymous_page() or do_fault(), and I just > > > > > > > > > > noticed that the former already checks this using vmf_pte_changed(). > > > > > > > > What I am still missing is why this is (a) arm64 only; and (b) if this > > > > is something we should really worry about. There are other reasons > > > > (e.g., speculative references) why migration could temporarily fail, > > > > does it happen that often that it is really something we have to worry > > > > about? > > > > > > > > > (a) See discussion at [1]; I guess it passes on x86, which is quite > > > strange since the race is clearly arch-independent. > > > > Yes, I think this is what we have to understand. Is the race simply less > > likely to trigger on x86? > > > > I would assume that it would trigger on any arch. > > > > I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here. > > > > Is this maybe related to deferred flushing? Such that the other CPU will > > by accident just observe the !pte_none a little less likely? > > > > But arm64 also usually defers flushes, right? At least unless > > ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred > > flushes. > > Bingo! > > diff --git a/mm/rmap.c b/mm/rmap.c > index e51ed44f8b53..ce94b810586b 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct > *mm, pte_t pteval, > */ > static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > { > - if (!(flags & TTU_BATCH_FLUSH)) > - return false; > - > - return arch_tlbbatch_should_defer(mm); > + return false; > } > > > On x86: > > # ./migration > TAP version 13 > 1..1 > # Starting 1 tests from 1 test cases. > # RUN migration.shared_anon ... > Didn't migrate 1 pages > # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, self->n2) (-2) > == 0 (0) > # shared_anon: Test terminated by assertion > # FAIL migration.shared_anon > not ok 1 migration.shared_anon > > > It fails all of the time! Nice work! I suppose that makes sense as, with the eager TLB invalidation, the window between the other CPU faulting and the migration entry being written is fairly wide. Not sure about a fix though :/ It feels a bit overkill to add a new invalid pte encoding just for this. Will
On 01.08.24 15:43, Will Deacon wrote: > On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: >> On 01.08.24 15:13, David Hildenbrand wrote: >>>>>> To dampen the tradeoff, we could do this in shmem_fault() instead? But >>>>>> then, this would mean that we do this in all >>>>>> >>>>>> kinds of vma->vm_ops->fault, only when we discover another reference >>>>>> count race condition :) Doing this in do_fault() >>>>>> >>>>>> should solve this once and for all. In fact, do_pte_missing() may call >>>>>> do_anonymous_page() or do_fault(), and I just >>>>>> >>>>>> noticed that the former already checks this using vmf_pte_changed(). >>>>> >>>>> What I am still missing is why this is (a) arm64 only; and (b) if this >>>>> is something we should really worry about. There are other reasons >>>>> (e.g., speculative references) why migration could temporarily fail, >>>>> does it happen that often that it is really something we have to worry >>>>> about? >>>> >>>> >>>> (a) See discussion at [1]; I guess it passes on x86, which is quite >>>> strange since the race is clearly arch-independent. >>> >>> Yes, I think this is what we have to understand. Is the race simply less >>> likely to trigger on x86? >>> >>> I would assume that it would trigger on any arch. >>> >>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here. >>> >>> Is this maybe related to deferred flushing? Such that the other CPU will >>> by accident just observe the !pte_none a little less likely? >>> >>> But arm64 also usually defers flushes, right? At least unless >>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred >>> flushes. >> >> Bingo! >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index e51ed44f8b53..ce94b810586b 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct >> *mm, pte_t pteval, >> */ >> static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) >> { >> - if (!(flags & TTU_BATCH_FLUSH)) >> - return false; >> - >> - return arch_tlbbatch_should_defer(mm); >> + return false; >> } >> >> >> On x86: >> >> # ./migration >> TAP version 13 >> 1..1 >> # Starting 1 tests from 1 test cases. >> # RUN migration.shared_anon ... >> Didn't migrate 1 pages >> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, self->n2) (-2) >> == 0 (0) >> # shared_anon: Test terminated by assertion >> # FAIL migration.shared_anon >> not ok 1 migration.shared_anon >> >> >> It fails all of the time! > > Nice work! I suppose that makes sense as, with the eager TLB > invalidation, the window between the other CPU faulting and the > migration entry being written is fairly wide. > > Not sure about a fix though :/ It feels a bit overkill to add a new > invalid pte encoding just for this. Something like that might make the test happy in most cases: diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c index 6908569ef406..4c18bfc13b94 100644 --- a/tools/testing/selftests/mm/migration.c +++ b/tools/testing/selftests/mm/migration.c @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) int ret, tmp; int status = 0; struct timespec ts1, ts2; + int errors = 0; if (clock_gettime(CLOCK_MONOTONIC, &ts1)) return -1; @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) ret = move_pages(0, 1, (void **) &ptr, &n2, &status, MPOL_MF_MOVE_ALL); if (ret) { - if (ret > 0) + if (ret > 0) { + if (++errors < 100) + continue; printf("Didn't migrate %d pages\n", ret); - else + } else { perror("Couldn't migrate pages"); + } return -2; } + /* Progress! */ + errors = 0; tmp = n2; n2 = n1; [root@localhost mm]# ./migration TAP version 13 1..1 # Starting 1 tests from 1 test cases. # RUN migration.shared_anon ... # OK migration.shared_anon ok 1 migration.shared_anon # PASSED: 1 / 1 tests passed. # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
On 01/08/2024 14:48, David Hildenbrand wrote: > On 01.08.24 15:43, Will Deacon wrote: >> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: >>> On 01.08.24 15:13, David Hildenbrand wrote: >>>>>>> To dampen the tradeoff, we could do this in shmem_fault() instead? But >>>>>>> then, this would mean that we do this in all >>>>>>> >>>>>>> kinds of vma->vm_ops->fault, only when we discover another reference >>>>>>> count race condition :) Doing this in do_fault() >>>>>>> >>>>>>> should solve this once and for all. In fact, do_pte_missing() may call >>>>>>> do_anonymous_page() or do_fault(), and I just >>>>>>> >>>>>>> noticed that the former already checks this using vmf_pte_changed(). >>>>>> >>>>>> What I am still missing is why this is (a) arm64 only; and (b) if this >>>>>> is something we should really worry about. There are other reasons >>>>>> (e.g., speculative references) why migration could temporarily fail, >>>>>> does it happen that often that it is really something we have to worry >>>>>> about? >>>>> >>>>> >>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite >>>>> strange since the race is clearly arch-independent. >>>> >>>> Yes, I think this is what we have to understand. Is the race simply less >>>> likely to trigger on x86? >>>> >>>> I would assume that it would trigger on any arch. >>>> >>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here. >>>> >>>> Is this maybe related to deferred flushing? Such that the other CPU will >>>> by accident just observe the !pte_none a little less likely? >>>> >>>> But arm64 also usually defers flushes, right? At least unless >>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred >>>> flushes. >>> >>> Bingo! >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index e51ed44f8b53..ce94b810586b 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct >>> *mm, pte_t pteval, >>> */ >>> static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) >>> { >>> - if (!(flags & TTU_BATCH_FLUSH)) >>> - return false; >>> - >>> - return arch_tlbbatch_should_defer(mm); >>> + return false; >>> } >>> >>> >>> On x86: >>> >>> # ./migration >>> TAP version 13 >>> 1..1 >>> # Starting 1 tests from 1 test cases. >>> # RUN migration.shared_anon ... >>> Didn't migrate 1 pages >>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, self->n2) (-2) >>> == 0 (0) >>> # shared_anon: Test terminated by assertion >>> # FAIL migration.shared_anon >>> not ok 1 migration.shared_anon >>> >>> >>> It fails all of the time! >> >> Nice work! I suppose that makes sense as, with the eager TLB >> invalidation, the window between the other CPU faulting and the >> migration entry being written is fairly wide. >> >> Not sure about a fix though :/ It feels a bit overkill to add a new >> invalid pte encoding just for this. If we want to fix the kernel, I think Dev's option 2 could work? 2. Prepare the migration swap entry and do an atomic exchange to fill the PTE (this currently seems the best option to me, but I have not tried it out). We could create a new helper, ptep_get_and_set_not_present() then we can atomically change the pte to the migration entry so the fault handler never sees pte_none(). On arm64 we are already using an atomic in ptep_get_and_clear() so its no extra work. We could implement a non-atomic default version for other arches. (Although I guess if we fix for arm64 we would want to fix for all). That wouldn't require any new encoding, just refactoring the migration code to call the helper. But if we can convince ourselves that changing the test as David suggests below is good enough and we won't still get spurious failures, then that seems simplest. Dev, could you see if you can get the retry-the-test approach to fail? Thanks, Ryan > > Something like that might make the test happy in most cases: > > diff --git a/tools/testing/selftests/mm/migration.c > b/tools/testing/selftests/mm/migration.c > index 6908569ef406..4c18bfc13b94 100644 > --- a/tools/testing/selftests/mm/migration.c > +++ b/tools/testing/selftests/mm/migration.c > @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) > int ret, tmp; > int status = 0; > struct timespec ts1, ts2; > + int errors = 0; > > if (clock_gettime(CLOCK_MONOTONIC, &ts1)) > return -1; > @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) > ret = move_pages(0, 1, (void **) &ptr, &n2, &status, > MPOL_MF_MOVE_ALL); > if (ret) { > - if (ret > 0) > + if (ret > 0) { > + if (++errors < 100) > + continue; > printf("Didn't migrate %d pages\n", ret); > - else > + } else { > perror("Couldn't migrate pages"); > + } > return -2; > } > + /* Progress! */ > + errors = 0; > > tmp = n2; > n2 = n1; > > > [root@localhost mm]# ./migration > TAP version 13 > 1..1 > # Starting 1 tests from 1 test cases. > # RUN migration.shared_anon ... > # OK migration.shared_anon > ok 1 migration.shared_anon > # PASSED: 1 / 1 tests passed. > # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 > >
On 8/1/24 19:18, David Hildenbrand wrote: > On 01.08.24 15:43, Will Deacon wrote: >> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: >>> On 01.08.24 15:13, David Hildenbrand wrote: >>>>>>> To dampen the tradeoff, we could do this in shmem_fault() >>>>>>> instead? But >>>>>>> then, this would mean that we do this in all >>>>>>> >>>>>>> kinds of vma->vm_ops->fault, only when we discover another >>>>>>> reference >>>>>>> count race condition :) Doing this in do_fault() >>>>>>> >>>>>>> should solve this once and for all. In fact, do_pte_missing() >>>>>>> may call >>>>>>> do_anonymous_page() or do_fault(), and I just >>>>>>> >>>>>>> noticed that the former already checks this using >>>>>>> vmf_pte_changed(). >>>>>> >>>>>> What I am still missing is why this is (a) arm64 only; and (b) if >>>>>> this >>>>>> is something we should really worry about. There are other reasons >>>>>> (e.g., speculative references) why migration could temporarily fail, >>>>>> does it happen that often that it is really something we have to >>>>>> worry >>>>>> about? >>>>> >>>>> >>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite >>>>> strange since the race is clearly arch-independent. >>>> >>>> Yes, I think this is what we have to understand. Is the race simply >>>> less >>>> likely to trigger on x86? >>>> >>>> I would assume that it would trigger on any arch. >>>> >>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to >>>> work here. >>>> >>>> Is this maybe related to deferred flushing? Such that the other CPU >>>> will >>>> by accident just observe the !pte_none a little less likely? >>>> >>>> But arm64 also usually defers flushes, right? At least unless >>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred >>>> flushes. >>> >>> Bingo! >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index e51ed44f8b53..ce94b810586b 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct >>> mm_struct >>> *mm, pte_t pteval, >>> */ >>> static bool should_defer_flush(struct mm_struct *mm, enum >>> ttu_flags flags) >>> { >>> - if (!(flags & TTU_BATCH_FLUSH)) >>> - return false; >>> - >>> - return arch_tlbbatch_should_defer(mm); >>> + return false; >>> } >>> >>> >>> On x86: >>> >>> # ./migration >>> TAP version 13 >>> 1..1 >>> # Starting 1 tests from 1 test cases. >>> # RUN migration.shared_anon ... >>> Didn't migrate 1 pages >>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, >>> self->n2) (-2) >>> == 0 (0) >>> # shared_anon: Test terminated by assertion >>> # FAIL migration.shared_anon >>> not ok 1 migration.shared_anon >>> >>> >>> It fails all of the time! >> >> Nice work! I suppose that makes sense as, with the eager TLB >> invalidation, the window between the other CPU faulting and the >> migration entry being written is fairly wide. >> >> Not sure about a fix though :/ It feels a bit overkill to add a new >> invalid pte encoding just for this. > > Something like that might make the test happy in most cases: > > diff --git a/tools/testing/selftests/mm/migration.c > b/tools/testing/selftests/mm/migration.c > index 6908569ef406..4c18bfc13b94 100644 > --- a/tools/testing/selftests/mm/migration.c > +++ b/tools/testing/selftests/mm/migration.c > @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) > int ret, tmp; > int status = 0; > struct timespec ts1, ts2; > + int errors = 0; > > if (clock_gettime(CLOCK_MONOTONIC, &ts1)) > return -1; > @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) > ret = move_pages(0, 1, (void **) &ptr, &n2, &status, > MPOL_MF_MOVE_ALL); > if (ret) { > - if (ret > 0) > + if (ret > 0) { > + if (++errors < 100) > + continue; > printf("Didn't migrate %d pages\n", ret); > - else > + } else { > perror("Couldn't migrate pages"); > + } > return -2; > } > + /* Progress! */ > + errors = 0; > > tmp = n2; > n2 = n1; > > > [root@localhost mm]# ./migration > TAP version 13 > 1..1 > # Starting 1 tests from 1 test cases. > # RUN migration.shared_anon ... > # OK migration.shared_anon > ok 1 migration.shared_anon > # PASSED: 1 / 1 tests passed. > # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 This does make the test pass, to my surprise, since what you are doing from userspace should have been done by the kernel, because it retries folio unmapping and moving NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up these macros and the original test was still failing. Now, I digged in more, and, if the following assertion is correct: Any thread having a reference on a folio will end up calling folio_lock() then it seems to me that the retry for loop wrapped around migrate_folio_move(), inside migrate_pages_batch(), is useless; if migrate_folio_move() fails on the first iteration, it is going to fail for all iterations since, if I am reading the code path correctly, the only way it fails is when the actual refcount is not equal to expected refcount (in folio_migrate_mapping()), and there is no way that the extra refcount is going to get released since the migration path has the folio lock. And therefore, this begs the question: isn't it logical to assert the actual refcount against the expected refcount, just after we have changed the PTEs, so that if this assertion fails, we can go to the next iteration of the for loop for migrate_folio_unmap() inside migrate_pages_batch() by calling migrate_folio_undo_src()/dst() to restore the old state? I am trying to implement this but is not as straightforward as it seemed to me this morning.
On 05.08.24 11:51, Dev Jain wrote: > > On 8/1/24 19:18, David Hildenbrand wrote: >> On 01.08.24 15:43, Will Deacon wrote: >>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: >>>> On 01.08.24 15:13, David Hildenbrand wrote: >>>>>>>> To dampen the tradeoff, we could do this in shmem_fault() >>>>>>>> instead? But >>>>>>>> then, this would mean that we do this in all >>>>>>>> >>>>>>>> kinds of vma->vm_ops->fault, only when we discover another >>>>>>>> reference >>>>>>>> count race condition :) Doing this in do_fault() >>>>>>>> >>>>>>>> should solve this once and for all. In fact, do_pte_missing() >>>>>>>> may call >>>>>>>> do_anonymous_page() or do_fault(), and I just >>>>>>>> >>>>>>>> noticed that the former already checks this using >>>>>>>> vmf_pte_changed(). >>>>>>> >>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if >>>>>>> this >>>>>>> is something we should really worry about. There are other reasons >>>>>>> (e.g., speculative references) why migration could temporarily fail, >>>>>>> does it happen that often that it is really something we have to >>>>>>> worry >>>>>>> about? >>>>>> >>>>>> >>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite >>>>>> strange since the race is clearly arch-independent. >>>>> >>>>> Yes, I think this is what we have to understand. Is the race simply >>>>> less >>>>> likely to trigger on x86? >>>>> >>>>> I would assume that it would trigger on any arch. >>>>> >>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to >>>>> work here. >>>>> >>>>> Is this maybe related to deferred flushing? Such that the other CPU >>>>> will >>>>> by accident just observe the !pte_none a little less likely? >>>>> >>>>> But arm64 also usually defers flushes, right? At least unless >>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred >>>>> flushes. >>>> >>>> Bingo! >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index e51ed44f8b53..ce94b810586b 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct >>>> mm_struct >>>> *mm, pte_t pteval, >>>> */ >>>> static bool should_defer_flush(struct mm_struct *mm, enum >>>> ttu_flags flags) >>>> { >>>> - if (!(flags & TTU_BATCH_FLUSH)) >>>> - return false; >>>> - >>>> - return arch_tlbbatch_should_defer(mm); >>>> + return false; >>>> } >>>> >>>> >>>> On x86: >>>> >>>> # ./migration >>>> TAP version 13 >>>> 1..1 >>>> # Starting 1 tests from 1 test cases. >>>> # RUN migration.shared_anon ... >>>> Didn't migrate 1 pages >>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, >>>> self->n2) (-2) >>>> == 0 (0) >>>> # shared_anon: Test terminated by assertion >>>> # FAIL migration.shared_anon >>>> not ok 1 migration.shared_anon >>>> >>>> >>>> It fails all of the time! >>> >>> Nice work! I suppose that makes sense as, with the eager TLB >>> invalidation, the window between the other CPU faulting and the >>> migration entry being written is fairly wide. >>> >>> Not sure about a fix though :/ It feels a bit overkill to add a new >>> invalid pte encoding just for this. >> >> Something like that might make the test happy in most cases: >> >> diff --git a/tools/testing/selftests/mm/migration.c >> b/tools/testing/selftests/mm/migration.c >> index 6908569ef406..4c18bfc13b94 100644 >> --- a/tools/testing/selftests/mm/migration.c >> +++ b/tools/testing/selftests/mm/migration.c >> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) >> int ret, tmp; >> int status = 0; >> struct timespec ts1, ts2; >> + int errors = 0; >> >> if (clock_gettime(CLOCK_MONOTONIC, &ts1)) >> return -1; >> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) >> ret = move_pages(0, 1, (void **) &ptr, &n2, &status, >> MPOL_MF_MOVE_ALL); >> if (ret) { >> - if (ret > 0) >> + if (ret > 0) { >> + if (++errors < 100) >> + continue; >> printf("Didn't migrate %d pages\n", ret); >> - else >> + } else { >> perror("Couldn't migrate pages"); >> + } >> return -2; >> } >> + /* Progress! */ >> + errors = 0; >> >> tmp = n2; >> n2 = n1; >> >> >> [root@localhost mm]# ./migration >> TAP version 13 >> 1..1 >> # Starting 1 tests from 1 test cases. >> # RUN migration.shared_anon ... >> # OK migration.shared_anon >> ok 1 migration.shared_anon >> # PASSED: 1 / 1 tests passed. >> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 > > > This does make the test pass, to my surprise, since what you are doing > from userspace > > should have been done by the kernel, because it retries folio unmapping > and moving > > NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up these > > macros and the original test was still failing. Now, I digged in more, > and, if the > > following assertion is correct: > > > Any thread having a reference on a folio will end up calling folio_lock() > Good point. I suspect concurrent things like read/write would also be able to trigger this (did not check, though). > > then it seems to me that the retry for loop wrapped around > migrate_folio_move(), inside > > migrate_pages_batch(), is useless; if migrate_folio_move() fails on the > first iteration, it is > > going to fail for all iterations since, if I am reading the code path > correctly, the only way it > > fails is when the actual refcount is not equal to expected refcount (in > folio_migrate_mapping()), > > and there is no way that the extra refcount is going to get released > since the migration path > > has the folio lock. > > And therefore, this begs the question: isn't it logical to assert the > actual refcount against the > > expected refcount, just after we have changed the PTEs, so that if this > assertion fails, we can > > go to the next iteration of the for loop for migrate_folio_unmap() > inside migrate_pages_batch() > > by calling migrate_folio_undo_src()/dst() to restore the old state? I am > trying to implement > > this but is not as straightforward as it seemed to me this morning. I agree with your assessment that migration code currently doesn't handle the case well when some other thread does an unconditional folio_lock(). folio_trylock() users would be handled, but that's not what we want with FGP_LOCK I assume. So IIUC, your idea would be to unlock the folio in migration code and try again their. Sounds reasonable, without looking into the details :)
On 8/5/24 16:16, David Hildenbrand wrote: > On 05.08.24 11:51, Dev Jain wrote: >> >> On 8/1/24 19:18, David Hildenbrand wrote: >>> On 01.08.24 15:43, Will Deacon wrote: >>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: >>>>> On 01.08.24 15:13, David Hildenbrand wrote: >>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault() >>>>>>>>> instead? But >>>>>>>>> then, this would mean that we do this in all >>>>>>>>> >>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another >>>>>>>>> reference >>>>>>>>> count race condition :) Doing this in do_fault() >>>>>>>>> >>>>>>>>> should solve this once and for all. In fact, do_pte_missing() >>>>>>>>> may call >>>>>>>>> do_anonymous_page() or do_fault(), and I just >>>>>>>>> >>>>>>>>> noticed that the former already checks this using >>>>>>>>> vmf_pte_changed(). >>>>>>>> >>>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if >>>>>>>> this >>>>>>>> is something we should really worry about. There are other reasons >>>>>>>> (e.g., speculative references) why migration could temporarily >>>>>>>> fail, >>>>>>>> does it happen that often that it is really something we have to >>>>>>>> worry >>>>>>>> about? >>>>>>> >>>>>>> >>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite >>>>>>> strange since the race is clearly arch-independent. >>>>>> >>>>>> Yes, I think this is what we have to understand. Is the race simply >>>>>> less >>>>>> likely to trigger on x86? >>>>>> >>>>>> I would assume that it would trigger on any arch. >>>>>> >>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to >>>>>> work here. >>>>>> >>>>>> Is this maybe related to deferred flushing? Such that the other CPU >>>>>> will >>>>>> by accident just observe the !pte_none a little less likely? >>>>>> >>>>>> But arm64 also usually defers flushes, right? At least unless >>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do >>>>>> deferred >>>>>> flushes. >>>>> >>>>> Bingo! >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index e51ed44f8b53..ce94b810586b 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct >>>>> mm_struct >>>>> *mm, pte_t pteval, >>>>> */ >>>>> static bool should_defer_flush(struct mm_struct *mm, enum >>>>> ttu_flags flags) >>>>> { >>>>> - if (!(flags & TTU_BATCH_FLUSH)) >>>>> - return false; >>>>> - >>>>> - return arch_tlbbatch_should_defer(mm); >>>>> + return false; >>>>> } >>>>> >>>>> >>>>> On x86: >>>>> >>>>> # ./migration >>>>> TAP version 13 >>>>> 1..1 >>>>> # Starting 1 tests from 1 test cases. >>>>> # RUN migration.shared_anon ... >>>>> Didn't migrate 1 pages >>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, >>>>> self->n2) (-2) >>>>> == 0 (0) >>>>> # shared_anon: Test terminated by assertion >>>>> # FAIL migration.shared_anon >>>>> not ok 1 migration.shared_anon >>>>> >>>>> >>>>> It fails all of the time! >>>> >>>> Nice work! I suppose that makes sense as, with the eager TLB >>>> invalidation, the window between the other CPU faulting and the >>>> migration entry being written is fairly wide. >>>> >>>> Not sure about a fix though :/ It feels a bit overkill to add a new >>>> invalid pte encoding just for this. >>> >>> Something like that might make the test happy in most cases: >>> >>> diff --git a/tools/testing/selftests/mm/migration.c >>> b/tools/testing/selftests/mm/migration.c >>> index 6908569ef406..4c18bfc13b94 100644 >>> --- a/tools/testing/selftests/mm/migration.c >>> +++ b/tools/testing/selftests/mm/migration.c >>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) >>> int ret, tmp; >>> int status = 0; >>> struct timespec ts1, ts2; >>> + int errors = 0; >>> >>> if (clock_gettime(CLOCK_MONOTONIC, &ts1)) >>> return -1; >>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) >>> ret = move_pages(0, 1, (void **) &ptr, &n2, &status, >>> MPOL_MF_MOVE_ALL); >>> if (ret) { >>> - if (ret > 0) >>> + if (ret > 0) { >>> + if (++errors < 100) >>> + continue; >>> printf("Didn't migrate %d pages\n", >>> ret); >>> - else >>> + } else { >>> perror("Couldn't migrate pages"); >>> + } >>> return -2; >>> } >>> + /* Progress! */ >>> + errors = 0; >>> >>> tmp = n2; >>> n2 = n1; >>> >>> >>> [root@localhost mm]# ./migration >>> TAP version 13 >>> 1..1 >>> # Starting 1 tests from 1 test cases. >>> # RUN migration.shared_anon ... >>> # OK migration.shared_anon >>> ok 1 migration.shared_anon >>> # PASSED: 1 / 1 tests passed. >>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 >> >> >> This does make the test pass, to my surprise, since what you are doing >> from userspace >> >> should have been done by the kernel, because it retries folio unmapping >> and moving >> >> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up >> these >> >> macros and the original test was still failing. Now, I digged in more, >> and, if the >> >> following assertion is correct: >> >> >> Any thread having a reference on a folio will end up calling >> folio_lock() >> > > Good point. I suspect concurrent things like read/write would also be > able to trigger this (did not check, though). > >> >> then it seems to me that the retry for loop wrapped around >> migrate_folio_move(), inside >> >> migrate_pages_batch(), is useless; if migrate_folio_move() fails on the >> first iteration, it is >> >> going to fail for all iterations since, if I am reading the code path >> correctly, the only way it >> >> fails is when the actual refcount is not equal to expected refcount (in >> folio_migrate_mapping()), >> >> and there is no way that the extra refcount is going to get released >> since the migration path >> >> has the folio lock. >> >> And therefore, this begs the question: isn't it logical to assert the >> actual refcount against the >> >> expected refcount, just after we have changed the PTEs, so that if this >> assertion fails, we can >> >> go to the next iteration of the for loop for migrate_folio_unmap() >> inside migrate_pages_batch() >> >> by calling migrate_folio_undo_src()/dst() to restore the old state? I am >> trying to implement >> >> this but is not as straightforward as it seemed to me this morning. > > I agree with your assessment that migration code currently doesn't > handle the case well when some other thread does an unconditional > folio_lock(). folio_trylock() users would be handled, but that's not > what we want with FGP_LOCK I assume. > > So IIUC, your idea would be to unlock the folio in migration code and > try again their. Sounds reasonable, without looking into the details :) (Adding Mel if at all he has any comments for a compaction use-case) What I was trying to say is this (forgive me for the hard-coded value): diff --git a/mm/migrate.c b/mm/migrate.c index a8c6f466e33a..404af46dd661 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1262,6 +1262,8 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, } if (!folio_mapped(src)) { + if (folio_ref_count(src) != 2) + goto out; __migrate_folio_record(dst, old_page_state, anon_vma); return MIGRATEPAGE_UNMAP; } This will give us more chances to win the race. On an average, now the test fails on 100 iterations of move_pages(). If you multiply the NR_MAX_PAGES_(A)SYNC_RETRY macros by 3, the average goes above to 2000. But if the consensus is that this is just pleasing the test without any real use-case (compaction?) then I guess I am alright with making the change in the test.
On 05.08.24 16:14, Dev Jain wrote: > > On 8/5/24 16:16, David Hildenbrand wrote: >> On 05.08.24 11:51, Dev Jain wrote: >>> >>> On 8/1/24 19:18, David Hildenbrand wrote: >>>> On 01.08.24 15:43, Will Deacon wrote: >>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: >>>>>> On 01.08.24 15:13, David Hildenbrand wrote: >>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault() >>>>>>>>>> instead? But >>>>>>>>>> then, this would mean that we do this in all >>>>>>>>>> >>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another >>>>>>>>>> reference >>>>>>>>>> count race condition :) Doing this in do_fault() >>>>>>>>>> >>>>>>>>>> should solve this once and for all. In fact, do_pte_missing() >>>>>>>>>> may call >>>>>>>>>> do_anonymous_page() or do_fault(), and I just >>>>>>>>>> >>>>>>>>>> noticed that the former already checks this using >>>>>>>>>> vmf_pte_changed(). >>>>>>>>> >>>>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if >>>>>>>>> this >>>>>>>>> is something we should really worry about. There are other reasons >>>>>>>>> (e.g., speculative references) why migration could temporarily >>>>>>>>> fail, >>>>>>>>> does it happen that often that it is really something we have to >>>>>>>>> worry >>>>>>>>> about? >>>>>>>> >>>>>>>> >>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite >>>>>>>> strange since the race is clearly arch-independent. >>>>>>> >>>>>>> Yes, I think this is what we have to understand. Is the race simply >>>>>>> less >>>>>>> likely to trigger on x86? >>>>>>> >>>>>>> I would assume that it would trigger on any arch. >>>>>>> >>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to >>>>>>> work here. >>>>>>> >>>>>>> Is this maybe related to deferred flushing? Such that the other CPU >>>>>>> will >>>>>>> by accident just observe the !pte_none a little less likely? >>>>>>> >>>>>>> But arm64 also usually defers flushes, right? At least unless >>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do >>>>>>> deferred >>>>>>> flushes. >>>>>> >>>>>> Bingo! >>>>>> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index e51ed44f8b53..ce94b810586b 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct >>>>>> mm_struct >>>>>> *mm, pte_t pteval, >>>>>> */ >>>>>> static bool should_defer_flush(struct mm_struct *mm, enum >>>>>> ttu_flags flags) >>>>>> { >>>>>> - if (!(flags & TTU_BATCH_FLUSH)) >>>>>> - return false; >>>>>> - >>>>>> - return arch_tlbbatch_should_defer(mm); >>>>>> + return false; >>>>>> } >>>>>> >>>>>> >>>>>> On x86: >>>>>> >>>>>> # ./migration >>>>>> TAP version 13 >>>>>> 1..1 >>>>>> # Starting 1 tests from 1 test cases. >>>>>> # RUN migration.shared_anon ... >>>>>> Didn't migrate 1 pages >>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, >>>>>> self->n2) (-2) >>>>>> == 0 (0) >>>>>> # shared_anon: Test terminated by assertion >>>>>> # FAIL migration.shared_anon >>>>>> not ok 1 migration.shared_anon >>>>>> >>>>>> >>>>>> It fails all of the time! >>>>> >>>>> Nice work! I suppose that makes sense as, with the eager TLB >>>>> invalidation, the window between the other CPU faulting and the >>>>> migration entry being written is fairly wide. >>>>> >>>>> Not sure about a fix though :/ It feels a bit overkill to add a new >>>>> invalid pte encoding just for this. >>>> >>>> Something like that might make the test happy in most cases: >>>> >>>> diff --git a/tools/testing/selftests/mm/migration.c >>>> b/tools/testing/selftests/mm/migration.c >>>> index 6908569ef406..4c18bfc13b94 100644 >>>> --- a/tools/testing/selftests/mm/migration.c >>>> +++ b/tools/testing/selftests/mm/migration.c >>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) >>>> int ret, tmp; >>>> int status = 0; >>>> struct timespec ts1, ts2; >>>> + int errors = 0; >>>> >>>> if (clock_gettime(CLOCK_MONOTONIC, &ts1)) >>>> return -1; >>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) >>>> ret = move_pages(0, 1, (void **) &ptr, &n2, &status, >>>> MPOL_MF_MOVE_ALL); >>>> if (ret) { >>>> - if (ret > 0) >>>> + if (ret > 0) { >>>> + if (++errors < 100) >>>> + continue; >>>> printf("Didn't migrate %d pages\n", >>>> ret); >>>> - else >>>> + } else { >>>> perror("Couldn't migrate pages"); >>>> + } >>>> return -2; >>>> } >>>> + /* Progress! */ >>>> + errors = 0; >>>> >>>> tmp = n2; >>>> n2 = n1; >>>> >>>> >>>> [root@localhost mm]# ./migration >>>> TAP version 13 >>>> 1..1 >>>> # Starting 1 tests from 1 test cases. >>>> # RUN migration.shared_anon ... >>>> # OK migration.shared_anon >>>> ok 1 migration.shared_anon >>>> # PASSED: 1 / 1 tests passed. >>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 >>> >>> >>> This does make the test pass, to my surprise, since what you are doing >>> from userspace >>> >>> should have been done by the kernel, because it retries folio unmapping >>> and moving >>> >>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up >>> these >>> >>> macros and the original test was still failing. Now, I digged in more, >>> and, if the >>> >>> following assertion is correct: >>> >>> >>> Any thread having a reference on a folio will end up calling >>> folio_lock() >>> >> >> Good point. I suspect concurrent things like read/write would also be >> able to trigger this (did not check, though). >> >>> >>> then it seems to me that the retry for loop wrapped around >>> migrate_folio_move(), inside >>> >>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on the >>> first iteration, it is >>> >>> going to fail for all iterations since, if I am reading the code path >>> correctly, the only way it >>> >>> fails is when the actual refcount is not equal to expected refcount (in >>> folio_migrate_mapping()), >>> >>> and there is no way that the extra refcount is going to get released >>> since the migration path >>> >>> has the folio lock. >>> >>> And therefore, this begs the question: isn't it logical to assert the >>> actual refcount against the >>> >>> expected refcount, just after we have changed the PTEs, so that if this >>> assertion fails, we can >>> >>> go to the next iteration of the for loop for migrate_folio_unmap() >>> inside migrate_pages_batch() >>> >>> by calling migrate_folio_undo_src()/dst() to restore the old state? I am >>> trying to implement >>> >>> this but is not as straightforward as it seemed to me this morning. >> >> I agree with your assessment that migration code currently doesn't >> handle the case well when some other thread does an unconditional >> folio_lock(). folio_trylock() users would be handled, but that's not >> what we want with FGP_LOCK I assume. >> >> So IIUC, your idea would be to unlock the folio in migration code and >> try again their. Sounds reasonable, without looking into the details :) > > BTW, I was trying to find the spot that would do the folio_lock(), but filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a folio_trylock(). Where exactly is the folio_lock() on the fault path that would prohibit us from making progress? > (Adding Mel if at all he has any comments for a compaction use-case) > > What I was trying to say is this (forgive me for the hard-coded value): The hard-coded 2 is wrong indeed :) > > diff --git a/mm/migrate.c b/mm/migrate.c > index a8c6f466e33a..404af46dd661 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1262,6 +1262,8 @@ static int migrate_folio_unmap(new_folio_t > get_new_folio, > } > if (!folio_mapped(src)) { > + if (folio_ref_count(src) != 2) > + goto out; > __migrate_folio_record(dst, old_page_state, anon_vma); > return MIGRATEPAGE_UNMAP; > } > This will give us more chances to win the race. On an average, now > the test fails on 100 iterations of move_pages(). If you multiply > the NR_MAX_PAGES_(A)SYNC_RETRY macros by 3, the average goes above > to 2000. > But if the consensus is that this is just pleasing the test without > any real use-case (compaction?) then I guess I am alright with making > the change in the test. I'd be curious if other activity (besides fault, like concurrent read()/write()/...) can similarly make migration fail. I suspect it can.
On 8/7/24 17:09, David Hildenbrand wrote: > On 05.08.24 16:14, Dev Jain wrote: >> >> On 8/5/24 16:16, David Hildenbrand wrote: >>> On 05.08.24 11:51, Dev Jain wrote: >>>> >>>> On 8/1/24 19:18, David Hildenbrand wrote: >>>>> On 01.08.24 15:43, Will Deacon wrote: >>>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: >>>>>>> On 01.08.24 15:13, David Hildenbrand wrote: >>>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault() >>>>>>>>>>> instead? But >>>>>>>>>>> then, this would mean that we do this in all >>>>>>>>>>> >>>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another >>>>>>>>>>> reference >>>>>>>>>>> count race condition :) Doing this in do_fault() >>>>>>>>>>> >>>>>>>>>>> should solve this once and for all. In fact, do_pte_missing() >>>>>>>>>>> may call >>>>>>>>>>> do_anonymous_page() or do_fault(), and I just >>>>>>>>>>> >>>>>>>>>>> noticed that the former already checks this using >>>>>>>>>>> vmf_pte_changed(). >>>>>>>>>> >>>>>>>>>> What I am still missing is why this is (a) arm64 only; and >>>>>>>>>> (b) if >>>>>>>>>> this >>>>>>>>>> is something we should really worry about. There are other >>>>>>>>>> reasons >>>>>>>>>> (e.g., speculative references) why migration could temporarily >>>>>>>>>> fail, >>>>>>>>>> does it happen that often that it is really something we have to >>>>>>>>>> worry >>>>>>>>>> about? >>>>>>>>> >>>>>>>>> >>>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is >>>>>>>>> quite >>>>>>>>> strange since the race is clearly arch-independent. >>>>>>>> >>>>>>>> Yes, I think this is what we have to understand. Is the race >>>>>>>> simply >>>>>>>> less >>>>>>>> likely to trigger on x86? >>>>>>>> >>>>>>>> I would assume that it would trigger on any arch. >>>>>>>> >>>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to >>>>>>>> work here. >>>>>>>> >>>>>>>> Is this maybe related to deferred flushing? Such that the other >>>>>>>> CPU >>>>>>>> will >>>>>>>> by accident just observe the !pte_none a little less likely? >>>>>>>> >>>>>>>> But arm64 also usually defers flushes, right? At least unless >>>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do >>>>>>>> deferred >>>>>>>> flushes. >>>>>>> >>>>>>> Bingo! >>>>>>> >>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>> index e51ed44f8b53..ce94b810586b 100644 >>>>>>> --- a/mm/rmap.c >>>>>>> +++ b/mm/rmap.c >>>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct >>>>>>> mm_struct >>>>>>> *mm, pte_t pteval, >>>>>>> */ >>>>>>> static bool should_defer_flush(struct mm_struct *mm, enum >>>>>>> ttu_flags flags) >>>>>>> { >>>>>>> - if (!(flags & TTU_BATCH_FLUSH)) >>>>>>> - return false; >>>>>>> - >>>>>>> - return arch_tlbbatch_should_defer(mm); >>>>>>> + return false; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> On x86: >>>>>>> >>>>>>> # ./migration >>>>>>> TAP version 13 >>>>>>> 1..1 >>>>>>> # Starting 1 tests from 1 test cases. >>>>>>> # RUN migration.shared_anon ... >>>>>>> Didn't migrate 1 pages >>>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, >>>>>>> self->n2) (-2) >>>>>>> == 0 (0) >>>>>>> # shared_anon: Test terminated by assertion >>>>>>> # FAIL migration.shared_anon >>>>>>> not ok 1 migration.shared_anon >>>>>>> >>>>>>> >>>>>>> It fails all of the time! >>>>>> >>>>>> Nice work! I suppose that makes sense as, with the eager TLB >>>>>> invalidation, the window between the other CPU faulting and the >>>>>> migration entry being written is fairly wide. >>>>>> >>>>>> Not sure about a fix though :/ It feels a bit overkill to add a new >>>>>> invalid pte encoding just for this. >>>>> >>>>> Something like that might make the test happy in most cases: >>>>> >>>>> diff --git a/tools/testing/selftests/mm/migration.c >>>>> b/tools/testing/selftests/mm/migration.c >>>>> index 6908569ef406..4c18bfc13b94 100644 >>>>> --- a/tools/testing/selftests/mm/migration.c >>>>> +++ b/tools/testing/selftests/mm/migration.c >>>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) >>>>> int ret, tmp; >>>>> int status = 0; >>>>> struct timespec ts1, ts2; >>>>> + int errors = 0; >>>>> >>>>> if (clock_gettime(CLOCK_MONOTONIC, &ts1)) >>>>> return -1; >>>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) >>>>> ret = move_pages(0, 1, (void **) &ptr, &n2, >>>>> &status, >>>>> MPOL_MF_MOVE_ALL); >>>>> if (ret) { >>>>> - if (ret > 0) >>>>> + if (ret > 0) { >>>>> + if (++errors < 100) >>>>> + continue; >>>>> printf("Didn't migrate %d pages\n", >>>>> ret); >>>>> - else >>>>> + } else { >>>>> perror("Couldn't migrate pages"); >>>>> + } >>>>> return -2; >>>>> } >>>>> + /* Progress! */ >>>>> + errors = 0; >>>>> >>>>> tmp = n2; >>>>> n2 = n1; >>>>> >>>>> >>>>> [root@localhost mm]# ./migration >>>>> TAP version 13 >>>>> 1..1 >>>>> # Starting 1 tests from 1 test cases. >>>>> # RUN migration.shared_anon ... >>>>> # OK migration.shared_anon >>>>> ok 1 migration.shared_anon >>>>> # PASSED: 1 / 1 tests passed. >>>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 >>>> >>>> >>>> This does make the test pass, to my surprise, since what you are doing >>>> from userspace >>>> >>>> should have been done by the kernel, because it retries folio >>>> unmapping >>>> and moving >>>> >>>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up >>>> these >>>> >>>> macros and the original test was still failing. Now, I digged in more, >>>> and, if the >>>> >>>> following assertion is correct: >>>> >>>> >>>> Any thread having a reference on a folio will end up calling >>>> folio_lock() >>>> >>> >>> Good point. I suspect concurrent things like read/write would also be >>> able to trigger this (did not check, though). >>> >>>> >>>> then it seems to me that the retry for loop wrapped around >>>> migrate_folio_move(), inside >>>> >>>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on >>>> the >>>> first iteration, it is >>>> >>>> going to fail for all iterations since, if I am reading the code path >>>> correctly, the only way it >>>> >>>> fails is when the actual refcount is not equal to expected refcount >>>> (in >>>> folio_migrate_mapping()), >>>> >>>> and there is no way that the extra refcount is going to get released >>>> since the migration path >>>> >>>> has the folio lock. >>>> >>>> And therefore, this begs the question: isn't it logical to assert the >>>> actual refcount against the >>>> >>>> expected refcount, just after we have changed the PTEs, so that if >>>> this >>>> assertion fails, we can >>>> >>>> go to the next iteration of the for loop for migrate_folio_unmap() >>>> inside migrate_pages_batch() >>>> >>>> by calling migrate_folio_undo_src()/dst() to restore the old state? >>>> I am >>>> trying to implement >>>> >>>> this but is not as straightforward as it seemed to me this morning. >>> >>> I agree with your assessment that migration code currently doesn't >>> handle the case well when some other thread does an unconditional >>> folio_lock(). folio_trylock() users would be handled, but that's not >>> what we want with FGP_LOCK I assume. >>> >>> So IIUC, your idea would be to unlock the folio in migration code and >>> try again their. Sounds reasonable, without looking into the details :) >> >> > > BTW, I was trying to find the spot that would do the folio_lock(), but > filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a > folio_trylock(). > > Where exactly is the folio_lock() on the fault path that would > prohibit us from making progress? Not filemap_fault(); it enters shmem_fault() which eventually calls shmem_get_folio_gfp(), retrieving the folio from the pagecache, and calling folio_lock(). > >> (Adding Mel if at all he has any comments for a compaction use-case) >> >> What I was trying to say is this (forgive me for the hard-coded value): > > The hard-coded 2 is wrong indeed :) > >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index a8c6f466e33a..404af46dd661 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -1262,6 +1262,8 @@ static int migrate_folio_unmap(new_folio_t >> get_new_folio, >> } >> if (!folio_mapped(src)) { >> + if (folio_ref_count(src) != 2) >> + goto out; >> __migrate_folio_record(dst, old_page_state, anon_vma); >> return MIGRATEPAGE_UNMAP; >> } >> This will give us more chances to win the race. On an average, now >> the test fails on 100 iterations of move_pages(). If you multiply >> the NR_MAX_PAGES_(A)SYNC_RETRY macros by 3, the average goes above >> to 2000. >> But if the consensus is that this is just pleasing the test without >> any real use-case (compaction?) then I guess I am alright with making >> the change in the test. > > I'd be curious if other activity (besides fault, like concurrent > read()/write()/...) can similarly make migration fail. I suspect it can. >
On 07.08.24 14:58, Dev Jain wrote: > > On 8/7/24 17:09, David Hildenbrand wrote: >> On 05.08.24 16:14, Dev Jain wrote: >>> >>> On 8/5/24 16:16, David Hildenbrand wrote: >>>> On 05.08.24 11:51, Dev Jain wrote: >>>>> >>>>> On 8/1/24 19:18, David Hildenbrand wrote: >>>>>> On 01.08.24 15:43, Will Deacon wrote: >>>>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: >>>>>>>> On 01.08.24 15:13, David Hildenbrand wrote: >>>>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault() >>>>>>>>>>>> instead? But >>>>>>>>>>>> then, this would mean that we do this in all >>>>>>>>>>>> >>>>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another >>>>>>>>>>>> reference >>>>>>>>>>>> count race condition :) Doing this in do_fault() >>>>>>>>>>>> >>>>>>>>>>>> should solve this once and for all. In fact, do_pte_missing() >>>>>>>>>>>> may call >>>>>>>>>>>> do_anonymous_page() or do_fault(), and I just >>>>>>>>>>>> >>>>>>>>>>>> noticed that the former already checks this using >>>>>>>>>>>> vmf_pte_changed(). >>>>>>>>>>> >>>>>>>>>>> What I am still missing is why this is (a) arm64 only; and >>>>>>>>>>> (b) if >>>>>>>>>>> this >>>>>>>>>>> is something we should really worry about. There are other >>>>>>>>>>> reasons >>>>>>>>>>> (e.g., speculative references) why migration could temporarily >>>>>>>>>>> fail, >>>>>>>>>>> does it happen that often that it is really something we have to >>>>>>>>>>> worry >>>>>>>>>>> about? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is >>>>>>>>>> quite >>>>>>>>>> strange since the race is clearly arch-independent. >>>>>>>>> >>>>>>>>> Yes, I think this is what we have to understand. Is the race >>>>>>>>> simply >>>>>>>>> less >>>>>>>>> likely to trigger on x86? >>>>>>>>> >>>>>>>>> I would assume that it would trigger on any arch. >>>>>>>>> >>>>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to >>>>>>>>> work here. >>>>>>>>> >>>>>>>>> Is this maybe related to deferred flushing? Such that the other >>>>>>>>> CPU >>>>>>>>> will >>>>>>>>> by accident just observe the !pte_none a little less likely? >>>>>>>>> >>>>>>>>> But arm64 also usually defers flushes, right? At least unless >>>>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do >>>>>>>>> deferred >>>>>>>>> flushes. >>>>>>>> >>>>>>>> Bingo! >>>>>>>> >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>>> index e51ed44f8b53..ce94b810586b 100644 >>>>>>>> --- a/mm/rmap.c >>>>>>>> +++ b/mm/rmap.c >>>>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct >>>>>>>> mm_struct >>>>>>>> *mm, pte_t pteval, >>>>>>>> */ >>>>>>>> static bool should_defer_flush(struct mm_struct *mm, enum >>>>>>>> ttu_flags flags) >>>>>>>> { >>>>>>>> - if (!(flags & TTU_BATCH_FLUSH)) >>>>>>>> - return false; >>>>>>>> - >>>>>>>> - return arch_tlbbatch_should_defer(mm); >>>>>>>> + return false; >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> On x86: >>>>>>>> >>>>>>>> # ./migration >>>>>>>> TAP version 13 >>>>>>>> 1..1 >>>>>>>> # Starting 1 tests from 1 test cases. >>>>>>>> # RUN migration.shared_anon ... >>>>>>>> Didn't migrate 1 pages >>>>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, >>>>>>>> self->n2) (-2) >>>>>>>> == 0 (0) >>>>>>>> # shared_anon: Test terminated by assertion >>>>>>>> # FAIL migration.shared_anon >>>>>>>> not ok 1 migration.shared_anon >>>>>>>> >>>>>>>> >>>>>>>> It fails all of the time! >>>>>>> >>>>>>> Nice work! I suppose that makes sense as, with the eager TLB >>>>>>> invalidation, the window between the other CPU faulting and the >>>>>>> migration entry being written is fairly wide. >>>>>>> >>>>>>> Not sure about a fix though :/ It feels a bit overkill to add a new >>>>>>> invalid pte encoding just for this. >>>>>> >>>>>> Something like that might make the test happy in most cases: >>>>>> >>>>>> diff --git a/tools/testing/selftests/mm/migration.c >>>>>> b/tools/testing/selftests/mm/migration.c >>>>>> index 6908569ef406..4c18bfc13b94 100644 >>>>>> --- a/tools/testing/selftests/mm/migration.c >>>>>> +++ b/tools/testing/selftests/mm/migration.c >>>>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) >>>>>> int ret, tmp; >>>>>> int status = 0; >>>>>> struct timespec ts1, ts2; >>>>>> + int errors = 0; >>>>>> >>>>>> if (clock_gettime(CLOCK_MONOTONIC, &ts1)) >>>>>> return -1; >>>>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) >>>>>> ret = move_pages(0, 1, (void **) &ptr, &n2, >>>>>> &status, >>>>>> MPOL_MF_MOVE_ALL); >>>>>> if (ret) { >>>>>> - if (ret > 0) >>>>>> + if (ret > 0) { >>>>>> + if (++errors < 100) >>>>>> + continue; >>>>>> printf("Didn't migrate %d pages\n", >>>>>> ret); >>>>>> - else >>>>>> + } else { >>>>>> perror("Couldn't migrate pages"); >>>>>> + } >>>>>> return -2; >>>>>> } >>>>>> + /* Progress! */ >>>>>> + errors = 0; >>>>>> >>>>>> tmp = n2; >>>>>> n2 = n1; >>>>>> >>>>>> >>>>>> [root@localhost mm]# ./migration >>>>>> TAP version 13 >>>>>> 1..1 >>>>>> # Starting 1 tests from 1 test cases. >>>>>> # RUN migration.shared_anon ... >>>>>> # OK migration.shared_anon >>>>>> ok 1 migration.shared_anon >>>>>> # PASSED: 1 / 1 tests passed. >>>>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 >>>>> >>>>> >>>>> This does make the test pass, to my surprise, since what you are doing >>>>> from userspace >>>>> >>>>> should have been done by the kernel, because it retries folio >>>>> unmapping >>>>> and moving >>>>> >>>>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up >>>>> these >>>>> >>>>> macros and the original test was still failing. Now, I digged in more, >>>>> and, if the >>>>> >>>>> following assertion is correct: >>>>> >>>>> >>>>> Any thread having a reference on a folio will end up calling >>>>> folio_lock() >>>>> >>>> >>>> Good point. I suspect concurrent things like read/write would also be >>>> able to trigger this (did not check, though). >>>> >>>>> >>>>> then it seems to me that the retry for loop wrapped around >>>>> migrate_folio_move(), inside >>>>> >>>>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on >>>>> the >>>>> first iteration, it is >>>>> >>>>> going to fail for all iterations since, if I am reading the code path >>>>> correctly, the only way it >>>>> >>>>> fails is when the actual refcount is not equal to expected refcount >>>>> (in >>>>> folio_migrate_mapping()), >>>>> >>>>> and there is no way that the extra refcount is going to get released >>>>> since the migration path >>>>> >>>>> has the folio lock. >>>>> >>>>> And therefore, this begs the question: isn't it logical to assert the >>>>> actual refcount against the >>>>> >>>>> expected refcount, just after we have changed the PTEs, so that if >>>>> this >>>>> assertion fails, we can >>>>> >>>>> go to the next iteration of the for loop for migrate_folio_unmap() >>>>> inside migrate_pages_batch() >>>>> >>>>> by calling migrate_folio_undo_src()/dst() to restore the old state? >>>>> I am >>>>> trying to implement >>>>> >>>>> this but is not as straightforward as it seemed to me this morning. >>>> >>>> I agree with your assessment that migration code currently doesn't >>>> handle the case well when some other thread does an unconditional >>>> folio_lock(). folio_trylock() users would be handled, but that's not >>>> what we want with FGP_LOCK I assume. >>>> >>>> So IIUC, your idea would be to unlock the folio in migration code and >>>> try again their. Sounds reasonable, without looking into the details :) >>> >>> >> >> BTW, I was trying to find the spot that would do the folio_lock(), but >> filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a >> folio_trylock(). >> >> Where exactly is the folio_lock() on the fault path that would >> prohibit us from making progress? > > Not filemap_fault(); it enters shmem_fault() which eventually calls > shmem_get_folio_gfp(), retrieving the folio from the pagecache, and > calling folio_lock(). Ah, thanks! ... which raises the question if we should handle it similar to filemap_fault(), essentially drop the reference and retry using VM_FAULT_RETRY. Hmmmmm
On 09.08.24 15:23, David Hildenbrand wrote: > On 07.08.24 14:58, Dev Jain wrote: >> >> On 8/7/24 17:09, David Hildenbrand wrote: >>> On 05.08.24 16:14, Dev Jain wrote: >>>> >>>> On 8/5/24 16:16, David Hildenbrand wrote: >>>>> On 05.08.24 11:51, Dev Jain wrote: >>>>>> >>>>>> On 8/1/24 19:18, David Hildenbrand wrote: >>>>>>> On 01.08.24 15:43, Will Deacon wrote: >>>>>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote: >>>>>>>>> On 01.08.24 15:13, David Hildenbrand wrote: >>>>>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault() >>>>>>>>>>>>> instead? But >>>>>>>>>>>>> then, this would mean that we do this in all >>>>>>>>>>>>> >>>>>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another >>>>>>>>>>>>> reference >>>>>>>>>>>>> count race condition :) Doing this in do_fault() >>>>>>>>>>>>> >>>>>>>>>>>>> should solve this once and for all. In fact, do_pte_missing() >>>>>>>>>>>>> may call >>>>>>>>>>>>> do_anonymous_page() or do_fault(), and I just >>>>>>>>>>>>> >>>>>>>>>>>>> noticed that the former already checks this using >>>>>>>>>>>>> vmf_pte_changed(). >>>>>>>>>>>> >>>>>>>>>>>> What I am still missing is why this is (a) arm64 only; and >>>>>>>>>>>> (b) if >>>>>>>>>>>> this >>>>>>>>>>>> is something we should really worry about. There are other >>>>>>>>>>>> reasons >>>>>>>>>>>> (e.g., speculative references) why migration could temporarily >>>>>>>>>>>> fail, >>>>>>>>>>>> does it happen that often that it is really something we have to >>>>>>>>>>>> worry >>>>>>>>>>>> about? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is >>>>>>>>>>> quite >>>>>>>>>>> strange since the race is clearly arch-independent. >>>>>>>>>> >>>>>>>>>> Yes, I think this is what we have to understand. Is the race >>>>>>>>>> simply >>>>>>>>>> less >>>>>>>>>> likely to trigger on x86? >>>>>>>>>> >>>>>>>>>> I would assume that it would trigger on any arch. >>>>>>>>>> >>>>>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to >>>>>>>>>> work here. >>>>>>>>>> >>>>>>>>>> Is this maybe related to deferred flushing? Such that the other >>>>>>>>>> CPU >>>>>>>>>> will >>>>>>>>>> by accident just observe the !pte_none a little less likely? >>>>>>>>>> >>>>>>>>>> But arm64 also usually defers flushes, right? At least unless >>>>>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do >>>>>>>>>> deferred >>>>>>>>>> flushes. >>>>>>>>> >>>>>>>>> Bingo! >>>>>>>>> >>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>>>> index e51ed44f8b53..ce94b810586b 100644 >>>>>>>>> --- a/mm/rmap.c >>>>>>>>> +++ b/mm/rmap.c >>>>>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct >>>>>>>>> mm_struct >>>>>>>>> *mm, pte_t pteval, >>>>>>>>> */ >>>>>>>>> static bool should_defer_flush(struct mm_struct *mm, enum >>>>>>>>> ttu_flags flags) >>>>>>>>> { >>>>>>>>> - if (!(flags & TTU_BATCH_FLUSH)) >>>>>>>>> - return false; >>>>>>>>> - >>>>>>>>> - return arch_tlbbatch_should_defer(mm); >>>>>>>>> + return false; >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> On x86: >>>>>>>>> >>>>>>>>> # ./migration >>>>>>>>> TAP version 13 >>>>>>>>> 1..1 >>>>>>>>> # Starting 1 tests from 1 test cases. >>>>>>>>> # RUN migration.shared_anon ... >>>>>>>>> Didn't migrate 1 pages >>>>>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, >>>>>>>>> self->n2) (-2) >>>>>>>>> == 0 (0) >>>>>>>>> # shared_anon: Test terminated by assertion >>>>>>>>> # FAIL migration.shared_anon >>>>>>>>> not ok 1 migration.shared_anon >>>>>>>>> >>>>>>>>> >>>>>>>>> It fails all of the time! >>>>>>>> >>>>>>>> Nice work! I suppose that makes sense as, with the eager TLB >>>>>>>> invalidation, the window between the other CPU faulting and the >>>>>>>> migration entry being written is fairly wide. >>>>>>>> >>>>>>>> Not sure about a fix though :/ It feels a bit overkill to add a new >>>>>>>> invalid pte encoding just for this. >>>>>>> >>>>>>> Something like that might make the test happy in most cases: >>>>>>> >>>>>>> diff --git a/tools/testing/selftests/mm/migration.c >>>>>>> b/tools/testing/selftests/mm/migration.c >>>>>>> index 6908569ef406..4c18bfc13b94 100644 >>>>>>> --- a/tools/testing/selftests/mm/migration.c >>>>>>> +++ b/tools/testing/selftests/mm/migration.c >>>>>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) >>>>>>> int ret, tmp; >>>>>>> int status = 0; >>>>>>> struct timespec ts1, ts2; >>>>>>> + int errors = 0; >>>>>>> >>>>>>> if (clock_gettime(CLOCK_MONOTONIC, &ts1)) >>>>>>> return -1; >>>>>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) >>>>>>> ret = move_pages(0, 1, (void **) &ptr, &n2, >>>>>>> &status, >>>>>>> MPOL_MF_MOVE_ALL); >>>>>>> if (ret) { >>>>>>> - if (ret > 0) >>>>>>> + if (ret > 0) { >>>>>>> + if (++errors < 100) >>>>>>> + continue; >>>>>>> printf("Didn't migrate %d pages\n", >>>>>>> ret); >>>>>>> - else >>>>>>> + } else { >>>>>>> perror("Couldn't migrate pages"); >>>>>>> + } >>>>>>> return -2; >>>>>>> } >>>>>>> + /* Progress! */ >>>>>>> + errors = 0; >>>>>>> >>>>>>> tmp = n2; >>>>>>> n2 = n1; >>>>>>> >>>>>>> >>>>>>> [root@localhost mm]# ./migration >>>>>>> TAP version 13 >>>>>>> 1..1 >>>>>>> # Starting 1 tests from 1 test cases. >>>>>>> # RUN migration.shared_anon ... >>>>>>> # OK migration.shared_anon >>>>>>> ok 1 migration.shared_anon >>>>>>> # PASSED: 1 / 1 tests passed. >>>>>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 >>>>>> >>>>>> >>>>>> This does make the test pass, to my surprise, since what you are doing >>>>>> from userspace >>>>>> >>>>>> should have been done by the kernel, because it retries folio >>>>>> unmapping >>>>>> and moving >>>>>> >>>>>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up >>>>>> these >>>>>> >>>>>> macros and the original test was still failing. Now, I digged in more, >>>>>> and, if the >>>>>> >>>>>> following assertion is correct: >>>>>> >>>>>> >>>>>> Any thread having a reference on a folio will end up calling >>>>>> folio_lock() >>>>>> >>>>> >>>>> Good point. I suspect concurrent things like read/write would also be >>>>> able to trigger this (did not check, though). >>>>> >>>>>> >>>>>> then it seems to me that the retry for loop wrapped around >>>>>> migrate_folio_move(), inside >>>>>> >>>>>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on >>>>>> the >>>>>> first iteration, it is >>>>>> >>>>>> going to fail for all iterations since, if I am reading the code path >>>>>> correctly, the only way it >>>>>> >>>>>> fails is when the actual refcount is not equal to expected refcount >>>>>> (in >>>>>> folio_migrate_mapping()), >>>>>> >>>>>> and there is no way that the extra refcount is going to get released >>>>>> since the migration path >>>>>> >>>>>> has the folio lock. >>>>>> >>>>>> And therefore, this begs the question: isn't it logical to assert the >>>>>> actual refcount against the >>>>>> >>>>>> expected refcount, just after we have changed the PTEs, so that if >>>>>> this >>>>>> assertion fails, we can >>>>>> >>>>>> go to the next iteration of the for loop for migrate_folio_unmap() >>>>>> inside migrate_pages_batch() >>>>>> >>>>>> by calling migrate_folio_undo_src()/dst() to restore the old state? >>>>>> I am >>>>>> trying to implement >>>>>> >>>>>> this but is not as straightforward as it seemed to me this morning. >>>>> >>>>> I agree with your assessment that migration code currently doesn't >>>>> handle the case well when some other thread does an unconditional >>>>> folio_lock(). folio_trylock() users would be handled, but that's not >>>>> what we want with FGP_LOCK I assume. >>>>> >>>>> So IIUC, your idea would be to unlock the folio in migration code and >>>>> try again their. Sounds reasonable, without looking into the details :) >>>> >>>> >>> >>> BTW, I was trying to find the spot that would do the folio_lock(), but >>> filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a >>> folio_trylock(). >>> >>> Where exactly is the folio_lock() on the fault path that would >>> prohibit us from making progress? >> >> Not filemap_fault(); it enters shmem_fault() which eventually calls >> shmem_get_folio_gfp(), retrieving the folio from the pagecache, and >> calling folio_lock(). > > Ah, thanks! > > ... which raises the question if we should handle it similar to > filemap_fault(), essentially drop the reference and retry using > VM_FAULT_RETRY. Hmmmmm ... just had another look at lock_folio_maybe_drop_mmap(), and we also usually end up in __folio_lock() / __folio_lock_killable(), folio_trylock() is only used for the fast path. So no, that wouldn't change that much.
diff --git a/mm/rmap.c b/mm/rmap.c index e8fc5ecb59b2..bb10b3376595 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2126,7 +2126,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, * transition on a cached TLB entry is written through * and traps if the PTE is unmapped. */ - pteval = ptep_get_and_clear(mm, address, pvmw.pte); + pteval = pte_mkinvalid(*(pvmw.pte)); + *(pvmw.pte) = pteval; set_tlb_ubc_flush_pending(mm, pteval, address); } else {