diff mbox series

Race condition observed between page migration and page fault handling on arm64 machines

Message ID 20240801081657.1386743-1-dev.jain@arm.com (mailing list archive)
State New, archived
Headers show
Series Race condition observed between page migration and page fault handling on arm64 machines | expand

Commit Message

Dev Jain Aug. 1, 2024, 8:16 a.m. UTC
I and Ryan had a discussion and we thought it would be best to get feedback
from the community.

The migration mm selftest currently fails on arm64 for shared anon mappings,
due to the following race:

Migration:						Page fault:
try_to_migrate_one():					handle_pte_fault():
1. Nuke the PTE						PTE has been deleted => do_pte_missing()
2. Mark the PTE for migration				PTE has not been deleted but is just not "present" => do_swap_page()

In the test, the parent thread migrates a single page between nodes, and the
children threads read on that page.

The race happens when the PTE has been nuked, and before it gets marked for
migration, the reader faults and checks if the PTE is missing, and calls
do_pte_missing() instead of the correct do_swap_page(); the latter implies that
the reader ends up calling migration_entry_wait() to wait for PTEs to get
corrected. The former path ends up following this: do_fault() -> do_read_fault()
-> __do_fault() -> vma->vm_ops->fault, which is shmem_fault() -> shmem_get_folio_gfp()
-> filemap_get_entry(), which then calls folio_try_get() and takes a reference
on the folio which is under migration. Returning back, the reader blocks on
folio_lock() since the migration path has the lock, and migration ends up
failing in folio_migrate_mapping(), which expects a reference count of 2 on the
folio.

The following hack makes the race vanish:

mm/rmap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


It's a hack because the invalidation is non-atomic, and pte_mkinvalid() is
defined only on arm64. I could think of the following solutions:

1. Introduce (atomic)_pte_mkinvalid() for all arches and call the arch-API if
defined, else call ptep_get_and_clear(). The theoretical race would still exist
for other arches.

2. Prepare the migration swap entry and do an atomic exchange to fill the PTE
(this currently seems the best option to me, but I have not tried it out).

3. In try_to_migrate_one(), before the nuking, freeze the refcount of the folio.
This would solve the race, but then we will have to defer all the reference
releases till later, and I don't know whether that's correct or feasible.

4. Since the extra refcount being taken in filemap_get_entry() is due to loading
the folio from the pagecache, delete/invalidate the folio in the pagecache
until migration is complete. I tried with some helpers present in mm/filemap.c
to do that but I guess I am not aware of the subtleties, and I end up triggering
a BUG() somewhere.

5. In do_fault(), under the if block, we are already checking whether this is
just a temporary clearing of the PTE. We can take that out of the if block, but
then that would be a performance bottleneck since we are taking the PTL?

Thanks,
Dev

Comments

David Hildenbrand Aug. 1, 2024, 8:42 a.m. UTC | #1
On 01.08.24 10:16, Dev Jain wrote:
> I and Ryan had a discussion and we thought it would be best to get feedback
> from the community.
> 
> The migration mm selftest currently fails on arm64 for shared anon mappings,
> due to the following race:

Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because 
you note shmem below, I assume you mean MAP_SHARED|MAP_ANON

> 
> Migration:						Page fault:
> try_to_migrate_one():					handle_pte_fault():
> 1. Nuke the PTE						PTE has been deleted => do_pte_missing()
> 2. Mark the PTE for migration				PTE has not been deleted but is just not "present" => do_swap_page()
>

In filemap_fault_recheck_pte_none() we recheck under PTL to make sure 
that a temporary pte_none() really was persistent pte_none() and not a 
temporary pte_none() under PTL.

Should we do something similar in do_fault()? I see that we already do 
something like that on the "!vma->vm_ops->fault" path.

But of course, there is a tradeoff between letting migration 
(temporarily) fail and grabbing the PTL during page faults.
Dev Jain Aug. 1, 2024, 9:38 a.m. UTC | #2
On 8/1/24 14:12, David Hildenbrand wrote:
> On 01.08.24 10:16, Dev Jain wrote:
>> I and Ryan had a discussion and we thought it would be best to get 
>> feedback
>> from the community.
>>
>> The migration mm selftest currently fails on arm64 for shared anon 
>> mappings,
>> due to the following race:
>
> Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because 
> you note shmem below, I assume you mean MAP_SHARED|MAP_ANON


Yes.

>
>>
>> Migration:                        Page fault:
>> try_to_migrate_one():                    handle_pte_fault():
>> 1. Nuke the PTE                        PTE has been deleted => 
>> do_pte_missing()
>> 2. Mark the PTE for migration                PTE has not been deleted 
>> but is just not "present" => do_swap_page()
>>
>
> In filemap_fault_recheck_pte_none() we recheck under PTL to make sure 
> that a temporary pte_none() really was persistent pte_none() and not a 
> temporary pte_none() under PTL.
>
> Should we do something similar in do_fault()? I see that we already do 
> something like that on the "!vma->vm_ops->fault" path.
>
> But of course, there is a tradeoff between letting migration 
> (temporarily) fail and grabbing the PTL during page faults.


To dampen the tradeoff, we could do this in shmem_fault() instead? But 
then, this would mean that we do this in all

kinds of vma->vm_ops->fault, only when we discover another reference 
count race condition :) Doing this in do_fault()

should solve this once and for all. In fact, do_pte_missing() may call 
do_anonymous_page() or do_fault(), and I just

noticed that the former already checks this using vmf_pte_changed().
David Hildenbrand Aug. 1, 2024, 9:41 a.m. UTC | #3
On 01.08.24 11:38, Dev Jain wrote:
> 
> On 8/1/24 14:12, David Hildenbrand wrote:
>> On 01.08.24 10:16, Dev Jain wrote:
>>> I and Ryan had a discussion and we thought it would be best to get
>>> feedback
>>> from the community.
>>>
>>> The migration mm selftest currently fails on arm64 for shared anon
>>> mappings,
>>> due to the following race:
>>
>> Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because
>> you note shmem below, I assume you mean MAP_SHARED|MAP_ANON
> 
> 
> Yes.
> 
>>
>>>
>>> Migration:                        Page fault:
>>> try_to_migrate_one():                    handle_pte_fault():
>>> 1. Nuke the PTE                        PTE has been deleted =>
>>> do_pte_missing()
>>> 2. Mark the PTE for migration                PTE has not been deleted
>>> but is just not "present" => do_swap_page()
>>>
>>
>> In filemap_fault_recheck_pte_none() we recheck under PTL to make sure
>> that a temporary pte_none() really was persistent pte_none() and not a
>> temporary pte_none() under PTL.
>>
>> Should we do something similar in do_fault()? I see that we already do
>> something like that on the "!vma->vm_ops->fault" path.
>>
>> But of course, there is a tradeoff between letting migration
>> (temporarily) fail and grabbing the PTL during page faults.
> 
> 
> To dampen the tradeoff, we could do this in shmem_fault() instead? But
> then, this would mean that we do this in all
> 
> kinds of vma->vm_ops->fault, only when we discover another reference
> count race condition :) Doing this in do_fault()
> 
> should solve this once and for all. In fact, do_pte_missing() may call
> do_anonymous_page() or do_fault(), and I just
> 
> noticed that the former already checks this using vmf_pte_changed().

What I am still missing is why this is (a) arm64 only; and (b) if this 
is something we should really worry about. There are other reasons 
(e.g., speculative references) why migration could temporarily fail, 
does it happen that often that it is really something we have to worry 
about?
Dev Jain Aug. 1, 2024, 10:05 a.m. UTC | #4
On 8/1/24 15:11, David Hildenbrand wrote:
> On 01.08.24 11:38, Dev Jain wrote:
>>
>> On 8/1/24 14:12, David Hildenbrand wrote:
>>> On 01.08.24 10:16, Dev Jain wrote:
>>>> I and Ryan had a discussion and we thought it would be best to get
>>>> feedback
>>>> from the community.
>>>>
>>>> The migration mm selftest currently fails on arm64 for shared anon
>>>> mappings,
>>>> due to the following race:
>>>
>>> Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because
>>> you note shmem below, I assume you mean MAP_SHARED|MAP_ANON
>>
>>
>> Yes.
>>
>>>
>>>>
>>>> Migration:                        Page fault:
>>>> try_to_migrate_one():                    handle_pte_fault():
>>>> 1. Nuke the PTE                        PTE has been deleted =>
>>>> do_pte_missing()
>>>> 2. Mark the PTE for migration                PTE has not been deleted
>>>> but is just not "present" => do_swap_page()
>>>>
>>>
>>> In filemap_fault_recheck_pte_none() we recheck under PTL to make sure
>>> that a temporary pte_none() really was persistent pte_none() and not a
>>> temporary pte_none() under PTL.
>>>
>>> Should we do something similar in do_fault()? I see that we already do
>>> something like that on the "!vma->vm_ops->fault" path.
>>>
>>> But of course, there is a tradeoff between letting migration
>>> (temporarily) fail and grabbing the PTL during page faults.
>>
>>
>> To dampen the tradeoff, we could do this in shmem_fault() instead? But
>> then, this would mean that we do this in all
>>
>> kinds of vma->vm_ops->fault, only when we discover another reference
>> count race condition :) Doing this in do_fault()
>>
>> should solve this once and for all. In fact, do_pte_missing() may call
>> do_anonymous_page() or do_fault(), and I just
>>
>> noticed that the former already checks this using vmf_pte_changed().
>
> What I am still missing is why this is (a) arm64 only; and (b) if this 
> is something we should really worry about. There are other reasons 
> (e.g., speculative references) why migration could temporarily fail, 
> does it happen that often that it is really something we have to worry 
> about?


(a) See discussion at [1]; I guess it passes on x86, which is quite 
strange since the race is clearly arch-independent.

(b) On my machine, on an average in under 10 iterations of move_pages(), 
it fails, which seems problematic to

me assuming it is passing on other arches, since 10 iterations means 
this is failing very quickly.


[1]: 
https://lore.kernel.org/all/9de42ace-dab1-5f60-af8a-26045ada27b9@arm.com/
Ryan Roberts Aug. 1, 2024, 10:06 a.m. UTC | #5
On 01/08/2024 10:41, David Hildenbrand wrote:
> On 01.08.24 11:38, Dev Jain wrote:
>>
>> On 8/1/24 14:12, David Hildenbrand wrote:
>>> On 01.08.24 10:16, Dev Jain wrote:
>>>> I and Ryan had a discussion and we thought it would be best to get
>>>> feedback
>>>> from the community.
>>>>
>>>> The migration mm selftest currently fails on arm64 for shared anon
>>>> mappings,
>>>> due to the following race:
>>>
>>> Do you mean MAP_SHARED|MAP_ANON or MAP_PRIVATE|MAP_ANON_fork? Because
>>> you note shmem below, I assume you mean MAP_SHARED|MAP_ANON
>>
>>
>> Yes.
>>
>>>
>>>>
>>>> Migration:                        Page fault:
>>>> try_to_migrate_one():                    handle_pte_fault():
>>>> 1. Nuke the PTE                        PTE has been deleted =>
>>>> do_pte_missing()
>>>> 2. Mark the PTE for migration                PTE has not been deleted
>>>> but is just not "present" => do_swap_page()
>>>>
>>>
>>> In filemap_fault_recheck_pte_none() we recheck under PTL to make sure
>>> that a temporary pte_none() really was persistent pte_none() and not a
>>> temporary pte_none() under PTL.
>>>
>>> Should we do something similar in do_fault()? I see that we already do
>>> something like that on the "!vma->vm_ops->fault" path.
>>>
>>> But of course, there is a tradeoff between letting migration
>>> (temporarily) fail and grabbing the PTL during page faults.
>>
>>
>> To dampen the tradeoff, we could do this in shmem_fault() instead? But
>> then, this would mean that we do this in all
>>
>> kinds of vma->vm_ops->fault, only when we discover another reference
>> count race condition :) Doing this in do_fault()
>>
>> should solve this once and for all. In fact, do_pte_missing() may call
>> do_anonymous_page() or do_fault(), and I just
>>
>> noticed that the former already checks this using vmf_pte_changed().
> 
> What I am still missing is why this is (a) arm64 only; and (b) if this is
> something we should really worry about. There are other reasons (e.g.,
> speculative references) why migration could temporarily fail, does it happen
> that often that it is really something we have to worry about?

The test fails consistently on arm64. It's my rough understanding that it's
failing due to migration backing off because the fault handler has raised the
ref count? (Dev correct me if I'm wrong).

So the real question is, is it a valid test in the first place? Should we just
delete the test or do we need to strengthen the kernel's guarrantees around
migration success?
David Hildenbrand Aug. 1, 2024, 1:13 p.m. UTC | #6
>>> To dampen the tradeoff, we could do this in shmem_fault() instead? But
>>> then, this would mean that we do this in all
>>>
>>> kinds of vma->vm_ops->fault, only when we discover another reference
>>> count race condition :) Doing this in do_fault()
>>>
>>> should solve this once and for all. In fact, do_pte_missing() may call
>>> do_anonymous_page() or do_fault(), and I just
>>>
>>> noticed that the former already checks this using vmf_pte_changed().
>>
>> What I am still missing is why this is (a) arm64 only; and (b) if this
>> is something we should really worry about. There are other reasons
>> (e.g., speculative references) why migration could temporarily fail,
>> does it happen that often that it is really something we have to worry
>> about?
> 
> 
> (a) See discussion at [1]; I guess it passes on x86, which is quite
> strange since the race is clearly arch-independent.

Yes, I think this is what we have to understand. Is the race simply less 
likely to trigger on x86?

I would assume that it would trigger on any arch.

I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here.

Is this maybe related to deferred flushing? Such that the other CPU will 
by accident just observe the !pte_none a little less likely?

But arm64 also usually defers flushes, right? At least unless 
ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred 
flushes.

> 
> (b) On my machine, on an average in under 10 iterations of move_pages(),
> it fails, which seems problematic to

Yes, it's a big difference compared to what I encounter.
David Hildenbrand Aug. 1, 2024, 1:15 p.m. UTC | #7
>> What I am still missing is why this is (a) arm64 only; and (b) if this is
>> something we should really worry about. There are other reasons (e.g.,
>> speculative references) why migration could temporarily fail, does it happen
>> that often that it is really something we have to worry about?
> 
> The test fails consistently on arm64. It's my rough understanding that it's
> failing due to migration backing off because the fault handler has raised the
> ref count? (Dev correct me if I'm wrong).
> 
> So the real question is, is it a valid test in the first place? Should we just
> delete the test or do we need to strengthen the kernel's guarrantees around
> migration success?

I think the test should retry migration a number of times in case it 
fails. But if it is a persistent migration failure, the test should fail.
David Hildenbrand Aug. 1, 2024, 1:26 p.m. UTC | #8
On 01.08.24 15:13, David Hildenbrand wrote:
>>>> To dampen the tradeoff, we could do this in shmem_fault() instead? But
>>>> then, this would mean that we do this in all
>>>>
>>>> kinds of vma->vm_ops->fault, only when we discover another reference
>>>> count race condition :) Doing this in do_fault()
>>>>
>>>> should solve this once and for all. In fact, do_pte_missing() may call
>>>> do_anonymous_page() or do_fault(), and I just
>>>>
>>>> noticed that the former already checks this using vmf_pte_changed().
>>>
>>> What I am still missing is why this is (a) arm64 only; and (b) if this
>>> is something we should really worry about. There are other reasons
>>> (e.g., speculative references) why migration could temporarily fail,
>>> does it happen that often that it is really something we have to worry
>>> about?
>>
>>
>> (a) See discussion at [1]; I guess it passes on x86, which is quite
>> strange since the race is clearly arch-independent.
> 
> Yes, I think this is what we have to understand. Is the race simply less
> likely to trigger on x86?
> 
> I would assume that it would trigger on any arch.
> 
> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here.
> 
> Is this maybe related to deferred flushing? Such that the other CPU will
> by accident just observe the !pte_none a little less likely?
> 
> But arm64 also usually defers flushes, right? At least unless
> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred
> flushes.

Bingo!

diff --git a/mm/rmap.c b/mm/rmap.c
index e51ed44f8b53..ce94b810586b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct 
mm_struct *mm, pte_t pteval,
   */
  static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  {
-       if (!(flags & TTU_BATCH_FLUSH))
-               return false;
-
-       return arch_tlbbatch_should_defer(mm);
+       return false;
  }


On x86:

# ./migration
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
#  RUN           migration.shared_anon ...
Didn't migrate 1 pages
# migration.c:170:shared_anon:Expected migrate(ptr, self->n1, self->n2) 
(-2) == 0 (0)
# shared_anon: Test terminated by assertion
#          FAIL  migration.shared_anon
not ok 1 migration.shared_anon


It fails all of the time!
Will Deacon Aug. 1, 2024, 1:43 p.m. UTC | #9
On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
> On 01.08.24 15:13, David Hildenbrand wrote:
> > > > > To dampen the tradeoff, we could do this in shmem_fault() instead? But
> > > > > then, this would mean that we do this in all
> > > > > 
> > > > > kinds of vma->vm_ops->fault, only when we discover another reference
> > > > > count race condition :) Doing this in do_fault()
> > > > > 
> > > > > should solve this once and for all. In fact, do_pte_missing() may call
> > > > > do_anonymous_page() or do_fault(), and I just
> > > > > 
> > > > > noticed that the former already checks this using vmf_pte_changed().
> > > > 
> > > > What I am still missing is why this is (a) arm64 only; and (b) if this
> > > > is something we should really worry about. There are other reasons
> > > > (e.g., speculative references) why migration could temporarily fail,
> > > > does it happen that often that it is really something we have to worry
> > > > about?
> > > 
> > > 
> > > (a) See discussion at [1]; I guess it passes on x86, which is quite
> > > strange since the race is clearly arch-independent.
> > 
> > Yes, I think this is what we have to understand. Is the race simply less
> > likely to trigger on x86?
> > 
> > I would assume that it would trigger on any arch.
> > 
> > I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here.
> > 
> > Is this maybe related to deferred flushing? Such that the other CPU will
> > by accident just observe the !pte_none a little less likely?
> > 
> > But arm64 also usually defers flushes, right? At least unless
> > ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred
> > flushes.
> 
> Bingo!
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e51ed44f8b53..ce94b810586b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct
> *mm, pte_t pteval,
>   */
>  static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>  {
> -       if (!(flags & TTU_BATCH_FLUSH))
> -               return false;
> -
> -       return arch_tlbbatch_should_defer(mm);
> +       return false;
>  }
> 
> 
> On x86:
> 
> # ./migration
> TAP version 13
> 1..1
> # Starting 1 tests from 1 test cases.
> #  RUN           migration.shared_anon ...
> Didn't migrate 1 pages
> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, self->n2) (-2)
> == 0 (0)
> # shared_anon: Test terminated by assertion
> #          FAIL  migration.shared_anon
> not ok 1 migration.shared_anon
> 
> 
> It fails all of the time!

Nice work! I suppose that makes sense as, with the eager TLB
invalidation, the window between the other CPU faulting and the
migration entry being written is fairly wide.

Not sure about a fix though :/ It feels a bit overkill to add a new
invalid pte encoding just for this.

Will
David Hildenbrand Aug. 1, 2024, 1:48 p.m. UTC | #10
On 01.08.24 15:43, Will Deacon wrote:
> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>> To dampen the tradeoff, we could do this in shmem_fault() instead? But
>>>>>> then, this would mean that we do this in all
>>>>>>
>>>>>> kinds of vma->vm_ops->fault, only when we discover another reference
>>>>>> count race condition :) Doing this in do_fault()
>>>>>>
>>>>>> should solve this once and for all. In fact, do_pte_missing() may call
>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>
>>>>>> noticed that the former already checks this using vmf_pte_changed().
>>>>>
>>>>> What I am still missing is why this is (a) arm64 only; and (b) if this
>>>>> is something we should really worry about. There are other reasons
>>>>> (e.g., speculative references) why migration could temporarily fail,
>>>>> does it happen that often that it is really something we have to worry
>>>>> about?
>>>>
>>>>
>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite
>>>> strange since the race is clearly arch-independent.
>>>
>>> Yes, I think this is what we have to understand. Is the race simply less
>>> likely to trigger on x86?
>>>
>>> I would assume that it would trigger on any arch.
>>>
>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here.
>>>
>>> Is this maybe related to deferred flushing? Such that the other CPU will
>>> by accident just observe the !pte_none a little less likely?
>>>
>>> But arm64 also usually defers flushes, right? At least unless
>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred
>>> flushes.
>>
>> Bingo!
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index e51ed44f8b53..ce94b810586b 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct
>> *mm, pte_t pteval,
>>    */
>>   static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>>   {
>> -       if (!(flags & TTU_BATCH_FLUSH))
>> -               return false;
>> -
>> -       return arch_tlbbatch_should_defer(mm);
>> +       return false;
>>   }
>>
>>
>> On x86:
>>
>> # ./migration
>> TAP version 13
>> 1..1
>> # Starting 1 tests from 1 test cases.
>> #  RUN           migration.shared_anon ...
>> Didn't migrate 1 pages
>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, self->n2) (-2)
>> == 0 (0)
>> # shared_anon: Test terminated by assertion
>> #          FAIL  migration.shared_anon
>> not ok 1 migration.shared_anon
>>
>>
>> It fails all of the time!
> 
> Nice work! I suppose that makes sense as, with the eager TLB
> invalidation, the window between the other CPU faulting and the
> migration entry being written is fairly wide.
> 
> Not sure about a fix though :/ It feels a bit overkill to add a new
> invalid pte encoding just for this.

Something like that might make the test happy in most cases:

diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c
index 6908569ef406..4c18bfc13b94 100644
--- a/tools/testing/selftests/mm/migration.c
+++ b/tools/testing/selftests/mm/migration.c
@@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
         int ret, tmp;
         int status = 0;
         struct timespec ts1, ts2;
+       int errors = 0;
  
         if (clock_gettime(CLOCK_MONOTONIC, &ts1))
                 return -1;
@@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
                 ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
                                 MPOL_MF_MOVE_ALL);
                 if (ret) {
-                       if (ret > 0)
+                       if (ret > 0) {
+                               if (++errors < 100)
+                                       continue;
                                 printf("Didn't migrate %d pages\n", ret);
-                       else
+                       } else {
                                 perror("Couldn't migrate pages");
+                       }
                         return -2;
                 }
+               /* Progress! */
+               errors = 0;
  
                 tmp = n2;
                 n2 = n1;


[root@localhost mm]# ./migration
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
#  RUN           migration.shared_anon ...
#            OK  migration.shared_anon
ok 1 migration.shared_anon
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
Ryan Roberts Aug. 1, 2024, 2:25 p.m. UTC | #11
On 01/08/2024 14:48, David Hildenbrand wrote:
> On 01.08.24 15:43, Will Deacon wrote:
>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>> To dampen the tradeoff, we could do this in shmem_fault() instead? But
>>>>>>> then, this would mean that we do this in all
>>>>>>>
>>>>>>> kinds of vma->vm_ops->fault, only when we discover another reference
>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>
>>>>>>> should solve this once and for all. In fact, do_pte_missing() may call
>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>
>>>>>>> noticed that the former already checks this using vmf_pte_changed().
>>>>>>
>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if this
>>>>>> is something we should really worry about. There are other reasons
>>>>>> (e.g., speculative references) why migration could temporarily fail,
>>>>>> does it happen that often that it is really something we have to worry
>>>>>> about?
>>>>>
>>>>>
>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite
>>>>> strange since the race is clearly arch-independent.
>>>>
>>>> Yes, I think this is what we have to understand. Is the race simply less
>>>> likely to trigger on x86?
>>>>
>>>> I would assume that it would trigger on any arch.
>>>>
>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here.
>>>>
>>>> Is this maybe related to deferred flushing? Such that the other CPU will
>>>> by accident just observe the !pte_none a little less likely?
>>>>
>>>> But arm64 also usually defers flushes, right? At least unless
>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred
>>>> flushes.
>>>
>>> Bingo!
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index e51ed44f8b53..ce94b810586b 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct
>>> *mm, pte_t pteval,
>>>    */
>>>   static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>>>   {
>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>> -               return false;
>>> -
>>> -       return arch_tlbbatch_should_defer(mm);
>>> +       return false;
>>>   }
>>>
>>>
>>> On x86:
>>>
>>> # ./migration
>>> TAP version 13
>>> 1..1
>>> # Starting 1 tests from 1 test cases.
>>> #  RUN           migration.shared_anon ...
>>> Didn't migrate 1 pages
>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, self->n2) (-2)
>>> == 0 (0)
>>> # shared_anon: Test terminated by assertion
>>> #          FAIL  migration.shared_anon
>>> not ok 1 migration.shared_anon
>>>
>>>
>>> It fails all of the time!
>>
>> Nice work! I suppose that makes sense as, with the eager TLB
>> invalidation, the window between the other CPU faulting and the
>> migration entry being written is fairly wide.
>>
>> Not sure about a fix though :/ It feels a bit overkill to add a new
>> invalid pte encoding just for this.

If we want to fix the kernel, I think Dev's option 2 could work?

2. Prepare the migration swap entry and do an atomic exchange to fill the PTE
(this currently seems the best option to me, but I have not tried it out).

We could create a new helper, ptep_get_and_set_not_present() then we can
atomically change the pte to the migration entry so the fault handler never sees
pte_none(). On arm64 we are already using an atomic in ptep_get_and_clear() so
its no extra work. We could implement a non-atomic default version for other
arches. (Although I guess if we fix for arm64 we would want to fix for all).
That wouldn't require any new encoding, just refactoring the migration code to
call the helper.

But if we can convince ourselves that changing the test as David suggests below
is good enough and we won't still get spurious failures, then that seems
simplest. Dev, could you see if you can get the retry-the-test approach to fail?

Thanks,
Ryan


> 
> Something like that might make the test happy in most cases:
> 
> diff --git a/tools/testing/selftests/mm/migration.c
> b/tools/testing/selftests/mm/migration.c
> index 6908569ef406..4c18bfc13b94 100644
> --- a/tools/testing/selftests/mm/migration.c
> +++ b/tools/testing/selftests/mm/migration.c
> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>         int ret, tmp;
>         int status = 0;
>         struct timespec ts1, ts2;
> +       int errors = 0;
>  
>         if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>                 return -1;
> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>                 ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
>                                 MPOL_MF_MOVE_ALL);
>                 if (ret) {
> -                       if (ret > 0)
> +                       if (ret > 0) {
> +                               if (++errors < 100)
> +                                       continue;
>                                 printf("Didn't migrate %d pages\n", ret);
> -                       else
> +                       } else {
>                                 perror("Couldn't migrate pages");
> +                       }
>                         return -2;
>                 }
> +               /* Progress! */
> +               errors = 0;
>  
>                 tmp = n2;
>                 n2 = n1;
> 
> 
> [root@localhost mm]# ./migration
> TAP version 13
> 1..1
> # Starting 1 tests from 1 test cases.
> #  RUN           migration.shared_anon ...
> #            OK  migration.shared_anon
> ok 1 migration.shared_anon
> # PASSED: 1 / 1 tests passed.
> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
>
Dev Jain Aug. 5, 2024, 9:51 a.m. UTC | #12
On 8/1/24 19:18, David Hildenbrand wrote:
> On 01.08.24 15:43, Will Deacon wrote:
>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>> To dampen the tradeoff, we could do this in shmem_fault() 
>>>>>>> instead? But
>>>>>>> then, this would mean that we do this in all
>>>>>>>
>>>>>>> kinds of vma->vm_ops->fault, only when we discover another 
>>>>>>> reference
>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>
>>>>>>> should solve this once and for all. In fact, do_pte_missing() 
>>>>>>> may call
>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>
>>>>>>> noticed that the former already checks this using 
>>>>>>> vmf_pte_changed().
>>>>>>
>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if 
>>>>>> this
>>>>>> is something we should really worry about. There are other reasons
>>>>>> (e.g., speculative references) why migration could temporarily fail,
>>>>>> does it happen that often that it is really something we have to 
>>>>>> worry
>>>>>> about?
>>>>>
>>>>>
>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite
>>>>> strange since the race is clearly arch-independent.
>>>>
>>>> Yes, I think this is what we have to understand. Is the race simply 
>>>> less
>>>> likely to trigger on x86?
>>>>
>>>> I would assume that it would trigger on any arch.
>>>>
>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to 
>>>> work here.
>>>>
>>>> Is this maybe related to deferred flushing? Such that the other CPU 
>>>> will
>>>> by accident just observe the !pte_none a little less likely?
>>>>
>>>> But arm64 also usually defers flushes, right? At least unless
>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred
>>>> flushes.
>>>
>>> Bingo!
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index e51ed44f8b53..ce94b810586b 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct 
>>> mm_struct
>>> *mm, pte_t pteval,
>>>    */
>>>   static bool should_defer_flush(struct mm_struct *mm, enum 
>>> ttu_flags flags)
>>>   {
>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>> -               return false;
>>> -
>>> -       return arch_tlbbatch_should_defer(mm);
>>> +       return false;
>>>   }
>>>
>>>
>>> On x86:
>>>
>>> # ./migration
>>> TAP version 13
>>> 1..1
>>> # Starting 1 tests from 1 test cases.
>>> #  RUN           migration.shared_anon ...
>>> Didn't migrate 1 pages
>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, 
>>> self->n2) (-2)
>>> == 0 (0)
>>> # shared_anon: Test terminated by assertion
>>> #          FAIL  migration.shared_anon
>>> not ok 1 migration.shared_anon
>>>
>>>
>>> It fails all of the time!
>>
>> Nice work! I suppose that makes sense as, with the eager TLB
>> invalidation, the window between the other CPU faulting and the
>> migration entry being written is fairly wide.
>>
>> Not sure about a fix though :/ It feels a bit overkill to add a new
>> invalid pte encoding just for this.
>
> Something like that might make the test happy in most cases:
>
> diff --git a/tools/testing/selftests/mm/migration.c 
> b/tools/testing/selftests/mm/migration.c
> index 6908569ef406..4c18bfc13b94 100644
> --- a/tools/testing/selftests/mm/migration.c
> +++ b/tools/testing/selftests/mm/migration.c
> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>         int ret, tmp;
>         int status = 0;
>         struct timespec ts1, ts2;
> +       int errors = 0;
>
>         if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>                 return -1;
> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>                 ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
>                                 MPOL_MF_MOVE_ALL);
>                 if (ret) {
> -                       if (ret > 0)
> +                       if (ret > 0) {
> +                               if (++errors < 100)
> +                                       continue;
>                                 printf("Didn't migrate %d pages\n", ret);
> -                       else
> +                       } else {
>                                 perror("Couldn't migrate pages");
> +                       }
>                         return -2;
>                 }
> +               /* Progress! */
> +               errors = 0;
>
>                 tmp = n2;
>                 n2 = n1;
>
>
> [root@localhost mm]# ./migration
> TAP version 13
> 1..1
> # Starting 1 tests from 1 test cases.
> #  RUN           migration.shared_anon ...
> #            OK  migration.shared_anon
> ok 1 migration.shared_anon
> # PASSED: 1 / 1 tests passed.
> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0


This does make the test pass, to my surprise, since what you are doing 
from userspace

should have been done by the kernel, because it retries folio unmapping 
and moving

NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up these

macros and the original test was still failing. Now, I digged in more, 
and, if the

following assertion is correct:


Any thread having a reference on a folio will end up calling folio_lock()


then it seems to me that the retry for loop wrapped around 
migrate_folio_move(), inside

migrate_pages_batch(), is useless; if migrate_folio_move() fails on the 
first iteration, it is

going to fail for all iterations since, if I am reading the code path 
correctly, the only way it

fails is when the actual refcount is not equal to expected refcount (in 
folio_migrate_mapping()),

and there is no way that the extra refcount is going to get released 
since the migration path

has the folio lock.

And therefore, this begs the question: isn't it logical to assert the 
actual refcount against the

expected refcount, just after we have changed the PTEs, so that if this 
assertion fails, we can

go to the next iteration of the for loop for migrate_folio_unmap() 
inside migrate_pages_batch()

by calling migrate_folio_undo_src()/dst() to restore the old state? I am 
trying to implement

this but is not as straightforward as it seemed to me this morning.
David Hildenbrand Aug. 5, 2024, 10:46 a.m. UTC | #13
On 05.08.24 11:51, Dev Jain wrote:
> 
> On 8/1/24 19:18, David Hildenbrand wrote:
>> On 01.08.24 15:43, Will Deacon wrote:
>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault()
>>>>>>>> instead? But
>>>>>>>> then, this would mean that we do this in all
>>>>>>>>
>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another
>>>>>>>> reference
>>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>>
>>>>>>>> should solve this once and for all. In fact, do_pte_missing()
>>>>>>>> may call
>>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>>
>>>>>>>> noticed that the former already checks this using
>>>>>>>> vmf_pte_changed().
>>>>>>>
>>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if
>>>>>>> this
>>>>>>> is something we should really worry about. There are other reasons
>>>>>>> (e.g., speculative references) why migration could temporarily fail,
>>>>>>> does it happen that often that it is really something we have to
>>>>>>> worry
>>>>>>> about?
>>>>>>
>>>>>>
>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite
>>>>>> strange since the race is clearly arch-independent.
>>>>>
>>>>> Yes, I think this is what we have to understand. Is the race simply
>>>>> less
>>>>> likely to trigger on x86?
>>>>>
>>>>> I would assume that it would trigger on any arch.
>>>>>
>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to
>>>>> work here.
>>>>>
>>>>> Is this maybe related to deferred flushing? Such that the other CPU
>>>>> will
>>>>> by accident just observe the !pte_none a little less likely?
>>>>>
>>>>> But arm64 also usually defers flushes, right? At least unless
>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred
>>>>> flushes.
>>>>
>>>> Bingo!
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index e51ed44f8b53..ce94b810586b 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct
>>>> mm_struct
>>>> *mm, pte_t pteval,
>>>>     */
>>>>    static bool should_defer_flush(struct mm_struct *mm, enum
>>>> ttu_flags flags)
>>>>    {
>>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>>> -               return false;
>>>> -
>>>> -       return arch_tlbbatch_should_defer(mm);
>>>> +       return false;
>>>>    }
>>>>
>>>>
>>>> On x86:
>>>>
>>>> # ./migration
>>>> TAP version 13
>>>> 1..1
>>>> # Starting 1 tests from 1 test cases.
>>>> #  RUN           migration.shared_anon ...
>>>> Didn't migrate 1 pages
>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1,
>>>> self->n2) (-2)
>>>> == 0 (0)
>>>> # shared_anon: Test terminated by assertion
>>>> #          FAIL  migration.shared_anon
>>>> not ok 1 migration.shared_anon
>>>>
>>>>
>>>> It fails all of the time!
>>>
>>> Nice work! I suppose that makes sense as, with the eager TLB
>>> invalidation, the window between the other CPU faulting and the
>>> migration entry being written is fairly wide.
>>>
>>> Not sure about a fix though :/ It feels a bit overkill to add a new
>>> invalid pte encoding just for this.
>>
>> Something like that might make the test happy in most cases:
>>
>> diff --git a/tools/testing/selftests/mm/migration.c
>> b/tools/testing/selftests/mm/migration.c
>> index 6908569ef406..4c18bfc13b94 100644
>> --- a/tools/testing/selftests/mm/migration.c
>> +++ b/tools/testing/selftests/mm/migration.c
>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>          int ret, tmp;
>>          int status = 0;
>>          struct timespec ts1, ts2;
>> +       int errors = 0;
>>
>>          if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>>                  return -1;
>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>                  ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
>>                                  MPOL_MF_MOVE_ALL);
>>                  if (ret) {
>> -                       if (ret > 0)
>> +                       if (ret > 0) {
>> +                               if (++errors < 100)
>> +                                       continue;
>>                                  printf("Didn't migrate %d pages\n", ret);
>> -                       else
>> +                       } else {
>>                                  perror("Couldn't migrate pages");
>> +                       }
>>                          return -2;
>>                  }
>> +               /* Progress! */
>> +               errors = 0;
>>
>>                  tmp = n2;
>>                  n2 = n1;
>>
>>
>> [root@localhost mm]# ./migration
>> TAP version 13
>> 1..1
>> # Starting 1 tests from 1 test cases.
>> #  RUN           migration.shared_anon ...
>> #            OK  migration.shared_anon
>> ok 1 migration.shared_anon
>> # PASSED: 1 / 1 tests passed.
>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> 
> This does make the test pass, to my surprise, since what you are doing
> from userspace
> 
> should have been done by the kernel, because it retries folio unmapping
> and moving
> 
> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up these
> 
> macros and the original test was still failing. Now, I digged in more,
> and, if the
> 
> following assertion is correct:
> 
> 
> Any thread having a reference on a folio will end up calling folio_lock()
> 

Good point. I suspect concurrent things like read/write would also be 
able to trigger this (did not check, though).

> 
> then it seems to me that the retry for loop wrapped around
> migrate_folio_move(), inside
> 
> migrate_pages_batch(), is useless; if migrate_folio_move() fails on the
> first iteration, it is
> 
> going to fail for all iterations since, if I am reading the code path
> correctly, the only way it
> 
> fails is when the actual refcount is not equal to expected refcount (in
> folio_migrate_mapping()),
> 
> and there is no way that the extra refcount is going to get released
> since the migration path
> 
> has the folio lock.
> 
> And therefore, this begs the question: isn't it logical to assert the
> actual refcount against the
> 
> expected refcount, just after we have changed the PTEs, so that if this
> assertion fails, we can
> 
> go to the next iteration of the for loop for migrate_folio_unmap()
> inside migrate_pages_batch()
> 
> by calling migrate_folio_undo_src()/dst() to restore the old state? I am
> trying to implement
> 
> this but is not as straightforward as it seemed to me this morning.

I agree with your assessment that migration code currently doesn't 
handle the case well when some other thread does an unconditional 
folio_lock(). folio_trylock() users would be handled, but that's not 
what we want with FGP_LOCK I assume.

So IIUC, your idea would be to unlock the folio in migration code and 
try again their. Sounds reasonable, without looking into the details :)
Dev Jain Aug. 5, 2024, 2:14 p.m. UTC | #14
On 8/5/24 16:16, David Hildenbrand wrote:
> On 05.08.24 11:51, Dev Jain wrote:
>>
>> On 8/1/24 19:18, David Hildenbrand wrote:
>>> On 01.08.24 15:43, Will Deacon wrote:
>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault()
>>>>>>>>> instead? But
>>>>>>>>> then, this would mean that we do this in all
>>>>>>>>>
>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another
>>>>>>>>> reference
>>>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>>>
>>>>>>>>> should solve this once and for all. In fact, do_pte_missing()
>>>>>>>>> may call
>>>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>>>
>>>>>>>>> noticed that the former already checks this using
>>>>>>>>> vmf_pte_changed().
>>>>>>>>
>>>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if
>>>>>>>> this
>>>>>>>> is something we should really worry about. There are other reasons
>>>>>>>> (e.g., speculative references) why migration could temporarily 
>>>>>>>> fail,
>>>>>>>> does it happen that often that it is really something we have to
>>>>>>>> worry
>>>>>>>> about?
>>>>>>>
>>>>>>>
>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite
>>>>>>> strange since the race is clearly arch-independent.
>>>>>>
>>>>>> Yes, I think this is what we have to understand. Is the race simply
>>>>>> less
>>>>>> likely to trigger on x86?
>>>>>>
>>>>>> I would assume that it would trigger on any arch.
>>>>>>
>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to
>>>>>> work here.
>>>>>>
>>>>>> Is this maybe related to deferred flushing? Such that the other CPU
>>>>>> will
>>>>>> by accident just observe the !pte_none a little less likely?
>>>>>>
>>>>>> But arm64 also usually defers flushes, right? At least unless
>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do 
>>>>>> deferred
>>>>>> flushes.
>>>>>
>>>>> Bingo!
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index e51ed44f8b53..ce94b810586b 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct
>>>>> mm_struct
>>>>> *mm, pte_t pteval,
>>>>>     */
>>>>>    static bool should_defer_flush(struct mm_struct *mm, enum
>>>>> ttu_flags flags)
>>>>>    {
>>>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>>>> -               return false;
>>>>> -
>>>>> -       return arch_tlbbatch_should_defer(mm);
>>>>> +       return false;
>>>>>    }
>>>>>
>>>>>
>>>>> On x86:
>>>>>
>>>>> # ./migration
>>>>> TAP version 13
>>>>> 1..1
>>>>> # Starting 1 tests from 1 test cases.
>>>>> #  RUN           migration.shared_anon ...
>>>>> Didn't migrate 1 pages
>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1,
>>>>> self->n2) (-2)
>>>>> == 0 (0)
>>>>> # shared_anon: Test terminated by assertion
>>>>> #          FAIL  migration.shared_anon
>>>>> not ok 1 migration.shared_anon
>>>>>
>>>>>
>>>>> It fails all of the time!
>>>>
>>>> Nice work! I suppose that makes sense as, with the eager TLB
>>>> invalidation, the window between the other CPU faulting and the
>>>> migration entry being written is fairly wide.
>>>>
>>>> Not sure about a fix though :/ It feels a bit overkill to add a new
>>>> invalid pte encoding just for this.
>>>
>>> Something like that might make the test happy in most cases:
>>>
>>> diff --git a/tools/testing/selftests/mm/migration.c
>>> b/tools/testing/selftests/mm/migration.c
>>> index 6908569ef406..4c18bfc13b94 100644
>>> --- a/tools/testing/selftests/mm/migration.c
>>> +++ b/tools/testing/selftests/mm/migration.c
>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>          int ret, tmp;
>>>          int status = 0;
>>>          struct timespec ts1, ts2;
>>> +       int errors = 0;
>>>
>>>          if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>>>                  return -1;
>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>                  ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
>>>                                  MPOL_MF_MOVE_ALL);
>>>                  if (ret) {
>>> -                       if (ret > 0)
>>> +                       if (ret > 0) {
>>> +                               if (++errors < 100)
>>> +                                       continue;
>>>                                  printf("Didn't migrate %d pages\n", 
>>> ret);
>>> -                       else
>>> +                       } else {
>>>                                  perror("Couldn't migrate pages");
>>> +                       }
>>>                          return -2;
>>>                  }
>>> +               /* Progress! */
>>> +               errors = 0;
>>>
>>>                  tmp = n2;
>>>                  n2 = n1;
>>>
>>>
>>> [root@localhost mm]# ./migration
>>> TAP version 13
>>> 1..1
>>> # Starting 1 tests from 1 test cases.
>>> #  RUN           migration.shared_anon ...
>>> #            OK  migration.shared_anon
>>> ok 1 migration.shared_anon
>>> # PASSED: 1 / 1 tests passed.
>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
>>
>>
>> This does make the test pass, to my surprise, since what you are doing
>> from userspace
>>
>> should have been done by the kernel, because it retries folio unmapping
>> and moving
>>
>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up 
>> these
>>
>> macros and the original test was still failing. Now, I digged in more,
>> and, if the
>>
>> following assertion is correct:
>>
>>
>> Any thread having a reference on a folio will end up calling 
>> folio_lock()
>>
>
> Good point. I suspect concurrent things like read/write would also be 
> able to trigger this (did not check, though).
>
>>
>> then it seems to me that the retry for loop wrapped around
>> migrate_folio_move(), inside
>>
>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on the
>> first iteration, it is
>>
>> going to fail for all iterations since, if I am reading the code path
>> correctly, the only way it
>>
>> fails is when the actual refcount is not equal to expected refcount (in
>> folio_migrate_mapping()),
>>
>> and there is no way that the extra refcount is going to get released
>> since the migration path
>>
>> has the folio lock.
>>
>> And therefore, this begs the question: isn't it logical to assert the
>> actual refcount against the
>>
>> expected refcount, just after we have changed the PTEs, so that if this
>> assertion fails, we can
>>
>> go to the next iteration of the for loop for migrate_folio_unmap()
>> inside migrate_pages_batch()
>>
>> by calling migrate_folio_undo_src()/dst() to restore the old state? I am
>> trying to implement
>>
>> this but is not as straightforward as it seemed to me this morning.
>
> I agree with your assessment that migration code currently doesn't 
> handle the case well when some other thread does an unconditional 
> folio_lock(). folio_trylock() users would be handled, but that's not 
> what we want with FGP_LOCK I assume.
>
> So IIUC, your idea would be to unlock the folio in migration code and 
> try again their. Sounds reasonable, without looking into the details :)


(Adding Mel if at all he has any comments for a compaction use-case)

What I was trying to say is this (forgive me for the hard-coded value):

diff --git a/mm/migrate.c b/mm/migrate.c
index a8c6f466e33a..404af46dd661 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1262,6 +1262,8 @@ static int migrate_folio_unmap(new_folio_t 
get_new_folio,
}
if (!folio_mapped(src)) {
+ if (folio_ref_count(src) != 2)
+ goto out;
__migrate_folio_record(dst, old_page_state, anon_vma);
return MIGRATEPAGE_UNMAP;
}
This will give us more chances to win the race. On an average, now
the test fails on 100 iterations of move_pages(). If you multiply
the NR_MAX_PAGES_(A)SYNC_RETRY macros by 3, the average goes above
to 2000.
But if the consensus is that this is just pleasing the test without
any real use-case (compaction?) then I guess I am alright with making
the change in the test.
David Hildenbrand Aug. 7, 2024, 11:39 a.m. UTC | #15
On 05.08.24 16:14, Dev Jain wrote:
> 
> On 8/5/24 16:16, David Hildenbrand wrote:
>> On 05.08.24 11:51, Dev Jain wrote:
>>>
>>> On 8/1/24 19:18, David Hildenbrand wrote:
>>>> On 01.08.24 15:43, Will Deacon wrote:
>>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>>>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault()
>>>>>>>>>> instead? But
>>>>>>>>>> then, this would mean that we do this in all
>>>>>>>>>>
>>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another
>>>>>>>>>> reference
>>>>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>>>>
>>>>>>>>>> should solve this once and for all. In fact, do_pte_missing()
>>>>>>>>>> may call
>>>>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>>>>
>>>>>>>>>> noticed that the former already checks this using
>>>>>>>>>> vmf_pte_changed().
>>>>>>>>>
>>>>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if
>>>>>>>>> this
>>>>>>>>> is something we should really worry about. There are other reasons
>>>>>>>>> (e.g., speculative references) why migration could temporarily
>>>>>>>>> fail,
>>>>>>>>> does it happen that often that it is really something we have to
>>>>>>>>> worry
>>>>>>>>> about?
>>>>>>>>
>>>>>>>>
>>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite
>>>>>>>> strange since the race is clearly arch-independent.
>>>>>>>
>>>>>>> Yes, I think this is what we have to understand. Is the race simply
>>>>>>> less
>>>>>>> likely to trigger on x86?
>>>>>>>
>>>>>>> I would assume that it would trigger on any arch.
>>>>>>>
>>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to
>>>>>>> work here.
>>>>>>>
>>>>>>> Is this maybe related to deferred flushing? Such that the other CPU
>>>>>>> will
>>>>>>> by accident just observe the !pte_none a little less likely?
>>>>>>>
>>>>>>> But arm64 also usually defers flushes, right? At least unless
>>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do
>>>>>>> deferred
>>>>>>> flushes.
>>>>>>
>>>>>> Bingo!
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index e51ed44f8b53..ce94b810586b 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct
>>>>>> mm_struct
>>>>>> *mm, pte_t pteval,
>>>>>>      */
>>>>>>     static bool should_defer_flush(struct mm_struct *mm, enum
>>>>>> ttu_flags flags)
>>>>>>     {
>>>>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>>>>> -               return false;
>>>>>> -
>>>>>> -       return arch_tlbbatch_should_defer(mm);
>>>>>> +       return false;
>>>>>>     }
>>>>>>
>>>>>>
>>>>>> On x86:
>>>>>>
>>>>>> # ./migration
>>>>>> TAP version 13
>>>>>> 1..1
>>>>>> # Starting 1 tests from 1 test cases.
>>>>>> #  RUN           migration.shared_anon ...
>>>>>> Didn't migrate 1 pages
>>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1,
>>>>>> self->n2) (-2)
>>>>>> == 0 (0)
>>>>>> # shared_anon: Test terminated by assertion
>>>>>> #          FAIL  migration.shared_anon
>>>>>> not ok 1 migration.shared_anon
>>>>>>
>>>>>>
>>>>>> It fails all of the time!
>>>>>
>>>>> Nice work! I suppose that makes sense as, with the eager TLB
>>>>> invalidation, the window between the other CPU faulting and the
>>>>> migration entry being written is fairly wide.
>>>>>
>>>>> Not sure about a fix though :/ It feels a bit overkill to add a new
>>>>> invalid pte encoding just for this.
>>>>
>>>> Something like that might make the test happy in most cases:
>>>>
>>>> diff --git a/tools/testing/selftests/mm/migration.c
>>>> b/tools/testing/selftests/mm/migration.c
>>>> index 6908569ef406..4c18bfc13b94 100644
>>>> --- a/tools/testing/selftests/mm/migration.c
>>>> +++ b/tools/testing/selftests/mm/migration.c
>>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>           int ret, tmp;
>>>>           int status = 0;
>>>>           struct timespec ts1, ts2;
>>>> +       int errors = 0;
>>>>
>>>>           if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>>>>                   return -1;
>>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>                   ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
>>>>                                   MPOL_MF_MOVE_ALL);
>>>>                   if (ret) {
>>>> -                       if (ret > 0)
>>>> +                       if (ret > 0) {
>>>> +                               if (++errors < 100)
>>>> +                                       continue;
>>>>                                   printf("Didn't migrate %d pages\n",
>>>> ret);
>>>> -                       else
>>>> +                       } else {
>>>>                                   perror("Couldn't migrate pages");
>>>> +                       }
>>>>                           return -2;
>>>>                   }
>>>> +               /* Progress! */
>>>> +               errors = 0;
>>>>
>>>>                   tmp = n2;
>>>>                   n2 = n1;
>>>>
>>>>
>>>> [root@localhost mm]# ./migration
>>>> TAP version 13
>>>> 1..1
>>>> # Starting 1 tests from 1 test cases.
>>>> #  RUN           migration.shared_anon ...
>>>> #            OK  migration.shared_anon
>>>> ok 1 migration.shared_anon
>>>> # PASSED: 1 / 1 tests passed.
>>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
>>>
>>>
>>> This does make the test pass, to my surprise, since what you are doing
>>> from userspace
>>>
>>> should have been done by the kernel, because it retries folio unmapping
>>> and moving
>>>
>>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up
>>> these
>>>
>>> macros and the original test was still failing. Now, I digged in more,
>>> and, if the
>>>
>>> following assertion is correct:
>>>
>>>
>>> Any thread having a reference on a folio will end up calling
>>> folio_lock()
>>>
>>
>> Good point. I suspect concurrent things like read/write would also be
>> able to trigger this (did not check, though).
>>
>>>
>>> then it seems to me that the retry for loop wrapped around
>>> migrate_folio_move(), inside
>>>
>>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on the
>>> first iteration, it is
>>>
>>> going to fail for all iterations since, if I am reading the code path
>>> correctly, the only way it
>>>
>>> fails is when the actual refcount is not equal to expected refcount (in
>>> folio_migrate_mapping()),
>>>
>>> and there is no way that the extra refcount is going to get released
>>> since the migration path
>>>
>>> has the folio lock.
>>>
>>> And therefore, this begs the question: isn't it logical to assert the
>>> actual refcount against the
>>>
>>> expected refcount, just after we have changed the PTEs, so that if this
>>> assertion fails, we can
>>>
>>> go to the next iteration of the for loop for migrate_folio_unmap()
>>> inside migrate_pages_batch()
>>>
>>> by calling migrate_folio_undo_src()/dst() to restore the old state? I am
>>> trying to implement
>>>
>>> this but is not as straightforward as it seemed to me this morning.
>>
>> I agree with your assessment that migration code currently doesn't
>> handle the case well when some other thread does an unconditional
>> folio_lock(). folio_trylock() users would be handled, but that's not
>> what we want with FGP_LOCK I assume.
>>
>> So IIUC, your idea would be to unlock the folio in migration code and
>> try again their. Sounds reasonable, without looking into the details :)
> 
> 

BTW, I was trying to find the spot that would do the folio_lock(), but 
filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a 
folio_trylock().

Where exactly is the folio_lock() on the fault path that would prohibit 
us from making progress?

> (Adding Mel if at all he has any comments for a compaction use-case)
> 
> What I was trying to say is this (forgive me for the hard-coded value):

The hard-coded 2 is wrong indeed :)

> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a8c6f466e33a..404af46dd661 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1262,6 +1262,8 @@ static int migrate_folio_unmap(new_folio_t
> get_new_folio,
> }
> if (!folio_mapped(src)) {
> + if (folio_ref_count(src) != 2)
> + goto out;
> __migrate_folio_record(dst, old_page_state, anon_vma);
> return MIGRATEPAGE_UNMAP;
> }
> This will give us more chances to win the race. On an average, now
> the test fails on 100 iterations of move_pages(). If you multiply
> the NR_MAX_PAGES_(A)SYNC_RETRY macros by 3, the average goes above
> to 2000.
> But if the consensus is that this is just pleasing the test without
> any real use-case (compaction?) then I guess I am alright with making
> the change in the test.

I'd be curious if other activity (besides fault, like concurrent 
read()/write()/...) can similarly make migration fail. I suspect it can.
Dev Jain Aug. 7, 2024, 12:58 p.m. UTC | #16
On 8/7/24 17:09, David Hildenbrand wrote:
> On 05.08.24 16:14, Dev Jain wrote:
>>
>> On 8/5/24 16:16, David Hildenbrand wrote:
>>> On 05.08.24 11:51, Dev Jain wrote:
>>>>
>>>> On 8/1/24 19:18, David Hildenbrand wrote:
>>>>> On 01.08.24 15:43, Will Deacon wrote:
>>>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>>>>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault()
>>>>>>>>>>> instead? But
>>>>>>>>>>> then, this would mean that we do this in all
>>>>>>>>>>>
>>>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another
>>>>>>>>>>> reference
>>>>>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>>>>>
>>>>>>>>>>> should solve this once and for all. In fact, do_pte_missing()
>>>>>>>>>>> may call
>>>>>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>>>>>
>>>>>>>>>>> noticed that the former already checks this using
>>>>>>>>>>> vmf_pte_changed().
>>>>>>>>>>
>>>>>>>>>> What I am still missing is why this is (a) arm64 only; and 
>>>>>>>>>> (b) if
>>>>>>>>>> this
>>>>>>>>>> is something we should really worry about. There are other 
>>>>>>>>>> reasons
>>>>>>>>>> (e.g., speculative references) why migration could temporarily
>>>>>>>>>> fail,
>>>>>>>>>> does it happen that often that it is really something we have to
>>>>>>>>>> worry
>>>>>>>>>> about?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is 
>>>>>>>>> quite
>>>>>>>>> strange since the race is clearly arch-independent.
>>>>>>>>
>>>>>>>> Yes, I think this is what we have to understand. Is the race 
>>>>>>>> simply
>>>>>>>> less
>>>>>>>> likely to trigger on x86?
>>>>>>>>
>>>>>>>> I would assume that it would trigger on any arch.
>>>>>>>>
>>>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to
>>>>>>>> work here.
>>>>>>>>
>>>>>>>> Is this maybe related to deferred flushing? Such that the other 
>>>>>>>> CPU
>>>>>>>> will
>>>>>>>> by accident just observe the !pte_none a little less likely?
>>>>>>>>
>>>>>>>> But arm64 also usually defers flushes, right? At least unless
>>>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do
>>>>>>>> deferred
>>>>>>>> flushes.
>>>>>>>
>>>>>>> Bingo!
>>>>>>>
>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>> index e51ed44f8b53..ce94b810586b 100644
>>>>>>> --- a/mm/rmap.c
>>>>>>> +++ b/mm/rmap.c
>>>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct
>>>>>>> mm_struct
>>>>>>> *mm, pte_t pteval,
>>>>>>>      */
>>>>>>>     static bool should_defer_flush(struct mm_struct *mm, enum
>>>>>>> ttu_flags flags)
>>>>>>>     {
>>>>>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>>>>>> -               return false;
>>>>>>> -
>>>>>>> -       return arch_tlbbatch_should_defer(mm);
>>>>>>> +       return false;
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>>> On x86:
>>>>>>>
>>>>>>> # ./migration
>>>>>>> TAP version 13
>>>>>>> 1..1
>>>>>>> # Starting 1 tests from 1 test cases.
>>>>>>> #  RUN           migration.shared_anon ...
>>>>>>> Didn't migrate 1 pages
>>>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1,
>>>>>>> self->n2) (-2)
>>>>>>> == 0 (0)
>>>>>>> # shared_anon: Test terminated by assertion
>>>>>>> #          FAIL  migration.shared_anon
>>>>>>> not ok 1 migration.shared_anon
>>>>>>>
>>>>>>>
>>>>>>> It fails all of the time!
>>>>>>
>>>>>> Nice work! I suppose that makes sense as, with the eager TLB
>>>>>> invalidation, the window between the other CPU faulting and the
>>>>>> migration entry being written is fairly wide.
>>>>>>
>>>>>> Not sure about a fix though :/ It feels a bit overkill to add a new
>>>>>> invalid pte encoding just for this.
>>>>>
>>>>> Something like that might make the test happy in most cases:
>>>>>
>>>>> diff --git a/tools/testing/selftests/mm/migration.c
>>>>> b/tools/testing/selftests/mm/migration.c
>>>>> index 6908569ef406..4c18bfc13b94 100644
>>>>> --- a/tools/testing/selftests/mm/migration.c
>>>>> +++ b/tools/testing/selftests/mm/migration.c
>>>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>>           int ret, tmp;
>>>>>           int status = 0;
>>>>>           struct timespec ts1, ts2;
>>>>> +       int errors = 0;
>>>>>
>>>>>           if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>>>>>                   return -1;
>>>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>>                   ret = move_pages(0, 1, (void **) &ptr, &n2, 
>>>>> &status,
>>>>>                                   MPOL_MF_MOVE_ALL);
>>>>>                   if (ret) {
>>>>> -                       if (ret > 0)
>>>>> +                       if (ret > 0) {
>>>>> +                               if (++errors < 100)
>>>>> +                                       continue;
>>>>>                                   printf("Didn't migrate %d pages\n",
>>>>> ret);
>>>>> -                       else
>>>>> +                       } else {
>>>>>                                   perror("Couldn't migrate pages");
>>>>> +                       }
>>>>>                           return -2;
>>>>>                   }
>>>>> +               /* Progress! */
>>>>> +               errors = 0;
>>>>>
>>>>>                   tmp = n2;
>>>>>                   n2 = n1;
>>>>>
>>>>>
>>>>> [root@localhost mm]# ./migration
>>>>> TAP version 13
>>>>> 1..1
>>>>> # Starting 1 tests from 1 test cases.
>>>>> #  RUN           migration.shared_anon ...
>>>>> #            OK  migration.shared_anon
>>>>> ok 1 migration.shared_anon
>>>>> # PASSED: 1 / 1 tests passed.
>>>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
>>>>
>>>>
>>>> This does make the test pass, to my surprise, since what you are doing
>>>> from userspace
>>>>
>>>> should have been done by the kernel, because it retries folio 
>>>> unmapping
>>>> and moving
>>>>
>>>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up
>>>> these
>>>>
>>>> macros and the original test was still failing. Now, I digged in more,
>>>> and, if the
>>>>
>>>> following assertion is correct:
>>>>
>>>>
>>>> Any thread having a reference on a folio will end up calling
>>>> folio_lock()
>>>>
>>>
>>> Good point. I suspect concurrent things like read/write would also be
>>> able to trigger this (did not check, though).
>>>
>>>>
>>>> then it seems to me that the retry for loop wrapped around
>>>> migrate_folio_move(), inside
>>>>
>>>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on 
>>>> the
>>>> first iteration, it is
>>>>
>>>> going to fail for all iterations since, if I am reading the code path
>>>> correctly, the only way it
>>>>
>>>> fails is when the actual refcount is not equal to expected refcount 
>>>> (in
>>>> folio_migrate_mapping()),
>>>>
>>>> and there is no way that the extra refcount is going to get released
>>>> since the migration path
>>>>
>>>> has the folio lock.
>>>>
>>>> And therefore, this begs the question: isn't it logical to assert the
>>>> actual refcount against the
>>>>
>>>> expected refcount, just after we have changed the PTEs, so that if 
>>>> this
>>>> assertion fails, we can
>>>>
>>>> go to the next iteration of the for loop for migrate_folio_unmap()
>>>> inside migrate_pages_batch()
>>>>
>>>> by calling migrate_folio_undo_src()/dst() to restore the old state? 
>>>> I am
>>>> trying to implement
>>>>
>>>> this but is not as straightforward as it seemed to me this morning.
>>>
>>> I agree with your assessment that migration code currently doesn't
>>> handle the case well when some other thread does an unconditional
>>> folio_lock(). folio_trylock() users would be handled, but that's not
>>> what we want with FGP_LOCK I assume.
>>>
>>> So IIUC, your idea would be to unlock the folio in migration code and
>>> try again their. Sounds reasonable, without looking into the details :)
>>
>>
>
> BTW, I was trying to find the spot that would do the folio_lock(), but 
> filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a 
> folio_trylock().
>
> Where exactly is the folio_lock() on the fault path that would 
> prohibit us from making progress?

Not filemap_fault(); it enters shmem_fault() which eventually calls 
shmem_get_folio_gfp(), retrieving the folio from the pagecache, and 
calling folio_lock().

>
>> (Adding Mel if at all he has any comments for a compaction use-case)
>>
>> What I was trying to say is this (forgive me for the hard-coded value):
>
> The hard-coded 2 is wrong indeed :)
>
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index a8c6f466e33a..404af46dd661 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1262,6 +1262,8 @@ static int migrate_folio_unmap(new_folio_t
>> get_new_folio,
>> }
>> if (!folio_mapped(src)) {
>> + if (folio_ref_count(src) != 2)
>> + goto out;
>> __migrate_folio_record(dst, old_page_state, anon_vma);
>> return MIGRATEPAGE_UNMAP;
>> }
>> This will give us more chances to win the race. On an average, now
>> the test fails on 100 iterations of move_pages(). If you multiply
>> the NR_MAX_PAGES_(A)SYNC_RETRY macros by 3, the average goes above
>> to 2000.
>> But if the consensus is that this is just pleasing the test without
>> any real use-case (compaction?) then I guess I am alright with making
>> the change in the test.
>
> I'd be curious if other activity (besides fault, like concurrent 
> read()/write()/...) can similarly make migration fail. I suspect it can.
>
David Hildenbrand Aug. 9, 2024, 1:23 p.m. UTC | #17
On 07.08.24 14:58, Dev Jain wrote:
> 
> On 8/7/24 17:09, David Hildenbrand wrote:
>> On 05.08.24 16:14, Dev Jain wrote:
>>>
>>> On 8/5/24 16:16, David Hildenbrand wrote:
>>>> On 05.08.24 11:51, Dev Jain wrote:
>>>>>
>>>>> On 8/1/24 19:18, David Hildenbrand wrote:
>>>>>> On 01.08.24 15:43, Will Deacon wrote:
>>>>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>>>>>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault()
>>>>>>>>>>>> instead? But
>>>>>>>>>>>> then, this would mean that we do this in all
>>>>>>>>>>>>
>>>>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another
>>>>>>>>>>>> reference
>>>>>>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>>>>>>
>>>>>>>>>>>> should solve this once and for all. In fact, do_pte_missing()
>>>>>>>>>>>> may call
>>>>>>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>>>>>>
>>>>>>>>>>>> noticed that the former already checks this using
>>>>>>>>>>>> vmf_pte_changed().
>>>>>>>>>>>
>>>>>>>>>>> What I am still missing is why this is (a) arm64 only; and
>>>>>>>>>>> (b) if
>>>>>>>>>>> this
>>>>>>>>>>> is something we should really worry about. There are other
>>>>>>>>>>> reasons
>>>>>>>>>>> (e.g., speculative references) why migration could temporarily
>>>>>>>>>>> fail,
>>>>>>>>>>> does it happen that often that it is really something we have to
>>>>>>>>>>> worry
>>>>>>>>>>> about?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is
>>>>>>>>>> quite
>>>>>>>>>> strange since the race is clearly arch-independent.
>>>>>>>>>
>>>>>>>>> Yes, I think this is what we have to understand. Is the race
>>>>>>>>> simply
>>>>>>>>> less
>>>>>>>>> likely to trigger on x86?
>>>>>>>>>
>>>>>>>>> I would assume that it would trigger on any arch.
>>>>>>>>>
>>>>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to
>>>>>>>>> work here.
>>>>>>>>>
>>>>>>>>> Is this maybe related to deferred flushing? Such that the other
>>>>>>>>> CPU
>>>>>>>>> will
>>>>>>>>> by accident just observe the !pte_none a little less likely?
>>>>>>>>>
>>>>>>>>> But arm64 also usually defers flushes, right? At least unless
>>>>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do
>>>>>>>>> deferred
>>>>>>>>> flushes.
>>>>>>>>
>>>>>>>> Bingo!
>>>>>>>>
>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>> index e51ed44f8b53..ce94b810586b 100644
>>>>>>>> --- a/mm/rmap.c
>>>>>>>> +++ b/mm/rmap.c
>>>>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct
>>>>>>>> mm_struct
>>>>>>>> *mm, pte_t pteval,
>>>>>>>>       */
>>>>>>>>      static bool should_defer_flush(struct mm_struct *mm, enum
>>>>>>>> ttu_flags flags)
>>>>>>>>      {
>>>>>>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>>>>>>> -               return false;
>>>>>>>> -
>>>>>>>> -       return arch_tlbbatch_should_defer(mm);
>>>>>>>> +       return false;
>>>>>>>>      }
>>>>>>>>
>>>>>>>>
>>>>>>>> On x86:
>>>>>>>>
>>>>>>>> # ./migration
>>>>>>>> TAP version 13
>>>>>>>> 1..1
>>>>>>>> # Starting 1 tests from 1 test cases.
>>>>>>>> #  RUN           migration.shared_anon ...
>>>>>>>> Didn't migrate 1 pages
>>>>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1,
>>>>>>>> self->n2) (-2)
>>>>>>>> == 0 (0)
>>>>>>>> # shared_anon: Test terminated by assertion
>>>>>>>> #          FAIL  migration.shared_anon
>>>>>>>> not ok 1 migration.shared_anon
>>>>>>>>
>>>>>>>>
>>>>>>>> It fails all of the time!
>>>>>>>
>>>>>>> Nice work! I suppose that makes sense as, with the eager TLB
>>>>>>> invalidation, the window between the other CPU faulting and the
>>>>>>> migration entry being written is fairly wide.
>>>>>>>
>>>>>>> Not sure about a fix though :/ It feels a bit overkill to add a new
>>>>>>> invalid pte encoding just for this.
>>>>>>
>>>>>> Something like that might make the test happy in most cases:
>>>>>>
>>>>>> diff --git a/tools/testing/selftests/mm/migration.c
>>>>>> b/tools/testing/selftests/mm/migration.c
>>>>>> index 6908569ef406..4c18bfc13b94 100644
>>>>>> --- a/tools/testing/selftests/mm/migration.c
>>>>>> +++ b/tools/testing/selftests/mm/migration.c
>>>>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>>>            int ret, tmp;
>>>>>>            int status = 0;
>>>>>>            struct timespec ts1, ts2;
>>>>>> +       int errors = 0;
>>>>>>
>>>>>>            if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>>>>>>                    return -1;
>>>>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>>>                    ret = move_pages(0, 1, (void **) &ptr, &n2,
>>>>>> &status,
>>>>>>                                    MPOL_MF_MOVE_ALL);
>>>>>>                    if (ret) {
>>>>>> -                       if (ret > 0)
>>>>>> +                       if (ret > 0) {
>>>>>> +                               if (++errors < 100)
>>>>>> +                                       continue;
>>>>>>                                    printf("Didn't migrate %d pages\n",
>>>>>> ret);
>>>>>> -                       else
>>>>>> +                       } else {
>>>>>>                                    perror("Couldn't migrate pages");
>>>>>> +                       }
>>>>>>                            return -2;
>>>>>>                    }
>>>>>> +               /* Progress! */
>>>>>> +               errors = 0;
>>>>>>
>>>>>>                    tmp = n2;
>>>>>>                    n2 = n1;
>>>>>>
>>>>>>
>>>>>> [root@localhost mm]# ./migration
>>>>>> TAP version 13
>>>>>> 1..1
>>>>>> # Starting 1 tests from 1 test cases.
>>>>>> #  RUN           migration.shared_anon ...
>>>>>> #            OK  migration.shared_anon
>>>>>> ok 1 migration.shared_anon
>>>>>> # PASSED: 1 / 1 tests passed.
>>>>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
>>>>>
>>>>>
>>>>> This does make the test pass, to my surprise, since what you are doing
>>>>> from userspace
>>>>>
>>>>> should have been done by the kernel, because it retries folio
>>>>> unmapping
>>>>> and moving
>>>>>
>>>>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up
>>>>> these
>>>>>
>>>>> macros and the original test was still failing. Now, I digged in more,
>>>>> and, if the
>>>>>
>>>>> following assertion is correct:
>>>>>
>>>>>
>>>>> Any thread having a reference on a folio will end up calling
>>>>> folio_lock()
>>>>>
>>>>
>>>> Good point. I suspect concurrent things like read/write would also be
>>>> able to trigger this (did not check, though).
>>>>
>>>>>
>>>>> then it seems to me that the retry for loop wrapped around
>>>>> migrate_folio_move(), inside
>>>>>
>>>>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on
>>>>> the
>>>>> first iteration, it is
>>>>>
>>>>> going to fail for all iterations since, if I am reading the code path
>>>>> correctly, the only way it
>>>>>
>>>>> fails is when the actual refcount is not equal to expected refcount
>>>>> (in
>>>>> folio_migrate_mapping()),
>>>>>
>>>>> and there is no way that the extra refcount is going to get released
>>>>> since the migration path
>>>>>
>>>>> has the folio lock.
>>>>>
>>>>> And therefore, this begs the question: isn't it logical to assert the
>>>>> actual refcount against the
>>>>>
>>>>> expected refcount, just after we have changed the PTEs, so that if
>>>>> this
>>>>> assertion fails, we can
>>>>>
>>>>> go to the next iteration of the for loop for migrate_folio_unmap()
>>>>> inside migrate_pages_batch()
>>>>>
>>>>> by calling migrate_folio_undo_src()/dst() to restore the old state?
>>>>> I am
>>>>> trying to implement
>>>>>
>>>>> this but is not as straightforward as it seemed to me this morning.
>>>>
>>>> I agree with your assessment that migration code currently doesn't
>>>> handle the case well when some other thread does an unconditional
>>>> folio_lock(). folio_trylock() users would be handled, but that's not
>>>> what we want with FGP_LOCK I assume.
>>>>
>>>> So IIUC, your idea would be to unlock the folio in migration code and
>>>> try again their. Sounds reasonable, without looking into the details :)
>>>
>>>
>>
>> BTW, I was trying to find the spot that would do the folio_lock(), but
>> filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a
>> folio_trylock().
>>
>> Where exactly is the folio_lock() on the fault path that would
>> prohibit us from making progress?
> 
> Not filemap_fault(); it enters shmem_fault() which eventually calls
> shmem_get_folio_gfp(), retrieving the folio from the pagecache, and
> calling folio_lock().

Ah, thanks!

... which raises the question if we should handle it similar to 
filemap_fault(), essentially drop the reference and retry using 
VM_FAULT_RETRY. Hmmmmm
David Hildenbrand Aug. 9, 2024, 1:26 p.m. UTC | #18
On 09.08.24 15:23, David Hildenbrand wrote:
> On 07.08.24 14:58, Dev Jain wrote:
>>
>> On 8/7/24 17:09, David Hildenbrand wrote:
>>> On 05.08.24 16:14, Dev Jain wrote:
>>>>
>>>> On 8/5/24 16:16, David Hildenbrand wrote:
>>>>> On 05.08.24 11:51, Dev Jain wrote:
>>>>>>
>>>>>> On 8/1/24 19:18, David Hildenbrand wrote:
>>>>>>> On 01.08.24 15:43, Will Deacon wrote:
>>>>>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>>>>>>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault()
>>>>>>>>>>>>> instead? But
>>>>>>>>>>>>> then, this would mean that we do this in all
>>>>>>>>>>>>>
>>>>>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another
>>>>>>>>>>>>> reference
>>>>>>>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>>>>>>>
>>>>>>>>>>>>> should solve this once and for all. In fact, do_pte_missing()
>>>>>>>>>>>>> may call
>>>>>>>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>>>>>>>
>>>>>>>>>>>>> noticed that the former already checks this using
>>>>>>>>>>>>> vmf_pte_changed().
>>>>>>>>>>>>
>>>>>>>>>>>> What I am still missing is why this is (a) arm64 only; and
>>>>>>>>>>>> (b) if
>>>>>>>>>>>> this
>>>>>>>>>>>> is something we should really worry about. There are other
>>>>>>>>>>>> reasons
>>>>>>>>>>>> (e.g., speculative references) why migration could temporarily
>>>>>>>>>>>> fail,
>>>>>>>>>>>> does it happen that often that it is really something we have to
>>>>>>>>>>>> worry
>>>>>>>>>>>> about?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is
>>>>>>>>>>> quite
>>>>>>>>>>> strange since the race is clearly arch-independent.
>>>>>>>>>>
>>>>>>>>>> Yes, I think this is what we have to understand. Is the race
>>>>>>>>>> simply
>>>>>>>>>> less
>>>>>>>>>> likely to trigger on x86?
>>>>>>>>>>
>>>>>>>>>> I would assume that it would trigger on any arch.
>>>>>>>>>>
>>>>>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to
>>>>>>>>>> work here.
>>>>>>>>>>
>>>>>>>>>> Is this maybe related to deferred flushing? Such that the other
>>>>>>>>>> CPU
>>>>>>>>>> will
>>>>>>>>>> by accident just observe the !pte_none a little less likely?
>>>>>>>>>>
>>>>>>>>>> But arm64 also usually defers flushes, right? At least unless
>>>>>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do
>>>>>>>>>> deferred
>>>>>>>>>> flushes.
>>>>>>>>>
>>>>>>>>> Bingo!
>>>>>>>>>
>>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>>> index e51ed44f8b53..ce94b810586b 100644
>>>>>>>>> --- a/mm/rmap.c
>>>>>>>>> +++ b/mm/rmap.c
>>>>>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct
>>>>>>>>> mm_struct
>>>>>>>>> *mm, pte_t pteval,
>>>>>>>>>        */
>>>>>>>>>       static bool should_defer_flush(struct mm_struct *mm, enum
>>>>>>>>> ttu_flags flags)
>>>>>>>>>       {
>>>>>>>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>>>>>>>> -               return false;
>>>>>>>>> -
>>>>>>>>> -       return arch_tlbbatch_should_defer(mm);
>>>>>>>>> +       return false;
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On x86:
>>>>>>>>>
>>>>>>>>> # ./migration
>>>>>>>>> TAP version 13
>>>>>>>>> 1..1
>>>>>>>>> # Starting 1 tests from 1 test cases.
>>>>>>>>> #  RUN           migration.shared_anon ...
>>>>>>>>> Didn't migrate 1 pages
>>>>>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1,
>>>>>>>>> self->n2) (-2)
>>>>>>>>> == 0 (0)
>>>>>>>>> # shared_anon: Test terminated by assertion
>>>>>>>>> #          FAIL  migration.shared_anon
>>>>>>>>> not ok 1 migration.shared_anon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It fails all of the time!
>>>>>>>>
>>>>>>>> Nice work! I suppose that makes sense as, with the eager TLB
>>>>>>>> invalidation, the window between the other CPU faulting and the
>>>>>>>> migration entry being written is fairly wide.
>>>>>>>>
>>>>>>>> Not sure about a fix though :/ It feels a bit overkill to add a new
>>>>>>>> invalid pte encoding just for this.
>>>>>>>
>>>>>>> Something like that might make the test happy in most cases:
>>>>>>>
>>>>>>> diff --git a/tools/testing/selftests/mm/migration.c
>>>>>>> b/tools/testing/selftests/mm/migration.c
>>>>>>> index 6908569ef406..4c18bfc13b94 100644
>>>>>>> --- a/tools/testing/selftests/mm/migration.c
>>>>>>> +++ b/tools/testing/selftests/mm/migration.c
>>>>>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>>>>             int ret, tmp;
>>>>>>>             int status = 0;
>>>>>>>             struct timespec ts1, ts2;
>>>>>>> +       int errors = 0;
>>>>>>>
>>>>>>>             if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>>>>>>>                     return -1;
>>>>>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>>>>                     ret = move_pages(0, 1, (void **) &ptr, &n2,
>>>>>>> &status,
>>>>>>>                                     MPOL_MF_MOVE_ALL);
>>>>>>>                     if (ret) {
>>>>>>> -                       if (ret > 0)
>>>>>>> +                       if (ret > 0) {
>>>>>>> +                               if (++errors < 100)
>>>>>>> +                                       continue;
>>>>>>>                                     printf("Didn't migrate %d pages\n",
>>>>>>> ret);
>>>>>>> -                       else
>>>>>>> +                       } else {
>>>>>>>                                     perror("Couldn't migrate pages");
>>>>>>> +                       }
>>>>>>>                             return -2;
>>>>>>>                     }
>>>>>>> +               /* Progress! */
>>>>>>> +               errors = 0;
>>>>>>>
>>>>>>>                     tmp = n2;
>>>>>>>                     n2 = n1;
>>>>>>>
>>>>>>>
>>>>>>> [root@localhost mm]# ./migration
>>>>>>> TAP version 13
>>>>>>> 1..1
>>>>>>> # Starting 1 tests from 1 test cases.
>>>>>>> #  RUN           migration.shared_anon ...
>>>>>>> #            OK  migration.shared_anon
>>>>>>> ok 1 migration.shared_anon
>>>>>>> # PASSED: 1 / 1 tests passed.
>>>>>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
>>>>>>
>>>>>>
>>>>>> This does make the test pass, to my surprise, since what you are doing
>>>>>> from userspace
>>>>>>
>>>>>> should have been done by the kernel, because it retries folio
>>>>>> unmapping
>>>>>> and moving
>>>>>>
>>>>>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up
>>>>>> these
>>>>>>
>>>>>> macros and the original test was still failing. Now, I digged in more,
>>>>>> and, if the
>>>>>>
>>>>>> following assertion is correct:
>>>>>>
>>>>>>
>>>>>> Any thread having a reference on a folio will end up calling
>>>>>> folio_lock()
>>>>>>
>>>>>
>>>>> Good point. I suspect concurrent things like read/write would also be
>>>>> able to trigger this (did not check, though).
>>>>>
>>>>>>
>>>>>> then it seems to me that the retry for loop wrapped around
>>>>>> migrate_folio_move(), inside
>>>>>>
>>>>>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on
>>>>>> the
>>>>>> first iteration, it is
>>>>>>
>>>>>> going to fail for all iterations since, if I am reading the code path
>>>>>> correctly, the only way it
>>>>>>
>>>>>> fails is when the actual refcount is not equal to expected refcount
>>>>>> (in
>>>>>> folio_migrate_mapping()),
>>>>>>
>>>>>> and there is no way that the extra refcount is going to get released
>>>>>> since the migration path
>>>>>>
>>>>>> has the folio lock.
>>>>>>
>>>>>> And therefore, this begs the question: isn't it logical to assert the
>>>>>> actual refcount against the
>>>>>>
>>>>>> expected refcount, just after we have changed the PTEs, so that if
>>>>>> this
>>>>>> assertion fails, we can
>>>>>>
>>>>>> go to the next iteration of the for loop for migrate_folio_unmap()
>>>>>> inside migrate_pages_batch()
>>>>>>
>>>>>> by calling migrate_folio_undo_src()/dst() to restore the old state?
>>>>>> I am
>>>>>> trying to implement
>>>>>>
>>>>>> this but is not as straightforward as it seemed to me this morning.
>>>>>
>>>>> I agree with your assessment that migration code currently doesn't
>>>>> handle the case well when some other thread does an unconditional
>>>>> folio_lock(). folio_trylock() users would be handled, but that's not
>>>>> what we want with FGP_LOCK I assume.
>>>>>
>>>>> So IIUC, your idea would be to unlock the folio in migration code and
>>>>> try again their. Sounds reasonable, without looking into the details :)
>>>>
>>>>
>>>
>>> BTW, I was trying to find the spot that would do the folio_lock(), but
>>> filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a
>>> folio_trylock().
>>>
>>> Where exactly is the folio_lock() on the fault path that would
>>> prohibit us from making progress?
>>
>> Not filemap_fault(); it enters shmem_fault() which eventually calls
>> shmem_get_folio_gfp(), retrieving the folio from the pagecache, and
>> calling folio_lock().
> 
> Ah, thanks!
> 
> ... which raises the question if we should handle it similar to
> filemap_fault(), essentially drop the reference and retry using
> VM_FAULT_RETRY. Hmmmmm

... just had another look at lock_folio_maybe_drop_mmap(), and we also
usually end up in __folio_lock() / __folio_lock_killable(), 
folio_trylock() is only used for the fast path. So no, that wouldn't 
change that much.
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..bb10b3376595 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2126,7 +2126,9 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 * transition on a cached TLB entry is written through
 				 * and traps if the PTE is unmapped.
 				 */
-				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+				pteval = pte_mkinvalid(*(pvmw.pte));
+				*(pvmw.pte) = pteval;
 
 				set_tlb_ubc_flush_pending(mm, pteval, address);
 			} else {