diff mbox series

[v1] mm: swap: Fix race between free_swap_and_cache() and swapoff()

Message ID 20240305151349.3781428-1-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series [v1] mm: swap: Fix race between free_swap_and_cache() and swapoff() | expand

Commit Message

Ryan Roberts March 5, 2024, 3:13 p.m. UTC
There was previously a theoretical window where swapoff() could run and
teardown a swap_info_struct while a call to free_swap_and_cache() was
running in another thread. This could cause, amongst other bad
possibilities, swap_page_trans_huge_swapped() (called by
free_swap_and_cache()) to access the freed memory for swap_map.

This is a theoretical problem and I haven't been able to provoke it from
a test case. But there has been agreement based on code review that this
is possible (see link below).

Fix it by using get_swap_device()/put_swap_device(), which will stall
swapoff(). There was an extra check in _swap_info_get() to confirm that
the swap entry was valid. This wasn't present in get_swap_device() so
I've added it. I couldn't find any existing get_swap_device() call sites
where this extra check would cause any false alarms.

Details of how to provoke one possible issue (thanks to David Hilenbrand
for deriving this):

--8<-----

__swap_entry_free() might be the last user and result in
"count == SWAP_HAS_CACHE".

swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.

So the question is: could someone reclaim the folio and turn
si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().

Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
still references by swap entries.

Process 1 still references subpage 0 via swap entry.
Process 2 still references subpage 1 via swap entry.

Process 1 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
[then, preempted in the hypervisor etc.]

Process 2 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE

Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
__try_to_reclaim_swap().

__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
swap_entry_free()->swap_range_free()->
...
WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);

What stops swapoff to succeed after process 2 reclaimed the swap cache
but before process1 finished its call to swap_page_trans_huge_swapped()?

--8<-----

Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch")
Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/
Cc: stable@vger.kernel.org
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---

Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).

Thanks,
Ryan

 mm/swapfile.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

--
2.25.1

Comments

David Hildenbrand March 5, 2024, 3:50 p.m. UTC | #1
On 05.03.24 16:13, Ryan Roberts wrote:
> There was previously a theoretical window where swapoff() could run and
> teardown a swap_info_struct while a call to free_swap_and_cache() was
> running in another thread. This could cause, amongst other bad
> possibilities, swap_page_trans_huge_swapped() (called by
> free_swap_and_cache()) to access the freed memory for swap_map.
> 
> This is a theoretical problem and I haven't been able to provoke it from
> a test case. But there has been agreement based on code review that this
> is possible (see link below).
> 
> Fix it by using get_swap_device()/put_swap_device(), which will stall
> swapoff(). There was an extra check in _swap_info_get() to confirm that
> the swap entry was valid. This wasn't present in get_swap_device() so
> I've added it. I couldn't find any existing get_swap_device() call sites
> where this extra check would cause any false alarms.
> 
> Details of how to provoke one possible issue (thanks to David Hilenbrand
> for deriving this):

Almost

"s/Hilenbrand/Hildenbrand/" :)

> 
> --8<-----
> 
> __swap_entry_free() might be the last user and result in
> "count == SWAP_HAS_CACHE".
> 
> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
> 
> So the question is: could someone reclaim the folio and turn
> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
> 
> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
> still references by swap entries.
> 
> Process 1 still references subpage 0 via swap entry.
> Process 2 still references subpage 1 via swap entry.
> 
> Process 1 quits. Calls free_swap_and_cache().
> -> count == SWAP_HAS_CACHE
> [then, preempted in the hypervisor etc.]
> 
> Process 2 quits. Calls free_swap_and_cache().
> -> count == SWAP_HAS_CACHE
> 
> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
> __try_to_reclaim_swap().
> 
> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
> swap_entry_free()->swap_range_free()->
> ...
> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
> 
> What stops swapoff to succeed after process 2 reclaimed the swap cache
> but before process1 finished its call to swap_page_trans_huge_swapped()?
> 
> --8<-----
> 
> Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch")
> Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/
> Cc: stable@vger.kernel.org
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> 
> Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
> 
> Thanks,
> Ryan
> 
>   mm/swapfile.c | 14 +++++++++++---
>   1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2b3a2d85e350..f580e6abc674 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>   	smp_rmb();
>   	offset = swp_offset(entry);
>   	if (offset >= si->max)
> -		goto put_out;
> +		goto bad_offset;
> +	if (data_race(!si->swap_map[swp_offset(entry)]))
> +		goto bad_free;
> 
>   	return si;
>   bad_nofile:
> @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>   out:
>   	return NULL;
>   put_out:
> -	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>   	percpu_ref_put(&si->users);
>   	return NULL;
> +bad_offset:
> +	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
> +	goto put_out;
> +bad_free:
> +	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
> +	goto put_out;
>   }
> 
>   static unsigned char __swap_entry_free(struct swap_info_struct *p,
> @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry)
>   	if (non_swap_entry(entry))
>   		return 1;
> 
> -	p = _swap_info_get(entry);
> +	p = get_swap_device(entry);
>   	if (p) {
>   		count = __swap_entry_free(p, entry);
>   		if (count == SWAP_HAS_CACHE &&
>   		    !swap_page_trans_huge_swapped(p, entry))
>   			__try_to_reclaim_swap(p, swp_offset(entry),
>   					      TTRS_UNMAPPED | TTRS_FULL);
> +		put_swap_device(p);
>   	}
>   	return p != NULL;
>   }
> --
> 2.25.1
> 

LGTM

Are you planning on sending a doc extension for get_swap_device()?
Ryan Roberts March 5, 2024, 4:33 p.m. UTC | #2
On 05/03/2024 15:50, David Hildenbrand wrote:
> On 05.03.24 16:13, Ryan Roberts wrote:
>> There was previously a theoretical window where swapoff() could run and
>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>> running in another thread. This could cause, amongst other bad
>> possibilities, swap_page_trans_huge_swapped() (called by
>> free_swap_and_cache()) to access the freed memory for swap_map.
>>
>> This is a theoretical problem and I haven't been able to provoke it from
>> a test case. But there has been agreement based on code review that this
>> is possible (see link below).
>>
>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>> the swap entry was valid. This wasn't present in get_swap_device() so
>> I've added it. I couldn't find any existing get_swap_device() call sites
>> where this extra check would cause any false alarms.
>>
>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>> for deriving this):
> 
> Almost
> 
> "s/Hilenbrand/Hildenbrand/" :)

Ahh sorry... I even explicitly checked it against your name on emails... fat
fingers...

> 
>>
>> --8<-----
>>
>> __swap_entry_free() might be the last user and result in
>> "count == SWAP_HAS_CACHE".
>>
>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>
>> So the question is: could someone reclaim the folio and turn
>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>
>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>> still references by swap entries.
>>
>> Process 1 still references subpage 0 via swap entry.
>> Process 2 still references subpage 1 via swap entry.
>>
>> Process 1 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>> [then, preempted in the hypervisor etc.]
>>
>> Process 2 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>>
>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>> __try_to_reclaim_swap().
>>
>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>> swap_entry_free()->swap_range_free()->
>> ...
>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>
>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>
>> --8<-----
>>
>> Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch")
>> Closes:
>> https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
>>
>> Thanks,
>> Ryan
>>
>>   mm/swapfile.c | 14 +++++++++++---
>>   1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 2b3a2d85e350..f580e6abc674 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>>       smp_rmb();
>>       offset = swp_offset(entry);
>>       if (offset >= si->max)
>> -        goto put_out;
>> +        goto bad_offset;
>> +    if (data_race(!si->swap_map[swp_offset(entry)]))
>> +        goto bad_free;
>>
>>       return si;
>>   bad_nofile:
>> @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t
>> entry)
>>   out:
>>       return NULL;
>>   put_out:
>> -    pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>>       percpu_ref_put(&si->users);
>>       return NULL;
>> +bad_offset:
>> +    pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>> +    goto put_out;
>> +bad_free:
>> +    pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
>> +    goto put_out;
>>   }
>>
>>   static unsigned char __swap_entry_free(struct swap_info_struct *p,
>> @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry)
>>       if (non_swap_entry(entry))
>>           return 1;
>>
>> -    p = _swap_info_get(entry);
>> +    p = get_swap_device(entry);
>>       if (p) {
>>           count = __swap_entry_free(p, entry);
>>           if (count == SWAP_HAS_CACHE &&
>>               !swap_page_trans_huge_swapped(p, entry))
>>               __try_to_reclaim_swap(p, swp_offset(entry),
>>                             TTRS_UNMAPPED | TTRS_FULL);
>> +        put_swap_device(p);
>>       }
>>       return p != NULL;
>>   }
>> -- 
>> 2.25.1
>>
> 
> LGTM
> 
> Are you planning on sending a doc extension for get_swap_device()?

I saw your comment:

We should likely update the documentation of get_swap_device(), that after
decrementing the refcount, the SI might become stale and should not be touched
without a prior get_swap_device().

But when I went to make the changes, I saw the documentation already said:

...we need to enclose all swap related functions with get_swap_device() and
put_swap_device()... Notice that swapoff ... can still happen before the
percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in
put_swap_device()... The caller must be prepared for that.

I thought that already covered it? I'm sure as usual, I've misunderstood your
point. Happy to respin if you have something in mind?
David Hildenbrand March 5, 2024, 10:05 p.m. UTC | #3
On 05.03.24 17:33, Ryan Roberts wrote:
> On 05/03/2024 15:50, David Hildenbrand wrote:
>> On 05.03.24 16:13, Ryan Roberts wrote:
>>> There was previously a theoretical window where swapoff() could run and
>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>> running in another thread. This could cause, amongst other bad
>>> possibilities, swap_page_trans_huge_swapped() (called by
>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>
>>> This is a theoretical problem and I haven't been able to provoke it from
>>> a test case. But there has been agreement based on code review that this
>>> is possible (see link below).
>>>
>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>> where this extra check would cause any false alarms.
>>>
>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>> for deriving this):
>>
>> Almost
>>
>> "s/Hilenbrand/Hildenbrand/" :)
> 
> Ahh sorry... I even explicitly checked it against your name on emails... fat
> fingers...

No need to be sorry. Even your average German person would get it wrong,
because there are other (more common) variants :)

[...]

>>>
>>
>> LGTM
>>
>> Are you planning on sending a doc extension for get_swap_device()?
> 
> I saw your comment:
> 
> We should likely update the documentation of get_swap_device(), that after
> decrementing the refcount, the SI might become stale and should not be touched
> without a prior get_swap_device().
> 
> But when I went to make the changes, I saw the documentation already said:
> 
> ...we need to enclose all swap related functions with get_swap_device() and
> put_swap_device()... Notice that swapoff ... can still happen before the
> percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in
> put_swap_device()... The caller must be prepared for that.
> 
> I thought that already covered it? I'm sure as usual, I've misunderstood your
> point. Happy to respin if you have something in mind?

No need to respin, we could clarify on top, if we decide it makes sense.

I was thinking about something like this, making it clearer that the PTL
discussion above does not express the corner case we discovered:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2b3a2d85e350b..646a436581eee 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1232,6 +1232,11 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
   * with get_swap_device() and put_swap_device(), unless the swap
   * functions call get/put_swap_device() by themselves.
   *
+ * Note that when only holding the PTL, swapoff might succeed immediately
+ * after freeing a swap entry. Therefore, immediately after
+ * __swap_entry_free(), the swap info might become stale and should not
+ * be touched without a prior get_swap_device().
+ *
   * Check whether swap entry is valid in the swap device.  If so,
   * return pointer to swap_info_struct, and keep the swap entry valid
   * via preventing the swap device from being swapoff, until
Johannes Weiner March 6, 2024, 2:19 a.m. UTC | #4
Hi Ryan,

On Tue, Mar 05, 2024 at 03:13:49PM +0000, Ryan Roberts wrote:
> There was previously a theoretical window where swapoff() could run and
> teardown a swap_info_struct while a call to free_swap_and_cache() was
> running in another thread. This could cause, amongst other bad
> possibilities, swap_page_trans_huge_swapped() (called by
> free_swap_and_cache()) to access the freed memory for swap_map.
> 
> This is a theoretical problem and I haven't been able to provoke it from
> a test case. But there has been agreement based on code review that this
> is possible (see link below).
> 
> Fix it by using get_swap_device()/put_swap_device(), which will stall
> swapoff(). There was an extra check in _swap_info_get() to confirm that
> the swap entry was valid. This wasn't present in get_swap_device() so
> I've added it. I couldn't find any existing get_swap_device() call sites
> where this extra check would cause any false alarms.

Unfortunately, I found one, testing current mm-everything:

[  189.420777] get_swap_device: Unused swap offset entry 000641ae
[  189.426648] ------------[ cut here ]------------
[  189.431290] WARNING: CPU: 3 PID: 369 at mm/swapfile.c:1301 get_swap_device+0x2da/0x3f0
[  189.439242] CPU: 3 PID: 369 Comm: cachish Not tainted 6.8.0-rc5-00527-g19d98776f227-dirty #30
[  189.447791] Hardware name: Micro-Star International Co., Ltd. MS-7B98/Z390-A PRO (MS-7B98), BIOS 1.80 12/25/2019
[  189.457998] RIP: 0010:get_swap_device+0x2da/0x3f0
[  189.462721] Code: a8 03 75 2a 65 48 ff 08 e9 36 ff ff ff 4c 89 e9 48 c7 c2 40 fd 91 83 48 c7 c6 c0 f9 91 83 48 c7 c7 60 ee 91 83 e8 26 2f af ff <0f> 0b eb af 4c 8d 6b 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48
[  189.481497] RSP: 0000:ffffc90000cff8a8 EFLAGS: 00010282
[  189.486760] RAX: 0000000000000032 RBX: ffff8881262eee00 RCX: 0000000000000000
[  189.493909] RDX: 0000000000000001 RSI: ffffffff83a1e620 RDI: 0000000000000001
[  189.501054] RBP: 1ffff9200019ff15 R08: 0000000000000001 R09: fffff5200019fee1
[  189.508202] R10: ffffc90000cff70f R11: 0000000000000001 R12: ffffc900018d51ae
[  189.515346] R13: 00000000000641ae R14: 0000000000000000 R15: 00000000000641af
[  189.522494] FS:  00007f7120263680(0000) GS:ffff88841c380000(0000) knlGS:0000000000000000
[  189.530591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  189.536373] CR2: 00007f6e659a2ea3 CR3: 0000000046860004 CR4: 00000000003706f0
[  189.543516] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  189.550661] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  189.557811] Call Trace:
[  189.560276]  <TASK>
[  189.562393]  ? __warn+0xc4/0x250
[  189.565647]  ? get_swap_device+0x2da/0x3f0
[  189.569761]  ? report_bug+0x348/0x440
[  189.573444]  ? handle_bug+0x6d/0x90
[  189.576951]  ? exc_invalid_op+0x13/0x40
[  189.580810]  ? asm_exc_invalid_op+0x16/0x20
[  189.585019]  ? get_swap_device+0x2da/0x3f0
[  189.589142]  ? get_swap_device+0x2da/0x3f0
[  189.593255]  ? __pfx_get_swap_device+0x10/0x10
[  189.597717]  __read_swap_cache_async+0x9f/0x630
[  189.602281]  ? __pfx___read_swap_cache_async+0x10/0x10
[  189.607439]  ? __mod_memcg_lruvec_state+0x238/0x4f0
[  189.612344]  ? __pfx_swp_swap_info+0x10/0x10
[  189.616652]  swap_cluster_readahead+0x2cd/0x510
[  189.621206]  ? __pfx_swap_cluster_readahead+0x10/0x10
[  189.626279]  ? swap_cache_get_folio+0xcd/0x360
[  189.630760]  ? __count_memcg_events+0x10a/0x370
[  189.635318]  shmem_swapin_folio+0x2f2/0xc60
[  189.639525]  ? __pfx__raw_spin_lock+0x10/0x10
[  189.643908]  ? __pte_offset_map+0x19/0x1d0
[  189.648024]  shmem_get_folio_gfp+0x307/0xe30
[  189.652323]  ? __schedule+0x9f0/0x1fe0
[  189.656096]  ? __pfx_shmem_get_folio_gfp+0x10/0x10
[  189.660923]  ? filemap_map_pages+0x999/0xe60
[  189.665211]  shmem_fault+0x1d9/0x810
[  189.668834]  ? __pfx_shmem_fault+0x10/0x10
[  189.672954]  ? __pfx_filemap_map_pages+0x10/0x10
[  189.677590]  __do_fault+0xed/0x390
[  189.681012]  __handle_mm_fault+0x1ba1/0x2e80
[  189.685297]  ? __pfx___handle_mm_fault+0x10/0x10
[  189.689933]  ? __pfx_down_read_trylock+0x10/0x10
[  189.694570]  ? __pfx_hrtimer_nanosleep+0x10/0x10
[  189.699215]  handle_mm_fault+0xe0/0x560
[  189.703074]  ? __pfx_restore_fpregs_from_fpstate+0x10/0x10
[  189.708620]  do_user_addr_fault+0x2ba/0x9d0
[  189.712828]  exc_page_fault+0x54/0x90
[  189.716508]  asm_exc_page_fault+0x22/0x30
[  189.720535] RIP: 0033:0x5640dc2d72b5
[  189.724131] Code: 98 48 ba 00 00 00 00 03 00 00 00 48 89 d6 48 2b 75 a0 ba 00 00 00 00 48 f7 f6 48 89 d1 48 89 ca 48 8b 45 a0 48 01 d0 48 01 d8 <0f> b6 00 bf e8 03 00 00 e8 1e fe ff ff 48 83 45 a8 01 48 8d 45 d0
[  189.742922] RSP: 002b:00007ffc227e3f60 EFLAGS: 00010206
[  189.748165] RAX: 00007f6e659a2ea3 RBX: 00007f6e2007e000 RCX: 0000000045924ea3
[  189.755311] RDX: 0000000045924ea3 RSI: 0000000300000000 RDI: 00007f71202586a0
[  189.762483] RBP: 00007ffc227e3fe0 R08: 00007f7120258074 R09: 00007f71202580a0
[  189.769633] R10: 0000000000019458 R11: 00000000008aa400 R12: 0000000000000000
[  189.776781] R13: 00007ffc227e4128 R14: 00007f712029d000 R15: 00005640dc2d9dd8
[  189.783928]  </TASK>
[  189.786126] ---[ end trace 0000000000000000 ]---
[  285.827888] get_swap_device: Unused swap offset entry 0018403f
[  320.699306] get_swap_device: Unused swap offset entry 000b001b
[  354.031339] get_swap_device: Unused swap offset entry 000681a9
[  364.958435] get_swap_device: Unused swap offset entry 001f4055
[  364.976235] get_swap_device: Unused swap offset entry 001f4057
[  365.530174] get_swap_device: Unused swap offset entry 000d415c
[  394.223792] get_swap_device: Unused swap offset entry 001540d0
[  394.317299] get_swap_device: Unused swap offset entry 000341d9
[  394.341727] get_swap_device: Unused swap offset entry 0006c07e
[  396.062365] get_swap_device: Unused swap offset entry 000541a4
[  396.068262] get_swap_device: Unused swap offset entry 000541a7
[  402.629551] get_swap_device: Unused swap offset entry 00294021
[  436.740622] get_swap_device: Unused swap offset entry 00334155
[  436.758527] get_swap_device: Unused swap offset entry 001b417c

swap_cluster_readahead() calls __read_swap_cache_async() on a range of
made-up swap entries around the faulting slot. The device and the
range (si->max) are valid, but the specific entry might not be in
use. __read_swap_cache_async() instead relies on swapcache_prepare()
returning -ENOENT to catch this and skip gracefully.

Confirmed that reverting the patch makes the warnings go away.
Huang, Ying March 6, 2024, 2:39 a.m. UTC | #5
David Hildenbrand <david@redhat.com> writes:

> On 05.03.24 17:33, Ryan Roberts wrote:
>> On 05/03/2024 15:50, David Hildenbrand wrote:
>>> On 05.03.24 16:13, Ryan Roberts wrote:
>>>> There was previously a theoretical window where swapoff() could run and
>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>> running in another thread. This could cause, amongst other bad
>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>
>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>> a test case. But there has been agreement based on code review that this
>>>> is possible (see link below).
>>>>
>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>> where this extra check would cause any false alarms.
>>>>
>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>> for deriving this):
>>>
>>> Almost
>>>
>>> "s/Hilenbrand/Hildenbrand/" :)
>> 
>> Ahh sorry... I even explicitly checked it against your name on emails... fat
>> fingers...
>
> No need to be sorry. Even your average German person would get it wrong,
> because there are other (more common) variants :)
>
> [...]
>
>>>>
>>>
>>> LGTM
>>>
>>> Are you planning on sending a doc extension for get_swap_device()?
>> 
>> I saw your comment:
>> 
>> We should likely update the documentation of get_swap_device(), that after
>> decrementing the refcount, the SI might become stale and should not be touched
>> without a prior get_swap_device().
>> 
>> But when I went to make the changes, I saw the documentation already said:
>> 
>> ...we need to enclose all swap related functions with get_swap_device() and
>> put_swap_device()... Notice that swapoff ... can still happen before the
>> percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in
>> put_swap_device()... The caller must be prepared for that.
>> 
>> I thought that already covered it? I'm sure as usual, I've misunderstood your
>> point. Happy to respin if you have something in mind?
>
> No need to respin, we could clarify on top, if we decide it makes sense.
>
> I was thinking about something like this, making it clearer that the PTL
> discussion above does not express the corner case we discovered:
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2b3a2d85e350b..646a436581eee 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1232,6 +1232,11 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
>    * with get_swap_device() and put_swap_device(), unless the swap
>    * functions call get/put_swap_device() by themselves.
>    *
> + * Note that when only holding the PTL, swapoff might succeed immediately
> + * after freeing a swap entry. Therefore, immediately after
> + * __swap_entry_free(), the swap info might become stale and should not
> + * be touched without a prior get_swap_device().
> + *
>    * Check whether swap entry is valid in the swap device.  If so,
>    * return pointer to swap_info_struct, and keep the swap entry valid
>    * via preventing the swap device from being swapoff, until

LGTM.  Thanks!

--
Best Regards,
Huang, Ying
Huang, Ying March 6, 2024, 2:52 a.m. UTC | #6
Ryan Roberts <ryan.roberts@arm.com> writes:

> There was previously a theoretical window where swapoff() could run and
> teardown a swap_info_struct while a call to free_swap_and_cache() was
> running in another thread. This could cause, amongst other bad
> possibilities, swap_page_trans_huge_swapped() (called by
> free_swap_and_cache()) to access the freed memory for swap_map.
>
> This is a theoretical problem and I haven't been able to provoke it from
> a test case. But there has been agreement based on code review that this
> is possible (see link below).
>
> Fix it by using get_swap_device()/put_swap_device(), which will stall
> swapoff(). There was an extra check in _swap_info_get() to confirm that
> the swap entry was valid. This wasn't present in get_swap_device() so
> I've added it. I couldn't find any existing get_swap_device() call sites
> where this extra check would cause any false alarms.
>
> Details of how to provoke one possible issue (thanks to David Hilenbrand
> for deriving this):
>
> --8<-----
>
> __swap_entry_free() might be the last user and result in
> "count == SWAP_HAS_CACHE".
>
> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>
> So the question is: could someone reclaim the folio and turn
> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>
> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
> still references by swap entries.
>
> Process 1 still references subpage 0 via swap entry.
> Process 2 still references subpage 1 via swap entry.
>
> Process 1 quits. Calls free_swap_and_cache().
> -> count == SWAP_HAS_CACHE
> [then, preempted in the hypervisor etc.]
>
> Process 2 quits. Calls free_swap_and_cache().
> -> count == SWAP_HAS_CACHE
>
> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
> __try_to_reclaim_swap().
>
> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
> swap_entry_free()->swap_range_free()->
> ...
> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>
> What stops swapoff to succeed after process 2 reclaimed the swap cache
> but before process1 finished its call to swap_page_trans_huge_swapped()?
>
> --8<-----

I think that this can be simplified.  Even for a 4K folio, this could
happen.

CPU0                                     CPU1
----                                     ----

zap_pte_range
  free_swap_and_cache
  __swap_entry_free
  /* swap count become 0 */
                                         swapoff
                                           try_to_unuse
                                             filemap_get_folio
                                             folio_free_swap
                                             /* remove swap cache */
                                           /* free si->swap_map[] */

  swap_page_trans_huge_swapped <-- access freed si->swap_map !!!

> Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch")
> Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/
> Cc: stable@vger.kernel.org
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
>
> Thanks,
> Ryan
>
>  mm/swapfile.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2b3a2d85e350..f580e6abc674 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>  	smp_rmb();
>  	offset = swp_offset(entry);
>  	if (offset >= si->max)
> -		goto put_out;
> +		goto bad_offset;
> +	if (data_race(!si->swap_map[swp_offset(entry)]))
> +		goto bad_free;
>
>  	return si;
>  bad_nofile:
> @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>  out:
>  	return NULL;
>  put_out:
> -	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>  	percpu_ref_put(&si->users);
>  	return NULL;
> +bad_offset:
> +	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
> +	goto put_out;
> +bad_free:
> +	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
> +	goto put_out;
>  }

I don't think that it's a good idea to warn for bad free entries.
get_swap_device() could be called without enough lock to prevent
parallel swap operations on entry.  So, false positive is possible
there.  I think that it's good to add more checks in general, for
example, in free_swap_and_cache(), we can check more because we are sure
the swap entry will not be freed by parallel swap operations.  Just
don't put the check in general get_swap_device().  We can add another
helper to check that.

I found that there are other checks in get_swap_device() already.  I
think that we may need to revert,

commit 23b230ba8ac3 ("mm/swap: print bad swap offset entry in get_swap_device")

which introduces it.  And add check in appropriate places.

--
Best Regards,
Huang, Ying

>  static unsigned char __swap_entry_free(struct swap_info_struct *p,
> @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry)
>  	if (non_swap_entry(entry))
>  		return 1;
>
> -	p = _swap_info_get(entry);
> +	p = get_swap_device(entry);
>  	if (p) {
>  		count = __swap_entry_free(p, entry);
>  		if (count == SWAP_HAS_CACHE &&
>  		    !swap_page_trans_huge_swapped(p, entry))
>  			__try_to_reclaim_swap(p, swp_offset(entry),
>  					      TTRS_UNMAPPED | TTRS_FULL);
> +		put_swap_device(p);
>  	}
>  	return p != NULL;
>  }
> --
> 2.25.1
Ryan Roberts March 6, 2024, 8:10 a.m. UTC | #7
On 05/03/2024 22:05, David Hildenbrand wrote:
> On 05.03.24 17:33, Ryan Roberts wrote:
>> On 05/03/2024 15:50, David Hildenbrand wrote:
>>> On 05.03.24 16:13, Ryan Roberts wrote:
>>>> There was previously a theoretical window where swapoff() could run and
>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>> running in another thread. This could cause, amongst other bad
>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>
>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>> a test case. But there has been agreement based on code review that this
>>>> is possible (see link below).
>>>>
>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>> where this extra check would cause any false alarms.
>>>>
>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>> for deriving this):
>>>
>>> Almost
>>>
>>> "s/Hilenbrand/Hildenbrand/" :)
>>
>> Ahh sorry... I even explicitly checked it against your name on emails... fat
>> fingers...
> 
> No need to be sorry. Even your average German person would get it wrong,
> because there are other (more common) variants :)
> 
> [...]
> 
>>>>
>>>
>>> LGTM
>>>
>>> Are you planning on sending a doc extension for get_swap_device()?
>>
>> I saw your comment:
>>
>> We should likely update the documentation of get_swap_device(), that after
>> decrementing the refcount, the SI might become stale and should not be touched
>> without a prior get_swap_device().
>>
>> But when I went to make the changes, I saw the documentation already said:
>>
>> ...we need to enclose all swap related functions with get_swap_device() and
>> put_swap_device()... Notice that swapoff ... can still happen before the
>> percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in
>> put_swap_device()... The caller must be prepared for that.
>>
>> I thought that already covered it? I'm sure as usual, I've misunderstood your
>> point. Happy to respin if you have something in mind?
> 
> No need to respin, we could clarify on top, if we decide it makes sense.
> 
> I was thinking about something like this, making it clearer that the PTL
> discussion above does not express the corner case we discovered:
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2b3a2d85e350b..646a436581eee 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1232,6 +1232,11 @@ static unsigned char __swap_entry_free_locked(struct
> swap_info_struct *p,
>   * with get_swap_device() and put_swap_device(), unless the swap
>   * functions call get/put_swap_device() by themselves.
>   *
> + * Note that when only holding the PTL, swapoff might succeed immediately
> + * after freeing a swap entry. Therefore, immediately after
> + * __swap_entry_free(), the swap info might become stale and should not
> + * be touched without a prior get_swap_device().
> + *

Are yes, this is useful. I'm going to have to respin anyway, so will include it
in the next version. Thanks!


>   * Check whether swap entry is valid in the swap device.  If so,
>   * return pointer to swap_info_struct, and keep the swap entry valid
>   * via preventing the swap device from being swapoff, until
> 
>
Ryan Roberts March 6, 2024, 8:23 a.m. UTC | #8
On 06/03/2024 02:19, Johannes Weiner wrote:
> Hi Ryan,
> 
> On Tue, Mar 05, 2024 at 03:13:49PM +0000, Ryan Roberts wrote:
>> There was previously a theoretical window where swapoff() could run and
>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>> running in another thread. This could cause, amongst other bad
>> possibilities, swap_page_trans_huge_swapped() (called by
>> free_swap_and_cache()) to access the freed memory for swap_map.
>>
>> This is a theoretical problem and I haven't been able to provoke it from
>> a test case. But there has been agreement based on code review that this
>> is possible (see link below).
>>
>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>> the swap entry was valid. This wasn't present in get_swap_device() so
>> I've added it. I couldn't find any existing get_swap_device() call sites
>> where this extra check would cause any false alarms.
> 
> Unfortunately, I found one, testing current mm-everything:
> 
> [  189.420777] get_swap_device: Unused swap offset entry 000641ae
> [  189.426648] ------------[ cut here ]------------
> [  189.431290] WARNING: CPU: 3 PID: 369 at mm/swapfile.c:1301 get_swap_device+0x2da/0x3f0
> [  189.439242] CPU: 3 PID: 369 Comm: cachish Not tainted 6.8.0-rc5-00527-g19d98776f227-dirty #30
> [  189.447791] Hardware name: Micro-Star International Co., Ltd. MS-7B98/Z390-A PRO (MS-7B98), BIOS 1.80 12/25/2019
> [  189.457998] RIP: 0010:get_swap_device+0x2da/0x3f0
> [  189.462721] Code: a8 03 75 2a 65 48 ff 08 e9 36 ff ff ff 4c 89 e9 48 c7 c2 40 fd 91 83 48 c7 c6 c0 f9 91 83 48 c7 c7 60 ee 91 83 e8 26 2f af ff <0f> 0b eb af 4c 8d 6b 08 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48
> [  189.481497] RSP: 0000:ffffc90000cff8a8 EFLAGS: 00010282
> [  189.486760] RAX: 0000000000000032 RBX: ffff8881262eee00 RCX: 0000000000000000
> [  189.493909] RDX: 0000000000000001 RSI: ffffffff83a1e620 RDI: 0000000000000001
> [  189.501054] RBP: 1ffff9200019ff15 R08: 0000000000000001 R09: fffff5200019fee1
> [  189.508202] R10: ffffc90000cff70f R11: 0000000000000001 R12: ffffc900018d51ae
> [  189.515346] R13: 00000000000641ae R14: 0000000000000000 R15: 00000000000641af
> [  189.522494] FS:  00007f7120263680(0000) GS:ffff88841c380000(0000) knlGS:0000000000000000
> [  189.530591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  189.536373] CR2: 00007f6e659a2ea3 CR3: 0000000046860004 CR4: 00000000003706f0
> [  189.543516] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  189.550661] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  189.557811] Call Trace:
> [  189.560276]  <TASK>
> [  189.562393]  ? __warn+0xc4/0x250
> [  189.565647]  ? get_swap_device+0x2da/0x3f0
> [  189.569761]  ? report_bug+0x348/0x440
> [  189.573444]  ? handle_bug+0x6d/0x90
> [  189.576951]  ? exc_invalid_op+0x13/0x40
> [  189.580810]  ? asm_exc_invalid_op+0x16/0x20
> [  189.585019]  ? get_swap_device+0x2da/0x3f0
> [  189.589142]  ? get_swap_device+0x2da/0x3f0
> [  189.593255]  ? __pfx_get_swap_device+0x10/0x10
> [  189.597717]  __read_swap_cache_async+0x9f/0x630
> [  189.602281]  ? __pfx___read_swap_cache_async+0x10/0x10
> [  189.607439]  ? __mod_memcg_lruvec_state+0x238/0x4f0
> [  189.612344]  ? __pfx_swp_swap_info+0x10/0x10
> [  189.616652]  swap_cluster_readahead+0x2cd/0x510
> [  189.621206]  ? __pfx_swap_cluster_readahead+0x10/0x10
> [  189.626279]  ? swap_cache_get_folio+0xcd/0x360
> [  189.630760]  ? __count_memcg_events+0x10a/0x370
> [  189.635318]  shmem_swapin_folio+0x2f2/0xc60
> [  189.639525]  ? __pfx__raw_spin_lock+0x10/0x10
> [  189.643908]  ? __pte_offset_map+0x19/0x1d0
> [  189.648024]  shmem_get_folio_gfp+0x307/0xe30
> [  189.652323]  ? __schedule+0x9f0/0x1fe0
> [  189.656096]  ? __pfx_shmem_get_folio_gfp+0x10/0x10
> [  189.660923]  ? filemap_map_pages+0x999/0xe60
> [  189.665211]  shmem_fault+0x1d9/0x810
> [  189.668834]  ? __pfx_shmem_fault+0x10/0x10
> [  189.672954]  ? __pfx_filemap_map_pages+0x10/0x10
> [  189.677590]  __do_fault+0xed/0x390
> [  189.681012]  __handle_mm_fault+0x1ba1/0x2e80
> [  189.685297]  ? __pfx___handle_mm_fault+0x10/0x10
> [  189.689933]  ? __pfx_down_read_trylock+0x10/0x10
> [  189.694570]  ? __pfx_hrtimer_nanosleep+0x10/0x10
> [  189.699215]  handle_mm_fault+0xe0/0x560
> [  189.703074]  ? __pfx_restore_fpregs_from_fpstate+0x10/0x10
> [  189.708620]  do_user_addr_fault+0x2ba/0x9d0
> [  189.712828]  exc_page_fault+0x54/0x90
> [  189.716508]  asm_exc_page_fault+0x22/0x30
> [  189.720535] RIP: 0033:0x5640dc2d72b5
> [  189.724131] Code: 98 48 ba 00 00 00 00 03 00 00 00 48 89 d6 48 2b 75 a0 ba 00 00 00 00 48 f7 f6 48 89 d1 48 89 ca 48 8b 45 a0 48 01 d0 48 01 d8 <0f> b6 00 bf e8 03 00 00 e8 1e fe ff ff 48 83 45 a8 01 48 8d 45 d0
> [  189.742922] RSP: 002b:00007ffc227e3f60 EFLAGS: 00010206
> [  189.748165] RAX: 00007f6e659a2ea3 RBX: 00007f6e2007e000 RCX: 0000000045924ea3
> [  189.755311] RDX: 0000000045924ea3 RSI: 0000000300000000 RDI: 00007f71202586a0
> [  189.762483] RBP: 00007ffc227e3fe0 R08: 00007f7120258074 R09: 00007f71202580a0
> [  189.769633] R10: 0000000000019458 R11: 00000000008aa400 R12: 0000000000000000
> [  189.776781] R13: 00007ffc227e4128 R14: 00007f712029d000 R15: 00005640dc2d9dd8
> [  189.783928]  </TASK>
> [  189.786126] ---[ end trace 0000000000000000 ]---
> [  285.827888] get_swap_device: Unused swap offset entry 0018403f
> [  320.699306] get_swap_device: Unused swap offset entry 000b001b
> [  354.031339] get_swap_device: Unused swap offset entry 000681a9
> [  364.958435] get_swap_device: Unused swap offset entry 001f4055
> [  364.976235] get_swap_device: Unused swap offset entry 001f4057
> [  365.530174] get_swap_device: Unused swap offset entry 000d415c
> [  394.223792] get_swap_device: Unused swap offset entry 001540d0
> [  394.317299] get_swap_device: Unused swap offset entry 000341d9
> [  394.341727] get_swap_device: Unused swap offset entry 0006c07e
> [  396.062365] get_swap_device: Unused swap offset entry 000541a4
> [  396.068262] get_swap_device: Unused swap offset entry 000541a7
> [  402.629551] get_swap_device: Unused swap offset entry 00294021
> [  436.740622] get_swap_device: Unused swap offset entry 00334155
> [  436.758527] get_swap_device: Unused swap offset entry 001b417c
> 
> swap_cluster_readahead() calls __read_swap_cache_async() on a range of
> made-up swap entries around the faulting slot. The device and the
> range (si->max) are valid, but the specific entry might not be in
> use. __read_swap_cache_async() instead relies on swapcache_prepare()
> returning -ENOENT to catch this and skip gracefully.

Sorry this got in your way, but thanks for the report! Clearly I didn't review
the call sites thoroughly enough. I'll move the extra check to
free_swap_and_cache(), retest and hopefully repost later today.

> 
> Confirmed that reverting the patch makes the warnings go away.
Miaohe Lin March 6, 2024, 8:51 a.m. UTC | #9
On 2024/3/6 10:52, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> There was previously a theoretical window where swapoff() could run and
>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>> running in another thread. This could cause, amongst other bad
>> possibilities, swap_page_trans_huge_swapped() (called by
>> free_swap_and_cache()) to access the freed memory for swap_map.
>>
>> This is a theoretical problem and I haven't been able to provoke it from
>> a test case. But there has been agreement based on code review that this
>> is possible (see link below).
>>
>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>> the swap entry was valid. This wasn't present in get_swap_device() so
>> I've added it. I couldn't find any existing get_swap_device() call sites
>> where this extra check would cause any false alarms.
>>
>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>> for deriving this):
>>
>> --8<-----
>>
>> __swap_entry_free() might be the last user and result in
>> "count == SWAP_HAS_CACHE".
>>
>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>
>> So the question is: could someone reclaim the folio and turn
>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>
>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>> still references by swap entries.
>>
>> Process 1 still references subpage 0 via swap entry.
>> Process 2 still references subpage 1 via swap entry.
>>
>> Process 1 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>> [then, preempted in the hypervisor etc.]
>>
>> Process 2 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>>
>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>> __try_to_reclaim_swap().
>>
>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>> swap_entry_free()->swap_range_free()->
>> ...
>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>
>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>
>> --8<-----
> 
> I think that this can be simplified.  Even for a 4K folio, this could
> happen.
> 
> CPU0                                     CPU1
> ----                                     ----
> 
> zap_pte_range
>   free_swap_and_cache
>   __swap_entry_free
>   /* swap count become 0 */
>                                          swapoff
>                                            try_to_unuse
>                                              filemap_get_folio
>                                              folio_free_swap
>                                              /* remove swap cache */
>                                            /* free si->swap_map[] */
> 
>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!

Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this
theoretical problem can't happen. Or am I miss something?

CPU0                                     CPU1
----                                     ----

zap_pte_range
  pte_offset_map_lock -- spin_lock is held.
  free_swap_and_cache
   __swap_entry_free
   /* swap count become 0 */
                                         swapoff
                                           try_to_unuse
                                             filemap_get_folio
                                             folio_free_swap
                                             /* remove swap cache */
					    percpu_ref_kill(&p->users);
   swap_page_trans_huge_swapped
  pte_unmap_unlock -- spin_lock is released.
					    synchronize_rcu();  --> Will wait pte_unmap_unlock to be called?
                                           /* free si->swap_map[] */

Thanks.
Ryan Roberts March 6, 2024, 9:03 a.m. UTC | #10
On 06/03/2024 02:52, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> There was previously a theoretical window where swapoff() could run and
>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>> running in another thread. This could cause, amongst other bad
>> possibilities, swap_page_trans_huge_swapped() (called by
>> free_swap_and_cache()) to access the freed memory for swap_map.
>>
>> This is a theoretical problem and I haven't been able to provoke it from
>> a test case. But there has been agreement based on code review that this
>> is possible (see link below).
>>
>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>> the swap entry was valid. This wasn't present in get_swap_device() so
>> I've added it. I couldn't find any existing get_swap_device() call sites
>> where this extra check would cause any false alarms.
>>
>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>> for deriving this):
>>
>> --8<-----
>>
>> __swap_entry_free() might be the last user and result in
>> "count == SWAP_HAS_CACHE".
>>
>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>
>> So the question is: could someone reclaim the folio and turn
>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>
>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>> still references by swap entries.
>>
>> Process 1 still references subpage 0 via swap entry.
>> Process 2 still references subpage 1 via swap entry.
>>
>> Process 1 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>> [then, preempted in the hypervisor etc.]
>>
>> Process 2 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>>
>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>> __try_to_reclaim_swap().
>>
>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>> swap_entry_free()->swap_range_free()->
>> ...
>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>
>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>
>> --8<-----
> 
> I think that this can be simplified.  Even for a 4K folio, this could
> happen.

I'm not so sure...

> 
> CPU0                                     CPU1
> ----                                     ----
> 
> zap_pte_range
>   free_swap_and_cache
>   __swap_entry_free
>   /* swap count become 0 */
>                                          swapoff
>                                            try_to_unuse
>                                              filemap_get_folio
>                                              folio_free_swap
>                                              /* remove swap cache */
>                                            /* free si->swap_map[] */
> 
>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!

I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is
called (per David, above), which is called after swap_page_trans_huge_swapped()
has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero
until after CPU0 has completed accessing si->swap_map, so if swapoff starts
where you have put it, it would get stalled waiting for the PTL which CPU0 has.

I'm sure there are other ways that this could be racy for 4K folios, but I don't
think this is one of them? e.g. if CPU0 does something with si after it has
decremented si->inuse_pages, then there is a problem.

> 
>> Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch")
>> Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
>>
>> Thanks,
>> Ryan
>>
>>  mm/swapfile.c | 14 +++++++++++---
>>  1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 2b3a2d85e350..f580e6abc674 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>>  	smp_rmb();
>>  	offset = swp_offset(entry);
>>  	if (offset >= si->max)
>> -		goto put_out;
>> +		goto bad_offset;
>> +	if (data_race(!si->swap_map[swp_offset(entry)]))
>> +		goto bad_free;
>>
>>  	return si;
>>  bad_nofile:
>> @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>>  out:
>>  	return NULL;
>>  put_out:
>> -	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>>  	percpu_ref_put(&si->users);
>>  	return NULL;
>> +bad_offset:
>> +	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>> +	goto put_out;
>> +bad_free:
>> +	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
>> +	goto put_out;
>>  }
> 
> I don't think that it's a good idea to warn for bad free entries.
> get_swap_device() could be called without enough lock to prevent
> parallel swap operations on entry.  So, false positive is possible
> there.  I think that it's good to add more checks in general, for
> example, in free_swap_and_cache(), we can check more because we are sure
> the swap entry will not be freed by parallel swap operations.

Yes, agreed. Johannes also reported that he is seeing false alarms due to this.
I'm going to remove it and add it to free_swap_and_cache() as you suggest. This
also fits well for the batched version I'm working on where we want to check the
global si things once, but need to check !free for every entry in the loop
(aiming to post that this week).

  Just
> don't put the check in general get_swap_device().  We can add another
> helper to check that.
> 
> I found that there are other checks in get_swap_device() already.  I
> think that we may need to revert,
> 
> commit 23b230ba8ac3 ("mm/swap: print bad swap offset entry in get_swap_device")

Yes agree this should be reverted.

> 
> which introduces it.  And add check in appropriate places.

I'm not quite sure what the "appropriate places" are. Looking at the call sites
for get_swap_device(), it looks like they are all racy except
free_swap_and_cache() which is called with the PTL. So check should only really
go there?

But... free_swap_and_cache() is called without the PTL by shmem (in 2 places -
see shmem_free_swap() wrapper). It looks like in one of those places, the folio
lock is held, so I guess this has a similar effect to holding the PTL. But the
other shmem_free_swap() call site doesn't appear to have the folio lock. Am I
missing something, or does this mean that for this path, free_swap_and_cache()
is racy and therefore we shouldn't be doing either the `offset >= si->max` or
the `!swap_map[offset]` in free_swap_and_cache() either?

> 
> --
> Best Regards,
> Huang, Ying
> 
>>  static unsigned char __swap_entry_free(struct swap_info_struct *p,
>> @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry)
>>  	if (non_swap_entry(entry))
>>  		return 1;
>>
>> -	p = _swap_info_get(entry);
>> +	p = get_swap_device(entry);
>>  	if (p) {
>>  		count = __swap_entry_free(p, entry);
>>  		if (count == SWAP_HAS_CACHE &&
>>  		    !swap_page_trans_huge_swapped(p, entry))
>>  			__try_to_reclaim_swap(p, swp_offset(entry),
>>  					      TTRS_UNMAPPED | TTRS_FULL);
>> +		put_swap_device(p);
>>  	}
>>  	return p != NULL;
>>  }
>> --
>> 2.25.1
Ryan Roberts March 6, 2024, 9:31 a.m. UTC | #11
On 06/03/2024 08:51, Miaohe Lin wrote:
> On 2024/3/6 10:52, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> There was previously a theoretical window where swapoff() could run and
>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>> running in another thread. This could cause, amongst other bad
>>> possibilities, swap_page_trans_huge_swapped() (called by
>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>
>>> This is a theoretical problem and I haven't been able to provoke it from
>>> a test case. But there has been agreement based on code review that this
>>> is possible (see link below).
>>>
>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>> where this extra check would cause any false alarms.
>>>
>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>> for deriving this):
>>>
>>> --8<-----
>>>
>>> __swap_entry_free() might be the last user and result in
>>> "count == SWAP_HAS_CACHE".
>>>
>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>
>>> So the question is: could someone reclaim the folio and turn
>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>
>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>> still references by swap entries.
>>>
>>> Process 1 still references subpage 0 via swap entry.
>>> Process 2 still references subpage 1 via swap entry.
>>>
>>> Process 1 quits. Calls free_swap_and_cache().
>>> -> count == SWAP_HAS_CACHE
>>> [then, preempted in the hypervisor etc.]
>>>
>>> Process 2 quits. Calls free_swap_and_cache().
>>> -> count == SWAP_HAS_CACHE
>>>
>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>> __try_to_reclaim_swap().
>>>
>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>> swap_entry_free()->swap_range_free()->
>>> ...
>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>
>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>
>>> --8<-----
>>
>> I think that this can be simplified.  Even for a 4K folio, this could
>> happen.
>>
>> CPU0                                     CPU1
>> ----                                     ----
>>
>> zap_pte_range
>>   free_swap_and_cache
>>   __swap_entry_free
>>   /* swap count become 0 */
>>                                          swapoff
>>                                            try_to_unuse
>>                                              filemap_get_folio
>>                                              folio_free_swap
>>                                              /* remove swap cache */
>>                                            /* free si->swap_map[] */
>>
>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
> 
> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.

I don't beleive it has the PTL when called by shmem.

> So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this
> theoretical problem can't happen. Or am I miss something?

For Huang Ying's example, I agree this can't happen because try_to_unuse() will
be waiting for the PTL (see the reply I just sent).

> 
> CPU0                                     CPU1
> ----                                     ----
> 
> zap_pte_range
>   pte_offset_map_lock -- spin_lock is held.
>   free_swap_and_cache
>    __swap_entry_free
>    /* swap count become 0 */
>                                          swapoff
>                                            try_to_unuse
>                                              filemap_get_folio
>                                              folio_free_swap
>                                              /* remove swap cache */
> 					    percpu_ref_kill(&p->users);
>    swap_page_trans_huge_swapped
>   pte_unmap_unlock -- spin_lock is released.
> 					    synchronize_rcu();  --> Will wait pte_unmap_unlock to be called?

Perhaps you can educate me here; I thought that synchronize_rcu() will only wait
for RCU critical sections to complete. The PTL is a spin lock, so why would
synchronize_rcu() wait for the PTL to become unlocked?


>                                            /* free si->swap_map[] */
> 
> Thanks.
> 
>
Miaohe Lin March 7, 2024, 2:38 a.m. UTC | #12
On 2024/3/6 17:31, Ryan Roberts wrote:
> On 06/03/2024 08:51, Miaohe Lin wrote:
>> On 2024/3/6 10:52, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> There was previously a theoretical window where swapoff() could run and
>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>> running in another thread. This could cause, amongst other bad
>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>
>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>> a test case. But there has been agreement based on code review that this
>>>> is possible (see link below).
>>>>
>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>> where this extra check would cause any false alarms.
>>>>
>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>> for deriving this):
>>>>
>>>> --8<-----
>>>>
>>>> __swap_entry_free() might be the last user and result in
>>>> "count == SWAP_HAS_CACHE".
>>>>
>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>
>>>> So the question is: could someone reclaim the folio and turn
>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>
>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>> still references by swap entries.
>>>>
>>>> Process 1 still references subpage 0 via swap entry.
>>>> Process 2 still references subpage 1 via swap entry.
>>>>
>>>> Process 1 quits. Calls free_swap_and_cache().
>>>> -> count == SWAP_HAS_CACHE
>>>> [then, preempted in the hypervisor etc.]
>>>>
>>>> Process 2 quits. Calls free_swap_and_cache().
>>>> -> count == SWAP_HAS_CACHE
>>>>
>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>> __try_to_reclaim_swap().
>>>>
>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>> swap_entry_free()->swap_range_free()->
>>>> ...
>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>
>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>
>>>> --8<-----
>>>
>>> I think that this can be simplified.  Even for a 4K folio, this could
>>> happen.
>>>
>>> CPU0                                     CPU1
>>> ----                                     ----
>>>
>>> zap_pte_range
>>>   free_swap_and_cache
>>>   __swap_entry_free
>>>   /* swap count become 0 */
>>>                                          swapoff
>>>                                            try_to_unuse
>>>                                              filemap_get_folio
>>>                                              folio_free_swap
>>>                                              /* remove swap cache */
>>>                                            /* free si->swap_map[] */
>>>
>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>
>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
> 
> I don't beleive it has the PTL when called by shmem.

In the case of shmem, folio_lock is used to guard against the race.

> 
>> So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this
>> theoretical problem can't happen. Or am I miss something?
> 
> For Huang Ying's example, I agree this can't happen because try_to_unuse() will
> be waiting for the PTL (see the reply I just sent).

Do you mean the below message?
"
I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is
called (per David, above), which is called after swap_page_trans_huge_swapped()
has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero
until after CPU0 has completed accessing si->swap_map, so if swapoff starts
where you have put it, it would get stalled waiting for the PTL which CPU0 has.
"

I agree try_to_unuse() will wait for si->inuse_pages being zero. But why will it waits
for the PTL? It seems PTL is not used to protect si->inuse_pages. Or am I miss something?

> 
>>
>> CPU0                                     CPU1
>> ----                                     ----
>>
>> zap_pte_range
>>   pte_offset_map_lock -- spin_lock is held.
>>   free_swap_and_cache
>>    __swap_entry_free
>>    /* swap count become 0 */
>>                                          swapoff
>>                                            try_to_unuse
>>                                              filemap_get_folio
>>                                              folio_free_swap
>>                                              /* remove swap cache */
>> 					    percpu_ref_kill(&p->users);
>>    swap_page_trans_huge_swapped
>>   pte_unmap_unlock -- spin_lock is released.
>> 					    synchronize_rcu();  --> Will wait pte_unmap_unlock to be called?
> 
> Perhaps you can educate me here; I thought that synchronize_rcu() will only wait
> for RCU critical sections to complete. The PTL is a spin lock, so why would
> synchronize_rcu() wait for the PTL to become unlocked?

I assume PTL will always disable preemption which disables a grace period until PTL is released.
But this might be fragile and I'm not really sure. I might be wrong.

Thanks.
> 
> 
>>                                            /* free si->swap_map[] */
>>
>> Thanks.
>>
>>
> 
> .
>
Huang, Ying March 7, 2024, 4:37 a.m. UTC | #13
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 06/03/2024 08:51, Miaohe Lin wrote:
>> On 2024/3/6 10:52, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> There was previously a theoretical window where swapoff() could run and
>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>> running in another thread. This could cause, amongst other bad
>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>
>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>> a test case. But there has been agreement based on code review that this
>>>> is possible (see link below).
>>>>
>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>> where this extra check would cause any false alarms.
>>>>
>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>> for deriving this):
>>>>
>>>> --8<-----
>>>>
>>>> __swap_entry_free() might be the last user and result in
>>>> "count == SWAP_HAS_CACHE".
>>>>
>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>
>>>> So the question is: could someone reclaim the folio and turn
>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>
>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>> still references by swap entries.
>>>>
>>>> Process 1 still references subpage 0 via swap entry.
>>>> Process 2 still references subpage 1 via swap entry.
>>>>
>>>> Process 1 quits. Calls free_swap_and_cache().
>>>> -> count == SWAP_HAS_CACHE
>>>> [then, preempted in the hypervisor etc.]
>>>>
>>>> Process 2 quits. Calls free_swap_and_cache().
>>>> -> count == SWAP_HAS_CACHE
>>>>
>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>> __try_to_reclaim_swap().
>>>>
>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>> swap_entry_free()->swap_range_free()->
>>>> ...
>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>
>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>
>>>> --8<-----
>>>
>>> I think that this can be simplified.  Even for a 4K folio, this could
>>> happen.
>>>
>>> CPU0                                     CPU1
>>> ----                                     ----
>>>
>>> zap_pte_range
>>>   free_swap_and_cache
>>>   __swap_entry_free
>>>   /* swap count become 0 */
>>>                                          swapoff
>>>                                            try_to_unuse
>>>                                              filemap_get_folio
>>>                                              folio_free_swap
>>>                                              /* remove swap cache */
>>>                                            /* free si->swap_map[] */
>>>
>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>> 
>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
>
> I don't beleive it has the PTL when called by shmem.

Yes, we don't hold PTL there.

After checking the code again.  I think that there may be race condition
as above without PTL.  But I may miss something, again.

>> So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this
>> theoretical problem can't happen. Or am I miss something?
>
> For Huang Ying's example, I agree this can't happen because try_to_unuse() will
> be waiting for the PTL (see the reply I just sent).
>
>> 
>> CPU0                                     CPU1
>> ----                                     ----
>> 
>> zap_pte_range
>>   pte_offset_map_lock -- spin_lock is held.
>>   free_swap_and_cache
>>    __swap_entry_free
>>    /* swap count become 0 */
>>                                          swapoff
>>                                            try_to_unuse
>>                                              filemap_get_folio
>>                                              folio_free_swap
>>                                              /* remove swap cache */
>> 					    percpu_ref_kill(&p->users);
>>    swap_page_trans_huge_swapped
>>   pte_unmap_unlock -- spin_lock is released.
>> 					    synchronize_rcu();  --> Will wait pte_unmap_unlock to be called?
>
> Perhaps you can educate me here; I thought that synchronize_rcu() will only wait
> for RCU critical sections to complete. The PTL is a spin lock, so why would
> synchronize_rcu() wait for the PTL to become unlocked?

Please take a look at the following link,

https://www.kernel.org/doc/html/next/RCU/whatisRCU.html#rcu-read-lock

"
Note that anything that disables bottom halves, preemption, or
interrupts also enters an RCU read-side critical section. Acquiring a
spinlock also enters an RCU read-side critical sections, even for
spinlocks that do not disable preemption, as is the case in kernels
built with CONFIG_PREEMPT_RT=y. Sleeplocks do not enter RCU read-side
critical sections.
"

--
Best Regards,
Huang, Ying

>
>>                                            /* free si->swap_map[] */
>> 
>> Thanks.
>> 
>>
Huang, Ying March 7, 2024, 4:39 a.m. UTC | #14
Miaohe Lin <linmiaohe@huawei.com> writes:

> On 2024/3/6 10:52, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> There was previously a theoretical window where swapoff() could run and
>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>> running in another thread. This could cause, amongst other bad
>>> possibilities, swap_page_trans_huge_swapped() (called by
>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>
>>> This is a theoretical problem and I haven't been able to provoke it from
>>> a test case. But there has been agreement based on code review that this
>>> is possible (see link below).
>>>
>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>> where this extra check would cause any false alarms.
>>>
>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>> for deriving this):
>>>
>>> --8<-----
>>>
>>> __swap_entry_free() might be the last user and result in
>>> "count == SWAP_HAS_CACHE".
>>>
>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>
>>> So the question is: could someone reclaim the folio and turn
>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>
>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>> still references by swap entries.
>>>
>>> Process 1 still references subpage 0 via swap entry.
>>> Process 2 still references subpage 1 via swap entry.
>>>
>>> Process 1 quits. Calls free_swap_and_cache().
>>> -> count == SWAP_HAS_CACHE
>>> [then, preempted in the hypervisor etc.]
>>>
>>> Process 2 quits. Calls free_swap_and_cache().
>>> -> count == SWAP_HAS_CACHE
>>>
>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>> __try_to_reclaim_swap().
>>>
>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>> swap_entry_free()->swap_range_free()->
>>> ...
>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>
>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>
>>> --8<-----
>> 
>> I think that this can be simplified.  Even for a 4K folio, this could
>> happen.
>> 
>> CPU0                                     CPU1
>> ----                                     ----
>> 
>> zap_pte_range
>>   free_swap_and_cache
>>   __swap_entry_free
>>   /* swap count become 0 */
>>                                          swapoff
>>                                            try_to_unuse
>>                                              filemap_get_folio
>>                                              folio_free_swap
>>                                              /* remove swap cache */
>>                                            /* free si->swap_map[] */
>> 
>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>
> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
> So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this
> theoretical problem can't happen. Or am I miss something?
>
> CPU0                                     CPU1
> ----                                     ----
>
> zap_pte_range
>   pte_offset_map_lock -- spin_lock is held.
>   free_swap_and_cache
>    __swap_entry_free
>    /* swap count become 0 */
>                                          swapoff
>                                            try_to_unuse
>                                              filemap_get_folio
>                                              folio_free_swap
>                                              /* remove swap cache */
> 					    percpu_ref_kill(&p->users);
>    swap_page_trans_huge_swapped
>   pte_unmap_unlock -- spin_lock is released.
> 					    synchronize_rcu();  --> Will wait pte_unmap_unlock to be called?
>                                            /* free si->swap_map[] */

I think that you are right.  We are safe if PTL is held.  Thanks a lot
for pointing this out!

--
Best Regards,
Huang, Ying
Huang, Ying March 7, 2024, 5:48 a.m. UTC | #15
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 06/03/2024 02:52, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> There was previously a theoretical window where swapoff() could run and
>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>> running in another thread. This could cause, amongst other bad
>>> possibilities, swap_page_trans_huge_swapped() (called by
>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>
>>> This is a theoretical problem and I haven't been able to provoke it from
>>> a test case. But there has been agreement based on code review that this
>>> is possible (see link below).
>>>
>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>> where this extra check would cause any false alarms.
>>>
>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>> for deriving this):
>>>
>>> --8<-----
>>>
>>> __swap_entry_free() might be the last user and result in
>>> "count == SWAP_HAS_CACHE".
>>>
>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>
>>> So the question is: could someone reclaim the folio and turn
>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>
>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>> still references by swap entries.
>>>
>>> Process 1 still references subpage 0 via swap entry.
>>> Process 2 still references subpage 1 via swap entry.
>>>
>>> Process 1 quits. Calls free_swap_and_cache().
>>> -> count == SWAP_HAS_CACHE
>>> [then, preempted in the hypervisor etc.]
>>>
>>> Process 2 quits. Calls free_swap_and_cache().
>>> -> count == SWAP_HAS_CACHE
>>>
>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>> __try_to_reclaim_swap().
>>>
>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>> swap_entry_free()->swap_range_free()->
>>> ...
>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>
>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>
>>> --8<-----
>> 
>> I think that this can be simplified.  Even for a 4K folio, this could
>> happen.
>
> I'm not so sure...
>
>> 
>> CPU0                                     CPU1
>> ----                                     ----
>> 
>> zap_pte_range
>>   free_swap_and_cache
>>   __swap_entry_free
>>   /* swap count become 0 */
>>                                          swapoff
>>                                            try_to_unuse
>>                                              filemap_get_folio
>>                                              folio_free_swap
>>                                              /* remove swap cache */
>>                                            /* free si->swap_map[] */
>> 
>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>
> I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is
> called (per David, above), which is called after swap_page_trans_huge_swapped()
> has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero

try_to_unuse() can decrease si->inuse_pages by itself via the following
loop,

	while (READ_ONCE(si->inuse_pages) &&
	       !signal_pending(current) &&
	       (i = find_next_to_unuse(si, i)) != 0) {

		entry = swp_entry(type, i);
		folio = filemap_get_folio(swap_address_space(entry), i);
		if (IS_ERR(folio))
			continue;

		/*
		 * It is conceivable that a racing task removed this folio from
		 * swap cache just before we acquired the page lock. The folio
		 * might even be back in swap cache on another swap area. But
		 * that is okay, folio_free_swap() only removes stale folios.
		 */
		folio_lock(folio);
		folio_wait_writeback(folio);
		folio_free_swap(folio);
		folio_unlock(folio);
		folio_put(folio);
	}

Then it can proceed to swapoff.  But as Miaohe pointed out, PTL to
prevent swapoff to free data structures later via synchronize_rcu().

> until after CPU0 has completed accessing si->swap_map, so if swapoff starts
> where you have put it, it would get stalled waiting for the PTL which CPU0 has.
>
> I'm sure there are other ways that this could be racy for 4K folios, but I don't
> think this is one of them? e.g. if CPU0 does something with si after it has
> decremented si->inuse_pages, then there is a problem.
>
>> 
>>> Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch")
>>> Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>
>>> Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
>>>
>>> Thanks,
>>> Ryan
>>>
>>>  mm/swapfile.c | 14 +++++++++++---
>>>  1 file changed, 11 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 2b3a2d85e350..f580e6abc674 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>>>  	smp_rmb();
>>>  	offset = swp_offset(entry);
>>>  	if (offset >= si->max)
>>> -		goto put_out;
>>> +		goto bad_offset;
>>> +	if (data_race(!si->swap_map[swp_offset(entry)]))
>>> +		goto bad_free;
>>>
>>>  	return si;
>>>  bad_nofile:
>>> @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>>>  out:
>>>  	return NULL;
>>>  put_out:
>>> -	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>>>  	percpu_ref_put(&si->users);
>>>  	return NULL;
>>> +bad_offset:
>>> +	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>>> +	goto put_out;
>>> +bad_free:
>>> +	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
>>> +	goto put_out;
>>>  }
>> 
>> I don't think that it's a good idea to warn for bad free entries.
>> get_swap_device() could be called without enough lock to prevent
>> parallel swap operations on entry.  So, false positive is possible
>> there.  I think that it's good to add more checks in general, for
>> example, in free_swap_and_cache(), we can check more because we are sure
>> the swap entry will not be freed by parallel swap operations.
>
> Yes, agreed. Johannes also reported that he is seeing false alarms due to this.
> I'm going to remove it and add it to free_swap_and_cache() as you suggest. This
> also fits well for the batched version I'm working on where we want to check the
> global si things once, but need to check !free for every entry in the loop
> (aiming to post that this week).
>
>   Just
>> don't put the check in general get_swap_device().  We can add another
>> helper to check that.
>> 
>> I found that there are other checks in get_swap_device() already.  I
>> think that we may need to revert,
>> 
>> commit 23b230ba8ac3 ("mm/swap: print bad swap offset entry in get_swap_device")
>
> Yes agree this should be reverted.
>
>> 
>> which introduces it.  And add check in appropriate places.
>
> I'm not quite sure what the "appropriate places" are. Looking at the call sites
> for get_swap_device(), it looks like they are all racy except
> free_swap_and_cache() which is called with the PTL. So check should only really
> go there?
>
> But... free_swap_and_cache() is called without the PTL by shmem (in 2 places -
> see shmem_free_swap() wrapper). It looks like in one of those places, the folio
> lock is held, so I guess this has a similar effect to holding the PTL. But the
> other shmem_free_swap() call site doesn't appear to have the folio lock. Am I
> missing something, or does this mean that for this path, free_swap_and_cache()
> is racy and therefore we shouldn't be doing either the `offset >= si->max` or
> the `!swap_map[offset]` in free_swap_and_cache() either?

In shmem_free_swap(), xa_cmpxchg_irq() is used to get the swap entry and
clear the original mapping.  So, no other code path can free the swap
count held by mapping.  But, after dropping the swap count, it appears
that it's not safe.

--
Best Regards,
Huang, Ying

>> 
>> --
>> Best Regards,
>> Huang, Ying
>> 
>>>  static unsigned char __swap_entry_free(struct swap_info_struct *p,
>>> @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry)
>>>  	if (non_swap_entry(entry))
>>>  		return 1;
>>>
>>> -	p = _swap_info_get(entry);
>>> +	p = get_swap_device(entry);
>>>  	if (p) {
>>>  		count = __swap_entry_free(p, entry);
>>>  		if (count == SWAP_HAS_CACHE &&
>>>  		    !swap_page_trans_huge_swapped(p, entry))
>>>  			__try_to_reclaim_swap(p, swp_offset(entry),
>>>  					      TTRS_UNMAPPED | TTRS_FULL);
>>> +		put_swap_device(p);
>>>  	}
>>>  	return p != NULL;
>>>  }
>>> --
>>> 2.25.1
Huang, Ying March 7, 2024, 5:56 a.m. UTC | #16
Miaohe Lin <linmiaohe@huawei.com> writes:

> On 2024/3/6 17:31, Ryan Roberts wrote:
>> On 06/03/2024 08:51, Miaohe Lin wrote:
>>> On 2024/3/6 10:52, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> There was previously a theoretical window where swapoff() could run and
>>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>>> running in another thread. This could cause, amongst other bad
>>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>>
>>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>>> a test case. But there has been agreement based on code review that this
>>>>> is possible (see link below).
>>>>>
>>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>>> where this extra check would cause any false alarms.
>>>>>
>>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>>> for deriving this):
>>>>>
>>>>> --8<-----
>>>>>
>>>>> __swap_entry_free() might be the last user and result in
>>>>> "count == SWAP_HAS_CACHE".
>>>>>
>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>
>>>>> So the question is: could someone reclaim the folio and turn
>>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>>
>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>> still references by swap entries.
>>>>>
>>>>> Process 1 still references subpage 0 via swap entry.
>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>
>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>> -> count == SWAP_HAS_CACHE
>>>>> [then, preempted in the hypervisor etc.]
>>>>>
>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>> -> count == SWAP_HAS_CACHE
>>>>>
>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>> __try_to_reclaim_swap().
>>>>>
>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>>> swap_entry_free()->swap_range_free()->
>>>>> ...
>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>
>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>>
>>>>> --8<-----
>>>>
>>>> I think that this can be simplified.  Even for a 4K folio, this could
>>>> happen.
>>>>
>>>> CPU0                                     CPU1
>>>> ----                                     ----
>>>>
>>>> zap_pte_range
>>>>   free_swap_and_cache
>>>>   __swap_entry_free
>>>>   /* swap count become 0 */
>>>>                                          swapoff
>>>>                                            try_to_unuse
>>>>                                              filemap_get_folio
>>>>                                              folio_free_swap
>>>>                                              /* remove swap cache */
>>>>                                            /* free si->swap_map[] */
>>>>
>>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>
>>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
>> 
>> I don't beleive it has the PTL when called by shmem.
>
> In the case of shmem, folio_lock is used to guard against the race.

I don't find folio is lock for shmem.  find_lock_entries() will only
lock the folio if (!xa_is_value()), that is, not swap entry.  Can you
point out where the folio is locked for shmem?

--
Best Regards,
Huang, Ying

>> 
>>> So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this
>>> theoretical problem can't happen. Or am I miss something?
>> 
>> For Huang Ying's example, I agree this can't happen because try_to_unuse() will
>> be waiting for the PTL (see the reply I just sent).
>
> Do you mean the below message?
> "
> I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is
> called (per David, above), which is called after swap_page_trans_huge_swapped()
> has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero
> until after CPU0 has completed accessing si->swap_map, so if swapoff starts
> where you have put it, it would get stalled waiting for the PTL which CPU0 has.
> "
>
> I agree try_to_unuse() will wait for si->inuse_pages being zero. But why will it waits
> for the PTL? It seems PTL is not used to protect si->inuse_pages. Or am I miss something?
>
>> 
>>>
>>> CPU0                                     CPU1
>>> ----                                     ----
>>>
>>> zap_pte_range
>>>   pte_offset_map_lock -- spin_lock is held.
>>>   free_swap_and_cache
>>>    __swap_entry_free
>>>    /* swap count become 0 */
>>>                                          swapoff
>>>                                            try_to_unuse
>>>                                              filemap_get_folio
>>>                                              folio_free_swap
>>>                                              /* remove swap cache */
>>> 					    percpu_ref_kill(&p->users);
>>>    swap_page_trans_huge_swapped
>>>   pte_unmap_unlock -- spin_lock is released.
>>> 					    synchronize_rcu();  --> Will wait pte_unmap_unlock to be called?
>> 
>> Perhaps you can educate me here; I thought that synchronize_rcu() will only wait
>> for RCU critical sections to complete. The PTL is a spin lock, so why would
>> synchronize_rcu() wait for the PTL to become unlocked?
>
> I assume PTL will always disable preemption which disables a grace period until PTL is released.
> But this might be fragile and I'm not really sure. I might be wrong.
>
> Thanks.
>> 
>> 
>>>                                            /* free si->swap_map[] */
>>>
>>> Thanks.
>>>
>>>
>> 
>> .
>>
Miaohe Lin March 7, 2024, 6:50 a.m. UTC | #17
On 2024/3/7 13:56, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> On 2024/3/6 17:31, Ryan Roberts wrote:
>>> On 06/03/2024 08:51, Miaohe Lin wrote:
>>>> On 2024/3/6 10:52, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> There was previously a theoretical window where swapoff() could run and
>>>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>>>> running in another thread. This could cause, amongst other bad
>>>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>>>
>>>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>>>> a test case. But there has been agreement based on code review that this
>>>>>> is possible (see link below).
>>>>>>
>>>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>>>> where this extra check would cause any false alarms.
>>>>>>
>>>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>>>> for deriving this):
>>>>>>
>>>>>> --8<-----
>>>>>>
>>>>>> __swap_entry_free() might be the last user and result in
>>>>>> "count == SWAP_HAS_CACHE".
>>>>>>
>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>>
>>>>>> So the question is: could someone reclaim the folio and turn
>>>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>>>
>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>>> still references by swap entries.
>>>>>>
>>>>>> Process 1 still references subpage 0 via swap entry.
>>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>>
>>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>>> -> count == SWAP_HAS_CACHE
>>>>>> [then, preempted in the hypervisor etc.]
>>>>>>
>>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>
>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>>> __try_to_reclaim_swap().
>>>>>>
>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>>>> swap_entry_free()->swap_range_free()->
>>>>>> ...
>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>>
>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>>>
>>>>>> --8<-----
>>>>>
>>>>> I think that this can be simplified.  Even for a 4K folio, this could
>>>>> happen.
>>>>>
>>>>> CPU0                                     CPU1
>>>>> ----                                     ----
>>>>>
>>>>> zap_pte_range
>>>>>   free_swap_and_cache
>>>>>   __swap_entry_free
>>>>>   /* swap count become 0 */
>>>>>                                          swapoff
>>>>>                                            try_to_unuse
>>>>>                                              filemap_get_folio
>>>>>                                              folio_free_swap
>>>>>                                              /* remove swap cache */
>>>>>                                            /* free si->swap_map[] */
>>>>>
>>>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>>
>>>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
>>>
>>> I don't beleive it has the PTL when called by shmem.
>>
>> In the case of shmem, folio_lock is used to guard against the race.
> 
> I don't find folio is lock for shmem.  find_lock_entries() will only
> lock the folio if (!xa_is_value()), that is, not swap entry.  Can you
> point out where the folio is locked for shmem?

You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent.
shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into
memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any
xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from
shmem_undo_range() after shmem_unuse(). Or am I miss something?

Thanks.

> 
> --
> Best Regards,
> Huang, Ying
>
Huang, Ying March 7, 2024, 7:34 a.m. UTC | #18
Miaohe Lin <linmiaohe@huawei.com> writes:

> On 2024/3/7 13:56, Huang, Ying wrote:
>> Miaohe Lin <linmiaohe@huawei.com> writes:
>> 
>>> On 2024/3/6 17:31, Ryan Roberts wrote:
>>>> On 06/03/2024 08:51, Miaohe Lin wrote:
>>>>> On 2024/3/6 10:52, Huang, Ying wrote:
>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>
>>>>>>> There was previously a theoretical window where swapoff() could run and
>>>>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>>>>> running in another thread. This could cause, amongst other bad
>>>>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>>>>
>>>>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>>>>> a test case. But there has been agreement based on code review that this
>>>>>>> is possible (see link below).
>>>>>>>
>>>>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>>>>> where this extra check would cause any false alarms.
>>>>>>>
>>>>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>>>>> for deriving this):
>>>>>>>
>>>>>>> --8<-----
>>>>>>>
>>>>>>> __swap_entry_free() might be the last user and result in
>>>>>>> "count == SWAP_HAS_CACHE".
>>>>>>>
>>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>>>
>>>>>>> So the question is: could someone reclaim the folio and turn
>>>>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>>>>
>>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>>>> still references by swap entries.
>>>>>>>
>>>>>>> Process 1 still references subpage 0 via swap entry.
>>>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>>>
>>>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>> [then, preempted in the hypervisor etc.]
>>>>>>>
>>>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>
>>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>>>> __try_to_reclaim_swap().
>>>>>>>
>>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>>>>> swap_entry_free()->swap_range_free()->
>>>>>>> ...
>>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>>>
>>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>>>>
>>>>>>> --8<-----
>>>>>>
>>>>>> I think that this can be simplified.  Even for a 4K folio, this could
>>>>>> happen.
>>>>>>
>>>>>> CPU0                                     CPU1
>>>>>> ----                                     ----
>>>>>>
>>>>>> zap_pte_range
>>>>>>   free_swap_and_cache
>>>>>>   __swap_entry_free
>>>>>>   /* swap count become 0 */
>>>>>>                                          swapoff
>>>>>>                                            try_to_unuse
>>>>>>                                              filemap_get_folio
>>>>>>                                              folio_free_swap
>>>>>>                                              /* remove swap cache */
>>>>>>                                            /* free si->swap_map[] */
>>>>>>
>>>>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>>>
>>>>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
>>>>
>>>> I don't beleive it has the PTL when called by shmem.
>>>
>>> In the case of shmem, folio_lock is used to guard against the race.
>> 
>> I don't find folio is lock for shmem.  find_lock_entries() will only
>> lock the folio if (!xa_is_value()), that is, not swap entry.  Can you
>> point out where the folio is locked for shmem?
>
> You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent.
> shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into
> memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any
> xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from
> shmem_undo_range() after shmem_unuse(). Or am I miss something?

I think the following situation is possible.  Right?

CPU0                               CPU1
----                               ----
shmem_undo_range
  shmem_free_swap
    xa_cmpxchg_irq
    free_swap_and_cache
      __swap_entry_free
      /* swap count become 0 */
                                   swapoff
                                     try_to_unuse
                                       shmem_unuse /* cannot find swap entry */
                                       find_next_to_unuse
                                       filemap_get_folio
                                       folio_free_swap
                                       /* remove swap cache */
                                       /* free si->swap_map[] */
      swap_page_trans_huge_swapped <-- access freed si->swap_map !!!

shmem_undo_range can run earlier.

--
Best Regards,
Huang, Ying
Ryan Roberts March 7, 2024, 7:48 a.m. UTC | #19
On 07/03/2024 07:34, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> On 2024/3/7 13:56, Huang, Ying wrote:
>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>
>>>> On 2024/3/6 17:31, Ryan Roberts wrote:
>>>>> On 06/03/2024 08:51, Miaohe Lin wrote:
>>>>>> On 2024/3/6 10:52, Huang, Ying wrote:
>>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>>
>>>>>>>> There was previously a theoretical window where swapoff() could run and
>>>>>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>>>>>> running in another thread. This could cause, amongst other bad
>>>>>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>>>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>>>>>
>>>>>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>>>>>> a test case. But there has been agreement based on code review that this
>>>>>>>> is possible (see link below).
>>>>>>>>
>>>>>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>>>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>>>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>>>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>>>>>> where this extra check would cause any false alarms.
>>>>>>>>
>>>>>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>>>>>> for deriving this):
>>>>>>>>
>>>>>>>> --8<-----
>>>>>>>>
>>>>>>>> __swap_entry_free() might be the last user and result in
>>>>>>>> "count == SWAP_HAS_CACHE".
>>>>>>>>
>>>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>>>>
>>>>>>>> So the question is: could someone reclaim the folio and turn
>>>>>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>>>>>
>>>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>>>>> still references by swap entries.
>>>>>>>>
>>>>>>>> Process 1 still references subpage 0 via swap entry.
>>>>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>>>>
>>>>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>> [then, preempted in the hypervisor etc.]
>>>>>>>>
>>>>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>>
>>>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>>>>> __try_to_reclaim_swap().
>>>>>>>>
>>>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>>>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>>>>>> swap_entry_free()->swap_range_free()->
>>>>>>>> ...
>>>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>>>>
>>>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>>>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>>>>>
>>>>>>>> --8<-----
>>>>>>>
>>>>>>> I think that this can be simplified.  Even for a 4K folio, this could
>>>>>>> happen.
>>>>>>>
>>>>>>> CPU0                                     CPU1
>>>>>>> ----                                     ----
>>>>>>>
>>>>>>> zap_pte_range
>>>>>>>   free_swap_and_cache
>>>>>>>   __swap_entry_free
>>>>>>>   /* swap count become 0 */
>>>>>>>                                          swapoff
>>>>>>>                                            try_to_unuse
>>>>>>>                                              filemap_get_folio
>>>>>>>                                              folio_free_swap
>>>>>>>                                              /* remove swap cache */
>>>>>>>                                            /* free si->swap_map[] */
>>>>>>>
>>>>>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>>>>
>>>>>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
>>>>>
>>>>> I don't beleive it has the PTL when called by shmem.
>>>>
>>>> In the case of shmem, folio_lock is used to guard against the race.
>>>
>>> I don't find folio is lock for shmem.  find_lock_entries() will only
>>> lock the folio if (!xa_is_value()), that is, not swap entry.  Can you
>>> point out where the folio is locked for shmem?
>>
>> You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent.
>> shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into
>> memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any
>> xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from
>> shmem_undo_range() after shmem_unuse(). Or am I miss something?
> 
> I think the following situation is possible.  Right?
> 
> CPU0                               CPU1
> ----                               ----
> shmem_undo_range
>   shmem_free_swap
>     xa_cmpxchg_irq
>     free_swap_and_cache
>       __swap_entry_free
>       /* swap count become 0 */
>                                    swapoff
>                                      try_to_unuse
>                                        shmem_unuse /* cannot find swap entry */
>                                        find_next_to_unuse
>                                        filemap_get_folio
>                                        folio_free_swap
>                                        /* remove swap cache */
>                                        /* free si->swap_map[] */
>       swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
> 
> shmem_undo_range can run earlier.

Yes that's the shmem problem I've been trying to convey. Perhaps there are other
(extremely subtle) mechanisms that make this impossible, I don't know.

Either way, given the length of this discussion, and the subtleties in the
syncrhonization mechanisms that have so far been identified, I think the safest
thing to do is just apply the patch. Then we have explicit syncrhonization that
we can trivially reason about.

Thanks,
Ryan

P.S. Thanks for the explanation about spinlocks being RCU read-side critical
sections - I didn't know that.

> 
> --
> Best Regards,
> Huang, Ying
Miaohe Lin March 7, 2024, 8:50 a.m. UTC | #20
On 2024/3/7 15:34, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> On 2024/3/7 13:56, Huang, Ying wrote:
>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>
>>>> On 2024/3/6 17:31, Ryan Roberts wrote:
>>>>> On 06/03/2024 08:51, Miaohe Lin wrote:
>>>>>> On 2024/3/6 10:52, Huang, Ying wrote:
>>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>>
>>>>>>>> There was previously a theoretical window where swapoff() could run and
>>>>>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>>>>>> running in another thread. This could cause, amongst other bad
>>>>>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>>>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>>>>>
>>>>>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>>>>>> a test case. But there has been agreement based on code review that this
>>>>>>>> is possible (see link below).
>>>>>>>>
>>>>>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>>>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>>>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>>>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>>>>>> where this extra check would cause any false alarms.
>>>>>>>>
>>>>>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>>>>>> for deriving this):
>>>>>>>>
>>>>>>>> --8<-----
>>>>>>>>
>>>>>>>> __swap_entry_free() might be the last user and result in
>>>>>>>> "count == SWAP_HAS_CACHE".
>>>>>>>>
>>>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>>>>
>>>>>>>> So the question is: could someone reclaim the folio and turn
>>>>>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>>>>>
>>>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>>>>> still references by swap entries.
>>>>>>>>
>>>>>>>> Process 1 still references subpage 0 via swap entry.
>>>>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>>>>
>>>>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>> [then, preempted in the hypervisor etc.]
>>>>>>>>
>>>>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>>
>>>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>>>>> __try_to_reclaim_swap().
>>>>>>>>
>>>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>>>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>>>>>> swap_entry_free()->swap_range_free()->
>>>>>>>> ...
>>>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>>>>
>>>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>>>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>>>>>
>>>>>>>> --8<-----
>>>>>>>
>>>>>>> I think that this can be simplified.  Even for a 4K folio, this could
>>>>>>> happen.
>>>>>>>
>>>>>>> CPU0                                     CPU1
>>>>>>> ----                                     ----
>>>>>>>
>>>>>>> zap_pte_range
>>>>>>>   free_swap_and_cache
>>>>>>>   __swap_entry_free
>>>>>>>   /* swap count become 0 */
>>>>>>>                                          swapoff
>>>>>>>                                            try_to_unuse
>>>>>>>                                              filemap_get_folio
>>>>>>>                                              folio_free_swap
>>>>>>>                                              /* remove swap cache */
>>>>>>>                                            /* free si->swap_map[] */
>>>>>>>
>>>>>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>>>>
>>>>>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
>>>>>
>>>>> I don't beleive it has the PTL when called by shmem.
>>>>
>>>> In the case of shmem, folio_lock is used to guard against the race.
>>>
>>> I don't find folio is lock for shmem.  find_lock_entries() will only
>>> lock the folio if (!xa_is_value()), that is, not swap entry.  Can you
>>> point out where the folio is locked for shmem?
>>
>> You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent.
>> shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into
>> memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any
>> xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from
>> shmem_undo_range() after shmem_unuse(). Or am I miss something?
> 
> I think the following situation is possible.  Right?
> 
> CPU0                               CPU1
> ----                               ----
> shmem_undo_range
>   shmem_free_swap
>     xa_cmpxchg_irq
>     free_swap_and_cache
>       __swap_entry_free
>       /* swap count become 0 */
>                                    swapoff
>                                      try_to_unuse
>                                        shmem_unuse /* cannot find swap entry */
>                                        find_next_to_unuse
>                                        filemap_get_folio
>                                        folio_free_swap
>                                        /* remove swap cache */
>                                        /* free si->swap_map[] */
>       swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
> 
> shmem_undo_range can run earlier.

Considering above case, I tend to agree it's possible. I can't figure out a mechanism to make it impossible yet.

Thanks.

> 
> --
> Best Regards,
> Huang, Ying
> .
>
Huang, Ying March 7, 2024, 8:54 a.m. UTC | #21
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 07/03/2024 07:34, Huang, Ying wrote:
>> Miaohe Lin <linmiaohe@huawei.com> writes:
>> 
>>> On 2024/3/7 13:56, Huang, Ying wrote:
>>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>>
>>>>> On 2024/3/6 17:31, Ryan Roberts wrote:
>>>>>> On 06/03/2024 08:51, Miaohe Lin wrote:
>>>>>>> On 2024/3/6 10:52, Huang, Ying wrote:
>>>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>>>
>>>>>>>>> There was previously a theoretical window where swapoff() could run and
>>>>>>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>>>>>>> running in another thread. This could cause, amongst other bad
>>>>>>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>>>>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>>>>>>
>>>>>>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>>>>>>> a test case. But there has been agreement based on code review that this
>>>>>>>>> is possible (see link below).
>>>>>>>>>
>>>>>>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>>>>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>>>>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>>>>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>>>>>>> where this extra check would cause any false alarms.
>>>>>>>>>
>>>>>>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>>>>>>> for deriving this):
>>>>>>>>>
>>>>>>>>> --8<-----
>>>>>>>>>
>>>>>>>>> __swap_entry_free() might be the last user and result in
>>>>>>>>> "count == SWAP_HAS_CACHE".
>>>>>>>>>
>>>>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>>>>>
>>>>>>>>> So the question is: could someone reclaim the folio and turn
>>>>>>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>>>>>>
>>>>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>>>>>> still references by swap entries.
>>>>>>>>>
>>>>>>>>> Process 1 still references subpage 0 via swap entry.
>>>>>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>>>>>
>>>>>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>>> [then, preempted in the hypervisor etc.]
>>>>>>>>>
>>>>>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>>>
>>>>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>>>>>> __try_to_reclaim_swap().
>>>>>>>>>
>>>>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>>>>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>>>>>>> swap_entry_free()->swap_range_free()->
>>>>>>>>> ...
>>>>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>>>>>
>>>>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>>>>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>>>>>>
>>>>>>>>> --8<-----
>>>>>>>>
>>>>>>>> I think that this can be simplified.  Even for a 4K folio, this could
>>>>>>>> happen.
>>>>>>>>
>>>>>>>> CPU0                                     CPU1
>>>>>>>> ----                                     ----
>>>>>>>>
>>>>>>>> zap_pte_range
>>>>>>>>   free_swap_and_cache
>>>>>>>>   __swap_entry_free
>>>>>>>>   /* swap count become 0 */
>>>>>>>>                                          swapoff
>>>>>>>>                                            try_to_unuse
>>>>>>>>                                              filemap_get_folio
>>>>>>>>                                              folio_free_swap
>>>>>>>>                                              /* remove swap cache */
>>>>>>>>                                            /* free si->swap_map[] */
>>>>>>>>
>>>>>>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>>>>>
>>>>>>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
>>>>>>
>>>>>> I don't beleive it has the PTL when called by shmem.
>>>>>
>>>>> In the case of shmem, folio_lock is used to guard against the race.
>>>>
>>>> I don't find folio is lock for shmem.  find_lock_entries() will only
>>>> lock the folio if (!xa_is_value()), that is, not swap entry.  Can you
>>>> point out where the folio is locked for shmem?
>>>
>>> You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent.
>>> shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into
>>> memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any
>>> xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from
>>> shmem_undo_range() after shmem_unuse(). Or am I miss something?
>> 
>> I think the following situation is possible.  Right?
>> 
>> CPU0                               CPU1
>> ----                               ----
>> shmem_undo_range
>>   shmem_free_swap
>>     xa_cmpxchg_irq
>>     free_swap_and_cache
>>       __swap_entry_free
>>       /* swap count become 0 */
>>                                    swapoff
>>                                      try_to_unuse
>>                                        shmem_unuse /* cannot find swap entry */
>>                                        find_next_to_unuse
>>                                        filemap_get_folio
>>                                        folio_free_swap
>>                                        /* remove swap cache */
>>                                        /* free si->swap_map[] */
>>       swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>> 
>> shmem_undo_range can run earlier.
>
> Yes that's the shmem problem I've been trying to convey. Perhaps there are other
> (extremely subtle) mechanisms that make this impossible, I don't know.
>
> Either way, given the length of this discussion, and the subtleties in the
> syncrhonization mechanisms that have so far been identified, I think the safest
> thing to do is just apply the patch. Then we have explicit syncrhonization that
> we can trivially reason about.

Yes.  This is tricky and we can improve it.  So I suggest to,

- Revise the patch description to use shmem race as example except
  someone found it's impossible.

- Revise the comments of get_swap_device() about RCU reader side lock
  (including IRQ off, spinlock, etc.) can prevent swapoff via
  synchronize_rcu() in swapoff().

- Revise the comments of synchronize_rcu() in swapoff(), which can
  prevent swapoff in parallel with RCU reader side lock including swap
  cache operations, etc.

--
Best Regards,
Huang, Ying
Ryan Roberts March 7, 2024, 9:19 a.m. UTC | #22
On 07/03/2024 08:54, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 07/03/2024 07:34, Huang, Ying wrote:
>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>
>>>> On 2024/3/7 13:56, Huang, Ying wrote:
>>>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>>>
>>>>>> On 2024/3/6 17:31, Ryan Roberts wrote:
>>>>>>> On 06/03/2024 08:51, Miaohe Lin wrote:
>>>>>>>> On 2024/3/6 10:52, Huang, Ying wrote:
>>>>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>>>>
>>>>>>>>>> There was previously a theoretical window where swapoff() could run and
>>>>>>>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>>>>>>>> running in another thread. This could cause, amongst other bad
>>>>>>>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>>>>>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>>>>>>>
>>>>>>>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>>>>>>>> a test case. But there has been agreement based on code review that this
>>>>>>>>>> is possible (see link below).
>>>>>>>>>>
>>>>>>>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>>>>>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>>>>>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>>>>>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>>>>>>>> where this extra check would cause any false alarms.
>>>>>>>>>>
>>>>>>>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>>>>>>>> for deriving this):
>>>>>>>>>>
>>>>>>>>>> --8<-----
>>>>>>>>>>
>>>>>>>>>> __swap_entry_free() might be the last user and result in
>>>>>>>>>> "count == SWAP_HAS_CACHE".
>>>>>>>>>>
>>>>>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>>>>>>
>>>>>>>>>> So the question is: could someone reclaim the folio and turn
>>>>>>>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>>>>>>>
>>>>>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>>>>>>> still references by swap entries.
>>>>>>>>>>
>>>>>>>>>> Process 1 still references subpage 0 via swap entry.
>>>>>>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>>>>>>
>>>>>>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>>>> [then, preempted in the hypervisor etc.]
>>>>>>>>>>
>>>>>>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>>>>
>>>>>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>>>>>>> __try_to_reclaim_swap().
>>>>>>>>>>
>>>>>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>>>>>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>>>>>>>> swap_entry_free()->swap_range_free()->
>>>>>>>>>> ...
>>>>>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>>>>>>
>>>>>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>>>>>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>>>>>>>
>>>>>>>>>> --8<-----
>>>>>>>>>
>>>>>>>>> I think that this can be simplified.  Even for a 4K folio, this could
>>>>>>>>> happen.
>>>>>>>>>
>>>>>>>>> CPU0                                     CPU1
>>>>>>>>> ----                                     ----
>>>>>>>>>
>>>>>>>>> zap_pte_range
>>>>>>>>>   free_swap_and_cache
>>>>>>>>>   __swap_entry_free
>>>>>>>>>   /* swap count become 0 */
>>>>>>>>>                                          swapoff
>>>>>>>>>                                            try_to_unuse
>>>>>>>>>                                              filemap_get_folio
>>>>>>>>>                                              folio_free_swap
>>>>>>>>>                                              /* remove swap cache */
>>>>>>>>>                                            /* free si->swap_map[] */
>>>>>>>>>
>>>>>>>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>>>>>>
>>>>>>>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
>>>>>>>
>>>>>>> I don't beleive it has the PTL when called by shmem.
>>>>>>
>>>>>> In the case of shmem, folio_lock is used to guard against the race.
>>>>>
>>>>> I don't find folio is lock for shmem.  find_lock_entries() will only
>>>>> lock the folio if (!xa_is_value()), that is, not swap entry.  Can you
>>>>> point out where the folio is locked for shmem?
>>>>
>>>> You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent.
>>>> shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into
>>>> memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any
>>>> xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from
>>>> shmem_undo_range() after shmem_unuse(). Or am I miss something?
>>>
>>> I think the following situation is possible.  Right?
>>>
>>> CPU0                               CPU1
>>> ----                               ----
>>> shmem_undo_range
>>>   shmem_free_swap
>>>     xa_cmpxchg_irq
>>>     free_swap_and_cache
>>>       __swap_entry_free
>>>       /* swap count become 0 */
>>>                                    swapoff
>>>                                      try_to_unuse
>>>                                        shmem_unuse /* cannot find swap entry */
>>>                                        find_next_to_unuse
>>>                                        filemap_get_folio
>>>                                        folio_free_swap
>>>                                        /* remove swap cache */
>>>                                        /* free si->swap_map[] */
>>>       swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>
>>> shmem_undo_range can run earlier.
>>
>> Yes that's the shmem problem I've been trying to convey. Perhaps there are other
>> (extremely subtle) mechanisms that make this impossible, I don't know.
>>
>> Either way, given the length of this discussion, and the subtleties in the
>> syncrhonization mechanisms that have so far been identified, I think the safest
>> thing to do is just apply the patch. Then we have explicit syncrhonization that
>> we can trivially reason about.
> 
> Yes.  This is tricky and we can improve it.  So I suggest to,
> 
> - Revise the patch description to use shmem race as example except
>   someone found it's impossible.
> 
> - Revise the comments of get_swap_device() about RCU reader side lock
>   (including IRQ off, spinlock, etc.) can prevent swapoff via
>   synchronize_rcu() in swapoff().
> 
> - Revise the comments of synchronize_rcu() in swapoff(), which can
>   prevent swapoff in parallel with RCU reader side lock including swap
>   cache operations, etc.

The only problem with this is that Andrew has already put my v2 into mm-*stable* :-|

So (1) from that list isn't possible. I could do a patch for (2) and (3), but to
be honest, I think you would do a better job of writing it up than I would - any
chance you could post the patch?




> 
> --
> Best Regards,
> Huang, Ying
Huang, Ying March 8, 2024, 12:55 a.m. UTC | #23
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 07/03/2024 08:54, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 07/03/2024 07:34, Huang, Ying wrote:
>>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>>
>>>>> On 2024/3/7 13:56, Huang, Ying wrote:
>>>>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>>>>
>>>>>>> On 2024/3/6 17:31, Ryan Roberts wrote:
>>>>>>>> On 06/03/2024 08:51, Miaohe Lin wrote:
>>>>>>>>> On 2024/3/6 10:52, Huang, Ying wrote:
>>>>>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>>>>>
>>>>>>>>>>> There was previously a theoretical window where swapoff() could run and
>>>>>>>>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>>>>>>>>>>> running in another thread. This could cause, amongst other bad
>>>>>>>>>>> possibilities, swap_page_trans_huge_swapped() (called by
>>>>>>>>>>> free_swap_and_cache()) to access the freed memory for swap_map.
>>>>>>>>>>>
>>>>>>>>>>> This is a theoretical problem and I haven't been able to provoke it from
>>>>>>>>>>> a test case. But there has been agreement based on code review that this
>>>>>>>>>>> is possible (see link below).
>>>>>>>>>>>
>>>>>>>>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>>>>>>>>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>>>>>>>>>>> the swap entry was valid. This wasn't present in get_swap_device() so
>>>>>>>>>>> I've added it. I couldn't find any existing get_swap_device() call sites
>>>>>>>>>>> where this extra check would cause any false alarms.
>>>>>>>>>>>
>>>>>>>>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>>>>>>>>>>> for deriving this):
>>>>>>>>>>>
>>>>>>>>>>> --8<-----
>>>>>>>>>>>
>>>>>>>>>>> __swap_entry_free() might be the last user and result in
>>>>>>>>>>> "count == SWAP_HAS_CACHE".
>>>>>>>>>>>
>>>>>>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>>>>>>>
>>>>>>>>>>> So the question is: could someone reclaim the folio and turn
>>>>>>>>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>>>>>>>>>>
>>>>>>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>>>>>>>> still references by swap entries.
>>>>>>>>>>>
>>>>>>>>>>> Process 1 still references subpage 0 via swap entry.
>>>>>>>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>>>>>>>
>>>>>>>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>>>>> [then, preempted in the hypervisor etc.]
>>>>>>>>>>>
>>>>>>>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>>>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>>>>>>
>>>>>>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>>>>>>>> __try_to_reclaim_swap().
>>>>>>>>>>>
>>>>>>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>>>>>>>>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>>>>>>>>>>> swap_entry_free()->swap_range_free()->
>>>>>>>>>>> ...
>>>>>>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>>>>>>>
>>>>>>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>>>>>>>>>>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>>>>>>>>>>
>>>>>>>>>>> --8<-----
>>>>>>>>>>
>>>>>>>>>> I think that this can be simplified.  Even for a 4K folio, this could
>>>>>>>>>> happen.
>>>>>>>>>>
>>>>>>>>>> CPU0                                     CPU1
>>>>>>>>>> ----                                     ----
>>>>>>>>>>
>>>>>>>>>> zap_pte_range
>>>>>>>>>>   free_swap_and_cache
>>>>>>>>>>   __swap_entry_free
>>>>>>>>>>   /* swap count become 0 */
>>>>>>>>>>                                          swapoff
>>>>>>>>>>                                            try_to_unuse
>>>>>>>>>>                                              filemap_get_folio
>>>>>>>>>>                                              folio_free_swap
>>>>>>>>>>                                              /* remove swap cache */
>>>>>>>>>>                                            /* free si->swap_map[] */
>>>>>>>>>>
>>>>>>>>>>   swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>>>>>>>
>>>>>>>>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held.
>>>>>>>>
>>>>>>>> I don't beleive it has the PTL when called by shmem.
>>>>>>>
>>>>>>> In the case of shmem, folio_lock is used to guard against the race.
>>>>>>
>>>>>> I don't find folio is lock for shmem.  find_lock_entries() will only
>>>>>> lock the folio if (!xa_is_value()), that is, not swap entry.  Can you
>>>>>> point out where the folio is locked for shmem?
>>>>>
>>>>> You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent.
>>>>> shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into
>>>>> memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any
>>>>> xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from
>>>>> shmem_undo_range() after shmem_unuse(). Or am I miss something?
>>>>
>>>> I think the following situation is possible.  Right?
>>>>
>>>> CPU0                               CPU1
>>>> ----                               ----
>>>> shmem_undo_range
>>>>   shmem_free_swap
>>>>     xa_cmpxchg_irq
>>>>     free_swap_and_cache
>>>>       __swap_entry_free
>>>>       /* swap count become 0 */
>>>>                                    swapoff
>>>>                                      try_to_unuse
>>>>                                        shmem_unuse /* cannot find swap entry */
>>>>                                        find_next_to_unuse
>>>>                                        filemap_get_folio
>>>>                                        folio_free_swap
>>>>                                        /* remove swap cache */
>>>>                                        /* free si->swap_map[] */
>>>>       swap_page_trans_huge_swapped <-- access freed si->swap_map !!!
>>>>
>>>> shmem_undo_range can run earlier.
>>>
>>> Yes that's the shmem problem I've been trying to convey. Perhaps there are other
>>> (extremely subtle) mechanisms that make this impossible, I don't know.
>>>
>>> Either way, given the length of this discussion, and the subtleties in the
>>> syncrhonization mechanisms that have so far been identified, I think the safest
>>> thing to do is just apply the patch. Then we have explicit syncrhonization that
>>> we can trivially reason about.
>> 
>> Yes.  This is tricky and we can improve it.  So I suggest to,
>> 
>> - Revise the patch description to use shmem race as example except
>>   someone found it's impossible.
>> 
>> - Revise the comments of get_swap_device() about RCU reader side lock
>>   (including IRQ off, spinlock, etc.) can prevent swapoff via
>>   synchronize_rcu() in swapoff().
>> 
>> - Revise the comments of synchronize_rcu() in swapoff(), which can
>>   prevent swapoff in parallel with RCU reader side lock including swap
>>   cache operations, etc.
>
> The only problem with this is that Andrew has already put my v2 into mm-*stable* :-|
>
> So (1) from that list isn't possible. I could do a patch for (2) and (3), but to
> be honest, I think you would do a better job of writing it up than I would - any
> chance you could post the patch?
>

Sure.  I will do that.

--
Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2b3a2d85e350..f580e6abc674 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1281,7 +1281,9 @@  struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	smp_rmb();
 	offset = swp_offset(entry);
 	if (offset >= si->max)
-		goto put_out;
+		goto bad_offset;
+	if (data_race(!si->swap_map[swp_offset(entry)]))
+		goto bad_free;

 	return si;
 bad_nofile:
@@ -1289,9 +1291,14 @@  struct swap_info_struct *get_swap_device(swp_entry_t entry)
 out:
 	return NULL;
 put_out:
-	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
 	percpu_ref_put(&si->users);
 	return NULL;
+bad_offset:
+	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
+	goto put_out;
+bad_free:
+	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
+	goto put_out;
 }

 static unsigned char __swap_entry_free(struct swap_info_struct *p,
@@ -1609,13 +1616,14 @@  int free_swap_and_cache(swp_entry_t entry)
 	if (non_swap_entry(entry))
 		return 1;

-	p = _swap_info_get(entry);
+	p = get_swap_device(entry);
 	if (p) {
 		count = __swap_entry_free(p, entry);
 		if (count == SWAP_HAS_CACHE &&
 		    !swap_page_trans_huge_swapped(p, entry))
 			__try_to_reclaim_swap(p, swp_offset(entry),
 					      TTRS_UNMAPPED | TTRS_FULL);
+		put_swap_device(p);
 	}
 	return p != NULL;
 }