diff mbox series

[v2] mm/swap: fix race when skipping swapcache

Message ID 20240206182559.32264-1-ryncsn@gmail.com (mailing list archive)
State New
Headers show
Series [v2] mm/swap: fix race when skipping swapcache | expand

Commit Message

Kairui Song Feb. 6, 2024, 6:25 p.m. UTC
From: Kairui Song <kasong@tencent.com>

When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
swapin the same entry at the same time, they get different pages (A, B).
Before one thread (T0) finishes the swapin and installs page (A)
to the PTE, another thread (T1) could finish swapin of page (B),
swap_free the entry, then swap out the possibly modified page
reusing the same entry. It breaks the pte_same check in (T0) because
PTE value is unchanged, causing ABA problem. Thread (T0) will
install a stalled page (A) into the PTE and cause data corruption.

One possible callstack is like this:

CPU0                                 CPU1
----                                 ----
do_swap_page()                       do_swap_page() with same entry
<direct swapin path>                 <direct swapin path>
<alloc page A>                       <alloc page B>
swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
<slow on later locks or interrupt>   <finished swapin first>
...                                  set_pte_at()
                                     swap_free() <- entry is free
                                     <write to page B, now page A stalled>
                                     <swap out page B to same swap entry>
pte_same() <- Check pass, PTE seems
              unchanged, but page A
              is stalled!
swap_free() <- page B content lost!
set_pte_at() <- staled page A installed!

And besides, for ZRAM, swap_free() allows the swap device to discard
the entry content, so even if page (B) is not modified, if
swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
it may also cause data loss.

To fix this, reuse swapcache_prepare which will pin the swap entry using
the cache flag, and allow only one thread to pin it. Release the pin
after PT unlocked. Racers will simply busy wait since it's a rare
and very short event.

Other methods like increasing the swap count don't seem to be a good
idea after some tests, that will cause racers to fall back to use the
swap cache again. Parallel swapin using different methods leads to
a much more complex scenario.

Reproducer:

This race issue can be triggered easily using a well constructed
reproducer and patched brd (with a delay in read path) [1]:

With latest 6.8 mainline, race caused data loss can be observed easily:
$ gcc -g -lpthread test-thread-swap-race.c && ./a.out
  Polulating 32MB of memory region...
  Keep swapping out...
  Starting round 0...
  Spawning 65536 workers...
  32746 workers spawned, wait for done...
  Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
  Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
  Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
  Round 0 Failed, 15 data loss!

This reproducer spawns multiple threads sharing the same memory region
using a small swap device. Every two threads updates mapped pages one by
one in opposite direction trying to create a race, with one dedicated
thread keep swapping out the data out using madvise.

The reproducer created a reproduce rate of about once every 5 minutes,
so the race should be totally possible in production.

After this patch, I ran the reproducer for over a few hundred rounds
and no data loss observed.

Performance overhead is minimal, microbenchmark swapin 10G from 32G
zram:

Before:     10934698 us
After:      11157121 us
Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)

Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Yu Zhao <yuzhao@google.com>

---
Update from V1:
- Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
- Update comments make it cleaner [Huang, Ying]
- Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
- Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
- Update commit message.
- Collect Review and Acks.

 include/linux/swap.h |  5 +++++
 mm/memory.c          | 15 +++++++++++++++
 mm/swap.h            |  5 +++++
 mm/swapfile.c        | 13 +++++++++++++
 4 files changed, 38 insertions(+)

Comments

SeongJae Park Feb. 6, 2024, 6:44 p.m. UTC | #1
Hi Kairui,

On Wed,  7 Feb 2024 02:25:59 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> From: Kairui Song <kasong@tencent.com>
> 
[...]
> ---
> Update from V1:
> - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> - Update comments make it cleaner [Huang, Ying]
> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]

I confirm the build issue[1] that I reported yesterday is gone with this
version.  Thank you for fixing it!

[1] https://lore.kernel.org/linux-mm/20240206022409.202536-1-sj@kernel.org


Thanks,
SJ

[...]
Minchan Kim Feb. 6, 2024, 11:02 p.m. UTC | #2
On Wed, Feb 07, 2024 at 02:25:59AM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> swapin the same entry at the same time, they get different pages (A, B).
> Before one thread (T0) finishes the swapin and installs page (A)
> to the PTE, another thread (T1) could finish swapin of page (B),
> swap_free the entry, then swap out the possibly modified page
> reusing the same entry. It breaks the pte_same check in (T0) because
> PTE value is unchanged, causing ABA problem. Thread (T0) will
> install a stalled page (A) into the PTE and cause data corruption.
> 
> One possible callstack is like this:
> 
> CPU0                                 CPU1
> ----                                 ----
> do_swap_page()                       do_swap_page() with same entry
> <direct swapin path>                 <direct swapin path>
> <alloc page A>                       <alloc page B>
> swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> <slow on later locks or interrupt>   <finished swapin first>
> ...                                  set_pte_at()
>                                      swap_free() <- entry is free

                                       ^^^
nit: From the recent code, I see swap_free is called earlier than set_pte_at


>                                      <write to page B, now page A stalled>
>                                      <swap out page B to same swap entry>
> pte_same() <- Check pass, PTE seems
>               unchanged, but page A
>               is stalled!
> swap_free() <- page B content lost!
> set_pte_at() <- staled page A installed!
> 
> And besides, for ZRAM, swap_free() allows the swap device to discard
> the entry content, so even if page (B) is not modified, if
> swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> it may also cause data loss.

Thanks for catching the issue, folks!

> 
> To fix this, reuse swapcache_prepare which will pin the swap entry using
> the cache flag, and allow only one thread to pin it. Release the pin
> after PT unlocked. Racers will simply busy wait since it's a rare
> and very short event.
> 
> Other methods like increasing the swap count don't seem to be a good
> idea after some tests, that will cause racers to fall back to use the
> swap cache again. Parallel swapin using different methods leads to
> a much more complex scenario.
> 
> Reproducer:
> 
> This race issue can be triggered easily using a well constructed
> reproducer and patched brd (with a delay in read path) [1]:
> 
> With latest 6.8 mainline, race caused data loss can be observed easily:
> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
>   Polulating 32MB of memory region...
>   Keep swapping out...
>   Starting round 0...
>   Spawning 65536 workers...
>   32746 workers spawned, wait for done...
>   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
>   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
>   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
>   Round 0 Failed, 15 data loss!
> 
> This reproducer spawns multiple threads sharing the same memory region
> using a small swap device. Every two threads updates mapped pages one by
> one in opposite direction trying to create a race, with one dedicated
> thread keep swapping out the data out using madvise.
> 
> The reproducer created a reproduce rate of about once every 5 minutes,
> so the race should be totally possible in production.
> 
> After this patch, I ran the reproducer for over a few hundred rounds
> and no data loss observed.
> 
> Performance overhead is minimal, microbenchmark swapin 10G from 32G
> zram:
> 
> Before:     10934698 us
> After:      11157121 us
> Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> 
> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> Reported-by: "Huang, Ying" <ying.huang@intel.com>
> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Acked-by: Yu Zhao <yuzhao@google.com>
> 
> ---
> Update from V1:
> - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> - Update comments make it cleaner [Huang, Ying]
> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> - Update commit message.
> - Collect Review and Acks.
> 
>  include/linux/swap.h |  5 +++++
>  mm/memory.c          | 15 +++++++++++++++
>  mm/swap.h            |  5 +++++
>  mm/swapfile.c        | 13 +++++++++++++
>  4 files changed, 38 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4db00ddad261..8d28f6091a32 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
>  	return 0;
>  }
>  
> +static inline int swapcache_prepare(swp_entry_t swp)
> +{
> +	return 0;
> +}
> +
>  static inline void swap_free(swp_entry_t swp)
>  {
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e1f4849463a..1749c700823d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (!folio) {
>  		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>  		    __swap_count(entry) == 1) {
> +			/*
> +			 * Prevent parallel swapin from proceeding with
> +			 * the cache flag. Otherwise, another thread may
> +			 * finish swapin first, free the entry, and swapout
> +			 * reusing the same entry. It's undetectable as
> +			 * pte_same() returns true due to entry reuse.
> +			 */
> +			if (swapcache_prepare(entry))
> +				goto out;
> +
>  			/* skip swapcache */
>  			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>  						vma, vmf->address, false);
> @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +	/* Clear the swap cache pin for direct swapin after PTL unlock */
> +	if (folio && !swapcache)
> +		swapcache_clear(si, entry);
>  out:
>  	if (si)
>  		put_swap_device(si);
> @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out_page:
> +	if (!swapcache)
> +		swapcache_clear(si, entry);
>  	folio_unlock(folio);
>  out_release:
>  	folio_put(folio);

What happens?

do_swap_page
  ..
  swapcache_prepare() <- tured the cache flag on
   
  folio = vma_alloc_folio <- failed to allocate the folio
  page = &foio->page; <- crash but it's out of scope from this patch

  ..
  if (!folio)
     goto unlock;

..
unlock:
  swapcache_clear(si, entry) <- it's skipped this time.
  

Can we simply introduce a boolean flag to state the special case and
clear the cache state based on the flag? 

if (swapcache_prepare())
  goto out;

need_clear_cache = true;

out_path:
   if (need_clear_cache)
      swapcache_clear
Chris Li Feb. 6, 2024, 11:10 p.m. UTC | #3
Hi Kairui,

Sorry replying to your patch V1 late, I will reply on the V2 thread.

On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> swapin the same entry at the same time, they get different pages (A, B).
> Before one thread (T0) finishes the swapin and installs page (A)
> to the PTE, another thread (T1) could finish swapin of page (B),
> swap_free the entry, then swap out the possibly modified page
> reusing the same entry. It breaks the pte_same check in (T0) because
> PTE value is unchanged, causing ABA problem. Thread (T0) will
> install a stalled page (A) into the PTE and cause data corruption.
>
> One possible callstack is like this:
>
> CPU0                                 CPU1
> ----                                 ----
> do_swap_page()                       do_swap_page() with same entry
> <direct swapin path>                 <direct swapin path>
> <alloc page A>                       <alloc page B>
> swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> <slow on later locks or interrupt>   <finished swapin first>
> ...                                  set_pte_at()
>                                      swap_free() <- entry is free
>                                      <write to page B, now page A stalled>
>                                      <swap out page B to same swap entry>
> pte_same() <- Check pass, PTE seems
>               unchanged, but page A
>               is stalled!
> swap_free() <- page B content lost!
> set_pte_at() <- staled page A installed!
>
> And besides, for ZRAM, swap_free() allows the swap device to discard
> the entry content, so even if page (B) is not modified, if
> swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> it may also cause data loss.
>
> To fix this, reuse swapcache_prepare which will pin the swap entry using
> the cache flag, and allow only one thread to pin it. Release the pin
> after PT unlocked. Racers will simply busy wait since it's a rare
> and very short event.
>
> Other methods like increasing the swap count don't seem to be a good
> idea after some tests, that will cause racers to fall back to use the
> swap cache again. Parallel swapin using different methods leads to
> a much more complex scenario.
>
> Reproducer:
>
> This race issue can be triggered easily using a well constructed
> reproducer and patched brd (with a delay in read path) [1]:
>
> With latest 6.8 mainline, race caused data loss can be observed easily:
> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
>   Polulating 32MB of memory region...
>   Keep swapping out...
>   Starting round 0...
>   Spawning 65536 workers...
>   32746 workers spawned, wait for done...
>   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
>   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
>   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
>   Round 0 Failed, 15 data loss!
>
> This reproducer spawns multiple threads sharing the same memory region
> using a small swap device. Every two threads updates mapped pages one by
> one in opposite direction trying to create a race, with one dedicated
> thread keep swapping out the data out using madvise.
>
> The reproducer created a reproduce rate of about once every 5 minutes,
> so the race should be totally possible in production.
>
> After this patch, I ran the reproducer for over a few hundred rounds
> and no data loss observed.
>
> Performance overhead is minimal, microbenchmark swapin 10G from 32G
> zram:
>
> Before:     10934698 us
> After:      11157121 us
> Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
>
> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> Reported-by: "Huang, Ying" <ying.huang@intel.com>
> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Acked-by: Yu Zhao <yuzhao@google.com>
>
> ---
> Update from V1:
> - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> - Update comments make it cleaner [Huang, Ying]
> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> - Update commit message.
> - Collect Review and Acks.
>
>  include/linux/swap.h |  5 +++++
>  mm/memory.c          | 15 +++++++++++++++
>  mm/swap.h            |  5 +++++
>  mm/swapfile.c        | 13 +++++++++++++
>  4 files changed, 38 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4db00ddad261..8d28f6091a32 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
>         return 0;
>  }
>
> +static inline int swapcache_prepare(swp_entry_t swp)
> +{
> +       return 0;
> +}
> +
>  static inline void swap_free(swp_entry_t swp)
>  {
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e1f4849463a..1749c700823d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
> +                       /*
> +                        * Prevent parallel swapin from proceeding with
> +                        * the cache flag. Otherwise, another thread may
> +                        * finish swapin first, free the entry, and swapout
> +                        * reusing the same entry. It's undetectable as
> +                        * pte_same() returns true due to entry reuse.
> +                        */
> +                       if (swapcache_prepare(entry))
> +                               goto out;
> +

I am puzzled by this "goto out". If I understand this correctly, you
have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
The CPU1 will succeed in adding the flag and  the CPU2 will get
"-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
correctly so far?

Then the goto out seems wrong to me. For the CPU2, the page fault will
return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
CPU2 does not even know the page fault is not handled, it will resume
from the page fault instruction, possibly generate another page fault
at the exact same location. That page fault loop will repeat until
CPU1 install the new pte on that faulting virtual address and pick up
by CPU2.

Am I missing something obvious there?

I just re-read your comment: "Racers will simply busy wait since it's
a rare and very short event." That might be referring to the above
CPU2 page fault looping situation. I consider the page fault looping
on CPU2 not acceptable. For one it will mess up the page fault
statistics.
In my mind, having an explicit loop for CPU2 waiting for the PTE to
show up is still better than this page fault loop. You can have more
CPU power friendly loops.

This behavior needs more discussion.

Chris


Chris


>                         /* skip swapcache */
>                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>                                                 vma, vmf->address, false);
> @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  unlock:
>         if (vmf->pte)
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> +       /* Clear the swap cache pin for direct swapin after PTL unlock */
> +       if (folio && !swapcache)
> +               swapcache_clear(si, entry);
>  out:
>         if (si)
>                 put_swap_device(si);
> @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (vmf->pte)
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out_page:
> +       if (!swapcache)
> +               swapcache_clear(si, entry);
>         folio_unlock(folio);
>  out_release:
>         folio_put(folio);
> diff --git a/mm/swap.h b/mm/swap.h
> index 758c46ca671e..fc2f6ade7f80 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio,
>  void delete_from_swap_cache(struct folio *folio);
>  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>                                   unsigned long end);
> +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
>  struct folio *swap_cache_get_folio(swp_entry_t entry,
>                 struct vm_area_struct *vma, unsigned long addr);
>  struct folio *filemap_get_incore_folio(struct address_space *mapping,
> @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>         return 0;
>  }
>
> +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> +{
> +}
> +
>  static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
>                 struct vm_area_struct *vma, unsigned long addr)
>  {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 556ff7347d5f..746aa9da5302 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry)
>         return __swap_duplicate(entry, SWAP_HAS_CACHE);
>  }
>
> +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> +{
> +       struct swap_cluster_info *ci;
> +       unsigned long offset = swp_offset(entry);
> +       unsigned char usage;
> +
> +       ci = lock_cluster_or_swap_info(si, offset);
> +       usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
> +       unlock_cluster_or_swap_info(si, ci);
> +       if (!usage)
> +               free_swap_slot(entry);
> +}
> +
>  struct swap_info_struct *swp_swap_info(swp_entry_t entry)
>  {
>         return swap_type_to_swap_info(swp_type(entry));
> --
> 2.43.0
>
>
Barry Song Feb. 6, 2024, 11:40 p.m. UTC | #4
On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> Sorry replying to your patch V1 late, I will reply on the V2 thread.
>
> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > swapin the same entry at the same time, they get different pages (A, B).
> > Before one thread (T0) finishes the swapin and installs page (A)
> > to the PTE, another thread (T1) could finish swapin of page (B),
> > swap_free the entry, then swap out the possibly modified page
> > reusing the same entry. It breaks the pte_same check in (T0) because
> > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > install a stalled page (A) into the PTE and cause data corruption.
> >
> > One possible callstack is like this:
> >
> > CPU0                                 CPU1
> > ----                                 ----
> > do_swap_page()                       do_swap_page() with same entry
> > <direct swapin path>                 <direct swapin path>
> > <alloc page A>                       <alloc page B>
> > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > <slow on later locks or interrupt>   <finished swapin first>
> > ...                                  set_pte_at()
> >                                      swap_free() <- entry is free
> >                                      <write to page B, now page A stalled>
> >                                      <swap out page B to same swap entry>
> > pte_same() <- Check pass, PTE seems
> >               unchanged, but page A
> >               is stalled!
> > swap_free() <- page B content lost!
> > set_pte_at() <- staled page A installed!
> >
> > And besides, for ZRAM, swap_free() allows the swap device to discard
> > the entry content, so even if page (B) is not modified, if
> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > it may also cause data loss.
> >
> > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > the cache flag, and allow only one thread to pin it. Release the pin
> > after PT unlocked. Racers will simply busy wait since it's a rare
> > and very short event.
> >
> > Other methods like increasing the swap count don't seem to be a good
> > idea after some tests, that will cause racers to fall back to use the
> > swap cache again. Parallel swapin using different methods leads to
> > a much more complex scenario.
> >
> > Reproducer:
> >
> > This race issue can be triggered easily using a well constructed
> > reproducer and patched brd (with a delay in read path) [1]:
> >
> > With latest 6.8 mainline, race caused data loss can be observed easily:
> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> >   Polulating 32MB of memory region...
> >   Keep swapping out...
> >   Starting round 0...
> >   Spawning 65536 workers...
> >   32746 workers spawned, wait for done...
> >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> >   Round 0 Failed, 15 data loss!
> >
> > This reproducer spawns multiple threads sharing the same memory region
> > using a small swap device. Every two threads updates mapped pages one by
> > one in opposite direction trying to create a race, with one dedicated
> > thread keep swapping out the data out using madvise.
> >
> > The reproducer created a reproduce rate of about once every 5 minutes,
> > so the race should be totally possible in production.
> >
> > After this patch, I ran the reproducer for over a few hundred rounds
> > and no data loss observed.
> >
> > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > zram:
> >
> > Before:     10934698 us
> > After:      11157121 us
> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> >
> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > Acked-by: Yu Zhao <yuzhao@google.com>
> >
> > ---
> > Update from V1:
> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > - Update comments make it cleaner [Huang, Ying]
> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > - Update commit message.
> > - Collect Review and Acks.
> >
> >  include/linux/swap.h |  5 +++++
> >  mm/memory.c          | 15 +++++++++++++++
> >  mm/swap.h            |  5 +++++
> >  mm/swapfile.c        | 13 +++++++++++++
> >  4 files changed, 38 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 4db00ddad261..8d28f6091a32 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> >         return 0;
> >  }
> >
> > +static inline int swapcache_prepare(swp_entry_t swp)
> > +{
> > +       return 0;
> > +}
> > +
> >  static inline void swap_free(swp_entry_t swp)
> >  {
> >  }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 7e1f4849463a..1749c700823d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         if (!folio) {
> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                     __swap_count(entry) == 1) {
> > +                       /*
> > +                        * Prevent parallel swapin from proceeding with
> > +                        * the cache flag. Otherwise, another thread may
> > +                        * finish swapin first, free the entry, and swapout
> > +                        * reusing the same entry. It's undetectable as
> > +                        * pte_same() returns true due to entry reuse.
> > +                        */
> > +                       if (swapcache_prepare(entry))
> > +                               goto out;
> > +
>
> I am puzzled by this "goto out". If I understand this correctly, you
> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> The CPU1 will succeed in adding the flag and  the CPU2 will get
> "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> correctly so far?
>
> Then the goto out seems wrong to me. For the CPU2, the page fault will
> return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> CPU2 does not even know the page fault is not handled, it will resume
> from the page fault instruction, possibly generate another page fault
> at the exact same location. That page fault loop will repeat until
> CPU1 install the new pte on that faulting virtual address and pick up
> by CPU2.
>
> Am I missing something obvious there?

I feel you are right. any concurrent page faults at the same pte
will increase the count of page faults for a couple of times now.

>
> I just re-read your comment: "Racers will simply busy wait since it's
> a rare and very short event." That might be referring to the above
> CPU2 page fault looping situation. I consider the page fault looping
> on CPU2 not acceptable. For one it will mess up the page fault
> statistics.
> In my mind, having an explicit loop for CPU2 waiting for the PTE to
> show up is still better than this page fault loop. You can have more
> CPU power friendly loops.

I assume you mean something like

while(!pte_same())
   cpu_relax();

then we still have a chance to miss the change of B.

For example, another thread is changing pte to A->B->A, our loop can
miss B. Thus we will trap into an infinite loop. this is even worse.

is it possible to loop for the success of swapcache_prepare(entry)
instead?

>
> This behavior needs more discussion.
>
> Chris
>
>
> Chris
>
>
> >                         /* skip swapcache */
> >                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >                                                 vma, vmf->address, false);
> > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  unlock:
> >         if (vmf->pte)
> >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +       /* Clear the swap cache pin for direct swapin after PTL unlock */
> > +       if (folio && !swapcache)
> > +               swapcache_clear(si, entry);
> >  out:
> >         if (si)
> >                 put_swap_device(si);
> > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         if (vmf->pte)
> >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out_page:
> > +       if (!swapcache)
> > +               swapcache_clear(si, entry);
> >         folio_unlock(folio);
> >  out_release:
> >         folio_put(folio);
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 758c46ca671e..fc2f6ade7f80 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio,
> >  void delete_from_swap_cache(struct folio *folio);
> >  void clear_shadow_from_swap_cache(int type, unsigned long begin,
> >                                   unsigned long end);
> > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> >  struct folio *swap_cache_get_folio(swp_entry_t entry,
> >                 struct vm_area_struct *vma, unsigned long addr);
> >  struct folio *filemap_get_incore_folio(struct address_space *mapping,
> > @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> >         return 0;
> >  }
> >
> > +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> > +{
> > +}
> > +
> >  static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> >                 struct vm_area_struct *vma, unsigned long addr)
> >  {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 556ff7347d5f..746aa9da5302 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry)
> >         return __swap_duplicate(entry, SWAP_HAS_CACHE);
> >  }
> >
> > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> > +{
> > +       struct swap_cluster_info *ci;
> > +       unsigned long offset = swp_offset(entry);
> > +       unsigned char usage;
> > +
> > +       ci = lock_cluster_or_swap_info(si, offset);
> > +       usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
> > +       unlock_cluster_or_swap_info(si, ci);
> > +       if (!usage)
> > +               free_swap_slot(entry);
> > +}
> > +
> >  struct swap_info_struct *swp_swap_info(swp_entry_t entry)
> >  {
> >         return swap_type_to_swap_info(swp_type(entry));
> > --
> > 2.43.0

Thanks
Barry
Barry Song Feb. 7, 2024, 1:52 a.m. UTC | #5
On Wed, Feb 7, 2024 at 10:21 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > Hi Kairui,
> > > >
> > > > Sorry replying to your patch V1 late, I will reply on the V2 thread.
> > > >
> > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > > > swapin the same entry at the same time, they get different pages (A, B).
> > > > > Before one thread (T0) finishes the swapin and installs page (A)
> > > > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > > > swap_free the entry, then swap out the possibly modified page
> > > > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > > > install a stalled page (A) into the PTE and cause data corruption.
> > > > >
> > > > > One possible callstack is like this:
> > > > >
> > > > > CPU0                                 CPU1
> > > > > ----                                 ----
> > > > > do_swap_page()                       do_swap_page() with same entry
> > > > > <direct swapin path>                 <direct swapin path>
> > > > > <alloc page A>                       <alloc page B>
> > > > > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > > > > <slow on later locks or interrupt>   <finished swapin first>
> > > > > ...                                  set_pte_at()
> > > > >                                      swap_free() <- entry is free
> > > > >                                      <write to page B, now page A stalled>
> > > > >                                      <swap out page B to same swap entry>
> > > > > pte_same() <- Check pass, PTE seems
> > > > >               unchanged, but page A
> > > > >               is stalled!
> > > > > swap_free() <- page B content lost!
> > > > > set_pte_at() <- staled page A installed!
> > > > >
> > > > > And besides, for ZRAM, swap_free() allows the swap device to discard
> > > > > the entry content, so even if page (B) is not modified, if
> > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > > > > it may also cause data loss.
> > > > >
> > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > > > > the cache flag, and allow only one thread to pin it. Release the pin
> > > > > after PT unlocked. Racers will simply busy wait since it's a rare
> > > > > and very short event.
> > > > >
> > > > > Other methods like increasing the swap count don't seem to be a good
> > > > > idea after some tests, that will cause racers to fall back to use the
> > > > > swap cache again. Parallel swapin using different methods leads to
> > > > > a much more complex scenario.
> > > > >
> > > > > Reproducer:
> > > > >
> > > > > This race issue can be triggered easily using a well constructed
> > > > > reproducer and patched brd (with a delay in read path) [1]:
> > > > >
> > > > > With latest 6.8 mainline, race caused data loss can be observed easily:
> > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > > > >   Polulating 32MB of memory region...
> > > > >   Keep swapping out...
> > > > >   Starting round 0...
> > > > >   Spawning 65536 workers...
> > > > >   32746 workers spawned, wait for done...
> > > > >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > > > >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > > > >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > > > >   Round 0 Failed, 15 data loss!
> > > > >
> > > > > This reproducer spawns multiple threads sharing the same memory region
> > > > > using a small swap device. Every two threads updates mapped pages one by
> > > > > one in opposite direction trying to create a race, with one dedicated
> > > > > thread keep swapping out the data out using madvise.
> > > > >
> > > > > The reproducer created a reproduce rate of about once every 5 minutes,
> > > > > so the race should be totally possible in production.
> > > > >
> > > > > After this patch, I ran the reproducer for over a few hundred rounds
> > > > > and no data loss observed.
> > > > >
> > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > > > > zram:
> > > > >
> > > > > Before:     10934698 us
> > > > > After:      11157121 us
> > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > > > >
> > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > Acked-by: Yu Zhao <yuzhao@google.com>
> > > > >
> > > > > ---
> > > > > Update from V1:
> > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > > > > - Update comments make it cleaner [Huang, Ying]
> > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > > > > - Update commit message.
> > > > > - Collect Review and Acks.
> > > > >
> > > > >  include/linux/swap.h |  5 +++++
> > > > >  mm/memory.c          | 15 +++++++++++++++
> > > > >  mm/swap.h            |  5 +++++
> > > > >  mm/swapfile.c        | 13 +++++++++++++
> > > > >  4 files changed, 38 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > index 4db00ddad261..8d28f6091a32 100644
> > > > > --- a/include/linux/swap.h
> > > > > +++ b/include/linux/swap.h
> > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > +static inline int swapcache_prepare(swp_entry_t swp)
> > > > > +{
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > >  static inline void swap_free(swp_entry_t swp)
> > > > >  {
> > > > >  }
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 7e1f4849463a..1749c700823d 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > >         if (!folio) {
> > > > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > > >                     __swap_count(entry) == 1) {
> > > > > +                       /*
> > > > > +                        * Prevent parallel swapin from proceeding with
> > > > > +                        * the cache flag. Otherwise, another thread may
> > > > > +                        * finish swapin first, free the entry, and swapout
> > > > > +                        * reusing the same entry. It's undetectable as
> > > > > +                        * pte_same() returns true due to entry reuse.
> > > > > +                        */
> > > > > +                       if (swapcache_prepare(entry))
> > > > > +                               goto out;
> > > > > +
> > > >
> > > > I am puzzled by this "goto out". If I understand this correctly, you
> > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > > > The CPU1 will succeed in adding the flag and  the CPU2 will get
> > > > "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > > > correctly so far?
> > > >
> > > > Then the goto out seems wrong to me. For the CPU2, the page fault will
> > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > > > CPU2 does not even know the page fault is not handled, it will resume
> > > > from the page fault instruction, possibly generate another page fault
> > > > at the exact same location. That page fault loop will repeat until
> > > > CPU1 install the new pte on that faulting virtual address and pick up
> > > > by CPU2.
> > > >
> > > > Am I missing something obvious there?
> > >
> > > I feel you are right. any concurrent page faults at the same pte
> > > will increase the count of page faults for a couple of times now.
> > >
> > > >
> > > > I just re-read your comment: "Racers will simply busy wait since it's
> > > > a rare and very short event." That might be referring to the above
> > > > CPU2 page fault looping situation. I consider the page fault looping
> > > > on CPU2 not acceptable. For one it will mess up the page fault
> > > > statistics.
> > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > > > show up is still better than this page fault loop. You can have more
> > > > CPU power friendly loops.
> > >
> > > I assume you mean something like
> > >
> > > while(!pte_same())
> > >    cpu_relax();
> > >
> > > then we still have a chance to miss the change of B.
> > >
> > > For example, another thread is changing pte to A->B->A, our loop can
> > > miss B. Thus we will trap into an infinite loop. this is even worse.
> >
> > Yes. You are right, it is worse. Thanks for catching that. That is why
> > I say this needs more discussion, I haven't fully thought it through
> > :-)
>
> Hi Chris and Barry,
>
> Thanks for the comments!
>

Hi Kairui,

> The worst thing I know of returning in do_swap_page without handling
> the swap, is an increase of some statistic counters, note it will not
> cause major page fault counters to grow, only things like perf counter
> and vma lock statistic are affected.

I don't understand :-) if it is goto out, userspace may re-execute the
instruction.
What is going to happen is a totally new page fault.

>
> And actually there are multiple already existing return points in
> do_swap_page that will return without handling it, which may
> re-trigger the page fault.

I feel this case is different from other returns.
other returns probably have held ptl or page lock before checking entry/pte,
and another thread has likely changed the PTE/swap entry before those
locks are released.
then it is likely there is no following page fault.

but our case is different, we are unconditionally having one or more page
faults. I think making the count "right" is important as we highly depend on
it to debug performance issues.

> When do_swap_page is called, many pre-checks have been applied, and
> they could all be invalidated if something raced, simply looping
> inside here could miss a lot of corner cases, so we have to go through
> that again.

I agree.

>
> This patch did increase the chance of false positive increase of some
> counters, maybe something like returning a VM_FAULT_RETRY could make
> it better, but code is more complex and will cause other counters to
> grow.

right.

Thanks
Barry
Chris Li Feb. 7, 2024, 2:03 a.m. UTC | #6
On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Kairui,
> >
> > Sorry replying to your patch V1 late, I will reply on the V2 thread.
> >
> > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > swapin the same entry at the same time, they get different pages (A, B).
> > > Before one thread (T0) finishes the swapin and installs page (A)
> > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > swap_free the entry, then swap out the possibly modified page
> > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > install a stalled page (A) into the PTE and cause data corruption.
> > >
> > > One possible callstack is like this:
> > >
> > > CPU0                                 CPU1
> > > ----                                 ----
> > > do_swap_page()                       do_swap_page() with same entry
> > > <direct swapin path>                 <direct swapin path>
> > > <alloc page A>                       <alloc page B>
> > > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > > <slow on later locks or interrupt>   <finished swapin first>
> > > ...                                  set_pte_at()
> > >                                      swap_free() <- entry is free
> > >                                      <write to page B, now page A stalled>
> > >                                      <swap out page B to same swap entry>
> > > pte_same() <- Check pass, PTE seems
> > >               unchanged, but page A
> > >               is stalled!
> > > swap_free() <- page B content lost!
> > > set_pte_at() <- staled page A installed!
> > >
> > > And besides, for ZRAM, swap_free() allows the swap device to discard
> > > the entry content, so even if page (B) is not modified, if
> > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > > it may also cause data loss.
> > >
> > > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > > the cache flag, and allow only one thread to pin it. Release the pin
> > > after PT unlocked. Racers will simply busy wait since it's a rare
> > > and very short event.
> > >
> > > Other methods like increasing the swap count don't seem to be a good
> > > idea after some tests, that will cause racers to fall back to use the
> > > swap cache again. Parallel swapin using different methods leads to
> > > a much more complex scenario.
> > >
> > > Reproducer:
> > >
> > > This race issue can be triggered easily using a well constructed
> > > reproducer and patched brd (with a delay in read path) [1]:
> > >
> > > With latest 6.8 mainline, race caused data loss can be observed easily:
> > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > >   Polulating 32MB of memory region...
> > >   Keep swapping out...
> > >   Starting round 0...
> > >   Spawning 65536 workers...
> > >   32746 workers spawned, wait for done...
> > >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > >   Round 0 Failed, 15 data loss!
> > >
> > > This reproducer spawns multiple threads sharing the same memory region
> > > using a small swap device. Every two threads updates mapped pages one by
> > > one in opposite direction trying to create a race, with one dedicated
> > > thread keep swapping out the data out using madvise.
> > >
> > > The reproducer created a reproduce rate of about once every 5 minutes,
> > > so the race should be totally possible in production.
> > >
> > > After this patch, I ran the reproducer for over a few hundred rounds
> > > and no data loss observed.
> > >
> > > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > > zram:
> > >
> > > Before:     10934698 us
> > > After:      11157121 us
> > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > >
> > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > > Acked-by: Yu Zhao <yuzhao@google.com>
> > >
> > > ---
> > > Update from V1:
> > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > > - Update comments make it cleaner [Huang, Ying]
> > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > > - Update commit message.
> > > - Collect Review and Acks.
> > >
> > >  include/linux/swap.h |  5 +++++
> > >  mm/memory.c          | 15 +++++++++++++++
> > >  mm/swap.h            |  5 +++++
> > >  mm/swapfile.c        | 13 +++++++++++++
> > >  4 files changed, 38 insertions(+)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 4db00ddad261..8d28f6091a32 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > >         return 0;
> > >  }
> > >
> > > +static inline int swapcache_prepare(swp_entry_t swp)
> > > +{
> > > +       return 0;
> > > +}
> > > +
> > >  static inline void swap_free(swp_entry_t swp)
> > >  {
> > >  }
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 7e1f4849463a..1749c700823d 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >         if (!folio) {
> > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > >                     __swap_count(entry) == 1) {
> > > +                       /*
> > > +                        * Prevent parallel swapin from proceeding with
> > > +                        * the cache flag. Otherwise, another thread may
> > > +                        * finish swapin first, free the entry, and swapout
> > > +                        * reusing the same entry. It's undetectable as
> > > +                        * pte_same() returns true due to entry reuse.
> > > +                        */
> > > +                       if (swapcache_prepare(entry))
> > > +                               goto out;
> > > +
> >
> > I am puzzled by this "goto out". If I understand this correctly, you
> > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > The CPU1 will succeed in adding the flag and  the CPU2 will get
> > "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > correctly so far?
> >
> > Then the goto out seems wrong to me. For the CPU2, the page fault will
> > return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > CPU2 does not even know the page fault is not handled, it will resume
> > from the page fault instruction, possibly generate another page fault
> > at the exact same location. That page fault loop will repeat until
> > CPU1 install the new pte on that faulting virtual address and pick up
> > by CPU2.
> >
> > Am I missing something obvious there?
>
> I feel you are right. any concurrent page faults at the same pte
> will increase the count of page faults for a couple of times now.
>
> >
> > I just re-read your comment: "Racers will simply busy wait since it's
> > a rare and very short event." That might be referring to the above
> > CPU2 page fault looping situation. I consider the page fault looping
> > on CPU2 not acceptable. For one it will mess up the page fault
> > statistics.
> > In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > show up is still better than this page fault loop. You can have more
> > CPU power friendly loops.
>
> I assume you mean something like
>
> while(!pte_same())
>    cpu_relax();
>
> then we still have a chance to miss the change of B.
>
> For example, another thread is changing pte to A->B->A, our loop can
> miss B. Thus we will trap into an infinite loop. this is even worse.

Yes. You are right, it is worse. Thanks for catching that. That is why
I say this needs more discussion, I haven't fully thought it through
:-)

>
> is it possible to loop for the success of swapcache_prepare(entry)
> instead?

It looks like it can work.  It is using the SWAP_HAS_CACHE like a per
swap entry spin lock.

Chris

>
> >
> > This behavior needs more discussion.
> >
> > Chris
> >
> >
> > Chris
> >
> >
> > >                         /* skip swapcache */
> > >                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > >                                                 vma, vmf->address, false);
> > > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >  unlock:
> > >         if (vmf->pte)
> > >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > +       /* Clear the swap cache pin for direct swapin after PTL unlock */
> > > +       if (folio && !swapcache)
> > > +               swapcache_clear(si, entry);
> > >  out:
> > >         if (si)
> > >                 put_swap_device(si);
> > > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >         if (vmf->pte)
> > >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> > >  out_page:
> > > +       if (!swapcache)
> > > +               swapcache_clear(si, entry);
> > >         folio_unlock(folio);
> > >  out_release:
> > >         folio_put(folio);
> > > diff --git a/mm/swap.h b/mm/swap.h
> > > index 758c46ca671e..fc2f6ade7f80 100644
> > > --- a/mm/swap.h
> > > +++ b/mm/swap.h
> > > @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio,
> > >  void delete_from_swap_cache(struct folio *folio);
> > >  void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > >                                   unsigned long end);
> > > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> > >  struct folio *swap_cache_get_folio(swp_entry_t entry,
> > >                 struct vm_area_struct *vma, unsigned long addr);
> > >  struct folio *filemap_get_incore_folio(struct address_space *mapping,
> > > @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> > >         return 0;
> > >  }
> > >
> > > +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> > > +{
> > > +}
> > > +
> > >  static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> > >                 struct vm_area_struct *vma, unsigned long addr)
> > >  {
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 556ff7347d5f..746aa9da5302 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry)
> > >         return __swap_duplicate(entry, SWAP_HAS_CACHE);
> > >  }
> > >
> > > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> > > +{
> > > +       struct swap_cluster_info *ci;
> > > +       unsigned long offset = swp_offset(entry);
> > > +       unsigned char usage;
> > > +
> > > +       ci = lock_cluster_or_swap_info(si, offset);
> > > +       usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
> > > +       unlock_cluster_or_swap_info(si, ci);
> > > +       if (!usage)
> > > +               free_swap_slot(entry);
> > > +}
> > > +
> > >  struct swap_info_struct *swp_swap_info(swp_entry_t entry)
> > >  {
> > >         return swap_type_to_swap_info(swp_type(entry));
> > > --
> > > 2.43.0
>
> Thanks
> Barry
>
Huang, Ying Feb. 7, 2024, 2:08 a.m. UTC | #7
Barry Song <21cnbao@gmail.com> writes:

> On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
>>
>> Hi Kairui,
>>
>> Sorry replying to your patch V1 late, I will reply on the V2 thread.
>>
>> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
>> >
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
>> > swapin the same entry at the same time, they get different pages (A, B).
>> > Before one thread (T0) finishes the swapin and installs page (A)
>> > to the PTE, another thread (T1) could finish swapin of page (B),
>> > swap_free the entry, then swap out the possibly modified page
>> > reusing the same entry. It breaks the pte_same check in (T0) because
>> > PTE value is unchanged, causing ABA problem. Thread (T0) will
>> > install a stalled page (A) into the PTE and cause data corruption.
>> >
>> > One possible callstack is like this:
>> >
>> > CPU0                                 CPU1
>> > ----                                 ----
>> > do_swap_page()                       do_swap_page() with same entry
>> > <direct swapin path>                 <direct swapin path>
>> > <alloc page A>                       <alloc page B>
>> > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
>> > <slow on later locks or interrupt>   <finished swapin first>
>> > ...                                  set_pte_at()
>> >                                      swap_free() <- entry is free
>> >                                      <write to page B, now page A stalled>
>> >                                      <swap out page B to same swap entry>
>> > pte_same() <- Check pass, PTE seems
>> >               unchanged, but page A
>> >               is stalled!
>> > swap_free() <- page B content lost!
>> > set_pte_at() <- staled page A installed!
>> >
>> > And besides, for ZRAM, swap_free() allows the swap device to discard
>> > the entry content, so even if page (B) is not modified, if
>> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
>> > it may also cause data loss.
>> >
>> > To fix this, reuse swapcache_prepare which will pin the swap entry using
>> > the cache flag, and allow only one thread to pin it. Release the pin
>> > after PT unlocked. Racers will simply busy wait since it's a rare
>> > and very short event.
>> >
>> > Other methods like increasing the swap count don't seem to be a good
>> > idea after some tests, that will cause racers to fall back to use the
>> > swap cache again. Parallel swapin using different methods leads to
>> > a much more complex scenario.
>> >
>> > Reproducer:
>> >
>> > This race issue can be triggered easily using a well constructed
>> > reproducer and patched brd (with a delay in read path) [1]:
>> >
>> > With latest 6.8 mainline, race caused data loss can be observed easily:
>> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
>> >   Polulating 32MB of memory region...
>> >   Keep swapping out...
>> >   Starting round 0...
>> >   Spawning 65536 workers...
>> >   32746 workers spawned, wait for done...
>> >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
>> >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
>> >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
>> >   Round 0 Failed, 15 data loss!
>> >
>> > This reproducer spawns multiple threads sharing the same memory region
>> > using a small swap device. Every two threads updates mapped pages one by
>> > one in opposite direction trying to create a race, with one dedicated
>> > thread keep swapping out the data out using madvise.
>> >
>> > The reproducer created a reproduce rate of about once every 5 minutes,
>> > so the race should be totally possible in production.
>> >
>> > After this patch, I ran the reproducer for over a few hundred rounds
>> > and no data loss observed.
>> >
>> > Performance overhead is minimal, microbenchmark swapin 10G from 32G
>> > zram:
>> >
>> > Before:     10934698 us
>> > After:      11157121 us
>> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
>> >
>> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
>> > Reported-by: "Huang, Ying" <ying.huang@intel.com>
>> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
>> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
>> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>> > Acked-by: Yu Zhao <yuzhao@google.com>
>> >
>> > ---
>> > Update from V1:
>> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
>> > - Update comments make it cleaner [Huang, Ying]
>> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
>> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
>> > - Update commit message.
>> > - Collect Review and Acks.
>> >
>> >  include/linux/swap.h |  5 +++++
>> >  mm/memory.c          | 15 +++++++++++++++
>> >  mm/swap.h            |  5 +++++
>> >  mm/swapfile.c        | 13 +++++++++++++
>> >  4 files changed, 38 insertions(+)
>> >
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index 4db00ddad261..8d28f6091a32 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
>> >         return 0;
>> >  }
>> >
>> > +static inline int swapcache_prepare(swp_entry_t swp)
>> > +{
>> > +       return 0;
>> > +}
>> > +
>> >  static inline void swap_free(swp_entry_t swp)
>> >  {
>> >  }
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 7e1f4849463a..1749c700823d 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >         if (!folio) {
>> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>> >                     __swap_count(entry) == 1) {
>> > +                       /*
>> > +                        * Prevent parallel swapin from proceeding with
>> > +                        * the cache flag. Otherwise, another thread may
>> > +                        * finish swapin first, free the entry, and swapout
>> > +                        * reusing the same entry. It's undetectable as
>> > +                        * pte_same() returns true due to entry reuse.
>> > +                        */
>> > +                       if (swapcache_prepare(entry))
>> > +                               goto out;
>> > +
>>
>> I am puzzled by this "goto out". If I understand this correctly, you
>> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
>> The CPU1 will succeed in adding the flag and  the CPU2 will get
>> "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
>> correctly so far?
>>
>> Then the goto out seems wrong to me. For the CPU2, the page fault will
>> return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
>> CPU2 does not even know the page fault is not handled, it will resume
>> from the page fault instruction, possibly generate another page fault
>> at the exact same location. That page fault loop will repeat until
>> CPU1 install the new pte on that faulting virtual address and pick up
>> by CPU2.
>>
>> Am I missing something obvious there?
>
> I feel you are right. any concurrent page faults at the same pte
> will increase the count of page faults for a couple of times now.
>
>>
>> I just re-read your comment: "Racers will simply busy wait since it's
>> a rare and very short event." That might be referring to the above
>> CPU2 page fault looping situation. I consider the page fault looping
>> on CPU2 not acceptable. For one it will mess up the page fault
>> statistics.
>> In my mind, having an explicit loop for CPU2 waiting for the PTE to
>> show up is still better than this page fault loop. You can have more
>> CPU power friendly loops.
>
> I assume you mean something like
>
> while(!pte_same())
>    cpu_relax();
>
> then we still have a chance to miss the change of B.
>
> For example, another thread is changing pte to A->B->A, our loop can
> miss B. Thus we will trap into an infinite loop. this is even worse.
>
> is it possible to loop for the success of swapcache_prepare(entry)
> instead?

This doesn't work too.  The swap count can increase to > 1 and be put in
swap cache for long time.

Another possibility is to move swapcache_prepare() after
vma_alloc_folio() to reduce the race window.

--
Best Regards,
Huang, Ying

>>
>> This behavior needs more discussion.
>>
>> Chris
>>
>>
>> Chris
>>
>>
>> >                         /* skip swapcache */
>> >                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> >                                                 vma, vmf->address, false);
>> > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >  unlock:
>> >         if (vmf->pte)
>> >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > +       /* Clear the swap cache pin for direct swapin after PTL unlock */
>> > +       if (folio && !swapcache)
>> > +               swapcache_clear(si, entry);
>> >  out:
>> >         if (si)
>> >                 put_swap_device(si);
>> > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >         if (vmf->pte)
>> >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>> >  out_page:
>> > +       if (!swapcache)
>> > +               swapcache_clear(si, entry);
>> >         folio_unlock(folio);
>> >  out_release:
>> >         folio_put(folio);
>> > diff --git a/mm/swap.h b/mm/swap.h
>> > index 758c46ca671e..fc2f6ade7f80 100644
>> > --- a/mm/swap.h
>> > +++ b/mm/swap.h
>> > @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio,
>> >  void delete_from_swap_cache(struct folio *folio);
>> >  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>> >                                   unsigned long end);
>> > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
>> >  struct folio *swap_cache_get_folio(swp_entry_t entry,
>> >                 struct vm_area_struct *vma, unsigned long addr);
>> >  struct folio *filemap_get_incore_folio(struct address_space *mapping,
>> > @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>> >         return 0;
>> >  }
>> >
>> > +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
>> > +{
>> > +}
>> > +
>> >  static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
>> >                 struct vm_area_struct *vma, unsigned long addr)
>> >  {
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 556ff7347d5f..746aa9da5302 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry)
>> >         return __swap_duplicate(entry, SWAP_HAS_CACHE);
>> >  }
>> >
>> > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
>> > +{
>> > +       struct swap_cluster_info *ci;
>> > +       unsigned long offset = swp_offset(entry);
>> > +       unsigned char usage;
>> > +
>> > +       ci = lock_cluster_or_swap_info(si, offset);
>> > +       usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
>> > +       unlock_cluster_or_swap_info(si, ci);
>> > +       if (!usage)
>> > +               free_swap_slot(entry);
>> > +}
>> > +
>> >  struct swap_info_struct *swp_swap_info(swp_entry_t entry)
>> >  {
>> >         return swap_type_to_swap_info(swp_type(entry));
>> > --
>> > 2.43.0
>
> Thanks
> Barry
Kairui Song Feb. 7, 2024, 2:20 a.m. UTC | #8
On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > Hi Kairui,
> > >
> > > Sorry replying to your patch V1 late, I will reply on the V2 thread.
> > >
> > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > > swapin the same entry at the same time, they get different pages (A, B).
> > > > Before one thread (T0) finishes the swapin and installs page (A)
> > > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > > swap_free the entry, then swap out the possibly modified page
> > > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > > install a stalled page (A) into the PTE and cause data corruption.
> > > >
> > > > One possible callstack is like this:
> > > >
> > > > CPU0                                 CPU1
> > > > ----                                 ----
> > > > do_swap_page()                       do_swap_page() with same entry
> > > > <direct swapin path>                 <direct swapin path>
> > > > <alloc page A>                       <alloc page B>
> > > > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > > > <slow on later locks or interrupt>   <finished swapin first>
> > > > ...                                  set_pte_at()
> > > >                                      swap_free() <- entry is free
> > > >                                      <write to page B, now page A stalled>
> > > >                                      <swap out page B to same swap entry>
> > > > pte_same() <- Check pass, PTE seems
> > > >               unchanged, but page A
> > > >               is stalled!
> > > > swap_free() <- page B content lost!
> > > > set_pte_at() <- staled page A installed!
> > > >
> > > > And besides, for ZRAM, swap_free() allows the swap device to discard
> > > > the entry content, so even if page (B) is not modified, if
> > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > > > it may also cause data loss.
> > > >
> > > > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > > > the cache flag, and allow only one thread to pin it. Release the pin
> > > > after PT unlocked. Racers will simply busy wait since it's a rare
> > > > and very short event.
> > > >
> > > > Other methods like increasing the swap count don't seem to be a good
> > > > idea after some tests, that will cause racers to fall back to use the
> > > > swap cache again. Parallel swapin using different methods leads to
> > > > a much more complex scenario.
> > > >
> > > > Reproducer:
> > > >
> > > > This race issue can be triggered easily using a well constructed
> > > > reproducer and patched brd (with a delay in read path) [1]:
> > > >
> > > > With latest 6.8 mainline, race caused data loss can be observed easily:
> > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > > >   Polulating 32MB of memory region...
> > > >   Keep swapping out...
> > > >   Starting round 0...
> > > >   Spawning 65536 workers...
> > > >   32746 workers spawned, wait for done...
> > > >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > > >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > > >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > > >   Round 0 Failed, 15 data loss!
> > > >
> > > > This reproducer spawns multiple threads sharing the same memory region
> > > > using a small swap device. Every two threads updates mapped pages one by
> > > > one in opposite direction trying to create a race, with one dedicated
> > > > thread keep swapping out the data out using madvise.
> > > >
> > > > The reproducer created a reproduce rate of about once every 5 minutes,
> > > > so the race should be totally possible in production.
> > > >
> > > > After this patch, I ran the reproducer for over a few hundred rounds
> > > > and no data loss observed.
> > > >
> > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > > > zram:
> > > >
> > > > Before:     10934698 us
> > > > After:      11157121 us
> > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > > >
> > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > > > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > > > Acked-by: Yu Zhao <yuzhao@google.com>
> > > >
> > > > ---
> > > > Update from V1:
> > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > > > - Update comments make it cleaner [Huang, Ying]
> > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > > > - Update commit message.
> > > > - Collect Review and Acks.
> > > >
> > > >  include/linux/swap.h |  5 +++++
> > > >  mm/memory.c          | 15 +++++++++++++++
> > > >  mm/swap.h            |  5 +++++
> > > >  mm/swapfile.c        | 13 +++++++++++++
> > > >  4 files changed, 38 insertions(+)
> > > >
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index 4db00ddad261..8d28f6091a32 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > > >         return 0;
> > > >  }
> > > >
> > > > +static inline int swapcache_prepare(swp_entry_t swp)
> > > > +{
> > > > +       return 0;
> > > > +}
> > > > +
> > > >  static inline void swap_free(swp_entry_t swp)
> > > >  {
> > > >  }
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 7e1f4849463a..1749c700823d 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > >         if (!folio) {
> > > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > >                     __swap_count(entry) == 1) {
> > > > +                       /*
> > > > +                        * Prevent parallel swapin from proceeding with
> > > > +                        * the cache flag. Otherwise, another thread may
> > > > +                        * finish swapin first, free the entry, and swapout
> > > > +                        * reusing the same entry. It's undetectable as
> > > > +                        * pte_same() returns true due to entry reuse.
> > > > +                        */
> > > > +                       if (swapcache_prepare(entry))
> > > > +                               goto out;
> > > > +
> > >
> > > I am puzzled by this "goto out". If I understand this correctly, you
> > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > > The CPU1 will succeed in adding the flag and  the CPU2 will get
> > > "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > > correctly so far?
> > >
> > > Then the goto out seems wrong to me. For the CPU2, the page fault will
> > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > > CPU2 does not even know the page fault is not handled, it will resume
> > > from the page fault instruction, possibly generate another page fault
> > > at the exact same location. That page fault loop will repeat until
> > > CPU1 install the new pte on that faulting virtual address and pick up
> > > by CPU2.
> > >
> > > Am I missing something obvious there?
> >
> > I feel you are right. any concurrent page faults at the same pte
> > will increase the count of page faults for a couple of times now.
> >
> > >
> > > I just re-read your comment: "Racers will simply busy wait since it's
> > > a rare and very short event." That might be referring to the above
> > > CPU2 page fault looping situation. I consider the page fault looping
> > > on CPU2 not acceptable. For one it will mess up the page fault
> > > statistics.
> > > In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > > show up is still better than this page fault loop. You can have more
> > > CPU power friendly loops.
> >
> > I assume you mean something like
> >
> > while(!pte_same())
> >    cpu_relax();
> >
> > then we still have a chance to miss the change of B.
> >
> > For example, another thread is changing pte to A->B->A, our loop can
> > miss B. Thus we will trap into an infinite loop. this is even worse.
>
> Yes. You are right, it is worse. Thanks for catching that. That is why
> I say this needs more discussion, I haven't fully thought it through
> :-)

Hi Chris and Barry,

Thanks for the comments!

The worst thing I know of returning in do_swap_page without handling
the swap, is an increase of some statistic counters, note it will not
cause major page fault counters to grow, only things like perf counter
and vma lock statistic are affected.

And actually there are multiple already existing return points in
do_swap_page that will return without handling it, which may
re-trigger the page fault.
When do_swap_page is called, many pre-checks have been applied, and
they could all be invalidated if something raced, simply looping
inside here could miss a lot of corner cases, so we have to go through
that again.

This patch did increase the chance of false positive increase of some
counters, maybe something like returning a VM_FAULT_RETRY could make
it better, but code is more complex and will cause other counters to
grow.
Kairui Song Feb. 7, 2024, 2:28 a.m. UTC | #9
On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> Hi Kairui,
> >>
> >> Sorry replying to your patch V1 late, I will reply on the V2 thread.
> >>
> >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> >> >
> >> > From: Kairui Song <kasong@tencent.com>
> >> >
> >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> >> > swapin the same entry at the same time, they get different pages (A, B).
> >> > Before one thread (T0) finishes the swapin and installs page (A)
> >> > to the PTE, another thread (T1) could finish swapin of page (B),
> >> > swap_free the entry, then swap out the possibly modified page
> >> > reusing the same entry. It breaks the pte_same check in (T0) because
> >> > PTE value is unchanged, causing ABA problem. Thread (T0) will
> >> > install a stalled page (A) into the PTE and cause data corruption.
> >> >
> >> > One possible callstack is like this:
> >> >
> >> > CPU0                                 CPU1
> >> > ----                                 ----
> >> > do_swap_page()                       do_swap_page() with same entry
> >> > <direct swapin path>                 <direct swapin path>
> >> > <alloc page A>                       <alloc page B>
> >> > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> >> > <slow on later locks or interrupt>   <finished swapin first>
> >> > ...                                  set_pte_at()
> >> >                                      swap_free() <- entry is free
> >> >                                      <write to page B, now page A stalled>
> >> >                                      <swap out page B to same swap entry>
> >> > pte_same() <- Check pass, PTE seems
> >> >               unchanged, but page A
> >> >               is stalled!
> >> > swap_free() <- page B content lost!
> >> > set_pte_at() <- staled page A installed!
> >> >
> >> > And besides, for ZRAM, swap_free() allows the swap device to discard
> >> > the entry content, so even if page (B) is not modified, if
> >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> >> > it may also cause data loss.
> >> >
> >> > To fix this, reuse swapcache_prepare which will pin the swap entry using
> >> > the cache flag, and allow only one thread to pin it. Release the pin
> >> > after PT unlocked. Racers will simply busy wait since it's a rare
> >> > and very short event.
> >> >
> >> > Other methods like increasing the swap count don't seem to be a good
> >> > idea after some tests, that will cause racers to fall back to use the
> >> > swap cache again. Parallel swapin using different methods leads to
> >> > a much more complex scenario.
> >> >
> >> > Reproducer:
> >> >
> >> > This race issue can be triggered easily using a well constructed
> >> > reproducer and patched brd (with a delay in read path) [1]:
> >> >
> >> > With latest 6.8 mainline, race caused data loss can be observed easily:
> >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> >> >   Polulating 32MB of memory region...
> >> >   Keep swapping out...
> >> >   Starting round 0...
> >> >   Spawning 65536 workers...
> >> >   32746 workers spawned, wait for done...
> >> >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> >> >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> >> >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> >> >   Round 0 Failed, 15 data loss!
> >> >
> >> > This reproducer spawns multiple threads sharing the same memory region
> >> > using a small swap device. Every two threads updates mapped pages one by
> >> > one in opposite direction trying to create a race, with one dedicated
> >> > thread keep swapping out the data out using madvise.
> >> >
> >> > The reproducer created a reproduce rate of about once every 5 minutes,
> >> > so the race should be totally possible in production.
> >> >
> >> > After this patch, I ran the reproducer for over a few hundred rounds
> >> > and no data loss observed.
> >> >
> >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> >> > zram:
> >> >
> >> > Before:     10934698 us
> >> > After:      11157121 us
> >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> >> >
> >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> >> > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> >> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> >> > Acked-by: Yu Zhao <yuzhao@google.com>
> >> >
> >> > ---
> >> > Update from V1:
> >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> >> > - Update comments make it cleaner [Huang, Ying]
> >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> >> > - Update commit message.
> >> > - Collect Review and Acks.
> >> >
> >> >  include/linux/swap.h |  5 +++++
> >> >  mm/memory.c          | 15 +++++++++++++++
> >> >  mm/swap.h            |  5 +++++
> >> >  mm/swapfile.c        | 13 +++++++++++++
> >> >  4 files changed, 38 insertions(+)
> >> >
> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> > index 4db00ddad261..8d28f6091a32 100644
> >> > --- a/include/linux/swap.h
> >> > +++ b/include/linux/swap.h
> >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> >> >         return 0;
> >> >  }
> >> >
> >> > +static inline int swapcache_prepare(swp_entry_t swp)
> >> > +{
> >> > +       return 0;
> >> > +}
> >> > +
> >> >  static inline void swap_free(swp_entry_t swp)
> >> >  {
> >> >  }
> >> > diff --git a/mm/memory.c b/mm/memory.c
> >> > index 7e1f4849463a..1749c700823d 100644
> >> > --- a/mm/memory.c
> >> > +++ b/mm/memory.c
> >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >         if (!folio) {
> >> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >> >                     __swap_count(entry) == 1) {
> >> > +                       /*
> >> > +                        * Prevent parallel swapin from proceeding with
> >> > +                        * the cache flag. Otherwise, another thread may
> >> > +                        * finish swapin first, free the entry, and swapout
> >> > +                        * reusing the same entry. It's undetectable as
> >> > +                        * pte_same() returns true due to entry reuse.
> >> > +                        */
> >> > +                       if (swapcache_prepare(entry))
> >> > +                               goto out;
> >> > +
> >>
> >> I am puzzled by this "goto out". If I understand this correctly, you
> >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> >> The CPU1 will succeed in adding the flag and  the CPU2 will get
> >> "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> >> correctly so far?
> >>
> >> Then the goto out seems wrong to me. For the CPU2, the page fault will
> >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> >> CPU2 does not even know the page fault is not handled, it will resume
> >> from the page fault instruction, possibly generate another page fault
> >> at the exact same location. That page fault loop will repeat until
> >> CPU1 install the new pte on that faulting virtual address and pick up
> >> by CPU2.
> >>
> >> Am I missing something obvious there?
> >
> > I feel you are right. any concurrent page faults at the same pte
> > will increase the count of page faults for a couple of times now.
> >
> >>
> >> I just re-read your comment: "Racers will simply busy wait since it's
> >> a rare and very short event." That might be referring to the above
> >> CPU2 page fault looping situation. I consider the page fault looping
> >> on CPU2 not acceptable. For one it will mess up the page fault
> >> statistics.
> >> In my mind, having an explicit loop for CPU2 waiting for the PTE to
> >> show up is still better than this page fault loop. You can have more
> >> CPU power friendly loops.
> >
> > I assume you mean something like
> >
> > while(!pte_same())
> >    cpu_relax();
> >
> > then we still have a chance to miss the change of B.
> >
> > For example, another thread is changing pte to A->B->A, our loop can
> > miss B. Thus we will trap into an infinite loop. this is even worse.
> >
> > is it possible to loop for the success of swapcache_prepare(entry)
> > instead?
>
> This doesn't work too.  The swap count can increase to > 1 and be put in
> swap cache for long time.
>
> Another possibility is to move swapcache_prepare() after
> vma_alloc_folio() to reduce the race window.

Reducing the race window seems like a good way. Or maybe we can just
add a cpu_relax() so raced swapins will just slow down, and won't loop
too much time and so the side effect (counter or power consumption)
should be much smaller?
Kairui Song Feb. 7, 2024, 3:21 a.m. UTC | #10
On Wed, Feb 7, 2024 at 10:55 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 10:21 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> > > > >
> > > > > Hi Kairui,
> > > > >
> > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread.
> > > > >
> > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > >
> > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > > > > swapin the same entry at the same time, they get different pages (A, B).
> > > > > > Before one thread (T0) finishes the swapin and installs page (A)
> > > > > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > > > > swap_free the entry, then swap out the possibly modified page
> > > > > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > > > > install a stalled page (A) into the PTE and cause data corruption.
> > > > > >
> > > > > > One possible callstack is like this:
> > > > > >
> > > > > > CPU0                                 CPU1
> > > > > > ----                                 ----
> > > > > > do_swap_page()                       do_swap_page() with same entry
> > > > > > <direct swapin path>                 <direct swapin path>
> > > > > > <alloc page A>                       <alloc page B>
> > > > > > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > > > > > <slow on later locks or interrupt>   <finished swapin first>
> > > > > > ...                                  set_pte_at()
> > > > > >                                      swap_free() <- entry is free
> > > > > >                                      <write to page B, now page A stalled>
> > > > > >                                      <swap out page B to same swap entry>
> > > > > > pte_same() <- Check pass, PTE seems
> > > > > >               unchanged, but page A
> > > > > >               is stalled!
> > > > > > swap_free() <- page B content lost!
> > > > > > set_pte_at() <- staled page A installed!
> > > > > >
> > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard
> > > > > > the entry content, so even if page (B) is not modified, if
> > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > > > > > it may also cause data loss.
> > > > > >
> > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > > > > > the cache flag, and allow only one thread to pin it. Release the pin
> > > > > > after PT unlocked. Racers will simply busy wait since it's a rare
> > > > > > and very short event.
> > > > > >
> > > > > > Other methods like increasing the swap count don't seem to be a good
> > > > > > idea after some tests, that will cause racers to fall back to use the
> > > > > > swap cache again. Parallel swapin using different methods leads to
> > > > > > a much more complex scenario.
> > > > > >
> > > > > > Reproducer:
> > > > > >
> > > > > > This race issue can be triggered easily using a well constructed
> > > > > > reproducer and patched brd (with a delay in read path) [1]:
> > > > > >
> > > > > > With latest 6.8 mainline, race caused data loss can be observed easily:
> > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > > > > >   Polulating 32MB of memory region...
> > > > > >   Keep swapping out...
> > > > > >   Starting round 0...
> > > > > >   Spawning 65536 workers...
> > > > > >   32746 workers spawned, wait for done...
> > > > > >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > > > > >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > > > > >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > > > > >   Round 0 Failed, 15 data loss!
> > > > > >
> > > > > > This reproducer spawns multiple threads sharing the same memory region
> > > > > > using a small swap device. Every two threads updates mapped pages one by
> > > > > > one in opposite direction trying to create a race, with one dedicated
> > > > > > thread keep swapping out the data out using madvise.
> > > > > >
> > > > > > The reproducer created a reproduce rate of about once every 5 minutes,
> > > > > > so the race should be totally possible in production.
> > > > > >
> > > > > > After this patch, I ran the reproducer for over a few hundred rounds
> > > > > > and no data loss observed.
> > > > > >
> > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > > > > > zram:
> > > > > >
> > > > > > Before:     10934698 us
> > > > > > After:      11157121 us
> > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > > > > >
> > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > Acked-by: Yu Zhao <yuzhao@google.com>
> > > > > >
> > > > > > ---
> > > > > > Update from V1:
> > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > > > > > - Update comments make it cleaner [Huang, Ying]
> > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > > > > > - Update commit message.
> > > > > > - Collect Review and Acks.
> > > > > >
> > > > > >  include/linux/swap.h |  5 +++++
> > > > > >  mm/memory.c          | 15 +++++++++++++++
> > > > > >  mm/swap.h            |  5 +++++
> > > > > >  mm/swapfile.c        | 13 +++++++++++++
> > > > > >  4 files changed, 38 insertions(+)
> > > > > >
> > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > > index 4db00ddad261..8d28f6091a32 100644
> > > > > > --- a/include/linux/swap.h
> > > > > > +++ b/include/linux/swap.h
> > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > > > > >         return 0;
> > > > > >  }
> > > > > >
> > > > > > +static inline int swapcache_prepare(swp_entry_t swp)
> > > > > > +{
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > >  static inline void swap_free(swp_entry_t swp)
> > > > > >  {
> > > > > >  }
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 7e1f4849463a..1749c700823d 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > >         if (!folio) {
> > > > > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > > > >                     __swap_count(entry) == 1) {
> > > > > > +                       /*
> > > > > > +                        * Prevent parallel swapin from proceeding with
> > > > > > +                        * the cache flag. Otherwise, another thread may
> > > > > > +                        * finish swapin first, free the entry, and swapout
> > > > > > +                        * reusing the same entry. It's undetectable as
> > > > > > +                        * pte_same() returns true due to entry reuse.
> > > > > > +                        */
> > > > > > +                       if (swapcache_prepare(entry))
> > > > > > +                               goto out;
> > > > > > +
> > > > >
> > > > > I am puzzled by this "goto out". If I understand this correctly, you
> > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > > > > The CPU1 will succeed in adding the flag and  the CPU2 will get
> > > > > "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > > > > correctly so far?
> > > > >
> > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will
> > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > > > > CPU2 does not even know the page fault is not handled, it will resume
> > > > > from the page fault instruction, possibly generate another page fault
> > > > > at the exact same location. That page fault loop will repeat until
> > > > > CPU1 install the new pte on that faulting virtual address and pick up
> > > > > by CPU2.
> > > > >
> > > > > Am I missing something obvious there?
> > > >
> > > > I feel you are right. any concurrent page faults at the same pte
> > > > will increase the count of page faults for a couple of times now.
> > > >
> > > > >
> > > > > I just re-read your comment: "Racers will simply busy wait since it's
> > > > > a rare and very short event." That might be referring to the above
> > > > > CPU2 page fault looping situation. I consider the page fault looping
> > > > > on CPU2 not acceptable. For one it will mess up the page fault
> > > > > statistics.
> > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > > > > show up is still better than this page fault loop. You can have more
> > > > > CPU power friendly loops.
> > > >
> > > > I assume you mean something like
> > > >
> > > > while(!pte_same())
> > > >    cpu_relax();
> > > >
> > > > then we still have a chance to miss the change of B.
> > > >
> > > > For example, another thread is changing pte to A->B->A, our loop can
> > > > miss B. Thus we will trap into an infinite loop. this is even worse.
> > >
> > > Yes. You are right, it is worse. Thanks for catching that. That is why
> > > I say this needs more discussion, I haven't fully thought it through
> > > :-)
> >
> > Hi Chris and Barry,
> >
> > Thanks for the comments!
> >
>
> Hi Kairui,

Hi Barry,

>
> > The worst thing I know of returning in do_swap_page without handling
> > the swap, is an increase of some statistic counters, note it will not
> > cause major page fault counters to grow, only things like perf counter
> > and vma lock statistic are affected.
>
> I don't understand :-) if it is goto out, userspace may re-execute the
> instruction.
> What is going to happen is a totally new page fault.

Right, it's a new page fault, I mean if the same PTE is still not
present, it's handled in the same way again.

> >
> > And actually there are multiple already existing return points in
> > do_swap_page that will return without handling it, which may
> > re-trigger the page fault.
>
> I feel this case is different from other returns.
> other returns probably have held ptl or page lock before checking entry/pte,
> and another thread has likely changed the PTE/swap entry before those
> locks are released.
> then it is likely there is no following page fault.

Right, in other return points, they expect the next try will just
success without triggering another fault. But here SWP_SYNCHRONOUS_IO
is usually very fast, so next attempt here will also likely succeed
without triggering a fault.

So, to ensure that, maybe we can just wait a bit before returning?
That way the following page fault should be very unlikely to happen.

I'm not sure if there is any better way to wait than simply adding a
cpu_relax, waiting on swap cache flag may cause it to wait forever,
since it's totally possible some other threads added the entry to
swapcache. Waiting for PTE change will result in the same issue of
ABA.
Kairui Song Feb. 7, 2024, 3:22 a.m. UTC | #11
On Wed, Feb 7, 2024 at 7:02 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Feb 07, 2024 at 02:25:59AM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > swapin the same entry at the same time, they get different pages (A, B).
> > Before one thread (T0) finishes the swapin and installs page (A)
> > to the PTE, another thread (T1) could finish swapin of page (B),
> > swap_free the entry, then swap out the possibly modified page
> > reusing the same entry. It breaks the pte_same check in (T0) because
> > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > install a stalled page (A) into the PTE and cause data corruption.
> >
> > One possible callstack is like this:
> >
> > CPU0                                 CPU1
> > ----                                 ----
> > do_swap_page()                       do_swap_page() with same entry
> > <direct swapin path>                 <direct swapin path>
> > <alloc page A>                       <alloc page B>
> > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > <slow on later locks or interrupt>   <finished swapin first>
> > ...                                  set_pte_at()
> >                                      swap_free() <- entry is free
>
>                                        ^^^
> nit: From the recent code, I see swap_free is called earlier than set_pte_at

Thanks, will update the message.

>
>
> >                                      <write to page B, now page A stalled>
> >                                      <swap out page B to same swap entry>
> > pte_same() <- Check pass, PTE seems
> >               unchanged, but page A
> >               is stalled!
> > swap_free() <- page B content lost!
> > set_pte_at() <- staled page A installed!
> >
> > And besides, for ZRAM, swap_free() allows the swap device to discard
> > the entry content, so even if page (B) is not modified, if
> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > it may also cause data loss.
>
> Thanks for catching the issue, folks!
>
> >
> > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > the cache flag, and allow only one thread to pin it. Release the pin
> > after PT unlocked. Racers will simply busy wait since it's a rare
> > and very short event.
> >
> > Other methods like increasing the swap count don't seem to be a good
> > idea after some tests, that will cause racers to fall back to use the
> > swap cache again. Parallel swapin using different methods leads to
> > a much more complex scenario.
> >
> > Reproducer:
> >
> > This race issue can be triggered easily using a well constructed
> > reproducer and patched brd (with a delay in read path) [1]:
> >
> > With latest 6.8 mainline, race caused data loss can be observed easily:
> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> >   Polulating 32MB of memory region...
> >   Keep swapping out...
> >   Starting round 0...
> >   Spawning 65536 workers...
> >   32746 workers spawned, wait for done...
> >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> >   Round 0 Failed, 15 data loss!
> >
> > This reproducer spawns multiple threads sharing the same memory region
> > using a small swap device. Every two threads updates mapped pages one by
> > one in opposite direction trying to create a race, with one dedicated
> > thread keep swapping out the data out using madvise.
> >
> > The reproducer created a reproduce rate of about once every 5 minutes,
> > so the race should be totally possible in production.
> >
> > After this patch, I ran the reproducer for over a few hundred rounds
> > and no data loss observed.
> >
> > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > zram:
> >
> > Before:     10934698 us
> > After:      11157121 us
> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> >
> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > Acked-by: Yu Zhao <yuzhao@google.com>
> >
> > ---
> > Update from V1:
> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > - Update comments make it cleaner [Huang, Ying]
> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > - Update commit message.
> > - Collect Review and Acks.
> >
> >  include/linux/swap.h |  5 +++++
> >  mm/memory.c          | 15 +++++++++++++++
> >  mm/swap.h            |  5 +++++
> >  mm/swapfile.c        | 13 +++++++++++++
> >  4 files changed, 38 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 4db00ddad261..8d28f6091a32 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> >       return 0;
> >  }
> >
> > +static inline int swapcache_prepare(swp_entry_t swp)
> > +{
> > +     return 0;
> > +}
> > +
> >  static inline void swap_free(swp_entry_t swp)
> >  {
> >  }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 7e1f4849463a..1749c700823d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (!folio) {
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                   __swap_count(entry) == 1) {
> > +                     /*
> > +                      * Prevent parallel swapin from proceeding with
> > +                      * the cache flag. Otherwise, another thread may
> > +                      * finish swapin first, free the entry, and swapout
> > +                      * reusing the same entry. It's undetectable as
> > +                      * pte_same() returns true due to entry reuse.
> > +                      */
> > +                     if (swapcache_prepare(entry))
> > +                             goto out;
> > +
> >                       /* skip swapcache */
> >                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >                                               vma, vmf->address, false);
> > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  unlock:
> >       if (vmf->pte)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +     /* Clear the swap cache pin for direct swapin after PTL unlock */
> > +     if (folio && !swapcache)
> > +             swapcache_clear(si, entry);
> >  out:
> >       if (si)
> >               put_swap_device(si);
> > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (vmf->pte)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out_page:
> > +     if (!swapcache)
> > +             swapcache_clear(si, entry);
> >       folio_unlock(folio);
> >  out_release:
> >       folio_put(folio);
>
> What happens?
>
> do_swap_page
>   ..
>   swapcache_prepare() <- tured the cache flag on
>
>   folio = vma_alloc_folio <- failed to allocate the folio
>   page = &foio->page; <- crash but it's out of scope from this patch
>
>   ..
>   if (!folio)
>      goto unlock;
>
> ..
> unlock:
>   swapcache_clear(si, entry) <- it's skipped this time.
>
>
> Can we simply introduce a boolean flag to state the special case and
> clear the cache state based on the flag?

Good idea, that should make the code easier to understand.
Huang, Ying Feb. 7, 2024, 3:44 a.m. UTC | #12
Kairui Song <ryncsn@gmail.com> writes:

> On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
>> >>
>> >> Hi Kairui,
>> >>
>> >> Sorry replying to your patch V1 late, I will reply on the V2 thread.
>> >>
>> >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
>> >> >
>> >> > From: Kairui Song <kasong@tencent.com>
>> >> >
>> >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
>> >> > swapin the same entry at the same time, they get different pages (A, B).
>> >> > Before one thread (T0) finishes the swapin and installs page (A)
>> >> > to the PTE, another thread (T1) could finish swapin of page (B),
>> >> > swap_free the entry, then swap out the possibly modified page
>> >> > reusing the same entry. It breaks the pte_same check in (T0) because
>> >> > PTE value is unchanged, causing ABA problem. Thread (T0) will
>> >> > install a stalled page (A) into the PTE and cause data corruption.
>> >> >
>> >> > One possible callstack is like this:
>> >> >
>> >> > CPU0                                 CPU1
>> >> > ----                                 ----
>> >> > do_swap_page()                       do_swap_page() with same entry
>> >> > <direct swapin path>                 <direct swapin path>
>> >> > <alloc page A>                       <alloc page B>
>> >> > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
>> >> > <slow on later locks or interrupt>   <finished swapin first>
>> >> > ...                                  set_pte_at()
>> >> >                                      swap_free() <- entry is free
>> >> >                                      <write to page B, now page A stalled>
>> >> >                                      <swap out page B to same swap entry>
>> >> > pte_same() <- Check pass, PTE seems
>> >> >               unchanged, but page A
>> >> >               is stalled!
>> >> > swap_free() <- page B content lost!
>> >> > set_pte_at() <- staled page A installed!
>> >> >
>> >> > And besides, for ZRAM, swap_free() allows the swap device to discard
>> >> > the entry content, so even if page (B) is not modified, if
>> >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
>> >> > it may also cause data loss.
>> >> >
>> >> > To fix this, reuse swapcache_prepare which will pin the swap entry using
>> >> > the cache flag, and allow only one thread to pin it. Release the pin
>> >> > after PT unlocked. Racers will simply busy wait since it's a rare
>> >> > and very short event.
>> >> >
>> >> > Other methods like increasing the swap count don't seem to be a good
>> >> > idea after some tests, that will cause racers to fall back to use the
>> >> > swap cache again. Parallel swapin using different methods leads to
>> >> > a much more complex scenario.
>> >> >
>> >> > Reproducer:
>> >> >
>> >> > This race issue can be triggered easily using a well constructed
>> >> > reproducer and patched brd (with a delay in read path) [1]:
>> >> >
>> >> > With latest 6.8 mainline, race caused data loss can be observed easily:
>> >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
>> >> >   Polulating 32MB of memory region...
>> >> >   Keep swapping out...
>> >> >   Starting round 0...
>> >> >   Spawning 65536 workers...
>> >> >   32746 workers spawned, wait for done...
>> >> >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
>> >> >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
>> >> >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
>> >> >   Round 0 Failed, 15 data loss!
>> >> >
>> >> > This reproducer spawns multiple threads sharing the same memory region
>> >> > using a small swap device. Every two threads updates mapped pages one by
>> >> > one in opposite direction trying to create a race, with one dedicated
>> >> > thread keep swapping out the data out using madvise.
>> >> >
>> >> > The reproducer created a reproduce rate of about once every 5 minutes,
>> >> > so the race should be totally possible in production.
>> >> >
>> >> > After this patch, I ran the reproducer for over a few hundred rounds
>> >> > and no data loss observed.
>> >> >
>> >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G
>> >> > zram:
>> >> >
>> >> > Before:     10934698 us
>> >> > After:      11157121 us
>> >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
>> >> >
>> >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
>> >> > Reported-by: "Huang, Ying" <ying.huang@intel.com>
>> >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
>> >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
>> >> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>> >> > Acked-by: Yu Zhao <yuzhao@google.com>
>> >> >
>> >> > ---
>> >> > Update from V1:
>> >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
>> >> > - Update comments make it cleaner [Huang, Ying]
>> >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
>> >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
>> >> > - Update commit message.
>> >> > - Collect Review and Acks.
>> >> >
>> >> >  include/linux/swap.h |  5 +++++
>> >> >  mm/memory.c          | 15 +++++++++++++++
>> >> >  mm/swap.h            |  5 +++++
>> >> >  mm/swapfile.c        | 13 +++++++++++++
>> >> >  4 files changed, 38 insertions(+)
>> >> >
>> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> > index 4db00ddad261..8d28f6091a32 100644
>> >> > --- a/include/linux/swap.h
>> >> > +++ b/include/linux/swap.h
>> >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
>> >> >         return 0;
>> >> >  }
>> >> >
>> >> > +static inline int swapcache_prepare(swp_entry_t swp)
>> >> > +{
>> >> > +       return 0;
>> >> > +}
>> >> > +
>> >> >  static inline void swap_free(swp_entry_t swp)
>> >> >  {
>> >> >  }
>> >> > diff --git a/mm/memory.c b/mm/memory.c
>> >> > index 7e1f4849463a..1749c700823d 100644
>> >> > --- a/mm/memory.c
>> >> > +++ b/mm/memory.c
>> >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> >         if (!folio) {
>> >> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>> >> >                     __swap_count(entry) == 1) {
>> >> > +                       /*
>> >> > +                        * Prevent parallel swapin from proceeding with
>> >> > +                        * the cache flag. Otherwise, another thread may
>> >> > +                        * finish swapin first, free the entry, and swapout
>> >> > +                        * reusing the same entry. It's undetectable as
>> >> > +                        * pte_same() returns true due to entry reuse.
>> >> > +                        */
>> >> > +                       if (swapcache_prepare(entry))
>> >> > +                               goto out;
>> >> > +
>> >>
>> >> I am puzzled by this "goto out". If I understand this correctly, you
>> >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
>> >> The CPU1 will succeed in adding the flag and  the CPU2 will get
>> >> "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
>> >> correctly so far?
>> >>
>> >> Then the goto out seems wrong to me. For the CPU2, the page fault will
>> >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
>> >> CPU2 does not even know the page fault is not handled, it will resume
>> >> from the page fault instruction, possibly generate another page fault
>> >> at the exact same location. That page fault loop will repeat until
>> >> CPU1 install the new pte on that faulting virtual address and pick up
>> >> by CPU2.
>> >>
>> >> Am I missing something obvious there?
>> >
>> > I feel you are right. any concurrent page faults at the same pte
>> > will increase the count of page faults for a couple of times now.
>> >
>> >>
>> >> I just re-read your comment: "Racers will simply busy wait since it's
>> >> a rare and very short event." That might be referring to the above
>> >> CPU2 page fault looping situation. I consider the page fault looping
>> >> on CPU2 not acceptable. For one it will mess up the page fault
>> >> statistics.
>> >> In my mind, having an explicit loop for CPU2 waiting for the PTE to
>> >> show up is still better than this page fault loop. You can have more
>> >> CPU power friendly loops.
>> >
>> > I assume you mean something like
>> >
>> > while(!pte_same())
>> >    cpu_relax();
>> >
>> > then we still have a chance to miss the change of B.
>> >
>> > For example, another thread is changing pte to A->B->A, our loop can
>> > miss B. Thus we will trap into an infinite loop. this is even worse.
>> >
>> > is it possible to loop for the success of swapcache_prepare(entry)
>> > instead?
>>
>> This doesn't work too.  The swap count can increase to > 1 and be put in
>> swap cache for long time.
>>
>> Another possibility is to move swapcache_prepare() after
>> vma_alloc_folio() to reduce the race window.
>
> Reducing the race window seems like a good way. Or maybe we can just
> add a cpu_relax() so raced swapins will just slow down, and won't loop
> too much time and so the side effect (counter or power consumption)
> should be much smaller?

cpu_relax() only makes sense in tight loop.

--
Best Regards,
Huang, Ying
Barry Song Feb. 7, 2024, 3:45 a.m. UTC | #13
On Wed, Feb 7, 2024 at 3:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> > >>
> > >> Hi Kairui,
> > >>
> > >> Sorry replying to your patch V1 late, I will reply on the V2 thread.
> > >>
> > >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >> >
> > >> > From: Kairui Song <kasong@tencent.com>
> > >> >
> > >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > >> > swapin the same entry at the same time, they get different pages (A, B).
> > >> > Before one thread (T0) finishes the swapin and installs page (A)
> > >> > to the PTE, another thread (T1) could finish swapin of page (B),
> > >> > swap_free the entry, then swap out the possibly modified page
> > >> > reusing the same entry. It breaks the pte_same check in (T0) because
> > >> > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > >> > install a stalled page (A) into the PTE and cause data corruption.
> > >> >
> > >> > One possible callstack is like this:
> > >> >
> > >> > CPU0                                 CPU1
> > >> > ----                                 ----
> > >> > do_swap_page()                       do_swap_page() with same entry
> > >> > <direct swapin path>                 <direct swapin path>
> > >> > <alloc page A>                       <alloc page B>
> > >> > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > >> > <slow on later locks or interrupt>   <finished swapin first>
> > >> > ...                                  set_pte_at()
> > >> >                                      swap_free() <- entry is free
> > >> >                                      <write to page B, now page A stalled>
> > >> >                                      <swap out page B to same swap entry>
> > >> > pte_same() <- Check pass, PTE seems
> > >> >               unchanged, but page A
> > >> >               is stalled!
> > >> > swap_free() <- page B content lost!
> > >> > set_pte_at() <- staled page A installed!
> > >> >
> > >> > And besides, for ZRAM, swap_free() allows the swap device to discard
> > >> > the entry content, so even if page (B) is not modified, if
> > >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > >> > it may also cause data loss.
> > >> >
> > >> > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > >> > the cache flag, and allow only one thread to pin it. Release the pin
> > >> > after PT unlocked. Racers will simply busy wait since it's a rare
> > >> > and very short event.
> > >> >
> > >> > Other methods like increasing the swap count don't seem to be a good
> > >> > idea after some tests, that will cause racers to fall back to use the
> > >> > swap cache again. Parallel swapin using different methods leads to
> > >> > a much more complex scenario.
> > >> >
> > >> > Reproducer:
> > >> >
> > >> > This race issue can be triggered easily using a well constructed
> > >> > reproducer and patched brd (with a delay in read path) [1]:
> > >> >
> > >> > With latest 6.8 mainline, race caused data loss can be observed easily:
> > >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > >> >   Polulating 32MB of memory region...
> > >> >   Keep swapping out...
> > >> >   Starting round 0...
> > >> >   Spawning 65536 workers...
> > >> >   32746 workers spawned, wait for done...
> > >> >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > >> >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > >> >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > >> >   Round 0 Failed, 15 data loss!
> > >> >
> > >> > This reproducer spawns multiple threads sharing the same memory region
> > >> > using a small swap device. Every two threads updates mapped pages one by
> > >> > one in opposite direction trying to create a race, with one dedicated
> > >> > thread keep swapping out the data out using madvise.
> > >> >
> > >> > The reproducer created a reproduce rate of about once every 5 minutes,
> > >> > so the race should be totally possible in production.
> > >> >
> > >> > After this patch, I ran the reproducer for over a few hundred rounds
> > >> > and no data loss observed.
> > >> >
> > >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > >> > zram:
> > >> >
> > >> > Before:     10934698 us
> > >> > After:      11157121 us
> > >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > >> >
> > >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > >> > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > >> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > >> > Acked-by: Yu Zhao <yuzhao@google.com>
> > >> >
> > >> > ---
> > >> > Update from V1:
> > >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > >> > - Update comments make it cleaner [Huang, Ying]
> > >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > >> > - Update commit message.
> > >> > - Collect Review and Acks.
> > >> >
> > >> >  include/linux/swap.h |  5 +++++
> > >> >  mm/memory.c          | 15 +++++++++++++++
> > >> >  mm/swap.h            |  5 +++++
> > >> >  mm/swapfile.c        | 13 +++++++++++++
> > >> >  4 files changed, 38 insertions(+)
> > >> >
> > >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > >> > index 4db00ddad261..8d28f6091a32 100644
> > >> > --- a/include/linux/swap.h
> > >> > +++ b/include/linux/swap.h
> > >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > >> >         return 0;
> > >> >  }
> > >> >
> > >> > +static inline int swapcache_prepare(swp_entry_t swp)
> > >> > +{
> > >> > +       return 0;
> > >> > +}
> > >> > +
> > >> >  static inline void swap_free(swp_entry_t swp)
> > >> >  {
> > >> >  }
> > >> > diff --git a/mm/memory.c b/mm/memory.c
> > >> > index 7e1f4849463a..1749c700823d 100644
> > >> > --- a/mm/memory.c
> > >> > +++ b/mm/memory.c
> > >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >> >         if (!folio) {
> > >> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > >> >                     __swap_count(entry) == 1) {
> > >> > +                       /*
> > >> > +                        * Prevent parallel swapin from proceeding with
> > >> > +                        * the cache flag. Otherwise, another thread may
> > >> > +                        * finish swapin first, free the entry, and swapout
> > >> > +                        * reusing the same entry. It's undetectable as
> > >> > +                        * pte_same() returns true due to entry reuse.
> > >> > +                        */
> > >> > +                       if (swapcache_prepare(entry))
> > >> > +                               goto out;
> > >> > +
> > >>
> > >> I am puzzled by this "goto out". If I understand this correctly, you
> > >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > >> The CPU1 will succeed in adding the flag and  the CPU2 will get
> > >> "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > >> correctly so far?
> > >>
> > >> Then the goto out seems wrong to me. For the CPU2, the page fault will
> > >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > >> CPU2 does not even know the page fault is not handled, it will resume
> > >> from the page fault instruction, possibly generate another page fault
> > >> at the exact same location. That page fault loop will repeat until
> > >> CPU1 install the new pte on that faulting virtual address and pick up
> > >> by CPU2.
> > >>
> > >> Am I missing something obvious there?
> > >
> > > I feel you are right. any concurrent page faults at the same pte
> > > will increase the count of page faults for a couple of times now.
> > >
> > >>
> > >> I just re-read your comment: "Racers will simply busy wait since it's
> > >> a rare and very short event." That might be referring to the above
> > >> CPU2 page fault looping situation. I consider the page fault looping
> > >> on CPU2 not acceptable. For one it will mess up the page fault
> > >> statistics.
> > >> In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > >> show up is still better than this page fault loop. You can have more
> > >> CPU power friendly loops.
> > >
> > > I assume you mean something like
> > >
> > > while(!pte_same())
> > >    cpu_relax();
> > >
> > > then we still have a chance to miss the change of B.
> > >
> > > For example, another thread is changing pte to A->B->A, our loop can
> > > miss B. Thus we will trap into an infinite loop. this is even worse.
> > >
> > > is it possible to loop for the success of swapcache_prepare(entry)
> > > instead?
> >
> > This doesn't work too.  The swap count can increase to > 1 and be put in
> > swap cache for long time.
> >
> > Another possibility is to move swapcache_prepare() after
> > vma_alloc_folio() to reduce the race window.

what about we make everything go as it is. I mean, we only need to
record we have failed on swapcache_prepare, but we don't goto out.

bool swapcache_prepare_failed =  swapcache_prepare();
.... // don't change any code


but we only change the last code to set pte from the below
   ptl
   if(pte_same)
       set_pte

to

   ptl
   if(pte_same && !swapcache_prepare_failed)
        set_pte

as the chance is close to 0%, the increased count should be very minor.


>
> Reducing the race window seems like a good way. Or maybe we can just
> add a cpu_relax() so raced swapins will just slow down, and won't loop
> too much time and so the side effect (counter or power consumption)
> should be much smaller?

Thanks
Barry
Chris Li Feb. 7, 2024, 4:01 a.m. UTC | #14
On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > Hi Kairui,
> > > >
> > > > Sorry replying to your patch V1 late, I will reply on the V2 thread.
> > > >
> > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > > > swapin the same entry at the same time, they get different pages (A, B).
> > > > > Before one thread (T0) finishes the swapin and installs page (A)
> > > > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > > > swap_free the entry, then swap out the possibly modified page
> > > > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > > > install a stalled page (A) into the PTE and cause data corruption.
> > > > >
> > > > > One possible callstack is like this:
> > > > >
> > > > > CPU0                                 CPU1
> > > > > ----                                 ----
> > > > > do_swap_page()                       do_swap_page() with same entry
> > > > > <direct swapin path>                 <direct swapin path>
> > > > > <alloc page A>                       <alloc page B>
> > > > > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > > > > <slow on later locks or interrupt>   <finished swapin first>
> > > > > ...                                  set_pte_at()
> > > > >                                      swap_free() <- entry is free
> > > > >                                      <write to page B, now page A stalled>
> > > > >                                      <swap out page B to same swap entry>
> > > > > pte_same() <- Check pass, PTE seems
> > > > >               unchanged, but page A
> > > > >               is stalled!
> > > > > swap_free() <- page B content lost!
> > > > > set_pte_at() <- staled page A installed!
> > > > >
> > > > > And besides, for ZRAM, swap_free() allows the swap device to discard
> > > > > the entry content, so even if page (B) is not modified, if
> > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > > > > it may also cause data loss.
> > > > >
> > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > > > > the cache flag, and allow only one thread to pin it. Release the pin
> > > > > after PT unlocked. Racers will simply busy wait since it's a rare
> > > > > and very short event.
> > > > >
> > > > > Other methods like increasing the swap count don't seem to be a good
> > > > > idea after some tests, that will cause racers to fall back to use the
> > > > > swap cache again. Parallel swapin using different methods leads to
> > > > > a much more complex scenario.
> > > > >
> > > > > Reproducer:
> > > > >
> > > > > This race issue can be triggered easily using a well constructed
> > > > > reproducer and patched brd (with a delay in read path) [1]:
> > > > >
> > > > > With latest 6.8 mainline, race caused data loss can be observed easily:
> > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > > > >   Polulating 32MB of memory region...
> > > > >   Keep swapping out...
> > > > >   Starting round 0...
> > > > >   Spawning 65536 workers...
> > > > >   32746 workers spawned, wait for done...
> > > > >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > > > >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > > > >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > > > >   Round 0 Failed, 15 data loss!
> > > > >
> > > > > This reproducer spawns multiple threads sharing the same memory region
> > > > > using a small swap device. Every two threads updates mapped pages one by
> > > > > one in opposite direction trying to create a race, with one dedicated
> > > > > thread keep swapping out the data out using madvise.
> > > > >
> > > > > The reproducer created a reproduce rate of about once every 5 minutes,
> > > > > so the race should be totally possible in production.
> > > > >
> > > > > After this patch, I ran the reproducer for over a few hundred rounds
> > > > > and no data loss observed.
> > > > >
> > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > > > > zram:
> > > > >
> > > > > Before:     10934698 us
> > > > > After:      11157121 us
> > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > > > >
> > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > Acked-by: Yu Zhao <yuzhao@google.com>
> > > > >
> > > > > ---
> > > > > Update from V1:
> > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > > > > - Update comments make it cleaner [Huang, Ying]
> > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > > > > - Update commit message.
> > > > > - Collect Review and Acks.
> > > > >
> > > > >  include/linux/swap.h |  5 +++++
> > > > >  mm/memory.c          | 15 +++++++++++++++
> > > > >  mm/swap.h            |  5 +++++
> > > > >  mm/swapfile.c        | 13 +++++++++++++
> > > > >  4 files changed, 38 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > index 4db00ddad261..8d28f6091a32 100644
> > > > > --- a/include/linux/swap.h
> > > > > +++ b/include/linux/swap.h
> > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > +static inline int swapcache_prepare(swp_entry_t swp)
> > > > > +{
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > >  static inline void swap_free(swp_entry_t swp)
> > > > >  {
> > > > >  }
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 7e1f4849463a..1749c700823d 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > >         if (!folio) {
> > > > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > > >                     __swap_count(entry) == 1) {
> > > > > +                       /*
> > > > > +                        * Prevent parallel swapin from proceeding with
> > > > > +                        * the cache flag. Otherwise, another thread may
> > > > > +                        * finish swapin first, free the entry, and swapout
> > > > > +                        * reusing the same entry. It's undetectable as
> > > > > +                        * pte_same() returns true due to entry reuse.
> > > > > +                        */
> > > > > +                       if (swapcache_prepare(entry))
> > > > > +                               goto out;
> > > > > +
> > > >
> > > > I am puzzled by this "goto out". If I understand this correctly, you
> > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > > > The CPU1 will succeed in adding the flag and  the CPU2 will get
> > > > "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > > > correctly so far?
> > > >
> > > > Then the goto out seems wrong to me. For the CPU2, the page fault will
> > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > > > CPU2 does not even know the page fault is not handled, it will resume
> > > > from the page fault instruction, possibly generate another page fault
> > > > at the exact same location. That page fault loop will repeat until
> > > > CPU1 install the new pte on that faulting virtual address and pick up
> > > > by CPU2.
> > > >
> > > > Am I missing something obvious there?
> > >
> > > I feel you are right. any concurrent page faults at the same pte
> > > will increase the count of page faults for a couple of times now.
> > >
> > > >
> > > > I just re-read your comment: "Racers will simply busy wait since it's
> > > > a rare and very short event." That might be referring to the above
> > > > CPU2 page fault looping situation. I consider the page fault looping
> > > > on CPU2 not acceptable. For one it will mess up the page fault
> > > > statistics.
> > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > > > show up is still better than this page fault loop. You can have more
> > > > CPU power friendly loops.
> > >
> > > I assume you mean something like
> > >
> > > while(!pte_same())
> > >    cpu_relax();
> > >
> > > then we still have a chance to miss the change of B.
> > >
> > > For example, another thread is changing pte to A->B->A, our loop can
> > > miss B. Thus we will trap into an infinite loop. this is even worse.
> >
> > Yes. You are right, it is worse. Thanks for catching that. That is why
> > I say this needs more discussion, I haven't fully thought it through
> > :-)
>
> Hi Chris and Barry,
>
> Thanks for the comments!
>
> The worst thing I know of returning in do_swap_page without handling
> the swap, is an increase of some statistic counters, note it will not
> cause major page fault counters to grow, only things like perf counter
> and vma lock statistic are affected.
>
> And actually there are multiple already existing return points in
> do_swap_page that will return without handling it, which may
> re-trigger the page fault.

Thanks for pointing that out. I take a look at those, which seems
different than the case here.  In those cases, it truely can not make
forward progress.
Here we actually have all the data it needs to complete the page
fault. Just a data synchronization issue preventing making forward
progress.
Ideally we can have some clever data structure to solve the
synchronization issue and make forward progress.

> When do_swap_page is called, many pre-checks have been applied, and
> they could all be invalidated if something raced, simply looping
> inside here could miss a lot of corner cases, so we have to go through
> that again.

Actually, I  think about it. Looping it here seems worse in the sense
that it is already holding some locks. Return and retry the page fault
at least release those locks and let others have a chance to make
progress.

>
> This patch did increase the chance of false positive increase of some
> counters, maybe something like returning a VM_FAULT_RETRY could make
> it better, but code is more complex and will cause other counters to
> grow.

This is certainly not ideal. It might upset the feedback loop that
uses the swap fault statistic as input to adjust the swapping
behavior.

Chris
Kairui Song Feb. 7, 2024, 4:06 a.m. UTC | #15
On Wed, Feb 7, 2024 at 12:02 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> > > > >
> > > > > Hi Kairui,
> > > > >
> > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread.
> > > > >
> > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > >
> > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > > > > swapin the same entry at the same time, they get different pages (A, B).
> > > > > > Before one thread (T0) finishes the swapin and installs page (A)
> > > > > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > > > > swap_free the entry, then swap out the possibly modified page
> > > > > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > > > > install a stalled page (A) into the PTE and cause data corruption.
> > > > > >
> > > > > > One possible callstack is like this:
> > > > > >
> > > > > > CPU0                                 CPU1
> > > > > > ----                                 ----
> > > > > > do_swap_page()                       do_swap_page() with same entry
> > > > > > <direct swapin path>                 <direct swapin path>
> > > > > > <alloc page A>                       <alloc page B>
> > > > > > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > > > > > <slow on later locks or interrupt>   <finished swapin first>
> > > > > > ...                                  set_pte_at()
> > > > > >                                      swap_free() <- entry is free
> > > > > >                                      <write to page B, now page A stalled>
> > > > > >                                      <swap out page B to same swap entry>
> > > > > > pte_same() <- Check pass, PTE seems
> > > > > >               unchanged, but page A
> > > > > >               is stalled!
> > > > > > swap_free() <- page B content lost!
> > > > > > set_pte_at() <- staled page A installed!
> > > > > >
> > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard
> > > > > > the entry content, so even if page (B) is not modified, if
> > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > > > > > it may also cause data loss.
> > > > > >
> > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > > > > > the cache flag, and allow only one thread to pin it. Release the pin
> > > > > > after PT unlocked. Racers will simply busy wait since it's a rare
> > > > > > and very short event.
> > > > > >
> > > > > > Other methods like increasing the swap count don't seem to be a good
> > > > > > idea after some tests, that will cause racers to fall back to use the
> > > > > > swap cache again. Parallel swapin using different methods leads to
> > > > > > a much more complex scenario.
> > > > > >
> > > > > > Reproducer:
> > > > > >
> > > > > > This race issue can be triggered easily using a well constructed
> > > > > > reproducer and patched brd (with a delay in read path) [1]:
> > > > > >
> > > > > > With latest 6.8 mainline, race caused data loss can be observed easily:
> > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > > > > >   Polulating 32MB of memory region...
> > > > > >   Keep swapping out...
> > > > > >   Starting round 0...
> > > > > >   Spawning 65536 workers...
> > > > > >   32746 workers spawned, wait for done...
> > > > > >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > > > > >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > > > > >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > > > > >   Round 0 Failed, 15 data loss!
> > > > > >
> > > > > > This reproducer spawns multiple threads sharing the same memory region
> > > > > > using a small swap device. Every two threads updates mapped pages one by
> > > > > > one in opposite direction trying to create a race, with one dedicated
> > > > > > thread keep swapping out the data out using madvise.
> > > > > >
> > > > > > The reproducer created a reproduce rate of about once every 5 minutes,
> > > > > > so the race should be totally possible in production.
> > > > > >
> > > > > > After this patch, I ran the reproducer for over a few hundred rounds
> > > > > > and no data loss observed.
> > > > > >
> > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > > > > > zram:
> > > > > >
> > > > > > Before:     10934698 us
> > > > > > After:      11157121 us
> > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > > > > >
> > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > Acked-by: Yu Zhao <yuzhao@google.com>
> > > > > >
> > > > > > ---
> > > > > > Update from V1:
> > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > > > > > - Update comments make it cleaner [Huang, Ying]
> > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > > > > > - Update commit message.
> > > > > > - Collect Review and Acks.
> > > > > >
> > > > > >  include/linux/swap.h |  5 +++++
> > > > > >  mm/memory.c          | 15 +++++++++++++++
> > > > > >  mm/swap.h            |  5 +++++
> > > > > >  mm/swapfile.c        | 13 +++++++++++++
> > > > > >  4 files changed, 38 insertions(+)
> > > > > >
> > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > > index 4db00ddad261..8d28f6091a32 100644
> > > > > > --- a/include/linux/swap.h
> > > > > > +++ b/include/linux/swap.h
> > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > > > > >         return 0;
> > > > > >  }
> > > > > >
> > > > > > +static inline int swapcache_prepare(swp_entry_t swp)
> > > > > > +{
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > >  static inline void swap_free(swp_entry_t swp)
> > > > > >  {
> > > > > >  }
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 7e1f4849463a..1749c700823d 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > >         if (!folio) {
> > > > > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > > > >                     __swap_count(entry) == 1) {
> > > > > > +                       /*
> > > > > > +                        * Prevent parallel swapin from proceeding with
> > > > > > +                        * the cache flag. Otherwise, another thread may
> > > > > > +                        * finish swapin first, free the entry, and swapout
> > > > > > +                        * reusing the same entry. It's undetectable as
> > > > > > +                        * pte_same() returns true due to entry reuse.
> > > > > > +                        */
> > > > > > +                       if (swapcache_prepare(entry))
> > > > > > +                               goto out;
> > > > > > +
> > > > >
> > > > > I am puzzled by this "goto out". If I understand this correctly, you
> > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > > > > The CPU1 will succeed in adding the flag and  the CPU2 will get
> > > > > "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > > > > correctly so far?
> > > > >
> > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will
> > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > > > > CPU2 does not even know the page fault is not handled, it will resume
> > > > > from the page fault instruction, possibly generate another page fault
> > > > > at the exact same location. That page fault loop will repeat until
> > > > > CPU1 install the new pte on that faulting virtual address and pick up
> > > > > by CPU2.
> > > > >
> > > > > Am I missing something obvious there?
> > > >
> > > > I feel you are right. any concurrent page faults at the same pte
> > > > will increase the count of page faults for a couple of times now.
> > > >
> > > > >
> > > > > I just re-read your comment: "Racers will simply busy wait since it's
> > > > > a rare and very short event." That might be referring to the above
> > > > > CPU2 page fault looping situation. I consider the page fault looping
> > > > > on CPU2 not acceptable. For one it will mess up the page fault
> > > > > statistics.
> > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > > > > show up is still better than this page fault loop. You can have more
> > > > > CPU power friendly loops.
> > > >
> > > > I assume you mean something like
> > > >
> > > > while(!pte_same())
> > > >    cpu_relax();
> > > >
> > > > then we still have a chance to miss the change of B.
> > > >
> > > > For example, another thread is changing pte to A->B->A, our loop can
> > > > miss B. Thus we will trap into an infinite loop. this is even worse.
> > >
> > > Yes. You are right, it is worse. Thanks for catching that. That is why
> > > I say this needs more discussion, I haven't fully thought it through
> > > :-)
> >
> > Hi Chris and Barry,
> >
> > Thanks for the comments!
> >
> > The worst thing I know of returning in do_swap_page without handling
> > the swap, is an increase of some statistic counters, note it will not
> > cause major page fault counters to grow, only things like perf counter
> > and vma lock statistic are affected.
> >
> > And actually there are multiple already existing return points in
> > do_swap_page that will return without handling it, which may
> > re-trigger the page fault.
>
> Thanks for pointing that out. I take a look at those, which seems
> different than the case here.  In those cases, it truely can not make
> forward progress.
> Here we actually have all the data it needs to complete the page
> fault. Just a data synchronization issue preventing making forward
> progress.
> Ideally we can have some clever data structure to solve the
> synchronization issue and make forward progress.
>
> > When do_swap_page is called, many pre-checks have been applied, and
> > they could all be invalidated if something raced, simply looping
> > inside here could miss a lot of corner cases, so we have to go through
> > that again.
>
> Actually, I  think about it. Looping it here seems worse in the sense
> that it is already holding some locks. Return and retry the page fault
> at least release those locks and let others have a chance to make
> progress.
>
> >
> > This patch did increase the chance of false positive increase of some
> > counters, maybe something like returning a VM_FAULT_RETRY could make
> > it better, but code is more complex and will cause other counters to
> > grow.
>
> This is certainly not ideal. It might upset the feedback loop that
> uses the swap fault statistic as input to adjust the swapping
> behavior.
>
> Chris

Hi Chris,

Thanks for the reply.

So I think the thing is, it's getting complex because this patch
wanted to make it simple and just reuse the swap cache flags.

Then maybe time to introduce a new swap map value (eg. SWAP_MAP_LOCK,
not an ideal name, open to any suggestion on this...) then we can just
wait for that value to be gone, no risk of loops and repeated page
faults, just need to tidy some flag set/retrieving helpers.
Huang, Ying Feb. 7, 2024, 4:16 a.m. UTC | #16
Barry Song <21cnbao@gmail.com> writes:

> On Wed, Feb 7, 2024 at 3:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>>
>> On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > Barry Song <21cnbao@gmail.com> writes:
>> >
>> > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
>> > >>
>> > >> Hi Kairui,
>> > >>
>> > >> Sorry replying to your patch V1 late, I will reply on the V2 thread.
>> > >>
>> > >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
>> > >> >
>> > >> > From: Kairui Song <kasong@tencent.com>
>> > >> >
>> > >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
>> > >> > swapin the same entry at the same time, they get different pages (A, B).
>> > >> > Before one thread (T0) finishes the swapin and installs page (A)
>> > >> > to the PTE, another thread (T1) could finish swapin of page (B),
>> > >> > swap_free the entry, then swap out the possibly modified page
>> > >> > reusing the same entry. It breaks the pte_same check in (T0) because
>> > >> > PTE value is unchanged, causing ABA problem. Thread (T0) will
>> > >> > install a stalled page (A) into the PTE and cause data corruption.
>> > >> >
>> > >> > One possible callstack is like this:
>> > >> >
>> > >> > CPU0                                 CPU1
>> > >> > ----                                 ----
>> > >> > do_swap_page()                       do_swap_page() with same entry
>> > >> > <direct swapin path>                 <direct swapin path>
>> > >> > <alloc page A>                       <alloc page B>
>> > >> > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
>> > >> > <slow on later locks or interrupt>   <finished swapin first>
>> > >> > ...                                  set_pte_at()
>> > >> >                                      swap_free() <- entry is free
>> > >> >                                      <write to page B, now page A stalled>
>> > >> >                                      <swap out page B to same swap entry>
>> > >> > pte_same() <- Check pass, PTE seems
>> > >> >               unchanged, but page A
>> > >> >               is stalled!
>> > >> > swap_free() <- page B content lost!
>> > >> > set_pte_at() <- staled page A installed!
>> > >> >
>> > >> > And besides, for ZRAM, swap_free() allows the swap device to discard
>> > >> > the entry content, so even if page (B) is not modified, if
>> > >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
>> > >> > it may also cause data loss.
>> > >> >
>> > >> > To fix this, reuse swapcache_prepare which will pin the swap entry using
>> > >> > the cache flag, and allow only one thread to pin it. Release the pin
>> > >> > after PT unlocked. Racers will simply busy wait since it's a rare
>> > >> > and very short event.
>> > >> >
>> > >> > Other methods like increasing the swap count don't seem to be a good
>> > >> > idea after some tests, that will cause racers to fall back to use the
>> > >> > swap cache again. Parallel swapin using different methods leads to
>> > >> > a much more complex scenario.
>> > >> >
>> > >> > Reproducer:
>> > >> >
>> > >> > This race issue can be triggered easily using a well constructed
>> > >> > reproducer and patched brd (with a delay in read path) [1]:
>> > >> >
>> > >> > With latest 6.8 mainline, race caused data loss can be observed easily:
>> > >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
>> > >> >   Polulating 32MB of memory region...
>> > >> >   Keep swapping out...
>> > >> >   Starting round 0...
>> > >> >   Spawning 65536 workers...
>> > >> >   32746 workers spawned, wait for done...
>> > >> >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
>> > >> >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
>> > >> >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
>> > >> >   Round 0 Failed, 15 data loss!
>> > >> >
>> > >> > This reproducer spawns multiple threads sharing the same memory region
>> > >> > using a small swap device. Every two threads updates mapped pages one by
>> > >> > one in opposite direction trying to create a race, with one dedicated
>> > >> > thread keep swapping out the data out using madvise.
>> > >> >
>> > >> > The reproducer created a reproduce rate of about once every 5 minutes,
>> > >> > so the race should be totally possible in production.
>> > >> >
>> > >> > After this patch, I ran the reproducer for over a few hundred rounds
>> > >> > and no data loss observed.
>> > >> >
>> > >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G
>> > >> > zram:
>> > >> >
>> > >> > Before:     10934698 us
>> > >> > After:      11157121 us
>> > >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
>> > >> >
>> > >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
>> > >> > Reported-by: "Huang, Ying" <ying.huang@intel.com>
>> > >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
>> > >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
>> > >> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> > >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>> > >> > Acked-by: Yu Zhao <yuzhao@google.com>
>> > >> >
>> > >> > ---
>> > >> > Update from V1:
>> > >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
>> > >> > - Update comments make it cleaner [Huang, Ying]
>> > >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
>> > >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
>> > >> > - Update commit message.
>> > >> > - Collect Review and Acks.
>> > >> >
>> > >> >  include/linux/swap.h |  5 +++++
>> > >> >  mm/memory.c          | 15 +++++++++++++++
>> > >> >  mm/swap.h            |  5 +++++
>> > >> >  mm/swapfile.c        | 13 +++++++++++++
>> > >> >  4 files changed, 38 insertions(+)
>> > >> >
>> > >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > >> > index 4db00ddad261..8d28f6091a32 100644
>> > >> > --- a/include/linux/swap.h
>> > >> > +++ b/include/linux/swap.h
>> > >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
>> > >> >         return 0;
>> > >> >  }
>> > >> >
>> > >> > +static inline int swapcache_prepare(swp_entry_t swp)
>> > >> > +{
>> > >> > +       return 0;
>> > >> > +}
>> > >> > +
>> > >> >  static inline void swap_free(swp_entry_t swp)
>> > >> >  {
>> > >> >  }
>> > >> > diff --git a/mm/memory.c b/mm/memory.c
>> > >> > index 7e1f4849463a..1749c700823d 100644
>> > >> > --- a/mm/memory.c
>> > >> > +++ b/mm/memory.c
>> > >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > >> >         if (!folio) {
>> > >> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>> > >> >                     __swap_count(entry) == 1) {
>> > >> > +                       /*
>> > >> > +                        * Prevent parallel swapin from proceeding with
>> > >> > +                        * the cache flag. Otherwise, another thread may
>> > >> > +                        * finish swapin first, free the entry, and swapout
>> > >> > +                        * reusing the same entry. It's undetectable as
>> > >> > +                        * pte_same() returns true due to entry reuse.
>> > >> > +                        */
>> > >> > +                       if (swapcache_prepare(entry))
>> > >> > +                               goto out;
>> > >> > +
>> > >>
>> > >> I am puzzled by this "goto out". If I understand this correctly, you
>> > >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
>> > >> The CPU1 will succeed in adding the flag and  the CPU2 will get
>> > >> "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
>> > >> correctly so far?
>> > >>
>> > >> Then the goto out seems wrong to me. For the CPU2, the page fault will
>> > >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
>> > >> CPU2 does not even know the page fault is not handled, it will resume
>> > >> from the page fault instruction, possibly generate another page fault
>> > >> at the exact same location. That page fault loop will repeat until
>> > >> CPU1 install the new pte on that faulting virtual address and pick up
>> > >> by CPU2.
>> > >>
>> > >> Am I missing something obvious there?
>> > >
>> > > I feel you are right. any concurrent page faults at the same pte
>> > > will increase the count of page faults for a couple of times now.
>> > >
>> > >>
>> > >> I just re-read your comment: "Racers will simply busy wait since it's
>> > >> a rare and very short event." That might be referring to the above
>> > >> CPU2 page fault looping situation. I consider the page fault looping
>> > >> on CPU2 not acceptable. For one it will mess up the page fault
>> > >> statistics.
>> > >> In my mind, having an explicit loop for CPU2 waiting for the PTE to
>> > >> show up is still better than this page fault loop. You can have more
>> > >> CPU power friendly loops.
>> > >
>> > > I assume you mean something like
>> > >
>> > > while(!pte_same())
>> > >    cpu_relax();
>> > >
>> > > then we still have a chance to miss the change of B.
>> > >
>> > > For example, another thread is changing pte to A->B->A, our loop can
>> > > miss B. Thus we will trap into an infinite loop. this is even worse.
>> > >
>> > > is it possible to loop for the success of swapcache_prepare(entry)
>> > > instead?
>> >
>> > This doesn't work too.  The swap count can increase to > 1 and be put in
>> > swap cache for long time.
>> >
>> > Another possibility is to move swapcache_prepare() after
>> > vma_alloc_folio() to reduce the race window.
>
> what about we make everything go as it is. I mean, we only need to
> record we have failed on swapcache_prepare, but we don't goto out.
>
> bool swapcache_prepare_failed =  swapcache_prepare();
> .... // don't change any code
>
>
> but we only change the last code to set pte from the below
>    ptl
>    if(pte_same)
>        set_pte
>
> to
>
>    ptl
>    if(pte_same && !swapcache_prepare_failed)
>         set_pte
>
> as the chance is close to 0%, the increased count should be very minor.

IIUC, if (!swapcache_prepare_failed), it will always fail.  If so, why
bother wasting CPU cycles?  If you return directly, and the first thread
runs quickly enough, you can use the installed PTE directly.

--
Best Regards,
Huang, Ying

>>
>> Reducing the race window seems like a good way. Or maybe we can just
>> add a cpu_relax() so raced swapins will just slow down, and won't loop
>> too much time and so the side effect (counter or power consumption)
>> should be much smaller?
>
> Thanks
> Barry
Barry Song Feb. 7, 2024, 4:24 a.m. UTC | #17
On Wed, Feb 7, 2024 at 5:18 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Feb 7, 2024 at 3:29 PM Kairui Song <ryncsn@gmail.com> wrote:
> >>
> >> On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >
> >> > Barry Song <21cnbao@gmail.com> writes:
> >> >
> >> > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> >> > >>
> >> > >> Hi Kairui,
> >> > >>
> >> > >> Sorry replying to your patch V1 late, I will reply on the V2 thread.
> >> > >>
> >> > >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> >> > >> >
> >> > >> > From: Kairui Song <kasong@tencent.com>
> >> > >> >
> >> > >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> >> > >> > swapin the same entry at the same time, they get different pages (A, B).
> >> > >> > Before one thread (T0) finishes the swapin and installs page (A)
> >> > >> > to the PTE, another thread (T1) could finish swapin of page (B),
> >> > >> > swap_free the entry, then swap out the possibly modified page
> >> > >> > reusing the same entry. It breaks the pte_same check in (T0) because
> >> > >> > PTE value is unchanged, causing ABA problem. Thread (T0) will
> >> > >> > install a stalled page (A) into the PTE and cause data corruption.
> >> > >> >
> >> > >> > One possible callstack is like this:
> >> > >> >
> >> > >> > CPU0                                 CPU1
> >> > >> > ----                                 ----
> >> > >> > do_swap_page()                       do_swap_page() with same entry
> >> > >> > <direct swapin path>                 <direct swapin path>
> >> > >> > <alloc page A>                       <alloc page B>
> >> > >> > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> >> > >> > <slow on later locks or interrupt>   <finished swapin first>
> >> > >> > ...                                  set_pte_at()
> >> > >> >                                      swap_free() <- entry is free
> >> > >> >                                      <write to page B, now page A stalled>
> >> > >> >                                      <swap out page B to same swap entry>
> >> > >> > pte_same() <- Check pass, PTE seems
> >> > >> >               unchanged, but page A
> >> > >> >               is stalled!
> >> > >> > swap_free() <- page B content lost!
> >> > >> > set_pte_at() <- staled page A installed!
> >> > >> >
> >> > >> > And besides, for ZRAM, swap_free() allows the swap device to discard
> >> > >> > the entry content, so even if page (B) is not modified, if
> >> > >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> >> > >> > it may also cause data loss.
> >> > >> >
> >> > >> > To fix this, reuse swapcache_prepare which will pin the swap entry using
> >> > >> > the cache flag, and allow only one thread to pin it. Release the pin
> >> > >> > after PT unlocked. Racers will simply busy wait since it's a rare
> >> > >> > and very short event.
> >> > >> >
> >> > >> > Other methods like increasing the swap count don't seem to be a good
> >> > >> > idea after some tests, that will cause racers to fall back to use the
> >> > >> > swap cache again. Parallel swapin using different methods leads to
> >> > >> > a much more complex scenario.
> >> > >> >
> >> > >> > Reproducer:
> >> > >> >
> >> > >> > This race issue can be triggered easily using a well constructed
> >> > >> > reproducer and patched brd (with a delay in read path) [1]:
> >> > >> >
> >> > >> > With latest 6.8 mainline, race caused data loss can be observed easily:
> >> > >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> >> > >> >   Polulating 32MB of memory region...
> >> > >> >   Keep swapping out...
> >> > >> >   Starting round 0...
> >> > >> >   Spawning 65536 workers...
> >> > >> >   32746 workers spawned, wait for done...
> >> > >> >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> >> > >> >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> >> > >> >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> >> > >> >   Round 0 Failed, 15 data loss!
> >> > >> >
> >> > >> > This reproducer spawns multiple threads sharing the same memory region
> >> > >> > using a small swap device. Every two threads updates mapped pages one by
> >> > >> > one in opposite direction trying to create a race, with one dedicated
> >> > >> > thread keep swapping out the data out using madvise.
> >> > >> >
> >> > >> > The reproducer created a reproduce rate of about once every 5 minutes,
> >> > >> > so the race should be totally possible in production.
> >> > >> >
> >> > >> > After this patch, I ran the reproducer for over a few hundred rounds
> >> > >> > and no data loss observed.
> >> > >> >
> >> > >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> >> > >> > zram:
> >> > >> >
> >> > >> > Before:     10934698 us
> >> > >> > After:      11157121 us
> >> > >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> >> > >> >
> >> > >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> >> > >> > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> >> > >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> >> > >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> >> > >> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >> > >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> >> > >> > Acked-by: Yu Zhao <yuzhao@google.com>
> >> > >> >
> >> > >> > ---
> >> > >> > Update from V1:
> >> > >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> >> > >> > - Update comments make it cleaner [Huang, Ying]
> >> > >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> >> > >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> >> > >> > - Update commit message.
> >> > >> > - Collect Review and Acks.
> >> > >> >
> >> > >> >  include/linux/swap.h |  5 +++++
> >> > >> >  mm/memory.c          | 15 +++++++++++++++
> >> > >> >  mm/swap.h            |  5 +++++
> >> > >> >  mm/swapfile.c        | 13 +++++++++++++
> >> > >> >  4 files changed, 38 insertions(+)
> >> > >> >
> >> > >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> > >> > index 4db00ddad261..8d28f6091a32 100644
> >> > >> > --- a/include/linux/swap.h
> >> > >> > +++ b/include/linux/swap.h
> >> > >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> >> > >> >         return 0;
> >> > >> >  }
> >> > >> >
> >> > >> > +static inline int swapcache_prepare(swp_entry_t swp)
> >> > >> > +{
> >> > >> > +       return 0;
> >> > >> > +}
> >> > >> > +
> >> > >> >  static inline void swap_free(swp_entry_t swp)
> >> > >> >  {
> >> > >> >  }
> >> > >> > diff --git a/mm/memory.c b/mm/memory.c
> >> > >> > index 7e1f4849463a..1749c700823d 100644
> >> > >> > --- a/mm/memory.c
> >> > >> > +++ b/mm/memory.c
> >> > >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > >> >         if (!folio) {
> >> > >> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >> > >> >                     __swap_count(entry) == 1) {
> >> > >> > +                       /*
> >> > >> > +                        * Prevent parallel swapin from proceeding with
> >> > >> > +                        * the cache flag. Otherwise, another thread may
> >> > >> > +                        * finish swapin first, free the entry, and swapout
> >> > >> > +                        * reusing the same entry. It's undetectable as
> >> > >> > +                        * pte_same() returns true due to entry reuse.
> >> > >> > +                        */
> >> > >> > +                       if (swapcache_prepare(entry))
> >> > >> > +                               goto out;
> >> > >> > +
> >> > >>
> >> > >> I am puzzled by this "goto out". If I understand this correctly, you
> >> > >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> >> > >> The CPU1 will succeed in adding the flag and  the CPU2 will get
> >> > >> "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> >> > >> correctly so far?
> >> > >>
> >> > >> Then the goto out seems wrong to me. For the CPU2, the page fault will
> >> > >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> >> > >> CPU2 does not even know the page fault is not handled, it will resume
> >> > >> from the page fault instruction, possibly generate another page fault
> >> > >> at the exact same location. That page fault loop will repeat until
> >> > >> CPU1 install the new pte on that faulting virtual address and pick up
> >> > >> by CPU2.
> >> > >>
> >> > >> Am I missing something obvious there?
> >> > >
> >> > > I feel you are right. any concurrent page faults at the same pte
> >> > > will increase the count of page faults for a couple of times now.
> >> > >
> >> > >>
> >> > >> I just re-read your comment: "Racers will simply busy wait since it's
> >> > >> a rare and very short event." That might be referring to the above
> >> > >> CPU2 page fault looping situation. I consider the page fault looping
> >> > >> on CPU2 not acceptable. For one it will mess up the page fault
> >> > >> statistics.
> >> > >> In my mind, having an explicit loop for CPU2 waiting for the PTE to
> >> > >> show up is still better than this page fault loop. You can have more
> >> > >> CPU power friendly loops.
> >> > >
> >> > > I assume you mean something like
> >> > >
> >> > > while(!pte_same())
> >> > >    cpu_relax();
> >> > >
> >> > > then we still have a chance to miss the change of B.
> >> > >
> >> > > For example, another thread is changing pte to A->B->A, our loop can
> >> > > miss B. Thus we will trap into an infinite loop. this is even worse.
> >> > >
> >> > > is it possible to loop for the success of swapcache_prepare(entry)
> >> > > instead?
> >> >
> >> > This doesn't work too.  The swap count can increase to > 1 and be put in
> >> > swap cache for long time.
> >> >
> >> > Another possibility is to move swapcache_prepare() after
> >> > vma_alloc_folio() to reduce the race window.
> >
> > what about we make everything go as it is. I mean, we only need to
> > record we have failed on swapcache_prepare, but we don't goto out.
> >
> > bool swapcache_prepare_failed =  swapcache_prepare();
> > .... // don't change any code
> >
> >
> > but we only change the last code to set pte from the below
> >    ptl
> >    if(pte_same)
> >        set_pte
> >
> > to
> >
> >    ptl
> >    if(pte_same && !swapcache_prepare_failed)
> >         set_pte
> >
> > as the chance is close to 0%, the increased count should be very minor.
>
> IIUC, if (!swapcache_prepare_failed), it will always fail.  If so, why
> bother wasting CPU cycles?  If you return directly, and the first thread
> runs quickly enough, you can use the installed PTE directly.
>

you are right. i thought the actual probability of the race this patch is trying
to fix is 0%. so we fix the final place of setting pte, we stop setting pte once
we detect one race. but i was wrong because the one who fails on
swapcache_prepare could be the one who first gets PTL.

> --
> Best Regards,
> Huang, Ying
>
> >>
> >> Reducing the race window seems like a good way. Or maybe we can just
> >> add a cpu_relax() so raced swapins will just slow down, and won't loop
> >> too much time and so the side effect (counter or power consumption)
> >> should be much smaller?
> >
Thanks
Barry
Minchan Kim Feb. 7, 2024, 6:31 p.m. UTC | #18
On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
> On Wed, Feb 7, 2024 at 12:02 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> > > > > >
> > > > > > Hi Kairui,
> > > > > >
> > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread.
> > > > > >
> > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > > >
> > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > > > > > swapin the same entry at the same time, they get different pages (A, B).
> > > > > > > Before one thread (T0) finishes the swapin and installs page (A)
> > > > > > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > > > > > swap_free the entry, then swap out the possibly modified page
> > > > > > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > > > > > install a stalled page (A) into the PTE and cause data corruption.
> > > > > > >
> > > > > > > One possible callstack is like this:
> > > > > > >
> > > > > > > CPU0                                 CPU1
> > > > > > > ----                                 ----
> > > > > > > do_swap_page()                       do_swap_page() with same entry
> > > > > > > <direct swapin path>                 <direct swapin path>
> > > > > > > <alloc page A>                       <alloc page B>
> > > > > > > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > > > > > > <slow on later locks or interrupt>   <finished swapin first>
> > > > > > > ...                                  set_pte_at()
> > > > > > >                                      swap_free() <- entry is free
> > > > > > >                                      <write to page B, now page A stalled>
> > > > > > >                                      <swap out page B to same swap entry>
> > > > > > > pte_same() <- Check pass, PTE seems
> > > > > > >               unchanged, but page A
> > > > > > >               is stalled!
> > > > > > > swap_free() <- page B content lost!
> > > > > > > set_pte_at() <- staled page A installed!
> > > > > > >
> > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard
> > > > > > > the entry content, so even if page (B) is not modified, if
> > > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > > > > > > it may also cause data loss.
> > > > > > >
> > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > > > > > > the cache flag, and allow only one thread to pin it. Release the pin
> > > > > > > after PT unlocked. Racers will simply busy wait since it's a rare
> > > > > > > and very short event.
> > > > > > >
> > > > > > > Other methods like increasing the swap count don't seem to be a good
> > > > > > > idea after some tests, that will cause racers to fall back to use the
> > > > > > > swap cache again. Parallel swapin using different methods leads to
> > > > > > > a much more complex scenario.
> > > > > > >
> > > > > > > Reproducer:
> > > > > > >
> > > > > > > This race issue can be triggered easily using a well constructed
> > > > > > > reproducer and patched brd (with a delay in read path) [1]:
> > > > > > >
> > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily:
> > > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > > > > > >   Polulating 32MB of memory region...
> > > > > > >   Keep swapping out...
> > > > > > >   Starting round 0...
> > > > > > >   Spawning 65536 workers...
> > > > > > >   32746 workers spawned, wait for done...
> > > > > > >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > > > > > >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > > > > > >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > > > > > >   Round 0 Failed, 15 data loss!
> > > > > > >
> > > > > > > This reproducer spawns multiple threads sharing the same memory region
> > > > > > > using a small swap device. Every two threads updates mapped pages one by
> > > > > > > one in opposite direction trying to create a race, with one dedicated
> > > > > > > thread keep swapping out the data out using madvise.
> > > > > > >
> > > > > > > The reproducer created a reproduce rate of about once every 5 minutes,
> > > > > > > so the race should be totally possible in production.
> > > > > > >
> > > > > > > After this patch, I ran the reproducer for over a few hundred rounds
> > > > > > > and no data loss observed.
> > > > > > >
> > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > > > > > > zram:
> > > > > > >
> > > > > > > Before:     10934698 us
> > > > > > > After:      11157121 us
> > > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > > > > > >
> > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > > Acked-by: Yu Zhao <yuzhao@google.com>
> > > > > > >
> > > > > > > ---
> > > > > > > Update from V1:
> > > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > > > > > > - Update comments make it cleaner [Huang, Ying]
> > > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > > > > > > - Update commit message.
> > > > > > > - Collect Review and Acks.
> > > > > > >
> > > > > > >  include/linux/swap.h |  5 +++++
> > > > > > >  mm/memory.c          | 15 +++++++++++++++
> > > > > > >  mm/swap.h            |  5 +++++
> > > > > > >  mm/swapfile.c        | 13 +++++++++++++
> > > > > > >  4 files changed, 38 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > > > index 4db00ddad261..8d28f6091a32 100644
> > > > > > > --- a/include/linux/swap.h
> > > > > > > +++ b/include/linux/swap.h
> > > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > > > > > >         return 0;
> > > > > > >  }
> > > > > > >
> > > > > > > +static inline int swapcache_prepare(swp_entry_t swp)
> > > > > > > +{
> > > > > > > +       return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > >  static inline void swap_free(swp_entry_t swp)
> > > > > > >  {
> > > > > > >  }
> > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > index 7e1f4849463a..1749c700823d 100644
> > > > > > > --- a/mm/memory.c
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > >         if (!folio) {
> > > > > > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > > > > >                     __swap_count(entry) == 1) {
> > > > > > > +                       /*
> > > > > > > +                        * Prevent parallel swapin from proceeding with
> > > > > > > +                        * the cache flag. Otherwise, another thread may
> > > > > > > +                        * finish swapin first, free the entry, and swapout
> > > > > > > +                        * reusing the same entry. It's undetectable as
> > > > > > > +                        * pte_same() returns true due to entry reuse.
> > > > > > > +                        */
> > > > > > > +                       if (swapcache_prepare(entry))
> > > > > > > +                               goto out;
> > > > > > > +
> > > > > >
> > > > > > I am puzzled by this "goto out". If I understand this correctly, you
> > > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > > > > > The CPU1 will succeed in adding the flag and  the CPU2 will get
> > > > > > "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > > > > > correctly so far?
> > > > > >
> > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will
> > > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > > > > > CPU2 does not even know the page fault is not handled, it will resume
> > > > > > from the page fault instruction, possibly generate another page fault
> > > > > > at the exact same location. That page fault loop will repeat until
> > > > > > CPU1 install the new pte on that faulting virtual address and pick up
> > > > > > by CPU2.
> > > > > >
> > > > > > Am I missing something obvious there?
> > > > >
> > > > > I feel you are right. any concurrent page faults at the same pte
> > > > > will increase the count of page faults for a couple of times now.
> > > > >
> > > > > >
> > > > > > I just re-read your comment: "Racers will simply busy wait since it's
> > > > > > a rare and very short event." That might be referring to the above
> > > > > > CPU2 page fault looping situation. I consider the page fault looping
> > > > > > on CPU2 not acceptable. For one it will mess up the page fault
> > > > > > statistics.
> > > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > > > > > show up is still better than this page fault loop. You can have more
> > > > > > CPU power friendly loops.
> > > > >
> > > > > I assume you mean something like
> > > > >
> > > > > while(!pte_same())
> > > > >    cpu_relax();
> > > > >
> > > > > then we still have a chance to miss the change of B.
> > > > >
> > > > > For example, another thread is changing pte to A->B->A, our loop can
> > > > > miss B. Thus we will trap into an infinite loop. this is even worse.
> > > >
> > > > Yes. You are right, it is worse. Thanks for catching that. That is why
> > > > I say this needs more discussion, I haven't fully thought it through
> > > > :-)
> > >
> > > Hi Chris and Barry,
> > >
> > > Thanks for the comments!
> > >
> > > The worst thing I know of returning in do_swap_page without handling
> > > the swap, is an increase of some statistic counters, note it will not
> > > cause major page fault counters to grow, only things like perf counter
> > > and vma lock statistic are affected.
> > >
> > > And actually there are multiple already existing return points in
> > > do_swap_page that will return without handling it, which may
> > > re-trigger the page fault.
> >
> > Thanks for pointing that out. I take a look at those, which seems
> > different than the case here.  In those cases, it truely can not make
> > forward progress.
> > Here we actually have all the data it needs to complete the page
> > fault. Just a data synchronization issue preventing making forward
> > progress.
> > Ideally we can have some clever data structure to solve the
> > synchronization issue and make forward progress.
> >
> > > When do_swap_page is called, many pre-checks have been applied, and
> > > they could all be invalidated if something raced, simply looping
> > > inside here could miss a lot of corner cases, so we have to go through
> > > that again.
> >
> > Actually, I  think about it. Looping it here seems worse in the sense
> > that it is already holding some locks. Return and retry the page fault
> > at least release those locks and let others have a chance to make
> > progress.
> >
> > >
> > > This patch did increase the chance of false positive increase of some
> > > counters, maybe something like returning a VM_FAULT_RETRY could make
> > > it better, but code is more complex and will cause other counters to
> > > grow.
> >
> > This is certainly not ideal. It might upset the feedback loop that
> > uses the swap fault statistic as input to adjust the swapping
> > behavior.
> >
> > Chris
> 
> Hi Chris,
> 
> Thanks for the reply.
> 
> So I think the thing is, it's getting complex because this patch
> wanted to make it simple and just reuse the swap cache flags.

I agree that a simple fix would be the important at this point.

Considering your description, here's my understanding of the other idea:
Other method, such as increasing the swap count, haven't proven effective
in your tests. The approach risk forcing racers to rely on the swap cache
again and the potential performance loss in race scenario.

While I understand that simplicity is important, and performance loss
in this case may be infrequent, I believe swap_count approach could be a
suitable solution. What do you think?
Kairui Song Feb. 8, 2024, 6:04 a.m. UTC | #19
On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
> > On Wed, Feb 7, 2024 at 12:02 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote:
> > > > >
> > > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> > > > > > >
> > > > > > > Hi Kairui,
> > > > > > >
> > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread.
> > > > > > >
> > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > > >
> > > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > > > >
> > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > > > > > > swapin the same entry at the same time, they get different pages (A, B).
> > > > > > > > Before one thread (T0) finishes the swapin and installs page (A)
> > > > > > > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > > > > > > swap_free the entry, then swap out the possibly modified page
> > > > > > > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > > > > > > install a stalled page (A) into the PTE and cause data corruption.
> > > > > > > >
> > > > > > > > One possible callstack is like this:
> > > > > > > >
> > > > > > > > CPU0                                 CPU1
> > > > > > > > ----                                 ----
> > > > > > > > do_swap_page()                       do_swap_page() with same entry
> > > > > > > > <direct swapin path>                 <direct swapin path>
> > > > > > > > <alloc page A>                       <alloc page B>
> > > > > > > > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > > > > > > > <slow on later locks or interrupt>   <finished swapin first>
> > > > > > > > ...                                  set_pte_at()
> > > > > > > >                                      swap_free() <- entry is free
> > > > > > > >                                      <write to page B, now page A stalled>
> > > > > > > >                                      <swap out page B to same swap entry>
> > > > > > > > pte_same() <- Check pass, PTE seems
> > > > > > > >               unchanged, but page A
> > > > > > > >               is stalled!
> > > > > > > > swap_free() <- page B content lost!
> > > > > > > > set_pte_at() <- staled page A installed!
> > > > > > > >
> > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard
> > > > > > > > the entry content, so even if page (B) is not modified, if
> > > > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > > > > > > > it may also cause data loss.
> > > > > > > >
> > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > > > > > > > the cache flag, and allow only one thread to pin it. Release the pin
> > > > > > > > after PT unlocked. Racers will simply busy wait since it's a rare
> > > > > > > > and very short event.
> > > > > > > >
> > > > > > > > Other methods like increasing the swap count don't seem to be a good
> > > > > > > > idea after some tests, that will cause racers to fall back to use the
> > > > > > > > swap cache again. Parallel swapin using different methods leads to
> > > > > > > > a much more complex scenario.
> > > > > > > >
> > > > > > > > Reproducer:
> > > > > > > >
> > > > > > > > This race issue can be triggered easily using a well constructed
> > > > > > > > reproducer and patched brd (with a delay in read path) [1]:
> > > > > > > >
> > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily:
> > > > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > > > > > > >   Polulating 32MB of memory region...
> > > > > > > >   Keep swapping out...
> > > > > > > >   Starting round 0...
> > > > > > > >   Spawning 65536 workers...
> > > > > > > >   32746 workers spawned, wait for done...
> > > > > > > >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > > > > > > >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > > > > > > >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > > > > > > >   Round 0 Failed, 15 data loss!
> > > > > > > >
> > > > > > > > This reproducer spawns multiple threads sharing the same memory region
> > > > > > > > using a small swap device. Every two threads updates mapped pages one by
> > > > > > > > one in opposite direction trying to create a race, with one dedicated
> > > > > > > > thread keep swapping out the data out using madvise.
> > > > > > > >
> > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes,
> > > > > > > > so the race should be totally possible in production.
> > > > > > > >
> > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds
> > > > > > > > and no data loss observed.
> > > > > > > >
> > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > > > > > > > zram:
> > > > > > > >
> > > > > > > > Before:     10934698 us
> > > > > > > > After:      11157121 us
> > > > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > > > > > > >
> > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > > > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > > > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > > > Acked-by: Yu Zhao <yuzhao@google.com>
> > > > > > > >
> > > > > > > > ---
> > > > > > > > Update from V1:
> > > > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > > > > > > > - Update comments make it cleaner [Huang, Ying]
> > > > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > > > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > > > > > > > - Update commit message.
> > > > > > > > - Collect Review and Acks.
> > > > > > > >
> > > > > > > >  include/linux/swap.h |  5 +++++
> > > > > > > >  mm/memory.c          | 15 +++++++++++++++
> > > > > > > >  mm/swap.h            |  5 +++++
> > > > > > > >  mm/swapfile.c        | 13 +++++++++++++
> > > > > > > >  4 files changed, 38 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > > > > index 4db00ddad261..8d28f6091a32 100644
> > > > > > > > --- a/include/linux/swap.h
> > > > > > > > +++ b/include/linux/swap.h
> > > > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > > > > > > >         return 0;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp)
> > > > > > > > +{
> > > > > > > > +       return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  static inline void swap_free(swp_entry_t swp)
> > > > > > > >  {
> > > > > > > >  }
> > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > index 7e1f4849463a..1749c700823d 100644
> > > > > > > > --- a/mm/memory.c
> > > > > > > > +++ b/mm/memory.c
> > > > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > >         if (!folio) {
> > > > > > > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > > > > > >                     __swap_count(entry) == 1) {
> > > > > > > > +                       /*
> > > > > > > > +                        * Prevent parallel swapin from proceeding with
> > > > > > > > +                        * the cache flag. Otherwise, another thread may
> > > > > > > > +                        * finish swapin first, free the entry, and swapout
> > > > > > > > +                        * reusing the same entry. It's undetectable as
> > > > > > > > +                        * pte_same() returns true due to entry reuse.
> > > > > > > > +                        */
> > > > > > > > +                       if (swapcache_prepare(entry))
> > > > > > > > +                               goto out;
> > > > > > > > +
> > > > > > >
> > > > > > > I am puzzled by this "goto out". If I understand this correctly, you
> > > > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > > > > > > The CPU1 will succeed in adding the flag and  the CPU2 will get
> > > > > > > "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > > > > > > correctly so far?
> > > > > > >
> > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will
> > > > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > > > > > > CPU2 does not even know the page fault is not handled, it will resume
> > > > > > > from the page fault instruction, possibly generate another page fault
> > > > > > > at the exact same location. That page fault loop will repeat until
> > > > > > > CPU1 install the new pte on that faulting virtual address and pick up
> > > > > > > by CPU2.
> > > > > > >
> > > > > > > Am I missing something obvious there?
> > > > > >
> > > > > > I feel you are right. any concurrent page faults at the same pte
> > > > > > will increase the count of page faults for a couple of times now.
> > > > > >
> > > > > > >
> > > > > > > I just re-read your comment: "Racers will simply busy wait since it's
> > > > > > > a rare and very short event." That might be referring to the above
> > > > > > > CPU2 page fault looping situation. I consider the page fault looping
> > > > > > > on CPU2 not acceptable. For one it will mess up the page fault
> > > > > > > statistics.
> > > > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > > > > > > show up is still better than this page fault loop. You can have more
> > > > > > > CPU power friendly loops.
> > > > > >
> > > > > > I assume you mean something like
> > > > > >
> > > > > > while(!pte_same())
> > > > > >    cpu_relax();
> > > > > >
> > > > > > then we still have a chance to miss the change of B.
> > > > > >
> > > > > > For example, another thread is changing pte to A->B->A, our loop can
> > > > > > miss B. Thus we will trap into an infinite loop. this is even worse.
> > > > >
> > > > > Yes. You are right, it is worse. Thanks for catching that. That is why
> > > > > I say this needs more discussion, I haven't fully thought it through
> > > > > :-)
> > > >
> > > > Hi Chris and Barry,
> > > >
> > > > Thanks for the comments!
> > > >
> > > > The worst thing I know of returning in do_swap_page without handling
> > > > the swap, is an increase of some statistic counters, note it will not
> > > > cause major page fault counters to grow, only things like perf counter
> > > > and vma lock statistic are affected.
> > > >
> > > > And actually there are multiple already existing return points in
> > > > do_swap_page that will return without handling it, which may
> > > > re-trigger the page fault.
> > >
> > > Thanks for pointing that out. I take a look at those, which seems
> > > different than the case here.  In those cases, it truely can not make
> > > forward progress.
> > > Here we actually have all the data it needs to complete the page
> > > fault. Just a data synchronization issue preventing making forward
> > > progress.
> > > Ideally we can have some clever data structure to solve the
> > > synchronization issue and make forward progress.
> > >
> > > > When do_swap_page is called, many pre-checks have been applied, and
> > > > they could all be invalidated if something raced, simply looping
> > > > inside here could miss a lot of corner cases, so we have to go through
> > > > that again.
> > >
> > > Actually, I  think about it. Looping it here seems worse in the sense
> > > that it is already holding some locks. Return and retry the page fault
> > > at least release those locks and let others have a chance to make
> > > progress.
> > >
> > > >
> > > > This patch did increase the chance of false positive increase of some
> > > > counters, maybe something like returning a VM_FAULT_RETRY could make
> > > > it better, but code is more complex and will cause other counters to
> > > > grow.
> > >
> > > This is certainly not ideal. It might upset the feedback loop that
> > > uses the swap fault statistic as input to adjust the swapping
> > > behavior.
> > >
> > > Chris
> >
> > Hi Chris,
> >
> > Thanks for the reply.
> >
> > So I think the thing is, it's getting complex because this patch
> > wanted to make it simple and just reuse the swap cache flags.
>
> I agree that a simple fix would be the important at this point.
>
> Considering your description, here's my understanding of the other idea:
> Other method, such as increasing the swap count, haven't proven effective
> in your tests. The approach risk forcing racers to rely on the swap cache
> again and the potential performance loss in race scenario.
>
> While I understand that simplicity is important, and performance loss
> in this case may be infrequent, I believe swap_count approach could be a
> suitable solution. What do you think?

Hi Minchan

Yes, my main concern was about simplicity and performance.

Increasing swap_count here will also race with another process from
releasing swap_count to 0 (swapcache was able to sync callers in other
call paths but we skipped swapcache here).
So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
have swap_count == 1, bail out if not; 3. Set it to 2;
__swap_duplicate can be modified to support this, it's similar to
existing logics for SWAP_HAS_CACHE.

And swap freeing path will do more things, swapcache clean up needs to
be handled even in the bypassing path since the racer may add it to
swapcache.

Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
overhead, so I used that way in this patch, the only issue is
potentially repeated page faults now.

I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
at naming it) special value, so any racer can just spin on it to avoid
all the problems, how do you think about this?
Huang, Ying Feb. 8, 2024, 6:34 a.m. UTC | #20
Kairui Song <ryncsn@gmail.com> writes:

> On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote:
>>
>> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:

[snip]

>> >
>> > So I think the thing is, it's getting complex because this patch
>> > wanted to make it simple and just reuse the swap cache flags.
>>
>> I agree that a simple fix would be the important at this point.
>>
>> Considering your description, here's my understanding of the other idea:
>> Other method, such as increasing the swap count, haven't proven effective
>> in your tests. The approach risk forcing racers to rely on the swap cache
>> again and the potential performance loss in race scenario.
>>
>> While I understand that simplicity is important, and performance loss
>> in this case may be infrequent, I believe swap_count approach could be a
>> suitable solution. What do you think?
>
> Hi Minchan
>
> Yes, my main concern was about simplicity and performance.
>
> Increasing swap_count here will also race with another process from
> releasing swap_count to 0 (swapcache was able to sync callers in other
> call paths but we skipped swapcache here).

What is the consequence of the race condition?

> So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
> have swap_count == 1, bail out if not; 3. Set it to 2;
> __swap_duplicate can be modified to support this, it's similar to
> existing logics for SWAP_HAS_CACHE.
>
> And swap freeing path will do more things, swapcache clean up needs to
> be handled even in the bypassing path since the racer may add it to
> swapcache.
>
> Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
> overhead, so I used that way in this patch, the only issue is
> potentially repeated page faults now.
>
> I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
> at naming it) special value, so any racer can just spin on it to avoid
> all the problems, how do you think about this?

Let's try some simpler method firstly.

--
Best Regards,
Huang, Ying
Barry Song Feb. 8, 2024, 7:16 a.m. UTC | #21
On Thu, Feb 8, 2024 at 2:05 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
> > > On Wed, Feb 7, 2024 at 12:02 PM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote:
> > > > > >
> > > > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote:
> > > > > > > >
> > > > > > > > Hi Kairui,
> > > > > > > >
> > > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread.
> > > > > > > >
> > > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > > > > >
> > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > > > > > > > > swapin the same entry at the same time, they get different pages (A, B).
> > > > > > > > > Before one thread (T0) finishes the swapin and installs page (A)
> > > > > > > > > to the PTE, another thread (T1) could finish swapin of page (B),
> > > > > > > > > swap_free the entry, then swap out the possibly modified page
> > > > > > > > > reusing the same entry. It breaks the pte_same check in (T0) because
> > > > > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > > > > > > > > install a stalled page (A) into the PTE and cause data corruption.
> > > > > > > > >
> > > > > > > > > One possible callstack is like this:
> > > > > > > > >
> > > > > > > > > CPU0                                 CPU1
> > > > > > > > > ----                                 ----
> > > > > > > > > do_swap_page()                       do_swap_page() with same entry
> > > > > > > > > <direct swapin path>                 <direct swapin path>
> > > > > > > > > <alloc page A>                       <alloc page B>
> > > > > > > > > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > > > > > > > > <slow on later locks or interrupt>   <finished swapin first>
> > > > > > > > > ...                                  set_pte_at()
> > > > > > > > >                                      swap_free() <- entry is free
> > > > > > > > >                                      <write to page B, now page A stalled>
> > > > > > > > >                                      <swap out page B to same swap entry>
> > > > > > > > > pte_same() <- Check pass, PTE seems
> > > > > > > > >               unchanged, but page A
> > > > > > > > >               is stalled!
> > > > > > > > > swap_free() <- page B content lost!
> > > > > > > > > set_pte_at() <- staled page A installed!
> > > > > > > > >
> > > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard
> > > > > > > > > the entry content, so even if page (B) is not modified, if
> > > > > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > > > > > > > > it may also cause data loss.
> > > > > > > > >
> > > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > > > > > > > > the cache flag, and allow only one thread to pin it. Release the pin
> > > > > > > > > after PT unlocked. Racers will simply busy wait since it's a rare
> > > > > > > > > and very short event.
> > > > > > > > >
> > > > > > > > > Other methods like increasing the swap count don't seem to be a good
> > > > > > > > > idea after some tests, that will cause racers to fall back to use the
> > > > > > > > > swap cache again. Parallel swapin using different methods leads to
> > > > > > > > > a much more complex scenario.
> > > > > > > > >
> > > > > > > > > Reproducer:
> > > > > > > > >
> > > > > > > > > This race issue can be triggered easily using a well constructed
> > > > > > > > > reproducer and patched brd (with a delay in read path) [1]:
> > > > > > > > >
> > > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily:
> > > > > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> > > > > > > > >   Polulating 32MB of memory region...
> > > > > > > > >   Keep swapping out...
> > > > > > > > >   Starting round 0...
> > > > > > > > >   Spawning 65536 workers...
> > > > > > > > >   32746 workers spawned, wait for done...
> > > > > > > > >   Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> > > > > > > > >   Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> > > > > > > > >   Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> > > > > > > > >   Round 0 Failed, 15 data loss!
> > > > > > > > >
> > > > > > > > > This reproducer spawns multiple threads sharing the same memory region
> > > > > > > > > using a small swap device. Every two threads updates mapped pages one by
> > > > > > > > > one in opposite direction trying to create a race, with one dedicated
> > > > > > > > > thread keep swapping out the data out using madvise.
> > > > > > > > >
> > > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes,
> > > > > > > > > so the race should be totally possible in production.
> > > > > > > > >
> > > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds
> > > > > > > > > and no data loss observed.
> > > > > > > > >
> > > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > > > > > > > > zram:
> > > > > > > > >
> > > > > > > > > Before:     10934698 us
> > > > > > > > > After:      11157121 us
> > > > > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> > > > > > > > >
> > > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > > > > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > > > > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > > > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > > > > > > > > Acked-by: Yu Zhao <yuzhao@google.com>
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > > Update from V1:
> > > > > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > > > > > > > > - Update comments make it cleaner [Huang, Ying]
> > > > > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > > > > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > > > > > > > > - Update commit message.
> > > > > > > > > - Collect Review and Acks.
> > > > > > > > >
> > > > > > > > >  include/linux/swap.h |  5 +++++
> > > > > > > > >  mm/memory.c          | 15 +++++++++++++++
> > > > > > > > >  mm/swap.h            |  5 +++++
> > > > > > > > >  mm/swapfile.c        | 13 +++++++++++++
> > > > > > > > >  4 files changed, 38 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > > > > > index 4db00ddad261..8d28f6091a32 100644
> > > > > > > > > --- a/include/linux/swap.h
> > > > > > > > > +++ b/include/linux/swap.h
> > > > > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> > > > > > > > >         return 0;
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp)
> > > > > > > > > +{
> > > > > > > > > +       return 0;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > >  static inline void swap_free(swp_entry_t swp)
> > > > > > > > >  {
> > > > > > > > >  }
> > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > > index 7e1f4849463a..1749c700823d 100644
> > > > > > > > > --- a/mm/memory.c
> > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > >         if (!folio) {
> > > > > > > > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > > > > > > >                     __swap_count(entry) == 1) {
> > > > > > > > > +                       /*
> > > > > > > > > +                        * Prevent parallel swapin from proceeding with
> > > > > > > > > +                        * the cache flag. Otherwise, another thread may
> > > > > > > > > +                        * finish swapin first, free the entry, and swapout
> > > > > > > > > +                        * reusing the same entry. It's undetectable as
> > > > > > > > > +                        * pte_same() returns true due to entry reuse.
> > > > > > > > > +                        */
> > > > > > > > > +                       if (swapcache_prepare(entry))
> > > > > > > > > +                               goto out;
> > > > > > > > > +
> > > > > > > >
> > > > > > > > I am puzzled by this "goto out". If I understand this correctly, you
> > > > > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE.
> > > > > > > > The CPU1 will succeed in adding the flag and  the CPU2 will get
> > > > > > > > "-EEXIST" from "swapcache_prepare(entry)".  Am I understanding it
> > > > > > > > correctly so far?
> > > > > > > >
> > > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will
> > > > > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved,
> > > > > > > > CPU2 does not even know the page fault is not handled, it will resume
> > > > > > > > from the page fault instruction, possibly generate another page fault
> > > > > > > > at the exact same location. That page fault loop will repeat until
> > > > > > > > CPU1 install the new pte on that faulting virtual address and pick up
> > > > > > > > by CPU2.
> > > > > > > >
> > > > > > > > Am I missing something obvious there?
> > > > > > >
> > > > > > > I feel you are right. any concurrent page faults at the same pte
> > > > > > > will increase the count of page faults for a couple of times now.
> > > > > > >
> > > > > > > >
> > > > > > > > I just re-read your comment: "Racers will simply busy wait since it's
> > > > > > > > a rare and very short event." That might be referring to the above
> > > > > > > > CPU2 page fault looping situation. I consider the page fault looping
> > > > > > > > on CPU2 not acceptable. For one it will mess up the page fault
> > > > > > > > statistics.
> > > > > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to
> > > > > > > > show up is still better than this page fault loop. You can have more
> > > > > > > > CPU power friendly loops.
> > > > > > >
> > > > > > > I assume you mean something like
> > > > > > >
> > > > > > > while(!pte_same())
> > > > > > >    cpu_relax();
> > > > > > >
> > > > > > > then we still have a chance to miss the change of B.
> > > > > > >
> > > > > > > For example, another thread is changing pte to A->B->A, our loop can
> > > > > > > miss B. Thus we will trap into an infinite loop. this is even worse.
> > > > > >
> > > > > > Yes. You are right, it is worse. Thanks for catching that. That is why
> > > > > > I say this needs more discussion, I haven't fully thought it through
> > > > > > :-)
> > > > >
> > > > > Hi Chris and Barry,
> > > > >
> > > > > Thanks for the comments!
> > > > >
> > > > > The worst thing I know of returning in do_swap_page without handling
> > > > > the swap, is an increase of some statistic counters, note it will not
> > > > > cause major page fault counters to grow, only things like perf counter
> > > > > and vma lock statistic are affected.
> > > > >
> > > > > And actually there are multiple already existing return points in
> > > > > do_swap_page that will return without handling it, which may
> > > > > re-trigger the page fault.
> > > >
> > > > Thanks for pointing that out. I take a look at those, which seems
> > > > different than the case here.  In those cases, it truely can not make
> > > > forward progress.
> > > > Here we actually have all the data it needs to complete the page
> > > > fault. Just a data synchronization issue preventing making forward
> > > > progress.
> > > > Ideally we can have some clever data structure to solve the
> > > > synchronization issue and make forward progress.
> > > >
> > > > > When do_swap_page is called, many pre-checks have been applied, and
> > > > > they could all be invalidated if something raced, simply looping
> > > > > inside here could miss a lot of corner cases, so we have to go through
> > > > > that again.
> > > >
> > > > Actually, I  think about it. Looping it here seems worse in the sense
> > > > that it is already holding some locks. Return and retry the page fault
> > > > at least release those locks and let others have a chance to make
> > > > progress.
> > > >
> > > > >
> > > > > This patch did increase the chance of false positive increase of some
> > > > > counters, maybe something like returning a VM_FAULT_RETRY could make
> > > > > it better, but code is more complex and will cause other counters to
> > > > > grow.
> > > >
> > > > This is certainly not ideal. It might upset the feedback loop that
> > > > uses the swap fault statistic as input to adjust the swapping
> > > > behavior.
> > > >
> > > > Chris
> > >
> > > Hi Chris,
> > >
> > > Thanks for the reply.
> > >
> > > So I think the thing is, it's getting complex because this patch
> > > wanted to make it simple and just reuse the swap cache flags.
> >
> > I agree that a simple fix would be the important at this point.
> >
> > Considering your description, here's my understanding of the other idea:
> > Other method, such as increasing the swap count, haven't proven effective
> > in your tests. The approach risk forcing racers to rely on the swap cache
> > again and the potential performance loss in race scenario.
> >
> > While I understand that simplicity is important, and performance loss
> > in this case may be infrequent, I believe swap_count approach could be a
> > suitable solution. What do you think?
>
> Hi Minchan
>
> Yes, my main concern was about simplicity and performance.
>
> Increasing swap_count here will also race with another process from
> releasing swap_count to 0 (swapcache was able to sync callers in other
> call paths but we skipped swapcache here).

Just like a refcount, releasing to 0 should be ok. the last one who sets
refcount to 0 will finally release the swap entry.

Increasing swap_count from 1 to 2 might cause concurrent threads
go into non-sync-io path(full swapcache things), which was Minchan's
original patch wanted to avoid. i am not sure if this will negatively affect
performance. If not, it seems a feasible way.

> So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
> have swap_count == 1, bail out if not; 3. Set it to 2;
> __swap_duplicate can be modified to support this, it's similar to
> existing logics for SWAP_HAS_CACHE.
>
> And swap freeing path will do more things, swapcache clean up needs to
> be handled even in the bypassing path since the racer may add it to
> swapcache.
>
> Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
> overhead, so I used that way in this patch, the only issue is
> potentially repeated page faults now.
>
> I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
> at naming it) special value, so any racer can just spin on it to avoid
> all the problems, how do you think about this?

Thanks
Barry
Kairui Song Feb. 8, 2024, 7:01 p.m. UTC | #22
On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote:
> >>
> >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
>
> [snip]
>
> >> >
> >> > So I think the thing is, it's getting complex because this patch
> >> > wanted to make it simple and just reuse the swap cache flags.
> >>
> >> I agree that a simple fix would be the important at this point.
> >>
> >> Considering your description, here's my understanding of the other idea:
> >> Other method, such as increasing the swap count, haven't proven effective
> >> in your tests. The approach risk forcing racers to rely on the swap cache
> >> again and the potential performance loss in race scenario.
> >>
> >> While I understand that simplicity is important, and performance loss
> >> in this case may be infrequent, I believe swap_count approach could be a
> >> suitable solution. What do you think?
> >
> > Hi Minchan
> >
> > Yes, my main concern was about simplicity and performance.
> >
> > Increasing swap_count here will also race with another process from
> > releasing swap_count to 0 (swapcache was able to sync callers in other
> > call paths but we skipped swapcache here).
>
> What is the consequence of the race condition?

Hi Ying,

It will increase the swap count of an already freed entry, this race
with multiple swap free/alloc logic that checks if count ==
SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry,
all result in random corruption of the swap map. This happens a lot
during stress testing.

>
> > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
> > have swap_count == 1, bail out if not; 3. Set it to 2;
> > __swap_duplicate can be modified to support this, it's similar to
> > existing logics for SWAP_HAS_CACHE.
> >
> > And swap freeing path will do more things, swapcache clean up needs to
> > be handled even in the bypassing path since the racer may add it to
> > swapcache.
> >
> > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
> > overhead, so I used that way in this patch, the only issue is
> > potentially repeated page faults now.
> >
> > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
> > at naming it) special value, so any racer can just spin on it to avoid
> > all the problems, how do you think about this?
>
> Let's try some simpler method firstly.

Another simpler idea is, add a schedule() or
schedule_timeout_uninterruptible(1) in the swapcache_prepare failure
path before goto out (just like __read_swap_cache_async). I think this
should ensure in almost all cases, PTE is ready after it returns, also
yields more CPU.
Chris Li Feb. 8, 2024, 7:42 p.m. UTC | #23
On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote:
> > >>
> > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
> >
> > [snip]
> >
> > >> >
> > >> > So I think the thing is, it's getting complex because this patch
> > >> > wanted to make it simple and just reuse the swap cache flags.
> > >>
> > >> I agree that a simple fix would be the important at this point.
> > >>
> > >> Considering your description, here's my understanding of the other idea:
> > >> Other method, such as increasing the swap count, haven't proven effective
> > >> in your tests. The approach risk forcing racers to rely on the swap cache
> > >> again and the potential performance loss in race scenario.
> > >>
> > >> While I understand that simplicity is important, and performance loss
> > >> in this case may be infrequent, I believe swap_count approach could be a
> > >> suitable solution. What do you think?
> > >
> > > Hi Minchan
> > >
> > > Yes, my main concern was about simplicity and performance.
> > >
> > > Increasing swap_count here will also race with another process from
> > > releasing swap_count to 0 (swapcache was able to sync callers in other
> > > call paths but we skipped swapcache here).
> >
> > What is the consequence of the race condition?
>
> Hi Ying,
>
> It will increase the swap count of an already freed entry, this race
> with multiple swap free/alloc logic that checks if count ==
> SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry,
> all result in random corruption of the swap map. This happens a lot
> during stress testing.

In theory, the original code before your patch can get into a
situation similar to what you try to avoid.
CPU1 enters the do_swap_page() with entry swap count == 1 and
continues handling the swap fault without swap cache.  Then some
operation bumps up the swap entry count and CPU2 enters the
do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need
to go through the swap cache case.  We still need to handle this
situation correctly.
So the complexity is already there.

If we can make sure the above case works correctly, then one way to
avoid the problem is just send the CPU2 to use the swap cache (without
the swap cache by-passing).



>
> >
> > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
> > > have swap_count == 1, bail out if not; 3. Set it to 2;
> > > __swap_duplicate can be modified to support this, it's similar to
> > > existing logics for SWAP_HAS_CACHE.
> > >
> > > And swap freeing path will do more things, swapcache clean up needs to
> > > be handled even in the bypassing path since the racer may add it to
> > > swapcache.
> > >
> > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
> > > overhead, so I used that way in this patch, the only issue is
> > > potentially repeated page faults now.
> > >
> > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
> > > at naming it) special value, so any racer can just spin on it to avoid
> > > all the problems, how do you think about this?
> >
> > Let's try some simpler method firstly.
>
> Another simpler idea is, add a schedule() or
> schedule_timeout_uninterruptible(1) in the swapcache_prepare failure
> path before goto out (just like __read_swap_cache_async). I think this
> should ensure in almost all cases, PTE is ready after it returns, also
> yields more CPU.

I mentioned in my earlier email and Ying points out that as well.
Looping waiting inside do_swap_page() is bad because it is holding
other locks. Sorry I started this idea but it seems no good. If we can
have CPU2 make forward progress without retrying the page fault would
be the best, if possible.

Chris
Kairui Song Feb. 9, 2024, 5:30 a.m. UTC | #24
On Fri, Feb 9, 2024 at 3:42 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >
> > > Kairui Song <ryncsn@gmail.com> writes:
> > >
> > > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote:
> > > >>
> > > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
> > >
> > > [snip]
> > >
> > > >> >
> > > >> > So I think the thing is, it's getting complex because this patch
> > > >> > wanted to make it simple and just reuse the swap cache flags.
> > > >>
> > > >> I agree that a simple fix would be the important at this point.
> > > >>
> > > >> Considering your description, here's my understanding of the other idea:
> > > >> Other method, such as increasing the swap count, haven't proven effective
> > > >> in your tests. The approach risk forcing racers to rely on the swap cache
> > > >> again and the potential performance loss in race scenario.
> > > >>
> > > >> While I understand that simplicity is important, and performance loss
> > > >> in this case may be infrequent, I believe swap_count approach could be a
> > > >> suitable solution. What do you think?
> > > >
> > > > Hi Minchan
> > > >
> > > > Yes, my main concern was about simplicity and performance.
> > > >
> > > > Increasing swap_count here will also race with another process from
> > > > releasing swap_count to 0 (swapcache was able to sync callers in other
> > > > call paths but we skipped swapcache here).
> > >
> > > What is the consequence of the race condition?
> >
> > Hi Ying,
> >
> > It will increase the swap count of an already freed entry, this race
> > with multiple swap free/alloc logic that checks if count ==
> > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry,
> > all result in random corruption of the swap map. This happens a lot
> > during stress testing.
>
> In theory, the original code before your patch can get into a
> situation similar to what you try to avoid.
> CPU1 enters the do_swap_page() with entry swap count == 1 and
> continues handling the swap fault without swap cache.  Then some
> operation bumps up the swap entry count and CPU2 enters the
> do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need
> to go through the swap cache case.  We still need to handle this
> situation correctly.

Hi Chris,

This won't happen, nothing can bump the swap entry count unless it's
swapped in and freed. There are only two places that call
swap_duplicate: unmap or fork, unmap need page mapped and entry alloc,
so it won't happen unless we hit the entry reuse issue. Fork needs the
VMA lock which we hold it during page fault.

> So the complexity is already there.
>
> If we can make sure the above case works correctly, then one way to
> avoid the problem is just send the CPU2 to use the swap cache (without
> the swap cache by-passing).

Yes, more auditing of existing code and explanation is needed to
ensure things won't go wrong, that's the reason I tried to avoid
things from going too complex...

> > > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
> > > > have swap_count == 1, bail out if not; 3. Set it to 2;
> > > > __swap_duplicate can be modified to support this, it's similar to
> > > > existing logics for SWAP_HAS_CACHE.
> > > >
> > > > And swap freeing path will do more things, swapcache clean up needs to
> > > > be handled even in the bypassing path since the racer may add it to
> > > > swapcache.
> > > >
> > > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
> > > > overhead, so I used that way in this patch, the only issue is
> > > > potentially repeated page faults now.
> > > >
> > > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
> > > > at naming it) special value, so any racer can just spin on it to avoid
> > > > all the problems, how do you think about this?
> > >
> > > Let's try some simpler method firstly.
> >
> > Another simpler idea is, add a schedule() or
> > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure
> > path before goto out (just like __read_swap_cache_async). I think this
> > should ensure in almost all cases, PTE is ready after it returns, also
> > yields more CPU.
>
> I mentioned in my earlier email and Ying points out that as well.
> Looping waiting inside do_swap_page() is bad because it is holding
> other locks.

It's not looping here though, just a tiny delay, since
SWP_SYNCHRONOUS_IO is supposed to be super fast devices so a tiny
delay should be enough.

> Sorry I started this idea but it seems no good.

Not at all, more reviewing helps to find a better solution :)

> If we can have CPU2 make forward progress without retrying
> the page fault would be the best, if possible.

Yes, making CPU2 fall back to cached swapin path is doable after
careful auditing. Just CPU2 is usually slower than CPU1 due to cache
and timing, so what it does will most likely be in vain and need to be
reverted, causing more work for both code logic and CPU. The case of
the race (CPU2 went faster) is very rare.

I'm not against the idea of bump count, it's better if things that can
be done without introducing too much noise. Will come back after more
tests and work on this.
Kairui Song Feb. 12, 2024, 7:53 p.m. UTC | #25
On Fri, Feb 9, 2024 at 1:30 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Feb 9, 2024 at 3:42 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote:
> > > >
> > > > Kairui Song <ryncsn@gmail.com> writes:
> > > >
> > > > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > >>
> > > > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
> > > >
> > > > [snip]
> > > >
> > > > >> >
> > > > >> > So I think the thing is, it's getting complex because this patch
> > > > >> > wanted to make it simple and just reuse the swap cache flags.
> > > > >>
> > > > >> I agree that a simple fix would be the important at this point.
> > > > >>
> > > > >> Considering your description, here's my understanding of the other idea:
> > > > >> Other method, such as increasing the swap count, haven't proven effective
> > > > >> in your tests. The approach risk forcing racers to rely on the swap cache
> > > > >> again and the potential performance loss in race scenario.
> > > > >>
> > > > >> While I understand that simplicity is important, and performance loss
> > > > >> in this case may be infrequent, I believe swap_count approach could be a
> > > > >> suitable solution. What do you think?
> > > > >
> > > > > Hi Minchan
> > > > >
> > > > > Yes, my main concern was about simplicity and performance.
> > > > >
> > > > > Increasing swap_count here will also race with another process from
> > > > > releasing swap_count to 0 (swapcache was able to sync callers in other
> > > > > call paths but we skipped swapcache here).
> > > >
> > > > What is the consequence of the race condition?
> > >
> > > Hi Ying,
> > >
> > > It will increase the swap count of an already freed entry, this race
> > > with multiple swap free/alloc logic that checks if count ==
> > > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry,
> > > all result in random corruption of the swap map. This happens a lot
> > > during stress testing.
> >
> > In theory, the original code before your patch can get into a
> > situation similar to what you try to avoid.
> > CPU1 enters the do_swap_page() with entry swap count == 1 and
> > continues handling the swap fault without swap cache.  Then some
> > operation bumps up the swap entry count and CPU2 enters the
> > do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need
> > to go through the swap cache case.  We still need to handle this
> > situation correctly.
>
> Hi Chris,
>
> This won't happen, nothing can bump the swap entry count unless it's
> swapped in and freed. There are only two places that call
> swap_duplicate: unmap or fork, unmap need page mapped and entry alloc,
> so it won't happen unless we hit the entry reuse issue. Fork needs the
> VMA lock which we hold it during page fault.
>
> > So the complexity is already there.
> >
> > If we can make sure the above case works correctly, then one way to
> > avoid the problem is just send the CPU2 to use the swap cache (without
> > the swap cache by-passing).
>
> Yes, more auditing of existing code and explanation is needed to
> ensure things won't go wrong, that's the reason I tried to avoid
> things from going too complex...
>
> > > > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
> > > > > have swap_count == 1, bail out if not; 3. Set it to 2;
> > > > > __swap_duplicate can be modified to support this, it's similar to
> > > > > existing logics for SWAP_HAS_CACHE.
> > > > >
> > > > > And swap freeing path will do more things, swapcache clean up needs to
> > > > > be handled even in the bypassing path since the racer may add it to
> > > > > swapcache.
> > > > >
> > > > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
> > > > > overhead, so I used that way in this patch, the only issue is
> > > > > potentially repeated page faults now.
> > > > >
> > > > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
> > > > > at naming it) special value, so any racer can just spin on it to avoid
> > > > > all the problems, how do you think about this?
> > > >
> > > > Let's try some simpler method firstly.
> > >
> > > Another simpler idea is, add a schedule() or
> > > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure
> > > path before goto out (just like __read_swap_cache_async). I think this
> > > should ensure in almost all cases, PTE is ready after it returns, also
> > > yields more CPU.
> >
> > I mentioned in my earlier email and Ying points out that as well.
> > Looping waiting inside do_swap_page() is bad because it is holding
> > other locks.
>
> It's not looping here though, just a tiny delay, since
> SWP_SYNCHRONOUS_IO is supposed to be super fast devices so a tiny
> delay should be enough.
>
> > Sorry I started this idea but it seems no good.
>
> Not at all, more reviewing helps to find a better solution :)
>
> > If we can have CPU2 make forward progress without retrying
> > the page fault would be the best, if possible.
>
> Yes, making CPU2 fall back to cached swapin path is doable after
> careful auditing. Just CPU2 is usually slower than CPU1 due to cache
> and timing, so what it does will most likely be in vain and need to be
> reverted, causing more work for both code logic and CPU. The case of
> the race (CPU2 went faster) is very rare.
>
> I'm not against the idea of bump count, it's better if things that can
> be done without introducing too much noise. Will come back after more
> tests and work on this.

Hi,

After some more testing, I still think bump swap count is a bad idea
here. Besides the issues I mentioned above, adding folio into cache is
fundamentally in conflict with the problem we are trying to solve:
folio in swapcache can be swapped out without allocating a new entry,
so the entry reuse issue will be much more serious.

If we want to avoid this we have to release the cache unconditionally
before folio unlock because it can't tell if there is a parallel cache
bypassing swapin going on or not. This is unacceptable, it breaks
normal usage of swap cache. There are also other potential issues like
read ahead, race causing stalled folio remain in swap cache, so I dont
think this is a good approach.

Instead if we add a schedule or schedule_timeout_uninterruptible(1),
it seems quite enough to avoid repeated page faults. I did following
test with the reproducer I provided in the commit message:

Using ZRAM instead of brd for more realistic workload:
$ perf stat -e minor-faults,major-faults -I 30000 ./a.out

Unpatched kernel:
#           time             counts unit events
    30.000096504             33,089      minor-faults:u
    30.000096504            958,745      major-faults:u
    60.000375308             22,150      minor-faults:u
    60.000375308          1,221,665      major-faults:u
    90.001062176             24,840      minor-faults:u
    90.001062176          1,005,427      major-faults:u
   120.004233268             23,634      minor-faults:u
   120.004233268          1,045,596      major-faults:u
   150.004530091             22,358      minor-faults:u
   150.004530091          1,097,871      major-faults:u

Patched kernel:
#           time             counts unit events
    30.000069753            280,094      minor-faults:u
    30.000069753          1,198,871      major-faults:u
    60.000415915            436,862      minor-faults:u
    60.000415915            919,860      major-faults:u
    90.000651670            359,176      minor-faults:u
    90.000651670            955,340      major-faults:u
   120.001088135            289,137      minor-faults:u
   120.001088135            992,317      major-faults:u

Indeed there is a huge increase of minor page faults, which indicate
the raced path returned without handling the fault. This reproducer
tries to maximize the race chance so the reading is more terrifying
than usual.

But after adding a schedule/schedule_timeout_uninterruptible(1) after
swapcache_prepare failed, still using the same test:

Patched kernel (adding a schedule() before goto out):
#           time             counts unit events
    30.000335901             43,066      minor-faults:u
    30.000335901          1,163,232      major-faults:u
    60.034629687             35,652      minor-faults:u
    60.034629687            844,188      major-faults:u
    90.034941472             34,957      minor-faults:u
    90.034941472          1,218,174      major-faults:u
   120.035255386             36,241      minor-faults:u
   120.035255386          1,073,395      major-faults:u
   150.035535786             33,057      minor-faults:u
   150.035535786          1,171,684      major-faults:u

Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out):
#           time             counts unit events
    30.000044452             26,748      minor-faults:u
    30.000044452          1,202,064      major-faults:u
    60.000317379             19,883      minor-faults:u
    60.000317379          1,186,152      major-faults:u
    90.000568779             18,333      minor-faults:u
    90.000568779          1,151,015      major-faults:u
   120.000787253             22,844      minor-faults:u
   120.000787253            887,531      major-faults:u
   150.001309065             18,900      minor-faults:u
   150.001309065          1,211,894      major-faults:u

The minor faults are basically the same as before, or even lower since
other races are also reduced, with no observable performance
regression so far.
If the vague margin of this schedule call is still concerning, I think
another approach is just a new swap map value for parallel cache
bypassed swapping to loop on.
Minchan Kim Feb. 15, 2024, 12:44 a.m. UTC | #26
Hi Kairui,

On Tue, Feb 13, 2024 at 03:53:09AM +0800, Kairui Song wrote:
> On Fri, Feb 9, 2024 at 1:30 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Fri, Feb 9, 2024 at 3:42 AM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote:
> > > > >
> > > > > Kairui Song <ryncsn@gmail.com> writes:
> > > > >
> > > > > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > >>
> > > > > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
> > > > >
> > > > > [snip]
> > > > >
> > > > > >> >
> > > > > >> > So I think the thing is, it's getting complex because this patch
> > > > > >> > wanted to make it simple and just reuse the swap cache flags.
> > > > > >>
> > > > > >> I agree that a simple fix would be the important at this point.
> > > > > >>
> > > > > >> Considering your description, here's my understanding of the other idea:
> > > > > >> Other method, such as increasing the swap count, haven't proven effective
> > > > > >> in your tests. The approach risk forcing racers to rely on the swap cache
> > > > > >> again and the potential performance loss in race scenario.
> > > > > >>
> > > > > >> While I understand that simplicity is important, and performance loss
> > > > > >> in this case may be infrequent, I believe swap_count approach could be a
> > > > > >> suitable solution. What do you think?
> > > > > >
> > > > > > Hi Minchan
> > > > > >
> > > > > > Yes, my main concern was about simplicity and performance.
> > > > > >
> > > > > > Increasing swap_count here will also race with another process from
> > > > > > releasing swap_count to 0 (swapcache was able to sync callers in other
> > > > > > call paths but we skipped swapcache here).
> > > > >
> > > > > What is the consequence of the race condition?
> > > >
> > > > Hi Ying,
> > > >
> > > > It will increase the swap count of an already freed entry, this race
> > > > with multiple swap free/alloc logic that checks if count ==
> > > > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry,
> > > > all result in random corruption of the swap map. This happens a lot
> > > > during stress testing.
> > >
> > > In theory, the original code before your patch can get into a
> > > situation similar to what you try to avoid.
> > > CPU1 enters the do_swap_page() with entry swap count == 1 and
> > > continues handling the swap fault without swap cache.  Then some
> > > operation bumps up the swap entry count and CPU2 enters the
> > > do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need
> > > to go through the swap cache case.  We still need to handle this
> > > situation correctly.
> >
> > Hi Chris,
> >
> > This won't happen, nothing can bump the swap entry count unless it's
> > swapped in and freed. There are only two places that call
> > swap_duplicate: unmap or fork, unmap need page mapped and entry alloc,
> > so it won't happen unless we hit the entry reuse issue. Fork needs the
> > VMA lock which we hold it during page fault.
> >
> > > So the complexity is already there.
> > >
> > > If we can make sure the above case works correctly, then one way to
> > > avoid the problem is just send the CPU2 to use the swap cache (without
> > > the swap cache by-passing).
> >
> > Yes, more auditing of existing code and explanation is needed to
> > ensure things won't go wrong, that's the reason I tried to avoid
> > things from going too complex...
> >
> > > > > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
> > > > > > have swap_count == 1, bail out if not; 3. Set it to 2;
> > > > > > __swap_duplicate can be modified to support this, it's similar to
> > > > > > existing logics for SWAP_HAS_CACHE.
> > > > > >
> > > > > > And swap freeing path will do more things, swapcache clean up needs to
> > > > > > be handled even in the bypassing path since the racer may add it to
> > > > > > swapcache.
> > > > > >
> > > > > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
> > > > > > overhead, so I used that way in this patch, the only issue is
> > > > > > potentially repeated page faults now.
> > > > > >
> > > > > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
> > > > > > at naming it) special value, so any racer can just spin on it to avoid
> > > > > > all the problems, how do you think about this?
> > > > >
> > > > > Let's try some simpler method firstly.
> > > >
> > > > Another simpler idea is, add a schedule() or
> > > > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure
> > > > path before goto out (just like __read_swap_cache_async). I think this
> > > > should ensure in almost all cases, PTE is ready after it returns, also
> > > > yields more CPU.
> > >
> > > I mentioned in my earlier email and Ying points out that as well.
> > > Looping waiting inside do_swap_page() is bad because it is holding
> > > other locks.
> >
> > It's not looping here though, just a tiny delay, since
> > SWP_SYNCHRONOUS_IO is supposed to be super fast devices so a tiny
> > delay should be enough.
> >
> > > Sorry I started this idea but it seems no good.
> >
> > Not at all, more reviewing helps to find a better solution :)
> >
> > > If we can have CPU2 make forward progress without retrying
> > > the page fault would be the best, if possible.
> >
> > Yes, making CPU2 fall back to cached swapin path is doable after
> > careful auditing. Just CPU2 is usually slower than CPU1 due to cache
> > and timing, so what it does will most likely be in vain and need to be
> > reverted, causing more work for both code logic and CPU. The case of
> > the race (CPU2 went faster) is very rare.
> >
> > I'm not against the idea of bump count, it's better if things that can
> > be done without introducing too much noise. Will come back after more
> > tests and work on this.
> 
> Hi,
> 
> After some more testing, I still think bump swap count is a bad idea
> here. Besides the issues I mentioned above, adding folio into cache is
> fundamentally in conflict with the problem we are trying to solve:
> folio in swapcache can be swapped out without allocating a new entry,
> so the entry reuse issue will be much more serious.
> 
> If we want to avoid this we have to release the cache unconditionally
> before folio unlock because it can't tell if there is a parallel cache
> bypassing swapin going on or not. This is unacceptable, it breaks
> normal usage of swap cache. There are also other potential issues like
> read ahead, race causing stalled folio remain in swap cache, so I dont
> think this is a good approach.

I also played the swap refcount stuff ideas and encoutered a lot
problems(e.g., kernel crash and/or data missing).

Now I realize your original solution would be best to fix under such a
small change.

Folks, please chime in if you have another idea.


> 
> Instead if we add a schedule or schedule_timeout_uninterruptible(1),

How much your workload is going If we use schedule instead of
schedule_timeout_uninterruptible(1)? If that doesn't increase the
statistics a lot, I prefer the schedule.
(TBH, I didn't care much about the stat since it would be
very rare).


> it seems quite enough to avoid repeated page faults. I did following
> test with the reproducer I provided in the commit message:
> 
> Using ZRAM instead of brd for more realistic workload:
> $ perf stat -e minor-faults,major-faults -I 30000 ./a.out
> 
> Unpatched kernel:
> #           time             counts unit events
>     30.000096504             33,089      minor-faults:u
>     30.000096504            958,745      major-faults:u
>     60.000375308             22,150      minor-faults:u
>     60.000375308          1,221,665      major-faults:u
>     90.001062176             24,840      minor-faults:u
>     90.001062176          1,005,427      major-faults:u
>    120.004233268             23,634      minor-faults:u
>    120.004233268          1,045,596      major-faults:u
>    150.004530091             22,358      minor-faults:u
>    150.004530091          1,097,871      major-faults:u
> 
> Patched kernel:
> #           time             counts unit events
>     30.000069753            280,094      minor-faults:u
>     30.000069753          1,198,871      major-faults:u
>     60.000415915            436,862      minor-faults:u
>     60.000415915            919,860      major-faults:u
>     90.000651670            359,176      minor-faults:u
>     90.000651670            955,340      major-faults:u
>    120.001088135            289,137      minor-faults:u
>    120.001088135            992,317      major-faults:u
> 
> Indeed there is a huge increase of minor page faults, which indicate
> the raced path returned without handling the fault. This reproducer
> tries to maximize the race chance so the reading is more terrifying
> than usual.
> 
> But after adding a schedule/schedule_timeout_uninterruptible(1) after
> swapcache_prepare failed, still using the same test:
> 
> Patched kernel (adding a schedule() before goto out):
> #           time             counts unit events
>     30.000335901             43,066      minor-faults:u
>     30.000335901          1,163,232      major-faults:u
>     60.034629687             35,652      minor-faults:u
>     60.034629687            844,188      major-faults:u
>     90.034941472             34,957      minor-faults:u
>     90.034941472          1,218,174      major-faults:u
>    120.035255386             36,241      minor-faults:u
>    120.035255386          1,073,395      major-faults:u
>    150.035535786             33,057      minor-faults:u
>    150.035535786          1,171,684      major-faults:u
> 
> Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out):
> #           time             counts unit events
>     30.000044452             26,748      minor-faults:u
>     30.000044452          1,202,064      major-faults:u
>     60.000317379             19,883      minor-faults:u
>     60.000317379          1,186,152      major-faults:u
>     90.000568779             18,333      minor-faults:u
>     90.000568779          1,151,015      major-faults:u
>    120.000787253             22,844      minor-faults:u
>    120.000787253            887,531      major-faults:u
>    150.001309065             18,900      minor-faults:u
>    150.001309065          1,211,894      major-faults:u
> 
> The minor faults are basically the same as before, or even lower since
> other races are also reduced, with no observable performance
> regression so far.
> If the vague margin of this schedule call is still concerning, I think
> another approach is just a new swap map value for parallel cache
> bypassed swapping to loop on.

Long term, the swap map vaule would be good but for now, I prefer
your original solution with some delay with schedule.

Thanks, Kairui!
David Hildenbrand Feb. 15, 2024, 3:36 p.m. UTC | #27
On 06.02.24 19:25, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> swapin the same entry at the same time, they get different pages (A, B).
> Before one thread (T0) finishes the swapin and installs page (A)
> to the PTE, another thread (T1) could finish swapin of page (B),
> swap_free the entry, then swap out the possibly modified page
> reusing the same entry. It breaks the pte_same check in (T0) because
> PTE value is unchanged, causing ABA problem. Thread (T0) will
> install a stalled page (A) into the PTE and cause data corruption.
> 
> One possible callstack is like this:
> 
> CPU0                                 CPU1
> ----                                 ----
> do_swap_page()                       do_swap_page() with same entry
> <direct swapin path>                 <direct swapin path>
> <alloc page A>                       <alloc page B>
> swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> <slow on later locks or interrupt>   <finished swapin first>
> ...                                  set_pte_at()
>                                       swap_free() <- entry is free
>                                       <write to page B, now page A stalled>
>                                       <swap out page B to same swap entry>
> pte_same() <- Check pass, PTE seems
>                unchanged, but page A
>                is stalled!
> swap_free() <- page B content lost!
> set_pte_at() <- staled page A installed!
> 
> And besides, for ZRAM, swap_free() allows the swap device to discard
> the entry content, so even if page (B) is not modified, if
> swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> it may also cause data loss.
> 
> To fix this, reuse swapcache_prepare which will pin the swap entry using
> the cache flag, and allow only one thread to pin it. Release the pin
> after PT unlocked. Racers will simply busy wait since it's a rare
> and very short event.
> 
> Other methods like increasing the swap count don't seem to be a good
> idea after some tests, that will cause racers to fall back to use the
> swap cache again. Parallel swapin using different methods leads to
> a much more complex scenario.
> 
> Reproducer:
> 
> This race issue can be triggered easily using a well constructed
> reproducer and patched brd (with a delay in read path) [1]:
> 
> With latest 6.8 mainline, race caused data loss can be observed easily:
> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
>    Polulating 32MB of memory region...
>    Keep swapping out...
>    Starting round 0...
>    Spawning 65536 workers...
>    32746 workers spawned, wait for done...
>    Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
>    Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
>    Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
>    Round 0 Failed, 15 data loss!
> 
> This reproducer spawns multiple threads sharing the same memory region
> using a small swap device. Every two threads updates mapped pages one by
> one in opposite direction trying to create a race, with one dedicated
> thread keep swapping out the data out using madvise.
> 
> The reproducer created a reproduce rate of about once every 5 minutes,
> so the race should be totally possible in production.
> 
> After this patch, I ran the reproducer for over a few hundred rounds
> and no data loss observed.
> 
> Performance overhead is minimal, microbenchmark swapin 10G from 32G
> zram:
> 
> Before:     10934698 us
> After:      11157121 us
> Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> 
> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> Reported-by: "Huang, Ying" <ying.huang@intel.com>
> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Acked-by: Yu Zhao <yuzhao@google.com>
> 
> ---
> Update from V1:
> - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> - Update comments make it cleaner [Huang, Ying]
> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> - Update commit message.
> - Collect Review and Acks.
> 
>   include/linux/swap.h |  5 +++++
>   mm/memory.c          | 15 +++++++++++++++
>   mm/swap.h            |  5 +++++
>   mm/swapfile.c        | 13 +++++++++++++
>   4 files changed, 38 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4db00ddad261..8d28f6091a32 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
>   	return 0;
>   }
>   
> +static inline int swapcache_prepare(swp_entry_t swp)
> +{
> +	return 0;
> +}
> +
>   static inline void swap_free(swp_entry_t swp)
>   {
>   }
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e1f4849463a..1749c700823d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	if (!folio) {
>   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>   		    __swap_count(entry) == 1) {
> +			/*
> +			 * Prevent parallel swapin from proceeding with
> +			 * the cache flag. Otherwise, another thread may
> +			 * finish swapin first, free the entry, and swapout
> +			 * reusing the same entry. It's undetectable as
> +			 * pte_same() returns true due to entry reuse.
> +			 */
> +			if (swapcache_prepare(entry))
> +				goto out;
> +

Is there anything that guarantees that the following won't
happen concurrently, and if it would happen, could it be a problem?

Other thread MADV_DONTNEED's the swap entry, swap slot is freed.
Some other page gets swapped out, reuses that swap slot.

We call swapcache_prepare() on that slot that is being reused
elsewhere. (possibly some other thread in the context of the reuses
swap slot might do the same!)

We would detect later, that the PTE changed, but we would temporarily
mess with that swap slot that we might no longer "own".


I was thinking about alternatives, it's tricky because of the concurrent
MADV_DONTNEED possibility. Something with another fake-swap entry type
(similar to migration entries) might work, but would require more changes.
Kairui Song Feb. 15, 2024, 6:49 p.m. UTC | #28
On Thu, Feb 15, 2024 at 11:36 PM David Hildenbrand <david@redhat.com> wrote:
> On 06.02.24 19:25, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > swapin the same entry at the same time, they get different pages (A, B).
> > Before one thread (T0) finishes the swapin and installs page (A)
> > to the PTE, another thread (T1) could finish swapin of page (B),
> > swap_free the entry, then swap out the possibly modified page
> > reusing the same entry. It breaks the pte_same check in (T0) because
> > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > install a stalled page (A) into the PTE and cause data corruption.
> >
> > One possible callstack is like this:
> >
> > CPU0                                 CPU1
> > ----                                 ----
> > do_swap_page()                       do_swap_page() with same entry
> > <direct swapin path>                 <direct swapin path>
> > <alloc page A>                       <alloc page B>
> > swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> > <slow on later locks or interrupt>   <finished swapin first>
> > ...                                  set_pte_at()
> >                                       swap_free() <- entry is free
> >                                       <write to page B, now page A stalled>
> >                                       <swap out page B to same swap entry>
> > pte_same() <- Check pass, PTE seems
> >                unchanged, but page A
> >                is stalled!
> > swap_free() <- page B content lost!
> > set_pte_at() <- staled page A installed!
> >
> > And besides, for ZRAM, swap_free() allows the swap device to discard
> > the entry content, so even if page (B) is not modified, if
> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> > it may also cause data loss.
> >
> > To fix this, reuse swapcache_prepare which will pin the swap entry using
> > the cache flag, and allow only one thread to pin it. Release the pin
> > after PT unlocked. Racers will simply busy wait since it's a rare
> > and very short event.
> >
> > Other methods like increasing the swap count don't seem to be a good
> > idea after some tests, that will cause racers to fall back to use the
> > swap cache again. Parallel swapin using different methods leads to
> > a much more complex scenario.
> >
> > Reproducer:
> >
> > This race issue can be triggered easily using a well constructed
> > reproducer and patched brd (with a delay in read path) [1]:
> >
> > With latest 6.8 mainline, race caused data loss can be observed easily:
> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> >    Polulating 32MB of memory region...
> >    Keep swapping out...
> >    Starting round 0...
> >    Spawning 65536 workers...
> >    32746 workers spawned, wait for done...
> >    Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> >    Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> >    Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> >    Round 0 Failed, 15 data loss!
> >
> > This reproducer spawns multiple threads sharing the same memory region
> > using a small swap device. Every two threads updates mapped pages one by
> > one in opposite direction trying to create a race, with one dedicated
> > thread keep swapping out the data out using madvise.
> >
> > The reproducer created a reproduce rate of about once every 5 minutes,
> > so the race should be totally possible in production.
> >
> > After this patch, I ran the reproducer for over a few hundred rounds
> > and no data loss observed.
> >
> > Performance overhead is minimal, microbenchmark swapin 10G from 32G
> > zram:
> >
> > Before:     10934698 us
> > After:      11157121 us
> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> >
> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> > Reported-by: "Huang, Ying" <ying.huang@intel.com>
> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> > Acked-by: Yu Zhao <yuzhao@google.com>
> >
> > ---
> > Update from V1:
> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> > - Update comments make it cleaner [Huang, Ying]
> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> > - Update commit message.
> > - Collect Review and Acks.
> >
> >   include/linux/swap.h |  5 +++++
> >   mm/memory.c          | 15 +++++++++++++++
> >   mm/swap.h            |  5 +++++
> >   mm/swapfile.c        | 13 +++++++++++++
> >   4 files changed, 38 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 4db00ddad261..8d28f6091a32 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> >       return 0;
> >   }
> >
> > +static inline int swapcache_prepare(swp_entry_t swp)
> > +{
> > +     return 0;
> > +}
> > +
> >   static inline void swap_free(swp_entry_t swp)
> >   {
> >   }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 7e1f4849463a..1749c700823d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (!folio) {
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                   __swap_count(entry) == 1) {
> > +                     /*
> > +                      * Prevent parallel swapin from proceeding with
> > +                      * the cache flag. Otherwise, another thread may
> > +                      * finish swapin first, free the entry, and swapout
> > +                      * reusing the same entry. It's undetectable as
> > +                      * pte_same() returns true due to entry reuse.
> > +                      */
> > +                     if (swapcache_prepare(entry))
> > +                             goto out;
> > +
>
> Is there anything that guarantees that the following won't
> happen concurrently, and if it would happen, could it be a problem?
>
> Other thread MADV_DONTNEED's the swap entry, swap slot is freed.
> Some other page gets swapped out, reuses that swap slot.

Hi David,

Thanks for adding more comments and sharing your thoughts!

I'm not sure what you mean by "reuses that swap slot" here, I think
you just mean reuse that swap entry (and the actual content on swap
device)?

> We call swapcache_prepare() on that slot that is being reused
> elsewhere. (possibly some other thread in the context of the reuses
> swap slot might do the same!)

I think this kind of swapcache_prepare() or false swapin read is
already happening quite frequently by swap readaheads. I've seen swap
cluster readahead mess up working set activation and memory policy
already. Swap cluster readahead simply read in nearby entries of
target entry, regardless of whether they are owned by the reader or
not.

For this patch, similar issues also exist, I think it only hurts the
performance, but that's a really rare thing to happen, so should not
be a problem.

>
> We would detect later, that the PTE changed, but we would temporarily
> mess with that swap slot that we might no longer "own".
>
> I was thinking about alternatives, it's tricky because of the concurrent
> MADV_DONTNEED possibility. Something with another fake-swap entry type
> (similar to migration entries) might work, but would require more changes.

Yeah, in the long term I also think more work is needed for the swap subsystem.

In my opinion, for this particular issue, or, for cache bypassed
swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM
might be needed, that may even help to simplify the swap count release
routine for cache bypassed swapin, and improve the performance.
Kairui Song Feb. 15, 2024, 7:07 p.m. UTC | #29
On Thu, Feb 15, 2024 at 8:44 AM Minchan Kim <minchan@kernel.org> wrote:
>
> Hi Kairui,
>

Hi, Minchan,

>
> I also played the swap refcount stuff ideas and encoutered a lot
> problems(e.g., kernel crash and/or data missing).
>
> Now I realize your original solution would be best to fix under such a
> small change.
>
> Folks, please chime in if you have another idea.
>
> >
> > Instead if we add a schedule or schedule_timeout_uninterruptible(1),
>
> How much your workload is going If we use schedule instead of
> schedule_timeout_uninterruptible(1)? If that doesn't increase the
> statistics a lot, I prefer the schedule.
> (TBH, I didn't care much about the stat since it would be
> very rare).

I've tested both schedule() and schedule_timeout_uninterruptible(1)
locally using the reproducer, and some other cases, and looked all
good.

The reproducer I provided demonstrates an extreme case, so I think
schedule() is good enough already.

I currently don't have any more benchmarks or tests, as the change is
very small for most workloads. I'll use schedule() here in V3 if no
one else has other suggestions.

I remember Barry's series about large folio swapin may change the ZRAM
read time depending on folio size, and cause strange races that's
different from the reproducer. Not sure if I can test using some known
race cases. That could be helpful to verify this fix and if schedule()
or schedule_timeout_uninterruptible(1) is better here.

> > it seems quite enough to avoid repeated page faults. I did following
> > test with the reproducer I provided in the commit message:
> >
> > Using ZRAM instead of brd for more realistic workload:
> > $ perf stat -e minor-faults,major-faults -I 30000 ./a.out
> >
> > Unpatched kernel:
> > #           time             counts unit events
> >     30.000096504             33,089      minor-faults:u
> >     30.000096504            958,745      major-faults:u
> >     60.000375308             22,150      minor-faults:u
> >     60.000375308          1,221,665      major-faults:u
> >     90.001062176             24,840      minor-faults:u
> >     90.001062176          1,005,427      major-faults:u
> >    120.004233268             23,634      minor-faults:u
> >    120.004233268          1,045,596      major-faults:u
> >    150.004530091             22,358      minor-faults:u
> >    150.004530091          1,097,871      major-faults:u
> >
> > Patched kernel:
> > #           time             counts unit events
> >     30.000069753            280,094      minor-faults:u
> >     30.000069753          1,198,871      major-faults:u
> >     60.000415915            436,862      minor-faults:u
> >     60.000415915            919,860      major-faults:u
> >     90.000651670            359,176      minor-faults:u
> >     90.000651670            955,340      major-faults:u
> >    120.001088135            289,137      minor-faults:u
> >    120.001088135            992,317      major-faults:u
> >
> > Indeed there is a huge increase of minor page faults, which indicate
> > the raced path returned without handling the fault. This reproducer
> > tries to maximize the race chance so the reading is more terrifying
> > than usual.
> >
> > But after adding a schedule/schedule_timeout_uninterruptible(1) after
> > swapcache_prepare failed, still using the same test:
> >
> > Patched kernel (adding a schedule() before goto out):
> > #           time             counts unit events
> >     30.000335901             43,066      minor-faults:u
> >     30.000335901          1,163,232      major-faults:u
> >     60.034629687             35,652      minor-faults:u
> >     60.034629687            844,188      major-faults:u
> >     90.034941472             34,957      minor-faults:u
> >     90.034941472          1,218,174      major-faults:u
> >    120.035255386             36,241      minor-faults:u
> >    120.035255386          1,073,395      major-faults:u
> >    150.035535786             33,057      minor-faults:u
> >    150.035535786          1,171,684      major-faults:u
> >
> > Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out):
> > #           time             counts unit events
> >     30.000044452             26,748      minor-faults:u
> >     30.000044452          1,202,064      major-faults:u
> >     60.000317379             19,883      minor-faults:u
> >     60.000317379          1,186,152      major-faults:u
> >     90.000568779             18,333      minor-faults:u
> >     90.000568779          1,151,015      major-faults:u
> >    120.000787253             22,844      minor-faults:u
> >    120.000787253            887,531      major-faults:u
> >    150.001309065             18,900      minor-faults:u
> >    150.001309065          1,211,894      major-faults:u
> >
> > The minor faults are basically the same as before, or even lower since
> > other races are also reduced, with no observable performance
> > regression so far.
> > If the vague margin of this schedule call is still concerning, I think
> > another approach is just a new swap map value for parallel cache
> > bypassed swapping to loop on.
>
> Long term, the swap map vaule would be good but for now, I prefer
> your original solution with some delay with schedule.

Yes, that's also what I have in mind.

With a swap map value, actually I think the cache bypassed path may
even go faster as it can simplify some swap map value change process,
but that requires quite some extra work and change, could be
introduced later as an optimization.

> Thanks, Kairui!

Thanks for the comments!
David Hildenbrand Feb. 15, 2024, 8:03 p.m. UTC | #30
On 15.02.24 19:49, Kairui Song wrote:
> On Thu, Feb 15, 2024 at 11:36 PM David Hildenbrand <david@redhat.com> wrote:
>> On 06.02.24 19:25, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
>>> swapin the same entry at the same time, they get different pages (A, B).
>>> Before one thread (T0) finishes the swapin and installs page (A)
>>> to the PTE, another thread (T1) could finish swapin of page (B),
>>> swap_free the entry, then swap out the possibly modified page
>>> reusing the same entry. It breaks the pte_same check in (T0) because
>>> PTE value is unchanged, causing ABA problem. Thread (T0) will
>>> install a stalled page (A) into the PTE and cause data corruption.
>>>
>>> One possible callstack is like this:
>>>
>>> CPU0                                 CPU1
>>> ----                                 ----
>>> do_swap_page()                       do_swap_page() with same entry
>>> <direct swapin path>                 <direct swapin path>
>>> <alloc page A>                       <alloc page B>
>>> swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
>>> <slow on later locks or interrupt>   <finished swapin first>
>>> ...                                  set_pte_at()
>>>                                        swap_free() <- entry is free
>>>                                        <write to page B, now page A stalled>
>>>                                        <swap out page B to same swap entry>
>>> pte_same() <- Check pass, PTE seems
>>>                 unchanged, but page A
>>>                 is stalled!
>>> swap_free() <- page B content lost!
>>> set_pte_at() <- staled page A installed!
>>>
>>> And besides, for ZRAM, swap_free() allows the swap device to discard
>>> the entry content, so even if page (B) is not modified, if
>>> swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
>>> it may also cause data loss.
>>>
>>> To fix this, reuse swapcache_prepare which will pin the swap entry using
>>> the cache flag, and allow only one thread to pin it. Release the pin
>>> after PT unlocked. Racers will simply busy wait since it's a rare
>>> and very short event.
>>>
>>> Other methods like increasing the swap count don't seem to be a good
>>> idea after some tests, that will cause racers to fall back to use the
>>> swap cache again. Parallel swapin using different methods leads to
>>> a much more complex scenario.
>>>
>>> Reproducer:
>>>
>>> This race issue can be triggered easily using a well constructed
>>> reproducer and patched brd (with a delay in read path) [1]:
>>>
>>> With latest 6.8 mainline, race caused data loss can be observed easily:
>>> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
>>>     Polulating 32MB of memory region...
>>>     Keep swapping out...
>>>     Starting round 0...
>>>     Spawning 65536 workers...
>>>     32746 workers spawned, wait for done...
>>>     Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
>>>     Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
>>>     Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
>>>     Round 0 Failed, 15 data loss!
>>>
>>> This reproducer spawns multiple threads sharing the same memory region
>>> using a small swap device. Every two threads updates mapped pages one by
>>> one in opposite direction trying to create a race, with one dedicated
>>> thread keep swapping out the data out using madvise.
>>>
>>> The reproducer created a reproduce rate of about once every 5 minutes,
>>> so the race should be totally possible in production.
>>>
>>> After this patch, I ran the reproducer for over a few hundred rounds
>>> and no data loss observed.
>>>
>>> Performance overhead is minimal, microbenchmark swapin 10G from 32G
>>> zram:
>>>
>>> Before:     10934698 us
>>> After:      11157121 us
>>> Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
>>>
>>> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
>>> Reported-by: "Huang, Ying" <ying.huang@intel.com>
>>> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
>>> Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>>> Acked-by: Yu Zhao <yuzhao@google.com>
>>>
>>> ---
>>> Update from V1:
>>> - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
>>> - Update comments make it cleaner [Huang, Ying]
>>> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
>>> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
>>> - Update commit message.
>>> - Collect Review and Acks.
>>>
>>>    include/linux/swap.h |  5 +++++
>>>    mm/memory.c          | 15 +++++++++++++++
>>>    mm/swap.h            |  5 +++++
>>>    mm/swapfile.c        | 13 +++++++++++++
>>>    4 files changed, 38 insertions(+)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 4db00ddad261..8d28f6091a32 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
>>>        return 0;
>>>    }
>>>
>>> +static inline int swapcache_prepare(swp_entry_t swp)
>>> +{
>>> +     return 0;
>>> +}
>>> +
>>>    static inline void swap_free(swp_entry_t swp)
>>>    {
>>>    }
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 7e1f4849463a..1749c700823d 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        if (!folio) {
>>>                if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>>                    __swap_count(entry) == 1) {
>>> +                     /*
>>> +                      * Prevent parallel swapin from proceeding with
>>> +                      * the cache flag. Otherwise, another thread may
>>> +                      * finish swapin first, free the entry, and swapout
>>> +                      * reusing the same entry. It's undetectable as
>>> +                      * pte_same() returns true due to entry reuse.
>>> +                      */
>>> +                     if (swapcache_prepare(entry))
>>> +                             goto out;
>>> +
>>
>> Is there anything that guarantees that the following won't
>> happen concurrently, and if it would happen, could it be a problem?
>>
>> Other thread MADV_DONTNEED's the swap entry, swap slot is freed.
>> Some other page gets swapped out, reuses that swap slot.
> 
> Hi David,
> 
> Thanks for adding more comments and sharing your thoughts!
> 
> I'm not sure what you mean by "reuses that swap slot" here, I think
> you just mean reuse that swap entry (and the actual content on swap
> device)?

I tried to be precise but I think that caused more confusion :)

entry = device + offset into swap_map

I called a "swap slot" the metadata that is identified by device + 
offset (swap_map).

I recall that's what we used to call these things [1], e.g., "Allocating 
swap slots ... All page sized slots are tracked by the array 
swap_info_struct→swap_map."

[1] https://www.kernel.org/doc/gorman/html/understand/understand014.html

Anyhow, you got what I mean :)

> 
>> We call swapcache_prepare() on that slot that is being reused
>> elsewhere. (possibly some other thread in the context of the reuses
>> swap slot might do the same!)
> 
> I think this kind of swapcache_prepare() or false swapin read is
> already happening quite frequently by swap readaheads. I've seen swap
> cluster readahead mess up working set activation and memory policy
> already. Swap cluster readahead simply read in nearby entries of
> target entry, regardless of whether they are owned by the reader or
> not.

Okay, thanks for the confirmation!
> 
> For this patch, similar issues also exist, I think it only hurts the
> performance, but that's a really rare thing to happen, so should not
> be a problem.

Okay, good.

> 
>>
>> We would detect later, that the PTE changed, but we would temporarily
>> mess with that swap slot that we might no longer "own".
>>
>> I was thinking about alternatives, it's tricky because of the concurrent
>> MADV_DONTNEED possibility. Something with another fake-swap entry type
>> (similar to migration entries) might work, but would require more changes.
> 
> Yeah, in the long term I also think more work is needed for the swap subsystem.
> 
> In my opinion, for this particular issue, or, for cache bypassed
> swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM
> might be needed, that may even help to simplify the swap count release
> routine for cache bypassed swapin, and improve the performance.

The question is if we really want to track that in the swapcache and not 
rather in the page table.

Imagine the following:

(1) allocate the folio and lock it (we do that already)

(2) take the page table lock. If the PTE is still the same, insert a new 
"swapin_in_process" fake swp entry that references the locked folio.

(3) read the folio from swap. This will unlock the folio IIUC. (we do 
that already)

(4) relock the folio. (we do that already, might not want to fail)

(4) take the PTE lock. If the PTE did not change, turn it into a present 
PTE entry. Otherwise, cleanup.


Any concurrent swap-in users would spot the new "swapin_in_process" fake 
swp entry and wait for the page lock (just like we do with migration 
entries).

Zap code would mostly only clear the "swapin_in_process" fake swp entry 
and leave the cleanup to (4) above. Fortunately, concurrent fork() is 
impossible as that cannot race with page faults.

There might be one minor thing to optimize with the folio lock above. 
But in essence, it would work just like migration entries, just that 
they are installed only while we actually do read the content from disk etc.
Minchan Kim Feb. 15, 2024, 8:55 p.m. UTC | #31
Hi David,

On Thu, Feb 15, 2024 at 09:03:28PM +0100, David Hildenbrand wrote:

< snip >

> > > 
> > > We would detect later, that the PTE changed, but we would temporarily
> > > mess with that swap slot that we might no longer "own".
> > > 
> > > I was thinking about alternatives, it's tricky because of the concurrent
> > > MADV_DONTNEED possibility. Something with another fake-swap entry type
> > > (similar to migration entries) might work, but would require more changes.
> > 
> > Yeah, in the long term I also think more work is needed for the swap subsystem.
> > 
> > In my opinion, for this particular issue, or, for cache bypassed
> > swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM
> > might be needed, that may even help to simplify the swap count release
> > routine for cache bypassed swapin, and improve the performance.
> 
> The question is if we really want to track that in the swapcache and not
> rather in the page table.
> 
> Imagine the following:
> 
> (1) allocate the folio and lock it (we do that already)
> 
> (2) take the page table lock. If the PTE is still the same, insert a new
> "swapin_in_process" fake swp entry that references the locked folio.
> 
> (3) read the folio from swap. This will unlock the folio IIUC. (we do that
> already)
> 
> (4) relock the folio. (we do that already, might not want to fail)
> 
> (4) take the PTE lock. If the PTE did not change, turn it into a present PTE
> entry. Otherwise, cleanup.
> 
> 
> Any concurrent swap-in users would spot the new "swapin_in_process" fake swp
> entry and wait for the page lock (just like we do with migration entries).
> 
> Zap code would mostly only clear the "swapin_in_process" fake swp entry and
> leave the cleanup to (4) above. Fortunately, concurrent fork() is impossible
> as that cannot race with page faults.
> 
> There might be one minor thing to optimize with the folio lock above. But in
> essence, it would work just like migration entries, just that they are
> installed only while we actually do read the content from disk etc.

That's a great idea. I was thinking to have the synchronization in the
page table but couldn't reach to the other non_swap_entry idea.

Only concern of the approach is that it would be harder to have the fix
in the stable tree. If there isn't strong objection, I prefer the
Kairui's orginal solution(with some tweak of scheduler if it's
necessary) first and then pursue your idea on latest tree.
Andrew Morton Feb. 15, 2024, 10:58 p.m. UTC | #32
On Thu, 15 Feb 2024 12:55:11 -0800 Minchan Kim <minchan@kernel.org> wrote:

> Only concern of the approach is that it would be harder to have the fix
> in the stable tree. If there isn't strong objection, I prefer the
> Kairui's orginal solution(with some tweak of scheduler if it's
> necessary) first and then pursue your idea on latest tree.

Do we agree that this fix is needed in -stable?  For some reason I
don't have a cc:stable in the changelog.

I'd like to move this patch into mm-hotfixes-stable in order to get it
upstream very soon.  Are we all agreeable with that?  I don't have an
Acked-by: David?

I have a note (which dates back to an earlier version) that Barry would
be performing runtime testing.  Has that occurred?

Thanks.


Here's what is presently in mm-hotfixes-unstable:



From: Kairui Song <kasong@tencent.com>
Subject: mm/swap: fix race when skipping swapcache
Date: Wed, 7 Feb 2024 02:25:59 +0800

When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
swapin the same entry at the same time, they get different pages (A, B). 
Before one thread (T0) finishes the swapin and installs page (A) to the
PTE, another thread (T1) could finish swapin of page (B), swap_free the
entry, then swap out the possibly modified page reusing the same entry. 
It breaks the pte_same check in (T0) because PTE value is unchanged,
causing ABA problem.  Thread (T0) will install a stalled page (A) into the
PTE and cause data corruption.

One possible callstack is like this:

CPU0                                 CPU1
----                                 ----
do_swap_page()                       do_swap_page() with same entry
<direct swapin path>                 <direct swapin path>
<alloc page A>                       <alloc page B>
swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
<slow on later locks or interrupt>   <finished swapin first>
...                                  set_pte_at()
                                     swap_free() <- entry is free
                                     <write to page B, now page A stalled>
                                     <swap out page B to same swap entry>
pte_same() <- Check pass, PTE seems
              unchanged, but page A
              is stalled!
swap_free() <- page B content lost!
set_pte_at() <- staled page A installed!

And besides, for ZRAM, swap_free() allows the swap device to discard the
entry content, so even if page (B) is not modified, if swap_read_folio()
on CPU0 happens later than swap_free() on CPU1, it may also cause data
loss.

To fix this, reuse swapcache_prepare which will pin the swap entry using
the cache flag, and allow only one thread to pin it.  Release the pin
after PT unlocked.  Racers will simply busy wait since it's a rare and
very short event.

Other methods like increasing the swap count don't seem to be a good idea
after some tests, that will cause racers to fall back to use the swap
cache again.  Parallel swapin using different methods leads to a much more
complex scenario.

Reproducer:

This race issue can be triggered easily using a well constructed
reproducer and patched brd (with a delay in read path) [1]:

With latest 6.8 mainline, race caused data loss can be observed easily:
$ gcc -g -lpthread test-thread-swap-race.c && ./a.out
  Polulating 32MB of memory region...
  Keep swapping out...
  Starting round 0...
  Spawning 65536 workers...
  32746 workers spawned, wait for done...
  Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
  Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
  Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
  Round 0 Failed, 15 data loss!

This reproducer spawns multiple threads sharing the same memory region
using a small swap device. Every two threads updates mapped pages one by
one in opposite direction trying to create a race, with one dedicated
thread keep swapping out the data out using madvise.

The reproducer created a reproduce rate of about once every 5 minutes,
so the race should be totally possible in production.

After this patch, I ran the reproducer for over a few hundred rounds
and no data loss observed.

Performance overhead is minimal, microbenchmark swapin 10G from 32G
zram:

Before:     10934698 us
After:      11157121 us
Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)

Link: https://lkml.kernel.org/r/20240206182559.32264-1-ryncsn@gmail.com
Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/swap.h |    5 +++++
 mm/memory.c          |   15 +++++++++++++++
 mm/swap.h            |    5 +++++
 mm/swapfile.c        |   13 +++++++++++++
 4 files changed, 38 insertions(+)

--- a/include/linux/swap.h~mm-swap-fix-race-when-skipping-swapcache
+++ a/include/linux/swap.h
@@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_ent
 	return 0;
 }
 
+static inline int swapcache_prepare(swp_entry_t swp)
+{
+	return 0;
+}
+
 static inline void swap_free(swp_entry_t swp)
 {
 }
--- a/mm/memory.c~mm-swap-fix-race-when-skipping-swapcache
+++ a/mm/memory.c
@@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
+			/*
+			 * Prevent parallel swapin from proceeding with
+			 * the cache flag. Otherwise, another thread may
+			 * finish swapin first, free the entry, and swapout
+			 * reusing the same entry. It's undetectable as
+			 * pte_same() returns true due to entry reuse.
+			 */
+			if (swapcache_prepare(entry))
+				goto out;
+
 			/* skip swapcache */
 			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
 						vma, vmf->address, false);
@@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
+	/* Clear the swap cache pin for direct swapin after PTL unlock */
+	if (folio && !swapcache)
+		swapcache_clear(si, entry);
 out:
 	if (si)
 		put_swap_device(si);
@@ -4124,6 +4137,8 @@ out_nomap:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out_page:
+	if (!swapcache)
+		swapcache_clear(si, entry);
 	folio_unlock(folio);
 out_release:
 	folio_put(folio);
--- a/mm/swapfile.c~mm-swap-fix-race-when-skipping-swapcache
+++ a/mm/swapfile.c
@@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry)
 	return __swap_duplicate(entry, SWAP_HAS_CACHE);
 }
 
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+{
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+	unsigned char usage;
+
+	ci = lock_cluster_or_swap_info(si, offset);
+	usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
+	unlock_cluster_or_swap_info(si, ci);
+	if (!usage)
+		free_swap_slot(entry);
+}
+
 struct swap_info_struct *swp_swap_info(swp_entry_t entry)
 {
 	return swap_type_to_swap_info(swp_type(entry));
--- a/mm/swap.h~mm-swap-fix-race-when-skipping-swapcache
+++ a/mm/swap.h
@@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct fol
 void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
@@ -97,6 +98,10 @@ static inline int swap_writepage(struct
 	return 0;
 }
 
+static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+{
+}
+
 static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr)
 {
Kairui Song Feb. 16, 2024, 7:11 a.m. UTC | #33
On Fri, Feb 16, 2024 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
> On 15.02.24 19:49, Kairui Song wrote:
> > On Thu, Feb 15, 2024 at 11:36 PM David Hildenbrand <david@redhat.com> wrote:
> >> On 06.02.24 19:25, Kairui Song wrote:
> >>> From: Kairui Song <kasong@tencent.com>
> >>>
> >>> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> >>> swapin the same entry at the same time, they get different pages (A, B).
> >>> Before one thread (T0) finishes the swapin and installs page (A)
> >>> to the PTE, another thread (T1) could finish swapin of page (B),
> >>> swap_free the entry, then swap out the possibly modified page
> >>> reusing the same entry. It breaks the pte_same check in (T0) because
> >>> PTE value is unchanged, causing ABA problem. Thread (T0) will
> >>> install a stalled page (A) into the PTE and cause data corruption.
> >>>
> >>> One possible callstack is like this:
> >>>
> >>> CPU0                                 CPU1
> >>> ----                                 ----
> >>> do_swap_page()                       do_swap_page() with same entry
> >>> <direct swapin path>                 <direct swapin path>
> >>> <alloc page A>                       <alloc page B>
> >>> swap_read_folio() <- read to page A  swap_read_folio() <- read to page B
> >>> <slow on later locks or interrupt>   <finished swapin first>
> >>> ...                                  set_pte_at()
> >>>                                        swap_free() <- entry is free
> >>>                                        <write to page B, now page A stalled>
> >>>                                        <swap out page B to same swap entry>
> >>> pte_same() <- Check pass, PTE seems
> >>>                 unchanged, but page A
> >>>                 is stalled!
> >>> swap_free() <- page B content lost!
> >>> set_pte_at() <- staled page A installed!
> >>>
> >>> And besides, for ZRAM, swap_free() allows the swap device to discard
> >>> the entry content, so even if page (B) is not modified, if
> >>> swap_read_folio() on CPU0 happens later than swap_free() on CPU1,
> >>> it may also cause data loss.
> >>>
> >>> To fix this, reuse swapcache_prepare which will pin the swap entry using
> >>> the cache flag, and allow only one thread to pin it. Release the pin
> >>> after PT unlocked. Racers will simply busy wait since it's a rare
> >>> and very short event.
> >>>
> >>> Other methods like increasing the swap count don't seem to be a good
> >>> idea after some tests, that will cause racers to fall back to use the
> >>> swap cache again. Parallel swapin using different methods leads to
> >>> a much more complex scenario.
> >>>
> >>> Reproducer:
> >>>
> >>> This race issue can be triggered easily using a well constructed
> >>> reproducer and patched brd (with a delay in read path) [1]:
> >>>
> >>> With latest 6.8 mainline, race caused data loss can be observed easily:
> >>> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out
> >>>     Polulating 32MB of memory region...
> >>>     Keep swapping out...
> >>>     Starting round 0...
> >>>     Spawning 65536 workers...
> >>>     32746 workers spawned, wait for done...
> >>>     Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
> >>>     Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
> >>>     Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
> >>>     Round 0 Failed, 15 data loss!
> >>>
> >>> This reproducer spawns multiple threads sharing the same memory region
> >>> using a small swap device. Every two threads updates mapped pages one by
> >>> one in opposite direction trying to create a race, with one dedicated
> >>> thread keep swapping out the data out using madvise.
> >>>
> >>> The reproducer created a reproduce rate of about once every 5 minutes,
> >>> so the race should be totally possible in production.
> >>>
> >>> After this patch, I ran the reproducer for over a few hundred rounds
> >>> and no data loss observed.
> >>>
> >>> Performance overhead is minimal, microbenchmark swapin 10G from 32G
> >>> zram:
> >>>
> >>> Before:     10934698 us
> >>> After:      11157121 us
> >>> Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> >>>
> >>> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
> >>> Reported-by: "Huang, Ying" <ying.huang@intel.com>
> >>> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
> >>> Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
> >>> Signed-off-by: Kairui Song <kasong@tencent.com>
> >>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> >>> Acked-by: Yu Zhao <yuzhao@google.com>
> >>>
> >>> ---
> >>> Update from V1:
> >>> - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song]
> >>> - Update comments make it cleaner [Huang, Ying]
> >>> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park]
> >>> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao]
> >>> - Update commit message.
> >>> - Collect Review and Acks.
> >>>
> >>>    include/linux/swap.h |  5 +++++
> >>>    mm/memory.c          | 15 +++++++++++++++
> >>>    mm/swap.h            |  5 +++++
> >>>    mm/swapfile.c        | 13 +++++++++++++
> >>>    4 files changed, 38 insertions(+)
> >>>
> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>> index 4db00ddad261..8d28f6091a32 100644
> >>> --- a/include/linux/swap.h
> >>> +++ b/include/linux/swap.h
> >>> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp)
> >>>        return 0;
> >>>    }
> >>>
> >>> +static inline int swapcache_prepare(swp_entry_t swp)
> >>> +{
> >>> +     return 0;
> >>> +}
> >>> +
> >>>    static inline void swap_free(swp_entry_t swp)
> >>>    {
> >>>    }
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 7e1f4849463a..1749c700823d 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>        if (!folio) {
> >>>                if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >>>                    __swap_count(entry) == 1) {
> >>> +                     /*
> >>> +                      * Prevent parallel swapin from proceeding with
> >>> +                      * the cache flag. Otherwise, another thread may
> >>> +                      * finish swapin first, free the entry, and swapout
> >>> +                      * reusing the same entry. It's undetectable as
> >>> +                      * pte_same() returns true due to entry reuse.
> >>> +                      */
> >>> +                     if (swapcache_prepare(entry))
> >>> +                             goto out;
> >>> +
> >>
> >> Is there anything that guarantees that the following won't
> >> happen concurrently, and if it would happen, could it be a problem?
> >>
> >> Other thread MADV_DONTNEED's the swap entry, swap slot is freed.
> >> Some other page gets swapped out, reuses that swap slot.
> >
> > Hi David,
> >
> > Thanks for adding more comments and sharing your thoughts!
> >
> > I'm not sure what you mean by "reuses that swap slot" here, I think
> > you just mean reuse that swap entry (and the actual content on swap
> > device)?
>
> I tried to be precise but I think that caused more confusion :)
>
> entry = device + offset into swap_map
>
> I called a "swap slot" the metadata that is identified by device +
> offset (swap_map).
>
> I recall that's what we used to call these things [1], e.g., "Allocating
> swap slots ... All page sized slots are tracked by the array
> swap_info_struct→swap_map."
>
> [1] https://www.kernel.org/doc/gorman/html/understand/understand014.html
>
> Anyhow, you got what I mean :)

Hi David,

That's good, I was confused about the naming of swap related things :)

> >
> >> We call swapcache_prepare() on that slot that is being reused
> >> elsewhere. (possibly some other thread in the context of the reuses
> >> swap slot might do the same!)
> >
> > I think this kind of swapcache_prepare() or false swapin read is
> > already happening quite frequently by swap readaheads. I've seen swap
> > cluster readahead mess up working set activation and memory policy
> > already. Swap cluster readahead simply read in nearby entries of
> > target entry, regardless of whether they are owned by the reader or
> > not.
>
> Okay, thanks for the confirmation!
> >
> > For this patch, similar issues also exist, I think it only hurts the
> > performance, but that's a really rare thing to happen, so should not
> > be a problem.
>
> Okay, good.
>
> >
> >>
> >> We would detect later, that the PTE changed, but we would temporarily
> >> mess with that swap slot that we might no longer "own".
> >>
> >> I was thinking about alternatives, it's tricky because of the concurrent
> >> MADV_DONTNEED possibility. Something with another fake-swap entry type
> >> (similar to migration entries) might work, but would require more changes.
> >
> > Yeah, in the long term I also think more work is needed for the swap subsystem.
> >
> > In my opinion, for this particular issue, or, for cache bypassed
> > swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM
> > might be needed, that may even help to simplify the swap count release
> > routine for cache bypassed swapin, and improve the performance.
>
> The question is if we really want to track that in the swapcache and not
> rather in the page table.
>
> Imagine the following:
>
> (1) allocate the folio and lock it (we do that already)
>
> (2) take the page table lock. If the PTE is still the same, insert a new
> "swapin_in_process" fake swp entry that references the locked folio.
...>
> (3) read the folio from swap. This will unlock the folio IIUC. (we do
> that already)
>
> (4) relock the folio. (we do that already, might not want to fail)
>
> (4) take the PTE lock. If the PTE did not change, turn it into a present
> PTE entry. Otherwise, cleanup.

Very interesting idea!

I'm just not sure what actual benefit it brings. The only concern
about reusing swapcache_prepare so far is repeated page faults that
may hurt performance or statistics, this issue is basically gone after
adding a schedule().

We can't drop all the operations around swap cache and map anyway. It
doesn't know if it should skip the swapcache until swapcache lookup
and swap count look up are all done. So I think it can be done more
naturally here with a special value, making things simpler, robust,
and improving performance a bit more.

And in another series [1] I'm working on making shmem make use of
cache bypassed swapin as well, following this approach I'll have to
implement another shmem map based synchronization too.

After all it's only a rare race, I think a simpler solution might be better.

[1] https://lore.kernel.org/all/20240129175423.1987-1-ryncsn@gmail.com/
Kairui Song Feb. 16, 2024, 10:01 a.m. UTC | #34
On Fri, Feb 16, 2024 at 6:58 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 15 Feb 2024 12:55:11 -0800 Minchan Kim <minchan@kernel.org> wrote:
> > Only concern of the approach is that it would be harder to have the fix
> > in the stable tree. If there isn't strong objection, I prefer the
> > Kairui's orginal solution(with some tweak of scheduler if it's
> > necessary) first and then pursue your idea on latest tree.
>
> Do we agree that this fix is needed in -stable?  For some reason I
> don't have a cc:stable in the changelog.
>
> I'd like to move this patch into mm-hotfixes-stable in order to get it
> upstream very soon.  Are we all agreeable with that?  I don't have an
> Acked-by: David?
>
> I have a note (which dates back to an earlier version) that Barry would
> be performing runtime testing.  Has that occurred?
>
> Thanks.
>

Hi Andrew,

I think this fix is needed for stable, I've sent V3, if we agree on
that, please consider V3 which includes changes we have discussed so
far, thanks!
David Hildenbrand Feb. 16, 2024, 4:16 p.m. UTC | #35
>> (4) relock the folio. (we do that already, might not want to fail)
>>
>> (4) take the PTE lock. If the PTE did not change, turn it into a present
>> PTE entry. Otherwise, cleanup.
> 
> Very interesting idea!
> 
> I'm just not sure what actual benefit it brings. The only concern
> about reusing swapcache_prepare so far is repeated page faults that
> may hurt performance or statistics, this issue is basically gone after
> adding a schedule().

I think you know that slapping in random schedule() calls is not a 
proper way to wait for an event to happen :) It's pretty much 
unpredictable how long the schedule() will take and if there even is 
anything to schedule to!

With what I propose, just like with page migration, you really do wait 
for the event (page read and unlocked, only the PTE has to be updated) 
to happen before you try anything else.

Now, the difference is most likely that the case here happens much less 
frequently than page migration. Still, you could have all threads 
faulting one the same page and all would do the same dance here.

> 
> We can't drop all the operations around swap cache and map anyway. It
> doesn't know if it should skip the swapcache until swapcache lookup
> and swap count look up are all done. So I think it can be done more
> naturally here with a special value, making things simpler, robust,
> and improving performance a bit more.
> 

The issue will always be that someone can zap the PTE concurrently, 
which would free up the swap cache. With what I propose, that cannot 
happen in the sync swapin case we are discussing here.

If someone where to zap the PTE in the meantime, it would only remove 
the special non-swap entry, indicating to swapin code that concurrent 
zapping happened. The swapin code would handle the swapcache freeing 
instead, after finishing reading the content.

So the swapcache would not get freed concurrently anymore if I am not wrong.

At least the theory, I didn't prototype it yet.

> And in another series [1] I'm working on making shmem make use of
> cache bypassed swapin as well, following this approach I'll have to
> implement another shmem map based synchronization too.
> 

I'd have to look further into that, if something like that could 
similarly apply to shmem. But yes, it's no using PTEs, so a PTE-based 
sync mechanism does definitely not apply..

> After all it's only a rare race, I think a simpler solution might be better.

I'm not sure that simpler means better here. Simpler might be better for 
a backport, though.

The "use schedule() to wait" looks odd, maybe it's a common primitive 
that I simply didn't stumble over yet. (I doubt it but it could be possible)
Huang, Ying Feb. 19, 2024, 5:42 a.m. UTC | #36
Kairui Song <ryncsn@gmail.com> writes:

> On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote:
>> >>
>> >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
>>
>> [snip]
>>
>> >> >
>> >> > So I think the thing is, it's getting complex because this patch
>> >> > wanted to make it simple and just reuse the swap cache flags.
>> >>
>> >> I agree that a simple fix would be the important at this point.
>> >>
>> >> Considering your description, here's my understanding of the other idea:
>> >> Other method, such as increasing the swap count, haven't proven effective
>> >> in your tests. The approach risk forcing racers to rely on the swap cache
>> >> again and the potential performance loss in race scenario.
>> >>
>> >> While I understand that simplicity is important, and performance loss
>> >> in this case may be infrequent, I believe swap_count approach could be a
>> >> suitable solution. What do you think?
>> >
>> > Hi Minchan
>> >
>> > Yes, my main concern was about simplicity and performance.
>> >
>> > Increasing swap_count here will also race with another process from
>> > releasing swap_count to 0 (swapcache was able to sync callers in other
>> > call paths but we skipped swapcache here).
>>
>> What is the consequence of the race condition?
>
> Hi Ying,
>
> It will increase the swap count of an already freed entry, this race
> with multiple swap free/alloc logic that checks if count ==
> SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry,
> all result in random corruption of the swap map. This happens a lot
> during stress testing.

You are right!  Thanks for explanation.

--
Best Regards,
Huang, Ying

>>
>> > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
>> > have swap_count == 1, bail out if not; 3. Set it to 2;
>> > __swap_duplicate can be modified to support this, it's similar to
>> > existing logics for SWAP_HAS_CACHE.
>> >
>> > And swap freeing path will do more things, swapcache clean up needs to
>> > be handled even in the bypassing path since the racer may add it to
>> > swapcache.
>> >
>> > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
>> > overhead, so I used that way in this patch, the only issue is
>> > potentially repeated page faults now.
>> >
>> > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
>> > at naming it) special value, so any racer can just spin on it to avoid
>> > all the problems, how do you think about this?
>>
>> Let's try some simpler method firstly.
>
> Another simpler idea is, add a schedule() or
> schedule_timeout_uninterruptible(1) in the swapcache_prepare failure
> path before goto out (just like __read_swap_cache_async). I think this
> should ensure in almost all cases, PTE is ready after it returns, also
> yields more CPU.
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4db00ddad261..8d28f6091a32 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -549,6 +549,11 @@  static inline int swap_duplicate(swp_entry_t swp)
 	return 0;
 }
 
+static inline int swapcache_prepare(swp_entry_t swp)
+{
+	return 0;
+}
+
 static inline void swap_free(swp_entry_t swp)
 {
 }
diff --git a/mm/memory.c b/mm/memory.c
index 7e1f4849463a..1749c700823d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3867,6 +3867,16 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
+			/*
+			 * Prevent parallel swapin from proceeding with
+			 * the cache flag. Otherwise, another thread may
+			 * finish swapin first, free the entry, and swapout
+			 * reusing the same entry. It's undetectable as
+			 * pte_same() returns true due to entry reuse.
+			 */
+			if (swapcache_prepare(entry))
+				goto out;
+
 			/* skip swapcache */
 			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
 						vma, vmf->address, false);
@@ -4116,6 +4126,9 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
+	/* Clear the swap cache pin for direct swapin after PTL unlock */
+	if (folio && !swapcache)
+		swapcache_clear(si, entry);
 out:
 	if (si)
 		put_swap_device(si);
@@ -4124,6 +4137,8 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out_page:
+	if (!swapcache)
+		swapcache_clear(si, entry);
 	folio_unlock(folio);
 out_release:
 	folio_put(folio);
diff --git a/mm/swap.h b/mm/swap.h
index 758c46ca671e..fc2f6ade7f80 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -41,6 +41,7 @@  void __delete_from_swap_cache(struct folio *folio,
 void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
@@ -97,6 +98,10 @@  static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 	return 0;
 }
 
+static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+{
+}
+
 static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 556ff7347d5f..746aa9da5302 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3365,6 +3365,19 @@  int swapcache_prepare(swp_entry_t entry)
 	return __swap_duplicate(entry, SWAP_HAS_CACHE);
 }
 
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+{
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+	unsigned char usage;
+
+	ci = lock_cluster_or_swap_info(si, offset);
+	usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
+	unlock_cluster_or_swap_info(si, ci);
+	if (!usage)
+		free_swap_slot(entry);
+}
+
 struct swap_info_struct *swp_swap_info(swp_entry_t entry)
 {
 	return swap_type_to_swap_info(swp_type(entry));