Message ID | 20240206182559.32264-1-ryncsn@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/swap: fix race when skipping swapcache | expand |
Hi Kairui, On Wed, 7 Feb 2024 02:25:59 +0800 Kairui Song <ryncsn@gmail.com> wrote: > From: Kairui Song <kasong@tencent.com> > [...] > --- > Update from V1: > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > - Update comments make it cleaner [Huang, Ying] > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] I confirm the build issue[1] that I reported yesterday is gone with this version. Thank you for fixing it! [1] https://lore.kernel.org/linux-mm/20240206022409.202536-1-sj@kernel.org Thanks, SJ [...]
On Wed, Feb 07, 2024 at 02:25:59AM +0800, Kairui Song wrote: > From: Kairui Song <kasong@tencent.com> > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > swapin the same entry at the same time, they get different pages (A, B). > Before one thread (T0) finishes the swapin and installs page (A) > to the PTE, another thread (T1) could finish swapin of page (B), > swap_free the entry, then swap out the possibly modified page > reusing the same entry. It breaks the pte_same check in (T0) because > PTE value is unchanged, causing ABA problem. Thread (T0) will > install a stalled page (A) into the PTE and cause data corruption. > > One possible callstack is like this: > > CPU0 CPU1 > ---- ---- > do_swap_page() do_swap_page() with same entry > <direct swapin path> <direct swapin path> > <alloc page A> <alloc page B> > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > <slow on later locks or interrupt> <finished swapin first> > ... set_pte_at() > swap_free() <- entry is free ^^^ nit: From the recent code, I see swap_free is called earlier than set_pte_at > <write to page B, now page A stalled> > <swap out page B to same swap entry> > pte_same() <- Check pass, PTE seems > unchanged, but page A > is stalled! > swap_free() <- page B content lost! > set_pte_at() <- staled page A installed! > > And besides, for ZRAM, swap_free() allows the swap device to discard > the entry content, so even if page (B) is not modified, if > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > it may also cause data loss. Thanks for catching the issue, folks! > > To fix this, reuse swapcache_prepare which will pin the swap entry using > the cache flag, and allow only one thread to pin it. Release the pin > after PT unlocked. Racers will simply busy wait since it's a rare > and very short event. > > Other methods like increasing the swap count don't seem to be a good > idea after some tests, that will cause racers to fall back to use the > swap cache again. Parallel swapin using different methods leads to > a much more complex scenario. > > Reproducer: > > This race issue can be triggered easily using a well constructed > reproducer and patched brd (with a delay in read path) [1]: > > With latest 6.8 mainline, race caused data loss can be observed easily: > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > Polulating 32MB of memory region... > Keep swapping out... > Starting round 0... > Spawning 65536 workers... > 32746 workers spawned, wait for done... > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > Round 0 Failed, 15 data loss! > > This reproducer spawns multiple threads sharing the same memory region > using a small swap device. Every two threads updates mapped pages one by > one in opposite direction trying to create a race, with one dedicated > thread keep swapping out the data out using madvise. > > The reproducer created a reproduce rate of about once every 5 minutes, > so the race should be totally possible in production. > > After this patch, I ran the reproducer for over a few hundred rounds > and no data loss observed. > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > zram: > > Before: 10934698 us > After: 11157121 us > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > Reported-by: "Huang, Ying" <ying.huang@intel.com> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > Signed-off-by: Kairui Song <kasong@tencent.com> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > Acked-by: Yu Zhao <yuzhao@google.com> > > --- > Update from V1: > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > - Update comments make it cleaner [Huang, Ying] > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > - Update commit message. > - Collect Review and Acks. > > include/linux/swap.h | 5 +++++ > mm/memory.c | 15 +++++++++++++++ > mm/swap.h | 5 +++++ > mm/swapfile.c | 13 +++++++++++++ > 4 files changed, 38 insertions(+) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 4db00ddad261..8d28f6091a32 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > return 0; > } > > +static inline int swapcache_prepare(swp_entry_t swp) > +{ > + return 0; > +} > + > static inline void swap_free(swp_entry_t swp) > { > } > diff --git a/mm/memory.c b/mm/memory.c > index 7e1f4849463a..1749c700823d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!folio) { > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > __swap_count(entry) == 1) { > + /* > + * Prevent parallel swapin from proceeding with > + * the cache flag. Otherwise, another thread may > + * finish swapin first, free the entry, and swapout > + * reusing the same entry. It's undetectable as > + * pte_same() returns true due to entry reuse. > + */ > + if (swapcache_prepare(entry)) > + goto out; > + > /* skip swapcache */ > folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > vma, vmf->address, false); > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > unlock: > if (vmf->pte) > pte_unmap_unlock(vmf->pte, vmf->ptl); > + /* Clear the swap cache pin for direct swapin after PTL unlock */ > + if (folio && !swapcache) > + swapcache_clear(si, entry); > out: > if (si) > put_swap_device(si); > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (vmf->pte) > pte_unmap_unlock(vmf->pte, vmf->ptl); > out_page: > + if (!swapcache) > + swapcache_clear(si, entry); > folio_unlock(folio); > out_release: > folio_put(folio); What happens? do_swap_page .. swapcache_prepare() <- tured the cache flag on folio = vma_alloc_folio <- failed to allocate the folio page = &foio->page; <- crash but it's out of scope from this patch .. if (!folio) goto unlock; .. unlock: swapcache_clear(si, entry) <- it's skipped this time. Can we simply introduce a boolean flag to state the special case and clear the cache state based on the flag? if (swapcache_prepare()) goto out; need_clear_cache = true; out_path: if (need_clear_cache) swapcache_clear
Hi Kairui, Sorry replying to your patch V1 late, I will reply on the V2 thread. On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > From: Kairui Song <kasong@tencent.com> > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > swapin the same entry at the same time, they get different pages (A, B). > Before one thread (T0) finishes the swapin and installs page (A) > to the PTE, another thread (T1) could finish swapin of page (B), > swap_free the entry, then swap out the possibly modified page > reusing the same entry. It breaks the pte_same check in (T0) because > PTE value is unchanged, causing ABA problem. Thread (T0) will > install a stalled page (A) into the PTE and cause data corruption. > > One possible callstack is like this: > > CPU0 CPU1 > ---- ---- > do_swap_page() do_swap_page() with same entry > <direct swapin path> <direct swapin path> > <alloc page A> <alloc page B> > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > <slow on later locks or interrupt> <finished swapin first> > ... set_pte_at() > swap_free() <- entry is free > <write to page B, now page A stalled> > <swap out page B to same swap entry> > pte_same() <- Check pass, PTE seems > unchanged, but page A > is stalled! > swap_free() <- page B content lost! > set_pte_at() <- staled page A installed! > > And besides, for ZRAM, swap_free() allows the swap device to discard > the entry content, so even if page (B) is not modified, if > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > it may also cause data loss. > > To fix this, reuse swapcache_prepare which will pin the swap entry using > the cache flag, and allow only one thread to pin it. Release the pin > after PT unlocked. Racers will simply busy wait since it's a rare > and very short event. > > Other methods like increasing the swap count don't seem to be a good > idea after some tests, that will cause racers to fall back to use the > swap cache again. Parallel swapin using different methods leads to > a much more complex scenario. > > Reproducer: > > This race issue can be triggered easily using a well constructed > reproducer and patched brd (with a delay in read path) [1]: > > With latest 6.8 mainline, race caused data loss can be observed easily: > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > Polulating 32MB of memory region... > Keep swapping out... > Starting round 0... > Spawning 65536 workers... > 32746 workers spawned, wait for done... > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > Round 0 Failed, 15 data loss! > > This reproducer spawns multiple threads sharing the same memory region > using a small swap device. Every two threads updates mapped pages one by > one in opposite direction trying to create a race, with one dedicated > thread keep swapping out the data out using madvise. > > The reproducer created a reproduce rate of about once every 5 minutes, > so the race should be totally possible in production. > > After this patch, I ran the reproducer for over a few hundred rounds > and no data loss observed. > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > zram: > > Before: 10934698 us > After: 11157121 us > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > Reported-by: "Huang, Ying" <ying.huang@intel.com> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > Signed-off-by: Kairui Song <kasong@tencent.com> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > Acked-by: Yu Zhao <yuzhao@google.com> > > --- > Update from V1: > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > - Update comments make it cleaner [Huang, Ying] > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > - Update commit message. > - Collect Review and Acks. > > include/linux/swap.h | 5 +++++ > mm/memory.c | 15 +++++++++++++++ > mm/swap.h | 5 +++++ > mm/swapfile.c | 13 +++++++++++++ > 4 files changed, 38 insertions(+) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 4db00ddad261..8d28f6091a32 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > return 0; > } > > +static inline int swapcache_prepare(swp_entry_t swp) > +{ > + return 0; > +} > + > static inline void swap_free(swp_entry_t swp) > { > } > diff --git a/mm/memory.c b/mm/memory.c > index 7e1f4849463a..1749c700823d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!folio) { > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > __swap_count(entry) == 1) { > + /* > + * Prevent parallel swapin from proceeding with > + * the cache flag. Otherwise, another thread may > + * finish swapin first, free the entry, and swapout > + * reusing the same entry. It's undetectable as > + * pte_same() returns true due to entry reuse. > + */ > + if (swapcache_prepare(entry)) > + goto out; > + I am puzzled by this "goto out". If I understand this correctly, you have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. The CPU1 will succeed in adding the flag and the CPU2 will get "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it correctly so far? Then the goto out seems wrong to me. For the CPU2, the page fault will return *unhandled*. Even worse, the "-EEXIST" error is not preserved, CPU2 does not even know the page fault is not handled, it will resume from the page fault instruction, possibly generate another page fault at the exact same location. That page fault loop will repeat until CPU1 install the new pte on that faulting virtual address and pick up by CPU2. Am I missing something obvious there? I just re-read your comment: "Racers will simply busy wait since it's a rare and very short event." That might be referring to the above CPU2 page fault looping situation. I consider the page fault looping on CPU2 not acceptable. For one it will mess up the page fault statistics. In my mind, having an explicit loop for CPU2 waiting for the PTE to show up is still better than this page fault loop. You can have more CPU power friendly loops. This behavior needs more discussion. Chris Chris > /* skip swapcache */ > folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > vma, vmf->address, false); > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > unlock: > if (vmf->pte) > pte_unmap_unlock(vmf->pte, vmf->ptl); > + /* Clear the swap cache pin for direct swapin after PTL unlock */ > + if (folio && !swapcache) > + swapcache_clear(si, entry); > out: > if (si) > put_swap_device(si); > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (vmf->pte) > pte_unmap_unlock(vmf->pte, vmf->ptl); > out_page: > + if (!swapcache) > + swapcache_clear(si, entry); > folio_unlock(folio); > out_release: > folio_put(folio); > diff --git a/mm/swap.h b/mm/swap.h > index 758c46ca671e..fc2f6ade7f80 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio, > void delete_from_swap_cache(struct folio *folio); > void clear_shadow_from_swap_cache(int type, unsigned long begin, > unsigned long end); > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); > struct folio *swap_cache_get_folio(swp_entry_t entry, > struct vm_area_struct *vma, unsigned long addr); > struct folio *filemap_get_incore_folio(struct address_space *mapping, > @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc) > return 0; > } > > +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) > +{ > +} > + > static inline struct folio *swap_cache_get_folio(swp_entry_t entry, > struct vm_area_struct *vma, unsigned long addr) > { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 556ff7347d5f..746aa9da5302 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry) > return __swap_duplicate(entry, SWAP_HAS_CACHE); > } > > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) > +{ > + struct swap_cluster_info *ci; > + unsigned long offset = swp_offset(entry); > + unsigned char usage; > + > + ci = lock_cluster_or_swap_info(si, offset); > + usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); > + unlock_cluster_or_swap_info(si, ci); > + if (!usage) > + free_swap_slot(entry); > +} > + > struct swap_info_struct *swp_swap_info(swp_entry_t entry) > { > return swap_type_to_swap_info(swp_type(entry)); > -- > 2.43.0 > >
On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > Hi Kairui, > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > From: Kairui Song <kasong@tencent.com> > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > swapin the same entry at the same time, they get different pages (A, B). > > Before one thread (T0) finishes the swapin and installs page (A) > > to the PTE, another thread (T1) could finish swapin of page (B), > > swap_free the entry, then swap out the possibly modified page > > reusing the same entry. It breaks the pte_same check in (T0) because > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > install a stalled page (A) into the PTE and cause data corruption. > > > > One possible callstack is like this: > > > > CPU0 CPU1 > > ---- ---- > > do_swap_page() do_swap_page() with same entry > > <direct swapin path> <direct swapin path> > > <alloc page A> <alloc page B> > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > <slow on later locks or interrupt> <finished swapin first> > > ... set_pte_at() > > swap_free() <- entry is free > > <write to page B, now page A stalled> > > <swap out page B to same swap entry> > > pte_same() <- Check pass, PTE seems > > unchanged, but page A > > is stalled! > > swap_free() <- page B content lost! > > set_pte_at() <- staled page A installed! > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > the entry content, so even if page (B) is not modified, if > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > it may also cause data loss. > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > the cache flag, and allow only one thread to pin it. Release the pin > > after PT unlocked. Racers will simply busy wait since it's a rare > > and very short event. > > > > Other methods like increasing the swap count don't seem to be a good > > idea after some tests, that will cause racers to fall back to use the > > swap cache again. Parallel swapin using different methods leads to > > a much more complex scenario. > > > > Reproducer: > > > > This race issue can be triggered easily using a well constructed > > reproducer and patched brd (with a delay in read path) [1]: > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > Polulating 32MB of memory region... > > Keep swapping out... > > Starting round 0... > > Spawning 65536 workers... > > 32746 workers spawned, wait for done... > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > Round 0 Failed, 15 data loss! > > > > This reproducer spawns multiple threads sharing the same memory region > > using a small swap device. Every two threads updates mapped pages one by > > one in opposite direction trying to create a race, with one dedicated > > thread keep swapping out the data out using madvise. > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > so the race should be totally possible in production. > > > > After this patch, I ran the reproducer for over a few hundred rounds > > and no data loss observed. > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > zram: > > > > Before: 10934698 us > > After: 11157121 us > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > Signed-off-by: Kairui Song <kasong@tencent.com> > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > --- > > Update from V1: > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > - Update comments make it cleaner [Huang, Ying] > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > - Update commit message. > > - Collect Review and Acks. > > > > include/linux/swap.h | 5 +++++ > > mm/memory.c | 15 +++++++++++++++ > > mm/swap.h | 5 +++++ > > mm/swapfile.c | 13 +++++++++++++ > > 4 files changed, 38 insertions(+) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 4db00ddad261..8d28f6091a32 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > return 0; > > } > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > +{ > > + return 0; > > +} > > + > > static inline void swap_free(swp_entry_t swp) > > { > > } > > diff --git a/mm/memory.c b/mm/memory.c > > index 7e1f4849463a..1749c700823d 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (!folio) { > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > __swap_count(entry) == 1) { > > + /* > > + * Prevent parallel swapin from proceeding with > > + * the cache flag. Otherwise, another thread may > > + * finish swapin first, free the entry, and swapout > > + * reusing the same entry. It's undetectable as > > + * pte_same() returns true due to entry reuse. > > + */ > > + if (swapcache_prepare(entry)) > > + goto out; > > + > > I am puzzled by this "goto out". If I understand this correctly, you > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > The CPU1 will succeed in adding the flag and the CPU2 will get > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > correctly so far? > > Then the goto out seems wrong to me. For the CPU2, the page fault will > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > CPU2 does not even know the page fault is not handled, it will resume > from the page fault instruction, possibly generate another page fault > at the exact same location. That page fault loop will repeat until > CPU1 install the new pte on that faulting virtual address and pick up > by CPU2. > > Am I missing something obvious there? I feel you are right. any concurrent page faults at the same pte will increase the count of page faults for a couple of times now. > > I just re-read your comment: "Racers will simply busy wait since it's > a rare and very short event." That might be referring to the above > CPU2 page fault looping situation. I consider the page fault looping > on CPU2 not acceptable. For one it will mess up the page fault > statistics. > In my mind, having an explicit loop for CPU2 waiting for the PTE to > show up is still better than this page fault loop. You can have more > CPU power friendly loops. I assume you mean something like while(!pte_same()) cpu_relax(); then we still have a chance to miss the change of B. For example, another thread is changing pte to A->B->A, our loop can miss B. Thus we will trap into an infinite loop. this is even worse. is it possible to loop for the success of swapcache_prepare(entry) instead? > > This behavior needs more discussion. > > Chris > > > Chris > > > > /* skip swapcache */ > > folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > vma, vmf->address, false); > > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > unlock: > > if (vmf->pte) > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > + /* Clear the swap cache pin for direct swapin after PTL unlock */ > > + if (folio && !swapcache) > > + swapcache_clear(si, entry); > > out: > > if (si) > > put_swap_device(si); > > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (vmf->pte) > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > out_page: > > + if (!swapcache) > > + swapcache_clear(si, entry); > > folio_unlock(folio); > > out_release: > > folio_put(folio); > > diff --git a/mm/swap.h b/mm/swap.h > > index 758c46ca671e..fc2f6ade7f80 100644 > > --- a/mm/swap.h > > +++ b/mm/swap.h > > @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio, > > void delete_from_swap_cache(struct folio *folio); > > void clear_shadow_from_swap_cache(int type, unsigned long begin, > > unsigned long end); > > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); > > struct folio *swap_cache_get_folio(swp_entry_t entry, > > struct vm_area_struct *vma, unsigned long addr); > > struct folio *filemap_get_incore_folio(struct address_space *mapping, > > @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc) > > return 0; > > } > > > > +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) > > +{ > > +} > > + > > static inline struct folio *swap_cache_get_folio(swp_entry_t entry, > > struct vm_area_struct *vma, unsigned long addr) > > { > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 556ff7347d5f..746aa9da5302 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry) > > return __swap_duplicate(entry, SWAP_HAS_CACHE); > > } > > > > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) > > +{ > > + struct swap_cluster_info *ci; > > + unsigned long offset = swp_offset(entry); > > + unsigned char usage; > > + > > + ci = lock_cluster_or_swap_info(si, offset); > > + usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); > > + unlock_cluster_or_swap_info(si, ci); > > + if (!usage) > > + free_swap_slot(entry); > > +} > > + > > struct swap_info_struct *swp_swap_info(swp_entry_t entry) > > { > > return swap_type_to_swap_info(swp_type(entry)); > > -- > > 2.43.0 Thanks Barry
On Wed, Feb 7, 2024 at 10:21 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote: > > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > Hi Kairui, > > > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > > > From: Kairui Song <kasong@tencent.com> > > > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > > > swapin the same entry at the same time, they get different pages (A, B). > > > > > Before one thread (T0) finishes the swapin and installs page (A) > > > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > > > swap_free the entry, then swap out the possibly modified page > > > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > > > > > One possible callstack is like this: > > > > > > > > > > CPU0 CPU1 > > > > > ---- ---- > > > > > do_swap_page() do_swap_page() with same entry > > > > > <direct swapin path> <direct swapin path> > > > > > <alloc page A> <alloc page B> > > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > > > > <slow on later locks or interrupt> <finished swapin first> > > > > > ... set_pte_at() > > > > > swap_free() <- entry is free > > > > > <write to page B, now page A stalled> > > > > > <swap out page B to same swap entry> > > > > > pte_same() <- Check pass, PTE seems > > > > > unchanged, but page A > > > > > is stalled! > > > > > swap_free() <- page B content lost! > > > > > set_pte_at() <- staled page A installed! > > > > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > > > > the entry content, so even if page (B) is not modified, if > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > > > > it may also cause data loss. > > > > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > > > > the cache flag, and allow only one thread to pin it. Release the pin > > > > > after PT unlocked. Racers will simply busy wait since it's a rare > > > > > and very short event. > > > > > > > > > > Other methods like increasing the swap count don't seem to be a good > > > > > idea after some tests, that will cause racers to fall back to use the > > > > > swap cache again. Parallel swapin using different methods leads to > > > > > a much more complex scenario. > > > > > > > > > > Reproducer: > > > > > > > > > > This race issue can be triggered easily using a well constructed > > > > > reproducer and patched brd (with a delay in read path) [1]: > > > > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > > > > Polulating 32MB of memory region... > > > > > Keep swapping out... > > > > > Starting round 0... > > > > > Spawning 65536 workers... > > > > > 32746 workers spawned, wait for done... > > > > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > > > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > > > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > > > > Round 0 Failed, 15 data loss! > > > > > > > > > > This reproducer spawns multiple threads sharing the same memory region > > > > > using a small swap device. Every two threads updates mapped pages one by > > > > > one in opposite direction trying to create a race, with one dedicated > > > > > thread keep swapping out the data out using madvise. > > > > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > > > > so the race should be totally possible in production. > > > > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds > > > > > and no data loss observed. > > > > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > > > > zram: > > > > > > > > > > Before: 10934698 us > > > > > After: 11157121 us > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > > > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > > > > > > > --- > > > > > Update from V1: > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > > > > - Update comments make it cleaner [Huang, Ying] > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > > > > - Update commit message. > > > > > - Collect Review and Acks. > > > > > > > > > > include/linux/swap.h | 5 +++++ > > > > > mm/memory.c | 15 +++++++++++++++ > > > > > mm/swap.h | 5 +++++ > > > > > mm/swapfile.c | 13 +++++++++++++ > > > > > 4 files changed, 38 insertions(+) > > > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > > index 4db00ddad261..8d28f6091a32 100644 > > > > > --- a/include/linux/swap.h > > > > > +++ b/include/linux/swap.h > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > > > > return 0; > > > > > } > > > > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > > > > +{ > > > > > + return 0; > > > > > +} > > > > > + > > > > > static inline void swap_free(swp_entry_t swp) > > > > > { > > > > > } > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > index 7e1f4849463a..1749c700823d 100644 > > > > > --- a/mm/memory.c > > > > > +++ b/mm/memory.c > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > if (!folio) { > > > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > > > __swap_count(entry) == 1) { > > > > > + /* > > > > > + * Prevent parallel swapin from proceeding with > > > > > + * the cache flag. Otherwise, another thread may > > > > > + * finish swapin first, free the entry, and swapout > > > > > + * reusing the same entry. It's undetectable as > > > > > + * pte_same() returns true due to entry reuse. > > > > > + */ > > > > > + if (swapcache_prepare(entry)) > > > > > + goto out; > > > > > + > > > > > > > > I am puzzled by this "goto out". If I understand this correctly, you > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > > > The CPU1 will succeed in adding the flag and the CPU2 will get > > > > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > > > correctly so far? > > > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > > > CPU2 does not even know the page fault is not handled, it will resume > > > > from the page fault instruction, possibly generate another page fault > > > > at the exact same location. That page fault loop will repeat until > > > > CPU1 install the new pte on that faulting virtual address and pick up > > > > by CPU2. > > > > > > > > Am I missing something obvious there? > > > > > > I feel you are right. any concurrent page faults at the same pte > > > will increase the count of page faults for a couple of times now. > > > > > > > > > > > I just re-read your comment: "Racers will simply busy wait since it's > > > > a rare and very short event." That might be referring to the above > > > > CPU2 page fault looping situation. I consider the page fault looping > > > > on CPU2 not acceptable. For one it will mess up the page fault > > > > statistics. > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to > > > > show up is still better than this page fault loop. You can have more > > > > CPU power friendly loops. > > > > > > I assume you mean something like > > > > > > while(!pte_same()) > > > cpu_relax(); > > > > > > then we still have a chance to miss the change of B. > > > > > > For example, another thread is changing pte to A->B->A, our loop can > > > miss B. Thus we will trap into an infinite loop. this is even worse. > > > > Yes. You are right, it is worse. Thanks for catching that. That is why > > I say this needs more discussion, I haven't fully thought it through > > :-) > > Hi Chris and Barry, > > Thanks for the comments! > Hi Kairui, > The worst thing I know of returning in do_swap_page without handling > the swap, is an increase of some statistic counters, note it will not > cause major page fault counters to grow, only things like perf counter > and vma lock statistic are affected. I don't understand :-) if it is goto out, userspace may re-execute the instruction. What is going to happen is a totally new page fault. > > And actually there are multiple already existing return points in > do_swap_page that will return without handling it, which may > re-trigger the page fault. I feel this case is different from other returns. other returns probably have held ptl or page lock before checking entry/pte, and another thread has likely changed the PTE/swap entry before those locks are released. then it is likely there is no following page fault. but our case is different, we are unconditionally having one or more page faults. I think making the count "right" is important as we highly depend on it to debug performance issues. > When do_swap_page is called, many pre-checks have been applied, and > they could all be invalidated if something raced, simply looping > inside here could miss a lot of corner cases, so we have to go through > that again. I agree. > > This patch did increase the chance of false positive increase of some > counters, maybe something like returning a VM_FAULT_RETRY could make > it better, but code is more complex and will cause other counters to > grow. right. Thanks Barry
On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > > > Hi Kairui, > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > From: Kairui Song <kasong@tencent.com> > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > swapin the same entry at the same time, they get different pages (A, B). > > > Before one thread (T0) finishes the swapin and installs page (A) > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > swap_free the entry, then swap out the possibly modified page > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > One possible callstack is like this: > > > > > > CPU0 CPU1 > > > ---- ---- > > > do_swap_page() do_swap_page() with same entry > > > <direct swapin path> <direct swapin path> > > > <alloc page A> <alloc page B> > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > > <slow on later locks or interrupt> <finished swapin first> > > > ... set_pte_at() > > > swap_free() <- entry is free > > > <write to page B, now page A stalled> > > > <swap out page B to same swap entry> > > > pte_same() <- Check pass, PTE seems > > > unchanged, but page A > > > is stalled! > > > swap_free() <- page B content lost! > > > set_pte_at() <- staled page A installed! > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > > the entry content, so even if page (B) is not modified, if > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > > it may also cause data loss. > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > > the cache flag, and allow only one thread to pin it. Release the pin > > > after PT unlocked. Racers will simply busy wait since it's a rare > > > and very short event. > > > > > > Other methods like increasing the swap count don't seem to be a good > > > idea after some tests, that will cause racers to fall back to use the > > > swap cache again. Parallel swapin using different methods leads to > > > a much more complex scenario. > > > > > > Reproducer: > > > > > > This race issue can be triggered easily using a well constructed > > > reproducer and patched brd (with a delay in read path) [1]: > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > > Polulating 32MB of memory region... > > > Keep swapping out... > > > Starting round 0... > > > Spawning 65536 workers... > > > 32746 workers spawned, wait for done... > > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > > Round 0 Failed, 15 data loss! > > > > > > This reproducer spawns multiple threads sharing the same memory region > > > using a small swap device. Every two threads updates mapped pages one by > > > one in opposite direction trying to create a race, with one dedicated > > > thread keep swapping out the data out using madvise. > > > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > > so the race should be totally possible in production. > > > > > > After this patch, I ran the reproducer for over a few hundred rounds > > > and no data loss observed. > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > > zram: > > > > > > Before: 10934698 us > > > After: 11157121 us > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > > > --- > > > Update from V1: > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > > - Update comments make it cleaner [Huang, Ying] > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > > - Update commit message. > > > - Collect Review and Acks. > > > > > > include/linux/swap.h | 5 +++++ > > > mm/memory.c | 15 +++++++++++++++ > > > mm/swap.h | 5 +++++ > > > mm/swapfile.c | 13 +++++++++++++ > > > 4 files changed, 38 insertions(+) > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > index 4db00ddad261..8d28f6091a32 100644 > > > --- a/include/linux/swap.h > > > +++ b/include/linux/swap.h > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > > return 0; > > > } > > > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > > +{ > > > + return 0; > > > +} > > > + > > > static inline void swap_free(swp_entry_t swp) > > > { > > > } > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 7e1f4849463a..1749c700823d 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > if (!folio) { > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > __swap_count(entry) == 1) { > > > + /* > > > + * Prevent parallel swapin from proceeding with > > > + * the cache flag. Otherwise, another thread may > > > + * finish swapin first, free the entry, and swapout > > > + * reusing the same entry. It's undetectable as > > > + * pte_same() returns true due to entry reuse. > > > + */ > > > + if (swapcache_prepare(entry)) > > > + goto out; > > > + > > > > I am puzzled by this "goto out". If I understand this correctly, you > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > The CPU1 will succeed in adding the flag and the CPU2 will get > > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > correctly so far? > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > CPU2 does not even know the page fault is not handled, it will resume > > from the page fault instruction, possibly generate another page fault > > at the exact same location. That page fault loop will repeat until > > CPU1 install the new pte on that faulting virtual address and pick up > > by CPU2. > > > > Am I missing something obvious there? > > I feel you are right. any concurrent page faults at the same pte > will increase the count of page faults for a couple of times now. > > > > > I just re-read your comment: "Racers will simply busy wait since it's > > a rare and very short event." That might be referring to the above > > CPU2 page fault looping situation. I consider the page fault looping > > on CPU2 not acceptable. For one it will mess up the page fault > > statistics. > > In my mind, having an explicit loop for CPU2 waiting for the PTE to > > show up is still better than this page fault loop. You can have more > > CPU power friendly loops. > > I assume you mean something like > > while(!pte_same()) > cpu_relax(); > > then we still have a chance to miss the change of B. > > For example, another thread is changing pte to A->B->A, our loop can > miss B. Thus we will trap into an infinite loop. this is even worse. Yes. You are right, it is worse. Thanks for catching that. That is why I say this needs more discussion, I haven't fully thought it through :-) > > is it possible to loop for the success of swapcache_prepare(entry) > instead? It looks like it can work. It is using the SWAP_HAS_CACHE like a per swap entry spin lock. Chris > > > > > This behavior needs more discussion. > > > > Chris > > > > > > Chris > > > > > > > /* skip swapcache */ > > > folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > vma, vmf->address, false); > > > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > unlock: > > > if (vmf->pte) > > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > > + /* Clear the swap cache pin for direct swapin after PTL unlock */ > > > + if (folio && !swapcache) > > > + swapcache_clear(si, entry); > > > out: > > > if (si) > > > put_swap_device(si); > > > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > if (vmf->pte) > > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > > out_page: > > > + if (!swapcache) > > > + swapcache_clear(si, entry); > > > folio_unlock(folio); > > > out_release: > > > folio_put(folio); > > > diff --git a/mm/swap.h b/mm/swap.h > > > index 758c46ca671e..fc2f6ade7f80 100644 > > > --- a/mm/swap.h > > > +++ b/mm/swap.h > > > @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio, > > > void delete_from_swap_cache(struct folio *folio); > > > void clear_shadow_from_swap_cache(int type, unsigned long begin, > > > unsigned long end); > > > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); > > > struct folio *swap_cache_get_folio(swp_entry_t entry, > > > struct vm_area_struct *vma, unsigned long addr); > > > struct folio *filemap_get_incore_folio(struct address_space *mapping, > > > @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc) > > > return 0; > > > } > > > > > > +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) > > > +{ > > > +} > > > + > > > static inline struct folio *swap_cache_get_folio(swp_entry_t entry, > > > struct vm_area_struct *vma, unsigned long addr) > > > { > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > index 556ff7347d5f..746aa9da5302 100644 > > > --- a/mm/swapfile.c > > > +++ b/mm/swapfile.c > > > @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry) > > > return __swap_duplicate(entry, SWAP_HAS_CACHE); > > > } > > > > > > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) > > > +{ > > > + struct swap_cluster_info *ci; > > > + unsigned long offset = swp_offset(entry); > > > + unsigned char usage; > > > + > > > + ci = lock_cluster_or_swap_info(si, offset); > > > + usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); > > > + unlock_cluster_or_swap_info(si, ci); > > > + if (!usage) > > > + free_swap_slot(entry); > > > +} > > > + > > > struct swap_info_struct *swp_swap_info(swp_entry_t entry) > > > { > > > return swap_type_to_swap_info(swp_type(entry)); > > > -- > > > 2.43.0 > > Thanks > Barry >
Barry Song <21cnbao@gmail.com> writes: > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: >> >> Hi Kairui, >> >> Sorry replying to your patch V1 late, I will reply on the V2 thread. >> >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: >> > >> > From: Kairui Song <kasong@tencent.com> >> > >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads >> > swapin the same entry at the same time, they get different pages (A, B). >> > Before one thread (T0) finishes the swapin and installs page (A) >> > to the PTE, another thread (T1) could finish swapin of page (B), >> > swap_free the entry, then swap out the possibly modified page >> > reusing the same entry. It breaks the pte_same check in (T0) because >> > PTE value is unchanged, causing ABA problem. Thread (T0) will >> > install a stalled page (A) into the PTE and cause data corruption. >> > >> > One possible callstack is like this: >> > >> > CPU0 CPU1 >> > ---- ---- >> > do_swap_page() do_swap_page() with same entry >> > <direct swapin path> <direct swapin path> >> > <alloc page A> <alloc page B> >> > swap_read_folio() <- read to page A swap_read_folio() <- read to page B >> > <slow on later locks or interrupt> <finished swapin first> >> > ... set_pte_at() >> > swap_free() <- entry is free >> > <write to page B, now page A stalled> >> > <swap out page B to same swap entry> >> > pte_same() <- Check pass, PTE seems >> > unchanged, but page A >> > is stalled! >> > swap_free() <- page B content lost! >> > set_pte_at() <- staled page A installed! >> > >> > And besides, for ZRAM, swap_free() allows the swap device to discard >> > the entry content, so even if page (B) is not modified, if >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, >> > it may also cause data loss. >> > >> > To fix this, reuse swapcache_prepare which will pin the swap entry using >> > the cache flag, and allow only one thread to pin it. Release the pin >> > after PT unlocked. Racers will simply busy wait since it's a rare >> > and very short event. >> > >> > Other methods like increasing the swap count don't seem to be a good >> > idea after some tests, that will cause racers to fall back to use the >> > swap cache again. Parallel swapin using different methods leads to >> > a much more complex scenario. >> > >> > Reproducer: >> > >> > This race issue can be triggered easily using a well constructed >> > reproducer and patched brd (with a delay in read path) [1]: >> > >> > With latest 6.8 mainline, race caused data loss can be observed easily: >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out >> > Polulating 32MB of memory region... >> > Keep swapping out... >> > Starting round 0... >> > Spawning 65536 workers... >> > 32746 workers spawned, wait for done... >> > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! >> > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! >> > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! >> > Round 0 Failed, 15 data loss! >> > >> > This reproducer spawns multiple threads sharing the same memory region >> > using a small swap device. Every two threads updates mapped pages one by >> > one in opposite direction trying to create a race, with one dedicated >> > thread keep swapping out the data out using madvise. >> > >> > The reproducer created a reproduce rate of about once every 5 minutes, >> > so the race should be totally possible in production. >> > >> > After this patch, I ran the reproducer for over a few hundred rounds >> > and no data loss observed. >> > >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G >> > zram: >> > >> > Before: 10934698 us >> > After: 11157121 us >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) >> > >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") >> > Reported-by: "Huang, Ying" <ying.huang@intel.com> >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] >> > Signed-off-by: Kairui Song <kasong@tencent.com> >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> >> > Acked-by: Yu Zhao <yuzhao@google.com> >> > >> > --- >> > Update from V1: >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] >> > - Update comments make it cleaner [Huang, Ying] >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] >> > - Update commit message. >> > - Collect Review and Acks. >> > >> > include/linux/swap.h | 5 +++++ >> > mm/memory.c | 15 +++++++++++++++ >> > mm/swap.h | 5 +++++ >> > mm/swapfile.c | 13 +++++++++++++ >> > 4 files changed, 38 insertions(+) >> > >> > diff --git a/include/linux/swap.h b/include/linux/swap.h >> > index 4db00ddad261..8d28f6091a32 100644 >> > --- a/include/linux/swap.h >> > +++ b/include/linux/swap.h >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) >> > return 0; >> > } >> > >> > +static inline int swapcache_prepare(swp_entry_t swp) >> > +{ >> > + return 0; >> > +} >> > + >> > static inline void swap_free(swp_entry_t swp) >> > { >> > } >> > diff --git a/mm/memory.c b/mm/memory.c >> > index 7e1f4849463a..1749c700823d 100644 >> > --- a/mm/memory.c >> > +++ b/mm/memory.c >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> > if (!folio) { >> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >> > __swap_count(entry) == 1) { >> > + /* >> > + * Prevent parallel swapin from proceeding with >> > + * the cache flag. Otherwise, another thread may >> > + * finish swapin first, free the entry, and swapout >> > + * reusing the same entry. It's undetectable as >> > + * pte_same() returns true due to entry reuse. >> > + */ >> > + if (swapcache_prepare(entry)) >> > + goto out; >> > + >> >> I am puzzled by this "goto out". If I understand this correctly, you >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. >> The CPU1 will succeed in adding the flag and the CPU2 will get >> "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it >> correctly so far? >> >> Then the goto out seems wrong to me. For the CPU2, the page fault will >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved, >> CPU2 does not even know the page fault is not handled, it will resume >> from the page fault instruction, possibly generate another page fault >> at the exact same location. That page fault loop will repeat until >> CPU1 install the new pte on that faulting virtual address and pick up >> by CPU2. >> >> Am I missing something obvious there? > > I feel you are right. any concurrent page faults at the same pte > will increase the count of page faults for a couple of times now. > >> >> I just re-read your comment: "Racers will simply busy wait since it's >> a rare and very short event." That might be referring to the above >> CPU2 page fault looping situation. I consider the page fault looping >> on CPU2 not acceptable. For one it will mess up the page fault >> statistics. >> In my mind, having an explicit loop for CPU2 waiting for the PTE to >> show up is still better than this page fault loop. You can have more >> CPU power friendly loops. > > I assume you mean something like > > while(!pte_same()) > cpu_relax(); > > then we still have a chance to miss the change of B. > > For example, another thread is changing pte to A->B->A, our loop can > miss B. Thus we will trap into an infinite loop. this is even worse. > > is it possible to loop for the success of swapcache_prepare(entry) > instead? This doesn't work too. The swap count can increase to > 1 and be put in swap cache for long time. Another possibility is to move swapcache_prepare() after vma_alloc_folio() to reduce the race window. -- Best Regards, Huang, Ying >> >> This behavior needs more discussion. >> >> Chris >> >> >> Chris >> >> >> > /* skip swapcache */ >> > folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, >> > vma, vmf->address, false); >> > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> > unlock: >> > if (vmf->pte) >> > pte_unmap_unlock(vmf->pte, vmf->ptl); >> > + /* Clear the swap cache pin for direct swapin after PTL unlock */ >> > + if (folio && !swapcache) >> > + swapcache_clear(si, entry); >> > out: >> > if (si) >> > put_swap_device(si); >> > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> > if (vmf->pte) >> > pte_unmap_unlock(vmf->pte, vmf->ptl); >> > out_page: >> > + if (!swapcache) >> > + swapcache_clear(si, entry); >> > folio_unlock(folio); >> > out_release: >> > folio_put(folio); >> > diff --git a/mm/swap.h b/mm/swap.h >> > index 758c46ca671e..fc2f6ade7f80 100644 >> > --- a/mm/swap.h >> > +++ b/mm/swap.h >> > @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio, >> > void delete_from_swap_cache(struct folio *folio); >> > void clear_shadow_from_swap_cache(int type, unsigned long begin, >> > unsigned long end); >> > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); >> > struct folio *swap_cache_get_folio(swp_entry_t entry, >> > struct vm_area_struct *vma, unsigned long addr); >> > struct folio *filemap_get_incore_folio(struct address_space *mapping, >> > @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc) >> > return 0; >> > } >> > >> > +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) >> > +{ >> > +} >> > + >> > static inline struct folio *swap_cache_get_folio(swp_entry_t entry, >> > struct vm_area_struct *vma, unsigned long addr) >> > { >> > diff --git a/mm/swapfile.c b/mm/swapfile.c >> > index 556ff7347d5f..746aa9da5302 100644 >> > --- a/mm/swapfile.c >> > +++ b/mm/swapfile.c >> > @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry) >> > return __swap_duplicate(entry, SWAP_HAS_CACHE); >> > } >> > >> > +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) >> > +{ >> > + struct swap_cluster_info *ci; >> > + unsigned long offset = swp_offset(entry); >> > + unsigned char usage; >> > + >> > + ci = lock_cluster_or_swap_info(si, offset); >> > + usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); >> > + unlock_cluster_or_swap_info(si, ci); >> > + if (!usage) >> > + free_swap_slot(entry); >> > +} >> > + >> > struct swap_info_struct *swp_swap_info(swp_entry_t entry) >> > { >> > return swap_type_to_swap_info(swp_type(entry)); >> > -- >> > 2.43.0 > > Thanks > Barry
On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote: > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > Hi Kairui, > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > From: Kairui Song <kasong@tencent.com> > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > > swapin the same entry at the same time, they get different pages (A, B). > > > > Before one thread (T0) finishes the swapin and installs page (A) > > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > > swap_free the entry, then swap out the possibly modified page > > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > > > One possible callstack is like this: > > > > > > > > CPU0 CPU1 > > > > ---- ---- > > > > do_swap_page() do_swap_page() with same entry > > > > <direct swapin path> <direct swapin path> > > > > <alloc page A> <alloc page B> > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > > > <slow on later locks or interrupt> <finished swapin first> > > > > ... set_pte_at() > > > > swap_free() <- entry is free > > > > <write to page B, now page A stalled> > > > > <swap out page B to same swap entry> > > > > pte_same() <- Check pass, PTE seems > > > > unchanged, but page A > > > > is stalled! > > > > swap_free() <- page B content lost! > > > > set_pte_at() <- staled page A installed! > > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > > > the entry content, so even if page (B) is not modified, if > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > > > it may also cause data loss. > > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > > > the cache flag, and allow only one thread to pin it. Release the pin > > > > after PT unlocked. Racers will simply busy wait since it's a rare > > > > and very short event. > > > > > > > > Other methods like increasing the swap count don't seem to be a good > > > > idea after some tests, that will cause racers to fall back to use the > > > > swap cache again. Parallel swapin using different methods leads to > > > > a much more complex scenario. > > > > > > > > Reproducer: > > > > > > > > This race issue can be triggered easily using a well constructed > > > > reproducer and patched brd (with a delay in read path) [1]: > > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > > > Polulating 32MB of memory region... > > > > Keep swapping out... > > > > Starting round 0... > > > > Spawning 65536 workers... > > > > 32746 workers spawned, wait for done... > > > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > > > Round 0 Failed, 15 data loss! > > > > > > > > This reproducer spawns multiple threads sharing the same memory region > > > > using a small swap device. Every two threads updates mapped pages one by > > > > one in opposite direction trying to create a race, with one dedicated > > > > thread keep swapping out the data out using madvise. > > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > > > so the race should be totally possible in production. > > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds > > > > and no data loss observed. > > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > > > zram: > > > > > > > > Before: 10934698 us > > > > After: 11157121 us > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > > > > > --- > > > > Update from V1: > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > > > - Update comments make it cleaner [Huang, Ying] > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > > > - Update commit message. > > > > - Collect Review and Acks. > > > > > > > > include/linux/swap.h | 5 +++++ > > > > mm/memory.c | 15 +++++++++++++++ > > > > mm/swap.h | 5 +++++ > > > > mm/swapfile.c | 13 +++++++++++++ > > > > 4 files changed, 38 insertions(+) > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > index 4db00ddad261..8d28f6091a32 100644 > > > > --- a/include/linux/swap.h > > > > +++ b/include/linux/swap.h > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > > > return 0; > > > > } > > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > > > +{ > > > > + return 0; > > > > +} > > > > + > > > > static inline void swap_free(swp_entry_t swp) > > > > { > > > > } > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > index 7e1f4849463a..1749c700823d 100644 > > > > --- a/mm/memory.c > > > > +++ b/mm/memory.c > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > if (!folio) { > > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > > __swap_count(entry) == 1) { > > > > + /* > > > > + * Prevent parallel swapin from proceeding with > > > > + * the cache flag. Otherwise, another thread may > > > > + * finish swapin first, free the entry, and swapout > > > > + * reusing the same entry. It's undetectable as > > > > + * pte_same() returns true due to entry reuse. > > > > + */ > > > > + if (swapcache_prepare(entry)) > > > > + goto out; > > > > + > > > > > > I am puzzled by this "goto out". If I understand this correctly, you > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > > The CPU1 will succeed in adding the flag and the CPU2 will get > > > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > > correctly so far? > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > > CPU2 does not even know the page fault is not handled, it will resume > > > from the page fault instruction, possibly generate another page fault > > > at the exact same location. That page fault loop will repeat until > > > CPU1 install the new pte on that faulting virtual address and pick up > > > by CPU2. > > > > > > Am I missing something obvious there? > > > > I feel you are right. any concurrent page faults at the same pte > > will increase the count of page faults for a couple of times now. > > > > > > > > I just re-read your comment: "Racers will simply busy wait since it's > > > a rare and very short event." That might be referring to the above > > > CPU2 page fault looping situation. I consider the page fault looping > > > on CPU2 not acceptable. For one it will mess up the page fault > > > statistics. > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to > > > show up is still better than this page fault loop. You can have more > > > CPU power friendly loops. > > > > I assume you mean something like > > > > while(!pte_same()) > > cpu_relax(); > > > > then we still have a chance to miss the change of B. > > > > For example, another thread is changing pte to A->B->A, our loop can > > miss B. Thus we will trap into an infinite loop. this is even worse. > > Yes. You are right, it is worse. Thanks for catching that. That is why > I say this needs more discussion, I haven't fully thought it through > :-) Hi Chris and Barry, Thanks for the comments! The worst thing I know of returning in do_swap_page without handling the swap, is an increase of some statistic counters, note it will not cause major page fault counters to grow, only things like perf counter and vma lock statistic are affected. And actually there are multiple already existing return points in do_swap_page that will return without handling it, which may re-trigger the page fault. When do_swap_page is called, many pre-checks have been applied, and they could all be invalidated if something raced, simply looping inside here could miss a lot of corner cases, so we have to go through that again. This patch did increase the chance of false positive increase of some counters, maybe something like returning a VM_FAULT_RETRY could make it better, but code is more complex and will cause other counters to grow.
On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > >> > >> Hi Kairui, > >> > >> Sorry replying to your patch V1 late, I will reply on the V2 thread. > >> > >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > >> > > >> > From: Kairui Song <kasong@tencent.com> > >> > > >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > >> > swapin the same entry at the same time, they get different pages (A, B). > >> > Before one thread (T0) finishes the swapin and installs page (A) > >> > to the PTE, another thread (T1) could finish swapin of page (B), > >> > swap_free the entry, then swap out the possibly modified page > >> > reusing the same entry. It breaks the pte_same check in (T0) because > >> > PTE value is unchanged, causing ABA problem. Thread (T0) will > >> > install a stalled page (A) into the PTE and cause data corruption. > >> > > >> > One possible callstack is like this: > >> > > >> > CPU0 CPU1 > >> > ---- ---- > >> > do_swap_page() do_swap_page() with same entry > >> > <direct swapin path> <direct swapin path> > >> > <alloc page A> <alloc page B> > >> > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > >> > <slow on later locks or interrupt> <finished swapin first> > >> > ... set_pte_at() > >> > swap_free() <- entry is free > >> > <write to page B, now page A stalled> > >> > <swap out page B to same swap entry> > >> > pte_same() <- Check pass, PTE seems > >> > unchanged, but page A > >> > is stalled! > >> > swap_free() <- page B content lost! > >> > set_pte_at() <- staled page A installed! > >> > > >> > And besides, for ZRAM, swap_free() allows the swap device to discard > >> > the entry content, so even if page (B) is not modified, if > >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > >> > it may also cause data loss. > >> > > >> > To fix this, reuse swapcache_prepare which will pin the swap entry using > >> > the cache flag, and allow only one thread to pin it. Release the pin > >> > after PT unlocked. Racers will simply busy wait since it's a rare > >> > and very short event. > >> > > >> > Other methods like increasing the swap count don't seem to be a good > >> > idea after some tests, that will cause racers to fall back to use the > >> > swap cache again. Parallel swapin using different methods leads to > >> > a much more complex scenario. > >> > > >> > Reproducer: > >> > > >> > This race issue can be triggered easily using a well constructed > >> > reproducer and patched brd (with a delay in read path) [1]: > >> > > >> > With latest 6.8 mainline, race caused data loss can be observed easily: > >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > >> > Polulating 32MB of memory region... > >> > Keep swapping out... > >> > Starting round 0... > >> > Spawning 65536 workers... > >> > 32746 workers spawned, wait for done... > >> > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > >> > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > >> > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > >> > Round 0 Failed, 15 data loss! > >> > > >> > This reproducer spawns multiple threads sharing the same memory region > >> > using a small swap device. Every two threads updates mapped pages one by > >> > one in opposite direction trying to create a race, with one dedicated > >> > thread keep swapping out the data out using madvise. > >> > > >> > The reproducer created a reproduce rate of about once every 5 minutes, > >> > so the race should be totally possible in production. > >> > > >> > After this patch, I ran the reproducer for over a few hundred rounds > >> > and no data loss observed. > >> > > >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G > >> > zram: > >> > > >> > Before: 10934698 us > >> > After: 11157121 us > >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > >> > > >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > >> > Reported-by: "Huang, Ying" <ying.huang@intel.com> > >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > >> > Signed-off-by: Kairui Song <kasong@tencent.com> > >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > >> > Acked-by: Yu Zhao <yuzhao@google.com> > >> > > >> > --- > >> > Update from V1: > >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > >> > - Update comments make it cleaner [Huang, Ying] > >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > >> > - Update commit message. > >> > - Collect Review and Acks. > >> > > >> > include/linux/swap.h | 5 +++++ > >> > mm/memory.c | 15 +++++++++++++++ > >> > mm/swap.h | 5 +++++ > >> > mm/swapfile.c | 13 +++++++++++++ > >> > 4 files changed, 38 insertions(+) > >> > > >> > diff --git a/include/linux/swap.h b/include/linux/swap.h > >> > index 4db00ddad261..8d28f6091a32 100644 > >> > --- a/include/linux/swap.h > >> > +++ b/include/linux/swap.h > >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > >> > return 0; > >> > } > >> > > >> > +static inline int swapcache_prepare(swp_entry_t swp) > >> > +{ > >> > + return 0; > >> > +} > >> > + > >> > static inline void swap_free(swp_entry_t swp) > >> > { > >> > } > >> > diff --git a/mm/memory.c b/mm/memory.c > >> > index 7e1f4849463a..1749c700823d 100644 > >> > --- a/mm/memory.c > >> > +++ b/mm/memory.c > >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >> > if (!folio) { > >> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > >> > __swap_count(entry) == 1) { > >> > + /* > >> > + * Prevent parallel swapin from proceeding with > >> > + * the cache flag. Otherwise, another thread may > >> > + * finish swapin first, free the entry, and swapout > >> > + * reusing the same entry. It's undetectable as > >> > + * pte_same() returns true due to entry reuse. > >> > + */ > >> > + if (swapcache_prepare(entry)) > >> > + goto out; > >> > + > >> > >> I am puzzled by this "goto out". If I understand this correctly, you > >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > >> The CPU1 will succeed in adding the flag and the CPU2 will get > >> "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > >> correctly so far? > >> > >> Then the goto out seems wrong to me. For the CPU2, the page fault will > >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > >> CPU2 does not even know the page fault is not handled, it will resume > >> from the page fault instruction, possibly generate another page fault > >> at the exact same location. That page fault loop will repeat until > >> CPU1 install the new pte on that faulting virtual address and pick up > >> by CPU2. > >> > >> Am I missing something obvious there? > > > > I feel you are right. any concurrent page faults at the same pte > > will increase the count of page faults for a couple of times now. > > > >> > >> I just re-read your comment: "Racers will simply busy wait since it's > >> a rare and very short event." That might be referring to the above > >> CPU2 page fault looping situation. I consider the page fault looping > >> on CPU2 not acceptable. For one it will mess up the page fault > >> statistics. > >> In my mind, having an explicit loop for CPU2 waiting for the PTE to > >> show up is still better than this page fault loop. You can have more > >> CPU power friendly loops. > > > > I assume you mean something like > > > > while(!pte_same()) > > cpu_relax(); > > > > then we still have a chance to miss the change of B. > > > > For example, another thread is changing pte to A->B->A, our loop can > > miss B. Thus we will trap into an infinite loop. this is even worse. > > > > is it possible to loop for the success of swapcache_prepare(entry) > > instead? > > This doesn't work too. The swap count can increase to > 1 and be put in > swap cache for long time. > > Another possibility is to move swapcache_prepare() after > vma_alloc_folio() to reduce the race window. Reducing the race window seems like a good way. Or maybe we can just add a cpu_relax() so raced swapins will just slow down, and won't loop too much time and so the side effect (counter or power consumption) should be much smaller?
On Wed, Feb 7, 2024 at 10:55 AM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, Feb 7, 2024 at 10:21 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > > > Hi Kairui, > > > > > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > > > > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > > > > > From: Kairui Song <kasong@tencent.com> > > > > > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > > > > swapin the same entry at the same time, they get different pages (A, B). > > > > > > Before one thread (T0) finishes the swapin and installs page (A) > > > > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > > > > swap_free the entry, then swap out the possibly modified page > > > > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > > > > > > > One possible callstack is like this: > > > > > > > > > > > > CPU0 CPU1 > > > > > > ---- ---- > > > > > > do_swap_page() do_swap_page() with same entry > > > > > > <direct swapin path> <direct swapin path> > > > > > > <alloc page A> <alloc page B> > > > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > > > > > <slow on later locks or interrupt> <finished swapin first> > > > > > > ... set_pte_at() > > > > > > swap_free() <- entry is free > > > > > > <write to page B, now page A stalled> > > > > > > <swap out page B to same swap entry> > > > > > > pte_same() <- Check pass, PTE seems > > > > > > unchanged, but page A > > > > > > is stalled! > > > > > > swap_free() <- page B content lost! > > > > > > set_pte_at() <- staled page A installed! > > > > > > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > > > > > the entry content, so even if page (B) is not modified, if > > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > > > > > it may also cause data loss. > > > > > > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > > > > > the cache flag, and allow only one thread to pin it. Release the pin > > > > > > after PT unlocked. Racers will simply busy wait since it's a rare > > > > > > and very short event. > > > > > > > > > > > > Other methods like increasing the swap count don't seem to be a good > > > > > > idea after some tests, that will cause racers to fall back to use the > > > > > > swap cache again. Parallel swapin using different methods leads to > > > > > > a much more complex scenario. > > > > > > > > > > > > Reproducer: > > > > > > > > > > > > This race issue can be triggered easily using a well constructed > > > > > > reproducer and patched brd (with a delay in read path) [1]: > > > > > > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > > > > > Polulating 32MB of memory region... > > > > > > Keep swapping out... > > > > > > Starting round 0... > > > > > > Spawning 65536 workers... > > > > > > 32746 workers spawned, wait for done... > > > > > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > > > > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > > > > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > > > > > Round 0 Failed, 15 data loss! > > > > > > > > > > > > This reproducer spawns multiple threads sharing the same memory region > > > > > > using a small swap device. Every two threads updates mapped pages one by > > > > > > one in opposite direction trying to create a race, with one dedicated > > > > > > thread keep swapping out the data out using madvise. > > > > > > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > > > > > so the race should be totally possible in production. > > > > > > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds > > > > > > and no data loss observed. > > > > > > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > > > > > zram: > > > > > > > > > > > > Before: 10934698 us > > > > > > After: 11157121 us > > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > > > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > > > > > > > > > --- > > > > > > Update from V1: > > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > > > > > - Update comments make it cleaner [Huang, Ying] > > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > > > > > - Update commit message. > > > > > > - Collect Review and Acks. > > > > > > > > > > > > include/linux/swap.h | 5 +++++ > > > > > > mm/memory.c | 15 +++++++++++++++ > > > > > > mm/swap.h | 5 +++++ > > > > > > mm/swapfile.c | 13 +++++++++++++ > > > > > > 4 files changed, 38 insertions(+) > > > > > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > > > index 4db00ddad261..8d28f6091a32 100644 > > > > > > --- a/include/linux/swap.h > > > > > > +++ b/include/linux/swap.h > > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > > > > > return 0; > > > > > > } > > > > > > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > > > > > +{ > > > > > > + return 0; > > > > > > +} > > > > > > + > > > > > > static inline void swap_free(swp_entry_t swp) > > > > > > { > > > > > > } > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > index 7e1f4849463a..1749c700823d 100644 > > > > > > --- a/mm/memory.c > > > > > > +++ b/mm/memory.c > > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > if (!folio) { > > > > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > > > > __swap_count(entry) == 1) { > > > > > > + /* > > > > > > + * Prevent parallel swapin from proceeding with > > > > > > + * the cache flag. Otherwise, another thread may > > > > > > + * finish swapin first, free the entry, and swapout > > > > > > + * reusing the same entry. It's undetectable as > > > > > > + * pte_same() returns true due to entry reuse. > > > > > > + */ > > > > > > + if (swapcache_prepare(entry)) > > > > > > + goto out; > > > > > > + > > > > > > > > > > I am puzzled by this "goto out". If I understand this correctly, you > > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > > > > The CPU1 will succeed in adding the flag and the CPU2 will get > > > > > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > > > > correctly so far? > > > > > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will > > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > > > > CPU2 does not even know the page fault is not handled, it will resume > > > > > from the page fault instruction, possibly generate another page fault > > > > > at the exact same location. That page fault loop will repeat until > > > > > CPU1 install the new pte on that faulting virtual address and pick up > > > > > by CPU2. > > > > > > > > > > Am I missing something obvious there? > > > > > > > > I feel you are right. any concurrent page faults at the same pte > > > > will increase the count of page faults for a couple of times now. > > > > > > > > > > > > > > I just re-read your comment: "Racers will simply busy wait since it's > > > > > a rare and very short event." That might be referring to the above > > > > > CPU2 page fault looping situation. I consider the page fault looping > > > > > on CPU2 not acceptable. For one it will mess up the page fault > > > > > statistics. > > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to > > > > > show up is still better than this page fault loop. You can have more > > > > > CPU power friendly loops. > > > > > > > > I assume you mean something like > > > > > > > > while(!pte_same()) > > > > cpu_relax(); > > > > > > > > then we still have a chance to miss the change of B. > > > > > > > > For example, another thread is changing pte to A->B->A, our loop can > > > > miss B. Thus we will trap into an infinite loop. this is even worse. > > > > > > Yes. You are right, it is worse. Thanks for catching that. That is why > > > I say this needs more discussion, I haven't fully thought it through > > > :-) > > > > Hi Chris and Barry, > > > > Thanks for the comments! > > > > Hi Kairui, Hi Barry, > > > The worst thing I know of returning in do_swap_page without handling > > the swap, is an increase of some statistic counters, note it will not > > cause major page fault counters to grow, only things like perf counter > > and vma lock statistic are affected. > > I don't understand :-) if it is goto out, userspace may re-execute the > instruction. > What is going to happen is a totally new page fault. Right, it's a new page fault, I mean if the same PTE is still not present, it's handled in the same way again. > > > > And actually there are multiple already existing return points in > > do_swap_page that will return without handling it, which may > > re-trigger the page fault. > > I feel this case is different from other returns. > other returns probably have held ptl or page lock before checking entry/pte, > and another thread has likely changed the PTE/swap entry before those > locks are released. > then it is likely there is no following page fault. Right, in other return points, they expect the next try will just success without triggering another fault. But here SWP_SYNCHRONOUS_IO is usually very fast, so next attempt here will also likely succeed without triggering a fault. So, to ensure that, maybe we can just wait a bit before returning? That way the following page fault should be very unlikely to happen. I'm not sure if there is any better way to wait than simply adding a cpu_relax, waiting on swap cache flag may cause it to wait forever, since it's totally possible some other threads added the entry to swapcache. Waiting for PTE change will result in the same issue of ABA.
On Wed, Feb 7, 2024 at 7:02 AM Minchan Kim <minchan@kernel.org> wrote: > > On Wed, Feb 07, 2024 at 02:25:59AM +0800, Kairui Song wrote: > > From: Kairui Song <kasong@tencent.com> > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > swapin the same entry at the same time, they get different pages (A, B). > > Before one thread (T0) finishes the swapin and installs page (A) > > to the PTE, another thread (T1) could finish swapin of page (B), > > swap_free the entry, then swap out the possibly modified page > > reusing the same entry. It breaks the pte_same check in (T0) because > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > install a stalled page (A) into the PTE and cause data corruption. > > > > One possible callstack is like this: > > > > CPU0 CPU1 > > ---- ---- > > do_swap_page() do_swap_page() with same entry > > <direct swapin path> <direct swapin path> > > <alloc page A> <alloc page B> > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > <slow on later locks or interrupt> <finished swapin first> > > ... set_pte_at() > > swap_free() <- entry is free > > ^^^ > nit: From the recent code, I see swap_free is called earlier than set_pte_at Thanks, will update the message. > > > > <write to page B, now page A stalled> > > <swap out page B to same swap entry> > > pte_same() <- Check pass, PTE seems > > unchanged, but page A > > is stalled! > > swap_free() <- page B content lost! > > set_pte_at() <- staled page A installed! > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > the entry content, so even if page (B) is not modified, if > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > it may also cause data loss. > > Thanks for catching the issue, folks! > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > the cache flag, and allow only one thread to pin it. Release the pin > > after PT unlocked. Racers will simply busy wait since it's a rare > > and very short event. > > > > Other methods like increasing the swap count don't seem to be a good > > idea after some tests, that will cause racers to fall back to use the > > swap cache again. Parallel swapin using different methods leads to > > a much more complex scenario. > > > > Reproducer: > > > > This race issue can be triggered easily using a well constructed > > reproducer and patched brd (with a delay in read path) [1]: > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > Polulating 32MB of memory region... > > Keep swapping out... > > Starting round 0... > > Spawning 65536 workers... > > 32746 workers spawned, wait for done... > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > Round 0 Failed, 15 data loss! > > > > This reproducer spawns multiple threads sharing the same memory region > > using a small swap device. Every two threads updates mapped pages one by > > one in opposite direction trying to create a race, with one dedicated > > thread keep swapping out the data out using madvise. > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > so the race should be totally possible in production. > > > > After this patch, I ran the reproducer for over a few hundred rounds > > and no data loss observed. > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > zram: > > > > Before: 10934698 us > > After: 11157121 us > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > Signed-off-by: Kairui Song <kasong@tencent.com> > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > --- > > Update from V1: > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > - Update comments make it cleaner [Huang, Ying] > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > - Update commit message. > > - Collect Review and Acks. > > > > include/linux/swap.h | 5 +++++ > > mm/memory.c | 15 +++++++++++++++ > > mm/swap.h | 5 +++++ > > mm/swapfile.c | 13 +++++++++++++ > > 4 files changed, 38 insertions(+) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 4db00ddad261..8d28f6091a32 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > return 0; > > } > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > +{ > > + return 0; > > +} > > + > > static inline void swap_free(swp_entry_t swp) > > { > > } > > diff --git a/mm/memory.c b/mm/memory.c > > index 7e1f4849463a..1749c700823d 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (!folio) { > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > __swap_count(entry) == 1) { > > + /* > > + * Prevent parallel swapin from proceeding with > > + * the cache flag. Otherwise, another thread may > > + * finish swapin first, free the entry, and swapout > > + * reusing the same entry. It's undetectable as > > + * pte_same() returns true due to entry reuse. > > + */ > > + if (swapcache_prepare(entry)) > > + goto out; > > + > > /* skip swapcache */ > > folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > vma, vmf->address, false); > > @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > unlock: > > if (vmf->pte) > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > + /* Clear the swap cache pin for direct swapin after PTL unlock */ > > + if (folio && !swapcache) > > + swapcache_clear(si, entry); > > out: > > if (si) > > put_swap_device(si); > > @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (vmf->pte) > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > out_page: > > + if (!swapcache) > > + swapcache_clear(si, entry); > > folio_unlock(folio); > > out_release: > > folio_put(folio); > > What happens? > > do_swap_page > .. > swapcache_prepare() <- tured the cache flag on > > folio = vma_alloc_folio <- failed to allocate the folio > page = &foio->page; <- crash but it's out of scope from this patch > > .. > if (!folio) > goto unlock; > > .. > unlock: > swapcache_clear(si, entry) <- it's skipped this time. > > > Can we simply introduce a boolean flag to state the special case and > clear the cache state based on the flag? Good idea, that should make the code easier to understand.
Kairui Song <ryncsn@gmail.com> writes: > On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: >> >> >> >> Hi Kairui, >> >> >> >> Sorry replying to your patch V1 late, I will reply on the V2 thread. >> >> >> >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: >> >> > >> >> > From: Kairui Song <kasong@tencent.com> >> >> > >> >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads >> >> > swapin the same entry at the same time, they get different pages (A, B). >> >> > Before one thread (T0) finishes the swapin and installs page (A) >> >> > to the PTE, another thread (T1) could finish swapin of page (B), >> >> > swap_free the entry, then swap out the possibly modified page >> >> > reusing the same entry. It breaks the pte_same check in (T0) because >> >> > PTE value is unchanged, causing ABA problem. Thread (T0) will >> >> > install a stalled page (A) into the PTE and cause data corruption. >> >> > >> >> > One possible callstack is like this: >> >> > >> >> > CPU0 CPU1 >> >> > ---- ---- >> >> > do_swap_page() do_swap_page() with same entry >> >> > <direct swapin path> <direct swapin path> >> >> > <alloc page A> <alloc page B> >> >> > swap_read_folio() <- read to page A swap_read_folio() <- read to page B >> >> > <slow on later locks or interrupt> <finished swapin first> >> >> > ... set_pte_at() >> >> > swap_free() <- entry is free >> >> > <write to page B, now page A stalled> >> >> > <swap out page B to same swap entry> >> >> > pte_same() <- Check pass, PTE seems >> >> > unchanged, but page A >> >> > is stalled! >> >> > swap_free() <- page B content lost! >> >> > set_pte_at() <- staled page A installed! >> >> > >> >> > And besides, for ZRAM, swap_free() allows the swap device to discard >> >> > the entry content, so even if page (B) is not modified, if >> >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, >> >> > it may also cause data loss. >> >> > >> >> > To fix this, reuse swapcache_prepare which will pin the swap entry using >> >> > the cache flag, and allow only one thread to pin it. Release the pin >> >> > after PT unlocked. Racers will simply busy wait since it's a rare >> >> > and very short event. >> >> > >> >> > Other methods like increasing the swap count don't seem to be a good >> >> > idea after some tests, that will cause racers to fall back to use the >> >> > swap cache again. Parallel swapin using different methods leads to >> >> > a much more complex scenario. >> >> > >> >> > Reproducer: >> >> > >> >> > This race issue can be triggered easily using a well constructed >> >> > reproducer and patched brd (with a delay in read path) [1]: >> >> > >> >> > With latest 6.8 mainline, race caused data loss can be observed easily: >> >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out >> >> > Polulating 32MB of memory region... >> >> > Keep swapping out... >> >> > Starting round 0... >> >> > Spawning 65536 workers... >> >> > 32746 workers spawned, wait for done... >> >> > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! >> >> > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! >> >> > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! >> >> > Round 0 Failed, 15 data loss! >> >> > >> >> > This reproducer spawns multiple threads sharing the same memory region >> >> > using a small swap device. Every two threads updates mapped pages one by >> >> > one in opposite direction trying to create a race, with one dedicated >> >> > thread keep swapping out the data out using madvise. >> >> > >> >> > The reproducer created a reproduce rate of about once every 5 minutes, >> >> > so the race should be totally possible in production. >> >> > >> >> > After this patch, I ran the reproducer for over a few hundred rounds >> >> > and no data loss observed. >> >> > >> >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G >> >> > zram: >> >> > >> >> > Before: 10934698 us >> >> > After: 11157121 us >> >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) >> >> > >> >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") >> >> > Reported-by: "Huang, Ying" <ying.huang@intel.com> >> >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ >> >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] >> >> > Signed-off-by: Kairui Song <kasong@tencent.com> >> >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> >> >> > Acked-by: Yu Zhao <yuzhao@google.com> >> >> > >> >> > --- >> >> > Update from V1: >> >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] >> >> > - Update comments make it cleaner [Huang, Ying] >> >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] >> >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] >> >> > - Update commit message. >> >> > - Collect Review and Acks. >> >> > >> >> > include/linux/swap.h | 5 +++++ >> >> > mm/memory.c | 15 +++++++++++++++ >> >> > mm/swap.h | 5 +++++ >> >> > mm/swapfile.c | 13 +++++++++++++ >> >> > 4 files changed, 38 insertions(+) >> >> > >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h >> >> > index 4db00ddad261..8d28f6091a32 100644 >> >> > --- a/include/linux/swap.h >> >> > +++ b/include/linux/swap.h >> >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) >> >> > return 0; >> >> > } >> >> > >> >> > +static inline int swapcache_prepare(swp_entry_t swp) >> >> > +{ >> >> > + return 0; >> >> > +} >> >> > + >> >> > static inline void swap_free(swp_entry_t swp) >> >> > { >> >> > } >> >> > diff --git a/mm/memory.c b/mm/memory.c >> >> > index 7e1f4849463a..1749c700823d 100644 >> >> > --- a/mm/memory.c >> >> > +++ b/mm/memory.c >> >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> >> > if (!folio) { >> >> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >> >> > __swap_count(entry) == 1) { >> >> > + /* >> >> > + * Prevent parallel swapin from proceeding with >> >> > + * the cache flag. Otherwise, another thread may >> >> > + * finish swapin first, free the entry, and swapout >> >> > + * reusing the same entry. It's undetectable as >> >> > + * pte_same() returns true due to entry reuse. >> >> > + */ >> >> > + if (swapcache_prepare(entry)) >> >> > + goto out; >> >> > + >> >> >> >> I am puzzled by this "goto out". If I understand this correctly, you >> >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. >> >> The CPU1 will succeed in adding the flag and the CPU2 will get >> >> "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it >> >> correctly so far? >> >> >> >> Then the goto out seems wrong to me. For the CPU2, the page fault will >> >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved, >> >> CPU2 does not even know the page fault is not handled, it will resume >> >> from the page fault instruction, possibly generate another page fault >> >> at the exact same location. That page fault loop will repeat until >> >> CPU1 install the new pte on that faulting virtual address and pick up >> >> by CPU2. >> >> >> >> Am I missing something obvious there? >> > >> > I feel you are right. any concurrent page faults at the same pte >> > will increase the count of page faults for a couple of times now. >> > >> >> >> >> I just re-read your comment: "Racers will simply busy wait since it's >> >> a rare and very short event." That might be referring to the above >> >> CPU2 page fault looping situation. I consider the page fault looping >> >> on CPU2 not acceptable. For one it will mess up the page fault >> >> statistics. >> >> In my mind, having an explicit loop for CPU2 waiting for the PTE to >> >> show up is still better than this page fault loop. You can have more >> >> CPU power friendly loops. >> > >> > I assume you mean something like >> > >> > while(!pte_same()) >> > cpu_relax(); >> > >> > then we still have a chance to miss the change of B. >> > >> > For example, another thread is changing pte to A->B->A, our loop can >> > miss B. Thus we will trap into an infinite loop. this is even worse. >> > >> > is it possible to loop for the success of swapcache_prepare(entry) >> > instead? >> >> This doesn't work too. The swap count can increase to > 1 and be put in >> swap cache for long time. >> >> Another possibility is to move swapcache_prepare() after >> vma_alloc_folio() to reduce the race window. > > Reducing the race window seems like a good way. Or maybe we can just > add a cpu_relax() so raced swapins will just slow down, and won't loop > too much time and so the side effect (counter or power consumption) > should be much smaller? cpu_relax() only makes sense in tight loop. -- Best Regards, Huang, Ying
On Wed, Feb 7, 2024 at 3:29 PM Kairui Song <ryncsn@gmail.com> wrote: > > On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote: > > > > Barry Song <21cnbao@gmail.com> writes: > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > >> > > >> Hi Kairui, > > >> > > >> Sorry replying to your patch V1 late, I will reply on the V2 thread. > > >> > > >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > >> > > > >> > From: Kairui Song <kasong@tencent.com> > > >> > > > >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > >> > swapin the same entry at the same time, they get different pages (A, B). > > >> > Before one thread (T0) finishes the swapin and installs page (A) > > >> > to the PTE, another thread (T1) could finish swapin of page (B), > > >> > swap_free the entry, then swap out the possibly modified page > > >> > reusing the same entry. It breaks the pte_same check in (T0) because > > >> > PTE value is unchanged, causing ABA problem. Thread (T0) will > > >> > install a stalled page (A) into the PTE and cause data corruption. > > >> > > > >> > One possible callstack is like this: > > >> > > > >> > CPU0 CPU1 > > >> > ---- ---- > > >> > do_swap_page() do_swap_page() with same entry > > >> > <direct swapin path> <direct swapin path> > > >> > <alloc page A> <alloc page B> > > >> > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > >> > <slow on later locks or interrupt> <finished swapin first> > > >> > ... set_pte_at() > > >> > swap_free() <- entry is free > > >> > <write to page B, now page A stalled> > > >> > <swap out page B to same swap entry> > > >> > pte_same() <- Check pass, PTE seems > > >> > unchanged, but page A > > >> > is stalled! > > >> > swap_free() <- page B content lost! > > >> > set_pte_at() <- staled page A installed! > > >> > > > >> > And besides, for ZRAM, swap_free() allows the swap device to discard > > >> > the entry content, so even if page (B) is not modified, if > > >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > >> > it may also cause data loss. > > >> > > > >> > To fix this, reuse swapcache_prepare which will pin the swap entry using > > >> > the cache flag, and allow only one thread to pin it. Release the pin > > >> > after PT unlocked. Racers will simply busy wait since it's a rare > > >> > and very short event. > > >> > > > >> > Other methods like increasing the swap count don't seem to be a good > > >> > idea after some tests, that will cause racers to fall back to use the > > >> > swap cache again. Parallel swapin using different methods leads to > > >> > a much more complex scenario. > > >> > > > >> > Reproducer: > > >> > > > >> > This race issue can be triggered easily using a well constructed > > >> > reproducer and patched brd (with a delay in read path) [1]: > > >> > > > >> > With latest 6.8 mainline, race caused data loss can be observed easily: > > >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > >> > Polulating 32MB of memory region... > > >> > Keep swapping out... > > >> > Starting round 0... > > >> > Spawning 65536 workers... > > >> > 32746 workers spawned, wait for done... > > >> > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > >> > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > >> > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > >> > Round 0 Failed, 15 data loss! > > >> > > > >> > This reproducer spawns multiple threads sharing the same memory region > > >> > using a small swap device. Every two threads updates mapped pages one by > > >> > one in opposite direction trying to create a race, with one dedicated > > >> > thread keep swapping out the data out using madvise. > > >> > > > >> > The reproducer created a reproduce rate of about once every 5 minutes, > > >> > so the race should be totally possible in production. > > >> > > > >> > After this patch, I ran the reproducer for over a few hundred rounds > > >> > and no data loss observed. > > >> > > > >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > >> > zram: > > >> > > > >> > Before: 10934698 us > > >> > After: 11157121 us > > >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > >> > > > >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > >> > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > >> > Signed-off-by: Kairui Song <kasong@tencent.com> > > >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > >> > Acked-by: Yu Zhao <yuzhao@google.com> > > >> > > > >> > --- > > >> > Update from V1: > > >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > >> > - Update comments make it cleaner [Huang, Ying] > > >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > >> > - Update commit message. > > >> > - Collect Review and Acks. > > >> > > > >> > include/linux/swap.h | 5 +++++ > > >> > mm/memory.c | 15 +++++++++++++++ > > >> > mm/swap.h | 5 +++++ > > >> > mm/swapfile.c | 13 +++++++++++++ > > >> > 4 files changed, 38 insertions(+) > > >> > > > >> > diff --git a/include/linux/swap.h b/include/linux/swap.h > > >> > index 4db00ddad261..8d28f6091a32 100644 > > >> > --- a/include/linux/swap.h > > >> > +++ b/include/linux/swap.h > > >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > >> > return 0; > > >> > } > > >> > > > >> > +static inline int swapcache_prepare(swp_entry_t swp) > > >> > +{ > > >> > + return 0; > > >> > +} > > >> > + > > >> > static inline void swap_free(swp_entry_t swp) > > >> > { > > >> > } > > >> > diff --git a/mm/memory.c b/mm/memory.c > > >> > index 7e1f4849463a..1749c700823d 100644 > > >> > --- a/mm/memory.c > > >> > +++ b/mm/memory.c > > >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > >> > if (!folio) { > > >> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > >> > __swap_count(entry) == 1) { > > >> > + /* > > >> > + * Prevent parallel swapin from proceeding with > > >> > + * the cache flag. Otherwise, another thread may > > >> > + * finish swapin first, free the entry, and swapout > > >> > + * reusing the same entry. It's undetectable as > > >> > + * pte_same() returns true due to entry reuse. > > >> > + */ > > >> > + if (swapcache_prepare(entry)) > > >> > + goto out; > > >> > + > > >> > > >> I am puzzled by this "goto out". If I understand this correctly, you > > >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > >> The CPU1 will succeed in adding the flag and the CPU2 will get > > >> "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > >> correctly so far? > > >> > > >> Then the goto out seems wrong to me. For the CPU2, the page fault will > > >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > >> CPU2 does not even know the page fault is not handled, it will resume > > >> from the page fault instruction, possibly generate another page fault > > >> at the exact same location. That page fault loop will repeat until > > >> CPU1 install the new pte on that faulting virtual address and pick up > > >> by CPU2. > > >> > > >> Am I missing something obvious there? > > > > > > I feel you are right. any concurrent page faults at the same pte > > > will increase the count of page faults for a couple of times now. > > > > > >> > > >> I just re-read your comment: "Racers will simply busy wait since it's > > >> a rare and very short event." That might be referring to the above > > >> CPU2 page fault looping situation. I consider the page fault looping > > >> on CPU2 not acceptable. For one it will mess up the page fault > > >> statistics. > > >> In my mind, having an explicit loop for CPU2 waiting for the PTE to > > >> show up is still better than this page fault loop. You can have more > > >> CPU power friendly loops. > > > > > > I assume you mean something like > > > > > > while(!pte_same()) > > > cpu_relax(); > > > > > > then we still have a chance to miss the change of B. > > > > > > For example, another thread is changing pte to A->B->A, our loop can > > > miss B. Thus we will trap into an infinite loop. this is even worse. > > > > > > is it possible to loop for the success of swapcache_prepare(entry) > > > instead? > > > > This doesn't work too. The swap count can increase to > 1 and be put in > > swap cache for long time. > > > > Another possibility is to move swapcache_prepare() after > > vma_alloc_folio() to reduce the race window. what about we make everything go as it is. I mean, we only need to record we have failed on swapcache_prepare, but we don't goto out. bool swapcache_prepare_failed = swapcache_prepare(); .... // don't change any code but we only change the last code to set pte from the below ptl if(pte_same) set_pte to ptl if(pte_same && !swapcache_prepare_failed) set_pte as the chance is close to 0%, the increased count should be very minor. > > Reducing the race window seems like a good way. Or maybe we can just > add a cpu_relax() so raced swapins will just slow down, and won't loop > too much time and so the side effect (counter or power consumption) > should be much smaller? Thanks Barry
On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote: > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote: > > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > Hi Kairui, > > > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > > > From: Kairui Song <kasong@tencent.com> > > > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > > > swapin the same entry at the same time, they get different pages (A, B). > > > > > Before one thread (T0) finishes the swapin and installs page (A) > > > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > > > swap_free the entry, then swap out the possibly modified page > > > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > > > > > One possible callstack is like this: > > > > > > > > > > CPU0 CPU1 > > > > > ---- ---- > > > > > do_swap_page() do_swap_page() with same entry > > > > > <direct swapin path> <direct swapin path> > > > > > <alloc page A> <alloc page B> > > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > > > > <slow on later locks or interrupt> <finished swapin first> > > > > > ... set_pte_at() > > > > > swap_free() <- entry is free > > > > > <write to page B, now page A stalled> > > > > > <swap out page B to same swap entry> > > > > > pte_same() <- Check pass, PTE seems > > > > > unchanged, but page A > > > > > is stalled! > > > > > swap_free() <- page B content lost! > > > > > set_pte_at() <- staled page A installed! > > > > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > > > > the entry content, so even if page (B) is not modified, if > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > > > > it may also cause data loss. > > > > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > > > > the cache flag, and allow only one thread to pin it. Release the pin > > > > > after PT unlocked. Racers will simply busy wait since it's a rare > > > > > and very short event. > > > > > > > > > > Other methods like increasing the swap count don't seem to be a good > > > > > idea after some tests, that will cause racers to fall back to use the > > > > > swap cache again. Parallel swapin using different methods leads to > > > > > a much more complex scenario. > > > > > > > > > > Reproducer: > > > > > > > > > > This race issue can be triggered easily using a well constructed > > > > > reproducer and patched brd (with a delay in read path) [1]: > > > > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > > > > Polulating 32MB of memory region... > > > > > Keep swapping out... > > > > > Starting round 0... > > > > > Spawning 65536 workers... > > > > > 32746 workers spawned, wait for done... > > > > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > > > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > > > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > > > > Round 0 Failed, 15 data loss! > > > > > > > > > > This reproducer spawns multiple threads sharing the same memory region > > > > > using a small swap device. Every two threads updates mapped pages one by > > > > > one in opposite direction trying to create a race, with one dedicated > > > > > thread keep swapping out the data out using madvise. > > > > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > > > > so the race should be totally possible in production. > > > > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds > > > > > and no data loss observed. > > > > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > > > > zram: > > > > > > > > > > Before: 10934698 us > > > > > After: 11157121 us > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > > > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > > > > > > > --- > > > > > Update from V1: > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > > > > - Update comments make it cleaner [Huang, Ying] > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > > > > - Update commit message. > > > > > - Collect Review and Acks. > > > > > > > > > > include/linux/swap.h | 5 +++++ > > > > > mm/memory.c | 15 +++++++++++++++ > > > > > mm/swap.h | 5 +++++ > > > > > mm/swapfile.c | 13 +++++++++++++ > > > > > 4 files changed, 38 insertions(+) > > > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > > index 4db00ddad261..8d28f6091a32 100644 > > > > > --- a/include/linux/swap.h > > > > > +++ b/include/linux/swap.h > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > > > > return 0; > > > > > } > > > > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > > > > +{ > > > > > + return 0; > > > > > +} > > > > > + > > > > > static inline void swap_free(swp_entry_t swp) > > > > > { > > > > > } > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > index 7e1f4849463a..1749c700823d 100644 > > > > > --- a/mm/memory.c > > > > > +++ b/mm/memory.c > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > if (!folio) { > > > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > > > __swap_count(entry) == 1) { > > > > > + /* > > > > > + * Prevent parallel swapin from proceeding with > > > > > + * the cache flag. Otherwise, another thread may > > > > > + * finish swapin first, free the entry, and swapout > > > > > + * reusing the same entry. It's undetectable as > > > > > + * pte_same() returns true due to entry reuse. > > > > > + */ > > > > > + if (swapcache_prepare(entry)) > > > > > + goto out; > > > > > + > > > > > > > > I am puzzled by this "goto out". If I understand this correctly, you > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > > > The CPU1 will succeed in adding the flag and the CPU2 will get > > > > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > > > correctly so far? > > > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > > > CPU2 does not even know the page fault is not handled, it will resume > > > > from the page fault instruction, possibly generate another page fault > > > > at the exact same location. That page fault loop will repeat until > > > > CPU1 install the new pte on that faulting virtual address and pick up > > > > by CPU2. > > > > > > > > Am I missing something obvious there? > > > > > > I feel you are right. any concurrent page faults at the same pte > > > will increase the count of page faults for a couple of times now. > > > > > > > > > > > I just re-read your comment: "Racers will simply busy wait since it's > > > > a rare and very short event." That might be referring to the above > > > > CPU2 page fault looping situation. I consider the page fault looping > > > > on CPU2 not acceptable. For one it will mess up the page fault > > > > statistics. > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to > > > > show up is still better than this page fault loop. You can have more > > > > CPU power friendly loops. > > > > > > I assume you mean something like > > > > > > while(!pte_same()) > > > cpu_relax(); > > > > > > then we still have a chance to miss the change of B. > > > > > > For example, another thread is changing pte to A->B->A, our loop can > > > miss B. Thus we will trap into an infinite loop. this is even worse. > > > > Yes. You are right, it is worse. Thanks for catching that. That is why > > I say this needs more discussion, I haven't fully thought it through > > :-) > > Hi Chris and Barry, > > Thanks for the comments! > > The worst thing I know of returning in do_swap_page without handling > the swap, is an increase of some statistic counters, note it will not > cause major page fault counters to grow, only things like perf counter > and vma lock statistic are affected. > > And actually there are multiple already existing return points in > do_swap_page that will return without handling it, which may > re-trigger the page fault. Thanks for pointing that out. I take a look at those, which seems different than the case here. In those cases, it truely can not make forward progress. Here we actually have all the data it needs to complete the page fault. Just a data synchronization issue preventing making forward progress. Ideally we can have some clever data structure to solve the synchronization issue and make forward progress. > When do_swap_page is called, many pre-checks have been applied, and > they could all be invalidated if something raced, simply looping > inside here could miss a lot of corner cases, so we have to go through > that again. Actually, I think about it. Looping it here seems worse in the sense that it is already holding some locks. Return and retry the page fault at least release those locks and let others have a chance to make progress. > > This patch did increase the chance of false positive increase of some > counters, maybe something like returning a VM_FAULT_RETRY could make > it better, but code is more complex and will cause other counters to > grow. This is certainly not ideal. It might upset the feedback loop that uses the swap fault statistic as input to adjust the swapping behavior. Chris
On Wed, Feb 7, 2024 at 12:02 PM Chris Li <chrisl@kernel.org> wrote: > > On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote: > > > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > > > Hi Kairui, > > > > > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > > > > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > > > > > From: Kairui Song <kasong@tencent.com> > > > > > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > > > > swapin the same entry at the same time, they get different pages (A, B). > > > > > > Before one thread (T0) finishes the swapin and installs page (A) > > > > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > > > > swap_free the entry, then swap out the possibly modified page > > > > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > > > > > > > One possible callstack is like this: > > > > > > > > > > > > CPU0 CPU1 > > > > > > ---- ---- > > > > > > do_swap_page() do_swap_page() with same entry > > > > > > <direct swapin path> <direct swapin path> > > > > > > <alloc page A> <alloc page B> > > > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > > > > > <slow on later locks or interrupt> <finished swapin first> > > > > > > ... set_pte_at() > > > > > > swap_free() <- entry is free > > > > > > <write to page B, now page A stalled> > > > > > > <swap out page B to same swap entry> > > > > > > pte_same() <- Check pass, PTE seems > > > > > > unchanged, but page A > > > > > > is stalled! > > > > > > swap_free() <- page B content lost! > > > > > > set_pte_at() <- staled page A installed! > > > > > > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > > > > > the entry content, so even if page (B) is not modified, if > > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > > > > > it may also cause data loss. > > > > > > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > > > > > the cache flag, and allow only one thread to pin it. Release the pin > > > > > > after PT unlocked. Racers will simply busy wait since it's a rare > > > > > > and very short event. > > > > > > > > > > > > Other methods like increasing the swap count don't seem to be a good > > > > > > idea after some tests, that will cause racers to fall back to use the > > > > > > swap cache again. Parallel swapin using different methods leads to > > > > > > a much more complex scenario. > > > > > > > > > > > > Reproducer: > > > > > > > > > > > > This race issue can be triggered easily using a well constructed > > > > > > reproducer and patched brd (with a delay in read path) [1]: > > > > > > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > > > > > Polulating 32MB of memory region... > > > > > > Keep swapping out... > > > > > > Starting round 0... > > > > > > Spawning 65536 workers... > > > > > > 32746 workers spawned, wait for done... > > > > > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > > > > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > > > > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > > > > > Round 0 Failed, 15 data loss! > > > > > > > > > > > > This reproducer spawns multiple threads sharing the same memory region > > > > > > using a small swap device. Every two threads updates mapped pages one by > > > > > > one in opposite direction trying to create a race, with one dedicated > > > > > > thread keep swapping out the data out using madvise. > > > > > > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > > > > > so the race should be totally possible in production. > > > > > > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds > > > > > > and no data loss observed. > > > > > > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > > > > > zram: > > > > > > > > > > > > Before: 10934698 us > > > > > > After: 11157121 us > > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > > > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > > > > > > > > > --- > > > > > > Update from V1: > > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > > > > > - Update comments make it cleaner [Huang, Ying] > > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > > > > > - Update commit message. > > > > > > - Collect Review and Acks. > > > > > > > > > > > > include/linux/swap.h | 5 +++++ > > > > > > mm/memory.c | 15 +++++++++++++++ > > > > > > mm/swap.h | 5 +++++ > > > > > > mm/swapfile.c | 13 +++++++++++++ > > > > > > 4 files changed, 38 insertions(+) > > > > > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > > > index 4db00ddad261..8d28f6091a32 100644 > > > > > > --- a/include/linux/swap.h > > > > > > +++ b/include/linux/swap.h > > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > > > > > return 0; > > > > > > } > > > > > > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > > > > > +{ > > > > > > + return 0; > > > > > > +} > > > > > > + > > > > > > static inline void swap_free(swp_entry_t swp) > > > > > > { > > > > > > } > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > index 7e1f4849463a..1749c700823d 100644 > > > > > > --- a/mm/memory.c > > > > > > +++ b/mm/memory.c > > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > if (!folio) { > > > > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > > > > __swap_count(entry) == 1) { > > > > > > + /* > > > > > > + * Prevent parallel swapin from proceeding with > > > > > > + * the cache flag. Otherwise, another thread may > > > > > > + * finish swapin first, free the entry, and swapout > > > > > > + * reusing the same entry. It's undetectable as > > > > > > + * pte_same() returns true due to entry reuse. > > > > > > + */ > > > > > > + if (swapcache_prepare(entry)) > > > > > > + goto out; > > > > > > + > > > > > > > > > > I am puzzled by this "goto out". If I understand this correctly, you > > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > > > > The CPU1 will succeed in adding the flag and the CPU2 will get > > > > > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > > > > correctly so far? > > > > > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will > > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > > > > CPU2 does not even know the page fault is not handled, it will resume > > > > > from the page fault instruction, possibly generate another page fault > > > > > at the exact same location. That page fault loop will repeat until > > > > > CPU1 install the new pte on that faulting virtual address and pick up > > > > > by CPU2. > > > > > > > > > > Am I missing something obvious there? > > > > > > > > I feel you are right. any concurrent page faults at the same pte > > > > will increase the count of page faults for a couple of times now. > > > > > > > > > > > > > > I just re-read your comment: "Racers will simply busy wait since it's > > > > > a rare and very short event." That might be referring to the above > > > > > CPU2 page fault looping situation. I consider the page fault looping > > > > > on CPU2 not acceptable. For one it will mess up the page fault > > > > > statistics. > > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to > > > > > show up is still better than this page fault loop. You can have more > > > > > CPU power friendly loops. > > > > > > > > I assume you mean something like > > > > > > > > while(!pte_same()) > > > > cpu_relax(); > > > > > > > > then we still have a chance to miss the change of B. > > > > > > > > For example, another thread is changing pte to A->B->A, our loop can > > > > miss B. Thus we will trap into an infinite loop. this is even worse. > > > > > > Yes. You are right, it is worse. Thanks for catching that. That is why > > > I say this needs more discussion, I haven't fully thought it through > > > :-) > > > > Hi Chris and Barry, > > > > Thanks for the comments! > > > > The worst thing I know of returning in do_swap_page without handling > > the swap, is an increase of some statistic counters, note it will not > > cause major page fault counters to grow, only things like perf counter > > and vma lock statistic are affected. > > > > And actually there are multiple already existing return points in > > do_swap_page that will return without handling it, which may > > re-trigger the page fault. > > Thanks for pointing that out. I take a look at those, which seems > different than the case here. In those cases, it truely can not make > forward progress. > Here we actually have all the data it needs to complete the page > fault. Just a data synchronization issue preventing making forward > progress. > Ideally we can have some clever data structure to solve the > synchronization issue and make forward progress. > > > When do_swap_page is called, many pre-checks have been applied, and > > they could all be invalidated if something raced, simply looping > > inside here could miss a lot of corner cases, so we have to go through > > that again. > > Actually, I think about it. Looping it here seems worse in the sense > that it is already holding some locks. Return and retry the page fault > at least release those locks and let others have a chance to make > progress. > > > > > This patch did increase the chance of false positive increase of some > > counters, maybe something like returning a VM_FAULT_RETRY could make > > it better, but code is more complex and will cause other counters to > > grow. > > This is certainly not ideal. It might upset the feedback loop that > uses the swap fault statistic as input to adjust the swapping > behavior. > > Chris Hi Chris, Thanks for the reply. So I think the thing is, it's getting complex because this patch wanted to make it simple and just reuse the swap cache flags. Then maybe time to introduce a new swap map value (eg. SWAP_MAP_LOCK, not an ideal name, open to any suggestion on this...) then we can just wait for that value to be gone, no risk of loops and repeated page faults, just need to tidy some flag set/retrieving helpers.
Barry Song <21cnbao@gmail.com> writes: > On Wed, Feb 7, 2024 at 3:29 PM Kairui Song <ryncsn@gmail.com> wrote: >> >> On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote: >> > >> > Barry Song <21cnbao@gmail.com> writes: >> > >> > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: >> > >> >> > >> Hi Kairui, >> > >> >> > >> Sorry replying to your patch V1 late, I will reply on the V2 thread. >> > >> >> > >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: >> > >> > >> > >> > From: Kairui Song <kasong@tencent.com> >> > >> > >> > >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads >> > >> > swapin the same entry at the same time, they get different pages (A, B). >> > >> > Before one thread (T0) finishes the swapin and installs page (A) >> > >> > to the PTE, another thread (T1) could finish swapin of page (B), >> > >> > swap_free the entry, then swap out the possibly modified page >> > >> > reusing the same entry. It breaks the pte_same check in (T0) because >> > >> > PTE value is unchanged, causing ABA problem. Thread (T0) will >> > >> > install a stalled page (A) into the PTE and cause data corruption. >> > >> > >> > >> > One possible callstack is like this: >> > >> > >> > >> > CPU0 CPU1 >> > >> > ---- ---- >> > >> > do_swap_page() do_swap_page() with same entry >> > >> > <direct swapin path> <direct swapin path> >> > >> > <alloc page A> <alloc page B> >> > >> > swap_read_folio() <- read to page A swap_read_folio() <- read to page B >> > >> > <slow on later locks or interrupt> <finished swapin first> >> > >> > ... set_pte_at() >> > >> > swap_free() <- entry is free >> > >> > <write to page B, now page A stalled> >> > >> > <swap out page B to same swap entry> >> > >> > pte_same() <- Check pass, PTE seems >> > >> > unchanged, but page A >> > >> > is stalled! >> > >> > swap_free() <- page B content lost! >> > >> > set_pte_at() <- staled page A installed! >> > >> > >> > >> > And besides, for ZRAM, swap_free() allows the swap device to discard >> > >> > the entry content, so even if page (B) is not modified, if >> > >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, >> > >> > it may also cause data loss. >> > >> > >> > >> > To fix this, reuse swapcache_prepare which will pin the swap entry using >> > >> > the cache flag, and allow only one thread to pin it. Release the pin >> > >> > after PT unlocked. Racers will simply busy wait since it's a rare >> > >> > and very short event. >> > >> > >> > >> > Other methods like increasing the swap count don't seem to be a good >> > >> > idea after some tests, that will cause racers to fall back to use the >> > >> > swap cache again. Parallel swapin using different methods leads to >> > >> > a much more complex scenario. >> > >> > >> > >> > Reproducer: >> > >> > >> > >> > This race issue can be triggered easily using a well constructed >> > >> > reproducer and patched brd (with a delay in read path) [1]: >> > >> > >> > >> > With latest 6.8 mainline, race caused data loss can be observed easily: >> > >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out >> > >> > Polulating 32MB of memory region... >> > >> > Keep swapping out... >> > >> > Starting round 0... >> > >> > Spawning 65536 workers... >> > >> > 32746 workers spawned, wait for done... >> > >> > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! >> > >> > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! >> > >> > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! >> > >> > Round 0 Failed, 15 data loss! >> > >> > >> > >> > This reproducer spawns multiple threads sharing the same memory region >> > >> > using a small swap device. Every two threads updates mapped pages one by >> > >> > one in opposite direction trying to create a race, with one dedicated >> > >> > thread keep swapping out the data out using madvise. >> > >> > >> > >> > The reproducer created a reproduce rate of about once every 5 minutes, >> > >> > so the race should be totally possible in production. >> > >> > >> > >> > After this patch, I ran the reproducer for over a few hundred rounds >> > >> > and no data loss observed. >> > >> > >> > >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G >> > >> > zram: >> > >> > >> > >> > Before: 10934698 us >> > >> > After: 11157121 us >> > >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) >> > >> > >> > >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") >> > >> > Reported-by: "Huang, Ying" <ying.huang@intel.com> >> > >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ >> > >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] >> > >> > Signed-off-by: Kairui Song <kasong@tencent.com> >> > >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> >> > >> > Acked-by: Yu Zhao <yuzhao@google.com> >> > >> > >> > >> > --- >> > >> > Update from V1: >> > >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] >> > >> > - Update comments make it cleaner [Huang, Ying] >> > >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] >> > >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] >> > >> > - Update commit message. >> > >> > - Collect Review and Acks. >> > >> > >> > >> > include/linux/swap.h | 5 +++++ >> > >> > mm/memory.c | 15 +++++++++++++++ >> > >> > mm/swap.h | 5 +++++ >> > >> > mm/swapfile.c | 13 +++++++++++++ >> > >> > 4 files changed, 38 insertions(+) >> > >> > >> > >> > diff --git a/include/linux/swap.h b/include/linux/swap.h >> > >> > index 4db00ddad261..8d28f6091a32 100644 >> > >> > --- a/include/linux/swap.h >> > >> > +++ b/include/linux/swap.h >> > >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) >> > >> > return 0; >> > >> > } >> > >> > >> > >> > +static inline int swapcache_prepare(swp_entry_t swp) >> > >> > +{ >> > >> > + return 0; >> > >> > +} >> > >> > + >> > >> > static inline void swap_free(swp_entry_t swp) >> > >> > { >> > >> > } >> > >> > diff --git a/mm/memory.c b/mm/memory.c >> > >> > index 7e1f4849463a..1749c700823d 100644 >> > >> > --- a/mm/memory.c >> > >> > +++ b/mm/memory.c >> > >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> > >> > if (!folio) { >> > >> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >> > >> > __swap_count(entry) == 1) { >> > >> > + /* >> > >> > + * Prevent parallel swapin from proceeding with >> > >> > + * the cache flag. Otherwise, another thread may >> > >> > + * finish swapin first, free the entry, and swapout >> > >> > + * reusing the same entry. It's undetectable as >> > >> > + * pte_same() returns true due to entry reuse. >> > >> > + */ >> > >> > + if (swapcache_prepare(entry)) >> > >> > + goto out; >> > >> > + >> > >> >> > >> I am puzzled by this "goto out". If I understand this correctly, you >> > >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. >> > >> The CPU1 will succeed in adding the flag and the CPU2 will get >> > >> "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it >> > >> correctly so far? >> > >> >> > >> Then the goto out seems wrong to me. For the CPU2, the page fault will >> > >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved, >> > >> CPU2 does not even know the page fault is not handled, it will resume >> > >> from the page fault instruction, possibly generate another page fault >> > >> at the exact same location. That page fault loop will repeat until >> > >> CPU1 install the new pte on that faulting virtual address and pick up >> > >> by CPU2. >> > >> >> > >> Am I missing something obvious there? >> > > >> > > I feel you are right. any concurrent page faults at the same pte >> > > will increase the count of page faults for a couple of times now. >> > > >> > >> >> > >> I just re-read your comment: "Racers will simply busy wait since it's >> > >> a rare and very short event." That might be referring to the above >> > >> CPU2 page fault looping situation. I consider the page fault looping >> > >> on CPU2 not acceptable. For one it will mess up the page fault >> > >> statistics. >> > >> In my mind, having an explicit loop for CPU2 waiting for the PTE to >> > >> show up is still better than this page fault loop. You can have more >> > >> CPU power friendly loops. >> > > >> > > I assume you mean something like >> > > >> > > while(!pte_same()) >> > > cpu_relax(); >> > > >> > > then we still have a chance to miss the change of B. >> > > >> > > For example, another thread is changing pte to A->B->A, our loop can >> > > miss B. Thus we will trap into an infinite loop. this is even worse. >> > > >> > > is it possible to loop for the success of swapcache_prepare(entry) >> > > instead? >> > >> > This doesn't work too. The swap count can increase to > 1 and be put in >> > swap cache for long time. >> > >> > Another possibility is to move swapcache_prepare() after >> > vma_alloc_folio() to reduce the race window. > > what about we make everything go as it is. I mean, we only need to > record we have failed on swapcache_prepare, but we don't goto out. > > bool swapcache_prepare_failed = swapcache_prepare(); > .... // don't change any code > > > but we only change the last code to set pte from the below > ptl > if(pte_same) > set_pte > > to > > ptl > if(pte_same && !swapcache_prepare_failed) > set_pte > > as the chance is close to 0%, the increased count should be very minor. IIUC, if (!swapcache_prepare_failed), it will always fail. If so, why bother wasting CPU cycles? If you return directly, and the first thread runs quickly enough, you can use the installed PTE directly. -- Best Regards, Huang, Ying >> >> Reducing the race window seems like a good way. Or maybe we can just >> add a cpu_relax() so raced swapins will just slow down, and won't loop >> too much time and so the side effect (counter or power consumption) >> should be much smaller? > > Thanks > Barry
On Wed, Feb 7, 2024 at 5:18 PM Huang, Ying <ying.huang@intel.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Feb 7, 2024 at 3:29 PM Kairui Song <ryncsn@gmail.com> wrote: > >> > >> On Wed, Feb 7, 2024 at 10:10 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > > >> > Barry Song <21cnbao@gmail.com> writes: > >> > > >> > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > >> > >> > >> > >> Hi Kairui, > >> > >> > >> > >> Sorry replying to your patch V1 late, I will reply on the V2 thread. > >> > >> > >> > >> On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > >> > >> > > >> > >> > From: Kairui Song <kasong@tencent.com> > >> > >> > > >> > >> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > >> > >> > swapin the same entry at the same time, they get different pages (A, B). > >> > >> > Before one thread (T0) finishes the swapin and installs page (A) > >> > >> > to the PTE, another thread (T1) could finish swapin of page (B), > >> > >> > swap_free the entry, then swap out the possibly modified page > >> > >> > reusing the same entry. It breaks the pte_same check in (T0) because > >> > >> > PTE value is unchanged, causing ABA problem. Thread (T0) will > >> > >> > install a stalled page (A) into the PTE and cause data corruption. > >> > >> > > >> > >> > One possible callstack is like this: > >> > >> > > >> > >> > CPU0 CPU1 > >> > >> > ---- ---- > >> > >> > do_swap_page() do_swap_page() with same entry > >> > >> > <direct swapin path> <direct swapin path> > >> > >> > <alloc page A> <alloc page B> > >> > >> > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > >> > >> > <slow on later locks or interrupt> <finished swapin first> > >> > >> > ... set_pte_at() > >> > >> > swap_free() <- entry is free > >> > >> > <write to page B, now page A stalled> > >> > >> > <swap out page B to same swap entry> > >> > >> > pte_same() <- Check pass, PTE seems > >> > >> > unchanged, but page A > >> > >> > is stalled! > >> > >> > swap_free() <- page B content lost! > >> > >> > set_pte_at() <- staled page A installed! > >> > >> > > >> > >> > And besides, for ZRAM, swap_free() allows the swap device to discard > >> > >> > the entry content, so even if page (B) is not modified, if > >> > >> > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > >> > >> > it may also cause data loss. > >> > >> > > >> > >> > To fix this, reuse swapcache_prepare which will pin the swap entry using > >> > >> > the cache flag, and allow only one thread to pin it. Release the pin > >> > >> > after PT unlocked. Racers will simply busy wait since it's a rare > >> > >> > and very short event. > >> > >> > > >> > >> > Other methods like increasing the swap count don't seem to be a good > >> > >> > idea after some tests, that will cause racers to fall back to use the > >> > >> > swap cache again. Parallel swapin using different methods leads to > >> > >> > a much more complex scenario. > >> > >> > > >> > >> > Reproducer: > >> > >> > > >> > >> > This race issue can be triggered easily using a well constructed > >> > >> > reproducer and patched brd (with a delay in read path) [1]: > >> > >> > > >> > >> > With latest 6.8 mainline, race caused data loss can be observed easily: > >> > >> > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > >> > >> > Polulating 32MB of memory region... > >> > >> > Keep swapping out... > >> > >> > Starting round 0... > >> > >> > Spawning 65536 workers... > >> > >> > 32746 workers spawned, wait for done... > >> > >> > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > >> > >> > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > >> > >> > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > >> > >> > Round 0 Failed, 15 data loss! > >> > >> > > >> > >> > This reproducer spawns multiple threads sharing the same memory region > >> > >> > using a small swap device. Every two threads updates mapped pages one by > >> > >> > one in opposite direction trying to create a race, with one dedicated > >> > >> > thread keep swapping out the data out using madvise. > >> > >> > > >> > >> > The reproducer created a reproduce rate of about once every 5 minutes, > >> > >> > so the race should be totally possible in production. > >> > >> > > >> > >> > After this patch, I ran the reproducer for over a few hundred rounds > >> > >> > and no data loss observed. > >> > >> > > >> > >> > Performance overhead is minimal, microbenchmark swapin 10G from 32G > >> > >> > zram: > >> > >> > > >> > >> > Before: 10934698 us > >> > >> > After: 11157121 us > >> > >> > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > >> > >> > > >> > >> > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > >> > >> > Reported-by: "Huang, Ying" <ying.huang@intel.com> > >> > >> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > >> > >> > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > >> > >> > Signed-off-by: Kairui Song <kasong@tencent.com> > >> > >> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > >> > >> > Acked-by: Yu Zhao <yuzhao@google.com> > >> > >> > > >> > >> > --- > >> > >> > Update from V1: > >> > >> > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > >> > >> > - Update comments make it cleaner [Huang, Ying] > >> > >> > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > >> > >> > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > >> > >> > - Update commit message. > >> > >> > - Collect Review and Acks. > >> > >> > > >> > >> > include/linux/swap.h | 5 +++++ > >> > >> > mm/memory.c | 15 +++++++++++++++ > >> > >> > mm/swap.h | 5 +++++ > >> > >> > mm/swapfile.c | 13 +++++++++++++ > >> > >> > 4 files changed, 38 insertions(+) > >> > >> > > >> > >> > diff --git a/include/linux/swap.h b/include/linux/swap.h > >> > >> > index 4db00ddad261..8d28f6091a32 100644 > >> > >> > --- a/include/linux/swap.h > >> > >> > +++ b/include/linux/swap.h > >> > >> > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > >> > >> > return 0; > >> > >> > } > >> > >> > > >> > >> > +static inline int swapcache_prepare(swp_entry_t swp) > >> > >> > +{ > >> > >> > + return 0; > >> > >> > +} > >> > >> > + > >> > >> > static inline void swap_free(swp_entry_t swp) > >> > >> > { > >> > >> > } > >> > >> > diff --git a/mm/memory.c b/mm/memory.c > >> > >> > index 7e1f4849463a..1749c700823d 100644 > >> > >> > --- a/mm/memory.c > >> > >> > +++ b/mm/memory.c > >> > >> > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >> > >> > if (!folio) { > >> > >> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > >> > >> > __swap_count(entry) == 1) { > >> > >> > + /* > >> > >> > + * Prevent parallel swapin from proceeding with > >> > >> > + * the cache flag. Otherwise, another thread may > >> > >> > + * finish swapin first, free the entry, and swapout > >> > >> > + * reusing the same entry. It's undetectable as > >> > >> > + * pte_same() returns true due to entry reuse. > >> > >> > + */ > >> > >> > + if (swapcache_prepare(entry)) > >> > >> > + goto out; > >> > >> > + > >> > >> > >> > >> I am puzzled by this "goto out". If I understand this correctly, you > >> > >> have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > >> > >> The CPU1 will succeed in adding the flag and the CPU2 will get > >> > >> "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > >> > >> correctly so far? > >> > >> > >> > >> Then the goto out seems wrong to me. For the CPU2, the page fault will > >> > >> return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > >> > >> CPU2 does not even know the page fault is not handled, it will resume > >> > >> from the page fault instruction, possibly generate another page fault > >> > >> at the exact same location. That page fault loop will repeat until > >> > >> CPU1 install the new pte on that faulting virtual address and pick up > >> > >> by CPU2. > >> > >> > >> > >> Am I missing something obvious there? > >> > > > >> > > I feel you are right. any concurrent page faults at the same pte > >> > > will increase the count of page faults for a couple of times now. > >> > > > >> > >> > >> > >> I just re-read your comment: "Racers will simply busy wait since it's > >> > >> a rare and very short event." That might be referring to the above > >> > >> CPU2 page fault looping situation. I consider the page fault looping > >> > >> on CPU2 not acceptable. For one it will mess up the page fault > >> > >> statistics. > >> > >> In my mind, having an explicit loop for CPU2 waiting for the PTE to > >> > >> show up is still better than this page fault loop. You can have more > >> > >> CPU power friendly loops. > >> > > > >> > > I assume you mean something like > >> > > > >> > > while(!pte_same()) > >> > > cpu_relax(); > >> > > > >> > > then we still have a chance to miss the change of B. > >> > > > >> > > For example, another thread is changing pte to A->B->A, our loop can > >> > > miss B. Thus we will trap into an infinite loop. this is even worse. > >> > > > >> > > is it possible to loop for the success of swapcache_prepare(entry) > >> > > instead? > >> > > >> > This doesn't work too. The swap count can increase to > 1 and be put in > >> > swap cache for long time. > >> > > >> > Another possibility is to move swapcache_prepare() after > >> > vma_alloc_folio() to reduce the race window. > > > > what about we make everything go as it is. I mean, we only need to > > record we have failed on swapcache_prepare, but we don't goto out. > > > > bool swapcache_prepare_failed = swapcache_prepare(); > > .... // don't change any code > > > > > > but we only change the last code to set pte from the below > > ptl > > if(pte_same) > > set_pte > > > > to > > > > ptl > > if(pte_same && !swapcache_prepare_failed) > > set_pte > > > > as the chance is close to 0%, the increased count should be very minor. > > IIUC, if (!swapcache_prepare_failed), it will always fail. If so, why > bother wasting CPU cycles? If you return directly, and the first thread > runs quickly enough, you can use the installed PTE directly. > you are right. i thought the actual probability of the race this patch is trying to fix is 0%. so we fix the final place of setting pte, we stop setting pte once we detect one race. but i was wrong because the one who fails on swapcache_prepare could be the one who first gets PTL. > -- > Best Regards, > Huang, Ying > > >> > >> Reducing the race window seems like a good way. Or maybe we can just > >> add a cpu_relax() so raced swapins will just slow down, and won't loop > >> too much time and so the side effect (counter or power consumption) > >> should be much smaller? > > Thanks Barry
On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: > On Wed, Feb 7, 2024 at 12:02 PM Chris Li <chrisl@kernel.org> wrote: > > > > On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > > > > > Hi Kairui, > > > > > > > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > > > > > > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > > > > > > > From: Kairui Song <kasong@tencent.com> > > > > > > > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > > > > > swapin the same entry at the same time, they get different pages (A, B). > > > > > > > Before one thread (T0) finishes the swapin and installs page (A) > > > > > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > > > > > swap_free the entry, then swap out the possibly modified page > > > > > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > > > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > > > > > > > > > One possible callstack is like this: > > > > > > > > > > > > > > CPU0 CPU1 > > > > > > > ---- ---- > > > > > > > do_swap_page() do_swap_page() with same entry > > > > > > > <direct swapin path> <direct swapin path> > > > > > > > <alloc page A> <alloc page B> > > > > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > > > > > > <slow on later locks or interrupt> <finished swapin first> > > > > > > > ... set_pte_at() > > > > > > > swap_free() <- entry is free > > > > > > > <write to page B, now page A stalled> > > > > > > > <swap out page B to same swap entry> > > > > > > > pte_same() <- Check pass, PTE seems > > > > > > > unchanged, but page A > > > > > > > is stalled! > > > > > > > swap_free() <- page B content lost! > > > > > > > set_pte_at() <- staled page A installed! > > > > > > > > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > > > > > > the entry content, so even if page (B) is not modified, if > > > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > > > > > > it may also cause data loss. > > > > > > > > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > > > > > > the cache flag, and allow only one thread to pin it. Release the pin > > > > > > > after PT unlocked. Racers will simply busy wait since it's a rare > > > > > > > and very short event. > > > > > > > > > > > > > > Other methods like increasing the swap count don't seem to be a good > > > > > > > idea after some tests, that will cause racers to fall back to use the > > > > > > > swap cache again. Parallel swapin using different methods leads to > > > > > > > a much more complex scenario. > > > > > > > > > > > > > > Reproducer: > > > > > > > > > > > > > > This race issue can be triggered easily using a well constructed > > > > > > > reproducer and patched brd (with a delay in read path) [1]: > > > > > > > > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > > > > > > Polulating 32MB of memory region... > > > > > > > Keep swapping out... > > > > > > > Starting round 0... > > > > > > > Spawning 65536 workers... > > > > > > > 32746 workers spawned, wait for done... > > > > > > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > > > > > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > > > > > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > > > > > > Round 0 Failed, 15 data loss! > > > > > > > > > > > > > > This reproducer spawns multiple threads sharing the same memory region > > > > > > > using a small swap device. Every two threads updates mapped pages one by > > > > > > > one in opposite direction trying to create a race, with one dedicated > > > > > > > thread keep swapping out the data out using madvise. > > > > > > > > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > > > > > > so the race should be totally possible in production. > > > > > > > > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds > > > > > > > and no data loss observed. > > > > > > > > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > > > > > > zram: > > > > > > > > > > > > > > Before: 10934698 us > > > > > > > After: 11157121 us > > > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > > > > > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > > > > > > > > > > > --- > > > > > > > Update from V1: > > > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > > > > > > - Update comments make it cleaner [Huang, Ying] > > > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > > > > > > - Update commit message. > > > > > > > - Collect Review and Acks. > > > > > > > > > > > > > > include/linux/swap.h | 5 +++++ > > > > > > > mm/memory.c | 15 +++++++++++++++ > > > > > > > mm/swap.h | 5 +++++ > > > > > > > mm/swapfile.c | 13 +++++++++++++ > > > > > > > 4 files changed, 38 insertions(+) > > > > > > > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > > > > index 4db00ddad261..8d28f6091a32 100644 > > > > > > > --- a/include/linux/swap.h > > > > > > > +++ b/include/linux/swap.h > > > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > > > > > > return 0; > > > > > > > } > > > > > > > > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > > > > > > +{ > > > > > > > + return 0; > > > > > > > +} > > > > > > > + > > > > > > > static inline void swap_free(swp_entry_t swp) > > > > > > > { > > > > > > > } > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > > index 7e1f4849463a..1749c700823d 100644 > > > > > > > --- a/mm/memory.c > > > > > > > +++ b/mm/memory.c > > > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > if (!folio) { > > > > > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > > > > > __swap_count(entry) == 1) { > > > > > > > + /* > > > > > > > + * Prevent parallel swapin from proceeding with > > > > > > > + * the cache flag. Otherwise, another thread may > > > > > > > + * finish swapin first, free the entry, and swapout > > > > > > > + * reusing the same entry. It's undetectable as > > > > > > > + * pte_same() returns true due to entry reuse. > > > > > > > + */ > > > > > > > + if (swapcache_prepare(entry)) > > > > > > > + goto out; > > > > > > > + > > > > > > > > > > > > I am puzzled by this "goto out". If I understand this correctly, you > > > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > > > > > The CPU1 will succeed in adding the flag and the CPU2 will get > > > > > > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > > > > > correctly so far? > > > > > > > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will > > > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > > > > > CPU2 does not even know the page fault is not handled, it will resume > > > > > > from the page fault instruction, possibly generate another page fault > > > > > > at the exact same location. That page fault loop will repeat until > > > > > > CPU1 install the new pte on that faulting virtual address and pick up > > > > > > by CPU2. > > > > > > > > > > > > Am I missing something obvious there? > > > > > > > > > > I feel you are right. any concurrent page faults at the same pte > > > > > will increase the count of page faults for a couple of times now. > > > > > > > > > > > > > > > > > I just re-read your comment: "Racers will simply busy wait since it's > > > > > > a rare and very short event." That might be referring to the above > > > > > > CPU2 page fault looping situation. I consider the page fault looping > > > > > > on CPU2 not acceptable. For one it will mess up the page fault > > > > > > statistics. > > > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to > > > > > > show up is still better than this page fault loop. You can have more > > > > > > CPU power friendly loops. > > > > > > > > > > I assume you mean something like > > > > > > > > > > while(!pte_same()) > > > > > cpu_relax(); > > > > > > > > > > then we still have a chance to miss the change of B. > > > > > > > > > > For example, another thread is changing pte to A->B->A, our loop can > > > > > miss B. Thus we will trap into an infinite loop. this is even worse. > > > > > > > > Yes. You are right, it is worse. Thanks for catching that. That is why > > > > I say this needs more discussion, I haven't fully thought it through > > > > :-) > > > > > > Hi Chris and Barry, > > > > > > Thanks for the comments! > > > > > > The worst thing I know of returning in do_swap_page without handling > > > the swap, is an increase of some statistic counters, note it will not > > > cause major page fault counters to grow, only things like perf counter > > > and vma lock statistic are affected. > > > > > > And actually there are multiple already existing return points in > > > do_swap_page that will return without handling it, which may > > > re-trigger the page fault. > > > > Thanks for pointing that out. I take a look at those, which seems > > different than the case here. In those cases, it truely can not make > > forward progress. > > Here we actually have all the data it needs to complete the page > > fault. Just a data synchronization issue preventing making forward > > progress. > > Ideally we can have some clever data structure to solve the > > synchronization issue and make forward progress. > > > > > When do_swap_page is called, many pre-checks have been applied, and > > > they could all be invalidated if something raced, simply looping > > > inside here could miss a lot of corner cases, so we have to go through > > > that again. > > > > Actually, I think about it. Looping it here seems worse in the sense > > that it is already holding some locks. Return and retry the page fault > > at least release those locks and let others have a chance to make > > progress. > > > > > > > > This patch did increase the chance of false positive increase of some > > > counters, maybe something like returning a VM_FAULT_RETRY could make > > > it better, but code is more complex and will cause other counters to > > > grow. > > > > This is certainly not ideal. It might upset the feedback loop that > > uses the swap fault statistic as input to adjust the swapping > > behavior. > > > > Chris > > Hi Chris, > > Thanks for the reply. > > So I think the thing is, it's getting complex because this patch > wanted to make it simple and just reuse the swap cache flags. I agree that a simple fix would be the important at this point. Considering your description, here's my understanding of the other idea: Other method, such as increasing the swap count, haven't proven effective in your tests. The approach risk forcing racers to rely on the swap cache again and the potential performance loss in race scenario. While I understand that simplicity is important, and performance loss in this case may be infrequent, I believe swap_count approach could be a suitable solution. What do you think?
On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote: > > On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: > > On Wed, Feb 7, 2024 at 12:02 PM Chris Li <chrisl@kernel.org> wrote: > > > > > > On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > > > > > > > Hi Kairui, > > > > > > > > > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > > > > > > > > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > > > > > > > > > From: Kairui Song <kasong@tencent.com> > > > > > > > > > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > > > > > > swapin the same entry at the same time, they get different pages (A, B). > > > > > > > > Before one thread (T0) finishes the swapin and installs page (A) > > > > > > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > > > > > > swap_free the entry, then swap out the possibly modified page > > > > > > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > > > > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > > > > > > > > > > > One possible callstack is like this: > > > > > > > > > > > > > > > > CPU0 CPU1 > > > > > > > > ---- ---- > > > > > > > > do_swap_page() do_swap_page() with same entry > > > > > > > > <direct swapin path> <direct swapin path> > > > > > > > > <alloc page A> <alloc page B> > > > > > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > > > > > > > <slow on later locks or interrupt> <finished swapin first> > > > > > > > > ... set_pte_at() > > > > > > > > swap_free() <- entry is free > > > > > > > > <write to page B, now page A stalled> > > > > > > > > <swap out page B to same swap entry> > > > > > > > > pte_same() <- Check pass, PTE seems > > > > > > > > unchanged, but page A > > > > > > > > is stalled! > > > > > > > > swap_free() <- page B content lost! > > > > > > > > set_pte_at() <- staled page A installed! > > > > > > > > > > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > > > > > > > the entry content, so even if page (B) is not modified, if > > > > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > > > > > > > it may also cause data loss. > > > > > > > > > > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > > > > > > > the cache flag, and allow only one thread to pin it. Release the pin > > > > > > > > after PT unlocked. Racers will simply busy wait since it's a rare > > > > > > > > and very short event. > > > > > > > > > > > > > > > > Other methods like increasing the swap count don't seem to be a good > > > > > > > > idea after some tests, that will cause racers to fall back to use the > > > > > > > > swap cache again. Parallel swapin using different methods leads to > > > > > > > > a much more complex scenario. > > > > > > > > > > > > > > > > Reproducer: > > > > > > > > > > > > > > > > This race issue can be triggered easily using a well constructed > > > > > > > > reproducer and patched brd (with a delay in read path) [1]: > > > > > > > > > > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > > > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > > > > > > > Polulating 32MB of memory region... > > > > > > > > Keep swapping out... > > > > > > > > Starting round 0... > > > > > > > > Spawning 65536 workers... > > > > > > > > 32746 workers spawned, wait for done... > > > > > > > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > > > > > > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > > > > > > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > > > > > > > Round 0 Failed, 15 data loss! > > > > > > > > > > > > > > > > This reproducer spawns multiple threads sharing the same memory region > > > > > > > > using a small swap device. Every two threads updates mapped pages one by > > > > > > > > one in opposite direction trying to create a race, with one dedicated > > > > > > > > thread keep swapping out the data out using madvise. > > > > > > > > > > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > > > > > > > so the race should be totally possible in production. > > > > > > > > > > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds > > > > > > > > and no data loss observed. > > > > > > > > > > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > > > > > > > zram: > > > > > > > > > > > > > > > > Before: 10934698 us > > > > > > > > After: 11157121 us > > > > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > > > > > > > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > > > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > > > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > > > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > > > > > > > > > > > > > --- > > > > > > > > Update from V1: > > > > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > > > > > > > - Update comments make it cleaner [Huang, Ying] > > > > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > > > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > > > > > > > - Update commit message. > > > > > > > > - Collect Review and Acks. > > > > > > > > > > > > > > > > include/linux/swap.h | 5 +++++ > > > > > > > > mm/memory.c | 15 +++++++++++++++ > > > > > > > > mm/swap.h | 5 +++++ > > > > > > > > mm/swapfile.c | 13 +++++++++++++ > > > > > > > > 4 files changed, 38 insertions(+) > > > > > > > > > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > > > > > index 4db00ddad261..8d28f6091a32 100644 > > > > > > > > --- a/include/linux/swap.h > > > > > > > > +++ b/include/linux/swap.h > > > > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > > > > > > > return 0; > > > > > > > > } > > > > > > > > > > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > > > > > > > +{ > > > > > > > > + return 0; > > > > > > > > +} > > > > > > > > + > > > > > > > > static inline void swap_free(swp_entry_t swp) > > > > > > > > { > > > > > > > > } > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > > > index 7e1f4849463a..1749c700823d 100644 > > > > > > > > --- a/mm/memory.c > > > > > > > > +++ b/mm/memory.c > > > > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > if (!folio) { > > > > > > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > > > > > > __swap_count(entry) == 1) { > > > > > > > > + /* > > > > > > > > + * Prevent parallel swapin from proceeding with > > > > > > > > + * the cache flag. Otherwise, another thread may > > > > > > > > + * finish swapin first, free the entry, and swapout > > > > > > > > + * reusing the same entry. It's undetectable as > > > > > > > > + * pte_same() returns true due to entry reuse. > > > > > > > > + */ > > > > > > > > + if (swapcache_prepare(entry)) > > > > > > > > + goto out; > > > > > > > > + > > > > > > > > > > > > > > I am puzzled by this "goto out". If I understand this correctly, you > > > > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > > > > > > The CPU1 will succeed in adding the flag and the CPU2 will get > > > > > > > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > > > > > > correctly so far? > > > > > > > > > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will > > > > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > > > > > > CPU2 does not even know the page fault is not handled, it will resume > > > > > > > from the page fault instruction, possibly generate another page fault > > > > > > > at the exact same location. That page fault loop will repeat until > > > > > > > CPU1 install the new pte on that faulting virtual address and pick up > > > > > > > by CPU2. > > > > > > > > > > > > > > Am I missing something obvious there? > > > > > > > > > > > > I feel you are right. any concurrent page faults at the same pte > > > > > > will increase the count of page faults for a couple of times now. > > > > > > > > > > > > > > > > > > > > I just re-read your comment: "Racers will simply busy wait since it's > > > > > > > a rare and very short event." That might be referring to the above > > > > > > > CPU2 page fault looping situation. I consider the page fault looping > > > > > > > on CPU2 not acceptable. For one it will mess up the page fault > > > > > > > statistics. > > > > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to > > > > > > > show up is still better than this page fault loop. You can have more > > > > > > > CPU power friendly loops. > > > > > > > > > > > > I assume you mean something like > > > > > > > > > > > > while(!pte_same()) > > > > > > cpu_relax(); > > > > > > > > > > > > then we still have a chance to miss the change of B. > > > > > > > > > > > > For example, another thread is changing pte to A->B->A, our loop can > > > > > > miss B. Thus we will trap into an infinite loop. this is even worse. > > > > > > > > > > Yes. You are right, it is worse. Thanks for catching that. That is why > > > > > I say this needs more discussion, I haven't fully thought it through > > > > > :-) > > > > > > > > Hi Chris and Barry, > > > > > > > > Thanks for the comments! > > > > > > > > The worst thing I know of returning in do_swap_page without handling > > > > the swap, is an increase of some statistic counters, note it will not > > > > cause major page fault counters to grow, only things like perf counter > > > > and vma lock statistic are affected. > > > > > > > > And actually there are multiple already existing return points in > > > > do_swap_page that will return without handling it, which may > > > > re-trigger the page fault. > > > > > > Thanks for pointing that out. I take a look at those, which seems > > > different than the case here. In those cases, it truely can not make > > > forward progress. > > > Here we actually have all the data it needs to complete the page > > > fault. Just a data synchronization issue preventing making forward > > > progress. > > > Ideally we can have some clever data structure to solve the > > > synchronization issue and make forward progress. > > > > > > > When do_swap_page is called, many pre-checks have been applied, and > > > > they could all be invalidated if something raced, simply looping > > > > inside here could miss a lot of corner cases, so we have to go through > > > > that again. > > > > > > Actually, I think about it. Looping it here seems worse in the sense > > > that it is already holding some locks. Return and retry the page fault > > > at least release those locks and let others have a chance to make > > > progress. > > > > > > > > > > > This patch did increase the chance of false positive increase of some > > > > counters, maybe something like returning a VM_FAULT_RETRY could make > > > > it better, but code is more complex and will cause other counters to > > > > grow. > > > > > > This is certainly not ideal. It might upset the feedback loop that > > > uses the swap fault statistic as input to adjust the swapping > > > behavior. > > > > > > Chris > > > > Hi Chris, > > > > Thanks for the reply. > > > > So I think the thing is, it's getting complex because this patch > > wanted to make it simple and just reuse the swap cache flags. > > I agree that a simple fix would be the important at this point. > > Considering your description, here's my understanding of the other idea: > Other method, such as increasing the swap count, haven't proven effective > in your tests. The approach risk forcing racers to rely on the swap cache > again and the potential performance loss in race scenario. > > While I understand that simplicity is important, and performance loss > in this case may be infrequent, I believe swap_count approach could be a > suitable solution. What do you think? Hi Minchan Yes, my main concern was about simplicity and performance. Increasing swap_count here will also race with another process from releasing swap_count to 0 (swapcache was able to sync callers in other call paths but we skipped swapcache here). So the right step is: 1. Lock the cluster/swap lock; 2. Check if still have swap_count == 1, bail out if not; 3. Set it to 2; __swap_duplicate can be modified to support this, it's similar to existing logics for SWAP_HAS_CACHE. And swap freeing path will do more things, swapcache clean up needs to be handled even in the bypassing path since the racer may add it to swapcache. Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many overhead, so I used that way in this patch, the only issue is potentially repeated page faults now. I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad at naming it) special value, so any racer can just spin on it to avoid all the problems, how do you think about this?
Kairui Song <ryncsn@gmail.com> writes: > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote: >> >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: [snip] >> > >> > So I think the thing is, it's getting complex because this patch >> > wanted to make it simple and just reuse the swap cache flags. >> >> I agree that a simple fix would be the important at this point. >> >> Considering your description, here's my understanding of the other idea: >> Other method, such as increasing the swap count, haven't proven effective >> in your tests. The approach risk forcing racers to rely on the swap cache >> again and the potential performance loss in race scenario. >> >> While I understand that simplicity is important, and performance loss >> in this case may be infrequent, I believe swap_count approach could be a >> suitable solution. What do you think? > > Hi Minchan > > Yes, my main concern was about simplicity and performance. > > Increasing swap_count here will also race with another process from > releasing swap_count to 0 (swapcache was able to sync callers in other > call paths but we skipped swapcache here). What is the consequence of the race condition? > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still > have swap_count == 1, bail out if not; 3. Set it to 2; > __swap_duplicate can be modified to support this, it's similar to > existing logics for SWAP_HAS_CACHE. > > And swap freeing path will do more things, swapcache clean up needs to > be handled even in the bypassing path since the racer may add it to > swapcache. > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many > overhead, so I used that way in this patch, the only issue is > potentially repeated page faults now. > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad > at naming it) special value, so any racer can just spin on it to avoid > all the problems, how do you think about this? Let's try some simpler method firstly. -- Best Regards, Huang, Ying
On Thu, Feb 8, 2024 at 2:05 PM Kairui Song <ryncsn@gmail.com> wrote: > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote: > > > > On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: > > > On Wed, Feb 7, 2024 at 12:02 PM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > On Tue, Feb 6, 2024 at 6:21 PM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > > > On Wed, Feb 7, 2024 at 10:03 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > > > > > On Tue, Feb 6, 2024 at 4:43 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > > > On Wed, Feb 7, 2024 at 7:18 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > > > > > > > > > Hi Kairui, > > > > > > > > > > > > > > > > Sorry replying to your patch V1 late, I will reply on the V2 thread. > > > > > > > > > > > > > > > > On Tue, Feb 6, 2024 at 10:28 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > > > > > > > > > > > From: Kairui Song <kasong@tencent.com> > > > > > > > > > > > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > > > > > > > > swapin the same entry at the same time, they get different pages (A, B). > > > > > > > > > Before one thread (T0) finishes the swapin and installs page (A) > > > > > > > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > > > > > > > swap_free the entry, then swap out the possibly modified page > > > > > > > > > reusing the same entry. It breaks the pte_same check in (T0) because > > > > > > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > > > > > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > > > > > > > > > > > > > One possible callstack is like this: > > > > > > > > > > > > > > > > > > CPU0 CPU1 > > > > > > > > > ---- ---- > > > > > > > > > do_swap_page() do_swap_page() with same entry > > > > > > > > > <direct swapin path> <direct swapin path> > > > > > > > > > <alloc page A> <alloc page B> > > > > > > > > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > > > > > > > > <slow on later locks or interrupt> <finished swapin first> > > > > > > > > > ... set_pte_at() > > > > > > > > > swap_free() <- entry is free > > > > > > > > > <write to page B, now page A stalled> > > > > > > > > > <swap out page B to same swap entry> > > > > > > > > > pte_same() <- Check pass, PTE seems > > > > > > > > > unchanged, but page A > > > > > > > > > is stalled! > > > > > > > > > swap_free() <- page B content lost! > > > > > > > > > set_pte_at() <- staled page A installed! > > > > > > > > > > > > > > > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > > > > > > > > the entry content, so even if page (B) is not modified, if > > > > > > > > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > > > > > > > > it may also cause data loss. > > > > > > > > > > > > > > > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > > > > > > > > the cache flag, and allow only one thread to pin it. Release the pin > > > > > > > > > after PT unlocked. Racers will simply busy wait since it's a rare > > > > > > > > > and very short event. > > > > > > > > > > > > > > > > > > Other methods like increasing the swap count don't seem to be a good > > > > > > > > > idea after some tests, that will cause racers to fall back to use the > > > > > > > > > swap cache again. Parallel swapin using different methods leads to > > > > > > > > > a much more complex scenario. > > > > > > > > > > > > > > > > > > Reproducer: > > > > > > > > > > > > > > > > > > This race issue can be triggered easily using a well constructed > > > > > > > > > reproducer and patched brd (with a delay in read path) [1]: > > > > > > > > > > > > > > > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > > > > > > > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > > > > > > > > Polulating 32MB of memory region... > > > > > > > > > Keep swapping out... > > > > > > > > > Starting round 0... > > > > > > > > > Spawning 65536 workers... > > > > > > > > > 32746 workers spawned, wait for done... > > > > > > > > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > > > > > > > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > > > > > > > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > > > > > > > > Round 0 Failed, 15 data loss! > > > > > > > > > > > > > > > > > > This reproducer spawns multiple threads sharing the same memory region > > > > > > > > > using a small swap device. Every two threads updates mapped pages one by > > > > > > > > > one in opposite direction trying to create a race, with one dedicated > > > > > > > > > thread keep swapping out the data out using madvise. > > > > > > > > > > > > > > > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > > > > > > > > so the race should be totally possible in production. > > > > > > > > > > > > > > > > > > After this patch, I ran the reproducer for over a few hundred rounds > > > > > > > > > and no data loss observed. > > > > > > > > > > > > > > > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > > > > > > > > zram: > > > > > > > > > > > > > > > > > > Before: 10934698 us > > > > > > > > > After: 11157121 us > > > > > > > > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > > > > > > > > > > > > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > > > > > > > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > > > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > > > > > > > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > > > > > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > > > > > > > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > > > > > > > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > > > > > > > > > > > > > > > --- > > > > > > > > > Update from V1: > > > > > > > > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > > > > > > > > - Update comments make it cleaner [Huang, Ying] > > > > > > > > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > > > > > > > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > > > > > > > > - Update commit message. > > > > > > > > > - Collect Review and Acks. > > > > > > > > > > > > > > > > > > include/linux/swap.h | 5 +++++ > > > > > > > > > mm/memory.c | 15 +++++++++++++++ > > > > > > > > > mm/swap.h | 5 +++++ > > > > > > > > > mm/swapfile.c | 13 +++++++++++++ > > > > > > > > > 4 files changed, 38 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > > > > > > index 4db00ddad261..8d28f6091a32 100644 > > > > > > > > > --- a/include/linux/swap.h > > > > > > > > > +++ b/include/linux/swap.h > > > > > > > > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > > > > > > > > return 0; > > > > > > > > > } > > > > > > > > > > > > > > > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > > > > > > > > +{ > > > > > > > > > + return 0; > > > > > > > > > +} > > > > > > > > > + > > > > > > > > > static inline void swap_free(swp_entry_t swp) > > > > > > > > > { > > > > > > > > > } > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > > > > index 7e1f4849463a..1749c700823d 100644 > > > > > > > > > --- a/mm/memory.c > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > if (!folio) { > > > > > > > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > > > > > > > __swap_count(entry) == 1) { > > > > > > > > > + /* > > > > > > > > > + * Prevent parallel swapin from proceeding with > > > > > > > > > + * the cache flag. Otherwise, another thread may > > > > > > > > > + * finish swapin first, free the entry, and swapout > > > > > > > > > + * reusing the same entry. It's undetectable as > > > > > > > > > + * pte_same() returns true due to entry reuse. > > > > > > > > > + */ > > > > > > > > > + if (swapcache_prepare(entry)) > > > > > > > > > + goto out; > > > > > > > > > + > > > > > > > > > > > > > > > > I am puzzled by this "goto out". If I understand this correctly, you > > > > > > > > have two threads CPU1 and CPU2 racing to set the flag SWAP_HAS_CACHE. > > > > > > > > The CPU1 will succeed in adding the flag and the CPU2 will get > > > > > > > > "-EEXIST" from "swapcache_prepare(entry)". Am I understanding it > > > > > > > > correctly so far? > > > > > > > > > > > > > > > > Then the goto out seems wrong to me. For the CPU2, the page fault will > > > > > > > > return *unhandled*. Even worse, the "-EEXIST" error is not preserved, > > > > > > > > CPU2 does not even know the page fault is not handled, it will resume > > > > > > > > from the page fault instruction, possibly generate another page fault > > > > > > > > at the exact same location. That page fault loop will repeat until > > > > > > > > CPU1 install the new pte on that faulting virtual address and pick up > > > > > > > > by CPU2. > > > > > > > > > > > > > > > > Am I missing something obvious there? > > > > > > > > > > > > > > I feel you are right. any concurrent page faults at the same pte > > > > > > > will increase the count of page faults for a couple of times now. > > > > > > > > > > > > > > > > > > > > > > > I just re-read your comment: "Racers will simply busy wait since it's > > > > > > > > a rare and very short event." That might be referring to the above > > > > > > > > CPU2 page fault looping situation. I consider the page fault looping > > > > > > > > on CPU2 not acceptable. For one it will mess up the page fault > > > > > > > > statistics. > > > > > > > > In my mind, having an explicit loop for CPU2 waiting for the PTE to > > > > > > > > show up is still better than this page fault loop. You can have more > > > > > > > > CPU power friendly loops. > > > > > > > > > > > > > > I assume you mean something like > > > > > > > > > > > > > > while(!pte_same()) > > > > > > > cpu_relax(); > > > > > > > > > > > > > > then we still have a chance to miss the change of B. > > > > > > > > > > > > > > For example, another thread is changing pte to A->B->A, our loop can > > > > > > > miss B. Thus we will trap into an infinite loop. this is even worse. > > > > > > > > > > > > Yes. You are right, it is worse. Thanks for catching that. That is why > > > > > > I say this needs more discussion, I haven't fully thought it through > > > > > > :-) > > > > > > > > > > Hi Chris and Barry, > > > > > > > > > > Thanks for the comments! > > > > > > > > > > The worst thing I know of returning in do_swap_page without handling > > > > > the swap, is an increase of some statistic counters, note it will not > > > > > cause major page fault counters to grow, only things like perf counter > > > > > and vma lock statistic are affected. > > > > > > > > > > And actually there are multiple already existing return points in > > > > > do_swap_page that will return without handling it, which may > > > > > re-trigger the page fault. > > > > > > > > Thanks for pointing that out. I take a look at those, which seems > > > > different than the case here. In those cases, it truely can not make > > > > forward progress. > > > > Here we actually have all the data it needs to complete the page > > > > fault. Just a data synchronization issue preventing making forward > > > > progress. > > > > Ideally we can have some clever data structure to solve the > > > > synchronization issue and make forward progress. > > > > > > > > > When do_swap_page is called, many pre-checks have been applied, and > > > > > they could all be invalidated if something raced, simply looping > > > > > inside here could miss a lot of corner cases, so we have to go through > > > > > that again. > > > > > > > > Actually, I think about it. Looping it here seems worse in the sense > > > > that it is already holding some locks. Return and retry the page fault > > > > at least release those locks and let others have a chance to make > > > > progress. > > > > > > > > > > > > > > This patch did increase the chance of false positive increase of some > > > > > counters, maybe something like returning a VM_FAULT_RETRY could make > > > > > it better, but code is more complex and will cause other counters to > > > > > grow. > > > > > > > > This is certainly not ideal. It might upset the feedback loop that > > > > uses the swap fault statistic as input to adjust the swapping > > > > behavior. > > > > > > > > Chris > > > > > > Hi Chris, > > > > > > Thanks for the reply. > > > > > > So I think the thing is, it's getting complex because this patch > > > wanted to make it simple and just reuse the swap cache flags. > > > > I agree that a simple fix would be the important at this point. > > > > Considering your description, here's my understanding of the other idea: > > Other method, such as increasing the swap count, haven't proven effective > > in your tests. The approach risk forcing racers to rely on the swap cache > > again and the potential performance loss in race scenario. > > > > While I understand that simplicity is important, and performance loss > > in this case may be infrequent, I believe swap_count approach could be a > > suitable solution. What do you think? > > Hi Minchan > > Yes, my main concern was about simplicity and performance. > > Increasing swap_count here will also race with another process from > releasing swap_count to 0 (swapcache was able to sync callers in other > call paths but we skipped swapcache here). Just like a refcount, releasing to 0 should be ok. the last one who sets refcount to 0 will finally release the swap entry. Increasing swap_count from 1 to 2 might cause concurrent threads go into non-sync-io path(full swapcache things), which was Minchan's original patch wanted to avoid. i am not sure if this will negatively affect performance. If not, it seems a feasible way. > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still > have swap_count == 1, bail out if not; 3. Set it to 2; > __swap_duplicate can be modified to support this, it's similar to > existing logics for SWAP_HAS_CACHE. > > And swap freeing path will do more things, swapcache clean up needs to > be handled even in the bypassing path since the racer may add it to > swapcache. > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many > overhead, so I used that way in this patch, the only issue is > potentially repeated page faults now. > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad > at naming it) special value, so any racer can just spin on it to avoid > all the problems, how do you think about this? Thanks Barry
On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote: > > Kairui Song <ryncsn@gmail.com> writes: > > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote: > >> > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: > > [snip] > > >> > > >> > So I think the thing is, it's getting complex because this patch > >> > wanted to make it simple and just reuse the swap cache flags. > >> > >> I agree that a simple fix would be the important at this point. > >> > >> Considering your description, here's my understanding of the other idea: > >> Other method, such as increasing the swap count, haven't proven effective > >> in your tests. The approach risk forcing racers to rely on the swap cache > >> again and the potential performance loss in race scenario. > >> > >> While I understand that simplicity is important, and performance loss > >> in this case may be infrequent, I believe swap_count approach could be a > >> suitable solution. What do you think? > > > > Hi Minchan > > > > Yes, my main concern was about simplicity and performance. > > > > Increasing swap_count here will also race with another process from > > releasing swap_count to 0 (swapcache was able to sync callers in other > > call paths but we skipped swapcache here). > > What is the consequence of the race condition? Hi Ying, It will increase the swap count of an already freed entry, this race with multiple swap free/alloc logic that checks if count == SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry, all result in random corruption of the swap map. This happens a lot during stress testing. > > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still > > have swap_count == 1, bail out if not; 3. Set it to 2; > > __swap_duplicate can be modified to support this, it's similar to > > existing logics for SWAP_HAS_CACHE. > > > > And swap freeing path will do more things, swapcache clean up needs to > > be handled even in the bypassing path since the racer may add it to > > swapcache. > > > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many > > overhead, so I used that way in this patch, the only issue is > > potentially repeated page faults now. > > > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad > > at naming it) special value, so any racer can just spin on it to avoid > > all the problems, how do you think about this? > > Let's try some simpler method firstly. Another simpler idea is, add a schedule() or schedule_timeout_uninterruptible(1) in the swapcache_prepare failure path before goto out (just like __read_swap_cache_async). I think this should ensure in almost all cases, PTE is ready after it returns, also yields more CPU.
On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > Kairui Song <ryncsn@gmail.com> writes: > > > > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote: > > >> > > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: > > > > [snip] > > > > >> > > > >> > So I think the thing is, it's getting complex because this patch > > >> > wanted to make it simple and just reuse the swap cache flags. > > >> > > >> I agree that a simple fix would be the important at this point. > > >> > > >> Considering your description, here's my understanding of the other idea: > > >> Other method, such as increasing the swap count, haven't proven effective > > >> in your tests. The approach risk forcing racers to rely on the swap cache > > >> again and the potential performance loss in race scenario. > > >> > > >> While I understand that simplicity is important, and performance loss > > >> in this case may be infrequent, I believe swap_count approach could be a > > >> suitable solution. What do you think? > > > > > > Hi Minchan > > > > > > Yes, my main concern was about simplicity and performance. > > > > > > Increasing swap_count here will also race with another process from > > > releasing swap_count to 0 (swapcache was able to sync callers in other > > > call paths but we skipped swapcache here). > > > > What is the consequence of the race condition? > > Hi Ying, > > It will increase the swap count of an already freed entry, this race > with multiple swap free/alloc logic that checks if count == > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry, > all result in random corruption of the swap map. This happens a lot > during stress testing. In theory, the original code before your patch can get into a situation similar to what you try to avoid. CPU1 enters the do_swap_page() with entry swap count == 1 and continues handling the swap fault without swap cache. Then some operation bumps up the swap entry count and CPU2 enters the do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need to go through the swap cache case. We still need to handle this situation correctly. So the complexity is already there. If we can make sure the above case works correctly, then one way to avoid the problem is just send the CPU2 to use the swap cache (without the swap cache by-passing). > > > > > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still > > > have swap_count == 1, bail out if not; 3. Set it to 2; > > > __swap_duplicate can be modified to support this, it's similar to > > > existing logics for SWAP_HAS_CACHE. > > > > > > And swap freeing path will do more things, swapcache clean up needs to > > > be handled even in the bypassing path since the racer may add it to > > > swapcache. > > > > > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many > > > overhead, so I used that way in this patch, the only issue is > > > potentially repeated page faults now. > > > > > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad > > > at naming it) special value, so any racer can just spin on it to avoid > > > all the problems, how do you think about this? > > > > Let's try some simpler method firstly. > > Another simpler idea is, add a schedule() or > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure > path before goto out (just like __read_swap_cache_async). I think this > should ensure in almost all cases, PTE is ready after it returns, also > yields more CPU. I mentioned in my earlier email and Ying points out that as well. Looping waiting inside do_swap_page() is bad because it is holding other locks. Sorry I started this idea but it seems no good. If we can have CPU2 make forward progress without retrying the page fault would be the best, if possible. Chris
On Fri, Feb 9, 2024 at 3:42 AM Chris Li <chrisl@kernel.org> wrote: > > On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > > > Kairui Song <ryncsn@gmail.com> writes: > > > > > > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote: > > > >> > > > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: > > > > > > [snip] > > > > > > >> > > > > >> > So I think the thing is, it's getting complex because this patch > > > >> > wanted to make it simple and just reuse the swap cache flags. > > > >> > > > >> I agree that a simple fix would be the important at this point. > > > >> > > > >> Considering your description, here's my understanding of the other idea: > > > >> Other method, such as increasing the swap count, haven't proven effective > > > >> in your tests. The approach risk forcing racers to rely on the swap cache > > > >> again and the potential performance loss in race scenario. > > > >> > > > >> While I understand that simplicity is important, and performance loss > > > >> in this case may be infrequent, I believe swap_count approach could be a > > > >> suitable solution. What do you think? > > > > > > > > Hi Minchan > > > > > > > > Yes, my main concern was about simplicity and performance. > > > > > > > > Increasing swap_count here will also race with another process from > > > > releasing swap_count to 0 (swapcache was able to sync callers in other > > > > call paths but we skipped swapcache here). > > > > > > What is the consequence of the race condition? > > > > Hi Ying, > > > > It will increase the swap count of an already freed entry, this race > > with multiple swap free/alloc logic that checks if count == > > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry, > > all result in random corruption of the swap map. This happens a lot > > during stress testing. > > In theory, the original code before your patch can get into a > situation similar to what you try to avoid. > CPU1 enters the do_swap_page() with entry swap count == 1 and > continues handling the swap fault without swap cache. Then some > operation bumps up the swap entry count and CPU2 enters the > do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need > to go through the swap cache case. We still need to handle this > situation correctly. Hi Chris, This won't happen, nothing can bump the swap entry count unless it's swapped in and freed. There are only two places that call swap_duplicate: unmap or fork, unmap need page mapped and entry alloc, so it won't happen unless we hit the entry reuse issue. Fork needs the VMA lock which we hold it during page fault. > So the complexity is already there. > > If we can make sure the above case works correctly, then one way to > avoid the problem is just send the CPU2 to use the swap cache (without > the swap cache by-passing). Yes, more auditing of existing code and explanation is needed to ensure things won't go wrong, that's the reason I tried to avoid things from going too complex... > > > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still > > > > have swap_count == 1, bail out if not; 3. Set it to 2; > > > > __swap_duplicate can be modified to support this, it's similar to > > > > existing logics for SWAP_HAS_CACHE. > > > > > > > > And swap freeing path will do more things, swapcache clean up needs to > > > > be handled even in the bypassing path since the racer may add it to > > > > swapcache. > > > > > > > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many > > > > overhead, so I used that way in this patch, the only issue is > > > > potentially repeated page faults now. > > > > > > > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad > > > > at naming it) special value, so any racer can just spin on it to avoid > > > > all the problems, how do you think about this? > > > > > > Let's try some simpler method firstly. > > > > Another simpler idea is, add a schedule() or > > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure > > path before goto out (just like __read_swap_cache_async). I think this > > should ensure in almost all cases, PTE is ready after it returns, also > > yields more CPU. > > I mentioned in my earlier email and Ying points out that as well. > Looping waiting inside do_swap_page() is bad because it is holding > other locks. It's not looping here though, just a tiny delay, since SWP_SYNCHRONOUS_IO is supposed to be super fast devices so a tiny delay should be enough. > Sorry I started this idea but it seems no good. Not at all, more reviewing helps to find a better solution :) > If we can have CPU2 make forward progress without retrying > the page fault would be the best, if possible. Yes, making CPU2 fall back to cached swapin path is doable after careful auditing. Just CPU2 is usually slower than CPU1 due to cache and timing, so what it does will most likely be in vain and need to be reverted, causing more work for both code logic and CPU. The case of the race (CPU2 went faster) is very rare. I'm not against the idea of bump count, it's better if things that can be done without introducing too much noise. Will come back after more tests and work on this.
On Fri, Feb 9, 2024 at 1:30 PM Kairui Song <ryncsn@gmail.com> wrote: > > On Fri, Feb 9, 2024 at 3:42 AM Chris Li <chrisl@kernel.org> wrote: > > > > On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > > > > > Kairui Song <ryncsn@gmail.com> writes: > > > > > > > > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote: > > > > >> > > > > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: > > > > > > > > [snip] > > > > > > > > >> > > > > > >> > So I think the thing is, it's getting complex because this patch > > > > >> > wanted to make it simple and just reuse the swap cache flags. > > > > >> > > > > >> I agree that a simple fix would be the important at this point. > > > > >> > > > > >> Considering your description, here's my understanding of the other idea: > > > > >> Other method, such as increasing the swap count, haven't proven effective > > > > >> in your tests. The approach risk forcing racers to rely on the swap cache > > > > >> again and the potential performance loss in race scenario. > > > > >> > > > > >> While I understand that simplicity is important, and performance loss > > > > >> in this case may be infrequent, I believe swap_count approach could be a > > > > >> suitable solution. What do you think? > > > > > > > > > > Hi Minchan > > > > > > > > > > Yes, my main concern was about simplicity and performance. > > > > > > > > > > Increasing swap_count here will also race with another process from > > > > > releasing swap_count to 0 (swapcache was able to sync callers in other > > > > > call paths but we skipped swapcache here). > > > > > > > > What is the consequence of the race condition? > > > > > > Hi Ying, > > > > > > It will increase the swap count of an already freed entry, this race > > > with multiple swap free/alloc logic that checks if count == > > > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry, > > > all result in random corruption of the swap map. This happens a lot > > > during stress testing. > > > > In theory, the original code before your patch can get into a > > situation similar to what you try to avoid. > > CPU1 enters the do_swap_page() with entry swap count == 1 and > > continues handling the swap fault without swap cache. Then some > > operation bumps up the swap entry count and CPU2 enters the > > do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need > > to go through the swap cache case. We still need to handle this > > situation correctly. > > Hi Chris, > > This won't happen, nothing can bump the swap entry count unless it's > swapped in and freed. There are only two places that call > swap_duplicate: unmap or fork, unmap need page mapped and entry alloc, > so it won't happen unless we hit the entry reuse issue. Fork needs the > VMA lock which we hold it during page fault. > > > So the complexity is already there. > > > > If we can make sure the above case works correctly, then one way to > > avoid the problem is just send the CPU2 to use the swap cache (without > > the swap cache by-passing). > > Yes, more auditing of existing code and explanation is needed to > ensure things won't go wrong, that's the reason I tried to avoid > things from going too complex... > > > > > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still > > > > > have swap_count == 1, bail out if not; 3. Set it to 2; > > > > > __swap_duplicate can be modified to support this, it's similar to > > > > > existing logics for SWAP_HAS_CACHE. > > > > > > > > > > And swap freeing path will do more things, swapcache clean up needs to > > > > > be handled even in the bypassing path since the racer may add it to > > > > > swapcache. > > > > > > > > > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many > > > > > overhead, so I used that way in this patch, the only issue is > > > > > potentially repeated page faults now. > > > > > > > > > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad > > > > > at naming it) special value, so any racer can just spin on it to avoid > > > > > all the problems, how do you think about this? > > > > > > > > Let's try some simpler method firstly. > > > > > > Another simpler idea is, add a schedule() or > > > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure > > > path before goto out (just like __read_swap_cache_async). I think this > > > should ensure in almost all cases, PTE is ready after it returns, also > > > yields more CPU. > > > > I mentioned in my earlier email and Ying points out that as well. > > Looping waiting inside do_swap_page() is bad because it is holding > > other locks. > > It's not looping here though, just a tiny delay, since > SWP_SYNCHRONOUS_IO is supposed to be super fast devices so a tiny > delay should be enough. > > > Sorry I started this idea but it seems no good. > > Not at all, more reviewing helps to find a better solution :) > > > If we can have CPU2 make forward progress without retrying > > the page fault would be the best, if possible. > > Yes, making CPU2 fall back to cached swapin path is doable after > careful auditing. Just CPU2 is usually slower than CPU1 due to cache > and timing, so what it does will most likely be in vain and need to be > reverted, causing more work for both code logic and CPU. The case of > the race (CPU2 went faster) is very rare. > > I'm not against the idea of bump count, it's better if things that can > be done without introducing too much noise. Will come back after more > tests and work on this. Hi, After some more testing, I still think bump swap count is a bad idea here. Besides the issues I mentioned above, adding folio into cache is fundamentally in conflict with the problem we are trying to solve: folio in swapcache can be swapped out without allocating a new entry, so the entry reuse issue will be much more serious. If we want to avoid this we have to release the cache unconditionally before folio unlock because it can't tell if there is a parallel cache bypassing swapin going on or not. This is unacceptable, it breaks normal usage of swap cache. There are also other potential issues like read ahead, race causing stalled folio remain in swap cache, so I dont think this is a good approach. Instead if we add a schedule or schedule_timeout_uninterruptible(1), it seems quite enough to avoid repeated page faults. I did following test with the reproducer I provided in the commit message: Using ZRAM instead of brd for more realistic workload: $ perf stat -e minor-faults,major-faults -I 30000 ./a.out Unpatched kernel: # time counts unit events 30.000096504 33,089 minor-faults:u 30.000096504 958,745 major-faults:u 60.000375308 22,150 minor-faults:u 60.000375308 1,221,665 major-faults:u 90.001062176 24,840 minor-faults:u 90.001062176 1,005,427 major-faults:u 120.004233268 23,634 minor-faults:u 120.004233268 1,045,596 major-faults:u 150.004530091 22,358 minor-faults:u 150.004530091 1,097,871 major-faults:u Patched kernel: # time counts unit events 30.000069753 280,094 minor-faults:u 30.000069753 1,198,871 major-faults:u 60.000415915 436,862 minor-faults:u 60.000415915 919,860 major-faults:u 90.000651670 359,176 minor-faults:u 90.000651670 955,340 major-faults:u 120.001088135 289,137 minor-faults:u 120.001088135 992,317 major-faults:u Indeed there is a huge increase of minor page faults, which indicate the raced path returned without handling the fault. This reproducer tries to maximize the race chance so the reading is more terrifying than usual. But after adding a schedule/schedule_timeout_uninterruptible(1) after swapcache_prepare failed, still using the same test: Patched kernel (adding a schedule() before goto out): # time counts unit events 30.000335901 43,066 minor-faults:u 30.000335901 1,163,232 major-faults:u 60.034629687 35,652 minor-faults:u 60.034629687 844,188 major-faults:u 90.034941472 34,957 minor-faults:u 90.034941472 1,218,174 major-faults:u 120.035255386 36,241 minor-faults:u 120.035255386 1,073,395 major-faults:u 150.035535786 33,057 minor-faults:u 150.035535786 1,171,684 major-faults:u Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out): # time counts unit events 30.000044452 26,748 minor-faults:u 30.000044452 1,202,064 major-faults:u 60.000317379 19,883 minor-faults:u 60.000317379 1,186,152 major-faults:u 90.000568779 18,333 minor-faults:u 90.000568779 1,151,015 major-faults:u 120.000787253 22,844 minor-faults:u 120.000787253 887,531 major-faults:u 150.001309065 18,900 minor-faults:u 150.001309065 1,211,894 major-faults:u The minor faults are basically the same as before, or even lower since other races are also reduced, with no observable performance regression so far. If the vague margin of this schedule call is still concerning, I think another approach is just a new swap map value for parallel cache bypassed swapping to loop on.
Hi Kairui, On Tue, Feb 13, 2024 at 03:53:09AM +0800, Kairui Song wrote: > On Fri, Feb 9, 2024 at 1:30 PM Kairui Song <ryncsn@gmail.com> wrote: > > > > On Fri, Feb 9, 2024 at 3:42 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > > > > > > > Kairui Song <ryncsn@gmail.com> writes: > > > > > > > > > > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > >> > > > > > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: > > > > > > > > > > [snip] > > > > > > > > > > >> > > > > > > >> > So I think the thing is, it's getting complex because this patch > > > > > >> > wanted to make it simple and just reuse the swap cache flags. > > > > > >> > > > > > >> I agree that a simple fix would be the important at this point. > > > > > >> > > > > > >> Considering your description, here's my understanding of the other idea: > > > > > >> Other method, such as increasing the swap count, haven't proven effective > > > > > >> in your tests. The approach risk forcing racers to rely on the swap cache > > > > > >> again and the potential performance loss in race scenario. > > > > > >> > > > > > >> While I understand that simplicity is important, and performance loss > > > > > >> in this case may be infrequent, I believe swap_count approach could be a > > > > > >> suitable solution. What do you think? > > > > > > > > > > > > Hi Minchan > > > > > > > > > > > > Yes, my main concern was about simplicity and performance. > > > > > > > > > > > > Increasing swap_count here will also race with another process from > > > > > > releasing swap_count to 0 (swapcache was able to sync callers in other > > > > > > call paths but we skipped swapcache here). > > > > > > > > > > What is the consequence of the race condition? > > > > > > > > Hi Ying, > > > > > > > > It will increase the swap count of an already freed entry, this race > > > > with multiple swap free/alloc logic that checks if count == > > > > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry, > > > > all result in random corruption of the swap map. This happens a lot > > > > during stress testing. > > > > > > In theory, the original code before your patch can get into a > > > situation similar to what you try to avoid. > > > CPU1 enters the do_swap_page() with entry swap count == 1 and > > > continues handling the swap fault without swap cache. Then some > > > operation bumps up the swap entry count and CPU2 enters the > > > do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need > > > to go through the swap cache case. We still need to handle this > > > situation correctly. > > > > Hi Chris, > > > > This won't happen, nothing can bump the swap entry count unless it's > > swapped in and freed. There are only two places that call > > swap_duplicate: unmap or fork, unmap need page mapped and entry alloc, > > so it won't happen unless we hit the entry reuse issue. Fork needs the > > VMA lock which we hold it during page fault. > > > > > So the complexity is already there. > > > > > > If we can make sure the above case works correctly, then one way to > > > avoid the problem is just send the CPU2 to use the swap cache (without > > > the swap cache by-passing). > > > > Yes, more auditing of existing code and explanation is needed to > > ensure things won't go wrong, that's the reason I tried to avoid > > things from going too complex... > > > > > > > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still > > > > > > have swap_count == 1, bail out if not; 3. Set it to 2; > > > > > > __swap_duplicate can be modified to support this, it's similar to > > > > > > existing logics for SWAP_HAS_CACHE. > > > > > > > > > > > > And swap freeing path will do more things, swapcache clean up needs to > > > > > > be handled even in the bypassing path since the racer may add it to > > > > > > swapcache. > > > > > > > > > > > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many > > > > > > overhead, so I used that way in this patch, the only issue is > > > > > > potentially repeated page faults now. > > > > > > > > > > > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad > > > > > > at naming it) special value, so any racer can just spin on it to avoid > > > > > > all the problems, how do you think about this? > > > > > > > > > > Let's try some simpler method firstly. > > > > > > > > Another simpler idea is, add a schedule() or > > > > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure > > > > path before goto out (just like __read_swap_cache_async). I think this > > > > should ensure in almost all cases, PTE is ready after it returns, also > > > > yields more CPU. > > > > > > I mentioned in my earlier email and Ying points out that as well. > > > Looping waiting inside do_swap_page() is bad because it is holding > > > other locks. > > > > It's not looping here though, just a tiny delay, since > > SWP_SYNCHRONOUS_IO is supposed to be super fast devices so a tiny > > delay should be enough. > > > > > Sorry I started this idea but it seems no good. > > > > Not at all, more reviewing helps to find a better solution :) > > > > > If we can have CPU2 make forward progress without retrying > > > the page fault would be the best, if possible. > > > > Yes, making CPU2 fall back to cached swapin path is doable after > > careful auditing. Just CPU2 is usually slower than CPU1 due to cache > > and timing, so what it does will most likely be in vain and need to be > > reverted, causing more work for both code logic and CPU. The case of > > the race (CPU2 went faster) is very rare. > > > > I'm not against the idea of bump count, it's better if things that can > > be done without introducing too much noise. Will come back after more > > tests and work on this. > > Hi, > > After some more testing, I still think bump swap count is a bad idea > here. Besides the issues I mentioned above, adding folio into cache is > fundamentally in conflict with the problem we are trying to solve: > folio in swapcache can be swapped out without allocating a new entry, > so the entry reuse issue will be much more serious. > > If we want to avoid this we have to release the cache unconditionally > before folio unlock because it can't tell if there is a parallel cache > bypassing swapin going on or not. This is unacceptable, it breaks > normal usage of swap cache. There are also other potential issues like > read ahead, race causing stalled folio remain in swap cache, so I dont > think this is a good approach. I also played the swap refcount stuff ideas and encoutered a lot problems(e.g., kernel crash and/or data missing). Now I realize your original solution would be best to fix under such a small change. Folks, please chime in if you have another idea. > > Instead if we add a schedule or schedule_timeout_uninterruptible(1), How much your workload is going If we use schedule instead of schedule_timeout_uninterruptible(1)? If that doesn't increase the statistics a lot, I prefer the schedule. (TBH, I didn't care much about the stat since it would be very rare). > it seems quite enough to avoid repeated page faults. I did following > test with the reproducer I provided in the commit message: > > Using ZRAM instead of brd for more realistic workload: > $ perf stat -e minor-faults,major-faults -I 30000 ./a.out > > Unpatched kernel: > # time counts unit events > 30.000096504 33,089 minor-faults:u > 30.000096504 958,745 major-faults:u > 60.000375308 22,150 minor-faults:u > 60.000375308 1,221,665 major-faults:u > 90.001062176 24,840 minor-faults:u > 90.001062176 1,005,427 major-faults:u > 120.004233268 23,634 minor-faults:u > 120.004233268 1,045,596 major-faults:u > 150.004530091 22,358 minor-faults:u > 150.004530091 1,097,871 major-faults:u > > Patched kernel: > # time counts unit events > 30.000069753 280,094 minor-faults:u > 30.000069753 1,198,871 major-faults:u > 60.000415915 436,862 minor-faults:u > 60.000415915 919,860 major-faults:u > 90.000651670 359,176 minor-faults:u > 90.000651670 955,340 major-faults:u > 120.001088135 289,137 minor-faults:u > 120.001088135 992,317 major-faults:u > > Indeed there is a huge increase of minor page faults, which indicate > the raced path returned without handling the fault. This reproducer > tries to maximize the race chance so the reading is more terrifying > than usual. > > But after adding a schedule/schedule_timeout_uninterruptible(1) after > swapcache_prepare failed, still using the same test: > > Patched kernel (adding a schedule() before goto out): > # time counts unit events > 30.000335901 43,066 minor-faults:u > 30.000335901 1,163,232 major-faults:u > 60.034629687 35,652 minor-faults:u > 60.034629687 844,188 major-faults:u > 90.034941472 34,957 minor-faults:u > 90.034941472 1,218,174 major-faults:u > 120.035255386 36,241 minor-faults:u > 120.035255386 1,073,395 major-faults:u > 150.035535786 33,057 minor-faults:u > 150.035535786 1,171,684 major-faults:u > > Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out): > # time counts unit events > 30.000044452 26,748 minor-faults:u > 30.000044452 1,202,064 major-faults:u > 60.000317379 19,883 minor-faults:u > 60.000317379 1,186,152 major-faults:u > 90.000568779 18,333 minor-faults:u > 90.000568779 1,151,015 major-faults:u > 120.000787253 22,844 minor-faults:u > 120.000787253 887,531 major-faults:u > 150.001309065 18,900 minor-faults:u > 150.001309065 1,211,894 major-faults:u > > The minor faults are basically the same as before, or even lower since > other races are also reduced, with no observable performance > regression so far. > If the vague margin of this schedule call is still concerning, I think > another approach is just a new swap map value for parallel cache > bypassed swapping to loop on. Long term, the swap map vaule would be good but for now, I prefer your original solution with some delay with schedule. Thanks, Kairui!
On 06.02.24 19:25, Kairui Song wrote: > From: Kairui Song <kasong@tencent.com> > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > swapin the same entry at the same time, they get different pages (A, B). > Before one thread (T0) finishes the swapin and installs page (A) > to the PTE, another thread (T1) could finish swapin of page (B), > swap_free the entry, then swap out the possibly modified page > reusing the same entry. It breaks the pte_same check in (T0) because > PTE value is unchanged, causing ABA problem. Thread (T0) will > install a stalled page (A) into the PTE and cause data corruption. > > One possible callstack is like this: > > CPU0 CPU1 > ---- ---- > do_swap_page() do_swap_page() with same entry > <direct swapin path> <direct swapin path> > <alloc page A> <alloc page B> > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > <slow on later locks or interrupt> <finished swapin first> > ... set_pte_at() > swap_free() <- entry is free > <write to page B, now page A stalled> > <swap out page B to same swap entry> > pte_same() <- Check pass, PTE seems > unchanged, but page A > is stalled! > swap_free() <- page B content lost! > set_pte_at() <- staled page A installed! > > And besides, for ZRAM, swap_free() allows the swap device to discard > the entry content, so even if page (B) is not modified, if > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > it may also cause data loss. > > To fix this, reuse swapcache_prepare which will pin the swap entry using > the cache flag, and allow only one thread to pin it. Release the pin > after PT unlocked. Racers will simply busy wait since it's a rare > and very short event. > > Other methods like increasing the swap count don't seem to be a good > idea after some tests, that will cause racers to fall back to use the > swap cache again. Parallel swapin using different methods leads to > a much more complex scenario. > > Reproducer: > > This race issue can be triggered easily using a well constructed > reproducer and patched brd (with a delay in read path) [1]: > > With latest 6.8 mainline, race caused data loss can be observed easily: > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > Polulating 32MB of memory region... > Keep swapping out... > Starting round 0... > Spawning 65536 workers... > 32746 workers spawned, wait for done... > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > Round 0 Failed, 15 data loss! > > This reproducer spawns multiple threads sharing the same memory region > using a small swap device. Every two threads updates mapped pages one by > one in opposite direction trying to create a race, with one dedicated > thread keep swapping out the data out using madvise. > > The reproducer created a reproduce rate of about once every 5 minutes, > so the race should be totally possible in production. > > After this patch, I ran the reproducer for over a few hundred rounds > and no data loss observed. > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > zram: > > Before: 10934698 us > After: 11157121 us > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > Reported-by: "Huang, Ying" <ying.huang@intel.com> > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > Signed-off-by: Kairui Song <kasong@tencent.com> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > Acked-by: Yu Zhao <yuzhao@google.com> > > --- > Update from V1: > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > - Update comments make it cleaner [Huang, Ying] > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > - Update commit message. > - Collect Review and Acks. > > include/linux/swap.h | 5 +++++ > mm/memory.c | 15 +++++++++++++++ > mm/swap.h | 5 +++++ > mm/swapfile.c | 13 +++++++++++++ > 4 files changed, 38 insertions(+) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 4db00ddad261..8d28f6091a32 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > return 0; > } > > +static inline int swapcache_prepare(swp_entry_t swp) > +{ > + return 0; > +} > + > static inline void swap_free(swp_entry_t swp) > { > } > diff --git a/mm/memory.c b/mm/memory.c > index 7e1f4849463a..1749c700823d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!folio) { > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > __swap_count(entry) == 1) { > + /* > + * Prevent parallel swapin from proceeding with > + * the cache flag. Otherwise, another thread may > + * finish swapin first, free the entry, and swapout > + * reusing the same entry. It's undetectable as > + * pte_same() returns true due to entry reuse. > + */ > + if (swapcache_prepare(entry)) > + goto out; > + Is there anything that guarantees that the following won't happen concurrently, and if it would happen, could it be a problem? Other thread MADV_DONTNEED's the swap entry, swap slot is freed. Some other page gets swapped out, reuses that swap slot. We call swapcache_prepare() on that slot that is being reused elsewhere. (possibly some other thread in the context of the reuses swap slot might do the same!) We would detect later, that the PTE changed, but we would temporarily mess with that swap slot that we might no longer "own". I was thinking about alternatives, it's tricky because of the concurrent MADV_DONTNEED possibility. Something with another fake-swap entry type (similar to migration entries) might work, but would require more changes.
On Thu, Feb 15, 2024 at 11:36 PM David Hildenbrand <david@redhat.com> wrote: > On 06.02.24 19:25, Kairui Song wrote: > > From: Kairui Song <kasong@tencent.com> > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > > swapin the same entry at the same time, they get different pages (A, B). > > Before one thread (T0) finishes the swapin and installs page (A) > > to the PTE, another thread (T1) could finish swapin of page (B), > > swap_free the entry, then swap out the possibly modified page > > reusing the same entry. It breaks the pte_same check in (T0) because > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > install a stalled page (A) into the PTE and cause data corruption. > > > > One possible callstack is like this: > > > > CPU0 CPU1 > > ---- ---- > > do_swap_page() do_swap_page() with same entry > > <direct swapin path> <direct swapin path> > > <alloc page A> <alloc page B> > > swap_read_folio() <- read to page A swap_read_folio() <- read to page B > > <slow on later locks or interrupt> <finished swapin first> > > ... set_pte_at() > > swap_free() <- entry is free > > <write to page B, now page A stalled> > > <swap out page B to same swap entry> > > pte_same() <- Check pass, PTE seems > > unchanged, but page A > > is stalled! > > swap_free() <- page B content lost! > > set_pte_at() <- staled page A installed! > > > > And besides, for ZRAM, swap_free() allows the swap device to discard > > the entry content, so even if page (B) is not modified, if > > swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > > it may also cause data loss. > > > > To fix this, reuse swapcache_prepare which will pin the swap entry using > > the cache flag, and allow only one thread to pin it. Release the pin > > after PT unlocked. Racers will simply busy wait since it's a rare > > and very short event. > > > > Other methods like increasing the swap count don't seem to be a good > > idea after some tests, that will cause racers to fall back to use the > > swap cache again. Parallel swapin using different methods leads to > > a much more complex scenario. > > > > Reproducer: > > > > This race issue can be triggered easily using a well constructed > > reproducer and patched brd (with a delay in read path) [1]: > > > > With latest 6.8 mainline, race caused data loss can be observed easily: > > $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > > Polulating 32MB of memory region... > > Keep swapping out... > > Starting round 0... > > Spawning 65536 workers... > > 32746 workers spawned, wait for done... > > Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > > Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > > Round 0 Failed, 15 data loss! > > > > This reproducer spawns multiple threads sharing the same memory region > > using a small swap device. Every two threads updates mapped pages one by > > one in opposite direction trying to create a race, with one dedicated > > thread keep swapping out the data out using madvise. > > > > The reproducer created a reproduce rate of about once every 5 minutes, > > so the race should be totally possible in production. > > > > After this patch, I ran the reproducer for over a few hundred rounds > > and no data loss observed. > > > > Performance overhead is minimal, microbenchmark swapin 10G from 32G > > zram: > > > > Before: 10934698 us > > After: 11157121 us > > Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > > > > Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > > Reported-by: "Huang, Ying" <ying.huang@intel.com> > > Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > > Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > > Signed-off-by: Kairui Song <kasong@tencent.com> > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > Acked-by: Yu Zhao <yuzhao@google.com> > > > > --- > > Update from V1: > > - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > > - Update comments make it cleaner [Huang, Ying] > > - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > > - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > > - Update commit message. > > - Collect Review and Acks. > > > > include/linux/swap.h | 5 +++++ > > mm/memory.c | 15 +++++++++++++++ > > mm/swap.h | 5 +++++ > > mm/swapfile.c | 13 +++++++++++++ > > 4 files changed, 38 insertions(+) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 4db00ddad261..8d28f6091a32 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > > return 0; > > } > > > > +static inline int swapcache_prepare(swp_entry_t swp) > > +{ > > + return 0; > > +} > > + > > static inline void swap_free(swp_entry_t swp) > > { > > } > > diff --git a/mm/memory.c b/mm/memory.c > > index 7e1f4849463a..1749c700823d 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (!folio) { > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > __swap_count(entry) == 1) { > > + /* > > + * Prevent parallel swapin from proceeding with > > + * the cache flag. Otherwise, another thread may > > + * finish swapin first, free the entry, and swapout > > + * reusing the same entry. It's undetectable as > > + * pte_same() returns true due to entry reuse. > > + */ > > + if (swapcache_prepare(entry)) > > + goto out; > > + > > Is there anything that guarantees that the following won't > happen concurrently, and if it would happen, could it be a problem? > > Other thread MADV_DONTNEED's the swap entry, swap slot is freed. > Some other page gets swapped out, reuses that swap slot. Hi David, Thanks for adding more comments and sharing your thoughts! I'm not sure what you mean by "reuses that swap slot" here, I think you just mean reuse that swap entry (and the actual content on swap device)? > We call swapcache_prepare() on that slot that is being reused > elsewhere. (possibly some other thread in the context of the reuses > swap slot might do the same!) I think this kind of swapcache_prepare() or false swapin read is already happening quite frequently by swap readaheads. I've seen swap cluster readahead mess up working set activation and memory policy already. Swap cluster readahead simply read in nearby entries of target entry, regardless of whether they are owned by the reader or not. For this patch, similar issues also exist, I think it only hurts the performance, but that's a really rare thing to happen, so should not be a problem. > > We would detect later, that the PTE changed, but we would temporarily > mess with that swap slot that we might no longer "own". > > I was thinking about alternatives, it's tricky because of the concurrent > MADV_DONTNEED possibility. Something with another fake-swap entry type > (similar to migration entries) might work, but would require more changes. Yeah, in the long term I also think more work is needed for the swap subsystem. In my opinion, for this particular issue, or, for cache bypassed swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM might be needed, that may even help to simplify the swap count release routine for cache bypassed swapin, and improve the performance.
On Thu, Feb 15, 2024 at 8:44 AM Minchan Kim <minchan@kernel.org> wrote: > > Hi Kairui, > Hi, Minchan, > > I also played the swap refcount stuff ideas and encoutered a lot > problems(e.g., kernel crash and/or data missing). > > Now I realize your original solution would be best to fix under such a > small change. > > Folks, please chime in if you have another idea. > > > > > Instead if we add a schedule or schedule_timeout_uninterruptible(1), > > How much your workload is going If we use schedule instead of > schedule_timeout_uninterruptible(1)? If that doesn't increase the > statistics a lot, I prefer the schedule. > (TBH, I didn't care much about the stat since it would be > very rare). I've tested both schedule() and schedule_timeout_uninterruptible(1) locally using the reproducer, and some other cases, and looked all good. The reproducer I provided demonstrates an extreme case, so I think schedule() is good enough already. I currently don't have any more benchmarks or tests, as the change is very small for most workloads. I'll use schedule() here in V3 if no one else has other suggestions. I remember Barry's series about large folio swapin may change the ZRAM read time depending on folio size, and cause strange races that's different from the reproducer. Not sure if I can test using some known race cases. That could be helpful to verify this fix and if schedule() or schedule_timeout_uninterruptible(1) is better here. > > it seems quite enough to avoid repeated page faults. I did following > > test with the reproducer I provided in the commit message: > > > > Using ZRAM instead of brd for more realistic workload: > > $ perf stat -e minor-faults,major-faults -I 30000 ./a.out > > > > Unpatched kernel: > > # time counts unit events > > 30.000096504 33,089 minor-faults:u > > 30.000096504 958,745 major-faults:u > > 60.000375308 22,150 minor-faults:u > > 60.000375308 1,221,665 major-faults:u > > 90.001062176 24,840 minor-faults:u > > 90.001062176 1,005,427 major-faults:u > > 120.004233268 23,634 minor-faults:u > > 120.004233268 1,045,596 major-faults:u > > 150.004530091 22,358 minor-faults:u > > 150.004530091 1,097,871 major-faults:u > > > > Patched kernel: > > # time counts unit events > > 30.000069753 280,094 minor-faults:u > > 30.000069753 1,198,871 major-faults:u > > 60.000415915 436,862 minor-faults:u > > 60.000415915 919,860 major-faults:u > > 90.000651670 359,176 minor-faults:u > > 90.000651670 955,340 major-faults:u > > 120.001088135 289,137 minor-faults:u > > 120.001088135 992,317 major-faults:u > > > > Indeed there is a huge increase of minor page faults, which indicate > > the raced path returned without handling the fault. This reproducer > > tries to maximize the race chance so the reading is more terrifying > > than usual. > > > > But after adding a schedule/schedule_timeout_uninterruptible(1) after > > swapcache_prepare failed, still using the same test: > > > > Patched kernel (adding a schedule() before goto out): > > # time counts unit events > > 30.000335901 43,066 minor-faults:u > > 30.000335901 1,163,232 major-faults:u > > 60.034629687 35,652 minor-faults:u > > 60.034629687 844,188 major-faults:u > > 90.034941472 34,957 minor-faults:u > > 90.034941472 1,218,174 major-faults:u > > 120.035255386 36,241 minor-faults:u > > 120.035255386 1,073,395 major-faults:u > > 150.035535786 33,057 minor-faults:u > > 150.035535786 1,171,684 major-faults:u > > > > Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out): > > # time counts unit events > > 30.000044452 26,748 minor-faults:u > > 30.000044452 1,202,064 major-faults:u > > 60.000317379 19,883 minor-faults:u > > 60.000317379 1,186,152 major-faults:u > > 90.000568779 18,333 minor-faults:u > > 90.000568779 1,151,015 major-faults:u > > 120.000787253 22,844 minor-faults:u > > 120.000787253 887,531 major-faults:u > > 150.001309065 18,900 minor-faults:u > > 150.001309065 1,211,894 major-faults:u > > > > The minor faults are basically the same as before, or even lower since > > other races are also reduced, with no observable performance > > regression so far. > > If the vague margin of this schedule call is still concerning, I think > > another approach is just a new swap map value for parallel cache > > bypassed swapping to loop on. > > Long term, the swap map vaule would be good but for now, I prefer > your original solution with some delay with schedule. Yes, that's also what I have in mind. With a swap map value, actually I think the cache bypassed path may even go faster as it can simplify some swap map value change process, but that requires quite some extra work and change, could be introduced later as an optimization. > Thanks, Kairui! Thanks for the comments!
On 15.02.24 19:49, Kairui Song wrote: > On Thu, Feb 15, 2024 at 11:36 PM David Hildenbrand <david@redhat.com> wrote: >> On 06.02.24 19:25, Kairui Song wrote: >>> From: Kairui Song <kasong@tencent.com> >>> >>> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads >>> swapin the same entry at the same time, they get different pages (A, B). >>> Before one thread (T0) finishes the swapin and installs page (A) >>> to the PTE, another thread (T1) could finish swapin of page (B), >>> swap_free the entry, then swap out the possibly modified page >>> reusing the same entry. It breaks the pte_same check in (T0) because >>> PTE value is unchanged, causing ABA problem. Thread (T0) will >>> install a stalled page (A) into the PTE and cause data corruption. >>> >>> One possible callstack is like this: >>> >>> CPU0 CPU1 >>> ---- ---- >>> do_swap_page() do_swap_page() with same entry >>> <direct swapin path> <direct swapin path> >>> <alloc page A> <alloc page B> >>> swap_read_folio() <- read to page A swap_read_folio() <- read to page B >>> <slow on later locks or interrupt> <finished swapin first> >>> ... set_pte_at() >>> swap_free() <- entry is free >>> <write to page B, now page A stalled> >>> <swap out page B to same swap entry> >>> pte_same() <- Check pass, PTE seems >>> unchanged, but page A >>> is stalled! >>> swap_free() <- page B content lost! >>> set_pte_at() <- staled page A installed! >>> >>> And besides, for ZRAM, swap_free() allows the swap device to discard >>> the entry content, so even if page (B) is not modified, if >>> swap_read_folio() on CPU0 happens later than swap_free() on CPU1, >>> it may also cause data loss. >>> >>> To fix this, reuse swapcache_prepare which will pin the swap entry using >>> the cache flag, and allow only one thread to pin it. Release the pin >>> after PT unlocked. Racers will simply busy wait since it's a rare >>> and very short event. >>> >>> Other methods like increasing the swap count don't seem to be a good >>> idea after some tests, that will cause racers to fall back to use the >>> swap cache again. Parallel swapin using different methods leads to >>> a much more complex scenario. >>> >>> Reproducer: >>> >>> This race issue can be triggered easily using a well constructed >>> reproducer and patched brd (with a delay in read path) [1]: >>> >>> With latest 6.8 mainline, race caused data loss can be observed easily: >>> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out >>> Polulating 32MB of memory region... >>> Keep swapping out... >>> Starting round 0... >>> Spawning 65536 workers... >>> 32746 workers spawned, wait for done... >>> Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! >>> Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! >>> Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! >>> Round 0 Failed, 15 data loss! >>> >>> This reproducer spawns multiple threads sharing the same memory region >>> using a small swap device. Every two threads updates mapped pages one by >>> one in opposite direction trying to create a race, with one dedicated >>> thread keep swapping out the data out using madvise. >>> >>> The reproducer created a reproduce rate of about once every 5 minutes, >>> so the race should be totally possible in production. >>> >>> After this patch, I ran the reproducer for over a few hundred rounds >>> and no data loss observed. >>> >>> Performance overhead is minimal, microbenchmark swapin 10G from 32G >>> zram: >>> >>> Before: 10934698 us >>> After: 11157121 us >>> Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) >>> >>> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") >>> Reported-by: "Huang, Ying" <ying.huang@intel.com> >>> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ >>> Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] >>> Signed-off-by: Kairui Song <kasong@tencent.com> >>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> >>> Acked-by: Yu Zhao <yuzhao@google.com> >>> >>> --- >>> Update from V1: >>> - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] >>> - Update comments make it cleaner [Huang, Ying] >>> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] >>> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] >>> - Update commit message. >>> - Collect Review and Acks. >>> >>> include/linux/swap.h | 5 +++++ >>> mm/memory.c | 15 +++++++++++++++ >>> mm/swap.h | 5 +++++ >>> mm/swapfile.c | 13 +++++++++++++ >>> 4 files changed, 38 insertions(+) >>> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>> index 4db00ddad261..8d28f6091a32 100644 >>> --- a/include/linux/swap.h >>> +++ b/include/linux/swap.h >>> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) >>> return 0; >>> } >>> >>> +static inline int swapcache_prepare(swp_entry_t swp) >>> +{ >>> + return 0; >>> +} >>> + >>> static inline void swap_free(swp_entry_t swp) >>> { >>> } >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 7e1f4849463a..1749c700823d 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> if (!folio) { >>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >>> __swap_count(entry) == 1) { >>> + /* >>> + * Prevent parallel swapin from proceeding with >>> + * the cache flag. Otherwise, another thread may >>> + * finish swapin first, free the entry, and swapout >>> + * reusing the same entry. It's undetectable as >>> + * pte_same() returns true due to entry reuse. >>> + */ >>> + if (swapcache_prepare(entry)) >>> + goto out; >>> + >> >> Is there anything that guarantees that the following won't >> happen concurrently, and if it would happen, could it be a problem? >> >> Other thread MADV_DONTNEED's the swap entry, swap slot is freed. >> Some other page gets swapped out, reuses that swap slot. > > Hi David, > > Thanks for adding more comments and sharing your thoughts! > > I'm not sure what you mean by "reuses that swap slot" here, I think > you just mean reuse that swap entry (and the actual content on swap > device)? I tried to be precise but I think that caused more confusion :) entry = device + offset into swap_map I called a "swap slot" the metadata that is identified by device + offset (swap_map). I recall that's what we used to call these things [1], e.g., "Allocating swap slots ... All page sized slots are tracked by the array swap_info_struct→swap_map." [1] https://www.kernel.org/doc/gorman/html/understand/understand014.html Anyhow, you got what I mean :) > >> We call swapcache_prepare() on that slot that is being reused >> elsewhere. (possibly some other thread in the context of the reuses >> swap slot might do the same!) > > I think this kind of swapcache_prepare() or false swapin read is > already happening quite frequently by swap readaheads. I've seen swap > cluster readahead mess up working set activation and memory policy > already. Swap cluster readahead simply read in nearby entries of > target entry, regardless of whether they are owned by the reader or > not. Okay, thanks for the confirmation! > > For this patch, similar issues also exist, I think it only hurts the > performance, but that's a really rare thing to happen, so should not > be a problem. Okay, good. > >> >> We would detect later, that the PTE changed, but we would temporarily >> mess with that swap slot that we might no longer "own". >> >> I was thinking about alternatives, it's tricky because of the concurrent >> MADV_DONTNEED possibility. Something with another fake-swap entry type >> (similar to migration entries) might work, but would require more changes. > > Yeah, in the long term I also think more work is needed for the swap subsystem. > > In my opinion, for this particular issue, or, for cache bypassed > swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM > might be needed, that may even help to simplify the swap count release > routine for cache bypassed swapin, and improve the performance. The question is if we really want to track that in the swapcache and not rather in the page table. Imagine the following: (1) allocate the folio and lock it (we do that already) (2) take the page table lock. If the PTE is still the same, insert a new "swapin_in_process" fake swp entry that references the locked folio. (3) read the folio from swap. This will unlock the folio IIUC. (we do that already) (4) relock the folio. (we do that already, might not want to fail) (4) take the PTE lock. If the PTE did not change, turn it into a present PTE entry. Otherwise, cleanup. Any concurrent swap-in users would spot the new "swapin_in_process" fake swp entry and wait for the page lock (just like we do with migration entries). Zap code would mostly only clear the "swapin_in_process" fake swp entry and leave the cleanup to (4) above. Fortunately, concurrent fork() is impossible as that cannot race with page faults. There might be one minor thing to optimize with the folio lock above. But in essence, it would work just like migration entries, just that they are installed only while we actually do read the content from disk etc.
Hi David, On Thu, Feb 15, 2024 at 09:03:28PM +0100, David Hildenbrand wrote: < snip > > > > > > > We would detect later, that the PTE changed, but we would temporarily > > > mess with that swap slot that we might no longer "own". > > > > > > I was thinking about alternatives, it's tricky because of the concurrent > > > MADV_DONTNEED possibility. Something with another fake-swap entry type > > > (similar to migration entries) might work, but would require more changes. > > > > Yeah, in the long term I also think more work is needed for the swap subsystem. > > > > In my opinion, for this particular issue, or, for cache bypassed > > swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM > > might be needed, that may even help to simplify the swap count release > > routine for cache bypassed swapin, and improve the performance. > > The question is if we really want to track that in the swapcache and not > rather in the page table. > > Imagine the following: > > (1) allocate the folio and lock it (we do that already) > > (2) take the page table lock. If the PTE is still the same, insert a new > "swapin_in_process" fake swp entry that references the locked folio. > > (3) read the folio from swap. This will unlock the folio IIUC. (we do that > already) > > (4) relock the folio. (we do that already, might not want to fail) > > (4) take the PTE lock. If the PTE did not change, turn it into a present PTE > entry. Otherwise, cleanup. > > > Any concurrent swap-in users would spot the new "swapin_in_process" fake swp > entry and wait for the page lock (just like we do with migration entries). > > Zap code would mostly only clear the "swapin_in_process" fake swp entry and > leave the cleanup to (4) above. Fortunately, concurrent fork() is impossible > as that cannot race with page faults. > > There might be one minor thing to optimize with the folio lock above. But in > essence, it would work just like migration entries, just that they are > installed only while we actually do read the content from disk etc. That's a great idea. I was thinking to have the synchronization in the page table but couldn't reach to the other non_swap_entry idea. Only concern of the approach is that it would be harder to have the fix in the stable tree. If there isn't strong objection, I prefer the Kairui's orginal solution(with some tweak of scheduler if it's necessary) first and then pursue your idea on latest tree.
On Thu, 15 Feb 2024 12:55:11 -0800 Minchan Kim <minchan@kernel.org> wrote: > Only concern of the approach is that it would be harder to have the fix > in the stable tree. If there isn't strong objection, I prefer the > Kairui's orginal solution(with some tweak of scheduler if it's > necessary) first and then pursue your idea on latest tree. Do we agree that this fix is needed in -stable? For some reason I don't have a cc:stable in the changelog. I'd like to move this patch into mm-hotfixes-stable in order to get it upstream very soon. Are we all agreeable with that? I don't have an Acked-by: David? I have a note (which dates back to an earlier version) that Barry would be performing runtime testing. Has that occurred? Thanks. Here's what is presently in mm-hotfixes-unstable: From: Kairui Song <kasong@tencent.com> Subject: mm/swap: fix race when skipping swapcache Date: Wed, 7 Feb 2024 02:25:59 +0800 When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads swapin the same entry at the same time, they get different pages (A, B). Before one thread (T0) finishes the swapin and installs page (A) to the PTE, another thread (T1) could finish swapin of page (B), swap_free the entry, then swap out the possibly modified page reusing the same entry. It breaks the pte_same check in (T0) because PTE value is unchanged, causing ABA problem. Thread (T0) will install a stalled page (A) into the PTE and cause data corruption. One possible callstack is like this: CPU0 CPU1 ---- ---- do_swap_page() do_swap_page() with same entry <direct swapin path> <direct swapin path> <alloc page A> <alloc page B> swap_read_folio() <- read to page A swap_read_folio() <- read to page B <slow on later locks or interrupt> <finished swapin first> ... set_pte_at() swap_free() <- entry is free <write to page B, now page A stalled> <swap out page B to same swap entry> pte_same() <- Check pass, PTE seems unchanged, but page A is stalled! swap_free() <- page B content lost! set_pte_at() <- staled page A installed! And besides, for ZRAM, swap_free() allows the swap device to discard the entry content, so even if page (B) is not modified, if swap_read_folio() on CPU0 happens later than swap_free() on CPU1, it may also cause data loss. To fix this, reuse swapcache_prepare which will pin the swap entry using the cache flag, and allow only one thread to pin it. Release the pin after PT unlocked. Racers will simply busy wait since it's a rare and very short event. Other methods like increasing the swap count don't seem to be a good idea after some tests, that will cause racers to fall back to use the swap cache again. Parallel swapin using different methods leads to a much more complex scenario. Reproducer: This race issue can be triggered easily using a well constructed reproducer and patched brd (with a delay in read path) [1]: With latest 6.8 mainline, race caused data loss can be observed easily: $ gcc -g -lpthread test-thread-swap-race.c && ./a.out Polulating 32MB of memory region... Keep swapping out... Starting round 0... Spawning 65536 workers... 32746 workers spawned, wait for done... Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! Round 0 Failed, 15 data loss! This reproducer spawns multiple threads sharing the same memory region using a small swap device. Every two threads updates mapped pages one by one in opposite direction trying to create a race, with one dedicated thread keep swapping out the data out using madvise. The reproducer created a reproduce rate of about once every 5 minutes, so the race should be totally possible in production. After this patch, I ran the reproducer for over a few hundred rounds and no data loss observed. Performance overhead is minimal, microbenchmark swapin 10G from 32G zram: Before: 10934698 us After: 11157121 us Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) Link: https://lkml.kernel.org/r/20240206182559.32264-1-ryncsn@gmail.com Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Barry Song <21cnbao@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- include/linux/swap.h | 5 +++++ mm/memory.c | 15 +++++++++++++++ mm/swap.h | 5 +++++ mm/swapfile.c | 13 +++++++++++++ 4 files changed, 38 insertions(+) --- a/include/linux/swap.h~mm-swap-fix-race-when-skipping-swapcache +++ a/include/linux/swap.h @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_ent return 0; } +static inline int swapcache_prepare(swp_entry_t swp) +{ + return 0; +} + static inline void swap_free(swp_entry_t swp) { } --- a/mm/memory.c~mm-swap-fix-race-when-skipping-swapcache +++ a/mm/memory.c @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { + /* + * Prevent parallel swapin from proceeding with + * the cache flag. Otherwise, another thread may + * finish swapin first, free the entry, and swapout + * reusing the same entry. It's undetectable as + * pte_same() returns true due to entry reuse. + */ + if (swapcache_prepare(entry)) + goto out; + /* skip swapcache */ folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); + /* Clear the swap cache pin for direct swapin after PTL unlock */ + if (folio && !swapcache) + swapcache_clear(si, entry); out: if (si) put_swap_device(si); @@ -4124,6 +4137,8 @@ out_nomap: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out_page: + if (!swapcache) + swapcache_clear(si, entry); folio_unlock(folio); out_release: folio_put(folio); --- a/mm/swapfile.c~mm-swap-fix-race-when-skipping-swapcache +++ a/mm/swapfile.c @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry) return __swap_duplicate(entry, SWAP_HAS_CACHE); } +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +{ + struct swap_cluster_info *ci; + unsigned long offset = swp_offset(entry); + unsigned char usage; + + ci = lock_cluster_or_swap_info(si, offset); + usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); + unlock_cluster_or_swap_info(si, ci); + if (!usage) + free_swap_slot(entry); +} + struct swap_info_struct *swp_swap_info(swp_entry_t entry) { return swap_type_to_swap_info(swp_type(entry)); --- a/mm/swap.h~mm-swap-fix-race-when-skipping-swapcache +++ a/mm/swap.h @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct fol void delete_from_swap_cache(struct folio *folio); void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end); +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr); struct folio *filemap_get_incore_folio(struct address_space *mapping, @@ -97,6 +98,10 @@ static inline int swap_writepage(struct return 0; } +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +{ +} + static inline struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr) {
On Fri, Feb 16, 2024 at 4:03 AM David Hildenbrand <david@redhat.com> wrote: > On 15.02.24 19:49, Kairui Song wrote: > > On Thu, Feb 15, 2024 at 11:36 PM David Hildenbrand <david@redhat.com> wrote: > >> On 06.02.24 19:25, Kairui Song wrote: > >>> From: Kairui Song <kasong@tencent.com> > >>> > >>> When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads > >>> swapin the same entry at the same time, they get different pages (A, B). > >>> Before one thread (T0) finishes the swapin and installs page (A) > >>> to the PTE, another thread (T1) could finish swapin of page (B), > >>> swap_free the entry, then swap out the possibly modified page > >>> reusing the same entry. It breaks the pte_same check in (T0) because > >>> PTE value is unchanged, causing ABA problem. Thread (T0) will > >>> install a stalled page (A) into the PTE and cause data corruption. > >>> > >>> One possible callstack is like this: > >>> > >>> CPU0 CPU1 > >>> ---- ---- > >>> do_swap_page() do_swap_page() with same entry > >>> <direct swapin path> <direct swapin path> > >>> <alloc page A> <alloc page B> > >>> swap_read_folio() <- read to page A swap_read_folio() <- read to page B > >>> <slow on later locks or interrupt> <finished swapin first> > >>> ... set_pte_at() > >>> swap_free() <- entry is free > >>> <write to page B, now page A stalled> > >>> <swap out page B to same swap entry> > >>> pte_same() <- Check pass, PTE seems > >>> unchanged, but page A > >>> is stalled! > >>> swap_free() <- page B content lost! > >>> set_pte_at() <- staled page A installed! > >>> > >>> And besides, for ZRAM, swap_free() allows the swap device to discard > >>> the entry content, so even if page (B) is not modified, if > >>> swap_read_folio() on CPU0 happens later than swap_free() on CPU1, > >>> it may also cause data loss. > >>> > >>> To fix this, reuse swapcache_prepare which will pin the swap entry using > >>> the cache flag, and allow only one thread to pin it. Release the pin > >>> after PT unlocked. Racers will simply busy wait since it's a rare > >>> and very short event. > >>> > >>> Other methods like increasing the swap count don't seem to be a good > >>> idea after some tests, that will cause racers to fall back to use the > >>> swap cache again. Parallel swapin using different methods leads to > >>> a much more complex scenario. > >>> > >>> Reproducer: > >>> > >>> This race issue can be triggered easily using a well constructed > >>> reproducer and patched brd (with a delay in read path) [1]: > >>> > >>> With latest 6.8 mainline, race caused data loss can be observed easily: > >>> $ gcc -g -lpthread test-thread-swap-race.c && ./a.out > >>> Polulating 32MB of memory region... > >>> Keep swapping out... > >>> Starting round 0... > >>> Spawning 65536 workers... > >>> 32746 workers spawned, wait for done... > >>> Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! > >>> Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! > >>> Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! > >>> Round 0 Failed, 15 data loss! > >>> > >>> This reproducer spawns multiple threads sharing the same memory region > >>> using a small swap device. Every two threads updates mapped pages one by > >>> one in opposite direction trying to create a race, with one dedicated > >>> thread keep swapping out the data out using madvise. > >>> > >>> The reproducer created a reproduce rate of about once every 5 minutes, > >>> so the race should be totally possible in production. > >>> > >>> After this patch, I ran the reproducer for over a few hundred rounds > >>> and no data loss observed. > >>> > >>> Performance overhead is minimal, microbenchmark swapin 10G from 32G > >>> zram: > >>> > >>> Before: 10934698 us > >>> After: 11157121 us > >>> Non-direct: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) > >>> > >>> Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") > >>> Reported-by: "Huang, Ying" <ying.huang@intel.com> > >>> Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/ > >>> Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] > >>> Signed-off-by: Kairui Song <kasong@tencent.com> > >>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > >>> Acked-by: Yu Zhao <yuzhao@google.com> > >>> > >>> --- > >>> Update from V1: > >>> - Add some words on ZRAM case, it will discard swap content on swap_free so the race window is a bit different but cure is the same. [Barry Song] > >>> - Update comments make it cleaner [Huang, Ying] > >>> - Add a function place holder to fix CONFIG_SWAP=n built [SeongJae Park] > >>> - Update the commit message and summary, refer to SWP_SYNCHRONOUS_IO instead of "direct swapin path" [Yu Zhao] > >>> - Update commit message. > >>> - Collect Review and Acks. > >>> > >>> include/linux/swap.h | 5 +++++ > >>> mm/memory.c | 15 +++++++++++++++ > >>> mm/swap.h | 5 +++++ > >>> mm/swapfile.c | 13 +++++++++++++ > >>> 4 files changed, 38 insertions(+) > >>> > >>> diff --git a/include/linux/swap.h b/include/linux/swap.h > >>> index 4db00ddad261..8d28f6091a32 100644 > >>> --- a/include/linux/swap.h > >>> +++ b/include/linux/swap.h > >>> @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) > >>> return 0; > >>> } > >>> > >>> +static inline int swapcache_prepare(swp_entry_t swp) > >>> +{ > >>> + return 0; > >>> +} > >>> + > >>> static inline void swap_free(swp_entry_t swp) > >>> { > >>> } > >>> diff --git a/mm/memory.c b/mm/memory.c > >>> index 7e1f4849463a..1749c700823d 100644 > >>> --- a/mm/memory.c > >>> +++ b/mm/memory.c > >>> @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>> if (!folio) { > >>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > >>> __swap_count(entry) == 1) { > >>> + /* > >>> + * Prevent parallel swapin from proceeding with > >>> + * the cache flag. Otherwise, another thread may > >>> + * finish swapin first, free the entry, and swapout > >>> + * reusing the same entry. It's undetectable as > >>> + * pte_same() returns true due to entry reuse. > >>> + */ > >>> + if (swapcache_prepare(entry)) > >>> + goto out; > >>> + > >> > >> Is there anything that guarantees that the following won't > >> happen concurrently, and if it would happen, could it be a problem? > >> > >> Other thread MADV_DONTNEED's the swap entry, swap slot is freed. > >> Some other page gets swapped out, reuses that swap slot. > > > > Hi David, > > > > Thanks for adding more comments and sharing your thoughts! > > > > I'm not sure what you mean by "reuses that swap slot" here, I think > > you just mean reuse that swap entry (and the actual content on swap > > device)? > > I tried to be precise but I think that caused more confusion :) > > entry = device + offset into swap_map > > I called a "swap slot" the metadata that is identified by device + > offset (swap_map). > > I recall that's what we used to call these things [1], e.g., "Allocating > swap slots ... All page sized slots are tracked by the array > swap_info_struct→swap_map." > > [1] https://www.kernel.org/doc/gorman/html/understand/understand014.html > > Anyhow, you got what I mean :) Hi David, That's good, I was confused about the naming of swap related things :) > > > >> We call swapcache_prepare() on that slot that is being reused > >> elsewhere. (possibly some other thread in the context of the reuses > >> swap slot might do the same!) > > > > I think this kind of swapcache_prepare() or false swapin read is > > already happening quite frequently by swap readaheads. I've seen swap > > cluster readahead mess up working set activation and memory policy > > already. Swap cluster readahead simply read in nearby entries of > > target entry, regardless of whether they are owned by the reader or > > not. > > Okay, thanks for the confirmation! > > > > For this patch, similar issues also exist, I think it only hurts the > > performance, but that's a really rare thing to happen, so should not > > be a problem. > > Okay, good. > > > > >> > >> We would detect later, that the PTE changed, but we would temporarily > >> mess with that swap slot that we might no longer "own". > >> > >> I was thinking about alternatives, it's tricky because of the concurrent > >> MADV_DONTNEED possibility. Something with another fake-swap entry type > >> (similar to migration entries) might work, but would require more changes. > > > > Yeah, in the long term I also think more work is needed for the swap subsystem. > > > > In my opinion, for this particular issue, or, for cache bypassed > > swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM > > might be needed, that may even help to simplify the swap count release > > routine for cache bypassed swapin, and improve the performance. > > The question is if we really want to track that in the swapcache and not > rather in the page table. > > Imagine the following: > > (1) allocate the folio and lock it (we do that already) > > (2) take the page table lock. If the PTE is still the same, insert a new > "swapin_in_process" fake swp entry that references the locked folio. ...> > (3) read the folio from swap. This will unlock the folio IIUC. (we do > that already) > > (4) relock the folio. (we do that already, might not want to fail) > > (4) take the PTE lock. If the PTE did not change, turn it into a present > PTE entry. Otherwise, cleanup. Very interesting idea! I'm just not sure what actual benefit it brings. The only concern about reusing swapcache_prepare so far is repeated page faults that may hurt performance or statistics, this issue is basically gone after adding a schedule(). We can't drop all the operations around swap cache and map anyway. It doesn't know if it should skip the swapcache until swapcache lookup and swap count look up are all done. So I think it can be done more naturally here with a special value, making things simpler, robust, and improving performance a bit more. And in another series [1] I'm working on making shmem make use of cache bypassed swapin as well, following this approach I'll have to implement another shmem map based synchronization too. After all it's only a rare race, I think a simpler solution might be better. [1] https://lore.kernel.org/all/20240129175423.1987-1-ryncsn@gmail.com/
On Fri, Feb 16, 2024 at 6:58 AM Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 15 Feb 2024 12:55:11 -0800 Minchan Kim <minchan@kernel.org> wrote: > > Only concern of the approach is that it would be harder to have the fix > > in the stable tree. If there isn't strong objection, I prefer the > > Kairui's orginal solution(with some tweak of scheduler if it's > > necessary) first and then pursue your idea on latest tree. > > Do we agree that this fix is needed in -stable? For some reason I > don't have a cc:stable in the changelog. > > I'd like to move this patch into mm-hotfixes-stable in order to get it > upstream very soon. Are we all agreeable with that? I don't have an > Acked-by: David? > > I have a note (which dates back to an earlier version) that Barry would > be performing runtime testing. Has that occurred? > > Thanks. > Hi Andrew, I think this fix is needed for stable, I've sent V3, if we agree on that, please consider V3 which includes changes we have discussed so far, thanks!
>> (4) relock the folio. (we do that already, might not want to fail) >> >> (4) take the PTE lock. If the PTE did not change, turn it into a present >> PTE entry. Otherwise, cleanup. > > Very interesting idea! > > I'm just not sure what actual benefit it brings. The only concern > about reusing swapcache_prepare so far is repeated page faults that > may hurt performance or statistics, this issue is basically gone after > adding a schedule(). I think you know that slapping in random schedule() calls is not a proper way to wait for an event to happen :) It's pretty much unpredictable how long the schedule() will take and if there even is anything to schedule to! With what I propose, just like with page migration, you really do wait for the event (page read and unlocked, only the PTE has to be updated) to happen before you try anything else. Now, the difference is most likely that the case here happens much less frequently than page migration. Still, you could have all threads faulting one the same page and all would do the same dance here. > > We can't drop all the operations around swap cache and map anyway. It > doesn't know if it should skip the swapcache until swapcache lookup > and swap count look up are all done. So I think it can be done more > naturally here with a special value, making things simpler, robust, > and improving performance a bit more. > The issue will always be that someone can zap the PTE concurrently, which would free up the swap cache. With what I propose, that cannot happen in the sync swapin case we are discussing here. If someone where to zap the PTE in the meantime, it would only remove the special non-swap entry, indicating to swapin code that concurrent zapping happened. The swapin code would handle the swapcache freeing instead, after finishing reading the content. So the swapcache would not get freed concurrently anymore if I am not wrong. At least the theory, I didn't prototype it yet. > And in another series [1] I'm working on making shmem make use of > cache bypassed swapin as well, following this approach I'll have to > implement another shmem map based synchronization too. > I'd have to look further into that, if something like that could similarly apply to shmem. But yes, it's no using PTEs, so a PTE-based sync mechanism does definitely not apply.. > After all it's only a rare race, I think a simpler solution might be better. I'm not sure that simpler means better here. Simpler might be better for a backport, though. The "use schedule() to wait" looks odd, maybe it's a common primitive that I simply didn't stumble over yet. (I doubt it but it could be possible)
Kairui Song <ryncsn@gmail.com> writes: > On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Kairui Song <ryncsn@gmail.com> writes: >> >> > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@kernel.org> wrote: >> >> >> >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: >> >> [snip] >> >> >> > >> >> > So I think the thing is, it's getting complex because this patch >> >> > wanted to make it simple and just reuse the swap cache flags. >> >> >> >> I agree that a simple fix would be the important at this point. >> >> >> >> Considering your description, here's my understanding of the other idea: >> >> Other method, such as increasing the swap count, haven't proven effective >> >> in your tests. The approach risk forcing racers to rely on the swap cache >> >> again and the potential performance loss in race scenario. >> >> >> >> While I understand that simplicity is important, and performance loss >> >> in this case may be infrequent, I believe swap_count approach could be a >> >> suitable solution. What do you think? >> > >> > Hi Minchan >> > >> > Yes, my main concern was about simplicity and performance. >> > >> > Increasing swap_count here will also race with another process from >> > releasing swap_count to 0 (swapcache was able to sync callers in other >> > call paths but we skipped swapcache here). >> >> What is the consequence of the race condition? > > Hi Ying, > > It will increase the swap count of an already freed entry, this race > with multiple swap free/alloc logic that checks if count == > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry, > all result in random corruption of the swap map. This happens a lot > during stress testing. You are right! Thanks for explanation. -- Best Regards, Huang, Ying >> >> > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still >> > have swap_count == 1, bail out if not; 3. Set it to 2; >> > __swap_duplicate can be modified to support this, it's similar to >> > existing logics for SWAP_HAS_CACHE. >> > >> > And swap freeing path will do more things, swapcache clean up needs to >> > be handled even in the bypassing path since the racer may add it to >> > swapcache. >> > >> > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many >> > overhead, so I used that way in this patch, the only issue is >> > potentially repeated page faults now. >> > >> > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad >> > at naming it) special value, so any racer can just spin on it to avoid >> > all the problems, how do you think about this? >> >> Let's try some simpler method firstly. > > Another simpler idea is, add a schedule() or > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure > path before goto out (just like __read_swap_cache_async). I think this > should ensure in almost all cases, PTE is ready after it returns, also > yields more CPU.
diff --git a/include/linux/swap.h b/include/linux/swap.h index 4db00ddad261..8d28f6091a32 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -549,6 +549,11 @@ static inline int swap_duplicate(swp_entry_t swp) return 0; } +static inline int swapcache_prepare(swp_entry_t swp) +{ + return 0; +} + static inline void swap_free(swp_entry_t swp) { } diff --git a/mm/memory.c b/mm/memory.c index 7e1f4849463a..1749c700823d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3867,6 +3867,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { + /* + * Prevent parallel swapin from proceeding with + * the cache flag. Otherwise, another thread may + * finish swapin first, free the entry, and swapout + * reusing the same entry. It's undetectable as + * pte_same() returns true due to entry reuse. + */ + if (swapcache_prepare(entry)) + goto out; + /* skip swapcache */ folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); @@ -4116,6 +4126,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); + /* Clear the swap cache pin for direct swapin after PTL unlock */ + if (folio && !swapcache) + swapcache_clear(si, entry); out: if (si) put_swap_device(si); @@ -4124,6 +4137,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out_page: + if (!swapcache) + swapcache_clear(si, entry); folio_unlock(folio); out_release: folio_put(folio); diff --git a/mm/swap.h b/mm/swap.h index 758c46ca671e..fc2f6ade7f80 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -41,6 +41,7 @@ void __delete_from_swap_cache(struct folio *folio, void delete_from_swap_cache(struct folio *folio); void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end); +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr); struct folio *filemap_get_incore_folio(struct address_space *mapping, @@ -97,6 +98,10 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc) return 0; } +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +{ +} + static inline struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr) { diff --git a/mm/swapfile.c b/mm/swapfile.c index 556ff7347d5f..746aa9da5302 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3365,6 +3365,19 @@ int swapcache_prepare(swp_entry_t entry) return __swap_duplicate(entry, SWAP_HAS_CACHE); } +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +{ + struct swap_cluster_info *ci; + unsigned long offset = swp_offset(entry); + unsigned char usage; + + ci = lock_cluster_or_swap_info(si, offset); + usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); + unlock_cluster_or_swap_info(si, ci); + if (!usage) + free_swap_slot(entry); +} + struct swap_info_struct *swp_swap_info(swp_entry_t entry) { return swap_type_to_swap_info(swp_type(entry));