diff mbox series

mm: avoid unconditional one-tick sleep when swapcache_prepare fails

Message ID 20240926211936.75373-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series mm: avoid unconditional one-tick sleep when swapcache_prepare fails | expand

Commit Message

Barry Song Sept. 26, 2024, 9:19 p.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
introduced an unconditional one-tick sleep when `swapcache_prepare()`
fails, which has led to reports of UI stuttering on latency-sensitive
Android devices. To address this, we can use a waitqueue to wake up
tasks that fail `swapcache_prepare()` sooner, instead of always
sleeping for a full tick. While tasks may occasionally be woken by an
unrelated `do_swap_page()`, this method is preferable to two scenarios:
rapid re-entry into page faults, which can cause livelocks, and
multiple millisecond sleeps, which visibly degrade user experience.

Oven's testing shows that a single waitqueue resolves the UI
stuttering issue. If a 'thundering herd' problem becomes apparent
later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]`
for page bit locks can be introduced.

Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
Cc: Kairui Song <kasong@tencent.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Reported-by: Oven Liyang <liyangouwen1@oppo.com>
Tested-by: Oven Liyang <liyangouwen1@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Huang, Ying Sept. 29, 2024, 2:39 a.m. UTC | #1
Hi, Barry,

Barry Song <21cnbao@gmail.com> writes:

> From: Barry Song <v-songbaohua@oppo.com>
>
> Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> introduced an unconditional one-tick sleep when `swapcache_prepare()`
> fails, which has led to reports of UI stuttering on latency-sensitive
> Android devices. To address this, we can use a waitqueue to wake up
> tasks that fail `swapcache_prepare()` sooner, instead of always
> sleeping for a full tick. While tasks may occasionally be woken by an
> unrelated `do_swap_page()`, this method is preferable to two scenarios:
> rapid re-entry into page faults, which can cause livelocks, and
> multiple millisecond sleeps, which visibly degrade user experience.

In general, I think that this works.  Why not extend the solution to
cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
overhead to call wake_up() when there's no task waiting, we can use an
atomic to count waiting tasks.

[snip]

--
Best Regards,
Huang, Ying
Barry Song Sept. 30, 2024, 1:18 p.m. UTC | #2
On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Barry,
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> > fails, which has led to reports of UI stuttering on latency-sensitive
> > Android devices. To address this, we can use a waitqueue to wake up
> > tasks that fail `swapcache_prepare()` sooner, instead of always
> > sleeping for a full tick. While tasks may occasionally be woken by an
> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> > rapid re-entry into page faults, which can cause livelocks, and
> > multiple millisecond sleeps, which visibly degrade user experience.
>
> In general, I think that this works.  Why not extend the solution to
> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid

Hi Ying,
Thanks for your comments.
I feel extending the solution to __read_swap_cache_async() should be done
in a separate patch. On phones, I've never encountered any issues reported
on that path, so it might be better suited for an optimization rather than a
hotfix?

> overhead to call wake_up() when there's no task waiting, we can use an
> atomic to count waiting tasks.

I'm not sure it's worth adding the complexity, as wake_up() on an empty
waitqueue should have a very low cost on its own?

>
> [snip]
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Huang, Ying Sept. 30, 2024, 11:40 p.m. UTC | #3
Barry Song <21cnbao@gmail.com> writes:

> On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Barry,
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > From: Barry Song <v-songbaohua@oppo.com>
>> >
>> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
>> > fails, which has led to reports of UI stuttering on latency-sensitive
>> > Android devices. To address this, we can use a waitqueue to wake up
>> > tasks that fail `swapcache_prepare()` sooner, instead of always
>> > sleeping for a full tick. While tasks may occasionally be woken by an
>> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
>> > rapid re-entry into page faults, which can cause livelocks, and
>> > multiple millisecond sleeps, which visibly degrade user experience.
>>
>> In general, I think that this works.  Why not extend the solution to
>> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
>> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
>
> Hi Ying,
> Thanks for your comments.
> I feel extending the solution to __read_swap_cache_async() should be done
> in a separate patch. On phones, I've never encountered any issues reported
> on that path, so it might be better suited for an optimization rather than a
> hotfix?

Yes.  It's fine to do that in another patch as optimization.

>> overhead to call wake_up() when there's no task waiting, we can use an
>> atomic to count waiting tasks.
>
> I'm not sure it's worth adding the complexity, as wake_up() on an empty
> waitqueue should have a very low cost on its own?

wake_up() needs to call spin_lock_irqsave() unconditionally on a global
shared lock.  On systems with many CPUs (such servers), this may cause
severe lock contention.  Even the cache ping-pong may hurt performance
much.

--
Best Regards,
Huang, Ying
Barry Song Oct. 1, 2024, 2:16 p.m. UTC | #4
On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Barry,
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > From: Barry Song <v-songbaohua@oppo.com>
> >> >
> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> >> > Android devices. To address this, we can use a waitqueue to wake up
> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> >> > rapid re-entry into page faults, which can cause livelocks, and
> >> > multiple millisecond sleeps, which visibly degrade user experience.
> >>
> >> In general, I think that this works.  Why not extend the solution to
> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> >
> > Hi Ying,
> > Thanks for your comments.
> > I feel extending the solution to __read_swap_cache_async() should be done
> > in a separate patch. On phones, I've never encountered any issues reported
> > on that path, so it might be better suited for an optimization rather than a
> > hotfix?
>
> Yes.  It's fine to do that in another patch as optimization.

Ok. I'll prepare a separate patch for optimizing that path.

>
> >> overhead to call wake_up() when there's no task waiting, we can use an
> >> atomic to count waiting tasks.
> >
> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> > waitqueue should have a very low cost on its own?
>
> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> shared lock.  On systems with many CPUs (such servers), this may cause
> severe lock contention.  Even the cache ping-pong may hurt performance
> much.

I understand that cache synchronization was a significant issue before
qspinlock, but it seems to be less of a concern after its implementation.
However, using a global atomic variable would still trigger cache broadcasts,
correct?

>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Huang, Ying Oct. 2, 2024, 12:40 a.m. UTC | #5
Barry Song <21cnbao@gmail.com> writes:

> On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Hi, Barry,
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > From: Barry Song <v-songbaohua@oppo.com>
>> >> >
>> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
>> >> > fails, which has led to reports of UI stuttering on latency-sensitive
>> >> > Android devices. To address this, we can use a waitqueue to wake up
>> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
>> >> > sleeping for a full tick. While tasks may occasionally be woken by an
>> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
>> >> > rapid re-entry into page faults, which can cause livelocks, and
>> >> > multiple millisecond sleeps, which visibly degrade user experience.
>> >>
>> >> In general, I think that this works.  Why not extend the solution to
>> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
>> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
>> >
>> > Hi Ying,
>> > Thanks for your comments.
>> > I feel extending the solution to __read_swap_cache_async() should be done
>> > in a separate patch. On phones, I've never encountered any issues reported
>> > on that path, so it might be better suited for an optimization rather than a
>> > hotfix?
>>
>> Yes.  It's fine to do that in another patch as optimization.
>
> Ok. I'll prepare a separate patch for optimizing that path.

Thanks!

>>
>> >> overhead to call wake_up() when there's no task waiting, we can use an
>> >> atomic to count waiting tasks.
>> >
>> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
>> > waitqueue should have a very low cost on its own?
>>
>> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
>> shared lock.  On systems with many CPUs (such servers), this may cause
>> severe lock contention.  Even the cache ping-pong may hurt performance
>> much.
>
> I understand that cache synchronization was a significant issue before
> qspinlock, but it seems to be less of a concern after its implementation.

Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
discussed in the following thread.

https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/

> However, using a global atomic variable would still trigger cache broadcasts,
> correct?

We can only change the atomic variable to non-zero when
swapcache_prepare() returns non-zero, and call wake_up() when the atomic
variable is non-zero.  Because swapcache_prepare() returns 0 most times,
the atomic variable is 0 most times.  If we don't change the value of
atomic variable, cache ping-pong will not be triggered.

Hi, Kairui,

Do you have some test cases to test parallel zram swap-in?  If so, that
can be used to verify whether cache ping-pong is an issue and whether it
can be fixed via a global atomic variable.

--
Best Regards,
Huang, Ying
Barry Song Oct. 2, 2024, 1:57 a.m. UTC | #6
On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Hi, Barry,
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > From: Barry Song <v-songbaohua@oppo.com>
> >> >> >
> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> >> >> > Android devices. To address this, we can use a waitqueue to wake up
> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> >> >> > rapid re-entry into page faults, which can cause livelocks, and
> >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> >> >>
> >> >> In general, I think that this works.  Why not extend the solution to
> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> >> >
> >> > Hi Ying,
> >> > Thanks for your comments.
> >> > I feel extending the solution to __read_swap_cache_async() should be done
> >> > in a separate patch. On phones, I've never encountered any issues reported
> >> > on that path, so it might be better suited for an optimization rather than a
> >> > hotfix?
> >>
> >> Yes.  It's fine to do that in another patch as optimization.
> >
> > Ok. I'll prepare a separate patch for optimizing that path.
>
> Thanks!
>
> >>
> >> >> overhead to call wake_up() when there's no task waiting, we can use an
> >> >> atomic to count waiting tasks.
> >> >
> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> >> > waitqueue should have a very low cost on its own?
> >>
> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> >> shared lock.  On systems with many CPUs (such servers), this may cause
> >> severe lock contention.  Even the cache ping-pong may hurt performance
> >> much.
> >
> > I understand that cache synchronization was a significant issue before
> > qspinlock, but it seems to be less of a concern after its implementation.
>
> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> discussed in the following thread.
>
> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
>
> > However, using a global atomic variable would still trigger cache broadcasts,
> > correct?
>
> We can only change the atomic variable to non-zero when
> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> the atomic variable is 0 most times.  If we don't change the value of
> atomic variable, cache ping-pong will not be triggered.

yes. this can be implemented by adding another atomic variable.

>
> Hi, Kairui,
>
> Do you have some test cases to test parallel zram swap-in?  If so, that
> can be used to verify whether cache ping-pong is an issue and whether it
> can be fixed via a global atomic variable.
>

Yes, Kairui please run a test on your machine with lots of cores before
and after adding a global atomic variable as suggested by Ying. I am
sorry I don't have a server machine.

if it turns out you find cache ping-pong can be an issue, another
approach would be a waitqueue hash:

diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..aae0e532d8b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+/*
+ * Alleviating the 'thundering herd' phenomenon using a waitqueue hash
+ * when multiple do_swap_page() operations occur simultaneously.
+ */
+#define SWAPCACHE_WAIT_TABLE_BITS 5
+#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS)
+static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE];
+
+static int __init swapcache_wqs_init(void)
+{
+	for (int i = 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++)
+		init_waitqueue_head(&swapcache_wqs[i]);
+
+        return 0;
+}
+late_initcall(swapcache_wqs_init);
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *swapcache, *folio = NULL;
+	DECLARE_WAITQUEUE(wait, current);
+	wait_queue_head_t *swapcache_wq;
 	struct page *page;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
@@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				 * undetectable as pte_same() returns true due
 				 * to entry reuse.
 				 */
+				swapcache_wq = &swapcache_wqs[hash_long(vmf->address & PMD_MASK,
+							SWAPCACHE_WAIT_TABLE_BITS)];
 				if (swapcache_prepare(entry, nr_pages)) {
 					/*
 					 * Relax a bit to prevent rapid
 					 * repeated page faults.
 					 */
+					add_wait_queue(swapcache_wq, &wait);
 					schedule_timeout_uninterruptible(1);
+					remove_wait_queue(swapcache_wq, &wait);
 					goto out_page;
 				}
 				need_clear_cache = true;
@@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
-	if (need_clear_cache)
+	if (need_clear_cache) {
 		swapcache_clear(si, entry, nr_pages);
+		wake_up(swapcache_wq);
+	}
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_unlock(swapcache);
 		folio_put(swapcache);
 	}
-	if (need_clear_cache)
+	if (need_clear_cache) {
 		swapcache_clear(si, entry, nr_pages);
+		wake_up(swapcache_wq);
+	}
 	if (si)
 		put_swap_device(si);
 	return ret;
Kairui Song Oct. 2, 2024, 6:30 p.m. UTC | #7
On Wed, Oct 2, 2024 at 10:02 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Barry Song <21cnbao@gmail.com> writes:
> > >>
> > >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >>
> > >> >> Hi, Barry,
> > >> >>
> > >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >>
> > >> >> > From: Barry Song <v-songbaohua@oppo.com>
> > >> >> >
> > >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> > >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> > >> >> > Android devices. To address this, we can use a waitqueue to wake up
> > >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> > >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> > >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> > >> >> > rapid re-entry into page faults, which can cause livelocks, and
> > >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> > >> >>
> > >> >> In general, I think that this works.  Why not extend the solution to
> > >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> > >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> > >> >
> > >> > Hi Ying,
> > >> > Thanks for your comments.
> > >> > I feel extending the solution to __read_swap_cache_async() should be done
> > >> > in a separate patch. On phones, I've never encountered any issues reported
> > >> > on that path, so it might be better suited for an optimization rather than a
> > >> > hotfix?

Hi Barry and Ying,

For the __read_swap_cache_async case, I'm not really against adding a
similar workqueue, but if no one is really suffering from it, and if
the workqueue do causes extra overhead, maybe we can ignore it for the
__read_swap_cache_async case now, and I plan to resent the following
patch:
https://lore.kernel.org/linux-mm/20240326185032.72159-9-ryncsn@gmail.com/#r

It removed all schedule_timeout_uninterruptible workaround and other
similar things, and the performance will go even higher.

> > >>
> > >> Yes.  It's fine to do that in another patch as optimization.
> > >
> > > Ok. I'll prepare a separate patch for optimizing that path.
> >
> > Thanks!
> >
> > >>
> > >> >> overhead to call wake_up() when there's no task waiting, we can use an
> > >> >> atomic to count waiting tasks.
> > >> >
> > >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> > >> > waitqueue should have a very low cost on its own?
> > >>
> > >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> > >> shared lock.  On systems with many CPUs (such servers), this may cause
> > >> severe lock contention.  Even the cache ping-pong may hurt performance
> > >> much.
> > >
> > > I understand that cache synchronization was a significant issue before
> > > qspinlock, but it seems to be less of a concern after its implementation.
> >
> > Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> > discussed in the following thread.
> >
> > https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
> >
> > > However, using a global atomic variable would still trigger cache broadcasts,
> > > correct?
> >
> > We can only change the atomic variable to non-zero when
> > swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> > variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> > the atomic variable is 0 most times.  If we don't change the value of
> > atomic variable, cache ping-pong will not be triggered.
>
> yes. this can be implemented by adding another atomic variable.
>
> >
> > Hi, Kairui,
> >
> > Do you have some test cases to test parallel zram swap-in?  If so, that
> > can be used to verify whether cache ping-pong is an issue and whether it
> > can be fixed via a global atomic variable.
> >
>
> Yes, Kairui please run a test on your machine with lots of cores before
> and after adding a global atomic variable as suggested by Ying. I am
> sorry I don't have a server machine.

I just had a try with the build kernel test which I used for the
allocator patch series, with -j64, 1G memcg on my local branch:

Without the patch:
2677.63user 9100.43system 3:33.15elapsed 5452%CPU (0avgtext+0avgdata
863284maxresident)k
2671.40user 8969.07system 3:33.67elapsed 5447%CPU (0avgtext+0avgdata
863316maxresident)k
2673.66user 8973.90system 3:33.18elapsed 5463%CPU (0avgtext+0avgdata
863284maxresident)k

With the patch:
2655.05user 9134.21system 3:35.63elapsed 5467%CPU (0avgtext+0avgdata
863288maxresident)k
2652.57user 9104.87system 3:35.07elapsed 5466%CPU (0avgtext+0avgdata
863272maxresident)k
2665.44user 9155.97system 3:35.92elapsed 5474%CPU (0avgtext+0avgdata
863316maxresident)k

Only three test runs, the main bottleneck for the test is still some
other locks (list_lru lock, swap cgroup lock etc), but it does show
the performance seems a bit lower. Could be considered a trivial
amount of overhead so I think it's acceptable for the SYNC_IO path.
Huang, Ying Oct. 3, 2024, 12:31 a.m. UTC | #8
Barry Song <21cnbao@gmail.com> writes:

> On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Hi, Barry,
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > From: Barry Song <v-songbaohua@oppo.com>
>> >> >> >
>> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
>> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
>> >> >> > Android devices. To address this, we can use a waitqueue to wake up
>> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
>> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
>> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
>> >> >> > rapid re-entry into page faults, which can cause livelocks, and
>> >> >> > multiple millisecond sleeps, which visibly degrade user experience.
>> >> >>
>> >> >> In general, I think that this works.  Why not extend the solution to
>> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
>> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
>> >> >
>> >> > Hi Ying,
>> >> > Thanks for your comments.
>> >> > I feel extending the solution to __read_swap_cache_async() should be done
>> >> > in a separate patch. On phones, I've never encountered any issues reported
>> >> > on that path, so it might be better suited for an optimization rather than a
>> >> > hotfix?
>> >>
>> >> Yes.  It's fine to do that in another patch as optimization.
>> >
>> > Ok. I'll prepare a separate patch for optimizing that path.
>>
>> Thanks!
>>
>> >>
>> >> >> overhead to call wake_up() when there's no task waiting, we can use an
>> >> >> atomic to count waiting tasks.
>> >> >
>> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
>> >> > waitqueue should have a very low cost on its own?
>> >>
>> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
>> >> shared lock.  On systems with many CPUs (such servers), this may cause
>> >> severe lock contention.  Even the cache ping-pong may hurt performance
>> >> much.
>> >
>> > I understand that cache synchronization was a significant issue before
>> > qspinlock, but it seems to be less of a concern after its implementation.
>>
>> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
>> discussed in the following thread.
>>
>> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
>>
>> > However, using a global atomic variable would still trigger cache broadcasts,
>> > correct?
>>
>> We can only change the atomic variable to non-zero when
>> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
>> variable is non-zero.  Because swapcache_prepare() returns 0 most times,
>> the atomic variable is 0 most times.  If we don't change the value of
>> atomic variable, cache ping-pong will not be triggered.
>
> yes. this can be implemented by adding another atomic variable.

Just realized that we don't need another atomic variable for this, just
use waitqueue_active() before wake_up() should be enough.

>>
>> Hi, Kairui,
>>
>> Do you have some test cases to test parallel zram swap-in?  If so, that
>> can be used to verify whether cache ping-pong is an issue and whether it
>> can be fixed via a global atomic variable.
>>
>
> Yes, Kairui please run a test on your machine with lots of cores before
> and after adding a global atomic variable as suggested by Ying. I am
> sorry I don't have a server machine.
>
> if it turns out you find cache ping-pong can be an issue, another
> approach would be a waitqueue hash:

Yes.  waitqueue hash may help reduce lock contention.  And, we can have
both waitqueue_active() and waitqueue hash if necessary.  As the first
step, waitqueue_active() appears simpler.

> diff --git a/mm/memory.c b/mm/memory.c
> index 2366578015ad..aae0e532d8b6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +/*
> + * Alleviating the 'thundering herd' phenomenon using a waitqueue hash
> + * when multiple do_swap_page() operations occur simultaneously.
> + */
> +#define SWAPCACHE_WAIT_TABLE_BITS 5
> +#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS)
> +static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE];
> +
> +static int __init swapcache_wqs_init(void)
> +{
> +	for (int i = 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++)
> +		init_waitqueue_head(&swapcache_wqs[i]);
> +
> +        return 0;
> +}
> +late_initcall(swapcache_wqs_init);
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *swapcache, *folio = NULL;
> +	DECLARE_WAITQUEUE(wait, current);
> +	wait_queue_head_t *swapcache_wq;
>  	struct page *page;
>  	struct swap_info_struct *si = NULL;
>  	rmap_t rmap_flags = RMAP_NONE;
> @@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  				 * undetectable as pte_same() returns true due
>  				 * to entry reuse.
>  				 */
> +				swapcache_wq = &swapcache_wqs[hash_long(vmf->address & PMD_MASK,
> +							SWAPCACHE_WAIT_TABLE_BITS)];
>  				if (swapcache_prepare(entry, nr_pages)) {
>  					/*
>  					 * Relax a bit to prevent rapid
>  					 * repeated page faults.
>  					 */
> +					add_wait_queue(swapcache_wq, &wait);
>  					schedule_timeout_uninterruptible(1);
> +					remove_wait_queue(swapcache_wq, &wait);
>  					goto out_page;
>  				}
>  				need_clear_cache = true;
> @@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
>  	/* Clear the swap cache pin for direct swapin after PTL unlock */
> -	if (need_clear_cache)
> +	if (need_clear_cache) {
>  		swapcache_clear(si, entry, nr_pages);
> +		wake_up(swapcache_wq);
> +	}
>  	if (si)
>  		put_swap_device(si);
>  	return ret;
> @@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		folio_unlock(swapcache);
>  		folio_put(swapcache);
>  	}
> -	if (need_clear_cache)
> +	if (need_clear_cache) {
>  		swapcache_clear(si, entry, nr_pages);
> +		wake_up(swapcache_wq);
> +	}
>  	if (si)
>  		put_swap_device(si);
>  	return ret;

--
Best Regards,
Huang, Ying
Huang, Ying Oct. 3, 2024, 12:38 a.m. UTC | #9
Hi, Kairui,

Kairui Song <ryncsn@gmail.com> writes:

> On Wed, Oct 2, 2024 at 10:02 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > Barry Song <21cnbao@gmail.com> writes:
>> >
>> > > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>> > >>
>> > >> Barry Song <21cnbao@gmail.com> writes:
>> > >>
>> > >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >>
>> > >> >> Hi, Barry,
>> > >> >>
>> > >> >> Barry Song <21cnbao@gmail.com> writes:
>> > >> >>
>> > >> >> > From: Barry Song <v-songbaohua@oppo.com>
>> > >> >> >
>> > >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> > >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
>> > >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
>> > >> >> > Android devices. To address this, we can use a waitqueue to wake up
>> > >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
>> > >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
>> > >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
>> > >> >> > rapid re-entry into page faults, which can cause livelocks, and
>> > >> >> > multiple millisecond sleeps, which visibly degrade user experience.
>> > >> >>
>> > >> >> In general, I think that this works.  Why not extend the solution to
>> > >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
>> > >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
>> > >> >
>> > >> > Hi Ying,
>> > >> > Thanks for your comments.
>> > >> > I feel extending the solution to __read_swap_cache_async() should be done
>> > >> > in a separate patch. On phones, I've never encountered any issues reported
>> > >> > on that path, so it might be better suited for an optimization rather than a
>> > >> > hotfix?
>
> Hi Barry and Ying,
>
> For the __read_swap_cache_async case, I'm not really against adding a
> similar workqueue, but if no one is really suffering from it, and if
> the workqueue do causes extra overhead, maybe we can ignore it for the
> __read_swap_cache_async case now, and I plan to resent the following
> patch:
> https://lore.kernel.org/linux-mm/20240326185032.72159-9-ryncsn@gmail.com/#r
>
> It removed all schedule_timeout_uninterruptible workaround and other
> similar things, and the performance will go even higher.

Sounds good to me.  Please resend it.  It's more complex than Barry's
fix.  So, I suggest to merge Barry's version first.

>> > >>
>> > >> Yes.  It's fine to do that in another patch as optimization.
>> > >
>> > > Ok. I'll prepare a separate patch for optimizing that path.
>> >
>> > Thanks!
>> >
>> > >>
>> > >> >> overhead to call wake_up() when there's no task waiting, we can use an
>> > >> >> atomic to count waiting tasks.
>> > >> >
>> > >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
>> > >> > waitqueue should have a very low cost on its own?
>> > >>
>> > >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
>> > >> shared lock.  On systems with many CPUs (such servers), this may cause
>> > >> severe lock contention.  Even the cache ping-pong may hurt performance
>> > >> much.
>> > >
>> > > I understand that cache synchronization was a significant issue before
>> > > qspinlock, but it seems to be less of a concern after its implementation.
>> >
>> > Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
>> > discussed in the following thread.
>> >
>> > https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
>> >
>> > > However, using a global atomic variable would still trigger cache broadcasts,
>> > > correct?
>> >
>> > We can only change the atomic variable to non-zero when
>> > swapcache_prepare() returns non-zero, and call wake_up() when the atomic
>> > variable is non-zero.  Because swapcache_prepare() returns 0 most times,
>> > the atomic variable is 0 most times.  If we don't change the value of
>> > atomic variable, cache ping-pong will not be triggered.
>>
>> yes. this can be implemented by adding another atomic variable.
>>
>> >
>> > Hi, Kairui,
>> >
>> > Do you have some test cases to test parallel zram swap-in?  If so, that
>> > can be used to verify whether cache ping-pong is an issue and whether it
>> > can be fixed via a global atomic variable.
>> >
>>
>> Yes, Kairui please run a test on your machine with lots of cores before
>> and after adding a global atomic variable as suggested by Ying. I am
>> sorry I don't have a server machine.
>
> I just had a try with the build kernel test which I used for the
> allocator patch series, with -j64, 1G memcg on my local branch:
>
> Without the patch:
> 2677.63user 9100.43system 3:33.15elapsed 5452%CPU (0avgtext+0avgdata
> 863284maxresident)k
> 2671.40user 8969.07system 3:33.67elapsed 5447%CPU (0avgtext+0avgdata
> 863316maxresident)k
> 2673.66user 8973.90system 3:33.18elapsed 5463%CPU (0avgtext+0avgdata
> 863284maxresident)k
>
> With the patch:
> 2655.05user 9134.21system 3:35.63elapsed 5467%CPU (0avgtext+0avgdata
> 863288maxresident)k
> 2652.57user 9104.87system 3:35.07elapsed 5466%CPU (0avgtext+0avgdata
> 863272maxresident)k
> 2665.44user 9155.97system 3:35.92elapsed 5474%CPU (0avgtext+0avgdata
> 863316maxresident)k
>
> Only three test runs, the main bottleneck for the test is still some
> other locks (list_lru lock, swap cgroup lock etc), but it does show
> the performance seems a bit lower. Could be considered a trivial
> amount of overhead so I think it's acceptable for the SYNC_IO path.

Thanks!  The difference appears measurable although small.  And, in some
use cases, multiple memcg may be used, so list_lru, swap cgroup lock
will be less contended.

--
Best Regards,
Huang, Ying
Chris Li Oct. 3, 2024, 10:22 p.m. UTC | #10
On Thu, Sep 26, 2024 at 2:20 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> introduced an unconditional one-tick sleep when `swapcache_prepare()`
> fails, which has led to reports of UI stuttering on latency-sensitive
> Android devices. To address this, we can use a waitqueue to wake up
> tasks that fail `swapcache_prepare()` sooner, instead of always
> sleeping for a full tick. While tasks may occasionally be woken by an
> unrelated `do_swap_page()`, this method is preferable to two scenarios:
> rapid re-entry into page faults, which can cause livelocks, and
> multiple millisecond sleeps, which visibly degrade user experience.
>
> Oven's testing shows that a single waitqueue resolves the UI
> stuttering issue. If a 'thundering herd' problem becomes apparent
> later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]`
> for page bit locks can be introduced.
>
> Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> Cc: Kairui Song <kasong@tencent.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Chris Li <chrisl@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Kalesh Singh <kaleshsingh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: <stable@vger.kernel.org>
> Reported-by: Oven Liyang <liyangouwen1@oppo.com>
> Tested-by: Oven Liyang <liyangouwen1@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2366578015ad..6913174f7f41 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4192,6 +4192,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4204,6 +4206,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
>         struct folio *swapcache, *folio = NULL;
> +       DECLARE_WAITQUEUE(wait, current);
>         struct page *page;
>         struct swap_info_struct *si = NULL;
>         rmap_t rmap_flags = RMAP_NONE;
> @@ -4302,7 +4305,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                          * Relax a bit to prevent rapid
>                                          * repeated page faults.
>                                          */
> +                                       add_wait_queue(&swapcache_wq, &wait);
>                                         schedule_timeout_uninterruptible(1);
> +                                       remove_wait_queue(&swapcache_wq, &wait);

There is only one "swapcache_wq", if we don't care about the memory
overhead, ideally should be per swap entry that fails to grab the
HAS_CACHE bit and has one wait queue. Currently all swap entries using
one wait queue will likely cause other swap entries (if any) get wait
up then find out the swap entry it cares hasn't been served yet.

Another thing to consider is that, if we are using a wait queue, the
1ms is not relevant any more. It can be longer than 1ms and it is
getting waited up by the wait queue anyway. Here you might use
indefinitely sleep to reduce the unnecessary wait up and the
complexity of the timer.

>                                         goto out_page;
>                                 }
>                                 need_clear_cache = true;
> @@ -4609,8 +4614,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
>         /* Clear the swap cache pin for direct swapin after PTL unlock */
> -       if (need_clear_cache)
> +       if (need_clear_cache) {
>                 swapcache_clear(si, entry, nr_pages);
> +               wake_up(&swapcache_wq);

Agree with Ying that here the common path will need to take a lock to
wait up the wait queue.

Chris


> +       }
>         if (si)
>                 put_swap_device(si);
>         return ret;
> @@ -4625,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_unlock(swapcache);
>                 folio_put(swapcache);
>         }
> -       if (need_clear_cache)
> +       if (need_clear_cache) {
>                 swapcache_clear(si, entry, nr_pages);
> +               wake_up(&swapcache_wq);
> +       }
>         if (si)
>                 put_swap_device(si);
>         return ret;
> --
> 2.34.1
>
Chris Li Oct. 3, 2024, 10:53 p.m. UTC | #11
On Tue, Oct 1, 2024 at 6:58 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Barry Song <21cnbao@gmail.com> writes:
> > >>
> > >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >>
> > >> >> Hi, Barry,
> > >> >>
> > >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >>
> > >> >> > From: Barry Song <v-songbaohua@oppo.com>
> > >> >> >
> > >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> > >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> > >> >> > Android devices. To address this, we can use a waitqueue to wake up
> > >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> > >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> > >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> > >> >> > rapid re-entry into page faults, which can cause livelocks, and
> > >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> > >> >>
> > >> >> In general, I think that this works.  Why not extend the solution to
> > >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> > >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> > >> >
> > >> > Hi Ying,
> > >> > Thanks for your comments.
> > >> > I feel extending the solution to __read_swap_cache_async() should be done
> > >> > in a separate patch. On phones, I've never encountered any issues reported
> > >> > on that path, so it might be better suited for an optimization rather than a
> > >> > hotfix?
> > >>
> > >> Yes.  It's fine to do that in another patch as optimization.
> > >
> > > Ok. I'll prepare a separate patch for optimizing that path.
> >
> > Thanks!
> >
> > >>
> > >> >> overhead to call wake_up() when there's no task waiting, we can use an
> > >> >> atomic to count waiting tasks.
> > >> >
> > >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> > >> > waitqueue should have a very low cost on its own?
> > >>
> > >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> > >> shared lock.  On systems with many CPUs (such servers), this may cause
> > >> severe lock contention.  Even the cache ping-pong may hurt performance
> > >> much.
> > >
> > > I understand that cache synchronization was a significant issue before
> > > qspinlock, but it seems to be less of a concern after its implementation.
> >
> > Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> > discussed in the following thread.
> >
> > https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
> >
> > > However, using a global atomic variable would still trigger cache broadcasts,
> > > correct?
> >
> > We can only change the atomic variable to non-zero when
> > swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> > variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> > the atomic variable is 0 most times.  If we don't change the value of
> > atomic variable, cache ping-pong will not be triggered.
>
> yes. this can be implemented by adding another atomic variable.
>
> >
> > Hi, Kairui,
> >
> > Do you have some test cases to test parallel zram swap-in?  If so, that
> > can be used to verify whether cache ping-pong is an issue and whether it
> > can be fixed via a global atomic variable.
> >
>
> Yes, Kairui please run a test on your machine with lots of cores before
> and after adding a global atomic variable as suggested by Ying. I am
> sorry I don't have a server machine.
>
> if it turns out you find cache ping-pong can be an issue, another
> approach would be a waitqueue hash:
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2366578015ad..aae0e532d8b6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +/*
> + * Alleviating the 'thundering herd' phenomenon using a waitqueue hash
> + * when multiple do_swap_page() operations occur simultaneously.
> + */
> +#define SWAPCACHE_WAIT_TABLE_BITS 5
> +#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS)
> +static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE];
> +
> +static int __init swapcache_wqs_init(void)
> +{
> +       for (int i = 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++)
> +               init_waitqueue_head(&swapcache_wqs[i]);
> +
> +        return 0;
> +}
> +late_initcall(swapcache_wqs_init);
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
>         struct folio *swapcache, *folio = NULL;
> +       DECLARE_WAITQUEUE(wait, current);
> +       wait_queue_head_t *swapcache_wq;
>         struct page *page;
>         struct swap_info_struct *si = NULL;
>         rmap_t rmap_flags = RMAP_NONE;
> @@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                  * undetectable as pte_same() returns true due
>                                  * to entry reuse.
>                                  */
> +                               swapcache_wq = &swapcache_wqs[hash_long(vmf->address & PMD_MASK,
> +                                                       SWAPCACHE_WAIT_TABLE_BITS)];

It is better to hash against the swap entry value rather than the
fault address. Same swap entries can map to different parts of the
page table. I am not sure this is triggerable in the SYNC_IO page
fault path, hash against the swap entries is more obviously correct.

Chris

>                                 if (swapcache_prepare(entry, nr_pages)) {
>                                         /*
>                                          * Relax a bit to prevent rapid
>                                          * repeated page faults.
>                                          */
> +                                       add_wait_queue(swapcache_wq, &wait);
>                                         schedule_timeout_uninterruptible(1);
> +                                       remove_wait_queue(swapcache_wq, &wait);
>                                         goto out_page;
>                                 }
>                                 need_clear_cache = true;
> @@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
>         /* Clear the swap cache pin for direct swapin after PTL unlock */
> -       if (need_clear_cache)
> +       if (need_clear_cache) {
>                 swapcache_clear(si, entry, nr_pages);
> +               wake_up(swapcache_wq);
> +       }
>         if (si)
>                 put_swap_device(si);
>         return ret;
> @@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_unlock(swapcache);
>                 folio_put(swapcache);
>         }
> -       if (need_clear_cache)
> +       if (need_clear_cache) {
>                 swapcache_clear(si, entry, nr_pages);
> +               wake_up(swapcache_wq);
> +       }
>         if (si)
>                 put_swap_device(si);
>         return ret;
> --
> 2.34.1
>
> > --
> > Best Regards,
> > Huang, Ying
>
> Thanks
> Barry
Chris Li Oct. 3, 2024, 11:03 p.m. UTC | #12
On Wed, Oct 2, 2024 at 5:35 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Hi, Barry,
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > From: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >
> >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> >> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> >> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> >> >> >> > Android devices. To address this, we can use a waitqueue to wake up
> >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> >> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> >> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> >> >> >> > rapid re-entry into page faults, which can cause livelocks, and
> >> >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> >> >> >>
> >> >> >> In general, I think that this works.  Why not extend the solution to
> >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> >> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> >> >> >
> >> >> > Hi Ying,
> >> >> > Thanks for your comments.
> >> >> > I feel extending the solution to __read_swap_cache_async() should be done
> >> >> > in a separate patch. On phones, I've never encountered any issues reported
> >> >> > on that path, so it might be better suited for an optimization rather than a
> >> >> > hotfix?
> >> >>
> >> >> Yes.  It's fine to do that in another patch as optimization.
> >> >
> >> > Ok. I'll prepare a separate patch for optimizing that path.
> >>
> >> Thanks!
> >>
> >> >>
> >> >> >> overhead to call wake_up() when there's no task waiting, we can use an
> >> >> >> atomic to count waiting tasks.
> >> >> >
> >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> >> >> > waitqueue should have a very low cost on its own?
> >> >>
> >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> >> >> shared lock.  On systems with many CPUs (such servers), this may cause
> >> >> severe lock contention.  Even the cache ping-pong may hurt performance
> >> >> much.
> >> >
> >> > I understand that cache synchronization was a significant issue before
> >> > qspinlock, but it seems to be less of a concern after its implementation.
> >>
> >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> >> discussed in the following thread.
> >>
> >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
> >>
> >> > However, using a global atomic variable would still trigger cache broadcasts,
> >> > correct?
> >>
> >> We can only change the atomic variable to non-zero when
> >> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> >> variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> >> the atomic variable is 0 most times.  If we don't change the value of
> >> atomic variable, cache ping-pong will not be triggered.
> >
> > yes. this can be implemented by adding another atomic variable.
>
> Just realized that we don't need another atomic variable for this, just
> use waitqueue_active() before wake_up() should be enough.
>
> >>
> >> Hi, Kairui,
> >>
> >> Do you have some test cases to test parallel zram swap-in?  If so, that
> >> can be used to verify whether cache ping-pong is an issue and whether it
> >> can be fixed via a global atomic variable.
> >>
> >
> > Yes, Kairui please run a test on your machine with lots of cores before
> > and after adding a global atomic variable as suggested by Ying. I am
> > sorry I don't have a server machine.
> >
> > if it turns out you find cache ping-pong can be an issue, another
> > approach would be a waitqueue hash:
>
> Yes.  waitqueue hash may help reduce lock contention.  And, we can have
> both waitqueue_active() and waitqueue hash if necessary.  As the first
> step, waitqueue_active() appears simpler.

Interesting. Just take a look at the waitqueue_active(), it requires
smp_mb() if using without holding the lock.
Quote from the comment of waitqueue_active():
* Also note that this 'optimization' trades a spin_lock() for an smp_mb(),
 * which (when the lock is uncontended) are of roughly equal cost.

Chris

>
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2366578015ad..aae0e532d8b6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >  }
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +/*
> > + * Alleviating the 'thundering herd' phenomenon using a waitqueue hash
> > + * when multiple do_swap_page() operations occur simultaneously.
> > + */
> > +#define SWAPCACHE_WAIT_TABLE_BITS 5
> > +#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS)
> > +static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE];
> > +
> > +static int __init swapcache_wqs_init(void)
> > +{
> > +     for (int i = 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++)
> > +             init_waitqueue_head(&swapcache_wqs[i]);
> > +
> > +        return 0;
> > +}
> > +late_initcall(swapcache_wqs_init);
> > +
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  {
> >       struct vm_area_struct *vma = vmf->vma;
> >       struct folio *swapcache, *folio = NULL;
> > +     DECLARE_WAITQUEUE(wait, current);
> > +     wait_queue_head_t *swapcache_wq;
> >       struct page *page;
> >       struct swap_info_struct *si = NULL;
> >       rmap_t rmap_flags = RMAP_NONE;
> > @@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                                * undetectable as pte_same() returns true due
> >                                * to entry reuse.
> >                                */
> > +                             swapcache_wq = &swapcache_wqs[hash_long(vmf->address & PMD_MASK,
> > +                                                     SWAPCACHE_WAIT_TABLE_BITS)];
> >                               if (swapcache_prepare(entry, nr_pages)) {
> >                                       /*
> >                                        * Relax a bit to prevent rapid
> >                                        * repeated page faults.
> >                                        */
> > +                                     add_wait_queue(swapcache_wq, &wait);
> >                                       schedule_timeout_uninterruptible(1);
> > +                                     remove_wait_queue(swapcache_wq, &wait);
> >                                       goto out_page;
> >                               }
> >                               need_clear_cache = true;
> > @@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out:
> >       /* Clear the swap cache pin for direct swapin after PTL unlock */
> > -     if (need_clear_cache)
> > +     if (need_clear_cache) {
> >               swapcache_clear(si, entry, nr_pages);
> > +             wake_up(swapcache_wq);
> > +     }
> >       if (si)
> >               put_swap_device(si);
> >       return ret;
> > @@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               folio_unlock(swapcache);
> >               folio_put(swapcache);
> >       }
> > -     if (need_clear_cache)
> > +     if (need_clear_cache) {
> >               swapcache_clear(si, entry, nr_pages);
> > +             wake_up(swapcache_wq);
> > +     }
> >       if (si)
> >               put_swap_device(si);
> >       return ret;
>
> --
> Best Regards,
> Huang, Ying
Barry Song Oct. 4, 2024, 3:35 p.m. UTC | #13
On Fri, Oct 4, 2024 at 6:53 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Oct 1, 2024 at 6:58 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >
> > > Barry Song <21cnbao@gmail.com> writes:
> > >
> > > > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> > > >>
> > > >> Barry Song <21cnbao@gmail.com> writes:
> > > >>
> > > >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> > > >> >>
> > > >> >> Hi, Barry,
> > > >> >>
> > > >> >> Barry Song <21cnbao@gmail.com> writes:
> > > >> >>
> > > >> >> > From: Barry Song <v-songbaohua@oppo.com>
> > > >> >> >
> > > >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> > > >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> > > >> >> > Android devices. To address this, we can use a waitqueue to wake up
> > > >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> > > >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> > > >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> > > >> >> > rapid re-entry into page faults, which can cause livelocks, and
> > > >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> > > >> >>
> > > >> >> In general, I think that this works.  Why not extend the solution to
> > > >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> > > >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> > > >> >
> > > >> > Hi Ying,
> > > >> > Thanks for your comments.
> > > >> > I feel extending the solution to __read_swap_cache_async() should be done
> > > >> > in a separate patch. On phones, I've never encountered any issues reported
> > > >> > on that path, so it might be better suited for an optimization rather than a
> > > >> > hotfix?
> > > >>
> > > >> Yes.  It's fine to do that in another patch as optimization.
> > > >
> > > > Ok. I'll prepare a separate patch for optimizing that path.
> > >
> > > Thanks!
> > >
> > > >>
> > > >> >> overhead to call wake_up() when there's no task waiting, we can use an
> > > >> >> atomic to count waiting tasks.
> > > >> >
> > > >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> > > >> > waitqueue should have a very low cost on its own?
> > > >>
> > > >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> > > >> shared lock.  On systems with many CPUs (such servers), this may cause
> > > >> severe lock contention.  Even the cache ping-pong may hurt performance
> > > >> much.
> > > >
> > > > I understand that cache synchronization was a significant issue before
> > > > qspinlock, but it seems to be less of a concern after its implementation.
> > >
> > > Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> > > discussed in the following thread.
> > >
> > > https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
> > >
> > > > However, using a global atomic variable would still trigger cache broadcasts,
> > > > correct?
> > >
> > > We can only change the atomic variable to non-zero when
> > > swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> > > variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> > > the atomic variable is 0 most times.  If we don't change the value of
> > > atomic variable, cache ping-pong will not be triggered.
> >
> > yes. this can be implemented by adding another atomic variable.
> >
> > >
> > > Hi, Kairui,
> > >
> > > Do you have some test cases to test parallel zram swap-in?  If so, that
> > > can be used to verify whether cache ping-pong is an issue and whether it
> > > can be fixed via a global atomic variable.
> > >
> >
> > Yes, Kairui please run a test on your machine with lots of cores before
> > and after adding a global atomic variable as suggested by Ying. I am
> > sorry I don't have a server machine.
> >
> > if it turns out you find cache ping-pong can be an issue, another
> > approach would be a waitqueue hash:
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2366578015ad..aae0e532d8b6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >  }
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +/*
> > + * Alleviating the 'thundering herd' phenomenon using a waitqueue hash
> > + * when multiple do_swap_page() operations occur simultaneously.
> > + */
> > +#define SWAPCACHE_WAIT_TABLE_BITS 5
> > +#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS)
> > +static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE];
> > +
> > +static int __init swapcache_wqs_init(void)
> > +{
> > +       for (int i = 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++)
> > +               init_waitqueue_head(&swapcache_wqs[i]);
> > +
> > +        return 0;
> > +}
> > +late_initcall(swapcache_wqs_init);
> > +
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  {
> >         struct vm_area_struct *vma = vmf->vma;
> >         struct folio *swapcache, *folio = NULL;
> > +       DECLARE_WAITQUEUE(wait, current);
> > +       wait_queue_head_t *swapcache_wq;
> >         struct page *page;
> >         struct swap_info_struct *si = NULL;
> >         rmap_t rmap_flags = RMAP_NONE;
> > @@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                                  * undetectable as pte_same() returns true due
> >                                  * to entry reuse.
> >                                  */
> > +                               swapcache_wq = &swapcache_wqs[hash_long(vmf->address & PMD_MASK,
> > +                                                       SWAPCACHE_WAIT_TABLE_BITS)];
>
> It is better to hash against the swap entry value rather than the
> fault address. Same swap entries can map to different parts of the
> page table. I am not sure this is triggerable in the SYNC_IO page
> fault path, hash against the swap entries is more obviously correct.
>

i am not convinced swap entry offset is a correct key here.

1. do_swap_page() is always for anon pages, there is no possibility
for anon pages to have different mapped virtual address; shmem will
never execute a different code path.

2. considering a mTHP swap-in case, the aligned virtual address
is the only reliable value for hash. if we only consider small folios
swap-in, it is fine to use swap entry value.

> Chris
>
> >                                 if (swapcache_prepare(entry, nr_pages)) {
> >                                         /*
> >                                          * Relax a bit to prevent rapid
> >                                          * repeated page faults.
> >                                          */
> > +                                       add_wait_queue(swapcache_wq, &wait);
> >                                         schedule_timeout_uninterruptible(1);
> > +                                       remove_wait_queue(swapcache_wq, &wait);
> >                                         goto out_page;
> >                                 }
> >                                 need_clear_cache = true;
> > @@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out:
> >         /* Clear the swap cache pin for direct swapin after PTL unlock */
> > -       if (need_clear_cache)
> > +       if (need_clear_cache) {
> >                 swapcache_clear(si, entry, nr_pages);
> > +               wake_up(swapcache_wq);
> > +       }
> >         if (si)
> >                 put_swap_device(si);
> >         return ret;
> > @@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 folio_unlock(swapcache);
> >                 folio_put(swapcache);
> >         }
> > -       if (need_clear_cache)
> > +       if (need_clear_cache) {
> >                 swapcache_clear(si, entry, nr_pages);
> > +               wake_up(swapcache_wq);
> > +       }
> >         if (si)
> >                 put_swap_device(si);
> >         return ret;
> > --
> > 2.34.1
> >
> > > --
> > > Best Regards,
> > > Huang, Ying
> >

Thanks
Barry
Barry Song Oct. 4, 2024, 3:55 p.m. UTC | #14
On Fri, Oct 4, 2024 at 6:22 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Sep 26, 2024 at 2:20 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> > fails, which has led to reports of UI stuttering on latency-sensitive
> > Android devices. To address this, we can use a waitqueue to wake up
> > tasks that fail `swapcache_prepare()` sooner, instead of always
> > sleeping for a full tick. While tasks may occasionally be woken by an
> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> > rapid re-entry into page faults, which can cause livelocks, and
> > multiple millisecond sleeps, which visibly degrade user experience.
> >
> > Oven's testing shows that a single waitqueue resolves the UI
> > stuttering issue. If a 'thundering herd' problem becomes apparent
> > later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]`
> > for page bit locks can be introduced.
> >
> > Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > Cc: Kairui Song <kasong@tencent.com>
> > Cc: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Yu Zhao <yuzhao@google.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Chris Li <chrisl@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Yosry Ahmed <yosryahmed@google.com>
> > Cc: SeongJae Park <sj@kernel.org>
> > Cc: Kalesh Singh <kaleshsingh@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: <stable@vger.kernel.org>
> > Reported-by: Oven Liyang <liyangouwen1@oppo.com>
> > Tested-by: Oven Liyang <liyangouwen1@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/memory.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2366578015ad..6913174f7f41 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4192,6 +4192,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >  }
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> > +
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -4204,6 +4206,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  {
> >         struct vm_area_struct *vma = vmf->vma;
> >         struct folio *swapcache, *folio = NULL;
> > +       DECLARE_WAITQUEUE(wait, current);
> >         struct page *page;
> >         struct swap_info_struct *si = NULL;
> >         rmap_t rmap_flags = RMAP_NONE;
> > @@ -4302,7 +4305,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                                          * Relax a bit to prevent rapid
> >                                          * repeated page faults.
> >                                          */
> > +                                       add_wait_queue(&swapcache_wq, &wait);
> >                                         schedule_timeout_uninterruptible(1);
> > +                                       remove_wait_queue(&swapcache_wq, &wait);
>
> There is only one "swapcache_wq", if we don't care about the memory
> overhead, ideally should be per swap entry that fails to grab the
> HAS_CACHE bit and has one wait queue. Currently all swap entries using
> one wait queue will likely cause other swap entries (if any) get wait
> up then find out the swap entry it cares hasn't been served yet.
>

even page bit locks do have a waitqueue for one page, i believe that
case has much serious contention then swap-in. page bit lock depends
on a waitqueue hash to decrease unrelated wake-up.

if one process is woken-up by unrelated do_swap_page() and its swapcache
is not released, it will sleep again after re-checking swapcache_prepare().

Too many unrelated wake-ups would be just a 'thundering herd' but not
a livelock.

> Another thing to consider is that, if we are using a wait queue, the
> 1ms is not relevant any more. It can be longer than 1ms and it is
> getting waited up by the wait queue anyway. Here you might use
> indefinitely sleep to reduce the unnecessary wait up and the
> complexity of the timer.

not quite sure what you mean for 1ms, in an embedded system, we never
use 1000HZ, the typical/maximum HZ is 250.  not quite sure what
you mean by "indefinitely sleep", my understanding is that we can't
poll the result of swapcache_prepare() as the winner process
which does swapcache_prepare() successfully will drop the
swap slots.

>
> >                                         goto out_page;
> >                                 }
> >                                 need_clear_cache = true;
> > @@ -4609,8 +4614,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out:
> >         /* Clear the swap cache pin for direct swapin after PTL unlock */
> > -       if (need_clear_cache)
> > +       if (need_clear_cache) {
> >                 swapcache_clear(si, entry, nr_pages);
> > +               wake_up(&swapcache_wq);
>
> Agree with Ying that here the common path will need to take a lock to
> wait up the wait queue.

waitqueue_active() might be a good candidate.

>
> Chris
>
>
> > +       }
> >         if (si)
> >                 put_swap_device(si);
> >         return ret;
> > @@ -4625,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 folio_unlock(swapcache);
> >                 folio_put(swapcache);
> >         }
> > -       if (need_clear_cache)
> > +       if (need_clear_cache) {
> >                 swapcache_clear(si, entry, nr_pages);
> > +               wake_up(&swapcache_wq);
> > +       }
> >         if (si)
> >                 put_swap_device(si);
> >         return ret;
> > --
> > 2.34.1
> >

Thanks
Barry
Barry Song Oct. 4, 2024, 4:03 p.m. UTC | #15
On Fri, Oct 4, 2024 at 7:03 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Oct 2, 2024 at 5:35 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Barry Song <21cnbao@gmail.com> writes:
> > >>
> > >> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >>
> > >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >>
> > >> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >>
> > >> >> >> Hi, Barry,
> > >> >> >>
> > >> >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >> >>
> > >> >> >> > From: Barry Song <v-songbaohua@oppo.com>
> > >> >> >> >
> > >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > >> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> > >> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> > >> >> >> > Android devices. To address this, we can use a waitqueue to wake up
> > >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> > >> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> > >> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> > >> >> >> > rapid re-entry into page faults, which can cause livelocks, and
> > >> >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> > >> >> >>
> > >> >> >> In general, I think that this works.  Why not extend the solution to
> > >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> > >> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> > >> >> >
> > >> >> > Hi Ying,
> > >> >> > Thanks for your comments.
> > >> >> > I feel extending the solution to __read_swap_cache_async() should be done
> > >> >> > in a separate patch. On phones, I've never encountered any issues reported
> > >> >> > on that path, so it might be better suited for an optimization rather than a
> > >> >> > hotfix?
> > >> >>
> > >> >> Yes.  It's fine to do that in another patch as optimization.
> > >> >
> > >> > Ok. I'll prepare a separate patch for optimizing that path.
> > >>
> > >> Thanks!
> > >>
> > >> >>
> > >> >> >> overhead to call wake_up() when there's no task waiting, we can use an
> > >> >> >> atomic to count waiting tasks.
> > >> >> >
> > >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> > >> >> > waitqueue should have a very low cost on its own?
> > >> >>
> > >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> > >> >> shared lock.  On systems with many CPUs (such servers), this may cause
> > >> >> severe lock contention.  Even the cache ping-pong may hurt performance
> > >> >> much.
> > >> >
> > >> > I understand that cache synchronization was a significant issue before
> > >> > qspinlock, but it seems to be less of a concern after its implementation.
> > >>
> > >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> > >> discussed in the following thread.
> > >>
> > >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
> > >>
> > >> > However, using a global atomic variable would still trigger cache broadcasts,
> > >> > correct?
> > >>
> > >> We can only change the atomic variable to non-zero when
> > >> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> > >> variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> > >> the atomic variable is 0 most times.  If we don't change the value of
> > >> atomic variable, cache ping-pong will not be triggered.
> > >
> > > yes. this can be implemented by adding another atomic variable.
> >
> > Just realized that we don't need another atomic variable for this, just
> > use waitqueue_active() before wake_up() should be enough.
> >
> > >>
> > >> Hi, Kairui,
> > >>
> > >> Do you have some test cases to test parallel zram swap-in?  If so, that
> > >> can be used to verify whether cache ping-pong is an issue and whether it
> > >> can be fixed via a global atomic variable.
> > >>
> > >
> > > Yes, Kairui please run a test on your machine with lots of cores before
> > > and after adding a global atomic variable as suggested by Ying. I am
> > > sorry I don't have a server machine.
> > >
> > > if it turns out you find cache ping-pong can be an issue, another
> > > approach would be a waitqueue hash:
> >
> > Yes.  waitqueue hash may help reduce lock contention.  And, we can have
> > both waitqueue_active() and waitqueue hash if necessary.  As the first
> > step, waitqueue_active() appears simpler.
>
> Interesting. Just take a look at the waitqueue_active(), it requires
> smp_mb() if using without holding the lock.
> Quote from the comment of waitqueue_active():
> * Also note that this 'optimization' trades a spin_lock() for an smp_mb(),
>  * which (when the lock is uncontended) are of roughly equal cost.
>

probably not a problem in our case. two reasons:
1. we don't have a condition here
2. false postive/negative wake_up() won't cause a problem here.

We used to always sleep at least 4ms for an embedded system, if we can
kill 99% of the possibilities, it is all good.

Ideally, we could combine wait queue hash and wakeup_active(), but
Kairui's test shows even if we did neither of the above, it is still acceptable
in performance. so probably we can make things simple by just
adding a if(waitqueue_active()) before wake_up().

> Chris
>
> >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 2366578015ad..aae0e532d8b6 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >  }
> > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > >
> > > +/*
> > > + * Alleviating the 'thundering herd' phenomenon using a waitqueue hash
> > > + * when multiple do_swap_page() operations occur simultaneously.
> > > + */
> > > +#define SWAPCACHE_WAIT_TABLE_BITS 5
> > > +#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS)
> > > +static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE];
> > > +
> > > +static int __init swapcache_wqs_init(void)
> > > +{
> > > +     for (int i = 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++)
> > > +             init_waitqueue_head(&swapcache_wqs[i]);
> > > +
> > > +        return 0;
> > > +}
> > > +late_initcall(swapcache_wqs_init);
> > > +
> > >  /*
> > >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > >   * but allow concurrent faults), and pte mapped but not yet locked.
> > > @@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >  {
> > >       struct vm_area_struct *vma = vmf->vma;
> > >       struct folio *swapcache, *folio = NULL;
> > > +     DECLARE_WAITQUEUE(wait, current);
> > > +     wait_queue_head_t *swapcache_wq;
> > >       struct page *page;
> > >       struct swap_info_struct *si = NULL;
> > >       rmap_t rmap_flags = RMAP_NONE;
> > > @@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >                                * undetectable as pte_same() returns true due
> > >                                * to entry reuse.
> > >                                */
> > > +                             swapcache_wq = &swapcache_wqs[hash_long(vmf->address & PMD_MASK,
> > > +                                                     SWAPCACHE_WAIT_TABLE_BITS)];
> > >                               if (swapcache_prepare(entry, nr_pages)) {
> > >                                       /*
> > >                                        * Relax a bit to prevent rapid
> > >                                        * repeated page faults.
> > >                                        */
> > > +                                     add_wait_queue(swapcache_wq, &wait);
> > >                                       schedule_timeout_uninterruptible(1);
> > > +                                     remove_wait_queue(swapcache_wq, &wait);
> > >                                       goto out_page;
> > >                               }
> > >                               need_clear_cache = true;
> > > @@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >               pte_unmap_unlock(vmf->pte, vmf->ptl);
> > >  out:
> > >       /* Clear the swap cache pin for direct swapin after PTL unlock */
> > > -     if (need_clear_cache)
> > > +     if (need_clear_cache) {
> > >               swapcache_clear(si, entry, nr_pages);
> > > +             wake_up(swapcache_wq);
> > > +     }
> > >       if (si)
> > >               put_swap_device(si);
> > >       return ret;
> > > @@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >               folio_unlock(swapcache);
> > >               folio_put(swapcache);
> > >       }
> > > -     if (need_clear_cache)
> > > +     if (need_clear_cache) {
> > >               swapcache_clear(si, entry, nr_pages);
> > > +             wake_up(swapcache_wq);
> > > +     }
> > >       if (si)
> > >               put_swap_device(si);
> > >       return ret;
> >
> > --
> > Best Regards,
> > Huang, Ying

Thanks
Barry
Barry Song Oct. 8, 2024, 1:08 p.m. UTC | #16
On Thu, Oct 3, 2024 at 8:35 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Hi, Barry,
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > From: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >
> >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> >> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> >> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> >> >> >> > Android devices. To address this, we can use a waitqueue to wake up
> >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> >> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> >> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> >> >> >> > rapid re-entry into page faults, which can cause livelocks, and
> >> >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> >> >> >>
> >> >> >> In general, I think that this works.  Why not extend the solution to
> >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> >> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> >> >> >
> >> >> > Hi Ying,
> >> >> > Thanks for your comments.
> >> >> > I feel extending the solution to __read_swap_cache_async() should be done
> >> >> > in a separate patch. On phones, I've never encountered any issues reported
> >> >> > on that path, so it might be better suited for an optimization rather than a
> >> >> > hotfix?
> >> >>
> >> >> Yes.  It's fine to do that in another patch as optimization.
> >> >
> >> > Ok. I'll prepare a separate patch for optimizing that path.
> >>
> >> Thanks!
> >>
> >> >>
> >> >> >> overhead to call wake_up() when there's no task waiting, we can use an
> >> >> >> atomic to count waiting tasks.
> >> >> >
> >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> >> >> > waitqueue should have a very low cost on its own?
> >> >>
> >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> >> >> shared lock.  On systems with many CPUs (such servers), this may cause
> >> >> severe lock contention.  Even the cache ping-pong may hurt performance
> >> >> much.
> >> >
> >> > I understand that cache synchronization was a significant issue before
> >> > qspinlock, but it seems to be less of a concern after its implementation.
> >>
> >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> >> discussed in the following thread.
> >>
> >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
> >>
> >> > However, using a global atomic variable would still trigger cache broadcasts,
> >> > correct?
> >>
> >> We can only change the atomic variable to non-zero when
> >> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> >> variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> >> the atomic variable is 0 most times.  If we don't change the value of
> >> atomic variable, cache ping-pong will not be triggered.
> >
> > yes. this can be implemented by adding another atomic variable.
>
> Just realized that we don't need another atomic variable for this, just
> use waitqueue_active() before wake_up() should be enough.
>
> >>
> >> Hi, Kairui,
> >>
> >> Do you have some test cases to test parallel zram swap-in?  If so, that
> >> can be used to verify whether cache ping-pong is an issue and whether it
> >> can be fixed via a global atomic variable.
> >>
> >
> > Yes, Kairui please run a test on your machine with lots of cores before
> > and after adding a global atomic variable as suggested by Ying. I am
> > sorry I don't have a server machine.
> >
> > if it turns out you find cache ping-pong can be an issue, another
> > approach would be a waitqueue hash:
>
> Yes.  waitqueue hash may help reduce lock contention.  And, we can have
> both waitqueue_active() and waitqueue hash if necessary.  As the first
> step, waitqueue_active() appears simpler.

Hi Andrew,
If there are no objections, can you please squash the below change? Oven
has already tested the change and the original issue was still fixed with
it. If you want me to send v2 instead, please let me know.

From a5ca401da89f3b628c3a0147e54541d0968654b2 Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Tue, 8 Oct 2024 20:18:27 +0800
Subject: [PATCH] mm: wake_up only when swapcache_wq waitqueue is active

wake_up() will acquire spinlock even waitqueue is empty. This might
involve cache sync overhead. Let's only call wake_up() when waitqueue
is active.

Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fe21bd3beff5..4adb2d0bcc7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4623,7 +4623,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
 	if (need_clear_cache) {
 		swapcache_clear(si, entry, nr_pages);
-		wake_up(&swapcache_wq);
+		if (waitqueue_active(&swapcache_wq))
+			wake_up(&swapcache_wq);
 	}
 	if (si)
 		put_swap_device(si);
@@ -4641,7 +4642,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 	if (need_clear_cache) {
 		swapcache_clear(si, entry, nr_pages);
-		wake_up(&swapcache_wq);
+		if (waitqueue_active(&swapcache_wq))
+			wake_up(&swapcache_wq);
 	}
 	if (si)
 		put_swap_device(si);
Huang, Ying Oct. 9, 2024, 12:51 a.m. UTC | #17
Barry Song <21cnbao@gmail.com> writes:

> On Thu, Oct 3, 2024 at 8:35 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Hi, Barry,
>> >> >> >>
>> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > From: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> >
>> >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> >> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
>> >> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
>> >> >> >> > Android devices. To address this, we can use a waitqueue to wake up
>> >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
>> >> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
>> >> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
>> >> >> >> > rapid re-entry into page faults, which can cause livelocks, and
>> >> >> >> > multiple millisecond sleeps, which visibly degrade user experience.
>> >> >> >>
>> >> >> >> In general, I think that this works.  Why not extend the solution to
>> >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
>> >> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
>> >> >> >
>> >> >> > Hi Ying,
>> >> >> > Thanks for your comments.
>> >> >> > I feel extending the solution to __read_swap_cache_async() should be done
>> >> >> > in a separate patch. On phones, I've never encountered any issues reported
>> >> >> > on that path, so it might be better suited for an optimization rather than a
>> >> >> > hotfix?
>> >> >>
>> >> >> Yes.  It's fine to do that in another patch as optimization.
>> >> >
>> >> > Ok. I'll prepare a separate patch for optimizing that path.
>> >>
>> >> Thanks!
>> >>
>> >> >>
>> >> >> >> overhead to call wake_up() when there's no task waiting, we can use an
>> >> >> >> atomic to count waiting tasks.
>> >> >> >
>> >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
>> >> >> > waitqueue should have a very low cost on its own?
>> >> >>
>> >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
>> >> >> shared lock.  On systems with many CPUs (such servers), this may cause
>> >> >> severe lock contention.  Even the cache ping-pong may hurt performance
>> >> >> much.
>> >> >
>> >> > I understand that cache synchronization was a significant issue before
>> >> > qspinlock, but it seems to be less of a concern after its implementation.
>> >>
>> >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
>> >> discussed in the following thread.
>> >>
>> >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
>> >>
>> >> > However, using a global atomic variable would still trigger cache broadcasts,
>> >> > correct?
>> >>
>> >> We can only change the atomic variable to non-zero when
>> >> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
>> >> variable is non-zero.  Because swapcache_prepare() returns 0 most times,
>> >> the atomic variable is 0 most times.  If we don't change the value of
>> >> atomic variable, cache ping-pong will not be triggered.
>> >
>> > yes. this can be implemented by adding another atomic variable.
>>
>> Just realized that we don't need another atomic variable for this, just
>> use waitqueue_active() before wake_up() should be enough.
>>
>> >>
>> >> Hi, Kairui,
>> >>
>> >> Do you have some test cases to test parallel zram swap-in?  If so, that
>> >> can be used to verify whether cache ping-pong is an issue and whether it
>> >> can be fixed via a global atomic variable.
>> >>
>> >
>> > Yes, Kairui please run a test on your machine with lots of cores before
>> > and after adding a global atomic variable as suggested by Ying. I am
>> > sorry I don't have a server machine.
>> >
>> > if it turns out you find cache ping-pong can be an issue, another
>> > approach would be a waitqueue hash:
>>
>> Yes.  waitqueue hash may help reduce lock contention.  And, we can have
>> both waitqueue_active() and waitqueue hash if necessary.  As the first
>> step, waitqueue_active() appears simpler.
>
> Hi Andrew,
> If there are no objections, can you please squash the below change? Oven
> has already tested the change and the original issue was still fixed with
> it. If you want me to send v2 instead, please let me know.
>
> From a5ca401da89f3b628c3a0147e54541d0968654b2 Mon Sep 17 00:00:00 2001
> From: Barry Song <v-songbaohua@oppo.com>
> Date: Tue, 8 Oct 2024 20:18:27 +0800
> Subject: [PATCH] mm: wake_up only when swapcache_wq waitqueue is active
>
> wake_up() will acquire spinlock even waitqueue is empty. This might
> involve cache sync overhead. Let's only call wake_up() when waitqueue
> is active.
>
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index fe21bd3beff5..4adb2d0bcc7a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4623,7 +4623,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	/* Clear the swap cache pin for direct swapin after PTL unlock */
>  	if (need_clear_cache) {
>  		swapcache_clear(si, entry, nr_pages);
> -		wake_up(&swapcache_wq);
> +		if (waitqueue_active(&swapcache_wq))
> +			wake_up(&swapcache_wq);
>  	}
>  	if (si)
>  		put_swap_device(si);
> @@ -4641,7 +4642,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  	if (need_clear_cache) {
>  		swapcache_clear(si, entry, nr_pages);
> -		wake_up(&swapcache_wq);
> +		if (waitqueue_active(&swapcache_wq))
> +			wake_up(&swapcache_wq);
>  	}
>  	if (si)
>  		put_swap_device(si);

Hi, Kairui,

Do you have time to give this patch (combined with the previous patch
from Barry) a test to check whether the overhead introduced in the
previous patch has been eliminated?

--
Best Regards,
Huang, Ying
Kairui Song Oct. 22, 2024, 9:21 a.m. UTC | #18
On Wed, Oct 9, 2024 at 8:55 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Thu, Oct 3, 2024 at 8:35 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Hi, Barry,
> >> >> >> >>
> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > From: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> >
> >> >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> >> >> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> >> >> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> >> >> >> >> > Android devices. To address this, we can use a waitqueue to wake up
> >> >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> >> >> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> >> >> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> >> >> >> >> > rapid re-entry into page faults, which can cause livelocks, and
> >> >> >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> >> >> >> >>
> >> >> >> >> In general, I think that this works.  Why not extend the solution to
> >> >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> >> >> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> >> >> >> >
> >> >> >> > Hi Ying,
> >> >> >> > Thanks for your comments.
> >> >> >> > I feel extending the solution to __read_swap_cache_async() should be done
> >> >> >> > in a separate patch. On phones, I've never encountered any issues reported
> >> >> >> > on that path, so it might be better suited for an optimization rather than a
> >> >> >> > hotfix?
> >> >> >>
> >> >> >> Yes.  It's fine to do that in another patch as optimization.
> >> >> >
> >> >> > Ok. I'll prepare a separate patch for optimizing that path.
> >> >>
> >> >> Thanks!
> >> >>
> >> >> >>
> >> >> >> >> overhead to call wake_up() when there's no task waiting, we can use an
> >> >> >> >> atomic to count waiting tasks.
> >> >> >> >
> >> >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> >> >> >> > waitqueue should have a very low cost on its own?
> >> >> >>
> >> >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> >> >> >> shared lock.  On systems with many CPUs (such servers), this may cause
> >> >> >> severe lock contention.  Even the cache ping-pong may hurt performance
> >> >> >> much.
> >> >> >
> >> >> > I understand that cache synchronization was a significant issue before
> >> >> > qspinlock, but it seems to be less of a concern after its implementation.
> >> >>
> >> >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> >> >> discussed in the following thread.
> >> >>
> >> >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
> >> >>
> >> >> > However, using a global atomic variable would still trigger cache broadcasts,
> >> >> > correct?
> >> >>
> >> >> We can only change the atomic variable to non-zero when
> >> >> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> >> >> variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> >> >> the atomic variable is 0 most times.  If we don't change the value of
> >> >> atomic variable, cache ping-pong will not be triggered.
> >> >
> >> > yes. this can be implemented by adding another atomic variable.
> >>
> >> Just realized that we don't need another atomic variable for this, just
> >> use waitqueue_active() before wake_up() should be enough.
> >>
> >> >>
> >> >> Hi, Kairui,
> >> >>
> >> >> Do you have some test cases to test parallel zram swap-in?  If so, that
> >> >> can be used to verify whether cache ping-pong is an issue and whether it
> >> >> can be fixed via a global atomic variable.
> >> >>
> >> >
> >> > Yes, Kairui please run a test on your machine with lots of cores before
> >> > and after adding a global atomic variable as suggested by Ying. I am
> >> > sorry I don't have a server machine.
> >> >
> >> > if it turns out you find cache ping-pong can be an issue, another
> >> > approach would be a waitqueue hash:
> >>
> >> Yes.  waitqueue hash may help reduce lock contention.  And, we can have
> >> both waitqueue_active() and waitqueue hash if necessary.  As the first
> >> step, waitqueue_active() appears simpler.
> >
> > Hi Andrew,
> > If there are no objections, can you please squash the below change? Oven
> > has already tested the change and the original issue was still fixed with
> > it. If you want me to send v2 instead, please let me know.
> >
> > From a5ca401da89f3b628c3a0147e54541d0968654b2 Mon Sep 17 00:00:00 2001
> > From: Barry Song <v-songbaohua@oppo.com>
> > Date: Tue, 8 Oct 2024 20:18:27 +0800
> > Subject: [PATCH] mm: wake_up only when swapcache_wq waitqueue is active
> >
> > wake_up() will acquire spinlock even waitqueue is empty. This might
> > involve cache sync overhead. Let's only call wake_up() when waitqueue
> > is active.
> >
> > Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/memory.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index fe21bd3beff5..4adb2d0bcc7a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4623,7 +4623,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       /* Clear the swap cache pin for direct swapin after PTL unlock */
> >       if (need_clear_cache) {
> >               swapcache_clear(si, entry, nr_pages);
> > -             wake_up(&swapcache_wq);
> > +             if (waitqueue_active(&swapcache_wq))
> > +                     wake_up(&swapcache_wq);
> >       }
> >       if (si)
> >               put_swap_device(si);
> > @@ -4641,7 +4642,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       }
> >       if (need_clear_cache) {
> >               swapcache_clear(si, entry, nr_pages);
> > -             wake_up(&swapcache_wq);
> > +             if (waitqueue_active(&swapcache_wq))
> > +                     wake_up(&swapcache_wq);
> >       }
> >       if (si)
> >               put_swap_device(si);
>
> Hi, Kairui,
>
> Do you have time to give this patch (combined with the previous patch
> from Barry) a test to check whether the overhead introduced in the
> previous patch has been eliminated?

Hi Ying, Barry

I did a rebase on mm tree and run more tests with the latest patch:

Before the two patches:
make -j96 (64k): 33814.45 35061.25 35667.54 36618.30 37381.60 37678.75
make -j96: 20456.03 20460.36 20511.55 20584.76 20751.07 20780.79
make -j64:7490.83 7515.55 7535.30 7544.81 7564.77 7583.41

After adding workqueue:
make -j96 (64k): 33190.60 35049.57 35732.01 36263.81 37154.05 37815.50
make -j96: 20373.27 20382.96 20428.78 20459.73 20534.59 20548.48
make -j64: 7469.18 7522.57 7527.38 7532.69 7543.36 7546.28

After adding workqueue with workqueue_active() check:
make -j96 (64k): 33321.03 35039.68 35552.86 36474.95 37502.76 37549.04
make -j96: 20601.39 20639.08 20692.81 20693.91 20701.35 20740.71
make -j64: 7538.63 7542.27 7564.86 7567.36 7594.14 7600.96

So I think it's just noise level performance change, it should be OK
in either way.

>
> --
> Best Regards,
> Huang, Ying
>
Huang, Ying Oct. 23, 2024, 1:57 a.m. UTC | #19
Kairui Song <ryncsn@gmail.com> writes:

> On Wed, Oct 9, 2024 at 8:55 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Thu, Oct 3, 2024 at 8:35 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Hi, Barry,
>> >> >> >> >>
>> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > From: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> >> >
>> >> >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> >> >> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
>> >> >> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
>> >> >> >> >> > Android devices. To address this, we can use a waitqueue to wake up
>> >> >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
>> >> >> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
>> >> >> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
>> >> >> >> >> > rapid re-entry into page faults, which can cause livelocks, and
>> >> >> >> >> > multiple millisecond sleeps, which visibly degrade user experience.
>> >> >> >> >>
>> >> >> >> >> In general, I think that this works.  Why not extend the solution to
>> >> >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
>> >> >> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
>> >> >> >> >
>> >> >> >> > Hi Ying,
>> >> >> >> > Thanks for your comments.
>> >> >> >> > I feel extending the solution to __read_swap_cache_async() should be done
>> >> >> >> > in a separate patch. On phones, I've never encountered any issues reported
>> >> >> >> > on that path, so it might be better suited for an optimization rather than a
>> >> >> >> > hotfix?
>> >> >> >>
>> >> >> >> Yes.  It's fine to do that in another patch as optimization.
>> >> >> >
>> >> >> > Ok. I'll prepare a separate patch for optimizing that path.
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> >>
>> >> >> >> >> overhead to call wake_up() when there's no task waiting, we can use an
>> >> >> >> >> atomic to count waiting tasks.
>> >> >> >> >
>> >> >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
>> >> >> >> > waitqueue should have a very low cost on its own?
>> >> >> >>
>> >> >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
>> >> >> >> shared lock.  On systems with many CPUs (such servers), this may cause
>> >> >> >> severe lock contention.  Even the cache ping-pong may hurt performance
>> >> >> >> much.
>> >> >> >
>> >> >> > I understand that cache synchronization was a significant issue before
>> >> >> > qspinlock, but it seems to be less of a concern after its implementation.
>> >> >>
>> >> >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
>> >> >> discussed in the following thread.
>> >> >>
>> >> >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
>> >> >>
>> >> >> > However, using a global atomic variable would still trigger cache broadcasts,
>> >> >> > correct?
>> >> >>
>> >> >> We can only change the atomic variable to non-zero when
>> >> >> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
>> >> >> variable is non-zero.  Because swapcache_prepare() returns 0 most times,
>> >> >> the atomic variable is 0 most times.  If we don't change the value of
>> >> >> atomic variable, cache ping-pong will not be triggered.
>> >> >
>> >> > yes. this can be implemented by adding another atomic variable.
>> >>
>> >> Just realized that we don't need another atomic variable for this, just
>> >> use waitqueue_active() before wake_up() should be enough.
>> >>
>> >> >>
>> >> >> Hi, Kairui,
>> >> >>
>> >> >> Do you have some test cases to test parallel zram swap-in?  If so, that
>> >> >> can be used to verify whether cache ping-pong is an issue and whether it
>> >> >> can be fixed via a global atomic variable.
>> >> >>
>> >> >
>> >> > Yes, Kairui please run a test on your machine with lots of cores before
>> >> > and after adding a global atomic variable as suggested by Ying. I am
>> >> > sorry I don't have a server machine.
>> >> >
>> >> > if it turns out you find cache ping-pong can be an issue, another
>> >> > approach would be a waitqueue hash:
>> >>
>> >> Yes.  waitqueue hash may help reduce lock contention.  And, we can have
>> >> both waitqueue_active() and waitqueue hash if necessary.  As the first
>> >> step, waitqueue_active() appears simpler.
>> >
>> > Hi Andrew,
>> > If there are no objections, can you please squash the below change? Oven
>> > has already tested the change and the original issue was still fixed with
>> > it. If you want me to send v2 instead, please let me know.
>> >
>> > From a5ca401da89f3b628c3a0147e54541d0968654b2 Mon Sep 17 00:00:00 2001
>> > From: Barry Song <v-songbaohua@oppo.com>
>> > Date: Tue, 8 Oct 2024 20:18:27 +0800
>> > Subject: [PATCH] mm: wake_up only when swapcache_wq waitqueue is active
>> >
>> > wake_up() will acquire spinlock even waitqueue is empty. This might
>> > involve cache sync overhead. Let's only call wake_up() when waitqueue
>> > is active.
>> >
>> > Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> > ---
>> >  mm/memory.c | 6 ++++--
>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index fe21bd3beff5..4adb2d0bcc7a 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -4623,7 +4623,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       /* Clear the swap cache pin for direct swapin after PTL unlock */
>> >       if (need_clear_cache) {
>> >               swapcache_clear(si, entry, nr_pages);
>> > -             wake_up(&swapcache_wq);
>> > +             if (waitqueue_active(&swapcache_wq))
>> > +                     wake_up(&swapcache_wq);
>> >       }
>> >       if (si)
>> >               put_swap_device(si);
>> > @@ -4641,7 +4642,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       }
>> >       if (need_clear_cache) {
>> >               swapcache_clear(si, entry, nr_pages);
>> > -             wake_up(&swapcache_wq);
>> > +             if (waitqueue_active(&swapcache_wq))
>> > +                     wake_up(&swapcache_wq);
>> >       }
>> >       if (si)
>> >               put_swap_device(si);
>>
>> Hi, Kairui,
>>
>> Do you have time to give this patch (combined with the previous patch
>> from Barry) a test to check whether the overhead introduced in the
>> previous patch has been eliminated?
>
> Hi Ying, Barry
>
> I did a rebase on mm tree and run more tests with the latest patch:
>
> Before the two patches:
> make -j96 (64k): 33814.45 35061.25 35667.54 36618.30 37381.60 37678.75
> make -j96: 20456.03 20460.36 20511.55 20584.76 20751.07 20780.79
> make -j64:7490.83 7515.55 7535.30 7544.81 7564.77 7583.41
>
> After adding workqueue:
> make -j96 (64k): 33190.60 35049.57 35732.01 36263.81 37154.05 37815.50
> make -j96: 20373.27 20382.96 20428.78 20459.73 20534.59 20548.48
> make -j64: 7469.18 7522.57 7527.38 7532.69 7543.36 7546.28
>
> After adding workqueue with workqueue_active() check:
> make -j96 (64k): 33321.03 35039.68 35552.86 36474.95 37502.76 37549.04
> make -j96: 20601.39 20639.08 20692.81 20693.91 20701.35 20740.71
> make -j64: 7538.63 7542.27 7564.86 7567.36 7594.14 7600.96
>
> So I think it's just noise level performance change, it should be OK
> in either way.

Thanks for your test results.  There should be bottlenecks in other
places.

--
Best Regards,
Huang, Ying
Barry Song Oct. 23, 2024, 2:32 a.m. UTC | #20
On Wed, Oct 23, 2024 at 3:01 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Wed, Oct 9, 2024 at 8:55 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Thu, Oct 3, 2024 at 8:35 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Hi, Barry,
> >> >> >> >> >>
> >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > From: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> >> >
> >> >> >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> >> >> >> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> >> >> >> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> >> >> >> >> >> > Android devices. To address this, we can use a waitqueue to wake up
> >> >> >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> >> >> >> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> >> >> >> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> >> >> >> >> >> > rapid re-entry into page faults, which can cause livelocks, and
> >> >> >> >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> >> >> >> >> >>
> >> >> >> >> >> In general, I think that this works.  Why not extend the solution to
> >> >> >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> >> >> >> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> >> >> >> >> >
> >> >> >> >> > Hi Ying,
> >> >> >> >> > Thanks for your comments.
> >> >> >> >> > I feel extending the solution to __read_swap_cache_async() should be done
> >> >> >> >> > in a separate patch. On phones, I've never encountered any issues reported
> >> >> >> >> > on that path, so it might be better suited for an optimization rather than a
> >> >> >> >> > hotfix?
> >> >> >> >>
> >> >> >> >> Yes.  It's fine to do that in another patch as optimization.
> >> >> >> >
> >> >> >> > Ok. I'll prepare a separate patch for optimizing that path.
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >> >>
> >> >> >> >> >> overhead to call wake_up() when there's no task waiting, we can use an
> >> >> >> >> >> atomic to count waiting tasks.
> >> >> >> >> >
> >> >> >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> >> >> >> >> > waitqueue should have a very low cost on its own?
> >> >> >> >>
> >> >> >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> >> >> >> >> shared lock.  On systems with many CPUs (such servers), this may cause
> >> >> >> >> severe lock contention.  Even the cache ping-pong may hurt performance
> >> >> >> >> much.
> >> >> >> >
> >> >> >> > I understand that cache synchronization was a significant issue before
> >> >> >> > qspinlock, but it seems to be less of a concern after its implementation.
> >> >> >>
> >> >> >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> >> >> >> discussed in the following thread.
> >> >> >>
> >> >> >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/
> >> >> >>
> >> >> >> > However, using a global atomic variable would still trigger cache broadcasts,
> >> >> >> > correct?
> >> >> >>
> >> >> >> We can only change the atomic variable to non-zero when
> >> >> >> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> >> >> >> variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> >> >> >> the atomic variable is 0 most times.  If we don't change the value of
> >> >> >> atomic variable, cache ping-pong will not be triggered.
> >> >> >
> >> >> > yes. this can be implemented by adding another atomic variable.
> >> >>
> >> >> Just realized that we don't need another atomic variable for this, just
> >> >> use waitqueue_active() before wake_up() should be enough.
> >> >>
> >> >> >>
> >> >> >> Hi, Kairui,
> >> >> >>
> >> >> >> Do you have some test cases to test parallel zram swap-in?  If so, that
> >> >> >> can be used to verify whether cache ping-pong is an issue and whether it
> >> >> >> can be fixed via a global atomic variable.
> >> >> >>
> >> >> >
> >> >> > Yes, Kairui please run a test on your machine with lots of cores before
> >> >> > and after adding a global atomic variable as suggested by Ying. I am
> >> >> > sorry I don't have a server machine.
> >> >> >
> >> >> > if it turns out you find cache ping-pong can be an issue, another
> >> >> > approach would be a waitqueue hash:
> >> >>
> >> >> Yes.  waitqueue hash may help reduce lock contention.  And, we can have
> >> >> both waitqueue_active() and waitqueue hash if necessary.  As the first
> >> >> step, waitqueue_active() appears simpler.
> >> >
> >> > Hi Andrew,
> >> > If there are no objections, can you please squash the below change? Oven
> >> > has already tested the change and the original issue was still fixed with
> >> > it. If you want me to send v2 instead, please let me know.
> >> >
> >> > From a5ca401da89f3b628c3a0147e54541d0968654b2 Mon Sep 17 00:00:00 2001
> >> > From: Barry Song <v-songbaohua@oppo.com>
> >> > Date: Tue, 8 Oct 2024 20:18:27 +0800
> >> > Subject: [PATCH] mm: wake_up only when swapcache_wq waitqueue is active
> >> >
> >> > wake_up() will acquire spinlock even waitqueue is empty. This might
> >> > involve cache sync overhead. Let's only call wake_up() when waitqueue
> >> > is active.
> >> >
> >> > Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> > ---
> >> >  mm/memory.c | 6 ++++--
> >> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/mm/memory.c b/mm/memory.c
> >> > index fe21bd3beff5..4adb2d0bcc7a 100644
> >> > --- a/mm/memory.c
> >> > +++ b/mm/memory.c
> >> > @@ -4623,7 +4623,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >       /* Clear the swap cache pin for direct swapin after PTL unlock */
> >> >       if (need_clear_cache) {
> >> >               swapcache_clear(si, entry, nr_pages);
> >> > -             wake_up(&swapcache_wq);
> >> > +             if (waitqueue_active(&swapcache_wq))
> >> > +                     wake_up(&swapcache_wq);
> >> >       }
> >> >       if (si)
> >> >               put_swap_device(si);
> >> > @@ -4641,7 +4642,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >       }
> >> >       if (need_clear_cache) {
> >> >               swapcache_clear(si, entry, nr_pages);
> >> > -             wake_up(&swapcache_wq);
> >> > +             if (waitqueue_active(&swapcache_wq))
> >> > +                     wake_up(&swapcache_wq);
> >> >       }
> >> >       if (si)
> >> >               put_swap_device(si);
> >>
> >> Hi, Kairui,
> >>
> >> Do you have time to give this patch (combined with the previous patch
> >> from Barry) a test to check whether the overhead introduced in the
> >> previous patch has been eliminated?
> >
> > Hi Ying, Barry
> >
> > I did a rebase on mm tree and run more tests with the latest patch:
> >
> > Before the two patches:
> > make -j96 (64k): 33814.45 35061.25 35667.54 36618.30 37381.60 37678.75
> > make -j96: 20456.03 20460.36 20511.55 20584.76 20751.07 20780.79
> > make -j64:7490.83 7515.55 7535.30 7544.81 7564.77 7583.41
> >
> > After adding workqueue:
> > make -j96 (64k): 33190.60 35049.57 35732.01 36263.81 37154.05 37815.50
> > make -j96: 20373.27 20382.96 20428.78 20459.73 20534.59 20548.48
> > make -j64: 7469.18 7522.57 7527.38 7532.69 7543.36 7546.28
> >
> > After adding workqueue with workqueue_active() check:
> > make -j96 (64k): 33321.03 35039.68 35552.86 36474.95 37502.76 37549.04
> > make -j96: 20601.39 20639.08 20692.81 20693.91 20701.35 20740.71
> > make -j64: 7538.63 7542.27 7564.86 7567.36 7594.14 7600.96
> >
> > So I think it's just noise level performance change, it should be OK
> > in either way.

Thanks for Kairui's testing.

>
> Thanks for your test results.  There should be bottlenecks in other
> places.

Exactly. I’d expect cache ping-pong to become noticeable only when
the spinlock is highly contended—such as when many threads
simultaneously follow the pattern below:

spin_lock
short-time operations
spin_unlock

But we’re likely dealing with a different pattern, as shown below:

long-time operations

spin_lock
short-time operations
spin_unlock

>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..6913174f7f41 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4192,6 +4192,8 @@  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4204,6 +4206,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *swapcache, *folio = NULL;
+	DECLARE_WAITQUEUE(wait, current);
 	struct page *page;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
@@ -4302,7 +4305,9 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 					 * Relax a bit to prevent rapid
 					 * repeated page faults.
 					 */
+					add_wait_queue(&swapcache_wq, &wait);
 					schedule_timeout_uninterruptible(1);
+					remove_wait_queue(&swapcache_wq, &wait);
 					goto out_page;
 				}
 				need_clear_cache = true;
@@ -4609,8 +4614,10 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
-	if (need_clear_cache)
+	if (need_clear_cache) {
 		swapcache_clear(si, entry, nr_pages);
+		wake_up(&swapcache_wq);
+	}
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4625,8 +4632,10 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_unlock(swapcache);
 		folio_put(swapcache);
 	}
-	if (need_clear_cache)
+	if (need_clear_cache) {
 		swapcache_clear(si, entry, nr_pages);
+		wake_up(&swapcache_wq);
+	}
 	if (si)
 		put_swap_device(si);
 	return ret;