mm: swap: async free swap slot cache entries

Message ID	20231221-async-free-v1-1-94b277992cb0@kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Chris Li <chrisl@kernel.org> Date: Thu, 21 Dec 2023 22:25:39 -0800 Subject: [PATCH] mm: swap: async free swap slot cache entries MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Message-Id: <20231221-async-free-v1-1-94b277992cb0@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?b?V2VpIFh177+8?= <weixugc@google.com>, =?utf-8?q?Yu_Zhao=EF=BF=BC?= <yuzhao@google.com>, Greg Thelen <gthelen@google.com>, Chun-Tse Shao <ctshao@google.com>, =?utf-8?q?Suren_Baghdasaryan=EF=BF=BC?= <surenb@google.com>, =?utf-8?q?Yosr?= =?utf-8?q?y_Ahmed=EF=BF=BC?= <yosryahmed@google.com>, Brain Geffon <bgeffon@google.com>, Minchan Kim <minchan@kernel.org>, Michal Hocko <mhocko@suse.com>, Mel Gorman <mgorman@techsingularity.net>, Huang Ying <ying.huang@intel.com>, Nhat Pham <nphamcs@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Kairui Song <kasong@tencent.com>, Zhongkun He <hezhongkun.hzk@bytedance.com>, Kemeng Shi <shikemeng@huaweicloud.com>, Barry Song <v-songbaohua@oppo.com>, Chris Li <chrisl@kernel.org> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: swap: async free swap slot cache entries \| expand mm: swap: async free swap slot cache entries

Chris Li Dec. 22, 2023, 6:25 a.m. UTC

We discovered that 1% swap page fault is 100us+ while 50% of
the swap fault is under 20us.

Further investigation show that a large portion of the time
spent in the free_swap_slots() function for the long tail case.

The percpu cache of swap slots is freed in a batch of 64 entries
inside free_swap_slots(). These cache entries are accumulated
from previous page faults, which may not be related to the current
process.

Doing the batch free in the page fault handler causes longer
tail latencies and penalizes the current process.

Move free_swap_slots() outside of the swapin page fault handler into an
async work queue to avoid such long tail latencies.

Testing:

Chun-Tse did some benchmark in chromebook, showing that
zram_wait_metrics improve about 15% with 80% and 95% confidence.

I recently ran some experiments on about 1000 Google production
machines. It shows swapin latency drops in the long tail
100us - 500us bucket dramatically.

platform	(100-500us)	 	(0-100us)
A		1.12% -> 0.36%		98.47% -> 99.22%
B		0.65% -> 0.15%		98.96% -> 99.46%
C		0.61% -> 0.23%		98.96% -> 99.38%

Signed-off-by: Chris Li <chrisl@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Wei Xu <weixugc@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Chun-Tse Shao <ctshao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Brain Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/swap_slots.h |  1 +
 mm/swap_slots.c            | 37 +++++++++++++++++++++++++++++--------
 2 files changed, 30 insertions(+), 8 deletions(-)


---
base-commit: eacce8189e28717da6f44ee492b7404c636ae0de
change-id: 20231216-async-free-bef392015432

Best regards,

Andrew Morton Dec. 22, 2023, 7:52 p.m. UTC | #1

On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote:

> We discovered that 1% swap page fault is 100us+ while 50% of
> the swap fault is under 20us.
> 
> Further investigation show that a large portion of the time
> spent in the free_swap_slots() function for the long tail case.
> 
> The percpu cache of swap slots is freed in a batch of 64 entries
> inside free_swap_slots(). These cache entries are accumulated
> from previous page faults, which may not be related to the current
> process.
> 
> Doing the batch free in the page fault handler causes longer
> tail latencies and penalizes the current process.
> 
> Move free_swap_slots() outside of the swapin page fault handler into an
> async work queue to avoid such long tail latencies.

This will require a larger amount of total work than the current
scheme.  So we're trading that off against better latency.

Why is this a good tradeoff?

Chris Li Dec. 22, 2023, 11:16 p.m. UTC | #2

On Fri, Dec 22, 2023 at 11:52:08AM -0800, Andrew Morton wrote:
> On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote:
> 
> > We discovered that 1% swap page fault is 100us+ while 50% of
> > the swap fault is under 20us.
> > 
> > Further investigation show that a large portion of the time
> > spent in the free_swap_slots() function for the long tail case.
> > 
> > The percpu cache of swap slots is freed in a batch of 64 entries
> > inside free_swap_slots(). These cache entries are accumulated
> > from previous page faults, which may not be related to the current
> > process.
> > 
> > Doing the batch free in the page fault handler causes longer
> > tail latencies and penalizes the current process.
> > 
> > Move free_swap_slots() outside of the swapin page fault handler into an
> > async work queue to avoid such long tail latencies.
> 
> This will require a larger amount of total work than the current

Yes, there will be a tiny little bit of extra overhead to schedule the job
on to the other work queue.

> scheme.  So we're trading that off against better latency.
> 
> Why is this a good tradeoff?

That is a very good question. Both Hugh and Wei had asked me similar questions
before. +Hugh.

The TL;DR is that it makes the swap more palleralizedable.

Because morden computers typically have more than one CPU and the CPU utilization
is rarely reached to 100%. We are actually not trading the latency for some one
run slower. Most of the time the real impact is that the current swapin page fault
can return quicker so more work can submit to the kernel sooner, at the same time
the other idle CPU can pick up the non latency critical work of freeing of the
swap slot cache entries. The net effect is that we speed things up and increase
the overall system utilization rather than slow things down.

The test result of chromebook and Google production server should be able to show
that it is beneficial to both laptop and server workloads, making them more responsive
in swap related workload.

Chris

Nhat Pham Dec. 23, 2023, 1:44 a.m. UTC | #3

On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote:
>
> We discovered that 1% swap page fault is 100us+ while 50% of
> the swap fault is under 20us.
>
> Further investigation show that a large portion of the time
> spent in the free_swap_slots() function for the long tail case.
>
> The percpu cache of swap slots is freed in a batch of 64 entries
> inside free_swap_slots(). These cache entries are accumulated
> from previous page faults, which may not be related to the current
> process.
>
> Doing the batch free in the page fault handler causes longer
> tail latencies and penalizes the current process.
>
> Move free_swap_slots() outside of the swapin page fault handler into an
> async work queue to avoid such long tail latencies.
>
> Testing:
>
> Chun-Tse did some benchmark in chromebook, showing that
> zram_wait_metrics improve about 15% with 80% and 95% confidence.
>
> I recently ran some experiments on about 1000 Google production
> machines. It shows swapin latency drops in the long tail
> 100us - 500us bucket dramatically.
>
> platform        (100-500us)             (0-100us)
> A               1.12% -> 0.36%          98.47% -> 99.22%
> B               0.65% -> 0.15%          98.96% -> 99.46%
> C               0.61% -> 0.23%          98.96% -> 99.38%

Nice! Are these values for zram as well, or ordinary (SSD?) swap? I
imagine it will matter less for swap, right?

>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> To: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Greg Thelen <gthelen@google.com>
> Cc: Chun-Tse Shao <ctshao@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Brain Geffon <bgeffon@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Kairui Song <kasong@tencent.com>
> Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Cc: Kemeng Shi <shikemeng@huaweicloud.com>
> Cc: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/swap_slots.h |  1 +
>  mm/swap_slots.c            | 37 +++++++++++++++++++++++++++++--------
>  2 files changed, 30 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
> index 15adfb8c813a..67bc8fa30d63 100644
> --- a/include/linux/swap_slots.h
> +++ b/include/linux/swap_slots.h
> @@ -19,6 +19,7 @@ struct swap_slots_cache {
>         spinlock_t      free_lock;  /* protects slots_ret, n_ret */
>         swp_entry_t     *slots_ret;
>         int             n_ret;
> +       struct work_struct async_free;
>  };
>
>  void disable_swap_slots_cache_lock(void);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..a3b306550732 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -42,8 +42,10 @@ static bool  swap_slot_cache_initialized;
>  static DEFINE_MUTEX(swap_slots_cache_mutex);
>  /* Serialize swap slots cache enable/disable operations */
>  static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
> +static struct workqueue_struct *swap_free_queue;
>
>  static void __drain_swap_slots_cache(unsigned int type);
> +static void swapcache_async_free_entries(struct work_struct *data);
>
>  #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
>  #define SLOTS_CACHE 0x1
> @@ -149,6 +151,7 @@ static int alloc_swap_slot_cache(unsigned int cpu)
>                 spin_lock_init(&cache->free_lock);
>                 cache->lock_initialized = true;
>         }
> +       INIT_WORK(&cache->async_free, swapcache_async_free_entries);
>         cache->nr = 0;
>         cache->cur = 0;
>         cache->n_ret = 0;
> @@ -269,6 +272,20 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
>         return cache->nr;
>  }
>
> +static void swapcache_async_free_entries(struct work_struct *data)
> +{
> +       struct swap_slots_cache *cache;
> +
> +       cache = container_of(data, struct swap_slots_cache, async_free);
> +       spin_lock_irq(&cache->free_lock);
> +       /* Swap slots cache may be deactivated before acquiring lock */
> +       if (cache->slots_ret) {
> +               swapcache_free_entries(cache->slots_ret, cache->n_ret);
> +               cache->n_ret = 0;
> +       }
> +       spin_unlock_irq(&cache->free_lock);
> +}
> +
>  void free_swap_slot(swp_entry_t entry)
>  {
>         struct swap_slots_cache *cache;
> @@ -282,17 +299,14 @@ void free_swap_slot(swp_entry_t entry)
>                         goto direct_free;
>                 }
>                 if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
> -                       /*
> -                        * Return slots to global pool.
> -                        * The current swap_map value is SWAP_HAS_CACHE.
> -                        * Set it to 0 to indicate it is available for
> -                        * allocation in global pool
> -                        */
> -                       swapcache_free_entries(cache->slots_ret, cache->n_ret);
> -                       cache->n_ret = 0;
> +                       spin_unlock_irq(&cache->free_lock);
> +                       queue_work(swap_free_queue, &cache->async_free);
> +                       goto direct_free;
>                 }
>                 cache->slots_ret[cache->n_ret++] = entry;
>                 spin_unlock_irq(&cache->free_lock);
> +               if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE)
> +                       queue_work(swap_free_queue, &cache->async_free);
>         } else {
>  direct_free:
>                 swapcache_free_entries(&entry, 1);
> @@ -348,3 +362,10 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>         }
>         return entry;
>  }
> +
> +static int __init async_queue_init(void)
> +{
> +       swap_free_queue = create_workqueue("async swap cache");

nit(?): isn't create_workqueue() deprecated? from:

https://www.kernel.org/doc/html/latest/core-api/workqueue.html#application-programming-interface-api

I think there's a zswap patch proposing fixing that on the zswap side.

> +       return 0;
> +}
> +subsys_initcall(async_queue_init);
>
> ---
> base-commit: eacce8189e28717da6f44ee492b7404c636ae0de
> change-id: 20231216-async-free-bef392015432
>
> Best regards,
> --
> Chris Li <chrisl@kernel.org>
>

Chris Li Dec. 23, 2023, 4:41 a.m. UTC | #4

On Fri, Dec 22, 2023 at 05:44:19PM -0800, Nhat Pham wrote:
> On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > We discovered that 1% swap page fault is 100us+ while 50% of
> > the swap fault is under 20us.
> >
> > Further investigation show that a large portion of the time
> > spent in the free_swap_slots() function for the long tail case.
> >
> > The percpu cache of swap slots is freed in a batch of 64 entries
> > inside free_swap_slots(). These cache entries are accumulated
> > from previous page faults, which may not be related to the current
> > process.
> >
> > Doing the batch free in the page fault handler causes longer
> > tail latencies and penalizes the current process.
> >
> > Move free_swap_slots() outside of the swapin page fault handler into an
> > async work queue to avoid such long tail latencies.
> >
> > Testing:
> >
> > Chun-Tse did some benchmark in chromebook, showing that
> > zram_wait_metrics improve about 15% with 80% and 95% confidence.

This benchmark result is using zram. There are 3 micro benchmarks of
all showing about 15% improvement with a slightly different confidence
level.  That is where the 80%-90% come from.

> >
> > I recently ran some experiments on about 1000 Google production
> > machines. It shows swapin latency drops in the long tail
> > 100us - 500us bucket dramatically.
> >
> > platform        (100-500us)             (0-100us)
> > A               1.12% -> 0.36%          98.47% -> 99.22%
> > B               0.65% -> 0.15%          98.96% -> 99.46%
> > C               0.61% -> 0.23%          98.96% -> 99.38%
> 
> Nice! Are these values for zram as well, or ordinary (SSD?) swap? I
> imagine it will matter less for swap, right?

Those production servers only use zswap. There is no zram there.
For ordinary SSD swap the latency reduction is also there in terms
of absolute us. However the raw savings get shadowed by the SSD IO
latency, typically in the 100us range. In terms of percentage,
you don't have as dramatica an effect compared to the memory
compression based swapping(zswap and zram).

> > @@ -348,3 +362,10 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> >         }
> >         return entry;
> >  }
> > +
> > +static int __init async_queue_init(void)
> > +{
> > +       swap_free_queue = create_workqueue("async swap cache");
> 
> nit(?): isn't create_workqueue() deprecated? from:
> 
> https://www.kernel.org/doc/html/latest/core-api/workqueue.html#application-programming-interface-api
> 
> I think there's a zswap patch proposing fixing that on the zswap side.
> 


Yes, I recall I saw that patch. I might acked on it as well.
Very good catch. I will fix it in the V2 spin.
Meanwhile, I will wait on it a bit to collect the other review
feedback.

Thans for catching that.

Chris

David Rientjes Dec. 23, 2023, 6:11 a.m. UTC | #5

On Fri, 22 Dec 2023, Chris Li wrote:

> On Fri, Dec 22, 2023 at 11:52:08AM -0800, Andrew Morton wrote:
> > On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote:
> > 
> > > We discovered that 1% swap page fault is 100us+ while 50% of
> > > the swap fault is under 20us.
> > > 
> > > Further investigation show that a large portion of the time
> > > spent in the free_swap_slots() function for the long tail case.
> > > 
> > > The percpu cache of swap slots is freed in a batch of 64 entries
> > > inside free_swap_slots(). These cache entries are accumulated
> > > from previous page faults, which may not be related to the current
> > > process.
> > > 
> > > Doing the batch free in the page fault handler causes longer
> > > tail latencies and penalizes the current process.
> > > 
> > > Move free_swap_slots() outside of the swapin page fault handler into an
> > > async work queue to avoid such long tail latencies.
> > 
> > This will require a larger amount of total work than the current
> 
> Yes, there will be a tiny little bit of extra overhead to schedule the job
> on to the other work queue.
> 

How do you quantify the impact of the delayed swap_entry_free()?

Since the free and memcg uncharge are now delayed, is there not the 
possibility that we stay under memory pressure for longer?  (Assuming at 
least some users are swapping because of memory pressure.)

I would assume that since the free and uncharge itself is delayed that in 
the pathological case we'd actually be swapping *more* until the async 
worker can run.

Chris Li Dec. 23, 2023, 4:51 p.m. UTC | #6

Hi David,

On Fri, Dec 22, 2023 at 10:11 PM David Rientjes <rientjes@google.com> wrote:
>
> How do you quantify the impact of the delayed swap_entry_free()?
>
> Since the free and memcg uncharge are now delayed, is there not the
> possibility that we stay under memory pressure for longer?  (Assuming at
> least some users are swapping because of memory pressure.)
>
> I would assume that since the free and uncharge itself is delayed that in
> the pathological case we'd actually be swapping *more* until the async
> worker can run.

Thanks for raising this interesting question.

First of all, the swap_entry_free() does not impact "memory.current".
It reduces "memory.swap.current". Technically it is the swap pressure
not memory pressure that suffers the extra delay.

Secondly, we are talking about delaying up to 64 swap entries for a
few microseconds. Where the swap slot cache itself delays the freeing
of the entries for an arbitrary amount of time. It is not freed until
the cache is full of 64 entries. This delay can be seconds or even
minutes. Adding a few microseconds of  extra delay to existing seconds
delay really makes no difference from the swap pressure point of view.

Chris

David Rientjes Dec. 24, 2023, 3:01 a.m. UTC | #7

On Sat, 23 Dec 2023, Chris Li wrote:

> > How do you quantify the impact of the delayed swap_entry_free()?
> >
> > Since the free and memcg uncharge are now delayed, is there not the
> > possibility that we stay under memory pressure for longer?  (Assuming at
> > least some users are swapping because of memory pressure.)
> >
> > I would assume that since the free and uncharge itself is delayed that in
> > the pathological case we'd actually be swapping *more* until the async
> > worker can run.
> 
> Thanks for raising this interesting question.
> 
> First of all, the swap_entry_free() does not impact "memory.current".
> It reduces "memory.swap.current". Technically it is the swap pressure
> not memory pressure that suffers the extra delay.
> 
> Secondly, we are talking about delaying up to 64 swap entries for a
> few microseconds.

What guarantees that the async freeing happens within a few microseconds?

> Where the swap slot cache itself delays the freeing
> of the entries for an arbitrary amount of time. It is not freed until
> the cache is full of 64 entries. This delay can be seconds or even
> minutes. Adding a few microseconds of  extra delay to existing seconds
> delay really makes no difference from the swap pressure point of view.
> 
> Chris
>

Chris Li Dec. 24, 2023, 6:15 p.m. UTC | #8

On Sat, Dec 23, 2023 at 7:01 PM David Rientjes <rientjes@google.com> wrote:
>
> On Sat, 23 Dec 2023, Chris Li wrote:
>
> > > How do you quantify the impact of the delayed swap_entry_free()?
> > >
> > > Since the free and memcg uncharge are now delayed, is there not the
> > > possibility that we stay under memory pressure for longer?  (Assuming at
> > > least some users are swapping because of memory pressure.)
> > >
> > > I would assume that since the free and uncharge itself is delayed that in
> > > the pathological case we'd actually be swapping *more* until the async
> > > worker can run.
> >
> > Thanks for raising this interesting question.
> >
> > First of all, the swap_entry_free() does not impact "memory.current".
> > It reduces "memory.swap.current". Technically it is the swap pressure
> > not memory pressure that suffers the extra delay.
> >
> > Secondly, we are talking about delaying up to 64 swap entries for a
> > few microseconds.
>
> What guarantees that the async freeing happens within a few microseconds?

Linux kernel typically doesn't provide RT scheduling guarantees. You
can change microseconds to milliseconds, my following reasoning still
holds.

>
> > Where the swap slot cache itself delays the freeing
> > of the entries for an arbitrary amount of time. It is not freed until
> > the cache is full of 64 entries. This delay can be seconds or even
> > minutes. Adding a few microseconds of  extra delay to existing seconds
> > delay really makes no difference from the swap pressure point of view.

Let me rephrase it. The swap slots cache itself has arbiturely delayed
on freeing the swap slots, the slot is not free until 64 entries are
reached. Adding a few milliseconds to the current swap slot freeing
delay does not significantly change current behavior from swap
pressure point of view.

Chris

David Rientjes Dec. 24, 2023, 9:13 p.m. UTC | #9

On Sun, 24 Dec 2023, Chris Li wrote:

> On Sat, Dec 23, 2023 at 7:01 PM David Rientjes <rientjes@google.com> wrote:
> >
> > On Sat, 23 Dec 2023, Chris Li wrote:
> >
> > > > How do you quantify the impact of the delayed swap_entry_free()?
> > > >
> > > > Since the free and memcg uncharge are now delayed, is there not the
> > > > possibility that we stay under memory pressure for longer?  (Assuming at
> > > > least some users are swapping because of memory pressure.)
> > > >
> > > > I would assume that since the free and uncharge itself is delayed that in
> > > > the pathological case we'd actually be swapping *more* until the async
> > > > worker can run.
> > >
> > > Thanks for raising this interesting question.
> > >
> > > First of all, the swap_entry_free() does not impact "memory.current".
> > > It reduces "memory.swap.current". Technically it is the swap pressure
> > > not memory pressure that suffers the extra delay.
> > >
> > > Secondly, we are talking about delaying up to 64 swap entries for a
> > > few microseconds.
> >
> > What guarantees that the async freeing happens within a few microseconds?
> 
> Linux kernel typically doesn't provide RT scheduling guarantees. You
> can change microseconds to milliseconds, my following reasoning still
> holds.
> 

What guarantees that the async freeing happens even within 10s?  Your 
responses are implying that there is some deadline by which this freeing 
absolutely must happen (us or ms), but I don't know of any strong 
guarantees.

If there are no absolute guarantees about when the freeing may now occur, 
I'm asking how the impact of the delayed swap_entry_free() can be 
quantified.

The benefit to the current implementation is that there *are* strong 
guarantees about when the freeing will occur and cannot grow exponentially 
before the async worker can do the freeing.

Chris Li Dec. 24, 2023, 10:06 p.m. UTC | #10

On Sun, Dec 24, 2023 at 1:13 PM David Rientjes <rientjes@google.com> wrote:
>
> On Sun, 24 Dec 2023, Chris Li wrote:
>
> > On Sat, Dec 23, 2023 at 7:01 PM David Rientjes <rientjes@google.com> wrote:
> > >
> > > On Sat, 23 Dec 2023, Chris Li wrote:
> > >
> > > > > How do you quantify the impact of the delayed swap_entry_free()?
> > > > >
> > > > > Since the free and memcg uncharge are now delayed, is there not the
> > > > > possibility that we stay under memory pressure for longer?  (Assuming at
> > > > > least some users are swapping because of memory pressure.)
> > > > >
> > > > > I would assume that since the free and uncharge itself is delayed that in
> > > > > the pathological case we'd actually be swapping *more* until the async
> > > > > worker can run.
> > > >
> > > > Thanks for raising this interesting question.
> > > >
> > > > First of all, the swap_entry_free() does not impact "memory.current".
> > > > It reduces "memory.swap.current". Technically it is the swap pressure
> > > > not memory pressure that suffers the extra delay.
> > > >
> > > > Secondly, we are talking about delaying up to 64 swap entries for a
> > > > few microseconds.
> > >
> > > What guarantees that the async freeing happens within a few microseconds?
> >
> > Linux kernel typically doesn't provide RT scheduling guarantees. You
> > can change microseconds to milliseconds, my following reasoning still
> > holds.
> >
>
> What guarantees that the async freeing happens even within 10s?  Your
> responses are implying that there is some deadline by which this freeing
> absolutely must happen (us or ms), but I don't know of any strong
> guarantees.

I think we are in agreement there, there are no such strong guarantees
in linux scheduling. However, when there are free CPU resources, the
job will get scheduled to execute in a reasonable table time frame. If
it does not, I consider that a bug if the CPU has idle resources and
the pending jobs are not able to run for a long time.
The existing code doesn't have such a guarantee either, see my point
follows. I don't know why you want to ask for such a guarantee.

> If there are no absolute guarantees about when the freeing may now occur,
> I'm asking how the impact of the delayed swap_entry_free() can be
> quantified.

Presumably each application has their own SLO metrics for monitoring
their application behavior. I am happy to take a look if any app has
new SLO violations caused by this change.
If you have one metric in mind, please  name it so we can look at it
together. During my current experiment and the chromebook benchmark, I
haven't noticed such ill effects show up in the other metrics drops in
a statistically significant manner. That is not the same as saying
such drops don't exist at all. Just I haven't noticed or the SLO
watching system hasn't caught it.

> The benefit to the current implementation is that there *are* strong
> guarantees about when the freeing will occur and cannot grow exponentially
> before the async worker can do the freeing.

I don't understand your point. Please help me. In the current code,
for the previous swapin fault that releases the swap slots into the
swap slot caches. Let's say the swap slot remains in the cache for X
seconds until Nth (N < 64) swapin page fault later, the cache is full
and finally all 64 swap slot caches are free. Are you suggesting there
is some kind of guarantee X is less than some fixed bound seconds?
What is that bound then? 10 second? 1 minutes?

BTW, there will be no exponential growth, that is guaranteed. Until
the 64 entries cache were freed. The swapin code will take the direct
free path for the current swap slot in hand. The direct free path
existed before my change.

Chris

David Rientjes Dec. 24, 2023, 10:20 p.m. UTC | #11

On Sun, 24 Dec 2023, Chris Li wrote:

> > > > > > How do you quantify the impact of the delayed swap_entry_free()?
> > > > > >
> > > > > > Since the free and memcg uncharge are now delayed, is there not the
> > > > > > possibility that we stay under memory pressure for longer?  (Assuming at
> > > > > > least some users are swapping because of memory pressure.)
> > > > > >
> > > > > > I would assume that since the free and uncharge itself is delayed that in
> > > > > > the pathological case we'd actually be swapping *more* until the async
> > > > > > worker can run.
> > > > >
> > > > > Thanks for raising this interesting question.
> > > > >
> > > > > First of all, the swap_entry_free() does not impact "memory.current".
> > > > > It reduces "memory.swap.current". Technically it is the swap pressure
> > > > > not memory pressure that suffers the extra delay.
> > > > >
> > > > > Secondly, we are talking about delaying up to 64 swap entries for a
> > > > > few microseconds.
> > > >
> > > > What guarantees that the async freeing happens within a few microseconds?
> > >
> > > Linux kernel typically doesn't provide RT scheduling guarantees. You
> > > can change microseconds to milliseconds, my following reasoning still
> > > holds.
> > >
> >
> > What guarantees that the async freeing happens even within 10s?  Your
> > responses are implying that there is some deadline by which this freeing
> > absolutely must happen (us or ms), but I don't know of any strong
> > guarantees.
> 
> I think we are in agreement there, there are no such strong guarantees
> in linux scheduling. However, when there are free CPU resources, the
> job will get scheduled to execute in a reasonable table time frame. If
> it does not, I consider that a bug if the CPU has idle resources and
> the pending jobs are not able to run for a long time.
> The existing code doesn't have such a guarantee either, see my point
> follows. I don't know why you want to ask for such a guarantee.
> 

I'm simply trying to compare the pros and cons of the approach.  As 
pointed out previously by Andrew, this approach actually results in *more* 
work to do freeing.  Then, we could have a significant time delay before 
the freeing is actually done, in addition to the code complexity.  And 
nothing safeguards us from an exponentially increasing amount of freeing 
that will be done sometime in the future.

The current implementation provides strong guarantees that you will do 
batched freeing that will not accumulate beyond a pre-defined threshold 
which is proven to work well in practice.

My only question was about how we can quantify the impact of the delayed 
free.  We've come to the conclusion that it hasn't been quantified and 
there is no guarantees on when it will be freed.

I'll leave it to those more invested in this path in the page fault 
handler to provide feedback.  Thanks!

Huang, Ying Dec. 25, 2023, 7:07 a.m. UTC | #12

Chris Li <chrisl@kernel.org> writes:

> On Fri, Dec 22, 2023 at 11:52:08AM -0800, Andrew Morton wrote:
>> On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote:
>> 
>> > We discovered that 1% swap page fault is 100us+ while 50% of
>> > the swap fault is under 20us.
>> > 
>> > Further investigation show that a large portion of the time
>> > spent in the free_swap_slots() function for the long tail case.
>> > 
>> > The percpu cache of swap slots is freed in a batch of 64 entries
>> > inside free_swap_slots(). These cache entries are accumulated
>> > from previous page faults, which may not be related to the current
>> > process.
>> > 
>> > Doing the batch free in the page fault handler causes longer
>> > tail latencies and penalizes the current process.
>> > 
>> > Move free_swap_slots() outside of the swapin page fault handler into an
>> > async work queue to avoid such long tail latencies.
>> 
>> This will require a larger amount of total work than the current
>
> Yes, there will be a tiny little bit of extra overhead to schedule the job
> on to the other work queue.
>
>> scheme.  So we're trading that off against better latency.
>> 
>> Why is this a good tradeoff?
>
> That is a very good question. Both Hugh and Wei had asked me similar questions
> before. +Hugh.
>
> The TL;DR is that it makes the swap more palleralizedable.
>
> Because morden computers typically have more than one CPU and the CPU utilization
> is rarely reached to 100%. We are actually not trading the latency for some one
> run slower. Most of the time the real impact is that the current swapin page fault
> can return quicker so more work can submit to the kernel sooner, at the same time
> the other idle CPU can pick up the non latency critical work of freeing of the
> swap slot cache entries. The net effect is that we speed things up and increase
> the overall system utilization rather than slow things down.

You solution depends on there is enough idle time in the system.  This
isn't always true.

In general, all async solutions have 2 possible issues.

a) Unrelated applications may be punished.  Because they may wait for
CPU which is running the async operations.  In the original solution,
the application swap more will be punished.

b) The CPU time cannot be charged to appropriate applications.  The
original behavior isn't perfect too.  But it's better than async worker.

Given the runtime of worker is at 100us level, these issues may be not
severe.  But I think that you may need to explain them at least.

And, when swap slots freeing batching was introduced, it was mainly used
to reduce the lock contention of sis->lock (via swap_info_get_cont()).
So, we may move some operations (e.g., mem_cgroup_uncharge_swap,
clear_shadow_from_swap_cache(), etc.) out of batched operation (before
calling free_swap_slot()) to reduce the latency impact.

> The test result of chromebook and Google production server should be able to show
> that it is beneficial to both laptop and server workloads, making them more responsive
> in swap related workload.

--
Best Regards,
Huang, Ying

Yosry Ahmed Dec. 28, 2023, 3:33 p.m. UTC | #13

On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote:
>
> We discovered that 1% swap page fault is 100us+ while 50% of
> the swap fault is under 20us.
>
> Further investigation show that a large portion of the time
> spent in the free_swap_slots() function for the long tail case.
>
> The percpu cache of swap slots is freed in a batch of 64 entries
> inside free_swap_slots(). These cache entries are accumulated
> from previous page faults, which may not be related to the current
> process.
>
> Doing the batch free in the page fault handler causes longer
> tail latencies and penalizes the current process.
>
> Move free_swap_slots() outside of the swapin page fault handler into an
> async work queue to avoid such long tail latencies.
>
> Testing:
>
> Chun-Tse did some benchmark in chromebook, showing that
> zram_wait_metrics improve about 15% with 80% and 95% confidence.
>
> I recently ran some experiments on about 1000 Google production
> machines. It shows swapin latency drops in the long tail
> 100us - 500us bucket dramatically.
>
> platform        (100-500us)             (0-100us)
> A               1.12% -> 0.36%          98.47% -> 99.22%
> B               0.65% -> 0.15%          98.96% -> 99.46%
> C               0.61% -> 0.23%          98.96% -> 99.38%

I recall you mentioning that mem_cgroup_uncharge_swap() is the most
expensive part of the batched freeing. If that's the case, I am
curious what happens if we move that call outside of the batching
(i.e. once the swap entry is no longer used and will be returned to
the cache). This should amortize the cost of memcg uncharging and
reduce the tail fault latency without extra work. Arguably, it could
increase the average fault latency, but not necessarily in a
significant way.

Ying pointed out something similar if I understand correctly (and
other operations that can be moved).

Also, if we choose to follow this route, I think there we should flush
the async worker in drain_slots_cache_cpu(), right?

Yosry Ahmed Dec. 28, 2023, 3:34 p.m. UTC | #14

On Sun, Dec 24, 2023 at 2:07 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Sun, Dec 24, 2023 at 1:13 PM David Rientjes <rientjes@google.com> wrote:
> >
> > On Sun, 24 Dec 2023, Chris Li wrote:
> >
> > > On Sat, Dec 23, 2023 at 7:01 PM David Rientjes <rientjes@google.com> wrote:
> > > >
> > > > On Sat, 23 Dec 2023, Chris Li wrote:
> > > >
> > > > > > How do you quantify the impact of the delayed swap_entry_free()?
> > > > > >
> > > > > > Since the free and memcg uncharge are now delayed, is there not the
> > > > > > possibility that we stay under memory pressure for longer?  (Assuming at
> > > > > > least some users are swapping because of memory pressure.)
> > > > > >
> > > > > > I would assume that since the free and uncharge itself is delayed that in
> > > > > > the pathological case we'd actually be swapping *more* until the async
> > > > > > worker can run.
> > > > >
> > > > > Thanks for raising this interesting question.
> > > > >
> > > > > First of all, the swap_entry_free() does not impact "memory.current".
> > > > > It reduces "memory.swap.current". Technically it is the swap pressure
> > > > > not memory pressure that suffers the extra delay.
> > > > >
> > > > > Secondly, we are talking about delaying up to 64 swap entries for a
> > > > > few microseconds.
> > > >
> > > > What guarantees that the async freeing happens within a few microseconds?
> > >
> > > Linux kernel typically doesn't provide RT scheduling guarantees. You
> > > can change microseconds to milliseconds, my following reasoning still
> > > holds.
> > >
> >
> > What guarantees that the async freeing happens even within 10s?  Your
> > responses are implying that there is some deadline by which this freeing
> > absolutely must happen (us or ms), but I don't know of any strong
> > guarantees.
>
> I think we are in agreement there, there are no such strong guarantees
> in linux scheduling. However, when there are free CPU resources, the
> job will get scheduled to execute in a reasonable table time frame. If
> it does not, I consider that a bug if the CPU has idle resources and
> the pending jobs are not able to run for a long time.
> The existing code doesn't have such a guarantee either, see my point
> follows. I don't know why you want to ask for such a guarantee.
>
> > If there are no absolute guarantees about when the freeing may now occur,
> > I'm asking how the impact of the delayed swap_entry_free() can be
> > quantified.
>
> Presumably each application has their own SLO metrics for monitoring
> their application behavior. I am happy to take a look if any app has
> new SLO violations caused by this change.
> If you have one metric in mind, please  name it so we can look at it
> together. During my current experiment and the chromebook benchmark, I
> haven't noticed such ill effects show up in the other metrics drops in
> a statistically significant manner. That is not the same as saying
> such drops don't exist at all. Just I haven't noticed or the SLO
> watching system hasn't caught it.
>
> > The benefit to the current implementation is that there *are* strong
> > guarantees about when the freeing will occur and cannot grow exponentially
> > before the async worker can do the freeing.
>
> I don't understand your point. Please help me. In the current code,
> for the previous swapin fault that releases the swap slots into the
> swap slot caches. Let's say the swap slot remains in the cache for X
> seconds until Nth (N < 64) swapin page fault later, the cache is full
> and finally all 64 swap slot caches are free. Are you suggesting there
> is some kind of guarantee X is less than some fixed bound seconds?
> What is that bound then? 10 second? 1 minutes?
>
> BTW, there will be no exponential growth, that is guaranteed. Until
> the 64 entries cache were freed. The swapin code will take the direct
> free path for the current swap slot in hand. The direct free path
> existed before my change.

FWIW, it's 64 * the number of CPUs.

Chris Li Feb. 1, 2024, 12:43 a.m. UTC | #15

Hi Ying,

Sorry for the late reply.

On Sun, Dec 24, 2023 at 11:10 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Fri, Dec 22, 2023 at 11:52:08AM -0800, Andrew Morton wrote:
> >> On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote:
> >>
> >> > We discovered that 1% swap page fault is 100us+ while 50% of
> >> > the swap fault is under 20us.
> >> >
> >> > Further investigation show that a large portion of the time
> >> > spent in the free_swap_slots() function for the long tail case.
> >> >
> >> > The percpu cache of swap slots is freed in a batch of 64 entries
> >> > inside free_swap_slots(). These cache entries are accumulated
> >> > from previous page faults, which may not be related to the current
> >> > process.
> >> >
> >> > Doing the batch free in the page fault handler causes longer
> >> > tail latencies and penalizes the current process.
> >> >
> >> > Move free_swap_slots() outside of the swapin page fault handler into an
> >> > async work queue to avoid such long tail latencies.
> >>
> >> This will require a larger amount of total work than the current
> >
> > Yes, there will be a tiny little bit of extra overhead to schedule the job
> > on to the other work queue.
> >
> >> scheme.  So we're trading that off against better latency.
> >>
> >> Why is this a good tradeoff?
> >
> > That is a very good question. Both Hugh and Wei had asked me similar questions
> > before. +Hugh.
> >
> > The TL;DR is that it makes the swap more palleralizedable.
> >
> > Because morden computers typically have more than one CPU and the CPU utilization
> > is rarely reached to 100%. We are actually not trading the latency for some one
> > run slower. Most of the time the real impact is that the current swapin page fault
> > can return quicker so more work can submit to the kernel sooner, at the same time
> > the other idle CPU can pick up the non latency critical work of freeing of the
> > swap slot cache entries. The net effect is that we speed things up and increase
> > the overall system utilization rather than slow things down.
>
> You solution depends on there is enough idle time in the system.  This
> isn't always true.
>
> In general, all async solutions have 2 possible issues.
>
> a) Unrelated applications may be punished.  Because they may wait for
> CPU which is running the async operations.  In the original solution,
> the application swap more will be punished.

The typical time to perform on the async free is very brief, at about
100ms level.
So the amount of punishment would be small.

The original behavior was already delaying the freeing of swap slots
due to batching.
Adding a tiny bit of time does not change the overall behavior too much.
Another thing is that, if the async free is pending, it will go
through the direct free path.

> b) The CPU time cannot be charged to appropriate applications.  The
> original behavior isn't perfect too.  But it's better than async worker.

Yes, the original behavior will free other cgroups' swap entries.

> Given the runtime of worker is at 100us level, these issues may be not
> severe.  But I think that you may need to explain them at least.

Thanks for the suggestion. Will do in V2.

>
> And, when swap slots freeing batching was introduced, it was mainly used
> to reduce the lock contention of sis->lock (via swap_info_get_cont()).
> So, we may move some operations (e.g., mem_cgroup_uncharge_swap,
> clear_shadow_from_swap_cache(), etc.) out of batched operation (before
> calling free_swap_slot()) to reduce the latency impact.

That is good to know. Thanks for the explanation.

Chris

>
> > The test result of chromebook and Google production server should be able to show
> > that it is beneficial to both laptop and server workloads, making them more responsive
> > in swap related workload.
>
> --
> Best Regards,
> Huang, Ying
>

Chris Li Feb. 1, 2024, 12:57 a.m. UTC | #16

Hi Yosry,

On Thu, Dec 28, 2023 at 7:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > We discovered that 1% swap page fault is 100us+ while 50% of
> > the swap fault is under 20us.
> >
> > Further investigation show that a large portion of the time
> > spent in the free_swap_slots() function for the long tail case.
> >
> > The percpu cache of swap slots is freed in a batch of 64 entries
> > inside free_swap_slots(). These cache entries are accumulated
> > from previous page faults, which may not be related to the current
> > process.
> >
> > Doing the batch free in the page fault handler causes longer
> > tail latencies and penalizes the current process.
> >
> > Move free_swap_slots() outside of the swapin page fault handler into an
> > async work queue to avoid such long tail latencies.
> >
> > Testing:
> >
> > Chun-Tse did some benchmark in chromebook, showing that
> > zram_wait_metrics improve about 15% with 80% and 95% confidence.
> >
> > I recently ran some experiments on about 1000 Google production
> > machines. It shows swapin latency drops in the long tail
> > 100us - 500us bucket dramatically.
> >
> > platform        (100-500us)             (0-100us)
> > A               1.12% -> 0.36%          98.47% -> 99.22%
> > B               0.65% -> 0.15%          98.96% -> 99.46%
> > C               0.61% -> 0.23%          98.96% -> 99.38%
>
> I recall you mentioning that mem_cgroup_uncharge_swap() is the most
> expensive part of the batched freeing. If that's the case, I am
> curious what happens if we move that call outside of the batching
> (i.e. once the swap entry is no longer used and will be returned to
> the cache). This should amortize the cost of memcg uncharging and
> reduce the tail fault latency without extra work. Arguably, it could
> increase the average fault latency, but not necessarily in a
> significant way.
>
> Ying pointed out something similar if I understand correctly (and
> other operations that can be moved).

If the goal is to let the swap fault return as soon as possible. Then
the current approach is better.
mem_cgroup_uncarge_swap() is only part of it. Not close to all of it.

>
> Also, if we choose to follow this route, I think there we should flush
> the async worker in drain_slots_cache_cpu(), right?
Not sure I understand this part. The drain_slots_cache_cpu(), will
free the entries already. The current lock around cache->free_lock
should protect async workers accessing the entries. What do you mean
by flushing?

Chris

Yosry Ahmed Feb. 1, 2024, 1:21 a.m. UTC | #17

On Wed, Jan 31, 2024 at 4:57 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Thu, Dec 28, 2023 at 7:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > We discovered that 1% swap page fault is 100us+ while 50% of
> > > the swap fault is under 20us.
> > >
> > > Further investigation show that a large portion of the time
> > > spent in the free_swap_slots() function for the long tail case.
> > >
> > > The percpu cache of swap slots is freed in a batch of 64 entries
> > > inside free_swap_slots(). These cache entries are accumulated
> > > from previous page faults, which may not be related to the current
> > > process.
> > >
> > > Doing the batch free in the page fault handler causes longer
> > > tail latencies and penalizes the current process.
> > >
> > > Move free_swap_slots() outside of the swapin page fault handler into an
> > > async work queue to avoid such long tail latencies.
> > >
> > > Testing:
> > >
> > > Chun-Tse did some benchmark in chromebook, showing that
> > > zram_wait_metrics improve about 15% with 80% and 95% confidence.
> > >
> > > I recently ran some experiments on about 1000 Google production
> > > machines. It shows swapin latency drops in the long tail
> > > 100us - 500us bucket dramatically.
> > >
> > > platform        (100-500us)             (0-100us)
> > > A               1.12% -> 0.36%          98.47% -> 99.22%
> > > B               0.65% -> 0.15%          98.96% -> 99.46%
> > > C               0.61% -> 0.23%          98.96% -> 99.38%
> >
> > I recall you mentioning that mem_cgroup_uncharge_swap() is the most
> > expensive part of the batched freeing. If that's the case, I am
> > curious what happens if we move that call outside of the batching
> > (i.e. once the swap entry is no longer used and will be returned to
> > the cache). This should amortize the cost of memcg uncharging and
> > reduce the tail fault latency without extra work. Arguably, it could
> > increase the average fault latency, but not necessarily in a
> > significant way.
> >
> > Ying pointed out something similar if I understand correctly (and
> > other operations that can be moved).
>
> If the goal is to let the swap fault return as soon as possible. Then
> the current approach is better.
> mem_cgroup_uncarge_swap() is only part of it. Not close to all of it.

I think there are a lot of operations that we can move out of
swapcache_free_entries():
- mem_cgroup_uncharge_swap()
- arch_swap_invalidate_page()
- zswap_invalidate()
- clear_shadow_from_swap_cache()
, and maybe others.

I am curious, if we move these operations from the batched freeing,
would this remove the increased tail latency and make it more
consistent, without doing extra work?

I believe this is what Ying was also asking about.

>
> >
> > Also, if we choose to follow this route, I think there we should flush
> > the async worker in drain_slots_cache_cpu(), right?
> Not sure I understand this part. The drain_slots_cache_cpu(), will
> free the entries already. The current lock around cache->free_lock
> should protect async workers accessing the entries. What do you mean
> by flushing?

Never mind. I just realized that the percpu caches are static, so they
are not freed in drain_slots_cache_cpu(). The NULL check in the async
worker should be enough protection.

mm: swap: async free swap slot cache entries

Commit Message

Comments

Patch