Message ID | 20231221-async-free-v1-1-94b277992cb0@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: swap: async free swap slot cache entries | expand |
On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote: > We discovered that 1% swap page fault is 100us+ while 50% of > the swap fault is under 20us. > > Further investigation show that a large portion of the time > spent in the free_swap_slots() function for the long tail case. > > The percpu cache of swap slots is freed in a batch of 64 entries > inside free_swap_slots(). These cache entries are accumulated > from previous page faults, which may not be related to the current > process. > > Doing the batch free in the page fault handler causes longer > tail latencies and penalizes the current process. > > Move free_swap_slots() outside of the swapin page fault handler into an > async work queue to avoid such long tail latencies. This will require a larger amount of total work than the current scheme. So we're trading that off against better latency. Why is this a good tradeoff?
On Fri, Dec 22, 2023 at 11:52:08AM -0800, Andrew Morton wrote: > On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote: > > > We discovered that 1% swap page fault is 100us+ while 50% of > > the swap fault is under 20us. > > > > Further investigation show that a large portion of the time > > spent in the free_swap_slots() function for the long tail case. > > > > The percpu cache of swap slots is freed in a batch of 64 entries > > inside free_swap_slots(). These cache entries are accumulated > > from previous page faults, which may not be related to the current > > process. > > > > Doing the batch free in the page fault handler causes longer > > tail latencies and penalizes the current process. > > > > Move free_swap_slots() outside of the swapin page fault handler into an > > async work queue to avoid such long tail latencies. > > This will require a larger amount of total work than the current Yes, there will be a tiny little bit of extra overhead to schedule the job on to the other work queue. > scheme. So we're trading that off against better latency. > > Why is this a good tradeoff? That is a very good question. Both Hugh and Wei had asked me similar questions before. +Hugh. The TL;DR is that it makes the swap more palleralizedable. Because morden computers typically have more than one CPU and the CPU utilization is rarely reached to 100%. We are actually not trading the latency for some one run slower. Most of the time the real impact is that the current swapin page fault can return quicker so more work can submit to the kernel sooner, at the same time the other idle CPU can pick up the non latency critical work of freeing of the swap slot cache entries. The net effect is that we speed things up and increase the overall system utilization rather than slow things down. The test result of chromebook and Google production server should be able to show that it is beneficial to both laptop and server workloads, making them more responsive in swap related workload. Chris
On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote: > > We discovered that 1% swap page fault is 100us+ while 50% of > the swap fault is under 20us. > > Further investigation show that a large portion of the time > spent in the free_swap_slots() function for the long tail case. > > The percpu cache of swap slots is freed in a batch of 64 entries > inside free_swap_slots(). These cache entries are accumulated > from previous page faults, which may not be related to the current > process. > > Doing the batch free in the page fault handler causes longer > tail latencies and penalizes the current process. > > Move free_swap_slots() outside of the swapin page fault handler into an > async work queue to avoid such long tail latencies. > > Testing: > > Chun-Tse did some benchmark in chromebook, showing that > zram_wait_metrics improve about 15% with 80% and 95% confidence. > > I recently ran some experiments on about 1000 Google production > machines. It shows swapin latency drops in the long tail > 100us - 500us bucket dramatically. > > platform (100-500us) (0-100us) > A 1.12% -> 0.36% 98.47% -> 99.22% > B 0.65% -> 0.15% 98.96% -> 99.46% > C 0.61% -> 0.23% 98.96% -> 99.38% Nice! Are these values for zram as well, or ordinary (SSD?) swap? I imagine it will matter less for swap, right? > > Signed-off-by: Chris Li <chrisl@kernel.org> > To: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > Cc: Wei Xu <weixugc@google.com> > Cc: Yu Zhao <yuzhao@google.com> > Cc: Greg Thelen <gthelen@google.com> > Cc: Chun-Tse Shao <ctshao@google.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Yosry Ahmed <yosryahmed@google.com> > Cc: Brain Geffon <bgeffon@google.com> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Huang Ying <ying.huang@intel.com> > Cc: Nhat Pham <nphamcs@gmail.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Kairui Song <kasong@tencent.com> > Cc: Zhongkun He <hezhongkun.hzk@bytedance.com> > Cc: Kemeng Shi <shikemeng@huaweicloud.com> > Cc: Barry Song <v-songbaohua@oppo.com> > --- > include/linux/swap_slots.h | 1 + > mm/swap_slots.c | 37 +++++++++++++++++++++++++++++-------- > 2 files changed, 30 insertions(+), 8 deletions(-) > > diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h > index 15adfb8c813a..67bc8fa30d63 100644 > --- a/include/linux/swap_slots.h > +++ b/include/linux/swap_slots.h > @@ -19,6 +19,7 @@ struct swap_slots_cache { > spinlock_t free_lock; /* protects slots_ret, n_ret */ > swp_entry_t *slots_ret; > int n_ret; > + struct work_struct async_free; > }; > > void disable_swap_slots_cache_lock(void); > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > index 0bec1f705f8e..a3b306550732 100644 > --- a/mm/swap_slots.c > +++ b/mm/swap_slots.c > @@ -42,8 +42,10 @@ static bool swap_slot_cache_initialized; > static DEFINE_MUTEX(swap_slots_cache_mutex); > /* Serialize swap slots cache enable/disable operations */ > static DEFINE_MUTEX(swap_slots_cache_enable_mutex); > +static struct workqueue_struct *swap_free_queue; > > static void __drain_swap_slots_cache(unsigned int type); > +static void swapcache_async_free_entries(struct work_struct *data); > > #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled) > #define SLOTS_CACHE 0x1 > @@ -149,6 +151,7 @@ static int alloc_swap_slot_cache(unsigned int cpu) > spin_lock_init(&cache->free_lock); > cache->lock_initialized = true; > } > + INIT_WORK(&cache->async_free, swapcache_async_free_entries); > cache->nr = 0; > cache->cur = 0; > cache->n_ret = 0; > @@ -269,6 +272,20 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache) > return cache->nr; > } > > +static void swapcache_async_free_entries(struct work_struct *data) > +{ > + struct swap_slots_cache *cache; > + > + cache = container_of(data, struct swap_slots_cache, async_free); > + spin_lock_irq(&cache->free_lock); > + /* Swap slots cache may be deactivated before acquiring lock */ > + if (cache->slots_ret) { > + swapcache_free_entries(cache->slots_ret, cache->n_ret); > + cache->n_ret = 0; > + } > + spin_unlock_irq(&cache->free_lock); > +} > + > void free_swap_slot(swp_entry_t entry) > { > struct swap_slots_cache *cache; > @@ -282,17 +299,14 @@ void free_swap_slot(swp_entry_t entry) > goto direct_free; > } > if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) { > - /* > - * Return slots to global pool. > - * The current swap_map value is SWAP_HAS_CACHE. > - * Set it to 0 to indicate it is available for > - * allocation in global pool > - */ > - swapcache_free_entries(cache->slots_ret, cache->n_ret); > - cache->n_ret = 0; > + spin_unlock_irq(&cache->free_lock); > + queue_work(swap_free_queue, &cache->async_free); > + goto direct_free; > } > cache->slots_ret[cache->n_ret++] = entry; > spin_unlock_irq(&cache->free_lock); > + if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) > + queue_work(swap_free_queue, &cache->async_free); > } else { > direct_free: > swapcache_free_entries(&entry, 1); > @@ -348,3 +362,10 @@ swp_entry_t folio_alloc_swap(struct folio *folio) > } > return entry; > } > + > +static int __init async_queue_init(void) > +{ > + swap_free_queue = create_workqueue("async swap cache"); nit(?): isn't create_workqueue() deprecated? from: https://www.kernel.org/doc/html/latest/core-api/workqueue.html#application-programming-interface-api I think there's a zswap patch proposing fixing that on the zswap side. > + return 0; > +} > +subsys_initcall(async_queue_init); > > --- > base-commit: eacce8189e28717da6f44ee492b7404c636ae0de > change-id: 20231216-async-free-bef392015432 > > Best regards, > -- > Chris Li <chrisl@kernel.org> >
On Fri, Dec 22, 2023 at 05:44:19PM -0800, Nhat Pham wrote: > On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote: > > > > We discovered that 1% swap page fault is 100us+ while 50% of > > the swap fault is under 20us. > > > > Further investigation show that a large portion of the time > > spent in the free_swap_slots() function for the long tail case. > > > > The percpu cache of swap slots is freed in a batch of 64 entries > > inside free_swap_slots(). These cache entries are accumulated > > from previous page faults, which may not be related to the current > > process. > > > > Doing the batch free in the page fault handler causes longer > > tail latencies and penalizes the current process. > > > > Move free_swap_slots() outside of the swapin page fault handler into an > > async work queue to avoid such long tail latencies. > > > > Testing: > > > > Chun-Tse did some benchmark in chromebook, showing that > > zram_wait_metrics improve about 15% with 80% and 95% confidence. This benchmark result is using zram. There are 3 micro benchmarks of all showing about 15% improvement with a slightly different confidence level. That is where the 80%-90% come from. > > > > I recently ran some experiments on about 1000 Google production > > machines. It shows swapin latency drops in the long tail > > 100us - 500us bucket dramatically. > > > > platform (100-500us) (0-100us) > > A 1.12% -> 0.36% 98.47% -> 99.22% > > B 0.65% -> 0.15% 98.96% -> 99.46% > > C 0.61% -> 0.23% 98.96% -> 99.38% > > Nice! Are these values for zram as well, or ordinary (SSD?) swap? I > imagine it will matter less for swap, right? Those production servers only use zswap. There is no zram there. For ordinary SSD swap the latency reduction is also there in terms of absolute us. However the raw savings get shadowed by the SSD IO latency, typically in the 100us range. In terms of percentage, you don't have as dramatica an effect compared to the memory compression based swapping(zswap and zram). > > @@ -348,3 +362,10 @@ swp_entry_t folio_alloc_swap(struct folio *folio) > > } > > return entry; > > } > > + > > +static int __init async_queue_init(void) > > +{ > > + swap_free_queue = create_workqueue("async swap cache"); > > nit(?): isn't create_workqueue() deprecated? from: > > https://www.kernel.org/doc/html/latest/core-api/workqueue.html#application-programming-interface-api > > I think there's a zswap patch proposing fixing that on the zswap side. > Yes, I recall I saw that patch. I might acked on it as well. Very good catch. I will fix it in the V2 spin. Meanwhile, I will wait on it a bit to collect the other review feedback. Thans for catching that. Chris
On Fri, 22 Dec 2023, Chris Li wrote: > On Fri, Dec 22, 2023 at 11:52:08AM -0800, Andrew Morton wrote: > > On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote: > > > > > We discovered that 1% swap page fault is 100us+ while 50% of > > > the swap fault is under 20us. > > > > > > Further investigation show that a large portion of the time > > > spent in the free_swap_slots() function for the long tail case. > > > > > > The percpu cache of swap slots is freed in a batch of 64 entries > > > inside free_swap_slots(). These cache entries are accumulated > > > from previous page faults, which may not be related to the current > > > process. > > > > > > Doing the batch free in the page fault handler causes longer > > > tail latencies and penalizes the current process. > > > > > > Move free_swap_slots() outside of the swapin page fault handler into an > > > async work queue to avoid such long tail latencies. > > > > This will require a larger amount of total work than the current > > Yes, there will be a tiny little bit of extra overhead to schedule the job > on to the other work queue. > How do you quantify the impact of the delayed swap_entry_free()? Since the free and memcg uncharge are now delayed, is there not the possibility that we stay under memory pressure for longer? (Assuming at least some users are swapping because of memory pressure.) I would assume that since the free and uncharge itself is delayed that in the pathological case we'd actually be swapping *more* until the async worker can run.
Hi David, On Fri, Dec 22, 2023 at 10:11 PM David Rientjes <rientjes@google.com> wrote: > > How do you quantify the impact of the delayed swap_entry_free()? > > Since the free and memcg uncharge are now delayed, is there not the > possibility that we stay under memory pressure for longer? (Assuming at > least some users are swapping because of memory pressure.) > > I would assume that since the free and uncharge itself is delayed that in > the pathological case we'd actually be swapping *more* until the async > worker can run. Thanks for raising this interesting question. First of all, the swap_entry_free() does not impact "memory.current". It reduces "memory.swap.current". Technically it is the swap pressure not memory pressure that suffers the extra delay. Secondly, we are talking about delaying up to 64 swap entries for a few microseconds. Where the swap slot cache itself delays the freeing of the entries for an arbitrary amount of time. It is not freed until the cache is full of 64 entries. This delay can be seconds or even minutes. Adding a few microseconds of extra delay to existing seconds delay really makes no difference from the swap pressure point of view. Chris
On Sat, 23 Dec 2023, Chris Li wrote: > > How do you quantify the impact of the delayed swap_entry_free()? > > > > Since the free and memcg uncharge are now delayed, is there not the > > possibility that we stay under memory pressure for longer? (Assuming at > > least some users are swapping because of memory pressure.) > > > > I would assume that since the free and uncharge itself is delayed that in > > the pathological case we'd actually be swapping *more* until the async > > worker can run. > > Thanks for raising this interesting question. > > First of all, the swap_entry_free() does not impact "memory.current". > It reduces "memory.swap.current". Technically it is the swap pressure > not memory pressure that suffers the extra delay. > > Secondly, we are talking about delaying up to 64 swap entries for a > few microseconds. What guarantees that the async freeing happens within a few microseconds? > Where the swap slot cache itself delays the freeing > of the entries for an arbitrary amount of time. It is not freed until > the cache is full of 64 entries. This delay can be seconds or even > minutes. Adding a few microseconds of extra delay to existing seconds > delay really makes no difference from the swap pressure point of view. > > Chris >
On Sat, Dec 23, 2023 at 7:01 PM David Rientjes <rientjes@google.com> wrote: > > On Sat, 23 Dec 2023, Chris Li wrote: > > > > How do you quantify the impact of the delayed swap_entry_free()? > > > > > > Since the free and memcg uncharge are now delayed, is there not the > > > possibility that we stay under memory pressure for longer? (Assuming at > > > least some users are swapping because of memory pressure.) > > > > > > I would assume that since the free and uncharge itself is delayed that in > > > the pathological case we'd actually be swapping *more* until the async > > > worker can run. > > > > Thanks for raising this interesting question. > > > > First of all, the swap_entry_free() does not impact "memory.current". > > It reduces "memory.swap.current". Technically it is the swap pressure > > not memory pressure that suffers the extra delay. > > > > Secondly, we are talking about delaying up to 64 swap entries for a > > few microseconds. > > What guarantees that the async freeing happens within a few microseconds? Linux kernel typically doesn't provide RT scheduling guarantees. You can change microseconds to milliseconds, my following reasoning still holds. > > > Where the swap slot cache itself delays the freeing > > of the entries for an arbitrary amount of time. It is not freed until > > the cache is full of 64 entries. This delay can be seconds or even > > minutes. Adding a few microseconds of extra delay to existing seconds > > delay really makes no difference from the swap pressure point of view. Let me rephrase it. The swap slots cache itself has arbiturely delayed on freeing the swap slots, the slot is not free until 64 entries are reached. Adding a few milliseconds to the current swap slot freeing delay does not significantly change current behavior from swap pressure point of view. Chris
On Sun, 24 Dec 2023, Chris Li wrote: > On Sat, Dec 23, 2023 at 7:01 PM David Rientjes <rientjes@google.com> wrote: > > > > On Sat, 23 Dec 2023, Chris Li wrote: > > > > > > How do you quantify the impact of the delayed swap_entry_free()? > > > > > > > > Since the free and memcg uncharge are now delayed, is there not the > > > > possibility that we stay under memory pressure for longer? (Assuming at > > > > least some users are swapping because of memory pressure.) > > > > > > > > I would assume that since the free and uncharge itself is delayed that in > > > > the pathological case we'd actually be swapping *more* until the async > > > > worker can run. > > > > > > Thanks for raising this interesting question. > > > > > > First of all, the swap_entry_free() does not impact "memory.current". > > > It reduces "memory.swap.current". Technically it is the swap pressure > > > not memory pressure that suffers the extra delay. > > > > > > Secondly, we are talking about delaying up to 64 swap entries for a > > > few microseconds. > > > > What guarantees that the async freeing happens within a few microseconds? > > Linux kernel typically doesn't provide RT scheduling guarantees. You > can change microseconds to milliseconds, my following reasoning still > holds. > What guarantees that the async freeing happens even within 10s? Your responses are implying that there is some deadline by which this freeing absolutely must happen (us or ms), but I don't know of any strong guarantees. If there are no absolute guarantees about when the freeing may now occur, I'm asking how the impact of the delayed swap_entry_free() can be quantified. The benefit to the current implementation is that there *are* strong guarantees about when the freeing will occur and cannot grow exponentially before the async worker can do the freeing.
On Sun, Dec 24, 2023 at 1:13 PM David Rientjes <rientjes@google.com> wrote: > > On Sun, 24 Dec 2023, Chris Li wrote: > > > On Sat, Dec 23, 2023 at 7:01 PM David Rientjes <rientjes@google.com> wrote: > > > > > > On Sat, 23 Dec 2023, Chris Li wrote: > > > > > > > > How do you quantify the impact of the delayed swap_entry_free()? > > > > > > > > > > Since the free and memcg uncharge are now delayed, is there not the > > > > > possibility that we stay under memory pressure for longer? (Assuming at > > > > > least some users are swapping because of memory pressure.) > > > > > > > > > > I would assume that since the free and uncharge itself is delayed that in > > > > > the pathological case we'd actually be swapping *more* until the async > > > > > worker can run. > > > > > > > > Thanks for raising this interesting question. > > > > > > > > First of all, the swap_entry_free() does not impact "memory.current". > > > > It reduces "memory.swap.current". Technically it is the swap pressure > > > > not memory pressure that suffers the extra delay. > > > > > > > > Secondly, we are talking about delaying up to 64 swap entries for a > > > > few microseconds. > > > > > > What guarantees that the async freeing happens within a few microseconds? > > > > Linux kernel typically doesn't provide RT scheduling guarantees. You > > can change microseconds to milliseconds, my following reasoning still > > holds. > > > > What guarantees that the async freeing happens even within 10s? Your > responses are implying that there is some deadline by which this freeing > absolutely must happen (us or ms), but I don't know of any strong > guarantees. I think we are in agreement there, there are no such strong guarantees in linux scheduling. However, when there are free CPU resources, the job will get scheduled to execute in a reasonable table time frame. If it does not, I consider that a bug if the CPU has idle resources and the pending jobs are not able to run for a long time. The existing code doesn't have such a guarantee either, see my point follows. I don't know why you want to ask for such a guarantee. > If there are no absolute guarantees about when the freeing may now occur, > I'm asking how the impact of the delayed swap_entry_free() can be > quantified. Presumably each application has their own SLO metrics for monitoring their application behavior. I am happy to take a look if any app has new SLO violations caused by this change. If you have one metric in mind, please name it so we can look at it together. During my current experiment and the chromebook benchmark, I haven't noticed such ill effects show up in the other metrics drops in a statistically significant manner. That is not the same as saying such drops don't exist at all. Just I haven't noticed or the SLO watching system hasn't caught it. > The benefit to the current implementation is that there *are* strong > guarantees about when the freeing will occur and cannot grow exponentially > before the async worker can do the freeing. I don't understand your point. Please help me. In the current code, for the previous swapin fault that releases the swap slots into the swap slot caches. Let's say the swap slot remains in the cache for X seconds until Nth (N < 64) swapin page fault later, the cache is full and finally all 64 swap slot caches are free. Are you suggesting there is some kind of guarantee X is less than some fixed bound seconds? What is that bound then? 10 second? 1 minutes? BTW, there will be no exponential growth, that is guaranteed. Until the 64 entries cache were freed. The swapin code will take the direct free path for the current swap slot in hand. The direct free path existed before my change. Chris
On Sun, 24 Dec 2023, Chris Li wrote: > > > > > > How do you quantify the impact of the delayed swap_entry_free()? > > > > > > > > > > > > Since the free and memcg uncharge are now delayed, is there not the > > > > > > possibility that we stay under memory pressure for longer? (Assuming at > > > > > > least some users are swapping because of memory pressure.) > > > > > > > > > > > > I would assume that since the free and uncharge itself is delayed that in > > > > > > the pathological case we'd actually be swapping *more* until the async > > > > > > worker can run. > > > > > > > > > > Thanks for raising this interesting question. > > > > > > > > > > First of all, the swap_entry_free() does not impact "memory.current". > > > > > It reduces "memory.swap.current". Technically it is the swap pressure > > > > > not memory pressure that suffers the extra delay. > > > > > > > > > > Secondly, we are talking about delaying up to 64 swap entries for a > > > > > few microseconds. > > > > > > > > What guarantees that the async freeing happens within a few microseconds? > > > > > > Linux kernel typically doesn't provide RT scheduling guarantees. You > > > can change microseconds to milliseconds, my following reasoning still > > > holds. > > > > > > > What guarantees that the async freeing happens even within 10s? Your > > responses are implying that there is some deadline by which this freeing > > absolutely must happen (us or ms), but I don't know of any strong > > guarantees. > > I think we are in agreement there, there are no such strong guarantees > in linux scheduling. However, when there are free CPU resources, the > job will get scheduled to execute in a reasonable table time frame. If > it does not, I consider that a bug if the CPU has idle resources and > the pending jobs are not able to run for a long time. > The existing code doesn't have such a guarantee either, see my point > follows. I don't know why you want to ask for such a guarantee. > I'm simply trying to compare the pros and cons of the approach. As pointed out previously by Andrew, this approach actually results in *more* work to do freeing. Then, we could have a significant time delay before the freeing is actually done, in addition to the code complexity. And nothing safeguards us from an exponentially increasing amount of freeing that will be done sometime in the future. The current implementation provides strong guarantees that you will do batched freeing that will not accumulate beyond a pre-defined threshold which is proven to work well in practice. My only question was about how we can quantify the impact of the delayed free. We've come to the conclusion that it hasn't been quantified and there is no guarantees on when it will be freed. I'll leave it to those more invested in this path in the page fault handler to provide feedback. Thanks!
Chris Li <chrisl@kernel.org> writes: > On Fri, Dec 22, 2023 at 11:52:08AM -0800, Andrew Morton wrote: >> On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote: >> >> > We discovered that 1% swap page fault is 100us+ while 50% of >> > the swap fault is under 20us. >> > >> > Further investigation show that a large portion of the time >> > spent in the free_swap_slots() function for the long tail case. >> > >> > The percpu cache of swap slots is freed in a batch of 64 entries >> > inside free_swap_slots(). These cache entries are accumulated >> > from previous page faults, which may not be related to the current >> > process. >> > >> > Doing the batch free in the page fault handler causes longer >> > tail latencies and penalizes the current process. >> > >> > Move free_swap_slots() outside of the swapin page fault handler into an >> > async work queue to avoid such long tail latencies. >> >> This will require a larger amount of total work than the current > > Yes, there will be a tiny little bit of extra overhead to schedule the job > on to the other work queue. > >> scheme. So we're trading that off against better latency. >> >> Why is this a good tradeoff? > > That is a very good question. Both Hugh and Wei had asked me similar questions > before. +Hugh. > > The TL;DR is that it makes the swap more palleralizedable. > > Because morden computers typically have more than one CPU and the CPU utilization > is rarely reached to 100%. We are actually not trading the latency for some one > run slower. Most of the time the real impact is that the current swapin page fault > can return quicker so more work can submit to the kernel sooner, at the same time > the other idle CPU can pick up the non latency critical work of freeing of the > swap slot cache entries. The net effect is that we speed things up and increase > the overall system utilization rather than slow things down. You solution depends on there is enough idle time in the system. This isn't always true. In general, all async solutions have 2 possible issues. a) Unrelated applications may be punished. Because they may wait for CPU which is running the async operations. In the original solution, the application swap more will be punished. b) The CPU time cannot be charged to appropriate applications. The original behavior isn't perfect too. But it's better than async worker. Given the runtime of worker is at 100us level, these issues may be not severe. But I think that you may need to explain them at least. And, when swap slots freeing batching was introduced, it was mainly used to reduce the lock contention of sis->lock (via swap_info_get_cont()). So, we may move some operations (e.g., mem_cgroup_uncharge_swap, clear_shadow_from_swap_cache(), etc.) out of batched operation (before calling free_swap_slot()) to reduce the latency impact. > The test result of chromebook and Google production server should be able to show > that it is beneficial to both laptop and server workloads, making them more responsive > in swap related workload. -- Best Regards, Huang, Ying
On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote: > > We discovered that 1% swap page fault is 100us+ while 50% of > the swap fault is under 20us. > > Further investigation show that a large portion of the time > spent in the free_swap_slots() function for the long tail case. > > The percpu cache of swap slots is freed in a batch of 64 entries > inside free_swap_slots(). These cache entries are accumulated > from previous page faults, which may not be related to the current > process. > > Doing the batch free in the page fault handler causes longer > tail latencies and penalizes the current process. > > Move free_swap_slots() outside of the swapin page fault handler into an > async work queue to avoid such long tail latencies. > > Testing: > > Chun-Tse did some benchmark in chromebook, showing that > zram_wait_metrics improve about 15% with 80% and 95% confidence. > > I recently ran some experiments on about 1000 Google production > machines. It shows swapin latency drops in the long tail > 100us - 500us bucket dramatically. > > platform (100-500us) (0-100us) > A 1.12% -> 0.36% 98.47% -> 99.22% > B 0.65% -> 0.15% 98.96% -> 99.46% > C 0.61% -> 0.23% 98.96% -> 99.38% I recall you mentioning that mem_cgroup_uncharge_swap() is the most expensive part of the batched freeing. If that's the case, I am curious what happens if we move that call outside of the batching (i.e. once the swap entry is no longer used and will be returned to the cache). This should amortize the cost of memcg uncharging and reduce the tail fault latency without extra work. Arguably, it could increase the average fault latency, but not necessarily in a significant way. Ying pointed out something similar if I understand correctly (and other operations that can be moved). Also, if we choose to follow this route, I think there we should flush the async worker in drain_slots_cache_cpu(), right?
On Sun, Dec 24, 2023 at 2:07 PM Chris Li <chrisl@kernel.org> wrote: > > On Sun, Dec 24, 2023 at 1:13 PM David Rientjes <rientjes@google.com> wrote: > > > > On Sun, 24 Dec 2023, Chris Li wrote: > > > > > On Sat, Dec 23, 2023 at 7:01 PM David Rientjes <rientjes@google.com> wrote: > > > > > > > > On Sat, 23 Dec 2023, Chris Li wrote: > > > > > > > > > > How do you quantify the impact of the delayed swap_entry_free()? > > > > > > > > > > > > Since the free and memcg uncharge are now delayed, is there not the > > > > > > possibility that we stay under memory pressure for longer? (Assuming at > > > > > > least some users are swapping because of memory pressure.) > > > > > > > > > > > > I would assume that since the free and uncharge itself is delayed that in > > > > > > the pathological case we'd actually be swapping *more* until the async > > > > > > worker can run. > > > > > > > > > > Thanks for raising this interesting question. > > > > > > > > > > First of all, the swap_entry_free() does not impact "memory.current". > > > > > It reduces "memory.swap.current". Technically it is the swap pressure > > > > > not memory pressure that suffers the extra delay. > > > > > > > > > > Secondly, we are talking about delaying up to 64 swap entries for a > > > > > few microseconds. > > > > > > > > What guarantees that the async freeing happens within a few microseconds? > > > > > > Linux kernel typically doesn't provide RT scheduling guarantees. You > > > can change microseconds to milliseconds, my following reasoning still > > > holds. > > > > > > > What guarantees that the async freeing happens even within 10s? Your > > responses are implying that there is some deadline by which this freeing > > absolutely must happen (us or ms), but I don't know of any strong > > guarantees. > > I think we are in agreement there, there are no such strong guarantees > in linux scheduling. However, when there are free CPU resources, the > job will get scheduled to execute in a reasonable table time frame. If > it does not, I consider that a bug if the CPU has idle resources and > the pending jobs are not able to run for a long time. > The existing code doesn't have such a guarantee either, see my point > follows. I don't know why you want to ask for such a guarantee. > > > If there are no absolute guarantees about when the freeing may now occur, > > I'm asking how the impact of the delayed swap_entry_free() can be > > quantified. > > Presumably each application has their own SLO metrics for monitoring > their application behavior. I am happy to take a look if any app has > new SLO violations caused by this change. > If you have one metric in mind, please name it so we can look at it > together. During my current experiment and the chromebook benchmark, I > haven't noticed such ill effects show up in the other metrics drops in > a statistically significant manner. That is not the same as saying > such drops don't exist at all. Just I haven't noticed or the SLO > watching system hasn't caught it. > > > The benefit to the current implementation is that there *are* strong > > guarantees about when the freeing will occur and cannot grow exponentially > > before the async worker can do the freeing. > > I don't understand your point. Please help me. In the current code, > for the previous swapin fault that releases the swap slots into the > swap slot caches. Let's say the swap slot remains in the cache for X > seconds until Nth (N < 64) swapin page fault later, the cache is full > and finally all 64 swap slot caches are free. Are you suggesting there > is some kind of guarantee X is less than some fixed bound seconds? > What is that bound then? 10 second? 1 minutes? > > BTW, there will be no exponential growth, that is guaranteed. Until > the 64 entries cache were freed. The swapin code will take the direct > free path for the current swap slot in hand. The direct free path > existed before my change. FWIW, it's 64 * the number of CPUs.
Hi Ying, Sorry for the late reply. On Sun, Dec 24, 2023 at 11:10 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Fri, Dec 22, 2023 at 11:52:08AM -0800, Andrew Morton wrote: > >> On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@kernel.org> wrote: > >> > >> > We discovered that 1% swap page fault is 100us+ while 50% of > >> > the swap fault is under 20us. > >> > > >> > Further investigation show that a large portion of the time > >> > spent in the free_swap_slots() function for the long tail case. > >> > > >> > The percpu cache of swap slots is freed in a batch of 64 entries > >> > inside free_swap_slots(). These cache entries are accumulated > >> > from previous page faults, which may not be related to the current > >> > process. > >> > > >> > Doing the batch free in the page fault handler causes longer > >> > tail latencies and penalizes the current process. > >> > > >> > Move free_swap_slots() outside of the swapin page fault handler into an > >> > async work queue to avoid such long tail latencies. > >> > >> This will require a larger amount of total work than the current > > > > Yes, there will be a tiny little bit of extra overhead to schedule the job > > on to the other work queue. > > > >> scheme. So we're trading that off against better latency. > >> > >> Why is this a good tradeoff? > > > > That is a very good question. Both Hugh and Wei had asked me similar questions > > before. +Hugh. > > > > The TL;DR is that it makes the swap more palleralizedable. > > > > Because morden computers typically have more than one CPU and the CPU utilization > > is rarely reached to 100%. We are actually not trading the latency for some one > > run slower. Most of the time the real impact is that the current swapin page fault > > can return quicker so more work can submit to the kernel sooner, at the same time > > the other idle CPU can pick up the non latency critical work of freeing of the > > swap slot cache entries. The net effect is that we speed things up and increase > > the overall system utilization rather than slow things down. > > You solution depends on there is enough idle time in the system. This > isn't always true. > > In general, all async solutions have 2 possible issues. > > a) Unrelated applications may be punished. Because they may wait for > CPU which is running the async operations. In the original solution, > the application swap more will be punished. The typical time to perform on the async free is very brief, at about 100ms level. So the amount of punishment would be small. The original behavior was already delaying the freeing of swap slots due to batching. Adding a tiny bit of time does not change the overall behavior too much. Another thing is that, if the async free is pending, it will go through the direct free path. > b) The CPU time cannot be charged to appropriate applications. The > original behavior isn't perfect too. But it's better than async worker. Yes, the original behavior will free other cgroups' swap entries. > Given the runtime of worker is at 100us level, these issues may be not > severe. But I think that you may need to explain them at least. Thanks for the suggestion. Will do in V2. > > And, when swap slots freeing batching was introduced, it was mainly used > to reduce the lock contention of sis->lock (via swap_info_get_cont()). > So, we may move some operations (e.g., mem_cgroup_uncharge_swap, > clear_shadow_from_swap_cache(), etc.) out of batched operation (before > calling free_swap_slot()) to reduce the latency impact. That is good to know. Thanks for the explanation. Chris > > > The test result of chromebook and Google production server should be able to show > > that it is beneficial to both laptop and server workloads, making them more responsive > > in swap related workload. > > -- > Best Regards, > Huang, Ying >
Hi Yosry, On Thu, Dec 28, 2023 at 7:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote: > > > > We discovered that 1% swap page fault is 100us+ while 50% of > > the swap fault is under 20us. > > > > Further investigation show that a large portion of the time > > spent in the free_swap_slots() function for the long tail case. > > > > The percpu cache of swap slots is freed in a batch of 64 entries > > inside free_swap_slots(). These cache entries are accumulated > > from previous page faults, which may not be related to the current > > process. > > > > Doing the batch free in the page fault handler causes longer > > tail latencies and penalizes the current process. > > > > Move free_swap_slots() outside of the swapin page fault handler into an > > async work queue to avoid such long tail latencies. > > > > Testing: > > > > Chun-Tse did some benchmark in chromebook, showing that > > zram_wait_metrics improve about 15% with 80% and 95% confidence. > > > > I recently ran some experiments on about 1000 Google production > > machines. It shows swapin latency drops in the long tail > > 100us - 500us bucket dramatically. > > > > platform (100-500us) (0-100us) > > A 1.12% -> 0.36% 98.47% -> 99.22% > > B 0.65% -> 0.15% 98.96% -> 99.46% > > C 0.61% -> 0.23% 98.96% -> 99.38% > > I recall you mentioning that mem_cgroup_uncharge_swap() is the most > expensive part of the batched freeing. If that's the case, I am > curious what happens if we move that call outside of the batching > (i.e. once the swap entry is no longer used and will be returned to > the cache). This should amortize the cost of memcg uncharging and > reduce the tail fault latency without extra work. Arguably, it could > increase the average fault latency, but not necessarily in a > significant way. > > Ying pointed out something similar if I understand correctly (and > other operations that can be moved). If the goal is to let the swap fault return as soon as possible. Then the current approach is better. mem_cgroup_uncarge_swap() is only part of it. Not close to all of it. > > Also, if we choose to follow this route, I think there we should flush > the async worker in drain_slots_cache_cpu(), right? Not sure I understand this part. The drain_slots_cache_cpu(), will free the entries already. The current lock around cache->free_lock should protect async workers accessing the entries. What do you mean by flushing? Chris
On Wed, Jan 31, 2024 at 4:57 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Yosry, > > On Thu, Dec 28, 2023 at 7:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Dec 21, 2023 at 10:25 PM Chris Li <chrisl@kernel.org> wrote: > > > > > > We discovered that 1% swap page fault is 100us+ while 50% of > > > the swap fault is under 20us. > > > > > > Further investigation show that a large portion of the time > > > spent in the free_swap_slots() function for the long tail case. > > > > > > The percpu cache of swap slots is freed in a batch of 64 entries > > > inside free_swap_slots(). These cache entries are accumulated > > > from previous page faults, which may not be related to the current > > > process. > > > > > > Doing the batch free in the page fault handler causes longer > > > tail latencies and penalizes the current process. > > > > > > Move free_swap_slots() outside of the swapin page fault handler into an > > > async work queue to avoid such long tail latencies. > > > > > > Testing: > > > > > > Chun-Tse did some benchmark in chromebook, showing that > > > zram_wait_metrics improve about 15% with 80% and 95% confidence. > > > > > > I recently ran some experiments on about 1000 Google production > > > machines. It shows swapin latency drops in the long tail > > > 100us - 500us bucket dramatically. > > > > > > platform (100-500us) (0-100us) > > > A 1.12% -> 0.36% 98.47% -> 99.22% > > > B 0.65% -> 0.15% 98.96% -> 99.46% > > > C 0.61% -> 0.23% 98.96% -> 99.38% > > > > I recall you mentioning that mem_cgroup_uncharge_swap() is the most > > expensive part of the batched freeing. If that's the case, I am > > curious what happens if we move that call outside of the batching > > (i.e. once the swap entry is no longer used and will be returned to > > the cache). This should amortize the cost of memcg uncharging and > > reduce the tail fault latency without extra work. Arguably, it could > > increase the average fault latency, but not necessarily in a > > significant way. > > > > Ying pointed out something similar if I understand correctly (and > > other operations that can be moved). > > If the goal is to let the swap fault return as soon as possible. Then > the current approach is better. > mem_cgroup_uncarge_swap() is only part of it. Not close to all of it. I think there are a lot of operations that we can move out of swapcache_free_entries(): - mem_cgroup_uncharge_swap() - arch_swap_invalidate_page() - zswap_invalidate() - clear_shadow_from_swap_cache() , and maybe others. I am curious, if we move these operations from the batched freeing, would this remove the increased tail latency and make it more consistent, without doing extra work? I believe this is what Ying was also asking about. > > > > > Also, if we choose to follow this route, I think there we should flush > > the async worker in drain_slots_cache_cpu(), right? > Not sure I understand this part. The drain_slots_cache_cpu(), will > free the entries already. The current lock around cache->free_lock > should protect async workers accessing the entries. What do you mean > by flushing? Never mind. I just realized that the percpu caches are static, so they are not freed in drain_slots_cache_cpu(). The NULL check in the async worker should be enough protection.
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h index 15adfb8c813a..67bc8fa30d63 100644 --- a/include/linux/swap_slots.h +++ b/include/linux/swap_slots.h @@ -19,6 +19,7 @@ struct swap_slots_cache { spinlock_t free_lock; /* protects slots_ret, n_ret */ swp_entry_t *slots_ret; int n_ret; + struct work_struct async_free; }; void disable_swap_slots_cache_lock(void); diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 0bec1f705f8e..a3b306550732 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -42,8 +42,10 @@ static bool swap_slot_cache_initialized; static DEFINE_MUTEX(swap_slots_cache_mutex); /* Serialize swap slots cache enable/disable operations */ static DEFINE_MUTEX(swap_slots_cache_enable_mutex); +static struct workqueue_struct *swap_free_queue; static void __drain_swap_slots_cache(unsigned int type); +static void swapcache_async_free_entries(struct work_struct *data); #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled) #define SLOTS_CACHE 0x1 @@ -149,6 +151,7 @@ static int alloc_swap_slot_cache(unsigned int cpu) spin_lock_init(&cache->free_lock); cache->lock_initialized = true; } + INIT_WORK(&cache->async_free, swapcache_async_free_entries); cache->nr = 0; cache->cur = 0; cache->n_ret = 0; @@ -269,6 +272,20 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache) return cache->nr; } +static void swapcache_async_free_entries(struct work_struct *data) +{ + struct swap_slots_cache *cache; + + cache = container_of(data, struct swap_slots_cache, async_free); + spin_lock_irq(&cache->free_lock); + /* Swap slots cache may be deactivated before acquiring lock */ + if (cache->slots_ret) { + swapcache_free_entries(cache->slots_ret, cache->n_ret); + cache->n_ret = 0; + } + spin_unlock_irq(&cache->free_lock); +} + void free_swap_slot(swp_entry_t entry) { struct swap_slots_cache *cache; @@ -282,17 +299,14 @@ void free_swap_slot(swp_entry_t entry) goto direct_free; } if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) { - /* - * Return slots to global pool. - * The current swap_map value is SWAP_HAS_CACHE. - * Set it to 0 to indicate it is available for - * allocation in global pool - */ - swapcache_free_entries(cache->slots_ret, cache->n_ret); - cache->n_ret = 0; + spin_unlock_irq(&cache->free_lock); + queue_work(swap_free_queue, &cache->async_free); + goto direct_free; } cache->slots_ret[cache->n_ret++] = entry; spin_unlock_irq(&cache->free_lock); + if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) + queue_work(swap_free_queue, &cache->async_free); } else { direct_free: swapcache_free_entries(&entry, 1); @@ -348,3 +362,10 @@ swp_entry_t folio_alloc_swap(struct folio *folio) } return entry; } + +static int __init async_queue_init(void) +{ + swap_free_queue = create_workqueue("async swap cache"); + return 0; +} +subsys_initcall(async_queue_init);
We discovered that 1% swap page fault is 100us+ while 50% of the swap fault is under 20us. Further investigation show that a large portion of the time spent in the free_swap_slots() function for the long tail case. The percpu cache of swap slots is freed in a batch of 64 entries inside free_swap_slots(). These cache entries are accumulated from previous page faults, which may not be related to the current process. Doing the batch free in the page fault handler causes longer tail latencies and penalizes the current process. Move free_swap_slots() outside of the swapin page fault handler into an async work queue to avoid such long tail latencies. Testing: Chun-Tse did some benchmark in chromebook, showing that zram_wait_metrics improve about 15% with 80% and 95% confidence. I recently ran some experiments on about 1000 Google production machines. It shows swapin latency drops in the long tail 100us - 500us bucket dramatically. platform (100-500us) (0-100us) A 1.12% -> 0.36% 98.47% -> 99.22% B 0.65% -> 0.15% 98.96% -> 99.46% C 0.61% -> 0.23% 98.96% -> 99.38% Signed-off-by: Chris Li <chrisl@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Wei Xu <weixugc@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Chun-Tse Shao <ctshao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Brain Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Huang Ying <ying.huang@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Zhongkun He <hezhongkun.hzk@bytedance.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Barry Song <v-songbaohua@oppo.com> --- include/linux/swap_slots.h | 1 + mm/swap_slots.c | 37 +++++++++++++++++++++++++++++-------- 2 files changed, 30 insertions(+), 8 deletions(-) --- base-commit: eacce8189e28717da6f44ee492b7404c636ae0de change-id: 20231216-async-free-bef392015432 Best regards,