Message ID | 20231207192406.3809579-1-nphamcs@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v6] zswap: memcontrol: implement zswap writeback disabling | expand |
On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > During our experiment with zswap, we sometimes observe swap IOs due to > occasional zswap store failures and writebacks-to-swap. These swapping > IOs prevent many users who cannot tolerate swapping from adopting zswap > to save memory and improve performance where possible. > > This patch adds the option to disable this behavior entirely: do not > writeback to backing swapping device when a zswap store attempt fail, > and do not write pages in the zswap pool back to the backing swap > device (both when the pool is full, and when the new zswap shrinker is > called). > > This new behavior can be opted-in/out on a per-cgroup basis via a new > cgroup file. By default, writebacks to swap device is enabled, which is > the previous behavior. Initially, writeback is enabled for the root > cgroup, and a newly created cgroup will inherit the current setting of > its parent. > > Note that this is subtly different from setting memory.swap.max to 0, as > it still allows for pages to be stored in the zswap pool (which itself > consumes swap space in its current form). > > This patch should be applied on top of the zswap shrinker series: > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > as it also disables the zswap shrinker, a major source of zswap > writebacks. > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> FTR I still prefer a more generic and future-proof interface (e.g. memory.swap.tiers), but I am not opposed to this. It will just be annoying to have different interfaces with overlapping functionalities in the future if/when the need for a generic interface arises.
On Thu, 7 Dec 2023 11:24:06 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > During our experiment with zswap, we sometimes observe swap IOs due to > occasional zswap store failures and writebacks-to-swap. These swapping > IOs prevent many users who cannot tolerate swapping from adopting zswap > to save memory and improve performance where possible. > > This patch adds the option to disable this behavior entirely: do not > writeback to backing swapping device when a zswap store attempt fail, > and do not write pages in the zswap pool back to the backing swap > device (both when the pool is full, and when the new zswap shrinker is > called). > > This new behavior can be opted-in/out on a per-cgroup basis via a new > cgroup file. By default, writebacks to swap device is enabled, which is > the previous behavior. Initially, writeback is enabled for the root > cgroup, and a newly created cgroup will inherit the current setting of > its parent. > > Note that this is subtly different from setting memory.swap.max to 0, as > it still allows for pages to be stored in the zswap pool (which itself > consumes swap space in its current form). > > This patch should be applied on top of the zswap shrinker series: > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > as it also disables the zswap shrinker, a major source of zswap > writebacks. > > ... > > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -153,6 +153,12 @@ attribute, e. g.:: > > Setting this parameter to 100 will disable the hysteresis. > > +Some users cannot tolerate the swapping that comes with zswap store failures > +and zswap writebacks. Swapping can be disabled entirely (without disabling > +zswap itself) on a cgroup-basis as follows: > + > + echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback > + This does seem to be getting down into the weeds. How would a user know (or even suspect) that these things are happening to them? Perhaps it would be helpful to tell people where to go look to determine this. Also, it would be quite helpful of the changelog were to give us some idea of how important this tunable is. What sort of throughput differences might it cause and under what circumstances?
Hi Nhat, On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > During our experiment with zswap, we sometimes observe swap IOs due to > occasional zswap store failures and writebacks-to-swap. These swapping > IOs prevent many users who cannot tolerate swapping from adopting zswap > to save memory and improve performance where possible. > > This patch adds the option to disable this behavior entirely: do not > writeback to backing swapping device when a zswap store attempt fail, > and do not write pages in the zswap pool back to the backing swap > device (both when the pool is full, and when the new zswap shrinker is > called). > > This new behavior can be opted-in/out on a per-cgroup basis via a new > cgroup file. By default, writebacks to swap device is enabled, which is > the previous behavior. Initially, writeback is enabled for the root > cgroup, and a newly created cgroup will inherit the current setting of > its parent. > > Note that this is subtly different from setting memory.swap.max to 0, as > it still allows for pages to be stored in the zswap pool (which itself > consumes swap space in its current form). > > This patch should be applied on top of the zswap shrinker series: > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > as it also disables the zswap shrinker, a major source of zswap > writebacks. I am wondering about the status of "memory.swap.tiers" proof of concept patch? Are we still on board to have this two patch merge together somehow so we can have "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case? Thanks Chris > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 12 ++++++++ > Documentation/admin-guide/mm/zswap.rst | 6 ++++ > include/linux/memcontrol.h | 12 ++++++++ > include/linux/zswap.h | 6 ++++ > mm/memcontrol.c | 38 +++++++++++++++++++++++++ > mm/page_io.c | 6 ++++ > mm/shmem.c | 3 +- > mm/zswap.c | 13 +++++++-- > 8 files changed, 92 insertions(+), 4 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 3f85254f3cef..2b4ac43efdc8 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1679,6 +1679,18 @@ PAGE_SIZE multiple when read back. > limit, it will refuse to take any more stores before existing > entries fault back in or are written out to disk. > > + memory.zswap.writeback > + A read-write single value file. The default value is "1". The > + initial value of the root cgroup is 1, and when a new cgroup is > + created, it inherits the current value of its parent. > + > + When this is set to 0, all swapping attempts to swapping devices > + are disabled. This included both zswap writebacks, and swapping due > + to zswap store failure. > + > + Note that this is subtly different from setting memory.swap.max to > + 0, as it still allows for pages to be written to the zswap pool. > + > memory.pressure > A read-only nested-keyed file. > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst > index 62fc244ec702..cfa653130346 100644 > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -153,6 +153,12 @@ attribute, e. g.:: > > Setting this parameter to 100 will disable the hysteresis. > > +Some users cannot tolerate the swapping that comes with zswap store failures > +and zswap writebacks. Swapping can be disabled entirely (without disabling > +zswap itself) on a cgroup-basis as follows: > + > + echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback > + > When there is a sizable amount of cold memory residing in the zswap pool, it > can be advantageous to proactively write these cold pages to swap and reclaim > the memory for other use cases. By default, the zswap shrinker is disabled. > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 43b77363ab8e..5de775e6cdd9 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -219,6 +219,12 @@ struct mem_cgroup { > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > unsigned long zswap_max; > + > + /* > + * Prevent pages from this memcg from being written back from zswap to > + * swap, and from being swapped out on zswap store failures. > + */ > + bool zswap_writeback; > #endif > > unsigned long soft_limit; > @@ -1941,6 +1947,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, > bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg); > #else > static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > { > @@ -1954,6 +1961,11 @@ static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, > size_t size) > { > } > +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > +{ > + /* if zswap is disabled, do not block pages going to the swapping device */ > + return true; > +} > #endif > > #endif /* _LINUX_MEMCONTROL_H */ > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index 08c240e16a01..a78ceaf3a65e 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -35,6 +35,7 @@ void zswap_swapoff(int type); > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); > void zswap_lruvec_state_init(struct lruvec *lruvec); > void zswap_page_swapin(struct page *page); > +bool is_zswap_enabled(void); > #else > > struct zswap_lruvec_state {}; > @@ -55,6 +56,11 @@ static inline void zswap_swapoff(int type) {} > static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {} > static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {} > static inline void zswap_page_swapin(struct page *page) {} > + > +static inline bool is_zswap_enabled(void) > +{ > + return false; > +} > #endif > > #endif /* _LINUX_ZSWAP_H */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d7bc47316acb..ae8c62c7aa53 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5538,6 +5538,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > memcg->zswap_max = PAGE_COUNTER_MAX; > + WRITE_ONCE(memcg->zswap_writeback, > + !parent || READ_ONCE(parent->zswap_writeback)); > #endif > page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); > if (parent) { > @@ -8174,6 +8176,12 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) > rcu_read_unlock(); > } > > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > +{ > + /* if zswap is disabled, do not block pages going to the swapping device */ > + return !is_zswap_enabled() || !memcg || READ_ONCE(memcg->zswap_writeback); > +} > + > static u64 zswap_current_read(struct cgroup_subsys_state *css, > struct cftype *cft) > { > @@ -8206,6 +8214,31 @@ static ssize_t zswap_max_write(struct kernfs_open_file *of, > return nbytes; > } > > +static int zswap_writeback_show(struct seq_file *m, void *v) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > + > + seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback)); > + return 0; > +} > + > +static ssize_t zswap_writeback_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + int zswap_writeback; > + ssize_t parse_ret = kstrtoint(strstrip(buf), 0, &zswap_writeback); > + > + if (parse_ret) > + return parse_ret; > + > + if (zswap_writeback != 0 && zswap_writeback != 1) > + return -EINVAL; > + > + WRITE_ONCE(memcg->zswap_writeback, zswap_writeback); > + return nbytes; > +} > + > static struct cftype zswap_files[] = { > { > .name = "zswap.current", > @@ -8218,6 +8251,11 @@ static struct cftype zswap_files[] = { > .seq_show = zswap_max_show, > .write = zswap_max_write, > }, > + { > + .name = "zswap.writeback", > + .seq_show = zswap_writeback_show, > + .write = zswap_writeback_write, > + }, > { } /* terminate */ > }; > #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */ > diff --git a/mm/page_io.c b/mm/page_io.c > index cb559ae324c6..5e606f1aa2f6 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -201,6 +201,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) > folio_end_writeback(folio); > return 0; > } > + > + if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { > + folio_mark_dirty(folio); > + return AOP_WRITEPAGE_ACTIVATE; > + } > + > __swap_writepage(&folio->page, wbc); > return 0; > } > diff --git a/mm/shmem.c b/mm/shmem.c > index c62f904ba1ca..dd084fbafcf1 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1514,8 +1514,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) > > mutex_unlock(&shmem_swaplist_mutex); > BUG_ON(folio_mapped(folio)); > - swap_writepage(&folio->page, wbc); > - return 0; > + return swap_writepage(&folio->page, wbc); > } > > mutex_unlock(&shmem_swaplist_mutex); > diff --git a/mm/zswap.c b/mm/zswap.c > index daaa949837f2..7ee54a3d8281 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -153,6 +153,11 @@ static bool zswap_shrinker_enabled = IS_ENABLED( > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); > > +bool is_zswap_enabled(void) > +{ > + return zswap_enabled; > +} > + > /********************************* > * data structures > **********************************/ > @@ -596,7 +601,8 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > struct zswap_pool *pool = shrinker->private_data; > bool encountered_page_in_swapcache = false; > > - if (!zswap_shrinker_enabled) { > + if (!zswap_shrinker_enabled || > + !mem_cgroup_zswap_writeback_enabled(sc->memcg)) { > sc->nr_scanned = 0; > return SHRINK_STOP; > } > @@ -637,7 +643,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); > unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > > - if (!zswap_shrinker_enabled) > + if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg)) > return 0; > > #ifdef CONFIG_MEMCG_KMEM > @@ -956,6 +962,9 @@ static int shrink_memcg(struct mem_cgroup *memcg) > struct zswap_pool *pool; > int nid, shrunk = 0; > > + if (!mem_cgroup_zswap_writeback_enabled(memcg)) > + return -EINVAL; > + > /* > * Skip zombies because their LRUs are reparented and we would be > * reclaiming from the parent instead of the dead memcg. > > base-commit: cdcab2d34f129f593c0afbb2493bcaf41f4acd61 > -- > 2.34.1 >
> > This does seem to be getting down into the weeds. How would a user > know (or even suspect) that these things are happening to them? Perhaps > it would be helpful to tell people where to go look to determine this. When I test this feature during its development, I primarily just look at the swapin/major fault counters to see if I'm experiencing swapping IO, and when writeback is disabled, if the IO is still there. We can also poll these counters overtime and plot it/compute their rate of change. I just assumed this is usually the standard practice, and not very zswap-specific in general, so I did not specify in the zswap documentation. > > Also, it would be quite helpful of the changelog were to give us some > idea of how important this tunable is. What sort of throughput > differences might it cause and under what circumstances? For the most part, this feature is motivated by internal parties who have already established their opinions regarding swapping - the workloads that are highly sensitive to IO, and especially those who are using servers with really slow disk performance (for instance, massive but slow HDDs). For these folks, it's impossible to convince them to even entertain zswap if swapping also comes as a packaged deal. Writeback disabling is quite a useful feature in these situations - on a mixed workloads deployment, they can disable writeback for the more IO-sensitive workloads, and enable writeback for other background workloads. (Maybe we should include the paragraph above as part of the changelog?) I don't have any concrete numbers though - any numbers I can pull out are from highly artificial tasks that only serve to test the correctness aspect of the implementation. zswap.writeback disablement would of course be faster in these situations (up to 33%!!!!) - but that's basically just saying HDD is slow. Which is not very informative or surprising, so I did not include it in the changelog.
On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Nhat, > > > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > During our experiment with zswap, we sometimes observe swap IOs due to > > occasional zswap store failures and writebacks-to-swap. These swapping > > IOs prevent many users who cannot tolerate swapping from adopting zswap > > to save memory and improve performance where possible. > > > > This patch adds the option to disable this behavior entirely: do not > > writeback to backing swapping device when a zswap store attempt fail, > > and do not write pages in the zswap pool back to the backing swap > > device (both when the pool is full, and when the new zswap shrinker is > > called). > > > > This new behavior can be opted-in/out on a per-cgroup basis via a new > > cgroup file. By default, writebacks to swap device is enabled, which is > > the previous behavior. Initially, writeback is enabled for the root > > cgroup, and a newly created cgroup will inherit the current setting of > > its parent. > > > > Note that this is subtly different from setting memory.swap.max to 0, as > > it still allows for pages to be stored in the zswap pool (which itself > > consumes swap space in its current form). > > > > This patch should be applied on top of the zswap shrinker series: > > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > > > as it also disables the zswap shrinker, a major source of zswap > > writebacks. > > I am wondering about the status of "memory.swap.tiers" proof of concept patch? > Are we still on board to have this two patch merge together somehow so > we can have > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case? > > Thanks > > Chris > Hi Chris, I briefly summarized my recent discussion with Johannes here: https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/ TL;DR is we acknowledge the potential usefulness of swap.tiers interface, but the use case is not quite there yet, so it does not make too much sense to build up that heavy machinery now. zswap.writeback is a more urgent need, and does not prevent swap.tiers if we do decide to implement it.
On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote: > > > > Hi Nhat, > > > > > > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > During our experiment with zswap, we sometimes observe swap IOs due to > > > occasional zswap store failures and writebacks-to-swap. These swapping > > > IOs prevent many users who cannot tolerate swapping from adopting zswap > > > to save memory and improve performance where possible. > > > > > > This patch adds the option to disable this behavior entirely: do not > > > writeback to backing swapping device when a zswap store attempt fail, > > > and do not write pages in the zswap pool back to the backing swap > > > device (both when the pool is full, and when the new zswap shrinker is > > > called). > > > > > > This new behavior can be opted-in/out on a per-cgroup basis via a new > > > cgroup file. By default, writebacks to swap device is enabled, which is > > > the previous behavior. Initially, writeback is enabled for the root > > > cgroup, and a newly created cgroup will inherit the current setting of > > > its parent. > > > > > > Note that this is subtly different from setting memory.swap.max to 0, as > > > it still allows for pages to be stored in the zswap pool (which itself > > > consumes swap space in its current form). > > > > > > This patch should be applied on top of the zswap shrinker series: > > > > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > > > > > as it also disables the zswap shrinker, a major source of zswap > > > writebacks. > > > > I am wondering about the status of "memory.swap.tiers" proof of concept patch? > > Are we still on board to have this two patch merge together somehow so > > we can have > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case? > > > > Thanks > > > > Chris > > > > Hi Chris, > > I briefly summarized my recent discussion with Johannes here: > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/ > > TL;DR is we acknowledge the potential usefulness of swap.tiers > interface, but the use case is not quite there yet, so it does not > make too much sense to build up that heavy machinery now. > zswap.writeback is a more urgent need, and does not prevent swap.tiers > if we do decide to implement it. I am honestly not convinced by this. There is no heavy machinery here. The interface is more generic and extensible, but the implementation is roughly the same. Unless we have a reason to think a swap.tiers interface may make it difficult to extend this later or will not support some use cases, I think we should go ahead with it. If we are worried that "tiers" may not accurately describe future use cases, we can be more generic and call it swap.types or something.
On Thu, Dec 7, 2023 at 4:42 PM Nhat Pham <nphamcs@gmail.com> wrote: > [..] > > I don't have any concrete numbers though - any numbers I can pull out > are from highly artificial tasks that only serve to test the > correctness aspect of the implementation. zswap.writeback disablement > would of course be faster in these situations (up to 33%!!!!) - but > that's basically just saying HDD is slow. Which is not very > informative or surprising, so I did not include it in the changelog. For instance, on a server with HDD, I allocate memories and populate them with random values (so that zswap store will always fail), and specify memory.high low enough to trigger reclaim. The time it takes to allocate the memories and just read through it a couple of times (doing silly things like computing the values' average etc.): zswap.writeback disabled: real 0m30.537s user 0m23.687s sys 0m6.637s 0 pages swapped in 0 pages swapped out zswap.writeback enabled: real 0m45.061s user 0m24.310s sys 0m8.892s 712686 pages swapped in 461093 pages swapped out (the last two lines are from vmstat -s).
On Thu, Dec 07, 2023 at 05:12:13PM -0800, Yosry Ahmed wrote: > On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote: > > > I am wondering about the status of "memory.swap.tiers" proof of concept patch? > > > Are we still on board to have this two patch merge together somehow so > > > we can have > > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the > > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case? > > > > > > Thanks > > > > > > Chris > > > > > > > Hi Chris, > > > > I briefly summarized my recent discussion with Johannes here: > > > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/ > > > > TL;DR is we acknowledge the potential usefulness of swap.tiers > > interface, but the use case is not quite there yet, so it does not > > make too much sense to build up that heavy machinery now. > > zswap.writeback is a more urgent need, and does not prevent swap.tiers > > if we do decide to implement it. > > I am honestly not convinced by this. There is no heavy machinery here. > The interface is more generic and extensible, but the implementation > is roughly the same. Unless we have a reason to think a swap.tiers > interface may make it difficult to extend this later or will not > support some use cases, I think we should go ahead with it. If we are > worried that "tiers" may not accurately describe future use cases, we > can be more generic and call it swap.types or something. I have to disagree. The generic swap types or tiers ideas actually look pretty far-fetched to me, and there is a lack of convincing explanation for why this is even a probable direction for swap. For example, 1. What are the other backends? Where you seem to see a multitude of backends and arbitrary hierarchies of them, I see compression and flash, and really not much else. And there is only one reasonable direction in which to combine those two. The IOPs and latencies of HDDs and network compared to modern memory sizes and compute speeds make them for the most part impractical as paging backends. So I don't see a common third swap backend, let alone a fourth or a fifth, or a multitude of meaningful ways of combining them... 2. Even if the usecases were there, enabling this would be a ton of work and open interface questions: 1) There is no generic code to transfer pages between arbitrary backends. 2) There is no accepted indirection model where a swap pte can refer to backends dynamically, in a way that makes migration feasible at scale. 3) Arbitrary global strings are somewhat unlikely to be accepted as a way to configure these hierarchies. 4) Backend file paths in a global sysfs file don't work well with namespacing. The swapfile could be in a container namespace. Containers are not guaranteed to see /sys. 5) Fixed keywords like "zswap" might not be good enough - what about compression and backend parameters? None of these are insurmountable. My point is that this would be a huge amount of prerequisite code and effort for what seems would be a fringe usecase at best right now. And there could be a lot of curve balls in both the software design as well as the hardware development between now and then that could make your proposals moot. Is a per-cgroup string file really going to be the right way to configure arbitrary hierarchies if they materialize? This strikes me as premature and speculative, for what could be, some day. We don't even do it for *internal API*. There is a review rule to introduce a function in the same patch as its first caller, to make sure it's the right abstraction and a good fit for the usecase. There is no way we can have a lower bar than that for permanent ABI. The patch here integrates with what zswap is NOW and always has been: a compressing writeback cache for swap. Should multiple swap tiers overcome all the above and actually become real, this knob here would be the least of our worries. It would be easy to just ignore, automatically override, or deprecate. So I don't think you made a reasonable proposal for an alternative, or gave convincing reasons to hold off this one.
On Thu, 7 Dec 2023 16:42:59 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > > > > Also, it would be quite helpful of the changelog were to give us some > > idea of how important this tunable is. What sort of throughput > > differences might it cause and under what circumstances? > > For the most part, this feature is motivated by internal parties who > have already established their opinions regarding swapping - the > workloads that are highly sensitive to IO, and especially those who > are using servers with really slow disk performance (for instance, > massive but slow HDDs). For these folks, it's impossible to convince > them to even entertain zswap if swapping also comes as a packaged > deal. Writeback disabling is quite a useful feature in these > situations - on a mixed workloads deployment, they can disable > writeback for the more IO-sensitive workloads, and enable writeback > for other background workloads. > > (Maybe we should include the paragraph above as part of the changelog?) I pasted it in, thanks.
On Thu, 7 Dec 2023 17:14:22 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > > I don't have any concrete numbers though - any numbers I can pull out > > are from highly artificial tasks that only serve to test the > > correctness aspect of the implementation. zswap.writeback disablement > > would of course be faster in these situations (up to 33%!!!!) - but > > that's basically just saying HDD is slow. Which is not very > > informative or surprising, so I did not include it in the changelog. > > For instance, on a server with HDD, I allocate memories and populate > them with random values (so that zswap store will always fail), and > specify memory.high low enough to trigger reclaim. The time it takes > to allocate the memories and just read through it a couple of times > (doing silly things like computing the values' average etc.): > > zswap.writeback disabled: > real 0m30.537s > user 0m23.687s > sys 0m6.637s > 0 pages swapped in > 0 pages swapped out > > zswap.writeback enabled: > real 0m45.061s > user 0m24.310s > sys 0m8.892s > 712686 pages swapped in > 461093 pages swapped out > > (the last two lines are from vmstat -s). I pasted that also.
On Fri, Dec 8, 2023 at 8:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Dec 07, 2023 at 05:12:13PM -0800, Yosry Ahmed wrote: > > On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote: > > > > I am wondering about the status of "memory.swap.tiers" proof of concept patch? > > > > Are we still on board to have this two patch merge together somehow so > > > > we can have > > > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the > > > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case? > > > > > > > > Thanks > > > > > > > > Chris > > > > > > > > > > Hi Chris, > > > > > > I briefly summarized my recent discussion with Johannes here: > > > > > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/ > > > > > > TL;DR is we acknowledge the potential usefulness of swap.tiers > > > interface, but the use case is not quite there yet, so it does not > > > make too much sense to build up that heavy machinery now. > > > zswap.writeback is a more urgent need, and does not prevent swap.tiers > > > if we do decide to implement it. > > > > I am honestly not convinced by this. There is no heavy machinery here. > > The interface is more generic and extensible, but the implementation > > is roughly the same. Unless we have a reason to think a swap.tiers > > interface may make it difficult to extend this later or will not > > support some use cases, I think we should go ahead with it. If we are > > worried that "tiers" may not accurately describe future use cases, we > > can be more generic and call it swap.types or something. > > I have to disagree. The generic swap types or tiers ideas actually > look pretty far-fetched to me, and there is a lack of convincing > explanation for why this is even a probable direction for swap. > > For example, > > 1. What are the other backends? Where you seem to see a multitude of > backends and arbitrary hierarchies of them, I see compression and > flash, and really not much else. And there is only one reasonable > direction in which to combine those two. > > The IOPs and latencies of HDDs and network compared to modern > memory sizes and compute speeds make them for the most part > impractical as paging backends. > > So I don't see a common third swap backend, let alone a fourth or a > fifth, or a multitude of meaningful ways of combining them... > > 2. Even if the usecases were there, enabling this would be a ton of > work and open interface questions: > > 1) There is no generic code to transfer pages between arbitrary > backends. > > 2) There is no accepted indirection model where a swap pte can refer > to backends dynamically, in a way that makes migration feasible > at scale. > > 3) Arbitrary global strings are somewhat unlikely to be accepted as > a way to configure these hierarchies. > > 4) Backend file paths in a global sysfs file don't work well with > namespacing. The swapfile could be in a container > namespace. Containers are not guaranteed to see /sys. > > 5) Fixed keywords like "zswap" might not be good enough - what about > compression and backend parameters? > > None of these are insurmountable. My point is that this would be a > huge amount of prerequisite code and effort for what seems would be a > fringe usecase at best right now. > > And there could be a lot of curve balls in both the software design as > well as the hardware development between now and then that could make > your proposals moot. Is a per-cgroup string file really going to be > the right way to configure arbitrary hierarchies if they materialize? > > This strikes me as premature and speculative, for what could be, some > day. It is, and I agree with all your reasoning above. My thinking was that if we create the file now, and we end up only supporting "all" and "zswap" values (equivalent to memory.zswap.writeback), that's fine, but if we can find a way to extend it, that would be nice. I honestly didn't think that "zswap" as a keyword may be a problem later on. > > We don't even do it for *internal API*. There is a review rule to > introduce a function in the same patch as its first caller, to make > sure it's the right abstraction and a good fit for the usecase. There > is no way we can have a lower bar than that for permanent ABI. > > The patch here integrates with what zswap is NOW and always has been: > a compressing writeback cache for swap. > > Should multiple swap tiers overcome all the above and actually become > real, this knob here would be the least of our worries. It would be > easy to just ignore, automatically override, or deprecate. My main concern was having two knobs with overlapping functionalities at that point, and we would need to handle invalid combinations, or make one of the knobs superior to the other one, which is not ideal. Having said that, you are right that we are probably a long ways from supporting more swap types anyway, and having two knobs wouldn't be a blocker then, only a nuisance that we need to take into consideration. I am fine with memory.zswap.writeback being merged as-is (as I mentioned earlier), it would have just been nice if we could agree on a more generic interface, but you thoroughly proved that it is too early to do that. Thanks!
Hi Nhat, On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote: > > > > Hi Nhat, > > > > > > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > During our experiment with zswap, we sometimes observe swap IOs due to > > > occasional zswap store failures and writebacks-to-swap. These swapping > > > IOs prevent many users who cannot tolerate swapping from adopting zswap > > > to save memory and improve performance where possible. > > > > > > This patch adds the option to disable this behavior entirely: do not > > > writeback to backing swapping device when a zswap store attempt fail, > > > and do not write pages in the zswap pool back to the backing swap > > > device (both when the pool is full, and when the new zswap shrinker is > > > called). > > > > > > This new behavior can be opted-in/out on a per-cgroup basis via a new > > > cgroup file. By default, writebacks to swap device is enabled, which is > > > the previous behavior. Initially, writeback is enabled for the root > > > cgroup, and a newly created cgroup will inherit the current setting of > > > its parent. > > > > > > Note that this is subtly different from setting memory.swap.max to 0, as > > > it still allows for pages to be stored in the zswap pool (which itself > > > consumes swap space in its current form). > > > > > > This patch should be applied on top of the zswap shrinker series: > > > > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > > > > > as it also disables the zswap shrinker, a major source of zswap > > > writebacks. > > > > I am wondering about the status of "memory.swap.tiers" proof of concept patch? > > Are we still on board to have this two patch merge together somehow so > > we can have > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case? > > > > Thanks > > > > Chris > > > > Hi Chris, > > I briefly summarized my recent discussion with Johannes here: > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/ Sorry I am traveling in a different time zone so not able to get to that email sooner. That email is only sent out less than one day before the V6 patch right? > > TL;DR is we acknowledge the potential usefulness of swap.tiers > interface, but the use case is not quite there yet, so it does not I disagree about no use case. No use case for Meta != no usage case for the rest of the linux kernel community. That mindset really needs to shift to do Linux kernel development. Respect other's usage cases. It is not just Meta's Linux kernel. It is everybody's Linux kernel. I can give you three usage cases right now: 1) Google producting kernel uses SSD only swap, it is currently on pilot. This is not expressible by the memory.zswap.writeback. You can set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD backed swapfile. But the whole thing feels very clunky, especially what you really want is SSD only swap, you need to do all this zswap config dance. Google has an internal memory.swapfile feature implemented per cgroup swap file type by "zswap only", "real swap file only", "both", "none" (the exact keyword might be different). running in the production for almost 10 years. The need for more than zswap type of per cgroup control is really there. 2) As indicated by this discussion, Tencent has a usage case for SSD and hard disk swap as overflow. https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ +Kairui 3) Android has some fancy swap ideas led by those patches. https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ It got shot down due to removal of frontswap. But the usage case and product requirement is there. +Minchan > make too much sense to build up that heavy machinery now. Does my minimal memory.swap.tiers patch to support "zswap" and "all" sound heavy machinery to you? It is the same implementation under the hood. I am only trying to avoid introducing something that will be foreseeable obsolete. > zswap.writeback is a more urgent need, and does not prevent swap.tiers > if we do decide to implement it. I respect that urgent need, that is why I Ack on the V5 path, under the understanding that this zswap.writeback is not carved into stones. When a better interface comes alone, that interface can be obsolete. Frankly speaking I would much prefer not introducing the cgroup API which will be obsolete soon. If you think zswap.writeback is not removable when another better alternative is available, please voice it now. If you squash my minimal memory.swap.tiers patch, it will also address your urgent need for merging the "zswap.writeback", no? Chris
Hi Yosry, On Thu, Dec 7, 2023 at 5:12 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > I briefly summarized my recent discussion with Johannes here: > > > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/ > > > > TL;DR is we acknowledge the potential usefulness of swap.tiers > > interface, but the use case is not quite there yet, so it does not > > make too much sense to build up that heavy machinery now. > > zswap.writeback is a more urgent need, and does not prevent swap.tiers > > if we do decide to implement it. > > I am honestly not convinced by this. There is no heavy machinery here. > The interface is more generic and extensible, but the implementation > is roughly the same. Unless we have a reason to think a swap.tiers > interface may make it difficult to extend this later or will not > support some use cases, I think we should go ahead with it. If we are > worried that "tiers" may not accurately describe future use cases, we > can be more generic and call it swap.types or something. > +100. Chris
Hi Johannes, On Fri, Dec 8, 2023 at 8:35 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Dec 07, 2023 at 05:12:13PM -0800, Yosry Ahmed wrote: > > On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote: > > > > I am wondering about the status of "memory.swap.tiers" proof of concept patch? > > > > Are we still on board to have this two patch merge together somehow so > > > > we can have > > > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the > > > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case? > > > > > > > > Thanks > > > > > > > > Chris > > > > > > > > > > Hi Chris, > > > > > > I briefly summarized my recent discussion with Johannes here: > > > > > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/ > > > > > > TL;DR is we acknowledge the potential usefulness of swap.tiers > > > interface, but the use case is not quite there yet, so it does not > > > make too much sense to build up that heavy machinery now. > > > zswap.writeback is a more urgent need, and does not prevent swap.tiers > > > if we do decide to implement it. > > > > I am honestly not convinced by this. There is no heavy machinery here. > > The interface is more generic and extensible, but the implementation > > is roughly the same. Unless we have a reason to think a swap.tiers > > interface may make it difficult to extend this later or will not > > support some use cases, I think we should go ahead with it. If we are > > worried that "tiers" may not accurately describe future use cases, we > > can be more generic and call it swap.types or something. > > I have to disagree. The generic swap types or tiers ideas actually > look pretty far-fetched to me, and there is a lack of convincing > explanation for why this is even a probable direction for swap. It boils down to there being more than just "zswap + SSD" usage cases in other parts of the Linux communities. The need is real and it is just a question of how to get there. > > For example, > > 1. What are the other backends? Where you seem to see a multitude of > backends and arbitrary hierarchies of them, I see compression and > flash, and really not much else. And there is only one reasonable > direction in which to combine those two. I list a few other usage cases here in an earlier email of the same thread. https://lore.kernel.org/linux-mm/CAF8kJuNpnqTM5x1QmQ7h-FaRWVnHBdNGvGvB3txohSOmZhYA-Q@mail.gmail.com/T/#t TL;DR: 1) Google has had an internal memory.swapfile in production for almost 10 years. 2) Tencent uses hard drives as SSD swap overflow. +Kairui. https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ 3) Android has more fancy swap usage. +Kimchan https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ You can't imagine such an usage is not the reason to block others for such usage. Please respect other usage cases as well. As for the other backends, the first minimal milestone we can implement is "all" and "zswap", which is functional equivalent to the memory.zswap.writeback. Other common back ends there are SSD and hard drive. The exact keywords and tiers are up for future discussion. I just want to acknowledge the need is there. > > The IOPs and latencies of HDDs and network compared to modern > memory sizes and compute speeds make them for the most part > impractical as paging backends. I don't see that being a problem as lower tiers for very cold swaps. Again, it might not be a usage case for Meta but please respect the usage case for others. > > So I don't see a common third swap backend, let alone a fourth or a > fifth, or a multitude of meaningful ways of combining them... Does not stop others from wanting to use them. > > 2. Even if the usecases were there, enabling this would be a ton of > work and open interface questions: If we can agree this interface is more flexible and covers more usage cases than "zswap.writeback". We can start from the minimal implementation of "zswap" and "all". There is not a ton of work in the first minimal milestone. I send out the minimal patch here: https://lore.kernel.org/linux-mm/ZVrHXJLxvs4_CUxc@google.com/ > > 1) There is no generic code to transfer pages between arbitrary > backends. True, but it does not have to be there to make "swap.tiers" useful. It can start without a transfer page between backends. It is just a more flexible way to specify what swap the cgroup wants to opt in initially, that is a solid need. Everything else can be done later. > > 2) There is no accepted indirection model where a swap pte can refer > to backends dynamically, in a way that makes migration feasible > at scale. Same as above. > > 3) Arbitrary global strings are somewhat unlikely to be accepted as > a way to configure these hierarchies. It does not need to be an arbitrary string. We have a string to config cgroup.subtree_control for example. I am not saying just borrow the subtree control syntax. But we can do some thing similar to that. > > 4) Backend file paths in a global sysfs file don't work well with > namespacing. The swapfile could be in a container > namespace. Containers are not guaranteed to see /sys. Can you clarify what usage problem you are trying to solve? The containers are typically managed by the container manager's service. The service sees /sys for sure. Do you want a cgroup interface to allow the container usage to self-service the swap file? > > 5) Fixed keywords like "zswap" might not be good enough - what about > compression and backend parameters? I think those levels of detail are zswap/zpool specific, it belongs to the /sys/kernel/mm/zswap/foo_bar for example. We can discuss the detail usage, nothing is set in stone yet. > > None of these are insurmountable. My point is that this would be a > huge amount of prerequisite code and effort for what seems would be a > fringe usecase at best right now. For the fringe usage a minimal patch exists. Not a huge amount of requireste code. What you describe are close to the end goal of the swap tiers. I > > And there could be a lot of curve balls in both the software design as > well as the hardware development between now and then that could make > your proposals moot. Is a per-cgroup string file really going to be > the right way to configure arbitrary hierarchies if they materialize? > > This strikes me as premature and speculative, for what could be, some > day. "swap.tiers" in my mind is strictly better than "zswap.writeback" because writeback has only 0 or 1 value. It is not able to describe other swap selections. Setting zswap.writeback = 0 will disable SSD swap as well, that is not very intuitive. If we want SSD only swap, we need to set "zswap.writeback" = 1 and "zswap.max" = 0. All this makes me feel that we are limiting ourselves too much from the cubical of zswap to look at the rest of the world. > > We don't even do it for *internal API*. There is a review rule to > introduce a function in the same patch as its first caller, to make > sure it's the right abstraction and a good fit for the usecase. There > is no way we can have a lower bar than that for permanent ABI. > > The patch here integrates with what zswap is NOW and always has been: > a compressing writeback cache for swap. > > Should multiple swap tiers overcome all the above and actually become > real, this knob here would be the least of our worries. It would be > easy to just ignore, automatically override, or deprecate. > > So I don't think you made a reasonable proposal for an alternative, or > gave convincing reasons to hold off this one. > Keep in mind that the minimal patch is just trying to avoid the detour of introducing the zswap.writeback and obsolete it soon. There is a solid need for swap backend other than zswap. Frankly speaking I don't see writing "all" to "swap.tiers" is that big a deal different than writing "1" to "zswap.writeback". From the API point of view, one is more flexible and future proof. I just don't want "zswap.writeback" to become something carved into stone and we cant' remove it later, especially if we can have the alternative API compatible with other usage cases. Chris
On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > I can give you three usage cases right now: > 1) Google producting kernel uses SSD only swap, it is currently on > pilot. This is not expressible by the memory.zswap.writeback. You can > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD > backed swapfile. But the whole thing feels very clunky, especially > what you really want is SSD only swap, you need to do all this zswap > config dance. Google has an internal memory.swapfile feature > implemented per cgroup swap file type by "zswap only", "real swap file > only", "both", "none" (the exact keyword might be different). running > in the production for almost 10 years. The need for more than zswap > type of per cgroup control is really there. We use regular swap on SSD without zswap just fine. Of course it's expressible. On dedicated systems, zswap is disabled in sysfs. On shared hosts where it's determined based on which workload is scheduled, zswap is generally enabled through sysfs, and individual cgroup access is controlled via memory.zswap.max - which is what this knob is for. This is analogous to enabling swap globally, and then opting individual cgroups in and out with memory.swap.max. So this usecase is very much already supported, and it's expressed in a way that's pretty natural for how cgroups express access and lack of access to certain resources. I don't see how memory.swap.type or memory.swap.tiers would improve this in any way. On the contrary, it would overlap and conflict with existing controls to manage swap and zswap on a per-cgroup basis. > 2) As indicated by this discussion, Tencent has a usage case for SSD > and hard disk swap as overflow. > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > +Kairui Multiple swap devices for round robin or with different priorities aren't new, they have been supported for a very, very long time. So far nobody has proposed to control the exact behavior on a per-cgroup basis, and I didn't see anybody in this thread asking for it either. So I don't see how this counts as an obvious and automatic usecase for memory.swap.tiers. > 3) Android has some fancy swap ideas led by those patches. > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > It got shot down due to removal of frontswap. But the usage case and > product requirement is there. > +Minchan This looks like an optimization for zram to bypass the block layer and hook directly into the swap code. Correct me if I'm wrong, but this doesn't appear to have anything to do with per-cgroup backend control. > > zswap.writeback is a more urgent need, and does not prevent swap.tiers > > if we do decide to implement it. > > I respect that urgent need, that is why I Ack on the V5 path, under > the understanding that this zswap.writeback is not carved into stones. > When a better interface comes alone, that interface can be obsolete. > Frankly speaking I would much prefer not introducing the cgroup API > which will be obsolete soon. > > If you think zswap.writeback is not removable when another better > alternative is available, please voice it now. > > If you squash my minimal memory.swap.tiers patch, it will also address > your urgent need for merging the "zswap.writeback", no? We can always deprecate ABI if something better comes along. However, it's quite bold to claim that memory.swap.tiers is the best way to implement backend control on a per-cgroup basis, and that it'll definitely be needed in the future. You might see this as a foregone conclusion, but I very much doubt this. Even if such a file were to show up, I'm not convinced it should even include zswap as one of the tiers. Zswap isn't a regular swap backend, it doesn't show up in /proc/swaps, it can't be a second tier, the way it interacts with its backend file is very different than how two swapfiles of different priorities interact with each other, it's already controllable with memory.zswap.max, etc. I'm open to discussing usecases and proposals for more fine-grained per-cgroup backend control. We've had discussions about per-cgroup swapfiles in the past. Cgroup parameters for swapon are another thought. There are several options and many considerations. The memory.swap.tiers idea is the newest, has probably had the least amount of discussion among them, and looks the least convincing to me. Let's work out the requirements first. The "conflict" with memory.zswap.writeback is a red herring - it's no more of a conflict than setting memory.swap.tiers to "zswap" or "all" and then setting memory.zswap.max or memory.swap.max to 0. So the notion that we have to rush in a minimal version of a MUCH bigger concept, just to support zswap writeback disabling is misguided. And then hope that this format works as the concept evolves and real usecases materialize... There is no reason to take that risk.
On Fri, Dec 8, 2023 at 7:42 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > I can give you three usage cases right now: > > 1) Google producting kernel uses SSD only swap, it is currently on > > pilot. This is not expressible by the memory.zswap.writeback. You can > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD > > backed swapfile. But the whole thing feels very clunky, especially > > what you really want is SSD only swap, you need to do all this zswap > > config dance. Google has an internal memory.swapfile feature > > implemented per cgroup swap file type by "zswap only", "real swap file > > only", "both", "none" (the exact keyword might be different). running > > in the production for almost 10 years. The need for more than zswap > > type of per cgroup control is really there. > > We use regular swap on SSD without zswap just fine. Of course it's > expressible. > > On dedicated systems, zswap is disabled in sysfs. On shared hosts > where it's determined based on which workload is scheduled, zswap is > generally enabled through sysfs, and individual cgroup access is > controlled via memory.zswap.max - which is what this knob is for. The sysfs API is not per cgroup, right? > > This is analogous to enabling swap globally, and then opting > individual cgroups in and out with memory.swap.max. That is good to know. Just comparing notes. Google is still using cgroup V1. There is a seperate per cgroup zswap switch control enabling zswap or not. As well as a swap file type switch to select ghost swap file or real swap file. > So this usecase is very much already supported, and it's expressed in > a way that's pretty natural for how cgroups express access and lack of > access to certain resources. > > I don't see how memory.swap.type or memory.swap.tiers would improve > this in any way. On the contrary, it would overlap and conflict with > existing controls to manage swap and zswap on a per-cgroup basis. One minor point is that, if we have a per cgroup list of swap tires/devices. It does not need to enter the zswap code to find out zswap is disabled. But that is a very minor point. > > > 2) As indicated by this discussion, Tencent has a usage case for SSD > > and hard disk swap as overflow. > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > +Kairui > > Multiple swap devices for round robin or with different priorities > aren't new, they have been supported for a very, very long time. So > far nobody has proposed to control the exact behavior on a per-cgroup > basis, and I didn't see anybody in this thread asking for it either. > > So I don't see how this counts as an obvious and automatic usecase for > memory.swap.tiers. I assume Tencent would want to use per cgroup control but you have a point it is not automatic. It would be best to have Kairui clarify whether Tencent wants to use per cgroup SSD vs Hard Disk swap file control. > > > 3) Android has some fancy swap ideas led by those patches. > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > > It got shot down due to removal of frontswap. But the usage case and > > product requirement is there. > > +Minchan > > This looks like an optimization for zram to bypass the block layer and > hook directly into the swap code. Correct me if I'm wrong, but this > doesn't appear to have anything to do with per-cgroup backend control. No in that series. No. > > > > zswap.writeback is a more urgent need, and does not prevent swap.tiers > > > if we do decide to implement it. > > > > I respect that urgent need, that is why I Ack on the V5 path, under > > the understanding that this zswap.writeback is not carved into stones. > > When a better interface comes alone, that interface can be obsolete. > > Frankly speaking I would much prefer not introducing the cgroup API > > which will be obsolete soon. > > > > If you think zswap.writeback is not removable when another better > > alternative is available, please voice it now. > > > > If you squash my minimal memory.swap.tiers patch, it will also address > > your urgent need for merging the "zswap.writeback", no? > > We can always deprecate ABI if something better comes along. Thanks for the confirmation. That is what I want to hear from you: memory.zswap.writeback is not set to stones if something better comes along. > > However, it's quite bold to claim that memory.swap.tiers is the best > way to implement backend control on a per-cgroup basis, and that it'll > definitely be needed in the future. You might see this as a foregone > conclusion, but I very much doubt this. Ack. > > Even if such a file were to show up, I'm not convinced it should even > include zswap as one of the tiers. Zswap isn't a regular swap backend, > it doesn't show up in /proc/swaps, it can't be a second tier, the way > it interacts with its backend file is very different than how two > swapfiles of different priorities interact with each other, it's > already controllable with memory.zswap.max, etc. The rationale to separate out the swap tiers is not based on the device, rather depends on the swapin latency distribution and access pattern etc. In that regard, the memory based zswap has much lower swap in latency, and should be considered a separate tier/class as SSD. Even though current zswap allocates swap slots from SSD or ghost swap file, I don't consider it an issue to list zswap as a separate tier/class. The end of day this is tight to the SLO of the expected swapping performance. > > I'm open to discussing usecases and proposals for more fine-grained > per-cgroup backend control. We've had discussions about per-cgroup > swapfiles in the past. Cgroup parameters for swapon are another > thought. There are several options and many considerations. The > memory.swap.tiers idea is the newest, has probably had the least > amount of discussion among them, and looks the least convincing to me. > > Let's work out the requirements first. I agree with that. Let's lay out the requirements differently and have open discussion to cover each other's usage case. > The "conflict" with memory.zswap.writeback is a red herring - it's no > more of a conflict than setting memory.swap.tiers to "zswap" or "all" > and then setting memory.zswap.max or memory.swap.max to 0. I am not sure I fully understand your point here. Can you elerate? I am thinking memory.zswap.writeback and memory.swap.tiers both try to describe the relationship of what set of swap tiers the cgroup can select. "swap.teris" can specify what order. The setting memory.zswap.max or memory.swap.max to 0 is just disable zswap/swap. It does not specify the ordering. I am not sure what is the "conflict" you are referring to. > > So the notion that we have to rush in a minimal version of a MUCH > bigger concept, just to support zswap writeback disabling is > misguided. And then hope that this format works as the concept evolves > and real usecases materialize... There is no reason to take that risk. > I am convinced memory.swap.tiers is more general and more straight forward to reason about. You obviously need more convincing. Which is fair that I haven't got much more than the minimal patch. As long as it is not carved into stones and better things can be developed, I am fine with that. Chris
Chris Li <chrisl@kernel.org> 于2023年12月9日周六 07:56写道: > > Hi Nhat, > > On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote: > > > > > > Hi Nhat, > > > > > > > > > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > > > During our experiment with zswap, we sometimes observe swap IOs due to > > > > occasional zswap store failures and writebacks-to-swap. These swapping > > > > IOs prevent many users who cannot tolerate swapping from adopting zswap > > > > to save memory and improve performance where possible. > > > > > > > > This patch adds the option to disable this behavior entirely: do not > > > > writeback to backing swapping device when a zswap store attempt fail, > > > > and do not write pages in the zswap pool back to the backing swap > > > > device (both when the pool is full, and when the new zswap shrinker is > > > > called). > > > > > > > > This new behavior can be opted-in/out on a per-cgroup basis via a new > > > > cgroup file. By default, writebacks to swap device is enabled, which is > > > > the previous behavior. Initially, writeback is enabled for the root > > > > cgroup, and a newly created cgroup will inherit the current setting of > > > > its parent. > > > > > > > > Note that this is subtly different from setting memory.swap.max to 0, as > > > > it still allows for pages to be stored in the zswap pool (which itself > > > > consumes swap space in its current form). > > > > > > > > This patch should be applied on top of the zswap shrinker series: > > > > > > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > > > > > > > as it also disables the zswap shrinker, a major source of zswap > > > > writebacks. > > > > > > I am wondering about the status of "memory.swap.tiers" proof of concept patch? > > > Are we still on board to have this two patch merge together somehow so > > > we can have > > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the > > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case? > > > > > > Thanks > > > > > > Chris > > > > > > > Hi Chris, > > > > I briefly summarized my recent discussion with Johannes here: > > > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/ > > Sorry I am traveling in a different time zone so not able to get to > that email sooner. That email is only sent out less than one day > before the V6 patch right? > > > > > TL;DR is we acknowledge the potential usefulness of swap.tiers > > interface, but the use case is not quite there yet, so it does not > > I disagree about no use case. No use case for Meta != no usage case > for the rest of the linux kernel community. That mindset really needs > to shift to do Linux kernel development. Respect other's usage cases. > It is not just Meta's Linux kernel. It is everybody's Linux kernel. > > I can give you three usage cases right now: > 1) Google producting kernel uses SSD only swap, it is currently on > pilot. This is not expressible by the memory.zswap.writeback. You can > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD > backed swapfile. But the whole thing feels very clunky, especially > what you really want is SSD only swap, you need to do all this zswap > config dance. Google has an internal memory.swapfile feature > implemented per cgroup swap file type by "zswap only", "real swap file > only", "both", "none" (the exact keyword might be different). running > in the production for almost 10 years. The need for more than zswap > type of per cgroup control is really there. > > 2) As indicated by this discussion, Tencent has a usage case for SSD > and hard disk swap as overflow. > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > +Kairui Yes, we are not using zswap. We are using ZRAM for swap since we have many different varieties of workload instances, with a very flexible storage setup. Some of them don't have the ability to set up a swapfile. So we built a pack of kernel infrastructures based on ZRAM, which so far worked pretty well. The concern from some teams is that ZRAM (or zswap) can't always free up memory so they may lead to higher risk of OOM compared to a physical swap device, and they do have suitable devices for doing swap on some of their machines. So a secondary swap support is very helpful in case of memory usage peak. Besides this, another requirement is that different containers may have different priority, some containers can tolerate high swap overhead while some cannot, so swap tiering is useful for us in many ways. And thanks to cloud infrastructure the disk setup could change from time to time depending on workload requirements, so our requirement is to support ZRAM (always) + SSD (optional) + HDD (also optional) as swap backends, while not making things too complex to maintain. Currently we have implemented a cgroup based ZRAM compression algorithm control, per-cgroup ZRAM accounting and limit, and a experimental kernel worker to migrate cold swap entry from high priority device to low priority device at very small scale (lack of basic mechanics to do this at large scale, however due to the low IOPS of slow device and cold pages are rarely accessed, this wasn't too much of a problem so far but kind of ugly). The rest of swapping (eg. secondary swap when ZRAM if full) will depend on the kernel's native ability. So far it works, not in the best form, need more patches to make it work better (eg. the swapin/readahead patch I sent previously). Some of our design may also need to change in the long term, and we also want a well built interface and kernel mechanics to manage multi tier swaps, I'm very willing to talk and collaborate on this.
On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote: > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > I can give you three usage cases right now: > > 1) Google producting kernel uses SSD only swap, it is currently on > > pilot. This is not expressible by the memory.zswap.writeback. You can > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD > > backed swapfile. But the whole thing feels very clunky, especially > > what you really want is SSD only swap, you need to do all this zswap > > config dance. Google has an internal memory.swapfile feature > > implemented per cgroup swap file type by "zswap only", "real swap file > > only", "both", "none" (the exact keyword might be different). running > > in the production for almost 10 years. The need for more than zswap > > type of per cgroup control is really there. > > We use regular swap on SSD without zswap just fine. Of course it's > expressible. > > On dedicated systems, zswap is disabled in sysfs. On shared hosts > where it's determined based on which workload is scheduled, zswap is > generally enabled through sysfs, and individual cgroup access is > controlled via memory.zswap.max - which is what this knob is for. > > This is analogous to enabling swap globally, and then opting > individual cgroups in and out with memory.swap.max. > > So this usecase is very much already supported, and it's expressed in > a way that's pretty natural for how cgroups express access and lack of > access to certain resources. > > I don't see how memory.swap.type or memory.swap.tiers would improve > this in any way. On the contrary, it would overlap and conflict with > existing controls to manage swap and zswap on a per-cgroup basis. > > > 2) As indicated by this discussion, Tencent has a usage case for SSD > > and hard disk swap as overflow. > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > +Kairui > > Multiple swap devices for round robin or with different priorities > aren't new, they have been supported for a very, very long time. So > far nobody has proposed to control the exact behavior on a per-cgroup > basis, and I didn't see anybody in this thread asking for it either. > > So I don't see how this counts as an obvious and automatic usecase for > memory.swap.tiers. > > > 3) Android has some fancy swap ideas led by those patches. > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > > It got shot down due to removal of frontswap. But the usage case and > > product requirement is there. > > +Minchan > > This looks like an optimization for zram to bypass the block layer and > hook directly into the swap code. Correct me if I'm wrong, but this > doesn't appear to have anything to do with per-cgroup backend control. Hi Johannes, I haven't been following the thread closely, but I noticed the discussion about potential use cases for zram with memcg. One interesting idea I have is to implement a swap controller per cgroup. This would allow us to tailor the zram swap behavior to the specific needs of different groups. For example, Group A, which is sensitive to swap latency, could use zram swap with a fast compression setting, even if it sacrifices some compression ratio. This would prioritize quick access to swapped data, even if it takes up more space. On the other hand, Group B, which can tolerate higher swap latency, could benefit from a slower compression setting that achieves a higher compression ratio. This would maximize memory efficiency at the cost of slightly slower data access. This approach could provide a more nuanced and flexible way to manage swap usage within different cgroups.
> Hi Johannes, > > I haven't been following the thread closely, but I noticed the discussion > about potential use cases for zram with memcg. > > One interesting idea I have is to implement a swap controller per cgroup. > This would allow us to tailor the zram swap behavior to the specific needs of > different groups. Hi Minchan, It would be great if this feature could be in place. We are using zram and also trying to use zswap to offload memory to disk. As kairui mentioned above, there are many workloads of varying priority and sensitivity on a machine. > > For example, Group A, which is sensitive to swap latency, could use zram swap > with a fast compression setting, even if it sacrifices some compression ratio. > This would prioritize quick access to swapped data, even if it takes up more space. > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > from a slower compression setting that achieves a higher compression ratio. > This would maximize memory efficiency at the cost of slightly slower data access. > > This approach could provide a more nuanced and flexible way to manage swap usage > within different cgroups. Indeed.
On Fri, Dec 8, 2023 at 7:42 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > I can give you three usage cases right now: > > 1) Google producting kernel uses SSD only swap, it is currently on > > pilot. This is not expressible by the memory.zswap.writeback. You can > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD > > backed swapfile. But the whole thing feels very clunky, especially > > what you really want is SSD only swap, you need to do all this zswap > > config dance. Google has an internal memory.swapfile feature > > implemented per cgroup swap file type by "zswap only", "real swap file > > only", "both", "none" (the exact keyword might be different). running > > in the production for almost 10 years. The need for more than zswap > > type of per cgroup control is really there. > > We use regular swap on SSD without zswap just fine. Of course it's > expressible. > > On dedicated systems, zswap is disabled in sysfs. On shared hosts > where it's determined based on which workload is scheduled, zswap is > generally enabled through sysfs, and individual cgroup access is > controlled via memory.zswap.max - which is what this knob is for. > > This is analogous to enabling swap globally, and then opting > individual cgroups in and out with memory.swap.max. > > So this usecase is very much already supported, and it's expressed in > a way that's pretty natural for how cgroups express access and lack of > access to certain resources. > > I don't see how memory.swap.type or memory.swap.tiers would improve > this in any way. On the contrary, it would overlap and conflict with > existing controls to manage swap and zswap on a per-cgroup basis. > > > 2) As indicated by this discussion, Tencent has a usage case for SSD > > and hard disk swap as overflow. > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > +Kairui > > Multiple swap devices for round robin or with different priorities > aren't new, they have been supported for a very, very long time. So > far nobody has proposed to control the exact behavior on a per-cgroup > basis, and I didn't see anybody in this thread asking for it either. > > So I don't see how this counts as an obvious and automatic usecase for > memory.swap.tiers. > > > 3) Android has some fancy swap ideas led by those patches. > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > > It got shot down due to removal of frontswap. But the usage case and > > product requirement is there. > > +Minchan > > This looks like an optimization for zram to bypass the block layer and > hook directly into the swap code. Correct me if I'm wrong, but this > doesn't appear to have anything to do with per-cgroup backend control. > > > > zswap.writeback is a more urgent need, and does not prevent swap.tiers > > > if we do decide to implement it. > > > > I respect that urgent need, that is why I Ack on the V5 path, under > > the understanding that this zswap.writeback is not carved into stones. > > When a better interface comes alone, that interface can be obsolete. > > Frankly speaking I would much prefer not introducing the cgroup API > > which will be obsolete soon. > > > > If you think zswap.writeback is not removable when another better > > alternative is available, please voice it now. > > > > If you squash my minimal memory.swap.tiers patch, it will also address > > your urgent need for merging the "zswap.writeback", no? > > We can always deprecate ABI if something better comes along. > > However, it's quite bold to claim that memory.swap.tiers is the best > way to implement backend control on a per-cgroup basis, and that it'll > definitely be needed in the future. You might see this as a foregone > conclusion, but I very much doubt this. > > Even if such a file were to show up, I'm not convinced it should even > include zswap as one of the tiers. Zswap isn't a regular swap backend, > it doesn't show up in /proc/swaps, it can't be a second tier, the way > it interacts with its backend file is very different than how two > swapfiles of different priorities interact with each other, it's > already controllable with memory.zswap.max, etc. This is honestly the thing I was originally most iffy about :) zswap is architecturally and semantically separate from other swap options. It gets really confusing to lump it as part of the swap tiers. > > I'm open to discussing usecases and proposals for more fine-grained > per-cgroup backend control. We've had discussions about per-cgroup > swapfiles in the past. Cgroup parameters for swapon are another > thought. There are several options and many considerations. The > memory.swap.tiers idea is the newest, has probably had the least > amount of discussion among them, and looks the least convincing to me. Definitely. zswap.writeback is a really concrete feature, with immediate use-case, whereas swap.tiers seem a bit nebulous to me now, the more we discuss it. I'm not against the inclusion of something along its line though, and I'm definitely not trying to limit the use case of other folks - I'd be happy to contribute my engineering hours towards the discussion of the multi-tier swapping design (both internal implementation and and public interface), as well as actual code, when that design is fully fleshed out :) > > Let's work out the requirements first. > > The "conflict" with memory.zswap.writeback is a red herring - it's no > more of a conflict than setting memory.swap.tiers to "zswap" or "all" > and then setting memory.zswap.max or memory.swap.max to 0. Yup. > > So the notion that we have to rush in a minimal version of a MUCH > bigger concept, just to support zswap writeback disabling is > misguided. And then hope that this format works as the concept evolves > and real usecases materialize... There is no reason to take that risk.
Hi Kairui, Thanks for sharing the information on how you use swap. On Mon, Dec 11, 2023 at 1:31 AM Kairui Song <ryncsn@gmail.com> wrote: > > 2) As indicated by this discussion, Tencent has a usage case for SSD > > and hard disk swap as overflow. > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > +Kairui > > Yes, we are not using zswap. We are using ZRAM for swap since we have > many different varieties of workload instances, with a very flexible > storage setup. Some of them don't have the ability to set up a > swapfile. So we built a pack of kernel infrastructures based on ZRAM, > which so far worked pretty well. This is great. The usage case is actually much more than I expected. For example, I never thought of zram as a swap tier. Now you mention it. I am considering whether it makes sense to add zram to the memory.swap.tiers as well as zswap. > > The concern from some teams is that ZRAM (or zswap) can't always free > up memory so they may lead to higher risk of OOM compared to a > physical swap device, and they do have suitable devices for doing swap > on some of their machines. So a secondary swap support is very helpful > in case of memory usage peak. > > Besides this, another requirement is that different containers may > have different priority, some containers can tolerate high swap > overhead while some cannot, so swap tiering is useful for us in many > ways. > > And thanks to cloud infrastructure the disk setup could change from > time to time depending on workload requirements, so our requirement is > to support ZRAM (always) + SSD (optional) + HDD (also optional) as > swap backends, while not making things too complex to maintain. Just curious, do you use ZRAM + SSD + HDD all enabled? Do you ever consider moving data from ZRAM to SSD, or from SSD to HDD? If you do, I do see the possibility of having more general swap tiers support and sharing the shrinking code between tiers somehow. Granted there are many unanswered questions and a lot of infrastructure is lacking. Gathering requirements, weight in the priority of the quirement is the first step towards a possible solution. > Currently we have implemented a cgroup based ZRAM compression > algorithm control, per-cgroup ZRAM accounting and limit, and a > experimental kernel worker to migrate cold swap entry from high > priority device to low priority device at very small scale (lack of > basic mechanics to do this at large scale, however due to the low IOPS > of slow device and cold pages are rarely accessed, this wasn't too > much of a problem so far but kind of ugly). The rest of swapping (eg. > secondary swap when ZRAM if full) will depend on the kernel's native > ability. Thanks for confirming usage needs of per cgroup ZRAM enable and flushing between swap devices. I was hoping the swap.tiers can support some thing like that. > So far it works, not in the best form, need more patches to make it > work better (eg. the swapin/readahead patch I sent previously). Some > of our design may also need to change in the long term, and we also > want a well built interface and kernel mechanics to manage multi tier > swaps, I'm very willing to talk and collaborate on this. > Great. Let's continue this discussion in a new thread and start gathering some requirements and priorities from everyone one. The output of this discussion should be some one pager document listing the swap tiers requirement and rate the priorities between different requirements. Once we have that nail down, we can then discuss what are the incremental milestones to get there. I am very interested in this topic and willing to spend time on it as well. Chris
Hi Minchan, On Mon, Dec 11, 2023 at 2:55 PM Minchan Kim <minchan@kernel.org> wrote: > > > 3) Android has some fancy swap ideas led by those patches. > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > > > It got shot down due to removal of frontswap. But the usage case and > > > product requirement is there. > > > +Minchan > > > > This looks like an optimization for zram to bypass the block layer and > > hook directly into the swap code. Correct me if I'm wrong, but this > > doesn't appear to have anything to do with per-cgroup backend control. > > Hi Johannes, > > I haven't been following the thread closely, but I noticed the discussion > about potential use cases for zram with memcg. > > One interesting idea I have is to implement a swap controller per cgroup. > This would allow us to tailor the zram swap behavior to the specific needs of > different groups. > > For example, Group A, which is sensitive to swap latency, could use zram swap > with a fast compression setting, even if it sacrifices some compression ratio. > This would prioritize quick access to swapped data, even if it takes up more space. > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > from a slower compression setting that achieves a higher compression ratio. > This would maximize memory efficiency at the cost of slightly slower data access. That is a very solid usage case. Thanks for sharing this swap backend usage story. It goes beyond my original memory.swap.teires idea as well. We can have some zram specific knobs to control what compression setting is using. Moving data between different compression settings would be an interesting topic. It might fit the swap.tiers usage model as well. I am just thinking it out loud. Maybe define different compression settings as different tiers and then allow the cgroup to enroll into one of the tiers list. > > This approach could provide a more nuanced and flexible way to manage swap usage > within different cgroups. > That is again very wonderful insight. Thanks Chris
On Tue, Dec 12, 2023 at 1:36 PM Nhat Pham <nphamcs@gmail.com> wrote: > > Even if such a file were to show up, I'm not convinced it should even > > include zswap as one of the tiers. Zswap isn't a regular swap backend, > > it doesn't show up in /proc/swaps, it can't be a second tier, the way > > it interacts with its backend file is very different than how two > > swapfiles of different priorities interact with each other, it's > > already controllable with memory.zswap.max, etc. > > This is honestly the thing I was originally most iffy about :) zswap > is architecturally and semantically separate from other swap options. > It gets really confusing to lump it as part of the swap tiers. The writeback option is about interacting with other swap backends. So technically it is not zswap alone. writeback = 0 will disable SSD swap as well. I am not against merging the write back. I just want to make sure 1) better alternatives can be developed 2) zswap.writeback can obsolete if a better alternative is available. > > > > > I'm open to discussing usecases and proposals for more fine-grained > > per-cgroup backend control. We've had discussions about per-cgroup > > swapfiles in the past. Cgroup parameters for swapon are another > > thought. There are several options and many considerations. The > > memory.swap.tiers idea is the newest, has probably had the least > > amount of discussion among them, and looks the least convincing to me. > > Definitely. zswap.writeback is a really concrete feature, with > immediate use-case, whereas swap.tiers seem a bit nebulous to me now, > the more we discuss it. I'm not against the inclusion of something > along its line though, and I'm definitely not trying to limit the use > case of other folks - I'd be happy to contribute my engineering hours > towards the discussion of the multi-tier swapping design (both > internal implementation and and public interface), as well as actual > code, when that design is fully fleshed out :) Great to hear that. I think the discussion so far shows the alternative usage cases of the swap backend/tires is real. > > > > > Let's work out the requirements first. > > > > The "conflict" with memory.zswap.writeback is a red herring - it's no > > more of a conflict than setting memory.swap.tiers to "zswap" or "all" > > and then setting memory.zswap.max or memory.swap.max to 0. > > Yup. Care to elaborate it more? I don't understand the conflict part. I do ask Johannes in my previous email for clarification. One is the superset of the other. I don't consider that as a conflict. If we can have both to choose from, obviously I would pick the one that is more general and flexible. Chris
On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote: > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote: > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > > I can give you three usage cases right now: > > > 1) Google producting kernel uses SSD only swap, it is currently on > > > pilot. This is not expressible by the memory.zswap.writeback. You can > > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD > > > backed swapfile. But the whole thing feels very clunky, especially > > > what you really want is SSD only swap, you need to do all this zswap > > > config dance. Google has an internal memory.swapfile feature > > > implemented per cgroup swap file type by "zswap only", "real swap file > > > only", "both", "none" (the exact keyword might be different). running > > > in the production for almost 10 years. The need for more than zswap > > > type of per cgroup control is really there. > > > > We use regular swap on SSD without zswap just fine. Of course it's > > expressible. > > > > On dedicated systems, zswap is disabled in sysfs. On shared hosts > > where it's determined based on which workload is scheduled, zswap is > > generally enabled through sysfs, and individual cgroup access is > > controlled via memory.zswap.max - which is what this knob is for. > > > > This is analogous to enabling swap globally, and then opting > > individual cgroups in and out with memory.swap.max. > > > > So this usecase is very much already supported, and it's expressed in > > a way that's pretty natural for how cgroups express access and lack of > > access to certain resources. > > > > I don't see how memory.swap.type or memory.swap.tiers would improve > > this in any way. On the contrary, it would overlap and conflict with > > existing controls to manage swap and zswap on a per-cgroup basis. > > > > > 2) As indicated by this discussion, Tencent has a usage case for SSD > > > and hard disk swap as overflow. > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > > +Kairui > > > > Multiple swap devices for round robin or with different priorities > > aren't new, they have been supported for a very, very long time. So > > far nobody has proposed to control the exact behavior on a per-cgroup > > basis, and I didn't see anybody in this thread asking for it either. > > > > So I don't see how this counts as an obvious and automatic usecase for > > memory.swap.tiers. > > > > > 3) Android has some fancy swap ideas led by those patches. > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > > > It got shot down due to removal of frontswap. But the usage case and > > > product requirement is there. > > > +Minchan > > > > This looks like an optimization for zram to bypass the block layer and > > hook directly into the swap code. Correct me if I'm wrong, but this > > doesn't appear to have anything to do with per-cgroup backend control. > > Hi Johannes, > > I haven't been following the thread closely, but I noticed the discussion > about potential use cases for zram with memcg. > > One interesting idea I have is to implement a swap controller per cgroup. > This would allow us to tailor the zram swap behavior to the specific needs of > different groups. > > For example, Group A, which is sensitive to swap latency, could use zram swap > with a fast compression setting, even if it sacrifices some compression ratio. > This would prioritize quick access to swapped data, even if it takes up more space. > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > from a slower compression setting that achieves a higher compression ratio. > This would maximize memory efficiency at the cost of slightly slower data access. > > This approach could provide a more nuanced and flexible way to manage swap usage > within different cgroups. That makes sense to me. It sounds to me like per-cgroup swapfiles would be the easiest solution to this. Then you can create zram devices with different configurations and assign them to individual cgroups. This would also apply to Kairu's usecase: assign zrams and hdd backups as needed on a per-cgroup basis. In addition, it would naturally solve scalability and isolation problems when multiple containers would otherwise be hammering on the same swap backends and locks. It would also only require one, relatively simple new interface, such as a cgroup parameter to swapon(). That's highly preferable over a complex configuration file like memory.swap.tiers that needs to solve all sorts of visibility and namespace issues and duplicate the full configuration interface of every backend in some new, custom syntax.
On Thu, Dec 14, 2023 at 10:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote: > > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote: > > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > > > I can give you three usage cases right now: > > > > 1) Google producting kernel uses SSD only swap, it is currently on > > > > pilot. This is not expressible by the memory.zswap.writeback. You can > > > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD > > > > backed swapfile. But the whole thing feels very clunky, especially > > > > what you really want is SSD only swap, you need to do all this zswap > > > > config dance. Google has an internal memory.swapfile feature > > > > implemented per cgroup swap file type by "zswap only", "real swap file > > > > only", "both", "none" (the exact keyword might be different). running > > > > in the production for almost 10 years. The need for more than zswap > > > > type of per cgroup control is really there. > > > > > > We use regular swap on SSD without zswap just fine. Of course it's > > > expressible. > > > > > > On dedicated systems, zswap is disabled in sysfs. On shared hosts > > > where it's determined based on which workload is scheduled, zswap is > > > generally enabled through sysfs, and individual cgroup access is > > > controlled via memory.zswap.max - which is what this knob is for. > > > > > > This is analogous to enabling swap globally, and then opting > > > individual cgroups in and out with memory.swap.max. > > > > > > So this usecase is very much already supported, and it's expressed in > > > a way that's pretty natural for how cgroups express access and lack of > > > access to certain resources. > > > > > > I don't see how memory.swap.type or memory.swap.tiers would improve > > > this in any way. On the contrary, it would overlap and conflict with > > > existing controls to manage swap and zswap on a per-cgroup basis. > > > > > > > 2) As indicated by this discussion, Tencent has a usage case for SSD > > > > and hard disk swap as overflow. > > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > > > +Kairui > > > > > > Multiple swap devices for round robin or with different priorities > > > aren't new, they have been supported for a very, very long time. So > > > far nobody has proposed to control the exact behavior on a per-cgroup > > > basis, and I didn't see anybody in this thread asking for it either. > > > > > > So I don't see how this counts as an obvious and automatic usecase for > > > memory.swap.tiers. > > > > > > > 3) Android has some fancy swap ideas led by those patches. > > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > > > > It got shot down due to removal of frontswap. But the usage case and > > > > product requirement is there. > > > > +Minchan > > > > > > This looks like an optimization for zram to bypass the block layer and > > > hook directly into the swap code. Correct me if I'm wrong, but this > > > doesn't appear to have anything to do with per-cgroup backend control. > > > > Hi Johannes, > > > > I haven't been following the thread closely, but I noticed the discussion > > about potential use cases for zram with memcg. > > > > One interesting idea I have is to implement a swap controller per cgroup. > > This would allow us to tailor the zram swap behavior to the specific needs of > > different groups. > > > > For example, Group A, which is sensitive to swap latency, could use zram swap > > with a fast compression setting, even if it sacrifices some compression ratio. > > This would prioritize quick access to swapped data, even if it takes up more space. > > > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > > from a slower compression setting that achieves a higher compression ratio. > > This would maximize memory efficiency at the cost of slightly slower data access. > > > > This approach could provide a more nuanced and flexible way to manage swap usage > > within different cgroups. > > That makes sense to me. > > It sounds to me like per-cgroup swapfiles would be the easiest > solution to this. Someone posted it about 10 years ago :) https://lwn.net/Articles/592923/ +fdeutsch@redhat.com Fabian recently asked me about its status.
On Thu, Dec 14, 2023 at 9:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Hi Johannes, > > > > I haven't been following the thread closely, but I noticed the discussion > > about potential use cases for zram with memcg. > > > > One interesting idea I have is to implement a swap controller per cgroup. > > This would allow us to tailor the zram swap behavior to the specific needs of > > different groups. > > > > For example, Group A, which is sensitive to swap latency, could use zram swap > > with a fast compression setting, even if it sacrifices some compression ratio. > > This would prioritize quick access to swapped data, even if it takes up more space. > > > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > > from a slower compression setting that achieves a higher compression ratio. > > This would maximize memory efficiency at the cost of slightly slower data access. > > > > This approach could provide a more nuanced and flexible way to manage swap usage > > within different cgroups. > > That makes sense to me. > > It sounds to me like per-cgroup swapfiles would be the easiest > solution to this. Then you can create zram devices with different > configurations and assign them to individual cgroups. Ideally you need zram then following swap file after the zram. That would be a list of the swap files rather than just one swapfile per cgroup. > > This would also apply to Kairu's usecase: assign zrams and hdd backups > as needed on a per-cgroup basis. Same there, Kairui's request involves ZRAM and at least one extra swap file. In other words, you really need a per cgroup swap file list. > > In addition, it would naturally solve scalability and isolation > problems when multiple containers would otherwise be hammering on the > same swap backends and locks. > > It would also only require one, relatively simple new interface, such > as a cgroup parameter to swapon(). > > That's highly preferable over a complex configuration file like > memory.swap.tiers that needs to solve all sorts of visibility and > namespace issues and duplicate the full configuration interface of > every backend in some new, custom syntax. If you don't like the syntax of memory.swap.tiers, I am open to suggestions of your preferred syntax as well. The essicents of the swap.tiers is a per cgroup list of the swap back ends. The names imply that. I am not married to any given syntax of how to specify the list. Its goal matches the above requirement pretty well. Chris
On Thu, Dec 14, 2023 at 6:24 PM Yu Zhao <yuzhao@google.com> wrote: > On Thu, Dec 14, 2023 at 10:11 AM Johannes Weiner <hannes@cmpxchg.org> > wrote: > > > > On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote: > > > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote: > > > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > > > > I can give you three usage cases right now: > > > > > 1) Google producting kernel uses SSD only swap, it is currently on > > > > > pilot. This is not expressible by the memory.zswap.writeback. You > can > > > > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then > SSD > > > > > backed swapfile. But the whole thing feels very clunky, especially > > > > > what you really want is SSD only swap, you need to do all this > zswap > > > > > config dance. Google has an internal memory.swapfile feature > > > > > implemented per cgroup swap file type by "zswap only", "real swap > file > > > > > only", "both", "none" (the exact keyword might be different). > running > > > > > in the production for almost 10 years. The need for more than zswap > > > > > type of per cgroup control is really there. > > > > > > > > We use regular swap on SSD without zswap just fine. Of course it's > > > > expressible. > > > > > > > > On dedicated systems, zswap is disabled in sysfs. On shared hosts > > > > where it's determined based on which workload is scheduled, zswap is > > > > generally enabled through sysfs, and individual cgroup access is > > > > controlled via memory.zswap.max - which is what this knob is for. > > > > > > > > This is analogous to enabling swap globally, and then opting > > > > individual cgroups in and out with memory.swap.max. > > > > > > > > So this usecase is very much already supported, and it's expressed in > > > > a way that's pretty natural for how cgroups express access and lack > of > > > > access to certain resources. > > > > > > > > I don't see how memory.swap.type or memory.swap.tiers would improve > > > > this in any way. On the contrary, it would overlap and conflict with > > > > existing controls to manage swap and zswap on a per-cgroup basis. > > > > > > > > > 2) As indicated by this discussion, Tencent has a usage case for > SSD > > > > > and hard disk swap as overflow. > > > > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > > > > +Kairui > > > > > > > > Multiple swap devices for round robin or with different priorities > > > > aren't new, they have been supported for a very, very long time. So > > > > far nobody has proposed to control the exact behavior on a per-cgroup > > > > basis, and I didn't see anybody in this thread asking for it either. > > > > > > > > So I don't see how this counts as an obvious and automatic usecase > for > > > > memory.swap.tiers. > > > > > > > > > 3) Android has some fancy swap ideas led by those patches. > > > > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > > > > > It got shot down due to removal of frontswap. But the usage case > and > > > > > product requirement is there. > > > > > +Minchan > > > > > > > > This looks like an optimization for zram to bypass the block layer > and > > > > hook directly into the swap code. Correct me if I'm wrong, but this > > > > doesn't appear to have anything to do with per-cgroup backend > control. > > > > > > Hi Johannes, > > > > > > I haven't been following the thread closely, but I noticed the > discussion > > > about potential use cases for zram with memcg. > > > > > > One interesting idea I have is to implement a swap controller per > cgroup. > > > This would allow us to tailor the zram swap behavior to the specific > needs of > > > different groups. > > > > > > For example, Group A, which is sensitive to swap latency, could use > zram swap > > > with a fast compression setting, even if it sacrifices some > compression ratio. > > > This would prioritize quick access to swapped data, even if it takes > up more space. > > > > > > On the other hand, Group B, which can tolerate higher swap latency, > could benefit > > > from a slower compression setting that achieves a higher compression > ratio. > > > This would maximize memory efficiency at the cost of slightly slower > data access. > > > > > > This approach could provide a more nuanced and flexible way to manage > swap usage > > > within different cgroups. > > > > That makes sense to me. > > > > It sounds to me like per-cgroup swapfiles would be the easiest > > solution to this. > > Someone posted it about 10 years ago :) > https://lwn.net/Articles/592923/ > > +fdeutsch@redhat.com > Fabian recently asked me about its status. > Yep - for container use-cases. Now a few thoughts in this direction: - With swap per cgroup you loose the big "statistical" benefit of having swap on a node level. well, it depends on the size of the cgroup (i.e. system.slice is quite large). - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap). - the above approach however will still have a shared swap backend for all cgroups.
On Thu, Dec 14, 2023 at 6:24 PM Yu Zhao <yuzhao@google.com> wrote: > > On Thu, Dec 14, 2023 at 10:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote: > > > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote: > > > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > > > > I can give you three usage cases right now: > > > > > 1) Google producting kernel uses SSD only swap, it is currently on > > > > > pilot. This is not expressible by the memory.zswap.writeback. You can > > > > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD > > > > > backed swapfile. But the whole thing feels very clunky, especially > > > > > what you really want is SSD only swap, you need to do all this zswap > > > > > config dance. Google has an internal memory.swapfile feature > > > > > implemented per cgroup swap file type by "zswap only", "real swap file > > > > > only", "both", "none" (the exact keyword might be different). running > > > > > in the production for almost 10 years. The need for more than zswap > > > > > type of per cgroup control is really there. > > > > > > > > We use regular swap on SSD without zswap just fine. Of course it's > > > > expressible. > > > > > > > > On dedicated systems, zswap is disabled in sysfs. On shared hosts > > > > where it's determined based on which workload is scheduled, zswap is > > > > generally enabled through sysfs, and individual cgroup access is > > > > controlled via memory.zswap.max - which is what this knob is for. > > > > > > > > This is analogous to enabling swap globally, and then opting > > > > individual cgroups in and out with memory.swap.max. > > > > > > > > So this usecase is very much already supported, and it's expressed in > > > > a way that's pretty natural for how cgroups express access and lack of > > > > access to certain resources. > > > > > > > > I don't see how memory.swap.type or memory.swap.tiers would improve > > > > this in any way. On the contrary, it would overlap and conflict with > > > > existing controls to manage swap and zswap on a per-cgroup basis. > > > > > > > > > 2) As indicated by this discussion, Tencent has a usage case for SSD > > > > > and hard disk swap as overflow. > > > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > > > > +Kairui > > > > > > > > Multiple swap devices for round robin or with different priorities > > > > aren't new, they have been supported for a very, very long time. So > > > > far nobody has proposed to control the exact behavior on a per-cgroup > > > > basis, and I didn't see anybody in this thread asking for it either. > > > > > > > > So I don't see how this counts as an obvious and automatic usecase for > > > > memory.swap.tiers. > > > > > > > > > 3) Android has some fancy swap ideas led by those patches. > > > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > > > > > It got shot down due to removal of frontswap. But the usage case and > > > > > product requirement is there. > > > > > +Minchan > > > > > > > > This looks like an optimization for zram to bypass the block layer and > > > > hook directly into the swap code. Correct me if I'm wrong, but this > > > > doesn't appear to have anything to do with per-cgroup backend control. > > > > > > Hi Johannes, > > > > > > I haven't been following the thread closely, but I noticed the discussion > > > about potential use cases for zram with memcg. > > > > > > One interesting idea I have is to implement a swap controller per cgroup. > > > This would allow us to tailor the zram swap behavior to the specific needs of > > > different groups. > > > > > > For example, Group A, which is sensitive to swap latency, could use zram swap > > > with a fast compression setting, even if it sacrifices some compression ratio. > > > This would prioritize quick access to swapped data, even if it takes up more space. > > > > > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > > > from a slower compression setting that achieves a higher compression ratio. > > > This would maximize memory efficiency at the cost of slightly slower data access. > > > > > > This approach could provide a more nuanced and flexible way to manage swap usage > > > within different cgroups. > > > > That makes sense to me. > > > > It sounds to me like per-cgroup swapfiles would be the easiest > > solution to this. > > Someone posted it about 10 years ago :) > https://lwn.net/Articles/592923/ > > +fdeutsch@redhat.com > Fabian recently asked me about its status. Thanks Yu! Yes, I was interested due to container use-cases. Now a few thoughts in this direction: - With swap per cgroup you loose the big "statistical" benefit of having swap on a node level. well, it depends on the size of the cgroup (i.e. system.slice is quite large). - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap). - the above approach however will still have a shared swap backend for all cgroups.
On Thu, Dec 14, 2023 at 09:34:06AM -0800, Christopher Li wrote: > On Thu, Dec 14, 2023 at 9:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > Hi Johannes, > > > > > > I haven't been following the thread closely, but I noticed the discussion > > > about potential use cases for zram with memcg. > > > > > > One interesting idea I have is to implement a swap controller per cgroup. > > > This would allow us to tailor the zram swap behavior to the specific needs of > > > different groups. > > > > > > For example, Group A, which is sensitive to swap latency, could use zram swap > > > with a fast compression setting, even if it sacrifices some compression ratio. > > > This would prioritize quick access to swapped data, even if it takes up more space. > > > > > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > > > from a slower compression setting that achieves a higher compression ratio. > > > This would maximize memory efficiency at the cost of slightly slower data access. > > > > > > This approach could provide a more nuanced and flexible way to manage swap usage > > > within different cgroups. > > > > That makes sense to me. > > > > It sounds to me like per-cgroup swapfiles would be the easiest > > solution to this. Then you can create zram devices with different > > configurations and assign them to individual cgroups. > > Ideally you need zram then following swap file after the zram. That > would be a list of the swap files rather than just one swapfile per > cgroup. > > > This would also apply to Kairu's usecase: assign zrams and hdd backups > > as needed on a per-cgroup basis. > > Same there, Kairui's request involves ZRAM and at least one extra swap > file. In other words, you really need a per cgroup swap file list. Why is that a problem? swapon(zram, cgroup=foo) swapon(hdd, cgroup=foo) > > In addition, it would naturally solve scalability and isolation > > problems when multiple containers would otherwise be hammering on the > > same swap backends and locks. > > > > It would also only require one, relatively simple new interface, such > > as a cgroup parameter to swapon(). > > > > That's highly preferable over a complex configuration file like > > memory.swap.tiers that needs to solve all sorts of visibility and > > namespace issues and duplicate the full configuration interface of > > every backend in some new, custom syntax. > > If you don't like the syntax of memory.swap.tiers, I am open to > suggestions of your preferred syntax as well. The essicents of the > swap.tiers is a per cgroup list of the swap back ends. The names imply > that. I am not married to any given syntax of how to specify the list. > Its goal matches the above requirement pretty well. Except Minchan said that he would also like different zram parameters depending on the cgroup. There is no way we'll add a memory.swap.tiers with a new configuration language for backend parameters.
On Thu, Dec 14, 2023 at 2:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Dec 14, 2023 at 09:34:06AM -0800, Christopher Li wrote: > > On Thu, Dec 14, 2023 at 9:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > Hi Johannes, > > > > > > > > I haven't been following the thread closely, but I noticed the discussion > > > > about potential use cases for zram with memcg. > > > > > > > > One interesting idea I have is to implement a swap controller per cgroup. > > > > This would allow us to tailor the zram swap behavior to the specific needs of > > > > different groups. > > > > > > > > For example, Group A, which is sensitive to swap latency, could use zram swap > > > > with a fast compression setting, even if it sacrifices some compression ratio. > > > > This would prioritize quick access to swapped data, even if it takes up more space. > > > > > > > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > > > > from a slower compression setting that achieves a higher compression ratio. > > > > This would maximize memory efficiency at the cost of slightly slower data access. > > > > > > > > This approach could provide a more nuanced and flexible way to manage swap usage > > > > within different cgroups. > > > > > > That makes sense to me. > > > > > > It sounds to me like per-cgroup swapfiles would be the easiest > > > solution to this. Then you can create zram devices with different > > > configurations and assign them to individual cgroups. > > > > Ideally you need zram then following swap file after the zram. That > > would be a list of the swap files rather than just one swapfile per > > cgroup. > > > > > This would also apply to Kairu's usecase: assign zrams and hdd backups > > > as needed on a per-cgroup basis. > > > > Same there, Kairui's request involves ZRAM and at least one extra swap > > file. In other words, you really need a per cgroup swap file list. > > Why is that a problem? It is not a problem. It is the necessary infrastructure to support the requirement. I am merely saying just having one swap file is not enough. > > swapon(zram, cgroup=foo) > swapon(hdd, cgroup=foo) Interesting idea. I assume you want to use swapon/swapoff to turn on off a device for a specific cgroup. That seems to implite each cgroup will have a private copy of the swap device list. I have considered the memory.swap.tiers for the same thing, with one minor optimization. The list is system wide maintained with a name. The per cgroup just has a pointer to that named list. There shouldn't be too many such lists of swap back end combinations on the system. We are getting into the weeds. The bottom line is, we need to have per cgroup a swap file list. That is the necessary evil we can't get away with. > > > > In addition, it would naturally solve scalability and isolation > > > problems when multiple containers would otherwise be hammering on the > > > same swap backends and locks. > > > > > > It would also only require one, relatively simple new interface, such > > > as a cgroup parameter to swapon(). > > > > > > That's highly preferable over a complex configuration file like > > > memory.swap.tiers that needs to solve all sorts of visibility and > > > namespace issues and duplicate the full configuration interface of > > > every backend in some new, custom syntax. > > > > If you don't like the syntax of memory.swap.tiers, I am open to > > suggestions of your preferred syntax as well. The essicents of the > > swap.tiers is a per cgroup list of the swap back ends. The names imply > > that. I am not married to any given syntax of how to specify the list. > > Its goal matches the above requirement pretty well. > > Except Minchan said that he would also like different zram parameters > depending on the cgroup. Minchan's requirement is new. We will need to expand the original "memory.swap.tiers" to support such usage. > There is no way we'll add a memory.swap.tiers with a new configuration > language for backend parameters. > I agree that we don't want a complicated configuration language for "memory.swap.tiers". Those backend parameters should be configured on the back end side. The "memory.swap.tiers" just reference the already configured object. Just brainstorming: /dev/zram0 has compression algo1 for fast speed low compression ratio. /dev/zram1 has compression algo2 for slow speed high compression ratio. "memory.swap.tiers" point to zram0 or zram1 or a custom list has "zram0 + hdd" Chris
Hi Fabian, On Thu, Dec 14, 2023 at 10:00 AM Fabian Deutsch <fdeutsch@redhat.com> wrote: > Yep - for container use-cases. > > Now a few thoughts in this direction: > - With swap per cgroup you loose the big "statistical" benefit of having swap on a node level. well, it depends on the size of the cgroup (i.e. system.slice is quite large). Just to clarify, the "node" you mean the "node" in kubernetes sense, which is the whole machine. In the Linux kernel MM context, the node often refers to the NUMA memory node, that is not what you mean here, right? > - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap). > - the above approach however will still have a shared swap backend for all cgroups. Yes, the "memory.swap.tires" idea is trying to allow cgroups to select a subset of the swap backend in a specific order. It is still in the early stage of discussion. If you have any suggestion or feedback in that direction, I am looking forward to hearing that. Chris
On Thu, Dec 14, 2023 at 2:55 PM Chris Li <chrisl@kernel.org> wrote: > > On Thu, Dec 14, 2023 at 2:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Thu, Dec 14, 2023 at 09:34:06AM -0800, Christopher Li wrote: > > > On Thu, Dec 14, 2023 at 9:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > Hi Johannes, > > > > > > > > > > I haven't been following the thread closely, but I noticed the discussion > > > > > about potential use cases for zram with memcg. > > > > > > > > > > One interesting idea I have is to implement a swap controller per cgroup. > > > > > This would allow us to tailor the zram swap behavior to the specific needs of > > > > > different groups. > > > > > > > > > > For example, Group A, which is sensitive to swap latency, could use zram swap > > > > > with a fast compression setting, even if it sacrifices some compression ratio. > > > > > This would prioritize quick access to swapped data, even if it takes up more space. > > > > > > > > > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > > > > > from a slower compression setting that achieves a higher compression ratio. > > > > > This would maximize memory efficiency at the cost of slightly slower data access. > > > > > > > > > > This approach could provide a more nuanced and flexible way to manage swap usage > > > > > within different cgroups. > > > > > > > > That makes sense to me. > > > > > > > > It sounds to me like per-cgroup swapfiles would be the easiest > > > > solution to this. Then you can create zram devices with different > > > > configurations and assign them to individual cgroups. > > > > > > Ideally you need zram then following swap file after the zram. That > > > would be a list of the swap files rather than just one swapfile per > > > cgroup. > > > > > > > This would also apply to Kairu's usecase: assign zrams and hdd backups > > > > as needed on a per-cgroup basis. > > > > > > Same there, Kairui's request involves ZRAM and at least one extra swap > > > file. In other words, you really need a per cgroup swap file list. > > > > Why is that a problem? > > It is not a problem. It is the necessary infrastructure to support the > requirement. I am merely saying just having one swap file is not > enough. > > > > > swapon(zram, cgroup=foo) > > swapon(hdd, cgroup=foo) > > Interesting idea. I assume you want to use swapon/swapoff to turn on > off a device for a specific cgroup. > That seems to implite each cgroup will have a private copy of the swap > device list. > > I have considered the memory.swap.tiers for the same thing, with one > minor optimization. The list is system wide maintained with a name. > The per cgroup just has a pointer to that named list. There shouldn't > be too many such lists of swap back end combinations on the system. > > We are getting into the weeds. The bottom line is, we need to have per > cgroup a swap file list. That is the necessary evil we can't get away > with. Highly agree. This is getting waaayyyy too deep into the weeds, and the conversation has practically spiralled out of the original intention of this patch - its purported problem and proposed solution. Not to say that none of this is useful, but I sense that we first need to do the following: a) List out the requirements that the new interface has to support: the tiers made available to the cgroup, hierarchical structure (i.e do we want a tier list to have more than 1 non-zswap level? Maybe we won't need it after all, in which case the swapon solution is perhaps sufficient). b) Carefully evaluate the proposed candidates. It could be an altered memory.swap.tiers, or an extended swapon/swapoff. Perhaps we should organize a separate meeting or email thread to discuss this in detail, and write out proposed solutions for everyone to evaluate. In the meantime, I think that we should merge this new knob as-is. > > > > > > > In addition, it would naturally solve scalability and isolation > > > > problems when multiple containers would otherwise be hammering on the > > > > same swap backends and locks. > > > > > > > > It would also only require one, relatively simple new interface, such > > > > as a cgroup parameter to swapon(). > > > > > > > > That's highly preferable over a complex configuration file like > > > > memory.swap.tiers that needs to solve all sorts of visibility and > > > > namespace issues and duplicate the full configuration interface of > > > > every backend in some new, custom syntax. > > > > > > If you don't like the syntax of memory.swap.tiers, I am open to > > > suggestions of your preferred syntax as well. The essicents of the > > > swap.tiers is a per cgroup list of the swap back ends. The names imply > > > that. I am not married to any given syntax of how to specify the list. > > > Its goal matches the above requirement pretty well. > > > > Except Minchan said that he would also like different zram parameters > > depending on the cgroup. > > Minchan's requirement is new. We will need to expand the original > "memory.swap.tiers" to support such usage. > > > There is no way we'll add a memory.swap.tiers with a new configuration > > language for backend parameters. > > > > I agree that we don't want a complicated configuration language for > "memory.swap.tiers". > > Those backend parameters should be configured on the back end side. > The "memory.swap.tiers" just reference the already configured object. > Just brainstorming: > /dev/zram0 has compression algo1 for fast speed low compression ratio. > /dev/zram1 has compression algo2 for slow speed high compression ratio. > > "memory.swap.tiers" point to zram0 or zram1 or a custom list has "zram0 + hdd" > > Chris
On Friday, December 15, 2023, Chris Li <chrisl@kernel.org> wrote: > Hi Fabian, > > On Thu, Dec 14, 2023 at 10:00 AM Fabian Deutsch <fdeutsch@redhat.com> wrote: > >> Yep - for container use-cases. >> >> Now a few thoughts in this direction: >> - With swap per cgroup you loose the big "statistical" benefit of having swap on a node level. well, it depends on the size of the cgroup (i.e. system.slice is quite large). > > Just to clarify, the "node" you mean the "node" in kubernetes sense, > which is the whole machine. In the Linux kernel MM context, the node > often refers to the NUMA memory node, that is not what you mean here, > right? Correct, I was referring to a kubernetes node, not numa node. > >> - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap). >> - the above approach however will still have a shared swap backend for all cgroups. > > Yes, the "memory.swap.tires" idea is trying to allow cgroups to select > a subset of the swap backend in a specific order. It is still in the > early stage of discussion. If you have any suggestion or feedback in > that direction, I am looking forward to hearing that. Interesting. There have been concerns to leak confidential data accidentally when it's getting written to a swap device. The other less discussed item was QoS for swap io traffic. At a first glance it seems like tires could help with the second use-case. - fabian
On Fri, Dec 15, 2023 at 12:22 AM Chris Li <chrisl@kernel.org> wrote: > > Hi Fabian, > > On Thu, Dec 14, 2023 at 10:00 AM Fabian Deutsch <fdeutsch@redhat.com> wrote: > > > Yep - for container use-cases. > > > > Now a few thoughts in this direction: > > - With swap per cgroup you loose the big "statistical" benefit of having swap on a node level. well, it depends on the size of the cgroup (i.e. system.slice is quite large). > > Just to clarify, the "node" you mean the "node" in kubernetes sense, > which is the whole machine. In the Linux kernel MM context, the node > often refers to the NUMA memory node, that is not what you mean here, > right? Correct - I was referring to Kubernetes, and not numa nodes. > > > - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap). > > - the above approach however will still have a shared swap backend for all cgroups. > > Yes, the "memory.swap.tires" idea is trying to allow cgroups to select > a subset of the swap backend in a specific order. It is still in the > early stage of discussion. If you have any suggestion or feedback in > that direction, I am looking forward to hearing that. Interesting. There have been concerns to leak confidential data accidentally when it's getting written to a swap device. The other less discussed item was QoS for swap io traffic. At a first glance it seems like tires could help with the second use-case. - fabian > > Chris >
On Thu, Dec 14, 2023 at 11:42 PM Fabian Deutsch <fdeutsch@redhat.com> wrote:. > > > > Just to clarify, the "node" you mean the "node" in kubernetes sense, > > which is the whole machine. In the Linux kernel MM context, the node > > often refers to the NUMA memory node, that is not what you mean here, > > right? > > Correct, I was referring to a kubernetes node, not numa node. > > > > >> - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap). > >> - the above approach however will still have a shared swap backend for all cgroups. > > > > Yes, the "memory.swap.tires" idea is trying to allow cgroups to select > > a subset of the swap backend in a specific order. It is still in the > > early stage of discussion. If you have any suggestion or feedback in > > that direction, I am looking forward to hearing that. > > Interesting. There have been concerns to leak confidential data accidentally when it's getting written to a swap device. One common solution is to encrypt the data written to the device. If someone gets hold of the swapped outed device without the key, they can't get to the memory data without the key. > The other less discussed item was QoS for swap io traffic. > > At a first glance it seems like tires could help with the second use-case. The idea is that you can select the swap tiers list for each cgroup. That way you can assign different swap QoS to different cgroup. Chris
On Fri, Dec 15, 2023 at 10:40 AM Chris Li <chrisl@kernel.org> wrote: > > On Thu, Dec 14, 2023 at 11:42 PM Fabian Deutsch <fdeutsch@redhat.com> wrote:. > > > > > > Just to clarify, the "node" you mean the "node" in kubernetes sense, > > > which is the whole machine. In the Linux kernel MM context, the node > > > often refers to the NUMA memory node, that is not what you mean here, > > > right? > > > > Correct, I was referring to a kubernetes node, not numa node. > > > > > > > >> - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap). > > >> - the above approach however will still have a shared swap backend for all cgroups. > > > > > > Yes, the "memory.swap.tires" idea is trying to allow cgroups to select > > > a subset of the swap backend in a specific order. It is still in the > > > early stage of discussion. If you have any suggestion or feedback in > > > that direction, I am looking forward to hearing that. > > > > Interesting. There have been concerns to leak confidential data accidentally when it's getting written to a swap device. > > One common solution is to encrypt the data written to the device. If > someone gets hold of the swapped outed device without the key, they > can't get to the memory data without the key. Yes - I guess like writing it onto a dmcrypt device with some random key. Nevertheless, this was one of the topics. > > > > The other less discussed item was QoS for swap io traffic. > > > > At a first glance it seems like tires could help with the second use-case. > > The idea is that you can select the swap tiers list for each cgroup. > That way you can assign different swap QoS to different cgroup. Yes, it sounds like a fit. What use-cases did you have in mind for the tiers feature?
On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > During our experiment with zswap, we sometimes observe swap IOs due to > occasional zswap store failures and writebacks-to-swap. These swapping > IOs prevent many users who cannot tolerate swapping from adopting zswap > to save memory and improve performance where possible. > > This patch adds the option to disable this behavior entirely: do not > writeback to backing swapping device when a zswap store attempt fail, > and do not write pages in the zswap pool back to the backing swap > device (both when the pool is full, and when the new zswap shrinker is > called). > > This new behavior can be opted-in/out on a per-cgroup basis via a new > cgroup file. By default, writebacks to swap device is enabled, which is > the previous behavior. Initially, writeback is enabled for the root > cgroup, and a newly created cgroup will inherit the current setting of > its parent. > > Note that this is subtly different from setting memory.swap.max to 0, as > it still allows for pages to be stored in the zswap pool (which itself > consumes swap space in its current form). > > This patch should be applied on top of the zswap shrinker series: > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > as it also disables the zswap shrinker, a major source of zswap > writebacks. > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Taking a step back from all the memory.swap.tiers vs. memory.zswap.writeback discussions, I think there may be a more fundamental problem here. If the zswap store failure is recurrent, pages can keep going back to the LRUs and then sent back to zswap eventually, only to be rejected again. For example, this can if zswap is above the acceptance threshold, but could be even worse if it's the allocator rejecting the page due to not compressing well enough. In the latter case, the page can keep going back and forth between zswap and LRUs indefinitely. You probably did not run into this as you're using zsmalloc, but it can happen with zbud AFAICT. Even with zsmalloc, a less problematic version can happen if zswap is above its acceptance threshold. This can cause thrashing and ineffective reclaim. We have an internal implementation where we mark incompressible pages and put them on the unevictable LRU when we don't have a backing swapfile (i.e. ghost swapfiles), and something similar may work if writeback is disabled. We need to scan such incompressible pages periodically though to remove them from the unevictable LRU if they have been dirited.
On Fri, Dec 15, 2023 at 01:21:57PM -0800, Yosry Ahmed wrote: > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > During our experiment with zswap, we sometimes observe swap IOs due to > > occasional zswap store failures and writebacks-to-swap. These swapping > > IOs prevent many users who cannot tolerate swapping from adopting zswap > > to save memory and improve performance where possible. > > > > This patch adds the option to disable this behavior entirely: do not > > writeback to backing swapping device when a zswap store attempt fail, > > and do not write pages in the zswap pool back to the backing swap > > device (both when the pool is full, and when the new zswap shrinker is > > called). > > > > This new behavior can be opted-in/out on a per-cgroup basis via a new > > cgroup file. By default, writebacks to swap device is enabled, which is > > the previous behavior. Initially, writeback is enabled for the root > > cgroup, and a newly created cgroup will inherit the current setting of > > its parent. > > > > Note that this is subtly different from setting memory.swap.max to 0, as > > it still allows for pages to be stored in the zswap pool (which itself > > consumes swap space in its current form). > > > > This patch should be applied on top of the zswap shrinker series: > > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > > > as it also disables the zswap shrinker, a major source of zswap > > writebacks. > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > Taking a step back from all the memory.swap.tiers vs. > memory.zswap.writeback discussions, I think there may be a more > fundamental problem here. If the zswap store failure is recurrent, > pages can keep going back to the LRUs and then sent back to zswap > eventually, only to be rejected again. For example, this can if zswap > is above the acceptance threshold, but could be even worse if it's the > allocator rejecting the page due to not compressing well enough. In > the latter case, the page can keep going back and forth between zswap > and LRUs indefinitely. > > You probably did not run into this as you're using zsmalloc, but it > can happen with zbud AFAICT. Even with zsmalloc, a less problematic > version can happen if zswap is above its acceptance threshold. > > This can cause thrashing and ineffective reclaim. We have an internal > implementation where we mark incompressible pages and put them on the > unevictable LRU when we don't have a backing swapfile (i.e. ghost > swapfiles), and something similar may work if writeback is disabled. > We need to scan such incompressible pages periodically though to > remove them from the unevictable LRU if they have been dirited. I'm not sure this is an actual problem. When pages get rejected, they rotate to the furthest point from the reclaimer - the head of the active list. We only get to them again after we scanned everything else. If all that's left on the LRU is unzswappable, then you'd assume that remainder isn't very large, and thus not a significant part of overall scan work. Because if it is, then there is a serious problem with the zswap configuration. There might be possible optimizations to determine how permanent a rejection is, but I'm not sure the effort is called for just yet. Rejections are already failure cases that screw up the LRU ordering, and healthy setups shouldn't have a lot of those. I don't think this patch adds any sort of new complications to this picture.
On Mon, Dec 18, 2023 at 6:44 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Dec 15, 2023 at 01:21:57PM -0800, Yosry Ahmed wrote: > > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > During our experiment with zswap, we sometimes observe swap IOs due to > > > occasional zswap store failures and writebacks-to-swap. These swapping > > > IOs prevent many users who cannot tolerate swapping from adopting zswap > > > to save memory and improve performance where possible. > > > > > > This patch adds the option to disable this behavior entirely: do not > > > writeback to backing swapping device when a zswap store attempt fail, > > > and do not write pages in the zswap pool back to the backing swap > > > device (both when the pool is full, and when the new zswap shrinker is > > > called). > > > > > > This new behavior can be opted-in/out on a per-cgroup basis via a new > > > cgroup file. By default, writebacks to swap device is enabled, which is > > > the previous behavior. Initially, writeback is enabled for the root > > > cgroup, and a newly created cgroup will inherit the current setting of > > > its parent. > > > > > > Note that this is subtly different from setting memory.swap.max to 0, as > > > it still allows for pages to be stored in the zswap pool (which itself > > > consumes swap space in its current form). > > > > > > This patch should be applied on top of the zswap shrinker series: > > > > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > > > > > as it also disables the zswap shrinker, a major source of zswap > > > writebacks. > > > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > > > Taking a step back from all the memory.swap.tiers vs. > > memory.zswap.writeback discussions, I think there may be a more > > fundamental problem here. If the zswap store failure is recurrent, > > pages can keep going back to the LRUs and then sent back to zswap > > eventually, only to be rejected again. For example, this can if zswap > > is above the acceptance threshold, but could be even worse if it's the > > allocator rejecting the page due to not compressing well enough. In > > the latter case, the page can keep going back and forth between zswap > > and LRUs indefinitely. > > > > You probably did not run into this as you're using zsmalloc, but it Which is why I recommend everyone to use zsmalloc, and change the default allocator to it in Kconfig :) But tongue-in-cheek aside, I think this is fine. As you noted below, we probably want to try again on that page (for instance, in case its content has changed and is now more compressible). And as Johannes has explained, we'll only look at it again once we have scanned everything else. This sounds acceptable to me. Now if all of the intermediate pages are also unstoreable as well, then we have a problem, but that seems unlikely, and perhaps is an indication that we need to do something else entirely (if the workload is *that* incompressible, perhaps it is better to just disable zswap entirely here). > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic > > version can happen if zswap is above its acceptance threshold. > > > > This can cause thrashing and ineffective reclaim. We have an internal > > implementation where we mark incompressible pages and put them on the > > unevictable LRU when we don't have a backing swapfile (i.e. ghost > > swapfiles), and something similar may work if writeback is disabled. > > We need to scan such incompressible pages periodically though to > > remove them from the unevictable LRU if they have been dirited. > > I'm not sure this is an actual problem. > > When pages get rejected, they rotate to the furthest point from the > reclaimer - the head of the active list. We only get to them again > after we scanned everything else. Agree. That is the reason why we rotate the LRU here - to avoid touching it again until we have tried other pages. > > If all that's left on the LRU is unzswappable, then you'd assume that > remainder isn't very large, and thus not a significant part of overall > scan work. Because if it is, then there is a serious problem with the > zswap configuration. Agree. > > There might be possible optimizations to determine how permanent a > rejection is, but I'm not sure the effort is called for just > yet. Rejections are already failure cases that screw up the LRU > ordering, and healthy setups shouldn't have a lot of those. I don't > think this patch adds any sort of new complications to this picture. Yep. This is one of the reasons (among many) why we were toying around with storing uncompressed pages in zswap - it's one of the failure cases where trying again (if the page's content has not changed) - isn't likely to yield a different result, so might as well just retain the overall LRU ordering and squeeze it in zswap (but as discussed, that has quite a bit of implications that we need to deal with).
> > Taking a step back from all the memory.swap.tiers vs. > > memory.zswap.writeback discussions, I think there may be a more > > fundamental problem here. If the zswap store failure is recurrent, > > pages can keep going back to the LRUs and then sent back to zswap > > eventually, only to be rejected again. For example, this can if zswap > > is above the acceptance threshold, but could be even worse if it's the > > allocator rejecting the page due to not compressing well enough. In > > the latter case, the page can keep going back and forth between zswap > > and LRUs indefinitely. > > > > You probably did not run into this as you're using zsmalloc, but it > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic > > version can happen if zswap is above its acceptance threshold. > > > > This can cause thrashing and ineffective reclaim. We have an internal > > implementation where we mark incompressible pages and put them on the > > unevictable LRU when we don't have a backing swapfile (i.e. ghost > > swapfiles), and something similar may work if writeback is disabled. > > We need to scan such incompressible pages periodically though to > > remove them from the unevictable LRU if they have been dirited. > > I'm not sure this is an actual problem. > > When pages get rejected, they rotate to the furthest point from the > reclaimer - the head of the active list. We only get to them again > after we scanned everything else. > > If all that's left on the LRU is unzswappable, then you'd assume that > remainder isn't very large, and thus not a significant part of overall > scan work. Because if it is, then there is a serious problem with the > zswap configuration. > > There might be possible optimizations to determine how permanent a > rejection is, but I'm not sure the effort is called for just > yet. Rejections are already failure cases that screw up the LRU > ordering, and healthy setups shouldn't have a lot of those. I don't > think this patch adds any sort of new complications to this picture. We have workloads where a significant amount (maybe 20%? 30% not sure tbh) of the memory is incompressible. Zswap is still a very viable option for those workloads once those pages are taken out of the picture. If those pages remain on the LRUs, they will introduce a regression in reclaim efficiency. With the upstream code today, those pages go directly to the backing store, which isn't ideal in terms of LRU ordering, but this patch makes them stay on the LRUs, which can be harmful. I don't think we can just assume it is okay. Whether we make those pages unevictable or store them uncompressed in zswap, I think taking them out of the LRUs (until they are redirtied), is the right thing to do. Adding Wei and Yu for more data about incompressible memory in our fleet. Keep in mind that we have internal patches to cap the compression ratio (i.e. reject pages where the compressed size + metadata is not worth it, or where zsmalloc will store it in a full page anyway). But the same thing can happen upstream with zbud.
On Mon, Dec 18, 2023 at 11:21 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Mon, Dec 18, 2023 at 6:44 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Fri, Dec 15, 2023 at 01:21:57PM -0800, Yosry Ahmed wrote: > > > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > > > During our experiment with zswap, we sometimes observe swap IOs due to > > > > occasional zswap store failures and writebacks-to-swap. These swapping > > > > IOs prevent many users who cannot tolerate swapping from adopting zswap > > > > to save memory and improve performance where possible. > > > > > > > > This patch adds the option to disable this behavior entirely: do not > > > > writeback to backing swapping device when a zswap store attempt fail, > > > > and do not write pages in the zswap pool back to the backing swap > > > > device (both when the pool is full, and when the new zswap shrinker is > > > > called). > > > > > > > > This new behavior can be opted-in/out on a per-cgroup basis via a new > > > > cgroup file. By default, writebacks to swap device is enabled, which is > > > > the previous behavior. Initially, writeback is enabled for the root > > > > cgroup, and a newly created cgroup will inherit the current setting of > > > > its parent. > > > > > > > > Note that this is subtly different from setting memory.swap.max to 0, as > > > > it still allows for pages to be stored in the zswap pool (which itself > > > > consumes swap space in its current form). > > > > > > > > This patch should be applied on top of the zswap shrinker series: > > > > > > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ > > > > > > > > as it also disables the zswap shrinker, a major source of zswap > > > > writebacks. > > > > > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > Taking a step back from all the memory.swap.tiers vs. > > > memory.zswap.writeback discussions, I think there may be a more > > > fundamental problem here. If the zswap store failure is recurrent, > > > pages can keep going back to the LRUs and then sent back to zswap > > > eventually, only to be rejected again. For example, this can if zswap > > > is above the acceptance threshold, but could be even worse if it's the > > > allocator rejecting the page due to not compressing well enough. In > > > the latter case, the page can keep going back and forth between zswap > > > and LRUs indefinitely. > > > > > > You probably did not run into this as you're using zsmalloc, but it > > Which is why I recommend everyone to use zsmalloc, and change the > default allocator to it in Kconfig :) > Internally, we have a cap on the compression ratio, after which we reject pages because it doesn't make sense to store them (e.g. zsmalloc will store them in a full page anyway, or the compressed size + metadata isn't worth it). I think this is where we should head upstream as well, you proposed something in the right direction with storing uncompressed pages in zswap. IOW, I think such pages should be taken out of the LRUs one way or another.
On Mon, Dec 18, 2023 at 01:52:23PM -0800, Yosry Ahmed wrote: > > > Taking a step back from all the memory.swap.tiers vs. > > > memory.zswap.writeback discussions, I think there may be a more > > > fundamental problem here. If the zswap store failure is recurrent, > > > pages can keep going back to the LRUs and then sent back to zswap > > > eventually, only to be rejected again. For example, this can if zswap > > > is above the acceptance threshold, but could be even worse if it's the > > > allocator rejecting the page due to not compressing well enough. In > > > the latter case, the page can keep going back and forth between zswap > > > and LRUs indefinitely. > > > > > > You probably did not run into this as you're using zsmalloc, but it > > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic > > > version can happen if zswap is above its acceptance threshold. > > > > > > This can cause thrashing and ineffective reclaim. We have an internal > > > implementation where we mark incompressible pages and put them on the > > > unevictable LRU when we don't have a backing swapfile (i.e. ghost > > > swapfiles), and something similar may work if writeback is disabled. > > > We need to scan such incompressible pages periodically though to > > > remove them from the unevictable LRU if they have been dirited. > > > > I'm not sure this is an actual problem. > > > > When pages get rejected, they rotate to the furthest point from the > > reclaimer - the head of the active list. We only get to them again > > after we scanned everything else. > > > > If all that's left on the LRU is unzswappable, then you'd assume that > > remainder isn't very large, and thus not a significant part of overall > > scan work. Because if it is, then there is a serious problem with the > > zswap configuration. > > > > There might be possible optimizations to determine how permanent a > > rejection is, but I'm not sure the effort is called for just > > yet. Rejections are already failure cases that screw up the LRU > > ordering, and healthy setups shouldn't have a lot of those. I don't > > think this patch adds any sort of new complications to this picture. > > We have workloads where a significant amount (maybe 20%? 30% not sure > tbh) of the memory is incompressible. Zswap is still a very viable > option for those workloads once those pages are taken out of the > picture. If those pages remain on the LRUs, they will introduce a > regression in reclaim efficiency. > > With the upstream code today, those pages go directly to the backing > store, which isn't ideal in terms of LRU ordering, but this patch > makes them stay on the LRUs, which can be harmful. I don't think we > can just assume it is okay. Whether we make those pages unevictable or > store them uncompressed in zswap, I think taking them out of the LRUs > (until they are redirtied), is the right thing to do. This is how it works with zram as well, though, and it has plenty of happy users. The fact that there are antagonistic workloads doesn't mean the feature isn't useful. This flag is optional and not enabled by default, so nobody is forced to use it where it hurts. I'm not saying it's not worth optimizing those cases, but it doesn't look like a requirement in order to be useful to a variety of loads. > Adding Wei and Yu for more data about incompressible memory in our > fleet. Keep in mind that we have internal patches to cap the > compression ratio (i.e. reject pages where the compressed size + > metadata is not worth it, or where zsmalloc will store it in a full > page anyway). But the same thing can happen upstream with zbud. I hate to bring this up, but there has been a bit of a disturbing trend in the zswap discussions recently. Please do not argue with private patches. Their behavior, the usecases they enable, and their dependencies are entirely irrelevant to patches submitted in this forum. They do not need to cater to them or consider the consequences for them. The only thing that matters is the upstream codebase and the usecases enabled by it.
On Tue, Dec 19, 2023 at 9:15 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Dec 18, 2023 at 01:52:23PM -0800, Yosry Ahmed wrote: > > > > Taking a step back from all the memory.swap.tiers vs. > > > > memory.zswap.writeback discussions, I think there may be a more > > > > fundamental problem here. If the zswap store failure is recurrent, > > > > pages can keep going back to the LRUs and then sent back to zswap > > > > eventually, only to be rejected again. For example, this can if zswap > > > > is above the acceptance threshold, but could be even worse if it's the > > > > allocator rejecting the page due to not compressing well enough. In > > > > the latter case, the page can keep going back and forth between zswap > > > > and LRUs indefinitely. > > > > > > > > You probably did not run into this as you're using zsmalloc, but it > > > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic > > > > version can happen if zswap is above its acceptance threshold. > > > > > > > > This can cause thrashing and ineffective reclaim. We have an internal > > > > implementation where we mark incompressible pages and put them on the > > > > unevictable LRU when we don't have a backing swapfile (i.e. ghost > > > > swapfiles), and something similar may work if writeback is disabled. > > > > We need to scan such incompressible pages periodically though to > > > > remove them from the unevictable LRU if they have been dirited. > > > > > > I'm not sure this is an actual problem. > > > > > > When pages get rejected, they rotate to the furthest point from the > > > reclaimer - the head of the active list. We only get to them again > > > after we scanned everything else. > > > > > > If all that's left on the LRU is unzswappable, then you'd assume that > > > remainder isn't very large, and thus not a significant part of overall > > > scan work. Because if it is, then there is a serious problem with the > > > zswap configuration. > > > > > > There might be possible optimizations to determine how permanent a > > > rejection is, but I'm not sure the effort is called for just > > > yet. Rejections are already failure cases that screw up the LRU > > > ordering, and healthy setups shouldn't have a lot of those. I don't > > > think this patch adds any sort of new complications to this picture. > > > > We have workloads where a significant amount (maybe 20%? 30% not sure > > tbh) of the memory is incompressible. Zswap is still a very viable > > option for those workloads once those pages are taken out of the > > picture. If those pages remain on the LRUs, they will introduce a > > regression in reclaim efficiency. > > > > With the upstream code today, those pages go directly to the backing > > store, which isn't ideal in terms of LRU ordering, but this patch > > makes them stay on the LRUs, which can be harmful. I don't think we > > can just assume it is okay. Whether we make those pages unevictable or > > store them uncompressed in zswap, I think taking them out of the LRUs > > (until they are redirtied), is the right thing to do. > > This is how it works with zram as well, though, and it has plenty of > happy users. I am not sure I understand. Zram does not reject pages that do not compress well, right? IIUC it acts as a block device so it cannot reject pages. I feel like I am missing something. > The fact that there are antagonistic workloads doesn't > mean the feature isn't useful. This flag is optional and not enabled > by default, so nobody is forced to use it where it hurts. > > I'm not saying it's not worth optimizing those cases, but it doesn't > look like a requirement in order to be useful to a variety of loads. But we don't even understand the impact on those workloads today to properly document it. What happens with a workload using zbud for example and has quite a bit of memory that gets rejected? Is the feature usable for such a setup or not? There has also been discussions upstream about introducing a compression threshold for zswap in general (see below), this seems to be going in an opposite direction. If we already want to support taking pages away from the LRUs when rejected by zswap (e.g. Nhat's proposal earlier), doesn't it make sense to do that first so that this patch can be useful for all workloads? > > > Adding Wei and Yu for more data about incompressible memory in our > > fleet. Keep in mind that we have internal patches to cap the > > compression ratio (i.e. reject pages where the compressed size + > > metadata is not worth it, or where zsmalloc will store it in a full > > page anyway). But the same thing can happen upstream with zbud. > > I hate to bring this up, but there has been a bit of a disturbing > trend in the zswap discussions recently. > > Please do not argue with private patches. Their behavior, the usecases > they enable, and their dependencies are entirely irrelevant to patches > submitted in this forum. They do not need to cater to them or consider > the consequences for them. The only thing that matters is the upstream > codebase and the usecases enabled by it. Sorry if my intention wasn't clear. I am not arguing that this patch affects our internal patches in any way. All I am saying is that we do something similar internally, and we would like to move to an upstream solution if possible -- so naturally we want the upstream solution to work for us as well. Besides, this can happen in the upstream codebase with zbud as I mentioned earlier, and there has been discussions upstream about introducing such a compression threshold as well (e.g. [1]). So it is not something unique to Google. If this is where we think we are headed upstream (and is already the case with zbud), I think it's not unreasonable to bring it up. [1]https://lore.kernel.org/lkml/7a0e3229-be63-4a24-a3fe-7e3ff517de10@bytedance.com/
Chris Li <chrisl@kernel.org> 于2023年12月13日周三 07:39写道: > > Hi Kairui, > > Thanks for sharing the information on how you use swap. Hi Chris, > > On Mon, Dec 11, 2023 at 1:31 AM Kairui Song <ryncsn@gmail.com> wrote: > > > 2) As indicated by this discussion, Tencent has a usage case for SSD > > > and hard disk swap as overflow. > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > > +Kairui > > > > Yes, we are not using zswap. We are using ZRAM for swap since we have > > many different varieties of workload instances, with a very flexible > > storage setup. Some of them don't have the ability to set up a > > swapfile. So we built a pack of kernel infrastructures based on ZRAM, > > which so far worked pretty well. > > This is great. The usage case is actually much more than I expected. > For example, I never thought of zram as a swap tier. Now you mention > it. I am considering whether it makes sense to add zram to the > memory.swap.tiers as well as zswap. > > > > > The concern from some teams is that ZRAM (or zswap) can't always free > > up memory so they may lead to higher risk of OOM compared to a > > physical swap device, and they do have suitable devices for doing swap > > on some of their machines. So a secondary swap support is very helpful > > in case of memory usage peak. > > > > Besides this, another requirement is that different containers may > > have different priority, some containers can tolerate high swap > > overhead while some cannot, so swap tiering is useful for us in many > > ways. > > > > And thanks to cloud infrastructure the disk setup could change from > > time to time depending on workload requirements, so our requirement is > > to support ZRAM (always) + SSD (optional) + HDD (also optional) as > > swap backends, while not making things too complex to maintain. > > Just curious, do you use ZRAM + SSD + HDD all enabled? Do you ever > consider moving data from ZRAM to SSD, or from SSD to HDD? If you do, > I do see the possibility of having more general swap tiers support and > sharing the shrinking code between tiers somehow. Granted there are > many unanswered questions and a lot of infrastructure is lacking. > Gathering requirements, weight in the priority of the quirement is the > first step towards a possible solution. Sorry for the late response. Yes, it's our plan to use ZRAM + SSD + HDD all enabled when possible. Alghouth currently only ZRAM + SSD is expected. I see this discussion is still going one so just add some info here... We have some test environments which have a kernel worker enabled to move data from ZRAM to SSD, and from SSD to HDD too, to free up space for higher tier swap devices. The kworker is simple, it maintains a swap entry LRU for every swap device (maybe worth noting here, there is currently no LRU bases writeback for ZRAM, and ZRAM writeback require a fixed block device on init, and a swap device level LRU is also helpful for migrating entry from SSD to HDD). It walks the page table to swap in coldest swap entry then swap out immediately to a lower tier, doing this page by page periodically. Overhead and memory footprint is minimal with limited moving rate, but the efficiency for large scaled data moving is terrible so it only has very limited usage. I was trying to come up with a better design but am currently not working on it.
Chris Li <chrisl@kernel.org> 于2023年12月13日周三 07:58写道: > > Hi Minchan, > > On Mon, Dec 11, 2023 at 2:55 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > 3) Android has some fancy swap ideas led by those patches. > > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/ > > > > It got shot down due to removal of frontswap. But the usage case and > > > > product requirement is there. > > > > +Minchan > > > > > > This looks like an optimization for zram to bypass the block layer and > > > hook directly into the swap code. Correct me if I'm wrong, but this > > > doesn't appear to have anything to do with per-cgroup backend control. > > > > Hi Johannes, > > > > I haven't been following the thread closely, but I noticed the discussion > > about potential use cases for zram with memcg. > > > > One interesting idea I have is to implement a swap controller per cgroup. > > This would allow us to tailor the zram swap behavior to the specific needs of > > different groups. > > > > For example, Group A, which is sensitive to swap latency, could use zram swap > > with a fast compression setting, even if it sacrifices some compression ratio. > > This would prioritize quick access to swapped data, even if it takes up more space. > > > > On the other hand, Group B, which can tolerate higher swap latency, could benefit > > from a slower compression setting that achieves a higher compression ratio. > > This would maximize memory efficiency at the cost of slightly slower data access. > > That is a very solid usage case. Thanks for sharing this swap backend > usage story. It goes beyond my original memory.swap.teires idea as > well. > > We can have some zram specific knobs to control what compression > setting is using. Moving data between different compression settings > would be an interesting topic. It might fit the swap.tiers usage model > as well. I am just thinking it out loud. Maybe define different > compression settings as different tiers and then allow the cgroup to > enroll into one of the tiers list. > This is very similar to our usage, easy to implement too. Actually, now ZRAM already supports multiple compression streams, so if each memcg just provides an extra knob to record the compression level (1-4), ZRAM can decide which compression stream to use when the page reaches ZRAM, just by checking pages memcg. It's limited to 4 levels but enough for most cases.
On Wed, Dec 20, 2023 at 12:59:15AM -0800, Yosry Ahmed wrote: > On Tue, Dec 19, 2023 at 9:15 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Mon, Dec 18, 2023 at 01:52:23PM -0800, Yosry Ahmed wrote: > > > > > Taking a step back from all the memory.swap.tiers vs. > > > > > memory.zswap.writeback discussions, I think there may be a more > > > > > fundamental problem here. If the zswap store failure is recurrent, > > > > > pages can keep going back to the LRUs and then sent back to zswap > > > > > eventually, only to be rejected again. For example, this can if zswap > > > > > is above the acceptance threshold, but could be even worse if it's the > > > > > allocator rejecting the page due to not compressing well enough. In > > > > > the latter case, the page can keep going back and forth between zswap > > > > > and LRUs indefinitely. > > > > > > > > > > You probably did not run into this as you're using zsmalloc, but it > > > > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic > > > > > version can happen if zswap is above its acceptance threshold. > > > > > > > > > > This can cause thrashing and ineffective reclaim. We have an internal > > > > > implementation where we mark incompressible pages and put them on the > > > > > unevictable LRU when we don't have a backing swapfile (i.e. ghost > > > > > swapfiles), and something similar may work if writeback is disabled. > > > > > We need to scan such incompressible pages periodically though to > > > > > remove them from the unevictable LRU if they have been dirited. > > > > > > > > I'm not sure this is an actual problem. > > > > > > > > When pages get rejected, they rotate to the furthest point from the > > > > reclaimer - the head of the active list. We only get to them again > > > > after we scanned everything else. > > > > > > > > If all that's left on the LRU is unzswappable, then you'd assume that > > > > remainder isn't very large, and thus not a significant part of overall > > > > scan work. Because if it is, then there is a serious problem with the > > > > zswap configuration. > > > > > > > > There might be possible optimizations to determine how permanent a > > > > rejection is, but I'm not sure the effort is called for just > > > > yet. Rejections are already failure cases that screw up the LRU > > > > ordering, and healthy setups shouldn't have a lot of those. I don't > > > > think this patch adds any sort of new complications to this picture. > > > > > > We have workloads where a significant amount (maybe 20%? 30% not sure > > > tbh) of the memory is incompressible. Zswap is still a very viable > > > option for those workloads once those pages are taken out of the > > > picture. If those pages remain on the LRUs, they will introduce a > > > regression in reclaim efficiency. > > > > > > With the upstream code today, those pages go directly to the backing > > > store, which isn't ideal in terms of LRU ordering, but this patch > > > makes them stay on the LRUs, which can be harmful. I don't think we > > > can just assume it is okay. Whether we make those pages unevictable or > > > store them uncompressed in zswap, I think taking them out of the LRUs > > > (until they are redirtied), is the right thing to do. > > > > This is how it works with zram as well, though, and it has plenty of > > happy users. > > I am not sure I understand. Zram does not reject pages that do not > compress well, right? IIUC it acts as a block device so it cannot > reject pages. I feel like I am missing something. zram_write_page() can fail for various reasons - compression failure, zsmalloc failure, the memory limit. This results in !!bio->bi_status, __end_swap_bio_write redirtying the page, and vmscan rotating it. The effect is actually more pronounced with zram, because the pages don't get activated and thus cycle faster. What you're raising doesn't seem to be a dealbreaker in practice. > If we already want to support taking pages away from the LRUs when > rejected by zswap (e.g. Nhat's proposal earlier), doesn't it make > sense to do that first so that this patch can be useful for all > workloads? No. Why should users who can benefit now wait for a hypothetical future optimization that isn't relevant to them? And by the looks of it, is only relevant to a small set of specialized cases? And the optimization - should anybody actually care to write it - can be transparently done on top later, so that's no reason to change merge order, either.
On Wed, Dec 20, 2023 at 6:50 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Dec 20, 2023 at 12:59:15AM -0800, Yosry Ahmed wrote: > > On Tue, Dec 19, 2023 at 9:15 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Mon, Dec 18, 2023 at 01:52:23PM -0800, Yosry Ahmed wrote: > > > > > > Taking a step back from all the memory.swap.tiers vs. > > > > > > memory.zswap.writeback discussions, I think there may be a more > > > > > > fundamental problem here. If the zswap store failure is recurrent, > > > > > > pages can keep going back to the LRUs and then sent back to zswap > > > > > > eventually, only to be rejected again. For example, this can if zswap > > > > > > is above the acceptance threshold, but could be even worse if it's the > > > > > > allocator rejecting the page due to not compressing well enough. In > > > > > > the latter case, the page can keep going back and forth between zswap > > > > > > and LRUs indefinitely. > > > > > > > > > > > > You probably did not run into this as you're using zsmalloc, but it > > > > > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic > > > > > > version can happen if zswap is above its acceptance threshold. > > > > > > > > > > > > This can cause thrashing and ineffective reclaim. We have an internal > > > > > > implementation where we mark incompressible pages and put them on the > > > > > > unevictable LRU when we don't have a backing swapfile (i.e. ghost > > > > > > swapfiles), and something similar may work if writeback is disabled. > > > > > > We need to scan such incompressible pages periodically though to > > > > > > remove them from the unevictable LRU if they have been dirited. > > > > > > > > > > I'm not sure this is an actual problem. > > > > > > > > > > When pages get rejected, they rotate to the furthest point from the > > > > > reclaimer - the head of the active list. We only get to them again > > > > > after we scanned everything else. > > > > > > > > > > If all that's left on the LRU is unzswappable, then you'd assume that > > > > > remainder isn't very large, and thus not a significant part of overall > > > > > scan work. Because if it is, then there is a serious problem with the > > > > > zswap configuration. > > > > > > > > > > There might be possible optimizations to determine how permanent a > > > > > rejection is, but I'm not sure the effort is called for just > > > > > yet. Rejections are already failure cases that screw up the LRU > > > > > ordering, and healthy setups shouldn't have a lot of those. I don't > > > > > think this patch adds any sort of new complications to this picture. > > > > > > > > We have workloads where a significant amount (maybe 20%? 30% not sure > > > > tbh) of the memory is incompressible. Zswap is still a very viable > > > > option for those workloads once those pages are taken out of the > > > > picture. If those pages remain on the LRUs, they will introduce a > > > > regression in reclaim efficiency. > > > > > > > > With the upstream code today, those pages go directly to the backing > > > > store, which isn't ideal in terms of LRU ordering, but this patch > > > > makes them stay on the LRUs, which can be harmful. I don't think we > > > > can just assume it is okay. Whether we make those pages unevictable or > > > > store them uncompressed in zswap, I think taking them out of the LRUs > > > > (until they are redirtied), is the right thing to do. > > > > > > This is how it works with zram as well, though, and it has plenty of > > > happy users. > > > > I am not sure I understand. Zram does not reject pages that do not > > compress well, right? IIUC it acts as a block device so it cannot > > reject pages. I feel like I am missing something. > > zram_write_page() can fail for various reasons - compression failure, > zsmalloc failure, the memory limit. This results in !!bio->bi_status, > __end_swap_bio_write redirtying the page, and vmscan rotating it. > > The effect is actually more pronounced with zram, because the pages > don't get activated and thus cycle faster. > > What you're raising doesn't seem to be a dealbreaker in practice. For the workloads using zram, yes, they are exclusively using zsmalloc which can store incompressible pages anyway. > > > If we already want to support taking pages away from the LRUs when > > rejected by zswap (e.g. Nhat's proposal earlier), doesn't it make > > sense to do that first so that this patch can be useful for all > > workloads? > > No. > > Why should users who can benefit now wait for a hypothetical future > optimization that isn't relevant to them? And by the looks of it, is > only relevant to a small set of specialized cases? > > And the optimization - should anybody actually care to write it - can > be transparently done on top later, so that's no reason to change > merge order, either. We can agree to disagree here, I am not trying to block this anyway. But let's at least document this in the commit message/docs/code (wherever it makes sense) -- that recurrent failures (e.g. incompressible memory) may keep going back to zswap only to get rejected, so workloads prone to this may observe some reclaim inefficiency.
On Wed, Dec 20, 2023 at 4:24 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > We can agree to disagree here, I am not trying to block this anyway. > But let's at least document this in the commit message/docs/code > (wherever it makes sense) -- that recurrent failures (e.g. > incompressible memory) may keep going back to zswap only to get > rejected, so workloads prone to this may observe some reclaim > inefficiency. I'll add the following caveat: Note that if the store failures are recurring (for e.g if the pages are incompressible), users can observe reclaim inefficiency after disabling writeback (because the same pages might be rejected again and again). to the zswap documentation and the cgroup documentation then? I'll repeat this caveat in both places for self-containment purposes.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3f85254f3cef..2b4ac43efdc8 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1679,6 +1679,18 @@ PAGE_SIZE multiple when read back. limit, it will refuse to take any more stores before existing entries fault back in or are written out to disk. + memory.zswap.writeback + A read-write single value file. The default value is "1". The + initial value of the root cgroup is 1, and when a new cgroup is + created, it inherits the current value of its parent. + + When this is set to 0, all swapping attempts to swapping devices + are disabled. This included both zswap writebacks, and swapping due + to zswap store failure. + + Note that this is subtly different from setting memory.swap.max to + 0, as it still allows for pages to be written to the zswap pool. + memory.pressure A read-only nested-keyed file. diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 62fc244ec702..cfa653130346 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -153,6 +153,12 @@ attribute, e. g.:: Setting this parameter to 100 will disable the hysteresis. +Some users cannot tolerate the swapping that comes with zswap store failures +and zswap writebacks. Swapping can be disabled entirely (without disabling +zswap itself) on a cgroup-basis as follows: + + echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback + When there is a sizable amount of cold memory residing in the zswap pool, it can be advantageous to proactively write these cold pages to swap and reclaim the memory for other use cases. By default, the zswap shrinker is disabled. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 43b77363ab8e..5de775e6cdd9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -219,6 +219,12 @@ struct mem_cgroup { #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) unsigned long zswap_max; + + /* + * Prevent pages from this memcg from being written back from zswap to + * swap, and from being swapped out on zswap store failures. + */ + bool zswap_writeback; #endif unsigned long soft_limit; @@ -1941,6 +1947,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg); #else static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) { @@ -1954,6 +1961,11 @@ static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) { } +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) +{ + /* if zswap is disabled, do not block pages going to the swapping device */ + return true; +} #endif #endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 08c240e16a01..a78ceaf3a65e 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -35,6 +35,7 @@ void zswap_swapoff(int type); void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); void zswap_lruvec_state_init(struct lruvec *lruvec); void zswap_page_swapin(struct page *page); +bool is_zswap_enabled(void); #else struct zswap_lruvec_state {}; @@ -55,6 +56,11 @@ static inline void zswap_swapoff(int type) {} static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {} static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {} static inline void zswap_page_swapin(struct page *page) {} + +static inline bool is_zswap_enabled(void) +{ + return false; +} #endif #endif /* _LINUX_ZSWAP_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d7bc47316acb..ae8c62c7aa53 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5538,6 +5538,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) memcg->zswap_max = PAGE_COUNTER_MAX; + WRITE_ONCE(memcg->zswap_writeback, + !parent || READ_ONCE(parent->zswap_writeback)); #endif page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); if (parent) { @@ -8174,6 +8176,12 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) rcu_read_unlock(); } +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) +{ + /* if zswap is disabled, do not block pages going to the swapping device */ + return !is_zswap_enabled() || !memcg || READ_ONCE(memcg->zswap_writeback); +} + static u64 zswap_current_read(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -8206,6 +8214,31 @@ static ssize_t zswap_max_write(struct kernfs_open_file *of, return nbytes; } +static int zswap_writeback_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback)); + return 0; +} + +static ssize_t zswap_writeback_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int zswap_writeback; + ssize_t parse_ret = kstrtoint(strstrip(buf), 0, &zswap_writeback); + + if (parse_ret) + return parse_ret; + + if (zswap_writeback != 0 && zswap_writeback != 1) + return -EINVAL; + + WRITE_ONCE(memcg->zswap_writeback, zswap_writeback); + return nbytes; +} + static struct cftype zswap_files[] = { { .name = "zswap.current", @@ -8218,6 +8251,11 @@ static struct cftype zswap_files[] = { .seq_show = zswap_max_show, .write = zswap_max_write, }, + { + .name = "zswap.writeback", + .seq_show = zswap_writeback_show, + .write = zswap_writeback_write, + }, { } /* terminate */ }; #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */ diff --git a/mm/page_io.c b/mm/page_io.c index cb559ae324c6..5e606f1aa2f6 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -201,6 +201,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) folio_end_writeback(folio); return 0; } + + if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { + folio_mark_dirty(folio); + return AOP_WRITEPAGE_ACTIVATE; + } + __swap_writepage(&folio->page, wbc); return 0; } diff --git a/mm/shmem.c b/mm/shmem.c index c62f904ba1ca..dd084fbafcf1 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1514,8 +1514,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) mutex_unlock(&shmem_swaplist_mutex); BUG_ON(folio_mapped(folio)); - swap_writepage(&folio->page, wbc); - return 0; + return swap_writepage(&folio->page, wbc); } mutex_unlock(&shmem_swaplist_mutex); diff --git a/mm/zswap.c b/mm/zswap.c index daaa949837f2..7ee54a3d8281 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -153,6 +153,11 @@ static bool zswap_shrinker_enabled = IS_ENABLED( CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); +bool is_zswap_enabled(void) +{ + return zswap_enabled; +} + /********************************* * data structures **********************************/ @@ -596,7 +601,8 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, struct zswap_pool *pool = shrinker->private_data; bool encountered_page_in_swapcache = false; - if (!zswap_shrinker_enabled) { + if (!zswap_shrinker_enabled || + !mem_cgroup_zswap_writeback_enabled(sc->memcg)) { sc->nr_scanned = 0; return SHRINK_STOP; } @@ -637,7 +643,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; - if (!zswap_shrinker_enabled) + if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg)) return 0; #ifdef CONFIG_MEMCG_KMEM @@ -956,6 +962,9 @@ static int shrink_memcg(struct mem_cgroup *memcg) struct zswap_pool *pool; int nid, shrunk = 0; + if (!mem_cgroup_zswap_writeback_enabled(memcg)) + return -EINVAL; + /* * Skip zombies because their LRUs are reparented and we would be * reclaiming from the parent instead of the dead memcg.