Message ID | 20250329110230.2459730-2-nphamcs@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | zswap: fix placement inversion in memory tiering systems | expand |
Nhat Pham wrote: > Curerntly, zsmalloc does not specify any memory policy when it allocates > memory for the compressed objects. > > Let users select the NUMA node for the memory allocation, through the > zpool-based API. Direct callers (i.e zram) should not observe any > behavioral change. > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > --- > include/linux/zpool.h | 4 ++-- > mm/zpool.c | 8 +++++--- > mm/zsmalloc.c | 28 +++++++++++++++++++++------- > mm/zswap.c | 2 +- > 4 files changed, 29 insertions(+), 13 deletions(-) > > diff --git a/include/linux/zpool.h b/include/linux/zpool.h > index 52f30e526607..0df8722e13d7 100644 > --- a/include/linux/zpool.h > +++ b/include/linux/zpool.h > @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool); > void zpool_destroy_pool(struct zpool *pool); > > int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp, > - unsigned long *handle); > + unsigned long *handle, int *nid); I agree with Johannes about the policy knob, so I'll just comment on the implementation. Why not just pass a "const int" for @nid, and use "NUMA_NO_NODE" for the "default" case. alloc_pages_node_noprof() is already prepared for a NUMA_NO_NODE argument.
On Mon, Mar 31, 2025 at 3:18 PM Dan Williams <dan.j.williams@intel.com> wrote: > > Nhat Pham wrote: > > Curerntly, zsmalloc does not specify any memory policy when it allocates > > memory for the compressed objects. > > > > Let users select the NUMA node for the memory allocation, through the > > zpool-based API. Direct callers (i.e zram) should not observe any > > behavioral change. > > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > --- > > include/linux/zpool.h | 4 ++-- > > mm/zpool.c | 8 +++++--- > > mm/zsmalloc.c | 28 +++++++++++++++++++++------- > > mm/zswap.c | 2 +- > > 4 files changed, 29 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/zpool.h b/include/linux/zpool.h > > index 52f30e526607..0df8722e13d7 100644 > > --- a/include/linux/zpool.h > > +++ b/include/linux/zpool.h > > @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool); > > void zpool_destroy_pool(struct zpool *pool); > > > > int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp, > > - unsigned long *handle); > > + unsigned long *handle, int *nid); > > I agree with Johannes about the policy knob, so I'll just comment on the > implementation. > > Why not just pass a "const int" for @nid, and use "NUMA_NO_NODE" for the > "default" case. alloc_pages_node_noprof() is already prepared for a > NUMA_NO_NODE argument. Gregory and Johannes gave me that suggestion too! However, my understanding was that alloc_pages_node(NUMA_NO_NODE, ...) != alloc_page(...), and I was trying to preserve the latter since it is the "original behavior" (primarily for !same_node_mode, but also for zram, which I tried not to change in this patch). Please correct me if I'm mistaken, but IIUC: 1. alloc_pages_node(NUMA_NO_NODE, ...) would go to the local/closest node: /* * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE, * prefer the current CPU's closest node. Otherwise node must be valid and * online. */ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order) { if (nid == NUMA_NO_NODE) nid = numa_mem_id(); 2. whereas, alloc_page(...) (i.e the "original" behavior) would actually adopt the allocation policy of the task entering reclaim, by calling: struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order) { struct mempolicy *pol = &default_policy; /* * No reference counting needed for current->mempolicy * nor system default_policy */ if (!in_interrupt() && !(gfp & __GFP_THISNODE)) pol = get_task_policy(current); Now, I think the "original" behavior is dumb/broken, and should be changed altogether. We should probably always pass the page's node id. On the zswap side, in the next version I'll remove same_node_mode sysfs knob and always pass the page's node id to zsmalloc and the page allocator. That will clean up the zpool path per your (and Johannes' and Gregory's) suggestion. That still leaves zram though. zram is more complicated than zswap - it has multiple allocation paths, so I don't want to touch it quite yet (and preferably a zram maintainer/developer should do it). :) Or if zram maintainers are happy with NUMA_NO_NODE, then we can completely get rid of the pointer arguments etc.
Nhat Pham wrote: [..] > That still leaves zram though. zram is more complicated than zswap - > it has multiple allocation paths, so I don't want to touch it quite > yet (and preferably a zram maintainer/developer should do it). :) Or > if zram maintainers are happy with NUMA_NO_NODE, then we can > completely get rid of the pointer arguments etc. At a minimum make the argument a "const int *" so it does not look like the value can be changed by the leaf functions. ...but I would challenge zram folks to ack/nak that change rather than maintain old behavior based purely on momentum. I.e. add to the commit message an "exclude zswap from this policy for $explicit_reason" statement rather than the current $implicit_guess that the old behavior is there for "reasons".
On Mon, Mar 31, 2025 at 4:23 PM Dan Williams <dan.j.williams@intel.com> wrote: > > Nhat Pham wrote: > [..] > > That still leaves zram though. zram is more complicated than zswap - > > it has multiple allocation paths, so I don't want to touch it quite > > yet (and preferably a zram maintainer/developer should do it). :) Or > > if zram maintainers are happy with NUMA_NO_NODE, then we can > > completely get rid of the pointer arguments etc. > > At a minimum make the argument a "const int *" so it does not look like > the value can be changed by the leaf functions. That's a good idea! I'll incorporate it into the next version. > > ...but I would challenge zram folks to ack/nak that change rather than > maintain old behavior based purely on momentum. I.e. add to the commit > message an "exclude zswap from this policy for $explicit_reason" > statement rather than the current $implicit_guess that the old behavior > is there for "reasons". Yeah I'll take a look at the zram code. There's no conceptual reason why zram can not or should not adopt the same behavior, from my POV - it's just a bit more complicated than zswap, and as such would require more hacking. So I'm leaning towards making the minimal change required to support zswap first, and letting the zram maintainers/developers work it out on the zram side :) We'll see!
diff --git a/include/linux/zpool.h b/include/linux/zpool.h index 52f30e526607..0df8722e13d7 100644 --- a/include/linux/zpool.h +++ b/include/linux/zpool.h @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool); void zpool_destroy_pool(struct zpool *pool); int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp, - unsigned long *handle); + unsigned long *handle, int *nid); void zpool_free(struct zpool *pool, unsigned long handle); @@ -64,7 +64,7 @@ struct zpool_driver { void (*destroy)(void *pool); int (*malloc)(void *pool, size_t size, gfp_t gfp, - unsigned long *handle); + unsigned long *handle, int *nid); void (*free)(void *pool, unsigned long handle); void *(*obj_read_begin)(void *pool, unsigned long handle, diff --git a/mm/zpool.c b/mm/zpool.c index 6d6d88930932..591a13b99755 100644 --- a/mm/zpool.c +++ b/mm/zpool.c @@ -226,20 +226,22 @@ const char *zpool_get_type(struct zpool *zpool) * @size: The amount of memory to allocate. * @gfp: The GFP flags to use when allocating memory. * @handle: Pointer to the handle to set + * @nid: Pointer to the preferred node id. * * This allocates the requested amount of memory from the pool. * The gfp flags will be used when allocating memory, if the * implementation supports it. The provided @handle will be - * set to the allocated object handle. + * set to the allocated object handle. If @nid is provided, the + * allocation will prefer the specified node. * * Implementations must guarantee this to be thread-safe. * * Returns: 0 on success, negative value on error. */ int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp, - unsigned long *handle) + unsigned long *handle, int *nid) { - return zpool->driver->malloc(zpool->pool, size, gfp, handle); + return zpool->driver->malloc(zpool->pool, size, gfp, handle, nid); } /** diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 961b270f023c..35f61f14c32e 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -243,9 +243,14 @@ static inline void zpdesc_dec_zone_page_state(struct zpdesc *zpdesc) dec_zone_page_state(zpdesc_page(zpdesc), NR_ZSPAGES); } -static inline struct zpdesc *alloc_zpdesc(gfp_t gfp) +static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, int *nid) { - struct page *page = alloc_page(gfp); + struct page *page; + + if (nid) + page = alloc_pages_node(*nid, gfp, 0); + else + page = alloc_page(gfp); return page_zpdesc(page); } @@ -461,10 +466,13 @@ static void zs_zpool_destroy(void *pool) zs_destroy_pool(pool); } +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size, + gfp_t gfp, int *nid); + static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp, - unsigned long *handle) + unsigned long *handle, int *nid) { - *handle = zs_malloc(pool, size, gfp); + *handle = zs_malloc_node(pool, size, gfp, nid); if (IS_ERR_VALUE(*handle)) return PTR_ERR((void *)*handle); @@ -1044,7 +1052,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage, */ static struct zspage *alloc_zspage(struct zs_pool *pool, struct size_class *class, - gfp_t gfp) + gfp_t gfp, int *nid) { int i; struct zpdesc *zpdescs[ZS_MAX_PAGES_PER_ZSPAGE]; @@ -1061,7 +1069,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool, for (i = 0; i < class->pages_per_zspage; i++) { struct zpdesc *zpdesc; - zpdesc = alloc_zpdesc(gfp); + zpdesc = alloc_zpdesc(gfp, nid); if (!zpdesc) { while (--i >= 0) { zpdesc_dec_zone_page_state(zpdescs[i]); @@ -1342,6 +1350,12 @@ static unsigned long obj_malloc(struct zs_pool *pool, * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. */ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) +{ + return zs_malloc_node(pool, size, gfp, NULL); +} + +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size, + gfp_t gfp, int *nid) { unsigned long handle; struct size_class *class; @@ -1376,7 +1390,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) spin_unlock(&class->lock); - zspage = alloc_zspage(pool, class, gfp); + zspage = alloc_zspage(pool, class, gfp, nid); if (!zspage) { cache_free_handle(pool, handle); return (unsigned long)ERR_PTR(-ENOMEM); diff --git a/mm/zswap.c b/mm/zswap.c index 204fb59da33c..89b6d4ade4cd 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -981,7 +981,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, zpool = pool->zpool; gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE; - alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle); + alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, NULL); if (alloc_ret) goto unlock;
Curerntly, zsmalloc does not specify any memory policy when it allocates memory for the compressed objects. Let users select the NUMA node for the memory allocation, through the zpool-based API. Direct callers (i.e zram) should not observe any behavioral change. Signed-off-by: Nhat Pham <nphamcs@gmail.com> --- include/linux/zpool.h | 4 ++-- mm/zpool.c | 8 +++++--- mm/zsmalloc.c | 28 +++++++++++++++++++++------- mm/zswap.c | 2 +- 4 files changed, 29 insertions(+), 13 deletions(-)