Message ID | 20250115120625.3785-1-donettom@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm/swap.c: Enable promotion of unmapped MGLRU page cache pages | expand |
On Wed, Jan 15, 2025 at 06:06:25AM -0600, Donet Tom wrote: > An MGLRU page cache page is eligible for promotion when: > > 1. Memory Tiering and pagecache_promotion_enabled are enabled > 2. It resides in a lower memory tier. > 3. It is referenced. > 4. It is part of the working set. > 5. It belongs to the active list. > For MGLRU, the youngest generation and the youngest generation - 1 > are treated as the active list. ... why do you only promote folios if they belong to a memcg? > +promo_candid: > + if (!folio_test_isolated(folio) && > + (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && > + numa_pagecache_promotion_enabled) { > + memcg = folio_memcg(folio); > + if (memcg) { > + lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); > + gen = folio_lru_gen(folio); > + > + if ((gen < MAX_NR_GENS) && lru_gen_is_active(lruvec, gen)) > + promotion_candidate(folio); > + } Should there be an 'else' clause here? Or should it be: lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); gen = folio_lru_gen(folio); if ((gen < MAX_NR_GENS) && lru_gen_is_active(lruvec, gen)) promotion_candidate(folio);
On Wed, Jan 15, 2025 at 06:06:25AM -0600, Donet Tom wrote: ... snip ... Thank you for taking the time to do this, I don't have enough background with MGLRU to have done this quickly. I'll pull this onto my branch and carry it if you don't mind so we can keep things tracked. I'll send you an updated RFC before i send out v4 and add appropriate tags. > This difference also impacts read latency: > > For MGLRU, the first read shows higher latency due to the combined > overhead of accessing a lower tier and performing promotion. > > For LRU, the first 3–4 reads typically exhibit lower latency since > promotion does not occur immediately. > Do you have a thought on a good test we can use to compare these strategies? We decided against promotion on first-access because there are many easy-to-imagine scenarios where that will clearly harm performance. We're planning to do some workload testing soon so we can get actual benefit numbers. > +promo_candid: > + if (!folio_test_isolated(folio) && > + (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && > + numa_pagecache_promotion_enabled) { I am considering putting this in some inline wrapper with some likely() tags to clean this up a bit and optimize the fall-through cases since i've seen some measurable differences when left as-is. Thoughts on this are welcome > + memcg = folio_memcg(folio); > + if (memcg) { Also curious, why only promote when the folio is a member of a memcg? ~Gregory
On 1/15/25 18:55, Matthew Wilcox wrote: > On Wed, Jan 15, 2025 at 06:06:25AM -0600, Donet Tom wrote: >> An MGLRU page cache page is eligible for promotion when: >> >> 1. Memory Tiering and pagecache_promotion_enabled are enabled >> 2. It resides in a lower memory tier. >> 3. It is referenced. >> 4. It is part of the working set. >> 5. It belongs to the active list. >> For MGLRU, the youngest generation and the youngest generation - 1 >> are treated as the active list. > ... why do you only promote folios if they belong to a memcg? I missed adding non memcg folios for promotion . I will add them too. > >> +promo_candid: >> + if (!folio_test_isolated(folio) && >> + (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && >> + numa_pagecache_promotion_enabled) { >> + memcg = folio_memcg(folio); >> + if (memcg) { >> + lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); >> + gen = folio_lru_gen(folio); >> + >> + if ((gen < MAX_NR_GENS) && lru_gen_is_active(lruvec, gen)) >> + promotion_candidate(folio); >> + } > Should there be an 'else' clause here? Or should it be: > > lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); > gen = folio_lru_gen(folio); > if ((gen < MAX_NR_GENS) && lru_gen_is_active(lruvec, gen)) > promotion_candidate(folio); > Yes, there should be an else block to handle the promotion of non memcg folios. Does this approach look correct to you? } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); + +promo_candid: + if (!folio_test_isolated(folio) && + (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && + numa_pagecache_promotion_enabled) { + pgdat = folio_pgdat(folio); + gen = folio_lru_gen(folio); + + if (mem_cgroup_disabled()) + lruvec = &pgdat->__lruvec; + else { + memcg = folio_memcg(folio); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + } + + if ((gen < MAX_NR_GENS) && lru_gen_is_active(lruvec, gen)) + promotion_candidate(folio); + } }
On 1/15/25 21:39, Gregory Price wrote: > On Wed, Jan 15, 2025 at 06:06:25AM -0600, Donet Tom wrote: > ... snip ... > > Thank you for taking the time to do this, I don't have enough background > with MGLRU to have done this quickly. I'll pull this onto my branch and > carry it if you don't mind so we can keep things tracked. > > I'll send you an updated RFC before i send out v4 and add appropriate tags. Sure. Thank you. > >> This difference also impacts read latency: >> >> For MGLRU, the first read shows higher latency due to the combined >> overhead of accessing a lower tier and performing promotion. >> >> For LRU, the first 3–4 reads typically exhibit lower latency since >> promotion does not occur immediately. >> > Do you have a thought on a good test we can use to compare these > strategies? I am also thinking on the same. I will come back with some tests which can compare these strategies. > > We decided against promotion on first-access because there are many > easy-to-imagine scenarios where that will clearly harm performance. Yes, I agree. After the first access, it gets promoted, but if those pages are not accessed afterward, it will become an overhead. I will check if we have any mechanism to restrict it. > > We're planning to do some workload testing soon so we can get actual > benefit numbers. Could you please let me know what workload you are planning? I can try it on my system as well. > >> +promo_candid: >> + if (!folio_test_isolated(folio) && >> + (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && >> + numa_pagecache_promotion_enabled) { > I am considering putting this in some inline wrapper with some likely() > tags to clean this up a bit and optimize the fall-through cases since > i've seen some measurable differences when left as-is. Sounds good to me. > > Thoughts on this are welcome > >> + memcg = folio_memcg(folio); >> + if (memcg) { > Also curious, why only promote when the folio is a member of a memcg? I missed non memcg pages. I will send a V2 with that. ~Donet > > ~Gregory
On Thu, Jan 16, 2025 at 04:40:21PM +0530, Donet Tom wrote: > > Should there be an 'else' clause here? Or should it be: > > > > lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); > > gen = folio_lru_gen(folio); > > if ((gen < MAX_NR_GENS) && lru_gen_is_active(lruvec, gen)) > > promotion_candidate(folio); > > > Yes, there should be an else block to handle the promotion of non memcg > folios. > > Does this approach look correct to you? No. Look at the implementation of folio_memcg(), and mem_cgroup_lruvec() for that matter. > } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); > + > +promo_candid: > + if (!folio_test_isolated(folio) && > + (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) > && > + numa_pagecache_promotion_enabled) { > + pgdat = folio_pgdat(folio); > + gen = folio_lru_gen(folio); > + > + if (mem_cgroup_disabled()) > + lruvec = &pgdat->__lruvec; > + else { > + memcg = folio_memcg(folio); > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + } > + > + if ((gen < MAX_NR_GENS) && lru_gen_is_active(lruvec, gen)) > + promotion_candidate(folio); > + } > } > >
diff --git a/mm/swap.c b/mm/swap.c index b2341bc18452..121de1d7e938 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -386,6 +386,9 @@ static void __lru_cache_activate_folio(struct folio *folio) static void lru_gen_inc_refs(struct folio *folio) { + struct mem_cgroup *memcg; + struct lruvec *lruvec; + int gen; unsigned long new_flags, old_flags = READ_ONCE(folio->flags); if (folio_test_unevictable(folio)) @@ -399,13 +402,29 @@ static void lru_gen_inc_refs(struct folio *folio) do { if ((old_flags & LRU_REFS_MASK) == LRU_REFS_MASK) { - if (!folio_test_workingset(folio)) + if (!folio_test_workingset(folio)) { folio_set_workingset(folio); - return; + return; + } + goto promo_candid; } new_flags = old_flags + BIT(LRU_REFS_PGOFF); } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); + +promo_candid: + if (!folio_test_isolated(folio) && + (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && + numa_pagecache_promotion_enabled) { + memcg = folio_memcg(folio); + if (memcg) { + lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio)); + gen = folio_lru_gen(folio); + + if ((gen < MAX_NR_GENS) && lru_gen_is_active(lruvec, gen)) + promotion_candidate(folio); + } + } } static bool lru_gen_clear_refs(struct folio *folio)
This patch is based on patch [1], which introduced support for promoting unmapped normal LRU page cache pages. Here, we extend the functionality to support promotion of MGLRU page cache pages. An MGLRU page cache page is eligible for promotion when: 1. Memory Tiering and pagecache_promotion_enabled are enabled 2. It resides in a lower memory tier. 3. It is referenced. 4. It is part of the working set. 5. It belongs to the active list. For MGLRU, the youngest generation and the youngest generation - 1 are treated as the active list. When a page is accessed through a file descriptor, folio_inc_refs() is invoked. In this function, we check whether the page is referenced, is part of the working set, and Belongs to the active list. If all these conditions are met, the page is added to the promotion list. The per-process task task_numa_promotion_work() take the pages from the promotion list and promotes them to a higher memory tier. Test process: We measured the read time in below scenarios for both LRU and MGLRU. Scenario 1: Pages are on Lower tier + promotion off Scenario 2: Pages are on Lower tier + promotion on Scenario 3: Pages are on higher tier Test Results MGLRU --------------------------------------------------------------- Pages on higher | Pages Lower tier | Pages on Lower Tier | Tier | promotion off | Promotion On | --------------------------------------------------------------- 1.5s | 3.2s |During Promotion - 6.6s | | |After Promotion - 1.6s | | | | --------------------------------------------------------------- Test Results LRU --------------------------------------------------------------- Pages on higher | Pages Lower tier | Pages on Lower Tier | Tier | promotion off | Promotion On | --------------------------------------------------------------- 1.5s | 3.2s |During Promotion - 4.2s | | | - 3.4s | | | - 5.6s | | |After Promotion - 1.6s | | | | --------------------------------------------------------------- In LRU, pages are initially added to the inactive list. When a page is referenced, it is moved to the active list, and promotion to a higher memory tier occurs from the active list. This process often requires multiple reads to trigger promotion. In contrast, MGLRU adds new pages to the youngest generation immediately. As a result, we observe that MGLRU promotes pages on the first read itself, unlike LRU which takes multiple reads to trigger promotion. This difference also impacts read latency: For MGLRU, the first read shows higher latency due to the combined overhead of accessing a lower tier and performing promotion. For LRU, the first 3–4 reads typically exhibit lower latency since promotion does not occur immediately. MGLRU and LRU are showing similar performance benefit. [1] https://lore.kernel.org/all/20250107000346.1338481-1-gourry@gourry.net/ Signed-off-by: Donet Tom <donettom@linux.ibm.com> --- mm/swap.c | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-)