Message ID | 20210809223740.59009-1-npache@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] vm_swappiness=0 should still try to avoid swapping anon memory | expand |
Hello Nico, On Mon, Aug 09, 2021 at 06:37:40PM -0400, Nico Pache wrote: > Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection > infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce > inactive:active ratio at the reclaim root") swappiness can start prematurely Could clarify what you mean by "prematurely"? The new balancing algorithm targets the lowest amount of overall paging IO performed across the anon and file sets. It doesn't swap unless it has an indication that a couple of swap writes are outweighed by a reduction of reads on the cache side. Is this not working for you? > swapping anon memory. This is due to the assumption that refaulting anon should > always allow the shrinker to target anon memory. This doesn't sound right. Did you mean "refaulting file"? > Add a check for swappiness being >0 before indiscriminately > targeting Anon. > Before these commits when a user had swappiness=0 anon memory would > rarely get swapped; this behavior has remained constant sense > RHEL5. This commit keeps that behavior intact and prevents the new > workingset refaulting from challenging the anon memory when > swappiness=0. I'm wondering how you're getting anon scans with swappiness=0. If you look at get_scan_count(), SCAN_FRACT with swappines=0 should always result in ap = fraction[0] = 0, which never yields any anon scan targets. So I'm thinking you're running into sc->file_is_tiny situations, meaning remaining file pages alone are not enough to restore watermarks anymore. Is that possible? In that case, anon scanning is forced, and always has been. But the difference is that before the above-mentioned patches, we'd usually force scan just the smaller inactive list, whereas now we disable active list protection due to swapins and target the entire anon set. I suppose you'd prefer we go back to that, so that more pressure remains proportionally on the file set, and just enough anon to get above the watermarks again. One complication I could see with that is that we no longer start anon pages on the active list like we used to. We used to say active until proven otherwise; now it's inactive until proven otherwise. It's possible for the inactive list to contain a much bigger share of the total anon set now than before, in which case your patch wouldn't have the desired effect of targetting just a small amount of anon pages to get over the watermark hump. We may need a get_scan_count() solution after all, and I agree with previous reviews that this is the better location for such an issue... One thing I think we should do - whether we need more on top or not - is allowing file reclaim to continue when sc->file_is_tiny. Yes, we also need anon to meet the watermarks, but it's not clear why we should stop scanning file pages altogether: it's possible they get us there 99% of the way, and somebody clearly wanted us to swap as little as possible to end up in a situation like that, so: diff --git a/mm/vmscan.c b/mm/vmscan.c index eeab6611993c..90dac3dc9903 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2477,7 +2477,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, * If the system is almost out of file pages, force-scan anon. */ if (sc->file_is_tiny) { - scan_balance = SCAN_ANON; + scan_balance = SCAN_EQUAL; goto out; }
On 8/9/21 6:37 PM, Nico Pache wrote: > Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection > infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce > inactive:active ratio at the reclaim root") swappiness can start prematurely > swapping anon memory. This is due to the assumption that refaulting anon should > always allow the shrinker to target anon memory. Add a check for swappiness > being >0 before indiscriminately targeting Anon. Before these commits > when a user had swappiness=0 anon memory would rarely get swapped; this > behavior has remained constant sense RHEL5. This commit keeps that behavior Typo: "sense" -> "since" > intact and prevents the new workingset refaulting from challenging the anon > memory when swappiness=0. > > Anon can still be swapped to prevent OOM. This does not completely disable > swapping, but rather tames the refaulting aspect of the code that allows for > the deactivating of anon memory. > > We have two customer workloads that discovered this issue: > 1) A VM claiming 95% of the hosts memory followed by file reads (never dirty) > which begins to challenge the anon. Refaulting the anon working set will then > cause the indiscriminant swapping of the anon. > > 2) A VM running a in-memory DB is being populated from file reads. > Swappiness is set to 0 or 1 to defer write I/O as much as possible. Once > the customer experienced low memory, swapping anon starts, with > little-to-no PageCache being swapped. Pagecache are not swapped. It is discarded under memory pressure and written back if dirty. Other than that, the patch looks good to me. Cheers, Longman
On 8/10/21 11:27 AM, Johannes Weiner wrote: > Hello Nico, > > On Mon, Aug 09, 2021 at 06:37:40PM -0400, Nico Pache wrote: >> Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection >> infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce >> inactive:active ratio at the reclaim root") swappiness can start prematurely > > Could clarify what you mean by "prematurely"? Hi Johannes! The reason I used the words prematurely and indiscriminately when trying to describe the behavior is because AFAICS the swappiness value is not being considered and this isnt a OOM case, so its prematurely going for anon memory. > > The new balancing algorithm targets the lowest amount of overall > paging IO performed across the anon and file sets. It doesn't swap > unless it has an indication that a couple of swap writes are > outweighed by a reduction of reads on the cache side. > > Is this not working for you? Well it is for the most part, but to your point below, the sc->is_file_tiny case can directly bypass the meaning of swappiness and chooses to do whatever it likes. > >> swapping anon memory. This is due to the assumption that refaulting anon should >> always allow the shrinker to target anon memory. > > This doesn't sound right. Did you mean "refaulting file"? <code> refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); if (refaults != target_lruvec->refaults[0] || inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) sc->may_deactivate |= DEACTIVATE_ANON; </code> Perhaps this is incorrect then? target_lruvec is using refaults[0] which is collected in snapshot_refaults. snapshot_refaults is populating index 0 with the WORKINGSET_ACTIVATE_ANON page state. the refaults variable is doing the same. So I assumed the refaulting ( new refault count != snapshot count) is comparing that of the anon workingset memory, not the refaulting of file cache. > >> Add a check for swappiness being >0 before indiscriminately >> targeting Anon. > >> Before these commits when a user had swappiness=0 anon memory would >> rarely get swapped; this behavior has remained constant sense >> RHEL5. This commit keeps that behavior intact and prevents the new >> workingset refaulting from challenging the anon memory when >> swappiness=0. > > I'm wondering how you're getting anon scans with swappiness=0. If you > look at get_scan_count(), SCAN_FRACT with swappines=0 should always > result in ap = fraction[0] = 0, which never yields any anon scan > targets. So I'm thinking you're running into sc->file_is_tiny > situations, meaning remaining file pages alone are not enough to > restore watermarks anymore. Is that possible? Yes DEACTIVATE_ANON is enabling the file_is_tiny case in shrink_node(). That is what im trying to prevent in the swappiness=0 case. > > In that case, anon scanning is forced, and always has been. But the > difference is that before the above-mentioned patches, we'd usually > force scan just the smaller inactive list, whereas now we disable > active list protection due to swapins and target the entire anon > set. I suppose you'd prefer we go back to that, so that more pressure > remains proportionally on the file set, and just enough anon to get > above the watermarks again Well kind of. It used to be that inactive_list_is_low would allow allow for the scanning of anon memory, but I am not removing that case here. Thats why my V3 separated the swappiness check from the inactive_is_low. Furthermore, the active list protection use to only be considered on the file LRU, as seem in ~4.18 inactive_list_is_low. > > One complication I could see with that is that we no longer start anon > pages on the active list like we used to. We used to say active until > proven otherwise; now it's inactive until proven otherwise. It's > possible for the inactive list to contain a much bigger share of the > total anon set now than before, in which case your patch wouldn't have > the desired effect of targetting just a small amount of anon pages to > get over the watermark hump. Yes I believe this is also makes the problem worst. Im not sure if given the anon memory the same read-once optimization (starts on the inactive list) as file cache is the way to go. > > We may need a get_scan_count() solution after all, and I agree with > previous reviews that this is the better location for such an issue... I cant see why a get_scan_count solution is better then removing the problem where it starts. > > One thing I think we should do - whether we need more on top or not - > is allowing file reclaim to continue when sc->file_is_tiny. Yes, we > also need anon to meet the watermarks, but it's not clear why we > should stop scanning file pages altogether: it's possible they get us > there 99% of the way, and somebody clearly wanted us to swap as little > as possible to end up in a situation like that, so:> > diff --git a/mm/vmscan.c b/mm/vmscan.c > index eeab6611993c..90dac3dc9903 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2477,7 +2477,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > * If the system is almost out of file pages, force-scan anon. > */ > if (sc->file_is_tiny) { > - scan_balance = SCAN_ANON; > + scan_balance = SCAN_EQUAL; > goto out; > } I agree, I think allowing it to scan both would be better as well. Cheers! -- Nico
Hi Johannes, On Tue, Aug 10, 2021 at 8:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > [...] > One thing I think we should do - whether we need more on top or not - > is allowing file reclaim to continue when sc->file_is_tiny. Yes, we > also need anon to meet the watermarks, but it's not clear why we > should stop scanning file pages altogether: it's possible they get us > there 99% of the way, and somebody clearly wanted us to swap as little > as possible to end up in a situation like that, so: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index eeab6611993c..90dac3dc9903 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2477,7 +2477,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > * If the system is almost out of file pages, force-scan anon. > */ > if (sc->file_is_tiny) { > - scan_balance = SCAN_ANON; > + scan_balance = SCAN_EQUAL; > goto out; > } > Another thing we should do is to re-evaluate the sc->file_is_tiny condition. Currently it is: anon = node_page_state(pgdat, NR_INACTIVE_ANON); sc->file_is_tiny = file + free <= total_high_wmark && !(sc->may_deactivate & DEACTIVATE_ANON) && anon >> sc->priority; First convert node_page_state() usage to lruvec_page_state() for common source of truth. Second, in the commit b91ac374346b (sc->may_deactivate & DEACTIVATE_ANON) implies inactive_is_low(LRU_INACTIVE_ANON) but commit 170b04b7ae49 changed that. Was that intended? Third, the comment above this code says "Consider anon" but it is only considering inactive anon. Do we need to change the comment or the check?
On Tue, Aug 10, 2021 at 12:24 PM Nico Pache <npache@redhat.com> wrote: > > [...] > > > > I'm wondering how you're getting anon scans with swappiness=0. If you > > look at get_scan_count(), SCAN_FRACT with swappines=0 should always > > result in ap = fraction[0] = 0, which never yields any anon scan > > targets. So I'm thinking you're running into sc->file_is_tiny > > situations, meaning remaining file pages alone are not enough to > > restore watermarks anymore. Is that possible? > > Yes DEACTIVATE_ANON is enabling the file_is_tiny case in shrink_node(). That is what im trying to prevent in the swappiness=0 case. > Can you please explain how DEACTIVATE_ANON is enabling the file_is_tiny case?
On 8/10/21 5:17 PM, Shakeel Butt wrote: > On Tue, Aug 10, 2021 at 12:24 PM Nico Pache <npache@redhat.com> wrote: >> >> > [...] >>> >>> I'm wondering how you're getting anon scans with swappiness=0. If you >>> look at get_scan_count(), SCAN_FRACT with swappines=0 should always >>> result in ap = fraction[0] = 0, which never yields any anon scan >>> targets. So I'm thinking you're running into sc->file_is_tiny >>> situations, meaning remaining file pages alone are not enough to >>> restore watermarks anymore. Is that possible? >> >> Yes DEACTIVATE_ANON is enabling the file_is_tiny case in shrink_node(). That is what im trying to prevent in the swappiness=0 case. >> > > Can you please explain how DEACTIVATE_ANON is enabling the file_is_tiny case? You're right. Just did a second pass... I misinterpreted the assignment to file_is_tiny. This is not the case that is causing the issue. So back to the SCAN_FRACT case. From my testing the refaulting still seems to be causing the issue; however, to your point in earlier discussions, if swappiness=0 then the get_scan_count *should* be 0. So my patch does solve the issue by preventing the shrink_list from deactivating the anon, but it may be hiding some other issue that is the ultimate cause. Thanks for pointing that out! -- Nico
On Tue, Aug 10, 2021 at 3:16 PM Nico Pache <npache@redhat.com> wrote: > > > > On 8/10/21 5:17 PM, Shakeel Butt wrote: > > On Tue, Aug 10, 2021 at 12:24 PM Nico Pache <npache@redhat.com> wrote: > >> > >> > > [...] > >>> > >>> I'm wondering how you're getting anon scans with swappiness=0. If you > >>> look at get_scan_count(), SCAN_FRACT with swappines=0 should always > >>> result in ap = fraction[0] = 0, which never yields any anon scan > >>> targets. So I'm thinking you're running into sc->file_is_tiny > >>> situations, meaning remaining file pages alone are not enough to > >>> restore watermarks anymore. Is that possible? > >> > >> Yes DEACTIVATE_ANON is enabling the file_is_tiny case in shrink_node(). That is what im trying to prevent in the swappiness=0 case. > >> > > > > Can you please explain how DEACTIVATE_ANON is enabling the file_is_tiny case? > > > You're right. Just did a second pass... I misinterpreted the assignment to > file_is_tiny. This is not the case that is causing the issue. So back to the > SCAN_FRACT case. From my testing the refaulting still seems to be causing the > issue; however, to your point in earlier discussions, if swappiness=0 then the > get_scan_count *should* be 0. > So my patch does solve the issue by preventing the shrink_list from deactivating > the anon, but it may be hiding some other issue that is the ultimate cause. > Yes, I am suspecting the same. BTW which kernel version are you testing with?
I think its is important to note the issue we are seeing has greatly improved since the initial posting. However we have noticed that the issue is still present (and significantly worse) when cgroupV1 is set. We were initially testing with CgroupV1 and later found that the issue was not as bad in CgroupV2 (but was still an noticeable issue). This is also resulting in the splitting of THPs in the host kernel. -- Nico On 8/9/21 18:37, Nico Pache wrote: > Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection > infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce > inactive:active ratio at the reclaim root") swappiness can start prematurely > swapping anon memory. This is due to the assumption that refaulting anon should > always allow the shrinker to target anon memory. Add a check for swappiness > being >0 before indiscriminately targeting Anon. Before these commits > when a user had swappiness=0 anon memory would rarely get swapped; this > behavior has remained constant sense RHEL5. This commit keeps that behavior > intact and prevents the new workingset refaulting from challenging the anon > memory when swappiness=0. > > Anon can still be swapped to prevent OOM. This does not completely disable > swapping, but rather tames the refaulting aspect of the code that allows for > the deactivating of anon memory. > > We have two customer workloads that discovered this issue: > 1) A VM claiming 95% of the hosts memory followed by file reads (never dirty) > which begins to challenge the anon. Refaulting the anon working set will then > cause the indiscriminant swapping of the anon. > > 2) A VM running a in-memory DB is being populated from file reads. > Swappiness is set to 0 or 1 to defer write I/O as much as possible. Once > the customer experienced low memory, swapping anon starts, with > little-to-no PageCache being swapped. > > Previously the file cache would account for almost all of the memory > reclaimed and reads would throttle. Although the two LRU changes mentioned > allow for less thrashing of file cache, customers would like to be able to keep > the swappiness=0 behavior that has been present in the kernel for a long > time. > > A similar solution may be possible in get_scan_count(), which determines the > reclaim pressure for each LRU; however I believe that kind of solution may be > too aggressive, and will not allow other parts of the code (like direct reclaim) > from targeting the active_anon list. This way we stop the problem at the heart > of what is causing the issue, with the least amount of interference in other > code paths. Furthermore, shrink_lruvec can modify the reclaim pressure of each > LRU, which may make the get_scan_count solution even trickier. > > Changelog: > -V3: > * Blame the right commit and be more descriptive in my log message. > * inactive_is_low should remain independent from the new swappiness check. > * Change how we get the swappiness value. Shrink_node can be called with a > null target_mem_cgroup so we should depend on the target_lruvec to do the > null check on memcg. > > -V2: > * made this mem_cgroup specific so now it will work with v1, v2, and > no cgroups. > * I've also touched up my commit log. > > Signed-off-by: Nico Pache <npache@redhat.com> > --- > mm/vmscan.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4620df62f0ff..9f2420da4037 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2883,8 +2883,12 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > struct lruvec *target_lruvec; > bool reclaimable = false; > unsigned long file; > + struct mem_cgroup *memcg; > + int swappiness; > > target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); > + memcg = lruvec_memcg(target_lruvec); > + swappiness = mem_cgroup_swappiness(memcg); > > again: > memset(&sc->nr, 0, sizeof(sc->nr)); > @@ -2909,7 +2913,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > > refaults = lruvec_page_state(target_lruvec, > WORKINGSET_ACTIVATE_ANON); > - if (refaults != target_lruvec->refaults[0] || > + if ((swappiness && refaults != target_lruvec->refaults[0]) || > inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) > sc->may_deactivate |= DEACTIVATE_ANON; > else
Hi Nico, On Tue, Apr 19, 2022 at 02:11:53PM -0400, Nico Pache wrote: > I think its is important to note the issue we are seeing has greatly improved > since the initial posting. However we have noticed that the issue is still > present (and significantly worse) when cgroupV1 is set. > > We were initially testing with CgroupV1 and later found that the issue was not > as bad in CgroupV2 (but was still an noticeable issue). This is also resulting > in the splitting of THPs in the host kernel. When swappiness is 0, cgroup limit reclaim has a fixed SCAN_FILE branch, so it shouldn't ever look at anon. I'm assuming you're getting global reclaim mixed in. Indeed, I think we can try harder not to swap for global reclaim if the user asks for that. Can you try the below patch? diff --git a/mm/vmscan.c b/mm/vmscan.c index ba19edbc2452..0aa929151386 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2729,11 +2729,9 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, } /* - * Global reclaim will swap to prevent OOM even with no - * swappiness, but memcg users want to use this knob to - * disable swapping for individual groups completely when - * using the memory controller's swap limit feature would be - * too expensive. + * For cgroups, swappiness=0 means never swap. Historically, + * cgroup users have relied on this as it was cheaper than + * setting a swap limit of 0. */ if (cgroup_reclaim(sc) && !swappiness) { scan_balance = SCAN_FILE; @@ -2754,7 +2752,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, * If the system is almost out of file pages, force-scan anon. */ if (sc->file_is_tiny) { - scan_balance = SCAN_ANON; + scan_balance = SCAN_EQUAL; goto out; } @@ -2767,6 +2765,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, goto out; } + /* swappiness=0 means no swap until OOM is imminent */ + if (!swappiness && sc->priority) { + scan_balance = SCAN_FILE; + goto out; + } + scan_balance = SCAN_FRACT; /* * Calculate the pressure balance between anon and file pages.
Hi Johannes, On 4/19/22 14:46, Johannes Weiner wrote: > Hi Nico, > > On Tue, Apr 19, 2022 at 02:11:53PM -0400, Nico Pache wrote: >> I think its is important to note the issue we are seeing has greatly improved >> since the initial posting. However we have noticed that the issue is still >> present (and significantly worse) when cgroupV1 is set. >> >> We were initially testing with CgroupV1 and later found that the issue was not >> as bad in CgroupV2 (but was still an noticeable issue). This is also resulting >> in the splitting of THPs in the host kernel. > > When swappiness is 0, cgroup limit reclaim has a fixed SCAN_FILE > branch, so it shouldn't ever look at anon. I'm assuming you're getting > global reclaim mixed in. Indeed, I think we can try harder not to swap > for global reclaim if the user asks for that. We aren't actually utilizing the cgroup mechanism; however, switching between the two has a noticeable affect on the global reclaim of the system. This is not a writeback case either-- The reproducer simply reads. So I think we can rule out the v2 writeback controller being involved. My initial patch was also targeting swappiness=0 but this also occurs when >0. > > Can you try the below patch? of course thanks for that :) I'll let you know how it goes! Cheers, -- Nico
On 4/19/22 14:46, Johannes Weiner wrote: > Hi Nico, > > On Tue, Apr 19, 2022 at 02:11:53PM -0400, Nico Pache wrote: >> I think its is important to note the issue we are seeing has greatly improved >> since the initial posting. However we have noticed that the issue is still >> present (and significantly worse) when cgroupV1 is set. >> >> We were initially testing with CgroupV1 and later found that the issue was not >> as bad in CgroupV2 (but was still an noticeable issue). This is also resulting >> in the splitting of THPs in the host kernel. > > When swappiness is 0, cgroup limit reclaim has a fixed SCAN_FILE > branch, so it shouldn't ever look at anon. I'm assuming you're getting > global reclaim mixed in. Indeed, I think we can try harder not to swap > for global reclaim if the user asks for that. > > Can you try the below patch? Sadly this did not fix the V1 case. I will have to go back through my notes and over the code again to find what difference between the two may be causing this issue. Im just starting to focus on this issue again so my memory needs some refreshing of where I was last at. The good news is that with the current state of upstream the issue of swap storms tearing down THPs seems to be minimized to sane amount with V2. My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping ANON' behavior that was previously there, but as Shakeel pointed out, there may be something larger at play. Cheers, -- Nico
On Tue, Apr 19, 2022 at 07:54:52PM -0400, Nico Pache wrote: > > > On 4/19/22 14:46, Johannes Weiner wrote: > > Hi Nico, > > > > On Tue, Apr 19, 2022 at 02:11:53PM -0400, Nico Pache wrote: > >> I think its is important to note the issue we are seeing has greatly improved > >> since the initial posting. However we have noticed that the issue is still > >> present (and significantly worse) when cgroupV1 is set. > >> > >> We were initially testing with CgroupV1 and later found that the issue was not > >> as bad in CgroupV2 (but was still an noticeable issue). This is also resulting > >> in the splitting of THPs in the host kernel. > > > > When swappiness is 0, cgroup limit reclaim has a fixed SCAN_FILE > > branch, so it shouldn't ever look at anon. I'm assuming you're getting > > global reclaim mixed in. Indeed, I think we can try harder not to swap > > for global reclaim if the user asks for that. > > > > Can you try the below patch? > Sadly this did not fix the V1 case. > > I will have to go back through my notes and over the code again to find what > difference between the two may be causing this issue. Im just starting to focus > on this issue again so my memory needs some refreshing of where I was last at. > > The good news is that with the current state of upstream the issue of swap > storms tearing down THPs seems to be minimized to sane amount with V2. > > My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping > ANON' behavior that was previously there, but as Shakeel pointed out, there may > be something larger at play. So with my patch and swappiness=0 you get excessive swapping on v1 but not on v2? And the patch to avoid DEACTIVATE_ANON fixes it? If you haven't done so, it could be useful to litter shrink_node() and get_scan_count() with trace_printk() to try to make sense of all the decisions that result in it swapping.
On 4/20/22 10:01, Johannes Weiner wrote: >> My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping >> ANON' behavior that was previously there, but as Shakeel pointed out, there may >> be something larger at play. > > So with my patch and swappiness=0 you get excessive swapping on v1 but > not on v2? And the patch to avoid DEACTIVATE_ANON fixes it? correct, I haven't tested the DEACTIVATE_ANON patch since last time I was working on this, but it did cure it. I can build a new kernel with it and verify again. The larger issue is that our workload has regressed in performance. With V2 and swappiness=10 we are still seeing some swap, but very little tearing down of THPs over time. With swappiness=0 it did some when swap but we are not losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2). With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and ultimately leads to a significant amount of THP splitting. So the longer the system/workload runs, the less likely we are to get THPs backing the guest and the performance gain from THPs is lost. So your patch does help return the old swappiness=0 behavior, but only for V2. Ideally we would like to keep swappiness>0 but I found that with my patch and swappiness=0 we could create a workaround for this effect on V1, but any other value still results in the THP issue. After the workload is run with V2 and swappiness=0 the host system look like this**: total used free shared buff/cache available Mem: 264071432 257536896 927424 4664 5607112 4993184 Swap: 4194300 0 4194300 Node 0 AnonPages: 128145476 kB Node 1 AnonPages: 128111908 kB Node 0 AnonHugePages: 128026624 kB Node 1 AnonHugePages: 128090112 kB ** without your patch there is still some swap and THP splitting but nothing like the case below. Same workload on V1/swappiness=0 looks like this: total used free shared buff/cache available Mem: 264071432 257169500 1032612 4192 5869320 5357944 Swap: 4194300 623008 3571292 Node 0 AnonPages: 127927156 kB Node 1 AnonPages: 127701088 kB Node 0 AnonHugePages: 127789056 kB Node 1 AnonHugePages: 87552000 kB ^^^^^^^ This leads to the performance regression I'm referring to in later workloads. V2 used to have a similar effect to V1, but not nearly as bad. Recent updates upstream fixed this in V2. The workload tests multiple FS types so this is most likely not a FS specific issue either. > If you haven't done so, it could be useful to litter shrink_node() and > get_scan_count() with trace_printk() to try to make sense of all the > decisions that result in it swapping. Will do :) I was originally doing some BPF tracing that lead me to find the DEACTIVE_ANON case. Thanks, -- Nico
Hi Nico, On Wed, Apr 20, 2022 at 01:34:58PM -0400, Nico Pache wrote: > On 4/20/22 10:01, Johannes Weiner wrote: > >> My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping > >> ANON' behavior that was previously there, but as Shakeel pointed out, there may > >> be something larger at play. > > > > So with my patch and swappiness=0 you get excessive swapping on v1 but > > not on v2? And the patch to avoid DEACTIVATE_ANON fixes it? > > correct, I haven't tested the DEACTIVATE_ANON patch since last time I was > working on this, but it did cure it. I can build a new kernel with it and verify > again. > > The larger issue is that our workload has regressed in performance. > > With V2 and swappiness=10 we are still seeing some swap, but very little tearing > down of THPs over time. With swappiness=0 it did some when swap but we are not > losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2). > > With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and > ultimately leads to a significant amount of THP splitting. So the longer the > system/workload runs, the less likely we are to get THPs backing the guest and > the performance gain from THPs is lost. I hate to ask, but is it possible this is a configuration issue? One significant difference between V1 and V2 is that V1 has per-cgroup swappiness, which is inherited when the cgroup is created. So if you set sysctl vm.swappiness=0 after cgroups have been created, it will not update them. V2 cgroups do use vm.swappiness: static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { /* Cgroup2 doesn't have per-cgroup swappiness */ if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) return vm_swappiness; /* root ? */ if (mem_cgroup_disabled() || mem_cgroup_is_root(memcg)) return vm_swappiness; return memcg->swappiness; } Is it possible the job cgroups on V1 have swappiness=60? > So your patch does help return the old swappiness=0 behavior, but only for V2. Thanks for verifying. I'll prepare a proper patch.
On 4/20/22 14:44, Johannes Weiner wrote: >> >> The larger issue is that our workload has regressed in performance. >> >> With V2 and swappiness=10 we are still seeing some swap, but very little tearing >> down of THPs over time. With swappiness=0 it did some when swap but we are not >> losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2). I meant to say `with your patch swappiness=0 does not swap or have thp issues on v2` >> >> With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and >> ultimately leads to a significant amount of THP splitting. So the longer the >> system/workload runs, the less likely we are to get THPs backing the guest and >> the performance gain from THPs is lost. > > I hate to ask, but is it possible this is a configuration issue Im very glad you asked :) > > One significant difference between V1 and V2 is that V1 has per-cgroup > swappiness, which is inherited when the cgroup is created. So if you > set sysctl vm.swappiness=0 after cgroups have been created, it will > not update them. V2 cgroups do use vm.swappiness: This is something I did not consider... Thank you for pointing that out! The issue still occurs weather or not I set the swappiness value before the VM boot. However this led me to find the icing on the cake :) Even if I set vm.swappiness=0 at boot using sysctl.conf I was not considering the fact that libvirtd was creating its own cgroup for the machines you start it with... additionally it does not inherit the sysctl value (even when set at boot)?!? How annoying... The cgroups swappiness value is defaulted to 60. This to me seems wrong from a libvirt/systemd POV. If the system is booted with swappiness=0 then why does the (user|machine|system).splice cgroup ignore this value when it creates it cgroups (see below). I will have to dig a little further to find a cause/fix for this. This requires the libvirt users to understand a number of intricacies that they really shouldnt have to consider, and may lead to headaches like these ;P Values of the memcgs created on boot (with sysctl.swappiness=0 on V1 boot) ------------------------------------------------------------------------ /sys/fs/cgroup/memory/memory.swappiness =0 /sys/fs/cgroup/memory/dev-hugepages.mount/memory.swappiness =60 /sys/fs/cgroup/memory/dev-mqueue.mount/memory.swappiness =60 /sys/fs/cgroup/memory/machine.slice/memory.swappiness =60 /sys/fs/cgroup/memory/proc-sys-fs-binfmt_misc.mount/memory.swappiness =0 /sys/fs/cgroup/memory/sys-fs-fuse-connections.mount/memory.swappiness =0 /sys/fs/cgroup/memory/sys-kernel-config.mount/memory.swappiness =0 /sys/fs/cgroup/memory/sys-kernel-debug.mount/memory.swappiness =60 /sys/fs/cgroup/memory/sys-kernel-tracing.mount/memory.swappiness =60 /sys/fs/cgroup/memory/system.slice/memory.swappiness =60 /sys/fs/cgroup/memory/user.slice/memory.swappiness =60 Some seem to inherit the cgroup/memory/memory.swappiness value and some do not... This issue was brought up in a systemd issue with no solution or documentation [1]. Libvirt in particular is using the machine.splice cgroup so it inherits the 60. If i change that value to 0, then start the machine it now has swappiness 0. $ echo 0 > /sys/fs/cgroup/memory/machine.slice/memory.swappiness $ virsh start <guest-name> $ cat /sys/fs/cgroup/memory/machine.slice/machine-qemu.scope/memory.swappiness 0 Thank you so much for your very insightful note that led to the real issue :) > Thanks for verifying. I'll prepare a proper patch. my issue with v1 vs v2 seems to go away with a much more sane value of swappiness=10 on v1 (when actually set properly lol). Also as per my results below, I actually dont think your patch caused much change to my workload. Im not sure what happened the first time I ran it that caused the swapping on v2 (before your patch)... perhaps I ran the older kernel (~v5.14) that was still having issues with v2 or its the fact that the results can differ between runs. sorry about that. Here is the test results for your patch with V1 and V2 (swappiness=0/10): Before Patch ------------- -- V1(swappiness=0): total used free shared buff/cache available Mem: 264071432 257465704 1100160 4224 5505568 5064964 Swap: 4194300 47828 4146472 Node 0 AnonPages: 128068580 kB Node 1 AnonPages: 128120400 kB Node 0 AnonHugePages: 128012288 kB Node 1 AnonHugePages: 128090112 kB ^^^^^ no loss -- V1(swappiness=10): total used free shared buff/cache available Mem: 264071432 257364436 972048 3972 5734948 5164520 Swap: 4194300 235028 3959272 Node 0 AnonPages: 128015404 kB Node 1 AnonPages: 128002788 kB Node 0 AnonHugePages: 128002048 kB Node 1 AnonHugePages: 120576000 kB ^^^^^ some loss -- V2(swappiness=0): total used free shared buff/cache available Mem: 264071432 257609352 924692 4664 5537388 4921236 Swap: 4194300 0 4194300 ^^^^^ No Swap Node 0 AnonPages: 128083104 kB Node 1 AnonPages: 128180576 kB Node 0 AnonHugePages: 128002048 kB Node 1 AnonHugePages: 128124928 kB ^^^^^ No loss -- V2(swappiness=10): total used free shared buff/cache available Mem: 264071432 257407576 918124 4632 5745732 5101764 Swap: 4194300 220424 3973876 ^^^^^ Some Swap Node 0 AnonPages: 128109700 kB Node 1 AnonPages: 127918164 kB Node 0 AnonHugePages: 128006144 kB Node 1 AnonHugePages: 120569856 kB ^^^^^ some loss After Patch ------------- -- V1:swappiness=0 total used free shared buff/cache available Mem: 264071432 257538832 945276 4368 5587324 4991852 Swap: 4194300 9276 4185024 Node 0 AnonPages: 128133932 kB Node 1 AnonPages: 128100540 kB Node 0 AnonHugePages: 128047104 kB Node 1 AnonHugePages: 128061440 kB -- V1:swappiness=10 total used free shared buff/cache available Mem: 264071432 257428564 969252 4384 5673616 5100824 Swap: 4194300 138936 4055364 ^^^^^ Some Swap Node 0 AnonPages: 128161724 kB Node 1 AnonPages: 127945368 kB Node 0 AnonHugePages: 128043008 kB Node 1 AnonHugePages: 120221696 kB ^^^^^ some loss -- V2(swappiness=0): total used free shared buff/cache available Mem: 264071432 257536896 927424 4664 5607112 4993184 Swap: 4194300 0 4194300 Node 0 AnonPages: 128145476 kB Node 1 AnonPages: 128111908 kB Node 0 AnonHugePages: 128026624 kB Node 1 AnonHugePages: 128090112 kB -- V2(swappiness=10): total used free shared buff/cache available Mem: 264071432 257423936 1007076 4548 5640420 5106544 Swap: 4194300 156016 4038284 Node 0 AnonPages: 128133264 kB Node 1 AnonPages: 127955952 kB Node 0 AnonHugePages: 128018432 kB Node 1 AnonHugePages: 122507264 kB ^^^^ slightly better The only notable difference between before/after your patch is that with your patch the THP tearing was slightly better, resulting in an extra 2GB as seen in the last result. This may just be noise. I'll have to see if I can find a fix for this in either the kernel, libvirt, or systemd, and will follow up if I do. If not this should at least be documented correctly. Given the fact cgroupV1 is in limited support mode upstream, and systemd's hesitancy to make changes for V1, we may how to go down our own avenues to ensure our customers dont run into this issue. Big Thanks! -- Nico [1] - https://github.com/systemd/systemd/issues/9276
diff --git a/mm/vmscan.c b/mm/vmscan.c index 4620df62f0ff..9f2420da4037 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2883,8 +2883,12 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) struct lruvec *target_lruvec; bool reclaimable = false; unsigned long file; + struct mem_cgroup *memcg; + int swappiness; target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); + memcg = lruvec_memcg(target_lruvec); + swappiness = mem_cgroup_swappiness(memcg); again: memset(&sc->nr, 0, sizeof(sc->nr)); @@ -2909,7 +2913,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); - if (refaults != target_lruvec->refaults[0] || + if ((swappiness && refaults != target_lruvec->refaults[0]) || inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) sc->may_deactivate |= DEACTIVATE_ANON; else
Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce inactive:active ratio at the reclaim root") swappiness can start prematurely swapping anon memory. This is due to the assumption that refaulting anon should always allow the shrinker to target anon memory. Add a check for swappiness being >0 before indiscriminately targeting Anon. Before these commits when a user had swappiness=0 anon memory would rarely get swapped; this behavior has remained constant sense RHEL5. This commit keeps that behavior intact and prevents the new workingset refaulting from challenging the anon memory when swappiness=0. Anon can still be swapped to prevent OOM. This does not completely disable swapping, but rather tames the refaulting aspect of the code that allows for the deactivating of anon memory. We have two customer workloads that discovered this issue: 1) A VM claiming 95% of the hosts memory followed by file reads (never dirty) which begins to challenge the anon. Refaulting the anon working set will then cause the indiscriminant swapping of the anon. 2) A VM running a in-memory DB is being populated from file reads. Swappiness is set to 0 or 1 to defer write I/O as much as possible. Once the customer experienced low memory, swapping anon starts, with little-to-no PageCache being swapped. Previously the file cache would account for almost all of the memory reclaimed and reads would throttle. Although the two LRU changes mentioned allow for less thrashing of file cache, customers would like to be able to keep the swappiness=0 behavior that has been present in the kernel for a long time. A similar solution may be possible in get_scan_count(), which determines the reclaim pressure for each LRU; however I believe that kind of solution may be too aggressive, and will not allow other parts of the code (like direct reclaim) from targeting the active_anon list. This way we stop the problem at the heart of what is causing the issue, with the least amount of interference in other code paths. Furthermore, shrink_lruvec can modify the reclaim pressure of each LRU, which may make the get_scan_count solution even trickier. Changelog: -V3: * Blame the right commit and be more descriptive in my log message. * inactive_is_low should remain independent from the new swappiness check. * Change how we get the swappiness value. Shrink_node can be called with a null target_mem_cgroup so we should depend on the target_lruvec to do the null check on memcg. -V2: * made this mem_cgroup specific so now it will work with v1, v2, and no cgroups. * I've also touched up my commit log. Signed-off-by: Nico Pache <npache@redhat.com> --- mm/vmscan.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)