diff mbox series

[v3] vm_swappiness=0 should still try to avoid swapping anon memory

Message ID 20210809223740.59009-1-npache@redhat.com (mailing list archive)
State New
Headers show
Series [v3] vm_swappiness=0 should still try to avoid swapping anon memory | expand

Commit Message

Nico Pache Aug. 9, 2021, 10:37 p.m. UTC
Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection
infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce
inactive:active ratio at the reclaim root") swappiness can start prematurely
swapping anon memory. This is due to the assumption that refaulting anon should
always allow the shrinker to target anon memory. Add a check for swappiness
being >0 before indiscriminately targeting Anon. Before these commits
when a user had swappiness=0 anon memory would rarely get swapped; this
behavior has remained constant sense RHEL5. This commit keeps that behavior
intact and prevents the new workingset refaulting from challenging the anon
memory when swappiness=0.

Anon can still be swapped to prevent OOM. This does not completely disable
swapping, but rather tames the refaulting aspect of the code that allows for
the deactivating of anon memory.

We have two customer workloads that discovered this issue:
1) A VM claiming 95% of the hosts memory followed by file reads (never dirty)
   which begins to challenge the anon. Refaulting the anon working set will then
   cause the indiscriminant swapping of the anon.

2) A VM running a in-memory DB is being populated from file reads.
   Swappiness is set to 0 or 1 to defer write I/O as much as possible. Once
   the customer experienced low memory, swapping anon starts, with
   little-to-no PageCache being swapped.

Previously the file cache would account for almost all of the memory
reclaimed and reads would throttle. Although the two LRU changes mentioned
allow for less thrashing of file cache, customers would like to be able to keep
the swappiness=0 behavior that has been present in the kernel for a long
time.

A similar solution may be possible in get_scan_count(), which determines the
reclaim pressure for each LRU; however I believe that kind of solution may be
too aggressive, and will not allow other parts of the code (like direct reclaim)
from targeting the active_anon list. This way we stop the problem at the heart
of what is causing the issue, with the least amount of interference in other
code paths. Furthermore, shrink_lruvec can modify the reclaim pressure of each
LRU, which may make the get_scan_count solution even trickier.

Changelog:
 -V3:
    * Blame the right commit and be more descriptive in my log message.
    * inactive_is_low should remain independent from the new swappiness check.
    * Change how we get the swappiness value. Shrink_node can be called with a
      null target_mem_cgroup so we should depend on the target_lruvec to do the
      null check on memcg.

 -V2:
     * made this mem_cgroup specific so now it will work with v1, v2, and
       no cgroups.
     * I've also touched up my commit log.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/vmscan.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Johannes Weiner Aug. 10, 2021, 3:27 p.m. UTC | #1
Hello Nico,

On Mon, Aug 09, 2021 at 06:37:40PM -0400, Nico Pache wrote:
> Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection
> infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce
> inactive:active ratio at the reclaim root") swappiness can start prematurely

Could clarify what you mean by "prematurely"?

The new balancing algorithm targets the lowest amount of overall
paging IO performed across the anon and file sets. It doesn't swap
unless it has an indication that a couple of swap writes are
outweighed by a reduction of reads on the cache side.

Is this not working for you?

> swapping anon memory. This is due to the assumption that refaulting anon should
> always allow the shrinker to target anon memory.

This doesn't sound right. Did you mean "refaulting file"?

> Add a check for swappiness being >0 before indiscriminately
> targeting Anon.

> Before these commits when a user had swappiness=0 anon memory would
> rarely get swapped; this behavior has remained constant sense
> RHEL5. This commit keeps that behavior intact and prevents the new
> workingset refaulting from challenging the anon memory when
> swappiness=0.

I'm wondering how you're getting anon scans with swappiness=0. If you
look at get_scan_count(), SCAN_FRACT with swappines=0 should always
result in ap = fraction[0] = 0, which never yields any anon scan
targets. So I'm thinking you're running into sc->file_is_tiny
situations, meaning remaining file pages alone are not enough to
restore watermarks anymore. Is that possible?

In that case, anon scanning is forced, and always has been. But the
difference is that before the above-mentioned patches, we'd usually
force scan just the smaller inactive list, whereas now we disable
active list protection due to swapins and target the entire anon
set. I suppose you'd prefer we go back to that, so that more pressure
remains proportionally on the file set, and just enough anon to get
above the watermarks again.

One complication I could see with that is that we no longer start anon
pages on the active list like we used to. We used to say active until
proven otherwise; now it's inactive until proven otherwise. It's
possible for the inactive list to contain a much bigger share of the
total anon set now than before, in which case your patch wouldn't have
the desired effect of targetting just a small amount of anon pages to
get over the watermark hump.

We may need a get_scan_count() solution after all, and I agree with
previous reviews that this is the better location for such an issue...

One thing I think we should do - whether we need more on top or not -
is allowing file reclaim to continue when sc->file_is_tiny. Yes, we
also need anon to meet the watermarks, but it's not clear why we
should stop scanning file pages altogether: it's possible they get us
there 99% of the way, and somebody clearly wanted us to swap as little
as possible to end up in a situation like that, so:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeab6611993c..90dac3dc9903 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2477,7 +2477,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * If the system is almost out of file pages, force-scan anon.
 	 */
 	if (sc->file_is_tiny) {
-		scan_balance = SCAN_ANON;
+		scan_balance = SCAN_EQUAL;
 		goto out;
 	}
Waiman Long Aug. 10, 2021, 3:37 p.m. UTC | #2
On 8/9/21 6:37 PM, Nico Pache wrote:
> Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection
> infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce
> inactive:active ratio at the reclaim root") swappiness can start prematurely
> swapping anon memory. This is due to the assumption that refaulting anon should
> always allow the shrinker to target anon memory. Add a check for swappiness
> being >0 before indiscriminately targeting Anon. Before these commits
> when a user had swappiness=0 anon memory would rarely get swapped; this
> behavior has remained constant sense RHEL5. This commit keeps that behavior
Typo: "sense" -> "since"
> intact and prevents the new workingset refaulting from challenging the anon
> memory when swappiness=0.
>
> Anon can still be swapped to prevent OOM. This does not completely disable
> swapping, but rather tames the refaulting aspect of the code that allows for
> the deactivating of anon memory.
>
> We have two customer workloads that discovered this issue:
> 1) A VM claiming 95% of the hosts memory followed by file reads (never dirty)
>     which begins to challenge the anon. Refaulting the anon working set will then
>     cause the indiscriminant swapping of the anon.
>
> 2) A VM running a in-memory DB is being populated from file reads.
>     Swappiness is set to 0 or 1 to defer write I/O as much as possible. Once
>     the customer experienced low memory, swapping anon starts, with
>     little-to-no PageCache being swapped.

Pagecache are not swapped. It is discarded under memory pressure and 
written back if dirty.

Other than that, the patch looks good to me.

Cheers,
Longman
Nico Pache Aug. 10, 2021, 7:24 p.m. UTC | #3
On 8/10/21 11:27 AM, Johannes Weiner wrote:
> Hello Nico,
> 
> On Mon, Aug 09, 2021 at 06:37:40PM -0400, Nico Pache wrote:
>> Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection
>> infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce
>> inactive:active ratio at the reclaim root") swappiness can start prematurely
> 
> Could clarify what you mean by "prematurely"?

Hi Johannes! 

The reason I used the words prematurely and indiscriminately when trying to describe the behavior is because AFAICS the swappiness value is not being considered and this isnt a OOM case, so its prematurely going for anon memory. 

> 
> The new balancing algorithm targets the lowest amount of overall
> paging IO performed across the anon and file sets. It doesn't swap
> unless it has an indication that a couple of swap writes are
> outweighed by a reduction of reads on the cache side.
> 
> Is this not working for you?

Well it is for the most part, but to your point below, the sc->is_file_tiny case can directly bypass the meaning of swappiness and chooses to do whatever it likes.

> 
>> swapping anon memory. This is due to the assumption that refaulting anon should
>> always allow the shrinker to target anon memory.
> 
> This doesn't sound right. Did you mean "refaulting file"?

<code> 
   
refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
    if (refaults != target_lruvec->refaults[0] || 
	inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
	sc->may_deactivate |= DEACTIVATE_ANON;

</code> 

Perhaps this is incorrect then? target_lruvec is using refaults[0] which is collected in snapshot_refaults. snapshot_refaults is populating index 0 with the WORKINGSET_ACTIVATE_ANON page state. the refaults variable is doing the same. So I assumed the refaulting ( new refault count != snapshot count) is comparing that of the anon workingset memory, not the refaulting of file cache.  

> 
>> Add a check for swappiness being >0 before indiscriminately
>> targeting Anon.
> 
>> Before these commits when a user had swappiness=0 anon memory would
>> rarely get swapped; this behavior has remained constant sense
>> RHEL5. This commit keeps that behavior intact and prevents the new
>> workingset refaulting from challenging the anon memory when
>> swappiness=0.
> 
> I'm wondering how you're getting anon scans with swappiness=0. If you
> look at get_scan_count(), SCAN_FRACT with swappines=0 should always
> result in ap = fraction[0] = 0, which never yields any anon scan
> targets. So I'm thinking you're running into sc->file_is_tiny
> situations, meaning remaining file pages alone are not enough to
> restore watermarks anymore. Is that possible?

Yes DEACTIVATE_ANON is enabling the file_is_tiny case in shrink_node(). That is what im trying to prevent in the swappiness=0 case. 

> 
> In that case, anon scanning is forced, and always has been. But the
> difference is that before the above-mentioned patches, we'd usually
> force scan just the smaller inactive list, whereas now we disable
> active list protection due to swapins and target the entire anon
> set. I suppose you'd prefer we go back to that, so that more pressure
> remains proportionally on the file set, and just enough anon to get
> above the watermarks again

Well kind of. It used to be that inactive_list_is_low would allow allow for the scanning of anon memory, but I am not removing that case here. Thats why my V3 separated the swappiness check from the inactive_is_low. Furthermore, the active list protection use to only be considered on the file LRU, as seem in ~4.18 inactive_list_is_low.  

> 
> One complication I could see with that is that we no longer start anon
> pages on the active list like we used to. We used to say active until
> proven otherwise; now it's inactive until proven otherwise. It's
> possible for the inactive list to contain a much bigger share of the
> total anon set now than before, in which case your patch wouldn't have
> the desired effect of targetting just a small amount of anon pages to
> get over the watermark hump.

Yes I believe this is also makes the problem worst. Im not sure if given the anon memory the same read-once optimization (starts on the inactive list) as file cache is the way to go. 

> 
> We may need a get_scan_count() solution after all, and I agree with
> previous reviews that this is the better location for such an issue...

I cant see why a get_scan_count solution is better then removing the problem where it starts.

> 
> One thing I think we should do - whether we need more on top or not -
> is allowing file reclaim to continue when sc->file_is_tiny. Yes, we
> also need anon to meet the watermarks, but it's not clear why we
> should stop scanning file pages altogether: it's possible they get us
> there 99% of the way, and somebody clearly wanted us to swap as little
> as possible to end up in a situation like that, so:> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeab6611993c..90dac3dc9903 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2477,7 +2477,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	 * If the system is almost out of file pages, force-scan anon.
>  	 */
>  	if (sc->file_is_tiny) {
> -		scan_balance = SCAN_ANON;
> +		scan_balance = SCAN_EQUAL;
>  		goto out;
>  	}

I agree, I think allowing it to scan both would be better as well. 

Cheers!
-- Nico
Shakeel Butt Aug. 10, 2021, 9:16 p.m. UTC | #4
Hi Johannes,

On Tue, Aug 10, 2021 at 8:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[...]
> One thing I think we should do - whether we need more on top or not -
> is allowing file reclaim to continue when sc->file_is_tiny. Yes, we
> also need anon to meet the watermarks, but it's not clear why we
> should stop scanning file pages altogether: it's possible they get us
> there 99% of the way, and somebody clearly wanted us to swap as little
> as possible to end up in a situation like that, so:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeab6611993c..90dac3dc9903 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2477,7 +2477,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>          * If the system is almost out of file pages, force-scan anon.
>          */
>         if (sc->file_is_tiny) {
> -               scan_balance = SCAN_ANON;
> +               scan_balance = SCAN_EQUAL;
>                 goto out;
>         }
>

Another thing we should do is to re-evaluate the sc->file_is_tiny
condition. Currently it is:

anon = node_page_state(pgdat, NR_INACTIVE_ANON);
sc->file_is_tiny = file + free <= total_high_wmark &&
!(sc->may_deactivate & DEACTIVATE_ANON) && anon >> sc->priority;

First convert node_page_state() usage to lruvec_page_state() for
common source of truth.

Second, in the commit b91ac374346b (sc->may_deactivate &
DEACTIVATE_ANON) implies inactive_is_low(LRU_INACTIVE_ANON) but commit
170b04b7ae49 changed that. Was that intended?

Third, the comment above this code says "Consider anon" but it is only
considering inactive anon. Do we need to change the comment or the
check?
Shakeel Butt Aug. 10, 2021, 9:17 p.m. UTC | #5
On Tue, Aug 10, 2021 at 12:24 PM Nico Pache <npache@redhat.com> wrote:
>
>
[...]
> >
> > I'm wondering how you're getting anon scans with swappiness=0. If you
> > look at get_scan_count(), SCAN_FRACT with swappines=0 should always
> > result in ap = fraction[0] = 0, which never yields any anon scan
> > targets. So I'm thinking you're running into sc->file_is_tiny
> > situations, meaning remaining file pages alone are not enough to
> > restore watermarks anymore. Is that possible?
>
> Yes DEACTIVATE_ANON is enabling the file_is_tiny case in shrink_node(). That is what im trying to prevent in the swappiness=0 case.
>

Can you please explain how DEACTIVATE_ANON is enabling the file_is_tiny case?
Nico Pache Aug. 10, 2021, 10:16 p.m. UTC | #6
On 8/10/21 5:17 PM, Shakeel Butt wrote:
> On Tue, Aug 10, 2021 at 12:24 PM Nico Pache <npache@redhat.com> wrote:
>>
>>
> [...]
>>>
>>> I'm wondering how you're getting anon scans with swappiness=0. If you
>>> look at get_scan_count(), SCAN_FRACT with swappines=0 should always
>>> result in ap = fraction[0] = 0, which never yields any anon scan
>>> targets. So I'm thinking you're running into sc->file_is_tiny
>>> situations, meaning remaining file pages alone are not enough to
>>> restore watermarks anymore. Is that possible?
>>
>> Yes DEACTIVATE_ANON is enabling the file_is_tiny case in shrink_node(). That is what im trying to prevent in the swappiness=0 case.
>>
> 
> Can you please explain how DEACTIVATE_ANON is enabling the file_is_tiny case?


You're right. Just did a second pass... I misinterpreted the assignment to
file_is_tiny. This is not the case that is causing the issue. So back to the
SCAN_FRACT case. From my testing the refaulting still seems to be causing the
issue; however, to your point in earlier discussions, if swappiness=0 then the
get_scan_count *should* be 0.
So my patch does solve the issue by preventing the shrink_list from deactivating
the anon, but it may be hiding some other issue that is the ultimate cause.

Thanks for pointing that out!
-- Nico
Shakeel Butt Aug. 10, 2021, 10:29 p.m. UTC | #7
On Tue, Aug 10, 2021 at 3:16 PM Nico Pache <npache@redhat.com> wrote:
>
>
>
> On 8/10/21 5:17 PM, Shakeel Butt wrote:
> > On Tue, Aug 10, 2021 at 12:24 PM Nico Pache <npache@redhat.com> wrote:
> >>
> >>
> > [...]
> >>>
> >>> I'm wondering how you're getting anon scans with swappiness=0. If you
> >>> look at get_scan_count(), SCAN_FRACT with swappines=0 should always
> >>> result in ap = fraction[0] = 0, which never yields any anon scan
> >>> targets. So I'm thinking you're running into sc->file_is_tiny
> >>> situations, meaning remaining file pages alone are not enough to
> >>> restore watermarks anymore. Is that possible?
> >>
> >> Yes DEACTIVATE_ANON is enabling the file_is_tiny case in shrink_node(). That is what im trying to prevent in the swappiness=0 case.
> >>
> >
> > Can you please explain how DEACTIVATE_ANON is enabling the file_is_tiny case?
>
>
> You're right. Just did a second pass... I misinterpreted the assignment to
> file_is_tiny. This is not the case that is causing the issue. So back to the
> SCAN_FRACT case. From my testing the refaulting still seems to be causing the
> issue; however, to your point in earlier discussions, if swappiness=0 then the
> get_scan_count *should* be 0.
> So my patch does solve the issue by preventing the shrink_list from deactivating
> the anon, but it may be hiding some other issue that is the ultimate cause.
>

Yes, I am suspecting the same. BTW which kernel version are you testing with?
Nico Pache April 19, 2022, 6:11 p.m. UTC | #8
I think its is important to note the issue we are seeing has greatly improved
since the initial posting. However we have noticed that the issue is still
present (and significantly worse) when cgroupV1 is set.

We were initially testing with CgroupV1 and later found that the issue was not
as bad in CgroupV2 (but was still an noticeable issue). This is also resulting
in the splitting of THPs in the host kernel.

-- Nico

On 8/9/21 18:37, Nico Pache wrote:
> Since commit 170b04b7ae49 ("mm/workingset: prepare the workingset detection
> infrastructure for anon LRU") and commit b91ac374346b ("mm: vmscan: enforce
> inactive:active ratio at the reclaim root") swappiness can start prematurely
> swapping anon memory. This is due to the assumption that refaulting anon should
> always allow the shrinker to target anon memory. Add a check for swappiness
> being >0 before indiscriminately targeting Anon. Before these commits
> when a user had swappiness=0 anon memory would rarely get swapped; this
> behavior has remained constant sense RHEL5. This commit keeps that behavior
> intact and prevents the new workingset refaulting from challenging the anon
> memory when swappiness=0.
> 
> Anon can still be swapped to prevent OOM. This does not completely disable
> swapping, but rather tames the refaulting aspect of the code that allows for
> the deactivating of anon memory.
> 
> We have two customer workloads that discovered this issue:
> 1) A VM claiming 95% of the hosts memory followed by file reads (never dirty)
>    which begins to challenge the anon. Refaulting the anon working set will then
>    cause the indiscriminant swapping of the anon.
> 
> 2) A VM running a in-memory DB is being populated from file reads.
>    Swappiness is set to 0 or 1 to defer write I/O as much as possible. Once
>    the customer experienced low memory, swapping anon starts, with
>    little-to-no PageCache being swapped.
> 
> Previously the file cache would account for almost all of the memory
> reclaimed and reads would throttle. Although the two LRU changes mentioned
> allow for less thrashing of file cache, customers would like to be able to keep
> the swappiness=0 behavior that has been present in the kernel for a long
> time.
> 
> A similar solution may be possible in get_scan_count(), which determines the
> reclaim pressure for each LRU; however I believe that kind of solution may be
> too aggressive, and will not allow other parts of the code (like direct reclaim)
> from targeting the active_anon list. This way we stop the problem at the heart
> of what is causing the issue, with the least amount of interference in other
> code paths. Furthermore, shrink_lruvec can modify the reclaim pressure of each
> LRU, which may make the get_scan_count solution even trickier.
> 
> Changelog:
>  -V3:
>     * Blame the right commit and be more descriptive in my log message.
>     * inactive_is_low should remain independent from the new swappiness check.
>     * Change how we get the swappiness value. Shrink_node can be called with a
>       null target_mem_cgroup so we should depend on the target_lruvec to do the
>       null check on memcg.
> 
>  -V2:
>      * made this mem_cgroup specific so now it will work with v1, v2, and
>        no cgroups.
>      * I've also touched up my commit log.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/vmscan.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4620df62f0ff..9f2420da4037 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2883,8 +2883,12 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	struct lruvec *target_lruvec;
>  	bool reclaimable = false;
>  	unsigned long file;
> +	struct mem_cgroup *memcg;
> +	int swappiness;
>  
>  	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +	memcg = lruvec_memcg(target_lruvec);
> +	swappiness = mem_cgroup_swappiness(memcg);
>  
>  again:
>  	memset(&sc->nr, 0, sizeof(sc->nr));
> @@ -2909,7 +2913,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  
>  		refaults = lruvec_page_state(target_lruvec,
>  				WORKINGSET_ACTIVATE_ANON);
> -		if (refaults != target_lruvec->refaults[0] ||
> +		if ((swappiness && refaults != target_lruvec->refaults[0]) ||
>  			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
>  			sc->may_deactivate |= DEACTIVATE_ANON;
>  		else
Johannes Weiner April 19, 2022, 6:46 p.m. UTC | #9
Hi Nico,

On Tue, Apr 19, 2022 at 02:11:53PM -0400, Nico Pache wrote:
> I think its is important to note the issue we are seeing has greatly improved
> since the initial posting. However we have noticed that the issue is still
> present (and significantly worse) when cgroupV1 is set.
> 
> We were initially testing with CgroupV1 and later found that the issue was not
> as bad in CgroupV2 (but was still an noticeable issue). This is also resulting
> in the splitting of THPs in the host kernel.

When swappiness is 0, cgroup limit reclaim has a fixed SCAN_FILE
branch, so it shouldn't ever look at anon. I'm assuming you're getting
global reclaim mixed in. Indeed, I think we can try harder not to swap
for global reclaim if the user asks for that.

Can you try the below patch?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba19edbc2452..0aa929151386 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2729,11 +2729,9 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 
 	/*
-	 * Global reclaim will swap to prevent OOM even with no
-	 * swappiness, but memcg users want to use this knob to
-	 * disable swapping for individual groups completely when
-	 * using the memory controller's swap limit feature would be
-	 * too expensive.
+	 * For cgroups, swappiness=0 means never swap. Historically,
+	 * cgroup users have relied on this as it was cheaper than
+	 * setting a swap limit of 0.
 	 */
 	if (cgroup_reclaim(sc) && !swappiness) {
 		scan_balance = SCAN_FILE;
@@ -2754,7 +2752,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * If the system is almost out of file pages, force-scan anon.
 	 */
 	if (sc->file_is_tiny) {
-		scan_balance = SCAN_ANON;
+		scan_balance = SCAN_EQUAL;
 		goto out;
 	}
 
@@ -2767,6 +2765,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		goto out;
 	}
 
+	/* swappiness=0 means no swap until OOM is imminent */
+	if (!swappiness && sc->priority) {
+		scan_balance = SCAN_FILE;
+		goto out;
+	}
+
 	scan_balance = SCAN_FRACT;
 	/*
 	 * Calculate the pressure balance between anon and file pages.
Nico Pache April 19, 2022, 7:37 p.m. UTC | #10
Hi Johannes,

On 4/19/22 14:46, Johannes Weiner wrote:
> Hi Nico,
> 
> On Tue, Apr 19, 2022 at 02:11:53PM -0400, Nico Pache wrote:
>> I think its is important to note the issue we are seeing has greatly improved
>> since the initial posting. However we have noticed that the issue is still
>> present (and significantly worse) when cgroupV1 is set.
>>
>> We were initially testing with CgroupV1 and later found that the issue was not
>> as bad in CgroupV2 (but was still an noticeable issue). This is also resulting
>> in the splitting of THPs in the host kernel.
> 
> When swappiness is 0, cgroup limit reclaim has a fixed SCAN_FILE
> branch, so it shouldn't ever look at anon. I'm assuming you're getting
> global reclaim mixed in. Indeed, I think we can try harder not to swap
> for global reclaim if the user asks for that.
We aren't actually utilizing the cgroup mechanism; however, switching between
the two has a noticeable affect on the global reclaim of the system. This is not
a writeback case either-- The reproducer simply reads. So I think we can rule
out the v2 writeback controller being involved. My initial patch was also
targeting swappiness=0 but this also occurs when >0.
> 
> Can you try the below patch?
of course thanks for that :) I'll let you know how it goes!

Cheers,
-- Nico
Nico Pache April 19, 2022, 11:54 p.m. UTC | #11
On 4/19/22 14:46, Johannes Weiner wrote:
> Hi Nico,
> 
> On Tue, Apr 19, 2022 at 02:11:53PM -0400, Nico Pache wrote:
>> I think its is important to note the issue we are seeing has greatly improved
>> since the initial posting. However we have noticed that the issue is still
>> present (and significantly worse) when cgroupV1 is set.
>>
>> We were initially testing with CgroupV1 and later found that the issue was not
>> as bad in CgroupV2 (but was still an noticeable issue). This is also resulting
>> in the splitting of THPs in the host kernel.
> 
> When swappiness is 0, cgroup limit reclaim has a fixed SCAN_FILE
> branch, so it shouldn't ever look at anon. I'm assuming you're getting
> global reclaim mixed in. Indeed, I think we can try harder not to swap
> for global reclaim if the user asks for that.
> 
> Can you try the below patch?
Sadly this did not fix the V1 case.

I will have to go back through my notes and over the code again to find what
difference between the two may be causing this issue. Im just starting to focus
on this issue again so my memory needs some refreshing of where I was last at.

The good news is that with the current state of upstream the issue of swap
storms tearing down THPs seems to be minimized to sane amount with V2.

My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping
ANON' behavior that was previously there, but as Shakeel pointed out, there may
be something larger at play.

Cheers,
-- Nico
Johannes Weiner April 20, 2022, 2:01 p.m. UTC | #12
On Tue, Apr 19, 2022 at 07:54:52PM -0400, Nico Pache wrote:
> 
> 
> On 4/19/22 14:46, Johannes Weiner wrote:
> > Hi Nico,
> > 
> > On Tue, Apr 19, 2022 at 02:11:53PM -0400, Nico Pache wrote:
> >> I think its is important to note the issue we are seeing has greatly improved
> >> since the initial posting. However we have noticed that the issue is still
> >> present (and significantly worse) when cgroupV1 is set.
> >>
> >> We were initially testing with CgroupV1 and later found that the issue was not
> >> as bad in CgroupV2 (but was still an noticeable issue). This is also resulting
> >> in the splitting of THPs in the host kernel.
> > 
> > When swappiness is 0, cgroup limit reclaim has a fixed SCAN_FILE
> > branch, so it shouldn't ever look at anon. I'm assuming you're getting
> > global reclaim mixed in. Indeed, I think we can try harder not to swap
> > for global reclaim if the user asks for that.
> > 
> > Can you try the below patch?
> Sadly this did not fix the V1 case.
> 
> I will have to go back through my notes and over the code again to find what
> difference between the two may be causing this issue. Im just starting to focus
> on this issue again so my memory needs some refreshing of where I was last at.
> 
> The good news is that with the current state of upstream the issue of swap
> storms tearing down THPs seems to be minimized to sane amount with V2.
> 
> My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping
> ANON' behavior that was previously there, but as Shakeel pointed out, there may
> be something larger at play.

So with my patch and swappiness=0 you get excessive swapping on v1 but
not on v2? And the patch to avoid DEACTIVATE_ANON fixes it?

If you haven't done so, it could be useful to litter shrink_node() and
get_scan_count() with trace_printk() to try to make sense of all the
decisions that result in it swapping.
Nico Pache April 20, 2022, 5:34 p.m. UTC | #13
On 4/20/22 10:01, Johannes Weiner wrote:
>> My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping
>> ANON' behavior that was previously there, but as Shakeel pointed out, there may
>> be something larger at play.
> 
> So with my patch and swappiness=0 you get excessive swapping on v1 but
> not on v2? And the patch to avoid DEACTIVATE_ANON fixes it?

correct, I haven't tested the DEACTIVATE_ANON patch since last time I was
working on this, but it did cure it. I can build a new kernel with it and verify
again.

The larger issue is that our workload has regressed in performance.

With V2 and swappiness=10 we are still seeing some swap, but very little tearing
down of THPs over time. With swappiness=0 it did some when swap but we are not
losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2).

With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and
ultimately leads to a significant amount of THP splitting. So the longer the
system/workload runs, the less likely we are to get THPs backing the guest and
the performance gain from THPs is lost.

So your patch does help return the old swappiness=0 behavior, but only for V2.

Ideally we would like to keep swappiness>0 but I found that with my patch and
swappiness=0 we could create a workaround for this effect on V1, but any other
value still results in the THP issue.


After the workload is run with V2 and swappiness=0 the host system look like this**:
               total        used        free      shared  buff/cache   available
Mem:       264071432   257536896      927424        4664     5607112     4993184
Swap:        4194300           0     4194300

Node 0 AnonPages:      128145476 kB	Node 1 AnonPages:      128111908 kB
Node 0 AnonHugePages:  128026624 kB	Node 1 AnonHugePages:  128090112 kB

** without your patch there is still some swap and THP splitting but nothing
like the case below.

Same workload on V1/swappiness=0 looks like this:
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257169500     1032612        4192     5869320     5357944
Swap:        4194300      623008     3571292

Node 0 AnonPages:      127927156 kB     Node 1 AnonPages:      127701088 kB
Node 0 AnonHugePages:  127789056 kB     Node 1 AnonHugePages:  87552000 kB
								^^^^^^^

This leads to the performance regression I'm referring to in later workloads.
V2 used to have a similar effect to V1, but not nearly as bad. Recent updates
upstream fixed this in V2.

The workload tests multiple FS types so this is most likely not a FS specific
issue either.

> If you haven't done so, it could be useful to litter shrink_node() and
> get_scan_count() with trace_printk() to try to make sense of all the
> decisions that result in it swapping.
Will do :) I was originally doing some BPF tracing that lead me to find the
DEACTIVE_ANON case.

Thanks,
-- Nico
Johannes Weiner April 20, 2022, 6:44 p.m. UTC | #14
Hi Nico,

On Wed, Apr 20, 2022 at 01:34:58PM -0400, Nico Pache wrote:
> On 4/20/22 10:01, Johannes Weiner wrote:
> >> My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping
> >> ANON' behavior that was previously there, but as Shakeel pointed out, there may
> >> be something larger at play.
> > 
> > So with my patch and swappiness=0 you get excessive swapping on v1 but
> > not on v2? And the patch to avoid DEACTIVATE_ANON fixes it?
> 
> correct, I haven't tested the DEACTIVATE_ANON patch since last time I was
> working on this, but it did cure it. I can build a new kernel with it and verify
> again.
> 
> The larger issue is that our workload has regressed in performance.
> 
> With V2 and swappiness=10 we are still seeing some swap, but very little tearing
> down of THPs over time. With swappiness=0 it did some when swap but we are not
> losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2).
> 
> With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and
> ultimately leads to a significant amount of THP splitting. So the longer the
> system/workload runs, the less likely we are to get THPs backing the guest and
> the performance gain from THPs is lost.

I hate to ask, but is it possible this is a configuration issue?

One significant difference between V1 and V2 is that V1 has per-cgroup
swappiness, which is inherited when the cgroup is created. So if you
set sysctl vm.swappiness=0 after cgroups have been created, it will
not update them. V2 cgroups do use vm.swappiness:

static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
	/* Cgroup2 doesn't have per-cgroup swappiness */
	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
		return vm_swappiness;

	/* root ? */
	if (mem_cgroup_disabled() || mem_cgroup_is_root(memcg))
		return vm_swappiness;

	return memcg->swappiness;
}

Is it possible the job cgroups on V1 have swappiness=60?

> So your patch does help return the old swappiness=0 behavior, but only for V2.

Thanks for verifying. I'll prepare a proper patch.
Nico Pache April 21, 2022, 4:21 p.m. UTC | #15
On 4/20/22 14:44, Johannes Weiner wrote:
>>
>> The larger issue is that our workload has regressed in performance.
>>
>> With V2 and swappiness=10 we are still seeing some swap, but very little tearing
>> down of THPs over time. With swappiness=0 it did some when swap but we are not
>> losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2).
I meant to say `with your patch swappiness=0 does not swap or have thp issues on v2`
>>
>> With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and
>> ultimately leads to a significant amount of THP splitting. So the longer the
>> system/workload runs, the less likely we are to get THPs backing the guest and
>> the performance gain from THPs is lost.
> 
> I hate to ask, but is it possible this is a configuration issue
Im very glad you asked :)
> 
> One significant difference between V1 and V2 is that V1 has per-cgroup
> swappiness, which is inherited when the cgroup is created. So if you
> set sysctl vm.swappiness=0 after cgroups have been created, it will
> not update them. V2 cgroups do use vm.swappiness:

This is something I did not consider... Thank you for pointing that out!

The issue still occurs weather or not I set the swappiness value before the VM
boot. However this led me to find the icing on the cake :)

Even if I set vm.swappiness=0 at boot using sysctl.conf I was not considering
the fact that libvirtd was creating its own cgroup for the machines you start it
with... additionally it does not inherit the sysctl value (even when set at
boot)?!? How annoying...

The cgroups swappiness value is defaulted to 60. This to me seems wrong from a
libvirt/systemd POV. If the system is booted with swappiness=0 then why does the
(user|machine|system).splice cgroup ignore this value when it creates it cgroups
(see below).

I will have to dig a little further to find a cause/fix for this. This requires
the libvirt users to understand a number of intricacies that they really
shouldnt have to consider, and may lead to headaches like these ;P

Values of the memcgs created on boot (with sysctl.swappiness=0 on V1 boot)
------------------------------------------------------------------------
/sys/fs/cgroup/memory/memory.swappiness 				=0
/sys/fs/cgroup/memory/dev-hugepages.mount/memory.swappiness		=60
/sys/fs/cgroup/memory/dev-mqueue.mount/memory.swappiness		=60
/sys/fs/cgroup/memory/machine.slice/memory.swappiness			=60
/sys/fs/cgroup/memory/proc-sys-fs-binfmt_misc.mount/memory.swappiness	=0
/sys/fs/cgroup/memory/sys-fs-fuse-connections.mount/memory.swappiness	=0
/sys/fs/cgroup/memory/sys-kernel-config.mount/memory.swappiness		=0
/sys/fs/cgroup/memory/sys-kernel-debug.mount/memory.swappiness		=60
/sys/fs/cgroup/memory/sys-kernel-tracing.mount/memory.swappiness	=60
/sys/fs/cgroup/memory/system.slice/memory.swappiness			=60
/sys/fs/cgroup/memory/user.slice/memory.swappiness			=60

Some seem to inherit the cgroup/memory/memory.swappiness value and some do
not... This issue was brought up in a systemd issue with no solution or
documentation [1].

Libvirt in particular is using the machine.splice cgroup so it inherits the 60.
If i change that value to 0, then start the machine it now has swappiness 0.
$ echo 0 > /sys/fs/cgroup/memory/machine.slice/memory.swappiness
$ virsh start <guest-name>
$ cat /sys/fs/cgroup/memory/machine.slice/machine-qemu.scope/memory.swappiness
0

Thank you so much for your very insightful note that led to the real issue :)

> Thanks for verifying. I'll prepare a proper patch.

my issue with v1 vs v2 seems to go away with a much more sane value of
swappiness=10 on v1 (when actually set properly lol).

Also as per my results below, I actually dont think your patch caused much
change to my workload. Im not sure what happened the first time I ran it that
caused the swapping on v2 (before your patch)... perhaps I ran the older kernel
(~v5.14) that was still having issues with v2 or its the fact that the results
can differ between runs. sorry about that.

Here is the test results for your patch with V1 and V2 (swappiness=0/10):

Before Patch
-------------
-- V1(swappiness=0):
               total        used        free      shared  buff/cache   available
Mem:       264071432   257465704     1100160        4224     5505568     5064964
Swap:        4194300       47828     4146472

Node 0 AnonPages:      128068580 kB	Node 1 AnonPages:      128120400 kB
Node 0 AnonHugePages:  128012288 kB	Node 1 AnonHugePages:  128090112 kB
                                                               ^^^^^ no loss

-- V1(swappiness=10):
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257364436      972048        3972     5734948     5164520
Swap:        4194300      235028     3959272

Node 0 AnonPages:      128015404 kB     Node 1 AnonPages:      128002788 kB
Node 0 AnonHugePages:  128002048 kB     Node 1 AnonHugePages:  120576000 kB
                                                               ^^^^^ some loss

-- V2(swappiness=0):
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257609352      924692        4664     5537388     4921236
Swap:        4194300           0     4194300
                           ^^^^^ No Swap
Node 0 AnonPages:      128083104 kB     Node 1 AnonPages:      128180576 kB
Node 0 AnonHugePages:  128002048 kB     Node 1 AnonHugePages:  128124928 kB
                                                               ^^^^^ No loss

-- V2(swappiness=10):
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257407576      918124        4632     5745732     5101764
Swap:        4194300      220424     3973876
                           ^^^^^ Some Swap
Node 0 AnonPages:      128109700 kB	Node 1 AnonPages:      127918164 kB
Node 0 AnonHugePages:  128006144 kB	Node 1 AnonHugePages:  120569856 kB
                                                               ^^^^^ some loss

After Patch
-------------
-- V1:swappiness=0
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257538832      945276        4368     5587324     4991852
Swap:        4194300        9276     4185024

Node 0 AnonPages:      128133932 kB	Node 1 AnonPages:      128100540 kB
Node 0 AnonHugePages:  128047104 kB	Node 1 AnonHugePages:  128061440 kB


-- V1:swappiness=10
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257428564      969252        4384     5673616     5100824
Swap:        4194300      138936     4055364
                           ^^^^^ Some Swap
Node 0 AnonPages:      128161724 kB     Node 1 AnonPages:      127945368 kB
Node 0 AnonHugePages:  128043008 kB     Node 1 AnonHugePages:  120221696 kB
                                                               ^^^^^ some loss

-- V2(swappiness=0):
               total        used        free      shared  buff/cache   available
Mem:       264071432   257536896      927424        4664     5607112     4993184
Swap:        4194300           0     4194300

Node 0 AnonPages:      128145476 kB	Node 1 AnonPages:      128111908 kB
Node 0 AnonHugePages:  128026624 kB	Node 1 AnonHugePages:  128090112 kB

-- V2(swappiness=10):
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257423936     1007076        4548     5640420     5106544
Swap:        4194300	  156016     4038284

Node 0 AnonPages:      128133264 kB     Node 1 AnonPages:      127955952 kB
Node 0 AnonHugePages:  128018432 kB     Node 1 AnonHugePages:  122507264 kB
                                                           ^^^^ slightly better

The only notable difference between before/after your patch is that with your
patch the THP tearing was slightly better, resulting in an extra 2GB as seen in
the last result. This may just be noise.

I'll have to see if I can find a fix for this in either the kernel, libvirt, or
systemd, and will follow up if I do. If not this should at least be documented
correctly. Given the fact cgroupV1 is in limited support mode upstream, and
systemd's hesitancy to make changes for V1, we may how to go down our own
avenues to ensure our customers dont run into this issue.

Big Thanks!
-- Nico

[1] - https://github.com/systemd/systemd/issues/9276
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4620df62f0ff..9f2420da4037 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2883,8 +2883,12 @@  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
 	unsigned long file;
+	struct mem_cgroup *memcg;
+	int swappiness;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+	memcg = lruvec_memcg(target_lruvec);
+	swappiness = mem_cgroup_swappiness(memcg);
 
 again:
 	memset(&sc->nr, 0, sizeof(sc->nr));
@@ -2909,7 +2913,7 @@  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 		refaults = lruvec_page_state(target_lruvec,
 				WORKINGSET_ACTIVATE_ANON);
-		if (refaults != target_lruvec->refaults[0] ||
+		if ((swappiness && refaults != target_lruvec->refaults[0]) ||
 			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
 			sc->may_deactivate |= DEACTIVATE_ANON;
 		else