diff mbox series

mm/swap, workingset: make anon shadow nodes memcg aware

Message ID 20240820092359.97782-1-ryncsn@gmail.com (mailing list archive)
State New
Headers show
Series mm/swap, workingset: make anon shadow nodes memcg aware | expand

Commit Message

Kairui Song Aug. 20, 2024, 9:23 a.m. UTC
From: Kairui Song <kasong@tencent.com>

Currently, the workingset (shadow) nodes of the swap cache are not
accounted to their corresponding memory cgroup, instead, they are
all accounted to the root cgroup. This leads to inaccurate accounting
and ineffective reclaiming. One cgroup could swap out a large amount
of memory, take up a large amount of memory with shadow nodes without
being accounted.

This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
convergence regression"), where page cache shadow nodes were incorrectly
accounted. That was due to the accidental dropping of the accounting
flag during the XArray conversion in commit a28334862993
("page cache: Finish XArray conversion").

However, this fix has a different cause. Swap cache shadow nodes were
never accounted even before the XArray conversion, since they did not
exist until commit 3852f6768ede ("mm/swapcache: support to handle the
shadow entries"), which was years after the XArray conversion.

It's worth noting that one anon shadow Xarray node may contain
different entries from different cgroup, and it gets accounted at reclaim
time, so it's arguable which cgroup it should be accounted to (as
Shakeal Butt pointed out [1]). File pages may suffer similar issue
but less common. Things like proactive memory reclaim could make thing
more complex.

So this commit still can't provide a 100% accurate accounting of anon
shadows, but it covers the cases when one memory cgroup uses significant
amount of swap, and in most cases memory pressure in one cgroup only
suppose to reclaim this cgroup and children. Besides, this fix is clean and
easy enough.

Link: https://lore.kernel.org/all/7gzevefivueqtebzvikzbucnrnpurmh3scmfuiuo2tnrs37xso@haj7gzepjur2/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>

---

This patch was part of previous series:
https://lore.kernel.org/all/20240624175313.47329-1-ryncsn@gmail.com/

Split out as a fix as suggested by Muchun and Shakeal.

 mm/swap_state.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Shakeel Butt Aug. 21, 2024, 12:21 a.m. UTC | #1
On Tue, Aug 20, 2024 at 05:23:59PM GMT, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently, the workingset (shadow) nodes of the swap cache are not
> accounted to their corresponding memory cgroup, instead, they are
> all accounted to the root cgroup. This leads to inaccurate accounting
> and ineffective reclaiming. One cgroup could swap out a large amount
> of memory, take up a large amount of memory with shadow nodes without
> being accounted.
> 
> This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
> convergence regression"), where page cache shadow nodes were incorrectly
> accounted. That was due to the accidental dropping of the accounting
> flag during the XArray conversion in commit a28334862993
> ("page cache: Finish XArray conversion").
> 
> However, this fix has a different cause. Swap cache shadow nodes were
> never accounted even before the XArray conversion, since they did not
> exist until commit 3852f6768ede ("mm/swapcache: support to handle the
> shadow entries"), which was years after the XArray conversion.
> 
> It's worth noting that one anon shadow Xarray node may contain
> different entries from different cgroup, and it gets accounted at reclaim
> time, so it's arguable which cgroup it should be accounted to (as
> Shakeal Butt pointed out [1]). File pages may suffer similar issue
> but less common. Things like proactive memory reclaim could make thing
> more complex.
> 
> So this commit still can't provide a 100% accurate accounting of anon
> shadows, but it covers the cases when one memory cgroup uses significant
> amount of swap, and in most cases memory pressure in one cgroup only
> suppose to reclaim this cgroup and children. Besides, this fix is clean and
> easy enough.
> 
> Link: https://lore.kernel.org/all/7gzevefivueqtebzvikzbucnrnpurmh3scmfuiuo2tnrs37xso@haj7gzepjur2/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> 

Is this a real issue? Have you seen systems in the production with
large amount of memory occupied by anon shadow entries? This is still
limited to the amount of swap a cgroup is allowed to use.

The reason I am asking is that this solution is worse than the perceived
problem at least to me. With this patch, the kernel will be charging
unrelated cgroups for the memory of swap xarray nodes during global
reclaim and proactive reclaim.

You can reduce this weirdness by using set_active_memcg() in
add_to_swap_cache() using the given folio's memcg but still you have the
case of multiple unrelated folios and shadow entries of different
cgroups within the same node. For filesystem case, the userspace can
control which files are shared between different cgroups and has more
control on it. That is not the case for swap space.

thanks,
Shakeel
Kairui Song Aug. 21, 2024, 5:35 p.m. UTC | #2
Shakeel Butt <shakeel.butt@linux.dev> 于 2024年8月21日周三 08:22写道:
>
> On Tue, Aug 20, 2024 at 05:23:59PM GMT, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently, the workingset (shadow) nodes of the swap cache are not
> > accounted to their corresponding memory cgroup, instead, they are
> > all accounted to the root cgroup. This leads to inaccurate accounting
> > and ineffective reclaiming. One cgroup could swap out a large amount
> > of memory, take up a large amount of memory with shadow nodes without
> > being accounted.
> >
> > This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
> > convergence regression"), where page cache shadow nodes were incorrectly
> > accounted. That was due to the accidental dropping of the accounting
> > flag during the XArray conversion in commit a28334862993
> > ("page cache: Finish XArray conversion").
> >
> > However, this fix has a different cause. Swap cache shadow nodes were
> > never accounted even before the XArray conversion, since they did not
> > exist until commit 3852f6768ede ("mm/swapcache: support to handle the
> > shadow entries"), which was years after the XArray conversion.
> >
> > It's worth noting that one anon shadow Xarray node may contain
> > different entries from different cgroup, and it gets accounted at reclaim
> > time, so it's arguable which cgroup it should be accounted to (as
> > Shakeal Butt pointed out [1]). File pages may suffer similar issue
> > but less common. Things like proactive memory reclaim could make thing
> > more complex.
> >
> > So this commit still can't provide a 100% accurate accounting of anon
> > shadows, but it covers the cases when one memory cgroup uses significant
> > amount of swap, and in most cases memory pressure in one cgroup only
> > suppose to reclaim this cgroup and children. Besides, this fix is clean and
> > easy enough.
> >
> > Link: https://lore.kernel.org/all/7gzevefivueqtebzvikzbucnrnpurmh3scmfuiuo2tnrs37xso@haj7gzepjur2/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >

Hi, Thanks for the comments.

> Is this a real issue? Have you seen systems in the production with
> large amount of memory occupied by anon shadow entries? This is still
> limited to the amount of swap a cgroup is allowed to use.

No, this patch is cherry picked from previous series, this help
separating the shadows to different cgroup properly according to my
test, and reduces the lock contention of list_lru by a lot combined
with later patches. Not very convincing on its own indeed, so I
hesitated to send it alone.

> The reason I am asking is that this solution is worse than the perceived
> problem at least to me. With this patch, the kernel will be charging
> unrelated cgroups for the memory of swap xarray nodes during global
> reclaim and proactive reclaim.

Yes, this could be a problem.

I didn't observe this happening frequently with tests though, SWAP
tends to cluster the SWAP allocations, and reclaiming tends to batch
reclaim pages, so usually there is a fair high chance that shadows of
pages of the same memcg stay on the same node.

It could end up completely random when the SWAP device is getting
fragmented or reclaim is struggling though.

> You can reduce this weirdness by using set_active_memcg() in
> add_to_swap_cache() using the given folio's memcg but still you have the
> case of multiple unrelated folios and shadow entries of different
> cgroups within the same node. For filesystem case, the userspace can
> control which files are shared between different cgroups and has more
> control on it. That is not the case for swap space.

Right, this fix is not perfect, it's arguable if this new behaviour is
better or worse than before. There is some ongoing work from the SWAP
side so things may get fixed differently in the future, but I'll also
check if this patch can be improved.
Shakeel Butt Aug. 21, 2024, 6:14 p.m. UTC | #3
On Thu, Aug 22, 2024 at 01:35:29AM GMT, Kairui Song wrote:
> Shakeel Butt <shakeel.butt@linux.dev> 于 2024年8月21日周三 08:22写道:
[...]
> 
> Hi, Thanks for the comments.
> 
> > Is this a real issue? Have you seen systems in the production with
> > large amount of memory occupied by anon shadow entries? This is still
> > limited to the amount of swap a cgroup is allowed to use.
> 
> No, this patch is cherry picked from previous series, this help
> separating the shadows to different cgroup properly according to my
> test, and reduces the lock contention of list_lru by a lot combined
> with later patches. Not very convincing on its own indeed, so I
> hesitated to send it alone.
> 

So, list_lru lock contention is the problem you are trying to solve.
Without this patch, do you see less impact of your list_lru series?
Anyways this patch is not the right way to solve the list_lru lock
contention issue.

> > The reason I am asking is that this solution is worse than the perceived
> > problem at least to me. With this patch, the kernel will be charging
> > unrelated cgroups for the memory of swap xarray nodes during global
> > reclaim and proactive reclaim.
> 
> Yes, this could be a problem.
> 
> I didn't observe this happening frequently with tests though, SWAP
> tends to cluster the SWAP allocations, and reclaiming tends to batch
> reclaim pages, so usually there is a fair high chance that shadows of
> pages of the same memcg stay on the same node.
> 
> It could end up completely random when the SWAP device is getting
> fragmented or reclaim is struggling though.

In actual production, fragmentation and memory over-commit is very
normal. So, such scenarios would occure more often.

> 
> > You can reduce this weirdness by using set_active_memcg() in
> > add_to_swap_cache() using the given folio's memcg but still you have the
> > case of multiple unrelated folios and shadow entries of different
> > cgroups within the same node. For filesystem case, the userspace can
> > control which files are shared between different cgroups and has more
> > control on it. That is not the case for swap space.
> 
> Right, this fix is not perfect, it's arguable if this new behaviour is
> better or worse than before. There is some ongoing work from the SWAP
> side so things may get fixed differently in the future, but I'll also
> check if this patch can be improved.

Yeah with mTHP we can reevaluate this approach.
diff mbox series

Patch

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4669f29cf555..b4ed2c664c67 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -97,6 +97,7 @@  int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 	void *old;
 
 	xas_set_update(&xas, workingset_update_node);
+	xas_set_lru(&xas, &shadow_nodes);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
@@ -718,7 +719,7 @@  int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 		return -ENOMEM;
 	for (i = 0; i < nr; i++) {
 		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
+		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
 		atomic_set(&space->i_mmap_writable, 0);
 		space->a_ops = &swap_aops;
 		/* swap cache doesn't use writeback related tags */