diff mbox series

mm: memcontrol: do not miss MEMCG_MAX events for enforced allocations

Message ID 20220702033521.64630-1-roman.gushchin@linux.dev (mailing list archive)
State New
Headers show
Series mm: memcontrol: do not miss MEMCG_MAX events for enforced allocations | expand

Commit Message

Roman Gushchin July 2, 2022, 3:35 a.m. UTC
Yafang Shao reported an issue related to the accounting of bpf
memory: if a bpf map is charged indirectly for memory consumed
from an interrupt context and allocations are enforced, MEMCG_MAX
events are not raised.

It's not/less of an issue in a generic case because consequent
allocations from a process context will trigger the reclaim and
MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
memory cgroup, so it might never happen. So the cgroup can
significantly exceed the memory.max limit without even triggering
MEMCG_MAX events.

Fix this by making sure that we never enforce allocations without
raising a MEMCG_MAX event.

Reported-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: bpf@vger.kernel.org
---
 mm/memcontrol.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Shakeel Butt July 2, 2022, 5:50 a.m. UTC | #1
On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Yafang Shao reported an issue related to the accounting of bpf
> memory: if a bpf map is charged indirectly for memory consumed
> from an interrupt context and allocations are enforced, MEMCG_MAX
> events are not raised.
>
> It's not/less of an issue in a generic case because consequent
> allocations from a process context will trigger the reclaim and
> MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> memory cgroup, so it might never happen.

The patch looks good but the above sentence is confusing. What might
never happen? Reclaim or MAX event on dying memcg?

> So the cgroup can
> significantly exceed the memory.max limit without even triggering
> MEMCG_MAX events.
>
> Fix this by making sure that we never enforce allocations without
> raising a MEMCG_MAX event.
>
> Reported-by: Yafang Shao <laoar.shao@gmail.com>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: bpf@vger.kernel.org
> ---
>  mm/memcontrol.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 655c09393ad5..eb383695659a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2577,6 +2577,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>         bool passed_oom = false;
>         bool may_swap = true;
>         bool drained = false;
> +       bool raised_max_event = false;
>         unsigned long pflags;
>
>  retry:
> @@ -2616,6 +2617,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>                 goto nomem;
>
>         memcg_memory_event(mem_over_limit, MEMCG_MAX);
> +       raised_max_event = true;
>
>         psi_memstall_enter(&pflags);
>         nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> @@ -2682,6 +2684,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>         if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
>                 return -ENOMEM;
>  force:
> +       /*
> +        * If the allocation has to be enforced, don't forget to raise
> +        * a MEMCG_MAX event.
> +        */
> +       if (!raised_max_event)
> +               memcg_memory_event(mem_over_limit, MEMCG_MAX);
> +
>         /*
>          * The allocation either can't fail or will lead to more memory
>          * being freed very soon.  Allow memory usage go over the limit
> --
> 2.36.1
>
Roman Gushchin July 2, 2022, 3:39 p.m. UTC | #2
On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > Yafang Shao reported an issue related to the accounting of bpf
> > memory: if a bpf map is charged indirectly for memory consumed
> > from an interrupt context and allocations are enforced, MEMCG_MAX
> > events are not raised.
> >
> > It's not/less of an issue in a generic case because consequent
> > allocations from a process context will trigger the reclaim and
> > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > memory cgroup, so it might never happen.
> 
> The patch looks good but the above sentence is confusing. What might
> never happen? Reclaim or MAX event on dying memcg?

Direct reclaim and MAX events. I agree it might be not clear without
looking into the code. How about something like this?

"It's not/less of an issue in a generic case because consequent
allocations from a process context will trigger the direct reclaim
and MEMCG_MAX events will be raised. However a bpf map can belong
to a dying/abandoned memory cgroup, so there will be no allocations
from a process context and no MEMCG_MAX events will be triggered."

Thanks!
Shakeel Butt July 3, 2022, 5:36 a.m. UTC | #3
On Sat, Jul 2, 2022 at 8:39 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > Yafang Shao reported an issue related to the accounting of bpf
> > > memory: if a bpf map is charged indirectly for memory consumed
> > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > events are not raised.
> > >
> > > It's not/less of an issue in a generic case because consequent
> > > allocations from a process context will trigger the reclaim and
> > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > memory cgroup, so it might never happen.
> >
> > The patch looks good but the above sentence is confusing. What might
> > never happen? Reclaim or MAX event on dying memcg?
>
> Direct reclaim and MAX events. I agree it might be not clear without
> looking into the code. How about something like this?
>
> "It's not/less of an issue in a generic case because consequent
> allocations from a process context will trigger the direct reclaim
> and MEMCG_MAX events will be raised. However a bpf map can belong
> to a dying/abandoned memory cgroup, so there will be no allocations
> from a process context and no MEMCG_MAX events will be triggered."
>

SGTM and you can add:

Acked-by: Shakeel Butt <shakeelb@google.com>
Roman Gushchin July 3, 2022, 10:50 p.m. UTC | #4
On Sat, Jul 02, 2022 at 10:36:28PM -0700, Shakeel Butt wrote:
> On Sat, Jul 2, 2022 at 8:39 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > events are not raised.
> > > >
> > > > It's not/less of an issue in a generic case because consequent
> > > > allocations from a process context will trigger the reclaim and
> > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > memory cgroup, so it might never happen.
> > >
> > > The patch looks good but the above sentence is confusing. What might
> > > never happen? Reclaim or MAX event on dying memcg?
> >
> > Direct reclaim and MAX events. I agree it might be not clear without
> > looking into the code. How about something like this?
> >
> > "It's not/less of an issue in a generic case because consequent
> > allocations from a process context will trigger the direct reclaim
> > and MEMCG_MAX events will be raised. However a bpf map can belong
> > to a dying/abandoned memory cgroup, so there will be no allocations
> > from a process context and no MEMCG_MAX events will be triggered."
> >
> 
> SGTM and you can add:
> 
> Acked-by: Shakeel Butt <shakeelb@google.com>

Thank you!
Michal Hocko July 4, 2022, 3:07 p.m. UTC | #5
On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > Yafang Shao reported an issue related to the accounting of bpf
> > > memory: if a bpf map is charged indirectly for memory consumed
> > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > events are not raised.
> > >
> > > It's not/less of an issue in a generic case because consequent
> > > allocations from a process context will trigger the reclaim and
> > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > memory cgroup, so it might never happen.
> > 
> > The patch looks good but the above sentence is confusing. What might
> > never happen? Reclaim or MAX event on dying memcg?
> 
> Direct reclaim and MAX events. I agree it might be not clear without
> looking into the code. How about something like this?
> 
> "It's not/less of an issue in a generic case because consequent
> allocations from a process context will trigger the direct reclaim
> and MEMCG_MAX events will be raised. However a bpf map can belong
> to a dying/abandoned memory cgroup, so there will be no allocations
> from a process context and no MEMCG_MAX events will be triggered."

Could you expand little bit more on the situation? Can those charges to
offline memcg happen indefinetely? How can it ever go away then? Also is
this something that we actually want to encourage?

In other words shouldn't those remote charges be redirected when the
target memcg is offline?
Michal Hocko July 4, 2022, 3:12 p.m. UTC | #6
On Fri 01-07-22 20:35:21, Roman Gushchin wrote:
> Yafang Shao reported an issue related to the accounting of bpf
> memory: if a bpf map is charged indirectly for memory consumed
> from an interrupt context and allocations are enforced, MEMCG_MAX
> events are not raised.

So I guess this will be a GFP_ATOMIC request failing due to the hard
limit, right? I think it would be easier to understand if the specific
allocation request type was mentioned.

> It's not/less of an issue in a generic case because consequent
> allocations from a process context will trigger the reclaim and
> MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> memory cgroup, so it might never happen. So the cgroup can
> significantly exceed the memory.max limit without even triggering
> MEMCG_MAX events.

More on that in other reply.

> Fix this by making sure that we never enforce allocations without
> raising a MEMCG_MAX event.
> 
> Reported-by: Yafang Shao <laoar.shao@gmail.com>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: bpf@vger.kernel.org

The patch makes sense to me though even without the weird charge to a
dead memcg aspect. It is true that a very calm memcg can trigger the
even much later after a GFP_ATOMIC charge (or __GFP_HIGH in general)
fails.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 655c09393ad5..eb383695659a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2577,6 +2577,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	bool passed_oom = false;
>  	bool may_swap = true;
>  	bool drained = false;
> +	bool raised_max_event = false;
>  	unsigned long pflags;
>  
>  retry:
> @@ -2616,6 +2617,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		goto nomem;
>  
>  	memcg_memory_event(mem_over_limit, MEMCG_MAX);
> +	raised_max_event = true;
>  
>  	psi_memstall_enter(&pflags);
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> @@ -2682,6 +2684,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
>  		return -ENOMEM;
>  force:
> +	/*
> +	 * If the allocation has to be enforced, don't forget to raise
> +	 * a MEMCG_MAX event.
> +	 */
> +	if (!raised_max_event)
> +		memcg_memory_event(mem_over_limit, MEMCG_MAX);
> +
>  	/*
>  	 * The allocation either can't fail or will lead to more memory
>  	 * being freed very soon.  Allow memory usage go over the limit
> -- 
> 2.36.1
Michal Hocko July 4, 2022, 3:30 p.m. UTC | #7
On Mon 04-07-22 17:07:32, Michal Hocko wrote:
> On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > events are not raised.
> > > >
> > > > It's not/less of an issue in a generic case because consequent
> > > > allocations from a process context will trigger the reclaim and
> > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > memory cgroup, so it might never happen.
> > > 
> > > The patch looks good but the above sentence is confusing. What might
> > > never happen? Reclaim or MAX event on dying memcg?
> > 
> > Direct reclaim and MAX events. I agree it might be not clear without
> > looking into the code. How about something like this?
> > 
> > "It's not/less of an issue in a generic case because consequent
> > allocations from a process context will trigger the direct reclaim
> > and MEMCG_MAX events will be raised. However a bpf map can belong
> > to a dying/abandoned memory cgroup, so there will be no allocations
> > from a process context and no MEMCG_MAX events will be triggered."
> 
> Could you expand little bit more on the situation? Can those charges to
> offline memcg happen indefinetely? How can it ever go away then? Also is
> this something that we actually want to encourage?

One more question. Mostly out of curiosity. How is userspace actually
acting on those events? Are watchers still active on those dead memcgs?
Roman Gushchin July 5, 2022, 8:49 p.m. UTC | #8
On Mon, Jul 04, 2022 at 05:07:30PM +0200, Michal Hocko wrote:
> On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > events are not raised.
> > > >
> > > > It's not/less of an issue in a generic case because consequent
> > > > allocations from a process context will trigger the reclaim and
> > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > memory cgroup, so it might never happen.
> > > 
> > > The patch looks good but the above sentence is confusing. What might
> > > never happen? Reclaim or MAX event on dying memcg?
> > 
> > Direct reclaim and MAX events. I agree it might be not clear without
> > looking into the code. How about something like this?
> > 
> > "It's not/less of an issue in a generic case because consequent
> > allocations from a process context will trigger the direct reclaim
> > and MEMCG_MAX events will be raised. However a bpf map can belong
> > to a dying/abandoned memory cgroup, so there will be no allocations
> > from a process context and no MEMCG_MAX events will be triggered."
> 
> Could you expand little bit more on the situation? Can those charges to
> offline memcg happen indefinetely?

Yes.

> How can it ever go away then?

Bpf map should be deleted by a user first.

> Also is this something that we actually want to encourage?

Not really. We can implement reparenting (probably objcg-based), I think it's
a good idea in general. I can take a look, but can't promise it will be fast.

In thory we can't forbid deleting cgroups with associated bpf maps, but I don't
thinks it's a good idea.

> In other words shouldn't those remote charges be redirected when the
> target memcg is offline?

Reparenting is the best answer I have.

Thanks!
Roman Gushchin July 5, 2022, 8:51 p.m. UTC | #9
On Mon, Jul 04, 2022 at 05:30:25PM +0200, Michal Hocko wrote:
> On Mon 04-07-22 17:07:32, Michal Hocko wrote:
> > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > events are not raised.
> > > > >
> > > > > It's not/less of an issue in a generic case because consequent
> > > > > allocations from a process context will trigger the reclaim and
> > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > memory cgroup, so it might never happen.
> > > > 
> > > > The patch looks good but the above sentence is confusing. What might
> > > > never happen? Reclaim or MAX event on dying memcg?
> > > 
> > > Direct reclaim and MAX events. I agree it might be not clear without
> > > looking into the code. How about something like this?
> > > 
> > > "It's not/less of an issue in a generic case because consequent
> > > allocations from a process context will trigger the direct reclaim
> > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > from a process context and no MEMCG_MAX events will be triggered."
> > 
> > Could you expand little bit more on the situation? Can those charges to
> > offline memcg happen indefinetely? How can it ever go away then? Also is
> > this something that we actually want to encourage?
> 
> One more question. Mostly out of curiosity. How is userspace actually
> acting on those events? Are watchers still active on those dead memcgs?

Idk, the whole problem was reported by Yafang, so he probably has a better
answer. But in general events are recursive and the cgroup doesn't have
to be dying, it can be simple abandoned.

Thanks!
Roman Gushchin July 5, 2022, 8:55 p.m. UTC | #10
On Mon, Jul 04, 2022 at 05:12:54PM +0200, Michal Hocko wrote:
> On Fri 01-07-22 20:35:21, Roman Gushchin wrote:
> > Yafang Shao reported an issue related to the accounting of bpf
> > memory: if a bpf map is charged indirectly for memory consumed
> > from an interrupt context and allocations are enforced, MEMCG_MAX
> > events are not raised.
> 
> So I guess this will be a GFP_ATOMIC request failing due to the hard
> limit, right? I think it would be easier to understand if the specific
> allocation request type was mentioned.

It all started from the discussion here:
https://www.spinics.net/lists/linux-mm/msg302319.html

Please, take a look.

> 
> > It's not/less of an issue in a generic case because consequent
> > allocations from a process context will trigger the reclaim and
> > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > memory cgroup, so it might never happen. So the cgroup can
> > significantly exceed the memory.max limit without even triggering
> > MEMCG_MAX events.
> 
> More on that in other reply.
> 
> > Fix this by making sure that we never enforce allocations without
> > raising a MEMCG_MAX event.
> > 
> > Reported-by: Yafang Shao <laoar.shao@gmail.com>
> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Muchun Song <songmuchun@bytedance.com>
> > Cc: cgroups@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Cc: bpf@vger.kernel.org
> 
> The patch makes sense to me though even without the weird charge to a
> dead memcg aspect. It is true that a very calm memcg can trigger the
> even much later after a GFP_ATOMIC charge (or __GFP_HIGH in general)
> fails.

Good point!

> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
Yafang Shao July 6, 2022, 2:40 a.m. UTC | #11
On Wed, Jul 6, 2022 at 4:52 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Mon, Jul 04, 2022 at 05:30:25PM +0200, Michal Hocko wrote:
> > On Mon 04-07-22 17:07:32, Michal Hocko wrote:
> > > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > >
> > > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > > events are not raised.
> > > > > >
> > > > > > It's not/less of an issue in a generic case because consequent
> > > > > > allocations from a process context will trigger the reclaim and
> > > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > > memory cgroup, so it might never happen.
> > > > >
> > > > > The patch looks good but the above sentence is confusing. What might
> > > > > never happen? Reclaim or MAX event on dying memcg?
> > > >
> > > > Direct reclaim and MAX events. I agree it might be not clear without
> > > > looking into the code. How about something like this?
> > > >
> > > > "It's not/less of an issue in a generic case because consequent
> > > > allocations from a process context will trigger the direct reclaim
> > > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > > from a process context and no MEMCG_MAX events will be triggered."
> > >
> > > Could you expand little bit more on the situation? Can those charges to
> > > offline memcg happen indefinetely? How can it ever go away then? Also is
> > > this something that we actually want to encourage?
> >
> > One more question. Mostly out of curiosity. How is userspace actually
> > acting on those events? Are watchers still active on those dead memcgs?
>
> Idk, the whole problem was reported by Yafang, so he probably has a better
> answer. But in general events are recursive and the cgroup doesn't have
> to be dying, it can be simple abandoned.
>

Regarding the pinned bpf programs, it can run without a user agent.
That means the cgroup may not be dead, but just not populated.
(But in our case, the cgroup will be deleted after the user agent exits.)
Yafang Shao July 6, 2022, 2:46 a.m. UTC | #12
On Wed, Jul 6, 2022 at 4:49 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Mon, Jul 04, 2022 at 05:07:30PM +0200, Michal Hocko wrote:
> > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > events are not raised.
> > > > >
> > > > > It's not/less of an issue in a generic case because consequent
> > > > > allocations from a process context will trigger the reclaim and
> > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > memory cgroup, so it might never happen.
> > > >
> > > > The patch looks good but the above sentence is confusing. What might
> > > > never happen? Reclaim or MAX event on dying memcg?
> > >
> > > Direct reclaim and MAX events. I agree it might be not clear without
> > > looking into the code. How about something like this?
> > >
> > > "It's not/less of an issue in a generic case because consequent
> > > allocations from a process context will trigger the direct reclaim
> > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > from a process context and no MEMCG_MAX events will be triggered."
> >
> > Could you expand little bit more on the situation? Can those charges to
> > offline memcg happen indefinetely?
>
> Yes.
>
> > How can it ever go away then?
>
> Bpf map should be deleted by a user first.
>

It can't apply to pinned bpf maps, because the user expects the bpf
maps to continue working after the user agent exits.

> > Also is this something that we actually want to encourage?
>
> Not really. We can implement reparenting (probably objcg-based), I think it's
> a good idea in general. I can take a look, but can't promise it will be fast.
>
> In thory we can't forbid deleting cgroups with associated bpf maps, but I don't
> thinks it's a good idea.
>

Agreed. It is not a good idea.

> > In other words shouldn't those remote charges be redirected when the
> > target memcg is offline?
>
> Reparenting is the best answer I have.
>

At the cost of increasing the complexity of deployment, that may not
be a good idea neither.
Roman Gushchin July 6, 2022, 3:28 a.m. UTC | #13
On Wed, Jul 06, 2022 at 10:46:48AM +0800, Yafang Shao wrote:
> On Wed, Jul 6, 2022 at 4:49 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Mon, Jul 04, 2022 at 05:07:30PM +0200, Michal Hocko wrote:
> > > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > >
> > > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > > events are not raised.
> > > > > >
> > > > > > It's not/less of an issue in a generic case because consequent
> > > > > > allocations from a process context will trigger the reclaim and
> > > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > > memory cgroup, so it might never happen.
> > > > >
> > > > > The patch looks good but the above sentence is confusing. What might
> > > > > never happen? Reclaim or MAX event on dying memcg?
> > > >
> > > > Direct reclaim and MAX events. I agree it might be not clear without
> > > > looking into the code. How about something like this?
> > > >
> > > > "It's not/less of an issue in a generic case because consequent
> > > > allocations from a process context will trigger the direct reclaim
> > > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > > from a process context and no MEMCG_MAX events will be triggered."
> > >
> > > Could you expand little bit more on the situation? Can those charges to
> > > offline memcg happen indefinetely?
> >
> > Yes.
> >
> > > How can it ever go away then?
> >
> > Bpf map should be deleted by a user first.
> >
> 
> It can't apply to pinned bpf maps, because the user expects the bpf
> maps to continue working after the user agent exits.
> 
> > > Also is this something that we actually want to encourage?
> >
> > Not really. We can implement reparenting (probably objcg-based), I think it's
> > a good idea in general. I can take a look, but can't promise it will be fast.
> >
> > In thory we can't forbid deleting cgroups with associated bpf maps, but I don't
> > thinks it's a good idea.
> >
> 
> Agreed. It is not a good idea.
> 
> > > In other words shouldn't those remote charges be redirected when the
> > > target memcg is offline?
> >
> > Reparenting is the best answer I have.
> >
> 
> At the cost of increasing the complexity of deployment, that may not
> be a good idea neither.

What do you mean? Can you please elaborate on it?

Thanks!
Yafang Shao July 6, 2022, 3:42 a.m. UTC | #14
On Wed, Jul 6, 2022 at 11:28 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Wed, Jul 06, 2022 at 10:46:48AM +0800, Yafang Shao wrote:
> > On Wed, Jul 6, 2022 at 4:49 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Mon, Jul 04, 2022 at 05:07:30PM +0200, Michal Hocko wrote:
> > > > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > >
> > > > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > > > events are not raised.
> > > > > > >
> > > > > > > It's not/less of an issue in a generic case because consequent
> > > > > > > allocations from a process context will trigger the reclaim and
> > > > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > > > memory cgroup, so it might never happen.
> > > > > >
> > > > > > The patch looks good but the above sentence is confusing. What might
> > > > > > never happen? Reclaim or MAX event on dying memcg?
> > > > >
> > > > > Direct reclaim and MAX events. I agree it might be not clear without
> > > > > looking into the code. How about something like this?
> > > > >
> > > > > "It's not/less of an issue in a generic case because consequent
> > > > > allocations from a process context will trigger the direct reclaim
> > > > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > > > from a process context and no MEMCG_MAX events will be triggered."
> > > >
> > > > Could you expand little bit more on the situation? Can those charges to
> > > > offline memcg happen indefinetely?
> > >
> > > Yes.
> > >
> > > > How can it ever go away then?
> > >
> > > Bpf map should be deleted by a user first.
> > >
> >
> > It can't apply to pinned bpf maps, because the user expects the bpf
> > maps to continue working after the user agent exits.
> >
> > > > Also is this something that we actually want to encourage?
> > >
> > > Not really. We can implement reparenting (probably objcg-based), I think it's
> > > a good idea in general. I can take a look, but can't promise it will be fast.
> > >
> > > In thory we can't forbid deleting cgroups with associated bpf maps, but I don't
> > > thinks it's a good idea.
> > >
> >
> > Agreed. It is not a good idea.
> >
> > > > In other words shouldn't those remote charges be redirected when the
> > > > target memcg is offline?
> > >
> > > Reparenting is the best answer I have.
> > >
> >
> > At the cost of increasing the complexity of deployment, that may not
> > be a good idea neither.
>
> What do you mean? Can you please elaborate on it?
>

                   parent memcg
                         |
                    bpf memcg   <- limit the memory size of bpf
programs
                        /           \
         bpf user agent     pinned bpf program

After bpf user agents exit, the bpf memcg will be dead, and then all
its memory will be reparented.
That is okay for preallocated bpf maps, but not okay for
non-preallocated bpf maps.
Because the bpf maps will continue to charge, but as all its memory
and objcg are reparented, so we have to limit the bpf memory size in
the parent as follows,

                   parent memcg   <-      limit the memory size of bpf
programs
                         |
                    bpf memcg
                        /           \
         bpf user agent     pinned bpf program

That means parent memcg can't be deleted and can only contain one bpf memcg.
It may work if we use systemd to manage the memcgs, but it will be a
problem if we use k8s to manage the memcgs.
Roman Gushchin July 6, 2022, 3:56 a.m. UTC | #15
On Wed, Jul 06, 2022 at 11:42:50AM +0800, Yafang Shao wrote:
> On Wed, Jul 6, 2022 at 11:28 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Wed, Jul 06, 2022 at 10:46:48AM +0800, Yafang Shao wrote:
> > > On Wed, Jul 6, 2022 at 4:49 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Mon, Jul 04, 2022 at 05:07:30PM +0200, Michal Hocko wrote:
> > > > > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > > > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > > >
> > > > > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > > > > events are not raised.
> > > > > > > >
> > > > > > > > It's not/less of an issue in a generic case because consequent
> > > > > > > > allocations from a process context will trigger the reclaim and
> > > > > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > > > > memory cgroup, so it might never happen.
> > > > > > >
> > > > > > > The patch looks good but the above sentence is confusing. What might
> > > > > > > never happen? Reclaim or MAX event on dying memcg?
> > > > > >
> > > > > > Direct reclaim and MAX events. I agree it might be not clear without
> > > > > > looking into the code. How about something like this?
> > > > > >
> > > > > > "It's not/less of an issue in a generic case because consequent
> > > > > > allocations from a process context will trigger the direct reclaim
> > > > > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > > > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > > > > from a process context and no MEMCG_MAX events will be triggered."
> > > > >
> > > > > Could you expand little bit more on the situation? Can those charges to
> > > > > offline memcg happen indefinetely?
> > > >
> > > > Yes.
> > > >
> > > > > How can it ever go away then?
> > > >
> > > > Bpf map should be deleted by a user first.
> > > >
> > >
> > > It can't apply to pinned bpf maps, because the user expects the bpf
> > > maps to continue working after the user agent exits.
> > >
> > > > > Also is this something that we actually want to encourage?
> > > >
> > > > Not really. We can implement reparenting (probably objcg-based), I think it's
> > > > a good idea in general. I can take a look, but can't promise it will be fast.
> > > >
> > > > In thory we can't forbid deleting cgroups with associated bpf maps, but I don't
> > > > thinks it's a good idea.
> > > >
> > >
> > > Agreed. It is not a good idea.
> > >
> > > > > In other words shouldn't those remote charges be redirected when the
> > > > > target memcg is offline?
> > > >
> > > > Reparenting is the best answer I have.
> > > >
> > >
> > > At the cost of increasing the complexity of deployment, that may not
> > > be a good idea neither.
> >
> > What do you mean? Can you please elaborate on it?
> >
> 
>                    parent memcg
>                          |
>                     bpf memcg   <- limit the memory size of bpf
> programs
>                         /           \
>          bpf user agent     pinned bpf program
> 
> After bpf user agents exit, the bpf memcg will be dead, and then all
> its memory will be reparented.
> That is okay for preallocated bpf maps, but not okay for
> non-preallocated bpf maps.
> Because the bpf maps will continue to charge, but as all its memory
> and objcg are reparented, so we have to limit the bpf memory size in
> the parent as follows,

So you're relying on the memory limit of a dying cgroup?
Sorry, but I don't think we can seriously discuss such a design.
A dying cgroup is invisible for a user, a user can't change any tunables,
they have zero visibility into any stats or charges. Why would you do this?

If you want the cgroup to be an active part of the memory management
process, don't delete it. There are exactly zero guarantees about what
happens with a memory cgroup after being deleted by a user, it's all
implementation details.

Anyway, here is the patch for reparenting bpf maps:
https://github.com/rgushchin/linux/commit/f57df8bb35770507a4624fe52216b6c14f39c50c

I gonna post it to bpf@ after some testing.

Thanks!
Yafang Shao July 6, 2022, 4:02 a.m. UTC | #16
On Wed, Jul 6, 2022 at 11:56 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Wed, Jul 06, 2022 at 11:42:50AM +0800, Yafang Shao wrote:
> > On Wed, Jul 6, 2022 at 11:28 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Wed, Jul 06, 2022 at 10:46:48AM +0800, Yafang Shao wrote:
> > > > On Wed, Jul 6, 2022 at 4:49 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Mon, Jul 04, 2022 at 05:07:30PM +0200, Michal Hocko wrote:
> > > > > > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > > > > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > > > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > > > >
> > > > > > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > > > > > events are not raised.
> > > > > > > > >
> > > > > > > > > It's not/less of an issue in a generic case because consequent
> > > > > > > > > allocations from a process context will trigger the reclaim and
> > > > > > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > > > > > memory cgroup, so it might never happen.
> > > > > > > >
> > > > > > > > The patch looks good but the above sentence is confusing. What might
> > > > > > > > never happen? Reclaim or MAX event on dying memcg?
> > > > > > >
> > > > > > > Direct reclaim and MAX events. I agree it might be not clear without
> > > > > > > looking into the code. How about something like this?
> > > > > > >
> > > > > > > "It's not/less of an issue in a generic case because consequent
> > > > > > > allocations from a process context will trigger the direct reclaim
> > > > > > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > > > > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > > > > > from a process context and no MEMCG_MAX events will be triggered."
> > > > > >
> > > > > > Could you expand little bit more on the situation? Can those charges to
> > > > > > offline memcg happen indefinetely?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > How can it ever go away then?
> > > > >
> > > > > Bpf map should be deleted by a user first.
> > > > >
> > > >
> > > > It can't apply to pinned bpf maps, because the user expects the bpf
> > > > maps to continue working after the user agent exits.
> > > >
> > > > > > Also is this something that we actually want to encourage?
> > > > >
> > > > > Not really. We can implement reparenting (probably objcg-based), I think it's
> > > > > a good idea in general. I can take a look, but can't promise it will be fast.
> > > > >
> > > > > In thory we can't forbid deleting cgroups with associated bpf maps, but I don't
> > > > > thinks it's a good idea.
> > > > >
> > > >
> > > > Agreed. It is not a good idea.
> > > >
> > > > > > In other words shouldn't those remote charges be redirected when the
> > > > > > target memcg is offline?
> > > > >
> > > > > Reparenting is the best answer I have.
> > > > >
> > > >
> > > > At the cost of increasing the complexity of deployment, that may not
> > > > be a good idea neither.
> > >
> > > What do you mean? Can you please elaborate on it?
> > >
> >
> >                    parent memcg
> >                          |
> >                     bpf memcg   <- limit the memory size of bpf
> > programs
> >                         /           \
> >          bpf user agent     pinned bpf program
> >
> > After bpf user agents exit, the bpf memcg will be dead, and then all
> > its memory will be reparented.
> > That is okay for preallocated bpf maps, but not okay for
> > non-preallocated bpf maps.
> > Because the bpf maps will continue to charge, but as all its memory
> > and objcg are reparented, so we have to limit the bpf memory size in
> > the parent as follows,
>
> So you're relying on the memory limit of a dying cgroup?

No. I didn't say it.  What I said is you can't use a dying cgroup to
limit it, that's why I said that we have to use parant memcg to limit
it.

> Sorry, but I don't think we can seriously discuss such a design.
> A dying cgroup is invisible for a user, a user can't change any tunables,
> they have zero visibility into any stats or charges. Why would you do this?
>
> If you want the cgroup to be an active part of the memory management
> process, don't delete it. There are exactly zero guarantees about what
> happens with a memory cgroup after being deleted by a user, it's all
> implementation details.
>
> Anyway, here is the patch for reparenting bpf maps:
> https://github.com/rgushchin/linux/commit/f57df8bb35770507a4624fe52216b6c14f39c50c
>
> I gonna post it to bpf@ after some testing.
>

I will take a look at it.
But AFAIK the reparenting can't resolve the problem of non-preallocated maps.
Roman Gushchin July 6, 2022, 4:19 a.m. UTC | #17
On Wed, Jul 06, 2022 at 12:02:49PM +0800, Yafang Shao wrote:
> On Wed, Jul 6, 2022 at 11:56 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Wed, Jul 06, 2022 at 11:42:50AM +0800, Yafang Shao wrote:
> > > On Wed, Jul 6, 2022 at 11:28 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Wed, Jul 06, 2022 at 10:46:48AM +0800, Yafang Shao wrote:
> > > > > On Wed, Jul 6, 2022 at 4:49 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > >
> > > > > > On Mon, Jul 04, 2022 at 05:07:30PM +0200, Michal Hocko wrote:
> > > > > > > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > > > > > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > > > > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > > > > >
> > > > > > > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > > > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > > > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > > > > > > events are not raised.
> > > > > > > > > >
> > > > > > > > > > It's not/less of an issue in a generic case because consequent
> > > > > > > > > > allocations from a process context will trigger the reclaim and
> > > > > > > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > > > > > > memory cgroup, so it might never happen.
> > > > > > > > >
> > > > > > > > > The patch looks good but the above sentence is confusing. What might
> > > > > > > > > never happen? Reclaim or MAX event on dying memcg?
> > > > > > > >
> > > > > > > > Direct reclaim and MAX events. I agree it might be not clear without
> > > > > > > > looking into the code. How about something like this?
> > > > > > > >
> > > > > > > > "It's not/less of an issue in a generic case because consequent
> > > > > > > > allocations from a process context will trigger the direct reclaim
> > > > > > > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > > > > > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > > > > > > from a process context and no MEMCG_MAX events will be triggered."
> > > > > > >
> > > > > > > Could you expand little bit more on the situation? Can those charges to
> > > > > > > offline memcg happen indefinetely?
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > > How can it ever go away then?
> > > > > >
> > > > > > Bpf map should be deleted by a user first.
> > > > > >
> > > > >
> > > > > It can't apply to pinned bpf maps, because the user expects the bpf
> > > > > maps to continue working after the user agent exits.
> > > > >
> > > > > > > Also is this something that we actually want to encourage?
> > > > > >
> > > > > > Not really. We can implement reparenting (probably objcg-based), I think it's
> > > > > > a good idea in general. I can take a look, but can't promise it will be fast.
> > > > > >
> > > > > > In thory we can't forbid deleting cgroups with associated bpf maps, but I don't
> > > > > > thinks it's a good idea.
> > > > > >
> > > > >
> > > > > Agreed. It is not a good idea.
> > > > >
> > > > > > > In other words shouldn't those remote charges be redirected when the
> > > > > > > target memcg is offline?
> > > > > >
> > > > > > Reparenting is the best answer I have.
> > > > > >
> > > > >
> > > > > At the cost of increasing the complexity of deployment, that may not
> > > > > be a good idea neither.
> > > >
> > > > What do you mean? Can you please elaborate on it?
> > > >
> > >
> > >                    parent memcg
> > >                          |
> > >                     bpf memcg   <- limit the memory size of bpf
> > > programs
> > >                         /           \
> > >          bpf user agent     pinned bpf program
> > >
> > > After bpf user agents exit, the bpf memcg will be dead, and then all
> > > its memory will be reparented.
> > > That is okay for preallocated bpf maps, but not okay for
> > > non-preallocated bpf maps.
> > > Because the bpf maps will continue to charge, but as all its memory
> > > and objcg are reparented, so we have to limit the bpf memory size in
> > > the parent as follows,
> >
> > So you're relying on the memory limit of a dying cgroup?
> 
> No. I didn't say it.  What I said is you can't use a dying cgroup to
> limit it, that's why I said that we have to use parant memcg to limit
> it.
> 
> > Sorry, but I don't think we can seriously discuss such a design.
> > A dying cgroup is invisible for a user, a user can't change any tunables,
> > they have zero visibility into any stats or charges. Why would you do this?
> >
> > If you want the cgroup to be an active part of the memory management
> > process, don't delete it. There are exactly zero guarantees about what
> > happens with a memory cgroup after being deleted by a user, it's all
> > implementation details.
> >
> > Anyway, here is the patch for reparenting bpf maps:
> > https://github.com/rgushchin/linux/commit/f57df8bb35770507a4624fe52216b6c14f39c50c
> >
> > I gonna post it to bpf@ after some testing.
> >
> 
> I will take a look at it.
> But AFAIK the reparenting can't resolve the problem of non-preallocated maps.

Sorry, what's the problem then?

Michal asked how we can prevent an indefinite pinning of a dying memcg by an associated
bpf map being used by other processes, and I guess the objcg-based reparenting is
the best answer here. You said it will complicate the deployment? What does it mean?

From a user's POV there is no visible difference. What am I missing here?
Yes, if we reparent the bpf map, memory.max of the original memory cgroup will
not apply, but as I said, if you want it to be effective, don't delete the cgroup.

Thanks!
Yafang Shao July 6, 2022, 4:33 a.m. UTC | #18
On Wed, Jul 6, 2022 at 12:19 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Wed, Jul 06, 2022 at 12:02:49PM +0800, Yafang Shao wrote:
> > On Wed, Jul 6, 2022 at 11:56 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Wed, Jul 06, 2022 at 11:42:50AM +0800, Yafang Shao wrote:
> > > > On Wed, Jul 6, 2022 at 11:28 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Wed, Jul 06, 2022 at 10:46:48AM +0800, Yafang Shao wrote:
> > > > > > On Wed, Jul 6, 2022 at 4:49 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > >
> > > > > > > On Mon, Jul 04, 2022 at 05:07:30PM +0200, Michal Hocko wrote:
> > > > > > > > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > > > > > > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > > > > > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > > > > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > > > > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > > > > > > > events are not raised.
> > > > > > > > > > >
> > > > > > > > > > > It's not/less of an issue in a generic case because consequent
> > > > > > > > > > > allocations from a process context will trigger the reclaim and
> > > > > > > > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > > > > > > > memory cgroup, so it might never happen.
> > > > > > > > > >
> > > > > > > > > > The patch looks good but the above sentence is confusing. What might
> > > > > > > > > > never happen? Reclaim or MAX event on dying memcg?
> > > > > > > > >
> > > > > > > > > Direct reclaim and MAX events. I agree it might be not clear without
> > > > > > > > > looking into the code. How about something like this?
> > > > > > > > >
> > > > > > > > > "It's not/less of an issue in a generic case because consequent
> > > > > > > > > allocations from a process context will trigger the direct reclaim
> > > > > > > > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > > > > > > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > > > > > > > from a process context and no MEMCG_MAX events will be triggered."
> > > > > > > >
> > > > > > > > Could you expand little bit more on the situation? Can those charges to
> > > > > > > > offline memcg happen indefinetely?
> > > > > > >
> > > > > > > Yes.
> > > > > > >
> > > > > > > > How can it ever go away then?
> > > > > > >
> > > > > > > Bpf map should be deleted by a user first.
> > > > > > >
> > > > > >
> > > > > > It can't apply to pinned bpf maps, because the user expects the bpf
> > > > > > maps to continue working after the user agent exits.
> > > > > >
> > > > > > > > Also is this something that we actually want to encourage?
> > > > > > >
> > > > > > > Not really. We can implement reparenting (probably objcg-based), I think it's
> > > > > > > a good idea in general. I can take a look, but can't promise it will be fast.
> > > > > > >
> > > > > > > In thory we can't forbid deleting cgroups with associated bpf maps, but I don't
> > > > > > > thinks it's a good idea.
> > > > > > >
> > > > > >
> > > > > > Agreed. It is not a good idea.
> > > > > >
> > > > > > > > In other words shouldn't those remote charges be redirected when the
> > > > > > > > target memcg is offline?
> > > > > > >
> > > > > > > Reparenting is the best answer I have.
> > > > > > >
> > > > > >
> > > > > > At the cost of increasing the complexity of deployment, that may not
> > > > > > be a good idea neither.
> > > > >
> > > > > What do you mean? Can you please elaborate on it?
> > > > >
> > > >
> > > >                    parent memcg
> > > >                          |
> > > >                     bpf memcg   <- limit the memory size of bpf
> > > > programs
> > > >                         /           \
> > > >          bpf user agent     pinned bpf program
> > > >
> > > > After bpf user agents exit, the bpf memcg will be dead, and then all
> > > > its memory will be reparented.
> > > > That is okay for preallocated bpf maps, but not okay for
> > > > non-preallocated bpf maps.
> > > > Because the bpf maps will continue to charge, but as all its memory
> > > > and objcg are reparented, so we have to limit the bpf memory size in
> > > > the parent as follows,
> > >
> > > So you're relying on the memory limit of a dying cgroup?
> >
> > No. I didn't say it.  What I said is you can't use a dying cgroup to
> > limit it, that's why I said that we have to use parant memcg to limit
> > it.
> >
> > > Sorry, but I don't think we can seriously discuss such a design.
> > > A dying cgroup is invisible for a user, a user can't change any tunables,
> > > they have zero visibility into any stats or charges. Why would you do this?
> > >
> > > If you want the cgroup to be an active part of the memory management
> > > process, don't delete it. There are exactly zero guarantees about what
> > > happens with a memory cgroup after being deleted by a user, it's all
> > > implementation details.
> > >
> > > Anyway, here is the patch for reparenting bpf maps:
> > > https://github.com/rgushchin/linux/commit/f57df8bb35770507a4624fe52216b6c14f39c50c
> > >
> > > I gonna post it to bpf@ after some testing.
> > >
> >
> > I will take a look at it.
> > But AFAIK the reparenting can't resolve the problem of non-preallocated maps.
>
> Sorry, what's the problem then?
>

The problem is, the bpf memcg or its parent memcg can't be destroyed currently.
IOW, you have to forbid the user to rmdir.

Reparenting is an improvement for the preallocated bpf map, because
all its memory is charged, so the memg is useless any more.
So it can be destroyed and thus the reparenting is an improvement.

But for the non-preallocated bpf map, the memcg still has to do the
limit work, that means, it can't be destroyed currently.
If you reparent it, then the parent can't be destroyed. So why not
forbid destroying the bpf memcg in the first place?
The reparenting just increases the complexity for this case.

> Michal asked how we can prevent an indefinite pinning of a dying memcg by an associated
> bpf map being used by other processes, and I guess the objcg-based reparenting is
> the best answer here. You said it will complicate the deployment? What does it mean?
>

See my reply above.

> From a user's POV there is no visible difference. What am I missing here?
> Yes, if we reparent the bpf map, memory.max of the original memory cgroup will
> not apply, but as I said, if you want it to be effective, don't delete the cgroup.
>
Michal Hocko July 7, 2022, 7:47 a.m. UTC | #19
On Wed 06-07-22 10:40:48, Yafang Shao wrote:
> On Wed, Jul 6, 2022 at 4:52 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Mon, Jul 04, 2022 at 05:30:25PM +0200, Michal Hocko wrote:
> > > On Mon 04-07-22 17:07:32, Michal Hocko wrote:
> > > > On Sat 02-07-22 08:39:14, Roman Gushchin wrote:
> > > > > On Fri, Jul 01, 2022 at 10:50:40PM -0700, Shakeel Butt wrote:
> > > > > > On Fri, Jul 1, 2022 at 8:35 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > >
> > > > > > > Yafang Shao reported an issue related to the accounting of bpf
> > > > > > > memory: if a bpf map is charged indirectly for memory consumed
> > > > > > > from an interrupt context and allocations are enforced, MEMCG_MAX
> > > > > > > events are not raised.
> > > > > > >
> > > > > > > It's not/less of an issue in a generic case because consequent
> > > > > > > allocations from a process context will trigger the reclaim and
> > > > > > > MEMCG_MAX events. However a bpf map can belong to a dying/abandoned
> > > > > > > memory cgroup, so it might never happen.
> > > > > >
> > > > > > The patch looks good but the above sentence is confusing. What might
> > > > > > never happen? Reclaim or MAX event on dying memcg?
> > > > >
> > > > > Direct reclaim and MAX events. I agree it might be not clear without
> > > > > looking into the code. How about something like this?
> > > > >
> > > > > "It's not/less of an issue in a generic case because consequent
> > > > > allocations from a process context will trigger the direct reclaim
> > > > > and MEMCG_MAX events will be raised. However a bpf map can belong
> > > > > to a dying/abandoned memory cgroup, so there will be no allocations
> > > > > from a process context and no MEMCG_MAX events will be triggered."
> > > >
> > > > Could you expand little bit more on the situation? Can those charges to
> > > > offline memcg happen indefinetely? How can it ever go away then? Also is
> > > > this something that we actually want to encourage?
> > >
> > > One more question. Mostly out of curiosity. How is userspace actually
> > > acting on those events? Are watchers still active on those dead memcgs?
> >
> > Idk, the whole problem was reported by Yafang, so he probably has a better
> > answer. But in general events are recursive and the cgroup doesn't have
> > to be dying, it can be simple abandoned.
> >
> 
> Regarding the pinned bpf programs, it can run without a user agent.
> That means the cgroup may not be dead, but just not populated.
> (But in our case, the cgroup will be deleted after the user agent exits.)

OK, that makes sense.
Alexei Starovoitov July 7, 2022, 10:41 p.m. UTC | #20
On Tue, Jul 5, 2022 at 9:24 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Anyway, here is the patch for reparenting bpf maps:
> https://github.com/rgushchin/linux/commit/f57df8bb35770507a4624fe52216b6c14f39c50c
>
> I gonna post it to bpf@ after some testing.

Please do. It looks good.
It needs #ifdef CONFIG_MEMCG_KMEM
because get_obj_cgroup_from_current() is undefined otherwise.
Ideally just adding a static inline to a .h ?

and
if (map->objcg)
   memcg = get_mem_cgroup_from_objcg(map->objcg);

or !NULL check inside get_mem_cgroup_from_objcg()
which would be better.
Roman Gushchin July 8, 2022, 3:18 a.m. UTC | #21
On Thu, Jul 07, 2022 at 03:41:11PM -0700, Alexei Starovoitov wrote:
> On Tue, Jul 5, 2022 at 9:24 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > Anyway, here is the patch for reparenting bpf maps:
> > https://github.com/rgushchin/linux/commit/f57df8bb35770507a4624fe52216b6c14f39c50c
> >
> > I gonna post it to bpf@ after some testing.
> 
> Please do. It looks good.
> It needs #ifdef CONFIG_MEMCG_KMEM
> because get_obj_cgroup_from_current() is undefined otherwise.
> Ideally just adding a static inline to a .h ?

Actually all call sites are already under CONFIG_MEMCG_KMEM.

> 
> and
> if (map->objcg)
>    memcg = get_mem_cgroup_from_objcg(map->objcg);
> 
> or !NULL check inside get_mem_cgroup_from_objcg()
> which would be better.

Yes, you're right, as now we need to handle it specially.

In the near future it won't be necessary. There are patches in
mm-unstable which make objcg API useful outside of CONFIG_MEMCG_KMEM.
In particular it means that objcg will be created for the root_mem_cgroup.
So map->objcg can always point at a valid objcg and we will be able
to drop this check.

Will post an updated version shortly.

Thanks!
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 655c09393ad5..eb383695659a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2577,6 +2577,7 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
+	bool raised_max_event = false;
 	unsigned long pflags;
 
 retry:
@@ -2616,6 +2617,7 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		goto nomem;
 
 	memcg_memory_event(mem_over_limit, MEMCG_MAX);
+	raised_max_event = true;
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
@@ -2682,6 +2684,13 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
 		return -ENOMEM;
 force:
+	/*
+	 * If the allocation has to be enforced, don't forget to raise
+	 * a MEMCG_MAX event.
+	 */
+	if (!raised_max_event)
+		memcg_memory_event(mem_over_limit, MEMCG_MAX);
+
 	/*
 	 * The allocation either can't fail or will lead to more memory
 	 * being freed very soon.  Allow memory usage go over the limit