Message ID | 20220619155032.32515-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf, mm: Recharge pages when reuse bpf map | expand |
On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote: > After switching to memcg-based bpf memory accounting, the bpf memory is > charged to the loader's memcg by default, that causes unexpected issues for > us. For instance, the container of the loader may be restarted after > pinning progs and maps, but the bpf memcg will be left and pinned on the > system. Once the loader's new generation container is started, the leftover > pages won't be charged to it. That inconsistent behavior will make trouble > for the memory resource management for this container. > > In the past few days, I have proposed two patchsets[1][2] to try to resolve > this issue, but in both of these two proposals the user code has to be > changed to adapt to it, that is a pain for us. This patchset relieves the > pain by triggering the recharge in libbpf. It also addresses Roman's > critical comments. > > The key point we can avoid changing the user code is that there's a resue > path in libbpf. Once the bpf container is restarted again, it will try > to re-run the required bpf programs, if the bpf programs are the same with > the already pinned one, it will reuse them. > > To make sure we either recharge all of them successfully or don't recharge > any of them. The recharge prograss is divided into three steps: > - Pre charge to the new generation > To make sure once we uncharge from the old generation, we can always > charge to the new generation succeesfully. If we can't pre charge to > the new generation, we won't allow it to be uncharged from the old > generation. > - Uncharge from the old generation > After pre charge to the new generation, we can uncharge from the old > generation. > - Post charge to the new generation > Finnaly we can set pages' memcg_data to the new generation. > In the pre charge step, we may succeed to charge some addresses, but fail > to charge a new address, then we should uncharge the already charged > addresses, so another recharge-err step is instroduced. > > This pachset has finished recharging bpf hash map. which is mostly used > by our bpf services. The other maps hasn't been implemented yet. The bpf > progs hasn't been implemented neither. ... and the implementation in patch 10 is missing recharge of htab->elems which consumes the most memory. That begs the question how the whole set was tested. Even if that bug is fixed this recharge approach works only with preallocated maps. Their use will be reduced in the future. Maps with kmalloc won't work with this multi step approach. There is no place where bpf_map_release_memcg can be done without racing with concurrent kmallocs from bpf program side. Despite being painful for user space the user space has to deal with it. It created a container, charged its memcg, then destroyed the container, but didn't free the bpf map. It's a user bug. It has to free the map. The user space can use map-in-map solution. In the new container the new bpf map can be allocated (and charged to new memcg), the data copied from old map, and then inner map replaced. At this point old map can be freed and memcg uncharged. To make things easier we can consider introducing an anon FD that points to a memcg. Then user can pick a memcg at map creation time instead of get_mem_cgroup_from_current.
On Wed, Jun 22, 2022 at 7:28 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote: > > After switching to memcg-based bpf memory accounting, the bpf memory is > > charged to the loader's memcg by default, that causes unexpected issues for > > us. For instance, the container of the loader may be restarted after > > pinning progs and maps, but the bpf memcg will be left and pinned on the > > system. Once the loader's new generation container is started, the leftover > > pages won't be charged to it. That inconsistent behavior will make trouble > > for the memory resource management for this container. > > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve > > this issue, but in both of these two proposals the user code has to be > > changed to adapt to it, that is a pain for us. This patchset relieves the > > pain by triggering the recharge in libbpf. It also addresses Roman's > > critical comments. > > > > The key point we can avoid changing the user code is that there's a resue > > path in libbpf. Once the bpf container is restarted again, it will try > > to re-run the required bpf programs, if the bpf programs are the same with > > the already pinned one, it will reuse them. > > > > To make sure we either recharge all of them successfully or don't recharge > > any of them. The recharge prograss is divided into three steps: > > - Pre charge to the new generation > > To make sure once we uncharge from the old generation, we can always > > charge to the new generation succeesfully. If we can't pre charge to > > the new generation, we won't allow it to be uncharged from the old > > generation. > > - Uncharge from the old generation > > After pre charge to the new generation, we can uncharge from the old > > generation. > > - Post charge to the new generation > > Finnaly we can set pages' memcg_data to the new generation. > > In the pre charge step, we may succeed to charge some addresses, but fail > > to charge a new address, then we should uncharge the already charged > > addresses, so another recharge-err step is instroduced. > > > > This pachset has finished recharging bpf hash map. which is mostly used > > by our bpf services. The other maps hasn't been implemented yet. The bpf > > progs hasn't been implemented neither. > > ... and the implementation in patch 10 is missing recharge of htab->elems > which consumes the most memory. You got to the point. I intended to skip htab->elemes due to some reasons. You have pointed out one of the reasons that it is complex for non-preallocated memory. Another reason is memcg reparenting, for example, once the memcg is offline, its pages and other related memory things like obj_cgroup will be reparented, but unlikely the map->memcg is still the offline memcg. There _must_ be issues in it, but I haven't figured out what it may cause to the user. One issue I can imagine is that the memcg limit may not work well in this case. That should be another different issue, and I'm still working on it. > That begs the question how the whole set was tested. In my test case, htab->buckets is around 120MB, which can be used to do the comparison. The number is charged back without any kernel warnings, so I posted this incomplete patchset to get feedback to check if I'm in the right direction. > > Even if that bug is fixed this recharge approach works only with preallocated > maps. Their use will be reduced in the future. > Maps with kmalloc won't work with this multi step approach. > There is no place where bpf_map_release_memcg can be done without racing > with concurrent kmallocs from bpf program side. Right, that is really an issue. > > Despite being painful for user space the user space has to deal with it. > It created a container, charged its memcg, then destroyed the container, > but didn't free the bpf map. It's a user bug. It has to free the map. It seems that I didn't describe this case clearly. It is not a user bug but a user case in the k8s environment. For example, after the user has deployed its container, it may need to upgrade its user application or bpf prog, then the user should upgrade its container, that means the old container will be destroyed and a new container will be created. In this upgrade progress, the bpf program can continue to work without interruption because they are pinned. > The user space can use map-in-map solution. In the new container the new bpf map > can be allocated (and charged to new memcg), the data copied from old map, > and then inner map replaced. At this point old map can be freed and memcg > uncharged. The map-in-map solution may work for some cases, but it will allocate more memory, which is not okay if the bpf map has lots of memory. > To make things easier we can consider introducing an anon FD that points to a memcg. > Then user can pick a memcg at map creation time instead of get_mem_cgroup_from_current. This may be a workable solution. The life cycle of a pinned bpf program is independent of the application which loads an pins it, so it is reasonable to introduce an independent memcg to manage the bpf memory, for example a memcg like /sys/fs/cgroup/memory/pinned_bpf which is independent of the k8s pod. I will analyze it. Thanks for the suggestion.
On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote: > After switching to memcg-based bpf memory accounting, the bpf memory is > charged to the loader's memcg by default, that causes unexpected issues for > us. For instance, the container of the loader may be restarted after > pinning progs and maps, but the bpf memcg will be left and pinned on the > system. Once the loader's new generation container is started, the leftover > pages won't be charged to it. That inconsistent behavior will make trouble > for the memory resource management for this container. > > In the past few days, I have proposed two patchsets[1][2] to try to resolve > this issue, but in both of these two proposals the user code has to be > changed to adapt to it, that is a pain for us. This patchset relieves the > pain by triggering the recharge in libbpf. It also addresses Roman's > critical comments. > > The key point we can avoid changing the user code is that there's a resue > path in libbpf. Once the bpf container is restarted again, it will try > to re-run the required bpf programs, if the bpf programs are the same with > the already pinned one, it will reuse them. > > To make sure we either recharge all of them successfully or don't recharge > any of them. The recharge prograss is divided into three steps: > - Pre charge to the new generation > To make sure once we uncharge from the old generation, we can always > charge to the new generation succeesfully. If we can't pre charge to > the new generation, we won't allow it to be uncharged from the old > generation. > - Uncharge from the old generation > After pre charge to the new generation, we can uncharge from the old > generation. > - Post charge to the new generation > Finnaly we can set pages' memcg_data to the new generation. > In the pre charge step, we may succeed to charge some addresses, but fail > to charge a new address, then we should uncharge the already charged > addresses, so another recharge-err step is instroduced. > > This pachset has finished recharging bpf hash map. which is mostly used > by our bpf services. The other maps hasn't been implemented yet. The bpf > progs hasn't been implemented neither. Without going into the implementation details, the overall approach looks ok to me. But it adds complexity and code into several different subsystems, and I'm 100% sure it's not worth it if we talking about a partial support of a single map type. Are you committed to implement the recharging for all/most map types and progs and support this code in the future? I'm still feeling you trying to solve a userspace problem in the kernel. Not saying it can't be solved this way, but it seems like there are easier options. Thanks!
On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote: > > After switching to memcg-based bpf memory accounting, the bpf memory is > > charged to the loader's memcg by default, that causes unexpected issues for > > us. For instance, the container of the loader may be restarted after > > pinning progs and maps, but the bpf memcg will be left and pinned on the > > system. Once the loader's new generation container is started, the leftover > > pages won't be charged to it. That inconsistent behavior will make trouble > > for the memory resource management for this container. > > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve > > this issue, but in both of these two proposals the user code has to be > > changed to adapt to it, that is a pain for us. This patchset relieves the > > pain by triggering the recharge in libbpf. It also addresses Roman's > > critical comments. > > > > The key point we can avoid changing the user code is that there's a resue > > path in libbpf. Once the bpf container is restarted again, it will try > > to re-run the required bpf programs, if the bpf programs are the same with > > the already pinned one, it will reuse them. > > > > To make sure we either recharge all of them successfully or don't recharge > > any of them. The recharge prograss is divided into three steps: > > - Pre charge to the new generation > > To make sure once we uncharge from the old generation, we can always > > charge to the new generation succeesfully. If we can't pre charge to > > the new generation, we won't allow it to be uncharged from the old > > generation. > > - Uncharge from the old generation > > After pre charge to the new generation, we can uncharge from the old > > generation. > > - Post charge to the new generation > > Finnaly we can set pages' memcg_data to the new generation. > > In the pre charge step, we may succeed to charge some addresses, but fail > > to charge a new address, then we should uncharge the already charged > > addresses, so another recharge-err step is instroduced. > > > > This pachset has finished recharging bpf hash map. which is mostly used > > by our bpf services. The other maps hasn't been implemented yet. The bpf > > progs hasn't been implemented neither. > > Without going into the implementation details, the overall approach looks > ok to me. But it adds complexity and code into several different subsystems, > and I'm 100% sure it's not worth it if we talking about a partial support > of a single map type. Are you committed to implement the recharging > for all/most map types and progs and support this code in the future? > I'm planning to support it for all map types and progs. Regarding the progs, it seems that we have to introduce a new UAPI for the user to do the recharge, because there's no similar reuse path in libbpf. Our company is a heavy bpf user. We have many bpf programs running on our production environment, including networking bpf, tracing/profiling bpf, and some other bpf programs which are not supported in upstream kernel, for example we're even trying the sched-bpf[1] proposed by you (and you may remember that I reviewed your patchset). Most of the networking bpf, e.g. gateway-bpf, edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system. It is a trend that bpf will be introduced in more and more subsystems, and thus it is no doubt that a bpf patchset will involve many subsystems. That means I will be continuously active in these areas in the near future, several years at least. [1]. https://lwn.net/Articles/869433/ > I'm still feeling you trying to solve a userspace problem in the kernel. Your feeling can be converted to a simple question: is it allowed to pin a bpf program by a process running in a container. The answer to this simple question can help us to understand whether it is a user bug or a kernel bug. I think you will agree with me that there's definitely no reason to refuse to pin a bpf program by a containerized process. And then we will find that the pinned-bpf-program doesn't cooperate well with memcg. A kernel feature can't work together with another kernel feature, and there's not even an interface provided to the user to adjust it. The user either doesn't pin the bpf program or disable kmemcg. Isn't it a kernel bug ? You may have a doubt why these two features can't cooperate. I will explain it in detail. That will be a long story. It should begin with why we introduce bpf pinning. We pin it because sometimes the lifecycle of a user application is different with the bpf program, or there's no user agent at all. In order to make it simple, I will take the no-user-agent (agent exits after pinning bpf program) case as an example. Now thinking about what will happen if the agent which pins the bpf program has a memcg. No matter if the agent destroys the memcg or not once it exits, the memcg will not disappear because it is pinned by the bpf program. To make it easy, let's assume the memcg isn't being destroyed, IOW, it is online. An online memcg is not populated, but it is still being remote charged (if it is a non-preallocate bpf map), that looks like a ghost. Now we will look into the details to find what will happen to this ghost memcg. If this ghost memcg is limited, it will introduce many issues AFAICS. Firstly, the memcg will be force charged[2], and I think I don't need to explain the reason to you. Even worse is that it force-charges silently without any event, because it comes from, if (!gfpflags_allow_blocking(gfp_mask)) goto nomem; And then all memcg events will be skipped. So at least we will introduce a force-charge event, force: + memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE); page_counter_charge(&memcg->memory, nr_pages); And then we should allow alloc_htab_elem() to fail, l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size, - GFP_ATOMIC | __GFP_NOWARN, + __GFP_ATOMIC | __GFP_KSWAPD_RECLAIM | __GFP_NOWARN, htab->map.numa_node); And then we'd better introduce an improvement for memcg, + /* + * Should wakeup async memcg reclaim first, + * in case there will be no direct memcg reclaim for a long time. + * We can either introduce async memcg reclaim + * or modify kswapd to reclaim a specific memcg + */ + if (gfp_mask & __GFP_KSWAPD_RECLAIM) + wake_up_async_memcg_reclaim(); if (!gfpflags_allow_blocking(gfp_mask)) goto nomem; And ..... Really bad luck that there are so many issues in memcg, but it may also be because I don't have a deep understanding of memcg ...... I have to clarify that these issues are not caused by memcg-based-bpf-accounting, but exposed by it. [ Time for lunch here, so I have to stop. ] [2]. https://elixir.bootlin.com/linux/v5.19-rc3/source/mm/memcontrol.c#L2685 > Not saying it can't be solved this way, but it seems like there are > easier options. >
On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote: > On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin > <roman.gushchin@linux.dev> wrote: > > > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote: > > > After switching to memcg-based bpf memory accounting, the bpf memory is > > > charged to the loader's memcg by default, that causes unexpected issues for > > > us. For instance, the container of the loader may be restarted after > > > pinning progs and maps, but the bpf memcg will be left and pinned on the > > > system. Once the loader's new generation container is started, the leftover > > > pages won't be charged to it. That inconsistent behavior will make trouble > > > for the memory resource management for this container. > > > > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve > > > this issue, but in both of these two proposals the user code has to be > > > changed to adapt to it, that is a pain for us. This patchset relieves the > > > pain by triggering the recharge in libbpf. It also addresses Roman's > > > critical comments. > > > > > > The key point we can avoid changing the user code is that there's a resue > > > path in libbpf. Once the bpf container is restarted again, it will try > > > to re-run the required bpf programs, if the bpf programs are the same with > > > the already pinned one, it will reuse them. > > > > > > To make sure we either recharge all of them successfully or don't recharge > > > any of them. The recharge prograss is divided into three steps: > > > - Pre charge to the new generation > > > To make sure once we uncharge from the old generation, we can always > > > charge to the new generation succeesfully. If we can't pre charge to > > > the new generation, we won't allow it to be uncharged from the old > > > generation. > > > - Uncharge from the old generation > > > After pre charge to the new generation, we can uncharge from the old > > > generation. > > > - Post charge to the new generation > > > Finnaly we can set pages' memcg_data to the new generation. > > > In the pre charge step, we may succeed to charge some addresses, but fail > > > to charge a new address, then we should uncharge the already charged > > > addresses, so another recharge-err step is instroduced. > > > > > > This pachset has finished recharging bpf hash map. which is mostly used > > > by our bpf services. The other maps hasn't been implemented yet. The bpf > > > progs hasn't been implemented neither. > > > > Without going into the implementation details, the overall approach looks > > ok to me. But it adds complexity and code into several different subsystems, > > and I'm 100% sure it's not worth it if we talking about a partial support > > of a single map type. Are you committed to implement the recharging > > for all/most map types and progs and support this code in the future? > > > > I'm planning to support it for all map types and progs. Regarding the > progs, it seems that we have to introduce a new UAPI for the user to > do the recharge, because there's no similar reuse path in libbpf. > > Our company is a heavy bpf user. We have many bpf programs running on > our production environment, including networking bpf, > tracing/profiling bpf, and some other bpf programs which are not > supported in upstream kernel, for example we're even trying the > sched-bpf[1] proposed by you (and you may remember that I reviewed > your patchset). Most of the networking bpf, e.g. gateway-bpf, > edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system. > > It is a trend that bpf will be introduced in more and more subsystems, > and thus it is no doubt that a bpf patchset will involve many > subsystems. > > That means I will be continuously active in these areas in the near > future, several years at least. Ok, I'm glad to hear this. I highly recommend to cover more map types and use cases in next iterations of the patchset. > > [1]. https://lwn.net/Articles/869433/ > > > I'm still feeling you trying to solve a userspace problem in the kernel. > > Your feeling can be converted to a simple question: is it allowed to > pin a bpf program by a process running in a container. The answer to > this simple question can help us to understand whether it is a user > bug or a kernel bug. > > I think you will agree with me that there's definitely no reason to > refuse to pin a bpf program by a containerized process. And then we > will find that the pinned-bpf-program doesn't cooperate well with > memcg. A kernel feature can't work together with another kernel > feature, and there's not even an interface provided to the user to > adjust it. The user either doesn't pin the bpf program or disable > kmemcg. Isn't it a kernel bug ? > > You may have a doubt why these two features can't cooperate. I will > explain it in detail. That will be a long story. > > It should begin with why we introduce bpf pinning. We pin it because > sometimes the lifecycle of a user application is different with the > bpf program, or there's no user agent at all. In order to make it > simple, I will take the no-user-agent (agent exits after pinning bpf > program) case as an example. > > Now thinking about what will happen if the agent which pins the bpf > program has a memcg. No matter if the agent destroys the memcg or not > once it exits, the memcg will not disappear because it is pinned by > the bpf program. To make it easy, let's assume the memcg isn't being > destroyed, IOW, it is online. > > An online memcg is not populated, but it is still being remote charged > (if it is a non-preallocate bpf map), that looks like a ghost. Now we > will look into the details to find what will happen to this ghost > memcg. > > If this ghost memcg is limited, it will introduce many issues AFAICS. > Firstly, the memcg will be force charged[2], and I think I don't need > to explain the reason to you. > Even worse is that it force-charges silently without any event, > because it comes from, > if (!gfpflags_allow_blocking(gfp_mask)) > goto nomem; > And then all memcg events will be skipped. So at least we will > introduce a force-charge event, > force: > + memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE); > page_counter_charge(&memcg->memory, nr_pages); This is actually a good point, let me try to fix it. > > And then we should allow alloc_htab_elem() to fail, > l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size, > - GFP_ATOMIC | __GFP_NOWARN, > + __GFP_ATOMIC | > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN, It's not a memcg thing, it was done this way previously. Probably Alexei has a better explanation. Personally, I'm totally fine with removing __GFP_NOWARN, but maybe I just don't know something. > htab->map.numa_node); > And then we'd better introduce an improvement for memcg, > + /* > + * Should wakeup async memcg reclaim first, > + * in case there will be no direct memcg reclaim for a long time. > + * We can either introduce async memcg reclaim > + * or modify kswapd to reclaim a specific memcg > + */ > + if (gfp_mask & __GFP_KSWAPD_RECLAIM) > + wake_up_async_memcg_reclaim(); > if (!gfpflags_allow_blocking(gfp_mask)) > goto nomem; Hm, I see. It might be an issue if there is no global memory pressure, right? Let me think what I can do here too. > > And ..... > > Really bad luck that there are so many issues in memcg, but it may > also be because I don't have a deep understanding of memcg ...... > > I have to clarify that these issues are not caused by > memcg-based-bpf-accounting, but exposed by it. > > [ Time for lunch here, so I have to stop. ] Thank you for writing this text, it was interesting to follow your thinking. And thank you for bringing in these problems above. Let me be clear: I'm not opposing the idea of recharging, I'm only against introducing hacks for bpf-specific issues, which can't be nicely generalized for other use cases and subsystems. That's the only reason why I'm a bit defensive here. In general, as now memory cgroups do not provide an ability to recharge accounted objects (with some exceptions from the v1 legacy). It applies both to user and kernel memory. I agree, that bpf maps are in some sense unique, as they are potentially large kernel objects with a lot of control from the userspace. Is this a good reason to extend memory cgroup API with the recharging ability? Maybe, but if yes, let's do it well. The big question is how to do it? Memcg accounting is done in a way that requires little changes from the kernel code, right? You just add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to an already accounted object. The same applies for uncharging. It works transparently. Recharging is different: a caller should have some sort of the ownership over the object (to make sure we are not racing against the reclaim and/or another recharging). And the rules are different for each type of objects. It's a caller duty to make sure all parts of the complex object are properly recharged and nothing is left behind. There is also the reparenting mechanism which can race against the recharging. So it's not an easy problem. If an object is large, we probably don't want to recharge it at once, otherwise temporarily doubling of the accounted memory (thanks to the precharge-uncharge-commit approach) risks introducing spurious OOMs on memory-limited systems. So yeah, if it doesn't sound too scary for you, I'm happy to help with this. But it's a lot of work to do it properly, that's why I'm thinking that maybe it's better to workaround it in userspace, as Alexei suggested. Thanks!
On Sat, Jun 25, 2022 at 08:28:37PM -0700, Roman Gushchin wrote: > On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote: > > On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin > > <roman.gushchin@linux.dev> wrote: > > > > > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote: > > > > After switching to memcg-based bpf memory accounting, the bpf memory is > > > > charged to the loader's memcg by default, that causes unexpected issues for > > > > us. For instance, the container of the loader may be restarted after > > > > pinning progs and maps, but the bpf memcg will be left and pinned on the > > > > system. Once the loader's new generation container is started, the leftover > > > > pages won't be charged to it. That inconsistent behavior will make trouble > > > > for the memory resource management for this container. > > > > > > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve > > > > this issue, but in both of these two proposals the user code has to be > > > > changed to adapt to it, that is a pain for us. This patchset relieves the > > > > pain by triggering the recharge in libbpf. It also addresses Roman's > > > > critical comments. > > > > > > > > The key point we can avoid changing the user code is that there's a resue > > > > path in libbpf. Once the bpf container is restarted again, it will try > > > > to re-run the required bpf programs, if the bpf programs are the same with > > > > the already pinned one, it will reuse them. > > > > > > > > To make sure we either recharge all of them successfully or don't recharge > > > > any of them. The recharge prograss is divided into three steps: > > > > - Pre charge to the new generation > > > > To make sure once we uncharge from the old generation, we can always > > > > charge to the new generation succeesfully. If we can't pre charge to > > > > the new generation, we won't allow it to be uncharged from the old > > > > generation. > > > > - Uncharge from the old generation > > > > After pre charge to the new generation, we can uncharge from the old > > > > generation. > > > > - Post charge to the new generation > > > > Finnaly we can set pages' memcg_data to the new generation. > > > > In the pre charge step, we may succeed to charge some addresses, but fail > > > > to charge a new address, then we should uncharge the already charged > > > > addresses, so another recharge-err step is instroduced. > > > > > > > > This pachset has finished recharging bpf hash map. which is mostly used > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf > > > > progs hasn't been implemented neither. > > > > > > Without going into the implementation details, the overall approach looks > > > ok to me. But it adds complexity and code into several different subsystems, > > > and I'm 100% sure it's not worth it if we talking about a partial support > > > of a single map type. Are you committed to implement the recharging > > > for all/most map types and progs and support this code in the future? > > > > > > > I'm planning to support it for all map types and progs. Regarding the > > progs, it seems that we have to introduce a new UAPI for the user to > > do the recharge, because there's no similar reuse path in libbpf. > > > > Our company is a heavy bpf user. We have many bpf programs running on > > our production environment, including networking bpf, > > tracing/profiling bpf, and some other bpf programs which are not > > supported in upstream kernel, for example we're even trying the > > sched-bpf[1] proposed by you (and you may remember that I reviewed > > your patchset). Most of the networking bpf, e.g. gateway-bpf, > > edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system. > > > > It is a trend that bpf will be introduced in more and more subsystems, > > and thus it is no doubt that a bpf patchset will involve many > > subsystems. > > > > That means I will be continuously active in these areas in the near > > future, several years at least. > > Ok, I'm glad to hear this. I highly recommend to cover more map types > and use cases in next iterations of the patchset. > > > > > [1]. https://lwn.net/Articles/869433/ > > > > > I'm still feeling you trying to solve a userspace problem in the kernel. > > > > Your feeling can be converted to a simple question: is it allowed to > > pin a bpf program by a process running in a container. The answer to > > this simple question can help us to understand whether it is a user > > bug or a kernel bug. > > > > I think you will agree with me that there's definitely no reason to > > refuse to pin a bpf program by a containerized process. And then we > > will find that the pinned-bpf-program doesn't cooperate well with > > memcg. A kernel feature can't work together with another kernel > > feature, and there's not even an interface provided to the user to > > adjust it. The user either doesn't pin the bpf program or disable > > kmemcg. Isn't it a kernel bug ? > > > > You may have a doubt why these two features can't cooperate. I will > > explain it in detail. That will be a long story. > > > > It should begin with why we introduce bpf pinning. We pin it because > > sometimes the lifecycle of a user application is different with the > > bpf program, or there's no user agent at all. In order to make it > > simple, I will take the no-user-agent (agent exits after pinning bpf > > program) case as an example. > > > > Now thinking about what will happen if the agent which pins the bpf > > program has a memcg. No matter if the agent destroys the memcg or not > > once it exits, the memcg will not disappear because it is pinned by > > the bpf program. To make it easy, let's assume the memcg isn't being > > destroyed, IOW, it is online. > > > > An online memcg is not populated, but it is still being remote charged > > (if it is a non-preallocate bpf map), that looks like a ghost. Now we > > will look into the details to find what will happen to this ghost > > memcg. > > > > If this ghost memcg is limited, it will introduce many issues AFAICS. > > Firstly, the memcg will be force charged[2], and I think I don't need > > to explain the reason to you. > > Even worse is that it force-charges silently without any event, > > because it comes from, > > if (!gfpflags_allow_blocking(gfp_mask)) > > goto nomem; > > And then all memcg events will be skipped. So at least we will > > introduce a force-charge event, > > force: > > + memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE); > > page_counter_charge(&memcg->memory, nr_pages); > > This is actually a good point, let me try to fix it. > > > > > And then we should allow alloc_htab_elem() to fail, > > l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size, > > - GFP_ATOMIC | __GFP_NOWARN, > > + __GFP_ATOMIC | > > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN, > > It's not a memcg thing, it was done this way previously. Probably Alexei > has a better explanation. Personally, I'm totally fine with removing > __GFP_NOWARN, but maybe I just don't know something. > > > htab->map.numa_node); > > And then we'd better introduce an improvement for memcg, > > + /* > > + * Should wakeup async memcg reclaim first, > > + * in case there will be no direct memcg reclaim for a long time. > > + * We can either introduce async memcg reclaim > > + * or modify kswapd to reclaim a specific memcg > > + */ > > + if (gfp_mask & __GFP_KSWAPD_RECLAIM) > > + wake_up_async_memcg_reclaim(); > > if (!gfpflags_allow_blocking(gfp_mask)) > > goto nomem; > > Hm, I see. It might be an issue if there is no global memory pressure, right? > Let me think what I can do here too. > > > > > And ..... > > > > Really bad luck that there are so many issues in memcg, but it may > > also be because I don't have a deep understanding of memcg ...... > > > > I have to clarify that these issues are not caused by > > memcg-based-bpf-accounting, but exposed by it. > > > > [ Time for lunch here, so I have to stop. ] > > Thank you for writing this text, it was interesting to follow your thinking. > And thank you for bringing in these problems above. > > Let me be clear: I'm not opposing the idea of recharging, I'm only against > introducing hacks for bpf-specific issues, which can't be nicely generalized > for other use cases and subsystems. That's the only reason why I'm a bit > defensive here. > > In general, as now memory cgroups do not provide an ability to recharge > accounted objects (with some exceptions from the v1 legacy). It applies > both to user and kernel memory. I agree, that bpf maps are in some sense > unique, as they are potentially large kernel objects with a lot of control > from the userspace. Is this a good reason to extend memory cgroup API > with the recharging ability? Maybe, but if yes, let's do it well. > > The big question is how to do it? Memcg accounting is done in a way > that requires little changes from the kernel code, right? You just > add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to > an already accounted object. The same applies for uncharging. > It works transparently. > > Recharging is different: a caller should have some sort of the ownership > over the object (to make sure we are not racing against the reclaim and/or > another recharging). And the rules are different for each type of objects. > It's a caller duty to make sure all parts of the complex object are properly > recharged and nothing is left behind. There is also the reparenting mechanism > which can race against the recharging. So it's not an easy problem. > If an object is large, we probably don't want to recharge it at once, > otherwise temporarily doubling of the accounted memory (thanks to the > precharge-uncharge-commit approach) risks introducing spurious OOMs > on memory-limited systems. > > So yeah, if it doesn't sound too scary for you, I'm happy to help > with this. But it's a lot of work to do it properly, that's why I'm thinking > that maybe it's better to workaround it in userspace, as Alexei suggested. And as Alexei mentioned, there are some changes coming around the way how memory allocations will be usually performed by the bpf code, which might make the whole problem and/or your solution obsolete. Please, make sure it's still actual before sending the next version. Thanks!
On Sun, Jun 26, 2022 at 11:28 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote: > > On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin > > <roman.gushchin@linux.dev> wrote: > > > > > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote: > > > > After switching to memcg-based bpf memory accounting, the bpf memory is > > > > charged to the loader's memcg by default, that causes unexpected issues for > > > > us. For instance, the container of the loader may be restarted after > > > > pinning progs and maps, but the bpf memcg will be left and pinned on the > > > > system. Once the loader's new generation container is started, the leftover > > > > pages won't be charged to it. That inconsistent behavior will make trouble > > > > for the memory resource management for this container. > > > > > > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve > > > > this issue, but in both of these two proposals the user code has to be > > > > changed to adapt to it, that is a pain for us. This patchset relieves the > > > > pain by triggering the recharge in libbpf. It also addresses Roman's > > > > critical comments. > > > > > > > > The key point we can avoid changing the user code is that there's a resue > > > > path in libbpf. Once the bpf container is restarted again, it will try > > > > to re-run the required bpf programs, if the bpf programs are the same with > > > > the already pinned one, it will reuse them. > > > > > > > > To make sure we either recharge all of them successfully or don't recharge > > > > any of them. The recharge prograss is divided into three steps: > > > > - Pre charge to the new generation > > > > To make sure once we uncharge from the old generation, we can always > > > > charge to the new generation succeesfully. If we can't pre charge to > > > > the new generation, we won't allow it to be uncharged from the old > > > > generation. > > > > - Uncharge from the old generation > > > > After pre charge to the new generation, we can uncharge from the old > > > > generation. > > > > - Post charge to the new generation > > > > Finnaly we can set pages' memcg_data to the new generation. > > > > In the pre charge step, we may succeed to charge some addresses, but fail > > > > to charge a new address, then we should uncharge the already charged > > > > addresses, so another recharge-err step is instroduced. > > > > > > > > This pachset has finished recharging bpf hash map. which is mostly used > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf > > > > progs hasn't been implemented neither. > > > > > > Without going into the implementation details, the overall approach looks > > > ok to me. But it adds complexity and code into several different subsystems, > > > and I'm 100% sure it's not worth it if we talking about a partial support > > > of a single map type. Are you committed to implement the recharging > > > for all/most map types and progs and support this code in the future? > > > > > > > I'm planning to support it for all map types and progs. Regarding the > > progs, it seems that we have to introduce a new UAPI for the user to > > do the recharge, because there's no similar reuse path in libbpf. > > > > Our company is a heavy bpf user. We have many bpf programs running on > > our production environment, including networking bpf, > > tracing/profiling bpf, and some other bpf programs which are not > > supported in upstream kernel, for example we're even trying the > > sched-bpf[1] proposed by you (and you may remember that I reviewed > > your patchset). Most of the networking bpf, e.g. gateway-bpf, > > edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system. > > > > It is a trend that bpf will be introduced in more and more subsystems, > > and thus it is no doubt that a bpf patchset will involve many > > subsystems. > > > > That means I will be continuously active in these areas in the near > > future, several years at least. > > Ok, I'm glad to hear this. I highly recommend to cover more map types > and use cases in next iterations of the patchset. > > > > > [1]. https://lwn.net/Articles/869433/ > > > > > I'm still feeling you trying to solve a userspace problem in the kernel. > > > > Your feeling can be converted to a simple question: is it allowed to > > pin a bpf program by a process running in a container. The answer to > > this simple question can help us to understand whether it is a user > > bug or a kernel bug. > > > > I think you will agree with me that there's definitely no reason to > > refuse to pin a bpf program by a containerized process. And then we > > will find that the pinned-bpf-program doesn't cooperate well with > > memcg. A kernel feature can't work together with another kernel > > feature, and there's not even an interface provided to the user to > > adjust it. The user either doesn't pin the bpf program or disable > > kmemcg. Isn't it a kernel bug ? > > > > You may have a doubt why these two features can't cooperate. I will > > explain it in detail. That will be a long story. > > > > It should begin with why we introduce bpf pinning. We pin it because > > sometimes the lifecycle of a user application is different with the > > bpf program, or there's no user agent at all. In order to make it > > simple, I will take the no-user-agent (agent exits after pinning bpf > > program) case as an example. > > > > Now thinking about what will happen if the agent which pins the bpf > > program has a memcg. No matter if the agent destroys the memcg or not > > once it exits, the memcg will not disappear because it is pinned by > > the bpf program. To make it easy, let's assume the memcg isn't being > > destroyed, IOW, it is online. > > > > An online memcg is not populated, but it is still being remote charged > > (if it is a non-preallocate bpf map), that looks like a ghost. Now we > > will look into the details to find what will happen to this ghost > > memcg. > > > > If this ghost memcg is limited, it will introduce many issues AFAICS. > > Firstly, the memcg will be force charged[2], and I think I don't need > > to explain the reason to you. > > Even worse is that it force-charges silently without any event, > > because it comes from, > > if (!gfpflags_allow_blocking(gfp_mask)) > > goto nomem; > > And then all memcg events will be skipped. So at least we will > > introduce a force-charge event, > > force: > > + memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE); > > page_counter_charge(&memcg->memory, nr_pages); > > This is actually a good point, let me try to fix it. > Thanks for following up with it. > > > > And then we should allow alloc_htab_elem() to fail, > > l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size, > > - GFP_ATOMIC | __GFP_NOWARN, > > + __GFP_ATOMIC | > > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN, > > It's not a memcg thing, it was done this way previously. Probably Alexei > has a better explanation. Personally, I'm totally fine with removing > __GFP_NOWARN, but maybe I just don't know something. > Ah, the automatic-line-feed misled you. What I really want to remove is the '__GFP_HIGH' in the GFP_ATOMIC, because then it will hit the 'nomem' check[1] if the memcg limit is reached. As it is allowed to fail the allocation in alloc_htab_elem(), it won't be an issue to remove this flag. alloc_htab_elem() may allocate lots of memory, so it is also reasonable to make it a low-priority allocation. Alexei may give us more information on it. [1]. https://elixir.bootlin.com/linux/v5.19-rc3/source/mm/memcontrol.c#L2683 > > htab->map.numa_node); > > And then we'd better introduce an improvement for memcg, > > + /* > > + * Should wakeup async memcg reclaim first, > > + * in case there will be no direct memcg reclaim for a long time. > > + * We can either introduce async memcg reclaim > > + * or modify kswapd to reclaim a specific memcg > > + */ > > + if (gfp_mask & __GFP_KSWAPD_RECLAIM) > > + wake_up_async_memcg_reclaim(); > > if (!gfpflags_allow_blocking(gfp_mask)) > > goto nomem; > > Hm, I see. It might be an issue if there is no global memory pressure, right? > Let me think what I can do here too. > Right. It is not a good idea to expect a global memory reclaimer to do it. Thanks for following up with it again. > > > > And ..... > > > > Really bad luck that there are so many issues in memcg, but it may > > also be because I don't have a deep understanding of memcg ...... > > > > I have to clarify that these issues are not caused by > > memcg-based-bpf-accounting, but exposed by it. > > > > [ Time for lunch here, so I have to stop. ] > > Thank you for writing this text, it was interesting to follow your thinking. > And thank you for bringing in these problems above. > > Let me be clear: I'm not opposing the idea of recharging, I'm only against > introducing hacks for bpf-specific issues, which can't be nicely generalized > for other use cases and subsystems. That's the only reason why I'm a bit > defensive here. > > In general, as now memory cgroups do not provide an ability to recharge > accounted objects (with some exceptions from the v1 legacy). It applies > both to user and kernel memory. I agree, that bpf maps are in some sense > unique, as they are potentially large kernel objects with a lot of control > from the userspace. Is this a good reason to extend memory cgroup API > with the recharging ability? Maybe, but if yes, let's do it well. > > The big question is how to do it? Memcg accounting is done in a way > that requires little changes from the kernel code, right? You just > add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to > an already accounted object. The same applies for uncharging. > It works transparently. > > Recharging is different: a caller should have some sort of the ownership > over the object (to make sure we are not racing against the reclaim and/or > another recharging). And the rules are different for each type of objects. > It's a caller duty to make sure all parts of the complex object are properly > recharged and nothing is left behind. There is also the reparenting mechanism > which can race against the recharging. So it's not an easy problem. > If an object is large, we probably don't want to recharge it at once, > otherwise temporarily doubling of the accounted memory (thanks to the > precharge-uncharge-commit approach) risks introducing spurious OOMs > on memory-limited systems. > As I explained in the cover-letter, the 'doubling of the accounted memory' can be avoided. The doubling will happen when the src and dst memcg have the same ancestor, so we can stop the iter at this common ancestor. For example, memcg A / \ memcg AB memcg AC <---- We can stop uncharge and recharge here / \ memcg ACA memcg ACB | | src memcg dst memcg Of course we must do it carefully. > So yeah, if it doesn't sound too scary for you, I'm happy to help > with this. I'm glad to hear that you can help with it. > But it's a lot of work to do it properly, that's why I'm thinking > that maybe it's better to workaround it in userspace, as Alexei suggested. > I don't mind working around it as Alexei suggested, that's why I investigated if it is possible to introduce a ghost memcg and then found so many issues in it. Even if all the issues I mentioned above are fixed, we still need to change the kernel to adapt to it.
On Sun, Jun 26, 2022 at 11:32 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Sat, Jun 25, 2022 at 08:28:37PM -0700, Roman Gushchin wrote: > > On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote: > > > On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin > > > <roman.gushchin@linux.dev> wrote: > > > > > > > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote: > > > > > After switching to memcg-based bpf memory accounting, the bpf memory is > > > > > charged to the loader's memcg by default, that causes unexpected issues for > > > > > us. For instance, the container of the loader may be restarted after > > > > > pinning progs and maps, but the bpf memcg will be left and pinned on the > > > > > system. Once the loader's new generation container is started, the leftover > > > > > pages won't be charged to it. That inconsistent behavior will make trouble > > > > > for the memory resource management for this container. > > > > > > > > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve > > > > > this issue, but in both of these two proposals the user code has to be > > > > > changed to adapt to it, that is a pain for us. This patchset relieves the > > > > > pain by triggering the recharge in libbpf. It also addresses Roman's > > > > > critical comments. > > > > > > > > > > The key point we can avoid changing the user code is that there's a resue > > > > > path in libbpf. Once the bpf container is restarted again, it will try > > > > > to re-run the required bpf programs, if the bpf programs are the same with > > > > > the already pinned one, it will reuse them. > > > > > > > > > > To make sure we either recharge all of them successfully or don't recharge > > > > > any of them. The recharge prograss is divided into three steps: > > > > > - Pre charge to the new generation > > > > > To make sure once we uncharge from the old generation, we can always > > > > > charge to the new generation succeesfully. If we can't pre charge to > > > > > the new generation, we won't allow it to be uncharged from the old > > > > > generation. > > > > > - Uncharge from the old generation > > > > > After pre charge to the new generation, we can uncharge from the old > > > > > generation. > > > > > - Post charge to the new generation > > > > > Finnaly we can set pages' memcg_data to the new generation. > > > > > In the pre charge step, we may succeed to charge some addresses, but fail > > > > > to charge a new address, then we should uncharge the already charged > > > > > addresses, so another recharge-err step is instroduced. > > > > > > > > > > This pachset has finished recharging bpf hash map. which is mostly used > > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf > > > > > progs hasn't been implemented neither. > > > > > > > > Without going into the implementation details, the overall approach looks > > > > ok to me. But it adds complexity and code into several different subsystems, > > > > and I'm 100% sure it's not worth it if we talking about a partial support > > > > of a single map type. Are you committed to implement the recharging > > > > for all/most map types and progs and support this code in the future? > > > > > > > > > > I'm planning to support it for all map types and progs. Regarding the > > > progs, it seems that we have to introduce a new UAPI for the user to > > > do the recharge, because there's no similar reuse path in libbpf. > > > > > > Our company is a heavy bpf user. We have many bpf programs running on > > > our production environment, including networking bpf, > > > tracing/profiling bpf, and some other bpf programs which are not > > > supported in upstream kernel, for example we're even trying the > > > sched-bpf[1] proposed by you (and you may remember that I reviewed > > > your patchset). Most of the networking bpf, e.g. gateway-bpf, > > > edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system. > > > > > > It is a trend that bpf will be introduced in more and more subsystems, > > > and thus it is no doubt that a bpf patchset will involve many > > > subsystems. > > > > > > That means I will be continuously active in these areas in the near > > > future, several years at least. > > > > Ok, I'm glad to hear this. I highly recommend to cover more map types > > and use cases in next iterations of the patchset. > > > > > > > > [1]. https://lwn.net/Articles/869433/ > > > > > > > I'm still feeling you trying to solve a userspace problem in the kernel. > > > > > > Your feeling can be converted to a simple question: is it allowed to > > > pin a bpf program by a process running in a container. The answer to > > > this simple question can help us to understand whether it is a user > > > bug or a kernel bug. > > > > > > I think you will agree with me that there's definitely no reason to > > > refuse to pin a bpf program by a containerized process. And then we > > > will find that the pinned-bpf-program doesn't cooperate well with > > > memcg. A kernel feature can't work together with another kernel > > > feature, and there's not even an interface provided to the user to > > > adjust it. The user either doesn't pin the bpf program or disable > > > kmemcg. Isn't it a kernel bug ? > > > > > > You may have a doubt why these two features can't cooperate. I will > > > explain it in detail. That will be a long story. > > > > > > It should begin with why we introduce bpf pinning. We pin it because > > > sometimes the lifecycle of a user application is different with the > > > bpf program, or there's no user agent at all. In order to make it > > > simple, I will take the no-user-agent (agent exits after pinning bpf > > > program) case as an example. > > > > > > Now thinking about what will happen if the agent which pins the bpf > > > program has a memcg. No matter if the agent destroys the memcg or not > > > once it exits, the memcg will not disappear because it is pinned by > > > the bpf program. To make it easy, let's assume the memcg isn't being > > > destroyed, IOW, it is online. > > > > > > An online memcg is not populated, but it is still being remote charged > > > (if it is a non-preallocate bpf map), that looks like a ghost. Now we > > > will look into the details to find what will happen to this ghost > > > memcg. > > > > > > If this ghost memcg is limited, it will introduce many issues AFAICS. > > > Firstly, the memcg will be force charged[2], and I think I don't need > > > to explain the reason to you. > > > Even worse is that it force-charges silently without any event, > > > because it comes from, > > > if (!gfpflags_allow_blocking(gfp_mask)) > > > goto nomem; > > > And then all memcg events will be skipped. So at least we will > > > introduce a force-charge event, > > > force: > > > + memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE); > > > page_counter_charge(&memcg->memory, nr_pages); > > > > This is actually a good point, let me try to fix it. > > > > > > > > And then we should allow alloc_htab_elem() to fail, > > > l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size, > > > - GFP_ATOMIC | __GFP_NOWARN, > > > + __GFP_ATOMIC | > > > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN, > > > > It's not a memcg thing, it was done this way previously. Probably Alexei > > has a better explanation. Personally, I'm totally fine with removing > > __GFP_NOWARN, but maybe I just don't know something. > > > > > htab->map.numa_node); > > > And then we'd better introduce an improvement for memcg, > > > + /* > > > + * Should wakeup async memcg reclaim first, > > > + * in case there will be no direct memcg reclaim for a long time. > > > + * We can either introduce async memcg reclaim > > > + * or modify kswapd to reclaim a specific memcg > > > + */ > > > + if (gfp_mask & __GFP_KSWAPD_RECLAIM) > > > + wake_up_async_memcg_reclaim(); > > > if (!gfpflags_allow_blocking(gfp_mask)) > > > goto nomem; > > > > Hm, I see. It might be an issue if there is no global memory pressure, right? > > Let me think what I can do here too. > > > > > > > > And ..... > > > > > > Really bad luck that there are so many issues in memcg, but it may > > > also be because I don't have a deep understanding of memcg ...... > > > > > > I have to clarify that these issues are not caused by > > > memcg-based-bpf-accounting, but exposed by it. > > > > > > [ Time for lunch here, so I have to stop. ] > > > > Thank you for writing this text, it was interesting to follow your thinking. > > And thank you for bringing in these problems above. > > > > Let me be clear: I'm not opposing the idea of recharging, I'm only against > > introducing hacks for bpf-specific issues, which can't be nicely generalized > > for other use cases and subsystems. That's the only reason why I'm a bit > > defensive here. > > > > In general, as now memory cgroups do not provide an ability to recharge > > accounted objects (with some exceptions from the v1 legacy). It applies > > both to user and kernel memory. I agree, that bpf maps are in some sense > > unique, as they are potentially large kernel objects with a lot of control > > from the userspace. Is this a good reason to extend memory cgroup API > > with the recharging ability? Maybe, but if yes, let's do it well. > > > > The big question is how to do it? Memcg accounting is done in a way > > that requires little changes from the kernel code, right? You just > > add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to > > an already accounted object. The same applies for uncharging. > > It works transparently. > > > > Recharging is different: a caller should have some sort of the ownership > > over the object (to make sure we are not racing against the reclaim and/or > > another recharging). And the rules are different for each type of objects. > > It's a caller duty to make sure all parts of the complex object are properly > > recharged and nothing is left behind. There is also the reparenting mechanism > > which can race against the recharging. So it's not an easy problem. > > If an object is large, we probably don't want to recharge it at once, > > otherwise temporarily doubling of the accounted memory (thanks to the > > precharge-uncharge-commit approach) risks introducing spurious OOMs > > on memory-limited systems. > > > > So yeah, if it doesn't sound too scary for you, I'm happy to help > > with this. But it's a lot of work to do it properly, that's why I'm thinking > > that maybe it's better to workaround it in userspace, as Alexei suggested. > > And as Alexei mentioned, there are some changes coming around the way > how memory allocations will be usually performed by the bpf code, which might > make the whole problem and/or your solution obsolete. Please, make sure it's > still actual before sending the next version. > I have taken a look at Alexei's patchset. For the non-preallocated bpf memory, we have to use map->memcg to do the accounting, then this issue will still exist. It also reminds us that when we reuse a map we must make sure the map->memcg is the expected one.
On Fri, Jun 24, 2022 at 8:26 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > I'm planning to support it for all map types and progs. Regarding the > progs, it seems that we have to introduce a new UAPI for the user to > do the recharge, because there's no similar reuse path in libbpf. > > Our company is a heavy bpf user. What company is that?
On Mon, Jun 27, 2022 at 8:40 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Jun 24, 2022 at 8:26 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > I'm planning to support it for all map types and progs. Regarding the > > progs, it seems that we have to introduce a new UAPI for the user to > > do the recharge, because there's no similar reuse path in libbpf. > > > > Our company is a heavy bpf user. > > What company is that? Pinduoduo, aka PDD, a small company in China.
On Sun, Jun 26, 2022 at 02:25:51PM +0800, Yafang Shao wrote: > > > htab->map.numa_node); > > > And then we'd better introduce an improvement for memcg, > > > + /* > > > + * Should wakeup async memcg reclaim first, > > > + * in case there will be no direct memcg reclaim for a long time. > > > + * We can either introduce async memcg reclaim > > > + * or modify kswapd to reclaim a specific memcg > > > + */ > > > + if (gfp_mask & __GFP_KSWAPD_RECLAIM) > > > + wake_up_async_memcg_reclaim(); > > > if (!gfpflags_allow_blocking(gfp_mask)) > > > goto nomem; > > > > Hm, I see. It might be an issue if there is no global memory pressure, right? > > Let me think what I can do here too. > > > > Right. It is not a good idea to expect a global memory reclaimer to do it. > Thanks for following up with it again. After thinking a bit more, I'm not sure if it's actually a good idea: there might be not much memory to reclaim except the memory consumed by the bpf map itself, so waking kswapd might be useless (and just consume cpu and drain batteries). What we need to do instead is to prevent bpf maps to meaningfully exceed memory.max, which is btw guaranteed by the cgroup API: memory.max is defined as a hard limit in docs. Your recent patch is actually doing this for hash maps, let's fix the rest of the bpf code. Thanks!
On Sat, Jul 2, 2022 at 12:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Sun, Jun 26, 2022 at 02:25:51PM +0800, Yafang Shao wrote: > > > > htab->map.numa_node); > > > > And then we'd better introduce an improvement for memcg, > > > > + /* > > > > + * Should wakeup async memcg reclaim first, > > > > + * in case there will be no direct memcg reclaim for a long time. > > > > + * We can either introduce async memcg reclaim > > > > + * or modify kswapd to reclaim a specific memcg > > > > + */ > > > > + if (gfp_mask & __GFP_KSWAPD_RECLAIM) > > > > + wake_up_async_memcg_reclaim(); > > > > if (!gfpflags_allow_blocking(gfp_mask)) > > > > goto nomem; > > > > > > Hm, I see. It might be an issue if there is no global memory pressure, right? > > > Let me think what I can do here too. > > > > > > > Right. It is not a good idea to expect a global memory reclaimer to do it. > > Thanks for following up with it again. > > After thinking a bit more, I'm not sure if it's actually a good idea: > there might be not much memory to reclaim except the memory consumed by the bpf > map itself, so waking kswapd might be useless (and just consume cpu and drain > batteries). > I'm not sure if it is a generic problem. For example, a latency-sensitive process running in a container doesn't set __GFP_DIRECT_RECLAIM, but there're page cache pages in this container. If there's no global memory pressure or no other kinds of memory allocation in this container, these page cache pages will not be reclaimed for a long time. Maybe we should also check the number of page cache pages in this container before waking up kswapd, but I'm not quite sure of it.
On Sat, Jul 02, 2022 at 11:24:10PM +0800, Yafang Shao wrote: > On Sat, Jul 2, 2022 at 12:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Sun, Jun 26, 2022 at 02:25:51PM +0800, Yafang Shao wrote: > > > > > htab->map.numa_node); > > > > > And then we'd better introduce an improvement for memcg, > > > > > + /* > > > > > + * Should wakeup async memcg reclaim first, > > > > > + * in case there will be no direct memcg reclaim for a long time. > > > > > + * We can either introduce async memcg reclaim > > > > > + * or modify kswapd to reclaim a specific memcg > > > > > + */ > > > > > + if (gfp_mask & __GFP_KSWAPD_RECLAIM) > > > > > + wake_up_async_memcg_reclaim(); > > > > > if (!gfpflags_allow_blocking(gfp_mask)) > > > > > goto nomem; > > > > > > > > Hm, I see. It might be an issue if there is no global memory pressure, right? > > > > Let me think what I can do here too. > > > > > > > > > > Right. It is not a good idea to expect a global memory reclaimer to do it. > > > Thanks for following up with it again. > > > > After thinking a bit more, I'm not sure if it's actually a good idea: > > there might be not much memory to reclaim except the memory consumed by the bpf > > map itself, so waking kswapd might be useless (and just consume cpu and drain > > batteries). > > > > I'm not sure if it is a generic problem. > For example, a latency-sensitive process running in a container > doesn't set __GFP_DIRECT_RECLAIM, but there're page cache pages in > this container. If there's no global memory pressure or no other kinds > of memory allocation in this container, these page cache pages will > not be reclaimed for a long time. > Maybe we should also check the number of page cache pages in this > container before waking up kswapd, but I'm not quite sure of it. It's not a generic problem but it might be very specific. Anyway, it doesn't really matter, we shouldn't exceed memory.max.