Message ID | 20200331131628.153118-1-joel@joelfernandes.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] rcu/tree: Use GFP_MEMALLOC for alloc memory to free memory pattern | expand |
On Tue, Mar 31, 2020 at 09:16:28AM -0400, Joel Fernandes (Google) wrote: > In kfree_rcu() headless implementation (where the caller need not pass > an rcu_head, but rather directly pass a pointer to an object), we have > a fall-back where we allocate a rcu_head wrapper for the caller (not the > common case). This brings the pattern of needing to allocate some memory > to free some memory. Currently we use GFP_ATOMIC flag to try harder for > this allocation, however the GFP_MEMALLOC flag is more tailored to this > pattern. We need to try harder not only during atomic context, but also > during non-atomic context anyway. So use the GFP_MEMALLOC flag instead. > > Also remove the __GFP_NOWARN flag simply because although we do have a > synchronize_rcu() fallback for absolutely worst case, we still would > like to not enter that path and atleast trigger a warning to the user. > > Cc: linux-mm@kvack.org > Cc: rcu@vger.kernel.org > Cc: willy@infradead.org > Cc: peterz@infradead.org > Cc: neilb@suse.com > Cc: vbabka@suse.cz > Cc: mgorman@suse.de > Cc: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > > This patch is based on the (not yet upstream) code in: > git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch rcu/kfree) > > It is a follow-up to the posted series: > https://lore.kernel.org/lkml/20200330023248.164994-1-joel@joelfernandes.org/ > > > kernel/rcu/tree.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 4be763355c9fb..965deefffdd58 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) > > if (!ptr) > ptr = kmalloc(sizeof(unsigned long *) + > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); > + sizeof(struct rcu_head), GFP_MEMALLOC); > Hello, Joel I have some questions regarding improving it, see below them: Do you mean __GFP_MEMALLOC? Can that flag be used in atomic context? Actually we do allocate there under spin lock. Should be combined with GFP_ATOMIC | __GFP_MEMALLOC? As for removing __GFP_NOWARN. Actually it is expectable that an allocation can fail, if so we follow last emergency case. You can see the trace but what would you do with that information? Thanks! -- Vlad Rezki
On Tue, Mar 31, 2020 at 09:16:28AM -0400, Joel Fernandes (Google) wrote: > In kfree_rcu() headless implementation (where the caller need not pass > an rcu_head, but rather directly pass a pointer to an object), we have > a fall-back where we allocate a rcu_head wrapper for the caller (not the > common case). This brings the pattern of needing to allocate some memory > to free some memory. Currently we use GFP_ATOMIC flag to try harder for > this allocation, however the GFP_MEMALLOC flag is more tailored to this > pattern. We need to try harder not only during atomic context, but also > during non-atomic context anyway. So use the GFP_MEMALLOC flag instead. > > Also remove the __GFP_NOWARN flag simply because although we do have a > synchronize_rcu() fallback for absolutely worst case, we still would > like to not enter that path and atleast trigger a warning to the user. > > Cc: linux-mm@kvack.org > Cc: rcu@vger.kernel.org > Cc: willy@infradead.org > Cc: peterz@infradead.org > Cc: neilb@suse.com > Cc: vbabka@suse.cz > Cc: mgorman@suse.de > Cc: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > > This patch is based on the (not yet upstream) code in: > git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch rcu/kfree) > > It is a follow-up to the posted series: > https://lore.kernel.org/lkml/20200330023248.164994-1-joel@joelfernandes.org/ > > > kernel/rcu/tree.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 4be763355c9fb..965deefffdd58 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) > > if (!ptr) > ptr = kmalloc(sizeof(unsigned long *) + > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); > + sizeof(struct rcu_head), GFP_MEMALLOC); Just to add, the main requirements here are: 1. Allocation should be bounded in time. 2. Allocation should try hard (possibly tapping into reserves) 3. Sleeping is Ok but should not affect the time bound. Considering this, GFP_MEMALLOC|GFP_NOWAIT seems appropriate. Thoughts? thanks, - Joel
On Tue, Mar 31, 2020 at 04:04:33PM +0200, Uladzislau Rezki wrote: > On Tue, Mar 31, 2020 at 09:16:28AM -0400, Joel Fernandes (Google) wrote: > > In kfree_rcu() headless implementation (where the caller need not pass > > an rcu_head, but rather directly pass a pointer to an object), we have > > a fall-back where we allocate a rcu_head wrapper for the caller (not the > > common case). This brings the pattern of needing to allocate some memory > > to free some memory. Currently we use GFP_ATOMIC flag to try harder for > > this allocation, however the GFP_MEMALLOC flag is more tailored to this > > pattern. We need to try harder not only during atomic context, but also > > during non-atomic context anyway. So use the GFP_MEMALLOC flag instead. > > > > Also remove the __GFP_NOWARN flag simply because although we do have a > > synchronize_rcu() fallback for absolutely worst case, we still would > > like to not enter that path and atleast trigger a warning to the user. > > > > Cc: linux-mm@kvack.org > > Cc: rcu@vger.kernel.org > > Cc: willy@infradead.org > > Cc: peterz@infradead.org > > Cc: neilb@suse.com > > Cc: vbabka@suse.cz > > Cc: mgorman@suse.de > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > --- > > > > This patch is based on the (not yet upstream) code in: > > git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch rcu/kfree) > > > > It is a follow-up to the posted series: > > https://lore.kernel.org/lkml/20200330023248.164994-1-joel@joelfernandes.org/ > > > > > > kernel/rcu/tree.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index 4be763355c9fb..965deefffdd58 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) > > > > if (!ptr) > > ptr = kmalloc(sizeof(unsigned long *) + > > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); > > + sizeof(struct rcu_head), GFP_MEMALLOC); > > > Hello, Joel > > I have some questions regarding improving it, see below them: > > Do you mean __GFP_MEMALLOC? Can that flag be used in atomic context? > Actually we do allocate there under spin lock. Should be combined with > GFP_ATOMIC | __GFP_MEMALLOC? Yes, I mean __GFP_MEMALLOC. Sorry, the patch was just to show the idea and marked as RFC. Good point on the atomic aspect of this path, you are right we cannot sleep. I believe the GFP_NOWAIT I mentioned in my last reply will take care of that? > As for removing __GFP_NOWARN. Actually it is expectable that an > allocation can fail, if so we follow last emergency case. You > can see the trace but what would you do with that information? Yes, the benefit of the trace/warning is that the user can switch to a non-headless API and avoid the synchronize_rcu(), that would help them get faster kfree_rcu() performance instead of having silent slowdowns. It also tells us whether the headless API is worth it in the long run, I think it is worth it because we will likely never hit the synchronize_rcu() failsafe. But if we hit it a lot, at least it wont happen silently. Paul was concerned about following scenario with hitting synchronize_rcu(): 1. Consider a system under memory pressure. 2. Consider some other subsystem X depending on another system Y which uses kfree_rcu(). If Y doesn't complete the operation in time, X accumulates more memory. 3. Since kfree_rcu() on Y hits synchronize_rcu() a lot, it slows it down. This causes X to further allocate memory, further causing a chain reaction. Paul, please correct me if I'm wrong. thanks, - Joel
On Tue 31-03-20 10:58:06, Joel Fernandes wrote: [...] > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index 4be763355c9fb..965deefffdd58 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) > > > > if (!ptr) > > ptr = kmalloc(sizeof(unsigned long *) + > > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); > > + sizeof(struct rcu_head), GFP_MEMALLOC); > > Just to add, the main requirements here are: > 1. Allocation should be bounded in time. > 2. Allocation should try hard (possibly tapping into reserves) > 3. Sleeping is Ok but should not affect the time bound. __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to memory reserves regarless of the sleeping status. Using __GFP_MEMALLOC is quite dangerous because it can deplete _all_ the memory. What does prevent the above code path to do that? If a partial access to reserves is sufficient then why the existing modifiers (mentioned above are not sufficient?
On Tue, Mar 31, 2020 at 05:34:50PM +0200, Michal Hocko wrote: > On Tue 31-03-20 10:58:06, Joel Fernandes wrote: > [...] > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > index 4be763355c9fb..965deefffdd58 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) > > > > > > if (!ptr) > > > ptr = kmalloc(sizeof(unsigned long *) + > > > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); > > > + sizeof(struct rcu_head), GFP_MEMALLOC); > > > > Just to add, the main requirements here are: > > 1. Allocation should be bounded in time. > > 2. Allocation should try hard (possibly tapping into reserves) > > 3. Sleeping is Ok but should not affect the time bound. > > > __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to > memory reserves regarless of the sleeping status. > > Using __GFP_MEMALLOC is quite dangerous because it can deplete _all_ the > memory. What does prevent the above code path to do that? Can you suggest what prevents other users of GFP_MEMALLOC from doing that also? That's the whole point of having a reserve, in normal usage no one will use it, but some times you need to use it. Keep in mind this is not a common case in this code here, this is triggered only if earlier allocation attempts failed. Only *then* we try with GFP_MEMALLOC with promises to free additional memory soon. > If a partial access to reserves is sufficient then why the existing > modifiers (mentioned above are not sufficient? The point with using GFP_MEMALLOC is it is useful for situations where you are about to free memory and needed some memory temporarily, to free that. It depletes it a bit temporarily to free even more. Is that not the point of PF_MEMALLOC? * %__GFP_MEMALLOC allows access to all memory. This should only be used when * the caller guarantees the allocation will allow more memory to be freed * very shortly e.g. process exiting or swapping. Users either should * be the MM or co-ordinating closely with the VM (e.g. swap over NFS). I was just recommending usage of this flag here because it fits the requirement of allocating some memory to free some memory. I am also Ok with GFP_ATOMIC with the GFP_NOWARN removed, if you are Ok with that. thanks, - Joel
> > Yes, I mean __GFP_MEMALLOC. Sorry, the patch was just to show the idea and > marked as RFC. > > Good point on the atomic aspect of this path, you are right we cannot sleep. > I believe the GFP_NOWAIT I mentioned in my last reply will take care of that? > I think there should be GFP_ATOMIC used, because it has more chance to return memory then GFP_NOWAIT. I see that Michal has same view on it. > > As for removing __GFP_NOWARN. Actually it is expectable that an > > allocation can fail, if so we follow last emergency case. You > > can see the trace but what would you do with that information? > > Yes, the benefit of the trace/warning is that the user can switch to a > non-headless API and avoid the synchronize_rcu(), that would help them get > faster kfree_rcu() performance instead of having silent slowdowns. > Agree. What about just adding WARN_ON_ONCE()? I am just thinking if it could be harmful or not. > > It also tells us whether the headless API is worth it in the long run, I > think it is worth it because we will likely never hit the synchronize_rcu() > failsafe. But if we hit it a lot, at least it wont happen silently. > Agree. > Paul was concerned about following scenario with hitting synchronize_rcu(): > 1. Consider a system under memory pressure. > 2. Consider some other subsystem X depending on another system Y which uses > kfree_rcu(). If Y doesn't complete the operation in time, X accumulates > more memory. > 3. Since kfree_rcu() on Y hits synchronize_rcu() a lot, it slows it down. > This causes X to further allocate memory, further causing a chain > reaction. > Paul, please correct me if I'm wrong. > I see your point and agree that in theory it can happen. So, we should make it more tight when it comes to rcu_head attachment logic. -- Vlad Rezki
> > __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to > memory reserves regarless of the sleeping status. > Michal, just one question here regarding proposed flags. Can we also tight it with __GFP_RETRY_MAYFAIL flag? Means it also can repeat a few times in order to increase the chance of being success. Thanks! -- Vlad Rezki
> > > > Paul was concerned about following scenario with hitting synchronize_rcu(): > > 1. Consider a system under memory pressure. > > 2. Consider some other subsystem X depending on another system Y which uses > > kfree_rcu(). If Y doesn't complete the operation in time, X accumulates > > more memory. > > 3. Since kfree_rcu() on Y hits synchronize_rcu() a lot, it slows it down. > > This causes X to further allocate memory, further causing a chain > > reaction. > > Paul, please correct me if I'm wrong. > > > I see your point and agree that in theory it can happen. So, we should > make it more tight when it comes to rcu_head attachment logic. > Just adding more thoughts about such concern. Even though in theory we can run into something like that. But also please note, that under high memory pressure it also does not mean that (X) will always succeed with further infinite allocations, so memory pressure is something common. As soon as the situation becomes slightly better we do our work much efficient. Practically, i was trying to simulate memory pressure to hit synchronize_rcu() on my test system. By just simulating head-less freeing(for any object) and by always dynamic attaching path. So i could trigger it, but that was really hard to achieve and it happened only few times. So that was not like a constant hit. What i got constantly were: - System got recovered and proceed with "normal" path; - The OOM hit as a final step, when the system is run out of memory fully. So, practically i have not seen massive synchronize_rcu() hit. -- Vlad Rezki
On Tue, Mar 31, 2020 at 07:02:32PM +0200, Uladzislau Rezki wrote: > > > > > > Paul was concerned about following scenario with hitting synchronize_rcu(): > > > 1. Consider a system under memory pressure. > > > 2. Consider some other subsystem X depending on another system Y which uses > > > kfree_rcu(). If Y doesn't complete the operation in time, X accumulates > > > more memory. > > > 3. Since kfree_rcu() on Y hits synchronize_rcu() a lot, it slows it down. > > > This causes X to further allocate memory, further causing a chain > > > reaction. > > > Paul, please correct me if I'm wrong. > > > > > I see your point and agree that in theory it can happen. So, we should > > make it more tight when it comes to rcu_head attachment logic. > > > Just adding more thoughts about such concern. Even though in theory we > can run into something like that. But also please note, that under high > memory pressure it also does not mean that (X) will always succeed with > further infinite allocations, so memory pressure is something common. > As soon as the situation becomes slightly better we do our work much > efficient. > > Practically, i was trying to simulate memory pressure to hit synchronize_rcu() > on my test system. By just simulating head-less freeing(for any object) and > by always dynamic attaching path. So i could trigger it, but that was really > hard to achieve and it happened only few times. So that was not like a constant > hit. What i got constantly were: > > - System got recovered and proceed with "normal" path; > - The OOM hit as a final step, when the system is run out of memory fully. > > So, practically i have not seen massive synchronize_rcu() hit. Understood, but given the attractive properties of headless kfree_rcu(), it is not unreasonable to expect its usage to remain low. In addition, memory-pressure scenarios can be quite involved. Finally, as Joel pointed out offlist, the per-CPU cached structure acts as a small portion of kfree_rcu()-specific reserved memory, so you guys have at least partially addressed parts of my concerns already. I am not at all a fan of using GFP_MEMALLOC because kfree_rcu() is sufficiently low-level to be in the business of ensuring its own forward progress. Thanx, Paul
On Tue, Mar 31, 2020 at 06:01:19PM +0200, Uladzislau Rezki wrote: > > > > Yes, I mean __GFP_MEMALLOC. Sorry, the patch was just to show the idea and > > marked as RFC. > > > > Good point on the atomic aspect of this path, you are right we cannot sleep. > > I believe the GFP_NOWAIT I mentioned in my last reply will take care of that? > > > I think there should be GFP_ATOMIC used, because it has more chance to > return memory then GFP_NOWAIT. I see that Michal has same view on it. I don't think so because GFP_ATOMIC implies GFP_NOWAIT. I am Ok with keeping the GFP_ATOMIC as it is btw. Paul mentioned he prefers this. I agree with that as well. > > > As for removing __GFP_NOWARN. Actually it is expectable that an > > > allocation can fail, if so we follow last emergency case. You > > > can see the trace but what would you do with that information? > > > > Yes, the benefit of the trace/warning is that the user can switch to a > > non-headless API and avoid the synchronize_rcu(), that would help them get > > faster kfree_rcu() performance instead of having silent slowdowns. > > > Agree. What about just adding WARN_ON_ONCE()? I am just thinking if it > could be harmful or not. You mean WARN_ON_ONCE() before the synchronize_rcu() right? We could do that. Paul mentioned to me he prefers if this new warning can be turned off with a boot parameter since some future user may prefer no warning. I also agree. If we add this then we can keep your __GFP_NOWARN flag with no additional GFP flag changes. > > It also tells us whether the headless API is worth it in the long run, I > > think it is worth it because we will likely never hit the synchronize_rcu() > > failsafe. But if we hit it a lot, at least it wont happen silently. > > > Agree. > > > Paul was concerned about following scenario with hitting synchronize_rcu(): > > 1. Consider a system under memory pressure. > > 2. Consider some other subsystem X depending on another system Y which uses > > kfree_rcu(). If Y doesn't complete the operation in time, X accumulates > > more memory. > > 3. Since kfree_rcu() on Y hits synchronize_rcu() a lot, it slows it down. > > This causes X to further allocate memory, further causing a chain > > reaction. > > Paul, please correct me if I'm wrong. > > > I see your point and agree that in theory it can happen. So, we should > make it more tight when it comes to rcu_head attachment logic. Right. Per discussion with Paul, we discussed that it is better if we pre-allocate N number of array blocks per-CPU and use it for the cache. Default for N being 1 and tunable with a boot parameter. I agree with this. In current code, we have 1 cache page per CPU, but this is allocated only on the first kvfree_rcu() request. So we could change this behavior as well to make it pre-allocated. Does this all sound good to you? thanks, - Joel
On Tue, Mar 31 2020, Joel Fernandes wrote: > On Tue, Mar 31, 2020 at 05:34:50PM +0200, Michal Hocko wrote: >> On Tue 31-03-20 10:58:06, Joel Fernandes wrote: >> [...] >> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c >> > > index 4be763355c9fb..965deefffdd58 100644 >> > > --- a/kernel/rcu/tree.c >> > > +++ b/kernel/rcu/tree.c >> > > @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) >> > > >> > > if (!ptr) >> > > ptr = kmalloc(sizeof(unsigned long *) + >> > > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); >> > > + sizeof(struct rcu_head), GFP_MEMALLOC); >> > >> > Just to add, the main requirements here are: >> > 1. Allocation should be bounded in time. >> > 2. Allocation should try hard (possibly tapping into reserves) >> > 3. Sleeping is Ok but should not affect the time bound. >> >> >> __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to >> memory reserves regarless of the sleeping status. >> >> Using __GFP_MEMALLOC is quite dangerous because it can deplete _all_ the >> memory. What does prevent the above code path to do that? > > Can you suggest what prevents other users of GFP_MEMALLOC from doing that > also? That's the whole point of having a reserve, in normal usage no one will > use it, but some times you need to use it. Keep in mind this is not a common > case in this code here, this is triggered only if earlier allocation attempts > failed. Only *then* we try with GFP_MEMALLOC with promises to free additional > memory soon. I think that "soon" is the key point. Users of __GFP_MEMALLOC certainly must be working to free other memory, that other memory needs to be freed "soon". In particular - sooner than all the reserve is exhausted. This can require rate-limiting. If one allocation can result in one page being freed, that is good and it is probably OK to have 1000 allocations resulting in 1000 pages being freed soon. But 10 million allocation to gain 10 million pages is not such a good thing and shouldn't be needed. Once those first 1000 pages have been freed, you won't need __GFP_MEMALLOC allocations any more, and you must be prepare to wait for them. So where does the rate-limiting happen in your proposal? A GP can be multiple milliseconds, which is time for lots of memory to be allocated and for rcu-free queues to grow quite large. You mention a possible fall-back of calling synchronize_rcu(). I think that needs to be a fallback that happens well before __GFP_MEMALLOC is exhausted. You need to choose some maximum amount that you will allocate, then use synchronize_rcu() (or probably the _expedited version) after that. The pool of reserves are certainly there for you to use, but not for you to exhaust. If you have your own rate-limiting, then I think __GFP_MEMALLOC is probably OK, and also you *don't* want the memalloc to wait. If memory cannot be allocated immediately, you need to use your own fallback. NeilBrown
On Wed, Apr 01, 2020 at 09:19:49AM +1100, NeilBrown wrote: > On Tue, Mar 31 2020, Joel Fernandes wrote: > > > On Tue, Mar 31, 2020 at 05:34:50PM +0200, Michal Hocko wrote: > >> On Tue 31-03-20 10:58:06, Joel Fernandes wrote: > >> [...] > >> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > >> > > index 4be763355c9fb..965deefffdd58 100644 > >> > > --- a/kernel/rcu/tree.c > >> > > +++ b/kernel/rcu/tree.c > >> > > @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) > >> > > > >> > > if (!ptr) > >> > > ptr = kmalloc(sizeof(unsigned long *) + > >> > > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); > >> > > + sizeof(struct rcu_head), GFP_MEMALLOC); > >> > > >> > Just to add, the main requirements here are: > >> > 1. Allocation should be bounded in time. > >> > 2. Allocation should try hard (possibly tapping into reserves) > >> > 3. Sleeping is Ok but should not affect the time bound. > >> > >> > >> __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to > >> memory reserves regarless of the sleeping status. > >> > >> Using __GFP_MEMALLOC is quite dangerous because it can deplete _all_ the > >> memory. What does prevent the above code path to do that? > > > > Can you suggest what prevents other users of GFP_MEMALLOC from doing that > > also? That's the whole point of having a reserve, in normal usage no one will > > use it, but some times you need to use it. Keep in mind this is not a common > > case in this code here, this is triggered only if earlier allocation attempts > > failed. Only *then* we try with GFP_MEMALLOC with promises to free additional > > memory soon. > > I think that "soon" is the key point. Users of __GFP_MEMALLOC certainly > must be working to free other memory, that other memory needs to be freed > "soon". In particular - sooner than all the reserve is exhausted. This > can require rate-limiting. If one allocation can result in one page > being freed, that is good and it is probably OK to have 1000 allocations > resulting in 1000 pages being freed soon. But 10 million allocation to > gain 10 million pages is not such a good thing and shouldn't be needed. > Once those first 1000 pages have been freed, you won't need > __GFP_MEMALLOC allocations any more, and you must be prepare to wait for > them. > > So where does the rate-limiting happen in your proposal? A GP can be > multiple milliseconds, which is time for lots of memory to be allocated > and for rcu-free queues to grow quite large. > > You mention a possible fall-back of calling synchronize_rcu(). I think > that needs to be a fallback that happens well before __GFP_MEMALLOC is > exhausted. You need to choose some maximum amount that you will > allocate, then use synchronize_rcu() (or probably the _expedited > version) after that. The pool of reserves are certainly there for you > to use, but not for you to exhaust. > > If you have your own rate-limiting, then I think __GFP_MEMALLOC is > probably OK, and also you *don't* want the memalloc to wait. If memory > cannot be allocated immediately, you need to use your own fallback. Thanks a lot for explaining in detail, the RFC patch has served its purpose well ;-) On discussing with RCU comrades, we agreed to not use GFP_MEMALLOC. But instead pre-allocate a cache (we do have a cache but it is not yet pre-allocated, just allocated on demand). About the rate limiting, we would fallback to synchronize_rcu() instead of sleeping in case of trobule. However I would like to add a warning if we ever hit the troublesome path mainly because that means we depleted the pre-allocated cache and perhaps the user should switch to adding an rcu_head in their structure to reduce latency. I'm adding that warning to my tree: diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 4be763355c9fb..6172e6296dd7d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -110,6 +110,10 @@ module_param(rcu_fanout_exact, bool, 0444); static int rcu_fanout_leaf = RCU_FANOUT_LEAF; module_param(rcu_fanout_leaf, int, 0444); int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; +/* Silence the kvfree_rcu() complaint (warning) that it blocks */ +int rcu_kfree_nowarn; +module_param(rcu_kfree_nowarn, int, 0444); + /* Number of rcu_nodes at specified level. */ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ @@ -3266,6 +3270,12 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) * state. */ if (!success) { + /* + * Please embed an rcu_head and pass it along if you hit this + * warning. Doing so would avoid long kfree_rcu() latencies. + */ + if (!rcu_kfree_nowarn) + WARN_ON_ONCE(1); debug_rcu_head_unqueue(ptr); synchronize_rcu(); kvfree(ptr);
On Tue, Mar 31 2020, Joel Fernandes wrote: > On Wed, Apr 01, 2020 at 09:19:49AM +1100, NeilBrown wrote: >> On Tue, Mar 31 2020, Joel Fernandes wrote: >> >> > On Tue, Mar 31, 2020 at 05:34:50PM +0200, Michal Hocko wrote: >> >> On Tue 31-03-20 10:58:06, Joel Fernandes wrote: >> >> [...] >> >> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c >> >> > > index 4be763355c9fb..965deefffdd58 100644 >> >> > > --- a/kernel/rcu/tree.c >> >> > > +++ b/kernel/rcu/tree.c >> >> > > @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) >> >> > > >> >> > > if (!ptr) >> >> > > ptr = kmalloc(sizeof(unsigned long *) + >> >> > > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); >> >> > > + sizeof(struct rcu_head), GFP_MEMALLOC); >> >> > >> >> > Just to add, the main requirements here are: >> >> > 1. Allocation should be bounded in time. >> >> > 2. Allocation should try hard (possibly tapping into reserves) >> >> > 3. Sleeping is Ok but should not affect the time bound. >> >> >> >> >> >> __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to >> >> memory reserves regarless of the sleeping status. >> >> >> >> Using __GFP_MEMALLOC is quite dangerous because it can deplete _all_ the >> >> memory. What does prevent the above code path to do that? >> > >> > Can you suggest what prevents other users of GFP_MEMALLOC from doing that >> > also? That's the whole point of having a reserve, in normal usage no one will >> > use it, but some times you need to use it. Keep in mind this is not a common >> > case in this code here, this is triggered only if earlier allocation attempts >> > failed. Only *then* we try with GFP_MEMALLOC with promises to free additional >> > memory soon. >> >> I think that "soon" is the key point. Users of __GFP_MEMALLOC certainly >> must be working to free other memory, that other memory needs to be freed >> "soon". In particular - sooner than all the reserve is exhausted. This >> can require rate-limiting. If one allocation can result in one page >> being freed, that is good and it is probably OK to have 1000 allocations >> resulting in 1000 pages being freed soon. But 10 million allocation to >> gain 10 million pages is not such a good thing and shouldn't be needed. >> Once those first 1000 pages have been freed, you won't need >> __GFP_MEMALLOC allocations any more, and you must be prepare to wait for >> them. >> >> So where does the rate-limiting happen in your proposal? A GP can be >> multiple milliseconds, which is time for lots of memory to be allocated >> and for rcu-free queues to grow quite large. >> >> You mention a possible fall-back of calling synchronize_rcu(). I think >> that needs to be a fallback that happens well before __GFP_MEMALLOC is >> exhausted. You need to choose some maximum amount that you will >> allocate, then use synchronize_rcu() (or probably the _expedited >> version) after that. The pool of reserves are certainly there for you >> to use, but not for you to exhaust. >> >> If you have your own rate-limiting, then I think __GFP_MEMALLOC is >> probably OK, and also you *don't* want the memalloc to wait. If memory >> cannot be allocated immediately, you need to use your own fallback. > > Thanks a lot for explaining in detail, the RFC patch has served its purpose > well ;-) > > On discussing with RCU comrades, we agreed to not use GFP_MEMALLOC. But > instead pre-allocate a cache (we do have a cache but it is not yet > pre-allocated, just allocated on demand). > > About the rate limiting, we would fallback to synchronize_rcu() instead of > sleeping in case of trobule. However I would like to add a warning if we ever > hit the troublesome path mainly because that means we depleted the > pre-allocated cache and perhaps the user should switch to adding an rcu_head > in their structure to reduce latency. I'm adding that warning to my tree: If this warning is only interesting to developers, I think you should only show it to developers, not to end-users. i.e. protect it with CONFIG_DEBUG_RCU or something like that. NeilBrown > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 4be763355c9fb..6172e6296dd7d 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -110,6 +110,10 @@ module_param(rcu_fanout_exact, bool, 0444); > static int rcu_fanout_leaf = RCU_FANOUT_LEAF; > module_param(rcu_fanout_leaf, int, 0444); > int rcu_num_lvls __read_mostly = RCU_NUM_LVLS; > +/* Silence the kvfree_rcu() complaint (warning) that it blocks */ > +int rcu_kfree_nowarn; > +module_param(rcu_kfree_nowarn, int, 0444); > + > /* Number of rcu_nodes at specified level. */ > int num_rcu_lvl[] = NUM_RCU_LVL_INIT; > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ > @@ -3266,6 +3270,12 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > * state. > */ > if (!success) { > + /* > + * Please embed an rcu_head and pass it along if you hit this > + * warning. Doing so would avoid long kfree_rcu() latencies. > + */ > + if (!rcu_kfree_nowarn) > + WARN_ON_ONCE(1); > debug_rcu_head_unqueue(ptr); > synchronize_rcu(); > kvfree(ptr);
On Tue 31-03-20 18:12:15, Uladzislau Rezki wrote: > > > > __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to > > memory reserves regarless of the sleeping status. > > > Michal, just one question here regarding proposed flags. Can we also > tight it with __GFP_RETRY_MAYFAIL flag? Means it also can repeat a few > times in order to increase the chance of being success. yes, __GFP_RETRY_MAYFAIL is perfectly valid with __GFP_ATOMIC. Please note that __GFP_ATOMIC, despite its name, doesn't imply an atomic allocation which cannot sleep. Quite confusing, I know. A much better name would be __GFP_RESERVES or something like that.
On Tue 31-03-20 12:01:17, Joel Fernandes wrote: > On Tue, Mar 31, 2020 at 05:34:50PM +0200, Michal Hocko wrote: > > On Tue 31-03-20 10:58:06, Joel Fernandes wrote: > > [...] > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > index 4be763355c9fb..965deefffdd58 100644 > > > > --- a/kernel/rcu/tree.c > > > > +++ b/kernel/rcu/tree.c > > > > @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) > > > > > > > > if (!ptr) > > > > ptr = kmalloc(sizeof(unsigned long *) + > > > > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); > > > > + sizeof(struct rcu_head), GFP_MEMALLOC); > > > > > > Just to add, the main requirements here are: > > > 1. Allocation should be bounded in time. > > > 2. Allocation should try hard (possibly tapping into reserves) > > > 3. Sleeping is Ok but should not affect the time bound. > > > > > > __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to > > memory reserves regarless of the sleeping status. > > > > Using __GFP_MEMALLOC is quite dangerous because it can deplete _all_ the > > memory. What does prevent the above code path to do that? Neil has provided a nice explanation down the email thread. But let me clarify few things here. > Can you suggest what prevents other users of GFP_MEMALLOC from doing that > also? There is no explicit mechanism which is indeed unfortunate. The only user real user of the flag is Swap over NFS AFAIK. I have never dared to look into details on how the complete reserves depletion is prevented. Mel would be much better fit here. > That's the whole point of having a reserve, in normal usage no one will > use it, but some times you need to use it. Keep in mind this is not a common > case in this code here, this is triggered only if earlier allocation attempts > failed. Only *then* we try with GFP_MEMALLOC with promises to free additional > memory soon. You are right that this is the usecase for the flag. But this should be done with an extreme care because the core MM relies on those reserves so any other users should better make sure they do not consume a lot from reserves as well. > > If a partial access to reserves is sufficient then why the existing > > modifiers (mentioned above are not sufficient? > > The point with using GFP_MEMALLOC is it is useful for situations where you > are about to free memory and needed some memory temporarily, to free that. It > depletes it a bit temporarily to free even more. Is that not the point of > PF_MEMALLOC? > * %__GFP_MEMALLOC allows access to all memory. This should only be used when > * the caller guarantees the allocation will allow more memory to be freed > * very shortly e.g. process exiting or swapping. Users either should > * be the MM or co-ordinating closely with the VM (e.g. swap over NFS). > > I was just recommending usage of this flag here because it fits the > requirement of allocating some memory to free some memory. I am also Ok with > GFP_ATOMIC with the GFP_NOWARN removed, if you are Ok with that. Maybe we need to refine this documentation to be more explicit about an extreme care to be taken when using the flag. diff --git a/include/linux/gfp.h b/include/linux/gfp.h index e5b817cb86e7..e436a7e28392 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -110,6 +110,9 @@ struct vm_area_struct; * the caller guarantees the allocation will allow more memory to be freed * very shortly e.g. process exiting or swapping. Users either should * be the MM or co-ordinating closely with the VM (e.g. swap over NFS). + * Users of this flag have to be extremely careful to not deplete the reserve + * completely and implement a throttling mechanism which controls the consumption + * based on the amount of freed memory. * * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves. * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
On April 1, 2020 3:23:59 AM EDT, Michal Hocko <mhocko@kernel.org> wrote: >On Tue 31-03-20 12:01:17, Joel Fernandes wrote: >> On Tue, Mar 31, 2020 at 05:34:50PM +0200, Michal Hocko wrote: >> > On Tue 31-03-20 10:58:06, Joel Fernandes wrote: >> > [...] >> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c >> > > > index 4be763355c9fb..965deefffdd58 100644 >> > > > --- a/kernel/rcu/tree.c >> > > > +++ b/kernel/rcu/tree.c >> > > > @@ -3149,7 +3149,7 @@ static inline struct rcu_head >*attach_rcu_head_to_object(void *obj) >> > > > >> > > > if (!ptr) >> > > > ptr = kmalloc(sizeof(unsigned long *) + >> > > > - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); >> > > > + sizeof(struct rcu_head), GFP_MEMALLOC); >> > > >> > > Just to add, the main requirements here are: >> > > 1. Allocation should be bounded in time. >> > > 2. Allocation should try hard (possibly tapping into reserves) >> > > 3. Sleeping is Ok but should not affect the time bound. >> > >> > >> > __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to >> > memory reserves regarless of the sleeping status. >> > >> > Using __GFP_MEMALLOC is quite dangerous because it can deplete >_all_ the >> > memory. What does prevent the above code path to do that? > >Neil has provided a nice explanation down the email thread. But let me >clarify few things here. > >> Can you suggest what prevents other users of GFP_MEMALLOC from doing >that >> also? > >There is no explicit mechanism which is indeed unfortunate. The only >user real user of the flag is Swap over NFS AFAIK. I have never dared >to >look into details on how the complete reserves depletion is prevented. >Mel would be much better fit here. > >> That's the whole point of having a reserve, in normal usage no one >will >> use it, but some times you need to use it. Keep in mind this is not a >common >> case in this code here, this is triggered only if earlier allocation >attempts >> failed. Only *then* we try with GFP_MEMALLOC with promises to free >additional >> memory soon. > >You are right that this is the usecase for the flag. But this should be >done with an extreme care because the core MM relies on those reserves >so any other users should better make sure they do not consume a lot >from reserves as well. > Understood and agreed. >> > If a partial access to reserves is sufficient then why the existing >> > modifiers (mentioned above are not sufficient? >> >> The point with using GFP_MEMALLOC is it is useful for situations >where you >> are about to free memory and needed some memory temporarily, to free >that. It >> depletes it a bit temporarily to free even more. Is that not the >point of >> PF_MEMALLOC? >> * %__GFP_MEMALLOC allows access to all memory. This should only be >used when >> * the caller guarantees the allocation will allow more memory to be >freed >> * very shortly e.g. process exiting or swapping. Users either should >> * be the MM or co-ordinating closely with the VM (e.g. swap over >NFS). >> >> I was just recommending usage of this flag here because it fits the >> requirement of allocating some memory to free some memory. I am also >Ok with >> GFP_ATOMIC with the GFP_NOWARN removed, if you are Ok with that. > >Maybe we need to refine this documentation to be more explicit about an >extreme care to be taken when using the flag. > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h >index e5b817cb86e7..e436a7e28392 100644 >--- a/include/linux/gfp.h >+++ b/include/linux/gfp.h >@@ -110,6 +110,9 @@ struct vm_area_struct; >* the caller guarantees the allocation will allow more memory to be >freed > * very shortly e.g. process exiting or swapping. Users either should > * be the MM or co-ordinating closely with the VM (e.g. swap over NFS). >+ * Users of this flag have to be extremely careful to not deplete the >reserve >+ * completely and implement a throttling mechanism which controls the >consumption >+ * based on the amount of freed memory. > * >* %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency >reserves. > * This takes precedence over the %__GFP_MEMALLOC flag if both are set. I am in support of this documentation patch. I would say "consumption of the reserve". Thanks, - Joel
On Wed 01-04-20 07:14:16, joel@joelfernandes.org wrote: [...] > I am in support of this documentation patch. I would say "consumption of the reserve". Like this? From afc9c4e56c6dd5f59c1cf5f95ad42a0e7cd78b2e Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Wed, 1 Apr 2020 14:00:56 +0200 Subject: [PATCH] mm: clarify __GFP_MEMALLOC usage It seems that the existing documentation is not explicit about the expected usage and potential risks enough. While it is calls out that users have to free memory when using this flag it is not really apparent that users have to careful to not deplete memory reserves and that they should implement some sort of throttling wrt. freeing process. This is partly based on Neil's explanation [1]. [1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/gfp.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index e5b817cb86e7..e3ab1c0d9140 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -110,6 +110,9 @@ struct vm_area_struct; * the caller guarantees the allocation will allow more memory to be freed * very shortly e.g. process exiting or swapping. Users either should * be the MM or co-ordinating closely with the VM (e.g. swap over NFS). + * Users of this flag have to be extremely careful to not deplete the reserve + * completely and implement a throttling mechanism which controls the consumption + * of the reserve based on the amount of freed memory. * * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves. * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
> > I think there should be GFP_ATOMIC used, because it has more chance to > > return memory then GFP_NOWAIT. I see that Michal has same view on it. > > I don't think so because GFP_ATOMIC implies GFP_NOWAIT. I am Ok with keeping > the GFP_ATOMIC as it is btw. Paul mentioned he prefers this. I agree with > that as well. > GFP_ATOMIC can access to reserved memory whereas GFP_NOWAIT is not eligible to do so. So there is difference between them :) > > > > > > Yes, the benefit of the trace/warning is that the user can switch to a > > > non-headless API and avoid the synchronize_rcu(), that would help them get > > > faster kfree_rcu() performance instead of having silent slowdowns. > > > > > Agree. What about just adding WARN_ON_ONCE()? I am just thinking if it > > could be harmful or not. > > You mean WARN_ON_ONCE() before the synchronize_rcu() right? We could do that. > Paul mentioned to me he prefers if this new warning can be turned off with a > boot parameter since some future user may prefer no warning. I also agree. > Yes, we can add it before doing synchronize_rcu(). WARN_ON_ONCE() will emit only once the warning. I think that would be enough to pay an attention. > > If we add this then we can keep your __GFP_NOWARN flag with no additional GFP > flag changes. > We can also add __GFP_RETRY_MAYFAIL to GFP_ATOMIC to make it more tight. Basically your patch can be modified just adding that. > > > It also tells us whether the headless API is worth it in the long run, I > > > think it is worth it because we will likely never hit the synchronize_rcu() > > > failsafe. But if we hit it a lot, at least it wont happen silently. > > > > > Agree. > > > > > Paul was concerned about following scenario with hitting synchronize_rcu(): > > > 1. Consider a system under memory pressure. > > > 2. Consider some other subsystem X depending on another system Y which uses > > > kfree_rcu(). If Y doesn't complete the operation in time, X accumulates > > > more memory. > > > 3. Since kfree_rcu() on Y hits synchronize_rcu() a lot, it slows it down. > > > This causes X to further allocate memory, further causing a chain > > > reaction. > > > Paul, please correct me if I'm wrong. > > > > > I see your point and agree that in theory it can happen. So, we should > > make it more tight when it comes to rcu_head attachment logic. > > Right. Per discussion with Paul, we discussed that it is better if we > pre-allocate N number of array blocks per-CPU and use it for the cache. > Default for N being 1 and tunable with a boot parameter. I agree with this. > As discussed before, we can make use of memory pool API for such purpose. But i am not sure if it should be one pool per CPU or one pool per NR_CPUS, that would contain NR_CPUS * N pre-allocated blocks. > In current code, we have 1 cache page per CPU, but this is allocated only on > the first kvfree_rcu() request. So we could change this behavior as well to > make it pre-allocated. > > Does this all sound good to you? > I think that makes sense :) -- Vlad Rezki
On Wed, Apr 01, 2020 at 09:09:58AM +0200, Michal Hocko wrote: > On Tue 31-03-20 18:12:15, Uladzislau Rezki wrote: > > > > > > __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to > > > memory reserves regarless of the sleeping status. > > > > > Michal, just one question here regarding proposed flags. Can we also > > tight it with __GFP_RETRY_MAYFAIL flag? Means it also can repeat a few > > times in order to increase the chance of being success. > > yes, __GFP_RETRY_MAYFAIL is perfectly valid with __GFP_ATOMIC. Please > note that __GFP_ATOMIC, despite its name, doesn't imply an atomic > allocation which cannot sleep. Quite confusing, I know. A much better > name would be __GFP_RESERVES or something like that. > OK. Then we can use GFP_ATOMIC | __GFP_RETRY_MAYFAIL to try in more harder way. Thanks! -- Vlad Rezki
On Wed 01-04-20 14:32:30, Uladzislau Rezki wrote: > On Wed, Apr 01, 2020 at 09:09:58AM +0200, Michal Hocko wrote: > > On Tue 31-03-20 18:12:15, Uladzislau Rezki wrote: > > > > > > > > __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to > > > > memory reserves regarless of the sleeping status. > > > > > > > Michal, just one question here regarding proposed flags. Can we also > > > tight it with __GFP_RETRY_MAYFAIL flag? Means it also can repeat a few > > > times in order to increase the chance of being success. > > > > yes, __GFP_RETRY_MAYFAIL is perfectly valid with __GFP_ATOMIC. Please > > note that __GFP_ATOMIC, despite its name, doesn't imply an atomic > > allocation which cannot sleep. Quite confusing, I know. A much better > > name would be __GFP_RESERVES or something like that. > > > OK. Then we can use GFP_ATOMIC | __GFP_RETRY_MAYFAIL to try in more harder > way. Please note the difference between __GFP_ATOMIC and GFP_ATOMIC. The later is a highlevel flag to use for atomic contexts. The former is an explicit way to give an access to memory reserves. I am not familiar with your code but if you have an existing gfp context coming from the caller then just do (gfp | __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL). If you do not have any gfp then decide based on whether the current context is allowed to sleep gfp = GFP_KERNEL | __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL; if (!sleepable) gfp &= ~__GFP_DIRECT_RECLAIM;
On Wed, Apr 01, 2020 at 02:55:03PM +0200, Michal Hocko wrote: > On Wed 01-04-20 14:32:30, Uladzislau Rezki wrote: > > On Wed, Apr 01, 2020 at 09:09:58AM +0200, Michal Hocko wrote: > > > On Tue 31-03-20 18:12:15, Uladzislau Rezki wrote: > > > > > > > > > > __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to > > > > > memory reserves regarless of the sleeping status. > > > > > > > > > Michal, just one question here regarding proposed flags. Can we also > > > > tight it with __GFP_RETRY_MAYFAIL flag? Means it also can repeat a few > > > > times in order to increase the chance of being success. > > > > > > yes, __GFP_RETRY_MAYFAIL is perfectly valid with __GFP_ATOMIC. Please > > > note that __GFP_ATOMIC, despite its name, doesn't imply an atomic > > > allocation which cannot sleep. Quite confusing, I know. A much better > > > name would be __GFP_RESERVES or something like that. > > > > > OK. Then we can use GFP_ATOMIC | __GFP_RETRY_MAYFAIL to try in more harder > > way. > > Please note the difference between __GFP_ATOMIC and GFP_ATOMIC. The > later is a highlevel flag to use for atomic contexts. The former is an > explicit way to give an access to memory reserves. I am not familiar > with your code but if you have an existing gfp context coming from the > caller then just do (gfp | __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL). > If you do not have any gfp then decide based on whether the current > context is allowed to sleep > gfp = GFP_KERNEL | __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL; > if (!sleepable) > gfp &= ~__GFP_DIRECT_RECLAIM; We call it from atomic context, so we can not sleep, also we do not have any existing context coming from the caller. I see that GFP_ATOMIC is high-level flag and is differ from __GFP_ATOMIC. It is defined as: #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) so basically we would like to have __GFP_KSWAPD_RECLAIM that is included in it, because it will also help in case of high memory pressure and wake-up kswapd to reclaim memory. We also can extract: __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL | __GFP_KSWAPD_RECLAIM but that is longer then GFP_ATMOC | __GFP_RETRY_MAYFAIL Am i missing something? Thank you, Michal! -- Vlad Rezki
On Wed, Apr 01, 2020 at 09:23:59AM +0200, Michal Hocko wrote: > > Can you suggest what prevents other users of GFP_MEMALLOC from doing that > > also? > > There is no explicit mechanism which is indeed unfortunate. The only > user real user of the flag is Swap over NFS AFAIK. I have never dared to > look into details on how the complete reserves depletion is prevented. > Mel would be much better fit here. > It's "prevented" by the fact that every other memory allocation request that is not involved with reclaiming memory gets stalled in the allocator with only the swap subsystem making any progress until the machine recovers. Potentially only kswapd is still running until the system recovers if stressed hard enough. The naming is terrible but is mased on kswapd's use of the PF_MEMALLOC flag. For swap-over-nfs, GFP_MEMALLOC saying "this allocation request is potentially needed for kswapd to make forward progress and not freeze". I would not be comfortable with kfree_rcu() doing the same thing because there can be many callers in parallel and it's freeing slab objects. Swap over NFS should free at least one page, freeing a slab object is not guaranteed to free anything.
On Wed 01-04-20 15:08:16, Uladzislau Rezki wrote: > On Wed, Apr 01, 2020 at 02:55:03PM +0200, Michal Hocko wrote: > > On Wed 01-04-20 14:32:30, Uladzislau Rezki wrote: > > > On Wed, Apr 01, 2020 at 09:09:58AM +0200, Michal Hocko wrote: > > > > On Tue 31-03-20 18:12:15, Uladzislau Rezki wrote: > > > > > > > > > > > > __GFP_ATOMIC | __GFP_HIGH is the way to get an additional access to > > > > > > memory reserves regarless of the sleeping status. > > > > > > > > > > > Michal, just one question here regarding proposed flags. Can we also > > > > > tight it with __GFP_RETRY_MAYFAIL flag? Means it also can repeat a few > > > > > times in order to increase the chance of being success. > > > > > > > > yes, __GFP_RETRY_MAYFAIL is perfectly valid with __GFP_ATOMIC. Please > > > > note that __GFP_ATOMIC, despite its name, doesn't imply an atomic > > > > allocation which cannot sleep. Quite confusing, I know. A much better > > > > name would be __GFP_RESERVES or something like that. > > > > > > > OK. Then we can use GFP_ATOMIC | __GFP_RETRY_MAYFAIL to try in more harder > > > way. > > > > Please note the difference between __GFP_ATOMIC and GFP_ATOMIC. The > > later is a highlevel flag to use for atomic contexts. The former is an > > explicit way to give an access to memory reserves. I am not familiar > > with your code but if you have an existing gfp context coming from the > > caller then just do (gfp | __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL). > > If you do not have any gfp then decide based on whether the current > > context is allowed to sleep > > gfp = GFP_KERNEL | __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL; > > if (!sleepable) > > gfp &= ~__GFP_DIRECT_RECLAIM; > > We call it from atomic context, so we can not sleep, also we do not have > any existing context coming from the caller. I see that GFP_ATOMIC is high-level > flag and is differ from __GFP_ATOMIC. It is defined as: > > #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) > > so basically we would like to have __GFP_KSWAPD_RECLAIM that is included in it, > because it will also help in case of high memory pressure and wake-up kswapd to > reclaim memory. > > We also can extract: > > __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL | __GFP_KSWAPD_RECLAIM > > but that is longer then > > GFP_ATMOC | __GFP_RETRY_MAYFAIL OK, if you are always in the atomic context then GFP_ATOMIC is sufficient. __GFP_RETRY_MAYFAIL will make no difference for allocations which do not reclaim (and thus not retry). Sorry this was not clear to me from the previous description.
> > We call it from atomic context, so we can not sleep, also we do not have > > any existing context coming from the caller. I see that GFP_ATOMIC is high-level > > flag and is differ from __GFP_ATOMIC. It is defined as: > > > > #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) > > > > so basically we would like to have __GFP_KSWAPD_RECLAIM that is included in it, > > because it will also help in case of high memory pressure and wake-up kswapd to > > reclaim memory. > > > > We also can extract: > > > > __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL | __GFP_KSWAPD_RECLAIM > > > > but that is longer then > > > > GFP_ATMOC | __GFP_RETRY_MAYFAIL > > OK, if you are always in the atomic context then GFP_ATOMIC is > sufficient. __GFP_RETRY_MAYFAIL will make no difference for allocations > which do not reclaim (and thus not retry). Sorry this was not clear to > me from the previous description. > Ahh. OK. Then adding __GFP_RETRY_MAYFAIL to GFP_ATOMIC will not make any effect. Thank you for your explanation! -- Vlad Rezki
On Wed, Apr 01, 2020 at 02:25:50PM +0200, Uladzislau Rezki wrote: [ . . . ] > > > > Paul was concerned about following scenario with hitting synchronize_rcu(): > > > > 1. Consider a system under memory pressure. > > > > 2. Consider some other subsystem X depending on another system Y which uses > > > > kfree_rcu(). If Y doesn't complete the operation in time, X accumulates > > > > more memory. > > > > 3. Since kfree_rcu() on Y hits synchronize_rcu() a lot, it slows it down. > > > > This causes X to further allocate memory, further causing a chain > > > > reaction. > > > > Paul, please correct me if I'm wrong. > > > > > > > I see your point and agree that in theory it can happen. So, we should > > > make it more tight when it comes to rcu_head attachment logic. > > > > Right. Per discussion with Paul, we discussed that it is better if we > > pre-allocate N number of array blocks per-CPU and use it for the cache. > > Default for N being 1 and tunable with a boot parameter. I agree with this. > > > As discussed before, we can make use of memory pool API for such > purpose. But i am not sure if it should be one pool per CPU or > one pool per NR_CPUS, that would contain NR_CPUS * N pre-allocated > blocks. There are advantages and disadvantages either way. The advantage of the per-CPU pool is that you don't have to worry about something like lock contention causing even more pain during an OOM event. One potential problem wtih the per-CPU pool can happen when callbacks are offloaded, in which case the CPUs needing the memory might never be getting it, because in the offloaded case (RCU_NOCB_CPU=y) the CPU posting callbacks might never be invoking them. But from what I know now, systems built with CONFIG_RCU_NOCB_CPU=y either don't have heavy callback loads (HPC systems) or are carefully configured (real-time systems). Plus large systems would probably end up needing something pretty close to a slab allocator to keep from dying from lock contention, and it is hard to justify that level of complexity at this point. Or is there some way to mark a specific slab allocator instance as being able to keep some amount of memory no matter what the OOM conditions are? If not, the current per-CPU pre-allocated cache is a better choice in the near term. Thanx, Paul > > In current code, we have 1 cache page per CPU, but this is allocated only on > > the first kvfree_rcu() request. So we could change this behavior as well to > > make it pre-allocated. > > > > Does this all sound good to you? > > > I think that makes sense :) > > -- > Vlad Rezki
On Wed, Apr 1, 2020 at 9:14 AM Mel Gorman <mgorman@suse.de> wrote: > > On Wed, Apr 01, 2020 at 09:23:59AM +0200, Michal Hocko wrote: > > > Can you suggest what prevents other users of GFP_MEMALLOC from doing that > > > also? > > > > There is no explicit mechanism which is indeed unfortunate. The only > > user real user of the flag is Swap over NFS AFAIK. I have never dared to > > look into details on how the complete reserves depletion is prevented. > > Mel would be much better fit here. > > > > It's "prevented" by the fact that every other memory allocation request > that is not involved with reclaiming memory gets stalled in the allocator > with only the swap subsystem making any progress until the machine > recovers. Potentially only kswapd is still running until the system > recovers if stressed hard enough. > > The naming is terrible but is mased on kswapd's use of the PF_MEMALLOC > flag. For swap-over-nfs, GFP_MEMALLOC saying "this allocation request is > potentially needed for kswapd to make forward progress and not freeze". > > I would not be comfortable with kfree_rcu() doing the same thing because > there can be many callers in parallel and it's freeing slab objects. > Swap over NFS should free at least one page, freeing a slab object is > not guaranteed to free anything. Got it Mel. Just to clarify to the onlooker. It seemed to fit the pattern that's why I proposed it as RFC, I was never sure it was the right approach -- I just proposed it for discussion-sake because I thought it was worth talking about at least. It was not even merged in my tree, was just RFC. Thanks Mel for clarifying the usage of the flag. - Joel
On Wed 01-04-20 15:22:58, Uladzislau Rezki wrote: > > > We call it from atomic context, so we can not sleep, also we do not have > > > any existing context coming from the caller. I see that GFP_ATOMIC is high-level > > > flag and is differ from __GFP_ATOMIC. It is defined as: > > > > > > #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) > > > > > > so basically we would like to have __GFP_KSWAPD_RECLAIM that is included in it, > > > because it will also help in case of high memory pressure and wake-up kswapd to > > > reclaim memory. > > > > > > We also can extract: > > > > > > __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL | __GFP_KSWAPD_RECLAIM > > > > > > but that is longer then > > > > > > GFP_ATMOC | __GFP_RETRY_MAYFAIL > > > > OK, if you are always in the atomic context then GFP_ATOMIC is > > sufficient. __GFP_RETRY_MAYFAIL will make no difference for allocations > > which do not reclaim (and thus not retry). Sorry this was not clear to > > me from the previous description. > > > Ahh. OK. Then adding __GFP_RETRY_MAYFAIL to GFP_ATOMIC will not make any effect. > > Thank you for your explanation! Welcome. I wish all those gfp flags were really clear but I fully understand that people who are not working with MM regurarly might find it confusing. Btw. have __GFP_RETRY_MAYFAIL is documented in gfp.h and it is documented as the reclaim modifier which should imply that it has no effect when the reclaim is not allowed which is the case for any non sleeping allocation. If that relation was not immediately obvious then I think we need to make it explicit. Would you find it useful? E.g. diff --git a/include/linux/gfp.h b/include/linux/gfp.h index e3ab1c0d9140..8f09cefdfa7b 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -127,6 +127,8 @@ struct vm_area_struct; * * Reclaim modifiers * ~~~~~~~~~~~~~~~~~ + * Please note that all the folloging flags are only applicable to sleepable + * allocations (e.g. %GFP_NOWAIT and %GFP_ATOMIC will ignore them). * * %__GFP_IO can start physical IO. *
> > > > > > OK, if you are always in the atomic context then GFP_ATOMIC is > > > sufficient. __GFP_RETRY_MAYFAIL will make no difference for allocations > > > which do not reclaim (and thus not retry). Sorry this was not clear to > > > me from the previous description. > > > > > Ahh. OK. Then adding __GFP_RETRY_MAYFAIL to GFP_ATOMIC will not make any effect. > > > > Thank you for your explanation! > > Welcome. I wish all those gfp flags were really clear but I fully > understand that people who are not working with MM regurarly might find > it confusing. Btw. have __GFP_RETRY_MAYFAIL is documented in gfp.h and > it is documented as the reclaim modifier which should imply that it has > no effect when the reclaim is not allowed which is the case for any non > sleeping allocation. If that relation was not immediately obvious then I > think we need to make it explicit. Would you find it useful? > > E.g. > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index e3ab1c0d9140..8f09cefdfa7b 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -127,6 +127,8 @@ struct vm_area_struct; > * > * Reclaim modifiers > * ~~~~~~~~~~~~~~~~~ > + * Please note that all the folloging flags are only applicable to sleepable > + * allocations (e.g. %GFP_NOWAIT and %GFP_ATOMIC will ignore them). > * > * %__GFP_IO can start physical IO. > * That would be definitely clear for me! -- Vlad Rezki
On Wed, Apr 01, 2020 at 05:28:05PM +0200, Michal Hocko wrote: > On Wed 01-04-20 15:22:58, Uladzislau Rezki wrote: > > > > We call it from atomic context, so we can not sleep, also we do not have > > > > any existing context coming from the caller. I see that GFP_ATOMIC is high-level > > > > flag and is differ from __GFP_ATOMIC. It is defined as: > > > > > > > > #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) > > > > > > > > so basically we would like to have __GFP_KSWAPD_RECLAIM that is included in it, > > > > because it will also help in case of high memory pressure and wake-up kswapd to > > > > reclaim memory. > > > > > > > > We also can extract: > > > > > > > > __GFP_ATOMIC | __GFP_HIGH | __GFP_RETRY_MAYFAIL | __GFP_KSWAPD_RECLAIM > > > > > > > > but that is longer then > > > > > > > > GFP_ATMOC | __GFP_RETRY_MAYFAIL > > > > > > OK, if you are always in the atomic context then GFP_ATOMIC is > > > sufficient. __GFP_RETRY_MAYFAIL will make no difference for allocations > > > which do not reclaim (and thus not retry). Sorry this was not clear to > > > me from the previous description. > > > > > Ahh. OK. Then adding __GFP_RETRY_MAYFAIL to GFP_ATOMIC will not make any effect. > > > > Thank you for your explanation! > > Welcome. I wish all those gfp flags were really clear but I fully > understand that people who are not working with MM regurarly might find > it confusing. Btw. have __GFP_RETRY_MAYFAIL is documented in gfp.h and > it is documented as the reclaim modifier which should imply that it has > no effect when the reclaim is not allowed which is the case for any non > sleeping allocation. If that relation was not immediately obvious then I > think we need to make it explicit. Would you find it useful? > > E.g. One nit below, but either way: Acked-by: Paul E. McKenney <paulmck@kernel.org> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index e3ab1c0d9140..8f09cefdfa7b 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -127,6 +127,8 @@ struct vm_area_struct; > * > * Reclaim modifiers > * ~~~~~~~~~~~~~~~~~ > + * Please note that all the folloging flags are only applicable to sleepable s/folloging/following/ > + * allocations (e.g. %GFP_NOWAIT and %GFP_ATOMIC will ignore them). > * > * %__GFP_IO can start physical IO. > * > -- > Michal Hocko > SUSE Labs
On Wed 01-04-20 08:57:12, Paul E. McKenney wrote: [...] > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > index e3ab1c0d9140..8f09cefdfa7b 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -127,6 +127,8 @@ struct vm_area_struct; > > * > > * Reclaim modifiers > > * ~~~~~~~~~~~~~~~~~ > > + * Please note that all the folloging flags are only applicable to sleepable > > s/folloging/following/ Fixed. I will add it on top of http://lkml.kernel.org/r/20200401120502.GH22681@dhcp22.suse.cz and post both later this week to Andrew. From ce456c2f6c8fe0588c1d743a5b87e283aa4806f8 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Wed, 1 Apr 2020 18:02:23 +0200 Subject: [PATCH] mm: make it clear that gfp reclaim modifiers are valid only for sleepable allocations While it might be really clear to MM developers that gfp reclaim modifiers are applicable only to sleepable allocations (those with __GFP_DIRECT_RECLAIM) it seems that actual users of the API are not always sure. Make it explicit that they are not applicable for GFP_NOWAIT or GFP_ATOMIC allocations which are the most commonly used non-sleepable allocation masks. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/gfp.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index e3ab1c0d9140..8040fa944cd8 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -127,6 +127,8 @@ struct vm_area_struct; * * Reclaim modifiers * ~~~~~~~~~~~~~~~~~ + * Please note that all the following flags are only applicable to sleepable + * allocations (e.g. %GFP_NOWAIT and %GFP_ATOMIC will ignore them). * * %__GFP_IO can start physical IO. *
> > > > > > Right. Per discussion with Paul, we discussed that it is better if we > > > pre-allocate N number of array blocks per-CPU and use it for the cache. > > > Default for N being 1 and tunable with a boot parameter. I agree with this. > > > > > As discussed before, we can make use of memory pool API for such > > purpose. But i am not sure if it should be one pool per CPU or > > one pool per NR_CPUS, that would contain NR_CPUS * N pre-allocated > > blocks. > > There are advantages and disadvantages either way. The advantage of the > per-CPU pool is that you don't have to worry about something like lock > contention causing even more pain during an OOM event. One potential > problem wtih the per-CPU pool can happen when callbacks are offloaded, > in which case the CPUs needing the memory might never be getting it, > because in the offloaded case (RCU_NOCB_CPU=y) the CPU posting callbacks > might never be invoking them. > > But from what I know now, systems built with CONFIG_RCU_NOCB_CPU=y > either don't have heavy callback loads (HPC systems) or are carefully > configured (real-time systems). Plus large systems would probably end > up needing something pretty close to a slab allocator to keep from dying > from lock contention, and it is hard to justify that level of complexity > at this point. > > Or is there some way to mark a specific slab allocator instance as being > able to keep some amount of memory no matter what the OOM conditions are? > If not, the current per-CPU pre-allocated cache is a better choice in the > near term. > As for mempool API: mempool_alloc() just tries to make regular allocation taking into account passed gfp_t bitmask. If it fails due to memory pressure, it uses reserved preallocated pool that consists of number of desirable elements(preallocated when a pool is created). mempoll_free() returns an element to to pool, if it detects that current reserved elements are lower then minimum allowed elements, it will add an element to reserved pool, i.e. refill it. Otherwise just call kfree() or whatever we define as "element-freeing function." > > If not, the current per-CPU pre-allocated cache is a better choice in the > near term. > OK. I see your point. Thank you for your comments and view :) -- Vlad Rezki
On Wed, Apr 01, 2020 at 08:16:01PM +0200, Uladzislau Rezki wrote: > > > > > > > > Right. Per discussion with Paul, we discussed that it is better if we > > > > pre-allocate N number of array blocks per-CPU and use it for the cache. > > > > Default for N being 1 and tunable with a boot parameter. I agree with this. > > > > > > > As discussed before, we can make use of memory pool API for such > > > purpose. But i am not sure if it should be one pool per CPU or > > > one pool per NR_CPUS, that would contain NR_CPUS * N pre-allocated > > > blocks. > > > > There are advantages and disadvantages either way. The advantage of the > > per-CPU pool is that you don't have to worry about something like lock > > contention causing even more pain during an OOM event. One potential > > problem wtih the per-CPU pool can happen when callbacks are offloaded, > > in which case the CPUs needing the memory might never be getting it, > > because in the offloaded case (RCU_NOCB_CPU=y) the CPU posting callbacks > > might never be invoking them. > > > > But from what I know now, systems built with CONFIG_RCU_NOCB_CPU=y > > either don't have heavy callback loads (HPC systems) or are carefully > > configured (real-time systems). Plus large systems would probably end > > up needing something pretty close to a slab allocator to keep from dying > > from lock contention, and it is hard to justify that level of complexity > > at this point. > > > > Or is there some way to mark a specific slab allocator instance as being > > able to keep some amount of memory no matter what the OOM conditions are? > > If not, the current per-CPU pre-allocated cache is a better choice in the > > near term. > > > As for mempool API: > > mempool_alloc() just tries to make regular allocation taking into > account passed gfp_t bitmask. If it fails due to memory pressure, > it uses reserved preallocated pool that consists of number of > desirable elements(preallocated when a pool is created). > > mempoll_free() returns an element to to pool, if it detects that > current reserved elements are lower then minimum allowed elements, > it will add an element to reserved pool, i.e. refill it. Otherwise > just call kfree() or whatever we define as "element-freeing function." Unless I am missing something, mempool_alloc() acquires a per-mempool lock on each invocation under OOM conditions. For our purposes, this is essentially a global lock. This will not be at all acceptable on a large system. Thanx, Paul > > If not, the current per-CPU pre-allocated cache is a better choice in the > > near term. > > > OK. I see your point. > > Thank you for your comments and view :) > > -- > Vlad Rezki
On Wed, Apr 01, 2020 at 11:26:15AM -0700, Paul E. McKenney wrote: > On Wed, Apr 01, 2020 at 08:16:01PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > Right. Per discussion with Paul, we discussed that it is better if we > > > > > pre-allocate N number of array blocks per-CPU and use it for the cache. > > > > > Default for N being 1 and tunable with a boot parameter. I agree with this. > > > > > > > > > As discussed before, we can make use of memory pool API for such > > > > purpose. But i am not sure if it should be one pool per CPU or > > > > one pool per NR_CPUS, that would contain NR_CPUS * N pre-allocated > > > > blocks. > > > > > > There are advantages and disadvantages either way. The advantage of the > > > per-CPU pool is that you don't have to worry about something like lock > > > contention causing even more pain during an OOM event. One potential > > > problem wtih the per-CPU pool can happen when callbacks are offloaded, > > > in which case the CPUs needing the memory might never be getting it, > > > because in the offloaded case (RCU_NOCB_CPU=y) the CPU posting callbacks > > > might never be invoking them. > > > > > > But from what I know now, systems built with CONFIG_RCU_NOCB_CPU=y > > > either don't have heavy callback loads (HPC systems) or are carefully > > > configured (real-time systems). Plus large systems would probably end > > > up needing something pretty close to a slab allocator to keep from dying > > > from lock contention, and it is hard to justify that level of complexity > > > at this point. > > > > > > Or is there some way to mark a specific slab allocator instance as being > > > able to keep some amount of memory no matter what the OOM conditions are? > > > If not, the current per-CPU pre-allocated cache is a better choice in the > > > near term. > > > > > As for mempool API: > > > > mempool_alloc() just tries to make regular allocation taking into > > account passed gfp_t bitmask. If it fails due to memory pressure, > > it uses reserved preallocated pool that consists of number of > > desirable elements(preallocated when a pool is created). > > > > mempoll_free() returns an element to to pool, if it detects that > > current reserved elements are lower then minimum allowed elements, > > it will add an element to reserved pool, i.e. refill it. Otherwise > > just call kfree() or whatever we define as "element-freeing function." > > Unless I am missing something, mempool_alloc() acquires a per-mempool > lock on each invocation under OOM conditions. For our purposes, this > is essentially a global lock. This will not be at all acceptable on a > large system. > It uses pool->lock to access to reserved objects, so if we have one memory pool per one CPU then it would be serialized. -- Vlad Rezki
On Wed, Apr 01, 2020 at 08:37:45PM +0200, Uladzislau Rezki wrote: > On Wed, Apr 01, 2020 at 11:26:15AM -0700, Paul E. McKenney wrote: > > On Wed, Apr 01, 2020 at 08:16:01PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > Right. Per discussion with Paul, we discussed that it is better if we > > > > > > pre-allocate N number of array blocks per-CPU and use it for the cache. > > > > > > Default for N being 1 and tunable with a boot parameter. I agree with this. > > > > > > > > > > > As discussed before, we can make use of memory pool API for such > > > > > purpose. But i am not sure if it should be one pool per CPU or > > > > > one pool per NR_CPUS, that would contain NR_CPUS * N pre-allocated > > > > > blocks. > > > > > > > > There are advantages and disadvantages either way. The advantage of the > > > > per-CPU pool is that you don't have to worry about something like lock > > > > contention causing even more pain during an OOM event. One potential > > > > problem wtih the per-CPU pool can happen when callbacks are offloaded, > > > > in which case the CPUs needing the memory might never be getting it, > > > > because in the offloaded case (RCU_NOCB_CPU=y) the CPU posting callbacks > > > > might never be invoking them. > > > > > > > > But from what I know now, systems built with CONFIG_RCU_NOCB_CPU=y > > > > either don't have heavy callback loads (HPC systems) or are carefully > > > > configured (real-time systems). Plus large systems would probably end > > > > up needing something pretty close to a slab allocator to keep from dying > > > > from lock contention, and it is hard to justify that level of complexity > > > > at this point. > > > > > > > > Or is there some way to mark a specific slab allocator instance as being > > > > able to keep some amount of memory no matter what the OOM conditions are? > > > > If not, the current per-CPU pre-allocated cache is a better choice in the > > > > near term. > > > > > > > As for mempool API: > > > > > > mempool_alloc() just tries to make regular allocation taking into > > > account passed gfp_t bitmask. If it fails due to memory pressure, > > > it uses reserved preallocated pool that consists of number of > > > desirable elements(preallocated when a pool is created). > > > > > > mempoll_free() returns an element to to pool, if it detects that > > > current reserved elements are lower then minimum allowed elements, > > > it will add an element to reserved pool, i.e. refill it. Otherwise > > > just call kfree() or whatever we define as "element-freeing function." > > > > Unless I am missing something, mempool_alloc() acquires a per-mempool > > lock on each invocation under OOM conditions. For our purposes, this > > is essentially a global lock. This will not be at all acceptable on a > > large system. > > > It uses pool->lock to access to reserved objects, so if we have one memory > pool per one CPU then it would be serialized. I am having difficulty parsing your sentence. It looks like your thought is to invoke mempool_create() for each CPU, so that the locking would be on a per-CPU basis, as in 128 invocations of mempool_init() on a system having 128 hardware threads. Is that your intent? Thanx, Paul
On Wed, Apr 01, 2020 at 11:54:39AM -0700, Paul E. McKenney wrote: > On Wed, Apr 01, 2020 at 08:37:45PM +0200, Uladzislau Rezki wrote: > > On Wed, Apr 01, 2020 at 11:26:15AM -0700, Paul E. McKenney wrote: > > > On Wed, Apr 01, 2020 at 08:16:01PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > Right. Per discussion with Paul, we discussed that it is better if we > > > > > > > pre-allocate N number of array blocks per-CPU and use it for the cache. > > > > > > > Default for N being 1 and tunable with a boot parameter. I agree with this. > > > > > > > > > > > > > As discussed before, we can make use of memory pool API for such > > > > > > purpose. But i am not sure if it should be one pool per CPU or > > > > > > one pool per NR_CPUS, that would contain NR_CPUS * N pre-allocated > > > > > > blocks. > > > > > > > > > > There are advantages and disadvantages either way. The advantage of the > > > > > per-CPU pool is that you don't have to worry about something like lock > > > > > contention causing even more pain during an OOM event. One potential > > > > > problem wtih the per-CPU pool can happen when callbacks are offloaded, > > > > > in which case the CPUs needing the memory might never be getting it, > > > > > because in the offloaded case (RCU_NOCB_CPU=y) the CPU posting callbacks > > > > > might never be invoking them. > > > > > > > > > > But from what I know now, systems built with CONFIG_RCU_NOCB_CPU=y > > > > > either don't have heavy callback loads (HPC systems) or are carefully > > > > > configured (real-time systems). Plus large systems would probably end > > > > > up needing something pretty close to a slab allocator to keep from dying > > > > > from lock contention, and it is hard to justify that level of complexity > > > > > at this point. > > > > > > > > > > Or is there some way to mark a specific slab allocator instance as being > > > > > able to keep some amount of memory no matter what the OOM conditions are? > > > > > If not, the current per-CPU pre-allocated cache is a better choice in the > > > > > near term. > > > > > > > > > As for mempool API: > > > > > > > > mempool_alloc() just tries to make regular allocation taking into > > > > account passed gfp_t bitmask. If it fails due to memory pressure, > > > > it uses reserved preallocated pool that consists of number of > > > > desirable elements(preallocated when a pool is created). > > > > > > > > mempoll_free() returns an element to to pool, if it detects that > > > > current reserved elements are lower then minimum allowed elements, > > > > it will add an element to reserved pool, i.e. refill it. Otherwise > > > > just call kfree() or whatever we define as "element-freeing function." > > > > > > Unless I am missing something, mempool_alloc() acquires a per-mempool > > > lock on each invocation under OOM conditions. For our purposes, this > > > is essentially a global lock. This will not be at all acceptable on a > > > large system. > > > > > It uses pool->lock to access to reserved objects, so if we have one memory > > pool per one CPU then it would be serialized. > > I am having difficulty parsing your sentence. It looks like your thought > is to invoke mempool_create() for each CPU, so that the locking would be > on a per-CPU basis, as in 128 invocations of mempool_init() on a system > having 128 hardware threads. Is that your intent? > In order to serialize it, you need to have it per CPU. So if you have 128 cpus, it means: <snip> for_each_possible_cpu(...) cpu_pool = mempool_create(); <snip> but please keep in mind that it is not my intention, but i had a though about mempool API. Because it has pre-reserve logic inside. -- Vlad Rezki
On Wed, Apr 01, 2020 at 09:05:48PM +0200, Uladzislau Rezki wrote: > On Wed, Apr 01, 2020 at 11:54:39AM -0700, Paul E. McKenney wrote: > > On Wed, Apr 01, 2020 at 08:37:45PM +0200, Uladzislau Rezki wrote: > > > On Wed, Apr 01, 2020 at 11:26:15AM -0700, Paul E. McKenney wrote: > > > > On Wed, Apr 01, 2020 at 08:16:01PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > Right. Per discussion with Paul, we discussed that it is better if we > > > > > > > > pre-allocate N number of array blocks per-CPU and use it for the cache. > > > > > > > > Default for N being 1 and tunable with a boot parameter. I agree with this. > > > > > > > > > > > > > > > As discussed before, we can make use of memory pool API for such > > > > > > > purpose. But i am not sure if it should be one pool per CPU or > > > > > > > one pool per NR_CPUS, that would contain NR_CPUS * N pre-allocated > > > > > > > blocks. > > > > > > > > > > > > There are advantages and disadvantages either way. The advantage of the > > > > > > per-CPU pool is that you don't have to worry about something like lock > > > > > > contention causing even more pain during an OOM event. One potential > > > > > > problem wtih the per-CPU pool can happen when callbacks are offloaded, > > > > > > in which case the CPUs needing the memory might never be getting it, > > > > > > because in the offloaded case (RCU_NOCB_CPU=y) the CPU posting callbacks > > > > > > might never be invoking them. > > > > > > > > > > > > But from what I know now, systems built with CONFIG_RCU_NOCB_CPU=y > > > > > > either don't have heavy callback loads (HPC systems) or are carefully > > > > > > configured (real-time systems). Plus large systems would probably end > > > > > > up needing something pretty close to a slab allocator to keep from dying > > > > > > from lock contention, and it is hard to justify that level of complexity > > > > > > at this point. > > > > > > > > > > > > Or is there some way to mark a specific slab allocator instance as being > > > > > > able to keep some amount of memory no matter what the OOM conditions are? > > > > > > If not, the current per-CPU pre-allocated cache is a better choice in the > > > > > > near term. > > > > > > > > > > > As for mempool API: > > > > > > > > > > mempool_alloc() just tries to make regular allocation taking into > > > > > account passed gfp_t bitmask. If it fails due to memory pressure, > > > > > it uses reserved preallocated pool that consists of number of > > > > > desirable elements(preallocated when a pool is created). > > > > > > > > > > mempoll_free() returns an element to to pool, if it detects that > > > > > current reserved elements are lower then minimum allowed elements, > > > > > it will add an element to reserved pool, i.e. refill it. Otherwise > > > > > just call kfree() or whatever we define as "element-freeing function." > > > > > > > > Unless I am missing something, mempool_alloc() acquires a per-mempool > > > > lock on each invocation under OOM conditions. For our purposes, this > > > > is essentially a global lock. This will not be at all acceptable on a > > > > large system. > > > > > > > It uses pool->lock to access to reserved objects, so if we have one memory > > > pool per one CPU then it would be serialized. > > > > I am having difficulty parsing your sentence. It looks like your thought > > is to invoke mempool_create() for each CPU, so that the locking would be > > on a per-CPU basis, as in 128 invocations of mempool_init() on a system > > having 128 hardware threads. Is that your intent? > > > In order to serialize it, you need to have it per CPU. So if you have 128 > cpus, it means: > > <snip> > for_each_possible_cpu(...) > cpu_pool = mempool_create(); > <snip> > > but please keep in mind that it is not my intention, but i had a though > about mempool API. Because it has pre-reserve logic inside. OK, fair point on use of mempool API, but my guess is that extending the current kfree_rcu() logic will be simpler. Thanx, Paul
On Wed, Apr 01, 2020 at 12:34:05PM -0700, Paul E. McKenney wrote: > On Wed, Apr 01, 2020 at 09:05:48PM +0200, Uladzislau Rezki wrote: > > On Wed, Apr 01, 2020 at 11:54:39AM -0700, Paul E. McKenney wrote: > > > On Wed, Apr 01, 2020 at 08:37:45PM +0200, Uladzislau Rezki wrote: > > > > On Wed, Apr 01, 2020 at 11:26:15AM -0700, Paul E. McKenney wrote: > > > > > On Wed, Apr 01, 2020 at 08:16:01PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > > > Right. Per discussion with Paul, we discussed that it is better if we > > > > > > > > > pre-allocate N number of array blocks per-CPU and use it for the cache. > > > > > > > > > Default for N being 1 and tunable with a boot parameter. I agree with this. > > > > > > > > > > > > > > > > > As discussed before, we can make use of memory pool API for such > > > > > > > > purpose. But i am not sure if it should be one pool per CPU or > > > > > > > > one pool per NR_CPUS, that would contain NR_CPUS * N pre-allocated > > > > > > > > blocks. > > > > > > > > > > > > > > There are advantages and disadvantages either way. The advantage of the > > > > > > > per-CPU pool is that you don't have to worry about something like lock > > > > > > > contention causing even more pain during an OOM event. One potential > > > > > > > problem wtih the per-CPU pool can happen when callbacks are offloaded, > > > > > > > in which case the CPUs needing the memory might never be getting it, > > > > > > > because in the offloaded case (RCU_NOCB_CPU=y) the CPU posting callbacks > > > > > > > might never be invoking them. > > > > > > > > > > > > > > But from what I know now, systems built with CONFIG_RCU_NOCB_CPU=y > > > > > > > either don't have heavy callback loads (HPC systems) or are carefully > > > > > > > configured (real-time systems). Plus large systems would probably end > > > > > > > up needing something pretty close to a slab allocator to keep from dying > > > > > > > from lock contention, and it is hard to justify that level of complexity > > > > > > > at this point. > > > > > > > > > > > > > > Or is there some way to mark a specific slab allocator instance as being > > > > > > > able to keep some amount of memory no matter what the OOM conditions are? > > > > > > > If not, the current per-CPU pre-allocated cache is a better choice in the > > > > > > > near term. > > > > > > > > > > > > > As for mempool API: > > > > > > > > > > > > mempool_alloc() just tries to make regular allocation taking into > > > > > > account passed gfp_t bitmask. If it fails due to memory pressure, > > > > > > it uses reserved preallocated pool that consists of number of > > > > > > desirable elements(preallocated when a pool is created). > > > > > > > > > > > > mempoll_free() returns an element to to pool, if it detects that > > > > > > current reserved elements are lower then minimum allowed elements, > > > > > > it will add an element to reserved pool, i.e. refill it. Otherwise > > > > > > just call kfree() or whatever we define as "element-freeing function." > > > > > > > > > > Unless I am missing something, mempool_alloc() acquires a per-mempool > > > > > lock on each invocation under OOM conditions. For our purposes, this > > > > > is essentially a global lock. This will not be at all acceptable on a > > > > > large system. > > > > > > > > > It uses pool->lock to access to reserved objects, so if we have one memory > > > > pool per one CPU then it would be serialized. > > > > > > I am having difficulty parsing your sentence. It looks like your thought > > > is to invoke mempool_create() for each CPU, so that the locking would be > > > on a per-CPU basis, as in 128 invocations of mempool_init() on a system > > > having 128 hardware threads. Is that your intent? > > > > > In order to serialize it, you need to have it per CPU. So if you have 128 > > cpus, it means: > > > > <snip> > > for_each_possible_cpu(...) > > cpu_pool = mempool_create(); > > <snip> > > > > but please keep in mind that it is not my intention, but i had a though > > about mempool API. Because it has pre-reserve logic inside. > > OK, fair point on use of mempool API, but my guess is that extending > the current kfree_rcu() logic will be simpler. > Agree :) -- Vlad Rezki
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 4be763355c9fb..965deefffdd58 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3149,7 +3149,7 @@ static inline struct rcu_head *attach_rcu_head_to_object(void *obj) if (!ptr) ptr = kmalloc(sizeof(unsigned long *) + - sizeof(struct rcu_head), GFP_ATOMIC | __GFP_NOWARN); + sizeof(struct rcu_head), GFP_MEMALLOC); if (!ptr) return NULL;
In kfree_rcu() headless implementation (where the caller need not pass an rcu_head, but rather directly pass a pointer to an object), we have a fall-back where we allocate a rcu_head wrapper for the caller (not the common case). This brings the pattern of needing to allocate some memory to free some memory. Currently we use GFP_ATOMIC flag to try harder for this allocation, however the GFP_MEMALLOC flag is more tailored to this pattern. We need to try harder not only during atomic context, but also during non-atomic context anyway. So use the GFP_MEMALLOC flag instead. Also remove the __GFP_NOWARN flag simply because although we do have a synchronize_rcu() fallback for absolutely worst case, we still would like to not enter that path and atleast trigger a warning to the user. Cc: linux-mm@kvack.org Cc: rcu@vger.kernel.org Cc: willy@infradead.org Cc: peterz@infradead.org Cc: neilb@suse.com Cc: vbabka@suse.cz Cc: mgorman@suse.de Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- This patch is based on the (not yet upstream) code in: git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch rcu/kfree) It is a follow-up to the posted series: https://lore.kernel.org/lkml/20200330023248.164994-1-joel@joelfernandes.org/ kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)