Message ID | 20200809204354.20137-2-urezki@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | __GFP_NO_LOCKS | expand |
On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote: [...] > Limitations and concerns (Main part) > ==================================== > The current memmory-allocation interface presents to following > difficulties that this patch is designed to overcome: > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will > complain about violation("BUG: Invalid wait context") of the > nesting rules. It does the raw_spinlock vs. spinlock nesting > checks, i.e. it is not legal to acquire a spinlock_t while > holding a raw_spinlock_t. > > Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch) > whereas the "page allocator" internally deals with spinlock_t to > access to its zones. The code also can be broken from higher level > of view: > <snip> > raw_spin_lock(&some_lock); > kfree_rcu(some_pointer, some_field_offset); > <snip> Is there any fundamental problem to make zone raw_spin_lock? > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t > is converted into sleepable variant. Invoking the page allocator from > atomic contexts leads to "BUG: scheduling while atomic". [...] > Proposal > ======== > 1) Make GFP_* that ensures that the allocator returns NULL rather > than acquire its own spinlock_t. Having such flag will address a and b > limitations described above. It will also make the kfree_rcu() code > common for RT and regular kernel, more clean, less handling corner > cases and reduce the code size. I do not think this is a good idea. Single purpose gfp flags that tend to heavily depend on the current implementation of the page allocator have turned out to be problematic. Users used to misunderstand their meaning resulting in a lot of abuse which was not trivial to remove. This flag seem to fall into exactly this sort of category. If there is a problem in nesting then that should be addressed rather than a new flag exported IMHO. If that is absolutely not possible for some reason then we can try to figure out what to do but that really need a very strong justification.
> On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote: > [...] > > Limitations and concerns (Main part) > > ==================================== > > The current memmory-allocation interface presents to following > > difficulties that this patch is designed to overcome: > > > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will > > complain about violation("BUG: Invalid wait context") of the > > nesting rules. It does the raw_spinlock vs. spinlock nesting > > checks, i.e. it is not legal to acquire a spinlock_t while > > holding a raw_spinlock_t. > > > > Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch) > > whereas the "page allocator" internally deals with spinlock_t to > > access to its zones. The code also can be broken from higher level > > of view: > > <snip> > > raw_spin_lock(&some_lock); > > kfree_rcu(some_pointer, some_field_offset); > > <snip> > > Is there any fundamental problem to make zone raw_spin_lock? > Good point. Converting a regular spinlock to the raw_* variant can solve an issue and to me it seems partly reasonable. Because there are other questions if we do it: a) what to do with kswapd and "wake-up path" that uses sleepable lock: wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait). b) How RT people reacts on it? I guess they will no be happy. As i described before, calling the __get_free_page(0) with 0 as argument will solve the (a). How correctly is it? From my point of view the logic that bypass the wakeup path should be explicitly defined. Or we can enter the allocator with (__GFP_HIGH|__GFP_ATOMIC) that bypass the __GFP_KSWAPD_RECLAIM as well. Any thoughts here? Please comment. Having proposed flag will not heart RT latency and solve all concerns. > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t > > is converted into sleepable variant. Invoking the page allocator from > > atomic contexts leads to "BUG: scheduling while atomic". > > [...] > > > Proposal > > ======== > > 1) Make GFP_* that ensures that the allocator returns NULL rather > > than acquire its own spinlock_t. Having such flag will address a and b > > limitations described above. It will also make the kfree_rcu() code > > common for RT and regular kernel, more clean, less handling corner > > cases and reduce the code size. > > I do not think this is a good idea. Single purpose gfp flags that tend > to heavily depend on the current implementation of the page allocator > have turned out to be problematic. Users used to misunderstand their > meaning resulting in a lot of abuse which was not trivial to remove. > This flag seem to fall into exactly this sort of category. If there is a > problem in nesting then that should be addressed rather than a new flag > exported IMHO. If that is absolutely not possible for some reason then > we can try to figure out what to do but that really need a very strong > justification. > The problem that i see is we can not use the page allocator from atomic contexts, what is our case: <snip> local_irq_save(flags) or preempt_disable() or raw_spinlock(); __get_free_page(GFP_ATOMIC); <snip> So if we can convert the page allocator to raw_* lock it will be appreciated, at least from our side, IMHO, not from RT one. But as i stated above we need to sort raised questions out if converting is done. What is your view? Thank you for your help and feedback! -- Vlad Rezki
On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: > > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote: > > [...] > > > Limitations and concerns (Main part) > > > ==================================== > > > The current memmory-allocation interface presents to following > > > difficulties that this patch is designed to overcome: > > > > > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will > > > complain about violation("BUG: Invalid wait context") of the > > > nesting rules. It does the raw_spinlock vs. spinlock nesting > > > checks, i.e. it is not legal to acquire a spinlock_t while > > > holding a raw_spinlock_t. > > > > > > Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch) > > > whereas the "page allocator" internally deals with spinlock_t to > > > access to its zones. The code also can be broken from higher level > > > of view: > > > <snip> > > > raw_spin_lock(&some_lock); > > > kfree_rcu(some_pointer, some_field_offset); > > > <snip> > > > > Is there any fundamental problem to make zone raw_spin_lock? > > > Good point. Converting a regular spinlock to the raw_* variant can solve > an issue and to me it seems partly reasonable. Because there are other > questions if we do it: > > a) what to do with kswapd and "wake-up path" that uses sleepable lock: > wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait). If there is no RT friendly variant for waking up process from the atomic context then we might need to special case this for the RT tree. > b) How RT people reacts on it? I guess they will no be happy. zone->lock should be held for a very limited amount of time. > As i described before, calling the __get_free_page(0) with 0 as argument > will solve the (a). How correctly is it? From my point of view the logic > that bypass the wakeup path should be explicitly defined. gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which doesn't wake up kswapd. So if the wakeup is a problem then this would be a way to go. > Or we can enter the allocator with (__GFP_HIGH|__GFP_ATOMIC) that bypass > the __GFP_KSWAPD_RECLAIM as well. This would be an alternative which consumes memory reserves. Is this really needed for the particular case? > > Any thoughts here? Please comment. > > Having proposed flag will not heart RT latency and solve all concerns. > > > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t > > > is converted into sleepable variant. Invoking the page allocator from > > > atomic contexts leads to "BUG: scheduling while atomic". > > > > [...] > > > > > Proposal > > > ======== > > > 1) Make GFP_* that ensures that the allocator returns NULL rather > > > than acquire its own spinlock_t. Having such flag will address a and b > > > limitations described above. It will also make the kfree_rcu() code > > > common for RT and regular kernel, more clean, less handling corner > > > cases and reduce the code size. > > > > I do not think this is a good idea. Single purpose gfp flags that tend > > to heavily depend on the current implementation of the page allocator > > have turned out to be problematic. Users used to misunderstand their > > meaning resulting in a lot of abuse which was not trivial to remove. > > This flag seem to fall into exactly this sort of category. If there is a > > problem in nesting then that should be addressed rather than a new flag > > exported IMHO. If that is absolutely not possible for some reason then > > we can try to figure out what to do but that really need a very strong > > justification. > > > The problem that i see is we can not use the page allocator from atomic > contexts, what is our case: > > <snip> > local_irq_save(flags) or preempt_disable() or raw_spinlock(); > __get_free_page(GFP_ATOMIC); > <snip> > > So if we can convert the page allocator to raw_* lock it will be appreciated, > at least from our side, IMHO, not from RT one. But as i stated above we need > to sort raised questions out if converting is done. > > What is your view? To me it would make more sense to support atomic allocations also for the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really work for atomic context in RT sounds subtle and wrong.
On Mon 10-08-20 21:25:26, Michal Hocko wrote: > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: [...] > > The problem that i see is we can not use the page allocator from atomic > > contexts, what is our case: > > > > <snip> > > local_irq_save(flags) or preempt_disable() or raw_spinlock(); > > __get_free_page(GFP_ATOMIC); > > <snip> > > > > So if we can convert the page allocator to raw_* lock it will be appreciated, > > at least from our side, IMHO, not from RT one. But as i stated above we need > > to sort raised questions out if converting is done. > > > > What is your view? > > To me it would make more sense to support atomic allocations also for > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really > work for atomic context in RT sounds subtle and wrong. I was thinking about this some more. I still think the above would be a reasonable goal we should try to achieve. If for not other then for future maintainability (especially after the RT patchset is merged). I have tried to search for any known problems/attempts to make zone->lock raw but couldn't find anything. Maybe somebody more involved in RT world have something to say about that. Anyway, if the zone->lock is not a good fit for raw_spin_lock then the only way I can see forward is to detect real (RT) atomic contexts and bail out early before taking the lock in the allocator for NOWAIT/ATOMIC requests.
On Mon, Aug 10, 2020 at 09:25:25PM +0200, Michal Hocko wrote: > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: > > > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote: > > > [...] > > > > Limitations and concerns (Main part) > > > > ==================================== > > > > The current memmory-allocation interface presents to following > > > > difficulties that this patch is designed to overcome: > > > > > > > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will > > > > complain about violation("BUG: Invalid wait context") of the > > > > nesting rules. It does the raw_spinlock vs. spinlock nesting > > > > checks, i.e. it is not legal to acquire a spinlock_t while > > > > holding a raw_spinlock_t. > > > > > > > > Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch) > > > > whereas the "page allocator" internally deals with spinlock_t to > > > > access to its zones. The code also can be broken from higher level > > > > of view: > > > > <snip> > > > > raw_spin_lock(&some_lock); > > > > kfree_rcu(some_pointer, some_field_offset); > > > > <snip> > > > > > > Is there any fundamental problem to make zone raw_spin_lock? > > > > > Good point. Converting a regular spinlock to the raw_* variant can solve > > an issue and to me it seems partly reasonable. Because there are other > > questions if we do it: > > > > a) what to do with kswapd and "wake-up path" that uses sleepable lock: > > wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait). > > If there is no RT friendly variant for waking up process from the atomic > context then we might need to special case this for the RT tree. > I do not see it in RT kernel. The waiting primitives, see the wait.c, use sleepable locks all over the file. > > b) How RT people reacts on it? I guess they will no be happy. > > zone->lock should be held for a very limited amount of time. > > > As i described before, calling the __get_free_page(0) with 0 as argument > > will solve the (a). How correctly is it? From my point of view the logic > > that bypass the wakeup path should be explicitly defined. > > gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which > doesn't wake up kswapd. So if the wakeup is a problem then this would be > a way to go. > What do you mean Michal? gfp_mask 0 != GFP_NOWAIT: #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM) it does wakeup of the kswapd. Or am i missing something? Please comment. If we are about to avoid the kswapd, should we define something special? #define GFP_NOWWAKE_KSWAPD 0 > > Or we can enter the allocator with (__GFP_HIGH|__GFP_ATOMIC) that bypass > > the __GFP_KSWAPD_RECLAIM as well. > > This would be an alternative which consumes memory reserves. Is this > really needed for the particular case? > No. That was just another example illustrating how to bypass the __GFP_KSWAPD_RECLAIM. > > > > Any thoughts here? Please comment. > > > > Having proposed flag will not heart RT latency and solve all concerns. > > > > > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t > > > > is converted into sleepable variant. Invoking the page allocator from > > > > atomic contexts leads to "BUG: scheduling while atomic". > > > > > > [...] > > > > > > > Proposal > > > > ======== > > > > 1) Make GFP_* that ensures that the allocator returns NULL rather > > > > than acquire its own spinlock_t. Having such flag will address a and b > > > > limitations described above. It will also make the kfree_rcu() code > > > > common for RT and regular kernel, more clean, less handling corner > > > > cases and reduce the code size. > > > > > > I do not think this is a good idea. Single purpose gfp flags that tend > > > to heavily depend on the current implementation of the page allocator > > > have turned out to be problematic. Users used to misunderstand their > > > meaning resulting in a lot of abuse which was not trivial to remove. > > > This flag seem to fall into exactly this sort of category. If there is a > > > problem in nesting then that should be addressed rather than a new flag > > > exported IMHO. If that is absolutely not possible for some reason then > > > we can try to figure out what to do but that really need a very strong > > > justification. > > > > > The problem that i see is we can not use the page allocator from atomic > > contexts, what is our case: > > > > <snip> > > local_irq_save(flags) or preempt_disable() or raw_spinlock(); > > __get_free_page(GFP_ATOMIC); > > <snip> > > > > So if we can convert the page allocator to raw_* lock it will be appreciated, > > at least from our side, IMHO, not from RT one. But as i stated above we need > > to sort raised questions out if converting is done. > > > > What is your view? > > To me it would make more sense to support atomic allocations also for > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really > work for atomic context in RT sounds subtle and wrong. > Same view on it. Thank you for your comments! -- Vlad Rezki
On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote: > On Mon 10-08-20 21:25:26, Michal Hocko wrote: > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: > [...] > > > The problem that i see is we can not use the page allocator from atomic > > > contexts, what is our case: > > > > > > <snip> > > > local_irq_save(flags) or preempt_disable() or raw_spinlock(); > > > __get_free_page(GFP_ATOMIC); > > > <snip> > > > > > > So if we can convert the page allocator to raw_* lock it will be appreciated, > > > at least from our side, IMHO, not from RT one. But as i stated above we need > > > to sort raised questions out if converting is done. > > > > > > What is your view? > > > > To me it would make more sense to support atomic allocations also for > > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really > > work for atomic context in RT sounds subtle and wrong. > > I was thinking about this some more. I still think the above would be a > reasonable goal we should try to achieve. If for not other then for > future maintainability (especially after the RT patchset is merged). > I have tried to search for any known problems/attempts to make > zone->lock raw but couldn't find anything. Maybe somebody more involved > in RT world have something to say about that. > I tried yesterday to convert zone->lock. See below files i had to modify: <snip> modified: include/linux/mmzone.h modified: mm/compaction.c modified: mm/memory_hotplug.c modified: mm/page_alloc.c modified: mm/page_isolation.c modified: mm/page_reporting.c modified: mm/shuffle.c modified: mm/vmscan.c modified: mm/vmstat.c <snip> There is one more lock, that is zone->lru_lock one. Both zone->lock and this one intersect between each other. If the lru_lock can be nested under zone->lock it should be converted as well. But i need to analyze it farther. There are two wrapper functions which are used as common interface to lock/unlock both locks. See compact_lock_irqsave()/compact_unlock_should_abort_lru() in the mm/compaction.c. Any thoughts here? Anyway i tried to convert only zone->lock and use page allocator passing there gfp_mask=0 as argument. So it works. CONFIG_PROVE_RAW_LOCK_NESTING does not complain about any "bad" lock nesting. > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the > only way I can see forward is to detect real (RT) atomic contexts and > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC > requests. > For RT kernel we can detect it for sure. preemtable() works just fine there, i.e. we can identify the context we are currently in. Thanks! -- Vlad Rezki
On Tue, Aug 11, 2020 at 11:37:13AM +0200, Uladzislau Rezki wrote: > On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote: > > On Mon 10-08-20 21:25:26, Michal Hocko wrote: > > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: > > [...] > > > > The problem that i see is we can not use the page allocator from atomic > > > > contexts, what is our case: > > > > > > > > <snip> > > > > local_irq_save(flags) or preempt_disable() or raw_spinlock(); > > > > __get_free_page(GFP_ATOMIC); > > > > <snip> > > > > > > > > So if we can convert the page allocator to raw_* lock it will be appreciated, > > > > at least from our side, IMHO, not from RT one. But as i stated above we need > > > > to sort raised questions out if converting is done. > > > > > > > > What is your view? > > > > > > To me it would make more sense to support atomic allocations also for > > > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really > > > work for atomic context in RT sounds subtle and wrong. > > > > I was thinking about this some more. I still think the above would be a > > reasonable goal we should try to achieve. If for not other then for > > future maintainability (especially after the RT patchset is merged). > > I have tried to search for any known problems/attempts to make > > zone->lock raw but couldn't find anything. Maybe somebody more involved > > in RT world have something to say about that. > > > I tried yesterday to convert zone->lock. See below files i had to modify: > <snip> > modified: include/linux/mmzone.h > modified: mm/compaction.c > modified: mm/memory_hotplug.c > modified: mm/page_alloc.c > modified: mm/page_isolation.c > modified: mm/page_reporting.c > modified: mm/shuffle.c > modified: mm/vmscan.c > modified: mm/vmstat.c > <snip> > > There is one more lock, that is zone->lru_lock one. Both zone->lock and this > one intersect between each other. If the lru_lock can be nested under zone->lock > it should be converted as well. But i need to analyze it farther. There are > two wrapper functions which are used as common interface to lock/unlock both > locks. See compact_lock_irqsave()/compact_unlock_should_abort_lru() in the > mm/compaction.c. > > Any thoughts here? > > Anyway i tried to convert only zone->lock and use page allocator passing there > gfp_mask=0 as argument. So it works. CONFIG_PROVE_RAW_LOCK_NESTING does not > complain about any "bad" lock nesting. > > > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the > > only way I can see forward is to detect real (RT) atomic contexts and > > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC > > requests. > > This is similar what i have done with mm: Add __GFP_NO_LOCKS flag. I just did it for order-0 pages(other paths are impossible) and made it common for any kernel. Because when you say "bail out early" i suspect that we would like to check the per-cpu-list cache. > For RT kernel we can detect it for sure. preemtable() works just fine there, > i.e. we can identify the context we are currently in. > -- Vlad Rezki
On Tue 11-08-20 11:18:07, Uladzislau Rezki wrote: > On Mon, Aug 10, 2020 at 09:25:25PM +0200, Michal Hocko wrote: > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: > > > > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote: [...] > > > As i described before, calling the __get_free_page(0) with 0 as argument > > > will solve the (a). How correctly is it? From my point of view the logic > > > that bypass the wakeup path should be explicitly defined. > > > > gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which > > doesn't wake up kswapd. So if the wakeup is a problem then this would be > > a way to go. > > > What do you mean Michal? gfp_mask 0 != GFP_NOWAIT: > > #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM) > > it does wakeup of the kswapd. Or am i missing something? Please comment. > If we are about to avoid the kswapd, should we define something special? > > #define GFP_NOWWAKE_KSWAPD 0 Sorry, I was more cryptic than necessary. What I meant is that GFP_NOWAIT is the basic non-sleepable allocation. It does wake up kswapd but a lack of it can be expressed GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM which is 0, now. The mouthfull variant is better for future maintainability.
On Tue 11-08-20 11:37:13, Uladzislau Rezki wrote: > On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote: > > On Mon 10-08-20 21:25:26, Michal Hocko wrote: > > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: > > [...] > > > > The problem that i see is we can not use the page allocator from atomic > > > > contexts, what is our case: > > > > > > > > <snip> > > > > local_irq_save(flags) or preempt_disable() or raw_spinlock(); > > > > __get_free_page(GFP_ATOMIC); > > > > <snip> > > > > > > > > So if we can convert the page allocator to raw_* lock it will be appreciated, > > > > at least from our side, IMHO, not from RT one. But as i stated above we need > > > > to sort raised questions out if converting is done. > > > > > > > > What is your view? > > > > > > To me it would make more sense to support atomic allocations also for > > > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really > > > work for atomic context in RT sounds subtle and wrong. > > > > I was thinking about this some more. I still think the above would be a > > reasonable goal we should try to achieve. If for not other then for > > future maintainability (especially after the RT patchset is merged). > > I have tried to search for any known problems/attempts to make > > zone->lock raw but couldn't find anything. Maybe somebody more involved > > in RT world have something to say about that. > > > I tried yesterday to convert zone->lock. See below files i had to modify: > <snip> > modified: include/linux/mmzone.h > modified: mm/compaction.c > modified: mm/memory_hotplug.c > modified: mm/page_alloc.c > modified: mm/page_isolation.c > modified: mm/page_reporting.c > modified: mm/shuffle.c > modified: mm/vmscan.c > modified: mm/vmstat.c > <snip> > > There is one more lock, that is zone->lru_lock one. Both zone->lock and this > one intersect between each other. If the lru_lock can be nested under zone->lock > it should be converted as well. But i need to analyze it farther. There are > two wrapper functions which are used as common interface to lock/unlock both > locks. See compact_lock_irqsave()/compact_unlock_should_abort_lru() in the > mm/compaction.c. > > Any thoughts here? I am not an expert on compaction. Vlastimil would know better. My thinking was that zone->lock is a tail lock but compaction/page isolation might be doing something I am not aware of right now. > Anyway i tried to convert only zone->lock and use page allocator passing there > gfp_mask=0 as argument. So it works. CONFIG_PROVE_RAW_LOCK_NESTING does not > complain about any "bad" lock nesting. > > > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the > > only way I can see forward is to detect real (RT) atomic contexts and > > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC > > requests. > > > For RT kernel we can detect it for sure. preemtable() works just fine there, > i.e. we can identify the context we are currently in. In previous email I didn't mention why I prefer full NOWAIT semantic over rt specific bailouts. There are users making NOWAIT allocation attempts as an opportunistic allocation request which is OK to fail as they have a fallback to go through. This would imply they would prefer to know this ASAP rather then get blocked and sleep. A lack of reports for PREEMPT_RT would suggest that nobody has noticed as this though.
On Tue 11-08-20 11:42:51, Uladzislau Rezki wrote: > On Tue, Aug 11, 2020 at 11:37:13AM +0200, Uladzislau Rezki wrote: > > On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote: [...] > > > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the > > > only way I can see forward is to detect real (RT) atomic contexts and > > > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC > > > requests. > > > > This is similar what i have done with mm: Add __GFP_NO_LOCKS flag. I just > did it for order-0 pages(other paths are impossible) and made it common for > any kernel. > > Because when you say "bail out early" i suspect that we would like to check > the per-cpu-list cache. Bail out early means to do as much as possible until a raw non-compliant lock has to be taken.
On Tue, Aug 11, 2020 at 12:28:18PM +0200, Michal Hocko wrote: > On Tue 11-08-20 11:42:51, Uladzislau Rezki wrote: > > On Tue, Aug 11, 2020 at 11:37:13AM +0200, Uladzislau Rezki wrote: > > > On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote: > [...] > > > > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the > > > > only way I can see forward is to detect real (RT) atomic contexts and > > > > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC > > > > requests. > > > > > > This is similar what i have done with mm: Add __GFP_NO_LOCKS flag. I just > > did it for order-0 pages(other paths are impossible) and made it common for > > any kernel. > > > > Because when you say "bail out early" i suspect that we would like to check > > the per-cpu-list cache. > > Bail out early means to do as much as possible until a raw non-compliant > lock has to be taken. > <snip> struct page *rmqueue(struct zone *preferred_zone, struct zone *zone, unsigned int order, gfp_t gfp_flags, unsigned int alloc_flags, int migratetype) { unsigned long flags; struct page *page; if (likely(order == 0)) { page = rmqueue_pcplist(preferred_zone, zone, gfp_flags, migratetype, alloc_flags); goto out; } /* * We most definitely don't want callers attempting to * allocate greater than order-1 page units with __GFP_NOFAIL. */ WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); spin_lock_irqsave(&zone->lock, flags); <snip> only order-0 allocations can be checked if CPUs pcp-list-cache has something. I mean without taking any locks, i.e. it is lockless. "Pre-fetching" is not possible since it takes zone->lock in order to do transfer pages from the buddy to the per-cpu-lists. It is done in the rmqueue_bulk() function. -- Vlad Rezki
On Tue, Aug 11, 2020 at 12:21:24PM +0200, Michal Hocko wrote: > On Tue 11-08-20 11:18:07, Uladzislau Rezki wrote: > > On Mon, Aug 10, 2020 at 09:25:25PM +0200, Michal Hocko wrote: > > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: > > > > > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote: > [...] > > > > As i described before, calling the __get_free_page(0) with 0 as argument > > > > will solve the (a). How correctly is it? From my point of view the logic > > > > that bypass the wakeup path should be explicitly defined. > > > > > > gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which > > > doesn't wake up kswapd. So if the wakeup is a problem then this would be > > > a way to go. > > > > > What do you mean Michal? gfp_mask 0 != GFP_NOWAIT: > > > > #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM) > > > > it does wakeup of the kswapd. Or am i missing something? Please comment. > > If we are about to avoid the kswapd, should we define something special? > > > > #define GFP_NOWWAKE_KSWAPD 0 > > Sorry, I was more cryptic than necessary. What I meant is that > GFP_NOWAIT is the basic non-sleepable allocation. It does wake up > kswapd but a lack of it can be expressed GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM > which is 0, now. The mouthfull variant is better for future > maintainability. > OK. I got it anyway. Just decided to clarify. -- Vlad Rezki
On Tue, Aug 11, 2020 at 12:26:49PM +0200, Michal Hocko wrote: > On Tue 11-08-20 11:37:13, Uladzislau Rezki wrote: > > On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote: > > > On Mon 10-08-20 21:25:26, Michal Hocko wrote: > > > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: > > > [...] > > > > > The problem that i see is we can not use the page allocator from atomic > > > > > contexts, what is our case: > > > > > > > > > > <snip> > > > > > local_irq_save(flags) or preempt_disable() or raw_spinlock(); > > > > > __get_free_page(GFP_ATOMIC); > > > > > <snip> > > > > > > > > > > So if we can convert the page allocator to raw_* lock it will be appreciated, > > > > > at least from our side, IMHO, not from RT one. But as i stated above we need > > > > > to sort raised questions out if converting is done. > > > > > > > > > > What is your view? > > > > > > > > To me it would make more sense to support atomic allocations also for > > > > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really > > > > work for atomic context in RT sounds subtle and wrong. > > > > > > I was thinking about this some more. I still think the above would be a > > > reasonable goal we should try to achieve. If for not other then for > > > future maintainability (especially after the RT patchset is merged). > > > I have tried to search for any known problems/attempts to make > > > zone->lock raw but couldn't find anything. Maybe somebody more involved > > > in RT world have something to say about that. > > > > > I tried yesterday to convert zone->lock. See below files i had to modify: > > <snip> > > modified: include/linux/mmzone.h > > modified: mm/compaction.c > > modified: mm/memory_hotplug.c > > modified: mm/page_alloc.c > > modified: mm/page_isolation.c > > modified: mm/page_reporting.c > > modified: mm/shuffle.c > > modified: mm/vmscan.c > > modified: mm/vmstat.c > > <snip> > > > > There is one more lock, that is zone->lru_lock one. Both zone->lock and this > > one intersect between each other. If the lru_lock can be nested under zone->lock > > it should be converted as well. But i need to analyze it farther. There are > > two wrapper functions which are used as common interface to lock/unlock both > > locks. See compact_lock_irqsave()/compact_unlock_should_abort_lru() in the > > mm/compaction.c. > > > > Any thoughts here? > > I am not an expert on compaction. Vlastimil would know better. My > thinking was that zone->lock is a tail lock but compaction/page > isolation might be doing something I am not aware of right now. > > > Anyway i tried to convert only zone->lock and use page allocator passing there > > gfp_mask=0 as argument. So it works. CONFIG_PROVE_RAW_LOCK_NESTING does not > > complain about any "bad" lock nesting. > > > > > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the > > > only way I can see forward is to detect real (RT) atomic contexts and > > > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC > > > requests. > > > > > For RT kernel we can detect it for sure. preemtable() works just fine there, > > i.e. we can identify the context we are currently in. > > In previous email I didn't mention why I prefer full NOWAIT semantic > over rt specific bailouts. There are users making NOWAIT allocation > attempts as an opportunistic allocation request which is OK to fail > as they have a fallback to go through. This would imply they would > prefer to know this ASAP rather then get blocked and sleep. A lack of > reports for PREEMPT_RT would suggest that nobody has noticed as this > though. > I agree here and share your view on it. To me, making *_ATOMIC *_NOWAIT to be fully workable on both kernels sounds like correct way to go. Indeed, there can be no complains as of now. But later on it can be and the question will be raised again, what to do. -- Vlad Rezki
Michal Hocko <mhocko@suse.com> writes: > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: >> > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote: >> > Is there any fundamental problem to make zone raw_spin_lock? >> > >> Good point. Converting a regular spinlock to the raw_* variant can solve >> an issue and to me it seems partly reasonable. Because there are other >> questions if we do it: >> >> a) what to do with kswapd and "wake-up path" that uses sleepable lock: >> wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait). > > If there is no RT friendly variant for waking up process from the atomic > context then we might need to special case this for the RT tree. That's a solvable problem. >> b) How RT people reacts on it? I guess they will no be happy. > > zone->lock should be held for a very limited amount of time. Emphasis on should. free_pcppages_bulk() can hold it for quite some time when a large amount of pages are purged. We surely would have converted it to a raw lock long time ago otherwise. For regular enterprise stuff a few hundred microseconds might qualify as a limited amount of time. For advanced RT applications that's way beyond tolerable.. >> As i described before, calling the __get_free_page(0) with 0 as argument >> will solve the (a). How correctly is it? From my point of view the logic >> that bypass the wakeup path should be explicitly defined. > > gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which > doesn't wake up kswapd. So if the wakeup is a problem then this would be > a way to go. The wakeup is the least of my worries. > To me it would make more sense to support atomic allocations also for > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really > work for atomic context in RT sounds subtle and wrong. Well, no. RT moves almost everything out of atomic context which means that GFP_ATOMIC is pretty meanlingless on a RT kernel. RT sacrifies performance for determinism. It's a known tradeoff. Now RCU creates a new thing which enforces to make page allocation in atomic context possible on RT. What for? What's the actual use case in truly atomic context for this new thing on an RT kernel? The actual RCU code disabling interrupts is an implementation detail which can easily be mitigated with a local lock. Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> writes: > Michal Hocko <mhocko@suse.com> writes: >> zone->lock should be held for a very limited amount of time. > > Emphasis on should. free_pcppages_bulk() can hold it for quite some time > when a large amount of pages are purged. We surely would have converted > it to a raw lock long time ago otherwise. > > For regular enterprise stuff a few hundred microseconds might qualify as > a limited amount of time. For advanced RT applications that's way beyond > tolerable.. Sebastian just tried with zone lock converted to a raw lock and maximum latencies go up by a factor of 7 when putting a bit of stress on the memory subsytem. Just a regular kernel compile kicks them up by a factor of 5. Way out of tolerance. We'll have a look whether it's solely free_pcppages_bulk() and if so we could get away with dropping the lock in the loop. Thanks, tglx
On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote: > Michal Hocko <mhocko@suse.com> writes: > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote: > >> > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote: > >> > Is there any fundamental problem to make zone raw_spin_lock? > >> > > >> Good point. Converting a regular spinlock to the raw_* variant can solve > >> an issue and to me it seems partly reasonable. Because there are other > >> questions if we do it: > >> > >> a) what to do with kswapd and "wake-up path" that uses sleepable lock: > >> wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait). > > > > If there is no RT friendly variant for waking up process from the atomic > > context then we might need to special case this for the RT tree. > > That's a solvable problem. > > >> b) How RT people reacts on it? I guess they will no be happy. > > > > zone->lock should be held for a very limited amount of time. > > Emphasis on should. free_pcppages_bulk() can hold it for quite some time > when a large amount of pages are purged. We surely would have converted > it to a raw lock long time ago otherwise. > > For regular enterprise stuff a few hundred microseconds might qualify as > a limited amount of time. For advanced RT applications that's way beyond > tolerable.. > > >> As i described before, calling the __get_free_page(0) with 0 as argument > >> will solve the (a). How correctly is it? From my point of view the logic > >> that bypass the wakeup path should be explicitly defined. > > > > gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which > > doesn't wake up kswapd. So if the wakeup is a problem then this would be > > a way to go. > > The wakeup is the least of my worries. > > > To me it would make more sense to support atomic allocations also for > > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really > > work for atomic context in RT sounds subtle and wrong. > > Well, no. RT moves almost everything out of atomic context which means > that GFP_ATOMIC is pretty meanlingless on a RT kernel. RT sacrifies > performance for determinism. It's a known tradeoff. > > Now RCU creates a new thing which enforces to make page allocation in > atomic context possible on RT. What for? > > What's the actual use case in truly atomic context for this new thing on > an RT kernel? It is not just RT kernels. CONFIG_PROVE_RAW_LOCK_NESTING=y propagates this constraint to all configurations, and a patch in your new favorite subsystem really did trigger this lockdep check in a non-RT kernel. > The actual RCU code disabling interrupts is an implementation detail > which can easily be mitigated with a local lock. In this case, we are in raw-spinlock context on entry to kfree_rcu(). Thanx, Paul
"Paul E. McKenney" <paulmck@kernel.org> writes: > On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote: >> Now RCU creates a new thing which enforces to make page allocation in >> atomic context possible on RT. What for? >> >> What's the actual use case in truly atomic context for this new thing on >> an RT kernel? > > It is not just RT kernels. CONFIG_PROVE_RAW_LOCK_NESTING=y propagates > this constraint to all configurations, and a patch in your new favorite > subsystem really did trigger this lockdep check in a non-RT kernel. > >> The actual RCU code disabling interrupts is an implementation detail >> which can easily be mitigated with a local lock. > > In this case, we are in raw-spinlock context on entry to kfree_rcu(). Where?
On 2020-08-11 17:43:16 [+0200], Thomas Gleixner wrote:
> Where?
See commit 8ac88f7177c75 ("rcu/tree: Keep kfree_rcu() awake during lock contention")
Sebastian
On Tue, Aug 11, 2020 at 05:43:16PM +0200, Thomas Gleixner wrote: > "Paul E. McKenney" <paulmck@kernel.org> writes: > > On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote: > >> Now RCU creates a new thing which enforces to make page allocation in > >> atomic context possible on RT. What for? > >> > >> What's the actual use case in truly atomic context for this new thing on > >> an RT kernel? > > > > It is not just RT kernels. CONFIG_PROVE_RAW_LOCK_NESTING=y propagates > > this constraint to all configurations, and a patch in your new favorite > > subsystem really did trigger this lockdep check in a non-RT kernel. > > > >> The actual RCU code disabling interrupts is an implementation detail > >> which can easily be mitigated with a local lock. > > > > In this case, we are in raw-spinlock context on entry to kfree_rcu(). > > Where? Some BPF code that needs to process and free a list. As noted above, this is a patch rather than something that is already in mainline. Not surprising, though, given call_rcu() invocations in similar contexts. Yes, we can perhaps rework all current and future callers to avoid invoking both call_rcu() and kfree_rcu() from raw atomic context, but the required change to permit this is quite a bit simpler. Thanx, Paul
On Tue, Aug 11, 2020 at 09:02:40AM -0700, Paul E. McKenney wrote: > On Tue, Aug 11, 2020 at 05:43:16PM +0200, Thomas Gleixner wrote: > > "Paul E. McKenney" <paulmck@kernel.org> writes: > > > On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote: > > >> Now RCU creates a new thing which enforces to make page allocation in > > >> atomic context possible on RT. What for? > > >> > > >> What's the actual use case in truly atomic context for this new thing on > > >> an RT kernel? > > > > > > It is not just RT kernels. CONFIG_PROVE_RAW_LOCK_NESTING=y propagates > > > this constraint to all configurations, and a patch in your new favorite > > > subsystem really did trigger this lockdep check in a non-RT kernel. > > > > > >> The actual RCU code disabling interrupts is an implementation detail > > >> which can easily be mitigated with a local lock. > > > > > > In this case, we are in raw-spinlock context on entry to kfree_rcu(). > > > > Where? > > Some BPF code that needs to process and free a list. As noted above, > this is a patch rather than something that is already in mainline. > Not surprising, though, given call_rcu() invocations in similar contexts. > > Yes, we can perhaps rework all current and future callers to avoid > invoking both call_rcu() and kfree_rcu() from raw atomic context, but > the required change to permit this is quite a bit simpler. I should hasten to add that from what I can see right now, the required change allows telling the memory allocator bail out instead of acquiring a non-raw spinlock. I am absolutely not advocating converting the allocator's spinlocks to raw spinlocks. Thanx, Paul
Thomas Gleixner <tglx@linutronix.de> writes: > "Paul E. McKenney" <paulmck@kernel.org> writes: >> On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote: >>> Now RCU creates a new thing which enforces to make page allocation in >>> atomic context possible on RT. What for? >>> >>> What's the actual use case in truly atomic context for this new thing on >>> an RT kernel? >> >> It is not just RT kernels. CONFIG_PROVE_RAW_LOCK_NESTING=y propagates >> this constraint to all configurations, and a patch in your new favorite >> subsystem really did trigger this lockdep check in a non-RT kernel. >> >>> The actual RCU code disabling interrupts is an implementation detail >>> which can easily be mitigated with a local lock. >> >> In this case, we are in raw-spinlock context on entry to kfree_rcu(). > > Where? And aside of the where, wasn't kfree_rcu() from within raw spinlock held regions possible all the time? Either I'm missing something or you are fundamentally changing RCU internals. kfree_rcu() saved RT in various ways where invoking kfree() was just not an option. Confused... Thanks, tglx
On Tue, Aug 11, 2020 at 09:39:10PM +0200, Thomas Gleixner wrote: > Thomas Gleixner <tglx@linutronix.de> writes: > > "Paul E. McKenney" <paulmck@kernel.org> writes: > >> On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote: > >>> Now RCU creates a new thing which enforces to make page allocation in > >>> atomic context possible on RT. What for? > >>> > >>> What's the actual use case in truly atomic context for this new thing on > >>> an RT kernel? > >> > >> It is not just RT kernels. CONFIG_PROVE_RAW_LOCK_NESTING=y propagates > >> this constraint to all configurations, and a patch in your new favorite > >> subsystem really did trigger this lockdep check in a non-RT kernel. > >> > >>> The actual RCU code disabling interrupts is an implementation detail > >>> which can easily be mitigated with a local lock. > >> > >> In this case, we are in raw-spinlock context on entry to kfree_rcu(). > > > > Where? > > And aside of the where, wasn't kfree_rcu() from within raw spinlock held > regions possible all the time? Either I'm missing something or you are > fundamentally changing RCU internals. kfree_rcu() saved RT in various > ways where invoking kfree() was just not an option. Confused... Back in the old days (months ago!), it was possible to invoke kfree_rcu() while holding a raw spinlock because kfree_rcu() didn't acquire any locks. It didn't need to because it was just a wrapper around call_rcu(). And call_rcu(), along with the rest of RCU's internals, has used raw spinlocks since near the beginnings of RT, which meant that it wasn't a problem in the rare cases where call_rcu() needed to acquire one of RCU's locks (for example, when call_rcu() sees that the current CPU has accumulated more than 10,000 callbacks). But one problem with the old kfree_rcu() approach is that the memory to be freed is almost always cold in the cache by the time that the grace period ends. And this was made worse by the fact that the rcu_do_batch() function traverses the list of objects to be freed as a linked list, thus incurring a cache miss on each and every object. The usual fix for this sort of performance problem is to use arrays of pointers, which on a 64-bit system with 64-byte cache lines reduces the number of cache misses by a factor of eight. In addition, decreasing the number of post-grace-period cache misses increases the stability of RCU with respect to callback flooding: Because the kfree()s happen faster, it is harder to overrun RCU with tight loops posting callbacks (as happened some time back with certain types of ACLs). Hence Ulad's work on kfree_rcu(). The approach is to allocate a page-sized array to hold all the pointers, then fill in the rest of these pointers on each subsequent kfree_rcu() call. These arrays of pointers also allows use of kfree_bulk() instead of kfree(), which can improve performance yet more. It is no big deal if kfree_rcu()'s allocation attempts fail occasionally because it can simply fall back to the old linked-list approach. And given that the various lockless caches in the memory allocator are almost never empty, in theory life is good. But in practice, mainline now has CONFIG_PROVE_RAW_LOCK_NESTING, and for good reason -- this Kconfig option makes it at least a little bit harder for mainline developers to mess up RT. But with CONFIG_PROVE_RAW_LOCK_NESTING=y and lockdep enabled, mainline will now sometimes complain if you invoke kfree_rcu() while holding a raw spinlock. This happens when kfree_rcu() needs to invoke the memory allocator and the memory allocator's caches are empty, thus resulting in the memory allocator attempting to acquire a non-raw spinlock. Because kfree_rcu() has a fallback available (just using the old linked list), kfree_rcu() would work well given a way to tell the memory allocator to return NULL instead of acquiring a non-raw spinlock. Which is exactly what Ulad's recent patches are intended to do. Since then, this thread has discussed various other approaches, including using existing combinations of GFP_ flags, converting the allocator's zone lock to a raw spinlock, and so on. Does that help, or am I missing the point of your question? Thanx, Paul
"Paul E. McKenney" <paulmck@kernel.org> writes: > Hence Ulad's work on kfree_rcu(). The approach is to allocate a > page-sized array to hold all the pointers, then fill in the rest of these > pointers on each subsequent kfree_rcu() call. These arrays of pointers > also allows use of kfree_bulk() instead of kfree(), which can improve > performance yet more. It is no big deal if kfree_rcu()'s allocation > attempts fail occasionally because it can simply fall back to the old > linked-list approach. And given that the various lockless caches in > the memory allocator are almost never empty, in theory life is good. Of course, it's always the damned reality which ruins the fun. > But in practice, mainline now has CONFIG_PROVE_RAW_LOCK_NESTING, > and for good reason -- this Kconfig option makes it at least a > little bit harder for mainline developers to mess up RT. But with > CONFIG_PROVE_RAW_LOCK_NESTING=y and lockdep enabled, mainline will now > sometimes complain if you invoke kfree_rcu() while holding a raw spinlock. > This happens when kfree_rcu() needs to invoke the memory allocator and > the memory allocator's caches are empty, thus resulting in the memory > allocator attempting to acquire a non-raw spinlock. Right. > Because kfree_rcu() has a fallback available (just using the old linked > list), kfree_rcu() would work well given a way to tell the memory > allocator to return NULL instead of acquiring a non-raw spinlock. > Which is exactly what Ulad's recent patches are intended to do. That much I understood, but I somehow failed to figure the why out despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my brain :) > Since then, this thread has discussed various other approaches, > including using existing combinations of GFP_ flags, converting > the allocator's zone lock to a raw spinlock, and so on. > > Does that help, or am I missing the point of your question? Yes, that helps so far that I understand what the actual problem is. It does not really help to make me more happy. :) That said, we can support atomic allocations on RT up to the point where zone->lock comes into play. We don't know yet exactly where the zone->lock induced damage happens. Presumably it's inside free_pcppages_bulk() - at least that's where I have faint bad memories from 15+ years ago. Aside of that I seriously doubt that it can be made work within a reasonable time frame. But what makes me really unhappy is that my defense line against allocations from truly atomic contexts (from RT POV) which was enforced on RT gets a real big gap shot into it. It becomes pretty hard to argue why atomic allocations via kmalloc() or kmem_cache_alloc() should be treated any different. Technically they can work similar to the page allocations up to the point where regular spinlocks come into play or the slab cache is exhausted. Where to draw the line? It's also unclear for the page allocator case whether we can and should stick a limit on the number of pages and/or the pageorder. Myself and others spent a considerable amount of time to kill off these kind of allocations from various interesting places including the guts of send IPI, the affinity setting path and others where people just slapped allocations into them because the stack checker warned or because they happened to copy the code from some other place. RT was pretty much a quick crap detector whenever new incarnations of this got added and to some extent continuous education about these issues made them less prominent over the years. Using atomic allocations should always have a real good rationale, not only in the cases where they collide with RT. I can understand your rationale and what you are trying to solve. So, if we can actually have a distinct GFP variant: GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY which is easy to grep for then having the page allocator go down to the point where zone lock gets involved is not the end of the world for RT in theory - unless that damned reality tells otherwise. :) The page allocator allocations should also have a limit on the number of pages and eventually also page order (need to stare at the code or let Michal educate me that the order does not matter). To make it consistent the same GFP_ variant should allow the slab allocator go to the point where the slab cache is exhausted. Having a distinct and clearly defined GFP_ variant is really key to chase down offenders and to make reviewers double check upfront why this is absolutely required. Thanks, tglx
On Wed, Aug 12, 2020 at 02:13:25AM +0200, Thomas Gleixner wrote: > "Paul E. McKenney" <paulmck@kernel.org> writes: > > Hence Ulad's work on kfree_rcu(). The approach is to allocate a > > page-sized array to hold all the pointers, then fill in the rest of these > > pointers on each subsequent kfree_rcu() call. These arrays of pointers > > also allows use of kfree_bulk() instead of kfree(), which can improve > > performance yet more. It is no big deal if kfree_rcu()'s allocation > > attempts fail occasionally because it can simply fall back to the old > > linked-list approach. And given that the various lockless caches in > > the memory allocator are almost never empty, in theory life is good. > > Of course, it's always the damned reality which ruins the fun. Classic!!! And yes, it always is! > > But in practice, mainline now has CONFIG_PROVE_RAW_LOCK_NESTING, > > and for good reason -- this Kconfig option makes it at least a > > little bit harder for mainline developers to mess up RT. But with > > CONFIG_PROVE_RAW_LOCK_NESTING=y and lockdep enabled, mainline will now > > sometimes complain if you invoke kfree_rcu() while holding a raw spinlock. > > This happens when kfree_rcu() needs to invoke the memory allocator and > > the memory allocator's caches are empty, thus resulting in the memory > > allocator attempting to acquire a non-raw spinlock. > > Right. > > > Because kfree_rcu() has a fallback available (just using the old linked > > list), kfree_rcu() would work well given a way to tell the memory > > allocator to return NULL instead of acquiring a non-raw spinlock. > > Which is exactly what Ulad's recent patches are intended to do. > > That much I understood, but I somehow failed to figure the why out > despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my > brain :) Ouch!!! And what on earth is Germany doing being that warm??? I hate it when that happens... > > Since then, this thread has discussed various other approaches, > > including using existing combinations of GFP_ flags, converting > > the allocator's zone lock to a raw spinlock, and so on. > > > > Does that help, or am I missing the point of your question? > > Yes, that helps so far that I understand what the actual problem is. It > does not really help to make me more happy. :) I must confess that I was not expecting to find anything resembling happiness anywhere down this road, whether for myself or anyone else... > That said, we can support atomic allocations on RT up to the point where > zone->lock comes into play. We don't know yet exactly where the > zone->lock induced damage happens. Presumably it's inside > free_pcppages_bulk() - at least that's where I have faint bad memories > from 15+ years ago. Aside of that I seriously doubt that it can be made > work within a reasonable time frame. I was not considering any approach other than return NULL just before the code would otherwise have acquired zone->lock. > But what makes me really unhappy is that my defense line against > allocations from truly atomic contexts (from RT POV) which was enforced > on RT gets a real big gap shot into it. Understood, and agreed: We do need to keep the RT degradation in check. > It becomes pretty hard to argue why atomic allocations via kmalloc() or > kmem_cache_alloc() should be treated any different. Technically they can > work similar to the page allocations up to the point where regular > spinlocks come into play or the slab cache is exhausted. Where to draw > the line? > > It's also unclear for the page allocator case whether we can and should > stick a limit on the number of pages and/or the pageorder. > > Myself and others spent a considerable amount of time to kill off these > kind of allocations from various interesting places including the guts > of send IPI, the affinity setting path and others where people just > slapped allocations into them because the stack checker warned or > because they happened to copy the code from some other place. > > RT was pretty much a quick crap detector whenever new incarnations of > this got added and to some extent continuous education about these > issues made them less prominent over the years. Using atomic allocations > should always have a real good rationale, not only in the cases where > they collide with RT. > > I can understand your rationale and what you are trying to solve. So, if > we can actually have a distinct GFP variant: > > GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY > > which is easy to grep for then having the page allocator go down to the > point where zone lock gets involved is not the end of the world for > RT in theory - unless that damned reality tells otherwise. :) I have no objection to an otherwise objectionable name in this particular case. After all, we now have 100 characters per line, right? ;-) > The page allocator allocations should also have a limit on the number of > pages and eventually also page order (need to stare at the code or let > Michal educate me that the order does not matter). Understood. I confess that I have but little understanding of that code. > To make it consistent the same GFP_ variant should allow the slab > allocator go to the point where the slab cache is exhausted. Why not wait until someone has an extremely good reason for needing this functionality from the slab allocators? After all, leaving out the slab allocators would provide a more robust defense line. Yes, consistent APIs are very good things as a general rule, but maybe this situation is one of the exceptions to that rule. > Having a distinct and clearly defined GFP_ variant is really key to > chase down offenders and to make reviewers double check upfront why this > is absolutely required. Checks for that GFP_ variant could be added to automation, though reality might eventually prove that to be a mixed blessing. Thanx, Paul
Paul, "Paul E. McKenney" <paulmck@kernel.org> writes: > On Wed, Aug 12, 2020 at 02:13:25AM +0200, Thomas Gleixner wrote: >> That much I understood, but I somehow failed to figure the why out >> despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my >> brain :) > > Ouch!!! And what on earth is Germany doing being that warm??? The hot air exhaustion of politicians, managers and conspiracy mythomaniacs seens to have contributed extensivly to global warming lately. >> But what makes me really unhappy is that my defense line against >> allocations from truly atomic contexts (from RT POV) which was enforced >> on RT gets a real big gap shot into it. > > Understood, and agreed: We do need to keep the RT degradation in > check. Not only that. It's bad practice in general to do memory allocations from such contexts if not absolutely necessary and the majority of cases which we cleaned up over time were just from the "works for me and why should I care and start to think" departement. >> I can understand your rationale and what you are trying to solve. So, if >> we can actually have a distinct GFP variant: >> >> GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY >> >> which is easy to grep for then having the page allocator go down to the >> point where zone lock gets involved is not the end of the world for >> RT in theory - unless that damned reality tells otherwise. :) > > I have no objection to an otherwise objectionable name in this particular > case. After all, we now have 100 characters per line, right? ;-) Hehe. I can live with the proposed NO_LOCK name or anything distinct which the mm people can agree on. >> To make it consistent the same GFP_ variant should allow the slab >> allocator go to the point where the slab cache is exhausted. > > Why not wait until someone has an extremely good reason for needing > this functionality from the slab allocators? After all, leaving out > the slab allocators would provide a more robust defense line. Yes, > consistent APIs are very good things as a general rule, but maybe this > situation is one of the exceptions to that rule. Fair enough. >> Having a distinct and clearly defined GFP_ variant is really key to >> chase down offenders and to make reviewers double check upfront why this >> is absolutely required. > > Checks for that GFP_ variant could be added to automation, though reality > might eventually prove that to be a mixed blessing. Did you really have to remind me and destroy my illusions before I was able to marvel at them? Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> writes: > Thomas Gleixner <tglx@linutronix.de> writes: >> Michal Hocko <mhocko@suse.com> writes: >>> zone->lock should be held for a very limited amount of time. >> >> Emphasis on should. free_pcppages_bulk() can hold it for quite some time >> when a large amount of pages are purged. We surely would have converted >> it to a raw lock long time ago otherwise. >> >> For regular enterprise stuff a few hundred microseconds might qualify as >> a limited amount of time. For advanced RT applications that's way beyond >> tolerable.. > > Sebastian just tried with zone lock converted to a raw lock and maximum > latencies go up by a factor of 7 when putting a bit of stress on the > memory subsytem. Just a regular kernel compile kicks them up by a factor > of 5. Way out of tolerance. > > We'll have a look whether it's solely free_pcppages_bulk() and if so we > could get away with dropping the lock in the loop. So even on !RT and just doing a kernel compile the time spent in free_pcppages_bulk() is up to 270 usec. It's not only the loop which processes a large pile of pages, part of it is caused by lock contention on zone->lock. Dropping the lock after a processing a couple of pages does not make it much better if enough CPUs are contending on the lock. Thanks, tglx
On Wed, Aug 12, 2020 at 01:38:35PM +0200, Thomas Gleixner wrote: > Thomas Gleixner <tglx@linutronix.de> writes: > > Thomas Gleixner <tglx@linutronix.de> writes: > >> Michal Hocko <mhocko@suse.com> writes: > >>> zone->lock should be held for a very limited amount of time. > >> > >> Emphasis on should. free_pcppages_bulk() can hold it for quite some time > >> when a large amount of pages are purged. We surely would have converted > >> it to a raw lock long time ago otherwise. > >> > >> For regular enterprise stuff a few hundred microseconds might qualify as > >> a limited amount of time. For advanced RT applications that's way beyond > >> tolerable.. > > > > Sebastian just tried with zone lock converted to a raw lock and maximum > > latencies go up by a factor of 7 when putting a bit of stress on the > > memory subsytem. Just a regular kernel compile kicks them up by a factor > > of 5. Way out of tolerance. > > > > We'll have a look whether it's solely free_pcppages_bulk() and if so we > > could get away with dropping the lock in the loop. > > So even on !RT and just doing a kernel compile the time spent in > free_pcppages_bulk() is up to 270 usec. > I suspect if you measure the latency of the zone->lock and its contention on any embedded device, i mean not powerful devices like PC, it could be milliseconds. IMHO. > > It's not only the loop which processes a large pile of pages, part of it > is caused by lock contention on zone->lock. Dropping the lock after a > processing a couple of pages does not make it much better if enough CPUs > are contending on the lock. > Initially i have not proposed to convert the lock, because i suspected that from the RT point of view there could be problems. Also, like i mentioned before, the GFP_ATOMIC is not meaningful anymore, that is a bit out of what GFP_ATOMIC stands for. But i see your point about "where is a stop line". That is why i proposed to bail out as later as possible: mm: Add __GFP_NO_LOCKS flag From the other hand we have been discussing other options, like converting. Just to cover as much as possible :) Thanks Thomas for valuable comments! -- Vlad Rezki
On Wed, Aug 12, 2020 at 10:32:50AM +0200, Thomas Gleixner wrote: > Paul, > > "Paul E. McKenney" <paulmck@kernel.org> writes: > > On Wed, Aug 12, 2020 at 02:13:25AM +0200, Thomas Gleixner wrote: > >> That much I understood, but I somehow failed to figure the why out > >> despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my > >> brain :) > > > > Ouch!!! And what on earth is Germany doing being that warm??? > > The hot air exhaustion of politicians, managers and conspiracy > mythomaniacs seens to have contributed extensivly to global warming > lately. In that case, our only hope here in this geography is that we are in a simulation, so that the hot air will cause a signed integer overflow to negative numbers some fraction of the time. :-( > >> But what makes me really unhappy is that my defense line against > >> allocations from truly atomic contexts (from RT POV) which was enforced > >> on RT gets a real big gap shot into it. > > > > Understood, and agreed: We do need to keep the RT degradation in > > check. > > Not only that. It's bad practice in general to do memory allocations > from such contexts if not absolutely necessary and the majority of cases > which we cleaned up over time were just from the "works for me and why > should I care and start to think" departement. Agreed, and I continue to see some of that myself. :-/ > >> I can understand your rationale and what you are trying to solve. So, if > >> we can actually have a distinct GFP variant: > >> > >> GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY > >> > >> which is easy to grep for then having the page allocator go down to the > >> point where zone lock gets involved is not the end of the world for > >> RT in theory - unless that damned reality tells otherwise. :) > > > > I have no objection to an otherwise objectionable name in this particular > > case. After all, we now have 100 characters per line, right? ;-) > > Hehe. I can live with the proposed NO_LOCK name or anything distinct > which the mm people can agree on. Sounds good. ;-) > >> To make it consistent the same GFP_ variant should allow the slab > >> allocator go to the point where the slab cache is exhausted. > > > > Why not wait until someone has an extremely good reason for needing > > this functionality from the slab allocators? After all, leaving out > > the slab allocators would provide a more robust defense line. Yes, > > consistent APIs are very good things as a general rule, but maybe this > > situation is one of the exceptions to that rule. > > Fair enough. > > >> Having a distinct and clearly defined GFP_ variant is really key to > >> chase down offenders and to make reviewers double check upfront why this > >> is absolutely required. > > > > Checks for that GFP_ variant could be added to automation, though reality > > might eventually prove that to be a mixed blessing. > > Did you really have to remind me and destroy my illusions before I was > able to marvel at them? Apologies! I am afraid that it has become a reflex due to living in this time and place. My further fear is that I will have all to great an opportunity for further reinforcing this reflex in the future. :-/ Thanx, Paul
On Wed 12-08-20 13:38:35, Thomas Gleixner wrote: > Thomas Gleixner <tglx@linutronix.de> writes: > > Thomas Gleixner <tglx@linutronix.de> writes: > >> Michal Hocko <mhocko@suse.com> writes: > >>> zone->lock should be held for a very limited amount of time. > >> > >> Emphasis on should. free_pcppages_bulk() can hold it for quite some time > >> when a large amount of pages are purged. We surely would have converted > >> it to a raw lock long time ago otherwise. > >> > >> For regular enterprise stuff a few hundred microseconds might qualify as > >> a limited amount of time. For advanced RT applications that's way beyond > >> tolerable.. > > > > Sebastian just tried with zone lock converted to a raw lock and maximum > > latencies go up by a factor of 7 when putting a bit of stress on the > > memory subsytem. Just a regular kernel compile kicks them up by a factor > > of 5. Way out of tolerance. > > > > We'll have a look whether it's solely free_pcppages_bulk() and if so we > > could get away with dropping the lock in the loop. > > So even on !RT and just doing a kernel compile the time spent in > free_pcppages_bulk() is up to 270 usec. > > It's not only the loop which processes a large pile of pages, part of it > is caused by lock contention on zone->lock. Dropping the lock after a > processing a couple of pages does not make it much better if enough CPUs > are contending on the lock. OK, this is a bit surprising to me but well, reality sucks sometimes. I was really hoping for a solution which would allow the allocator to really do what gfp flags say and if something is nowait then it shouldn't really block. I believe we need to document this properly. I will comment on the dedicated gfp flag in reply to other email. Thanks for trying that out Sebastian and Thomas!
On Wed 12-08-20 02:13:25, Thomas Gleixner wrote: [...] > I can understand your rationale and what you are trying to solve. So, if > we can actually have a distinct GFP variant: > > GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY Even if we cannot make the zone->lock raw I would prefer to not introduce a new gfp flag. Well we can do an alias for easier grepping #define GFP_RT_SAFE 0 that would imply nowait semantic and would exclude waking up kswapd as well. If we can make wake up safe under RT then the alias would reflect that without any code changes. The second, and the more important part, would be to bail out anytime the page allocator is to take a lock which is not allowed in the current RT context. Something like diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 67a0774e080b..3ef3ac82d110 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -237,6 +237,9 @@ struct vm_area_struct; * that subsystems start with one of these combinations and then set/clear * %__GFP_FOO flags as necessary. * + * %GFP_RT_SAFE users can not sleep and they are running under RT atomic context + * e.g. under raw_spin_lock. Failure of an allocation is to be expected. + * * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower * watermark is applied to allow access to "atomic reserves" * @@ -293,6 +296,7 @@ struct vm_area_struct; * version does not attempt reclaim/compaction at all and is by default used * in page fault path, while the non-light is used by khugepaged. */ +#define GFP_RT_SAFE 0 #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e028b87ce294..268ae872cc2a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2824,6 +2824,13 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, { int i, alloced = 0; + /* + * Hard atomic contexts are not supported by the allocator for + * anything but pcp requests + */ + if (!preemtable()) + return 0; + spin_lock(&zone->lock); for (i = 0; i < count; ++i) { struct page *page = __rmqueue(zone, order, migratetype, @@ -3371,6 +3378,13 @@ struct page *rmqueue(struct zone *preferred_zone, goto out; } + /* + * Hard atomic contexts are not supported by the allocator for high + * order requests + */ + if (WARN_ON_ONCE(!preemtable())) + reurn NULL; + /* * We most definitely don't want callers attempting to * allocate greater than order-1 page units with __GFP_NOFAIL. @@ -4523,6 +4537,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) gfp_mask &= ~__GFP_ATOMIC; + /* Hard atomic contexts support is very limited to the fast path */ + if (!preemtable()) { + WARN_ON_ONCE(gfp_mask != GFP_RT_SAFE); + return NULL; + } + retry_cpuset: compaction_retries = 0; no_progress_loops = 0; What do you think? > which is easy to grep for then having the page allocator go down to the > point where zone lock gets involved is not the end of the world for > RT in theory - unless that damned reality tells otherwise. :) > > The page allocator allocations should also have a limit on the number of > pages and eventually also page order (need to stare at the code or let > Michal educate me that the order does not matter). In practice anything but order 0 is out of question because we need zone->lock for that currently. Maybe we can introduce pcp lists for higher orders in the future - I have a vague recollection Mel was playing with that some time ago. > To make it consistent the same GFP_ variant should allow the slab > allocator go to the point where the slab cache is exhausted. > > Having a distinct and clearly defined GFP_ variant is really key to > chase down offenders and to make reviewers double check upfront why this > is absolutely required. Having a high level and recognizable gfp mask is OK but I would really like not to introduce a dedicated flag. The page allocator should be able to recognize the context which cannot be handled.
On Thu, Aug 13, 2020 at 09:50:27AM +0200, Michal Hocko wrote: > On Wed 12-08-20 02:13:25, Thomas Gleixner wrote: > [...] > > I can understand your rationale and what you are trying to solve. So, if > > we can actually have a distinct GFP variant: > > > > GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY > > Even if we cannot make the zone->lock raw I would prefer to not > introduce a new gfp flag. Well we can do an alias for easier grepping > #define GFP_RT_SAFE 0 > > that would imply nowait semantic and would exclude waking up kswapd as > well. If we can make wake up safe under RT then the alias would reflect > that without any code changes. > > The second, and the more important part, would be to bail out anytime > the page allocator is to take a lock which is not allowed in the current > RT context. Something like > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 67a0774e080b..3ef3ac82d110 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -237,6 +237,9 @@ struct vm_area_struct; > * that subsystems start with one of these combinations and then set/clear > * %__GFP_FOO flags as necessary. > * > + * %GFP_RT_SAFE users can not sleep and they are running under RT atomic context > + * e.g. under raw_spin_lock. Failure of an allocation is to be expected. > + * > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > * watermark is applied to allow access to "atomic reserves" > * > @@ -293,6 +296,7 @@ struct vm_area_struct; > * version does not attempt reclaim/compaction at all and is by default used > * in page fault path, while the non-light is used by khugepaged. > */ > +#define GFP_RT_SAFE 0 > #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) > #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) > #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT) > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e028b87ce294..268ae872cc2a 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2824,6 +2824,13 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > { > int i, alloced = 0; > > + /* > + * Hard atomic contexts are not supported by the allocator for > + * anything but pcp requests > + */ > + if (!preemtable()) > + return 0; > + > spin_lock(&zone->lock); > for (i = 0; i < count; ++i) { > struct page *page = __rmqueue(zone, order, migratetype, > @@ -3371,6 +3378,13 @@ struct page *rmqueue(struct zone *preferred_zone, > goto out; > } > > + /* > + * Hard atomic contexts are not supported by the allocator for high > + * order requests > + */ > + if (WARN_ON_ONCE(!preemtable())) > + reurn NULL; > + > /* > * We most definitely don't want callers attempting to > * allocate greater than order-1 page units with __GFP_NOFAIL. > @@ -4523,6 +4537,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) > gfp_mask &= ~__GFP_ATOMIC; > > + /* Hard atomic contexts support is very limited to the fast path */ > + if (!preemtable()) { > + WARN_ON_ONCE(gfp_mask != GFP_RT_SAFE); > + return NULL; > + } > + > retry_cpuset: > compaction_retries = 0; > no_progress_loops = 0; > > What do you think? > > > which is easy to grep for then having the page allocator go down to the > > point where zone lock gets involved is not the end of the world for > > RT in theory - unless that damned reality tells otherwise. :) > > > > The page allocator allocations should also have a limit on the number of > > pages and eventually also page order (need to stare at the code or let > > Michal educate me that the order does not matter). > > In practice anything but order 0 is out of question because we need > zone->lock for that currently. Maybe we can introduce pcp lists for > higher orders in the future - I have a vague recollection Mel was > playing with that some time ago. > > > To make it consistent the same GFP_ variant should allow the slab > > allocator go to the point where the slab cache is exhausted. > > > > Having a distinct and clearly defined GFP_ variant is really key to > > chase down offenders and to make reviewers double check upfront why this > > is absolutely required. > > Having a high level and recognizable gfp mask is OK but I would really > like not to introduce a dedicated flag. The page allocator should be > able to recognize the context which cannot be handled. > Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT is enabled, something like below: if (IS_ENABLED_RT && preemptebale()) Also i have a question about pcp-lists. Can we introduce and use all allowed MIGRATE_PCPTYPES? If called with GFP_RT_SAFE? If not please elaborate. According to my tests it really helps when it comes to: succeed(return the page) or NULL. Because on of the list of below types: MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RECLAIMABLE, can have a page making allocation succeed. Thanks! -- Vlad Rezki
On Thu 13-08-20 11:58:40, Uladzislau Rezki wrote: [...] > Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT > is enabled, something like below: > > if (IS_ENABLED_RT && preemptebale()) Sure. I thought this was an RT specific thing that would be noop otherwise. > Also i have a question about pcp-lists. Can we introduce and use all allowed > MIGRATE_PCPTYPES? If called with GFP_RT_SAFE? If not please elaborate. Yes we can. This depends on the provided gfp_mask. Many gfp flags will be meaningless for such a context but essentially all we care about is that ((gfp_mask & __GFP_RECLAIM) == GFP_RT_SAFE) for the checking purpose. We can sort out all these details if this is decided to be the right way to do. My (pseudo) patch was mostly to show the direction I've had in mind for easier further discussion.
Uladzislau Rezki <urezki@gmail.com> writes: > On Thu, Aug 13, 2020 at 09:50:27AM +0200, Michal Hocko wrote: >> On Wed 12-08-20 02:13:25, Thomas Gleixner wrote: >> [...] >> > I can understand your rationale and what you are trying to solve. So, if >> > we can actually have a distinct GFP variant: >> > >> > GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY >> >> Even if we cannot make the zone->lock raw I would prefer to not >> introduce a new gfp flag. Well we can do an alias for easier grepping >> #define GFP_RT_SAFE 0 Just using 0 is sneaky but yes, that's fine :) Bikeshedding: GFP_RT_NOWAIT or such might be more obvious. >> that would imply nowait semantic and would exclude waking up kswapd as >> well. If we can make wake up safe under RT then the alias would reflect >> that without any code changes. It basically requires to convert the wait queue to something else. Is the waitqueue strict single waiter? >> The second, and the more important part, would be to bail out anytime >> the page allocator is to take a lock which is not allowed in the current >> RT context. Something like >> + /* >> + * Hard atomic contexts are not supported by the allocator for >> + * anything but pcp requests >> + */ >> + if (!preemtable()) If you make that preemtible() it might even compile, but that still wont work because if CONFIG_PREEMPT_COUNT=n then preemptible() is always false. So that should be: if (!preemptible() && gfp == GFP_RT_NOWAIT) which is limiting the damage to those callers which hand in GFP_RT_NOWAIT. lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits zone->lock in the wrong context. And we want to know about that so we can look at the caller and figure out how to solve it. >> > The page allocator allocations should also have a limit on the number of >> > pages and eventually also page order (need to stare at the code or let >> > Michal educate me that the order does not matter). >> >> In practice anything but order 0 is out of question because we need >> zone->lock for that currently. Maybe we can introduce pcp lists for >> higher orders in the future - I have a vague recollection Mel was >> playing with that some time ago. Ok. >> > To make it consistent the same GFP_ variant should allow the slab >> > allocator go to the point where the slab cache is exhausted. >> > >> > Having a distinct and clearly defined GFP_ variant is really key to >> > chase down offenders and to make reviewers double check upfront why this >> > is absolutely required. >> >> Having a high level and recognizable gfp mask is OK but I would really >> like not to introduce a dedicated flag. The page allocator should be >> able to recognize the context which cannot be handled. The GFP_xxx == 0 is perfectly fine. > Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT > is enabled, something like below: > > if (IS_ENABLED_RT && preemptebale()) Ha, you morphed preemtable() into preemptebale() which will not compile either :) Thanks, tglx
Michal Hocko <mhocko@suse.com> writes: > On Thu 13-08-20 11:58:40, Uladzislau Rezki wrote: > [...] >> Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT >> is enabled, something like below: >> >> if (IS_ENABLED_RT && preemptebale()) > > Sure. I thought this was an RT specific thing that would be noop > otherwise. Well, even if RT specific it would be still something returning either true or false unconditionally. And guarding it with RT is not working either because then you are back to square one with the problem which triggered the discussion in the first place: raw_spin_lock() alloc() if (RT && !preemptible()) <- False because RT == false goto bail; spin_lock(&zone->lock) --> LOCKDEP complains So either you convince Paul not to do that or you need to do something like I suggested in my other reply. Thanks, tglx
Hello, Michal. > > On Wed 12-08-20 02:13:25, Thomas Gleixner wrote: > > [...] > > > I can understand your rationale and what you are trying to solve. So, if > > > we can actually have a distinct GFP variant: > > > > > > GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY > > > > Even if we cannot make the zone->lock raw I would prefer to not > > introduce a new gfp flag. Well we can do an alias for easier grepping > > #define GFP_RT_SAFE 0 > > > > that would imply nowait semantic and would exclude waking up kswapd as > > well. If we can make wake up safe under RT then the alias would reflect > > that without any code changes. > > > > The second, and the more important part, would be to bail out anytime > > the page allocator is to take a lock which is not allowed in the current > > RT context. Something like > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > index 67a0774e080b..3ef3ac82d110 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -237,6 +237,9 @@ struct vm_area_struct; > > * that subsystems start with one of these combinations and then set/clear > > * %__GFP_FOO flags as necessary. > > * > > + * %GFP_RT_SAFE users can not sleep and they are running under RT atomic context > > + * e.g. under raw_spin_lock. Failure of an allocation is to be expected. > > + * > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > * watermark is applied to allow access to "atomic reserves" > > * > > @@ -293,6 +296,7 @@ struct vm_area_struct; > > * version does not attempt reclaim/compaction at all and is by default used > > * in page fault path, while the non-light is used by khugepaged. > > */ > > +#define GFP_RT_SAFE 0 > > #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM) > > #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) > > #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index e028b87ce294..268ae872cc2a 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2824,6 +2824,13 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > > { > > int i, alloced = 0; > > > > + /* > > + * Hard atomic contexts are not supported by the allocator for > > + * anything but pcp requests > > + */ > > + if (!preemtable()) > > + return 0; > > + > > spin_lock(&zone->lock); > > for (i = 0; i < count; ++i) { > > struct page *page = __rmqueue(zone, order, migratetype, > > @@ -3371,6 +3378,13 @@ struct page *rmqueue(struct zone *preferred_zone, > > goto out; > > } > > > > + /* > > + * Hard atomic contexts are not supported by the allocator for high > > + * order requests > > + */ > > + if (WARN_ON_ONCE(!preemtable())) > > + reurn NULL; > > + > > /* > > * We most definitely don't want callers attempting to > > * allocate greater than order-1 page units with __GFP_NOFAIL. > > @@ -4523,6 +4537,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) > > gfp_mask &= ~__GFP_ATOMIC; > > > > + /* Hard atomic contexts support is very limited to the fast path */ > > + if (!preemtable()) { > > + WARN_ON_ONCE(gfp_mask != GFP_RT_SAFE); > > + return NULL; > > + } > > + > > retry_cpuset: > > compaction_retries = 0; > > no_progress_loops = 0; > > > > What do you think? > > > > > which is easy to grep for then having the page allocator go down to the > > > point where zone lock gets involved is not the end of the world for > > > RT in theory - unless that damned reality tells otherwise. :) > > > > > > The page allocator allocations should also have a limit on the number of > > > pages and eventually also page order (need to stare at the code or let > > > Michal educate me that the order does not matter). > > > > In practice anything but order 0 is out of question because we need > > zone->lock for that currently. Maybe we can introduce pcp lists for > > higher orders in the future - I have a vague recollection Mel was > > playing with that some time ago. > > > > > To make it consistent the same GFP_ variant should allow the slab > > > allocator go to the point where the slab cache is exhausted. > > > > > > Having a distinct and clearly defined GFP_ variant is really key to > > > chase down offenders and to make reviewers double check upfront why this > > > is absolutely required. > > > > Having a high level and recognizable gfp mask is OK but I would really > > like not to introduce a dedicated flag. The page allocator should be > > able to recognize the context which cannot be handled. > > > Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT > is enabled, something like below: > > if (IS_ENABLED_RT && preemptebale()) > I was a bit out of focus and did not mention about one thing. Identifying the context type using preemtable() primitives looks a bit not suitable, IMHO. GFP_* flags are not supposed to identify a context type, because it is not possible for all cases. But that i Also, to bail out based on a context's type will not allow to get a page from atomic contexts, what we try to achieve :) Whereas, i mean, we do have possibility to do lock-less per-cpu-list allocation without touching any zone lock. if (gfp_mask == 0) { check_pcp_lists(); if (page) return page; bail out here without doing farther logic, like pre-fetching. return NULL; } For example consider our case: kfree_rcu()-> raw_spin_lock() -> page_alloc-> preemtable() -> false -- Vlad Rezki
On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > Uladzislau Rezki <urezki@gmail.com> writes: > > On Thu, Aug 13, 2020 at 09:50:27AM +0200, Michal Hocko wrote: > >> On Wed 12-08-20 02:13:25, Thomas Gleixner wrote: > >> [...] > >> > I can understand your rationale and what you are trying to solve. So, if > >> > we can actually have a distinct GFP variant: > >> > > >> > GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY > >> > >> Even if we cannot make the zone->lock raw I would prefer to not > >> introduce a new gfp flag. Well we can do an alias for easier grepping > >> #define GFP_RT_SAFE 0 > > Just using 0 is sneaky but yes, that's fine :) > > Bikeshedding: GFP_RT_NOWAIT or such might be more obvious. Sounds goood. > >> that would imply nowait semantic and would exclude waking up kswapd as > >> well. If we can make wake up safe under RT then the alias would reflect > >> that without any code changes. > > It basically requires to convert the wait queue to something else. Is > the waitqueue strict single waiter? I would have to double check. From what I remember only kswapd should ever sleep on it. > >> The second, and the more important part, would be to bail out anytime > >> the page allocator is to take a lock which is not allowed in the current > >> RT context. Something like > > >> + /* > >> + * Hard atomic contexts are not supported by the allocator for > >> + * anything but pcp requests > >> + */ > >> + if (!preemtable()) > > If you make that preemtible() it might even compile, but that still wont > work because if CONFIG_PREEMPT_COUNT=n then preemptible() is always > false. It would be nice to hide all that behind a helper and guarded by PREEMPT_RT. That would imply PREEMPT_COUNT automatically, right? > > So that should be: > > if (!preemptible() && gfp == GFP_RT_NOWAIT) > > which is limiting the damage to those callers which hand in > GFP_RT_NOWAIT. > > lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > zone->lock in the wrong context. And we want to know about that so we > can look at the caller and figure out how to solve it. Yes, that would have to somehow need to annotate the zone_lock to be ok in those paths so that lockdep doesn't complain.
On Thu 13-08-20 15:29:31, Uladzislau Rezki wrote: [...] > I was a bit out of focus and did not mention about one thing. Identifying the context > type using preemtable() primitives looks a bit not suitable, IMHO. GFP_* flags are > not supposed to identify a context type, because it is not possible for all cases. Yes, GFP flags do not identify context and that is why my draft didn't really consider gfp flags for anything but the retry logic which is already gfp based already. The buddy allocator path simply always bails out for those rt atomic paths whenever it gets close to zone->lock > But that i You meant to say more I guess > Also, to bail out based on a context's type will not allow to get a page from atomic > contexts, what we try to achieve :) Yes lockdep will need some special treatment but I suspect we want to address the underlying problem first and only then take care of the lockdep. > Whereas, i mean, we do have possibility to do lock-less per-cpu-list allocation without > touching any zone lock. > > if (gfp_mask == 0) { > check_pcp_lists(); > if (page) > return page; > > bail out here without doing farther logic, like pre-fetching. > return NULL; > } The existing code does that already. __rmqueue_pcplist will go rmqueue_bulk only when pcp lists are empty. Or did I miss your point?
On Thu 13-08-20 15:27:15, Thomas Gleixner wrote: > Michal Hocko <mhocko@suse.com> writes: > > On Thu 13-08-20 11:58:40, Uladzislau Rezki wrote: > > [...] > >> Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT > >> is enabled, something like below: > >> > >> if (IS_ENABLED_RT && preemptebale()) > > > > Sure. I thought this was an RT specific thing that would be noop > > otherwise. > > Well, even if RT specific it would be still something returning either > true or false unconditionally. > > And guarding it with RT is not working either because then you are back > to square one with the problem which triggered the discussion in the > first place: > > raw_spin_lock() > alloc() > if (RT && !preemptible()) <- False because RT == false > goto bail; > > spin_lock(&zone->lock) --> LOCKDEP complains > > So either you convince Paul not to do that or you need to do something > like I suggested in my other reply. Can we somehow annotate the lock to be safe for nesting for lockdep?
On Thu, Aug 13, 2020 at 03:41:39PM +0200, Michal Hocko wrote: > On Thu 13-08-20 15:29:31, Uladzislau Rezki wrote: > [...] > > I was a bit out of focus and did not mention about one thing. Identifying the context > > type using preemtable() primitives looks a bit not suitable, IMHO. GFP_* flags are > > not supposed to identify a context type, because it is not possible for all cases. > > Yes, GFP flags do not identify context and that is why my draft didn't > really consider gfp flags for anything but the retry logic which is > already gfp based already. The buddy allocator path simply always bails > out for those rt atomic paths whenever it gets close to zone->lock > > > But that i > > You meant to say more I guess > Ahh. Right. It was not completed. The idea was that we do not really need to identify preemptible we are or not. Unless we want to help RT to proceed further, based on if "RT && preemtable()", allowing to take zone->lock and improve a chance of to be succeed with allocation. Basically not bail out. > > Also, to bail out based on a context's type will not allow to get a page from atomic > > contexts, what we try to achieve :) > > Yes lockdep will need some special treatment but I suspect we want to > address the underlying problem first and only then take care of the > lockdep. > > > Whereas, i mean, we do have possibility to do lock-less per-cpu-list allocation without > > touching any zone lock. > > > > if (gfp_mask == 0) { > > check_pcp_lists(); > > if (page) > > return page; > > > > bail out here without doing farther logic, like pre-fetching. > > return NULL; > > } > > The existing code does that already. __rmqueue_pcplist will go > rmqueue_bulk only when pcp lists are empty. Or did I miss your point? > Right. Probably we look at it from different angles :) Right, when pcp is empty the zone->lock is accessed. I got the feeling that you want to bail out based on the if (RT && !preemptible()) -> bail out i.e. On RT below code will always return NULL: <snip> raw_cpin_lock(); page_alloc(); <snip> Thanks! -- Vlad Rezki
On Thu, Aug 13, 2020 at 03:27:15PM +0200, Thomas Gleixner wrote: > And guarding it with RT is not working either because then you are back > to square one with the problem which triggered the discussion in the > first place: > > raw_spin_lock() > alloc() > if (RT && !preemptible()) <- False because RT == false > goto bail; > > spin_lock(&zone->lock) --> LOCKDEP complains > > So either you convince Paul not to do that or you need to do something > like I suggested in my other reply. I'd like to throw in the possibility that we do something like: raw_spin_lock() alloc() if (!spin_trylock(&zone->lock)) if (RT && !preemptible()) goto bail; spin_lock(&zone->lock); would that make us feel more comfortable about converting zone->lock to a raw spinlock?
Michal Hocko <mhocko@suse.com> writes: > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: >> It basically requires to convert the wait queue to something else. Is >> the waitqueue strict single waiter? > > I would have to double check. From what I remember only kswapd should > ever sleep on it. That would make it trivial as we could simply switch it over to rcu_wait. >> So that should be: >> >> if (!preemptible() && gfp == GFP_RT_NOWAIT) >> >> which is limiting the damage to those callers which hand in >> GFP_RT_NOWAIT. >> >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits >> zone->lock in the wrong context. And we want to know about that so we >> can look at the caller and figure out how to solve it. > > Yes, that would have to somehow need to annotate the zone_lock to be ok > in those paths so that lockdep doesn't complain. That opens the worst of all cans of worms. If we start this here then Joe programmer and his dog will use these lockdep annotation to evade warnings and when exposed to RT it will fall apart in pieces. Just that at that point Joe programmer moved on to something else and the usual suspects can mop up the pieces. We've seen that all over the place and some people even disable lockdep temporarily because annotations don't help. PeterZ might have opinions about that too I suspect. Really, if your primary lockless caches are empty then any allocation which comes from deep atomic context should simply always fail. Being stuck in an interrupt handler or even deeper for 200+ microseconds waiting for zone lock is just bonkers IMO. Thanks, tglx
On Thu 13-08-20 16:34:57, Thomas Gleixner wrote: > Michal Hocko <mhocko@suse.com> writes: > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > >> It basically requires to convert the wait queue to something else. Is > >> the waitqueue strict single waiter? > > > > I would have to double check. From what I remember only kswapd should > > ever sleep on it. > > That would make it trivial as we could simply switch it over to rcu_wait. > > >> So that should be: > >> > >> if (!preemptible() && gfp == GFP_RT_NOWAIT) > >> > >> which is limiting the damage to those callers which hand in > >> GFP_RT_NOWAIT. > >> > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > >> zone->lock in the wrong context. And we want to know about that so we > >> can look at the caller and figure out how to solve it. > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok > > in those paths so that lockdep doesn't complain. > > That opens the worst of all cans of worms. If we start this here then > Joe programmer and his dog will use these lockdep annotation to evade > warnings and when exposed to RT it will fall apart in pieces. Just that > at that point Joe programmer moved on to something else and the usual > suspects can mop up the pieces. We've seen that all over the place and > some people even disable lockdep temporarily because annotations don't > help. Hmm. I am likely missing something really important here. We have two problems at hand: 1) RT will become broken as soon as this new RCU functionality which requires an allocation from inside of raw_spinlock hits the RT tree 2) lockdep splats which are telling us that early because of the raw_spinlock-> spin_lock dependency. 1) can be handled by handled by the bailing out whenever we have to use zone->lock inside the buddy allocator - essentially even more strict NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is trying to describe that. 2) would become a false positive if 1) is in place, right? RT wouldn't do the illegal nesting and !RT would just work fine because GFP_RT_NOWAIT would be simply GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM. Why should we limit the functionality of the allocator for something that is not a real problem? > PeterZ might have opinions about that too I suspect. > > Really, if your primary lockless caches are empty then any allocation > which comes from deep atomic context should simply always fail. Being > stuck in an interrupt handler or even deeper for 200+ microseconds > waiting for zone lock is just bonkers IMO. That would require changing NOWAIT/ATOMIC allocations semantic quite drastically for !RT kernels as well. I am not sure this is something we can do. Or maybe I am just missing your point.
On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote: > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote: > > Michal Hocko <mhocko@suse.com> writes: > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > > >> It basically requires to convert the wait queue to something else. Is > > >> the waitqueue strict single waiter? > > > > > > I would have to double check. From what I remember only kswapd should > > > ever sleep on it. > > > > That would make it trivial as we could simply switch it over to rcu_wait. > > > > >> So that should be: > > >> > > >> if (!preemptible() && gfp == GFP_RT_NOWAIT) > > >> > > >> which is limiting the damage to those callers which hand in > > >> GFP_RT_NOWAIT. > > >> > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > > >> zone->lock in the wrong context. And we want to know about that so we > > >> can look at the caller and figure out how to solve it. > > > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok > > > in those paths so that lockdep doesn't complain. > > > > That opens the worst of all cans of worms. If we start this here then > > Joe programmer and his dog will use these lockdep annotation to evade > > warnings and when exposed to RT it will fall apart in pieces. Just that > > at that point Joe programmer moved on to something else and the usual > > suspects can mop up the pieces. We've seen that all over the place and > > some people even disable lockdep temporarily because annotations don't > > help. > > Hmm. I am likely missing something really important here. We have two > problems at hand: > 1) RT will become broken as soon as this new RCU functionality which > requires an allocation from inside of raw_spinlock hits the RT tree > 2) lockdep splats which are telling us that early because of the > raw_spinlock-> spin_lock dependency. That is a reasonable high-level summary. > 1) can be handled by handled by the bailing out whenever we have to use > zone->lock inside the buddy allocator - essentially even more strict > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is > trying to describe that. Unless I am missing something subtle, the problem with this approach is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there is no way at runtime to distinguish between holding a spinlock on the one hand and holding a raw spinlock on the other. Therefore, without some sort of indication from the caller, this approach will not make CONFIG_PREEMPT_NONE=y users happy. > 2) would become a false positive if 1) is in place, right? RT wouldn't > do the illegal nesting and !RT would just work fine because > GFP_RT_NOWAIT would be simply GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM. > Why should we limit the functionality of the allocator for something > that is not a real problem? I will confess that I found this approach quite attractive, at least until I dug into it. But it runs up against the inability to distinguish between holding a spinlock and holding a raw spinlock. This inability shows up in non-debugging CONFIG_PREEMPT_NONE=y kernels. (Specifically, production kernels of this type will have CONFIG_PREEMPT_COUNT=n.) I will always have a warm spot in my heart for RT, but this CONFIG_PREEMPT_NONE=y&&CONFIG_PREEMPT_COUNT=n configuration is still very important. > > PeterZ might have opinions about that too I suspect. > > > > Really, if your primary lockless caches are empty then any allocation > > which comes from deep atomic context should simply always fail. Being > > stuck in an interrupt handler or even deeper for 200+ microseconds > > waiting for zone lock is just bonkers IMO. > > That would require changing NOWAIT/ATOMIC allocations semantic quite > drastically for !RT kernels as well. I am not sure this is something we > can do. Or maybe I am just missing your point. Exactly, and avoiding changing this semantic for current users is precisely why we are proposing some sort of indication to be passed into the allocation request. In Uladzislau's patch, this was the __GFP_NO_LOCKS flag, but whatever works. Thanx, Paul
On Thu 13-08-20 08:41:59, Paul E. McKenney wrote: > On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote: > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote: > > > Michal Hocko <mhocko@suse.com> writes: > > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > > > >> It basically requires to convert the wait queue to something else. Is > > > >> the waitqueue strict single waiter? > > > > > > > > I would have to double check. From what I remember only kswapd should > > > > ever sleep on it. > > > > > > That would make it trivial as we could simply switch it over to rcu_wait. > > > > > > >> So that should be: > > > >> > > > >> if (!preemptible() && gfp == GFP_RT_NOWAIT) > > > >> > > > >> which is limiting the damage to those callers which hand in > > > >> GFP_RT_NOWAIT. > > > >> > > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > > > >> zone->lock in the wrong context. And we want to know about that so we > > > >> can look at the caller and figure out how to solve it. > > > > > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok > > > > in those paths so that lockdep doesn't complain. > > > > > > That opens the worst of all cans of worms. If we start this here then > > > Joe programmer and his dog will use these lockdep annotation to evade > > > warnings and when exposed to RT it will fall apart in pieces. Just that > > > at that point Joe programmer moved on to something else and the usual > > > suspects can mop up the pieces. We've seen that all over the place and > > > some people even disable lockdep temporarily because annotations don't > > > help. > > > > Hmm. I am likely missing something really important here. We have two > > problems at hand: > > 1) RT will become broken as soon as this new RCU functionality which > > requires an allocation from inside of raw_spinlock hits the RT tree > > 2) lockdep splats which are telling us that early because of the > > raw_spinlock-> spin_lock dependency. > > That is a reasonable high-level summary. > > > 1) can be handled by handled by the bailing out whenever we have to use > > zone->lock inside the buddy allocator - essentially even more strict > > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is > > trying to describe that. > > Unless I am missing something subtle, the problem with this approach > is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there > is no way at runtime to distinguish between holding a spinlock on the > one hand and holding a raw spinlock on the other. Therefore, without > some sort of indication from the caller, this approach will not make > CONFIG_PREEMPT_NONE=y users happy. If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity check then there is no functional problem - GFP_RT_SAFE would still be GFP_NOWAIT so functional wise the allocator will still do the right thing. [...] > > That would require changing NOWAIT/ATOMIC allocations semantic quite > > drastically for !RT kernels as well. I am not sure this is something we > > can do. Or maybe I am just missing your point. > > Exactly, and avoiding changing this semantic for current users is > precisely why we are proposing some sort of indication to be passed > into the allocation request. In Uladzislau's patch, this was the > __GFP_NO_LOCKS flag, but whatever works. As I've tried to explain already, I would really hope we can do without any new gfp flags. We are running out of them and they tend to generate a lot of maintenance burden. There is a lot of abuse etc. We should also not expose such an implementation detail of the allocator to callers because that would make future changes even harder. The alias, on the othere hand already builds on top of existing NOWAIT semantic and it just helps the allocator to complain about a wrong usage while it doesn't expose any internals.
On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote: > On Thu 13-08-20 08:41:59, Paul E. McKenney wrote: > > On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote: > > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote: > > > > Michal Hocko <mhocko@suse.com> writes: > > > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > > > > >> It basically requires to convert the wait queue to something else. Is > > > > >> the waitqueue strict single waiter? > > > > > > > > > > I would have to double check. From what I remember only kswapd should > > > > > ever sleep on it. > > > > > > > > That would make it trivial as we could simply switch it over to rcu_wait. > > > > > > > > >> So that should be: > > > > >> > > > > >> if (!preemptible() && gfp == GFP_RT_NOWAIT) > > > > >> > > > > >> which is limiting the damage to those callers which hand in > > > > >> GFP_RT_NOWAIT. > > > > >> > > > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > > > > >> zone->lock in the wrong context. And we want to know about that so we > > > > >> can look at the caller and figure out how to solve it. > > > > > > > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok > > > > > in those paths so that lockdep doesn't complain. > > > > > > > > That opens the worst of all cans of worms. If we start this here then > > > > Joe programmer and his dog will use these lockdep annotation to evade > > > > warnings and when exposed to RT it will fall apart in pieces. Just that > > > > at that point Joe programmer moved on to something else and the usual > > > > suspects can mop up the pieces. We've seen that all over the place and > > > > some people even disable lockdep temporarily because annotations don't > > > > help. > > > > > > Hmm. I am likely missing something really important here. We have two > > > problems at hand: > > > 1) RT will become broken as soon as this new RCU functionality which > > > requires an allocation from inside of raw_spinlock hits the RT tree > > > 2) lockdep splats which are telling us that early because of the > > > raw_spinlock-> spin_lock dependency. > > > > That is a reasonable high-level summary. > > > > > 1) can be handled by handled by the bailing out whenever we have to use > > > zone->lock inside the buddy allocator - essentially even more strict > > > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is > > > trying to describe that. > > > > Unless I am missing something subtle, the problem with this approach > > is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there > > is no way at runtime to distinguish between holding a spinlock on the > > one hand and holding a raw spinlock on the other. Therefore, without > > some sort of indication from the caller, this approach will not make > > CONFIG_PREEMPT_NONE=y users happy. > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity > check then there is no functional problem - GFP_RT_SAFE would still be > GFP_NOWAIT so functional wise the allocator will still do the right > thing. Perhaps it was just me getting confused, early hour Pacific Time and whatever other excuses might apply. But I thought that you still had an objection to GFP_RT_SAFE based on changes in allocator semantics for other users. Of course, if it is just me being confused, by all means let's give this a shot!!! > [...] > > > > That would require changing NOWAIT/ATOMIC allocations semantic quite > > > drastically for !RT kernels as well. I am not sure this is something we > > > can do. Or maybe I am just missing your point. > > > > Exactly, and avoiding changing this semantic for current users is > > precisely why we are proposing some sort of indication to be passed > > into the allocation request. In Uladzislau's patch, this was the > > __GFP_NO_LOCKS flag, but whatever works. > > As I've tried to explain already, I would really hope we can do without > any new gfp flags. We are running out of them and they tend to generate > a lot of maintenance burden. There is a lot of abuse etc. We should also > not expose such an implementation detail of the allocator to callers > because that would make future changes even harder. The alias, on the > othere hand already builds on top of existing NOWAIT semantic and it > just helps the allocator to complain about a wrong usage while it > doesn't expose any internals. And I am plenty happy if an existing flag or set of flags (up to and including the all-zeroes empty set) can be used in this case. Thanx, Paul
On Thu 13-08-20 09:04:42, Paul E. McKenney wrote: > On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote: [...] > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity > > check then there is no functional problem - GFP_RT_SAFE would still be > > GFP_NOWAIT so functional wise the allocator will still do the right > > thing. > > Perhaps it was just me getting confused, early hour Pacific Time and > whatever other excuses might apply. But I thought that you still had > an objection to GFP_RT_SAFE based on changes in allocator semantics for > other users. There is still that problem with lockdep complaining about raw->regular spinlock on !PREEMPT_RT that would need to get resolved somehow. Thomas is not really keen on adding some lockdep annotation mechanism and unfortunatelly I do not have a different idea how to get rid of those.
Matthew Wilcox <willy@infradead.org> writes: > On Thu, Aug 13, 2020 at 03:27:15PM +0200, Thomas Gleixner wrote: >> And guarding it with RT is not working either because then you are back >> to square one with the problem which triggered the discussion in the >> first place: >> >> raw_spin_lock() >> alloc() >> if (RT && !preemptible()) <- False because RT == false >> goto bail; >> >> spin_lock(&zone->lock) --> LOCKDEP complains >> >> So either you convince Paul not to do that or you need to do something >> like I suggested in my other reply. > > I'd like to throw in the possibility that we do something like: > > raw_spin_lock() > alloc() > if (!spin_trylock(&zone->lock)) > if (RT && !preemptible()) > goto bail; > spin_lock(&zone->lock); > > would that make us feel more comfortable about converting zone->lock to > a raw spinlock? Even if that could cure that particular problem of allocations in deep atomic context, making zone->lock raw brings back the problem of zone->lock being held/contended for hundreds of microseconds with interrupts disabled which is causing RT tasks to miss their deadlines by big margins. Thanks, tglx
> On Thu 13-08-20 08:41:59, Paul E. McKenney wrote: > > On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote: > > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote: > > > > Michal Hocko <mhocko@suse.com> writes: > > > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > > > > >> It basically requires to convert the wait queue to something else. Is > > > > >> the waitqueue strict single waiter? > > > > > > > > > > I would have to double check. From what I remember only kswapd should > > > > > ever sleep on it. > > > > > > > > That would make it trivial as we could simply switch it over to rcu_wait. > > > > > > > > >> So that should be: > > > > >> > > > > >> if (!preemptible() && gfp == GFP_RT_NOWAIT) > > > > >> > > > > >> which is limiting the damage to those callers which hand in > > > > >> GFP_RT_NOWAIT. > > > > >> > > > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > > > > >> zone->lock in the wrong context. And we want to know about that so we > > > > >> can look at the caller and figure out how to solve it. > > > > > > > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok > > > > > in those paths so that lockdep doesn't complain. > > > > > > > > That opens the worst of all cans of worms. If we start this here then > > > > Joe programmer and his dog will use these lockdep annotation to evade > > > > warnings and when exposed to RT it will fall apart in pieces. Just that > > > > at that point Joe programmer moved on to something else and the usual > > > > suspects can mop up the pieces. We've seen that all over the place and > > > > some people even disable lockdep temporarily because annotations don't > > > > help. > > > > > > Hmm. I am likely missing something really important here. We have two > > > problems at hand: > > > 1) RT will become broken as soon as this new RCU functionality which > > > requires an allocation from inside of raw_spinlock hits the RT tree > > > 2) lockdep splats which are telling us that early because of the > > > raw_spinlock-> spin_lock dependency. > > > > That is a reasonable high-level summary. > > > > > 1) can be handled by handled by the bailing out whenever we have to use > > > zone->lock inside the buddy allocator - essentially even more strict > > > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is > > > trying to describe that. > > > > Unless I am missing something subtle, the problem with this approach > > is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there > > is no way at runtime to distinguish between holding a spinlock on the > > one hand and holding a raw spinlock on the other. Therefore, without > > some sort of indication from the caller, this approach will not make > > CONFIG_PREEMPT_NONE=y users happy. > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity > check then there is no functional problem - GFP_RT_SAFE would still be > GFP_NOWAIT so functional wise the allocator will still do the right > thing. > > [...] > > > > That would require changing NOWAIT/ATOMIC allocations semantic quite > > > drastically for !RT kernels as well. I am not sure this is something we > > > can do. Or maybe I am just missing your point. > > > > Exactly, and avoiding changing this semantic for current users is > > precisely why we are proposing some sort of indication to be passed > > into the allocation request. In Uladzislau's patch, this was the > > __GFP_NO_LOCKS flag, but whatever works. > > As I've tried to explain already, I would really hope we can do without > any new gfp flags. We are running out of them and they tend to generate > a lot of maintenance burden. There is a lot of abuse etc. We should also > not expose such an implementation detail of the allocator to callers > because that would make future changes even harder. The alias, on the > othere hand already builds on top of existing NOWAIT semantic and it > just helps the allocator to complain about a wrong usage while it > doesn't expose any internals. > I know that Matthew and me raised it. We do can handle it without introducing any flag. I mean just use 0 as argument to the page_alloc(gfp_flags = 0) i.e. #define __GFP_NO_LOCKS 0 so it will be handled same way how it is done in the "mm: Add __GFP_NO_LOCKS flag" I can re-spin the RFC patch and send it out for better understanding. Does it work for you, Michal? Or it is better just to drop the patch here? -- Vlad Rezki
On Thu, Aug 13, 2020 at 06:14:32PM +0200, Thomas Gleixner wrote: > Matthew Wilcox <willy@infradead.org> writes: > > On Thu, Aug 13, 2020 at 03:27:15PM +0200, Thomas Gleixner wrote: > >> And guarding it with RT is not working either because then you are back > >> to square one with the problem which triggered the discussion in the > >> first place: > >> > >> raw_spin_lock() > >> alloc() > >> if (RT && !preemptible()) <- False because RT == false > >> goto bail; > >> > >> spin_lock(&zone->lock) --> LOCKDEP complains > >> > >> So either you convince Paul not to do that or you need to do something > >> like I suggested in my other reply. > > > > I'd like to throw in the possibility that we do something like: > > > > raw_spin_lock() > > alloc() > > if (!spin_trylock(&zone->lock)) > > if (RT && !preemptible()) > > goto bail; > > spin_lock(&zone->lock); > > > > would that make us feel more comfortable about converting zone->lock to > > a raw spinlock? > > Even if that could cure that particular problem of allocations in deep > atomic context, making zone->lock raw brings back the problem of > zone->lock being held/contended for hundreds of microseconds with > interrupts disabled which is causing RT tasks to miss their deadlines by > big margins. Ah, I see. Yeah, that doesn't work. Never mind.
On Thu, Aug 13, 2020 at 06:13:57PM +0200, Michal Hocko wrote: > On Thu 13-08-20 09:04:42, Paul E. McKenney wrote: > > On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote: > [...] > > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity > > > check then there is no functional problem - GFP_RT_SAFE would still be > > > GFP_NOWAIT so functional wise the allocator will still do the right > > > thing. > > > > Perhaps it was just me getting confused, early hour Pacific Time and > > whatever other excuses might apply. But I thought that you still had > > an objection to GFP_RT_SAFE based on changes in allocator semantics for > > other users. > > There is still that problem with lockdep complaining about raw->regular > spinlock on !PREEMPT_RT that would need to get resolved somehow. Thomas > is not really keen on adding some lockdep annotation mechanism and > unfortunatelly I do not have a different idea how to get rid of those. OK. So the current situation requires a choice between these these alternatives, each of which has shortcomings that have been mentioned earlier in this thread: 1. Prohibit invoking allocators from raw atomic context, such as when holding a raw spinlock. 2. Adding a GFP_ flag. 3. Reusing existing GFP_ flags/values/whatever to communicate the raw-context information that was to be communicated by the new GFP_ flag. 4. Making lockdep forgive acquiring spinlocks while holding raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. Am I missing anything? Thanx, Paul
On Thu 13-08-20 18:20:47, Uladzislau Rezki wrote: > > On Thu 13-08-20 08:41:59, Paul E. McKenney wrote: > > > On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote: > > > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote: > > > > > Michal Hocko <mhocko@suse.com> writes: > > > > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > > > > > >> It basically requires to convert the wait queue to something else. Is > > > > > >> the waitqueue strict single waiter? > > > > > > > > > > > > I would have to double check. From what I remember only kswapd should > > > > > > ever sleep on it. > > > > > > > > > > That would make it trivial as we could simply switch it over to rcu_wait. > > > > > > > > > > >> So that should be: > > > > > >> > > > > > >> if (!preemptible() && gfp == GFP_RT_NOWAIT) > > > > > >> > > > > > >> which is limiting the damage to those callers which hand in > > > > > >> GFP_RT_NOWAIT. > > > > > >> > > > > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > > > > > >> zone->lock in the wrong context. And we want to know about that so we > > > > > >> can look at the caller and figure out how to solve it. > > > > > > > > > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok > > > > > > in those paths so that lockdep doesn't complain. > > > > > > > > > > That opens the worst of all cans of worms. If we start this here then > > > > > Joe programmer and his dog will use these lockdep annotation to evade > > > > > warnings and when exposed to RT it will fall apart in pieces. Just that > > > > > at that point Joe programmer moved on to something else and the usual > > > > > suspects can mop up the pieces. We've seen that all over the place and > > > > > some people even disable lockdep temporarily because annotations don't > > > > > help. > > > > > > > > Hmm. I am likely missing something really important here. We have two > > > > problems at hand: > > > > 1) RT will become broken as soon as this new RCU functionality which > > > > requires an allocation from inside of raw_spinlock hits the RT tree > > > > 2) lockdep splats which are telling us that early because of the > > > > raw_spinlock-> spin_lock dependency. > > > > > > That is a reasonable high-level summary. > > > > > > > 1) can be handled by handled by the bailing out whenever we have to use > > > > zone->lock inside the buddy allocator - essentially even more strict > > > > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is > > > > trying to describe that. > > > > > > Unless I am missing something subtle, the problem with this approach > > > is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there > > > is no way at runtime to distinguish between holding a spinlock on the > > > one hand and holding a raw spinlock on the other. Therefore, without > > > some sort of indication from the caller, this approach will not make > > > CONFIG_PREEMPT_NONE=y users happy. > > > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity > > check then there is no functional problem - GFP_RT_SAFE would still be > > GFP_NOWAIT so functional wise the allocator will still do the right > > thing. > > > > [...] > > > > > > That would require changing NOWAIT/ATOMIC allocations semantic quite > > > > drastically for !RT kernels as well. I am not sure this is something we > > > > can do. Or maybe I am just missing your point. > > > > > > Exactly, and avoiding changing this semantic for current users is > > > precisely why we are proposing some sort of indication to be passed > > > into the allocation request. In Uladzislau's patch, this was the > > > __GFP_NO_LOCKS flag, but whatever works. > > > > As I've tried to explain already, I would really hope we can do without > > any new gfp flags. We are running out of them and they tend to generate > > a lot of maintenance burden. There is a lot of abuse etc. We should also > > not expose such an implementation detail of the allocator to callers > > because that would make future changes even harder. The alias, on the > > othere hand already builds on top of existing NOWAIT semantic and it > > just helps the allocator to complain about a wrong usage while it > > doesn't expose any internals. > > > I know that Matthew and me raised it. We do can handle it without > introducing any flag. I mean just use 0 as argument to the page_alloc(gfp_flags = 0) > > i.e. #define __GFP_NO_LOCKS 0 > > so it will be handled same way how it is done in the "mm: Add __GFP_NO_LOCKS flag" > I can re-spin the RFC patch and send it out for better understanding. > > Does it work for you, Michal? Or it is better just to drop the patch here? That would change the semantic for GFP_NOWAIT users who decided to drop __GFP_KSWAPD_RECLAIM or even use 0 gfp mask right away, right? The point I am trying to make is that an alias is good for RT because it doesn't have any users (because there is no RT atomic user of the allocator) currently.
Michal Hocko <mhocko@suse.com> writes: > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote: >> Michal Hocko <mhocko@suse.com> writes: >> > Yes, that would have to somehow need to annotate the zone_lock to be ok >> > in those paths so that lockdep doesn't complain. >> >> That opens the worst of all cans of worms. If we start this here then >> Joe programmer and his dog will use these lockdep annotation to evade >> warnings and when exposed to RT it will fall apart in pieces. Just that >> at that point Joe programmer moved on to something else and the usual >> suspects can mop up the pieces. We've seen that all over the place and >> some people even disable lockdep temporarily because annotations don't >> help. > > Hmm. I am likely missing something really important here. We have two > problems at hand: > 1) RT will become broken as soon as this new RCU functionality which > requires an allocation from inside of raw_spinlock hits the RT tree > 2) lockdep splats which are telling us that early because of the > raw_spinlock-> spin_lock dependency. Correct. > 1) can be handled by handled by the bailing out whenever we have to use > zone->lock inside the buddy allocator - essentially even more strict > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is > trying to describe that. > > 2) would become a false positive if 1) is in place, right? RT wouldn't > do the illegal nesting and !RT would just work fine because > GFP_RT_NOWAIT would be simply GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM. And how do you deal with that false positive and the subsequent false positives when this code hits the next regular spinlock in some code path? Disabling lockdep or crippling coverage? > Why should we limit the functionality of the allocator for something > that is not a real problem? We'd limit the allocator for exactly ONE new user which was aware of this problem _before_ the code hit mainline. And that ONE user is prepared to handle the fail. Any other usage of the page allocator just works. The amount of raw spinlocks is very limited and there are very good reasons to make them raw spinlocks. And none of them does allocations inside, except this particular new one. Some did years ago, but none of them was necessary at all, quite the contrary most of them were just pointless and in frequent hot pathes. Let me ask the question the other way round: Is there a real request by Paul that going deeper into the allocator is necessary for his new fangled RCU thing? I haven't seen one and if the lockless allocation fails then the system might have other worries than getting a page to this particular RCU thing which has a perfectly working fallback. It's not affecting anything else. GFP_ATOMIC/NOWAIT still work the same way as before from all other contexts and that's 99,9999999999% of all use cases. Why, because none of them happen under a raw spinlock. Even if we could make this lockdep thing work that does not mean that it's a good thing to do. Quite the contrary, you'd just encourage people to create more of those use cases for probably the completely wrong reasons. Putting a limitation into place upfront might makes them think farther than just slapping GFP_RT_ATOMIC in and be done with it. Let me dream :) I've dealt with tons of patches in the last 15+ years where people just came up with 's/GFP_KERNEL/GFP_ATOMIC/ because tool complained' patches. The vast majority of them were bogus because the alloc() was simply at the wrong place. Forcing people not to take the easy way out by making the infrastructure restrictive is way better than encouraging mindless hackery. We have enough of this (not restricted to memory allocations) all over the kernel already. No need for more. >> Really, if your primary lockless caches are empty then any allocation >> which comes from deep atomic context should simply always fail. Being >> stuck in an interrupt handler or even deeper for 200+ microseconds >> waiting for zone lock is just bonkers IMO. > > That would require changing NOWAIT/ATOMIC allocations semantic quite > drastically for !RT kernels as well. I am not sure this is something we > can do. Or maybe I am just missing your point. I really do not understand why you think that it affects everything. It's exactly ONE particular use case which is affected, i.e. Pauls new RCU thing if he uses GFP_RT_NOWAIT. Everything else is not affected at all and NOWAIT/ATOMIC just works as it used to work because NOWAIT != 0 and ATOMIC != 0. Thanks, tglx
On Thu 13-08-20 09:29:04, Paul E. McKenney wrote: > On Thu, Aug 13, 2020 at 06:13:57PM +0200, Michal Hocko wrote: > > On Thu 13-08-20 09:04:42, Paul E. McKenney wrote: > > > On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote: > > [...] > > > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity > > > > check then there is no functional problem - GFP_RT_SAFE would still be > > > > GFP_NOWAIT so functional wise the allocator will still do the right > > > > thing. > > > > > > Perhaps it was just me getting confused, early hour Pacific Time and > > > whatever other excuses might apply. But I thought that you still had > > > an objection to GFP_RT_SAFE based on changes in allocator semantics for > > > other users. > > > > There is still that problem with lockdep complaining about raw->regular > > spinlock on !PREEMPT_RT that would need to get resolved somehow. Thomas > > is not really keen on adding some lockdep annotation mechanism and > > unfortunatelly I do not have a different idea how to get rid of those. > > OK. So the current situation requires a choice between these these > alternatives, each of which has shortcomings that have been mentioned > earlier in this thread: > > 1. Prohibit invoking allocators from raw atomic context, such > as when holding a raw spinlock. > > 2. Adding a GFP_ flag. Which would implemente a completely new level atomic allocation for all preemption models > > 3. Reusing existing GFP_ flags/values/whatever to communicate > the raw-context information that was to be communicated by > the new GFP_ flag. this would have to be RT specific to not change the semantic for existing users. In other words make NOWAIT semantic working for RT atomic contexts. > > 4. Making lockdep forgive acquiring spinlocks while holding > raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. and this would have to go along with 3 to remove false positives on !RT.
On Thu 13-08-20 19:09:29, Thomas Gleixner wrote: > Michal Hocko <mhocko@suse.com> writes: > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote: [...] I will go though the rest of the email tomorrow. > >> Really, if your primary lockless caches are empty then any allocation > >> which comes from deep atomic context should simply always fail. Being > >> stuck in an interrupt handler or even deeper for 200+ microseconds > >> waiting for zone lock is just bonkers IMO. > > > > That would require changing NOWAIT/ATOMIC allocations semantic quite > > drastically for !RT kernels as well. I am not sure this is something we > > can do. Or maybe I am just missing your point. > > I really do not understand why you think that it affects everything. We are likely not on the same page here. Are you talking about the original proposal in this thread (aka a new flag) or a way to reuse existing gfp space (http://lkml.kernel.org/r/20200813075027.GD9477@dhcp22.suse.cz) with a modification that would both silence the lockdep and keep the existing NOWAIT semantic? Sorry bear with me but I am getting lost in this thread.
On Thu, Aug 13, 2020 at 07:12:11PM +0200, Michal Hocko wrote: > On Thu 13-08-20 09:29:04, Paul E. McKenney wrote: > > On Thu, Aug 13, 2020 at 06:13:57PM +0200, Michal Hocko wrote: > > > On Thu 13-08-20 09:04:42, Paul E. McKenney wrote: > > > > On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote: > > > [...] > > > > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity > > > > > check then there is no functional problem - GFP_RT_SAFE would still be > > > > > GFP_NOWAIT so functional wise the allocator will still do the right > > > > > thing. > > > > > > > > Perhaps it was just me getting confused, early hour Pacific Time and > > > > whatever other excuses might apply. But I thought that you still had > > > > an objection to GFP_RT_SAFE based on changes in allocator semantics for > > > > other users. > > > > > > There is still that problem with lockdep complaining about raw->regular > > > spinlock on !PREEMPT_RT that would need to get resolved somehow. Thomas > > > is not really keen on adding some lockdep annotation mechanism and > > > unfortunatelly I do not have a different idea how to get rid of those. > > > > OK. So the current situation requires a choice between these these > > alternatives, each of which has shortcomings that have been mentioned > > earlier in this thread: > > > > 1. Prohibit invoking allocators from raw atomic context, such > > as when holding a raw spinlock. > > > > 2. Adding a GFP_ flag. > > Which would implemente a completely new level atomic allocation for all > preemption models > > > > > 3. Reusing existing GFP_ flags/values/whatever to communicate > > the raw-context information that was to be communicated by > > the new GFP_ flag. > > this would have to be RT specific to not change the semantic for > existing users. In other words make NOWAIT semantic working for > RT atomic contexts. > > > > > 4. Making lockdep forgive acquiring spinlocks while holding > > raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. > > and this would have to go along with 3 to remove false positives on !RT. OK, let's fill in the issues, then: 1. Prohibit invoking allocators from raw atomic context, such as when holding a raw spinlock. o This would prevent an important cache-locality optimization. 2. Adding a GFP_ flag. o Requires a new level atomic allocation for all preemption models, namely, confined to the allocator's lockless caches. 3. Reusing existing GFP_ flags/values/whatever to communicate the raw-context information that was to be communicated by the new GFP_ flag. o There are existing users of all combinations that might be unhappy with a change of semantics. o But Michal is OK with this if usage is restricted to RT. Except that this requires #4 below: 4. Making lockdep forgive acquiring spinlocks while holding raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. o This would allow latency degradation and other bad coding practices to creep in, per Thomas's recent email. Again, am I missing anything? Thanx, Paul
On Thu, Aug 13, 2020 at 04:34:57PM +0200, Thomas Gleixner wrote: > Michal Hocko <mhocko@suse.com> writes: > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > >> It basically requires to convert the wait queue to something else. Is > >> the waitqueue strict single waiter? > > > > I would have to double check. From what I remember only kswapd should > > ever sleep on it. > > That would make it trivial as we could simply switch it over to rcu_wait. > > >> So that should be: > >> > >> if (!preemptible() && gfp == GFP_RT_NOWAIT) > >> > >> which is limiting the damage to those callers which hand in > >> GFP_RT_NOWAIT. > >> > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > >> zone->lock in the wrong context. And we want to know about that so we > >> can look at the caller and figure out how to solve it. > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok > > in those paths so that lockdep doesn't complain. > > That opens the worst of all cans of worms. If we start this here then > Joe programmer and his dog will use these lockdep annotation to evade > warnings and when exposed to RT it will fall apart in pieces. Just that > at that point Joe programmer moved on to something else and the usual > suspects can mop up the pieces. We've seen that all over the place and > some people even disable lockdep temporarily because annotations don't > help. > > PeterZ might have opinions about that too I suspect. PeterZ is mightily confused by all of this -- also heat induced brain melt. I thought the rule was: - No allocators (alloc/free) inside raw_spinlock_t, full-stop. Why are we trying to craft an exception?
On Thu, Aug 13, 2020 at 09:29:04AM -0700, Paul E. McKenney wrote: > OK. So the current situation requires a choice between these these > alternatives, each of which has shortcomings that have been mentioned > earlier in this thread: > > 1. Prohibit invoking allocators from raw atomic context, such > as when holding a raw spinlock. This! This has always been the case, why are we even considering change here? > 2. Adding a GFP_ flag. The patch 1/2 in this thread is horrendous crap. > 3. Reusing existing GFP_ flags/values/whatever to communicate > the raw-context information that was to be communicated by > the new GFP_ flag. > > 4. Making lockdep forgive acquiring spinlocks while holding > raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. > > Am I missing anything? How would 4 solve anything? In other words, what is the actual friggin problem? I've not seen one described anywhere.
On Thu, Aug 13, 2020 at 08:26:18PM +0200, peterz@infradead.org wrote: > On Thu, Aug 13, 2020 at 04:34:57PM +0200, Thomas Gleixner wrote: > > Michal Hocko <mhocko@suse.com> writes: > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > > >> It basically requires to convert the wait queue to something else. Is > > >> the waitqueue strict single waiter? > > > > > > I would have to double check. From what I remember only kswapd should > > > ever sleep on it. > > > > That would make it trivial as we could simply switch it over to rcu_wait. > > > > >> So that should be: > > >> > > >> if (!preemptible() && gfp == GFP_RT_NOWAIT) > > >> > > >> which is limiting the damage to those callers which hand in > > >> GFP_RT_NOWAIT. > > >> > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > > >> zone->lock in the wrong context. And we want to know about that so we > > >> can look at the caller and figure out how to solve it. > > > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok > > > in those paths so that lockdep doesn't complain. > > > > That opens the worst of all cans of worms. If we start this here then > > Joe programmer and his dog will use these lockdep annotation to evade > > warnings and when exposed to RT it will fall apart in pieces. Just that > > at that point Joe programmer moved on to something else and the usual > > suspects can mop up the pieces. We've seen that all over the place and > > some people even disable lockdep temporarily because annotations don't > > help. > > > > PeterZ might have opinions about that too I suspect. > > PeterZ is mightily confused by all of this -- also heat induced brain > melt. > > I thought the rule was: > > - No allocators (alloc/free) inside raw_spinlock_t, full-stop. > > Why are we trying to craft an exception? So that we can reduce post-grace-period cache misses by a factor of eight when invoking RCU callbacks. This reduction in cache misses also makes it more difficult to overrun RCU with floods of either call_rcu() or kfree_rcu() invocations. The idea is to allocate page-sized arrays of pointers so that the callback invocation can sequence through the array instead of walking a linked list, hence the reduction in cache misses. If the allocation fails, for example, during OOM events, we fall back to the linked-list approach. So, as with much of the rest of the kernel, under OOM we just run a bit slower. Thanx, Paul
On Thu 13-08-20 20:31:21, Peter Zijlstra wrote: > On Thu, Aug 13, 2020 at 09:29:04AM -0700, Paul E. McKenney wrote: [...] > > 3. Reusing existing GFP_ flags/values/whatever to communicate > > the raw-context information that was to be communicated by > > the new GFP_ flag. > > > > 4. Making lockdep forgive acquiring spinlocks while holding > > raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. > > > > Am I missing anything? > > How would 4 solve anything? Nothing on its own but along with http://lkml.kernel.org/r/20200813075027.GD9477@dhcp22.suse.cz it would allow at least _some_ NOWAIT semantic for RT atomic contexts and prevent from lockdep false positives for !RT trees.
On Thu, Aug 13, 2020 at 11:52:57AM -0700, Paul E. McKenney wrote: > On Thu, Aug 13, 2020 at 08:26:18PM +0200, peterz@infradead.org wrote: > > I thought the rule was: > > > > - No allocators (alloc/free) inside raw_spinlock_t, full-stop. > > > > Why are we trying to craft an exception? > > So that we can reduce post-grace-period cache misses by a factor of > eight when invoking RCU callbacks. This reduction in cache misses also > makes it more difficult to overrun RCU with floods of either call_rcu() > or kfree_rcu() invocations. > > The idea is to allocate page-sized arrays of pointers so that the callback > invocation can sequence through the array instead of walking a linked > list, hence the reduction in cache misses. I'm still not getting it, how do we end up trying to allocate memory from under raw spinlocks if you're not allowed to use kfree_rcu() under one ? Can someone please spell out the actual problem?
On Fri, Aug 14, 2020 at 12:06:19AM +0200, peterz@infradead.org wrote: > On Thu, Aug 13, 2020 at 11:52:57AM -0700, Paul E. McKenney wrote: > > On Thu, Aug 13, 2020 at 08:26:18PM +0200, peterz@infradead.org wrote: > > > > I thought the rule was: > > > > > > - No allocators (alloc/free) inside raw_spinlock_t, full-stop. > > > > > > Why are we trying to craft an exception? > > > > So that we can reduce post-grace-period cache misses by a factor of > > eight when invoking RCU callbacks. This reduction in cache misses also > > makes it more difficult to overrun RCU with floods of either call_rcu() > > or kfree_rcu() invocations. > > > > The idea is to allocate page-sized arrays of pointers so that the callback > > invocation can sequence through the array instead of walking a linked > > list, hence the reduction in cache misses. > > I'm still not getting it, how do we end up trying to allocate memory > from under raw spinlocks if you're not allowed to use kfree_rcu() under > one ? You are indeed not allowed to use kfree() under a raw spinlock, given that it can acquire a non-raw spinlock. But kfree_rcu() was just a wrapper around call_rcu(), which can be and is called from raw atomic context. > Can someone please spell out the actual problem? And as noted above, reducing the kfree()-time cache misses would be a good thing. Thanx, Paul
On Fri, Aug 14 2020 at 00:06, peterz wrote: > I'm still not getting it, how do we end up trying to allocate memory > from under raw spinlocks if you're not allowed to use kfree_rcu() under > one ? > > Can someone please spell out the actual problem? Your actual problem is the heat wave. Get an icepack already :) To set things straight: 1) kfree_rcu() which used to be just a conveniance wrapper around call_rcu() always worked in any context except NMI. Otherwise RT would have never worked or would have needed other horrible hacks to defer kfree in certain contexts including context switch. 2) RCU grew an optimization for kfree_rcu() which avoids the linked list cache misses when a large number of objects is freed via kfree_rcu(). Paul says it speeds up post grace period processing by a factor of 8 which is impressive. That's done by avoiding the linked list and storing the object pointer in an array of pointers which is page sized. This array is then freed in bulk without having to touch the rcu head linked list which obviously avoids cache misses. Occasionally kfree_rcu() needs to allocate a page for this but it can fallback to the linked list when the allocation fails. Inconveniantly this allocation happens with an RCU internal raw lock held, but even if we could do the allocation outside the RCU internal lock this would not solve the problem that kfree_rcu() can be called in any context except NMI even on RT. So RT forbids allocations from truly atomic contexts even with GFP_ATOMIC/NOWAIT. GFP_ATOMIC is kinda meaningless on RT because everything which calls it is in non-atomic context :) But still GFP_ATOMIC or GFP_NOWAIT retain the semantics of !RT and do not go down into deeper levels or wait for pages to become available. 3) #2 upsets lockdep (with the raw lock nesting enabled) rightfully when RCU tries to allocate a page, the lockless page cache is empty and it acquires zone->lock which is a regular spinlock 4) As kfree_rcu() can be used from any context except NMI and RT relies on it we ran into a circular dependency problem. If we disable that feature for RT then the problem would be solved except that lockdep still would complain about taking zone->lock within a forbidden context on !RT kernels. Preventing that feature because of that is not a feasible option either. Aside of that we discuss this postfactum, IOW the stuff is upstream already despite the problem being known before. 5) Ulad posted patches which introduce a new GFP allocation mode which makes the allocation fail if the lockless cache is empty, i.e. it does not try to touch zone->lock in that case. That works perfectly fine on RT and !RT, makes lockdep happy and Paul is happy as well. If the lockless cache, which works perfectly fine on RT, is empty then the performance of kfree_rcu() post grace period processing is probably not making the situation of the system worse. Michal is not fond of the new GFP flag and wants to explore options to avoid that completely. But so far there is no real feasible solution. A) It was suggested to make zone->lock raw, but that's a total disaster latency wise. With just a kernel compile (!RT kernel) spinwait times are in the hundreds of microseconds. RT tests showed max latency of cyclictest go up from 30 to 220 microseconds, i.e. factor 7 just because of that. So not really great. It would have had the charm to keep the semantics of GFP_NOWAIT, but OTOH even if it would work I'm pretty oppposed to open the can of worm which allows allocations from the deepest atomic contexts in general without a really good reason. B) Michal suggested to have GFP_RT_ATOMIC defined to 0 which would not require a new GFP bit and bail out when RT is enabled and the context is atomic. That of course does not solve the problem vs. lockdep. Also the idea to make this conditional on !preemptible() does not work because preemptible() returns always false for CONFIG_PREEMPT_COUNT=n kernels. C) I suggested to make GFP == 0 fail unconditionally when the lockless cache is empty, but that changes the semantics on !RT kernels for existing users which hand in 0. D) To solve the lockdep issue of #B Michal suggested to have magic lockdep annotations which allow !RT kernels to take zone->lock from the otherwise forbidden contexts because on !RT this lock nesting could be considered a false positive. But this would be horrors of some sorts because there might be locks further down which then need the same treatment or some general 'yay, I'm excempt from everything' annotation which is short of disabling lockdep completly. Even if that could be solved and made correct for both RT and !RT then this opens the can of worms that this kind of annotation would show up all over the place within no time for the completely wrong reasons. Paul compiled this options list: > 1. Prohibit invoking allocators from raw atomic context, such > as when holding a raw spinlock. Clearly the simplest solution but not Pauls favourite and unfortunately he has a good reason. > 2. Adding a GFP_ flag. The simplest and most consistent solution. If you really need to do allocations from such contexts then deal with the limitations whether it's RT or not. Paul has no problem with that and this newfangled kfree_rcu() is the only user and can live with that restriction. Michal does not like the restriction for !RT kernels and tries to avoid the introduction of a new allocation mode. My argument here is that consistency is the best solution and the extra GFP mode is explicitly restrictive due to the context which it is called from. Aside of that this affects exactly ONE use case which has a really good reason and does not care about the restriction even on !RT because in that situation kfree_rcu() performance is not the most urgent problem. > 3. Reusing existing GFP_ flags/values/whatever to communicate > the raw-context information that was to be communicated by > the new GFP_ flag. > > 4. Making lockdep forgive acquiring spinlocks while holding > raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. These are not seperate of each other as #3 requires #4. The most horrible solution IMO from a technical POV as it proliferates inconsistency for no good reaosn. Aside of that it'd be solving a problem which does not exist simply because kfree_rcu() does not depend on it and we really don't want to set precedence and encourage the (ab)use of this in any way. Having a distinct GFP mode is technically correct, consistent on all kernel variants and does not affect any exisiting user of the page allocator aside of the new kfree_rcu(). I hope that the cloudy and rainy day cured most of my heat wave induced brain damage to the extent that the above is correctly summarizing the state of affairs. If not, then please yell and I get an icepack. Thanks, tglx
On Thu 13-08-20 19:09:29, Thomas Gleixner wrote: > Michal Hocko <mhocko@suse.com> writes: [...] > > Why should we limit the functionality of the allocator for something > > that is not a real problem? > > We'd limit the allocator for exactly ONE new user which was aware of > this problem _before_ the code hit mainline. And that ONE user is > prepared to handle the fail. If we are to limit the functionality to this one particular user then I would consider a dedicated gfp flag a huge overkill. It would be much more easier to have a preallocated pool of pages and use those and completely avoid the core allocator. That would certainly only shift the complexity to the caller but if it is expected there would be only that single user then it would be probably better than opening a can of worms like allocator usable from raw spin locks. Paul would something like that be feasible? Really we have been bitten by a single usecase gfp flags in the past. [...] > Even if we could make this lockdep thing work that does not mean that > it's a good thing to do. > > Quite the contrary, you'd just encourage people to create more of those > use cases for probably the completely wrong reasons. Putting a > limitation into place upfront might makes them think farther than just > slapping GFP_RT_ATOMIC in and be done with it. Let me dream :) Good one ;) But seriously. I was suggesting lockdep workaround because I reckon that is less prone to an abuse than gfp flags. Lockdep is that scary thing people do not want to touch by a long pole but gfp flags are something you have to deal with when calling allocator and people tend to be creative. We used to suck in documentation so I am not wondering but things have improved so maybe the usage is going to improve as well. Anyway __GFP_NO_LOCK would be a free ticket to "I want to optimize even further" land. Maybe a better naming would be better but I am skeptical. > I've dealt with tons of patches in the last 15+ years where people just > came up with 's/GFP_KERNEL/GFP_ATOMIC/ because tool complained' > patches. The vast majority of them were bogus because the alloc() was > simply at the wrong place. Completely agreed. > Forcing people not to take the easy way out by making the infrastructure > restrictive is way better than encouraging mindless hackery. We have > enough of this (not restricted to memory allocations) all over the > kernel already. No need for more. I do agree with you. I just slightly disagree where the danger is. Explicit lockdep usage outside of the core is spread much less than the allocator so the abuse is less likely.
On Fri, Aug 14, 2020 at 01:59:04AM +0200, Thomas Gleixner wrote: > On Fri, Aug 14 2020 at 00:06, peterz wrote: > > I'm still not getting it, how do we end up trying to allocate memory > > from under raw spinlocks if you're not allowed to use kfree_rcu() under > > one ? > > > > Can someone please spell out the actual problem? > > Your actual problem is the heat wave. Get an icepack already :) Sure, but also much of the below wasn't stated anywhere in the thread I got Cc'ed on. All I got was half arsed solutions without an actual problem statement. > To set things straight: > > 1) kfree_rcu() which used to be just a conveniance wrapper around > call_rcu() always worked in any context except NMI. > > Otherwise RT would have never worked or would have needed other > horrible hacks to defer kfree in certain contexts including > context switch. I've never used kfree_rcu(), htf would I know. > 2) RCU grew an optimization for kfree_rcu() which avoids the linked > list cache misses when a large number of objects is freed via > kfree_rcu(). Paul says it speeds up post grace period processing by > a factor of 8 which is impressive. > > That's done by avoiding the linked list and storing the object > pointer in an array of pointers which is page sized. This array is > then freed in bulk without having to touch the rcu head linked list > which obviously avoids cache misses. > > Occasionally kfree_rcu() needs to allocate a page for this but it > can fallback to the linked list when the allocation fails. > > Inconveniantly this allocation happens with an RCU internal raw > lock held, but even if we could do the allocation outside the RCU > internal lock this would not solve the problem that kfree_rcu() can > be called in any context except NMI even on RT. > > So RT forbids allocations from truly atomic contexts even with > GFP_ATOMIC/NOWAIT. GFP_ATOMIC is kinda meaningless on RT because > everything which calls it is in non-atomic context :) But still > GFP_ATOMIC or GFP_NOWAIT retain the semantics of !RT and do not go > down into deeper levels or wait for pages to become available. Right so on RT just cut out the allocation and unconditionally make it NULL. Since it needs to deal with GFP_ATOMIC|GFP_NOWARN failing the allocation anyway. > 3) #2 upsets lockdep (with the raw lock nesting enabled) rightfully > when RCU tries to allocate a page, the lockless page cache is empty > and it acquires zone->lock which is a regular spinlock There was no lockdep splat in the thread. > 4) As kfree_rcu() can be used from any context except NMI and RT > relies on it we ran into a circular dependency problem. Well, which actual usage sites are under a raw spinlock? Most of the ones I could find are not. > If we disable that feature for RT then the problem would be solved > except that lockdep still would complain about taking zone->lock > within a forbidden context on !RT kernels. > > Preventing that feature because of that is not a feasible option > either. Aside of that we discuss this postfactum, IOW the stuff is > upstream already despite the problem being known before. Well, that's a fail :-( I tought we were supposed to solve known issues before shit got merged. > 5) Ulad posted patches which introduce a new GFP allocation mode > which makes the allocation fail if the lockless cache is empty, > i.e. it does not try to touch zone->lock in that case. > > That works perfectly fine on RT and !RT, makes lockdep happy and > Paul is happy as well. > > If the lockless cache, which works perfectly fine on RT, is empty > then the performance of kfree_rcu() post grace period processing is > probably not making the situation of the system worse. > > Michal is not fond of the new GFP flag and wants to explore options > to avoid that completely. But so far there is no real feasible > solution. Not only Michal, those patches looked pretty horrid. > A) It was suggested to make zone->lock raw, but that's a total > disaster latency wise. With just a kernel compile (!RT kernel) > spinwait times are in the hundreds of microseconds. Yeah, I know, been there done that, like over a decade ago :-) > B) Michal suggested to have GFP_RT_ATOMIC defined to 0 which would > not require a new GFP bit and bail out when RT is enabled and > the context is atomic. I would've written the code like: void *mem = NULL; ... #ifndef CONFIG_PREEMPT_RT mem = kmalloc(PAGE_SIZE, GFP_ATOMIC|GFP_NOWAIT); #endif if (!mem) .... Seems much simpler and doesn't pollute the GFP_ space. > D) To solve the lockdep issue of #B Michal suggested to have magic > lockdep annotations which allow !RT kernels to take zone->lock > from the otherwise forbidden contexts because on !RT this lock > nesting could be considered a false positive. > > But this would be horrors of some sorts because there might be > locks further down which then need the same treatment or some > general 'yay, I'm excempt from everything' annotation which is > short of disabling lockdep completly. > > Even if that could be solved and made correct for both RT and > !RT then this opens the can of worms that this kind of > annotation would show up all over the place within no time for > the completely wrong reasons. So due to this heat crap I've not slept more than a few hours a night for the last week, I'm not entirely there, but we already have a bunch of lockdep magic for this raw nesting stuff. But yes, we need to avoid abuse, grep for lockdep_off() :-( This drivers/md/dm-cache-target.c thing is just plain broken. It sorta 'works' on mainline (and even there it can behave badly), but on RT it will come apart with a bang. > Paul compiled this options list: > > > 1. Prohibit invoking allocators from raw atomic context, such > > as when holding a raw spinlock. > > Clearly the simplest solution but not Pauls favourite and > unfortunately he has a good reason. Which isn't actually stated anywhere I suppose ? > > 2. Adding a GFP_ flag. > > Michal does not like the restriction for !RT kernels and tries to > avoid the introduction of a new allocation mode. Like above, I tend to be with Michal on this, just wrap the actual allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer there anyway. > > 3. Reusing existing GFP_ flags/values/whatever to communicate > > the raw-context information that was to be communicated by > > the new GFP_ flag. > > > > 4. Making lockdep forgive acquiring spinlocks while holding > > raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. Uhh, !CONFIG_PREEMPT_RT, the rest is 'fine'. > These are not seperate of each other as #3 requires #4. The most > horrible solution IMO from a technical POV as it proliferates > inconsistency for no good reaosn. > > Aside of that it'd be solving a problem which does not exist simply > because kfree_rcu() does not depend on it and we really don't want to > set precedence and encourage the (ab)use of this in any way. My preferred solution is 1, if you want to use an allocator, you simply don't get to play under raw_spinlock_t. And from a quick grep, most kfree_rcu() users are not under raw_spinlock_t context. This here sounds like someone who wants to have his cake and eat it too. I'll try and think about a lockdep annotation that isn't utter crap, but that's probably next week, I need this heat to end and get a few nights sleep, I'm an utter wreck atm.
On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote: > > > 1. Prohibit invoking allocators from raw atomic context, such > > > as when holding a raw spinlock. > > > > Clearly the simplest solution but not Pauls favourite and > > unfortunately he has a good reason. > > Which isn't actually stated anywhere I suppose ? Introduce raw_kfree_rcu() that doesn't do the allocation, and fix the few wonky callsites.
On Thu, Aug 13, 2020 at 06:36:17PM +0200, Michal Hocko wrote: > On Thu 13-08-20 18:20:47, Uladzislau Rezki wrote: > > > On Thu 13-08-20 08:41:59, Paul E. McKenney wrote: > > > > On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote: > > > > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote: > > > > > > Michal Hocko <mhocko@suse.com> writes: > > > > > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote: > > > > > > >> It basically requires to convert the wait queue to something else. Is > > > > > > >> the waitqueue strict single waiter? > > > > > > > > > > > > > > I would have to double check. From what I remember only kswapd should > > > > > > > ever sleep on it. > > > > > > > > > > > > That would make it trivial as we could simply switch it over to rcu_wait. > > > > > > > > > > > > >> So that should be: > > > > > > >> > > > > > > >> if (!preemptible() && gfp == GFP_RT_NOWAIT) > > > > > > >> > > > > > > >> which is limiting the damage to those callers which hand in > > > > > > >> GFP_RT_NOWAIT. > > > > > > >> > > > > > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits > > > > > > >> zone->lock in the wrong context. And we want to know about that so we > > > > > > >> can look at the caller and figure out how to solve it. > > > > > > > > > > > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok > > > > > > > in those paths so that lockdep doesn't complain. > > > > > > > > > > > > That opens the worst of all cans of worms. If we start this here then > > > > > > Joe programmer and his dog will use these lockdep annotation to evade > > > > > > warnings and when exposed to RT it will fall apart in pieces. Just that > > > > > > at that point Joe programmer moved on to something else and the usual > > > > > > suspects can mop up the pieces. We've seen that all over the place and > > > > > > some people even disable lockdep temporarily because annotations don't > > > > > > help. > > > > > > > > > > Hmm. I am likely missing something really important here. We have two > > > > > problems at hand: > > > > > 1) RT will become broken as soon as this new RCU functionality which > > > > > requires an allocation from inside of raw_spinlock hits the RT tree > > > > > 2) lockdep splats which are telling us that early because of the > > > > > raw_spinlock-> spin_lock dependency. > > > > > > > > That is a reasonable high-level summary. > > > > > > > > > 1) can be handled by handled by the bailing out whenever we have to use > > > > > zone->lock inside the buddy allocator - essentially even more strict > > > > > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is > > > > > trying to describe that. > > > > > > > > Unless I am missing something subtle, the problem with this approach > > > > is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there > > > > is no way at runtime to distinguish between holding a spinlock on the > > > > one hand and holding a raw spinlock on the other. Therefore, without > > > > some sort of indication from the caller, this approach will not make > > > > CONFIG_PREEMPT_NONE=y users happy. > > > > > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity > > > check then there is no functional problem - GFP_RT_SAFE would still be > > > GFP_NOWAIT so functional wise the allocator will still do the right > > > thing. > > > > > > [...] > > > > > > > > That would require changing NOWAIT/ATOMIC allocations semantic quite > > > > > drastically for !RT kernels as well. I am not sure this is something we > > > > > can do. Or maybe I am just missing your point. > > > > > > > > Exactly, and avoiding changing this semantic for current users is > > > > precisely why we are proposing some sort of indication to be passed > > > > into the allocation request. In Uladzislau's patch, this was the > > > > __GFP_NO_LOCKS flag, but whatever works. > > > > > > As I've tried to explain already, I would really hope we can do without > > > any new gfp flags. We are running out of them and they tend to generate > > > a lot of maintenance burden. There is a lot of abuse etc. We should also > > > not expose such an implementation detail of the allocator to callers > > > because that would make future changes even harder. The alias, on the > > > othere hand already builds on top of existing NOWAIT semantic and it > > > just helps the allocator to complain about a wrong usage while it > > > doesn't expose any internals. > > > > > I know that Matthew and me raised it. We do can handle it without > > introducing any flag. I mean just use 0 as argument to the page_alloc(gfp_flags = 0) > > > > i.e. #define __GFP_NO_LOCKS 0 > > > > so it will be handled same way how it is done in the "mm: Add __GFP_NO_LOCKS flag" > > I can re-spin the RFC patch and send it out for better understanding. > > > > Does it work for you, Michal? Or it is better just to drop the patch here? > > That would change the semantic for GFP_NOWAIT users who decided to drop > __GFP_KSWAPD_RECLAIM or even use 0 gfp mask right away, right? The point > I see your point. Doing GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM will do something different what people expect. Right you are. > > I am trying to make is that an alias is good for RT because it doesn't > have any users (because there is no RT atomic user of the allocator) > currently. > Now I see your view. So we can handle RT case by using "RT && !preemptible()", based on that we can bail out. GFP_ATOMIC and NOWAIT at least will keep same semantic. Second, if the CONFIG_PROVE_RAW_LOCK_NESTING is fixed for PREEMPT_COUNT=n, then it would work. But i am lost here a bit if it is discussable or not. Thanks! -- Vlad Rezki
> On Thu 13-08-20 19:09:29, Thomas Gleixner wrote: > > Michal Hocko <mhocko@suse.com> writes: > [...] > > > Why should we limit the functionality of the allocator for something > > > that is not a real problem? > > > > We'd limit the allocator for exactly ONE new user which was aware of > > this problem _before_ the code hit mainline. And that ONE user is > > prepared to handle the fail. > > If we are to limit the functionality to this one particular user then > I would consider a dedicated gfp flag a huge overkill. It would be much > more easier to have a preallocated pool of pages and use those and > completely avoid the core allocator. That would certainly only shift the > complexity to the caller but if it is expected there would be only that > single user then it would be probably better than opening a can of worms > like allocator usable from raw spin locks. > Vlastimil raised same question earlier, i answered, but let me answer again: It is hard to achieve because the logic does not stick to certain static test case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based on that, "how heavily" - number of pages are formed, until the drain/reclaimer thread frees them. Preloading pages and keeping them for internal use, IMHO, seems not optimal from the point of resources wasting. It is better to have a fast mechanism to request a page and release it back for needs of others. As described about we do not know how much we will need. -- Vlad Rezki
On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote: > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote: > > > Michal Hocko <mhocko@suse.com> writes: > > [...] > > > > Why should we limit the functionality of the allocator for something > > > > that is not a real problem? > > > > > > We'd limit the allocator for exactly ONE new user which was aware of > > > this problem _before_ the code hit mainline. And that ONE user is > > > prepared to handle the fail. > > > > If we are to limit the functionality to this one particular user then > > I would consider a dedicated gfp flag a huge overkill. It would be much > > more easier to have a preallocated pool of pages and use those and > > completely avoid the core allocator. That would certainly only shift the > > complexity to the caller but if it is expected there would be only that > > single user then it would be probably better than opening a can of worms > > like allocator usable from raw spin locks. > > > Vlastimil raised same question earlier, i answered, but let me answer again: > > It is hard to achieve because the logic does not stick to certain static test > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based > on that, "how heavily" - number of pages are formed, until the drain/reclaimer > thread frees them. How many pages are talking about - ball park? 100s, 1000s? > Preloading pages and keeping them for internal use, IMHO, seems not optimal > from the point of resources wasting. It is better to have a fast mechanism to > request a page and release it back for needs of others. As described about > we do not know how much we will need. It would be wasted memory but if we are talking about relatively small number of pages. Note that there are not that many pages on the allocator's pcp list anyway.
On Fri, Aug 14, 2020 at 02:48:32PM +0200, Michal Hocko wrote: > On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote: > > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote: > > > > Michal Hocko <mhocko@suse.com> writes: > > > [...] > > > > > Why should we limit the functionality of the allocator for something > > > > > that is not a real problem? > > > > > > > > We'd limit the allocator for exactly ONE new user which was aware of > > > > this problem _before_ the code hit mainline. And that ONE user is > > > > prepared to handle the fail. > > > > > > If we are to limit the functionality to this one particular user then > > > I would consider a dedicated gfp flag a huge overkill. It would be much > > > more easier to have a preallocated pool of pages and use those and > > > completely avoid the core allocator. That would certainly only shift the > > > complexity to the caller but if it is expected there would be only that > > > single user then it would be probably better than opening a can of worms > > > like allocator usable from raw spin locks. > > > > > Vlastimil raised same question earlier, i answered, but let me answer again: > > > > It is hard to achieve because the logic does not stick to certain static test > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer > > thread frees them. > > How many pages are talking about - ball park? 100s, 1000s? Under normal operation, a couple of pages per CPU, which would make preallocation entirely reasonable. Except that if someone does something that floods RCU callbacks (close(open) in a tight userspace loop, for but one example), then 2000 per CPU might not be enough, which on a 64-CPU system comes to about 500MB. This is beyond excessive for preallocation on the systems I am familiar with. And the flooding case is where you most want the reclamation to be efficient, and thus where you want the pages. This of course raises the question of how much memory the lockless caches contain, but fortunately these RCU callback flooding scenarios also involve process-context allocation of the memory that is being passed to kfree_rcu(). That allocation should keep the lockless caches from going empty in the common case, correct? Please note that we will also need to apply this same optimization to call_rcu(), which will have the same constraints. > > Preloading pages and keeping them for internal use, IMHO, seems not optimal > > from the point of resources wasting. It is better to have a fast mechanism to > > request a page and release it back for needs of others. As described about > > we do not know how much we will need. > > It would be wasted memory but if we are talking about relatively small > number of pages. Note that there are not that many pages on the > allocator's pcp list anyway. Indeed, if we were talking a maximum of (say) 10 pages per CPU, we would not be having this conversation. ;-) Thanx, Paul
On Fri 14-08-20 06:34:50, Paul E. McKenney wrote: > On Fri, Aug 14, 2020 at 02:48:32PM +0200, Michal Hocko wrote: > > On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote: > > > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote: > > > > > Michal Hocko <mhocko@suse.com> writes: > > > > [...] > > > > > > Why should we limit the functionality of the allocator for something > > > > > > that is not a real problem? > > > > > > > > > > We'd limit the allocator for exactly ONE new user which was aware of > > > > > this problem _before_ the code hit mainline. And that ONE user is > > > > > prepared to handle the fail. > > > > > > > > If we are to limit the functionality to this one particular user then > > > > I would consider a dedicated gfp flag a huge overkill. It would be much > > > > more easier to have a preallocated pool of pages and use those and > > > > completely avoid the core allocator. That would certainly only shift the > > > > complexity to the caller but if it is expected there would be only that > > > > single user then it would be probably better than opening a can of worms > > > > like allocator usable from raw spin locks. > > > > > > > Vlastimil raised same question earlier, i answered, but let me answer again: > > > > > > It is hard to achieve because the logic does not stick to certain static test > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer > > > thread frees them. > > > > How many pages are talking about - ball park? 100s, 1000s? > > Under normal operation, a couple of pages per CPU, which would make > preallocation entirely reasonable. Except that if someone does something > that floods RCU callbacks (close(open) in a tight userspace loop, for but > one example), then 2000 per CPU might not be enough, which on a 64-CPU > system comes to about 500MB. This is beyond excessive for preallocation > on the systems I am familiar with. > > And the flooding case is where you most want the reclamation to be > efficient, and thus where you want the pages. I am not sure the page allocator would help you with this scenario unless you are on very large machines. Pagesets scale with the available memory and percpu_pagelist_fraction sysctl (have a look at pageset_set_high_and_batch). It is roughly 1000th of the zone size for each zone. You can check that in /proc/vmstat (my 8G machine) Node 0, zone DMA Not interesting at all Node 0, zone DMA32 pagesets cpu: 0 count: 242 high: 378 batch: 63 cpu: 1 count: 355 high: 378 batch: 63 cpu: 2 count: 359 high: 378 batch: 63 cpu: 3 count: 366 high: 378 batch: 63 Node 0, zone Normal pagesets cpu: 0 count: 359 high: 378 batch: 63 cpu: 1 count: 241 high: 378 batch: 63 cpu: 2 count: 297 high: 378 batch: 63 cpu: 3 count: 227 high: 378 batch: 63 Besides that do you need to be per-cpu? Having 1000 pages available and managed under your raw spinlock should be good enough already no? > This of course raises the question of how much memory the lockless caches > contain, but fortunately these RCU callback flooding scenarios also > involve process-context allocation of the memory that is being passed > to kfree_rcu(). That allocation should keep the lockless caches from > going empty in the common case, correct? Yes, those are refilled both on the allocation/free paths. But you cannot really rely on that to happen early enough. Do you happen to have any numbers that would show the typical usage and how often the slow path has to be taken becase pcp lists are depleted? In other words even if we provide a functionality to give completely lockless way to allocate memory how useful that is?
On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote: > On Fri, Aug 14, 2020 at 01:59:04AM +0200, Thomas Gleixner wrote: > > On Fri, Aug 14 2020 at 00:06, peterz wrote: > > > I'm still not getting it, how do we end up trying to allocate memory > > > from under raw spinlocks if you're not allowed to use kfree_rcu() under > > > one ? > > > > > > Can someone please spell out the actual problem? > > > > Your actual problem is the heat wave. Get an icepack already :) > > Sure, but also much of the below wasn't stated anywhere in the thread I > got Cc'ed on. All I got was half arsed solutions without an actual > problem statement. > > > To set things straight: > > > > 1) kfree_rcu() which used to be just a conveniance wrapper around > > call_rcu() always worked in any context except NMI. > > > > Otherwise RT would have never worked or would have needed other > > horrible hacks to defer kfree in certain contexts including > > context switch. > > I've never used kfree_rcu(), htf would I know. Doing this to kfree_rcu() is the first step. We will also be doing this to call_rcu(), which has some long-standing invocations from various raw contexts, including hardirq handler. > > 2) RCU grew an optimization for kfree_rcu() which avoids the linked > > list cache misses when a large number of objects is freed via > > kfree_rcu(). Paul says it speeds up post grace period processing by > > a factor of 8 which is impressive. > > > > That's done by avoiding the linked list and storing the object > > pointer in an array of pointers which is page sized. This array is > > then freed in bulk without having to touch the rcu head linked list > > which obviously avoids cache misses. > > > > Occasionally kfree_rcu() needs to allocate a page for this but it > > can fallback to the linked list when the allocation fails. > > > > Inconveniantly this allocation happens with an RCU internal raw > > lock held, but even if we could do the allocation outside the RCU > > internal lock this would not solve the problem that kfree_rcu() can > > be called in any context except NMI even on RT. > > > > So RT forbids allocations from truly atomic contexts even with > > GFP_ATOMIC/NOWAIT. GFP_ATOMIC is kinda meaningless on RT because > > everything which calls it is in non-atomic context :) But still > > GFP_ATOMIC or GFP_NOWAIT retain the semantics of !RT and do not go > > down into deeper levels or wait for pages to become available. > > Right so on RT just cut out the allocation and unconditionally make it > NULL. Since it needs to deal with GFP_ATOMIC|GFP_NOWARN failing the > allocation anyway. Except that this is not just RT due to CONFIG_PROVE_RAW_LOCK_NESTING=y. > > 3) #2 upsets lockdep (with the raw lock nesting enabled) rightfully > > when RCU tries to allocate a page, the lockless page cache is empty > > and it acquires zone->lock which is a regular spinlock > > There was no lockdep splat in the thread. I don't see the point of including such a splat given that we know that lockdep is supposed to splat in this situation. > > 4) As kfree_rcu() can be used from any context except NMI and RT > > relies on it we ran into a circular dependency problem. > > Well, which actual usage sites are under a raw spinlock? Most of the > ones I could find are not. There are some on their way in, but this same optimization will be applied to call_rcu(), and there are no shortage of such call sites in that case. And these call sites have been around for a very long time. > > If we disable that feature for RT then the problem would be solved > > except that lockdep still would complain about taking zone->lock > > within a forbidden context on !RT kernels. > > > > Preventing that feature because of that is not a feasible option > > either. Aside of that we discuss this postfactum, IOW the stuff is > > upstream already despite the problem being known before. > > Well, that's a fail :-( I tought we were supposed to solve known issues > before shit got merged. The enforcement is coming in and the kfree_rcu() speed up is coming in at the same time. The call_rcu() speedup will appear later. > > 5) Ulad posted patches which introduce a new GFP allocation mode > > which makes the allocation fail if the lockless cache is empty, > > i.e. it does not try to touch zone->lock in that case. > > > > That works perfectly fine on RT and !RT, makes lockdep happy and > > Paul is happy as well. > > > > If the lockless cache, which works perfectly fine on RT, is empty > > then the performance of kfree_rcu() post grace period processing is > > probably not making the situation of the system worse. > > > > Michal is not fond of the new GFP flag and wants to explore options > > to avoid that completely. But so far there is no real feasible > > solution. > > Not only Michal, those patches looked pretty horrid. They are the first round, and will be improved. > > A) It was suggested to make zone->lock raw, but that's a total > > disaster latency wise. With just a kernel compile (!RT kernel) > > spinwait times are in the hundreds of microseconds. > > Yeah, I know, been there done that, like over a decade ago :-) > > > B) Michal suggested to have GFP_RT_ATOMIC defined to 0 which would > > not require a new GFP bit and bail out when RT is enabled and > > the context is atomic. > > I would've written the code like: > > void *mem = NULL; > > ... > #ifndef CONFIG_PREEMPT_RT > mem = kmalloc(PAGE_SIZE, GFP_ATOMIC|GFP_NOWAIT); > #endif > if (!mem) > .... > > Seems much simpler and doesn't pollute the GFP_ space. But fails in the CONFIG_PROVE_RAW_LOCK_NESTING=y case when lockdep is enabled. > > D) To solve the lockdep issue of #B Michal suggested to have magic > > lockdep annotations which allow !RT kernels to take zone->lock > > from the otherwise forbidden contexts because on !RT this lock > > nesting could be considered a false positive. > > > > But this would be horrors of some sorts because there might be > > locks further down which then need the same treatment or some > > general 'yay, I'm excempt from everything' annotation which is > > short of disabling lockdep completly. > > > > Even if that could be solved and made correct for both RT and > > !RT then this opens the can of worms that this kind of > > annotation would show up all over the place within no time for > > the completely wrong reasons. > > So due to this heat crap I've not slept more than a few hours a night > for the last week, I'm not entirely there, but we already have a bunch > of lockdep magic for this raw nesting stuff. Ouch! Here is hoping for cooler weather soon! > But yes, we need to avoid abuse, grep for lockdep_off() :-( This > drivers/md/dm-cache-target.c thing is just plain broken. It sorta > 'works' on mainline (and even there it can behave badly), but on RT it > will come apart with a bang. > > > Paul compiled this options list: > > > > > 1. Prohibit invoking allocators from raw atomic context, such > > > as when holding a raw spinlock. > > > > Clearly the simplest solution but not Pauls favourite and > > unfortunately he has a good reason. > > Which isn't actually stated anywhere I suppose ? Several times, but why not one more? ;-) In CONFIG_PREEMPT_NONE=y kernels, which are heavily used, and for which the proposed kfree_rcu() and later call_rcu() optimizations are quite important, there is no way to tell at runtime whether or you are in atomic raw context. > > > 2. Adding a GFP_ flag. > > > > Michal does not like the restriction for !RT kernels and tries to > > avoid the introduction of a new allocation mode. > > Like above, I tend to be with Michal on this, just wrap the actual > allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer > there anyway. That disables the optimization in the CONFIG_PREEMPT_NONE=y case, where it really is needed. > > > 3. Reusing existing GFP_ flags/values/whatever to communicate > > > the raw-context information that was to be communicated by > > > the new GFP_ flag. > > > > > > 4. Making lockdep forgive acquiring spinlocks while holding > > > raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. > > Uhh, !CONFIG_PREEMPT_RT, the rest is 'fine'. I would be OK with either. In CONFIG_PREEMPT_NONE=n kernels, the kfree_rcu() code could use preemptible() to determine whether it was safe to invoke the allocator. The code in kfree_rcu() might look like this: mem = NULL; if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible()) mem = __get_free_page(...); Is your point is that the usual mistakes would then be caught by the usual testing on CONFIG_PREEMPT_NONE=n kernels? > > These are not seperate of each other as #3 requires #4. The most > > horrible solution IMO from a technical POV as it proliferates > > inconsistency for no good reaosn. > > > > Aside of that it'd be solving a problem which does not exist simply > > because kfree_rcu() does not depend on it and we really don't want to > > set precedence and encourage the (ab)use of this in any way. > > My preferred solution is 1, if you want to use an allocator, you simply > don't get to play under raw_spinlock_t. And from a quick grep, most > kfree_rcu() users are not under raw_spinlock_t context. There is at least one on its way in, but more to the point, we will need to apply this same optimization to call_rcu(), which is used in raw atomic context, including from hardirq handlers. > This here sounds like someone who wants to have his cake and eat it too. Just looking for a non-negative sized slice of cake, actually! > I'll try and think about a lockdep annotation that isn't utter crap, but > that's probably next week, I need this heat to end and get a few nights > sleep, I'm an utter wreck atm. Again, here is hoping for cooler weather! Thanx, Paul
On Fri, Aug 14, 2020 at 12:23:06PM +0200, peterz@infradead.org wrote: > On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote: > > > > 1. Prohibit invoking allocators from raw atomic context, such > > > > as when holding a raw spinlock. > > > > > > Clearly the simplest solution but not Pauls favourite and > > > unfortunately he has a good reason. > > > > Which isn't actually stated anywhere I suppose ? > > Introduce raw_kfree_rcu() that doesn't do the allocation, and fix the > few wonky callsites. The problem with that is common code along with the tendency of people to just use the one that "works everywhere". Thanx, Paul
On Fri, Aug 14, 2020 at 07:14:25AM -0700, Paul E. McKenney wrote: > On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote: > > On Fri, Aug 14, 2020 at 01:59:04AM +0200, Thomas Gleixner wrote: [ . . . ] > > > > 3. Reusing existing GFP_ flags/values/whatever to communicate > > > > the raw-context information that was to be communicated by > > > > the new GFP_ flag. > > > > > > > > 4. Making lockdep forgive acquiring spinlocks while holding > > > > raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels. > > > > Uhh, !CONFIG_PREEMPT_RT, the rest is 'fine'. > > I would be OK with either. In CONFIG_PREEMPT_NONE=n kernels, the > kfree_rcu() code could use preemptible() to determine whether it was safe > to invoke the allocator. The code in kfree_rcu() might look like this: > > mem = NULL; > if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible()) > mem = __get_free_page(...); > > Is your point is that the usual mistakes would then be caught by the > usual testing on CONFIG_PREEMPT_NONE=n kernels? Just to make sure we are talking about the same thing, please see below for an untested patch that illustrates how I was interpreting your words. Was this what you had in mind? Thanx, Paul ------------------------------------------------------------------------ diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 62a382d..42d0ff1 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -579,7 +579,7 @@ do { \ # define lockdep_assert_preemption_disabled() do { } while (0) #endif -#ifdef CONFIG_PROVE_RAW_LOCK_NESTING +#ifdef CONFIG_PROVE_RAW_LOCK_NESTING_EFFECTIVE # define lockdep_assert_RT_in_threaded_ctx() do { \ WARN_ONCE(debug_locks && !current->lockdep_recursion && \ diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h index bb35b44..70867d58 100644 --- a/include/linux/lockdep_types.h +++ b/include/linux/lockdep_types.h @@ -20,7 +20,7 @@ enum lockdep_wait_type { LD_WAIT_FREE, /* wait free, rcu etc.. */ LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */ -#ifdef CONFIG_PROVE_RAW_LOCK_NESTING +#ifdef PROVE_RAW_LOCK_NESTING_EFFECTIVE LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */ #else LD_WAIT_CONFIG = LD_WAIT_SPIN, diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index e068c3c..e02de40 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1215,6 +1215,9 @@ config PROVE_RAW_LOCK_NESTING If unsure, select N. +config PROVE_RAW_LOCK_NESTING_EFFECTIVE + def_bool PROVE_RAW_LOCK_NESTING && !PREEMPTION + config LOCK_STAT bool "Lock usage statistics" depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
On Fri, Aug 14, 2020 at 07:14:25AM -0700, Paul E. McKenney wrote: > Doing this to kfree_rcu() is the first step. We will also be doing this > to call_rcu(), which has some long-standing invocations from various > raw contexts, including hardirq handler. Most hardirq handler are not raw on RT due to threaded interrupts. Lockdep knows about this. > > > 4) As kfree_rcu() can be used from any context except NMI and RT > > > relies on it we ran into a circular dependency problem. > > > > Well, which actual usage sites are under a raw spinlock? Most of the > > ones I could find are not. > > There are some on their way in, but this same optimization will be applied > to call_rcu(), and there are no shortage of such call sites in that case. > And these call sites have been around for a very long time. Luckily there is lockdep to help you find the ones that need converting to raw_call_rcu() :-) > > > Clearly the simplest solution but not Pauls favourite and > > > unfortunately he has a good reason. > > > > Which isn't actually stated anywhere I suppose ? > > Several times, but why not one more? ;-) > > In CONFIG_PREEMPT_NONE=y kernels, which are heavily used, and for which > the proposed kfree_rcu() and later call_rcu() optimizations are quite > important, there is no way to tell at runtime whether or you are in > atomic raw context. CONFIG_PREEMPT_NONE has nothing what so ever to do with any of this. > > > > 2. Adding a GFP_ flag. > > > > > > Michal does not like the restriction for !RT kernels and tries to > > > avoid the introduction of a new allocation mode. > > > > Like above, I tend to be with Michal on this, just wrap the actual > > allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer > > there anyway. > > That disables the optimization in the CONFIG_PREEMPT_NONE=y case, > where it really is needed. No, it disables it for CONFIG_PREEMPT_RT. > I would be OK with either. In CONFIG_PREEMPT_NONE=n kernels, the > kfree_rcu() code could use preemptible() to determine whether it was safe > to invoke the allocator. The code in kfree_rcu() might look like this: > > mem = NULL; > if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible()) > mem = __get_free_page(...); > > Is your point is that the usual mistakes would then be caught by the > usual testing on CONFIG_PREEMPT_NONE=n kernels? mem = NULL; #if !defined(CONFIG_PREEMPT_RT) && !defined(CONFIG_PROVE_LOCKING) mem = __get_free_page(...) #endif if (!mem) But I _really_ would much rather have raw_kfree_rcu() and raw_call_rcu() variants for the few places that actually need it. > > > These are not seperate of each other as #3 requires #4. The most > > > horrible solution IMO from a technical POV as it proliferates > > > inconsistency for no good reaosn. > > > > > > Aside of that it'd be solving a problem which does not exist simply > > > because kfree_rcu() does not depend on it and we really don't want to > > > set precedence and encourage the (ab)use of this in any way. > > > > My preferred solution is 1, if you want to use an allocator, you simply > > don't get to play under raw_spinlock_t. And from a quick grep, most > > kfree_rcu() users are not under raw_spinlock_t context. > > There is at least one on its way in, but more to the point, we will > need to apply this same optimization to call_rcu(), which is used in There is no need, call_rcu() works perfectly fine today, thank you. You might want to, but that's an entirely different thing. > raw atomic context, including from hardirq handlers. Threaded IRQs. There really is very little code that is 'raw' on RT.
On Fri, Aug 14, 2020 at 09:11:06AM -0700, Paul E. McKenney wrote: > Just to make sure we are talking about the same thing, please see below > for an untested patch that illustrates how I was interpreting your words. > Was this what you had in mind? No, definitely not. Also, since we used to be able to use call_rcu() _everywhere_, including under zone->lock, how's that working with you calling the page-allocating from it?
On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote: > On Fri 14-08-20 06:34:50, Paul E. McKenney wrote: > > On Fri, Aug 14, 2020 at 02:48:32PM +0200, Michal Hocko wrote: > > > On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote: > > > > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote: > > > > > > Michal Hocko <mhocko@suse.com> writes: > > > > > [...] > > > > > > > Why should we limit the functionality of the allocator for something > > > > > > > that is not a real problem? > > > > > > > > > > > > We'd limit the allocator for exactly ONE new user which was aware of > > > > > > this problem _before_ the code hit mainline. And that ONE user is > > > > > > prepared to handle the fail. > > > > > > > > > > If we are to limit the functionality to this one particular user then > > > > > I would consider a dedicated gfp flag a huge overkill. It would be much > > > > > more easier to have a preallocated pool of pages and use those and > > > > > completely avoid the core allocator. That would certainly only shift the > > > > > complexity to the caller but if it is expected there would be only that > > > > > single user then it would be probably better than opening a can of worms > > > > > like allocator usable from raw spin locks. > > > > > > > > > Vlastimil raised same question earlier, i answered, but let me answer again: > > > > > > > > It is hard to achieve because the logic does not stick to certain static test > > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based > > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer > > > > thread frees them. > > > > > > How many pages are talking about - ball park? 100s, 1000s? > > > > Under normal operation, a couple of pages per CPU, which would make > > preallocation entirely reasonable. Except that if someone does something > > that floods RCU callbacks (close(open) in a tight userspace loop, for but > > one example), then 2000 per CPU might not be enough, which on a 64-CPU > > system comes to about 500MB. This is beyond excessive for preallocation > > on the systems I am familiar with. > > > > And the flooding case is where you most want the reclamation to be > > efficient, and thus where you want the pages. > > I am not sure the page allocator would help you with this scenario > unless you are on very large machines. Pagesets scale with the available > memory and percpu_pagelist_fraction sysctl (have a look at > pageset_set_high_and_batch). It is roughly 1000th of the zone size for > each zone. You can check that in /proc/vmstat (my 8G machine) Small systems might have ~64G. The medium-sized systems might have ~250G. There are a few big ones that might have 1.5T. None of the /proc/vmstat files from those machines contain anything resembling the list below, though. > Node 0, zone DMA > Not interesting at all > Node 0, zone DMA32 > pagesets > cpu: 0 > count: 242 > high: 378 > batch: 63 > cpu: 1 > count: 355 > high: 378 > batch: 63 > cpu: 2 > count: 359 > high: 378 > batch: 63 > cpu: 3 > count: 366 > high: 378 > batch: 63 > Node 0, zone Normal > pagesets > cpu: 0 > count: 359 > high: 378 > batch: 63 > cpu: 1 > count: 241 > high: 378 > batch: 63 > cpu: 2 > count: 297 > high: 378 > batch: 63 > cpu: 3 > count: 227 > high: 378 > batch: 63 > > Besides that do you need to be per-cpu? Having 1000 pages available and > managed under your raw spinlock should be good enough already no? It needs to be almost entirely per-CPU for performance reasons. Plus a user could do a tight close(open()) loop on each CPU. > > This of course raises the question of how much memory the lockless caches > > contain, but fortunately these RCU callback flooding scenarios also > > involve process-context allocation of the memory that is being passed > > to kfree_rcu(). That allocation should keep the lockless caches from > > going empty in the common case, correct? > > Yes, those are refilled both on the allocation/free paths. But you > cannot really rely on that to happen early enough. So the really ugly scenarios with the tight loops normally allocate something and immediately either call_rcu() or kfree_rcu() it. But you are right, someone doing "rm -rf" on a large file tree with lots of small files might not be doing that many allocations. > Do you happen to have any numbers that would show the typical usage > and how often the slow path has to be taken becase pcp lists are > depleted? In other words even if we provide a functionality to give > completely lockless way to allocate memory how useful that is? Not yet, but let's see what we can do. Thanx, Paul
On Fri, Aug 14, 2020 at 07:49:24PM +0200, Peter Zijlstra wrote: > On Fri, Aug 14, 2020 at 09:11:06AM -0700, Paul E. McKenney wrote: > > Just to make sure we are talking about the same thing, please see below > > for an untested patch that illustrates how I was interpreting your words. > > Was this what you had in mind? > > No, definitely not. > > Also, since we used to be able to use call_rcu() _everywhere_, including > under zone->lock, how's that working with you calling the > page-allocating from it? Indeed, that is exactly the problem we are trying to solve. Thanx, Paul
On Fri, Aug 14, 2020 at 06:19:04PM +0200, peterz@infradead.org wrote: > On Fri, Aug 14, 2020 at 07:14:25AM -0700, Paul E. McKenney wrote: > > > Doing this to kfree_rcu() is the first step. We will also be doing this > > to call_rcu(), which has some long-standing invocations from various > > raw contexts, including hardirq handler. > > Most hardirq handler are not raw on RT due to threaded interrupts. > Lockdep knows about this. Understood. But not all hardirq handlers are threaded. > > > > 4) As kfree_rcu() can be used from any context except NMI and RT > > > > relies on it we ran into a circular dependency problem. > > > > > > Well, which actual usage sites are under a raw spinlock? Most of the > > > ones I could find are not. > > > > There are some on their way in, but this same optimization will be applied > > to call_rcu(), and there are no shortage of such call sites in that case. > > And these call sites have been around for a very long time. > > Luckily there is lockdep to help you find the ones that need converting > to raw_call_rcu() :-) I already gave you my views on raw_call_rcu(). :-/ > > > > Clearly the simplest solution but not Pauls favourite and > > > > unfortunately he has a good reason. > > > > > > Which isn't actually stated anywhere I suppose ? > > > > Several times, but why not one more? ;-) > > > > In CONFIG_PREEMPT_NONE=y kernels, which are heavily used, and for which > > the proposed kfree_rcu() and later call_rcu() optimizations are quite > > important, there is no way to tell at runtime whether or you are in > > atomic raw context. > > CONFIG_PREEMPT_NONE has nothing what so ever to do with any of this. On the contrary, it has everything to do with this. It is the environment in which we cannot use preemptible() to dynamically determine whether or not it is safe to invoke the memory allocator. > > > > > 2. Adding a GFP_ flag. > > > > > > > > Michal does not like the restriction for !RT kernels and tries to > > > > avoid the introduction of a new allocation mode. > > > > > > Like above, I tend to be with Michal on this, just wrap the actual > > > allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer > > > there anyway. > > > > That disables the optimization in the CONFIG_PREEMPT_NONE=y case, > > where it really is needed. > > No, it disables it for CONFIG_PREEMPT_RT. Except that lockdep still yells at us for CONFIG_PREEMPT_NONE=y, and again, in the CONFIG_PREEMPT_NONE=y we cannot use preemptible() to dynamically determine whether it is safe to invoke the memory allocator. > > I would be OK with either. In CONFIG_PREEMPT_NONE=n kernels, the > > kfree_rcu() code could use preemptible() to determine whether it was safe > > to invoke the allocator. The code in kfree_rcu() might look like this: > > > > mem = NULL; > > if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible()) > > mem = __get_free_page(...); > > > > Is your point is that the usual mistakes would then be caught by the > > usual testing on CONFIG_PREEMPT_NONE=n kernels? > > mem = NULL; > #if !defined(CONFIG_PREEMPT_RT) && !defined(CONFIG_PROVE_LOCKING) > mem = __get_free_page(...) > #endif > if (!mem) > > But I _really_ would much rather have raw_kfree_rcu() and raw_call_rcu() > variants for the few places that actually need it. Until people start propagating them all over because they happen to unconditionally stop lockdep from complaining. > > > > These are not seperate of each other as #3 requires #4. The most > > > > horrible solution IMO from a technical POV as it proliferates > > > > inconsistency for no good reaosn. > > > > > > > > Aside of that it'd be solving a problem which does not exist simply > > > > because kfree_rcu() does not depend on it and we really don't want to > > > > set precedence and encourage the (ab)use of this in any way. > > > > > > My preferred solution is 1, if you want to use an allocator, you simply > > > don't get to play under raw_spinlock_t. And from a quick grep, most > > > kfree_rcu() users are not under raw_spinlock_t context. > > > > There is at least one on its way in, but more to the point, we will > > need to apply this same optimization to call_rcu(), which is used in > > There is no need, call_rcu() works perfectly fine today, thank you. > You might want to, but that's an entirely different thing. Sorry, but no. The call_rcu() callback invocation currently takes a cache miss on each step through the rcu_head chain. > > raw atomic context, including from hardirq handlers. > > Threaded IRQs. There really is very little code that is 'raw' on RT. Except that we also need to run non-RT kernels. Thanx, Paul
On Fri, Aug 14 2020 at 11:02, Paul E. McKenney wrote: > On Fri, Aug 14, 2020 at 07:49:24PM +0200, Peter Zijlstra wrote: >> On Fri, Aug 14, 2020 at 09:11:06AM -0700, Paul E. McKenney wrote: >> > Just to make sure we are talking about the same thing, please see below >> > for an untested patch that illustrates how I was interpreting your words. >> > Was this what you had in mind? >> >> No, definitely not. >> >> Also, since we used to be able to use call_rcu() _everywhere_, including >> under zone->lock, how's that working with you calling the >> page-allocating from it? > > Indeed, that is exactly the problem we are trying to solve. Wait a moment. Why are we discussing RT induced raw non raw lock ordering at all? Whatever kernel you variant you look at this is not working: lock(zone) call_rcu() lock(zone) It's a simple recursive dead lock, nothing else. And that enforces the GFP_NOLOCK allocation mode or some other solution unless you make a new rule that calling call_rcu() is forbidden while holding zone lock or any other lock which might be nested inside the GFP_NOWAIT zone::lock held region. Thanks, tglx
On Fri, Aug 14, 2020 at 09:33:47PM +0200, Thomas Gleixner wrote: > On Fri, Aug 14 2020 at 11:02, Paul E. McKenney wrote: > > On Fri, Aug 14, 2020 at 07:49:24PM +0200, Peter Zijlstra wrote: > >> On Fri, Aug 14, 2020 at 09:11:06AM -0700, Paul E. McKenney wrote: > >> > Just to make sure we are talking about the same thing, please see below > >> > for an untested patch that illustrates how I was interpreting your words. > >> > Was this what you had in mind? > >> > >> No, definitely not. > >> > >> Also, since we used to be able to use call_rcu() _everywhere_, including > >> under zone->lock, how's that working with you calling the > >> page-allocating from it? > > > > Indeed, that is exactly the problem we are trying to solve. > > Wait a moment. Why are we discussing RT induced raw non raw lock > ordering at all? Because we like to argue? (Sorry, couldn't resist.) > Whatever kernel you variant you look at this is not working: > > lock(zone) call_rcu() lock(zone) > > It's a simple recursive dead lock, nothing else. You are of course absolutely correct. > And that enforces the GFP_NOLOCK allocation mode or some other solution > unless you make a new rule that calling call_rcu() is forbidden while > holding zone lock or any other lock which might be nested inside the > GFP_NOWAIT zone::lock held region. Again, you are correct. Maybe the forecasted weekend heat will cause my brain to hallucinate a better solution, but in the meantime, the GFP_NOLOCK approach looks good from this end. Thanx, Paul
On Fri, Aug 14, 2020 at 01:41:40PM -0700, Paul E. McKenney wrote: > > And that enforces the GFP_NOLOCK allocation mode or some other solution > > unless you make a new rule that calling call_rcu() is forbidden while > > holding zone lock or any other lock which might be nested inside the > > GFP_NOWAIT zone::lock held region. > > Again, you are correct. Maybe the forecasted weekend heat will cause > my brain to hallucinate a better solution, but in the meantime, the > GFP_NOLOCK approach looks good from this end. So I hate __GFP_NO_LOCKS for a whole number of reasons: - it should be called __GFP_LOCKLESS if anything - it sprinkles a bunch of ugly branches around the allocator fast path - it only works for order==0 Combined I really odn't think this should be a GFP flag. How about a special purpose allocation function, something like so.. --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 901a21f61d68..cdec9c99fba7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4875,6 +4875,47 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, } EXPORT_SYMBOL(__alloc_pages_nodemask); +struct page *__rmqueue_lockless(struct zone *zone, struct per_cpu_pages *pcp) +{ + struct list_head *list; + struct page *page; + int migratetype; + + for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) { + list = &pcp->list[migratetype]; + page = list_first_entry_or_null(list, struct page, lru); + if (page && check_new_pcp(page)) { + list_del(&page->lru); + pcp->count--; + return page; + } + } + + return NULL; +} + +struct page *__alloc_page_lockless(void) +{ + struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); + struct per_cpu_pages *pcp; + struct page *page = NULL; + unsigned long flags; + struct zoneref *z; + struct zone *zone; + + for_each_zone_zonelist(zone, z, zonelist, ZONE_NORMAL) { + local_irq_save(flags); + pcp = &this_cpu_ptr(zone->pageset)->pcp; + page = __rmqueue_lockless(zone, pcp); + local_irq_restore(flags); + + if (page) + break; + } + + return page; +} + /* * Common helper functions. Never use with __GFP_HIGHMEM because the returned * address cannot represent highmem pages. Use alloc_pages and then kmap if
Paul, On Fri, Aug 14 2020 at 11:01, Paul E. McKenney wrote: > On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote: >> > > > Vlastimil raised same question earlier, i answered, but let me answer again: >> > > > >> > > > It is hard to achieve because the logic does not stick to certain static test >> > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based >> > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer >> > > > thread frees them. >> > > >> > > How many pages are talking about - ball park? 100s, 1000s? >> > >> > Under normal operation, a couple of pages per CPU, which would make >> > preallocation entirely reasonable. Except that if someone does something >> > that floods RCU callbacks (close(open) in a tight userspace loop, for but >> > one example), then 2000 per CPU might not be enough, which on a 64-CPU >> > system comes to about 500MB. This is beyond excessive for preallocation >> > on the systems I am familiar with. >> > >> > And the flooding case is where you most want the reclamation to be >> > efficient, and thus where you want the pages. As we now established that taking zone lock is impossible at all independent of raw/non-raw ordering and independent of RT/PREEMPT configs, can we just take a step back and look at the problem from scratch again? As a matter of fact I assume^Wdeclare that removing struct rcu_head which provides a fallback is not an option at all. I know that you want to, but it wont work ever. Dream on, but as we agreed on recently there is this thing called reality which ruins everything. For normal operations a couple of pages which can be preallocated are enough. What you are concerned of is the case where you run out of pointer storage space. There are two reasons why that can happen: 1) RCU call flooding 2) RCU not being able to run and mop up the backlog #1 is observable by looking at the remaining storage space and the RCU call frequency #2 is uninteresting because it's caused by RCU being stalled / delayed e.g. by a runaway of some sorts or a plain RCU usage bug. Allocating more memory in that case does not solve or improve anything. So the interesting case is #1. Which means we need to look at the potential sources of the flooding: 1) User space via syscalls, e.g. open/close 2) Kernel thread 3) Softirq 4) Device interrupt 5) System interrupts, deep atomic context, NMI ... #1 trivial fix is to force switching to an high prio thread or a soft interrupt which does the allocation #2 Similar to #1 unless that thread loops with interrupts, softirqs or preemption disabled. If that's the case then running out of RCU storage space is the least of your worries. #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a CPU have loop limits in place already. If there is a bug which fails to care about the limit, why would RCU care and allocate more memory? #4 Similar to #3. If the interrupt handler loops forever or if the interrupt is a runaway which prevents task/softirq processing then RCU free performance is the least of your worries. #5 Clearly a bug and making RCU accomodate for that is beyond silly. So if call_rcu() detects that the remaining storage space for pointers goes below the critical point or if it observes high frequency calls then it simply should force a soft interrupt which does the allocation. Allocating from softirq context obviously without holding the raw lock which is used inside call_rcu() is safe on all configurations. If call_rcu() is forced to use the fallback for a few calls until this happens then that's not the end of the world. It is not going to be a problem ever for the most obvious issue #1, user space madness, because that case cannot delay the softirq processing unless there is a kernel bug which makes again RCU free performance irrelevant. So this will cure the problem for the most interesting case #1 and handle all sane variants of the other possible flooding sources. The other insane reasons do not justify any attempt to increase RCU performance at all. Watching the remaining storage space is good enough IMO. It clearly covers #1 and for all others the occasional fallback wont hurt. If it really matters for any case > #1 then doing a frequency based prediction is a straight forward optimization. As usual I might be missing something, but as this is the second day with reasonable temperatures here that would be caused by my ignorance and not be excusable by brain usage outside of specified temperature range. Thanks, tglx
On Fri, Aug 14, 2020 at 11:52:06PM +0200, Peter Zijlstra wrote: > On Fri, Aug 14, 2020 at 01:41:40PM -0700, Paul E. McKenney wrote: > > > And that enforces the GFP_NOLOCK allocation mode or some other solution > > > unless you make a new rule that calling call_rcu() is forbidden while > > > holding zone lock or any other lock which might be nested inside the > > > GFP_NOWAIT zone::lock held region. > > > > Again, you are correct. Maybe the forecasted weekend heat will cause > > my brain to hallucinate a better solution, but in the meantime, the > > GFP_NOLOCK approach looks good from this end. > > So I hate __GFP_NO_LOCKS for a whole number of reasons: > > - it should be called __GFP_LOCKLESS if anything > - it sprinkles a bunch of ugly branches around the allocator fast path > - it only works for order==0 > > Combined I really odn't think this should be a GFP flag. How about a > special purpose allocation function, something like so.. This looks entirely reasonable to me! Thanx, Paul > --- > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 901a21f61d68..cdec9c99fba7 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4875,6 +4875,47 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > } > EXPORT_SYMBOL(__alloc_pages_nodemask); > > +struct page *__rmqueue_lockless(struct zone *zone, struct per_cpu_pages *pcp) > +{ > + struct list_head *list; > + struct page *page; > + int migratetype; > + > + for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) { > + list = &pcp->list[migratetype]; > + page = list_first_entry_or_null(list, struct page, lru); > + if (page && check_new_pcp(page)) { > + list_del(&page->lru); > + pcp->count--; > + return page; > + } > + } > + > + return NULL; > +} > + > +struct page *__alloc_page_lockless(void) > +{ > + struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); > + struct per_cpu_pages *pcp; > + struct page *page = NULL; > + unsigned long flags; > + struct zoneref *z; > + struct zone *zone; > + > + for_each_zone_zonelist(zone, z, zonelist, ZONE_NORMAL) { > + local_irq_save(flags); > + pcp = &this_cpu_ptr(zone->pageset)->pcp; > + page = __rmqueue_lockless(zone, pcp); > + local_irq_restore(flags); > + > + if (page) > + break; > + } > + > + return page; > +} > + > /* > * Common helper functions. Never use with __GFP_HIGHMEM because the returned > * address cannot represent highmem pages. Use alloc_pages and then kmap if
On Fri, Aug 14 2020 at 23:52, Peter Zijlstra wrote: > On Fri, Aug 14, 2020 at 01:41:40PM -0700, Paul E. McKenney wrote: >> > And that enforces the GFP_NOLOCK allocation mode or some other solution >> > unless you make a new rule that calling call_rcu() is forbidden while >> > holding zone lock or any other lock which might be nested inside the >> > GFP_NOWAIT zone::lock held region. >> >> Again, you are correct. Maybe the forecasted weekend heat will cause >> my brain to hallucinate a better solution, but in the meantime, the >> GFP_NOLOCK approach looks good from this end. > > So I hate __GFP_NO_LOCKS for a whole number of reasons: > > - it should be called __GFP_LOCKLESS if anything > - it sprinkles a bunch of ugly branches around the allocator fast path > - it only works for order==0 > > Combined I really odn't think this should be a GFP flag. How about a > special purpose allocation function, something like so.. No. No. No. It's not requried at all after the context got set straight and my brain started working again. There is no need to provide such a thing which tries to "optimize" unfixable problems and as a consequence makes other people use it for the completely wrong reasons all over the place. We have plenty of that stuff already. No need for more ... Thanks, tglx
On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote: > Paul, > > On Fri, Aug 14 2020 at 11:01, Paul E. McKenney wrote: > > On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote: > >> > > > Vlastimil raised same question earlier, i answered, but let me answer again: > >> > > > > >> > > > It is hard to achieve because the logic does not stick to certain static test > >> > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based > >> > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer > >> > > > thread frees them. > >> > > > >> > > How many pages are talking about - ball park? 100s, 1000s? > >> > > >> > Under normal operation, a couple of pages per CPU, which would make > >> > preallocation entirely reasonable. Except that if someone does something > >> > that floods RCU callbacks (close(open) in a tight userspace loop, for but > >> > one example), then 2000 per CPU might not be enough, which on a 64-CPU > >> > system comes to about 500MB. This is beyond excessive for preallocation > >> > on the systems I am familiar with. > >> > > >> > And the flooding case is where you most want the reclamation to be > >> > efficient, and thus where you want the pages. > > As we now established that taking zone lock is impossible at all > independent of raw/non-raw ordering and independent of RT/PREEMPT > configs, can we just take a step back and look at the problem from > scratch again? Can't hurt! (Famous last words...) > As a matter of fact I assume^Wdeclare that removing struct rcu_head which > provides a fallback is not an option at all. I know that you want to, > but it wont work ever. Dream on, but as we agreed on recently there is > this thing called reality which ruins everything. For call_rcu(), agreed. For kfree_rcu(), we know what the callback is going to do, plus single-argument kfree_rcu() can only be invoked from sleepable context. (If you want to kfree_rcu() from non-sleepable context, that will cost you an rcu_head in the data structure being freed.) So if the single-argument kfree_rcu() case gets hit with a memory-allocation failure, it can fall back to waiting for a grace period and doing the free. Of course, grace-period waits have horrible latency, but under OOM life is hard. If this becomes a problem in non-OOM situations due to the lockless caches becoming empty, we will have to allocate memory if needed before acquiring the lock with the usual backout logic. Doing that means that we can let the allocator acquire locks and maybe even do a little bit of blocking, so that the inline grace-period-wait would only happen if the system was well and truly OOMed. > For normal operations a couple of pages which can be preallocated are > enough. What you are concerned of is the case where you run out of > pointer storage space. Agreed. > There are two reasons why that can happen: > > 1) RCU call flooding > 2) RCU not being able to run and mop up the backlog > > #1 is observable by looking at the remaining storage space and the RCU > call frequency > > #2 is uninteresting because it's caused by RCU being stalled / delayed > e.g. by a runaway of some sorts or a plain RCU usage bug. > > Allocating more memory in that case does not solve or improve anything. Yes, #2 is instead RCU CPU stall warning territory. If this becomes a problem, one approach is to skip the page-of-pointers allocation if the grace period is more than (say) one second old. If the grace period never completes, OOM is unavoidable, but this is a way of putting it off for a bit. > So the interesting case is #1. Which means we need to look at the > potential sources of the flooding: > > 1) User space via syscalls, e.g. open/close > 2) Kernel thread > 3) Softirq > 4) Device interrupt > 5) System interrupts, deep atomic context, NMI ... > > #1 trivial fix is to force switching to an high prio thread or a soft > interrupt which does the allocation > > #2 Similar to #1 unless that thread loops with interrupts, softirqs or > preemption disabled. If that's the case then running out of RCU > storage space is the least of your worries. > > #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a > CPU have loop limits in place already. If there is a bug which fails > to care about the limit, why would RCU care and allocate more memory? > > #4 Similar to #3. If the interrupt handler loops forever or if the > interrupt is a runaway which prevents task/softirq processing then > RCU free performance is the least of your worries. > > #5 Clearly a bug and making RCU accomodate for that is beyond silly. > > So if call_rcu() detects that the remaining storage space for pointers > goes below the critical point or if it observes high frequency calls > then it simply should force a soft interrupt which does the allocation. Unless call_rcu() has been invoked with scheduler locks held. But eventually call_rcu() should be invoked with interrupts enabled, and at that point it would be safe to raise_softirq(), wake_up(), or whatever. > Allocating from softirq context obviously without holding the raw lock > which is used inside call_rcu() is safe on all configurations. Once we get there, agreed. > If call_rcu() is forced to use the fallback for a few calls until this > happens then that's not the end of the world. It is not going to be a > problem ever for the most obvious issue #1, user space madness, because > that case cannot delay the softirq processing unless there is a kernel > bug which makes again RCU free performance irrelevant. > > So this will cure the problem for the most interesting case #1 and > handle all sane variants of the other possible flooding sources. > > The other insane reasons do not justify any attempt to increase RCU > performance at all. > > Watching the remaining storage space is good enough IMO. It clearly > covers #1 and for all others the occasional fallback wont hurt. If it > really matters for any case > #1 then doing a frequency based prediction > is a straight forward optimization. > > As usual I might be missing something, but as this is the second day > with reasonable temperatures here that would be caused by my ignorance > and not be excusable by brain usage outside of specified temperature > range. It is at the very least a new approach, so either way thank you for that! ;-) Thanx, Paul
Paul, On Fri, Aug 14 2020 at 16:41, Paul E. McKenney wrote: > On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote: >> As a matter of fact I assume^Wdeclare that removing struct rcu_head which >> provides a fallback is not an option at all. I know that you want to, >> but it wont work ever. Dream on, but as we agreed on recently there is >> this thing called reality which ruins everything. > > For call_rcu(), agreed. For kfree_rcu(), we know what the callback is > going to do, plus single-argument kfree_rcu() can only be invoked from > sleepable context. (If you want to kfree_rcu() from non-sleepable > context, that will cost you an rcu_head in the data structure being > freed.) kfree_rcu() as of today is just a conveniance wrapper around call_rcu(obj, rcu) which can be called from any context and it still takes TWO arguments. Icepack? So if you come up with a new kfree_rcu_magic(void *obj) single argument variant which can only be called from sleepable contexts then this does not require any of the raw lock vs. non raw hacks at all because you can simply allocate without holding the raw lock in the rare case that you run out of storage space. With four 4k pages preallocated per CPU that's every 2048 invocations per CPU on 64bit. So if you run into that situation then you drop the lock and then it's racy because you might be preempted or migrated after dropping the lock and you might have done a useless allocation, but that does not justify having a special allocator just for that? You have an extra page, so what? To prevent subsequent callers to add to the allocation race you simply can let them wait on the first allocating attempt to finish That avoids more pointless allocations and as a side effect prevents all of them to create more pressure by continuing their open/close loop naturally without extra work. > So if the single-argument kfree_rcu() case gets hit with a > memory-allocation failure, it can fall back to waiting for a grace > period and doing the free. Of course, grace-period waits have horrible > latency, but under OOM life is hard. If this becomes a problem in > non-OOM situations due to the lockless caches becoming empty, we will > have to allocate memory if needed before acquiring the lock with the > usual backout logic. Doing that means that we can let the allocator > acquire locks and maybe even do a little bit of blocking, so that the > inline grace-period-wait would only happen if the system was well and > truly OOMed. No. It dropped the rcu internal lock and does a regular GFP_KENRNEL allocation which waits for the page to become available. Which is a good thing in the open/close scenario because it throttles the offender. >> For normal operations a couple of pages which can be preallocated are >> enough. What you are concerned of is the case where you run out of >> pointer storage space. > > Agreed. > >> There are two reasons why that can happen: >> >> 1) RCU call flooding >> 2) RCU not being able to run and mop up the backlog >> >> #1 is observable by looking at the remaining storage space and the RCU >> call frequency >> >> #2 is uninteresting because it's caused by RCU being stalled / delayed >> e.g. by a runaway of some sorts or a plain RCU usage bug. >> >> Allocating more memory in that case does not solve or improve anything. > > Yes, #2 is instead RCU CPU stall warning territory. > > If this becomes a problem, one approach is to skip the page-of-pointers > allocation if the grace period is more than (say) one second old. If > the grace period never completes, OOM is unavoidable, but this is a way > of putting it off for a bit. Don't even think about optimizing your new thing for #2. It's a pointless exercise. If the task which runs into the 'can't allocate' case then is sleeps and waits. End of story. >> So the interesting case is #1. Which means we need to look at the >> potential sources of the flooding: >> >> 1) User space via syscalls, e.g. open/close >> 2) Kernel thread >> 3) Softirq >> 4) Device interrupt >> 5) System interrupts, deep atomic context, NMI ... >> >> #1 trivial fix is to force switching to an high prio thread or a soft >> interrupt which does the allocation >> >> #2 Similar to #1 unless that thread loops with interrupts, softirqs or >> preemption disabled. If that's the case then running out of RCU >> storage space is the least of your worries. >> >> #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a >> CPU have loop limits in place already. If there is a bug which fails >> to care about the limit, why would RCU care and allocate more memory? >> >> #4 Similar to #3. If the interrupt handler loops forever or if the >> interrupt is a runaway which prevents task/softirq processing then >> RCU free performance is the least of your worries. >> >> #5 Clearly a bug and making RCU accomodate for that is beyond silly. >> >> So if call_rcu() detects that the remaining storage space for pointers >> goes below the critical point or if it observes high frequency calls >> then it simply should force a soft interrupt which does the allocation. > > Unless call_rcu() has been invoked with scheduler locks held. But > eventually call_rcu() should be invoked with interrupts enabled, and at > that point it would be safe to raise_softirq(), wake_up(), or > whatever. If this atomic context corner case is hit within a problematic context then we talk about the RCU of today and not about the future single argument thing. And that oldschool RCU has a fallback. We are talking about pressure corner cases and you really want to squeeze out the last cache miss? What for? If there is pressure then these cache misses are irrelevant. Thanks, tglx
On Sat, Aug 15, 2020 at 02:43:51AM +0200, Thomas Gleixner wrote: > Paul, > > On Fri, Aug 14 2020 at 16:41, Paul E. McKenney wrote: > > On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote: > >> As a matter of fact I assume^Wdeclare that removing struct rcu_head which > >> provides a fallback is not an option at all. I know that you want to, > >> but it wont work ever. Dream on, but as we agreed on recently there is > >> this thing called reality which ruins everything. > > > > For call_rcu(), agreed. For kfree_rcu(), we know what the callback is > > going to do, plus single-argument kfree_rcu() can only be invoked from > > sleepable context. (If you want to kfree_rcu() from non-sleepable > > context, that will cost you an rcu_head in the data structure being > > freed.) > > kfree_rcu() as of today is just a conveniance wrapper around > call_rcu(obj, rcu) which can be called from any context and it still > takes TWO arguments. > > Icepack? Indeed. Make that not kfree_rcu(), but rather kvfree_rcu(), which is in mainline. :-/ > So if you come up with a new kfree_rcu_magic(void *obj) single argument > variant which can only be called from sleepable contexts then this does > not require any of the raw lock vs. non raw hacks at all because you can > simply allocate without holding the raw lock in the rare case that you > run out of storage space. With four 4k pages preallocated per CPU that's > every 2048 invocations per CPU on 64bit. > > So if you run into that situation then you drop the lock and then it's > racy because you might be preempted or migrated after dropping the lock > and you might have done a useless allocation, but that does not justify > having a special allocator just for that? You have an extra page, so > what? > > To prevent subsequent callers to add to the allocation race you simply > can let them wait on the first allocating attempt to finish That avoids > more pointless allocations and as a side effect prevents all of them to > create more pressure by continuing their open/close loop naturally > without extra work. Agreed, as I said, it is the double-argument version that is the challenge. > > So if the single-argument kfree_rcu() case gets hit with a > > memory-allocation failure, it can fall back to waiting for a grace > > period and doing the free. Of course, grace-period waits have horrible > > latency, but under OOM life is hard. If this becomes a problem in > > non-OOM situations due to the lockless caches becoming empty, we will > > have to allocate memory if needed before acquiring the lock with the > > usual backout logic. Doing that means that we can let the allocator > > acquire locks and maybe even do a little bit of blocking, so that the > > inline grace-period-wait would only happen if the system was well and > > truly OOMed. > > No. It dropped the rcu internal lock and does a regular GFP_KENRNEL > allocation which waits for the page to become available. Which is a good > thing in the open/close scenario because it throttles the offender. Understood, especially that last. But it really doesn't want to be waiting in the memory allocator for more than a grace period. But that was hashed out quite some time ago, and there is a combination of GFP_* flags that achieves the right balance for the can-sleep situation. > >> For normal operations a couple of pages which can be preallocated are > >> enough. What you are concerned of is the case where you run out of > >> pointer storage space. > > > > Agreed. > > > >> There are two reasons why that can happen: > >> > >> 1) RCU call flooding > >> 2) RCU not being able to run and mop up the backlog > >> > >> #1 is observable by looking at the remaining storage space and the RCU > >> call frequency > >> > >> #2 is uninteresting because it's caused by RCU being stalled / delayed > >> e.g. by a runaway of some sorts or a plain RCU usage bug. > >> > >> Allocating more memory in that case does not solve or improve anything. > > > > Yes, #2 is instead RCU CPU stall warning territory. > > > > If this becomes a problem, one approach is to skip the page-of-pointers > > allocation if the grace period is more than (say) one second old. If > > the grace period never completes, OOM is unavoidable, but this is a way > > of putting it off for a bit. > > Don't even think about optimizing your new thing for #2. It's a > pointless exercise. If the task which runs into the 'can't allocate' > case then is sleeps and waits. End of story. Agreed, and hence my "If this becomes a problem". Until such time, it is pointless. For one thing, we don't yet know the failure mode. But it has been helpful for me to think a move or two ahead when playing against RCU, hence the remainder of my paragraph. > >> So the interesting case is #1. Which means we need to look at the > >> potential sources of the flooding: > >> > >> 1) User space via syscalls, e.g. open/close > >> 2) Kernel thread > >> 3) Softirq > >> 4) Device interrupt > >> 5) System interrupts, deep atomic context, NMI ... > >> > >> #1 trivial fix is to force switching to an high prio thread or a soft > >> interrupt which does the allocation > >> > >> #2 Similar to #1 unless that thread loops with interrupts, softirqs or > >> preemption disabled. If that's the case then running out of RCU > >> storage space is the least of your worries. > >> > >> #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a > >> CPU have loop limits in place already. If there is a bug which fails > >> to care about the limit, why would RCU care and allocate more memory? > >> > >> #4 Similar to #3. If the interrupt handler loops forever or if the > >> interrupt is a runaway which prevents task/softirq processing then > >> RCU free performance is the least of your worries. > >> > >> #5 Clearly a bug and making RCU accomodate for that is beyond silly. > >> > >> So if call_rcu() detects that the remaining storage space for pointers > >> goes below the critical point or if it observes high frequency calls > >> then it simply should force a soft interrupt which does the allocation. > > > > Unless call_rcu() has been invoked with scheduler locks held. But > > eventually call_rcu() should be invoked with interrupts enabled, and at > > that point it would be safe to raise_softirq(), wake_up(), or > > whatever. > > If this atomic context corner case is hit within a problematic context > then we talk about the RCU of today and not about the future single > argument thing. And that oldschool RCU has a fallback. We are talking > about pressure corner cases and you really want to squeeze out the last > cache miss? What for? If there is pressure then these cache misses are > irrelevant. Of course. My point was instead that even this atomic corner case was likely to have escape routes in the form of occasional non-atomic calls, and that these could do the wakeups. Again, thank you. Thanx, Paul
On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote: > > As a matter of fact I assume^Wdeclare that removing struct rcu_head which > provides a fallback is not an option at all. I know that you want to, > but it wont work ever. Dream on, but as we agreed on recently there is > this thing called reality which ruins everything. It never was going to work, freeing memory can never hard rely on the success of allocating memory.
On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote: > #1 trivial fix is to force switching to an high prio thread or a soft > interrupt which does the allocation Yeah, push the alocation out to another context. I did consider it, but why bother? Also, raising a softirq can't be done from every context, that's a whole new problem. You can do irq_work I suppose, but not all architectures support the self-IPI yet. All in all, it's just more complexity than the fairly trivial __alloc_page_lockless(). Whichever way around, we can't rely on the allocation.
On Sat, Aug 15, 2020 at 10:27:54AM +0200, Peter Zijlstra wrote: > On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote: > > > > As a matter of fact I assume^Wdeclare that removing struct rcu_head which > > provides a fallback is not an option at all. I know that you want to, > > but it wont work ever. Dream on, but as we agreed on recently there is > > this thing called reality which ruins everything. > > It never was going to work, freeing memory can never hard rely on the > success of allocating memory. In neither case does the freeing of memory rely hard-rely on the success of allocating memory. This is because there is a fallback in both cases should allocation fail. Given an rcu_head structure, we use that, and accept the extra cache misses at callback-invocation time. Otherwise, without an rcu_head structure, the allocation parameters are carefully chosen to avoid indefinite sleeping, meaning that the allocation attempt either succeeds or fails within a reasonable amount of time. And upon failure we invoke synchronize_rcu(), then immediately free. Which is slow, but then again life is like that under OOM conditions. And yes, this means that the price of leaving the rcu_head structure out of the structure to be freed is that you must call kvfree_free() from a sleepable context. If you don't like being restricted to sleepable context, you can always supply the rcu_head structure. Thanx, Paul
On Sat, Aug 15, 2020 at 10:42:50AM +0200, Peter Zijlstra wrote: > On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote: > > > #1 trivial fix is to force switching to an high prio thread or a soft > > interrupt which does the allocation > > Yeah, push the alocation out to another context. I did consider it, but > why bother? > > Also, raising a softirq can't be done from every context, that's a whole > new problem. You can do irq_work I suppose, but not all architectures > support the self-IPI yet. > > All in all, it's just more complexity than the fairly trivial > __alloc_page_lockless(). > > Whichever way around, we can't rely on the allocation. One way to enforce that would be to put something like this at the beginning of the __alloc_page_lockless() function: if (IS_ENABLED(CONFIG_PROVE_LOCKING) && (prandom_u32() & 0xffff)) return NULL; I am sure that there is a better choice than CONFIG_PROVE_LOCKING. But whatever the choice, there is nothing quite like the occasional allocation failure during testing to convince people that such failure really can happen. Thanx, Paul
On Sat, Aug 15, 2020 at 07:18:39AM -0700, Paul E. McKenney wrote: > On Sat, Aug 15, 2020 at 10:42:50AM +0200, Peter Zijlstra wrote: > > On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote: > > > > > #1 trivial fix is to force switching to an high prio thread or a soft > > > interrupt which does the allocation > > > > Yeah, push the alocation out to another context. I did consider it, but > > why bother? > > > > Also, raising a softirq can't be done from every context, that's a whole > > new problem. You can do irq_work I suppose, but not all architectures > > support the self-IPI yet. > > > > All in all, it's just more complexity than the fairly trivial > > __alloc_page_lockless(). > > > > Whichever way around, we can't rely on the allocation. > > One way to enforce that would be to put something like this at the > beginning of the __alloc_page_lockless() function: > > if (IS_ENABLED(CONFIG_PROVE_LOCKING) && (prandom_u32() & 0xffff)) > return NULL; Right, too early in the morning. :-/ This "slight" variation might include a bit of usefulness along with the convincing: if (IS_ENABLED(CONFIG_PROVE_LOCKING) && !(prandom_u32() & 0xff)) return NULL; Plus failing one out of 256 times is likely a better choice than once out of 65536 times, especially for the occasional caller of this function. Thanx, Paul > I am sure that there is a better choice than CONFIG_PROVE_LOCKING. > But whatever the choice, there is nothing quite like the occasional > allocation failure during testing to convince people that such failure > really can happen. > > Thanx, Paul
On Fri, Aug 14, 2020 at 11:52:06PM +0200, Peter Zijlstra wrote: > On Fri, Aug 14, 2020 at 01:41:40PM -0700, Paul E. McKenney wrote: > > > And that enforces the GFP_NOLOCK allocation mode or some other solution > > > unless you make a new rule that calling call_rcu() is forbidden while > > > holding zone lock or any other lock which might be nested inside the > > > GFP_NOWAIT zone::lock held region. > > > > Again, you are correct. Maybe the forecasted weekend heat will cause > > my brain to hallucinate a better solution, but in the meantime, the > > GFP_NOLOCK approach looks good from this end. > > So I hate __GFP_NO_LOCKS for a whole number of reasons: > > - it should be called __GFP_LOCKLESS if anything > - it sprinkles a bunch of ugly branches around the allocator fast path > - it only works for order==0 > I had a look at your proposal, that is below. An underlying logic stays almost the same as what has been proposed by this RFC. I mean i do not see any difference in your approach that does exactly the same - providing lock-less access to the per-cpu-lists. I am not talking about implementation details and farther improvements, like doing also a search over zonelist -> ZONE_NORMAL. Also, please note. The patch was tagged as RFC. > > Combined I really odn't think this should be a GFP flag. How about a > special purpose allocation function, something like so.. > I agree with you. Also i think, Michal, does not like the GFP flag, it introduces more complexity to the page allocator. So, providing lock-less access as a separate function is better approach, IMHO. Michal asked to provide some data regarding how many pages we need and how "lockless allocation" behaves when it comes to success vs failed scenarios. Please see below some results. The test case is a tight loop of 1 000 000 allocations doing kmalloc() and kfree_rcu(): sudo ./test_vmalloc.sh run_test_mask=2048 single_cpu_test=1 <snip> for (i = 0; i < 1 000 000; i++) { p = kmalloc(sizeof(*p), GFP_KERNEL); if (!p) return -1; p->array[0] = 'a'; kvfree_rcu(p, rcu); } <snip> wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_0.png wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png Also i would like to underline, that kfree_rcu() reclaim logic can be improved further, making the drain logic more efficient when it comes to time, thus to reduce a footprint as a result number of required pages. -- Vlad Rezki
On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote: [...] > Michal asked to provide some data regarding how many pages we need and how > "lockless allocation" behaves when it comes to success vs failed scenarios. > > Please see below some results. The test case is a tight loop of 1 000 000 allocations > doing kmalloc() and kfree_rcu(): It would be nice to cover some more realistic workloads as well. > sudo ./test_vmalloc.sh run_test_mask=2048 single_cpu_test=1 > > <snip> > for (i = 0; i < 1 000 000; i++) { > p = kmalloc(sizeof(*p), GFP_KERNEL); > if (!p) > return -1; > > p->array[0] = 'a'; > kvfree_rcu(p, rcu); > } > <snip> > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_0.png If I understand this correctly then this means that failures happen very often because pcp pages are not recycled quicklly enough. > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png 1/8 of the memory in pcp lists is quite large and likely not something used very often. Both these numbers just make me think that a dedicated pool of page pre-allocated for RCU specifically might be a better solution. I still haven't read through that branch of the email thread though so there might be some pretty convincing argments to not do that. > Also i would like to underline, that kfree_rcu() reclaim logic can be improved further, > making the drain logic more efficient when it comes to time, thus to reduce a footprint > as a result number of required pages. > > -- > Vlad Rezki
On Sat 15-08-20 01:14:53, Thomas Gleixner wrote: [...] > For normal operations a couple of pages which can be preallocated are > enough. What you are concerned of is the case where you run out of > pointer storage space. > > There are two reasons why that can happen: > > 1) RCU call flooding > 2) RCU not being able to run and mop up the backlog > > #1 is observable by looking at the remaining storage space and the RCU > call frequency > > #2 is uninteresting because it's caused by RCU being stalled / delayed > e.g. by a runaway of some sorts or a plain RCU usage bug. > > Allocating more memory in that case does not solve or improve anything. > > So the interesting case is #1. Which means we need to look at the > potential sources of the flooding: > > 1) User space via syscalls, e.g. open/close > 2) Kernel thread > 3) Softirq > 4) Device interrupt > 5) System interrupts, deep atomic context, NMI ... > > #1 trivial fix is to force switching to an high prio thread or a soft > interrupt which does the allocation > > #2 Similar to #1 unless that thread loops with interrupts, softirqs or > preemption disabled. If that's the case then running out of RCU > storage space is the least of your worries. > > #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a > CPU have loop limits in place already. If there is a bug which fails > to care about the limit, why would RCU care and allocate more memory? > > #4 Similar to #3. If the interrupt handler loops forever or if the > interrupt is a runaway which prevents task/softirq processing then > RCU free performance is the least of your worries. > > #5 Clearly a bug and making RCU accomodate for that is beyond silly. > > So if call_rcu() detects that the remaining storage space for pointers > goes below the critical point or if it observes high frequency calls > then it simply should force a soft interrupt which does the allocation. > > Allocating from softirq context obviously without holding the raw lock > which is used inside call_rcu() is safe on all configurations. > > If call_rcu() is forced to use the fallback for a few calls until this > happens then that's not the end of the world. It is not going to be a > problem ever for the most obvious issue #1, user space madness, because > that case cannot delay the softirq processing unless there is a kernel > bug which makes again RCU free performance irrelevant. Yes, this makes perfect sense to me! I really do not think we want to optimize for a userspace abuse to allow complete pcp allocator memory depletion (or a control in a worse case). Thanks!
On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote: > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote: > [...] > > Michal asked to provide some data regarding how many pages we need and how > > "lockless allocation" behaves when it comes to success vs failed scenarios. > > > > Please see below some results. The test case is a tight loop of 1 000 000 allocations > > doing kmalloc() and kfree_rcu(): > > It would be nice to cover some more realistic workloads as well. > Hmm.. I tried to show syntactic worst case when a "flood" occurs. In such conditions we can get fails what is expectable and we have fallback mechanism for it. > > sudo ./test_vmalloc.sh run_test_mask=2048 single_cpu_test=1 > > > > <snip> > > for (i = 0; i < 1 000 000; i++) { > > p = kmalloc(sizeof(*p), GFP_KERNEL); > > if (!p) > > return -1; > > > > p->array[0] = 'a'; > > kvfree_rcu(p, rcu); > > } > > <snip> > > > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_0.png > > If I understand this correctly then this means that failures happen very > often because pcp pages are not recycled quicklly enough. > Yep, it happens and that is kind of worst scenario(flood one). Therefore we have a fallback and is expectable. Also, i did not provide the number of pages in a loop. On my test machine we need approximately ~300/400 pages to cover that flood case until we recycles or return back the pages to the pcp. Please note, as i mentioned before. Our drain part is not optimal for sure, it means that we can rework it a bit making it more efficient. For example, when a flood occurs, instead of delaying "reclaimer logic" thread, it can be placed to a run-queue right away. We can use separate "flush workqueue" that is tagged with WQ_MEM_RECLAIM raising a priority of drain context. i.e. there is a room for reducing such page footprint. > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png > > 1/8 of the memory in pcp lists is quite large and likely not something > used very often. > Just for illustration. When percpu_pagelist_fractio is set to 8, i do not see any page fail on a single CPU flood case. If i run simultaneously such flood on all available CPUs there will be fails for sure. > Both these numbers just make me think that a dedicated pool of page > pre-allocated for RCU specifically might be a better solution. I still > haven't read through that branch of the email thread though so there > might be some pretty convincing argments to not do that. > > > Also i would like to underline, that kfree_rcu() reclaim logic can be improved further, > > making the drain logic more efficient when it comes to time, thus to reduce a footprint > > as a result number of required pages. > > > > -- > > Vlad Rezki > > -- > Michal Hocko > SUSE Labs
On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote: > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote: [ . . . ] > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png > > 1/8 of the memory in pcp lists is quite large and likely not something > used very often. > > Both these numbers just make me think that a dedicated pool of page > pre-allocated for RCU specifically might be a better solution. I still > haven't read through that branch of the email thread though so there > might be some pretty convincing argments to not do that. To avoid the problematic corner cases, we would need way more dedicated memory than is reasonable, as in well over one hundred pages per CPU. Sure, we could choose a smaller number, but then we are failing to defend against flooding, even on systems that have more than enough free memory to be able to do so. It would be better to live within what is available, taking the performance/robustness hit only if there isn't enough. My current belief is that we need a combination of (1) either the GFP_NOLOCK flag or Peter Zijlstra's patch and (2) Thomas Gleixner's delayed allocation approach. As you say and as Uladislau's measurements suggest, if we only have Peter's approach, although we could handle short floods just fine, we could not handle longer-term floods. And Thomas's approach is in fact designed to handle these longer-term floods. Except that if we have only Thomas's approach, then we have to handle the possibility that RCU_SOFTIRQ happened on the back of an interrupt that happened while the interrupted process was holding a memory-allocator lock. This means further deferral, such as going into a workqueue, which would allow better memory-allocator results, but which would also mean further delays from the time that memory was needed until the time that it was actually supplied. Delays that could be bridged by either a GFP_NOLOCK flag or Peter's patch. So again, it looks like this is not an either/or situation, but rather an need-both situation. I freely confess that one of my hopes almost 30 years ago was that a high-quality parallel memory allocator would eliminate the need for special-purpose allocators, but as has been noted several times on this thread, reality does not always seem to be compelled to take such hopes into account. Thanx, Paul
On Mon 17-08-20 15:28:03, Paul E. McKenney wrote: > On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote: > > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote: > > [ . . . ] > > > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png > > > > 1/8 of the memory in pcp lists is quite large and likely not something > > used very often. > > > > Both these numbers just make me think that a dedicated pool of page > > pre-allocated for RCU specifically might be a better solution. I still > > haven't read through that branch of the email thread though so there > > might be some pretty convincing argments to not do that. > > To avoid the problematic corner cases, we would need way more dedicated > memory than is reasonable, as in well over one hundred pages per CPU. > Sure, we could choose a smaller number, but then we are failing to defend > against flooding, even on systems that have more than enough free memory > to be able to do so. It would be better to live within what is available, > taking the performance/robustness hit only if there isn't enough. Thomas had a good point that it doesn't really make much sense to optimize for flooders because that just makes them more effective. > My current belief is that we need a combination of (1) either the > GFP_NOLOCK flag or Peter Zijlstra's patch and I must have missed the patch?
On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote: > On Mon 17-08-20 15:28:03, Paul E. McKenney wrote: > > On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote: > > > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote: > > > > [ . . . ] > > > > > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png > > > > > > 1/8 of the memory in pcp lists is quite large and likely not something > > > used very often. > > > > > > Both these numbers just make me think that a dedicated pool of page > > > pre-allocated for RCU specifically might be a better solution. I still > > > haven't read through that branch of the email thread though so there > > > might be some pretty convincing argments to not do that. > > > > To avoid the problematic corner cases, we would need way more dedicated > > memory than is reasonable, as in well over one hundred pages per CPU. > > Sure, we could choose a smaller number, but then we are failing to defend > > against flooding, even on systems that have more than enough free memory > > to be able to do so. It would be better to live within what is available, > > taking the performance/robustness hit only if there isn't enough. > > Thomas had a good point that it doesn't really make much sense to > optimize for flooders because that just makes them more effective. The point is not to make the flooders go faster, but rather for the system to be robust in the face of flooders. Robust as in harder for a flooder to OOM the system. And reducing the number of post-grace-period cache misses makes it easier for the callback-invocation-time memory freeing to keep up with the flooder, thus avoiding (or at least delaying) the OOM. > > My current belief is that we need a combination of (1) either the > > GFP_NOLOCK flag or Peter Zijlstra's patch and > > I must have missed the patch? If I am keeping track, this one: https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/ Thanx, Paul
On Tue, Aug 18 2020 at 06:53, Paul E. McKenney wrote: > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote: >> Thomas had a good point that it doesn't really make much sense to >> optimize for flooders because that just makes them more effective. > > The point is not to make the flooders go faster, but rather for the > system to be robust in the face of flooders. Robust as in harder for > a flooder to OOM the system. > > And reducing the number of post-grace-period cache misses makes it > easier for the callback-invocation-time memory freeing to keep up with > the flooder, thus avoiding (or at least delaying) the OOM. Throttling the flooder is incresing robustness far more than reducing cache misses. Thanks, tglx
On Tue 18-08-20 06:53:27, Paul E. McKenney wrote: > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote: > > On Mon 17-08-20 15:28:03, Paul E. McKenney wrote: > > > On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote: > > > > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote: > > > > > > [ . . . ] > > > > > > > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png > > > > > > > > 1/8 of the memory in pcp lists is quite large and likely not something > > > > used very often. > > > > > > > > Both these numbers just make me think that a dedicated pool of page > > > > pre-allocated for RCU specifically might be a better solution. I still > > > > haven't read through that branch of the email thread though so there > > > > might be some pretty convincing argments to not do that. > > > > > > To avoid the problematic corner cases, we would need way more dedicated > > > memory than is reasonable, as in well over one hundred pages per CPU. > > > Sure, we could choose a smaller number, but then we are failing to defend > > > against flooding, even on systems that have more than enough free memory > > > to be able to do so. It would be better to live within what is available, > > > taking the performance/robustness hit only if there isn't enough. > > > > Thomas had a good point that it doesn't really make much sense to > > optimize for flooders because that just makes them more effective. > > The point is not to make the flooders go faster, but rather for the > system to be robust in the face of flooders. Robust as in harder for > a flooder to OOM the system. Do we see this to be a practical problem? I am really confused because the initial argument was revolving around an optimization now you are suggesting that this is actually system stability measure. And I fail to see how allowing an easy way to deplete pcp caches completely solves any of that. Please do realize that if allow that then every user who relies on pcp caches will have to take a slow(er) path and that will have performance consequences. The pool is a global and a scarce resource. That's why I've suggested a dedicated preallocated pool and use it instead of draining global pcp caches. > And reducing the number of post-grace-period cache misses makes it > easier for the callback-invocation-time memory freeing to keep up with > the flooder, thus avoiding (or at least delaying) the OOM. > > > > My current belief is that we need a combination of (1) either the > > > GFP_NOLOCK flag or Peter Zijlstra's patch and > > > > I must have missed the patch? > > If I am keeping track, this one: > > https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/ OK, I have certainly noticed that one but didn't react but my response would be similar to the dedicated gfp flag. This is less of a hack than __GFP_NOLOCK but it still exposes very internal parts of the allocator and I find that a quite problematic from the future maintenance of the allocator. The risk of an easy depletion of the pcp pool is also there of course.
Hello, Michal. You mentioned somewhere in the thread to show figures regarding hitting a fast path and "fallback one". We follow fallback when a page allocation fails. Please see below the plot. I hope it is easy to understand: wget ftp://vps418301.ovh.net/incoming/1000000_kfree_rcu_fast_hit_vs_fallback_hit.png to summarize. When i tight loop is applied, i.e. flood simulation the fallback hit is negligible. It is a noise, out of 1 000 000 we have 1% or %2 of fallback hitting. Thanks! -- Vlad Rezki
On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote: > On Tue, Aug 18 2020 at 06:53, Paul E. McKenney wrote: > > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote: > >> Thomas had a good point that it doesn't really make much sense to > >> optimize for flooders because that just makes them more effective. > > > > The point is not to make the flooders go faster, but rather for the > > system to be robust in the face of flooders. Robust as in harder for > > a flooder to OOM the system. > > > > And reducing the number of post-grace-period cache misses makes it > > easier for the callback-invocation-time memory freeing to keep up with > > the flooder, thus avoiding (or at least delaying) the OOM. > > Throttling the flooder is incresing robustness far more than reducing > cache misses. True, but it takes time to identify a flooding event that needs to be throttled (as in milliseconds). This time cannot be made up. And in the absence of a flooding event, the last thing you want to do is to throttle call_rcu(), kfree_rcu(), and kvfree_rcu(). Thanx, Paul
On Tue, Aug 18, 2020 at 05:02:32PM +0200, Michal Hocko wrote: > On Tue 18-08-20 06:53:27, Paul E. McKenney wrote: > > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote: > > > On Mon 17-08-20 15:28:03, Paul E. McKenney wrote: > > > > On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote: > > > > > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote: > > > > > > > > [ . . . ] > > > > > > > > > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png > > > > > > > > > > 1/8 of the memory in pcp lists is quite large and likely not something > > > > > used very often. > > > > > > > > > > Both these numbers just make me think that a dedicated pool of page > > > > > pre-allocated for RCU specifically might be a better solution. I still > > > > > haven't read through that branch of the email thread though so there > > > > > might be some pretty convincing argments to not do that. > > > > > > > > To avoid the problematic corner cases, we would need way more dedicated > > > > memory than is reasonable, as in well over one hundred pages per CPU. > > > > Sure, we could choose a smaller number, but then we are failing to defend > > > > against flooding, even on systems that have more than enough free memory > > > > to be able to do so. It would be better to live within what is available, > > > > taking the performance/robustness hit only if there isn't enough. > > > > > > Thomas had a good point that it doesn't really make much sense to > > > optimize for flooders because that just makes them more effective. > > > > The point is not to make the flooders go faster, but rather for the > > system to be robust in the face of flooders. Robust as in harder for > > a flooder to OOM the system. > > Do we see this to be a practical problem? I am really confused because > the initial argument was revolving around an optimization now you are > suggesting that this is actually system stability measure. And I fail to > see how allowing an easy way to deplete pcp caches completely solves > any of that. Please do realize that if allow that then every user who > relies on pcp caches will have to take a slow(er) path and that will > have performance consequences. The pool is a global and a scarce > resource. That's why I've suggested a dedicated preallocated pool and > use it instead of draining global pcp caches. Both the optimization and the robustness are important. The problem with this thing is that I have to start describing it somewhere, and I have not yet come up with a description of the whole thing that isn't TL;DR. > > And reducing the number of post-grace-period cache misses makes it > > easier for the callback-invocation-time memory freeing to keep up with > > the flooder, thus avoiding (or at least delaying) the OOM. > > > > > > My current belief is that we need a combination of (1) either the > > > > GFP_NOLOCK flag or Peter Zijlstra's patch and > > > > > > I must have missed the patch? > > > > If I am keeping track, this one: > > > > https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/ > > OK, I have certainly noticed that one but didn't react but my response > would be similar to the dedicated gfp flag. This is less of a hack than > __GFP_NOLOCK but it still exposes very internal parts of the allocator > and I find that a quite problematic from the future maintenance of the > allocator. The risk of an easy depletion of the pcp pool is also there > of course. I had to ask. ;-) Thanx, Paul
On Tue, Aug 18 2020 at 09:13, Paul E. McKenney wrote: > On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote: >> On Tue, Aug 18 2020 at 06:53, Paul E. McKenney wrote: >> > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote: >> >> Thomas had a good point that it doesn't really make much sense to >> >> optimize for flooders because that just makes them more effective. >> > >> > The point is not to make the flooders go faster, but rather for the >> > system to be robust in the face of flooders. Robust as in harder for >> > a flooder to OOM the system. >> > >> > And reducing the number of post-grace-period cache misses makes it >> > easier for the callback-invocation-time memory freeing to keep up with >> > the flooder, thus avoiding (or at least delaying) the OOM. >> >> Throttling the flooder is incresing robustness far more than reducing >> cache misses. > > True, but it takes time to identify a flooding event that needs to be > throttled (as in milliseconds). This time cannot be made up. Not really. A flooding event will deplete your preallocated pages very fast, so you have to go into the allocator and get new ones which naturally throttles the offender. So if your open/close thing uses the new single argument free which has to be called from sleepable context then the allocation either gives you a page or that thing has to wait. No fancy extras. You still can have a page reserved for your other regular things and once that it gone, you have to fall back to the linked list for those. But when that happens the extra cache misses are not your main problem anymore. Thanks, tglx
On Tue, Aug 18, 2020 at 06:55:11PM +0200, Thomas Gleixner wrote: > On Tue, Aug 18 2020 at 09:13, Paul E. McKenney wrote: > > On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote: > >> On Tue, Aug 18 2020 at 06:53, Paul E. McKenney wrote: > >> > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote: > >> >> Thomas had a good point that it doesn't really make much sense to > >> >> optimize for flooders because that just makes them more effective. > >> > > >> > The point is not to make the flooders go faster, but rather for the > >> > system to be robust in the face of flooders. Robust as in harder for > >> > a flooder to OOM the system. > >> > > >> > And reducing the number of post-grace-period cache misses makes it > >> > easier for the callback-invocation-time memory freeing to keep up with > >> > the flooder, thus avoiding (or at least delaying) the OOM. > >> > >> Throttling the flooder is incresing robustness far more than reducing > >> cache misses. > > > > True, but it takes time to identify a flooding event that needs to be > > throttled (as in milliseconds). This time cannot be made up. > > Not really. A flooding event will deplete your preallocated pages very > fast, so you have to go into the allocator and get new ones which > naturally throttles the offender. Should it turn out that we can in fact go into the allocator, completely agreed. > So if your open/close thing uses the new single argument free which has > to be called from sleepable context then the allocation either gives you > a page or that thing has to wait. No fancy extras. In the single-argument kvfree_rcu() case, completely agreed. > You still can have a page reserved for your other regular things and > once that it gone, you have to fall back to the linked list for > those. But when that happens the extra cache misses are not your main > problem anymore. The extra cache misses are a problem in that case because they throttle the reclamation, which anti-throttles the producer, especially in the case where callback invocation is offloaded. Thanx, Paul
Paul, On Tue, Aug 18 2020 at 10:13, Paul E. McKenney wrote: > On Tue, Aug 18, 2020 at 06:55:11PM +0200, Thomas Gleixner wrote: >> On Tue, Aug 18 2020 at 09:13, Paul E. McKenney wrote: >> > On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote: >> >> Throttling the flooder is incresing robustness far more than reducing >> >> cache misses. >> > >> > True, but it takes time to identify a flooding event that needs to be >> > throttled (as in milliseconds). This time cannot be made up. >> >> Not really. A flooding event will deplete your preallocated pages very >> fast, so you have to go into the allocator and get new ones which >> naturally throttles the offender. > > Should it turn out that we can in fact go into the allocator, completely > agreed. You better can for any user space controllable flooding source. >> So if your open/close thing uses the new single argument free which has >> to be called from sleepable context then the allocation either gives you >> a page or that thing has to wait. No fancy extras. > > In the single-argument kvfree_rcu() case, completely agreed. > >> You still can have a page reserved for your other regular things and >> once that it gone, you have to fall back to the linked list for >> those. But when that happens the extra cache misses are not your main >> problem anymore. > > The extra cache misses are a problem in that case because they throttle > the reclamation, which anti-throttles the producer, especially in the > case where callback invocation is offloaded. You still did not explain which contexts can create flooding. I gave you a complete list a few mails ago, but you still did not tell which of the contexts can cause flooding. If it's any context which is not sleepable or controllable in any way, then any attempt to mitigate it is a lost battle: A dependency on allocating memory to free memory is a dead end by definition. Any flooder which is uncontrollable is a bug and no matter what kind of hacks you provide, it will be able to bring the whole thing down. So far this looks like you're trying to cure the symptoms, which is wrong to begin with. If the flooder is controllable then there is no problem with cache misses at all unless the RCU free callbacks are not able to catch up which is yet another problem which you can't cure by allocating more memory. Thanks, tglx
On Wed, Aug 19, 2020 at 01:26:02AM +0200, Thomas Gleixner wrote: > Paul, > > On Tue, Aug 18 2020 at 10:13, Paul E. McKenney wrote: > > On Tue, Aug 18, 2020 at 06:55:11PM +0200, Thomas Gleixner wrote: > >> On Tue, Aug 18 2020 at 09:13, Paul E. McKenney wrote: > >> > On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote: > >> >> Throttling the flooder is incresing robustness far more than reducing > >> >> cache misses. > >> > > >> > True, but it takes time to identify a flooding event that needs to be > >> > throttled (as in milliseconds). This time cannot be made up. > >> > >> Not really. A flooding event will deplete your preallocated pages very > >> fast, so you have to go into the allocator and get new ones which > >> naturally throttles the offender. > > > > Should it turn out that we can in fact go into the allocator, completely > > agreed. > > You better can for any user space controllable flooding source. For the memory being passed to kvfree_rcu(), but of course. However, that memory has just as good a chance of going deeply into the allocator when being freed as when being allocated. In contrast, the page of pointers that kvfree_rcu() attempts to allocate can be handed back via per-CPU variables, avoiding lengthy time on the allocator's free path. Yes, yes, in theory we could make a devil's pact with potential flooders so that RCU directly handed memory back to them via a back channel as well, but in practice that sounds like an excellent source of complexity and bugs. > >> So if your open/close thing uses the new single argument free which has > >> to be called from sleepable context then the allocation either gives you > >> a page or that thing has to wait. No fancy extras. > > > > In the single-argument kvfree_rcu() case, completely agreed. > > > >> You still can have a page reserved for your other regular things and > >> once that it gone, you have to fall back to the linked list for > >> those. But when that happens the extra cache misses are not your main > >> problem anymore. > > > > The extra cache misses are a problem in that case because they throttle > > the reclamation, which anti-throttles the producer, especially in the > > case where callback invocation is offloaded. > > You still did not explain which contexts can create flooding. I gave you > a complete list a few mails ago, but you still did not tell which of the > contexts can cause flooding. Message-ID: <87tux4kefm.fsf@nanos.tec.linutronix.de>, correct? In case #1 (RCU call flooding), the page of pointers is helpful. In case #2 (RCU not being able to run and mop up the backlog), allocating pages of pointers is unhelpful, given that doing so simply speeds up the potential OOM. My thought is to skip kvfree_rcu()/call_rcu() pointer-page allocation once the current grace period's duration exceeds one second. > If it's any context which is not sleepable or controllable in any way, > then any attempt to mitigate it is a lost battle: > > A dependency on allocating memory to free memory is a dead end by > definition. Any attempt to mitigate that -lacks- -a- -fallback- is a losing battle. > Any flooder which is uncontrollable is a bug and no matter what kind of > hacks you provide, it will be able to bring the whole thing down. On to the sources of flooding, based on what reality has managed to teach me thus far: 1) User space via syscalls, e.g. open/close These are definitely in scope. 2) Kernel thread In general, these are out of scope, as you would expect. For completeness, based on rcutorture-induced callback flooding, if the kernel thread's loop contains at least one cond_resched() for CONFIG_PREEMPT_NONE=y on the one hand, at least one schedule() for preemptible kernels running NO_HZ_FULL, and at least one point during which preemption is possible otherwise. As discussed at Plumbers last year, the sysadm can always configure callback offloading in such a way that RCU has no chance of keeping up with a flood. Then again, the sysadm can also always crash the system in all sorts of interesting ways, so what is one more? But please note that rcutorture does -not- recycle the flooded memory via kfree(), and thus avoids any problems with kfree() needing to dive deep into the allocator. What rcutorture does instead is to pass the memory back to the flooder using a dedicated queue. At the end of the test, the whole mess is kfree()ed. And all of these will be solved (as you would hope) by "Don't do that!!!", as laid out in the paragraphs above. Of course, the exact form of "Don't do that" will no doubt change over time, but this code is in the kernel and therefore can be changed as needed. 3) Softirq Flooding from within the confines of a single softirq handler is of course out of scope. There are all sorts of ways for the softirq-handler writer to deal with this, for example, dropping into workqueue context to do the allocation, which brings things back to #2 (kernel thread). I have not experimented with this, nor do I intend to. 4) Device interrupt Likewise, flooding from within the confines of a single device interrupt handler is out of scope. As with softirq handlers, there are all sorts of ways for the device-driver writer to deal with this, for example, dropping into workqueue context to do the allocation, which brings things back to #2 (kernel thread). Again, as with softirq, I have not experimented with this, nor do I intend to. 5) System interrupts, deep atomic context, NMI ... System interrupts have the same callback-flooding constraints as device interrupts, correct? (I am thinking that by "system interrupt" you mean things like the scheduling-clock interrupt, except that I have always thought of this interrupt as being a form of device interrupt.) I have no idea what "deep atomic context" but it does sound intriguing. ;-) If you mean the entry/exit code that is not allowed to be traced, then you cannot allocate, call_rcu(), kfree_rcu, or kvfree_rcu() from that context anyway. NMI handlers are not allowed to allocate or to invoke either call_rcu(), kfree_rcu(), or kvfree_rcu(), so they are going to need help from some other execution context if they are going to do any flooding at all. In short, the main concern is flooding driven one way or another from user space, whether via syscalls, traps, exceptions, or whatever. Therefore, of the above list, I am worried only about #1. (OK, OK, I am worried about #2-#4, but only from the perspective of knowing what to tell the developer not to do.) > So far this looks like you're trying to cure the symptoms, which is > wrong to begin with. > > If the flooder is controllable then there is no problem with cache > misses at all unless the RCU free callbacks are not able to catch up > which is yet another problem which you can't cure by allocating more > memory. As much as I might like to agree with that statement, the possibility of freeing (not just allocation) going deep into the allocator leads me to believe that reality might have a rather different opinion. And the only way to find out is to try it. I therefore propose that the eventual patch series be split into three parts: 1. Maintain separate per-CPU cache of pages of pointers dedicated to kfree_rcu() and kvfree_rcu(), and later also to call_rcu(). If a given cache is empty, the code takes the fallback (either use the rcu_head structure or block on synchronize_rcu(), depending), but the code also asks for an allocation from a sleepable context in order to replenish the cache. (We could use softirq for kfree_rcu() and kvfree_rcu(), but call_rcu() would have deadlock issues with softirq.) 2. Peter's patch or similar. 3. Use of Peter's patch. If the need for #2 and #3 are convincing, well and good. If not, I create a tag for them in the -rcu tree so that they can be found quickly in case reality decides to express its opinion at some inconvenient time at some inconvenient scale. Thoughts? Thanx, Paul
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 67a0774e080b..c6f11481c42a 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -39,8 +39,9 @@ struct vm_area_struct; #define ___GFP_HARDWALL 0x100000u #define ___GFP_THISNODE 0x200000u #define ___GFP_ACCOUNT 0x400000u +#define ___GFP_NO_LOCKS 0x800000u #ifdef CONFIG_LOCKDEP -#define ___GFP_NOLOCKDEP 0x800000u +#define ___GFP_NOLOCKDEP 0x1000000u #else #define ___GFP_NOLOCKDEP 0 #endif @@ -215,16 +216,22 @@ struct vm_area_struct; * %__GFP_COMP address compound page metadata. * * %__GFP_ZERO returns a zeroed page on success. + * + * %__GFP_NO_LOCKS order-0 allocation without sleepable-locks. + * It obtains a page from the per-cpu-list and considered as + * lock-less. No other actions are performed, thus it returns + * NULL if per-cpu-list is empty. */ #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) #define __GFP_COMP ((__force gfp_t)___GFP_COMP) #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) +#define __GFP_NO_LOCKS ((__force gfp_t)___GFP_NO_LOCKS) /* Disable lockdep for GFP context tracking */ #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP)) +#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP)) #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /** diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 939092dbcb8b..653c56c478ad 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -45,6 +45,7 @@ {(unsigned long)__GFP_RECLAIMABLE, "__GFP_RECLAIMABLE"}, \ {(unsigned long)__GFP_MOVABLE, "__GFP_MOVABLE"}, \ {(unsigned long)__GFP_ACCOUNT, "__GFP_ACCOUNT"}, \ + {(unsigned long)__GFP_NO_LOCKS, "__GFP_NO_LOCKS"}, \ {(unsigned long)__GFP_WRITE, "__GFP_WRITE"}, \ {(unsigned long)__GFP_RECLAIM, "__GFP_RECLAIM"}, \ {(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e4896e674594..8bf1e3a9a1c3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3305,7 +3305,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) } /* Remove page from the per-cpu list, caller must protect the list */ -static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, +static struct page *__rmqueue_pcplist(struct zone *zone, gfp_t gfp_flags, + int migratetype, unsigned int alloc_flags, struct per_cpu_pages *pcp, struct list_head *list) @@ -3314,7 +3315,8 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, do { if (list_empty(list)) { - pcp->count += rmqueue_bulk(zone, 0, + if (!(gfp_flags & __GFP_NO_LOCKS)) + pcp->count += rmqueue_bulk(zone, 0, pcp->batch, list, migratetype, alloc_flags); if (unlikely(list_empty(list))) @@ -3341,8 +3343,20 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, local_irq_save(flags); pcp = &this_cpu_ptr(zone->pageset)->pcp; - list = &pcp->lists[migratetype]; - page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list); + + if (!(gfp_flags & __GFP_NO_LOCKS)) { + list = &pcp->lists[migratetype]; + page = __rmqueue_pcplist(zone, gfp_flags, migratetype, alloc_flags, pcp, list); + } else { + /* Iterate over all migrate types of the pcp-lists. */ + for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) { + list = &pcp->lists[migratetype]; + page = __rmqueue_pcplist(zone, gfp_flags, migratetype, alloc_flags, pcp, list); + if (page) + break; + } + } + if (page) { __count_zid_vm_events(PGALLOC, page_zonenum(page), 1); zone_statistics(preferred_zone, zone); @@ -3790,7 +3804,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, * grow this zone if it contains deferred pages. */ if (static_branch_unlikely(&deferred_pages)) { - if (_deferred_grow_zone(zone, order)) + if (!(gfp_mask & __GFP_NO_LOCKS) && + _deferred_grow_zone(zone, order)) goto try_this_zone; } #endif @@ -3835,7 +3850,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, reserve_highatomic_pageblock(page, zone, order); return page; - } else { + } else if (!(gfp_mask & __GFP_NO_LOCKS)) { #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* Try again if zone has deferred pages */ if (static_branch_unlikely(&deferred_pages)) { @@ -4880,6 +4895,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, if (likely(page)) goto out; + /* Bypass slow path if __GFP_NO_LOCKS. */ + if ((gfp_mask & __GFP_NO_LOCKS)) + goto out; + /* * Apply scoped allocation constraints. This is mainly about GFP_NOFS * resp. GFP_NOIO which has to be inherited for all allocation requests diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index 38a5ab683ebc..662e1d9a0e99 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -656,6 +656,7 @@ static const struct { { "__GFP_RECLAIMABLE", "RC" }, { "__GFP_MOVABLE", "M" }, { "__GFP_ACCOUNT", "AC" }, + { "__GFP_NO_LOCKS", "NL" }, { "__GFP_WRITE", "WR" }, { "__GFP_RECLAIM", "R" }, { "__GFP_DIRECT_RECLAIM", "DR" },
Some background and kfree_rcu() =============================== The pointers to be freed are stored in the per-cpu array to improve performance, to enable an easier-to-use API, to accommodate vmalloc memmory and to support a single argument of the kfree_rcu() when only a pointer is passed. More details are below. In order to maintain such per-CPU arrays there is a need in dynamic allocation when a current array is fully populated and a new block is required. See below the example: 0 1 2 3 0 1 2 3 |p|p|p|p| -> |p|p|p|p| -> NULL there are two pointer-blocks, each one can store 4 addresses which will be freed after a grace period is passed. In reality we store PAGE_SIZE / sizeof(void *). So to maintain such blocks a single page is obtain via the page allocator: bnode = (struct kvfree_rcu_bulk_data *) __get_free_page(GFP_NOWAIT | __GFP_NOWARN); after that it is attached to the "head" and its "next" pointer is set to previous "head", so the list of blocks can be maintained and grow dynamically until it gets drained by the reclaiming thread. Please note. There is always a fallback if an allocation fails. In the single argument, this is a call to synchronize_rcu() and for the two arguments case this is to use rcu_head structure embedded in the object being free, and then paying cache-miss penalty, also invoke the kfree() per object instead of kfree_bulk() for groups of objects. Why we maintain arrays/blocks instead of linking objects by the regular "struct rcu_head" technique. See below a few but main reasons: a) A memory can be reclaimed by invoking of the kfree_bulk() interface that requires passing an array and number of entries in it. That reduces the per-object overhead caused by calling kfree() per-object. This reduces the reclamation time. b) Improves locality and reduces the number of cache-misses, due to "pointer chasing" between objects, which can be far spread between each other. c) Support a "single argument" in the kvfree_rcu() void *ptr = kvmalloc(some_bytes, GFP_KERNEL); if (ptr) kvfree_rcu(ptr); We need it when an "rcu_head" is not embed into a stucture but an object must be freed after a grace period. Therefore for the single argument, such objects cannot be queued on a linked list. So nowadays, since we do not have a single argument but we see the demand in it, to workaround it people just do a simple not efficient sequence: <snip> synchronize_rcu(); /* Can be long and blocks a current context */ kfree(p); <snip> More details is here: https://lkml.org/lkml/2020/4/28/1626 d) To distinguish vmalloc pointers between SLAB ones. It becomes possible to invoke the right freeing API for the right kind of pointer, kfree_bulk() or TBD: vmalloc_bulk(). Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166 Limitations and concerns (Main part) ==================================== The current memmory-allocation interface presents to following difficulties that this patch is designed to overcome: a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will complain about violation("BUG: Invalid wait context") of the nesting rules. It does the raw_spinlock vs. spinlock nesting checks, i.e. it is not legal to acquire a spinlock_t while holding a raw_spinlock_t. Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch) whereas the "page allocator" internally deals with spinlock_t to access to its zones. The code also can be broken from higher level of view: <snip> raw_spin_lock(&some_lock); kfree_rcu(some_pointer, some_field_offset); <snip> b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t is converted into sleepable variant. Invoking the page allocator from atomic contexts leads to "BUG: scheduling while atomic". Proposal ======== 1) Make GFP_* that ensures that the allocator returns NULL rather than acquire its own spinlock_t. Having such flag will address a and b limitations described above. It will also make the kfree_rcu() code common for RT and regular kernel, more clean, less handling corner cases and reduce the code size. Description: The page allocator has two phases, fast path and slow one. We are interested in fast path and order-0 allocations. In its turn it is divided also into two phases: lock-less and not: a) As a first step the page allocator tries to obtain a page from the per-cpu-list, so each CPU has its own one. That is why this step is lock-less and fast. Basically it disables irqs on current CPU in order to access to per-cpu data and remove a first element from the pcp-list. An element/page is returned to an user. b) If there is no any available page in per-cpu-list, the second step is involved. It removes a specified number of elements from the buddy allocator transferring them to the "supplied-list/per-cpu-list" described in [1]. A number of pre-fetched elements can be controlled via sysfs attribute. Please see the /proc/sys/vm/percpu_pagelist_fraction. This step is not lock-less. It uses spinlock_t for accessing to the buddy zone. This step is fully covered by the rmqueue_bulk() function. Summarizing. The __GFP_NO_LOCKS covers only [1] and can not do step [2], due to the fact that [2] acquires spinlock_t. It implies that it is super fast, but a higher rate of fails is also expected. Having such flag will address (a) and (b) limitations described above. Usage: __get_free_page(__GFP_NO_LOCKS); Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> --- include/linux/gfp.h | 11 +++++++++-- include/trace/events/mmflags.h | 1 + mm/page_alloc.c | 31 +++++++++++++++++++++++++------ tools/perf/builtin-kmem.c | 1 + 4 files changed, 36 insertions(+), 8 deletions(-)