Message ID | 20200918194817.48921-3-urezki@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | kvfree_rcu() and _LOCK_NESTING/_PREEMPT_RT | expand |
On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote: [...] > Proposal > ======== > Introduce a lock-free function that obtain a page from the per-cpu-lists > on current CPU. It returns NULL rather than acquiring any non-raw spinlock. I was not happy about this solution when we have discussed this last time and I have to say I am still not happy. This is exposing an internal allocator optimization and allows a hard to estimate consumption of pcp free pages. IIUC this run on pcp cache can be controled directly from userspace (close(open) loop IIRC) which makes it even bigger no-no. I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de that this optimization is not aiming at reasonable workloads. Really, go with pre-allocated buffer and fallback to whatever slow path you have already. Exposing more internals of the allocator is not going to do any good for long term maintainability.
On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote: > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote: > [...] > > Proposal > > ======== > > Introduce a lock-free function that obtain a page from the per-cpu-lists > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock. > > I was not happy about this solution when we have discussed this > last time and I have to say I am still not happy. This is exposing > an internal allocator optimization and allows a hard to estimate > consumption of pcp free pages. IIUC this run on pcp cache can be > controled directly from userspace (close(open) loop IIRC) which makes it > even bigger no-no. Yes, I do well remember that you are unhappy with this approach. Unfortunately, thus far, there is no solution that makes all developers happy. You might be glad to hear that we are also looking into other solutions, each of which makes some other developers unhappy. So we are at least not picking on you alone. :-/ > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de > that this optimization is not aiming at reasonable workloads. Really, go > with pre-allocated buffer and fallback to whatever slow path you have > already. Exposing more internals of the allocator is not going to do any > good for long term maintainability. I suggest that you carefully re-read the thread following that email. Given a choice between making users unhappy and making developers unhappy, I will side with the users each and every time. Thanx, Paul
On Mon 21-09-20 08:45:58, Paul E. McKenney wrote: > On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote: > > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote: > > [...] > > > Proposal > > > ======== > > > Introduce a lock-free function that obtain a page from the per-cpu-lists > > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock. > > > > I was not happy about this solution when we have discussed this > > last time and I have to say I am still not happy. This is exposing > > an internal allocator optimization and allows a hard to estimate > > consumption of pcp free pages. IIUC this run on pcp cache can be > > controled directly from userspace (close(open) loop IIRC) which makes it > > even bigger no-no. > > Yes, I do well remember that you are unhappy with this approach. > Unfortunately, thus far, there is no solution that makes all developers > happy. You might be glad to hear that we are also looking into other > solutions, each of which makes some other developers unhappy. So we > are at least not picking on you alone. :-/ No worries I do not feel like a whipping boy here. But do expect me to argue against the approach. I would also appreciate it if there was some more information on other attempts, why they have failed. E.g. why pre-allocation is not an option that works well enough in most reasonable workloads. I would also appreciate some more thoughts why we need to optimize for heavy abusers of RCU (like close(open) extremes). > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de > > that this optimization is not aiming at reasonable workloads. Really, go > > with pre-allocated buffer and fallback to whatever slow path you have > > already. Exposing more internals of the allocator is not going to do any > > good for long term maintainability. > > I suggest that you carefully re-read the thread following that email. I clearly remember Thomas not being particularly happy that you optimize for a corner case. I do not remember there being a consensus that this is the right approach. There was some consensus that this is better than a gfp flag. Still quite bad though if you ask me. > Given a choice between making users unhappy and making developers > unhappy, I will side with the users each and every time. Well, let me rephrase. It is not only about me (as a developer) being unhappy but also all the side effects this would have for users when performance of their favorite workload declines for no apparent reason just because pcp caches are depleted by an unrelated process.
Hello, Michal. > > > > Yes, I do well remember that you are unhappy with this approach. > > Unfortunately, thus far, there is no solution that makes all developers > > happy. You might be glad to hear that we are also looking into other > > solutions, each of which makes some other developers unhappy. So we > > are at least not picking on you alone. :-/ > > No worries I do not feel like a whipping boy here. But do expect me to > argue against the approach. I would also appreciate it if there was some > more information on other attempts, why they have failed. E.g. why > pre-allocation is not an option that works well enough in most > reasonable workloads. Pre-allocating has some drawbacks: a) It is impossible to predict how many pages will be required to cover a demand that is controlled by different workloads on various systems. b) Memory overhead since we do not know how much pages should be preloaded: 100, 200 or 300 As for memory overhead, it is important to reduce it because of embedded devices like phones, where a low memory condition is a big issue. In that sense pre-allocating is something that we strongly would like to avoid. > > I would also appreciate some more thoughts why we > need to optimize for heavy abusers of RCU (like close(open) extremes). > I think here is a small misunderstanding. Please note, that is not only about performance and corner cases. There is a single argument support of the kvfree_rcu(ptr), where maintaining an array in time is needed. The fallback of the single argument case is extrimely slow. Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626 > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de > > > that this optimization is not aiming at reasonable workloads. Really, go > > > with pre-allocated buffer and fallback to whatever slow path you have > > > already. Exposing more internals of the allocator is not going to do any > > > good for long term maintainability. > > > > I suggest that you carefully re-read the thread following that email. > > I clearly remember Thomas not being particularly happy that you optimize > for a corner case. I do not remember there being a consensus that this > is the right approach. There was some consensus that this is better than > a gfp flag. Still quite bad though if you ask me. > > > Given a choice between making users unhappy and making developers > > unhappy, I will side with the users each and every time. > > Well, let me rephrase. It is not only about me (as a developer) being > unhappy but also all the side effects this would have for users when > performance of their favorite workload declines for no apparent reason > just because pcp caches are depleted by an unrelated process. > If depleted, we have a special worker that charge it. From the other hand, the pcplist can be depleted by its nature, what _is_ not wrong. But just in case we secure it since you had a concern about it. Could you please specify a real test case or workload you are talking about? Thank you for your comments and help. -- Vlad Rezki
On Mon, Sep 21, 2020 at 06:03:18PM +0200, Michal Hocko wrote: > On Mon 21-09-20 08:45:58, Paul E. McKenney wrote: > > On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote: > > > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote: > > > [...] > > > > Proposal > > > > ======== > > > > Introduce a lock-free function that obtain a page from the per-cpu-lists > > > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock. > > > > > > I was not happy about this solution when we have discussed this > > > last time and I have to say I am still not happy. This is exposing > > > an internal allocator optimization and allows a hard to estimate > > > consumption of pcp free pages. IIUC this run on pcp cache can be > > > controled directly from userspace (close(open) loop IIRC) which makes it > > > even bigger no-no. > > > > Yes, I do well remember that you are unhappy with this approach. > > Unfortunately, thus far, there is no solution that makes all developers > > happy. You might be glad to hear that we are also looking into other > > solutions, each of which makes some other developers unhappy. So we > > are at least not picking on you alone. :-/ > > No worries I do not feel like a whipping boy here. But do expect me to > argue against the approach. I would also appreciate it if there was some > more information on other attempts, why they have failed. E.g. why > pre-allocation is not an option that works well enough in most > reasonable workloads. I would also appreciate some more thoughts why we > need to optimize for heavy abusers of RCU (like close(open) extremes). Not optimizing for them, but rather defending against them. Uladzislau gave the example of low-memory phones. And we have quite the array of defenses against other userspace bugs including MMUs, the "limit" command, and so on. There have been quite a few successful attempts, starting from the introduction of blimit and RCU-bh more than 15 years ago, continuing through making call_rcu() jump-start grace periods, IPIing reluctant CPUs, tuning RCU callback offloading, and many others. But these prior approaches have only taken us so far. Other approaches under consideration include making CONFIG_PREEMPT_COUNT unconditional and thus allowing call_rcu() and kvfree_rcu() to determine whether direct calls to the allocator are safe (some guy named Linus doesn't like this one), preallocation (Uladzislau covered this, and the amount that would need to be preallocated is excessive), deferring allocation to RCU_SOFTIRQ (this would also need CONFIG_PREEMPT_COUNT), and deferring to some clean context (which is the best we can do within the confines of RCU, but which can have issues with delay). So it is not the need to address this general problem that is new. Far from it! What is new is the need for changes outside of RCU. > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de > > > that this optimization is not aiming at reasonable workloads. Really, go > > > with pre-allocated buffer and fallback to whatever slow path you have > > > already. Exposing more internals of the allocator is not going to do any > > > good for long term maintainability. > > > > I suggest that you carefully re-read the thread following that email. > > I clearly remember Thomas not being particularly happy that you optimize > for a corner case. I do not remember there being a consensus that this > is the right approach. There was some consensus that this is better than > a gfp flag. Still quite bad though if you ask me. Again, this "optimization" is for robustness more than raw speed. > > Given a choice between making users unhappy and making developers > > unhappy, I will side with the users each and every time. > > Well, let me rephrase. It is not only about me (as a developer) being > unhappy but also all the side effects this would have for users when > performance of their favorite workload declines for no apparent reason > just because pcp caches are depleted by an unrelated process. But in the close(open()) case, wouldn't the allocations on the open() side refill those caches? Yes, cases where one CPU is doing the allocating and another the call_rcu()/kvfree_rcu() need additional help, but as Uladzislau noted, we do have patches that ensure that the refilling happens. Thanx, Paul
[Cc Mel - the thread starts http://lkml.kernel.org/r/20200918194817.48921-1-urezki@gmail.com] On Mon 21-09-20 21:48:19, Uladzislau Rezki wrote: > Hello, Michal. > > > > > > > Yes, I do well remember that you are unhappy with this approach. > > > Unfortunately, thus far, there is no solution that makes all developers > > > happy. You might be glad to hear that we are also looking into other > > > solutions, each of which makes some other developers unhappy. So we > > > are at least not picking on you alone. :-/ > > > > No worries I do not feel like a whipping boy here. But do expect me to > > argue against the approach. I would also appreciate it if there was some > > more information on other attempts, why they have failed. E.g. why > > pre-allocation is not an option that works well enough in most > > reasonable workloads. > Pre-allocating has some drawbacks: > > a) It is impossible to predict how many pages will be required to > cover a demand that is controlled by different workloads on > various systems. Yes, this is not trivial but not a rocket science either. Remember that you are relying on a very dumb watermark based pcp pool from the allocator. Mimicing a similar implementation shouldn't be all that hard and you will get your own pool which doesn't affect other page allocator users as much as a bonus. > b) Memory overhead since we do not know how much pages should be > preloaded: 100, 200 or 300 Does anybody who really needs this optimization actually cares about 300 pages? > As for memory overhead, it is important to reduce it because of > embedded devices like phones, where a low memory condition is a > big issue. In that sense pre-allocating is something that we strongly > would like to avoid. How big "machines" are we talking about here? I would expect that really tiny machines would have hard times to really fill up thousands of pages with pointers to free... Would a similar scaling as the page allocator feasible. Really I mostly do care about shared nature of the pcp allocator list that one user can easily monopolize with this API. > > I would also appreciate some more thoughts why we > > need to optimize for heavy abusers of RCU (like close(open) extremes). > > > I think here is a small misunderstanding. Please note, that is not only > about performance and corner cases. There is a single argument support > of the kvfree_rcu(ptr), where maintaining an array in time is needed. > The fallback of the single argument case is extrimely slow. This should be part of the changelog. > > Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626 Error 501 > > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de > > > > that this optimization is not aiming at reasonable workloads. Really, go > > > > with pre-allocated buffer and fallback to whatever slow path you have > > > > already. Exposing more internals of the allocator is not going to do any > > > > good for long term maintainability. > > > > > > I suggest that you carefully re-read the thread following that email. > > > > I clearly remember Thomas not being particularly happy that you optimize > > for a corner case. I do not remember there being a consensus that this > > is the right approach. There was some consensus that this is better than > > a gfp flag. Still quite bad though if you ask me. > > > > > Given a choice between making users unhappy and making developers > > > unhappy, I will side with the users each and every time. > > > > Well, let me rephrase. It is not only about me (as a developer) being > > unhappy but also all the side effects this would have for users when > > performance of their favorite workload declines for no apparent reason > > just because pcp caches are depleted by an unrelated process. > > > If depleted, we have a special worker that charge it. From the other hand, > the pcplist can be depleted by its nature, what _is_ not wrong. But just > in case we secure it since you had a concern about it. pcp free lists should ever get empty when we run out of memory and need to reclaim. Otherwise they are constantly refilled/rebalanced on demand. The fact that you are refilling them from outside just suggest that you are operating on a wrong layer. Really, create your own pool of pages and rebalance them based on the workload. > Could you please specify a real test case or workload you are talking about? I am not a performance expert but essentially any memory allocator heavy workload might notice. I am pretty sure Mel would tell you more.
On Mon 21-09-20 20:35:53, Paul E. McKenney wrote: > On Mon, Sep 21, 2020 at 06:03:18PM +0200, Michal Hocko wrote: > > On Mon 21-09-20 08:45:58, Paul E. McKenney wrote: > > > On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote: > > > > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote: > > > > [...] > > > > > Proposal > > > > > ======== > > > > > Introduce a lock-free function that obtain a page from the per-cpu-lists > > > > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock. > > > > > > > > I was not happy about this solution when we have discussed this > > > > last time and I have to say I am still not happy. This is exposing > > > > an internal allocator optimization and allows a hard to estimate > > > > consumption of pcp free pages. IIUC this run on pcp cache can be > > > > controled directly from userspace (close(open) loop IIRC) which makes it > > > > even bigger no-no. > > > > > > Yes, I do well remember that you are unhappy with this approach. > > > Unfortunately, thus far, there is no solution that makes all developers > > > happy. You might be glad to hear that we are also looking into other > > > solutions, each of which makes some other developers unhappy. So we > > > are at least not picking on you alone. :-/ > > > > No worries I do not feel like a whipping boy here. But do expect me to > > argue against the approach. I would also appreciate it if there was some > > more information on other attempts, why they have failed. E.g. why > > pre-allocation is not an option that works well enough in most > > reasonable workloads. I would also appreciate some more thoughts why we > > need to optimize for heavy abusers of RCU (like close(open) extremes). > > Not optimizing for them, but rather defending against them. Uladzislau > gave the example of low-memory phones. And we have quite the array > of defenses against other userspace bugs including MMUs, the "limit" > command, and so on. > > There have been quite a few successful attempts, starting from the > introduction of blimit and RCU-bh more than 15 years ago, continuing > through making call_rcu() jump-start grace periods, IPIing reluctant > CPUs, tuning RCU callback offloading, and many others. But these prior > approaches have only taken us so far. Thank you this is useful. It gives me some idea about the problem but I still cannot picture how serious the problem really is. Is this a DoS like scenario where an unprivileged task is able to monopolize the system/CPU to do all the RCU heavy lifting? Are other RCU users deprived from doing their portion? How much expediting helps? Does it really solve the problem or it merely shifts it around as you will have hard time to keep up with more and more tasks to hit the same path and the pre-allocated memory is a very finite resource. > Other approaches under consideration include making CONFIG_PREEMPT_COUNT > unconditional and thus allowing call_rcu() and kvfree_rcu() to determine > whether direct calls to the allocator are safe (some guy named Linus > doesn't like this one), I assume that the primary argument is the overhead, right? Do you happen to have any reference? > preallocation (Uladzislau covered this, and > the amount that would need to be preallocated is excessive), deferring > allocation to RCU_SOFTIRQ (this would also need CONFIG_PREEMPT_COUNT), > and deferring to some clean context (which is the best we can do within > the confines of RCU, but which can have issues with delay). I have already replied to that so let's not split the discussion into several fronts. > So it is not the need to address this general problem that is new. > Far from it! What is new is the need for changes outside of RCU. > > > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de > > > > that this optimization is not aiming at reasonable workloads. Really, go > > > > with pre-allocated buffer and fallback to whatever slow path you have > > > > already. Exposing more internals of the allocator is not going to do any > > > > good for long term maintainability. > > > > > > I suggest that you carefully re-read the thread following that email. > > > > I clearly remember Thomas not being particularly happy that you optimize > > for a corner case. I do not remember there being a consensus that this > > is the right approach. There was some consensus that this is better than > > a gfp flag. Still quite bad though if you ask me. > > Again, this "optimization" is for robustness more than raw speed. > > > > Given a choice between making users unhappy and making developers > > > unhappy, I will side with the users each and every time. > > > > Well, let me rephrase. It is not only about me (as a developer) being > > unhappy but also all the side effects this would have for users when > > performance of their favorite workload declines for no apparent reason > > just because pcp caches are depleted by an unrelated process. > > But in the close(open()) case, wouldn't the allocations on the open() > side refill those caches? Yes pcp lists will eventually get replenished. This is not a correctness problem I am pointing out. It is the fact that any user of this new interface can monopolize the _global_ pool which would affect all other users of the pool.
> > > > Yes, I do well remember that you are unhappy with this approach. > > > > Unfortunately, thus far, there is no solution that makes all developers > > > > happy. You might be glad to hear that we are also looking into other > > > > solutions, each of which makes some other developers unhappy. So we > > > > are at least not picking on you alone. :-/ > > > > > > No worries I do not feel like a whipping boy here. But do expect me to > > > argue against the approach. I would also appreciate it if there was some > > > more information on other attempts, why they have failed. E.g. why > > > pre-allocation is not an option that works well enough in most > > > reasonable workloads. > > Pre-allocating has some drawbacks: > > > > a) It is impossible to predict how many pages will be required to > > cover a demand that is controlled by different workloads on > > various systems. > > Yes, this is not trivial but not a rocket science either. Remember that > you are relying on a very dumb watermark based pcp pool from the > allocator. > We rely on it, indeed. If the pcp-cache is depleted our special work is triggered to charge our local cache(few pages) such way will also initiate the process of pre-featching pages from the buddy allocator populating the depleted pcp-cache. I do not have any concern here. > > Mimicing a similar implementation shouldn't be all that hard > and you will get your own pool which doesn't affect other page allocator > users as much as a bonus. > I see your point Michal. As i mentioned before, it is important to avoid of having such own pools, because the aim is not to waste memory resources. A page will be returned back to "page allocator" as soon as a scheduler place our reclaim thread on a CPU and grace period is passed. So, the resource can be used for other needs. What is important. Otherwise a memory footprint is increased what is bad for low memory conditions when OOM is involved. Just in case, it is a big issue for mobile devices. > > b) Memory overhead since we do not know how much pages should be > > preloaded: 100, 200 or 300 > > Does anybody who really needs this optimization actually cares about 300 > pages? > It might be an issue for embedded devices when such devices run into a low memory condition resulting in OOM or slow allocations due to mentioned condition. For servers and big system it will not be visible. > > As for memory overhead, it is important to reduce it because of > > embedded devices like phones, where a low memory condition is a > > big issue. In that sense pre-allocating is something that we strongly > > would like to avoid. > > How big "machines" are we talking about here? I would expect that really > tiny machines would have hard times to really fill up thousands of pages > with pointers to free... > I mentioned above. We can not rely on static model. We would like to have a mechanism that gives back ASAP used pages to page allocator for other needs. > > Would a similar scaling as the page allocator feasible. Really I mostly > do care about shared nature of the pcp allocator list that one user can > easily monopolize with this API. > I see your concern. pcplist can be monopolized by already existing API: while (i < 100) __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > > > I would also appreciate some more thoughts why we > > > need to optimize for heavy abusers of RCU (like close(open) extremes). > > > > > I think here is a small misunderstanding. Please note, that is not only > > about performance and corner cases. There is a single argument support > > of the kvfree_rcu(ptr), where maintaining an array in time is needed. > > The fallback of the single argument case is extrimely slow. > > This should be part of the changelog. > Hmm.. I think it is. Sorry if i missed that but i hope i mentioned about it. > > > > Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626 > > Error 501 > Could you please elaborate? Do not want to speculate :) > > > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de > > > > > that this optimization is not aiming at reasonable workloads. Really, go > > > > > with pre-allocated buffer and fallback to whatever slow path you have > > > > > already. Exposing more internals of the allocator is not going to do any > > > > > good for long term maintainability. > > > > > > > > I suggest that you carefully re-read the thread following that email. > > > > > > I clearly remember Thomas not being particularly happy that you optimize > > > for a corner case. I do not remember there being a consensus that this > > > is the right approach. There was some consensus that this is better than > > > a gfp flag. Still quite bad though if you ask me. > > > > > > > Given a choice between making users unhappy and making developers > > > > unhappy, I will side with the users each and every time. > > > > > > Well, let me rephrase. It is not only about me (as a developer) being > > > unhappy but also all the side effects this would have for users when > > > performance of their favorite workload declines for no apparent reason > > > just because pcp caches are depleted by an unrelated process. > > > > > If depleted, we have a special worker that charge it. From the other hand, > > the pcplist can be depleted by its nature, what _is_ not wrong. But just > > in case we secure it since you had a concern about it. > > pcp free lists should ever get empty when we run out of memory and need > to reclaim. Otherwise they are constantly refilled/rebalanced on demand. > The fact that you are refilling them from outside just suggest that you > are operating on a wrong layer. Really, create your own pool of pages > and rebalance them based on the workload. > I covered it above. > > Could you please specify a real test case or workload you are talking about? > > I am not a performance expert but essentially any memory allocator heavy > workload might notice. I am pretty sure Mel would tell you more. > OK. Thank you for your comments, Michal! -- Vlad Rezki
On Tue 22-09-20 15:12:57, Uladzislau Rezki wrote: [...] > > Mimicing a similar implementation shouldn't be all that hard > > and you will get your own pool which doesn't affect other page allocator > > users as much as a bonus. > > > I see your point Michal. As i mentioned before, it is important to avoid of > having such own pools, because the aim is not to waste memory resources. A > page will be returned back to "page allocator" as soon as a scheduler place > our reclaim thread on a CPU and grace period is passed. So, the resource > can be used for other needs. What is important. > > Otherwise a memory footprint is increased what is bad for low memory > conditions when OOM is involved. Just in case, it is a big issue for > mobile devices. Really, how much memory are we talking about here? Do you have any estimation? How many pointers do you need to store at once? 10k (that would be 20 pages per cpu? Doesn't sound too big to me. But again I do not know the scale here. Also if you really care you can fine tune this pool based on demand. All that is not a rocket science and it can be tuned outside of the page allocator rather than other way around. We will not move forward without any specific numbers here I am afraid. [...] > > Would a similar scaling as the page allocator feasible. Really I mostly > > do care about shared nature of the pcp allocator list that one user can > > easily monopolize with this API. > > > I see your concern. pcplist can be monopolized by already existing API: > > while (i < 100) > __get_free_page(GFP_NOWAIT | __GFP_NOWARN); They will usually not, because even non-sleeping allocations will refill them unless the memory is scarce and memory reclaim is needed. As replied to Paul in other email, this is not a question of correctness. It is a matter of shifting the overhead around. > > > Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626 > > > > Error 501 > > > Could you please elaborate? Do not want to speculate :) It thrown 501 on me. lkml.org is quite unreliable. It works now. I will read through that. Please use lore or lkml.kernel.org/r/$msg in future.
On Tue, Sep 22, 2020 at 10:03:06AM +0200, Michal Hocko wrote: > On Mon 21-09-20 20:35:53, Paul E. McKenney wrote: > > On Mon, Sep 21, 2020 at 06:03:18PM +0200, Michal Hocko wrote: > > > On Mon 21-09-20 08:45:58, Paul E. McKenney wrote: > > > > On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote: > > > > > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote: > > > > > [...] > > > > > > Proposal > > > > > > ======== > > > > > > Introduce a lock-free function that obtain a page from the per-cpu-lists > > > > > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock. > > > > > > > > > > I was not happy about this solution when we have discussed this > > > > > last time and I have to say I am still not happy. This is exposing > > > > > an internal allocator optimization and allows a hard to estimate > > > > > consumption of pcp free pages. IIUC this run on pcp cache can be > > > > > controled directly from userspace (close(open) loop IIRC) which makes it > > > > > even bigger no-no. > > > > > > > > Yes, I do well remember that you are unhappy with this approach. > > > > Unfortunately, thus far, there is no solution that makes all developers > > > > happy. You might be glad to hear that we are also looking into other > > > > solutions, each of which makes some other developers unhappy. So we > > > > are at least not picking on you alone. :-/ > > > > > > No worries I do not feel like a whipping boy here. But do expect me to > > > argue against the approach. I would also appreciate it if there was some > > > more information on other attempts, why they have failed. E.g. why > > > pre-allocation is not an option that works well enough in most > > > reasonable workloads. I would also appreciate some more thoughts why we > > > need to optimize for heavy abusers of RCU (like close(open) extremes). > > > > Not optimizing for them, but rather defending against them. Uladzislau > > gave the example of low-memory phones. And we have quite the array > > of defenses against other userspace bugs including MMUs, the "limit" > > command, and so on. > > > > There have been quite a few successful attempts, starting from the > > introduction of blimit and RCU-bh more than 15 years ago, continuing > > through making call_rcu() jump-start grace periods, IPIing reluctant > > CPUs, tuning RCU callback offloading, and many others. But these prior > > approaches have only taken us so far. > > Thank you this is useful. It gives me some idea about the problem but I > still cannot picture how serious the problem really is. Is this a DoS > like scenario where an unprivileged task is able to monopolize the > system/CPU to do all the RCU heavy lifting? Are other RCU users deprived > from doing their portion? How much expediting helps? Does it really > solve the problem or it merely shifts it around as you will have hard > time to keep up with more and more tasks to hit the same path and the > pre-allocated memory is a very finite resource. Yes, this situation can result in an unprivileged task being able to DoS some or even all of the system. Which is why I have been pushing on this despite resistance. And again yes, expediting can help and RCU does in fact take auto-expediting steps when any CPU's callback list exceeds 10,000 elements. These auto-expediting changes have also been made over a 15-year time period. Joel Fernandes is working on a patch series that will allow RCU to better choose when to auto-expedite and when to favor speedy RCU callback invocation. Currently RCU cannot tell and so just does both whenever it starts getting into trouble. Both auto-expediting and speedy RCU callback invocation help the memory allocator by freeing memory more quickly, thus reducing the system's memory footprint. Even more important, these help the RCU callback invocation (and thus freeing of memory) to better keep up with the allocation in the various difficult situations, for example, the tight userspace loop around close(open()). > > Other approaches under consideration include making CONFIG_PREEMPT_COUNT > > unconditional and thus allowing call_rcu() and kvfree_rcu() to determine > > whether direct calls to the allocator are safe (some guy named Linus > > doesn't like this one), > > I assume that the primary argument is the overhead, right? Do you happen > to have any reference? Jon Corbet wrote a very nice article summarizing the current situation: https://lwn.net/Articles/831678/. Thomas's measurements show no visible system-level performance impact. I will let Uladzislau present his more microbenchmarky performance work. We are pursuing this unconditional CONFIG_PREEMPT_COUNT approach as well, in keeping with our current de-facto policy of irritating everyone across the full expanse of the Linux-kernel source tree. :-/ > > preallocation (Uladzislau covered this, and > > the amount that would need to be preallocated is excessive), deferring > > allocation to RCU_SOFTIRQ (this would also need CONFIG_PREEMPT_COUNT), > > and deferring to some clean context (which is the best we can do within > > the confines of RCU, but which can have issues with delay). > > I have already replied to that so let's not split the discussion into > several fronts. I will look at other emails. > > So it is not the need to address this general problem that is new. > > Far from it! What is new is the need for changes outside of RCU. > > > > > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de > > > > > that this optimization is not aiming at reasonable workloads. Really, go > > > > > with pre-allocated buffer and fallback to whatever slow path you have > > > > > already. Exposing more internals of the allocator is not going to do any > > > > > good for long term maintainability. > > > > > > > > I suggest that you carefully re-read the thread following that email. > > > > > > I clearly remember Thomas not being particularly happy that you optimize > > > for a corner case. I do not remember there being a consensus that this > > > is the right approach. There was some consensus that this is better than > > > a gfp flag. Still quite bad though if you ask me. > > > > Again, this "optimization" is for robustness more than raw speed. > > > > > > Given a choice between making users unhappy and making developers > > > > unhappy, I will side with the users each and every time. > > > > > > Well, let me rephrase. It is not only about me (as a developer) being > > > unhappy but also all the side effects this would have for users when > > > performance of their favorite workload declines for no apparent reason > > > just because pcp caches are depleted by an unrelated process. > > > > But in the close(open()) case, wouldn't the allocations on the open() > > side refill those caches? > > Yes pcp lists will eventually get replenished. This is not a correctness > problem I am pointing out. It is the fact that any user of this new > interface can monopolize the _global_ pool which would affect all other > users of the pool. Understood. However, the kvfree_rcu() and call_rcu() allocations when under stress will be about one page per 500 objects on 64-bit systems. These 500 objects are there to begin with. So it is not at all clear to me that this additional one-page-per-500-objects memory overhead is significant. What about when not under stress? In that case, the absolute number of additional objects allocated will be small, so it is once again not at all clear to me that this additional memory overhead is significant. Thanx, Paul
On Tue, Sep 22, 2020 at 09:50:02AM +0200, Michal Hocko wrote: > [Cc Mel - the thread starts http://lkml.kernel.org/r/20200918194817.48921-1-urezki@gmail.com] > > On Mon 21-09-20 21:48:19, Uladzislau Rezki wrote: > > Hello, Michal. > > > > > > > > > > Yes, I do well remember that you are unhappy with this approach. > > > > Unfortunately, thus far, there is no solution that makes all developers > > > > happy. You might be glad to hear that we are also looking into other > > > > solutions, each of which makes some other developers unhappy. So we > > > > are at least not picking on you alone. :-/ > > > > > > No worries I do not feel like a whipping boy here. But do expect me to > > > argue against the approach. I would also appreciate it if there was some > > > more information on other attempts, why they have failed. E.g. why > > > pre-allocation is not an option that works well enough in most > > > reasonable workloads. > > Pre-allocating has some drawbacks: > > > > a) It is impossible to predict how many pages will be required to > > cover a demand that is controlled by different workloads on > > various systems. > > Yes, this is not trivial but not a rocket science either. Remember that > you are relying on a very dumb watermark based pcp pool from the > allocator. Mimicing a similar implementation shouldn't be all that hard > and you will get your own pool which doesn't affect other page allocator > users as much as a bonus. > > > b) Memory overhead since we do not know how much pages should be > > preloaded: 100, 200 or 300 > > Does anybody who really needs this optimization actually cares about 300 > pages? That would be 100-300 (maybe more) pages -per- -CPU-, so yes, some people will care quite deeply about this. Thanx, Paul > > As for memory overhead, it is important to reduce it because of > > embedded devices like phones, where a low memory condition is a > > big issue. In that sense pre-allocating is something that we strongly > > would like to avoid. > > How big "machines" are we talking about here? I would expect that really > tiny machines would have hard times to really fill up thousands of pages > with pointers to free... > > Would a similar scaling as the page allocator feasible. Really I mostly > do care about shared nature of the pcp allocator list that one user can > easily monopolize with this API. > > > > I would also appreciate some more thoughts why we > > > need to optimize for heavy abusers of RCU (like close(open) extremes). > > > > > I think here is a small misunderstanding. Please note, that is not only > > about performance and corner cases. There is a single argument support > > of the kvfree_rcu(ptr), where maintaining an array in time is needed. > > The fallback of the single argument case is extrimely slow. > > This should be part of the changelog. > > > > Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626 > > Error 501 > > > > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de > > > > > that this optimization is not aiming at reasonable workloads. Really, go > > > > > with pre-allocated buffer and fallback to whatever slow path you have > > > > > already. Exposing more internals of the allocator is not going to do any > > > > > good for long term maintainability. > > > > > > > > I suggest that you carefully re-read the thread following that email. > > > > > > I clearly remember Thomas not being particularly happy that you optimize > > > for a corner case. I do not remember there being a consensus that this > > > is the right approach. There was some consensus that this is better than > > > a gfp flag. Still quite bad though if you ask me. > > > > > > > Given a choice between making users unhappy and making developers > > > > unhappy, I will side with the users each and every time. > > > > > > Well, let me rephrase. It is not only about me (as a developer) being > > > unhappy but also all the side effects this would have for users when > > > performance of their favorite workload declines for no apparent reason > > > just because pcp caches are depleted by an unrelated process. > > > > > If depleted, we have a special worker that charge it. From the other hand, > > the pcplist can be depleted by its nature, what _is_ not wrong. But just > > in case we secure it since you had a concern about it. > > pcp free lists should ever get empty when we run out of memory and need > to reclaim. Otherwise they are constantly refilled/rebalanced on demand. > The fact that you are refilling them from outside just suggest that you > are operating on a wrong layer. Really, create your own pool of pages > and rebalance them based on the workload. > > > Could you please specify a real test case or workload you are talking about? > > I am not a performance expert but essentially any memory allocator heavy > workload might notice. I am pretty sure Mel would tell you more. > > -- > Michal Hocko > SUSE Labs
On Tue, Sep 22, 2020 at 03:12:57PM +0200, Uladzislau Rezki wrote: > > > > > Yes, I do well remember that you are unhappy with this approach. > > > > > Unfortunately, thus far, there is no solution that makes all developers > > > > > happy. You might be glad to hear that we are also looking into other > > > > > solutions, each of which makes some other developers unhappy. So we > > > > > are at least not picking on you alone. :-/ > > > > > > > > No worries I do not feel like a whipping boy here. But do expect me to > > > > argue against the approach. I would also appreciate it if there was some > > > > more information on other attempts, why they have failed. E.g. why > > > > pre-allocation is not an option that works well enough in most > > > > reasonable workloads. > > > Pre-allocating has some drawbacks: > > > > > > a) It is impossible to predict how many pages will be required to > > > cover a demand that is controlled by different workloads on > > > various systems. > > > > Yes, this is not trivial but not a rocket science either. Remember that > > you are relying on a very dumb watermark based pcp pool from the > > allocator. > > > We rely on it, indeed. If the pcp-cache is depleted our special work is > triggered to charge our local cache(few pages) such way will also initiate > the process of pre-featching pages from the buddy allocator populating > the depleted pcp-cache. I do not have any concern here. > It can interfere with ATOMIC allocations in critical paths in extreme circumstances as it potentially puts increased pressure on the emergency reserve as watermarks are bypassed. That adds to the risk of a functional failuure if reclaim fails to make progress. The number of pages are likely to be limited and unpredictable. As it uses any PCP type, it potentially causes fragmention issues. For the last point, the allocations may be transient in the RCU case now but not guaranteed forever. As the API is in gfp.h, it's open to abuse so the next guy that comes along and thinks "I am critical no matter what the name says" will cause problems. While you could argue that would be caught in review, plenty of GFP flag abuses made it through review. Fundamentally, this is simply shifting the problem from RCU to the page allocator because of the locking arrangements and hazard of acquiring zone lock is a raw spinlock is held on RT. It does not even make the timing predictable as an empty PCU list (for example, a full drain in low memory situations) may mean the emergency path is hit anyway. About all it changes is the timing of when the emergency path is hit in some circumstances -- it's not fixing the problem, it's simply changing the shape. > > > > Mimicing a similar implementation shouldn't be all that hard > > and you will get your own pool which doesn't affect other page allocator > > users as much as a bonus. > > > I see your point Michal. As i mentioned before, it is important to avoid of > having such own pools, because the aim is not to waste memory resources. A > page will be returned back to "page allocator" as soon as a scheduler place > our reclaim thread on a CPU and grace period is passed. So, the resource > can be used for other needs. What is important. > As the emergency path and synchronising can be hit no matter what, why not increase the pool temporarily after the emergency path is hit and shrink it again later if necessary? > Otherwise a memory footprint is increased what is bad for low memory > conditions when OOM is involved. OOM would only be a major factor if the size of the pools meant the machine could not even operate or at least was severely degraded. However, depleting the PCPU lists for RCU may slow kswapd making reclaim progress and cause an OOM in itself, or at least an intervention by a userspace monitor that kills non-critical applications in the background when memory pressure exists. > > > As for memory overhead, it is important to reduce it because of > > > embedded devices like phones, where a low memory condition is a > > > big issue. In that sense pre-allocating is something that we strongly > > > would like to avoid. > > > > How big "machines" are we talking about here? I would expect that really > > tiny machines would have hard times to really fill up thousands of pages > > with pointers to free... > > > I mentioned above. We can not rely on static model. We would like to > have a mechanism that gives back ASAP used pages to page allocator > for other needs. > After an emergency, temporarily increase the size of the pool to avoid hitting the emergency path again in the near future. > > > > Would a similar scaling as the page allocator feasible. Really I mostly > > do care about shared nature of the pcp allocator list that one user can > > easily monopolize with this API. > > > I see your concern. pcplist can be monopolized by already existing API: > > while (i < 100) > __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > That's not the same class of abuse as it can go to the buddy lists to refill the correct PCP lists, avoid fragmentation issues, obeys watermarks and wakes kswapd if it's not awake already.
> > > Other approaches under consideration include making CONFIG_PREEMPT_COUNT > > > unconditional and thus allowing call_rcu() and kvfree_rcu() to determine > > > whether direct calls to the allocator are safe (some guy named Linus > > > doesn't like this one), > > > > I assume that the primary argument is the overhead, right? Do you happen > > to have any reference? > > Jon Corbet wrote a very nice article summarizing the current situation: > https://lwn.net/Articles/831678/. Thomas's measurements show no visible > system-level performance impact. I will let Uladzislau present his more > microbenchmarky performance work. > I have done some analysis of the !PREEMPT kernel with and without PREEMPT_COUNT configuration. The aim is to show a performance impact if the PREEMPT_COUNT is unconditionally enabled. As for the test i used the refscale kernel module, that does: <snip> static void ref_rcu_read_section(const int nloops) { int i; for (i = nloops; i >= 0; i--) { rcu_read_lock(); rcu_read_unlock(); } } <snip> How to run the microbenchmark: <snip> urezki@pc638:~$ sudo modprobe refscale <snip> The below is an average duration per loop (nanoseconds): !PREEMPT_COUNT PREEMPT_COUNT Runs Time(ns) Runc Time(ns) 1 109.640 1 99.915 2 102.303 2 111.106 3 90.520 3 98.713 4 106.347 4 111.239 5 108.374 5 111.797 6 108.012 6 111.558 7 103.989 7 113.122 8 106.194 8 111.515 9 107.330 9 107.559 10 105.877 10 105.965 11 104.860 11 104.835 12 104.299 12 106.342 13 104.794 13 106.664 14 104.916 14 104.914 15 105.485 15 104.280 16 104.610 16 105.642 17 104.981 17 105.646 18 103.089 18 106.370 19 105.251 19 105.284 20 104.133 20 105.973 21 105.589 21 105.271 22 104.154 22 106.063 23 104.963 23 106.248 24 102.431 24 105.568 25 102.610 25 105.556 26 103.474 26 105.655 27 100.194 27 102.887 28 102.340 28 104.347 29 102.075 29 102.389 30 102.808 30 103.123 The difference is ~1.8% in average. The maximum value is 109.640 vs 113.122 The minimum value is 90.520 vs 98.713. Tested on: processor : 63 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : QEMU Virtual CPU version 2.5+ cpu MHz : 3700.204 I also can do more detailed testing using "perf" tool. -- Vlad Rezki
On Wed, Sep 23, 2020 at 11:37:06AM +0100, Mel Gorman wrote: > On Tue, Sep 22, 2020 at 03:12:57PM +0200, Uladzislau Rezki wrote: > > > > > > Yes, I do well remember that you are unhappy with this approach. > > > > > > Unfortunately, thus far, there is no solution that makes all developers > > > > > > happy. You might be glad to hear that we are also looking into other > > > > > > solutions, each of which makes some other developers unhappy. So we > > > > > > are at least not picking on you alone. :-/ > > > > > > > > > > No worries I do not feel like a whipping boy here. But do expect me to > > > > > argue against the approach. I would also appreciate it if there was some > > > > > more information on other attempts, why they have failed. E.g. why > > > > > pre-allocation is not an option that works well enough in most > > > > > reasonable workloads. > > > > Pre-allocating has some drawbacks: > > > > > > > > a) It is impossible to predict how many pages will be required to > > > > cover a demand that is controlled by different workloads on > > > > various systems. > > > > > > Yes, this is not trivial but not a rocket science either. Remember that > > > you are relying on a very dumb watermark based pcp pool from the > > > allocator. > > > > > We rely on it, indeed. If the pcp-cache is depleted our special work is > > triggered to charge our local cache(few pages) such way will also initiate > > the process of pre-featching pages from the buddy allocator populating > > the depleted pcp-cache. I do not have any concern here. > > It can interfere with ATOMIC allocations in critical paths in extreme > circumstances as it potentially puts increased pressure on the emergency > reserve as watermarks are bypassed. That adds to the risk of a functional > failuure if reclaim fails to make progress. The number of pages are likely > to be limited and unpredictable. As it uses any PCP type, it potentially > causes fragmention issues. For the last point, the allocations may be > transient in the RCU case now but not guaranteed forever. As the API is > in gfp.h, it's open to abuse so the next guy that comes along and thinks > "I am critical no matter what the name says" will cause problems. While > you could argue that would be caught in review, plenty of GFP flag abuses > made it through review. > > Fundamentally, this is simply shifting the problem from RCU to the page > allocator because of the locking arrangements and hazard of acquiring zone > lock is a raw spinlock is held on RT. It does not even make the timing > predictable as an empty PCU list (for example, a full drain in low memory > situations) may mean the emergency path is hit anyway. About all it changes > is the timing of when the emergency path is hit in some circumstances -- > it's not fixing the problem, it's simply changing the shape. All good points! On the other hand, duplicating a portion of the allocator functionality within RCU increases the amount of reserved memory, and needlessly most of the time. Is there some way that we can locklessly allocate memory, but return failure instead of running down the emergency pool? A change to the loop that iterates over the migration types? Or to the loop that iterates over the zones? Something else? > > > Mimicing a similar implementation shouldn't be all that hard > > > and you will get your own pool which doesn't affect other page allocator > > > users as much as a bonus. > > > > > I see your point Michal. As i mentioned before, it is important to avoid of > > having such own pools, because the aim is not to waste memory resources. A > > page will be returned back to "page allocator" as soon as a scheduler place > > our reclaim thread on a CPU and grace period is passed. So, the resource > > can be used for other needs. What is important. > > As the emergency path and synchronising can be hit no matter what, why > not increase the pool temporarily after the emergency path is hit and > shrink it again later if necessary? If I understand what you are suggesting, this is in fact what Uladzislau's prototyped commit 8c0a1269709d ("rcu/tree: Add a work to allocate pages from regular context") on the -rcu "dev" branch is intended to do. The issue, as Uladislau noted above, is that scheduler delays can prevent these pool-increase actions until the point at which there is no memory. > > Otherwise a memory footprint is increased what is bad for low memory > > conditions when OOM is involved. > > OOM would only be a major factor if the size of the pools meant the > machine could not even operate or at least was severely degraded. However, > depleting the PCPU lists for RCU may slow kswapd making reclaim progress > and cause an OOM in itself, or at least an intervention by a userspace > monitor that kills non-critical applications in the background when memory > pressure exists. When under emergency conditions, we have one page allocated per 500 objects passed to kvfree_rcu(). So the increase in total allocated memory load due to this emergency path is quite small. > > > > As for memory overhead, it is important to reduce it because of > > > > embedded devices like phones, where a low memory condition is a > > > > big issue. In that sense pre-allocating is something that we strongly > > > > would like to avoid. > > > > > > How big "machines" are we talking about here? I would expect that really > > > tiny machines would have hard times to really fill up thousands of pages > > > with pointers to free... > > > > > I mentioned above. We can not rely on static model. We would like to > > have a mechanism that gives back ASAP used pages to page allocator > > for other needs. > > After an emergency, temporarily increase the size of the pool to avoid > hitting the emergency path again in the near future. By which time we might well already be in OOM territory. The emergency situations can ramp up very quickly. > > > Would a similar scaling as the page allocator feasible. Really I mostly > > > do care about shared nature of the pcp allocator list that one user can > > > easily monopolize with this API. > > > > > I see your concern. pcplist can be monopolized by already existing API: > > > > while (i < 100) > > __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > > That's not the same class of abuse as it can go to the buddy lists to > refill the correct PCP lists, avoid fragmentation issues, obeys watermarks > and wakes kswapd if it's not awake already. Good point, and we did try doing it this way. Unfortunately, in current !PREEMPT kernels, this approach can deadlock on one of the allocator locks via call_rcu(). This experience caused us to look at lockless allocator access. It also inspired the unconditional PREEMPT_COUNT approach, but that has its own detractors. (Yes, we are of course still persuing it as well.) Thanx, Paul
On Wed, Sep 23, 2020 at 08:41:05AM -0700, Paul E. McKenney wrote: > > Fundamentally, this is simply shifting the problem from RCU to the page > > allocator because of the locking arrangements and hazard of acquiring zone > > lock is a raw spinlock is held on RT. It does not even make the timing > > predictable as an empty PCU list (for example, a full drain in low memory > > situations) may mean the emergency path is hit anyway. About all it changes > > is the timing of when the emergency path is hit in some circumstances -- > > it's not fixing the problem, it's simply changing the shape. > > All good points! > > On the other hand, duplicating a portion of the allocator functionality > within RCU increases the amount of reserved memory, and needlessly most > of the time. > But it's very similar to what mempools are for. > Is there some way that we can locklessly allocate memory, but return > failure instead of running down the emergency pool? A change to the loop > that iterates over the migration types? Or to the loop that iterates > over the zones? Something else? > Only by duplicating some of the logic in get_page_from_freelist would protect the reserves. Even if you had that, the contents of the pcp pools can easily be zero for the local CPU and you still get stuck. > > > > Mimicing a similar implementation shouldn't be all that hard > > > > and you will get your own pool which doesn't affect other page allocator > > > > users as much as a bonus. > > > > > > > I see your point Michal. As i mentioned before, it is important to avoid of > > > having such own pools, because the aim is not to waste memory resources. A > > > page will be returned back to "page allocator" as soon as a scheduler place > > > our reclaim thread on a CPU and grace period is passed. So, the resource > > > can be used for other needs. What is important. > > > > As the emergency path and synchronising can be hit no matter what, why > > not increase the pool temporarily after the emergency path is hit and > > shrink it again later if necessary? > > If I understand what you are suggesting, this is in fact what Uladzislau's > prototyped commit 8c0a1269709d ("rcu/tree: Add a work to allocate > pages from regular context") on the -rcu "dev" branch is intended to do. > The issue, as Uladislau noted above, is that scheduler delays can prevent > these pool-increase actions until the point at which there is no memory. > Scheduler latency would be a problem. You would have to keep an emergency "rescue" pool that is enough to make slow progress even if no other memory is available until the worker takes action. The worker that allocates pages from regular context would be to increase the pool size enough so the emergency reserve does not have to be used again in the near future. > > > Otherwise a memory footprint is increased what is bad for low memory > > > conditions when OOM is involved. > > > > OOM would only be a major factor if the size of the pools meant the > > machine could not even operate or at least was severely degraded. However, > > depleting the PCPU lists for RCU may slow kswapd making reclaim progress > > and cause an OOM in itself, or at least an intervention by a userspace > > monitor that kills non-critical applications in the background when memory > > pressure exists. > > When under emergency conditions, we have one page allocated per 500 > objects passed to kvfree_rcu(). So the increase in total allocated > memory load due to this emergency path is quite small. > Fair enough but any solution that depends on the PCP having even one page available is going to need a fallback option of some sort. > > > > > As for memory overhead, it is important to reduce it because of > > > > > embedded devices like phones, where a low memory condition is a > > > > > big issue. In that sense pre-allocating is something that we strongly > > > > > would like to avoid. > > > > > > > > How big "machines" are we talking about here? I would expect that really > > > > tiny machines would have hard times to really fill up thousands of pages > > > > with pointers to free... > > > > > > > I mentioned above. We can not rely on static model. We would like to > > > have a mechanism that gives back ASAP used pages to page allocator > > > for other needs. > > > > After an emergency, temporarily increase the size of the pool to avoid > > hitting the emergency path again in the near future. > > By which time we might well already be in OOM territory. The emergency > situations can ramp up very quickly. > And if the pcp lists are empty, this problem still exists -- pretty much anything that depends on the pcp lists always having pages is going to have a failure case. > > > > Would a similar scaling as the page allocator feasible. Really I mostly > > > > do care about shared nature of the pcp allocator list that one user can > > > > easily monopolize with this API. > > > > > > > I see your concern. pcplist can be monopolized by already existing API: > > > > > > while (i < 100) > > > __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > > > > That's not the same class of abuse as it can go to the buddy lists to > > refill the correct PCP lists, avoid fragmentation issues, obeys watermarks > > and wakes kswapd if it's not awake already. > > Good point, and we did try doing it this way. Unfortunately, in current > !PREEMPT kernels, this approach can deadlock on one of the allocator > locks via call_rcu(). This experience caused us to look at lockless > allocator access. It also inspired the unconditional PREEMPT_COUNT > approach, but that has its own detractors. (Yes, we are of course still > persuing it as well.) > I didn't keep up to date unfortunately but pretty much anything that needs guaranteed access to pages has to maintain a mempool of some sort. Even if the page allocator had a common pool for emergency use, it would be subject to abuse because all it would take is one driver or subsystem deciding that it was special enough to deplete it. For example, the ml4 driver decided that it should declare itself to be a "memory allocator" like kswapd for iSCSI (it claims swap-over-nfs, but swap-over-nfs was only meant to apply when a swapfile was backed by a network fs).
> On Wed, Sep 23, 2020 at 08:41:05AM -0700, Paul E. McKenney wrote: > > > Fundamentally, this is simply shifting the problem from RCU to the page > > > allocator because of the locking arrangements and hazard of acquiring zone > > > lock is a raw spinlock is held on RT. It does not even make the timing > > > predictable as an empty PCU list (for example, a full drain in low memory > > > situations) may mean the emergency path is hit anyway. About all it changes > > > is the timing of when the emergency path is hit in some circumstances -- > > > it's not fixing the problem, it's simply changing the shape. > > > > All good points! > > > > On the other hand, duplicating a portion of the allocator functionality > > within RCU increases the amount of reserved memory, and needlessly most > > of the time. > > > > But it's very similar to what mempools are for. > As for dynamic caching or mempools. It requires extra logic on top of RCU to move things forward and it might be not efficient way. As a side effect, maintaining of the bulk arrays in the separate worker thread will introduce other drawbacks: a) There is an extra latency window, a time during which a fallback mechanism is used until pages are obtained via the special worker for further pointers collecting over arrays. b) It is impossible to predict how many pages will be required to cover a demand that is controlled by different workloads on various systems. It would require a rough value. c) extra memory footprint. Why a special worker is required. It is because non-sleeping flags, like GFP_ATOMIC or GFP_NOWAIT apparently can be sleeping what is not supposed. Therefore we proposed lock-less flag as a first step, and later a lock-less function. Other option is if we had unconditionally enabled PREEMPT_COUNT config. It would be easy to identify a context type and invoke a page allocator if a context is preemtale. But as of now preemptable() is "half" working. Thomas uploaded patches to make it unconditional. But it can be blocked. If both are not accepted, we will need to workaround it with dynamic caching, at least for !PREEMPT kernel. > > Is there some way that we can locklessly allocate memory, but return > > failure instead of running down the emergency pool? A change to the loop > > that iterates over the migration types? Or to the loop that iterates > > over the zones? Something else? > > > > Only by duplicating some of the logic in get_page_from_freelist would > protect the reserves. Even if you had that, the contents of the pcp > pools can easily be zero for the local CPU and you still get stuck. > We have a special worker that is triggered when a lock-less alloc fails. The worker we have, by requesting a new page will also initiate an internal process that prefetches specified number of elements from the buddy allocator populating the "pcplist" by new fresh pages. > > > > > Mimicing a similar implementation shouldn't be all that hard > > > > > and you will get your own pool which doesn't affect other page allocator > > > > > users as much as a bonus. > > > > > > > > > I see your point Michal. As i mentioned before, it is important to avoid of > > > > having such own pools, because the aim is not to waste memory resources. A > > > > page will be returned back to "page allocator" as soon as a scheduler place > > > > our reclaim thread on a CPU and grace period is passed. So, the resource > > > > can be used for other needs. What is important. > > > > > > As the emergency path and synchronising can be hit no matter what, why > > > not increase the pool temporarily after the emergency path is hit and > > > shrink it again later if necessary? > > > > If I understand what you are suggesting, this is in fact what Uladzislau's > > prototyped commit 8c0a1269709d ("rcu/tree: Add a work to allocate > > pages from regular context") on the -rcu "dev" branch is intended to do. > > The issue, as Uladislau noted above, is that scheduler delays can prevent > > these pool-increase actions until the point at which there is no memory. > > > > Scheduler latency would be a problem. You would have to keep an emergency > "rescue" pool that is enough to make slow progress even if no other memory > is available until the worker takes action. The worker that allocates > pages from regular context would be to increase the pool size enough so > the emergency reserve does not have to be used again in the near future. > The key point is "enough". We need pages to make a) fast progress b) support single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends on scheduler latency and vague pre-allocated number of pages, it might be not enough what would require to refill it more and more or we can overshoot that would lead to memory overhead. So we have here timing issues and not accurate model. IMHO. > > > > Otherwise a memory footprint is increased what is bad for low memory > > > > conditions when OOM is involved. > > > > > > OOM would only be a major factor if the size of the pools meant the > > > machine could not even operate or at least was severely degraded. However, > > > depleting the PCPU lists for RCU may slow kswapd making reclaim progress > > > and cause an OOM in itself, or at least an intervention by a userspace > > > monitor that kills non-critical applications in the background when memory > > > pressure exists. > > > > When under emergency conditions, we have one page allocated per 500 > > objects passed to kvfree_rcu(). So the increase in total allocated > > memory load due to this emergency path is quite small. > > > > Fair enough but any solution that depends on the PCP having even one > page available is going to need a fallback option of some sort. > We have a fallback mechanism. > > > > > > As for memory overhead, it is important to reduce it because of > > > > > > embedded devices like phones, where a low memory condition is a > > > > > > big issue. In that sense pre-allocating is something that we strongly > > > > > > would like to avoid. > > > > > > > > > > How big "machines" are we talking about here? I would expect that really > > > > > tiny machines would have hard times to really fill up thousands of pages > > > > > with pointers to free... > > > > > > > > > I mentioned above. We can not rely on static model. We would like to > > > > have a mechanism that gives back ASAP used pages to page allocator > > > > for other needs. > > > > > > After an emergency, temporarily increase the size of the pool to avoid > > > hitting the emergency path again in the near future. > > > > By which time we might well already be in OOM territory. The emergency > > situations can ramp up very quickly. > > > > And if the pcp lists are empty, this problem still exists -- pretty much > anything that depends on the pcp lists always having pages is going to > have a failure case. > But it is enough just to trigger __get_free_page() on that CPU, so the PCP will be populated. What we do in a separate worker. > > > > > Would a similar scaling as the page allocator feasible. Really I mostly > > > > > do care about shared nature of the pcp allocator list that one user can > > > > > easily monopolize with this API. > > > > > > > > > I see your concern. pcplist can be monopolized by already existing API: > > > > > > > > while (i < 100) > > > > __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > > > > > > That's not the same class of abuse as it can go to the buddy lists to > > > refill the correct PCP lists, avoid fragmentation issues, obeys watermarks > > > and wakes kswapd if it's not awake already. > > > > Good point, and we did try doing it this way. Unfortunately, in current > > !PREEMPT kernels, this approach can deadlock on one of the allocator > > locks via call_rcu(). This experience caused us to look at lockless > > allocator access. It also inspired the unconditional PREEMPT_COUNT > > approach, but that has its own detractors. (Yes, we are of course still > > persuing it as well.) > > > > I didn't keep up to date unfortunately but pretty much anything that > needs guaranteed access to pages has to maintain a mempool of some sort. > Even if the page allocator had a common pool for emergency use, it would > be subject to abuse because all it would take is one driver or subsystem > deciding that it was special enough to deplete it. For example, the ml4 > driver decided that it should declare itself to be a "memory allocator" > like kswapd for iSCSI (it claims swap-over-nfs, but swap-over-nfs was > only meant to apply when a swapfile was backed by a network fs). > We do not want to have many different memory allocators for sure :) As it was noted above. GFP_ATOMIC/GFP_NOWAIT can sleep what is a problem for atomic context. That is why we have proposed a lock-less way of getting a page and you are not happy with that :) Other option is unconditionally enable PREEMPT_COUNT config. That would help a lot not only for RCU but also for others if they can use of GFP_ATOMIC/GFP_NOWAIT and similar. Appreciate for comments. -- Vlad Rezki
On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote: > Other option is if we had unconditionally enabled PREEMPT_COUNT config. > It would be easy to identify a context type and invoke a page allocator > if a context is preemtale. But as of now preemptable() is "half" working. > Thomas uploaded patches to make it unconditional. But it can be blocked. While I in principle support Thomas' patch, I think this is an abuse and still complete wrong. Not all preemptible() context can deal with GFP_KERNEL (GFP_NOFS existing should be a clue). !preemptible context cannot unconditionally deal with GFP_NOWAIT, consider using it while holding zone->lock. The preemption context does not say antying much useful about the allocation context.
On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote: > The key point is "enough". We need pages to make a) fast progress b) support > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends > on scheduler latency and vague pre-allocated number of pages, it might > be not enough what would require to refill it more and more or we can overshoot > that would lead to memory overhead. So we have here timing issues and > not accurate model. IMHO. I'm firmly opposed to the single argument kvfree_rcu() idea, that's requiring memory to free memory.
On Thu, Sep 24, 2020 at 01:16:32PM +0200, Peter Zijlstra wrote: > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote: > > Other option is if we had unconditionally enabled PREEMPT_COUNT config. > > It would be easy to identify a context type and invoke a page allocator > > if a context is preemtale. But as of now preemptable() is "half" working. > > Thomas uploaded patches to make it unconditional. But it can be blocked. > > While I in principle support Thomas' patch, I think this is an abuse and > still complete wrong. > Good that you support it :) > > Not all preemptible() context can deal with GFP_KERNEL (GFP_NOFS > existing should be a clue). !preemptible context cannot unconditionally > deal with GFP_NOWAIT, consider using it while holding zone->lock. > Not sure if i fully follow you here. What i tried to express, if we had preemtable() fully working, including !PREEMPT, we can at least detect the following: if (!preemptable()) by pass using any ATOMIC/NOWAIT flags because they do not work; as an example you mentioned z->lock that is sleepable. > > The preemption context does not say antying much useful about the > allocation context. > I am talking about taking a decision. -- Vlad Rezki
On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote: > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote: > > The key point is "enough". We need pages to make a) fast progress b) support > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends > > on scheduler latency and vague pre-allocated number of pages, it might > > be not enough what would require to refill it more and more or we can overshoot > > that would lead to memory overhead. So we have here timing issues and > > not accurate model. IMHO. > > I'm firmly opposed to the single argument kvfree_rcu() idea, that's > requiring memory to free memory. > Hmm.. The problem is there is a demand in it: Please have a look at places in the kernel where people do not embed the rcu_head into their stuctures and do like: <snip> synchronize_rcu(); kfree(p); <snip> <snip> urezki@pc638:~/data/coding/linux-rcu.git$ find ./ -name "*.c" | xargs grep -C 1 -rn "synchronize_rcu" | grep kfree ./fs/nfs/sysfs.c-113- kfree(old); ./fs/ext4/super.c-1708- kfree(old_qname); ./kernel/trace/ftrace.c-5079- kfree(direct); ./kernel/trace/ftrace.c-5156- kfree(direct); ./kernel/trace/trace_probe.c-1087- kfree(link); ./kernel/module.c-3910- kfree(mod->args); ./net/core/sysctl_net_core.c-143- kfree(cur); ./arch/x86/mm/mmio-mod.c-314- kfree(found_trace); ./drivers/mfd/dln2.c-183- kfree(i); ./drivers/block/drbd/drbd_state.c-2074- kfree(old_conf); ./drivers/block/drbd/drbd_nl.c-1689- kfree(old_disk_conf); ./drivers/block/drbd/drbd_nl.c-2522- kfree(old_net_conf); ./drivers/block/drbd/drbd_nl.c-2935- kfree(old_disk_conf); ./drivers/block/drbd/drbd_receiver.c-3805- kfree(old_net_conf); ./drivers/block/drbd/drbd_receiver.c-4177- kfree(old_disk_conf); ./drivers/ipack/carriers/tpci200.c-189- kfree(slot_irq); ./drivers/crypto/nx/nx-842-pseries.c-1010- kfree(old_devdata); ./drivers/net/ethernet/myricom/myri10ge/myri10ge.c-3583- kfree(mgp->ss); ./drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c:286: synchronize_rcu(); /* before kfree(flow) */ ./drivers/net/ethernet/mellanox/mlxsw/core.c-1574- kfree(rxl_item); ./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6642- kfree(adapter->mbox_log); ./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6644- kfree(adapter); ./drivers/infiniband/hw/hfi1/sdma.c-1337- kfree(dd->per_sdma); ./drivers/infiniband/core/device.c:2164: * synchronize_rcu before the netdev is kfreed, so we ./drivers/misc/vmw_vmci/vmci_context.c-692- kfree(notifier); ./drivers/misc/vmw_vmci/vmci_event.c-213- kfree(s); ./drivers/staging/fwserial/fwserial.c-2122- kfree(peer); urezki@pc638:~/data/coding/linux-rcu.git$ <snip> -- Vlad Rezki
On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote: > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote: > > The key point is "enough". We need pages to make a) fast progress b) support > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends > > on scheduler latency and vague pre-allocated number of pages, it might > > be not enough what would require to refill it more and more or we can overshoot > > that would lead to memory overhead. So we have here timing issues and > > not accurate model. IMHO. > > I'm firmly opposed to the single argument kvfree_rcu() idea, that's > requiring memory to free memory. Not quite. First, there is a fallback when memory allocation fails. Second, in heavy-use situations, there is only one allocation per about 500 kvfree_rcu() calls on 64-bit systems. Third, there are other long-standing situations that require allocating memory in order to free memory. So I agree that it is a good general rule of thumb to avoid allocating on free paths, but there are exceptions. This is one of them. Thanx, Paul
On Thu 24-09-20 10:16:14, Uladzislau Rezki wrote: > > On Wed, Sep 23, 2020 at 08:41:05AM -0700, Paul E. McKenney wrote: > > > > Fundamentally, this is simply shifting the problem from RCU to the page > > > > allocator because of the locking arrangements and hazard of acquiring zone > > > > lock is a raw spinlock is held on RT. It does not even make the timing > > > > predictable as an empty PCU list (for example, a full drain in low memory > > > > situations) may mean the emergency path is hit anyway. About all it changes > > > > is the timing of when the emergency path is hit in some circumstances -- > > > > it's not fixing the problem, it's simply changing the shape. > > > > > > All good points! > > > > > > On the other hand, duplicating a portion of the allocator functionality > > > within RCU increases the amount of reserved memory, and needlessly most > > > of the time. > > > > > > > But it's very similar to what mempools are for. > > > As for dynamic caching or mempools. It requires extra logic on top of RCU > to move things forward and it might be not efficient way. As a side > effect, maintaining of the bulk arrays in the separate worker thread > will introduce other drawbacks: This is true but it is also true that it is RCU to require this special logic and we can expect that we might need to fine tune this logic depending on the RCU usage. We definitely do not want to tune the generic page allocator for a very specific usecase, do we? > a) There is an extra latency window, a time during which a fallback > mechanism is used until pages are obtained via the special > worker for further pointers collecting over arrays. This will be always the case for async refilling. More importantly this will be a bigger problem at the page allocator level which has other users other than RCU so more users are suffering... > b) It is impossible to predict how many pages will be required to > cover a demand that is controlled by different workloads on > various systems. It would require a rough value. I have asked for some actual numbers for real life scenarios this work is meant to cover. There was nothing presented so far. We can hand wave for ever but this will not move us forward much. As I've said in other email, few dozens pages per CPU by default will hardly get noticeable. You have a trivial initial implementation and can build on top in incremental steps. You can kick a background allocator to start new allocations when the pool hits a watermark and aggressivelly remove cached pages when they are not used. You will have quite a freedom to fine tune the scheme which is much harder for the page allocator because there are many other consumers. Anyway, I am afraid that we are going in circles here. We do not have any meaningful numbers to claim memory footprint problems. There is a clear opposition to hook into page allocator for reasons already mentioned. You are looking for a dedicated memory pool and it should be quite trivial to develop one and fine tune it for your specific usecase. All that on top of page allocator. Unless this is seen as completely unfeasible based on some solid arguments then we can start talking about the page allocator itself.
On Thu, Sep 24, 2020 at 05:21:12PM +0200, Uladzislau Rezki wrote: > On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote: > > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote: > > > The key point is "enough". We need pages to make a) fast progress b) support > > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends > > > on scheduler latency and vague pre-allocated number of pages, it might > > > be not enough what would require to refill it more and more or we can overshoot > > > that would lead to memory overhead. So we have here timing issues and > > > not accurate model. IMHO. > > > > I'm firmly opposed to the single argument kvfree_rcu() idea, that's > > requiring memory to free memory. > > > Hmm.. The problem is there is a demand in it: People demand ponies all the time, the usual answer is: No.
On Thu, Sep 24, 2020 at 08:38:34AM -0700, Paul E. McKenney wrote: > On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote: > > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote: > > > The key point is "enough". We need pages to make a) fast progress b) support > > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends > > > on scheduler latency and vague pre-allocated number of pages, it might > > > be not enough what would require to refill it more and more or we can overshoot > > > that would lead to memory overhead. So we have here timing issues and > > > not accurate model. IMHO. > > > > I'm firmly opposed to the single argument kvfree_rcu() idea, that's > > requiring memory to free memory. > > Not quite. > > First, there is a fallback when memory allocation fails. Second, > in heavy-use situations, there is only one allocation per about > 500 kvfree_rcu() calls on 64-bit systems. Third, there are other > long-standing situations that require allocating memory in order to > free memory. Some of which are quite broken. And yes, I'm aware of all that, I'm the one that started swap-over-NFS, which requires network traffic to free memory, which is one insane step further. But the way to make that 'work' is carefully account and pre-allocate (or size the reserve) the required memory to make progress and to strictly limit concurrency to ensure you stay in your bounds. > So I agree that it is a good general rule of thumb to avoid allocating > on free paths, but there are exceptions. This is one of them. The very first thing you need to do is proof your memory usage is bounded, and then calculate your bound. The problem is that with RCU you can't limit concurrency. call_rcu() can't block, you can't wait for a grace period to end when you've ran out of your reserve. That is, you don't have a bound, so no reserve what so ever is going to help. You must have that callback_head fallback.
On Fri, Sep 25, 2020 at 10:15:52AM +0200, Peter Zijlstra wrote: > On Thu, Sep 24, 2020 at 05:21:12PM +0200, Uladzislau Rezki wrote: > > On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote: > > > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote: > > > > The key point is "enough". We need pages to make a) fast progress b) support > > > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends > > > > on scheduler latency and vague pre-allocated number of pages, it might > > > > be not enough what would require to refill it more and more or we can overshoot > > > > that would lead to memory overhead. So we have here timing issues and > > > > not accurate model. IMHO. > > > > > > I'm firmly opposed to the single argument kvfree_rcu() idea, that's > > > requiring memory to free memory. > > > > > Hmm.. The problem is there is a demand in it: > > People demand ponies all the time, the usual answer is: No. > I see your view. From the other hand it is clear, there is still demand in it: <snip> void ext4_kvfree_array_rcu(void *to_free) { struct ext4_rcu_ptr *ptr = kzalloc(sizeof(*ptr), GFP_KERNEL); if (ptr) { ptr->ptr = to_free; call_rcu(&ptr->rcu, ext4_rcu_ptr_callback); return; } synchronize_rcu(); kvfree(ptr); } <snip> -- Vlad Rezki
> > > > > > > > All good points! > > > > > > > > On the other hand, duplicating a portion of the allocator functionality > > > > within RCU increases the amount of reserved memory, and needlessly most > > > > of the time. > > > > > > > > > > But it's very similar to what mempools are for. > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU > > to move things forward and it might be not efficient way. As a side > > effect, maintaining of the bulk arrays in the separate worker thread > > will introduce other drawbacks: > > This is true but it is also true that it is RCU to require this special > logic and we can expect that we might need to fine tune this logic > depending on the RCU usage. We definitely do not want to tune the > generic page allocator for a very specific usecase, do we? > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability to provide a memory service for contexts which are not allowed to sleep, RCU is part of them. Both flags used to provide such ability before but not anymore. Do you agree with it? > > do we? > If you agree with above, then there is a reasonable question what to do. Converting zone->lock into raw variant, the simplest and generic solution, has been declined(RT people do not like and see a side effect), NO_LOCK flag together with lock-less function were declined. All those efforts in terms of suggestions were just to find the best way to go. That is why we highlighted it outside of RCU. If you do not agree, then at least a documentation of mentioned flags should be changed, IMHO. We are, in our turn will focus on fixing it differently. > > a) There is an extra latency window, a time during which a fallback > > mechanism is used until pages are obtained via the special > > worker for further pointers collecting over arrays. > > This will be always the case for async refilling. More importantly this > will be a bigger problem at the page allocator level which has other > users other than RCU so more users are suffering... > Agree on that, async refilling has its disadvantages. > > b) It is impossible to predict how many pages will be required to > > cover a demand that is controlled by different workloads on > > various systems. It would require a rough value. > > I have asked for some actual numbers for real life scenarios this work > is meant to cover. There was nothing presented so far. We can hand wave > for ever but this will not move us forward much. As I've said in other > email, few dozens pages per CPU by default will hardly get noticeable. > You have a trivial initial implementation and can build on top in > incremental steps. You can kick a background allocator to start new > allocations when the pool hits a watermark and aggressivelly remove > cached pages when they are not used. You will have quite a freedom to > fine tune the scheme which is much harder for the page allocator because > there are many other consumers. > > Anyway, I am afraid that we are going in circles here. We do not have > any meaningful numbers to claim memory footprint problems. There is a > clear opposition to hook into page allocator for reasons already > mentioned. You are looking for a dedicated memory pool and it should be > quite trivial to develop one and fine tune it for your specific usecase. > All that on top of page allocator. Unless this is seen as completely > unfeasible based on some solid arguments then we can start talking about > the page allocator itself. > We have only synthetic tests. As for real life scenarios, on highly loaded servers we will have one picture, on small devices like phones we will have another one. As i mentioned in my previous emails, in worst case on my system it requires ~400 pages to cover the bandwidth. But, again we would like to find the best way to go. As i mentioned if Thomas patches are in place, we do not need to workaround everything regarding dynamic caching or async preloading and so on. Thank you Michal! -- Vlad Rezki
On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote: > > > > > > > > > > All good points! > > > > > > > > > > On the other hand, duplicating a portion of the allocator functionality > > > > > within RCU increases the amount of reserved memory, and needlessly most > > > > > of the time. > > > > > > > > > > > > > But it's very similar to what mempools are for. > > > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU > > > to move things forward and it might be not efficient way. As a side > > > effect, maintaining of the bulk arrays in the separate worker thread > > > will introduce other drawbacks: > > > > This is true but it is also true that it is RCU to require this special > > logic and we can expect that we might need to fine tune this logic > > depending on the RCU usage. We definitely do not want to tune the > > generic page allocator for a very specific usecase, do we? > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > to provide a memory service for contexts which are not allowed to > sleep, RCU is part of them. Both flags used to provide such ability > before but not anymore. > > Do you agree with it? Yes this sucks. But this is something that we likely really want to live with. We have to explicitly _document_ that really atomic contexts in RT cannot use the allocator. From the past discussions we've had this is likely the most reasonable way forward because we do not really want to encourage anybody to do something like that and there should be ways around that. The same is btw. true also for !RT. The allocator is not NMI safe and while we should be able to make it compatible I am not convinced we really want to. Would something like this be helpful wrt documentation? diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 67a0774e080b..9fcd47606493 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -238,7 +238,9 @@ struct vm_area_struct; * %__GFP_FOO flags as necessary. * * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower - * watermark is applied to allow access to "atomic reserves" + * watermark is applied to allow access to "atomic reserves". + * The current implementation doesn't support NMI and other non-preemptive context + * (e.g. raw_spin_lock). * * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. [...]
On Fri, Sep 25, 2020 at 05:31:29PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > All good points! > > > > > > > > > > On the other hand, duplicating a portion of the allocator functionality > > > > > within RCU increases the amount of reserved memory, and needlessly most > > > > > of the time. > > > > > > > > > > > > > But it's very similar to what mempools are for. > > > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU > > > to move things forward and it might be not efficient way. As a side > > > effect, maintaining of the bulk arrays in the separate worker thread > > > will introduce other drawbacks: > > > > This is true but it is also true that it is RCU to require this special > > logic and we can expect that we might need to fine tune this logic > > depending on the RCU usage. We definitely do not want to tune the > > generic page allocator for a very specific usecase, do we? > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > to provide a memory service for contexts which are not allowed to > sleep, RCU is part of them. Both flags used to provide such ability > before but not anymore. > > Do you agree with it? > I was led to believe that hte problem was taking the zone lock while holding a raw spinlock that was specific to RCU. Are you saying that GFP_ATOMIC/GFP_NOWAIT users are also holding raw spinlocks at the same time on RT?
On Fri, Sep 25, 2020 at 05:17:12PM +0100, Mel Gorman wrote: > On Fri, Sep 25, 2020 at 05:31:29PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > All good points! > > > > > > > > > > > > On the other hand, duplicating a portion of the allocator functionality > > > > > > within RCU increases the amount of reserved memory, and needlessly most > > > > > > of the time. > > > > > > > > > > > > > > > > But it's very similar to what mempools are for. > > > > > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU > > > > to move things forward and it might be not efficient way. As a side > > > > effect, maintaining of the bulk arrays in the separate worker thread > > > > will introduce other drawbacks: > > > > > > This is true but it is also true that it is RCU to require this special > > > logic and we can expect that we might need to fine tune this logic > > > depending on the RCU usage. We definitely do not want to tune the > > > generic page allocator for a very specific usecase, do we? > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > to provide a memory service for contexts which are not allowed to > > sleep, RCU is part of them. Both flags used to provide such ability > > before but not anymore. > > > > Do you agree with it? > > > > I was led to believe that hte problem was taking the zone lock while > holding a raw spinlock that was specific to RCU. In RCU code we hold a raw spinlock, because the kfree_rcu() should follow the call_rcu() rule and work in atomic contexts. So we can not enter a page allocator because it uses spinlock_t z->lock(is sleepable for RT). Because of CONFIG_PROVE_RAW_LOCK_NESTING option and CONFIG_PREEMPT_RT. > > Are you saying that GFP_ATOMIC/GFP_NOWAIT users are also holding raw > spinlocks at the same time on RT? > I do not say it. And it is not possible because zone->lock has a spinlock_t type. So, in case of CONFIG_PREEMPT_RT you will hit a "BUG: scheduling while atomic". If allocator is called when: raw lock is held or irqs are disabled or preempt_disable() on a higher level. -- Vlad Rezki
On Fri, Sep 25, 2020 at 10:26:18AM +0200, Peter Zijlstra wrote: > On Thu, Sep 24, 2020 at 08:38:34AM -0700, Paul E. McKenney wrote: > > On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote: > > > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote: > > > > The key point is "enough". We need pages to make a) fast progress b) support > > > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends > > > > on scheduler latency and vague pre-allocated number of pages, it might > > > > be not enough what would require to refill it more and more or we can overshoot > > > > that would lead to memory overhead. So we have here timing issues and > > > > not accurate model. IMHO. > > > > > > I'm firmly opposed to the single argument kvfree_rcu() idea, that's > > > requiring memory to free memory. > > > > Not quite. > > > > First, there is a fallback when memory allocation fails. Second, > > in heavy-use situations, there is only one allocation per about > > 500 kvfree_rcu() calls on 64-bit systems. Third, there are other > > long-standing situations that require allocating memory in order to > > free memory. > > Some of which are quite broken. And yes, I'm aware of all that, I'm the > one that started swap-over-NFS, which requires network traffic to free > memory, which is one insane step further. I could easily imagine that experience might have left some scars. > But the way to make that 'work' is carefully account and pre-allocate > (or size the reserve) the required memory to make progress and to > strictly limit concurrency to ensure you stay in your bounds. But your situation is different. When swapping over NFS, if you cannot allocate the memory to do the I/O, you cannot free the memory you are attempting to swap out, at least not unless you can kill the corresponding process. So if you don't want to kill processes, as you say, worst case is what matters. The kvfree_rcu() situation is rather different. In all cases, there is a fallback, namely using the existing rcu_head for double-argument kvfree_rcu() or falling back to synchronize_rcu() for single-argument kvfree_rcu(). As long as these fallbacks are sufficiently rare, the system will probably survive. > > So I agree that it is a good general rule of thumb to avoid allocating > > on free paths, but there are exceptions. This is one of them. > > The very first thing you need to do is proof your memory usage is > bounded, and then calculate your bound. Again, you are confusing your old swap-over-NFS scars with the current situation. They really are not the same. > The problem is that with RCU you can't limit concurrency. call_rcu() > can't block, you can't wait for a grace period to end when you've ran > out of your reserve. > > That is, you don't have a bound, so no reserve what so ever is going to > help. Almost. A dedicated reserve large enough to result in sufficiently low use of the fallback paths is too large. Again, we can tolerate a small fraction of requests taking the fallback, with emphasis on "small". > You must have that callback_head fallback. And we do have that callback_head fallback. And in the case of single-argument kvfree_rcu(), that synchronize_rcu() fallback. And as long as we can avoid using those fallbacks almost all the time, things will be OK. But we do need to able to allocate memory in the common case when there is memory to be had. Thanx, Paul
On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote: > Some background and kfree_rcu() > =============================== > The pointers to be freed are stored in the per-cpu array to improve > performance, to enable an easier-to-use API, to accommodate vmalloc > memmory and to support a single argument of the kfree_rcu() when only > a pointer is passed. More details are below. > > In order to maintain such per-CPU arrays there is a need in dynamic > allocation when a current array is fully populated and a new block is > required. See below the example: > > 0 1 2 3 0 1 2 3 > |p|p|p|p| -> |p|p|p|p| -> NULL > > there are two pointer-blocks, each one can store 4 addresses > which will be freed after a grace period is passed. In reality > we store PAGE_SIZE / sizeof(void *). So to maintain such blocks > a single page is obtain via the page allocator: > > bnode = (struct kvfree_rcu_bulk_data *) > __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > > after that it is attached to the "head" and its "next" pointer is > set to previous "head", so the list of blocks can be maintained and > grow dynamically until it gets drained by the reclaiming thread. > > Please note. There is always a fallback if an allocation fails. In the > single argument, this is a call to synchronize_rcu() and for the two > arguments case this is to use rcu_head structure embedded in the object > being free, and then paying cache-miss penalty, also invoke the kfree() > per object instead of kfree_bulk() for groups of objects. > > Why we maintain arrays/blocks instead of linking objects by the regular > "struct rcu_head" technique. See below a few but main reasons: > > a) A memory can be reclaimed by invoking of the kfree_bulk() > interface that requires passing an array and number of > entries in it. That reduces the per-object overhead caused > by calling kfree() per-object. This reduces the reclamation > time. > > b) Improves locality and reduces the number of cache-misses, due to > "pointer chasing" between objects, which can be far spread between > each other. > > c) Support a "single argument" in the kvfree_rcu() > void *ptr = kvmalloc(some_bytes, GFP_KERNEL); > if (ptr) > kvfree_rcu(ptr); > > We need it when an "rcu_head" is not embed into a stucture but an > object must be freed after a grace period. Therefore for the single > argument, such objects cannot be queued on a linked list. > > So nowadays, since we do not have a single argument but we see the > demand in it, to workaround it people just do a simple not efficient > sequence: > <snip> > synchronize_rcu(); /* Can be long and blocks a current context */ > kfree(p); > <snip> > > More details is here: https://lkml.org/lkml/2020/4/28/1626 > > d) To distinguish vmalloc pointers between SLAB ones. It becomes possible > to invoke the right freeing API for the right kind of pointer, kfree_bulk() > or TBD: vmalloc_bulk(). > > e) Speeding up the post-grace-period freeing reduces the chance of a flood > of callback's OOMing the system. > > Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166 > > Proposal > ======== > Introduce a lock-free function that obtain a page from the per-cpu-lists > on current CPU. It returns NULL rather than acquiring any non-raw spinlock. > > Description > =========== > The page allocator has two phases, fast path and slow one. We are interested > in fast path and order-0 allocations. In its turn it is divided also into two > phases: lock-less and not: > > 1) As a first step the page allocator tries to obtain a page from the > per-cpu-list, so each CPU has its own one. That is why this step is > lock-less and fast. Basically it disables irqs on current CPU in order > to access to per-cpu data and remove a first element from the pcp-list. > An element/page is returned to an user. > > 2) If there is no any available page in per-cpu-list, the second step is > involved. It removes a specified number of elements from the buddy allocator > transferring them to the "supplied-list/per-cpu-list" described in [1]. > > Summarizing. The __rcu_alloc_page_lockless() covers only [1] and can not > do step [2], due to the fact that [2] requires an access to zone->lock. > It implies that it is super fast, but a higher rate of fails is also > expected. > > Usage: __rcu_alloc_page_lockless(); > > Link: https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/ > Not-signed-off-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> After reading all the threads and mulling over this, I am going to deflect from Mel and Michal and not oppose the idea of lockless allocation. I would even prefer to do it via the gfp flag and not a completely separate path. Not using the exact code from v1, I think it could be done in a way that we don't actually look at the new flag until we find that pcplist is empty - which should not introduce overhead to the fast-fast path when pcpclist is not empty. It's more maintainable that adding new entry points, IMHO. But there's still the condition that it's sufficiently shown that the allocation is useful for RCU. In that case I prefer that the page allocator (or MM in general) can give its users what they need without having to work around it. Seems like GFP_ATOMIC is insufficient today so if that means we need a new flag for the raw spinlock context, so be it. But if your usage of __GFP_NO_LOCKS depends on the result of preempt_count() or similar check whether this is a context that needs it, I'd prefer to keep this in the caller.
> > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > to provide a memory service for contexts which are not allowed to > > sleep, RCU is part of them. Both flags used to provide such ability > > before but not anymore. > > > > Do you agree with it? > > Yes this sucks. But this is something that we likely really want to live > with. We have to explicitly _document_ that really atomic contexts in RT > cannot use the allocator. From the past discussions we've had this is > likely the most reasonable way forward because we do not really want to > encourage anybody to do something like that and there should be ways > around that. The same is btw. true also for !RT. The allocator is not > NMI safe and while we should be able to make it compatible I am not > convinced we really want to. > > Would something like this be helpful wrt documentation? > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 67a0774e080b..9fcd47606493 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -238,7 +238,9 @@ struct vm_area_struct; > * %__GFP_FOO flags as necessary. > * > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > - * watermark is applied to allow access to "atomic reserves" > + * watermark is applied to allow access to "atomic reserves". > + * The current implementation doesn't support NMI and other non-preemptive context > + * (e.g. raw_spin_lock). > * > * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires > * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. > To me it is clear. But also above conflicting statement: <snip> %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower <snip> should be rephrased, IMHO. -- Vlad Rezki
On Tue, Sep 29, 2020 at 12:15:34PM +0200, Vlastimil Babka wrote: > On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote: > > Some background and kfree_rcu() > > =============================== > > The pointers to be freed are stored in the per-cpu array to improve > > performance, to enable an easier-to-use API, to accommodate vmalloc > > memmory and to support a single argument of the kfree_rcu() when only > > a pointer is passed. More details are below. > > > > In order to maintain such per-CPU arrays there is a need in dynamic > > allocation when a current array is fully populated and a new block is > > required. See below the example: > > > > 0 1 2 3 0 1 2 3 > > |p|p|p|p| -> |p|p|p|p| -> NULL > > > > there are two pointer-blocks, each one can store 4 addresses > > which will be freed after a grace period is passed. In reality > > we store PAGE_SIZE / sizeof(void *). So to maintain such blocks > > a single page is obtain via the page allocator: > > > > bnode = (struct kvfree_rcu_bulk_data *) > > __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > > > > after that it is attached to the "head" and its "next" pointer is > > set to previous "head", so the list of blocks can be maintained and > > grow dynamically until it gets drained by the reclaiming thread. > > > > Please note. There is always a fallback if an allocation fails. In the > > single argument, this is a call to synchronize_rcu() and for the two > > arguments case this is to use rcu_head structure embedded in the object > > being free, and then paying cache-miss penalty, also invoke the kfree() > > per object instead of kfree_bulk() for groups of objects. > > > > Why we maintain arrays/blocks instead of linking objects by the regular > > "struct rcu_head" technique. See below a few but main reasons: > > > > a) A memory can be reclaimed by invoking of the kfree_bulk() > > interface that requires passing an array and number of > > entries in it. That reduces the per-object overhead caused > > by calling kfree() per-object. This reduces the reclamation > > time. > > > > b) Improves locality and reduces the number of cache-misses, due to > > "pointer chasing" between objects, which can be far spread between > > each other. > > > > c) Support a "single argument" in the kvfree_rcu() > > void *ptr = kvmalloc(some_bytes, GFP_KERNEL); > > if (ptr) > > kvfree_rcu(ptr); > > > > We need it when an "rcu_head" is not embed into a stucture but an > > object must be freed after a grace period. Therefore for the single > > argument, such objects cannot be queued on a linked list. > > > > So nowadays, since we do not have a single argument but we see the > > demand in it, to workaround it people just do a simple not efficient > > sequence: > > <snip> > > synchronize_rcu(); /* Can be long and blocks a current context */ > > kfree(p); > > <snip> > > > > More details is here: https://lkml.org/lkml/2020/4/28/1626 > > > > d) To distinguish vmalloc pointers between SLAB ones. It becomes possible > > to invoke the right freeing API for the right kind of pointer, kfree_bulk() > > or TBD: vmalloc_bulk(). > > > > e) Speeding up the post-grace-period freeing reduces the chance of a flood > > of callback's OOMing the system. > > > > Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166 > > > > Proposal > > ======== > > Introduce a lock-free function that obtain a page from the per-cpu-lists > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock. > > > > Description > > =========== > > The page allocator has two phases, fast path and slow one. We are interested > > in fast path and order-0 allocations. In its turn it is divided also into two > > phases: lock-less and not: > > > > 1) As a first step the page allocator tries to obtain a page from the > > per-cpu-list, so each CPU has its own one. That is why this step is > > lock-less and fast. Basically it disables irqs on current CPU in order > > to access to per-cpu data and remove a first element from the pcp-list. > > An element/page is returned to an user. > > > > 2) If there is no any available page in per-cpu-list, the second step is > > involved. It removes a specified number of elements from the buddy allocator > > transferring them to the "supplied-list/per-cpu-list" described in [1]. > > > > Summarizing. The __rcu_alloc_page_lockless() covers only [1] and can not > > do step [2], due to the fact that [2] requires an access to zone->lock. > > It implies that it is super fast, but a higher rate of fails is also > > expected. > > > > Usage: __rcu_alloc_page_lockless(); > > > > Link: https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/ > > Not-signed-off-by: Peter Zijlstra <peterz@infradead.org> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > After reading all the threads and mulling over this, I am going to deflect from > Mel and Michal and not oppose the idea of lockless allocation. I would even > prefer to do it via the gfp flag and not a completely separate path. Not using > the exact code from v1, I think it could be done in a way that we don't actually > look at the new flag until we find that pcplist is empty - which should not > introduce overhead to the fast-fast path when pcpclist is not empty. It's more > maintainable that adding new entry points, IMHO. > Thanks for reading all that from the beginning! It must be tough due to the fact there were lot of messages in the threads, so at least i was lost. I agree that adding a new entry or separate lock-less function can be considered as something that is hard to maintain. I have a question here. I mean about your different look at it: <snip> bool is_pcp_cache_empty(gfp_t gfp) { struct per_cpu_pages *pcp; struct zoneref *ref; unsigned long flags; bool empty; ref = first_zones_zonelist(node_zonelist( numa_node_id(), gfp), gfp_zone(gfp), NULL); if (!ref->zone) return true; local_irq_save(flags); pcp = &this_cpu_ptr(ref->zone->pageset)->pcp; empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]); local_irq_restore(flags); return empty; } disable_irq(); if (!is_pcp_cache_empty(GFP_NOWAIT)) __get_free_page(GFP_NOWAIT); enable_irq(); <snip> Do you mean to have something like above? I mean some extra API function that returns true or false if fast-fast allocation can either occur or not. Above code works just fine and never touches main zone->lock. i.e. Instead of introducing an extra GFP_LOCKLESS flag or any new extra lock-less function. We could have something that checks a pcp page cache list, thus it can guarantee that a request would be accomplish using fast-fast path. > > But there's still the condition that it's sufficiently shown that the allocation > is useful for RCU. In that case I prefer that the page allocator (or MM in > general) can give its users what they need without having to work around it. > I see your point, if there is something that can be generic and not specif - makes sense to me also. I do share your view. > > Seems like GFP_ATOMIC is insufficient today so if that means we need a new flag > for the raw spinlock context, so be it. But if your usage of __GFP_NO_LOCKS > depends on the result of preempt_count() or similar check whether this is a > context that needs it, I'd prefer to keep this in the caller. > By having the preemptabele() thing, we can just workaround GFP_ATOMIC or GFP_NOWAIT flag issues. Because both are insufficient. I mean when we see that a context is atomic(can not sleep) we can bypass any usage of the page allocator because it will hit BUG() with flags which are in question. And yes, the best is when we do not think much about context, so there is something reliable from point of using the page allocator. Thanks Vlastimil! -- Vlad Rezki
On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote: > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > > to provide a memory service for contexts which are not allowed to > > > sleep, RCU is part of them. Both flags used to provide such ability > > > before but not anymore. > > > > > > Do you agree with it? > > > > Yes this sucks. But this is something that we likely really want to live > > with. We have to explicitly _document_ that really atomic contexts in RT > > cannot use the allocator. From the past discussions we've had this is > > likely the most reasonable way forward because we do not really want to > > encourage anybody to do something like that and there should be ways > > around that. The same is btw. true also for !RT. The allocator is not > > NMI safe and while we should be able to make it compatible I am not > > convinced we really want to. > > > > Would something like this be helpful wrt documentation? > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > index 67a0774e080b..9fcd47606493 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -238,7 +238,9 @@ struct vm_area_struct; > > * %__GFP_FOO flags as necessary. > > * > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > - * watermark is applied to allow access to "atomic reserves" > > + * watermark is applied to allow access to "atomic reserves". > > + * The current implementation doesn't support NMI and other non-preemptive context > > + * (e.g. raw_spin_lock). > > * > > * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires > > * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. > > > To me it is clear. But also above conflicting statement: > > <snip> > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower > <snip> > > should be rephrased, IMHO. Any suggestions? Or more specifics about which part is conflicting? It tries to say that there is a higher demand to succeed even though the context cannot sleep to take active measures to achieve that. So the only way to achieve that is to break the watermakrs to a certain degree which is making them more "higher class" than other allocations.
On Wed 30-09-20 00:07:42, Uladzislau Rezki wrote: [...] > <snip> > bool is_pcp_cache_empty(gfp_t gfp) > { > struct per_cpu_pages *pcp; > struct zoneref *ref; > unsigned long flags; > bool empty; > > ref = first_zones_zonelist(node_zonelist( > numa_node_id(), gfp), gfp_zone(gfp), NULL); > if (!ref->zone) > return true; > > local_irq_save(flags); > pcp = &this_cpu_ptr(ref->zone->pageset)->pcp; > empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]); > local_irq_restore(flags); > > return empty; > } > > disable_irq(); > if (!is_pcp_cache_empty(GFP_NOWAIT)) > __get_free_page(GFP_NOWAIT); > enable_irq(); > <snip> > > Do you mean to have something like above? I mean some extra API > function that returns true or false if fast-fast allocation can > either occur or not. Above code works just fine and never touches > main zone->lock. The above code works with the _current_ implementation and it restricts its implementation to some degree. Future changes might get harder to implement with a pattern like this. I do not think we want users to be aware of internal implementation details like pcp caches, migrate types or others. While pcp caches are here for years and unlikely to change in a foreseeable future many details are changing on regular basis.
On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote: > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote: > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > > > to provide a memory service for contexts which are not allowed to > > > > sleep, RCU is part of them. Both flags used to provide such ability > > > > before but not anymore. > > > > > > > > Do you agree with it? > > > > > > Yes this sucks. But this is something that we likely really want to live > > > with. We have to explicitly _document_ that really atomic contexts in RT > > > cannot use the allocator. From the past discussions we've had this is > > > likely the most reasonable way forward because we do not really want to > > > encourage anybody to do something like that and there should be ways > > > around that. The same is btw. true also for !RT. The allocator is not > > > NMI safe and while we should be able to make it compatible I am not > > > convinced we really want to. > > > > > > Would something like this be helpful wrt documentation? > > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > > index 67a0774e080b..9fcd47606493 100644 > > > --- a/include/linux/gfp.h > > > +++ b/include/linux/gfp.h > > > @@ -238,7 +238,9 @@ struct vm_area_struct; > > > * %__GFP_FOO flags as necessary. > > > * > > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > > - * watermark is applied to allow access to "atomic reserves" > > > + * watermark is applied to allow access to "atomic reserves". > > > + * The current implementation doesn't support NMI and other non-preemptive context > > > + * (e.g. raw_spin_lock). > > > * > > > * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires > > > * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. > > > > > To me it is clear. But also above conflicting statement: > > > > <snip> > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower > > <snip> > > > > should be rephrased, IMHO. > > Any suggestions? Or more specifics about which part is conflicting? It > tries to say that there is a higher demand to succeed even though the > context cannot sleep to take active measures to achieve that. So the > only way to achieve that is to break the watermakrs to a certain degree > which is making them more "higher class" than other allocations. > Michal, i had only one concern about it. It says that %GFP_ATOMIC users can not sleep, i.e. callers know that they are in atomic, thus no any sleeping, but the chose they make will force them to sleep. -- Vlad Rezki
On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote: > On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote: > > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote: > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > > > > to provide a memory service for contexts which are not allowed to > > > > > sleep, RCU is part of them. Both flags used to provide such ability > > > > > before but not anymore. > > > > > > > > > > Do you agree with it? > > > > > > > > Yes this sucks. But this is something that we likely really want to live > > > > with. We have to explicitly _document_ that really atomic contexts in RT > > > > cannot use the allocator. From the past discussions we've had this is > > > > likely the most reasonable way forward because we do not really want to > > > > encourage anybody to do something like that and there should be ways > > > > around that. The same is btw. true also for !RT. The allocator is not > > > > NMI safe and while we should be able to make it compatible I am not > > > > convinced we really want to. > > > > > > > > Would something like this be helpful wrt documentation? > > > > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > > > index 67a0774e080b..9fcd47606493 100644 > > > > --- a/include/linux/gfp.h > > > > +++ b/include/linux/gfp.h > > > > @@ -238,7 +238,9 @@ struct vm_area_struct; > > > > * %__GFP_FOO flags as necessary. > > > > * > > > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > > > - * watermark is applied to allow access to "atomic reserves" > > > > + * watermark is applied to allow access to "atomic reserves". > > > > + * The current implementation doesn't support NMI and other non-preemptive context > > > > + * (e.g. raw_spin_lock). > > > > * > > > > * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires > > > > * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. > > > > > > > To me it is clear. But also above conflicting statement: > > > > > > <snip> > > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower > > > <snip> > > > > > > should be rephrased, IMHO. > > > > Any suggestions? Or more specifics about which part is conflicting? It > > tries to say that there is a higher demand to succeed even though the > > context cannot sleep to take active measures to achieve that. So the > > only way to achieve that is to break the watermakrs to a certain degree > > which is making them more "higher class" than other allocations. > > > Michal, i had only one concern about it. It says that %GFP_ATOMIC users > can not sleep, i.e. callers know that they are in atomic, thus no any > sleeping, but the chose they make will force them to sleep. I am not sure I follow you here. Do you mean they will be forced to sleep with PREEMPT_RT?
On Wed, Sep 30, 2020 at 02:44:13PM +0200, Michal Hocko wrote: > On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote: > > On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote: > > > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote: > > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > > > > > to provide a memory service for contexts which are not allowed to > > > > > > sleep, RCU is part of them. Both flags used to provide such ability > > > > > > before but not anymore. > > > > > > > > > > > > Do you agree with it? > > > > > > > > > > Yes this sucks. But this is something that we likely really want to live > > > > > with. We have to explicitly _document_ that really atomic contexts in RT > > > > > cannot use the allocator. From the past discussions we've had this is > > > > > likely the most reasonable way forward because we do not really want to > > > > > encourage anybody to do something like that and there should be ways > > > > > around that. The same is btw. true also for !RT. The allocator is not > > > > > NMI safe and while we should be able to make it compatible I am not > > > > > convinced we really want to. > > > > > > > > > > Would something like this be helpful wrt documentation? > > > > > > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > > > > index 67a0774e080b..9fcd47606493 100644 > > > > > --- a/include/linux/gfp.h > > > > > +++ b/include/linux/gfp.h > > > > > @@ -238,7 +238,9 @@ struct vm_area_struct; > > > > > * %__GFP_FOO flags as necessary. > > > > > * > > > > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > > > > - * watermark is applied to allow access to "atomic reserves" > > > > > + * watermark is applied to allow access to "atomic reserves". > > > > > + * The current implementation doesn't support NMI and other non-preemptive context > > > > > + * (e.g. raw_spin_lock). > > > > > * > > > > > * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires > > > > > * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. > > > > > > > > > To me it is clear. But also above conflicting statement: > > > > > > > > <snip> > > > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower > > > > <snip> > > > > > > > > should be rephrased, IMHO. > > > > > > Any suggestions? Or more specifics about which part is conflicting? It > > > tries to say that there is a higher demand to succeed even though the > > > context cannot sleep to take active measures to achieve that. So the > > > only way to achieve that is to break the watermakrs to a certain degree > > > which is making them more "higher class" than other allocations. > > > > > Michal, i had only one concern about it. It says that %GFP_ATOMIC users > > can not sleep, i.e. callers know that they are in atomic, thus no any > > sleeping, but the chose they make will force them to sleep. > > I am not sure I follow you here. Do you mean they will be forced to > sleep with PREEMPT_RT? > Exactly :) -- Vlad Rezki
On 9/30/20 12:07 AM, Uladzislau Rezki wrote: > On Tue, Sep 29, 2020 at 12:15:34PM +0200, Vlastimil Babka wrote: >> On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote: >> >> After reading all the threads and mulling over this, I am going to deflect from >> Mel and Michal and not oppose the idea of lockless allocation. I would even >> prefer to do it via the gfp flag and not a completely separate path. Not using >> the exact code from v1, I think it could be done in a way that we don't actually >> look at the new flag until we find that pcplist is empty - which should not >> introduce overhead to the fast-fast path when pcpclist is not empty. It's more >> maintainable that adding new entry points, IMHO. >> > Thanks for reading all that from the beginning! It must be tough due to the > fact there were lot of messages in the threads, so at least i was lost. > > I agree that adding a new entry or separate lock-less function can be considered > as something that is hard to maintain. I have a question here. I mean about your > different look at it: > > <snip> > bool is_pcp_cache_empty(gfp_t gfp) > { > struct per_cpu_pages *pcp; > struct zoneref *ref; > unsigned long flags; > bool empty; > > ref = first_zones_zonelist(node_zonelist( > numa_node_id(), gfp), gfp_zone(gfp), NULL); > if (!ref->zone) > return true; > > local_irq_save(flags); > pcp = &this_cpu_ptr(ref->zone->pageset)->pcp; > empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]); > local_irq_restore(flags); > > return empty; > } > > disable_irq(); > if (!is_pcp_cache_empty(GFP_NOWAIT)) > __get_free_page(GFP_NOWAIT); > enable_irq(); > <snip> > > Do you mean to have something like above? I mean some extra API > function that returns true or false if fast-fast allocation can > either occur or not. Above code works just fine and never touches > main zone->lock. > > i.e. Instead of introducing an extra GFP_LOCKLESS flag or any new > extra lock-less function. We could have something that checks a > pcp page cache list, thus it can guarantee that a request would > be accomplish using fast-fast path. No, I meant going back to idea of new gfp flag, but adjust the implementation in the allocator (different from what you posted in previous version) so that it only looks at the flag after it tries to allocate from pcplist and finds out it's empty. So, no inventing of new page allocator entry points or checks such as the one you wrote above, but adding the new gfp flag in a way that it doesn't affect existing fast paths.
On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote: > On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote: > > > > > > > > > > > > All good points! > > > > > > > > > > > > On the other hand, duplicating a portion of the allocator functionality > > > > > > within RCU increases the amount of reserved memory, and needlessly most > > > > > > of the time. > > > > > > > > > > > > > > > > But it's very similar to what mempools are for. > > > > > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU > > > > to move things forward and it might be not efficient way. As a side > > > > effect, maintaining of the bulk arrays in the separate worker thread > > > > will introduce other drawbacks: > > > > > > This is true but it is also true that it is RCU to require this special > > > logic and we can expect that we might need to fine tune this logic > > > depending on the RCU usage. We definitely do not want to tune the > > > generic page allocator for a very specific usecase, do we? > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > to provide a memory service for contexts which are not allowed to > > sleep, RCU is part of them. Both flags used to provide such ability > > before but not anymore. > > > > Do you agree with it? > > Yes this sucks. But this is something that we likely really want to live > with. We have to explicitly _document_ that really atomic contexts in RT > cannot use the allocator. From the past discussions we've had this is > likely the most reasonable way forward because we do not really want to > encourage anybody to do something like that and there should be ways > around that. The same is btw. true also for !RT. The allocator is not > NMI safe and while we should be able to make it compatible I am not > convinced we really want to. > > Would something like this be helpful wrt documentation? > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 67a0774e080b..9fcd47606493 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -238,7 +238,9 @@ struct vm_area_struct; > * %__GFP_FOO flags as necessary. > * > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > - * watermark is applied to allow access to "atomic reserves" > + * watermark is applied to allow access to "atomic reserves". > + * The current implementation doesn't support NMI and other non-preemptive context > + * (e.g. raw_spin_lock). I think documenting is useful. Could it be more explicit in what the issue is? Something like: * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT systems. Therefore, the current low-level allocator implementation does not support being called from special contexts that are atomic on RT - such as NMI and raw_spin_lock. Due to these constraints and considering calling code usually has no control over the PREEMPT_RT configuration, callers of the allocator should avoid calling the allocator from these cotnexts even in non-RT systems. thanks! - Joel > * > * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires > * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. > > [...] > -- > Michal Hocko > SUSE Labs
On Wed, Sep 30, 2020 at 04:39:53PM +0200, Vlastimil Babka wrote: > On 9/30/20 12:07 AM, Uladzislau Rezki wrote: > > On Tue, Sep 29, 2020 at 12:15:34PM +0200, Vlastimil Babka wrote: > >> On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote: > >> > >> After reading all the threads and mulling over this, I am going to deflect from > >> Mel and Michal and not oppose the idea of lockless allocation. I would even > >> prefer to do it via the gfp flag and not a completely separate path. Not using > >> the exact code from v1, I think it could be done in a way that we don't actually > >> look at the new flag until we find that pcplist is empty - which should not > >> introduce overhead to the fast-fast path when pcpclist is not empty. It's more > >> maintainable that adding new entry points, IMHO. > >> > > Thanks for reading all that from the beginning! It must be tough due to the > > fact there were lot of messages in the threads, so at least i was lost. > > > > I agree that adding a new entry or separate lock-less function can be considered > > as something that is hard to maintain. I have a question here. I mean about your > > different look at it: > > > > <snip> > > bool is_pcp_cache_empty(gfp_t gfp) > > { > > struct per_cpu_pages *pcp; > > struct zoneref *ref; > > unsigned long flags; > > bool empty; > > > > ref = first_zones_zonelist(node_zonelist( > > numa_node_id(), gfp), gfp_zone(gfp), NULL); > > if (!ref->zone) > > return true; > > > > local_irq_save(flags); > > pcp = &this_cpu_ptr(ref->zone->pageset)->pcp; > > empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]); > > local_irq_restore(flags); > > > > return empty; > > } > > > > disable_irq(); > > if (!is_pcp_cache_empty(GFP_NOWAIT)) > > __get_free_page(GFP_NOWAIT); > > enable_irq(); > > <snip> > > > > Do you mean to have something like above? I mean some extra API > > function that returns true or false if fast-fast allocation can > > either occur or not. Above code works just fine and never touches > > main zone->lock. > > > > i.e. Instead of introducing an extra GFP_LOCKLESS flag or any new > > extra lock-less function. We could have something that checks a > > pcp page cache list, thus it can guarantee that a request would > > be accomplish using fast-fast path. > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > the allocator (different from what you posted in previous version) so that it > only looks at the flag after it tries to allocate from pcplist and finds out > it's empty. So, no inventing of new page allocator entry points or checks such > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > affect existing fast paths. I like the idea of a new flag too :) After all telling the allocator more about what your context can tolerate, via GFP flags, is not without precedent. thanks, - Joel
On Wed 30-09-20 15:39:54, Uladzislau Rezki wrote: > On Wed, Sep 30, 2020 at 02:44:13PM +0200, Michal Hocko wrote: > > On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote: > > > On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote: > > > > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote: > > > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > > > > > > to provide a memory service for contexts which are not allowed to > > > > > > > sleep, RCU is part of them. Both flags used to provide such ability > > > > > > > before but not anymore. > > > > > > > > > > > > > > Do you agree with it? > > > > > > > > > > > > Yes this sucks. But this is something that we likely really want to live > > > > > > with. We have to explicitly _document_ that really atomic contexts in RT > > > > > > cannot use the allocator. From the past discussions we've had this is > > > > > > likely the most reasonable way forward because we do not really want to > > > > > > encourage anybody to do something like that and there should be ways > > > > > > around that. The same is btw. true also for !RT. The allocator is not > > > > > > NMI safe and while we should be able to make it compatible I am not > > > > > > convinced we really want to. > > > > > > > > > > > > Would something like this be helpful wrt documentation? > > > > > > > > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > > > > > index 67a0774e080b..9fcd47606493 100644 > > > > > > --- a/include/linux/gfp.h > > > > > > +++ b/include/linux/gfp.h > > > > > > @@ -238,7 +238,9 @@ struct vm_area_struct; > > > > > > * %__GFP_FOO flags as necessary. > > > > > > * > > > > > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > > > > > - * watermark is applied to allow access to "atomic reserves" > > > > > > + * watermark is applied to allow access to "atomic reserves". > > > > > > + * The current implementation doesn't support NMI and other non-preemptive context > > > > > > + * (e.g. raw_spin_lock). > > > > > > * > > > > > > * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires > > > > > > * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. > > > > > > > > > > > To me it is clear. But also above conflicting statement: > > > > > > > > > > <snip> > > > > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower > > > > > <snip> > > > > > > > > > > should be rephrased, IMHO. > > > > > > > > Any suggestions? Or more specifics about which part is conflicting? It > > > > tries to say that there is a higher demand to succeed even though the > > > > context cannot sleep to take active measures to achieve that. So the > > > > only way to achieve that is to break the watermakrs to a certain degree > > > > which is making them more "higher class" than other allocations. > > > > > > > Michal, i had only one concern about it. It says that %GFP_ATOMIC users > > > can not sleep, i.e. callers know that they are in atomic, thus no any > > > sleeping, but the chose they make will force them to sleep. > > > > I am not sure I follow you here. Do you mean they will be forced to > > sleep with PREEMPT_RT? > > > Exactly :) We can make that more specific once RT patchset is merged. As of now this is not the thing in the Linus tree. I believe there will be more to clarify about atomic contexts in the RT tree as it means something else than people are used to think.
On Wed 30-09-20 11:25:17, Joel Fernandes wrote: > On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote: > > On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote: > > > > > > > > > > > > > > All good points! > > > > > > > > > > > > > > On the other hand, duplicating a portion of the allocator functionality > > > > > > > within RCU increases the amount of reserved memory, and needlessly most > > > > > > > of the time. > > > > > > > > > > > > > > > > > > > But it's very similar to what mempools are for. > > > > > > > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU > > > > > to move things forward and it might be not efficient way. As a side > > > > > effect, maintaining of the bulk arrays in the separate worker thread > > > > > will introduce other drawbacks: > > > > > > > > This is true but it is also true that it is RCU to require this special > > > > logic and we can expect that we might need to fine tune this logic > > > > depending on the RCU usage. We definitely do not want to tune the > > > > generic page allocator for a very specific usecase, do we? > > > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > > to provide a memory service for contexts which are not allowed to > > > sleep, RCU is part of them. Both flags used to provide such ability > > > before but not anymore. > > > > > > Do you agree with it? > > > > Yes this sucks. But this is something that we likely really want to live > > with. We have to explicitly _document_ that really atomic contexts in RT > > cannot use the allocator. From the past discussions we've had this is > > likely the most reasonable way forward because we do not really want to > > encourage anybody to do something like that and there should be ways > > around that. The same is btw. true also for !RT. The allocator is not > > NMI safe and while we should be able to make it compatible I am not > > convinced we really want to. > > > > Would something like this be helpful wrt documentation? > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > index 67a0774e080b..9fcd47606493 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -238,7 +238,9 @@ struct vm_area_struct; > > * %__GFP_FOO flags as necessary. > > * > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > - * watermark is applied to allow access to "atomic reserves" > > + * watermark is applied to allow access to "atomic reserves". > > + * The current implementation doesn't support NMI and other non-preemptive context > > + * (e.g. raw_spin_lock). > > I think documenting is useful. > > Could it be more explicit in what the issue is? Something like: > > * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT > systems. Therefore, the current low-level allocator implementation does not > support being called from special contexts that are atomic on RT - such as > NMI and raw_spin_lock. Due to these constraints and considering calling code > usually has no control over the PREEMPT_RT configuration, callers of the > allocator should avoid calling the allocator from these cotnexts even in > non-RT systems. I do not mind documenting RT specific behavior but as mentioned in other reply, this should likely go via RT tree for now. There is likely more to clarify about atomicity for PREEMPT_RT.
On Wed, Sep 30, 2020 at 12:48 PM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 30-09-20 11:25:17, Joel Fernandes wrote: > > On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote: > > > On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > All good points! > > > > > > > > > > > > > > > > On the other hand, duplicating a portion of the allocator functionality > > > > > > > > within RCU increases the amount of reserved memory, and needlessly most > > > > > > > > of the time. > > > > > > > > > > > > > > > > > > > > > > But it's very similar to what mempools are for. > > > > > > > > > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU > > > > > > to move things forward and it might be not efficient way. As a side > > > > > > effect, maintaining of the bulk arrays in the separate worker thread > > > > > > will introduce other drawbacks: > > > > > > > > > > This is true but it is also true that it is RCU to require this special > > > > > logic and we can expect that we might need to fine tune this logic > > > > > depending on the RCU usage. We definitely do not want to tune the > > > > > generic page allocator for a very specific usecase, do we? > > > > > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > > > to provide a memory service for contexts which are not allowed to > > > > sleep, RCU is part of them. Both flags used to provide such ability > > > > before but not anymore. > > > > > > > > Do you agree with it? > > > > > > Yes this sucks. But this is something that we likely really want to live > > > with. We have to explicitly _document_ that really atomic contexts in RT > > > cannot use the allocator. From the past discussions we've had this is > > > likely the most reasonable way forward because we do not really want to > > > encourage anybody to do something like that and there should be ways > > > around that. The same is btw. true also for !RT. The allocator is not > > > NMI safe and while we should be able to make it compatible I am not > > > convinced we really want to. > > > > > > Would something like this be helpful wrt documentation? > > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > > index 67a0774e080b..9fcd47606493 100644 > > > --- a/include/linux/gfp.h > > > +++ b/include/linux/gfp.h > > > @@ -238,7 +238,9 @@ struct vm_area_struct; > > > * %__GFP_FOO flags as necessary. > > > * > > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > > - * watermark is applied to allow access to "atomic reserves" > > > + * watermark is applied to allow access to "atomic reserves". > > > + * The current implementation doesn't support NMI and other non-preemptive context > > > + * (e.g. raw_spin_lock). > > > > I think documenting is useful. > > > > Could it be more explicit in what the issue is? Something like: > > > > * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT > > systems. Therefore, the current low-level allocator implementation does not > > support being called from special contexts that are atomic on RT - such as > > NMI and raw_spin_lock. Due to these constraints and considering calling code > > usually has no control over the PREEMPT_RT configuration, callers of the > > allocator should avoid calling the allocator from these cotnexts even in > > non-RT systems. > > I do not mind documenting RT specific behavior but as mentioned in other > reply, this should likely go via RT tree for now. There is likely more > to clarify about atomicity for PREEMPT_RT. I am sorry, I did not understand what you meant by something missing in Linus Tree. CONFIG_PREEMPT_RT is there. Could you clarify? Also the CONFIG_PREEMPT_RT is the only thing driving this requirement for GFP_ATOMIC right? Or are there non-RT reasons why GFP_ATOMIC allocation cannot be done from NMI/raw_spin_lock ? Thanks, - Joel
On Wed 30-09-20 13:03:29, Joel Fernandes wrote: > On Wed, Sep 30, 2020 at 12:48 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 30-09-20 11:25:17, Joel Fernandes wrote: > > > On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote: > > > > On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > > > All good points! > > > > > > > > > > > > > > > > > > On the other hand, duplicating a portion of the allocator functionality > > > > > > > > > within RCU increases the amount of reserved memory, and needlessly most > > > > > > > > > of the time. > > > > > > > > > > > > > > > > > > > > > > > > > But it's very similar to what mempools are for. > > > > > > > > > > > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU > > > > > > > to move things forward and it might be not efficient way. As a side > > > > > > > effect, maintaining of the bulk arrays in the separate worker thread > > > > > > > will introduce other drawbacks: > > > > > > > > > > > > This is true but it is also true that it is RCU to require this special > > > > > > logic and we can expect that we might need to fine tune this logic > > > > > > depending on the RCU usage. We definitely do not want to tune the > > > > > > generic page allocator for a very specific usecase, do we? > > > > > > > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > > > > to provide a memory service for contexts which are not allowed to > > > > > sleep, RCU is part of them. Both flags used to provide such ability > > > > > before but not anymore. > > > > > > > > > > Do you agree with it? > > > > > > > > Yes this sucks. But this is something that we likely really want to live > > > > with. We have to explicitly _document_ that really atomic contexts in RT > > > > cannot use the allocator. From the past discussions we've had this is > > > > likely the most reasonable way forward because we do not really want to > > > > encourage anybody to do something like that and there should be ways > > > > around that. The same is btw. true also for !RT. The allocator is not > > > > NMI safe and while we should be able to make it compatible I am not > > > > convinced we really want to. > > > > > > > > Would something like this be helpful wrt documentation? > > > > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > > > index 67a0774e080b..9fcd47606493 100644 > > > > --- a/include/linux/gfp.h > > > > +++ b/include/linux/gfp.h > > > > @@ -238,7 +238,9 @@ struct vm_area_struct; > > > > * %__GFP_FOO flags as necessary. > > > > * > > > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > > > - * watermark is applied to allow access to "atomic reserves" > > > > + * watermark is applied to allow access to "atomic reserves". > > > > + * The current implementation doesn't support NMI and other non-preemptive context > > > > + * (e.g. raw_spin_lock). > > > > > > I think documenting is useful. > > > > > > Could it be more explicit in what the issue is? Something like: > > > > > > * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT > > > systems. Therefore, the current low-level allocator implementation does not > > > support being called from special contexts that are atomic on RT - such as > > > NMI and raw_spin_lock. Due to these constraints and considering calling code > > > usually has no control over the PREEMPT_RT configuration, callers of the > > > allocator should avoid calling the allocator from these cotnexts even in > > > non-RT systems. > > > > I do not mind documenting RT specific behavior but as mentioned in other > > reply, this should likely go via RT tree for now. There is likely more > > to clarify about atomicity for PREEMPT_RT. > > I am sorry, I did not understand what you meant by something missing > in Linus Tree. CONFIG_PREEMPT_RT is there. OK, I was not aware we already CONFIG_PREEMPT_RT in the three. There is still a lot from the RT patchset including sleeping spin locks that make a real difference. Or maybe I haven't checked properly. > Could you clarify? Also the CONFIG_PREEMPT_RT is the only thing > driving this requirement for GFP_ATOMIC right? Or are there non-RT > reasons why GFP_ATOMIC allocation cannot be done from > NMI/raw_spin_lock ? I have already sent a clarification patch [1]. NMI is not supported regardless of the preemption mode. Hope this clarifies it a bit. [1] http://lkml.kernel.org/r/20200929123010.5137-1-mhocko@kernel.org
On Wed, Sep 30, 2020 at 1:22 PM Michal Hocko <mhocko@suse.com> wrote: > > > > I think documenting is useful. > > > > > > > > Could it be more explicit in what the issue is? Something like: > > > > > > > > * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT > > > > systems. Therefore, the current low-level allocator implementation does not > > > > support being called from special contexts that are atomic on RT - such as > > > > NMI and raw_spin_lock. Due to these constraints and considering calling code > > > > usually has no control over the PREEMPT_RT configuration, callers of the > > > > allocator should avoid calling the allocator from these cotnexts even in > > > > non-RT systems. > > > > > > I do not mind documenting RT specific behavior but as mentioned in other > > > reply, this should likely go via RT tree for now. There is likely more > > > to clarify about atomicity for PREEMPT_RT. > > > > I am sorry, I did not understand what you meant by something missing > > in Linus Tree. CONFIG_PREEMPT_RT is there. > > OK, I was not aware we already CONFIG_PREEMPT_RT in the three. There is > still a lot from the RT patchset including sleeping spin locks that make > a real difference. Or maybe I haven't checked properly. > > > Could you clarify? Also the CONFIG_PREEMPT_RT is the only thing > > driving this requirement for GFP_ATOMIC right? Or are there non-RT > > reasons why GFP_ATOMIC allocation cannot be done from > > NMI/raw_spin_lock ? > > I have already sent a clarification patch [1]. NMI is not supported > regardless of the preemption mode. Hope this clarifies it a bit. > > [1] http://lkml.kernel.org/r/20200929123010.5137-1-mhocko@kernel.org That works for me. Thanks, - Joel
On Wed, Sep 30, 2020 at 06:46:00PM +0200, Michal Hocko wrote: > On Wed 30-09-20 15:39:54, Uladzislau Rezki wrote: > > On Wed, Sep 30, 2020 at 02:44:13PM +0200, Michal Hocko wrote: > > > On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote: > > > > On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote: > > > > > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote: > > > > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability > > > > > > > > to provide a memory service for contexts which are not allowed to > > > > > > > > sleep, RCU is part of them. Both flags used to provide such ability > > > > > > > > before but not anymore. > > > > > > > > > > > > > > > > Do you agree with it? > > > > > > > > > > > > > > Yes this sucks. But this is something that we likely really want to live > > > > > > > with. We have to explicitly _document_ that really atomic contexts in RT > > > > > > > cannot use the allocator. From the past discussions we've had this is > > > > > > > likely the most reasonable way forward because we do not really want to > > > > > > > encourage anybody to do something like that and there should be ways > > > > > > > around that. The same is btw. true also for !RT. The allocator is not > > > > > > > NMI safe and while we should be able to make it compatible I am not > > > > > > > convinced we really want to. > > > > > > > > > > > > > > Would something like this be helpful wrt documentation? > > > > > > > > > > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > > > > > > index 67a0774e080b..9fcd47606493 100644 > > > > > > > --- a/include/linux/gfp.h > > > > > > > +++ b/include/linux/gfp.h > > > > > > > @@ -238,7 +238,9 @@ struct vm_area_struct; > > > > > > > * %__GFP_FOO flags as necessary. > > > > > > > * > > > > > > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > > > > > > - * watermark is applied to allow access to "atomic reserves" > > > > > > > + * watermark is applied to allow access to "atomic reserves". > > > > > > > + * The current implementation doesn't support NMI and other non-preemptive context > > > > > > > + * (e.g. raw_spin_lock). > > > > > > > * > > > > > > > * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires > > > > > > > * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. > > > > > > > > > > > > > To me it is clear. But also above conflicting statement: > > > > > > > > > > > > <snip> > > > > > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower > > > > > > <snip> > > > > > > > > > > > > should be rephrased, IMHO. > > > > > > > > > > Any suggestions? Or more specifics about which part is conflicting? It > > > > > tries to say that there is a higher demand to succeed even though the > > > > > context cannot sleep to take active measures to achieve that. So the > > > > > only way to achieve that is to break the watermakrs to a certain degree > > > > > which is making them more "higher class" than other allocations. > > > > > > > > > Michal, i had only one concern about it. It says that %GFP_ATOMIC users > > > > can not sleep, i.e. callers know that they are in atomic, thus no any > > > > sleeping, but the chose they make will force them to sleep. > > > > > > I am not sure I follow you here. Do you mean they will be forced to > > > sleep with PREEMPT_RT? > > > > > Exactly :) > > We can make that more specific once RT patchset is merged. As of now > this is not the thing in the Linus tree. I believe there will be more to > clarify about atomic contexts in the RT tree as it means something else > than people are used to think. > Fair enough. Currently we have only the Kconfig CONFIG_PREEMPT_RT symbol, so RT patches are still not in place. -- Vlad Rezki
> > No, I meant going back to idea of new gfp flag, but adjust the implementation in > the allocator (different from what you posted in previous version) so that it > only looks at the flag after it tries to allocate from pcplist and finds out > it's empty. So, no inventing of new page allocator entry points or checks such > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > affect existing fast paths. > OK. Now i see. Please have a look below at the patch, so we fully understand each other. If that is something that is close to your view or not: <snip> t a/include/linux/gfp.h b/include/linux/gfp.h index c603237e006c..7e613560a502 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -39,8 +39,9 @@ struct vm_area_struct; #define ___GFP_HARDWALL 0x100000u #define ___GFP_THISNODE 0x200000u #define ___GFP_ACCOUNT 0x400000u +#define ___GFP_NO_LOCKS 0x800000u #ifdef CONFIG_LOCKDEP -#define ___GFP_NOLOCKDEP 0x800000u +#define ___GFP_NOLOCKDEP 0x1000000u #else #define ___GFP_NOLOCKDEP 0 #endif @@ -215,16 +216,22 @@ struct vm_area_struct; * %__GFP_COMP address compound page metadata. * * %__GFP_ZERO returns a zeroed page on success. + * + * %__GFP_NO_LOCKS order-0 allocation without sleepable-locks. + * It obtains a page from the per-cpu-list and considered as + * lock-less. No other actions are performed, thus it returns + * NULL if per-cpu-list is empty. */ #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) #define __GFP_COMP ((__force gfp_t)___GFP_COMP) #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) +#define __GFP_NO_LOCKS ((__force gfp_t)___GFP_NO_LOCKS) /* Disable lockdep for GFP context tracking */ #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP)) +#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP)) #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /** diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 67018d367b9f..d99af78237be 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -45,6 +45,7 @@ {(unsigned long)__GFP_RECLAIMABLE, "__GFP_RECLAIMABLE"}, \ {(unsigned long)__GFP_MOVABLE, "__GFP_MOVABLE"}, \ {(unsigned long)__GFP_ACCOUNT, "__GFP_ACCOUNT"}, \ + {(unsigned long)__GFP_NO_LOCKS, "__GFP_NO_LOCKS"}, \ {(unsigned long)__GFP_WRITE, "__GFP_WRITE"}, \ {(unsigned long)__GFP_RECLAIM, "__GFP_RECLAIM"}, \ {(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\ diff --git a/mm/internal.h b/mm/internal.h index 6345b08ce86c..5724fba921f9 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -569,6 +569,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, #define ALLOC_NOFRAGMENT 0x0 #endif #define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */ +#define ALLOC_NO_LOCKS 0x1000 /* Lock free allocation. */ enum ttu_flags; struct tlbflush_unmap_batch; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index aff1f84bf268..19cd9794dd45 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2837,6 +2837,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, { int i, alloced = 0; + if (alloc_flags & ALLOC_NO_LOCKS) + return alloced; + spin_lock(&zone->lock); for (i = 0; i < count; ++i) { struct page *page = __rmqueue(zone, order, migratetype, @@ -3805,7 +3808,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, * grow this zone if it contains deferred pages. */ if (static_branch_unlikely(&deferred_pages)) { - if (_deferred_grow_zone(zone, order)) + if (!(alloc_flags & ALLOC_NO_LOCKS) && + _deferred_grow_zone(zone, order)) goto try_this_zone; } #endif @@ -3850,7 +3854,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, reserve_highatomic_pageblock(page, zone, order); return page; - } else { + } else if (!(alloc_flags & ALLOC_NO_LOCKS)) { #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* Try again if zone has deferred pages */ if (static_branch_unlikely(&deferred_pages)) { @@ -4846,6 +4850,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx, ac->nodemask); + if (gfp_mask & __GFP_NO_LOCKS) + *alloc_flags |= ALLOC_NO_LOCKS; + return true; } @@ -4886,6 +4893,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, if (likely(page)) goto out; + /* Bypass slow path if __GFP_NO_LOCKS. */ + if ((gfp_mask & __GFP_NO_LOCKS)) + goto out; + /* * Apply scoped allocation constraints. This is mainly about GFP_NOFS * resp. GFP_NOIO which has to be inherited for all allocation requests diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index a50dae2c4ae9..fee3221bcf6a 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -656,6 +656,7 @@ static const struct { { "__GFP_RECLAIMABLE", "RC" }, { "__GFP_MOVABLE", "M" }, { "__GFP_ACCOUNT", "AC" }, + { "__GFP_NO_LOCKS", "NL" }, { "__GFP_WRITE", "WR" }, { "__GFP_RECLAIM", "R" }, { "__GFP_DIRECT_RECLAIM", "DR" }, <snip> If not, could you please provide some snips or any pseudo code? Thanks Vlastimil! -- Vlad Rezki
On Wed, Sep 30, 2020 at 12:35:57PM +0200, Michal Hocko wrote: > On Wed 30-09-20 00:07:42, Uladzislau Rezki wrote: > [...] > > <snip> > > bool is_pcp_cache_empty(gfp_t gfp) > > { > > struct per_cpu_pages *pcp; > > struct zoneref *ref; > > unsigned long flags; > > bool empty; > > > > ref = first_zones_zonelist(node_zonelist( > > numa_node_id(), gfp), gfp_zone(gfp), NULL); > > if (!ref->zone) > > return true; > > > > local_irq_save(flags); > > pcp = &this_cpu_ptr(ref->zone->pageset)->pcp; > > empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]); > > local_irq_restore(flags); > > > > return empty; > > } > > > > disable_irq(); > > if (!is_pcp_cache_empty(GFP_NOWAIT)) > > __get_free_page(GFP_NOWAIT); > > enable_irq(); > > <snip> > > > > Do you mean to have something like above? I mean some extra API > > function that returns true or false if fast-fast allocation can > > either occur or not. Above code works just fine and never touches > > main zone->lock. > > The above code works with the _current_ implementation and it restricts > its implementation to some degree. Future changes might get harder to > implement with a pattern like this. I do not think we want users to be > aware of internal implementation details like pcp caches, migrate types > or others. While pcp caches are here for years and unlikely to change in > a foreseeable future many details are changing on regular basis. > I see your view. That was en example for better understanding. Thanks. -- Vlad Rezki
On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote: > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > the allocator (different from what you posted in previous version) so that it > > only looks at the flag after it tries to allocate from pcplist and finds out > > it's empty. So, no inventing of new page allocator entry points or checks such > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > affect existing fast paths. > > > OK. Now i see. Please have a look below at the patch, so we fully understand > each other. If that is something that is close to your view or not: > > <snip> > t a/include/linux/gfp.h b/include/linux/gfp.h > index c603237e006c..7e613560a502 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -39,8 +39,9 @@ struct vm_area_struct; > #define ___GFP_HARDWALL 0x100000u > #define ___GFP_THISNODE 0x200000u > #define ___GFP_ACCOUNT 0x400000u > +#define ___GFP_NO_LOCKS 0x800000u Even if a new gfp flag gains a sufficient traction and support I am _strongly_ opposed against consuming another flag for that. Bit space is limited. Besides that we certainly do not want to allow craziness like __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
On Thu, Oct 01, 2020 at 09:26:26PM +0200, Uladzislau Rezki wrote: > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > the allocator (different from what you posted in previous version) so that it > > only looks at the flag after it tries to allocate from pcplist and finds out > > it's empty. So, no inventing of new page allocator entry points or checks such > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > affect existing fast paths. > > > OK. Now i see. Please have a look below at the patch, so we fully understand > each other. If that is something that is close to your view or not: > > <snip> > t a/include/linux/gfp.h b/include/linux/gfp.h > index c603237e006c..7e613560a502 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -39,8 +39,9 @@ struct vm_area_struct; > #define ___GFP_HARDWALL 0x100000u > #define ___GFP_THISNODE 0x200000u > #define ___GFP_ACCOUNT 0x400000u > +#define ___GFP_NO_LOCKS 0x800000u > #ifdef CONFIG_LOCKDEP > -#define ___GFP_NOLOCKDEP 0x800000u > +#define ___GFP_NOLOCKDEP 0x1000000u > #else > #define ___GFP_NOLOCKDEP 0 > #endif > @@ -215,16 +216,22 @@ struct vm_area_struct; > * %__GFP_COMP address compound page metadata. > * > * %__GFP_ZERO returns a zeroed page on success. > + * > + * %__GFP_NO_LOCKS order-0 allocation without sleepable-locks. > + * It obtains a page from the per-cpu-list and considered as > + * lock-less. No other actions are performed, thus it returns > + * NULL if per-cpu-list is empty. > */ > #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) > #define __GFP_COMP ((__force gfp_t)___GFP_COMP) > #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) > +#define __GFP_NO_LOCKS ((__force gfp_t)___GFP_NO_LOCKS) > I'm not a fan of the GFP flag approach simply because we've had cases before where GFP flags were used in inappropriate contexts like __GFP_MEMALLOC which led to a surprising amount of bugs, particularly from out-of-tree drivers but also in-tree drivers. Of course, there are limited GFP flags available too but at least the comment should be as robust as possible. Maybe something like * %__GFP_NO_LOCKS attempts order-0 allocation without sleepable-locks. It * attempts to obtain a page without acquiring any spinlocks. This * should only be used in a context where the holder holds a * raw_spin_lock that cannot be released for the allocation request. * This may be necessary in PREEMPT_RT kernels where a * raw_spin_lock is held which does not sleep tries to acquire a * spin_lock that can sleep with PREEMPT_RT. This should not be * confused with GFP_ATOMIC contexts. Like atomic allocation * requests, there is no guarantee a page will be returned and * the caller must be able to deal with allocation failures. * The risk of allocation failure is higher than using GFP_ATOMIC. It's verbose but it would be hard to misinterpret. I think we're going to go through a period of time before people get familiar with PREEMPT_RT-related hazards as various comments that were true are going to be misleading for a while. For anyone reviewing, any use of __GFP_NO_LOCKS should meet a high standard where there is no alternative except to use the flags. i.e. a higher standard "but I'm an important driver".
On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote: > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > > the allocator (different from what you posted in previous version) so that it > > > only looks at the flag after it tries to allocate from pcplist and finds out > > > it's empty. So, no inventing of new page allocator entry points or checks such > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > > affect existing fast paths. > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand > > each other. If that is something that is close to your view or not: > > > > <snip> > > t a/include/linux/gfp.h b/include/linux/gfp.h > > index c603237e006c..7e613560a502 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -39,8 +39,9 @@ struct vm_area_struct; > > #define ___GFP_HARDWALL 0x100000u > > #define ___GFP_THISNODE 0x200000u > > #define ___GFP_ACCOUNT 0x400000u > > +#define ___GFP_NO_LOCKS 0x800000u > > Even if a new gfp flag gains a sufficient traction and support I am > _strongly_ opposed against consuming another flag for that. Bit space is > limited. That is definitely true. I'm not happy with the GFP flag at all, the comment is at best a damage limiting move. It still would be better for a memory pool to be reserved and sized for critical allocations. > Besides that we certainly do not want to allow craziness like > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? That would deserve to be taken to a dumpster and set on fire. The flag combination could be checked in the allocator but the allocator path fast paths are bad enough already.
On Fri 02-10-20 09:50:14, Mel Gorman wrote: > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote: > > > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > > > the allocator (different from what you posted in previous version) so that it > > > > only looks at the flag after it tries to allocate from pcplist and finds out > > > > it's empty. So, no inventing of new page allocator entry points or checks such > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > > > affect existing fast paths. > > > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand > > > each other. If that is something that is close to your view or not: > > > > > > <snip> > > > t a/include/linux/gfp.h b/include/linux/gfp.h > > > index c603237e006c..7e613560a502 100644 > > > --- a/include/linux/gfp.h > > > +++ b/include/linux/gfp.h > > > @@ -39,8 +39,9 @@ struct vm_area_struct; > > > #define ___GFP_HARDWALL 0x100000u > > > #define ___GFP_THISNODE 0x200000u > > > #define ___GFP_ACCOUNT 0x400000u > > > +#define ___GFP_NO_LOCKS 0x800000u > > > > Even if a new gfp flag gains a sufficient traction and support I am > > _strongly_ opposed against consuming another flag for that. Bit space is > > limited. > > That is definitely true. I'm not happy with the GFP flag at all, the > comment is at best a damage limiting move. It still would be better for > a memory pool to be reserved and sized for critical allocations. Completely agreed. The only existing usecase is so special cased that a dedicated pool is not only easier to maintain but it should be also much better tuned for the specific workload. Something not really feasible with the allocator. > > Besides that we certainly do not want to allow craziness like > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > > That would deserve to be taken to a dumpster and set on fire. The flag > combination could be checked in the allocator but the allocator path fast > paths are bad enough already. If a new allocation/gfp mode is absolutely necessary then I believe that the most reasoanble way forward would be #define GFP_NO_LOCK ((__force gfp_t)0) and explicitly document it as a final flag to use without any further modifiers. Yeah there are some that could be used potentially - e.g. zone specifiers, __GFP_ZERO and likely few others. But support for those can be added when there is an actual and reasonable demand. I would also strongly argue against implementation alowing to fully consume pcp free pages.
On Fri, Oct 02, 2020 at 09:50:14AM +0100, Mel Gorman wrote: > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > > > +#define ___GFP_NO_LOCKS 0x800000u > > > > Even if a new gfp flag gains a sufficient traction and support I am > > _strongly_ opposed against consuming another flag for that. Bit space is > > limited. > > That is definitely true. I'm not happy with the GFP flag at all, the > comment is at best a damage limiting move. It still would be better for > a memory pool to be reserved and sized for critical allocations. This is one of the reasons I did a separate allocation function. No GFP flag to leak into general usage. > > Besides that we certainly do not want to allow craziness like > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > > That would deserve to be taken to a dumpster and set on fire. The flag > combination could be checked in the allocator but the allocator path fast > paths are bad enough already. Isn't that what we have CONFIG_DEBUG_VM for?
On Fri, Oct 02, 2020 at 11:07:29AM +0200, Peter Zijlstra wrote: > On Fri, Oct 02, 2020 at 09:50:14AM +0100, Mel Gorman wrote: > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > > > > > +#define ___GFP_NO_LOCKS 0x800000u > > > > > > Even if a new gfp flag gains a sufficient traction and support I am > > > _strongly_ opposed against consuming another flag for that. Bit space is > > > limited. > > > > That is definitely true. I'm not happy with the GFP flag at all, the > > comment is at best a damage limiting move. It still would be better for > > a memory pool to be reserved and sized for critical allocations. > > This is one of the reasons I did a separate allocation function. No GFP > flag to leak into general usage. > Even a specific function with a hint that "this is for RCU only" will not prevent abuse. > > > Besides that we certainly do not want to allow craziness like > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > > > > That would deserve to be taken to a dumpster and set on fire. The flag > > combination could be checked in the allocator but the allocator path fast > > paths are bad enough already. > > Isn't that what we have CONFIG_DEBUG_VM for? It's enabled by default by enough distros that adding too many checks is potentially painful. Granted it would be missed by most benchmarking which tend to control allocations from userspace but a lot of performance problems I see are the "death by a thousand cuts" variety.
On Fri, Oct 02, 2020 at 10:45:02AM +0100, Mel Gorman wrote: > On Fri, Oct 02, 2020 at 11:07:29AM +0200, Peter Zijlstra wrote: > > On Fri, Oct 02, 2020 at 09:50:14AM +0100, Mel Gorman wrote: > > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > > > > > > > +#define ___GFP_NO_LOCKS 0x800000u > > > > > > > > Even if a new gfp flag gains a sufficient traction and support I am > > > > _strongly_ opposed against consuming another flag for that. Bit space is > > > > limited. > > > > > > That is definitely true. I'm not happy with the GFP flag at all, the > > > comment is at best a damage limiting move. It still would be better for > > > a memory pool to be reserved and sized for critical allocations. > > > > This is one of the reasons I did a separate allocation function. No GFP > > flag to leak into general usage. > > > > Even a specific function with a hint that "this is for RCU only" will > not prevent abuse. Not exporting it for modules helps, but yes. > > > > Besides that we certainly do not want to allow craziness like > > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > > > > > > That would deserve to be taken to a dumpster and set on fire. The flag > > > combination could be checked in the allocator but the allocator path fast > > > paths are bad enough already. > > > > Isn't that what we have CONFIG_DEBUG_VM for? > > It's enabled by default by enough distros that adding too many checks > is potentially painful. Granted it would be missed by most benchmarking > which tend to control allocations from userspace but a lot of performance > problems I see are the "death by a thousand cuts" variety. Oh quite agreed, aka death by accounting. But if people are enabling DEBUG options in production kernels, there's something wrong, no? Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options?
On Fri, Oct 02, 2020 at 11:58:58AM +0200, Peter Zijlstra wrote: > > It's enabled by default by enough distros that adding too many checks > > is potentially painful. Granted it would be missed by most benchmarking > > which tend to control allocations from userspace but a lot of performance > > problems I see are the "death by a thousand cuts" variety. > > Oh quite agreed, aka death by accounting. But if people are enabling > DEBUG options in production kernels, there's something wrong, no? > You'd think but historically I believe DEBUG_VM was enabled for some distributions because it made certain classes of problems easier to debug early. There is also a recent trend for enabling various DEBUG options for "hardening" even when they protect very specific corner cases or are for intended for kernel development. I've pushed back where I have an opinion that matters but it's generally corrosive. > Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options? It's heading in that direction :(
On Fri, Oct 02, 2020 at 11:19:52AM +0100, Mel Gorman wrote: > On Fri, Oct 02, 2020 at 11:58:58AM +0200, Peter Zijlstra wrote: > > > It's enabled by default by enough distros that adding too many checks > > > is potentially painful. Granted it would be missed by most benchmarking > > > which tend to control allocations from userspace but a lot of performance > > > problems I see are the "death by a thousand cuts" variety. > > > > Oh quite agreed, aka death by accounting. But if people are enabling > > DEBUG options in production kernels, there's something wrong, no? > > > > You'd think but historically I believe DEBUG_VM was enabled for some > distributions because it made certain classes of problems easier to debug > early. There is also a recent trend for enabling various DEBUG options for > "hardening" even when they protect very specific corner cases or are for > intended for kernel development. I've pushed back where I have an opinion > that matters but it's generally corrosive. > > > Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options? > > It's heading in that direction :( Given that you guys have just reiterated yet again that you are very unhappy with either a GFP_ flag or a special function like the one that Peter Zijlstra put together, it would be very helpful if you were to at least voice some level of support for Thomas Gleixner's patchset, which, if accepted, will allow me to solve at least 50% of the problem. https://lore.kernel.org/lkml/20200928233041.GA23230@paulmck-ThinkPad-P72/ Patch series including Thomas's patchset. https://lore.kernel.org/lkml/20201001210750.GA25287@paulmck-ThinkPad-P72/ Corresponding pull request. Thanx, Paul
On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote: > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > > the allocator (different from what you posted in previous version) so that it > > > only looks at the flag after it tries to allocate from pcplist and finds out > > > it's empty. So, no inventing of new page allocator entry points or checks such > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > > affect existing fast paths. > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand > > each other. If that is something that is close to your view or not: > > > > <snip> > > t a/include/linux/gfp.h b/include/linux/gfp.h > > index c603237e006c..7e613560a502 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -39,8 +39,9 @@ struct vm_area_struct; > > #define ___GFP_HARDWALL 0x100000u > > #define ___GFP_THISNODE 0x200000u > > #define ___GFP_ACCOUNT 0x400000u > > +#define ___GFP_NO_LOCKS 0x800000u > > Even if a new gfp flag gains a sufficient traction and support I am > _strongly_ opposed against consuming another flag for that. Bit space is > limited. > That is a valid point. > > Besides that we certainly do not want to allow craziness like > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > Obviously not. And it seems that the way of implementing of the NO_LOCK logic would be easier(less code changes) and better if it was defined like below(what you proposed later in this thread): -#define __GFP_NO_LOCKS ((__force gfp_t)___GFP_NO_LOCKS) +#define __GFP_NO_LOCKS ((__force gfp_t) 0) That could imply that calling the page allocator with zero argument would apply a further limitation - that is lock free. -- Vlad Rezki
On Fri, Oct 02, 2020 at 09:06:24AM +0100, Mel Gorman wrote: > On Thu, Oct 01, 2020 at 09:26:26PM +0200, Uladzislau Rezki wrote: > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > > the allocator (different from what you posted in previous version) so that it > > > only looks at the flag after it tries to allocate from pcplist and finds out > > > it's empty. So, no inventing of new page allocator entry points or checks such > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > > affect existing fast paths. > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand > > each other. If that is something that is close to your view or not: > > > > <snip> > > t a/include/linux/gfp.h b/include/linux/gfp.h > > index c603237e006c..7e613560a502 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -39,8 +39,9 @@ struct vm_area_struct; > > #define ___GFP_HARDWALL 0x100000u > > #define ___GFP_THISNODE 0x200000u > > #define ___GFP_ACCOUNT 0x400000u > > +#define ___GFP_NO_LOCKS 0x800000u > > #ifdef CONFIG_LOCKDEP > > -#define ___GFP_NOLOCKDEP 0x800000u > > +#define ___GFP_NOLOCKDEP 0x1000000u > > #else > > #define ___GFP_NOLOCKDEP 0 > > #endif > > @@ -215,16 +216,22 @@ struct vm_area_struct; > > * %__GFP_COMP address compound page metadata. > > * > > * %__GFP_ZERO returns a zeroed page on success. > > + * > > + * %__GFP_NO_LOCKS order-0 allocation without sleepable-locks. > > + * It obtains a page from the per-cpu-list and considered as > > + * lock-less. No other actions are performed, thus it returns > > + * NULL if per-cpu-list is empty. > > */ > > #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) > > #define __GFP_COMP ((__force gfp_t)___GFP_COMP) > > #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) > > +#define __GFP_NO_LOCKS ((__force gfp_t)___GFP_NO_LOCKS) > > > > I'm not a fan of the GFP flag approach simply because we've had cases > before where GFP flags were used in inappropriate contexts like > __GFP_MEMALLOC which led to a surprising amount of bugs, particularly > from out-of-tree drivers but also in-tree drivers. Of course, there > are limited GFP flags available too but at least the comment should > be as robust as possible. Maybe something like > > * %__GFP_NO_LOCKS attempts order-0 allocation without sleepable-locks. It > * attempts to obtain a page without acquiring any spinlocks. This > * should only be used in a context where the holder holds a > * raw_spin_lock that cannot be released for the allocation request. > * This may be necessary in PREEMPT_RT kernels where a > * raw_spin_lock is held which does not sleep tries to acquire a > * spin_lock that can sleep with PREEMPT_RT. This should not be > * confused with GFP_ATOMIC contexts. Like atomic allocation > * requests, there is no guarantee a page will be returned and > * the caller must be able to deal with allocation failures. > * The risk of allocation failure is higher than using GFP_ATOMIC. > > It's verbose but it would be hard to misinterpret. I think we're > going to go through a period of time before people get familiar > with PREEMPT_RT-related hazards as various comments that were > true are going to be misleading for a while. > Yep, it should be properly documented for sure. Including new GFP_NOWAIT limitations, same as GFP_ATOMIC once you mentioned. Thanks! -- Vlad Rezki
On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote: > On Fri 02-10-20 09:50:14, Mel Gorman wrote: > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote: > > > > > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > > > > the allocator (different from what you posted in previous version) so that it > > > > > only looks at the flag after it tries to allocate from pcplist and finds out > > > > > it's empty. So, no inventing of new page allocator entry points or checks such > > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > > > > affect existing fast paths. > > > > > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand > > > > each other. If that is something that is close to your view or not: > > > > > > > > <snip> > > > > t a/include/linux/gfp.h b/include/linux/gfp.h > > > > index c603237e006c..7e613560a502 100644 > > > > --- a/include/linux/gfp.h > > > > +++ b/include/linux/gfp.h > > > > @@ -39,8 +39,9 @@ struct vm_area_struct; > > > > #define ___GFP_HARDWALL 0x100000u > > > > #define ___GFP_THISNODE 0x200000u > > > > #define ___GFP_ACCOUNT 0x400000u > > > > +#define ___GFP_NO_LOCKS 0x800000u > > > > > > Even if a new gfp flag gains a sufficient traction and support I am > > > _strongly_ opposed against consuming another flag for that. Bit space is > > > limited. > > > > That is definitely true. I'm not happy with the GFP flag at all, the > > comment is at best a damage limiting move. It still would be better for > > a memory pool to be reserved and sized for critical allocations. > > Completely agreed. The only existing usecase is so special cased that a > dedicated pool is not only easier to maintain but it should be also much > better tuned for the specific workload. Something not really feasible > with the allocator. > > > > Besides that we certainly do not want to allow craziness like > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > > > > That would deserve to be taken to a dumpster and set on fire. The flag > > combination could be checked in the allocator but the allocator path fast > > paths are bad enough already. > > If a new allocation/gfp mode is absolutely necessary then I believe that > the most reasoanble way forward would be > #define GFP_NO_LOCK ((__force gfp_t)0) > Agree. Even though i see that some code should be adjusted for it. There are a few users of the __get_free_page(0); So, need to double check it: <snip> [ 0.650351] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 0.651083] #PF: supervisor read access in kernel mode [ 0.651639] #PF: error_code(0x0000) - not-present page [ 0.652200] PGD 0 P4D 0 [ 0.652523] Oops: 0000 [#1] SMP NOPTI [ 0.652668] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc7-next-20200930+ #140 [ 0.652668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 0.652668] RIP: 0010:__find_event_file+0x21/0x80 <snip> Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page(). If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages(). __kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks what can be easily converted to raw variant. -- Vlad Rezki
On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote: > On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote: > > On Fri 02-10-20 09:50:14, Mel Gorman wrote: > > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote: > > > > > > > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > > > > > the allocator (different from what you posted in previous version) so that it > > > > > > only looks at the flag after it tries to allocate from pcplist and finds out > > > > > > it's empty. So, no inventing of new page allocator entry points or checks such > > > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > > > > > affect existing fast paths. > > > > > > > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand > > > > > each other. If that is something that is close to your view or not: > > > > > > > > > > <snip> > > > > > t a/include/linux/gfp.h b/include/linux/gfp.h > > > > > index c603237e006c..7e613560a502 100644 > > > > > --- a/include/linux/gfp.h > > > > > +++ b/include/linux/gfp.h > > > > > @@ -39,8 +39,9 @@ struct vm_area_struct; > > > > > #define ___GFP_HARDWALL 0x100000u > > > > > #define ___GFP_THISNODE 0x200000u > > > > > #define ___GFP_ACCOUNT 0x400000u > > > > > +#define ___GFP_NO_LOCKS 0x800000u > > > > > > > > Even if a new gfp flag gains a sufficient traction and support I am > > > > _strongly_ opposed against consuming another flag for that. Bit space is > > > > limited. > > > > > > That is definitely true. I'm not happy with the GFP flag at all, the > > > comment is at best a damage limiting move. It still would be better for > > > a memory pool to be reserved and sized for critical allocations. > > > > Completely agreed. The only existing usecase is so special cased that a > > dedicated pool is not only easier to maintain but it should be also much > > better tuned for the specific workload. Something not really feasible > > with the allocator. > > > > > > Besides that we certainly do not want to allow craziness like > > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > > > > > > That would deserve to be taken to a dumpster and set on fire. The flag > > > combination could be checked in the allocator but the allocator path fast > > > paths are bad enough already. > > > > If a new allocation/gfp mode is absolutely necessary then I believe that > > the most reasoanble way forward would be > > #define GFP_NO_LOCK ((__force gfp_t)0) > > > Agree. Even though i see that some code should be adjusted for it. There are > a few users of the __get_free_page(0); So, need to double check it: Yes, I believe I have pointed that out in the previous discussion. > <snip> > [ 0.650351] BUG: kernel NULL pointer dereference, address: 0000000000000010 > [ 0.651083] #PF: supervisor read access in kernel mode > [ 0.651639] #PF: error_code(0x0000) - not-present page > [ 0.652200] PGD 0 P4D 0 > [ 0.652523] Oops: 0000 [#1] SMP NOPTI > [ 0.652668] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc7-next-20200930+ #140 > [ 0.652668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > [ 0.652668] RIP: 0010:__find_event_file+0x21/0x80 > <snip> > > Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page(). > If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages(). > __kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks > what can be easily converted to raw variant. Yes, there are likely more surprises like that. I am not sure about kasan, page owner (which depens on the stack unwinder) and others which hook into this path.
On Fri, Oct 02, 2020 at 07:41:20AM -0700, Paul E. McKenney wrote: > On Fri, Oct 02, 2020 at 11:19:52AM +0100, Mel Gorman wrote: > > On Fri, Oct 02, 2020 at 11:58:58AM +0200, Peter Zijlstra wrote: > > > > It's enabled by default by enough distros that adding too many checks > > > > is potentially painful. Granted it would be missed by most benchmarking > > > > which tend to control allocations from userspace but a lot of performance > > > > problems I see are the "death by a thousand cuts" variety. > > > > > > Oh quite agreed, aka death by accounting. But if people are enabling > > > DEBUG options in production kernels, there's something wrong, no? > > > > > > > You'd think but historically I believe DEBUG_VM was enabled for some > > distributions because it made certain classes of problems easier to debug > > early. There is also a recent trend for enabling various DEBUG options for > > "hardening" even when they protect very specific corner cases or are for > > intended for kernel development. I've pushed back where I have an opinion > > that matters but it's generally corrosive. > > > > > Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options? > > > > It's heading in that direction :( > > Given that you guys have just reiterated yet again that you are very > unhappy with either a GFP_ flag or a special function like the one that > Peter Zijlstra put together, it would be very helpful if you were to at > least voice some level of support for Thomas Gleixner's patchset, which, > if accepted, will allow me to solve at least 50% of the problem. > I read through the series and didn't find anything problematic that had not been covered already. Minimally, avoiding surprises about what preemptible() means in different contexts is nice. While I have not run it through a test grid to check, I'd be very surprised if this was problematic from a performance perspective on a preempt-disabled kernels. Last I checked, the difference between PREEMPT_NONE and PREEMPT_VOLUNTARY was less than 2% *at worst* and I don't think that was due to the preempt accounting.
On Tue, Oct 06, 2020 at 11:03:34AM +0100, Mel Gorman wrote: > On Fri, Oct 02, 2020 at 07:41:20AM -0700, Paul E. McKenney wrote: > > On Fri, Oct 02, 2020 at 11:19:52AM +0100, Mel Gorman wrote: > > > On Fri, Oct 02, 2020 at 11:58:58AM +0200, Peter Zijlstra wrote: > > > > > It's enabled by default by enough distros that adding too many checks > > > > > is potentially painful. Granted it would be missed by most benchmarking > > > > > which tend to control allocations from userspace but a lot of performance > > > > > problems I see are the "death by a thousand cuts" variety. > > > > > > > > Oh quite agreed, aka death by accounting. But if people are enabling > > > > DEBUG options in production kernels, there's something wrong, no? > > > > > > > > > > You'd think but historically I believe DEBUG_VM was enabled for some > > > distributions because it made certain classes of problems easier to debug > > > early. There is also a recent trend for enabling various DEBUG options for > > > "hardening" even when they protect very specific corner cases or are for > > > intended for kernel development. I've pushed back where I have an opinion > > > that matters but it's generally corrosive. > > > > > > > Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options? > > > > > > It's heading in that direction :( > > > > Given that you guys have just reiterated yet again that you are very > > unhappy with either a GFP_ flag or a special function like the one that > > Peter Zijlstra put together, it would be very helpful if you were to at > > least voice some level of support for Thomas Gleixner's patchset, which, > > if accepted, will allow me to solve at least 50% of the problem. > > I read through the series and didn't find anything problematic that > had not been covered already. Minimally, avoiding surprises about what > preemptible() means in different contexts is nice. While I have not > run it through a test grid to check, I'd be very surprised if this was > problematic from a performance perspective on a preempt-disabled kernels. > Last I checked, the difference between PREEMPT_NONE and PREEMPT_VOLUNTARY > was less than 2% *at worst* and I don't think that was due to the preempt > accounting. Thank you, Mel! Thanx, Paul
On Mon, Oct 05, 2020 at 05:41:00PM +0200, Michal Hocko wrote: > On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote: > > On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote: > > > On Fri 02-10-20 09:50:14, Mel Gorman wrote: > > > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > > > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote: > > > > > > > > > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > > > > > > the allocator (different from what you posted in previous version) so that it > > > > > > > only looks at the flag after it tries to allocate from pcplist and finds out > > > > > > > it's empty. So, no inventing of new page allocator entry points or checks such > > > > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > > > > > > affect existing fast paths. > > > > > > > > > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand > > > > > > each other. If that is something that is close to your view or not: > > > > > > > > > > > > <snip> > > > > > > t a/include/linux/gfp.h b/include/linux/gfp.h > > > > > > index c603237e006c..7e613560a502 100644 > > > > > > --- a/include/linux/gfp.h > > > > > > +++ b/include/linux/gfp.h > > > > > > @@ -39,8 +39,9 @@ struct vm_area_struct; > > > > > > #define ___GFP_HARDWALL 0x100000u > > > > > > #define ___GFP_THISNODE 0x200000u > > > > > > #define ___GFP_ACCOUNT 0x400000u > > > > > > +#define ___GFP_NO_LOCKS 0x800000u > > > > > > > > > > Even if a new gfp flag gains a sufficient traction and support I am > > > > > _strongly_ opposed against consuming another flag for that. Bit space is > > > > > limited. > > > > > > > > That is definitely true. I'm not happy with the GFP flag at all, the > > > > comment is at best a damage limiting move. It still would be better for > > > > a memory pool to be reserved and sized for critical allocations. > > > > > > Completely agreed. The only existing usecase is so special cased that a > > > dedicated pool is not only easier to maintain but it should be also much > > > better tuned for the specific workload. Something not really feasible > > > with the allocator. > > > > > > > > Besides that we certainly do not want to allow craziness like > > > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > > > > > > > > That would deserve to be taken to a dumpster and set on fire. The flag > > > > combination could be checked in the allocator but the allocator path fast > > > > paths are bad enough already. > > > > > > If a new allocation/gfp mode is absolutely necessary then I believe that > > > the most reasoanble way forward would be > > > #define GFP_NO_LOCK ((__force gfp_t)0) > > > > > Agree. Even though i see that some code should be adjusted for it. There are > > a few users of the __get_free_page(0); So, need to double check it: > > Yes, I believe I have pointed that out in the previous discussion. > OK. I spent more time on it. A passed gfp_mask can be adjusted on the entry, that adjustment depends on the gfp_allowed_mask. It can be changed in run-time. For example during early boot it excludes: __GFP_RECLAIM|__GFP_IO|__GFP_FS flags, what is GFP_KERNEL. So, GFP_KERNEL is adjusted on entry and becomes 0 during early boot phase. How to distinguish it: <snip> + /* + * gfp_mask can become zero because gfp_allowed_mask changes in run-time. + */ + if (!gfp_mask) + alloc_flags |= ALLOC_NO_LOCKS; + gfp_mask &= gfp_allowed_mask; alloc_mask = gfp_mask; if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) <snip> > > > > Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page(). > > If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages(). > > __kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks > > what can be easily converted to raw variant. > > Yes, there are likely more surprises like that. I am not sure about > kasan, page owner (which depens on the stack unwinder) and others which > hook into this path. > I have checked kasan_alloc_pages(), kernel_poison_pages() both are OK, at least i did not find any locking there. As for set_page_owner(), it requires more attention, since it uses arch specific unwind logic. Though, i spent some time on it and so far have not noticed anything. -- Vlad Rezki
On Wed 07-10-20 00:25:29, Uladzislau Rezki wrote: > On Mon, Oct 05, 2020 at 05:41:00PM +0200, Michal Hocko wrote: > > On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote: > > > On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote: > > > > On Fri 02-10-20 09:50:14, Mel Gorman wrote: > > > > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > > > > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > > > > > > > the allocator (different from what you posted in previous version) so that it > > > > > > > > only looks at the flag after it tries to allocate from pcplist and finds out > > > > > > > > it's empty. So, no inventing of new page allocator entry points or checks such > > > > > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > > > > > > > affect existing fast paths. > > > > > > > > > > > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand > > > > > > > each other. If that is something that is close to your view or not: > > > > > > > > > > > > > > <snip> > > > > > > > t a/include/linux/gfp.h b/include/linux/gfp.h > > > > > > > index c603237e006c..7e613560a502 100644 > > > > > > > --- a/include/linux/gfp.h > > > > > > > +++ b/include/linux/gfp.h > > > > > > > @@ -39,8 +39,9 @@ struct vm_area_struct; > > > > > > > #define ___GFP_HARDWALL 0x100000u > > > > > > > #define ___GFP_THISNODE 0x200000u > > > > > > > #define ___GFP_ACCOUNT 0x400000u > > > > > > > +#define ___GFP_NO_LOCKS 0x800000u > > > > > > > > > > > > Even if a new gfp flag gains a sufficient traction and support I am > > > > > > _strongly_ opposed against consuming another flag for that. Bit space is > > > > > > limited. > > > > > > > > > > That is definitely true. I'm not happy with the GFP flag at all, the > > > > > comment is at best a damage limiting move. It still would be better for > > > > > a memory pool to be reserved and sized for critical allocations. > > > > > > > > Completely agreed. The only existing usecase is so special cased that a > > > > dedicated pool is not only easier to maintain but it should be also much > > > > better tuned for the specific workload. Something not really feasible > > > > with the allocator. > > > > > > > > > > Besides that we certainly do not want to allow craziness like > > > > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > > > > > > > > > > That would deserve to be taken to a dumpster and set on fire. The flag > > > > > combination could be checked in the allocator but the allocator path fast > > > > > paths are bad enough already. > > > > > > > > If a new allocation/gfp mode is absolutely necessary then I believe that > > > > the most reasoanble way forward would be > > > > #define GFP_NO_LOCK ((__force gfp_t)0) > > > > > > > Agree. Even though i see that some code should be adjusted for it. There are > > > a few users of the __get_free_page(0); So, need to double check it: > > > > Yes, I believe I have pointed that out in the previous discussion. > > > OK. I spent more time on it. A passed gfp_mask can be adjusted on the entry, > that adjustment depends on the gfp_allowed_mask. It can be changed in run-time. > > For example during early boot it excludes: __GFP_RECLAIM|__GFP_IO|__GFP_FS flags, > what is GFP_KERNEL. So, GFP_KERNEL is adjusted on entry and becomes 0 during early > boot phase. Honestly I am not sure how much is GFP_BOOT_MASK still needed. The remaining user of gfp_allowed_mask mask should be only hibernation and I believe this should be removed in long term. Not as trivial because scope API cannot be used for that as it needs a global flag but this is a gross hack that should be implemented differently. It is waiting on my todo list but never got around to that. > How to distinguish it: > > <snip> > + /* > + * gfp_mask can become zero because gfp_allowed_mask changes in run-time. > + */ > + if (!gfp_mask) > + alloc_flags |= ALLOC_NO_LOCKS; > + > gfp_mask &= gfp_allowed_mask; > alloc_mask = gfp_mask; > if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) > <snip> > > > > > > > Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page(). > > > If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages(). > > > __kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks > > > what can be easily converted to raw variant. > > > > Yes, there are likely more surprises like that. I am not sure about > > kasan, page owner (which depens on the stack unwinder) and others which > > hook into this path. > > > I have checked kasan_alloc_pages(), kernel_poison_pages() both are OK, > at least i did not find any locking there. As for set_page_owner(), it > requires more attention, since it uses arch specific unwind logic. Though, > i spent some time on it and so far have not noticed anything. stack depod depends on a lock IIRC. Anyway, this is just showing how this is going to grow in complexity and make future additions harder. A niche usecase is inducing an additional complexity for future maintenance.
On Wed, Oct 07, 2020 at 12:02:34PM +0200, Michal Hocko wrote: > On Wed 07-10-20 00:25:29, Uladzislau Rezki wrote: > > On Mon, Oct 05, 2020 at 05:41:00PM +0200, Michal Hocko wrote: > > > On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote: > > > > On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote: > > > > > On Fri 02-10-20 09:50:14, Mel Gorman wrote: > > > > > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote: > > > > > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in > > > > > > > > > the allocator (different from what you posted in previous version) so that it > > > > > > > > > only looks at the flag after it tries to allocate from pcplist and finds out > > > > > > > > > it's empty. So, no inventing of new page allocator entry points or checks such > > > > > > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't > > > > > > > > > affect existing fast paths. > > > > > > > > > > > > > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand > > > > > > > > each other. If that is something that is close to your view or not: > > > > > > > > > > > > > > > > <snip> > > > > > > > > t a/include/linux/gfp.h b/include/linux/gfp.h > > > > > > > > index c603237e006c..7e613560a502 100644 > > > > > > > > --- a/include/linux/gfp.h > > > > > > > > +++ b/include/linux/gfp.h > > > > > > > > @@ -39,8 +39,9 @@ struct vm_area_struct; > > > > > > > > #define ___GFP_HARDWALL 0x100000u > > > > > > > > #define ___GFP_THISNODE 0x200000u > > > > > > > > #define ___GFP_ACCOUNT 0x400000u > > > > > > > > +#define ___GFP_NO_LOCKS 0x800000u > > > > > > > > > > > > > > Even if a new gfp flag gains a sufficient traction and support I am > > > > > > > _strongly_ opposed against consuming another flag for that. Bit space is > > > > > > > limited. > > > > > > > > > > > > That is definitely true. I'm not happy with the GFP flag at all, the > > > > > > comment is at best a damage limiting move. It still would be better for > > > > > > a memory pool to be reserved and sized for critical allocations. > > > > > > > > > > Completely agreed. The only existing usecase is so special cased that a > > > > > dedicated pool is not only easier to maintain but it should be also much > > > > > better tuned for the specific workload. Something not really feasible > > > > > with the allocator. > > > > > > > > > > > > Besides that we certainly do not want to allow craziness like > > > > > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we? > > > > > > > > > > > > That would deserve to be taken to a dumpster and set on fire. The flag > > > > > > combination could be checked in the allocator but the allocator path fast > > > > > > paths are bad enough already. > > > > > > > > > > If a new allocation/gfp mode is absolutely necessary then I believe that > > > > > the most reasoanble way forward would be > > > > > #define GFP_NO_LOCK ((__force gfp_t)0) > > > > > > > > > Agree. Even though i see that some code should be adjusted for it. There are > > > > a few users of the __get_free_page(0); So, need to double check it: > > > > > > Yes, I believe I have pointed that out in the previous discussion. > > > > > OK. I spent more time on it. A passed gfp_mask can be adjusted on the entry, > > that adjustment depends on the gfp_allowed_mask. It can be changed in run-time. > > > > For example during early boot it excludes: __GFP_RECLAIM|__GFP_IO|__GFP_FS flags, > > what is GFP_KERNEL. So, GFP_KERNEL is adjusted on entry and becomes 0 during early > > boot phase. > > Honestly I am not sure how much is GFP_BOOT_MASK still needed. The > remaining user of gfp_allowed_mask mask should be only hibernation and I > believe this should be removed in long term. Not as trivial because > scope API cannot be used for that as it needs a global flag but this is > a gross hack that should be implemented differently. It is waiting on my > todo list but never got around to that. > > > How to distinguish it: > > > > <snip> > > + /* > > + * gfp_mask can become zero because gfp_allowed_mask changes in run-time. > > + */ > > + if (!gfp_mask) > > + alloc_flags |= ALLOC_NO_LOCKS; > > + > > gfp_mask &= gfp_allowed_mask; > > alloc_mask = gfp_mask; > > if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) > > <snip> > > > > > > > > > > Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page(). > > > > If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages(). > > > > __kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks > > > > what can be easily converted to raw variant. > > > > > > Yes, there are likely more surprises like that. I am not sure about > > > kasan, page owner (which depens on the stack unwinder) and others which > > > hook into this path. > > > > > I have checked kasan_alloc_pages(), kernel_poison_pages() both are OK, > > at least i did not find any locking there. As for set_page_owner(), it > > requires more attention, since it uses arch specific unwind logic. Though, > > i spent some time on it and so far have not noticed anything. > > stack depod depends on a lock IIRC. Anyway, this is just showing how > this is going to grow in complexity and make future additions harder. > A niche usecase is inducing an additional complexity for future > maintenance. > I agree regarding maintenance costs. That is what is hard to avoid. -- Vlad Rezki
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 67a0774e080b..c065031b4403 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -565,6 +565,7 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); extern unsigned long get_zeroed_page(gfp_t gfp_mask); +extern unsigned long __rcu_alloc_page_lockless(void); void *alloc_pages_exact(size_t size, gfp_t gfp_mask); void free_pages_exact(void *virt, size_t size); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0e2bab486fea..360c68ea3491 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4908,6 +4908,88 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, } EXPORT_SYMBOL(__alloc_pages_nodemask); +static struct page *__rmqueue_lockless(struct zone *zone, struct per_cpu_pages *pcp) +{ + struct list_head *list; + struct page *page; + int migratetype; + + for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) { + list = &pcp->lists[migratetype]; + page = list_first_entry_or_null(list, struct page, lru); + if (page && !check_new_pcp(page)) { + list_del(&page->lru); + pcp->count--; + return page; + } + } + + return NULL; +} + +/* + * Semantic of this function illustrates that a page + * is obtained in lock-free maneer. Instead of going + * deeper in the page allocator, it uses the pcplists + * only. Such way provides lock-less allocation method. + * + * Some notes are below: + * - intended to use for RCU code only; + * - it does not use any atomic reserves. + */ +unsigned long __rcu_alloc_page_lockless(void) +{ + struct zonelist *zonelist = + node_zonelist(numa_node_id(), GFP_KERNEL); + struct zoneref *z, *preferred_zoneref; + struct per_cpu_pages *pcp; + struct page *page; + unsigned long flags; + struct zone *zone; + + /* + * If DEBUG_PAGEALLOC is enabled, the post_alloc_hook() + * in the prep_new_page() function also does some extra + * page mappings via __kernel_map_pages(), what is arch + * specific. It is for debug purpose only. + * + * For example, powerpc variant of __kernel_map_pages() + * uses sleep-able locks. Thus a lock-less access can + * not be provided if debug option is activated. In that + * case it is fine to revert and return NULL, since RCU + * code has a fallback mechanism. It is OK if it is used + * for debug kernel. + */ + if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) + return 0; + + /* + * Preferred zone is a first one in the zonelist. + */ + preferred_zoneref = NULL; + + for_each_zone_zonelist(zone, z, zonelist, ZONE_NORMAL) { + if (!preferred_zoneref) + preferred_zoneref = z; + + local_irq_save(flags); + pcp = &this_cpu_ptr(zone->pageset)->pcp; + page = __rmqueue_lockless(zone, pcp); + if (page) { + __count_zid_vm_events(PGALLOC, page_zonenum(page), 1); + zone_statistics(preferred_zoneref->zone, zone); + } + local_irq_restore(flags); + + if (page) { + prep_new_page(page, 0, 0, 0); + return (unsigned long) page_address(page); + } + } + + return 0; +} + /* * Common helper functions. Never use with __GFP_HIGHMEM because the returned * address cannot represent highmem pages. Use alloc_pages and then kmap if
Some background and kfree_rcu() =============================== The pointers to be freed are stored in the per-cpu array to improve performance, to enable an easier-to-use API, to accommodate vmalloc memmory and to support a single argument of the kfree_rcu() when only a pointer is passed. More details are below. In order to maintain such per-CPU arrays there is a need in dynamic allocation when a current array is fully populated and a new block is required. See below the example: 0 1 2 3 0 1 2 3 |p|p|p|p| -> |p|p|p|p| -> NULL there are two pointer-blocks, each one can store 4 addresses which will be freed after a grace period is passed. In reality we store PAGE_SIZE / sizeof(void *). So to maintain such blocks a single page is obtain via the page allocator: bnode = (struct kvfree_rcu_bulk_data *) __get_free_page(GFP_NOWAIT | __GFP_NOWARN); after that it is attached to the "head" and its "next" pointer is set to previous "head", so the list of blocks can be maintained and grow dynamically until it gets drained by the reclaiming thread. Please note. There is always a fallback if an allocation fails. In the single argument, this is a call to synchronize_rcu() and for the two arguments case this is to use rcu_head structure embedded in the object being free, and then paying cache-miss penalty, also invoke the kfree() per object instead of kfree_bulk() for groups of objects. Why we maintain arrays/blocks instead of linking objects by the regular "struct rcu_head" technique. See below a few but main reasons: a) A memory can be reclaimed by invoking of the kfree_bulk() interface that requires passing an array and number of entries in it. That reduces the per-object overhead caused by calling kfree() per-object. This reduces the reclamation time. b) Improves locality and reduces the number of cache-misses, due to "pointer chasing" between objects, which can be far spread between each other. c) Support a "single argument" in the kvfree_rcu() void *ptr = kvmalloc(some_bytes, GFP_KERNEL); if (ptr) kvfree_rcu(ptr); We need it when an "rcu_head" is not embed into a stucture but an object must be freed after a grace period. Therefore for the single argument, such objects cannot be queued on a linked list. So nowadays, since we do not have a single argument but we see the demand in it, to workaround it people just do a simple not efficient sequence: <snip> synchronize_rcu(); /* Can be long and blocks a current context */ kfree(p); <snip> More details is here: https://lkml.org/lkml/2020/4/28/1626 d) To distinguish vmalloc pointers between SLAB ones. It becomes possible to invoke the right freeing API for the right kind of pointer, kfree_bulk() or TBD: vmalloc_bulk(). e) Speeding up the post-grace-period freeing reduces the chance of a flood of callback's OOMing the system. Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166 Proposal ======== Introduce a lock-free function that obtain a page from the per-cpu-lists on current CPU. It returns NULL rather than acquiring any non-raw spinlock. Description =========== The page allocator has two phases, fast path and slow one. We are interested in fast path and order-0 allocations. In its turn it is divided also into two phases: lock-less and not: 1) As a first step the page allocator tries to obtain a page from the per-cpu-list, so each CPU has its own one. That is why this step is lock-less and fast. Basically it disables irqs on current CPU in order to access to per-cpu data and remove a first element from the pcp-list. An element/page is returned to an user. 2) If there is no any available page in per-cpu-list, the second step is involved. It removes a specified number of elements from the buddy allocator transferring them to the "supplied-list/per-cpu-list" described in [1]. Summarizing. The __rcu_alloc_page_lockless() covers only [1] and can not do step [2], due to the fact that [2] requires an access to zone->lock. It implies that it is super fast, but a higher rate of fails is also expected. Usage: __rcu_alloc_page_lockless(); Link: https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/ Not-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> --- include/linux/gfp.h | 1 + mm/page_alloc.c | 82 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 83 insertions(+)