Message ID | 20240604222355.2370768-2-paulmck@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Miscellaneous fixes for v6.11 | expand |
Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit : > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > > When all wait heads are in use, which can happen when > rcu_sr_normal_gp_cleanup_work()'s callback processing > is slow, any new synchronize_rcu() user's rcu_synchronize > node's processing is deferred to future GP periods. This > can result in long list of synchronize_rcu() invocations > waiting for full grace period processing, which can delay > freeing of memory. Mitigate this problem by using first > node in the list as wait tail when all wait heads are in use. > While methods to speed up callback processing would be needed > to recover from this situation, allowing new nodes to complete > their grace period can help prevent delays due to a fixed > number of wait head nodes. > > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> IIRC we agreed that this patch could be a step too far that made an already not so simple state machine even less simple, breaking the wait_head based flow. Should we postpone this change until it is observed that a workqueue not being scheduled for 5 grace periods is a real issue? Thanks.
On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote: > Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit : > > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > > > > When all wait heads are in use, which can happen when > > rcu_sr_normal_gp_cleanup_work()'s callback processing > > is slow, any new synchronize_rcu() user's rcu_synchronize > > node's processing is deferred to future GP periods. This > > can result in long list of synchronize_rcu() invocations > > waiting for full grace period processing, which can delay > > freeing of memory. Mitigate this problem by using first > > node in the list as wait tail when all wait heads are in use. > > While methods to speed up callback processing would be needed > > to recover from this situation, allowing new nodes to complete > > their grace period can help prevent delays due to a fixed > > number of wait head nodes. > > > > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > IIRC we agreed that this patch could be a step too far that > made an already not so simple state machine even less simple, > breaking the wait_head based flow. True, which is why we agreed not to submit it into the v6.10 merge window. And I don't recall us saying what merge window to send it to. > Should we postpone this change until it is observed that a workqueue > not being scheduled for 5 grace periods is a real issue? Neeraj, thoughts? Or, better yet, test results? ;-) Thanx, Paul
On 6/6/2024 12:08 AM, Paul E. McKenney wrote: > On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote: >> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit : >>> From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> >>> >>> When all wait heads are in use, which can happen when >>> rcu_sr_normal_gp_cleanup_work()'s callback processing >>> is slow, any new synchronize_rcu() user's rcu_synchronize >>> node's processing is deferred to future GP periods. This >>> can result in long list of synchronize_rcu() invocations >>> waiting for full grace period processing, which can delay >>> freeing of memory. Mitigate this problem by using first >>> node in the list as wait tail when all wait heads are in use. >>> While methods to speed up callback processing would be needed >>> to recover from this situation, allowing new nodes to complete >>> their grace period can help prevent delays due to a fixed >>> number of wait head nodes. >>> >>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> >>> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> >> >> IIRC we agreed that this patch could be a step too far that >> made an already not so simple state machine even less simple, >> breaking the wait_head based flow. > > True, which is why we agreed not to submit it into the v6.10 merge window. > > And I don't recall us saying what merge window to send it to. > >> Should we postpone this change until it is observed that a workqueue >> not being scheduled for 5 grace periods is a real issue? > > Neeraj, thoughts? Or, better yet, test results? ;-) Yes I agree that we postpone this change until we see it as a real problem. I had run a test to invoke synchronize_rcu() from all CPUs on a 96 core system in parallel. I didn't specifically check if this scenario was hit. Will run RCU torture test with this change. Thanks Neeraj > > Thanx, Paul
On Thu, Jun 06, 2024 at 09:16:08AM +0530, Neeraj Upadhyay wrote: > > > On 6/6/2024 12:08 AM, Paul E. McKenney wrote: > > On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote: > >> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit : > >>> From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > >>> > >>> When all wait heads are in use, which can happen when > >>> rcu_sr_normal_gp_cleanup_work()'s callback processing > >>> is slow, any new synchronize_rcu() user's rcu_synchronize > >>> node's processing is deferred to future GP periods. This > >>> can result in long list of synchronize_rcu() invocations > >>> waiting for full grace period processing, which can delay > >>> freeing of memory. Mitigate this problem by using first > >>> node in the list as wait tail when all wait heads are in use. > >>> While methods to speed up callback processing would be needed > >>> to recover from this situation, allowing new nodes to complete > >>> their grace period can help prevent delays due to a fixed > >>> number of wait head nodes. > >>> > >>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > >>> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > >> > >> IIRC we agreed that this patch could be a step too far that > >> made an already not so simple state machine even less simple, > >> breaking the wait_head based flow. > > > > True, which is why we agreed not to submit it into the v6.10 merge window. > > > > And I don't recall us saying what merge window to send it to. > > > >> Should we postpone this change until it is observed that a workqueue > >> not being scheduled for 5 grace periods is a real issue? > > > > Neeraj, thoughts? Or, better yet, test results? ;-) > > Yes I agree that we postpone this change until we see it as a real > problem. I had run a test to invoke synchronize_rcu() from all CPUs > on a 96 core system in parallel. I didn't specifically check if this > scenario was hit. Will run RCU torture test with this change. Very well, I will drop this patch with the expectation that you will re-post it if a problem does arise. Thanx, Paul
On Thu, Jun 06, 2024 at 09:49:59AM -0700, Paul E. McKenney wrote: > On Thu, Jun 06, 2024 at 09:16:08AM +0530, Neeraj Upadhyay wrote: > > > > > > On 6/6/2024 12:08 AM, Paul E. McKenney wrote: > > > On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote: > > >> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit : > > >>> From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > > >>> > > >>> When all wait heads are in use, which can happen when > > >>> rcu_sr_normal_gp_cleanup_work()'s callback processing > > >>> is slow, any new synchronize_rcu() user's rcu_synchronize > > >>> node's processing is deferred to future GP periods. This > > >>> can result in long list of synchronize_rcu() invocations > > >>> waiting for full grace period processing, which can delay > > >>> freeing of memory. Mitigate this problem by using first > > >>> node in the list as wait tail when all wait heads are in use. > > >>> While methods to speed up callback processing would be needed > > >>> to recover from this situation, allowing new nodes to complete > > >>> their grace period can help prevent delays due to a fixed > > >>> number of wait head nodes. > > >>> > > >>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > > >>> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > >> > > >> IIRC we agreed that this patch could be a step too far that > > >> made an already not so simple state machine even less simple, > > >> breaking the wait_head based flow. > > > > > > True, which is why we agreed not to submit it into the v6.10 merge window. > > > > > > And I don't recall us saying what merge window to send it to. > > > > > >> Should we postpone this change until it is observed that a workqueue > > >> not being scheduled for 5 grace periods is a real issue? > > > > > > Neeraj, thoughts? Or, better yet, test results? ;-) > > > > Yes I agree that we postpone this change until we see it as a real > > problem. I had run a test to invoke synchronize_rcu() from all CPUs > > on a 96 core system in parallel. I didn't specifically check if this > > scenario was hit. Will run RCU torture test with this change. > > Very well, I will drop this patch with the expectation that you will > re-post it if a problem does arise. > Thank you! We discussed it before and came to conclusion that it adds an extra complexity. Once we hit an issue with delays, we can introduce it and explain a workload which triggers it. -- Uladzislau Rezki
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 28c7031711a3f..6ba36d9c09bde 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1463,14 +1463,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap) * for this new grace period. Given that there are a fixed * number of wait nodes, if all wait nodes are in use * (which can happen when kworker callback processing - * is delayed) and additional grace period is requested. - * This means, a system is slow in processing callbacks. - * - * TODO: If a slow processing is detected, a first node - * in the llist should be used as a wait-tail for this - * grace period, therefore users which should wait due - * to a slow process are handled by _this_ grace period - * and not next. + * is delayed), first node in the llist is used as wait + * tail for this grace period. This means, the first node + * has to go through additional grace periods before it is + * part of the wait callbacks. This should be ok, as + * the system is slow in processing callbacks anyway. * * Below is an illustration of how the done and wait * tail pointers move from one set of rcu_synchronize nodes @@ -1639,7 +1636,6 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work) if (!done) return; - WARN_ON_ONCE(!rcu_sr_is_wait_head(done)); head = done->next; done->next = NULL; @@ -1676,13 +1672,21 @@ static void rcu_sr_normal_gp_cleanup(void) rcu_state.srs_wait_tail = NULL; ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail); - WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail)); /* * Process (a) and (d) cases. See an illustration. */ llist_for_each_safe(rcu, next, wait_tail->next) { - if (rcu_sr_is_wait_head(rcu)) + /* + * The done tail may reference a rcu_synchronize node. + * Stop at done tail, as using rcu_sr_normal_complete() + * from this path can result in use-after-free. This + * may occur if, following the wake-up of the synchronize_rcu() + * wait contexts and freeing up of node memory, + * rcu_sr_normal_gp_cleanup_work() accesses the done tail and + * its subsequent nodes. + */ + if (wait_tail->next == rcu_state.srs_done_tail) break; rcu_sr_normal_complete(rcu); @@ -1719,15 +1723,17 @@ static bool rcu_sr_normal_gp_init(void) return start_new_poll; wait_head = rcu_sr_get_wait_head(); - if (!wait_head) { - // Kick another GP to retry. + if (wait_head) { + /* Inject a wait-dummy-node. */ + llist_add(wait_head, &rcu_state.srs_next); + } else { + // Kick another GP for first node. start_new_poll = true; - return start_new_poll; + if (first == rcu_state.srs_done_tail) + return start_new_poll; + wait_head = first; } - /* Inject a wait-dummy-node. */ - llist_add(wait_head, &rcu_state.srs_next); - /* * A waiting list of rcu_synchronize nodes should be empty on * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),