diff mbox series

[2/9] rcu: Reduce synchronize_rcu() delays when all wait heads are in use

Message ID 20240604222355.2370768-2-paulmck@kernel.org (mailing list archive)
State New, archived
Headers show
Series Miscellaneous fixes for v6.11 | expand

Commit Message

Paul E. McKenney June 4, 2024, 10:23 p.m. UTC
From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>

When all wait heads are in use, which can happen when
rcu_sr_normal_gp_cleanup_work()'s callback processing
is slow, any new synchronize_rcu() user's rcu_synchronize
node's processing is deferred to future GP periods. This
can result in long list of synchronize_rcu() invocations
waiting for full grace period processing, which can delay
freeing of memory. Mitigate this problem by using first
node in the list as wait tail when all wait heads are in use.
While methods to speed up callback processing would be needed
to recover from this situation, allowing new nodes to complete
their grace period can help prevent delays due to a fixed
number of wait head nodes.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree.c | 40 +++++++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 17 deletions(-)

Comments

Frederic Weisbecker June 5, 2024, 12:09 p.m. UTC | #1
Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit :
> From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> 
> When all wait heads are in use, which can happen when
> rcu_sr_normal_gp_cleanup_work()'s callback processing
> is slow, any new synchronize_rcu() user's rcu_synchronize
> node's processing is deferred to future GP periods. This
> can result in long list of synchronize_rcu() invocations
> waiting for full grace period processing, which can delay
> freeing of memory. Mitigate this problem by using first
> node in the list as wait tail when all wait heads are in use.
> While methods to speed up callback processing would be needed
> to recover from this situation, allowing new nodes to complete
> their grace period can help prevent delays due to a fixed
> number of wait head nodes.
> 
> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

IIRC we agreed that this patch could be a step too far that
made an already not so simple state machine even less simple,
breaking the wait_head based flow.

Should we postpone this change until it is observed that a workqueue
not being scheduled for 5 grace periods is a real issue?

Thanks.
Paul E. McKenney June 5, 2024, 6:38 p.m. UTC | #2
On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote:
> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit :
> > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > 
> > When all wait heads are in use, which can happen when
> > rcu_sr_normal_gp_cleanup_work()'s callback processing
> > is slow, any new synchronize_rcu() user's rcu_synchronize
> > node's processing is deferred to future GP periods. This
> > can result in long list of synchronize_rcu() invocations
> > waiting for full grace period processing, which can delay
> > freeing of memory. Mitigate this problem by using first
> > node in the list as wait tail when all wait heads are in use.
> > While methods to speed up callback processing would be needed
> > to recover from this situation, allowing new nodes to complete
> > their grace period can help prevent delays due to a fixed
> > number of wait head nodes.
> > 
> > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> IIRC we agreed that this patch could be a step too far that
> made an already not so simple state machine even less simple,
> breaking the wait_head based flow.

True, which is why we agreed not to submit it into the v6.10 merge window.

And I don't recall us saying what merge window to send it to.

> Should we postpone this change until it is observed that a workqueue
> not being scheduled for 5 grace periods is a real issue?

Neeraj, thoughts?  Or, better yet, test results?  ;-)

							Thanx, Paul
Neeraj Upadhyay June 6, 2024, 3:46 a.m. UTC | #3
On 6/6/2024 12:08 AM, Paul E. McKenney wrote:
> On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote:
>> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit :
>>> From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>>>
>>> When all wait heads are in use, which can happen when
>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>>> is slow, any new synchronize_rcu() user's rcu_synchronize
>>> node's processing is deferred to future GP periods. This
>>> can result in long list of synchronize_rcu() invocations
>>> waiting for full grace period processing, which can delay
>>> freeing of memory. Mitigate this problem by using first
>>> node in the list as wait tail when all wait heads are in use.
>>> While methods to speed up callback processing would be needed
>>> to recover from this situation, allowing new nodes to complete
>>> their grace period can help prevent delays due to a fixed
>>> number of wait head nodes.
>>>
>>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>>> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>>
>> IIRC we agreed that this patch could be a step too far that
>> made an already not so simple state machine even less simple,
>> breaking the wait_head based flow.
> 
> True, which is why we agreed not to submit it into the v6.10 merge window.
> 
> And I don't recall us saying what merge window to send it to.
> 
>> Should we postpone this change until it is observed that a workqueue
>> not being scheduled for 5 grace periods is a real issue?
> 
> Neeraj, thoughts?  Or, better yet, test results?  ;-)

Yes I agree that we postpone this change until we see it as a real
problem. I had run a test to invoke synchronize_rcu() from all CPUs
on a 96 core system in parallel. I didn't specifically check if this
scenario was hit. Will run RCU torture test with this change.


Thanks
Neeraj


> 
> 							Thanx, Paul
Paul E. McKenney June 6, 2024, 4:49 p.m. UTC | #4
On Thu, Jun 06, 2024 at 09:16:08AM +0530, Neeraj Upadhyay wrote:
> 
> 
> On 6/6/2024 12:08 AM, Paul E. McKenney wrote:
> > On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote:
> >> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit :
> >>> From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> >>>
> >>> When all wait heads are in use, which can happen when
> >>> rcu_sr_normal_gp_cleanup_work()'s callback processing
> >>> is slow, any new synchronize_rcu() user's rcu_synchronize
> >>> node's processing is deferred to future GP periods. This
> >>> can result in long list of synchronize_rcu() invocations
> >>> waiting for full grace period processing, which can delay
> >>> freeing of memory. Mitigate this problem by using first
> >>> node in the list as wait tail when all wait heads are in use.
> >>> While methods to speed up callback processing would be needed
> >>> to recover from this situation, allowing new nodes to complete
> >>> their grace period can help prevent delays due to a fixed
> >>> number of wait head nodes.
> >>>
> >>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> >>> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> >>
> >> IIRC we agreed that this patch could be a step too far that
> >> made an already not so simple state machine even less simple,
> >> breaking the wait_head based flow.
> > 
> > True, which is why we agreed not to submit it into the v6.10 merge window.
> > 
> > And I don't recall us saying what merge window to send it to.
> > 
> >> Should we postpone this change until it is observed that a workqueue
> >> not being scheduled for 5 grace periods is a real issue?
> > 
> > Neeraj, thoughts?  Or, better yet, test results?  ;-)
> 
> Yes I agree that we postpone this change until we see it as a real
> problem. I had run a test to invoke synchronize_rcu() from all CPUs
> on a 96 core system in parallel. I didn't specifically check if this
> scenario was hit. Will run RCU torture test with this change.

Very well, I will drop this patch with the expectation that you will
re-post it if a problem does arise.

							Thanx, Paul
Uladzislau Rezki June 11, 2024, 10:12 a.m. UTC | #5
On Thu, Jun 06, 2024 at 09:49:59AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 06, 2024 at 09:16:08AM +0530, Neeraj Upadhyay wrote:
> > 
> > 
> > On 6/6/2024 12:08 AM, Paul E. McKenney wrote:
> > > On Wed, Jun 05, 2024 at 02:09:34PM +0200, Frederic Weisbecker wrote:
> > >> Le Tue, Jun 04, 2024 at 03:23:48PM -0700, Paul E. McKenney a écrit :
> > >>> From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > >>>
> > >>> When all wait heads are in use, which can happen when
> > >>> rcu_sr_normal_gp_cleanup_work()'s callback processing
> > >>> is slow, any new synchronize_rcu() user's rcu_synchronize
> > >>> node's processing is deferred to future GP periods. This
> > >>> can result in long list of synchronize_rcu() invocations
> > >>> waiting for full grace period processing, which can delay
> > >>> freeing of memory. Mitigate this problem by using first
> > >>> node in the list as wait tail when all wait heads are in use.
> > >>> While methods to speed up callback processing would be needed
> > >>> to recover from this situation, allowing new nodes to complete
> > >>> their grace period can help prevent delays due to a fixed
> > >>> number of wait head nodes.
> > >>>
> > >>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > >>> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > >>
> > >> IIRC we agreed that this patch could be a step too far that
> > >> made an already not so simple state machine even less simple,
> > >> breaking the wait_head based flow.
> > > 
> > > True, which is why we agreed not to submit it into the v6.10 merge window.
> > > 
> > > And I don't recall us saying what merge window to send it to.
> > > 
> > >> Should we postpone this change until it is observed that a workqueue
> > >> not being scheduled for 5 grace periods is a real issue?
> > > 
> > > Neeraj, thoughts?  Or, better yet, test results?  ;-)
> > 
> > Yes I agree that we postpone this change until we see it as a real
> > problem. I had run a test to invoke synchronize_rcu() from all CPUs
> > on a 96 core system in parallel. I didn't specifically check if this
> > scenario was hit. Will run RCU torture test with this change.
> 
> Very well, I will drop this patch with the expectation that you will
> re-post it if a problem does arise.
> 
Thank you! We discussed it before and came to conclusion that it adds an
extra complexity. Once we hit an issue with delays, we can introduce it
and explain a workload which triggers it.

--
Uladzislau Rezki
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 28c7031711a3f..6ba36d9c09bde 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1463,14 +1463,11 @@  static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
  * for this new grace period. Given that there are a fixed
  * number of wait nodes, if all wait nodes are in use
  * (which can happen when kworker callback processing
- * is delayed) and additional grace period is requested.
- * This means, a system is slow in processing callbacks.
- *
- * TODO: If a slow processing is detected, a first node
- * in the llist should be used as a wait-tail for this
- * grace period, therefore users which should wait due
- * to a slow process are handled by _this_ grace period
- * and not next.
+ * is delayed), first node in the llist is used as wait
+ * tail for this grace period. This means, the first node
+ * has to go through additional grace periods before it is
+ * part of the wait callbacks. This should be ok, as
+ * the system is slow in processing callbacks anyway.
  *
  * Below is an illustration of how the done and wait
  * tail pointers move from one set of rcu_synchronize nodes
@@ -1639,7 +1636,6 @@  static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
 	if (!done)
 		return;
 
-	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
 	head = done->next;
 	done->next = NULL;
 
@@ -1676,13 +1672,21 @@  static void rcu_sr_normal_gp_cleanup(void)
 
 	rcu_state.srs_wait_tail = NULL;
 	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
-	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
 
 	/*
 	 * Process (a) and (d) cases. See an illustration.
 	 */
 	llist_for_each_safe(rcu, next, wait_tail->next) {
-		if (rcu_sr_is_wait_head(rcu))
+		/*
+		 * The done tail may reference a rcu_synchronize node.
+		 * Stop at done tail, as using rcu_sr_normal_complete()
+		 * from this path can result in use-after-free. This
+		 * may occur if, following the wake-up of the synchronize_rcu()
+		 * wait contexts and freeing up of node memory,
+		 * rcu_sr_normal_gp_cleanup_work() accesses the done tail and
+		 * its subsequent nodes.
+		 */
+		if (wait_tail->next == rcu_state.srs_done_tail)
 			break;
 
 		rcu_sr_normal_complete(rcu);
@@ -1719,15 +1723,17 @@  static bool rcu_sr_normal_gp_init(void)
 		return start_new_poll;
 
 	wait_head = rcu_sr_get_wait_head();
-	if (!wait_head) {
-		// Kick another GP to retry.
+	if (wait_head) {
+		/* Inject a wait-dummy-node. */
+		llist_add(wait_head, &rcu_state.srs_next);
+	} else {
+		// Kick another GP for first node.
 		start_new_poll = true;
-		return start_new_poll;
+		if (first == rcu_state.srs_done_tail)
+			return start_new_poll;
+		wait_head = first;
 	}
 
-	/* Inject a wait-dummy-node. */
-	llist_add(wait_head, &rcu_state.srs_next);
-
 	/*
 	 * A waiting list of rcu_synchronize nodes should be empty on
 	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),