[RFC,7/9] rv: Retry when da monitor detects race conditions

Message ID	20250404084512.98552-18-gmonaco@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E095A1A2642 for <linux-trace-kernel@vger.kernel.org>; Fri, 4 Apr 2025 08:46:28 +0000 (UTC) From: Gabriele Monaco <gmonaco@redhat.com> To: linux-kernel@vger.kernel.org, Steven Rostedt <rostedt@goodmis.org>, linux-trace-kernel@vger.kernel.org Cc: Gabriele Monaco <gmonaco@redhat.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Tomas Glozar <tglozar@redhat.com>, Juri Lelli <jlelli@redhat.com> Subject: [RFC PATCH 7/9] rv: Retry when da monitor detects race conditions Date: Fri, 4 Apr 2025 10:45:20 +0200 Message-ID: <20250404084512.98552-18-gmonaco@redhat.com> In-Reply-To: <20250404084512.98552-11-gmonaco@redhat.com> References: <20250404084512.98552-11-gmonaco@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[RFC,1/9] tools/rv: Do not skip idle in trace \| expand [RFC,1/9] tools/rv: Do not skip idle in trace [RFC,2/9] tools/rv: Stop gracefully also on SIGTERM [RFC,3/9] rv: Add da_handle_start_run_event_ to per-task monitors [RFC,4/9] rv: Remove trailing whitespace from tracepoint string [RFC,5/9] sched: Add sched tracepoints for RV task model [RFC,7/9] rv: Retry when da monitor detects race conditions [RFC,8/9] rv: Replace tss monitor with more complete sts [RFC,9/9] rv: Add srs per-task monitor

Message ID

20250404084512.98552-18-gmonaco@redhat.com (mailing list archive)

State

New

Headers

From: Gabriele Monaco <gmonaco@redhat.com>
To: linux-kernel@vger.kernel.org,
	Steven Rostedt <rostedt@goodmis.org>,
	linux-trace-kernel@vger.kernel.org
Cc: Gabriele Monaco <gmonaco@redhat.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Tomas Glozar <tglozar@redhat.com>,
	Juri Lelli <jlelli@redhat.com>
Subject: [RFC PATCH 7/9] rv: Retry when da monitor detects race conditions
Date: Fri,  4 Apr 2025 10:45:20 +0200
Message-ID: <20250404084512.98552-18-gmonaco@redhat.com>
In-Reply-To: <20250404084512.98552-11-gmonaco@redhat.com>
References: <20250404084512.98552-11-gmonaco@redhat.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

[RFC,1/9] tools/rv: Do not skip idle in trace | expand

Commit Message

Gabriele Monaco April 4, 2025, 8:45 a.m. UTC

DA monitor can be accessed from multiple cores simultaneously, this is
likely, for instance when dealing with per-task monitors reacting on
events that do not always occur on the CPU where the task is running.
This can cause race conditions where two events change the next state
and we see inconsistent values. E.g.:

  [62] event_srs: 27: sleepable x sched_wakeup -> running (final)
  [63] event_srs: 27: sleepable x sched_set_state_sleepable -> sleepable
  [63] error_srs: 27: event sched_switch_suspend not expected in the state running

In this case the monitor fails because the event on CPU 62 wins against
the one on CPU 63, although the correct state should have been
sleepable, since the task get suspended.

Detect if the current state was modified by using try_cmpxchg while
storing the next value. If it was, try again reading the current state.
After a maximum number of failed retries, react as if it was an error
with invalid current state (we cannot determine it).

Monitors where this type of condition can occur must be able to account
for racing events in any possible order, as we cannot know the winner.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/linux/rv.h      |  3 ++-
 include/rv/da_monitor.h | 53 +++++++++++++++++++++++++++++++----------
 2 files changed, 42 insertions(+), 14 deletions(-)

Comments

Nam Cao April 11, 2025, 4:52 a.m. UTC | #1

On Fri, Apr 04, 2025 at 10:45:20AM +0200, Gabriele Monaco wrote:
> DA monitor can be accessed from multiple cores simultaneously, this is
> likely, for instance when dealing with per-task monitors reacting on
> events that do not always occur on the CPU where the task is running.
> This can cause race conditions where two events change the next state
> and we see inconsistent values. E.g.:
> 
>   [62] event_srs: 27: sleepable x sched_wakeup -> running (final)
>   [63] event_srs: 27: sleepable x sched_set_state_sleepable -> sleepable
>   [63] error_srs: 27: event sched_switch_suspend not expected in the state running
> 
> In this case the monitor fails because the event on CPU 62 wins against
> the one on CPU 63, although the correct state should have been
> sleepable, since the task get suspended.
> 
> Detect if the current state was modified by using try_cmpxchg while
> storing the next value. If it was, try again reading the current state.
> After a maximum number of failed retries, react as if it was an error
> with invalid current state (we cannot determine it).
> 
> Monitors where this type of condition can occur must be able to account
> for racing events in any possible order, as we cannot know the winner.

Is locking not simpler? I understand raw_spin_lock() doesn't work because
it steps on some tracepoints. But how about adding something like
raw_spin_lock_notrace()?

static inline bool raw_spin_lock_notrace(raw_spinlock_t *lock)
{
	preempt_disable_notrace(); //probably not required, tracepoint handlers do this already

	if (!do_raw_spin_trylock(lock))
		do_raw_spin_lock(lock);
}

My LTL series theoretically also has this problem, but I have never got it
during testing yet. We should use the same solution for both DA and LTL.

Also, can you please Cc me in your RV patches?

Best regards,
Nam

Gabriele Monaco April 11, 2025, 6:09 a.m. UTC | #2

On Fri, 2025-04-11 at 06:52 +0200, Nam Cao wrote:
> On Fri, Apr 04, 2025 at 10:45:20AM +0200, Gabriele Monaco wrote:
> > DA monitor can be accessed from multiple cores simultaneously, this
> > is
> > likely, for instance when dealing with per-task monitors reacting
> > on
> > events that do not always occur on the CPU where the task is
> > running.
> > This can cause race conditions where two events change the next
> > state
> > and we see inconsistent values. E.g.:
> > 
> >   [62] event_srs: 27: sleepable x sched_wakeup -> running (final)
> >   [63] event_srs: 27: sleepable x sched_set_state_sleepable ->
> > sleepable
> >   [63] error_srs: 27: event sched_switch_suspend not expected in
> > the state running
> > 
> > In this case the monitor fails because the event on CPU 62 wins
> > against
> > the one on CPU 63, although the correct state should have been
> > sleepable, since the task get suspended.
> > 
> > Detect if the current state was modified by using try_cmpxchg while
> > storing the next value. If it was, try again reading the current
> > state.
> > After a maximum number of failed retries, react as if it was an
> > error
> > with invalid current state (we cannot determine it).
> > 
> > Monitors where this type of condition can occur must be able to
> > account
> > for racing events in any possible order, as we cannot know the
> > winner.
> 
> Is locking not simpler? I understand raw_spin_lock() doesn't work
> because
> it steps on some tracepoints. But how about adding something like
> raw_spin_lock_notrace()?

It is probably simpler, but I think it would require also to disable
interrupts (some events occur in interrupt context), I'm not sure the
introduced overhead is going to be worth it in the fast path, but
that's kinda what I wanted to learn in this RFC ;)

> 
> static inline bool raw_spin_lock_notrace(raw_spinlock_t *lock)
> {
> 	preempt_disable_notrace(); //probably not required,
> tracepoint handlers do this already
> 
> 	if (!do_raw_spin_trylock(lock))
> 		do_raw_spin_lock(lock);
> }
> 
> My LTL series theoretically also has this problem, but I have never
> got it
> during testing yet. We should use the same solution for both DA and
> LTL.

Yes totally, on the long run we might get some common utilities for
this kind of things that aren't too monitor specific. But for now I
wouldn't worry too much.

> 
> Also, can you please Cc me in your RV patches?
> 

Right.. will do!

Thanks for your feedback,
Gabriele

diff --git a/include/linux/rv.h b/include/linux/rv.h
index 3452b5e4b29e7..a83a81ac6e466 100644
--- a/include/linux/rv.h
+++ b/include/linux/rv.h
@@ -7,7 +7,8 @@ 
 #ifndef _LINUX_RV_H
 #define _LINUX_RV_H
 
-#define MAX_DA_NAME_LEN	32
+#define MAX_DA_NAME_LEN			32
+#define MAX_DA_RETRY_RACING_EVENTS	3
 
 #ifdef CONFIG_RV
 /*
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 215c3eb770ccc..8b714e3085a55 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -82,16 +82,19 @@  static inline void da_monitor_reset_##name(struct da_monitor *da_mon)				\
  */												\
 static inline type da_monitor_curr_state_##name(struct da_monitor *da_mon)			\
 {												\
-	return da_mon->curr_state;								\
+	return READ_ONCE(da_mon->curr_state);							\
 }												\
 												\
 /*												\
  * da_monitor_set_state_##name - set the new current state					\
+ *												\
+ * return false without the change in case the state was modified elsewhere			\
  */												\
-static inline void										\
-da_monitor_set_state_##name(struct da_monitor *da_mon, enum states_##name state)		\
+static inline bool										\
+da_monitor_set_state_##name(struct da_monitor *da_mon, enum states_##name prev_state,		\
+			    enum states_##name state)						\
 {												\
-	da_mon->curr_state = state;								\
+	return try_cmpxchg(&da_mon->curr_state, &prev_state, state);				\
 }												\
 												\
 /*												\
@@ -150,17 +153,29 @@  static inline bool da_monitor_handling_event_##name(struct da_monitor *da_mon)
  * Event handler for implicit monitors. Implicit monitor is the one which the
  * handler does not need to specify which da_monitor to manipulate. Examples
  * of implicit monitor are the per_cpu or the global ones.
+ *
+ * Retry, in case there is a race while getting and setting the next state
+ * return an invalid current state if we run out of retries. The monitor should
+ * be able to handle various orders.
  */
 #define DECLARE_DA_MON_MODEL_HANDLER_IMPLICIT(name, type)					\
 												\
 static inline bool										\
 da_event_##name(struct da_monitor *da_mon, enum events_##name event)				\
 {												\
-	type curr_state = da_monitor_curr_state_##name(da_mon);					\
-	type next_state = model_get_next_state_##name(curr_state, event);			\
+	bool changed;										\
+	type curr_state, next_state;								\
 												\
-	if (next_state != INVALID_STATE) {							\
-		da_monitor_set_state_##name(da_mon, next_state);				\
+	for (int i = 0; i < MAX_DA_RETRY_RACING_EVENTS; i++) {					\
+		curr_state = da_monitor_curr_state_##name(da_mon);				\
+		next_state = model_get_next_state_##name(curr_state, event);			\
+		if (next_state == INVALID_STATE)						\
+			break;									\
+		changed = da_monitor_set_state_##name(da_mon, curr_state, next_state);		\
+		if (unlikely(!changed))	{							\
+			curr_state = -1;							\
+			continue;								\
+		}										\
 												\
 		trace_event_##name(model_get_state_name_##name(curr_state),			\
 				   model_get_event_name_##name(event),				\
@@ -181,17 +196,29 @@  da_event_##name(struct da_monitor *da_mon, enum events_##name event)				\
 
 /*
  * Event handler for per_task monitors.
+ *
+ * Retry, in case there is a race while getting and setting the next state
+ * return an invalid current state if we run out of retries. The monitor should
+ * be able to handle various orders.
  */
 #define DECLARE_DA_MON_MODEL_HANDLER_PER_TASK(name, type)					\
 												\
 static inline bool da_event_##name(struct da_monitor *da_mon, struct task_struct *tsk,		\
 				   enum events_##name event)					\
 {												\
-	type curr_state = da_monitor_curr_state_##name(da_mon);					\
-	type next_state = model_get_next_state_##name(curr_state, event);			\
-												\
-	if (next_state != INVALID_STATE) {							\
-		da_monitor_set_state_##name(da_mon, next_state);				\
+	bool changed;										\
+	type curr_state, next_state;								\
+												\
+	for (int i = 0; i < MAX_DA_RETRY_RACING_EVENTS; i++) {					\
+		curr_state = da_monitor_curr_state_##name(da_mon);				\
+		next_state = model_get_next_state_##name(curr_state, event);			\
+		if (next_state == INVALID_STATE)						\
+			break;									\
+		changed = da_monitor_set_state_##name(da_mon, curr_state, next_state);		\
+		if (unlikely(!changed))	{							\
+			curr_state = -1;							\
+			continue;								\
+		}										\
 												\
 		trace_event_##name(tsk->pid,							\
 				   model_get_state_name_##name(curr_state),			\

[RFC,7/9] rv: Retry when da monitor detects race conditions

Commit Message

Comments

Patch