[v2] sched/rt: Clean up usage of rt_task()

Message ID	20240515220536.823145-1-qyousef@layalina.io (mailing list archive)
State	New, archived
Headers	show Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F4E915B561 for <linux-fsdevel@vger.kernel.org>; Wed, 15 May 2024 22:05:46 +0000 (UTC) From: Qais Yousef <qyousef@layalina.io> To: Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Steven Rostedt <rostedt@goodmis.org> Cc: Vincent Guittot <vincent.guittot@linaro.org>, Daniel Bristot de Oliveira <bristot@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Jens Axboe <axboe@kernel.dk>, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, Qais Yousef <qyousef@layalina.io>, Phil Auld <pauld@redhat.com> Subject: [PATCH v2] sched/rt: Clean up usage of rt_task() Date: Wed, 15 May 2024 23:05:36 +0100 Message-Id: <20240515220536.823145-1-qyousef@layalina.io> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v2] sched/rt: Clean up usage of rt_task() \| expand [v2] sched/rt: Clean up usage of rt_task()

Qais Yousef May 15, 2024, 10:05 p.m. UTC

rt_task() checks if a task has RT priority. But depends on your
dictionary, this could mean it belongs to RT class, or is a 'realtime'
task, which includes RT and DL classes.

Since this has caused some confusion already on discussion [1], it
seemed a clean up is due.

I define the usage of rt_task() to be tasks that belong to RT class.
Make sure that it returns true only for RT class and audit the users and
replace the ones required the old behavior with the new realtime_task()
which returns true for RT and DL classes. Introduce similar
realtime_prio() to create similar distinction to rt_prio() and update
the users that required the old behavior to use the new function.

Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.

Document the functions to make it more obvious what is the difference
between them. PI-boosted tasks is a factor that must be taken into
account when choosing which function to use.

Rename task_is_realtime() to realtime_task_policy() as the old name is
confusing against the new realtime_task().

No functional changes were intended.

[1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/

Reviewed-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---

Changes since v1:

	* Use realtime_task_policy() instead task_has_realtime_policy() (Peter)
	* Improve commit message readability about replace some rt_task()
	  users.

v1 discussion: https://lore.kernel.org/lkml/20240514234112.792989-1-qyousef@layalina.io/

 fs/select.c                       |  2 +-
 include/linux/ioprio.h            |  2 +-
 include/linux/sched/deadline.h    |  6 ++++--
 include/linux/sched/prio.h        |  1 +
 include/linux/sched/rt.h          | 27 ++++++++++++++++++++++++++-
 kernel/locking/rtmutex.c          |  4 ++--
 kernel/locking/rwsem.c            |  4 ++--
 kernel/locking/ww_mutex.h         |  2 +-
 kernel/sched/core.c               |  6 +++---
 kernel/time/hrtimer.c             |  6 +++---
 kernel/trace/trace_sched_wakeup.c |  2 +-
 mm/page-writeback.c               |  4 ++--
 mm/page_alloc.c                   |  2 +-
 13 files changed, 48 insertions(+), 20 deletions(-)

Sebastian Sewior May 21, 2024, 11 a.m. UTC | #1

On 2024-05-15 23:05:36 [+0100], Qais Yousef wrote:
> rt_task() checks if a task has RT priority. But depends on your
> dictionary, this could mean it belongs to RT class, or is a 'realtime'
> task, which includes RT and DL classes.
> 
> Since this has caused some confusion already on discussion [1], it
> seemed a clean up is due.
> 
> I define the usage of rt_task() to be tasks that belong to RT class.
> Make sure that it returns true only for RT class and audit the users and
> replace the ones required the old behavior with the new realtime_task()
> which returns true for RT and DL classes. Introduce similar
> realtime_prio() to create similar distinction to rt_prio() and update
> the users that required the old behavior to use the new function.
> 
> Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.
> 
> Document the functions to make it more obvious what is the difference
> between them. PI-boosted tasks is a factor that must be taken into
> account when choosing which function to use.
> 
> Rename task_is_realtime() to realtime_task_policy() as the old name is
> confusing against the new realtime_task().

I *think* everyone using rt_task() means to include DL tasks. And
everyone means !SCHED-people since they know when the difference matters.

> No functional changes were intended.
> 
> [1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/
> 
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> 
> Changes since v1:
> 
> 	* Use realtime_task_policy() instead task_has_realtime_policy() (Peter)
> 	* Improve commit message readability about replace some rt_task()
> 	  users.
> 
> v1 discussion: https://lore.kernel.org/lkml/20240514234112.792989-1-qyousef@layalina.io/
> 
>  fs/select.c                       |  2 +-

fs/bcachefs/six.c
six_owner_running() has rt_task(). But imho should have realtime_task()
to consider DL. But I think it is way worse that it has its own locking
rather than using what everyone else but then again it wouldn't be the
new hot thing…

>  include/linux/ioprio.h            |  2 +-
>  include/linux/sched/deadline.h    |  6 ++++--
>  include/linux/sched/prio.h        |  1 +
>  include/linux/sched/rt.h          | 27 ++++++++++++++++++++++++++-
>  kernel/locking/rtmutex.c          |  4 ++--
>  kernel/locking/rwsem.c            |  4 ++--
>  kernel/locking/ww_mutex.h         |  2 +-
>  kernel/sched/core.c               |  6 +++---
>  kernel/time/hrtimer.c             |  6 +++---
>  kernel/trace/trace_sched_wakeup.c |  2 +-
>  mm/page-writeback.c               |  4 ++--
>  mm/page_alloc.c                   |  2 +-
>  13 files changed, 48 insertions(+), 20 deletions(-)
…
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 70625dff62ce..08b95e0a41ab 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1996,7 +1996,7 @@ static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
>  	 * expiry.
>  	 */
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> -		if (task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT))
> +		if (realtime_task_policy(current) && !(mode & HRTIMER_MODE_SOFT))
>  			mode |= HRTIMER_MODE_HARD;
>  	}
>  
> @@ -2096,7 +2096,7 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
>  	u64 slack;
>  
>  	slack = current->timer_slack_ns;
> -	if (rt_task(current))
> +	if (realtime_task(current))
>  		slack = 0;
>  
>  	hrtimer_init_sleeper_on_stack(&t, clockid, mode);
> @@ -2301,7 +2301,7 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
>  	 * Override any slack passed by the user if under
>  	 * rt contraints.
>  	 */
> -	if (rt_task(current))
> +	if (realtime_task(current))
>  		delta = 0;

I know this is just converting what is already here but…
__hrtimer_init_sleeper() looks at the policy to figure out if the task
is realtime do decide if should expire in HARD-IRQ context. This is
correct, a boosted task should not sleep.

hrtimer_nanosleep() + schedule_hrtimeout_range_clock() is looking at
priority to decide if slack should be removed. This should also look at
policy since a boosted task shouldn't sleep.

In order to be PI-boosted you need to acquire a lock and the only lock
you can sleep while acquired without generating a warning is a mutex_t
(or equivalent sleeping lock) on PREEMPT_RT. 

>  	hrtimer_init_sleeper_on_stack(&t, clock_id, mode);
> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index 0469a04a355f..19d737742e29 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
>  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
>  	 */
>  	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
> -	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> +	    (wakeup_rt && !realtime_task(p)) ||
>  	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
>  		return;
>  

Sebastian

Steven Rostedt May 23, 2024, 3:45 p.m. UTC | #2

On Wed, 15 May 2024 23:05:36 +0100
Qais Yousef <qyousef@layalina.io> wrote:
> diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
> index df3aca89d4f5..5cb88b748ad6 100644
> --- a/include/linux/sched/deadline.h
> +++ b/include/linux/sched/deadline.h
> @@ -10,8 +10,6 @@
>  
>  #include <linux/sched.h>
>  
> -#define MAX_DL_PRIO		0
> -
>  static inline int dl_prio(int prio)
>  {
>  	if (unlikely(prio < MAX_DL_PRIO))
> @@ -19,6 +17,10 @@ static inline int dl_prio(int prio)
>  	return 0;
>  }
>  
> +/*
> + * Returns true if a task has a priority that belongs to DL class. PI-boosted
> + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks.
> + */
>  static inline int dl_task(struct task_struct *p)
>  {
>  	return dl_prio(p->prio);
> diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
> index ab83d85e1183..6ab43b4f72f9 100644
> --- a/include/linux/sched/prio.h
> +++ b/include/linux/sched/prio.h
> @@ -14,6 +14,7 @@
>   */
>  
>  #define MAX_RT_PRIO		100
> +#define MAX_DL_PRIO		0
>  
>  #define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)
>  #define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)
> diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
> index b2b9e6eb9683..a055dd68a77c 100644
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -7,18 +7,43 @@
>  struct task_struct;
>  
>  static inline int rt_prio(int prio)
> +{
> +	if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO))
> +		return 1;
> +	return 0;
> +}
> +
> +static inline int realtime_prio(int prio)
>  {
>  	if (unlikely(prio < MAX_RT_PRIO))
>  		return 1;
>  	return 0;
>  }

I'm thinking we should change the above to bool (separate patch), as
returning an int may give one the impression that it returns the actual
priority number. Having it return bool will clear that up.

In fact, if we are touching theses functions, might as well change all of
them to bool when returning true/false. Just to make it easier to
understand what they are doing.

>  
> +/*
> + * Returns true if a task has a priority that belongs to RT class. PI-boosted
> + * tasks will return true. Use rt_policy() to ignore PI-boosted tasks.
> + */
>  static inline int rt_task(struct task_struct *p)
>  {
>  	return rt_prio(p->prio);
>  }
>  
> -static inline bool task_is_realtime(struct task_struct *tsk)
> +/*
> + * Returns true if a task has a priority that belongs to RT or DL classes.
> + * PI-boosted tasks will return true. Use realtime_task_policy() to ignore
> + * PI-boosted tasks.
> + */
> +static inline int realtime_task(struct task_struct *p)
> +{
> +	return realtime_prio(p->prio);
> +}
> +
> +/*
> + * Returns true if a task has a policy that belongs to RT or DL classes.
> + * PI-boosted tasks will return false.
> + */
> +static inline bool realtime_task_policy(struct task_struct *tsk)
>  {
>  	int policy = tsk->policy;
>  



> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index 0469a04a355f..19d737742e29 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
>  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
>  	 */
>  	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
> -	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> +	    (wakeup_rt && !realtime_task(p)) ||
>  	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
>  		return;
>  

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Qais Yousef May 27, 2024, 5:26 p.m. UTC | #3

On 05/21/24 13:00, Sebastian Andrzej Siewior wrote:
> On 2024-05-15 23:05:36 [+0100], Qais Yousef wrote:
> > rt_task() checks if a task has RT priority. But depends on your
> > dictionary, this could mean it belongs to RT class, or is a 'realtime'
> > task, which includes RT and DL classes.
> > 
> > Since this has caused some confusion already on discussion [1], it
> > seemed a clean up is due.
> > 
> > I define the usage of rt_task() to be tasks that belong to RT class.
> > Make sure that it returns true only for RT class and audit the users and
> > replace the ones required the old behavior with the new realtime_task()
> > which returns true for RT and DL classes. Introduce similar
> > realtime_prio() to create similar distinction to rt_prio() and update
> > the users that required the old behavior to use the new function.
> > 
> > Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.
> > 
> > Document the functions to make it more obvious what is the difference
> > between them. PI-boosted tasks is a factor that must be taken into
> > account when choosing which function to use.
> > 
> > Rename task_is_realtime() to realtime_task_policy() as the old name is
> > confusing against the new realtime_task().
> 
> I *think* everyone using rt_task() means to include DL tasks. And
> everyone means !SCHED-people since they know when the difference matters.

yes, this makes sense

> 
> > No functional changes were intended.
> > 
> > [1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/
> > 
> > Reviewed-by: Phil Auld <pauld@redhat.com>
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> > 
> > Changes since v1:
> > 
> > 	* Use realtime_task_policy() instead task_has_realtime_policy() (Peter)
> > 	* Improve commit message readability about replace some rt_task()
> > 	  users.
> > 
> > v1 discussion: https://lore.kernel.org/lkml/20240514234112.792989-1-qyousef@layalina.io/
> > 
> >  fs/select.c                       |  2 +-
> 
> fs/bcachefs/six.c
> six_owner_running() has rt_task(). But imho should have realtime_task()
> to consider DL. But I think it is way worse that it has its own locking
> rather than using what everyone else but then again it wouldn't be the
> new hot thing…

I think I missed this one. Converted now. Thanks!

> 
> >  include/linux/ioprio.h            |  2 +-
> >  include/linux/sched/deadline.h    |  6 ++++--
> >  include/linux/sched/prio.h        |  1 +
> >  include/linux/sched/rt.h          | 27 ++++++++++++++++++++++++++-
> >  kernel/locking/rtmutex.c          |  4 ++--
> >  kernel/locking/rwsem.c            |  4 ++--
> >  kernel/locking/ww_mutex.h         |  2 +-
> >  kernel/sched/core.c               |  6 +++---
> >  kernel/time/hrtimer.c             |  6 +++---
> >  kernel/trace/trace_sched_wakeup.c |  2 +-
> >  mm/page-writeback.c               |  4 ++--
> >  mm/page_alloc.c                   |  2 +-
> >  13 files changed, 48 insertions(+), 20 deletions(-)
> …
> > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> > index 70625dff62ce..08b95e0a41ab 100644
> > --- a/kernel/time/hrtimer.c
> > +++ b/kernel/time/hrtimer.c
> > @@ -1996,7 +1996,7 @@ static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
> >  	 * expiry.
> >  	 */
> >  	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > -		if (task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT))
> > +		if (realtime_task_policy(current) && !(mode & HRTIMER_MODE_SOFT))
> >  			mode |= HRTIMER_MODE_HARD;
> >  	}
> >  
> > @@ -2096,7 +2096,7 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
> >  	u64 slack;
> >  
> >  	slack = current->timer_slack_ns;
> > -	if (rt_task(current))
> > +	if (realtime_task(current))
> >  		slack = 0;
> >  
> >  	hrtimer_init_sleeper_on_stack(&t, clockid, mode);
> > @@ -2301,7 +2301,7 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
> >  	 * Override any slack passed by the user if under
> >  	 * rt contraints.
> >  	 */
> > -	if (rt_task(current))
> > +	if (realtime_task(current))
> >  		delta = 0;
> 
> I know this is just converting what is already here but…
> __hrtimer_init_sleeper() looks at the policy to figure out if the task
> is realtime do decide if should expire in HARD-IRQ context. This is
> correct, a boosted task should not sleep.
> 
> hrtimer_nanosleep() + schedule_hrtimeout_range_clock() is looking at
> priority to decide if slack should be removed. This should also look at
> policy since a boosted task shouldn't sleep.

I have to admit I never dug deep enough into this code. Happy to convert these
users. I'll add that as a separate patch as this is somewhat changing behavior
which this patch intends to do a clean up only.

> 
> In order to be PI-boosted you need to acquire a lock and the only lock
> you can sleep while acquired without generating a warning is a mutex_t
> (or equivalent sleeping lock) on PREEMPT_RT. 

Note we care about the behavior for !PREEMPT_RT. PI issues are important there
too. I assume the fact the PREEMPT_RT changes the locks behavior is what you're
referring to here and not applicable to normal case.


Thanks!

--
Qais Yousef

> 
> >  	hrtimer_init_sleeper_on_stack(&t, clock_id, mode);
> > diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> > index 0469a04a355f..19d737742e29 100644
> > --- a/kernel/trace/trace_sched_wakeup.c
> > +++ b/kernel/trace/trace_sched_wakeup.c
> > @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
> >  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
> >  	 */
> >  	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
> > -	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> > +	    (wakeup_rt && !realtime_task(p)) ||
> >  	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
> >  		return;
> >  
> 
> Sebastian

Qais Yousef May 27, 2024, 5:37 p.m. UTC | #4

On 05/23/24 11:45, Steven Rostedt wrote:
> On Wed, 15 May 2024 23:05:36 +0100
> Qais Yousef <qyousef@layalina.io> wrote:
> > diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
> > index df3aca89d4f5..5cb88b748ad6 100644
> > --- a/include/linux/sched/deadline.h
> > +++ b/include/linux/sched/deadline.h
> > @@ -10,8 +10,6 @@
> >  
> >  #include <linux/sched.h>
> >  
> > -#define MAX_DL_PRIO		0
> > -
> >  static inline int dl_prio(int prio)
> >  {
> >  	if (unlikely(prio < MAX_DL_PRIO))
> > @@ -19,6 +17,10 @@ static inline int dl_prio(int prio)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Returns true if a task has a priority that belongs to DL class. PI-boosted
> > + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks.
> > + */
> >  static inline int dl_task(struct task_struct *p)
> >  {
> >  	return dl_prio(p->prio);
> > diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
> > index ab83d85e1183..6ab43b4f72f9 100644
> > --- a/include/linux/sched/prio.h
> > +++ b/include/linux/sched/prio.h
> > @@ -14,6 +14,7 @@
> >   */
> >  
> >  #define MAX_RT_PRIO		100
> > +#define MAX_DL_PRIO		0
> >  
> >  #define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)
> >  #define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)
> > diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
> > index b2b9e6eb9683..a055dd68a77c 100644
> > --- a/include/linux/sched/rt.h
> > +++ b/include/linux/sched/rt.h
> > @@ -7,18 +7,43 @@
> >  struct task_struct;
> >  
> >  static inline int rt_prio(int prio)
> > +{
> > +	if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO))
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +static inline int realtime_prio(int prio)
> >  {
> >  	if (unlikely(prio < MAX_RT_PRIO))
> >  		return 1;
> >  	return 0;
> >  }
> 
> I'm thinking we should change the above to bool (separate patch), as
> returning an int may give one the impression that it returns the actual
> priority number. Having it return bool will clear that up.
> 
> In fact, if we are touching theses functions, might as well change all of
> them to bool when returning true/false. Just to make it easier to
> understand what they are doing.

I can add a patch on top, sure.

> 
> >  
> > +/*
> > + * Returns true if a task has a priority that belongs to RT class. PI-boosted
> > + * tasks will return true. Use rt_policy() to ignore PI-boosted tasks.
> > + */
> >  static inline int rt_task(struct task_struct *p)
> >  {
> >  	return rt_prio(p->prio);
> >  }
> >  
> > -static inline bool task_is_realtime(struct task_struct *tsk)
> > +/*
> > + * Returns true if a task has a priority that belongs to RT or DL classes.
> > + * PI-boosted tasks will return true. Use realtime_task_policy() to ignore
> > + * PI-boosted tasks.
> > + */
> > +static inline int realtime_task(struct task_struct *p)
> > +{
> > +	return realtime_prio(p->prio);
> > +}
> > +
> > +/*
> > + * Returns true if a task has a policy that belongs to RT or DL classes.
> > + * PI-boosted tasks will return false.
> > + */
> > +static inline bool realtime_task_policy(struct task_struct *tsk)
> >  {
> >  	int policy = tsk->policy;
> >  
> 
> 
> 
> > diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> > index 0469a04a355f..19d737742e29 100644
> > --- a/kernel/trace/trace_sched_wakeup.c
> > +++ b/kernel/trace/trace_sched_wakeup.c
> > @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
> >  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
> >  	 */
> >  	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
> > -	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> > +	    (wakeup_rt && !realtime_task(p)) ||
> >  	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
> >  		return;
> >  
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Thanks!

--
Qais Yousef

> 
>

Sebastian Sewior May 29, 2024, 8:29 a.m. UTC | #5

On 2024-05-27 18:26:50 [+0100], Qais Yousef wrote:
> > In order to be PI-boosted you need to acquire a lock and the only lock
> > you can sleep while acquired without generating a warning is a mutex_t
> > (or equivalent sleeping lock) on PREEMPT_RT. 
> 
> Note we care about the behavior for !PREEMPT_RT. PI issues are important there
> too. I assume the fact the PREEMPT_RT changes the locks behavior is what you're
> referring to here and not applicable to normal case.

So for !PREEMPT_RT you need a rtmutex for PI. RCU and i2c is using it
within the kernel and this shouldn't go via the `slack' API.

The FUTEX API on the other hand is a different story and it might
matter. So you have one task running SCHED_OTHER and acquiring a lock in
userspace (pthread_mutex_t, PTHREAD_PRIO_INHERIT). Another task running
at SCHED_FIFO/ RR/ DL would also acquire that lock, block on it and
then inherit its priority.
This is the point where the former task has a different policy vs
priority considering PI-boosting. You could argue that the task
shouldn't sleep or invoke anything possible sleeping with a timeout > 0
because it is using an important lock.
But then it is userland and has the freedom to do whatever it wants you
know…

So it might be better to forget what I said and keeping the current
behaviour. But then it is insistent which matters only in the RT case.
Puh. Any sched folks regarding policy?

> Thanks!

Sebastian

Qais Yousef May 29, 2024, 10:34 a.m. UTC | #6

On 05/29/24 10:29, Sebastian Andrzej Siewior wrote:
> On 2024-05-27 18:26:50 [+0100], Qais Yousef wrote:
> > > In order to be PI-boosted you need to acquire a lock and the only lock
> > > you can sleep while acquired without generating a warning is a mutex_t
> > > (or equivalent sleeping lock) on PREEMPT_RT. 
> > 
> > Note we care about the behavior for !PREEMPT_RT. PI issues are important there
> > too. I assume the fact the PREEMPT_RT changes the locks behavior is what you're
> > referring to here and not applicable to normal case.
> 
> So for !PREEMPT_RT you need a rtmutex for PI. RCU and i2c is using it
> within the kernel and this shouldn't go via the `slack' API.
> 
> The FUTEX API on the other hand is a different story and it might
> matter. So you have one task running SCHED_OTHER and acquiring a lock in
> userspace (pthread_mutex_t, PTHREAD_PRIO_INHERIT). Another task running
> at SCHED_FIFO/ RR/ DL would also acquire that lock, block on it and
> then inherit its priority.
> This is the point where the former task has a different policy vs
> priority considering PI-boosting. You could argue that the task
> shouldn't sleep or invoke anything possible sleeping with a timeout > 0
> because it is using an important lock.
> But then it is userland and has the freedom to do whatever it wants you
> know…

Yes..

> 
> So it might be better to forget what I said and keeping the current

Okay I'll drop the patch then in next posting.

> behaviour. But then it is insistent which matters only in the RT case.
> Puh. Any sched folks regarding policy?

I am not sure I understood you here. Could you rephrase please?

Sebastian Sewior May 29, 2024, 10:55 a.m. UTC | #7

On 2024-05-29 11:34:09 [+0100], Qais Yousef wrote:
> > behaviour. But then it is insistent which matters only in the RT case.
> > Puh. Any sched folks regarding policy?
> 
> I am not sure I understood you here. Could you rephrase please?

Right now a SCHED_OTHER task boosted to a realtime priority gets
slack=0. In the !RT scenario everything is fine.
For RT the slack=0 also happens but the init of the timer looks at the
policy instead at the possible boosted priority and uses a different
clock attribute. This can lead to a delayed wake up (so avoiding the
slack does not solve the problem).

This is not consistent because IMHO the clock setup & slack should be
handled equally. So I am asking the sched folks for a policy and I am
leaning towards looking at task-policy in this case instead of prio
because you shouldn't do anything that can delay.

Sebastian

Qais Yousef May 30, 2024, 11:10 a.m. UTC | #8

On 05/29/24 12:55, Sebastian Andrzej Siewior wrote:
> On 2024-05-29 11:34:09 [+0100], Qais Yousef wrote:
> > > behaviour. But then it is insistent which matters only in the RT case.
> > > Puh. Any sched folks regarding policy?
> > 
> > I am not sure I understood you here. Could you rephrase please?
> 
> Right now a SCHED_OTHER task boosted to a realtime priority gets
> slack=0. In the !RT scenario everything is fine.
> For RT the slack=0 also happens but the init of the timer looks at the
> policy instead at the possible boosted priority and uses a different
> clock attribute. This can lead to a delayed wake up (so avoiding the
> slack does not solve the problem).
> 
> This is not consistent because IMHO the clock setup & slack should be
> handled equally. So I am asking the sched folks for a policy and I am
> leaning towards looking at task-policy in this case instead of prio
> because you shouldn't do anything that can delay.

Can't we do that based on is_soft/is_hard flag in hrtimer struct when we apply
the slack in hrtimer_set_expires_range_ns() instead?

(not compile tested even)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index aa1e65ccb615..e001f20bbea9 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -102,12 +102,16 @@ static inline void hrtimer_set_expires(struct hrtimer *timer, ktime_t time)
 
 static inline void hrtimer_set_expires_range(struct hrtimer *timer, ktime_t time, ktime_t delta)
 {
+       if (timer->is_soft || timer->is_hard)
+               delta = 0;
        timer->_softexpires = time;
        timer->node.expires = ktime_add_safe(time, delta);
 }
 
 static inline void hrtimer_set_expires_range_ns(struct hrtimer *timer, ktime_t time, u64 delta)
 {
+       if (timer->is_soft || timer->is_hard)
+               delta = 0;
        timer->_softexpires = time;
        timer->node.expires = ktime_add_safe(time, ns_to_ktime(delta));
 }

Sebastian Sewior May 31, 2024, 6:30 a.m. UTC | #9

On 2024-05-30 12:10:44 [+0100], Qais Yousef wrote:
> > This is not consistent because IMHO the clock setup & slack should be
> > handled equally. So I am asking the sched folks for a policy and I am
> > leaning towards looking at task-policy in this case instead of prio
> > because you shouldn't do anything that can delay.
> 
> Can't we do that based on is_soft/is_hard flag in hrtimer struct when we apply
> the slack in hrtimer_set_expires_range_ns() instead?

We need to decide on a policy first.
You don't want to add overhead on each invocation plus some in-kernel
ask for delta. ->is_soft is not a good criteria.

Sebastian

Qais Yousef June 1, 2024, 10:31 p.m. UTC | #10

On 05/31/24 08:30, Sebastian Andrzej Siewior wrote:
> On 2024-05-30 12:10:44 [+0100], Qais Yousef wrote:
> > > This is not consistent because IMHO the clock setup & slack should be
> > > handled equally. So I am asking the sched folks for a policy and I am
> > > leaning towards looking at task-policy in this case instead of prio
> > > because you shouldn't do anything that can delay.
> > 
> > Can't we do that based on is_soft/is_hard flag in hrtimer struct when we apply
> > the slack in hrtimer_set_expires_range_ns() instead?
> 
> We need to decide on a policy first.
> You don't want to add overhead on each invocation plus some in-kernel
> ask for delta. ->is_soft is not a good criteria.

The suggestion was not to check policy or priority but dictate the behavior
based on hrtimer properties set at init. Which agrees with your thinking.

I think the check for policy at init to enforce a behavior is borderline hack
by the way and better to get an explicit request from a task and let the
HRTIMER comply adequately without checking for policies anywhere. Whether this
property should be inherited or not is a good question and I think is the
policy you're after here. Proxy Execution would probably handle all this
inheritance problems transparently, so we need not think much about it if we
agree proxy execution is the way to deal with all those inheritance problems.
There are always exceptions though.. So worth a think.

I'll drop this patch as this discussion has diverged to something else now.
I'll leave this here until others get a chance to comment with their views too.

Cheers

--
Qais Yousef

[v2] sched/rt: Clean up usage of rt_task()

Commit Message

Comments

Patch