diff mbox series

[v2] sched/rt: Clean up usage of rt_task()

Message ID 20240515220536.823145-1-qyousef@layalina.io (mailing list archive)
State New, archived
Headers show
Series [v2] sched/rt: Clean up usage of rt_task() | expand

Commit Message

Qais Yousef May 15, 2024, 10:05 p.m. UTC
rt_task() checks if a task has RT priority. But depends on your
dictionary, this could mean it belongs to RT class, or is a 'realtime'
task, which includes RT and DL classes.

Since this has caused some confusion already on discussion [1], it
seemed a clean up is due.

I define the usage of rt_task() to be tasks that belong to RT class.
Make sure that it returns true only for RT class and audit the users and
replace the ones required the old behavior with the new realtime_task()
which returns true for RT and DL classes. Introduce similar
realtime_prio() to create similar distinction to rt_prio() and update
the users that required the old behavior to use the new function.

Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.

Document the functions to make it more obvious what is the difference
between them. PI-boosted tasks is a factor that must be taken into
account when choosing which function to use.

Rename task_is_realtime() to realtime_task_policy() as the old name is
confusing against the new realtime_task().

No functional changes were intended.

[1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/

Reviewed-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---

Changes since v1:

	* Use realtime_task_policy() instead task_has_realtime_policy() (Peter)
	* Improve commit message readability about replace some rt_task()
	  users.

v1 discussion: https://lore.kernel.org/lkml/20240514234112.792989-1-qyousef@layalina.io/

 fs/select.c                       |  2 +-
 include/linux/ioprio.h            |  2 +-
 include/linux/sched/deadline.h    |  6 ++++--
 include/linux/sched/prio.h        |  1 +
 include/linux/sched/rt.h          | 27 ++++++++++++++++++++++++++-
 kernel/locking/rtmutex.c          |  4 ++--
 kernel/locking/rwsem.c            |  4 ++--
 kernel/locking/ww_mutex.h         |  2 +-
 kernel/sched/core.c               |  6 +++---
 kernel/time/hrtimer.c             |  6 +++---
 kernel/trace/trace_sched_wakeup.c |  2 +-
 mm/page-writeback.c               |  4 ++--
 mm/page_alloc.c                   |  2 +-
 13 files changed, 48 insertions(+), 20 deletions(-)

Comments

Sebastian Sewior May 21, 2024, 11 a.m. UTC | #1
On 2024-05-15 23:05:36 [+0100], Qais Yousef wrote:
> rt_task() checks if a task has RT priority. But depends on your
> dictionary, this could mean it belongs to RT class, or is a 'realtime'
> task, which includes RT and DL classes.
> 
> Since this has caused some confusion already on discussion [1], it
> seemed a clean up is due.
> 
> I define the usage of rt_task() to be tasks that belong to RT class.
> Make sure that it returns true only for RT class and audit the users and
> replace the ones required the old behavior with the new realtime_task()
> which returns true for RT and DL classes. Introduce similar
> realtime_prio() to create similar distinction to rt_prio() and update
> the users that required the old behavior to use the new function.
> 
> Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.
> 
> Document the functions to make it more obvious what is the difference
> between them. PI-boosted tasks is a factor that must be taken into
> account when choosing which function to use.
> 
> Rename task_is_realtime() to realtime_task_policy() as the old name is
> confusing against the new realtime_task().

I *think* everyone using rt_task() means to include DL tasks. And
everyone means !SCHED-people since they know when the difference matters.

> No functional changes were intended.
> 
> [1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/
> 
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> 
> Changes since v1:
> 
> 	* Use realtime_task_policy() instead task_has_realtime_policy() (Peter)
> 	* Improve commit message readability about replace some rt_task()
> 	  users.
> 
> v1 discussion: https://lore.kernel.org/lkml/20240514234112.792989-1-qyousef@layalina.io/
> 
>  fs/select.c                       |  2 +-

fs/bcachefs/six.c
six_owner_running() has rt_task(). But imho should have realtime_task()
to consider DL. But I think it is way worse that it has its own locking
rather than using what everyone else but then again it wouldn't be the
new hot thing…

>  include/linux/ioprio.h            |  2 +-
>  include/linux/sched/deadline.h    |  6 ++++--
>  include/linux/sched/prio.h        |  1 +
>  include/linux/sched/rt.h          | 27 ++++++++++++++++++++++++++-
>  kernel/locking/rtmutex.c          |  4 ++--
>  kernel/locking/rwsem.c            |  4 ++--
>  kernel/locking/ww_mutex.h         |  2 +-
>  kernel/sched/core.c               |  6 +++---
>  kernel/time/hrtimer.c             |  6 +++---
>  kernel/trace/trace_sched_wakeup.c |  2 +-
>  mm/page-writeback.c               |  4 ++--
>  mm/page_alloc.c                   |  2 +-
>  13 files changed, 48 insertions(+), 20 deletions(-)> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 70625dff62ce..08b95e0a41ab 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1996,7 +1996,7 @@ static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
>  	 * expiry.
>  	 */
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> -		if (task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT))
> +		if (realtime_task_policy(current) && !(mode & HRTIMER_MODE_SOFT))
>  			mode |= HRTIMER_MODE_HARD;
>  	}
>  
> @@ -2096,7 +2096,7 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
>  	u64 slack;
>  
>  	slack = current->timer_slack_ns;
> -	if (rt_task(current))
> +	if (realtime_task(current))
>  		slack = 0;
>  
>  	hrtimer_init_sleeper_on_stack(&t, clockid, mode);
> @@ -2301,7 +2301,7 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
>  	 * Override any slack passed by the user if under
>  	 * rt contraints.
>  	 */
> -	if (rt_task(current))
> +	if (realtime_task(current))
>  		delta = 0;

I know this is just converting what is already here but…
__hrtimer_init_sleeper() looks at the policy to figure out if the task
is realtime do decide if should expire in HARD-IRQ context. This is
correct, a boosted task should not sleep.

hrtimer_nanosleep() + schedule_hrtimeout_range_clock() is looking at
priority to decide if slack should be removed. This should also look at
policy since a boosted task shouldn't sleep.

In order to be PI-boosted you need to acquire a lock and the only lock
you can sleep while acquired without generating a warning is a mutex_t
(or equivalent sleeping lock) on PREEMPT_RT. 

>  	hrtimer_init_sleeper_on_stack(&t, clock_id, mode);
> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index 0469a04a355f..19d737742e29 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
>  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
>  	 */
>  	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
> -	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> +	    (wakeup_rt && !realtime_task(p)) ||
>  	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
>  		return;
>  

Sebastian
Steven Rostedt May 23, 2024, 3:45 p.m. UTC | #2
On Wed, 15 May 2024 23:05:36 +0100
Qais Yousef <qyousef@layalina.io> wrote:
> diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
> index df3aca89d4f5..5cb88b748ad6 100644
> --- a/include/linux/sched/deadline.h
> +++ b/include/linux/sched/deadline.h
> @@ -10,8 +10,6 @@
>  
>  #include <linux/sched.h>
>  
> -#define MAX_DL_PRIO		0
> -
>  static inline int dl_prio(int prio)
>  {
>  	if (unlikely(prio < MAX_DL_PRIO))
> @@ -19,6 +17,10 @@ static inline int dl_prio(int prio)
>  	return 0;
>  }
>  
> +/*
> + * Returns true if a task has a priority that belongs to DL class. PI-boosted
> + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks.
> + */
>  static inline int dl_task(struct task_struct *p)
>  {
>  	return dl_prio(p->prio);
> diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
> index ab83d85e1183..6ab43b4f72f9 100644
> --- a/include/linux/sched/prio.h
> +++ b/include/linux/sched/prio.h
> @@ -14,6 +14,7 @@
>   */
>  
>  #define MAX_RT_PRIO		100
> +#define MAX_DL_PRIO		0
>  
>  #define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)
>  #define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)
> diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
> index b2b9e6eb9683..a055dd68a77c 100644
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -7,18 +7,43 @@
>  struct task_struct;
>  
>  static inline int rt_prio(int prio)
> +{
> +	if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO))
> +		return 1;
> +	return 0;
> +}
> +
> +static inline int realtime_prio(int prio)
>  {
>  	if (unlikely(prio < MAX_RT_PRIO))
>  		return 1;
>  	return 0;
>  }

I'm thinking we should change the above to bool (separate patch), as
returning an int may give one the impression that it returns the actual
priority number. Having it return bool will clear that up.

In fact, if we are touching theses functions, might as well change all of
them to bool when returning true/false. Just to make it easier to
understand what they are doing.

>  
> +/*
> + * Returns true if a task has a priority that belongs to RT class. PI-boosted
> + * tasks will return true. Use rt_policy() to ignore PI-boosted tasks.
> + */
>  static inline int rt_task(struct task_struct *p)
>  {
>  	return rt_prio(p->prio);
>  }
>  
> -static inline bool task_is_realtime(struct task_struct *tsk)
> +/*
> + * Returns true if a task has a priority that belongs to RT or DL classes.
> + * PI-boosted tasks will return true. Use realtime_task_policy() to ignore
> + * PI-boosted tasks.
> + */
> +static inline int realtime_task(struct task_struct *p)
> +{
> +	return realtime_prio(p->prio);
> +}
> +
> +/*
> + * Returns true if a task has a policy that belongs to RT or DL classes.
> + * PI-boosted tasks will return false.
> + */
> +static inline bool realtime_task_policy(struct task_struct *tsk)
>  {
>  	int policy = tsk->policy;
>  



> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index 0469a04a355f..19d737742e29 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
>  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
>  	 */
>  	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
> -	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> +	    (wakeup_rt && !realtime_task(p)) ||
>  	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
>  		return;
>  

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Qais Yousef May 27, 2024, 5:26 p.m. UTC | #3
On 05/21/24 13:00, Sebastian Andrzej Siewior wrote:
> On 2024-05-15 23:05:36 [+0100], Qais Yousef wrote:
> > rt_task() checks if a task has RT priority. But depends on your
> > dictionary, this could mean it belongs to RT class, or is a 'realtime'
> > task, which includes RT and DL classes.
> > 
> > Since this has caused some confusion already on discussion [1], it
> > seemed a clean up is due.
> > 
> > I define the usage of rt_task() to be tasks that belong to RT class.
> > Make sure that it returns true only for RT class and audit the users and
> > replace the ones required the old behavior with the new realtime_task()
> > which returns true for RT and DL classes. Introduce similar
> > realtime_prio() to create similar distinction to rt_prio() and update
> > the users that required the old behavior to use the new function.
> > 
> > Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.
> > 
> > Document the functions to make it more obvious what is the difference
> > between them. PI-boosted tasks is a factor that must be taken into
> > account when choosing which function to use.
> > 
> > Rename task_is_realtime() to realtime_task_policy() as the old name is
> > confusing against the new realtime_task().
> 
> I *think* everyone using rt_task() means to include DL tasks. And
> everyone means !SCHED-people since they know when the difference matters.

yes, this makes sense

> 
> > No functional changes were intended.
> > 
> > [1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/
> > 
> > Reviewed-by: Phil Auld <pauld@redhat.com>
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> > 
> > Changes since v1:
> > 
> > 	* Use realtime_task_policy() instead task_has_realtime_policy() (Peter)
> > 	* Improve commit message readability about replace some rt_task()
> > 	  users.
> > 
> > v1 discussion: https://lore.kernel.org/lkml/20240514234112.792989-1-qyousef@layalina.io/
> > 
> >  fs/select.c                       |  2 +-
> 
> fs/bcachefs/six.c
> six_owner_running() has rt_task(). But imho should have realtime_task()
> to consider DL. But I think it is way worse that it has its own locking
> rather than using what everyone else but then again it wouldn't be the
> new hot thing…

I think I missed this one. Converted now. Thanks!

> 
> >  include/linux/ioprio.h            |  2 +-
> >  include/linux/sched/deadline.h    |  6 ++++--
> >  include/linux/sched/prio.h        |  1 +
> >  include/linux/sched/rt.h          | 27 ++++++++++++++++++++++++++-
> >  kernel/locking/rtmutex.c          |  4 ++--
> >  kernel/locking/rwsem.c            |  4 ++--
> >  kernel/locking/ww_mutex.h         |  2 +-
> >  kernel/sched/core.c               |  6 +++---
> >  kernel/time/hrtimer.c             |  6 +++---
> >  kernel/trace/trace_sched_wakeup.c |  2 +-
> >  mm/page-writeback.c               |  4 ++--
> >  mm/page_alloc.c                   |  2 +-
> >  13 files changed, 48 insertions(+), 20 deletions(-)
> …
> > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> > index 70625dff62ce..08b95e0a41ab 100644
> > --- a/kernel/time/hrtimer.c
> > +++ b/kernel/time/hrtimer.c
> > @@ -1996,7 +1996,7 @@ static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
> >  	 * expiry.
> >  	 */
> >  	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > -		if (task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT))
> > +		if (realtime_task_policy(current) && !(mode & HRTIMER_MODE_SOFT))
> >  			mode |= HRTIMER_MODE_HARD;
> >  	}
> >  
> > @@ -2096,7 +2096,7 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
> >  	u64 slack;
> >  
> >  	slack = current->timer_slack_ns;
> > -	if (rt_task(current))
> > +	if (realtime_task(current))
> >  		slack = 0;
> >  
> >  	hrtimer_init_sleeper_on_stack(&t, clockid, mode);
> > @@ -2301,7 +2301,7 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
> >  	 * Override any slack passed by the user if under
> >  	 * rt contraints.
> >  	 */
> > -	if (rt_task(current))
> > +	if (realtime_task(current))
> >  		delta = 0;
> 
> I know this is just converting what is already here but…
> __hrtimer_init_sleeper() looks at the policy to figure out if the task
> is realtime do decide if should expire in HARD-IRQ context. This is
> correct, a boosted task should not sleep.
> 
> hrtimer_nanosleep() + schedule_hrtimeout_range_clock() is looking at
> priority to decide if slack should be removed. This should also look at
> policy since a boosted task shouldn't sleep.

I have to admit I never dug deep enough into this code. Happy to convert these
users. I'll add that as a separate patch as this is somewhat changing behavior
which this patch intends to do a clean up only.

> 
> In order to be PI-boosted you need to acquire a lock and the only lock
> you can sleep while acquired without generating a warning is a mutex_t
> (or equivalent sleeping lock) on PREEMPT_RT. 

Note we care about the behavior for !PREEMPT_RT. PI issues are important there
too. I assume the fact the PREEMPT_RT changes the locks behavior is what you're
referring to here and not applicable to normal case.


Thanks!

--
Qais Yousef

> 
> >  	hrtimer_init_sleeper_on_stack(&t, clock_id, mode);
> > diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> > index 0469a04a355f..19d737742e29 100644
> > --- a/kernel/trace/trace_sched_wakeup.c
> > +++ b/kernel/trace/trace_sched_wakeup.c
> > @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
> >  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
> >  	 */
> >  	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
> > -	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> > +	    (wakeup_rt && !realtime_task(p)) ||
> >  	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
> >  		return;
> >  
> 
> Sebastian
Qais Yousef May 27, 2024, 5:37 p.m. UTC | #4
On 05/23/24 11:45, Steven Rostedt wrote:
> On Wed, 15 May 2024 23:05:36 +0100
> Qais Yousef <qyousef@layalina.io> wrote:
> > diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
> > index df3aca89d4f5..5cb88b748ad6 100644
> > --- a/include/linux/sched/deadline.h
> > +++ b/include/linux/sched/deadline.h
> > @@ -10,8 +10,6 @@
> >  
> >  #include <linux/sched.h>
> >  
> > -#define MAX_DL_PRIO		0
> > -
> >  static inline int dl_prio(int prio)
> >  {
> >  	if (unlikely(prio < MAX_DL_PRIO))
> > @@ -19,6 +17,10 @@ static inline int dl_prio(int prio)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Returns true if a task has a priority that belongs to DL class. PI-boosted
> > + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks.
> > + */
> >  static inline int dl_task(struct task_struct *p)
> >  {
> >  	return dl_prio(p->prio);
> > diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
> > index ab83d85e1183..6ab43b4f72f9 100644
> > --- a/include/linux/sched/prio.h
> > +++ b/include/linux/sched/prio.h
> > @@ -14,6 +14,7 @@
> >   */
> >  
> >  #define MAX_RT_PRIO		100
> > +#define MAX_DL_PRIO		0
> >  
> >  #define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)
> >  #define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)
> > diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
> > index b2b9e6eb9683..a055dd68a77c 100644
> > --- a/include/linux/sched/rt.h
> > +++ b/include/linux/sched/rt.h
> > @@ -7,18 +7,43 @@
> >  struct task_struct;
> >  
> >  static inline int rt_prio(int prio)
> > +{
> > +	if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO))
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +static inline int realtime_prio(int prio)
> >  {
> >  	if (unlikely(prio < MAX_RT_PRIO))
> >  		return 1;
> >  	return 0;
> >  }
> 
> I'm thinking we should change the above to bool (separate patch), as
> returning an int may give one the impression that it returns the actual
> priority number. Having it return bool will clear that up.
> 
> In fact, if we are touching theses functions, might as well change all of
> them to bool when returning true/false. Just to make it easier to
> understand what they are doing.

I can add a patch on top, sure.

> 
> >  
> > +/*
> > + * Returns true if a task has a priority that belongs to RT class. PI-boosted
> > + * tasks will return true. Use rt_policy() to ignore PI-boosted tasks.
> > + */
> >  static inline int rt_task(struct task_struct *p)
> >  {
> >  	return rt_prio(p->prio);
> >  }
> >  
> > -static inline bool task_is_realtime(struct task_struct *tsk)
> > +/*
> > + * Returns true if a task has a priority that belongs to RT or DL classes.
> > + * PI-boosted tasks will return true. Use realtime_task_policy() to ignore
> > + * PI-boosted tasks.
> > + */
> > +static inline int realtime_task(struct task_struct *p)
> > +{
> > +	return realtime_prio(p->prio);
> > +}
> > +
> > +/*
> > + * Returns true if a task has a policy that belongs to RT or DL classes.
> > + * PI-boosted tasks will return false.
> > + */
> > +static inline bool realtime_task_policy(struct task_struct *tsk)
> >  {
> >  	int policy = tsk->policy;
> >  
> 
> 
> 
> > diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> > index 0469a04a355f..19d737742e29 100644
> > --- a/kernel/trace/trace_sched_wakeup.c
> > +++ b/kernel/trace/trace_sched_wakeup.c
> > @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
> >  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
> >  	 */
> >  	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
> > -	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> > +	    (wakeup_rt && !realtime_task(p)) ||
> >  	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
> >  		return;
> >  
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Thanks!

--
Qais Yousef

> 
>
Sebastian Sewior May 29, 2024, 8:29 a.m. UTC | #5
On 2024-05-27 18:26:50 [+0100], Qais Yousef wrote:
> > In order to be PI-boosted you need to acquire a lock and the only lock
> > you can sleep while acquired without generating a warning is a mutex_t
> > (or equivalent sleeping lock) on PREEMPT_RT. 
> 
> Note we care about the behavior for !PREEMPT_RT. PI issues are important there
> too. I assume the fact the PREEMPT_RT changes the locks behavior is what you're
> referring to here and not applicable to normal case.

So for !PREEMPT_RT you need a rtmutex for PI. RCU and i2c is using it
within the kernel and this shouldn't go via the `slack' API.

The FUTEX API on the other hand is a different story and it might
matter. So you have one task running SCHED_OTHER and acquiring a lock in
userspace (pthread_mutex_t, PTHREAD_PRIO_INHERIT). Another task running
at SCHED_FIFO/ RR/ DL would also acquire that lock, block on it and
then inherit its priority.
This is the point where the former task has a different policy vs
priority considering PI-boosting. You could argue that the task
shouldn't sleep or invoke anything possible sleeping with a timeout > 0
because it is using an important lock.
But then it is userland and has the freedom to do whatever it wants you
know…

So it might be better to forget what I said and keeping the current
behaviour. But then it is insistent which matters only in the RT case.
Puh. Any sched folks regarding policy?

> Thanks!

Sebastian
Qais Yousef May 29, 2024, 10:34 a.m. UTC | #6
On 05/29/24 10:29, Sebastian Andrzej Siewior wrote:
> On 2024-05-27 18:26:50 [+0100], Qais Yousef wrote:
> > > In order to be PI-boosted you need to acquire a lock and the only lock
> > > you can sleep while acquired without generating a warning is a mutex_t
> > > (or equivalent sleeping lock) on PREEMPT_RT. 
> > 
> > Note we care about the behavior for !PREEMPT_RT. PI issues are important there
> > too. I assume the fact the PREEMPT_RT changes the locks behavior is what you're
> > referring to here and not applicable to normal case.
> 
> So for !PREEMPT_RT you need a rtmutex for PI. RCU and i2c is using it
> within the kernel and this shouldn't go via the `slack' API.
> 
> The FUTEX API on the other hand is a different story and it might
> matter. So you have one task running SCHED_OTHER and acquiring a lock in
> userspace (pthread_mutex_t, PTHREAD_PRIO_INHERIT). Another task running
> at SCHED_FIFO/ RR/ DL would also acquire that lock, block on it and
> then inherit its priority.
> This is the point where the former task has a different policy vs
> priority considering PI-boosting. You could argue that the task
> shouldn't sleep or invoke anything possible sleeping with a timeout > 0
> because it is using an important lock.
> But then it is userland and has the freedom to do whatever it wants you
> know…

Yes..

> 
> So it might be better to forget what I said and keeping the current

Okay I'll drop the patch then in next posting.

> behaviour. But then it is insistent which matters only in the RT case.
> Puh. Any sched folks regarding policy?

I am not sure I understood you here. Could you rephrase please?
Sebastian Sewior May 29, 2024, 10:55 a.m. UTC | #7
On 2024-05-29 11:34:09 [+0100], Qais Yousef wrote:
> > behaviour. But then it is insistent which matters only in the RT case.
> > Puh. Any sched folks regarding policy?
> 
> I am not sure I understood you here. Could you rephrase please?

Right now a SCHED_OTHER task boosted to a realtime priority gets
slack=0. In the !RT scenario everything is fine.
For RT the slack=0 also happens but the init of the timer looks at the
policy instead at the possible boosted priority and uses a different
clock attribute. This can lead to a delayed wake up (so avoiding the
slack does not solve the problem).

This is not consistent because IMHO the clock setup & slack should be
handled equally. So I am asking the sched folks for a policy and I am
leaning towards looking at task-policy in this case instead of prio
because you shouldn't do anything that can delay.

Sebastian
Qais Yousef May 30, 2024, 11:10 a.m. UTC | #8
On 05/29/24 12:55, Sebastian Andrzej Siewior wrote:
> On 2024-05-29 11:34:09 [+0100], Qais Yousef wrote:
> > > behaviour. But then it is insistent which matters only in the RT case.
> > > Puh. Any sched folks regarding policy?
> > 
> > I am not sure I understood you here. Could you rephrase please?
> 
> Right now a SCHED_OTHER task boosted to a realtime priority gets
> slack=0. In the !RT scenario everything is fine.
> For RT the slack=0 also happens but the init of the timer looks at the
> policy instead at the possible boosted priority and uses a different
> clock attribute. This can lead to a delayed wake up (so avoiding the
> slack does not solve the problem).
> 
> This is not consistent because IMHO the clock setup & slack should be
> handled equally. So I am asking the sched folks for a policy and I am
> leaning towards looking at task-policy in this case instead of prio
> because you shouldn't do anything that can delay.

Can't we do that based on is_soft/is_hard flag in hrtimer struct when we apply
the slack in hrtimer_set_expires_range_ns() instead?

(not compile tested even)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index aa1e65ccb615..e001f20bbea9 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -102,12 +102,16 @@ static inline void hrtimer_set_expires(struct hrtimer *timer, ktime_t time)
 
 static inline void hrtimer_set_expires_range(struct hrtimer *timer, ktime_t time, ktime_t delta)
 {
+       if (timer->is_soft || timer->is_hard)
+               delta = 0;
        timer->_softexpires = time;
        timer->node.expires = ktime_add_safe(time, delta);
 }
 
 static inline void hrtimer_set_expires_range_ns(struct hrtimer *timer, ktime_t time, u64 delta)
 {
+       if (timer->is_soft || timer->is_hard)
+               delta = 0;
        timer->_softexpires = time;
        timer->node.expires = ktime_add_safe(time, ns_to_ktime(delta));
 }
Sebastian Sewior May 31, 2024, 6:30 a.m. UTC | #9
On 2024-05-30 12:10:44 [+0100], Qais Yousef wrote:
> > This is not consistent because IMHO the clock setup & slack should be
> > handled equally. So I am asking the sched folks for a policy and I am
> > leaning towards looking at task-policy in this case instead of prio
> > because you shouldn't do anything that can delay.
> 
> Can't we do that based on is_soft/is_hard flag in hrtimer struct when we apply
> the slack in hrtimer_set_expires_range_ns() instead?

We need to decide on a policy first.
You don't want to add overhead on each invocation plus some in-kernel
ask for delta. ->is_soft is not a good criteria.

Sebastian
Qais Yousef June 1, 2024, 10:31 p.m. UTC | #10
On 05/31/24 08:30, Sebastian Andrzej Siewior wrote:
> On 2024-05-30 12:10:44 [+0100], Qais Yousef wrote:
> > > This is not consistent because IMHO the clock setup & slack should be
> > > handled equally. So I am asking the sched folks for a policy and I am
> > > leaning towards looking at task-policy in this case instead of prio
> > > because you shouldn't do anything that can delay.
> > 
> > Can't we do that based on is_soft/is_hard flag in hrtimer struct when we apply
> > the slack in hrtimer_set_expires_range_ns() instead?
> 
> We need to decide on a policy first.
> You don't want to add overhead on each invocation plus some in-kernel
> ask for delta. ->is_soft is not a good criteria.

The suggestion was not to check policy or priority but dictate the behavior
based on hrtimer properties set at init. Which agrees with your thinking.

I think the check for policy at init to enforce a behavior is borderline hack
by the way and better to get an explicit request from a task and let the
HRTIMER comply adequately without checking for policies anywhere. Whether this
property should be inherited or not is a good question and I think is the
policy you're after here. Proxy Execution would probably handle all this
inheritance problems transparently, so we need not think much about it if we
agree proxy execution is the way to deal with all those inheritance problems.
There are always exceptions though.. So worth a think.

I'll drop this patch as this discussion has diverged to something else now.
I'll leave this here until others get a chance to comment with their views too.


Cheers

--
Qais Yousef
diff mbox series

Patch

diff --git a/fs/select.c b/fs/select.c
index 9515c3fa1a03..8d5c1419416c 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -82,7 +82,7 @@  u64 select_estimate_accuracy(struct timespec64 *tv)
 	 * Realtime tasks get a slack of 0 for obvious reasons.
 	 */
 
-	if (rt_task(current))
+	if (realtime_task(current))
 		return 0;
 
 	ktime_get_ts64(&now);
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index db1249cd9692..75859b78d540 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -40,7 +40,7 @@  static inline int task_nice_ioclass(struct task_struct *task)
 {
 	if (task->policy == SCHED_IDLE)
 		return IOPRIO_CLASS_IDLE;
-	else if (task_is_realtime(task))
+	else if (realtime_task_policy(task))
 		return IOPRIO_CLASS_RT;
 	else
 		return IOPRIO_CLASS_BE;
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index df3aca89d4f5..5cb88b748ad6 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -10,8 +10,6 @@ 
 
 #include <linux/sched.h>
 
-#define MAX_DL_PRIO		0
-
 static inline int dl_prio(int prio)
 {
 	if (unlikely(prio < MAX_DL_PRIO))
@@ -19,6 +17,10 @@  static inline int dl_prio(int prio)
 	return 0;
 }
 
+/*
+ * Returns true if a task has a priority that belongs to DL class. PI-boosted
+ * tasks will return true. Use dl_policy() to ignore PI-boosted tasks.
+ */
 static inline int dl_task(struct task_struct *p)
 {
 	return dl_prio(p->prio);
diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
index ab83d85e1183..6ab43b4f72f9 100644
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -14,6 +14,7 @@ 
  */
 
 #define MAX_RT_PRIO		100
+#define MAX_DL_PRIO		0
 
 #define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)
 #define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index b2b9e6eb9683..a055dd68a77c 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -7,18 +7,43 @@ 
 struct task_struct;
 
 static inline int rt_prio(int prio)
+{
+	if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO))
+		return 1;
+	return 0;
+}
+
+static inline int realtime_prio(int prio)
 {
 	if (unlikely(prio < MAX_RT_PRIO))
 		return 1;
 	return 0;
 }
 
+/*
+ * Returns true if a task has a priority that belongs to RT class. PI-boosted
+ * tasks will return true. Use rt_policy() to ignore PI-boosted tasks.
+ */
 static inline int rt_task(struct task_struct *p)
 {
 	return rt_prio(p->prio);
 }
 
-static inline bool task_is_realtime(struct task_struct *tsk)
+/*
+ * Returns true if a task has a priority that belongs to RT or DL classes.
+ * PI-boosted tasks will return true. Use realtime_task_policy() to ignore
+ * PI-boosted tasks.
+ */
+static inline int realtime_task(struct task_struct *p)
+{
+	return realtime_prio(p->prio);
+}
+
+/*
+ * Returns true if a task has a policy that belongs to RT or DL classes.
+ * PI-boosted tasks will return false.
+ */
+static inline bool realtime_task_policy(struct task_struct *tsk)
 {
 	int policy = tsk->policy;
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 88d08eeb8bc0..55c9dab37f33 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -347,7 +347,7 @@  static __always_inline int __waiter_prio(struct task_struct *task)
 {
 	int prio = task->prio;
 
-	if (!rt_prio(prio))
+	if (!realtime_prio(prio))
 		return DEFAULT_PRIO;
 
 	return prio;
@@ -435,7 +435,7 @@  static inline bool rt_mutex_steal(struct rt_mutex_waiter *waiter,
 	 * Note that RT tasks are excluded from same priority (lateral)
 	 * steals to prevent the introduction of an unbounded latency.
 	 */
-	if (rt_prio(waiter->tree.prio) || dl_prio(waiter->tree.prio))
+	if (realtime_prio(waiter->tree.prio))
 		return false;
 
 	return rt_waiter_node_equal(&waiter->tree, &top_waiter->tree);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c6d17aee4209..ad8d4438bc91 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -631,7 +631,7 @@  static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 			 * if it is an RT task or wait in the wait queue
 			 * for too long.
 			 */
-			if (has_handoff || (!rt_task(waiter->task) &&
+			if (has_handoff || (!realtime_task(waiter->task) &&
 					    !time_after(jiffies, waiter->timeout)))
 				return false;
 
@@ -914,7 +914,7 @@  static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		if (owner_state != OWNER_WRITER) {
 			if (need_resched())
 				break;
-			if (rt_task(current) &&
+			if (realtime_task(current) &&
 			   (prev_owner_state != OWNER_WRITER))
 				break;
 		}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 3ad2cc4823e5..fa4b416a1f62 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -237,7 +237,7 @@  __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
 	int a_prio = a->task->prio;
 	int b_prio = b->task->prio;
 
-	if (rt_prio(a_prio) || rt_prio(b_prio)) {
+	if (realtime_prio(a_prio) || realtime_prio(b_prio)) {
 
 		if (a_prio > b_prio)
 			return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a914388144a..27f15de3d099 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -162,7 +162,7 @@  static inline int __task_prio(const struct task_struct *p)
 	if (p->sched_class == &stop_sched_class) /* trumps deadline */
 		return -2;
 
-	if (rt_prio(p->prio)) /* includes deadline */
+	if (realtime_prio(p->prio)) /* includes deadline */
 		return p->prio; /* [-1, 99] */
 
 	if (p->sched_class == &idle_sched_class)
@@ -2198,7 +2198,7 @@  static int effective_prio(struct task_struct *p)
 	 * keep the priority unchanged. Otherwise, update priority
 	 * to the normal priority:
 	 */
-	if (!rt_prio(p->prio))
+	if (!realtime_prio(p->prio))
 		return p->normal_prio;
 	return p->prio;
 }
@@ -10282,7 +10282,7 @@  void normalize_rt_tasks(void)
 		schedstat_set(p->stats.sleep_start, 0);
 		schedstat_set(p->stats.block_start, 0);
 
-		if (!dl_task(p) && !rt_task(p)) {
+		if (!realtime_task(p)) {
 			/*
 			 * Renice negative nice level userspace
 			 * tasks back to 0:
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 70625dff62ce..08b95e0a41ab 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1996,7 +1996,7 @@  static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
 	 * expiry.
 	 */
 	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-		if (task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT))
+		if (realtime_task_policy(current) && !(mode & HRTIMER_MODE_SOFT))
 			mode |= HRTIMER_MODE_HARD;
 	}
 
@@ -2096,7 +2096,7 @@  long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
 	u64 slack;
 
 	slack = current->timer_slack_ns;
-	if (rt_task(current))
+	if (realtime_task(current))
 		slack = 0;
 
 	hrtimer_init_sleeper_on_stack(&t, clockid, mode);
@@ -2301,7 +2301,7 @@  schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
 	 * Override any slack passed by the user if under
 	 * rt contraints.
 	 */
-	if (rt_task(current))
+	if (realtime_task(current))
 		delta = 0;
 
 	hrtimer_init_sleeper_on_stack(&t, clock_id, mode);
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 0469a04a355f..19d737742e29 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -545,7 +545,7 @@  probe_wakeup(void *ignore, struct task_struct *p)
 	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
 	 */
 	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
-	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
+	    (wakeup_rt && !realtime_task(p)) ||
 	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
 		return;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3e19b87049db..7372e40f225d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -418,7 +418,7 @@  static void domain_dirty_limits(struct dirty_throttle_control *dtc)
 	if (bg_thresh >= thresh)
 		bg_thresh = thresh / 2;
 	tsk = current;
-	if (rt_task(tsk)) {
+	if (realtime_task(tsk)) {
 		bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
 		thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
 	}
@@ -468,7 +468,7 @@  static unsigned long node_dirty_limit(struct pglist_data *pgdat)
 	else
 		dirty = vm_dirty_ratio * node_memory / 100;
 
-	if (rt_task(tsk))
+	if (realtime_task(tsk))
 		dirty += dirty / 4;
 
 	return dirty;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 14d39f34d336..0af24a60ade0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3877,7 +3877,7 @@  gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 		 */
 		if (alloc_flags & ALLOC_MIN_RESERVE)
 			alloc_flags &= ~ALLOC_CPUSET;
-	} else if (unlikely(rt_task(current)) && in_task())
+	} else if (unlikely(realtime_task(current)) && in_task())
 		alloc_flags |= ALLOC_MIN_RESERVE;
 
 	alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);