diff mbox series

[1/1] psi: do not require setsched permission from the trigger creator

Message ID 20190730013310.162367-1-surenb@google.com (mailing list archive)
State New, archived
Headers show
Series [1/1] psi: do not require setsched permission from the trigger creator | expand

Commit Message

Suren Baghdasaryan July 30, 2019, 1:33 a.m. UTC
When a process creates a new trigger by writing into /proc/pressure/*
files, permissions to write such a file should be used to determine whether
the process is allowed to do so or not. Current implementation would also
require such a process to have setsched capability. Setting of psi trigger
thread's scheduling policy is an implementation detail and should not be
exposed to the user level. Remove the permission check by using _nocheck
version of the function.

Suggested-by: Nick Kralevich <nnk@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Peter Zijlstra July 30, 2019, 8:11 a.m. UTC | #1
On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote:
> When a process creates a new trigger by writing into /proc/pressure/*
> files, permissions to write such a file should be used to determine whether
> the process is allowed to do so or not. Current implementation would also
> require such a process to have setsched capability. Setting of psi trigger
> thread's scheduling policy is an implementation detail and should not be
> exposed to the user level. Remove the permission check by using _nocheck
> version of the function.
> 
> Suggested-by: Nick Kralevich <nnk@google.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  kernel/sched/psi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 7acc632c3b82..ed9a1d573cb1 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
>  			mutex_unlock(&group->trigger_lock);
>  			return ERR_CAST(kworker);
>  		}
> -		sched_setscheduler(kworker->task, SCHED_FIFO, &param);
> +		sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, &param);

ARGGH, wtf is there a FIFO-99!! thread here at all !?

>  		kthread_init_delayed_work(&group->poll_work,
>  				psi_poll_work);
>  		rcu_assign_pointer(group->poll_kworker, kworker);
> -- 
> 2.22.0.709.g102302147b-goog
>
Suren Baghdasaryan July 30, 2019, 5:44 p.m. UTC | #2
On Tue, Jul 30, 2019 at 1:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote:
> > When a process creates a new trigger by writing into /proc/pressure/*
> > files, permissions to write such a file should be used to determine whether
> > the process is allowed to do so or not. Current implementation would also
> > require such a process to have setsched capability. Setting of psi trigger
> > thread's scheduling policy is an implementation detail and should not be
> > exposed to the user level. Remove the permission check by using _nocheck
> > version of the function.
> >
> > Suggested-by: Nick Kralevich <nnk@google.com>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  kernel/sched/psi.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 7acc632c3b82..ed9a1d573cb1 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
> >                       mutex_unlock(&group->trigger_lock);
> >                       return ERR_CAST(kworker);
> >               }
> > -             sched_setscheduler(kworker->task, SCHED_FIFO, &param);
> > +             sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, &param);
>
> ARGGH, wtf is there a FIFO-99!! thread here at all !?

We need psi poll_kworker to be an rt-priority thread so that psi
notifications are delivered to the userspace without delay even when
the CPUs are very congested. Otherwise it's easy to delay psi
notifications by running a simple CPU hogger executing "chrt -f 50 dd
if=/dev/zero of=/dev/null". Because these notifications are
time-critical for reacting to memory shortages we can't allow for such
delays.
Notice that this kworker is created only if userspace creates a psi
trigger. So unless you are using psi triggers you will never see this
kthread created.

> >               kthread_init_delayed_work(&group->poll_work,
> >                               psi_poll_work);
> >               rcu_assign_pointer(group->poll_kworker, kworker);
> > --
> > 2.22.0.709.g102302147b-goog
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Peter Zijlstra Aug. 1, 2019, 9:51 a.m. UTC | #3
On Tue, Jul 30, 2019 at 10:44:51AM -0700, Suren Baghdasaryan wrote:
> On Tue, Jul 30, 2019 at 1:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote:
> > > When a process creates a new trigger by writing into /proc/pressure/*
> > > files, permissions to write such a file should be used to determine whether
> > > the process is allowed to do so or not. Current implementation would also
> > > require such a process to have setsched capability. Setting of psi trigger
> > > thread's scheduling policy is an implementation detail and should not be
> > > exposed to the user level. Remove the permission check by using _nocheck
> > > version of the function.
> > >
> > > Suggested-by: Nick Kralevich <nnk@google.com>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  kernel/sched/psi.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > index 7acc632c3b82..ed9a1d573cb1 100644
> > > --- a/kernel/sched/psi.c
> > > +++ b/kernel/sched/psi.c
> > > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
> > >                       mutex_unlock(&group->trigger_lock);
> > >                       return ERR_CAST(kworker);
> > >               }
> > > -             sched_setscheduler(kworker->task, SCHED_FIFO, &param);
> > > +             sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, &param);
> >
> > ARGGH, wtf is there a FIFO-99!! thread here at all !?
> 
> We need psi poll_kworker to be an rt-priority thread so that psi

There is a giant difference between 'needs to be higher than OTHER' and
FIFO-99.

> notifications are delivered to the userspace without delay even when
> the CPUs are very congested. Otherwise it's easy to delay psi
> notifications by running a simple CPU hogger executing "chrt -f 50 dd
> if=/dev/zero of=/dev/null". Because these notifications are

So what; at that point that's exactly what you're asking for. Using RT
is for those who know what they're doing.

> time-critical for reacting to memory shortages we can't allow for such
> delays.

Furthermore, actual RT programs will have pre-allocated and locked any
memory they rely on. They don't give a crap about your pressure
nonsense.

> Notice that this kworker is created only if userspace creates a psi
> trigger. So unless you are using psi triggers you will never see this
> kthread created.

By marking it FIFO-99 you're in effect saying that your stupid
statistics gathering is more important than your life. It will preempt
the task that's in control of the band-saw emergency break, it will
preempt the task that's adjusting the electromagnetic field containing
this plasma flow.

That's insane.

I'm going to queue a patch to reduce this to FIFO-1, that will preempt
regular OTHER tasks but will not perturb (much) actual RT bits.
Suren Baghdasaryan Aug. 1, 2019, 6:28 p.m. UTC | #4
Hi Peter,
Thanks for sharing your thoughts. I understand your point and I tend
to agree with it. I originally designed this using watchdog as the
example of a critical system health signal and in the context of
mobile device memory pressure is critical but I agree that there are
more important things in life. I checked and your proposal to change
it to FIFO-1 should still work for our purposes. Will test to make
sure and reply to your patch. Couple clarifications in-line.

On Thu, Aug 1, 2019 at 2:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 30, 2019 at 10:44:51AM -0700, Suren Baghdasaryan wrote:
> > On Tue, Jul 30, 2019 at 1:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote:
> > > > When a process creates a new trigger by writing into /proc/pressure/*
> > > > files, permissions to write such a file should be used to determine whether
> > > > the process is allowed to do so or not. Current implementation would also
> > > > require such a process to have setsched capability. Setting of psi trigger
> > > > thread's scheduling policy is an implementation detail and should not be
> > > > exposed to the user level. Remove the permission check by using _nocheck
> > > > version of the function.
> > > >
> > > > Suggested-by: Nick Kralevich <nnk@google.com>
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  kernel/sched/psi.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > index 7acc632c3b82..ed9a1d573cb1 100644
> > > > --- a/kernel/sched/psi.c
> > > > +++ b/kernel/sched/psi.c
> > > > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
> > > >                       mutex_unlock(&group->trigger_lock);
> > > >                       return ERR_CAST(kworker);
> > > >               }
> > > > -             sched_setscheduler(kworker->task, SCHED_FIFO, &param);
> > > > +             sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, &param);
> > >
> > > ARGGH, wtf is there a FIFO-99!! thread here at all !?
> >
> > We need psi poll_kworker to be an rt-priority thread so that psi
>
> There is a giant difference between 'needs to be higher than OTHER' and
> FIFO-99.
>
> > notifications are delivered to the userspace without delay even when
> > the CPUs are very congested. Otherwise it's easy to delay psi
> > notifications by running a simple CPU hogger executing "chrt -f 50 dd
> > if=/dev/zero of=/dev/null". Because these notifications are
>
> So what; at that point that's exactly what you're asking for. Using RT
> is for those who know what they're doing.
>
> > time-critical for reacting to memory shortages we can't allow for such
> > delays.
>
> Furthermore, actual RT programs will have pre-allocated and locked any
> memory they rely on. They don't give a crap about your pressure
> nonsense.
>

This signal is used not to protect other RT tasks but to monitor
overall system memory health for the sake of system responsiveness.

> > Notice that this kworker is created only if userspace creates a psi
> > trigger. So unless you are using psi triggers you will never see this
> > kthread created.
>
> By marking it FIFO-99 you're in effect saying that your stupid
> statistics gathering is more important than your life. It will preempt
> the task that's in control of the band-saw emergency break, it will
> preempt the task that's adjusting the electromagnetic field containing
> this plasma flow.
>
> That's insane.

IMHO an opt-in feature stops being "stupid" as soon as the user opted
in to use it, therefore explicitly indicating interest in it. However
I assume you are using "stupid" here to indicate that it's "less
important" rather than it's "useless".

> I'm going to queue a patch to reduce this to FIFO-1, that will preempt
> regular OTHER tasks but will not perturb (much) actual RT bits.
>

Thanks for posting the fix.

> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Peter Zijlstra Aug. 1, 2019, 9:59 p.m. UTC | #5
On Thu, Aug 01, 2019 at 11:28:30AM -0700, Suren Baghdasaryan wrote:
> > By marking it FIFO-99 you're in effect saying that your stupid
> > statistics gathering is more important than your life. It will preempt
> > the task that's in control of the band-saw emergency break, it will
> > preempt the task that's adjusting the electromagnetic field containing
> > this plasma flow.
> >
> > That's insane.
> 
> IMHO an opt-in feature stops being "stupid" as soon as the user opted
> in to use it, therefore explicitly indicating interest in it. However
> I assume you are using "stupid" here to indicate that it's "less
> important" rather than it's "useless".

Quite; PSI does have its uses. RT just isn't one of them.
Suren Baghdasaryan Aug. 2, 2019, 1:19 a.m. UTC | #6
On Thu, Aug 1, 2019 at 2:59 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Aug 01, 2019 at 11:28:30AM -0700, Suren Baghdasaryan wrote:
> > > By marking it FIFO-99 you're in effect saying that your stupid
> > > statistics gathering is more important than your life. It will preempt
> > > the task that's in control of the band-saw emergency break, it will
> > > preempt the task that's adjusting the electromagnetic field containing
> > > this plasma flow.
> > >
> > > That's insane.
> >
> > IMHO an opt-in feature stops being "stupid" as soon as the user opted
> > in to use it, therefore explicitly indicating interest in it. However
> > I assume you are using "stupid" here to indicate that it's "less
> > important" rather than it's "useless".
>
> Quite; PSI does have its uses. RT just isn't one of them.

Sorry about messing it up in the first place.
If you don't see any issues with my patch replacing
sched_setscheduler() with sched_setscheduler_nocheck(), would you mind
taking it too? I applied it over your patch onto Linus' ToT with no
merge conflicts.
Thanks,
Suren.

> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
diff mbox series

Patch

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 7acc632c3b82..ed9a1d573cb1 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1061,7 +1061,7 @@  struct psi_trigger *psi_trigger_create(struct psi_group *group,
 			mutex_unlock(&group->trigger_lock);
 			return ERR_CAST(kworker);
 		}
-		sched_setscheduler(kworker->task, SCHED_FIFO, &param);
+		sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, &param);
 		kthread_init_delayed_work(&group->poll_work,
 				psi_poll_work);
 		rcu_assign_pointer(group->poll_kworker, kworker);