Message ID | 20190730013310.162367-1-surenb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] psi: do not require setsched permission from the trigger creator | expand |
On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote: > When a process creates a new trigger by writing into /proc/pressure/* > files, permissions to write such a file should be used to determine whether > the process is allowed to do so or not. Current implementation would also > require such a process to have setsched capability. Setting of psi trigger > thread's scheduling policy is an implementation detail and should not be > exposed to the user level. Remove the permission check by using _nocheck > version of the function. > > Suggested-by: Nick Kralevich <nnk@google.com> > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > kernel/sched/psi.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index 7acc632c3b82..ed9a1d573cb1 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, > mutex_unlock(&group->trigger_lock); > return ERR_CAST(kworker); > } > - sched_setscheduler(kworker->task, SCHED_FIFO, ¶m); > + sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, ¶m); ARGGH, wtf is there a FIFO-99!! thread here at all !? > kthread_init_delayed_work(&group->poll_work, > psi_poll_work); > rcu_assign_pointer(group->poll_kworker, kworker); > -- > 2.22.0.709.g102302147b-goog >
On Tue, Jul 30, 2019 at 1:11 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote: > > When a process creates a new trigger by writing into /proc/pressure/* > > files, permissions to write such a file should be used to determine whether > > the process is allowed to do so or not. Current implementation would also > > require such a process to have setsched capability. Setting of psi trigger > > thread's scheduling policy is an implementation detail and should not be > > exposed to the user level. Remove the permission check by using _nocheck > > version of the function. > > > > Suggested-by: Nick Kralevich <nnk@google.com> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > --- > > kernel/sched/psi.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > > index 7acc632c3b82..ed9a1d573cb1 100644 > > --- a/kernel/sched/psi.c > > +++ b/kernel/sched/psi.c > > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, > > mutex_unlock(&group->trigger_lock); > > return ERR_CAST(kworker); > > } > > - sched_setscheduler(kworker->task, SCHED_FIFO, ¶m); > > + sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, ¶m); > > ARGGH, wtf is there a FIFO-99!! thread here at all !? We need psi poll_kworker to be an rt-priority thread so that psi notifications are delivered to the userspace without delay even when the CPUs are very congested. Otherwise it's easy to delay psi notifications by running a simple CPU hogger executing "chrt -f 50 dd if=/dev/zero of=/dev/null". Because these notifications are time-critical for reacting to memory shortages we can't allow for such delays. Notice that this kworker is created only if userspace creates a psi trigger. So unless you are using psi triggers you will never see this kthread created. > > kthread_init_delayed_work(&group->poll_work, > > psi_poll_work); > > rcu_assign_pointer(group->poll_kworker, kworker); > > -- > > 2.22.0.709.g102302147b-goog > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Tue, Jul 30, 2019 at 10:44:51AM -0700, Suren Baghdasaryan wrote: > On Tue, Jul 30, 2019 at 1:11 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote: > > > When a process creates a new trigger by writing into /proc/pressure/* > > > files, permissions to write such a file should be used to determine whether > > > the process is allowed to do so or not. Current implementation would also > > > require such a process to have setsched capability. Setting of psi trigger > > > thread's scheduling policy is an implementation detail and should not be > > > exposed to the user level. Remove the permission check by using _nocheck > > > version of the function. > > > > > > Suggested-by: Nick Kralevich <nnk@google.com> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > --- > > > kernel/sched/psi.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > > > index 7acc632c3b82..ed9a1d573cb1 100644 > > > --- a/kernel/sched/psi.c > > > +++ b/kernel/sched/psi.c > > > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, > > > mutex_unlock(&group->trigger_lock); > > > return ERR_CAST(kworker); > > > } > > > - sched_setscheduler(kworker->task, SCHED_FIFO, ¶m); > > > + sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, ¶m); > > > > ARGGH, wtf is there a FIFO-99!! thread here at all !? > > We need psi poll_kworker to be an rt-priority thread so that psi There is a giant difference between 'needs to be higher than OTHER' and FIFO-99. > notifications are delivered to the userspace without delay even when > the CPUs are very congested. Otherwise it's easy to delay psi > notifications by running a simple CPU hogger executing "chrt -f 50 dd > if=/dev/zero of=/dev/null". Because these notifications are So what; at that point that's exactly what you're asking for. Using RT is for those who know what they're doing. > time-critical for reacting to memory shortages we can't allow for such > delays. Furthermore, actual RT programs will have pre-allocated and locked any memory they rely on. They don't give a crap about your pressure nonsense. > Notice that this kworker is created only if userspace creates a psi > trigger. So unless you are using psi triggers you will never see this > kthread created. By marking it FIFO-99 you're in effect saying that your stupid statistics gathering is more important than your life. It will preempt the task that's in control of the band-saw emergency break, it will preempt the task that's adjusting the electromagnetic field containing this plasma flow. That's insane. I'm going to queue a patch to reduce this to FIFO-1, that will preempt regular OTHER tasks but will not perturb (much) actual RT bits.
Hi Peter, Thanks for sharing your thoughts. I understand your point and I tend to agree with it. I originally designed this using watchdog as the example of a critical system health signal and in the context of mobile device memory pressure is critical but I agree that there are more important things in life. I checked and your proposal to change it to FIFO-1 should still work for our purposes. Will test to make sure and reply to your patch. Couple clarifications in-line. On Thu, Aug 1, 2019 at 2:51 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Jul 30, 2019 at 10:44:51AM -0700, Suren Baghdasaryan wrote: > > On Tue, Jul 30, 2019 at 1:11 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote: > > > > When a process creates a new trigger by writing into /proc/pressure/* > > > > files, permissions to write such a file should be used to determine whether > > > > the process is allowed to do so or not. Current implementation would also > > > > require such a process to have setsched capability. Setting of psi trigger > > > > thread's scheduling policy is an implementation detail and should not be > > > > exposed to the user level. Remove the permission check by using _nocheck > > > > version of the function. > > > > > > > > Suggested-by: Nick Kralevich <nnk@google.com> > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > > --- > > > > kernel/sched/psi.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > > > > index 7acc632c3b82..ed9a1d573cb1 100644 > > > > --- a/kernel/sched/psi.c > > > > +++ b/kernel/sched/psi.c > > > > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, > > > > mutex_unlock(&group->trigger_lock); > > > > return ERR_CAST(kworker); > > > > } > > > > - sched_setscheduler(kworker->task, SCHED_FIFO, ¶m); > > > > + sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, ¶m); > > > > > > ARGGH, wtf is there a FIFO-99!! thread here at all !? > > > > We need psi poll_kworker to be an rt-priority thread so that psi > > There is a giant difference between 'needs to be higher than OTHER' and > FIFO-99. > > > notifications are delivered to the userspace without delay even when > > the CPUs are very congested. Otherwise it's easy to delay psi > > notifications by running a simple CPU hogger executing "chrt -f 50 dd > > if=/dev/zero of=/dev/null". Because these notifications are > > So what; at that point that's exactly what you're asking for. Using RT > is for those who know what they're doing. > > > time-critical for reacting to memory shortages we can't allow for such > > delays. > > Furthermore, actual RT programs will have pre-allocated and locked any > memory they rely on. They don't give a crap about your pressure > nonsense. > This signal is used not to protect other RT tasks but to monitor overall system memory health for the sake of system responsiveness. > > Notice that this kworker is created only if userspace creates a psi > > trigger. So unless you are using psi triggers you will never see this > > kthread created. > > By marking it FIFO-99 you're in effect saying that your stupid > statistics gathering is more important than your life. It will preempt > the task that's in control of the band-saw emergency break, it will > preempt the task that's adjusting the electromagnetic field containing > this plasma flow. > > That's insane. IMHO an opt-in feature stops being "stupid" as soon as the user opted in to use it, therefore explicitly indicating interest in it. However I assume you are using "stupid" here to indicate that it's "less important" rather than it's "useless". > I'm going to queue a patch to reduce this to FIFO-1, that will preempt > regular OTHER tasks but will not perturb (much) actual RT bits. > Thanks for posting the fix. > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Thu, Aug 01, 2019 at 11:28:30AM -0700, Suren Baghdasaryan wrote: > > By marking it FIFO-99 you're in effect saying that your stupid > > statistics gathering is more important than your life. It will preempt > > the task that's in control of the band-saw emergency break, it will > > preempt the task that's adjusting the electromagnetic field containing > > this plasma flow. > > > > That's insane. > > IMHO an opt-in feature stops being "stupid" as soon as the user opted > in to use it, therefore explicitly indicating interest in it. However > I assume you are using "stupid" here to indicate that it's "less > important" rather than it's "useless". Quite; PSI does have its uses. RT just isn't one of them.
On Thu, Aug 1, 2019 at 2:59 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Aug 01, 2019 at 11:28:30AM -0700, Suren Baghdasaryan wrote: > > > By marking it FIFO-99 you're in effect saying that your stupid > > > statistics gathering is more important than your life. It will preempt > > > the task that's in control of the band-saw emergency break, it will > > > preempt the task that's adjusting the electromagnetic field containing > > > this plasma flow. > > > > > > That's insane. > > > > IMHO an opt-in feature stops being "stupid" as soon as the user opted > > in to use it, therefore explicitly indicating interest in it. However > > I assume you are using "stupid" here to indicate that it's "less > > important" rather than it's "useless". > > Quite; PSI does have its uses. RT just isn't one of them. Sorry about messing it up in the first place. If you don't see any issues with my patch replacing sched_setscheduler() with sched_setscheduler_nocheck(), would you mind taking it too? I applied it over your patch onto Linus' ToT with no merge conflicts. Thanks, Suren. > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 7acc632c3b82..ed9a1d573cb1 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, mutex_unlock(&group->trigger_lock); return ERR_CAST(kworker); } - sched_setscheduler(kworker->task, SCHED_FIFO, ¶m); + sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, ¶m); kthread_init_delayed_work(&group->poll_work, psi_poll_work); rcu_assign_pointer(group->poll_kworker, kworker);
When a process creates a new trigger by writing into /proc/pressure/* files, permissions to write such a file should be used to determine whether the process is allowed to do so or not. Current implementation would also require such a process to have setsched capability. Setting of psi trigger thread's scheduling policy is an implementation detail and should not be exposed to the user level. Remove the permission check by using _nocheck version of the function. Suggested-by: Nick Kralevich <nnk@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- kernel/sched/psi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)