Message ID | 20230303011346.3342233-1-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/1] psi: remove 500ms min window size limitation for triggers | expand |
On Thu, Mar 2, 2023 at 5:13 PM Suren Baghdasaryan <surenb@google.com> wrote: > > Current 500ms min window size for psi triggers limits polling interval > to 50ms to prevent polling threads from using too much cpu bandwidth by > polling too frequently. However the number of cgroups with triggers is > unlimited, so this protection can be defeated by creating multiple > cgroups with psi triggers (triggers in each cgroup are served by a single > "psimon" kernel thread). > Instead of limiting min polling period, which also limits the latency of > psi events, it's better to limit psi trigger creation to authorized users > only, like we do for system-wide psi triggers (/proc/pressure/* files can > be written only by processes with CAP_SYS_RESOURCE capability). This also > makes access rules for cgroup psi files consistent with system-wide ones. > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > remove the psi window min size limitation. > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > Acked-by: Michal Hocko <mhocko@suse.com> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Forgot to change the --to field from Tejun to PeterZ. Peter, just to clarify, this change is targeted for inclusion in your tree. Thanks! > --- > kernel/cgroup/cgroup.c | 10 ++++++++++ > kernel/sched/psi.c | 4 +--- > 2 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 935e8121b21e..b600a6baaeca 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -3867,6 +3867,12 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, > return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); > } > > +static int cgroup_pressure_open(struct kernfs_open_file *of) > +{ > + return (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) ? > + -EPERM : 0; > +} > + > static void cgroup_pressure_release(struct kernfs_open_file *of) > { > struct cgroup_file_ctx *ctx = of->priv; > @@ -5266,6 +5272,7 @@ static struct cftype cgroup_psi_files[] = { > { > .name = "io.pressure", > .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]), > + .open = cgroup_pressure_open, > .seq_show = cgroup_io_pressure_show, > .write = cgroup_io_pressure_write, > .poll = cgroup_pressure_poll, > @@ -5274,6 +5281,7 @@ static struct cftype cgroup_psi_files[] = { > { > .name = "memory.pressure", > .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]), > + .open = cgroup_pressure_open, > .seq_show = cgroup_memory_pressure_show, > .write = cgroup_memory_pressure_write, > .poll = cgroup_pressure_poll, > @@ -5282,6 +5290,7 @@ static struct cftype cgroup_psi_files[] = { > { > .name = "cpu.pressure", > .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]), > + .open = cgroup_pressure_open, > .seq_show = cgroup_cpu_pressure_show, > .write = cgroup_cpu_pressure_write, > .poll = cgroup_pressure_poll, > @@ -5291,6 +5300,7 @@ static struct cftype cgroup_psi_files[] = { > { > .name = "irq.pressure", > .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]), > + .open = cgroup_pressure_open, > .seq_show = cgroup_irq_pressure_show, > .write = cgroup_irq_pressure_write, > .poll = cgroup_pressure_poll, > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index 02e011cabe91..0945f956bf80 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -160,7 +160,6 @@ __setup("psi=", setup_psi); > #define EXP_300s 2034 /* 1/exp(2s/300s) */ > > /* PSI trigger definitions */ > -#define WINDOW_MIN_US 500000 /* Min window size is 500ms */ > #define WINDOW_MAX_US 10000000 /* Max window size is 10s */ > #define UPDATES_PER_WINDOW 10 /* 10 updates per window */ > > @@ -1278,8 +1277,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, > if (state >= PSI_NONIDLE) > return ERR_PTR(-EINVAL); > > - if (window_us < WINDOW_MIN_US || > - window_us > WINDOW_MAX_US) > + if (window_us == 0 || window_us > WINDOW_MAX_US) > return ERR_PTR(-EINVAL); > > /* Check threshold */ > -- > 2.40.0.rc0.216.gc4246ad0f0-goog >
On Thu, Mar 2, 2023 at 5:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Mar 2, 2023 at 5:13 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > Current 500ms min window size for psi triggers limits polling interval > > to 50ms to prevent polling threads from using too much cpu bandwidth by > > polling too frequently. However the number of cgroups with triggers is > > unlimited, so this protection can be defeated by creating multiple > > cgroups with psi triggers (triggers in each cgroup are served by a single > > "psimon" kernel thread). > > Instead of limiting min polling period, which also limits the latency of > > psi events, it's better to limit psi trigger creation to authorized users > > only, like we do for system-wide psi triggers (/proc/pressure/* files can > > be written only by processes with CAP_SYS_RESOURCE capability). This also > > makes access rules for cgroup psi files consistent with system-wide ones. > > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > > remove the psi window min size limitation. > > > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > Acked-by: Michal Hocko <mhocko@suse.com> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > Forgot to change the --to field from Tejun to PeterZ. > Peter, just to clarify, this change is targeted for inclusion in your tree. I think this patch slipped through the cracks. Peter, could you please take it into your tree? Thanks, Suren. > Thanks! > > > --- > > kernel/cgroup/cgroup.c | 10 ++++++++++ > > kernel/sched/psi.c | 4 +--- > > 2 files changed, 11 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > index 935e8121b21e..b600a6baaeca 100644 > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -3867,6 +3867,12 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, > > return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); > > } > > > > +static int cgroup_pressure_open(struct kernfs_open_file *of) > > +{ > > + return (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) ? > > + -EPERM : 0; > > +} > > + > > static void cgroup_pressure_release(struct kernfs_open_file *of) > > { > > struct cgroup_file_ctx *ctx = of->priv; > > @@ -5266,6 +5272,7 @@ static struct cftype cgroup_psi_files[] = { > > { > > .name = "io.pressure", > > .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]), > > + .open = cgroup_pressure_open, > > .seq_show = cgroup_io_pressure_show, > > .write = cgroup_io_pressure_write, > > .poll = cgroup_pressure_poll, > > @@ -5274,6 +5281,7 @@ static struct cftype cgroup_psi_files[] = { > > { > > .name = "memory.pressure", > > .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]), > > + .open = cgroup_pressure_open, > > .seq_show = cgroup_memory_pressure_show, > > .write = cgroup_memory_pressure_write, > > .poll = cgroup_pressure_poll, > > @@ -5282,6 +5290,7 @@ static struct cftype cgroup_psi_files[] = { > > { > > .name = "cpu.pressure", > > .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]), > > + .open = cgroup_pressure_open, > > .seq_show = cgroup_cpu_pressure_show, > > .write = cgroup_cpu_pressure_write, > > .poll = cgroup_pressure_poll, > > @@ -5291,6 +5300,7 @@ static struct cftype cgroup_psi_files[] = { > > { > > .name = "irq.pressure", > > .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]), > > + .open = cgroup_pressure_open, > > .seq_show = cgroup_irq_pressure_show, > > .write = cgroup_irq_pressure_write, > > .poll = cgroup_pressure_poll, > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > > index 02e011cabe91..0945f956bf80 100644 > > --- a/kernel/sched/psi.c > > +++ b/kernel/sched/psi.c > > @@ -160,7 +160,6 @@ __setup("psi=", setup_psi); > > #define EXP_300s 2034 /* 1/exp(2s/300s) */ > > > > /* PSI trigger definitions */ > > -#define WINDOW_MIN_US 500000 /* Min window size is 500ms */ > > #define WINDOW_MAX_US 10000000 /* Max window size is 10s */ > > #define UPDATES_PER_WINDOW 10 /* 10 updates per window */ > > > > @@ -1278,8 +1277,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, > > if (state >= PSI_NONIDLE) > > return ERR_PTR(-EINVAL); > > > > - if (window_us < WINDOW_MIN_US || > > - window_us > WINDOW_MAX_US) > > + if (window_us == 0 || window_us > WINDOW_MAX_US) > > return ERR_PTR(-EINVAL); > > > > /* Check threshold */ > > -- > > 2.40.0.rc0.216.gc4246ad0f0-goog > >
On Tue, May 02, 2023 at 10:20:34AM -0700, Suren Baghdasaryan wrote: > On Thu, Mar 2, 2023 at 5:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Mar 2, 2023 at 5:13 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > Current 500ms min window size for psi triggers limits polling interval > > > to 50ms to prevent polling threads from using too much cpu bandwidth by > > > polling too frequently. However the number of cgroups with triggers is > > > unlimited, so this protection can be defeated by creating multiple > > > cgroups with psi triggers (triggers in each cgroup are served by a single > > > "psimon" kernel thread). > > > Instead of limiting min polling period, which also limits the latency of > > > psi events, it's better to limit psi trigger creation to authorized users > > > only, like we do for system-wide psi triggers (/proc/pressure/* files can > > > be written only by processes with CAP_SYS_RESOURCE capability). This also > > > makes access rules for cgroup psi files consistent with system-wide ones. > > > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > > > remove the psi window min size limitation. > > > > > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > > > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > Acked-by: Michal Hocko <mhocko@suse.com> > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > Forgot to change the --to field from Tejun to PeterZ. > > Peter, just to clarify, this change is targeted for inclusion in your tree. > > I think this patch slipped through the cracks. Peter, could you please > take it into your tree? Sorry, yes, got lost. I'll go queue it for post -rc1. No urgency with this right?
On Tue, May 2, 2023 at 10:24 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, May 02, 2023 at 10:20:34AM -0700, Suren Baghdasaryan wrote: > > On Thu, Mar 2, 2023 at 5:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Mar 2, 2023 at 5:13 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > Current 500ms min window size for psi triggers limits polling interval > > > > to 50ms to prevent polling threads from using too much cpu bandwidth by > > > > polling too frequently. However the number of cgroups with triggers is > > > > unlimited, so this protection can be defeated by creating multiple > > > > cgroups with psi triggers (triggers in each cgroup are served by a single > > > > "psimon" kernel thread). > > > > Instead of limiting min polling period, which also limits the latency of > > > > psi events, it's better to limit psi trigger creation to authorized users > > > > only, like we do for system-wide psi triggers (/proc/pressure/* files can > > > > be written only by processes with CAP_SYS_RESOURCE capability). This also > > > > makes access rules for cgroup psi files consistent with system-wide ones. > > > > Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and > > > > remove the psi window min size limitation. > > > > > > > > Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> > > > > Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > > Acked-by: Michal Hocko <mhocko@suse.com> > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > Forgot to change the --to field from Tejun to PeterZ. > > > Peter, just to clarify, this change is targeted for inclusion in your tree. > > > > I think this patch slipped through the cracks. Peter, could you please > > take it into your tree? > > Sorry, yes, got lost. I'll go queue it for post -rc1. No urgency with > this right? Yes, I'll be merging it into Android branches counting on it making upstream later on :) Greg will hate me for that but I'll survive. Thanks!
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 935e8121b21e..b600a6baaeca 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3867,6 +3867,12 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, return psi_trigger_poll(&ctx->psi.trigger, of->file, pt); } +static int cgroup_pressure_open(struct kernfs_open_file *of) +{ + return (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) ? + -EPERM : 0; +} + static void cgroup_pressure_release(struct kernfs_open_file *of) { struct cgroup_file_ctx *ctx = of->priv; @@ -5266,6 +5272,7 @@ static struct cftype cgroup_psi_files[] = { { .name = "io.pressure", .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]), + .open = cgroup_pressure_open, .seq_show = cgroup_io_pressure_show, .write = cgroup_io_pressure_write, .poll = cgroup_pressure_poll, @@ -5274,6 +5281,7 @@ static struct cftype cgroup_psi_files[] = { { .name = "memory.pressure", .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]), + .open = cgroup_pressure_open, .seq_show = cgroup_memory_pressure_show, .write = cgroup_memory_pressure_write, .poll = cgroup_pressure_poll, @@ -5282,6 +5290,7 @@ static struct cftype cgroup_psi_files[] = { { .name = "cpu.pressure", .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]), + .open = cgroup_pressure_open, .seq_show = cgroup_cpu_pressure_show, .write = cgroup_cpu_pressure_write, .poll = cgroup_pressure_poll, @@ -5291,6 +5300,7 @@ static struct cftype cgroup_psi_files[] = { { .name = "irq.pressure", .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]), + .open = cgroup_pressure_open, .seq_show = cgroup_irq_pressure_show, .write = cgroup_irq_pressure_write, .poll = cgroup_pressure_poll, diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 02e011cabe91..0945f956bf80 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -160,7 +160,6 @@ __setup("psi=", setup_psi); #define EXP_300s 2034 /* 1/exp(2s/300s) */ /* PSI trigger definitions */ -#define WINDOW_MIN_US 500000 /* Min window size is 500ms */ #define WINDOW_MAX_US 10000000 /* Max window size is 10s */ #define UPDATES_PER_WINDOW 10 /* 10 updates per window */ @@ -1278,8 +1277,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, if (state >= PSI_NONIDLE) return ERR_PTR(-EINVAL); - if (window_us < WINDOW_MIN_US || - window_us > WINDOW_MAX_US) + if (window_us == 0 || window_us > WINDOW_MAX_US) return ERR_PTR(-EINVAL); /* Check threshold */