[v7,12/15] sched/core: uclamp: Propagate parent clamps

Message ID	20190208100554.32196-13-patrick.bellasi@arm.com (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: Patrick Bellasi <patrick.bellasi@arm.com> To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, linux-api@vger.kernel.org Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>, "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>, Vincent Guittot <vincent.guittot@linaro.org>, Viresh Kumar <viresh.kumar@linaro.org>, Paul Turner <pjt@google.com>, Quentin Perret <quentin.perret@arm.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Morten Rasmussen <morten.rasmussen@arm.com>, Juri Lelli <juri.lelli@redhat.com>, Todd Kjos <tkjos@google.com>, Joel Fernandes <joelaf@google.com>, Steve Muckle <smuckle@google.com>, Suren Baghdasaryan <surenb@google.com> Subject: [PATCH v7 12/15] sched/core: uclamp: Propagate parent clamps Date: Fri, 8 Feb 2019 10:05:51 +0000 Message-Id: <20190208100554.32196-13-patrick.bellasi@arm.com> In-Reply-To: <20190208100554.32196-1-patrick.bellasi@arm.com> References: <20190208100554.32196-1-patrick.bellasi@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-pm-owner@vger.kernel.org Precedence: bulk
Series	Add utilization clamping support \| expand [v7,00/15] Add utilization clamping support [v7,01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting [v7,02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX [v7,03/15] sched/core: uclamp: Add system default clamps [v7,04/15] sched/core: Allow sched_setattr() to use the current policy [v7,05/15] sched/core: uclamp: Extend sched_setattr() to support utilization clamping [v7,06/15] sched/core: uclamp: Reset uclamp values on RESET_ON_FORK [v7,07/15] sched/core: uclamp: Set default clamps for RT tasks [v7,08/15] sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks [v7,09/15] sched/core: uclamp: Add uclamp_util_with() [v7,10/15] sched/fair: uclamp: Add uclamp support to energy_compute() [v7,11/15] sched/core: uclamp: Extend CPU's cgroup controller [v7,12/15] sched/core: uclamp: Propagate parent clamps [v7,13/15] sched/core: uclamp: Propagate system defaults to root group [v7,14/15] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps [v7,15/15] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

Message ID

20190208100554.32196-13-patrick.bellasi@arm.com (mailing list archive)

State

Not Applicable, archived

Headers

From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
        linux-api@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v7 12/15] sched/core: uclamp: Propagate parent clamps
Date: Fri,  8 Feb 2019 10:05:51 +0000
Message-Id: <20190208100554.32196-13-patrick.bellasi@arm.com>
In-Reply-To: <20190208100554.32196-1-patrick.bellasi@arm.com>
References: <20190208100554.32196-1-patrick.bellasi@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk

Series

Add utilization clamping support | expand

Commit Message

Patrick Bellasi Feb. 8, 2019, 10:05 a.m. UTC

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.

Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This is the actual clamp value enforced on tasks in a
task group.

Since it can be interesting for userspace, e.g. system management
software, to know exactly what the currently propagated/enforced
configuration is, the effective clamp values are exposed to user-space
by means of a new pair of read-only attributes
cpu.util.{min,max}.effective.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>

---
Changes in v7:
 Others:
 - ensure clamp values are not tunable at root cgroup level
---
 Documentation/admin-guide/cgroup-v2.rst |  19 ++++
 kernel/sched/core.c                     | 118 +++++++++++++++++++++++-
 2 files changed, 133 insertions(+), 4 deletions(-)

Comments

Suren Baghdasaryan March 14, 2019, 4:17 p.m. UTC | #1

On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> In order to properly support hierarchical resources control, the cgroup
> delegation model requires that attribute writes from a child group never
> fail but still are (potentially) constrained based on parent's assigned
> resources. This requires to properly propagate and aggregate parent
> attributes down to its descendants.
>
> Let's implement this mechanism by adding a new "effective" clamp value
> for each task group. The effective clamp value is defined as the smaller
> value between the clamp value of a group and the effective clamp value
> of its parent. This is the actual clamp value enforced on tasks in a
> task group.

In patch 10 in this series you mentioned "b) do not enforce any
constraints and/or dependencies between the parent and its child
nodes"

This patch seems to change that behavior. If so, should it be documented?

> Since it can be interesting for userspace, e.g. system management
> software, to know exactly what the currently propagated/enforced
> configuration is, the effective clamp values are exposed to user-space
> by means of a new pair of read-only attributes
> cpu.util.{min,max}.effective.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Tejun Heo <tj@kernel.org>
>
> ---
> Changes in v7:
>  Others:
>  - ensure clamp values are not tunable at root cgroup level
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  19 ++++
>  kernel/sched/core.c                     | 118 +++++++++++++++++++++++-
>  2 files changed, 133 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 47710a77f4fa..7aad2435e961 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -990,6 +990,16 @@ All time durations are in microseconds.
>          values similar to the sched_setattr(2). This minimum utilization
>          value is used to clamp the task specific minimum utilization clamp.
>
> +  cpu.util.min.effective
> +        A read-only single value file which exists on non-root cgroups and
> +        reports minimum utilization clamp value currently enforced on a task
> +        group.
> +
> +        The actual minimum utilization in the range [0, 1024].
> +
> +        This value can be lower then cpu.util.min in case a parent cgroup
> +        allows only smaller minimum utilization values.
> +
>    cpu.util.max
>          A read-write single value file which exists on non-root cgroups.
>          The default is "1024". i.e. no utilization capping
> @@ -1000,6 +1010,15 @@ All time durations are in microseconds.
>          values similar to the sched_setattr(2). This maximum utilization
>          value is used to clamp the task specific maximum utilization clamp.
>
> +  cpu.util.max.effective
> +        A read-only single value file which exists on non-root cgroups and
> +        reports maximum utilization clamp value currently enforced on a task
> +        group.
> +
> +        The actual maximum utilization in the range [0, 1024].
> +
> +        This value can be lower then cpu.util.max in case a parent cgroup
> +        is enforcing a more restrictive clamping on max utilization.
>
>
>  Memory
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 122ab069ade5..1e54517acd58 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -720,6 +720,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
>  }
>
>  #ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Serializes updates of utilization clamp values
> + *
> + * The (slow-path) user-space triggers utilization clamp value updates which
> + * can require updates on (fast-path) scheduler's data structures used to
> + * support enqueue/dequeue operations.
> + * While the per-CPU rq lock protects fast-path update operations, user-space
> + * requests are serialized using a mutex to reduce the risk of conflicting
> + * updates or API abuses.
> + */
> +static DEFINE_MUTEX(uclamp_mutex);
> +
>  /* Max allowed minimum utilization */
>  unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
>
> @@ -1127,6 +1139,8 @@ static void __init init_uclamp(void)
>         unsigned int value;
>         int cpu;
>
> +       mutex_init(&uclamp_mutex);
> +
>         for_each_possible_cpu(cpu) {
>                 memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
>                 cpu_rq(cpu)->uclamp_flags = 0;
> @@ -6758,6 +6772,10 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
>                         parent->uclamp[clamp_id].value;
>                 tg->uclamp[clamp_id].bucket_id =
>                         parent->uclamp[clamp_id].bucket_id;
> +               tg->uclamp[clamp_id].effective.value =
> +                       parent->uclamp[clamp_id].effective.value;
> +               tg->uclamp[clamp_id].effective.bucket_id =
> +                       parent->uclamp[clamp_id].effective.bucket_id;
>         }
>  #endif
>
> @@ -7011,6 +7029,53 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
>  }
>
>  #ifdef CONFIG_UCLAMP_TASK_GROUP
> +static void cpu_util_update_hier(struct cgroup_subsys_state *css,

s/cpu_util_update_hier/cpu_util_update_heir ?

> +                                unsigned int clamp_id, unsigned int bucket_id,
> +                                unsigned int value)
> +{
> +       struct cgroup_subsys_state *top_css = css;
> +       struct uclamp_se *uc_se, *uc_parent;
> +
> +       css_for_each_descendant_pre(css, top_css) {
> +               /*
> +                * The first visited task group is top_css, which clamp value
> +                * is the one passed as parameter. For descendent task
> +                * groups we consider their current value.
> +                */
> +               uc_se = &css_tg(css)->uclamp[clamp_id];
> +               if (css != top_css) {
> +                       value = uc_se->value;
> +                       bucket_id = uc_se->effective.bucket_id;
> +               }
> +               uc_parent = NULL;
> +               if (css_tg(css)->parent)
> +                       uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
> +
> +               /*
> +                * Skip the whole subtrees if the current effective clamp is
> +                * already matching the TG's clamp value.
> +                * In this case, all the subtrees already have top_value, or a
> +                * more restrictive value, as effective clamp.
> +                */
> +               if (uc_se->effective.value == value &&
> +                   uc_parent && uc_parent->effective.value >= value) {
> +                       css = css_rightmost_descendant(css);
> +                       continue;
> +               }
> +
> +               /* Propagate the most restrictive effective value */
> +               if (uc_parent && uc_parent->effective.value < value) {
> +                       value = uc_parent->effective.value;
> +                       bucket_id = uc_parent->effective.bucket_id;
> +               }
> +               if (uc_se->effective.value == value)
> +                       continue;
> +
> +               uc_se->effective.value = value;
> +               uc_se->effective.bucket_id = bucket_id;
> +       }
> +}
> +
>  static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
>                                   struct cftype *cftype, u64 min_value)
>  {
> @@ -7020,6 +7085,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
>         if (min_value > SCHED_CAPACITY_SCALE)
>                 return -ERANGE;
>
> +       mutex_lock(&uclamp_mutex);
>         rcu_read_lock();
>
>         tg = css_tg(css);
> @@ -7038,8 +7104,13 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
>         tg->uclamp[UCLAMP_MIN].value = min_value;
>         tg->uclamp[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value);
>
> +       /* Update effective clamps to track the most restrictive value */
> +       cpu_util_update_hier(css, UCLAMP_MIN, tg->uclamp[UCLAMP_MIN].bucket_id,
> +                            min_value);
> +
>  out:
>         rcu_read_unlock();
> +       mutex_unlock(&uclamp_mutex);
>
>         return ret;
>  }
> @@ -7053,6 +7124,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
>         if (max_value > SCHED_CAPACITY_SCALE)
>                 return -ERANGE;
>
> +       mutex_lock(&uclamp_mutex);
>         rcu_read_lock();
>
>         tg = css_tg(css);
> @@ -7071,21 +7143,29 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
>         tg->uclamp[UCLAMP_MAX].value = max_value;
>         tg->uclamp[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
>
> +       /* Update effective clamps to track the most restrictive value */
> +       cpu_util_update_hier(css, UCLAMP_MAX, tg->uclamp[UCLAMP_MAX].bucket_id,
> +                            max_value);
> +
>  out:
>         rcu_read_unlock();
> +       mutex_unlock(&uclamp_mutex);
>
>         return ret;
>  }
>
>  static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
> -                                 enum uclamp_id clamp_id)
> +                                 enum uclamp_id clamp_id,
> +                                 bool effective)
>  {
>         struct task_group *tg;
>         u64 util_clamp;
>
>         rcu_read_lock();
>         tg = css_tg(css);
> -       util_clamp = tg->uclamp[clamp_id].value;
> +       util_clamp = effective
> +               ? tg->uclamp[clamp_id].effective.value
> +               : tg->uclamp[clamp_id].value;
>         rcu_read_unlock();
>
>         return util_clamp;
> @@ -7094,13 +7174,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
>  static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
>                                  struct cftype *cft)
>  {
> -       return cpu_uclamp_read(css, UCLAMP_MIN);
> +       return cpu_uclamp_read(css, UCLAMP_MIN, false);
>  }
>
>  static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
>                                  struct cftype *cft)
>  {
> -       return cpu_uclamp_read(css, UCLAMP_MAX);
> +       return cpu_uclamp_read(css, UCLAMP_MAX, false);
> +}
> +
> +static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
> +                                          struct cftype *cft)
> +{
> +       return cpu_uclamp_read(css, UCLAMP_MIN, true);
> +}
> +
> +static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
> +                                          struct cftype *cft)
> +{
> +       return cpu_uclamp_read(css, UCLAMP_MAX, true);
>  }
>  #endif /* CONFIG_UCLAMP_TASK_GROUP */
>
> @@ -7448,11 +7540,19 @@ static struct cftype cpu_legacy_files[] = {
>                 .read_u64 = cpu_util_min_read_u64,
>                 .write_u64 = cpu_util_min_write_u64,
>         },
> +       {
> +               .name = "util.min.effective",
> +               .read_u64 = cpu_util_min_effective_read_u64,
> +       },
>         {
>                 .name = "util.max",
>                 .read_u64 = cpu_util_max_read_u64,
>                 .write_u64 = cpu_util_max_write_u64,
>         },
> +       {
> +               .name = "util.max.effective",
> +               .read_u64 = cpu_util_max_effective_read_u64,
> +       },
>  #endif
>         { }     /* Terminate */
>  };
> @@ -7628,12 +7728,22 @@ static struct cftype cpu_files[] = {
>                 .read_u64 = cpu_util_min_read_u64,
>                 .write_u64 = cpu_util_min_write_u64,
>         },
> +       {
> +               .name = "util.min.effective",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .read_u64 = cpu_util_min_effective_read_u64,
> +       },
>         {
>                 .name = "util.max",
>                 .flags = CFTYPE_NOT_ON_ROOT,
>                 .read_u64 = cpu_util_max_read_u64,
>                 .write_u64 = cpu_util_max_write_u64,
>         },
> +       {
> +               .name = "util.max.effective",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .read_u64 = cpu_util_max_effective_read_u64,
> +       },
>  #endif
>         { }     /* terminate */
>  };
> --
> 2.20.1
>

Patrick Bellasi March 18, 2019, 4:54 p.m. UTC | #2

On 14-Mar 09:17, Suren Baghdasaryan wrote:
> On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> >
> > In order to properly support hierarchical resources control, the cgroup
> > delegation model requires that attribute writes from a child group never
> > fail but still are (potentially) constrained based on parent's assigned
> > resources. This requires to properly propagate and aggregate parent
> > attributes down to its descendants.
> >
> > Let's implement this mechanism by adding a new "effective" clamp value
> > for each task group. The effective clamp value is defined as the smaller
> > value between the clamp value of a group and the effective clamp value
> > of its parent. This is the actual clamp value enforced on tasks in a
> > task group.
> 
> In patch 10 in this series you mentioned "b) do not enforce any
> constraints and/or dependencies between the parent and its child
> nodes"
> 
> This patch seems to change that behavior. If so, should it be documented?

Not, I actually have to update the changelog of that patch.

What I mean is that we do not enforce constraints among "requested"
values thus ensuring that each sub-group can always request a clamp
value.
Of course, if it gets that value or not depends on parent constraints,
which are propagated down the hierarchy under the form of "effective"
values by cpu_util_update_heir()

I'll fix the changelog in patch 10 which seems to be confusing for
Tejun too.

[...]

> > @@ -7011,6 +7029,53 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> >  }
> >
> >  #ifdef CONFIG_UCLAMP_TASK_GROUP
> > +static void cpu_util_update_hier(struct cgroup_subsys_state *css,
> 
> s/cpu_util_update_hier/cpu_util_update_heir ?

Mmm... why?

That "_hier" stands for "hierarchical".

However, since there we update the effective values, maybe I can
better rename it in "_eff" ?

> > +                                unsigned int clamp_id, unsigned int bucket_id,
> > +                                unsigned int value)
> > +{
> > +       struct cgroup_subsys_state *top_css = css;
> > +       struct uclamp_se *uc_se, *uc_parent;
> > +
> > +       css_for_each_descendant_pre(css, top_css) {
> > +               /*
> > +                * The first visited task group is top_css, which clamp value
> > +                * is the one passed as parameter. For descendent task
> > +                * groups we consider their current value.
> > +                */
> > +               uc_se = &css_tg(css)->uclamp[clamp_id];
> > +               if (css != top_css) {
> > +                       value = uc_se->value;
> > +                       bucket_id = uc_se->effective.bucket_id;
> > +               }
> > +               uc_parent = NULL;
> > +               if (css_tg(css)->parent)
> > +                       uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
> > +
> > +               /*
> > +                * Skip the whole subtrees if the current effective clamp is
> > +                * already matching the TG's clamp value.
> > +                * In this case, all the subtrees already have top_value, or a
> > +                * more restrictive value, as effective clamp.
> > +                */
> > +               if (uc_se->effective.value == value &&
> > +                   uc_parent && uc_parent->effective.value >= value) {
> > +                       css = css_rightmost_descendant(css);
> > +                       continue;
> > +               }
> > +
> > +               /* Propagate the most restrictive effective value */
> > +               if (uc_parent && uc_parent->effective.value < value) {
> > +                       value = uc_parent->effective.value;
> > +                       bucket_id = uc_parent->effective.bucket_id;
> > +               }
> > +               if (uc_se->effective.value == value)
> > +                       continue;
> > +
> > +               uc_se->effective.value = value;
> > +               uc_se->effective.bucket_id = bucket_id;
> > +       }
> > +}
> > +
> >  static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> >                                   struct cftype *cftype, u64 min_value)
> >  {

[...]

Cheers,
Patrick

Suren Baghdasaryan March 18, 2019, 4:58 p.m. UTC | #3

On Mon, Mar 18, 2019 at 9:54 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> On 14-Mar 09:17, Suren Baghdasaryan wrote:
> > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > >
> > > In order to properly support hierarchical resources control, the cgroup
> > > delegation model requires that attribute writes from a child group never
> > > fail but still are (potentially) constrained based on parent's assigned
> > > resources. This requires to properly propagate and aggregate parent
> > > attributes down to its descendants.
> > >
> > > Let's implement this mechanism by adding a new "effective" clamp value
> > > for each task group. The effective clamp value is defined as the smaller
> > > value between the clamp value of a group and the effective clamp value
> > > of its parent. This is the actual clamp value enforced on tasks in a
> > > task group.
> >
> > In patch 10 in this series you mentioned "b) do not enforce any
> > constraints and/or dependencies between the parent and its child
> > nodes"
> >
> > This patch seems to change that behavior. If so, should it be documented?
>
> Not, I actually have to update the changelog of that patch.
>
> What I mean is that we do not enforce constraints among "requested"
> values thus ensuring that each sub-group can always request a clamp
> value.
> Of course, if it gets that value or not depends on parent constraints,
> which are propagated down the hierarchy under the form of "effective"
> values by cpu_util_update_heir()
>
> I'll fix the changelog in patch 10 which seems to be confusing for
> Tejun too.
>
> [...]
>
> > > @@ -7011,6 +7029,53 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> > >  }
> > >
> > >  #ifdef CONFIG_UCLAMP_TASK_GROUP
> > > +static void cpu_util_update_hier(struct cgroup_subsys_state *css,
> >
> > s/cpu_util_update_hier/cpu_util_update_heir ?
>
> Mmm... why?
>
> That "_hier" stands for "hierarchical".

Yeah, I realized that later on but did not want to create more
chatter. _hier seems fine.

> However, since there we update the effective values, maybe I can
> better rename it in "_eff" ?
>
> > > +                                unsigned int clamp_id, unsigned int bucket_id,
> > > +                                unsigned int value)
> > > +{
> > > +       struct cgroup_subsys_state *top_css = css;
> > > +       struct uclamp_se *uc_se, *uc_parent;
> > > +
> > > +       css_for_each_descendant_pre(css, top_css) {
> > > +               /*
> > > +                * The first visited task group is top_css, which clamp value
> > > +                * is the one passed as parameter. For descendent task
> > > +                * groups we consider their current value.
> > > +                */
> > > +               uc_se = &css_tg(css)->uclamp[clamp_id];
> > > +               if (css != top_css) {
> > > +                       value = uc_se->value;
> > > +                       bucket_id = uc_se->effective.bucket_id;
> > > +               }
> > > +               uc_parent = NULL;
> > > +               if (css_tg(css)->parent)
> > > +                       uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
> > > +
> > > +               /*
> > > +                * Skip the whole subtrees if the current effective clamp is
> > > +                * already matching the TG's clamp value.
> > > +                * In this case, all the subtrees already have top_value, or a
> > > +                * more restrictive value, as effective clamp.
> > > +                */
> > > +               if (uc_se->effective.value == value &&
> > > +                   uc_parent && uc_parent->effective.value >= value) {
> > > +                       css = css_rightmost_descendant(css);
> > > +                       continue;
> > > +               }
> > > +
> > > +               /* Propagate the most restrictive effective value */
> > > +               if (uc_parent && uc_parent->effective.value < value) {
> > > +                       value = uc_parent->effective.value;
> > > +                       bucket_id = uc_parent->effective.bucket_id;
> > > +               }
> > > +               if (uc_se->effective.value == value)
> > > +                       continue;
> > > +
> > > +               uc_se->effective.value = value;
> > > +               uc_se->effective.bucket_id = bucket_id;
> > > +       }
> > > +}
> > > +
> > >  static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> > >                                   struct cftype *cftype, u64 min_value)
> > >  {
>
> [...]
>
> Cheers,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 47710a77f4fa..7aad2435e961 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -990,6 +990,16 @@  All time durations are in microseconds.
         values similar to the sched_setattr(2). This minimum utilization
         value is used to clamp the task specific minimum utilization clamp.
 
+  cpu.util.min.effective
+        A read-only single value file which exists on non-root cgroups and
+        reports minimum utilization clamp value currently enforced on a task
+        group.
+
+        The actual minimum utilization in the range [0, 1024].
+
+        This value can be lower then cpu.util.min in case a parent cgroup
+        allows only smaller minimum utilization values.
+
   cpu.util.max
         A read-write single value file which exists on non-root cgroups.
         The default is "1024". i.e. no utilization capping
@@ -1000,6 +1010,15 @@  All time durations are in microseconds.
         values similar to the sched_setattr(2). This maximum utilization
         value is used to clamp the task specific maximum utilization clamp.
 
+  cpu.util.max.effective
+        A read-only single value file which exists on non-root cgroups and
+        reports maximum utilization clamp value currently enforced on a task
+        group.
+
+        The actual maximum utilization in the range [0, 1024].
+
+        This value can be lower then cpu.util.max in case a parent cgroup
+        is enforcing a more restrictive clamping on max utilization.
 
 
 Memory
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 122ab069ade5..1e54517acd58 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -720,6 +720,18 @@  static void set_load_weight(struct task_struct *p, bool update_load)
 }
 
 #ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
 /* Max allowed minimum utilization */
 unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 
@@ -1127,6 +1139,8 @@  static void __init init_uclamp(void)
 	unsigned int value;
 	int cpu;
 
+	mutex_init(&uclamp_mutex);
+
 	for_each_possible_cpu(cpu) {
 		memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
 		cpu_rq(cpu)->uclamp_flags = 0;
@@ -6758,6 +6772,10 @@  static inline int alloc_uclamp_sched_group(struct task_group *tg,
 			parent->uclamp[clamp_id].value;
 		tg->uclamp[clamp_id].bucket_id =
 			parent->uclamp[clamp_id].bucket_id;
+		tg->uclamp[clamp_id].effective.value =
+			parent->uclamp[clamp_id].effective.value;
+		tg->uclamp[clamp_id].effective.bucket_id =
+			parent->uclamp[clamp_id].effective.bucket_id;
 	}
 #endif
 
@@ -7011,6 +7029,53 @@  static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_hier(struct cgroup_subsys_state *css,
+				 unsigned int clamp_id, unsigned int bucket_id,
+				 unsigned int value)
+{
+	struct cgroup_subsys_state *top_css = css;
+	struct uclamp_se *uc_se, *uc_parent;
+
+	css_for_each_descendant_pre(css, top_css) {
+		/*
+		 * The first visited task group is top_css, which clamp value
+		 * is the one passed as parameter. For descendent task
+		 * groups we consider their current value.
+		 */
+		uc_se = &css_tg(css)->uclamp[clamp_id];
+		if (css != top_css) {
+			value = uc_se->value;
+			bucket_id = uc_se->effective.bucket_id;
+		}
+		uc_parent = NULL;
+		if (css_tg(css)->parent)
+			uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+
+		/*
+		 * Skip the whole subtrees if the current effective clamp is
+		 * already matching the TG's clamp value.
+		 * In this case, all the subtrees already have top_value, or a
+		 * more restrictive value, as effective clamp.
+		 */
+		if (uc_se->effective.value == value &&
+		    uc_parent && uc_parent->effective.value >= value) {
+			css = css_rightmost_descendant(css);
+			continue;
+		}
+
+		/* Propagate the most restrictive effective value */
+		if (uc_parent && uc_parent->effective.value < value) {
+			value = uc_parent->effective.value;
+			bucket_id = uc_parent->effective.bucket_id;
+		}
+		if (uc_se->effective.value == value)
+			continue;
+
+		uc_se->effective.value = value;
+		uc_se->effective.bucket_id = bucket_id;
+	}
+}
+
 static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 				  struct cftype *cftype, u64 min_value)
 {
@@ -7020,6 +7085,7 @@  static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 	if (min_value > SCHED_CAPACITY_SCALE)
 		return -ERANGE;
 
+	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
 	tg = css_tg(css);
@@ -7038,8 +7104,13 @@  static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 	tg->uclamp[UCLAMP_MIN].value = min_value;
 	tg->uclamp[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value);
 
+	/* Update effective clamps to track the most restrictive value */
+	cpu_util_update_hier(css, UCLAMP_MIN, tg->uclamp[UCLAMP_MIN].bucket_id,
+			     min_value);
+
 out:
 	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
 
 	return ret;
 }
@@ -7053,6 +7124,7 @@  static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 	if (max_value > SCHED_CAPACITY_SCALE)
 		return -ERANGE;
 
+	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
 	tg = css_tg(css);
@@ -7071,21 +7143,29 @@  static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 	tg->uclamp[UCLAMP_MAX].value = max_value;
 	tg->uclamp[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
 
+	/* Update effective clamps to track the most restrictive value */
+	cpu_util_update_hier(css, UCLAMP_MAX, tg->uclamp[UCLAMP_MAX].bucket_id,
+			     max_value);
+
 out:
 	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
 
 	return ret;
 }
 
 static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
-				  enum uclamp_id clamp_id)
+				  enum uclamp_id clamp_id,
+				  bool effective)
 {
 	struct task_group *tg;
 	u64 util_clamp;
 
 	rcu_read_lock();
 	tg = css_tg(css);
-	util_clamp = tg->uclamp[clamp_id].value;
+	util_clamp = effective
+		? tg->uclamp[clamp_id].effective.value
+		: tg->uclamp[clamp_id].value;
 	rcu_read_unlock();
 
 	return util_clamp;
@@ -7094,13 +7174,25 @@  static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
 static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
 				 struct cftype *cft)
 {
-	return cpu_uclamp_read(css, UCLAMP_MIN);
+	return cpu_uclamp_read(css, UCLAMP_MIN, false);
 }
 
 static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
 				 struct cftype *cft)
 {
-	return cpu_uclamp_read(css, UCLAMP_MAX);
+	return cpu_uclamp_read(css, UCLAMP_MAX, false);
+}
+
+static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
+					   struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MIN, true);
+}
+
+static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
+					   struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MAX, true);
 }
 #endif /* CONFIG_UCLAMP_TASK_GROUP */
 
@@ -7448,11 +7540,19 @@  static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_util_min_read_u64,
 		.write_u64 = cpu_util_min_write_u64,
 	},
+	{
+		.name = "util.min.effective",
+		.read_u64 = cpu_util_min_effective_read_u64,
+	},
 	{
 		.name = "util.max",
 		.read_u64 = cpu_util_max_read_u64,
 		.write_u64 = cpu_util_max_write_u64,
 	},
+	{
+		.name = "util.max.effective",
+		.read_u64 = cpu_util_max_effective_read_u64,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7628,12 +7728,22 @@  static struct cftype cpu_files[] = {
 		.read_u64 = cpu_util_min_read_u64,
 		.write_u64 = cpu_util_min_write_u64,
 	},
+	{
+		.name = "util.min.effective",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_util_min_effective_read_u64,
+	},
 	{
 		.name = "util.max",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = cpu_util_max_read_u64,
 		.write_u64 = cpu_util_max_write_u64,
 	},
+	{
+		.name = "util.max.effective",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_util_max_effective_read_u64,
+	},
 #endif
 	{ }	/* terminate */
 };

[v7,12/15] sched/core: uclamp: Propagate parent clamps

Commit Message

Comments

Patch