diff mbox

[v9,05/10] sched: make scale_rt invariant with frequency

Message ID 20141121123559.GF23177@e105550-lin.cambridge.arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Morten Rasmussen Nov. 21, 2014, 12:35 p.m. UTC
On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:
> The average running time of RT tasks is used to estimate the remaining compute
> capacity for CFS tasks. This remaining capacity is the original capacity scaled
> down by a factor (aka scale_rt_capacity). This estimation of available capacity
> must also be invariant with frequency scaling.
> 
> A frequency scaling factor is applied on the running time of the RT tasks for
> computing scale_rt_capacity.
> 
> In sched_rt_avg_update, we scale the RT execution time like below:
> rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
> 
> Then, scale_rt_capacity can be summarized by:
> scale_rt_capacity = SCHED_CAPACITY_SCALE -
> 		((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
> 
> We can optimize by removing right and left shift in the computation of rq->rt_avg
> and scale_rt_capacity
> 
> The call to arch_scale_frequency_capacity in the rt scheduling path might be
> a concern for RT folks because I'm not sure whether we can rely on
> arch_scale_freq_capacity to be short and efficient ?

It better be fast :) It is used in critical paths. However, if you
really care about latency you probably don't want frequency scaling to
mess around. If the architecture provides a fast-path for
arch_scale_freq_capacity() returning SCHED_CAPACITY_SCALE when frequency
scaling is disabled, the overhead should be minimal. If the architecture
doesn't provide arch_scale_freq_capacity() it becomes a constant
multiplication and should hopefully go away completely.

> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c  | 17 +++++------------
>  kernel/sched/sched.h |  4 +++-
>  2 files changed, 8 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a5039da..b37c27b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5785,7 +5785,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  static unsigned long scale_rt_capacity(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> -	u64 total, available, age_stamp, avg;
> +	u64 total, used, age_stamp, avg;
>  	s64 delta;
>  
>  	/*
> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>  
>  	total = sched_avg_period() + delta;
>  
> -	if (unlikely(total < avg)) {
> -		/* Ensures that capacity won't end up being negative */
> -		available = 0;
> -	} else {
> -		available = total - avg;
> -	}
> +	used = div_u64(avg, total);

I haven't looked through all the details of the rt avg tracking, but if
'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
work. Is it guaranteed that total > 0 so we don't get division by zero?

It does get a slightly more complicated if we want to figure out the
available capacity at the current frequency (current < max) later. Say,
rt eats 25% of the compute capacity, but the current frequency is only
50%. In that case get:

curr_avail_capacity = (arch_scale_cpu_capacity() *
  (arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
  >> SCHED_CAPACITY_SHIFT

With numbers assuming arch_scale_cpu_capacity() = 800:

curr_avail_capacity = 800 * (512 - (1024 - 758)) >> 10 = 200

Which isn't actually that bad. Anyway, it isn't needed until we start
invovling energy models.

>  
> -	if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
> -		total = SCHED_CAPACITY_SCALE;
> +	if (likely(used < SCHED_CAPACITY_SCALE))
> +		return SCHED_CAPACITY_SCALE - used;
>  
> -	total >>= SCHED_CAPACITY_SHIFT;
> -
> -	return div_u64(available, total);
> +	return 1;
>  }
>  
>  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c34bd11..fc5b152 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1312,9 +1312,11 @@ static inline int hrtick_enabled(struct rq *rq)
>  
>  #ifdef CONFIG_SMP
>  extern void sched_avg_update(struct rq *rq);
> +extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);

I'm not sure if it makes any difference, but shouldn't it be __weak
instead of extern?

unsigned long __weak arch_scale_freq_capacity(...)

Also, now that the function prototype definition is in the header file
we can kill the local prototype in fair.c introduced in patch 4:

  * the
  * coefficients of a geometric series.  To do this we sub-divide our
  * runnable

Comments

Vincent Guittot Nov. 24, 2014, 2:24 p.m. UTC | #1
On 21 November 2014 at 13:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:

[snip]

>> The average running time of RT tasks is used to estimate the remaining compute
>> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>>
>>       total = sched_avg_period() + delta;
>>
>> -     if (unlikely(total < avg)) {
>> -             /* Ensures that capacity won't end up being negative */
>> -             available = 0;
>> -     } else {
>> -             available = total - avg;
>> -     }
>> +     used = div_u64(avg, total);
>
> I haven't looked through all the details of the rt avg tracking, but if
> 'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
> work. Is it guaranteed that total > 0 so we don't get division by zero?

static inline u64 sched_avg_period(void)
{
return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
}

>
> It does get a slightly more complicated if we want to figure out the
> available capacity at the current frequency (current < max) later. Say,
> rt eats 25% of the compute capacity, but the current frequency is only
> 50%. In that case get:
>
> curr_avail_capacity = (arch_scale_cpu_capacity() *
>   (arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
>   >> SCHED_CAPACITY_SHIFT

You don't have to be so complicated but simply need to do:
curr_avail_capacity for CFS = (capacity_of(CPU) *
arch_scale_freq_capacity())  >> SCHED_CAPACITY_SHIFT

capacity_of(CPU) = 600 is the max available capacity for CFS tasks
once we have removed the 25% of capacity that is used by RT tasks
arch_scale_freq_capacity = 512 because we currently run at 50% of max freq

so curr_avail_capacity for CFS = 300

Vincent
>
> With numbers assuming arch_scale_cpu_capacity() = 800:
>
> curr_avail_capacity = 800 * (512 - (1024 - 758)) >> 10 = 200
>
> Which isn't actually that bad. Anyway, it isn't needed until we start
> invovling energy models.
>
>>
>> -     if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
>> -             total = SCHED_CAPACITY_SCALE;
>> +     if (likely(used < SCHED_CAPACITY_SCALE))
>> +             return SCHED_CAPACITY_SCALE - used;
>>
>> -     total >>= SCHED_CAPACITY_SHIFT;
>> -
>> -     return div_u64(available, total);
>> +     return 1;
>>  }
>>
Morten Rasmussen Nov. 24, 2014, 5:05 p.m. UTC | #2
On Mon, Nov 24, 2014 at 02:24:00PM +0000, Vincent Guittot wrote:
> On 21 November 2014 at 13:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:
> 
> [snip]
> 
> >> The average running time of RT tasks is used to estimate the remaining compute
> >> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
> >>
> >>       total = sched_avg_period() + delta;
> >>
> >> -     if (unlikely(total < avg)) {
> >> -             /* Ensures that capacity won't end up being negative */
> >> -             available = 0;
> >> -     } else {
> >> -             available = total - avg;
> >> -     }
> >> +     used = div_u64(avg, total);
> >
> > I haven't looked through all the details of the rt avg tracking, but if
> > 'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
> > work. Is it guaranteed that total > 0 so we don't get division by zero?
> 
> static inline u64 sched_avg_period(void)
> {
> return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
> }
>

I see.

> >
> > It does get a slightly more complicated if we want to figure out the
> > available capacity at the current frequency (current < max) later. Say,
> > rt eats 25% of the compute capacity, but the current frequency is only
> > 50%. In that case get:
> >
> > curr_avail_capacity = (arch_scale_cpu_capacity() *
> >   (arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
> >   >> SCHED_CAPACITY_SHIFT
> 
> You don't have to be so complicated but simply need to do:
> curr_avail_capacity for CFS = (capacity_of(CPU) *
> arch_scale_freq_capacity())  >> SCHED_CAPACITY_SHIFT
> 
> capacity_of(CPU) = 600 is the max available capacity for CFS tasks
> once we have removed the 25% of capacity that is used by RT tasks
> arch_scale_freq_capacity = 512 because we currently run at 50% of max freq
> 
> so curr_avail_capacity for CFS = 300

I don't think that is correct. It is at least not what I had in mind.

capacity_orig_of(cpu) = 800, we run at 50% frequency which means:

curr_capacity = capacity_orig_of(cpu) * arch_scale_freq_capacity()
                  >> SCHED_CAPACITY_SHIFT
              = 400

So the total capacity at the current frequency (50%) is 400, without
considering RT. scale_rt_capacity() is frequency invariant, so it takes
away capacity_orig_of(cpu) - capacity_of(cpu) = 200 worth of capacity
for RT.  We need to subtract that from the current capacity to get the
available capacity at the current frequency.

curr_available_capacity = curr_capacity - (capacity_orig_of(cpu) -
capacity_of(cpu)) = 200

In other words, 800 is the max capacity, we are currently running at 50%
frequency, which gives us 400. RT takes away 25% of 800
(frequency-invariant) from the 400, which leaves us with 200 left for
CFS tasks at the current frequency.

In your calculations you subtract the RT load before computing the
current capacity using arch_scale_freq_capacity(), where I think it
should be done after. You find the amount spare capacity you would have
at the maximum frequency when RT has been subtracted and then scale the
result by frequency which means indirectly scaling the RT load
contribution again (the rt avg has already been scaled). So instead of
taking away 200 of the 400 (current capacity @ 50% frequency), it only
takes away 100 which isn't right.

scale_rt_capacity() is frequency-invariant, so if the RT load is 50% and
the frequency is 50%, there are no spare cycles left.
curr_avail_capacity should be 0. But using your expression above you
would get capacity_of(cpu) = 400 after removing RT,
arch_scale_freq_capacity = 512 and you get 200. I don't think that is
right.

Morten
Vincent Guittot Nov. 25, 2014, 1:48 p.m. UTC | #3
On 24 November 2014 at 18:05, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Mon, Nov 24, 2014 at 02:24:00PM +0000, Vincent Guittot wrote:
>> On 21 November 2014 at 13:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:
>>
>> [snip]
>>
>> >> The average running time of RT tasks is used to estimate the remaining compute
>> >> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
>> >>
>> >>       total = sched_avg_period() + delta;
>> >>
>> >> -     if (unlikely(total < avg)) {
>> >> -             /* Ensures that capacity won't end up being negative */
>> >> -             available = 0;
>> >> -     } else {
>> >> -             available = total - avg;
>> >> -     }
>> >> +     used = div_u64(avg, total);
>> >
>> > I haven't looked through all the details of the rt avg tracking, but if
>> > 'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
>> > work. Is it guaranteed that total > 0 so we don't get division by zero?
>>
>> static inline u64 sched_avg_period(void)
>> {
>> return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
>> }
>>
>
> I see.
>
>> >
>> > It does get a slightly more complicated if we want to figure out the
>> > available capacity at the current frequency (current < max) later. Say,
>> > rt eats 25% of the compute capacity, but the current frequency is only
>> > 50%. In that case get:
>> >
>> > curr_avail_capacity = (arch_scale_cpu_capacity() *
>> >   (arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
>> >   >> SCHED_CAPACITY_SHIFT
>>
>> You don't have to be so complicated but simply need to do:
>> curr_avail_capacity for CFS = (capacity_of(CPU) *
>> arch_scale_freq_capacity())  >> SCHED_CAPACITY_SHIFT
>>
>> capacity_of(CPU) = 600 is the max available capacity for CFS tasks
>> once we have removed the 25% of capacity that is used by RT tasks
>> arch_scale_freq_capacity = 512 because we currently run at 50% of max freq
>>
>> so curr_avail_capacity for CFS = 300
>
> I don't think that is correct. It is at least not what I had in mind.
>
> capacity_orig_of(cpu) = 800, we run at 50% frequency which means:
>
> curr_capacity = capacity_orig_of(cpu) * arch_scale_freq_capacity()
>                   >> SCHED_CAPACITY_SHIFT
>               = 400
>
> So the total capacity at the current frequency (50%) is 400, without
> considering RT. scale_rt_capacity() is frequency invariant, so it takes
> away capacity_orig_of(cpu) - capacity_of(cpu) = 200 worth of capacity
> for RT.  We need to subtract that from the current capacity to get the
> available capacity at the current frequency.
>
> curr_available_capacity = curr_capacity - (capacity_orig_of(cpu) -
> capacity_of(cpu)) = 200

you're right, this one looks good to me too

>
> In other words, 800 is the max capacity, we are currently running at 50%
> frequency, which gives us 400. RT takes away 25% of 800
> (frequency-invariant) from the 400, which leaves us with 200 left for
> CFS tasks at the current frequency.
>
> In your calculations you subtract the RT load before computing the
> current capacity using arch_scale_freq_capacity(), where I think it
> should be done after. You find the amount spare capacity you would have
> at the maximum frequency when RT has been subtracted and then scale the
> result by frequency which means indirectly scaling the RT load
> contribution again (the rt avg has already been scaled). So instead of
> taking away 200 of the 400 (current capacity @ 50% frequency), it only
> takes away 100 which isn't right.
>
> scale_rt_capacity() is frequency-invariant, so if the RT load is 50% and
> the frequency is 50%, there are no spare cycles left.
> curr_avail_capacity should be 0. But using your expression above you
> would get capacity_of(cpu) = 400 after removing RT,
> arch_scale_freq_capacity = 512 and you get 200. I don't think that is
> right.
>
> Morten
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen Nov. 26, 2014, 11:57 a.m. UTC | #4
On Tue, Nov 25, 2014 at 01:48:02PM +0000, Vincent Guittot wrote:
> On 24 November 2014 at 18:05, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Mon, Nov 24, 2014 at 02:24:00PM +0000, Vincent Guittot wrote:
> >> On 21 November 2014 at 13:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > On Mon, Nov 03, 2014 at 04:54:42PM +0000, Vincent Guittot wrote:
> >>
> >> [snip]
> >>
> >> >> The average running time of RT tasks is used to estimate the remaining compute
> >> >> @@ -5801,19 +5801,12 @@ static unsigned long scale_rt_capacity(int cpu)
> >> >>
> >> >>       total = sched_avg_period() + delta;
> >> >>
> >> >> -     if (unlikely(total < avg)) {
> >> >> -             /* Ensures that capacity won't end up being negative */
> >> >> -             available = 0;
> >> >> -     } else {
> >> >> -             available = total - avg;
> >> >> -     }
> >> >> +     used = div_u64(avg, total);
> >> >
> >> > I haven't looked through all the details of the rt avg tracking, but if
> >> > 'used' is in the range [0..SCHED_CAPACITY_SCALE], I believe it should
> >> > work. Is it guaranteed that total > 0 so we don't get division by zero?
> >>
> >> static inline u64 sched_avg_period(void)
> >> {
> >> return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
> >> }
> >>
> >
> > I see.
> >
> >> >
> >> > It does get a slightly more complicated if we want to figure out the
> >> > available capacity at the current frequency (current < max) later. Say,
> >> > rt eats 25% of the compute capacity, but the current frequency is only
> >> > 50%. In that case get:
> >> >
> >> > curr_avail_capacity = (arch_scale_cpu_capacity() *
> >> >   (arch_scale_freq_capacity() - (SCHED_SCALE_CAPACITY - scale_rt_capacity())))
> >> >   >> SCHED_CAPACITY_SHIFT
> >>
> >> You don't have to be so complicated but simply need to do:
> >> curr_avail_capacity for CFS = (capacity_of(CPU) *
> >> arch_scale_freq_capacity())  >> SCHED_CAPACITY_SHIFT
> >>
> >> capacity_of(CPU) = 600 is the max available capacity for CFS tasks
> >> once we have removed the 25% of capacity that is used by RT tasks
> >> arch_scale_freq_capacity = 512 because we currently run at 50% of max freq
> >>
> >> so curr_avail_capacity for CFS = 300
> >
> > I don't think that is correct. It is at least not what I had in mind.
> >
> > capacity_orig_of(cpu) = 800, we run at 50% frequency which means:
> >
> > curr_capacity = capacity_orig_of(cpu) * arch_scale_freq_capacity()
> >                   >> SCHED_CAPACITY_SHIFT
> >               = 400
> >
> > So the total capacity at the current frequency (50%) is 400, without
> > considering RT. scale_rt_capacity() is frequency invariant, so it takes
> > away capacity_orig_of(cpu) - capacity_of(cpu) = 200 worth of capacity
> > for RT.  We need to subtract that from the current capacity to get the
> > available capacity at the current frequency.
> >
> > curr_available_capacity = curr_capacity - (capacity_orig_of(cpu) -
> > capacity_of(cpu)) = 200
> 
> you're right, this one looks good to me too

Okay, thanks for confirming. It doesn't affect this patch set anyway, I just
wanted to be sure that I got all the scaling factors right :)
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6fd5ac6..921b174 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2277,8 +2277,6 @@  static u32 __compute_runnable_contrib(u64 n)
        return contrib + runnable_avg_yN_sum[n];
 }
 
-unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd,
int cpu);
-
 /*
  * We can represent the historical contribution to runnable average as