diff mbox

[v2] schedutil: Allow cpufreq requests to be made even when kthread kicked

Message ID 20180518185501.173552-1-joel@joelfernandes.org (mailing list archive)
State Not Applicable, archived
Delegated to: Rafael Wysocki
Headers show

Commit Message

Joel Fernandes May 18, 2018, 6:55 p.m. UTC
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>

Currently there is a chance of a schedutil cpufreq update request to be
dropped if there is a pending update request. This pending request can
be delayed if there is a scheduling delay of the irq_work and the wake
up of the schedutil governor kthread.

A very bad scenario is when a schedutil request was already just made,
such as to reduce the CPU frequency, then a newer request to increase
CPU frequency (even sched deadline urgent frequency increase requests)
can be dropped, even though the rate limits suggest that its Ok to
process a request. This is because of the way the work_in_progress flag
is used.

This patch improves the situation by allowing new requests to happen
even though the old one is still being processed. Note that in this
approach, if an irq_work was already issued, we just update next_freq
and don't bother to queue another request so there's no extra work being
done to make this happen.

I had brought up this issue at the OSPM conference and Claudio had a
discussion RFC with an alternate approach [1]. I prefer the approach as
done in the patch below since it doesn't need any new flags and doesn't
cause any other extra overhead.

[1] https://patchwork.kernel.org/patch/10384261/

LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
CC: Viresh Kumar <viresh.kumar@linaro.org>
CC: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Patrick Bellasi <patrick.bellasi@arm.com>
CC: Juri Lelli <juri.lelli@redhat.com>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
CC: Joel Fernandes <joelaf@google.com>
CC: Todd Kjos <tkjos@google.com>
CC: claudio@evidence.eu.com
CC: kernel-team@android.com
CC: linux-pm@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
v1 -> v2: Minor style related changes.

 kernel/sched/cpufreq_schedutil.c | 34 ++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

Comments

Saravana Kannan May 18, 2018, 9:13 p.m. UTC | #1
On 05/18/2018 11:55 AM, Joel Fernandes (Google.) wrote:
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>
> Currently there is a chance of a schedutil cpufreq update request to be
> dropped if there is a pending update request. This pending request can
> be delayed if there is a scheduling delay of the irq_work and the wake
> up of the schedutil governor kthread.
>
> A very bad scenario is when a schedutil request was already just made,
> such as to reduce the CPU frequency, then a newer request to increase
> CPU frequency (even sched deadline urgent frequency increase requests)
> can be dropped, even though the rate limits suggest that its Ok to
> process a request. This is because of the way the work_in_progress flag
> is used.
>
> This patch improves the situation by allowing new requests to happen
> even though the old one is still being processed. Note that in this
> approach, if an irq_work was already issued, we just update next_freq
> and don't bother to queue another request so there's no extra work being
> done to make this happen.
>
> I had brought up this issue at the OSPM conference and Claudio had a
> discussion RFC with an alternate approach [1]. I prefer the approach as
> done in the patch below since it doesn't need any new flags and doesn't
> cause any other extra overhead.
>
> [1] https://patchwork.kernel.org/patch/10384261/
>
> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
> CC: Viresh Kumar <viresh.kumar@linaro.org>
> CC: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Patrick Bellasi <patrick.bellasi@arm.com>
> CC: Juri Lelli <juri.lelli@redhat.com>
> Cc: Luca Abeni <luca.abeni@santannapisa.it>
> CC: Joel Fernandes <joelaf@google.com>
> CC: Todd Kjos <tkjos@google.com>
> CC: claudio@evidence.eu.com
> CC: kernel-team@android.com
> CC: linux-pm@vger.kernel.org
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> v1 -> v2: Minor style related changes.
>
>   kernel/sched/cpufreq_schedutil.c | 34 ++++++++++++++++++++++++--------
>   1 file changed, 26 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e13df951aca7..5c482ec38610 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>   	    !cpufreq_can_do_remote_dvfs(sg_policy->policy))
>   		return false;
>
> -	if (sg_policy->work_in_progress)
> -		return false;
> -
>   	if (unlikely(sg_policy->need_freq_update)) {
>   		sg_policy->need_freq_update = false;
>   		/*
> @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>
>   		policy->cur = next_freq;
>   		trace_cpu_frequency(next_freq, smp_processor_id());
> -	} else {
> +	} else if (!sg_policy->work_in_progress) {

Not really something you added, but if you are modifying it:
Do we really need this work_in_progress flag? irq_work_queue() already 
checks if the work is pending and then returns true/false.

Wouldn't the issue you are trying to fix be resolved just by dropping 
this flag check entirely?

>   		sg_policy->work_in_progress = true;
>   		irq_work_queue(&sg_policy->irq_work);
>   	}
> @@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>
>   	ignore_dl_rate_limit(sg_cpu, sg_policy);
>
> +	/*
> +	 * For slow-switch systems, single policy requests can't run at the
> +	 * moment if update is in progress, unless we acquire update_lock.
> +	 */
> +	if (sg_policy->work_in_progress)
> +		return;
> +
>   	if (!sugov_should_update_freq(sg_policy, time))
>   		return;
>
> @@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>   static void sugov_work(struct kthread_work *work)
>   {
>   	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> +	unsigned int freq;
> +	unsigned long flags;
> +
> +	/*
> +	 * Hold sg_policy->update_lock shortly to handle the case where:
> +	 * incase sg_policy->next_freq is read here, and then updated by
> +	 * sugov_update_shared just before work_in_progress is set to false
> +	 * here, we may miss queueing the new update.
> +	 *
> +	 * Note: If a work was queued after the update_lock is released,
> +	 * sugov_work will just be called again by kthread_work code; and the
> +	 * request will be proceed before the sugov thread sleeps.
> +	 */
> +	raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
> +	freq = sg_policy->next_freq;
> +	sg_policy->work_in_progress = false;
> +	raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
>
>   	mutex_lock(&sg_policy->work_lock);
> -	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> -				CPUFREQ_RELATION_L);
> +	__cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
>   	mutex_unlock(&sg_policy->work_lock);
> -
> -	sg_policy->work_in_progress = false;
>   }
>
>   static void sugov_irq_work(struct irq_work *irq_work)
>

-Saravana
Rafael J. Wysocki May 18, 2018, 9:17 p.m. UTC | #2
On Fri, May 18, 2018 at 11:13 PM, Saravana Kannan
<skannan@codeaurora.org> wrote:
> On 05/18/2018 11:55 AM, Joel Fernandes (Google.) wrote:
>>
>> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>>
>> Currently there is a chance of a schedutil cpufreq update request to be
>> dropped if there is a pending update request. This pending request can
>> be delayed if there is a scheduling delay of the irq_work and the wake
>> up of the schedutil governor kthread.
>>
>> A very bad scenario is when a schedutil request was already just made,
>> such as to reduce the CPU frequency, then a newer request to increase
>> CPU frequency (even sched deadline urgent frequency increase requests)
>> can be dropped, even though the rate limits suggest that its Ok to
>> process a request. This is because of the way the work_in_progress flag
>> is used.
>>
>> This patch improves the situation by allowing new requests to happen
>> even though the old one is still being processed. Note that in this
>> approach, if an irq_work was already issued, we just update next_freq
>> and don't bother to queue another request so there's no extra work being
>> done to make this happen.
>>
>> I had brought up this issue at the OSPM conference and Claudio had a
>> discussion RFC with an alternate approach [1]. I prefer the approach as
>> done in the patch below since it doesn't need any new flags and doesn't
>> cause any other extra overhead.
>>
>> [1] https://patchwork.kernel.org/patch/10384261/
>>
>> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
>> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
>> CC: Viresh Kumar <viresh.kumar@linaro.org>
>> CC: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> CC: Peter Zijlstra <peterz@infradead.org>
>> CC: Ingo Molnar <mingo@redhat.com>
>> CC: Patrick Bellasi <patrick.bellasi@arm.com>
>> CC: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Luca Abeni <luca.abeni@santannapisa.it>
>> CC: Joel Fernandes <joelaf@google.com>
>> CC: Todd Kjos <tkjos@google.com>
>> CC: claudio@evidence.eu.com
>> CC: kernel-team@android.com
>> CC: linux-pm@vger.kernel.org
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>> v1 -> v2: Minor style related changes.
>>
>>   kernel/sched/cpufreq_schedutil.c | 34 ++++++++++++++++++++++++--------
>>   1 file changed, 26 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/cpufreq_schedutil.c
>> b/kernel/sched/cpufreq_schedutil.c
>> index e13df951aca7..5c482ec38610 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy
>> *sg_policy, u64 time)
>>             !cpufreq_can_do_remote_dvfs(sg_policy->policy))
>>                 return false;
>>
>> -       if (sg_policy->work_in_progress)
>> -               return false;
>> -
>>         if (unlikely(sg_policy->need_freq_update)) {
>>                 sg_policy->need_freq_update = false;
>>                 /*
>> @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy
>> *sg_policy, u64 time,
>>
>>                 policy->cur = next_freq;
>>                 trace_cpu_frequency(next_freq, smp_processor_id());
>> -       } else {
>> +       } else if (!sg_policy->work_in_progress) {
>
>
> Not really something you added, but if you are modifying it:
> Do we really need this work_in_progress flag? irq_work_queue() already
> checks if the work is pending and then returns true/false.
>
> Wouldn't the issue you are trying to fix be resolved just by dropping this
> flag check entirely?

You've missed the entire discussion on that several days ago, sorry.

Thanks,
Rafael
Viresh Kumar May 21, 2018, 5:14 a.m. UTC | #3
On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> 
> Currently there is a chance of a schedutil cpufreq update request to be
> dropped if there is a pending update request. This pending request can
> be delayed if there is a scheduling delay of the irq_work and the wake
> up of the schedutil governor kthread.
> 
> A very bad scenario is when a schedutil request was already just made,
> such as to reduce the CPU frequency, then a newer request to increase
> CPU frequency (even sched deadline urgent frequency increase requests)
> can be dropped, even though the rate limits suggest that its Ok to
> process a request. This is because of the way the work_in_progress flag
> is used.
> 
> This patch improves the situation by allowing new requests to happen
> even though the old one is still being processed. Note that in this
> approach, if an irq_work was already issued, we just update next_freq
> and don't bother to queue another request so there's no extra work being
> done to make this happen.

Now that this isn't an RFC anymore, you shouldn't have added below
paragraph here. It could go to the comments section though.

> I had brought up this issue at the OSPM conference and Claudio had a
> discussion RFC with an alternate approach [1]. I prefer the approach as
> done in the patch below since it doesn't need any new flags and doesn't
> cause any other extra overhead.
> 
> [1] https://patchwork.kernel.org/patch/10384261/
> 
> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>

Looks like a Tag you just invented ? :)

> CC: Viresh Kumar <viresh.kumar@linaro.org>
> CC: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Patrick Bellasi <patrick.bellasi@arm.com>
> CC: Juri Lelli <juri.lelli@redhat.com>
> Cc: Luca Abeni <luca.abeni@santannapisa.it>
> CC: Joel Fernandes <joelaf@google.com>
> CC: Todd Kjos <tkjos@google.com>
> CC: claudio@evidence.eu.com
> CC: kernel-team@android.com
> CC: linux-pm@vger.kernel.org
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> v1 -> v2: Minor style related changes.
> 
>  kernel/sched/cpufreq_schedutil.c | 34 ++++++++++++++++++++++++--------
>  1 file changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e13df951aca7..5c482ec38610 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>  	    !cpufreq_can_do_remote_dvfs(sg_policy->policy))
>  		return false;
>  
> -	if (sg_policy->work_in_progress)
> -		return false;
> -
>  	if (unlikely(sg_policy->need_freq_update)) {
>  		sg_policy->need_freq_update = false;
>  		/*
> @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>  
>  		policy->cur = next_freq;
>  		trace_cpu_frequency(next_freq, smp_processor_id());
> -	} else {
> +	} else if (!sg_policy->work_in_progress) {
>  		sg_policy->work_in_progress = true;
>  		irq_work_queue(&sg_policy->irq_work);
>  	}
> @@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>  
>  	ignore_dl_rate_limit(sg_cpu, sg_policy);
>  
> +	/*
> +	 * For slow-switch systems, single policy requests can't run at the
> +	 * moment if update is in progress, unless we acquire update_lock.
> +	 */
> +	if (sg_policy->work_in_progress)
> +		return;
> +

I would still want this to go away :)

@Rafael, will it be fine to get locking in place for unshared policy
platforms ?

>  	if (!sugov_should_update_freq(sg_policy, time))
>  		return;
>  
> @@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>  static void sugov_work(struct kthread_work *work)
>  {
>  	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> +	unsigned int freq;
> +	unsigned long flags;
> +
> +	/*
> +	 * Hold sg_policy->update_lock shortly to handle the case where:
> +	 * incase sg_policy->next_freq is read here, and then updated by
> +	 * sugov_update_shared just before work_in_progress is set to false
> +	 * here, we may miss queueing the new update.
> +	 *
> +	 * Note: If a work was queued after the update_lock is released,
> +	 * sugov_work will just be called again by kthread_work code; and the
> +	 * request will be proceed before the sugov thread sleeps.
> +	 */
> +	raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
> +	freq = sg_policy->next_freq;
> +	sg_policy->work_in_progress = false;
> +	raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
>  
>  	mutex_lock(&sg_policy->work_lock);
> -	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> -				CPUFREQ_RELATION_L);
> +	__cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
>  	mutex_unlock(&sg_policy->work_lock);
> -
> -	sg_policy->work_in_progress = false;
>  }
>  
>  static void sugov_irq_work(struct irq_work *irq_work)

Fix the commit log and you can add my

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Rafael J. Wysocki May 21, 2018, 8:29 a.m. UTC | #4
On Mon, May 21, 2018 at 7:14 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
>> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>>
>> Currently there is a chance of a schedutil cpufreq update request to be
>> dropped if there is a pending update request. This pending request can
>> be delayed if there is a scheduling delay of the irq_work and the wake
>> up of the schedutil governor kthread.
>>
>> A very bad scenario is when a schedutil request was already just made,
>> such as to reduce the CPU frequency, then a newer request to increase
>> CPU frequency (even sched deadline urgent frequency increase requests)
>> can be dropped, even though the rate limits suggest that its Ok to
>> process a request. This is because of the way the work_in_progress flag
>> is used.
>>
>> This patch improves the situation by allowing new requests to happen
>> even though the old one is still being processed. Note that in this
>> approach, if an irq_work was already issued, we just update next_freq
>> and don't bother to queue another request so there's no extra work being
>> done to make this happen.
>
> Now that this isn't an RFC anymore, you shouldn't have added below
> paragraph here. It could go to the comments section though.
>
>> I had brought up this issue at the OSPM conference and Claudio had a
>> discussion RFC with an alternate approach [1]. I prefer the approach as
>> done in the patch below since it doesn't need any new flags and doesn't
>> cause any other extra overhead.
>>
>> [1] https://patchwork.kernel.org/patch/10384261/
>>
>> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
>> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
>
> Looks like a Tag you just invented ? :)

Yeah.

The LGTM from Juri can be converned into an ACK silently IMO.  That
said I have added Looks-good-to: tags to a couple of commits. :-)

>> CC: Viresh Kumar <viresh.kumar@linaro.org>
>> CC: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> CC: Peter Zijlstra <peterz@infradead.org>
>> CC: Ingo Molnar <mingo@redhat.com>
>> CC: Patrick Bellasi <patrick.bellasi@arm.com>
>> CC: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Luca Abeni <luca.abeni@santannapisa.it>
>> CC: Joel Fernandes <joelaf@google.com>
>> CC: Todd Kjos <tkjos@google.com>
>> CC: claudio@evidence.eu.com
>> CC: kernel-team@android.com
>> CC: linux-pm@vger.kernel.org
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>> v1 -> v2: Minor style related changes.
>>
>>  kernel/sched/cpufreq_schedutil.c | 34 ++++++++++++++++++++++++--------
>>  1 file changed, 26 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> index e13df951aca7..5c482ec38610 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>>           !cpufreq_can_do_remote_dvfs(sg_policy->policy))
>>               return false;
>>
>> -     if (sg_policy->work_in_progress)
>> -             return false;
>> -
>>       if (unlikely(sg_policy->need_freq_update)) {
>>               sg_policy->need_freq_update = false;
>>               /*
>> @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>>
>>               policy->cur = next_freq;
>>               trace_cpu_frequency(next_freq, smp_processor_id());
>> -     } else {
>> +     } else if (!sg_policy->work_in_progress) {
>>               sg_policy->work_in_progress = true;
>>               irq_work_queue(&sg_policy->irq_work);
>>       }
>> @@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>>
>>       ignore_dl_rate_limit(sg_cpu, sg_policy);
>>
>> +     /*
>> +      * For slow-switch systems, single policy requests can't run at the
>> +      * moment if update is in progress, unless we acquire update_lock.
>> +      */
>> +     if (sg_policy->work_in_progress)
>> +             return;
>> +
>
> I would still want this to go away :)
>
> @Rafael, will it be fine to get locking in place for unshared policy
> platforms ?

As long as it doesn't affect the fast switch path in any way.

>
>>       if (!sugov_should_update_freq(sg_policy, time))
>>               return;
>>
>> @@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>>  static void sugov_work(struct kthread_work *work)
>>  {
>>       struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
>> +     unsigned int freq;
>> +     unsigned long flags;
>> +
>> +     /*
>> +      * Hold sg_policy->update_lock shortly to handle the case where:
>> +      * incase sg_policy->next_freq is read here, and then updated by
>> +      * sugov_update_shared just before work_in_progress is set to false
>> +      * here, we may miss queueing the new update.
>> +      *
>> +      * Note: If a work was queued after the update_lock is released,
>> +      * sugov_work will just be called again by kthread_work code; and the
>> +      * request will be proceed before the sugov thread sleeps.
>> +      */
>> +     raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
>> +     freq = sg_policy->next_freq;
>> +     sg_policy->work_in_progress = false;
>> +     raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
>>
>>       mutex_lock(&sg_policy->work_lock);
>> -     __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
>> -                             CPUFREQ_RELATION_L);
>> +     __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
>>       mutex_unlock(&sg_policy->work_lock);
>> -
>> -     sg_policy->work_in_progress = false;
>>  }
>>
>>  static void sugov_irq_work(struct irq_work *irq_work)
>
> Fix the commit log and you can add my

I can fix it up.

> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

Thanks,
Rafael
Juri Lelli May 21, 2018, 9:57 a.m. UTC | #5
On 21/05/18 10:29, Rafael J. Wysocki wrote:
> On Mon, May 21, 2018 at 7:14 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
> >> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> >>
> >> Currently there is a chance of a schedutil cpufreq update request to be
> >> dropped if there is a pending update request. This pending request can
> >> be delayed if there is a scheduling delay of the irq_work and the wake
> >> up of the schedutil governor kthread.
> >>
> >> A very bad scenario is when a schedutil request was already just made,
> >> such as to reduce the CPU frequency, then a newer request to increase
> >> CPU frequency (even sched deadline urgent frequency increase requests)
> >> can be dropped, even though the rate limits suggest that its Ok to
> >> process a request. This is because of the way the work_in_progress flag
> >> is used.
> >>
> >> This patch improves the situation by allowing new requests to happen
> >> even though the old one is still being processed. Note that in this
> >> approach, if an irq_work was already issued, we just update next_freq
> >> and don't bother to queue another request so there's no extra work being
> >> done to make this happen.
> >
> > Now that this isn't an RFC anymore, you shouldn't have added below
> > paragraph here. It could go to the comments section though.
> >
> >> I had brought up this issue at the OSPM conference and Claudio had a
> >> discussion RFC with an alternate approach [1]. I prefer the approach as
> >> done in the patch below since it doesn't need any new flags and doesn't
> >> cause any other extra overhead.
> >>
> >> [1] https://patchwork.kernel.org/patch/10384261/
> >>
> >> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
> >> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
> >
> > Looks like a Tag you just invented ? :)
> 
> Yeah.
> 
> The LGTM from Juri can be converned into an ACK silently IMO.  That

Sure! :)

Thanks,

- Juri
Patrick Bellasi May 21, 2018, 10:50 a.m. UTC | #6
On 18-May 11:55, Joel Fernandes (Google.) wrote:
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> 
> Currently there is a chance of a schedutil cpufreq update request to be
> dropped if there is a pending update request. This pending request can
> be delayed if there is a scheduling delay of the irq_work and the wake
> up of the schedutil governor kthread.
> 
> A very bad scenario is when a schedutil request was already just made,
> such as to reduce the CPU frequency, then a newer request to increase
> CPU frequency (even sched deadline urgent frequency increase requests)
> can be dropped, even though the rate limits suggest that its Ok to
> process a request. This is because of the way the work_in_progress flag
> is used.
> 
> This patch improves the situation by allowing new requests to happen
> even though the old one is still being processed. Note that in this
> approach, if an irq_work was already issued, we just update next_freq
> and don't bother to queue another request so there's no extra work being
> done to make this happen.

Maybe I'm missing something but... is not this patch just a partial
mitigation of the issue you descrive above?

If a DL freq increase is queued, with this patch we store the request
but we don't actually increase the frequency until the next schedutil
update, which can be one tick away... isn't it?

If that's the case, maybe something like the following can complete
the cure?

---8<---
#define SUGOV_FREQ_NONE 0

static unsigned int sugov_work_update(struct sugov_policy *sg_policy,
				      unsigned int prev_freq)
{
	unsigned long irq_flags;
	bool update_freq = true;
	unsigned int next_freq;

	/*
	 * Hold sg_policy->update_lock shortly to handle the case where:
	 * incase sg_policy->next_freq is read here, and then updated by
	 * sugov_update_shared just before work_in_progress is set to false
	 * here, we may miss queueing the new update.
	 *
	 * Note: If a work was queued after the update_lock is released,
	 * sugov_work will just be called again by kthread_work code; and the
	 * request will be proceed before the sugov thread sleeps.
	 */
	raw_spin_lock_irqsave(&sg_policy->update_lock, irq_flags);
	next_freq = sg_policy->next_freq;
	sg_policy->work_in_progress = false;
	if (prev_freq == next_freq)
		update_freq = false;
	raw_spin_unlock_irqrestore(&sg_policy->update_lock, irq_flags);

	/*
	 * Update the frequency only if it has changed since the last call.
	 */
	if (update_freq) {
		mutex_lock(&sg_policy->work_lock);
		__cpufreq_driver_target(sg_policy->policy, next_freq, CPUFREQ_RELATION_L);
		mutex_unlock(&sg_policy->work_lock);

		return next_freq;
	}

	return SUGOV_FREQ_NONE;
}

static void sugov_work(struct kthread_work *work)
{
	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
	unsigned int prev_freq = 0;

	/*
	 * Keep updating the frequency until we end up with a frequency which
	 * satisfies the most recent request we got meanwhile.
	 */
	do {
		prev_freq = sugov_work_update(sg_policy, prev_freq);
	} while (prev_freq != SUGOV_FREQ_NONE);
}
---8<---
Joel Fernandes May 21, 2018, 3:49 p.m. UTC | #7
On Mon, May 21, 2018 at 11:50:55AM +0100, Patrick Bellasi wrote:
> On 18-May 11:55, Joel Fernandes (Google.) wrote:
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > 
> > Currently there is a chance of a schedutil cpufreq update request to be
> > dropped if there is a pending update request. This pending request can
> > be delayed if there is a scheduling delay of the irq_work and the wake
> > up of the schedutil governor kthread.
> > 
> > A very bad scenario is when a schedutil request was already just made,
> > such as to reduce the CPU frequency, then a newer request to increase
> > CPU frequency (even sched deadline urgent frequency increase requests)
> > can be dropped, even though the rate limits suggest that its Ok to
> > process a request. This is because of the way the work_in_progress flag
> > is used.
> > 
> > This patch improves the situation by allowing new requests to happen
> > even though the old one is still being processed. Note that in this
> > approach, if an irq_work was already issued, we just update next_freq
> > and don't bother to queue another request so there's no extra work being
> > done to make this happen.
> 
> Maybe I'm missing something but... is not this patch just a partial
> mitigation of the issue you descrive above?
> 
> If a DL freq increase is queued, with this patch we store the request
> but we don't actually increase the frequency until the next schedutil
> update, which can be one tick away... isn't it?
> 
> If that's the case, maybe something like the following can complete
> the cure?

We already discussed this and thought of this case, I think you missed a
previous thread [1]. The outer loop in the kthread_work subsystem will take
care of calling sugov_work again incase another request was queued which we
happen to miss. So I don't think more complexity is needed to handle the case
you're bringing up.

thanks!

 - Joel

[1] https://lkml.org/lkml/2018/5/17/668
Joel Fernandes May 21, 2018, 4:13 p.m. UTC | #8
On Mon, May 21, 2018 at 10:29:52AM +0200, Rafael J. Wysocki wrote:
> On Mon, May 21, 2018 at 7:14 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
> >> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> >>
> >> Currently there is a chance of a schedutil cpufreq update request to be
> >> dropped if there is a pending update request. This pending request can
> >> be delayed if there is a scheduling delay of the irq_work and the wake
> >> up of the schedutil governor kthread.
> >>
> >> A very bad scenario is when a schedutil request was already just made,
> >> such as to reduce the CPU frequency, then a newer request to increase
> >> CPU frequency (even sched deadline urgent frequency increase requests)
> >> can be dropped, even though the rate limits suggest that its Ok to
> >> process a request. This is because of the way the work_in_progress flag
> >> is used.
> >>
> >> This patch improves the situation by allowing new requests to happen
> >> even though the old one is still being processed. Note that in this
> >> approach, if an irq_work was already issued, we just update next_freq
> >> and don't bother to queue another request so there's no extra work being
> >> done to make this happen.
> >
> > Now that this isn't an RFC anymore, you shouldn't have added below
> > paragraph here. It could go to the comments section though.
> >
> >> I had brought up this issue at the OSPM conference and Claudio had a
> >> discussion RFC with an alternate approach [1]. I prefer the approach as
> >> done in the patch below since it doesn't need any new flags and doesn't
> >> cause any other extra overhead.
> >>
> >> [1] https://patchwork.kernel.org/patch/10384261/
> >>
> >> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
> >> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
> >
> > Looks like a Tag you just invented ? :)
> 
> Yeah.
> 
> The LGTM from Juri can be converned into an ACK silently IMO.  That
> said I have added Looks-good-to: tags to a couple of commits. :-)

Cool, I'll covert them to Acks :-)

[..]
> >> v1 -> v2: Minor style related changes.
> >>
> >>  kernel/sched/cpufreq_schedutil.c | 34 ++++++++++++++++++++++++--------
> >>  1 file changed, 26 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >> index e13df951aca7..5c482ec38610 100644
> >> --- a/kernel/sched/cpufreq_schedutil.c
> >> +++ b/kernel/sched/cpufreq_schedutil.c
> >> @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> >>           !cpufreq_can_do_remote_dvfs(sg_policy->policy))
> >>               return false;
> >>
> >> -     if (sg_policy->work_in_progress)
> >> -             return false;
> >> -
> >>       if (unlikely(sg_policy->need_freq_update)) {
> >>               sg_policy->need_freq_update = false;
> >>               /*
> >> @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> >>
> >>               policy->cur = next_freq;
> >>               trace_cpu_frequency(next_freq, smp_processor_id());
> >> -     } else {
> >> +     } else if (!sg_policy->work_in_progress) {
> >>               sg_policy->work_in_progress = true;
> >>               irq_work_queue(&sg_policy->irq_work);
> >>       }
> >> @@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> >>
> >>       ignore_dl_rate_limit(sg_cpu, sg_policy);
> >>
> >> +     /*
> >> +      * For slow-switch systems, single policy requests can't run at the
> >> +      * moment if update is in progress, unless we acquire update_lock.
> >> +      */
> >> +     if (sg_policy->work_in_progress)
> >> +             return;
> >> +
> >
> > I would still want this to go away :)
> >
> > @Rafael, will it be fine to get locking in place for unshared policy
> > platforms ?
> 
> As long as it doesn't affect the fast switch path in any way.

I just realized that on a single policy switch that uses the governor thread,
there will be 1 thread per-CPU. The sugov_update_single will be called on the
same CPU with interrupts disabled. In sugov_work, we are doing a
raw_spin_lock_irqsave which also disables interrupts. So I don't think
there's any possibility of a race happening on the same CPU between the
frequency update request and the sugov_work executing. In other words, I feel
we can drop the above if (..) statement for single policies completely and
only keep the changes for the shared policy. Viresh since you brought up the
single policy issue initially which made me add this if statememnt, could you
let me know if you agree with what I just said?

> >>       if (!sugov_should_update_freq(sg_policy, time))
> >>               return;
> >>
> >> @@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
> >>  static void sugov_work(struct kthread_work *work)
> >>  {
> >>       struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> >> +     unsigned int freq;
> >> +     unsigned long flags;
> >> +
> >> +     /*
> >> +      * Hold sg_policy->update_lock shortly to handle the case where:
> >> +      * incase sg_policy->next_freq is read here, and then updated by
> >> +      * sugov_update_shared just before work_in_progress is set to false
> >> +      * here, we may miss queueing the new update.
> >> +      *
> >> +      * Note: If a work was queued after the update_lock is released,
> >> +      * sugov_work will just be called again by kthread_work code; and the
> >> +      * request will be proceed before the sugov thread sleeps.
> >> +      */
> >> +     raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
> >> +     freq = sg_policy->next_freq;
> >> +     sg_policy->work_in_progress = false;
> >> +     raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
> >>
> >>       mutex_lock(&sg_policy->work_lock);
> >> -     __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> >> -                             CPUFREQ_RELATION_L);
> >> +     __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
> >>       mutex_unlock(&sg_policy->work_lock);
> >> -
> >> -     sg_policy->work_in_progress = false;
> >>  }
> >>
> >>  static void sugov_irq_work(struct irq_work *irq_work)
> >
> > Fix the commit log and you can add my
> 
> I can fix it up.
> 
> > Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

Thanks!

 - Joel
Patrick Bellasi May 21, 2018, 5 p.m. UTC | #9
On 21-May 08:49, Joel Fernandes wrote:
> On Mon, May 21, 2018 at 11:50:55AM +0100, Patrick Bellasi wrote:
> > On 18-May 11:55, Joel Fernandes (Google.) wrote:
> > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > 
> > > Currently there is a chance of a schedutil cpufreq update request to be
> > > dropped if there is a pending update request. This pending request can
> > > be delayed if there is a scheduling delay of the irq_work and the wake
> > > up of the schedutil governor kthread.
> > > 
> > > A very bad scenario is when a schedutil request was already just made,
> > > such as to reduce the CPU frequency, then a newer request to increase
> > > CPU frequency (even sched deadline urgent frequency increase requests)
> > > can be dropped, even though the rate limits suggest that its Ok to
> > > process a request. This is because of the way the work_in_progress flag
> > > is used.
> > > 
> > > This patch improves the situation by allowing new requests to happen
> > > even though the old one is still being processed. Note that in this
> > > approach, if an irq_work was already issued, we just update next_freq
> > > and don't bother to queue another request so there's no extra work being
> > > done to make this happen.
> > 
> > Maybe I'm missing something but... is not this patch just a partial
> > mitigation of the issue you descrive above?
> > 
> > If a DL freq increase is queued, with this patch we store the request
> > but we don't actually increase the frequency until the next schedutil
> > update, which can be one tick away... isn't it?
> > 
> > If that's the case, maybe something like the following can complete
> > the cure?
> 
> We already discussed this and thought of this case, I think you missed a
> previous thread [1]. The outer loop in the kthread_work subsystem will take
> care of calling sugov_work again incase another request was queued which we
> happen to miss.

Ok, I missed that thread... my bad.

However, [1] made me noticing that your solution works under the
assumption that we keep queuing a new kworker job for each request we
get, isn't it?

If that's the case, this means that if, for example, during a
frequency switch you get a request to reduce the frequency (e.g.
deadline task passing the 0-lag time) and right after a request to
increase the frequency (e.g. the current FAIR task tick)... you will
enqueue a freq drop followed by a freq increase and actually do two
frequency hops?

> So I don't think more complexity is needed to handle the case
> you're bringing up.
>
> thanks!
> 
>  - Joel
> 
> [1] https://lkml.org/lkml/2018/5/17/668
>
Joel Fernandes May 21, 2018, 5:20 p.m. UTC | #10
Hi Patrick,

On Mon, May 21, 2018 at 06:00:50PM +0100, Patrick Bellasi wrote:
> On 21-May 08:49, Joel Fernandes wrote:
> > On Mon, May 21, 2018 at 11:50:55AM +0100, Patrick Bellasi wrote:
> > > On 18-May 11:55, Joel Fernandes (Google.) wrote:
> > > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > > 
> > > > Currently there is a chance of a schedutil cpufreq update request to be
> > > > dropped if there is a pending update request. This pending request can
> > > > be delayed if there is a scheduling delay of the irq_work and the wake
> > > > up of the schedutil governor kthread.
> > > > 
> > > > A very bad scenario is when a schedutil request was already just made,
> > > > such as to reduce the CPU frequency, then a newer request to increase
> > > > CPU frequency (even sched deadline urgent frequency increase requests)
> > > > can be dropped, even though the rate limits suggest that its Ok to
> > > > process a request. This is because of the way the work_in_progress flag
> > > > is used.
> > > > 
> > > > This patch improves the situation by allowing new requests to happen
> > > > even though the old one is still being processed. Note that in this
> > > > approach, if an irq_work was already issued, we just update next_freq
> > > > and don't bother to queue another request so there's no extra work being
> > > > done to make this happen.
> > > 
> > > Maybe I'm missing something but... is not this patch just a partial
> > > mitigation of the issue you descrive above?
> > > 
> > > If a DL freq increase is queued, with this patch we store the request
> > > but we don't actually increase the frequency until the next schedutil
> > > update, which can be one tick away... isn't it?
> > > 
> > > If that's the case, maybe something like the following can complete
> > > the cure?
> > 
> > We already discussed this and thought of this case, I think you missed a
> > previous thread [1]. The outer loop in the kthread_work subsystem will take
> > care of calling sugov_work again incase another request was queued which we
> > happen to miss.
> 
> Ok, I missed that thread... my bad.

Sure no problem, sorry I was just pointing out the thread, not blaming you
for not reading it ;)

> However, [1] made me noticing that your solution works under the
> assumption that we keep queuing a new kworker job for each request we
> get, isn't it?

Not at each request, but each request after work_in_progress was cleared by the
sugov_work. Any requests that happen between work_in_progress is set and
cleared only result in updating of the next_freq.

> If that's the case, this means that if, for example, during a
> frequency switch you get a request to reduce the frequency (e.g.
> deadline task passing the 0-lag time) and right after a request to
> increase the frequency (e.g. the current FAIR task tick)... you will
> enqueue a freq drop followed by a freq increase and actually do two
> frequency hops?

Yes possibly, I see your point but I'm not sure if the tight loop around that
is worth the complexity, or atleast is within the scope of my patch. Perhaps
the problem you describe can be looked at as a future enhancement?

thanks,

 - Joel
Patrick Bellasi May 21, 2018, 5:41 p.m. UTC | #11
On 21-May 10:20, Joel Fernandes wrote:
> Hi Patrick,
> 
> On Mon, May 21, 2018 at 06:00:50PM +0100, Patrick Bellasi wrote:
> > On 21-May 08:49, Joel Fernandes wrote:
> > > On Mon, May 21, 2018 at 11:50:55AM +0100, Patrick Bellasi wrote:
> > > > On 18-May 11:55, Joel Fernandes (Google.) wrote:
> > > > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > > > 
> > > > > Currently there is a chance of a schedutil cpufreq update request to be
> > > > > dropped if there is a pending update request. This pending request can
> > > > > be delayed if there is a scheduling delay of the irq_work and the wake
> > > > > up of the schedutil governor kthread.
> > > > > 
> > > > > A very bad scenario is when a schedutil request was already just made,
> > > > > such as to reduce the CPU frequency, then a newer request to increase
> > > > > CPU frequency (even sched deadline urgent frequency increase requests)
> > > > > can be dropped, even though the rate limits suggest that its Ok to
> > > > > process a request. This is because of the way the work_in_progress flag
> > > > > is used.
> > > > > 
> > > > > This patch improves the situation by allowing new requests to happen
> > > > > even though the old one is still being processed. Note that in this
> > > > > approach, if an irq_work was already issued, we just update next_freq
> > > > > and don't bother to queue another request so there's no extra work being
> > > > > done to make this happen.
> > > > 
> > > > Maybe I'm missing something but... is not this patch just a partial
> > > > mitigation of the issue you descrive above?
> > > > 
> > > > If a DL freq increase is queued, with this patch we store the request
> > > > but we don't actually increase the frequency until the next schedutil
> > > > update, which can be one tick away... isn't it?
> > > > 
> > > > If that's the case, maybe something like the following can complete
> > > > the cure?
> > > 
> > > We already discussed this and thought of this case, I think you missed a
> > > previous thread [1]. The outer loop in the kthread_work subsystem will take
> > > care of calling sugov_work again incase another request was queued which we
> > > happen to miss.
> > 
> > Ok, I missed that thread... my bad.
> 
> Sure no problem, sorry I was just pointing out the thread, not blaming you
> for not reading it ;)

Sure, np here too ;)

> > However, [1] made me noticing that your solution works under the
> > assumption that we keep queuing a new kworker job for each request we
> > get, isn't it?
> 
> Not at each request, but each request after work_in_progress was cleared by the
> sugov_work. Any requests that happen between work_in_progress is set and
> cleared only result in updating of the next_freq.

I see, so we enqueue for the time of:

   mutex_lock(&sg_policy->work_lock);
   __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
   mutex_unlock(&sg_policy->work_lock);

> > If that's the case, this means that if, for example, during a
> > frequency switch you get a request to reduce the frequency (e.g.
> > deadline task passing the 0-lag time) and right after a request to
> > increase the frequency (e.g. the current FAIR task tick)... you will
> > enqueue a freq drop followed by a freq increase and actually do two
> > frequency hops?
> 
> Yes possibly,

Not sure about the time window above, I can try to get some
measurements tomorrow.

> I see your point but I'm not sure if the tight loop around that
> is worth the complexity, or atleast is within the scope of my patch.
> Perhaps the problem you describe can be looked at as a future enhancement?

Sure, I already have it as a patch on top of your.  I can post it
afterwards and we can discuss whether it makes sense or not.

Still have to better check, but I think that maybe we can skip
the queueing altogether if some work is already pending... in case we
wanna go for a dedicated inner loop like the one I was proposing.

Apart that, I think that your patch is already fixing 90% of the
issue we have now.

> thanks,
> 
>  - Joel
Joel Fernandes May 21, 2018, 6:05 p.m. UTC | #12
On Mon, May 21, 2018 at 11:50:55AM +0100, Patrick Bellasi wrote:
> On 18-May 11:55, Joel Fernandes (Google.) wrote:
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > 
> > Currently there is a chance of a schedutil cpufreq update request to be
> > dropped if there is a pending update request. This pending request can
> > be delayed if there is a scheduling delay of the irq_work and the wake
> > up of the schedutil governor kthread.
> > 
> > A very bad scenario is when a schedutil request was already just made,
> > such as to reduce the CPU frequency, then a newer request to increase
> > CPU frequency (even sched deadline urgent frequency increase requests)
> > can be dropped, even though the rate limits suggest that its Ok to
> > process a request. This is because of the way the work_in_progress flag
> > is used.
> > 
> > This patch improves the situation by allowing new requests to happen
> > even though the old one is still being processed. Note that in this
> > approach, if an irq_work was already issued, we just update next_freq
> > and don't bother to queue another request so there's no extra work being
> > done to make this happen.
> 
> Maybe I'm missing something but... is not this patch just a partial
> mitigation of the issue you descrive above?
> 
> If a DL freq increase is queued, with this patch we store the request
> but we don't actually increase the frequency until the next schedutil
> update, which can be one tick away... isn't it?
> 
> If that's the case, maybe something like the following can complete
> the cure?
> 
> ---8<---
> #define SUGOV_FREQ_NONE 0
> 
> static unsigned int sugov_work_update(struct sugov_policy *sg_policy,
> 				      unsigned int prev_freq)
> {
> 	unsigned long irq_flags;
> 	bool update_freq = true;
> 	unsigned int next_freq;
> 
> 	/*
> 	 * Hold sg_policy->update_lock shortly to handle the case where:
> 	 * incase sg_policy->next_freq is read here, and then updated by
> 	 * sugov_update_shared just before work_in_progress is set to false
> 	 * here, we may miss queueing the new update.
> 	 *
> 	 * Note: If a work was queued after the update_lock is released,
> 	 * sugov_work will just be called again by kthread_work code; and the
> 	 * request will be proceed before the sugov thread sleeps.
> 	 */
> 	raw_spin_lock_irqsave(&sg_policy->update_lock, irq_flags);
> 	next_freq = sg_policy->next_freq;
> 	sg_policy->work_in_progress = false;
> 	if (prev_freq == next_freq)
> 		update_freq = false;

About this patch on top of mine, I believe this check is already being done
by sugov_update_commit? :

static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
				unsigned int next_freq)
{
	struct cpufreq_policy *policy = sg_policy->policy;

	if (sg_policy->next_freq == next_freq)
		return;

	sg_policy->next_freq = next_freq;
	sg_policy->last_freq_update_time = time;
----

thanks,

 - Joel
Rafael J. Wysocki May 22, 2018, 10:02 a.m. UTC | #13
On Mon, May 21, 2018 at 6:13 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> On Mon, May 21, 2018 at 10:29:52AM +0200, Rafael J. Wysocki wrote:
>> On Mon, May 21, 2018 at 7:14 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>> > On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
>> >> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>> >>
>> >> Currently there is a chance of a schedutil cpufreq update request to be
>> >> dropped if there is a pending update request. This pending request can
>> >> be delayed if there is a scheduling delay of the irq_work and the wake
>> >> up of the schedutil governor kthread.
>> >>
>> >> A very bad scenario is when a schedutil request was already just made,
>> >> such as to reduce the CPU frequency, then a newer request to increase
>> >> CPU frequency (even sched deadline urgent frequency increase requests)
>> >> can be dropped, even though the rate limits suggest that its Ok to
>> >> process a request. This is because of the way the work_in_progress flag
>> >> is used.
>> >>
>> >> This patch improves the situation by allowing new requests to happen
>> >> even though the old one is still being processed. Note that in this
>> >> approach, if an irq_work was already issued, we just update next_freq
>> >> and don't bother to queue another request so there's no extra work being
>> >> done to make this happen.
>> >
>> > Now that this isn't an RFC anymore, you shouldn't have added below
>> > paragraph here. It could go to the comments section though.
>> >
>> >> I had brought up this issue at the OSPM conference and Claudio had a
>> >> discussion RFC with an alternate approach [1]. I prefer the approach as
>> >> done in the patch below since it doesn't need any new flags and doesn't
>> >> cause any other extra overhead.
>> >>
>> >> [1] https://patchwork.kernel.org/patch/10384261/
>> >>
>> >> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
>> >> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
>> >
>> > Looks like a Tag you just invented ? :)
>>
>> Yeah.
>>
>> The LGTM from Juri can be converned into an ACK silently IMO.  That
>> said I have added Looks-good-to: tags to a couple of commits. :-)
>
> Cool, I'll covert them to Acks :-)

So it looks like I should expect an update of this patch, right?

Or do you prefer the current one to be applied and work on top of it?

> [..]
>> >> v1 -> v2: Minor style related changes.
>> >>
>> >>  kernel/sched/cpufreq_schedutil.c | 34 ++++++++++++++++++++++++--------
>> >>  1 file changed, 26 insertions(+), 8 deletions(-)
>> >>
>> >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> >> index e13df951aca7..5c482ec38610 100644
>> >> --- a/kernel/sched/cpufreq_schedutil.c
>> >> +++ b/kernel/sched/cpufreq_schedutil.c
>> >> @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>> >>           !cpufreq_can_do_remote_dvfs(sg_policy->policy))
>> >>               return false;
>> >>
>> >> -     if (sg_policy->work_in_progress)
>> >> -             return false;
>> >> -
>> >>       if (unlikely(sg_policy->need_freq_update)) {
>> >>               sg_policy->need_freq_update = false;
>> >>               /*
>> >> @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>> >>
>> >>               policy->cur = next_freq;
>> >>               trace_cpu_frequency(next_freq, smp_processor_id());
>> >> -     } else {
>> >> +     } else if (!sg_policy->work_in_progress) {
>> >>               sg_policy->work_in_progress = true;
>> >>               irq_work_queue(&sg_policy->irq_work);
>> >>       }
>> >> @@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>> >>
>> >>       ignore_dl_rate_limit(sg_cpu, sg_policy);
>> >>
>> >> +     /*
>> >> +      * For slow-switch systems, single policy requests can't run at the
>> >> +      * moment if update is in progress, unless we acquire update_lock.
>> >> +      */
>> >> +     if (sg_policy->work_in_progress)
>> >> +             return;
>> >> +
>> >
>> > I would still want this to go away :)
>> >
>> > @Rafael, will it be fine to get locking in place for unshared policy
>> > platforms ?
>>
>> As long as it doesn't affect the fast switch path in any way.
>
> I just realized that on a single policy switch that uses the governor thread,
> there will be 1 thread per-CPU. The sugov_update_single will be called on the
> same CPU with interrupts disabled.

sugov_update_single() doesn't have to run on the target CPU.

> In sugov_work, we are doing a
> raw_spin_lock_irqsave which also disables interrupts. So I don't think
> there's any possibility of a race happening on the same CPU between the
> frequency update request and the sugov_work executing. In other words, I feel
> we can drop the above if (..) statement for single policies completely and
> only keep the changes for the shared policy. Viresh since you brought up the
> single policy issue initially which made me add this if statememnt, could you
> let me know if you agree with what I just said?

Which is why you need the spinlock too.
Viresh Kumar May 22, 2018, 10:23 a.m. UTC | #14
On 21-05-18, 10:20, Joel Fernandes wrote:
> On Mon, May 21, 2018 at 06:00:50PM +0100, Patrick Bellasi wrote:
> > If that's the case, this means that if, for example, during a
> > frequency switch you get a request to reduce the frequency (e.g.
> > deadline task passing the 0-lag time) and right after a request to
> > increase the frequency (e.g. the current FAIR task tick)... you will
> > enqueue a freq drop followed by a freq increase and actually do two
> > frequency hops?

I don't think so.

Consider the kthread as running currently and has just cleared the
work_in_progress flag. Sched update comes at that time and we decide
to reduce the frequency, we queue another work and update next_freq.
Now if another sched update comes before the kthread finishes its
previous loop, we will simply update next_freq and return. So when the
next time kthread runs, it will pick the most recent update.

Where is the problem both of you see ?
Patrick Bellasi May 22, 2018, 10:26 a.m. UTC | #15
On 21-May 11:05, Joel Fernandes wrote:
> On Mon, May 21, 2018 at 11:50:55AM +0100, Patrick Bellasi wrote:
> > On 18-May 11:55, Joel Fernandes (Google.) wrote:
> > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > 
> > > Currently there is a chance of a schedutil cpufreq update request to be
> > > dropped if there is a pending update request. This pending request can
> > > be delayed if there is a scheduling delay of the irq_work and the wake
> > > up of the schedutil governor kthread.
> > > 
> > > A very bad scenario is when a schedutil request was already just made,
> > > such as to reduce the CPU frequency, then a newer request to increase
> > > CPU frequency (even sched deadline urgent frequency increase requests)
> > > can be dropped, even though the rate limits suggest that its Ok to
> > > process a request. This is because of the way the work_in_progress flag
> > > is used.
> > > 
> > > This patch improves the situation by allowing new requests to happen
> > > even though the old one is still being processed. Note that in this
> > > approach, if an irq_work was already issued, we just update next_freq
> > > and don't bother to queue another request so there's no extra work being
> > > done to make this happen.
> > 
> > Maybe I'm missing something but... is not this patch just a partial
> > mitigation of the issue you descrive above?
> > 
> > If a DL freq increase is queued, with this patch we store the request
> > but we don't actually increase the frequency until the next schedutil
> > update, which can be one tick away... isn't it?
> > 
> > If that's the case, maybe something like the following can complete
> > the cure?
> > 
> > ---8<---
> > #define SUGOV_FREQ_NONE 0
> > 
> > static unsigned int sugov_work_update(struct sugov_policy *sg_policy,
> > 				      unsigned int prev_freq)
> > {
> > 	unsigned long irq_flags;
> > 	bool update_freq = true;
> > 	unsigned int next_freq;
> > 
> > 	/*
> > 	 * Hold sg_policy->update_lock shortly to handle the case where:
> > 	 * incase sg_policy->next_freq is read here, and then updated by
> > 	 * sugov_update_shared just before work_in_progress is set to false
> > 	 * here, we may miss queueing the new update.
> > 	 *
> > 	 * Note: If a work was queued after the update_lock is released,
> > 	 * sugov_work will just be called again by kthread_work code; and the
> > 	 * request will be proceed before the sugov thread sleeps.
> > 	 */
> > 	raw_spin_lock_irqsave(&sg_policy->update_lock, irq_flags);
> > 	next_freq = sg_policy->next_freq;
> > 	sg_policy->work_in_progress = false;
> > 	if (prev_freq == next_freq)
> > 		update_freq = false;
> 
> About this patch on top of mine, I believe this check is already being done
> by sugov_update_commit? :

No, that check is different...
> 
> static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> 				unsigned int next_freq)
> {
> 	struct cpufreq_policy *policy = sg_policy->policy;
> 
> 	if (sg_policy->next_freq == next_freq)
> 		return;
> 
> 	sg_policy->next_freq = next_freq;
> 	sg_policy->last_freq_update_time = time;
> ----

... in my snippet the check is required to verify if, once a freq
swich has been completed by the kthread, the sugov_update_commit has
actually committed a new and different frequency wrt the one the
kthread has just configured.

It means we will have two async paths:

1. sugov_update_commit()
   which updates sg_policy->next_freq

2. sugov_work_update()
   which will run in a loop until the last freq it configures matches
   with the current value of sg_policy->next_freq

But again, as we was discussing yesterday, we can have these
additional bits in a following patch on top of your.
Viresh Kumar May 22, 2018, 10:34 a.m. UTC | #16
Okay, me and Rafael were discussing this patch, locking and races around this.

On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e13df951aca7..5c482ec38610 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>  	    !cpufreq_can_do_remote_dvfs(sg_policy->policy))
>  		return false;
>  
> -	if (sg_policy->work_in_progress)
> -		return false;
> -
>  	if (unlikely(sg_policy->need_freq_update)) {
>  		sg_policy->need_freq_update = false;
>  		/*
> @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>  
>  		policy->cur = next_freq;
>  		trace_cpu_frequency(next_freq, smp_processor_id());
> -	} else {
> +	} else if (!sg_policy->work_in_progress) {
>  		sg_policy->work_in_progress = true;
>  		irq_work_queue(&sg_policy->irq_work);
>  	}
> @@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>  
>  	ignore_dl_rate_limit(sg_cpu, sg_policy);
>  
> +	/*
> +	 * For slow-switch systems, single policy requests can't run at the
> +	 * moment if update is in progress, unless we acquire update_lock.
> +	 */
> +	if (sg_policy->work_in_progress)
> +		return;
> +
>  	if (!sugov_should_update_freq(sg_policy, time))
>  		return;
>  
> @@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>  static void sugov_work(struct kthread_work *work)
>  {
>  	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> +	unsigned int freq;
> +	unsigned long flags;
> +
> +	/*
> +	 * Hold sg_policy->update_lock shortly to handle the case where:
> +	 * incase sg_policy->next_freq is read here, and then updated by
> +	 * sugov_update_shared just before work_in_progress is set to false
> +	 * here, we may miss queueing the new update.
> +	 *
> +	 * Note: If a work was queued after the update_lock is released,
> +	 * sugov_work will just be called again by kthread_work code; and the
> +	 * request will be proceed before the sugov thread sleeps.
> +	 */
> +	raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
> +	freq = sg_policy->next_freq;
> +	sg_policy->work_in_progress = false;
> +	raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
>  
>  	mutex_lock(&sg_policy->work_lock);
> -	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> -				CPUFREQ_RELATION_L);
> +	__cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
>  	mutex_unlock(&sg_policy->work_lock);
> -
> -	sg_policy->work_in_progress = false;
>  }

And I do see a race here for single policy systems doing slow switching.

Kthread                                                 Sched update

sugov_work()                                            sugov_update_single()

        lock();
        // The CPU is free to rearrange below           
        // two in any order, so it may clear
        // the flag first and then read next
        // freq. Lets assume it does.
        work_in_progress = false

                                                        if (work_in_progress)
                                                                return;

                                                        sg_policy->next_freq = 0;
        freq = sg_policy->next_freq;
                                                        sg_policy->next_freq = real-next-freq;
        unlock();



Is the above theory right or am I day dreaming ? :)
Patrick Bellasi May 22, 2018, 10:38 a.m. UTC | #17
Hi Viresh,
thanks for clarifying...

On 22-May 15:53, Viresh Kumar wrote:
> On 21-05-18, 10:20, Joel Fernandes wrote:
> > On Mon, May 21, 2018 at 06:00:50PM +0100, Patrick Bellasi wrote:
> > > If that's the case, this means that if, for example, during a
> > > frequency switch you get a request to reduce the frequency (e.g.
> > > deadline task passing the 0-lag time) and right after a request to
> > > increase the frequency (e.g. the current FAIR task tick)... you will
> > > enqueue a freq drop followed by a freq increase and actually do two
> > > frequency hops?
> 
> I don't think so.
> 
> Consider the kthread as running currently and has just cleared the
> work_in_progress flag. Sched update comes at that time and we decide
> to reduce the frequency, we queue another work and update next_freq.
> Now if another sched update comes before the kthread finishes its
> previous loop, we will simply update next_freq and return. So when the
> next time kthread runs, it will pick the most recent update.

Mmm... right... looking better at the two execution contexts:

   // A) Frequency update requests
   sugov_update_commit() {
      sg_policy->next_freq = next_freq;
      if (!sg_policy->work_in_progress) {
         sg_policy->work_in_progress = true;
         irq_work_queue(&sg_policy->irq_work);
      }
   }

   // B) Actual frequency updates
   sugov_work() {
      freq = sg_policy->next_freq;
      sg_policy->work_in_progress = false;

      __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
   }


It's true that A will enqueue only one B at the first next_freq update
and then it will keep just updating the next_freq.
Thus, we should be ensure to have always just one kwork pending in the
queue.
 
> Where is the problem both of you see ?

Perhaps the confusion comes just from the naming of
"work_in_progress", which is confusing since we use it now to
represent that we enqueued a frequency change and we wait for the
kwork to pick it up.

Maybe it can help to rename it to something like kwork_queued or
update_pending, update_queued... ?
Patrick Bellasi May 22, 2018, 10:51 a.m. UTC | #18
On 22-May 16:04, Viresh Kumar wrote:
> Okay, me and Rafael were discussing this patch, locking and races around this.
> 
> On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
> > @@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
> >  static void sugov_work(struct kthread_work *work)
> >  {
> >  	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> > +	unsigned int freq;
> > +	unsigned long flags;
> > +
> > +	/*
> > +	 * Hold sg_policy->update_lock shortly to handle the case where:
> > +	 * incase sg_policy->next_freq is read here, and then updated by
> > +	 * sugov_update_shared just before work_in_progress is set to false
> > +	 * here, we may miss queueing the new update.
> > +	 *
> > +	 * Note: If a work was queued after the update_lock is released,
> > +	 * sugov_work will just be called again by kthread_work code; and the
> > +	 * request will be proceed before the sugov thread sleeps.
> > +	 */
> > +	raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
> > +	freq = sg_policy->next_freq;
> > +	sg_policy->work_in_progress = false;
> > +	raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
> >  
> >  	mutex_lock(&sg_policy->work_lock);
> > -	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> > -				CPUFREQ_RELATION_L);
> > +	__cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
> >  	mutex_unlock(&sg_policy->work_lock);
> > -
> > -	sg_policy->work_in_progress = false;
> >  }
> 
> And I do see a race here for single policy systems doing slow switching.
> 
> Kthread                                                 Sched update
> 
> sugov_work()                                            sugov_update_single()
> 
>         lock();
>         // The CPU is free to rearrange below
>         // two in any order, so it may clear
>         // the flag first and then read next
>         // freq. Lets assume it does.
>         work_in_progress = false
> 
>                                                         if (work_in_progress)
>                                                                 return;
> 
>                                                         sg_policy->next_freq = 0;
>         freq = sg_policy->next_freq;
>                                                         sg_policy->next_freq = real-next-freq;
>         unlock();
> 
> 
> 
> Is the above theory right or am I day dreaming ? :)

It could happen, but using:

   raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
   freq = READ_ONCE(sg_policy->next_freq)
   WRITE_ONCE(sg_policy->work_in_progress, false);
   raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);

                       if (!READ_ONCE(sg_policy->work_in_progress)) {
                           WRITE_ONCE(sg_policy->work_in_progress, true);
                           irq_work_queue(&sg_policy->irq_work);
                       }

should fix it by enforcing the ordering as well as documenting the
concurrent access.

However, in the "sched update" side, where do we have the sequence:

   sg_policy->next_freq = 0;
   sg_policy->next_freq = real-next-freq;

AFAICS we always use locals for next_freq and do one single assignment
in sugov_update_commit(), isn't it?
Viresh Kumar May 22, 2018, 10:56 a.m. UTC | #19
On 22-05-18, 11:51, Patrick Bellasi wrote:
> It could happen, but using:
> 
>    raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
>    freq = READ_ONCE(sg_policy->next_freq)
>    WRITE_ONCE(sg_policy->work_in_progress, false);
>    raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
> 
>                        if (!READ_ONCE(sg_policy->work_in_progress)) {
>                            WRITE_ONCE(sg_policy->work_in_progress, true);
>                            irq_work_queue(&sg_policy->irq_work);
>                        }

I think its better to get locking in place for non-fast switching case in
single-policy systems right now.

> should fix it by enforcing the ordering as well as documenting the
> concurrent access.
> 
> However, in the "sched update" side, where do we have the sequence:
> 
>    sg_policy->next_freq = 0;
>    sg_policy->next_freq = real-next-freq;

Ah, that was just an example of what a compiler may do (though it shouldn't do).
Rafael J. Wysocki May 22, 2018, 11:26 a.m. UTC | #20
On Tuesday, May 22, 2018 12:02:24 PM CEST Rafael J. Wysocki wrote:
> On Mon, May 21, 2018 at 6:13 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > On Mon, May 21, 2018 at 10:29:52AM +0200, Rafael J. Wysocki wrote:
> >> On Mon, May 21, 2018 at 7:14 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >> > On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
> >> >> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> >> >>
> >> >> Currently there is a chance of a schedutil cpufreq update request to be
> >> >> dropped if there is a pending update request. This pending request can
> >> >> be delayed if there is a scheduling delay of the irq_work and the wake
> >> >> up of the schedutil governor kthread.
> >> >>
> >> >> A very bad scenario is when a schedutil request was already just made,
> >> >> such as to reduce the CPU frequency, then a newer request to increase
> >> >> CPU frequency (even sched deadline urgent frequency increase requests)
> >> >> can be dropped, even though the rate limits suggest that its Ok to
> >> >> process a request. This is because of the way the work_in_progress flag
> >> >> is used.
> >> >>
> >> >> This patch improves the situation by allowing new requests to happen
> >> >> even though the old one is still being processed. Note that in this
> >> >> approach, if an irq_work was already issued, we just update next_freq
> >> >> and don't bother to queue another request so there's no extra work being
> >> >> done to make this happen.
> >> >
> >> > Now that this isn't an RFC anymore, you shouldn't have added below
> >> > paragraph here. It could go to the comments section though.
> >> >
> >> >> I had brought up this issue at the OSPM conference and Claudio had a
> >> >> discussion RFC with an alternate approach [1]. I prefer the approach as
> >> >> done in the patch below since it doesn't need any new flags and doesn't
> >> >> cause any other extra overhead.
> >> >>
> >> >> [1] https://patchwork.kernel.org/patch/10384261/
> >> >>
> >> >> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
> >> >> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
> >> >
> >> > Looks like a Tag you just invented ? :)
> >>
> >> Yeah.
> >>
> >> The LGTM from Juri can be converned into an ACK silently IMO.  That
> >> said I have added Looks-good-to: tags to a couple of commits. :-)
> >
> > Cool, I'll covert them to Acks :-)
> 
> So it looks like I should expect an update of this patch, right?
> 
> Or do you prefer the current one to be applied and work on top of it?

Well, sorry, I can't apply this one as it is racy.
Rafael J. Wysocki May 22, 2018, 3:30 p.m. UTC | #21
On Tue, May 22, 2018 at 12:02 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Mon, May 21, 2018 at 6:13 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
>> On Mon, May 21, 2018 at 10:29:52AM +0200, Rafael J. Wysocki wrote:
>>> On Mon, May 21, 2018 at 7:14 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>>> > On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
>>> >> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>>> >>
>>> >> Currently there is a chance of a schedutil cpufreq update request to be
>>> >> dropped if there is a pending update request. This pending request can
>>> >> be delayed if there is a scheduling delay of the irq_work and the wake
>>> >> up of the schedutil governor kthread.
>>> >>
>>> >> A very bad scenario is when a schedutil request was already just made,
>>> >> such as to reduce the CPU frequency, then a newer request to increase
>>> >> CPU frequency (even sched deadline urgent frequency increase requests)
>>> >> can be dropped, even though the rate limits suggest that its Ok to
>>> >> process a request. This is because of the way the work_in_progress flag
>>> >> is used.
>>> >>
>>> >> This patch improves the situation by allowing new requests to happen
>>> >> even though the old one is still being processed. Note that in this
>>> >> approach, if an irq_work was already issued, we just update next_freq
>>> >> and don't bother to queue another request so there's no extra work being
>>> >> done to make this happen.
>>> >
>>> > Now that this isn't an RFC anymore, you shouldn't have added below
>>> > paragraph here. It could go to the comments section though.
>>> >
>>> >> I had brought up this issue at the OSPM conference and Claudio had a
>>> >> discussion RFC with an alternate approach [1]. I prefer the approach as
>>> >> done in the patch below since it doesn't need any new flags and doesn't
>>> >> cause any other extra overhead.
>>> >>
>>> >> [1] https://patchwork.kernel.org/patch/10384261/
>>> >>
>>> >> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
>>> >> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
>>> >
>>> > Looks like a Tag you just invented ? :)
>>>
>>> Yeah.
>>>
>>> The LGTM from Juri can be converned into an ACK silently IMO.  That
>>> said I have added Looks-good-to: tags to a couple of commits. :-)
>>
>> Cool, I'll covert them to Acks :-)
>
> So it looks like I should expect an update of this patch, right?
>
> Or do you prefer the current one to be applied and work on top of it?
>

[cut]

>>
>> I just realized that on a single policy switch that uses the governor thread,
>> there will be 1 thread per-CPU. The sugov_update_single will be called on the
>> same CPU with interrupts disabled.
>
> sugov_update_single() doesn't have to run on the target CPU.

Which sadly is a bug IMO. :-/

>> In sugov_work, we are doing a
>> raw_spin_lock_irqsave which also disables interrupts. So I don't think
>> there's any possibility of a race happening on the same CPU between the
>> frequency update request and the sugov_work executing. In other words, I feel
>> we can drop the above if (..) statement for single policies completely and
>> only keep the changes for the shared policy. Viresh since you brought up the
>> single policy issue initially which made me add this if statememnt, could you
>> let me know if you agree with what I just said?
>
> Which is why you need the spinlock too.

And you totally have a point.  With the above bug fixed, disabling
interrupts should be sufficient to prevent concurrent updates from
occurring in the one-CPU policy case and the work_in_progress check in
sugov_update_single() isn't necessary.
Rafael J. Wysocki May 22, 2018, 5:07 p.m. UTC | #22
On Tue, May 22, 2018 at 5:30 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Tue, May 22, 2018 at 12:02 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Mon, May 21, 2018 at 6:13 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
>>> On Mon, May 21, 2018 at 10:29:52AM +0200, Rafael J. Wysocki wrote:
>>>> On Mon, May 21, 2018 at 7:14 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>>>> > On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
>>>> >> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>>>> >>
>>>> >> Currently there is a chance of a schedutil cpufreq update request to be
>>>> >> dropped if there is a pending update request. This pending request can
>>>> >> be delayed if there is a scheduling delay of the irq_work and the wake
>>>> >> up of the schedutil governor kthread.
>>>> >>
>>>> >> A very bad scenario is when a schedutil request was already just made,
>>>> >> such as to reduce the CPU frequency, then a newer request to increase
>>>> >> CPU frequency (even sched deadline urgent frequency increase requests)
>>>> >> can be dropped, even though the rate limits suggest that its Ok to
>>>> >> process a request. This is because of the way the work_in_progress flag
>>>> >> is used.
>>>> >>
>>>> >> This patch improves the situation by allowing new requests to happen
>>>> >> even though the old one is still being processed. Note that in this
>>>> >> approach, if an irq_work was already issued, we just update next_freq
>>>> >> and don't bother to queue another request so there's no extra work being
>>>> >> done to make this happen.
>>>> >
>>>> > Now that this isn't an RFC anymore, you shouldn't have added below
>>>> > paragraph here. It could go to the comments section though.
>>>> >
>>>> >> I had brought up this issue at the OSPM conference and Claudio had a
>>>> >> discussion RFC with an alternate approach [1]. I prefer the approach as
>>>> >> done in the patch below since it doesn't need any new flags and doesn't
>>>> >> cause any other extra overhead.
>>>> >>
>>>> >> [1] https://patchwork.kernel.org/patch/10384261/
>>>> >>
>>>> >> LGTMed-by: Viresh Kumar <viresh.kumar@linaro.org>
>>>> >> LGTMed-by: Juri Lelli <juri.lelli@redhat.com>
>>>> >
>>>> > Looks like a Tag you just invented ? :)
>>>>
>>>> Yeah.
>>>>
>>>> The LGTM from Juri can be converned into an ACK silently IMO.  That
>>>> said I have added Looks-good-to: tags to a couple of commits. :-)
>>>
>>> Cool, I'll covert them to Acks :-)
>>
>> So it looks like I should expect an update of this patch, right?
>>
>> Or do you prefer the current one to be applied and work on top of it?
>>
>
> [cut]
>
>>>
>>> I just realized that on a single policy switch that uses the governor thread,
>>> there will be 1 thread per-CPU. The sugov_update_single will be called on the
>>> same CPU with interrupts disabled.
>>
>> sugov_update_single() doesn't have to run on the target CPU.
>
> Which sadly is a bug IMO. :-/

My bad.

sugov_update_single() runs under rq->lock, so it need not run on a
target CPU so long as the CPU running it can update the frequency for
the target and there is the requisite check for that in
sugov_should_update_freq().

That means that sugov_update_single() will not run concurrently on two
different CPUs for the same target, but it may be running concurrently
with the kthread (as pointed out by Viresh).

>>> In sugov_work, we are doing a
>>> raw_spin_lock_irqsave which also disables interrupts. So I don't think
>>> there's any possibility of a race happening on the same CPU between the
>>> frequency update request and the sugov_work executing. In other words, I feel
>>> we can drop the above if (..) statement for single policies completely and
>>> only keep the changes for the shared policy. Viresh since you brought up the
>>> single policy issue initially which made me add this if statememnt, could you
>>> let me know if you agree with what I just said?
>>
>> Which is why you need the spinlock too.
>
> And you totally have a point.

Not really, sorry about that.

It is necessary to take the spinlock in the non-fast-switch case,
because of the possible race with the kthread, so something like my
patch at https://patchwork.kernel.org/patch/10418551/ is needed after
all.
Joel Fernandes May 22, 2018, 10:09 p.m. UTC | #23
On Tue, May 22, 2018 at 04:04:15PM +0530, Viresh Kumar wrote:
> Okay, me and Rafael were discussing this patch, locking and races around this.
> 
> On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index e13df951aca7..5c482ec38610 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> >  	    !cpufreq_can_do_remote_dvfs(sg_policy->policy))
> >  		return false;
> >  
> > -	if (sg_policy->work_in_progress)
> > -		return false;
> > -
> >  	if (unlikely(sg_policy->need_freq_update)) {
> >  		sg_policy->need_freq_update = false;
> >  		/*
> > @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> >  
> >  		policy->cur = next_freq;
> >  		trace_cpu_frequency(next_freq, smp_processor_id());
> > -	} else {
> > +	} else if (!sg_policy->work_in_progress) {
> >  		sg_policy->work_in_progress = true;
> >  		irq_work_queue(&sg_policy->irq_work);
> >  	}
> > @@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> >  
> >  	ignore_dl_rate_limit(sg_cpu, sg_policy);
> >  
> > +	/*
> > +	 * For slow-switch systems, single policy requests can't run at the
> > +	 * moment if update is in progress, unless we acquire update_lock.
> > +	 */
> > +	if (sg_policy->work_in_progress)
> > +		return;
> > +
> >  	if (!sugov_should_update_freq(sg_policy, time))
> >  		return;
> >  
> > @@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
> >  static void sugov_work(struct kthread_work *work)
> >  {
> >  	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
> > +	unsigned int freq;
> > +	unsigned long flags;
> > +
> > +	/*
> > +	 * Hold sg_policy->update_lock shortly to handle the case where:
> > +	 * incase sg_policy->next_freq is read here, and then updated by
> > +	 * sugov_update_shared just before work_in_progress is set to false
> > +	 * here, we may miss queueing the new update.
> > +	 *
> > +	 * Note: If a work was queued after the update_lock is released,
> > +	 * sugov_work will just be called again by kthread_work code; and the
> > +	 * request will be proceed before the sugov thread sleeps.
> > +	 */
> > +	raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
> > +	freq = sg_policy->next_freq;
> > +	sg_policy->work_in_progress = false;
> > +	raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
> >  
> >  	mutex_lock(&sg_policy->work_lock);
> > -	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
> > -				CPUFREQ_RELATION_L);
> > +	__cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
> >  	mutex_unlock(&sg_policy->work_lock);
> > -
> > -	sg_policy->work_in_progress = false;
> >  }
> 
> And I do see a race here for single policy systems doing slow switching.
> 
> Kthread                                                 Sched update
> 
> sugov_work()                                            sugov_update_single()
> 
>         lock();
>         // The CPU is free to rearrange below           
>         // two in any order, so it may clear
>         // the flag first and then read next
>         // freq. Lets assume it does.
>         work_in_progress = false
> 
>                                                         if (work_in_progress)
>                                                                 return;
> 
>                                                         sg_policy->next_freq = 0;
>         freq = sg_policy->next_freq;
>                                                         sg_policy->next_freq = real-next-freq;
>         unlock();
> 

I agree with the race you describe for single policy slow-switch. Good find :)

The mainline sugov_work could also do such reordering in sugov_work, I think. Even
with the mutex_unlock in mainline's sugov_work, that work_in_progress write could
be reordered by the CPU to happen before the read of next_freq. AIUI,
mutex_unlock is expected to be only a release-barrier.

Although to be safe, I could just put an smp_mb() there. I believe with that,
no locking would be needed for such case.

I'll send out a v3 with Acks for the original patch, and the send out the
smp_mb() as a separate patch if that's Ok.

thanks,

 - Joel
Rafael J. Wysocki May 23, 2018, 8:18 a.m. UTC | #24
On Wed, May 23, 2018 at 12:09 AM, Joel Fernandes <joel@joelfernandes.org> wrote:
> On Tue, May 22, 2018 at 04:04:15PM +0530, Viresh Kumar wrote:
>> Okay, me and Rafael were discussing this patch, locking and races around this.
>>
>> On 18-05-18, 11:55, Joel Fernandes (Google.) wrote:
>> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> > index e13df951aca7..5c482ec38610 100644
>> > --- a/kernel/sched/cpufreq_schedutil.c
>> > +++ b/kernel/sched/cpufreq_schedutil.c
>> > @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>> >         !cpufreq_can_do_remote_dvfs(sg_policy->policy))
>> >             return false;
>> >
>> > -   if (sg_policy->work_in_progress)
>> > -           return false;
>> > -
>> >     if (unlikely(sg_policy->need_freq_update)) {
>> >             sg_policy->need_freq_update = false;
>> >             /*
>> > @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>> >
>> >             policy->cur = next_freq;
>> >             trace_cpu_frequency(next_freq, smp_processor_id());
>> > -   } else {
>> > +   } else if (!sg_policy->work_in_progress) {
>> >             sg_policy->work_in_progress = true;
>> >             irq_work_queue(&sg_policy->irq_work);
>> >     }
>> > @@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>> >
>> >     ignore_dl_rate_limit(sg_cpu, sg_policy);
>> >
>> > +   /*
>> > +    * For slow-switch systems, single policy requests can't run at the
>> > +    * moment if update is in progress, unless we acquire update_lock.
>> > +    */
>> > +   if (sg_policy->work_in_progress)
>> > +           return;
>> > +
>> >     if (!sugov_should_update_freq(sg_policy, time))
>> >             return;
>> >
>> > @@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>> >  static void sugov_work(struct kthread_work *work)
>> >  {
>> >     struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
>> > +   unsigned int freq;
>> > +   unsigned long flags;
>> > +
>> > +   /*
>> > +    * Hold sg_policy->update_lock shortly to handle the case where:
>> > +    * incase sg_policy->next_freq is read here, and then updated by
>> > +    * sugov_update_shared just before work_in_progress is set to false
>> > +    * here, we may miss queueing the new update.
>> > +    *
>> > +    * Note: If a work was queued after the update_lock is released,
>> > +    * sugov_work will just be called again by kthread_work code; and the
>> > +    * request will be proceed before the sugov thread sleeps.
>> > +    */
>> > +   raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
>> > +   freq = sg_policy->next_freq;
>> > +   sg_policy->work_in_progress = false;
>> > +   raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
>> >
>> >     mutex_lock(&sg_policy->work_lock);
>> > -   __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
>> > -                           CPUFREQ_RELATION_L);
>> > +   __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
>> >     mutex_unlock(&sg_policy->work_lock);
>> > -
>> > -   sg_policy->work_in_progress = false;
>> >  }
>>
>> And I do see a race here for single policy systems doing slow switching.
>>
>> Kthread                                                 Sched update
>>
>> sugov_work()                                            sugov_update_single()
>>
>>         lock();
>>         // The CPU is free to rearrange below
>>         // two in any order, so it may clear
>>         // the flag first and then read next
>>         // freq. Lets assume it does.
>>         work_in_progress = false
>>
>>                                                         if (work_in_progress)
>>                                                                 return;
>>
>>                                                         sg_policy->next_freq = 0;
>>         freq = sg_policy->next_freq;
>>                                                         sg_policy->next_freq = real-next-freq;
>>         unlock();
>>
>
> I agree with the race you describe for single policy slow-switch. Good find :)
>
> The mainline sugov_work could also do such reordering in sugov_work, I think. Even
> with the mutex_unlock in mainline's sugov_work, that work_in_progress write could
> be reordered by the CPU to happen before the read of next_freq. AIUI,
> mutex_unlock is expected to be only a release-barrier.
>
> Although to be safe, I could just put an smp_mb() there. I believe with that,
> no locking would be needed for such case.

Yes, but leaving the work_in_progress check in sugov_update_single()
means that the original problem is still there in the one-CPU policy
case.  Namely, utilization updates coming in between setting
work_in_progress in sugov_update_commit() and clearing it in
sugov_work() will be discarded in the one-CPU policy case, but not in
the shared policy case.

> I'll send out a v3 with Acks for the original patch,

OK

> and the send out the smp_mb() as a separate patch if that's Ok.

I would prefer to use a spinlock in the one-CPU policy non-fast-switch
case and remove the work_in_progress check from sugov_update_single().

I can do a patch on top of yours for that.  In fact, I've done that already. :-)

Thanks,
Rafael
Viresh Kumar May 23, 2018, 9:01 a.m. UTC | #25
On 22-05-18, 15:09, Joel Fernandes wrote:
> I agree with the race you describe for single policy slow-switch. Good find :)
> 
> The mainline sugov_work could also do such reordering in sugov_work, I think. Even
> with the mutex_unlock in mainline's sugov_work, that work_in_progress write could
> be reordered by the CPU to happen before the read of next_freq. AIUI,
> mutex_unlock is expected to be only a release-barrier.
> 
> Although to be safe, I could just put an smp_mb() there. I believe with that,
> no locking would be needed for such case.
> 
> I'll send out a v3 with Acks for the original patch, and the send out the
> smp_mb() as a separate patch if that's Ok.

Maybe it would be better to get the fix (with smp_mb) first and then
this optimization patch on the top? That would mean that the fix can
get applied to stable kernels easily.
Joel Fernandes May 23, 2018, 9:42 a.m. UTC | #26
On May 23, 2018 2:01:01 AM PDT, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>On 22-05-18, 15:09, Joel Fernandes wrote:
>> I agree with the race you describe for single policy slow-switch.
>Good find :)
>> 
>> The mainline sugov_work could also do such reordering in sugov_work,
>I think. Even
>> with the mutex_unlock in mainline's sugov_work, that work_in_progress
>write could
>> be reordered by the CPU to happen before the read of next_freq. AIUI,
>> mutex_unlock is expected to be only a release-barrier.
>> 
>> Although to be safe, I could just put an smp_mb() there. I believe
>with that,
>> no locking would be needed for such case.
>> 
>> I'll send out a v3 with Acks for the original patch, and the send out
>the
>> smp_mb() as a separate patch if that's Ok.
>
>Maybe it would be better to get the fix (with smp_mb) first and then
>this optimization patch on the top? That would mean that the fix can
>get applied to stable kernels easily.

Probably. But then Rafael is changing single policy to use the lock so then barrier wouldn't be needed at all. In that case, both mine and Rafael new patch can go into stable which handles your race ( optimization == fix in this case :P )

thanks,

- Joel
Viresh Kumar May 23, 2018, 10:06 a.m. UTC | #27
On 23-05-18, 02:42, Joel Fernandes wrote:
> Probably. But then Rafael is changing single policy to use the lock
> so then barrier wouldn't be needed at all. In that case, both mine
> and Rafael new patch can go into stable which handles your race (
> optimization == fix in this case :P )

Yeah, we discussed that offline.

Go get some sleep. There are no barriers in this world :)
diff mbox

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e13df951aca7..5c482ec38610 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -92,9 +92,6 @@  static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 	    !cpufreq_can_do_remote_dvfs(sg_policy->policy))
 		return false;
 
-	if (sg_policy->work_in_progress)
-		return false;
-
 	if (unlikely(sg_policy->need_freq_update)) {
 		sg_policy->need_freq_update = false;
 		/*
@@ -128,7 +125,7 @@  static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
 
 		policy->cur = next_freq;
 		trace_cpu_frequency(next_freq, smp_processor_id());
-	} else {
+	} else if (!sg_policy->work_in_progress) {
 		sg_policy->work_in_progress = true;
 		irq_work_queue(&sg_policy->irq_work);
 	}
@@ -291,6 +288,13 @@  static void sugov_update_single(struct update_util_data *hook, u64 time,
 
 	ignore_dl_rate_limit(sg_cpu, sg_policy);
 
+	/*
+	 * For slow-switch systems, single policy requests can't run at the
+	 * moment if update is in progress, unless we acquire update_lock.
+	 */
+	if (sg_policy->work_in_progress)
+		return;
+
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
@@ -382,13 +386,27 @@  sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
 static void sugov_work(struct kthread_work *work)
 {
 	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+	unsigned int freq;
+	unsigned long flags;
+
+	/*
+	 * Hold sg_policy->update_lock shortly to handle the case where:
+	 * incase sg_policy->next_freq is read here, and then updated by
+	 * sugov_update_shared just before work_in_progress is set to false
+	 * here, we may miss queueing the new update.
+	 *
+	 * Note: If a work was queued after the update_lock is released,
+	 * sugov_work will just be called again by kthread_work code; and the
+	 * request will be proceed before the sugov thread sleeps.
+	 */
+	raw_spin_lock_irqsave(&sg_policy->update_lock, flags);
+	freq = sg_policy->next_freq;
+	sg_policy->work_in_progress = false;
+	raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
 
 	mutex_lock(&sg_policy->work_lock);
-	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
-				CPUFREQ_RELATION_L);
+	__cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
 	mutex_unlock(&sg_policy->work_lock);
-
-	sg_policy->work_in_progress = false;
 }
 
 static void sugov_irq_work(struct irq_work *irq_work)