diff mbox series

clk: allow to skip clk_core_req_round_rate_nolock()

Message ID 20240126201433.1830600-1-jan.dakinevich@salutedevices.com (mailing list archive)
State Changes Requested, archived
Headers show
Series clk: allow to skip clk_core_req_round_rate_nolock() | expand

Commit Message

Jan Dakinevich Jan. 26, 2024, 8:14 p.m. UTC
Calling of clk_core_req_round_rate_nolock() can be time-consuming in a
case of deep hierarchy with multiple dividers/parents. But if the clock
already has exactly the same rate as desired, there is no need to
determine how it could be rounded.

Signed-off-by: Jan Dakinevich <jan.dakinevich@salutedevices.com>
---
 drivers/clk/clk.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Jan Dakinevich Feb. 20, 2024, 5:34 p.m. UTC | #1
Please take a look at this patch.

On 1/26/24 23:14, Jan Dakinevich wrote:
> Calling of clk_core_req_round_rate_nolock() can be time-consuming in a
> case of deep hierarchy with multiple dividers/parents. But if the clock
> already has exactly the same rate as desired, there is no need to
> determine how it could be rounded.
> 
> Signed-off-by: Jan Dakinevich <jan.dakinevich@salutedevices.com>
> ---
>   drivers/clk/clk.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
> index 2253c154a824..04f0ddced932 100644
> --- a/drivers/clk/clk.c
> +++ b/drivers/clk/clk.c
> @@ -2423,6 +2423,12 @@ static int clk_core_set_rate_nolock(struct clk_core *core,
>   	if (!core)
>   		return 0;
>   
> +	/* skip calculation of rounded rate if the clock already has exactly
> +	 * the same rate as desired
> +	 */
> +	if (req_rate == clk_core_get_rate_nolock(core))
> +		return 0;
> +
>   	rate = clk_core_req_round_rate_nolock(core, req_rate);
>   
>   	/* bail early if nothing to do */
Stephen Boyd Feb. 22, 2024, 11:20 p.m. UTC | #2
Quoting Jan Dakinevich (2024-01-26 12:14:33)
> Calling of clk_core_req_round_rate_nolock() can be time-consuming in a
> case of deep hierarchy with multiple dividers/parents. But if the clock
> already has exactly the same rate as desired, there is no need to
> determine how it could be rounded.

What exactly are you trying to avoid? Is this an optimization or a bug
fix? TL;DR: I'm unlikely to apply this patch.

I could see some driver implementing round_rate()/determine_rate() in a
way that rounds the rate passed in, so that even if the rate is what the
clk is running at _right now_, it still wants to change it to something
else, or at least call down into the driver to call the set_rate clk_op.
Applying this patch will break that. The contract is that
clk_set_rate(rate) == clk_set_rate(clk_round_rate(rate)). It doesn't
look like anything needs to change.
Jan Dakinevich Feb. 23, 2024, 9:47 p.m. UTC | #3
On 2/23/24 02:20, Stephen Boyd wrote:
> Quoting Jan Dakinevich (2024-01-26 12:14:33)
>> Calling of clk_core_req_round_rate_nolock() can be time-consuming in a
>> case of deep hierarchy with multiple dividers/parents. But if the clock
>> already has exactly the same rate as desired, there is no need to
>> determine how it could be rounded.
> 
> What exactly are you trying to avoid? Is this an optimization or a bug
> fix? TL;DR: I'm unlikely to apply this patch.
> 

It is an optimization, not a bug. The problem is that 
clk_core_req_round_rate_nolock() is quite expensive, and I faced with 
cases, where it takes tens and hundreds milliseconds (depending on SoC).

As I see, it is irremovable feature of clk_core_req_round_rate_nolock() 
design itself. Lets imagine, we have some clock, and its parent is a 
divider. When clk_core_req_round_rate_nolock() is being called the 
execution is walked through the following path:

clk_core_determine_round_nolock
  core->ops->determine_rate
    divider_determine_rate
     clk_divider_bestdiv

Inside clk_divider_bestdiv() for each possible divider 
clk_hw_round_rate() is called for parent of the clock, which in turn 
calls clk_core_determine_round_nolock().

So, each divider and multiplexer in clock path multiplies many times an 
amount of iteration required to execute 
clk_core_req_round_rate_nolock(). When there are a lot of them the time 
consumed by clk_core_req_round_rate_nolock() becomes sufficient.

> I could see some driver implementing round_rate()/determine_rate() in a
> way that rounds the rate passed in, so that even if the rate is what the
> clk is running at _right now_, it still wants to change it to something
> else, or at least call down into the driver to call the set_rate clk_op.
> Applying this patch will break that. The contract is that
> clk_set_rate(rate) == clk_set_rate(clk_round_rate(rate)). It doesn't
> look like anything needs to change.

If I am not mistaken, clocks's rate is either equal to its parent rate 
or calculated by ->recalc_rate(). I suppose, this callback should return 
valid rate value that is based on current clock parameters.

Now, suppose the clock has rate "rateA" and we called clk_set_rate() to 
set "rateA", but clk_core_req_round_rate_nolock() inside clk_set_rate() 
rounds it to "rateB". Thus, although the clock is able to run on desired 
rate (and actually run on it), ->determine_rate() and ->round_rate() are 
unable to choose clocks's parameters for that value. Is it correct 
behavior for clock driver?
Stephen Boyd Feb. 29, 2024, 2:16 a.m. UTC | #4
Quoting Jan Dakinevich (2024-02-23 13:47:35)
> 
> 
> On 2/23/24 02:20, Stephen Boyd wrote:
> > Quoting Jan Dakinevich (2024-01-26 12:14:33)
> >> Calling of clk_core_req_round_rate_nolock() can be time-consuming in a
> >> case of deep hierarchy with multiple dividers/parents. But if the clock
> >> already has exactly the same rate as desired, there is no need to
> >> determine how it could be rounded.
> > 
> > What exactly are you trying to avoid? Is this an optimization or a bug
> > fix? TL;DR: I'm unlikely to apply this patch.
> > 
> 
> It is an optimization, not a bug. The problem is that 
> clk_core_req_round_rate_nolock() is quite expensive, and I faced with 
> cases, where it takes tens and hundreds milliseconds (depending on SoC).
> 
> As I see, it is irremovable feature of clk_core_req_round_rate_nolock() 
> design itself. Lets imagine, we have some clock, and its parent is a 
> divider. When clk_core_req_round_rate_nolock() is being called the 
> execution is walked through the following path:
> 
> clk_core_determine_round_nolock
>   core->ops->determine_rate
>     divider_determine_rate
>      clk_divider_bestdiv
> 
> Inside clk_divider_bestdiv() for each possible divider 
> clk_hw_round_rate() is called for parent of the clock, which in turn 
> calls clk_core_determine_round_nolock().
> 
> So, each divider and multiplexer in clock path multiplies many times an 
> amount of iteration required to execute 
> clk_core_req_round_rate_nolock(). When there are a lot of them the time 
> consumed by clk_core_req_round_rate_nolock() becomes sufficient.

Do you have a more concrete example? I wonder if perhaps you've split up
the clk hardware into multipliers and dividers, when they really could
all be combined into one clk that does all the math at once without
traversing the tree. But if the problem is really just that the
clk_divider_bestdiv() implementation is slow then that's good to know.

> 
> > I could see some driver implementing round_rate()/determine_rate() in a
> > way that rounds the rate passed in, so that even if the rate is what the
> > clk is running at _right now_, it still wants to change it to something
> > else, or at least call down into the driver to call the set_rate clk_op.
> > Applying this patch will break that. The contract is that
> > clk_set_rate(rate) == clk_set_rate(clk_round_rate(rate)). It doesn't
> > look like anything needs to change.
> 
> If I am not mistaken, clocks's rate is either equal to its parent rate 
> or calculated by ->recalc_rate(). I suppose, this callback should return 
> valid rate value that is based on current clock parameters.
> 
> Now, suppose the clock has rate "rateA" and we called clk_set_rate() to 
> set "rateA", but clk_core_req_round_rate_nolock() inside clk_set_rate() 
> rounds it to "rateB". Thus, although the clock is able to run on desired 
> rate (and actually run on it), ->determine_rate() and ->round_rate() are 
> unable to choose clocks's parameters for that value. Is it correct 
> behavior for clock driver?
> 

It's not really a question for the clk framework. If the clk driver
wants to round rateA to rateB then it can. It could be that the
recalc_rate() clk_op calculates a slightly different rate than what
round_rate() clk op did, because maybe the driver has frequency tables
and the rate the clk runs at is something like 933333Hz but the driver
just says that's 930000Hz for simplicity. If that happens, recalc_rate()
gives us the "true" rate, while round_rate() gives us the "approximate"
rate. Either way, the set_rate() clk_op knows that 930000Hz means set
some clk rate, even if that doesn't match what recalc_rate() returns
once the rate is changed.

This is very much a real case, because this is essentially how the qcom
clk driver works.
Jan Dakinevich March 9, 2024, 9:20 p.m. UTC | #5
On 2/29/24 05:16, Stephen Boyd wrote:
> Quoting Jan Dakinevich (2024-02-23 13:47:35)
>>
>>
>> On 2/23/24 02:20, Stephen Boyd wrote:
>>> Quoting Jan Dakinevich (2024-01-26 12:14:33)
>>>> Calling of clk_core_req_round_rate_nolock() can be time-consuming in a
>>>> case of deep hierarchy with multiple dividers/parents. But if the clock
>>>> already has exactly the same rate as desired, there is no need to
>>>> determine how it could be rounded.
>>>
>>> What exactly are you trying to avoid? Is this an optimization or a bug
>>> fix? TL;DR: I'm unlikely to apply this patch.
>>>
>>
>> It is an optimization, not a bug. The problem is that 
>> clk_core_req_round_rate_nolock() is quite expensive, and I faced with 
>> cases, where it takes tens and hundreds milliseconds (depending on SoC).
>>
>> As I see, it is irremovable feature of clk_core_req_round_rate_nolock() 
>> design itself. Lets imagine, we have some clock, and its parent is a 
>> divider. When clk_core_req_round_rate_nolock() is being called the 
>> execution is walked through the following path:
>>
>> clk_core_determine_round_nolock
>>   core->ops->determine_rate
>>     divider_determine_rate
>>      clk_divider_bestdiv
>>
>> Inside clk_divider_bestdiv() for each possible divider 
>> clk_hw_round_rate() is called for parent of the clock, which in turn 
>> calls clk_core_determine_round_nolock().
>>
>> So, each divider and multiplexer in clock path multiplies many times an 
>> amount of iteration required to execute 
>> clk_core_req_round_rate_nolock(). When there are a lot of them the time 
>> consumed by clk_core_req_round_rate_nolock() becomes sufficient.
> 
> Do you have a more concrete example? I wonder if perhaps you've split up
> the clk hardware into multipliers and dividers, when they really could
> all be combined into one clk that does all the math at once without
> traversing the tree. But if the problem is really just that the
> clk_divider_bestdiv() implementation is slow then that's good to know.
> 

My experience related to audio stack on Amlogic SoCs. For example, below
is clock hierarchy on AXG SoC:

xtal

hifi_pll_dco

hifi_pll

aud_mst_c_mclk_sel

aud_mst_c_mclk_div

aud_mst_c_mclk

aud_mst_c_sclk_pre_en

aud_mst_c_sclk_div

aud_mst_c_sclk_post_en

aud_mst_c_lrclk_div

aud_mst_c_lrclk

aud_tdmout_c_lrclk

on A1 SoC (which is my target) it will be almost identical, but it is
not upstreamed yet:

xtal

hifipll_in

hifi_pll

audio_mst_a_mclk_mux

audio_mst_a_mclk_div

audio_mst_a_mclk

audio_mst_a_sclk_pre_en

audio_mst_a_sclk_div

audio_mst_a_sclk_post_en

audio_mst_a_lrclk_div

audio_mst_a_lrclk

audio_tdmout_a_lrclk

Clock setting operation that takes too long is here:

https://elixir.bootlin.com/linux/v6.7/source/sound/soc/meson/axg-tdm-interface.c#L279

In both cases there are three divider. When I artificially make that one
of these dividers has single value (using clk_div_table) it
significantly decreases the time that spent by clk_set_rate(): less then
1ms instead ~300ms on A1 SoC.


>>> I could see some driver implementing round_rate()/determine_rate() in a
>>> way that rounds the rate passed in, so that even if the rate is what the
>>> clk is running at _right now_, it still wants to change it to something
>>> else, or at least call down into the driver to call the set_rate clk_op.
>>> Applying this patch will break that. The contract is that
>>> clk_set_rate(rate) == clk_set_rate(clk_round_rate(rate)). It doesn't
>>> look like anything needs to change.
>>
>> If I am not mistaken, clocks's rate is either equal to its parent rate 
>> or calculated by ->recalc_rate(). I suppose, this callback should return 
>> valid rate value that is based on current clock parameters.
>>
>> Now, suppose the clock has rate "rateA" and we called clk_set_rate() to 
>> set "rateA", but clk_core_req_round_rate_nolock() inside clk_set_rate() 
>> rounds it to "rateB". Thus, although the clock is able to run on desired 
>> rate (and actually run on it), ->determine_rate() and ->round_rate() are 
>> unable to choose clocks's parameters for that value. Is it correct 
>> behavior for clock driver?
>>
> 
> It's not really a question for the clk framework. If the clk driver
> wants to round rateA to rateB then it can. It could be that the
> recalc_rate() clk_op calculates a slightly different rate than what
> round_rate() clk op did, because maybe the driver has frequency tables
> and the rate the clk runs at is something like 933333Hz but the driver
> just says that's 930000Hz for simplicity. If that happens, recalc_rate()
> gives us the "true" rate, while round_rate() gives us the "approximate"
> rate. Either way, the set_rate() clk_op knows that 930000Hz means set
> some clk rate, even if that doesn't match what recalc_rate() returns
> once the rate is changed.
> 
> This is very much a real case, because this is essentially how the qcom
> clk driver works.

Ok. Got it.
diff mbox series

Patch

diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index 2253c154a824..04f0ddced932 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -2423,6 +2423,12 @@  static int clk_core_set_rate_nolock(struct clk_core *core,
 	if (!core)
 		return 0;
 
+	/* skip calculation of rounded rate if the clock already has exactly
+	 * the same rate as desired
+	 */
+	if (req_rate == clk_core_get_rate_nolock(core))
+		return 0;
+
 	rate = clk_core_req_round_rate_nolock(core, req_rate);
 
 	/* bail early if nothing to do */