mbox series

[v1,0/9] Fix Allwinner D1 boot regression

Message ID 20240814145642.344485-1-emil.renner.berthing@canonical.com (mailing list archive)
Headers show
Series Fix Allwinner D1 boot regression | expand

Message

Emil Renner Berthing Aug. 14, 2024, 2:56 p.m. UTC
Hi Anup,

As described in the thread below[1] I haven't been able to boot my
boards based on the Allwinner D1 SoC since 6.9 where you converted the
SiFive PLIC driver to a platform driver.

This is clearly a regression and there haven't really been much progress
on fixing the issue since then, so here is the revert that fixes it.

If no other fix is found before 6.11 I suggest we apply this.

[1]: https://lore.kernel.org/linux-riscv/CAJM55Z9hGKo4784N3s3DhWw=nMRKZKcmvZ58x7uVBghExhoc9A@mail.gmail.com/

/Emil

Emil Renner Berthing (9):
  Revert "irqchip/sifive-plic: Chain to parent IRQ after handlers are
    ready"
  Revert "irqchip/sifive-plic: Avoid explicit cpumask allocation on
    stack"
  Revert "irqchip/sifive-plic: Improve locking safety by using
    irqsave/irqrestore"
  Revert "irqchip/sifive-plic: Parse number of interrupts and contexts
    early in plic_probe()"
  Revert "irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain
    creation failure"
  Revert "irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent
    fwnode"
  Revert "irqchip/sifive-plic: Use devm_xyz() for managed allocation"
  Revert "irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz()"
  Revert "irqchip/sifive-plic: Convert PLIC driver into a platform
    driver"

 drivers/irqchip/irq-sifive-plic.c | 296 ++++++++++++------------------
 1 file changed, 117 insertions(+), 179 deletions(-)

Comments

Thomas Gleixner Aug. 14, 2024, 5:30 p.m. UTC | #1
On Wed, Aug 14 2024 at 16:56, Emil Renner Berthing wrote:
> As described in the thread below[1] I haven't been able to boot my
> boards based on the Allwinner D1 SoC since 6.9 where you converted the
> SiFive PLIC driver to a platform driver.
>
> This is clearly a regression and there haven't really been much progress
> on fixing the issue since then, so here is the revert that fixes it.
>
> If no other fix is found before 6.11 I suggest we apply this.

So this mess has been ignored for two month now?

From the pastebin in the initial report:

[    0.000000] irq: no irq domain found for interrupt-controller@10000000 !
[    0.000000] Failed to map interrupt for /soc/timer@2050000
[    0.000000] Failed to initialize '/soc/timer@2050000': -22

This comes back with -EINVAL. So the timer cannot find an interrupt,
which makes it pretty obvious why the system stops to boot, unless there
is some other timer available.

This is obviously related to the SUN4I_TIMER because that message went
away when it was disabled according to the next pastebin.

Obviously that can't work because the SUN4I timer driver is using
timer_of_init() which cannot handle deferred probing.

Daniel: There was a partial fix for the sun4i driver, which you said you
applied:

  https://lore.kernel.org/all/20240312192519.1602493-1-samuel.holland@sifive.com

But that thing never materialized in a pull request.

And of course everyone involved ignored the problem since March 13th
2024, i.e. almost half a year.

Seriously?

Can you RISCV folks get your act together and ensure to fix things you
broke on the way? Especially when Emil reported this nobody pointed him
to this patch and nobody noticed that it's still not merged?

It took me less than 15 minutes to find that patch and the correlation,
but this is absolutely not my job.

I'm seriously grumpy about that. This is not how it works. If you break
stuff, then you take care to fix it before you shove more changes into
the tree and waste my time.

I'm very much inclined to take the reverts right now, send them to Linus
for -rc5 tagged with cc: stable and ignore/nak any irqchip related riscv
patches until the next merge window is over.

Emil, can you give that sun4i fix a test ride please?

Thanks,

        tglx
Emil Renner Berthing Aug. 15, 2024, 10:29 a.m. UTC | #2
Thomas Gleixner wrote:
> On Wed, Aug 14 2024 at 16:56, Emil Renner Berthing wrote:
> > As described in the thread below[1] I haven't been able to boot my
> > boards based on the Allwinner D1 SoC since 6.9 where you converted the
> > SiFive PLIC driver to a platform driver.
> >
> > This is clearly a regression and there haven't really been much progress
> > on fixing the issue since then, so here is the revert that fixes it.
> >
> > If no other fix is found before 6.11 I suggest we apply this.
>
> So this mess has been ignored for two month now?
>
> From the pastebin in the initial report:
>
> [    0.000000] irq: no irq domain found for interrupt-controller@10000000 !
> [    0.000000] Failed to map interrupt for /soc/timer@2050000
> [    0.000000] Failed to initialize '/soc/timer@2050000': -22
>
> This comes back with -EINVAL. So the timer cannot find an interrupt,
> which makes it pretty obvious why the system stops to boot, unless there
> is some other timer available.
>
> This is obviously related to the SUN4I_TIMER because that message went
> away when it was disabled according to the next pastebin.
>
> Obviously that can't work because the SUN4I timer driver is using
> timer_of_init() which cannot handle deferred probing.
>
> Daniel: There was a partial fix for the sun4i driver, which you said you
> applied:
>
>   https://lore.kernel.org/all/20240312192519.1602493-1-samuel.holland@sifive.com
>
> But that thing never materialized in a pull request.
>
> And of course everyone involved ignored the problem since March 13th
> 2024, i.e. almost half a year.
>
> Seriously?
>
> Can you RISCV folks get your act together and ensure to fix things you
> broke on the way? Especially when Emil reported this nobody pointed him
> to this patch and nobody noticed that it's still not merged?
>
> It took me less than 15 minutes to find that patch and the correlation,
> but this is absolutely not my job.
>
> I'm seriously grumpy about that. This is not how it works. If you break
> stuff, then you take care to fix it before you shove more changes into
> the tree and waste my time.
>
> I'm very much inclined to take the reverts right now, send them to Linus
> for -rc5 tagged with cc: stable and ignore/nak any irqchip related riscv
> patches until the next merge window is over.
>
> Emil, can you give that sun4i fix a test ride please?

Hi Thomas,

Thanks for looking at this! Unfortunately the above patch isn't enough to fix
the issue:

https://termbin.com/7sgc

It still hangs after the "[    0.176451] cpuidle: using governor teo" message
until the watchdog reboots the systems.

/Emil
Thomas Gleixner Aug. 15, 2024, 11:44 a.m. UTC | #3
On Thu, Aug 15 2024 at 03:29, Emil Renner Berthing wrote:
> Thomas Gleixner wrote:
> Thanks for looking at this! Unfortunately the above patch isn't enough to fix
> the issue:
>
> https://termbin.com/7sgc
>
> It still hangs after the "[    0.176451] cpuidle: using governor teo" message
> until the watchdog reboots the systems.

So what's puzzling is that there is a timer installed early on:

[    0.000000] clocksource: riscv_clocksource: ....

That same init function installs the per cpu riscv clockevent, so there
should be a timer available.

The deffered probing of the PLIC driver delays obviously the probing of
the sun4i timer, but that should not matter when another timer is
available. So the sun4i driver might be a red herring.

Can you please add "ignore_loglevel initcall_debug" to the command line
and provide the output of a booting and a failing kernel?

And on the booting kernel please provide the output from:

# cat /sys/devices/system/clockevents/clockevent0/current_device
# cat /sys/devices/system/clockevents/broadcast/current_device
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Thanks,

        tglx
Emil Renner Berthing Aug. 15, 2024, 12:04 p.m. UTC | #4
Thomas Gleixner wrote:
> On Thu, Aug 15 2024 at 03:29, Emil Renner Berthing wrote:
> > Thomas Gleixner wrote:
> > Thanks for looking at this! Unfortunately the above patch isn't enough to fix
> > the issue:
> >
> > https://termbin.com/7sgc
> >
> > It still hangs after the "[    0.176451] cpuidle: using governor teo" message
> > until the watchdog reboots the systems.
>
> So what's puzzling is that there is a timer installed early on:
>
> [    0.000000] clocksource: riscv_clocksource: ....
>
> That same init function installs the per cpu riscv clockevent, so there
> should be a timer available.
>
> The deffered probing of the PLIC driver delays obviously the probing of
> the sun4i timer, but that should not matter when another timer is
> available. So the sun4i driver might be a red herring.
>
> Can you please add "ignore_loglevel initcall_debug" to the command line
> and provide the output of a booting and a failing kernel?

6.11-rc3 + these reverts:  https://termbin.com/q6wk
6.11-rc3 + Samuel's patch: https://termbin.com/7cgs

> And on the booting kernel please provide the output from:
>
> # cat /sys/devices/system/clockevents/clockevent0/current_device
> # cat /sys/devices/system/clockevents/broadcast/current_device
> # cat /sys/devices/system/clocksource/clocksource0/current_clocksource

On both a 6.8.6 kernel and 6.11-rc3 + reverts I get:

  # cat /sys/devices/system/clockevents/clockevent0/current_device
  sun4i_tick
  # cat /sys/devices/system/clockevents/broadcast/current_device
  riscv_timer_clockevent
  # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
  riscv_clocksource

/Emil
Emil Renner Berthing Aug. 15, 2024, 12:14 p.m. UTC | #5
Emil Renner Berthing wrote:
> Thomas Gleixner wrote:
> > On Thu, Aug 15 2024 at 03:29, Emil Renner Berthing wrote:
> > > Thomas Gleixner wrote:
> > > Thanks for looking at this! Unfortunately the above patch isn't enough to fix
> > > the issue:
> > >
> > > https://termbin.com/7sgc
> > >
> > > It still hangs after the "[    0.176451] cpuidle: using governor teo" message
> > > until the watchdog reboots the systems.
> >
> > So what's puzzling is that there is a timer installed early on:
> >
> > [    0.000000] clocksource: riscv_clocksource: ....
> >
> > That same init function installs the per cpu riscv clockevent, so there
> > should be a timer available.
> >
> > The deffered probing of the PLIC driver delays obviously the probing of
> > the sun4i timer, but that should not matter when another timer is
> > available. So the sun4i driver might be a red herring.
> >
> > Can you please add "ignore_loglevel initcall_debug" to the command line
> > and provide the output of a booting and a failing kernel?
>
> 6.11-rc3 + these reverts:  https://termbin.com/q6wk
> 6.11-rc3 + Samuel's patch: https://termbin.com/7cgs

I think this confirms what Charlie found here:
https://lore.kernel.org/linux-riscv/ZoydV7vad5JWIcZb@ghost/

>
> > And on the booting kernel please provide the output from:
> >
> > # cat /sys/devices/system/clockevents/clockevent0/current_device
> > # cat /sys/devices/system/clockevents/broadcast/current_device
> > # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>
> On both a 6.8.6 kernel and 6.11-rc3 + reverts I get:
>
>   # cat /sys/devices/system/clockevents/clockevent0/current_device
>   sun4i_tick
>   # cat /sys/devices/system/clockevents/broadcast/current_device
>   riscv_timer_clockevent
>   # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>   riscv_clocksource
>
> /Emil
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Thomas Gleixner Aug. 15, 2024, 1:16 p.m. UTC | #6
On Thu, Aug 15 2024 at 05:14, Emil Renner Berthing wrote:
> Emil Renner Berthing wrote:
>> 6.11-rc3 + these reverts:  https://termbin.com/q6wk
>> 6.11-rc3 + Samuel's patch: https://termbin.com/7cgs
>
> I think this confirms what Charlie found here:
> https://lore.kernel.org/linux-riscv/ZoydV7vad5JWIcZb@ghost/

Yes. So the riscv timer is not working on this thing or it stops
somehow.

Can you apply the debug patch below and check whether you see the
'J: ....' output at all and if so whether it stops at some point.

Thanks,

        tglx

---
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2459,6 +2459,9 @@ static void run_local_timers(void)
 {
 	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
 
+	if (!(jiffies & 0xFF))
+		pr_info("J: %lx\n", jiffies);
+
 	hrtimer_run_queues();
 
 	for (int i = 0; i < NR_BASES; i++, base++) {
Samuel Holland Aug. 15, 2024, 1:32 p.m. UTC | #7
Hi Thomas, Emil,

On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
> On Thu, Aug 15 2024 at 05:14, Emil Renner Berthing wrote:
>> Emil Renner Berthing wrote:
>>> 6.11-rc3 + these reverts:  https://us01.z.antigena.com/l/Er4kZWDmvL5-bLzHHJoZv0k71iwW2jCD5qNpiz0x0XdYY6oORF_nXh7U7jw6oubhi~32HI4i71jUW9v8~NvSvPeUWrdYx3WJBr2GPDUjOu6LYPCOBfR2dVQuMWvlNj4tDjXFp3QEQAmeawZflD4JrIJjtSYIbKfe6v-tgH7SEuHMeSSriU633Lv 
>>> 6.11-rc3 + Samuel's patch: https://us01.z.antigena.com/l/EULtAYky6ZvgqZ49KGS-WBsYTg~Ht1NoQtEYmUVb56ymS9jDagqYHLK90WDjnVt69GfB4IX5NSRQXmSfkNsTzB8lJmFvDihHQmGrsCv9FzlorD9yGfXDlQ6rG6vmn5BNDwlipmssGaOGfh9yko8n9ArWR4TLhEf~f9ODqme~NXXwA9DLLc9p 
>>
>> I think this confirms what Charlie found here:
>> https://lore.kernel.org/linux-riscv/ZoydV7vad5JWIcZb@ghost/
> 
> Yes. So the riscv timer is not working on this thing or it stops
> somehow.

That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
firmware does not have a timer device, so it does not expose the (optional[1])
SBI time extension, and sbi_set_timer() does nothing.

I wrote a patch (not submitted) to skip registering riscv_clock_event when the
SBI time extension is unavailable, but this doesn't fully solve the issue
either, because then we have no clockevent at all when
check_unaligned_access_all_cpus() is called.

How early in the boot process are we "required" to have a functional clockevent?
Do we need to refactor check_unaligned_access_all_cpus() so it works on systems
where the only clockevent is provided by a platform device?

Regards,
Samuel

[1] https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/intro.adoc

> Can you apply the debug patch below and check whether you see the
> 'J: ....' output at all and if so whether it stops at some point.
> 
> Thanks,
> 
>         tglx
> 
> ---
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -2459,6 +2459,9 @@ static void run_local_timers(void)
>  {
>  	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
>  
> +	if (!(jiffies & 0xFF))
> +		pr_info("J: %lx\n", jiffies);
> +
>  	hrtimer_run_queues();
>  
>  	for (int i = 0; i < NR_BASES; i++, base++) {
> 
> 
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
>
Emil Renner Berthing Aug. 15, 2024, 1:35 p.m. UTC | #8
Thomas Gleixner wrote:
> On Thu, Aug 15 2024 at 05:14, Emil Renner Berthing wrote:
> > Emil Renner Berthing wrote:
> >> 6.11-rc3 + these reverts:  https://termbin.com/q6wk
> >> 6.11-rc3 + Samuel's patch: https://termbin.com/7cgs
> >
> > I think this confirms what Charlie found here:
> > https://lore.kernel.org/linux-riscv/ZoydV7vad5JWIcZb@ghost/
>
> Yes. So the riscv timer is not working on this thing or it stops
> somehow.
>
> Can you apply the debug patch below and check whether you see the
> 'J: ....' output at all and if so whether it stops at some point.

No, I don't see the J: ... debug output anywhere:
https://termbin.com/3vez

/Emil
Thomas Gleixner Aug. 15, 2024, 2:11 p.m. UTC | #9
On Thu, Aug 15 2024 at 08:32, Samuel Holland wrote:
> On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
>> Yes. So the riscv timer is not working on this thing or it stops
>> somehow.
>
> That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
> firmware does not have a timer device, so it does not expose the (optional[1])
> SBI time extension, and sbi_set_timer() does nothing.

Sigh. Does RISCV really have to repeat all mistakes which have been made
by x86, ARM and others before? It's known for decades that the kernel
relies on a working timer...

> I wrote a patch (not submitted) to skip registering riscv_clock_event when the
> SBI time extension is unavailable, but this doesn't fully solve the issue
> either, because then we have no clockevent at all when
> check_unaligned_access_all_cpus() is called.

check_unaligned_access_all_cpus() is irrelevant.

> How early in the boot process are we "required" to have a functional clockevent?
> Do we need to refactor check_unaligned_access_all_cpus() so it works on systems
> where the only clockevent is provided by a platform device?

Right after init/main::late_time_init() everything can depend on a
working timer and on jiffies increasing.

I'm actually surprised that the boot process gets that far. That's just
by pure luck, really.

Thanks,

        tglx
Anup Patel Aug. 15, 2024, 2:16 p.m. UTC | #10
On Thu, Aug 15, 2024 at 7:41 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Aug 15 2024 at 08:32, Samuel Holland wrote:
> > On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
> >> Yes. So the riscv timer is not working on this thing or it stops
> >> somehow.
> >
> > That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
> > firmware does not have a timer device, so it does not expose the (optional[1])
> > SBI time extension, and sbi_set_timer() does nothing.
>
> Sigh. Does RISCV really have to repeat all mistakes which have been made
> by x86, ARM and others before? It's known for decades that the kernel
> relies on a working timer...

My apologies for the delay in finding a fix for this issue.

Almost all RISC-V platforms (except this one) have SBI Timer always
available and Linux uses a better timer or Sstc extension whenever
it is available.

When Emil first reported this issue, I did try to help him root cause
the issue but unfortunately I don't have this particular platform and
PLIC on all other RISC-V platforms works fine.

I am also surprised that none of the Allwiner folks tried helping.

>
> > I wrote a patch (not submitted) to skip registering riscv_clock_event when the
> > SBI time extension is unavailable, but this doesn't fully solve the issue
> > either, because then we have no clockevent at all when
> > check_unaligned_access_all_cpus() is called.
>
> check_unaligned_access_all_cpus() is irrelevant.
>
> > How early in the boot process are we "required" to have a functional clockevent?
> > Do we need to refactor check_unaligned_access_all_cpus() so it works on systems
> > where the only clockevent is provided by a platform device?
>
> Right after init/main::late_time_init() everything can depend on a
> working timer and on jiffies increasing.
>
> I'm actually surprised that the boot process gets that far. That's just
> by pure luck, really.
>
> Thanks,
>
>         tglx

Regards,
Anup
Anup Patel Aug. 15, 2024, 2:30 p.m. UTC | #11
On Thu, Aug 15, 2024 at 7:02 PM Samuel Holland
<samuel.holland@sifive.com> wrote:
>
> Hi Thomas, Emil,
>
> On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
> > On Thu, Aug 15 2024 at 05:14, Emil Renner Berthing wrote:
> >> Emil Renner Berthing wrote:
> >>> 6.11-rc3 + these reverts:  https://us01.z.antigena.com/l/Er4kZWDmvL5-bLzHHJoZv0k71iwW2jCD5qNpiz0x0XdYY6oORF_nXh7U7jw6oubhi~32HI4i71jUW9v8~NvSvPeUWrdYx3WJBr2GPDUjOu6LYPCOBfR2dVQuMWvlNj4tDjXFp3QEQAmeawZflD4JrIJjtSYIbKfe6v-tgH7SEuHMeSSriU633Lv
> >>> 6.11-rc3 + Samuel's patch: https://us01.z.antigena.com/l/EULtAYky6ZvgqZ49KGS-WBsYTg~Ht1NoQtEYmUVb56ymS9jDagqYHLK90WDjnVt69GfB4IX5NSRQXmSfkNsTzB8lJmFvDihHQmGrsCv9FzlorD9yGfXDlQ6rG6vmn5BNDwlipmssGaOGfh9yko8n9ArWR4TLhEf~f9ODqme~NXXwA9DLLc9p
> >>
> >> I think this confirms what Charlie found here:
> >> https://lore.kernel.org/linux-riscv/ZoydV7vad5JWIcZb@ghost/
> >
> > Yes. So the riscv timer is not working on this thing or it stops
> > somehow.
>
> That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
> firmware does not have a timer device, so it does not expose the (optional[1])
> SBI time extension, and sbi_set_timer() does nothing.

OpenSBI uses platform specific M-mode timer (mtime and mtimecmp) to
provide SBI time extension to Linux.

The RISC-V privileged specification (v1.10 or higher) requires platform to
provide a M-mode timer (mtime and mtimecmp).

This platform not having any M-mode timer is yet another RISC-V spec
violation by this platform.

Regards,
Anup

>
> I wrote a patch (not submitted) to skip registering riscv_clock_event when the
> SBI time extension is unavailable, but this doesn't fully solve the issue
> either, because then we have no clockevent at all when
> check_unaligned_access_all_cpus() is called.
>
> How early in the boot process are we "required" to have a functional clockevent?
> Do we need to refactor check_unaligned_access_all_cpus() so it works on systems
> where the only clockevent is provided by a platform device?
>
> Regards,
> Samuel
>
> [1] https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/intro.adoc
>
> > Can you apply the debug patch below and check whether you see the
> > 'J: ....' output at all and if so whether it stops at some point.
> >
> > Thanks,
> >
> >         tglx
> >
> > ---
> > --- a/kernel/time/timer.c
> > +++ b/kernel/time/timer.c
> > @@ -2459,6 +2459,9 @@ static void run_local_timers(void)
> >  {
> >       struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
> >
> > +     if (!(jiffies & 0xFF))
> > +             pr_info("J: %lx\n", jiffies);
> > +
> >       hrtimer_run_queues();
> >
> >       for (int i = 0; i < NR_BASES; i++, base++) {
> >
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
> >
>
Samuel Holland Aug. 15, 2024, 2:41 p.m. UTC | #12
On 2024-08-15 9:16 AM, Anup Patel wrote:
> On Thu, Aug 15, 2024 at 7:41 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Thu, Aug 15 2024 at 08:32, Samuel Holland wrote:
>>> On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
>>>> Yes. So the riscv timer is not working on this thing or it stops
>>>> somehow.
>>>
>>> That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
>>> firmware does not have a timer device, so it does not expose the (optional[1])
>>> SBI time extension, and sbi_set_timer() does nothing.
>>
>> Sigh. Does RISCV really have to repeat all mistakes which have been made
>> by x86, ARM and others before? It's known for decades that the kernel
>> relies on a working timer...
> 
> My apologies for the delay in finding a fix for this issue.
> 
> Almost all RISC-V platforms (except this one) have SBI Timer always
> available and Linux uses a better timer or Sstc extension whenever
> it is available.

So this is the immediate solution: add the CLINT to the firmware devicetree so
that the SBI time extension works, and Linux will boot without any code changes,
albeit with a higher-overhead clockevent device.

Additionally merging the sun4i timer patch[1] will allow the system to switch to
the better MMIO clocksource later in the boot process.

The reason the CLINT was not added to the devicetree already is that the T-HEAD
version of the CLINT includes an extension to drive SSIP/STIP from a second
S-mode visible set of registers. So it should really have twice as many entries
in its interrupts-extended property as the existing CLINT, and I never got
around to validating that this would work.

The long-term solution would be adding driver support for the T-HEAD CLINT
extensions, which provide an even better clockevent than the sun4i timer.

[1]: https://lore.kernel.org/all/20240312192519.1602493-1-samuel.holland@sifive.com/

> When Emil first reported this issue, I did try to help him root cause
> the issue but unfortunately I don't have this particular platform and
> PLIC on all other RISC-V platforms works fine.
> 
> I am also surprised that none of the Allwinner folks tried helping.

Allwinner D1 support was upstreamed by unpaid hobbyists with very little
first-party assistance.

>>> I wrote a patch (not submitted) to skip registering riscv_clock_event when the
>>> SBI time extension is unavailable, but this doesn't fully solve the issue
>>> either, because then we have no clockevent at all when
>>> check_unaligned_access_all_cpus() is called.
>>
>> check_unaligned_access_all_cpus() is irrelevant.
>>
>>> How early in the boot process are we "required" to have a functional clockevent?
>>> Do we need to refactor check_unaligned_access_all_cpus() so it works on systems
>>> where the only clockevent is provided by a platform device?
>>
>> Right after init/main::late_time_init() everything can depend on a
>> working timer and on jiffies increasing.
>>
>> I'm actually surprised that the boot process gets that far. That's just
>> by pure luck, really.

Thanks for clearing this up!

Regards,
Samuel
Samuel Holland Aug. 15, 2024, 3:03 p.m. UTC | #13
Hi Anup,

On 2024-08-15 9:30 AM, Anup Patel wrote:
> On Thu, Aug 15, 2024 at 7:02 PM Samuel Holland
> <samuel.holland@sifive.com> wrote:
>>> Yes. So the riscv timer is not working on this thing or it stops
>>> somehow.
>>
>> That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
>> firmware does not have a timer device, so it does not expose the (optional[1])
>> SBI time extension, and sbi_set_timer() does nothing.
> 
> OpenSBI uses platform specific M-mode timer (mtime and mtimecmp) to
> provide SBI time extension to Linux.
> 
> The RISC-V privileged specification (v1.10 or higher) requires platform to
> provide a M-mode timer (mtime and mtimecmp).
> 
> This platform not having any M-mode timer is yet another RISC-V spec
> violation by this platform.

You've misunderstood here. Allwinner D1 (T-HEAD C906) _does_ have an M-mode
timer (a CLINT). It is just omitted from devicetree that Emil happens to be
using, so OpenSBI isn't using it.

Currently OpenSBI allows the system to boot without a timer device, and the SBI
specification does not mandate the time extension. If consensus is that either
of these should change, that's fine, but currently I see nothing in either the
privileged spec nor the SBI spec that guarantees the availability of some timer
to the kernel in S-mode.

Regards,
Samuel
Emil Renner Berthing Aug. 15, 2024, 3:07 p.m. UTC | #14
Samuel Holland wrote:
> On 2024-08-15 9:16 AM, Anup Patel wrote:
> > On Thu, Aug 15, 2024 at 7:41 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> On Thu, Aug 15 2024 at 08:32, Samuel Holland wrote:
> >>> On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
> >>>> Yes. So the riscv timer is not working on this thing or it stops
> >>>> somehow.
> >>>
> >>> That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
> >>> firmware does not have a timer device, so it does not expose the (optional[1])
> >>> SBI time extension, and sbi_set_timer() does nothing.
> >>
> >> Sigh. Does RISCV really have to repeat all mistakes which have been made
> >> by x86, ARM and others before? It's known for decades that the kernel
> >> relies on a working timer...
> >
> > My apologies for the delay in finding a fix for this issue.
> >
> > Almost all RISC-V platforms (except this one) have SBI Timer always
> > available and Linux uses a better timer or Sstc extension whenever
> > it is available.
>
> So this is the immediate solution: add the CLINT to the firmware devicetree so
> that the SBI time extension works, and Linux will boot without any code changes,
> albeit with a higher-overhead clockevent device.

But this will mean that you can't update your kernel to v6.9 or newer without
reflashing OpenSBI and u-boot. That's still a regression right?

/Emil
Thomas Gleixner Aug. 15, 2024, 3:14 p.m. UTC | #15
On Thu, Aug 15 2024 at 09:41, Samuel Holland wrote:
> On 2024-08-15 9:16 AM, Anup Patel wrote:
>> Almost all RISC-V platforms (except this one) have SBI Timer always
>> available and Linux uses a better timer or Sstc extension whenever
>> it is available.
>
> So this is the immediate solution: add the CLINT to the firmware
> devicetree so that the SBI time extension works, and Linux will boot
> without any code changes, albeit with a higher-overhead clockevent
> device.

That does not matter for the boot process when the sun4i timer becomes
available afterwards. But how can this be retrofitted along with the
kernel update?

Thanks,

        tglx
Anup Patel Aug. 15, 2024, 3:53 p.m. UTC | #16
On Thu, Aug 15, 2024 at 8:33 PM Samuel Holland
<samuel.holland@sifive.com> wrote:
>
> Hi Anup,
>
> On 2024-08-15 9:30 AM, Anup Patel wrote:
> > On Thu, Aug 15, 2024 at 7:02 PM Samuel Holland
> > <samuel.holland@sifive.com> wrote:
> >>> Yes. So the riscv timer is not working on this thing or it stops
> >>> somehow.
> >>
> >> That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
> >> firmware does not have a timer device, so it does not expose the (optional[1])
> >> SBI time extension, and sbi_set_timer() does nothing.
> >
> > OpenSBI uses platform specific M-mode timer (mtime and mtimecmp) to
> > provide SBI time extension to Linux.
> >
> > The RISC-V privileged specification (v1.10 or higher) requires platform to
> > provide a M-mode timer (mtime and mtimecmp).
> >
> > This platform not having any M-mode timer is yet another RISC-V spec
> > violation by this platform.
>
> You've misunderstood here. Allwinner D1 (T-HEAD C906) _does_ have an M-mode
> timer (a CLINT). It is just omitted from devicetree that Emil happens to be
> using, so OpenSBI isn't using it.
>
> Currently OpenSBI allows the system to boot without a timer device, and the SBI
> specification does not mandate the time extension. If consensus is that either
> of these should change, that's fine, but currently I see nothing in either the
> privileged spec nor the SBI spec that guarantees the availability of some timer
> to the kernel in S-mode.
>

The SBI time is certainly optional hence OpenSBI does not hang or crash if
it can't provide SBI time to supervisor software.

My comment is from the RISC-V privileged spec perspective:

1) Priv v1.10 says "Platforms provide a real-time counter, exposed as a
memory-mapped machine-mode register, mtime." in section "3.1.15 Machine
Timer Registers (mtime and mtimecmp)".

2) Similar statement in Priv v1.11 section "3.1.10 Machine Timer Registers
(mtime and mtimecmp)"

3) Similar statement in Priv v1.12 section "3.2.1 Machine Timer Registers
(mtime and mtimecmp)"

But since the M-mode timer was omitted from the DT, I think the DT was
always incomplete from the M-mode perspective.

Regards,
Anup
Samuel Holland Aug. 15, 2024, 3:59 p.m. UTC | #17
Hi Emil,

On 2024-08-15 10:07 AM, Emil Renner Berthing wrote:
> Samuel Holland wrote:
>> On 2024-08-15 9:16 AM, Anup Patel wrote:
>>> On Thu, Aug 15, 2024 at 7:41 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>
>>>> On Thu, Aug 15 2024 at 08:32, Samuel Holland wrote:
>>>>> On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
>>>>>> Yes. So the riscv timer is not working on this thing or it stops
>>>>>> somehow.
>>>>>
>>>>> That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
>>>>> firmware does not have a timer device, so it does not expose the (optional[1])
>>>>> SBI time extension, and sbi_set_timer() does nothing.
>>>>
>>>> Sigh. Does RISCV really have to repeat all mistakes which have been made
>>>> by x86, ARM and others before? It's known for decades that the kernel
>>>> relies on a working timer...
>>>
>>> My apologies for the delay in finding a fix for this issue.
>>>
>>> Almost all RISC-V platforms (except this one) have SBI Timer always
>>> available and Linux uses a better timer or Sstc extension whenever
>>> it is available.
>>
>> So this is the immediate solution: add the CLINT to the firmware devicetree so
>> that the SBI time extension works, and Linux will boot without any code changes,
>> albeit with a higher-overhead clockevent device.
> 
> But this will mean that you can't update your kernel to v6.9 or newer without
> reflashing OpenSBI and u-boot. That's still a regression right?

I suppose that depends on if you think the SBI time extension is (or should have
been) mandatory for platforms without Sstc. If the SBI time extension is
mandatory, then this is a firmware bug, and not really Linux's responsibility to
work around.

If the SBI time extension is not mandatory, then Linux needs to be able to
handle platforms where the S-mode visible timer is attached to an external
interrupt controller (PLIC or APLIC), so the irqchip driver needs to be loaded
before time_init() (timer_probe()). So in that case, the bug is a Linux
regression, and we would need to revert the platform driver conversion.

Regards,
Samuel
Palmer Dabbelt Aug. 15, 2024, 5:51 p.m. UTC | #18
On Thu, 15 Aug 2024 08:59:37 PDT (-0700), samuel.holland@sifive.com wrote:
> Hi Emil,
>
> On 2024-08-15 10:07 AM, Emil Renner Berthing wrote:
>> Samuel Holland wrote:
>>> On 2024-08-15 9:16 AM, Anup Patel wrote:
>>>> On Thu, Aug 15, 2024 at 7:41 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>>
>>>>> On Thu, Aug 15 2024 at 08:32, Samuel Holland wrote:
>>>>>> On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
>>>>>>> Yes. So the riscv timer is not working on this thing or it stops
>>>>>>> somehow.
>>>>>>
>>>>>> That's correct. With the (firmware) devicetree that Emil is using, the OpenSBI
>>>>>> firmware does not have a timer device, so it does not expose the (optional[1])
>>>>>> SBI time extension, and sbi_set_timer() does nothing.
>>>>>
>>>>> Sigh. Does RISCV really have to repeat all mistakes which have been made
>>>>> by x86, ARM and others before? It's known for decades that the kernel
>>>>> relies on a working timer...

It's even worse than that: RISC-V doesn't even mandate any working 
_instructions_, much less anything in the platform/firmware.

>>>> My apologies for the delay in finding a fix for this issue.
>>>>
>>>> Almost all RISC-V platforms (except this one) have SBI Timer always
>>>> available and Linux uses a better timer or Sstc extension whenever
>>>> it is available.
>>>
>>> So this is the immediate solution: add the CLINT to the firmware devicetree so
>>> that the SBI time extension works, and Linux will boot without any code changes,
>>> albeit with a higher-overhead clockevent device.
>>
>> But this will mean that you can't update your kernel to v6.9 or newer without
>> reflashing OpenSBI and u-boot. That's still a regression right?

Ya, I'd call that a regression.  Updating the firmware on these things 
isn't generally something we can rely on users to do, we've worked 
around other firmware bugs where we can to avoid forced updates.

> I suppose that depends on if you think the SBI time extension is (or should have
> been) mandatory for platforms without Sstc. If the SBI time extension is
> mandatory, then this is a firmware bug, and not really Linux's responsibility to
> work around.
>
> If the SBI time extension is not mandatory, then Linux needs to be able to
> handle platforms where the S-mode visible timer is attached to an external
> interrupt controller (PLIC or APLIC), so the irqchip driver needs to be loaded
> before time_init() (timer_probe()). So in that case, the bug is a Linux
> regression, and we would need to revert the platform driver conversion.

It doesn't really matter what the specs say (aka intended to say in 
RISC-V land): if there's a regression then we have to deal with it.  
It's not like whatever's written in the specs actually matters, vendors 
can just do whatever they want, so wer'e just stuck making the known 
implementations work.

So I think if the revert is the best fix then we should revert it.

That said: If the CLINT works, could we just add a probing quirk to make 
it appear on these systems even when it's not in the DT?  I'm thinking 
something like adding a compatibly string to the CLINT driver for the 
SOC (or core or whatever, just something that's already there).  We'd 
probably need a bit of special-case probing code, but shouldn't be so 
bad.  We've got some other compatibility-oriented DT quirks floating 
around.

> Regards,
> Samuel
Palmer Dabbelt Aug. 15, 2024, 5:51 p.m. UTC | #19
On Wed, 14 Aug 2024 10:30:48 PDT (-0700), tglx@linutronix.de wrote:
> On Wed, Aug 14 2024 at 16:56, Emil Renner Berthing wrote:
>> As described in the thread below[1] I haven't been able to boot my
>> boards based on the Allwinner D1 SoC since 6.9 where you converted the
>> SiFive PLIC driver to a platform driver.
>>
>> This is clearly a regression and there haven't really been much progress
>> on fixing the issue since then, so here is the revert that fixes it.
>>
>> If no other fix is found before 6.11 I suggest we apply this.
>
> So this mess has been ignored for two month now?
>
>>From the pastebin in the initial report:
>
> [    0.000000] irq: no irq domain found for interrupt-controller@10000000 !
> [    0.000000] Failed to map interrupt for /soc/timer@2050000
> [    0.000000] Failed to initialize '/soc/timer@2050000': -22
>
> This comes back with -EINVAL. So the timer cannot find an interrupt,
> which makes it pretty obvious why the system stops to boot, unless there
> is some other timer available.
>
> This is obviously related to the SUN4I_TIMER because that message went
> away when it was disabled according to the next pastebin.
>
> Obviously that can't work because the SUN4I timer driver is using
> timer_of_init() which cannot handle deferred probing.
>
> Daniel: There was a partial fix for the sun4i driver, which you said you
> applied:
>
>   https://lore.kernel.org/all/20240312192519.1602493-1-samuel.holland@sifive.com
>
> But that thing never materialized in a pull request.
>
> And of course everyone involved ignored the problem since March 13th
> 2024, i.e. almost half a year.
>
> Seriously?
>
> Can you RISCV folks get your act together and ensure to fix things you
> broke on the way? Especially when Emil reported this nobody pointed him
> to this patch and nobody noticed that it's still not merged?
>
> It took me less than 15 minutes to find that patch and the correlation,
> but this is absolutely not my job.

Sorry, I guess I'd just sort of been ignoring the platform-specific side 
of things because it's so frustrating to deal with, but that's led to a 
bunch of breakages so it's obviously the wrong thing to do.

> I'm seriously grumpy about that. This is not how it works. If you break
> stuff, then you take care to fix it before you shove more changes into
> the tree and waste my time.
>
> I'm very much inclined to take the reverts right now, send them to Linus
> for -rc5 tagged with cc: stable and ignore/nak any irqchip related riscv
> patches until the next merge window is over.

Acked-by: Palmer Dabbelt <palmer@rivosinc.com>

if you want to take the revert.

IIUC the patch above doesn't actually fix it, that's what led to just 
sending the reverts -- at least reverts are better than breaking users.  
I'll post over there too...

And it's no big deal if we're in the doghouse for a bit.  Regressions 
should get fixed faster than this, so we deserve it.

Probably also another sign we're way too focused on getting new features 
merged, as that's coming at the expense of making existing platforms 
work.  IMO we've been way too focused on getting support for specs that 
don't even have implementations, and not enough on building real working 
systems.

> Emil, can you give that sun4i fix a test ride please?
>
> Thanks,
>
>         tglx
Thomas Gleixner Aug. 15, 2024, 6:10 p.m. UTC | #20
On Thu, Aug 15 2024 at 10:51, Palmer Dabbelt wrote:
> On Wed, 14 Aug 2024 10:30:48 PDT (-0700), tglx@linutronix.de wrote:
>> I'm very much inclined to take the reverts right now, send them to Linus
>> for -rc5 tagged with cc: stable and ignore/nak any irqchip related riscv
>> patches until the next merge window is over.
>
> Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
>
> if you want to take the revert.

I'm happy to wait a week and see whether someone gets that CLINT hack
working or as I suggested the D1 PLIC early probe quirk.

> IIUC the patch above doesn't actually fix it, that's what led to just 
> sending the reverts -- at least reverts are better than breaking users.  
> I'll post over there too...

Right. We figured that out by now :)

> And it's no big deal if we're in the doghouse for a bit.  Regressions 
> should get fixed faster than this, so we deserve it.

For a week I consider you probationers :)

> Probably also another sign we're way too focused on getting new features 
> merged, as that's coming at the expense of making existing platforms 
> work.  IMO we've been way too focused on getting support for specs that 
> don't even have implementations, and not enough on building real working 
> systems.

RISCV is not alone with that. This whole industry is nuts about features
and forgets the stuff what matters.

Thanks,

        tglx
Palmer Dabbelt Aug. 15, 2024, 11:04 p.m. UTC | #21
On Thu, 15 Aug 2024 11:10:25 PDT (-0700), tglx@linutronix.de wrote:
> On Thu, Aug 15 2024 at 10:51, Palmer Dabbelt wrote:
>> On Wed, 14 Aug 2024 10:30:48 PDT (-0700), tglx@linutronix.de wrote:
>>> I'm very much inclined to take the reverts right now, send them to Linus
>>> for -rc5 tagged with cc: stable and ignore/nak any irqchip related riscv
>>> patches until the next merge window is over.
>>
>> Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
>>
>> if you want to take the revert.
>
> I'm happy to wait a week and see whether someone gets that CLINT hack
> working or as I suggested the D1 PLIC early probe quirk.
>
>> IIUC the patch above doesn't actually fix it, that's what led to just
>> sending the reverts -- at least reverts are better than breaking users.
>> I'll post over there too...
>
> Right. We figured that out by now :)
>
>> And it's no big deal if we're in the doghouse for a bit.  Regressions
>> should get fixed faster than this, so we deserve it.
>
> For a week I consider you probationers :)

Works for me ;)

>> Probably also another sign we're way too focused on getting new features
>> merged, as that's coming at the expense of making existing platforms
>> work.  IMO we've been way too focused on getting support for specs that
>> don't even have implementations, and not enough on building real working
>> systems.
>
> RISCV is not alone with that. This whole industry is nuts about features
> and forgets the stuff what matters.
>
> Thanks,
>
>         tglx
Icenowy Zheng Aug. 16, 2024, 6:09 a.m. UTC | #22
在 2024-08-15星期四的 20:00 +0530,Anup Patel写道:
> On Thu, Aug 15, 2024 at 7:02 PM Samuel Holland
> <samuel.holland@sifive.com> wrote:
> > 
> > Hi Thomas, Emil,
> > 
> > On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
> > > On Thu, Aug 15 2024 at 05:14, Emil Renner Berthing wrote:
> > > > Emil Renner Berthing wrote:
> > > > > 6.11-rc3 + these reverts: 
> > > > > https://us01.z.antigena.com/l/Er4kZWDmvL5-bLzHHJoZv0k71iwW2jCD5qNpiz0x0XdYY6oORF_nXh7U7jw6oubhi~32HI4i71jUW9v8~NvSvPeUWrdYx3WJBr2GPDUjOu6LYPCOBfR2dVQuMWvlNj4tDjXFp3QEQAmeawZflD4JrIJjtSYIbKfe6v-tgH7SEuHMeSSriU633Lv
> > > > > 6.11-rc3 + Samuel's patch:
> > > > > https://us01.z.antigena.com/l/EULtAYky6ZvgqZ49KGS-WBsYTg~Ht1NoQtEYmUVb56ymS9jDagqYHLK90WDjnVt69GfB4IX5NSRQXmSfkNsTzB8lJmFvDihHQmGrsCv9FzlorD9yGfXDlQ6rG6vmn5BNDwlipmssGaOGfh9yko8n9ArWR4TLhEf~f9ODqme~NXXwA9DLLc9p
> > > > 
> > > > I think this confirms what Charlie found here:
> > > > https://lore.kernel.org/linux-riscv/ZoydV7vad5JWIcZb@ghost/
> > > 
> > > Yes. So the riscv timer is not working on this thing or it stops
> > > somehow.
> > 
> > That's correct. With the (firmware) devicetree that Emil is using,
> > the OpenSBI
> > firmware does not have a timer device, so it does not expose the
> > (optional[1])
> > SBI time extension, and sbi_set_timer() does nothing.
> 
> OpenSBI uses platform specific M-mode timer (mtime and mtimecmp) to
> provide SBI time extension to Linux.
> 
> The RISC-V privileged specification (v1.10 or higher) requires
> platform to
> provide a M-mode timer (mtime and mtimecmp).

The T-Head cores' design is weird, and because of its source code is
available (as OpenC906), we can investigate it further in the RTL
level:

- From software perspective, it has no mtime mmap'ed register, but it
has mtimecmp, which compares with time CSR (a CSR R/O in all priv
levels).
- From SoC integration perspective, the value of the time CSR is
sourced from an external input, pad_cpu_sys_cnt[63:0] [1].

BTW I already added support for this kind of non-standard CLINT to
OpenSBI [2], however I don't think the current firmware DT of D1
utilizes it; and this is also a quite new SBI version I think.

[1]
https://github.com/XUANTIE-RV/openc906/blob/main/C906_RTL_FACTORY/gen_rtl/cpu/rtl/openC906.v#L84
[2]
https://github.com/riscv-software-src/opensbi/commit/b848d8763a737de44b64bfc036c8f51200226440

> 
> This platform not having any M-mode timer is yet another RISC-V spec
> violation by this platform.
> 
> Regards,
> Anup
> 
> > 
> > I wrote a patch (not submitted) to skip registering
> > riscv_clock_event when the
> > SBI time extension is unavailable, but this doesn't fully solve the
> > issue
> > either, because then we have no clockevent at all when
> > check_unaligned_access_all_cpus() is called.
> > 
> > How early in the boot process are we "required" to have a
> > functional clockevent?
> > Do we need to refactor check_unaligned_access_all_cpus() so it
> > works on systems
> > where the only clockevent is provided by a platform device?
> > 
> > Regards,
> > Samuel
> > 
> > [1]
> > https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/intro.adoc
> > 
> > > Can you apply the debug patch below and check whether you see the
> > > 'J: ....' output at all and if so whether it stops at some point.
> > > 
> > > Thanks,
> > > 
> > >         tglx
> > > 
> > > ---
> > > --- a/kernel/time/timer.c
> > > +++ b/kernel/time/timer.c
> > > @@ -2459,6 +2459,9 @@ static void run_local_timers(void)
> > >  {
> > >       struct timer_base *base =
> > > this_cpu_ptr(&timer_bases[BASE_LOCAL]);
> > > 
> > > +     if (!(jiffies & 0xFF))
> > > +             pr_info("J: %lx\n", jiffies);
> > > +
> > >       hrtimer_run_queues();
> > > 
> > >       for (int i = 0; i < NR_BASES; i++, base++) {
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > linux-riscv mailing list
> > > linux-riscv@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-riscv
> > > 
> > 
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Icenowy Zheng Aug. 16, 2024, 6:13 a.m. UTC | #23
在 2024-08-15星期四的 10:51 -0700,Palmer Dabbelt写道:
> On Thu, 15 Aug 2024 08:59:37 PDT (-0700),
> samuel.holland@sifive.com wrote:
> > Hi Emil,
> > 
> > On 2024-08-15 10:07 AM, Emil Renner Berthing wrote:
> > > Samuel Holland wrote:
> > > > On 2024-08-15 9:16 AM, Anup Patel wrote:
> > > > > On Thu, Aug 15, 2024 at 7:41 PM Thomas Gleixner
> > > > > <tglx@linutronix.de> wrote:
> > > > > > 
> > > > > > On Thu, Aug 15 2024 at 08:32, Samuel Holland wrote:
> > > > > > > On 2024-08-15 8:16 AM, Thomas Gleixner wrote:
> > > > > > > > Yes. So the riscv timer is not working on this thing or
> > > > > > > > it stops
> > > > > > > > somehow.
> > > > > > > 
> > > > > > > That's correct. With the (firmware) devicetree that Emil
> > > > > > > is using, the OpenSBI
> > > > > > > firmware does not have a timer device, so it does not
> > > > > > > expose the (optional[1])
> > > > > > > SBI time extension, and sbi_set_timer() does nothing.
> > > > > > 
> > > > > > Sigh. Does RISCV really have to repeat all mistakes which
> > > > > > have been made
> > > > > > by x86, ARM and others before? It's known for decades that
> > > > > > the kernel
> > > > > > relies on a working timer...
> 
> It's even worse than that: RISC-V doesn't even mandate any working 
> _instructions_, much less anything in the platform/firmware.
> 
> > > > > My apologies for the delay in finding a fix for this issue.
> > > > > 
> > > > > Almost all RISC-V platforms (except this one) have SBI Timer
> > > > > always
> > > > > available and Linux uses a better timer or Sstc extension
> > > > > whenever
> > > > > it is available.
> > > > 
> > > > So this is the immediate solution: add the CLINT to the
> > > > firmware devicetree so
> > > > that the SBI time extension works, and Linux will boot without
> > > > any code changes,
> > > > albeit with a higher-overhead clockevent device.
> > > 
> > > But this will mean that you can't update your kernel to v6.9 or
> > > newer without
> > > reflashing OpenSBI and u-boot. That's still a regression right?
> 
> Ya, I'd call that a regression.  Updating the firmware on these
> things 
> isn't generally something we can rely on users to do, we've worked 
> around other firmware bugs where we can to avoid forced updates.
> 
> > I suppose that depends on if you think the SBI time extension is
> > (or should have
> > been) mandatory for platforms without Sstc. If the SBI time
> > extension is
> > mandatory, then this is a firmware bug, and not really Linux's
> > responsibility to
> > work around.
> > 
> > If the SBI time extension is not mandatory, then Linux needs to be
> > able to
> > handle platforms where the S-mode visible timer is attached to an
> > external
> > interrupt controller (PLIC or APLIC), so the irqchip driver needs
> > to be loaded
> > before time_init() (timer_probe()). So in that case, the bug is a
> > Linux
> > regression, and we would need to revert the platform driver
> > conversion.
> 
> It doesn't really matter what the specs say (aka intended to say in 
> RISC-V land): if there's a regression then we have to deal with it.  
> It's not like whatever's written in the specs actually matters,
> vendors 
> can just do whatever they want, so wer'e just stuck making the known 
> implementations work.
> 
> So I think if the revert is the best fix then we should revert it.
> 
> That said: If the CLINT works, could we just add a probing quirk to
> make 

CLINT works, but will will never work in S mode by its design -- the
register used is all M-mode-only.

So it this kind of probing quirk is being added, the target will be
OpenSBI instead of Linux, and the problem of updating firmware still
exists.

> it appear on these systems even when it's not in the DT?  I'm
> thinking 
> something like adding a compatibly string to the CLINT driver for the
> SOC (or core or whatever, just something that's already there).  We'd
> probably need a bit of special-case probing code, but shouldn't be so
> bad.  We've got some other compatibility-oriented DT quirks floating 
> around.
> 
> > Regards,
> > Samuel
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Icenowy Zheng Aug. 16, 2024, 6:15 a.m. UTC | #24
在 2024-08-15星期四的 10:51 -0700,Palmer Dabbelt写道:
> On Wed, 14 Aug 2024 10:30:48 PDT (-0700), tglx@linutronix.de wrote:
> > On Wed, Aug 14 2024 at 16:56, Emil Renner Berthing wrote:
> > > As described in the thread below[1] I haven't been able to boot
> > > my
> > > boards based on the Allwinner D1 SoC since 6.9 where you
> > > converted the
> > > SiFive PLIC driver to a platform driver.
> > > 
> > > This is clearly a regression and there haven't really been much
> > > progress
> > > on fixing the issue since then, so here is the revert that fixes
> > > it.
> > > 
> > > If no other fix is found before 6.11 I suggest we apply this.
> > 
> > So this mess has been ignored for two month now?
> > 
> > > From the pastebin in the initial report:
> > 
> > [    0.000000] irq: no irq domain found for
> > interrupt-controller@10000000 !
> > [    0.000000] Failed to map interrupt for /soc/timer@2050000
> > [    0.000000] Failed to initialize '/soc/timer@2050000': -22
> > 
> > This comes back with -EINVAL. So the timer cannot find an
> > interrupt,
> > which makes it pretty obvious why the system stops to boot, unless
> > there
> > is some other timer available.
> > 
> > This is obviously related to the SUN4I_TIMER because that message
> > went
> > away when it was disabled according to the next pastebin.
> > 
> > Obviously that can't work because the SUN4I timer driver is using
> > timer_of_init() which cannot handle deferred probing.
> > 
> > Daniel: There was a partial fix for the sun4i driver, which you
> > said you
> > applied:
> > 
> >  
> > https://lore.kernel.org/all/20240312192519.1602493-1-samuel.holland@sifive.com
> > 
> > But that thing never materialized in a pull request.
> > 
> > And of course everyone involved ignored the problem since March
> > 13th
> > 2024, i.e. almost half a year.
> > 
> > Seriously?
> > 
> > Can you RISCV folks get your act together and ensure to fix things
> > you
> > broke on the way? Especially when Emil reported this nobody pointed
> > him
> > to this patch and nobody noticed that it's still not merged?
> > 
> > It took me less than 15 minutes to find that patch and the
> > correlation,
> > but this is absolutely not my job.
> 
> Sorry, I guess I'd just sort of been ignoring the platform-specific
> side 
> of things because it's so frustrating to deal with, but that's led to
> a 
> bunch of breakages so it's obviously the wrong thing to do.
> 
> > I'm seriously grumpy about that. This is not how it works. If you
> > break
> > stuff, then you take care to fix it before you shove more changes
> > into
> > the tree and waste my time.
> > 
> > I'm very much inclined to take the reverts right now, send them to
> > Linus
> > for -rc5 tagged with cc: stable and ignore/nak any irqchip related
> > riscv
> > patches until the next merge window is over.
> 
> Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
> 
> if you want to take the revert.
> 
> IIUC the patch above doesn't actually fix it, that's what led to just
> sending the reverts -- at least reverts are better than breaking
> users.  
> I'll post over there too...
> 
> And it's no big deal if we're in the doghouse for a bit.  Regressions
> should get fixed faster than this, so we deserve it.
> 
> Probably also another sign we're way too focused on getting new
> features 
> merged, as that's coming at the expense of making existing platforms 
> work.  IMO we've been way too focused on getting support for specs
> that 
> don't even have implementations, and not enough on building real
> working 
> systems.

Well I think all existing platforms are more or less weird (in
specification-compatibility, stability, etc). (Maybe FU540 isn't so
weird, but it has too few peripherals to be really useful, and it's
discontinued; FU740 has some stability issues.)

> 
> > Emil, can you give that sun4i fix a test ride please?
> > 
> > Thanks,
> > 
> >         tglx
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Palmer Dabbelt Aug. 18, 2024, 2:47 p.m. UTC | #25
On Wed, 14 Aug 2024 07:56:32 PDT (-0700), Renner Berthing wrote:
> Hi Anup,
>
> As described in the thread below[1] I haven't been able to boot my
> boards based on the Allwinner D1 SoC since 6.9 where you converted the
> SiFive PLIC driver to a platform driver.
>
> This is clearly a regression and there haven't really been much progress
> on fixing the issue since then, so here is the revert that fixes it.
>
> If no other fix is found before 6.11 I suggest we apply this.
>
> [1]: https://lore.kernel.org/linux-riscv/CAJM55Z9hGKo4784N3s3DhWw=nMRKZKcmvZ58x7uVBghExhoc9A@mail.gmail.com/
>
> /Emil
>
> Emil Renner Berthing (9):
>   Revert "irqchip/sifive-plic: Chain to parent IRQ after handlers are
>     ready"
>   Revert "irqchip/sifive-plic: Avoid explicit cpumask allocation on
>     stack"
>   Revert "irqchip/sifive-plic: Improve locking safety by using
>     irqsave/irqrestore"
>   Revert "irqchip/sifive-plic: Parse number of interrupts and contexts
>     early in plic_probe()"
>   Revert "irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain
>     creation failure"
>   Revert "irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent
>     fwnode"
>   Revert "irqchip/sifive-plic: Use devm_xyz() for managed allocation"
>   Revert "irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz()"
>   Revert "irqchip/sifive-plic: Convert PLIC driver into a platform
>     driver"
>
>  drivers/irqchip/irq-sifive-plic.c | 296 ++++++++++++------------------
>  1 file changed, 117 insertions(+), 179 deletions(-)

I'm still only testing on the QEMU-emulated platforms, but this isn't 
regressing over there so

Tested-by: Palmer Dabbelt <palmer@rivosinc.com> # QEMU

Thanks!