diff mbox

hw/char/cmsdk-apb-timer: Correctly identify and set one-shot mode

Message ID 1529374119-27015-1-git-send-email-linux@roeck-us.net (mailing list archive)
State New, archived
Headers show

Commit Message

Guenter Roeck June 19, 2018, 2:08 a.m. UTC
The CMSDK APB timer is currently always configured as periodic timer.
This results in the following messages when trying to boot Linux.

Timer with delta zero, disabling

If the timer limit set with the RELOAD command is 0, the timer
needs to be enabled as one-shot timer.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
 hw/timer/cmsdk-apb-timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Peter Maydell June 26, 2018, 5:17 p.m. UTC | #1
On 19 June 2018 at 03:08, Guenter Roeck <linux@roeck-us.net> wrote:
> The CMSDK APB timer is currently always configured as periodic timer.
> This results in the following messages when trying to boot Linux.
>
> Timer with delta zero, disabling
>
> If the timer limit set with the RELOAD command is 0, the timer
> needs to be enabled as one-shot timer.
>
> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
> ---
>  hw/timer/cmsdk-apb-timer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/timer/cmsdk-apb-timer.c b/hw/timer/cmsdk-apb-timer.c
> index 9878746..79c1b49 100644
> --- a/hw/timer/cmsdk-apb-timer.c
> +++ b/hw/timer/cmsdk-apb-timer.c
> @@ -119,7 +119,7 @@ static void cmsdk_apb_timer_write(void *opaque, hwaddr offset, uint64_t value,
>          }
>          s->ctrl = value & 0xf;
>          if (s->ctrl & R_CTRL_EN_MASK) {
> -            ptimer_run(s->timer, 0);
> +            ptimer_run(s->timer, ptimer_get_limit(s->timer) == 0);
>          } else {
>              ptimer_stop(s->timer);
>          }
> --
> 2.7.4

Thanks for this patch. I was wondering whether it would be better
just to remove the fprintf message instead. I'll either apply
this or send a patch to do that before 3.0, anyway.

I think we also want to make sure we convert back to a
periodic timer if the reload register is written with a nonzero
value before the timer expires (and that if that happens after
the timer expired that we restart the timer). Thinking about
this is also on my todo list.

-- PMM
Guenter Roeck June 26, 2018, 5:59 p.m. UTC | #2
On Tue, Jun 26, 2018 at 06:17:44PM +0100, Peter Maydell wrote:
> On 19 June 2018 at 03:08, Guenter Roeck <linux@roeck-us.net> wrote:
> > The CMSDK APB timer is currently always configured as periodic timer.
> > This results in the following messages when trying to boot Linux.
> >
> > Timer with delta zero, disabling
> >
> > If the timer limit set with the RELOAD command is 0, the timer
> > needs to be enabled as one-shot timer.
> >
> > Signed-off-by: Guenter Roeck <linux@roeck-us.net>
> > ---
> >  hw/timer/cmsdk-apb-timer.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/hw/timer/cmsdk-apb-timer.c b/hw/timer/cmsdk-apb-timer.c
> > index 9878746..79c1b49 100644
> > --- a/hw/timer/cmsdk-apb-timer.c
> > +++ b/hw/timer/cmsdk-apb-timer.c
> > @@ -119,7 +119,7 @@ static void cmsdk_apb_timer_write(void *opaque, hwaddr offset, uint64_t value,
> >          }
> >          s->ctrl = value & 0xf;
> >          if (s->ctrl & R_CTRL_EN_MASK) {
> > -            ptimer_run(s->timer, 0);
> > +            ptimer_run(s->timer, ptimer_get_limit(s->timer) == 0);
> >          } else {
> >              ptimer_stop(s->timer);
> >          }
> > --
> > 2.7.4
> 
> Thanks for this patch. I was wondering whether it would be better
> just to remove the fprintf message instead. I'll either apply
> this or send a patch to do that before 3.0, anyway.
> 

If I recall correctly, I tried that, and it did not help.
The messages don't happen too often, and the message itself
does not cause a problem. Issue is that the interrupts happen
at the wrong time or not at all (after a while, ie after the
configured one-shot time expires), and the kernel really doesn't
like that.

I think the underlying problem was that the periodic timer counts
the period down (based on the time set for the one-shot timer),
stops with the "Timer with delta zero, disabling" message once
the period reaches 0, and does not fire anymore afterwards.
As a result, the kernel fails to boot maybe 90% of the time.
I should probably have mentioned that in more detail in the
commit log.

> I think we also want to make sure we convert back to a
> periodic timer if the reload register is written with a nonzero
> value before the timer expires (and that if that happens after
> the timer expired that we restart the timer). Thinking about
> this is also on my todo list.
> 

I thought that was handled in the Linux driver by disabling
the timer, updating the period, and then re-enabling it in
periodic mode, but I may have misinterpreted the code.

Guenter
Peter Maydell June 26, 2018, 6:10 p.m. UTC | #3
On 26 June 2018 at 18:59, Guenter Roeck <linux@roeck-us.net> wrote:
> On Tue, Jun 26, 2018 at 06:17:44PM +0100, Peter Maydell wrote:
>> Thanks for this patch. I was wondering whether it would be better
>> just to remove the fprintf message instead. I'll either apply
>> this or send a patch to do that before 3.0, anyway.
>>
>
> If I recall correctly, I tried that, and it did not help.
> The messages don't happen too often, and the message itself
> does not cause a problem. Issue is that the interrupts happen
> at the wrong time or not at all (after a while, ie after the
> configured one-shot time expires), and the kernel really doesn't
> like that.
>
> I think the underlying problem was that the periodic timer counts
> the period down (based on the time set for the one-shot timer),
> stops with the "Timer with delta zero, disabling" message once
> the period reaches 0, and does not fire anymore afterwards.
> As a result, the kernel fails to boot maybe 90% of the time.
> I should probably have mentioned that in more detail in the
> commit log.

Hmm, that's odd, because I don't really see what the difference
between the two is. If you set the thing as a one-shot then we
won't reenable the timer later either with this patch.

Can you provide an image/QEMU command line that repros this,
and I'll see if I can find time to investigate it? (We have
a softfreeze deadline next Tuesday, so I probably won't be
able to get to it until after that, but since this is a bugfix
it doesn't have to be done before freeze.)

>> periodic timer if the reload register is written with a nonzero
>> value before the timer expires (and that if that happens after
>> the timer expired that we restart the timer). Thinking about
>> this is also on my todo list.
>>
>
> I thought that was handled in the Linux driver by disabling
> the timer, updating the period, and then re-enabling it in
> periodic mode, but I may have misinterpreted the code.

Yeah, it would depend what the guest does whether it's affected,
but my reading of the data sheet is that messing with the reload
register with the timer still active isn't prohibited.

thanks
-- PMM
Guenter Roeck June 26, 2018, 6:43 p.m. UTC | #4
On Tue, Jun 26, 2018 at 07:10:45PM +0100, Peter Maydell wrote:
> On 26 June 2018 at 18:59, Guenter Roeck <linux@roeck-us.net> wrote:
> > On Tue, Jun 26, 2018 at 06:17:44PM +0100, Peter Maydell wrote:
> >> Thanks for this patch. I was wondering whether it would be better
> >> just to remove the fprintf message instead. I'll either apply
> >> this or send a patch to do that before 3.0, anyway.
> >>
> >
> > If I recall correctly, I tried that, and it did not help.
> > The messages don't happen too often, and the message itself
> > does not cause a problem. Issue is that the interrupts happen
> > at the wrong time or not at all (after a while, ie after the
> > configured one-shot time expires), and the kernel really doesn't
> > like that.
> >
> > I think the underlying problem was that the periodic timer counts
> > the period down (based on the time set for the one-shot timer),
> > stops with the "Timer with delta zero, disabling" message once
> > the period reaches 0, and does not fire anymore afterwards.
> > As a result, the kernel fails to boot maybe 90% of the time.
> > I should probably have mentioned that in more detail in the
> > commit log.
> 
> Hmm, that's odd, because I don't really see what the difference
> between the two is. If you set the thing as a one-shot then we
> won't reenable the timer later either with this patch.
> 
> Can you provide an image/QEMU command line that repros this,
> and I'll see if I can find time to investigate it? (We have
> a softfreeze deadline next Tuesday, so I probably won't be
> able to get to it until after that, but since this is a bugfix
> it doesn't have to be done before freeze.)
> 

I had added some debug messages into the timer code and noticed
the behavior. If I recall correctly, for periodic timers, the timer
period counts down in ptimer_reload(). At some point, it reaches
zero, and the timer stops working. The behavior is different
between one-shot and periodic timers; I think that count-down
does not happen with one-shot timers. This also has to do with the
timer limit which is 0 for one-shot timers but != 0 for periodic
timers and is used in various locations in hw/core/ptimer.c.

This was with Cortex-M3 where you said you won't accept my patches
to add initrd/dtb/bios support, so the qemu version would not be
an official one and has functionality not available in upstream qemu
(it has other changes but those are relevant for reproducing the
problem). My qemu clone is at git@github.com:groeck/qemu.git in branch
v2.12.0-local. It includes this patch. The test scripts are at
git@github.com:groeck/linux-build-test.git, in the rootfs/arm/
subdirectory.

I can try to reproduce it myself tonight and send you a log,
but no promises.

Guenter

> >> periodic timer if the reload register is written with a nonzero
> >> value before the timer expires (and that if that happens after
> >> the timer expired that we restart the timer). Thinking about
> >> this is also on my todo list.
> >>
> >
> > I thought that was handled in the Linux driver by disabling
> > the timer, updating the period, and then re-enabling it in
> > periodic mode, but I may have misinterpreted the code.
> 
> Yeah, it would depend what the guest does whether it's affected,
> but my reading of the data sheet is that messing with the reload
> register with the timer still active isn't prohibited.
> 
> thanks
> -- PMM
Guenter Roeck June 26, 2018, 8 p.m. UTC | #5
On Tue, Jun 26, 2018 at 07:10:45PM +0100, Peter Maydell wrote:
> On 26 June 2018 at 18:59, Guenter Roeck <linux@roeck-us.net> wrote:
> > On Tue, Jun 26, 2018 at 06:17:44PM +0100, Peter Maydell wrote:
> >> Thanks for this patch. I was wondering whether it would be better
> >> just to remove the fprintf message instead. I'll either apply
> >> this or send a patch to do that before 3.0, anyway.
> >>
> >
> > If I recall correctly, I tried that, and it did not help.
> > The messages don't happen too often, and the message itself
> > does not cause a problem. Issue is that the interrupts happen
> > at the wrong time or not at all (after a while, ie after the
> > configured one-shot time expires), and the kernel really doesn't
> > like that.
> >
> > I think the underlying problem was that the periodic timer counts
> > the period down (based on the time set for the one-shot timer),
> > stops with the "Timer with delta zero, disabling" message once
> > the period reaches 0, and does not fire anymore afterwards.
> > As a result, the kernel fails to boot maybe 90% of the time.
> > I should probably have mentioned that in more detail in the
> > commit log.
> 
> Hmm, that's odd, because I don't really see what the difference
> between the two is. If you set the thing as a one-shot then we
> won't reenable the timer later either with this patch.
> 
> Can you provide an image/QEMU command line that repros this,
> and I'll see if I can find time to investigate it? (We have
> a softfreeze deadline next Tuesday, so I probably won't be
> able to get to it until after that, but since this is a bugfix
> it doesn't have to be done before freeze.)
> 

Here is a log, with debug messages added to ptimer code.

Without my patch:

ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
[    0.497942] clocksource: Switched to clocksource mps2-clksrc^M
ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1

# up to here timer is in periodic mode and fires on a regular basis

# prepare to switch to one-shot mode
ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=0

# set timer for one-shot mode, start
ptimer_set_count(0x555556edac70, 92004
ptimer_reload(555556edac70): frac=0 period=40 delta=92004 limit=0 adjust=0

# one-shot mode fires
ptimer_tick(0x555556edac70): enabled=1 delta=92004 limit=0

# timer is in periodic mode, limit is 0 (from one-shot mode)
# reload propagates limit -> delta, making it 0
ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=-1

# and the timer is disabled as result.
0x555556edac70: Timer with delta zero, disabling

ptimer_set_count(0x555556edac70, 199340
ptimer_reload(555556edac70): frac=0 period=40 delta=199340 limit=0 adjust=0
[    0.514458] NET: Registered protocol family 2^M
ptimer_tick(0x555556edac70): enabled=1 delta=199340 limit=0
ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=-1
0x555556edac70: Timer with delta zero, disabling
ptimer_set_count(0x555556edac70, 240488
ptimer_reload(555556edac70): frac=0 period=40 delta=240488 limit=0 adjust=0
[    0.526096] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096
bytes)^M
[    0.526533] TCP established hash table entries: 1024 (order: 0, 4096 bytes)^M
[    0.526943] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)^M
ptimer_tick(0x555556edac70): enabled=1 delta=240488 limit=0
ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=-1
0x555556edac70: Timer with delta zero, disabling

And so on. The system may finish booting or lock up. Which one it is seems
to be random.

Same log with my patch applied:

ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1

# switch to one-shot mode
ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=0

@ set counter, start
ptimer_set_count(0x555556edac70, 24
ptimer_reload(555556edac70): frac=0 period=40 delta=24 limit=0 adjust=0

# tick
ptimer_tick(0x555556edac70): enabled=2 delta=24 limit=0

# update counter, restart
ptimer_set_count(0x555556edac70, 209498
ptimer_reload(555556edac70): frac=0 period=40 delta=209498 limit=0 adjust=0
[    0.509883] NET: Registered protocol family 2^M

# tick
ptimer_tick(0x555556edac70): enabled=2 delta=209498 limit=0

# update counter, restart
ptimer_set_count(0x555556edac70, 217500
ptimer_reload(555556edac70): frac=0 period=40 delta=217500 limit=0 adjust=0

# tick
ptimer_tick(0x555556edac70): enabled=2 delta=217500 limit=0

# update counter, restart
ptimer_set_count(0x555556edac70, 236941
ptimer_reload(555556edac70): frac=0 period=40 delta=236941 limit=0 adjust=0
[    0.529228] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes)^M
[    0.529661] TCP established hash table entries: 1024 (order: 0, 4096 bytes)^M
[    0.530059] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)^M
[    0.530535] TCP: Hash tables configured (established 1024 bind 1024)^M

# tick
ptimer_tick(0x555556edac70): enabled=2 delta=236941 limit=0

# update counter, restart
ptimer_set_count(0x555556edac70, 245934
ptimer_reload(555556edac70): frac=0 period=40 delta=245934 limit=0 adjust=0

and so on.

Hope this helps,

Guenter
Peter Maydell July 2, 2018, 3:05 p.m. UTC | #6
On 26 June 2018 at 21:00, Guenter Roeck <linux@roeck-us.net> wrote:
> Here is a log, with debug messages added to ptimer code.

Thanks for the logging. (I am currently also messing about
with buildroot to try to reproduce -- I have a kernel that
boots, but since it has no initial filesystem it hits the
"stop because I can't find init" before anything interesting
happens.)

> Without my patch:
>
> ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
> ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
> ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
> ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
> [    0.497942] clocksource: Switched to clocksource mps2-clksrc^M
> ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
> ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
>
> # up to here timer is in periodic mode and fires on a regular basis
>
> # prepare to switch to one-shot mode
> ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=0
>
> # set timer for one-shot mode, start
> ptimer_set_count(0x555556edac70, 92004
> ptimer_reload(555556edac70): frac=0 period=40 delta=92004 limit=0 adjust=0
>
> # one-shot mode fires
> ptimer_tick(0x555556edac70): enabled=1 delta=92004 limit=0
>
> # timer is in periodic mode, limit is 0 (from one-shot mode)
> # reload propagates limit -> delta, making it 0
> ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=-1
>
> # and the timer is disabled as result.
> 0x555556edac70: Timer with delta zero, disabling
>
> ptimer_set_count(0x555556edac70, 199340
> ptimer_reload(555556edac70): frac=0 period=40 delta=199340 limit=0 adjust=0
> [    0.514458] NET: Registered protocol family 2^M
> ptimer_tick(0x555556edac70): enabled=1 delta=199340 limit=0
> ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=-1
> 0x555556edac70: Timer with delta zero, disabling
> ptimer_set_count(0x555556edac70, 240488
> ptimer_reload(555556edac70): frac=0 period=40 delta=240488 limit=0 adjust=0
> [    0.526096] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096
> bytes)^M
> [    0.526533] TCP established hash table entries: 1024 (order: 0, 4096 bytes)^M
> [    0.526943] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)^M
> ptimer_tick(0x555556edac70): enabled=1 delta=240488 limit=0
> ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=-1
> 0x555556edac70: Timer with delta zero, disabling

OK, but I can't see anything actually going wrong here. Once
the guest sets the RELOAD register to 0, it gets one-shot timers, and
needs to write to VALUE each time to restart the timer. We seem to get
calls to set_count which set different counts, and a a tick for each
set_count, so that looks like it's working to me. (But see below...)

> And so on. The system may finish booting or lock up. Which one it is seems
> to be random.
>
> Same log with my patch applied:
>
> ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
> ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
>
> # switch to one-shot mode
> ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=0
>
> @ set counter, start
> ptimer_set_count(0x555556edac70, 24
> ptimer_reload(555556edac70): frac=0 period=40 delta=24 limit=0 adjust=0
>
> # tick
> ptimer_tick(0x555556edac70): enabled=2 delta=24 limit=0
>
> # update counter, restart
> ptimer_set_count(0x555556edac70, 209498
> ptimer_reload(555556edac70): frac=0 period=40 delta=209498 limit=0 adjust=0
> [    0.509883] NET: Registered protocol family 2^M
>
> # tick
> ptimer_tick(0x555556edac70): enabled=2 delta=209498 limit=0
>
> # update counter, restart
> ptimer_set_count(0x555556edac70, 217500
> ptimer_reload(555556edac70): frac=0 period=40 delta=217500 limit=0 adjust=0
>
> # tick
> ptimer_tick(0x555556edac70): enabled=2 delta=217500 limit=0

This doesn't seem to be behaving any differently (apart from
having silenced the fprintf about the zero delta).

My reading of the code in ptimer_tick() is that the behaviour
for "this was setup as a one-shot" is in effect the same as
for "this was setup as periodic but the reload value is 0"
is the same: it sets enabled to 0 and delta to 0.

One thing I do notice reading the code is that we don't
actually do anything to re-enable the timer on a write to
the VALUE register. We call ptimer_set_count(), but that will
not reenable the timer if it was disabled -- the cmsdk timer
code needs to call ptimer_run() if necessary, I think. But
that bug should manifest whether we're setting the timer up
as a one-shot or as a periodic with a 0 limit, because
ptimer_tick() will disable the timer when it fires whichever
way we do that.

thanks
-- PMM
Peter Maydell July 2, 2018, 3:29 p.m. UTC | #7
On 2 July 2018 at 16:05, Peter Maydell <peter.maydell@linaro.org> wrote:
> One thing I do notice reading the code is that we don't
> actually do anything to re-enable the timer on a write to
> the VALUE register. We call ptimer_set_count(), but that will
> not reenable the timer if it was disabled -- the cmsdk timer
> code needs to call ptimer_run() if necessary, I think. But
> that bug should manifest whether we're setting the timer up
> as a one-shot or as a periodic with a 0 limit, because
> ptimer_tick() will disable the timer when it fires whichever
> way we do that.

...but as it happens the Linux driver doesn't rely on this:
https://elixir.bootlin.com/linux/v4.18-rc3/source/drivers/clocksource/mps2-timer.c#L66
mps2_timer_set_next_event() always writes both VALUE and
the CTRL register, so it will re-enable the timer anyway
and avoid this QEMU bug.

thanks
-- PMM
Guenter Roeck July 2, 2018, 4:27 p.m. UTC | #8
On Mon, Jul 02, 2018 at 04:05:54PM +0100, Peter Maydell wrote:
> On 26 June 2018 at 21:00, Guenter Roeck <linux@roeck-us.net> wrote:
> > Here is a log, with debug messages added to ptimer code.
> 
> Thanks for the logging. (I am currently also messing about
> with buildroot to try to reproduce -- I have a kernel that
> boots, but since it has no initial filesystem it hits the
> "stop because I can't find init" before anything interesting
> happens.)
> 
> > Without my patch:
> >
> > ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
> > ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
> > ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
> > ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
> > [    0.497942] clocksource: Switched to clocksource mps2-clksrc^M
> > ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
> > ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
> >
> > # up to here timer is in periodic mode and fires on a regular basis
> >
> > # prepare to switch to one-shot mode
> > ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=0
> >
> > # set timer for one-shot mode, start
> > ptimer_set_count(0x555556edac70, 92004
> > ptimer_reload(555556edac70): frac=0 period=40 delta=92004 limit=0 adjust=0
> >
> > # one-shot mode fires
> > ptimer_tick(0x555556edac70): enabled=1 delta=92004 limit=0
> >
> > # timer is in periodic mode, limit is 0 (from one-shot mode)
> > # reload propagates limit -> delta, making it 0
> > ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=-1
> >
> > # and the timer is disabled as result.
> > 0x555556edac70: Timer with delta zero, disabling
> >
> > ptimer_set_count(0x555556edac70, 199340
> > ptimer_reload(555556edac70): frac=0 period=40 delta=199340 limit=0 adjust=0
> > [    0.514458] NET: Registered protocol family 2^M
> > ptimer_tick(0x555556edac70): enabled=1 delta=199340 limit=0
> > ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=-1
> > 0x555556edac70: Timer with delta zero, disabling
> > ptimer_set_count(0x555556edac70, 240488
> > ptimer_reload(555556edac70): frac=0 period=40 delta=240488 limit=0 adjust=0
> > [    0.526096] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096
> > bytes)^M
> > [    0.526533] TCP established hash table entries: 1024 (order: 0, 4096 bytes)^M
> > [    0.526943] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)^M
> > ptimer_tick(0x555556edac70): enabled=1 delta=240488 limit=0
> > ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=-1
> > 0x555556edac70: Timer with delta zero, disabling
> 
> OK, but I can't see anything actually going wrong here. Once
> the guest sets the RELOAD register to 0, it gets one-shot timers, and
> needs to write to VALUE each time to restart the timer. We seem to get
> calls to set_count which set different counts, and a a tick for each
> set_count, so that looks like it's working to me. (But see below...)
> 

Sometimes the timer doesn't fire or is not reloaded when expected, causing
the system to stall. It also seemed to me that the timer fires at the wrong
time, ie at the period instead of the selected one-time value, but I may
misinterpret the log.

> > And so on. The system may finish booting or lock up. Which one it is seems
> > to be random.
> >
> > Same log with my patch applied:
> >
> > ptimer_tick(0x555556edac70): enabled=1 delta=250000 limit=250000
> > ptimer_reload(555556edac70): frac=0 period=40 delta=250000 limit=250000 adjust=1
> >
> > # switch to one-shot mode
> > ptimer_reload(555556edac70): frac=0 period=40 delta=0 limit=0 adjust=0
> >
> > @ set counter, start
> > ptimer_set_count(0x555556edac70, 24
> > ptimer_reload(555556edac70): frac=0 period=40 delta=24 limit=0 adjust=0
> >
> > # tick
> > ptimer_tick(0x555556edac70): enabled=2 delta=24 limit=0
> >
> > # update counter, restart
> > ptimer_set_count(0x555556edac70, 209498
> > ptimer_reload(555556edac70): frac=0 period=40 delta=209498 limit=0 adjust=0
> > [    0.509883] NET: Registered protocol family 2^M
> >
> > # tick
> > ptimer_tick(0x555556edac70): enabled=2 delta=209498 limit=0
> >
> > # update counter, restart
> > ptimer_set_count(0x555556edac70, 217500
> > ptimer_reload(555556edac70): frac=0 period=40 delta=217500 limit=0 adjust=0
> >
> > # tick
> > ptimer_tick(0x555556edac70): enabled=2 delta=217500 limit=0
> 
> This doesn't seem to be behaving any differently (apart from
> having silenced the fprintf about the zero delta).
> 
> My reading of the code in ptimer_tick() is that the behaviour
> for "this was setup as a one-shot" is in effect the same as
> for "this was setup as periodic but the reload value is 0"
> is the same: it sets enabled to 0 and delta to 0.
> 

To me the second log looks more predictive than the first, but that
may just be me.

Again, in my observation the boot sometimes stalls without my
patch. Maybe there is a better fix, but the code as-is doesn't
work for me. All I can say is that the current code does not set
one-shot mode when it should, and that seems wrong.

You could of course take out the zero delta warning. Personally
I would caution against that - whenever I have seen that warning,
it suggested that something is wrong with the emulation, and it
helped me track down the problem. Your call, of course. Also,
I don't think taking out the warning would fix my stall problem.

Thanks,
Guenter

> One thing I do notice reading the code is that we don't
> actually do anything to re-enable the timer on a write to
> the VALUE register. We call ptimer_set_count(), but that will
> not reenable the timer if it was disabled -- the cmsdk timer
> code needs to call ptimer_run() if necessary, I think. But
> that bug should manifest whether we're setting the timer up
> as a one-shot or as a periodic with a 0 limit, because
> ptimer_tick() will disable the timer when it fires whichever
> way we do that.
> 
> thanks
> -- PMM
Peter Maydell July 2, 2018, 4:46 p.m. UTC | #9
On 2 July 2018 at 17:27, Guenter Roeck <linux@roeck-us.net> wrote:
> Again, in my observation the boot sometimes stalls without my
> patch. Maybe there is a better fix, but the code as-is doesn't
> work for me. All I can say is that the current code does not set
> one-shot mode when it should, and that seems wrong.
>
> You could of course take out the zero delta warning. Personally
> I would caution against that - whenever I have seen that warning,
> it suggested that something is wrong with the emulation, and it
> helped me track down the problem. Your call, of course. Also,
> I don't think taking out the warning would fix my stall problem.

Can you provide me with something I can reproduce with, then,
please? My kernel + buildroot don't repro the problem: it always
seems to boot fine. I had a look at your testbuild git repo
but couldn't manage to figure out how to build a binary
from it. I'm ok with building a QEMU from your branch for
testing this, but I'd rather not try to figure out how to
reproduce your kernel and root filesystem setup.

thanks
-- PMM
Guenter Roeck July 2, 2018, 5:05 p.m. UTC | #10
On Mon, Jul 02, 2018 at 05:46:39PM +0100, Peter Maydell wrote:
> On 2 July 2018 at 17:27, Guenter Roeck <linux@roeck-us.net> wrote:
> > Again, in my observation the boot sometimes stalls without my
> > patch. Maybe there is a better fix, but the code as-is doesn't
> > work for me. All I can say is that the current code does not set
> > one-shot mode when it should, and that seems wrong.
> >
> > You could of course take out the zero delta warning. Personally
> > I would caution against that - whenever I have seen that warning,
> > it suggested that something is wrong with the emulation, and it
> > helped me track down the problem. Your call, of course. Also,
> > I don't think taking out the warning would fix my stall problem.
> 
> Can you provide me with something I can reproduce with, then,
> please? My kernel + buildroot don't repro the problem: it always
> seems to boot fine. I had a look at your testbuild git repo
> but couldn't manage to figure out how to build a binary
> from it. I'm ok with building a QEMU from your branch for
> testing this, but I'd rather not try to figure out how to
> reproduce your kernel and root filesystem setup.
> 

I don't think I'll have more time available to spend on this
anytime soon. If it works for you, just drop the patch. I can
keep it local in my branch. My version of qemu is different anyway,
after all.

Thanks,
Guenter
Peter Maydell July 2, 2018, 5:10 p.m. UTC | #11
On 2 July 2018 at 18:05, Guenter Roeck <linux@roeck-us.net> wrote:
> I don't think I'll have more time available to spend on this
> anytime soon. If it works for you, just drop the patch. I can
> keep it local in my branch. My version of qemu is different anyway,
> after all.

I'd kind of like to fix the bug, though. I'd just like you
to provide the image you're testing with...

thanks
-- PMM
Guenter Roeck July 2, 2018, 5:57 p.m. UTC | #12
On Mon, Jul 02, 2018 at 06:10:36PM +0100, Peter Maydell wrote:
> On 2 July 2018 at 18:05, Guenter Roeck <linux@roeck-us.net> wrote:
> > I don't think I'll have more time available to spend on this
> > anytime soon. If it works for you, just drop the patch. I can
> > keep it local in my branch. My version of qemu is different anyway,
> > after all.
> 
> I'd kind of like to fix the bug, though. I'd just like you
> to provide the image you're testing with...
> 

Ah, but that won't help. We are dealing with a heisenbug, and me using
a different host than you (or even a different gcc version) may be
sufficient to make the problem disappear on your system.

The mps2 images use high resolution timers, which is likely what is
triggering here. My working assumption is that those timers trigger at
the wrong time because the timer is set up in periodic instead of one-time
mode. With that, behavior is more or less random. On some systems it will
work, on some systems it won't. Since we can not agree that setting the
timer in periodic mode instead of one-shot mode is a problem, I don't
really see how we can proceed. I would somehow have to prove to you that
there is an observable problem in the first place, well beyond of what
I already did. That might take a lot of time - much more than I would
be able to spend on it.

Guenter
Peter Maydell July 2, 2018, 6:06 p.m. UTC | #13
On 2 July 2018 at 18:57, Guenter Roeck <linux@roeck-us.net> wrote:
> On Mon, Jul 02, 2018 at 06:10:36PM +0100, Peter Maydell wrote:
>> On 2 July 2018 at 18:05, Guenter Roeck <linux@roeck-us.net> wrote:
>> > I don't think I'll have more time available to spend on this
>> > anytime soon. If it works for you, just drop the patch. I can
>> > keep it local in my branch. My version of qemu is different anyway,
>> > after all.
>>
>> I'd kind of like to fix the bug, though. I'd just like you
>> to provide the image you're testing with...
>>
>
> Ah, but that won't help. We are dealing with a heisenbug, and me using
> a different host than you (or even a different gcc version) may be
> sufficient to make the problem disappear on your system.

True, but we can at least rule out some differences between
your setup and mine. (For instance, your kernel config and
version and rootfs are all going to be different from what
I've been testing.)

> The mps2 images use high resolution timers, which is likely what is
> triggering here. My working assumption is that those timers trigger at
> the wrong time because the timer is set up in periodic instead of one-time
> mode. With that, behavior is more or less random. On some systems it will
> work, on some systems it won't. Since we can not agree that setting the
> timer in periodic mode instead of one-shot mode is a problem, I don't
> really see how we can proceed. I would somehow have to prove to you that
> there is an observable problem in the first place, well beyond of what
> I already did. That might take a lot of time - much more than I would
> be able to spend on it.

I certainly believe you that there's an observable problem.
I just disagree about your proposed fix, because I don't see
from the code why it would make any difference. So I want to
get to a point where I can run exactly the same thing you're
running on my system to have a look at what's happening.

thanks
-- PMM
Guenter Roeck July 2, 2018, 6:37 p.m. UTC | #14
On Mon, Jul 02, 2018 at 07:06:17PM +0100, Peter Maydell wrote:
> On 2 July 2018 at 18:57, Guenter Roeck <linux@roeck-us.net> wrote:
> > On Mon, Jul 02, 2018 at 06:10:36PM +0100, Peter Maydell wrote:
> >> On 2 July 2018 at 18:05, Guenter Roeck <linux@roeck-us.net> wrote:
> >> > I don't think I'll have more time available to spend on this
> >> > anytime soon. If it works for you, just drop the patch. I can
> >> > keep it local in my branch. My version of qemu is different anyway,
> >> > after all.
> >>
> >> I'd kind of like to fix the bug, though. I'd just like you
> >> to provide the image you're testing with...
> >>
> >
> > Ah, but that won't help. We are dealing with a heisenbug, and me using
> > a different host than you (or even a different gcc version) may be
> > sufficient to make the problem disappear on your system.
> 
> True, but we can at least rule out some differences between
> your setup and mine. (For instance, your kernel config and
> version and rootfs are all going to be different from what
> I've been testing.)
> 
> > The mps2 images use high resolution timers, which is likely what is
> > triggering here. My working assumption is that those timers trigger at
> > the wrong time because the timer is set up in periodic instead of one-time
> > mode. With that, behavior is more or less random. On some systems it will
> > work, on some systems it won't. Since we can not agree that setting the
> > timer in periodic mode instead of one-shot mode is a problem, I don't
> > really see how we can proceed. I would somehow have to prove to you that
> > there is an observable problem in the first place, well beyond of what
> > I already did. That might take a lot of time - much more than I would
> > be able to spend on it.
> 
> I certainly believe you that there's an observable problem.
> I just disagree about your proposed fix, because I don't see
> from the code why it would make any difference. So I want to
> get to a point where I can run exactly the same thing you're
> running on my system to have a look at what's happening.
> 

Please try http://server.roeck-us.net/qemu/an385/

vmlinux is plain v4.18-rc3 compiled with the provided defconfig and
the provided toolchain (position independent, created with buildroot).
defconfig is mps2_defconfig+CONFIG_BLK_DEV_INITRD=y. Older kernels work
as well (I tested as far back as 4.9.y), with the same configuration.

qemu is from the master-local branch in my repository, built with
https://github.com/groeck/linux-build-test/blob/master/qemu/buildall.sh.

The boot wrapper was generated from
https://github.com/groeck/linux-build-test/tree/master/rootfs/arm/arm-m3-bootwrapper/.

run.sh should boot the system into a shell (at least it does for me).

Guenter
Peter Maydell July 3, 2018, 5:06 p.m. UTC | #15
On 2 July 2018 at 19:37, Guenter Roeck <linux@roeck-us.net> wrote:
> Please try http://server.roeck-us.net/qemu/an385/
>
> vmlinux is plain v4.18-rc3 compiled with the provided defconfig and
> the provided toolchain (position independent, created with buildroot).
> defconfig is mps2_defconfig+CONFIG_BLK_DEV_INITRD=y. Older kernels work
> as well (I tested as far back as 4.9.y), with the same configuration.
>
> qemu is from the master-local branch in my repository, built with
> https://github.com/groeck/linux-build-test/blob/master/qemu/buildall.sh.
>
> The boot wrapper was generated from
> https://github.com/groeck/linux-build-test/tree/master/rootfs/arm/arm-m3-bootwrapper/.
>
> run.sh should boot the system into a shell (at least it does for me).

Thanks. Unfortunately I tried running that a bunch of times with a
version of your QEMU with the 'set one-shot mode' patch reverted,
and it still booted every time.

However, I think I agree with you that it's better for the timer
device to avoid doing things that result in the ptimer code emitting
warnings rather than just dropping the ptimer warning, so I'm going
to take this patch upstream.

I also sorted through some of the other corner case issues
(eg restarting a stopped one-shot timer by writing to VALUE,
and whether an interrupt should be generated if VALUE is
written as zero). I'm going to send a patchset in a moment that
includes this patch and those fixes. I've tested that set by
applying it to your master-local/ branch and using your testcase
here, as well as by using my buildroot/uclinux image on an
upstream QEMU (and a few other RTOS images I had to hand).

thanks
-- PMM
Guenter Roeck July 3, 2018, 6:49 p.m. UTC | #16
On Tue, Jul 03, 2018 at 06:06:24PM +0100, Peter Maydell wrote:
> On 2 July 2018 at 19:37, Guenter Roeck <linux@roeck-us.net> wrote:
> > Please try http://server.roeck-us.net/qemu/an385/
> >
> > vmlinux is plain v4.18-rc3 compiled with the provided defconfig and
> > the provided toolchain (position independent, created with buildroot).
> > defconfig is mps2_defconfig+CONFIG_BLK_DEV_INITRD=y. Older kernels work
> > as well (I tested as far back as 4.9.y), with the same configuration.
> >
> > qemu is from the master-local branch in my repository, built with
> > https://github.com/groeck/linux-build-test/blob/master/qemu/buildall.sh.
> >
> > The boot wrapper was generated from
> > https://github.com/groeck/linux-build-test/tree/master/rootfs/arm/arm-m3-bootwrapper/.
> >
> > run.sh should boot the system into a shell (at least it does for me).
> 
> Thanks. Unfortunately I tried running that a bunch of times with a
> version of your QEMU with the 'set one-shot mode' patch reverted,
> and it still booted every time.
> 

Oddly enough, I now seem to be unable to reproduce the problem either.
Weird, when I wrote the patch it was absolutely necessary.

The only other change I can think of would be the code to set the
serial baud rate in the boot loader, which I did not have initially.
Is it possible that the hangup could have been caused by that ? 

Is it worth tracking it down, or are we ok ?

> However, I think I agree with you that it's better for the timer
> device to avoid doing things that result in the ptimer code emitting
> warnings rather than just dropping the ptimer warning, so I'm going
> to take this patch upstream.
> 
> I also sorted through some of the other corner case issues
> (eg restarting a stopped one-shot timer by writing to VALUE,
> and whether an interrupt should be generated if VALUE is
> written as zero). I'm going to send a patchset in a moment that
> includes this patch and those fixes. I've tested that set by
> applying it to your master-local/ branch and using your testcase
> here, as well as by using my buildroot/uclinux image on an
> upstream QEMU (and a few other RTOS images I had to hand).
> 

The series WFM as well. I'll send out a Tested-by: in a minute.

Thanks,
Guenter
Peter Maydell July 3, 2018, 8:08 p.m. UTC | #17
On 3 July 2018 at 19:49, Guenter Roeck <linux@roeck-us.net> wrote:
> Oddly enough, I now seem to be unable to reproduce the problem either.
> Weird, when I wrote the patch it was absolutely necessary.
>
> The only other change I can think of would be the code to set the
> serial baud rate in the boot loader, which I did not have initially.
> Is it possible that the hangup could have been caused by that ?

Seems unlikely -- unless you're actually connecting the emulated
serial port to a real host serial port, we don't do anything with
the baudrate in the cmsdk uart model except complain if the
guest hasn't set it to something valid.

> Is it worth tracking it down, or are we ok ?

It does worry me a little, but if neither you nor I can repro
it at the moment there's not a lot we can do about it I think.
Looking at the ptimer and cmsdk timer code I think it *should*
now implement what the hardware does...

If you see the problem again in future let me know and we
can have another go at figuring out what's going on.

> The series WFM as well. I'll send out a Tested-by: in a minute.

Thanks for the testing.

-- PMM
diff mbox

Patch

diff --git a/hw/timer/cmsdk-apb-timer.c b/hw/timer/cmsdk-apb-timer.c
index 9878746..79c1b49 100644
--- a/hw/timer/cmsdk-apb-timer.c
+++ b/hw/timer/cmsdk-apb-timer.c
@@ -119,7 +119,7 @@  static void cmsdk_apb_timer_write(void *opaque, hwaddr offset, uint64_t value,
         }
         s->ctrl = value & 0xf;
         if (s->ctrl & R_CTRL_EN_MASK) {
-            ptimer_run(s->timer, 0);
+            ptimer_run(s->timer, ptimer_get_limit(s->timer) == 0);
         } else {
             ptimer_stop(s->timer);
         }