diff mbox series

[v1,02/11] ci: increase timeout for hw tests

Message ID 7578489af5c7df525d4c82231b68bbb7d955d2b4.1743678257.git-series.marmarek@invisiblethingslab.com (mailing list archive)
State New
Headers show
Series Several CI cleanups and improvements, plus yet another new runner | expand

Commit Message

Marek Marczykowski-Górecki April 3, 2025, 11:04 a.m. UTC
It appears as sometimes it takes more time for Xen even start booting,
mostly due to firmware and fetching large boot files by grub. In some
jobs the current timeout is pretty close to the actual time needed, and
sometimes (rarely for now) test fails due to timeout expiring in the
middle of dom0 booting. This will be happening more often if the
initramfs will grow (and with more complex tests).
This has been observed on some dom0pvh-hvm jobs, at least on runners hw3
and hw11.

Increase the timeout by yet another 60s (up to 180s now).

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
 automation/scripts/qubes-x86-64.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jan Beulich April 3, 2025, 11:32 a.m. UTC | #1
On 03.04.2025 13:04, Marek Marczykowski-Górecki wrote:
> It appears as sometimes it takes more time for Xen even start booting,
> mostly due to firmware and fetching large boot files by grub. In some
> jobs the current timeout is pretty close to the actual time needed, and
> sometimes (rarely for now) test fails due to timeout expiring in the
> middle of dom0 booting. This will be happening more often if the
> initramfs will grow (and with more complex tests).

With that, ...

> This has been observed on some dom0pvh-hvm jobs, at least on runners hw3
> and hw11.
> 
> Increase the timeout by yet another 60s (up to 180s now).

... is this little a bump going to be sufficient? How about moving straight
to 5min?

As to observed failing jobs - the PV Dom0 boot failure seen today looks to
also be due to too short a timeout.

Jan
Marek Marczykowski-Górecki April 3, 2025, 12:25 p.m. UTC | #2
On Thu, Apr 03, 2025 at 01:32:38PM +0200, Jan Beulich wrote:
> On 03.04.2025 13:04, Marek Marczykowski-Górecki wrote:
> > It appears as sometimes it takes more time for Xen even start booting,
> > mostly due to firmware and fetching large boot files by grub. In some
> > jobs the current timeout is pretty close to the actual time needed, and
> > sometimes (rarely for now) test fails due to timeout expiring in the
> > middle of dom0 booting. This will be happening more often if the
> > initramfs will grow (and with more complex tests).
> 
> With that, ...
> 
> > This has been observed on some dom0pvh-hvm jobs, at least on runners hw3
> > and hw11.
> > 
> > Increase the timeout by yet another 60s (up to 180s now).
> 
> ... is this little a bump going to be sufficient? How about moving straight
> to 5min?

I don't like this, as many (most) actual failures are visible as timeout
(for example panic that prevents reaching Alpine prompt). One
improvement I can see is splitting this into two separate timeouts: one
before seeing the first line from Xen and then the second one for
reaching Alpine login prompt. The first one can be longer as its mostly
about firmware+fetching boot files and shouldn't hit on crashes (unless
a crash happen before printing anything on the console - but those are
rare).

> As to observed failing jobs - the PV Dom0 boot failure seen today looks to
> also be due to too short a timeout.

As responded on Matrix, I'm not so sure, there is over 1m wait after
"Built 1 zonelists, mobility grouping on.  Total pages: 8228487" line
from dom0 (or a bit later, due to buffering by sed), while in successful
test next lines follow instantaneously.
Stefano Stabellini April 4, 2025, 12:21 a.m. UTC | #3
On Thu, 3 Apr 2025, Marek Marczykowski-Górecki wrote:
> On Thu, Apr 03, 2025 at 01:32:38PM +0200, Jan Beulich wrote:
> > On 03.04.2025 13:04, Marek Marczykowski-Górecki wrote:
> > > It appears as sometimes it takes more time for Xen even start booting,
> > > mostly due to firmware and fetching large boot files by grub. In some
> > > jobs the current timeout is pretty close to the actual time needed, and
> > > sometimes (rarely for now) test fails due to timeout expiring in the
> > > middle of dom0 booting. This will be happening more often if the
> > > initramfs will grow (and with more complex tests).
> > 
> > With that, ...
> > 
> > > This has been observed on some dom0pvh-hvm jobs, at least on runners hw3
> > > and hw11.
> > > 
> > > Increase the timeout by yet another 60s (up to 180s now).
> > 
> > ... is this little a bump going to be sufficient? How about moving straight
> > to 5min?

Hi Marek, would you be up for moving your script to use "expect"?
Something like ./automation/scripts/console.exp?

That way, we would immediately complete the job no matter the timeout
value. It is also nicer :-)


> I don't like this, as many (most) actual failures are visible as timeout
> (for example panic that prevents reaching Alpine prompt). One
> improvement I can see is splitting this into two separate timeouts: one
> before seeing the first line from Xen and then the second one for
> reaching Alpine login prompt. The first one can be longer as its mostly
> about firmware+fetching boot files and shouldn't hit on crashes (unless
> a crash happen before printing anything on the console - but those are
> rare).

This is also something you can very specifically tweak with expect.
Marek Marczykowski-Górecki April 4, 2025, 12:35 a.m. UTC | #4
On Thu, Apr 03, 2025 at 05:21:56PM -0700, Stefano Stabellini wrote:
> Hi Marek, would you be up for moving your script to use "expect"?
> Something like ./automation/scripts/console.exp?
> 
> That way, we would immediately complete the job no matter the timeout
> value. It is also nicer :-)

Oh, this is excellent idea, I'll see how it fits there. It seems I'll
need at least add support for wakeup for the suspend tests, but that
shouldn't be a problem.
diff mbox series

Patch

diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh
index 8e78b7984e98..771c77d6618b 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -17,7 +17,7 @@  test_variant=$1
 ### defaults
 extra_xen_opts=
 wait_and_wakeup=
-timeout=120
+timeout=180
 domU_type="pvh"
 domU_vif="'bridge=xenbr0',"
 domU_extra_config=