diff mbox series

[PULL,10/15] tests/functional: fix race in virtio balloon test

Message ID 20250307115314.1096373-11-thuth@redhat.com (mailing list archive)
State New
Headers show
Series [PULL,01/15] tests/functional: remove unused 'bin_prefix' variable | expand

Commit Message

Thomas Huth March 7, 2025, 11:53 a.m. UTC
From: Daniel P. Berrangé <berrange@redhat.com>

There are two race conditions in the recently added virtio balloon
test

 * The /dev/vda device node is not ready
 * The virtio-balloon driver has not issued the first stats refresh

To fix the former, monitor dmesg for a line about 'vda'.

To fix the latter, retry the stats query until seeing fresh data.

Adding 'quiet' to the kernel command line reduces serial output
which otherwise slows boot, making it less likely to hit the former
race too.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20250304183340.3749797-1-berrange@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[thuth: Break long line to avoid checkpatch error]
Signed-off-by: Thomas Huth <thuth@redhat.com>
---
 tests/functional/test_virtio_balloon.py | 26 ++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

Comments

Philippe Mathieu-Daudé March 7, 2025, 12:34 p.m. UTC | #1
Hi Thomas, Daniel,

On 7/3/25 12:53, Thomas Huth wrote:
> From: Daniel P. Berrangé <berrange@redhat.com>
> 
> There are two race conditions in the recently added virtio balloon
> test
> 
>   * The /dev/vda device node is not ready
>   * The virtio-balloon driver has not issued the first stats refresh
> 
> To fix the former, monitor dmesg for a line about 'vda'.
> 
> To fix the latter, retry the stats query until seeing fresh data.
> 
> Adding 'quiet' to the kernel command line reduces serial output
> which otherwise slows boot, making it less likely to hit the former
> race too.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> Message-ID: <20250304183340.3749797-1-berrange@redhat.com>
> Reviewed-by: Thomas Huth <thuth@redhat.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> [thuth: Break long line to avoid checkpatch error]
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>   tests/functional/test_virtio_balloon.py | 26 ++++++++++++++++++++-----
>   1 file changed, 21 insertions(+), 5 deletions(-)

This fix isn't really working, and is more of band-aid IIUC.

With the following patch restricting to KVM accel, do we still want it 
merged?
Thomas Huth March 7, 2025, 12:39 p.m. UTC | #2
On 07/03/2025 13.34, Philippe Mathieu-Daudé wrote:
> Hi Thomas, Daniel,
> 
> On 7/3/25 12:53, Thomas Huth wrote:
>> From: Daniel P. Berrangé <berrange@redhat.com>
>>
>> There are two race conditions in the recently added virtio balloon
>> test
>>
>>   * The /dev/vda device node is not ready
>>   * The virtio-balloon driver has not issued the first stats refresh
>>
>> To fix the former, monitor dmesg for a line about 'vda'.
>>
>> To fix the latter, retry the stats query until seeing fresh data.
>>
>> Adding 'quiet' to the kernel command line reduces serial output
>> which otherwise slows boot, making it less likely to hit the former
>> race too.
>>
>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>> Message-ID: <20250304183340.3749797-1-berrange@redhat.com>
>> Reviewed-by: Thomas Huth <thuth@redhat.com>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> [thuth: Break long line to avoid checkpatch error]
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>> ---
>>   tests/functional/test_virtio_balloon.py | 26 ++++++++++++++++++++-----
>>   1 file changed, 21 insertions(+), 5 deletions(-)
> 
> This fix isn't really working, and is more of band-aid IIUC.
> 
> With the following patch restricting to KVM accel, do we still want it merged?

Yes, I think it still makes this test more robust.

  Thomas
Philippe Mathieu-Daudé March 7, 2025, 12:42 p.m. UTC | #3
On 7/3/25 13:39, Thomas Huth wrote:
> On 07/03/2025 13.34, Philippe Mathieu-Daudé wrote:
>> Hi Thomas, Daniel,
>>
>> On 7/3/25 12:53, Thomas Huth wrote:
>>> From: Daniel P. Berrangé <berrange@redhat.com>
>>>
>>> There are two race conditions in the recently added virtio balloon
>>> test
>>>
>>>   * The /dev/vda device node is not ready
>>>   * The virtio-balloon driver has not issued the first stats refresh
>>>
>>> To fix the former, monitor dmesg for a line about 'vda'.
>>>
>>> To fix the latter, retry the stats query until seeing fresh data.
>>>
>>> Adding 'quiet' to the kernel command line reduces serial output
>>> which otherwise slows boot, making it less likely to hit the former
>>> race too.
>>>
>>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>>> Message-ID: <20250304183340.3749797-1-berrange@redhat.com>
>>> Reviewed-by: Thomas Huth <thuth@redhat.com>
>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>> [thuth: Break long line to avoid checkpatch error]
>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>> ---
>>>   tests/functional/test_virtio_balloon.py | 26 ++++++++++++++++++++-----
>>>   1 file changed, 21 insertions(+), 5 deletions(-)
>>
>> This fix isn't really working, and is more of band-aid IIUC.
>>
>> With the following patch restricting to KVM accel, do we still want it 
>> merged?
> 
> Yes, I think it still makes this test more robust.

Fine then, thanks.
Daniel P. Berrangé March 7, 2025, 12:45 p.m. UTC | #4
On Fri, Mar 07, 2025 at 01:39:17PM +0100, Thomas Huth wrote:
> On 07/03/2025 13.34, Philippe Mathieu-Daudé wrote:
> > Hi Thomas, Daniel,
> > 
> > On 7/3/25 12:53, Thomas Huth wrote:
> > > From: Daniel P. Berrangé <berrange@redhat.com>
> > > 
> > > There are two race conditions in the recently added virtio balloon
> > > test
> > > 
> > >   * The /dev/vda device node is not ready
> > >   * The virtio-balloon driver has not issued the first stats refresh
> > > 
> > > To fix the former, monitor dmesg for a line about 'vda'.
> > > 
> > > To fix the latter, retry the stats query until seeing fresh data.
> > > 
> > > Adding 'quiet' to the kernel command line reduces serial output
> > > which otherwise slows boot, making it less likely to hit the former
> > > race too.
> > > 
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > Message-ID: <20250304183340.3749797-1-berrange@redhat.com>
> > > Reviewed-by: Thomas Huth <thuth@redhat.com>
> > > Reviewed-by: David Hildenbrand <david@redhat.com>
> > > [thuth: Break long line to avoid checkpatch error]
> > > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > > ---
> > >   tests/functional/test_virtio_balloon.py | 26 ++++++++++++++++++++-----
> > >   1 file changed, 21 insertions(+), 5 deletions(-)
> > 
> > This fix isn't really working, and is more of band-aid IIUC.
> > 
> > With the following patch restricting to KVM accel, do we still want it merged?
> 
> Yes, I think it still makes this test more robust.

Agreed, even though this patch isn't perfect it is definitely fixing two
clear race conditions.


With regards,
Daniel
diff mbox series

Patch

diff --git a/tests/functional/test_virtio_balloon.py b/tests/functional/test_virtio_balloon.py
index 67b48e1b4e8..082bf08c4e8 100755
--- a/tests/functional/test_virtio_balloon.py
+++ b/tests/functional/test_virtio_balloon.py
@@ -32,7 +32,7 @@  class VirtioBalloonx86(QemuSystemTest):
         'e3c1b309d9203604922d6e255c2c5d098a309c2d46215d8fc026954f3c5c27a0')
 
     DEFAULT_KERNEL_PARAMS = ('root=/dev/vda1 console=ttyS0 net.ifnames=0 '
-                             'rd.rescue')
+                             'rd.rescue quiet')
 
     def wait_for_console_pattern(self, success_message, vm=None):
         wait_for_console_pattern(
@@ -47,6 +47,11 @@  def mount_root(self):
         prompt = '# '
         self.wait_for_console_pattern(prompt)
 
+        # Synchronize on virtio-block driver creating the root device
+        exec_command_and_wait_for_pattern(self,
+                        "while ! (dmesg -c | grep vda:) ; do sleep 1 ; done",
+                        "vda1")
+
         exec_command_and_wait_for_pattern(self, 'mount /dev/vda1 /sysroot',
                                           prompt)
         exec_command_and_wait_for_pattern(self, 'chroot /sysroot',
@@ -65,10 +70,21 @@  def assert_initial_stats(self):
             assert val == UNSET_STATS_VALUE
 
     def assert_running_stats(self, then):
-        ret = self.vm.qmp('qom-get',
-                          {'path': '/machine/peripheral/balloon',
-                           'property': 'guest-stats'})['return']
-        when = ret.get('last-update')
+        # We told the QEMU to refresh stats every 100ms, but
+        # there can be a delay between virtio-ballon driver
+        # being modprobed and seeing the first stats refresh
+        # Retry a few times for robustness under heavy load
+        retries = 10
+        when = 0
+        while when == 0 and retries:
+            ret = self.vm.qmp('qom-get',
+                              {'path': '/machine/peripheral/balloon',
+                               'property': 'guest-stats'})['return']
+            when = ret.get('last-update')
+            if when == 0:
+                retries = retries - 1
+                time.sleep(0.5)
+
         now = time.time()
 
         assert when > then and when < now