diff mbox series

[v1] selftests: openvswitch: retry instead of sleep

Message ID 20240708134451.3489802-1-amorenoz@redhat.com (mailing list archive)
State New
Headers show
Series [v1] selftests: openvswitch: retry instead of sleep | expand

Commit Message

Adrián Moreno July 8, 2024, 1:44 p.m. UTC
There are a couple of places where the test script "sleep"s to wait for
some external condition to be met.

This is error prone, specially in slow systems (identified in CI by
"KSFT_MACHINE_SLOW=yes").

To fix this, add a "ovs_wait" function that tries to execute a command
a few times until it succeeds. The timeout used is set to 5s for
"normal" systems and doubled if a slow CI machine is detected.

This should make the following work:

$ vng --build  \
    --config tools/testing/selftests/net/config \
    --config kernel/configs/debug.config

$ vng --run . --user root -- "make -C tools/testing/selftests/ \
    KSFT_MACHINE_SLOW=yes TARGETS=net/openvswitch run_tests"

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
---
 .../selftests/net/openvswitch/openvswitch.sh  | 49 ++++++++++++++++---
 .../selftests/net/openvswitch/ovs-dpctl.py    |  1 +
 2 files changed, 42 insertions(+), 8 deletions(-)

Comments

Aaron Conole July 8, 2024, 6:27 p.m. UTC | #1
Adrian Moreno <amorenoz@redhat.com> writes:

> There are a couple of places where the test script "sleep"s to wait for
> some external condition to be met.
>
> This is error prone, specially in slow systems (identified in CI by
> "KSFT_MACHINE_SLOW=yes").
>
> To fix this, add a "ovs_wait" function that tries to execute a command
> a few times until it succeeds. The timeout used is set to 5s for
> "normal" systems and doubled if a slow CI machine is detected.
>
> This should make the following work:
>
> $ vng --build  \
>     --config tools/testing/selftests/net/config \
>     --config kernel/configs/debug.config
>
> $ vng --run . --user root -- "make -C tools/testing/selftests/ \
>     KSFT_MACHINE_SLOW=yes TARGETS=net/openvswitch run_tests"
>
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---

Looks like this does resolve the issue in question on the -dbg
environment:

https://netdev.bots.linux.dev/contest.html?executor=vmksft-net-dbg&test=openvswitch-sh

Thanks Adrian!  Also, thanks for including the fractional sleep.

Reviewed-by: Aaron Conole <aconole@redhat.com>
Adrián Moreno July 8, 2024, 7:24 p.m. UTC | #2
On Mon, Jul 08, 2024 at 02:27:45PM GMT, Aaron Conole wrote:
> Adrian Moreno <amorenoz@redhat.com> writes:
>
> > There are a couple of places where the test script "sleep"s to wait for
> > some external condition to be met.
> >
> > This is error prone, specially in slow systems (identified in CI by
> > "KSFT_MACHINE_SLOW=yes").
> >
> > To fix this, add a "ovs_wait" function that tries to execute a command
> > a few times until it succeeds. The timeout used is set to 5s for
> > "normal" systems and doubled if a slow CI machine is detected.
> >
> > This should make the following work:
> >
> > $ vng --build  \
> >     --config tools/testing/selftests/net/config \
> >     --config kernel/configs/debug.config
> >
> > $ vng --run . --user root -- "make -C tools/testing/selftests/ \
> >     KSFT_MACHINE_SLOW=yes TARGETS=net/openvswitch run_tests"
> >
> > Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> > ---
>
> Looks like this does resolve the issue in question on the -dbg
> environment:
>
> https://netdev.bots.linux.dev/contest.html?executor=vmksft-net-dbg&test=openvswitch-sh

Nice! I guess the 10s global timeout is enough for now.

>
> Thanks Adrian!  Also, thanks for including the fractional sleep.
>
> Reviewed-by: Aaron Conole <aconole@redhat.com>
>

Thanks.

I just realized that I've missed the target branch ("net-next") in the
subject. I'll wait a bit and respin to fix that.

Adrián
Ilya Maximets July 8, 2024, 7:31 p.m. UTC | #3
On 7/8/24 15:44, Adrian Moreno wrote:
> There are a couple of places where the test script "sleep"s to wait for
> some external condition to be met.
> 
> This is error prone, specially in slow systems (identified in CI by
> "KSFT_MACHINE_SLOW=yes").
> 
> To fix this, add a "ovs_wait" function that tries to execute a command
> a few times until it succeeds. The timeout used is set to 5s for
> "normal" systems and doubled if a slow CI machine is detected.
> 
> This should make the following work:
> 
> $ vng --build  \
>     --config tools/testing/selftests/net/config \
>     --config kernel/configs/debug.config
> 
> $ vng --run . --user root -- "make -C tools/testing/selftests/ \
>     KSFT_MACHINE_SLOW=yes TARGETS=net/openvswitch run_tests"
> 
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
>  .../selftests/net/openvswitch/openvswitch.sh  | 49 ++++++++++++++++---
>  .../selftests/net/openvswitch/ovs-dpctl.py    |  1 +
>  2 files changed, 42 insertions(+), 8 deletions(-)
> 

Hi, Adrian.  See a small pile of nitpicks below.

None of them are blocking from my perspective, except for a typo.
Just listed them since there is a typo anyway.

> diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh b/tools/testing/selftests/net/openvswitch/openvswitch.sh
> index bc71dbc18b21..83407b42073a 100755
> --- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
> +++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
> @@ -11,6 +11,7 @@ ksft_skip=4
>  PAUSE_ON_FAIL=no
>  VERBOSE=0
>  TRACING=0
> +WAIT_TIMEOUT=5
>  
>  tests="
>  	arp_ping				eth-arp: Basic arp ping between two NS
> @@ -29,6 +30,32 @@ info() {
>  	[ $VERBOSE = 0 ] || echo $*
>  }
>  
> +ovs_wait() {
> +	info "waiting $WAIT_TIMEOUT s for: $@"
> +
> +	"$@"
> +	if [[ $? -eq 0 ]]; then

Maybe just 'if "$@"; then' ?

> +		info "wait succeeded inmediately"

* immediately

> +		return 0
> +	fi
> +
> +	# A quick re-check helps speed up small races in fast systems.
> +	# However, fractional sleeps might not necessarily work.
> +	local start=0
> +	sleep 0.1 || { sleep 1; start=1; }
> +
> +	for (( i=start; i<WAIT_TIMEOUT; i++ )); do

for i in $(seq ${start} ${WAIT_TIMEOUT}); do

Will need to initialize start to 1 and 2.

It works, but seems like an unnecessary use of non-POSIX constructs.

> +		"$@"
> +		if [[ $? -eq 0 ]]; then

if "$@"; then

> +			info "wait succeeded after $i seconds"
> +			return 0
> +		fi
> +		sleep 1
> +	done
> +	info "wait failed after $i seconds"
> +	return 1
> +}
> +
>  ovs_base=`pwd`
>  sbxs=
>  sbx_add () {
> @@ -278,20 +305,21 @@ test_psample() {
>  
>  	# Record psample data.
>  	ovs_spawn_daemon "test_psample" python3 $ovs_base/ovs-dpctl.py psample-events
> +	ovs_wait grep -q "listening for psample events" ${ovs_dir}/stdout
>  
>  	# Send a single ping.
> -	sleep 1
>  	ovs_sbx "test_psample" ip netns exec client ping -I c1 172.31.110.20 -c 1 || return 1
> -	sleep 1
>  
>  	# We should have received one userspace action upcall and 2 psample packets.
> -	grep -E "userspace action command" $ovs_dir/s0.out >/dev/null 2>&1 || return 1
> +	ovs_wait grep -q "userspace action command" $ovs_dir/s0.out
> +	[[ $? -eq 0 ]] || return 1

Why checking separately and not one the same line with || return 1 ?
Also double brackets seem unnecessary.

>  
>  	# client -> server samples should only contain the first 14 bytes of the packet.
> -	grep -E "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" \
> -			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
> -	grep -E "rate:4294967295,group:2,cookie:eeff0c" \
> -			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
> +	ovs_wait grep -qE "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" $ovs_dir/stdout
> +	[[ $? -eq 0 ]] || return 1
> +
> +	ovs_wait grep -q "rate:4294967295,group:2,cookie:eeff0c" $ovs_dir/stdout
> +	[[ $? -eq 0 ]] || return 1

Same for above two.

>  
>  	return 0
>  }
> @@ -711,7 +739,8 @@ test_upcall_interfaces() {
>  	ovs_add_netns_and_veths "test_upcall_interfaces" ui0 upc left0 l0 \
>  	    172.31.110.1/24 -u || return 1
>  
> -	sleep 1
> +	ovs_wait grep -q "listening on upcall packet handler" ${ovs_dir}/left0.out
> +
>  	info "sending arping"
>  	ip netns exec upc arping -I l0 172.31.110.20 -c 1 \
>  	    >$ovs_dir/arping.stdout 2>$ovs_dir/arping.stderr
> @@ -811,6 +840,10 @@ shift $(($OPTIND-1))
>  IFS="	
>  "
>  
> +if test "X$KSFT_MACHINE_SLOW" == "Xyes"; then
> +	WAIT_TIMEOUT=10
> +fi

Should this be done closer to the first initialization of WAIT_TIMEOUT ?

> +
>  for arg do
>  	# Check first that all requested tests are available before running any
>  	command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} not found"; usage; }
> diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
> index 1e15b0818074..8a0396bfaf99 100644
> --- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
> +++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
> @@ -2520,6 +2520,7 @@ class PsampleEvent(EventSocket):
>      marshal_class = psample_msg
>  
>      def read_samples(self):
> +        print("listening for psample events", flush=True)
>          while True:
>              try:
>                  for msg in self.get():
Adrián Moreno July 8, 2024, 9:40 p.m. UTC | #4
On Mon, Jul 08, 2024 at 09:31:58PM GMT, Ilya Maximets wrote:
> On 7/8/24 15:44, Adrian Moreno wrote:
> > There are a couple of places where the test script "sleep"s to wait for
> > some external condition to be met.
> >
> > This is error prone, specially in slow systems (identified in CI by
> > "KSFT_MACHINE_SLOW=yes").
> >
> > To fix this, add a "ovs_wait" function that tries to execute a command
> > a few times until it succeeds. The timeout used is set to 5s for
> > "normal" systems and doubled if a slow CI machine is detected.
> >
> > This should make the following work:
> >
> > $ vng --build  \
> >     --config tools/testing/selftests/net/config \
> >     --config kernel/configs/debug.config
> >
> > $ vng --run . --user root -- "make -C tools/testing/selftests/ \
> >     KSFT_MACHINE_SLOW=yes TARGETS=net/openvswitch run_tests"
> >
> > Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> > ---
> >  .../selftests/net/openvswitch/openvswitch.sh  | 49 ++++++++++++++++---
> >  .../selftests/net/openvswitch/ovs-dpctl.py    |  1 +
> >  2 files changed, 42 insertions(+), 8 deletions(-)
> >
>
> Hi, Adrian.  See a small pile of nitpicks below.
>
> None of them are blocking from my perspective, except for a typo.
> Just listed them since there is a typo anyway.
>
> > diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh b/tools/testing/selftests/net/openvswitch/openvswitch.sh
> > index bc71dbc18b21..83407b42073a 100755
> > --- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
> > +++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
> > @@ -11,6 +11,7 @@ ksft_skip=4
> >  PAUSE_ON_FAIL=no
> >  VERBOSE=0
> >  TRACING=0
> > +WAIT_TIMEOUT=5
> >
> >  tests="
> >  	arp_ping				eth-arp: Basic arp ping between two NS
> > @@ -29,6 +30,32 @@ info() {
> >  	[ $VERBOSE = 0 ] || echo $*
> >  }
> >
> > +ovs_wait() {
> > +	info "waiting $WAIT_TIMEOUT s for: $@"
> > +
> > +	"$@"
> > +	if [[ $? -eq 0 ]]; then
>
> Maybe just 'if "$@"; then' ?
>

In my head this is a bit less clean but I don't mind.

> > +		info "wait succeeded inmediately"
>
> * immediately

Thanks. Will fix the typo.

>
> > +		return 0
> > +	fi
> > +
> > +	# A quick re-check helps speed up small races in fast systems.
> > +	# However, fractional sleeps might not necessarily work.
> > +	local start=0
> > +	sleep 0.1 || { sleep 1; start=1; }
> > +
> > +	for (( i=start; i<WAIT_TIMEOUT; i++ )); do
>
> for i in $(seq ${start} ${WAIT_TIMEOUT}); do
>
> Will need to initialize start to 1 and 2.
>
> It works, but seems like an unnecessary use of non-POSIX constructs.

The reason why I chose this form is that I find it more robust on a
script that changes IFS. If this function is called within a block that
has changed IFS, "i" will take the entire sequence as the value for the
first iteration.

>
> > +		"$@"
> > +		if [[ $? -eq 0 ]]; then
>
> if "$@"; then
>
> > +			info "wait succeeded after $i seconds"
> > +			return 0
> > +		fi
> > +		sleep 1
> > +	done
> > +	info "wait failed after $i seconds"
> > +	return 1
> > +}
> > +
> >  ovs_base=`pwd`
> >  sbxs=
> >  sbx_add () {
> > @@ -278,20 +305,21 @@ test_psample() {
> >
> >  	# Record psample data.
> >  	ovs_spawn_daemon "test_psample" python3 $ovs_base/ovs-dpctl.py psample-events
> > +	ovs_wait grep -q "listening for psample events" ${ovs_dir}/stdout
> >
> >  	# Send a single ping.
> > -	sleep 1
> >  	ovs_sbx "test_psample" ip netns exec client ping -I c1 172.31.110.20 -c 1 || return 1
> > -	sleep 1
> >
> >  	# We should have received one userspace action upcall and 2 psample packets.
> > -	grep -E "userspace action command" $ovs_dir/s0.out >/dev/null 2>&1 || return 1
> > +	ovs_wait grep -q "userspace action command" $ovs_dir/s0.out
> > +	[[ $? -eq 0 ]] || return 1
>
> Why checking separately and not one the same line with || return 1 ?

IMHO, passing complex commands to a function in bash can easily get very
problematic. That's why I try to remove all pipes, redirections or
logical operators like && and ||. At least for me it removes one extra
cycle that my brain has to spend looking at quotes and figuring out if
the operand will be interpreted inside the function or outside.

> Also double brackets seem unnecessary.

That's true.

>
> >
> >  	# client -> server samples should only contain the first 14 bytes of the packet.
> > -	grep -E "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" \
> > -			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
> > -	grep -E "rate:4294967295,group:2,cookie:eeff0c" \
> > -			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
> > +	ovs_wait grep -qE "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" $ovs_dir/stdout
> > +	[[ $? -eq 0 ]] || return 1
> > +
> > +	ovs_wait grep -q "rate:4294967295,group:2,cookie:eeff0c" $ovs_dir/stdout
> > +	[[ $? -eq 0 ]] || return 1
>
> Same for above two.
>
> >
> >  	return 0
> >  }
> > @@ -711,7 +739,8 @@ test_upcall_interfaces() {
> >  	ovs_add_netns_and_veths "test_upcall_interfaces" ui0 upc left0 l0 \
> >  	    172.31.110.1/24 -u || return 1
> >
> > -	sleep 1
> > +	ovs_wait grep -q "listening on upcall packet handler" ${ovs_dir}/left0.out
> > +
> >  	info "sending arping"
> >  	ip netns exec upc arping -I l0 172.31.110.20 -c 1 \
> >  	    >$ovs_dir/arping.stdout 2>$ovs_dir/arping.stderr
> > @@ -811,6 +840,10 @@ shift $(($OPTIND-1))
> >  IFS="	
> >  "
> >
> > +if test "X$KSFT_MACHINE_SLOW" == "Xyes"; then
> > +	WAIT_TIMEOUT=10
> > +fi
>
> Should this be done closer to the first initialization of WAIT_TIMEOUT ?
>

My rationale was splitting "variable declaration" and "code". Sure
we're not adding an explicit cli argument for this (as with TRACING or
VERBOSE) but we kind-of are using KSFT_MACHINE_SLOW as an input so for
me grouping input processing all together made some sense. Having said
that, I don't have a very strong opinion. I guess we can move it up as
well.

> > +
> >  for arg do
> >  	# Check first that all requested tests are available before running any
> >  	command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} not found"; usage; }
> > diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
> > index 1e15b0818074..8a0396bfaf99 100644
> > --- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
> > +++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
> > @@ -2520,6 +2520,7 @@ class PsampleEvent(EventSocket):
> >      marshal_class = psample_msg
> >
> >      def read_samples(self):
> > +        print("listening for psample events", flush=True)
> >          while True:
> >              try:
> >                  for msg in self.get():
>

Thanks.
Adrián
diff mbox series

Patch

diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh b/tools/testing/selftests/net/openvswitch/openvswitch.sh
index bc71dbc18b21..83407b42073a 100755
--- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
+++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
@@ -11,6 +11,7 @@  ksft_skip=4
 PAUSE_ON_FAIL=no
 VERBOSE=0
 TRACING=0
+WAIT_TIMEOUT=5
 
 tests="
 	arp_ping				eth-arp: Basic arp ping between two NS
@@ -29,6 +30,32 @@  info() {
 	[ $VERBOSE = 0 ] || echo $*
 }
 
+ovs_wait() {
+	info "waiting $WAIT_TIMEOUT s for: $@"
+
+	"$@"
+	if [[ $? -eq 0 ]]; then
+		info "wait succeeded inmediately"
+		return 0
+	fi
+
+	# A quick re-check helps speed up small races in fast systems.
+	# However, fractional sleeps might not necessarily work.
+	local start=0
+	sleep 0.1 || { sleep 1; start=1; }
+
+	for (( i=start; i<WAIT_TIMEOUT; i++ )); do
+		"$@"
+		if [[ $? -eq 0 ]]; then
+			info "wait succeeded after $i seconds"
+			return 0
+		fi
+		sleep 1
+	done
+	info "wait failed after $i seconds"
+	return 1
+}
+
 ovs_base=`pwd`
 sbxs=
 sbx_add () {
@@ -278,20 +305,21 @@  test_psample() {
 
 	# Record psample data.
 	ovs_spawn_daemon "test_psample" python3 $ovs_base/ovs-dpctl.py psample-events
+	ovs_wait grep -q "listening for psample events" ${ovs_dir}/stdout
 
 	# Send a single ping.
-	sleep 1
 	ovs_sbx "test_psample" ip netns exec client ping -I c1 172.31.110.20 -c 1 || return 1
-	sleep 1
 
 	# We should have received one userspace action upcall and 2 psample packets.
-	grep -E "userspace action command" $ovs_dir/s0.out >/dev/null 2>&1 || return 1
+	ovs_wait grep -q "userspace action command" $ovs_dir/s0.out
+	[[ $? -eq 0 ]] || return 1
 
 	# client -> server samples should only contain the first 14 bytes of the packet.
-	grep -E "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" \
-			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
-	grep -E "rate:4294967295,group:2,cookie:eeff0c" \
-			 $ovs_dir/stdout >/dev/null 2>&1 || return 1
+	ovs_wait grep -qE "rate:4294967295,group:1,cookie:c0ffee data:[0-9a-f]{28}$" $ovs_dir/stdout
+	[[ $? -eq 0 ]] || return 1
+
+	ovs_wait grep -q "rate:4294967295,group:2,cookie:eeff0c" $ovs_dir/stdout
+	[[ $? -eq 0 ]] || return 1
 
 	return 0
 }
@@ -711,7 +739,8 @@  test_upcall_interfaces() {
 	ovs_add_netns_and_veths "test_upcall_interfaces" ui0 upc left0 l0 \
 	    172.31.110.1/24 -u || return 1
 
-	sleep 1
+	ovs_wait grep -q "listening on upcall packet handler" ${ovs_dir}/left0.out
+
 	info "sending arping"
 	ip netns exec upc arping -I l0 172.31.110.20 -c 1 \
 	    >$ovs_dir/arping.stdout 2>$ovs_dir/arping.stderr
@@ -811,6 +840,10 @@  shift $(($OPTIND-1))
 IFS="	
 "
 
+if test "X$KSFT_MACHINE_SLOW" == "Xyes"; then
+	WAIT_TIMEOUT=10
+fi
+
 for arg do
 	# Check first that all requested tests are available before running any
 	command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} not found"; usage; }
diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
index 1e15b0818074..8a0396bfaf99 100644
--- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
+++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
@@ -2520,6 +2520,7 @@  class PsampleEvent(EventSocket):
     marshal_class = psample_msg
 
     def read_samples(self):
+        print("listening for psample events", flush=True)
         while True:
             try:
                 for msg in self.get():