Message ID | baf2abd6af2e88f8874d14c97da1554b7e7a710e.1731342342.git.petrm@nvidia.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: ndo_fdb_add/del: Have drivers report whether they notified | expand |
On Mon, 11 Nov 2024 18:09:01 +0100 Petr Machata wrote: > Check that only one notification is produced for various FDB edit > operations. > > Regarding the ip_link_add() and ip_link_master() helpers. This pattern of > action plus corresponding defer is bound to come up often, and a dedicated > vocabulary to capture it will be handy. tunnel_create() and vlan_create() > from forwarding/lib.sh are somewhat opaque and perhaps too kitchen-sinky, > so I tried to go in the opposite direction with these ones, and wrapped > only the bare minimum to schedule a corresponding cleanup. Looks like it fails about half of the time :( https://netdev.bots.linux.dev/flakes.html?min-flip=0&tn-needle=fdb-notify&br-cnt=200
Jakub Kicinski <kuba@kernel.org> writes: > On Mon, 11 Nov 2024 18:09:01 +0100 Petr Machata wrote: >> Check that only one notification is produced for various FDB edit >> operations. >> >> Regarding the ip_link_add() and ip_link_master() helpers. This pattern of >> action plus corresponding defer is bound to come up often, and a dedicated >> vocabulary to capture it will be handy. tunnel_create() and vlan_create() >> from forwarding/lib.sh are somewhat opaque and perhaps too kitchen-sinky, >> so I tried to go in the opposite direction with these ones, and wrapped >> only the bare minimum to schedule a corresponding cleanup. > > Looks like it fails about half of the time :( > > https://netdev.bots.linux.dev/flakes.html?min-flip=0&tn-needle=fdb-notify&br-cnt=200 OK, I can't reproduce this. Trying in VM, on an actual HW, debug, no debug, no luck. But I see basically two failures: - A "0 seen, 1 expected", which... I don't know, maybe it could just be a misplaced sleep. I don't see how, but it's a deterministing scenario, there shouldn't be anything racy here, either it emits or it doesn't, so some buffering issue is the only thing I can think of. - Deadlocks. E.g. this, which looks like it deadlocked and timed out ("bad unlock balance detected" followed by "blocked for more than 122 seconds" et.al.): https://netdev-3.bots.linux.dev/vmksft-net-dbg/results/846621/18-fdb-notify-sh/ Like... how could this patchset even theoretically cause a deadlock?
Petr Machata <petrm@nvidia.com> writes: > Jakub Kicinski <kuba@kernel.org> writes: > >> On Mon, 11 Nov 2024 18:09:01 +0100 Petr Machata wrote: >>> Check that only one notification is produced for various FDB edit >>> operations. >>> >>> Regarding the ip_link_add() and ip_link_master() helpers. This pattern of >>> action plus corresponding defer is bound to come up often, and a dedicated >>> vocabulary to capture it will be handy. tunnel_create() and vlan_create() >>> from forwarding/lib.sh are somewhat opaque and perhaps too kitchen-sinky, >>> so I tried to go in the opposite direction with these ones, and wrapped >>> only the bare minimum to schedule a corresponding cleanup. >> >> Looks like it fails about half of the time :( >> >> https://netdev.bots.linux.dev/flakes.html?min-flip=0&tn-needle=fdb-notify&br-cnt=200 > > OK, I can't reproduce this. Trying in VM, on an actual HW, debug, no > debug, no luck. But I see basically two failures: > > - A "0 seen, 1 expected", which... I don't know, maybe it could just be > a misplaced sleep. I don't see how, but it's a deterministing > scenario, there shouldn't be anything racy here, either it emits or it > doesn't, so some buffering issue is the only thing I can think of. I think this really could be just a "bridge monitor" taking a bit more time to start every now and then. Can I have you test with this extra chunk, or should I just resend with that change and hope for the best? diff --git a/tools/testing/selftests/net/fdb_notify.sh b/tools/testing/selftests/net/fdb_notify.sh index a98047361988..a8e04f08831c 100755 --- a/tools/testing/selftests/net/fdb_notify.sh +++ b/tools/testing/selftests/net/fdb_notify.sh @@ -26,6 +26,7 @@ do_test_dup() bridge monitor fdb &> "$tmpf" & defer kill_process $! + sleep 0.5 bridge fdb "$op" 00:11:22:33:44:55 vlan 1 "$@" sleep 0.2 defer_scope_pop > - Deadlocks. E.g. this, which looks like it deadlocked and timed out Eh, these are ancient. Never mind.
On Wed, 13 Nov 2024 16:11:03 +0100 Petr Machata wrote: > > - A "0 seen, 1 expected", which... I don't know, maybe it could just be > > a misplaced sleep. I don't see how, but it's a deterministing > > scenario, there shouldn't be anything racy here, either it emits or it > > doesn't, so some buffering issue is the only thing I can think of. > > I think this really could be just a "bridge monitor" taking a bit more > time to start every now and then. Can I have you test with this extra > chunk, or should I just resend with that change and hope for the best? Let's give it a go, if it doesn't fix it we can try to do sneaky local changes in the CI, without more resends.
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 26a4883a65c9..ab0e8f30bfe7 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -92,7 +92,7 @@ TEST_PROGS += test_vxlan_mdb.sh TEST_PROGS += test_bridge_neigh_suppress.sh TEST_PROGS += test_vxlan_nolocalbypass.sh TEST_PROGS += test_bridge_backup_port.sh -TEST_PROGS += fdb_flush.sh +TEST_PROGS += fdb_flush.sh fdb_notify.sh TEST_PROGS += fq_band_pktlimit.sh TEST_PROGS += vlan_hw_filter.sh TEST_PROGS += bpf_offload.py diff --git a/tools/testing/selftests/net/fdb_notify.sh b/tools/testing/selftests/net/fdb_notify.sh new file mode 100755 index 000000000000..a98047361988 --- /dev/null +++ b/tools/testing/selftests/net/fdb_notify.sh @@ -0,0 +1,95 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +source lib.sh + +ALL_TESTS=" + test_dup_bridge + test_dup_vxlan_self + test_dup_vxlan_master + test_dup_macvlan_self + test_dup_macvlan_master +" + +do_test_dup() +{ + local op=$1; shift + local what=$1; shift + local tmpf + + RET=0 + + tmpf=$(mktemp) + defer rm "$tmpf" + + defer_scope_push + bridge monitor fdb &> "$tmpf" & + defer kill_process $! + + bridge fdb "$op" 00:11:22:33:44:55 vlan 1 "$@" + sleep 0.2 + defer_scope_pop + + local count=$(grep -c -e 00:11:22:33:44:55 $tmpf) + ((count == 1)) + check_err $? "Got $count notifications, expected 1" + + log_test "$what $op: Duplicate notifications" +} + +test_dup_bridge() +{ + ip_link_add br up type bridge vlan_filtering 1 + do_test_dup add "bridge" dev br self + do_test_dup del "bridge" dev br self +} + +test_dup_vxlan_self() +{ + ip_link_add br up type bridge vlan_filtering 1 + ip_link_add vx up type vxlan id 2000 dstport 4789 + ip_link_master vx br + + do_test_dup add "vxlan" dev vx self dst 192.0.2.1 + do_test_dup del "vxlan" dev vx self dst 192.0.2.1 +} + +test_dup_vxlan_master() +{ + ip_link_add br up type bridge vlan_filtering 1 + ip_link_add vx up type vxlan id 2000 dstport 4789 + ip_link_master vx br + + do_test_dup add "vxlan master" dev vx master + do_test_dup del "vxlan master" dev vx master +} + +test_dup_macvlan_self() +{ + ip_link_add dd up type dummy + ip_link_add mv up link dd type macvlan mode passthru + + do_test_dup add "macvlan self" dev mv self + do_test_dup del "macvlan self" dev mv self +} + +test_dup_macvlan_master() +{ + ip_link_add br up type bridge vlan_filtering 1 + ip_link_add dd up type dummy + ip_link_add mv up link dd type macvlan mode passthru + ip_link_master mv br + + do_test_dup add "macvlan master" dev mv self + do_test_dup del "macvlan master" dev mv self +} + +cleanup() +{ + defer_scopes_cleanup +} + +trap cleanup EXIT +tests_run + +exit $EXIT_STATUS diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh index 24f63e45735d..8994fec1c38f 100644 --- a/tools/testing/selftests/net/lib.sh +++ b/tools/testing/selftests/net/lib.sh @@ -442,3 +442,20 @@ kill_process() # Suppress noise from killing the process. { kill $pid && wait $pid; } 2>/dev/null } + +ip_link_add() +{ + local name=$1; shift + + ip link add name "$name" "$@" + defer ip link del dev "$name" +} + +ip_link_master() +{ + local member=$1; shift + local master=$1; shift + + ip link set dev "$member" master "$master" + defer ip link set dev "$member" nomaster +}