Message ID | 20210319204307.3128280-1-aconole@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | openvswitch: perform refragmentation for packets which pass through conntrack | expand |
Aaron Conole <aconole@redhat.com> writes: > When a user instructs a flow pipeline to perform connection tracking, > there is an implicit L3 operation that occurs - namely the IP fragments > are reassembled and then processed as a single unit. After this, new > fragments are generated and then transmitted, with the hint that they > should be fragmented along the max rx unit boundary. In general, this > behavior works well to forward packets along when the MTUs are congruent > across the datapath. > > However, if using a protocol such as UDP on a network with mismatching > MTUs, it is possible that the refragmentation will still produce an > invalid fragment, and that fragmented packet will not be delivered. > Such a case shouldn't happen because the user explicitly requested a > layer 3+4 function (conntrack), and that function generates new fragments, > so we should perform the needed actions in that case (namely, refragment > IPv4 along a correct boundary, or send a packet too big in the IPv6 case). > > Additionally, introduce a test suite for openvswitch with a test case > that ensures this MTU behavior, with the expectation that new tests are > added when needed. > > Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") > Signed-off-by: Aaron Conole <aconole@redhat.com> > --- Ugh, after I re-generated, I forgot to add 'net' as the target tree. Sorry for that.
Hey Aaron, long time no chat :) On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote: > > When a user instructs a flow pipeline to perform connection tracking, > there is an implicit L3 operation that occurs - namely the IP fragments > are reassembled and then processed as a single unit. After this, new > fragments are generated and then transmitted, with the hint that they > should be fragmented along the max rx unit boundary. In general, this > behavior works well to forward packets along when the MTUs are congruent > across the datapath. > > However, if using a protocol such as UDP on a network with mismatching > MTUs, it is possible that the refragmentation will still produce an > invalid fragment, and that fragmented packet will not be delivered. > Such a case shouldn't happen because the user explicitly requested a > layer 3+4 function (conntrack), and that function generates new fragments, > so we should perform the needed actions in that case (namely, refragment > IPv4 along a correct boundary, or send a packet too big in the IPv6 case). > > Additionally, introduce a test suite for openvswitch with a test case > that ensures this MTU behavior, with the expectation that new tests are > added when needed. > > Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") > Signed-off-by: Aaron Conole <aconole@redhat.com> > --- > NOTE: checkpatch reports a whitespace error with the openvswitch.sh > script - this is due to using tab as the IFS value. This part > of the script was copied from > tools/testing/selftests/net/pmtu.sh so I think should be > permissible. > > net/openvswitch/actions.c | 2 +- > tools/testing/selftests/net/.gitignore | 1 + > tools/testing/selftests/net/Makefile | 1 + > tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++ > 4 files changed, 397 insertions(+), 1 deletion(-) > create mode 100755 tools/testing/selftests/net/openvswitch.sh > > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > index 92a0b67b2728..d858ea580e43 100644 > --- a/net/openvswitch/actions.c > +++ b/net/openvswitch/actions.c > @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port, > if (likely(!mru || > (skb->len <= mru + vport->dev->hard_header_len))) { > ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); > - } else if (mru <= vport->dev->mtu) { > + } else if (mru) { > struct net *net = read_pnet(&dp->net); > > ovs_fragment(net, vport, skb, mru, key); I thought about this for a while. For a bit of context, my recollection is that in the initial design, there was an attempt to minimize the set of assumptions around L3 behaviour and despite performing this pseudo-L3 action of connection tracking, attempt a "bump-in-the-wire" approach where OVS is serving as an L2 switch and if you wanted L3 features, you need to build them on top or explicitly define that you're looking for L3 semantics. In this case, you're interpreting that the combination of the conntrack action + an output action implies that L3 routing is being performed. Hence, OVS should act like a router and either refragment or generate ICMP PTB in the case where MTU differs. According to the flow table, the rest of the routing functionality (MAC handling for instance) may or may not have been performed at this point, but we basically leave that up to the SDN controller to implement the right behaviour. In relation to this particular check, the idea was to retain the original geometry of the packet such that it's as though there were no functionality performed in the middle at all. OVS happened to do connection tracking (which implicitly involved queueing fragments), but if you treat it as an opaque box, you have ports connected and OVS is simply performing forwarding between the ports. One of the related implications is the contrast between what happens in this case if you have a conntrack action injected or not when outputting to another port. If you didn't put a connection tracking action into the flows when redirecting here, then there would be no defragmentation or refragmentation. In that case, OVS is just attempting to forward to another device and if the MTU check fails, then bad luck, packets will be dropped. Now, with the interpretation in this patch, it seems like we're trying to say that, well, actually, if the controller injects a connection tracking action, then the controller implicitly switches OVS into a sort of half-L3 mode for this particular flow. This makes the behaviour a bit inconsistent. Another thought that occurs here is that if you have three interfaces attached to the switch, say one with MTU 1500 and two with MTU 1450, and the OVS flows are configured to conntrack and clone the packets from the higher-MTU interface to the lower-MTU interfaces. If you receive larger IP fragments on the first interface and attempt to forward on to the other interfaces, should all interfaces generate an ICMPv6 PTB? That doesn't seem quite right, especially if one of those ports is used for mirroring the traffic for operational reasons while the other path is part of the actual routing path for the traffic. You'd end up with duplicate PTB messages for the same outbound request. If I read right, this would also not be able to be controlled by the OVS controller because when we call into ip6_fragment() and hit the MTU-handling path, it will automatically take over and generate the ICMP response out the source interface, without any reference to the OVS flow table. This seems like it's further breaking the model where instead of OVS being a purely programmable L2-like flow match+actions pipeline, now depending on the specific actions you inject (in particular combinations), you get some bits of the L3 functionality. But for full L3 functionality, the controller still needs to handle the rest through the correct set of actions in the flow. Looking at the tree, it seems like this problem can be solved in userspace without further kernel changes by using OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net: openvswitch: Add a new action check_pkt_len"). It even explicitly says "The main use case for adding this action is to solve the packet drops because of MTU mismatch in OVN virtual networking solution.". Have you tried using this approach? > diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore > index 61ae899cfc17..d4d7487833be 100644 > --- a/tools/testing/selftests/net/.gitignore > +++ b/tools/testing/selftests/net/.gitignore > @@ -30,3 +30,4 @@ hwtstamp_config > rxtimestamp > timestamping > txtimestamp > +test_mismatched_mtu_with_conntrack > diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile > index 25f198bec0b2..dc9b556f86fd 100644 > --- a/tools/testing/selftests/net/Makefile > +++ b/tools/testing/selftests/net/Makefile Neat to see some bootstrapping of in-tree OVS testing. I'd probably put it in a separate commit but maybe that's just personal preference. I didn't look *too* closely at the tests but just one nit below: > + # test a udp connection > + info "send udp data" > + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > ${ovs_dir}/server.pid' There are multiple netcat implementations with different arguments (BSD and nmap.org and maybe also Debian versions). Might be nice to point out which netcat you're relying on here or try to detect & fail out/skip on the wrong one. For reference, the equivalent OVS test code detection is here: https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175
Hi, On Fri, Mar 19, 2021 at 04:43:07PM -0400, Aaron Conole wrote: > When a user instructs a flow pipeline to perform connection tracking, > there is an implicit L3 operation that occurs - namely the IP fragments > are reassembled and then processed as a single unit. After this, new > fragments are generated and then transmitted, with the hint that they > should be fragmented along the max rx unit boundary. In general, this > behavior works well to forward packets along when the MTUs are congruent > across the datapath. > > However, if using a protocol such as UDP on a network with mismatching > MTUs, it is possible that the refragmentation will still produce an > invalid fragment, and that fragmented packet will not be delivered. > Such a case shouldn't happen because the user explicitly requested a > layer 3+4 function (conntrack), and that function generates new fragments, > so we should perform the needed actions in that case (namely, refragment > IPv4 along a correct boundary, or send a packet too big in the IPv6 case). > > Additionally, introduce a test suite for openvswitch with a test case > that ensures this MTU behavior, with the expectation that new tests are > added when needed. > > Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") > Signed-off-by: Aaron Conole <aconole@redhat.com> > --- > NOTE: checkpatch reports a whitespace error with the openvswitch.sh > script - this is due to using tab as the IFS value. This part > of the script was copied from > tools/testing/selftests/net/pmtu.sh so I think should be > permissible. > > net/openvswitch/actions.c | 2 +- > tools/testing/selftests/net/.gitignore | 1 + > tools/testing/selftests/net/Makefile | 1 + > tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++ > 4 files changed, 397 insertions(+), 1 deletion(-) > create mode 100755 tools/testing/selftests/net/openvswitch.sh > > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > index 92a0b67b2728..d858ea580e43 100644 > --- a/net/openvswitch/actions.c > +++ b/net/openvswitch/actions.c > @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port, > if (likely(!mru || > (skb->len <= mru + vport->dev->hard_header_len))) { > ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); > - } else if (mru <= vport->dev->mtu) { > + } else if (mru) { > struct net *net = read_pnet(&dp->net); If the egress port has MTU less than MRU, then before the patch the packet is dropped and after the patch ip_do_fragment() will correctly take care of fragmenting according with egress MTU. That seems a reasonable change to me. This condition below makes little sense to me now that this patch is changing the MTU assumption: skb->len <= mru + vport->dev->hard_header_len The MRU depends on the ingress port. Perhaps that should check if the skb length fits into the egress port even with mru available: if (!mru || (packet_length(skb, vport->dev) <= vport->dev->mtu)) { ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); else { ovs_fragment(net, vport, skb, mru, key); } IIRC, a GSO packet will always fall into the first condition otherwise we need an additional skb_is_gso(skb). Also, that last 'else' branch is not needed anymore. Then we have check_pkt_len which Aaron and I discussed a bit offline. I think we should revert commit[1] at least the part relying on mru because a packet with mru > mtu is not dropped after the patch. [1] 178436557 ("openvswitch: take into account de-fragmentation/gso_size in execute_check_pkt_len") Thanks, fbl > > ovs_fragment(net, vport, skb, mru, key); > diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore > index 61ae899cfc17..d4d7487833be 100644 > --- a/tools/testing/selftests/net/.gitignore > +++ b/tools/testing/selftests/net/.gitignore > @@ -30,3 +30,4 @@ hwtstamp_config > rxtimestamp > timestamping > txtimestamp > +test_mismatched_mtu_with_conntrack > diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile > index 25f198bec0b2..dc9b556f86fd 100644 > --- a/tools/testing/selftests/net/Makefile > +++ b/tools/testing/selftests/net/Makefile > @@ -23,6 +23,7 @@ TEST_PROGS += drop_monitor_tests.sh > TEST_PROGS += vrf_route_leaking.sh > TEST_PROGS += bareudp.sh > TEST_PROGS += unicast_extensions.sh > +TEST_PROGS += openvswitch.sh > TEST_PROGS_EXTENDED := in_netns.sh > TEST_GEN_FILES = socket nettest > TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any > diff --git a/tools/testing/selftests/net/openvswitch.sh b/tools/testing/selftests/net/openvswitch.sh > new file mode 100755 > index 000000000000..7b6341688cc3 > --- /dev/null > +++ b/tools/testing/selftests/net/openvswitch.sh > @@ -0,0 +1,394 @@ > +#!/bin/sh > +# SPDX-License-Identifier: GPL-2.0 > +# > +# OVS kernel module self tests > +# > +# Tests currently implemented: > +# > +# - mismatched_mtu_with_conntrack > +# Set up two namespaces (client and server) which each have devices specifying > +# incongruent MTUs (1450 vs 1500 in the test). Transmit a UDP packet of 1901 bytes > +# from client to server, and back. Ensure that the ct() action > +# uses the mru as a hint, but not a forced check. > + > + > +# Kselftest framework requirement - SKIP code is 4. > +ksft_skip=4 > + > +PAUSE_ON_FAIL=no > +VERBOSE=0 > +TRACING=0 > + > +tests=" > + mismatched_mtu_with_conntrack ipv4: IP Fragmentation with conntrack" > + > + > +usage() { > + echo > + echo "$0 [OPTIONS] [TEST]..." > + echo "If no TEST argument is given, all tests will be run." > + echo > + echo "Options" > + echo " -t: capture traffic via tcpdump" > + echo " -v: verbose" > + echo " -p: pause on failure" > + echo > + echo "Available tests${tests}" > + exit 1 > +} > + > +on_exit() { > + echo "$1" > ${ovs_dir}/cleanup.tmp > + cat ${ovs_dir}/cleanup >> ${ovs_dir}/cleanup.tmp > + mv ${ovs_dir}/cleanup.tmp ${ovs_dir}/cleanup > +} > + > +ovs_setenv() { > + sandbox=$1 > + > + ovs_dir=$ovs_base${1:+/$1}; export ovs_dir > + > + test -e ${ovs_dir}/cleanup || : > ${ovs_dir}/cleanup > + > + OVS_RUNDIR=$ovs_dir; export OVS_RUNDIR > + OVS_LOGDIR=$ovs_dir; export OVS_LOGDIR > + OVS_DBDIR=$ovs_dir; export OVS_DBDIR > + OVS_SYSCONFDIR=$ovs_dir; export OVS_SYSCONFDIR > + OVS_PKGDATADIR=$ovs_dir; export OVS_PKGDATADIR > +} > + > +ovs_exit_sig() { > + . "$ovs_dir/cleanup" > + ovs-dpctl del-dp ovs-system > +} > + > +ovs_cleanup() { > + ovs_exit_sig > + [ $VERBOSE = 0 ] || echo "Error detected. See $ovs_dir for more details." > +} > + > +ovs_normal_exit() { > + ovs_exit_sig > + rm -rf ${ovs_dir} > +} > + > +info() { > + [ $VERBOSE = 0 ] || echo $* > +} > + > +kill_ovs_vswitchd () { > + # Use provided PID or save the current PID if available. > + TMPPID=$1 > + if test -z "$TMPPID"; then > + TMPPID=$(cat $OVS_RUNDIR/ovs-vswitchd.pid 2>/dev/null) > + fi > + > + # Tell the daemon to terminate gracefully > + ovs-appctl -t ovs-vswitchd exit --cleanup 2>/dev/null > + > + # Nothing else to be done if there is no PID > + test -z "$TMPPID" && return > + > + for i in 1 2 3 4 5 6 7 8 9; do > + # Check if the daemon is alive. > + kill -0 $TMPPID 2>/dev/null || return > + > + # Fallback to whole number since POSIX doesn't require > + # fractional times to work. > + sleep 0.1 || sleep 1 > + done > + > + # Make sure it is terminated. > + kill $TMPPID > +} > + > +start_daemon () { > + info "exec: $@ -vconsole:off --detach --no-chdir --pidfile --log-file" > + "$@" -vconsole:off --detach --no-chdir --pidfile --log-file >/dev/null > + pidfile="$OVS_RUNDIR"/$1.pid > + > + info "setting kill for $@..." > + on_exit "test -e \"$pidfile\" && kill \`cat \"$pidfile\"\`" > +} > + > +if test "X$vswitchd_schema" = "X"; then > + vswitchd_schema="/usr/share/openvswitch" > +fi > + > +ovs_sbx() { > + if test "X$2" != X; then > + (ovs_setenv $1; shift; "$@" >> ${ovs_dir}/debug.log) > + else > + ovs_setenv $1 > + fi > +} > + > +seq () { > + if test $# = 1; then > + set 1 $1 > + fi > + while test $1 -le $2; do > + echo $1 > + set `expr $1 + ${3-1}` $2 $3 > + done > +} > + > +ovs_wait () { > + info "$1: waiting $2..." > + > + # First try the condition without waiting. > + if ovs_wait_cond; then info "$1: wait succeeded immediately"; return 0; fi > + > + # Try a quick sleep, so that the test completes very quickly > + # in the normal case. POSIX doesn't require fractional times to > + # work, so this might not work. > + sleep 0.1 > + if ovs_wait_cond; then info "$1: wait succeeded quickly"; return 0; fi > + > + # Then wait up to OVS_CTL_TIMEOUT seconds. > + local d > + for d in `seq 1 "$OVS_CTL_TIMEOUT"`; do > + sleep 1 > + if ovs_wait_cond; then info "$1: wait succeeded after $d seconds"; return 0; fi > + done > + > + info "$1: wait failed after $d seconds" > + ovs_wait_failed > +} > + > +sbxs= > +sbx_add () { > + info "adding sandbox '$1'" > + > + sbxs="$sbxs $1" > + > + NO_BIN=0 > + which ovsdb-tool >/dev/null 2>&1 || NO_BIN=1 > + which ovsdb-server >/dev/null 2>&1 || NO_BIN=1 > + which ovs-vsctl >/dev/null 2>&1 || NO_BIN=1 > + which ovs-vswitchd >/dev/null 2>&1 || NO_BIN=1 > + > + if [ $NO_BIN = 1 ]; then > + info "Missing required binaries..." > + return 4 > + fi > + # Create sandbox. > + local d="$ovs_base"/$1 > + if [ -e $d ]; then > + info "removing $d" > + rm -rf "$d" > + fi > + mkdir "$d" || return 1 > + ovs_setenv $1 > + > + # Create database and start ovsdb-server. > + info "$1: create db and start db-server" > + : > "$d"/.conf.db.~lock~ > + ovs_sbx $1 ovsdb-tool create "$d"/conf.db "$vswitchd_schema"/vswitch.ovsschema || return 1 > + ovs_sbx $1 start_daemon ovsdb-server --detach --remote=punix:"$d"/db.sock || return 1 > + > + # Initialize database. > + ovs_sbx $1 ovs-vsctl --no-wait -- init || return 1 > + > + # Start ovs-vswitchd > + info "$1: start vswitchd" > + ovs_sbx $1 start_daemon ovs-vswitchd -vvconn -vofproto_dpif -vunixctl > + > + ovs_wait_cond () { > + if ip link show ovs-netdev >/dev/null 2>&1; then return 1; else return 0; fi > + } > + ovs_wait_failed () { > + : > + } > + > + ovs_wait "sandbox_add" "while ip link show ovs-netdev" || return 1 > +} > + > +ovs_base=`pwd` > + > +# mismatched_mtu_with_conntrack test > +# - client has 1450 byte MTU > +# - server has 1500 byte MTU > +# - use UDP to send 1901 bytes each direction for mismatched > +# fragmentation. > +test_mismatched_mtu_with_conntrack() { > + > + sbx_add "test_mismatched_mtu_with_conntrack" || return $? > + > + info "create namespaces" > + for ns in client server; do > + ip netns add $ns || return 1 > + on_exit "ip netns del $ns" > + done > + > + # setup the base bridge > + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-br br0 || return 1 > + > + # setup the client > + info "setup client namespace" > + ip link add c0 type veth peer name c1 || return 1 > + on_exit "ip link del c0 >/dev/null 2>&1" > + > + ip link set c0 mtu 1450 > + ip link set c0 up > + > + ip link set c1 netns client || return 1 > + ip netns exec client ip addr add 172.31.110.2/24 dev c1 > + ip netns exec client ip link set c1 mtu 1450 > + ip netns exec client ip link set c1 up > + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-port br0 c0 || return 1 > + > + # setup the server > + info "setup server namespace" > + ip link add s0 type veth peer name s1 || return 1 > + on_exit "ip link del s0 >/dev/null 2>&1; ip netns exec server ip link del s0 >/dev/null 2>&1" > + ip link set s0 up > + > + ip link set s1 netns server || return 1 > + ip netns exec server ip addr add 172.31.110.1/24 dev s1 || return 1 > + ip netns exec server ip link set s1 up > + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-port br0 s0 || return 1 > + > + info "setup flows" > + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl del-flows br0 > + > + cat >${ovs_dir}/flows.txt <<EOF > +table=0,priority=100,arp action=normal > +table=0,priority=100,ip,udp action=ct(table=1) > +table=0,priority=10 action=drop > + > +table=1,priority=100,ct_state=+new+trk,in_port=c0,ip action=ct(commit),s0 > +table=1,priority=100,ct_state=+est+trk,in_port=s0,ip action=c0 > + > +EOF > + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl --bundle add-flows br0 ${ovs_dir}/flows.txt || return 1 > + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl dump-flows br0 > + > + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl show > + > + # setup echo > + mknod -m 777 ${ovs_dir}/fifo p || return 1 > + # on_exit "rm ${ovs_dir}/fifo" > + > + # test a udp connection > + info "send udp data" > + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > ${ovs_dir}/server.pid' > + on_exit "test -e \"${ovs_dir}/server.pid\" && kill \`cat \"${ovs_dir}/server.pid\"\`" > + if [ $TRACING = 1 ]; then > + ip netns exec server sh -c "tcpdump -i s1 -l -n -U -xx >> ${ovs_dir}/s1-pkts.cap &" > + ip netns exec client sh -c "tcpdump -i c1 -l -n -U -xx >> ${ovs_dir}/c1-pkts.cap &" > + fi > + > + ip netns exec client sh -c "python -c \"import time; print('a' * 1900); time.sleep(2)\" | nc -v -u 172.31.110.1 8888 2>${ovs_dir}/c1-nc.log" || return 1 > + > + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-appctl dpctl/dump-flows > + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl dump-flows br0 > + > + info "check udp data was tx and rx" > + grep "1901 bytes received" ${ovs_dir}/c1-nc.log || return 1 > + ovs_normal_exit > +} > + > +run_test() { > + ( > + tname="$1" > + tdesc="$2" > + > + if ! lsmod | grep openvswitch >/dev/null 2>&1; then > + printf "TEST: %-60s [SKIP]\n" "${tdesc}" > + return $ksft_skip > + fi > + > + unset IFS > + > + eval test_${tname} > + ret=$? > + > + if [ $ret -eq 0 ]; then > + printf "TEST: %-60s [ OK ]\n" "${tdesc}" > + ovs_normal_exit > + elif [ $ret -eq 1 ]; then > + printf "TEST: %-60s [FAIL]\n" "${tdesc}" > + if [ "${PAUSE_ON_FAIL}" = "yes" ]; then > + echo > + echo "Pausing. Hit enter to continue" > + read a > + fi > + ovs_exit_sig > + exit 1 > + elif [ $ret -eq $ksft_skip ]; then > + printf "TEST: %-60s [SKIP]\n" "${tdesc}" > + elif [ $ret -eq 2 ]; then > + rm -rf test_${tname} > + run_test "$1" "$2" > + fi > + > + return $ret > + ) > + ret=$? > + case $ret in > + 0) > + all_skipped=false > + [ $exitcode=$ksft_skip ] && exitcode=0 > + ;; > + $ksft_skip) > + [ $all_skipped = true ] && exitcode=$ksft_skip > + ;; > + *) > + all_skipped=false > + exitcode=1 > + ;; > + esac > + > + return $ret > +} > + > + > +exitcode=0 > +desc=0 > +all_skipped=true > + > +while getopts :pvt o > +do > + case $o in > + p) PAUSE_ON_FAIL=yes;; > + v) VERBOSE=1;; > + t) if which tcpdump > /dev/null 2>&1; then > + TRACING=1 > + else > + echo "=== tcpdump not available, tracing disabled" > + fi > + ;; > + *) usage;; > + esac > +done > +shift $(($OPTIND-1)) > + > +IFS=" > +" > + > +for arg do > + # Check first that all requested tests are available before running any > + command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} not found"; usage; } > +done > + > +name="" > +desc="" > +for t in ${tests}; do > + [ "${name}" = "" ] && name="${t}" && continue > + [ "${desc}" = "" ] && desc="${t}" > + > + run_this=1 > + for arg do > + [ "${arg}" != "${arg#--*}" ] && continue > + [ "${arg}" = "${name}" ] && run_this=1 && break > + run_this=0 > + done > + if [ $run_this -eq 1 ]; then > + run_test "${name}" "${desc}" > + fi > + name="" > + desc="" > +done > + > +exit ${exitcode} > -- > 2.25.4 >
Joe Stringer <joe@cilium.io> writes: > Hey Aaron, long time no chat :) Same :) > On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote: >> >> When a user instructs a flow pipeline to perform connection tracking, >> there is an implicit L3 operation that occurs - namely the IP fragments >> are reassembled and then processed as a single unit. After this, new >> fragments are generated and then transmitted, with the hint that they >> should be fragmented along the max rx unit boundary. In general, this >> behavior works well to forward packets along when the MTUs are congruent >> across the datapath. >> >> However, if using a protocol such as UDP on a network with mismatching >> MTUs, it is possible that the refragmentation will still produce an >> invalid fragment, and that fragmented packet will not be delivered. >> Such a case shouldn't happen because the user explicitly requested a >> layer 3+4 function (conntrack), and that function generates new fragments, >> so we should perform the needed actions in that case (namely, refragment >> IPv4 along a correct boundary, or send a packet too big in the IPv6 case). >> >> Additionally, introduce a test suite for openvswitch with a test case >> that ensures this MTU behavior, with the expectation that new tests are >> added when needed. >> >> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") >> Signed-off-by: Aaron Conole <aconole@redhat.com> >> --- >> NOTE: checkpatch reports a whitespace error with the openvswitch.sh >> script - this is due to using tab as the IFS value. This part >> of the script was copied from >> tools/testing/selftests/net/pmtu.sh so I think should be >> permissible. >> >> net/openvswitch/actions.c | 2 +- >> tools/testing/selftests/net/.gitignore | 1 + >> tools/testing/selftests/net/Makefile | 1 + >> tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++ >> 4 files changed, 397 insertions(+), 1 deletion(-) >> create mode 100755 tools/testing/selftests/net/openvswitch.sh >> >> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >> index 92a0b67b2728..d858ea580e43 100644 >> --- a/net/openvswitch/actions.c >> +++ b/net/openvswitch/actions.c >> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port, >> if (likely(!mru || >> (skb->len <= mru + vport->dev->hard_header_len))) { >> ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); >> - } else if (mru <= vport->dev->mtu) { >> + } else if (mru) { >> struct net *net = read_pnet(&dp->net); >> >> ovs_fragment(net, vport, skb, mru, key); > > I thought about this for a while. For a bit of context, my > recollection is that in the initial design, there was an attempt to > minimize the set of assumptions around L3 behaviour and despite > performing this pseudo-L3 action of connection tracking, attempt a > "bump-in-the-wire" approach where OVS is serving as an L2 switch and > if you wanted L3 features, you need to build them on top or explicitly > define that you're looking for L3 semantics. In this case, you're > interpreting that the combination of the conntrack action + an output > action implies that L3 routing is being performed. Hence, OVS should > act like a router and either refragment or generate ICMP PTB in the > case where MTU differs. According to the flow table, the rest of the > routing functionality (MAC handling for instance) may or may not have > been performed at this point, but we basically leave that up to the > SDN controller to implement the right behaviour. In relation to this > particular check, the idea was to retain the original geometry of the > packet such that it's as though there were no functionality performed > in the middle at all. OVS happened to do connection tracking (which > implicitly involved queueing fragments), but if you treat it as an > opaque box, you have ports connected and OVS is simply performing > forwarding between the ports. I've been going back and forth on this. On the one hand, Open vSwitch is supposed to only care about 'just' the L2 forwarding, with some additional processing to assist. After that, it's up to an L3 layer to really provide additional support, and the idea is that the controller or something else should really be guiding this higher level intelligence. The issue I have is that we do some of the high level intelligence here to support conntrack, and it's done in a way that is a bit unintuitive. As an example, you write: ... the idea was to retain the original geometry of the packet such that it's as though there were no functionality performed in the middle at all But, the fragmentation engine isn't guaranteed to reassemble exactly the same packets. Consider the scenario where there is an upstream router that has performed it's own mitm fragmentation. There can be a sequence of packets after that that looks like: IPID == 1, pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes) When reassembled by the frag engine, we will only use the MRU as the guide, and that will spit out: IPID == 1, pkt 1 (1410 bytes), pkt 2 (1044 bytes) We will have reduced the number of packets moving through the network, and then aren't acting as a bump in the wire, but as a real entity. I even tested this: 04:28:47 root@dhcp-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap 16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 bash: __git_ps1: command not found 04:28:54 root@dhcp-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap 16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 Additionally, because this happens transparently for the flow rule user, we need to run check_pkt_len() after every call to the conntrack action because there is really no longer a way to know whether the packet came in via a fragmented path. I guess we could do something with ip.frag==first|later|no... selection rules to try and create a custom table for handling fragments - but that seems like it's a workaround for the conntrack functionality w.r.t. the fragmentation engine. > One of the related implications is the contrast between what happens > in this case if you have a conntrack action injected or not when > outputting to another port. If you didn't put a connection tracking > action into the flows when redirecting here, then there would be no > defragmentation or refragmentation. In that case, OVS is just > attempting to forward to another device and if the MTU check fails, > then bad luck, packets will be dropped. Now, with the interpretation > in this patch, it seems like we're trying to say that, well, actually, > if the controller injects a connection tracking action, then the > controller implicitly switches OVS into a sort of half-L3 mode for > this particular flow. This makes the behaviour a bit inconsistent. I agree, the behavior will be inconsistent w.r.t. L3. But it is right now also. And at least with this change we will be consistently inconsistent - the user requests ct() with the L3 functions that it implies. One other problem with the controller is the way we need to generate FRAGNEED packets in v4. The spec is quite clear with DF=1, drop and generate. With DF=0, it's less clear (at least after I re-checked RFC 791 I didn't see anything, but I might have missed it). The controller will now receive all the traffic, I guess. It's okay with TCP flows that set DF=1, but for UDP (maybe other protocols) that isn't the case. > Another thought that occurs here is that if you have three interfaces > attached to the switch, say one with MTU 1500 and two with MTU 1450, > and the OVS flows are configured to conntrack and clone the packets > from the higher-MTU interface to the lower-MTU interfaces. If you > receive larger IP fragments on the first interface and attempt to > forward on to the other interfaces, should all interfaces generate an > ICMPv6 PTB? I guess they would, for each destination. I don't know if it's desirable, but I can see how it would generate a lot of traffic. Then again, why should it? Would conntrack determine that we have two interfaces to output: actions? > That doesn't seem quite right, especially if one of those > ports is used for mirroring the traffic for operational reasons while > the other path is part of the actual routing path for the traffic. I didn't consider the mirroring case. I guess we would either need some specific metadata. I don't know that anyone is making a mirror port this way, but I guess if the bug report comes in I'll look at it ;) > You'd end up with duplicate PTB messages for the same outbound > request. If I read right, this would also not be able to be controlled > by the OVS controller because when we call into ip6_fragment() and hit > the MTU-handling path, it will automatically take over and generate > the ICMP response out the source interface, without any reference to > the OVS flow table. This seems like it's further breaking the model > where instead of OVS being a purely programmable L2-like flow > match+actions pipeline, now depending on the specific actions you > inject (in particular combinations), you get some bits of the L3 > functionality. But for full L3 functionality, the controller still > needs to handle the rest through the correct set of actions in the > flow. It's made more difficult because ct() action itself does L3 processing (and I think I demonstrated this). > Looking at the tree, it seems like this problem can be solved in > userspace without further kernel changes by using > OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net: > openvswitch: Add a new action check_pkt_len"). It even explicitly says > "The main use case for adding this action is to solve the packet drops > because of MTU mismatch in OVN virtual networking solution.". Have you > tried using this approach? We looked and discussed it a bit. I think the outcome boils down to check_pkt_len needs to be used on every single instance where a ct() call occurs because ct() implies we have connections to monitor, and that implies l3, so we need to do something (either push to a controller and handle it like OVN would, etc). This has implications on older versions of OvS userspace (that don't support check_pkt_len action), and non-OVN based controllers (that might just program a flow pipeline and expect it to work). I'm not sure what the best approach is really. >> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore >> index 61ae899cfc17..d4d7487833be 100644 >> --- a/tools/testing/selftests/net/.gitignore >> +++ b/tools/testing/selftests/net/.gitignore >> @@ -30,3 +30,4 @@ hwtstamp_config >> rxtimestamp >> timestamping >> txtimestamp >> +test_mismatched_mtu_with_conntrack >> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile >> index 25f198bec0b2..dc9b556f86fd 100644 >> --- a/tools/testing/selftests/net/Makefile >> +++ b/tools/testing/selftests/net/Makefile > > Neat to see some bootstrapping of in-tree OVS testing. I'd probably > put it in a separate commit but maybe that's just personal preference. I figured I should add it here because it demonstrates the issue I'm trying to solve. But I agree, it's maybe a new functionality, so I'm okay with submitting this part + test cases with net-next instead. > I didn't look *too* closely at the tests but just one nit below: > >> + # test a udp connection >> + info "send udp data" >> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u >> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > >> ${ovs_dir}/server.pid' > > There are multiple netcat implementations with different arguments > (BSD and nmap.org and maybe also Debian versions). Might be nice to > point out which netcat you're relying on here or try to detect & fail > out/skip on the wrong one. For reference, the equivalent OVS test code > detection is here: netcat's -l, -v, and -u options are universal (even to the old 'hobbit' 1.10 netcat), so I don't think we need to do detection for these options. If a future test needs something special (like 'send-only' for nmap-ncat), then it probably makes sense to hook something up then. > https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175
On 4/8/21 10:41 PM, Aaron Conole wrote: > Joe Stringer <joe@cilium.io> writes: > >> Hey Aaron, long time no chat :) > > Same :) > >> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote: >>> >>> When a user instructs a flow pipeline to perform connection tracking, >>> there is an implicit L3 operation that occurs - namely the IP fragments >>> are reassembled and then processed as a single unit. After this, new >>> fragments are generated and then transmitted, with the hint that they >>> should be fragmented along the max rx unit boundary. In general, this >>> behavior works well to forward packets along when the MTUs are congruent >>> across the datapath. >>> >>> However, if using a protocol such as UDP on a network with mismatching >>> MTUs, it is possible that the refragmentation will still produce an >>> invalid fragment, and that fragmented packet will not be delivered. >>> Such a case shouldn't happen because the user explicitly requested a >>> layer 3+4 function (conntrack), and that function generates new fragments, >>> so we should perform the needed actions in that case (namely, refragment >>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case). >>> >>> Additionally, introduce a test suite for openvswitch with a test case >>> that ensures this MTU behavior, with the expectation that new tests are >>> added when needed. >>> >>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") >>> Signed-off-by: Aaron Conole <aconole@redhat.com> >>> --- >>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh >>> script - this is due to using tab as the IFS value. This part >>> of the script was copied from >>> tools/testing/selftests/net/pmtu.sh so I think should be >>> permissible. >>> >>> net/openvswitch/actions.c | 2 +- >>> tools/testing/selftests/net/.gitignore | 1 + >>> tools/testing/selftests/net/Makefile | 1 + >>> tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++ >>> 4 files changed, 397 insertions(+), 1 deletion(-) >>> create mode 100755 tools/testing/selftests/net/openvswitch.sh >>> >>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >>> index 92a0b67b2728..d858ea580e43 100644 >>> --- a/net/openvswitch/actions.c >>> +++ b/net/openvswitch/actions.c >>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port, >>> if (likely(!mru || >>> (skb->len <= mru + vport->dev->hard_header_len))) { >>> ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); >>> - } else if (mru <= vport->dev->mtu) { >>> + } else if (mru) { >>> struct net *net = read_pnet(&dp->net); >>> >>> ovs_fragment(net, vport, skb, mru, key); >> >> I thought about this for a while. For a bit of context, my >> recollection is that in the initial design, there was an attempt to >> minimize the set of assumptions around L3 behaviour and despite >> performing this pseudo-L3 action of connection tracking, attempt a >> "bump-in-the-wire" approach where OVS is serving as an L2 switch and >> if you wanted L3 features, you need to build them on top or explicitly >> define that you're looking for L3 semantics. In this case, you're >> interpreting that the combination of the conntrack action + an output >> action implies that L3 routing is being performed. Hence, OVS should >> act like a router and either refragment or generate ICMP PTB in the >> case where MTU differs. According to the flow table, the rest of the >> routing functionality (MAC handling for instance) may or may not have >> been performed at this point, but we basically leave that up to the >> SDN controller to implement the right behaviour. In relation to this >> particular check, the idea was to retain the original geometry of the >> packet such that it's as though there were no functionality performed >> in the middle at all. OVS happened to do connection tracking (which >> implicitly involved queueing fragments), but if you treat it as an >> opaque box, you have ports connected and OVS is simply performing >> forwarding between the ports. > > I've been going back and forth on this. On the one hand, Open vSwitch > is supposed to only care about 'just' the L2 forwarding, with some > additional processing to assist. After that, it's up to an L3 layer to > really provide additional support, and the idea is that the controller > or something else should really be guiding this higher level > intelligence. > > The issue I have is that we do some of the high level intelligence here > to support conntrack, and it's done in a way that is a bit unintuitive. > As an example, you write: > > ... the idea was to retain the original geometry of the packet such > that it's as though there were no functionality performed in the > middle at all > > But, the fragmentation engine isn't guaranteed to reassemble exactly the > same packets. > > Consider the scenario where there is an upstream router that has > performed it's own mitm fragmentation. There can be a sequence of > packets after that that looks like: > > IPID == 1, pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes) > > When reassembled by the frag engine, we will only use the MRU as the > guide, and that will spit out: > > IPID == 1, pkt 1 (1410 bytes), pkt 2 (1044 bytes) > > We will have reduced the number of packets moving through the network, > and then aren't acting as a bump in the wire, but as a real entity. > > I even tested this: > > 04:28:47 root@dhcp-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap > 16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 > 16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 > 16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 > bash: __git_ps1: command not found > 04:28:54 root@dhcp-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap > 16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 > 16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 > > Additionally, because this happens transparently for the flow rule user, > we need to run check_pkt_len() after every call to the conntrack action > because there is really no longer a way to know whether the packet came > in via a fragmented path. I guess we could do something with > ip.frag==first|later|no... selection rules to try and create a custom > table for handling fragments - but that seems like it's a workaround for > the conntrack functionality w.r.t. the fragmentation engine. Maybe it makes no sense, so correct me if I'm wrong, but looking at the defragmentation code I see that it does not touch original fragments. I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list of fragments with their original size. Maybe we can fragment them based on existing skb fragments instead of using the maximum fragment size and get the same split as it was before defragmentation? > >> One of the related implications is the contrast between what happens >> in this case if you have a conntrack action injected or not when >> outputting to another port. If you didn't put a connection tracking >> action into the flows when redirecting here, then there would be no >> defragmentation or refragmentation. In that case, OVS is just >> attempting to forward to another device and if the MTU check fails, >> then bad luck, packets will be dropped. Now, with the interpretation >> in this patch, it seems like we're trying to say that, well, actually, >> if the controller injects a connection tracking action, then the >> controller implicitly switches OVS into a sort of half-L3 mode for >> this particular flow. This makes the behaviour a bit inconsistent. > > I agree, the behavior will be inconsistent w.r.t. L3. But it is right > now also. And at least with this change we will be consistently > inconsistent - the user requests ct() with the L3 functions that it > implies. > > One other problem with the controller is the way we need to generate > FRAGNEED packets in v4. The spec is quite clear with DF=1, drop and > generate. With DF=0, it's less clear (at least after I re-checked RFC > 791 I didn't see anything, but I might have missed it). The controller > will now receive all the traffic, I guess. It's okay with TCP flows > that set DF=1, but for UDP (maybe other protocols) that isn't the case. > >> Another thought that occurs here is that if you have three interfaces >> attached to the switch, say one with MTU 1500 and two with MTU 1450, >> and the OVS flows are configured to conntrack and clone the packets >> from the higher-MTU interface to the lower-MTU interfaces. If you >> receive larger IP fragments on the first interface and attempt to >> forward on to the other interfaces, should all interfaces generate an >> ICMPv6 PTB? > > I guess they would, for each destination. I don't know if it's > desirable, but I can see how it would generate a lot of traffic. Then > again, why should it? Would conntrack determine that we have two > interfaces to output: actions? > >> That doesn't seem quite right, especially if one of those >> ports is used for mirroring the traffic for operational reasons while >> the other path is part of the actual routing path for the traffic. > > I didn't consider the mirroring case. I guess we would either need some > specific metadata. I don't know that anyone is making a mirror port > this way, but I guess if the bug report comes in I'll look at it ;) > >> You'd end up with duplicate PTB messages for the same outbound >> request. If I read right, this would also not be able to be controlled >> by the OVS controller because when we call into ip6_fragment() and hit >> the MTU-handling path, it will automatically take over and generate >> the ICMP response out the source interface, without any reference to >> the OVS flow table. This seems like it's further breaking the model >> where instead of OVS being a purely programmable L2-like flow >> match+actions pipeline, now depending on the specific actions you >> inject (in particular combinations), you get some bits of the L3 >> functionality. But for full L3 functionality, the controller still >> needs to handle the rest through the correct set of actions in the >> flow. > > It's made more difficult because ct() action itself does L3 processing > (and I think I demonstrated this). > >> Looking at the tree, it seems like this problem can be solved in >> userspace without further kernel changes by using >> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net: >> openvswitch: Add a new action check_pkt_len"). It even explicitly says >> "The main use case for adding this action is to solve the packet drops >> because of MTU mismatch in OVN virtual networking solution.". Have you >> tried using this approach? > > We looked and discussed it a bit. I think the outcome boils down to > check_pkt_len needs to be used on every single instance where a ct() > call occurs because ct() implies we have connections to monitor, and > that implies l3, so we need to do something (either push to a controller > and handle it like OVN would, etc). This has implications on older > versions of OvS userspace (that don't support check_pkt_len action), and > non-OVN based controllers (that might just program a flow pipeline and > expect it to work). > > I'm not sure what the best approach is really. > >>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore >>> index 61ae899cfc17..d4d7487833be 100644 >>> --- a/tools/testing/selftests/net/.gitignore >>> +++ b/tools/testing/selftests/net/.gitignore >>> @@ -30,3 +30,4 @@ hwtstamp_config >>> rxtimestamp >>> timestamping >>> txtimestamp >>> +test_mismatched_mtu_with_conntrack >>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile >>> index 25f198bec0b2..dc9b556f86fd 100644 >>> --- a/tools/testing/selftests/net/Makefile >>> +++ b/tools/testing/selftests/net/Makefile >> >> Neat to see some bootstrapping of in-tree OVS testing. I'd probably >> put it in a separate commit but maybe that's just personal preference. > > I figured I should add it here because it demonstrates the issue I'm > trying to solve. But I agree, it's maybe a new functionality, so I'm > okay with submitting this part + test cases with net-next instead. > >> I didn't look *too* closely at the tests but just one nit below: >> >>> + # test a udp connection >>> + info "send udp data" >>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u >>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > >>> ${ovs_dir}/server.pid' >> >> There are multiple netcat implementations with different arguments >> (BSD and nmap.org and maybe also Debian versions). Might be nice to >> point out which netcat you're relying on here or try to detect & fail >> out/skip on the wrong one. For reference, the equivalent OVS test code >> detection is here: > > netcat's -l, -v, and -u options are universal (even to the old 'hobbit' > 1.10 netcat), so I don't think we need to do detection for these > options. If a future test needs something special (like 'send-only' for > nmap-ncat), then it probably makes sense to hook something up then. > >> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
Ilya Maximets <i.maximets@ovn.org> writes: > On 4/8/21 10:41 PM, Aaron Conole wrote: >> Joe Stringer <joe@cilium.io> writes: >> >>> Hey Aaron, long time no chat :) >> >> Same :) >> >>> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote: >>>> >>>> When a user instructs a flow pipeline to perform connection tracking, >>>> there is an implicit L3 operation that occurs - namely the IP fragments >>>> are reassembled and then processed as a single unit. After this, new >>>> fragments are generated and then transmitted, with the hint that they >>>> should be fragmented along the max rx unit boundary. In general, this >>>> behavior works well to forward packets along when the MTUs are congruent >>>> across the datapath. >>>> >>>> However, if using a protocol such as UDP on a network with mismatching >>>> MTUs, it is possible that the refragmentation will still produce an >>>> invalid fragment, and that fragmented packet will not be delivered. >>>> Such a case shouldn't happen because the user explicitly requested a >>>> layer 3+4 function (conntrack), and that function generates new fragments, >>>> so we should perform the needed actions in that case (namely, refragment >>>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case). >>>> >>>> Additionally, introduce a test suite for openvswitch with a test case >>>> that ensures this MTU behavior, with the expectation that new tests are >>>> added when needed. >>>> >>>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") >>>> Signed-off-by: Aaron Conole <aconole@redhat.com> >>>> --- >>>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh >>>> script - this is due to using tab as the IFS value. This part >>>> of the script was copied from >>>> tools/testing/selftests/net/pmtu.sh so I think should be >>>> permissible. >>>> >>>> net/openvswitch/actions.c | 2 +- >>>> tools/testing/selftests/net/.gitignore | 1 + >>>> tools/testing/selftests/net/Makefile | 1 + >>>> tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++ >>>> 4 files changed, 397 insertions(+), 1 deletion(-) >>>> create mode 100755 tools/testing/selftests/net/openvswitch.sh >>>> >>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >>>> index 92a0b67b2728..d858ea580e43 100644 >>>> --- a/net/openvswitch/actions.c >>>> +++ b/net/openvswitch/actions.c >>>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port, >>>> if (likely(!mru || >>>> (skb->len <= mru + vport->dev->hard_header_len))) { >>>> ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); >>>> - } else if (mru <= vport->dev->mtu) { >>>> + } else if (mru) { >>>> struct net *net = read_pnet(&dp->net); >>>> >>>> ovs_fragment(net, vport, skb, mru, key); >>> >>> I thought about this for a while. For a bit of context, my >>> recollection is that in the initial design, there was an attempt to >>> minimize the set of assumptions around L3 behaviour and despite >>> performing this pseudo-L3 action of connection tracking, attempt a >>> "bump-in-the-wire" approach where OVS is serving as an L2 switch and >>> if you wanted L3 features, you need to build them on top or explicitly >>> define that you're looking for L3 semantics. In this case, you're >>> interpreting that the combination of the conntrack action + an output >>> action implies that L3 routing is being performed. Hence, OVS should >>> act like a router and either refragment or generate ICMP PTB in the >>> case where MTU differs. According to the flow table, the rest of the >>> routing functionality (MAC handling for instance) may or may not have >>> been performed at this point, but we basically leave that up to the >>> SDN controller to implement the right behaviour. In relation to this >>> particular check, the idea was to retain the original geometry of the >>> packet such that it's as though there were no functionality performed >>> in the middle at all. OVS happened to do connection tracking (which >>> implicitly involved queueing fragments), but if you treat it as an >>> opaque box, you have ports connected and OVS is simply performing >>> forwarding between the ports. >> >> I've been going back and forth on this. On the one hand, Open vSwitch >> is supposed to only care about 'just' the L2 forwarding, with some >> additional processing to assist. After that, it's up to an L3 layer to >> really provide additional support, and the idea is that the controller >> or something else should really be guiding this higher level >> intelligence. >> >> The issue I have is that we do some of the high level intelligence here >> to support conntrack, and it's done in a way that is a bit unintuitive. >> As an example, you write: >> >> ... the idea was to retain the original geometry of the packet such >> that it's as though there were no functionality performed in the >> middle at all >> >> But, the fragmentation engine isn't guaranteed to reassemble exactly the >> same packets. >> >> Consider the scenario where there is an upstream router that has >> performed it's own mitm fragmentation. There can be a sequence of >> packets after that that looks like: >> >> IPID == 1, pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes) >> >> When reassembled by the frag engine, we will only use the MRU as the >> guide, and that will spit out: >> >> IPID == 1, pkt 1 (1410 bytes), pkt 2 (1044 bytes) >> >> We will have reduced the number of packets moving through the network, >> and then aren't acting as a bump in the wire, but as a real entity. >> >> I even tested this: >> >> 04:28:47 root@dhcp-25 >> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP >> 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap >> 16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 >> 16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 >> 16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 >> bash: __git_ps1: command not found >> 04:28:54 root@dhcp-25 >> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP >> 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap >> 16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 >> 16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 >> >> Additionally, because this happens transparently for the flow rule user, >> we need to run check_pkt_len() after every call to the conntrack action >> because there is really no longer a way to know whether the packet came >> in via a fragmented path. I guess we could do something with >> ip.frag==first|later|no... selection rules to try and create a custom >> table for handling fragments - but that seems like it's a workaround for >> the conntrack functionality w.r.t. the fragmentation engine. > > > Maybe it makes no sense, so correct me if I'm wrong, but looking at the > defragmentation code I see that it does not touch original fragments. I guess you're referring to the frag list that gets generated and stored in the skbuff shinfo? > I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list > of fragments with their original size. Maybe we can fragment them based > on existing skb fragments instead of using the maximum fragment size and > get the same split as it was before defragmentation? I think during conntrack processing we linearize the skbuff and then discard these fragments. At least, I didn't look as deeply just now, but I did hack a check for the skbfraglist on output: if (skb_has_frag_list(skb)) { printk(KERN_CRIT "SKB HAS A FRAG LIST"); } And this print wasn't hit during the ovs output processing above. So I assume we don't have the fraglist any more by the time output would happen. Are you asking if we can keep this list around to use? >>> One of the related implications is the contrast between what happens >>> in this case if you have a conntrack action injected or not when >>> outputting to another port. If you didn't put a connection tracking >>> action into the flows when redirecting here, then there would be no >>> defragmentation or refragmentation. In that case, OVS is just >>> attempting to forward to another device and if the MTU check fails, >>> then bad luck, packets will be dropped. Now, with the interpretation >>> in this patch, it seems like we're trying to say that, well, actually, >>> if the controller injects a connection tracking action, then the >>> controller implicitly switches OVS into a sort of half-L3 mode for >>> this particular flow. This makes the behaviour a bit inconsistent. >> >> I agree, the behavior will be inconsistent w.r.t. L3. But it is right >> now also. And at least with this change we will be consistently >> inconsistent - the user requests ct() with the L3 functions that it >> implies. >> >> One other problem with the controller is the way we need to generate >> FRAGNEED packets in v4. The spec is quite clear with DF=1, drop and >> generate. With DF=0, it's less clear (at least after I re-checked RFC >> 791 I didn't see anything, but I might have missed it). The controller >> will now receive all the traffic, I guess. It's okay with TCP flows >> that set DF=1, but for UDP (maybe other protocols) that isn't the case. >> >>> Another thought that occurs here is that if you have three interfaces >>> attached to the switch, say one with MTU 1500 and two with MTU 1450, >>> and the OVS flows are configured to conntrack and clone the packets >>> from the higher-MTU interface to the lower-MTU interfaces. If you >>> receive larger IP fragments on the first interface and attempt to >>> forward on to the other interfaces, should all interfaces generate an >>> ICMPv6 PTB? >> >> I guess they would, for each destination. I don't know if it's >> desirable, but I can see how it would generate a lot of traffic. Then >> again, why should it? Would conntrack determine that we have two >> interfaces to output: actions? >> >>> That doesn't seem quite right, especially if one of those >>> ports is used for mirroring the traffic for operational reasons while >>> the other path is part of the actual routing path for the traffic. >> >> I didn't consider the mirroring case. I guess we would either need some >> specific metadata. I don't know that anyone is making a mirror port >> this way, but I guess if the bug report comes in I'll look at it ;) >> >>> You'd end up with duplicate PTB messages for the same outbound >>> request. If I read right, this would also not be able to be controlled >>> by the OVS controller because when we call into ip6_fragment() and hit >>> the MTU-handling path, it will automatically take over and generate >>> the ICMP response out the source interface, without any reference to >>> the OVS flow table. This seems like it's further breaking the model >>> where instead of OVS being a purely programmable L2-like flow >>> match+actions pipeline, now depending on the specific actions you >>> inject (in particular combinations), you get some bits of the L3 >>> functionality. But for full L3 functionality, the controller still >>> needs to handle the rest through the correct set of actions in the >>> flow. >> >> It's made more difficult because ct() action itself does L3 processing >> (and I think I demonstrated this). >> >>> Looking at the tree, it seems like this problem can be solved in >>> userspace without further kernel changes by using >>> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net: >>> openvswitch: Add a new action check_pkt_len"). It even explicitly says >>> "The main use case for adding this action is to solve the packet drops >>> because of MTU mismatch in OVN virtual networking solution.". Have you >>> tried using this approach? >> >> We looked and discussed it a bit. I think the outcome boils down to >> check_pkt_len needs to be used on every single instance where a ct() >> call occurs because ct() implies we have connections to monitor, and >> that implies l3, so we need to do something (either push to a controller >> and handle it like OVN would, etc). This has implications on older >> versions of OvS userspace (that don't support check_pkt_len action), and >> non-OVN based controllers (that might just program a flow pipeline and >> expect it to work). >> >> I'm not sure what the best approach is really. >> >>>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore >>>> index 61ae899cfc17..d4d7487833be 100644 >>>> --- a/tools/testing/selftests/net/.gitignore >>>> +++ b/tools/testing/selftests/net/.gitignore >>>> @@ -30,3 +30,4 @@ hwtstamp_config >>>> rxtimestamp >>>> timestamping >>>> txtimestamp >>>> +test_mismatched_mtu_with_conntrack >>>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile >>>> index 25f198bec0b2..dc9b556f86fd 100644 >>>> --- a/tools/testing/selftests/net/Makefile >>>> +++ b/tools/testing/selftests/net/Makefile >>> >>> Neat to see some bootstrapping of in-tree OVS testing. I'd probably >>> put it in a separate commit but maybe that's just personal preference. >> >> I figured I should add it here because it demonstrates the issue I'm >> trying to solve. But I agree, it's maybe a new functionality, so I'm >> okay with submitting this part + test cases with net-next instead. >> >>> I didn't look *too* closely at the tests but just one nit below: >>> >>>> + # test a udp connection >>>> + info "send udp data" >>>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u >>>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > >>>> ${ovs_dir}/server.pid' >>> >>> There are multiple netcat implementations with different arguments >>> (BSD and nmap.org and maybe also Debian versions). Might be nice to >>> point out which netcat you're relying on here or try to detect & fail >>> out/skip on the wrong one. For reference, the equivalent OVS test code >>> detection is here: >> >> netcat's -l, -v, and -u options are universal (even to the old 'hobbit' >> 1.10 netcat), so I don't think we need to do detection for these >> options. If a future test needs something special (like 'send-only' for >> nmap-ncat), then it probably makes sense to hook something up then. >> >>> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175 >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>
On 4/10/21 2:22 PM, Aaron Conole wrote: > Ilya Maximets <i.maximets@ovn.org> writes: > >> On 4/8/21 10:41 PM, Aaron Conole wrote: >>> Joe Stringer <joe@cilium.io> writes: >>> >>>> Hey Aaron, long time no chat :) >>> >>> Same :) >>> >>>> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote: >>>>> >>>>> When a user instructs a flow pipeline to perform connection tracking, >>>>> there is an implicit L3 operation that occurs - namely the IP fragments >>>>> are reassembled and then processed as a single unit. After this, new >>>>> fragments are generated and then transmitted, with the hint that they >>>>> should be fragmented along the max rx unit boundary. In general, this >>>>> behavior works well to forward packets along when the MTUs are congruent >>>>> across the datapath. >>>>> >>>>> However, if using a protocol such as UDP on a network with mismatching >>>>> MTUs, it is possible that the refragmentation will still produce an >>>>> invalid fragment, and that fragmented packet will not be delivered. >>>>> Such a case shouldn't happen because the user explicitly requested a >>>>> layer 3+4 function (conntrack), and that function generates new fragments, >>>>> so we should perform the needed actions in that case (namely, refragment >>>>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case). >>>>> >>>>> Additionally, introduce a test suite for openvswitch with a test case >>>>> that ensures this MTU behavior, with the expectation that new tests are >>>>> added when needed. >>>>> >>>>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") >>>>> Signed-off-by: Aaron Conole <aconole@redhat.com> >>>>> --- >>>>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh >>>>> script - this is due to using tab as the IFS value. This part >>>>> of the script was copied from >>>>> tools/testing/selftests/net/pmtu.sh so I think should be >>>>> permissible. >>>>> >>>>> net/openvswitch/actions.c | 2 +- >>>>> tools/testing/selftests/net/.gitignore | 1 + >>>>> tools/testing/selftests/net/Makefile | 1 + >>>>> tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++ >>>>> 4 files changed, 397 insertions(+), 1 deletion(-) >>>>> create mode 100755 tools/testing/selftests/net/openvswitch.sh >>>>> >>>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >>>>> index 92a0b67b2728..d858ea580e43 100644 >>>>> --- a/net/openvswitch/actions.c >>>>> +++ b/net/openvswitch/actions.c >>>>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port, >>>>> if (likely(!mru || >>>>> (skb->len <= mru + vport->dev->hard_header_len))) { >>>>> ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); >>>>> - } else if (mru <= vport->dev->mtu) { >>>>> + } else if (mru) { >>>>> struct net *net = read_pnet(&dp->net); >>>>> >>>>> ovs_fragment(net, vport, skb, mru, key); >>>> >>>> I thought about this for a while. For a bit of context, my >>>> recollection is that in the initial design, there was an attempt to >>>> minimize the set of assumptions around L3 behaviour and despite >>>> performing this pseudo-L3 action of connection tracking, attempt a >>>> "bump-in-the-wire" approach where OVS is serving as an L2 switch and >>>> if you wanted L3 features, you need to build them on top or explicitly >>>> define that you're looking for L3 semantics. In this case, you're >>>> interpreting that the combination of the conntrack action + an output >>>> action implies that L3 routing is being performed. Hence, OVS should >>>> act like a router and either refragment or generate ICMP PTB in the >>>> case where MTU differs. According to the flow table, the rest of the >>>> routing functionality (MAC handling for instance) may or may not have >>>> been performed at this point, but we basically leave that up to the >>>> SDN controller to implement the right behaviour. In relation to this >>>> particular check, the idea was to retain the original geometry of the >>>> packet such that it's as though there were no functionality performed >>>> in the middle at all. OVS happened to do connection tracking (which >>>> implicitly involved queueing fragments), but if you treat it as an >>>> opaque box, you have ports connected and OVS is simply performing >>>> forwarding between the ports. >>> >>> I've been going back and forth on this. On the one hand, Open vSwitch >>> is supposed to only care about 'just' the L2 forwarding, with some >>> additional processing to assist. After that, it's up to an L3 layer to >>> really provide additional support, and the idea is that the controller >>> or something else should really be guiding this higher level >>> intelligence. >>> >>> The issue I have is that we do some of the high level intelligence here >>> to support conntrack, and it's done in a way that is a bit unintuitive. >>> As an example, you write: >>> >>> ... the idea was to retain the original geometry of the packet such >>> that it's as though there were no functionality performed in the >>> middle at all >>> >>> But, the fragmentation engine isn't guaranteed to reassemble exactly the >>> same packets. >>> >>> Consider the scenario where there is an upstream router that has >>> performed it's own mitm fragmentation. There can be a sequence of >>> packets after that that looks like: >>> >>> IPID == 1, pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes) >>> >>> When reassembled by the frag engine, we will only use the MRU as the >>> guide, and that will spit out: >>> >>> IPID == 1, pkt 1 (1410 bytes), pkt 2 (1044 bytes) >>> >>> We will have reduced the number of packets moving through the network, >>> and then aren't acting as a bump in the wire, but as a real entity. >>> >>> I even tested this: >>> >>> 04:28:47 root@dhcp-25 >>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP >>> 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap >>> 16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 >>> 16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 >>> 16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 >>> bash: __git_ps1: command not found >>> 04:28:54 root@dhcp-25 >>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP >>> 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap >>> 16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 >>> 16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 >>> >>> Additionally, because this happens transparently for the flow rule user, >>> we need to run check_pkt_len() after every call to the conntrack action >>> because there is really no longer a way to know whether the packet came >>> in via a fragmented path. I guess we could do something with >>> ip.frag==first|later|no... selection rules to try and create a custom >>> table for handling fragments - but that seems like it's a workaround for >>> the conntrack functionality w.r.t. the fragmentation engine. >> >> >> Maybe it makes no sense, so correct me if I'm wrong, but looking at the >> defragmentation code I see that it does not touch original fragments. > > I guess you're referring to the frag list that gets generated and stored > in the skbuff shinfo? I guess so. :) > >> I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list >> of fragments with their original size. Maybe we can fragment them based >> on existing skb fragments instead of using the maximum fragment size and >> get the same split as it was before defragmentation? > > I think during conntrack processing we linearize the skbuff and then > discard these fragments. At least, I didn't look as deeply just now, > but I did hack a check for the skbfraglist on output: > > if (skb_has_frag_list(skb)) { > printk(KERN_CRIT "SKB HAS A FRAG LIST"); > } > > And this print wasn't hit during the ovs output processing above. So I > assume we don't have the fraglist any more by the time output would > happen. Are you asking if we can keep this list around to use? Yes, it will be good if we can keep it somehow. ip_do_fragment() uses this list for fragmentation if it's available. At least, it should be available right after ip_defrag(). If we can't keep the list itself without modifying too much of the code, maybe we can just memorize sizes and use them later for fragmentation. Not sure how the good implementation should look like, though. Anyway, my point is that it looks more like a technical difficulty rather than conceptual problem. Another thing: It seems like very recently some very similar (to what OVS does right now) fragmentation logic was added to net/sched/: c129412f74e9 ("net/sched: sch_frag: add generic packet fragment support.") Do we have the same problem there? We, likely, want the logic to be consistent. IIUC, this is the code that will be executed if OVS will offload conntrack to TC, right? And one more question here: How does CT offload capable HW works in this case? What is the logic around re-fragmenting there? Is there some common guideline on how solution should behave (looks like there is no any, otherwise we would not have this discussion)? > >>>> One of the related implications is the contrast between what happens >>>> in this case if you have a conntrack action injected or not when >>>> outputting to another port. If you didn't put a connection tracking >>>> action into the flows when redirecting here, then there would be no >>>> defragmentation or refragmentation. In that case, OVS is just >>>> attempting to forward to another device and if the MTU check fails, >>>> then bad luck, packets will be dropped. Now, with the interpretation >>>> in this patch, it seems like we're trying to say that, well, actually, >>>> if the controller injects a connection tracking action, then the >>>> controller implicitly switches OVS into a sort of half-L3 mode for >>>> this particular flow. This makes the behaviour a bit inconsistent. >>> >>> I agree, the behavior will be inconsistent w.r.t. L3. But it is right >>> now also. And at least with this change we will be consistently >>> inconsistent - the user requests ct() with the L3 functions that it >>> implies. >>> >>> One other problem with the controller is the way we need to generate >>> FRAGNEED packets in v4. The spec is quite clear with DF=1, drop and >>> generate. With DF=0, it's less clear (at least after I re-checked RFC >>> 791 I didn't see anything, but I might have missed it). The controller >>> will now receive all the traffic, I guess. It's okay with TCP flows >>> that set DF=1, but for UDP (maybe other protocols) that isn't the case. >>> >>>> Another thought that occurs here is that if you have three interfaces >>>> attached to the switch, say one with MTU 1500 and two with MTU 1450, >>>> and the OVS flows are configured to conntrack and clone the packets >>>> from the higher-MTU interface to the lower-MTU interfaces. If you >>>> receive larger IP fragments on the first interface and attempt to >>>> forward on to the other interfaces, should all interfaces generate an >>>> ICMPv6 PTB? >>> >>> I guess they would, for each destination. I don't know if it's >>> desirable, but I can see how it would generate a lot of traffic. Then >>> again, why should it? Would conntrack determine that we have two >>> interfaces to output: actions? >>> >>>> That doesn't seem quite right, especially if one of those >>>> ports is used for mirroring the traffic for operational reasons while >>>> the other path is part of the actual routing path for the traffic. >>> >>> I didn't consider the mirroring case. I guess we would either need some >>> specific metadata. I don't know that anyone is making a mirror port >>> this way, but I guess if the bug report comes in I'll look at it ;) >>> >>>> You'd end up with duplicate PTB messages for the same outbound >>>> request. If I read right, this would also not be able to be controlled >>>> by the OVS controller because when we call into ip6_fragment() and hit >>>> the MTU-handling path, it will automatically take over and generate >>>> the ICMP response out the source interface, without any reference to >>>> the OVS flow table. This seems like it's further breaking the model >>>> where instead of OVS being a purely programmable L2-like flow >>>> match+actions pipeline, now depending on the specific actions you >>>> inject (in particular combinations), you get some bits of the L3 >>>> functionality. But for full L3 functionality, the controller still >>>> needs to handle the rest through the correct set of actions in the >>>> flow. >>> >>> It's made more difficult because ct() action itself does L3 processing >>> (and I think I demonstrated this). >>> >>>> Looking at the tree, it seems like this problem can be solved in >>>> userspace without further kernel changes by using >>>> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net: >>>> openvswitch: Add a new action check_pkt_len"). It even explicitly says >>>> "The main use case for adding this action is to solve the packet drops >>>> because of MTU mismatch in OVN virtual networking solution.". Have you >>>> tried using this approach? >>> >>> We looked and discussed it a bit. I think the outcome boils down to >>> check_pkt_len needs to be used on every single instance where a ct() >>> call occurs because ct() implies we have connections to monitor, and >>> that implies l3, so we need to do something (either push to a controller >>> and handle it like OVN would, etc). This has implications on older >>> versions of OvS userspace (that don't support check_pkt_len action), and >>> non-OVN based controllers (that might just program a flow pipeline and >>> expect it to work). >>> >>> I'm not sure what the best approach is really. >>> >>>>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore >>>>> index 61ae899cfc17..d4d7487833be 100644 >>>>> --- a/tools/testing/selftests/net/.gitignore >>>>> +++ b/tools/testing/selftests/net/.gitignore >>>>> @@ -30,3 +30,4 @@ hwtstamp_config >>>>> rxtimestamp >>>>> timestamping >>>>> txtimestamp >>>>> +test_mismatched_mtu_with_conntrack >>>>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile >>>>> index 25f198bec0b2..dc9b556f86fd 100644 >>>>> --- a/tools/testing/selftests/net/Makefile >>>>> +++ b/tools/testing/selftests/net/Makefile >>>> >>>> Neat to see some bootstrapping of in-tree OVS testing. I'd probably >>>> put it in a separate commit but maybe that's just personal preference. >>> >>> I figured I should add it here because it demonstrates the issue I'm >>> trying to solve. But I agree, it's maybe a new functionality, so I'm >>> okay with submitting this part + test cases with net-next instead. >>> >>>> I didn't look *too* closely at the tests but just one nit below: >>>> >>>>> + # test a udp connection >>>>> + info "send udp data" >>>>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u >>>>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > >>>>> ${ovs_dir}/server.pid' >>>> >>>> There are multiple netcat implementations with different arguments >>>> (BSD and nmap.org and maybe also Debian versions). Might be nice to >>>> point out which netcat you're relying on here or try to detect & fail >>>> out/skip on the wrong one. For reference, the equivalent OVS test code >>>> detection is here: >>> >>> netcat's -l, -v, and -u options are universal (even to the old 'hobbit' >>> 1.10 netcat), so I don't think we need to do detection for these >>> options. If a future test needs something special (like 'send-only' for >>> nmap-ncat), then it probably makes sense to hook something up then. >>> >>>> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175 >>> >>> _______________________________________________ >>> dev mailing list >>> dev@openvswitch.org >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> >
Ilya Maximets <i.maximets@ovn.org> writes: > On 4/10/21 2:22 PM, Aaron Conole wrote: >> Ilya Maximets <i.maximets@ovn.org> writes: >> >>> On 4/8/21 10:41 PM, Aaron Conole wrote: >>>> Joe Stringer <joe@cilium.io> writes: >>>> >>>>> Hey Aaron, long time no chat :) >>>> >>>> Same :) >>>> >>>>> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote: >>>>>> >>>>>> When a user instructs a flow pipeline to perform connection tracking, >>>>>> there is an implicit L3 operation that occurs - namely the IP fragments >>>>>> are reassembled and then processed as a single unit. After this, new >>>>>> fragments are generated and then transmitted, with the hint that they >>>>>> should be fragmented along the max rx unit boundary. In general, this >>>>>> behavior works well to forward packets along when the MTUs are congruent >>>>>> across the datapath. >>>>>> >>>>>> However, if using a protocol such as UDP on a network with mismatching >>>>>> MTUs, it is possible that the refragmentation will still produce an >>>>>> invalid fragment, and that fragmented packet will not be delivered. >>>>>> Such a case shouldn't happen because the user explicitly requested a >>>>>> layer 3+4 function (conntrack), and that function generates new fragments, >>>>>> so we should perform the needed actions in that case (namely, refragment >>>>>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case). >>>>>> >>>>>> Additionally, introduce a test suite for openvswitch with a test case >>>>>> that ensures this MTU behavior, with the expectation that new tests are >>>>>> added when needed. >>>>>> >>>>>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") >>>>>> Signed-off-by: Aaron Conole <aconole@redhat.com> >>>>>> --- >>>>>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh >>>>>> script - this is due to using tab as the IFS value. This part >>>>>> of the script was copied from >>>>>> tools/testing/selftests/net/pmtu.sh so I think should be >>>>>> permissible. >>>>>> >>>>>> net/openvswitch/actions.c | 2 +- >>>>>> tools/testing/selftests/net/.gitignore | 1 + >>>>>> tools/testing/selftests/net/Makefile | 1 + >>>>>> tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++ >>>>>> 4 files changed, 397 insertions(+), 1 deletion(-) >>>>>> create mode 100755 tools/testing/selftests/net/openvswitch.sh >>>>>> >>>>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >>>>>> index 92a0b67b2728..d858ea580e43 100644 >>>>>> --- a/net/openvswitch/actions.c >>>>>> +++ b/net/openvswitch/actions.c >>>>>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port, >>>>>> if (likely(!mru || >>>>>> (skb->len <= mru + vport->dev->hard_header_len))) { >>>>>> ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); >>>>>> - } else if (mru <= vport->dev->mtu) { >>>>>> + } else if (mru) { >>>>>> struct net *net = read_pnet(&dp->net); >>>>>> >>>>>> ovs_fragment(net, vport, skb, mru, key); >>>>> >>>>> I thought about this for a while. For a bit of context, my >>>>> recollection is that in the initial design, there was an attempt to >>>>> minimize the set of assumptions around L3 behaviour and despite >>>>> performing this pseudo-L3 action of connection tracking, attempt a >>>>> "bump-in-the-wire" approach where OVS is serving as an L2 switch and >>>>> if you wanted L3 features, you need to build them on top or explicitly >>>>> define that you're looking for L3 semantics. In this case, you're >>>>> interpreting that the combination of the conntrack action + an output >>>>> action implies that L3 routing is being performed. Hence, OVS should >>>>> act like a router and either refragment or generate ICMP PTB in the >>>>> case where MTU differs. According to the flow table, the rest of the >>>>> routing functionality (MAC handling for instance) may or may not have >>>>> been performed at this point, but we basically leave that up to the >>>>> SDN controller to implement the right behaviour. In relation to this >>>>> particular check, the idea was to retain the original geometry of the >>>>> packet such that it's as though there were no functionality performed >>>>> in the middle at all. OVS happened to do connection tracking (which >>>>> implicitly involved queueing fragments), but if you treat it as an >>>>> opaque box, you have ports connected and OVS is simply performing >>>>> forwarding between the ports. >>>> >>>> I've been going back and forth on this. On the one hand, Open vSwitch >>>> is supposed to only care about 'just' the L2 forwarding, with some >>>> additional processing to assist. After that, it's up to an L3 layer to >>>> really provide additional support, and the idea is that the controller >>>> or something else should really be guiding this higher level >>>> intelligence. >>>> >>>> The issue I have is that we do some of the high level intelligence here >>>> to support conntrack, and it's done in a way that is a bit unintuitive. >>>> As an example, you write: >>>> >>>> ... the idea was to retain the original geometry of the packet such >>>> that it's as though there were no functionality performed in the >>>> middle at all >>>> >>>> But, the fragmentation engine isn't guaranteed to reassemble exactly the >>>> same packets. >>>> >>>> Consider the scenario where there is an upstream router that has >>>> performed it's own mitm fragmentation. There can be a sequence of >>>> packets after that that looks like: >>>> >>>> IPID == 1, pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes) >>>> >>>> When reassembled by the frag engine, we will only use the MRU as the >>>> guide, and that will spit out: >>>> >>>> IPID == 1, pkt 1 (1410 bytes), pkt 2 (1044 bytes) >>>> >>>> We will have reduced the number of packets moving through the network, >>>> and then aren't acting as a bump in the wire, but as a real entity. >>>> >>>> I even tested this: >>>> >>>> 04:28:47 root@dhcp-25 >>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP >>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap >>>> 16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 >>>> 16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 >>>> 16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 >>>> bash: __git_ps1: command not found >>>> 04:28:54 root@dhcp-25 >>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP >>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap >>>> 16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 >>>> 16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 >>>> >>>> Additionally, because this happens transparently for the flow rule user, >>>> we need to run check_pkt_len() after every call to the conntrack action >>>> because there is really no longer a way to know whether the packet came >>>> in via a fragmented path. I guess we could do something with >>>> ip.frag==first|later|no... selection rules to try and create a custom >>>> table for handling fragments - but that seems like it's a workaround for >>>> the conntrack functionality w.r.t. the fragmentation engine. >>> >>> >>> Maybe it makes no sense, so correct me if I'm wrong, but looking at the >>> defragmentation code I see that it does not touch original fragments. >> >> I guess you're referring to the frag list that gets generated and stored >> in the skbuff shinfo? > > I guess so. :) > >> >>> I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list >>> of fragments with their original size. Maybe we can fragment them based >>> on existing skb fragments instead of using the maximum fragment size and >>> get the same split as it was before defragmentation? >> >> I think during conntrack processing we linearize the skbuff and then >> discard these fragments. At least, I didn't look as deeply just now, >> but I did hack a check for the skbfraglist on output: >> >> if (skb_has_frag_list(skb)) { >> printk(KERN_CRIT "SKB HAS A FRAG LIST"); >> } >> >> And this print wasn't hit during the ovs output processing above. So I >> assume we don't have the fraglist any more by the time output would >> happen. Are you asking if we can keep this list around to use? > > Yes, it will be good if we can keep it somehow. ip_do_fragment() uses > this list for fragmentation if it's available. This is quite a change - we would need some additional data to hang onto with the skbuff (or maybe we store it somewhere else). I'm not sure where to put this information. > At least, it should be available right after ip_defrag(). If we can't > keep the list itself without modifying too much of the code, maybe we > can just memorize sizes and use them later for fragmentation. Not sure > how the good implementation should look like, though. > > Anyway, my point is that it looks more like a technical difficulty rather > than conceptual problem. Sure. It's all software, not stone tablets :) > Another thing: It seems like very recently some very similar (to what OVS > does right now) fragmentation logic was added to net/sched/: > c129412f74e9 ("net/sched: sch_frag: add generic packet fragment support.") > > Do we have the same problem there? We, likely, want the logic to be > consistent. IIUC, this is the code that will be executed if OVS will > offload conntrack to TC, right? Yes, similar behavior is present, and if tc datapath is used I guess it would be executed and exhibit the same behavior. Should be simple to modify the test case I attached to the patch and test it out, so I'll try it out. > And one more question here: How does CT offload capable HW works in this > case? What is the logic around re-fragmenting there? Is there some > common guideline on how solution should behave (looks like there is no > any, otherwise we would not have this discussion)? In the case of CT offload, fragmented packets might be skipped by the HW and pushed into SW path... not sure really. It's a difficult set of cases. No such guidelines exist anywhere that I've found. I think OvS does have some flow matching for fragmented packets, but no idea what they do with it. Maybe Joe or Marcelo has some thoughts on what the expected / desired behavior might be in these cases? >>>>> One of the related implications is the contrast between what happens >>>>> in this case if you have a conntrack action injected or not when >>>>> outputting to another port. If you didn't put a connection tracking >>>>> action into the flows when redirecting here, then there would be no >>>>> defragmentation or refragmentation. In that case, OVS is just >>>>> attempting to forward to another device and if the MTU check fails, >>>>> then bad luck, packets will be dropped. Now, with the interpretation >>>>> in this patch, it seems like we're trying to say that, well, actually, >>>>> if the controller injects a connection tracking action, then the >>>>> controller implicitly switches OVS into a sort of half-L3 mode for >>>>> this particular flow. This makes the behaviour a bit inconsistent. >>>> >>>> I agree, the behavior will be inconsistent w.r.t. L3. But it is right >>>> now also. And at least with this change we will be consistently >>>> inconsistent - the user requests ct() with the L3 functions that it >>>> implies. >>>> >>>> One other problem with the controller is the way we need to generate >>>> FRAGNEED packets in v4. The spec is quite clear with DF=1, drop and >>>> generate. With DF=0, it's less clear (at least after I re-checked RFC >>>> 791 I didn't see anything, but I might have missed it). The controller >>>> will now receive all the traffic, I guess. It's okay with TCP flows >>>> that set DF=1, but for UDP (maybe other protocols) that isn't the case. >>>> >>>>> Another thought that occurs here is that if you have three interfaces >>>>> attached to the switch, say one with MTU 1500 and two with MTU 1450, >>>>> and the OVS flows are configured to conntrack and clone the packets >>>>> from the higher-MTU interface to the lower-MTU interfaces. If you >>>>> receive larger IP fragments on the first interface and attempt to >>>>> forward on to the other interfaces, should all interfaces generate an >>>>> ICMPv6 PTB? >>>> >>>> I guess they would, for each destination. I don't know if it's >>>> desirable, but I can see how it would generate a lot of traffic. Then >>>> again, why should it? Would conntrack determine that we have two >>>> interfaces to output: actions? >>>> >>>>> That doesn't seem quite right, especially if one of those >>>>> ports is used for mirroring the traffic for operational reasons while >>>>> the other path is part of the actual routing path for the traffic. >>>> >>>> I didn't consider the mirroring case. I guess we would either need some >>>> specific metadata. I don't know that anyone is making a mirror port >>>> this way, but I guess if the bug report comes in I'll look at it ;) >>>> >>>>> You'd end up with duplicate PTB messages for the same outbound >>>>> request. If I read right, this would also not be able to be controlled >>>>> by the OVS controller because when we call into ip6_fragment() and hit >>>>> the MTU-handling path, it will automatically take over and generate >>>>> the ICMP response out the source interface, without any reference to >>>>> the OVS flow table. This seems like it's further breaking the model >>>>> where instead of OVS being a purely programmable L2-like flow >>>>> match+actions pipeline, now depending on the specific actions you >>>>> inject (in particular combinations), you get some bits of the L3 >>>>> functionality. But for full L3 functionality, the controller still >>>>> needs to handle the rest through the correct set of actions in the >>>>> flow. >>>> >>>> It's made more difficult because ct() action itself does L3 processing >>>> (and I think I demonstrated this). >>>> >>>>> Looking at the tree, it seems like this problem can be solved in >>>>> userspace without further kernel changes by using >>>>> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net: >>>>> openvswitch: Add a new action check_pkt_len"). It even explicitly says >>>>> "The main use case for adding this action is to solve the packet drops >>>>> because of MTU mismatch in OVN virtual networking solution.". Have you >>>>> tried using this approach? >>>> >>>> We looked and discussed it a bit. I think the outcome boils down to >>>> check_pkt_len needs to be used on every single instance where a ct() >>>> call occurs because ct() implies we have connections to monitor, and >>>> that implies l3, so we need to do something (either push to a controller >>>> and handle it like OVN would, etc). This has implications on older >>>> versions of OvS userspace (that don't support check_pkt_len action), and >>>> non-OVN based controllers (that might just program a flow pipeline and >>>> expect it to work). >>>> >>>> I'm not sure what the best approach is really. >>>> >>>>>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore >>>>>> index 61ae899cfc17..d4d7487833be 100644 >>>>>> --- a/tools/testing/selftests/net/.gitignore >>>>>> +++ b/tools/testing/selftests/net/.gitignore >>>>>> @@ -30,3 +30,4 @@ hwtstamp_config >>>>>> rxtimestamp >>>>>> timestamping >>>>>> txtimestamp >>>>>> +test_mismatched_mtu_with_conntrack >>>>>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile >>>>>> index 25f198bec0b2..dc9b556f86fd 100644 >>>>>> --- a/tools/testing/selftests/net/Makefile >>>>>> +++ b/tools/testing/selftests/net/Makefile >>>>> >>>>> Neat to see some bootstrapping of in-tree OVS testing. I'd probably >>>>> put it in a separate commit but maybe that's just personal preference. >>>> >>>> I figured I should add it here because it demonstrates the issue I'm >>>> trying to solve. But I agree, it's maybe a new functionality, so I'm >>>> okay with submitting this part + test cases with net-next instead. >>>> >>>>> I didn't look *too* closely at the tests but just one nit below: >>>>> >>>>>> + # test a udp connection >>>>>> + info "send udp data" >>>>>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u >>>>>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > >>>>>> ${ovs_dir}/server.pid' >>>>> >>>>> There are multiple netcat implementations with different arguments >>>>> (BSD and nmap.org and maybe also Debian versions). Might be nice to >>>>> point out which netcat you're relying on here or try to detect & fail >>>>> out/skip on the wrong one. For reference, the equivalent OVS test code >>>>> detection is here: >>>> >>>> netcat's -l, -v, and -u options are universal (even to the old 'hobbit' >>>> 1.10 netcat), so I don't think we need to do detection for these >>>> options. If a future test needs something special (like 'send-only' for >>>> nmap-ncat), then it probably makes sense to hook something up then. >>>> >>>>> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175 >>>> >>>> _______________________________________________ >>>> dev mailing list >>>> dev@openvswitch.org >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>> >>
On Mon, Apr 12, 2021 at 09:40:40AM -0400, Aaron Conole wrote: > Ilya Maximets <i.maximets@ovn.org> writes: > > > On 4/10/21 2:22 PM, Aaron Conole wrote: > >> Ilya Maximets <i.maximets@ovn.org> writes: > >> > >>> On 4/8/21 10:41 PM, Aaron Conole wrote: > >>>> Joe Stringer <joe@cilium.io> writes: > >>>> > >>>>> Hey Aaron, long time no chat :) > >>>> > >>>> Same :) > >>>> > >>>>> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@redhat.com> wrote: > >>>>>> > >>>>>> When a user instructs a flow pipeline to perform connection tracking, > >>>>>> there is an implicit L3 operation that occurs - namely the IP fragments > >>>>>> are reassembled and then processed as a single unit. After this, new > >>>>>> fragments are generated and then transmitted, with the hint that they > >>>>>> should be fragmented along the max rx unit boundary. In general, this > >>>>>> behavior works well to forward packets along when the MTUs are congruent > >>>>>> across the datapath. > >>>>>> > >>>>>> However, if using a protocol such as UDP on a network with mismatching > >>>>>> MTUs, it is possible that the refragmentation will still produce an > >>>>>> invalid fragment, and that fragmented packet will not be delivered. > >>>>>> Such a case shouldn't happen because the user explicitly requested a > >>>>>> layer 3+4 function (conntrack), and that function generates new fragments, > >>>>>> so we should perform the needed actions in that case (namely, refragment > >>>>>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case). > >>>>>> > >>>>>> Additionally, introduce a test suite for openvswitch with a test case > >>>>>> that ensures this MTU behavior, with the expectation that new tests are > >>>>>> added when needed. > >>>>>> > >>>>>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") > >>>>>> Signed-off-by: Aaron Conole <aconole@redhat.com> > >>>>>> --- > >>>>>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh > >>>>>> script - this is due to using tab as the IFS value. This part > >>>>>> of the script was copied from > >>>>>> tools/testing/selftests/net/pmtu.sh so I think should be > >>>>>> permissible. > >>>>>> > >>>>>> net/openvswitch/actions.c | 2 +- > >>>>>> tools/testing/selftests/net/.gitignore | 1 + > >>>>>> tools/testing/selftests/net/Makefile | 1 + > >>>>>> tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++ > >>>>>> 4 files changed, 397 insertions(+), 1 deletion(-) > >>>>>> create mode 100755 tools/testing/selftests/net/openvswitch.sh > >>>>>> > >>>>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > >>>>>> index 92a0b67b2728..d858ea580e43 100644 > >>>>>> --- a/net/openvswitch/actions.c > >>>>>> +++ b/net/openvswitch/actions.c > >>>>>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port, > >>>>>> if (likely(!mru || > >>>>>> (skb->len <= mru + vport->dev->hard_header_len))) { > >>>>>> ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); > >>>>>> - } else if (mru <= vport->dev->mtu) { > >>>>>> + } else if (mru) { > >>>>>> struct net *net = read_pnet(&dp->net); > >>>>>> > >>>>>> ovs_fragment(net, vport, skb, mru, key); > >>>>> > >>>>> I thought about this for a while. For a bit of context, my > >>>>> recollection is that in the initial design, there was an attempt to > >>>>> minimize the set of assumptions around L3 behaviour and despite > >>>>> performing this pseudo-L3 action of connection tracking, attempt a > >>>>> "bump-in-the-wire" approach where OVS is serving as an L2 switch and > >>>>> if you wanted L3 features, you need to build them on top or explicitly > >>>>> define that you're looking for L3 semantics. In this case, you're > >>>>> interpreting that the combination of the conntrack action + an output > >>>>> action implies that L3 routing is being performed. Hence, OVS should > >>>>> act like a router and either refragment or generate ICMP PTB in the > >>>>> case where MTU differs. According to the flow table, the rest of the > >>>>> routing functionality (MAC handling for instance) may or may not have > >>>>> been performed at this point, but we basically leave that up to the > >>>>> SDN controller to implement the right behaviour. In relation to this > >>>>> particular check, the idea was to retain the original geometry of the > >>>>> packet such that it's as though there were no functionality performed > >>>>> in the middle at all. OVS happened to do connection tracking (which > >>>>> implicitly involved queueing fragments), but if you treat it as an > >>>>> opaque box, you have ports connected and OVS is simply performing > >>>>> forwarding between the ports. > >>>> > >>>> I've been going back and forth on this. On the one hand, Open vSwitch > >>>> is supposed to only care about 'just' the L2 forwarding, with some > >>>> additional processing to assist. After that, it's up to an L3 layer to > >>>> really provide additional support, and the idea is that the controller > >>>> or something else should really be guiding this higher level > >>>> intelligence. > >>>> > >>>> The issue I have is that we do some of the high level intelligence here > >>>> to support conntrack, and it's done in a way that is a bit unintuitive. > >>>> As an example, you write: > >>>> > >>>> ... the idea was to retain the original geometry of the packet such > >>>> that it's as though there were no functionality performed in the > >>>> middle at all > >>>> > >>>> But, the fragmentation engine isn't guaranteed to reassemble exactly the > >>>> same packets. > >>>> > >>>> Consider the scenario where there is an upstream router that has > >>>> performed it's own mitm fragmentation. There can be a sequence of > >>>> packets after that that looks like: > >>>> > >>>> IPID == 1, pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes) > >>>> > >>>> When reassembled by the frag engine, we will only use the MRU as the > >>>> guide, and that will spit out: > >>>> > >>>> IPID == 1, pkt 1 (1410 bytes), pkt 2 (1044 bytes) > >>>> > >>>> We will have reduced the number of packets moving through the network, > >>>> and then aren't acting as a bump in the wire, but as a real entity. > >>>> > >>>> I even tested this: > >>>> > >>>> 04:28:47 root@dhcp-25 > >>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP > >>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap > >>>> 16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 > >>>> 16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 > >>>> 16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 > >>>> bash: __git_ps1: command not found > >>>> 04:28:54 root@dhcp-25 > >>>> /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP > >>>> 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap > >>>> 16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360 > >>>> 16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17 > >>>> > >>>> Additionally, because this happens transparently for the flow rule user, > >>>> we need to run check_pkt_len() after every call to the conntrack action check_pkt_len() if not offloadable (via tc, at least), btw. Point being, its usage can stop offloadable flows from getting offloaded. > >>>> because there is really no longer a way to know whether the packet came > >>>> in via a fragmented path. I guess we could do something with > >>>> ip.frag==first|later|no... selection rules to try and create a custom > >>>> table for handling fragments - but that seems like it's a workaround for > >>>> the conntrack functionality w.r.t. the fragmentation engine. > >>> > >>> > >>> Maybe it makes no sense, so correct me if I'm wrong, but looking at the > >>> defragmentation code I see that it does not touch original fragments. > >> > >> I guess you're referring to the frag list that gets generated and stored > >> in the skbuff shinfo? > > > > I guess so. :) > > > >> > >>> I mean, since it's not a IP_DEFRAG_LOCAL_DELIVER, skb still holds a list > >>> of fragments with their original size. Maybe we can fragment them based > >>> on existing skb fragments instead of using the maximum fragment size and > >>> get the same split as it was before defragmentation? > >> > >> I think during conntrack processing we linearize the skbuff and then > >> discard these fragments. At least, I didn't look as deeply just now, > >> but I did hack a check for the skbfraglist on output: > >> > >> if (skb_has_frag_list(skb)) { > >> printk(KERN_CRIT "SKB HAS A FRAG LIST"); > >> } > >> > >> And this print wasn't hit during the ovs output processing above. So I > >> assume we don't have the fraglist any more by the time output would > >> happen. Are you asking if we can keep this list around to use? > > > > Yes, it will be good if we can keep it somehow. ip_do_fragment() uses > > this list for fragmentation if it's available. > > This is quite a change - we would need some additional data to hang onto > with the skbuff (or maybe we store it somewhere else). I'm not sure > where to put this information. > > > At least, it should be available right after ip_defrag(). If we can't > > keep the list itself without modifying too much of the code, maybe we > > can just memorize sizes and use them later for fragmentation. Not sure > > how the good implementation should look like, though. > > > > Anyway, my point is that it looks more like a technical difficulty rather > > than conceptual problem. > > Sure. It's all software, not stone tablets :) > > > Another thing: It seems like very recently some very similar (to what OVS > > does right now) fragmentation logic was added to net/sched/: > > c129412f74e9 ("net/sched: sch_frag: add generic packet fragment support.") > > > > Do we have the same problem there? We, likely, want the logic to be > > consistent. IIUC, this is the code that will be executed if OVS will > > offload conntrack to TC, right? > > Yes, similar behavior is present, and if tc datapath is used I guess it > would be executed and exhibit the same behavior. Should be simple to > modify the test case I attached to the patch and test it out, so I'll > try it out. tc code for handling fragments for CT was heavily based on OVS kernel code. I also expect both (tc and ovs kernel) implementations to match. > > > And one more question here: How does CT offload capable HW works in this > > case? What is the logic around re-fragmenting there? Is there some > > common guideline on how solution should behave (looks like there is no > > any, otherwise we would not have this discussion)? > > In the case of CT offload, fragmented packets might be skipped by the HW > and pushed into SW path... not sure really. It's a difficult set of > cases. No such guidelines exist anywhere that I've found. I think OvS > does have some flow matching for fragmented packets, but no idea what > they do with it. > > Maybe Joe or Marcelo has some thoughts on what the expected / desired > behavior might be in these cases? To the best of my knowlege, the HW will never do ip defrag and will simply send fragments to SW. Then, tc rules should be able to pick these up and if act_ct was loaded, reassemble them and re-fragment when mirred sends them out. With that implicit fallback, the rules/flows don't need to change to accomodate it. At least, up to now. :-) > > >>>>> One of the related implications is the contrast between what happens > >>>>> in this case if you have a conntrack action injected or not when > >>>>> outputting to another port. If you didn't put a connection tracking > >>>>> action into the flows when redirecting here, then there would be no > >>>>> defragmentation or refragmentation. In that case, OVS is just > >>>>> attempting to forward to another device and if the MTU check fails, > >>>>> then bad luck, packets will be dropped. Now, with the interpretation > >>>>> in this patch, it seems like we're trying to say that, well, actually, > >>>>> if the controller injects a connection tracking action, then the > >>>>> controller implicitly switches OVS into a sort of half-L3 mode for > >>>>> this particular flow. This makes the behaviour a bit inconsistent. > >>>> > >>>> I agree, the behavior will be inconsistent w.r.t. L3. But it is right > >>>> now also. And at least with this change we will be consistently > >>>> inconsistent - the user requests ct() with the L3 functions that it > >>>> implies. > >>>> > >>>> One other problem with the controller is the way we need to generate > >>>> FRAGNEED packets in v4. The spec is quite clear with DF=1, drop and > >>>> generate. With DF=0, it's less clear (at least after I re-checked RFC > >>>> 791 I didn't see anything, but I might have missed it). The controller > >>>> will now receive all the traffic, I guess. It's okay with TCP flows > >>>> that set DF=1, but for UDP (maybe other protocols) that isn't the case. > >>>> > >>>>> Another thought that occurs here is that if you have three interfaces > >>>>> attached to the switch, say one with MTU 1500 and two with MTU 1450, > >>>>> and the OVS flows are configured to conntrack and clone the packets > >>>>> from the higher-MTU interface to the lower-MTU interfaces. If you > >>>>> receive larger IP fragments on the first interface and attempt to > >>>>> forward on to the other interfaces, should all interfaces generate an > >>>>> ICMPv6 PTB? > >>>> > >>>> I guess they would, for each destination. I don't know if it's > >>>> desirable, but I can see how it would generate a lot of traffic. Then > >>>> again, why should it? Would conntrack determine that we have two > >>>> interfaces to output: actions? > >>>> > >>>>> That doesn't seem quite right, especially if one of those > >>>>> ports is used for mirroring the traffic for operational reasons while > >>>>> the other path is part of the actual routing path for the traffic. > >>>> > >>>> I didn't consider the mirroring case. I guess we would either need some > >>>> specific metadata. I don't know that anyone is making a mirror port > >>>> this way, but I guess if the bug report comes in I'll look at it ;) > >>>> > >>>>> You'd end up with duplicate PTB messages for the same outbound > >>>>> request. If I read right, this would also not be able to be controlled > >>>>> by the OVS controller because when we call into ip6_fragment() and hit > >>>>> the MTU-handling path, it will automatically take over and generate > >>>>> the ICMP response out the source interface, without any reference to > >>>>> the OVS flow table. This seems like it's further breaking the model > >>>>> where instead of OVS being a purely programmable L2-like flow > >>>>> match+actions pipeline, now depending on the specific actions you > >>>>> inject (in particular combinations), you get some bits of the L3 > >>>>> functionality. But for full L3 functionality, the controller still > >>>>> needs to handle the rest through the correct set of actions in the > >>>>> flow. > >>>> > >>>> It's made more difficult because ct() action itself does L3 processing > >>>> (and I think I demonstrated this). > >>>> > >>>>> Looking at the tree, it seems like this problem can be solved in > >>>>> userspace without further kernel changes by using > >>>>> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net: > >>>>> openvswitch: Add a new action check_pkt_len"). It even explicitly says > >>>>> "The main use case for adding this action is to solve the packet drops > >>>>> because of MTU mismatch in OVN virtual networking solution.". Have you > >>>>> tried using this approach? > >>>> > >>>> We looked and discussed it a bit. I think the outcome boils down to > >>>> check_pkt_len needs to be used on every single instance where a ct() > >>>> call occurs because ct() implies we have connections to monitor, and > >>>> that implies l3, so we need to do something (either push to a controller > >>>> and handle it like OVN would, etc). This has implications on older > >>>> versions of OvS userspace (that don't support check_pkt_len action), and > >>>> non-OVN based controllers (that might just program a flow pipeline and > >>>> expect it to work). > >>>> > >>>> I'm not sure what the best approach is really. > >>>> > >>>>>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore > >>>>>> index 61ae899cfc17..d4d7487833be 100644 > >>>>>> --- a/tools/testing/selftests/net/.gitignore > >>>>>> +++ b/tools/testing/selftests/net/.gitignore > >>>>>> @@ -30,3 +30,4 @@ hwtstamp_config > >>>>>> rxtimestamp > >>>>>> timestamping > >>>>>> txtimestamp > >>>>>> +test_mismatched_mtu_with_conntrack > >>>>>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile > >>>>>> index 25f198bec0b2..dc9b556f86fd 100644 > >>>>>> --- a/tools/testing/selftests/net/Makefile > >>>>>> +++ b/tools/testing/selftests/net/Makefile > >>>>> > >>>>> Neat to see some bootstrapping of in-tree OVS testing. I'd probably > >>>>> put it in a separate commit but maybe that's just personal preference. > >>>> > >>>> I figured I should add it here because it demonstrates the issue I'm > >>>> trying to solve. But I agree, it's maybe a new functionality, so I'm > >>>> okay with submitting this part + test cases with net-next instead. > >>>> > >>>>> I didn't look *too* closely at the tests but just one nit below: > >>>>> > >>>>>> + # test a udp connection > >>>>>> + info "send udp data" > >>>>>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u > >>>>>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > > >>>>>> ${ovs_dir}/server.pid' > >>>>> > >>>>> There are multiple netcat implementations with different arguments > >>>>> (BSD and nmap.org and maybe also Debian versions). Might be nice to > >>>>> point out which netcat you're relying on here or try to detect & fail > >>>>> out/skip on the wrong one. For reference, the equivalent OVS test code > >>>>> detection is here: > >>>> > >>>> netcat's -l, -v, and -u options are universal (even to the old 'hobbit' > >>>> 1.10 netcat), so I don't think we need to do detection for these > >>>> options. If a future test needs something special (like 'send-only' for > >>>> nmap-ncat), then it probably makes sense to hook something up then. > >>>> > >>>>> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175 > >>>> > >>>> _______________________________________________ > >>>> dev mailing list > >>>> dev@openvswitch.org > >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>>> > >> >
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 92a0b67b2728..d858ea580e43 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port, if (likely(!mru || (skb->len <= mru + vport->dev->hard_header_len))) { ovs_vport_send(vport, skb, ovs_key_mac_proto(key)); - } else if (mru <= vport->dev->mtu) { + } else if (mru) { struct net *net = read_pnet(&dp->net); ovs_fragment(net, vport, skb, mru, key); diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index 61ae899cfc17..d4d7487833be 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -30,3 +30,4 @@ hwtstamp_config rxtimestamp timestamping txtimestamp +test_mismatched_mtu_with_conntrack diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 25f198bec0b2..dc9b556f86fd 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -23,6 +23,7 @@ TEST_PROGS += drop_monitor_tests.sh TEST_PROGS += vrf_route_leaking.sh TEST_PROGS += bareudp.sh TEST_PROGS += unicast_extensions.sh +TEST_PROGS += openvswitch.sh TEST_PROGS_EXTENDED := in_netns.sh TEST_GEN_FILES = socket nettest TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any diff --git a/tools/testing/selftests/net/openvswitch.sh b/tools/testing/selftests/net/openvswitch.sh new file mode 100755 index 000000000000..7b6341688cc3 --- /dev/null +++ b/tools/testing/selftests/net/openvswitch.sh @@ -0,0 +1,394 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# +# OVS kernel module self tests +# +# Tests currently implemented: +# +# - mismatched_mtu_with_conntrack +# Set up two namespaces (client and server) which each have devices specifying +# incongruent MTUs (1450 vs 1500 in the test). Transmit a UDP packet of 1901 bytes +# from client to server, and back. Ensure that the ct() action +# uses the mru as a hint, but not a forced check. + + +# Kselftest framework requirement - SKIP code is 4. +ksft_skip=4 + +PAUSE_ON_FAIL=no +VERBOSE=0 +TRACING=0 + +tests=" + mismatched_mtu_with_conntrack ipv4: IP Fragmentation with conntrack" + + +usage() { + echo + echo "$0 [OPTIONS] [TEST]..." + echo "If no TEST argument is given, all tests will be run." + echo + echo "Options" + echo " -t: capture traffic via tcpdump" + echo " -v: verbose" + echo " -p: pause on failure" + echo + echo "Available tests${tests}" + exit 1 +} + +on_exit() { + echo "$1" > ${ovs_dir}/cleanup.tmp + cat ${ovs_dir}/cleanup >> ${ovs_dir}/cleanup.tmp + mv ${ovs_dir}/cleanup.tmp ${ovs_dir}/cleanup +} + +ovs_setenv() { + sandbox=$1 + + ovs_dir=$ovs_base${1:+/$1}; export ovs_dir + + test -e ${ovs_dir}/cleanup || : > ${ovs_dir}/cleanup + + OVS_RUNDIR=$ovs_dir; export OVS_RUNDIR + OVS_LOGDIR=$ovs_dir; export OVS_LOGDIR + OVS_DBDIR=$ovs_dir; export OVS_DBDIR + OVS_SYSCONFDIR=$ovs_dir; export OVS_SYSCONFDIR + OVS_PKGDATADIR=$ovs_dir; export OVS_PKGDATADIR +} + +ovs_exit_sig() { + . "$ovs_dir/cleanup" + ovs-dpctl del-dp ovs-system +} + +ovs_cleanup() { + ovs_exit_sig + [ $VERBOSE = 0 ] || echo "Error detected. See $ovs_dir for more details." +} + +ovs_normal_exit() { + ovs_exit_sig + rm -rf ${ovs_dir} +} + +info() { + [ $VERBOSE = 0 ] || echo $* +} + +kill_ovs_vswitchd () { + # Use provided PID or save the current PID if available. + TMPPID=$1 + if test -z "$TMPPID"; then + TMPPID=$(cat $OVS_RUNDIR/ovs-vswitchd.pid 2>/dev/null) + fi + + # Tell the daemon to terminate gracefully + ovs-appctl -t ovs-vswitchd exit --cleanup 2>/dev/null + + # Nothing else to be done if there is no PID + test -z "$TMPPID" && return + + for i in 1 2 3 4 5 6 7 8 9; do + # Check if the daemon is alive. + kill -0 $TMPPID 2>/dev/null || return + + # Fallback to whole number since POSIX doesn't require + # fractional times to work. + sleep 0.1 || sleep 1 + done + + # Make sure it is terminated. + kill $TMPPID +} + +start_daemon () { + info "exec: $@ -vconsole:off --detach --no-chdir --pidfile --log-file" + "$@" -vconsole:off --detach --no-chdir --pidfile --log-file >/dev/null + pidfile="$OVS_RUNDIR"/$1.pid + + info "setting kill for $@..." + on_exit "test -e \"$pidfile\" && kill \`cat \"$pidfile\"\`" +} + +if test "X$vswitchd_schema" = "X"; then + vswitchd_schema="/usr/share/openvswitch" +fi + +ovs_sbx() { + if test "X$2" != X; then + (ovs_setenv $1; shift; "$@" >> ${ovs_dir}/debug.log) + else + ovs_setenv $1 + fi +} + +seq () { + if test $# = 1; then + set 1 $1 + fi + while test $1 -le $2; do + echo $1 + set `expr $1 + ${3-1}` $2 $3 + done +} + +ovs_wait () { + info "$1: waiting $2..." + + # First try the condition without waiting. + if ovs_wait_cond; then info "$1: wait succeeded immediately"; return 0; fi + + # Try a quick sleep, so that the test completes very quickly + # in the normal case. POSIX doesn't require fractional times to + # work, so this might not work. + sleep 0.1 + if ovs_wait_cond; then info "$1: wait succeeded quickly"; return 0; fi + + # Then wait up to OVS_CTL_TIMEOUT seconds. + local d + for d in `seq 1 "$OVS_CTL_TIMEOUT"`; do + sleep 1 + if ovs_wait_cond; then info "$1: wait succeeded after $d seconds"; return 0; fi + done + + info "$1: wait failed after $d seconds" + ovs_wait_failed +} + +sbxs= +sbx_add () { + info "adding sandbox '$1'" + + sbxs="$sbxs $1" + + NO_BIN=0 + which ovsdb-tool >/dev/null 2>&1 || NO_BIN=1 + which ovsdb-server >/dev/null 2>&1 || NO_BIN=1 + which ovs-vsctl >/dev/null 2>&1 || NO_BIN=1 + which ovs-vswitchd >/dev/null 2>&1 || NO_BIN=1 + + if [ $NO_BIN = 1 ]; then + info "Missing required binaries..." + return 4 + fi + # Create sandbox. + local d="$ovs_base"/$1 + if [ -e $d ]; then + info "removing $d" + rm -rf "$d" + fi + mkdir "$d" || return 1 + ovs_setenv $1 + + # Create database and start ovsdb-server. + info "$1: create db and start db-server" + : > "$d"/.conf.db.~lock~ + ovs_sbx $1 ovsdb-tool create "$d"/conf.db "$vswitchd_schema"/vswitch.ovsschema || return 1 + ovs_sbx $1 start_daemon ovsdb-server --detach --remote=punix:"$d"/db.sock || return 1 + + # Initialize database. + ovs_sbx $1 ovs-vsctl --no-wait -- init || return 1 + + # Start ovs-vswitchd + info "$1: start vswitchd" + ovs_sbx $1 start_daemon ovs-vswitchd -vvconn -vofproto_dpif -vunixctl + + ovs_wait_cond () { + if ip link show ovs-netdev >/dev/null 2>&1; then return 1; else return 0; fi + } + ovs_wait_failed () { + : + } + + ovs_wait "sandbox_add" "while ip link show ovs-netdev" || return 1 +} + +ovs_base=`pwd` + +# mismatched_mtu_with_conntrack test +# - client has 1450 byte MTU +# - server has 1500 byte MTU +# - use UDP to send 1901 bytes each direction for mismatched +# fragmentation. +test_mismatched_mtu_with_conntrack() { + + sbx_add "test_mismatched_mtu_with_conntrack" || return $? + + info "create namespaces" + for ns in client server; do + ip netns add $ns || return 1 + on_exit "ip netns del $ns" + done + + # setup the base bridge + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-br br0 || return 1 + + # setup the client + info "setup client namespace" + ip link add c0 type veth peer name c1 || return 1 + on_exit "ip link del c0 >/dev/null 2>&1" + + ip link set c0 mtu 1450 + ip link set c0 up + + ip link set c1 netns client || return 1 + ip netns exec client ip addr add 172.31.110.2/24 dev c1 + ip netns exec client ip link set c1 mtu 1450 + ip netns exec client ip link set c1 up + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-port br0 c0 || return 1 + + # setup the server + info "setup server namespace" + ip link add s0 type veth peer name s1 || return 1 + on_exit "ip link del s0 >/dev/null 2>&1; ip netns exec server ip link del s0 >/dev/null 2>&1" + ip link set s0 up + + ip link set s1 netns server || return 1 + ip netns exec server ip addr add 172.31.110.1/24 dev s1 || return 1 + ip netns exec server ip link set s1 up + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl add-port br0 s0 || return 1 + + info "setup flows" + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl del-flows br0 + + cat >${ovs_dir}/flows.txt <<EOF +table=0,priority=100,arp action=normal +table=0,priority=100,ip,udp action=ct(table=1) +table=0,priority=10 action=drop + +table=1,priority=100,ct_state=+new+trk,in_port=c0,ip action=ct(commit),s0 +table=1,priority=100,ct_state=+est+trk,in_port=s0,ip action=c0 + +EOF + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl --bundle add-flows br0 ${ovs_dir}/flows.txt || return 1 + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl dump-flows br0 + + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-vsctl show + + # setup echo + mknod -m 777 ${ovs_dir}/fifo p || return 1 + # on_exit "rm ${ovs_dir}/fifo" + + # test a udp connection + info "send udp data" + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > ${ovs_dir}/server.pid' + on_exit "test -e \"${ovs_dir}/server.pid\" && kill \`cat \"${ovs_dir}/server.pid\"\`" + if [ $TRACING = 1 ]; then + ip netns exec server sh -c "tcpdump -i s1 -l -n -U -xx >> ${ovs_dir}/s1-pkts.cap &" + ip netns exec client sh -c "tcpdump -i c1 -l -n -U -xx >> ${ovs_dir}/c1-pkts.cap &" + fi + + ip netns exec client sh -c "python -c \"import time; print('a' * 1900); time.sleep(2)\" | nc -v -u 172.31.110.1 8888 2>${ovs_dir}/c1-nc.log" || return 1 + + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-appctl dpctl/dump-flows + ovs_sbx "test_mismatched_mtu_with_conntrack" ovs-ofctl dump-flows br0 + + info "check udp data was tx and rx" + grep "1901 bytes received" ${ovs_dir}/c1-nc.log || return 1 + ovs_normal_exit +} + +run_test() { + ( + tname="$1" + tdesc="$2" + + if ! lsmod | grep openvswitch >/dev/null 2>&1; then + printf "TEST: %-60s [SKIP]\n" "${tdesc}" + return $ksft_skip + fi + + unset IFS + + eval test_${tname} + ret=$? + + if [ $ret -eq 0 ]; then + printf "TEST: %-60s [ OK ]\n" "${tdesc}" + ovs_normal_exit + elif [ $ret -eq 1 ]; then + printf "TEST: %-60s [FAIL]\n" "${tdesc}" + if [ "${PAUSE_ON_FAIL}" = "yes" ]; then + echo + echo "Pausing. Hit enter to continue" + read a + fi + ovs_exit_sig + exit 1 + elif [ $ret -eq $ksft_skip ]; then + printf "TEST: %-60s [SKIP]\n" "${tdesc}" + elif [ $ret -eq 2 ]; then + rm -rf test_${tname} + run_test "$1" "$2" + fi + + return $ret + ) + ret=$? + case $ret in + 0) + all_skipped=false + [ $exitcode=$ksft_skip ] && exitcode=0 + ;; + $ksft_skip) + [ $all_skipped = true ] && exitcode=$ksft_skip + ;; + *) + all_skipped=false + exitcode=1 + ;; + esac + + return $ret +} + + +exitcode=0 +desc=0 +all_skipped=true + +while getopts :pvt o +do + case $o in + p) PAUSE_ON_FAIL=yes;; + v) VERBOSE=1;; + t) if which tcpdump > /dev/null 2>&1; then + TRACING=1 + else + echo "=== tcpdump not available, tracing disabled" + fi + ;; + *) usage;; + esac +done +shift $(($OPTIND-1)) + +IFS=" +" + +for arg do + # Check first that all requested tests are available before running any + command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} not found"; usage; } +done + +name="" +desc="" +for t in ${tests}; do + [ "${name}" = "" ] && name="${t}" && continue + [ "${desc}" = "" ] && desc="${t}" + + run_this=1 + for arg do + [ "${arg}" != "${arg#--*}" ] && continue + [ "${arg}" = "${name}" ] && run_this=1 && break + run_this=0 + done + if [ $run_this -eq 1 ]; then + run_test "${name}" "${desc}" + fi + name="" + desc="" +done + +exit ${exitcode}
When a user instructs a flow pipeline to perform connection tracking, there is an implicit L3 operation that occurs - namely the IP fragments are reassembled and then processed as a single unit. After this, new fragments are generated and then transmitted, with the hint that they should be fragmented along the max rx unit boundary. In general, this behavior works well to forward packets along when the MTUs are congruent across the datapath. However, if using a protocol such as UDP on a network with mismatching MTUs, it is possible that the refragmentation will still produce an invalid fragment, and that fragmented packet will not be delivered. Such a case shouldn't happen because the user explicitly requested a layer 3+4 function (conntrack), and that function generates new fragments, so we should perform the needed actions in that case (namely, refragment IPv4 along a correct boundary, or send a packet too big in the IPv6 case). Additionally, introduce a test suite for openvswitch with a test case that ensures this MTU behavior, with the expectation that new tests are added when needed. Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Aaron Conole <aconole@redhat.com> --- NOTE: checkpatch reports a whitespace error with the openvswitch.sh script - this is due to using tab as the IFS value. This part of the script was copied from tools/testing/selftests/net/pmtu.sh so I think should be permissible. net/openvswitch/actions.c | 2 +- tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++ 4 files changed, 397 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/net/openvswitch.sh