diff mbox series

[net] net: ipv4: fix path MTU for multi path routes

Message ID 20210731011729.4357-1-vfedorenko@novek.ru (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] net: ipv4: fix path MTU for multi path routes | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers warning 1 maintainers not CCed: yoshfuji@linux-ipv6.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 45 this patch: 45
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 83 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 38 this patch: 38
netdev/header_inline success Link

Commit Message

Vadim Fedorenko July 31, 2021, 1:17 a.m. UTC
Bug 213729 showed that MTU check could be used against route that
will not be used in actual transmit if source ip is not specified.
But path MTU update is always done on route with defined source ip.
Fix route selection by updating flow info in case when source ip
is not explicitly defined in raw and udp sockets.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
---
 net/ipv4/raw.c | 11 +++++++++++
 net/ipv4/udp.c | 13 +++++++++++++
 2 files changed, 24 insertions(+)

Comments

David Ahern Aug. 1, 2021, 5:12 p.m. UTC | #1
On 7/30/21 7:17 PM, Vadim Fedorenko wrote:
> Bug 213729 showed that MTU check could be used against route that
> will not be used in actual transmit if source ip is not specified.
> But path MTU update is always done on route with defined source ip.
> Fix route selection by updating flow info in case when source ip
> is not explicitly defined in raw and udp sockets.
> 
>

There is more to it than just setting the source address and doing a
second lookup.

Attached is a test script I started last summer which shows the problem
at hand and is setup to cover various permutations of routing (legacy
routing, nexthop objects, and vrf), network protocols (v4 and v6) and
should cover tcp, udp and icmp:

# PMTU handling with multipath routing.
#
#          .-- sw1 --.
#   h1 ----|-- sw2 --|---- h2 -------- h3
#          |   ...   |       reduced mtu
#          .-- swN --.
#
# h2-h3 segment has reduced mtu.
# Exceptions created in h1 for h3.

N=8 (8-way multipath) seems to always demonstrate it; N=2 is a 50-50 chance.

Snippet from a run this morning:

# ip netns exec h1 ping -s 1450 10.100.2.254
PING 10.100.2.254 (10.100.2.254) 1450(1478) bytes of data.
From 10.2.22.254 icmp_seq=1 Frag needed and DF set (mtu = 1420)
From 10.2.22.254 icmp_seq=2 Frag needed and DF set (mtu = 1420)
From 10.2.22.254 icmp_seq=3 Frag needed and DF set (mtu = 1420)
From 10.2.22.254 icmp_seq=4 Frag needed and DF set (mtu = 1420)

ok, an MTU message makes it back to h1, but that it continues shows the
exception is not created on the right interface:

# ip -netns h1 ro ls cache
10.100.2.254 via 10.1.15.5 dev eth5
    cache expires 580sec mtu 1420

But the selected path is:
# ip -netns h1 ro get 10.100.2.254
10.100.2.254 via 10.1.12.2 dev eth2 src 10.1.12.254 uid 0
    cache

Adding in the source address does not fix it but it does change the
selected path. .e.g,

# ip -netns h1 ro get 10.100.2.254 from 10.1.16.254
10.100.2.254 from 10.1.16.254 via 10.1.14.4 dev eth4 uid 0
    cache

If 10.1.16.254 is the set source address then egress should be eth6, not
eth4, since that is an address on eth6.
Vadim Fedorenko Aug. 5, 2021, 8:51 p.m. UTC | #2
On 01.08.2021 18:12, David Ahern wrote:
> On 7/30/21 7:17 PM, Vadim Fedorenko wrote:
>> Bug 213729 showed that MTU check could be used against route that
>> will not be used in actual transmit if source ip is not specified.
>> But path MTU update is always done on route with defined source ip.
>> Fix route selection by updating flow info in case when source ip
>> is not explicitly defined in raw and udp sockets.
> 
> There is more to it than just setting the source address and doing a
> second lookup.
> 
You are right. Update of source address fixes only some specific cases.
Also, I'm not fun of doing several lookups just because we found additional
next hops. It looks like (for ipv4 case) fib_table_lookup() should select
correct next-hop based on hash and update source ip and output interface
for flowi4. But right now flowi4 is constant and such change looks more
like net-next improvement. Or do you have another solution?

> Attached is a test script I started last summer which shows the problem
> at hand and is setup to cover various permutations of routing (legacy
> routing, nexthop objects, and vrf), network protocols (v4 and v6) and
> should cover tcp, udp and icmp:
> 
> # PMTU handling with multipath routing.
> #
> #          .-- sw1 --.
> #   h1 ----|-- sw2 --|---- h2 -------- h3
> #          |   ...   |       reduced mtu
> #          .-- swN --.
> #
> # h2-h3 segment has reduced mtu.
> # Exceptions created in h1 for h3.
> 
> N=8 (8-way multipath) seems to always demonstrate it; N=2 is a 50-50 chance.
> 
> Snippet from a run this morning:
> 
> # ip netns exec h1 ping -s 1450 10.100.2.254
> PING 10.100.2.254 (10.100.2.254) 1450(1478) bytes of data.
>  From 10.2.22.254 icmp_seq=1 Frag needed and DF set (mtu = 1420)
>  From 10.2.22.254 icmp_seq=2 Frag needed and DF set (mtu = 1420)
>  From 10.2.22.254 icmp_seq=3 Frag needed and DF set (mtu = 1420)
>  From 10.2.22.254 icmp_seq=4 Frag needed and DF set (mtu = 1420)
> 
> ok, an MTU message makes it back to h1, but that it continues shows the
> exception is not created on the right interface:
> 
> # ip -netns h1 ro ls cache
> 10.100.2.254 via 10.1.15.5 dev eth5
>      cache expires 580sec mtu 1420
> 
> But the selected path is:
> # ip -netns h1 ro get 10.100.2.254
> 10.100.2.254 via 10.1.12.2 dev eth2 src 10.1.12.254 uid 0
>      cache
> 
> Adding in the source address does not fix it but it does change the
> selected path. .e.g,
> 
> # ip -netns h1 ro get 10.100.2.254 from 10.1.16.254
> 10.100.2.254 from 10.1.16.254 via 10.1.14.4 dev eth4 uid 0
>      cache
> 
> If 10.1.16.254 is the set source address then egress should be eth6, not
> eth4, since that is an address on eth6.
>
David Ahern Aug. 5, 2021, 11:13 p.m. UTC | #3
On 8/5/21 2:51 PM, Vadim Fedorenko wrote:
> On 01.08.2021 18:12, David Ahern wrote:
>> On 7/30/21 7:17 PM, Vadim Fedorenko wrote:
>>> Bug 213729 showed that MTU check could be used against route that
>>> will not be used in actual transmit if source ip is not specified.
>>> But path MTU update is always done on route with defined source ip.
>>> Fix route selection by updating flow info in case when source ip
>>> is not explicitly defined in raw and udp sockets.
>>
>> There is more to it than just setting the source address and doing a
>> second lookup.
>>
> You are right. Update of source address fixes only some specific cases.
> Also, I'm not fun of doing several lookups just because we found additional
> next hops. It looks like (for ipv4 case) fib_table_lookup() should select
> correct next-hop based on hash and update source ip and output interface
> for flowi4. But right now flowi4 is constant and such change looks more
> like net-next improvement. Or do you have another solution?
> 

This is a rare case where I wrote the test script first lacking ideas at
that moment. What comes to mind now is that one part of the solution is
to make ipv4 similar to ipv6 in that

1. device bind is always stronger than source address bind, and
2. if saddr is set, prefer the nexthop device with that address (but
only for local traffic). This handles the connect() case where saddr has
not been set yet.

The second one will get expensive if a device has multiple addresses.
FIB lookups for IPv4 have been highly optimized, and we do not want to
explode the lookup time for forwards where saddr is always known and it
does not need to be considered in the lookup to fix local local traffic
which needs a saddr to be picked along with the egress device. Perhaps a
flow flag (SKIP_SADDR similar to the current SKIP_NH_OIF) will address
this part. Another less desirable option is to improve but not totally
fix the problem by only considering the primary address.

For ICMP the PMTU update should look at the header returned in the icmp
payload to determine which device needs the pmtu exception.
diff mbox series

Patch

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index bb446e60cf58..e4008416dfc1 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -640,6 +640,17 @@  static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			goto done;
 	}
 
+	if (!saddr) {
+		rt = __ip_route_output_key(net, &fl4);
+		if (IS_ERR(rt)) {
+			err = PTR_ERR(rt);
+			rt = NULL;
+			goto done;
+		}
+		ip_rt_put(rt);
+		flowi4_update_output(&fl4, ipc.oif, tos, fl4.daddr, fl4.saddr);
+	}
+
 	security_sk_classify_flow(sk, flowi4_to_flowi_common(&fl4));
 	rt = ip_route_output_flow(net, &fl4, sk);
 	if (IS_ERR(rt)) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1a742b710e54..c6db5c3aa294 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1215,6 +1215,19 @@  int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 				   faddr, saddr, dport, inet->inet_sport,
 				   sk->sk_uid);
 
+		if (!saddr) {
+			rt = __ip_route_output_key(net, fl4);
+			if (IS_ERR(rt)) {
+				err = PTR_ERR(rt);
+				rt = NULL;
+				if (err == -ENETUNREACH)
+					IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+				goto out;
+			}
+			ip_rt_put(rt);
+			flowi4_update_output(fl4, ipc.oif, tos, fl4->daddr, fl4->saddr);
+		}
+
 		security_sk_classify_flow(sk, flowi4_to_flowi_common(fl4));
 		rt = ip_route_output_flow(net, fl4, sk);
 		if (IS_ERR(rt)) {