Message ID | 20230501162530.26414-1-vladimir@nikishkin.pw (mailing list archive) |
---|---|
State | Deferred |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next,v7,1/2] Add nolocalbypass option to vxlan. | expand |
On Tue, 2 May 2023 00:25:29 +0800 Vladimir Nikishkin <vladimir@nikishkin.pw> wrote: > If a packet needs to be encapsulated towards a local destination IP and > a VXLAN device that matches the destination port and VNI exists, then > the packet will be injected into the Rx path as if it was received by > the target VXLAN device without undergoing encapsulation. If such a > device does not exist, the packet will be dropped. > > There are scenarios where we do not want to drop such packets and > instead want to let them be encapsulated and locally received by a user > space program that post-processes these VXLAN packets. > > To that end, add a new VXLAN device attribute that controls whether such > packets are dropped or not. When set ("localbypass") these packets are > dropped and when unset ("nolocalbypass") the packets are encapsulated > and locally delivered to the listening user space application. Default > to "localbypass" to maintain existing behavior. > > Signed-off-by: Vladimir Nikishkin <vladimir@nikishkin.pw> Is there some way to use BPF for this. Rather than a special case for some userspace program?
Stephen Hemminger <stephen@networkplumber.org> writes: > On Tue, 2 May 2023 00:25:29 +0800 > Vladimir Nikishkin <vladimir@nikishkin.pw> wrote: > >> If a packet needs to be encapsulated towards a local destination IP and >> a VXLAN device that matches the destination port and VNI exists, then >> the packet will be injected into the Rx path as if it was received by >> the target VXLAN device without undergoing encapsulation. If such a >> device does not exist, the packet will be dropped. >> >> There are scenarios where we do not want to drop such packets and >> instead want to let them be encapsulated and locally received by a user >> space program that post-processes these VXLAN packets. >> >> To that end, add a new VXLAN device attribute that controls whether such >> packets are dropped or not. When set ("localbypass") these packets are >> dropped and when unset ("nolocalbypass") the packets are encapsulated >> and locally delivered to the listening user space application. Default >> to "localbypass" to maintain existing behavior. >> >> Signed-off-by: Vladimir Nikishkin <vladimir@nikishkin.pw> > > Is there some way to use BPF for this. Rather than a special case > for some userspace program? Well, in the first patch this was not a special case, but rather change to the default behaviour. (Which, I guess has been a little too audacious.) I am not sure about BPF, but the concrete use-case I have is solvable by dedicating a packet to a bogus IP, and doing an nftables double-NAT (source and destination) to 127.0.0.1, which is the way I am solving this problem now, and I suspect, what most sysadmins who need this feature would be doing this without this patch. In fact, among all the people I have talked to about this issue (on #networking@libera.chat, and elsewhere), nobody considered dropping packets to be an intuitive thing. The "intuitive logic" here is the following: 1) I am sending packets to an ip and a port, 2) I have a process listening to packets on this IP and port, 3) Why on Earth are packets not arriving? 4) Even further, why does local behaviour differ from remote behaviour? So the "special case" is already there by design. The new option is turning off the special case. I am aware of the fact that heavy-duty network processing people have a different perspective on this issue, and that in high-load environments every tiny bit of performance is of crucial importance, hence "local bypass" is seen not as a dirty heuristic, but rather as an essential feature which vastly increases performance, but for "kitchen sink" sysadmins the current (not documented) behaviour is just baffling. So I would argue that having an option that, even though it might not be the most frequently used one, is clearly documented as enabling the most straightforward behaviour, would be worth it. And although having a userspace process listening to a vxlan "for processing" might not be the most frequently used thing (although I do need it), at least being able to see the packets being sent to local ports, with, say, tcpdump, in exactly the same way as the packets being sent to remote addresses, would help sysadmins debug their setups better even when only the most basic tools available. I hope that this is convincing enough. P.S. A apologise for not adding the vxlan: and testing/selftests/net: prefixes to the patches. I will add them to the next attempt, in addition to fixing the other issues that might be discovered.
On Tue, May 02, 2023 at 12:25:29AM +0800, Vladimir Nikishkin wrote: > diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c > index 561fe1b314f5..ede98b879257 100644 > --- a/drivers/net/vxlan/vxlan_core.c > +++ b/drivers/net/vxlan/vxlan_core.c > @@ -2355,11 +2355,13 @@ static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev, > !(rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))) { > struct vxlan_dev *dst_vxlan; > > - dst_release(dst); > dst_vxlan = vxlan_find_vni(vxlan->net, dst_ifindex, vni, > daddr->sa.sa_family, dst_port, > vxlan->cfg.flags); > if (!dst_vxlan) { > + if (!(vxlan->cfg.flags & VXLAN_F_LOCALBYPASS)) > + return 0; > + dst_release(dst); Thinking about it again, now that we have a new flag to signal the desired behavior, why do we care if there is a local VXLAN device listening or not? If 'VXLAN_F_LOCALBYPASS' is not set we don't want to deliver the packet to a local VXLAN device even if one exists. IOW, can't the diff simply be: diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c index 561fe1b314f5..1a1dfe6be92d 100644 --- a/drivers/net/vxlan/vxlan_core.c +++ b/drivers/net/vxlan/vxlan_core.c @@ -2352,7 +2352,8 @@ static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev, #endif /* Bypass encapsulation if the destination is local */ if (rt_flags & RTCF_LOCAL && - !(rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))) { + !(rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) && + vxlan->cfg.flags & VXLAN_F_LOCALBYPASS) { struct vxlan_dev *dst_vxlan; dst_release(dst); ? > dev->stats.tx_errors++; > vxlan_vnifilter_count(vxlan, vni, NULL, > VXLAN_VNI_STATS_TX_ERRORS, 0); > @@ -2367,6 +2369,7 @@ static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev, > > return -ENOENT; > } > + dst_release(dst); > vxlan_encap_bypass(skb, vxlan, dst_vxlan, vni, true); > return 1; > } > @@ -2568,7 +2571,6 @@ void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, > > if (!info) { > u32 rt6i_flags = ((struct rt6_info *)ndst)->rt6i_flags; > - Unnecessary change. Please remove. Probably a leftover from previous versions > err = encap_bypass_if_local(skb, dev, vxlan, dst, > dst_port, ifindex, vni, > ndst, rt6i_flags); > @@ -3172,6 +3174,7 @@ static void vxlan_raw_setup(struct net_device *dev) > } > > static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = { > + [IFLA_VXLAN_UNSPEC] = { .strict_start_type = IFLA_VXLAN_LOCALBYPASS }, > [IFLA_VXLAN_ID] = { .type = NLA_U32 }, > [IFLA_VXLAN_GROUP] = { .len = sizeof_field(struct iphdr, daddr) }, > [IFLA_VXLAN_GROUP6] = { .len = sizeof(struct in6_addr) }, > @@ -3202,6 +3205,7 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = { > [IFLA_VXLAN_TTL_INHERIT] = { .type = NLA_FLAG }, > [IFLA_VXLAN_DF] = { .type = NLA_U8 }, > [IFLA_VXLAN_VNIFILTER] = { .type = NLA_U8 }, > + [IFLA_VXLAN_LOCALBYPASS] = NLA_POLICY_MAX(NLA_U8, 1), > }; The rest looks fine to me
diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c index 561fe1b314f5..ede98b879257 100644 --- a/drivers/net/vxlan/vxlan_core.c +++ b/drivers/net/vxlan/vxlan_core.c @@ -2355,11 +2355,13 @@ static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev, !(rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))) { struct vxlan_dev *dst_vxlan; - dst_release(dst); dst_vxlan = vxlan_find_vni(vxlan->net, dst_ifindex, vni, daddr->sa.sa_family, dst_port, vxlan->cfg.flags); if (!dst_vxlan) { + if (!(vxlan->cfg.flags & VXLAN_F_LOCALBYPASS)) + return 0; + dst_release(dst); dev->stats.tx_errors++; vxlan_vnifilter_count(vxlan, vni, NULL, VXLAN_VNI_STATS_TX_ERRORS, 0); @@ -2367,6 +2369,7 @@ static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev, return -ENOENT; } + dst_release(dst); vxlan_encap_bypass(skb, vxlan, dst_vxlan, vni, true); return 1; } @@ -2568,7 +2571,6 @@ void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, if (!info) { u32 rt6i_flags = ((struct rt6_info *)ndst)->rt6i_flags; - err = encap_bypass_if_local(skb, dev, vxlan, dst, dst_port, ifindex, vni, ndst, rt6i_flags); @@ -3172,6 +3174,7 @@ static void vxlan_raw_setup(struct net_device *dev) } static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = { + [IFLA_VXLAN_UNSPEC] = { .strict_start_type = IFLA_VXLAN_LOCALBYPASS }, [IFLA_VXLAN_ID] = { .type = NLA_U32 }, [IFLA_VXLAN_GROUP] = { .len = sizeof_field(struct iphdr, daddr) }, [IFLA_VXLAN_GROUP6] = { .len = sizeof(struct in6_addr) }, @@ -3202,6 +3205,7 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = { [IFLA_VXLAN_TTL_INHERIT] = { .type = NLA_FLAG }, [IFLA_VXLAN_DF] = { .type = NLA_U8 }, [IFLA_VXLAN_VNIFILTER] = { .type = NLA_U8 }, + [IFLA_VXLAN_LOCALBYPASS] = NLA_POLICY_MAX(NLA_U8, 1), }; static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[], @@ -4011,6 +4015,17 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[], conf->flags |= VXLAN_F_UDP_ZERO_CSUM_TX; } + if (data[IFLA_VXLAN_LOCALBYPASS]) { + err = vxlan_nl2flag(conf, data, IFLA_VXLAN_LOCALBYPASS, + VXLAN_F_LOCALBYPASS, changelink, + true, extack); + if (err) + return err; + } else if (!changelink) { + /* default to local bypass on a new device */ + conf->flags |= VXLAN_F_LOCALBYPASS; + } + if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]) { err = vxlan_nl2flag(conf, data, IFLA_VXLAN_UDP_ZERO_CSUM6_TX, VXLAN_F_UDP_ZERO_CSUM6_TX, changelink, @@ -4232,6 +4247,7 @@ static size_t vxlan_get_size(const struct net_device *dev) nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_UDP_ZERO_CSUM6_RX */ nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_REMCSUM_TX */ nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_REMCSUM_RX */ + nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_LOCALBYPASS */ 0; } @@ -4308,7 +4324,9 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev) nla_put_u8(skb, IFLA_VXLAN_REMCSUM_TX, !!(vxlan->cfg.flags & VXLAN_F_REMCSUM_TX)) || nla_put_u8(skb, IFLA_VXLAN_REMCSUM_RX, - !!(vxlan->cfg.flags & VXLAN_F_REMCSUM_RX))) + !!(vxlan->cfg.flags & VXLAN_F_REMCSUM_RX)) || + nla_put_u8(skb, IFLA_VXLAN_LOCALBYPASS, + !!(vxlan->cfg.flags & VXLAN_F_LOCALBYPASS))) goto nla_put_failure; if (nla_put(skb, IFLA_VXLAN_PORT_RANGE, sizeof(ports), &ports)) diff --git a/include/net/vxlan.h b/include/net/vxlan.h index 20bd7d893e10..0be91ca78d3a 100644 --- a/include/net/vxlan.h +++ b/include/net/vxlan.h @@ -328,6 +328,7 @@ struct vxlan_dev { #define VXLAN_F_TTL_INHERIT 0x10000 #define VXLAN_F_VNIFILTER 0x20000 #define VXLAN_F_MDB 0x40000 +#define VXLAN_F_LOCALBYPASS 0x80000 /* Flags that are used in the receive path. These flags must match in * order for a socket to be shareable @@ -348,7 +349,8 @@ struct vxlan_dev { VXLAN_F_UDP_ZERO_CSUM6_TX | \ VXLAN_F_UDP_ZERO_CSUM6_RX | \ VXLAN_F_COLLECT_METADATA | \ - VXLAN_F_VNIFILTER) + VXLAN_F_VNIFILTER | \ + VXLAN_F_LOCALBYPASS) struct net_device *vxlan_dev_create(struct net *net, const char *name, u8 name_assign_type, struct vxlan_config *conf); diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 4ac1000b0ef2..0f6a0fe09bdb 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -828,6 +828,7 @@ enum { IFLA_VXLAN_TTL_INHERIT, IFLA_VXLAN_DF, IFLA_VXLAN_VNIFILTER, /* only applicable with COLLECT_METADATA mode */ + IFLA_VXLAN_LOCALBYPASS, __IFLA_VXLAN_MAX }; #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
If a packet needs to be encapsulated towards a local destination IP and a VXLAN device that matches the destination port and VNI exists, then the packet will be injected into the Rx path as if it was received by the target VXLAN device without undergoing encapsulation. If such a device does not exist, the packet will be dropped. There are scenarios where we do not want to drop such packets and instead want to let them be encapsulated and locally received by a user space program that post-processes these VXLAN packets. To that end, add a new VXLAN device attribute that controls whether such packets are dropped or not. When set ("localbypass") these packets are dropped and when unset ("nolocalbypass") the packets are encapsulated and locally delivered to the listening user space application. Default to "localbypass" to maintain existing behavior. Signed-off-by: Vladimir Nikishkin <vladimir@nikishkin.pw> --- drivers/net/vxlan/vxlan_core.c | 24 +++++++++++++++++++++--- include/net/vxlan.h | 4 +++- include/uapi/linux/if_link.h | 1 + 3 files changed, 25 insertions(+), 4 deletions(-)