Message ID | 20210126132311.3061388-6-idosch@idosch.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Add notifications when route hardware flags change | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | warning | 7 maintainers not CCed: linux-doc@vger.kernel.org yoshfuji@linux-ipv6.org edumazet@google.com weiwan@google.com roopa@cumulusnetworks.com kuniyu@amazon.co.jp corbet@lwn.net |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 7490 this patch: 7490 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | warning | WARNING: line length of 82 exceeds 80 columns |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 7892 this patch: 7892 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
On 1/26/21 6:23 AM, Ido Schimmel wrote: > From: Amit Cohen <amcohen@nvidia.com> > > After installing a route to the kernel, user space receives an > acknowledgment, which means the route was installed in the kernel, > but not necessarily in hardware. > > The asynchronous nature of route installation in hardware can lead to a > routing daemon advertising a route before it was actually installed in > hardware. This can result in packet loss or mis-routed packets until the > route is installed in hardware. > > It is also possible for a route already installed in hardware to change > its action and therefore its flags. For example, a host route that is > trapping packets can be "promoted" to perform decapsulation following > the installation of an IPinIP/VXLAN tunnel. > > Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags > are changed. The aim is to provide an indication to user-space > (e.g., routing daemons) about the state of the route in hardware. > > Introduce a sysctl that controls this behavior. > > Keep the default value at 0 (i.e., do not emit notifications) for several > reasons: > - Multiple RTM_NEWROUTE notification per-route might confuse existing > routing daemons. are you aware of any routing daemons that are affected? Seems like they should be able to handle redundant notifications > - Convergence reasons in routing daemons. > - The extra notifications will negatively impact the insertion rate. any numbers on the overhead? > - Not all users are interested in these notifications. > > Signed-off-by: Amit Cohen <amcohen@nvidia.com> > Acked-by: Roopa Prabhu <roopa@nvidia.com> > Signed-off-by: Ido Schimmel <idosch@nvidia.com> > --- > Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++ > include/net/netns/ipv4.h | 2 ++ > net/ipv4/af_inet.c | 2 ++ > net/ipv4/fib_trie.c | 27 ++++++++++++++++++++++++++ > net/ipv4/sysctl_net_ipv4.c | 9 +++++++++ > 5 files changed, 60 insertions(+) >
>-----Original Message----- >From: David Ahern <dsahern@gmail.com> >Sent: Wednesday, January 27, 2021 7:03 >To: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org >Cc: davem@davemloft.net; kuba@kernel.org; Amit Cohen <amcohen@nvidia.com>; Roopa Prabhu <roopa@nvidia.com>; Donald >Sharp <sharpd@nvidia.com>; Benjamin Poirier <bpoirier@nvidia.com>; mlxsw <mlxsw@nvidia.com>; Ido Schimmel ><idosch@nvidia.com> >Subject: Re: [PATCH net-next 05/10] net: ipv4: Emit notification when fib hardware flags are changed > >On 1/26/21 6:23 AM, Ido Schimmel wrote: >> From: Amit Cohen <amcohen@nvidia.com> >> >> After installing a route to the kernel, user space receives an >> acknowledgment, which means the route was installed in the kernel, but >> not necessarily in hardware. >> >> The asynchronous nature of route installation in hardware can lead to >> a routing daemon advertising a route before it was actually installed >> in hardware. This can result in packet loss or mis-routed packets >> until the route is installed in hardware. >> >> It is also possible for a route already installed in hardware to >> change its action and therefore its flags. For example, a host route >> that is trapping packets can be "promoted" to perform decapsulation >> following the installation of an IPinIP/VXLAN tunnel. >> >> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP >> flags are changed. The aim is to provide an indication to user-space >> (e.g., routing daemons) about the state of the route in hardware. >> >> Introduce a sysctl that controls this behavior. >> >> Keep the default value at 0 (i.e., do not emit notifications) for >> several >> reasons: >> - Multiple RTM_NEWROUTE notification per-route might confuse existing >> routing daemons. > >are you aware of any routing daemons that are affected? Seems like they should be able to handle redundant notifications Actually no, we didn't check all the existing daemons, just assume that not everyone will want to activate the notifications at all. So there is no point in sending notifications for users which aren't interested in them. > >> - Convergence reasons in routing daemons. >> - The extra notifications will negatively impact the insertion rate. > >any numbers on the overhead? For addition of 256k routes in mlxsw, the overhead is 3.6% of the total time. > >> - Not all users are interested in these notifications. >> >> Signed-off-by: Amit Cohen <amcohen@nvidia.com> >> Acked-by: Roopa Prabhu <roopa@nvidia.com> >> Signed-off-by: Ido Schimmel <idosch@nvidia.com> >> --- >> Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++ >> include/net/netns/ipv4.h | 2 ++ >> net/ipv4/af_inet.c | 2 ++ >> net/ipv4/fib_trie.c | 27 ++++++++++++++++++++++++++ >> net/ipv4/sysctl_net_ipv4.c | 9 +++++++++ >> 5 files changed, 60 insertions(+) >>
On 1/26/21 6:23 AM, Ido Schimmel wrote: > From: Amit Cohen <amcohen@nvidia.com> > > After installing a route to the kernel, user space receives an > acknowledgment, which means the route was installed in the kernel, > but not necessarily in hardware. > > The asynchronous nature of route installation in hardware can lead to a > routing daemon advertising a route before it was actually installed in > hardware. This can result in packet loss or mis-routed packets until the > route is installed in hardware. > > It is also possible for a route already installed in hardware to change > its action and therefore its flags. For example, a host route that is > trapping packets can be "promoted" to perform decapsulation following > the installation of an IPinIP/VXLAN tunnel. > > Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags > are changed. The aim is to provide an indication to user-space > (e.g., routing daemons) about the state of the route in hardware. > > Introduce a sysctl that controls this behavior. > > Keep the default value at 0 (i.e., do not emit notifications) for several > reasons: > - Multiple RTM_NEWROUTE notification per-route might confuse existing > routing daemons. > - Convergence reasons in routing daemons. > - The extra notifications will negatively impact the insertion rate. > - Not all users are interested in these notifications. > > Signed-off-by: Amit Cohen <amcohen@nvidia.com> > Acked-by: Roopa Prabhu <roopa@nvidia.com> > Signed-off-by: Ido Schimmel <idosch@nvidia.com> > --- > Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++ > include/net/netns/ipv4.h | 2 ++ > net/ipv4/af_inet.c | 2 ++ > net/ipv4/fib_trie.c | 27 ++++++++++++++++++++++++++ > net/ipv4/sysctl_net_ipv4.c | 9 +++++++++ > 5 files changed, 60 insertions(+) > Reviewed-by: David Ahern <dsahern@kernel.org>
On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote: > Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags > are changed. The aim is to provide an indication to user-space > (e.g., routing daemons) about the state of the route in hardware. What does the daemon in the user space do with it? The notification will only be generated for the _first_ ASIC which offloaded the object. Which may be fine for you today but as an uAPI it feels slightly lacking. If the user space just wants to make sure the devices are synced to notifications from certain stage, wouldn't it be more idiomatic to provide some "fence" operation? WDYT? David?
On 1/28/21 8:04 PM, Jakub Kicinski wrote: > On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote: >> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags >> are changed. The aim is to provide an indication to user-space >> (e.g., routing daemons) about the state of the route in hardware. > > What does the daemon in the user space do with it? You don't want FRR for example to advertise a route to a peer until it is really programmed in h/w. This notification gives routing daemons that information. > > The notification will only be generated for the _first_ ASIC which > offloaded the object. Which may be fine for you today but as an uAPI > it feels slightly lacking. > > If the user space just wants to make sure the devices are synced to > notifications from certain stage, wouldn't it be more idiomatic to > provide some "fence" operation? > > WDYT? David? > This feature was first discussed I think about 2 years ago - when I was still with Cumulus, so I already knew the intent and end goal. I think support for multiple ASICs / NICs doing this kind of offload will have a whole lot of challenges. I don't think this particular user notification is going to be a big problem - e.g., you could always delay the emit until all have indicated the offload.
On Thu, 28 Jan 2021 20:33:22 -0700 David Ahern wrote: > On 1/28/21 8:04 PM, Jakub Kicinski wrote: > > On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote: > >> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags > >> are changed. The aim is to provide an indication to user-space > >> (e.g., routing daemons) about the state of the route in hardware. > > > > What does the daemon in the user space do with it? > > You don't want FRR for example to advertise a route to a peer until it > is really programmed in h/w. This notification gives routing daemons > that information. I see. Hm. > > The notification will only be generated for the _first_ ASIC which > > offloaded the object. Which may be fine for you today but as an uAPI > > it feels slightly lacking. > > > > If the user space just wants to make sure the devices are synced to > > notifications from certain stage, wouldn't it be more idiomatic to > > provide some "fence" operation? > > > > WDYT? David? > > This feature was first discussed I think about 2 years ago - when I was > still with Cumulus, so I already knew the intent and end goal. > > I think support for multiple ASICs / NICs doing this kind of offload > will have a whole lot of challenges. I don't think this particular user > notification is going to be a big problem - e.g., you could always delay > the emit until all have indicated the offload. My impression from working on this problem in TC is that the definition of "all" becomes problematic especially if one takes into account drivers getting reloaded. But I think routing offload has stronger semantics than TC, so no objections. We need a respin for the somewhat embarrassing loop in patch 1, tho.
On Thu, Jan 28, 2021 at 08:33:22PM -0700, David Ahern wrote: > On 1/28/21 8:04 PM, Jakub Kicinski wrote: > > On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote: > >> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags > >> are changed. The aim is to provide an indication to user-space > >> (e.g., routing daemons) about the state of the route in hardware. > > > > What does the daemon in the user space do with it? > > You don't want FRR for example to advertise a route to a peer until it > is really programmed in h/w. This notification gives routing daemons > that information. Correct. It is in the cover letter: "These flags are of interest to routing daemons since they would like to delay advertisement of routes until they are installed in hardware." Amit is working on follow-up to emit notifications when route offload fails. This request also comes from the FRR team. Currently we have a policy inside mlxsw to abort route offload and install a default route that sends all the traffic to the CPU. It obviously kills the box and anyway the policy is something user space should decide, not the kernel. > > > > > The notification will only be generated for the _first_ ASIC which > > offloaded the object. Which may be fine for you today but as an uAPI > > it feels slightly lacking. > > > > If the user space just wants to make sure the devices are synced to > > notifications from certain stage, wouldn't it be more idiomatic to > > provide some "fence" operation? > > > > WDYT? David? > > > > This feature was first discussed I think about 2 years ago - when I was > still with Cumulus, so I already knew the intent and end goal. > > I think support for multiple ASICs / NICs doing this kind of offload > will have a whole lot of challenges. I don't think this particular user > notification is going to be a big problem - e.g., you could always delay > the emit until all have indicated the offload. I do not have experience with multi-ASIC systems, but my understanding is that each ASIC has its own copy of the networking stack and the ASICs are connected via front panel or backplane ports, like distinct leaf/spine switches. In Linux, such a system can be supported by registering a devlink instance for each ASIC and reloading each instance to a separate namespace. Thanks for reviewing, David
On Thu, Jan 28, 2021 at 08:15:45PM -0800, Jakub Kicinski wrote: > My impression from working on this problem in TC is that the definition > of "all" becomes problematic especially if one takes into account > drivers getting reloaded. But I think routing offload has stronger > semantics than TC, so no objections. During the teardown phase of the reload, all the routes using the driver's netdevs (or their uppers) will be deleted by the kernel because the netdevs will disappear. During the init phase of the reload, the driver will re-register its FIB notifier and ask for a dump of all the existing routes (usually only host routes). With this patchset, user space will receive a notification that these routes are now in hardware. # ip monitor route broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 broadcast 10.209.1.255 dev eth0 table local proto kernel scope link src 10.209.0.191 local 10.209.0.191 dev eth0 table local proto kernel scope host src 10.209.0.191 broadcast 10.209.0.0 dev eth0 table local proto kernel scope link src 10.209.0.191 10.209.0.1 dev eth0 proto dhcp scope link src 10.209.0.191 metric 1024 10.209.0.0/23 dev eth0 proto kernel scope link src 10.209.0.191 default via 10.209.0.1 dev eth0 proto dhcp src 10.209.0.191 metric 1024 << init phase starts here >> default via 10.209.0.1 dev eth0 proto dhcp src 10.209.0.191 metric 1024 rt_trap 10.209.0.0/23 dev eth0 proto kernel scope link src 10.209.0.191 rt_trap 10.209.0.1 dev eth0 proto dhcp scope link src 10.209.0.191 metric 1024 rt_trap broadcast 10.209.0.0 dev eth0 table local proto kernel scope link src 10.209.0.191 rt_trap local 10.209.0.191 dev eth0 table local proto kernel scope host src 10.209.0.191 rt_trap broadcast 10.209.1.255 dev eth0 table local proto kernel scope link src 10.209.0.191 rt_trap broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 rt_trap local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 rt_trap local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 rt_trap broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 rt_trap
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index dd2b12a32b73..01927b36bbee 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -178,6 +178,26 @@ min_adv_mss - INTEGER The advertised MSS depends on the first hop route MTU, but will never be lower than this setting. +fib_notify_on_flag_change - INTEGER + Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/ + RTM_F_TRAP flags are changed. + + After installing a route to the kernel, user space receives an + acknowledgment, which means the route was installed in the kernel, + but not necessarily in hardware. + It is also possible for a route already installed in hardware to change + its action and therefore its flags. For example, a host route that is + trapping packets can be "promoted" to perform decapsulation following + the installation of an IPinIP/VXLAN tunnel. + The notifications will indicate to user-space the state of the route. + + Default: 0 (Do not emit notifications.) + + Possible values: + + - 0 - Do not emit notifications. + - 1 - Emit notifications. + IP Fragmentation: ipfrag_high_thresh - LONG INTEGER diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 8e4fcac4df72..70a2a085dd1a 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -188,6 +188,8 @@ struct netns_ipv4 { int sysctl_udp_wmem_min; int sysctl_udp_rmem_min; + int sysctl_fib_notify_on_flag_change; + #ifdef CONFIG_NET_L3_MASTER_DEV int sysctl_udp_l3mdev_accept; #endif diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index b94fa8eb831b..ab42f6404fc6 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1871,6 +1871,8 @@ static __net_init int inet_init_net(struct net *net) net->ipv4.sysctl_igmp_llm_reports = 1; net->ipv4.sysctl_igmp_qrv = 2; + net->ipv4.sysctl_fib_notify_on_flag_change = 0; + return 0; } diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 28117c05dc35..60559b708158 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1038,6 +1038,8 @@ fib_find_matching_alias(struct net *net, const struct fib_rt_info *fri) void fib_alias_hw_flags_set(struct net *net, const struct fib_rt_info *fri) { struct fib_alias *fa_match; + struct sk_buff *skb; + int err; rcu_read_lock(); @@ -1045,9 +1047,34 @@ void fib_alias_hw_flags_set(struct net *net, const struct fib_rt_info *fri) if (!fa_match) goto out; + if (fa_match->offload == fri->offload && fa_match->trap == fri->trap) + goto out; + fa_match->offload = fri->offload; fa_match->trap = fri->trap; + if (!net->ipv4.sysctl_fib_notify_on_flag_change) + goto out; + + skb = nlmsg_new(fib_nlmsg_size(fa_match->fa_info), GFP_ATOMIC); + if (!skb) { + err = -ENOBUFS; + goto errout; + } + + err = fib_dump_info(skb, 0, 0, RTM_NEWROUTE, fri, 0); + if (err < 0) { + /* -EMSGSIZE implies BUG in fib_nlmsg_size() */ + WARN_ON(err == -EMSGSIZE); + kfree_skb(skb); + goto errout; + } + + rtnl_notify(skb, net, 0, RTNLGRP_IPV4_ROUTE, NULL, GFP_ATOMIC); + goto out; + +errout: + rtnl_set_sk_err(net, RTNLGRP_IPV4_ROUTE, err); out: rcu_read_unlock(); } diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 3e5f4f2e705e..e5798b3b59d2 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -1354,6 +1354,15 @@ static struct ctl_table ipv4_net_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ONE }, + { + .procname = "fib_notify_on_flag_change", + .data = &init_net.ipv4.sysctl_fib_notify_on_flag_change, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, { } };