diff mbox series

[net,1/1] net/ipv6: Netlink flag for new IPv6 Default Routes

Message ID 20241105031841.10730-2-Matt.Muggeridge@hpe.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net,1/1] net/ipv6: Netlink flag for new IPv6 Default Routes | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 18 this patch: 18
netdev/build_tools success Errors and warnings before: 2 (+0) this patch: 2 (+0)
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 61 this patch: 61
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3551 this patch: 3551
netdev/checkpatch warning WARNING: The commit message has 'stable@', perhaps it also needs a 'Fixes:' tag?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-11-05--21-00 (tests: 779)

Commit Message

Matt Muggeridge Nov. 5, 2024, 3:18 a.m. UTC
Add a Netlink rtm_flag, RTM_F_RA_ROUTER for the RTM_NEWROUTE message.
This allows an IPv6 Netlink client to indicate the default route came
from an RA. This results in the kernel creating individual default
routes, rather than coalescing multiple default routes into a single
ECMP route.

Details:

For IPv6, a Netlink client is unable to create default routes in the
same manner as the kernel. This leads to failures when there are
multiple default routers, as they were being coalesced into a single
ECMP route. When one of the ECMP default routers becomes UNREACHABLE, it
was still being selected as the nexthop.

Meanwhile, when the kernel processes RAs from multiple default routers,
it sets the fib6_flags: RTF_ADDRCONF | RTF_DEFAULT. The RTF_ADDRCONF
flag is checked by rt6_qualify_for_ecmp(), which returns false when
ADDRCONF is set. As such, the kernel creates separate default routes.

E.g. compare the routing tables when RAs are processed by the kernel
versus a Netlink client (systemd-networkd, in my case).

1) RA Processed by kernel (accept_ra = 2)
$ ip -6 route
2001:2:0:1000::/64 dev enp0s9 proto kernel metric 256 expires ...
fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
default via fe80::200:10ff:fe10:1060 dev enp0s9 proto ra ...
default via fe80::200:10ff:fe10:1061 dev enp0s9 proto ra ...

2) RA Processed by Netlink client (accept_ra = 0)
$ ip -6 route
2001:2:0:1000::/64 dev enp0s9 proto ra metric 1024 expires ...
fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
default proto ra metric 1024 expires 595sec pref medium
	nexthop via fe80::200:10ff:fe10:1060 dev enp0s9 weight 1
	nexthop via fe80::200:10ff:fe10:1061 dev enp0s9 weight 1

IPv6 Netlink clients need a mechanism to identify a route as coming from
an RA. i.e. a Netlink client needs a method to set the kernel flags:

    RTF_ADDRCONF | RTF_DEFAULT

This is needed when there are multiple default routers that each send
an RA. Setting the RTF_ADDRCONF flag ensures their fib entries do not
qualify for ECMP routes, see rt6_qualify_for_ecmp().

To achieve this, introduce a user-level flag RTM_F_RA_ROUTER that a
Netlink client can pass to the kernel.

A Netlink user-level network manager, such as systemd-networkd, may set
the RTM_F_RA_ROUTER flag in the Netlink RTM_NEWROUTE rtmsg. When set,
the kernel sets RTF_RA_ROUTER in the fib6_config fc_flags. This causes a
default route to be created in the same way as if the kernel processed
the RA, via rt6add_dflt_router().

This is needed by user-level network managers, like systemd-networkd,
that prefer to do the RA processing themselves. ie. they disable the
kernel's RA processing by setting net.ipv6.conf.<intf>.accept_ra=0.

Without this flag, when there are mutliple default routers, the kernel
coalesces multiple default routes into an ECMP route. The ECMP route
ignores per-route REACHABILITY information. If one of the default
routers is unresponsive, with a Neighbor Cache entry of INCOMPLETE, then
it can still be selected as the nexthop for outgoing packets. This
results in an inability to communicate with remote hosts, even though
one of the default routers remains REACHABLE. This violates RFC4861
section 6.3.6, bullet 1.

Extract from RFC4861 6.3.6 bullet 1:
     1) Routers that are reachable or probably reachable (i.e., in any
        state other than INCOMPLETE) SHOULD be preferred over routers
        whose reachability is unknown or suspect (i.e., in the
        INCOMPLETE state, or for which no Neighbor Cache entry exists).

This fixes the IPv6 Logo conformance test v6LC_2_2_11, and others that
test with multiple default routers. Also see systemd issue #33470:
https://github.com/systemd/systemd/issues/33470.

Signed-off-by: Matt Muggeridge <Matt.Muggeridge@hpe.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-api@vger.kernel.org
Cc: stable@vger.kernel.org
---
 include/uapi/linux/rtnetlink.h | 9 +++++----
 net/ipv6/route.c               | 3 +++
 2 files changed, 8 insertions(+), 4 deletions(-)

Comments

Nicolas Dichtel Nov. 5, 2024, 2:37 p.m. UTC | #1
Le 05/11/2024 à 04:18, Matt Muggeridge a écrit :
> Add a Netlink rtm_flag, RTM_F_RA_ROUTER for the RTM_NEWROUTE message.
> This allows an IPv6 Netlink client to indicate the default route came
> from an RA. This results in the kernel creating individual default
> routes, rather than coalescing multiple default routes into a single
> ECMP route.
> 
> Details:
> 
> For IPv6, a Netlink client is unable to create default routes in the
> same manner as the kernel. This leads to failures when there are
> multiple default routers, as they were being coalesced into a single
> ECMP route. When one of the ECMP default routers becomes UNREACHABLE, it
> was still being selected as the nexthop.
> 
> Meanwhile, when the kernel processes RAs from multiple default routers,
> it sets the fib6_flags: RTF_ADDRCONF | RTF_DEFAULT. The RTF_ADDRCONF
> flag is checked by rt6_qualify_for_ecmp(), which returns false when
> ADDRCONF is set. As such, the kernel creates separate default routes.
> 
> E.g. compare the routing tables when RAs are processed by the kernel
> versus a Netlink client (systemd-networkd, in my case).
> 
> 1) RA Processed by kernel (accept_ra = 2)
> $ ip -6 route
> 2001:2:0:1000::/64 dev enp0s9 proto kernel metric 256 expires ...
> fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
> default via fe80::200:10ff:fe10:1060 dev enp0s9 proto ra ...
> default via fe80::200:10ff:fe10:1061 dev enp0s9 proto ra ...
> 
> 2) RA Processed by Netlink client (accept_ra = 0)
> $ ip -6 route
> 2001:2:0:1000::/64 dev enp0s9 proto ra metric 1024 expires ...
> fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
> fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
> default proto ra metric 1024 expires 595sec pref medium
> 	nexthop via fe80::200:10ff:fe10:1060 dev enp0s9 weight 1
> 	nexthop via fe80::200:10ff:fe10:1061 dev enp0s9 weight 1
> 
> IPv6 Netlink clients need a mechanism to identify a route as coming from
> an RA. i.e. a Netlink client needs a method to set the kernel flags:
> 
>     RTF_ADDRCONF | RTF_DEFAULT
> 
> This is needed when there are multiple default routers that each send
> an RA. Setting the RTF_ADDRCONF flag ensures their fib entries do not
> qualify for ECMP routes, see rt6_qualify_for_ecmp().
> 
> To achieve this, introduce a user-level flag RTM_F_RA_ROUTER that a
> Netlink client can pass to the kernel.
> 
> A Netlink user-level network manager, such as systemd-networkd, may set
> the RTM_F_RA_ROUTER flag in the Netlink RTM_NEWROUTE rtmsg. When set,
> the kernel sets RTF_RA_ROUTER in the fib6_config fc_flags. This causes a
> default route to be created in the same way as if the kernel processed
> the RA, via rt6add_dflt_router().
> 
> This is needed by user-level network managers, like systemd-networkd,
> that prefer to do the RA processing themselves. ie. they disable the
> kernel's RA processing by setting net.ipv6.conf.<intf>.accept_ra=0.
> 
> Without this flag, when there are mutliple default routers, the kernel
> coalesces multiple default routes into an ECMP route. The ECMP route
> ignores per-route REACHABILITY information. If one of the default
> routers is unresponsive, with a Neighbor Cache entry of INCOMPLETE, then
> it can still be selected as the nexthop for outgoing packets. This
> results in an inability to communicate with remote hosts, even though
> one of the default routers remains REACHABLE. This violates RFC4861
> section 6.3.6, bullet 1.
> 
> Extract from RFC4861 6.3.6 bullet 1:
>      1) Routers that are reachable or probably reachable (i.e., in any
>         state other than INCOMPLETE) SHOULD be preferred over routers
>         whose reachability is unknown or suspect (i.e., in the
>         INCOMPLETE state, or for which no Neighbor Cache entry exists).
> 
> This fixes the IPv6 Logo conformance test v6LC_2_2_11, and others that
> test with multiple default routers. Also see systemd issue #33470:
> https://github.com/systemd/systemd/issues/33470.
> 
> Signed-off-by: Matt Muggeridge <Matt.Muggeridge@hpe.com>
> Cc: David Ahern <dsahern@kernel.org>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: linux-api@vger.kernel.org
> Cc: stable@vger.kernel.org
> ---
>  include/uapi/linux/rtnetlink.h | 9 +++++----
>  net/ipv6/route.c               | 3 +++
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 3b687d20c9ed..9f0259f6e4ed 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -202,7 +202,7 @@ enum {
>  #define RTM_NR_FAMILIES	(RTM_NR_MSGTYPES >> 2)
>  #define RTM_FAM(cmd)	(((cmd) - RTM_BASE) >> 2)
>  
> -/* 
> +/*
>     Generic structure for encapsulation of optional route information.
>     It is reminiscent of sockaddr, but with sa_family replaced
>     with attribute type.
> @@ -242,7 +242,7 @@ struct rtmsg {
>  
>  	unsigned char		rtm_table;	/* Routing table id */
>  	unsigned char		rtm_protocol;	/* Routing protocol; see below	*/
> -	unsigned char		rtm_scope;	/* See below */	
> +	unsigned char		rtm_scope;	/* See below */
>  	unsigned char		rtm_type;	/* See below	*/
>  
>  	unsigned		rtm_flags;
> @@ -336,6 +336,7 @@ enum rt_scope_t {
>  #define RTM_F_FIB_MATCH	        0x2000	/* return full fib lookup match */
>  #define RTM_F_OFFLOAD		0x4000	/* route is offloaded */
>  #define RTM_F_TRAP		0x8000	/* route is trapping packets */
> +#define RTM_F_RA_ROUTER		0x10000	/* route is a default route from RA */
Please, don't mix whitespace changes with the changes related to the new flag.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst#n166


Regards,
Nicolas
Ido Schimmel Nov. 5, 2024, 6:15 p.m. UTC | #2
On Mon, Nov 04, 2024 at 10:18:39PM -0500, Matt Muggeridge wrote:
> Add a Netlink rtm_flag, RTM_F_RA_ROUTER for the RTM_NEWROUTE message.
> This allows an IPv6 Netlink client to indicate the default route came
> from an RA. This results in the kernel creating individual default
> routes, rather than coalescing multiple default routes into a single
> ECMP route.
> 
> Details:
> 
> For IPv6, a Netlink client is unable to create default routes in the
> same manner as the kernel. This leads to failures when there are
> multiple default routers, as they were being coalesced into a single
> ECMP route. When one of the ECMP default routers becomes UNREACHABLE, it
> was still being selected as the nexthop.
> 
> Meanwhile, when the kernel processes RAs from multiple default routers,
> it sets the fib6_flags: RTF_ADDRCONF | RTF_DEFAULT. The RTF_ADDRCONF
> flag is checked by rt6_qualify_for_ecmp(), which returns false when
> ADDRCONF is set. As such, the kernel creates separate default routes.
> 
> E.g. compare the routing tables when RAs are processed by the kernel
> versus a Netlink client (systemd-networkd, in my case).
> 
> 1) RA Processed by kernel (accept_ra = 2)
> $ ip -6 route
> 2001:2:0:1000::/64 dev enp0s9 proto kernel metric 256 expires ...
> fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
> default via fe80::200:10ff:fe10:1060 dev enp0s9 proto ra ...
> default via fe80::200:10ff:fe10:1061 dev enp0s9 proto ra ...
> 
> 2) RA Processed by Netlink client (accept_ra = 0)
> $ ip -6 route
> 2001:2:0:1000::/64 dev enp0s9 proto ra metric 1024 expires ...
> fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
> fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
> default proto ra metric 1024 expires 595sec pref medium
> 	nexthop via fe80::200:10ff:fe10:1060 dev enp0s9 weight 1
> 	nexthop via fe80::200:10ff:fe10:1061 dev enp0s9 weight 1
> 
> IPv6 Netlink clients need a mechanism to identify a route as coming from
> an RA. i.e. a Netlink client needs a method to set the kernel flags:
> 
>     RTF_ADDRCONF | RTF_DEFAULT
> 
> This is needed when there are multiple default routers that each send
> an RA. Setting the RTF_ADDRCONF flag ensures their fib entries do not
> qualify for ECMP routes, see rt6_qualify_for_ecmp().
> 
> To achieve this, introduce a user-level flag RTM_F_RA_ROUTER that a
> Netlink client can pass to the kernel.
> 
> A Netlink user-level network manager, such as systemd-networkd, may set
> the RTM_F_RA_ROUTER flag in the Netlink RTM_NEWROUTE rtmsg. When set,
> the kernel sets RTF_RA_ROUTER in the fib6_config fc_flags. This causes a
> default route to be created in the same way as if the kernel processed
> the RA, via rt6add_dflt_router().
> 
> This is needed by user-level network managers, like systemd-networkd,
> that prefer to do the RA processing themselves. ie. they disable the
> kernel's RA processing by setting net.ipv6.conf.<intf>.accept_ra=0.
> 
> Without this flag, when there are mutliple default routers, the kernel
> coalesces multiple default routes into an ECMP route. The ECMP route
> ignores per-route REACHABILITY information. If one of the default
> routers is unresponsive, with a Neighbor Cache entry of INCOMPLETE, then
> it can still be selected as the nexthop for outgoing packets. This
> results in an inability to communicate with remote hosts, even though
> one of the default routers remains REACHABLE. This violates RFC4861
> section 6.3.6, bullet 1.

Do you have forwarding disabled (it causes RT6_LOOKUP_F_REACHABLE to be
set)? Is the problem that fib6_table_lookup() chooses a reachable
nexthop and then fib6_select_path() overrides it with an unreachable
one?

> 
> Extract from RFC4861 6.3.6 bullet 1:
>      1) Routers that are reachable or probably reachable (i.e., in any
>         state other than INCOMPLETE) SHOULD be preferred over routers
>         whose reachability is unknown or suspect (i.e., in the
>         INCOMPLETE state, or for which no Neighbor Cache entry exists).
> 
> This fixes the IPv6 Logo conformance test v6LC_2_2_11, and others that
> test with multiple default routers. Also see systemd issue #33470:
> https://github.com/systemd/systemd/issues/33470.
> 
> Signed-off-by: Matt Muggeridge <Matt.Muggeridge@hpe.com>
> Cc: David Ahern <dsahern@kernel.org>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: linux-api@vger.kernel.org
> Cc: stable@vger.kernel.org
> ---
>  include/uapi/linux/rtnetlink.h | 9 +++++----
>  net/ipv6/route.c               | 3 +++
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 3b687d20c9ed..9f0259f6e4ed 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -202,7 +202,7 @@ enum {
>  #define RTM_NR_FAMILIES	(RTM_NR_MSGTYPES >> 2)
>  #define RTM_FAM(cmd)	(((cmd) - RTM_BASE) >> 2)
>  
> -/* 
> +/*
>     Generic structure for encapsulation of optional route information.
>     It is reminiscent of sockaddr, but with sa_family replaced
>     with attribute type.
> @@ -242,7 +242,7 @@ struct rtmsg {
>  
>  	unsigned char		rtm_table;	/* Routing table id */
>  	unsigned char		rtm_protocol;	/* Routing protocol; see below	*/
> -	unsigned char		rtm_scope;	/* See below */	
> +	unsigned char		rtm_scope;	/* See below */
>  	unsigned char		rtm_type;	/* See below	*/
>  
>  	unsigned		rtm_flags;
> @@ -336,6 +336,7 @@ enum rt_scope_t {
>  #define RTM_F_FIB_MATCH	        0x2000	/* return full fib lookup match */
>  #define RTM_F_OFFLOAD		0x4000	/* route is offloaded */
>  #define RTM_F_TRAP		0x8000	/* route is trapping packets */
> +#define RTM_F_RA_ROUTER		0x10000	/* route is a default route from RA */
>  #define RTM_F_OFFLOAD_FAILED	0x20000000 /* route offload failed, this value
>  					    * is chosen to avoid conflicts with
>  					    * other flags defined in
> @@ -568,7 +569,7 @@ struct ifinfomsg {
>  };
>  
>  /********************************************************************
> - *		prefix information 
> + *		prefix information
>   ****/
>  
>  struct prefixmsg {
> @@ -582,7 +583,7 @@ struct prefixmsg {
>  	unsigned char	prefix_pad3;
>  };
>  
> -enum 
> +enum
>  {
>  	PREFIX_UNSPEC,
>  	PREFIX_ADDRESS,
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index b4251915585f..5b0c16422720 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -5055,6 +5055,9 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
>  	if (rtm->rtm_flags & RTM_F_CLONED)
>  		cfg->fc_flags |= RTF_CACHE;
>  
> +	if (rtm->rtm_flags & RTM_F_RA_ROUTER)
> +		cfg->fc_flags |= RTF_RA_ROUTER;
> +

It is possible there are user space programs out there that set this bit
(knowingly or not) when sending requests to the kernel and this change
will result in a behavior change for them. So, if we were to continue in
this path, this would need to be converted to a new netlink attribute to
avoid such potential problems.

BTW, you can avoid the coalescing problem by using the nexthop API (man
ip-nexthop).

>  	cfg->fc_flags |= (rtm->rtm_flags & RTNH_F_ONLINK);
>  
>  	if (tb[RTA_NH_ID]) {
> -- 
> 2.35.3
> 
>
Matt Muggeridge Nov. 6, 2024, 1:44 a.m. UTC | #3
> Please, don't mix whitespace changes with the changes related to the new flag.

Thanks, Nicolas. I will revert the changes that tidied up trailing whitespace in the next version.
Matt Muggeridge Nov. 6, 2024, 2:50 a.m. UTC | #4
Thank you for your review and feedback, Ido.

>> Without this flag, when there are mutliple default routers, the kernel
>> coalesces multiple default routes into an ECMP route. The ECMP route
>> ignores per-route REACHABILITY information. If one of the default
>> routers is unresponsive, with a Neighbor Cache entry of INCOMPLETE, then
>> it can still be selected as the nexthop for outgoing packets. This
>> results in an inability to communicate with remote hosts, even though
>> one of the default routers remains REACHABLE. This violates RFC4861
>> section 6.3.6, bullet 1.
>
>Do you have forwarding disabled (it causes RT6_LOOKUP_F_REACHABLE to be
>set)?

Yes, forwarding is disabled on our embedded system. Though, this needs to
work on systems regardless of the state of forwarding.

>  Is the problem that fib6_table_lookup() chooses a reachable
>nexthop and then fib6_select_path() overrides it with an unreachable
>one?

I'm afraid I don't know.

The objective is to allow IPv6 Netlink clients to be able to create default
routes from RAs in the same way the kernel creates default routes from RAs.
Essentially, I'm trying to have Netlink and Kernel behaviors match.

My analysis led me to the need for Netlink clients to set the kernel's
fib6_config flags RTF_RA_ROUTER, where:

    #define RTF_RA_ROUTER		(RTF_ADDRCONF | RTF_DEFAULT)

>> +	if (rtm->rtm_flags & RTM_F_RA_ROUTER)
>> +		cfg->fc_flags |= RTF_RA_ROUTER;
>> +
> 
> It is possible there are user space programs out there that set this bit
> (knowingly or not) when sending requests to the kernel and this change
> will result in a behavior change for them. So, if we were to continue in
> this path, this would need to be converted to a new netlink attribute to
> avoid such potential problems.
> 

Is this a mandated approach to implementing unspecified bits in a flag?

I'm a little surprised by this consideration. If we account for poorly
written buggy user-programs, doesn't this open any API to an explosion
of new attributes or other odd extensions? I'd imagine the same argument
would be applicable to ioctl flags, socket flags, and so on. Why would we
treat implementing unspecified Netlink bits differently to implementing
unspecified ioctl bits, etc.

Naturally, if this is the mandated approach, then I'll reimplement it with
a new Netlink attribute. I'm just trying to understand what is the
Linux-lore, here?

> BTW, you can avoid the coalescing problem by using the nexthop API (man
> ip-nexthop).

I'm not sure how that would help in this case. We need the nexthop to be
determined according to its REACHABILITY and other considerations described
in RFC4861.

Kind regards,
Matt.
Ido Schimmel Nov. 6, 2024, 12:37 p.m. UTC | #5
On Tue, Nov 05, 2024 at 09:50:56PM -0500, Matt Muggeridge wrote:
> Thank you for your review and feedback, Ido.
> 
> >> Without this flag, when there are mutliple default routers, the kernel
> >> coalesces multiple default routes into an ECMP route. The ECMP route
> >> ignores per-route REACHABILITY information. If one of the default
> >> routers is unresponsive, with a Neighbor Cache entry of INCOMPLETE, then
> >> it can still be selected as the nexthop for outgoing packets. This
> >> results in an inability to communicate with remote hosts, even though
> >> one of the default routers remains REACHABLE. This violates RFC4861
> >> section 6.3.6, bullet 1.
> >
> >Do you have forwarding disabled (it causes RT6_LOOKUP_F_REACHABLE to be
> >set)?
> 
> Yes, forwarding is disabled on our embedded system. Though, this needs to
> work on systems regardless of the state of forwarding.
> 
> >  Is the problem that fib6_table_lookup() chooses a reachable
> >nexthop and then fib6_select_path() overrides it with an unreachable
> >one?
> 
> I'm afraid I don't know.

We need to understand the current behavior before adding a new interface
that we will never be able to remove. It is possible we can improve /
fix the current code. I won't have time to look into it myself until
next week.

> 
> The objective is to allow IPv6 Netlink clients to be able to create default
> routes from RAs in the same way the kernel creates default routes from RAs.
> Essentially, I'm trying to have Netlink and Kernel behaviors match.

I understand, but it's essentially an extension for the legacy IPv6
multipath API which we are trying to move away from towards the nexthop
API (see more below).

> 
> My analysis led me to the need for Netlink clients to set the kernel's
> fib6_config flags RTF_RA_ROUTER, where:
> 
>     #define RTF_RA_ROUTER		(RTF_ADDRCONF | RTF_DEFAULT)
> 
> >> +	if (rtm->rtm_flags & RTM_F_RA_ROUTER)
> >> +		cfg->fc_flags |= RTF_RA_ROUTER;
> >> +
> > 
> > It is possible there are user space programs out there that set this bit
> > (knowingly or not) when sending requests to the kernel and this change
> > will result in a behavior change for them. So, if we were to continue in
> > this path, this would need to be converted to a new netlink attribute to
> > avoid such potential problems.
> > 
> 
> Is this a mandated approach to implementing unspecified bits in a flag?
> 
> I'm a little surprised by this consideration. If we account for poorly
> written buggy user-programs, doesn't this open any API to an explosion
> of new attributes or other odd extensions? I'd imagine the same argument
> would be applicable to ioctl flags, socket flags, and so on. Why would we
> treat implementing unspecified Netlink bits differently to implementing
> unspecified ioctl bits, etc.
> 
> Naturally, if this is the mandated approach, then I'll reimplement it with
> a new Netlink attribute. I'm just trying to understand what is the
> Linux-lore, here?

Using this bit could have been valid if previously the kernel rejected
requests with this bit set, but as evident by your patch the kernel does
not do it. It is therefore possible that there are user space programs
out there that are working perfectly fine right now and they will break
/ misbehave after this change.

> 
> > BTW, you can avoid the coalescing problem by using the nexthop API (man
> > ip-nexthop).
> 
> I'm not sure how that would help in this case. We need the nexthop to be
> determined according to its REACHABILITY and other considerations described
> in RFC4861.

Using your example:

# ip nexthop add id 1 via fe80::200:10ff:fe10:1060 dev enp0s9
# ip -6 route add default nhid 1 expires 600 proto ra
# ip nexthop add id 2 via fe80::200:10ff:fe10:1061 dev enp0s9
# ip -6 route append default nhid 2 expires 600 proto ra
# ip -6 route
fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
default nhid 1 via fe80::200:10ff:fe10:1060 dev enp0s9 proto ra metric 1024 expires 563sec pref medium
default nhid 2 via fe80::200:10ff:fe10:1061 dev enp0s9 proto ra metric 1024 expires 594sec pref medium
David Ahern Nov. 6, 2024, 6:59 p.m. UTC | #6
On 11/5/24 7:50 PM, Matt Muggeridge wrote:
> I'm a little surprised by this consideration. If we account for poorly
> written buggy user-programs, doesn't this open any API to an explosion
> of new attributes or other odd extensions? I'd imagine the same argument

yes, it does and there are many examples. UAPIs are hard to get right
yet persistent forever, so I do agree with Ido's push back on making
sure it is needed.
Matt Muggeridge Nov. 7, 2024, 3:53 a.m. UTC | #7
Hi Ido,

> >>> Is the problem that fib6_table_lookup() chooses a reachable
> >>> nexthop and then fib6_select_path() overrides it with an unreachable
> >>> one?
> 
> >> I'm afraid I don't know.
> >>
> > We need to understand the current behavior before adding a new interface
> > that we will never be able to remove. It is possible we can improve /
> > fix the current code. I won't have time to look into it myself until
> > next week.

I am grateful that you want to look into it. Thank you! And I look forward to
learning what you discover.

You probably already know how to reproduce it, but in case it helps, I still
have the packet captures and can share them with you. Let me know if you'd
like me to share them (and how to share them).

> > 
> > The objective is to allow IPv6 Netlink clients to be able to create default
> > routes from RAs in the same way the kernel creates default routes from RAs.
> > Essentially, I'm trying to have Netlink and Kernel behaviors match.
> 
> I understand, but it's essentially an extension for the legacy IPv6
> multipath API which we are trying to move away from towards the nexthop
> API (see more below).

Very interesting, I wasn't aware of this movement.

While this change is an extension of the legacy IPv6 multipath API, won't it
still need to support Netlink clients that have been designed around it? I
imagine that transitioning Netlink clients to the NH API will take many years?

As such, it still seems appropriate (to me) that this be implemented in the
legacy API as well as ensuring it works with the NH API.

Another consideration...

Will the kernel RA processing go through the same nh pathway? The reason I
ask is because I faced several challenges with IPv6 Logo certification due to
Netlink clients being unable to achieve the same as the kernel's behavior.

As long as the kernel is creating RA routes in a way that meets RFC4861, then
I'd hope that  Netlink clients would be able to leverage that for 'free'.

> > 
> > My analysis led me to the need for Netlink clients to set the kernel's
> > fib6_config flags RTF_RA_ROUTER, where:
> > 
> >     #define RTF_RA_ROUTER		(RTF_ADDRCONF | RTF_DEFAULT)
> > 
> >>> +	if (rtm->rtm_flags & RTM_F_RA_ROUTER)
> >>> +		cfg->fc_flags |= RTF_RA_ROUTER;
> >>> +
> >> 
> >> It is possible there are user space programs out there that set this bit
> >> (knowingly or not) when sending requests to the kernel and this change
> >> will result in a behavior change for them. So, if we were to continue in
> >> this path, this would need to be converted to a new netlink attribute to
> >> avoid such potential problems.
> >> 
> > 
> > Is this a mandated approach to implementing unspecified bits in a flag?
> > 
> > I'm a little surprised by this consideration. If we account for poorly
> > written buggy user-programs, doesn't this open any API to an explosion
> > of new attributes or other odd extensions? I'd imagine the same argument
> > would be applicable to ioctl flags, socket flags, and so on. Why would we
> > treat implementing unspecified Netlink bits differently to implementing
> > unspecified ioctl bits, etc.
> > 
> > Naturally, if this is the mandated approach, then I'll reimplement it with
> > a new Netlink attribute. I'm just trying to understand what is the
> > Linux-lore, here?
> 
> Using this bit could have been valid if previously the kernel rejected
> requests with this bit set, but as evident by your patch the kernel does
> not do it. It is therefore possible that there are user space programs
> out there that are working perfectly fine right now and they will break
> / misbehave after this change.
> 

Understood and I agree.

> > 
> >> BTW, you can avoid the coalescing problem by using the nexthop API (man
> >> ip-nexthop).
> > 
> > I'm not sure how that would help in this case. We need the nexthop to be
> > determined according to its REACHABILITY and other considerations described
> > in RFC4861.
> 
> Using your example:
> 
> # ip nexthop add id 1 via fe80::200:10ff:fe10:1060 dev enp0s9
> # ip -6 route add default nhid 1 expires 600 proto ra
> # ip nexthop add id 2 via fe80::200:10ff:fe10:1061 dev enp0s9
> # ip -6 route append default nhid 2 expires 600 proto ra
> # ip -6 route
> fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
> default nhid 1 via fe80::200:10ff:fe10:1060 dev enp0s9 proto ra metric 1024 expires 563sec pref medium
> default nhid 2 via fe80::200:10ff:fe10:1061 dev enp0s9 proto ra metric 1024 expires 594sec pref medium

Thanks! That looks like it should work. I'll raise this with the the developers
of systemd-networkd.

Just to confirm; are these two nhid routes equivalent to having two separate
default routes that are created when the kernel processes IPv6 RAs?

Specifically, if one of these nhid routes becomes UNREACHABLE, will that be
taken into consideration during the routing decision? (I'm guessing so?)

Thank you for your interest in this.

Regards,
Matt.
Ido Schimmel Nov. 7, 2024, 9:52 a.m. UTC | #8
On Wed, Nov 06, 2024 at 10:53:03PM -0500, Matt Muggeridge wrote:
> Hi Ido,
> 
> > >>> Is the problem that fib6_table_lookup() chooses a reachable
> > >>> nexthop and then fib6_select_path() overrides it with an unreachable
> > >>> one?
> > 
> > >> I'm afraid I don't know.
> > >>
> > > We need to understand the current behavior before adding a new interface
> > > that we will never be able to remove. It is possible we can improve /
> > > fix the current code. I won't have time to look into it myself until
> > > next week.
> 
> I am grateful that you want to look into it. Thank you! And I look forward to
> learning what you discover.
> 
> You probably already know how to reproduce it, but in case it helps, I still
> have the packet captures and can share them with you. Let me know if you'd
> like me to share them (and how to share them).

It would be best if you could provide a reproducer using iproute2:
Configure a dummy device using ip-link, install the multipath route
using ip-route, configure the neighbour table using ip-neigh and then
perform route queries using "ip route get ..." showing the problem. We
can then use it as the basis for a new test case in
tools/testing/selftests/net/fib_tests.sh 

BTW, do you have CONFIG_IPV6_ROUTER_PREF=y in your config?

> 
> > > 
> > > The objective is to allow IPv6 Netlink clients to be able to create default
> > > routes from RAs in the same way the kernel creates default routes from RAs.
> > > Essentially, I'm trying to have Netlink and Kernel behaviors match.
> > 
> > I understand, but it's essentially an extension for the legacy IPv6
> > multipath API which we are trying to move away from towards the nexthop
> > API (see more below).
> 
> Very interesting, I wasn't aware of this movement.
> 
> While this change is an extension of the legacy IPv6 multipath API, won't it
> still need to support Netlink clients that have been designed around it? I
> imagine that transitioning Netlink clients to the NH API will take many years?

FRR already supports it and I saw that there is some support for nexthop
objects in systemd:

https://github.com/systemd/systemd/pull/13735

> 
> As such, it still seems appropriate (to me) that this be implemented in the
> legacy API as well as ensuring it works with the NH API.

As I understand it you currently get different results because the
kernel installs two default routes whereas user space can only create
one default multipath route. Before adding a new uAPI I want to
understand the source of the difference and see if we can improve / fix
the current multipath code so that the two behave the same. If we can
get them to behave the same then I don't think user space will care
about two default routes versus one default multipath route.

> 
> Another consideration...
> 
> Will the kernel RA processing go through the same nh pathway? The reason I
> ask is because I faced several challenges with IPv6 Logo certification due to
> Netlink clients being unable to achieve the same as the kernel's behavior.

If you are asking if the kernel can install RA routes using nexthop
objects, then the answer is no. Only user space can create nexthop
objects and I don't think we want to allow the kernel to do that.

> 
> As long as the kernel is creating RA routes in a way that meets RFC4861, then
> I'd hope that  Netlink clients would be able to leverage that for 'free'.
> 
> > > 
> > > My analysis led me to the need for Netlink clients to set the kernel's
> > > fib6_config flags RTF_RA_ROUTER, where:
> > > 
> > >     #define RTF_RA_ROUTER		(RTF_ADDRCONF | RTF_DEFAULT)
> > > 
> > >>> +	if (rtm->rtm_flags & RTM_F_RA_ROUTER)
> > >>> +		cfg->fc_flags |= RTF_RA_ROUTER;
> > >>> +
> > >> 
> > >> It is possible there are user space programs out there that set this bit
> > >> (knowingly or not) when sending requests to the kernel and this change
> > >> will result in a behavior change for them. So, if we were to continue in
> > >> this path, this would need to be converted to a new netlink attribute to
> > >> avoid such potential problems.
> > >> 
> > > 
> > > Is this a mandated approach to implementing unspecified bits in a flag?
> > > 
> > > I'm a little surprised by this consideration. If we account for poorly
> > > written buggy user-programs, doesn't this open any API to an explosion
> > > of new attributes or other odd extensions? I'd imagine the same argument
> > > would be applicable to ioctl flags, socket flags, and so on. Why would we
> > > treat implementing unspecified Netlink bits differently to implementing
> > > unspecified ioctl bits, etc.
> > > 
> > > Naturally, if this is the mandated approach, then I'll reimplement it with
> > > a new Netlink attribute. I'm just trying to understand what is the
> > > Linux-lore, here?
> > 
> > Using this bit could have been valid if previously the kernel rejected
> > requests with this bit set, but as evident by your patch the kernel does
> > not do it. It is therefore possible that there are user space programs
> > out there that are working perfectly fine right now and they will break
> > / misbehave after this change.
> > 
> 
> Understood and I agree.
> 
> > > 
> > >> BTW, you can avoid the coalescing problem by using the nexthop API (man
> > >> ip-nexthop).
> > > 
> > > I'm not sure how that would help in this case. We need the nexthop to be
> > > determined according to its REACHABILITY and other considerations described
> > > in RFC4861.
> > 
> > Using your example:
> > 
> > # ip nexthop add id 1 via fe80::200:10ff:fe10:1060 dev enp0s9
> > # ip -6 route add default nhid 1 expires 600 proto ra
> > # ip nexthop add id 2 via fe80::200:10ff:fe10:1061 dev enp0s9
> > # ip -6 route append default nhid 2 expires 600 proto ra
> > # ip -6 route
> > fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
> > default nhid 1 via fe80::200:10ff:fe10:1060 dev enp0s9 proto ra metric 1024 expires 563sec pref medium
> > default nhid 2 via fe80::200:10ff:fe10:1061 dev enp0s9 proto ra metric 1024 expires 594sec pref medium
> 
> Thanks! That looks like it should work. I'll raise this with the the developers
> of systemd-networkd.
> 
> Just to confirm; are these two nhid routes equivalent to having two separate
> default routes that are created when the kernel processes IPv6 RAs?
> 
> Specifically, if one of these nhid routes becomes UNREACHABLE, will that be
> taken into consideration during the routing decision? (I'm guessing so?)

I didn't test it, but I don't see a reason for these two routes to
behave differently than two default routes installed with legacy
nexthops.
Matt Muggeridge Nov. 8, 2024, 2:20 a.m. UTC | #9
> > You probably already know how to reproduce it, but in case it helps, I still
> > have the packet captures and can share them with you. Let me know if you'd
> > like me to share them (and how to share them).
> 
> It would be best if you could provide a reproducer using iproute2:
> Configure a dummy device using ip-link, install the multipath route
> using ip-route, configure the neighbour table using ip-neigh and then
> perform route queries using "ip route get ..." showing the problem. We
> can then use it as the basis for a new test case in
> tools/testing/selftests/net/fib_tests.sh 

I'll try to do that next week.

> BTW, do you have CONFIG_IPV6_ROUTER_PREF=y in your config?

Yes.

$ gunzip -c /proc/config.gz | grep ROUTER_PREF
CONFIG_IPV6_ROUTER_PREF=y

> > 
> > As such, it still seems appropriate (to me) that this be implemented in the
> > legacy API as well as ensuring it works with the NH API.
> 
> As I understand it you currently get different results because the
> kernel installs two default routes whereas user space can only create
> one default multipath route.

Yes, that's the end result of an underlying problem.

Perhaps more to the point, the fact that a coalesced, INCOMPLETE, multipath
route is selected when a REACHABLE alternative exists, is what prevents us
from using coalesced multipath routes. This seems like a bug, since it violates
RFC4861 6.3.6, bullet 1.

Imagine adding a 2nd router to an IPv6 network for added resiliency, but when
one becomes unreachable, some network flows keep choosing the unreachable
router. This is what is happening with ECMP routes. It doesn't happen with
multiple default routes.

I'll just reiterate earlier comments, this doesn't happen all of the time.
It seems I have a 50/50 chance of the INCOMPLETE route being selected.

> Before adding a new uAPI I want to
> understand the source of the difference and see if we can improve / fix
> the current multipath code so that the two behave the same. If we can
> get them to behave the same then I don't think user space will care
> about two default routes versus one default multipath route.

Exactly, I totally support that approach.

Regards,
Matt.
Matt Muggeridge Nov. 14, 2024, 8:42 p.m. UTC | #10
Thank you for your review and feedback, Ido.

>> Without this flag, when there are mutliple default routers, the kernel
>> coalesces multiple default routes into an ECMP route. The ECMP route
>> ignores per-route REACHABILITY information. If one of the default
>> routers is unresponsive, with a Neighbor Cache entry of INCOMPLETE, then
>> it can still be selected as the nexthop for outgoing packets. This
>> results in an inability to communicate with remote hosts, even though
>> one of the default routers remains REACHABLE. This violates RFC4861
>> section 6.3.6, bullet 1.
>
>Do you have forwarding disabled (it causes RT6_LOOKUP_F_REACHABLE to be
>set)?

Yes, forwarding is disabled on our embedded system. Though, this needs to
work on systems regardless of the state of forwarding.

>  Is the problem that fib6_table_lookup() chooses a reachable
>nexthop and then fib6_select_path() overrides it with an unreachable
>one?

I'm afraid I don't know.

The objective is to allow IPv6 Netlink clients to be able to create default
routes from RAs in the same way the kernel creates default routes from RAs.
Essentially, I'm trying to have Netlink and Kernel behaviors match.

My analysis led me to the need for Netlink clients to set the kernel's
fib6_config flags RTF_RA_ROUTER, where:

    #define RTF_RA_ROUTER		(RTF_ADDRCONF | RTF_DEFAULT)

>> +	if (rtm->rtm_flags & RTM_F_RA_ROUTER)
>> +		cfg->fc_flags |= RTF_RA_ROUTER;
>> +
> 
> It is possible there are user space programs out there that set this bit
> (knowingly or not) when sending requests to the kernel and this change
> will result in a behavior change for them. So, if we were to continue in
> this path, this would need to be converted to a new netlink attribute to
> avoid such potential problems.
> 

Is this a mandated approach to implementing unspecified bits in a flag?

I'm a little surprised by this consideration. If we account for poorly
written buggy user-programs, doesn't this open any API to an explosion
of new attributes or other odd extensions? I'd imagine the same argument
would be applicable to ioctl flags, socket flags, and so on. Why would we
treat implementing unspecified Netlink bits differently to implementing
unspecified ioctl bits, etc.

Naturally, if this is the mandated approach, then I'll reimplement it with
a new Netlink attribute. I'm just trying to understand what is the
Linux-lore, here?

> BTW, you can avoid the coalescing problem by using the nexthop API (man
> ip-nexthop).

I'm not sure how that would help in this case. We need the nexthop to be
determined according to its REACHABILITY and other considerations described
in RFC4861.

Kind regards,
Matt.
diff mbox series

Patch

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 3b687d20c9ed..9f0259f6e4ed 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -202,7 +202,7 @@  enum {
 #define RTM_NR_FAMILIES	(RTM_NR_MSGTYPES >> 2)
 #define RTM_FAM(cmd)	(((cmd) - RTM_BASE) >> 2)
 
-/* 
+/*
    Generic structure for encapsulation of optional route information.
    It is reminiscent of sockaddr, but with sa_family replaced
    with attribute type.
@@ -242,7 +242,7 @@  struct rtmsg {
 
 	unsigned char		rtm_table;	/* Routing table id */
 	unsigned char		rtm_protocol;	/* Routing protocol; see below	*/
-	unsigned char		rtm_scope;	/* See below */	
+	unsigned char		rtm_scope;	/* See below */
 	unsigned char		rtm_type;	/* See below	*/
 
 	unsigned		rtm_flags;
@@ -336,6 +336,7 @@  enum rt_scope_t {
 #define RTM_F_FIB_MATCH	        0x2000	/* return full fib lookup match */
 #define RTM_F_OFFLOAD		0x4000	/* route is offloaded */
 #define RTM_F_TRAP		0x8000	/* route is trapping packets */
+#define RTM_F_RA_ROUTER		0x10000	/* route is a default route from RA */
 #define RTM_F_OFFLOAD_FAILED	0x20000000 /* route offload failed, this value
 					    * is chosen to avoid conflicts with
 					    * other flags defined in
@@ -568,7 +569,7 @@  struct ifinfomsg {
 };
 
 /********************************************************************
- *		prefix information 
+ *		prefix information
  ****/
 
 struct prefixmsg {
@@ -582,7 +583,7 @@  struct prefixmsg {
 	unsigned char	prefix_pad3;
 };
 
-enum 
+enum
 {
 	PREFIX_UNSPEC,
 	PREFIX_ADDRESS,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b4251915585f..5b0c16422720 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5055,6 +5055,9 @@  static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (rtm->rtm_flags & RTM_F_CLONED)
 		cfg->fc_flags |= RTF_CACHE;
 
+	if (rtm->rtm_flags & RTM_F_RA_ROUTER)
+		cfg->fc_flags |= RTF_RA_ROUTER;
+
 	cfg->fc_flags |= (rtm->rtm_flags & RTNH_F_ONLINK);
 
 	if (tb[RTA_NH_ID]) {