diff mbox series

[net] ipv4: fix source address and gateway mismatch under multiple default gateways

Message ID 20221026032017.3675060-1-william.xuanziyang@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] ipv4: fix source address and gateway mismatch under multiple default gateways | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 2 this patch: 4
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn fail Errors and warnings before: 2 this patch: 4
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 18 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Ziyang Xuan (William) Oct. 26, 2022, 3:20 a.m. UTC
We found a problem that source address doesn't match with selected gateway
under multiple default gateways. The reproducer is as following:

Setup in client as following:

$ ip link add link eth2 dev eth2.71 type vlan id 71
$ ip link add link eth2 dev eth2.72 type vlan id 72
$ ip addr add 192.168.71.41/24 dev eth2.71
$ ip addr add 192.168.72.41/24 dev eth2.72
$ ip link set eth2.71 up
$ ip link set eth2.72 up
$ route add -net default gw 192.168.71.1 dev eth2.71
$ route add -net default gw 192.168.72.1 dev eth2.72

Add a nameserver configuration in the following file:
$ cat /etc/resolv.conf
nameserver 8.8.8.8

Setup in peer server as following:

$ ip link add link eth2 dev eth2.71 type vlan id 71
$ ip link add link eth2 dev eth2.72 type vlan id 72
$ ip addr add 192.168.71.1/24 dev eth2.71
$ ip addr add 192.168.72.1/24 dev eth2.72
$ ip link set eth2.71 up
$ ip link set eth2.72 up

Use the following command trigger DNS packet in client:
$ ping www.baidu.com

Capture packets with tcpdump in client when ping:
$ tcpdump -i eth2 -vne
...
20:30:22.996044 52:54:00:20:23:a9 > 52:54:00:d2:4f:e3, ethertype 802.1Q (0x8100), length 77: vlan 71, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 25407, offset 0, flags [DF], proto UDP (17), length 59)
    192.168.72.41.42666 > 8.8.8.8.domain: 58562+ A? www.baidu.com. (31)
...

We get the problem that IPv4 saddr "192.168.72.41" do not match with
selected VLAN device "eth2.71".

In above scenario, the process does __ip_route_output_key() twice in
ip_route_connect(), the two processes have chosen different default gateway,
and the last choice is not the best.

Add flowi4->saddr and fib_nh_common->nhc_gw.ipv4 matching consideration in
fib_select_default() to fix that.

Fixes: 19baf839ff4a ("[IPV4]: Add LC-Trie FIB lookup algorithm.")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
---
 net/ipv4/fib_semantics.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Julian Anastasov Oct. 26, 2022, 10:15 a.m. UTC | #1
Hello,

On Wed, 26 Oct 2022, Ziyang Xuan wrote:

> We found a problem that source address doesn't match with selected gateway
> under multiple default gateways. The reproducer is as following:
> 
> Setup in client as following:
> 
> $ ip link add link eth2 dev eth2.71 type vlan id 71
> $ ip link add link eth2 dev eth2.72 type vlan id 72
> $ ip addr add 192.168.71.41/24 dev eth2.71
> $ ip addr add 192.168.72.41/24 dev eth2.72
> $ ip link set eth2.71 up
> $ ip link set eth2.72 up
> $ route add -net default gw 192.168.71.1 dev eth2.71
> $ route add -net default gw 192.168.72.1 dev eth2.72

	Second route goes to first position due to the
prepend operation for the route add command. That is
why 192.168.72.41 is selected.

> Add a nameserver configuration in the following file:
> $ cat /etc/resolv.conf
> nameserver 8.8.8.8
> 
> Setup in peer server as following:
> 
> $ ip link add link eth2 dev eth2.71 type vlan id 71
> $ ip link add link eth2 dev eth2.72 type vlan id 72
> $ ip addr add 192.168.71.1/24 dev eth2.71
> $ ip addr add 192.168.72.1/24 dev eth2.72
> $ ip link set eth2.71 up
> $ ip link set eth2.72 up
> 
> Use the following command trigger DNS packet in client:
> $ ping www.baidu.com
> 
> Capture packets with tcpdump in client when ping:
> $ tcpdump -i eth2 -vne
> ...
> 20:30:22.996044 52:54:00:20:23:a9 > 52:54:00:d2:4f:e3, ethertype 802.1Q (0x8100), length 77: vlan 71, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 25407, offset 0, flags [DF], proto UDP (17), length 59)
>     192.168.72.41.42666 > 8.8.8.8.domain: 58562+ A? www.baidu.com. (31)
> ...
> 
> We get the problem that IPv4 saddr "192.168.72.41" do not match with
> selected VLAN device "eth2.71".

	The problem could be that source address is selected
once and later used as source address in following routing lookups.

	And your routing rules do not express the restriction that
both addresses can not be used for specific route. If you have
such restriction which is common, you should use source-specific routes:

1. ip rule to consider table main only for link routes,
no default routes here

ip rule add prio 10 table main

2. source-specific routes:

ip rule add prio 20 from 192.168.71.0/24 table 20
ip route append default via 192.168.71.1 dev eth2.71 src 192.168.71.41 table 20
ip rule add prio 30 from 192.168.72.0/24 table 30
ip route append default via 192.168.72.1 dev eth2.72 src 192.168.72.41 table 30

3. Store default alternative routes not in table main:
ip rule add prio 200 table 200
ip route append default via 192.168.71.1 dev eth2.71 src 192.168.71.41 table 200
ip route append default via 192.168.72.1 dev eth2.72 src 192.168.72.41 table 200

	Above routes should work even without specifying prefsrc.

	As result, table 200 is used only for routing lookups
without specific source address, usually for first packet in
connection, next packets should hit tables 20/30.

	You can check https://ja.ssi.bg/dgd-usage.txt for such
examples, see under 2. Alternative routes and dead gateway detection

> In above scenario, the process does __ip_route_output_key() twice in
> ip_route_connect(), the two processes have chosen different default gateway,
> and the last choice is not the best.
> 
> Add flowi4->saddr and fib_nh_common->nhc_gw.ipv4 matching consideration in
> fib_select_default() to fix that.

	Other setups may not have such restriction, they can
prefer any gateway in reachable state no matter the saddr.

> Fixes: 19baf839ff4a ("[IPV4]: Add LC-Trie FIB lookup algorithm.")
> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
> ---
>  net/ipv4/fib_semantics.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index e9a7f70a54df..8bd94875a009 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -2046,6 +2046,7 @@ static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
>  	int order = -1, last_idx = -1;
>  	struct fib_alias *fa, *fa1 = NULL;
>  	u32 last_prio = res->fi->fib_priority;
> +	u8 prefix, max_prefix = 0;
>  	dscp_t last_dscp = 0;
>  
>  	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
> @@ -2078,6 +2079,11 @@ static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
>  		if (!nhc->nhc_gw_family || nhc->nhc_scope != RT_SCOPE_LINK)
>  			continue;
>  
> +		prefix = __ffs(flp->saddr ^ nhc->nhc_gw.ipv4);
> +		if (prefix < max_prefix)
> +			continue;
> +		max_prefix = max_t(u8, prefix, max_prefix);
> +
>  		fib_alias_accessed(fa);
>  
>  		if (!fi) {
> -- 
> 2.25.1

Regards

--
Julian Anastasov <ja@ssi.bg>
kernel test robot Oct. 26, 2022, 3:48 p.m. UTC | #2
Hi Ziyang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Ziyang-Xuan/ipv4-fix-source-address-and-gateway-mismatch-under-multiple-default-gateways/20221026-112135
patch link:    https://lore.kernel.org/r/20221026032017.3675060-1-william.xuanziyang%40huawei.com
patch subject: [PATCH net] ipv4: fix source address and gateway mismatch under multiple default gateways
config: sparc-randconfig-s033-20221026
compiler: sparc64-linux-gcc (GCC) 12.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/10ba8a9a3b91638837103326eb6a23b814b5ee2b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Ziyang-Xuan/ipv4-fix-source-address-and-gateway-mismatch-under-multiple-default-gateways/20221026-112135
        git checkout 10ba8a9a3b91638837103326eb6a23b814b5ee2b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sparc SHELL=/bin/bash net/ipv4/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

sparse warnings: (new ones prefixed by >>)
>> net/ipv4/fib_semantics.c:2082:43: sparse: sparse: incorrect type in argument 1 (different base types) @@     expected unsigned long @@     got restricted __be32 @@
   net/ipv4/fib_semantics.c:2082:43: sparse:     expected unsigned long
   net/ipv4/fib_semantics.c:2082:43: sparse:     got restricted __be32

vim +2082 net/ipv4/fib_semantics.c

  2038	
  2039	/* Must be invoked inside of an RCU protected region.  */
  2040	static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
  2041	{
  2042		struct fib_info *fi = NULL, *last_resort = NULL;
  2043		struct hlist_head *fa_head = res->fa_head;
  2044		struct fib_table *tb = res->table;
  2045		u8 slen = 32 - res->prefixlen;
  2046		int order = -1, last_idx = -1;
  2047		struct fib_alias *fa, *fa1 = NULL;
  2048		u32 last_prio = res->fi->fib_priority;
  2049		u8 prefix, max_prefix = 0;
  2050		dscp_t last_dscp = 0;
  2051	
  2052		hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
  2053			struct fib_info *next_fi = fa->fa_info;
  2054			struct fib_nh_common *nhc;
  2055	
  2056			if (fa->fa_slen != slen)
  2057				continue;
  2058			if (fa->fa_dscp &&
  2059			    fa->fa_dscp != inet_dsfield_to_dscp(flp->flowi4_tos))
  2060				continue;
  2061			if (fa->tb_id != tb->tb_id)
  2062				continue;
  2063			if (next_fi->fib_priority > last_prio &&
  2064			    fa->fa_dscp == last_dscp) {
  2065				if (last_dscp)
  2066					continue;
  2067				break;
  2068			}
  2069			if (next_fi->fib_flags & RTNH_F_DEAD)
  2070				continue;
  2071			last_dscp = fa->fa_dscp;
  2072			last_prio = next_fi->fib_priority;
  2073	
  2074			if (next_fi->fib_scope != res->scope ||
  2075			    fa->fa_type != RTN_UNICAST)
  2076				continue;
  2077	
  2078			nhc = fib_info_nhc(next_fi, 0);
  2079			if (!nhc->nhc_gw_family || nhc->nhc_scope != RT_SCOPE_LINK)
  2080				continue;
  2081	
> 2082			prefix = __ffs(flp->saddr ^ nhc->nhc_gw.ipv4);
  2083			if (prefix < max_prefix)
  2084				continue;
  2085			max_prefix = max_t(u8, prefix, max_prefix);
  2086	
  2087			fib_alias_accessed(fa);
  2088	
  2089			if (!fi) {
  2090				if (next_fi != res->fi)
  2091					break;
  2092				fa1 = fa;
  2093			} else if (!fib_detect_death(fi, order, &last_resort,
  2094						     &last_idx, fa1->fa_default)) {
  2095				fib_result_assign(res, fi);
  2096				fa1->fa_default = order;
  2097				goto out;
  2098			}
  2099			fi = next_fi;
  2100			order++;
  2101		}
  2102	
  2103		if (order <= 0 || !fi) {
  2104			if (fa1)
  2105				fa1->fa_default = -1;
  2106			goto out;
  2107		}
  2108	
  2109		if (!fib_detect_death(fi, order, &last_resort, &last_idx,
  2110				      fa1->fa_default)) {
  2111			fib_result_assign(res, fi);
  2112			fa1->fa_default = order;
  2113			goto out;
  2114		}
  2115	
  2116		if (last_idx >= 0)
  2117			fib_result_assign(res, last_resort);
  2118		fa1->fa_default = last_idx;
  2119	out:
  2120		return;
  2121	}
  2122
kernel test robot Oct. 26, 2022, 4:49 p.m. UTC | #3
Hi Ziyang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Ziyang-Xuan/ipv4-fix-source-address-and-gateway-mismatch-under-multiple-default-gateways/20221026-112135
patch link:    https://lore.kernel.org/r/20221026032017.3675060-1-william.xuanziyang%40huawei.com
patch subject: [PATCH net] ipv4: fix source address and gateway mismatch under multiple default gateways
config: loongarch-randconfig-s043-20221024
compiler: loongarch64-linux-gcc (GCC) 12.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/10ba8a9a3b91638837103326eb6a23b814b5ee2b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Ziyang-Xuan/ipv4-fix-source-address-and-gateway-mismatch-under-multiple-default-gateways/20221026-112135
        git checkout 10ba8a9a3b91638837103326eb6a23b814b5ee2b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=loongarch SHELL=/bin/bash net/ipv4/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

sparse warnings: (new ones prefixed by >>)
>> net/ipv4/fib_semantics.c:2082:43: sparse: sparse: incorrect type in argument 1 (different base types) @@     expected unsigned long word @@     got restricted __be32 @@
   net/ipv4/fib_semantics.c:2082:43: sparse:     expected unsigned long word
   net/ipv4/fib_semantics.c:2082:43: sparse:     got restricted __be32

vim +2082 net/ipv4/fib_semantics.c

  2038	
  2039	/* Must be invoked inside of an RCU protected region.  */
  2040	static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
  2041	{
  2042		struct fib_info *fi = NULL, *last_resort = NULL;
  2043		struct hlist_head *fa_head = res->fa_head;
  2044		struct fib_table *tb = res->table;
  2045		u8 slen = 32 - res->prefixlen;
  2046		int order = -1, last_idx = -1;
  2047		struct fib_alias *fa, *fa1 = NULL;
  2048		u32 last_prio = res->fi->fib_priority;
  2049		u8 prefix, max_prefix = 0;
  2050		dscp_t last_dscp = 0;
  2051	
  2052		hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
  2053			struct fib_info *next_fi = fa->fa_info;
  2054			struct fib_nh_common *nhc;
  2055	
  2056			if (fa->fa_slen != slen)
  2057				continue;
  2058			if (fa->fa_dscp &&
  2059			    fa->fa_dscp != inet_dsfield_to_dscp(flp->flowi4_tos))
  2060				continue;
  2061			if (fa->tb_id != tb->tb_id)
  2062				continue;
  2063			if (next_fi->fib_priority > last_prio &&
  2064			    fa->fa_dscp == last_dscp) {
  2065				if (last_dscp)
  2066					continue;
  2067				break;
  2068			}
  2069			if (next_fi->fib_flags & RTNH_F_DEAD)
  2070				continue;
  2071			last_dscp = fa->fa_dscp;
  2072			last_prio = next_fi->fib_priority;
  2073	
  2074			if (next_fi->fib_scope != res->scope ||
  2075			    fa->fa_type != RTN_UNICAST)
  2076				continue;
  2077	
  2078			nhc = fib_info_nhc(next_fi, 0);
  2079			if (!nhc->nhc_gw_family || nhc->nhc_scope != RT_SCOPE_LINK)
  2080				continue;
  2081	
> 2082			prefix = __ffs(flp->saddr ^ nhc->nhc_gw.ipv4);
  2083			if (prefix < max_prefix)
  2084				continue;
  2085			max_prefix = max_t(u8, prefix, max_prefix);
  2086	
  2087			fib_alias_accessed(fa);
  2088	
  2089			if (!fi) {
  2090				if (next_fi != res->fi)
  2091					break;
  2092				fa1 = fa;
  2093			} else if (!fib_detect_death(fi, order, &last_resort,
  2094						     &last_idx, fa1->fa_default)) {
  2095				fib_result_assign(res, fi);
  2096				fa1->fa_default = order;
  2097				goto out;
  2098			}
  2099			fi = next_fi;
  2100			order++;
  2101		}
  2102	
  2103		if (order <= 0 || !fi) {
  2104			if (fa1)
  2105				fa1->fa_default = -1;
  2106			goto out;
  2107		}
  2108	
  2109		if (!fib_detect_death(fi, order, &last_resort, &last_idx,
  2110				      fa1->fa_default)) {
  2111			fib_result_assign(res, fi);
  2112			fa1->fa_default = order;
  2113			goto out;
  2114		}
  2115	
  2116		if (last_idx >= 0)
  2117			fib_result_assign(res, last_resort);
  2118		fa1->fa_default = last_idx;
  2119	out:
  2120		return;
  2121	}
  2122
kernel test robot Oct. 26, 2022, 5:29 p.m. UTC | #4
Hi Ziyang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Ziyang-Xuan/ipv4-fix-source-address-and-gateway-mismatch-under-multiple-default-gateways/20221026-112135
patch link:    https://lore.kernel.org/r/20221026032017.3675060-1-william.xuanziyang%40huawei.com
patch subject: [PATCH net] ipv4: fix source address and gateway mismatch under multiple default gateways
config: x86_64-randconfig-s023
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/10ba8a9a3b91638837103326eb6a23b814b5ee2b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Ziyang-Xuan/ipv4-fix-source-address-and-gateway-mismatch-under-multiple-default-gateways/20221026-112135
        git checkout 10ba8a9a3b91638837103326eb6a23b814b5ee2b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash net/ipv4/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

sparse warnings: (new ones prefixed by >>)
   net/ipv4/fib_semantics.c:2082:26: sparse: sparse: incorrect type in argument 1 (different base types) @@     expected unsigned long word @@     got restricted __be32 @@
   net/ipv4/fib_semantics.c:2082:26: sparse:     expected unsigned long word
   net/ipv4/fib_semantics.c:2082:26: sparse:     got restricted __be32
>> net/ipv4/fib_semantics.c:2082:26: sparse: sparse: incorrect type in argument 1 (different base types) @@     expected long @@     got restricted __be32 @@
   net/ipv4/fib_semantics.c:2082:26: sparse:     expected long
   net/ipv4/fib_semantics.c:2082:26: sparse:     got restricted __be32

vim +2082 net/ipv4/fib_semantics.c

  2038	
  2039	/* Must be invoked inside of an RCU protected region.  */
  2040	static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
  2041	{
  2042		struct fib_info *fi = NULL, *last_resort = NULL;
  2043		struct hlist_head *fa_head = res->fa_head;
  2044		struct fib_table *tb = res->table;
  2045		u8 slen = 32 - res->prefixlen;
  2046		int order = -1, last_idx = -1;
  2047		struct fib_alias *fa, *fa1 = NULL;
  2048		u32 last_prio = res->fi->fib_priority;
  2049		u8 prefix, max_prefix = 0;
  2050		dscp_t last_dscp = 0;
  2051	
  2052		hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
  2053			struct fib_info *next_fi = fa->fa_info;
  2054			struct fib_nh_common *nhc;
  2055	
  2056			if (fa->fa_slen != slen)
  2057				continue;
  2058			if (fa->fa_dscp &&
  2059			    fa->fa_dscp != inet_dsfield_to_dscp(flp->flowi4_tos))
  2060				continue;
  2061			if (fa->tb_id != tb->tb_id)
  2062				continue;
  2063			if (next_fi->fib_priority > last_prio &&
  2064			    fa->fa_dscp == last_dscp) {
  2065				if (last_dscp)
  2066					continue;
  2067				break;
  2068			}
  2069			if (next_fi->fib_flags & RTNH_F_DEAD)
  2070				continue;
  2071			last_dscp = fa->fa_dscp;
  2072			last_prio = next_fi->fib_priority;
  2073	
  2074			if (next_fi->fib_scope != res->scope ||
  2075			    fa->fa_type != RTN_UNICAST)
  2076				continue;
  2077	
  2078			nhc = fib_info_nhc(next_fi, 0);
  2079			if (!nhc->nhc_gw_family || nhc->nhc_scope != RT_SCOPE_LINK)
  2080				continue;
  2081	
> 2082			prefix = __ffs(flp->saddr ^ nhc->nhc_gw.ipv4);
  2083			if (prefix < max_prefix)
  2084				continue;
  2085			max_prefix = max_t(u8, prefix, max_prefix);
  2086	
  2087			fib_alias_accessed(fa);
  2088	
  2089			if (!fi) {
  2090				if (next_fi != res->fi)
  2091					break;
  2092				fa1 = fa;
  2093			} else if (!fib_detect_death(fi, order, &last_resort,
  2094						     &last_idx, fa1->fa_default)) {
  2095				fib_result_assign(res, fi);
  2096				fa1->fa_default = order;
  2097				goto out;
  2098			}
  2099			fi = next_fi;
  2100			order++;
  2101		}
  2102	
  2103		if (order <= 0 || !fi) {
  2104			if (fa1)
  2105				fa1->fa_default = -1;
  2106			goto out;
  2107		}
  2108	
  2109		if (!fib_detect_death(fi, order, &last_resort, &last_idx,
  2110				      fa1->fa_default)) {
  2111			fib_result_assign(res, fi);
  2112			fa1->fa_default = order;
  2113			goto out;
  2114		}
  2115	
  2116		if (last_idx >= 0)
  2117			fib_result_assign(res, last_resort);
  2118		fa1->fa_default = last_idx;
  2119	out:
  2120		return;
  2121	}
  2122
kernel test robot Oct. 27, 2022, 3:54 p.m. UTC | #5
Hi Ziyang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Ziyang-Xuan/ipv4-fix-source-address-and-gateway-mismatch-under-multiple-default-gateways/20221026-112135
patch link:    https://lore.kernel.org/r/20221026032017.3675060-1-william.xuanziyang%40huawei.com
patch subject: [PATCH net] ipv4: fix source address and gateway mismatch under multiple default gateways
config: parisc-randconfig-s031-20221026
compiler: hppa64-linux-gcc (GCC) 12.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/10ba8a9a3b91638837103326eb6a23b814b5ee2b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Ziyang-Xuan/ipv4-fix-source-address-and-gateway-mismatch-under-multiple-default-gateways/20221026-112135
        git checkout 10ba8a9a3b91638837103326eb6a23b814b5ee2b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=parisc64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

sparse warnings: (new ones prefixed by >>)
>> net/ipv4/fib_semantics.c:2082:43: sparse: sparse: incorrect type in argument 1 (different base types) @@     expected unsigned long x @@     got restricted __be32 @@
   net/ipv4/fib_semantics.c:2082:43: sparse:     expected unsigned long x
   net/ipv4/fib_semantics.c:2082:43: sparse:     got restricted __be32

vim +2082 net/ipv4/fib_semantics.c

  2038	
  2039	/* Must be invoked inside of an RCU protected region.  */
  2040	static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
  2041	{
  2042		struct fib_info *fi = NULL, *last_resort = NULL;
  2043		struct hlist_head *fa_head = res->fa_head;
  2044		struct fib_table *tb = res->table;
  2045		u8 slen = 32 - res->prefixlen;
  2046		int order = -1, last_idx = -1;
  2047		struct fib_alias *fa, *fa1 = NULL;
  2048		u32 last_prio = res->fi->fib_priority;
  2049		u8 prefix, max_prefix = 0;
  2050		dscp_t last_dscp = 0;
  2051	
  2052		hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
  2053			struct fib_info *next_fi = fa->fa_info;
  2054			struct fib_nh_common *nhc;
  2055	
  2056			if (fa->fa_slen != slen)
  2057				continue;
  2058			if (fa->fa_dscp &&
  2059			    fa->fa_dscp != inet_dsfield_to_dscp(flp->flowi4_tos))
  2060				continue;
  2061			if (fa->tb_id != tb->tb_id)
  2062				continue;
  2063			if (next_fi->fib_priority > last_prio &&
  2064			    fa->fa_dscp == last_dscp) {
  2065				if (last_dscp)
  2066					continue;
  2067				break;
  2068			}
  2069			if (next_fi->fib_flags & RTNH_F_DEAD)
  2070				continue;
  2071			last_dscp = fa->fa_dscp;
  2072			last_prio = next_fi->fib_priority;
  2073	
  2074			if (next_fi->fib_scope != res->scope ||
  2075			    fa->fa_type != RTN_UNICAST)
  2076				continue;
  2077	
  2078			nhc = fib_info_nhc(next_fi, 0);
  2079			if (!nhc->nhc_gw_family || nhc->nhc_scope != RT_SCOPE_LINK)
  2080				continue;
  2081	
> 2082			prefix = __ffs(flp->saddr ^ nhc->nhc_gw.ipv4);
  2083			if (prefix < max_prefix)
  2084				continue;
  2085			max_prefix = max_t(u8, prefix, max_prefix);
  2086	
  2087			fib_alias_accessed(fa);
  2088	
  2089			if (!fi) {
  2090				if (next_fi != res->fi)
  2091					break;
  2092				fa1 = fa;
  2093			} else if (!fib_detect_death(fi, order, &last_resort,
  2094						     &last_idx, fa1->fa_default)) {
  2095				fib_result_assign(res, fi);
  2096				fa1->fa_default = order;
  2097				goto out;
  2098			}
  2099			fi = next_fi;
  2100			order++;
  2101		}
  2102	
  2103		if (order <= 0 || !fi) {
  2104			if (fa1)
  2105				fa1->fa_default = -1;
  2106			goto out;
  2107		}
  2108	
  2109		if (!fib_detect_death(fi, order, &last_resort, &last_idx,
  2110				      fa1->fa_default)) {
  2111			fib_result_assign(res, fi);
  2112			fa1->fa_default = order;
  2113			goto out;
  2114		}
  2115	
  2116		if (last_idx >= 0)
  2117			fib_result_assign(res, last_resort);
  2118		fa1->fa_default = last_idx;
  2119	out:
  2120		return;
  2121	}
  2122
Ziyang Xuan (William) Oct. 29, 2022, 2:38 a.m. UTC | #6
> 
> 	Hello,
> 
> On Wed, 26 Oct 2022, Ziyang Xuan wrote:
> 
>> We found a problem that source address doesn't match with selected gateway
>> under multiple default gateways. The reproducer is as following:
>>
>> Setup in client as following:
>>
>> $ ip link add link eth2 dev eth2.71 type vlan id 71
>> $ ip link add link eth2 dev eth2.72 type vlan id 72
>> $ ip addr add 192.168.71.41/24 dev eth2.71
>> $ ip addr add 192.168.72.41/24 dev eth2.72
>> $ ip link set eth2.71 up
>> $ ip link set eth2.72 up
>> $ route add -net default gw 192.168.71.1 dev eth2.71
>> $ route add -net default gw 192.168.72.1 dev eth2.72
> 
> 	Second route goes to first position due to the
> prepend operation for the route add command. That is
> why 192.168.72.41 is selected.
> 
>> Add a nameserver configuration in the following file:
>> $ cat /etc/resolv.conf
>> nameserver 8.8.8.8
>>
>> Setup in peer server as following:
>>
>> $ ip link add link eth2 dev eth2.71 type vlan id 71
>> $ ip link add link eth2 dev eth2.72 type vlan id 72
>> $ ip addr add 192.168.71.1/24 dev eth2.71
>> $ ip addr add 192.168.72.1/24 dev eth2.72
>> $ ip link set eth2.71 up
>> $ ip link set eth2.72 up
>>
>> Use the following command trigger DNS packet in client:
>> $ ping www.baidu.com
>>
>> Capture packets with tcpdump in client when ping:
>> $ tcpdump -i eth2 -vne
>> ...
>> 20:30:22.996044 52:54:00:20:23:a9 > 52:54:00:d2:4f:e3, ethertype 802.1Q (0x8100), length 77: vlan 71, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 25407, offset 0, flags [DF], proto UDP (17), length 59)
>>     192.168.72.41.42666 > 8.8.8.8.domain: 58562+ A? www.baidu.com. (31)
>> ...
>>
>> We get the problem that IPv4 saddr "192.168.72.41" do not match with
>> selected VLAN device "eth2.71".
> 
> 	The problem could be that source address is selected
> once and later used as source address in following routing lookups.
> 
> 	And your routing rules do not express the restriction that
> both addresses can not be used for specific route. If you have
> such restriction which is common, you should use source-specific routes:

Hi Julian,

Thank you for your quick reply.

Can we make some work to make the restriction "both addresses can not be used for specific route" in code but in consciousness?

Thanks.

> 
> 1. ip rule to consider table main only for link routes,
> no default routes here
> 
> ip rule add prio 10 table main
> 
> 2. source-specific routes:
> 
> ip rule add prio 20 from 192.168.71.0/24 table 20
> ip route append default via 192.168.71.1 dev eth2.71 src 192.168.71.41 table 20
> ip rule add prio 30 from 192.168.72.0/24 table 30
> ip route append default via 192.168.72.1 dev eth2.72 src 192.168.72.41 table 30
> 
> 3. Store default alternative routes not in table main:
> ip rule add prio 200 table 200
> ip route append default via 192.168.71.1 dev eth2.71 src 192.168.71.41 table 200
> ip route append default via 192.168.72.1 dev eth2.72 src 192.168.72.41 table 200
> 
> 	Above routes should work even without specifying prefsrc.
> 
> 	As result, table 200 is used only for routing lookups
> without specific source address, usually for first packet in
> connection, next packets should hit tables 20/30.
> 
> 	You can check https://ja.ssi.bg/dgd-usage.txt for such
> examples, see under 2. Alternative routes and dead gateway detection
> 
>> In above scenario, the process does __ip_route_output_key() twice in
>> ip_route_connect(), the two processes have chosen different default gateway,
>> and the last choice is not the best.
>>
>> Add flowi4->saddr and fib_nh_common->nhc_gw.ipv4 matching consideration in
>> fib_select_default() to fix that.
> 
> 	Other setups may not have such restriction, they can
> prefer any gateway in reachable state no matter the saddr.
> 
>> Fixes: 19baf839ff4a ("[IPV4]: Add LC-Trie FIB lookup algorithm.")
>> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
>> ---
>>  net/ipv4/fib_semantics.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index e9a7f70a54df..8bd94875a009 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -2046,6 +2046,7 @@ static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
>>  	int order = -1, last_idx = -1;
>>  	struct fib_alias *fa, *fa1 = NULL;
>>  	u32 last_prio = res->fi->fib_priority;
>> +	u8 prefix, max_prefix = 0;
>>  	dscp_t last_dscp = 0;
>>  
>>  	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
>> @@ -2078,6 +2079,11 @@ static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
>>  		if (!nhc->nhc_gw_family || nhc->nhc_scope != RT_SCOPE_LINK)
>>  			continue;
>>  
>> +		prefix = __ffs(flp->saddr ^ nhc->nhc_gw.ipv4);
>> +		if (prefix < max_prefix)
>> +			continue;
>> +		max_prefix = max_t(u8, prefix, max_prefix);
>> +
>>  		fib_alias_accessed(fa);
>>  
>>  		if (!fi) {
>> -- 
>> 2.25.1
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>
> 
> 
> .
>
Julian Anastasov Nov. 13, 2022, 8:01 p.m. UTC | #7
Hello,

On Sat, 29 Oct 2022, Ziyang Xuan (William) wrote:

> > On Wed, 26 Oct 2022, Ziyang Xuan wrote:
> > 
> >> We found a problem that source address doesn't match with selected gateway
> >> under multiple default gateways. The reproducer is as following:
> >>
> >> Setup in client as following:
> >>
> >> $ ip link add link eth2 dev eth2.71 type vlan id 71
> >> $ ip link add link eth2 dev eth2.72 type vlan id 72
> >> $ ip addr add 192.168.71.41/24 dev eth2.71
> >> $ ip addr add 192.168.72.41/24 dev eth2.72
> >> $ ip link set eth2.71 up
> >> $ ip link set eth2.72 up
> >> $ route add -net default gw 192.168.71.1 dev eth2.71
> >> $ route add -net default gw 192.168.72.1 dev eth2.72
> > 
> > 	Second route goes to first position due to the
> > prepend operation for the route add command. That is
> > why 192.168.72.41 is selected.
> > 
> >> Add a nameserver configuration in the following file:
> >> $ cat /etc/resolv.conf
> >> nameserver 8.8.8.8
> >>
> >> Setup in peer server as following:
> >>
> >> $ ip link add link eth2 dev eth2.71 type vlan id 71
> >> $ ip link add link eth2 dev eth2.72 type vlan id 72
> >> $ ip addr add 192.168.71.1/24 dev eth2.71
> >> $ ip addr add 192.168.72.1/24 dev eth2.72
> >> $ ip link set eth2.71 up
> >> $ ip link set eth2.72 up
> >>
> >> Use the following command trigger DNS packet in client:
> >> $ ping www.baidu.com
> >>
> >> Capture packets with tcpdump in client when ping:
> >> $ tcpdump -i eth2 -vne
> >> ...
> >> 20:30:22.996044 52:54:00:20:23:a9 > 52:54:00:d2:4f:e3, ethertype 802.1Q (0x8100), length 77: vlan 71, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 25407, offset 0, flags [DF], proto UDP (17), length 59)
> >>     192.168.72.41.42666 > 8.8.8.8.domain: 58562+ A? www.baidu.com. (31)
> >> ...
> >>
> >> We get the problem that IPv4 saddr "192.168.72.41" do not match with
> >> selected VLAN device "eth2.71".
> > 
> > 	The problem could be that source address is selected
> > once and later used as source address in following routing lookups.
> > 
> > 	And your routing rules do not express the restriction that
> > both addresses can not be used for specific route. If you have
> > such restriction which is common, you should use source-specific routes:
> 
> Hi Julian,
> 
> Thank you for your quick reply.
> 
> Can we make some work to make the restriction "both addresses can not be used for specific route" in code but in consciousness?

	No good idea about this. Not sure if the new 'nexthop'
object is suitable for such policy.

Regards

--
Julian Anastasov <ja@ssi.bg>
diff mbox series

Patch

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index e9a7f70a54df..8bd94875a009 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -2046,6 +2046,7 @@  static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
 	int order = -1, last_idx = -1;
 	struct fib_alias *fa, *fa1 = NULL;
 	u32 last_prio = res->fi->fib_priority;
+	u8 prefix, max_prefix = 0;
 	dscp_t last_dscp = 0;
 
 	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
@@ -2078,6 +2079,11 @@  static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
 		if (!nhc->nhc_gw_family || nhc->nhc_scope != RT_SCOPE_LINK)
 			continue;
 
+		prefix = __ffs(flp->saddr ^ nhc->nhc_gw.ipv4);
+		if (prefix < max_prefix)
+			continue;
+		max_prefix = max_t(u8, prefix, max_prefix);
+
 		fib_alias_accessed(fa);
 
 		if (!fi) {