mbox series

[net-next,v2,0/6] net: nexthop: Increase weight to u16

Message ID cover.1723036486.git.petrm@nvidia.com (mailing list archive)
Headers show
Series net: nexthop: Increase weight to u16 | expand

Message

Petr Machata Aug. 7, 2024, 2:13 p.m. UTC
In CLOS networks, as link failures occur at various points in the network,
ECMP weights of the involved nodes are adjusted to compensate. With high
fan-out of the involved nodes, and overall high number of nodes,
a (non-)ECMP weight ratio that we would like to configure does not fit into
8 bits. Instead of, say, 255:254, we might like to configure something like
1000:999. For these deployments, the 8-bit weight may not be enough.

To that end, in this patchset increase the next hop weight from u8 to u16.

Patch #1 adds a flag that indicates whether the reserved fields are zeroed.
This is a follow-up to a new fix merged in commit 6d745cd0e972 ("net:
nexthop: Initialize all fields in dumped nexthops"). The theory behind this
patch is that there is a strict ordering between the fields actually being
zeroed, the kernel declaring that they are, and the kernel repurposing the
fields. Thus clients can use the flag to tell if it is safe to interpret
the reserved fields in any way.

Patch #2 contains the substantial code and the commit message covers the
details of the changes.

Patches #3 to #6 add selftests.

v2:
- Patch #1:
    - Move the new OP_FLAG to bit 31 to make in/out confusion less likely
    - Add a comment to the flag
- Patch #2:
    - s/endianes/endianness/

Petr Machata (6):
  net: nexthop: Add flag to assert that NHGRP reserved fields are zero
  net: nexthop: Increase weight to u16
  selftests: router_mpath: Sleep after MZ
  selftests: router_mpath_nh: Test 16-bit next hop weights
  selftests: router_mpath_nh_res: Test 16-bit next hop weights
  selftests: fib_nexthops: Test 16-bit next hop weights

 include/net/nexthop.h                         |  4 +-
 include/uapi/linux/nexthop.h                  | 10 +++-
 net/ipv4/nexthop.c                            | 49 ++++++++++------
 tools/testing/selftests/net/fib_nexthops.sh   | 55 +++++++++++++++++-
 tools/testing/selftests/net/forwarding/lib.sh |  7 +++
 .../net/forwarding/router_mpath_nh.sh         | 40 ++++++++++---
 .../net/forwarding/router_mpath_nh_lib.sh     | 13 +++++
 .../net/forwarding/router_mpath_nh_res.sh     | 58 ++++++++++++++++---
 .../net/forwarding/router_multipath.sh        |  2 +
 9 files changed, 201 insertions(+), 37 deletions(-)

Comments

Jakub Kicinski Aug. 8, 2024, 1:28 p.m. UTC | #1
On Wed, 7 Aug 2024 16:13:45 +0200 Petr Machata wrote:
> In CLOS networks, as link failures occur at various points in the network,
> ECMP weights of the involved nodes are adjusted to compensate. With high
> fan-out of the involved nodes, and overall high number of nodes,
> a (non-)ECMP weight ratio that we would like to configure does not fit into
> 8 bits. Instead of, say, 255:254, we might like to configure something like
> 1000:999. For these deployments, the 8-bit weight may not be enough.
> 
> To that end, in this patchset increase the next hop weight from u8 to u16.
> 
> Patch #1 adds a flag that indicates whether the reserved fields are zeroed.
> This is a follow-up to a new fix merged in commit 6d745cd0e972 ("net:
> nexthop: Initialize all fields in dumped nexthops"). The theory behind this
> patch is that there is a strict ordering between the fields actually being
> zeroed, the kernel declaring that they are, and the kernel repurposing the
> fields. Thus clients can use the flag to tell if it is safe to interpret
> the reserved fields in any way.
> 
> Patch #2 contains the substantial code and the commit message covers the
> details of the changes.
> 
> Patches #3 to #6 add selftests.

I did update iproute2 to the branch you sent me last time, but tests
are not happy:

# IPv6 groups functional
# ----------------------
# TEST: Create nexthop group with single nexthop                      [ OK ]
# TEST: Get nexthop group by id                                       [ OK ]
# TEST: Delete nexthop group by id                                    [ OK ]
# TEST: Nexthop group with multiple nexthops                          [ OK ]
# TEST: Nexthop group updated when entry is deleted                   [ OK ]
# TEST: Nexthop group with weighted nexthops                          [ OK ]
# TEST: Weighted nexthop group updated when entry is deleted          [ OK ]
# TEST: Nexthops in groups removed on admin down                      [ OK ]
# TEST: Multiple groups with same nexthop                             [ OK ]
# TEST: Nexthops in group removed on admin down - mixed group         [ OK ]
# TEST: Nexthop group can not have a group as an entry                [ OK ]
# TEST: Nexthop group with a blackhole entry                          [ OK ]
# TEST: Nexthop group can not have a blackhole and another nexthop    [ OK ]
# TEST: Nexthop group replace refcounts                               [ OK ]
#       WARNING: Unexpected route entry
# TEST: 16-bit weights                                                [FAIL]
# 
# IPv6 resilient groups functional
# --------------------------------
# TEST: Nexthop group updated when entry is deleted                   [ OK ]
# TEST: Nexthop buckets updated when entry is deleted                 [ OK ]
# TEST: Nexthop group updated after replace                           [ OK ]
# TEST: Nexthop buckets updated after replace                         [ OK ]
# TEST: Nexthop group updated when entry is deleted - nECMP           [ OK ]
# TEST: Nexthop buckets updated when entry is deleted - nECMP         [ OK ]
# TEST: Nexthop group updated after replace - nECMP                   [ OK ]
# TEST: Nexthop buckets updated after replace - nECMP                 [ OK ]
#       WARNING: Unexpected route entry
# TEST: 16-bit weights                                                [FAIL]

https://netdev-3.bots.linux.dev/vmksft-net/results/718641/2-fib-nexthops-sh/stdout
Petr Machata Aug. 9, 2024, 9:48 a.m. UTC | #2
Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 7 Aug 2024 16:13:45 +0200 Petr Machata wrote:
>> In CLOS networks, as link failures occur at various points in the network,
>> ECMP weights of the involved nodes are adjusted to compensate. With high
>> fan-out of the involved nodes, and overall high number of nodes,
>> a (non-)ECMP weight ratio that we would like to configure does not fit into
>> 8 bits. Instead of, say, 255:254, we might like to configure something like
>> 1000:999. For these deployments, the 8-bit weight may not be enough.
>> 
>> To that end, in this patchset increase the next hop weight from u8 to u16.
>> 
>> Patch #1 adds a flag that indicates whether the reserved fields are zeroed.
>> This is a follow-up to a new fix merged in commit 6d745cd0e972 ("net:
>> nexthop: Initialize all fields in dumped nexthops"). The theory behind this
>> patch is that there is a strict ordering between the fields actually being
>> zeroed, the kernel declaring that they are, and the kernel repurposing the
>> fields. Thus clients can use the flag to tell if it is safe to interpret
>> the reserved fields in any way.
>> 
>> Patch #2 contains the substantial code and the commit message covers the
>> details of the changes.
>> 
>> Patches #3 to #6 add selftests.
>
> I did update iproute2 to the branch you sent me last time, but tests
> are not happy:
>
> # IPv6 groups functional
> # ----------------------
> # TEST: Create nexthop group with single nexthop                      [ OK ]
> # TEST: Get nexthop group by id                                       [ OK ]
> # TEST: Delete nexthop group by id                                    [ OK ]
> # TEST: Nexthop group with multiple nexthops                          [ OK ]
> # TEST: Nexthop group updated when entry is deleted                   [ OK ]
> # TEST: Nexthop group with weighted nexthops                          [ OK ]
> # TEST: Weighted nexthop group updated when entry is deleted          [ OK ]
> # TEST: Nexthops in groups removed on admin down                      [ OK ]
> # TEST: Multiple groups with same nexthop                             [ OK ]
> # TEST: Nexthops in group removed on admin down - mixed group         [ OK ]
> # TEST: Nexthop group can not have a group as an entry                [ OK ]
> # TEST: Nexthop group with a blackhole entry                          [ OK ]
> # TEST: Nexthop group can not have a blackhole and another nexthop    [ OK ]
> # TEST: Nexthop group replace refcounts                               [ OK ]
> #       WARNING: Unexpected route entry
> # TEST: 16-bit weights                                                [FAIL]
> # 
> # IPv6 resilient groups functional
> # --------------------------------
> # TEST: Nexthop group updated when entry is deleted                   [ OK ]
> # TEST: Nexthop buckets updated when entry is deleted                 [ OK ]
> # TEST: Nexthop group updated after replace                           [ OK ]
> # TEST: Nexthop buckets updated after replace                         [ OK ]
> # TEST: Nexthop group updated when entry is deleted - nECMP           [ OK ]
> # TEST: Nexthop buckets updated when entry is deleted - nECMP         [ OK ]
> # TEST: Nexthop group updated after replace - nECMP                   [ OK ]
> # TEST: Nexthop buckets updated after replace - nECMP                 [ OK ]
> #       WARNING: Unexpected route entry
> # TEST: 16-bit weights                                                [FAIL]
>
> https://netdev-3.bots.linux.dev/vmksft-net/results/718641/2-fib-nexthops-sh/stdout

This failure mode is consistent with non-updated iproute2. I only pushed
to the iproute2 repository after having sent the kernel patches, so I
think you or your automation have picked up the old version. Can you try
again, please? I retested on my end and it still works.
Jakub Kicinski Aug. 10, 2024, 4:09 a.m. UTC | #3
On Fri, 9 Aug 2024 11:48:09 +0200 Petr Machata wrote:
> This failure mode is consistent with non-updated iproute2. I only pushed
> to the iproute2 repository after having sent the kernel patches, so I
> think you or your automation have picked up the old version. Can you try
> again, please? I retested on my end and it still works.

Updated now. iproute2 is at:

commit cefc088ed1b96a2c245b60043b52f31cd89c2946 (HEAD -> main)
Merge: 354d8a36 54da1517
Author: Your Name
Date:   Fri Aug 9 21:06:12 2024 -0700

    Merge branch 'nhgw16' of https://github.com/pmachata/iproute2


Let's see what happens, I haven't rebuilt the image in a while
maybe I'm doing it wrong.
Petr Machata Aug. 12, 2024, 10:09 a.m. UTC | #4
Jakub Kicinski <kuba@kernel.org> writes:

> On Fri, 9 Aug 2024 11:48:09 +0200 Petr Machata wrote:
>> This failure mode is consistent with non-updated iproute2. I only pushed
>> to the iproute2 repository after having sent the kernel patches, so I
>> think you or your automation have picked up the old version. Can you try
>> again, please? I retested on my end and it still works.
>
> Updated now. iproute2 is at:
>
> commit cefc088ed1b96a2c245b60043b52f31cd89c2946 (HEAD -> main)
> Merge: 354d8a36 54da1517
> Author: Your Name
> Date:   Fri Aug 9 21:06:12 2024 -0700
>
>     Merge branch 'nhgw16' of https://github.com/pmachata/iproute2
>
>
> Let's see what happens, I haven't rebuilt the image in a while
> maybe I'm doing it wrong.

Looking good:

https://netdev-3.bots.linux.dev/vmksft-net/results/724221/2-fib-nexthops-sh/stdout
patchwork-bot+netdevbpf@kernel.org Aug. 13, 2024, 1 a.m. UTC | #5
Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 7 Aug 2024 16:13:45 +0200 you wrote:
> In CLOS networks, as link failures occur at various points in the network,
> ECMP weights of the involved nodes are adjusted to compensate. With high
> fan-out of the involved nodes, and overall high number of nodes,
> a (non-)ECMP weight ratio that we would like to configure does not fit into
> 8 bits. Instead of, say, 255:254, we might like to configure something like
> 1000:999. For these deployments, the 8-bit weight may not be enough.
> 
> [...]

Here is the summary with links:
  - [net-next,v2,1/6] net: nexthop: Add flag to assert that NHGRP reserved fields are zero
    https://git.kernel.org/netdev/net-next/c/75bab45e6b2d
  - [net-next,v2,2/6] net: nexthop: Increase weight to u16
    https://git.kernel.org/netdev/net-next/c/b72a6a7ab957
  - [net-next,v2,3/6] selftests: router_mpath: Sleep after MZ
    https://git.kernel.org/netdev/net-next/c/110d3ffe9d2b
  - [net-next,v2,4/6] selftests: router_mpath_nh: Test 16-bit next hop weights
    https://git.kernel.org/netdev/net-next/c/bb89fdacf99c
  - [net-next,v2,5/6] selftests: router_mpath_nh_res: Test 16-bit next hop weights
    https://git.kernel.org/netdev/net-next/c/dce0765c1d5b
  - [net-next,v2,6/6] selftests: fib_nexthops: Test 16-bit next hop weights
    https://git.kernel.org/netdev/net-next/c/4b808f447332

You are awesome, thank you!