diff mbox series

[net-next,2/2] datagram, udp: Set local address and rehash socket atomically against lookup

Message ID 20241204221254.3537932-3-sbrivio@redhat.com (mailing list archive)
State Rejected
Delegated to: Netdev Maintainers
Headers show
Series Fix race between datagram socket address change and rehash | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 17 this patch: 17
netdev/build_tools success Errors and warnings before: 0 (+0) this patch: 0 (+0)
netdev/cc_maintainers warning 4 maintainers not CCed: kuba@kernel.org pabeni@redhat.com horms@kernel.org dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 29 this patch: 29
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn fail Errors and warnings before: 2898 this patch: 2899
netdev/checkpatch warning CHECK: Unbalanced braces around else statement WARNING: Non-standard signature: Analysed-by:
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 17 this patch: 17
netdev/source_inline success Was 0 now: 0

Commit Message

Stefano Brivio Dec. 4, 2024, 10:12 p.m. UTC
If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.

Secondary hash chains were introduced by commit 30fff9231fad ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().

This operation was introduced by commit 719f835853a9 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.

This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.

Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:

  LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
  dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in

  while :; do
  	taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.1 || sleep 1
  	taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
  	wait
  done

where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):

  2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused

This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:

  LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"

  while :; do
  	./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.2 || sleep 1
  	socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
  	wait
  	cmp tmp.in tmp.out
  done

Once this fails:

  tmp.in tmp.out differ: char 8193, line 29

we can finally have a look at what's going on:

  $ tshark -r pasta.pcap
      1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
      2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
      6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
      7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
      9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
     12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0

On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.

In another variant of this reproducer, starting the client with:

  strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &

and connecting to the socat server using a loopback address:

  socat OPEN:tmp.in UDP4:localhost:1337,shut-null

we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:

  [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
  [...]
  [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
  [...]
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)

and, somewhat confusingly, after a connect() on the same socket
succeeded.

To fix this, replace the rehash operation by a set_rcv_saddr()
callback holding the spinlock on the primary hash chain, just like
the rehash operation used to do, but also setting the address (via
inet_update_saddr(), moved to headers) while holding the spinlock.

To make this atomic against the lookup operation, also acquire the
spinlock on the primary chain there.

This results in some awkwardness at a caller site, specifically
sock_bindtoindex_locked(), where we really just need to rehash the
socket without changing its address. With the new operation, we now
need to forcibly set the current address again.

On the other hand, this appears more elegant than alternatives such
as fetching the spinlock reference in ip4_datagram_connect() and
ip6_datagram_conect(), and keeping the rehash operation around for
a single user also seems a tad overkill.

v1:
  - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
    usage (Kuniyuki Iwashima)
  - directly use sk_rcv_saddr for IPv4 receive addresses instead of
    fetching inet_rcv_saddr (Kuniyuki Iwashima)
  - move inet_update_saddr() to inet_hashtables.h and use that
    to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
  - rebase onto net-next, update commit message accordingly

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/24147
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 30fff9231fad ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/net/inet_hashtables.h | 13 ++++++
 include/net/sock.h            |  2 +-
 include/net/udp.h             |  3 +-
 net/core/sock.c               | 12 ++++-
 net/ipv4/datagram.c           |  7 +--
 net/ipv4/inet_hashtables.c    | 13 ------
 net/ipv4/udp.c                | 84 +++++++++++++++++++++++------------
 net/ipv4/udp_impl.h           |  2 +-
 net/ipv4/udplite.c            |  2 +-
 net/ipv6/datagram.c           | 30 +++++++++----
 net/ipv6/udp.c                | 31 +++++++------
 net/ipv6/udp_impl.h           |  2 +-
 net/ipv6/udplite.c            |  2 +-
 13 files changed, 130 insertions(+), 73 deletions(-)

Comments

Paolo Abeni Dec. 5, 2024, 9:30 a.m. UTC | #1
Hi,

On 12/4/24 23:12, Stefano Brivio wrote:
> If a UDP socket changes its local address while it's receiving
> datagrams, as a result of connect(), there is a period during which
> a lookup operation might fail to find it, after the address is changed
> but before the secondary hash (port and address) and the four-tuple
> hash (local and remote ports and addresses) are updated.
> 
> Secondary hash chains were introduced by commit 30fff9231fad ("udp:
> bind() optimisation") and, as a result, a rehash operation became
> needed to make a bound socket reachable again after a connect().
> 
> This operation was introduced by commit 719f835853a9 ("udp: add
> rehash on connect()") which isn't however a complete fix: the
> socket will be found once the rehashing completes, but not while
> it's pending.
> 
> This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
> client sending datagrams to it. After the server receives the first
> datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
> the address of the sender, in order to set up a directed flow.
> 
> Now, if the client, running on a different CPU thread, happens to
> send a (subsequent) datagram while the server's socket changes its
> address, but is not rehashed yet, this will result in a failed
> lookup and a port unreachable error delivered to the client, as
> apparent from the following reproducer:
> 
>   LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
>   dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in
> 
>   while :; do
>   	taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
>   	sleep 0.1 || sleep 1
>   	taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
>   	wait
>   done
> 
> where the client will eventually get ECONNREFUSED on a write()
> (typically the second or third one of a given iteration):
> 
>   2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused
> 
> This issue was first observed as a seldom failure in Podman's tests
> checking UDP functionality while using pasta(1) to connect the
> container's network namespace, which leads us to a reproducer with
> the lookup error resulting in an ICMP packet on a tap device:
> 
>   LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"
> 
>   while :; do
>   	./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
>   	sleep 0.2 || sleep 1
>   	socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
>   	wait
>   	cmp tmp.in tmp.out
>   done
> 
> Once this fails:
> 
>   tmp.in tmp.out differ: char 8193, line 29
> 
> we can finally have a look at what's going on:
> 
>   $ tshark -r pasta.pcap
>       1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
>       2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>       3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>       4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>       5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
>       6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
>       7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>       8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
>       9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>      10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>      11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
>      12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0
> 
> On the third datagram received, the network namespace of the container
> initiates an ARP lookup to deliver the ICMP message.
> 
> In another variant of this reproducer, starting the client with:
> 
>   strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &
> 
> and connecting to the socat server using a loopback address:
> 
>   socat OPEN:tmp.in UDP4:localhost:1337,shut-null
> 
> we can more clearly observe a sendmmsg() call failing after the
> first datagram is delivered:
> 
>   [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
>   [...]
>   [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
>   [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
>   [...]
>   [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)
> 
> and, somewhat confusingly, after a connect() on the same socket
> succeeded.
> 
> To fix this, replace the rehash operation by a set_rcv_saddr()
> callback holding the spinlock on the primary hash chain, just like
> the rehash operation used to do, but also setting the address (via
> inet_update_saddr(), moved to headers) while holding the spinlock.
> 
> To make this atomic against the lookup operation, also acquire the
> spinlock on the primary chain there.

I'm sorry for the late feedback.

I'm concerned by the unconditional spinlock in __udp4_lib_lookup(). I
fear it could cause performance regressions in different workloads:
heavy UDP unicast flow, or even TCP over UDP tunnel when the NIC
supports RX offload for the relevant UDP tunnel protocol.

In the first case there will be an additional atomic operation per packet.

In the latter the spin_lock will be contended with multiple concurrent
TCP over UDP tunnel flows: the NIC with UDP tunnel offload can use the
inner header to compute the RX hash, and use different rx queues for
such flows.

The GRO stage will perform UDP tunnel socket lookup and will contend the
bucket lock.

> This results in some awkwardness at a caller site, specifically
> sock_bindtoindex_locked(), where we really just need to rehash the
> socket without changing its address. With the new operation, we now
> need to forcibly set the current address again.
> 
> On the other hand, this appears more elegant than alternatives such
> as fetching the spinlock reference in ip4_datagram_connect() and
> ip6_datagram_conect(), and keeping the rehash operation around for
> a single user also seems a tad overkill.

Would such option require the same additional lock at lookup time?

Thanks,

Paolo
Stefano Brivio Dec. 5, 2024, 3:58 p.m. UTC | #2
On Thu, 5 Dec 2024 10:30:14 +0100
Paolo Abeni <pabeni@redhat.com> wrote:

> On 12/4/24 23:12, Stefano Brivio wrote:
>
> > [...]
> > 
> > To fix this, replace the rehash operation by a set_rcv_saddr()
> > callback holding the spinlock on the primary hash chain, just like
> > the rehash operation used to do, but also setting the address (via
> > inet_update_saddr(), moved to headers) while holding the spinlock.
> > 
> > To make this atomic against the lookup operation, also acquire the
> > spinlock on the primary chain there.  
> 
> I'm sorry for the late feedback.
> 
> I'm concerned by the unconditional spinlock in __udp4_lib_lookup(). I
> fear it could cause performance regressions in different workloads:
> heavy UDP unicast flow, or even TCP over UDP tunnel when the NIC
> supports RX offload for the relevant UDP tunnel protocol.
> 
> In the first case there will be an additional atomic operation per packet.

So, I've been looking into this a bit, and request-response rates with
neper's udp_rr (https://github.com/google/neper/blob/master/udp_rr.c)
for a client/server pair via loopback interface are the same before and
after this patch.

The reason is, I suppose, that the only contention on that spinlock is
the "intended" one, that is, between connect() and lookup.

Then I moved on to bulk flows, with socat or iperf3. But there (and
that's the whole point of this fix) we have connected sockets, and once
they are connected, we switch to early demux, which is not affected by
this patch.

In the end, I don't think this will affect "regular", bulk unicast
flows, because applications using them will typically connect sockets,
and we'll switch to early demux right away.

This lookup is not exactly "slow path", but it's not fast path either.

> In the latter the spin_lock will be contended with multiple concurrent
> TCP over UDP tunnel flows: the NIC with UDP tunnel offload can use the
> inner header to compute the RX hash, and use different rx queues for
> such flows.
> 
> The GRO stage will perform UDP tunnel socket lookup and will contend the
> bucket lock.

In this case (I couldn't find out yet), aren't sockets connected? I
would expect that we switch to the early demux path relatively soon for
anything that needs to have somehow high throughput.

And if we don't, probably the more reasonable alternative would be to
"fix" that, rather than keeping this relatively common case broken.

Do you have a benchmark or something I can run?

> > This results in some awkwardness at a caller site, specifically
> > sock_bindtoindex_locked(), where we really just need to rehash the
> > socket without changing its address. With the new operation, we now
> > need to forcibly set the current address again.
> > 
> > On the other hand, this appears more elegant than alternatives such
> > as fetching the spinlock reference in ip4_datagram_connect() and
> > ip6_datagram_conect(), and keeping the rehash operation around for
> > a single user also seems a tad overkill.  
> 
> Would such option require the same additional lock at lookup time?

Yes, it's conceptually the same, we would pretty much just move code
around.

I've been thinking about possible alternatives but they all involve a
much bigger rework. One idea could be that we RCU-connect() sockets,
instead of just having the hash table insertion under RCU. That is, as
long as we're in the grace period, the lookup would still see the old
receive address.

But, especially now that we have *three* hash tables, this is extremely
involved, and perhaps would warrant a rewrite of the whole thing. Given
that we're currently breaking users, I'd rather fix this first.

Sure, things have been broken for 19 years so I guess it's okay to defer
this fix to net-next (see discussion around the RFC), but I'd still
suggest that we fix this as a first step, because the breakage is
embarrassingly obvious (see reproducers).
Eric Dumazet Dec. 5, 2024, 4:35 p.m. UTC | #3
On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> If a UDP socket changes its local address while it's receiving
> datagrams, as a result of connect(), there is a period during which
> a lookup operation might fail to find it, after the address is changed
> but before the secondary hash (port and address) and the four-tuple
> hash (local and remote ports and addresses) are updated.
>
> Secondary hash chains were introduced by commit 30fff9231fad ("udp:
> bind() optimisation") and, as a result, a rehash operation became
> needed to make a bound socket reachable again after a connect().
>
> This operation was introduced by commit 719f835853a9 ("udp: add
> rehash on connect()") which isn't however a complete fix: the
> socket will be found once the rehashing completes, but not while
> it's pending.
>
> This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
> client sending datagrams to it. After the server receives the first
> datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
> the address of the sender, in order to set up a directed flow.
>
> Now, if the client, running on a different CPU thread, happens to
> send a (subsequent) datagram while the server's socket changes its
> address, but is not rehashed yet, this will result in a failed
> lookup and a port unreachable error delivered to the client, as
> apparent from the following reproducer:
>
>   LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
>   dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in
>
>   while :; do
>         taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
>         sleep 0.1 || sleep 1
>         taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
>         wait
>   done
>
> where the client will eventually get ECONNREFUSED on a write()
> (typically the second or third one of a given iteration):
>
>   2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused
>
> This issue was first observed as a seldom failure in Podman's tests
> checking UDP functionality while using pasta(1) to connect the
> container's network namespace, which leads us to a reproducer with
> the lookup error resulting in an ICMP packet on a tap device:
>
>   LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"
>
>   while :; do
>         ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
>         sleep 0.2 || sleep 1
>         socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
>         wait
>         cmp tmp.in tmp.out
>   done
>
> Once this fails:
>
>   tmp.in tmp.out differ: char 8193, line 29
>
> we can finally have a look at what's going on:
>
>   $ tshark -r pasta.pcap
>       1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
>       2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>       3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>       4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>       5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
>       6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
>       7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>       8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
>       9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>      10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
>      11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
>      12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0
>
> On the third datagram received, the network namespace of the container
> initiates an ARP lookup to deliver the ICMP message.
>
> In another variant of this reproducer, starting the client with:
>
>   strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &
>
> and connecting to the socat server using a loopback address:
>
>   socat OPEN:tmp.in UDP4:localhost:1337,shut-null
>
> we can more clearly observe a sendmmsg() call failing after the
> first datagram is delivered:
>
>   [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
>   [...]
>   [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
>   [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
>   [...]
>   [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)
>
> and, somewhat confusingly, after a connect() on the same socket
> succeeded.
>
> To fix this, replace the rehash operation by a set_rcv_saddr()
> callback holding the spinlock on the primary hash chain, just like
> the rehash operation used to do, but also setting the address (via
> inet_update_saddr(), moved to headers) while holding the spinlock.
>
> To make this atomic against the lookup operation, also acquire the
> spinlock on the primary chain there.
>
> This results in some awkwardness at a caller site, specifically
> sock_bindtoindex_locked(), where we really just need to rehash the
> socket without changing its address. With the new operation, we now
> need to forcibly set the current address again.
>
> On the other hand, this appears more elegant than alternatives such
> as fetching the spinlock reference in ip4_datagram_connect() and
> ip6_datagram_conect(), and keeping the rehash operation around for
> a single user also seems a tad overkill.
>
> v1:
>   - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
>     usage (Kuniyuki Iwashima)
>   - directly use sk_rcv_saddr for IPv4 receive addresses instead of
>     fetching inet_rcv_saddr (Kuniyuki Iwashima)
>   - move inet_update_saddr() to inet_hashtables.h and use that
>     to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
>   - rebase onto net-next, update commit message accordingly
>
> Reported-by: Ed Santiago <santiago@redhat.com>
> Link: https://github.com/containers/podman/issues/24147
> Analysed-by: David Gibson <david@gibson.dropbear.id.au>
> Fixes: 30fff9231fad ("udp: bind() optimisation")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  include/net/inet_hashtables.h | 13 ++++++
>  include/net/sock.h            |  2 +-
>  include/net/udp.h             |  3 +-
>  net/core/sock.c               | 12 ++++-
>  net/ipv4/datagram.c           |  7 +--
>  net/ipv4/inet_hashtables.c    | 13 ------
>  net/ipv4/udp.c                | 84 +++++++++++++++++++++++------------
>  net/ipv4/udp_impl.h           |  2 +-
>  net/ipv4/udplite.c            |  2 +-
>  net/ipv6/datagram.c           | 30 +++++++++----
>  net/ipv6/udp.c                | 31 +++++++------
>  net/ipv6/udp_impl.h           |  2 +-
>  net/ipv6/udplite.c            |  2 +-
>  13 files changed, 130 insertions(+), 73 deletions(-)
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 5eea47f135a4..7f05e7ebc2e4 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -525,6 +525,19 @@ static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
>  #endif
>  }
>
> +static inline void inet_update_saddr(struct sock *sk, void *saddr, int family)
> +{
> +       if (family == AF_INET) {
> +               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> +               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> +       }
> +#if IS_ENABLED(CONFIG_IPV6)
> +       else {
> +               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> +       }
> +#endif
> +}
> +
>  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>                         struct sock *sk, u64 port_offset,
>                         int (*check_established)(struct inet_timewait_death_row *,
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7464e9f9f47c..1410036e4f5a 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1228,6 +1228,7 @@ struct proto {
>         int                     (*connect)(struct sock *sk,
>                                         struct sockaddr *uaddr,
>                                         int addr_len);
> +       void                    (*set_rcv_saddr)(struct sock *sk, void *addr);
>         int                     (*disconnect)(struct sock *sk, int flags);
>
>         struct sock *           (*accept)(struct sock *sk,
> @@ -1269,7 +1270,6 @@ struct proto {
>         /* Keeping track of sk's, looking them up, and port selection methods. */
>         int                     (*hash)(struct sock *sk);
>         void                    (*unhash)(struct sock *sk);
> -       void                    (*rehash)(struct sock *sk);
>         int                     (*get_port)(struct sock *sk, unsigned short snum);
>         void                    (*put_port)(struct sock *sk);
>  #ifdef CONFIG_BPF_SYSCALL
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 6e89520e100d..8283ea5768de 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -302,7 +302,6 @@ static inline int udp_lib_hash(struct sock *sk)
>  }
>
>  void udp_lib_unhash(struct sock *sk);
> -void udp_lib_rehash(struct sock *sk, u16 new_hash, u16 new_hash4);
>  u32 udp_ehashfn(const struct net *net, const __be32 laddr, const __u16 lport,
>                 const __be32 faddr, const __be16 fport);
>
> @@ -411,6 +410,8 @@ int udp_rcv(struct sk_buff *skb);
>  int udp_ioctl(struct sock *sk, int cmd, int *karg);
>  int udp_init_sock(struct sock *sk);
>  int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
> +void udp_lib_set_rcv_saddr(struct sock *sk, void *addr, u16 hash, u16 hash4);
> +int udp_set_rcv_saddr(struct sock *sk, void *addr);
>  int __udp_disconnect(struct sock *sk, int flags);
>  int udp_disconnect(struct sock *sk, int flags);
>  __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 74729d20cd00..221c904d870d 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -641,8 +641,16 @@ static int sock_bindtoindex_locked(struct sock *sk, int ifindex)
>         /* Paired with all READ_ONCE() done locklessly. */
>         WRITE_ONCE(sk->sk_bound_dev_if, ifindex);
>
> -       if (sk->sk_prot->rehash)
> -               sk->sk_prot->rehash(sk);
> +       /* Force rehash if protocol needs it */
> +       if (sk->sk_prot->set_rcv_saddr) {
> +               if (sk->sk_family == AF_INET)
> +                       sk->sk_prot->set_rcv_saddr(sk, &sk->sk_rcv_saddr);
> +#if IS_ENABLED(CONFIG_IPV6)
> +               else if (sk->sk_family == AF_INET6)
> +                       sk->sk_prot->set_rcv_saddr(sk, &sk->sk_v6_rcv_saddr);
> +#endif
> +       }
> +
>         sk_dst_reset(sk);
>
>         ret = 0;
> diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
> index d52333e921f3..3ea3fa94c127 100644
> --- a/net/ipv4/datagram.c
> +++ b/net/ipv4/datagram.c
> @@ -64,9 +64,10 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
>         if (!inet->inet_saddr)
>                 inet->inet_saddr = fl4->saddr;  /* Update source address */
>         if (!inet->inet_rcv_saddr) {
> -               inet->inet_rcv_saddr = fl4->saddr;
> -               if (sk->sk_prot->rehash && sk->sk_family == AF_INET)
> -                       sk->sk_prot->rehash(sk);
> +               if (sk->sk_prot->set_rcv_saddr && sk->sk_family == AF_INET)
> +                       sk->sk_prot->set_rcv_saddr(sk, &fl4->saddr);
> +               else
> +                       inet->inet_rcv_saddr = fl4->saddr;
>         }
>         inet->inet_daddr = fl4->daddr;
>         inet->inet_dport = usin->sin_port;
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 9bfcfd016e18..74e6a3604bcf 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -874,19 +874,6 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
>         return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
>  }
>
> -static void inet_update_saddr(struct sock *sk, void *saddr, int family)
> -{
> -       if (family == AF_INET) {
> -               inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
> -               sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
> -       }
> -#if IS_ENABLED(CONFIG_IPV6)
> -       else {
> -               sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
> -       }
> -#endif
> -}
> -
>  static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
>  {
>         struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 6a01905d379f..8490408f6009 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
>                 int sdif, struct udp_table *udptable, struct sk_buff *skb)
>  {
>         unsigned short hnum = ntohs(dport);
> -       struct udp_hslot *hslot2;
> +       struct udp_hslot *hslot, *hslot2;
>         struct sock *result, *sk;
>         unsigned int hash2;
>
> +       hslot = udp_hashslot(udptable, net, hnum);
> +       spin_lock_bh(&hslot->lock);

This is not acceptable.
UDP is best effort, packets can be dropped.
Please fix user application expectations.
Paolo Abeni Dec. 5, 2024, 4:53 p.m. UTC | #4
On 12/5/24 16:58, Stefano Brivio wrote:
> On Thu, 5 Dec 2024 10:30:14 +0100
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
>> On 12/4/24 23:12, Stefano Brivio wrote:
>>
>>> [...]
>>>
>>> To fix this, replace the rehash operation by a set_rcv_saddr()
>>> callback holding the spinlock on the primary hash chain, just like
>>> the rehash operation used to do, but also setting the address (via
>>> inet_update_saddr(), moved to headers) while holding the spinlock.
>>>
>>> To make this atomic against the lookup operation, also acquire the
>>> spinlock on the primary chain there.  
>>
>> I'm sorry for the late feedback.
>>
>> I'm concerned by the unconditional spinlock in __udp4_lib_lookup(). I
>> fear it could cause performance regressions in different workloads:
>> heavy UDP unicast flow, or even TCP over UDP tunnel when the NIC
>> supports RX offload for the relevant UDP tunnel protocol.
>>
>> In the first case there will be an additional atomic operation per packet.
> 
> So, I've been looking into this a bit, and request-response rates with
> neper's udp_rr (https://github.com/google/neper/blob/master/udp_rr.c)
> for a client/server pair via loopback interface are the same before and
> after this patch.
> 
> The reason is, I suppose, that the only contention on that spinlock is
> the "intended" one, that is, between connect() and lookup.
> 
> Then I moved on to bulk flows, with socat or iperf3. But there (and
> that's the whole point of this fix) we have connected sockets, and once
> they are connected, we switch to early demux, which is not affected by
> this patch.
> 
> In the end, I don't think this will affect "regular", bulk unicast
> flows, because applications using them will typically connect sockets,
> and we'll switch to early demux right away.
> 
> This lookup is not exactly "slow path", but it's not fast path either.

Some (most ?) quick server implementations don't use connect.

DNS servers will be affected, and will see contention on the hash lock

Even deployment using SO_REUSEPORT with a per-cpu UDP socket will see
contention. This latter case would be pretty bad, as it's supposed to
scale linearly.

I really think the hash lock during lookup is a no go.

>> In the latter the spin_lock will be contended with multiple concurrent
>> TCP over UDP tunnel flows: the NIC with UDP tunnel offload can use the
>> inner header to compute the RX hash, and use different rx queues for
>> such flows.
>>
>> The GRO stage will perform UDP tunnel socket lookup and will contend the
>> bucket lock.
> 
> In this case (I couldn't find out yet), aren't sockets connected? I
> would expect that we switch to the early demux path relatively soon for
> anything that needs to have somehow high throughput.

The UDP socket backing tunnels is unconnected and can receive data from
multiple other tunnel endpoints.

> And if we don't, probably the more reasonable alternative would be to
> "fix" that, rather than keeping this relatively common case broken.
> 
> Do you have a benchmark or something I can run?

I'm sorry, but I don't have anything handy. If you have a NIC
implementing i.e. vxlan H/W offload you should be able to observe
contention with multiple simultaneus TCP over vxlan flows targeting an
endpoint on top of it.

>>> This results in some awkwardness at a caller site, specifically
>>> sock_bindtoindex_locked(), where we really just need to rehash the
>>> socket without changing its address. With the new operation, we now
>>> need to forcibly set the current address again.
>>>
>>> On the other hand, this appears more elegant than alternatives such
>>> as fetching the spinlock reference in ip4_datagram_connect() and
>>> ip6_datagram_conect(), and keeping the rehash operation around for
>>> a single user also seems a tad overkill.  
>>
>> Would such option require the same additional lock at lookup time?
> 
> Yes, it's conceptually the same, we would pretty much just move code
> around.
> 
> I've been thinking about possible alternatives but they all involve a
> much bigger rework. One idea could be that we RCU-connect() sockets,
> instead of just having the hash table insertion under RCU. That is, as
> long as we're in the grace period, the lookup would still see the old
> receive address.

I'm wondering if the issue could be solved (almost) entirely in the
rehash callback?!? if the rehash happens on connect and the the socket
does not have hash4 yet (it's not a reconnect) do the l4 hashing before
everything else.

Incoming packets should match the l4 hash and reach the socket even
while later updating the other hash(es).

Cheers,

Paolo
David Gibson Dec. 5, 2024, 10:32 p.m. UTC | #5
On Thu, Dec 05, 2024 at 05:35:52PM +0100, Eric Dumazet wrote:
> On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@redhat.com> wrote:
[snip]
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 6a01905d379f..8490408f6009 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
> >                 int sdif, struct udp_table *udptable, struct sk_buff *skb)
> >  {
> >         unsigned short hnum = ntohs(dport);
> > -       struct udp_hslot *hslot2;
> > +       struct udp_hslot *hslot, *hslot2;
> >         struct sock *result, *sk;
> >         unsigned int hash2;
> >
> > +       hslot = udp_hashslot(udptable, net, hnum);
> > +       spin_lock_bh(&hslot->lock);
> 
> This is not acceptable.
> UDP is best effort, packets can be dropped.
> Please fix user application expectations.

The packets aren't merely dropped, they're rejected with an ICMP Port
Unreachable.
Eric Dumazet Dec. 5, 2024, 10:52 p.m. UTC | #6
On Thu, Dec 5, 2024 at 11:32 PM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> On Thu, Dec 05, 2024 at 05:35:52PM +0100, Eric Dumazet wrote:
> > On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> [snip]
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index 6a01905d379f..8490408f6009 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
> > >                 int sdif, struct udp_table *udptable, struct sk_buff *skb)
> > >  {
> > >         unsigned short hnum = ntohs(dport);
> > > -       struct udp_hslot *hslot2;
> > > +       struct udp_hslot *hslot, *hslot2;
> > >         struct sock *result, *sk;
> > >         unsigned int hash2;
> > >
> > > +       hslot = udp_hashslot(udptable, net, hnum);
> > > +       spin_lock_bh(&hslot->lock);
> >
> > This is not acceptable.
> > UDP is best effort, packets can be dropped.
> > Please fix user application expectations.
>
> The packets aren't merely dropped, they're rejected with an ICMP Port
> Unreachable.

We made UDP stack scalable with RCU, it took years of work.

And this patch is bringing back the UDP stack to horrible performance
from more than a decade ago.
Everybody will go back to DPDK.

I am pretty certain this can be solved without using a spinlock in the
fast path.

Think about UDP DNS/QUIC servers, using SO_REUSEPORT and receiving
10,000,000 packets per second....

Changing source address on an UDP socket is highly unusual, we are not
going to slow down UDP for this case.

Application could instead open another socket, and would probably work
on old linux versions.

If the regression was recent, this would be considered as a normal regression,
but apparently nobody noticed for 10 years. This should be saying something...
David Gibson Dec. 6, 2024, 2:16 a.m. UTC | #7
On Thu, Dec 05, 2024 at 11:52:38PM +0100, Eric Dumazet wrote:
> On Thu, Dec 5, 2024 at 11:32 PM David Gibson
> <david@gibson.dropbear.id.au> wrote:
> >
> > On Thu, Dec 05, 2024 at 05:35:52PM +0100, Eric Dumazet wrote:
> > > On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > [snip]
> > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > index 6a01905d379f..8490408f6009 100644
> > > > --- a/net/ipv4/udp.c
> > > > +++ b/net/ipv4/udp.c
> > > > @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
> > > >                 int sdif, struct udp_table *udptable, struct sk_buff *skb)
> > > >  {
> > > >         unsigned short hnum = ntohs(dport);
> > > > -       struct udp_hslot *hslot2;
> > > > +       struct udp_hslot *hslot, *hslot2;
> > > >         struct sock *result, *sk;
> > > >         unsigned int hash2;
> > > >
> > > > +       hslot = udp_hashslot(udptable, net, hnum);
> > > > +       spin_lock_bh(&hslot->lock);
> > >
> > > This is not acceptable.
> > > UDP is best effort, packets can be dropped.
> > > Please fix user application expectations.
> >
> > The packets aren't merely dropped, they're rejected with an ICMP Port
> > Unreachable.
> 
> We made UDP stack scalable with RCU, it took years of work.
> 
> And this patch is bringing back the UDP stack to horrible performance
> from more than a decade ago.
> Everybody will go back to DPDK.

It's reasonable to be concerned about the performance impact.  But
this seems like preamture hyperbole given no-one has numbers yet, or
has even suggested a specific benchmark to reveal the impact.

> I am pretty certain this can be solved without using a spinlock in the
> fast path.

Quite possibly.  But Stefano has tried, and it certainly wasn't
trivial.

> Think about UDP DNS/QUIC servers, using SO_REUSEPORT and receiving
> 10,000,000 packets per second....
> 
> Changing source address on an UDP socket is highly unusual, we are not
> going to slow down UDP for this case.

Changing in a general way is very rare, one specific case is not.
Every time you connect() a socket that wasn't previously bound to a
specific address you get an implicit source address change from
0.0.0.0 or :: to something that depends on the routing table.

> Application could instead open another socket, and would probably work
> on old linux versions.

Possibly there's a procedure that would work here, but it's not at all
obvious:

 * Clearly, you can't close the non-connected socket before opening
   the connected one - that just introduces a new much wider race.  It
   doesn't even get rid of the existing one, because unless you can
   independently predict what the correct bound address will be
   for a given peer address, the second socket will still have an
   address change when you connect().

 * So, you must create the connected socket before closing the
   unconnected one, meaning you have to use SO_REUSEADDR or
   SO_REUSEPORT whether or not you otherwise wanted to.

 * While both sockets are open, you need to handle the possibility
   that packets could be delivered to either one.  Doable, but a pain
   in the arse.

 * How do you know when the transition is completed and you can close
   the unconnected socket?  The fact that the rehashing has completed
   and all the necessary memory barriers passed isn't something
   userspace can directly discern.

> If the regression was recent, this would be considered as a normal regression,
> but apparently nobody noticed for 10 years. This should be saying something...

It does.  But so does the fact that it can be trivially reproduced.
Eric Dumazet Dec. 6, 2024, 9:04 a.m. UTC | #8
On Fri, Dec 6, 2024 at 3:16 AM David Gibson <david@gibson.dropbear.id.au> wrote:
>
> On Thu, Dec 05, 2024 at 11:52:38PM +0100, Eric Dumazet wrote:
> > On Thu, Dec 5, 2024 at 11:32 PM David Gibson
> > <david@gibson.dropbear.id.au> wrote:
> > >
> > > On Thu, Dec 05, 2024 at 05:35:52PM +0100, Eric Dumazet wrote:
> > > > On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > [snip]
> > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > index 6a01905d379f..8490408f6009 100644
> > > > > --- a/net/ipv4/udp.c
> > > > > +++ b/net/ipv4/udp.c
> > > > > @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
> > > > >                 int sdif, struct udp_table *udptable, struct sk_buff *skb)
> > > > >  {
> > > > >         unsigned short hnum = ntohs(dport);
> > > > > -       struct udp_hslot *hslot2;
> > > > > +       struct udp_hslot *hslot, *hslot2;
> > > > >         struct sock *result, *sk;
> > > > >         unsigned int hash2;
> > > > >
> > > > > +       hslot = udp_hashslot(udptable, net, hnum);
> > > > > +       spin_lock_bh(&hslot->lock);
> > > >
> > > > This is not acceptable.
> > > > UDP is best effort, packets can be dropped.
> > > > Please fix user application expectations.
> > >
> > > The packets aren't merely dropped, they're rejected with an ICMP Port
> > > Unreachable.
> >
> > We made UDP stack scalable with RCU, it took years of work.
> >
> > And this patch is bringing back the UDP stack to horrible performance
> > from more than a decade ago.
> > Everybody will go back to DPDK.
>
> It's reasonable to be concerned about the performance impact.  But
> this seems like preamture hyperbole given no-one has numbers yet, or
> has even suggested a specific benchmark to reveal the impact.
>
> > I am pretty certain this can be solved without using a spinlock in the
> > fast path.
>
> Quite possibly.  But Stefano has tried, and it certainly wasn't
> trivial.
>
> > Think about UDP DNS/QUIC servers, using SO_REUSEPORT and receiving
> > 10,000,000 packets per second....
> >
> > Changing source address on an UDP socket is highly unusual, we are not
> > going to slow down UDP for this case.
>
> Changing in a general way is very rare, one specific case is not.
> Every time you connect() a socket that wasn't previously bound to a
> specific address you get an implicit source address change from
> 0.0.0.0 or :: to something that depends on the routing table.
>
> > Application could instead open another socket, and would probably work
> > on old linux versions.
>
> Possibly there's a procedure that would work here, but it's not at all
> obvious:
>
>  * Clearly, you can't close the non-connected socket before opening
>    the connected one - that just introduces a new much wider race.  It
>    doesn't even get rid of the existing one, because unless you can
>    independently predict what the correct bound address will be
>    for a given peer address, the second socket will still have an
>    address change when you connect().
>

The order is kind of obvious.

Kernel does not have to deal with wrong application design.

>  * So, you must create the connected socket before closing the
>    unconnected one, meaning you have to use SO_REUSEADDR or
>    SO_REUSEPORT whether or not you otherwise wanted to.
>
>  * While both sockets are open, you need to handle the possibility
>    that packets could be delivered to either one.  Doable, but a pain
>    in the arse.

Given UDP does not have a proper listen() + accept() model, I am
afraid this is the only way

You need to keep the generic UDP socket as a catch all, and deal with
packets received on it.

>
>  * How do you know when the transition is completed and you can close
>    the unconnected socket?  The fact that the rehashing has completed
>    and all the necessary memory barriers passed isn't something
>    userspace can directly discern.
>
> > If the regression was recent, this would be considered as a normal regression,
> > but apparently nobody noticed for 10 years. This should be saying something...
>
> It does.  But so does the fact that it can be trivially reproduced.

If a kernel fix is doable without making UDP stack a complete nogo for
most of us,
I will be happy to review it.
Stefano Brivio Dec. 6, 2024, 10:50 a.m. UTC | #9
On Thu, 5 Dec 2024 17:53:33 +0100
Paolo Abeni <pabeni@redhat.com> wrote:

> On 12/5/24 16:58, Stefano Brivio wrote:
> > On Thu, 5 Dec 2024 10:30:14 +0100
> > Paolo Abeni <pabeni@redhat.com> wrote:
> >   
> >> On 12/4/24 23:12, Stefano Brivio wrote:
> >>  
> >>> [...]
> >>>
> >>> To fix this, replace the rehash operation by a set_rcv_saddr()
> >>> callback holding the spinlock on the primary hash chain, just like
> >>> the rehash operation used to do, but also setting the address (via
> >>> inet_update_saddr(), moved to headers) while holding the spinlock.
> >>>
> >>> To make this atomic against the lookup operation, also acquire the
> >>> spinlock on the primary chain there.    
> >>
> >> I'm sorry for the late feedback.
> >>
> >> I'm concerned by the unconditional spinlock in __udp4_lib_lookup(). I
> >> fear it could cause performance regressions in different workloads:
> >> heavy UDP unicast flow, or even TCP over UDP tunnel when the NIC
> >> supports RX offload for the relevant UDP tunnel protocol.
> >>
> >> In the first case there will be an additional atomic operation per packet.  
> > 
> > So, I've been looking into this a bit, and request-response rates with
> > neper's udp_rr (https://github.com/google/neper/blob/master/udp_rr.c)
> > for a client/server pair via loopback interface are the same before and
> > after this patch.
> > 
> > The reason is, I suppose, that the only contention on that spinlock is
> > the "intended" one, that is, between connect() and lookup.
> > 
> > Then I moved on to bulk flows, with socat or iperf3. But there (and
> > that's the whole point of this fix) we have connected sockets, and once
> > they are connected, we switch to early demux, which is not affected by
> > this patch.
> > 
> > In the end, I don't think this will affect "regular", bulk unicast
> > flows, because applications using them will typically connect sockets,
> > and we'll switch to early demux right away.
> > 
> > This lookup is not exactly "slow path", but it's not fast path either.  
> 
> Some (most ?) quick server implementations don't use connect.

Assuming you mean QUIC, fair enough, I see your point.

> DNS servers will be affected, and will see contention on the hash lock

At the same time, clients (not just DNS) are surely affected by bogus
ICMP Port Unreachable messages, if remote, or ECONNREFUSED on send()
(!), if local.

If (presumed) contention is so relevant, I would have expected that
somebody could indicate a benchmark for it. As I mentioned, udp_rr from
'neper' didn't really show any difference for me. Anyway, fine, let's
assume that it's an issue.

> Even deployment using SO_REUSEPORT with a per-cpu UDP socket will see
> contention. This latter case would be pretty bad, as it's supposed to
> scale linearly.

Okay, I guess we could observe a bigger impact in this case (this is
something I didn't try).

> I really think the hash lock during lookup is a no go.
> 
> >> In the latter the spin_lock will be contended with multiple concurrent
> >> TCP over UDP tunnel flows: the NIC with UDP tunnel offload can use the
> >> inner header to compute the RX hash, and use different rx queues for
> >> such flows.
> >>
> >> The GRO stage will perform UDP tunnel socket lookup and will contend the
> >> bucket lock.  
> > 
> > In this case (I couldn't find out yet), aren't sockets connected? I
> > would expect that we switch to the early demux path relatively soon for
> > anything that needs to have somehow high throughput.  
> 
> The UDP socket backing tunnels is unconnected and can receive data from
> multiple other tunnel endpoints.
> 
> > And if we don't, probably the more reasonable alternative would be to
> > "fix" that, rather than keeping this relatively common case broken.
> > 
> > Do you have a benchmark or something I can run?  
> 
> I'm sorry, but I don't have anything handy. If you have a NIC
> implementing i.e. vxlan H/W offload you should be able to observe
> contention with multiple simultaneus TCP over vxlan flows targeting an
> endpoint on top of it.

Thanks for the idea, but no, I don't have one right now.

> >>> This results in some awkwardness at a caller site, specifically
> >>> sock_bindtoindex_locked(), where we really just need to rehash the
> >>> socket without changing its address. With the new operation, we now
> >>> need to forcibly set the current address again.
> >>>
> >>> On the other hand, this appears more elegant than alternatives such
> >>> as fetching the spinlock reference in ip4_datagram_connect() and
> >>> ip6_datagram_conect(), and keeping the rehash operation around for
> >>> a single user also seems a tad overkill.    
> >>
> >> Would such option require the same additional lock at lookup time?  
> > 
> > Yes, it's conceptually the same, we would pretty much just move code
> > around.
> > 
> > I've been thinking about possible alternatives but they all involve a
> > much bigger rework. One idea could be that we RCU-connect() sockets,
> > instead of just having the hash table insertion under RCU. That is, as
> > long as we're in the grace period, the lookup would still see the old
> > receive address.  
> 
> I'm wondering if the issue could be solved (almost) entirely in the
> rehash callback?!? if the rehash happens on connect and the the socket
> does not have hash4 yet (it's not a reconnect) do the l4 hashing before
> everything else.

So, yes, that's actually the first thing I tried: do the hashing (any
hash) before setting the address (I guess that's what you mean by
"everything else").

If you take this series, and drop the changes in __udp4_lib_lookup(), I
guess that would match what you suggest.

With udp_lib_set_rcv_saddr() instead of a "rehash" callback you can see
pretty easily that hashes are updated first, and then we set the
receiving address.

It doesn't work because the socket does have a receiving address (and
hashes) already: it's 0.0.0.0. So we're just moving the race condition.
I don't think we can really change that part.

Note that this issue occurs with and without four-tuple hashes (I
actually posted the original fix before they were introduced).

> Incoming packets should match the l4 hash and reach the socket even
> while later updating the other hash(es).

...to obtain this kind of outcome, I'm trying to keep the old hash
around until the new hash is there *and* we changed the address.

For simplicity, I cut out four-tuple hashes, and, in the new
udp_lib_set_rcv_saddr(), I changed RCU calls so that it should always
be the case... but that doesn't help either for some reason.

I wonder if you have some idea as to whether that's a viable approach
at all, and if there's something particular I should observe while
implementing it.
Paolo Abeni Dec. 6, 2024, 12:36 p.m. UTC | #10
On 12/6/24 11:50, Stefano Brivio wrote:
> On Thu, 5 Dec 2024 17:53:33 +0100 Paolo Abeni <pabeni@redhat.com> wrote:
>> I'm wondering if the issue could be solved (almost) entirely in the
>> rehash callback?!? if the rehash happens on connect and the the socket
>> does not have hash4 yet (it's not a reconnect) do the l4 hashing before
>> everything else.
> 
> So, yes, that's actually the first thing I tried: do the hashing (any
> hash) before setting the address (I guess that's what you mean by
> "everything else").
> 
> If you take this series, and drop the changes in __udp4_lib_lookup(), I
> guess that would match what you suggest.

I mean something slightly different. Just to explain the idea something
alike the following (completely untested):

---
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index cc6d0bd7b0a9..e9cc6edbcdc6 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -61,6 +61,10 @@ int __ip4_datagram_connect(struct sock *sk, struct
sockaddr *uaddr, int addr_len
 		err = -EACCES;
 		goto out;
 	}
+
+	sk->sk_state = TCP_ESTABLISHED;
+	inet->inet_daddr = fl4->daddr;
+	inet->inet_dport = usin->sin_port;
 	if (!inet->inet_saddr)
 		inet->inet_saddr = fl4->saddr;	/* Update source address */
 	if (!inet->inet_rcv_saddr) {
@@ -68,10 +72,7 @@ int __ip4_datagram_connect(struct sock *sk, struct
sockaddr *uaddr, int addr_len
 		if (sk->sk_prot->rehash)
 			sk->sk_prot->rehash(sk);
 	}
-	inet->inet_daddr = fl4->daddr;
-	inet->inet_dport = usin->sin_port;
 	reuseport_has_conns_set(sk);
-	sk->sk_state = TCP_ESTABLISHED;
 	sk_set_txhash(sk);
 	atomic_set(&inet->inet_id, get_random_u16());

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6a01905d379f..c6c58b0a6b7b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2194,6 +2194,21 @@ void udp_lib_rehash(struct sock *sk, u16 newhash,
u16 newhash4)
 			if (rcu_access_pointer(sk->sk_reuseport_cb))
 				reuseport_detach_sock(sk);

+			if (sk->sk_state == TCP_ESTABLISHED && !udp_hashed4(sk)) {
+				struct udp_hslot * hslot4 = udp_hashslot4(udptable, newhash4);
+
+				udp_sk(sk)->udp_lrpa_hash = newhash4;
+				spin_lock(&hslot4->lock);
+				hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_lrpa_node,
+							 &hslot4->nulls_head);
+				hslot4->count++;
+				spin_unlock(&hslot4->lock);
+
+				spin_lock(&hslot2->lock);
+				udp_hash4_inc(hslot2);
+				spin_unlock(&hslot2->lock);
+			}
+
 			if (hslot2 != nhslot2) {
 				spin_lock(&hslot2->lock);
 				hlist_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
---

Basically the idea is to leverage the hash4 - which should be not yet
initialized when rehash is invoked due to connect().

In such a case, before touching hash{,2}, do hash4.

/P
Stefano Brivio Dec. 6, 2024, 1:35 p.m. UTC | #11
On Fri, 6 Dec 2024 13:36:47 +0100
Paolo Abeni <pabeni@redhat.com> wrote:

> On 12/6/24 11:50, Stefano Brivio wrote:
> > On Thu, 5 Dec 2024 17:53:33 +0100 Paolo Abeni <pabeni@redhat.com> wrote:  
> >> I'm wondering if the issue could be solved (almost) entirely in the
> >> rehash callback?!? if the rehash happens on connect and the the socket
> >> does not have hash4 yet (it's not a reconnect) do the l4 hashing before
> >> everything else.  
> > 
> > So, yes, that's actually the first thing I tried: do the hashing (any
> > hash) before setting the address (I guess that's what you mean by
> > "everything else").
> > 
> > If you take this series, and drop the changes in __udp4_lib_lookup(), I
> > guess that would match what you suggest.  
> 
> I mean something slightly different. Just to explain the idea something
> alike the following (completely untested):
> 
> ---
> diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
> index cc6d0bd7b0a9..e9cc6edbcdc6 100644
> --- a/net/ipv4/datagram.c
> +++ b/net/ipv4/datagram.c
> @@ -61,6 +61,10 @@ int __ip4_datagram_connect(struct sock *sk, struct
> sockaddr *uaddr, int addr_len
>  		err = -EACCES;
>  		goto out;
>  	}
> +
> +	sk->sk_state = TCP_ESTABLISHED;
> +	inet->inet_daddr = fl4->daddr;
> +	inet->inet_dport = usin->sin_port;
>  	if (!inet->inet_saddr)
>  		inet->inet_saddr = fl4->saddr;	/* Update source address */
>  	if (!inet->inet_rcv_saddr) {
> @@ -68,10 +72,7 @@ int __ip4_datagram_connect(struct sock *sk, struct
> sockaddr *uaddr, int addr_len
>  		if (sk->sk_prot->rehash)
>  			sk->sk_prot->rehash(sk);
>  	}
> -	inet->inet_daddr = fl4->daddr;
> -	inet->inet_dport = usin->sin_port;
>  	reuseport_has_conns_set(sk);
> -	sk->sk_state = TCP_ESTABLISHED;
>  	sk_set_txhash(sk);
>  	atomic_set(&inet->inet_id, get_random_u16());
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 6a01905d379f..c6c58b0a6b7b 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2194,6 +2194,21 @@ void udp_lib_rehash(struct sock *sk, u16 newhash,
> u16 newhash4)
>  			if (rcu_access_pointer(sk->sk_reuseport_cb))
>  				reuseport_detach_sock(sk);
> 
> +			if (sk->sk_state == TCP_ESTABLISHED && !udp_hashed4(sk)) {
> +				struct udp_hslot * hslot4 = udp_hashslot4(udptable, newhash4);
> +
> +				udp_sk(sk)->udp_lrpa_hash = newhash4;
> +				spin_lock(&hslot4->lock);
> +				hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_lrpa_node,
> +							 &hslot4->nulls_head);
> +				hslot4->count++;
> +				spin_unlock(&hslot4->lock);
> +
> +				spin_lock(&hslot2->lock);
> +				udp_hash4_inc(hslot2);
> +				spin_unlock(&hslot2->lock);
> +			}
> +
>  			if (hslot2 != nhslot2) {
>  				spin_lock(&hslot2->lock);
>  				hlist_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
> ---
> 
> Basically the idea is to leverage the hash4 - which should be not yet
> initialized when rehash is invoked due to connect().

That assumption seems to be correct from my tests.

> In such a case, before touching hash{,2}, do hash4.

Brilliant, thanks. I'll give that a try.
Paolo Abeni Dec. 6, 2024, 3:10 p.m. UTC | #12
On 12/6/24 14:35, Stefano Brivio wrote:
> On Fri, 6 Dec 2024 13:36:47 +0100
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
>> On 12/6/24 11:50, Stefano Brivio wrote:
>>> On Thu, 5 Dec 2024 17:53:33 +0100 Paolo Abeni <pabeni@redhat.com> wrote:  
>>>> I'm wondering if the issue could be solved (almost) entirely in the
>>>> rehash callback?!? if the rehash happens on connect and the the socket
>>>> does not have hash4 yet (it's not a reconnect) do the l4 hashing before
>>>> everything else.  
>>>
>>> So, yes, that's actually the first thing I tried: do the hashing (any
>>> hash) before setting the address (I guess that's what you mean by
>>> "everything else").
>>>
>>> If you take this series, and drop the changes in __udp4_lib_lookup(), I
>>> guess that would match what you suggest.  
>>
>> I mean something slightly different. Just to explain the idea something
>> alike the following (completely untested):
>>
>> ---
>> diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
>> index cc6d0bd7b0a9..e9cc6edbcdc6 100644
>> --- a/net/ipv4/datagram.c
>> +++ b/net/ipv4/datagram.c
>> @@ -61,6 +61,10 @@ int __ip4_datagram_connect(struct sock *sk, struct
>> sockaddr *uaddr, int addr_len
>>  		err = -EACCES;
>>  		goto out;
>>  	}
>> +
>> +	sk->sk_state = TCP_ESTABLISHED;
>> +	inet->inet_daddr = fl4->daddr;
>> +	inet->inet_dport = usin->sin_port;
>>  	if (!inet->inet_saddr)
>>  		inet->inet_saddr = fl4->saddr;	/* Update source address */
>>  	if (!inet->inet_rcv_saddr) {
>> @@ -68,10 +72,7 @@ int __ip4_datagram_connect(struct sock *sk, struct
>> sockaddr *uaddr, int addr_len
>>  		if (sk->sk_prot->rehash)
>>  			sk->sk_prot->rehash(sk);
>>  	}
>> -	inet->inet_daddr = fl4->daddr;
>> -	inet->inet_dport = usin->sin_port;

Side note: I think that moving the initialization of the above fields
before the rehash is separate fix - otherwise reconnect will screw hash4.

I'll submit just that (sub) chunk separately.

Cheers,

Paolo
David Gibson Dec. 9, 2024, 2:20 a.m. UTC | #13
On Fri, Dec 06, 2024 at 10:04:33AM +0100, Eric Dumazet wrote:
> On Fri, Dec 6, 2024 at 3:16 AM David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > On Thu, Dec 05, 2024 at 11:52:38PM +0100, Eric Dumazet wrote:
> > > On Thu, Dec 5, 2024 at 11:32 PM David Gibson
> > > <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > On Thu, Dec 05, 2024 at 05:35:52PM +0100, Eric Dumazet wrote:
> > > > > On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > [snip]
> > > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > > index 6a01905d379f..8490408f6009 100644
> > > > > > --- a/net/ipv4/udp.c
> > > > > > +++ b/net/ipv4/udp.c
> > > > > > @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
> > > > > >                 int sdif, struct udp_table *udptable, struct sk_buff *skb)
> > > > > >  {
> > > > > >         unsigned short hnum = ntohs(dport);
> > > > > > -       struct udp_hslot *hslot2;
> > > > > > +       struct udp_hslot *hslot, *hslot2;
> > > > > >         struct sock *result, *sk;
> > > > > >         unsigned int hash2;
> > > > > >
> > > > > > +       hslot = udp_hashslot(udptable, net, hnum);
> > > > > > +       spin_lock_bh(&hslot->lock);
> > > > >
> > > > > This is not acceptable.
> > > > > UDP is best effort, packets can be dropped.
> > > > > Please fix user application expectations.
> > > >
> > > > The packets aren't merely dropped, they're rejected with an ICMP Port
> > > > Unreachable.
> > >
> > > We made UDP stack scalable with RCU, it took years of work.
> > >
> > > And this patch is bringing back the UDP stack to horrible performance
> > > from more than a decade ago.
> > > Everybody will go back to DPDK.
> >
> > It's reasonable to be concerned about the performance impact.  But
> > this seems like preamture hyperbole given no-one has numbers yet, or
> > has even suggested a specific benchmark to reveal the impact.
> >
> > > I am pretty certain this can be solved without using a spinlock in the
> > > fast path.
> >
> > Quite possibly.  But Stefano has tried, and it certainly wasn't
> > trivial.
> >
> > > Think about UDP DNS/QUIC servers, using SO_REUSEPORT and receiving
> > > 10,000,000 packets per second....
> > >
> > > Changing source address on an UDP socket is highly unusual, we are not
> > > going to slow down UDP for this case.
> >
> > Changing in a general way is very rare, one specific case is not.
> > Every time you connect() a socket that wasn't previously bound to a
> > specific address you get an implicit source address change from
> > 0.0.0.0 or :: to something that depends on the routing table.
> >
> > > Application could instead open another socket, and would probably work
> > > on old linux versions.
> >
> > Possibly there's a procedure that would work here, but it's not at all
> > obvious:
> >
> >  * Clearly, you can't close the non-connected socket before opening
> >    the connected one - that just introduces a new much wider race.  It
> >    doesn't even get rid of the existing one, because unless you can
> >    independently predict what the correct bound address will be
> >    for a given peer address, the second socket will still have an
> >    address change when you connect().
> >
> 
> The order is kind of obvious.
> 
> Kernel does not have to deal with wrong application design.

What we're talking about is:

	bind("0.0.0.0:12345");
	connect("1.2.3.4:54321");

Which AFAIK has been a legal sequence since the sockets interface was
a thing.  I don't think it's reasonable to call expecting that *not*
to trigger ICMPs around the connect "wrong application design".

> >  * So, you must create the connected socket before closing the
> >    unconnected one, meaning you have to use SO_REUSEADDR or
> >    SO_REUSEPORT whether or not you otherwise wanted to.
> >
> >  * While both sockets are open, you need to handle the possibility
> >    that packets could be delivered to either one.  Doable, but a pain
> >    in the arse.
> 
> Given UDP does not have a proper listen() + accept() model, I am
> afraid this is the only way
> 
> You need to keep the generic UDP socket as a catch all, and deal with
> packets received on it.
> 
> >
> >  * How do you know when the transition is completed and you can close
> >    the unconnected socket?  The fact that the rehashing has completed
> >    and all the necessary memory barriers passed isn't something
> >    userspace can directly discern.
> >
> > > If the regression was recent, this would be considered as a normal regression,
> > > but apparently nobody noticed for 10 years. This should be saying something...
> >
> > It does.  But so does the fact that it can be trivially reproduced.
> 
> If a kernel fix is doable without making UDP stack a complete nogo for
> most of us,

The benchmarks Stefano has tried so far don't show an impact, and you
haven't yet suggested another one.  Again, calling this a "complete
nogo" seems like huge hyperbole without more data.

> I will be happy to review it.
>
Stefano Brivio Dec. 18, 2024, 4:21 p.m. UTC | #14
On Fri, 6 Dec 2024 14:35:35 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Fri, 6 Dec 2024 13:36:47 +0100
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > On 12/6/24 11:50, Stefano Brivio wrote:  
> > > On Thu, 5 Dec 2024 17:53:33 +0100 Paolo Abeni <pabeni@redhat.com> wrote:    
> > >> I'm wondering if the issue could be solved (almost) entirely in the
> > >> rehash callback?!? if the rehash happens on connect and the the socket
> > >> does not have hash4 yet (it's not a reconnect) do the l4 hashing before
> > >> everything else.    
> > > 
> > > So, yes, that's actually the first thing I tried: do the hashing (any
> > > hash) before setting the address (I guess that's what you mean by
> > > "everything else").
> > > 
> > > If you take this series, and drop the changes in __udp4_lib_lookup(), I
> > > guess that would match what you suggest.    
> > 
> > I mean something slightly different. Just to explain the idea something
> > alike the following (completely untested):
> > 
> > ---
> > diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
> > index cc6d0bd7b0a9..e9cc6edbcdc6 100644
> > --- a/net/ipv4/datagram.c
> > +++ b/net/ipv4/datagram.c
> > @@ -61,6 +61,10 @@ int __ip4_datagram_connect(struct sock *sk, struct
> > sockaddr *uaddr, int addr_len
> >  		err = -EACCES;
> >  		goto out;
> >  	}
> > +
> > +	sk->sk_state = TCP_ESTABLISHED;
> > +	inet->inet_daddr = fl4->daddr;
> > +	inet->inet_dport = usin->sin_port;
> >  	if (!inet->inet_saddr)
> >  		inet->inet_saddr = fl4->saddr;	/* Update source address */
> >  	if (!inet->inet_rcv_saddr) {
> > @@ -68,10 +72,7 @@ int __ip4_datagram_connect(struct sock *sk, struct
> > sockaddr *uaddr, int addr_len
> >  		if (sk->sk_prot->rehash)
> >  			sk->sk_prot->rehash(sk);
> >  	}
> > -	inet->inet_daddr = fl4->daddr;
> > -	inet->inet_dport = usin->sin_port;
> >  	reuseport_has_conns_set(sk);
> > -	sk->sk_state = TCP_ESTABLISHED;
> >  	sk_set_txhash(sk);
> >  	atomic_set(&inet->inet_id, get_random_u16());
> > 
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 6a01905d379f..c6c58b0a6b7b 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -2194,6 +2194,21 @@ void udp_lib_rehash(struct sock *sk, u16 newhash,
> > u16 newhash4)
> >  			if (rcu_access_pointer(sk->sk_reuseport_cb))
> >  				reuseport_detach_sock(sk);
> > 
> > +			if (sk->sk_state == TCP_ESTABLISHED && !udp_hashed4(sk)) {
> > +				struct udp_hslot * hslot4 = udp_hashslot4(udptable, newhash4);
> > +
> > +				udp_sk(sk)->udp_lrpa_hash = newhash4;
> > +				spin_lock(&hslot4->lock);
> > +				hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_lrpa_node,
> > +							 &hslot4->nulls_head);
> > +				hslot4->count++;
> > +				spin_unlock(&hslot4->lock);
> > +
> > +				spin_lock(&hslot2->lock);
> > +				udp_hash4_inc(hslot2);
> > +				spin_unlock(&hslot2->lock);
> > +			}
> > +
> >  			if (hslot2 != nhslot2) {
> >  				spin_lock(&hslot2->lock);
> >  				hlist_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
> > ---
> > 
> > Basically the idea is to leverage the hash4 - which should be not yet
> > initialized when rehash is invoked due to connect().  
> 
> That assumption seems to be correct from my tests.

...but that doesn't help in a general case, because we don't have a
wildcard lookup for four-tuple hashes, more on that below.

> > In such a case, before touching hash{,2}, do hash4.  
> 
> Brilliant, thanks. I'll give that a try.

It sounded like a nice idea and I actually tried quite hard, but it
can't work (so I'm posting a different/simpler fix), mostly for three
reasons (plus a bunch that would require sparse but doable changes):

1. we can't use four-tuple hashes on CONFIG_BASE_SMALL=y, and it would
   be rather weird to leave it unfixed in that case (and, worse, to have
   substantially different behaviours depending on CONFIG_BASE_SMALL).

   At the same time, I see your point about it (from review to v4 of
   the four-tuple hash series), and I don't feel like it's worth adding
   it back also for CONFIG_BASE_SMALL.

   I tried adding some special handling based on a similar concept that
   wouldn't make struct udp_table bigger, but it's strictly more
   complicated than the other fix I'm posting.

2. hash4_cnt is stored in the secondary hash slot, and I see why, but
   that means that if the secondary hash doesn't match, we'll also fail
   the lookup based on four-tuple hash.

   We could introduce a special case in the lookup, perhaps as fallback
   only, ignoring the result of udp_has_hash4(), but it looks rather
   convoluted (especially compared to the fix I'm posting)

3. we would need another version of udp{4,6}_lib_lookup4() (or a branch
   inside it), handling wildcard lookups like udp{4,6}_lib_lookup2()
   does, and then call udp{4,6}_lib_lookup4() a second time with
   INADDR_ANY / &in6addr_any, because we don't know if the receive
   address changed yet, as we're performing the lookup.

So, instead, I'm resorting to the primary hash, as fallback only. If
what we need is a hash that doesn't include the address, such as an
"uninitialised" four-tuple hash, we can as well use the original hash
that doesn't include addresses by design.
diff mbox series

Patch

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 5eea47f135a4..7f05e7ebc2e4 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -525,6 +525,19 @@  static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
 #endif
 }
 
+static inline void inet_update_saddr(struct sock *sk, void *saddr, int family)
+{
+	if (family == AF_INET) {
+		inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
+		sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
+	}
+#if IS_ENABLED(CONFIG_IPV6)
+	else {
+		sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
+	}
+#endif
+}
+
 int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			struct sock *sk, u64 port_offset,
 			int (*check_established)(struct inet_timewait_death_row *,
diff --git a/include/net/sock.h b/include/net/sock.h
index 7464e9f9f47c..1410036e4f5a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1228,6 +1228,7 @@  struct proto {
 	int			(*connect)(struct sock *sk,
 					struct sockaddr *uaddr,
 					int addr_len);
+	void			(*set_rcv_saddr)(struct sock *sk, void *addr);
 	int			(*disconnect)(struct sock *sk, int flags);
 
 	struct sock *		(*accept)(struct sock *sk,
@@ -1269,7 +1270,6 @@  struct proto {
 	/* Keeping track of sk's, looking them up, and port selection methods. */
 	int			(*hash)(struct sock *sk);
 	void			(*unhash)(struct sock *sk);
-	void			(*rehash)(struct sock *sk);
 	int			(*get_port)(struct sock *sk, unsigned short snum);
 	void			(*put_port)(struct sock *sk);
 #ifdef CONFIG_BPF_SYSCALL
diff --git a/include/net/udp.h b/include/net/udp.h
index 6e89520e100d..8283ea5768de 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -302,7 +302,6 @@  static inline int udp_lib_hash(struct sock *sk)
 }
 
 void udp_lib_unhash(struct sock *sk);
-void udp_lib_rehash(struct sock *sk, u16 new_hash, u16 new_hash4);
 u32 udp_ehashfn(const struct net *net, const __be32 laddr, const __u16 lport,
 		const __be32 faddr, const __be16 fport);
 
@@ -411,6 +410,8 @@  int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, int *karg);
 int udp_init_sock(struct sock *sk);
 int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
+void udp_lib_set_rcv_saddr(struct sock *sk, void *addr, u16 hash, u16 hash4);
+int udp_set_rcv_saddr(struct sock *sk, void *addr);
 int __udp_disconnect(struct sock *sk, int flags);
 int udp_disconnect(struct sock *sk, int flags);
 __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait);
diff --git a/net/core/sock.c b/net/core/sock.c
index 74729d20cd00..221c904d870d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -641,8 +641,16 @@  static int sock_bindtoindex_locked(struct sock *sk, int ifindex)
 	/* Paired with all READ_ONCE() done locklessly. */
 	WRITE_ONCE(sk->sk_bound_dev_if, ifindex);
 
-	if (sk->sk_prot->rehash)
-		sk->sk_prot->rehash(sk);
+	/* Force rehash if protocol needs it */
+	if (sk->sk_prot->set_rcv_saddr) {
+		if (sk->sk_family == AF_INET)
+			sk->sk_prot->set_rcv_saddr(sk, &sk->sk_rcv_saddr);
+#if IS_ENABLED(CONFIG_IPV6)
+		else if (sk->sk_family == AF_INET6)
+			sk->sk_prot->set_rcv_saddr(sk, &sk->sk_v6_rcv_saddr);
+#endif
+	}
+
 	sk_dst_reset(sk);
 
 	ret = 0;
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index d52333e921f3..3ea3fa94c127 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -64,9 +64,10 @@  int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	if (!inet->inet_saddr)
 		inet->inet_saddr = fl4->saddr;	/* Update source address */
 	if (!inet->inet_rcv_saddr) {
-		inet->inet_rcv_saddr = fl4->saddr;
-		if (sk->sk_prot->rehash && sk->sk_family == AF_INET)
-			sk->sk_prot->rehash(sk);
+		if (sk->sk_prot->set_rcv_saddr && sk->sk_family == AF_INET)
+			sk->sk_prot->set_rcv_saddr(sk, &fl4->saddr);
+		else
+			inet->inet_rcv_saddr = fl4->saddr;
 	}
 	inet->inet_daddr = fl4->daddr;
 	inet->inet_dport = usin->sin_port;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 9bfcfd016e18..74e6a3604bcf 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -874,19 +874,6 @@  inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
 	return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
 }
 
-static void inet_update_saddr(struct sock *sk, void *saddr, int family)
-{
-	if (family == AF_INET) {
-		inet_sk(sk)->inet_saddr = *(__be32 *)saddr;
-		sk_rcv_saddr_set(sk, inet_sk(sk)->inet_saddr);
-	}
-#if IS_ENABLED(CONFIG_IPV6)
-	else {
-		sk->sk_v6_rcv_saddr = *(struct in6_addr *)saddr;
-	}
-#endif
-}
-
 static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family, bool reset)
 {
 	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6a01905d379f..8490408f6009 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -639,18 +639,21 @@  struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
 		int sdif, struct udp_table *udptable, struct sk_buff *skb)
 {
 	unsigned short hnum = ntohs(dport);
-	struct udp_hslot *hslot2;
+	struct udp_hslot *hslot, *hslot2;
 	struct sock *result, *sk;
 	unsigned int hash2;
 
+	hslot = udp_hashslot(udptable, net, hnum);
+	spin_lock_bh(&hslot->lock);
+
 	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
 	hslot2 = udp_hashslot2(udptable, hash2);
 
 	if (udp_has_hash4(hslot2)) {
 		result = udp4_lib_lookup4(net, saddr, sport, daddr, hnum,
 					  dif, sdif, udptable);
-		if (result) /* udp4_lib_lookup4 return sk or NULL */
-			return result;
+		if (result) /* udp4_lib_lookup4() returns sk or NULL */
+			goto done;
 	}
 
 	/* Lookup connected or non-wildcard socket */
@@ -684,6 +687,8 @@  struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
 				  htonl(INADDR_ANY), hnum, dif, sdif,
 				  hslot2, skb);
 done:
+	spin_unlock_bh(&hslot->lock);
+
 	if (IS_ERR(result))
 		return NULL;
 	return result;
@@ -2118,10 +2123,12 @@  int __udp_disconnect(struct sock *sk, int flags)
 	sock_rps_reset_rxhash(sk);
 	sk->sk_bound_dev_if = 0;
 	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
+		if (sk->sk_prot->set_rcv_saddr &&
+		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
+			sk->sk_prot->set_rcv_saddr(sk, &(__be32){ 0 });
+		}
+
 		inet_reset_saddr(sk);
-		if (sk->sk_prot->rehash &&
-		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK))
-			sk->sk_prot->rehash(sk);
 	}
 
 	if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
@@ -2172,25 +2179,37 @@  void udp_lib_unhash(struct sock *sk)
 }
 EXPORT_SYMBOL(udp_lib_unhash);
 
-/*
- * inet_rcv_saddr was changed, we must rehash secondary hash
+/**
+ * udp_lib_set_rcv_saddr() - Set local address and rehash socket atomically
+ * @sk:		Socket changing local address
+ * @addr:	New address, __be32 * or struct in6_addr * depending on family
+ * @hash:	New secondary hash (local port and address) for socket
+ * @hash4:	New 4-tuple hash (local/remote port and address) for socket
+ *
+ * Set local address for socket and rehash it while holding a spinlock on the
+ * primary hash chain (port only). This needs to be atomic to avoid that a
+ * concurrent lookup misses a socket while it's being connected or disconnected.
  */
-void udp_lib_rehash(struct sock *sk, u16 newhash, u16 newhash4)
+void udp_lib_set_rcv_saddr(struct sock *sk, void *addr, u16 hash, u16 hash4)
 {
+	struct udp_hslot *hslot;
+
 	if (sk_hashed(sk)) {
 		struct udp_table *udptable = udp_get_table_prot(sk);
-		struct udp_hslot *hslot, *hslot2, *nhslot2;
+		struct udp_hslot *hslot2, *nhslot2;
 
 		hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
-		nhslot2 = udp_hashslot2(udptable, newhash);
-		udp_sk(sk)->udp_portaddr_hash = newhash;
+		nhslot2 = udp_hashslot2(udptable, hash);
+
+		hslot = udp_hashslot(udptable, sock_net(sk),
+				     udp_sk(sk)->udp_port_hash);
+
+		spin_lock_bh(&hslot->lock);
+
+		udp_sk(sk)->udp_portaddr_hash = hash;
 
 		if (hslot2 != nhslot2 ||
 		    rcu_access_pointer(sk->sk_reuseport_cb)) {
-			hslot = udp_hashslot(udptable, sock_net(sk),
-					     udp_sk(sk)->udp_port_hash);
-			/* we must lock primary chain too */
-			spin_lock_bh(&hslot->lock);
 			if (rcu_access_pointer(sk->sk_reuseport_cb))
 				reuseport_detach_sock(sk);
 
@@ -2208,7 +2227,7 @@  void udp_lib_rehash(struct sock *sk, u16 newhash, u16 newhash4)
 			}
 
 			if (udp_hashed4(sk)) {
-				udp_rehash4(udptable, sk, newhash4);
+				udp_rehash4(udptable, sk, hash4);
 
 				if (hslot2 != nhslot2) {
 					spin_lock(&hslot2->lock);
@@ -2220,23 +2239,32 @@  void udp_lib_rehash(struct sock *sk, u16 newhash, u16 newhash4)
 					spin_unlock(&nhslot2->lock);
 				}
 			}
-			spin_unlock_bh(&hslot->lock);
 		}
 	}
+
+	inet_update_saddr(sk, addr, sk->sk_family);
+
+	if (sk_hashed(sk))
+		spin_unlock_bh(&hslot->lock);
 }
-EXPORT_SYMBOL(udp_lib_rehash);
+EXPORT_SYMBOL(udp_lib_set_rcv_saddr);
 
-void udp_v4_rehash(struct sock *sk)
+/**
+ * udp_v4_set_rcv_saddr() - Set local address and new hash for IPv4 socket
+ * @sk:		Socket changing local address
+ * @addr:	New address, pointer to __be32 representation
+ */
+void udp_v4_set_rcv_saddr(struct sock *sk, void *addr)
 {
-	u16 new_hash = ipv4_portaddr_hash(sock_net(sk),
-					  inet_sk(sk)->inet_rcv_saddr,
-					  inet_sk(sk)->inet_num);
-	u16 new_hash4 = udp_ehashfn(sock_net(sk),
-				    sk->sk_rcv_saddr, sk->sk_num,
-				    sk->sk_daddr, sk->sk_dport);
+	u16 hash = ipv4_portaddr_hash(sock_net(sk),
+				      *(__be32 *)addr, inet_sk(sk)->inet_num);
+	u16 hash4 = udp_ehashfn(sock_net(sk),
+				*(__be32 *)addr, sk->sk_num,
+				sk->sk_daddr, sk->sk_dport);
 
-	udp_lib_rehash(sk, new_hash, new_hash4);
+	udp_lib_set_rcv_saddr(sk, addr, hash, hash4);
 }
+EXPORT_SYMBOL(udp_v4_set_rcv_saddr);
 
 static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
@@ -3129,6 +3157,7 @@  struct proto udp_prot = {
 	.close			= udp_lib_close,
 	.pre_connect		= udp_pre_connect,
 	.connect		= udp_connect,
+	.set_rcv_saddr		= udp_v4_set_rcv_saddr,
 	.disconnect		= udp_disconnect,
 	.ioctl			= udp_ioctl,
 	.init			= udp_init_sock,
@@ -3141,7 +3170,6 @@  struct proto udp_prot = {
 	.release_cb		= ip4_datagram_release_cb,
 	.hash			= udp_lib_hash,
 	.unhash			= udp_lib_unhash,
-	.rehash			= udp_v4_rehash,
 	.get_port		= udp_v4_get_port,
 	.put_port		= udp_lib_unhash,
 #ifdef CONFIG_BPF_SYSCALL
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index e1ff3a375996..1c5ad903c064 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -10,7 +10,7 @@  int __udp4_lib_rcv(struct sk_buff *, struct udp_table *, int);
 int __udp4_lib_err(struct sk_buff *, u32, struct udp_table *);
 
 int udp_v4_get_port(struct sock *sk, unsigned short snum);
-void udp_v4_rehash(struct sock *sk);
+void udp_v4_set_rcv_saddr(struct sock *sk, void *addr);
 
 int udp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
 		   unsigned int optlen);
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index af37af3ab727..4a5c2e080120 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -57,7 +57,7 @@  struct proto 	udplite_prot = {
 	.recvmsg	   = udp_recvmsg,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
-	.rehash		   = udp_v4_rehash,
+	.set_rcv_saddr	   = udp_v4_set_rcv_saddr,
 	.get_port	   = udp_v4_get_port,
 
 	.memory_allocated  = &udp_memory_allocated,
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 5c28a11128c7..ec659bc97c04 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -104,10 +104,19 @@  int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr)
 			np->saddr = fl6.saddr;
 
 		if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-			sk->sk_v6_rcv_saddr = fl6.saddr;
-			inet->inet_rcv_saddr = LOOPBACK4_IPV6;
-			if (sk->sk_prot->rehash)
-				sk->sk_prot->rehash(sk);
+			__be32 v4addr = LOOPBACK4_IPV6;
+
+			if (sk->sk_prot->set_rcv_saddr &&
+			    sk->sk_family == AF_INET)
+				sk->sk_prot->set_rcv_saddr(sk, &v4addr);
+			else
+				inet->inet_rcv_saddr = v4addr;
+
+			if (sk->sk_prot->set_rcv_saddr &&
+			    sk->sk_family == AF_INET6)
+				sk->sk_prot->set_rcv_saddr(sk, &fl6.saddr);
+			else
+				sk->sk_v6_rcv_saddr = fl6.saddr;
 		}
 	}
 
@@ -209,10 +218,15 @@  int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr,
 
 		if (ipv6_addr_any(&sk->sk_v6_rcv_saddr) ||
 		    ipv6_mapped_addr_any(&sk->sk_v6_rcv_saddr)) {
-			ipv6_addr_set_v4mapped(inet->inet_rcv_saddr,
-					       &sk->sk_v6_rcv_saddr);
-			if (sk->sk_prot->rehash && sk->sk_family == AF_INET6)
-				sk->sk_prot->rehash(sk);
+			struct in6_addr v4mapped;
+
+			ipv6_addr_set_v4mapped(inet->inet_rcv_saddr, &v4mapped);
+
+			if (sk->sk_prot->set_rcv_saddr &&
+			    sk->sk_family == AF_INET6)
+				sk->sk_prot->set_rcv_saddr(sk, &v4mapped);
+			else
+				sk->sk_v6_rcv_saddr = v4mapped;
 		}
 
 		goto out;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d766fd798ecf..6f7b13c53941 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -105,25 +105,30 @@  int udp_v6_get_port(struct sock *sk, unsigned short snum)
 	return udp_lib_get_port(sk, snum, hash2_nulladdr);
 }
 
-void udp_v6_rehash(struct sock *sk)
+/**
+ * udp_v6_set_rcv_saddr() - Set local address and new hashes for IPv6 socket
+ * @sk:		Socket changing local address
+ * @addr:	New address, pointer to struct in6_addr
+ */
+void udp_v6_set_rcv_saddr(struct sock *sk, void *addr)
 {
-	u16 new_hash = ipv6_portaddr_hash(sock_net(sk),
-					  &sk->sk_v6_rcv_saddr,
-					  inet_sk(sk)->inet_num);
-	u16 new_hash4;
+	u16 hash = ipv6_portaddr_hash(sock_net(sk),
+				      addr, inet_sk(sk)->inet_num);
+	u16 hash4;
 
 	if (ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)) {
-		new_hash4 = udp_ehashfn(sock_net(sk),
-					sk->sk_rcv_saddr, sk->sk_num,
-					sk->sk_daddr, sk->sk_dport);
+		hash4 = udp_ehashfn(sock_net(sk),
+				    sk->sk_rcv_saddr, sk->sk_num,
+				    sk->sk_daddr, sk->sk_dport);
 	} else {
-		new_hash4 = udp6_ehashfn(sock_net(sk),
-					 &sk->sk_v6_rcv_saddr, sk->sk_num,
-					 &sk->sk_v6_daddr, sk->sk_dport);
+		hash4 = udp6_ehashfn(sock_net(sk),
+				     addr, sk->sk_num,
+				     &sk->sk_v6_daddr, sk->sk_dport);
 	}
 
-	udp_lib_rehash(sk, new_hash, new_hash4);
+	udp_lib_set_rcv_saddr(sk, addr, hash, hash4);
 }
+EXPORT_SYMBOL(udp_v6_set_rcv_saddr);
 
 static int compute_score(struct sock *sk, const struct net *net,
 			 const struct in6_addr *saddr, __be16 sport,
@@ -1860,6 +1865,7 @@  struct proto udpv6_prot = {
 	.close			= udp_lib_close,
 	.pre_connect		= udpv6_pre_connect,
 	.connect		= udpv6_connect,
+	.set_rcv_saddr		= udp_v6_set_rcv_saddr,
 	.disconnect		= udp_disconnect,
 	.ioctl			= udp_ioctl,
 	.init			= udpv6_init_sock,
@@ -1872,7 +1878,6 @@  struct proto udpv6_prot = {
 	.release_cb		= ip6_datagram_release_cb,
 	.hash			= udp_lib_hash,
 	.unhash			= udp_lib_unhash,
-	.rehash			= udp_v6_rehash,
 	.get_port		= udp_v6_get_port,
 	.put_port		= udp_lib_unhash,
 #ifdef CONFIG_BPF_SYSCALL
diff --git a/net/ipv6/udp_impl.h b/net/ipv6/udp_impl.h
index 0590f566379d..45166e56ef12 100644
--- a/net/ipv6/udp_impl.h
+++ b/net/ipv6/udp_impl.h
@@ -14,7 +14,7 @@  int __udp6_lib_err(struct sk_buff *, struct inet6_skb_parm *, u8, u8, int,
 
 int udpv6_init_sock(struct sock *sk);
 int udp_v6_get_port(struct sock *sk, unsigned short snum);
-void udp_v6_rehash(struct sock *sk);
+void udp_v6_set_rcv_saddr(struct sock *sk, void *addr);
 
 int udpv6_getsockopt(struct sock *sk, int level, int optname,
 		     char __user *optval, int __user *optlen);
diff --git a/net/ipv6/udplite.c b/net/ipv6/udplite.c
index a60bec9b14f1..597ca4558b3f 100644
--- a/net/ipv6/udplite.c
+++ b/net/ipv6/udplite.c
@@ -56,7 +56,7 @@  struct proto udplitev6_prot = {
 	.recvmsg	   = udpv6_recvmsg,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
-	.rehash		   = udp_v6_rehash,
+	.set_rcv_saddr	   = udp_v6_set_rcv_saddr,
 	.get_port	   = udp_v6_get_port,
 
 	.memory_allocated  = &udp_memory_allocated,