diff mbox series

[v3,bpf-next,4/5,RFC] udp: Fix destroying UDP listening sockets

Message ID 20230321184541.1857363-5-aditi.ghag@isovalent.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series bpf-next: Add socket destroy capability | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/apply fail Patch does not apply to bpf-next
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-29 success Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-32 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-34 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-35 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-36 success Logs for test_verifier on x86_64 with llvm-16

Commit Message

Aditi Ghag March 21, 2023, 6:45 p.m. UTC
Previously, UDP listening sockets that bind'ed to a port
weren't getting properly destroyed via udp_abort.
Specifically, the sockets were left in the UDP hash table with
unset source port.
Fix the issue by unconditionally unhashing and resetting source
port for sockets that are getting destroyed. This would mean
that in case of sockets listening on wildcarded address and
on a specific address with a common port, users would have to
explicitly select the socket(s) they intend to destroy.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/udp.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

Comments

Martin KaFai Lau March 22, 2023, 12:29 a.m. UTC | #1
On 3/21/23 11:45 AM, Aditi Ghag wrote:
> Previously, UDP listening sockets that bind'ed to a port
> weren't getting properly destroyed via udp_abort.
> Specifically, the sockets were left in the UDP hash table with
> unset source port.
> Fix the issue by unconditionally unhashing and resetting source
> port for sockets that are getting destroyed. This would mean
> that in case of sockets listening on wildcarded address and
> on a specific address with a common port, users would have to
> explicitly select the socket(s) they intend to destroy.
> 
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/ipv4/udp.c | 21 ++++++++++++++++++++-
>   1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 02d357713838..a495ac88fcee 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>   }
>   EXPORT_SYMBOL(udp_pre_connect);
>   
> +int __udp_disconnect_with_abort(struct sock *sk)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +
> +	sk->sk_state = TCP_CLOSE;
> +	inet->inet_daddr = 0;
> +	inet->inet_dport = 0;
> +	sock_rps_reset_rxhash(sk);
> +	sk->sk_bound_dev_if = 0;
> +	inet_reset_saddr(sk);
> +	inet->inet_sport = 0;
> +	sk_dst_reset(sk);
> +	/* (TBD) In case of sockets listening on wildcard and specific address
> +	 * with a common port, socket will be removed from {hash, hash2} table.
> +	 */
> +	sk->sk_prot->unhash(sk);

hmm... not sure if I understand the use case. The idea is to enforce the user 
space to bind() again when it gets error from read(fd) because the source 
ip/port needs to change when sending to another dst IP/port? Does it have a 
usage example in the selftests?

> +	return 0;
> +}
> +
>   int __udp_disconnect(struct sock *sk, int flags)
>   {
>   	struct inet_sock *inet = inet_sk(sk);
> @@ -2937,7 +2956,7 @@ int udp_abort(struct sock *sk, int err)
>   
>   	sk->sk_err = err;
>   	sk_error_report(sk);
> -	__udp_disconnect(sk, 0);
> +	__udp_disconnect_with_abort(sk);
>   
>   out:
>   	if (!has_current_bpf_ctx())
Aditi Ghag March 22, 2023, 12:59 a.m. UTC | #2
> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>> Previously, UDP listening sockets that bind'ed to a port
>> weren't getting properly destroyed via udp_abort.
>> Specifically, the sockets were left in the UDP hash table with
>> unset source port.
>> Fix the issue by unconditionally unhashing and resetting source
>> port for sockets that are getting destroyed. This would mean
>> that in case of sockets listening on wildcarded address and
>> on a specific address with a common port, users would have to
>> explicitly select the socket(s) they intend to destroy.
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>  1 file changed, 20 insertions(+), 1 deletion(-)
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 02d357713838..a495ac88fcee 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>  }
>>  EXPORT_SYMBOL(udp_pre_connect);
>>  +int __udp_disconnect_with_abort(struct sock *sk)
>> +{
>> +	struct inet_sock *inet = inet_sk(sk);
>> +
>> +	sk->sk_state = TCP_CLOSE;
>> +	inet->inet_daddr = 0;
>> +	inet->inet_dport = 0;
>> +	sock_rps_reset_rxhash(sk);
>> +	sk->sk_bound_dev_if = 0;
>> +	inet_reset_saddr(sk);
>> +	inet->inet_sport = 0;
>> +	sk_dst_reset(sk);
>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>> +	 */
>> +	sk->sk_prot->unhash(sk);
> 
> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?


> Does it have a usage example in the selftests?

Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?   

I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]). 

[1] https://lore.kernel.org/bpf/FB695169-4640-4E50-901D-84CF145765F2@isovalent.com/T/#u

> 
>> +	return 0;
>> +}
>> +
>>  int __udp_disconnect(struct sock *sk, int flags)
>>  {
>>  	struct inet_sock *inet = inet_sk(sk);
>> @@ -2937,7 +2956,7 @@ int udp_abort(struct sock *sk, int err)
>>    	sk->sk_err = err;
>>  	sk_error_report(sk);
>> -	__udp_disconnect(sk, 0);
>> +	__udp_disconnect_with_abort(sk);
>>    out:
>>  	if (!has_current_bpf_ctx())
Martin KaFai Lau March 22, 2023, 10:55 p.m. UTC | #3
On 3/21/23 5:59 PM, Aditi Ghag wrote:
> 
> 
>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>> Previously, UDP listening sockets that bind'ed to a port
>>> weren't getting properly destroyed via udp_abort.
>>> Specifically, the sockets were left in the UDP hash table with
>>> unset source port.
>>> Fix the issue by unconditionally unhashing and resetting source
>>> port for sockets that are getting destroyed. This would mean
>>> that in case of sockets listening on wildcarded address and
>>> on a specific address with a common port, users would have to
>>> explicitly select the socket(s) they intend to destroy.
>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>> ---
>>>   net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>   1 file changed, 20 insertions(+), 1 deletion(-)
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 02d357713838..a495ac88fcee 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>   }
>>>   EXPORT_SYMBOL(udp_pre_connect);
>>>   +int __udp_disconnect_with_abort(struct sock *sk)
>>> +{
>>> +	struct inet_sock *inet = inet_sk(sk);
>>> +
>>> +	sk->sk_state = TCP_CLOSE;
>>> +	inet->inet_daddr = 0;
>>> +	inet->inet_dport = 0;
>>> +	sock_rps_reset_rxhash(sk);
>>> +	sk->sk_bound_dev_if = 0;
>>> +	inet_reset_saddr(sk);
>>> +	inet->inet_sport = 0;
>>> +	sk_dst_reset(sk);
>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>> +	 */
>>> +	sk->sk_prot->unhash(sk);
>>
>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
> 
> 
>> Does it have a usage example in the selftests?
> 
> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
> 
> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).

The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip 
and port. afaik, the tcp server which binds to a particular ip and port will 
observe similar behavior as the udp server.

When the user space notices a read() error from a UDP server socket (because of 
udp_abort), should the user space close() this udp server socket first before 
opening a new one and then bind to a different src ip and src port?
or I am still missing some pieces in the use case where there is other cgroup 
bpf programs doing bind?


> 
> [1] https://lore.kernel.org/bpf/FB695169-4640-4E50-901D-84CF145765F2@isovalent.com/T/#u
Aditi Ghag March 23, 2023, 12:32 a.m. UTC | #4
> On Mar 22, 2023, at 3:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/21/23 5:59 PM, Aditi Ghag wrote:
>>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>> 
>>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>>> Previously, UDP listening sockets that bind'ed to a port
>>>> weren't getting properly destroyed via udp_abort.
>>>> Specifically, the sockets were left in the UDP hash table with
>>>> unset source port.
>>>> Fix the issue by unconditionally unhashing and resetting source
>>>> port for sockets that are getting destroyed. This would mean
>>>> that in case of sockets listening on wildcarded address and
>>>> on a specific address with a common port, users would have to
>>>> explicitly select the socket(s) they intend to destroy.
>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>> ---
>>>>  net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>>  1 file changed, 20 insertions(+), 1 deletion(-)
>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>> index 02d357713838..a495ac88fcee 100644
>>>> --- a/net/ipv4/udp.c
>>>> +++ b/net/ipv4/udp.c
>>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>>  }
>>>>  EXPORT_SYMBOL(udp_pre_connect);
>>>>  +int __udp_disconnect_with_abort(struct sock *sk)
>>>> +{
>>>> +	struct inet_sock *inet = inet_sk(sk);
>>>> +
>>>> +	sk->sk_state = TCP_CLOSE;
>>>> +	inet->inet_daddr = 0;
>>>> +	inet->inet_dport = 0;
>>>> +	sock_rps_reset_rxhash(sk);
>>>> +	sk->sk_bound_dev_if = 0;
>>>> +	inet_reset_saddr(sk);
>>>> +	inet->inet_sport = 0;
>>>> +	sk_dst_reset(sk);
>>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>>> +	 */
>>>> +	sk->sk_prot->unhash(sk);
>>> 
>>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
>>> Does it have a usage example in the selftests?
>> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
>> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).
> 
> The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip and port. afaik, the tcp server which binds to a particular ip and port will observe similar behavior as the udp server.
> 
> When the user space notices a read() error from a UDP server socket (because of udp_abort), should the user space close() this udp server socket first before opening a new one and then bind to a different src ip and src port?
> or I am still missing some pieces in the use case where there is other cgroup bpf programs doing bind?

I'm not sure if we are talking about the same selftests. It's possible that the new server related selftests are not validating the behavior in the right way.

Let's take an example of the test_udp_server test. It does the following - 

1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
2) Run BPF iterators that destroys the server sockets.
3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind. Moreover, the UDP sockets were lingering in the kernel hash table without the fix in one of the commits. 

Are you suggesting that (3) is expected?  The destroyed sockets are expected to be also present in the hash table? Are you saying that userspace needs to first close the sockets that were destroyed in the kernel when they encounter read() error? 

> 
> 
>> [1] https://lore.kernel.org/bpf/FB695169-4640-4E50-901D-84CF145765F2@isovalent.com/T/#u
Martin KaFai Lau March 23, 2023, 1:08 a.m. UTC | #5
On 3/22/23 5:32 PM, Aditi Ghag wrote:
> 
> 
>> On Mar 22, 2023, at 3:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 3/21/23 5:59 PM, Aditi Ghag wrote:
>>>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>>>> Previously, UDP listening sockets that bind'ed to a port
>>>>> weren't getting properly destroyed via udp_abort.
>>>>> Specifically, the sockets were left in the UDP hash table with
>>>>> unset source port.
>>>>> Fix the issue by unconditionally unhashing and resetting source
>>>>> port for sockets that are getting destroyed. This would mean
>>>>> that in case of sockets listening on wildcarded address and
>>>>> on a specific address with a common port, users would have to
>>>>> explicitly select the socket(s) they intend to destroy.
>>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>>> ---
>>>>>   net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>>>   1 file changed, 20 insertions(+), 1 deletion(-)
>>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>>> index 02d357713838..a495ac88fcee 100644
>>>>> --- a/net/ipv4/udp.c
>>>>> +++ b/net/ipv4/udp.c
>>>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>>>   }
>>>>>   EXPORT_SYMBOL(udp_pre_connect);
>>>>>   +int __udp_disconnect_with_abort(struct sock *sk)
>>>>> +{
>>>>> +	struct inet_sock *inet = inet_sk(sk);
>>>>> +
>>>>> +	sk->sk_state = TCP_CLOSE;
>>>>> +	inet->inet_daddr = 0;
>>>>> +	inet->inet_dport = 0;
>>>>> +	sock_rps_reset_rxhash(sk);
>>>>> +	sk->sk_bound_dev_if = 0;
>>>>> +	inet_reset_saddr(sk);
>>>>> +	inet->inet_sport = 0;
>>>>> +	sk_dst_reset(sk);
>>>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>>>> +	 */
>>>>> +	sk->sk_prot->unhash(sk);
>>>>
>>>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
>>>> Does it have a usage example in the selftests?
>>> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
>>> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).
>>
>> The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip and port. afaik, the tcp server which binds to a particular ip and port will observe similar behavior as the udp server.
>>
>> When the user space notices a read() error from a UDP server socket (because of udp_abort), should the user space close() this udp server socket first before opening a new one and then bind to a different src ip and src port?
>> or I am still missing some pieces in the use case where there is other cgroup bpf programs doing bind?
> 
> I'm not sure if we are talking about the same selftests. It's possible that the new server related selftests are not validating the behavior in the right way.
> 
> Let's take an example of the test_udp_server test. It does the following -

Yep, I was looking at the test_udp_server test.

> 
> 1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)

Note that UDP has no listen(). It is just a bind().

> 2) Run BPF iterators that destroys the server sockets.

destroy does not mean it is gone from the kernel space or user space. User space 
still has a fd.

> 3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind. Moreover, the UDP sockets were lingering in the kernel hash table without the fix in one of the commits.
> 
> Are you suggesting that (3) is expected?  The destroyed sockets are expected to be also present in the hash table? Are you saying that userspace needs to first close the sockets that were destroyed in the kernel when they encounter read() error?

Yes, (3) is expected (at least the current abort behavior). The user space still 
holds the fd of (1). The user space did bind(fd1) to request for a specific 
local addr/port and the user space has not closed it yet. The current udp_abort 
does not undo this ip/port binding which is a sensible thing, imo.

I have been answering a lot of questions. May be time to go back to my earlier 
question, why the user space server cannot do a close on the fd after the 
previous read() return ECONNABORTED? It seems to be the most sensible server 
retry logic when getting ECONNABORTED and then open another socket to do bind() 
again. test_udp_server() is also creating a new socket to do the bind() after 
the earlier the kfunc destroy is done, so the old fd1 is supposed to be useless 
anyway.
Aditi Ghag March 23, 2023, 1:35 a.m. UTC | #6
> On Mar 22, 2023, at 6:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/22/23 5:32 PM, Aditi Ghag wrote:
>>> On Mar 22, 2023, at 3:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>> 
>>> On 3/21/23 5:59 PM, Aditi Ghag wrote:
>>>>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>> 
>>>>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>>>>> Previously, UDP listening sockets that bind'ed to a port
>>>>>> weren't getting properly destroyed via udp_abort.
>>>>>> Specifically, the sockets were left in the UDP hash table with
>>>>>> unset source port.
>>>>>> Fix the issue by unconditionally unhashing and resetting source
>>>>>> port for sockets that are getting destroyed. This would mean
>>>>>> that in case of sockets listening on wildcarded address and
>>>>>> on a specific address with a common port, users would have to
>>>>>> explicitly select the socket(s) they intend to destroy.
>>>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>>>> ---
>>>>>>  net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>>>>  1 file changed, 20 insertions(+), 1 deletion(-)
>>>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>>>> index 02d357713838..a495ac88fcee 100644
>>>>>> --- a/net/ipv4/udp.c
>>>>>> +++ b/net/ipv4/udp.c
>>>>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>>>>  }
>>>>>>  EXPORT_SYMBOL(udp_pre_connect);
>>>>>>  +int __udp_disconnect_with_abort(struct sock *sk)
>>>>>> +{
>>>>>> +	struct inet_sock *inet = inet_sk(sk);
>>>>>> +
>>>>>> +	sk->sk_state = TCP_CLOSE;
>>>>>> +	inet->inet_daddr = 0;
>>>>>> +	inet->inet_dport = 0;
>>>>>> +	sock_rps_reset_rxhash(sk);
>>>>>> +	sk->sk_bound_dev_if = 0;
>>>>>> +	inet_reset_saddr(sk);
>>>>>> +	inet->inet_sport = 0;
>>>>>> +	sk_dst_reset(sk);
>>>>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>>>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>>>>> +	 */
>>>>>> +	sk->sk_prot->unhash(sk);
>>>>> 
>>>>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
>>>>> Does it have a usage example in the selftests?
>>>> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
>>>> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).
>>> 
>>> The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip and port. afaik, the tcp server which binds to a particular ip and port will observe similar behavior as the udp server.
>>> 
>>> When the user space notices a read() error from a UDP server socket (because of udp_abort), should the user space close() this udp server socket first before opening a new one and then bind to a different src ip and src port?
>>> or I am still missing some pieces in the use case where there is other cgroup bpf programs doing bind?
>> I'm not sure if we are talking about the same selftests. It's possible that the new server related selftests are not validating the behavior in the right way.
>> Let's take an example of the test_udp_server test. It does the following -
> 
> Yep, I was looking at the test_udp_server test.
> 
>> 1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
> 
> Note that UDP has no listen(). It is just a bind().
> 
>> 2) Run BPF iterators that destroys the server sockets.
> 
> destroy does not mean it is gone from the kernel space or user space. User space still has a fd.
> 
>> 3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind. Moreover, the UDP sockets were lingering in the kernel hash table without the fix in one of the commits.
>> Are you suggesting that (3) is expected?  The destroyed sockets are expected to be also present in the hash table? Are you saying that userspace needs to first close the sockets that were destroyed in the kernel when they encounter read() error?
> 
> Yes, (3) is expected (at least the current abort behavior). The user space still holds the fd of (1). The user space did bind(fd1) to request for a specific local addr/port and the user space has not closed it yet. The current udp_abort does not undo this ip/port binding which is a sensible thing, imo.

Got it. 

> 
> I have been answering a lot of questions. May be time to go back to my earlier question,
> why the user space server cannot do a close on the fd after the previous read() return ECONNABORTED? It seems to be the most sensible server retry logic when getting ECONNABORTED and then open another socket to do bind() again. test_udp_server() is also creating a new socket to do the bind() after the earlier the kfunc destroy is done, so the old fd1 is supposed to be useless anyway.


Thanks for the clarifications. The client selftests are doing the same validations: check if read() fails, and check for ECONNABORTED error code. So yes, we can do the same for servers as well.
I've been on the same page with userspace checking for error codes, or failures in the socket read call. It's just the bind case that tripped me: I expected the kernel to free up source ports that the destroyed sockets had previously called bind on.

I'll update the server tests.
Aditi Ghag March 23, 2023, 6:14 a.m. UTC | #7
> On Mar 22, 2023, at 6:35 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Mar 22, 2023, at 6:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>> 
>> On 3/22/23 5:32 PM, Aditi Ghag wrote:
>>>> On Mar 22, 2023, at 3:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>> 
>>>> On 3/21/23 5:59 PM, Aditi Ghag wrote:
>>>>>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>> 
>>>>>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>>>>>> Previously, UDP listening sockets that bind'ed to a port
>>>>>>> weren't getting properly destroyed via udp_abort.
>>>>>>> Specifically, the sockets were left in the UDP hash table with
>>>>>>> unset source port.
>>>>>>> Fix the issue by unconditionally unhashing and resetting source
>>>>>>> port for sockets that are getting destroyed. This would mean
>>>>>>> that in case of sockets listening on wildcarded address and
>>>>>>> on a specific address with a common port, users would have to
>>>>>>> explicitly select the socket(s) they intend to destroy.
>>>>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>>>>> ---
>>>>>>> net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>>>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>>>>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>>>>> index 02d357713838..a495ac88fcee 100644
>>>>>>> --- a/net/ipv4/udp.c
>>>>>>> +++ b/net/ipv4/udp.c
>>>>>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>>>>> }
>>>>>>> EXPORT_SYMBOL(udp_pre_connect);
>>>>>>> +int __udp_disconnect_with_abort(struct sock *sk)
>>>>>>> +{
>>>>>>> +	struct inet_sock *inet = inet_sk(sk);
>>>>>>> +
>>>>>>> +	sk->sk_state = TCP_CLOSE;
>>>>>>> +	inet->inet_daddr = 0;
>>>>>>> +	inet->inet_dport = 0;
>>>>>>> +	sock_rps_reset_rxhash(sk);
>>>>>>> +	sk->sk_bound_dev_if = 0;
>>>>>>> +	inet_reset_saddr(sk);
>>>>>>> +	inet->inet_sport = 0;
>>>>>>> +	sk_dst_reset(sk);
>>>>>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>>>>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>>>>>> +	 */
>>>>>>> +	sk->sk_prot->unhash(sk);
>>>>>> 
>>>>>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
>>>>>> Does it have a usage example in the selftests?
>>>>> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
>>>>> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).
>>>> 
>>>> The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip and port. afaik, the tcp server which binds to a particular ip and port will observe similar behavior as the udp server.
>>>> 
>>>> When the user space notices a read() error from a UDP server socket (because of udp_abort), should the user space close() this udp server socket first before opening a new one and then bind to a different src ip and src port?
>>>> or I am still missing some pieces in the use case where there is other cgroup bpf programs doing bind?
>>> I'm not sure if we are talking about the same selftests. It's possible that the new server related selftests are not validating the behavior in the right way.
>>> Let's take an example of the test_udp_server test. It does the following -
>> 
>> Yep, I was looking at the test_udp_server test.
>> 
>>> 1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
>> 
>> Note that UDP has no listen(). It is just a bind().
>> 
>>> 2) Run BPF iterators that destroys the server sockets.
>> 
>> destroy does not mean it is gone from the kernel space or user space. User space still has a fd.
>> 
>>> 3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind. Moreover, the UDP sockets were lingering in the kernel hash table without the fix in one of the commits.
>>> Are you suggesting that (3) is expected?  The destroyed sockets are expected to be also present in the hash table? Are you saying that userspace needs to first close the sockets that were destroyed in the kernel when they encounter read() error?
>> 
>> Yes, (3) is expected (at least the current abort behavior). The user space still holds the fd of (1). The user space did bind(fd1) to request for a specific local addr/port and the user space has not closed it yet. The current udp_abort does not undo this ip/port binding which is a sensible thing, imo.
> 
> Got it. 
> 
>> 
>> I have been answering a lot of questions. May be time to go back to my earlier question,
>> why the user space server cannot do a close on the fd after the previous read() return ECONNABORTED? It seems to be the most sensible server retry logic when getting ECONNABORTED and then open another socket to do bind() again. test_udp_server() is also creating a new socket to do the bind() after the earlier the kfunc destroy is done, so the old fd1 is supposed to be useless anyway.
> 
> 
> Thanks for the clarifications. The client selftests are doing the same validations: check if read() fails, and check for ECONNABORTED error code. So yes, we can do the same for servers as well.
> I've been on the same page with userspace checking for error codes, or failures in the socket read call. It's just the bind case that tripped me: I expected the kernel to free up source ports that the destroyed sockets had previously called bind on.
> 
> I'll update the server tests. 


Just to close the loop on this: I updated the server tests to check for errno == ECONNABORTED, and they pass (no surprises there!). (Just to clarify, read() doesn't return ECONNABORTED, but I presume you meant errno. ) The more interesting scenarios to consider are asynchronous I/O (e.g., epoll) that can be non-blocking. In case of aborted sockets, EPOLLERR should bubble up the error condition on an fd to userspace applications. I'm not seeing related code in udp_poll that checks for sk_err though. Am I missing something, or not looking at the right place?
diff mbox series

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 02d357713838..a495ac88fcee 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1965,6 +1965,25 @@  int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 }
 EXPORT_SYMBOL(udp_pre_connect);
 
+int __udp_disconnect_with_abort(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	sk->sk_state = TCP_CLOSE;
+	inet->inet_daddr = 0;
+	inet->inet_dport = 0;
+	sock_rps_reset_rxhash(sk);
+	sk->sk_bound_dev_if = 0;
+	inet_reset_saddr(sk);
+	inet->inet_sport = 0;
+	sk_dst_reset(sk);
+	/* (TBD) In case of sockets listening on wildcard and specific address
+	 * with a common port, socket will be removed from {hash, hash2} table.
+	 */
+	sk->sk_prot->unhash(sk);
+	return 0;
+}
+
 int __udp_disconnect(struct sock *sk, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -2937,7 +2956,7 @@  int udp_abort(struct sock *sk, int err)
 
 	sk->sk_err = err;
 	sk_error_report(sk);
-	__udp_disconnect(sk, 0);
+	__udp_disconnect_with_abort(sk);
 
 out:
 	if (!has_current_bpf_ctx())