diff mbox series

[net,v3] sunrpc: fix one UAF issue caused by sunrpc kernel tcp socket

Message ID 20241111081736.526093-1-liujian56@huawei.com (mailing list archive)
State New
Headers show
Series [net,v3] sunrpc: fix one UAF issue caused by sunrpc kernel tcp socket | expand

Commit Message

Liu Jian Nov. 11, 2024, 8:17 a.m. UTC
BUG: KASAN: slab-use-after-free in tcp_write_timer_handler+0x156/0x3e0
Read of size 1 at addr ffff888111f322cd by task swapper/0/0

CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-dirty #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
Call Trace:
 <IRQ>
 dump_stack_lvl+0x68/0xa0
 print_address_description.constprop.0+0x2c/0x3d0
 print_report+0xb4/0x270
 kasan_report+0xbd/0xf0
 tcp_write_timer_handler+0x156/0x3e0
 tcp_write_timer+0x66/0x170
 call_timer_fn+0xfb/0x1d0
 __run_timers+0x3f8/0x480
 run_timer_softirq+0x9b/0x100
 handle_softirqs+0x153/0x390
 __irq_exit_rcu+0x103/0x120
 irq_exit_rcu+0xe/0x20
 sysvec_apic_timer_interrupt+0x76/0x90
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1a/0x20
RIP: 0010:default_idle+0xf/0x20
Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90
 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 <fa> c3 cc cc cc
 cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242
RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX: ffffffffa09c3196
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00590f
RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed102360835d
R10: ffff88811b041aeb R11: 0000000000000001 R12: 0000000000000000
R13: ffffffffa202d7c0 R14: 0000000000000000 R15: 00000000000147d0
 default_idle_call+0x6b/0xa0
 cpuidle_idle_call+0x1af/0x1f0
 do_idle+0xbc/0x130
 cpu_startup_entry+0x33/0x40
 rest_init+0x11f/0x210
 start_kernel+0x39a/0x420
 x86_64_start_reservations+0x18/0x30
 x86_64_start_kernel+0x97/0xa0
 common_startup_64+0x13e/0x141
 </TASK>

Allocated by task 595:
 kasan_save_stack+0x24/0x50
 kasan_save_track+0x14/0x30
 __kasan_slab_alloc+0x87/0x90
 kmem_cache_alloc_noprof+0x12b/0x3f0
 copy_net_ns+0x94/0x380
 create_new_namespaces+0x24c/0x500
 unshare_nsproxy_namespaces+0x75/0xf0
 ksys_unshare+0x24e/0x4f0
 __x64_sys_unshare+0x1f/0x30
 do_syscall_64+0x70/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Freed by task 100:
 kasan_save_stack+0x24/0x50
 kasan_save_track+0x14/0x30
 kasan_save_free_info+0x3b/0x60
 __kasan_slab_free+0x54/0x70
 kmem_cache_free+0x156/0x5d0
 cleanup_net+0x5d3/0x670
 process_one_work+0x776/0xa90
 worker_thread+0x2e2/0x560
 kthread+0x1a8/0x1f0
 ret_from_fork+0x34/0x60
 ret_from_fork_asm+0x1a/0x30

Reproduction script:

mkdir -p /mnt/nfsshare
mkdir -p /mnt/nfs/netns_1
mkfs.ext4 /dev/sdb
mount /dev/sdb /mnt/nfsshare
systemctl restart nfs-server
chmod 777 /mnt/nfsshare
exportfs -i -o rw,no_root_squash *:/mnt/nfsshare

ip netns add netns_1
ip link add name veth_1_peer type veth peer veth_1
ifconfig veth_1_peer 11.11.0.254 up
ip link set veth_1 netns netns_1
ip netns exec netns_1 ifconfig veth_1 11.11.0.1

ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -p tcp \
	--tcp-flags FIN FIN  -j DROP

(note: In my environment, a DESTROY_CLIENTID operation is always sent
 immediately, breaking the nfs tcp connection.)
ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o proto=tcp,vers=4.1 \
	11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1

ip netns del netns_1

The reason here is that the tcp socket in netns_1 (nfs side) has been
shutdown and closed (done in xs_destroy), but the FIN message (with ack)
is discarded, and the nfsd side keeps sending retransmission messages.
As a result, when the tcp sock in netns_1 processes the received message,
it sends the message (FIN message) in the sending queue, and the tcp timer
is re-established. When the network namespace is deleted, the net structure
accessed by tcp's timer handler function causes problems.

To fix this problem, let's hold netns refcnt for the tcp kernel socket as
 done in other modules.

Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Signed-off-by: Liu Jian <liujian56@huawei.com>
---
 net/sunrpc/svcsock.c  | 4 ++++
 net/sunrpc/xprtsock.c | 6 ++++++
 2 files changed, 10 insertions(+)

Comments

Chuck Lever III Nov. 11, 2024, 3:11 p.m. UTC | #1
On Mon, Nov 11, 2024 at 04:17:36PM +0800, Liu Jian wrote:
> BUG: KASAN: slab-use-after-free in tcp_write_timer_handler+0x156/0x3e0
> Read of size 1 at addr ffff888111f322cd by task swapper/0/0
> 
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-dirty #7
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> Call Trace:
>  <IRQ>
>  dump_stack_lvl+0x68/0xa0
>  print_address_description.constprop.0+0x2c/0x3d0
>  print_report+0xb4/0x270
>  kasan_report+0xbd/0xf0
>  tcp_write_timer_handler+0x156/0x3e0
>  tcp_write_timer+0x66/0x170
>  call_timer_fn+0xfb/0x1d0
>  __run_timers+0x3f8/0x480
>  run_timer_softirq+0x9b/0x100
>  handle_softirqs+0x153/0x390
>  __irq_exit_rcu+0x103/0x120
>  irq_exit_rcu+0xe/0x20
>  sysvec_apic_timer_interrupt+0x76/0x90
>  </IRQ>
>  <TASK>
>  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> RIP: 0010:default_idle+0xf/0x20
> Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90
>  90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 <fa> c3 cc cc cc
>  cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242
> RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX: ffffffffa09c3196
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00590f
> RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed102360835d
> R10: ffff88811b041aeb R11: 0000000000000001 R12: 0000000000000000
> R13: ffffffffa202d7c0 R14: 0000000000000000 R15: 00000000000147d0
>  default_idle_call+0x6b/0xa0
>  cpuidle_idle_call+0x1af/0x1f0
>  do_idle+0xbc/0x130
>  cpu_startup_entry+0x33/0x40
>  rest_init+0x11f/0x210
>  start_kernel+0x39a/0x420
>  x86_64_start_reservations+0x18/0x30
>  x86_64_start_kernel+0x97/0xa0
>  common_startup_64+0x13e/0x141
>  </TASK>
> 
> Allocated by task 595:
>  kasan_save_stack+0x24/0x50
>  kasan_save_track+0x14/0x30
>  __kasan_slab_alloc+0x87/0x90
>  kmem_cache_alloc_noprof+0x12b/0x3f0
>  copy_net_ns+0x94/0x380
>  create_new_namespaces+0x24c/0x500
>  unshare_nsproxy_namespaces+0x75/0xf0
>  ksys_unshare+0x24e/0x4f0
>  __x64_sys_unshare+0x1f/0x30
>  do_syscall_64+0x70/0x180
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Freed by task 100:
>  kasan_save_stack+0x24/0x50
>  kasan_save_track+0x14/0x30
>  kasan_save_free_info+0x3b/0x60
>  __kasan_slab_free+0x54/0x70
>  kmem_cache_free+0x156/0x5d0
>  cleanup_net+0x5d3/0x670
>  process_one_work+0x776/0xa90
>  worker_thread+0x2e2/0x560
>  kthread+0x1a8/0x1f0
>  ret_from_fork+0x34/0x60
>  ret_from_fork_asm+0x1a/0x30
> 
> Reproduction script:
> 
> mkdir -p /mnt/nfsshare
> mkdir -p /mnt/nfs/netns_1
> mkfs.ext4 /dev/sdb
> mount /dev/sdb /mnt/nfsshare
> systemctl restart nfs-server
> chmod 777 /mnt/nfsshare
> exportfs -i -o rw,no_root_squash *:/mnt/nfsshare
> 
> ip netns add netns_1
> ip link add name veth_1_peer type veth peer veth_1
> ifconfig veth_1_peer 11.11.0.254 up
> ip link set veth_1 netns netns_1
> ip netns exec netns_1 ifconfig veth_1 11.11.0.1
> 
> ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -p tcp \
> 	--tcp-flags FIN FIN  -j DROP
> 
> (note: In my environment, a DESTROY_CLIENTID operation is always sent
>  immediately, breaking the nfs tcp connection.)
> ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o proto=tcp,vers=4.1 \
> 	11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1
> 
> ip netns del netns_1
> 
> The reason here is that the tcp socket in netns_1 (nfs side) has been
> shutdown and closed (done in xs_destroy), but the FIN message (with ack)
> is discarded, and the nfsd side keeps sending retransmission messages.
> As a result, when the tcp sock in netns_1 processes the received message,
> it sends the message (FIN message) in the sending queue, and the tcp timer
> is re-established. When the network namespace is deleted, the net structure
> accessed by tcp's timer handler function causes problems.
> 
> To fix this problem, let's hold netns refcnt for the tcp kernel socket as
>  done in other modules.
> 
> Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> Signed-off-by: Liu Jian <liujian56@huawei.com>
> ---
>  net/sunrpc/svcsock.c  | 4 ++++
>  net/sunrpc/xprtsock.c | 6 ++++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 6f272013fd9b..d4330aaadc23 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1551,6 +1551,10 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
>  	newlen = error;
>  
>  	if (protocol == IPPROTO_TCP) {
> +		__netns_tracker_free(net, &sock->sk->ns_tracker, false);
> +		sock->sk->sk_net_refcnt = 1;
> +		get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> +		sock_inuse_add(net, 1);

I'm not as familiar with net tracking as perhaps I should be. Can
you tell me where this reference count is released, or does it not
need to be?

Does the net reference count get carried over to sockets created
by accept() ?


>  		if ((error = kernel_listen(sock, 64)) < 0)
>  			goto bummer;
>  	}
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index d2f31b59457b..0f0b9f9283d9 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -1942,6 +1942,12 @@ static struct socket *xs_create_sock(struct rpc_xprt *xprt,
>  		goto out;
>  	}
>  
> +	if (protocol == IPPROTO_TCP) {
> +		__netns_tracker_free(xprt->xprt_net, &sock->sk->ns_tracker, false);
> +		sock->sk->sk_net_refcnt = 1;
> +		get_net_track(xprt->xprt_net, &sock->sk->ns_tracker, GFP_KERNEL);
> +		sock_inuse_add(xprt->xprt_net, 1);
> +	}
>  	filp = sock_alloc_file(sock, O_NONBLOCK, NULL);
>  	if (IS_ERR(filp))
>  		return ERR_CAST(filp);
> -- 
> 2.34.1
> 
>
Kuniyuki Iwashima Nov. 11, 2024, 11 p.m. UTC | #2
From: Chuck Lever <chuck.lever@oracle.com>
Date: Mon, 11 Nov 2024 10:11:08 -0500
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 6f272013fd9b..d4330aaadc23 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -1551,6 +1551,10 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
> >  	newlen = error;
> >  
> >  	if (protocol == IPPROTO_TCP) {
> > +		__netns_tracker_free(net, &sock->sk->ns_tracker, false);
> > +		sock->sk->sk_net_refcnt = 1;
> > +		get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > +		sock_inuse_add(net, 1);
> 
> I'm not as familiar with net tracking as perhaps I should be. Can
> you tell me where this reference count is released, or does it not
> need to be?

It's decremented when the socket is destroyed in __sk_free().


> 
> Does the net reference count get carried over to sockets created
> by accept() ?

Yes, sk_clone_lock() creates a child socket that inherits the
listener's sk->sk_net_refcnt, then the child will call get_net_track().

  tcp_create_openreq_child
    inet_csk_clone_lock
      sk_clone_lock
NeilBrown Nov. 11, 2024, 11:52 p.m. UTC | #3
On Mon, 11 Nov 2024, Liu Jian wrote:
> BUG: KASAN: slab-use-after-free in tcp_write_timer_handler+0x156/0x3e0
> Read of size 1 at addr ffff888111f322cd by task swapper/0/0
> 
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-dirty #7
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> Call Trace:
>  <IRQ>
>  dump_stack_lvl+0x68/0xa0
>  print_address_description.constprop.0+0x2c/0x3d0
>  print_report+0xb4/0x270
>  kasan_report+0xbd/0xf0
>  tcp_write_timer_handler+0x156/0x3e0
>  tcp_write_timer+0x66/0x170
>  call_timer_fn+0xfb/0x1d0
>  __run_timers+0x3f8/0x480
>  run_timer_softirq+0x9b/0x100
>  handle_softirqs+0x153/0x390
>  __irq_exit_rcu+0x103/0x120
>  irq_exit_rcu+0xe/0x20
>  sysvec_apic_timer_interrupt+0x76/0x90
>  </IRQ>
>  <TASK>
>  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> RIP: 0010:default_idle+0xf/0x20
> Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90
>  90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 <fa> c3 cc cc cc
>  cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242
> RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX: ffffffffa09c3196
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00590f
> RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed102360835d
> R10: ffff88811b041aeb R11: 0000000000000001 R12: 0000000000000000
> R13: ffffffffa202d7c0 R14: 0000000000000000 R15: 00000000000147d0
>  default_idle_call+0x6b/0xa0
>  cpuidle_idle_call+0x1af/0x1f0
>  do_idle+0xbc/0x130
>  cpu_startup_entry+0x33/0x40
>  rest_init+0x11f/0x210
>  start_kernel+0x39a/0x420
>  x86_64_start_reservations+0x18/0x30
>  x86_64_start_kernel+0x97/0xa0
>  common_startup_64+0x13e/0x141
>  </TASK>
> 
> Allocated by task 595:
>  kasan_save_stack+0x24/0x50
>  kasan_save_track+0x14/0x30
>  __kasan_slab_alloc+0x87/0x90
>  kmem_cache_alloc_noprof+0x12b/0x3f0
>  copy_net_ns+0x94/0x380
>  create_new_namespaces+0x24c/0x500
>  unshare_nsproxy_namespaces+0x75/0xf0
>  ksys_unshare+0x24e/0x4f0
>  __x64_sys_unshare+0x1f/0x30
>  do_syscall_64+0x70/0x180
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Freed by task 100:
>  kasan_save_stack+0x24/0x50
>  kasan_save_track+0x14/0x30
>  kasan_save_free_info+0x3b/0x60
>  __kasan_slab_free+0x54/0x70
>  kmem_cache_free+0x156/0x5d0
>  cleanup_net+0x5d3/0x670
>  process_one_work+0x776/0xa90
>  worker_thread+0x2e2/0x560
>  kthread+0x1a8/0x1f0
>  ret_from_fork+0x34/0x60
>  ret_from_fork_asm+0x1a/0x30
> 
> Reproduction script:
> 
> mkdir -p /mnt/nfsshare
> mkdir -p /mnt/nfs/netns_1
> mkfs.ext4 /dev/sdb
> mount /dev/sdb /mnt/nfsshare
> systemctl restart nfs-server
> chmod 777 /mnt/nfsshare
> exportfs -i -o rw,no_root_squash *:/mnt/nfsshare
> 
> ip netns add netns_1
> ip link add name veth_1_peer type veth peer veth_1
> ifconfig veth_1_peer 11.11.0.254 up
> ip link set veth_1 netns netns_1
> ip netns exec netns_1 ifconfig veth_1 11.11.0.1
> 
> ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -p tcp \
> 	--tcp-flags FIN FIN  -j DROP
> 
> (note: In my environment, a DESTROY_CLIENTID operation is always sent
>  immediately, breaking the nfs tcp connection.)
> ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o proto=tcp,vers=4.1 \
> 	11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1
> 
> ip netns del netns_1
> 
> The reason here is that the tcp socket in netns_1 (nfs side) has been
> shutdown and closed (done in xs_destroy), but the FIN message (with ack)
> is discarded, and the nfsd side keeps sending retransmission messages.
> As a result, when the tcp sock in netns_1 processes the received message,
> it sends the message (FIN message) in the sending queue, and the tcp timer
> is re-established. When the network namespace is deleted, the net structure
> accessed by tcp's timer handler function causes problems.
> 
> To fix this problem, let's hold netns refcnt for the tcp kernel socket as
>  done in other modules.
> 
> Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> Signed-off-by: Liu Jian <liujian56@huawei.com>
> ---
>  net/sunrpc/svcsock.c  | 4 ++++
>  net/sunrpc/xprtsock.c | 6 ++++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 6f272013fd9b..d4330aaadc23 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1551,6 +1551,10 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
>  	newlen = error;
>  
>  	if (protocol == IPPROTO_TCP) {
> +		__netns_tracker_free(net, &sock->sk->ns_tracker, false);
> +		sock->sk->sk_net_refcnt = 1;
> +		get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> +		sock_inuse_add(net, 1);

This is really ugly.  These internal details of the network layer have
no place in sunrpc code. There must be a better way.

Can we pass '0' for the kern arg to __sock_create()?  That should fix
the refcounting issues, but might mess up security labelling.

Can we wait for something before we call put_net() to release the net.

Maybe we want to split the "kern" arg t __sock_create() and have
"kern" which affects labeling and "refnet" with affects refcounting the
net.

I had a quick look and very nearly every caller of __sock_create()
outside of net/core really does want refcount.  Many callers of
sock_create_kern() possibly don't.

So I really think this needs to be cleaned up in net/core, not in all
the different network clients in the kernel.

Thanks,
NeilBrown


>  		if ((error = kernel_listen(sock, 64)) < 0)
>  			goto bummer;
>  	}
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index d2f31b59457b..0f0b9f9283d9 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -1942,6 +1942,12 @@ static struct socket *xs_create_sock(struct rpc_xprt *xprt,
>  		goto out;
>  	}
>  
> +	if (protocol == IPPROTO_TCP) {
> +		__netns_tracker_free(xprt->xprt_net, &sock->sk->ns_tracker, false);
> +		sock->sk->sk_net_refcnt = 1;
> +		get_net_track(xprt->xprt_net, &sock->sk->ns_tracker, GFP_KERNEL);
> +		sock_inuse_add(xprt->xprt_net, 1);
> +	}
>  	filp = sock_alloc_file(sock, O_NONBLOCK, NULL);
>  	if (IS_ERR(filp))
>  		return ERR_CAST(filp);
> -- 
> 2.34.1
> 
>
Kuniyuki Iwashima Nov. 12, 2024, 12:13 a.m. UTC | #4
From: "NeilBrown" <neilb@suse.de>
Date: Tue, 12 Nov 2024 10:52:34 +1100
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 6f272013fd9b..d4330aaadc23 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -1551,6 +1551,10 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
> >  	newlen = error;
> >  
> >  	if (protocol == IPPROTO_TCP) {
> > +		__netns_tracker_free(net, &sock->sk->ns_tracker, false);
> > +		sock->sk->sk_net_refcnt = 1;
> > +		get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > +		sock_inuse_add(net, 1);
> 
> This is really ugly.  These internal details of the network layer have
> no place in sunrpc code. There must be a better way.

I asked to do this way.  I agree this way is really ugly.  Similar
code exists in MPTCP, SMC, CIFS, etc, so I plan to add a new API for
this case, but this requires huge change adding a new parameter for
->create() prototype and the changes are not backportable.

https://github.com/q2ven/linux/commit/bb8b8814a73b3f50c3fef5eaf8d30d8c1df43e7b
https://github.com/q2ven/linux/commits/427_2

After my series, we can use the following but cannot backport it to
stable.

  sock_create_net(net, family, type, protocol);

  e.g. commit for MPTCP
  https://github.com/q2ven/linux/commit/24a4647561272c1e67a685d8403e27eb863398cf

That's why I suggested to go with the ugly way and I will clean them
up in the next cycle.

So, finally the sunrpc code will be much cleaner and the netns refcnt
will be touched only in the core code.


> 
> Can we pass '0' for the kern arg to __sock_create()?  That should fix
> the refcounting issues, but might mess up security labelling.

This should be avoided as it's confusing for BPF programs, LSMs, and
LOCKDEP.


> 
> Can we wait for something before we call put_net() to release the net.
> 
> Maybe we want to split the "kern" arg t __sock_create() and have
> "kern" which affects labeling and "refnet" with affects refcounting the
> net.

This is exactly what my series does, but again, it's not backport
friendly.
https://github.com/q2ven/linux/commit/413e867b4aee9e9f60f3c33fb38d2004aeb29c40


> 
> I had a quick look and very nearly every caller of __sock_create()
> outside of net/core really does want refcount.  Many callers of
> sock_create_kern() possibly don't.

Actually, since sock_create_kern() is added, we no longer need to
export __sock_create(), so I have a patch to convert them to
sock_create_kern().

And most of TCP socket does need refcnt, but non-TCP won't.
Also, handshake one is exception, which uses TCP but only in init_net,
where we need not take care of netns refcnt.

https://github.com/q2ven/linux/commit/b56888bbbf327d57ea25a6b97275d6b9b8ad043a



> 
> So I really think this needs to be cleaned up in net/core, not in all
> the different network clients in the kernel.

Yes, will be done in the next cycle.
NeilBrown Nov. 12, 2024, 12:54 a.m. UTC | #5
On Tue, 12 Nov 2024, Kuniyuki Iwashima wrote:
> From: "NeilBrown" <neilb@suse.de>
> Date: Tue, 12 Nov 2024 10:52:34 +1100
> > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > index 6f272013fd9b..d4330aaadc23 100644
> > > --- a/net/sunrpc/svcsock.c
> > > +++ b/net/sunrpc/svcsock.c
> > > @@ -1551,6 +1551,10 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
> > >  	newlen = error;
> > >  
> > >  	if (protocol == IPPROTO_TCP) {
> > > +		__netns_tracker_free(net, &sock->sk->ns_tracker, false);
> > > +		sock->sk->sk_net_refcnt = 1;
> > > +		get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > +		sock_inuse_add(net, 1);
> > 
> > This is really ugly.  These internal details of the network layer have
> > no place in sunrpc code. There must be a better way.
> 
> I asked to do this way.  I agree this way is really ugly.  Similar
> code exists in MPTCP, SMC, CIFS, etc, so I plan to add a new API for
> this case, but this requires huge change adding a new parameter for
> ->create() prototype and the changes are not backportable.
> 
> https://github.com/q2ven/linux/commit/bb8b8814a73b3f50c3fef5eaf8d30d8c1df43e7b
> https://github.com/q2ven/linux/commits/427_2
> 
> After my series, we can use the following but cannot backport it to
> stable.
> 
>   sock_create_net(net, family, type, protocol);
> 
>   e.g. commit for MPTCP
>   https://github.com/q2ven/linux/commit/24a4647561272c1e67a685d8403e27eb863398cf
> 
> That's why I suggested to go with the ugly way and I will clean them
> up in the next cycle.
> 
> So, finally the sunrpc code will be much cleaner and the netns refcnt
> will be touched only in the core code.

This fact needs to be spelled out in the commit message:

   This is an ugly hack which can easily be backported to earlier
   kernels.  A proper fix which cleans up the interfaces will follow,
   but will not be so easy to backport.

or something like that.

I would still prefer if a little helper were made available so sunrpc
could just call one function rather than adding 4 cryptic lines.  But I
won't argue that too strongly.

Thanks,
NeilBrown

> 
> 
> > 
> > Can we pass '0' for the kern arg to __sock_create()?  That should fix
> > the refcounting issues, but might mess up security labelling.
> 
> This should be avoided as it's confusing for BPF programs, LSMs, and
> LOCKDEP.
> 
> 
> > 
> > Can we wait for something before we call put_net() to release the net.
> > 
> > Maybe we want to split the "kern" arg t __sock_create() and have
> > "kern" which affects labeling and "refnet" with affects refcounting the
> > net.
> 
> This is exactly what my series does, but again, it's not backport
> friendly.
> https://github.com/q2ven/linux/commit/413e867b4aee9e9f60f3c33fb38d2004aeb29c40
> 
> 
> > 
> > I had a quick look and very nearly every caller of __sock_create()
> > outside of net/core really does want refcount.  Many callers of
> > sock_create_kern() possibly don't.
> 
> Actually, since sock_create_kern() is added, we no longer need to
> export __sock_create(), so I have a patch to convert them to
> sock_create_kern().
> 
> And most of TCP socket does need refcnt, but non-TCP won't.
> Also, handshake one is exception, which uses TCP but only in init_net,
> where we need not take care of netns refcnt.
> 
> https://github.com/q2ven/linux/commit/b56888bbbf327d57ea25a6b97275d6b9b8ad043a
> 
> 
> 
> > 
> > So I really think this needs to be cleaned up in net/core, not in all
> > the different network clients in the kernel.
> 
> Yes, will be done in the next cycle.
>
Liu Jian Nov. 12, 2024, 1:41 p.m. UTC | #6
在 2024/11/12 8:54, NeilBrown 写道:
> On Tue, 12 Nov 2024, Kuniyuki Iwashima wrote:
>> From: "NeilBrown" <neilb@suse.de>
>> Date: Tue, 12 Nov 2024 10:52:34 +1100
>>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>>> index 6f272013fd9b..d4330aaadc23 100644
>>>> --- a/net/sunrpc/svcsock.c
>>>> +++ b/net/sunrpc/svcsock.c
>>>> @@ -1551,6 +1551,10 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
>>>>   	newlen = error;
>>>>   
>>>>   	if (protocol == IPPROTO_TCP) {
>>>> +		__netns_tracker_free(net, &sock->sk->ns_tracker, false);
>>>> +		sock->sk->sk_net_refcnt = 1;
>>>> +		get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
>>>> +		sock_inuse_add(net, 1);
>>>
>>> This is really ugly.  These internal details of the network layer have
>>> no place in sunrpc code. There must be a better way.
>>
>> I asked to do this way.  I agree this way is really ugly.  Similar
>> code exists in MPTCP, SMC, CIFS, etc, so I plan to add a new API for
>> this case, but this requires huge change adding a new parameter for
>> ->create() prototype and the changes are not backportable.
>>
>> https://github.com/q2ven/linux/commit/bb8b8814a73b3f50c3fef5eaf8d30d8c1df43e7b
>> https://github.com/q2ven/linux/commits/427_2
>>
>> After my series, we can use the following but cannot backport it to
>> stable.
>>
>>    sock_create_net(net, family, type, protocol);
>>
>>    e.g. commit for MPTCP
>>    https://github.com/q2ven/linux/commit/24a4647561272c1e67a685d8403e27eb863398cf
>>
>> That's why I suggested to go with the ugly way and I will clean them
>> up in the next cycle.
>>
>> So, finally the sunrpc code will be much cleaner and the netns refcnt
>> will be touched only in the core code.
> 
> This fact needs to be spelled out in the commit message:
> 
>     This is an ugly hack which can easily be backported to earlier
>     kernels.  A proper fix which cleans up the interfaces will follow,
>     but will not be so easy to backport.
> 
> or something like that.
Already added to v4, thank you.
> 
> I would still prefer if a little helper were made available so sunrpc
> could just call one function rather than adding 4 cryptic lines.  But I
> won't argue that too strongly.
Let's just leave it at that for now. ):


> Thanks,
> NeilBrown
> 
>>
>>
>>>
>>> Can we pass '0' for the kern arg to __sock_create()?  That should fix
>>> the refcounting issues, but might mess up security labelling.
>>
>> This should be avoided as it's confusing for BPF programs, LSMs, and
>> LOCKDEP.
>>
>>
>>>
>>> Can we wait for something before we call put_net() to release the net.
>>>
>>> Maybe we want to split the "kern" arg t __sock_create() and have
>>> "kern" which affects labeling and "refnet" with affects refcounting the
>>> net.
>>
>> This is exactly what my series does, but again, it's not backport
>> friendly.
>> https://github.com/q2ven/linux/commit/413e867b4aee9e9f60f3c33fb38d2004aeb29c40
>>
>>
>>>
>>> I had a quick look and very nearly every caller of __sock_create()
>>> outside of net/core really does want refcount.  Many callers of
>>> sock_create_kern() possibly don't.
>>
>> Actually, since sock_create_kern() is added, we no longer need to
>> export __sock_create(), so I have a patch to convert them to
>> sock_create_kern().
>>
>> And most of TCP socket does need refcnt, but non-TCP won't.
>> Also, handshake one is exception, which uses TCP but only in init_net,
>> where we need not take care of netns refcnt.
>>
>> https://github.com/q2ven/linux/commit/b56888bbbf327d57ea25a6b97275d6b9b8ad043a
>>
>>
>>
>>>
>>> So I really think this needs to be cleaned up in net/core, not in all
>>> the different network clients in the kernel.
>>
>> Yes, will be done in the next cycle.
>>
>
diff mbox series

Patch

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 6f272013fd9b..d4330aaadc23 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1551,6 +1551,10 @@  static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
 	newlen = error;
 
 	if (protocol == IPPROTO_TCP) {
+		__netns_tracker_free(net, &sock->sk->ns_tracker, false);
+		sock->sk->sk_net_refcnt = 1;
+		get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
+		sock_inuse_add(net, 1);
 		if ((error = kernel_listen(sock, 64)) < 0)
 			goto bummer;
 	}
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index d2f31b59457b..0f0b9f9283d9 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1942,6 +1942,12 @@  static struct socket *xs_create_sock(struct rpc_xprt *xprt,
 		goto out;
 	}
 
+	if (protocol == IPPROTO_TCP) {
+		__netns_tracker_free(xprt->xprt_net, &sock->sk->ns_tracker, false);
+		sock->sk->sk_net_refcnt = 1;
+		get_net_track(xprt->xprt_net, &sock->sk->ns_tracker, GFP_KERNEL);
+		sock_inuse_add(xprt->xprt_net, 1);
+	}
 	filp = sock_alloc_file(sock, O_NONBLOCK, NULL);
 	if (IS_ERR(filp))
 		return ERR_CAST(filp);