diff mbox series

[net] sunrpc: fix one UAF issue caused by sunrpc kernel tcp socket

Message ID 20241024015543.568476-1-liujian56@huawei.com (mailing list archive)
State New
Headers show
Series [net] sunrpc: fix one UAF issue caused by sunrpc kernel tcp socket | expand

Commit Message

liujian (CE) Oct. 24, 2024, 1:55 a.m. UTC
BUG: KASAN: slab-use-after-free in tcp_write_timer_handler+0x156/0x3e0
Read of size 1 at addr ffff888111f322cd by task swapper/0/0

CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-dirty #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
Call Trace:
 <IRQ>
 dump_stack_lvl+0x68/0xa0
 print_address_description.constprop.0+0x2c/0x3d0
 print_report+0xb4/0x270
 kasan_report+0xbd/0xf0
 tcp_write_timer_handler+0x156/0x3e0
 tcp_write_timer+0x66/0x170
 call_timer_fn+0xfb/0x1d0
 __run_timers+0x3f8/0x480
 run_timer_softirq+0x9b/0x100
 handle_softirqs+0x153/0x390
 __irq_exit_rcu+0x103/0x120
 irq_exit_rcu+0xe/0x20
 sysvec_apic_timer_interrupt+0x76/0x90
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1a/0x20
RIP: 0010:default_idle+0xf/0x20
Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90
 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 <fa> c3 cc cc cc
 cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242
RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX: ffffffffa09c3196
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00590f
RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed102360835d
R10: ffff88811b041aeb R11: 0000000000000001 R12: 0000000000000000
R13: ffffffffa202d7c0 R14: 0000000000000000 R15: 00000000000147d0
 default_idle_call+0x6b/0xa0
 cpuidle_idle_call+0x1af/0x1f0
 do_idle+0xbc/0x130
 cpu_startup_entry+0x33/0x40
 rest_init+0x11f/0x210
 start_kernel+0x39a/0x420
 x86_64_start_reservations+0x18/0x30
 x86_64_start_kernel+0x97/0xa0
 common_startup_64+0x13e/0x141
 </TASK>

Allocated by task 595:
 kasan_save_stack+0x24/0x50
 kasan_save_track+0x14/0x30
 __kasan_slab_alloc+0x87/0x90
 kmem_cache_alloc_noprof+0x12b/0x3f0
 copy_net_ns+0x94/0x380
 create_new_namespaces+0x24c/0x500
 unshare_nsproxy_namespaces+0x75/0xf0
 ksys_unshare+0x24e/0x4f0
 __x64_sys_unshare+0x1f/0x30
 do_syscall_64+0x70/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Freed by task 100:
 kasan_save_stack+0x24/0x50
 kasan_save_track+0x14/0x30
 kasan_save_free_info+0x3b/0x60
 __kasan_slab_free+0x54/0x70
 kmem_cache_free+0x156/0x5d0
 cleanup_net+0x5d3/0x670
 process_one_work+0x776/0xa90
 worker_thread+0x2e2/0x560
 kthread+0x1a8/0x1f0
 ret_from_fork+0x34/0x60
 ret_from_fork_asm+0x1a/0x30

Reproduction script:

mkdir -p /mnt/nfsshare
mkdir -p /mnt/nfs/netns_1
mkfs.ext4 /dev/sdb
mount /dev/sdb /mnt/nfsshare
systemctl restart nfs-server
chmod 777 /mnt/nfsshare
exportfs -i -o rw,no_root_squash *:/mnt/nfsshare

ip netns add netns_1
ip link add name veth_1_peer type veth peer veth_1
ifconfig veth_1_peer 11.11.0.254 up
ip link set veth_1 netns netns_1
ip netns exec netns_1 ifconfig veth_1 11.11.0.1

ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -p tcp \
	--tcp-flags FIN FIN  -j DROP

(note: In my environment, a DESTROY_CLIENTID operation is always sent
 immediately, breaking the nfs tcp connection.)
ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o proto=tcp,vers=4.1 \
	11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1

ip netns del netns_1

The reason here is that the tcp socket in netns_1 (nfs side) has been
shutdown and closed (done in xs_destroy), but the FIN message (with ack)
is discarded, and the nfsd side keeps sending retransmission messages.
As a result, when the tcp sock in netns_1 processes the received message,
it sends the message (FIN message) in the sending queue, and the tcp timer
is re-established. When the network namespace is deleted, the net structure
accessed by tcp's timer handler function causes problems.

The modification here aborts the TCP connection directly in xs_destroy().

Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Signed-off-by: Liu Jian <liujian56@huawei.com>
---
 net/sunrpc/xprtsock.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Trond Myklebust Oct. 24, 2024, 2:20 a.m. UTC | #1
On Thu, 2024-10-24 at 09:55 +0800, Liu Jian wrote:
> BUG: KASAN: slab-use-after-free in
> tcp_write_timer_handler+0x156/0x3e0
> Read of size 1 at addr ffff888111f322cd by task swapper/0/0
> 
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-dirty #7
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> Call Trace:
>  <IRQ>
>  dump_stack_lvl+0x68/0xa0
>  print_address_description.constprop.0+0x2c/0x3d0
>  print_report+0xb4/0x270
>  kasan_report+0xbd/0xf0
>  tcp_write_timer_handler+0x156/0x3e0
>  tcp_write_timer+0x66/0x170
>  call_timer_fn+0xfb/0x1d0
>  __run_timers+0x3f8/0x480
>  run_timer_softirq+0x9b/0x100
>  handle_softirqs+0x153/0x390
>  __irq_exit_rcu+0x103/0x120
>  irq_exit_rcu+0xe/0x20
>  sysvec_apic_timer_interrupt+0x76/0x90
>  </IRQ>
>  <TASK>
>  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> RIP: 0010:default_idle+0xf/0x20
> Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90
> 90 90
>  90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 <fa> c3 cc
> cc cc
>  cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242
> RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX: ffffffffa09c3196
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00590f
> RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed102360835d
> R10: ffff88811b041aeb R11: 0000000000000001 R12: 0000000000000000
> R13: ffffffffa202d7c0 R14: 0000000000000000 R15: 00000000000147d0
>  default_idle_call+0x6b/0xa0
>  cpuidle_idle_call+0x1af/0x1f0
>  do_idle+0xbc/0x130
>  cpu_startup_entry+0x33/0x40
>  rest_init+0x11f/0x210
>  start_kernel+0x39a/0x420
>  x86_64_start_reservations+0x18/0x30
>  x86_64_start_kernel+0x97/0xa0
>  common_startup_64+0x13e/0x141
>  </TASK>
> 
> Allocated by task 595:
>  kasan_save_stack+0x24/0x50
>  kasan_save_track+0x14/0x30
>  __kasan_slab_alloc+0x87/0x90
>  kmem_cache_alloc_noprof+0x12b/0x3f0
>  copy_net_ns+0x94/0x380
>  create_new_namespaces+0x24c/0x500
>  unshare_nsproxy_namespaces+0x75/0xf0
>  ksys_unshare+0x24e/0x4f0
>  __x64_sys_unshare+0x1f/0x30
>  do_syscall_64+0x70/0x180
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Freed by task 100:
>  kasan_save_stack+0x24/0x50
>  kasan_save_track+0x14/0x30
>  kasan_save_free_info+0x3b/0x60
>  __kasan_slab_free+0x54/0x70
>  kmem_cache_free+0x156/0x5d0
>  cleanup_net+0x5d3/0x670
>  process_one_work+0x776/0xa90
>  worker_thread+0x2e2/0x560
>  kthread+0x1a8/0x1f0
>  ret_from_fork+0x34/0x60
>  ret_from_fork_asm+0x1a/0x30
> 
> Reproduction script:
> 
> mkdir -p /mnt/nfsshare
> mkdir -p /mnt/nfs/netns_1
> mkfs.ext4 /dev/sdb
> mount /dev/sdb /mnt/nfsshare
> systemctl restart nfs-server
> chmod 777 /mnt/nfsshare
> exportfs -i -o rw,no_root_squash *:/mnt/nfsshare
> 
> ip netns add netns_1
> ip link add name veth_1_peer type veth peer veth_1
> ifconfig veth_1_peer 11.11.0.254 up
> ip link set veth_1 netns netns_1
> ip netns exec netns_1 ifconfig veth_1 11.11.0.1
> 
> ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -p tcp
> \
> 	--tcp-flags FIN FIN  -j DROP
> 
> (note: In my environment, a DESTROY_CLIENTID operation is always sent
>  immediately, breaking the nfs tcp connection.)
> ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o
> proto=tcp,vers=4.1 \
> 	11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1
> 
> ip netns del netns_1
> 
> The reason here is that the tcp socket in netns_1 (nfs side) has been
> shutdown and closed (done in xs_destroy), but the FIN message (with
> ack)
> is discarded, and the nfsd side keeps sending retransmission
> messages.
> As a result, when the tcp sock in netns_1 processes the received
> message,
> it sends the message (FIN message) in the sending queue, and the tcp
> timer
> is re-established. When the network namespace is deleted, the net
> structure
> accessed by tcp's timer handler function causes problems.
> 
> The modification here aborts the TCP connection directly in
> xs_destroy().
> 
> Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the
> netns of kernel sockets.")
> Signed-off-by: Liu Jian <liujian56@huawei.com>
> ---
>  net/sunrpc/xprtsock.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 0e1691316f42..91ee3484155a 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -1287,6 +1287,9 @@ static void xs_reset_transport(struct sock_xprt
> *transport)
>  	release_sock(sk);
>  	mutex_unlock(&transport->recv_mutex);
>  
> +	if (sk->sk_prot == &tcp_prot)
> +		tcp_abort(sk, ECONNABORTED);

We've already called kernel_sock_shutdown(sock, SHUT_RDWR), and we're
about to close the socket. Why on earth should we need a hack like the
above in order to abort the connection at this point?

This would appear to be a networking layer bug, and not an RPC level
problem.

> +
>  	trace_rpc_socket_close(xprt, sock);
>  	__fput_sync(filp);
>
Trond Myklebust Oct. 24, 2024, 12:57 p.m. UTC | #2
On Thu, 2024-10-24 at 02:20 +0000, Trond Myklebust wrote:
> On Thu, 2024-10-24 at 09:55 +0800, Liu Jian wrote:
> > BUG: KASAN: slab-use-after-free in
> > tcp_write_timer_handler+0x156/0x3e0
> > Read of size 1 at addr ffff888111f322cd by task swapper/0/0
> > 
> > CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-dirty
> > #7
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-
> > 1
> > Call Trace:
> >  <IRQ>
> >  dump_stack_lvl+0x68/0xa0
> >  print_address_description.constprop.0+0x2c/0x3d0
> >  print_report+0xb4/0x270
> >  kasan_report+0xbd/0xf0
> >  tcp_write_timer_handler+0x156/0x3e0
> >  tcp_write_timer+0x66/0x170
> >  call_timer_fn+0xfb/0x1d0
> >  __run_timers+0x3f8/0x480
> >  run_timer_softirq+0x9b/0x100
> >  handle_softirqs+0x153/0x390
> >  __irq_exit_rcu+0x103/0x120
> >  irq_exit_rcu+0xe/0x20
> >  sysvec_apic_timer_interrupt+0x76/0x90
> >  </IRQ>
> >  <TASK>
> >  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > RIP: 0010:default_idle+0xf/0x20
> > Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90
> > 90
> > 90 90
> >  90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 <fa> c3
> > cc
> > cc cc
> >  cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> > RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242
> > RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX: ffffffffa09c3196
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00590f
> > RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed102360835d
> > R10: ffff88811b041aeb R11: 0000000000000001 R12: 0000000000000000
> > R13: ffffffffa202d7c0 R14: 0000000000000000 R15: 00000000000147d0
> >  default_idle_call+0x6b/0xa0
> >  cpuidle_idle_call+0x1af/0x1f0
> >  do_idle+0xbc/0x130
> >  cpu_startup_entry+0x33/0x40
> >  rest_init+0x11f/0x210
> >  start_kernel+0x39a/0x420
> >  x86_64_start_reservations+0x18/0x30
> >  x86_64_start_kernel+0x97/0xa0
> >  common_startup_64+0x13e/0x141
> >  </TASK>
> > 
> > Allocated by task 595:
> >  kasan_save_stack+0x24/0x50
> >  kasan_save_track+0x14/0x30
> >  __kasan_slab_alloc+0x87/0x90
> >  kmem_cache_alloc_noprof+0x12b/0x3f0
> >  copy_net_ns+0x94/0x380
> >  create_new_namespaces+0x24c/0x500
> >  unshare_nsproxy_namespaces+0x75/0xf0
> >  ksys_unshare+0x24e/0x4f0
> >  __x64_sys_unshare+0x1f/0x30
> >  do_syscall_64+0x70/0x180
> >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > Freed by task 100:
> >  kasan_save_stack+0x24/0x50
> >  kasan_save_track+0x14/0x30
> >  kasan_save_free_info+0x3b/0x60
> >  __kasan_slab_free+0x54/0x70
> >  kmem_cache_free+0x156/0x5d0
> >  cleanup_net+0x5d3/0x670
> >  process_one_work+0x776/0xa90
> >  worker_thread+0x2e2/0x560
> >  kthread+0x1a8/0x1f0
> >  ret_from_fork+0x34/0x60
> >  ret_from_fork_asm+0x1a/0x30
> > 
> > Reproduction script:
> > 
> > mkdir -p /mnt/nfsshare
> > mkdir -p /mnt/nfs/netns_1
> > mkfs.ext4 /dev/sdb
> > mount /dev/sdb /mnt/nfsshare
> > systemctl restart nfs-server
> > chmod 777 /mnt/nfsshare
> > exportfs -i -o rw,no_root_squash *:/mnt/nfsshare
> > 
> > ip netns add netns_1
> > ip link add name veth_1_peer type veth peer veth_1
> > ifconfig veth_1_peer 11.11.0.254 up
> > ip link set veth_1 netns netns_1
> > ip netns exec netns_1 ifconfig veth_1 11.11.0.1
> > 
> > ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -p
> > tcp
> > \
> > 	--tcp-flags FIN FIN  -j DROP
> > 
> > (note: In my environment, a DESTROY_CLIENTID operation is always
> > sent
> >  immediately, breaking the nfs tcp connection.)
> > ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o
> > proto=tcp,vers=4.1 \
> > 	11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1
> > 
> > ip netns del netns_1
> > 
> > The reason here is that the tcp socket in netns_1 (nfs side) has
> > been
> > shutdown and closed (done in xs_destroy), but the FIN message (with
> > ack)
> > is discarded, and the nfsd side keeps sending retransmission
> > messages.
> > As a result, when the tcp sock in netns_1 processes the received
> > message,
> > it sends the message (FIN message) in the sending queue, and the
> > tcp
> > timer
> > is re-established. When the network namespace is deleted, the net
> > structure
> > accessed by tcp's timer handler function causes problems.
> > 
> > The modification here aborts the TCP connection directly in
> > xs_destroy().
> > 
> > Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count
> > the
> > netns of kernel sockets.")
> > Signed-off-by: Liu Jian <liujian56@huawei.com>
> > ---
> >  net/sunrpc/xprtsock.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > index 0e1691316f42..91ee3484155a 100644
> > --- a/net/sunrpc/xprtsock.c
> > +++ b/net/sunrpc/xprtsock.c
> > @@ -1287,6 +1287,9 @@ static void xs_reset_transport(struct
> > sock_xprt
> > *transport)
> >  	release_sock(sk);
> >  	mutex_unlock(&transport->recv_mutex);
> >  
> > +	if (sk->sk_prot == &tcp_prot)
> > +		tcp_abort(sk, ECONNABORTED);
> 
> We've already called kernel_sock_shutdown(sock, SHUT_RDWR), and we're
> about to close the socket. Why on earth should we need a hack like
> the
> above in order to abort the connection at this point?
> 
> This would appear to be a networking layer bug, and not an RPC level
> problem.
> 

To put this differently: if a use after free can occur in the kernel
when the RPC layer closes a TCP socket and then exits the network
namespace, then can't that occur when a userland application does the
same?

If not, then what prevents it from happening?

> > +
> >  	trace_rpc_socket_close(xprt, sock);
> >  	__fput_sync(filp);
> >
liujian (CE) Oct. 24, 2024, 1:40 p.m. UTC | #3
在 2024/10/24 20:57, Trond Myklebust 写道:
> On Thu, 2024-10-24 at 02:20 +0000, Trond Myklebust wrote:
>> On Thu, 2024-10-24 at 09:55 +0800, Liu Jian wrote:
>>> BUG: KASAN: slab-use-after-free in
>>> tcp_write_timer_handler+0x156/0x3e0
>>> Read of size 1 at addr ffff888111f322cd by task swapper/0/0
>>>
>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-dirty
>>> #7
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-
>>> 1
>>> Call Trace:
>>>   <IRQ>
>>>   dump_stack_lvl+0x68/0xa0
>>>   print_address_description.constprop.0+0x2c/0x3d0
>>>   print_report+0xb4/0x270
>>>   kasan_report+0xbd/0xf0
>>>   tcp_write_timer_handler+0x156/0x3e0
>>>   tcp_write_timer+0x66/0x170
>>>   call_timer_fn+0xfb/0x1d0
>>>   __run_timers+0x3f8/0x480
>>>   run_timer_softirq+0x9b/0x100
>>>   handle_softirqs+0x153/0x390
>>>   __irq_exit_rcu+0x103/0x120
>>>   irq_exit_rcu+0xe/0x20
>>>   sysvec_apic_timer_interrupt+0x76/0x90
>>>   </IRQ>
>>>   <TASK>
>>>   asm_sysvec_apic_timer_interrupt+0x1a/0x20
>>> RIP: 0010:default_idle+0xf/0x20
>>> Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90
>>> 90
>>> 90 90
>>>   90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 <fa> c3
>>> cc
>>> cc cc
>>>   cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
>>> RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242
>>> RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX: ffffffffa09c3196
>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00590f
>>> RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed102360835d
>>> R10: ffff88811b041aeb R11: 0000000000000001 R12: 0000000000000000
>>> R13: ffffffffa202d7c0 R14: 0000000000000000 R15: 00000000000147d0
>>>   default_idle_call+0x6b/0xa0
>>>   cpuidle_idle_call+0x1af/0x1f0
>>>   do_idle+0xbc/0x130
>>>   cpu_startup_entry+0x33/0x40
>>>   rest_init+0x11f/0x210
>>>   start_kernel+0x39a/0x420
>>>   x86_64_start_reservations+0x18/0x30
>>>   x86_64_start_kernel+0x97/0xa0
>>>   common_startup_64+0x13e/0x141
>>>   </TASK>
>>>
>>> Allocated by task 595:
>>>   kasan_save_stack+0x24/0x50
>>>   kasan_save_track+0x14/0x30
>>>   __kasan_slab_alloc+0x87/0x90
>>>   kmem_cache_alloc_noprof+0x12b/0x3f0
>>>   copy_net_ns+0x94/0x380
>>>   create_new_namespaces+0x24c/0x500
>>>   unshare_nsproxy_namespaces+0x75/0xf0
>>>   ksys_unshare+0x24e/0x4f0
>>>   __x64_sys_unshare+0x1f/0x30
>>>   do_syscall_64+0x70/0x180
>>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> Freed by task 100:
>>>   kasan_save_stack+0x24/0x50
>>>   kasan_save_track+0x14/0x30
>>>   kasan_save_free_info+0x3b/0x60
>>>   __kasan_slab_free+0x54/0x70
>>>   kmem_cache_free+0x156/0x5d0
>>>   cleanup_net+0x5d3/0x670
>>>   process_one_work+0x776/0xa90
>>>   worker_thread+0x2e2/0x560
>>>   kthread+0x1a8/0x1f0
>>>   ret_from_fork+0x34/0x60
>>>   ret_from_fork_asm+0x1a/0x30
>>>
>>> Reproduction script:
>>>
>>> mkdir -p /mnt/nfsshare
>>> mkdir -p /mnt/nfs/netns_1
>>> mkfs.ext4 /dev/sdb
>>> mount /dev/sdb /mnt/nfsshare
>>> systemctl restart nfs-server
>>> chmod 777 /mnt/nfsshare
>>> exportfs -i -o rw,no_root_squash *:/mnt/nfsshare
>>>
>>> ip netns add netns_1
>>> ip link add name veth_1_peer type veth peer veth_1
>>> ifconfig veth_1_peer 11.11.0.254 up
>>> ip link set veth_1 netns netns_1
>>> ip netns exec netns_1 ifconfig veth_1 11.11.0.1
>>>
>>> ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -p
>>> tcp
>>> \
>>> 	--tcp-flags FIN FIN  -j DROP
>>>
>>> (note: In my environment, a DESTROY_CLIENTID operation is always
>>> sent
>>>   immediately, breaking the nfs tcp connection.)
>>> ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o
>>> proto=tcp,vers=4.1 \
>>> 	11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1
>>>
>>> ip netns del netns_1
>>>
>>> The reason here is that the tcp socket in netns_1 (nfs side) has
>>> been
>>> shutdown and closed (done in xs_destroy), but the FIN message (with
>>> ack)
>>> is discarded, and the nfsd side keeps sending retransmission
>>> messages.
>>> As a result, when the tcp sock in netns_1 processes the received
>>> message,
>>> it sends the message (FIN message) in the sending queue, and the
>>> tcp
>>> timer
>>> is re-established. When the network namespace is deleted, the net
>>> structure
>>> accessed by tcp's timer handler function causes problems.
>>>
>>> The modification here aborts the TCP connection directly in
>>> xs_destroy().
>>>
>>> Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count
>>> the
>>> netns of kernel sockets.")
>>> Signed-off-by: Liu Jian <liujian56@huawei.com>
>>> ---
>>>   net/sunrpc/xprtsock.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>> index 0e1691316f42..91ee3484155a 100644
>>> --- a/net/sunrpc/xprtsock.c
>>> +++ b/net/sunrpc/xprtsock.c
>>> @@ -1287,6 +1287,9 @@ static void xs_reset_transport(struct
>>> sock_xprt
>>> *transport)
>>>   	release_sock(sk);
>>>   	mutex_unlock(&transport->recv_mutex);
>>>   
>>> +	if (sk->sk_prot == &tcp_prot)
>>> +		tcp_abort(sk, ECONNABORTED);
>> We've already called kernel_sock_shutdown(sock, SHUT_RDWR), and we're
>> about to close the socket. Why on earth should we need a hack like
>> the
>> above in order to abort the connection at this point?
>>
>> This would appear to be a networking layer bug, and not an RPC level
>> problem.
>>
> To put this differently: if a use after free can occur in the kernel
> when the RPC layer closes a TCP socket and then exits the network
> namespace, then can't that occur when a userland application does the
> same?
>
> If not, then what prevents it from happening?
The socket created by the userspace program obtains the reference 
counting of the namespace, but the kernel socket does not.

There's some discussion here:
https://lore.kernel.org/all/CANn89iJE5anTbyLJ0TdGAqGsE+GichY3YzQECjNUVMz=G3bcQg@mail.gmail.com/
>>> +
>>>   	trace_rpc_socket_close(xprt, sock);
>>>   	__fput_sync(filp);
>>>
Trond Myklebust Oct. 24, 2024, 1:57 p.m. UTC | #4
On Thu, 2024-10-24 at 21:40 +0800, liujian (CE) wrote:
> 
> 
> 在 2024/10/24 20:57, Trond Myklebust 写道:
> > On Thu, 2024-10-24 at 02:20 +0000, Trond Myklebust wrote:
> > > On Thu, 2024-10-24 at 09:55 +0800, Liu Jian wrote:
> > > > BUG: KASAN: slab-use-after-free in
> > > > tcp_write_timer_handler+0x156/0x3e0
> > > > Read of size 1 at addr ffff888111f322cd by task swapper/0/0
> > > > 
> > > > CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-
> > > > dirty
> > > > #7
> > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > > 1.15.0-
> > > > 1
> > > > Call Trace:
> > > >   <IRQ>
> > > >   dump_stack_lvl+0x68/0xa0
> > > >   print_address_description.constprop.0+0x2c/0x3d0
> > > >   print_report+0xb4/0x270
> > > >   kasan_report+0xbd/0xf0
> > > >   tcp_write_timer_handler+0x156/0x3e0
> > > >   tcp_write_timer+0x66/0x170
> > > >   call_timer_fn+0xfb/0x1d0
> > > >   __run_timers+0x3f8/0x480
> > > >   run_timer_softirq+0x9b/0x100
> > > >   handle_softirqs+0x153/0x390
> > > >   __irq_exit_rcu+0x103/0x120
> > > >   irq_exit_rcu+0xe/0x20
> > > >   sysvec_apic_timer_interrupt+0x76/0x90
> > > >   </IRQ>
> > > >   <TASK>
> > > >   asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > > > RIP: 0010:default_idle+0xf/0x20
> > > > Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90
> > > > 90
> > > > 90
> > > > 90 90
> > > >   90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 <fa>
> > > > c3
> > > > cc
> > > > cc cc
> > > >   cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> > > > RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242
> > > > RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX:
> > > > ffffffffa09c3196
> > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> > > > ffffffff9f00590f
> > > > RBP: 0000000000000000 R08: 0000000000000001 R09:
> > > > ffffed102360835d
> > > > R10: ffff88811b041aeb R11: 0000000000000001 R12:
> > > > 0000000000000000
> > > > R13: ffffffffa202d7c0 R14: 0000000000000000 R15:
> > > > 00000000000147d0
> > > >   default_idle_call+0x6b/0xa0
> > > >   cpuidle_idle_call+0x1af/0x1f0
> > > >   do_idle+0xbc/0x130
> > > >   cpu_startup_entry+0x33/0x40
> > > >   rest_init+0x11f/0x210
> > > >   start_kernel+0x39a/0x420
> > > >   x86_64_start_reservations+0x18/0x30
> > > >   x86_64_start_kernel+0x97/0xa0
> > > >   common_startup_64+0x13e/0x141
> > > >   </TASK>
> > > > 
> > > > Allocated by task 595:
> > > >   kasan_save_stack+0x24/0x50
> > > >   kasan_save_track+0x14/0x30
> > > >   __kasan_slab_alloc+0x87/0x90
> > > >   kmem_cache_alloc_noprof+0x12b/0x3f0
> > > >   copy_net_ns+0x94/0x380
> > > >   create_new_namespaces+0x24c/0x500
> > > >   unshare_nsproxy_namespaces+0x75/0xf0
> > > >   ksys_unshare+0x24e/0x4f0
> > > >   __x64_sys_unshare+0x1f/0x30
> > > >   do_syscall_64+0x70/0x180
> > > >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > 
> > > > Freed by task 100:
> > > >   kasan_save_stack+0x24/0x50
> > > >   kasan_save_track+0x14/0x30
> > > >   kasan_save_free_info+0x3b/0x60
> > > >   __kasan_slab_free+0x54/0x70
> > > >   kmem_cache_free+0x156/0x5d0
> > > >   cleanup_net+0x5d3/0x670
> > > >   process_one_work+0x776/0xa90
> > > >   worker_thread+0x2e2/0x560
> > > >   kthread+0x1a8/0x1f0
> > > >   ret_from_fork+0x34/0x60
> > > >   ret_from_fork_asm+0x1a/0x30
> > > > 
> > > > Reproduction script:
> > > > 
> > > > mkdir -p /mnt/nfsshare
> > > > mkdir -p /mnt/nfs/netns_1
> > > > mkfs.ext4 /dev/sdb
> > > > mount /dev/sdb /mnt/nfsshare
> > > > systemctl restart nfs-server
> > > > chmod 777 /mnt/nfsshare
> > > > exportfs -i -o rw,no_root_squash *:/mnt/nfsshare
> > > > 
> > > > ip netns add netns_1
> > > > ip link add name veth_1_peer type veth peer veth_1
> > > > ifconfig veth_1_peer 11.11.0.254 up
> > > > ip link set veth_1 netns netns_1
> > > > ip netns exec netns_1 ifconfig veth_1 11.11.0.1
> > > > 
> > > > ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -
> > > > p
> > > > tcp
> > > > \
> > > > 	--tcp-flags FIN FIN  -j DROP
> > > > 
> > > > (note: In my environment, a DESTROY_CLIENTID operation is
> > > > always
> > > > sent
> > > >   immediately, breaking the nfs tcp connection.)
> > > > ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o
> > > > proto=tcp,vers=4.1 \
> > > > 	11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1
> > > > 
> > > > ip netns del netns_1
> > > > 
> > > > The reason here is that the tcp socket in netns_1 (nfs side)
> > > > has
> > > > been
> > > > shutdown and closed (done in xs_destroy), but the FIN message
> > > > (with
> > > > ack)
> > > > is discarded, and the nfsd side keeps sending retransmission
> > > > messages.
> > > > As a result, when the tcp sock in netns_1 processes the
> > > > received
> > > > message,
> > > > it sends the message (FIN message) in the sending queue, and
> > > > the
> > > > tcp
> > > > timer
> > > > is re-established. When the network namespace is deleted, the
> > > > net
> > > > structure
> > > > accessed by tcp's timer handler function causes problems.
> > > > 
> > > > The modification here aborts the TCP connection directly in
> > > > xs_destroy().
> > > > 
> > > > Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference
> > > > count
> > > > the
> > > > netns of kernel sockets.")
> > > > Signed-off-by: Liu Jian <liujian56@huawei.com>
> > > > ---
> > > >   net/sunrpc/xprtsock.c | 3 +++
> > > >   1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > > > index 0e1691316f42..91ee3484155a 100644
> > > > --- a/net/sunrpc/xprtsock.c
> > > > +++ b/net/sunrpc/xprtsock.c
> > > > @@ -1287,6 +1287,9 @@ static void xs_reset_transport(struct
> > > > sock_xprt
> > > > *transport)
> > > >   	release_sock(sk);
> > > >   	mutex_unlock(&transport->recv_mutex);
> > > >   
> > > > +	if (sk->sk_prot == &tcp_prot)
> > > > +		tcp_abort(sk, ECONNABORTED);
> > > We've already called kernel_sock_shutdown(sock, SHUT_RDWR), and
> > > we're
> > > about to close the socket. Why on earth should we need a hack
> > > like
> > > the
> > > above in order to abort the connection at this point?
> > > 
> > > This would appear to be a networking layer bug, and not an RPC
> > > level
> > > problem.
> > > 
> > To put this differently: if a use after free can occur in the
> > kernel
> > when the RPC layer closes a TCP socket and then exits the network
> > namespace, then can't that occur when a userland application does
> > the
> > same?
> > 
> > If not, then what prevents it from happening?
> The socket created by the userspace program obtains the reference 
> counting of the namespace, but the kernel socket does not.
> 
> There's some discussion here:
> https://lore.kernel.org/all/CANn89iJE5anTbyLJ0TdGAqGsE+GichY3YzQECjNUVMz=G3bcQg@mail.gmail.com/

OK... So then it looks to me as if NFS, SMB, AFS, and any other
networked filesystem that can be started from inside a container is
going to need to do the same thing that rds appears to be doing.

Should there perhaps be a helper function in the networking layer for
this?

> > > > +
> > > >   	trace_rpc_socket_close(xprt, sock);
> > > >   	__fput_sync(filp);
> > > >   
> 
>
liujian (CE) Oct. 25, 2024, 3:32 a.m. UTC | #5
在 2024/10/24 21:57, Trond Myklebust 写道:
> On Thu, 2024-10-24 at 21:40 +0800, liujian (CE) wrote:
>>
>> 在 2024/10/24 20:57, Trond Myklebust 写道:
>>> On Thu, 2024-10-24 at 02:20 +0000, Trond Myklebust wrote:
>>>> On Thu, 2024-10-24 at 09:55 +0800, Liu Jian wrote:
>>>>> BUG: KASAN: slab-use-after-free in
>>>>> tcp_write_timer_handler+0x156/0x3e0
>>>>> Read of size 1 at addr ffff888111f322cd by task swapper/0/0
>>>>>
>>>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc4-
>>>>> dirty
>>>>> #7
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>>> 1.15.0-
>>>>> 1
>>>>> Call Trace:
>>>>>    <IRQ>
>>>>>    dump_stack_lvl+0x68/0xa0
>>>>>    print_address_description.constprop.0+0x2c/0x3d0
>>>>>    print_report+0xb4/0x270
>>>>>    kasan_report+0xbd/0xf0
>>>>>    tcp_write_timer_handler+0x156/0x3e0
>>>>>    tcp_write_timer+0x66/0x170
>>>>>    call_timer_fn+0xfb/0x1d0
>>>>>    __run_timers+0x3f8/0x480
>>>>>    run_timer_softirq+0x9b/0x100
>>>>>    handle_softirqs+0x153/0x390
>>>>>    __irq_exit_rcu+0x103/0x120
>>>>>    irq_exit_rcu+0xe/0x20
>>>>>    sysvec_apic_timer_interrupt+0x76/0x90
>>>>>    </IRQ>
>>>>>    <TASK>
>>>>>    asm_sysvec_apic_timer_interrupt+0x1a/0x20
>>>>> RIP: 0010:default_idle+0xf/0x20
>>>>> Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90
>>>>> 90
>>>>> 90
>>>>> 90 90
>>>>>    90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 33 f8 25 00 fb f4 <fa>
>>>>> c3
>>>>> cc
>>>>> cc cc
>>>>>    cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
>>>>> RSP: 0018:ffffffffa2007e28 EFLAGS: 00000242
>>>>> RAX: 00000000000f3b31 RBX: 1ffffffff4400fc7 RCX:
>>>>> ffffffffa09c3196
>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>>>>> ffffffff9f00590f
>>>>> RBP: 0000000000000000 R08: 0000000000000001 R09:
>>>>> ffffed102360835d
>>>>> R10: ffff88811b041aeb R11: 0000000000000001 R12:
>>>>> 0000000000000000
>>>>> R13: ffffffffa202d7c0 R14: 0000000000000000 R15:
>>>>> 00000000000147d0
>>>>>    default_idle_call+0x6b/0xa0
>>>>>    cpuidle_idle_call+0x1af/0x1f0
>>>>>    do_idle+0xbc/0x130
>>>>>    cpu_startup_entry+0x33/0x40
>>>>>    rest_init+0x11f/0x210
>>>>>    start_kernel+0x39a/0x420
>>>>>    x86_64_start_reservations+0x18/0x30
>>>>>    x86_64_start_kernel+0x97/0xa0
>>>>>    common_startup_64+0x13e/0x141
>>>>>    </TASK>
>>>>>
>>>>> Allocated by task 595:
>>>>>    kasan_save_stack+0x24/0x50
>>>>>    kasan_save_track+0x14/0x30
>>>>>    __kasan_slab_alloc+0x87/0x90
>>>>>    kmem_cache_alloc_noprof+0x12b/0x3f0
>>>>>    copy_net_ns+0x94/0x380
>>>>>    create_new_namespaces+0x24c/0x500
>>>>>    unshare_nsproxy_namespaces+0x75/0xf0
>>>>>    ksys_unshare+0x24e/0x4f0
>>>>>    __x64_sys_unshare+0x1f/0x30
>>>>>    do_syscall_64+0x70/0x180
>>>>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>
>>>>> Freed by task 100:
>>>>>    kasan_save_stack+0x24/0x50
>>>>>    kasan_save_track+0x14/0x30
>>>>>    kasan_save_free_info+0x3b/0x60
>>>>>    __kasan_slab_free+0x54/0x70
>>>>>    kmem_cache_free+0x156/0x5d0
>>>>>    cleanup_net+0x5d3/0x670
>>>>>    process_one_work+0x776/0xa90
>>>>>    worker_thread+0x2e2/0x560
>>>>>    kthread+0x1a8/0x1f0
>>>>>    ret_from_fork+0x34/0x60
>>>>>    ret_from_fork_asm+0x1a/0x30
>>>>>
>>>>> Reproduction script:
>>>>>
>>>>> mkdir -p /mnt/nfsshare
>>>>> mkdir -p /mnt/nfs/netns_1
>>>>> mkfs.ext4 /dev/sdb
>>>>> mount /dev/sdb /mnt/nfsshare
>>>>> systemctl restart nfs-server
>>>>> chmod 777 /mnt/nfsshare
>>>>> exportfs -i -o rw,no_root_squash *:/mnt/nfsshare
>>>>>
>>>>> ip netns add netns_1
>>>>> ip link add name veth_1_peer type veth peer veth_1
>>>>> ifconfig veth_1_peer 11.11.0.254 up
>>>>> ip link set veth_1 netns netns_1
>>>>> ip netns exec netns_1 ifconfig veth_1 11.11.0.1
>>>>>
>>>>> ip netns exec netns_1 /root/iptables -A OUTPUT -d 11.11.0.254 -
>>>>> p
>>>>> tcp
>>>>> \
>>>>> 	--tcp-flags FIN FIN  -j DROP
>>>>>
>>>>> (note: In my environment, a DESTROY_CLIENTID operation is
>>>>> always
>>>>> sent
>>>>>    immediately, breaking the nfs tcp connection.)
>>>>> ip netns exec netns_1 timeout -s 9 300 mount -t nfs -o
>>>>> proto=tcp,vers=4.1 \
>>>>> 	11.11.0.254:/mnt/nfsshare /mnt/nfs/netns_1
>>>>>
>>>>> ip netns del netns_1
>>>>>
>>>>> The reason here is that the tcp socket in netns_1 (nfs side)
>>>>> has
>>>>> been
>>>>> shutdown and closed (done in xs_destroy), but the FIN message
>>>>> (with
>>>>> ack)
>>>>> is discarded, and the nfsd side keeps sending retransmission
>>>>> messages.
>>>>> As a result, when the tcp sock in netns_1 processes the
>>>>> received
>>>>> message,
>>>>> it sends the message (FIN message) in the sending queue, and
>>>>> the
>>>>> tcp
>>>>> timer
>>>>> is re-established. When the network namespace is deleted, the
>>>>> net
>>>>> structure
>>>>> accessed by tcp's timer handler function causes problems.
>>>>>
>>>>> The modification here aborts the TCP connection directly in
>>>>> xs_destroy().
>>>>>
>>>>> Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference
>>>>> count
>>>>> the
>>>>> netns of kernel sockets.")
>>>>> Signed-off-by: Liu Jian <liujian56@huawei.com>
>>>>> ---
>>>>>    net/sunrpc/xprtsock.c | 3 +++
>>>>>    1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>>>> index 0e1691316f42..91ee3484155a 100644
>>>>> --- a/net/sunrpc/xprtsock.c
>>>>> +++ b/net/sunrpc/xprtsock.c
>>>>> @@ -1287,6 +1287,9 @@ static void xs_reset_transport(struct
>>>>> sock_xprt
>>>>> *transport)
>>>>>    	release_sock(sk);
>>>>>    	mutex_unlock(&transport->recv_mutex);
>>>>>    
>>>>> +	if (sk->sk_prot == &tcp_prot)
>>>>> +		tcp_abort(sk, ECONNABORTED);
>>>> We've already called kernel_sock_shutdown(sock, SHUT_RDWR), and
>>>> we're
>>>> about to close the socket. Why on earth should we need a hack
>>>> like
>>>> the
>>>> above in order to abort the connection at this point?
>>>>
>>>> This would appear to be a networking layer bug, and not an RPC
>>>> level
>>>> problem.
>>>>
>>> To put this differently: if a use after free can occur in the
>>> kernel
>>> when the RPC layer closes a TCP socket and then exits the network
>>> namespace, then can't that occur when a userland application does
>>> the
>>> same?
>>>
>>> If not, then what prevents it from happening?
>> The socket created by the userspace program obtains the reference
>> counting of the namespace, but the kernel socket does not.
>>
>> There's some discussion here:
>> https://lore.kernel.org/all/CANn89iJE5anTbyLJ0TdGAqGsE+GichY3YzQECjNUVMz=G3bcQg@mail.gmail.com/
> OK... So then it looks to me as if NFS, SMB, AFS, and any other
> networked filesystem that can be started from inside a container is
> going to need to do the same thing that rds appears to be doing.
>
> Should there perhaps be a helper function in the networking layer for
> this?

There should be no such helper function at present, right?.

If get net's reference to fix this problem, the following test is 
performed. There's nothing wrong with this case. I don't know if there's 
anything else to consider.

I don't have any other ideas other than these two methods. Do you have 
any suggestions on this problem? @Eric @Jakub ... @All

diff --git a/include/linux/net.h b/include/linux/net.h
index b75bc534c1b3..58216da3b62c 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -255,6 +255,7 @@ int __sock_create(struct net *net, int family, int 
type, int proto,
                   struct socket **res, int kern);
  int sock_create(int family, int type, int proto, struct socket **res);
  int sock_create_kern(struct net *net, int family, int type, int proto, 
struct socket **res);
+int sock_create_kern_getnet(struct net *net, int family, int type, int 
proto, struct socket **res);
  int sock_create_lite(int family, int type, int proto, struct socket 
**res);
  struct socket *sock_alloc(void);
  void sock_release(struct socket *sock);
diff --git a/net/socket.c b/net/socket.c
index 042451f01c65..e64a02445b1a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1651,6 +1651,34 @@ int sock_create_kern(struct net *net, int family, 
int type, int protocol, struct
  }
  EXPORT_SYMBOL(sock_create_kern);

+int sock_create_kern_getnet(struct net *net, int family, int type, int 
proto, struct socket **res)
+{
+       struct sock *sk;
+       int ret;
+
+       if (!maybe_get_net(net))
+               return -EINVAL;
+
+       ret = sock_create_kern(net, family, type, proto, res);
+       if (ret < 0) {
+               put_net(net);
+               return ret;
+       }
+
+       sk = (*res)->sk;
+       lock_sock(sk);
+       /* Update ns_tracker to current stack trace and refcounted 
tracker */
+       __netns_tracker_free(net, &sk->ns_tracker, false);
+
+       sk->sk_net_refcnt = 1;
+       netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL);
+       sock_inuse_add(net, 1);
+       release_sock(sk);
+
+       return ret;
+}
+EXPORT_SYMBOL(sock_create_kern_getnet);
+
  static struct socket *__sys_socket_create(int family, int type, int 
protocol)
  {
         struct socket *sock;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 825ec5357691..31dc291446fb 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1526,7 +1526,10 @@ static struct svc_xprt *svc_create_socket(struct 
svc_serv *serv,
                 return ERR_PTR(-EINVAL);
         }

-       error = __sock_create(net, family, type, protocol, &sock, 1);
+       if (protocol == IPPROTO_TCP)
+               error = sock_create_kern_getnet(net, family, type, 
protocol, &sock);
+       else
+               error = sock_create_kern(net, family, type, protocol, 
&sock);
         if (error < 0)
                 return ERR_PTR(error);

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 0e1691316f42..d2304010daeb 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1922,7 +1922,10 @@ static struct socket *xs_create_sock(struct 
rpc_xprt *xprt,
         struct socket *sock;
         int err;

-       err = __sock_create(xprt->xprt_net, family, type, protocol, 
&sock, 1);
+       if (protocol == IPPROTO_TCP)
+               err = sock_create_kern_getnet(xprt->xprt_net, family, 
type, protocol, &sock);
+       else
+               err = sock_create_kern(xprt->xprt_net, family, type, 
protocol, &sock);
         if (err < 0) {
                 dprintk("RPC:       can't create %d transport socket 
(%d).\n",
                                 protocol, -err);

>>>>> +
>>>>>    	trace_rpc_socket_close(xprt, sock);
>>>>>    	__fput_sync(filp);
>>>>>    
>>
Kuniyuki Iwashima Oct. 25, 2024, 9:20 p.m. UTC | #6
From: "liujian (CE)" <liujian56@huawei.com>
Date: Fri, 25 Oct 2024 11:32:52 +0800
> >>> If not, then what prevents it from happening?
> >> The socket created by the userspace program obtains the reference
> >> counting of the namespace, but the kernel socket does not.
> >>
> >> There's some discussion here:
> >> https://lore.kernel.org/all/CANn89iJE5anTbyLJ0TdGAqGsE+GichY3YzQECjNUVMz=G3bcQg@mail.gmail.com/
> > OK... So then it looks to me as if NFS, SMB, AFS, and any other
> > networked filesystem that can be started from inside a container is
> > going to need to do the same thing that rds appears to be doing.

FWIW, recently we saw a similar UAF on CIFS.


> >
> > Should there perhaps be a helper function in the networking layer for
> > this?
> 
> There should be no such helper function at present, right?.
> 
> If get net's reference to fix this problem, the following test is 
> performed. There's nothing wrong with this case. I don't know if there's 
> anything else to consider.
> 
> I don't have any other ideas other than these two methods. Do you have 
> any suggestions on this problem? @Eric @Jakub ... @All

The netns lifetime should be managed by the upper layer rather than
the networking layer.  If the netns is already dead, the upper layer
must discard the net pointer anyway.

I suggest checking maybe_get_net() in NFS, CIFS, etc and then calling
__sock_create() with kern 0.


> 
> diff --git a/include/linux/net.h b/include/linux/net.h
> index b75bc534c1b3..58216da3b62c 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -255,6 +255,7 @@ int __sock_create(struct net *net, int family, int 
> type, int proto,
>                    struct socket **res, int kern);
>   int sock_create(int family, int type, int proto, struct socket **res);
>   int sock_create_kern(struct net *net, int family, int type, int proto, 
> struct socket **res);
> +int sock_create_kern_getnet(struct net *net, int family, int type, int 
> proto, struct socket **res);
>   int sock_create_lite(int family, int type, int proto, struct socket 
> **res);
>   struct socket *sock_alloc(void);
>   void sock_release(struct socket *sock);
> diff --git a/net/socket.c b/net/socket.c
> index 042451f01c65..e64a02445b1a 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -1651,6 +1651,34 @@ int sock_create_kern(struct net *net, int family, 
> int type, int protocol, struct
>   }
>   EXPORT_SYMBOL(sock_create_kern);
> 
> +int sock_create_kern_getnet(struct net *net, int family, int type, int 
> proto, struct socket **res)
> +{
> +       struct sock *sk;
> +       int ret;
> +
> +       if (!maybe_get_net(net))
> +               return -EINVAL;
> +
> +       ret = sock_create_kern(net, family, type, proto, res);
> +       if (ret < 0) {
> +               put_net(net);
> +               return ret;
> +       }
> +
> +       sk = (*res)->sk;
> +       lock_sock(sk);
> +       /* Update ns_tracker to current stack trace and refcounted 
> tracker */
> +       __netns_tracker_free(net, &sk->ns_tracker, false);
> +
> +       sk->sk_net_refcnt = 1;
> +       netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL);
> +       sock_inuse_add(net, 1);
> +       release_sock(sk);
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL(sock_create_kern_getnet);
> +
>   static struct socket *__sys_socket_create(int family, int type, int 
> protocol)
>   {
>          struct socket *sock;
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 825ec5357691..31dc291446fb 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1526,7 +1526,10 @@ static struct svc_xprt *svc_create_socket(struct 
> svc_serv *serv,
>                  return ERR_PTR(-EINVAL);
>          }
> 
> -       error = __sock_create(net, family, type, protocol, &sock, 1);
> +       if (protocol == IPPROTO_TCP)
> +               error = sock_create_kern_getnet(net, family, type, 
> protocol, &sock);
> +       else
> +               error = sock_create_kern(net, family, type, protocol, 
> &sock);
>          if (error < 0)
>                  return ERR_PTR(error);
> 
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 0e1691316f42..d2304010daeb 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -1922,7 +1922,10 @@ static struct socket *xs_create_sock(struct 
> rpc_xprt *xprt,
>          struct socket *sock;
>          int err;
> 
> -       err = __sock_create(xprt->xprt_net, family, type, protocol, 
> &sock, 1);
> +       if (protocol == IPPROTO_TCP)
> +               err = sock_create_kern_getnet(xprt->xprt_net, family, 
> type, protocol, &sock);
> +       else
> +               err = sock_create_kern(xprt->xprt_net, family, type, 
> protocol, &sock);
>          if (err < 0) {
>                  dprintk("RPC:       can't create %d transport socket 
> (%d).\n",
>                                  protocol, -err);
Trond Myklebust Oct. 26, 2024, 12:35 a.m. UTC | #7
On Fri, 2024-10-25 at 14:20 -0700, Kuniyuki Iwashima wrote:
> From: "liujian (CE)" <liujian56@huawei.com>
> Date: Fri, 25 Oct 2024 11:32:52 +0800
> > > > > If not, then what prevents it from happening?
> > > > The socket created by the userspace program obtains the
> > > > reference
> > > > counting of the namespace, but the kernel socket does not.
> > > > 
> > > > There's some discussion here:
> > > > https://lore.kernel.org/all/CANn89iJE5anTbyLJ0TdGAqGsE+GichY3YzQECjNUVMz=G3bcQg@mail.gmail.com/
> > > OK... So then it looks to me as if NFS, SMB, AFS, and any other
> > > networked filesystem that can be started from inside a container
> > > is
> > > going to need to do the same thing that rds appears to be doing.
> 
> FWIW, recently we saw a similar UAF on CIFS.
> 
> 
> > > 
> > > Should there perhaps be a helper function in the networking layer
> > > for
> > > this?
> > 
> > There should be no such helper function at present, right?.
> > 
> > If get net's reference to fix this problem, the following test is 
> > performed. There's nothing wrong with this case. I don't know if
> > there's 
> > anything else to consider.
> > 
> > I don't have any other ideas other than these two methods. Do you
> > have 
> > any suggestions on this problem? @Eric @Jakub ... @All
> 
> The netns lifetime should be managed by the upper layer rather than
> the networking layer.  If the netns is already dead, the upper layer
> must discard the net pointer anyway.
> 
> I suggest checking maybe_get_net() in NFS, CIFS, etc and then calling
> __sock_create() with kern 0.
> 

Thanks for the suggestion, but we already manage the netns lifetime in
the RPC layer. A reference is taken when the filesystem is being
mounted. It is dropped when the filesystem is being unmounted.

The problem is the TCP timer races on shutdown. There is no interest in
having to manage that in the RPC layer.
Kuniyuki Iwashima Oct. 26, 2024, 12:48 a.m. UTC | #8
From: Trond Myklebust <trondmy@hammerspace.com>
Date: Sat, 26 Oct 2024 00:35:30 +0000
> On Fri, 2024-10-25 at 14:20 -0700, Kuniyuki Iwashima wrote:
> > From: "liujian (CE)" <liujian56@huawei.com>
> > Date: Fri, 25 Oct 2024 11:32:52 +0800
> > > > > > If not, then what prevents it from happening?
> > > > > The socket created by the userspace program obtains the
> > > > > reference
> > > > > counting of the namespace, but the kernel socket does not.
> > > > > 
> > > > > There's some discussion here:
> > > > > https://lore.kernel.org/all/CANn89iJE5anTbyLJ0TdGAqGsE+GichY3YzQECjNUVMz=G3bcQg@mail.gmail.com/
> > > > OK... So then it looks to me as if NFS, SMB, AFS, and any other
> > > > networked filesystem that can be started from inside a container
> > > > is
> > > > going to need to do the same thing that rds appears to be doing.
> > 
> > FWIW, recently we saw a similar UAF on CIFS.
> > 
> > 
> > > > 
> > > > Should there perhaps be a helper function in the networking layer
> > > > for
> > > > this?
> > > 
> > > There should be no such helper function at present, right?.
> > > 
> > > If get net's reference to fix this problem, the following test is 
> > > performed. There's nothing wrong with this case. I don't know if
> > > there's 
> > > anything else to consider.
> > > 
> > > I don't have any other ideas other than these two methods. Do you
> > > have 
> > > any suggestions on this problem? @Eric @Jakub ... @All
> > 
> > The netns lifetime should be managed by the upper layer rather than
> > the networking layer.  If the netns is already dead, the upper layer
> > must discard the net pointer anyway.
> > 
> > I suggest checking maybe_get_net() in NFS, CIFS, etc and then calling
> > __sock_create() with kern 0.
> > 
> 
> Thanks for the suggestion, but we already manage the netns lifetime in
> the RPC layer. A reference is taken when the filesystem is being
> mounted. It is dropped when the filesystem is being unmounted.
> 
> The problem is the TCP timer races on shutdown. There is no interest in
> having to manage that in the RPC layer.

Does that mean netns is always alive when the socket is created
in svc_create_socket() or xs_create_sock() ?  If so, you can just
use __sock_create(kern=0) there to prevent net from being freed
before the socket.

sock_create_kern() and kern@ are confusing, and we had similar issues
in other kernel TCP socket users SMC/RDS, so I'll rename them to
sock_create_noref() and no_net_ref@ or something.
liujian (CE) Oct. 26, 2024, 1:31 a.m. UTC | #9
在 2024/10/26 8:48, Kuniyuki Iwashima 写道:
> From: Trond Myklebust <trondmy@hammerspace.com>
> Date: Sat, 26 Oct 2024 00:35:30 +0000
>> On Fri, 2024-10-25 at 14:20 -0700, Kuniyuki Iwashima wrote:
>>> From: "liujian (CE)" <liujian56@huawei.com>
>>> Date: Fri, 25 Oct 2024 11:32:52 +0800
>>>>>>> If not, then what prevents it from happening?
>>>>>> The socket created by the userspace program obtains the
>>>>>> reference
>>>>>> counting of the namespace, but the kernel socket does not.
>>>>>>
>>>>>> There's some discussion here:
>>>>>> https://lore.kernel.org/all/CANn89iJE5anTbyLJ0TdGAqGsE+GichY3YzQECjNUVMz=G3bcQg@mail.gmail.com/
>>>>> OK... So then it looks to me as if NFS, SMB, AFS, and any other
>>>>> networked filesystem that can be started from inside a container
>>>>> is
>>>>> going to need to do the same thing that rds appears to be doing.
>>>
>>> FWIW, recently we saw a similar UAF on CIFS.
>>>
>>>
>>>>>
>>>>> Should there perhaps be a helper function in the networking layer
>>>>> for
>>>>> this?
>>>>
>>>> There should be no such helper function at present, right?.
>>>>
>>>> If get net's reference to fix this problem, the following test is
>>>> performed. There's nothing wrong with this case. I don't know if
>>>> there's
>>>> anything else to consider.
>>>>
>>>> I don't have any other ideas other than these two methods. Do you
>>>> have
>>>> any suggestions on this problem? @Eric @Jakub ... @All
>>>
>>> The netns lifetime should be managed by the upper layer rather than
>>> the networking layer.  If the netns is already dead, the upper layer
>>> must discard the net pointer anyway.
>>>
>>> I suggest checking maybe_get_net() in NFS, CIFS, etc and then calling
>>> __sock_create() with kern 0.
Only maybe_get_net() is enough. sk_kern_sock also needs to identify 
whether it is a kernel socket.
>>>
>>
>> Thanks for the suggestion, but we already manage the netns lifetime in
>> the RPC layer. A reference is taken when the filesystem is being
>> mounted. It is dropped when the filesystem is being unmounted.
>>
>> The problem is the TCP timer races on shutdown. There is no interest in
>> having to manage that in the RPC layer.
> 
> Does that mean netns is always alive when the socket is created
> in svc_create_socket() or xs_create_sock() ?  If so, you can just
> use __sock_create(kern=0) there to prevent net from being freed
> before the socket.
> 
> sock_create_kern() and kern@ are confusing, and we had similar issues
> in other kernel TCP socket users SMC/RDS, so I'll rename them to
> sock_create_noref() and no_net_ref@ or something.
diff mbox series

Patch

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 0e1691316f42..91ee3484155a 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1287,6 +1287,9 @@  static void xs_reset_transport(struct sock_xprt *transport)
 	release_sock(sk);
 	mutex_unlock(&transport->recv_mutex);
 
+	if (sk->sk_prot == &tcp_prot)
+		tcp_abort(sk, ECONNABORTED);
+
 	trace_rpc_socket_close(xprt, sock);
 	__fput_sync(filp);