diff mbox series

[net-next,v10,08/16] tls: Inline do_tcp_sendpages()

Message ID 20230522121125.2595254-9-dhowells@redhat.com (mailing list archive)
State New
Headers show
Series splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 | expand

Commit Message

David Howells May 22, 2023, 12:11 p.m. UTC
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/net/tls.h  |  2 +-
 net/tls/tls_main.c | 24 +++++++++++++++---------
 2 files changed, 16 insertions(+), 10 deletions(-)

Comments

Tariq Toukan June 7, 2023, 2:17 p.m. UTC | #1
On 22/05/2023 15:11, David Howells wrote:
> do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
> so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
> replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Boris Pismenny <borisp@nvidia.com>
> cc: John Fastabend <john.fastabend@gmail.com>
> cc: Jakub Kicinski <kuba@kernel.org>
> cc: "David S. Miller" <davem@davemloft.net>
> cc: Eric Dumazet <edumazet@google.com>
> cc: Paolo Abeni <pabeni@redhat.com>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: netdev@vger.kernel.org
> ---

Hi,

My team spotted a new degradation in TLS TX device offload, bisected to 
this patch.

 From a quick look at the patch, it's not clear to me what's going wrong.
Please let us know of any helpful information that we can provide to 
help in the debug.

Regards,
Tariq

Reproduce Flow:
client / server test using nginx and  wrk (nothing special/custom about 
the apps used).

client:
/opt/mellanox/iproute2/sbin/ip link set dev eth3 up
/opt/mellanox/iproute2/sbin/ip addr add 11.141.46.9/16 dev eth3

server:
/opt/mellanox/iproute2/sbin/ip link set dev eth3 up
/opt/mellanox/iproute2/sbin/ip addr add 11.141.46.10/16 dev eth3

client:
/auto/sw/regression/sw_net_ver_tools/ktls/tools/x86_64/nginx_openssl_3_0_0 
-p /usr/bin/drivertest_rpms/ktls/nginx/
/opt/mellanox/iproute2/sbin/ss -i src [11.141.46.9]

server:
/auto/sw/regression/sw_net_ver_tools/ktls/tools/x86_64/wrk_openssl_3_0_0 
-b11.141.46.10 -t4 -c874 -d14 --timeout 5s 
https://11.141.46.9:20443/256000b.img

client:
dmesg
/auto/sw/regression/sw_net_ver_tools/ktls/tools/x86_64/nginx_openssl_3_0_0 
-p /usr/bin/drivertest_rpms/ktls/nginx/ -s stop


[root@c-141-46-1-009 ~]# dmesg
------------[ cut here ]------------
WARNING: CPU: 1 PID: 977 at net/core/skbuff.c:6957 
skb_splice_from_iter+0x102/0x300
Modules linked in: rpcrdma rdma_ucm ib_iser libiscsi 
scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib 
ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink 
nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 
auth_rpcgss oid_registry overlay mlx5_core zram zsmalloc fuse
CPU: 1 PID: 977 Comm: nginx_openssl_3 Not tainted 
6.4.0-rc3_for_upstream_min_debug_2023_06_01_23_04 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:skb_splice_from_iter+0x102/0x300
Code: ef 48 8b 55 08 f6 c2 01 0f 85 54 01 00 00 8b 0d 98 cf 5f 01 48 89 
ea 85 c9 0f 8f 4c 01 00 00 48 8b 12 80 e6 02 74 48 49 89 dd <0f> 0b 48 
c7 c1 fb ff ff ff 45 01 65 70 45 01 65 74 45 01 a5 d0 00
RSP: 0018:ffff8881045abaa0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88814370fe00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffea00051123c0 RDI: ffff88814370fe00
RBP: ffffea0005112400 R08: 0000000000000011 R09: 0000000000003ffd
R10: 0000000000003ffd R11: 0000000000000008 R12: 0000000000002e6e
R13: ffff88814370fe00 R14: ffff8881045abae8 R15: 000000000000118f
FS:  00007f6e23043740(0000) GS:ffff88852c880000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000009c6c00 CR3: 000000013b791001 CR4: 0000000000370ea0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:

  ? kmalloc_reserve+0x86/0xe0
  tcp_sendmsg_locked+0x33e/0xd40
  tls_push_sg+0xdd/0x230
  tls_push_data+0x673/0x920
  tls_device_sendmsg+0x6e/0xc0
  sock_sendmsg+0x38/0x60
  sock_write_iter+0x97/0x100
  vfs_write+0x2df/0x380
  ksys_write+0xa7/0xe0
  do_syscall_64+0x3d/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f6e22f018b7
Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e 
fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffdb528a2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000004000 RCX: 00007f6e22f018b7
RDX: 0000000000004000 RSI: 00000000025cdef0 RDI: 0000000000000028
RBP: 00000000020103c0 R08: 00007ffdb5289a90 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000025cdef0
R13: 000000000204fca0 R14: 0000000000004000 R15: 0000000000004000

---[ end trace 0000000000000000 ]---



>   include/net/tls.h  |  2 +-
>   net/tls/tls_main.c | 24 +++++++++++++++---------
>   2 files changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 6056ce5a2aa5..5791ca7a189c 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -258,7 +258,7 @@ struct tls_context {
>   	struct scatterlist *partially_sent_record;
>   	u16 partially_sent_offset;
>   
> -	bool in_tcp_sendpages;
> +	bool splicing_pages;
>   	bool pending_open_record_frags;
>   
>   	struct mutex tx_lock; /* protects partially_sent_* fields and
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index f2e7302a4d96..3d45fdb5c4e9 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -125,7 +125,10 @@ int tls_push_sg(struct sock *sk,
>   		u16 first_offset,
>   		int flags)
>   {
> -	int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
> +	struct bio_vec bvec;
> +	struct msghdr msg = {
> +		.msg_flags = MSG_SENDPAGE_NOTLAST | MSG_SPLICE_PAGES | flags,
> +	};
>   	int ret = 0;
>   	struct page *p;
>   	size_t size;
> @@ -134,16 +137,19 @@ int tls_push_sg(struct sock *sk,
>   	size = sg->length - offset;
>   	offset += sg->offset;
>   
> -	ctx->in_tcp_sendpages = true;
> +	ctx->splicing_pages = true;
>   	while (1) {
>   		if (sg_is_last(sg))
> -			sendpage_flags = flags;
> +			msg.msg_flags = flags;
>   
>   		/* is sending application-limited? */
>   		tcp_rate_check_app_limited(sk);
>   		p = sg_page(sg);
>   retry:
> -		ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
> +		bvec_set_page(&bvec, p, size, offset);
> +		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
> +
> +		ret = tcp_sendmsg_locked(sk, &msg, size);
>   
>   		if (ret != size) {
>   			if (ret > 0) {
> @@ -155,7 +161,7 @@ int tls_push_sg(struct sock *sk,
>   			offset -= sg->offset;
>   			ctx->partially_sent_offset = offset;
>   			ctx->partially_sent_record = (void *)sg;
> -			ctx->in_tcp_sendpages = false;
> +			ctx->splicing_pages = false;
>   			return ret;
>   		}
>   
> @@ -169,7 +175,7 @@ int tls_push_sg(struct sock *sk,
>   		size = sg->length;
>   	}
>   
> -	ctx->in_tcp_sendpages = false;
> +	ctx->splicing_pages = false;
>   
>   	return 0;
>   }
> @@ -247,11 +253,11 @@ static void tls_write_space(struct sock *sk)
>   {
>   	struct tls_context *ctx = tls_get_ctx(sk);
>   
> -	/* If in_tcp_sendpages call lower protocol write space handler
> +	/* If splicing_pages call lower protocol write space handler
>   	 * to ensure we wake up any waiting operations there. For example
> -	 * if do_tcp_sendpages where to call sk_wait_event.
> +	 * if splicing pages where to call sk_wait_event.
>   	 */
> -	if (ctx->in_tcp_sendpages) {
> +	if (ctx->splicing_pages) {
>   		ctx->sk_write_space(sk);
>   		return;
>   	}
> 
>
David Howells June 7, 2023, 3:03 p.m. UTC | #2
Tariq Toukan <ttoukan.linux@gmail.com> wrote:

> My team spotted a new degradation in TLS TX device offload, bisected to this
> patch.

I presume you're using some hardware (I'm guessing Mellanox?) that can
actually do TLS offload?  Unfortunately, I don't have any hardware that can do
this, so I can't test the tls_device stuff.

> From a quick look at the patch, it's not clear to me what's going wrong.
> Please let us know of any helpful information that we can provide to help in
> the debug.

Can you find out what source line this corresponds to?

	RIP: 0010:skb_splice_from_iter+0x102/0x300

Assuming you're building your own kernel, something like the following might
do the trick:

	echo "RIP: 0010:skb_splice_from_iter+0x102/0x300" |
	./scripts/decode_stacktrace.sh /my/built/vmlinux /my/build/tree

if you run it in the kernel source tree you're using and substitute the
paths to vmlinux and the build tree for modules.

David
Tariq Toukan June 13, 2023, 11:15 a.m. UTC | #3
On 07/06/2023 18:03, David Howells wrote:
> Tariq Toukan <ttoukan.linux@gmail.com> wrote:
> 
>> My team spotted a new degradation in TLS TX device offload, bisected to this
>> patch.
> 
> I presume you're using some hardware (I'm guessing Mellanox?) that can
> actually do TLS offload?  Unfortunately, I don't have any hardware that can do
> this, so I can't test the tls_device stuff.
> 
>>  From a quick look at the patch, it's not clear to me what's going wrong.
>> Please let us know of any helpful information that we can provide to help in
>> the debug.
> 
> Can you find out what source line this corresponds to?
> 
> 	RIP: 0010:skb_splice_from_iter+0x102/0x300
> 
> Assuming you're building your own kernel, something like the following might
> do the trick:
> 
> 	echo "RIP: 0010:skb_splice_from_iter+0x102/0x300" |
> 	./scripts/decode_stacktrace.sh /my/built/vmlinux /my/build/tree
> 

Hi,

It's:
RIP: 0010:skb_splice_from_iter (/usr/linux/net/core/skbuff.c:6957)

which coresponds to this line:
                         if (WARN_ON_ONCE(!sendpage_ok(page)))

> if you run it in the kernel source tree you're using and substitute the
> paths to vmlinux and the build tree for modules.
> 
> David
>
Tariq Toukan June 19, 2023, 8:23 a.m. UTC | #4
On 13/06/2023 14:15, Tariq Toukan wrote:
> 
> 
> On 07/06/2023 18:03, David Howells wrote:
>> Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>
>>> My team spotted a new degradation in TLS TX device offload, bisected 
>>> to this
>>> patch.
>>
>> I presume you're using some hardware (I'm guessing Mellanox?) that can
>> actually do TLS offload?  Unfortunately, I don't have any hardware 
>> that can do
>> this, so I can't test the tls_device stuff.
>>
>>>  From a quick look at the patch, it's not clear to me what's going 
>>> wrong.
>>> Please let us know of any helpful information that we can provide to 
>>> help in
>>> the debug.
>>
>> Can you find out what source line this corresponds to?
>>
>>     RIP: 0010:skb_splice_from_iter+0x102/0x300
>>
>> Assuming you're building your own kernel, something like the following 
>> might
>> do the trick:
>>
>>     echo "RIP: 0010:skb_splice_from_iter+0x102/0x300" |
>>     ./scripts/decode_stacktrace.sh /my/built/vmlinux /my/build/tree
>>
> 
> Hi,
> 
> It's:
> RIP: 0010:skb_splice_from_iter (/usr/linux/net/core/skbuff.c:6957)
> 
> which coresponds to this line:
>                          if (WARN_ON_ONCE(!sendpage_ok(page)))
> 

Hi David,
Any other debug information that we can provide to progress with the 
analysis?
David Howells June 19, 2023, 9:35 a.m. UTC | #5
Tariq Toukan <ttoukan.linux@gmail.com> wrote:

> Any other debug information that we can provide to progress with the analysis?

Can you see if the problem still happens on this branch of my tree?

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-3-frag

It eases the restriction that the WARN_ON is warning about by (in patch 1[*])
copying slab objects into page fragments.

David

[*] "net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)"
Tariq Toukan June 27, 2023, 4:49 p.m. UTC | #6
On 19/06/2023 12:35, David Howells wrote:
> Tariq Toukan <ttoukan.linux@gmail.com> wrote:
> 
>> Any other debug information that we can provide to progress with the analysis?
> 
> Can you see if the problem still happens on this branch of my tree?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-3-frag
> 
> It eases the restriction that the WARN_ON is warning about by (in patch 1[*])
> copying slab objects into page fragments.
> 
> David
> 
> [*] "net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)"
> 

Hi David,

Unfortunately, it still happens:

------------[ cut here ]------------
WARNING: CPU: 2 PID: 93427 at net/core/skbuff.c:7013 
skb_splice_from_iter+0x299/0x550
Modules linked in: bonding nf_tables vfio_pci ip_gre geneve ib_umad 
rdma_ucm ipip tunnel4 ip6_gre gre ip6_tunnel tunnel6 ib_ipoib 
mlx5_vfio_pci vfio_pci_core mlx5_ib ib_uverbs mlx5_core sch_mqprio 
sch_mqprio_lib sch_netem iptable_raw vfio_iommu_type1 vfio openvswitch 
nsh rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm 
ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink 
xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss 
oid_registry overlay zram zsmalloc fuse [last unloaded: nf_tables]
CPU: 2 PID: 93427 Comm: nginx_openssl_3 Tainted: G        W 
6.4.0-rc6_net_next_mlx5_9b6e6b6 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:skb_splice_from_iter+0x299/0x550
Code: 49 8b 57 08 f6 c2 01 0f 85 89 01 00 00 8b 0d 22 b3 4a 01 4c 89 fa 
85 c9 0f 8f 81 01 00 00 48 8b 12 80 e6 02 0f 84 a3 00 00 00 <0f> 0b 48 
c7 c1 fb ff ff ff 44 01 6b 70 44 01 6b 74 44 01 ab d0 00
RSP: 0018:ffff8882a16d3a80 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88821a89ee00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffea0004c2cfc0 RDI: ffff88821a89ee00
RBP: 0000000000000f34 R08: 0000000000000011 R09: 0000000000000f34
R10: 0000000000000000 R11: 000000000000000d R12: 0000000000000004
R13: 0000000000002d0f R14: 0000000000000f34 R15: ffffea0004c2d000
FS:  00007f5c383eb740(0000) GS:ffff88885f880000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002ea5000 CR3: 0000000264ffe006 CR4: 0000000000370ea0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  <TASK>
  ? __warn+0x79/0x120
  ? skb_splice_from_iter+0x299/0x550
  ? report_bug+0x17c/0x190
  ? handle_bug+0x3c/0x60
  ? exc_invalid_op+0x14/0x70
  ? asm_exc_invalid_op+0x16/0x20
  ? skb_splice_from_iter+0x299/0x550
  tcp_sendmsg_locked+0x375/0xd00
  tls_push_sg+0xdd/0x230
  tls_push_data+0x6de/0xb00
  tls_device_sendmsg+0x7a/0xd0
  sock_sendmsg+0x38/0x60
  sock_write_iter+0x97/0x100
  vfs_write+0x2df/0x380
  ksys_write+0xa7/0xe0
  do_syscall_64+0x3d/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f5c381018b7
Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e 
fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffee9750848 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000004000 RCX: 00007f5c381018b7
RDX: 0000000000004000 RSI: 0000000002ea2dc0 RDI: 00000000000000d1
RBP: 0000000001d1bbe0 R08: 00007ffee974ffe0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002ea2dc0
R13: 0000000001d2a7e0 R14: 0000000000004000 R15: 0000000000004000
  </TASK>
---[ end trace 0000000000000000 ]---
David Howells June 27, 2023, 4:55 p.m. UTC | #7
Tariq Toukan <ttoukan.linux@gmail.com> wrote:

> WARNING: CPU: 2 PID: 93427 at net/core/skbuff.c:7013

Is that this line for you:

			} else if (WARN_ON_ONCE(!sendpage_ok(page))) {

If so, it's not slab data, but we've got a page with a 0 refcount from
somewhere.

David
David Howells June 27, 2023, 5:06 p.m. UTC | #8
Can you try net-next/main?  Parts of the branch you're trying have been
dropped.

David
Jakub Kicinski June 30, 2023, 5:21 p.m. UTC | #9
On Tue, 27 Jun 2023 19:49:22 +0300 Tariq Toukan wrote:
> Unfortunately, it still happens:
> 
> ------------[ cut here ]------------
> WARNING: CPU: 2 PID: 93427 at net/core/skbuff.c:7013 

I can't repro it on net-next with basic TLS 1.2 sendmsg/stream
test + device offload, let us know if you still see it.
Tariq Toukan July 4, 2023, 8:06 p.m. UTC | #10
On 30/06/2023 20:21, Jakub Kicinski wrote:
> On Tue, 27 Jun 2023 19:49:22 +0300 Tariq Toukan wrote:
>> Unfortunately, it still happens:
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 2 PID: 93427 at net/core/skbuff.c:7013
> 
> I can't repro it on net-next with basic TLS 1.2 sendmsg/stream
> test + device offload, let us know if you still see it.

Hi,

Unfortunately, it still repros for us.

We are collecting more info on how the repro is affected by the 
different parameters.

Regards,
Tariq
Jakub Kicinski July 5, 2023, 4:19 p.m. UTC | #11
On Tue, 4 Jul 2023 23:06:02 +0300 Tariq Toukan wrote:
> Unfortunately, it still repros for us.
> 
> We are collecting more info on how the repro is affected by the 
> different parameters.

Consider configuring kdump for your test env. Debugging is super easy
if one has the vmcore available.
Tariq Toukan July 23, 2023, 6:35 a.m. UTC | #12
On 05/07/2023 19:19, Jakub Kicinski wrote:
> On Tue, 4 Jul 2023 23:06:02 +0300 Tariq Toukan wrote:
>> Unfortunately, it still repros for us.
>>
>> We are collecting more info on how the repro is affected by the
>> different parameters.
> 
> Consider configuring kdump for your test env. Debugging is super easy
> if one has the vmcore available.

Hi Jakub, David,

We repro the issue on the server side using this client command:
$ wrk -b2.2.2.2 -t4 -c1000 -d5 --timeout 5s 
https://2.2.2.3:20443/256000b.img

Port 20443 is configured with:
     ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256;
     sendfile    off;


Important:
1. Couldn't repro with files smaller than 40KB.
2. Couldn't repro with "sendfile    on;"

In addition, we collected the vmcore (forced by panic_on_warn), it can 
be downloaded from here:
https://drive.google.com/file/d/1Fi2dzgq6k2hb2L_kwyntRjfLF6_RmbxB/view?usp=sharing

Regards,
Tariq
Jakub Kicinski July 26, 2023, 12:30 a.m. UTC | #13
On Sun, 23 Jul 2023 09:35:56 +0300 Tariq Toukan wrote:
> Hi Jakub, David,
> 
> We repro the issue on the server side using this client command:
> $ wrk -b2.2.2.2 -t4 -c1000 -d5 --timeout 5s 
> https://2.2.2.3:20443/256000b.img
> 
> Port 20443 is configured with:
>      ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256;
>      sendfile    off;
> 
> 
> Important:
> 1. Couldn't repro with files smaller than 40KB.
> 2. Couldn't repro with "sendfile    on;"
> 
> In addition, we collected the vmcore (forced by panic_on_warn), it can 
> be downloaded from here:
> https://drive.google.com/file/d/1Fi2dzgq6k2hb2L_kwyntRjfLF6_RmbxB/view?usp=sharing

This has no symbols :(

There is a small bug in this commit, we should always set SPLICE.
But I don't see how that'd cause the warning you're seeing.
Does your build have CONFIG_DEBUG_VM enabled?

-->8-------------------------

From: Jakub Kicinski <kuba@kernel.org>
Date: Tue, 25 Jul 2023 17:03:25 -0700
Subject: net: tls: set MSG_SPLICE_PAGES consistently

We used to change the flags for the last segment, because
non-last segments had the MSG_SENDPAGE_NOTLAST flag set.
That flag is no longer a thing so remove the setting.

Since flags most likely don't have MSG_SPLICE_PAGES set
this avoids passing parts of the sg as splice and parts
as non-splice.

... tags ...
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/tls/tls_main.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index b6896126bb92..4a8ee2f6badb 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -139,9 +139,6 @@ int tls_push_sg(struct sock *sk,
 
 	ctx->splicing_pages = true;
 	while (1) {
-		if (sg_is_last(sg))
-			msg.msg_flags = flags;
-
 		/* is sending application-limited? */
 		tcp_rate_check_app_limited(sk);
 		p = sg_page(sg);
David Howells July 26, 2023, 10:51 a.m. UTC | #14
Tariq Toukan <ttoukan.linux@gmail.com> wrote:

> We repro the issue on the server side using this client command:
> $ wrk -b2.2.2.2 -t4 -c1000 -d5 --timeout 5s https://2.2.2.3:20443/256000b.img

What's wrk?

David
Tariq Toukan July 26, 2023, 11:43 a.m. UTC | #15
On 26/07/2023 13:51, David Howells wrote:
> Tariq Toukan <ttoukan.linux@gmail.com> wrote:
> 
>> We repro the issue on the server side using this client command:
>> $ wrk -b2.2.2.2 -t4 -c1000 -d5 --timeout 5s https://2.2.2.3:20443/256000b.img
> 
> What's wrk?
> 
> David
> 

Pretty known and standard client app.
wrk - a HTTP benchmarking tool
https://github.com/wg/wrk
Jakub Kicinski July 26, 2023, 2:57 p.m. UTC | #16
On Wed, 26 Jul 2023 14:43:35 +0300 Tariq Toukan wrote:
> >> We repro the issue on the server side using this client command:
> >> $ wrk -b2.2.2.2 -t4 -c1000 -d5 --timeout 5s https://2.2.2.3:20443/256000b.img  
> > 
> > What's wrk?
> > 
> > David
> >   
> 
> Pretty known and standard client app.
> wrk - a HTTP benchmarking tool
> https://github.com/wg/wrk

Let us know if your build has CONFIG_DEBUG_VM, please.
Because in the old code the warning was gated by this config,
so the bug may be older. We just started reporting it.
Tariq Toukan July 26, 2023, 7:20 p.m. UTC | #17
On 26/07/2023 3:30, Jakub Kicinski wrote:
> On Sun, 23 Jul 2023 09:35:56 +0300 Tariq Toukan wrote:
>> Hi Jakub, David,
>>
>> We repro the issue on the server side using this client command:
>> $ wrk -b2.2.2.2 -t4 -c1000 -d5 --timeout 5s
>> https://2.2.2.3:20443/256000b.img
>>
>> Port 20443 is configured with:
>>       ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256;
>>       sendfile    off;
>>
>>
>> Important:
>> 1. Couldn't repro with files smaller than 40KB.
>> 2. Couldn't repro with "sendfile    on;"
>>
>> In addition, we collected the vmcore (forced by panic_on_warn), it can
>> be downloaded from here:
>> https://drive.google.com/file/d/1Fi2dzgq6k2hb2L_kwyntRjfLF6_RmbxB/view?usp=sharing
> 
> This has no symbols :(
> 

Uh.. :/
I'll try to fix this and re-generate.

> There is a small bug in this commit, we should always set SPLICE.
> But I don't see how that'd cause the warning you're seeing.
> Does your build have CONFIG_DEBUG_VM enabled?

No.

# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VM_PGTABLE is not set

> 
> -->8-------------------------
> 
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Tue, 25 Jul 2023 17:03:25 -0700
> Subject: net: tls: set MSG_SPLICE_PAGES consistently
> 
> We used to change the flags for the last segment, because
> non-last segments had the MSG_SENDPAGE_NOTLAST flag set.
> That flag is no longer a thing so remove the setting.
> 
> Since flags most likely don't have MSG_SPLICE_PAGES set
> this avoids passing parts of the sg as splice and parts
> as non-splice.
> 
> ... tags ...
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>   net/tls/tls_main.c | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index b6896126bb92..4a8ee2f6badb 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -139,9 +139,6 @@ int tls_push_sg(struct sock *sk,
>   
>   	ctx->splicing_pages = true;
>   	while (1) {
> -		if (sg_is_last(sg))
> -			msg.msg_flags = flags;
> -
>   		/* is sending application-limited? */
>   		tcp_rate_check_app_limited(sk);
>   		p = sg_page(sg);

I'll test this anyway tomorrow and update.

Regards,
Tariq
Jakub Kicinski July 26, 2023, 8:08 p.m. UTC | #18
On Wed, 26 Jul 2023 22:20:42 +0300 Tariq Toukan wrote:
> > There is a small bug in this commit, we should always set SPLICE.
> > But I don't see how that'd cause the warning you're seeing.
> > Does your build have CONFIG_DEBUG_VM enabled?  
> 
> No.
> 
> # CONFIG_DEBUG_VM is not set
> # CONFIG_DEBUG_VM_PGTABLE is not set

Try testing v6.3 with DEBUG_VM enabled or just remove the IS_ENABLED()
from: https://github.com/torvalds/linux/blob/v6.4/net/ipv4/tcp.c#L1051
Tariq Toukan Aug. 3, 2023, 11:47 a.m. UTC | #19
On 26/07/2023 3:30, Jakub Kicinski wrote:
> On Sun, 23 Jul 2023 09:35:56 +0300 Tariq Toukan wrote:
>> Hi Jakub, David,
>>
>> We repro the issue on the server side using this client command:
>> $ wrk -b2.2.2.2 -t4 -c1000 -d5 --timeout 5s
>> https://2.2.2.3:20443/256000b.img
>>
>> Port 20443 is configured with:
>>       ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256;
>>       sendfile    off;
>>
>>
>> Important:
>> 1. Couldn't repro with files smaller than 40KB.
>> 2. Couldn't repro with "sendfile    on;"
>>
>> In addition, we collected the vmcore (forced by panic_on_warn), it can
>> be downloaded from here:
>> https://drive.google.com/file/d/1Fi2dzgq6k2hb2L_kwyntRjfLF6_RmbxB/view?usp=sharing
> 
> This has no symbols :(
> 
> There is a small bug in this commit, we should always set SPLICE.
> But I don't see how that'd cause the warning you're seeing.
> Does your build have CONFIG_DEBUG_VM enabled?
> 
> -->8-------------------------
> 
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Tue, 25 Jul 2023 17:03:25 -0700
> Subject: net: tls: set MSG_SPLICE_PAGES consistently
> 
> We used to change the flags for the last segment, because
> non-last segments had the MSG_SENDPAGE_NOTLAST flag set.
> That flag is no longer a thing so remove the setting.
> 
> Since flags most likely don't have MSG_SPLICE_PAGES set
> this avoids passing parts of the sg as splice and parts
> as non-splice.
> 
> ... tags ...
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>   net/tls/tls_main.c | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index b6896126bb92..4a8ee2f6badb 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -139,9 +139,6 @@ int tls_push_sg(struct sock *sk,
>   
>   	ctx->splicing_pages = true;
>   	while (1) {
> -		if (sg_is_last(sg))
> -			msg.msg_flags = flags;
> -
>   		/* is sending application-limited? */
>   		tcp_rate_check_app_limited(sk);
>   		p = sg_page(sg);

Hi Jakub,

When applying this patch, repro disappears! :)
Apparently it is related to the warning.
Please go on and submit it.

Tested-by: Tariq Toukan <tariqt@nvidia.com>

We are going to run more comprehensive tests, I'll let you know if we 
find anything unusual.

Regards,
Tariq
Tariq Toukan Aug. 3, 2023, 11:52 a.m. UTC | #20
On 26/07/2023 23:08, Jakub Kicinski wrote:
> On Wed, 26 Jul 2023 22:20:42 +0300 Tariq Toukan wrote:
>>> There is a small bug in this commit, we should always set SPLICE.
>>> But I don't see how that'd cause the warning you're seeing.
>>> Does your build have CONFIG_DEBUG_VM enabled?
>>
>> No.
>>
>> # CONFIG_DEBUG_VM is not set
>> # CONFIG_DEBUG_VM_PGTABLE is not set
> 
> Try testing v6.3 with DEBUG_VM enabled or just remove the IS_ENABLED()
> from: https://github.com/torvalds/linux/blob/v6.4/net/ipv4/tcp.c#L1051

Tested. It doesn't repro.
Jakub Kicinski Aug. 4, 2023, 3:12 a.m. UTC | #21
On Thu, 3 Aug 2023 14:47:35 +0300 Tariq Toukan wrote:
> When applying this patch, repro disappears! :)
> Apparently it is related to the warning.
> Please go on and submit it.

I have no idea how. I found a different bug, staring at this code
for another hour. But I still don't get how we can avoid UaF on
a page by having the TCP take a ref on it rather than copy it.

If anything we should have 2 refs on any page in the sg, one because
it's on the sg, and another held by the re-tx handling.

So I'm afraid we're papering over something here :( We need to keep
digging.
Tariq Toukan Aug. 8, 2023, 7:29 a.m. UTC | #22
On 04/08/2023 6:12, Jakub Kicinski wrote:
> On Thu, 3 Aug 2023 14:47:35 +0300 Tariq Toukan wrote:
>> When applying this patch, repro disappears! :)
>> Apparently it is related to the warning.
>> Please go on and submit it.
> 
> I have no idea how. I found a different bug, staring at this code
> for another hour. But I still don't get how we can avoid UaF on
> a page by having the TCP take a ref on it rather than copy it.
> 
> If anything we should have 2 refs on any page in the sg, one because
> it's on the sg, and another held by the re-tx handling.
> 
> So I'm afraid we're papering over something here :( We need to keep
> digging.

Hi Jakub,
I'm glad to see that you already nailed the other bug and merged the fix.

I can update that we ran comprehensive TLS testing on a branch that 
contains your proposed fix (net: tls: set MSG_SPLICE_PAGES 
consistently), and doesn't contain the other fix (net: tls: avoid 
discarding data on record close).

Except for one "known" issue (we'll discuss it in a second), the runs 
look clean.
No more traces or encrypt/decrypt error counters. Your proposed fix 
seems to work and causes no degradation.
How do you suggest proceeding here?

One mysterious remaining issue, which I already reported some time ago 
but couldn't effectively debug due to other TLS bugs, is the increase of 
TlsDecryptError / TlsEncryptError counters when running kTLS offloaded 
traffic during bond creation on some other interface.
Weird...

We should start giving it the needed attention now that the other issues 
seem to be resolved.

Regards,
Tariq
David Howells Aug. 10, 2023, 1:07 p.m. UTC | #23
Tariq Toukan <ttoukan.linux@gmail.com> wrote:

> We are collecting more info on how the repro is affected by the different
> parameters.

I'm wondering if userspace is feeding the unspliceable page in somehow.  Could
you try running with the attached changes?  It might help catch the point at
which the offending page is first spliced into the pipe and any backtrace
might help localise the driver that's producing it.

Thanks,
David
---
diff --git a/fs/splice.c b/fs/splice.c
index 3e2a31e1ce6a..877df1de3863 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -218,6 +218,8 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 	while (!pipe_full(head, tail, pipe->max_usage)) {
 		struct pipe_buffer *buf = &pipe->bufs[head & mask];
 
+		WARN_ON_ONCE(!sendpage_ok(spd->pages[page_nr]));
+
 		buf->page = spd->pages[page_nr];
 		buf->offset = spd->partial[page_nr].offset;
 		buf->len = spd->partial[page_nr].len;
@@ -252,6 +254,8 @@ ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
 	unsigned int mask = pipe->ring_size - 1;
 	int ret;
 
+	WARN_ON_ONCE(!sendpage_ok(buf->page));
+
 	if (unlikely(!pipe->readers)) {
 		send_sig(SIGPIPE, current, 0);
 		ret = -EPIPE;
@@ -861,6 +865,8 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
 				break;
 			}
 
+			WARN_ON_ONCE(!sendpage_ok(buf->page));
+
 			bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset);
 			remain -= seg;
 			if (remain == 0 || bc >= ARRAY_SIZE(bvec))
@@ -1411,6 +1417,8 @@ static int iter_to_pipe(struct iov_iter *from,
 		for (i = 0; i < n; i++) {
 			int size = min_t(int, left, PAGE_SIZE - start);
 
+			WARN_ON_ONCE(!sendpage_ok(pages[i]));
+
 			buf.page = pages[i];
 			buf.offset = start;
 			buf.len = size;
diff mbox series

Patch

diff --git a/include/net/tls.h b/include/net/tls.h
index 6056ce5a2aa5..5791ca7a189c 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -258,7 +258,7 @@  struct tls_context {
 	struct scatterlist *partially_sent_record;
 	u16 partially_sent_offset;
 
-	bool in_tcp_sendpages;
+	bool splicing_pages;
 	bool pending_open_record_frags;
 
 	struct mutex tx_lock; /* protects partially_sent_* fields and
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index f2e7302a4d96..3d45fdb5c4e9 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -125,7 +125,10 @@  int tls_push_sg(struct sock *sk,
 		u16 first_offset,
 		int flags)
 {
-	int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = MSG_SENDPAGE_NOTLAST | MSG_SPLICE_PAGES | flags,
+	};
 	int ret = 0;
 	struct page *p;
 	size_t size;
@@ -134,16 +137,19 @@  int tls_push_sg(struct sock *sk,
 	size = sg->length - offset;
 	offset += sg->offset;
 
-	ctx->in_tcp_sendpages = true;
+	ctx->splicing_pages = true;
 	while (1) {
 		if (sg_is_last(sg))
-			sendpage_flags = flags;
+			msg.msg_flags = flags;
 
 		/* is sending application-limited? */
 		tcp_rate_check_app_limited(sk);
 		p = sg_page(sg);
 retry:
-		ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
+		bvec_set_page(&bvec, p, size, offset);
+		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
+		ret = tcp_sendmsg_locked(sk, &msg, size);
 
 		if (ret != size) {
 			if (ret > 0) {
@@ -155,7 +161,7 @@  int tls_push_sg(struct sock *sk,
 			offset -= sg->offset;
 			ctx->partially_sent_offset = offset;
 			ctx->partially_sent_record = (void *)sg;
-			ctx->in_tcp_sendpages = false;
+			ctx->splicing_pages = false;
 			return ret;
 		}
 
@@ -169,7 +175,7 @@  int tls_push_sg(struct sock *sk,
 		size = sg->length;
 	}
 
-	ctx->in_tcp_sendpages = false;
+	ctx->splicing_pages = false;
 
 	return 0;
 }
@@ -247,11 +253,11 @@  static void tls_write_space(struct sock *sk)
 {
 	struct tls_context *ctx = tls_get_ctx(sk);
 
-	/* If in_tcp_sendpages call lower protocol write space handler
+	/* If splicing_pages call lower protocol write space handler
 	 * to ensure we wake up any waiting operations there. For example
-	 * if do_tcp_sendpages where to call sk_wait_event.
+	 * if splicing pages where to call sk_wait_event.
 	 */
-	if (ctx->in_tcp_sendpages) {
+	if (ctx->splicing_pages) {
 		ctx->sk_write_space(sk);
 		return;
 	}