[bpf,0/2] Revert "xsk: support redirect to any socket bound to the same umem"

Message ID	20240604122927.29080-1-magnus.karlsson@gmail.com (mailing list archive)
Headers	show Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com [209.85.221.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 878033236; Tue, 4 Jun 2024 12:29:54 +0000 (UTC) From: Magnus Karlsson <magnus.karlsson@gmail.com> To: magnus.karlsson@intel.com, bjorn@kernel.org, ast@kernel.org, daniel@iogearbox.net, netdev@vger.kernel.org, maciej.fijalkowski@intel.com, bpf@vger.kernel.org Cc: Magnus Karlsson <magnus.karlsson@gmail.com>, YuvalE@radware.com Subject: [PATCH bpf 0/2] Revert "xsk: support redirect to any socket bound to the same umem" Date: Tue, 4 Jun 2024 14:29:24 +0200 Message-ID: <20240604122927.29080-1-magnus.karlsson@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Revert "xsk: support redirect to any socket bound to the same umem" \| expand [bpf,0/2] Revert "xsk: support redirect to any socket bound to the same umem" [bpf,1/2] Revert "xsk: support redirect to any socket bound to the same umem" [bpf,2/2] Revert "xsk: document ability to redirect to any socket bound to the same umem"

Message ID

20240604122927.29080-1-magnus.karlsson@gmail.com (mailing list archive)

Headers

From: Magnus Karlsson <magnus.karlsson@gmail.com>
To: magnus.karlsson@intel.com,
	bjorn@kernel.org,
	ast@kernel.org,
	daniel@iogearbox.net,
	netdev@vger.kernel.org,
	maciej.fijalkowski@intel.com,
	bpf@vger.kernel.org
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>,
	YuvalE@radware.com
Subject: [PATCH bpf 0/2] Revert "xsk: support redirect to any socket bound to
 the same umem"
Date: Tue,  4 Jun 2024 14:29:24 +0200
Message-ID: <20240604122927.29080-1-magnus.karlsson@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Revert "xsk: support redirect to any socket bound to the same umem" | expand

Message

Magnus Karlsson June 4, 2024, 12:29 p.m. UTC

Revert "xsk: support redirect to any socket bound to the same umem"

This patch introduced a potential kernel crash when multiple napi
instances redirect to the same AF_XDP socket. By removing the
queue_index check, it is possible for multiple napi instances to
access the Rx ring at the same time, which will result in a corrupted
ring state which can lead to a crash when flushing the rings in
__xsk_flush(). This can happen when the linked list of sockets to
flush gets corrupted by concurrent accesses. A quick and small fix is
unfortunately not possible, so let us revert this for now.

[  306.997548] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  307.088372] #PF: supervisor read access in kernel mode
[  307.149079] #PF: error_code(0x0000) - not-present page
[  307.209774] PGD 10f131067 P4D 10f131067 PUD 102642067 PMD 0
[  307.276608] Oops: 0000 [#1] SMP
[  307.313712] CPU: 3 PID: 1919 Comm: sp1 Tainted: P           OE     5.15.117-1-ULP-NG #1
[  307.408219] Hardware name: Radware Radware/Default string, BIOS 5.25 (785A.015) 05/11/2023
[  307.505779] RIP: 0010:xsk_flush+0xb/0x40
[  307.552099] Code: a0 03 00 00 01 b8 e4 ff ff ff eb dc 49 83 85 a0 03 00 00 01 b8 e4 ff ff ff eb cd 0f 1f 40 00 48 8b 87 40 03 00 00 55 48 89 e5 <8b> 50 08 48 8b 40 10 89 10 48 8b 87 68 03 00 00 48 8b 80 80 00 00
[  307.773694] RSP: 0000:ffffb7ae01037c80 EFLAGS: 00010287
[  307.835401] RAX: 0000000000000000 RBX: ffffa0a88f8ab768 RCX: ffffa0a88f8abac0
[  307.919670] RDX: ffffa0a88f8abac0 RSI: 0000000000000004 RDI: ffffa0a88f8ab768
[  308.003922] RBP: ffffb7ae01037c80 R08: ffffa0a10b3e0000 R09: 000000000000769f
[  308.088172] R10: ffffa0a1035ca000 R11: 000000000d7f9180 R12: ffffa0a88f8ab768
[  308.172405] R13: ffffa0a88f8ebac0 R14: ffffa0a2ef135300 R15: 0000000000000155
[  308.256635] FS:  00007ffff7e97a80(0000) GS:ffffa0a88f8c0000(0000) knlGS:0000000000000000
[  308.352186] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  308.420043] CR2: 0000000000000008 CR3: 000000010cf6e000 CR4: 0000000000750ee0
[  308.504309] PKRU: 55555554
[  308.536296] Call Trace:
[  308.565209]  <TASK>
[  308.590026]  ? show_regs+0x56/0x60
[  308.630218]  ? __die_body+0x1a/0x60
[  308.671433]  ? __die+0x25/0x30
[  308.707529]  ? page_fault_oops+0xc0/0x440
[  308.754897]  ? do_sys_poll+0x47c/0x5e0
[  308.799188]  ? do_user_addr_fault+0x319/0x6e0
[  308.850659]  ? exc_page_fault+0x6c/0x130
[  308.896992]  ? asm_exc_page_fault+0x27/0x30
[  308.946398]  ? xsk_flush+0xb/0x40
[  308.985546]  __xsk_map_flush+0x3a/0x80
[  309.029824]  xdp_do_flush+0x13/0x20
[  309.071043]  i40e_finalize_xdp_rx+0x44/0x50 [i40e]
[  309.127653]  i40e_clean_rx_irq_zc+0x132/0x500 [i40e]
[  309.202736]  i40e_napi_poll+0x119/0x1270 [i40e]
[  309.256285]  ? xsk_sendmsg+0xf4/0x100
[  309.315969]  ? sock_sendmsg+0x2e/0x40
[  309.359244]  __napi_poll+0x23/0x160
[  309.400482]  net_rx_action+0x232/0x290
[  309.444778]  __do_softirq+0xd0/0x270
[  309.487012]  irq_exit_rcu+0x74/0xa0
[  309.528241]  common_interrupt+0x83/0xa0
[  309.573577]  asm_common_interrupt+0x27/0x40

Thanks: Magnus

Magnus Karlsson (2):
  Revert "xsk: support redirect to any socket bound to the same umem"
  Revert "xsk: document ability to redirect to any socket bound to the
    same umem"

 Documentation/networking/af_xdp.rst | 33 ++++++++++++-----------------
 net/xdp/xsk.c                       |  5 +----
 2 files changed, 15 insertions(+), 23 deletions(-)


base-commit: 2317dc2c22cc353b699c7d1db47b2fe91f54055c
--
2.45.1

Comments

Willem de Bruijn June 4, 2024, 11:03 p.m. UTC | #1

Magnus Karlsson wrote:
> Revert "xsk: support redirect to any socket bound to the same umem"
> 
> This patch introduced a potential kernel crash when multiple napi
> instances redirect to the same AF_XDP socket. By removing the
> queue_index check, it is possible for multiple napi instances to
> access the Rx ring at the same time, which will result in a corrupted
> ring state which can lead to a crash when flushing the rings in
> __xsk_flush(). This can happen when the linked list of sockets to
> flush gets corrupted by concurrent accesses. A quick and small fix is
> unfortunately not possible, so let us revert this for now.

This is a very useful feature, to be able to use AF_XDP sockets with
a standard RSS nic configuration.

Not all AF_XDP use cases require the absolute highest packet rate.

Can this be addressed with an optional spinlock on the RxQ, only for
this case?

If there is no simple enough fix in the short term, do you plan to
reintroduce this in another form later?

Magnus Karlsson June 5, 2024, 7:45 a.m. UTC | #2

On Wed, 5 Jun 2024 at 01:03, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Magnus Karlsson wrote:
> > Revert "xsk: support redirect to any socket bound to the same umem"
> >
> > This patch introduced a potential kernel crash when multiple napi
> > instances redirect to the same AF_XDP socket. By removing the
> > queue_index check, it is possible for multiple napi instances to
> > access the Rx ring at the same time, which will result in a corrupted
> > ring state which can lead to a crash when flushing the rings in
> > __xsk_flush(). This can happen when the linked list of sockets to
> > flush gets corrupted by concurrent accesses. A quick and small fix is
> > unfortunately not possible, so let us revert this for now.
>
> This is a very useful feature, to be able to use AF_XDP sockets with
> a standard RSS nic configuration.

I completely agree.

> Not all AF_XDP use cases require the absolute highest packet rate.
>
> Can this be addressed with an optional spinlock on the RxQ, only for
> this case?

Yes, or with a MPSC ring implementation.

> If there is no simple enough fix in the short term, do you plan to
> reintroduce this in another form later?

Yuval and I are looking into a solution based around an optional
spinlock since it is easier to pull off than an MPSC ring. The
discussion is on-going on the xdp-newbies list [0], but as soon as we
have a first patch, we will post it here for review and debate.

[0] https://lore.kernel.org/xdp-newbies/8100DBDC-0B7C-49DB-9995-6027F6E63147@radware.com/

patchwork-bot+netdevbpf@kernel.org June 5, 2024, 7:50 a.m. UTC | #3

Hello:

This series was applied to bpf/bpf.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Tue,  4 Jun 2024 14:29:24 +0200 you wrote:
> Revert "xsk: support redirect to any socket bound to the same umem"
> 
> This patch introduced a potential kernel crash when multiple napi
> instances redirect to the same AF_XDP socket. By removing the
> queue_index check, it is possible for multiple napi instances to
> access the Rx ring at the same time, which will result in a corrupted
> ring state which can lead to a crash when flushing the rings in
> __xsk_flush(). This can happen when the linked list of sockets to
> flush gets corrupted by concurrent accesses. A quick and small fix is
> unfortunately not possible, so let us revert this for now.
> 
> [...]

Here is the summary with links:
  - [bpf,1/2] Revert "xsk: support redirect to any socket bound to the same umem"
    https://git.kernel.org/bpf/bpf/c/7fcf26b315bb
  - [bpf,2/2] Revert "xsk: document ability to redirect to any socket bound to the same umem"
    https://git.kernel.org/bpf/bpf/c/03e38d315f3c

You are awesome, thank you!

Willem de Bruijn June 5, 2024, 7:28 p.m. UTC | #4

Magnus Karlsson wrote:
> On Wed, 5 Jun 2024 at 01:03, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Magnus Karlsson wrote:
> > > Revert "xsk: support redirect to any socket bound to the same umem"
> > >
> > > This patch introduced a potential kernel crash when multiple napi
> > > instances redirect to the same AF_XDP socket. By removing the
> > > queue_index check, it is possible for multiple napi instances to
> > > access the Rx ring at the same time, which will result in a corrupted
> > > ring state which can lead to a crash when flushing the rings in
> > > __xsk_flush(). This can happen when the linked list of sockets to
> > > flush gets corrupted by concurrent accesses. A quick and small fix is
> > > unfortunately not possible, so let us revert this for now.
> >
> > This is a very useful feature, to be able to use AF_XDP sockets with
> > a standard RSS nic configuration.
> 
> I completely agree.
> 
> > Not all AF_XDP use cases require the absolute highest packet rate.
> >
> > Can this be addressed with an optional spinlock on the RxQ, only for
> > this case?
> 
> Yes, or with a MPSC ring implementation.
> 
> > If there is no simple enough fix in the short term, do you plan to
> > reintroduce this in another form later?
> 
> Yuval and I are looking into a solution based around an optional
> spinlock since it is easier to pull off than an MPSC ring. The
> discussion is on-going on the xdp-newbies list [0], but as soon as we
> have a first patch, we will post it here for review and debate.
> 
> [0] https://lore.kernel.org/xdp-newbies/8100DBDC-0B7C-49DB-9995-6027F6E63147@radware.com/

Glad to hear that it's intended to be supported, and even being worked
on, thanks! I'll follow the conversation there.