[RFC] wiregard RX packet processing.

Message ID	20211208173205.zajfvg6zvi4g5kln@linutronix.de (mailing list archive)
State	RFC
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> Date: Wed, 8 Dec 2021 18:32:05 +0100 From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: wireguard@lists.zx2c4.com, netdev@vger.kernel.org Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Peter Zijlstra <peterz@infradead.org> Subject: [RFC] wiregard RX packet processing. Message-ID: <20211208173205.zajfvg6zvi4g5kln@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	[RFC] wiregard RX packet processing. \| expand [RFC] wiregard RX packet processing.

Message ID

20211208173205.zajfvg6zvi4g5kln@linutronix.de (mailing list archive)

State

RFC

Delegated to:

Netdev Maintainers

Headers

Date: Wed, 8 Dec 2021 18:32:05 +0100
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: wireguard@lists.zx2c4.com, netdev@vger.kernel.org
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>
Subject: [RFC] wiregard RX packet processing.
Message-ID: <20211208173205.zajfvg6zvi4g5kln@linutronix.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Series

[RFC] wiregard RX packet processing. | expand

Context	Check	Description
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	warning	Target tree name not specified in the subject
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	success	CCed 5 of 5 maintainers
netdev/build_clang	success	Errors and warnings before: 0 this patch: 0
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	fail	author Signed-off-by missing
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 0 this patch: 0
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 13 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Context

Check

Description

netdev/tree_selection

success

Guessed tree name to be net-next

netdev/fixes_present

success

Fixes tag not required for -next series

netdev/subject_prefix

warning

Target tree name not specified in the subject

netdev/cover_letter

success

Single patches do not need cover letters

netdev/patch_count

success

Link

netdev/header_inline

success

No static functions without inline keyword in header files

netdev/build_32bit

success

Errors and warnings before: 0 this patch: 0

netdev/cc_maintainers

success

CCed 5 of 5 maintainers

netdev/build_clang

success

Errors and warnings before: 0 this patch: 0

netdev/module_param

success

Was 0 now: 0

netdev/verify_signedoff

fail

author Signed-off-by missing

netdev/verify_fixes

success

No Fixes tag

netdev/build_allmodconfig_warn

success

Errors and warnings before: 0 this patch: 0

netdev/checkpatch

success

total: 0 errors, 0 warnings, 0 checks, 13 lines checked

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/source_inline

success

Was 0 now: 0

Commit Message

Sebastian Sewior Dec. 8, 2021, 5:32 p.m. UTC

I didn't understand everything, I just stumbled upon this while looking
for something else and don't have the time to figure everything out.
Also I might haven taken a wrong turn somewhere…

need_resched() is something you want avoid unless you write core code.
On a PREEMPT kernel you never observe true here and cond_resched() is a
nop. On non-PREEMPT kernels need_resched() can return true/ false _and_
should_resched() (which is part of cond_resched()) returns only true if
the same bit is true. This means invoking only cond_resched() saves one
read access. Bonus points: On x86 that bit is folded into the preemption
counter so you avoid reading that bit entirely plus the whole thing is
optimized away on a PREEMPT kernel.

wg_queue_enqueue_per_peer_rx() enqueues somehow skb for NAPI processing
(this bit I haven't figured out yet but it has to) and then invokes
napi_schedule(). This napi_schedule() wasn't meant to be invoked from
preemptible context, only from an actual IRQ handler:
- if NAPI is already active (which can only happen if it is running on a
  remote CPU) then nothing happens. Good.

- if NAPI is idle then __napi_schedule() will "schedule" it. Here is
  the thing: You are in process context (kworker) so nothing happens
  right away: NET_RX_SOFTIRQ is set for the local CPU and NAPI struct is
  added to the list. Now you need to wait until a random interrupt
  appears which will notice that a softirq bit is set and will process
  it. So it will happen eventually…

I would suggest to either:
- add a comment that this is know and it doesn't not matter because
  $REASON. I would imagine you might want to batch multiple skbs but…

- add a BH disable section around wg_queue_enqueue_per_peer_rx() (see
  below). That bh-enable() will invoke pending softirqs which in your
  case should invoke wg_packet_rx_poll() where you see only one skb.


Sebastian

Comments

Jason A. Donenfeld Dec. 20, 2021, 5:29 p.m. UTC | #1

Hi Sebastian,

Seems like you've identified two things, the use of need_resched, and
potentially surrounding napi_schedule in local_bh_{disable,enable}.

Regarding need_resched, I pulled that out of other code that seemed to
have the "same requirements", as vaguely conceived. It indeed might
not be right. The intent is to have that worker running at maximum
throughput for extended periods of time, but not preventing other
threads from running elsewhere, so that, e.g., a user's machine
doesn't have a jenky mouse when downloading a file.

What are the effects of unconditionally calling cond_resched() without
checking for if (need_resched())? Sounds like you're saying none at
all?

Regarding napi_schedule, I actually wasn't aware that it's requirement
to _only_ ever run from softirq was a strict one. When I switched to
using napi_schedule in this way, throughput really jumped up
significantly. Part of this indeed is from the batching, so that the
napi callback can then handle more packets in one go later. But I
assumed it was something inside of NAPI that was batching and
scheduling it, rather than a mistake on my part to call this from a wq
and not from a softirq.

What, then, are the effects of surrounding that in
local_bh_{disable,enable} as you've done in the patch? You mentioned
one aspect is that it will "invoke wg_packet_rx_poll() where you see
only one skb." It sounds like that'd be bad for performance, though,
given that the design of napi is really geared toward batching.

Jason

Toke Høiland-Jørgensen Jan. 5, 2022, 12:14 a.m. UTC | #2

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Hi Sebastian,
>
> Seems like you've identified two things, the use of need_resched, and
> potentially surrounding napi_schedule in local_bh_{disable,enable}.
>
> Regarding need_resched, I pulled that out of other code that seemed to
> have the "same requirements", as vaguely conceived. It indeed might
> not be right. The intent is to have that worker running at maximum
> throughput for extended periods of time, but not preventing other
> threads from running elsewhere, so that, e.g., a user's machine
> doesn't have a jenky mouse when downloading a file.
>
> What are the effects of unconditionally calling cond_resched() without
> checking for if (need_resched())? Sounds like you're saying none at
> all?

I believe so: AFAIU, you use need_resched() if you need to do some kind
of teardown before the schedule point, like this example I was recently
looking at:

https://elixir.bootlin.com/linux/latest/source/net/bpf/test_run.c#L73

If you just need to maybe reschedule, you can just call cond_resched()
and it'll do what it says on the tin: do a schedule if needed, and
return immediately otherwise.

> Regarding napi_schedule, I actually wasn't aware that it's requirement
> to _only_ ever run from softirq was a strict one. When I switched to
> using napi_schedule in this way, throughput really jumped up
> significantly. Part of this indeed is from the batching, so that the
> napi callback can then handle more packets in one go later. But I
> assumed it was something inside of NAPI that was batching and
> scheduling it, rather than a mistake on my part to call this from a wq
> and not from a softirq.
>
> What, then, are the effects of surrounding that in
> local_bh_{disable,enable} as you've done in the patch? You mentioned
> one aspect is that it will "invoke wg_packet_rx_poll() where you see
> only one skb." It sounds like that'd be bad for performance, though,
> given that the design of napi is really geared toward batching.

Heh, I wrote a whole long explanation he about variable batch sizes
because you don't control when the NAPI is scheduled, etc... And then I
noticed the while loop is calling ptr_ring_consume_bh(), which means
that there's already a local_bh_disable/enable pair on every loop
invocation. So you already have this :)

Which of course raises the question of whether there's anything to gain
from *adding* batching to the worker? Something like:

#define BATCH_SIZE 8
void wg_packet_decrypt_worker(struct work_struct *work)
{
	struct crypt_queue *queue = container_of(work, struct multicore_worker,
						 work)->ptr;
	void *skbs[BATCH_SIZE];
	bool again;
	int i;

restart:
	local_bh_disable();
	ptr_ring_consume_batched(&queue->ring, skbs, BATCH_SIZE);

	for (i = 0; i < BATCH_SIZE; i++) {
		struct sk_buff *skb = skbs[i];
		enum packet_state state;

		if (!skb)
			break;

		state = likely(decrypt_packet(skb, PACKET_CB(skb)->keypair)) ?
				PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
		wg_queue_enqueue_per_peer_rx(skb, state);
	}
        
	again = !ptr_ring_empty(&queue->ring);
	local_bh_enable();

	if (again) {
		cond_resched();
		goto restart;
	}
}


Another thing that might be worth looking into is whether it makes sense
to enable threaded NAPI for Wireguard. See:
https://lore.kernel.org/r/20210208193410.3859094-1-weiwan@google.com

-Toke

Sebastian Sewior Jan. 11, 2022, 3:40 p.m. UTC | #3

On 2021-12-20 18:29:49 [+0100], Jason A. Donenfeld wrote:
> Hi Sebastian,
> 
> Seems like you've identified two things, the use of need_resched, and
> potentially surrounding napi_schedule in local_bh_{disable,enable}.
> 
> Regarding need_resched, I pulled that out of other code that seemed to
> have the "same requirements", as vaguely conceived. It indeed might
> not be right. The intent is to have that worker running at maximum
> throughput for extended periods of time, but not preventing other
> threads from running elsewhere, so that, e.g., a user's machine
> doesn't have a jenky mouse when downloading a file.
>
> What are the effects of unconditionally calling cond_resched() without
> checking for if (need_resched())? Sounds like you're saying none at
> all?

I stand to be corrected but "if need_resched() cond_resched())" is not
something one should do. If you hold a lock and need to drop it first
and und you don't want to drop the lock if there is no need for
scheduling then there is cond_resched_lock() for instance. If you need
to do something more complex (say set a marker if you drop the lock)
then okay _but_ in this case you do more than just the "if …" from
above.

cond_resched() gets optimized away on a preemptible kernel. The side
effect is that you have always a branch (to cond_resched()) including a
possible RCU section (urgently needed quiescent state).

> Regarding napi_schedule, I actually wasn't aware that it's requirement
> to _only_ ever run from softirq was a strict one. When I switched to
> using napi_schedule in this way, throughput really jumped up
> significantly. Part of this indeed is from the batching, so that the
> napi callback can then handle more packets in one go later. But I
> assumed it was something inside of NAPI that was batching and
> scheduling it, rather than a mistake on my part to call this from a wq
> and not from a softirq.

There is no strict requirement to do napi_schedule() from hard-IRQ but
it makes sense actually. So napi_schedule() invokes
__raise_softirq_irqoff() which only ors a bit in the softirq state.
Nothing else. The only reason that the softirq is invoked in a
deterministic way is that irq_exit() has this "if
(local_softirq_pending()) invoke_softirq()" check before returing (to
interrupted user/ kernel code).

So if you use it in a worker (for instance) the NAPI call is delayed
until the next IRQ (due to irq_exit() part) or a random
local_bh_enable() user.

> What, then, are the effects of surrounding that in
> local_bh_{disable,enable} as you've done in the patch? You mentioned
> one aspect is that it will "invoke wg_packet_rx_poll() where you see
> only one skb." It sounds like that'd be bad for performance, though,
> given that the design of napi is really geared toward batching.

As Toke Høiland-Jørgensen wrote in the previous reply, I missed the BH
disable/ enable in ptr_ring_consume_bh(). So what happens is that
ptr_ring_consume_bh() gives you one skb, you do
wg_queue_enqueue_per_peer_rx() which raises NAPI then the following
ptr_ring_consume_bh() (that local_bh_enable() to be exact) invokes the
NAPI callback (I guess wg_packet_rx_poll() but as I wrote earlier, I
didn't figure out how the skbs move from here to the other queue for
that callback).

So there is probably no batching assuming that one skb is processed in
the NAPI callback.

> Jason

Sebastian

diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
index 7b8df406c7737..64e4ca1ded108 100644
--- a/drivers/net/wireguard/receive.c
+++ b/drivers/net/wireguard/receive.c
@@ -507,9 +507,11 @@  void wg_packet_decrypt_worker(struct work_struct *work)
 		enum packet_state state =
 			likely(decrypt_packet(skb, PACKET_CB(skb)->keypair)) ?
 				PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
+		local_bh_disable();
 		wg_queue_enqueue_per_peer_rx(skb, state);
-		if (need_resched())
-			cond_resched();
+		local_bh_enable();
+
+		cond_resched();
 	}
 }

[RFC] wiregard RX packet processing.

Checks

Commit Message

Comments

Patch