Message ID | 20241025142025.3558051-2-leitao@debian.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: netpoll: Improve SKB pool management | expand |
On Fri, 25 Oct 2024 07:20:18 -0700 Breno Leitao wrote: > The current implementation has a flaw where it populates the skb_pool > with 32 SKBs before calling __netpoll_setup(). If the setup fails, the > skb_pool buffer will persist indefinitely and never be cleaned up. > > This change moves the skb_pool population to after the successful > completion of __netpoll_setup(), ensuring that the buffers are not > unnecessarily retained. Additionally, this modification alleviates rtnl > lock pressure by allowing the buffer filling to occur outside of the > lock. arguably if the setup succeeds there would now be a window of time where np is active but pool is empty.
Hello Jakub, On Thu, Oct 31, 2024 at 06:26:47PM -0700, Jakub Kicinski wrote: > On Fri, 25 Oct 2024 07:20:18 -0700 Breno Leitao wrote: > > The current implementation has a flaw where it populates the skb_pool > > with 32 SKBs before calling __netpoll_setup(). If the setup fails, the > > skb_pool buffer will persist indefinitely and never be cleaned up. > > > > This change moves the skb_pool population to after the successful > > completion of __netpoll_setup(), ensuring that the buffers are not > > unnecessarily retained. Additionally, this modification alleviates rtnl > > lock pressure by allowing the buffer filling to occur outside of the > > lock. > > arguably if the setup succeeds there would now be a window of time > where np is active but pool is empty. I am not convinced this is a problem. Given that netpoll_setup() is only called from netconsole. In netconsole, a target is not enabled (as in sending packets) until the netconsole target is, in fact, enabled. (nt->enabled = true). Enabling the target(nt) only happen after netpoll_setup() returns successfully. Example: static void write_ext_msg(struct console *con, const char *msg, unsigned int len) { ... list_for_each_entry(nt, &target_list, list) if (nt->extended && nt->enabled && netif_running(nt->np.dev)) send_ext_msg_udp(nt, msg, len); So, back to your point, the netpoll interface will be up, but, not used at all. On top of that, two other considerations: * If the netpoll target is used without the buffer, it is not a big deal, since refill_skbs() is called, independently if the pool is full or not. (Which is not ideal and I eventually want to improve it). Anyway, this is how the code works today: void netpoll_send_udp(struct netpoll *np, const char *msg, int len) { ... skb = find_skb(np, total_len + np->dev->needed_tailroom,... // transmit the skb static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve) { ... refill_skbs(np); skb = alloc_skb(len, GFP_ATOMIC); if (!skb) skb = skb_dequeue(&np->skb_pool); ... // return the skb So, even in there is a transmission in-between enabling the netpoll target and not populating the pool (which is NOT the case in the code today), it would not be a problem, given that netpoll_send_udp() will call refill_skbs() anyway. I have an in-development patch to improve it, by deferring this to a workthread, mainly because this whole allocation dance is done with a bunch of locks held, including printk/console lock. I think that a best mechanism might be something like: * If find_skb() needs to consume from the pool (which is rare, only when alloc_skb() fails), raise workthread that tries to repopulate the pool in the background. * Eventually avoid alloc_skb() first, and getting directly from the pool first, if the pool is depleted, try to alloc_skb(GPF_ATOMIC). This might make the code faster, but, I don't have data yet. This might also required a netpool reconfigurable of pool size. Today it is hardcoded (#define MAX_SKBS 32). This current patchset is the first step to individualize the pool, then, we can have a field in struct netpoll that specify what is the pool size (32 by default), but user configuration. On netconsole, we can do it using the configfs fields. Anyway, are these ideas too crazy?
On Fri, Nov 01, 2024 at 03:51:59AM -0700, Breno Leitao wrote: > Hello Jakub, > > On Thu, Oct 31, 2024 at 06:26:47PM -0700, Jakub Kicinski wrote: > > On Fri, 25 Oct 2024 07:20:18 -0700 Breno Leitao wrote: > > > The current implementation has a flaw where it populates the skb_pool > > > with 32 SKBs before calling __netpoll_setup(). If the setup fails, the > > > skb_pool buffer will persist indefinitely and never be cleaned up. > > > > > > This change moves the skb_pool population to after the successful > > > completion of __netpoll_setup(), ensuring that the buffers are not > > > unnecessarily retained. Additionally, this modification alleviates rtnl > > > lock pressure by allowing the buffer filling to occur outside of the > > > lock. > > > > arguably if the setup succeeds there would now be a window of time > > where np is active but pool is empty. > > I am not convinced this is a problem. Given that netpoll_setup() is only > called from netconsole. > > In netconsole, a target is not enabled (as in sending packets) until the > netconsole target is, in fact, enabled. (nt->enabled = true). Enabling > the target(nt) only happen after netpoll_setup() returns successfully. > > Example: > > static void write_ext_msg(struct console *con, const char *msg, > unsigned int len) > { > ... > list_for_each_entry(nt, &target_list, list) > if (nt->extended && nt->enabled && netif_running(nt->np.dev)) > send_ext_msg_udp(nt, msg, len); > > So, back to your point, the netpoll interface will be up, but, not used > at all. > > On top of that, two other considerations: > > * If the netpoll target is used without the buffer, it is not a big > deal, since refill_skbs() is called, independently if the pool is full > or not. (Which is not ideal and I eventually want to improve it). > > Anyway, this is how the code works today: > > > void netpoll_send_udp(struct netpoll *np, const char *msg, int len) > { > ... > skb = find_skb(np, total_len + np->dev->needed_tailroom,... > // transmit the skb > > > static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve) > { > ... > refill_skbs(np); > skb = alloc_skb(len, GFP_ATOMIC); > if (!skb) > skb = skb_dequeue(&np->skb_pool); > ... > // return the skb > > So, even in there is a transmission in-between enabling the netpoll > target and not populating the pool (which is NOT the case in the code > today), it would not be a problem, given that netpoll_send_udp() will > call refill_skbs() anyway. > > I have an in-development patch to improve it, by deferring this to a > workthread, mainly because this whole allocation dance is done with a > bunch of locks held, including printk/console lock. > > I think that a best mechanism might be something like: > > * If find_skb() needs to consume from the pool (which is rare, only > when alloc_skb() fails), raise workthread that tries to repopulate the > pool in the background. > > * Eventually avoid alloc_skb() first, and getting directly from the > pool first, if the pool is depleted, try to alloc_skb(GPF_ATOMIC). > This might make the code faster, but, I don't have data yet. I've hacked this case (getting the skb from the pool first and refilling it on a workqueue) today, and the performance is expressive. I've tested sending 2k messages, and meassured the time it takes to run `netpoll_send_udp`, which is the critical function in netpoll. Actual code (with this patchset applied), where the index is nanoseconds: [14K, 16K) 107 |@@@ | [16K, 20K) 1757 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [20K, 24K) 59 |@ | [24K, 28K) 35 |@ | [28K, 32K) 35 |@ | [32K, 40K) 5 | | [40K, 48K) 0 | | [48K, 56K) 0 | | [56K, 64K) 0 | | [64K, 80K) 1 | | [80K, 96K) 0 | | [96K, 112K) 1 | | With the optimization applied, I get a solid win: [8K, 10K) 32 |@ | [10K, 12K) 514 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [12K, 14K) 932 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [14K, 16K) 367 |@@@@@@@@@@@@@@@@@@@@ | [16K, 20K) 102 |@@@@@ | [20K, 24K) 29 |@ | [24K, 28K) 17 | | [28K, 32K) 1 | | [32K, 40K) 3 | | [40K, 48K) 0 | | [48K, 56K) 0 | | [56K, 64K) 1 | | [64K, 80K) 0 | | [80K, 96K) 1 | | [96K, 112K) 1 | | That was captured this simple bpftrace script: kprobe:netpoll_send_udp { @start[tid] = nsecs; } kretprobe:netpoll_send_udp /@start[tid]/ { $duration = nsecs - @start[tid]; delete(@start[tid]); @ = hist($duration, 2) } END { clear(@start); print(@); } And this is the patch I am testing right now: commit 262de00e439e0708fadf5e4c2896837c046a325b Author: Breno Leitao <leitao@debian.org> Date: Fri Nov 1 10:03:58 2024 -0700 netpoll: prioritize the skb from the pool. Prioritize allocating SKBs from the pool over alloc_skb() to reduce overhead in the critical path. Move the pool refill operation to a worktask, allowing it to run outside of the printk/console lock. This change improves performance by minimizing the time spent in the critical path filling and allocating skbs, reducing contention on the printk/console lock. Signed-off-by: Breno Leitao <leitao@debian.org> diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h index 77635b885c18..c81dc9cc0139 100644 --- a/include/linux/netpoll.h +++ b/include/linux/netpoll.h @@ -33,6 +33,7 @@ struct netpoll { u16 local_port, remote_port; u8 remote_mac[ETH_ALEN]; struct sk_buff_head skb_pool; + struct work_struct work; }; struct netpoll_info { diff --git a/net/core/netpoll.c b/net/core/netpoll.c index bf2064d689d5..657100393489 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -278,18 +278,24 @@ static void zap_completion_queue(void) put_cpu_var(softnet_data); } + +static void refill_skbs_wt(struct work_struct *work) +{ + struct netpoll *np = container_of(work, struct netpoll, work); + + refill_skbs(np); +} + static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve) { int count = 0; struct sk_buff *skb; zap_completion_queue(); - refill_skbs(np); repeat: - - skb = alloc_skb(len, GFP_ATOMIC); + skb = skb_dequeue(&np->skb_pool); if (!skb) - skb = skb_dequeue(&np->skb_pool); + skb = alloc_skb(len, GFP_ATOMIC); if (!skb) { if (++count < 10) { @@ -301,6 +307,7 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve) refcount_set(&skb->users, 1); skb_reserve(skb, reserve); + schedule_work(&np->work); return skb; } @@ -780,6 +787,7 @@ int netpoll_setup(struct netpoll *np) /* fill up the skb queue */ refill_skbs(np); + INIT_WORK(&np->work, refill_skbs_wt); return 0; put:
On Fri, 1 Nov 2024 11:18:29 -0700 Breno Leitao wrote: > > I think that a best mechanism might be something like: > > > > * If find_skb() needs to consume from the pool (which is rare, only > > when alloc_skb() fails), raise workthread that tries to repopulate the > > pool in the background. > > > > * Eventually avoid alloc_skb() first, and getting directly from the > > pool first, if the pool is depleted, try to alloc_skb(GPF_ATOMIC). > > This might make the code faster, but, I don't have data yet. > > I've hacked this case (getting the skb from the pool first and refilling > it on a workqueue) today, and the performance is expressive. > > I've tested sending 2k messages, and meassured the time it takes to > run `netpoll_send_udp`, which is the critical function in netpoll. The purpose of the pool is to have a reserve in case of OOM, AFAIU. We may speed things up by taking the allocations out of line but we risk the pool being empty when we really need it.
On Fri, Nov 01, 2024 at 07:01:01PM -0700, Jakub Kicinski wrote: > On Fri, 1 Nov 2024 11:18:29 -0700 Breno Leitao wrote: > > > I think that a best mechanism might be something like: > > > > > > * If find_skb() needs to consume from the pool (which is rare, only > > > when alloc_skb() fails), raise workthread that tries to repopulate the > > > pool in the background. > > > > > > * Eventually avoid alloc_skb() first, and getting directly from the > > > pool first, if the pool is depleted, try to alloc_skb(GPF_ATOMIC). > > > This might make the code faster, but, I don't have data yet. > > > > I've hacked this case (getting the skb from the pool first and refilling > > it on a workqueue) today, and the performance is expressive. > > > > I've tested sending 2k messages, and meassured the time it takes to > > run `netpoll_send_udp`, which is the critical function in netpoll. > > The purpose of the pool is to have a reserve in case of OOM, AFAIU. > We may speed things up by taking the allocations out of line but > we risk the pool being empty when we really need it. Correct, but, in a case of OOM, I am not sure if this is going to change the chances at all. Let's assume the pool is full and we start getting OOMs. It doesn't matter if alloc_skb() will fail in the critical path or in the work thread, netpoll will have MAX_SKBS skbs buffered to use, and none will be allocated, thus, just 32 SKBs will be used until a -ENOMEM returns. On the other side, let's suppose there is a bunch of OOM pressure for a while (10 SKBs are consumed for instance), and then some free memory show up, causing the pool to be replenished. It is better to do it in the workthread other than in the hot path. In both cases, the chance of not having SKBs to send the packet seems to be the same, unless I am not modeling the problem correctly. On top of that, a few other points that this new model could help more, in a OOM case. 1) Now with Maksysm patches, we can monitor the OOMing rate 2) With the pool per target, we can easily increase the pool size if we want. (patchset not pushed yet) This will also fix another corner case we have in netconsole. When printk() holds the console/target_list locked, the upcoming code cannot printk() anymore, otherwise it will deadlcok system. That is because a printk() will call netconsole again (nested), and it will try to get the console_lock/target_lock again, but that is already held. Having the alloc_skb() out of that thot path will reduce the probability of this happening. This is something I am planning to fix later, by just dropping the upcoming message. Having this patch might make less packets being dropped.
On Mon, 4 Nov 2024 12:40:00 -0800 Breno Leitao wrote: > Let's assume the pool is full and we start getting OOMs. It doesn't > matter if alloc_skb() will fail in the critical path or in the work > thread, netpoll will have MAX_SKBS skbs buffered to use, and none will > be allocated, thus, just 32 SKBs will be used until a -ENOMEM returns. Do you assume the worker thread will basically keep up with the output? Vadim was showing me a system earlier today where workqueue workers didn't get scheduled in for minutes :( That's a bit extreme but doesn't inspire confidence in worker replenishing the pool quickly. > On the other side, let's suppose there is a bunch of OOM pressure for a > while (10 SKBs are consumed for instance), and then some free memory > show up, causing the pool to be replenished. It is better > to do it in the workthread other than in the hot path. We could cap how much we replenish in one go? > In both cases, the chance of not having SKBs to send the packet seems to > be the same, unless I am not modeling the problem correctly. Maybe I misunderstood the proposal, I think you said earlier that you want to consume from the pool instead of calling alloc(). If you mean that we'd still alloc in the fast path but not replenish the pool that's different. > On top of that, a few other points that this new model could help more, > in a OOM case. > > 1) Now with Maksysm patches, we can monitor the OOMing rate > > 2) With the pool per target, we can easily increase the pool size if we > want. (patchset not pushed yet)
Hello Jakub, On Tue, Nov 05, 2024 at 05:00:29PM -0800, Jakub Kicinski wrote: > On Mon, 4 Nov 2024 12:40:00 -0800 Breno Leitao wrote: > > Let's assume the pool is full and we start getting OOMs. It doesn't > > matter if alloc_skb() will fail in the critical path or in the work > > thread, netpoll will have MAX_SKBS skbs buffered to use, and none will > > be allocated, thus, just 32 SKBs will be used until a -ENOMEM returns. > > Do you assume the worker thread will basically keep up with the output? > Vadim was showing me a system earlier today where workqueue workers > didn't get scheduled in for minutes :( That's a bit extreme but doesn't > inspire confidence in worker replenishing the pool quickly. Interesting. Thanks for the data point. > > On the other side, let's suppose there is a bunch of OOM pressure for a > > while (10 SKBs are consumed for instance), and then some free memory > > show up, causing the pool to be replenished. It is better > > to do it in the workthread other than in the hot path. > > We could cap how much we replenish in one go? If we keep the replenish in the hot path, I think it is worth doing it, for sure. > > In both cases, the chance of not having SKBs to send the packet seems to > > be the same, unless I am not modeling the problem correctly. > > Maybe I misunderstood the proposal, I think you said earlier that you > want to consume from the pool instead of calling alloc(). If you mean > that we'd still alloc in the fast path but not replenish the pool > that's different. To clarify, let me take a step back and outline what this patchset proposes: The patchset enhances SKB pool management in three key ways: a) It delays populating the skb pool until the target is active. b) It releases the skb pool when there are no more active users. c) It creates a separate pool for each target. The third point (c) is the one that's open to discussion, as I understand. I proposed that having an individualized skb pool that users can control would be beneficial. For example, users could define the number of skbs in the pool. This could lead to additional advantages, such as allowing netpoll to directly consume from the pool instead of relying on alloc() in the optimal scenario, thereby speeding up the critical path. --breno
On Wed, 6 Nov 2024 07:06:06 -0800 Breno Leitao wrote: > To clarify, let me take a step back and outline what this patchset proposes: > > The patchset enhances SKB pool management in three key ways: > > a) It delays populating the skb pool until the target is active. > b) It releases the skb pool when there are no more active users. > c) It creates a separate pool for each target. > > The third point (c) is the one that's open to discussion, as I > understand. > > I proposed that having an individualized skb pool that users can control > would be beneficial. For example, users could define the number of skbs > in the pool. This could lead to additional advantages, such as allowing > netpoll to directly consume from the pool instead of relying on alloc() > in the optimal scenario, thereby speeding up the critical path. Patch 1 is the one I'm not completely convinced by. I understand the motivation but its rather unusual to activate partially initialized objects. Maybe let's leave it out. The rest is fine, although I'd invert the justification for the second patch. We should in fact scale the number of pooled packets with the number of consoles. Each message gets send to every console so system with 2 netconsoles has effectively half the OOM cushion.
Hello Jakub, On Wed, Nov 06, 2024 at 03:43:49PM -0800, Jakub Kicinski wrote: > On Wed, 6 Nov 2024 07:06:06 -0800 Breno Leitao wrote: > > To clarify, let me take a step back and outline what this patchset proposes: > > > > The patchset enhances SKB pool management in three key ways: > > > > a) It delays populating the skb pool until the target is active. > > b) It releases the skb pool when there are no more active users. > > c) It creates a separate pool for each target. > > > > The third point (c) is the one that's open to discussion, as I > > understand. > > > > I proposed that having an individualized skb pool that users can control > > would be beneficial. For example, users could define the number of skbs > > in the pool. This could lead to additional advantages, such as allowing > > netpoll to directly consume from the pool instead of relying on alloc() > > in the optimal scenario, thereby speeding up the critical path. > > Patch 1 is the one I'm not completely convinced by. I understand > the motivation but its rather unusual to activate partially initialized > objects. Maybe let's leave it out. > > The rest is fine, although I'd invert the justification for the second > patch. We should in fact scale the number of pooled packets with the > number of consoles. Each message gets send to every console so system > with 2 netconsoles has effectively half the OOM cushion. That is fair. Thanks for the guidance. I will keep patch 1 out of it and send a v2. Thanks --breno
diff --git a/net/core/netpoll.c b/net/core/netpoll.c index aa49b92e9194..e83fd8bdce36 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -772,13 +772,13 @@ int netpoll_setup(struct netpoll *np) } } - /* fill up the skb queue */ - refill_skbs(); - err = __netpoll_setup(np, ndev); if (err) goto put; rtnl_unlock(); + + /* fill up the skb queue */ + refill_skbs(); return 0; put:
The current implementation has a flaw where it populates the skb_pool with 32 SKBs before calling __netpoll_setup(). If the setup fails, the skb_pool buffer will persist indefinitely and never be cleaned up. This change moves the skb_pool population to after the successful completion of __netpoll_setup(), ensuring that the buffers are not unnecessarily retained. Additionally, this modification alleviates rtnl lock pressure by allowing the buffer filling to occur outside of the lock. Signed-off-by: Breno Leitao <leitao@debian.org> --- net/core/netpoll.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)