Message ID | 20210817194003.2102381-1-weiwan@google.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 4b1327be9fe57443295ae86fe0fcf24a18469e9f |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem() | expand |
Hello: This patch was applied to netdev/net-next.git (refs/heads/master): On Tue, 17 Aug 2021 12:40:03 -0700 you wrote: > Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(), > to give more control to the networking stack and enable it to change > memcg charging behavior. In the future, the networking stack may decide > to avoid oom-kills when fallbacks are more appropriate. > > One behavior change in mem_cgroup_charge_skmem() by this patch is to > avoid force charging by default and let the caller decide when and if > force charging is needed through the presence or absence of > __GFP_NOFAIL. > > [...] Here is the summary with links: - [net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem() https://git.kernel.org/netdev/net-next/c/4b1327be9fe5 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
On Tue, 17 Aug 2021 12:40:03 -0700 Wei Wang wrote: > Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(), > to give more control to the networking stack and enable it to change > memcg charging behavior. In the future, the networking stack may decide > to avoid oom-kills when fallbacks are more appropriate. > > One behavior change in mem_cgroup_charge_skmem() by this patch is to > avoid force charging by default and let the caller decide when and if > force charging is needed through the presence or absence of > __GFP_NOFAIL. This patch is causing a little bit of pain to us, to workloads running with just memory.max set. After this change the TCP rx path no longer forces the charging. Any recommendation for the fix? Setting memory.high a few MB under memory.max seems to remove the failures.
On Wed, Oct 12, 2022 at 4:33 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 17 Aug 2021 12:40:03 -0700 Wei Wang wrote: > > Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(), > > to give more control to the networking stack and enable it to change > > memcg charging behavior. In the future, the networking stack may decide > > to avoid oom-kills when fallbacks are more appropriate. > > > > One behavior change in mem_cgroup_charge_skmem() by this patch is to > > avoid force charging by default and let the caller decide when and if > > force charging is needed through the presence or absence of > > __GFP_NOFAIL. > > This patch is causing a little bit of pain to us, to workloads running > with just memory.max set. After this change the TCP rx path no longer > forces the charging. > > Any recommendation for the fix? Setting memory.high a few MB under > memory.max seems to remove the failures. Did the revert of this patch fix the issue you are seeing? The reason I am asking is because this patch should not change the behavior. Actually someone else reported the similar issue for UDP RX at [1] and they tested the revert as well. The revert did not fix the issue for them. Wei has a better explanation at [2] why this patch is not the cause for these issues. [1] https://lore.kernel.org/all/CALvZod5_LVkOkF+gmefnctmx+bRjykSARm2JA9eqKJx85NYBGQ@mail.gmail.com/ [2] https://lore.kernel.org/all/CAEA6p_BhAh6f_kAHEoEJ38nunY=c=4WqxhJQUjT+dCSAr_rm8g@mail.gmail.com/
On Wed, 12 Oct 2022 17:17:38 -0700 Shakeel Butt wrote: > Did the revert of this patch fix the issue you are seeing? The reason > I am asking is because this patch should not change the behavior. > Actually someone else reported the similar issue for UDP RX at [1] and > they tested the revert as well. The revert did not fix the issue for > them. > > Wei has a better explanation at [2] why this patch is not the cause > for these issues. We're talking TCP here, to be clear. I haven't tested a revert, yet (not that easy to test with a real workload) but I'm relatively confident the change did introduce an "unforced" call, specifically this bit: @@ -2728,10 +2728,12 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) { struct proto *prot = sk->sk_prot; long allocated = sk_memory_allocated_add(sk, amt); + bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg; bool charged = true; - if (mem_cgroup_sockets_enabled && sk->sk_memcg && - !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt))) + if (memcg_charge && + !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt, + gfp_memcg_charge()))) where gfp_memcg_charge() is GFP_NOWAIT in softirq. The above gets called from (inverted stack): tcp_data_queue() tcp_try_rmem_schedule(sk, skb, skb->truesize) tcp_try_rmem_schedule() sk_rmem_schedule() __sk_mem_schedule() __sk_mem_raise_allocated() Is my confidence unjustified? :)
On Wed, Oct 12, 2022 at 05:38:25PM -0700, Jakub Kicinski wrote: > On Wed, 12 Oct 2022 17:17:38 -0700 Shakeel Butt wrote: > > Did the revert of this patch fix the issue you are seeing? The reason > > I am asking is because this patch should not change the behavior. > > Actually someone else reported the similar issue for UDP RX at [1] and > > they tested the revert as well. The revert did not fix the issue for > > them. > > > > Wei has a better explanation at [2] why this patch is not the cause > > for these issues. > > We're talking TCP here, to be clear. I haven't tested a revert, yet (not > that easy to test with a real workload) but I'm relatively confident the > change did introduce an "unforced" call, specifically this bit: > > @@ -2728,10 +2728,12 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) > { > struct proto *prot = sk->sk_prot; > long allocated = sk_memory_allocated_add(sk, amt); > + bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg; > bool charged = true; > > - if (mem_cgroup_sockets_enabled && sk->sk_memcg && > - !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt))) > + if (memcg_charge && > + !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt, > + gfp_memcg_charge()))) > > where gfp_memcg_charge() is GFP_NOWAIT in softirq. > > The above gets called from (inverted stack): > tcp_data_queue() > tcp_try_rmem_schedule(sk, skb, skb->truesize) > tcp_try_rmem_schedule() > sk_rmem_schedule() > __sk_mem_schedule() > __sk_mem_raise_allocated() > > Is my confidence unjustified? :) > Let me add Wei's explanation inline which is protocol independent: __sk_mem_raise_allocated() BEFORE the above patch is: - mem_cgroup_charge_skmem() gets called: - try_charge() with GFP_NOWAIT gets called and failed - try_charge() with __GFP_NOFAIL - return false - goto suppress_allocation: - mem_cgroup_uncharge_skmem() gets called - return 0 (which means failure) AFTER the above patch, what happens in __sk_mem_raise_allocated() is: - mem_cgroup_charge_skmem() gets called: - try_charge() with GFP_NOWAIT gets called and failed - return false - goto suppress_allocation: - We no longer calls mem_cgroup_uncharge_skmem() - return 0 So, before the patch, the memcg code may force charges but it will return false and make the networking code to uncharge memcg for SK_MEM_RECV.
On Thu, 13 Oct 2022 00:54:31 +0000 Shakeel Butt wrote: > So, before the patch, the memcg code may force charges but it will > return false and make the networking code to uncharge memcg for > SK_MEM_RECV. Ah, right, I see it now :( I guess I'll have to try to test (some approximation of) a revert after all. Did the fact that we used to force charge not potentially cause reclaim, tho? Letting TCP accept the next packet even if it had to drop the current one?
On Wed, 12 Oct 2022 18:40:50 -0700 Jakub Kicinski wrote: > Did the fact that we used to force charge not potentially cause > reclaim, tho? Letting TCP accept the next packet even if it had > to drop the current one? I pushed this little nugget to one affected machine via KLP: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 03ffbb255e60..c1ca369a1b77 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, return true; } + if (gfp_mask == GFP_NOWAIT) { + try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages); + refill_stock(memcg, nr_pages); + } return false; } The problem normally reproes reliably within 10min -- 30min and counting and the application-level latency has not spiked.
On Wed, Oct 12, 2022 at 8:16 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 12 Oct 2022 18:40:50 -0700 Jakub Kicinski wrote: > > Did the fact that we used to force charge not potentially cause > > reclaim, tho? Letting TCP accept the next packet even if it had > > to drop the current one? > > I pushed this little nugget to one affected machine via KLP: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 03ffbb255e60..c1ca369a1b77 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, > return true; > } > > + if (gfp_mask == GFP_NOWAIT) { > + try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages); > + refill_stock(memcg, nr_pages); > + } > return false; > } > AFAICT, if you force charge by passing __GFP_NOFAIL to try_charge(), you should return true to tell the caller that the nr_pages is actually being charged. Although I am not very sure what refill_stock() does. Does that "uncharge" those pages? > The problem normally reproes reliably within 10min -- 30min and counting > and the application-level latency has not spiked.
On Wed, 12 Oct 2022 20:34:00 -0700 Wei Wang wrote: > > I pushed this little nugget to one affected machine via KLP: > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 03ffbb255e60..c1ca369a1b77 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, > > return true; > > } > > > > + if (gfp_mask == GFP_NOWAIT) { > > + try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages); > > + refill_stock(memcg, nr_pages); > > + } > > return false; > > } > > > AFAICT, if you force charge by passing __GFP_NOFAIL to try_charge(), > you should return true to tell the caller that the nr_pages is > actually being charged. Ack - not sure what the best thing to do is, tho. Always pass NOFAIL in softirq? It's not clear to me yet why doing the charge/uncharge actually helps, perhaps try_to_free_mem_cgroup_pages() does more when NOFAIL is passed? I'll do more digging tomorrow. > Although I am not very sure what refill_stock() does. Does that > "uncharge" those pages? I think so, I copied it from mem_cgroup_uncharge_skmem().
On Wed, Oct 12, 2022 at 8:49 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 12 Oct 2022 20:34:00 -0700 Wei Wang wrote: > > > I pushed this little nugget to one affected machine via KLP: > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 03ffbb255e60..c1ca369a1b77 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, > > > return true; > > > } > > > > > > + if (gfp_mask == GFP_NOWAIT) { > > > + try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages); > > > + refill_stock(memcg, nr_pages); > > > + } > > > return false; > > > } > > > > > AFAICT, if you force charge by passing __GFP_NOFAIL to try_charge(), > > you should return true to tell the caller that the nr_pages is > > actually being charged. > > Ack - not sure what the best thing to do is, tho. Always pass NOFAIL > in softirq? > > It's not clear to me yet why doing the charge/uncharge actually helps, > perhaps try_to_free_mem_cgroup_pages() does more when NOFAIL is passed? > I am curious to know as well. > I'll do more digging tomorrow. > > > Although I am not very sure what refill_stock() does. Does that > > "uncharge" those pages? > > I think so, I copied it from mem_cgroup_uncharge_skmem().
On Wed, Oct 12, 2022 at 09:04:59PM -0700, Wei Wang wrote: > On Wed, Oct 12, 2022 at 8:49 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Wed, 12 Oct 2022 20:34:00 -0700 Wei Wang wrote: > > > > I pushed this little nugget to one affected machine via KLP: > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index 03ffbb255e60..c1ca369a1b77 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, > > > > return true; > > > > } > > > > > > > > + if (gfp_mask == GFP_NOWAIT) { > > > > + try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages); > > > > + refill_stock(memcg, nr_pages); > > > > + } > > > > return false; > > > > } > > > > > > > AFAICT, if you force charge by passing __GFP_NOFAIL to try_charge(), > > > you should return true to tell the caller that the nr_pages is > > > actually being charged. > > > > Ack - not sure what the best thing to do is, tho. Always pass NOFAIL > > in softirq? > > > > It's not clear to me yet why doing the charge/uncharge actually helps, > > perhaps try_to_free_mem_cgroup_pages() does more when NOFAIL is passed? > > > I am curious to know as well. > > > I'll do more digging tomorrow. > > > > > Although I am not very sure what refill_stock() does. Does that > > > "uncharge" those pages? > > > > I think so, I copied it from mem_cgroup_uncharge_skmem(). I think I understand why this issue start happening after this patch. The memcg charging happens in batches of 32 (64 nowadays) pages even if the charge request is less. The remaining pre-charge is cached in the per-cpu cache (or stock). With (GFP_NOWAIT | __GFP_NOFAIL), you let the memcg go over the limit without triggering oom-kill and then refill_stock just put the pre-charge in the per-cpu cache. So, the later allocation/charge succeed from the per-cpu cache even though memcg is over the limit. So, with this patch we no longer force charge and then uncharge on failure, so the later allocation/charge fail similarly. Regarding what is the right thing to do, IMHO, is to use GFP_ATOMIC instead of GFP_NOWAIT. If you see the following comment in try_charge_memcg(), we added this exception particularly for these kind of situations. ... /* * Memcg doesn't have a dedicated reserve for atomic * allocations. But like the global atomic pool, we need to * put the burden of reclaim on regular allocation requests * and let these go through as privileged allocations. */ if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH))) return -ENOMEM; ... Shakeel
On Wed, 12 Oct 2022 16:33:00 -0700 Jakub Kicinski wrote: > This patch is causing a little bit of pain to us, to workloads running > with just memory.max set. After this change the TCP rx path no longer > forces the charging. > > Any recommendation for the fix? Setting memory.high a few MB under > memory.max seems to remove the failures. Eric, is there anything that would make the TCP perform particularly poorly under mem pressure? Dropping and pruning happens a lot here: # nstat -a | grep -i -E 'Prune|Drop' TcpExtPruneCalled 1202577 0.0 TcpExtOfoPruned 734606 0.0 TcpExtTCPOFODrop 64191 0.0 TcpExtTCPRcvQDrop 384305 0.0 Same workload on 5.6 kernel: TcpExtPruneCalled 1223043 0.0 TcpExtOfoPruned 3377 0.0 TcpExtListenDrops 10596 0.0 TcpExtTCPOFODrop 22 0.0 TcpExtTCPRcvQDrop 734 0.0 From a quick look at the code and with what Shakeel explained in mind - previously we would have "loaded up the cache" after the first failed try, so we never got into the loop inside tcp_try_rmem_schedule() which most likely nukes the entire OFO queue: static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb, unsigned int size) { if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || !sk_rmem_schedule(sk, skb, size)) { /* ^ would fail but "load up the cache" ^ */ if (tcp_prune_queue(sk) < 0) return -1; /* v this one would not fail due to the cache v */ while (!sk_rmem_schedule(sk, skb, size)) { if (!tcp_prune_ofo_queue(sk)) return -1; Neil mentioned that he's seen multi-second stalls when SACKed segments get dropped from the OFO queue. Sender waits for a very long time before retrying something that was already SACKed if the receiver keeps sacking new, later segments. Even when ACK reaches the previously-SACKed block which should prove to the sender that something is very wrong. I tried to repro this with a packet drill and it's not what I see exactly, I need to keep shortening the RTT otherwise the retx comes out before the next SACK arrives. I'll try to read the code, and maybe I'll get lucky and manage capture the exact impacted flows :S But does anything of this nature ring the bell? `../common/defaults.sh` 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 8> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.1 < . 1:1(0) ack 1 win 2048 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 60000) = 60000 +0 > P. 1:10001(10000) ack 1 // Do some SACK-ing +.1 < . 1:1(0) ack 1 win 513 <sack 1001:2001,nop,nop> +.001 < . 1:1(0) ack 1 win 513 <sack 1001:2001 3001:4001 5001:6001,nop,nop> // ..and we pretend we lost 1001:2001 +.001 < . 1:1(0) ack 1 win 513 <sack 2001:10001,nop,nop> // re-xmit holes and send more +0 > . 10001:11001(1000) ack 1 +0 > . 1:1001(1000) ack 1 +0 > . 2001:3001(1000) ack 1 win 256 +0 > P. 11001:13001(2000) ack 1 win 256 +0 > P. 13001:15001(2000) ack 1 win 256 +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:15001,nop,nop> +0 > P. 15001:18001(3000) ack 1 win 256 +0 > P. 18001:20001(2000) ack 1 win 256 +0 > P. 20001:22001(2000) ack 1 win 256 +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:22001,nop,nop> +0 > P. 22001:24001(2000) ack 1 win 256 +0 > P. 24001:26001(2000) ack 1 win 256 +0 > P. 26001:28001(2000) ack 1 win 256 +0 > . 28001:29001(1000) ack 1 win 256 +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop> +0 > P. 29001:31001(2000) ack 1 win 256 +0 > P. 31001:33001(2000) ack 1 win 256 +0 > P. 33001:35001(2000) ack 1 win 256 +0 > . 35001:36001(1000) ack 1 win 256 +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:36001,nop,nop> +0 > P. 36001:38001(2000) ack 1 win 256 +0 > P. 38001:40001(2000) ack 1 win 256 +0 > P. 40001:42001(2000) ack 1 win 256 +0 > . 42001:43001(1000) ack 1 win 256 +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:43001,nop,nop> +0 > P. 43001:45001(2000) ack 1 win 256 +0 > P. 45001:47001(2000) ack 1 win 256 +0 > P. 47001:49001(2000) ack 1 win 256 +0 > . 49001:50001(1000) ack 1 win 256 +0.04 < . 1:1(0) ack 1001 win 257 <sack 2001:50001,nop,nop> +0 > P. 50001:52001(2000) ack 1 win 256 +0 > P. 52001:54001(2000) ack 1 win 256 +0 > P. 54001:56001(2000) ack 1 win 256 +0 > . 56001:57001(1000) ack 1 win 256 +0.04 > . 1001:2001(1000) ack 1 win 256 +.1 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
On Thu, Oct 13, 2022 at 2:49 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 12 Oct 2022 16:33:00 -0700 Jakub Kicinski wrote: > > This patch is causing a little bit of pain to us, to workloads running > > with just memory.max set. After this change the TCP rx path no longer > > forces the charging. > > > > Any recommendation for the fix? Setting memory.high a few MB under > > memory.max seems to remove the failures. > > Eric, is there anything that would make the TCP perform particularly > poorly under mem pressure? > > Dropping and pruning happens a lot here: > > # nstat -a | grep -i -E 'Prune|Drop' > TcpExtPruneCalled 1202577 0.0 > TcpExtOfoPruned 734606 0.0 > TcpExtTCPOFODrop 64191 0.0 > TcpExtTCPRcvQDrop 384305 0.0 > > Same workload on 5.6 kernel: > > TcpExtPruneCalled 1223043 0.0 > TcpExtOfoPruned 3377 0.0 > TcpExtListenDrops 10596 0.0 > TcpExtTCPOFODrop 22 0.0 > TcpExtTCPRcvQDrop 734 0.0 > > From a quick look at the code and with what Shakeel explained in mind - > previously we would have "loaded up the cache" after the first failed > try, so we never got into the loop inside tcp_try_rmem_schedule() which > most likely nukes the entire OFO queue: > > static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb, > unsigned int size) > { > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || > !sk_rmem_schedule(sk, skb, size)) { > /* ^ would fail but "load up the cache" ^ */ > > if (tcp_prune_queue(sk) < 0) > return -1; > > /* v this one would not fail due to the cache v */ > while (!sk_rmem_schedule(sk, skb, size)) { > if (!tcp_prune_ofo_queue(sk)) > return -1; > > Neil mentioned that he's seen multi-second stalls when SACKed segments > get dropped from the OFO queue. Sender waits for a very long time before > retrying something that was already SACKed if the receiver keeps > sacking new, later segments. Even when ACK reaches the previously-SACKed > block which should prove to the sender that something is very wrong. > > I tried to repro this with a packet drill and it's not what I see > exactly, I need to keep shortening the RTT otherwise the retx comes > out before the next SACK arrives. > > I'll try to read the code, and maybe I'll get lucky and manage capture > the exact impacted flows :S But does anything of this nature ring the > bell? > > `../common/defaults.sh` > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > +0 bind(3, ..., ...) = 0 > +0 listen(3, 1) = 0 > > +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 8> > +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> > +.1 < . 1:1(0) ack 1 win 2048 > +0 accept(3, ..., ...) = 4 > > +0 write(4, ..., 60000) = 60000 > +0 > P. 1:10001(10000) ack 1 > > // Do some SACK-ing > +.1 < . 1:1(0) ack 1 win 513 <sack 1001:2001,nop,nop> > +.001 < . 1:1(0) ack 1 win 513 <sack 1001:2001 3001:4001 5001:6001,nop,nop> > // ..and we pretend we lost 1001:2001 > +.001 < . 1:1(0) ack 1 win 513 <sack 2001:10001,nop,nop> > > // re-xmit holes and send more > +0 > . 10001:11001(1000) ack 1 > +0 > . 1:1001(1000) ack 1 > +0 > . 2001:3001(1000) ack 1 win 256 > +0 > P. 11001:13001(2000) ack 1 win 256 > +0 > P. 13001:15001(2000) ack 1 win 256 > > +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:15001,nop,nop> > > +0 > P. 15001:18001(3000) ack 1 win 256 > +0 > P. 18001:20001(2000) ack 1 win 256 > +0 > P. 20001:22001(2000) ack 1 win 256 > > +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:22001,nop,nop> > > +0 > P. 22001:24001(2000) ack 1 win 256 > +0 > P. 24001:26001(2000) ack 1 win 256 > +0 > P. 26001:28001(2000) ack 1 win 256 > +0 > . 28001:29001(1000) ack 1 win 256 > > +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop> > > +0 > P. 29001:31001(2000) ack 1 win 256 > +0 > P. 31001:33001(2000) ack 1 win 256 > +0 > P. 33001:35001(2000) ack 1 win 256 > +0 > . 35001:36001(1000) ack 1 win 256 > > +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:36001,nop,nop> > > +0 > P. 36001:38001(2000) ack 1 win 256 > +0 > P. 38001:40001(2000) ack 1 win 256 > +0 > P. 40001:42001(2000) ack 1 win 256 > +0 > . 42001:43001(1000) ack 1 win 256 > > +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:43001,nop,nop> > > +0 > P. 43001:45001(2000) ack 1 win 256 > +0 > P. 45001:47001(2000) ack 1 win 256 > +0 > P. 47001:49001(2000) ack 1 win 256 > +0 > . 49001:50001(1000) ack 1 win 256 > > +0.04 < . 1:1(0) ack 1001 win 257 <sack 2001:50001,nop,nop> > > +0 > P. 50001:52001(2000) ack 1 win 256 > +0 > P. 52001:54001(2000) ack 1 win 256 > +0 > P. 54001:56001(2000) ack 1 win 256 > +0 > . 56001:57001(1000) ack 1 win 256 > > +0.04 > . 1001:2001(1000) ack 1 win 256 > > This is SACK reneging, I would have to double check linux behavior, but reverting to timeout could very well happen. > +.1 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop> >
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index bfe5c486f4ad..f0ee30881ca9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1581,7 +1581,8 @@ static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb) #endif /* CONFIG_CGROUP_WRITEBACK */ struct sock; -bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, + gfp_t gfp_mask); void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages); #ifdef CONFIG_MEMCG extern struct static_key_false memcg_sockets_enabled_key; diff --git a/include/net/sock.h b/include/net/sock.h index 6e761451c927..95b25777b53e 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2400,6 +2400,11 @@ static inline gfp_t gfp_any(void) return in_softirq() ? GFP_ATOMIC : GFP_KERNEL; } +static inline gfp_t gfp_memcg_charge(void) +{ + return in_softirq() ? GFP_NOWAIT : GFP_KERNEL; +} + static inline long sock_rcvtimeo(const struct sock *sk, bool noblock) { return noblock ? 0 : sk->sk_rcvtimeo; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8ef06f9e0db1..be585ceaba98 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7048,14 +7048,14 @@ void mem_cgroup_sk_free(struct sock *sk) * mem_cgroup_charge_skmem - charge socket memory * @memcg: memcg to charge * @nr_pages: number of pages to charge + * @gfp_mask: reclaim mode * * Charges @nr_pages to @memcg. Returns %true if the charge fit within - * @memcg's configured limit, %false if the charge had to be forced. + * @memcg's configured limit, %false if it doesn't. */ -bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, + gfp_t gfp_mask) { - gfp_t gfp_mask = GFP_KERNEL; - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { struct page_counter *fail; @@ -7063,21 +7063,19 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) memcg->tcpmem_pressure = 0; return true; } - page_counter_charge(&memcg->tcpmem, nr_pages); memcg->tcpmem_pressure = 1; + if (gfp_mask & __GFP_NOFAIL) { + page_counter_charge(&memcg->tcpmem, nr_pages); + return true; + } return false; } - /* Don't block in the packet receive path */ - if (in_softirq()) - gfp_mask = GFP_NOWAIT; - - mod_memcg_state(memcg, MEMCG_SOCK, nr_pages); - - if (try_charge(memcg, gfp_mask, nr_pages) == 0) + if (try_charge(memcg, gfp_mask, nr_pages) == 0) { + mod_memcg_state(memcg, MEMCG_SOCK, nr_pages); return true; + } - try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages); return false; } diff --git a/net/core/sock.c b/net/core/sock.c index aada649e07e8..950f1e70dbf5 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2728,10 +2728,12 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) { struct proto *prot = sk->sk_prot; long allocated = sk_memory_allocated_add(sk, amt); + bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg; bool charged = true; - if (mem_cgroup_sockets_enabled && sk->sk_memcg && - !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt))) + if (memcg_charge && + !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt, + gfp_memcg_charge()))) goto suppress_allocation; /* Under limit. */ @@ -2785,8 +2787,14 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) /* Fail only if socket is _under_ its sndbuf. * In this case we cannot block, so that we have to fail. */ - if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) + if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) { + /* Force charge with __GFP_NOFAIL */ + if (memcg_charge && !charged) { + mem_cgroup_charge_skmem(sk->sk_memcg, amt, + gfp_memcg_charge() | __GFP_NOFAIL); + } return 1; + } } if (kind == SK_MEM_SEND || (kind == SK_MEM_RECV && charged)) @@ -2794,7 +2802,7 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) sk_memory_allocated_sub(sk, amt); - if (mem_cgroup_sockets_enabled && sk->sk_memcg) + if (memcg_charge && charged) mem_cgroup_uncharge_skmem(sk->sk_memcg, amt); return 0; diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 754013fa393b..f25d02ad4a8a 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -534,7 +534,8 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) atomic_read(&newsk->sk_rmem_alloc)); mem_cgroup_sk_alloc(newsk); if (newsk->sk_memcg && amt) - mem_cgroup_charge_skmem(newsk->sk_memcg, amt); + mem_cgroup_charge_skmem(newsk->sk_memcg, amt, + GFP_KERNEL | __GFP_NOFAIL); release_sock(newsk); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 29553fce8502..6d72f3ea48c4 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3373,7 +3373,8 @@ void sk_forced_mem_schedule(struct sock *sk, int size) sk_memory_allocated_add(sk, amt); if (mem_cgroup_sockets_enabled && sk->sk_memcg) - mem_cgroup_charge_skmem(sk->sk_memcg, amt); + mem_cgroup_charge_skmem(sk->sk_memcg, amt, + gfp_memcg_charge() | __GFP_NOFAIL); } /* Send a FIN. The caller locks the socket for us.