diff mbox series

[RFC,net-next] sock: Propose socket.urgent for sockmem isolation

Message ID 20230609082712.34889-1-wuyun.abel@bytedance.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net-next] sock: Propose socket.urgent for sockmem isolation | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8350 this patch: 8350
netdev/cc_maintainers success CCed 16 of 16 maintainers
netdev/build_clang success Errors and warnings before: 2254 this patch: 2254
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 9032 this patch: 9032
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Abel Wu June 9, 2023, 8:27 a.m. UTC
This is just a PoC patch intended to resume the discussion about
tcpmem isolation opened by Google in LPC'22 [1].

We are facing the same problem that the global shared threshold can
cause isolation issues. Low priority jobs can hog TCP memory and
adversely impact higher priority jobs. What's worse is that these
low priority jobs usually have smaller cpu weights leading to poor
ability to consume rx data.

To tackle this problem, an interface for non-root cgroup memory
controller named 'socket.urgent' is proposed. It determines whether
the sockets of this cgroup and its descendants can escape from the
constrains or not under global socket memory pressure.

The 'urgent' semantics will not take effect under memcg pressure in
order to protect against worse memstalls, thus will be the same as
before without this patch.

This proposal doesn't remove protocal's threshold as we found it
useful in restraining memory defragment. As aforementioned the low
priority jobs can hog lots of memory, which is unreclaimable and
unmovable, for some time due to small cpu weight.

So in practice we allow high priority jobs with net-memcg accounting
enabled to escape the global constrains if the net-memcg itselt is
not under pressure. While for lower priority jobs, the budget will
be tightened as the memory usage of 'urgent' jobs increases. In this
way we can finally achieve:

  - Important jobs won't be priority inversed by the background
    jobs in terms of socket memory pressure/limit.

  - Global constrains are still effective, but only on non-urgent
    jobs, useful for admins on policy decision on defrag.

Comments/Ideas are welcomed, thanks!

[1] https://lpc.events/event/16/contributions/1212/

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/memcontrol.h | 15 +++++++++++++--
 include/net/sock.h         | 11 ++++++++---
 include/net/tcp.h          | 26 ++++++++++++++++++++------
 mm/memcontrol.c            | 35 +++++++++++++++++++++++++++++++++++
 net/core/sock.c            | 22 ++++++++++++++++++----
 net/ipv4/tcp_input.c       | 10 +++++++---
 6 files changed, 101 insertions(+), 18 deletions(-)

Comments

Eric Dumazet June 9, 2023, 9:07 a.m. UTC | #1
On Fri, Jun 9, 2023 at 10:28 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> This is just a PoC patch intended to resume the discussion about
> tcpmem isolation opened by Google in LPC'22 [1].
>
> We are facing the same problem that the global shared threshold can
> cause isolation issues. Low priority jobs can hog TCP memory and
> adversely impact higher priority jobs. What's worse is that these
> low priority jobs usually have smaller cpu weights leading to poor
> ability to consume rx data.
>
> To tackle this problem, an interface for non-root cgroup memory
> controller named 'socket.urgent' is proposed. It determines whether
> the sockets of this cgroup and its descendants can escape from the
> constrains or not under global socket memory pressure.
>
> The 'urgent' semantics will not take effect under memcg pressure in
> order to protect against worse memstalls, thus will be the same as
> before without this patch.
>
> This proposal doesn't remove protocal's threshold as we found it
> useful in restraining memory defragment. As aforementioned the low
> priority jobs can hog lots of memory, which is unreclaimable and
> unmovable, for some time due to small cpu weight.
>
> So in practice we allow high priority jobs with net-memcg accounting
> enabled to escape the global constrains if the net-memcg itselt is
> not under pressure. While for lower priority jobs, the budget will
> be tightened as the memory usage of 'urgent' jobs increases. In this
> way we can finally achieve:
>
>   - Important jobs won't be priority inversed by the background
>     jobs in terms of socket memory pressure/limit.
>
>   - Global constrains are still effective, but only on non-urgent
>     jobs, useful for admins on policy decision on defrag.
>
> Comments/Ideas are welcomed, thanks!
>

This seems to go in a complete opposite direction than memcg promises.

Can we fix memcg, so that :

Each group can use the memory it was provisioned (this includes TCP buffers)

Global tcp_memory can disappear (set tcp_mem to infinity)
Shakeel Butt June 9, 2023, 5:53 p.m. UTC | #2
On Fri, Jun 9, 2023 at 2:07 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jun 9, 2023 at 10:28 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
> >
> > This is just a PoC patch intended to resume the discussion about
> > tcpmem isolation opened by Google in LPC'22 [1].
> >
> > We are facing the same problem that the global shared threshold can
> > cause isolation issues. Low priority jobs can hog TCP memory and
> > adversely impact higher priority jobs. What's worse is that these
> > low priority jobs usually have smaller cpu weights leading to poor
> > ability to consume rx data.
> >
> > To tackle this problem, an interface for non-root cgroup memory
> > controller named 'socket.urgent' is proposed. It determines whether
> > the sockets of this cgroup and its descendants can escape from the
> > constrains or not under global socket memory pressure.
> >
> > The 'urgent' semantics will not take effect under memcg pressure in
> > order to protect against worse memstalls, thus will be the same as
> > before without this patch.
> >
> > This proposal doesn't remove protocal's threshold as we found it
> > useful in restraining memory defragment. As aforementioned the low
> > priority jobs can hog lots of memory, which is unreclaimable and
> > unmovable, for some time due to small cpu weight.
> >
> > So in practice we allow high priority jobs with net-memcg accounting
> > enabled to escape the global constrains if the net-memcg itselt is
> > not under pressure. While for lower priority jobs, the budget will
> > be tightened as the memory usage of 'urgent' jobs increases. In this
> > way we can finally achieve:
> >
> >   - Important jobs won't be priority inversed by the background
> >     jobs in terms of socket memory pressure/limit.
> >
> >   - Global constrains are still effective, but only on non-urgent
> >     jobs, useful for admins on policy decision on defrag.
> >
> > Comments/Ideas are welcomed, thanks!
> >
>
> This seems to go in a complete opposite direction than memcg promises.
>
> Can we fix memcg, so that :
>
> Each group can use the memory it was provisioned (this includes TCP buffers)
>
> Global tcp_memory can disappear (set tcp_mem to infinity)

I agree with Eric and this is exactly how we at Google overcome the
isolation issue. We have set tcp_mem to unlimited and enabled memcg
accounting of network memory (by surgically incorporating v2 semantics
of network memory accounting in our v1 environment).

I do have one question though:

> This proposal doesn't remove protocal's threshold as we found it
> useful in restraining memory defragment.

Can you explain how you find the global tcp limit useful? What does
memory defragment mean?
Abel Wu June 13, 2023, 6:46 a.m. UTC | #3
On 6/9/23 5:07 PM, Eric Dumazet wrote:
> On Fri, Jun 9, 2023 at 10:28 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>>
>> This is just a PoC patch intended to resume the discussion about
>> tcpmem isolation opened by Google in LPC'22 [1].
>>
>> We are facing the same problem that the global shared threshold can
>> cause isolation issues. Low priority jobs can hog TCP memory and
>> adversely impact higher priority jobs. What's worse is that these
>> low priority jobs usually have smaller cpu weights leading to poor
>> ability to consume rx data.
>>
>> To tackle this problem, an interface for non-root cgroup memory
>> controller named 'socket.urgent' is proposed. It determines whether
>> the sockets of this cgroup and its descendants can escape from the
>> constrains or not under global socket memory pressure.
>>
>> The 'urgent' semantics will not take effect under memcg pressure in
>> order to protect against worse memstalls, thus will be the same as
>> before without this patch.
>>
>> This proposal doesn't remove protocal's threshold as we found it
>> useful in restraining memory defragment. As aforementioned the low
>> priority jobs can hog lots of memory, which is unreclaimable and
>> unmovable, for some time due to small cpu weight.
>>
>> So in practice we allow high priority jobs with net-memcg accounting
>> enabled to escape the global constrains if the net-memcg itselt is
>> not under pressure. While for lower priority jobs, the budget will
>> be tightened as the memory usage of 'urgent' jobs increases. In this
>> way we can finally achieve:
>>
>>    - Important jobs won't be priority inversed by the background
>>      jobs in terms of socket memory pressure/limit.
>>
>>    - Global constrains are still effective, but only on non-urgent
>>      jobs, useful for admins on policy decision on defrag.
>>
>> Comments/Ideas are welcomed, thanks!
>>
> 
> This seems to go in a complete opposite direction than memcg promises.
> 
> Can we fix memcg, so that :
> 
> Each group can use the memory it was provisioned (this includes TCP buffers)

Yes, but might not be easy once memory gets over-committed (which is
common in modern data-centers). So as a tradeoff, we intend to put
harder constraint on memory allocation for low priority jobs. Or else
if every job can use its provisioned memory, than there will be more
memstalls blocking random jobs which could be the important ones.
Either way hurts performance, but the difference is whose performance
gets hurt.

Memory protection (memory.{min,low}) helps the important jobs less
affected by memstalls. But once low priority jobs use lots of kernel
memory like sockmem, the protection might become much less efficient.

> 
> Global tcp_memory can disappear (set tcp_mem to infinity)
Abel Wu June 13, 2023, 6:46 a.m. UTC | #4
On 6/10/23 1:53 AM, Shakeel Butt wrote:
> On Fri, Jun 9, 2023 at 2:07 PM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Fri, Jun 9, 2023 at 10:28 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>>>
>>> This is just a PoC patch intended to resume the discussion about
>>> tcpmem isolation opened by Google in LPC'22 [1].
>>>
>>> We are facing the same problem that the global shared threshold can
>>> cause isolation issues. Low priority jobs can hog TCP memory and
>>> adversely impact higher priority jobs. What's worse is that these
>>> low priority jobs usually have smaller cpu weights leading to poor
>>> ability to consume rx data.
>>>
>>> To tackle this problem, an interface for non-root cgroup memory
>>> controller named 'socket.urgent' is proposed. It determines whether
>>> the sockets of this cgroup and its descendants can escape from the
>>> constrains or not under global socket memory pressure.
>>>
>>> The 'urgent' semantics will not take effect under memcg pressure in
>>> order to protect against worse memstalls, thus will be the same as
>>> before without this patch.
>>>
>>> This proposal doesn't remove protocal's threshold as we found it
>>> useful in restraining memory defragment. As aforementioned the low
>>> priority jobs can hog lots of memory, which is unreclaimable and
>>> unmovable, for some time due to small cpu weight.
>>>
>>> So in practice we allow high priority jobs with net-memcg accounting
>>> enabled to escape the global constrains if the net-memcg itselt is
>>> not under pressure. While for lower priority jobs, the budget will
>>> be tightened as the memory usage of 'urgent' jobs increases. In this
>>> way we can finally achieve:
>>>
>>>    - Important jobs won't be priority inversed by the background
>>>      jobs in terms of socket memory pressure/limit.
>>>
>>>    - Global constrains are still effective, but only on non-urgent
>>>      jobs, useful for admins on policy decision on defrag.
>>>
>>> Comments/Ideas are welcomed, thanks!
>>>
>>
>> This seems to go in a complete opposite direction than memcg promises.
>>
>> Can we fix memcg, so that :
>>
>> Each group can use the memory it was provisioned (this includes TCP buffers)
>>
>> Global tcp_memory can disappear (set tcp_mem to infinity)
> 
> I agree with Eric and this is exactly how we at Google overcome the
> isolation issue. We have set tcp_mem to unlimited and enabled memcg
> accounting of network memory (by surgically incorporating v2 semantics
> of network memory accounting in our v1 environment).
> 
> I do have one question though:
> 
>> This proposal doesn't remove protocal's threshold as we found it
>> useful in restraining memory defragment.
> 
> Can you explain how you find the global tcp limit useful? What does
> memory defragment mean?

We co-locate different kinds of jobs with different priority in cgroups,
among which there are some background jobs can have lots of net data to
process, e.g. training jobs. These background jobs usually don't have
enough cpu bandwidth to consume the rx data in time if more important
jobs are running simultaneously. The data can be accumulated to eat up
some or all of the provisioned memory. These unreclaimable memory could
gradually fragment whole memory. We have already found many such cases
in production environment.

Maybe it's not proper to use the word 'defragment' as what we do is to
try to prevent from fragmentation rather than defrag like compaction.
With global tcp_mem pressure/limit and socket.urgent, we are able to
achieve this goal, at least at some extent.

And not only global tcp limit, the pressure threshold could also make
something like priority inversion happen. We monitored top20 priority
jobs and found their performance reduced by 2~9% when under global tcp
memory pressure (and sometimes the majority of sk_memory_allocated()
can be contributed by the low priority jobs). Although this has nothing
to do with 'memory defrag'.
Abel Wu June 16, 2023, 7:27 a.m. UTC | #5
Gentle ping :)

Any suggestions for memory over-committed scenario?

Thanks,
	Abel

On 6/13/23 2:46 PM, Abel Wu wrote:
> On 6/9/23 5:07 PM, Eric Dumazet wrote:
>> On Fri, Jun 9, 2023 at 10:28 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>>>
>>> This is just a PoC patch intended to resume the discussion about
>>> tcpmem isolation opened by Google in LPC'22 [1].
>>>
>>> We are facing the same problem that the global shared threshold can
>>> cause isolation issues. Low priority jobs can hog TCP memory and
>>> adversely impact higher priority jobs. What's worse is that these
>>> low priority jobs usually have smaller cpu weights leading to poor
>>> ability to consume rx data.
>>>
>>> To tackle this problem, an interface for non-root cgroup memory
>>> controller named 'socket.urgent' is proposed. It determines whether
>>> the sockets of this cgroup and its descendants can escape from the
>>> constrains or not under global socket memory pressure.
>>>
>>> The 'urgent' semantics will not take effect under memcg pressure in
>>> order to protect against worse memstalls, thus will be the same as
>>> before without this patch.
>>>
>>> This proposal doesn't remove protocal's threshold as we found it
>>> useful in restraining memory defragment. As aforementioned the low
>>> priority jobs can hog lots of memory, which is unreclaimable and
>>> unmovable, for some time due to small cpu weight.
>>>
>>> So in practice we allow high priority jobs with net-memcg accounting
>>> enabled to escape the global constrains if the net-memcg itselt is
>>> not under pressure. While for lower priority jobs, the budget will
>>> be tightened as the memory usage of 'urgent' jobs increases. In this
>>> way we can finally achieve:
>>>
>>>    - Important jobs won't be priority inversed by the background
>>>      jobs in terms of socket memory pressure/limit.
>>>
>>>    - Global constrains are still effective, but only on non-urgent
>>>      jobs, useful for admins on policy decision on defrag.
>>>
>>> Comments/Ideas are welcomed, thanks!
>>>
>>
>> This seems to go in a complete opposite direction than memcg promises.
>>
>> Can we fix memcg, so that :
>>
>> Each group can use the memory it was provisioned (this includes TCP 
>> buffers)
> 
> Yes, but might not be easy once memory gets over-committed (which is
> common in modern data-centers). So as a tradeoff, we intend to put
> harder constraint on memory allocation for low priority jobs. Or else
> if every job can use its provisioned memory, than there will be more
> memstalls blocking random jobs which could be the important ones.
> Either way hurts performance, but the difference is whose performance
> gets hurt.
> 
> Memory protection (memory.{min,low}) helps the important jobs less
> affected by memstalls. But once low priority jobs use lots of kernel
> memory like sockmem, the protection might become much less efficient.
> 
>>
>> Global tcp_memory can disappear (set tcp_mem to infinity)
Michal Koutný June 19, 2023, 5:30 p.m. UTC | #6
On Tue, Jun 13, 2023 at 02:46:32PM +0800, Abel Wu <wuyun.abel@bytedance.com> wrote:
> Memory protection (memory.{min,low}) helps the important jobs less
> affected by memstalls. But once low priority jobs use lots of kernel
> memory like sockmem, the protection might become much less efficient.

What would happen if you applied memory.{min,low} to the important jobs
and memory.{max,high} to the low prio ones?

Thanks,
Michal
Abel Wu June 20, 2023, 6:39 a.m. UTC | #7
Hi Michal,

On 6/20/23 1:30 AM, Michal Koutný wrote:
> On Tue, Jun 13, 2023 at 02:46:32PM +0800, Abel Wu <wuyun.abel@bytedance.com> wrote:
>> Memory protection (memory.{min,low}) helps the important jobs less
>> affected by memstalls. But once low priority jobs use lots of kernel
>> memory like sockmem, the protection might become much less efficient.
> 
> What would happen if you applied memory.{min,low} to the important jobs
> and memory.{max,high} to the low prio ones?

I might expect that the memory of low prio jobs gets reclaimed first.
Specifically we set memory.low to protect the working-set for important
jobs. Due to the best-effort behavior of 'low', the important jobs can
still be affected if not enough memory reclaimed from the low prio ones.
And we don't use 'min' (yet?) because the need for flexibility when
memory is tight.

Best Regards,
	Abel
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 222d7370134c..f8c1c108aa28 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -284,7 +284,13 @@  struct mem_cgroup {
 	atomic_long_t		memory_events[MEMCG_NR_MEMORY_EVENTS];
 	atomic_long_t		memory_events_local[MEMCG_NR_MEMORY_EVENTS];
 
+	/*
+	 * Urgent sockets can escape from the contrains under global memory
+	 * pressure/limit iff !socket_pressure. So this two variables are
+	 * always used together, make sure they are in same cacheline.
+	 */
 	unsigned long		socket_pressure;
+	bool			socket_urgent;
 
 	/* Legacy tcp memory accounting */
 	bool			tcpmem_active;
@@ -1741,13 +1747,17 @@  extern struct static_key_false memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
 void mem_cgroup_sk_alloc(struct sock *sk);
 void mem_cgroup_sk_free(struct sock *sk);
-static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
+
+static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg,
+						    bool *is_urgent)
 {
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
 		return true;
 	do {
 		if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
 			return true;
+		if (is_urgent && !*is_urgent && READ_ONCE(memcg->socket_urgent))
+			*is_urgent = true;
 	} while ((memcg = parent_mem_cgroup(memcg)));
 	return false;
 }
@@ -1760,7 +1770,8 @@  void reparent_shrinker_deferred(struct mem_cgroup *memcg);
 #define mem_cgroup_sockets_enabled 0
 static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
 static inline void mem_cgroup_sk_free(struct sock *sk) { };
-static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg,
+						    bool *is_urgent)
 {
 	return false;
 }
diff --git a/include/net/sock.h b/include/net/sock.h
index 656ea89f60ff..80e1240ffc35 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1414,9 +1414,14 @@  static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
-	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
-		return true;
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg) {
+		bool urgent;
+
+		if (mem_cgroup_under_socket_pressure(sk->sk_memcg, &urgent))
+			return true;
+		if (urgent)
+			return false;
+	}
 
 	return !!*sk->sk_prot->memory_pressure;
 }
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 14fa716cac50..9fa8d8fcb992 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -259,9 +259,14 @@  extern unsigned long tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
-	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
-		return true;
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg) {
+		bool urgent;
+
+		if (mem_cgroup_under_socket_pressure(sk->sk_memcg, &urgent))
+			return true;
+		if (urgent)
+			return false;
+	}
 
 	return READ_ONCE(tcp_memory_pressure);
 }
@@ -284,9 +289,18 @@  static inline bool between(__u32 seq1, __u32 seq2, __u32 seq3)
 
 static inline bool tcp_out_of_memory(struct sock *sk)
 {
-	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	    sk_memory_allocated(sk) > sk_prot_mem_limits(sk, 2))
-		return true;
+	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF) {
+		bool urgent = false;
+
+		if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+		    !mem_cgroup_under_socket_pressure(sk->sk_memcg, &urgent) &&
+		    urgent)
+			return false;
+
+		if (sk_memory_allocated(sk) > sk_prot_mem_limits(sk, 2))
+			return true;
+	}
+
 	return false;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b27e245a055..d620c4d9b2cc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6753,6 +6753,35 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+static int memory_sock_urgent_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	seq_printf(m, "%d\n", READ_ONCE(memcg->socket_urgent));
+
+	return 0;
+}
+
+static ssize_t memory_sock_urgent_write(struct kernfs_open_file *of,
+					char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	bool socket_urgent;
+	int ret;
+
+	buf = strstrip(buf);
+	if (!buf)
+		return -EINVAL;
+
+	ret = kstrtobool(buf, &socket_urgent);
+	if (ret)
+		return ret;
+
+	WRITE_ONCE(memcg->socket_urgent, socket_urgent);
+
+	return nbytes;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -6821,6 +6850,12 @@  static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
+	{
+		.name = "socket.urgent",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_sock_urgent_show,
+		.write = memory_sock_urgent_write,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 5440e67bcfe3..29d2b03595cf 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2982,10 +2982,24 @@  int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 
 	sk_memory_allocated_add(sk, amt);
 	allocated = sk_memory_allocated(sk);
-	if (memcg_charge &&
-	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
-						gfp_memcg_charge())))
-		goto suppress_allocation;
+
+	if (memcg_charge) {
+		bool urgent;
+
+		charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
+						  gfp_memcg_charge());
+		if (!charged)
+			goto suppress_allocation;
+
+		if (!mem_cgroup_under_socket_pressure(sk->sk_memcg, &urgent)) {
+			/* Urgent sockets by design escape from the constrains
+			 * under global memory pressure/limit iff there is no
+			 * pressure in the net-memcg to avoid priority inversion.
+			 */
+			if (urgent)
+				return 1;
+		}
+	}
 
 	/* Under limit. */
 	if (allocated <= sk_prot_mem_limits(sk, 0)) {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 61b6710f337a..7d5d4b4e17b4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5439,6 +5439,7 @@  static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 static bool tcp_should_expand_sndbuf(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
+	bool under_pressure, urgent = false;
 
 	/* If the user specified a specific send buffer setting, do
 	 * not modify it.
@@ -5446,8 +5447,11 @@  static bool tcp_should_expand_sndbuf(struct sock *sk)
 	if (sk->sk_userlocks & SOCK_SNDBUF_LOCK)
 		return false;
 
-	/* If we are under global TCP memory pressure, do not expand.  */
-	if (tcp_under_memory_pressure(sk)) {
+	under_pressure = mem_cgroup_sockets_enabled && sk->sk_memcg &&
+			 mem_cgroup_under_socket_pressure(sk->sk_memcg, &urgent);
+
+	/* If we are under net-memcg/TCP memory pressure, do not expand.  */
+	if (under_pressure || (!urgent && READ_ONCE(tcp_memory_pressure))) {
 		int unused_mem = sk_unused_reserved_mem(sk);
 
 		/* Adjust sndbuf according to reserved mem. But make sure
@@ -5461,7 +5465,7 @@  static bool tcp_should_expand_sndbuf(struct sock *sk)
 	}
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (sk_memory_allocated(sk) >= sk_prot_mem_limits(sk, 0))
+	if (!urgent && sk_memory_allocated(sk) >= sk_prot_mem_limits(sk, 0))
 		return false;
 
 	/* If we filled the congestion window, do not expand.  */