diff mbox series

[net-next] tcp: use RCU in __inet{6}_check_established()

Message ID 20250301194624.1879919-1-edumazet@google.com (mailing list archive)
State New
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: use RCU in __inet{6}_check_established() | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning CHECK: spinlock_t definition without comment
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-03-02--15-00 (tests: 893)

Commit Message

Eric Dumazet March 1, 2025, 7:46 p.m. UTC
When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
__inet_check_established() and/or __inet6_check_established().

This patch adds an RCU lookup to avoid the spinlock
acquisition if the 4-tuple is found in the hash table.

Note that there are still spin_lock_bh() calls in
__inet_hash_connect() to protect inet_bind_hashbucket,
this will be fixed in a future patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/inet_hashtables.c  | 19 ++++++++++++++++---
 net/ipv6/inet6_hashtables.c | 19 ++++++++++++++++---
 2 files changed, 32 insertions(+), 6 deletions(-)

Comments

Jason Xing March 2, 2025, 12:16 a.m. UTC | #1
On Sun, Mar 2, 2025 at 3:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> When __inet_hash_connect() has to try many 4-tuples before
> finding an available one, we see a high spinlock cost from
> __inet_check_established() and/or __inet6_check_established().
>
> This patch adds an RCU lookup to avoid the spinlock
> acquisition if the 4-tuple is found in the hash table.
>
> Note that there are still spin_lock_bh() calls in
> __inet_hash_connect() to protect inet_bind_hashbucket,
> this will be fixed in a future patch.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

It can introduce extra system overhead in most cases because it takes
effect only when the socket is not unique in the hash table. I'm not
sure what the probability of seeing this case is in reality in
general. Considering performing a look-up seems not to consume much, I
think it looks good to me. Well, it's the only one I'm a bit worried
about.

As you said, it truly mitigates the huge contention in the earlier
mentioned case where the available port resources are becoming rare.
We've encountered this situation causing high cpu load before. Thanks
for the optimization!

Thanks,
Jason
Eric Dumazet March 2, 2025, 7:11 a.m. UTC | #2
On Sun, Mar 2, 2025 at 1:17 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sun, Mar 2, 2025 at 3:46 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > When __inet_hash_connect() has to try many 4-tuples before
> > finding an available one, we see a high spinlock cost from
> > __inet_check_established() and/or __inet6_check_established().
> >
> > This patch adds an RCU lookup to avoid the spinlock
> > acquisition if the 4-tuple is found in the hash table.
> >
> > Note that there are still spin_lock_bh() calls in
> > __inet_hash_connect() to protect inet_bind_hashbucket,
> > this will be fixed in a future patch.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>
> It can introduce extra system overhead in most cases because it takes
> effect only when the socket is not unique in the hash table. I'm not
> sure what the probability of seeing this case is in reality in
> general. Considering performing a look-up seems not to consume much, I
> think it looks good to me. Well, it's the only one I'm a bit worried
> about.
>
> As you said, it truly mitigates the huge contention in the earlier
> mentioned case where the available port resources are becoming rare.
> We've encountered this situation causing high cpu load before. Thanks
> for the optimization!

Addition of bhash2 in 6.1 added a major regression.

This is the reason I started to work on this stuff.
I will send the whole series later today, but I get a ~200% increase
in performance.
I will provide numbers in the cover letter.

neper/tcp_crr can be used to measure the gains.

Both server/client have 240 cores, 480 hyperthreads (Intel(R) Xeon(R) 6985P-C)

Server
ulimit -n 40000; neper/tcp_crr -6 -T200 -F20000 --nolog

Client
ulimit -n 40000; neper/tcp_crr -6 -T200 -F20000 --nolog -c -H server

Before this first patch:

utime_start=0.210641
utime_end=1.704755
stime_start=11.842697
stime_end=1997.341498
nvcsw_start=18518
nvcsw_end=18672
nivcsw_start=26
nivcsw_end=14828
num_transactions=615906
latency_min=0.051826868
latency_max=12.015396087
latency_mean=0.642949344
latency_stddev=1.860316922
num_samples=207534
correlation_coefficient=1.00
throughput=62524.04

After this patch:

utime_start=0.185656
utime_end=2.436602
stime_start=11.470889
stime_end=1980.679087
nvcsw_start=17327
nvcsw_end=17514
nivcsw_start=48
nivcsw_end=77724
num_transactions=821025
latency_min=0.025097789
latency_max=11.581610596
latency_mean=0.475903462
latency_stddev=1.597439931
num_samples=206556
time_end=173.321207377
correlation_coefficient=1.00
throughput=84387.19
Jason Xing March 2, 2025, 8:56 a.m. UTC | #3
On Sun, Mar 2, 2025 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Mar 2, 2025 at 1:17 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Sun, Mar 2, 2025 at 3:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > When __inet_hash_connect() has to try many 4-tuples before
> > > finding an available one, we see a high spinlock cost from
> > > __inet_check_established() and/or __inet6_check_established().
> > >
> > > This patch adds an RCU lookup to avoid the spinlock
> > > acquisition if the 4-tuple is found in the hash table.
> > >
> > > Note that there are still spin_lock_bh() calls in
> > > __inet_hash_connect() to protect inet_bind_hashbucket,
> > > this will be fixed in a future patch.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> >
> > It can introduce extra system overhead in most cases because it takes
> > effect only when the socket is not unique in the hash table. I'm not
> > sure what the probability of seeing this case is in reality in
> > general. Considering performing a look-up seems not to consume much, I
> > think it looks good to me. Well, it's the only one I'm a bit worried
> > about.
> >
> > As you said, it truly mitigates the huge contention in the earlier
> > mentioned case where the available port resources are becoming rare.
> > We've encountered this situation causing high cpu load before. Thanks
> > for the optimization!
>
> Addition of bhash2 in 6.1 added a major regression.
>
> This is the reason I started to work on this stuff.
> I will send the whole series later today, but I get a ~200% increase
> in performance.
> I will provide numbers in the cover letter.
>
> neper/tcp_crr can be used to measure the gains.
>
> Both server/client have 240 cores, 480 hyperthreads (Intel(R) Xeon(R) 6985P-C)
>
> Server
> ulimit -n 40000; neper/tcp_crr -6 -T200 -F20000 --nolog
>
> Client
> ulimit -n 40000; neper/tcp_crr -6 -T200 -F20000 --nolog -c -H server
>
> Before this first patch:
>
> utime_start=0.210641
> utime_end=1.704755
> stime_start=11.842697
> stime_end=1997.341498
> nvcsw_start=18518
> nvcsw_end=18672
> nivcsw_start=26
> nivcsw_end=14828
> num_transactions=615906
> latency_min=0.051826868
> latency_max=12.015396087
> latency_mean=0.642949344
> latency_stddev=1.860316922
> num_samples=207534
> correlation_coefficient=1.00
> throughput=62524.04
>
> After this patch:
>
> utime_start=0.185656
> utime_end=2.436602
> stime_start=11.470889
> stime_end=1980.679087
> nvcsw_start=17327
> nvcsw_end=17514
> nivcsw_start=48
> nivcsw_end=77724
> num_transactions=821025
> latency_min=0.025097789
> latency_max=11.581610596
> latency_mean=0.475903462
> latency_stddev=1.597439931
> num_samples=206556
> time_end=173.321207377
> correlation_coefficient=1.00
> throughput=84387.19

Amazing! Thanks for the work and looking forward to your series then!

Thanks,
Jason
Jason Xing March 2, 2025, 9:48 a.m. UTC | #4
On Sun, Mar 2, 2025 at 3:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> When __inet_hash_connect() has to try many 4-tuples before
> finding an available one, we see a high spinlock cost from
> __inet_check_established() and/or __inet6_check_established().
>
> This patch adds an RCU lookup to avoid the spinlock
> acquisition if the 4-tuple is found in the hash table.
>
> Note that there are still spin_lock_bh() calls in
> __inet_hash_connect() to protect inet_bind_hashbucket,
> this will be fixed in a future patch.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

After Eric just reminded me, I similarly conduct the test and succeed
to see a 7% performance increase. And there are 1225112852 times early
return with -EADDRNOTAVAIL (during single test period) in RCU
protection newly added in this patch, which means we save 1225112852
times using unnecessary spin lock. It's really remarkable!

So,
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thank you.
diff mbox series

Patch

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..46d39aa2199ec3a405b50e8e85130e990d2c26b7 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -551,11 +551,24 @@  static int __inet_check_established(struct inet_timewait_death_row *death_row,
 	unsigned int hash = inet_ehashfn(net, daddr, lport,
 					 saddr, inet->inet_dport);
 	struct inet_ehash_bucket *head = inet_ehash_bucket(hinfo, hash);
-	spinlock_t *lock = inet_ehash_lockp(hinfo, hash);
-	struct sock *sk2;
-	const struct hlist_nulls_node *node;
 	struct inet_timewait_sock *tw = NULL;
+	const struct hlist_nulls_node *node;
+	struct sock *sk2;
+	spinlock_t *lock;
+
+	rcu_read_lock();
+	sk_nulls_for_each(sk2, node, &head->chain) {
+		if (sk2->sk_hash != hash ||
+		    !inet_match(net, sk2, acookie, ports, dif, sdif))
+			continue;
+		if (sk2->sk_state == TCP_TIME_WAIT)
+			break;
+		rcu_read_unlock();
+		return -EADDRNOTAVAIL;
+	}
+	rcu_read_unlock();
 
+	lock = inet_ehash_lockp(hinfo, hash);
 	spin_lock(lock);
 
 	sk_nulls_for_each(sk2, node, &head->chain) {
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 9ec05e354baa69d14e88da37f5a9fce11e874e35..3604a5cae5d29a25d24f9513308334ff8e64b083 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -276,11 +276,24 @@  static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 	const unsigned int hash = inet6_ehashfn(net, daddr, lport, saddr,
 						inet->inet_dport);
 	struct inet_ehash_bucket *head = inet_ehash_bucket(hinfo, hash);
-	spinlock_t *lock = inet_ehash_lockp(hinfo, hash);
-	struct sock *sk2;
-	const struct hlist_nulls_node *node;
 	struct inet_timewait_sock *tw = NULL;
+	const struct hlist_nulls_node *node;
+	struct sock *sk2;
+	spinlock_t *lock;
+
+	rcu_read_lock();
+	sk_nulls_for_each(sk2, node, &head->chain) {
+		if (sk2->sk_hash != hash ||
+		    !inet6_match(net, sk2, saddr, daddr, ports, dif, sdif))
+			continue;
+		if (sk2->sk_state == TCP_TIME_WAIT)
+			break;
+		rcu_read_unlock();
+		return -EADDRNOTAVAIL;
+	}
+	rcu_read_unlock();
 
+	lock = inet_ehash_lockp(hinfo, hash);
 	spin_lock(lock);
 
 	sk_nulls_for_each(sk2, node, &head->chain) {