Message ID | 20240412212004.17181-1-krisman@suse.de (mailing list archive) |
---|---|
State | Accepted |
Commit | 50aee97d15113b95a68848db1f0cb2a6c09f753a |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v3] udp: Avoid call to compute_score on multiple sites | expand |
From: Gabriel Krisman Bertazi <krisman@suse.de> Date: Fri, 12 Apr 2024 17:20:04 -0400 > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected > sockets are present"). The failing tests were those that would spawn > UDP sockets per-cpu on systems that have a high number of cpus. > > Unsurprisingly, it is not caused by the extra re-scoring of the reused > socket, but due to the compiler no longer inlining compute_score, once > it has the extra call site in udp4_lib_lookup2. This is augmented by > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus. > > We could just explicitly inline it, but compute_score() is quite a large > function, around 300b. Inlining in two sites would almost double > udp4_lib_lookup2, which is a silly thing to do just to workaround a > mitigation. Instead, this patch shuffles the code a bit to avoid the > multiple calls to compute_score. Since it is a static function used in > one spot, the compiler can safely fold it in, as it did before, without > increasing the text size. > > With this patch applied I ran my original iperf3 testcases. The failing > cases all looked like this (ipv4): > iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2 > > where $R is either 1G/10G/0 (max, unlimited). I ran 3 times each. > baseline is v6.9-rc3. harmean == harmonic mean; CV == coefficient of > variation. > > ipv4: > 1G 10G MAX > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV) > baseline 1743852.66(0.0208) 1725933.02(0.0167) 1705203.78(0.0386) > patched 1968727.61(0.0035) 1962283.22(0.0195) 1923853.50(0.0256) > > ipv6: > 1G 10G MAX > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV) > baseline 1729020.03(0.0028) 1691704.49(0.0243) 1692251.34(0.0083) > patched 1900422.19(0.0067) 1900968.01(0.0067) 1568532.72(0.1519) > > This restores the performance we had before the change above with this > benchmark. We obviously don't expect any real impact when mitigations > are disabled, but just to be sure it also doesn't regresses: > > mitigations=off ipv4: > 1G 10G MAX > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV) > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697) > patched 3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882) > > Cc: Lorenz Bauer <lmb@isovalent.com> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present") > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Thanks! > > --- > Changes since v2: > (me) > - recollected performance data after changes below only for the > mitigations=auto case. > (suggested by Willem de Bruijn) > - Explicitly continue the loop after a rescore > - rename rescore variable to not clash with jump label > - disable rescore for new loop iteration > (suggested by Kuniyuki Iwashima) > - sort stack variables > - drop unneeded () > > Changes since v1: > (me) > - recollected performance data after changes below only for the > mitigations enabled case. > (suggested by Willem de Bruijn) > - Drop __always_inline in compute_score > - Simplify logic by replacing third struct sock pointer with bool > - Fix typo in commit message > - Don't explicitly break out of loop after rescore > --- > net/ipv4/udp.c | 21 ++++++++++++++++----- > net/ipv6/udp.c | 20 ++++++++++++++++---- > 2 files changed, 32 insertions(+), 9 deletions(-) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index c02bf011d4a6..4eff1e145c63 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -427,15 +427,21 @@ static struct sock *udp4_lib_lookup2(struct net *net, > { > struct sock *sk, *result; > int score, badness; > + bool need_rescore; > > result = NULL; > badness = 0; > udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { > - score = compute_score(sk, net, saddr, sport, > - daddr, hnum, dif, sdif); > + need_rescore = false; > +rescore: > + score = compute_score(need_rescore ? result : sk, net, saddr, > + sport, daddr, hnum, dif, sdif); > if (score > badness) { > badness = score; > > + if (need_rescore) > + continue; > + > if (sk->sk_state == TCP_ESTABLISHED) { > result = sk; > continue; > @@ -456,9 +462,14 @@ static struct sock *udp4_lib_lookup2(struct net *net, > if (IS_ERR(result)) > continue; > > - badness = compute_score(result, net, saddr, sport, > - daddr, hnum, dif, sdif); > - > + /* compute_score is too long of a function to be > + * inlined, and calling it again here yields > + * measureable overhead for some > + * workloads. Work around it by jumping > + * backwards to rescore 'result'. > + */ > + need_rescore = true; > + goto rescore; > } > } > return result; > diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c > index 8b1dd7f51249..e80e8b1d2000 100644 > --- a/net/ipv6/udp.c > +++ b/net/ipv6/udp.c > @@ -168,15 +168,21 @@ static struct sock *udp6_lib_lookup2(struct net *net, > { > struct sock *sk, *result; > int score, badness; > + bool need_rescore; > > result = NULL; > badness = -1; > udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { > - score = compute_score(sk, net, saddr, sport, > - daddr, hnum, dif, sdif); > + need_rescore = false; > +rescore: > + score = compute_score(need_rescore ? result : sk, net, saddr, > + sport, daddr, hnum, dif, sdif); > if (score > badness) { > badness = score; > > + if (need_rescore) > + continue; > + > if (sk->sk_state == TCP_ESTABLISHED) { > result = sk; > continue; > @@ -197,8 +203,14 @@ static struct sock *udp6_lib_lookup2(struct net *net, > if (IS_ERR(result)) > continue; > > - badness = compute_score(sk, net, saddr, sport, > - daddr, hnum, dif, sdif); > + /* compute_score is too long of a function to be > + * inlined, and calling it again here yields > + * measureable overhead for some > + * workloads. Work around it by jumping > + * backwards to rescore 'result'. > + */ > + need_rescore = true; > + goto rescore; > } > } > return result; > -- > 2.44.0
Kuniyuki Iwashima wrote: > From: Gabriel Krisman Bertazi <krisman@suse.de> > Date: Fri, 12 Apr 2024 17:20:04 -0400 > > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and > > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to > > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected > > sockets are present"). The failing tests were those that would spawn > > UDP sockets per-cpu on systems that have a high number of cpus. > > > > Unsurprisingly, it is not caused by the extra re-scoring of the reused > > socket, but due to the compiler no longer inlining compute_score, once > > it has the extra call site in udp4_lib_lookup2. This is augmented by > > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus. > > > > We could just explicitly inline it, but compute_score() is quite a large > > function, around 300b. Inlining in two sites would almost double > > udp4_lib_lookup2, which is a silly thing to do just to workaround a > > mitigation. Instead, this patch shuffles the code a bit to avoid the > > multiple calls to compute_score. Since it is a static function used in > > one spot, the compiler can safely fold it in, as it did before, without > > increasing the text size. > > > > With this patch applied I ran my original iperf3 testcases. The failing > > cases all looked like this (ipv4): > > iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2 > > > > where $R is either 1G/10G/0 (max, unlimited). I ran 3 times each. > > baseline is v6.9-rc3. harmean == harmonic mean; CV == coefficient of > > variation. > > > > ipv4: > > 1G 10G MAX > > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV) > > baseline 1743852.66(0.0208) 1725933.02(0.0167) 1705203.78(0.0386) > > patched 1968727.61(0.0035) 1962283.22(0.0195) 1923853.50(0.0256) > > > > ipv6: > > 1G 10G MAX > > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV) > > baseline 1729020.03(0.0028) 1691704.49(0.0243) 1692251.34(0.0083) > > patched 1900422.19(0.0067) 1900968.01(0.0067) 1568532.72(0.1519) > > > > This restores the performance we had before the change above with this > > benchmark. We obviously don't expect any real impact when mitigations > > are disabled, but just to be sure it also doesn't regresses: > > > > mitigations=off ipv4: > > 1G 10G MAX > > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV) > > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697) > > patched 3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882) > > > > Cc: Lorenz Bauer <lmb@isovalent.com> > > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present") > > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com>
Hello: This patch was applied to netdev/net-next.git (main) by David S. Miller <davem@davemloft.net>: On Fri, 12 Apr 2024 17:20:04 -0400 you wrote: > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected > sockets are present"). The failing tests were those that would spawn > UDP sockets per-cpu on systems that have a high number of cpus. > > Unsurprisingly, it is not caused by the extra re-scoring of the reused > socket, but due to the compiler no longer inlining compute_score, once > it has the extra call site in udp4_lib_lookup2. This is augmented by > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus. > > [...] Here is the summary with links: - [v3] udp: Avoid call to compute_score on multiple sites https://git.kernel.org/netdev/net-next/c/50aee97d1511 You are awesome, thank you!
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c02bf011d4a6..4eff1e145c63 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -427,15 +427,21 @@ static struct sock *udp4_lib_lookup2(struct net *net, { struct sock *sk, *result; int score, badness; + bool need_rescore; result = NULL; badness = 0; udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { - score = compute_score(sk, net, saddr, sport, - daddr, hnum, dif, sdif); + need_rescore = false; +rescore: + score = compute_score(need_rescore ? result : sk, net, saddr, + sport, daddr, hnum, dif, sdif); if (score > badness) { badness = score; + if (need_rescore) + continue; + if (sk->sk_state == TCP_ESTABLISHED) { result = sk; continue; @@ -456,9 +462,14 @@ static struct sock *udp4_lib_lookup2(struct net *net, if (IS_ERR(result)) continue; - badness = compute_score(result, net, saddr, sport, - daddr, hnum, dif, sdif); - + /* compute_score is too long of a function to be + * inlined, and calling it again here yields + * measureable overhead for some + * workloads. Work around it by jumping + * backwards to rescore 'result'. + */ + need_rescore = true; + goto rescore; } } return result; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 8b1dd7f51249..e80e8b1d2000 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -168,15 +168,21 @@ static struct sock *udp6_lib_lookup2(struct net *net, { struct sock *sk, *result; int score, badness; + bool need_rescore; result = NULL; badness = -1; udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { - score = compute_score(sk, net, saddr, sport, - daddr, hnum, dif, sdif); + need_rescore = false; +rescore: + score = compute_score(need_rescore ? result : sk, net, saddr, + sport, daddr, hnum, dif, sdif); if (score > badness) { badness = score; + if (need_rescore) + continue; + if (sk->sk_state == TCP_ESTABLISHED) { result = sk; continue; @@ -197,8 +203,14 @@ static struct sock *udp6_lib_lookup2(struct net *net, if (IS_ERR(result)) continue; - badness = compute_score(sk, net, saddr, sport, - daddr, hnum, dif, sdif); + /* compute_score is too long of a function to be + * inlined, and calling it again here yields + * measureable overhead for some + * workloads. Work around it by jumping + * backwards to rescore 'result'. + */ + need_rescore = true; + goto rescore; } } return result;
We've observed a 7-12% performance regression in iperf3 UDP ipv4 and ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present"). The failing tests were those that would spawn UDP sockets per-cpu on systems that have a high number of cpus. Unsurprisingly, it is not caused by the extra re-scoring of the reused socket, but due to the compiler no longer inlining compute_score, once it has the extra call site in udp4_lib_lookup2. This is augmented by the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus. We could just explicitly inline it, but compute_score() is quite a large function, around 300b. Inlining in two sites would almost double udp4_lib_lookup2, which is a silly thing to do just to workaround a mitigation. Instead, this patch shuffles the code a bit to avoid the multiple calls to compute_score. Since it is a static function used in one spot, the compiler can safely fold it in, as it did before, without increasing the text size. With this patch applied I ran my original iperf3 testcases. The failing cases all looked like this (ipv4): iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2 where $R is either 1G/10G/0 (max, unlimited). I ran 3 times each. baseline is v6.9-rc3. harmean == harmonic mean; CV == coefficient of variation. ipv4: 1G 10G MAX HARMEAN (CV) HARMEAN (CV) HARMEAN (CV) baseline 1743852.66(0.0208) 1725933.02(0.0167) 1705203.78(0.0386) patched 1968727.61(0.0035) 1962283.22(0.0195) 1923853.50(0.0256) ipv6: 1G 10G MAX HARMEAN (CV) HARMEAN (CV) HARMEAN (CV) baseline 1729020.03(0.0028) 1691704.49(0.0243) 1692251.34(0.0083) patched 1900422.19(0.0067) 1900968.01(0.0067) 1568532.72(0.1519) This restores the performance we had before the change above with this benchmark. We obviously don't expect any real impact when mitigations are disabled, but just to be sure it also doesn't regresses: mitigations=off ipv4: 1G 10G MAX HARMEAN (CV) HARMEAN (CV) HARMEAN (CV) baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697) patched 3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882) Cc: Lorenz Bauer <lmb@isovalent.com> Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> --- Changes since v2: (me) - recollected performance data after changes below only for the mitigations=auto case. (suggested by Willem de Bruijn) - Explicitly continue the loop after a rescore - rename rescore variable to not clash with jump label - disable rescore for new loop iteration (suggested by Kuniyuki Iwashima) - sort stack variables - drop unneeded () Changes since v1: (me) - recollected performance data after changes below only for the mitigations enabled case. (suggested by Willem de Bruijn) - Drop __always_inline in compute_score - Simplify logic by replacing third struct sock pointer with bool - Fix typo in commit message - Don't explicitly break out of loop after rescore --- net/ipv4/udp.c | 21 ++++++++++++++++----- net/ipv6/udp.c | 20 ++++++++++++++++---- 2 files changed, 32 insertions(+), 9 deletions(-)