Message ID | 20220826000445.46552-9-kuniyu@amazon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | tcp/udp: Introduce optional per-netns hash table. | expand |
On Thu, Aug 25, 2022 at 5:07 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > The more sockets we have in the hash table, the more time we spend > looking up the socket. While running a number of small workloads on > the same host, they penalise each other and cause performance degradation. > > Also, the root cause might be a single workload that consumes much more > resources than the others. It often happens on a cloud service where > different workloads share the same computing resource. > > To resolve the issue, we introduce an optional per-netns hash table for > TCP, but it's just ehash, and we still share the global bhash and lhash2. > > With a smaller ehash, we can look up non-listener sockets faster and > isolate such noisy neighbours. Also, we can reduce lock contention. > > We can control the ehash size by a new sysctl knob. However, depending > on workloads, it will require very sensitive tuning, so we disable the > feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, > we can fall back to using the global ehash in case we fail to allocate > enough memory for a new ehash. > > We can check the current ehash size by another read-only sysctl knob, > net.ipv4.tcp_ehash_entries. A negative value means the netns shares > the global ehash (per-netns ehash is disabled or failed to allocate > memory). > > # dmesg | cut -d ' ' -f 5- | grep "established hash" > TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) > > # sysctl net.ipv4.tcp_ehash_entries > net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries > > # sysctl net.ipv4.tcp_child_ehash_entries > net.ipv4.tcp_child_ehash_entries = 0 # disabled by default > > # ip netns add test1 > # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries > net.ipv4.tcp_ehash_entries = -524288 # share the global ehash > > # sysctl -w net.ipv4.tcp_child_ehash_entries=100 > net.ipv4.tcp_child_ehash_entries = 100 > > # sysctl net.ipv4.tcp_child_ehash_entries > net.ipv4.tcp_child_ehash_entries = 128 # rounded up to 2^n > > # ip netns add test2 > # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries > net.ipv4.tcp_ehash_entries = 128 # own per-netns ehash > > When more than two processes in the same netns create per-netns ehash > concurrently with different sizes, we need to guarantee the size in > one of the following ways: > > 1) Share the global ehash and create per-netns ehash > > First, unshare() with tcp_child_ehash_entries==0. It creates dedicated > netns sysctl knobs where we can safely change tcp_child_ehash_entries > and clone()/unshare() to create a per-netns ehash. > > 2) Lock the sysctl knob > > We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny > read/write on sysctl knobs. > > Note the default values of two sysctl knobs depend on the ehash size and > should be tuned carefully: > > tcp_max_tw_buckets : tcp_child_ehash_entries / 2 > tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) > > Also, we could optimise ehash lookup/iteration further by removing netns > comparison for the per-netns ehash in the future. > > As a bonus, we can dismantle netns faster. Currently, while destroying > netns, we call inet_twsk_purge(), which walks through the global ehash. > It can be potentially big because it can have many sockets other than > TIME_WAIT in all netns. Splitting ehash changes that situation, where > it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets > in each netns. > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > --- > Documentation/networking/ip-sysctl.rst | 20 +++++++++ > include/net/inet_hashtables.h | 6 +++ > include/net/netns/ipv4.h | 1 + > net/dccp/proto.c | 2 + > net/ipv4/inet_hashtables.c | 57 ++++++++++++++++++++++++++ > net/ipv4/inet_timewait_sock.c | 4 +- > net/ipv4/sysctl_net_ipv4.c | 57 ++++++++++++++++++++++++++ > net/ipv4/tcp.c | 1 + > net/ipv4/tcp_ipv4.c | 53 ++++++++++++++++++++---- > net/ipv6/tcp_ipv6.c | 12 +++++- > 10 files changed, 202 insertions(+), 11 deletions(-) > > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst > index 56cd4ea059b2..97a0952b11e3 100644 > --- a/Documentation/networking/ip-sysctl.rst > +++ b/Documentation/networking/ip-sysctl.rst > @@ -1037,6 +1037,26 @@ tcp_challenge_ack_limit - INTEGER > in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) > Default: 1000 > > +tcp_ehash_entries - INTEGER > + Read-only number of hash buckets for TCP sockets in the current > + networking namespace. > + > + A negative value means the networking namespace does not own its > + hash buckets and shares the initial networking namespace's one. > + > +tcp_child_ehash_entries - INTEGER > + Control the number of hash buckets for TCP sockets in the child > + networking namespace, which must be set before clone() or unshare(). > + > + The written value except for 0 is rounded up to 2^n. 0 is a special > + value, meaning the child networking namespace will share the initial > + networking namespace's hash buckets. > + > + Note that the child will use the global one in case the kernel > + fails to allocate enough memory. > + > + Default: 0 > + > UDP variables > ============= > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h > index 2c866112433e..039440936ab2 100644 > --- a/include/net/inet_hashtables.h > +++ b/include/net/inet_hashtables.h > @@ -168,6 +168,8 @@ struct inet_hashinfo { > /* The 2nd listener table hashed by local port and address */ > unsigned int lhash2_mask; > struct inet_listen_hashbucket *lhash2; > + > + bool pernet; > }; > > static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk) > @@ -214,6 +216,10 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo) > hashinfo->ehash_locks = NULL; > } > > +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo, > + unsigned int ehash_entries); > +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo); > + > struct inet_bind_bucket * > inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net, > struct inet_bind_hashbucket *head, > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h > index c7320ef356d9..6d9c01879027 100644 > --- a/include/net/netns/ipv4.h > +++ b/include/net/netns/ipv4.h > @@ -170,6 +170,7 @@ struct netns_ipv4 { > int sysctl_tcp_pacing_ca_ratio; > int sysctl_tcp_wmem[3]; > int sysctl_tcp_rmem[3]; > + unsigned int sysctl_tcp_child_ehash_entries; > unsigned long sysctl_tcp_comp_sack_delay_ns; > unsigned long sysctl_tcp_comp_sack_slack_ns; > int sysctl_max_syn_backlog; > diff --git a/net/dccp/proto.c b/net/dccp/proto.c > index 7cd4a6cc99fc..c548ca3e9b0e 100644 > --- a/net/dccp/proto.c > +++ b/net/dccp/proto.c > @@ -1197,6 +1197,8 @@ static int __init dccp_init(void) > INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain); > } > > + dccp_hashinfo.pernet = false; > + > rc = dccp_mib_init(); > if (rc) > goto out_free_dccp_bhash2; > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > index 5eb21a95179b..a57932b14bc6 100644 > --- a/net/ipv4/inet_hashtables.c > +++ b/net/ipv4/inet_hashtables.c > @@ -1145,3 +1145,60 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo) > return 0; > } > EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc); > + > +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo, > + unsigned int ehash_entries) > +{ > + struct inet_hashinfo *new_hashinfo; > + int i; > + > + new_hashinfo = kmalloc(sizeof(*new_hashinfo), GFP_KERNEL); > + if (!new_hashinfo) > + goto err; > + > + new_hashinfo->ehash = kvmalloc_array(ehash_entries, > + sizeof(struct inet_ehash_bucket), > + GFP_KERNEL); GFP_KERNEL_ACCOUNT ? > + if (!new_hashinfo->ehash) > + goto free_hashinfo; > + > + new_hashinfo->ehash_mask = ehash_entries - 1; > + > + if (inet_ehash_locks_alloc(new_hashinfo)) > + goto free_ehash; > + > + for (i = 0; i < ehash_entries; i++) > + INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i); > + > + new_hashinfo->bind_bucket_cachep = hashinfo->bind_bucket_cachep; > + new_hashinfo->bhash = hashinfo->bhash; > + new_hashinfo->bind2_bucket_cachep = hashinfo->bind2_bucket_cachep; > + new_hashinfo->bhash2 = hashinfo->bhash2; > + new_hashinfo->bhash_size = hashinfo->bhash_size; > + > + new_hashinfo->lhash2_mask = hashinfo->lhash2_mask; > + new_hashinfo->lhash2 = hashinfo->lhash2; > + > + new_hashinfo->pernet = true; > + > + return new_hashinfo; > + > +free_ehash: > + kvfree(new_hashinfo->ehash); > +free_hashinfo: > + kfree(new_hashinfo); > +err: > + return NULL; > +} > +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_alloc); > + > +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo) > +{ > + if (!hashinfo->pernet) > + return; > + > + inet_ehash_locks_free(hashinfo); > + kvfree(hashinfo->ehash); > + kfree(hashinfo); > +} > +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_free); > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > index 47ccc343c9fb..a5d40acde9d6 100644 > --- a/net/ipv4/inet_timewait_sock.c > +++ b/net/ipv4/inet_timewait_sock.c > @@ -59,8 +59,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw) > inet_twsk_bind_unhash(tw, hashinfo); > spin_unlock(&bhead->lock); > > - if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) > + if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) { > + inet_pernet_hashinfo_free(hashinfo); > kfree(tw->tw_dr); > + } > > inet_twsk_put(tw); > } > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 5490c285668b..03a3187c4705 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -382,6 +382,48 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl, > return ret; > } > > +static int proc_tcp_ehash_entries(struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos) > +{ > + struct net *net = container_of(table->data, struct net, > + ipv4.sysctl_tcp_child_ehash_entries); > + struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo; > + int tcp_ehash_entries; > + struct ctl_table tbl; > + > + tcp_ehash_entries = hinfo->ehash_mask + 1; > + > + /* A negative number indicates that the child netns > + * shares the global ehash. > + */ > + if (!net_eq(net, &init_net) && !hinfo->pernet) > + tcp_ehash_entries *= -1; > + > + tbl.data = &tcp_ehash_entries; > + tbl.maxlen = sizeof(int); > + > + return proc_dointvec(&tbl, write, buffer, lenp, ppos); > +} > + > +static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos) > +{ > + unsigned int tcp_child_ehash_entries; > + int ret; > + > + ret = proc_douintvec(table, write, buffer, lenp, ppos); > + if (!write || ret) > + return ret; > + > + tcp_child_ehash_entries = READ_ONCE(*(unsigned int *)table->data); > + if (tcp_child_ehash_entries) > + tcp_child_ehash_entries = roundup_pow_of_two(tcp_child_ehash_entries); > + > + WRITE_ONCE(*(unsigned int *)table->data, tcp_child_ehash_entries); > + > + return 0; > +} > + > #ifdef CONFIG_IP_ROUTE_MULTIPATH > static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write, > void *buffer, size_t *lenp, > @@ -1321,6 +1363,21 @@ static struct ctl_table ipv4_net_table[] = { > .extra1 = SYSCTL_ZERO, > .extra2 = SYSCTL_ONE, > }, > + { > + .procname = "tcp_ehash_entries", > + .data = &init_net.ipv4.sysctl_tcp_child_ehash_entries, > + .mode = 0444, > + .proc_handler = proc_tcp_ehash_entries, > + }, > + { > + .procname = "tcp_child_ehash_entries", > + .data = &init_net.ipv4.sysctl_tcp_child_ehash_entries, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_tcp_child_ehash_entries, > + .extra1 = SYSCTL_ZERO, > + .extra2 = SYSCTL_INT_MAX, Have you really tested what happens if you set the sysctl to max value 0x7fffffff I would assume some kernel allocations will fail, or some loops will trigger some kind of soft lockups. > + }, > { > .procname = "udp_rmem_min", > .data = &init_net.ipv4.sysctl_udp_rmem_min, > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index baf6adb723ad..f8ce673e32cb 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -4788,6 +4788,7 @@ void __init tcp_init(void) > INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain); > } > > + tcp_hashinfo.pernet = false; > > cnt = tcp_hashinfo.ehash_mask + 1; > sysctl_tcp_max_orphans = cnt / 2; > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index b07930643b11..604119f46b52 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -3109,14 +3109,23 @@ static void __net_exit tcp_sk_exit(struct net *net) > if (net->ipv4.tcp_congestion_control) > bpf_module_put(net->ipv4.tcp_congestion_control, > net->ipv4.tcp_congestion_control->owner); > - if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) > + if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) { > + inet_pernet_hashinfo_free(tcp_death_row->hashinfo); > kfree(tcp_death_row); > + } > } > > -static int __net_init tcp_sk_init(struct net *net) > +static void __net_init tcp_set_hashinfo(struct net *net, struct inet_hashinfo *hinfo) > { > - int cnt; > + int ehash_entries = hinfo->ehash_mask + 1; 0x7fffffff + 1 -> integer overflow > > + net->ipv4.tcp_death_row->hashinfo = hinfo; > + net->ipv4.tcp_death_row->sysctl_max_tw_buckets = ehash_entries / 2; > + net->ipv4.sysctl_max_syn_backlog = max(128, ehash_entries / 128); > +} > + > +static int __net_init tcp_sk_init(struct net *net) > +{ > net->ipv4.sysctl_tcp_ecn = 2; > net->ipv4.sysctl_tcp_ecn_fallback = 1; > > @@ -3145,12 +3154,10 @@ static int __net_init tcp_sk_init(struct net *net) > net->ipv4.tcp_death_row = kzalloc(sizeof(struct inet_timewait_death_row), GFP_KERNEL); > if (!net->ipv4.tcp_death_row) > return -ENOMEM; > + > refcount_set(&net->ipv4.tcp_death_row->tw_refcount, 1); > - cnt = tcp_hashinfo.ehash_mask + 1; > - net->ipv4.tcp_death_row->sysctl_max_tw_buckets = cnt / 2; > - net->ipv4.tcp_death_row->hashinfo = &tcp_hashinfo; > + tcp_set_hashinfo(net, &tcp_hashinfo); > > - net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128); > net->ipv4.sysctl_tcp_sack = 1; > net->ipv4.sysctl_tcp_window_scaling = 1; > net->ipv4.sysctl_tcp_timestamps = 1; > @@ -3206,18 +3213,46 @@ static int __net_init tcp_sk_init(struct net *net) > return 0; > } > > +static int __net_init tcp_sk_init_pernet_hashinfo(struct net *net, struct net *old_net) > +{ > + struct inet_hashinfo *child_hinfo; > + int ehash_entries; > + > + ehash_entries = READ_ONCE(old_net->ipv4.sysctl_tcp_child_ehash_entries); > + if (!ehash_entries) > + goto out; > + > + child_hinfo = inet_pernet_hashinfo_alloc(&tcp_hashinfo, ehash_entries); > + if (child_hinfo) > + tcp_set_hashinfo(net, child_hinfo); > + else > + pr_warn("Failed to allocate TCP ehash (entries: %u) " > + "for a netns, fallback to use the global one\n", > + ehash_entries); > +out: > + return 0; > +} > + > static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) > { > + bool purge_once = true; > struct net *net; > > - inet_twsk_purge(&tcp_hashinfo, AF_INET); > + list_for_each_entry(net, net_exit_list, exit_list) { > + if (net->ipv4.tcp_death_row->hashinfo->pernet) { > + inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET); > + } else if (purge_once) { > + inet_twsk_purge(&tcp_hashinfo, AF_INET); > + purge_once = false; > + } > > - list_for_each_entry(net, net_exit_list, exit_list) > tcp_fastopen_ctx_destroy(net); > + } > } > > static struct pernet_operations __net_initdata tcp_sk_ops = { > .init = tcp_sk_init, > + .init2 = tcp_sk_init_pernet_hashinfo, > .exit = tcp_sk_exit, > .exit_batch = tcp_sk_exit_batch, > }; > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index 27b2fd98a2c4..19f730428720 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -2229,7 +2229,17 @@ static void __net_exit tcpv6_net_exit(struct net *net) > > static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list) > { > - inet_twsk_purge(&tcp_hashinfo, AF_INET6); > + bool purge_once = true; > + struct net *net; > + This looks like a duplicate of ipv4 function. Opportunity of factorization ? > + list_for_each_entry(net, net_exit_list, exit_list) { > + if (net->ipv4.tcp_death_row->hashinfo->pernet) { > + inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET6); > + } else if (purge_once) { > + inet_twsk_purge(&tcp_hashinfo, AF_INET6); > + purge_once = false; > + } > + } > } > > static struct pernet_operations tcpv6_net_ops = { > -- > 2.30.2 >
From: Eric Dumazet <edumazet@google.com> Date: Fri, 26 Aug 2022 08:24:54 -0700 > On Thu, Aug 25, 2022 at 5:07 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > The more sockets we have in the hash table, the more time we spend > > looking up the socket. While running a number of small workloads on > > the same host, they penalise each other and cause performance degradation. > > > > Also, the root cause might be a single workload that consumes much more > > resources than the others. It often happens on a cloud service where > > different workloads share the same computing resource. > > > > To resolve the issue, we introduce an optional per-netns hash table for > > TCP, but it's just ehash, and we still share the global bhash and lhash2. > > > > With a smaller ehash, we can look up non-listener sockets faster and > > isolate such noisy neighbours. Also, we can reduce lock contention. > > > > We can control the ehash size by a new sysctl knob. However, depending > > on workloads, it will require very sensitive tuning, so we disable the > > feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, > > we can fall back to using the global ehash in case we fail to allocate > > enough memory for a new ehash. > > > > We can check the current ehash size by another read-only sysctl knob, > > net.ipv4.tcp_ehash_entries. A negative value means the netns shares > > the global ehash (per-netns ehash is disabled or failed to allocate > > memory). > > > > # dmesg | cut -d ' ' -f 5- | grep "established hash" > > TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) > > > > # sysctl net.ipv4.tcp_ehash_entries > > net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries > > > > # sysctl net.ipv4.tcp_child_ehash_entries > > net.ipv4.tcp_child_ehash_entries = 0 # disabled by default > > > > # ip netns add test1 > > # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries > > net.ipv4.tcp_ehash_entries = -524288 # share the global ehash > > > > # sysctl -w net.ipv4.tcp_child_ehash_entries=100 > > net.ipv4.tcp_child_ehash_entries = 100 > > > > # sysctl net.ipv4.tcp_child_ehash_entries > > net.ipv4.tcp_child_ehash_entries = 128 # rounded up to 2^n > > > > # ip netns add test2 > > # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries > > net.ipv4.tcp_ehash_entries = 128 # own per-netns ehash > > > > When more than two processes in the same netns create per-netns ehash > > concurrently with different sizes, we need to guarantee the size in > > one of the following ways: > > > > 1) Share the global ehash and create per-netns ehash > > > > First, unshare() with tcp_child_ehash_entries==0. It creates dedicated > > netns sysctl knobs where we can safely change tcp_child_ehash_entries > > and clone()/unshare() to create a per-netns ehash. > > > > 2) Lock the sysctl knob > > > > We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny > > read/write on sysctl knobs. > > > > Note the default values of two sysctl knobs depend on the ehash size and > > should be tuned carefully: > > > > tcp_max_tw_buckets : tcp_child_ehash_entries / 2 > > tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) > > > > Also, we could optimise ehash lookup/iteration further by removing netns > > comparison for the per-netns ehash in the future. > > > > As a bonus, we can dismantle netns faster. Currently, while destroying > > netns, we call inet_twsk_purge(), which walks through the global ehash. > > It can be potentially big because it can have many sockets other than > > TIME_WAIT in all netns. Splitting ehash changes that situation, where > > it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets > > in each netns. > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > > --- > > Documentation/networking/ip-sysctl.rst | 20 +++++++++ > > include/net/inet_hashtables.h | 6 +++ > > include/net/netns/ipv4.h | 1 + > > net/dccp/proto.c | 2 + > > net/ipv4/inet_hashtables.c | 57 ++++++++++++++++++++++++++ > > net/ipv4/inet_timewait_sock.c | 4 +- > > net/ipv4/sysctl_net_ipv4.c | 57 ++++++++++++++++++++++++++ > > net/ipv4/tcp.c | 1 + > > net/ipv4/tcp_ipv4.c | 53 ++++++++++++++++++++---- > > net/ipv6/tcp_ipv6.c | 12 +++++- > > 10 files changed, 202 insertions(+), 11 deletions(-) > > > > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst > > index 56cd4ea059b2..97a0952b11e3 100644 > > --- a/Documentation/networking/ip-sysctl.rst > > +++ b/Documentation/networking/ip-sysctl.rst > > @@ -1037,6 +1037,26 @@ tcp_challenge_ack_limit - INTEGER > > in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) > > Default: 1000 > > > > +tcp_ehash_entries - INTEGER > > + Read-only number of hash buckets for TCP sockets in the current > > + networking namespace. > > + > > + A negative value means the networking namespace does not own its > > + hash buckets and shares the initial networking namespace's one. > > + > > +tcp_child_ehash_entries - INTEGER > > + Control the number of hash buckets for TCP sockets in the child > > + networking namespace, which must be set before clone() or unshare(). > > + > > + The written value except for 0 is rounded up to 2^n. 0 is a special > > + value, meaning the child networking namespace will share the initial > > + networking namespace's hash buckets. > > + > > + Note that the child will use the global one in case the kernel > > + fails to allocate enough memory. > > + > > + Default: 0 > > + > > UDP variables > > ============= > > > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h > > index 2c866112433e..039440936ab2 100644 > > --- a/include/net/inet_hashtables.h > > +++ b/include/net/inet_hashtables.h > > @@ -168,6 +168,8 @@ struct inet_hashinfo { > > /* The 2nd listener table hashed by local port and address */ > > unsigned int lhash2_mask; > > struct inet_listen_hashbucket *lhash2; > > + > > + bool pernet; > > }; > > > > static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk) > > @@ -214,6 +216,10 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo) > > hashinfo->ehash_locks = NULL; > > } > > > > +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo, > > + unsigned int ehash_entries); > > +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo); > > + > > struct inet_bind_bucket * > > inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net, > > struct inet_bind_hashbucket *head, > > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h > > index c7320ef356d9..6d9c01879027 100644 > > --- a/include/net/netns/ipv4.h > > +++ b/include/net/netns/ipv4.h > > @@ -170,6 +170,7 @@ struct netns_ipv4 { > > int sysctl_tcp_pacing_ca_ratio; > > int sysctl_tcp_wmem[3]; > > int sysctl_tcp_rmem[3]; > > + unsigned int sysctl_tcp_child_ehash_entries; > > unsigned long sysctl_tcp_comp_sack_delay_ns; > > unsigned long sysctl_tcp_comp_sack_slack_ns; > > int sysctl_max_syn_backlog; > > diff --git a/net/dccp/proto.c b/net/dccp/proto.c > > index 7cd4a6cc99fc..c548ca3e9b0e 100644 > > --- a/net/dccp/proto.c > > +++ b/net/dccp/proto.c > > @@ -1197,6 +1197,8 @@ static int __init dccp_init(void) > > INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain); > > } > > > > + dccp_hashinfo.pernet = false; > > + > > rc = dccp_mib_init(); > > if (rc) > > goto out_free_dccp_bhash2; > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > > index 5eb21a95179b..a57932b14bc6 100644 > > --- a/net/ipv4/inet_hashtables.c > > +++ b/net/ipv4/inet_hashtables.c > > @@ -1145,3 +1145,60 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo) > > return 0; > > } > > EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc); > > + > > +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo, > > + unsigned int ehash_entries) > > +{ > > + struct inet_hashinfo *new_hashinfo; > > + int i; > > + > > + new_hashinfo = kmalloc(sizeof(*new_hashinfo), GFP_KERNEL); > > + if (!new_hashinfo) > > + goto err; > > + > > + new_hashinfo->ehash = kvmalloc_array(ehash_entries, > > + sizeof(struct inet_ehash_bucket), > > + GFP_KERNEL); > > GFP_KERNEL_ACCOUNT ? Right, we should account the use. Will use it in v2. > > > + if (!new_hashinfo->ehash) > > + goto free_hashinfo; > > + > > + new_hashinfo->ehash_mask = ehash_entries - 1; > > + > > + if (inet_ehash_locks_alloc(new_hashinfo)) > > + goto free_ehash; > > + > > + for (i = 0; i < ehash_entries; i++) > > + INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i); > > + > > + new_hashinfo->bind_bucket_cachep = hashinfo->bind_bucket_cachep; > > + new_hashinfo->bhash = hashinfo->bhash; > > + new_hashinfo->bind2_bucket_cachep = hashinfo->bind2_bucket_cachep; > > + new_hashinfo->bhash2 = hashinfo->bhash2; > > + new_hashinfo->bhash_size = hashinfo->bhash_size; > > + > > + new_hashinfo->lhash2_mask = hashinfo->lhash2_mask; > > + new_hashinfo->lhash2 = hashinfo->lhash2; > > + > > + new_hashinfo->pernet = true; > > + > > + return new_hashinfo; > > + > > +free_ehash: > > + kvfree(new_hashinfo->ehash); > > +free_hashinfo: > > + kfree(new_hashinfo); > > +err: > > + return NULL; > > +} > > +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_alloc); > > + > > +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo) > > +{ > > + if (!hashinfo->pernet) > > + return; > > + > > + inet_ehash_locks_free(hashinfo); > > + kvfree(hashinfo->ehash); > > + kfree(hashinfo); > > +} > > +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_free); > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > > index 47ccc343c9fb..a5d40acde9d6 100644 > > --- a/net/ipv4/inet_timewait_sock.c > > +++ b/net/ipv4/inet_timewait_sock.c > > @@ -59,8 +59,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw) > > inet_twsk_bind_unhash(tw, hashinfo); > > spin_unlock(&bhead->lock); > > > > - if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) > > + if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) { > > + inet_pernet_hashinfo_free(hashinfo); > > kfree(tw->tw_dr); > > + } > > > > inet_twsk_put(tw); > > } > > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > > index 5490c285668b..03a3187c4705 100644 > > --- a/net/ipv4/sysctl_net_ipv4.c > > +++ b/net/ipv4/sysctl_net_ipv4.c > > @@ -382,6 +382,48 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl, > > return ret; > > } > > > > +static int proc_tcp_ehash_entries(struct ctl_table *table, int write, > > + void *buffer, size_t *lenp, loff_t *ppos) > > +{ > > + struct net *net = container_of(table->data, struct net, > > + ipv4.sysctl_tcp_child_ehash_entries); > > + struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo; > > + int tcp_ehash_entries; > > + struct ctl_table tbl; > > + > > + tcp_ehash_entries = hinfo->ehash_mask + 1; > > + > > + /* A negative number indicates that the child netns > > + * shares the global ehash. > > + */ > > + if (!net_eq(net, &init_net) && !hinfo->pernet) > > + tcp_ehash_entries *= -1; > > + > > + tbl.data = &tcp_ehash_entries; > > + tbl.maxlen = sizeof(int); > > + > > + return proc_dointvec(&tbl, write, buffer, lenp, ppos); > > +} > > + > > +static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write, > > + void *buffer, size_t *lenp, loff_t *ppos) > > +{ > > + unsigned int tcp_child_ehash_entries; > > + int ret; > > + > > + ret = proc_douintvec(table, write, buffer, lenp, ppos); > > + if (!write || ret) > > + return ret; > > + > > + tcp_child_ehash_entries = READ_ONCE(*(unsigned int *)table->data); > > + if (tcp_child_ehash_entries) > > + tcp_child_ehash_entries = roundup_pow_of_two(tcp_child_ehash_entries); > > + > > + WRITE_ONCE(*(unsigned int *)table->data, tcp_child_ehash_entries); > > + > > + return 0; > > +} > > + > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write, > > void *buffer, size_t *lenp, > > @@ -1321,6 +1363,21 @@ static struct ctl_table ipv4_net_table[] = { > > .extra1 = SYSCTL_ZERO, > > .extra2 = SYSCTL_ONE, > > }, > > + { > > + .procname = "tcp_ehash_entries", > > + .data = &init_net.ipv4.sysctl_tcp_child_ehash_entries, > > + .mode = 0444, > > + .proc_handler = proc_tcp_ehash_entries, > > + }, > > + { > > + .procname = "tcp_child_ehash_entries", > > + .data = &init_net.ipv4.sysctl_tcp_child_ehash_entries, > > + .maxlen = sizeof(unsigned int), > > + .mode = 0644, > > + .proc_handler = proc_tcp_child_ehash_entries, > > + .extra1 = SYSCTL_ZERO, > > + .extra2 = SYSCTL_INT_MAX, > > Have you really tested what happens if you set the sysctl to max value > 0x7fffffff > > I would assume some kernel allocations will fail, or some loops will > trigger some kind of soft lockups. Yes, I saw vmalloc() error splat and fallback to the global ehash. I think 4Mi or 8Mi should be enough for most workloads. What do you think? ---8<--- [ 46.525863] ------------[ cut here ]------------ [ 46.526095] WARNING: CPU: 0 PID: 240 at mm/util.c:624 kvmalloc_node+0xbb/0xc0 [ 46.526534] Modules linked in: [ 46.526901] CPU: 0 PID: 240 Comm: ip Not tainted 6.0.0-rc1-per-netns-hash-tcpudp-15620-gd02cde62bac1 #121 [ 46.527241] Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014 [ 46.527568] RIP: 0010:kvmalloc_node+0xbb/0xc0 [ 46.527870] Code: 55 48 89 ef 68 00 04 00 00 48 8d 4c 0a ff e8 7c 74 03 00 48 83 c4 18 5d 41 5c 41 5d c3 cc cc cc cc 41 81 e4 00 20 00 00 75 ed <0f> 0b eb e9 90 55 48 89 fd e8 57 1d 03 00 48 89 ef 84 c0 74 06 5d [ 46.528493] RSP: 0018:ffffc9000022fd80 EFLAGS: 00000246 [ 46.528801] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000000 [ 46.529107] RDX: 0000000000000015 RSI: ffffffff81220571 RDI: 0000000000052cc0 [ 46.529390] RBP: 0000000400000000 R08: ffffffff830ee280 R09: 0000000000000060 [ 46.529730] R10: ffff888005305a00 R11: 0000000000001788 R12: 0000000000000000 [ 46.529989] R13: 00000000ffffffff R14: ffffffff8389db80 R15: ffff88800550e300 [ 46.530351] FS: 00007f1ca8200740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 [ 46.530731] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 46.530931] CR2: 00007f1ca82e9150 CR3: 00000000054ca000 CR4: 00000000000006f0 [ 46.531386] Call Trace: [ 46.531966] <TASK> [ 46.532277] inet_pernet_hashinfo_alloc+0x40/0xe0 [ 46.532530] tcp_sk_init_pernet_hashinfo+0x26/0x80 [ 46.532806] ops_init+0x7a/0x150 [ 46.532965] setup_net+0x145/0x2b0 [ 46.533116] copy_net_ns+0xf8/0x1c0 [ 46.533310] create_new_namespaces+0x10e/0x2e0 [ 46.533478] unshare_nsproxy_namespaces+0x57/0x90 [ 46.533731] ksys_unshare+0x183/0x320 [ 46.533863] __x64_sys_unshare+0x9/0x10 [ 46.534029] do_syscall_64+0x3b/0x90 [ 46.534192] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 46.534506] RIP: 0033:0x7f1ca82f86c7 [ 46.534906] Code: 73 01 c3 48 8b 0d a9 07 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 10 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 79 07 0c 00 f7 d8 64 89 01 48 [ 46.535581] RSP: 002b:00007ffd394c2298 EFLAGS: 00000206 ORIG_RAX: 0000000000000110 [ 46.535860] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f1ca82f86c7 [ 46.536118] RDX: 0000000000080000 RSI: 0000560877451812 RDI: 0000000040000000 [ 46.536425] RBP: 0000000000000005 R08: 0000000000000000 R09: 00007ffd394c2140 [ 46.536826] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 [ 46.537094] R13: 00007ffd394c22b8 R14: 00007ffd394c4490 R15: 00007f1ca82006c8 [ 46.537437] </TASK> [ 46.537558] ---[ end trace 0000000000000000 ]--- [ 46.538077] TCP: Failed to allocate TCP ehash (entries: 2147483648) for a netns, fallback to use the global one ---8<--- > > > + }, > > { > > .procname = "udp_rmem_min", > > .data = &init_net.ipv4.sysctl_udp_rmem_min, > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index baf6adb723ad..f8ce673e32cb 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -4788,6 +4788,7 @@ void __init tcp_init(void) > > INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain); > > } > > > > + tcp_hashinfo.pernet = false; > > > > cnt = tcp_hashinfo.ehash_mask + 1; > > sysctl_tcp_max_orphans = cnt / 2; > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > > index b07930643b11..604119f46b52 100644 > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > @@ -3109,14 +3109,23 @@ static void __net_exit tcp_sk_exit(struct net *net) > > if (net->ipv4.tcp_congestion_control) > > bpf_module_put(net->ipv4.tcp_congestion_control, > > net->ipv4.tcp_congestion_control->owner); > > - if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) > > + if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) { > > + inet_pernet_hashinfo_free(tcp_death_row->hashinfo); > > kfree(tcp_death_row); > > + } > > } > > > > -static int __net_init tcp_sk_init(struct net *net) > > +static void __net_init tcp_set_hashinfo(struct net *net, struct inet_hashinfo *hinfo) > > { > > - int cnt; > > + int ehash_entries = hinfo->ehash_mask + 1; > > 0x7fffffff + 1 -> integer overflow > Nice catch! I'll change it to unsigned int. > > > > + net->ipv4.tcp_death_row->hashinfo = hinfo; > > + net->ipv4.tcp_death_row->sysctl_max_tw_buckets = ehash_entries / 2; > > + net->ipv4.sysctl_max_syn_backlog = max(128, ehash_entries / 128); > > +} > > + > > +static int __net_init tcp_sk_init(struct net *net) > > +{ > > net->ipv4.sysctl_tcp_ecn = 2; > > net->ipv4.sysctl_tcp_ecn_fallback = 1; > > > > @@ -3145,12 +3154,10 @@ static int __net_init tcp_sk_init(struct net *net) > > net->ipv4.tcp_death_row = kzalloc(sizeof(struct inet_timewait_death_row), GFP_KERNEL); > > if (!net->ipv4.tcp_death_row) > > return -ENOMEM; > > + > > refcount_set(&net->ipv4.tcp_death_row->tw_refcount, 1); > > - cnt = tcp_hashinfo.ehash_mask + 1; > > - net->ipv4.tcp_death_row->sysctl_max_tw_buckets = cnt / 2; > > - net->ipv4.tcp_death_row->hashinfo = &tcp_hashinfo; > > + tcp_set_hashinfo(net, &tcp_hashinfo); > > > > - net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128); > > net->ipv4.sysctl_tcp_sack = 1; > > net->ipv4.sysctl_tcp_window_scaling = 1; > > net->ipv4.sysctl_tcp_timestamps = 1; > > @@ -3206,18 +3213,46 @@ static int __net_init tcp_sk_init(struct net *net) > > return 0; > > } > > > > +static int __net_init tcp_sk_init_pernet_hashinfo(struct net *net, struct net *old_net) > > +{ > > + struct inet_hashinfo *child_hinfo; > > + int ehash_entries; > > + > > + ehash_entries = READ_ONCE(old_net->ipv4.sysctl_tcp_child_ehash_entries); > > + if (!ehash_entries) > > + goto out; > > + > > + child_hinfo = inet_pernet_hashinfo_alloc(&tcp_hashinfo, ehash_entries); > > + if (child_hinfo) > > + tcp_set_hashinfo(net, child_hinfo); > > + else > > + pr_warn("Failed to allocate TCP ehash (entries: %u) " > > + "for a netns, fallback to use the global one\n", > > + ehash_entries); > > +out: > > + return 0; > > +} > > + > > static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) > > { > > + bool purge_once = true; > > struct net *net; > > > > - inet_twsk_purge(&tcp_hashinfo, AF_INET); > > + list_for_each_entry(net, net_exit_list, exit_list) { > > + if (net->ipv4.tcp_death_row->hashinfo->pernet) { > > + inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET); > > + } else if (purge_once) { > > + inet_twsk_purge(&tcp_hashinfo, AF_INET); > > + purge_once = false; > > + } > > > > - list_for_each_entry(net, net_exit_list, exit_list) > > tcp_fastopen_ctx_destroy(net); > > + } > > } > > > > static struct pernet_operations __net_initdata tcp_sk_ops = { > > .init = tcp_sk_init, > > + .init2 = tcp_sk_init_pernet_hashinfo, > > .exit = tcp_sk_exit, > > .exit_batch = tcp_sk_exit_batch, > > }; > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > > index 27b2fd98a2c4..19f730428720 100644 > > --- a/net/ipv6/tcp_ipv6.c > > +++ b/net/ipv6/tcp_ipv6.c > > @@ -2229,7 +2229,17 @@ static void __net_exit tcpv6_net_exit(struct net *net) > > > > static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list) > > { > > - inet_twsk_purge(&tcp_hashinfo, AF_INET6); > > + bool purge_once = true; > > + struct net *net; > > + > > This looks like a duplicate of ipv4 function. Opportunity of factorization ? Exactly. I'll factorise it in v2. > > > + list_for_each_entry(net, net_exit_list, exit_list) { > > + if (net->ipv4.tcp_death_row->hashinfo->pernet) { > > + inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET6); > > + } else if (purge_once) { > > + inet_twsk_purge(&tcp_hashinfo, AF_INET6); > > + purge_once = false; > > + } > > + } > > } > > > > static struct pernet_operations tcpv6_net_ops = { > > -- > > 2.30.2 > >
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 56cd4ea059b2..97a0952b11e3 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -1037,6 +1037,26 @@ tcp_challenge_ack_limit - INTEGER in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) Default: 1000 +tcp_ehash_entries - INTEGER + Read-only number of hash buckets for TCP sockets in the current + networking namespace. + + A negative value means the networking namespace does not own its + hash buckets and shares the initial networking namespace's one. + +tcp_child_ehash_entries - INTEGER + Control the number of hash buckets for TCP sockets in the child + networking namespace, which must be set before clone() or unshare(). + + The written value except for 0 is rounded up to 2^n. 0 is a special + value, meaning the child networking namespace will share the initial + networking namespace's hash buckets. + + Note that the child will use the global one in case the kernel + fails to allocate enough memory. + + Default: 0 + UDP variables ============= diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 2c866112433e..039440936ab2 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -168,6 +168,8 @@ struct inet_hashinfo { /* The 2nd listener table hashed by local port and address */ unsigned int lhash2_mask; struct inet_listen_hashbucket *lhash2; + + bool pernet; }; static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk) @@ -214,6 +216,10 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo) hashinfo->ehash_locks = NULL; } +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo, + unsigned int ehash_entries); +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo); + struct inet_bind_bucket * inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net, struct inet_bind_hashbucket *head, diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index c7320ef356d9..6d9c01879027 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -170,6 +170,7 @@ struct netns_ipv4 { int sysctl_tcp_pacing_ca_ratio; int sysctl_tcp_wmem[3]; int sysctl_tcp_rmem[3]; + unsigned int sysctl_tcp_child_ehash_entries; unsigned long sysctl_tcp_comp_sack_delay_ns; unsigned long sysctl_tcp_comp_sack_slack_ns; int sysctl_max_syn_backlog; diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 7cd4a6cc99fc..c548ca3e9b0e 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -1197,6 +1197,8 @@ static int __init dccp_init(void) INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain); } + dccp_hashinfo.pernet = false; + rc = dccp_mib_init(); if (rc) goto out_free_dccp_bhash2; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 5eb21a95179b..a57932b14bc6 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -1145,3 +1145,60 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo) return 0; } EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc); + +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo, + unsigned int ehash_entries) +{ + struct inet_hashinfo *new_hashinfo; + int i; + + new_hashinfo = kmalloc(sizeof(*new_hashinfo), GFP_KERNEL); + if (!new_hashinfo) + goto err; + + new_hashinfo->ehash = kvmalloc_array(ehash_entries, + sizeof(struct inet_ehash_bucket), + GFP_KERNEL); + if (!new_hashinfo->ehash) + goto free_hashinfo; + + new_hashinfo->ehash_mask = ehash_entries - 1; + + if (inet_ehash_locks_alloc(new_hashinfo)) + goto free_ehash; + + for (i = 0; i < ehash_entries; i++) + INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i); + + new_hashinfo->bind_bucket_cachep = hashinfo->bind_bucket_cachep; + new_hashinfo->bhash = hashinfo->bhash; + new_hashinfo->bind2_bucket_cachep = hashinfo->bind2_bucket_cachep; + new_hashinfo->bhash2 = hashinfo->bhash2; + new_hashinfo->bhash_size = hashinfo->bhash_size; + + new_hashinfo->lhash2_mask = hashinfo->lhash2_mask; + new_hashinfo->lhash2 = hashinfo->lhash2; + + new_hashinfo->pernet = true; + + return new_hashinfo; + +free_ehash: + kvfree(new_hashinfo->ehash); +free_hashinfo: + kfree(new_hashinfo); +err: + return NULL; +} +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_alloc); + +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo) +{ + if (!hashinfo->pernet) + return; + + inet_ehash_locks_free(hashinfo); + kvfree(hashinfo->ehash); + kfree(hashinfo); +} +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_free); diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index 47ccc343c9fb..a5d40acde9d6 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -59,8 +59,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw) inet_twsk_bind_unhash(tw, hashinfo); spin_unlock(&bhead->lock); - if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) + if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) { + inet_pernet_hashinfo_free(hashinfo); kfree(tw->tw_dr); + } inet_twsk_put(tw); } diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 5490c285668b..03a3187c4705 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -382,6 +382,48 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl, return ret; } +static int proc_tcp_ehash_entries(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct net *net = container_of(table->data, struct net, + ipv4.sysctl_tcp_child_ehash_entries); + struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo; + int tcp_ehash_entries; + struct ctl_table tbl; + + tcp_ehash_entries = hinfo->ehash_mask + 1; + + /* A negative number indicates that the child netns + * shares the global ehash. + */ + if (!net_eq(net, &init_net) && !hinfo->pernet) + tcp_ehash_entries *= -1; + + tbl.data = &tcp_ehash_entries; + tbl.maxlen = sizeof(int); + + return proc_dointvec(&tbl, write, buffer, lenp, ppos); +} + +static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + unsigned int tcp_child_ehash_entries; + int ret; + + ret = proc_douintvec(table, write, buffer, lenp, ppos); + if (!write || ret) + return ret; + + tcp_child_ehash_entries = READ_ONCE(*(unsigned int *)table->data); + if (tcp_child_ehash_entries) + tcp_child_ehash_entries = roundup_pow_of_two(tcp_child_ehash_entries); + + WRITE_ONCE(*(unsigned int *)table->data, tcp_child_ehash_entries); + + return 0; +} + #ifdef CONFIG_IP_ROUTE_MULTIPATH static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write, void *buffer, size_t *lenp, @@ -1321,6 +1363,21 @@ static struct ctl_table ipv4_net_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, + { + .procname = "tcp_ehash_entries", + .data = &init_net.ipv4.sysctl_tcp_child_ehash_entries, + .mode = 0444, + .proc_handler = proc_tcp_ehash_entries, + }, + { + .procname = "tcp_child_ehash_entries", + .data = &init_net.ipv4.sysctl_tcp_child_ehash_entries, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_tcp_child_ehash_entries, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_INT_MAX, + }, { .procname = "udp_rmem_min", .data = &init_net.ipv4.sysctl_udp_rmem_min, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index baf6adb723ad..f8ce673e32cb 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4788,6 +4788,7 @@ void __init tcp_init(void) INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain); } + tcp_hashinfo.pernet = false; cnt = tcp_hashinfo.ehash_mask + 1; sysctl_tcp_max_orphans = cnt / 2; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index b07930643b11..604119f46b52 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3109,14 +3109,23 @@ static void __net_exit tcp_sk_exit(struct net *net) if (net->ipv4.tcp_congestion_control) bpf_module_put(net->ipv4.tcp_congestion_control, net->ipv4.tcp_congestion_control->owner); - if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) + if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) { + inet_pernet_hashinfo_free(tcp_death_row->hashinfo); kfree(tcp_death_row); + } } -static int __net_init tcp_sk_init(struct net *net) +static void __net_init tcp_set_hashinfo(struct net *net, struct inet_hashinfo *hinfo) { - int cnt; + int ehash_entries = hinfo->ehash_mask + 1; + net->ipv4.tcp_death_row->hashinfo = hinfo; + net->ipv4.tcp_death_row->sysctl_max_tw_buckets = ehash_entries / 2; + net->ipv4.sysctl_max_syn_backlog = max(128, ehash_entries / 128); +} + +static int __net_init tcp_sk_init(struct net *net) +{ net->ipv4.sysctl_tcp_ecn = 2; net->ipv4.sysctl_tcp_ecn_fallback = 1; @@ -3145,12 +3154,10 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.tcp_death_row = kzalloc(sizeof(struct inet_timewait_death_row), GFP_KERNEL); if (!net->ipv4.tcp_death_row) return -ENOMEM; + refcount_set(&net->ipv4.tcp_death_row->tw_refcount, 1); - cnt = tcp_hashinfo.ehash_mask + 1; - net->ipv4.tcp_death_row->sysctl_max_tw_buckets = cnt / 2; - net->ipv4.tcp_death_row->hashinfo = &tcp_hashinfo; + tcp_set_hashinfo(net, &tcp_hashinfo); - net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128); net->ipv4.sysctl_tcp_sack = 1; net->ipv4.sysctl_tcp_window_scaling = 1; net->ipv4.sysctl_tcp_timestamps = 1; @@ -3206,18 +3213,46 @@ static int __net_init tcp_sk_init(struct net *net) return 0; } +static int __net_init tcp_sk_init_pernet_hashinfo(struct net *net, struct net *old_net) +{ + struct inet_hashinfo *child_hinfo; + int ehash_entries; + + ehash_entries = READ_ONCE(old_net->ipv4.sysctl_tcp_child_ehash_entries); + if (!ehash_entries) + goto out; + + child_hinfo = inet_pernet_hashinfo_alloc(&tcp_hashinfo, ehash_entries); + if (child_hinfo) + tcp_set_hashinfo(net, child_hinfo); + else + pr_warn("Failed to allocate TCP ehash (entries: %u) " + "for a netns, fallback to use the global one\n", + ehash_entries); +out: + return 0; +} + static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) { + bool purge_once = true; struct net *net; - inet_twsk_purge(&tcp_hashinfo, AF_INET); + list_for_each_entry(net, net_exit_list, exit_list) { + if (net->ipv4.tcp_death_row->hashinfo->pernet) { + inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET); + } else if (purge_once) { + inet_twsk_purge(&tcp_hashinfo, AF_INET); + purge_once = false; + } - list_for_each_entry(net, net_exit_list, exit_list) tcp_fastopen_ctx_destroy(net); + } } static struct pernet_operations __net_initdata tcp_sk_ops = { .init = tcp_sk_init, + .init2 = tcp_sk_init_pernet_hashinfo, .exit = tcp_sk_exit, .exit_batch = tcp_sk_exit_batch, }; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 27b2fd98a2c4..19f730428720 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -2229,7 +2229,17 @@ static void __net_exit tcpv6_net_exit(struct net *net) static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list) { - inet_twsk_purge(&tcp_hashinfo, AF_INET6); + bool purge_once = true; + struct net *net; + + list_for_each_entry(net, net_exit_list, exit_list) { + if (net->ipv4.tcp_death_row->hashinfo->pernet) { + inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET6); + } else if (purge_once) { + inet_twsk_purge(&tcp_hashinfo, AF_INET6); + purge_once = false; + } + } } static struct pernet_operations tcpv6_net_ops = {
The more sockets we have in the hash table, the more time we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. Also, the root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. To resolve the issue, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. Also, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg | cut -d ' ' -f 5- | grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 128 # rounded up to 2^n # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own per-netns ehash When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Lock the sysctl knob We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) Also, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash in the future. As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> --- Documentation/networking/ip-sysctl.rst | 20 +++++++++ include/net/inet_hashtables.h | 6 +++ include/net/netns/ipv4.h | 1 + net/dccp/proto.c | 2 + net/ipv4/inet_hashtables.c | 57 ++++++++++++++++++++++++++ net/ipv4/inet_timewait_sock.c | 4 +- net/ipv4/sysctl_net_ipv4.c | 57 ++++++++++++++++++++++++++ net/ipv4/tcp.c | 1 + net/ipv4/tcp_ipv4.c | 53 ++++++++++++++++++++---- net/ipv6/tcp_ipv6.c | 12 +++++- 10 files changed, 202 insertions(+), 11 deletions(-)