Message ID | 20220626082331.36119-1-kuniyu@amazon.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2,net] af_unix: Do not call kmemdup() for init_net's sysctl table. | expand |
Kuniyuki Iwashima <kuniyu@amazon.com> writes: > While setting up init_net's sysctl table, we need not duplicate the global > table and can use it directly. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> I am not quite certain the savings of a single entry table justivies the complexity. But the looks correct. Eric > > Fixes: 1597fbc0faf8 ("[UNIX]: Make the unix sysctl tables per-namespace") > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > --- > v2: > * Fix NULL comparison style by checkpatch.pl > > v1: https://lore.kernel.org/all/20220626074454.28944-1-kuniyu@amazon.com/ > --- > --- > net/unix/sysctl_net_unix.c | 16 +++++++++++----- > 1 file changed, 11 insertions(+), 5 deletions(-) > > diff --git a/net/unix/sysctl_net_unix.c b/net/unix/sysctl_net_unix.c > index 01d44e2598e2..4bd856d05135 100644 > --- a/net/unix/sysctl_net_unix.c > +++ b/net/unix/sysctl_net_unix.c > @@ -26,11 +26,16 @@ int __net_init unix_sysctl_register(struct net *net) > { > struct ctl_table *table; > > - table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL); > - if (table == NULL) > - goto err_alloc; > + if (net_eq(net, &init_net)) { > + table = unix_table; > + } else { > + table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL); > + if (!table) > + goto err_alloc; > + > + table[0].data = &net->unx.sysctl_max_dgram_qlen; > + } > > - table[0].data = &net->unx.sysctl_max_dgram_qlen; > net->unx.ctl = register_net_sysctl(net, "net/unix", table); > if (net->unx.ctl == NULL) > goto err_reg; > @@ -38,7 +43,8 @@ int __net_init unix_sysctl_register(struct net *net) > return 0; > > err_reg: > - kfree(table); > + if (net_eq(net, &init_net)) > + kfree(table); > err_alloc: > return -ENOMEM; > }
On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > While setting up init_net's sysctl table, we need not duplicate the global > > table and can use it directly. > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > I am not quite certain the savings of a single entry table justivies > the complexity. But the looks correct. Yeah, the commit message is a little sparse. The "why" is not addressed. Could you add more details to explain the motivation?
From: Jakub Kicinski <kuba@kernel.org> Date: Mon, 27 Jun 2022 10:58:59 -0700 > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > table and can use it directly. > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > I am not quite certain the savings of a single entry table justivies > > the complexity. But the looks correct. > > Yeah, the commit message is a little sparse. The "why" is not addressed. > Could you add more details to explain the motivation? I was working on a series which converts UDP/TCP hash tables into per-netns ones like AF_UNIX to speed up looking up sockets. It will consume much memory on a host with thousands of netns, but it can be waste if we do not have its protocol family's sockets. So, I'm now working on a follow-up series for AF_UNIX per-netns hash table so that we can change the size for a child netns by a sysctl knob: # sysctl -w net.unix.child_hash_entries=128 # ip net add test # created with the hash table size 128 # ip net exec test sh # sysctl net.unix.hash_entries # read-only 128 (The size for init_net can be changed via a new boot parameter xhash_entries like uhash_entries/thash_entries.) While implementing that, I found that kmemdup() is called for init_net but TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does not have a huge sysctl table, so it cannot be a problem though, this patch is for consuming less memory and kind of consistency. The reason I submit this seperately is that it might be better to have a Fixes tag.
On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Jakub Kicinski <kuba@kernel.org> > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > table and can use it directly. > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > I am not quite certain the savings of a single entry table justivies > > > the complexity. But the looks correct. > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > Could you add more details to explain the motivation? > > I was working on a series which converts UDP/TCP hash tables into per-netns > ones like AF_UNIX to speed up looking up sockets. It will consume much > memory on a host with thousands of netns, but it can be waste if we do not > have its protocol family's sockets. For the record, I doubt we will accept such a patch (per net-ns TCP/UDP hash tables) > > So, I'm now working on a follow-up series for AF_UNIX per-netns hash table > so that we can change the size for a child netns by a sysctl knob: > > # sysctl -w net.unix.child_hash_entries=128 > # ip net add test # created with the hash table size 128 > # ip net exec test sh > # sysctl net.unix.hash_entries # read-only > 128 > > (The size for init_net can be changed via a new boot parameter > xhash_entries like uhash_entries/thash_entries.) > > While implementing that, I found that kmemdup() is called for init_net but > TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does > not have a huge sysctl table, so it cannot be a problem though, this patch > is for consuming less memory and kind of consistency. The reason I submit > this seperately is that it might be better to have a Fixes tag. I think that af_unix module can be unloaded. Your patch will break the module unload operation.
From: Eric Dumazet <edumazet@google.com> Date: Mon, 27 Jun 2022 20:40:24 +0200 > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Jakub Kicinski <kuba@kernel.org> > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > table and can use it directly. > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > the complexity. But the looks correct. > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > Could you add more details to explain the motivation? > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > memory on a host with thousands of netns, but it can be waste if we do not > > have its protocol family's sockets. > > For the record, I doubt we will accept such a patch (per net-ns > TCP/UDP hash tables) Is it because it's risky? IIRC, you said we need per netns table for TCP in the future. > > So, I'm now working on a follow-up series for AF_UNIX per-netns hash table > > so that we can change the size for a child netns by a sysctl knob: > > > > # sysctl -w net.unix.child_hash_entries=128 > > # ip net add test # created with the hash table size 128 > > # ip net exec test sh > > # sysctl net.unix.hash_entries # read-only > > 128 > > > > (The size for init_net can be changed via a new boot parameter > > xhash_entries like uhash_entries/thash_entries.) > > > > While implementing that, I found that kmemdup() is called for init_net but > > TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does > > not have a huge sysctl table, so it cannot be a problem though, this patch > > is for consuming less memory and kind of consistency. The reason I submit > > this seperately is that it might be better to have a Fixes tag. > > I think that af_unix module can be unloaded. > > Your patch will break the module unload operation. Thank you! I had to take of kfree() in unix_sysctl_unregister().
On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > table and can use it directly. > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > the complexity. But the looks correct. > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > Could you add more details to explain the motivation? > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > memory on a host with thousands of netns, but it can be waste if we do not > > > have its protocol family's sockets. > > > > For the record, I doubt we will accept such a patch (per net-ns > > TCP/UDP hash tables) > > Is it because it's risky? Because it will be very expensive. TCP hash tables are quite big. [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 (order: 8, 1048576 bytes, vmalloc) [ 4.917260] TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes, vmalloc) [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > IIRC, you said we need per netns table for TCP in the future. Which ones exactly ? I guess you misunderstood. > > > > > So, I'm now working on a follow-up series for AF_UNIX per-netns hash table > > > so that we can change the size for a child netns by a sysctl knob: > > > > > > # sysctl -w net.unix.child_hash_entries=128 > > > # ip net add test # created with the hash table size 128 > > > # ip net exec test sh > > > # sysctl net.unix.hash_entries # read-only > > > 128 > > > > > > (The size for init_net can be changed via a new boot parameter > > > xhash_entries like uhash_entries/thash_entries.) > > > > > > While implementing that, I found that kmemdup() is called for init_net but > > > TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does > > > not have a huge sysctl table, so it cannot be a problem though, this patch > > > is for consuming less memory and kind of consistency. The reason I submit > > > this seperately is that it might be better to have a Fixes tag. > > > > I think that af_unix module can be unloaded. > > > > Your patch will break the module unload operation. > > Thank you! > I had to take of kfree() in unix_sysctl_unregister().
From: Eric Dumazet <edumazet@google.com> Date: Mon, 27 Jun 2022 21:06:14 +0200 > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Eric Dumazet <edumazet@google.com> > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > table and can use it directly. > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > the complexity. But the looks correct. > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > Could you add more details to explain the motivation? > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > have its protocol family's sockets. > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > TCP/UDP hash tables) > > > > Is it because it's risky? > > Because it will be very expensive. TCP hash tables are quite big. Yes, so I'm wondering if changing the size by sysctl makes sense. If we have per-netns hash tables, each table should have smaller amount of sockets and smaller size should be enough, I think. > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 > (order: 8, 1048576 bytes, vmalloc) > [ 4.917260] TCP established hash table entries: 524288 (order: 10, > 4194304 bytes, vmalloc hugepage) > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 > bytes, vmalloc) > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > > > > > IIRC, you said we need per netns table for TCP in the future. > > Which ones exactly ? I guess you misunderstood. I think this. https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13
On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Mon, 27 Jun 2022 21:06:14 +0200 > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > From: Eric Dumazet <edumazet@google.com> > > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > > table and can use it directly. > > > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > > the complexity. But the looks correct. > > > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > > Could you add more details to explain the motivation? > > > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > > have its protocol family's sockets. > > > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > > TCP/UDP hash tables) > > > > > > Is it because it's risky? > > > > Because it will be very expensive. TCP hash tables are quite big. > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we > have per-netns hash tables, each table should have smaller amount of > sockets and smaller size should be enough, I think. How can a sysctl be safely used if two different threads call "unshare -n" at the same time ? > > > > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 > > (order: 8, 1048576 bytes, vmalloc) > > [ 4.917260] TCP established hash table entries: 524288 (order: 10, > > 4194304 bytes, vmalloc hugepage) > > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 > > bytes, vmalloc) > > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > > > > > > > > > IIRC, you said we need per netns table for TCP in the future. > > > > Which ones exactly ? I guess you misunderstood. > > I think this. > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13 "might" is very different than "will" I would rather use the list of time_wait, instead of adding huge memory costs for hosts with hundreds of netns.
From: Eric Dumazet <edumazet@google.com> Date: Mon, 27 Jun 2022 21:36:18 +0200 > On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Eric Dumazet <edumazet@google.com> > > Date: Mon, 27 Jun 2022 21:06:14 +0200 > > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > From: Eric Dumazet <edumazet@google.com> > > > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > > > table and can use it directly. > > > > > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > > > the complexity. But the looks correct. > > > > > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > > > Could you add more details to explain the motivation? > > > > > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > > > have its protocol family's sockets. > > > > > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > > > TCP/UDP hash tables) > > > > > > > > Is it because it's risky? > > > > > > Because it will be very expensive. TCP hash tables are quite big. > > > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we > > have per-netns hash tables, each table should have smaller amount of > > sockets and smaller size should be enough, I think. > > How can a sysctl be safely used if two different threads call "unshare > -n" at the same time ? I think two unshare are safe. Each of them reads its parent netns's sysctl knob. Even when the parent is the same, they can read the same value. But I think we need READ_ONCE()/WRITE_ONCE() in such a sysctl. While creating a child netns, another one can change the value and there can be a data-race. So we have to use custome handler and pass write/read handler as conv of do_proc_douintvec(), like do_proc_douintvec_conv_lockless(). If there are some sysctls missing READ_ONCE/WRITE_ONCE(), I will add more general one, proc_douintvec_lockless(). > > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 > > > (order: 8, 1048576 bytes, vmalloc) > > > [ 4.917260] TCP established hash table entries: 524288 (order: 10, > > > 4194304 bytes, vmalloc hugepage) > > > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 > > > bytes, vmalloc) > > > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > > > > > > > > > > > > > IIRC, you said we need per netns table for TCP in the future. > > > > > > Which ones exactly ? I guess you misunderstood. > > > > I think this. > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13 > > "might" is very different than "will" > > I would rather use the list of time_wait, instead of adding huge > memory costs for hosts with hundreds of netns. Sorry, my bad. I would give it a try only for TIME_WAIT.
On Mon, Jun 27, 2022 at 10:00 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Mon, 27 Jun 2022 21:36:18 +0200 > > On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > From: Eric Dumazet <edumazet@google.com> > > > Date: Mon, 27 Jun 2022 21:06:14 +0200 > > > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > From: Eric Dumazet <edumazet@google.com> > > > > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > > > > table and can use it directly. > > > > > > > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > > > > the complexity. But the looks correct. > > > > > > > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > > > > Could you add more details to explain the motivation? > > > > > > > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > > > > have its protocol family's sockets. > > > > > > > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > > > > TCP/UDP hash tables) > > > > > > > > > > Is it because it's risky? > > > > > > > > Because it will be very expensive. TCP hash tables are quite big. > > > > > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we > > > have per-netns hash tables, each table should have smaller amount of > > > sockets and smaller size should be enough, I think. > > > > How can a sysctl be safely used if two different threads call "unshare > > -n" at the same time ? > > I think two unshare are safe. Each of them reads its parent netns's sysctl > knob. Even when the parent is the same, they can read the same value. How can one thread create a netns with a TCP ehash table with 1024 buckets, and a second one create a netns with a TCP ehash table with 1 million buckets at the same time, if they share the same sysctl ??? > > But I think we need READ_ONCE()/WRITE_ONCE() in such a sysctl. Like all sysctls really. While > creating a child netns, another one can change the value and there can be > a data-race. So we have to use custome handler and pass write/read handler > as conv of do_proc_douintvec(), like do_proc_douintvec_conv_lockless(). > > If there are some sysctls missing READ_ONCE/WRITE_ONCE(), I will add > more general one, proc_douintvec_lockless(). Seriously, all sysctls can be set while being read. That is not something new. > > > > > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 > > > > (order: 8, 1048576 bytes, vmalloc) > > > > [ 4.917260] TCP established hash table entries: 524288 (order: 10, > > > > 4194304 bytes, vmalloc hugepage) > > > > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 > > > > bytes, vmalloc) > > > > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > > > > > > > > > > > > > > > > > IIRC, you said we need per netns table for TCP in the future. > > > > > > > > Which ones exactly ? I guess you misunderstood. > > > > > > I think this. > > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13 > > > > "might" is very different than "will" > > > > I would rather use the list of time_wait, instead of adding huge > > memory costs for hosts with hundreds of netns. > > Sorry, my bad. > I would give it a try only for TIME_WAIT.
From: Eric Dumazet <edumazet@google.com> Date: Mon, 27 Jun 2022 22:04:07 +0200 > On Mon, Jun 27, 2022 at 10:00 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Eric Dumazet <edumazet@google.com> > > Date: Mon, 27 Jun 2022 21:36:18 +0200 > > > On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > From: Eric Dumazet <edumazet@google.com> > > > > Date: Mon, 27 Jun 2022 21:06:14 +0200 > > > > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > > > From: Eric Dumazet <edumazet@google.com> > > > > > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > > > > > table and can use it directly. > > > > > > > > > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > > > > > the complexity. But the looks correct. > > > > > > > > > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > > > > > Could you add more details to explain the motivation? > > > > > > > > > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > > > > > have its protocol family's sockets. > > > > > > > > > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > > > > > TCP/UDP hash tables) > > > > > > > > > > > > Is it because it's risky? > > > > > > > > > > Because it will be very expensive. TCP hash tables are quite big. > > > > > > > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we > > > > have per-netns hash tables, each table should have smaller amount of > > > > sockets and smaller size should be enough, I think. > > > > > > How can a sysctl be safely used if two different threads call "unshare > > > -n" at the same time ? > > > > I think two unshare are safe. Each of them reads its parent netns's sysctl > > knob. Even when the parent is the same, they can read the same value. > > How can one thread create a netns with a TCP ehash table with 1024 buckets, > and a second one create a netns with a TCP ehash table with 1 million > buckets at the same time, > if they share the same sysctl ??? Oh, I undertood. In the example, I added net.unix.hash_entries so we can confirm if the size is intended one, but yes, checking it and recreating netns is crazy... # sysctl -w net.unix.child_hash_entries=128 # ip net add test # created with the hash table size 128 # ip net exec test sh # sysctl net.unix.hash_entries # read-only 128 Do you have good idea? > > > > > But I think we need READ_ONCE()/WRITE_ONCE() in such a sysctl. > > Like all sysctls really. > > While > > creating a child netns, another one can change the value and there can be > > a data-race. So we have to use custome handler and pass write/read handler > > as conv of do_proc_douintvec(), like do_proc_douintvec_conv_lockless(). > > > > If there are some sysctls missing READ_ONCE/WRITE_ONCE(), I will add > > more general one, proc_douintvec_lockless(). > > Seriously, all sysctls can be set while being read. That is not something new. Ok, I added that on TODO.
diff --git a/net/unix/sysctl_net_unix.c b/net/unix/sysctl_net_unix.c index 01d44e2598e2..4bd856d05135 100644 --- a/net/unix/sysctl_net_unix.c +++ b/net/unix/sysctl_net_unix.c @@ -26,11 +26,16 @@ int __net_init unix_sysctl_register(struct net *net) { struct ctl_table *table; - table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL); - if (table == NULL) - goto err_alloc; + if (net_eq(net, &init_net)) { + table = unix_table; + } else { + table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL); + if (!table) + goto err_alloc; + + table[0].data = &net->unx.sysctl_max_dgram_qlen; + } - table[0].data = &net->unx.sysctl_max_dgram_qlen; net->unx.ctl = register_net_sysctl(net, "net/unix", table); if (net->unx.ctl == NULL) goto err_reg; @@ -38,7 +43,8 @@ int __net_init unix_sysctl_register(struct net *net) return 0; err_reg: - kfree(table); + if (net_eq(net, &init_net)) + kfree(table); err_alloc: return -ENOMEM; }
While setting up init_net's sysctl table, we need not duplicate the global table and can use it directly. Fixes: 1597fbc0faf8 ("[UNIX]: Make the unix sysctl tables per-namespace") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> --- v2: * Fix NULL comparison style by checkpatch.pl v1: https://lore.kernel.org/all/20220626074454.28944-1-kuniyu@amazon.com/ --- --- net/unix/sysctl_net_unix.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)