Message ID | 159611007271.535980.15362304262237658692.stgit@localhost.localdomain (mailing list archive) |
---|---|
Headers | show |
Series | proc: Introduce /proc/namespaces/ directory to expose namespaces lineary | expand |
On Thu, Jul 30, 2020 at 02:59:20PM +0300, Kirill Tkhai wrote: > Currently, there is no a way to list or iterate all or subset of namespaces > in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > but some also may be as open files, which are not attached to a process. > When a namespace open fd is sent over unix socket and then closed, it is > impossible to know whether the namespace exists or not. > > Also, even if namespace is exposed as attached to a process or as open file, > iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > this multiplies at tasks and fds number. > > This patchset introduces a new /proc/namespaces/ directory, which exposes > subset of permitted namespaces in linear view: > > # ls /proc/namespaces/ -l > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]' > > Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns. > I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is > > in_userns(pid_ns->user_ns, ns->user_ns). > > In case of ns is a user_ns: > > in_userns(pid_ns->user_ns, ns). > > The patchset follows this steps: > > 1)A generic counter in ns_common is introduced instead of separate > counters for every ns type (net::count, uts_namespace::kref, > user_namespace::count, etc). Patches [1-8]; > 2)Patch [9] introduces IDR to link and iterate alive namespaces; > 3)Patch [10] is refactoring; > 4)Patch [11] actually adds /proc/namespace directory and fs methods; > 5)Patches [12-23] make every namespace to use the added methods > and to appear in /proc/namespace directory. > > This may be usefull to write effective debug utils (say, fast build > of networks topology) and checkpoint/restore software. Kirill, Thanks for working on this! We have a need for this functionality too for namespace introspection. I actually had a prototype of this as well but mine was based on debugfs but /proc/namespaces seems like a good place. Christian
[Cc: linux-api] On Thu, Jul 30, 2020 at 03:08:53PM +0200, Christian Brauner wrote: > On Thu, Jul 30, 2020 at 02:59:20PM +0300, Kirill Tkhai wrote: > > Currently, there is no a way to list or iterate all or subset of namespaces > > in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > > but some also may be as open files, which are not attached to a process. > > When a namespace open fd is sent over unix socket and then closed, it is > > impossible to know whether the namespace exists or not. > > > > Also, even if namespace is exposed as attached to a process or as open file, > > iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > > this multiplies at tasks and fds number. > > > > This patchset introduces a new /proc/namespaces/ directory, which exposes > > subset of permitted namespaces in linear view: > > > > # ls /proc/namespaces/ -l > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]' > > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]' > > > > Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns. > > I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is > > > > in_userns(pid_ns->user_ns, ns->user_ns). > > > > In case of ns is a user_ns: > > > > in_userns(pid_ns->user_ns, ns). > > > > The patchset follows this steps: > > > > 1)A generic counter in ns_common is introduced instead of separate > > counters for every ns type (net::count, uts_namespace::kref, > > user_namespace::count, etc). Patches [1-8]; > > 2)Patch [9] introduces IDR to link and iterate alive namespaces; > > 3)Patch [10] is refactoring; > > 4)Patch [11] actually adds /proc/namespace directory and fs methods; > > 5)Patches [12-23] make every namespace to use the added methods > > and to appear in /proc/namespace directory. > > > > This may be usefull to write effective debug utils (say, fast build > > of networks topology) and checkpoint/restore software. > > Kirill, > > Thanks for working on this! > We have a need for this functionality too for namespace introspection. > I actually had a prototype of this as well but mine was based on debugfs > but /proc/namespaces seems like a good place.
Kirill Tkhai <ktkhai@virtuozzo.com> writes: > Currently, there is no a way to list or iterate all or subset of namespaces > in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > but some also may be as open files, which are not attached to a process. > When a namespace open fd is sent over unix socket and then closed, it is > impossible to know whether the namespace exists or not. > > Also, even if namespace is exposed as attached to a process or as open file, > iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > this multiplies at tasks and fds number. I am very dubious about this. I have been avoiding exactly this kind of interface because it can create rather fundamental problems with checkpoint restart. You do have some filtering and the filtering is not based on current. Which is good. A view that is relative to a user namespace might be ok. It almost certainly does better as it's own little filesystem than as an extension to proc though. The big thing we want to ensure is that if you migrate you can restore everything. I don't see how you will be able to restore these files after migration. Anything like this without having a complete checkpoint/restore story is a non-starter. Further by not going through the processes it looks like you are bypassing the existing permission checks. Which has the potential to allow someone to use a namespace who would not be able to otherwise. So I think this goes one step too far but I am willing to be persuaded otherwise. Eric > This patchset introduces a new /proc/namespaces/ directory, which exposes > subset of permitted namespaces in linear view: > > # ls /proc/namespaces/ -l > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]' > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]' > > Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns. > I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is > > in_userns(pid_ns->user_ns, ns->user_ns). > > In case of ns is a user_ns: > > in_userns(pid_ns->user_ns, ns). > > The patchset follows this steps: > > 1)A generic counter in ns_common is introduced instead of separate > counters for every ns type (net::count, uts_namespace::kref, > user_namespace::count, etc). Patches [1-8]; > 2)Patch [9] introduces IDR to link and iterate alive namespaces; > 3)Patch [10] is refactoring; > 4)Patch [11] actually adds /proc/namespace directory and fs methods; > 5)Patches [12-23] make every namespace to use the added methods > and to appear in /proc/namespace directory. > > This may be usefull to write effective debug utils (say, fast build > of networks topology) and checkpoint/restore software. > --- > > Kirill Tkhai (23): > ns: Add common refcount into ns_common add use it as counter for net_ns > uts: Use generic ns_common::count > ipc: Use generic ns_common::count > pid: Use generic ns_common::count > user: Use generic ns_common::count > mnt: Use generic ns_common::count > cgroup: Use generic ns_common::count > time: Use generic ns_common::count > ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system > fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c > fs: Add /proc/namespaces/ directory > user: Free user_ns one RCU grace period after final counter put > user: Add user namespaces into ns_idr > net: Add net namespaces into ns_idr > pid: Eextract child_reaper check from pidns_for_children_get() > proc_ns_operations: Add can_get method > pid: Add pid namespaces into ns_idr > uts: Free uts namespace one RCU grace period after final counter put > uts: Add uts namespaces into ns_idr > ipc: Add ipc namespaces into ns_idr > mnt: Add mount namespaces into ns_idr > cgroup: Add cgroup namespaces into ns_idr > time: Add time namespaces into ns_idr > > > fs/mount.h | 4 > fs/namespace.c | 14 + > fs/nsfs.c | 78 ++++++++ > fs/proc/Makefile | 1 > fs/proc/internal.h | 18 +- > fs/proc/namespaces.c | 382 +++++++++++++++++++++++++++------------- > fs/proc/root.c | 17 ++ > fs/proc/task_namespaces.c | 183 +++++++++++++++++++ > include/linux/cgroup.h | 6 - > include/linux/ipc_namespace.h | 3 > include/linux/ns_common.h | 11 + > include/linux/pid_namespace.h | 4 > include/linux/proc_fs.h | 1 > include/linux/proc_ns.h | 12 + > include/linux/time_namespace.h | 10 + > include/linux/user_namespace.h | 10 + > include/linux/utsname.h | 10 + > include/net/net_namespace.h | 11 - > init/version.c | 2 > ipc/msgutil.c | 2 > ipc/namespace.c | 17 +- > ipc/shm.c | 1 > kernel/cgroup/cgroup.c | 2 > kernel/cgroup/namespace.c | 25 ++- > kernel/pid.c | 2 > kernel/pid_namespace.c | 46 +++-- > kernel/time/namespace.c | 20 +- > kernel/user.c | 2 > kernel/user_namespace.c | 23 ++ > kernel/utsname.c | 23 ++ > net/core/net-sysfs.c | 6 - > net/core/net_namespace.c | 18 +- > net/ipv4/inet_timewait_sock.c | 4 > net/ipv4/tcp_metrics.c | 2 > 34 files changed, 746 insertions(+), 224 deletions(-) > create mode 100644 fs/proc/task_namespaces.c > > -- > Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
On Thu, Jul 30, 2020 at 09:34:01AM -0500, Eric W. Biederman wrote: > Kirill Tkhai <ktkhai@virtuozzo.com> writes: > > > Currently, there is no a way to list or iterate all or subset of namespaces > > in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > > but some also may be as open files, which are not attached to a process. > > When a namespace open fd is sent over unix socket and then closed, it is > > impossible to know whether the namespace exists or not. > > > > Also, even if namespace is exposed as attached to a process or as open file, > > iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > > this multiplies at tasks and fds number. > > I am very dubious about this. > > I have been avoiding exactly this kind of interface because it can > create rather fundamental problems with checkpoint restart. > > You do have some filtering and the filtering is not based on current. > Which is good. > > A view that is relative to a user namespace might be ok. It almost > certainly does better as it's own little filesystem than as an extension > to proc though. > > The big thing we want to ensure is that if you migrate you can restore > everything. I don't see how you will be able to restore these files > after migration. Anything like this without having a complete > checkpoint/restore story is a non-starter. > > Further by not going through the processes it looks like you are > bypassing the existing permission checks. Which has the potential > to allow someone to use a namespace who would not be able to otherwise. > > So I think this goes one step too far but I am willing to be persuaded > otherwise. I think we discussed this at Plumbers (last year I want to say?) and you were against making this a part of procfs already back then, I think. The last known idead we could agree on was debugfs (shudder). But a tiny separate fs might work as well. We really would want those introspection abilities this provides though. For us it was for debugging when namespaces linger and also to crawl and inspect namespaces from LXD and various other use-cases. So if we could make this happen in some form that'd be great. Thanks! Christian
On 30.07.2020 17:34, Eric W. Biederman wrote: > Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >> Currently, there is no a way to list or iterate all or subset of namespaces >> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >> but some also may be as open files, which are not attached to a process. >> When a namespace open fd is sent over unix socket and then closed, it is >> impossible to know whether the namespace exists or not. >> >> Also, even if namespace is exposed as attached to a process or as open file, >> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >> this multiplies at tasks and fds number. > > I am very dubious about this. > > I have been avoiding exactly this kind of interface because it can > create rather fundamental problems with checkpoint restart. restart/restore :) > You do have some filtering and the filtering is not based on current. > Which is good. > > A view that is relative to a user namespace might be ok. It almost > certainly does better as it's own little filesystem than as an extension > to proc though. > > The big thing we want to ensure is that if you migrate you can restore > everything. I don't see how you will be able to restore these files > after migration. Anything like this without having a complete > checkpoint/restore story is a non-starter. There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any problem here. If you have a specific worries about, let's discuss them. CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R. > Further by not going through the processes it looks like you are > bypassing the existing permission checks. Which has the potential > to allow someone to use a namespace who would not be able to otherwise. I agree, and I wrote to Christian, that permissions should be more strict. This just should be formalized. Let's discuss this. > So I think this goes one step too far but I am willing to be persuaded > otherwise. > > Eric > > > > >> This patchset introduces a new /proc/namespaces/ directory, which exposes >> subset of permitted namespaces in linear view: >> >> # ls /proc/namespaces/ -l >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]' >> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]' >> >> Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns. >> I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is >> >> in_userns(pid_ns->user_ns, ns->user_ns). >> >> In case of ns is a user_ns: >> >> in_userns(pid_ns->user_ns, ns). >> >> The patchset follows this steps: >> >> 1)A generic counter in ns_common is introduced instead of separate >> counters for every ns type (net::count, uts_namespace::kref, >> user_namespace::count, etc). Patches [1-8]; >> 2)Patch [9] introduces IDR to link and iterate alive namespaces; >> 3)Patch [10] is refactoring; >> 4)Patch [11] actually adds /proc/namespace directory and fs methods; >> 5)Patches [12-23] make every namespace to use the added methods >> and to appear in /proc/namespace directory. >> >> This may be usefull to write effective debug utils (say, fast build >> of networks topology) and checkpoint/restore software. >> --- >> >> Kirill Tkhai (23): >> ns: Add common refcount into ns_common add use it as counter for net_ns >> uts: Use generic ns_common::count >> ipc: Use generic ns_common::count >> pid: Use generic ns_common::count >> user: Use generic ns_common::count >> mnt: Use generic ns_common::count >> cgroup: Use generic ns_common::count >> time: Use generic ns_common::count >> ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system >> fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c >> fs: Add /proc/namespaces/ directory >> user: Free user_ns one RCU grace period after final counter put >> user: Add user namespaces into ns_idr >> net: Add net namespaces into ns_idr >> pid: Eextract child_reaper check from pidns_for_children_get() >> proc_ns_operations: Add can_get method >> pid: Add pid namespaces into ns_idr >> uts: Free uts namespace one RCU grace period after final counter put >> uts: Add uts namespaces into ns_idr >> ipc: Add ipc namespaces into ns_idr >> mnt: Add mount namespaces into ns_idr >> cgroup: Add cgroup namespaces into ns_idr >> time: Add time namespaces into ns_idr >> >> >> fs/mount.h | 4 >> fs/namespace.c | 14 + >> fs/nsfs.c | 78 ++++++++ >> fs/proc/Makefile | 1 >> fs/proc/internal.h | 18 +- >> fs/proc/namespaces.c | 382 +++++++++++++++++++++++++++------------- >> fs/proc/root.c | 17 ++ >> fs/proc/task_namespaces.c | 183 +++++++++++++++++++ >> include/linux/cgroup.h | 6 - >> include/linux/ipc_namespace.h | 3 >> include/linux/ns_common.h | 11 + >> include/linux/pid_namespace.h | 4 >> include/linux/proc_fs.h | 1 >> include/linux/proc_ns.h | 12 + >> include/linux/time_namespace.h | 10 + >> include/linux/user_namespace.h | 10 + >> include/linux/utsname.h | 10 + >> include/net/net_namespace.h | 11 - >> init/version.c | 2 >> ipc/msgutil.c | 2 >> ipc/namespace.c | 17 +- >> ipc/shm.c | 1 >> kernel/cgroup/cgroup.c | 2 >> kernel/cgroup/namespace.c | 25 ++- >> kernel/pid.c | 2 >> kernel/pid_namespace.c | 46 +++-- >> kernel/time/namespace.c | 20 +- >> kernel/user.c | 2 >> kernel/user_namespace.c | 23 ++ >> kernel/utsname.c | 23 ++ >> net/core/net-sysfs.c | 6 - >> net/core/net_namespace.c | 18 +- >> net/ipv4/inet_timewait_sock.c | 4 >> net/ipv4/tcp_metrics.c | 2 >> 34 files changed, 746 insertions(+), 224 deletions(-) >> create mode 100644 fs/proc/task_namespaces.c >> >> -- >> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Kirill Tkhai <ktkhai@virtuozzo.com> writes: > On 30.07.2020 17:34, Eric W. Biederman wrote: >> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >> >>> Currently, there is no a way to list or iterate all or subset of namespaces >>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>> but some also may be as open files, which are not attached to a process. >>> When a namespace open fd is sent over unix socket and then closed, it is >>> impossible to know whether the namespace exists or not. >>> >>> Also, even if namespace is exposed as attached to a process or as open file, >>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>> this multiplies at tasks and fds number. >> >> I am very dubious about this. >> >> I have been avoiding exactly this kind of interface because it can >> create rather fundamental problems with checkpoint restart. > > restart/restore :) > >> You do have some filtering and the filtering is not based on current. >> Which is good. >> >> A view that is relative to a user namespace might be ok. It almost >> certainly does better as it's own little filesystem than as an extension >> to proc though. >> >> The big thing we want to ensure is that if you migrate you can restore >> everything. I don't see how you will be able to restore these files >> after migration. Anything like this without having a complete >> checkpoint/restore story is a non-starter. > > There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. > > CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. > As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any > problem here. An obvious diffference is that you are adding the inode to the inode to the file name. Which means that now you really do have to preserve the inode numbers during process migration. Which means now we have to do all of the work to make inode number restoration possible. Which means now we need to have multiple instances of nsfs so that we can restore inode numbers. I think this is still possible but we have been delaying figuring out how to restore inode numbers long enough that may be actual technical problems making it happen. Now maybe CRIU can handle the names of the files changing during migration but you have just increased the level of difficulty for doing that. > If you have a specific worries about, let's discuss them. I was asking and I am asking that it be described in the patch description how a container using this feature can be migrated from one machine to another. This code is so close to being problematic that we need be very careful we don't fundamentally break CRIU while trying to make it's job simpler and easier. > CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R. > >> Further by not going through the processes it looks like you are >> bypassing the existing permission checks. Which has the potential >> to allow someone to use a namespace who would not be able to otherwise. > > I agree, and I wrote to Christian, that permissions should be more strict. > This just should be formalized. Let's discuss this. > >> So I think this goes one step too far but I am willing to be persuaded >> otherwise. >> Eric
On 7/31/20 1:13 AM, Eric W. Biederman wrote: > Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >> On 30.07.2020 17:34, Eric W. Biederman wrote: >>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>> >>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>> but some also may be as open files, which are not attached to a process. >>>> When a namespace open fd is sent over unix socket and then closed, it is >>>> impossible to know whether the namespace exists or not. >>>> >>>> Also, even if namespace is exposed as attached to a process or as open file, >>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>> this multiplies at tasks and fds number. >>> >>> I am very dubious about this. >>> >>> I have been avoiding exactly this kind of interface because it can >>> create rather fundamental problems with checkpoint restart. >> >> restart/restore :) >> >>> You do have some filtering and the filtering is not based on current. >>> Which is good. >>> >>> A view that is relative to a user namespace might be ok. It almost >>> certainly does better as it's own little filesystem than as an extension >>> to proc though. >>> >>> The big thing we want to ensure is that if you migrate you can restore >>> everything. I don't see how you will be able to restore these files >>> after migration. Anything like this without having a complete >>> checkpoint/restore story is a non-starter. >> >> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. >> >> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any >> problem here. > > An obvious diffference is that you are adding the inode to the inode to > the file name. Which means that now you really do have to preserve the > inode numbers during process migration. > > Which means now we have to do all of the work to make inode number > restoration possible. Which means now we need to have multiple > instances of nsfs so that we can restore inode numbers. > > I think this is still possible but we have been delaying figuring out > how to restore inode numbers long enough that may be actual technical > problems making it happen. > > Now maybe CRIU can handle the names of the files changing during > migration but you have just increased the level of difficulty for doing > that. Yes adding /proc/namespaces/<ns_name>:[<ns_ino>] files may be a problem to CRIU. First I would like to highlight that open files are not a problem. Because open file from /proc/namespaces/* are exactly the same as open files from /proc/<pid>/ns/<ns_name>. So when we c/r an nsfs open file fd on dump we readlink the fd and get <ns_name>:[<ns_ino>] and on restore we recreate each dumped namespace and open an fd to each, so we can 'dup' it when restoring open file. It will be an fd to topologically same namespace though ns_ino would be newly generated. But the problem I see is with readdir. What if some task is reading /proc/namespaces/ directory at the time of dump, after restore directory will contain new names for namespaces and possibly in different order, this way if process continues to readdir it can miss some namespaces or read some twice. May be instead of multiple files in /proc/namespaces directory, we can leave just one file /proc/namespaces and when we open it we would return e.g. a unix socket filled with all the fds of all namespacess visible at this point. It looks like a possible solution to the above problem. CRIU can restore unix sockets with fds inside, so we should be able to dump process using this functionality. > >> If you have a specific worries about, let's discuss them. > > I was asking and I am asking that it be described in the patch > description how a container using this feature can be migrated > from one machine to another. This code is so close to being problematic > that we need be very careful we don't fundamentally break CRIU while > trying to make it's job simpler and easier. > >> CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R. >> >>> Further by not going through the processes it looks like you are >>> bypassing the existing permission checks. Which has the potential >>> to allow someone to use a namespace who would not be able to otherwise. >> >> I agree, and I wrote to Christian, that permissions should be more strict. >> This just should be formalized. Let's discuss this. >> >>> So I think this goes one step too far but I am willing to be persuaded >>> otherwise. >>> > > Eric >
On 31.07.2020 01:13, Eric W. Biederman wrote: > Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >> On 30.07.2020 17:34, Eric W. Biederman wrote: >>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>> >>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>> but some also may be as open files, which are not attached to a process. >>>> When a namespace open fd is sent over unix socket and then closed, it is >>>> impossible to know whether the namespace exists or not. >>>> >>>> Also, even if namespace is exposed as attached to a process or as open file, >>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>> this multiplies at tasks and fds number. >>> >>> I am very dubious about this. >>> >>> I have been avoiding exactly this kind of interface because it can >>> create rather fundamental problems with checkpoint restart. >> >> restart/restore :) >> >>> You do have some filtering and the filtering is not based on current. >>> Which is good. >>> >>> A view that is relative to a user namespace might be ok. It almost >>> certainly does better as it's own little filesystem than as an extension >>> to proc though. >>> >>> The big thing we want to ensure is that if you migrate you can restore >>> everything. I don't see how you will be able to restore these files >>> after migration. Anything like this without having a complete >>> checkpoint/restore story is a non-starter. >> >> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. >> >> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any >> problem here. > > An obvious diffference is that you are adding the inode to the inode to > the file name. Which means that now you really do have to preserve the > inode numbers during process migration. > > Which means now we have to do all of the work to make inode number > restoration possible. Which means now we need to have multiple > instances of nsfs so that we can restore inode numbers. > > I think this is still possible but we have been delaying figuring out > how to restore inode numbers long enough that may be actual technical > problems making it happen. Yeah, this matters. But it looks like here is not a dead end. We just need change the names the namespaces are exported to particular fs and to support rename(). Before introduction a principally new filesystem type for this, can't this be solved in current /proc? Alexey, does rename() is prohibited for /proc fs? > Now maybe CRIU can handle the names of the files changing during > migration but you have just increased the level of difficulty for doing > that. > >> If you have a specific worries about, let's discuss them. > > I was asking and I am asking that it be described in the patch > description how a container using this feature can be migrated > from one machine to another. This code is so close to being problematic > that we need be very careful we don't fundamentally break CRIU while > trying to make it's job simpler and easier. > >> CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R. >> >>> Further by not going through the processes it looks like you are >>> bypassing the existing permission checks. Which has the potential >>> to allow someone to use a namespace who would not be able to otherwise. >> >> I agree, and I wrote to Christian, that permissions should be more strict. >> This just should be formalized. Let's discuss this. >> >>> So I think this goes one step too far but I am willing to be persuaded >>> otherwise. >>> > > Eric >
On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: > On 31.07.2020 01:13, Eric W. Biederman wrote: > > Kirill Tkhai <ktkhai@virtuozzo.com> writes: > > > >> On 30.07.2020 17:34, Eric W. Biederman wrote: > >>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>> > >>>> Currently, there is no a way to list or iterate all or subset of namespaces > >>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > >>>> but some also may be as open files, which are not attached to a process. > >>>> When a namespace open fd is sent over unix socket and then closed, it is > >>>> impossible to know whether the namespace exists or not. > >>>> > >>>> Also, even if namespace is exposed as attached to a process or as open file, > >>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > >>>> this multiplies at tasks and fds number. > >>> > >>> I am very dubious about this. > >>> > >>> I have been avoiding exactly this kind of interface because it can > >>> create rather fundamental problems with checkpoint restart. > >> > >> restart/restore :) > >> > >>> You do have some filtering and the filtering is not based on current. > >>> Which is good. > >>> > >>> A view that is relative to a user namespace might be ok. It almost > >>> certainly does better as it's own little filesystem than as an extension > >>> to proc though. > >>> > >>> The big thing we want to ensure is that if you migrate you can restore > >>> everything. I don't see how you will be able to restore these files > >>> after migration. Anything like this without having a complete > >>> checkpoint/restore story is a non-starter. > >> > >> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. > >> > >> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. > >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any > >> problem here. > > > > An obvious diffference is that you are adding the inode to the inode to > > the file name. Which means that now you really do have to preserve the > > inode numbers during process migration. > > > > Which means now we have to do all of the work to make inode number > > restoration possible. Which means now we need to have multiple > > instances of nsfs so that we can restore inode numbers. > > > > I think this is still possible but we have been delaying figuring out > > how to restore inode numbers long enough that may be actual technical > > problems making it happen. > > Yeah, this matters. But it looks like here is not a dead end. We just need > change the names the namespaces are exported to particular fs and to support > rename(). > > Before introduction a principally new filesystem type for this, can't > this be solved in current /proc? > > Alexey, does rename() is prohibited for /proc fs? Techically it is allowed: add ->rename to /proc/ns inode. But nobody does it.
On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote: > On 30.07.2020 17:34, Eric W. Biederman wrote: > > Kirill Tkhai <ktkhai@virtuozzo.com> writes: > > > >> Currently, there is no a way to list or iterate all or subset of namespaces > >> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > >> but some also may be as open files, which are not attached to a process. > >> When a namespace open fd is sent over unix socket and then closed, it is > >> impossible to know whether the namespace exists or not. > >> > >> Also, even if namespace is exposed as attached to a process or as open file, > >> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > >> this multiplies at tasks and fds number. Could you describe with more details when you need to iterate namespaces? There are three ways to hold namespaces. * processes * bind-mounts * file descriptors When CRIU dumps a container, it enumirates all processes, collects file descriptors and mounts. This means that we will be able to collect all namespaces, doesn't it?
On 8/4/20 8:43 AM, Andrei Vagin wrote: > On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote: >> On 30.07.2020 17:34, Eric W. Biederman wrote: >>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>> >>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>> but some also may be as open files, which are not attached to a process. >>>> When a namespace open fd is sent over unix socket and then closed, it is >>>> impossible to know whether the namespace exists or not. >>>> >>>> Also, even if namespace is exposed as attached to a process or as open file, >>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>> this multiplies at tasks and fds number. > > Could you describe with more details when you need to iterate > namespaces? > > There are three ways to hold namespaces. > > * processes > * bind-mounts > * file descriptors > > When CRIU dumps a container, it enumirates all processes, collects file > descriptors and mounts. This means that we will be able to collect all > namespaces, doesn't it? Yes we can. But it would be much easier for us to have all namespaces in one place isn't it? And this patch-set has another non-CRIU use case. It can simplify a view to namespaces for a normal user. Lets consider some cases: Lets assume we have an empty (no processes) mount namespace M which is held by single open fd, which was put in a unix socket and closed, unix socket has single open fd to it which was in it's turn put to another unix socket and again and again until we reach unix socket max depth... How should normal user find this mount namespace M? Lets assume that M also has a nsfs bindmount which helds some empty network namespace N... How should normal user find N? Lets also assume that M has overmounted "/": mount -t tmpfs tmpfs / Now if you would enter M you would see single tmpfs (because of implicit chroot to overmount on setns) in mountinfo and there is no way to see full mountinfo if you does not know real root dentry... How should normal user (or even CRIU) find N? So my personal opinion is that we need this interface, maybe it should be done somehow different but we need it. >
On 04.08.2020 08:43, Andrei Vagin wrote: > On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote: >> On 30.07.2020 17:34, Eric W. Biederman wrote: >>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>> >>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>> but some also may be as open files, which are not attached to a process. >>>> When a namespace open fd is sent over unix socket and then closed, it is >>>> impossible to know whether the namespace exists or not. >>>> >>>> Also, even if namespace is exposed as attached to a process or as open file, >>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>> this multiplies at tasks and fds number. > > Could you describe with more details when you need to iterate > namespaces? > > There are three ways to hold namespaces. > > * processes > * bind-mounts > * file descriptors > > When CRIU dumps a container, it enumirates all processes, collects file > descriptors and mounts. This means that we will be able to collect all > namespaces, doesn't it? 1)It's not only for CRIU. No one util can read content of another task unix socket like CRIU does. Sometimes we may just want to see all mount namespaces to found a mount, which owns a reference on a device. 2)In case of CRIU, recursive dump (when you iterate unix socket content, then you find another namespace and iterate another unix socket content, then you find one more namespace) is less effective and less fast, then dumping different types sequentially: first namespaces, second fds, etc. 3)It's still impossible to collect all namespaces like Pasha wrote.
On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: > On 31.07.2020 01:13, Eric W. Biederman wrote: > > Kirill Tkhai <ktkhai@virtuozzo.com> writes: > > > >> On 30.07.2020 17:34, Eric W. Biederman wrote: > >>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>> > >>>> Currently, there is no a way to list or iterate all or subset of namespaces > >>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > >>>> but some also may be as open files, which are not attached to a process. > >>>> When a namespace open fd is sent over unix socket and then closed, it is > >>>> impossible to know whether the namespace exists or not. > >>>> > >>>> Also, even if namespace is exposed as attached to a process or as open file, > >>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > >>>> this multiplies at tasks and fds number. > >>> > >>> I am very dubious about this. > >>> > >>> I have been avoiding exactly this kind of interface because it can > >>> create rather fundamental problems with checkpoint restart. > >> > >> restart/restore :) > >> > >>> You do have some filtering and the filtering is not based on current. > >>> Which is good. > >>> > >>> A view that is relative to a user namespace might be ok. It almost > >>> certainly does better as it's own little filesystem than as an extension > >>> to proc though. > >>> > >>> The big thing we want to ensure is that if you migrate you can restore > >>> everything. I don't see how you will be able to restore these files > >>> after migration. Anything like this without having a complete > >>> checkpoint/restore story is a non-starter. > >> > >> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. > >> > >> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. > >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any > >> problem here. > > > > An obvious diffference is that you are adding the inode to the inode to > > the file name. Which means that now you really do have to preserve the > > inode numbers during process migration. > > > > Which means now we have to do all of the work to make inode number > > restoration possible. Which means now we need to have multiple > > instances of nsfs so that we can restore inode numbers. > > > > I think this is still possible but we have been delaying figuring out > > how to restore inode numbers long enough that may be actual technical > > problems making it happen. > > Yeah, this matters. But it looks like here is not a dead end. We just need > change the names the namespaces are exported to particular fs and to support > rename(). > > Before introduction a principally new filesystem type for this, can't > this be solved in current /proc? do you mean to introduce names for namespaces which users will be able to change? By default, this can be uuid. And I have a suggestion about the structure of /proc/namespaces/. Each namespace is owned by one of user namespaces. Maybe it makes sense to group namespaces by their user-namespaces? /proc/namespaces/ user mnt-X mnt-Y pid-X uts-Z user-X/ user mnt-A mnt-B user-C user-C/ user user-Y/ user Do we try to invent cgroupfs for namespaces? Thanks, Andrei
On 06.08.2020 11:05, Andrei Vagin wrote: > On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: >> On 31.07.2020 01:13, Eric W. Biederman wrote: >>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>> >>>> On 30.07.2020 17:34, Eric W. Biederman wrote: >>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>>>> >>>>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>>>> but some also may be as open files, which are not attached to a process. >>>>>> When a namespace open fd is sent over unix socket and then closed, it is >>>>>> impossible to know whether the namespace exists or not. >>>>>> >>>>>> Also, even if namespace is exposed as attached to a process or as open file, >>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>>>> this multiplies at tasks and fds number. >>>>> >>>>> I am very dubious about this. >>>>> >>>>> I have been avoiding exactly this kind of interface because it can >>>>> create rather fundamental problems with checkpoint restart. >>>> >>>> restart/restore :) >>>> >>>>> You do have some filtering and the filtering is not based on current. >>>>> Which is good. >>>>> >>>>> A view that is relative to a user namespace might be ok. It almost >>>>> certainly does better as it's own little filesystem than as an extension >>>>> to proc though. >>>>> >>>>> The big thing we want to ensure is that if you migrate you can restore >>>>> everything. I don't see how you will be able to restore these files >>>>> after migration. Anything like this without having a complete >>>>> checkpoint/restore story is a non-starter. >>>> >>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. >>>> >>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. >>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any >>>> problem here. >>> >>> An obvious diffference is that you are adding the inode to the inode to >>> the file name. Which means that now you really do have to preserve the >>> inode numbers during process migration. >>> >>> Which means now we have to do all of the work to make inode number >>> restoration possible. Which means now we need to have multiple >>> instances of nsfs so that we can restore inode numbers. >>> >>> I think this is still possible but we have been delaying figuring out >>> how to restore inode numbers long enough that may be actual technical >>> problems making it happen. >> >> Yeah, this matters. But it looks like here is not a dead end. We just need >> change the names the namespaces are exported to particular fs and to support >> rename(). >> >> Before introduction a principally new filesystem type for this, can't >> this be solved in current /proc? > > do you mean to introduce names for namespaces which users will be able > to change? By default, this can be uuid. Yes, I mean this. Currently I won't give a final answer about UUID, but I planned to show some default names, which based on namespace type and inode num. Completely custom names for any /proc by default will waste too much memory. So, I think the good way will be: 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static random seed, which is generated on boot; 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} 3)Allow rename, and allocate space only for renamed names. Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. > And I have a suggestion about the structure of /proc/namespaces/. > > Each namespace is owned by one of user namespaces. Maybe it makes sense > to group namespaces by their user-namespaces? > > /proc/namespaces/ > user > mnt-X > mnt-Y > pid-X > uts-Z > user-X/ > user > mnt-A > mnt-B > user-C > user-C/ > user > user-Y/ > user Hm, I don't think that user namespace is a generic key value for everybody. For generic people tasks a user namespace is just a namespace among another namespace types. For me it will look a bit strage to iterate some user namespaces to build container net topology. > Do we try to invent cgroupfs for namespaces? Could you clarify your thought?
On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote: > On 06.08.2020 11:05, Andrei Vagin wrote: > > On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: > >> On 31.07.2020 01:13, Eric W. Biederman wrote: > >>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>> > >>>> On 30.07.2020 17:34, Eric W. Biederman wrote: > >>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>>>> > >>>>>> Currently, there is no a way to list or iterate all or subset of namespaces > >>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > >>>>>> but some also may be as open files, which are not attached to a process. > >>>>>> When a namespace open fd is sent over unix socket and then closed, it is > >>>>>> impossible to know whether the namespace exists or not. > >>>>>> > >>>>>> Also, even if namespace is exposed as attached to a process or as open file, > >>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > >>>>>> this multiplies at tasks and fds number. > >>>>> > >>>>> I am very dubious about this. > >>>>> > >>>>> I have been avoiding exactly this kind of interface because it can > >>>>> create rather fundamental problems with checkpoint restart. > >>>> > >>>> restart/restore :) > >>>> > >>>>> You do have some filtering and the filtering is not based on current. > >>>>> Which is good. > >>>>> > >>>>> A view that is relative to a user namespace might be ok. It almost > >>>>> certainly does better as it's own little filesystem than as an extension > >>>>> to proc though. > >>>>> > >>>>> The big thing we want to ensure is that if you migrate you can restore > >>>>> everything. I don't see how you will be able to restore these files > >>>>> after migration. Anything like this without having a complete > >>>>> checkpoint/restore story is a non-starter. > >>>> > >>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. > >>>> > >>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. > >>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any > >>>> problem here. > >>> > >>> An obvious diffference is that you are adding the inode to the inode to > >>> the file name. Which means that now you really do have to preserve the > >>> inode numbers during process migration. > >>> > >>> Which means now we have to do all of the work to make inode number > >>> restoration possible. Which means now we need to have multiple > >>> instances of nsfs so that we can restore inode numbers. > >>> > >>> I think this is still possible but we have been delaying figuring out > >>> how to restore inode numbers long enough that may be actual technical > >>> problems making it happen. > >> > >> Yeah, this matters. But it looks like here is not a dead end. We just need > >> change the names the namespaces are exported to particular fs and to support > >> rename(). > >> > >> Before introduction a principally new filesystem type for this, can't > >> this be solved in current /proc? > > > > do you mean to introduce names for namespaces which users will be able > > to change? By default, this can be uuid. > > Yes, I mean this. > > Currently I won't give a final answer about UUID, but I planned to show some > default names, which based on namespace type and inode num. Completely custom > names for any /proc by default will waste too much memory. > > So, I think the good way will be: > > 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static > random seed, which is generated on boot; > > 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} > > 3)Allow rename, and allocate space only for renamed names. > > Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. > > > And I have a suggestion about the structure of /proc/namespaces/. > > > > Each namespace is owned by one of user namespaces. Maybe it makes sense > > to group namespaces by their user-namespaces? > > > > /proc/namespaces/ > > user > > mnt-X > > mnt-Y > > pid-X > > uts-Z > > user-X/ > > user > > mnt-A > > mnt-B > > user-C > > user-C/ > > user > > user-Y/ > > user > > Hm, I don't think that user namespace is a generic key value for everybody. > For generic people tasks a user namespace is just a namespace among another > namespace types. For me it will look a bit strage to iterate some user namespaces > to build container net topology. I can’t agree with you that the user namespace is one of others. It is the namespace for namespaces. It sets security boundaries in the system and we need to know them to understand the whole system. If user namespaces are not used in the system or on a container, you will see all namespaces in one directory. But if the system has a more complicated structure, you will be able to build a full picture of it. You said that one of the users of this feature is CRIU (the tool to checkpoint/restore containers) and you said that it would be good if CRIU will be able to collect all container namespaces before dumping processes, sockets, files etc. But how will we be able to do this if we will list all namespaces in one directory? Here are my thoughts why we need to the suggested structure is better than just a list of namespaces: * Users will be able to understand securies bondaries in the system. Each namespace in the system is owned by one of user namespace and we need to know these relationshipts to understand the whole system. * This is simplify collecting namespaces which belong to one container. For example, CRIU collects all namespaces before dumping file descriptors. Then it collects all sockets with socket-diag in network namespaces and collects mount points via /proc/pid/mountinfo in mount namesapces. Then these information is used to dump socket file descriptors and opened files. * We are going to assign names to namespaces. But this means that we need to guarantee that all names in one directory are unique. The initial proposal was to enumerate all namespaces in one proc directory, that means names of all namespaces have to be unique. This can be problematic in some cases. For example, we may want to dump a container and then restore it more than once on the same host. How are we going to avoid namespace name conficts in such cases? If we will have per-user-namespace directories, we will need to guarantee that names are unique only inside one user namespace. * With the suggested structure, for each user namepsace, we will show only its subtree of namespaces. This looks more natural than filltering content of one directory. > > > Do we try to invent cgroupfs for namespaces? > > Could you clarify your thought?
On 10.08.2020 20:34, Andrei Vagin wrote: > On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote: >> On 06.08.2020 11:05, Andrei Vagin wrote: >>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: >>>> On 31.07.2020 01:13, Eric W. Biederman wrote: >>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>>>> >>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote: >>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>>>>>> >>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>>>>>> but some also may be as open files, which are not attached to a process. >>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is >>>>>>>> impossible to know whether the namespace exists or not. >>>>>>>> >>>>>>>> Also, even if namespace is exposed as attached to a process or as open file, >>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>>>>>> this multiplies at tasks and fds number. >>>>>>> >>>>>>> I am very dubious about this. >>>>>>> >>>>>>> I have been avoiding exactly this kind of interface because it can >>>>>>> create rather fundamental problems with checkpoint restart. >>>>>> >>>>>> restart/restore :) >>>>>> >>>>>>> You do have some filtering and the filtering is not based on current. >>>>>>> Which is good. >>>>>>> >>>>>>> A view that is relative to a user namespace might be ok. It almost >>>>>>> certainly does better as it's own little filesystem than as an extension >>>>>>> to proc though. >>>>>>> >>>>>>> The big thing we want to ensure is that if you migrate you can restore >>>>>>> everything. I don't see how you will be able to restore these files >>>>>>> after migration. Anything like this without having a complete >>>>>>> checkpoint/restore story is a non-starter. >>>>>> >>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. >>>>>> >>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. >>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any >>>>>> problem here. >>>>> >>>>> An obvious diffference is that you are adding the inode to the inode to >>>>> the file name. Which means that now you really do have to preserve the >>>>> inode numbers during process migration. >>>>> >>>>> Which means now we have to do all of the work to make inode number >>>>> restoration possible. Which means now we need to have multiple >>>>> instances of nsfs so that we can restore inode numbers. >>>>> >>>>> I think this is still possible but we have been delaying figuring out >>>>> how to restore inode numbers long enough that may be actual technical >>>>> problems making it happen. >>>> >>>> Yeah, this matters. But it looks like here is not a dead end. We just need >>>> change the names the namespaces are exported to particular fs and to support >>>> rename(). >>>> >>>> Before introduction a principally new filesystem type for this, can't >>>> this be solved in current /proc? >>> >>> do you mean to introduce names for namespaces which users will be able >>> to change? By default, this can be uuid. >> >> Yes, I mean this. >> >> Currently I won't give a final answer about UUID, but I planned to show some >> default names, which based on namespace type and inode num. Completely custom >> names for any /proc by default will waste too much memory. >> >> So, I think the good way will be: >> >> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static >> random seed, which is generated on boot; >> >> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} >> >> 3)Allow rename, and allocate space only for renamed names. >> >> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. >> >>> And I have a suggestion about the structure of /proc/namespaces/. >>> >>> Each namespace is owned by one of user namespaces. Maybe it makes sense >>> to group namespaces by their user-namespaces? >>> >>> /proc/namespaces/ >>> user >>> mnt-X >>> mnt-Y >>> pid-X >>> uts-Z >>> user-X/ >>> user >>> mnt-A >>> mnt-B >>> user-C >>> user-C/ >>> user >>> user-Y/ >>> user >> >> Hm, I don't think that user namespace is a generic key value for everybody. >> For generic people tasks a user namespace is just a namespace among another >> namespace types. For me it will look a bit strage to iterate some user namespaces >> to build container net topology. > > I can’t agree with you that the user namespace is one of others. It is > the namespace for namespaces. It sets security boundaries in the system > and we need to know them to understand the whole system. > > If user namespaces are not used in the system or on a container, you > will see all namespaces in one directory. But if the system has a more > complicated structure, you will be able to build a full picture of it. > > You said that one of the users of this feature is CRIU (the tool to > checkpoint/restore containers) and you said that it would be good if > CRIU will be able to collect all container namespaces before dumping > processes, sockets, files etc. But how will we be able to do this if we > will list all namespaces in one directory? There is no a problem, this looks rather simple. Two cases are possible: 1)a container has dedicated namespaces set, and CRIU just has to iterate files in /proc/namespaces of root pid namespace of the container. The relationships between parents and childs of pid and user namespaces are founded via ioctl(NS_GET_PARENT). 2)container has no dedicated namespaces set. Then CRIU just has to iterate all host namespaces. There is no another way to do that, because container may have any host namespaces, and hierarchy in /proc/namespaces won't help you. > Here are my thoughts why we need to the suggested structure is better > than just a list of namespaces: > > * Users will be able to understand securies bondaries in the system. > Each namespace in the system is owned by one of user namespace and we > need to know these relationshipts to understand the whole system. Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use this interfaces? > * This is simplify collecting namespaces which belong to one container. > > For example, CRIU collects all namespaces before dumping file > descriptors. Then it collects all sockets with socket-diag in network > namespaces and collects mount points via /proc/pid/mountinfo in mount > namesapces. Then these information is used to dump socket file > descriptors and opened files. This is just the thing I say. This allows to avoid writing recursive dump. But this has nothing about advantages of hierarchy in /proc/namespaces. > * We are going to assign names to namespaces. But this means that we > need to guarantee that all names in one directory are unique. The > initial proposal was to enumerate all namespaces in one proc directory, > that means names of all namespaces have to be unique. This can be > problematic in some cases. For example, we may want to dump a container > and then restore it more than once on the same host. How are we going to > avoid namespace name conficts in such cases? Previous message I wrote about .rename of proc files, Alexey Dobriyan said this is not a taboo. Are there problem which doesn't cover the case you point? > If we will have per-user-namespace directories, we will need to > guarantee that names are unique only inside one user namespace. Unique names inside one user namespace won't introduce a new /proc mount. You can't pass a sub-directory of /proc/namespaces/ to a specific container. To give a virtualized name you have to have a dedicated pid ns. Let we have in one /proc mount: /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX] In another another /proc mount we have: /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX] The virtualization is made per /proc (i.e., per pid ns). Container should receive either /mnt1/proc or /mnt2/proc on restore as it's /proc. There is no a sense of directory hierarchy for virtualization, since you can't use specific sub-directory as a root directory of /proc/namespaces to a container. You still have to introduce a new pid ns to have virtualized /proc. > * With the suggested structure, for each user namepsace, we will show > only its subtree of namespaces. This looks more natural than > filltering content of one directory. It's rather subjectively I think. /proc is related to pid ns, and user ns hierarchy does not look more natural for me.
On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote: > On 10.08.2020 20:34, Andrei Vagin wrote: > > On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote: > >> On 06.08.2020 11:05, Andrei Vagin wrote: > >>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: > >>>> On 31.07.2020 01:13, Eric W. Biederman wrote: > >>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>>>> > >>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote: > >>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>>>>>> > >>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces > >>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > >>>>>>>> but some also may be as open files, which are not attached to a process. > >>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is > >>>>>>>> impossible to know whether the namespace exists or not. > >>>>>>>> > >>>>>>>> Also, even if namespace is exposed as attached to a process or as open file, > >>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > >>>>>>>> this multiplies at tasks and fds number. > >>>>>>> > >>>>>>> I am very dubious about this. > >>>>>>> > >>>>>>> I have been avoiding exactly this kind of interface because it can > >>>>>>> create rather fundamental problems with checkpoint restart. > >>>>>> > >>>>>> restart/restore :) > >>>>>> > >>>>>>> You do have some filtering and the filtering is not based on current. > >>>>>>> Which is good. > >>>>>>> > >>>>>>> A view that is relative to a user namespace might be ok. It almost > >>>>>>> certainly does better as it's own little filesystem than as an extension > >>>>>>> to proc though. > >>>>>>> > >>>>>>> The big thing we want to ensure is that if you migrate you can restore > >>>>>>> everything. I don't see how you will be able to restore these files > >>>>>>> after migration. Anything like this without having a complete > >>>>>>> checkpoint/restore story is a non-starter. > >>>>>> > >>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. > >>>>>> > >>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. > >>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any > >>>>>> problem here. > >>>>> > >>>>> An obvious diffference is that you are adding the inode to the inode to > >>>>> the file name. Which means that now you really do have to preserve the > >>>>> inode numbers during process migration. > >>>>> > >>>>> Which means now we have to do all of the work to make inode number > >>>>> restoration possible. Which means now we need to have multiple > >>>>> instances of nsfs so that we can restore inode numbers. > >>>>> > >>>>> I think this is still possible but we have been delaying figuring out > >>>>> how to restore inode numbers long enough that may be actual technical > >>>>> problems making it happen. > >>>> > >>>> Yeah, this matters. But it looks like here is not a dead end. We just need > >>>> change the names the namespaces are exported to particular fs and to support > >>>> rename(). > >>>> > >>>> Before introduction a principally new filesystem type for this, can't > >>>> this be solved in current /proc? > >>> > >>> do you mean to introduce names for namespaces which users will be able > >>> to change? By default, this can be uuid. > >> > >> Yes, I mean this. > >> > >> Currently I won't give a final answer about UUID, but I planned to show some > >> default names, which based on namespace type and inode num. Completely custom > >> names for any /proc by default will waste too much memory. > >> > >> So, I think the good way will be: > >> > >> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static > >> random seed, which is generated on boot; > >> > >> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} > >> > >> 3)Allow rename, and allocate space only for renamed names. > >> > >> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. > >> > >>> And I have a suggestion about the structure of /proc/namespaces/. > >>> > >>> Each namespace is owned by one of user namespaces. Maybe it makes sense > >>> to group namespaces by their user-namespaces? > >>> > >>> /proc/namespaces/ > >>> user > >>> mnt-X > >>> mnt-Y > >>> pid-X > >>> uts-Z > >>> user-X/ > >>> user > >>> mnt-A > >>> mnt-B > >>> user-C > >>> user-C/ > >>> user > >>> user-Y/ > >>> user > >> > >> Hm, I don't think that user namespace is a generic key value for everybody. > >> For generic people tasks a user namespace is just a namespace among another > >> namespace types. For me it will look a bit strage to iterate some user namespaces > >> to build container net topology. > > > > I can’t agree with you that the user namespace is one of others. It is > > the namespace for namespaces. It sets security boundaries in the system > > and we need to know them to understand the whole system. > > > > If user namespaces are not used in the system or on a container, you > > will see all namespaces in one directory. But if the system has a more > > complicated structure, you will be able to build a full picture of it. > > > > You said that one of the users of this feature is CRIU (the tool to > > checkpoint/restore containers) and you said that it would be good if > > CRIU will be able to collect all container namespaces before dumping > > processes, sockets, files etc. But how will we be able to do this if we > > will list all namespaces in one directory? > > There is no a problem, this looks rather simple. Two cases are possible: > > 1)a container has dedicated namespaces set, and CRIU just has to iterate > files in /proc/namespaces of root pid namespace of the container. > The relationships between parents and childs of pid and user namespaces > are founded via ioctl(NS_GET_PARENT). > > 2)container has no dedicated namespaces set. Then CRIU just has to iterate > all host namespaces. There is no another way to do that, because container > may have any host namespaces, and hierarchy in /proc/namespaces won't > help you. > > > Here are my thoughts why we need to the suggested structure is better > > than just a list of namespaces: > > > > * Users will be able to understand securies bondaries in the system. > > Each namespace in the system is owned by one of user namespace and we > > need to know these relationshipts to understand the whole system. > > Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use > this interfaces? We can use these ioctl-s, but we will need to enumerate all namespaces in the system to build a view of the namespace hierarchy. This will be very expensive. The kernel can show this hierarchy without additional cost. > > > * This is simplify collecting namespaces which belong to one container. > > > > For example, CRIU collects all namespaces before dumping file > > descriptors. Then it collects all sockets with socket-diag in network > > namespaces and collects mount points via /proc/pid/mountinfo in mount > > namesapces. Then these information is used to dump socket file > > descriptors and opened files. > > This is just the thing I say. This allows to avoid writing recursive dump. I don't understand this. How are you going to collect namespaces in CRIU without knowing which are used by a dumped container? > But this has nothing about advantages of hierarchy in /proc/namespaces. Really? You said that you implemented this series to help CRIU dumping namespaces. I think we need to implement the CRIU part to prove that this interface is usable for this case. Right now, I have doubts about this. > > > * We are going to assign names to namespaces. But this means that we > > need to guarantee that all names in one directory are unique. The > > initial proposal was to enumerate all namespaces in one proc directory, > > that means names of all namespaces have to be unique. This can be > > problematic in some cases. For example, we may want to dump a container > > and then restore it more than once on the same host. How are we going to > > avoid namespace name conficts in such cases? > > Previous message I wrote about .rename of proc files, Alexey Dobriyan > said this is not a taboo. Are there problem which doesn't cover the case > you point? Yes, there is. Namespace names will be visible from a container, so they have to be restored. But this means that two containers can't be restored from the same snapshot due to namespace name conflicts. But if we will show namespaces how I suggest, each container will see only its sub-tree of namespaces and we will be able to specify any name for the container root user namespace. > > > If we will have per-user-namespace directories, we will need to > > guarantee that names are unique only inside one user namespace. > > Unique names inside one user namespace won't introduce a new /proc > mount. You can't pass a sub-directory of /proc/namespaces/ to a specific > container. To give a virtualized name you have to have a dedicated pid ns. > > Let we have in one /proc mount: > > /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX] > > In another another /proc mount we have: > > /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX] > > The virtualization is made per /proc (i.e., per pid ns). Container should > receive either /mnt1/proc or /mnt2/proc on restore as it's /proc. > > There is no a sense of directory hierarchy for virtualization, since > you can't use specific sub-directory as a root directory of /proc/namespaces > to a container. You still have to introduce a new pid ns to have virtualized > /proc. I think we can figure out how to implement this. As the first idea, we can use the same way how /proc/net is implemented. > > > * With the suggested structure, for each user namepsace, we will show > > only its subtree of namespaces. This looks more natural than > > filltering content of one directory. > > It's rather subjectively I think. /proc is related to pid ns, and user ns > hierarchy does not look more natural for me. or /proc is wrong place for this.
On 12.08.2020 20:53, Andrei Vagin wrote: > On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote: >> On 10.08.2020 20:34, Andrei Vagin wrote: >>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote: >>>> On 06.08.2020 11:05, Andrei Vagin wrote: >>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: >>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote: >>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>>>>>> >>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote: >>>>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>>>>>>>> >>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>>>>>>>> but some also may be as open files, which are not attached to a process. >>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is >>>>>>>>>> impossible to know whether the namespace exists or not. >>>>>>>>>> >>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file, >>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>>>>>>>> this multiplies at tasks and fds number. >>>>>>>>> >>>>>>>>> I am very dubious about this. >>>>>>>>> >>>>>>>>> I have been avoiding exactly this kind of interface because it can >>>>>>>>> create rather fundamental problems with checkpoint restart. >>>>>>>> >>>>>>>> restart/restore :) >>>>>>>> >>>>>>>>> You do have some filtering and the filtering is not based on current. >>>>>>>>> Which is good. >>>>>>>>> >>>>>>>>> A view that is relative to a user namespace might be ok. It almost >>>>>>>>> certainly does better as it's own little filesystem than as an extension >>>>>>>>> to proc though. >>>>>>>>> >>>>>>>>> The big thing we want to ensure is that if you migrate you can restore >>>>>>>>> everything. I don't see how you will be able to restore these files >>>>>>>>> after migration. Anything like this without having a complete >>>>>>>>> checkpoint/restore story is a non-starter. >>>>>>>> >>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. >>>>>>>> >>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. >>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any >>>>>>>> problem here. >>>>>>> >>>>>>> An obvious diffference is that you are adding the inode to the inode to >>>>>>> the file name. Which means that now you really do have to preserve the >>>>>>> inode numbers during process migration. >>>>>>> >>>>>>> Which means now we have to do all of the work to make inode number >>>>>>> restoration possible. Which means now we need to have multiple >>>>>>> instances of nsfs so that we can restore inode numbers. >>>>>>> >>>>>>> I think this is still possible but we have been delaying figuring out >>>>>>> how to restore inode numbers long enough that may be actual technical >>>>>>> problems making it happen. >>>>>> >>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need >>>>>> change the names the namespaces are exported to particular fs and to support >>>>>> rename(). >>>>>> >>>>>> Before introduction a principally new filesystem type for this, can't >>>>>> this be solved in current /proc? >>>>> >>>>> do you mean to introduce names for namespaces which users will be able >>>>> to change? By default, this can be uuid. >>>> >>>> Yes, I mean this. >>>> >>>> Currently I won't give a final answer about UUID, but I planned to show some >>>> default names, which based on namespace type and inode num. Completely custom >>>> names for any /proc by default will waste too much memory. >>>> >>>> So, I think the good way will be: >>>> >>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static >>>> random seed, which is generated on boot; >>>> >>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} >>>> >>>> 3)Allow rename, and allocate space only for renamed names. >>>> >>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. >>>> >>>>> And I have a suggestion about the structure of /proc/namespaces/. >>>>> >>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense >>>>> to group namespaces by their user-namespaces? >>>>> >>>>> /proc/namespaces/ >>>>> user >>>>> mnt-X >>>>> mnt-Y >>>>> pid-X >>>>> uts-Z >>>>> user-X/ >>>>> user >>>>> mnt-A >>>>> mnt-B >>>>> user-C >>>>> user-C/ >>>>> user >>>>> user-Y/ >>>>> user >>>> >>>> Hm, I don't think that user namespace is a generic key value for everybody. >>>> For generic people tasks a user namespace is just a namespace among another >>>> namespace types. For me it will look a bit strage to iterate some user namespaces >>>> to build container net topology. >>> >>> I can’t agree with you that the user namespace is one of others. It is >>> the namespace for namespaces. It sets security boundaries in the system >>> and we need to know them to understand the whole system. >>> >>> If user namespaces are not used in the system or on a container, you >>> will see all namespaces in one directory. But if the system has a more >>> complicated structure, you will be able to build a full picture of it. >>> >>> You said that one of the users of this feature is CRIU (the tool to >>> checkpoint/restore containers) and you said that it would be good if >>> CRIU will be able to collect all container namespaces before dumping >>> processes, sockets, files etc. But how will we be able to do this if we >>> will list all namespaces in one directory? >> >> There is no a problem, this looks rather simple. Two cases are possible: >> >> 1)a container has dedicated namespaces set, and CRIU just has to iterate >> files in /proc/namespaces of root pid namespace of the container. >> The relationships between parents and childs of pid and user namespaces >> are founded via ioctl(NS_GET_PARENT). >> >> 2)container has no dedicated namespaces set. Then CRIU just has to iterate >> all host namespaces. There is no another way to do that, because container >> may have any host namespaces, and hierarchy in /proc/namespaces won't >> help you. >> >>> Here are my thoughts why we need to the suggested structure is better >>> than just a list of namespaces: >>> >>> * Users will be able to understand securies bondaries in the system. >>> Each namespace in the system is owned by one of user namespace and we >>> need to know these relationshipts to understand the whole system. >> >> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use >> this interfaces? > > We can use these ioctl-s, but we will need to enumerate all namespaces in > the system to build a view of the namespace hierarchy. This will be very > expensive. The kernel can show this hierarchy without additional cost. No. We will have to iterate /proc/namespaces of a specific container to get its namespaces. It's a subset of all namespaces in system, and these all the namespaces, which are potentially allowed for the container. >> >>> * This is simplify collecting namespaces which belong to one container. >>> >>> For example, CRIU collects all namespaces before dumping file >>> descriptors. Then it collects all sockets with socket-diag in network >>> namespaces and collects mount points via /proc/pid/mountinfo in mount >>> namesapces. Then these information is used to dump socket file >>> descriptors and opened files. >> >> This is just the thing I say. This allows to avoid writing recursive dump. > > I don't understand this. How are you going to collect namespaces in CRIU > without knowing which are used by a dumped container? My patchset exports only the namespaces, which are allowed for a specific container, and no more above this. All exported namespaces are alive, so someone holds a reference on every of it. So they are used. It seems you haven't understood the way I suggested here. See patch [11/23] for the details. It's about permissions, and the subset of exported namespaces is formalized there. >> But this has nothing about advantages of hierarchy in /proc/namespaces. > > Really? You said that you implemented this series to help CRIU dumping > namespaces. I think we need to implement the CRIU part to prove that > this interface is usable for this case. Right now, I have doubts about > this. Yes, really. See my comment above and patch [11/23]. >> >>> * We are going to assign names to namespaces. But this means that we >>> need to guarantee that all names in one directory are unique. The >>> initial proposal was to enumerate all namespaces in one proc directory, >>> that means names of all namespaces have to be unique. This can be >>> problematic in some cases. For example, we may want to dump a container >>> and then restore it more than once on the same host. How are we going to >>> avoid namespace name conficts in such cases? >> >> Previous message I wrote about .rename of proc files, Alexey Dobriyan >> said this is not a taboo. Are there problem which doesn't cover the case >> you point? > > Yes, there is. Namespace names will be visible from a container, so they > have to be restored. But this means that two containers can't be > restored from the same snapshot due to namespace name conflicts. > > But if we will show namespaces how I suggest, each container will see > only its sub-tree of namespaces and we will be able to specify any name > for the container root user namespace. Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23]. I do export sub-tree. >> >>> If we will have per-user-namespace directories, we will need to >>> guarantee that names are unique only inside one user namespace. >> >> Unique names inside one user namespace won't introduce a new /proc >> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific >> container. To give a virtualized name you have to have a dedicated pid ns. >> >> Let we have in one /proc mount: >> >> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX] >> >> In another another /proc mount we have: >> >> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX] >> >> The virtualization is made per /proc (i.e., per pid ns). Container should >> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc. >> >> There is no a sense of directory hierarchy for virtualization, since >> you can't use specific sub-directory as a root directory of /proc/namespaces >> to a container. You still have to introduce a new pid ns to have virtualized >> /proc. > > I think we can figure out how to implement this. As the first idea, we > can use the same way how /proc/net is implemented. > >> >>> * With the suggested structure, for each user namepsace, we will show >>> only its subtree of namespaces. This looks more natural than >>> filltering content of one directory. >> >> It's rather subjectively I think. /proc is related to pid ns, and user ns >> hierarchy does not look more natural for me. > > or /proc is wrong place for this
On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote: > On 12.08.2020 20:53, Andrei Vagin wrote: > > On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote: > >> On 10.08.2020 20:34, Andrei Vagin wrote: > >>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote: > >>>> On 06.08.2020 11:05, Andrei Vagin wrote: > >>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: > >>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote: > >>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>>>>>> > >>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote: > >>>>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>>>>>>>> > >>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces > >>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > >>>>>>>>>> but some also may be as open files, which are not attached to a process. > >>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is > >>>>>>>>>> impossible to know whether the namespace exists or not. > >>>>>>>>>> > >>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file, > >>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > >>>>>>>>>> this multiplies at tasks and fds number. > >>>>>>>>> > >>>>>>>>> I am very dubious about this. > >>>>>>>>> > >>>>>>>>> I have been avoiding exactly this kind of interface because it can > >>>>>>>>> create rather fundamental problems with checkpoint restart. > >>>>>>>> > >>>>>>>> restart/restore :) > >>>>>>>> > >>>>>>>>> You do have some filtering and the filtering is not based on current. > >>>>>>>>> Which is good. > >>>>>>>>> > >>>>>>>>> A view that is relative to a user namespace might be ok. It almost > >>>>>>>>> certainly does better as it's own little filesystem than as an extension > >>>>>>>>> to proc though. > >>>>>>>>> > >>>>>>>>> The big thing we want to ensure is that if you migrate you can restore > >>>>>>>>> everything. I don't see how you will be able to restore these files > >>>>>>>>> after migration. Anything like this without having a complete > >>>>>>>>> checkpoint/restore story is a non-starter. > >>>>>>>> > >>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. > >>>>>>>> > >>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. > >>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any > >>>>>>>> problem here. > >>>>>>> > >>>>>>> An obvious diffference is that you are adding the inode to the inode to > >>>>>>> the file name. Which means that now you really do have to preserve the > >>>>>>> inode numbers during process migration. > >>>>>>> > >>>>>>> Which means now we have to do all of the work to make inode number > >>>>>>> restoration possible. Which means now we need to have multiple > >>>>>>> instances of nsfs so that we can restore inode numbers. > >>>>>>> > >>>>>>> I think this is still possible but we have been delaying figuring out > >>>>>>> how to restore inode numbers long enough that may be actual technical > >>>>>>> problems making it happen. > >>>>>> > >>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need > >>>>>> change the names the namespaces are exported to particular fs and to support > >>>>>> rename(). > >>>>>> > >>>>>> Before introduction a principally new filesystem type for this, can't > >>>>>> this be solved in current /proc? > >>>>> > >>>>> do you mean to introduce names for namespaces which users will be able > >>>>> to change? By default, this can be uuid. > >>>> > >>>> Yes, I mean this. > >>>> > >>>> Currently I won't give a final answer about UUID, but I planned to show some > >>>> default names, which based on namespace type and inode num. Completely custom > >>>> names for any /proc by default will waste too much memory. > >>>> > >>>> So, I think the good way will be: > >>>> > >>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static > >>>> random seed, which is generated on boot; > >>>> > >>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} > >>>> > >>>> 3)Allow rename, and allocate space only for renamed names. > >>>> > >>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. > >>>> > >>>>> And I have a suggestion about the structure of /proc/namespaces/. > >>>>> > >>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense > >>>>> to group namespaces by their user-namespaces? > >>>>> > >>>>> /proc/namespaces/ > >>>>> user > >>>>> mnt-X > >>>>> mnt-Y > >>>>> pid-X > >>>>> uts-Z > >>>>> user-X/ > >>>>> user > >>>>> mnt-A > >>>>> mnt-B > >>>>> user-C > >>>>> user-C/ > >>>>> user > >>>>> user-Y/ > >>>>> user > >>>> > >>>> Hm, I don't think that user namespace is a generic key value for everybody. > >>>> For generic people tasks a user namespace is just a namespace among another > >>>> namespace types. For me it will look a bit strage to iterate some user namespaces > >>>> to build container net topology. > >>> > >>> I can’t agree with you that the user namespace is one of others. It is > >>> the namespace for namespaces. It sets security boundaries in the system > >>> and we need to know them to understand the whole system. > >>> > >>> If user namespaces are not used in the system or on a container, you > >>> will see all namespaces in one directory. But if the system has a more > >>> complicated structure, you will be able to build a full picture of it. > >>> > >>> You said that one of the users of this feature is CRIU (the tool to > >>> checkpoint/restore containers) and you said that it would be good if > >>> CRIU will be able to collect all container namespaces before dumping > >>> processes, sockets, files etc. But how will we be able to do this if we > >>> will list all namespaces in one directory? > >> > >> There is no a problem, this looks rather simple. Two cases are possible: > >> > >> 1)a container has dedicated namespaces set, and CRIU just has to iterate > >> files in /proc/namespaces of root pid namespace of the container. > >> The relationships between parents and childs of pid and user namespaces > >> are founded via ioctl(NS_GET_PARENT). > >> > >> 2)container has no dedicated namespaces set. Then CRIU just has to iterate > >> all host namespaces. There is no another way to do that, because container > >> may have any host namespaces, and hierarchy in /proc/namespaces won't > >> help you. > >> > >>> Here are my thoughts why we need to the suggested structure is better > >>> than just a list of namespaces: > >>> > >>> * Users will be able to understand securies bondaries in the system. > >>> Each namespace in the system is owned by one of user namespace and we > >>> need to know these relationshipts to understand the whole system. > >> > >> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use > >> this interfaces? > > > > We can use these ioctl-s, but we will need to enumerate all namespaces in > > the system to build a view of the namespace hierarchy. This will be very > > expensive. The kernel can show this hierarchy without additional cost. > > No. We will have to iterate /proc/namespaces of a specific container to get > its namespaces. It's a subset of all namespaces in system, and these all the > namespaces, which are potentially allowed for the container. """ Every /proc is related to a pid_namespace, and the pid_namespace is related to a user_namespace. The items, we show in this /proc/namespaces/ directory, are the namespaces, whose user_namespaces are the same as /proc's user_namespace, or their descendants. """ // [PATCH 11/23] fs: Add /proc/namespaces/ directory This means that if a user want to find out all container namespaces, it has to have access to the container procfs and the container should a separate pid namespace. I would say these are two big limitations. The first one will not affect CRIU and I agree CRIU can use this interface in its current form. The second one will be still the issue for CRIU. And they both will affect other users. For end users, it will be a pain. They will need to create a pid namespaces in a specified user-namespace, if a container doesn't have its own. Then they will need to mount /proc from the container pid namespace and only then they will be able to enumerate namespaces. But to build a view of a hierarchy of these namespaces, they will need to use a binary tool which will open each of these namespaces, call NS_GET_PARENT and NS_GET_USERNS ioctl-s and build a tree. > > >> > >>> * This is simplify collecting namespaces which belong to one container. > >>> > >>> For example, CRIU collects all namespaces before dumping file > >>> descriptors. Then it collects all sockets with socket-diag in network > >>> namespaces and collects mount points via /proc/pid/mountinfo in mount > >>> namesapces. Then these information is used to dump socket file > >>> descriptors and opened files. > >> > >> This is just the thing I say. This allows to avoid writing recursive dump. > > > > I don't understand this. How are you going to collect namespaces in CRIU > > without knowing which are used by a dumped container? > > My patchset exports only the namespaces, which are allowed for a specific > container, and no more above this. All exported namespaces are alive, > so someone holds a reference on every of it. So they are used. > > It seems you haven't understood the way I suggested here. See patch [11/23] > for the details. It's about permissions, and the subset of exported namespaces > is formalized there. Honestly, I have not read all patches in this series and you didn't describe this behavior in the cover letter. Thank you for pointing out to the 11 patch, but I still think it doesn't solve the problem completely. More details is in the comment which is a few lines above this one. > > >> But this has nothing about advantages of hierarchy in /proc/namespaces. Yes, it has. For example, in cases when a container doesn't have its own pid namespaces. > > > > Really? You said that you implemented this series to help CRIU dumping > > namespaces. I think we need to implement the CRIU part to prove that > > this interface is usable for this case. Right now, I have doubts about > > this. > > Yes, really. See my comment above and patch [11/23]. > > >> > >>> * We are going to assign names to namespaces. But this means that we > >>> need to guarantee that all names in one directory are unique. The > >>> initial proposal was to enumerate all namespaces in one proc directory, > >>> that means names of all namespaces have to be unique. This can be > >>> problematic in some cases. For example, we may want to dump a container > >>> and then restore it more than once on the same host. How are we going to > >>> avoid namespace name conficts in such cases? > >> > >> Previous message I wrote about .rename of proc files, Alexey Dobriyan > >> said this is not a taboo. Are there problem which doesn't cover the case > >> you point? > > > > Yes, there is. Namespace names will be visible from a container, so they > > have to be restored. But this means that two containers can't be > > restored from the same snapshot due to namespace name conflicts. > > > > But if we will show namespaces how I suggest, each container will see > > only its sub-tree of namespaces and we will be able to specify any name > > for the container root user namespace. > > Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23]. > > I do export sub-tree. I got your idea, but it is unclear how your are going to avoid name conflicts. In the root container, you will show all namespaces in the system. These means that all namespaces have to have unique names. This means we will not able to restore two containers from the same snapshot without renaming namespaces. But we can't change namespace names, because they are visible from containers and container processes can use them. > > >> > >>> If we will have per-user-namespace directories, we will need to > >>> guarantee that names are unique only inside one user namespace. > >> > >> Unique names inside one user namespace won't introduce a new /proc > >> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific > >> container. To give a virtualized name you have to have a dedicated pid ns. > >> > >> Let we have in one /proc mount: > >> > >> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX] > >> > >> In another another /proc mount we have: > >> > >> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX] > >> > >> The virtualization is made per /proc (i.e., per pid ns). Container should > >> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc. > >> > >> There is no a sense of directory hierarchy for virtualization, since > >> you can't use specific sub-directory as a root directory of /proc/namespaces > >> to a container. You still have to introduce a new pid ns to have virtualized > >> /proc. > > > > I think we can figure out how to implement this. As the first idea, we > > can use the same way how /proc/net is implemented. > > > >> > >>> * With the suggested structure, for each user namepsace, we will show > >>> only its subtree of namespaces. This looks more natural than > >>> filltering content of one directory. > >> > >> It's rather subjectively I think. /proc is related to pid ns, and user ns > >> hierarchy does not look more natural for me. > > > > or /proc is wrong place for this
On 14.08.2020 04:16, Andrei Vagin wrote: > On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote: >> On 12.08.2020 20:53, Andrei Vagin wrote: >>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote: >>>> On 10.08.2020 20:34, Andrei Vagin wrote: >>>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote: >>>>>> On 06.08.2020 11:05, Andrei Vagin wrote: >>>>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: >>>>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote: >>>>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>>>>>>>> >>>>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote: >>>>>>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>>>>>>>>>> >>>>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>>>>>>>>>> but some also may be as open files, which are not attached to a process. >>>>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is >>>>>>>>>>>> impossible to know whether the namespace exists or not. >>>>>>>>>>>> >>>>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file, >>>>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>>>>>>>>>> this multiplies at tasks and fds number. >>>>>>>>>>> >>>>>>>>>>> I am very dubious about this. >>>>>>>>>>> >>>>>>>>>>> I have been avoiding exactly this kind of interface because it can >>>>>>>>>>> create rather fundamental problems with checkpoint restart. >>>>>>>>>> >>>>>>>>>> restart/restore :) >>>>>>>>>> >>>>>>>>>>> You do have some filtering and the filtering is not based on current. >>>>>>>>>>> Which is good. >>>>>>>>>>> >>>>>>>>>>> A view that is relative to a user namespace might be ok. It almost >>>>>>>>>>> certainly does better as it's own little filesystem than as an extension >>>>>>>>>>> to proc though. >>>>>>>>>>> >>>>>>>>>>> The big thing we want to ensure is that if you migrate you can restore >>>>>>>>>>> everything. I don't see how you will be able to restore these files >>>>>>>>>>> after migration. Anything like this without having a complete >>>>>>>>>>> checkpoint/restore story is a non-starter. >>>>>>>>>> >>>>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. >>>>>>>>>> >>>>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. >>>>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any >>>>>>>>>> problem here. >>>>>>>>> >>>>>>>>> An obvious diffference is that you are adding the inode to the inode to >>>>>>>>> the file name. Which means that now you really do have to preserve the >>>>>>>>> inode numbers during process migration. >>>>>>>>> >>>>>>>>> Which means now we have to do all of the work to make inode number >>>>>>>>> restoration possible. Which means now we need to have multiple >>>>>>>>> instances of nsfs so that we can restore inode numbers. >>>>>>>>> >>>>>>>>> I think this is still possible but we have been delaying figuring out >>>>>>>>> how to restore inode numbers long enough that may be actual technical >>>>>>>>> problems making it happen. >>>>>>>> >>>>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need >>>>>>>> change the names the namespaces are exported to particular fs and to support >>>>>>>> rename(). >>>>>>>> >>>>>>>> Before introduction a principally new filesystem type for this, can't >>>>>>>> this be solved in current /proc? >>>>>>> >>>>>>> do you mean to introduce names for namespaces which users will be able >>>>>>> to change? By default, this can be uuid. >>>>>> >>>>>> Yes, I mean this. >>>>>> >>>>>> Currently I won't give a final answer about UUID, but I planned to show some >>>>>> default names, which based on namespace type and inode num. Completely custom >>>>>> names for any /proc by default will waste too much memory. >>>>>> >>>>>> So, I think the good way will be: >>>>>> >>>>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static >>>>>> random seed, which is generated on boot; >>>>>> >>>>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} >>>>>> >>>>>> 3)Allow rename, and allocate space only for renamed names. >>>>>> >>>>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. >>>>>> >>>>>>> And I have a suggestion about the structure of /proc/namespaces/. >>>>>>> >>>>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense >>>>>>> to group namespaces by their user-namespaces? >>>>>>> >>>>>>> /proc/namespaces/ >>>>>>> user >>>>>>> mnt-X >>>>>>> mnt-Y >>>>>>> pid-X >>>>>>> uts-Z >>>>>>> user-X/ >>>>>>> user >>>>>>> mnt-A >>>>>>> mnt-B >>>>>>> user-C >>>>>>> user-C/ >>>>>>> user >>>>>>> user-Y/ >>>>>>> user >>>>>> >>>>>> Hm, I don't think that user namespace is a generic key value for everybody. >>>>>> For generic people tasks a user namespace is just a namespace among another >>>>>> namespace types. For me it will look a bit strage to iterate some user namespaces >>>>>> to build container net topology. >>>>> >>>>> I can’t agree with you that the user namespace is one of others. It is >>>>> the namespace for namespaces. It sets security boundaries in the system >>>>> and we need to know them to understand the whole system. >>>>> >>>>> If user namespaces are not used in the system or on a container, you >>>>> will see all namespaces in one directory. But if the system has a more >>>>> complicated structure, you will be able to build a full picture of it. >>>>> >>>>> You said that one of the users of this feature is CRIU (the tool to >>>>> checkpoint/restore containers) and you said that it would be good if >>>>> CRIU will be able to collect all container namespaces before dumping >>>>> processes, sockets, files etc. But how will we be able to do this if we >>>>> will list all namespaces in one directory? >>>> >>>> There is no a problem, this looks rather simple. Two cases are possible: >>>> >>>> 1)a container has dedicated namespaces set, and CRIU just has to iterate >>>> files in /proc/namespaces of root pid namespace of the container. >>>> The relationships between parents and childs of pid and user namespaces >>>> are founded via ioctl(NS_GET_PARENT). >>>> >>>> 2)container has no dedicated namespaces set. Then CRIU just has to iterate >>>> all host namespaces. There is no another way to do that, because container >>>> may have any host namespaces, and hierarchy in /proc/namespaces won't >>>> help you. >>>> >>>>> Here are my thoughts why we need to the suggested structure is better >>>>> than just a list of namespaces: >>>>> >>>>> * Users will be able to understand securies bondaries in the system. >>>>> Each namespace in the system is owned by one of user namespace and we >>>>> need to know these relationshipts to understand the whole system. >>>> >>>> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use >>>> this interfaces? >>> >>> We can use these ioctl-s, but we will need to enumerate all namespaces in >>> the system to build a view of the namespace hierarchy. This will be very >>> expensive. The kernel can show this hierarchy without additional cost. >> >> No. We will have to iterate /proc/namespaces of a specific container to get >> its namespaces. It's a subset of all namespaces in system, and these all the >> namespaces, which are potentially allowed for the container. > > """ > Every /proc is related to a pid_namespace, and the pid_namespace > is related to a user_namespace. The items, we show in this > /proc/namespaces/ directory, are the namespaces, > whose user_namespaces are the same as /proc's user_namespace, > or their descendants. > """ // [PATCH 11/23] fs: Add /proc/namespaces/ directory > > This means that if a user want to find out all container namespaces, it > has to have access to the container procfs and the container should > a separate pid namespace. > > I would say these are two big limitations. The first one will not affect > CRIU and I agree CRIU can use this interface in its current form. > > The second one will be still the issue for CRIU. And they both will > affect other users. > > For end users, it will be a pain. They will need to create a pid > namespaces in a specified user-namespace, if a container doesn't have > its own. Then they will need to mount /proc from the container pid > namespace and only then they will be able to enumerate namespaces. In case of a container does not have its own pid namespace, CRIU already sucks. Every file in /proc directory is not reliable after restore, so /proc/namespaces is just one of them. Container, who may access files in /proc, does have to have its own pid namespace. Even if we imagine an unreal situation, when the rest of /proc files are reliable, sub-directories won't help in this case also. In case of we introduce user ns hierarchy, the namespaces names above container's user ns, will still be unchangeble: /proc/namespaces/parent_user_ns/container_user_ns/... Path to container_user_ns is fixed. If container accesses /proc/namespace/parent_user_ns file, it will suck a pow after restore again. So, the suggested sub-directories just don't work. > But to build a view of a hierarchy of these namespaces, they will need to > use a binary tool which will open each of these namespaces, call > NS_GET_PARENT and NS_GET_USERNS ioctl-s and build a tree. Yes, it's the same way we have on a construction of tasks tree. Linear /proc/namespaces is rather natural way. The sense is "all namespaces, which are available for tasks in this /proc directory". Grouping by user ns directories looks odd. CRIU is only util, who needs such the grouping. But even for CRIU performance advantages look dubious. For another utils, the preference of user ns grouping over another hierarchy namespaces looks just weirdy weird. I can agree with an idea of separate top-level sub-directories for different namespaces types like: /proc/namespaces/uts/ /proc/namespaces/user/ /proc/namespaces/pid/ ... But grouping of all another namespaces by user ns sub-directories absolutely does not look sane for me. >> >>>> >>>>> * This is simplify collecting namespaces which belong to one container. >>>>> >>>>> For example, CRIU collects all namespaces before dumping file >>>>> descriptors. Then it collects all sockets with socket-diag in network >>>>> namespaces and collects mount points via /proc/pid/mountinfo in mount >>>>> namesapces. Then these information is used to dump socket file >>>>> descriptors and opened files. >>>> >>>> This is just the thing I say. This allows to avoid writing recursive dump. >>> >>> I don't understand this. How are you going to collect namespaces in CRIU >>> without knowing which are used by a dumped container? >> >> My patchset exports only the namespaces, which are allowed for a specific >> container, and no more above this. All exported namespaces are alive, >> so someone holds a reference on every of it. So they are used. >> >> It seems you haven't understood the way I suggested here. See patch [11/23] >> for the details. It's about permissions, and the subset of exported namespaces >> is formalized there. > > Honestly, I have not read all patches in this series and you didn't > describe this behavior in the cover letter. Thank you for pointing out > to the 11 patch, but I still think it doesn't solve the problem > completely. More details is in the comment which is a few lines above > this one. > >> >>>> But this has nothing about advantages of hierarchy in /proc/namespaces. > > Yes, it has. For example, in cases when a container doesn't have its own > pid namespaces. > >>> >>> Really? You said that you implemented this series to help CRIU dumping >>> namespaces. I think we need to implement the CRIU part to prove that >>> this interface is usable for this case. Right now, I have doubts about >>> this. >> >> Yes, really. See my comment above and patch [11/23]. >> >>>> >>>>> * We are going to assign names to namespaces. But this means that we >>>>> need to guarantee that all names in one directory are unique. The >>>>> initial proposal was to enumerate all namespaces in one proc directory, >>>>> that means names of all namespaces have to be unique. This can be >>>>> problematic in some cases. For example, we may want to dump a container >>>>> and then restore it more than once on the same host. How are we going to >>>>> avoid namespace name conficts in such cases? >>>> >>>> Previous message I wrote about .rename of proc files, Alexey Dobriyan >>>> said this is not a taboo. Are there problem which doesn't cover the case >>>> you point? >>> >>> Yes, there is. Namespace names will be visible from a container, so they >>> have to be restored. But this means that two containers can't be >>> restored from the same snapshot due to namespace name conflicts. >>> >>> But if we will show namespaces how I suggest, each container will see >>> only its sub-tree of namespaces and we will be able to specify any name >>> for the container root user namespace. >> >> Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23]. >> >> I do export sub-tree. > > I got your idea, but it is unclear how your are going to avoid name > conflicts. > > In the root container, you will show all namespaces in the system. These > means that all namespaces have to have unique names. This means we will > not able to restore two containers from the same snapshot without > renaming namespaces. But we can't change namespace names, because they > are visible from containers and container processes can use them. Grouping by user ns sub-directories does not solve a problem with names of containers w/o own pid ns. See above. >> >>>> >>>>> If we will have per-user-namespace directories, we will need to >>>>> guarantee that names are unique only inside one user namespace. >>>> >>>> Unique names inside one user namespace won't introduce a new /proc >>>> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific >>>> container. To give a virtualized name you have to have a dedicated pid ns. >>>> >>>> Let we have in one /proc mount: >>>> >>>> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX] >>>> >>>> In another another /proc mount we have: >>>> >>>> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX] >>>> >>>> The virtualization is made per /proc (i.e., per pid ns). Container should >>>> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc. >>>> >>>> There is no a sense of directory hierarchy for virtualization, since >>>> you can't use specific sub-directory as a root directory of /proc/namespaces >>>> to a container. You still have to introduce a new pid ns to have virtualized >>>> /proc. >>> >>> I think we can figure out how to implement this. As the first idea, we >>> can use the same way how /proc/net is implemented. >>> >>>> >>>>> * With the suggested structure, for each user namepsace, we will show >>>>> only its subtree of namespaces. This looks more natural than >>>>> filltering content of one directory. >>>> >>>> It's rather subjectively I think. /proc is related to pid ns, and user ns >>>> hierarchy does not look more natural for me. >>> >>> or /proc is wrong place for this
On Fri, Aug 14, 2020 at 06:11:58PM +0300, Kirill Tkhai wrote: > On 14.08.2020 04:16, Andrei Vagin wrote: > > On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote: > >> On 12.08.2020 20:53, Andrei Vagin wrote: > >>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote: > >>>> On 10.08.2020 20:34, Andrei Vagin wrote: > >>>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote: > >>>>>> On 06.08.2020 11:05, Andrei Vagin wrote: > >>>>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: > >>>>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote: > >>>>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>>>>>>>> > >>>>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote: > >>>>>>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: > >>>>>>>>>>> > >>>>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces > >>>>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, > >>>>>>>>>>>> but some also may be as open files, which are not attached to a process. > >>>>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is > >>>>>>>>>>>> impossible to know whether the namespace exists or not. > >>>>>>>>>>>> > >>>>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file, > >>>>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because > >>>>>>>>>>>> this multiplies at tasks and fds number. > >>>>>>>>>>> > >>>>>>>>>>> I am very dubious about this. > >>>>>>>>>>> > >>>>>>>>>>> I have been avoiding exactly this kind of interface because it can > >>>>>>>>>>> create rather fundamental problems with checkpoint restart. > >>>>>>>>>> > >>>>>>>>>> restart/restore :) > >>>>>>>>>> > >>>>>>>>>>> You do have some filtering and the filtering is not based on current. > >>>>>>>>>>> Which is good. > >>>>>>>>>>> > >>>>>>>>>>> A view that is relative to a user namespace might be ok. It almost > >>>>>>>>>>> certainly does better as it's own little filesystem than as an extension > >>>>>>>>>>> to proc though. > >>>>>>>>>>> > >>>>>>>>>>> The big thing we want to ensure is that if you migrate you can restore > >>>>>>>>>>> everything. I don't see how you will be able to restore these files > >>>>>>>>>>> after migration. Anything like this without having a complete > >>>>>>>>>>> checkpoint/restore story is a non-starter. > >>>>>>>>>> > >>>>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. > >>>>>>>>>> > >>>>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. > >>>>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any > >>>>>>>>>> problem here. > >>>>>>>>> > >>>>>>>>> An obvious diffference is that you are adding the inode to the inode to > >>>>>>>>> the file name. Which means that now you really do have to preserve the > >>>>>>>>> inode numbers during process migration. > >>>>>>>>> > >>>>>>>>> Which means now we have to do all of the work to make inode number > >>>>>>>>> restoration possible. Which means now we need to have multiple > >>>>>>>>> instances of nsfs so that we can restore inode numbers. > >>>>>>>>> > >>>>>>>>> I think this is still possible but we have been delaying figuring out > >>>>>>>>> how to restore inode numbers long enough that may be actual technical > >>>>>>>>> problems making it happen. > >>>>>>>> > >>>>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need > >>>>>>>> change the names the namespaces are exported to particular fs and to support > >>>>>>>> rename(). > >>>>>>>> > >>>>>>>> Before introduction a principally new filesystem type for this, can't > >>>>>>>> this be solved in current /proc? > >>>>>>> > >>>>>>> do you mean to introduce names for namespaces which users will be able > >>>>>>> to change? By default, this can be uuid. > >>>>>> > >>>>>> Yes, I mean this. > >>>>>> > >>>>>> Currently I won't give a final answer about UUID, but I planned to show some > >>>>>> default names, which based on namespace type and inode num. Completely custom > >>>>>> names for any /proc by default will waste too much memory. > >>>>>> > >>>>>> So, I think the good way will be: > >>>>>> > >>>>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static > >>>>>> random seed, which is generated on boot; > >>>>>> > >>>>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} > >>>>>> > >>>>>> 3)Allow rename, and allocate space only for renamed names. > >>>>>> > >>>>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. > >>>>>> > >>>>>>> And I have a suggestion about the structure of /proc/namespaces/. > >>>>>>> > >>>>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense > >>>>>>> to group namespaces by their user-namespaces? > >>>>>>> > >>>>>>> /proc/namespaces/ > >>>>>>> user > >>>>>>> mnt-X > >>>>>>> mnt-Y > >>>>>>> pid-X > >>>>>>> uts-Z > >>>>>>> user-X/ > >>>>>>> user > >>>>>>> mnt-A > >>>>>>> mnt-B > >>>>>>> user-C > >>>>>>> user-C/ > >>>>>>> user > >>>>>>> user-Y/ > >>>>>>> user > >>>>>> > >>>>>> Hm, I don't think that user namespace is a generic key value for everybody. > >>>>>> For generic people tasks a user namespace is just a namespace among another > >>>>>> namespace types. For me it will look a bit strage to iterate some user namespaces > >>>>>> to build container net topology. > >>>>> > >>>>> I can’t agree with you that the user namespace is one of others. It is > >>>>> the namespace for namespaces. It sets security boundaries in the system > >>>>> and we need to know them to understand the whole system. > >>>>> > >>>>> If user namespaces are not used in the system or on a container, you > >>>>> will see all namespaces in one directory. But if the system has a more > >>>>> complicated structure, you will be able to build a full picture of it. > >>>>> > >>>>> You said that one of the users of this feature is CRIU (the tool to > >>>>> checkpoint/restore containers) and you said that it would be good if > >>>>> CRIU will be able to collect all container namespaces before dumping > >>>>> processes, sockets, files etc. But how will we be able to do this if we > >>>>> will list all namespaces in one directory? > >>>> > >>>> There is no a problem, this looks rather simple. Two cases are possible: > >>>> > >>>> 1)a container has dedicated namespaces set, and CRIU just has to iterate > >>>> files in /proc/namespaces of root pid namespace of the container. > >>>> The relationships between parents and childs of pid and user namespaces > >>>> are founded via ioctl(NS_GET_PARENT). > >>>> > >>>> 2)container has no dedicated namespaces set. Then CRIU just has to iterate > >>>> all host namespaces. There is no another way to do that, because container > >>>> may have any host namespaces, and hierarchy in /proc/namespaces won't > >>>> help you. > >>>> > >>>>> Here are my thoughts why we need to the suggested structure is better > >>>>> than just a list of namespaces: > >>>>> > >>>>> * Users will be able to understand securies bondaries in the system. > >>>>> Each namespace in the system is owned by one of user namespace and we > >>>>> need to know these relationshipts to understand the whole system. > >>>> > >>>> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use > >>>> this interfaces? > >>> > >>> We can use these ioctl-s, but we will need to enumerate all namespaces in > >>> the system to build a view of the namespace hierarchy. This will be very > >>> expensive. The kernel can show this hierarchy without additional cost. > >> > >> No. We will have to iterate /proc/namespaces of a specific container to get > >> its namespaces. It's a subset of all namespaces in system, and these all the > >> namespaces, which are potentially allowed for the container. > > > > """ > > Every /proc is related to a pid_namespace, and the pid_namespace > > is related to a user_namespace. The items, we show in this > > /proc/namespaces/ directory, are the namespaces, > > whose user_namespaces are the same as /proc's user_namespace, > > or their descendants. > > """ // [PATCH 11/23] fs: Add /proc/namespaces/ directory > > > > This means that if a user want to find out all container namespaces, it > > has to have access to the container procfs and the container should > > a separate pid namespace. > > > > I would say these are two big limitations. The first one will not affect > > CRIU and I agree CRIU can use this interface in its current form. > > > > The second one will be still the issue for CRIU. And they both will > > affect other users. > > > > For end users, it will be a pain. They will need to create a pid > > namespaces in a specified user-namespace, if a container doesn't have > > its own. Then they will need to mount /proc from the container pid > > namespace and only then they will be able to enumerate namespaces. > > In case of a container does not have its own pid namespace, CRIU already > sucks. Every file in /proc directory is not reliable after restore, > so /proc/namespaces is just one of them. Container, who may access files > in /proc, does have to have its own pid namespace. Can you be more detailed here? What files are not reliable? And why we don't need to think about this use-case? If we have any issues here, maybe we need to think how to fix them instead of adding a new one. > > Even if we imagine an unreal situation, when the rest of /proc files are reliable, > sub-directories won't help in this case also. In case of we introduce user ns > hierarchy, the namespaces names above container's user ns, will still > be unchangeble: > > /proc/namespaces/parent_user_ns/container_user_ns/... > > Path to container_user_ns is fixed. If container accesses /proc/namespace/parent_user_ns > file, it will suck a pow after restore again. In case of user ns hierarchy, a container will see only its sub-tree and it will not know a name of its root namespace. It will look like this: From host: /proc/namespaces/user_ns_ct1/user1 user2 /proc/namespaces/user_ns_ct2/user1 user2 From ct1: /proc/namespaces/user1 user2 And now could you explain how you are going to solve this problem with your interface? > > So, the suggested sub-directories just don't work. I am sure it will work. > > > But to build a view of a hierarchy of these namespaces, they will need to > > use a binary tool which will open each of these namespaces, call > > NS_GET_PARENT and NS_GET_USERNS ioctl-s and build a tree. > > Yes, it's the same way we have on a construction of tasks tree. > > Linear /proc/namespaces is rather natural way. The sense is "all namespaces, > which are available for tasks in this /proc directory". > > Grouping by user ns directories looks odd. CRIU is only util, who needs > such the grouping. But even for CRIU performance advantages look dubious. I can't agree with you here. This isn't about CRIU. Grouping by user ns doesn't look odd for me, because this is how namespaces are grouped in the kernel. > > For another utils, the preference of user ns grouping over another hierarchy > namespaces looks just weirdy weird. > > I can agree with an idea of separate top-level sub-directories for different > namespaces types like: > > /proc/namespaces/uts/ > /proc/namespaces/user/ > /proc/namespaces/pid/ > ... > > But grouping of all another namespaces by user ns sub-directories absolutely > does not look sane for me. I think we are stuck here and we need to ask an opinion of someone else. > > >> > >>>> > >>>>> * This is simplify collecting namespaces which belong to one container. > >>>>> > >>>>> For example, CRIU collects all namespaces before dumping file > >>>>> descriptors. Then it collects all sockets with socket-diag in network > >>>>> namespaces and collects mount points via /proc/pid/mountinfo in mount > >>>>> namesapces. Then these information is used to dump socket file > >>>>> descriptors and opened files. > >>>> > >>>> This is just the thing I say. This allows to avoid writing recursive dump. > >>> > >>> I don't understand this. How are you going to collect namespaces in CRIU > >>> without knowing which are used by a dumped container? > >> > >> My patchset exports only the namespaces, which are allowed for a specific > >> container, and no more above this. All exported namespaces are alive, > >> so someone holds a reference on every of it. So they are used. > >> > >> It seems you haven't understood the way I suggested here. See patch [11/23] > >> for the details. It's about permissions, and the subset of exported namespaces > >> is formalized there. > > > > Honestly, I have not read all patches in this series and you didn't > > describe this behavior in the cover letter. Thank you for pointing out > > to the 11 patch, but I still think it doesn't solve the problem > > completely. More details is in the comment which is a few lines above > > this one. > > > >> > >>>> But this has nothing about advantages of hierarchy in /proc/namespaces. > > > > Yes, it has. For example, in cases when a container doesn't have its own > > pid namespaces. > > > >>> > >>> Really? You said that you implemented this series to help CRIU dumping > >>> namespaces. I think we need to implement the CRIU part to prove that > >>> this interface is usable for this case. Right now, I have doubts about > >>> this. > >> > >> Yes, really. See my comment above and patch [11/23]. > >> > >>>> > >>>>> * We are going to assign names to namespaces. But this means that we > >>>>> need to guarantee that all names in one directory are unique. The > >>>>> initial proposal was to enumerate all namespaces in one proc directory, > >>>>> that means names of all namespaces have to be unique. This can be > >>>>> problematic in some cases. For example, we may want to dump a container > >>>>> and then restore it more than once on the same host. How are we going to > >>>>> avoid namespace name conficts in such cases? > >>>> > >>>> Previous message I wrote about .rename of proc files, Alexey Dobriyan > >>>> said this is not a taboo. Are there problem which doesn't cover the case > >>>> you point? > >>> > >>> Yes, there is. Namespace names will be visible from a container, so they > >>> have to be restored. But this means that two containers can't be > >>> restored from the same snapshot due to namespace name conflicts. > >>> > >>> But if we will show namespaces how I suggest, each container will see > >>> only its sub-tree of namespaces and we will be able to specify any name > >>> for the container root user namespace. > >> > >> Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23]. > >> > >> I do export sub-tree. > > > > I got your idea, but it is unclear how your are going to avoid name > > conflicts. > > > > In the root container, you will show all namespaces in the system. These > > means that all namespaces have to have unique names. This means we will > > not able to restore two containers from the same snapshot without > > renaming namespaces. But we can't change namespace names, because they > > are visible from containers and container processes can use them. > > Grouping by user ns sub-directories does not solve a problem with names > of containers w/o own pid ns. See above. It solves, you just doesn't understand how it works. See above. > > >> > >>>> > >>>>> If we will have per-user-namespace directories, we will need to > >>>>> guarantee that names are unique only inside one user namespace. > >>>> > >>>> Unique names inside one user namespace won't introduce a new /proc > >>>> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific > >>>> container. To give a virtualized name you have to have a dedicated pid ns. > >>>> > >>>> Let we have in one /proc mount: > >>>> > >>>> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX] > >>>> > >>>> In another another /proc mount we have: > >>>> > >>>> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX] > >>>> > >>>> The virtualization is made per /proc (i.e., per pid ns). Container should > >>>> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc. > >>>> > >>>> There is no a sense of directory hierarchy for virtualization, since > >>>> you can't use specific sub-directory as a root directory of /proc/namespaces > >>>> to a container. You still have to introduce a new pid ns to have virtualized > >>>> /proc. > >>> > >>> I think we can figure out how to implement this. As the first idea, we > >>> can use the same way how /proc/net is implemented. > >>> > >>>> > >>>>> * With the suggested structure, for each user namepsace, we will show > >>>>> only its subtree of namespaces. This looks more natural than > >>>>> filltering content of one directory. > >>>> > >>>> It's rather subjectively I think. /proc is related to pid ns, and user ns > >>>> hierarchy does not look more natural for me. > >>> > >>> or /proc is wrong place for this >
On 14.08.2020 22:21, Andrei Vagin wrote: > On Fri, Aug 14, 2020 at 06:11:58PM +0300, Kirill Tkhai wrote: >> On 14.08.2020 04:16, Andrei Vagin wrote: >>> On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote: >>>> On 12.08.2020 20:53, Andrei Vagin wrote: >>>>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote: >>>>>> On 10.08.2020 20:34, Andrei Vagin wrote: >>>>>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote: >>>>>>>> On 06.08.2020 11:05, Andrei Vagin wrote: >>>>>>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: >>>>>>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote: >>>>>>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>>>>>>>>>> >>>>>>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote: >>>>>>>>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes: >>>>>>>>>>>>> >>>>>>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>>>>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>>>>>>>>>>>> but some also may be as open files, which are not attached to a process. >>>>>>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is >>>>>>>>>>>>>> impossible to know whether the namespace exists or not. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file, >>>>>>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>>>>>>>>>>>> this multiplies at tasks and fds number. >>>>>>>>>>>>> >>>>>>>>>>>>> I am very dubious about this. >>>>>>>>>>>>> >>>>>>>>>>>>> I have been avoiding exactly this kind of interface because it can >>>>>>>>>>>>> create rather fundamental problems with checkpoint restart. >>>>>>>>>>>> >>>>>>>>>>>> restart/restore :) >>>>>>>>>>>> >>>>>>>>>>>>> You do have some filtering and the filtering is not based on current. >>>>>>>>>>>>> Which is good. >>>>>>>>>>>>> >>>>>>>>>>>>> A view that is relative to a user namespace might be ok. It almost >>>>>>>>>>>>> certainly does better as it's own little filesystem than as an extension >>>>>>>>>>>>> to proc though. >>>>>>>>>>>>> >>>>>>>>>>>>> The big thing we want to ensure is that if you migrate you can restore >>>>>>>>>>>>> everything. I don't see how you will be able to restore these files >>>>>>>>>>>>> after migration. Anything like this without having a complete >>>>>>>>>>>>> checkpoint/restore story is a non-starter. >>>>>>>>>>>> >>>>>>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. >>>>>>>>>>>> >>>>>>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. >>>>>>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any >>>>>>>>>>>> problem here. >>>>>>>>>>> >>>>>>>>>>> An obvious diffference is that you are adding the inode to the inode to >>>>>>>>>>> the file name. Which means that now you really do have to preserve the >>>>>>>>>>> inode numbers during process migration. >>>>>>>>>>> >>>>>>>>>>> Which means now we have to do all of the work to make inode number >>>>>>>>>>> restoration possible. Which means now we need to have multiple >>>>>>>>>>> instances of nsfs so that we can restore inode numbers. >>>>>>>>>>> >>>>>>>>>>> I think this is still possible but we have been delaying figuring out >>>>>>>>>>> how to restore inode numbers long enough that may be actual technical >>>>>>>>>>> problems making it happen. >>>>>>>>>> >>>>>>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need >>>>>>>>>> change the names the namespaces are exported to particular fs and to support >>>>>>>>>> rename(). >>>>>>>>>> >>>>>>>>>> Before introduction a principally new filesystem type for this, can't >>>>>>>>>> this be solved in current /proc? >>>>>>>>> >>>>>>>>> do you mean to introduce names for namespaces which users will be able >>>>>>>>> to change? By default, this can be uuid. >>>>>>>> >>>>>>>> Yes, I mean this. >>>>>>>> >>>>>>>> Currently I won't give a final answer about UUID, but I planned to show some >>>>>>>> default names, which based on namespace type and inode num. Completely custom >>>>>>>> names for any /proc by default will waste too much memory. >>>>>>>> >>>>>>>> So, I think the good way will be: >>>>>>>> >>>>>>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static >>>>>>>> random seed, which is generated on boot; >>>>>>>> >>>>>>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} >>>>>>>> >>>>>>>> 3)Allow rename, and allocate space only for renamed names. >>>>>>>> >>>>>>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. >>>>>>>> >>>>>>>>> And I have a suggestion about the structure of /proc/namespaces/. >>>>>>>>> >>>>>>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense >>>>>>>>> to group namespaces by their user-namespaces? >>>>>>>>> >>>>>>>>> /proc/namespaces/ >>>>>>>>> user >>>>>>>>> mnt-X >>>>>>>>> mnt-Y >>>>>>>>> pid-X >>>>>>>>> uts-Z >>>>>>>>> user-X/ >>>>>>>>> user >>>>>>>>> mnt-A >>>>>>>>> mnt-B >>>>>>>>> user-C >>>>>>>>> user-C/ >>>>>>>>> user >>>>>>>>> user-Y/ >>>>>>>>> user >>>>>>>> >>>>>>>> Hm, I don't think that user namespace is a generic key value for everybody. >>>>>>>> For generic people tasks a user namespace is just a namespace among another >>>>>>>> namespace types. For me it will look a bit strage to iterate some user namespaces >>>>>>>> to build container net topology. >>>>>>> >>>>>>> I can’t agree with you that the user namespace is one of others. It is >>>>>>> the namespace for namespaces. It sets security boundaries in the system >>>>>>> and we need to know them to understand the whole system. >>>>>>> >>>>>>> If user namespaces are not used in the system or on a container, you >>>>>>> will see all namespaces in one directory. But if the system has a more >>>>>>> complicated structure, you will be able to build a full picture of it. >>>>>>> >>>>>>> You said that one of the users of this feature is CRIU (the tool to >>>>>>> checkpoint/restore containers) and you said that it would be good if >>>>>>> CRIU will be able to collect all container namespaces before dumping >>>>>>> processes, sockets, files etc. But how will we be able to do this if we >>>>>>> will list all namespaces in one directory? >>>>>> >>>>>> There is no a problem, this looks rather simple. Two cases are possible: >>>>>> >>>>>> 1)a container has dedicated namespaces set, and CRIU just has to iterate >>>>>> files in /proc/namespaces of root pid namespace of the container. >>>>>> The relationships between parents and childs of pid and user namespaces >>>>>> are founded via ioctl(NS_GET_PARENT). >>>>>> >>>>>> 2)container has no dedicated namespaces set. Then CRIU just has to iterate >>>>>> all host namespaces. There is no another way to do that, because container >>>>>> may have any host namespaces, and hierarchy in /proc/namespaces won't >>>>>> help you. >>>>>> >>>>>>> Here are my thoughts why we need to the suggested structure is better >>>>>>> than just a list of namespaces: >>>>>>> >>>>>>> * Users will be able to understand securies bondaries in the system. >>>>>>> Each namespace in the system is owned by one of user namespace and we >>>>>>> need to know these relationshipts to understand the whole system. >>>>>> >>>>>> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use >>>>>> this interfaces? >>>>> >>>>> We can use these ioctl-s, but we will need to enumerate all namespaces in >>>>> the system to build a view of the namespace hierarchy. This will be very >>>>> expensive. The kernel can show this hierarchy without additional cost. >>>> >>>> No. We will have to iterate /proc/namespaces of a specific container to get >>>> its namespaces. It's a subset of all namespaces in system, and these all the >>>> namespaces, which are potentially allowed for the container. >>> >>> """ >>> Every /proc is related to a pid_namespace, and the pid_namespace >>> is related to a user_namespace. The items, we show in this >>> /proc/namespaces/ directory, are the namespaces, >>> whose user_namespaces are the same as /proc's user_namespace, >>> or their descendants. >>> """ // [PATCH 11/23] fs: Add /proc/namespaces/ directory >>> >>> This means that if a user want to find out all container namespaces, it >>> has to have access to the container procfs and the container should >>> a separate pid namespace. >>> >>> I would say these are two big limitations. The first one will not affect >>> CRIU and I agree CRIU can use this interface in its current form. >>> >>> The second one will be still the issue for CRIU. And they both will >>> affect other users. >>> >>> For end users, it will be a pain. They will need to create a pid >>> namespaces in a specified user-namespace, if a container doesn't have >>> its own. Then they will need to mount /proc from the container pid >>> namespace and only then they will be able to enumerate namespaces. >> >> In case of a container does not have its own pid namespace, CRIU already >> sucks. Every file in /proc directory is not reliable after restore, >> so /proc/namespaces is just one of them. Container, who may access files >> in /proc, does have to have its own pid namespace. > > Can you be more detailed here? What files are not reliable? And why we > don't need to think about this use-case? If we have any issues here, > maybe we need to think how to fix them instead of adding a new one. Any file in /proc is not reliable. You can't guarantee, the pid you need at restore time will be free. Simple example: a program reading information about its threads. It can't believe /proc/XXX/task/YYY/ after restore, any access will results in error. The same is with other files in /proc. Why do you require additional guarantees from the only directory in /proc? This is really strange approach. The issue is already fixed, and the fix is called pid namespace. Did you get my proposition? Any container will rename namespaces like it wants in its own /proc. Current patchset does not contain this, but I wrote this in replies. Maybe you missed that. >> >> Even if we imagine an unreal situation, when the rest of /proc files are reliable, >> sub-directories won't help in this case also. In case of we introduce user ns >> hierarchy, the namespaces names above container's user ns, will still >> be unchangeble: >> >> /proc/namespaces/parent_user_ns/container_user_ns/... >> >> Path to container_user_ns is fixed. If container accesses /proc/namespace/parent_user_ns >> file, it will suck a pow after restore again. > > > In case of user ns hierarchy, a container will see only its sub-tree and > it will not know a name of its root namespace. It will look like this: > > From host: > /proc/namespaces/user_ns_ct1/user1 > user2 > > /proc/namespaces/user_ns_ct2/user1 > user2 > > From ct1: > /proc/namespaces/user1 > user2 This is not expedient. You can't reliable restore certain pid in the same pid namespace, which is very likely used information. But you request this strange functionality from rare used /proc/namespaces, which is only for system utils. This is really strange and useless. Hierarchy during user namespace is completely crap IMO. The world does not spinning around CRIU. It will be really strange to analyze container net namespaces topology (say, where veth is connected) iterating over user namespaces directories. What is this information for? Nobody needs it. It is just bad design and ugly interface, which makes users to say curses for inventor of such the interface. > And now could you explain how you are going to solve this problem with > your interface? I don't give more guarantees, than guarantees during pid restore. What do you have on restore w/o pid namespace now?! If there is free pid, you restore you program with this pid number. Otherwise, you restore with another pid, or do not restore. The same is with namespace aliases. No more. >> >> So, the suggested sub-directories just don't work. > > I am sure it will work. > >> >>> But to build a view of a hierarchy of these namespaces, they will need to >>> use a binary tool which will open each of these namespaces, call >>> NS_GET_PARENT and NS_GET_USERNS ioctl-s and build a tree. >> >> Yes, it's the same way we have on a construction of tasks tree. >> >> Linear /proc/namespaces is rather natural way. The sense is "all namespaces, >> which are available for tasks in this /proc directory". >> >> Grouping by user ns directories looks odd. CRIU is only util, who needs >> such the grouping. But even for CRIU performance advantages look dubious. > > I can't agree with you here. This isn't about CRIU. Grouping by user ns > doesn't look odd for me, because this is how namespaces are grouped in > the kernel. Nope. Namespaces are not grouped by user namespace hierarchy. Pid and user namespace use their own parent/child grouping, all another namespaces types are linked in double linked lists. >> >> For another utils, the preference of user ns grouping over another hierarchy >> namespaces looks just weirdy weird. >> >> I can agree with an idea of separate top-level sub-directories for different >> namespaces types like: >> >> /proc/namespaces/uts/ >> /proc/namespaces/user/ >> /proc/namespaces/pid/ >> ... >> >> But grouping of all another namespaces by user ns sub-directories absolutely >> does not look sane for me. > > I think we are stuck here and we need to ask an opinion of someone else. > >> >>>> >>>>>> >>>>>>> * This is simplify collecting namespaces which belong to one container. >>>>>>> >>>>>>> For example, CRIU collects all namespaces before dumping file >>>>>>> descriptors. Then it collects all sockets with socket-diag in network >>>>>>> namespaces and collects mount points via /proc/pid/mountinfo in mount >>>>>>> namesapces. Then these information is used to dump socket file >>>>>>> descriptors and opened files. >>>>>> >>>>>> This is just the thing I say. This allows to avoid writing recursive dump. >>>>> >>>>> I don't understand this. How are you going to collect namespaces in CRIU >>>>> without knowing which are used by a dumped container? >>>> >>>> My patchset exports only the namespaces, which are allowed for a specific >>>> container, and no more above this. All exported namespaces are alive, >>>> so someone holds a reference on every of it. So they are used. >>>> >>>> It seems you haven't understood the way I suggested here. See patch [11/23] >>>> for the details. It's about permissions, and the subset of exported namespaces >>>> is formalized there. >>> >>> Honestly, I have not read all patches in this series and you didn't >>> describe this behavior in the cover letter. Thank you for pointing out >>> to the 11 patch, but I still think it doesn't solve the problem >>> completely. More details is in the comment which is a few lines above >>> this one. >>> >>>> >>>>>> But this has nothing about advantages of hierarchy in /proc/namespaces. >>> >>> Yes, it has. For example, in cases when a container doesn't have its own >>> pid namespaces. >>> >>>>> >>>>> Really? You said that you implemented this series to help CRIU dumping >>>>> namespaces. I think we need to implement the CRIU part to prove that >>>>> this interface is usable for this case. Right now, I have doubts about >>>>> this. >>>> >>>> Yes, really. See my comment above and patch [11/23]. >>>> >>>>>> >>>>>>> * We are going to assign names to namespaces. But this means that we >>>>>>> need to guarantee that all names in one directory are unique. The >>>>>>> initial proposal was to enumerate all namespaces in one proc directory, >>>>>>> that means names of all namespaces have to be unique. This can be >>>>>>> problematic in some cases. For example, we may want to dump a container >>>>>>> and then restore it more than once on the same host. How are we going to >>>>>>> avoid namespace name conficts in such cases? >>>>>> >>>>>> Previous message I wrote about .rename of proc files, Alexey Dobriyan >>>>>> said this is not a taboo. Are there problem which doesn't cover the case >>>>>> you point? >>>>> >>>>> Yes, there is. Namespace names will be visible from a container, so they >>>>> have to be restored. But this means that two containers can't be >>>>> restored from the same snapshot due to namespace name conflicts. >>>>> >>>>> But if we will show namespaces how I suggest, each container will see >>>>> only its sub-tree of namespaces and we will be able to specify any name >>>>> for the container root user namespace. >>>> >>>> Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23]. >>>> >>>> I do export sub-tree. >>> >>> I got your idea, but it is unclear how your are going to avoid name >>> conflicts. >>> >>> In the root container, you will show all namespaces in the system. These >>> means that all namespaces have to have unique names. This means we will >>> not able to restore two containers from the same snapshot without >>> renaming namespaces. But we can't change namespace names, because they >>> are visible from containers and container processes can use them. >> >> Grouping by user ns sub-directories does not solve a problem with names >> of containers w/o own pid ns. See above. > > It solves, you just doesn't understand how it works. See above. > >> >>>> >>>>>> >>>>>>> If we will have per-user-namespace directories, we will need to >>>>>>> guarantee that names are unique only inside one user namespace. >>>>>> >>>>>> Unique names inside one user namespace won't introduce a new /proc >>>>>> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific >>>>>> container. To give a virtualized name you have to have a dedicated pid ns. >>>>>> >>>>>> Let we have in one /proc mount: >>>>>> >>>>>> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX] >>>>>> >>>>>> In another another /proc mount we have: >>>>>> >>>>>> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX] >>>>>> >>>>>> The virtualization is made per /proc (i.e., per pid ns). Container should >>>>>> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc. >>>>>> >>>>>> There is no a sense of directory hierarchy for virtualization, since >>>>>> you can't use specific sub-directory as a root directory of /proc/namespaces >>>>>> to a container. You still have to introduce a new pid ns to have virtualized >>>>>> /proc. >>>>> >>>>> I think we can figure out how to implement this. As the first idea, we >>>>> can use the same way how /proc/net is implemented. >>>>> >>>>>> >>>>>>> * With the suggested structure, for each user namepsace, we will show >>>>>>> only its subtree of namespaces. This looks more natural than >>>>>>> filltering content of one directory. >>>>>> >>>>>> It's rather subjectively I think. /proc is related to pid ns, and user ns >>>>>> hierarchy does not look more natural for me. >>>>> >>>>> or /proc is wrong place for this
Creating names in the kernel for namespaces is very difficult and problematic. I have not seen anything that looks like all of the problems have been solved with restoring these new names. When your filter for your list of namespaces is user namespace creating a new directory in proc is highly questionable. As everyone uses proc placing this functionality in proc also amplifies the problem of creating names. Rather than proc having a way to mount a namespace filesystem filter by the user namespace of the mounter likely to have many many fewer problems. Especially as we are limiting/not allow new non-process things and ideally finding a way to remove the non-process things. Kirill you have a good point that taking the case where a pid namespace does not exist in a user namespace is likely quite unrealistic. Kirill mentioned upthread that the list of namespaces are the list that can appear in a container. Except by discipline in creating containers it is not possible to know which namespaces may appear in attached to a process. It is possible to be very creative with setns, and violate any constraint you may have. Which means your filtered list of namespaces may not contain all of the namespaces used by a set of processes. This further argues that attaching the list of namespaces to proc does not make sense. Andrei has a good point that placing the names in a hierarchy by user namespace has the potential to create more freedom when assigning names to namespaces, as it means the names for namespaces do not need to be globally unique, and while still allowing the names to stay the same. To recap the possibilities for names for namespaces that I have seen mentioned in this thread are: - Names per mount - Names per user namespace I personally suspect that names per mount are likely to be so flexibly they are confusing, while names per user namespace are likely to be rigid, possibly too rigid to use. It all depends upon how everything is used. I have yet to see a complete story of how these names will be generated and used. So I can not really judge. Let me add another take on this idea that might give this work a path forward. If I were solving this I would explore giving nsfs directories per user namespace, and a way to mount it that exposed the directory of the mounters current user namespace (something like btrfs snapshots). Hmm. For the user namespace directory I think I would give it a file "ns" that can be opened to get a file handle on the user namespace. Plus a set of subdirectories "cgroup", "ipc", "mnt", "net", "pid", "user", "uts") for each type of namespace. In each directory I think I would just have a 64bit counter and each new entry I would assign the next number from that counter. The restore could either have the ability to rename files or simply the ability to bump the counter (like we do with pids) so the names of the namespaces can be restored. That winds up making a user namespace the namespace of namespaces, so I am not 100% about the idea. Eric
On Mon, Aug 17, 2020 at 10:48:01AM -0500, Eric W. Biederman wrote: > > Creating names in the kernel for namespaces is very difficult and > problematic. I have not seen anything that looks like all of the > problems have been solved with restoring these new names. > > When your filter for your list of namespaces is user namespace creating > a new directory in proc is highly questionable. > > As everyone uses proc placing this functionality in proc also amplifies > the problem of creating names. > > > Rather than proc having a way to mount a namespace filesystem filter by > the user namespace of the mounter likely to have many many fewer > problems. Especially as we are limiting/not allow new non-process > things and ideally finding a way to remove the non-process things. > > > Kirill you have a good point that taking the case where a pid namespace > does not exist in a user namespace is likely quite unrealistic. > > Kirill mentioned upthread that the list of namespaces are the list that > can appear in a container. Except by discipline in creating containers > it is not possible to know which namespaces may appear in attached to a > process. It is possible to be very creative with setns, and violate any > constraint you may have. Which means your filtered list of namespaces > may not contain all of the namespaces used by a set of processes. This Indeed. We use setns() quite creatively when intercepting syscalls and when attaching to a container. > further argues that attaching the list of namespaces to proc does not > make sense. > > Andrei has a good point that placing the names in a hierarchy by > user namespace has the potential to create more freedom when > assigning names to namespaces, as it means the names for namespaces > do not need to be globally unique, and while still allowing the names > to stay the same. > > > To recap the possibilities for names for namespaces that I have seen > mentioned in this thread are: > - Names per mount > - Names per user namespace > > I personally suspect that names per mount are likely to be so flexibly > they are confusing, while names per user namespace are likely to be > rigid, possibly too rigid to use. > > It all depends upon how everything is used. I have yet to see a > complete story of how these names will be generated and used. So I can > not really judge. So I haven't fully understood either what the motivation for this patchset is. I can just speak to the use-case I had when I started prototyping something similar: We needed a way to get a view on all namespaces that exist on the system because we wanted a way to do namespace debugging on a live system. This interface could've easily lived in debugfs. The main point was that it should contain all namespaces. Note, that it wasn't supposed to be a hierarchical format it was only mean to list all namespaces and accessible to real root. The interface here is way more flexible/complex and I haven't yet figured out what exactly it is supposed to be used for. > > > Let me add another take on this idea that might give this work a path > forward. If I were solving this I would explore giving nsfs directories > per user namespace, and a way to mount it that exposed the directory of > the mounters current user namespace (something like btrfs snapshots). > > Hmm. For the user namespace directory I think I would give it a file > "ns" that can be opened to get a file handle on the user namespace. > Plus a set of subdirectories "cgroup", "ipc", "mnt", "net", "pid", > "user", "uts") for each type of namespace. In each directory I think > I would just have a 64bit counter and each new entry I would assign the > next number from that counter. > > The restore could either have the ability to rename files or simply the > ability to bump the counter (like we do with pids) so the names of the > namespaces can be restored. > > That winds up making a user namespace the namespace of namespaces, so > I am not 100% about the idea. I think you're right that we need to understand better what the use-case is. If I understand your suggestion correctly it wouldn't allow to show nested user namespaces if the nsfs mount is per-user namespace. Let me throw in a crazy idea: couldn't we just make the ioctl_ns() walk a namespace hierarchy? For example, you could pass in a user namespace fd and then you'd get back a struct with handles for fds for the namespaces owned by that user namespace and then you could use NS_GET_USERNS/NS_GET_PARENT to walk upwards from the user namespace fd passed in initially and so on? Or something similar/simpler. This would also decouple this from procfs somewhat. Christian
Christian Brauner <christian.brauner@ubuntu.com> writes: > On Mon, Aug 17, 2020 at 10:48:01AM -0500, Eric W. Biederman wrote: >> >> Creating names in the kernel for namespaces is very difficult and >> problematic. I have not seen anything that looks like all of the >> problems have been solved with restoring these new names. >> >> When your filter for your list of namespaces is user namespace creating >> a new directory in proc is highly questionable. >> >> As everyone uses proc placing this functionality in proc also amplifies >> the problem of creating names. >> >> >> Rather than proc having a way to mount a namespace filesystem filter by >> the user namespace of the mounter likely to have many many fewer >> problems. Especially as we are limiting/not allow new non-process >> things and ideally finding a way to remove the non-process things. >> >> >> Kirill you have a good point that taking the case where a pid namespace >> does not exist in a user namespace is likely quite unrealistic. >> >> Kirill mentioned upthread that the list of namespaces are the list that >> can appear in a container. Except by discipline in creating containers >> it is not possible to know which namespaces may appear in attached to a >> process. It is possible to be very creative with setns, and violate any >> constraint you may have. Which means your filtered list of namespaces >> may not contain all of the namespaces used by a set of processes. This > > Indeed. We use setns() quite creatively when intercepting syscalls and > when attaching to a container. > >> further argues that attaching the list of namespaces to proc does not >> make sense. >> >> Andrei has a good point that placing the names in a hierarchy by >> user namespace has the potential to create more freedom when >> assigning names to namespaces, as it means the names for namespaces >> do not need to be globally unique, and while still allowing the names >> to stay the same. >> >> >> To recap the possibilities for names for namespaces that I have seen >> mentioned in this thread are: >> - Names per mount >> - Names per user namespace >> >> I personally suspect that names per mount are likely to be so flexibly >> they are confusing, while names per user namespace are likely to be >> rigid, possibly too rigid to use. >> >> It all depends upon how everything is used. I have yet to see a >> complete story of how these names will be generated and used. So I can >> not really judge. > > So I haven't fully understood either what the motivation for this > patchset is. > I can just speak to the use-case I had when I started prototyping > something similar: We needed a way to get a view on all namespaces > that exist on the system because we wanted a way to do namespace > debugging on a live system. This interface could've easily lived in > debugfs. The main point was that it should contain all namespaces. > Note, that it wasn't supposed to be a hierarchical format it was only > mean to list all namespaces and accessible to real root. > The interface here is way more flexible/complex and I haven't yet > figured out what exactly it is supposed to be used for. > >> >> >> Let me add another take on this idea that might give this work a path >> forward. If I were solving this I would explore giving nsfs directories >> per user namespace, and a way to mount it that exposed the directory of >> the mounters current user namespace (something like btrfs snapshots). >> >> Hmm. For the user namespace directory I think I would give it a file >> "ns" that can be opened to get a file handle on the user namespace. >> Plus a set of subdirectories "cgroup", "ipc", "mnt", "net", "pid", >> "user", "uts") for each type of namespace. In each directory I think >> I would just have a 64bit counter and each new entry I would assign the >> next number from that counter. >> >> The restore could either have the ability to rename files or simply the >> ability to bump the counter (like we do with pids) so the names of the >> namespaces can be restored. >> >> That winds up making a user namespace the namespace of namespaces, so >> I am not 100% about the idea. > > I think you're right that we need to understand better what the use-case > is. If I understand your suggestion correctly it wouldn't allow to show > nested user namespaces if the nsfs mount is per-user namespace. So what I was thinking is that we have the user namespace directories and that the mount code would perform a bind mount such that the directory that matches the mounters user namespace is the root directory. > Let me throw in a crazy idea: couldn't we just make the ioctl_ns() walk > a namespace hierarchy? For example, you could pass in a user namespace > fd and then you'd get back a struct with handles for fds for the > namespaces owned by that user namespace and then you could use > NS_GET_USERNS/NS_GET_PARENT to walk upwards from the user namespace fd > passed in initially and so on? Or something similar/simpler. This would > also decouple this from procfs somewhat. Hmm. That would remove the need to have names. We could just keep a list of the namespaces in creation order. Hopefully the CRIU folks could preserve that create order without too much trouble. Say with an ioctl NS_NEXT_CREATION which takes two fds, and returns a new file descriptor. The arguments would be the user namespace and -1 or the file descriptor last returned fro NS_NEXT_CREATION. Assuming that is not difficult for CRIU to restore that would be a very simple patch. Eric
Currently, there is no a way to list or iterate all or subset of namespaces in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, but some also may be as open files, which are not attached to a process. When a namespace open fd is sent over unix socket and then closed, it is impossible to know whether the namespace exists or not. Also, even if namespace is exposed as attached to a process or as open file, iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because this multiplies at tasks and fds number. This patchset introduces a new /proc/namespaces/ directory, which exposes subset of permitted namespaces in linear view: # ls /proc/namespaces/ -l lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]' lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]' Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns. I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is in_userns(pid_ns->user_ns, ns->user_ns). In case of ns is a user_ns: in_userns(pid_ns->user_ns, ns). The patchset follows this steps: 1)A generic counter in ns_common is introduced instead of separate counters for every ns type (net::count, uts_namespace::kref, user_namespace::count, etc). Patches [1-8]; 2)Patch [9] introduces IDR to link and iterate alive namespaces; 3)Patch [10] is refactoring; 4)Patch [11] actually adds /proc/namespace directory and fs methods; 5)Patches [12-23] make every namespace to use the added methods and to appear in /proc/namespace directory. This may be usefull to write effective debug utils (say, fast build of networks topology) and checkpoint/restore software. --- Kirill Tkhai (23): ns: Add common refcount into ns_common add use it as counter for net_ns uts: Use generic ns_common::count ipc: Use generic ns_common::count pid: Use generic ns_common::count user: Use generic ns_common::count mnt: Use generic ns_common::count cgroup: Use generic ns_common::count time: Use generic ns_common::count ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c fs: Add /proc/namespaces/ directory user: Free user_ns one RCU grace period after final counter put user: Add user namespaces into ns_idr net: Add net namespaces into ns_idr pid: Eextract child_reaper check from pidns_for_children_get() proc_ns_operations: Add can_get method pid: Add pid namespaces into ns_idr uts: Free uts namespace one RCU grace period after final counter put uts: Add uts namespaces into ns_idr ipc: Add ipc namespaces into ns_idr mnt: Add mount namespaces into ns_idr cgroup: Add cgroup namespaces into ns_idr time: Add time namespaces into ns_idr fs/mount.h | 4 fs/namespace.c | 14 + fs/nsfs.c | 78 ++++++++ fs/proc/Makefile | 1 fs/proc/internal.h | 18 +- fs/proc/namespaces.c | 382 +++++++++++++++++++++++++++------------- fs/proc/root.c | 17 ++ fs/proc/task_namespaces.c | 183 +++++++++++++++++++ include/linux/cgroup.h | 6 - include/linux/ipc_namespace.h | 3 include/linux/ns_common.h | 11 + include/linux/pid_namespace.h | 4 include/linux/proc_fs.h | 1 include/linux/proc_ns.h | 12 + include/linux/time_namespace.h | 10 + include/linux/user_namespace.h | 10 + include/linux/utsname.h | 10 + include/net/net_namespace.h | 11 - init/version.c | 2 ipc/msgutil.c | 2 ipc/namespace.c | 17 +- ipc/shm.c | 1 kernel/cgroup/cgroup.c | 2 kernel/cgroup/namespace.c | 25 ++- kernel/pid.c | 2 kernel/pid_namespace.c | 46 +++-- kernel/time/namespace.c | 20 +- kernel/user.c | 2 kernel/user_namespace.c | 23 ++ kernel/utsname.c | 23 ++ net/core/net-sysfs.c | 6 - net/core/net_namespace.c | 18 +- net/ipv4/inet_timewait_sock.c | 4 net/ipv4/tcp_metrics.c | 2 34 files changed, 746 insertions(+), 224 deletions(-) create mode 100644 fs/proc/task_namespaces.c -- Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>