Message ID | 20240213-vfs-pidfd_fs-v1-0-f863f58cfce1@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Move pidfd to tiny pseudo fs | expand |
On Tue, Feb 13, 2024 at 05:45:45PM +0100, Christian Brauner wrote: > Hey, > > This moves pidfds from the anonymous inode infrastructure to a tiny > pseudo filesystem. This has been on my todo for quite a while as it will > unblock further work that we weren't able to do so far simply because of > the very justified limitations of anonymous inodes. So yesterday I sat > down and wrote it down. > > Back when I added pidfds the concept was new (on Linux) and the > limitations were acceptable but now it's starting to hurt us. And with > the concept of pidfds having been around quite a while and being widely > used this is worth doing. This makes it so that: > > * statx() on pidfds becomes useful for the first time. > * pidfds can be compared simply via statx() for equality. > * pidfds have unique inode numbers for the system lifetime. > * struct pid is now stashed in inode->i_private instead of > file->private_data. This means it is now possible to introduce > concepts that operate on a process once all file descriptors have been > closed. A concrete example is kill-on-last-close. > * file->private_data is freed up for per-file options for pidfds. > * Each struct pid will refer to a different inode but the same struct > pid will refer to the same inode if it's opened multiple times. In > contrast to now where each struct pid refers to the same inode. Even > if we were to move to anon_inode_create_getfile() which creates new > inodes we'd still be associating the same struct pid with multiple > different inodes. > * Pidfds now go through the regular dentry_open() path which means that > all security hooks are called unblocking proper LSM management for > pidfds. In addition fsnotify hooks are called and allow for listening > to open events on pidfds. > > The tiny pseudo filesystem is not visible anywhere in userspace exactly > like e.g., pipefs and sockfs. There's no lookup, there's no inode > operations in general, so nothing complex. It's hopefully the best kind > of dumb there is. Dentries and inodes are always deleted when the last > pidfd is closed. > > I've made the new code optional and placed it under CONFIG_FS_PIDFD but > I'm confident we can remove that very soon. This takes some inspiration > from nsfs which uses a similar stashing mechanism. > > Thanks! > Christian > > Signed-off-by: Christian Brauner <brauner@kernel.org> > > --- > base-commit: 3f643cd2351099e6b859533b6f984463e5315e5f > change-id: 20240212-vfs-pidfd_fs-9a6e49283d80 I forgot to mention that pidfds are explicitly not simply directory inodes in procfs for various reasons so this isn't an option I want to pursue. Integrating them into procfs would be a nasty level of complexity that makes for very ugly and convoluted code. Especially how this would need to be integrated into copy_process() and other locations. It also poses significant security and permission checking challenges to userspace because it is generally not safe to send around file descriptors for /proc/<pid> directories. It's a pretty big attack vector and cause of security issues. So really this is not a path that I want to go down. It defeats the whole purpose of pidfds as opaque, easy delegatable handles. Oh, and tree is vfs.pidfd at the usual location https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
Hey, This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't able to do so far simply because of the very justified limitations of anonymous inodes. So yesterday I sat down and wrote it down. Back when I added pidfds the concept was new (on Linux) and the limitations were acceptable but now it's starting to hurt us. And with the concept of pidfds having been around quite a while and being widely used this is worth doing. This makes it so that: * statx() on pidfds becomes useful for the first time. * pidfds can be compared simply via statx() for equality. * pidfds have unique inode numbers for the system lifetime. * struct pid is now stashed in inode->i_private instead of file->private_data. This means it is now possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. * file->private_data is freed up for per-file options for pidfds. * Each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. Even if we were to move to anon_inode_create_getfile() which creates new inodes we'd still be associating the same struct pid with multiple different inodes. * Pidfds now go through the regular dentry_open() path which means that all security hooks are called unblocking proper LSM management for pidfds. In addition fsnotify hooks are called and allow for listening to open events on pidfds. The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no inode operations in general, so nothing complex. It's hopefully the best kind of dumb there is. Dentries and inodes are always deleted when the last pidfd is closed. I've made the new code optional and placed it under CONFIG_FS_PIDFD but I'm confident we can remove that very soon. This takes some inspiration from nsfs which uses a similar stashing mechanism. Thanks! Christian Signed-off-by: Christian Brauner <brauner@kernel.org> --- base-commit: 3f643cd2351099e6b859533b6f984463e5315e5f change-id: 20240212-vfs-pidfd_fs-9a6e49283d80