diff mbox series

[GIT,PULL,for,v6.11] vfs nsfs

Message ID 20240712-vfs-nsfs-bb9a28102667@brauner (mailing list archive)
State New
Headers show
Series [GIT,PULL,for,v6.11] vfs nsfs | expand

Pull-request

git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.11.nsfs

Commit Message

Christian Brauner July 12, 2024, 2 p.m. UTC
Hey Linus,

/* Summary */
This adds ioctls allowing to translate PIDs between PID namespaces.

The motivating use-case comes from LXCFS which is a tiny fuse filesystem used
to virtualize various aspects of procfs. LXCFS is run on the host. The files
and directories it creates can be bind-mounted by e.g. a container at startup
and mounted over the various procfs files the container wishes to have
virtualized.

When e.g. a read request for uptime is received, LXCFS will receive the pid of
the reader. In order to virtualize the corresponding read, LXCFS needs to know
the pid of the init process of the reader's pid namespace.

In order to do this, LXCFS first needs to fork() two helper processes. The
first helper process setns() to the readers pid namespace. The second helper
process is needed to create a process that is a proper member of the pid
namespace.

The second helper process then creates a ucred message with ucred.pid set to 1
and sends it back to LXCFS. The kernel will translate the ucred.pid field to
the corresponding pid number in LXCFS's pid namespace. This way LXCFS can learn
the init pid number of the reader's pid namespace and can go on to virtualize.

Since these two forks() are costly LXCFS maintains an init pid cache that
caches a given pid for a fixed amount of time. The cache is pruned during new
read requests. However, even with the cache the hit of the two forks() is
singificant when a very large number of containers are running.

So this adds a simple set of ioctls that let's a caller translate PIDs from and
into a given PID namespace. This significantly improves performance with a very
simple change.

To protect against races pidfds can be used to check whether the process is
still valid.

/* Testing */
clang: Debian clang version 16.0.6 (26)
gcc: (Debian 13.2.0-24)

All patches are based on v6.10-rc1 and have been sitting in linux-next.
No build failures or warnings were observed.

/* Conflicts */
[1]: This contains a merge conflict with the vfs-6.11.mount pull request
     https://lore.kernel.org/r/20240712-vfs-mount-8fd93381a87f@brauner

     After conflict resolution the merge diff looks like this:

+++ b/fs/nsfs.c
@@@ -144,22 -147,56 +148,69 @@@ static long ns_ioctl(struct file *filp
  		argp = (uid_t __user *) arg;
  		uid = from_kuid_munged(current_user_ns(), user_ns->owner);
  		return put_user(uid, argp);
 +	case NS_GET_MNTNS_ID: {
 +		struct mnt_namespace *mnt_ns;
 +		__u64 __user *idp;
 +		__u64 id;
 +
 +		if (ns->ops->type != CLONE_NEWNS)
 +			return -EINVAL;
 +
 +		mnt_ns = container_of(ns, struct mnt_namespace, ns);
 +		idp = (__u64 __user *)arg;
 +		id = mnt_ns->seq;
 +		return put_user(id, idp);
 +	}
+ 	case NS_GET_PID_FROM_PIDNS:
+ 		fallthrough;
+ 	case NS_GET_TGID_FROM_PIDNS:
+ 		fallthrough;
+ 	case NS_GET_PID_IN_PIDNS:
+ 		fallthrough;
+ 	case NS_GET_TGID_IN_PIDNS:
+ 		if (ns->ops->type != CLONE_NEWPID)
+ 			return -EINVAL;
+ 
+ 		ret = -ESRCH;
+ 		pid_ns = container_of(ns, struct pid_namespace, ns);
+ 
+ 		rcu_read_lock();
+ 
+ 		if (ioctl == NS_GET_PID_IN_PIDNS ||
+ 		    ioctl == NS_GET_TGID_IN_PIDNS)
+ 			tsk = find_task_by_vpid(arg);
+ 		else
+ 			tsk = find_task_by_pid_ns(arg, pid_ns);
+ 		if (!tsk)
+ 			break;
+ 
+ 		switch (ioctl) {
+ 		case NS_GET_PID_FROM_PIDNS:
+ 			ret = task_pid_vnr(tsk);
+ 			break;
+ 		case NS_GET_TGID_FROM_PIDNS:
+ 			ret = task_tgid_vnr(tsk);
+ 			break;
+ 		case NS_GET_PID_IN_PIDNS:
+ 			ret = task_pid_nr_ns(tsk, pid_ns);
+ 			break;
+ 		case NS_GET_TGID_IN_PIDNS:
+ 			ret = task_tgid_nr_ns(tsk, pid_ns);
+ 			break;
+ 		default:
+ 			ret = 0;
+ 			break;
+ 		}
+ 		rcu_read_unlock();
+ 
+ 		if (!ret)
+ 			ret = -ESRCH;
+ 		break;
  	default:
- 		return -ENOTTY;
+ 		ret = -ENOTTY;
  	}
+ 
+ 	return ret;
  }
  
  int ns_get_name(char *buf, size_t size, struct task_struct *task,
+++ b/include/uapi/linux/nsfs.h
@@@ -15,7 -15,13 +15,15 @@@
  #define NS_GET_NSTYPE		_IO(NSIO, 0x3)
  /* Get owner UID (in the caller's user namespace) for a user namespace */
  #define NS_GET_OWNER_UID	_IO(NSIO, 0x4)
 +/* Get the id for a mount namespace */
 +#define NS_GET_MNTNS_ID		_IO(NSIO, 0x5)
+ /* Translate pid from target pid namespace into the caller's pid namespace. */
 -#define NS_GET_PID_FROM_PIDNS	_IOR(NSIO, 0x5, int)
++#define NS_GET_PID_FROM_PIDNS	_IOR(NSIO, 0x6, int)
+ /* Return thread-group leader id of pid in the callers pid namespace. */
+ #define NS_GET_TGID_FROM_PIDNS	_IOR(NSIO, 0x7, int)
+ /* Translate pid from caller's pid namespace into a target pid namespace. */
 -#define NS_GET_PID_IN_PIDNS	_IOR(NSIO, 0x6, int)
++#define NS_GET_PID_IN_PIDNS	_IOR(NSIO, 0x8, int)
+ /* Return thread-group leader id of pid in the target pid namespace. */
 -#define NS_GET_TGID_IN_PIDNS	_IOR(NSIO, 0x8, int)
++#define NS_GET_TGID_IN_PIDNS	_IOR(NSIO, 0x9, int)
  
  #endif /* __LINUX_NSFS_H */

The following changes since commit 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0:

  Linux 6.10-rc1 (2024-05-26 15:20:12 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.11.nsfs

for you to fetch changes up to ca567df74a28a9fb368c6b2d93e864113f73f5c2:

  nsfs: add pid translation ioctls (2024-06-25 23:00:41 +0200)

Please consider pulling these changes from the signed vfs-6.11.nsfs tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.11.nsfs

----------------------------------------------------------------
Christian Brauner (1):
      nsfs: add pid translation ioctls

 fs/nsfs.c                 | 53 ++++++++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/nsfs.h |  8 +++++++
 2 files changed, 60 insertions(+), 1 deletion(-)

Comments

pr-tracker-bot@kernel.org July 15, 2024, 8:34 p.m. UTC | #1
The pull request you sent on Fri, 12 Jul 2024 16:00:48 +0200:

> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.11.nsfs

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/1b074abe885f43b2c207b5e748ffa60604dbc020

Thank you!
diff mbox series

Patch

diff --cc fs/nsfs.c
index af352dadffe1,a23c827a0299..ad6bb91a3e23
--- a/fs/nsfs.c
diff --cc include/uapi/linux/nsfs.h
index 56e8b1639b98,faeb9195da08..b133211331f6
--- a/include/uapi/linux/nsfs.h