Message ID | 87bll17ili.fsf_-_@x220.int.ebiederm.org (mailing list archive) |
---|---|
Headers | show |
Series | Make the user mode driver code a better citizen | expand |
On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: > > I have tested thes changes by booting with the code compiled in and > by killing "bpfilter_umh" and running iptables -vnL to restart > the userspace driver. > > I have compiled tested each change with and without CONFIG_BPFILTER > enabled. With CONFIG_BPFILTER=y CONFIG_BPFILTER_UMH=m it doesn't build: ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined! I've added: +EXPORT_SYMBOL(kill_pid_info); to continue testing... And then did: while true; do iptables -L;rmmod bpfilter; done Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). I suspect patch 13 is somehow responsible: + if (tgid) { + kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); + wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); + bpfilter_umh_cleanup(info); + } I cannot figure out why it hangs. Some sort of race ? Since adding short delay between kill and wait makes it work.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: >> >> I have tested thes changes by booting with the code compiled in and >> by killing "bpfilter_umh" and running iptables -vnL to restart >> the userspace driver. >> >> I have compiled tested each change with and without CONFIG_BPFILTER >> enabled. > > With > CONFIG_BPFILTER=y > CONFIG_BPFILTER_UMH=m > it doesn't build: > > ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined! > > I've added: > +EXPORT_SYMBOL(kill_pid_info); > to continue testing... > > And then did: > while true; do iptables -L;rmmod bpfilter; done > > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). > > I suspect patch 13 is somehow responsible: > + if (tgid) { > + kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); > + wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); > + bpfilter_umh_cleanup(info); > + } > > I cannot figure out why it hangs. Some sort of race ? > Since adding short delay between kill and wait makes it work. Thanks. I will take a look. Eric
On 2020/06/30 10:13, Eric W. Biederman wrote: > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > >> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: >>> >>> I have tested thes changes by booting with the code compiled in and >>> by killing "bpfilter_umh" and running iptables -vnL to restart >>> the userspace driver. >>> >>> I have compiled tested each change with and without CONFIG_BPFILTER >>> enabled. >> >> With >> CONFIG_BPFILTER=y >> CONFIG_BPFILTER_UMH=m >> it doesn't build: >> >> ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined! >> >> I've added: >> +EXPORT_SYMBOL(kill_pid_info); >> to continue testing... kill_pid() is already exported. >> >> And then did: >> while true; do iptables -L;rmmod bpfilter; done >> >> Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). >> >> I suspect patch 13 is somehow responsible: >> + if (tgid) { >> + kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); >> + wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); >> + bpfilter_umh_cleanup(info); >> + } >> >> I cannot figure out why it hangs. Some sort of race ? >> Since adding short delay between kill and wait makes it work. Because there is a race window that detach_pid() from __unhash_process() from __exit_signal() from release_task() from exit_notify() from do_exit() is called some time after wake_up_all(&pid->wait_pidfd) from do_notify_pidfd() from do_notify_parent() from exit_notify() from do_exit() was called (in other words, we can't use pid->wait_pidfd when pid_task() is used at wait_event()) ? Below are changes I suggest. diff --git a/kernel/umd.c b/kernel/umd.c index ff79fb16d738..f688813b8830 100644 --- a/kernel/umd.c +++ b/kernel/umd.c @@ -26,7 +26,7 @@ static struct vfsmount *blob_to_mnt(const void *data, size_t len, const char *na if (IS_ERR(mnt)) return mnt; - file = file_open_root(mnt->mnt_root, mnt, name, O_CREAT | O_WRONLY, 0700); + file = file_open_root(mnt->mnt_root, mnt, name, O_CREAT | O_WRONLY | O_EXCL, 0700); if (IS_ERR(file)) { mntput(mnt); return ERR_CAST(file); @@ -52,16 +52,18 @@ static struct vfsmount *blob_to_mnt(const void *data, size_t len, const char *na /** * umd_load_blob - Remember a blob of bytes for fork_usermode_driver - * @info: information about usermode driver - * @data: a blob of bytes that can be executed as a file - * @len: The lentgh of the blob + * @info: information about usermode driver (shouldn't be NULL) + * @data: a blob of bytes that can be executed as a file (shouldn't be NULL) + * @len: The lentgh of the blob (shouldn't be 0) * */ int umd_load_blob(struct umd_info *info, const void *data, size_t len) { struct vfsmount *mnt; - if (WARN_ON_ONCE(info->wd.dentry || info->wd.mnt)) + if (!info || !info->driver_name || !data || !len) + return -EINVAL; + if (info->wd.dentry || info->wd.mnt) return -EBUSY; mnt = blob_to_mnt(data, len, info->driver_name); @@ -76,15 +78,14 @@ EXPORT_SYMBOL_GPL(umd_load_blob); /** * umd_unload_blob - Disassociate @info from a previously loaded blob - * @info: information about usermode driver + * @info: information about usermode driver (shouldn't be NULL) * */ int umd_unload_blob(struct umd_info *info) { - if (WARN_ON_ONCE(!info->wd.mnt || - !info->wd.dentry || - info->wd.mnt->mnt_root != info->wd.dentry)) + if (!info || !info->driver_name || !info->wd.dentry || !info->wd.mnt) return -EINVAL; + BUG_ON(info->wd.mnt->mnt_root != info->wd.dentry); kern_unmount(info->wd.mnt); info->wd.mnt = NULL; @@ -138,7 +139,7 @@ static void umd_cleanup(struct subprocess_info *info) { struct umd_info *umd_info = info->data; - /* cleanup if umh_setup() was successful but exec failed */ + /* cleanup if umd_setup() was successful but exec failed */ if (info->retval) { fput(umd_info->pipe_to_umh); fput(umd_info->pipe_from_umh); @@ -163,7 +164,10 @@ int fork_usermode_driver(struct umd_info *info) const char *argv[] = { info->driver_name, NULL }; int err; - if (WARN_ON_ONCE(info->tgid)) + if (!info || !info->driver_name || !info->wd.dentry || !info->wd.mnt) + return -EINVAL; + BUG_ON(info->wd.mnt->mnt_root != info->wd.dentry); + if (info->tgid) return -EBUSY; err = -ENOMEM; diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c index 91474884ddb7..9dd70aacb81a 100644 --- a/net/bpfilter/bpfilter_kern.c +++ b/net/bpfilter/bpfilter_kern.c @@ -19,8 +19,13 @@ static void shutdown_umh(void) struct pid *tgid = info->tgid; if (tgid) { - kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); - wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); + kill_pid(tgid, SIGKILL, 1); + while (({ bool done; + rcu_read_lock(); + done = !pid_task(tgid, PIDTYPE_TGID); + rcu_read_unlock(); + done; })) + schedule_timeout_uninterruptible(1); bpfilter_umh_cleanup(info); } }
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: 2> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: >> >> I have tested thes changes by booting with the code compiled in and >> by killing "bpfilter_umh" and running iptables -vnL to restart >> the userspace driver. >> >> I have compiled tested each change with and without CONFIG_BPFILTER >> enabled. > > With > CONFIG_BPFILTER=y > CONFIG_BPFILTER_UMH=m > it doesn't build: > > ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined! > > I've added: > +EXPORT_SYMBOL(kill_pid_info); > to continue testing... I am rather surprised I thought Tetsuo had already compile tested modules. > I suspect patch 13 is somehow responsible: > + if (tgid) { > + kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); > + wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); > + bpfilter_umh_cleanup(info); > + } > > I cannot figure out why it hangs. Some sort of race ? > Since adding short delay between kill and wait makes it work. Having had a chance to sleep kill_pid_info was a thinko, as was !pid_task. It should have been !pid_has_task as that takes the proper rcu locking. I don't know if that is going to be enough to fix the wait_event but those are obvious bugs that need to be fixed. diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c index 91474884ddb7..3e1874030daa 100644 --- a/net/bpfilter/bpfilter_kern.c +++ b/net/bpfilter/bpfilter_kern.c @@ -19,8 +19,8 @@ static void shutdown_umh(void) struct pid *tgid = info->tgid; if (tgid) { - kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); - wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); + kill_pid(tgid, SIGKILL, 1); + wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); bpfilter_umh_cleanup(info); } } > And then did: > while true; do iptables -L;rmmod bpfilter; done > > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). Hmm. The wake up happens just of tgid->wait_pidfd happens just before release_task is called so there is a race. As it is possible to wake up and then go back to sleep before pid_has_task becomes false. So I think I need a friendly helper that does: bool task_has_exited(struct pid *tgid) { bool exited = false; rcu_read_lock(); tsk = pid_task(tgid, PIDTYPE_TGID); exited = !!tsk; if (tsk) { exited = !!tsk->exit_state; out: rcu_unlock(); return exited; } There should be a sensible way to do that. Eric
On 2020/06/30 21:29, Eric W. Biederman wrote: > Hmm. The wake up happens just of tgid->wait_pidfd happens just before > release_task is called so there is a race. As it is possible to wake > up and then go back to sleep before pid_has_task becomes false. What is the reason we want to wait until pid_has_task() becomes false? - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1)); By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says that use of flush_delayed_fput() has to be careful. Al, is it safe to call flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be called from both kernel thread and from process context (e.g. init_module() syscall by /sbin/insmod )) ?
On Tue, Jun 30, 2020 at 07:29:34AM -0500, Eric W. Biederman wrote: > > diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c > index 91474884ddb7..3e1874030daa 100644 > --- a/net/bpfilter/bpfilter_kern.c > +++ b/net/bpfilter/bpfilter_kern.c > @@ -19,8 +19,8 @@ static void shutdown_umh(void) > struct pid *tgid = info->tgid; > > if (tgid) { > - kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); > - wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); > + kill_pid(tgid, SIGKILL, 1); > + wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); > bpfilter_umh_cleanup(info); > } > } > > > And then did: > > while true; do iptables -L;rmmod bpfilter; done > > > > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). > > Hmm. The wake up happens just of tgid->wait_pidfd happens just before > release_task is called so there is a race. As it is possible to wake > up and then go back to sleep before pid_has_task becomes false. > > So I think I need a friendly helper that does: > > bool task_has_exited(struct pid *tgid) > { > bool exited = false; > > rcu_read_lock(); > tsk = pid_task(tgid, PIDTYPE_TGID); > exited = !!tsk; > if (tsk) { > exited = !!tsk->exit_state; > out: > rcu_unlock(); > return exited; > } All makes sense to me. If I understood the race condition such helper should indeed solve it. Are you going to add such patch to your series? I'll proceed with my work on top of your series and will ignore this race for now, but I think it should be fixed before we land this set into multiple trees.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Tue, Jun 30, 2020 at 07:29:34AM -0500, Eric W. Biederman wrote: >> >> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c >> index 91474884ddb7..3e1874030daa 100644 >> --- a/net/bpfilter/bpfilter_kern.c >> +++ b/net/bpfilter/bpfilter_kern.c >> @@ -19,8 +19,8 @@ static void shutdown_umh(void) >> struct pid *tgid = info->tgid; >> >> if (tgid) { >> - kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); >> - wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); >> + kill_pid(tgid, SIGKILL, 1); >> + wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); >> bpfilter_umh_cleanup(info); >> } >> } >> >> > And then did: >> > while true; do iptables -L;rmmod bpfilter; done >> > >> > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). >> >> Hmm. The wake up happens just of tgid->wait_pidfd happens just before >> release_task is called so there is a race. As it is possible to wake >> up and then go back to sleep before pid_has_task becomes false. >> >> So I think I need a friendly helper that does: >> >> bool task_has_exited(struct pid *tgid) >> { >> bool exited = false; >> >> rcu_read_lock(); >> tsk = pid_task(tgid, PIDTYPE_TGID); >> exited = !!tsk; >> if (tsk) { >> exited = !!tsk->exit_state; >> out: >> rcu_unlock(); >> return exited; >> } > > All makes sense to me. > If I understood the race condition such helper should indeed solve it. > Are you going to add such patch to your series? > I'll proceed with my work on top of your series and will ignore this > race for now, but I think it should be fixed before we land this set > into multiple trees. Yes. I am just finishing it up now. Eric
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes: > On 2020/06/30 21:29, Eric W. Biederman wrote: >> Hmm. The wake up happens just of tgid->wait_pidfd happens just before >> release_task is called so there is a race. As it is possible to wake >> up and then go back to sleep before pid_has_task becomes false. > > What is the reason we want to wait until pid_has_task() becomes false? > > - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); > + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1)); So that it is safe to call bpfilter_umh_cleanup. The previous code performed the wait by having a callback in do_exit. It might be possible to call bpf_umh_cleanup early but I have not done that analysis. To perform the test correctly what I have right now is: bool thread_group_exited(struct pid *pid) { struct task_struct *tsk; bool exited; rcu_read_lock(); tsk = pid_task(pid, PIDTYPE_PID); exited = !tsk || (READ_ONCE(tsk->exit_state) && thread_group_empty(tsk)); rcu_read_unlock(); return exited; } Which is factored out of pidfd_poll. Which means that this won't be something that the bpfilter code has to maintain. That seems to be a fundamentally good facility to have regardless of bpfilter. I will post the whole thing in a bit once I have a chance to dot my i's and cross my t's. > By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says > that use of flush_delayed_fput() has to be careful. Al, is it safe to call > flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be > called from both kernel thread and from process context (e.g. init_module() > syscall by /sbin/insmod )) ? And __fput_sync needs to be even more careful. umd_load_blob is called in these changes without any locks held. We fundamentally AKA in any correct version of this code need to flush the file descriptor before we call exec or exec can not open it a read-only denying all writes from any other opens. The use case of flush_delayed_fput is exactly the same as that used when loading the initramfs. Eric
On 2020/07/02 22:08, Eric W. Biederman wrote: > Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes: > >> On 2020/06/30 21:29, Eric W. Biederman wrote: >>> Hmm. The wake up happens just of tgid->wait_pidfd happens just before >>> release_task is called so there is a race. As it is possible to wake >>> up and then go back to sleep before pid_has_task becomes false. >> >> What is the reason we want to wait until pid_has_task() becomes false? >> >> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); >> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1)); > > So that it is safe to call bpfilter_umh_cleanup. The previous code > performed the wait by having a callback in do_exit. But bpfilter_umh_cleanup() does only fput(info->pipe_to_umh); fput(info->pipe_from_umh); put_pid(info->tgid); info->tgid = NULL; which is (I think) already safe regardless of the usermode process because bpfilter_umh_cleanup() merely closes one side of two pipes used between two processes and forgets about the usermode process. > > It might be possible to call bpf_umh_cleanup early but I have not done > that analysis. > > To perform the test correctly what I have right now is: Waiting for the termination of a SIGKILLed usermode process is not such simple. If a usermode process was killed by the OOM killer, it might take minutes for the killed process to reach do_exit() due to invisible memory allocation dependency chain. Since the OOM killer kicks the OOM reaper, and the OOM reaper forgets about the killed process after one second if mmap_sem could not be held (in order to avoid OOM deadlock), the OOM situation will be eventually solved; but there is no guarantee that the killed process can reach do_exit() in a short period.
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes: > On 2020/07/02 22:08, Eric W. Biederman wrote: >> Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes: >> >>> On 2020/06/30 21:29, Eric W. Biederman wrote: >>>> Hmm. The wake up happens just of tgid->wait_pidfd happens just before >>>> release_task is called so there is a race. As it is possible to wake >>>> up and then go back to sleep before pid_has_task becomes false. >>> >>> What is the reason we want to wait until pid_has_task() becomes false? >>> >>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); >>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1)); >> >> So that it is safe to call bpfilter_umh_cleanup. The previous code >> performed the wait by having a callback in do_exit. > > But bpfilter_umh_cleanup() does only > > fput(info->pipe_to_umh); > fput(info->pipe_from_umh); > put_pid(info->tgid); > info->tgid = NULL; > > which is (I think) already safe regardless of the usermode process because > bpfilter_umh_cleanup() merely closes one side of two pipes used between > two processes and forgets about the usermode process. It is not safe. Baring bugs there is only one use of shtudown_umh that matters. The one in fini_umh. The use of the file by the mm must be finished before umd_unload_blob. AKA unmount. Which completely frees the filesystem. >> It might be possible to call bpf_umh_cleanup early but I have not done >> that analysis. >> >> To perform the test correctly what I have right now is: > > Waiting for the termination of a SIGKILLed usermode process is not > such simple. The waiting is that simple. You are correct it might not be a quick process. A good general principle is to start with something simple and correct for what it does, and then to make it more complicated when real world cases show up, and it can be understood what the real challenges are. I am not going to merge known broken code but I am also not going to overcomplicate it. Dealing with very rare and pathological cases that are not handled or considered today is out of scope for my patchset. Eric
On 2020/07/02 22:08, Eric W. Biederman wrote: >> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says >> that use of flush_delayed_fput() has to be careful. Al, is it safe to call >> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be >> called from both kernel thread and from process context (e.g. init_module() >> syscall by /sbin/insmod )) ? > > And __fput_sync needs to be even more careful. > umd_load_blob is called in these changes without any locks held. But where is the guarantee that a thread which called flush_delayed_fput() waits for the completion of processing _all_ "struct file" linked into delayed_fput_list ? If some other thread or delayed_fput_work (scheduled by fput_many()) called flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput() sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which needs to be processed before execve(), can't it? Also, I don't know how convoluted the dependency of all "struct file" linked into delayed_fput_list might be, for there can be "struct file" which will not be a simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request. On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads, there is a guarantee that __fput_sync() waits for the completion of "struct file" which needs to be flushed before execve(), isn't there? > > We fundamentally AKA in any correct version of this code need to flush > the file descriptor before we call exec or exec can not open it a > read-only denying all writes from any other opens. > > The use case of flush_delayed_fput is exactly the same as that used > when loading the initramfs. When loading the initramfs, the number of threads is quite few (which means that the possibility of hitting the race window and convoluted dependency is small). But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s flush_delayed_fput() might be called after many number of threads already started running. On 2020/07/03 1:02, Eric W. Biederman wrote: >>>> On 2020/06/30 21:29, Eric W. Biederman wrote: >>>>> Hmm. The wake up happens just of tgid->wait_pidfd happens just before >>>>> release_task is called so there is a race. As it is possible to wake >>>>> up and then go back to sleep before pid_has_task becomes false. >>>> >>>> What is the reason we want to wait until pid_has_task() becomes false? >>>> >>>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); >>>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1)); >>> >>> So that it is safe to call bpfilter_umh_cleanup. The previous code >>> performed the wait by having a callback in do_exit. >> >> But bpfilter_umh_cleanup() does only >> >> fput(info->pipe_to_umh); >> fput(info->pipe_from_umh); >> put_pid(info->tgid); >> info->tgid = NULL; >> >> which is (I think) already safe regardless of the usermode process because >> bpfilter_umh_cleanup() merely closes one side of two pipes used between >> two processes and forgets about the usermode process. > > It is not safe. > > Baring bugs there is only one use of shtudown_umh that matters. The one > in fini_umh. The use of the file by the mm must be finished before > umd_unload_blob. AKA unmount. Which completely frees the filesystem. Do we really need to mount upon umd_load_blob() and unmount upon umd_unload_blob() ? LSM modules might prefer only one instance of filesystem for umd blobs. For pathname based LSMs, since that filesystem is not visible from mount tree, only info->driver_name can be used for distinction. Therefore, one instance of filesystem with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL) might be preferable. For inode based LSMs, reusing one instance of filesystem created upon early boot might be convenient for labeling. Also, we might want a dedicated filesystem (say, "umdfs") instead of regular tmpfs in order to implement protections without labeling files. Then, we might also be able to implement minimal protections without LSMs.
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes: > On 2020/07/02 22:08, Eric W. Biederman wrote: >>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says >>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call >>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be >>> called from both kernel thread and from process context (e.g. init_module() >>> syscall by /sbin/insmod )) ? >> >> And __fput_sync needs to be even more careful. >> umd_load_blob is called in these changes without any locks held. > > But where is the guarantee that a thread which called flush_delayed_fput() waits for > the completion of processing _all_ "struct file" linked into delayed_fput_list ? > If some other thread or delayed_fput_work (scheduled by fput_many()) called > flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput() > sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which > needs to be processed before execve(), can't it? As a module the guarantee is we call task_work_run. Built into the kernel the guarantee as best I can trace it is that kthreadd hasn't started, and as such nothing that is scheduled has run yet. > Also, I don't know how convoluted the dependency of all "struct file" linked into > delayed_fput_list might be, for there can be "struct file" which will not be a > simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request. > > On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads, > there is a guarantee that __fput_sync() waits for the completion of "struct file" > which needs to be flushed before execve(), isn't there? There is really not a good helper or helpers, and this code suggests we have something better. Right now I have used the existing helpers to the best of my ability. If you or someone else wants to write a better version of flushing so that exec can happen be my guest. As far as I can tell what I have is good enough. >> We fundamentally AKA in any correct version of this code need to flush >> the file descriptor before we call exec or exec can not open it a >> read-only denying all writes from any other opens. >> >> The use case of flush_delayed_fput is exactly the same as that used >> when loading the initramfs. > > When loading the initramfs, the number of threads is quite few (which > means that the possibility of hitting the race window and convoluted > dependency is small). But the reality is the code run very early, before the initramfs is initialized in practice. > But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s > flush_delayed_fput() might be called after many number of threads already > started running. At which point the code probably won't be runnig from a kernel thread but instead will be running in a thread where task_work_run is relevant. At worst it is a very small race, where someone else in another thread starts flushing the file. Which means the file could still be completely close before exec. Even that is not necessarily fatal, as the usermode driver code has a respawn capability. Code that is used enough that it hits that race sounds like a very good problem to have from the perspective of the usermode driver code. > Do we really need to mount upon umd_load_blob() and unmount upon umd_unload_blob() ? > LSM modules might prefer only one instance of filesystem for umd > blobs. It is simple. People are free to change it, but a single filesystem seems like a very good place to start with this functionality. > For pathname based LSMs, since that filesystem is not visible from mount tree, only > info->driver_name can be used for distinction. Therefore, one instance of filesystem > with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL) > might be preferable. I took a quick look and the creation and removal of files with the in-kernel helpers is not particularly easy. Certainly it is more work and thus a higher likelyhood of bugs than what I have done. A directory per driver does sound tempting. Just more work that I am willing to do. > For inode based LSMs, reusing one instance of filesystem created upon early boot might > be convenient for labeling. > > Also, we might want a dedicated filesystem (say, "umdfs") instead of regular tmpfs in > order to implement protections without labeling files. Then, we might also be able to > implement minimal protections without LSMs. All valid points. Nothing sets this design in stone. Nothing says this is the endpoint of the evolution of this code. The entire point of this patchset for me is that I remove the unnecessary special cases from exec and do_exit, so I don't have to deal with the usermode driver code anymore. Eric
On 2020/07/04 7:25, Eric W. Biederman wrote: > Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes: > >> On 2020/07/02 22:08, Eric W. Biederman wrote: >>>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says >>>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call >>>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be >>>> called from both kernel thread and from process context (e.g. init_module() >>>> syscall by /sbin/insmod )) ? >>> >>> And __fput_sync needs to be even more careful. >>> umd_load_blob is called in these changes without any locks held. >> >> But where is the guarantee that a thread which called flush_delayed_fput() waits for >> the completion of processing _all_ "struct file" linked into delayed_fput_list ? >> If some other thread or delayed_fput_work (scheduled by fput_many()) called >> flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput() >> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which >> needs to be processed before execve(), can't it? > > As a module the guarantee is we call task_work_run. No. It is possible that blob_to_mnt() is called by a kernel thread which was started by init_module() syscall by /sbin/insmod . > Built into the kernel the guarantee as best I can trace it is that > kthreadd hasn't started, and as such nothing that is scheduled has run > yet. Have you ever checked how early the kthreadd (PID=2) gets started? ---------- --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2306,6 +2306,7 @@ static __latent_entropy struct task_struct *copy_process( trace_task_newtask(p, clone_flags); uprobe_copy_process(p, clone_flags); + printk(KERN_INFO "Created PID: %u Comm: %s\n", p->pid, p->comm); return p; bad_fork_cancel_cgroup: ---------- ---------- [ 0.090757][ T0] pid_max: default: 65536 minimum: 512 [ 0.090890][ T0] LSM: Security Framework initializing [ 0.090890][ T0] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes, linear) [ 0.090890][ T0] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear) [ 0.090890][ T0] Disabled fast string operations [ 0.090890][ T0] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024 [ 0.090890][ T0] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4 [ 0.090890][ T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization [ 0.090890][ T0] Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available! [ 0.090890][ T0] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp [ 0.090890][ T0] SRBDS: Unknown: Dependent on hypervisor status [ 0.090890][ T0] MDS: Mitigation: Clear CPU buffers [ 0.090890][ T0] Freeing SMP alternatives memory: 24K [ 0.090890][ T0] Created PID: 1 Comm: swapper/0 [ 0.090890][ T0] Created PID: 2 Comm: swapper/0 [ 0.090890][ T1] smpboot: CPU0: Intel(R) Core(TM) i5-4440S CPU @ 2.80GHz (family: 0x6, model: 0x3c, stepping: 0x3) [ 0.091000][ T2] Created PID: 3 Comm: kthreadd [ 0.091995][ T2] Created PID: 4 Comm: kthreadd [ 0.093028][ T2] Created PID: 5 Comm: kthreadd [ 0.093997][ T2] Created PID: 6 Comm: kthreadd [ 0.094995][ T2] Created PID: 7 Comm: kthreadd [ 0.096037][ T2] Created PID: 8 Comm: kthreadd (...snipped...) [ 0.135716][ T2] Created PID: 13 Comm: kthreadd [ 0.135716][ T1] smp: Bringing up secondary CPUs ... [ 0.135716][ T2] Created PID: 14 Comm: kthreadd [ 0.135716][ T2] Created PID: 15 Comm: kthreadd [ 0.135716][ T2] Created PID: 16 Comm: kthreadd [ 0.135716][ T2] Created PID: 17 Comm: kthreadd [ 0.135716][ T2] Created PID: 18 Comm: kthreadd [ 0.135716][ T1] x86: Booting SMP configuration: (...snipped...) [ 0.901990][ T1] pci 0000:00:00.0: Limiting direct PCI/PCI transfers [ 0.902145][ T1] pci 0000:00:0f.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff] [ 0.902213][ T1] pci 0000:02:00.0: CLS mismatch (32 != 64), using 64 bytes [ 0.902224][ T1] Trying to unpack rootfs image as initramfs... [ 1.107993][ T1] Freeing initrd memory: 18876K [ 1.109049][ T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) [ 1.111003][ T1] software IO TLB: mapped [mem 0xab000000-0xaf000000] (64MB) [ 1.112136][ T1] check: Scanning for low memory corruption every 60 seconds [ 1.115040][ T2] Created PID: 52 Comm: kthreadd [ 1.116110][ T1] workingset: timestamp_bits=46 max_order=20 bucket_order=0 [ 1.120936][ T1] SGI XFS with ACLs, security attributes, verbose warnings, quota, no debug enabled [ 1.129626][ T2] Created PID: 53 Comm: kthreadd [ 1.131403][ T2] Created PID: 54 Comm: kthreadd ---------- kthreadd (PID=2) is created by swapper/0 (PID=0) immediately after init (PID=1) was created by swapper/0 (PID=0). It is even before secondary CPUs are brought up, and far earlier than unpacking initramfs. And how can we prove that blob_to_mnt() is only called by a kernel thread before some kernel thread that interferes fput() starts running? blob_to_mnt() needs to be prepared for being called after many processes already started running. > >> Also, I don't know how convoluted the dependency of all "struct file" linked into >> delayed_fput_list might be, for there can be "struct file" which will not be a >> simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request. >> >> On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads, >> there is a guarantee that __fput_sync() waits for the completion of "struct file" >> which needs to be flushed before execve(), isn't there? > > There is really not a good helper or helpers, and this code suggests we > have something better. Right now I have used the existing helpers to > the best of my ability. If you or someone else wants to write a better > version of flushing so that exec can happen be my guest. > > As far as I can tell what I have is good enough. Just saying what you think is not a "review". I'm waiting for answer from Al Viro because I consider that Al will be the most familiar with fput()'s behavior. At least I consider that if (current->flags & PF_KTHREAD) { __fput_sync(file); } else { fput(file); task_work_run(); } is a candidate for closing the race window. And depending on Al's answer, removing BUG_ON(!(task->flags & PF_KTHREAD)); from __fput_sync() and unconditionally using __fput_sync(file); from blob_to_mnt() might be the better choice. Anyway, I consider that Al's response is important for this "review". > >>> We fundamentally AKA in any correct version of this code need to flush >>> the file descriptor before we call exec or exec can not open it a >>> read-only denying all writes from any other opens. >>> >>> The use case of flush_delayed_fput is exactly the same as that used >>> when loading the initramfs. >> >> When loading the initramfs, the number of threads is quite few (which >> means that the possibility of hitting the race window and convoluted >> dependency is small). > > But the reality is the code run very early, before the initramfs is > initialized in practice. Such expectation is not a reality. > >> But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s >> flush_delayed_fput() might be called after many number of threads already >> started running. > > At which point the code probably won't be runnig from a kernel thread > but instead will be running in a thread where task_work_run is relevant. No. It is possible that blob_to_mnt() is called by a kernel thread which was started by init_module() syscall by /sbin/insmod . > > At worst it is a very small race, where someone else in another thread > starts flushing the file. Which means the file could still be > completely close before exec. Even that is not necessarily fatal, > as the usermode driver code has a respawn capability. > > Code that is used enough that it hits that race sounds like a very > good problem to have from the perspective of the usermode driver code. In general, unconditionally retrying call_usermodehelper() when it returned a negative value (e.g. -ENOENT, -ENOMEM, -EBUSY) is bad. I don't know which code is an implementation of "a respawn capability"; I'd like to check where that code is and whether that code is checking -ETXTBSY.
Just to make certain I understand what is going on I instrumented a kernel with some print statements. a) The workqueues and timers start before populate_rootfs. b) populate_rootfs does indeed happen long before the bpfilter module is intialized. c) What prevents populate_rootfs and the umd_load_blob from having problems when they call flush_delayed_put is the fact that fput_many does: "schedule_delayed_work(&delayed_fput_work,1)". That 1 requests a delay of at least 1 jiffy. A jiffy is between 1ms and 10ms depending on how Linux is configured. In my test configuration running a kernel in kvm printing to a serial console I measured 0.8ms between the fput in blob_to_mnt and flush_delayed_fput which immediately follows it. So unless the fput becomes incredibly slow there is nothing to worry about in blob_to_mnt. d) As the same mechanism is used by populate_rootfs. A but in the mechanism applies to both. e) No one appears to have reported a problem executing files out of initramfs these last several years since the flush_delayed_fput was introduced. f) The code works for me. There is real reason to believe the code will work for everyone else, as the exact same logic is used by initramfs. So it should be perfectly fine for the patchset and the usermode_driver code to go ahead as written. h) If there is something to be fixed it is flush_delayed_fput as that is much more important than anything in the usermode driver code. Eric p.s.) When I talked of restarts of the usermode driver code ealier I was referring to the code that restarts the usermode driver if it is killed, the next time the kernel tries to talk to it. That could mask an -ETXTBUSY except if it happens on the first exec the net/bfilter/bpfilter_kern.c:load_umh() will return an error.
On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: > > I have tested thes changes by booting with the code compiled in and > by killing "bpfilter_umh" and running iptables -vnL to restart > the userspace driver. > > I have compiled tested each change with and without CONFIG_BPFILTER > enabled. Sounds like grounds for a selftests driver and respective selftest? And if so, has the other issues Tetsuo reported be hacked into one? Luis