Message ID | 20210104232123.31378-1-stephen.s.brennan@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] proc: Allow pid_revalidate() during LOOKUP_RCU | expand |
On Mon, Jan 04, 2021 at 03:21:22PM -0800, Stephen Brennan wrote: > The pid_revalidate() function drops from RCU into REF lookup mode. When > many threads are resolving paths within /proc in parallel, this can > result in heavy spinlock contention on d_lockref as each thread tries to > grab a reference to the /proc dentry (and drop it shortly thereafter). > > Investigation indicates that it is not necessary to drop RCU in > pid_revalidate(), as no RCU data is modified and the function never > sleeps. So, remove the LOOKUP_RCU check. Umm... I'm rather worried about the side effect you are removing here - you are suddenly exposing a bunch of methods in there to RCU mode. E.g. is proc_pid_permission() safe with MAY_NOT_BLOCK in the mask? generic_permission() call in there is fine, but has_pid_permission() doesn't even see the mask. Is that thing safe in RCU mode? AFAICS, this static int selinux_ptrace_access_check(struct task_struct *child, unsigned int mode) { u32 sid = current_sid(); u32 csid = task_sid(child); if (mode & PTRACE_MODE_READ) return avc_has_perm(&selinux_state, sid, csid, SECCLASS_FILE, FILE__READ, NULL); return avc_has_perm(&selinux_state, sid, csid, SECCLASS_PROCESS, PROCESS__PTRACE, NULL); } is reachable and IIRC avc_has_perm() should *NOT* be called in RCU mode. If nothing else, audit handling needs care... And LSM-related stuff is only a part of possible issues here. It does need a careful code audit - you are taking a bunch of methods into the conditions they'd never been tested in. ->permission(), ->get_link(), ->d_revalidate(), ->d_hash() and ->d_compare() instances for objects that subtree. The last two are not there in case of anything in /proc/<pid>, but the first 3 very much are.
On Tue, Jan 05, 2021 at 05:59:35AM +0000, Al Viro wrote: > Umm... I'm rather worried about the side effect you are removing here - > you are suddenly exposing a bunch of methods in there to RCU mode. > E.g. is proc_pid_permission() safe with MAY_NOT_BLOCK in the mask? > generic_permission() call in there is fine, but has_pid_permission() > doesn't even see the mask. Is that thing safe in RCU mode? AFAICS, > this > static int selinux_ptrace_access_check(struct task_struct *child, > unsigned int mode) > { > u32 sid = current_sid(); > u32 csid = task_sid(child); > > if (mode & PTRACE_MODE_READ) > return avc_has_perm(&selinux_state, > sid, csid, SECCLASS_FILE, FILE__READ, NULL); > > return avc_has_perm(&selinux_state, > sid, csid, SECCLASS_PROCESS, PROCESS__PTRACE, NULL); > } > is reachable and IIRC avc_has_perm() should *NOT* be called in RCU mode. > If nothing else, audit handling needs care... > > And LSM-related stuff is only a part of possible issues here. It does need > a careful code audit - you are taking a bunch of methods into the conditions > they'd never been tested in. ->permission(), ->get_link(), ->d_revalidate(), > ->d_hash() and ->d_compare() instances for objects that subtree. The last > two are not there in case of anything in /proc/<pid>, but the first 3 very > much are. FWIW, after looking through the selinux and smack I started to wonder whether we really need that "return -ECHILD rather than audit and fail" in case of ->inode_permission(). AFAICS, the reason we need it is that dump_common_audit_data() is not safe in RCU mode with LSM_AUDIT_DATA_DENTRY and even more so - with LSM_AUDIT_DATA_INODE (d_find_alias() + dput() there, and dput() can bloody well block). LSM_AUDIT_DATA_DENTRY is easy to handle - wrap audit_log_untrustedstring(ab, a->u.dentry->d_name.name); into grabbing/dropping a->u.dentry->d_lock and we are done. And as for the LSM_AUDIT_DATA_INODE... How about this: /* * Caller *MUST* hold rcu_read_lock() and be guaranteed that inode itself * will be around until that gets dropped. */ struct dentry *d_find_alias_rcu(struct inode *inode) { struct hlist_head *l = &inode->i_dentry; struct dentry *de = NULL; spin_lock(&inode->i_lock); // ->i_dentry and ->i_rcu are colocated, but the latter won't be // used without having I_FREEING set, which means no aliases left if (inode->i_state & I_FREEING) { spin_unlock(&inode->i_lock); return NULL; } // we can safely access inode->i_dentry if (hlist_empty(p)) { spin_unlock(&inode->i_lock); return NULL; } if (S_ISDIR(inode->i_mode)) { de = hlist_entry(l->first, struct dentry, d_u.d_alias); } else hlist_for_each_entry(de, l, d_u.d_alias) { if (d_unhashed(de)) continue; // hashed + nonrcu really shouldn't be possible if (WARN_ON(READ_ONCE(de->d_flags) & DCACE_NONRCU)) continue; break; } spin_unlock(&inode->i_lock); return de; } and have case LSM_AUDIT_DATA_INODE: { struct dentry *dentry; struct inode *inode; rcu_read_lock(); inode = a->u.inode; dentry = d_find_alias_rcu(inode); if (dentry) { audit_log_format(ab, " name="); spin_lock(&dentry->d_lock); audit_log_untrustedstring(ab, dentry->d_name.name); spin_unlock(&dentry->d_lock); } audit_log_format(ab, " dev="); audit_log_untrustedstring(ab, inode->i_sb->s_id); audit_log_format(ab, " ino=%lu", inode->i_ino); rcu_read_unlock(); break; } in dump_common_audit_data(). Would that be enough to stop bothering with the entire AVC_NONBLOCKING thing or is there anything else involved?
On Tue, Jan 05, 2021 at 04:50:05PM +0000, Al Viro wrote: > struct dentry *d_find_alias_rcu(struct inode *inode) > { > struct hlist_head *l = &inode->i_dentry; > struct dentry *de = NULL; > > spin_lock(&inode->i_lock); > // ->i_dentry and ->i_rcu are colocated, but the latter won't be > // used without having I_FREEING set, which means no aliases left > if (inode->i_state & I_FREEING) { > spin_unlock(&inode->i_lock); > return NULL; > } > // we can safely access inode->i_dentry > if (hlist_empty(p)) { if (hlist_empty(l)) { obviously...
On Tue, Jan 05, 2021 at 04:50:05PM +0000, Al Viro wrote: > LSM_AUDIT_DATA_DENTRY is easy to handle - wrap > audit_log_untrustedstring(ab, a->u.dentry->d_name.name); > into grabbing/dropping a->u.dentry->d_lock and we are done. Incidentally, LSM_AUDIT_DATA_DENTRY in mainline is *not* safe wrt rename() - for long-named dentries it is possible to get preempted in the middle of audit_log_untrustedstring(ab, a->u.dentry->d_name.name); and have the bugger renamed, with old name ending up freed. The same goes for LSM_AUDIT_DATA_INODE... Folks, ->d_name.name is not automatically stable and the memory it points to is not always guaranteed to last as long as dentry itself does. In something like ->rename(), ->mkdir(), etc. - sure, we have the parent ->i_rwsem held exclusive and nothing's going to rename dentry under us. But there's a reason why e.g. d_path() has to be careful. And there are selinux and smack hooks using LSM_AUDIT_DATA_DENTRY in locking environment that does not exclude renames. AFAICS, that race goes back to 2004: 605303cc340d ("[PATCH] selinux: Add dname to audit output when a path cannot be generated.") in historical tree is where its first ancestor appears... The minimal fix is to grab ->d_lock around these audit_log_untrustedstring() calls, and IMO that's -stable fodder. It is a slow path (we are spewing an audit record, not to mention anything else), so I don't believe it's worth trying to do anything fancier than that. How about the following? If nobody objects, I'll drop it into #fixes and send a pull request in a few days... dump_common_audit_data(): fix racy accesses to ->d_name We are not guaranteed the locking environment that would prevent dentry getting renamed right under us. And it's possible for old long name to be freed after rename, leading to UAF here. Cc: stable@kernel.org # v2.6.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/security/lsm_audit.c b/security/lsm_audit.c index 7d8026f3f377..a0cd28cd31a8 100644 --- a/security/lsm_audit.c +++ b/security/lsm_audit.c @@ -275,7 +275,9 @@ static void dump_common_audit_data(struct audit_buffer *ab, struct inode *inode; audit_log_format(ab, " name="); + spin_lock(&a->u.dentry->d_lock); audit_log_untrustedstring(ab, a->u.dentry->d_name.name); + spin_unlock(&a->u.dentry->d_lock); inode = d_backing_inode(a->u.dentry); if (inode) { @@ -293,8 +295,9 @@ static void dump_common_audit_data(struct audit_buffer *ab, dentry = d_find_alias(inode); if (dentry) { audit_log_format(ab, " name="); - audit_log_untrustedstring(ab, - dentry->d_name.name); + spin_lock(&dentry->d_lock); + audit_log_untrustedstring(ab, dentry->d_name.name); + spin_unlock(&dentry->d_lock); dput(dentry); } audit_log_format(ab, " dev=");
On Tue, Jan 5, 2021 at 12:00 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > We are not guaranteed the locking environment that would prevent > dentry getting renamed right under us. And it's possible for > old long name to be freed after rename, leading to UAF here. This whole thing isn't important enough to get the dentry lock. It's more of a hint than anything else. Why isn't the fix to just use READ_ONCE() of the name pointer, and do it under RCU? That's what dentry_name() does for the much more complex case of actually even following parent data for a depth up to 4, much less just a single name. So instead of spin_lock(&dentry->d_lock); audit_log_untrustedstring(ab, dentry->d_name.name); spin_unlock(&dentry->d_lock); why not rcu_read_lock(); audit_log_untrustedstring(ab, READ_ONCE(dentry->d_name.name)); rcu_read_unlock(); which looks a lot more in line with the other dentry path functions. Maybe even have this as part of fs/d_path.c and try to get rid of magic internal dentry name knowledge from the audit code? Linus
On Tue, Jan 05, 2021 at 12:38:31PM -0800, Linus Torvalds wrote: > This whole thing isn't important enough to get the dentry lock. It's > more of a hint than anything else. > > Why isn't the fix to just use READ_ONCE() of the name pointer, and do > it under RCU? Umm... Take a look at audit_log_untrustedstring() - it really assumes that string is not changing under it. It could be massaged to be resilent to such changes, and it's not even all that hard (copy the sucker byte-by-byte, checking them for prohibited characters, with fallback to hex dump if it finds one), but I really don't want to mess with that for -stable and TBH I don't see the point - if the system is spending enough time in spewing into audit for contention and/or cacheline pingpong to matter, you are FUBAR anyway. In this case dumber is better; sure, if it was just a string copy with the accuracy in face of concurrent renames not guaranteed, I'd be all for "let's see if we can just use %pd printf, or go for open-coded analogue of such". But here the lack of whitespaces and quotes in the output is expected by userland tools and that's more sensitive than the accuracy... Again, if there's anybody seriously interested in analogue of %pd with that (or some other) form of quoting, it could be done. But I don't think it's a good idea for -stable and it obviously can be done on top of the minimal race fix.
Al Viro <viro@zeniv.linux.org.uk> writes: > On Tue, Jan 05, 2021 at 04:50:05PM +0000, Al Viro wrote: > >> LSM_AUDIT_DATA_DENTRY is easy to handle - wrap >> audit_log_untrustedstring(ab, a->u.dentry->d_name.name); >> into grabbing/dropping a->u.dentry->d_lock and we are done. > > Incidentally, LSM_AUDIT_DATA_DENTRY in mainline is *not* safe wrt > rename() - for long-named dentries it is possible to get preempted > in the middle of > audit_log_untrustedstring(ab, a->u.dentry->d_name.name); > and have the bugger renamed, with old name ending up freed. The > same goes for LSM_AUDIT_DATA_INODE... In the case of proc_pid_permission(), this preemption doesn't seem possible. We have task_lock() (a spinlock) held by ptrace_may_access() during this call, so preemption should be disabled: proc_pid_permission() has_pid_permissions() ptrace_may_access() task_lock() __ptrace_may_access() | security_ptrace_access_check() | ptrace_access_check -> selinux_ptrace_access_check() | avc_has_perm() | avc_audit() // note that has_pid_permissions() didn't get a | // flags field to propagate, so flags will not | // contain MAY_NOT_BLOCK | slow_avc_audit() | common_lsm_audit() | dump_common_audit_data() task_unlock() I understand the issue of d_name.name being freed across a preemption is more general than proc_pid_permission() (as other callers may have preemption enabled). However, it seems like there's another issue here. avc_audit() seems to imply that slow_avc_audit() would sleep: static inline int avc_audit(struct selinux_state *state, u32 ssid, u32 tsid, u16 tclass, u32 requested, struct av_decision *avd, int result, struct common_audit_data *a, int flags) { u32 audited, denied; audited = avc_audit_required(requested, avd, result, 0, &denied); if (likely(!audited)) return 0; /* fall back to ref-walk if we have to generate audit */ if (flags & MAY_NOT_BLOCK) return -ECHILD; return slow_avc_audit(state, ssid, tsid, tclass, requested, audited, denied, result, a); } If there are other cases in here where we might sleep, it would be a problem to sleep with the task lock held, correct?
On Tue, Jan 5, 2021 at 6:27 PM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > Al Viro <viro@zeniv.linux.org.uk> writes: > > > On Tue, Jan 05, 2021 at 04:50:05PM +0000, Al Viro wrote: > > > >> LSM_AUDIT_DATA_DENTRY is easy to handle - wrap > >> audit_log_untrustedstring(ab, a->u.dentry->d_name.name); > >> into grabbing/dropping a->u.dentry->d_lock and we are done. > > > > Incidentally, LSM_AUDIT_DATA_DENTRY in mainline is *not* safe wrt > > rename() - for long-named dentries it is possible to get preempted > > in the middle of > > audit_log_untrustedstring(ab, a->u.dentry->d_name.name); > > and have the bugger renamed, with old name ending up freed. The > > same goes for LSM_AUDIT_DATA_INODE... > > In the case of proc_pid_permission(), this preemption doesn't seem > possible. We have task_lock() (a spinlock) held by ptrace_may_access() > during this call, so preemption should be disabled: > > proc_pid_permission() > has_pid_permissions() > ptrace_may_access() > task_lock() > __ptrace_may_access() > | security_ptrace_access_check() > | ptrace_access_check -> selinux_ptrace_access_check() > | avc_has_perm() > | avc_audit() // note that has_pid_permissions() didn't get a > | // flags field to propagate, so flags will not > | // contain MAY_NOT_BLOCK > | slow_avc_audit() > | common_lsm_audit() > | dump_common_audit_data() > task_unlock() > > I understand the issue of d_name.name being freed across a preemption is > more general than proc_pid_permission() (as other callers may have > preemption enabled). However, it seems like there's another issue here. > avc_audit() seems to imply that slow_avc_audit() would sleep: > > static inline int avc_audit(struct selinux_state *state, > u32 ssid, u32 tsid, > u16 tclass, u32 requested, > struct av_decision *avd, > int result, > struct common_audit_data *a, > int flags) > { > u32 audited, denied; > audited = avc_audit_required(requested, avd, result, 0, &denied); > if (likely(!audited)) > return 0; > /* fall back to ref-walk if we have to generate audit */ > if (flags & MAY_NOT_BLOCK) > return -ECHILD; > return slow_avc_audit(state, ssid, tsid, tclass, > requested, audited, denied, result, > a); > } > > If there are other cases in here where we might sleep, it would be a > problem to sleep with the task lock held, correct? I would expect the problem here to be the currently allocated audit buffer isn't large enough to hold the full audit record, in which case it will attempt to expand the buffer by a call to pskb_expand_head() - don't ask why audit buffers are skbs, it's awful - using a gfp flag that was established when the buffer was first created. In this particular case it is GFP_ATOMIC|__GFP_NOWARN, which I believe should be safe in that it will not sleep on an allocation miss. I need to go deal with dinner, so I can't trace the entire path at the moment, but I believe the potential audit buffer allocation is the main issue.
On Tue, Jan 05, 2021 at 07:00:59PM -0500, Paul Moore wrote: > > > Incidentally, LSM_AUDIT_DATA_DENTRY in mainline is *not* safe wrt > > > rename() - for long-named dentries it is possible to get preempted > > > in the middle of > > > audit_log_untrustedstring(ab, a->u.dentry->d_name.name); > > > and have the bugger renamed, with old name ending up freed. The > > > same goes for LSM_AUDIT_DATA_INODE... > > > > In the case of proc_pid_permission(), this preemption doesn't seem > > possible. We have task_lock() (a spinlock) held by ptrace_may_access() > > during this call, so preemption should be disabled: > > > > proc_pid_permission() > > has_pid_permissions() > > ptrace_may_access() > > task_lock() > > __ptrace_may_access() > > | security_ptrace_access_check() > > | ptrace_access_check -> selinux_ptrace_access_check() > > | avc_has_perm() ... which does not hit either LSM_AUDIT_DATA_DENTRY nor LSM_AUDIT_DATA_INODE. It's really an unrelated issue. > > preemption enabled). However, it seems like there's another issue here. > > avc_audit() seems to imply that slow_avc_audit() would sleep: > > > > static inline int avc_audit(struct selinux_state *state, > > u32 ssid, u32 tsid, > > u16 tclass, u32 requested, > > struct av_decision *avd, > > int result, > > struct common_audit_data *a, > > int flags) > > { > > u32 audited, denied; > > audited = avc_audit_required(requested, avd, result, 0, &denied); > > if (likely(!audited)) > > return 0; > > /* fall back to ref-walk if we have to generate audit */ > > if (flags & MAY_NOT_BLOCK) > > return -ECHILD; > > return slow_avc_audit(state, ssid, tsid, tclass, > > requested, audited, denied, result, > > a); > > } > > > > If there are other cases in here where we might sleep, it would be a > > problem to sleep with the task lock held, correct? It can sleep - with LSM_AUDIT_DATA_INODE, which is precisely what selinux_inode_permission() is hitting. > I would expect the problem here to be the currently allocated audit > buffer isn't large enough to hold the full audit record, in which case > it will attempt to expand the buffer by a call to pskb_expand_head() - > don't ask why audit buffers are skbs, it's awful - using a gfp flag > that was established when the buffer was first created. In this > particular case it is GFP_ATOMIC|__GFP_NOWARN, which I believe should > be safe in that it will not sleep on an allocation miss. > > I need to go deal with dinner, so I can't trace the entire path at the > moment, but I believe the potential audit buffer allocation is the > main issue. Nope. dput() in dump_common_audit_data(), OTOH, is certainly not safe. OTTH, it's not really needed there - see vfs.git #work.audit for (untested) turning that sucker non-blocking. I hadn't tried a followup that would get rid of the entire AVC_NONBLOCKING thing yet, but I suspect that it should simplify the things in there nicely...
Al Viro <viro@zeniv.linux.org.uk> writes: > On Mon, Jan 04, 2021 at 03:21:22PM -0800, Stephen Brennan wrote: >> The pid_revalidate() function drops from RCU into REF lookup mode. When >> many threads are resolving paths within /proc in parallel, this can >> result in heavy spinlock contention on d_lockref as each thread tries to >> grab a reference to the /proc dentry (and drop it shortly thereafter). >> >> Investigation indicates that it is not necessary to drop RCU in >> pid_revalidate(), as no RCU data is modified and the function never >> sleeps. So, remove the LOOKUP_RCU check. > > Umm... I'm rather worried about the side effect you are removing here - > you are suddenly exposing a bunch of methods in there to RCU mode. > E.g. is proc_pid_permission() safe with MAY_NOT_BLOCK in the mask? > generic_permission() call in there is fine, but has_pid_permission() > doesn't even see the mask. Is that thing safe in RCU mode? AFAICS, > this > static int selinux_ptrace_access_check(struct task_struct *child, > unsigned int mode) > { > u32 sid = current_sid(); > u32 csid = task_sid(child); > > if (mode & PTRACE_MODE_READ) > return avc_has_perm(&selinux_state, > sid, csid, SECCLASS_FILE, FILE__READ, NULL); > > return avc_has_perm(&selinux_state, > sid, csid, SECCLASS_PROCESS, PROCESS__PTRACE, NULL); > } > is reachable and IIRC avc_has_perm() should *NOT* be called in RCU mode. > If nothing else, audit handling needs care... > > And LSM-related stuff is only a part of possible issues here. It does need > a careful code audit - you are taking a bunch of methods into the conditions > they'd never been tested in. ->permission(), ->get_link(), ->d_revalidate(), > ->d_hash() and ->d_compare() instances for objects that subtree. The last > two are not there in case of anything in /proc/<pid>, but the first 3 very > much are. You're right, this was a major oversight on my part. The main motivation of this patch is to reduce contention on the /proc dentry, which occurs directly after d_revalidate() returns -ECHILD the first time in lookup_fast(). To drop into ref mode, we call unlazy_child(), while nd->path still refers to /proc and dentry refers to /proc/PID. Grabbing a reference to /proc is the heart of the contention issue. But directly after a successful d_revalidate() in lookup_fast(), we return and go to step_into(), which assigns the /proc/PID dentry to nd->path. After this point, any unlazy operation will not try to grab the /proc dentry, resulting in significantly less contention. So it would already be a significant improvement if we kept this change to pid_revalidate(), and simply added checks to bail out of each of the other procfs methods if we're in LOOKUP_RCU. Would that be an acceptable change for you?
On Tue, Jan 5, 2021 at 7:38 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > On Tue, Jan 05, 2021 at 07:00:59PM -0500, Paul Moore wrote: ... > > I would expect the problem here to be the currently allocated audit > > buffer isn't large enough to hold the full audit record, in which case > > it will attempt to expand the buffer by a call to pskb_expand_head() - > > don't ask why audit buffers are skbs, it's awful - using a gfp flag > > that was established when the buffer was first created. In this > > particular case it is GFP_ATOMIC|__GFP_NOWARN, which I believe should > > be safe in that it will not sleep on an allocation miss. > > > > I need to go deal with dinner, so I can't trace the entire path at the > > moment, but I believe the potential audit buffer allocation is the > > main issue. > > Nope. dput() in dump_common_audit_data(), OTOH, is certainly not > safe. My mistake. My initial reaction is to always assume audit is the problem; I should have traced everything through before commenting. > OTTH, it's not really needed there - see vfs.git #work.audit > for (untested) turning that sucker non-blocking. I hadn't tried > a followup that would get rid of the entire AVC_NONBLOCKING thing yet, > but I suspect that it should simplify the things in there nicely... It would be nice to be able to get rid of the limitation on when we can update the AVC and do proper auditing. I doubt the impact is anything that anyone notices, but I agree that it should make things much cleaner. Thanks Al.
Al Viro <viro@zeniv.linux.org.uk> writes: > OTTH, it's not really needed there - see vfs.git #work.audit > for (untested) turning that sucker non-blocking. I hadn't tried > a followup that would get rid of the entire AVC_NONBLOCKING thing yet, > but I suspect that it should simplify the things in there nicely... I went ahead and pulled down this branch and combined it with my pid_revalidate change. Further, I audited all the inode get_link and permission() implementations, as well as dentry d_revalidate() implementations, in fs/proc (more on that below). Together, all these patches have run stable under a steady high load of concurrent PS processes on a 104CPU machine for over an hour, and greatly reduced the %sys utilization which the patch originally addressed. How would you like to proceed with the #work.audit changes? I could include them in a v5 of this patch series. Regarding my audit (ha) of dentry and inode functions in the fs/proc/ directory: * get_link() receives a NULL dentry pointer when called in RCU mode. * permission() receives MAY_NOT_BLOCK in the mode parameter when called from RCU. * d_revalidate() receives LOOKUP_RCU in flags. There were generally three groups I found. Group (1) are functions which contain a check at the top of the function and return -ECHILD, and so appear to be trivially RCU safe (although this is by dropping out of RCU completely). Group (2) are functions which have no explicit check, but on my audit, I was confident that there were no sleeping function calls, and thus were RCU safe as is. Group (3) are functions which appeared to be unsafe for some reason or another. Group (1): proc_ns_get_link() proc_pid_get_link() map_files_d_revalidate() proc_misc_d_revalidate() tid_fd_revalidate() Group (2): proc_get_link() proc_self_get_link() proc_thread_self_get_link() proc_fd_permission() Group (3): pid_revalidate() -- addressed by my patch proc_map_files_get_link() proc_pid_permission() -- addressed by Al's work.audit branch proc_map_files_get_link() calls capable() which ends up calling a security hook, which can get into the audit guts, and so I marked it as potentially unsafe, and added a patch to bail out of this function before the capable() check. However, I doubt this is really necessary. So to conclude, depending on how Al wants to move forward with the work.audit branch, I could send a full series with the proposed changes. Stephen
diff --git a/fs/proc/base.c b/fs/proc/base.c index f52217f432bc..633ef74e8dfd 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1974,19 +1974,18 @@ static int pid_revalidate(struct dentry *dentry, unsigned int flags) { struct inode *inode; struct task_struct *task; + int ret = 0; - if (flags & LOOKUP_RCU) - return -ECHILD; - - inode = d_inode(dentry); - task = get_proc_task(inode); + rcu_read_lock(); + inode = d_inode_rcu(dentry); + task = pid_task(proc_pid(inode), PIDTYPE_PID); if (task) { pid_update_inode(task, inode); - put_task_struct(task); - return 1; + ret = 1; } - return 0; + rcu_read_unlock(); + return ret; } static inline bool proc_inode_is_dead(struct inode *inode)