Message ID | 87o8j2svnt.fsf_-_@x220.int.ebiederm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | files: rcu free files_struct | expand |
On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > - struct file * file = xchg(&fdt->fd[i], NULL); > + struct file * file = fdt->fd[i]; > if (file) { > + rcu_assign_pointer(fdt->fd[i], NULL); This makes me nervous. Why did we use to do that xchg() there? That has atomicity guarantees that now are gone. Now, this whole thing should be called for just the last ref of the fd table, so presumably that atomicity was never needed in the first place. But the fact that we did that very expensive xchg() then makes me go "there's some reason for it". Is this xchg() just bogus historical leftover? It kind of looks that way. But maybe that change should be done separately? Linus
On Wed, Dec 09, 2020 at 12:04:38PM -0600, Eric W. Biederman wrote: > @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files) > set = fdt->open_fds[j++]; > while (set) { > if (set & 1) { > - struct file * file = xchg(&fdt->fd[i], NULL); > + struct file * file = fdt->fd[i]; > if (file) { > + rcu_assign_pointer(fdt->fd[i], NULL); Assuming this is safe, you can use RCU_INIT_POINTER() here because you're storing NULL, so you don't need the wmb() before storing the pointer.
On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote: > On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > > - struct file * file = xchg(&fdt->fd[i], NULL); > > + struct file * file = fdt->fd[i]; > > if (file) { > > + rcu_assign_pointer(fdt->fd[i], NULL); > > This makes me nervous. Why did we use to do that xchg() there? That > has atomicity guarantees that now are gone. > > Now, this whole thing should be called for just the last ref of the fd > table, so presumably that atomicity was never needed in the first > place. But the fact that we did that very expensive xchg() then makes > me go "there's some reason for it". > > Is this xchg() just bogus historical leftover? It kind of looks that > way. But maybe that change should be done separately? I'm still not convinced that exposing close_files() to parallel 3rd-party accesses is safe in all cases, so this patch still needs more analysis. And I'm none too happy about "we'll fix the things up at the tail of the series" - the changes are subtle enough and the area affected is rather fundamental. So if we end up returning to that several years from now while debugging something, I would very much prefer to have the transformation series as clean and understandable as possible. It's not just about bisect hazard - asking yourself "WTF had it been done that way, is there anything subtle I'm missing here?" can cost many hours of head-scratching, IME. Eric, I understand that you want to avoid reordering/folding, but in this case it _is_ needed. It's not as if there had been any serious objections to the overall direction of changes; it's just that we need to get that as understandable as possible.
Al Viro <viro@zeniv.linux.org.uk> writes: > On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote: >> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote: >> > >> > - struct file * file = xchg(&fdt->fd[i], NULL); >> > + struct file * file = fdt->fd[i]; >> > if (file) { >> > + rcu_assign_pointer(fdt->fd[i], NULL); >> >> This makes me nervous. Why did we use to do that xchg() there? That >> has atomicity guarantees that now are gone. >> >> Now, this whole thing should be called for just the last ref of the fd >> table, so presumably that atomicity was never needed in the first >> place. But the fact that we did that very expensive xchg() then makes >> me go "there's some reason for it". >> >> Is this xchg() just bogus historical leftover? It kind of looks that >> way. But maybe that change should be done separately? > > I'm still not convinced that exposing close_files() to parallel > 3rd-party accesses is safe in all cases, so this patch still needs > more analysis. That is fine. I just wanted to post the latest version so we could continue the discussion. Especially with comments etc. > And I'm none too happy about "we'll fix the things > up at the tail of the series" - the changes are subtle enough and > the area affected is rather fundamental. So if we end up returning > to that several years from now while debugging something, I would > very much prefer to have the transformation series as clean and > understandable as possible. It's not just about bisect hazard - > asking yourself "WTF had it been done that way, is there anything > subtle I'm missing here?" can cost many hours of head-scratching, > IME. Fair enough. I don't expect anyone is basing anything on that branch, so a rebase is possible. Now removing the pounding on task_lock isn't about correctness, and it is not fixing a performance problem anyone has measured at this point. So I do think it should be a follow on. If for no other reason than to keep the problem small enough it can fit in heads. Similarly the dnotify stuff. Your description certain makes it look fishy but that the questionable parts are orthogonal to my patches. > Eric, I understand that you want to avoid reordering/folding, but > in this case it _is_ needed. It's not as if there had been any > serious objections to the overall direction of changes; it's > just that we need to get that as understandable as possible. I will post the patch that will become -1/24 in a moment. Eric
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> - struct file * file = xchg(&fdt->fd[i], NULL); >> + struct file * file = fdt->fd[i]; >> if (file) { >> + rcu_assign_pointer(fdt->fd[i], NULL); > > This makes me nervous. Why did we use to do that xchg() there? That > has atomicity guarantees that now are gone. > > Now, this whole thing should be called for just the last ref of the fd > table, so presumably that atomicity was never needed in the first > place. But the fact that we did that very expensive xchg() then makes > me go "there's some reason for it". > > Is this xchg() just bogus historical leftover? It kind of looks that > way. But maybe that change should be done separately? Removing the xchg in a separate patch seems reasonable. Just to make the review easier. I traced the xchg back to 7cf4dc3c8dbf ("move files_struct-related bits from kernel/exit.c to fs/file.c") when put_files_struct was introduced. The xchg did not exist before that change. There were many other xchgs in the code back then so I suspect was left over from some way an earlier version of the change worked and simply was not removed when the patch was updated. Eric
Matthew Wilcox <willy@infradead.org> writes: > On Wed, Dec 09, 2020 at 12:04:38PM -0600, Eric W. Biederman wrote: >> @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files) >> set = fdt->open_fds[j++]; >> while (set) { >> if (set & 1) { >> - struct file * file = xchg(&fdt->fd[i], NULL); >> + struct file * file = fdt->fd[i]; >> if (file) { >> + rcu_assign_pointer(fdt->fd[i], NULL); > > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're > storing NULL, so you don't need the wmb() before storing the pointer. Thanks. I was remembering there was a special case like and I had forgotten what the rule was. Eric
On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote: > On Wed, Dec 09, 2020 at 12:04:38PM -0600, Eric W. Biederman wrote: > > @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files) > > set = fdt->open_fds[j++]; > > while (set) { > > if (set & 1) { > > - struct file * file = xchg(&fdt->fd[i], NULL); > > + struct file * file = fdt->fd[i]; > > if (file) { > > + rcu_assign_pointer(fdt->fd[i], NULL); > > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're > storing NULL, so you don't need the wmb() before storing the pointer. fs/file.c:pick_file() would make more interesting target for the same treatment...
On Wed, Dec 9, 2020 at 2:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote: > > > > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're > > storing NULL, so you don't need the wmb() before storing the pointer. > > fs/file.c:pick_file() would make more interesting target for the same treatment... Actually, don't. rcu_assign_pointer() itself already does the optimization for the case of a constant NULL pointer assignment. So there's no need to manually change things to RCU_INIT_POINTER(). Linus
On Wed, Dec 9, 2020 at 3:01 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > rcu_assign_pointer() itself already does the optimization for the case > of a constant NULL pointer assignment. > > So there's no need to manually change things to RCU_INIT_POINTER(). Side note: what should be done instead is to delete the stale comment. It should have been removed back when the optimization was done in 2016 - see commit 3a37f7275cda ("rcu: No ordering for rcu_assign_pointer() of NULL") Linus
On Wed, Dec 09, 2020 at 03:01:36PM -0800, Linus Torvalds wrote: > On Wed, Dec 9, 2020 at 2:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote: > > > > > > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're > > > storing NULL, so you don't need the wmb() before storing the pointer. > > > > fs/file.c:pick_file() would make more interesting target for the same treatment... > > Actually, don't. > > rcu_assign_pointer() itself already does the optimization for the case > of a constant NULL pointer assignment. > > So there's no need to manually change things to RCU_INIT_POINTER(). I missed that, and the documentation wasn't updated by 3a37f7275cda5ad25c1fe9be8f20c76c60d175fa. Paul, how about this? +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -1668,8 +1668,10 @@ against mishaps and misuse: this purpose. #. It is not necessary to use rcu_assign_pointer() when creating linked structures that are to be published via a single external - pointer. The RCU_INIT_POINTER() macro is provided for this task - and also for assigning ``NULL`` pointers at runtime. + pointer. The RCU_INIT_POINTER() macro is provided for this task. + It used to be more efficient to use RCU_INIT_POINTER() to store a + ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant + ``NULL`` pointer itself. This not a hard-and-fast list: RCU's diagnostic capabilities will continue to be guided by the number and type of usage bugs found in
On Wed, Dec 09, 2020 at 11:07:55PM +0000, Matthew Wilcox wrote: > On Wed, Dec 09, 2020 at 03:01:36PM -0800, Linus Torvalds wrote: > > On Wed, Dec 9, 2020 at 2:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote: > > > > > > > > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're > > > > storing NULL, so you don't need the wmb() before storing the pointer. > > > > > > fs/file.c:pick_file() would make more interesting target for the same treatment... > > > > Actually, don't. > > > > rcu_assign_pointer() itself already does the optimization for the case > > of a constant NULL pointer assignment. > > > > So there's no need to manually change things to RCU_INIT_POINTER(). > > I missed that, and the documentation wasn't updated by > 3a37f7275cda5ad25c1fe9be8f20c76c60d175fa. Can't trust the author of that patch! ;-) > Paul, how about this? > > +++ b/Documentation/RCU/Design/Requirements/Requirements.rst > @@ -1668,8 +1668,10 @@ against mishaps and misuse: > this purpose. > #. It is not necessary to use rcu_assign_pointer() when creating > linked structures that are to be published via a single external > - pointer. The RCU_INIT_POINTER() macro is provided for this task > - and also for assigning ``NULL`` pointers at runtime. > + pointer. The RCU_INIT_POINTER() macro is provided for this task. > + It used to be more efficient to use RCU_INIT_POINTER() to store a > + ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant > + ``NULL`` pointer itself. > > This not a hard-and-fast list: RCU's diagnostic capabilities will > continue to be guided by the number and type of usage bugs found in Looks good to me! If you send a complete patch, I will be happy to pull it in. Thanx, Paul
On Wed, Dec 9, 2020 at 3:07 PM Matthew Wilcox <willy@infradead.org> wrote: > > #. It is not necessary to use rcu_assign_pointer() when creating > linked structures that are to be published via a single external > - pointer. The RCU_INIT_POINTER() macro is provided for this task > - and also for assigning ``NULL`` pointers at runtime. > + pointer. The RCU_INIT_POINTER() macro is provided for this task. > + It used to be more efficient to use RCU_INIT_POINTER() to store a > + ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant > + ``NULL`` pointer itself. I would just remove the historical note about "It used to be more efficient ..". It's not really helpful. If somebody wants to dig into the history, it's there in git. Maybe the note could be part of the commit message for the comment update, explaining that it used to be more efficient but no longer is. Because commit messages are about the explanation for the change. But I don't feel it makes any sense in the source code. Linus
On Wed, Dec 09, 2020 at 03:29:09PM -0800, Linus Torvalds wrote: > On Wed, Dec 9, 2020 at 3:07 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > #. It is not necessary to use rcu_assign_pointer() when creating > > linked structures that are to be published via a single external > > - pointer. The RCU_INIT_POINTER() macro is provided for this task > > - and also for assigning ``NULL`` pointers at runtime. > > + pointer. The RCU_INIT_POINTER() macro is provided for this task. > > + It used to be more efficient to use RCU_INIT_POINTER() to store a > > + ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant > > + ``NULL`` pointer itself. > > I would just remove the historical note about "It used to be more > efficient ..". It's not really helpful. > > If somebody wants to dig into the history, it's there in git. > > Maybe the note could be part of the commit message for the comment > update, explaining that it used to be more efficient but no longer is. > Because commit messages are about the explanation for the change. > > But I don't feel it makes any sense in the source code. Like this, then? Thanx, Paul #. It is not necessary to use rcu_assign_pointer() when creating linked structures that are to be published via a single external - pointer. The RCU_INIT_POINTER() macro is provided for this task - and also for assigning ``NULL`` pointers at runtime. + pointer. The RCU_INIT_POINTER() macro is provided for this task.
On Wed, Dec 9, 2020 at 4:39 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > Like this, then? Ack. Linus
On Wed, Dec 09, 2020 at 04:41:06PM -0800, Linus Torvalds wrote: > On Wed, Dec 9, 2020 at 4:39 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > Like this, then? > > Ack. Queued with Matthew's Reported-by and your Acked-by, thank you all! Thanx, Paul
On Wed, Dec 09, 2020 at 03:32:38PM -0600, Eric W. Biederman wrote: > Al Viro <viro@zeniv.linux.org.uk> writes: > > > On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote: > >> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > >> > > >> > - struct file * file = xchg(&fdt->fd[i], NULL); > >> > + struct file * file = fdt->fd[i]; > >> > if (file) { > >> > + rcu_assign_pointer(fdt->fd[i], NULL); > >> > >> This makes me nervous. Why did we use to do that xchg() there? That > >> has atomicity guarantees that now are gone. > >> > >> Now, this whole thing should be called for just the last ref of the fd > >> table, so presumably that atomicity was never needed in the first > >> place. But the fact that we did that very expensive xchg() then makes > >> me go "there's some reason for it". > >> > >> Is this xchg() just bogus historical leftover? It kind of looks that > >> way. But maybe that change should be done separately? > > > > I'm still not convinced that exposing close_files() to parallel > > 3rd-party accesses is safe in all cases, so this patch still needs > > more analysis. > > That is fine. I just wanted to post the latest version so we could > continue the discussion. Especially with comments etc. It's probably safe. I've spent today digging through the mess in fs/notify and kernel/bpf, and while I'm disgusted with both, at that point I believe that close_files() exposure is not going to create problems with either. And xchg() in there _is_ useless. Said that, BPF "file iterator" stuff is potentially very unpleasant - it allows to pin a struct file found in any process' descriptor table indefinitely long. Temporary struct file references grabbed by procfs code, while unfortunate, are at least short-lived; with this stuff sky's the limit. I'm not happy about having that available, especially if it's a user-visible primitive we can't withdraw at zero notice ;-/ What are the users of that thing and is there any chance to replace it with something saner? IOW, what *is* realistically called for each struct file by the users of that iterator?
Al Viro <viro@zeniv.linux.org.uk> writes: > On Wed, Dec 09, 2020 at 03:32:38PM -0600, Eric W. Biederman wrote: >> Al Viro <viro@zeniv.linux.org.uk> writes: >> >> > On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote: >> >> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> > >> >> > - struct file * file = xchg(&fdt->fd[i], NULL); >> >> > + struct file * file = fdt->fd[i]; >> >> > if (file) { >> >> > + rcu_assign_pointer(fdt->fd[i], NULL); >> >> >> >> This makes me nervous. Why did we use to do that xchg() there? That >> >> has atomicity guarantees that now are gone. >> >> >> >> Now, this whole thing should be called for just the last ref of the fd >> >> table, so presumably that atomicity was never needed in the first >> >> place. But the fact that we did that very expensive xchg() then makes >> >> me go "there's some reason for it". >> >> >> >> Is this xchg() just bogus historical leftover? It kind of looks that >> >> way. But maybe that change should be done separately? >> > >> > I'm still not convinced that exposing close_files() to parallel >> > 3rd-party accesses is safe in all cases, so this patch still needs >> > more analysis. >> >> That is fine. I just wanted to post the latest version so we could >> continue the discussion. Especially with comments etc. > > It's probably safe. I've spent today digging through the mess in > fs/notify and kernel/bpf, and while I'm disgusted with both, at > that point I believe that close_files() exposure is not going to > create problems with either. And xchg() in there _is_ useless. Then I will work on a cleaned up version. > Said that, BPF "file iterator" stuff is potentially very unpleasant - > it allows to pin a struct file found in any process' descriptor table > indefinitely long. Temporary struct file references grabbed by procfs > code, while unfortunate, are at least short-lived; with this stuff sky's > the limit. > > I'm not happy about having that available, especially if it's a user-visible > primitive we can't withdraw at zero notice ;-/ > > What are the users of that thing and is there any chance to replace it > with something saner? IOW, what *is* realistically called for each > struct file by the users of that iterator? The bpf guys are no longer Cc'd and they can probably answer better than I. In a previous conversation it was mentioned that task_iter was supposed to be a high performance interface for getting proc like data out of the kernel using bpf. If so I think that handles the lifetime issues as bpf programs are supposed to be short-lived and can not pass references anywhere. On the flip side it raises the question did the BPF guys just make the current layout of task_struct and struct file part of the linux kernel user space ABI? Eric
On Thu, Dec 10, 2020 at 01:29:01PM -0600, Eric W. Biederman wrote: > Al Viro <viro@zeniv.linux.org.uk> writes: > > What are the users of that thing and is there any chance to replace it > > with something saner? IOW, what *is* realistically called for each > > struct file by the users of that iterator? > > The bpf guys are no longer Cc'd and they can probably answer better than > I. > > In a previous conversation it was mentioned that task_iter was supposed > to be a high performance interface for getting proc like data out of the > kernel using bpf. > > If so I think that handles the lifetime issues as bpf programs are > supposed to be short-lived and can not pass references anywhere. > > On the flip side it raises the question did the BPF guys just make the > current layout of task_struct and struct file part of the linux kernel > user space ABI? An interesting question, that... For the record: anybody coming to complain about a removed/renamed/replaced with something else field in struct file will be refered to Figure 1. None of the VFS data structures has any layout stability warranties. If BPF folks want access to something in that, they are welcome to come and discuss the set of accessors; so far nothing of that sort has happened. Direct access to any fields of any of those structures is subject to being broken at zero notice. IMO we need some notation for a structure being off-limits for BPF, tracing, etc., along the lines of "don't access any field directly".
On Thu, Dec 10, 2020 at 09:36:24PM +0000, Al Viro wrote: > On Thu, Dec 10, 2020 at 01:29:01PM -0600, Eric W. Biederman wrote: > > Al Viro <viro@zeniv.linux.org.uk> writes: > > > > What are the users of that thing and is there any chance to replace it > > > with something saner? IOW, what *is* realistically called for each > > > struct file by the users of that iterator? > > > > The bpf guys are no longer Cc'd and they can probably answer better than > > I. > > > > In a previous conversation it was mentioned that task_iter was supposed > > to be a high performance interface for getting proc like data out of the > > kernel using bpf. > > > > If so I think that handles the lifetime issues as bpf programs are > > supposed to be short-lived and can not pass references anywhere. > > > > On the flip side it raises the question did the BPF guys just make the > > current layout of task_struct and struct file part of the linux kernel > > user space ABI? > > An interesting question, that... For the record: anybody coming to Imho, they did. An example from the BPF LSM: a few weeks ago someone asked me whether it would be possible to use the BPF LSM to enforce you can't open files when they are on a given filesystem. Sine this bpf lsm allows to attach to lsm hooks, say security_file_open(), you can get at the superblock and check the filesyste type in a bpf program (requiring btf), i.e. security_file_open, then follow file->f_inode->i_sb->s_type->s_magic. If we change the say struct super_block I'd expect these bpf programs to break. I'm sure there's something clever that they came up with but it is nonetheless uncomfortably close to making internal kernel structures part of userspace ABI indeed. > complain about a removed/renamed/replaced with something else field > in struct file will be refered to Figure 1. > > None of the VFS data structures has any layout stability warranties. > If BPF folks want access to something in that, they are welcome to come > and discuss the set of accessors; so far nothing of that sort has happened. > > Direct access to any fields of any of those structures is subject to > being broken at zero notice. > > IMO we need some notation for a structure being off-limits for BPF, tracing, > etc., along the lines of "don't access any field directly". Indeed. I would also like to see a list where changes need to be sent that are technically specific to a subsystem but will necessarily have kernel-wide impact prime example: a lot of bpf. Christian
On Thu, Dec 10, 2020 at 11:30:24PM +0100, Christian Brauner wrote: > (requiring btf), i.e. security_file_open, then follow > file->f_inode->i_sb->s_type->s_magic. If we change the say struct > super_block I'd expect these bpf programs to break. To break they would need to have compiled in the first place; ->s_type is struct file_system_type and it contains no ->s_magic (nor would it be possible, really - ->s_magic can vary between filesystems that *do* share ->s_type).
On Thu, Dec 10, 2020 at 10:54:05PM +0000, Al Viro wrote: > On Thu, Dec 10, 2020 at 11:30:24PM +0100, Christian Brauner wrote: > > (requiring btf), i.e. security_file_open, then follow > > file->f_inode->i_sb->s_type->s_magic. If we change the say struct > > super_block I'd expect these bpf programs to break. > > To break they would need to have compiled in the first place; > ->s_type is struct file_system_type and it contains no ->s_magic > (nor would it be possible, really - ->s_magic can vary between > filesystems that *do* share ->s_type). Incidentally, a lot of things in e.g. struct dentry need care when accessing; the fields are there, but e.g. blind access to name or parent really can oops. Moreover, blindly following a chain of ->d_parent pointers without taking appropriate precautions might end up reading from arbitrary kernel address, including iomem ones. I don't see anything that would prevent that... TAINT_BPF would probably be too impractical, since there's a lot of boxen using it more reasonably on the networking side. But it really looks like we *do* need annotations with their violation triggering a taint, so that BS bug reports could be discarded.
diff --git a/fs/file.c b/fs/file.c index 412033d8cfdf..88064f185560 100644 --- a/fs/file.c +++ b/fs/file.c @@ -379,7 +379,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int return NULL; } -static struct fdtable *close_files(struct files_struct * files) +static void close_files(struct files_struct * files) { /* * It is safe to dereference the fd table without RCU or @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files) set = fdt->open_fds[j++]; while (set) { if (set & 1) { - struct file * file = xchg(&fdt->fd[i], NULL); + struct file * file = fdt->fd[i]; if (file) { + rcu_assign_pointer(fdt->fd[i], NULL); filp_close(file, files); cond_resched(); } @@ -407,19 +408,35 @@ static struct fdtable *close_files(struct files_struct * files) set >>= 1; } } +} - return fdt; +static void free_files_struct_rcu(struct rcu_head *rcu) +{ + struct files_struct *files = + container_of(rcu, struct files_struct, fdtab.rcu); + struct fdtable *fdt = rcu_dereference_raw(files->fdt); + + /* free the arrays if they are not embedded */ + if (fdt != &files->fdtab) + __free_fdtable(fdt); + kmem_cache_free(files_cachep, files); } void put_files_struct(struct files_struct *files) { if (atomic_dec_and_test(&files->count)) { - struct fdtable *fdt = close_files(files); - - /* free the arrays if they are not embedded */ - if (fdt != &files->fdtab) - __free_fdtable(fdt); - kmem_cache_free(files_cachep, files); + close_files(files); + /* + * The rules for using the rcu_head in fdtable: + * + * If the fdtable is being freed independently + * of the files_struct fdtable->rcu is used. + * + * When the fdtable and files_struct are freed + * together the rcu_head from the fdtable embedded in + * files_struct is used. + */ + call_rcu(&files->fdtab.rcu, free_files_struct_rcu); } } @@ -822,12 +839,14 @@ EXPORT_SYMBOL(fget_raw); struct file *fget_task(struct task_struct *task, unsigned int fd) { + struct files_struct *files; struct file *file = NULL; - task_lock(task); - if (task->files) - file = __fget_files(task->files, fd, 0, 1); - task_unlock(task); + rcu_read_lock(); + files = rcu_dereference(task->files); + if (files) + file = __fget_files(files, fd, 0, 1); + rcu_read_unlock(); return file; } @@ -838,11 +857,9 @@ struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd) struct files_struct *files; struct file *file = NULL; - task_lock(task); - files = task->files; + files = rcu_dereference(task->files); if (files) file = files_lookup_fd_rcu(files, fd); - task_unlock(task); return file; } @@ -854,8 +871,7 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret unsigned int fd = *ret_fd; struct file *file = NULL; - task_lock(task); - files = task->files; + files = rcu_dereference(task->files); if (files) { for (; fd < files_fdtable(files)->max_fds; fd++) { file = files_lookup_fd_rcu(files, fd); @@ -863,7 +879,6 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret break; } } - task_unlock(task); *ret_fd = fd; return file; } diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index d0e78174874a..37dcface020f 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -30,6 +30,7 @@ struct fdtable { unsigned long *close_on_exec; unsigned long *open_fds; unsigned long *full_fds_bits; + /* Used for freeing fdtable and any containing files_struct */ struct rcu_head rcu; }; diff --git a/kernel/fork.c b/kernel/fork.c index 4d0ae6f827df..eca474a1586a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2982,7 +2982,7 @@ int ksys_unshare(unsigned long unshare_flags) if (new_fd) { fd = current->files; - current->files = new_fd; + rcu_assign_pointer(current->files, new_fd); new_fd = fd; } @@ -3035,7 +3035,7 @@ int unshare_files(void) old = task->files; task_lock(task); - task->files = copy; + rcu_assign_pointer(task->files, copy); task_unlock(task); put_files_struct(old); return 0;
This makes it safe to deference task->files with just rcu_read_lock held, removing the need to take task_lock. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- I have cleaned up my patch a little more and made this a little more explicitly rcu. I took a completley different approach to documenting the use of rcu_head than we described that seems a little more robust. fs/file.c | 53 ++++++++++++++++++++++++++--------------- include/linux/fdtable.h | 1 + kernel/fork.c | 4 ++-- 3 files changed, 37 insertions(+), 21 deletions(-)