Message ID | 87wo66vvnm.fsf_-_@x220.int.ebiederm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/2] proc: Use PIDTYPE_TGID in next_tgid | expand |
On Thu, Apr 23, 2020 at 12:42 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > +void exchange_tids(struct task_struct *ntask, struct task_struct *otask) > +{ > + /* pid_links[PIDTYPE_PID].next is always NULL */ > + struct pid *npid = READ_ONCE(ntask->thread_pid); > + struct pid *opid = READ_ONCE(otask->thread_pid); > + > + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]); > + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]); > + rcu_assign_pointer(ntask->thread_pid, opid); > + rcu_assign_pointer(otask->thread_pid, npid); > + WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first); > + WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first); > + WRITE_ONCE(ntask->pid, pid_nr(opid)); > + WRITE_ONCE(otask->pid, pid_nr(npid)); > +} This function is _very_ hard to read as written. It really wants a helper function to do the swapping per hlist_head and hlist_node, I think. And "opid/npid" is very hard to see, and the naming doesn't make much sense (if it's an "exchange", then why is it "old/new" - they're symmetric). At least something like struct hlist_head *old_pid_hlist = opid->tasks + PIDTYPE_PID; struct hlist_head *new_pid_hlist = npid->tasks + PIDTYPE_PID; struct hlist_node *old_pid_node = otask->pid_links + PIDTYPE_PID; struct hlist_node *new_pid_node = ntask->pid_links + PIDTYPE_PID; struct hlist_node *old_first_node = old_pid_hlist->first; struct hlist_node *new_first_node = new_pid_hlist->first; and then trying to group up the first/pprev/thread_pid/pid accesses so that you them together, and using a helper function that does the whole switch, so that you'd have /* Move new node to old hlist, and update thread_pid/pid fields */ insert_pid_pointers(old_pid_hlist, new_pid_node, new_first_node); rcu_assign_pointer(ntask->thread_pid, opid); WRITE_ONCE(ntask->pid, pid_nr(opid)); /* Move old new to new hlist, and update thread_pid/pid fields */ insert_pid_pointers(new_pid_hlist, old_pid_node, old_first_node); rcu_assign_pointer(otask->thread_pid, npid); WRITE_ONCE(otask->pid, pid_nr(npid)); or something roughly like that. (And the above still uses "old/new", which as mentioned sounds wrong to me. Maybe it should just be "a_xyz" and "b_xyz"? Also note that I did this in my MUA, so I could have gotten the names and types wrong etc). I think that would make it look at least _slightly_ less like random line noise and easier to follow. But maybe even a rcu_hlist_swap() helper? We have one for regular lists. Do we really have to do it all written out, not do it with a "remove and reinsert" model? Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Apr 23, 2020 at 12:42 PM Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask) >> +{ >> + /* pid_links[PIDTYPE_PID].next is always NULL */ >> + struct pid *npid = READ_ONCE(ntask->thread_pid); >> + struct pid *opid = READ_ONCE(otask->thread_pid); >> + >> + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]); >> + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]); >> + rcu_assign_pointer(ntask->thread_pid, opid); >> + rcu_assign_pointer(otask->thread_pid, npid); >> + WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first); >> + WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first); >> + WRITE_ONCE(ntask->pid, pid_nr(opid)); >> + WRITE_ONCE(otask->pid, pid_nr(npid)); >> +} > > This function is _very_ hard to read as written. > > It really wants a helper function to do the swapping per hlist_head > and hlist_node, I think. And "opid/npid" is very hard to see, and the > naming doesn't make much sense (if it's an "exchange", then why is it > "old/new" - they're symmetric). > > At least something like > > struct hlist_head *old_pid_hlist = opid->tasks + PIDTYPE_PID; > struct hlist_head *new_pid_hlist = npid->tasks + PIDTYPE_PID; > struct hlist_node *old_pid_node = otask->pid_links + PIDTYPE_PID; > struct hlist_node *new_pid_node = ntask->pid_links + PIDTYPE_PID; > > struct hlist_node *old_first_node = old_pid_hlist->first; > struct hlist_node *new_first_node = new_pid_hlist->first; > > and then trying to group up the first/pprev/thread_pid/pid accesses > so that you them together, and using a helper function that does the > whole switch, so that you'd have > > /* Move new node to old hlist, and update thread_pid/pid fields */ > insert_pid_pointers(old_pid_hlist, new_pid_node, new_first_node); > rcu_assign_pointer(ntask->thread_pid, opid); > WRITE_ONCE(ntask->pid, pid_nr(opid)); > > /* Move old new to new hlist, and update thread_pid/pid fields */ > insert_pid_pointers(new_pid_hlist, old_pid_node, old_first_node); > rcu_assign_pointer(otask->thread_pid, npid); > WRITE_ONCE(otask->pid, pid_nr(npid)); > > or something roughly like that. > > (And the above still uses "old/new", which as mentioned sounds wrong > to me. Maybe it should just be "a_xyz" and "b_xyz"? Also note that I > did this in my MUA, so I could have gotten the names and types wrong > etc). > > I think that would make it look at least _slightly_ less like random > line noise and easier to follow. > > But maybe even a rcu_hlist_swap() helper? We have one for regular > lists. Do we really have to do it all written out, not do it with a > "remove and reinsert" model? At one point my brain I had forgetten that xchg can not take two memory arguments and had hoped to be able to provide stronger guarnatees than I can. Which is where I think the structure of exchange_pids came from. I do agree the clearer we can write things, the easier it is for someone else to come along and follow. We can not use a remove and reinser model because that does break rcu accesses, and complicates everything else. With a swap model we have the struct pids pointer at either of the tasks that are swapped but never at nothing. With a remove/reinsert model we have to deal the addittional possibility of the pids not pointing at a thread at all which can result in things like signals not being delivered at all. I played with it a bit and the best I have been able to come up is: void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right) { struct hlist_node **lpprev = left->pprev; struct hlist_node **rpprev = right->pprev; rcu_assign_pointer(*lpprev, right); rcu_assign_pointer(*rpprev, left); WRITE_ONCE(left->pprev, rpprev); WRITE_ONCE(right->pprev, lpprev); } void exchange_tids(struct task_struct *left, struct task_struct *right) { struct hlist_node *lnode = &left->pid_links[PIDTYPE_PID]; struct hlist_node *rnode = &right->pid_links[PIDTYPE_PID]; struct pid *lpid, *rpid; /* Replace the single entry tid lists with each other */ hlist_swap_before_rcu(lnode, rnode); /* Swap thread_pid */ rpid = left->thread_pid; lpid = right->thread_pid; rcu_assign_pointer(left->thread_pid, lpid); rcu_assign_pointer(right->thread_pid, rpid); /* Swap the cached pid value */ WRITE_ONCE(left->pid, pid_nr(lpid)); WRITE_ONCE(right->pid, pid_nr(rpid)); } hlists because they are not doubly linked can legitimately swap their beginnings or their tails. Something that regular lists can not, and I think that is exactly the general purpose semantic I want. Does that look a little more readable? Eric
On 04/23, Eric W. Biederman wrote: > > When the thread group leader changes during exec and the old leaders > thread is reaped proc_flush_pid This is off-topic, but let me mention this before I forget... Note that proc_flush_pid() does nothing if CONFIG_PROC_FS=n, this mean that in this case release_task() leaks thread_pid. > +void exchange_tids(struct task_struct *ntask, struct task_struct *otask) > +{ > + /* pid_links[PIDTYPE_PID].next is always NULL */ > + struct pid *npid = READ_ONCE(ntask->thread_pid); > + struct pid *opid = READ_ONCE(otask->thread_pid); > + > + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]); > + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]); > + rcu_assign_pointer(ntask->thread_pid, opid); > + rcu_assign_pointer(otask->thread_pid, npid); > + WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first); > + WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first); > + WRITE_ONCE(ntask->pid, pid_nr(opid)); > + WRITE_ONCE(otask->pid, pid_nr(npid)); > +} Oh, at first glance this breaks posix-cpu-timers.c:lookup_task(), the last user of has_group_leader_pid(). I think that we should change lookup_task() to return "struct *pid", this should simplify the code... Note that none of its callers needs task_struct. And, instead of thread_group_leader/has_group_leader_pid checks we should use pid_has_task(TGID). After that, this patch should kill has_group_leader_pid(). What do you think? Oleg.
On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > At one point my brain I had forgetten that xchg can not take two memory > arguments and had hoped to be able to provide stronger guarnatees than I > can. Which is where I think the structure of exchange_pids came from. Note that even if we were to have a "exchange two memory locations atomically" instruction (and we don't - even a "double cmpxchg" is actually just a double-_sized_ one, not a two different locations one), I'm not convinced it makes sense. There's no way to _walk_ two lists atomically. Any user will only ever walk one or the other, so it's not sensible to try to make the two list updates be atomic. And if a user for some reason walks both, the walking itself will obviously then be racy - it does one or the other first, and can see either the old state, or the new state - or see _neither_ (ie if you walk it twice, you might see neither task, or you might see both, just depending on order or walk). > I do agree the clearer we can write things, the easier it is for > someone else to come along and follow. Your alternate write of the function seems a bit more readable to me, even if the main effect might be just that it was split up a bit and added a few comments and whitespace. So I'm more happier with that one. That said: > We can not use a remove and reinser model because that does break rcu > accesses, and complicates everything else. With a swap model we have > the struct pids pointer at either of the tasks that are swapped but > never at nothing. I'm not suggesting removing the pid entirely - like making task->pid be NULL. I'm literally suggesting just doing the RCU list operations as "remove and re-insert". And that shouldn't break anything, for the same reason that an atomic exchange doesn't make sense: you can only ever walk one of the lists at a time. And regardless of how you walk it, you might not see the new state (or the old state) reliably. Put another way: > void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right) > { > struct hlist_node **lpprev = left->pprev; > struct hlist_node **rpprev = right->pprev; > > rcu_assign_pointer(*lpprev, right); > rcu_assign_pointer(*rpprev, left); These are the only two assignments that matter for anything that walks the list (the pprev ones are for things that change the list, and they have to have exclusions in place). And those two writes cannot be atomic anyway, so you fundamentally will always be in the situation that a walker can miss one of the tasks. Which is why I think it would be ok to just do the RCU list swap as a "remove left, remove right, add left, add right" operation. It doesn't seem fundamentally different to a walker than the "switch left/right" operation, and it seems much simpler. Is there something I'm missing? But I'm *not* suggesting that we change these simple parts to be "remove thread_pid or pid pointer, and then insert a new one": > /* Swap thread_pid */ > rpid = left->thread_pid; > lpid = right->thread_pid; > rcu_assign_pointer(left->thread_pid, lpid); > rcu_assign_pointer(right->thread_pid, rpid); > > /* Swap the cached pid value */ > WRITE_ONCE(left->pid, pid_nr(lpid)); > WRITE_ONCE(right->pid, pid_nr(rpid)); > } because I agree that for things that don't _walk_ the list, but just look up "thread_pid" vs "pid" atomically but asynchronously, we obviously need to get one or the other, not some kind of "empty" state. > Does that look a little more readable? Regardless, I find your new version at least a lot more readable, so I'm ok with it. It looks like Oleg found an independent issue, though. Linus
Oleg Nesterov <oleg@redhat.com> writes: > On 04/23, Eric W. Biederman wrote: >> >> When the thread group leader changes during exec and the old leaders >> thread is reaped proc_flush_pid > > This is off-topic, but let me mention this before I forget... > > Note that proc_flush_pid() does nothing if CONFIG_PROC_FS=n, this mean > that in this case release_task() leaks thread_pid. Good catch. Wow. I seem to be introducing almost as many bugs as I am fixing right now. Ouch. >> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask) >> +{ >> + /* pid_links[PIDTYPE_PID].next is always NULL */ >> + struct pid *npid = READ_ONCE(ntask->thread_pid); >> + struct pid *opid = READ_ONCE(otask->thread_pid); >> + >> + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]); >> + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]); >> + rcu_assign_pointer(ntask->thread_pid, opid); >> + rcu_assign_pointer(otask->thread_pid, npid); >> + WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first); >> + WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first); >> + WRITE_ONCE(ntask->pid, pid_nr(opid)); >> + WRITE_ONCE(otask->pid, pid_nr(npid)); >> +} > > Oh, at first glance this breaks posix-cpu-timers.c:lookup_task(), the last > user of has_group_leader_pid(). > > I think that we should change lookup_task() to return "struct *pid", this > should simplify the code... Note that none of its callers needs task_struct. > > And, instead of thread_group_leader/has_group_leader_pid checks we should > use pid_has_task(TGID). Somehow I thought we could get away without fiddling with that right now, but on second glance I can see the races. I played with this earlier and I agree returning a struct pid * is desirable. I will see if I can track down the patches I was playing with as that definitely needs to get fixed first. > After that, this patch should kill has_group_leader_pid(). > > What do you think? I agree completely. has_group_leader_pid is the same as thread_group_leader after this so should be removed. Especially as it won't have any users. There are several other potential cleanups as well. Such as not using a hlist for PIDTYPE_PID. Which would allow us to run the hlists through struct signal_struct instead. I think that would clean things up but that touches so many things it may just be pointless code churn. Just for mentioning I am thinking we should rename PIDTYPE_PID to PIDTYPE_TID just to create a distance in peoples minds between the kernel concepts and the user space concepts. Eric
On Fri, Apr 24, 2020 at 11:02 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > [..] even a "double cmpxchg" is > actually just a double-_sized_ one, not a two different locations > one Historical accuracy side note: the 68020 actually had a CAS2 that was "two different locations". Maybe somebody else did too. Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> At one point my brain I had forgetten that xchg can not take two memory >> arguments and had hoped to be able to provide stronger guarnatees than I >> can. Which is where I think the structure of exchange_pids came from. > > Note that even if we were to have a "exchange two memory locations > atomically" instruction (and we don't - even a "double cmpxchg" is > actually just a double-_sized_ one, not a two different locations > one), I'm not convinced it makes sense. > > There's no way to _walk_ two lists atomically. Any user will only ever > walk one or the other, so it's not sensible to try to make the two > list updates be atomic. > > And if a user for some reason walks both, the walking itself will > obviously then be racy - it does one or the other first, and can see > either the old state, or the new state - or see _neither_ (ie if you > walk it twice, you might see neither task, or you might see both, just > depending on order or walk). > >> I do agree the clearer we can write things, the easier it is for >> someone else to come along and follow. > > Your alternate write of the function seems a bit more readable to me, > even if the main effect might be just that it was split up a bit and > added a few comments and whitespace. > > So I'm more happier with that one. That said: > >> We can not use a remove and reinser model because that does break rcu >> accesses, and complicates everything else. With a swap model we have >> the struct pids pointer at either of the tasks that are swapped but >> never at nothing. > > I'm not suggesting removing the pid entirely - like making task->pid > be NULL. I'm literally suggesting just doing the RCU list operations > as "remove and re-insert". > > And that shouldn't break anything, for the same reason that an atomic > exchange doesn't make sense: you can only ever walk one of the lists > at a time. And regardless of how you walk it, you might not see the > new state (or the old state) reliably. > > Put another way: > >> void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right) >> { >> struct hlist_node **lpprev = left->pprev; >> struct hlist_node **rpprev = right->pprev; >> >> rcu_assign_pointer(*lpprev, right); >> rcu_assign_pointer(*rpprev, left); > > These are the only two assignments that matter for anything that walks > the list (the pprev ones are for things that change the list, and they > have to have exclusions in place). > > And those two writes cannot be atomic anyway, so you fundamentally > will always be in the situation that a walker can miss one of the > tasks. > > Which is why I think it would be ok to just do the RCU list swap as a > "remove left, remove right, add left, add right" operation. It doesn't > seem fundamentally different to a walker than the "switch left/right" > operation, and it seems much simpler. > > Is there something I'm missing? The problem with remove remove add add is: A lookup that hit between the remove and the add could return nothing. The function kill_pid_info does everything it can to handle this case today does: int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) { int error = -ESRCH; struct task_struct *p; for (;;) { rcu_read_lock(); p = pid_task(pid, PIDTYPE_PID); if (p) error = group_send_sig_info(sig, info, p, PIDTYPE_TGID); rcu_read_unlock(); if (likely(!p || error != -ESRCH)) return error; /* * The task was unhashed in between, try again. If it * is dead, pid_task() will return NULL, if we race with * de_thread() it will find the new leader. */ } } Now kill_pid_info is signalling the entire task and is just using PIDTYPE_PID to find a thread in the task. With the remove then add model there will be a point where pid_task will return nothing, because ever so briefly the lists will be empty. However with an actually swap we will find a task and kill_pid_info will work. It pathloglical cases lock_task_sighand might have to loop and we would need to find the new task that has the given pid. But kill_pid_info is guaranteed to work with swaps and will fail with remove add. > But I'm *not* suggesting that we change these simple parts to be > "remove thread_pid or pid pointer, and then insert a new one": > >> /* Swap thread_pid */ >> rpid = left->thread_pid; >> lpid = right->thread_pid; >> rcu_assign_pointer(left->thread_pid, lpid); >> rcu_assign_pointer(right->thread_pid, rpid); >> >> /* Swap the cached pid value */ >> WRITE_ONCE(left->pid, pid_nr(lpid)); >> WRITE_ONCE(right->pid, pid_nr(rpid)); >> } > > because I agree that for things that don't _walk_ the list, but just > look up "thread_pid" vs "pid" atomically but asynchronously, we > obviously need to get one or the other, not some kind of "empty" > state. For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but pointers to the appropriate task. Only for PIDTYPE_PGID and PIDTYPE_SID do these become lists in practice. That not-really-a-list status allows for signel delivery to indivdual processes to happen in rcu context. Which is where we would get into trouble with add/remove. Since signals are guaranteed to be delivered to the entire session or the entire process group all of the list walking happens under the tasklist_lock currently. Which really keeps list walking from being a concern. >> Does that look a little more readable? > > Regardless, I find your new version at least a lot more readable, so > I'm ok with it. Good. Then I will finish cleaning it up and go with that version. > It looks like Oleg found an independent issue, though. Yes, and I will definitely work through those. Eric
On Fri, Apr 24, 2020 at 12:54 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > The problem with > > remove > remove > add > add > is: > > A lookup that hit between the remove and the add could return nothing. Argh. Because that thing doesn't actually _search_ the list, it just wants to pick any (first) entry. So it doesn't actually care what it gets, just that it gets something. Ok, I see. > For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but > pointers to the appropriate task. Only for PIDTYPE_PGID and PIDTYPE_SID > do these become lists in practice. Yeah, that's what confused me. Ok, please add a comment about the "single-entry list" issue, and I'm happy. Linus
Eric, I am sick today and can't read the code, but I feel this patch is not right ... please correct me. So, iiuc when posix_cpu_timer_create() is called and CPUCLOCK_PERTHREAD is false we roughly have task = pid_task(pid, PIDTYPE_TGID); // lookup_task() /* WINDOW */ timer->it.cpu.pid = = get_task_pid(task, PIDTYPE_TGID) // posix_cpu_timer_create() Now suppose that we race with mt-exec and this "task" is the old leader; it can be release_task()'ed in the WINDOW above and then get_task_pid() will return NULL. That is why I suggested to change lookup_task() to return "struct pid*" to eliminate the pid -> task -> pid transition. Apart from the same_thread_group() check for the "thread" case we do not need task_struct at all, lookup_task() can do if (thread) { p = pid_task(pid, PIDTYPE_PID); if (p && !same_thread_group(p, current)) pid = NULL; } else { ... gettime check ... if (!pid_has_task(pid, PIDTYPE_TGID)) pid = NULL; } return pid; No? Oleg.
On Sun, Apr 26, 2020 at 7:14 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > To support this add hlist_swap_before_rcu. An hlist primitive that > will allow swaping the leading sections of two tasks. For exchanging > the task pids it will just be swapping the hlist_heads of two single > entry lists. But the functionality is more general. So I have no problems with the series now - the code is much more understandable. Partly because of the split-up, partly because of the comments, and partly because you explained the special case and why it was a valid thing to do... However, I did start thinking about this case again. I still don't think the "swap entry" macro is necessarily useful in _general_ - any time it's an actual individual entry, that swap macro doesn't really work. So the only reason it works here is because you're actually swapping the whole list. But that, in turn, shouldn't be using that "first node" model at all, it should use the hlist_head. That would have made it a lot more obvious what is actually going on to me. Now, the comment very much talks about the head case, but the code still looks like it's swapping a non-head thing. I guess the code technically _works_ with "swap two list ends", but does that actually really make sense as an operation? So I no longer hate how this patch looks, but I wonder if we should just make the whole "this node is the *first* node" a bit more explicit in both the caller and in the swapping code. It could be as simple as replacing just the conceptual types and names, so instead of some "pnode1" double-indirect node pointer, we'd have struct hlist_head *left_head = container_of(left->pprev, struct hlist_head, first); struct hlist_head *right_head = container_of(right->pprev, struct hlist_head, first); and then the code would do rcu_assign_pointer(right_head->first, left); rcu_assign_pointer(left_head->first, right); WRITE_ONCE(left->pprev, &right_head->first); WRITE_ONCE(right->pprev, &left_head->first); which should generate the exact same code, but makes it clear that what we're doing is switching the whole hlist when given the first entries. Doesn't that make what it actually does a lot more understandable? The *pnode1/pnode2 games are somewhat opaque, but with that type and name change and using "container_of()", the code now fairly naturally reads as "oh, we're changing the first pointers in the list heads, and making the nodes point back to them" . Again - the current function _works_ with swapping two hlists in the middle (not two entries - it swaps the whole list starting at that entry!), so your current patch is in some ways "more generic". I'm just suggesting that the generic case doesn't make much sense, and that the "we know the first entries, swap the lists" actually is what the real use is, and writing it as such makes the code easier to understand. But I'm not going to insist on this, so this is more an RFC. Maybe people disagree, and/or have an actual use case for that "break two hlists in the middle, swap the ends" that I find unlikely... (NOTE: My "convert to hlist_head" code _works_ for that case too because the code generation is the same! But it would be really really confusing for that code to be used for anything but the first entry). Linus
ebiederm@xmission.com (Eric W. Biederman) writes: > This allows the getref flag to be removed and the callers can > than take a task reference if needed. That changelog lacks any form of information why this should be changed. I can see the point vs. patch 2, but pretty please put coherent explanations into each patch. > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > kernel/time/posix-cpu-timers.c | 41 +++++++++++++++++----------------- > 1 file changed, 21 insertions(+), 20 deletions(-) > > diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c > index 2fd3b3fa68bf..eba41c70f0f0 100644 > --- a/kernel/time/posix-cpu-timers.c > +++ b/kernel/time/posix-cpu-timers.c > @@ -86,36 +86,34 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread, > } > > static struct task_struct *__get_task_for_clock(const clockid_t clock, > - bool getref, bool gettime) > + bool gettime) > { > const bool thread = !!CPUCLOCK_PERTHREAD(clock); > const pid_t pid = CPUCLOCK_PID(clock); > - struct task_struct *p; > > if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX) > return NULL; > > - rcu_read_lock(); > - p = lookup_task(pid, thread, gettime); > - if (p && getref) > - get_task_struct(p); > - rcu_read_unlock(); > - return p; > + return lookup_task(pid, thread, gettime); > } > > static inline struct task_struct *get_task_for_clock(const clockid_t clock) > { > - return __get_task_for_clock(clock, true, false); > + return __get_task_for_clock(clock, false); > } > > static inline struct task_struct *get_task_for_clock_get(const clockid_t clock) > { > - return __get_task_for_clock(clock, true, true); > + return __get_task_for_clock(clock, true); > } > > static inline int validate_clock_permissions(const clockid_t clock) > { > - return __get_task_for_clock(clock, false, false) ? 0 : -EINVAL; > + int ret; New line between declarations and code please. > + rcu_read_lock(); > + ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL; > + rcu_read_unlock(); > + return ret; > } Thanks, tglx
ebiederm@xmission.com (Eric W. Biederman) writes: > Using pid_task(find_vpid(N), PIDTYPE_TGID) guarantees that if a task > is found it is at that moment the thread group leader. Which removes > the need for the follow on test has_group_leader_pid. > > I have reorganized the rest of the code in lookup_task for clarity, > and created a common return for most of the code. Sorry, it's way harder to read than the very explicit exits which were there before. > The special case for clock_gettime with "pid == gettid" is not my > favorite. I strongly suspect it isn't used as gettid is such a pain, > and passing 0 is much easier. Still it is easier to keep this special > case than to do the reasarch that will show it isn't used. It might be not your favorite, but when I refactored the code I learned the hard way that one of the test suites has assumptions that clock_gettime(PROCESS) works from any task of a group and not just for the group leader. Sure we could fix the test suite, but test code tends to be copied ... > /* > * Functions for validating access to tasks. > */ > -static struct task_struct *lookup_task(const pid_t pid, bool thread, > +static struct task_struct *lookup_task(const pid_t which_pid, bool thread, > bool gettime) > { > struct task_struct *p; > + struct pid *pid; > > /* > * If the encoded PID is 0, then the timer is targeted at current > * or the process to which current belongs. > */ > - if (!pid) > + if (!which_pid) > return thread ? current : current->group_leader; > > - p = find_task_by_vpid(pid); > - if (!p) > - return p; > - > - if (thread) > - return same_thread_group(p, current) ? p : NULL; > - > - if (gettime) { > + pid = find_vpid(which_pid); > + if (thread) { > + p = pid_task(pid, PIDTYPE_PID); > + if (p && !same_thread_group(p, current)) > + p = NULL; > + } else { > /* > * For clock_gettime(PROCESS) the task does not need to be > * the actual group leader. tsk->sighand gives > @@ -76,13 +75,13 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread, > * reference on it and store the task pointer until the > * timer is destroyed. Btw, this comment is wrong since 55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task") Thanks, tglx
Oleg Nesterov <oleg@redhat.com> writes: > Eric, > > I am sick today and can't read the code, but I feel this patch is not > right ... please correct me. > So, iiuc when posix_cpu_timer_create() is called and CPUCLOCK_PERTHREAD > is false we roughly have > > task = pid_task(pid, PIDTYPE_TGID); // lookup_task() > > /* WINDOW */ > > timer->it.cpu.pid = = get_task_pid(task, PIDTYPE_TGID) // posix_cpu_timer_create() > > Now suppose that we race with mt-exec and this "task" is the old leader; > it can be release_task()'ed in the WINDOW above and then get_task_pid() > will return NULL. Except it is asking for PIDTYPE_TGID. task->signal even if it is freed (which it won't be in a mt-exec) is valid until after an rcu window. release_task() put_task_struct_rcu_user() call_rcu(..., delayed_put_task_struct()) ... rcu delay ... delayed_put_task_struct() put_task_struct() __put_task_struct() put_signal_struct() free_signal_struct() Which means that task->signal->pids[PIDTYPE_TGID] will remain valid even across mt-exec. Further the only change I have introduced is to perform this work under rcu_read_lock vs taking a reference to task_struct. As the reference to task_struct does not prevent release_task, the situation with respect to races in the rest of the code does not change. Hmm.... If the case instead is: > timer->it.cpu.pid = get_task_pid(task, PIDTYPE_PID) // posix_cpu_timer_create() Which can also happen for threads in the same thread group. I have to agree that we can wind up with a NULL pid. And that is a brand new bug, because we didn't use to use pids. Sigh. > That is why I suggested to change lookup_task() to return "struct pid*" > to eliminate the pid -> task -> pid transition. Yes. I have to agree. Getting rid of the pid -> task -> pid transition looks important to close bugs like that. > Apart from the same_thread_group() check for the "thread" case we do not > need task_struct at all, lookup_task() can do > > if (thread) { > p = pid_task(pid, PIDTYPE_PID); > if (p && !same_thread_group(p, current)) > pid = NULL; > } else { > ... gettime check ... > > if (!pid_has_task(pid, PIDTYPE_TGID)) > pid = NULL; > } > > return pid; > > No? There is also the posix_cpu_clock_get, where we immediately use the clock instead of create something we can use later. I want to say the gettime case is another reason to go through the whole transition but the code can just as easily say "pid = task_tgid(current)" as it can "p = current"; Eric
Thomas Gleixner <tglx@linutronix.de> writes: > ebiederm@xmission.com (Eric W. Biederman) writes: > >> This allows the getref flag to be removed and the callers can >> than take a task reference if needed. > > That changelog lacks any form of information why this should be > changed. I can see the point vs. patch 2, but pretty please put coherent > explanations into each patch. Well excess flags bad. But in this case I can do even better. The code no longer takes a reference on task_struct so this patch removes unnecessary code. I will see if I can say that better. >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >> --- >> kernel/time/posix-cpu-timers.c | 41 +++++++++++++++++----------------- >> 1 file changed, 21 insertions(+), 20 deletions(-) >> >> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c >> index 2fd3b3fa68bf..eba41c70f0f0 100644 >> --- a/kernel/time/posix-cpu-timers.c >> +++ b/kernel/time/posix-cpu-timers.c >> @@ -86,36 +86,34 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread, >> } >> >> static struct task_struct *__get_task_for_clock(const clockid_t clock, >> - bool getref, bool gettime) >> + bool gettime) >> { >> const bool thread = !!CPUCLOCK_PERTHREAD(clock); >> const pid_t pid = CPUCLOCK_PID(clock); >> - struct task_struct *p; >> >> if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX) >> return NULL; >> >> - rcu_read_lock(); >> - p = lookup_task(pid, thread, gettime); >> - if (p && getref) >> - get_task_struct(p); >> - rcu_read_unlock(); >> - return p; >> + return lookup_task(pid, thread, gettime); >> } >> >> static inline struct task_struct *get_task_for_clock(const clockid_t clock) >> { >> - return __get_task_for_clock(clock, true, false); >> + return __get_task_for_clock(clock, false); >> } >> >> static inline struct task_struct *get_task_for_clock_get(const clockid_t clock) >> { >> - return __get_task_for_clock(clock, true, true); >> + return __get_task_for_clock(clock, true); >> } >> >> static inline int validate_clock_permissions(const clockid_t clock) >> { >> - return __get_task_for_clock(clock, false, false) ? 0 : -EINVAL; >> + int ret; > > New line between declarations and code please. > >> + rcu_read_lock(); >> + ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL; >> + rcu_read_unlock(); >> + return ret; >> } > > Thanks, > > tglx
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sun, Apr 26, 2020 at 7:14 AM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> To support this add hlist_swap_before_rcu. An hlist primitive that >> will allow swaping the leading sections of two tasks. For exchanging >> the task pids it will just be swapping the hlist_heads of two single >> entry lists. But the functionality is more general. > > So I have no problems with the series now - the code is much more > understandable. Partly because of the split-up, partly because of the > comments, and partly because you explained the special case and why it > was a valid thing to do... > > However, I did start thinking about this case again. > > I still don't think the "swap entry" macro is necessarily useful in > _general_ - any time it's an actual individual entry, that swap macro > doesn't really work. But it isn't a "swap entry" macro/function. I did not even attempt to make it a "swap entry" function. I made a chop two lists into two and swap the pieces function. > So the only reason it works here is because you're actually swapping > the whole list. > > But that, in turn, shouldn't be using that "first node" model at all, > it should use the hlist_head. That would have made it a lot more > obvious what is actually going on to me. > > Now, the comment very much talks about the head case, but the code > still looks like it's swapping a non-head thing. > > I guess the code technically _works_ with "swap two list ends", but > does that actually really make sense as an operation? As an operation yes. Will anyone else want that operation I don't know. > So I no longer hate how this patch looks, but I wonder if we should > just make the whole "this node is the *first* node" a bit more > explicit in both the caller and in the swapping code. > > It could be as simple as replacing just the conceptual types and > names, so instead of some "pnode1" double-indirect node pointer, we'd > have > > struct hlist_head *left_head = container_of(left->pprev, > struct hlist_head, first); > struct hlist_head *right_head = container_of(right->pprev, > struct hlist_head, first); > > and then the code would do > > rcu_assign_pointer(right_head->first, left); > rcu_assign_pointer(left_head->first, right); > WRITE_ONCE(left->pprev, &right_head->first); > WRITE_ONCE(right->pprev, &left_head->first); > > which should generate the exact same code, but makes it clear that > what we're doing is switching the whole hlist when given the first > entries. > > Doesn't that make what it actually does a lot more understandable? Understandable is a bit subjective. I think having a well defined hlist operation I can call makes things more understandable. I think the getting the list head as: "head = &task->thread_pid->tasks[PIDTYPE_PID];" is more understandable and less risky than container_of. My concern and probably unreasonbable as this is a slow path with getting the list heads after looking up the pid is that it seems to add a wait for an additional cache line to load before anything can happen. The only way I really know to make this code much more understandable is to remove the lists entirely for this case. But that is a much larger change and it is not clear that it makes the kernel code overall better. I stared at that for a while and it is an interesting follow on but not something I want or we even can do before exchange_tids is in place. > The > *pnode1/pnode2 games are somewhat opaque, but with that type and name > change and using "container_of()", the code now fairly naturally reads > as "oh, we're changing the first pointers in the list heads, and > making the nodes point back to them" . > > Again - the current function _works_ with swapping two hlists in the > middle (not two entries - it swaps the whole list starting at that > entry!), so your current patch is in some ways "more generic". I'm > just suggesting that the generic case doesn't make much sense, and > that the "we know the first entries, swap the lists" actually is what > the real use is, and writing it as such makes the code easier to > understand. Yep. That is waht I designed it to do. I sort of went the other direction when writing this. I could start with the list heads and swap the rest of the lists and get the same code. But it looked like it would be a little slower to find the hlist_heads, and I couldn't think of a good name for the function. So I figured if I was writing a fucntion for this case I would write one that was convinient. For understandability that is my real challenge what is a good name that people can read and understand what is happening for this swapping function. > But I'm not going to insist on this, so this is more an RFC. Maybe > people disagree, and/or have an actual use case for that "break two > hlists in the middle, swap the ends" that I find unlikely... > > (NOTE: My "convert to hlist_head" code _works_ for that case too > because the code generation is the same! But it would be really really > confusing for that code to be used for anything but the first entry). Yes. I am open to improvements. Especially in the naming. Would hlists_swap_heads_rcu be noticably better? Eric
Thomas Gleixner <tglx@linutronix.de> writes: > ebiederm@xmission.com (Eric W. Biederman) writes: >> Using pid_task(find_vpid(N), PIDTYPE_TGID) guarantees that if a task >> is found it is at that moment the thread group leader. Which removes >> the need for the follow on test has_group_leader_pid. >> >> I have reorganized the rest of the code in lookup_task for clarity, >> and created a common return for most of the code. > > Sorry, it's way harder to read than the very explicit exits which were > there before. My biggest gripe is the gettime and the ordinary !thread case should be sharing code and they are not. I know historically why they don't but for all practical purposes has_group_leader_pid and thread_group_leader are the same test. >> The special case for clock_gettime with "pid == gettid" is not my >> favorite. I strongly suspect it isn't used as gettid is such a pain, >> and passing 0 is much easier. Still it is easier to keep this special >> case than to do the reasarch that will show it isn't used. > > It might be not your favorite, but when I refactored the code I learned > the hard way that one of the test suites has assumptions that > clock_gettime(PROCESS) works from any task of a group and not just for > the group leader. Sure we could fix the test suite, but test code tends > to be copied ... Do you know which test suite? It would be nice to see such surprising code with my own eyes. I completely agree that clock_gettime(PROCESS) should work for any task of a group. I would think anything with a constant like that would just be passing in 0, which is trivial. Looking up your threadid seems like extra work. Mostly my complaint is that the gettime subcase is an awkward special case. Added in 33ab0fec3352 ("posix-timers: Consolidate posix_cpu_clock_get()") and the only justification for changing the userspace ABI was that it made things less awkward to combine to branches of code. >> /* >> * Functions for validating access to tasks. >> */ >> -static struct task_struct *lookup_task(const pid_t pid, bool thread, >> +static struct task_struct *lookup_task(const pid_t which_pid, bool thread, >> bool gettime) >> { >> struct task_struct *p; >> + struct pid *pid; >> >> /* >> * If the encoded PID is 0, then the timer is targeted at current >> * or the process to which current belongs. >> */ >> - if (!pid) >> + if (!which_pid) >> return thread ? current : current->group_leader; >> >> - p = find_task_by_vpid(pid); >> - if (!p) >> - return p; >> - >> - if (thread) >> - return same_thread_group(p, current) ? p : NULL; >> - >> - if (gettime) { >> + pid = find_vpid(which_pid); >> + if (thread) { >> + p = pid_task(pid, PIDTYPE_PID); >> + if (p && !same_thread_group(p, current)) >> + p = NULL; >> + } else { >> /* >> * For clock_gettime(PROCESS) the task does not need to be >> * the actual group leader. tsk->sighand gives >> @@ -76,13 +75,13 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread, >> * reference on it and store the task pointer until the >> * timer is destroyed. > > Btw, this comment is wrong since > > 55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task") It is definitely stale. It continues to describe the motive for limiting ourselves to a thread_group_leader. I am cooking up a patch to tweak that comment, and get replace of has_group_leader_pid. Since it is unnecessary I just want to decouple that work from this patchset. Eric
ebiederm@xmission.com (Eric W. Biederman) writes: > Take a good look over the code base and see if I can see any more > issues like was found in next_tgid and repost the core patches. Oleg, I am reading through kernel/ptrace.c and I am seeing a lot of: spin_lock(&child->sighand->siglock); In places where I don't see anything guaranteeing the child is stopped such as ptrace_freeze_traced. Are all of those places safe or do some of the need to be transformed into lock_task_sighand in case the process is current running? I might just be reading the code to quickly and missing what keeps the code from executing exec and changing sighand. Eric
On Mon, Apr 27, 2020 at 7:32 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Would hlists_swap_heads_rcu be noticably better? Edited out most of the rest because I think we're in agreement and it's not a huge deal. And yes, I think it might be nice to just call it "swap_heads()" and make the naming match what the user wants, so that people who don't care about the implementation can just guess from the function name. But I also think that by now it's mostly bikeshedding, and I'm probably also now biased by the fact that I have looked at that code and read your explanations so it's all fresh in my mind. A year from now, when I've forgotten the details, who knows which part I'd find confusing? ;) Linus
On 04/27, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > Eric, > > > > I am sick today and can't read the code, but I feel this patch is not > > right ... please correct me. > > > > So, iiuc when posix_cpu_timer_create() is called and CPUCLOCK_PERTHREAD > > is false we roughly have > > > > task = pid_task(pid, PIDTYPE_TGID); // lookup_task() > > > > /* WINDOW */ > > > > timer->it.cpu.pid = = get_task_pid(task, PIDTYPE_TGID) // posix_cpu_timer_create() > > > > Now suppose that we race with mt-exec and this "task" is the old leader; > > it can be release_task()'ed in the WINDOW above and then get_task_pid() > > will return NULL. > > Except it is asking for PIDTYPE_TGID. > > task->signal Ah yes, I knew I missed something... Oleg.
diff --git a/fs/exec.c b/fs/exec.c index 06b4c550af5d..9b60f927afd7 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1186,11 +1186,8 @@ static int de_thread(struct task_struct *tsk) /* Become a process group leader with the old leader's pid. * The old leader becomes a thread of the this thread group. - * Note: The old leader also uses this pid until release_task - * is called. Odd but simple and correct. */ - tsk->pid = leader->pid; - change_pid(tsk, PIDTYPE_PID, task_pid(leader)); + exchange_tids(tsk, leader); transfer_pid(leader, tsk, PIDTYPE_TGID); transfer_pid(leader, tsk, PIDTYPE_PGID); transfer_pid(leader, tsk, PIDTYPE_SID); diff --git a/include/linux/pid.h b/include/linux/pid.h index cc896f0fc4e3..2159ffca63fc 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -102,6 +102,7 @@ extern void attach_pid(struct task_struct *task, enum pid_type); extern void detach_pid(struct task_struct *task, enum pid_type); extern void change_pid(struct task_struct *task, enum pid_type, struct pid *pid); +extern void exchange_tids(struct task_struct *task, struct task_struct *old); extern void transfer_pid(struct task_struct *old, struct task_struct *new, enum pid_type); diff --git a/kernel/pid.c b/kernel/pid.c index c835b844aca7..4ece32d8791a 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -363,6 +363,22 @@ void change_pid(struct task_struct *task, enum pid_type type, attach_pid(task, type); } +void exchange_tids(struct task_struct *ntask, struct task_struct *otask) +{ + /* pid_links[PIDTYPE_PID].next is always NULL */ + struct pid *npid = READ_ONCE(ntask->thread_pid); + struct pid *opid = READ_ONCE(otask->thread_pid); + + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]); + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]); + rcu_assign_pointer(ntask->thread_pid, opid); + rcu_assign_pointer(otask->thread_pid, npid); + WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first); + WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first); + WRITE_ONCE(ntask->pid, pid_nr(opid)); + WRITE_ONCE(otask->pid, pid_nr(npid)); +} + /* transfer_pid is an optimization of attach_pid(new), detach_pid(old) */ void transfer_pid(struct task_struct *old, struct task_struct *new, enum pid_type type)