Message ID | 20200313174701.148376-4-bigeasy@linutronix.de (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | None | expand |
On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote: > The poll callback is abusing the completion wait queue and sticks it into > poll_wait() to wake up pollers after a command has completed. > > First of all it's a layering violation as it imposes restrictions on the > inner workings of completions. Just because C allows to do so does not > justify that in any way. The proper way to do such things is to post > patches which extend the core infrastructure and not by silently abusing > it. As I've said previously, I disagree with this approach. Open coding standard primitives sweeps issues under the rug and is a step backwards for code quality. Calling it a layering violation is just one opinion and if it is, the better solution would be to create an interface you find appropriate so that it isn't one. > Aside of that the implementation is seriously broken: > > 1) It cannot work with EPOLLEXCLUSIVE Why? You don't explain this. And I don't see how this patch would change anything to do with the call to poll_wait(). All you've done is open-code the completion. Not that it matters that much, having multiple waiters poll on this interface can pretty much never happen. It only makes sense for the process who submitted the write to poll on the interface. > 2) It's racy: > > poll() write() > switchtec_dev_poll() switchtec_dev_write() > poll_wait(&s->comp.wait); mrpc_queue_cmd() > init_completion(&s->comp) > init_waitqueue_head(&s->comp.wait) That's a nice catch! But wouldn't an easier solution be to change the code to use reinit_completion() instead of using the bug to justify a different change? Thanks, Logan
On Fri, Mar 13, 2020 at 06:46:55PM +0100, Sebastian Andrzej Siewior wrote: > The poll callback is abusing the completion wait queue and sticks it into > poll_wait() to wake up pollers after a command has completed. > > First of all it's a layering violation as it imposes restrictions on the > inner workings of completions. Just because C allows to do so does not > justify that in any way. The proper way to do such things is to post > patches which extend the core infrastructure and not by silently abusing > it. > > Aside of that the implementation is seriously broken: > > 1) It cannot work with EPOLLEXCLUSIVE > > 2) It's racy: > > poll() write() > switchtec_dev_poll() switchtec_dev_write() > poll_wait(&s->comp.wait); mrpc_queue_cmd() > init_completion(&s->comp) > init_waitqueue_head(&s->comp.wait) > > Replace it with a regular wait queue which removes the completion abuse and > cures #1 and #2 above. > > Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com> > Cc: Logan Gunthorpe <logang@deltatee.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: linux-pci@vger.kernel.org > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Relying on implementation details of locking primitives like that is yuck. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Logan Gunthorpe <logang@deltatee.com> writes: > On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote: >> The poll callback is abusing the completion wait queue and sticks it into >> poll_wait() to wake up pollers after a command has completed. >> >> First of all it's a layering violation as it imposes restrictions on the >> inner workings of completions. Just because C allows to do so does not >> justify that in any way. The proper way to do such things is to post >> patches which extend the core infrastructure and not by silently abusing >> it. > > As I've said previously, I disagree with this approach. Feel free to do s. > Open coding standard primitives sweeps issues under the rug and is a > step backwards for code quality. Calling it a layering violation is > just one opinion and if it is, the better solution would be to create > an interface you find appropriate so that it isn't one. There is no standard primitive which allows to poll on a completion. You decided that this is smart to do and just because C does not allow to hide implementation details this is not a justification for this at all. Due to the limitations of C, the kernel has to rely on the assumption that developers know and respect the difference between API and implementation. Relying on implementation details of an interface and then arguing that this is a standard primitive for the chosen purpose is just backwards. What's even more hillarious is that you now request that we give you a replacement interface for something which was not an interface to use in the first place. >> 1) It cannot work with EPOLLEXCLUSIVE > > Why? You don't explain this. And I don't see how this patch would change > anything to do with the call to poll_wait(). All you've done is > open-code the completion. > > Not that it matters that much, having multiple waiters poll on this > interface can pretty much never happen. It only makes sense for the > process who submitted the write to poll on the interface. It does not matter whether your envisioned usage implies that it cannot happen. Fact is that there is no restriction. That means using this with the well documented semantics of poll(2) will result in failure. >> 2) It's racy: >> >> poll() write() >> switchtec_dev_poll() switchtec_dev_write() >> poll_wait(&s->comp.wait); mrpc_queue_cmd() >> init_completion(&s->comp) >> init_waitqueue_head(&s->comp.wait) > > That's a nice catch! But wouldn't an easier solution be to change the > code to use reinit_completion() instead of using the bug to justify a > different change? Sure taht can be cured by changing it to reinit, but that does not cure the abuse of implementation details. As Peter, who maintains the completion code says: Relying on implementation details of locking primitives like that is yuck. Thanks, tglx
On 2020-03-13 6:23 p.m., Thomas Gleixner wrote: > Logan Gunthorpe <logang@deltatee.com> writes: >> On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote: >>> The poll callback is abusing the completion wait queue and sticks it into >>> poll_wait() to wake up pollers after a command has completed. >>> >>> First of all it's a layering violation as it imposes restrictions on the >>> inner workings of completions. Just because C allows to do so does not >>> justify that in any way. The proper way to do such things is to post >>> patches which extend the core infrastructure and not by silently abusing >>> it. >> >> As I've said previously, I disagree with this approach. > > Feel free to do s. > >> Open coding standard primitives sweeps issues under the rug and is a >> step backwards for code quality. Calling it a layering violation is >> just one opinion and if it is, the better solution would be to create >> an interface you find appropriate so that it isn't one. > > There is no standard primitive which allows to poll on a completion. > > You decided that this is smart to do and just because C does not > allow to hide implementation details this is not a justification for > this at all. > > Due to the limitations of C, the kernel has to rely on the assumption > that developers know and respect the difference between API and > implementation. > > Relying on implementation details of an interface and then arguing that > this is a standard primitive for the chosen purpose is just backwards. > > What's even more hillarious is that you now request that we give you a > replacement interface for something which was not an interface to use in > the first place. I'm in awe at the lack of professionalism in your emails. If you bothered to edit out the ad hominems, you might have noticed that nobody has yet described how the poll interface fails here (with EPOLLEXCLUSIVE) or how replacing one wait queue for another fixes the purported problem. We clearly disagree on what's considered appropriate usage of the completion helper (calling it a "locking primitive" is a bit disingenuous) and it doesn't sound like that's going to change. I hold no power here, but you aren't going to bully me into giving this patch an Ack or into silencing my opinion on the matter. I'd prefer it if you submit a patch that's honest about what it's trying to accomplish and why (ie. it doesn't masquerade as being necessary to fix a bug). I also ask that you accept that I'm within my right to voice my dissent. If, after that, Bjorn chooses to take your patch, then I will respect his decision. I trust that he's able to read behind the personal attacks and look only at the technical issues. Logan
Logan Gunthorpe <logang@deltatee.com> writes: > On 2020-03-13 6:23 p.m., Thomas Gleixner wrote: > I'm in awe at the lack of professionalism in your emails. If you > bothered to edit out the ad hominems, you might have noticed that nobody > has yet described how the poll interface fails here (with > EPOLLEXCLUSIVE) or how replacing one wait queue for another fixes the > purported problem. I merily stated an opinion, but if you consider this an ad hominem attack, then let me ensure you this wasn't my intention and accept my apology. Thanks, tglx
On 2020-03-16 12:52 p.m., Thomas Gleixner wrote: > Logan Gunthorpe <logang@deltatee.com> writes: >> On 2020-03-13 6:23 p.m., Thomas Gleixner wrote: >> I'm in awe at the lack of professionalism in your emails. If you >> bothered to edit out the ad hominems, you might have noticed that nobody >> has yet described how the poll interface fails here (with >> EPOLLEXCLUSIVE) or how replacing one wait queue for another fixes the >> purported problem. > > I merily stated an opinion, but if you consider this an ad hominem > attack, then let me ensure you this wasn't my intention and accept my > apology. A technical opinion, and a valid argument, does *not* involve telling me what my decision process was ("You decided this was smart to do"), or mocking it as "hillarious" (sic). Your actual opinion was drowned out by these attacks. And, while valid, your opinion is very much subjective and I, personally, disagree with it. I accept your apology and hope this doesn't happen again. Thanks, Logan
Logan Gunthorpe <logang@deltatee.com> writes: > On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote: >> 1) It cannot work with EPOLLEXCLUSIVE > > Why? You don't explain this. man epoll_ctt(2) EPOLLEXCLUSIVE (since Linux 4.5) Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2). As this uses complete_all() there is no distinction possible, because complete_all() wakes up everything. > And I don't see how this patch would change anything to do with the > call to poll_wait(). All you've done is open-code the completion. wake_up_interruptible(x) resolves to: __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL) which wakes exactly 1 exclusive waiter. Also the other way round is just working because the waker side uses complete_all(). Why? Because completion internally defaults to exclusive mode and complete() wakes exactly one exlusive waiter. There is a conceptual difference and while it works for that particular purpose to some extent it's not suitable as a general wait notification construct. Thanks, tglx
On 2020-03-16 1:34 p.m., Thomas Gleixner wrote: > Logan Gunthorpe <logang@deltatee.com> writes: >> On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote: >>> 1) It cannot work with EPOLLEXCLUSIVE >> >> Why? You don't explain this. > > man epoll_ctt(2) > > EPOLLEXCLUSIVE (since Linux 4.5) > > Sets an exclusive wakeup mode for the epoll file descriptor that is > being attached to the target file descriptor, fd. When a wakeup event > occurs and multiple epoll file descriptors are attached to the same > target file using EPOLLEXCLUSIVE, one or more of the epoll file > descriptors will receive an event with epoll_wait(2). > > As this uses complete_all() there is no distinction possible, because > complete_all() wakes up everything. > >> And I don't see how this patch would change anything to do with the >> call to poll_wait(). All you've done is open-code the completion. > > wake_up_interruptible(x) resolves to: > > __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL) > > which wakes exactly 1 exclusive waiter. > > Also the other way round is just working because the waker side uses > complete_all(). Why? Because completion internally defaults to exclusive > mode and complete() wakes exactly one exlusive waiter. > > There is a conceptual difference and while it works for that particular > purpose to some extent it's not suitable as a general wait notification > construct. Ok, I now understand this point. That's exceedingly subtle. I certainly would not agree that this qualifies as "seriously broken", and I'm not even sure I'd agree that this actually violates the semantics of poll() seeing the man page clearly states that with EPOLLEXCLUSIVE set, "one or more" pollers will be woken up. So waking up all of them is still allowed. Ensuring fewer pollers wake up is just an optimization to avoid the thundering herd problem which users of this interface are very unlikely to ever have (I can confidently tell you that none have this problem now). If we do want to say that all poll_wait() users *must* respect EPOLLEXCLUSIVE, we should at least have some documentation saying that combining poll_wait() with wake_up_all() (or similar) is not allowed. A *very* quick check finds there's at least a few drivers doing this: drivers/char/ipmi/ipmb_dev_int.c drivers/dma-buf/sync_file.c drivers/gpu/vga/vgaarb.c (That's just looking at the drivers tree, up to "G".) Finally, since we seem to back to more reasonable discussion, I will make this point: it's fairly common for wait queue users to directly use the spinlock from within wait_queue_head_t without an interface (even completion.c does it). How are developers supposed to know when an interface is required and when it's not? Sometimes using "implementation" details interface-free is standard practice, but other times it's "yuck" and will illicit ire from other developers? Is it valid to use completion.wait.lock? Where's the line? Logan
Logan, Logan Gunthorpe <logang@deltatee.com> writes: > On 2020-03-16 1:34 p.m., Thomas Gleixner wrote: >> Logan Gunthorpe <logang@deltatee.com> writes: > I certainly would not agree that this qualifies as "seriously broken", I concede that this is the wrong wording, but chasing things like this and constantly mopping up code is not necessarily keeping my mood up all the time. > and I'm not even sure I'd agree that this actually violates the > semantics of poll() seeing the man page clearly states that with > EPOLLEXCLUSIVE set, "one or more" pollers will be woken up. So waking up > all of them is still allowed. Ensuring fewer pollers wake up is just an > optimization to avoid the thundering herd problem which users of this > interface are very unlikely to ever have (I can confidently tell you > that none have this problem now). I can see the point, but in general we should just strive for keeping the interfaces consistent and not subject to interpretation by individual developers. Maybe in your case the probability that someone wants to use it in that way is close to zero, but we've been surprised often enough by the creativity of user space developers who came up with use cases nobody ever expected. > If we do want to say that all poll_wait() users *must* respect > EPOLLEXCLUSIVE, we should at least have some documentation saying that > combining poll_wait() with wake_up_all() (or similar) is not allowed. A > *very* quick check finds there's at least a few drivers doing this: > > drivers/char/ipmi/ipmb_dev_int.c > drivers/dma-buf/sync_file.c > drivers/gpu/vga/vgaarb.c Right. There is surely quite some stuff in drivers which either predates these things or slipped through review. > Finally, since we seem to back to more reasonable discussion, I will > make this point: it's fairly common for wait queue users to directly use > the spinlock from within wait_queue_head_t without an interface (even > completion.c does it). completions and waitqueues are both core code. Core primitives build obviously on other primitives so dependencies on the inner workings are expected to some degree, especially for real and valuable optimization reasons. Solving these dependencies when one primitive changes has limited scope. > How are developers supposed to know when an interface is required and > when it's not? Sometimes using "implementation" details interface-free > is standard practice, but other times it's "yuck" and will illicit ire > from other developers? Is it valid to use completion.wait.lock? > Where's the line? That's a good question, but that question simply arises due to the fact that C does not provide proper privatizing or if you want to work around that it forces you to come up with really horrible constructs. The waitqueue lock usage has a very long history. It goes back to 2002 when the semaphore implementation was optimized. That exposed some of the waitqueue internals which got consequently used elsewhere as well. But looking at the actual usage sites, that's a pretty limited amount. Most of them are core infrastrucure. Of course there are drivers as well which use it for questionable reasons. In general I think that silently using implementation details just to scratch an itch or "optimizing" code is pretty much bad practice. Especially as this has a tendency to spread out in creative ways. And this happens simply because developers often copy the next thing which looks suitable and then expand on it. As this is not necessarily caught in reviews, this can spread to the point where the core infrastructure cannot be changed anymore. That's not a made up nightmare scenario. This happened in reality and caused me to mop up 50+ interrupt chip implementations in order to be able to make an urgently needed 10 line change to the core interrupt infrastructure. Guess what, the vast majority of instances fiddling with the core internals were either voodoo programming or plain bugs. There were a few legitimate issues, but they had been better solved in the core code upfront. Even after that cleanup a driver got merged which had #include "../../../../kernel/irg/internal.h" inside just because the code which was developed out of tree before this change had be made to "work". That's just not a workable solution with the ever expanding code base of the kernel. I really prefer when people come along and show the problem they want to solve - I'm completely fine with the POC hack which uses internals for that purpose - so that this can be discussed and eventually integrated into core infrastructure in one way or the other or better suitable solutions can be found. There are other aspects to this: - Using existing interfaces allows a reviewer to do the quick check on init, run time usage and tear down instead of wrapping his head around the special case - Verification tooling of all sorts just works - Improvements to the core implementation including new features are just coming for free. I hope that clarifies where I'm coming from. This has nothing to do with you personally, you just triggered a few sensible fuses while understandably defending your admittedly smart solution. Thanks, tglx
On 2020-03-16 6:17 p.m., Thomas Gleixner wrote: > That's a good question, but that question simply arises due to the fact > that C does not provide proper privatizing or if you want to work around > that it forces you to come up with really horrible constructs. Well, we do have the underscore convention. Though, I concede this code could potentially predate that. Had there been a preceding underscore, I definitely would have thought twice before touching it. > That's not a made up nightmare scenario. This happened in reality and > caused me to mop up 50+ interrupt chip implementations in order to be > able to make an urgently needed 10 line change to the core interrupt > infrastructure. Guess what, the vast majority of instances fiddling with > the core internals were either voodoo programming or plain bugs. There > were a few legitimate issues, but they had been better solved in the > core code upfront. Even after that cleanup a driver got merged which > had #include "../../../../kernel/irg/internal.h" inside just because the > code which was developed out of tree before this change had be made to > "work". I get where your coming from, and it sucks having to clean up so many instances in an urgent situation. But I see this kind of cleanup work as routine, a necessary thing that happens all the time. I've done it myself a couple times before in the kernel. The hard trick is to get developers to do more of it, before it becomes a problem like the one you faced. In my experience, what makes clean up work even harder is where developers see an interface, notice it's not a perfect fit and so open code the whole thing themselves. Then you have random buggy primitives open coded all over the place that are impossible to find. And the primitives themselves never improve or grow new interfaces because nobody knows there's a bunch of instances that require the improvement. That's a much bigger mess. > I really prefer when people come along and show the problem they want to > solve - I'm completely fine with the POC hack which uses internals for > that purpose - so that this can be discussed and eventually integrated > into core infrastructure in one way or the other or better suitable > solutions can be found. Yes, and this code was a prototype at one point and went through review from a number of people in the community, and nobody complained about this. I've also been in the situation where I submitted a POC and somebody pointed out a better way (though with a few swears thrown in for good measure); but in that case, I was actually changing a primitive so it got their attention more easily. It's impossible for the maintainer of a primitive to review all the use cases of every primitive when new code gets merged. But at least if new code uses/abuses the primitive they will eventually come to light and can be cleaned up as appropriate with a holistic view. > I hope that clarifies where I'm coming from. Yes, I understood your point. And I concede that a completion is a pretty trivial primitive to open code and the change is not really worth any further fight. If the patch gets merged (preferably with a reworked commit message), I will not complain. > This has nothing to do with you personally, you just triggered a few > sensible fuses while understandably defending your admittedly smart > solution. Thank you. Logan
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c index a823b4b8ef8a9..e69cac84b605f 100644 --- a/drivers/pci/switch/switchtec.c +++ b/drivers/pci/switch/switchtec.c @@ -52,10 +52,11 @@ struct switchtec_user { enum mrpc_state state; - struct completion comp; + wait_queue_head_t cmd_comp; struct kref kref; struct list_head list; + bool cmd_done; u32 cmd; u32 status; u32 return_code; @@ -77,7 +78,7 @@ static struct switchtec_user *stuser_create(struct switchtec_dev *stdev) stuser->stdev = stdev; kref_init(&stuser->kref); INIT_LIST_HEAD(&stuser->list); - init_completion(&stuser->comp); + init_waitqueue_head(&stuser->cmd_comp); stuser->event_cnt = atomic_read(&stdev->event_cnt); dev_dbg(&stdev->dev, "%s: %p\n", __func__, stuser); @@ -175,7 +176,7 @@ static int mrpc_queue_cmd(struct switchtec_user *stuser) kref_get(&stuser->kref); stuser->read_len = sizeof(stuser->data); stuser_set_state(stuser, MRPC_QUEUED); - init_completion(&stuser->comp); + stuser->cmd_done = false; list_add_tail(&stuser->list, &stdev->mrpc_queue); mrpc_cmd_submit(stdev); @@ -222,7 +223,8 @@ static void mrpc_complete_cmd(struct switchtec_dev *stdev) memcpy_fromio(stuser->data, &stdev->mmio_mrpc->output_data, stuser->read_len); out: - complete_all(&stuser->comp); + stuser->cmd_done = true; + wake_up_interruptible(&stuser->cmd_comp); list_del_init(&stuser->list); stuser_put(stuser); stdev->mrpc_busy = 0; @@ -529,10 +531,11 @@ static ssize_t switchtec_dev_read(struct file *filp, char __user *data, mutex_unlock(&stdev->mrpc_mutex); if (filp->f_flags & O_NONBLOCK) { - if (!try_wait_for_completion(&stuser->comp)) + if (!stuser->cmd_done) return -EAGAIN; } else { - rc = wait_for_completion_interruptible(&stuser->comp); + rc = wait_event_interruptible(stuser->cmd_comp, + stuser->cmd_done); if (rc < 0) return rc; } @@ -580,7 +583,7 @@ static __poll_t switchtec_dev_poll(struct file *filp, poll_table *wait) struct switchtec_dev *stdev = stuser->stdev; __poll_t ret = 0; - poll_wait(filp, &stuser->comp.wait, wait); + poll_wait(filp, &stuser->cmd_comp, wait); poll_wait(filp, &stdev->event_wq, wait); if (lock_mutex_and_test_alive(stdev)) @@ -588,7 +591,7 @@ static __poll_t switchtec_dev_poll(struct file *filp, poll_table *wait) mutex_unlock(&stdev->mrpc_mutex); - if (try_wait_for_completion(&stuser->comp)) + if (stuser->cmd_done) ret |= EPOLLIN | EPOLLRDNORM; if (stuser->event_cnt != atomic_read(&stdev->event_cnt)) @@ -1272,7 +1275,8 @@ static void stdev_kill(struct switchtec_dev *stdev) /* Wake up and kill any users waiting on an MRPC request */ list_for_each_entry_safe(stuser, tmpuser, &stdev->mrpc_queue, list) { - complete_all(&stuser->comp); + stuser->cmd_done = true; + wake_up_interruptible(&stuser->cmd_comp); list_del_init(&stuser->list); stuser_put(stuser); }
The poll callback is abusing the completion wait queue and sticks it into poll_wait() to wake up pollers after a command has completed. First of all it's a layering violation as it imposes restrictions on the inner workings of completions. Just because C allows to do so does not justify that in any way. The proper way to do such things is to post patches which extend the core infrastructure and not by silently abusing it. Aside of that the implementation is seriously broken: 1) It cannot work with EPOLLEXCLUSIVE 2) It's racy: poll() write() switchtec_dev_poll() switchtec_dev_write() poll_wait(&s->comp.wait); mrpc_queue_cmd() init_completion(&s->comp) init_waitqueue_head(&s->comp.wait) Replace it with a regular wait queue which removes the completion abuse and cures #1 and #2 above. Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-pci@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- drivers/pci/switch/switchtec.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-)