Message ID | 20190204180422.5095-1-guido@kiener-muenchen.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/3] udc: net2280: Fix net2280_disable | expand |
On Mon, 4 Feb 2019, Guido Kiener wrote: > From: Guido Kiener <guido.kiener@rohde-schwarz.com> > > A reset e.g. calling ep_reset_338x() can happen while endpoints > are enabled. How can this happen? That routine is called from only two places. One is in net2280_disable(), so after the endpoint has already been disabled. The other is in usb_reinit_338x() via usb_reinit() via stop_activity(), which disconnects the gadget driver and thereby disables all the endpoints. > The ep_reset_338x() sets ep->desc = NULL to mark > endpoint being invalid. A subsequent call of net2280_disable will > fail and return -EINVAL to parent function usb_ep_disable(), > which will fail, too, and do not set the member ep->enabled = false. So maybe a better fix (assuming there really is a problem) is to make usb_ep_disable() clear ep->enabled always. After all, the only reasonable way for usb_ep_disable() to fail is if the endpoint isn't enabled in the first place. Alan Stern > See: > https://elixir.bootlin.com/linux/v5.0-rc5/source/drivers/usb/gadget/udc/core.c#L139 > > This fix ignores dp->desc and allows net2280_disable() to succeed. > Subsequent calls to usb_ep_enable()/usb_ep_disable() succeeds. > > Signed-off-by: Guido Kiener <guido.kiener@rohde-schwarz.com> > --- > drivers/usb/gadget/udc/net2280.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/usb/gadget/udc/net2280.c b/drivers/usb/gadget/udc/net2280.c > index e7dae5379e04..7154f00dea40 100644 > --- a/drivers/usb/gadget/udc/net2280.c > +++ b/drivers/usb/gadget/udc/net2280.c > @@ -516,8 +516,8 @@ static int net2280_disable(struct usb_ep *_ep) > unsigned long flags; > > ep = container_of(_ep, struct net2280_ep, ep); > - if (!_ep || !ep->desc || _ep->name == ep0name) { > - pr_err("%s: Invalid ep=%p or ep->desc\n", __func__, _ep); > + if (!_ep || _ep->name == ep0name) { > + pr_err("%s: Invalid ep=%p\n", __func__, _ep); > return -EINVAL; > } > spin_lock_irqsave(&ep->dev->lock, flags); >
Zitat von Alan Stern <stern@rowland.harvard.edu>: > On Mon, 4 Feb 2019, Guido Kiener wrote: > >> From: Guido Kiener <guido.kiener@rohde-schwarz.com> >> >> A reset e.g. calling ep_reset_338x() can happen while endpoints >> are enabled. > > How can this happen? That routine is called from only two places. > One is in net2280_disable(), so after the endpoint has already been > disabled. The other is in usb_reinit_338x() via usb_reinit() via > stop_activity(), which disconnects the gadget driver and thereby > disables all the endpoints. > Yes, the routine is called when I stop my application. See callstack: Call Trace: net2280_disable: Invalid ep=00000000341c7996 or ep->desc usb_ep_disable+0x24/0xa0 [udc_core] disableEndpoints+0x1c/0x80 [rsusbtmc] tmc_func_disable+0x29/0xa0 [rsusbtmc] reset_config+0x3e/0xa0 [libcomposite] composite_disconnect+0x36/0x60 [libcomposite] net2280_pullup+0x9e/0xf0 [net2280] usb_gadget_disconnect+0x35/0xc0 [udc_core] usb_gadget_remove_driver+0x29/0xa0 [udc_core] usb_gadget_unregister_driver+0xc1/0xf0 [udc_core] usb_composite_unregister+0x12/0x20 [libcomposite] dev_release+0x3f/0x80 [rsusbtmc] __fput+0x102/0x230 ____fput+0xe/0x10 task_work_run+0x90/0xc0 exit_to_usermode_loop+0xf2/0x100 do_syscall_64+0xfb/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 When I restart my application then the endpoints remains disabled. A workaround is to reload our (legacy) gadget driver, libcomposite and net2280. >> The ep_reset_338x() sets ep->desc = NULL to mark >> endpoint being invalid. A subsequent call of net2280_disable will >> fail and return -EINVAL to parent function usb_ep_disable(), >> which will fail, too, and do not set the member ep->enabled = false. > > So maybe a better fix (assuming there really is a problem) is to make > usb_ep_disable() clear ep->enabled always. After all, the only > reasonable way for usb_ep_disable() to fail is if the endpoint isn't > enabled in the first place. > I had the same idea. However the root cause is the net2280 driver. We have not seen the problem with other function drivers. The usb_ep_disable() implementation looks reasonable to me. When the ret = ep->ops->disable(ep); function fails for any other fictitious reason (e.g. conflicts, order of endpoints..) then it might be better to set ep->enabled = false when the function driver really succeeds. Regards, Guido
On Tue, 5 Feb 2019 guido@kiener-muenchen.de wrote: > Zitat von Alan Stern <stern@rowland.harvard.edu>: > > > On Mon, 4 Feb 2019, Guido Kiener wrote: > > > >> From: Guido Kiener <guido.kiener@rohde-schwarz.com> > >> > >> A reset e.g. calling ep_reset_338x() can happen while endpoints > >> are enabled. > > > > How can this happen? That routine is called from only two places. > > One is in net2280_disable(), so after the endpoint has already been > > disabled. The other is in usb_reinit_338x() via usb_reinit() via > > stop_activity(), which disconnects the gadget driver and thereby > > disables all the endpoints. > > > Yes, the routine is called when I stop my application. See callstack: > Call Trace: > net2280_disable: Invalid ep=00000000341c7996 or ep->desc > usb_ep_disable+0x24/0xa0 [udc_core] > disableEndpoints+0x1c/0x80 [rsusbtmc] > tmc_func_disable+0x29/0xa0 [rsusbtmc] > reset_config+0x3e/0xa0 [libcomposite] > composite_disconnect+0x36/0x60 [libcomposite] > net2280_pullup+0x9e/0xf0 [net2280] > usb_gadget_disconnect+0x35/0xc0 [udc_core] > usb_gadget_remove_driver+0x29/0xa0 [udc_core] > usb_gadget_unregister_driver+0xc1/0xf0 [udc_core] > usb_composite_unregister+0x12/0x20 [libcomposite] > dev_release+0x3f/0x80 [rsusbtmc] > __fput+0x102/0x230 > ____fput+0xe/0x10 > task_work_run+0x90/0xc0 > exit_to_usermode_loop+0xf2/0x100 > do_syscall_64+0xfb/0x120 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > When I restart my application then the endpoints remains disabled. > A workaround is to reload our (legacy) gadget driver, libcomposite > and net2280. > > >> The ep_reset_338x() sets ep->desc = NULL to mark > >> endpoint being invalid. A subsequent call of net2280_disable will > >> fail and return -EINVAL to parent function usb_ep_disable(), > >> which will fail, too, and do not set the member ep->enabled = false. > > > > So maybe a better fix (assuming there really is a problem) is to make > > usb_ep_disable() clear ep->enabled always. After all, the only > > reasonable way for usb_ep_disable() to fail is if the endpoint isn't > > enabled in the first place. > > > I had the same idea. However the root cause is the net2280 driver. > We have not seen the problem with other function drivers. > The usb_ep_disable() implementation looks reasonable to me. When the > ret = ep->ops->disable(ep); function fails for any other fictitious > reason (e.g. conflicts, order of endpoints..) then it might be better to > set ep->enabled = false when the function driver really succeeds. All right, I withdraw the objection. Acked-by: Alan Stern <stern@rowland.harvard.edu> Alan Stern
Hi Alan, My last proposal "udc: net2280: Fix overrun of OUT messages" is still under investigation. During the random stress tests I found a new rare problem: When a request must be dequeued with net2280_dequeue() e.g. due to a device clear action and the same request is finished by the function scan_dma_completions(): https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269 then the following search loop does not find the request and the function returns the error -EINVAL without restoring the state ep->stopped = stopped! https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269 Thus the endpoint keeps blocked and does not receive any data. When we insert ep->stopped = stopped here: if (&req->req != _req) { ep->stopped = stopped; // <<<<<---- spin_unlock_irqrestore(&ep->dev->lock, flags); dev_err(&ep->dev->pdev->dev, "%s: Request mismatch\n", __func__); return -EINVAL; } then the driver continues as expected, although the driver issues the false error message "Request mismatch". What do you think? It's labor-intensive to fix the error message here, or shall we leave it as it is, and only insert the "ep->stopped = stopped;" Regards, Guido
On Tue, 19 Feb 2019 guido@kiener-muenchen.de wrote: > Hi Alan, > > My last proposal "udc: net2280: Fix overrun of OUT messages" is still > under investigation. > > During the random stress tests I found a new rare problem: > > When a request must be dequeued with net2280_dequeue() e.g. due > to a device clear action and the same request is finished by the > function scan_dma_completions(): > > https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269 > > then the following search loop does not find the request and the function > returns the error -EINVAL without restoring the state ep->stopped = stopped! > > https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269 > > Thus the endpoint keeps blocked and does not receive any data. > When we insert ep->stopped = stopped here: > > if (&req->req != _req) { > ep->stopped = stopped; // <<<<<---- > spin_unlock_irqrestore(&ep->dev->lock, flags); > dev_err(&ep->dev->pdev->dev, "%s: Request mismatch\n", > __func__); > return -EINVAL; > } > > then the driver continues as expected, although the driver issues the > false error message "Request mismatch". > > What do you think? It's labor-intensive to fix the error message here, > or shall we leave it as it is, and only insert the "ep->stopped = stopped;" Yes, this is a bug. I think the dev_err() should be changed to ep_dbg(). It's not a real error, after all -- it's just a race. Do that and insert the "ep->stopped = stopped;" line. Also, consider changing the -EINVAL return code here to -EIDRM. This is analogous to what usb_unlink_urb() does on the host side. Alan Stern
Zitat von Alan Stern <stern@rowland.harvard.edu>: > On Tue, 19 Feb 2019 guido@kiener-muenchen.de wrote: > >> Hi Alan, >> >> My last proposal "udc: net2280: Fix overrun of OUT messages" is still >> under investigation. >> >> During the random stress tests I found a new rare problem: >> >> When a request must be dequeued with net2280_dequeue() e.g. due >> to a device clear action and the same request is finished by the >> function scan_dma_completions(): >> >> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269 >> >> then the following search loop does not find the request and the function >> returns the error -EINVAL without restoring the state ep->stopped = stopped! >> >> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269 >> >> Thus the endpoint keeps blocked and does not receive any data. >> When we insert ep->stopped = stopped here: >> >> if (&req->req != _req) { >> ep->stopped = stopped; // <<<<<---- >> spin_unlock_irqrestore(&ep->dev->lock, flags); >> dev_err(&ep->dev->pdev->dev, "%s: Request mismatch\n", >> __func__); >> return -EINVAL; >> } >> >> then the driver continues as expected, although the driver issues the >> false error message "Request mismatch". >> >> What do you think? It's labor-intensive to fix the error message here, >> or shall we leave it as it is, and only insert the "ep->stopped = stopped;" > > Yes, this is a bug. I think the dev_err() should be changed to > ep_dbg(). It's not a real error, after all -- it's just a race. I agree. ep_dbg() is better here. I see the same bug in the sibling driver net2272_dequeue(): https://elixir.bootlin.com/linux/v5.0-rc8/source/drivers/usb/gadget/udc/net2272.c#L948 The bug is quite obvious to me, however I'm not aware that we use this driver in any of our instruments. I cannot test it. What do you recommend? - Shall I fix it within the same commit. - Shall I fix it with another commit. - Don't care about it > Do that and insert the "ep->stopped = stopped;" line. Also, consider > changing the -EINVAL return code here to -EIDRM. This is analogous to > what usb_unlink_urb() does on the host side. EIDRM sounds better. However I see that all udc drivers calling the <udc>_dequeue() function returns -EINVAL when a request cannot be dequeued. I do not dare to break this tradition. BTW we tested my last proposal: https://patchwork.kernel.org/patch/10796161/ ... I see that we can simplify this patch when we just insert the stop_out_nacking(ep) between line 909 and 910: https://elixir.bootlin.com/linux/v5.0-rc5/source/drivers/usb/gadget/udc/net2280.c#L909 and remove the stop_out_naking() call in start_queue()... I will add this improvement to the new patch sequence. Guido
On Wed, 27 Feb 2019 guido@kiener-muenchen.de wrote: > Zitat von Alan Stern <stern@rowland.harvard.edu>: > > > On Tue, 19 Feb 2019 guido@kiener-muenchen.de wrote: > > > >> Hi Alan, > >> > >> My last proposal "udc: net2280: Fix overrun of OUT messages" is still > >> under investigation. > >> > >> During the random stress tests I found a new rare problem: > >> > >> When a request must be dequeued with net2280_dequeue() e.g. due > >> to a device clear action and the same request is finished by the > >> function scan_dma_completions(): > >> > >> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269 > >> > >> then the following search loop does not find the request and the function > >> returns the error -EINVAL without restoring the state ep->stopped = stopped! > >> > >> https://elixir.bootlin.com/linux/v5.0-rc7/source/drivers/usb/gadget/udc/net2280.c#L1269 > >> > >> Thus the endpoint keeps blocked and does not receive any data. > >> When we insert ep->stopped = stopped here: > >> > >> if (&req->req != _req) { > >> ep->stopped = stopped; // <<<<<---- > >> spin_unlock_irqrestore(&ep->dev->lock, flags); > >> dev_err(&ep->dev->pdev->dev, "%s: Request mismatch\n", > >> __func__); > >> return -EINVAL; > >> } > >> > >> then the driver continues as expected, although the driver issues the > >> false error message "Request mismatch". > >> > >> What do you think? It's labor-intensive to fix the error message here, > >> or shall we leave it as it is, and only insert the "ep->stopped = stopped;" > > > > Yes, this is a bug. I think the dev_err() should be changed to > > ep_dbg(). It's not a real error, after all -- it's just a race. > > I agree. ep_dbg() is better here. I see the same bug in the sibling driver > net2272_dequeue(): > https://elixir.bootlin.com/linux/v5.0-rc8/source/drivers/usb/gadget/udc/net2272.c#L948 > > The bug is quite obvious to me, however I'm not aware that we use this driver > in any of our instruments. I cannot test it. What do you recommend? > - Shall I fix it within the same commit. > - Shall I fix it with another commit. > - Don't care about it You can fix it in a separate commit, and in the commit log say that it was only compile-tested because you don't have the right hardware for a full test. > > Do that and insert the "ep->stopped = stopped;" line. Also, consider > > changing the -EINVAL return code here to -EIDRM. This is analogous to > > what usb_unlink_urb() does on the host side. > > EIDRM sounds better. However I see that all udc drivers calling the > <udc>_dequeue() function returns -EINVAL when a request cannot be dequeued. > I do not dare to break this tradition. I guess it's too late to change this now. A pity... Alan Stern > BTW we tested my last proposal: > https://patchwork.kernel.org/patch/10796161/ > > ... I see that we can simplify this patch when we just insert the > stop_out_nacking(ep) between line 909 and 910: > > https://elixir.bootlin.com/linux/v5.0-rc5/source/drivers/usb/gadget/udc/net2280.c#L909 > > and remove the stop_out_naking() call in start_queue()... > > I will add this improvement to the new patch sequence. > > Guido
diff --git a/drivers/usb/gadget/udc/net2280.c b/drivers/usb/gadget/udc/net2280.c index e7dae5379e04..7154f00dea40 100644 --- a/drivers/usb/gadget/udc/net2280.c +++ b/drivers/usb/gadget/udc/net2280.c @@ -516,8 +516,8 @@ static int net2280_disable(struct usb_ep *_ep) unsigned long flags; ep = container_of(_ep, struct net2280_ep, ep); - if (!_ep || !ep->desc || _ep->name == ep0name) { - pr_err("%s: Invalid ep=%p or ep->desc\n", __func__, _ep); + if (!_ep || _ep->name == ep0name) { + pr_err("%s: Invalid ep=%p\n", __func__, _ep); return -EINVAL; } spin_lock_irqsave(&ep->dev->lock, flags);