Message ID | 2127594073.298820.1493143869792@email.1und1.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> Stefan Wahren <stefan.wahren@i2se.com> hat am 25. April 2017 um 20:11 geschrieben: > > > > Stefan Wahren <stefan.wahren@i2se.com> hat am 22. April 2017 um 22:50 geschrieben: > > > > > > Hi, > > > > > Eric Anholt <eric@anholt.net> hat am 20. April 2017 um 20:54 geschrieben: > > > > > > > > > Stefan Wahren <stefan.wahren@i2se.com> writes: > > > > > > > Hi, > > > > > > > >> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben: > > > >> > > > >> > > > >> It's hard to know for sure that all of this time is really in > > > >> urb_enqueue(). Possible we could have task switched out and been > > > >> blocked elsewhere. Using ftrace to get more fine-grained timings > > > >> would be useful. ktime_get(), ktime_sub(), and ktime_to_us() are your > > > >> friends here if you want to use trace_printk. > > > > > > > > i'm a newbie to ftrace, so i hope this would be helpful. > > > > > > > > # connect PL2303 to the onboard hub > > > > # echo 0 > options/sleep-time > > > > # echo 0 > function_profile_enabled > > > > # echo 1 > function_profile_enabled > > > > # ./usb_test > > > > # Waiting for at least 20 seconds and then disconnect PL2303 > > > > # echo 0 > function_profile_enabled > > > > # cat trace_stat/function0 > > > > > > > > Function Hit Time Avg s^2 > > > > -------- --- ---- --- --- > > > > bcm2835_handle_irq 361347 219567633 us 607.636 us 1485199 us > > > > __handle_domain_irq 1082482 212639551 us 196.437 us 3642030 us > > > > generic_handle_irq 1082482 100592051 us 92.927 us 50511334 us > > > > irq_exit 1082482 98197771 us 90.715 us 29649040 us > > > > handle_level_irq 1082482 95812379 us 88.511 us 51910093 us > > > > > > If I'm reading this output right, we're spending half of our interrupt > > > processing time in irq_exit(), so even if dwc2's interrupt was free (the > > > generic_handle_irq() chain), we'd be eating about half the CPU getting > > > back out of the interrupt handler, right? > > > > > > I don't really know anything about DWC2 or USB, but is there any way we > > > could mitigate the interrupt frequency with this hardware? If nothing > > > else, could we loop reading gintsts until it reads back 0? > > > > first of all i updated my kernel to 4.11rc7 and the issue still occures. Today i had some time to investigate this issue further and i made some interesting observations: > > > > 1. The lockup doesn't occure always after starting usb_test. In rare cases i was able to run the program without lockup. > > 2. In case the lockup occured we are always able to "rescue" the BCM2835 from this state by sending some serial data to the PL2303. > > Based on this scenario i patched the interrupt routine to detect the interrupt storm and normal condition. So i can dump the global and host registers in both situations (bad and goodcase). > > Here is the diff between both register dumps, maybe someone see something interesting: > > --- badcase.txt 2017-04-25 18:02:59.000000000 +0000 > +++ goodcase.txt 2017-04-25 18:02:59.000000000 +0000 > @@ -4,12 +4,12 @@ > dwc2_dump_global_registers: GAHBCFG @0xCC850008 : 0x00000031 > dwc2_dump_global_registers: GUSBCFG @0xCC85000C : 0x20001700 > dwc2_dump_global_registers: GRSTCTL @0xCC850010 : 0x80000000 > - dwc2_dump_global_registers: GINTSTS @0xCC850014 : 0x06000021 > + dwc2_dump_global_registers: GINTSTS @0xCC850014 : 0x06000029 > dwc2_dump_global_registers: GINTMSK @0xCC850018 : 0xF300080E > - dwc2_dump_global_registers: GRXSTSR @0xCC85001C : 0x00070044 > + dwc2_dump_global_registers: GRXSTSR @0xCC85001C : 0x000F0001 > dwc2_dump_global_registers: GRXFSIZ @0xCC850024 : 0x00000306 > dwc2_dump_global_registers: GNPTXFSIZ @0xCC850028 : 0x01000306 > - dwc2_dump_global_registers: GNPTXSTS @0xCC85002C : 0x1D080100 > + dwc2_dump_global_registers: GNPTXSTS @0xCC85002C : 0x18080100 > ... @Doug, @John Y., @Johan: Any further ideas to narrow down the reason for this interrupt storm? Regards Stefan
On Mon, May 08, 2017 at 10:22:08PM +0200, Stefan Wahren wrote: > > > Stefan Wahren <stefan.wahren@i2se.com> hat am 25. April 2017 um 20:11 geschrieben: > > > > > > > Stefan Wahren <stefan.wahren@i2se.com> hat am 22. April 2017 um 22:50 geschrieben: > > > > > > > > > Hi, > > > > > > > Eric Anholt <eric@anholt.net> hat am 20. April 2017 um 20:54 geschrieben: > > > > > > > > > > > > Stefan Wahren <stefan.wahren@i2se.com> writes: > > > > > > > > > Hi, > > > > > > > > > >> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben: > > > > >> > > > > >> > > > > >> It's hard to know for sure that all of this time is really in > > > > >> urb_enqueue(). Possible we could have task switched out and been > > > > >> blocked elsewhere. Using ftrace to get more fine-grained timings > > > > >> would be useful. ktime_get(), ktime_sub(), and ktime_to_us() are your > > > > >> friends here if you want to use trace_printk. > > > > > > > > > > i'm a newbie to ftrace, so i hope this would be helpful. > > > > > > > > > > # connect PL2303 to the onboard hub > > > > > # echo 0 > options/sleep-time > > > > > # echo 0 > function_profile_enabled > > > > > # echo 1 > function_profile_enabled > > > > > # ./usb_test > > > > > # Waiting for at least 20 seconds and then disconnect PL2303 > > > > > # echo 0 > function_profile_enabled > > > > > # cat trace_stat/function0 > > > > > > > > > > Function Hit Time Avg s^2 > > > > > -------- --- ---- --- --- > > > > > bcm2835_handle_irq 361347 219567633 us 607.636 us 1485199 us > > > > > __handle_domain_irq 1082482 212639551 us 196.437 us 3642030 us > > > > > generic_handle_irq 1082482 100592051 us 92.927 us 50511334 us > > > > > irq_exit 1082482 98197771 us 90.715 us 29649040 us > > > > > handle_level_irq 1082482 95812379 us 88.511 us 51910093 us > > > > > > > > If I'm reading this output right, we're spending half of our interrupt > > > > processing time in irq_exit(), so even if dwc2's interrupt was free (the > > > > generic_handle_irq() chain), we'd be eating about half the CPU getting > > > > back out of the interrupt handler, right? > > > > > > > > I don't really know anything about DWC2 or USB, but is there any way we > > > > could mitigate the interrupt frequency with this hardware? If nothing > > > > else, could we loop reading gintsts until it reads back 0? > > > > > > first of all i updated my kernel to 4.11rc7 and the issue still > > > occures. Today i had some time to investigate this issue further > > > and i made some interesting observations: > > > > > > 1. The lockup doesn't occure always after starting usb_test. In > > > rare cases i was able to run the program without lockup. > > > 2. In case the lockup occured we are always able to "rescue" the > > > BCM2835 from this state by sending some serial data to the PL2303. > > > > Based on this scenario i patched the interrupt routine to detect the > > interrupt storm and normal condition. So i can dump the global and > > host registers in both situations (bad and goodcase). > > > > Here is the diff between both register dumps, maybe someone see > > something interesting: > > > > --- badcase.txt 2017-04-25 18:02:59.000000000 +0000 > > +++ goodcase.txt 2017-04-25 18:02:59.000000000 +0000 > > @@ -4,12 +4,12 @@ > > dwc2_dump_global_registers: GAHBCFG @0xCC850008 : 0x00000031 > > dwc2_dump_global_registers: GUSBCFG @0xCC85000C : 0x20001700 > > dwc2_dump_global_registers: GRSTCTL @0xCC850010 : 0x80000000 > > - dwc2_dump_global_registers: GINTSTS @0xCC850014 : 0x06000021 > > + dwc2_dump_global_registers: GINTSTS @0xCC850014 : 0x06000029 > > dwc2_dump_global_registers: GINTMSK @0xCC850018 : 0xF300080E > > - dwc2_dump_global_registers: GRXSTSR @0xCC85001C : 0x00070044 > > + dwc2_dump_global_registers: GRXSTSR @0xCC85001C : 0x000F0001 > > dwc2_dump_global_registers: GRXFSIZ @0xCC850024 : 0x00000306 > > dwc2_dump_global_registers: GNPTXFSIZ @0xCC850028 : 0x01000306 > > - dwc2_dump_global_registers: GNPTXSTS @0xCC85002C : 0x1D080100 > > + dwc2_dump_global_registers: GNPTXSTS @0xCC85002C : 0x18080100 > > ... > > @Doug, @John Y., @Johan: Any further ideas to narrow down the reason > for this interrupt storm? No, sorry. I suggested earlier that you could try commenting out the interrupt-URB submission in a naive attempt to narrow it down since that is an obvious difference from the ftdi driver, but then again so are the frequent ftdi status bulk messages (and any other timing differences). Seems like you need to dig into dwc2. Johan
Hi, On Wed, May 10, 2017 at 9:31 AM, Johan Hovold <johan@kernel.org> wrote: > On Mon, May 08, 2017 at 10:22:08PM +0200, Stefan Wahren wrote: >> >> > Stefan Wahren <stefan.wahren@i2se.com> hat am 25. April 2017 um 20:11 geschrieben: >> > >> > >> > > Stefan Wahren <stefan.wahren@i2se.com> hat am 22. April 2017 um 22:50 geschrieben: >> > > >> > > >> > > Hi, >> > > >> > > > Eric Anholt <eric@anholt.net> hat am 20. April 2017 um 20:54 geschrieben: >> > > > >> > > > >> > > > Stefan Wahren <stefan.wahren@i2se.com> writes: >> > > > >> > > > > Hi, >> > > > > >> > > > >> Doug Anderson <dianders@chromium.org> hat am 18. April 2017 um 22:41 geschrieben: >> > > > >> >> > > > >> >> > > > >> It's hard to know for sure that all of this time is really in >> > > > >> urb_enqueue(). Possible we could have task switched out and been >> > > > >> blocked elsewhere. Using ftrace to get more fine-grained timings >> > > > >> would be useful. ktime_get(), ktime_sub(), and ktime_to_us() are your >> > > > >> friends here if you want to use trace_printk. >> > > > > >> > > > > i'm a newbie to ftrace, so i hope this would be helpful. >> > > > > >> > > > > # connect PL2303 to the onboard hub >> > > > > # echo 0 > options/sleep-time >> > > > > # echo 0 > function_profile_enabled >> > > > > # echo 1 > function_profile_enabled >> > > > > # ./usb_test >> > > > > # Waiting for at least 20 seconds and then disconnect PL2303 >> > > > > # echo 0 > function_profile_enabled >> > > > > # cat trace_stat/function0 >> > > > > >> > > > > Function Hit Time Avg s^2 >> > > > > -------- --- ---- --- --- >> > > > > bcm2835_handle_irq 361347 219567633 us 607.636 us 1485199 us >> > > > > __handle_domain_irq 1082482 212639551 us 196.437 us 3642030 us >> > > > > generic_handle_irq 1082482 100592051 us 92.927 us 50511334 us >> > > > > irq_exit 1082482 98197771 us 90.715 us 29649040 us >> > > > > handle_level_irq 1082482 95812379 us 88.511 us 51910093 us >> > > > >> > > > If I'm reading this output right, we're spending half of our interrupt >> > > > processing time in irq_exit(), so even if dwc2's interrupt was free (the >> > > > generic_handle_irq() chain), we'd be eating about half the CPU getting >> > > > back out of the interrupt handler, right? >> > > > >> > > > I don't really know anything about DWC2 or USB, but is there any way we >> > > > could mitigate the interrupt frequency with this hardware? If nothing >> > > > else, could we loop reading gintsts until it reads back 0? >> > > >> > > first of all i updated my kernel to 4.11rc7 and the issue still >> > > occures. Today i had some time to investigate this issue further >> > > and i made some interesting observations: >> > > >> > > 1. The lockup doesn't occure always after starting usb_test. In >> > > rare cases i was able to run the program without lockup. >> > > 2. In case the lockup occured we are always able to "rescue" the >> > > BCM2835 from this state by sending some serial data to the PL2303. >> > >> > Based on this scenario i patched the interrupt routine to detect the >> > interrupt storm and normal condition. So i can dump the global and >> > host registers in both situations (bad and goodcase). >> > >> > Here is the diff between both register dumps, maybe someone see >> > something interesting: >> > >> > --- badcase.txt 2017-04-25 18:02:59.000000000 +0000 >> > +++ goodcase.txt 2017-04-25 18:02:59.000000000 +0000 >> > @@ -4,12 +4,12 @@ >> > dwc2_dump_global_registers: GAHBCFG @0xCC850008 : 0x00000031 >> > dwc2_dump_global_registers: GUSBCFG @0xCC85000C : 0x20001700 >> > dwc2_dump_global_registers: GRSTCTL @0xCC850010 : 0x80000000 >> > - dwc2_dump_global_registers: GINTSTS @0xCC850014 : 0x06000021 >> > + dwc2_dump_global_registers: GINTSTS @0xCC850014 : 0x06000029 >> > dwc2_dump_global_registers: GINTMSK @0xCC850018 : 0xF300080E >> > - dwc2_dump_global_registers: GRXSTSR @0xCC85001C : 0x00070044 >> > + dwc2_dump_global_registers: GRXSTSR @0xCC85001C : 0x000F0001 >> > dwc2_dump_global_registers: GRXFSIZ @0xCC850024 : 0x00000306 >> > dwc2_dump_global_registers: GNPTXFSIZ @0xCC850028 : 0x01000306 >> > - dwc2_dump_global_registers: GNPTXSTS @0xCC85002C : 0x1D080100 >> > + dwc2_dump_global_registers: GNPTXSTS @0xCC85002C : 0x18080100 >> > ... >> >> @Doug, @John Y., @Johan: Any further ideas to narrow down the reason >> for this interrupt storm? > > No, sorry. > > I suggested earlier that you could try commenting out the interrupt-URB > submission in a naive attempt to narrow it down since that is an obvious > difference from the ftdi driver, but then again so are the frequent ftdi > status bulk messages (and any other timing differences). > > Seems like you need to dig into dwc2. I haven't had a chance to dig and not sure when I'll have that time. I'll keep this around in my inbox for now on the off chance that I clean it out enough to get back to this. Sorry!
> Doug Anderson <dianders@chromium.org> hat am 11. Mai 2017 um 01:50 geschrieben: > > > Hi, > > On Wed, May 10, 2017 at 9:31 AM, Johan Hovold <johan@kernel.org> wrote: > > On Mon, May 08, 2017 at 10:22:08PM +0200, Stefan Wahren wrote: > >> @Doug, @John Y., @Johan: Any further ideas to narrow down the reason > >> for this interrupt storm? > > > > No, sorry. > > > > I suggested earlier that you could try commenting out the interrupt-URB > > submission in a naive attempt to narrow it down since that is an obvious > > difference from the ftdi driver, but then again so are the frequent ftdi > > status bulk messages (and any other timing differences). > > > > Seems like you need to dig into dwc2. > > I haven't had a chance to dig and not sure when I'll have that time. > I'll keep this around in my inbox for now on the off chance that I > clean it out enough to get back to this. Sorry! > But thanks for all your feedback guys. Btw i think i took a step forward. During the whole time i a took a Raspberry Pi B to reproduce this issue, which has a builtin USB hub and a SMSC USB to ethernet chip. /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/3p, 480M |__ Port 1: Dev 3, If 0, Class=Vendor Specific Class, Driver=smsc95xx, 480M |__ Port 2: Dev 4, If 0, Class=Vendor Specific Class, Driver=pl2303, 12M I decide to switch to the Raspberry Pi Zero which neither have a builtin USB hub or a USB ethernet chip. If i connect the PL2303 directly to the RPi Zero the issue is not reproducible and usb_test behave as expected by opening the port and terminate after a few seconds (good case). /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M |__ Port 1: Dev 5, If 0, Class=Vendor Specific Class, Driver=pl2303, 12M After that i placed a USB 4 port hub between RPi Zero and PL2303. Now the issue (interrupt storm) occurs on the RPi Zero (bad case). After nearly 20 seconds i disconnected the PL2303 in order to make the system usable and stop the trace. /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc2/1p, 480M |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M |__ Port 3: Dev 4, If 0, Class=Vendor Specific Class, Driver=pl2303, 12M Additional i made usbmon traces of both cases. Maybe that's useful for somebody with more USB knowledge than me. Stefan good case: d9adfd00 1324940464 S Co:1:005:0 s 40 01 0008 0000 0000 0 d9adfd00 1324941739 C Co:1:005:0 0 0 d9adfd00 1324941792 S Co:1:005:0 s 40 01 0009 0000 0000 0 d9adfd00 1324941879 C Co:1:005:0 0 0 d9adfd00 1324941920 S Ci:1:005:0 s a1 21 0000 0000 0007 7 < d9adfd00 1324942046 C Ci:1:005:0 0 7 = 80250000 000000 d9adfd00 1324942093 S Co:1:005:0 s 21 20 0000 0000 0007 7 = 80250000 000008 d9adfd00 1324942211 C Co:1:005:0 0 7 > d9adfd00 1324942248 S Co:1:005:0 s 40 01 0000 0000 0000 0 d9adfd00 1324942335 C Co:1:005:0 0 0 da6ef600 1324942367 S Ii:1:005:1 -115:1 10 < da6efd00 1324942412 S Bi:1:005:3 -115 256 < da6ef800 1324942431 S Bi:1:005:3 -115 256 < d9adfd00 1324942447 S Co:1:005:0 s 21 22 0003 0000 0000 0 d9adfd00 1324943310 C Co:1:005:0 0 0 da6ef600 1324944174 C Ii:1:005:1 0:1 10 = a1200000 00000200 8b06 da6ef600 1324944192 S Ii:1:005:1 -115:1 10 < da6ef600 1324946152 C Ii:1:005:1 0:1 10 = a1200000 00000200 8b00 da6ef600 1324946161 S Ii:1:005:1 -115:1 10 < da6ef600 1325451149 C Ii:1:005:1 -5:1 0 da6ef600 1325451177 S Ii:1:005:1 -115:1 10 < d9adf400 1330210089 S Co:1:005:0 s 21 22 0000 0000 0000 0 d9adf400 1330211717 C Co:1:005:0 0 0 da6efd00 1330211855 C Bi:1:005:3 -2 0 da6ef800 1330211921 C Bi:1:005:3 -2 0 da6ef600 1330211969 C Ii:1:005:1 -2:1 0 d9adf400 1330211997 S Co:1:005:0 s 21 23 0000 0000 0000 0 d9adf400 1330212158 C Co:1:005:0 0 0 bad case: da6efb00 316377734 S Co:1:004:0 s 40 01 0008 0000 0000 0 da6efb00 316378587 C Co:1:004:0 0 0 da6efb00 316378837 S Co:1:004:0 s 40 01 0009 0000 0000 0 da6efb00 316379201 C Co:1:004:0 0 0 da6efb00 316379397 S Ci:1:004:0 s a1 21 0000 0000 0007 7 < da6efb00 316379697 C Ci:1:004:0 0 7 = 80250000 000000 da6efb00 316379890 S Co:1:004:0 s 21 20 0000 0000 0007 7 = 80250000 000008 da6efb00 316380350 C Co:1:004:0 0 7 > da6efb00 316380580 S Co:1:004:0 s 40 01 0000 0000 0000 0 da6efb00 316380932 C Co:1:004:0 0 0 d9a14b00 316381132 S Ii:1:004:1 -115:1 10 < da780600 316381325 S Bi:1:004:3 -115 256 < da780100 316381488 S Bi:1:004:3 -115 256 < d9a14b00 316389933 C Ii:1:004:1 0:1 10 = a1200000 00000200 8b06 d9a14b00 316390820 S Ii:1:004:1 -115:1 10 < d9a14b00 316398308 C Ii:1:004:1 0:1 10 = a1200000 00000200 8b00 d9a14b00 316399936 S Ii:1:004:1 -115:1 10 < da6efb00 316419808 S Co:1:004:0 s 21 22 0003 0000 0000 0 da6efb00 316438433 C Co:1:004:0 0 0 da780600 339316684 C Bi:1:004:3 -32 0 da780100 339316772 C Bi:1:004:3 -32 0 d9a14b00 339317035 C Ii:1:004:1 -32:1 0 d9a14b00 339317049 S Ii:1:004:1 -115:1 10 < d9a14b00 339318040 C Ii:1:004:1 -32:1 0 d9a14b00 339318057 S Ii:1:004:1 -115:1 10 < d9a14b00 339319042 C Ii:1:004:1 -32:1 0 d9a14b00 339319056 S Ii:1:004:1 -115:1 10 < d9a14b00 339329551 C Ii:1:004:1 -32:1 0 d9a14b00 339329571 S Ii:1:004:1 -115:1 10 < d9a14b00 339330586 C Ii:1:004:1 -32:1 0 d9a14b00 339330601 S Ii:1:004:1 -115:1 10 < d9a14b00 339331035 C Ii:1:004:1 -32:1 0 d9a14b00 339331046 S Ii:1:004:1 -115:1 10 < d9a14b00 339332037 C Ii:1:004:1 -32:1 0 d9a14b00 339332049 S Ii:1:004:1 -115:1 10 < d9a14b00 339333054 C Ii:1:004:1 -32:1 0 d9a14b00 339333068 S Ii:1:004:1 -115:1 10 < d9a14b00 339334037 C Ii:1:004:1 -32:1 0 d9a14b00 339334050 S Ii:1:004:1 -115:1 10 < d9a14b00 339335036 C Ii:1:004:1 -32:1 0 d9a14b00 339335050 S Ii:1:004:1 -115:1 10 < d9a14b00 339336036 C Ii:1:004:1 -32:1 0 d9a14b00 339336049 S Ii:1:004:1 -115:1 10 < d9a14b00 339337037 C Ii:1:004:1 -32:1 0 d9a14b00 339337050 S Ii:1:004:1 -115:1 10 < d9a14b00 339338039 C Ii:1:004:1 -32:1 0 d9a14b00 339338074 S Ii:1:004:1 -115:1 10 < d9a14b00 339339038 C Ii:1:004:1 -32:1 0 d9a14b00 339339052 S Ii:1:004:1 -115:1 10 < d9a14b00 339340130 C Ii:1:004:1 -32:1 0 d9a14b00 339340143 S Ii:1:004:1 -115:1 10 < d9a14b00 339341035 C Ii:1:004:1 -32:1 0 d9a14b00 339341048 S Ii:1:004:1 -115:1 10 < d9a14b00 339342039 C Ii:1:004:1 -32:1 0 d9a14b00 339342051 S Ii:1:004:1 -115:1 10 < d9a14b00 339343038 C Ii:1:004:1 -32:1 0 d9a14b00 339343051 S Ii:1:004:1 -115:1 10 < d9a14b00 339344038 C Ii:1:004:1 -32:1 0 d9a14b00 339344050 S Ii:1:004:1 -115:1 10 < d9a14b00 339345037 C Ii:1:004:1 -32:1 0 d9a14b00 339345050 S Ii:1:004:1 -115:1 10 < d9a14b00 339346038 C Ii:1:004:1 -32:1 0 d9a14b00 339346049 S Ii:1:004:1 -115:1 10 < d9a14b00 339347030 C Ii:1:004:1 -32:1 0 d9a14b00 339347041 S Ii:1:004:1 -115:1 10 < d9a14b00 339348039 C Ii:1:004:1 -32:1 0 d9a14b00 339348052 S Ii:1:004:1 -115:1 10 < d9a14b00 339349039 C Ii:1:004:1 -32:1 0 d9a14b00 339349053 S Ii:1:004:1 -115:1 10 < d9a14b00 339350105 C Ii:1:004:1 -32:1 0 d9a14b00 339350119 S Ii:1:004:1 -115:1 10 < d9a14b00 339351032 C Ii:1:004:1 -32:1 0 d9a14b00 339351044 S Ii:1:004:1 -115:1 10 < d9a14b00 339352030 C Ii:1:004:1 -32:1 0 d9a14b00 339352042 S Ii:1:004:1 -115:1 10 < d9a14b00 339353037 C Ii:1:004:1 -32:1 0 d9a14b00 339353051 S Ii:1:004:1 -115:1 10 < d9a14b00 339354068 C Ii:1:004:1 -32:1 0 d9a14b00 339354083 S Ii:1:004:1 -115:1 10 < d9a14b00 339355018 C Ii:1:004:1 -32:1 0 d9a14b00 339355026 S Ii:1:004:1 -115:1 10 < d9a14b00 339356017 C Ii:1:004:1 -32:1 0 d9a14b00 339356025 S Ii:1:004:1 -115:1 10 < d9a14b00 339357018 C Ii:1:004:1 -32:1 0 d9a14b00 339357026 S Ii:1:004:1 -115:1 10 < d9a14b00 339358017 C Ii:1:004:1 -32:1 0 d9a14b00 339358025 S Ii:1:004:1 -115:1 10 < d9a14b00 339359018 C Ii:1:004:1 -32:1 0 d9a14b00 339359026 S Ii:1:004:1 -115:1 10 < d9a14b00 339360016 C Ii:1:004:1 -32:1 0 d9a14b00 339360024 S Ii:1:004:1 -115:1 10 < d9a14b00 339361018 C Ii:1:004:1 -32:1 0 d9a14b00 339361026 S Ii:1:004:1 -115:1 10 < d9a14b00 339362016 C Ii:1:004:1 -32:1 0 d9a14b00 339362024 S Ii:1:004:1 -115:1 10 < d9a14b00 339363017 C Ii:1:004:1 -32:1 0 d9a14b00 339363049 S Ii:1:004:1 -115:1 10 < d9a14b00 339364017 C Ii:1:004:1 -32:1 0 d9a14b00 339364025 S Ii:1:004:1 -115:1 10 < d9a14b00 339365017 C Ii:1:004:1 -32:1 0 d9a14b00 339365025 S Ii:1:004:1 -115:1 10 < d9a14b00 339366016 C Ii:1:004:1 -32:1 0 d9a14b00 339366024 S Ii:1:004:1 -115:1 10 < d9a14b00 339367017 C Ii:1:004:1 -32:1 0 d9a14b00 339367025 S Ii:1:004:1 -115:1 10 < d9a14b00 339368017 C Ii:1:004:1 -32:1 0 d9a14b00 339368024 S Ii:1:004:1 -115:1 10 < d9a14b00 339369017 C Ii:1:004:1 -32:1 0 d9a14b00 339369025 S Ii:1:004:1 -115:1 10 < d9a14b00 339370017 C Ii:1:004:1 -32:1 0 d9a14b00 339370024 S Ii:1:004:1 -115:1 10 < d9a14b00 339371017 C Ii:1:004:1 -32:1 0 d9a14b00 339371026 S Ii:1:004:1 -115:1 10 < d9a14b00 339372017 C Ii:1:004:1 -32:1 0 d9a14b00 339372024 S Ii:1:004:1 -115:1 10 < d9a14b00 339373017 C Ii:1:004:1 -32:1 0 d9a14b00 339373025 S Ii:1:004:1 -115:1 10 < d9a14b00 339374018 C Ii:1:004:1 -32:1 0 d9a14b00 339374025 S Ii:1:004:1 -115:1 10 < d9a14b00 339375017 C Ii:1:004:1 -32:1 0 d9a14b00 339375025 S Ii:1:004:1 -115:1 10 < d9a14b00 339376018 C Ii:1:004:1 -32:1 0 d9a14b00 339376026 S Ii:1:004:1 -115:1 10 < d9a14b00 339377017 C Ii:1:004:1 -32:1 0 d9a14b00 339377024 S Ii:1:004:1 -115:1 10 < d9930180 339605934 S Ci:1:002:0 s a3 00 0000 0003 0004 4 < d9930180 339606101 C Ci:1:002:0 0 4 = 00010000 d9930380 339650171 S Ci:1:002:0 s a3 00 0000 0003 0004 4 < d9930380 339650346 C Ci:1:002:0 0 4 = 00010000 d9842c80 339700165 S Ci:1:002:0 s a3 00 0000 0003 0004 4 < d9842c80 339700346 C Ci:1:002:0 0 4 = 00010000 da780000 339751063 S Ci:1:002:0 s a3 00 0000 0003 0004 4 < da780000 339751221 C Ci:1:002:0 0 4 = 00010000 d9a15380 339800833 S Ci:1:002:0 s a3 00 0000 0003 0004 4 < d9a15380 339800974 C Ci:1:002:0 0 4 = 00010000 da737080 339801535 C Ii:1:002:1 -2:2048 0 d9a15380 339802157 S Co:1:002:0 s 00 03 0001 0000 0000 0 d9a15380 339802265 C Co:1:002:0 0 0 d9a15380 339802291 S Co:1:001:0 s 23 03 0002 0001 0000 0 d9a15380 339802352 C Co:1:001:0 0 0 da6ba280 339830247 C Ii:1:001:1 -2:2048 0
> d9a14b00 339317035 C Ii:1:004:1 -32:1 0 > d9a14b00 339317049 S Ii:1:004:1 -115:1 10 < > d9a14b00 339318040 C Ii:1:004:1 -32:1 0 > d9a14b00 339318057 S Ii:1:004:1 -115:1 10 < > d9a14b00 339319042 C Ii:1:004:1 -32:1 0 > d9a14b00 339319056 S Ii:1:004:1 -115:1 10 < > d9a14b00 339329551 C Ii:1:004:1 -32:1 0 > d9a14b00 339329571 S Ii:1:004:1 -115:1 10 < > d9a14b00 339330586 C Ii:1:004:1 -32:1 0 > d9a14b00 339330601 S Ii:1:004:1 -115:1 10 < > d9a14b00 339331035 C Ii:1:004:1 -32:1 0 Sorry for necromancing an old thread, but I just happened to read through this and thought someone might care: If I read that right, the usbmon output shows that the interrupt endpoint is stalled (keeps returning -EPIPE). A STALL is a special device-side USB condition that tells the host something is wrong and will persist until cleared manually. It seems that the driver isn't prepared for this (see drivers/usb/serial/pl2303.c#pl2303_read_int_callback) and just keeps resubmitting the URB, so it will stall again as fast as the endpoint allows it to. This may be the reason why you get so many transfers that it overwhelms the CPU. A fix would be to catch -EPIPE in that function and handle it explicitly (with either a CLEAR_STALL to the endpoint or a full USB reset... would have to look at the documentation for PL2303 to see what the stall actually means and how you're supposed to treat it).
On Mon, Oct 16, 2017 at 01:49:11PM -0700, Julius Werner wrote: > > d9a14b00 339317035 C Ii:1:004:1 -32:1 0 > > d9a14b00 339317049 S Ii:1:004:1 -115:1 10 < > > d9a14b00 339318040 C Ii:1:004:1 -32:1 0 > > d9a14b00 339318057 S Ii:1:004:1 -115:1 10 < > > d9a14b00 339319042 C Ii:1:004:1 -32:1 0 > > d9a14b00 339319056 S Ii:1:004:1 -115:1 10 < > > d9a14b00 339329551 C Ii:1:004:1 -32:1 0 > > d9a14b00 339329571 S Ii:1:004:1 -115:1 10 < > > d9a14b00 339330586 C Ii:1:004:1 -32:1 0 > > d9a14b00 339330601 S Ii:1:004:1 -115:1 10 < > > d9a14b00 339331035 C Ii:1:004:1 -32:1 0 > > Sorry for necromancing an old thread, but I just happened to read > through this and thought someone might care: > > If I read that right, the usbmon output shows that the interrupt > endpoint is stalled (keeps returning -EPIPE). A STALL is a special > device-side USB condition that tells the host something is wrong and > will persist until cleared manually. It seems that the driver isn't > prepared for this (see > drivers/usb/serial/pl2303.c#pl2303_read_int_callback) and just keeps > resubmitting the URB, so it will stall again as fast as the endpoint > allows it to. This may be the reason why you get so many transfers > that it overwhelms the CPU. That's a bug in the driver, we should not resubmit (without further action) on -EPIPE. > A fix would be to catch -EPIPE in that function and handle it > explicitly (with either a CLEAR_STALL to the endpoint or a full USB > reset... would have to look at the documentation for PL2303 to see > what the stall actually means and how you're supposed to treat it). Yes, but we can't just clear the halt from the completion handler, so you'd typically have to schedule a work struct and call usb_clear_halt from there. Only then could we try resubmitting the URBs, but chances are we'd just hit that stall again (with the hardware setup in question). Note that no usb-serial drivers currently implement any such stall recovery, and just stop resubmitting the URB on -EPIPE. Or at least so I thought. The generic implementation (which most drivers rely on) and a few others get this right, but we have a number of legacy drivers with custom implementations that do resubmit on -EPIPE (including the pl2303 one). I'll go fix up that up. Thanks, Johan
Hi, On Mon, Oct 16, 2017 at 1:49 PM, Julius Werner <jwerner@chromium.org> wrote: >> d9a14b00 339317035 C Ii:1:004:1 -32:1 0 >> d9a14b00 339317049 S Ii:1:004:1 -115:1 10 < >> d9a14b00 339318040 C Ii:1:004:1 -32:1 0 >> d9a14b00 339318057 S Ii:1:004:1 -115:1 10 < >> d9a14b00 339319042 C Ii:1:004:1 -32:1 0 >> d9a14b00 339319056 S Ii:1:004:1 -115:1 10 < >> d9a14b00 339329551 C Ii:1:004:1 -32:1 0 >> d9a14b00 339329571 S Ii:1:004:1 -115:1 10 < >> d9a14b00 339330586 C Ii:1:004:1 -32:1 0 >> d9a14b00 339330601 S Ii:1:004:1 -115:1 10 < >> d9a14b00 339331035 C Ii:1:004:1 -32:1 0 > > Sorry for necromancing an old thread, but I just happened to read > through this and thought someone might care: > > If I read that right, the usbmon output shows that the interrupt > endpoint is stalled (keeps returning -EPIPE). A STALL is a special > device-side USB condition that tells the host something is wrong and > will persist until cleared manually. It seems that the driver isn't > prepared for this (see > drivers/usb/serial/pl2303.c#pl2303_read_int_callback) and just keeps > resubmitting the URB, so it will stall again as fast as the endpoint > allows it to. This may be the reason why you get so many transfers > that it overwhelms the CPU. > > A fix would be to catch -EPIPE in that function and handle it > explicitly (with either a CLEAR_STALL to the endpoint or a full USB > reset... would have to look at the documentation for PL2303 to see > what the stall actually means and how you're supposed to treat it). To further comment on this old thread, I just posted another patch at <https://patchwork.kernel.org/patch/10027237/> that could also make pl2303 less able to bring dwc2-based controllers to a screeching halt. I added many of the people who had taken part in this thread, but if you were just lurking here then hopefully you can dig it up and try it out. -Doug
--- badcase.txt 2017-04-25 18:02:59.000000000 +0000 +++ goodcase.txt 2017-04-25 18:02:59.000000000 +0000 @@ -4,12 +4,12 @@ dwc2_dump_global_registers: GAHBCFG @0xCC850008 : 0x00000031 dwc2_dump_global_registers: GUSBCFG @0xCC85000C : 0x20001700 dwc2_dump_global_registers: GRSTCTL @0xCC850010 : 0x80000000 - dwc2_dump_global_registers: GINTSTS @0xCC850014 : 0x06000021 + dwc2_dump_global_registers: GINTSTS @0xCC850014 : 0x06000029 dwc2_dump_global_registers: GINTMSK @0xCC850018 : 0xF300080E - dwc2_dump_global_registers: GRXSTSR @0xCC85001C : 0x00070044 + dwc2_dump_global_registers: GRXSTSR @0xCC85001C : 0x000F0001 dwc2_dump_global_registers: GRXFSIZ @0xCC850024 : 0x00000306 dwc2_dump_global_registers: GNPTXFSIZ @0xCC850028 : 0x01000306 - dwc2_dump_global_registers: GNPTXSTS @0xCC85002C : 0x1D080100 + dwc2_dump_global_registers: GNPTXSTS @0xCC85002C : 0x18080100 dwc2_dump_global_registers: GI2CCTL @0xCC850030 : 0x00000000 dwc2_dump_global_registers: GPVNDCTL @0xCC850034 : 0x00000000 dwc2_dump_global_registers: GGPIO @0xCC850038 : 0x00000000 @@ -27,64 +27,64 @@ dwc2_dump_host_registers: Host Global Registers dwc2_dump_host_registers: HCFG @0xCC850400 : 0x00000000 dwc2_dump_host_registers: HFIR @0xCC850404 : 0x00001D4B - dwc2_dump_host_registers: HFNUM @0xCC850408 : 0x0BF03CA1 - dwc2_dump_host_registers: HPTXSTS @0xCC850410 : 0x29080200 - dwc2_dump_host_registers: HAINT @0xCC850414 : 0x00000002 - dwc2_dump_host_registers: HAINTMSK @0xCC850418 : 0x00000002 + dwc2_dump_host_registers: HFNUM @0xCC850408 : 0x029F3421 + dwc2_dump_host_registers: HPTXSTS @0xCC850410 : 0x25080200 + dwc2_dump_host_registers: HAINT @0xCC850414 : 0x00000001 + dwc2_dump_host_registers: HAINTMSK @0xCC850418 : 0x00000001 dwc2_dump_host_registers: HPRT0 @0xCC850440 : 0x00001005 dwc2_dump_host_registers: Host Channel 0 Specific Registers - dwc2_dump_host_registers: HCCHAR @0xCC850500 : 0x01189840 - dwc2_dump_host_registers: HCSPLT @0xCC850504 : 0x8001C102 - dwc2_dump_host_registers: HCINT @0xCC850508 : 0x00000000 - dwc2_dump_host_registers: HCINTMSK @0xCC85050C : 0x00000000 - dwc2_dump_host_registers: HCTSIZ @0xCC850510 : 0x00080040 - dwc2_dump_host_registers: HCDMA @0xCC850514 : 0x4A07D300 + dwc2_dump_host_registers: HCCHAR @0xCC850500 : 0x013C880A + dwc2_dump_host_registers: HCSPLT @0xCC850504 : 0x8000C102 + dwc2_dump_host_registers: HCINT @0xCC850508 : 0x00000022 + dwc2_dump_host_registers: HCINTMSK @0xCC85050C : 0x00000006 + dwc2_dump_host_registers: HCTSIZ @0xCC850510 : 0x4008000A + dwc2_dump_host_registers: HCDMA @0xCC850514 : 0x4A093840 dwc2_dump_host_registers: Host Channel 1 Specific Registers dwc2_dump_host_registers: HCCHAR @0xCC850520 : 0x01189840 dwc2_dump_host_registers: HCSPLT @0xCC850524 : 0x8000C102 - dwc2_dump_host_registers: HCINT @0xCC850528 : 0x00000022 - dwc2_dump_host_registers: HCINTMSK @0xCC85052C : 0x00000006 + dwc2_dump_host_registers: HCINT @0xCC850528 : 0x00000000 + dwc2_dump_host_registers: HCINTMSK @0xCC85052C : 0x00000000 dwc2_dump_host_registers: HCTSIZ @0xCC850530 : 0x00080040 - dwc2_dump_host_registers: HCDMA @0xCC850534 : 0x4A07D300 + dwc2_dump_host_registers: HCDMA @0xCC850534 : 0x4A07D500 dwc2_dump_host_registers: Host Channel 2 Specific Registers dwc2_dump_host_registers: HCCHAR @0xCC850540 : 0x01189840 dwc2_dump_host_registers: HCSPLT @0xCC850544 : 0x8001C102 dwc2_dump_host_registers: HCINT @0xCC850548 : 0x00000000 dwc2_dump_host_registers: HCINTMSK @0xCC85054C : 0x00000000 dwc2_dump_host_registers: HCTSIZ @0xCC850550 : 0x00080040 - dwc2_dump_host_registers: HCDMA @0xCC850554 : 0x4A07D300 + dwc2_dump_host_registers: HCDMA @0xCC850554 : 0x4A07D500 dwc2_dump_host_registers: Host Channel 3 Specific Registers - dwc2_dump_host_registers: HCCHAR @0xCC850560 : 0x01189840 - dwc2_dump_host_registers: HCSPLT @0xCC850564 : 0x8001C102 + dwc2_dump_host_registers: HCCHAR @0xCC850560 : 0x00D08040 + dwc2_dump_host_registers: HCSPLT @0xCC850564 : 0x00000000 dwc2_dump_host_registers: HCINT @0xCC850568 : 0x00000000 dwc2_dump_host_registers: HCINTMSK @0xCC85056C : 0x00000000 - dwc2_dump_host_registers: HCTSIZ @0xCC850570 : 0x00080040 - dwc2_dump_host_registers: HCDMA @0xCC850574 : 0x4A07D300 + dwc2_dump_host_registers: HCTSIZ @0xCC850570 : 0x8000003C + dwc2_dump_host_registers: HCDMA @0xCC850574 : 0x4A3B7544 dwc2_dump_host_registers: Host Channel 4 Specific Registers - dwc2_dump_host_registers: HCCHAR @0xCC850580 : 0x01189840 - dwc2_dump_host_registers: HCSPLT @0xCC850584 : 0x8001C102 + dwc2_dump_host_registers: HCCHAR @0xCC850580 : 0x00D00040 + dwc2_dump_host_registers: HCSPLT @0xCC850584 : 0x00000000 dwc2_dump_host_registers: HCINT @0xCC850588 : 0x00000000 dwc2_dump_host_registers: HCINTMSK @0xCC85058C : 0x00000000 - dwc2_dump_host_registers: HCTSIZ @0xCC850590 : 0x00080040 - dwc2_dump_host_registers: HCDMA @0xCC850594 : 0x4A07D300 + dwc2_dump_host_registers: HCTSIZ @0xCC850590 : 0x20000008 + dwc2_dump_host_registers: HCDMA @0xCC850594 : 0x4A3B74C8 dwc2_dump_host_registers: Host Channel 5 Specific Registers dwc2_dump_host_registers: HCCHAR @0xCC8505A0 : 0x01189840 dwc2_dump_host_registers: HCSPLT @0xCC8505A4 : 0x8000C102 dwc2_dump_host_registers: HCINT @0xCC8505A8 : 0x00000000 dwc2_dump_host_registers: HCINTMSK @0xCC8505AC : 0x00000000 dwc2_dump_host_registers: HCTSIZ @0xCC8505B0 : 0x00080040 - dwc2_dump_host_registers: HCDMA @0xCC8505B4 : 0x4A07D300 + dwc2_dump_host_registers: HCDMA @0xCC8505B4 : 0x4A07D500 dwc2_dump_host_registers: Host Channel 6 Specific Registers - dwc2_dump_host_registers: HCCHAR @0xCC8505C0 : 0x01189840 - dwc2_dump_host_registers: HCSPLT @0xCC8505C4 : 0x8001C102 + dwc2_dump_host_registers: HCCHAR @0xCC8505C0 : 0x00D00040 + dwc2_dump_host_registers: HCSPLT @0xCC8505C4 : 0x00000000 dwc2_dump_host_registers: HCINT @0xCC8505C8 : 0x00000000 dwc2_dump_host_registers: HCINTMSK @0xCC8505CC : 0x00000000 - dwc2_dump_host_registers: HCTSIZ @0xCC8505D0 : 0x00080040 - dwc2_dump_host_registers: HCDMA @0xCC8505D4 : 0x4A07D300 + dwc2_dump_host_registers: HCTSIZ @0xCC8505D0 : 0x00000000 + dwc2_dump_host_registers: HCDMA @0xCC8505D4 : 0x4B080000 dwc2_dump_host_registers: Host Channel 7 Specific Registers - dwc2_dump_host_registers: HCCHAR @0xCC8505E0 : 0x01189840 + dwc2_dump_host_registers: HCCHAR @0xCC8505E0 : 0x013C880A dwc2_dump_host_registers: HCSPLT @0xCC8505E4 : 0x8001C102 dwc2_dump_host_registers: HCINT @0xCC8505E8 : 0x00000000 dwc2_dump_host_registers: HCINTMSK @0xCC8505EC : 0x00000000 - dwc2_dump_host_registers: HCTSIZ @0xCC8505F0 : 0x00080040 - dwc2_dump_host_registers: HCDMA @0xCC8505F4 : 0x4A07D300 + dwc2_dump_host_registers: HCTSIZ @0xCC8505F0 : 0x4008000A + dwc2_dump_host_registers: HCDMA @0xCC8505F4 : 0x4A093840