Message ID | 20230327011117.33953-1-xiehongyu1@kylinos.cn (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [-next,v2] usb: xhci: do not free an empty cmd ring | expand |
On 27.3.2023 4.11, Hongyu Xie wrote: > It was first found on HUAWEI Kirin 9006C platform with a builtin xhci > controller during stress cycle test(stress-ng, glmark2, x11perf, S4...). > > phase one: > [26788.706878] PM: dpm_run_callback(): platform_pm_thaw+0x0/0x68 returns -12 > [26788.706878] PM: Device xhci-hcd.1.auto failed to thaw async: error -12 > ... > phase two: > [28650.583496] [2023:01:19 04:43:29]Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028 > ... > [28650.583526] user pgtable: 4k pages, 39-bit VAs, pgdp=000000027862a000 > [28650.583557] [0000000000000028] pgd=0000000000000000 > ... > [28650.583587] pc : xhci_suspend+0x154/0x5b0 > [28650.583618] lr : xhci_suspend+0x148/0x5b0 > [28650.583618] sp : ffffffc01c7ebbd0 > [28650.583618] x29: ffffffc01c7ebbd0 x28: ffffffec834d0000 > [28650.583618] x27: ffffffc0106a3cc8 x26: ffffffb2c540c848 > [28650.583618] x25: 0000000000000000 x24: ffffffec82ee30b0 > [28650.583618] x23: ffffffb43b31c2f8 x22: 0000000000000000 > [28650.583618] x21: 0000000000000000 x20: ffffffb43b31c000 > [28650.583648] x19: ffffffb43b31c2a8 x18: 0000000000000001 > [28650.583648] x17: 0000000000000803 x16: 00000000fffffffe > [28650.583648] x15: 0000000000001000 x14: ffffffb150b67e00 > [28650.583648] x13: 00000000f0000000 x12: 0000000000000001 > [28650.583648] x11: 0000000000000000 x10: 0000000000000a80 > [28650.583648] x9 : ffffffc01c7eba00 x8 : ffffffb43ad10ae0 > [28650.583648] x7 : ffffffb84cd98dc0 x6 : 0000000cceb6a101 > [28650.583679] x5 : 00ffffffffffffff x4 : 0000000000000001 > [28650.583679] x3 : 0000000000000011 x2 : 0000000000e2cfa8 > [28650.583679] x1 : 00000000823535e1 x0 : 0000000000000000 > > gdb: > (gdb) l *(xhci_suspend+0x154) > 0xffffffc010b6cd44 is in xhci_suspend (/.../drivers/usb/host/xhci.c:854). > 849 { > 850 struct xhci_ring *ring; > 851 struct xhci_segment *seg; > 852 > 853 ring = xhci->cmd_ring; > 854 seg = ring->deq_seg; > (gdb) disassemble 0xffffffc010b6cd44 > ... > 0xffffffc010b6cd40 <+336>: ldr x22, [x19, #160] > 0xffffffc010b6cd44 <+340>: ldr x20, [x22, #40] > 0xffffffc010b6cd48 <+344>: mov w1, #0x0 // #0 > > During phase one, platform_pm_thaw called xhci_plat_resume which called > xhci_resume. The rest possible calling routine might be > xhci_resume->xhci_init->xhci_mem_init, and xhci->cmd_ring was cleaned in > xhci_mem_cleanup before xhci_mem_init returned -ENOMEM. > > During phase two, systemd was tring to hibernate again and called > xhci_suspend, then xhci_clear_command_ring dereferenced xhci->cmd_ring > which was already NULL. > Any comments on the questions I had on the first version of the patch? xhci_mem_init() failing with -ENOMEM looks like the real problem here. Are we really running out of memory? does kmemleak say anything? Any chance you could look into where exactly xhci_mem_init() fails as xhci_mem_init() always returns -ENOMEM on failure? > So if xhci->cmd_ring is NULL, xhci_clear_command_ring just return. This hides the problem more than solves it. Root cause is still unknown Thanks Mathias
Hi, 在 2023/3/27 22:58, Mathias Nyman 写道: > On 27.3.2023 4.11, Hongyu Xie wrote: >> It was first found on HUAWEI Kirin 9006C platform with a builtin xhci >> controller during stress cycle test(stress-ng, glmark2, x11perf, S4...). >> >> phase one: >> [26788.706878] PM: dpm_run_callback(): platform_pm_thaw+0x0/0x68 returns -12 >> [26788.706878] PM: Device xhci-hcd.1.auto failed to thaw async: error -12 >> ... >> phase two: >> [28650.583496] [2023:01:19 04:43:29]Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028 >> ... >> [28650.583526] user pgtable: 4k pages, 39-bit VAs, pgdp=000000027862a000 >> [28650.583557] [0000000000000028] pgd=0000000000000000 >> ... >> [28650.583587] pc : xhci_suspend+0x154/0x5b0 >> [28650.583618] lr : xhci_suspend+0x148/0x5b0 >> [28650.583618] sp : ffffffc01c7ebbd0 >> [28650.583618] x29: ffffffc01c7ebbd0 x28: ffffffec834d0000 >> [28650.583618] x27: ffffffc0106a3cc8 x26: ffffffb2c540c848 >> [28650.583618] x25: 0000000000000000 x24: ffffffec82ee30b0 >> [28650.583618] x23: ffffffb43b31c2f8 x22: 0000000000000000 >> [28650.583618] x21: 0000000000000000 x20: ffffffb43b31c000 >> [28650.583648] x19: ffffffb43b31c2a8 x18: 0000000000000001 >> [28650.583648] x17: 0000000000000803 x16: 00000000fffffffe >> [28650.583648] x15: 0000000000001000 x14: ffffffb150b67e00 >> [28650.583648] x13: 00000000f0000000 x12: 0000000000000001 >> [28650.583648] x11: 0000000000000000 x10: 0000000000000a80 >> [28650.583648] x9 : ffffffc01c7eba00 x8 : ffffffb43ad10ae0 >> [28650.583648] x7 : ffffffb84cd98dc0 x6 : 0000000cceb6a101 >> [28650.583679] x5 : 00ffffffffffffff x4 : 0000000000000001 >> [28650.583679] x3 : 0000000000000011 x2 : 0000000000e2cfa8 >> [28650.583679] x1 : 00000000823535e1 x0 : 0000000000000000 >> >> gdb: >> (gdb) l *(xhci_suspend+0x154) >> 0xffffffc010b6cd44 is in xhci_suspend (/.../drivers/usb/host/xhci.c:854). >> 849 { >> 850 struct xhci_ring *ring; >> 851 struct xhci_segment *seg; >> 852 >> 853 ring = xhci->cmd_ring; >> 854 seg = ring->deq_seg; >> (gdb) disassemble 0xffffffc010b6cd44 >> ... >> 0xffffffc010b6cd40 <+336>: ldr x22, [x19, #160] >> 0xffffffc010b6cd44 <+340>: ldr x20, [x22, #40] >> 0xffffffc010b6cd48 <+344>: mov w1, #0x0 // #0 >> >> During phase one, platform_pm_thaw called xhci_plat_resume which called >> xhci_resume. The rest possible calling routine might be >> xhci_resume->xhci_init->xhci_mem_init, and xhci->cmd_ring was cleaned in >> xhci_mem_cleanup before xhci_mem_init returned -ENOMEM. >> >> During phase two, systemd was tring to hibernate again and called >> xhci_suspend, then xhci_clear_command_ring dereferenced xhci->cmd_ring >> which was already NULL. >> > > Any comments on the questions I had on the first version of the patch? Sorry, didn't notice your reply in the first version. > > xhci_mem_init() failing with -ENOMEM looks like the real problem here. > > Are we really running out of memory? does kmemleak say anything? It looks like running out of memory, since it was running a stress test. But can't go any further without more details. Didn't run with kmemleak open. > Any chance you could look into where exactly xhci_mem_init() fails as > xhci_mem_init() always returns -ENOMEM on failure? Can't reproduce the problem for a very long time. Still don't know where did it fail in xhci_mem_init. But I think you can't blame xhci driver for memory shortage, and you can't fix that. > >> So if xhci->cmd_ring is NULL, xhci_clear_command_ring just return. > > This hides the problem more than solves it. Root cause is still unknown You were saying "If xhci_mem_init() failed then...it shouldn't be...", and I agree with it. Further more, I think functions that calling xhci_mem_init needs to check xhci_mem_init's return value, but it needs another patch to do this. This patch is saying that xhci_clear_command_ring should check a pointer before using it, because somewhere else might clear cmd_ring, that's all. > > Thanks > Mathias > > Thanks Hongyu Xie
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index 6183ce8574b1..faa0a63671f6 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -920,6 +920,11 @@ static void xhci_clear_command_ring(struct xhci_hcd *xhci) struct xhci_ring *ring; struct xhci_segment *seg; + if (!xhci->cmd_ring) { + xhci_err(xhci, "Empty cmd ring"); + return; + } + ring = xhci->cmd_ring; seg = ring->deq_seg; do {