diff mbox series

[-next,v2] usb: xhci: do not free an empty cmd ring

Message ID 20230327011117.33953-1-xiehongyu1@kylinos.cn (mailing list archive)
State Superseded
Headers show
Series [-next,v2] usb: xhci: do not free an empty cmd ring | expand

Commit Message

Hongyu Xie March 27, 2023, 1:11 a.m. UTC
It was first found on HUAWEI Kirin 9006C platform with a builtin xhci
controller during stress cycle test(stress-ng, glmark2, x11perf, S4...).

phase one:
[26788.706878] PM: dpm_run_callback(): platform_pm_thaw+0x0/0x68 returns -12
[26788.706878] PM: Device xhci-hcd.1.auto failed to thaw async: error -12
...
phase two:
[28650.583496] [2023:01:19 04:43:29]Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
...
[28650.583526] user pgtable: 4k pages, 39-bit VAs, pgdp=000000027862a000
[28650.583557] [0000000000000028] pgd=0000000000000000
...
[28650.583587] pc : xhci_suspend+0x154/0x5b0
[28650.583618] lr : xhci_suspend+0x148/0x5b0
[28650.583618] sp : ffffffc01c7ebbd0
[28650.583618] x29: ffffffc01c7ebbd0 x28: ffffffec834d0000
[28650.583618] x27: ffffffc0106a3cc8 x26: ffffffb2c540c848
[28650.583618] x25: 0000000000000000 x24: ffffffec82ee30b0
[28650.583618] x23: ffffffb43b31c2f8 x22: 0000000000000000
[28650.583618] x21: 0000000000000000 x20: ffffffb43b31c000
[28650.583648] x19: ffffffb43b31c2a8 x18: 0000000000000001
[28650.583648] x17: 0000000000000803 x16: 00000000fffffffe
[28650.583648] x15: 0000000000001000 x14: ffffffb150b67e00
[28650.583648] x13: 00000000f0000000 x12: 0000000000000001
[28650.583648] x11: 0000000000000000 x10: 0000000000000a80
[28650.583648] x9 : ffffffc01c7eba00 x8 : ffffffb43ad10ae0
[28650.583648] x7 : ffffffb84cd98dc0 x6 : 0000000cceb6a101
[28650.583679] x5 : 00ffffffffffffff x4 : 0000000000000001
[28650.583679] x3 : 0000000000000011 x2 : 0000000000e2cfa8
[28650.583679] x1 : 00000000823535e1 x0 : 0000000000000000

gdb:
(gdb) l *(xhci_suspend+0x154)
0xffffffc010b6cd44 is in xhci_suspend (/.../drivers/usb/host/xhci.c:854).
849	{
850		struct xhci_ring *ring;
851		struct xhci_segment *seg;
852
853		ring = xhci->cmd_ring;
854		seg = ring->deq_seg;
(gdb) disassemble 0xffffffc010b6cd44
...
0xffffffc010b6cd40 <+336>:	ldr	x22, [x19, #160]
0xffffffc010b6cd44 <+340>:	ldr	x20, [x22, #40]
0xffffffc010b6cd48 <+344>:	mov	w1, #0x0                   	// #0

During phase one, platform_pm_thaw called xhci_plat_resume which called
xhci_resume. The rest possible calling routine might be
xhci_resume->xhci_init->xhci_mem_init, and xhci->cmd_ring was cleaned in
xhci_mem_cleanup before xhci_mem_init returned -ENOMEM.

During phase two, systemd was tring to hibernate again and called
xhci_suspend, then xhci_clear_command_ring dereferenced xhci->cmd_ring
which was already NULL.

So if xhci->cmd_ring is NULL, xhci_clear_command_ring just return.

Fixes: 898213200cba ("xhci: Fix command ring replay after resume.")
Cc: stable@vger.kernel.org # 2.6.27+
Co-developed-by: sunke <sunke@kylinos.cn>
Signed-off-by: sunke <sunke@kylinos.cn>
Signed-off-by: Hongyu Xie <xiehongyu1@kylinos.cn>
---

v2: add fixes

 drivers/usb/host/xhci.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Mathias Nyman March 27, 2023, 2:58 p.m. UTC | #1
On 27.3.2023 4.11, Hongyu Xie wrote:
> It was first found on HUAWEI Kirin 9006C platform with a builtin xhci
> controller during stress cycle test(stress-ng, glmark2, x11perf, S4...).
> 
> phase one:
> [26788.706878] PM: dpm_run_callback(): platform_pm_thaw+0x0/0x68 returns -12
> [26788.706878] PM: Device xhci-hcd.1.auto failed to thaw async: error -12
> ...
> phase two:
> [28650.583496] [2023:01:19 04:43:29]Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
> ...
> [28650.583526] user pgtable: 4k pages, 39-bit VAs, pgdp=000000027862a000
> [28650.583557] [0000000000000028] pgd=0000000000000000
> ...
> [28650.583587] pc : xhci_suspend+0x154/0x5b0
> [28650.583618] lr : xhci_suspend+0x148/0x5b0
> [28650.583618] sp : ffffffc01c7ebbd0
> [28650.583618] x29: ffffffc01c7ebbd0 x28: ffffffec834d0000
> [28650.583618] x27: ffffffc0106a3cc8 x26: ffffffb2c540c848
> [28650.583618] x25: 0000000000000000 x24: ffffffec82ee30b0
> [28650.583618] x23: ffffffb43b31c2f8 x22: 0000000000000000
> [28650.583618] x21: 0000000000000000 x20: ffffffb43b31c000
> [28650.583648] x19: ffffffb43b31c2a8 x18: 0000000000000001
> [28650.583648] x17: 0000000000000803 x16: 00000000fffffffe
> [28650.583648] x15: 0000000000001000 x14: ffffffb150b67e00
> [28650.583648] x13: 00000000f0000000 x12: 0000000000000001
> [28650.583648] x11: 0000000000000000 x10: 0000000000000a80
> [28650.583648] x9 : ffffffc01c7eba00 x8 : ffffffb43ad10ae0
> [28650.583648] x7 : ffffffb84cd98dc0 x6 : 0000000cceb6a101
> [28650.583679] x5 : 00ffffffffffffff x4 : 0000000000000001
> [28650.583679] x3 : 0000000000000011 x2 : 0000000000e2cfa8
> [28650.583679] x1 : 00000000823535e1 x0 : 0000000000000000
> 
> gdb:
> (gdb) l *(xhci_suspend+0x154)
> 0xffffffc010b6cd44 is in xhci_suspend (/.../drivers/usb/host/xhci.c:854).
> 849	{
> 850		struct xhci_ring *ring;
> 851		struct xhci_segment *seg;
> 852
> 853		ring = xhci->cmd_ring;
> 854		seg = ring->deq_seg;
> (gdb) disassemble 0xffffffc010b6cd44
> ...
> 0xffffffc010b6cd40 <+336>:	ldr	x22, [x19, #160]
> 0xffffffc010b6cd44 <+340>:	ldr	x20, [x22, #40]
> 0xffffffc010b6cd48 <+344>:	mov	w1, #0x0                   	// #0
> 
> During phase one, platform_pm_thaw called xhci_plat_resume which called
> xhci_resume. The rest possible calling routine might be
> xhci_resume->xhci_init->xhci_mem_init, and xhci->cmd_ring was cleaned in
> xhci_mem_cleanup before xhci_mem_init returned -ENOMEM.
> 
> During phase two, systemd was tring to hibernate again and called
> xhci_suspend, then xhci_clear_command_ring dereferenced xhci->cmd_ring
> which was already NULL.
> 

Any comments on the questions I had on the first version of the patch?

xhci_mem_init() failing with -ENOMEM looks like the real problem here.

Are we really running out of memory? does kmemleak say anything?

Any chance you could look into where exactly xhci_mem_init() fails as
xhci_mem_init() always returns -ENOMEM on failure?

> So if xhci->cmd_ring is NULL, xhci_clear_command_ring just return.

This hides the problem more than solves it. Root cause is still unknown

Thanks
Mathias
Hongyu Xie March 30, 2023, 2:58 a.m. UTC | #2
Hi,

在 2023/3/27 22:58, Mathias Nyman 写道:
> On 27.3.2023 4.11, Hongyu Xie wrote:
>> It was first found on HUAWEI Kirin 9006C platform with a builtin xhci
>> controller during stress cycle test(stress-ng, glmark2, x11perf, S4...).
>>
>> phase one:
>> [26788.706878] PM: dpm_run_callback(): platform_pm_thaw+0x0/0x68 returns -12
>> [26788.706878] PM: Device xhci-hcd.1.auto failed to thaw async: error -12
>> ...
>> phase two:
>> [28650.583496] [2023:01:19 04:43:29]Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
>> ...
>> [28650.583526] user pgtable: 4k pages, 39-bit VAs, pgdp=000000027862a000
>> [28650.583557] [0000000000000028] pgd=0000000000000000
>> ...
>> [28650.583587] pc : xhci_suspend+0x154/0x5b0
>> [28650.583618] lr : xhci_suspend+0x148/0x5b0
>> [28650.583618] sp : ffffffc01c7ebbd0
>> [28650.583618] x29: ffffffc01c7ebbd0 x28: ffffffec834d0000
>> [28650.583618] x27: ffffffc0106a3cc8 x26: ffffffb2c540c848
>> [28650.583618] x25: 0000000000000000 x24: ffffffec82ee30b0
>> [28650.583618] x23: ffffffb43b31c2f8 x22: 0000000000000000
>> [28650.583618] x21: 0000000000000000 x20: ffffffb43b31c000
>> [28650.583648] x19: ffffffb43b31c2a8 x18: 0000000000000001
>> [28650.583648] x17: 0000000000000803 x16: 00000000fffffffe
>> [28650.583648] x15: 0000000000001000 x14: ffffffb150b67e00
>> [28650.583648] x13: 00000000f0000000 x12: 0000000000000001
>> [28650.583648] x11: 0000000000000000 x10: 0000000000000a80
>> [28650.583648] x9 : ffffffc01c7eba00 x8 : ffffffb43ad10ae0
>> [28650.583648] x7 : ffffffb84cd98dc0 x6 : 0000000cceb6a101
>> [28650.583679] x5 : 00ffffffffffffff x4 : 0000000000000001
>> [28650.583679] x3 : 0000000000000011 x2 : 0000000000e2cfa8
>> [28650.583679] x1 : 00000000823535e1 x0 : 0000000000000000
>>
>> gdb:
>> (gdb) l *(xhci_suspend+0x154)
>> 0xffffffc010b6cd44 is in xhci_suspend (/.../drivers/usb/host/xhci.c:854).
>> 849	{
>> 850		struct xhci_ring *ring;
>> 851		struct xhci_segment *seg;
>> 852
>> 853		ring = xhci->cmd_ring;
>> 854		seg = ring->deq_seg;
>> (gdb) disassemble 0xffffffc010b6cd44
>> ...
>> 0xffffffc010b6cd40 <+336>:	ldr	x22, [x19, #160]
>> 0xffffffc010b6cd44 <+340>:	ldr	x20, [x22, #40]
>> 0xffffffc010b6cd48 <+344>:	mov	w1, #0x0                   	// #0
>>
>> During phase one, platform_pm_thaw called xhci_plat_resume which called
>> xhci_resume. The rest possible calling routine might be
>> xhci_resume->xhci_init->xhci_mem_init, and xhci->cmd_ring was cleaned in
>> xhci_mem_cleanup before xhci_mem_init returned -ENOMEM.
>>
>> During phase two, systemd was tring to hibernate again and called
>> xhci_suspend, then xhci_clear_command_ring dereferenced xhci->cmd_ring
>> which was already NULL.
>>
> 
> Any comments on the questions I had on the first version of the patch?
Sorry, didn't notice your reply in the first version.
> 
> xhci_mem_init() failing with -ENOMEM looks like the real problem here.
> 
> Are we really running out of memory? does kmemleak say anything?
It looks like running out of memory, since it was running a stress test. 
But can't go any further without more details. Didn't run with kmemleak 
open.
> Any chance you could look into where exactly xhci_mem_init() fails as
> xhci_mem_init() always returns -ENOMEM on failure?
Can't reproduce the problem for a very long time. Still don't know where 
did it fail in xhci_mem_init. But I think you can't blame xhci driver 
for memory shortage, and you can't fix that.
> 
>> So if xhci->cmd_ring is NULL, xhci_clear_command_ring just return.
> 
> This hides the problem more than solves it. Root cause is still unknown
You were saying "If xhci_mem_init() failed then...it shouldn't be...", 
and I agree with it. Further more, I think functions that calling 
xhci_mem_init needs to check xhci_mem_init's return value, but it needs 
another patch to do this. This patch is saying that 
xhci_clear_command_ring should check a pointer before using it, because 
somewhere else might clear cmd_ring, that's all.
> 
> Thanks
> Mathias
> 
> 
Thanks

Hongyu Xie
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 6183ce8574b1..faa0a63671f6 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -920,6 +920,11 @@  static void xhci_clear_command_ring(struct xhci_hcd *xhci)
 	struct xhci_ring *ring;
 	struct xhci_segment *seg;
 
+	if (!xhci->cmd_ring) {
+		xhci_err(xhci, "Empty cmd ring");
+		return;
+	}
+
 	ring = xhci->cmd_ring;
 	seg = ring->deq_seg;
 	do {