usb HC busted?

Message ID	80eace7a-976d-65a5-a353-54a2b18edd06@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-usb-owner@kernel.org> Subject: Re: usb HC busted? To: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Mathias Nyman <mathias.nyman@intel.com>, linux-usb@vger.kernel.org, lukaszx.szulc@intel.com References: <20180518100650.kfw6wijpncpvqx7j@debian> <6790b352-add3-5531-115c-15db6c9c744d@intel.com> <20180518130458.v73syr3fltdzdzzi@debian> <881d576b-c7c1-ef74-c6bc-68b81371e7e0@intel.com> <20180523212956.n4ztasdffg2aeaku@debian> From: Mathias Nyman <mathias.nyman@linux.intel.com> Message-ID: <80eace7a-976d-65a5-a353-54a2b18edd06@linux.intel.com> Date: Thu, 24 May 2018 16:35:34 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180523212956.n4ztasdffg2aeaku@debian> Content-Type: multipart/mixed; boundary="------------B3611688B2CFC19CE087B4C6" Content-Language: en-US Sender: linux-usb-owner@vger.kernel.org Precedence: bulk

Mathias Nyman May 24, 2018, 1:35 p.m. UTC

Hi

On 24.05.2018 00:29, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Fri, May 18, 2018 at 04:19:02PM +0300, Mathias Nyman wrote:
>> On 18.05.2018 16:04, Sudip Mukherjee wrote:
>>> Hi Mathias,
>>>
>>> On Fri, May 18, 2018 at 03:55:04PM +0300, Mathias Nyman wrote:
>>>> Hi,
>>>>
>>>> Looks like event for Transfer block (TRB) at 0x32a21060 was never completed,
>>>> or at least not handled by xhci driver.
>>>> (either the event was never issued by hw, or something got messed up in the driver along the way)
>>>>
>>>> HC doesn't look busted, it continues sending transfer completions events.
>>>> it is already at event 0x32a211d0, which is 23 TRBS later. (one TRB is 0x10)
>>>>
>>>> This small log sinppet doesnt' say much about the reasons.
>>>>
>>>> Can you enable tracing for xhci and send me the output.
>>>
> We have finally reproduced the error while traces were on. The trace and
> the relevant part of the dmesg (when the error starts) are in:
> https://drive.google.com/open?id=1PbcYwL1a9ndtHw1MNjE6uVqb0fFX9jV8
> 
> Will request you to have a look and suggest what might be going wrong here.
> 

Log show two rings having the same TRB segment dma address, this will completely mess up the transfer:

While allocating rigs the enque pointers for the two rings are the same:

461.859315: xhci_ring_alloc: ISOC efa4e580: enq 0x0000000033386000(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000) segs 2 stream 0 ...bs
461.859320: xhci_ring_alloc: ISOC f0ce1f00: enq 0x0000000033386000(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000) segs 2 stream 0 ...

URBs for ISOC IN transfers are queued on EP3 at enqueue address (33386000 to 33386140)

461.859998: xhci_urb_enqueue: ep3in-isoc: urb f0ec0e00 pipe 4294528 slot 8 length 0/170 sgs 0/0 stream 0 flags 00010302
461.860004: xhci_queue_trb: ISOC: Buffer 000000002b480240 length 17 TD size 0 intr 0 type 'Isoch' flags b:i:I:c:s:I:e:c
461.860006: xhci_inc_enq: ISOC f0ce1f00: enq 0x0000000033386010(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000

Later URBs for ISOC OUT transfers are queued at the same address, this should not happen:

461.901175: xhci_urb_enqueue: ep3out-isoc: urb ecec2600 pipe 100096 slot 8 length 0/51 sgs 0/0 stream 0 flags 00010002
461.901180: xhci_queue_trb: ISOC: Buffer 000000002d9fa805 length 17 TD size 0 intr 0 type 'Isoch' flags b:i:I:c:s:i:e:c
461.901181: xhci_inc_enq: ISOC efa4e580: enq 0x0000000033386010(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000)

So something goes really wrong when allocating or setting up the rings in one of these functions:
xhci_ring_alloc()
xhci_alloc_segments_for_ring()
xhci_initialize_ring_info()
xhci_segment_alloc()
xhci_link_segments()
dma_pool_zalloc()

To verify and rule out dma_pool_zalloc(), could you apply the attached patch and reproduce with new logs?

Thanks
-Mathias

Sudip Mukherjee June 3, 2018, 7:37 p.m. UTC | #1

Hi Mathias,

On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> Hi
> 
> On 24.05.2018 00:29, Sudip Mukherjee wrote:
> >Hi Mathias,
> >
> >>>On Fri, May 18, 2018 at 03:55:04PM +0300, Mathias Nyman wrote:
> >>>>Hi,
<snip>
> >>>>
> >>>>
> >>>>Can you enable tracing for xhci and send me the output.
> >>>
> >We have finally reproduced the error while traces were on. The trace and
> >the relevant part of the dmesg (when the error starts) are in:
> >https://drive.google.com/open?id=1PbcYwL1a9ndtHw1MNjE6uVqb0fFX9jV8
> >
> >Will request you to have a look and suggest what might be going wrong here.
> >
> 
> Log show two rings having the same TRB segment dma address, this will completely mess up the transfer:
> 
> While allocating rigs the enque pointers for the two rings are the same:
> 
> 461.859315: xhci_ring_alloc: ISOC efa4e580: enq 0x0000000033386000(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000) segs 2 stream 0 ...bs
> 461.859320: xhci_ring_alloc: ISOC f0ce1f00: enq 0x0000000033386000(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000) segs 2 stream 0 ...
> 
> URBs for ISOC IN transfers are queued on EP3 at enqueue address (33386000 to 33386140)
> 
> 461.859998: xhci_urb_enqueue: ep3in-isoc: urb f0ec0e00 pipe 4294528 slot 8 length 0/170 sgs 0/0 stream 0 flags 00010302
> 461.860004: xhci_queue_trb: ISOC: Buffer 000000002b480240 length 17 TD size 0 intr 0 type 'Isoch' flags b:i:I:c:s:I:e:c
> 461.860006: xhci_inc_enq: ISOC f0ce1f00: enq 0x0000000033386010(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000
> 
> Later URBs for ISOC OUT transfers are queued at the same address, this should not happen:
> 
> 461.901175: xhci_urb_enqueue: ep3out-isoc: urb ecec2600 pipe 100096 slot 8 length 0/51 sgs 0/0 stream 0 flags 00010002
> 461.901180: xhci_queue_trb: ISOC: Buffer 000000002d9fa805 length 17 TD size 0 intr 0 type 'Isoch' flags b:i:I:c:s:i:e:c
> 461.901181: xhci_inc_enq: ISOC efa4e580: enq 0x0000000033386010(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000)
> 
> So something goes really wrong when allocating or setting up the rings in one of these functions:
> xhci_ring_alloc()
> xhci_alloc_segments_for_ring()
> xhci_initialize_ring_info()
> xhci_segment_alloc()
> xhci_link_segments()
> dma_pool_zalloc()
> 
> To verify and rule out dma_pool_zalloc(), could you apply the attached patch and reproduce with new logs?

We tested for the full week but still could not reproduce with the patch
applied. We are still trying and will be setting up automated tests for
this. And, since we are not able to reproduce it, I was wondering if it
is somekind of race and the applied patch with extra tracing has changed
the timing in such a way that it is not seen now. And also, wondering if
2b3ff282dff3 ("xhci: Don't add a virt_dev to the devs array before it's fully allocated")
will be of any help to us.

--
Regards
Sudip
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sudip Mukherjee June 4, 2018, 3:28 p.m. UTC | #2

Hi Mathias,

On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> Hi
> 
> On 24.05.2018 00:29, Sudip Mukherjee wrote:
> > Hi Mathias,
> > 
> > On Fri, May 18, 2018 at 04:19:02PM +0300, Mathias Nyman wrote:
> > > On 18.05.2018 16:04, Sudip Mukherjee wrote:
<snip>
> > > > 
> > We have finally reproduced the error while traces were on. The trace and
> > the relevant part of the dmesg (when the error starts) are in:
> > https://drive.google.com/open?id=1PbcYwL1a9ndtHw1MNjE6uVqb0fFX9jV8
> > 
> > Will request you to have a look and suggest what might be going wrong here.
> > 
> 
> Log show two rings having the same TRB segment dma address, this will completely mess up the transfer:
> 
> While allocating rigs the enque pointers for the two rings are the same:
> 
> 461.859315: xhci_ring_alloc: ISOC efa4e580: enq 0x0000000033386000(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000) segs 2 stream 0 ...bs
> 461.859320: xhci_ring_alloc: ISOC f0ce1f00: enq 0x0000000033386000(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000) segs 2 stream 0 ...
> 
> URBs for ISOC IN transfers are queued on EP3 at enqueue address (33386000 to 33386140)
> 
> 461.859998: xhci_urb_enqueue: ep3in-isoc: urb f0ec0e00 pipe 4294528 slot 8 length 0/170 sgs 0/0 stream 0 flags 00010302
> 461.860004: xhci_queue_trb: ISOC: Buffer 000000002b480240 length 17 TD size 0 intr 0 type 'Isoch' flags b:i:I:c:s:I:e:c
> 461.860006: xhci_inc_enq: ISOC f0ce1f00: enq 0x0000000033386010(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000
> 
> Later URBs for ISOC OUT transfers are queued at the same address, this should not happen:
> 
> 461.901175: xhci_urb_enqueue: ep3out-isoc: urb ecec2600 pipe 100096 slot 8 length 0/51 sgs 0/0 stream 0 flags 00010002
> 461.901180: xhci_queue_trb: ISOC: Buffer 000000002d9fa805 length 17 TD size 0 intr 0 type 'Isoch' flags b:i:I:c:s:i:e:c
> 461.901181: xhci_inc_enq: ISOC efa4e580: enq 0x0000000033386010(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000)
> 
> So something goes really wrong when allocating or setting up the rings in one of these functions:
> xhci_ring_alloc()
> xhci_alloc_segments_for_ring()
> xhci_initialize_ring_info()
> xhci_segment_alloc()
> xhci_link_segments()
> dma_pool_zalloc()
> 
> To verify and rule out dma_pool_zalloc(), could you apply the attached patch and reproduce with new logs?

I spoke too soon in my yesterday's mail. We were able to reproduce it
on the automated tests. The log and the trace is at:
https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx-

Will request you to have a look at it.

--
Regards
Sudip
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mathias Nyman June 6, 2018, 2:12 p.m. UTC | #3

On 04.06.2018 18:28, Sudip Mukherjee wrote:
> On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
>>
>> Log show two rings having the same TRB segment dma address, this will completely mess up the transfer:
>>
>> While allocating rigs the enque pointers for the two rings are the same:
>>
>> 461.859315: xhci_ring_alloc: ISOC efa4e580: enq 0x0000000033386000(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000) segs 2 stream 0 ...bs
>> 461.859320: xhci_ring_alloc: ISOC f0ce1f00: enq 0x0000000033386000(0x0000000033386000) deq 0x0000000033386000(0x0000000033386000) segs 2 stream 0 ...
>>
>> So something goes really wrong when allocating or setting up the rings in one of these functions:
>>
>> To verify and rule out dma_pool_zalloc(), could you apply the attached patch and reproduce with new logs?
> 
> I spoke too soon in my yesterday's mail. We were able to reproduce it
> on the automated tests. The log and the trace is at:
> https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx-
> 
> Will request you to have a look at it.
> 

Odd and unlikely, but to me this looks like some issue in allocating dma memory
from pool using dma_pool_zalloc()

Adding people with DMA knowledge to cc, maybe someone knows what is going on.

Here's the story:
Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
All tracing points to dma_pool_zalloc() returning the same dma address block on
consecutive calls.

In the failing case dma_pool_zalloc() is called 3 - 6us apart.

<...>-26362 [002] ....  1186.756739: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x000000002d92b000
<...>-26362 [002] ....  1186.756745: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x000000002d92b000
<...>-26362 [002] ....  1186.756748: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x000000002d92b000

dma_pool_zalloc() is called from xhci_segment_alloc() in drivers/usb/host/xhci-mem.c
see:
https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-mem.c#L52

prints above are custom traces added right after dma_pool_zalloc()
@@ -44,10 +44,15 @@ static struct xhci_segment *xhci_segment_alloc(struct xhci_hcd *xhci,
  		return NULL;
  	}
  
+	xhci_dbg_trace(xhci,  trace_xhci_ring_mem_detail,
+		       "MATTU xhci_segment_alloc dma @ %pad", &dma);
+

Any idea what's going on?
dma_pool_alloc() has a comment that it drops &pool->lock if it needs to allocate
a page, can it be related?

Thanks
-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Shevchenko June 6, 2018, 3:36 p.m. UTC | #4

On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:
> On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > 

> Odd and unlikely, but to me this looks like some issue in allocating
> dma memory
> from pool using dma_pool_zalloc()
> 
> Adding people with DMA knowledge to cc, maybe someone knows what is
> going on.
> 
> Here's the story:
> Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
> All tracing points to dma_pool_zalloc() returning the same dma address
> block on
> consecutive calls.
> 
> In the failing case dma_pool_zalloc() is called 3 - 6us apart.
> 
> <...>-26362 [002] ....  1186.756739: xhci_ring_mem_detail: MATTU
> xhci_segment_alloc dma @ 0x000000002d92b000
> <...>-26362 [002] ....  1186.756745: xhci_ring_mem_detail: MATTU
> xhci_segment_alloc dma @ 0x000000002d92b000
> <...>-26362 [002] ....  1186.756748: xhci_ring_mem_detail: MATTU
> xhci_segment_alloc dma @ 0x000000002d92b000
> 
> dma_pool_zalloc() is called from xhci_segment_alloc() in
> drivers/usb/host/xhci-mem.c
> see:
> https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-
> mem.c#L52
> 
> prints above are custom traces added right after dma_pool_zalloc()

For better understanding it would be good to have dma_pool_free() calls
debugged as well.

Is it possible that something in parallel just fast enough to free the
allocated resource from pool?

Sudip Mukherjee June 6, 2018, 4:42 p.m. UTC | #5

On Wed, Jun 06, 2018 at 05:12:21PM +0300, Mathias Nyman wrote:
> On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > 
<snip>
> > 
> > Will request you to have a look at it.
> > 
> 
> Odd and unlikely, but to me this looks like some issue in allocating dma memory
> from pool using dma_pool_zalloc()
> 
> Adding people with DMA knowledge to cc, maybe someone knows what is going on.

Thanks Mathias.

--
Regards
Sudip
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sudip Mukherjee June 6, 2018, 4:45 p.m. UTC | #6

Hi Andy,

And we meet again. :)

On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote:
> On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:
> > On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > > 
> 
> > Odd and unlikely, but to me this looks like some issue in allocating
> > dma memory
> > from pool using dma_pool_zalloc()
> > 
> > Adding people with DMA knowledge to cc, maybe someone knows what is
> > going on.
> > 
> > Here's the story:
> > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
> > All tracing points to dma_pool_zalloc() returning the same dma address
> > block on
> > consecutive calls.
> > 
> > In the failing case dma_pool_zalloc() is called 3 - 6us apart.
> > 
> > <...>-26362 [002] ....  1186.756739: xhci_ring_mem_detail: MATTU
> > xhci_segment_alloc dma @ 0x000000002d92b000
> > <...>-26362 [002] ....  1186.756745: xhci_ring_mem_detail: MATTU
> > xhci_segment_alloc dma @ 0x000000002d92b000
> > <...>-26362 [002] ....  1186.756748: xhci_ring_mem_detail: MATTU
> > xhci_segment_alloc dma @ 0x000000002d92b000
> > 
> > dma_pool_zalloc() is called from xhci_segment_alloc() in
> > drivers/usb/host/xhci-mem.c
> > see:
> > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-
> > mem.c#L52
> > 
> > prints above are custom traces added right after dma_pool_zalloc()
> 
> For better understanding it would be good to have dma_pool_free() calls
> debugged as well.

So, I am adding another trace event for dma_pool_free() and continuing
with the test. Is there anything else that I should be adding as debug?

--
Regards
Sudip
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Commit Message

Comments

Patch