diff mbox series

[2/4] xhci: Mitigate failed set dequeue pointer commands

Message ID 20241016140000.783905-3-mathias.nyman@linux.intel.com (mailing list archive)
State Accepted
Commit fe49df60cdb7c2975aa743dc295f8786e4b7db10
Headers show
Series xhci fixes for usb-linus | expand

Commit Message

Mathias Nyman Oct. 16, 2024, 1:59 p.m. UTC
Avoid xHC host from processing a cancelled URB by always turning
cancelled URB TDs into no-op TRBs before queuing a 'Set TR Deq' command.

If the command fails then xHC will start processing the cancelled TD
instead of skipping it once endpoint is restarted, causing issues like
Babble error.

This is not a complete solution as a failed 'Set TR Deq' command does not
guarantee xHC TRB caches are cleared.

Fixes: 4db356924a50 ("xhci: turn cancelled td cleanup to its own function")
Cc: stable@vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-ring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Michał Pecio Oct. 17, 2024, 6:40 a.m. UTC | #1
> Avoid xHC host from processing a cancelled URB by always turning
> cancelled URB TDs into no-op TRBs before queuing a 'Set TR Deq'
> command.
>
> If the command fails then xHC will start processing the cancelled TD
> instead of skipping it once endpoint is restarted, causing issues like
> Babble error.
>
> This is not a complete solution as a failed 'Set TR Deq' command does
> not guarantee xHC TRB caches are cleared.

Hmm, wouldn't a long and partially cached TD basically become corrupted
by this overwrite?

For instance, No Op following a chain bit TRB is prohibited by 4.11.7.

4.11.5.1 even goes as far as saying that there are no constraints on
the order in which TRBs are fetched from the ring, not sure how much
"out of order" it can be and if a cached TD could be left with a hole?

If the reason of Set TR Deq failure is an earlier Stop Endpoint failure,
the xHC is executing this TD right now. Or maybe the next one - I guess 
the driver already risks UB when it misses any Stop EP failure.

If it didn't fail, xHC may store some "state" which allows it to restart
a TRB stopped in the middle. It might not expect the TRB to change.


Actually, it would *almost* be better to deal with it by simply leaving
the TRB on the ring and waiting for it to complete. Problem is when it
doesn't execute soon, or ever, leaving the urb_dequeue() caller hanging.

Regards,
Michal
Mathias Nyman Oct. 17, 2024, 1:10 p.m. UTC | #2
On 17.10.2024 9.40, Michał Pecio wrote:
>> Avoid xHC host from processing a cancelled URB by always turning
>> cancelled URB TDs into no-op TRBs before queuing a 'Set TR Deq'
>> command.
>>
>> If the command fails then xHC will start processing the cancelled TD
>> instead of skipping it once endpoint is restarted, causing issues like
>> Babble error.
>>
>> This is not a complete solution as a failed 'Set TR Deq' command does
>> not guarantee xHC TRB caches are cleared.
> 
> Hmm, wouldn't a long and partially cached TD basically become corrupted
> by this overwrite?

Unlikely but not impossible.
We already turn all cancelled TDs that we don't stop on into no-ops, so those
would already now experience the same problem.

We stopped the endpoint, and issued a 'Set TR deq' command which is supposed
to clear xHC TRB cache.  I find it hard to believe xHC would continue
by caching some select TRBs of a TD to cache.

But lets say we end up corrupting the TD. It might still be better than
allowing xHC to process the TRBs and write to DMA addresses that might be
freed/reused already.
   
> 
> For instance, No Op following a chain bit TRB is prohibited by 4.11.7.
> 
> 4.11.5.1 even goes as far as saying that there are no constraints on
> the order in which TRBs are fetched from the ring, not sure how much
> "out of order" it can be and if a cached TD could be left with a hole?
> 
> If the reason of Set TR Deq failure is an earlier Stop Endpoint failure,
> the xHC is executing this TD right now. Or maybe the next one - I guess
> the driver already risks UB when it misses any Stop EP failure.
> 
> If it didn't fail, xHC may store some "state" which allows it to restart
> a TRB stopped in the middle. It might not expect the TRB to change.

This should not be an issue.
We don't queue a 'Set TR Deq' command if we intend to continue processing
a stopped TD, as the 'Set TR Deq' is designed to dump all transfer related
state of the endpoint.

> 
> 
> Actually, it would *almost* be better to deal with it by simply leaving
> the TRB on the ring and waiting for it to complete. Problem is when it
> doesn't execute soon, or ever, leaving the urb_dequeue() caller hanging.

We need to give back the cancelled URB at some point, and 'Set TR Deq'
command completion is the latest reasonable place to do it.

After this we should prevent xHC hw from accessing URB DMA pointers.

Thanks
Mathias
Michał Pecio Oct. 17, 2024, 4:14 p.m. UTC | #3
On Thu, 17 Oct 2024 16:10:39 +0300, Mathias Nyman wrote:
> > Hmm, wouldn't a long and partially cached TD basically become
> > corrupted by this overwrite?  
> 
> Unlikely but not impossible.
> We already turn all cancelled TDs that we don't stop on into no-ops,
> so those would already now experience the same problem.

No, I think they wouldn't. Note in xHCI 1.2, 4.6.9, on page 135 states
clearly that xHC shall invalidate cached TRBs besides the current TD.

Same page, point 3, mentions that software "may not modify" the current
TD, whatever on earth is that supposed to mean. Unfortunately, I can't
find a clear "shall not" in 4.6.9, but I would see it as such.

> We stopped the endpoint, and issued a 'Set TR deq' command which is
> supposed to clear xHC TRB cache.  I find it hard to believe xHC would
> continue by caching some select TRBs of a TD to cache.

The idea is, if Set TR Deq fails, the xHC preserves transfer state and
cache and tries to continue. If the TD wasn't fully cached when the xHC
stopped, it remains incomplete. Missing TRBs will be filled with No Ops
when it restarts, yielding an ivalid TD (e.g. No Op chained at the end).

So it may turn out that instead of "EP TRB ptr not part of current TD"
something else would show up, perhaps TRB Errors.

> But lets say we end up corrupting the TD. It might still be better
> than allowing xHC to process the TRBs and write to DMA addresses that
> might be freed/reused already.

There is some truth to that, I guess. It's bummer that those bugs are
here in the first place and no one seems to know where they come from.


Was this tested on HW? I suppose it wouldn't be hard to corrupt a Set
TR Deq command to make it fail, stream 0xffff or something like that.
It may be harder to come up with a realistic test case with long TDs.

Regards,
Michal
Mathias Nyman Oct. 18, 2024, 9:59 a.m. UTC | #4
On 17.10.2024 19.14, Michał Pecio wrote:
> On Thu, 17 Oct 2024 16:10:39 +0300, Mathias Nyman wrote:
>>> Hmm, wouldn't a long and partially cached TD basically become
>>> corrupted by this overwrite?
>>
>> Unlikely but not impossible.
>> We already turn all cancelled TDs that we don't stop on into no-ops,
>> so those would already now experience the same problem.
> 
> No, I think they wouldn't. Note in xHCI 1.2, 4.6.9, on page 135 states
> clearly that xHC shall invalidate cached TRBs besides the current TD.
> 
> Same page, point 3, mentions that software "may not modify" the current
> TD, whatever on earth is that supposed to mean. Unfortunately, I can't
> find a clear "shall not" in 4.6.9, but I would see it as such.
> 

Ok, I think we are talking about two different things here.

Point 3 you mentioned is about modifying TDs on the ring, and then continue.
And you are right, xHC should in this case invalidate all future TDs, but
not the current one it stopped on.

I'm talking about point 2, about aborting the current TD where we know
we are queuing a "Set TR Deq" command. Same section states that
Set TD Deq may be used to force xHC to dump any internal state it has for
the ring.

>> We stopped the endpoint, and issued a 'Set TR deq' command which is
>> supposed to clear xHC TRB cache.  I find it hard to believe xHC would
>> continue by caching some select TRBs of a TD to cache.
> 
> The idea is, if Set TR Deq fails, the xHC preserves transfer state and
> cache and tries to continue. If the TD wasn't fully cached when the xHC
> stopped, it remains incomplete. Missing TRBs will be filled with No Ops
> when it restarts, yielding an ivalid TD (e.g. No Op chained at the end).
> 
> So it may turn out that instead of "EP TRB ptr not part of current TD"
> something else would show up, perhaps TRB Errors.

If this is how xHC behaves on failed Set TR Deq commands, then yes,
TRB errors are possible.

But if xHC does clear TD cache on failed Set TR Deq command then it's
smooth sailing.

If we don't turn the TD to no-op then xHC is more likely to write to
freed DMA address in both cases above, which I think is worse.

> 
>> But lets say we end up corrupting the TD. It might still be better
>> than allowing xHC to process the TRBs and write to DMA addresses that
>> might be freed/reused already.
> 
> There is some truth to that, I guess. It's bummer that those bugs are
> here in the first place and no one seems to know where they come from.
> 
> 
> Was this tested on HW? I suppose it wouldn't be hard to corrupt a Set
> TR Deq command to make it fail, stream 0xffff or something like that.
> It may be harder to come up with a realistic test case with long TDs.

Unfortunately no, this patch is an attempt to mitigate the issue seen in
"Strange issues with USB device" [1]. That discussion continued off-list
with a lot more testing and debugging, but I ran out of testing goodwill
before I came up with this partial solution.

1. https://lore.kernel.org/linux-usb/ZsjgmCjHdzck9UKd@alphanet.ch/

Thanks
Mathias
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 4d664ba53fe9..7dedf31bbddd 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1023,7 +1023,7 @@  static int xhci_invalidate_cancelled_tds(struct xhci_virt_ep *ep)
 					td_to_noop(xhci, ring, cached_td, false);
 					cached_td->cancel_status = TD_CLEARED;
 				}
-
+				td_to_noop(xhci, ring, td, false);
 				td->cancel_status = TD_CLEARING_CACHE;
 				cached_td = td;
 				break;