mbox series

[RFC,0/2] Avoiding DWC3 transfer stalls/hangs when using adb over f_fs

Message ID 20200122222645.38805-1-john.stultz@linaro.org (mailing list archive)
Headers show
Series Avoiding DWC3 transfer stalls/hangs when using adb over f_fs | expand

Message

John Stultz Jan. 22, 2020, 10:26 p.m. UTC
Hey all,
  I wanted to send these out for comment and thoughts.

Since ~4.20, when the functionfs gadget enabled scatter-gather
support, we have seen problems with adb connections stalling and
stopping to function on hardware with dwc3 usb controllers.
Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.

Initally the workaround we used was to simply disable scatter
gather support on the dwc3 by commenting out the
"dwc->gadget.sg_supported = true;" line.

After working with Fei Yang, who was seeing similar trouble on
Intel dwc3 based hardare, Thinh Nguyen mentioned that a fix had
already been found and pointed me to one of Anurag's patches.

This solved the issue on HiKey960 and I sent it out to the list
but didn't get any feedback.

Additional testing with the Dragonboard 845c found that that
first fix was not sufficient, and so I've sat on the fix
thinking something deeper was amiss and went back to the hack
of disabling sg_supported on all dwc3 platforms. 

In the following months Fei's continued and repeated efforts
didn't seem to get enough review to result in a fix, and they've
since moved on to other work.

Recently, I found that folks at qcom have seen similer issues
and pointed me to the second patch in this series, which does
seem to resolve the issue on the Dragonboard 845c, but not the
HiKey960 on its own.

So I wanted to send these patches out for comment. There's
clearly a number of folks seeing broken behavior for ahwile on
dwc3 hardware, and we're all seeemingly working around it in our
own ways, so either those individual fixes need to get upstream
or we need to figure out some deeper solution to the issue.

So I wanted to send these two out for review and feedback.

thanks
-john

Cc: Felipe Balbi <felipe.balbi@linux.intel.com>
Cc: Yang Fei <fei.yang@intel.com>
Cc: Thinh Nguyen <thinhn@synopsys.com>
Cc: Tejas Joglekar <tejas.joglekar@synopsys.com>
Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
Cc: Jack Pham <jackp@codeaurora.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Linux USB List <linux-usb@vger.kernel.org>
Cc: stable <stable@vger.kernel.org>

Anurag Kumar Vulisha (2):
  usb: dwc3: gadget: Check for IOC/LST bit in both event->status and
    TRB->ctrl fields
  usb: dwc3: gadget: Correct the logic for finding last SG entry

 drivers/usb/dwc3/gadget.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Felipe Balbi Jan. 23, 2020, 7:18 a.m. UTC | #1
Hi,

John Stultz <john.stultz@linaro.org> writes:
> Hey all,
>   I wanted to send these out for comment and thoughts.
>
> Since ~4.20, when the functionfs gadget enabled scatter-gather
> support, we have seen problems with adb connections stalling and
> stopping to function on hardware with dwc3 usb controllers.
> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
>
> Initally the workaround we used was to simply disable scatter
> gather support on the dwc3 by commenting out the
> "dwc->gadget.sg_supported = true;" line.
>
> After working with Fei Yang, who was seeing similar trouble on
> Intel dwc3 based hardare, Thinh Nguyen mentioned that a fix had
> already been found and pointed me to one of Anurag's patches.
>
> This solved the issue on HiKey960 and I sent it out to the list
> but didn't get any feedback.
>
> Additional testing with the Dragonboard 845c found that that
> first fix was not sufficient, and so I've sat on the fix
> thinking something deeper was amiss and went back to the hack
> of disabling sg_supported on all dwc3 platforms. 
>
> In the following months Fei's continued and repeated efforts
> didn't seem to get enough review to result in a fix, and they've
> since moved on to other work.
>
> Recently, I found that folks at qcom have seen similer issues
> and pointed me to the second patch in this series, which does
> seem to resolve the issue on the Dragonboard 845c, but not the
> HiKey960 on its own.
>
> So I wanted to send these patches out for comment. There's
> clearly a number of folks seeing broken behavior for ahwile on
> dwc3 hardware, and we're all seeemingly working around it in our
> own ways, so either those individual fixes need to get upstream
> or we need to figure out some deeper solution to the issue.
>
> So I wanted to send these two out for review and feedback.
>
> thanks
> -john
>
> Cc: Felipe Balbi <felipe.balbi@linux.intel.com>
> Cc: Yang Fei <fei.yang@intel.com>
> Cc: Thinh Nguyen <thinhn@synopsys.com>
> Cc: Tejas Joglekar <tejas.joglekar@synopsys.com>
> Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
> Cc: Jack Pham <jackp@codeaurora.org>
> Cc: Todd Kjos <tkjos@google.com>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Linux USB List <linux-usb@vger.kernel.org>
> Cc: stable <stable@vger.kernel.org>
>
> Anurag Kumar Vulisha (2):
>   usb: dwc3: gadget: Check for IOC/LST bit in both event->status and
>     TRB->ctrl fields
>   usb: dwc3: gadget: Correct the logic for finding last SG entry

I remember commenting on these patches before and never getting a newer
version from Anurag.
Andrzej Pietrasiewicz Jan. 23, 2020, 8:43 a.m. UTC | #2
Hi John,

W dniu 22.01.2020 o 23:26, John Stultz pisze:
> Hey all,
>    I wanted to send these out for comment and thoughts.
> 
> Since ~4.20, when the functionfs gadget enabled scatter-gather
> support, we have seen problems with adb connections stalling and
> stopping to function on hardware with dwc3 usb controllers.
> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.

Any chance this:

https://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git/commit/?h=testing/next&id=f63333e8e4fd63d8d8ae83b89d2c38cf21d64801

has something to do with the problem you are reporting?

Andrzej
Yang, Fei Jan. 23, 2020, 4:29 p.m. UTC | #3
>> Hey all,
>>    I wanted to send these out for comment and thoughts.
>> 
>> Since ~4.20, when the functionfs gadget enabled scatter-gather 
>> support, we have seen problems with adb connections stalling and 
>> stopping to function on hardware with dwc3 usb controllers.
>> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
>
> Any chance this:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git/commit/?h=testing/next&id=f63333e8e4fd63d8d8ae83b89d2c38cf21d64801
This is a different issue. I have tried initializing num_sgs when debugging this adb stall problem, but it didn't help.

>> has something to do with the problem you are reporting?
>
>> Andrzej
Felipe Balbi Jan. 23, 2020, 5:31 p.m. UTC | #4
Hi,

"Yang, Fei" <fei.yang@intel.com> writes:
>>> Hey all,
>>>    I wanted to send these out for comment and thoughts.
>>> 
>>> Since ~4.20, when the functionfs gadget enabled scatter-gather 
>>> support, we have seen problems with adb connections stalling and 
>>> stopping to function on hardware with dwc3 usb controllers.
>>> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
>>
>> Any chance this:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git/commit/?h=testing/next&id=f63333e8e4fd63d8d8ae83b89d2c38cf21d64801
> This is a different issue. I have tried initializing num_sgs when debugging this adb stall problem, but it didn't help.

So multiple folks have run through this problem, but not *one* has
tracepoints collected from the issue? C'mon guys. Can someone, please,
collect tracepoints so we can figure out what's actually going on?

I'm pretty sure this should be solved at the DMA API level, just want to
confirm.

cheers
Yang, Fei Jan. 23, 2020, 5:37 p.m. UTC | #5
>>>> Hey all,
>>>>    I wanted to send these out for comment and thoughts.
>>>> 
>>>> Since ~4.20, when the functionfs gadget enabled scatter-gather 
>>>> support, we have seen problems with adb connections stalling and 
>>>> stopping to function on hardware with dwc3 usb controllers.
>>>> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
>>>
>>> Any chance this:
>>> 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git/commit/
>>> ?h=testing/next&id=f63333e8e4fd63d8d8ae83b89d2c38cf21d64801
>> This is a different issue. I have tried initializing num_sgs when debugging this adb stall problem, but it didn't help.
>
> So multiple folks have run through this problem, but not *one* has tracepoints collected from the issue? C'mon guys. Can someone, please, collect tracepoints so we can figure out what's actually going on?
>
> I'm pretty sure this should be solved at the DMA API level, just want to confirm.
I have sent you the tracepoints long time ago. Also my analysis of the problem (BTW, I don't think the tracepoints helped much). It's basically a logic problem in function dwc3_gadget_ep_reclaim_trb_sg().
I can try dig into my old emails and resend, but that is a bit hard to find.

-Fei
Felipe Balbi Jan. 23, 2020, 5:46 p.m. UTC | #6
Hi,

"Yang, Fei" <fei.yang@intel.com> writes:
>>>>> Hey all,
>>>>>    I wanted to send these out for comment and thoughts.
>>>>> 
>>>>> Since ~4.20, when the functionfs gadget enabled scatter-gather 
>>>>> support, we have seen problems with adb connections stalling and 
>>>>> stopping to function on hardware with dwc3 usb controllers.
>>>>> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
>>>>
>>>> Any chance this:
>>>> 
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git/commit/
>>>> ?h=testing/next&id=f63333e8e4fd63d8d8ae83b89d2c38cf21d64801
>>> This is a different issue. I have tried initializing num_sgs when debugging this adb stall problem, but it didn't help.
>>
>> So multiple folks have run through this problem, but not *one* has tracepoints collected from the issue? C'mon guys. Can someone, please, collect tracepoints so we can figure out what's actually going on?
>>
>> I'm pretty sure this should be solved at the DMA API level, just want to confirm.
>
> I have sent you the tracepoints long time ago. Also my analysis of the
> problem (BTW, I don't think the tracepoints helped much). It's
> basically a logic problem in function dwc3_gadget_ep_reclaim_trb_sg().

AFAICT, this is caused by DMA API merging pages together when map an
sglist for DMA. While doing that, it does *not* move the SG_END flag
which sg_is_last() checks.

I consider that an overlook on the DMA API, wouldn't you? Why should DMA
API users care if pages were merged or not while mapping the sglist? We
have for_each_sg() and sg_is_last() for a reason.

> I can try dig into my old emails and resend, but that is a bit hard to find.

Don't bother, I'm still not convinced we should fix at the driver level
when sg_is_last() should be working here, unless we should iterate over
num_sgs instead of num_mapped_sgs, though I don't think that's the case
since in that case we would have to chain buffers of size zero.
Yang, Fei Jan. 23, 2020, 6:28 p.m. UTC | #7
>>>>>> Since ~4.20, when the functionfs gadget enabled scatter-gather 
>>>>>> support, we have seen problems with adb connections stalling and 
>>>>>> stopping to function on hardware with dwc3 usb controllers.
>>>>>> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
>>>>>
>>>>> Any chance this:
>>>>> 
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git/commi
>>>>> t/
>>>>> ?h=testing/next&id=f63333e8e4fd63d8d8ae83b89d2c38cf21d64801
>>>> This is a different issue. I have tried initializing num_sgs when debugging this adb stall problem, but it didn't help.
>>>
>>> So multiple folks have run through this problem, but not *one* has tracepoints collected from the issue? C'mon guys.
>>> Can someone, please, collect tracepoints so we can figure out what's actually going on?
>>>
>>> I'm pretty sure this should be solved at the DMA API level, just want to confirm.
>>
>> I have sent you the tracepoints long time ago. Also my analysis of the 
>> problem (BTW, I don't think the tracepoints helped much). It's 
>> basically a logic problem in function dwc3_gadget_ep_reclaim_trb_sg().
>
> AFAICT, this is caused by DMA API merging pages together when map an sglist for DMA. While doing that,
> it does *not* move the SG_END flag which sg_is_last() checks.
>
> I consider that an overlook on the DMA API, wouldn't you? Why should DMA API users care if pages were merged or not while mapping the sglist?
> We have for_each_sg() and sg_is_last() for a reason.

Oops, my bad. Actually, I was talking about the other patch, not the one setting num_sgs = 0; I don't know if this patch is really needed, but from
what I remember the DMA API is setting up the num_sgs properly. I agree even if there is a problem initializing num_sgs, it should be fixed in DMA API.

> I can try dig into my old emails and resend, but that is a bit hard to find.
>
> Don't bother, I'm still not convinced we should fix at the driver level when sg_is_last() should be working here,
> unless we should iterate over num_sgs instead of num_mapped_sgs, though I don't think that's the case since
> in that case we would have to chain buffers of size zero.

> --
> balbi
John Stultz Jan. 23, 2020, 7:58 p.m. UTC | #8
On Thu, Jan 23, 2020 at 9:31 AM Felipe Balbi <balbi@kernel.org> wrote:
> "Yang, Fei" <fei.yang@intel.com> writes:
> >>> Hey all,
> >>>    I wanted to send these out for comment and thoughts.
> >>>
> >>> Since ~4.20, when the functionfs gadget enabled scatter-gather
> >>> support, we have seen problems with adb connections stalling and
> >>> stopping to function on hardware with dwc3 usb controllers.
> >>> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
> >>
> >> Any chance this:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git/commit/?h=testing/next&id=f63333e8e4fd63d8d8ae83b89d2c38cf21d64801
> > This is a different issue. I have tried initializing num_sgs when debugging this adb stall problem, but it didn't help.
>
> So multiple folks have run through this problem, but not *one* has
> tracepoints collected from the issue? C'mon guys. Can someone, please,
> collect tracepoints so we can figure out what's actually going on?

Sure, I can do that. Though to be fair, I recall Fei sending out
tracepoint data earlier that didn't get a response.

So attached is trace/regdump data for db845c both in the failure case
and with the patch ("Correct the logic for finding last SG entry").

I'll collect HiKey960 data here after lunch when I can swap over to
that board and will send it along soon.

Thanks so much for taking a look at this!
-john
John Stultz Feb. 5, 2020, 9:03 p.m. UTC | #9
On Thu, Jan 23, 2020 at 9:46 AM Felipe Balbi <balbi@kernel.org> wrote:
> >>>>>
> >>>>> Since ~4.20, when the functionfs gadget enabled scatter-gather
> >>>>> support, we have seen problems with adb connections stalling and
> >>>>> stopping to function on hardware with dwc3 usb controllers.
> >>>>> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
...
> >>
> >> I'm pretty sure this should be solved at the DMA API level, just want to confirm.
> >
> > I have sent you the tracepoints long time ago. Also my analysis of the
> > problem (BTW, I don't think the tracepoints helped much). It's
> > basically a logic problem in function dwc3_gadget_ep_reclaim_trb_sg().
>
> AFAICT, this is caused by DMA API merging pages together when map an
> sglist for DMA. While doing that, it does *not* move the SG_END flag
> which sg_is_last() checks.
>
> I consider that an overlook on the DMA API, wouldn't you? Why should DMA
> API users care if pages were merged or not while mapping the sglist? We
> have for_each_sg() and sg_is_last() for a reason.
>

From an initial look, I agree this is pretty confusing.   dma_map_sg()
can coalesce entries in the sg list, modifying the sg entires
themselves, however, in doing so it doesn't modify the number of
entries in the sglist (nor the end state bit).  That's pretty subtle!

My initial naive attempt to fix the dma-iommu path to set the end bit
at the tail of __finalize_sg() which does the sg entry modifications,
only caused trouble elsewhere, as there's plenty of logic that expects
the number of entries to not change, so having sg_next() return NULL
before that point results in lots of null pointer traversals.

Further, looking at the history, while apparently quirky, this has
been the documented behavior in DMA-API.txt for over almost 14 years
(at least).  It clearly states that that dma_map_api can return fewer
mapped entries then sg entries, and one should loop only that many
times (for_each_sg() having a max number of entries to iterate over
seems specifically for this purpose).  Additionally, it says one must
preserve the original number of entries (not # mapped entries) for
dma_unmap_sg().

So I'm not sure that sg_is_last() is really valid for iterating on
mapped sg lists.

Should it be? Probably (at least with my unfamiliar eyes), but
sg_is_last() has been around for almost as long coexisting with this
behavioral quirk, so I'm also not sure this is the best hill for the
dwc3 driver to die on. :)

The fix here:
  https://lore.kernel.org/lkml/20200122222645.38805-3-john.stultz@linaro.org/
Or maybe the slightly cleaner varient here:
  https://git.linaro.org/people/john.stultz/android-dev.git/commit/?h=dev/db845c-mainline-WIP&id=61a5816aa71ec719e77df9f2260dbbf6bcec7c99
seems like it would correctly address things following the
documentation and avoid the failures we're seeing.

As to if dma_map_sg() should fixup the state bits or properly shrink
the sg list when it coalesces entries, that seems like it would be a
much more intrusive change across quite a bit of the kernel that was
written to follow the documented method. So my confidence that such a
change would make it upstream in a reasonable amount of time isn't
very high, and it seems like a bad idea to block the driver from
working properly in the meantime.

Pulling in Christoph and Jens as I suspect they have more context on
this, and maybe can explain thats its not so quirky with the right
perspective?

Thoughts? Maybe there is an easier way to make it less quirky then
what I imagine?

thanks
-john
Felipe Balbi Feb. 6, 2020, 6:23 a.m. UTC | #10
Hi,

John Stultz <john.stultz@linaro.org> writes:
> On Thu, Jan 23, 2020 at 9:46 AM Felipe Balbi <balbi@kernel.org> wrote:
>> >>>>>
>> >>>>> Since ~4.20, when the functionfs gadget enabled scatter-gather
>> >>>>> support, we have seen problems with adb connections stalling and
>> >>>>> stopping to function on hardware with dwc3 usb controllers.
>> >>>>> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
> ...
>> >>
>> >> I'm pretty sure this should be solved at the DMA API level, just want to confirm.
>> >
>> > I have sent you the tracepoints long time ago. Also my analysis of the
>> > problem (BTW, I don't think the tracepoints helped much). It's
>> > basically a logic problem in function dwc3_gadget_ep_reclaim_trb_sg().
>>
>> AFAICT, this is caused by DMA API merging pages together when map an
>> sglist for DMA. While doing that, it does *not* move the SG_END flag
>> which sg_is_last() checks.
>>
>> I consider that an overlook on the DMA API, wouldn't you? Why should DMA
>> API users care if pages were merged or not while mapping the sglist? We
>> have for_each_sg() and sg_is_last() for a reason.
>>
>
> From an initial look, I agree this is pretty confusing.   dma_map_sg()
> can coalesce entries in the sg list, modifying the sg entires
> themselves, however, in doing so it doesn't modify the number of
> entries in the sglist (nor the end state bit).  That's pretty subtle!
>
> My initial naive attempt to fix the dma-iommu path to set the end bit
> at the tail of __finalize_sg() which does the sg entry modifications,
> only caused trouble elsewhere, as there's plenty of logic that expects
> the number of entries to not change, so having sg_next() return NULL
> before that point results in lots of null pointer traversals.
>
> Further, looking at the history, while apparently quirky, this has
> been the documented behavior in DMA-API.txt for over almost 14 years
> (at least).  It clearly states that that dma_map_api can return fewer
> mapped entries then sg entries, and one should loop only that many
> times (for_each_sg() having a max number of entries to iterate over
> seems specifically for this purpose).  Additionally, it says one must
> preserve the original number of entries (not # mapped entries) for
> dma_unmap_sg().
>
> So I'm not sure that sg_is_last() is really valid for iterating on
> mapped sg lists.
>
> Should it be? Probably (at least with my unfamiliar eyes), but
> sg_is_last() has been around for almost as long coexisting with this
> behavioral quirk, so I'm also not sure this is the best hill for the
> dwc3 driver to die on. :)
>
> The fix here:
>   https://lore.kernel.org/lkml/20200122222645.38805-3-john.stultz@linaro.org/
> Or maybe the slightly cleaner varient here:
>   https://git.linaro.org/people/john.stultz/android-dev.git/commit/?h=dev/db845c-mainline-WIP&id=61a5816aa71ec719e77df9f2260dbbf6bcec7c99

in that case, we don't need to use sg_is_last() at all, since i will
always encode the last entry in the list.

> seems like it would correctly address things following the
> documentation and avoid the failures we're seeing.
>
> As to if dma_map_sg() should fixup the state bits or properly shrink
> the sg list when it coalesces entries, that seems like it would be a
> much more intrusive change across quite a bit of the kernel that was
> written to follow the documented method. So my confidence that such a
> change would make it upstream in a reasonable amount of time isn't
> very high, and it seems like a bad idea to block the driver from
> working properly in the meantime.
>
> Pulling in Christoph and Jens as I suspect they have more context on
> this, and maybe can explain thats its not so quirky with the right
> perspective?
>
> Thoughts? Maybe there is an easier way to make it less quirky then
> what I imagine?

it just seems very counter-intuitive to me that DMA api can coalesce
entries but they're actually still there and drivers have to cope with
this behavior.
Christoph Hellwig Feb. 6, 2020, 7:40 a.m. UTC | #11
On Wed, Feb 05, 2020 at 01:03:51PM -0800, John Stultz wrote:
> On Thu, Jan 23, 2020 at 9:46 AM Felipe Balbi <balbi@kernel.org> wrote:
> > >> I'm pretty sure this should be solved at the DMA API level, just want to confirm.
> > >
> > > I have sent you the tracepoints long time ago. Also my analysis of the
> > > problem (BTW, I don't think the tracepoints helped much). It's
> > > basically a logic problem in function dwc3_gadget_ep_reclaim_trb_sg().
> >
> > AFAICT, this is caused by DMA API merging pages together when map an
> > sglist for DMA. While doing that, it does *not* move the SG_END flag
> > which sg_is_last() checks.
> >
> > I consider that an overlook on the DMA API, wouldn't you? Why should DMA
> > API users care if pages were merged or not while mapping the sglist? We
> > have for_each_sg() and sg_is_last() for a reason.
> >
> 
> >From an initial look, I agree this is pretty confusing.   dma_map_sg()
> can coalesce entries in the sg list, modifying the sg entires
> themselves, however, in doing so it doesn't modify the number of
> entries in the sglist (nor the end state bit).  That's pretty subtle!

dma_map_sg only coalesces the dma address.  The page, offset and len
members are immutable.

The problem is really the design of the scatterlist structure - it
combines immutable input parameters (page, offset, len) and output
parameters (dma_addr, dma_len) in one data structure, and then needs
different accessors depending on which information you care about.
The end marker only works for the "CPU" view.

The right fix is top stop using struct scatterlist, but that is going to
be larger and painful change.  At least for block layer stuff I plan to
incrementally do that, though.

> So I'm not sure that sg_is_last() is really valid for iterating on
> mapped sg lists.
> 
> Should it be? Probably (at least with my unfamiliar eyes), but
> sg_is_last() has been around for almost as long coexisting with this
> behavioral quirk, so I'm also not sure this is the best hill for the
> dwc3 driver to die on. :)

No, it shoudn't.  dma_map_sg returns the number of mapped segments,
and the callers need to remember that.

> 
> The fix here:
>   https://lore.kernel.org/lkml/20200122222645.38805-3-john.stultz@linaro.org/
> Or maybe the slightly cleaner varient here:
>   https://git.linaro.org/people/john.stultz/android-dev.git/commit/?h=dev/db845c-mainline-WIP&id=61a5816aa71ec719e77df9f2260dbbf6bcec7c99
> seems like it would correctly address things following the
> documentation and avoid the failures we're seeing.

The first version is the corret one.  sg_is_last has no meaning for the
"DMA" view of the scatterlist.
Felipe Balbi Feb. 6, 2020, 6:29 p.m. UTC | #12
Hi,

Christoph Hellwig <hch@infradead.org> writes:
> On Wed, Feb 05, 2020 at 01:03:51PM -0800, John Stultz wrote:
>> On Thu, Jan 23, 2020 at 9:46 AM Felipe Balbi <balbi@kernel.org> wrote:
>> > >> I'm pretty sure this should be solved at the DMA API level, just want to confirm.
>> > >
>> > > I have sent you the tracepoints long time ago. Also my analysis of the
>> > > problem (BTW, I don't think the tracepoints helped much). It's
>> > > basically a logic problem in function dwc3_gadget_ep_reclaim_trb_sg().
>> >
>> > AFAICT, this is caused by DMA API merging pages together when map an
>> > sglist for DMA. While doing that, it does *not* move the SG_END flag
>> > which sg_is_last() checks.
>> >
>> > I consider that an overlook on the DMA API, wouldn't you? Why should DMA
>> > API users care if pages were merged or not while mapping the sglist? We
>> > have for_each_sg() and sg_is_last() for a reason.
>> >
>> 
>> >From an initial look, I agree this is pretty confusing.   dma_map_sg()
>> can coalesce entries in the sg list, modifying the sg entires
>> themselves, however, in doing so it doesn't modify the number of
>> entries in the sglist (nor the end state bit).  That's pretty subtle!
>
> dma_map_sg only coalesces the dma address.  The page, offset and len
> members are immutable.

ok

> The problem is really the design of the scatterlist structure - it
> combines immutable input parameters (page, offset, len) and output
> parameters (dma_addr, dma_len) in one data structure, and then needs
> different accessors depending on which information you care about.
> The end marker only works for the "CPU" view.

right

> The right fix is top stop using struct scatterlist, but that is going to
> be larger and painful change.  At least for block layer stuff I plan to
> incrementally do that, though.

I don't think that would be necessary though.

>> So I'm not sure that sg_is_last() is really valid for iterating on
>> mapped sg lists.
>> 
>> Should it be? Probably (at least with my unfamiliar eyes), but
>> sg_is_last() has been around for almost as long coexisting with this
>> behavioral quirk, so I'm also not sure this is the best hill for the
>> dwc3 driver to die on. :)
>
> No, it shoudn't.  dma_map_sg returns the number of mapped segments,
> and the callers need to remember that.

We _do_ remember that:

	unsigned int remaining = req->request.num_mapped_sgs
		- req->num_queued_sgs;

	for_each_sg(sg, s, remaining, i) {
		unsigned int length = req->request.length;
		unsigned int maxp = usb_endpoint_maxp(dep->endpoint.desc);
		unsigned int rem = length % maxp;
		unsigned chain = true;

		if (sg_is_last(s))
			chain = false;

		if (rem && usb_endpoint_dir_out(dep->endpoint.desc) && !chain) {

that req->request.num_mapped_sgs is the returned value. So you're saying
we should test for i == num_mapped_sgs, instead of using
sg_is_last(). Is that it?

Fair enough. Just out of curiosity, then, when *should* we use
sg_is_last()?

cheers
Christoph Hellwig Feb. 6, 2020, 6:41 p.m. UTC | #13
On Thu, Feb 06, 2020 at 08:29:45PM +0200, Felipe Balbi wrote:
> > No, it shoudn't.  dma_map_sg returns the number of mapped segments,
> > and the callers need to remember that.
> 
> We _do_ remember that:

That helps :)

> that req->request.num_mapped_sgs is the returned value. So you're saying
> we should test for i == num_mapped_sgs, instead of using
> sg_is_last(). Is that it?

Yes.

> Fair enough. Just out of curiosity, then, when *should* we use
> sg_is_last()?

Outside of sg_next/sg_last it really shoud not be used at all as far
as I'm concerned.
Felipe Balbi Feb. 7, 2020, 6 a.m. UTC | #14
Hi,

Christoph Hellwig <hch@infradead.org> writes:
> On Thu, Feb 06, 2020 at 08:29:45PM +0200, Felipe Balbi wrote:
>> Fair enough. Just out of curiosity, then, when *should* we use
>> sg_is_last()?
>
> Outside of sg_next/sg_last it really shoud not be used at all as far
> as I'm concerned.

In that case, we may have other drivers with similar issues that just
haven't surfaced:

$ git --no-pager grep -e sg_is_last
drivers/ata/pata_octeon_cf.c:701:			if (!sg_is_last(qc->cursg)) {
drivers/crypto/amcc/crypto4xx_core.c:738:	if (sg_is_last(dst) && force_sd == false) {
drivers/crypto/atmel-sha.c:318:			if ((ctx->sg->length == 0) && !sg_is_last(ctx->sg)) {
drivers/crypto/atmel-sha.c:781:	if (!sg_is_last(sg) && !IS_ALIGNED(sg->length, ctx->block_size))
drivers/crypto/atmel-sha.c:787:	if (sg_is_last(sg)) {
drivers/crypto/ccree/cc_buffer_mgr.c:293:	if (sg_is_last(sg)) {
drivers/crypto/ccree/cc_buffer_mgr.c:305:	} else {  /*sg_is_last*/
drivers/crypto/hisilicon/hpre/hpre_crypto.c:238:	if ((sg_is_last(data) && len == ctx->key_sz) &&
drivers/crypto/marvell/tdma.c:25:		if (sg_is_last(sgiter->sg))
drivers/crypto/mediatek/mtk-sha.c:196:			if ((ctx->sg->length == 0) && !sg_is_last(ctx->sg)) {
drivers/crypto/mediatek/mtk-sha.c:530:	if (!sg_is_last(sg) && !IS_ALIGNED(sg->length, ctx->bs))
drivers/crypto/mediatek/mtk-sha.c:536:	if (sg_is_last(sg)) {
drivers/crypto/mxs-dcp.c:339:			if (actx->fill == out_off || sg_is_last(src) ||
drivers/crypto/qat/qat_common/qat_asym_algs.c:322:		if (sg_is_last(req->src) && req->src_len == ctx->p_size) {
drivers/crypto/qat/qat_common/qat_asym_algs.c:353:	if (sg_is_last(req->dst) && req->dst_len == ctx->p_size) {
drivers/crypto/qat/qat_common/qat_asym_algs.c:730:	if (sg_is_last(req->src) && req->src_len == ctx->key_sz) {
drivers/crypto/qat/qat_common/qat_asym_algs.c:749:	if (sg_is_last(req->dst) && req->dst_len == ctx->key_sz) {
drivers/crypto/qat/qat_common/qat_asym_algs.c:874:	if (sg_is_last(req->src) && req->src_len == ctx->key_sz) {
drivers/crypto/qat/qat_common/qat_asym_algs.c:893:	if (sg_is_last(req->dst) && req->dst_len == ctx->key_sz) {
drivers/crypto/rockchip/rk3288_crypto_ahash.c:238:			if (sg_is_last(dev->sg_src)) {
drivers/crypto/rockchip/rk3288_crypto_skcipher.c:356:			if (sg_is_last(dev->sg_src)) {
drivers/crypto/s5p-sss.c:581:	if (!sg_is_last(dev->sg_dst)) {
drivers/crypto/s5p-sss.c:603:	if (!sg_is_last(dev->sg_src)) {
drivers/crypto/s5p-sss.c:690:		if (sg_is_last(dev->sg_dst))
drivers/crypto/stm32/stm32-hash.c:304:			if ((rctx->sg->length == 0) && !sg_is_last(rctx->sg)) {
drivers/crypto/stm32/stm32-hash.c:568:		if (sg_is_last(sg)) {
drivers/crypto/stm32/stm32-hash.c:595:					  !sg_is_last(sg));
drivers/crypto/stm32/stm32-hash.c:668:			    (!sg_is_last(sg)))
drivers/dma/ipu/ipu_idmac.c:847:		dma_addr_t dma_1 = sg_is_last(desc->sg) ? 0 :
drivers/gpu/drm/i915/i915_gpu_error.c:649:			if (sg_is_last(sg))
drivers/gpu/drm/i915/i915_gpu_error.c:653:		sg = sg_is_last(sg) ? NULL : sg_chain_ptr(sg);
drivers/gpu/drm/i915/i915_gpu_error.c:903:	} while (!sg_is_last(sg++));
drivers/gpu/drm/i915/i915_scatterlist.h:66:	return sg_is_last(sg) ? NULL : ____sg_next(sg);
drivers/hwtracing/intel_th/msu.c:545:	if (sg_is_last(iter->block))
drivers/memstick/core/ms_block.c:43:			if (sg_is_last(sg_from))
drivers/memstick/core/ms_block.c:58:		if (sg_is_last(sg_from) || !len)
drivers/memstick/core/ms_block.c:73:		if (sg_is_last(sg_from) || !len)
drivers/mmc/host/bcm2835.c:485:			if (sg_is_last(sg)) {
drivers/rapidio/devices/tsi721_dma.c:514:		if (sg_is_last(sg)) {
drivers/s390/scsi/zfcp_qdio.h:184:	return sg_is_last(sg) && sg->length <= ZFCP_QDIO_SBALE_LEN;
drivers/scsi/NCR5380.c:171:	if (!cmd->SCp.this_residual && s && !sg_is_last(s)) {
drivers/scsi/NCR5380.c:184:		while (!sg_is_last(s)) {
drivers/scsi/aha152x.c:2019:				    !sg_is_last(CURRENT_SC->SCp.buffer)) {
drivers/scsi/aha152x.c:2125:		    !sg_is_last(CURRENT_SC->SCp.buffer)) {
drivers/scsi/aha152x.c:2155:		while (done > 0 && !sg_is_last(sg)) {
drivers/scsi/qla2xxx/qla_iocb.c:1226:				    sg_is_last(sg)) {
drivers/spi/spi-bcm2835.c:484:	if (bs->tx_buf && !sg_is_last(&tfr->tx_sg.sgl[0]))
drivers/spi/spi-bcm2835.c:487:	if (bs->rx_buf && !sg_is_last(&tfr->rx_sg.sgl[0])) {
drivers/spi/spi-bcm2835.c:491:			if (!bs->tx_buf || sg_is_last(&tfr->tx_sg.sgl[0])) {
drivers/usb/dwc2/gadget.c:861:			sg_is_last(sg));
drivers/usb/dwc3/gadget.c:1074:		if (sg_is_last(s))
drivers/usb/image/microtek.c:507:			   sg_is_last(context->curr_sg) ?
include/linux/devcoredump.h:40:	while (!sg_is_last(iter)) {
include/linux/scatterlist.h:73:#define sg_is_last(sg)		((sg)->page_link & SG_END)
lib/scatterlist.c:25:	if (sg_is_last(sg))
lib/scatterlist.c:109:	BUG_ON(!sg_is_last(ret));
net/core/skbuff.c:4289:			if (unlikely(elt && sg_is_last(&sg[elt - 1])))
net/core/skbuff.c:4311:			if (unlikely(elt && sg_is_last(&sg[elt - 1])))
net/tls/tls_main.c:116:		if (sg_is_last(sg))
net/xfrm/espintcp.c:179:		if (sg_is_last(sg))
samples/kfifo/dma-example.c:79:		if (sg_is_last(&sg[i]))
samples/kfifo/dma-example.c:108:		if (sg_is_last(&sg[i]))
tools/virtio/linux/scatterlist.h:15:#define sg_is_last(sg)		((sg)->page_link & 0x02)
tools/virtio/linux/scatterlist.h:139:	if (sg_is_last(sg))