Message ID | 1314456515-16419-1-git-send-email-ming.lei@canonical.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, On Saturday 27 August 2011 08:18 PM, ming.lei@canonical.com wrote: > From: Ming Lei<ming.lei@canonical.com> > > This patch fixs one performance bug on ARM Cortex A9 dual core platform, > which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...), > see details from link of https://bugs.launchpad.net/bugs/709245. > > In fact, one mb() on ARM is enough to flush L2 cache, but > 'dummy->hw_token = token;' after mb() is added just for obeying > correct mb() usage. > Who said "one mb() on ARM is enough to flush L2 cache" ? It's just a memory barrier and it doesn't flush any cache. What it cleans is the CPU write buffers and the L2 cache write buffers. > The patch has been tested ok on OMAP4 panda A1 board, the performance > of 'dd' over usb mass storage can be increased from 4~5MB/sec to > 14~16MB/sec after applying this patch. > Though number looks great, how is the below patch helping to get better numbers. > Signed-off-by: Ming Lei<ming.lei@canonical.com> > --- > drivers/usb/host/ehci-q.c | 14 ++++++++++++++ > 1 files changed, 14 insertions(+), 0 deletions(-) > > diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c > index 0917e3a..65b5021 100644 > --- a/drivers/usb/host/ehci-q.c > +++ b/drivers/usb/host/ehci-q.c > @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds ( > wmb (); > dummy->hw_token = token; > > + /* The mb() below is added to make sure that > + * 'token' can be writen into qtd, so that ehci > + * HC can see the up-to-date qtd descriptor. On > + * some archs(at least on ARM Cortex A9 dual core), > + * writing into coherenet memory doesn't mean the > + * value written can reach physical memory > + * immediately, and the value may be buffered > + * inside L2 cache. 'dummy->hw_token = token;' > + * after mb() is added for obeying correct mb() > + * usage. > + * */ > + mb(); > + token = dummy->hw_token; > + This patch at max fix some corruption if the memory buffer used is buffer-able. Infact I see there is already a write memory barrier above. So just pushing that down by one line should be enough. > dummy->hw_token = token; > wmb (); Is there another patch along with this which removes, some cache clean on this buffer ? Regards Santosh
On Sat, Aug 27, 2011 at 10:48:35PM +0800, ming.lei@canonical.com wrote: > From: Ming Lei <ming.lei@canonical.com> > > This patch fixs one performance bug on ARM Cortex A9 dual core platform, > which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...), > see details from link of https://bugs.launchpad.net/bugs/709245. > > In fact, one mb() on ARM is enough to flush L2 cache, but > 'dummy->hw_token = token;' after mb() is added just for obeying > correct mb() usage. Really? A mb() should not be flushing any caches, it's just a memory barrier. Or is ARM somehow "special" in this way? > The patch has been tested ok on OMAP4 panda A1 board, the performance > of 'dd' over usb mass storage can be increased from 4~5MB/sec to > 14~16MB/sec after applying this patch. That's impressive, but I don't think this is really the proper way to do this... > Signed-off-by: Ming Lei <ming.lei@canonical.com> > --- > drivers/usb/host/ehci-q.c | 14 ++++++++++++++ > 1 files changed, 14 insertions(+), 0 deletions(-) > > diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c > index 0917e3a..65b5021 100644 > --- a/drivers/usb/host/ehci-q.c > +++ b/drivers/usb/host/ehci-q.c > @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds ( > wmb (); > dummy->hw_token = token; > > + /* The mb() below is added to make sure that > + * 'token' can be writen into qtd, so that ehci > + * HC can see the up-to-date qtd descriptor. On > + * some archs(at least on ARM Cortex A9 dual core), > + * writing into coherenet memory doesn't mean the > + * value written can reach physical memory > + * immediately, and the value may be buffered > + * inside L2 cache. 'dummy->hw_token = token;' > + * after mb() is added for obeying correct mb() > + * usage. > + * */ > + mb(); > + token = dummy->hw_token; Your comment does not match the code, so something is wrong here. greg k-h
On Sat, Aug 27, 2011 at 11:03 PM, Santosh <santosh.shilimkar@ti.com> wrote: > Hi, > > On Saturday 27 August 2011 08:18 PM, ming.lei@canonical.com wrote: >> >> From: Ming Lei<ming.lei@canonical.com> >> >> This patch fixs one performance bug on ARM Cortex A9 dual core platform, >> which has been reported on quite a few ARM machines(OMAP4, Tegra 2, >> snowball...), >> see details from link of https://bugs.launchpad.net/bugs/709245. >> >> In fact, one mb() on ARM is enough to flush L2 cache, but >> 'dummy->hw_token = token;' after mb() is added just for obeying >> correct mb() usage. >> > Who said "one mb() on ARM is enough to flush L2 cache" ? > It's just a memory barrier and it doesn't flush any cache. > What it cleans is the CPU write buffers and the L2 cache > write buffers Yes, your description is more accurate, it should be L2 write buffer, I see mb() will call outer_sync() on ARM. > >> The patch has been tested ok on OMAP4 panda A1 board, the performance >> of 'dd' over usb mass storage can be increased from 4~5MB/sec to >> 14~16MB/sec after applying this patch. >> > Though number looks great, how is the below patch helping to get better > numbers. The patch can make ehci HC see the up-to-date qtd, so make usb transaction executed correctly. If a qtd->token is not updated, maybe IOC is not set or set very late, so interrupt can't be triggered in time, also mistaken 'total bytes to transfer' can make HC work badly. In fact, I have traced the problem and found ehci irq is often delayed by ehci HC. also sometimes ehci irq is lost, so I start to trace ehci driver and find the problem here. > >> Signed-off-by: Ming Lei<ming.lei@canonical.com> >> --- >> drivers/usb/host/ehci-q.c | 14 ++++++++++++++ >> 1 files changed, 14 insertions(+), 0 deletions(-) >> >> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c >> index 0917e3a..65b5021 100644 >> --- a/drivers/usb/host/ehci-q.c >> +++ b/drivers/usb/host/ehci-q.c >> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds ( >> wmb (); >> dummy->hw_token = token; >> >> + /* The mb() below is added to make sure that >> + * 'token' can be writen into qtd, so that ehci >> + * HC can see the up-to-date qtd descriptor. On >> + * some archs(at least on ARM Cortex A9 dual >> core), >> + * writing into coherenet memory doesn't mean the >> + * value written can reach physical memory >> + * immediately, and the value may be buffered >> + * inside L2 cache. 'dummy->hw_token = token;' >> + * after mb() is added for obeying correct mb() >> + * usage. >> + * */ >> + mb(); >> + token = dummy->hw_token; >> + > > This patch at max fix some corruption if the memory buffer > used is buffer-able. Infact I see there is already a write memory > barrier above. So just pushing that down by one line should > be enough. The above wmb is used to order updating qtd->hw_next and dummy->hw_token. > >> dummy->hw_token = token; >> wmb (); > > Is there another patch along with this which removes, some cache clean > on this buffer ? No, I am not sure the wmb should be merged with mb(). thanks, -- Ming Lei
Hi, On Sat, Aug 27, 2011 at 11:13 PM, Greg KH <greg@kroah.com> wrote: > On Sat, Aug 27, 2011 at 10:48:35PM +0800, ming.lei@canonical.com wrote: >> From: Ming Lei <ming.lei@canonical.com> >> >> This patch fixs one performance bug on ARM Cortex A9 dual core platform, >> which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...), >> see details from link of https://bugs.launchpad.net/bugs/709245. >> >> In fact, one mb() on ARM is enough to flush L2 cache, but >> 'dummy->hw_token = token;' after mb() is added just for obeying >> correct mb() usage. > > Really? A mb() should not be flushing any caches, it's just a memory > barrier. Or is ARM somehow "special" in this way? As Santosh pointed out, mb on ARM will flush L2 write buffer. The description here is wrong. I think the below should make the writing reach into memory on all ARCH after ' token = dummy->hw_token;' is executed. dummy->hw_token = token; mb() token = dummy->hw_token; The above is the idea introduced to fix the problem. > >> The patch has been tested ok on OMAP4 panda A1 board, the performance >> of 'dd' over usb mass storage can be increased from 4~5MB/sec to >> 14~16MB/sec after applying this patch. > > That's impressive, but I don't think this is really the proper way to do > this... > >> Signed-off-by: Ming Lei <ming.lei@canonical.com> >> --- >> drivers/usb/host/ehci-q.c | 14 ++++++++++++++ >> 1 files changed, 14 insertions(+), 0 deletions(-) >> >> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c >> index 0917e3a..65b5021 100644 >> --- a/drivers/usb/host/ehci-q.c >> +++ b/drivers/usb/host/ehci-q.c >> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds ( >> wmb (); >> dummy->hw_token = token; >> >> + /* The mb() below is added to make sure that >> + * 'token' can be writen into qtd, so that ehci >> + * HC can see the up-to-date qtd descriptor. On >> + * some archs(at least on ARM Cortex A9 dual core), >> + * writing into coherenet memory doesn't mean the >> + * value written can reach physical memory >> + * immediately, and the value may be buffered >> + * inside L2 cache. 'dummy->hw_token = token;' >> + * after mb() is added for obeying correct mb() >> + * usage. >> + * */ >> + mb(); >> + token = dummy->hw_token; > > Your comment does not match the code, so something is wrong here. If you mean "L2 cache flush", I confess to the mistaken description, and will update it later. If you mean others, could you help to point it out? thanks, -- Ming Lei
On Saturday 27 August 2011 08:48 PM, Ming Lei wrote: > On Sat, Aug 27, 2011 at 11:03 PM, Santosh<santosh.shilimkar@ti.com> wrote: >> Hi, >> >> On Saturday 27 August 2011 08:18 PM, ming.lei@canonical.com wrote: >>> >>> From: Ming Lei<ming.lei@canonical.com> >>> >>> This patch fixs one performance bug on ARM Cortex A9 dual core platform, >>> which has been reported on quite a few ARM machines(OMAP4, Tegra 2, >>> snowball...), >>> see details from link of https://bugs.launchpad.net/bugs/709245. >>> >>> In fact, one mb() on ARM is enough to flush L2 cache, but >>> 'dummy->hw_token = token;' after mb() is added just for obeying >>> correct mb() usage. >>> >> Who said "one mb() on ARM is enough to flush L2 cache" ? >> It's just a memory barrier and it doesn't flush any cache. >> What it cleans is the CPU write buffers and the L2 cache >> write buffers > > Yes, your description is more accurate, it should be L2 write buffer, > I see mb() will call outer_sync() on ARM. > >> >>> The patch has been tested ok on OMAP4 panda A1 board, the performance >>> of 'dd' over usb mass storage can be increased from 4~5MB/sec to >>> 14~16MB/sec after applying this patch. >>> >> Though number looks great, how is the below patch helping to get better >> numbers. > > The patch can make ehci HC see the up-to-date qtd, so make usb transaction > executed correctly. If a qtd->token is not updated, maybe IOC is not > set or set very > late, so interrupt can't be triggered in time, also mistaken 'total > bytes to transfer' > can make HC work badly. > > In fact, I have traced the problem and found ehci irq is often delayed > by ehci HC. > also sometimes ehci irq is lost, so I start to trace ehci driver and > find the problem here. > You are missing all of this important description in change log. >> >>> Signed-off-by: Ming Lei<ming.lei@canonical.com> >>> --- >>> drivers/usb/host/ehci-q.c | 14 ++++++++++++++ >>> 1 files changed, 14 insertions(+), 0 deletions(-) >>> >>> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c >>> index 0917e3a..65b5021 100644 >>> --- a/drivers/usb/host/ehci-q.c >>> +++ b/drivers/usb/host/ehci-q.c >>> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds ( >>> wmb (); >>> dummy->hw_token = token; >>> >>> + /* The mb() below is added to make sure that >>> + * 'token' can be writen into qtd, so that ehci >>> + * HC can see the up-to-date qtd descriptor. On >>> + * some archs(at least on ARM Cortex A9 dual >>> core), >>> + * writing into coherenet memory doesn't mean the >>> + * value written can reach physical memory >>> + * immediately, and the value may be buffered >>> + * inside L2 cache. 'dummy->hw_token = token;' >>> + * after mb() is added for obeying correct mb() >>> + * usage. >>> + * */ >>> + mb(); >>> + token = dummy->hw_token; >>> + >> >> This patch at max fix some corruption if the memory buffer >> used is buffer-able. Infact I see there is already a write memory >> barrier above. So just pushing that down by one line should >> be enough. > > The above wmb is used to order updating qtd->hw_next and > dummy->hw_token. > >> >>> dummy->hw_token = token; >>> wmb (); >> >> Is there another patch along with this which removes, some cache clean >> on this buffer ? > > No, I am not sure the wmb should be merged with mb(). > It will work but I leave that call to you since I don't understand much what that function is doing. Infact I see, excessive usage of wmb() in qh_append_tds() Regards Santosh
On Sat, Aug 27, 2011 at 11:33:26PM +0800, Ming Lei wrote: > Hi, > > On Sat, Aug 27, 2011 at 11:13 PM, Greg KH <greg@kroah.com> wrote: > > On Sat, Aug 27, 2011 at 10:48:35PM +0800, ming.lei@canonical.com wrote: > >> From: Ming Lei <ming.lei@canonical.com> > >> > >> This patch fixs one performance bug on ARM Cortex A9 dual core platform, > >> which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...), > >> see details from link of https://bugs.launchpad.net/bugs/709245. > >> > >> In fact, one mb() on ARM is enough to flush L2 cache, but > >> 'dummy->hw_token = token;' after mb() is added just for obeying > >> correct mb() usage. > > > > Really? A mb() should not be flushing any caches, it's just a memory > > barrier. Or is ARM somehow "special" in this way? > > As Santosh pointed out, mb on ARM will flush L2 write buffer. The > description here is wrong. Then this can't be accepted as-is :) > I think the below should make the writing reach into memory on all > ARCH after ' token = dummy->hw_token;' is executed. > > dummy->hw_token = token; > mb() > token = dummy->hw_token; > > The above is the idea introduced to fix the problem. Are you sure? Have you read the documentation about memory barriers to confirm this? > >> The patch has been tested ok on OMAP4 panda A1 board, the performance > >> of 'dd' over usb mass storage can be increased from 4~5MB/sec to > >> 14~16MB/sec after applying this patch. > > > > That's impressive, but I don't think this is really the proper way to do > > this... > > > >> Signed-off-by: Ming Lei <ming.lei@canonical.com> > >> --- > >> drivers/usb/host/ehci-q.c | 14 ++++++++++++++ > >> 1 files changed, 14 insertions(+), 0 deletions(-) > >> > >> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c > >> index 0917e3a..65b5021 100644 > >> --- a/drivers/usb/host/ehci-q.c > >> +++ b/drivers/usb/host/ehci-q.c > >> @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds ( > >> wmb (); > >> dummy->hw_token = token; > >> > >> + /* The mb() below is added to make sure that > >> + * 'token' can be writen into qtd, so that ehci > >> + * HC can see the up-to-date qtd descriptor. On > >> + * some archs(at least on ARM Cortex A9 dual core), > >> + * writing into coherenet memory doesn't mean the > >> + * value written can reach physical memory > >> + * immediately, and the value may be buffered > >> + * inside L2 cache. 'dummy->hw_token = token;' > >> + * after mb() is added for obeying correct mb() > >> + * usage. > >> + * */ > >> + mb(); > >> + token = dummy->hw_token; > > > > Your comment does not match the code, so something is wrong here. > > If you mean "L2 cache flush", I confess to the mistaken description, > and will update it later. If you mean others, could you help to point it out? I mean others, please read the the last 3 lines of the comment and compare that to the code lines you added. greg k-h
Hello. On 27-08-2011 18:48, ming.lei@canonical.com wrote: > From: Ming Lei<ming.lei@canonical.com> > This patch fixs one performance bug on ARM Cortex A9 dual core platform, > which has been reported on quite a few ARM machines(OMAP4, Tegra 2, snowball...), > see details from link of https://bugs.launchpad.net/bugs/709245. > In fact, one mb() on ARM is enough to flush L2 cache, but > 'dummy->hw_token = token;' after mb() is added just for obeying > correct mb() usage. > The patch has been tested ok on OMAP4 panda A1 board, the performance > of 'dd' over usb mass storage can be increased from 4~5MB/sec to > 14~16MB/sec after applying this patch. > Signed-off-by: Ming Lei<ming.lei@canonical.com> > --- > drivers/usb/host/ehci-q.c | 14 ++++++++++++++ > 1 files changed, 14 insertions(+), 0 deletions(-) > diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c > index 0917e3a..65b5021 100644 > --- a/drivers/usb/host/ehci-q.c > +++ b/drivers/usb/host/ehci-q.c > @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds ( > wmb (); > dummy->hw_token = token; > > + /* The mb() below is added to make sure that > + * 'token' can be writen into qtd, so that ehci > + * HC can see the up-to-date qtd descriptor. On > + * some archs(at least on ARM Cortex A9 dual core), > + * writing into coherenet memory doesn't mean the > + * value written can reach physical memory > + * immediately, and the value may be buffered > + * inside L2 cache. 'dummy->hw_token = token;' You meant 'token = dummy->hw_token;'? > + * after mb() is added for obeying correct mb() > + * usage. > + * */ > + mb(); > + token = dummy->hw_token; > + > urb->hcpriv = qh_get (qh); > } > } WBR, Sergei
Hi, Thanks for your comment. On Sun, Aug 28, 2011 at 12:07 AM, Greg KH <greg@kroah.com> wrote: >> As Santosh pointed out, mb on ARM will flush L2 write buffer. The >> description here is wrong. > > Then this can't be accepted as-is :) Yes, I will update it in v1, :-) >> I think the below should make the writing reach into memory on all >> ARCH after ' token = dummy->hw_token;' is executed. >> >> dummy->hw_token = token; >> mb() >> token = dummy->hw_token; >> >> The above is the idea introduced to fix the problem. > > Are you sure? Have you read the documentation about memory barriers to > confirm this? I read the doc again, :-), and it mentions few about mb/wmb/rmb, I think my above description is still not correct. Generally speaking, mb only means there is a order between two accesses. Now I think only one mb() after 'dummy->hw_token = token;' is enough: HC will read the up-to-date value of qtd->hw_token after mb() is executed because of the effect of the mb(), which should be guaranteed by mb. > I mean others, please read the the last 3 lines of the comment and > compare that to the code lines you added. I see now, the comment of the last 3 lines is wrong, should be * inside L2 cache. 'token = dummy->hw_token' * after mb() is added for obeying correct mb() * usage. But the 'token = dummy->hw_token' after mb() isn't needed any more as described above, is it? thanks, -- Ming Lei
Hi, On Sun, Aug 28, 2011 at 12:57 AM, Ming Lei <ming.lei@canonical.com> wrote: >> Are you sure? Have you read the documentation about memory barriers to >> confirm this? > > I read the doc again, :-), and it mentions few about mb/wmb/rmb, I think > my above description is still not correct. Generally speaking, mb only > means there is a order between two accesses. > > Now I think only one mb() after 'dummy->hw_token = token;' is enough: > HC will read the up-to-date value of qtd->hw_token after mb() is executed > because of the effect of the mb(), which should be guaranteed by mb. Looks like there is still another similar problem in qh_link_async(): the last wmb should be changed into mb, because HC will read 'head->hw->hw_next' from qh descriptor and this pointer in qh is read only for HC. But this problem can't be observed on ARM, since wmb on ARM is same with mb. thanks, -- Ming Lei
On Sun, 28 Aug 2011, Ming Lei wrote: > I read the doc again, :-), and it mentions few about mb/wmb/rmb, I think > my above description is still not correct. Generally speaking, mb only > means there is a order between two accesses. > > Now I think only one mb() after 'dummy->hw_token = token;' is enough: > HC will read the up-to-date value of qtd->hw_token after mb() is executed > because of the effect of the mb(), which should be guaranteed by mb. I've been following this whole discussion. I didn't understand the idea behind the original patch, and I don't understand this. What reason is there for adding a memory barrier after the "dummy->hw_token = token" line? Such a barrier would not accomplish anything useful, because there are no later memory accesses which need to be ordered with respect to that line. (By the way, Santosh appears to be right about the earlier wmb() in this routine. As far as I can see, it isn't needed. David Brownell probably wrote it just as a precaution.) > > I mean others, please read the the last 3 lines of the comment and > > compare that to the code lines you added. > > I see now, the comment of the last 3 lines is wrong, should be > > * inside L2 cache. 'token = dummy->hw_token' > * after mb() is added for obeying correct mb() > * usage. > > But the 'token = dummy->hw_token' after mb() isn't needed any > more as described above, is it? I don't see why it was ever needed at all. The compiler will realize that token is a dead variable at that point in the routine and will optimize the new line away. > The patch can make ehci HC see the up-to-date qtd, so make usb transaction > executed correctly. If a qtd->token is not updated, maybe IOC is not set or set very > late, so interrupt can't be triggered in time, also mistaken 'total bytes to transfer' > can make HC work badly. This doesn't make any sense. qtd becomes the new dummy; by the time the HC sees qtd the only important bit set in qtd->hw_token is the HALT bit. > In fact, I have traced the problem and found ehci irq is often delayed by ehci HC. > also sometimes ehci irq is lost, so I start to trace ehci driver and find the problem here. I don't think you have identified the underlying cause correctly. Something else must be responsible. Alan Stern
On Sun, 28 Aug 2011, Ming Lei wrote: > Looks like there is still another similar problem in qh_link_async(): > the last wmb > should be changed into mb, because HC will read 'head->hw->hw_next' from qh > descriptor and this pointer in qh is read only for HC. But this problem can't be > observed on ARM, since wmb on ARM is same with mb. It doesn't matter what the HC does -- the wmb() instruction is executed by the CPU, not the HC. The point of that instruction is to make sure that the qh->hw->hw_next = head->hw->hw_next; line (and all the preceding lines as well) is ordered before the head->hw->hw_next = dma; 3C line. Since both of these lines are writes, not reads, it suffices to use wmb() rather than mb(). Alan Stern
Hi, Thanks for your comment. On Sun, Aug 28, 2011 at 4:06 AM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Sun, 28 Aug 2011, Ming Lei wrote: > >> I read the doc again, :-), and it mentions few about mb/wmb/rmb, I think >> my above description is still not correct. Generally speaking, mb only >> means there is a order between two accesses. >> >> Now I think only one mb() after 'dummy->hw_token = token;' is enough: >> HC will read the up-to-date value of qtd->hw_token after mb() is executed >> because of the effect of the mb(), which should be guaranteed by mb. > > I've been following this whole discussion. I didn't understand the > idea behind the original patch, and I don't understand this. What > reason is there for adding a memory barrier after the "dummy->hw_token > = token" line? Such a barrier would not accomplish anything useful, > because there are no later memory accesses which need to be ordered > with respect to that line. Here, mb is used to synchronize between writing of CPU and reading of ehci HC, see below: CPU EHCI dummy->hw_token = token; mb read dummy->hw_token The mb() introduced was supposed to be used to make sure that EHCI can see the correct value of dummy->hw_token. If EHCI can't get the correct hw_token in time, it will work badly, such as irq delay or irq lost which may be caused by mistaken 'IOC' or 'total bytes to transfer' in hw_token. This is just what the patch is to fix. But I think the above is still not correct, and the correct way may be like below: CPU EHCI dummy->hw_token = token; wmb qh->hw->hw_qtd_next = QTD_NEXT(ehci, dummy->qtd_dma); fetch next qtd from qh->hw->hw_qtd_next read qtd->hw_token The problem is that qh has already linked dummy into its queue before(as did in qh_update), so even after 'dummy->hw_token = token;' is executed on CPU, EHCI still may not fetch the up-to-date value of the qtd(dummy in here), and this should be the root cause, IMO. I will figure out a elegant way to handle this race. > (By the way, Santosh appears to be right about the earlier wmb() in > this routine. As far as I can see, it isn't needed. David Brownell > probably wrote it just as a precaution.) > >> > I mean others, please read the the last 3 lines of the comment and >> > compare that to the code lines you added. >> >> I see now, the comment of the last 3 lines is wrong, should be >> >> * inside L2 cache. 'token = dummy->hw_token' >> * after mb() is added for obeying correct mb() >> * usage. >> >> But the 'token = dummy->hw_token' after mb() isn't needed any >> more as described above, is it? > > I don't see why it was ever needed at all. The compiler will realize > that token is a dead variable at that point in the routine and will > optimize the new line away. Yes, it is wrong, as I said before. > >> The patch can make ehci HC see the up-to-date qtd, so make usb transaction >> executed correctly. If a qtd->token is not updated, maybe IOC is not set or set very >> late, so interrupt can't be triggered in time, also mistaken 'total bytes to transfer' >> can make HC work badly. > > This doesn't make any sense. qtd becomes the new dummy; by the time > the HC sees qtd the only important bit set in qtd->hw_token is the HALT > bit. Here, I mean the 'qtd' is the 'dummy' in code, and I described it in such way because 'dummy' has been scheduled into qh queue, so qtd should be its new identity, a little confusion about the description, :-) > >> In fact, I have traced the problem and found ehci irq is often delayed by ehci HC. >> also sometimes ehci irq is lost, so I start to trace ehci driver and find the problem here. > > I don't think you have identified the underlying cause correctly. > Something else must be responsible. I have explained above, so please point it out if anything is still wrong. thanks, -- Ming Lei
Hi, On Sun, Aug 28, 2011 at 4:11 AM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Sun, 28 Aug 2011, Ming Lei wrote: > >> Looks like there is still another similar problem in qh_link_async(): >> the last wmb >> should be changed into mb, because HC will read 'head->hw->hw_next' from qh >> descriptor and this pointer in qh is read only for HC. But this problem can't be >> observed on ARM, since wmb on ARM is same with mb. > > It doesn't matter what the HC does -- the wmb() instruction is executed > by the CPU, not the HC. The point of that instruction is to make sure > that the > > qh->hw->hw_next = head->hw->hw_next; > > line (and all the preceding lines as well) is ordered before the > > head->hw->hw_next = dma; > 3C > line. Since both of these lines are writes, not reads, it suffices to > use wmb() rather than mb(). oops, I added a wmb() after the line of 'head->hw->hw_next = dma;', in my debug code, and not found that it is my local modification before sending out last mail. (I described it as 'the last wmb' in fact), sorry for the noise, please ignore it. thanks, -- Ming Lei
On Sun, Aug 28, 2011 at 01:00:07PM -0400, Alan Stern wrote: > It won't do that. All it will do is guarantee that the CPU writes out > dumy->hw_token before it writes out or reads in any values executed > after the mb. You're right from the perspective of how things are defined today. However, that isn't how things work on ARM. With ARMv6 and ARMv7, we have weak memory ordering. This includes so called "DMA coherent" memory. This means that the architecture does not guarantee the order of writes to DMA coherent memory (which is non- cacheable normal memory) without an intervening 'data synchronization barrier' (dsb). Even that may not be sufficient without also poking at the L2 cache controller. We get around some of that by ensuring that our MMIO read/write macros contain the necessary barriers to ensure that DMA memory is up to date before the DMA agent is programmed. However, this doesn't cater for agents which continue to run in the background. These agents will need some kind of barrier to ensure that the write becomes visible - there's no way to get around that. Maybe we need yet another new barrier macro...
On Mon, 29 Aug 2011, Russell King - ARM Linux wrote: > On Sun, Aug 28, 2011 at 01:00:07PM -0400, Alan Stern wrote: > > It won't do that. All it will do is guarantee that the CPU writes out > > dumy->hw_token before it writes out or reads in any values executed > > after the mb. > > You're right from the perspective of how things are defined today. However, > that isn't how things work on ARM. > > With ARMv6 and ARMv7, we have weak memory ordering. This includes so > called "DMA coherent" memory. This means that the architecture does not > guarantee the order of writes to DMA coherent memory (which is non- > cacheable normal memory) without an intervening 'data synchronization > barrier' (dsb). Even that may not be sufficient without also poking > at the L2 cache controller. > > We get around some of that by ensuring that our MMIO read/write macros > contain the necessary barriers to ensure that DMA memory is up to date > before the DMA agent is programmed. However, this doesn't cater for > agents which continue to run in the background. > > These agents will need some kind of barrier to ensure that the write > becomes visible - there's no way to get around that. Maybe we need > yet another new barrier macro... Hmmm. Although the semantics of the various mb() macros were originally defined only for inter-CPU synchronization, I believe they are also supposed to work for guaranteeing the order of accesses to DMA-coherent memory. If that's not the case with ARM, something is seriously wrong. (Maybe I'm wrong about this, but if I am then there's currently _no_ way for the kernel to order DMA-coherent accesses on ARM.) You know better than I do what is needed to resolve the ordering issue. However, contrary to what the original patch description said, this isn't entirely a matter of making the write visible to the host controller: No doubt in time the write will eventually become visible anyway. It's a matter of making the write become visible reasonably quickly and in the correct order with respect to other writes. Is this extra L2-cache "poke" needed for proper ordering, or is it needed merely to flush the write out to memory in a timely manner? Alan Stern
On Sun, Aug 28, 2011 at 09:51:10PM -0400, Alan Stern wrote: > Hmmm. Although the semantics of the various mb() macros were > originally defined only for inter-CPU synchronization, I believe they > are also supposed to work for guaranteeing the order of accesses to > DMA-coherent memory. If that's not the case with ARM, something is > seriously wrong. (Maybe I'm wrong about this, but if I am then there's > currently _no_ way for the kernel to order DMA-coherent accesses on > ARM.) That is the case with ARM - mb() and wmb() does everything that's required. rmb() is weaker than the other two. > You know better than I do what is needed to resolve the ordering issue. > However, contrary to what the original patch description said, this > isn't entirely a matter of making the write visible to the host > controller: No doubt in time the write will eventually become visible > anyway. It's a matter of making the write become visible reasonably > quickly and in the correct order with respect to other writes. I'm not entirely sure what the problem is - I think its about a write by the CPU to dma coherent memory being delayed and not being visible to the HC in a timely manner. Either mb() or wmb() placed after the write on ARM will do that - and ARM has no requirement to do a read- back after the barrier. > Is this extra L2-cache "poke" needed for proper ordering, or is it > needed merely to flush the write out to memory in a timely manner? Both, though primerily it's about ensuring correct ordering. A side effect of it is that it will flush all pending writes in L2 before completing. From the theoretical viewpoint, I think I'm right to say that mb() doesn't need to provide that level of ordering as its supposed to be an inter-CPU barrier - which probably means we need to invent a new barrier to deal with DMA memory ordering. However, given the difficulty of getting the existing barriers placed correctly, I don't think inventing new barriers is a very good idea. What we can do is view devices which perform DMA as being strongly ordered with respect to their memory accesses - iow, they have an implicit memory barrier before and after their accesses to memory. This would make the CPUs use of mb() have a conceptual pairing with the DMA agents.
On Mon, 29 Aug 2011, Russell King - ARM Linux wrote: > > You know better than I do what is needed to resolve the ordering issue. > > However, contrary to what the original patch description said, this > > isn't entirely a matter of making the write visible to the host > > controller: No doubt in time the write will eventually become visible > > anyway. It's a matter of making the write become visible reasonably > > quickly and in the correct order with respect to other writes. > > I'm not entirely sure what the problem is - I think its about a write > by the CPU to dma coherent memory being delayed and not being visible > to the HC in a timely manner. Either mb() or wmb() placed after the > write on ARM will do that - and ARM has no requirement to do a read- > back after the barrier. Okay, then this needs to be done in a way that won't slow down other architectures with an unnecessary memory barrier. And there needs to be a comment in the code explaining that the new mb() instruction isn't being used as a memory barrier but rather to expedite writeback of the L2 cache. This certainly is starting to sound like something that needs to be addressed in the arch-specific #include files... > > Is this extra L2-cache "poke" needed for proper ordering, or is it > > needed merely to flush the write out to memory in a timely manner? > > Both, though primerily it's about ensuring correct ordering. A side > effect of it is that it will flush all pending writes in L2 before > completing. > > From the theoretical viewpoint, I think I'm right to say that mb() > doesn't need to provide that level of ordering as its supposed to be > an inter-CPU barrier - which probably means we need to invent a new > barrier to deal with DMA memory ordering. However, given the > difficulty of getting the existing barriers placed correctly, I don't > think inventing new barriers is a very good idea. > > What we can do is view devices which perform DMA as being strongly > ordered with respect to their memory accesses - iow, they have an > implicit memory barrier before and after their accesses to memory. > This would make the CPUs use of mb() have a conceptual pairing with > the DMA agents. Yes, that's the model I have been using all along. After all, if a DMA master carries out its memory accesses in some random order then it's impossible for the CPU to make any guarantees. Alan Stern
Hi, On Mon, Aug 29, 2011 at 1:00 AM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Sun, 28 Aug 2011, Ming Lei wrote: > >> > I've been following this whole discussion. I didn't understand the >> > idea behind the original patch, and I don't understand this. What >> > reason is there for adding a memory barrier after the "dummy->hw_token >> > = token" line? Such a barrier would not accomplish anything useful, >> > because there are no later memory accesses which need to be ordered >> > with respect to that line. >> >> Here, mb is used to synchronize between writing of CPU and reading of >> ehci HC, see below: > > A memory barrier never synchronizes events between two processors (for > example, between the CPU and HC). It only affects ordering for the > processor it executes on. That's why you always have to use memory > barriers in pairs for inter-CPU synchronization: one memory barrier > instruction for each CPU. IMO, for SMP case, memory barrier pairs(smp_*) are required to synchronize events between two processors. mb/wmb/rmb are mainly used to synchronize events between CPU and device, and one barrier may be enough, for example: CPU device A=1; wmb B=2; read B read A one wmb is used to order 'A=1' and 'B=2', which will make the two write operations reach to physical memory as the order: 'A=1' first, 'B=1' second. Then the device can observe the two write events as the order above, so if device has seen 'B==2', then device will surely see 'A==1'. > >> CPU EHCI >> dummy->hw_token = token; >> mb >> read dummy->hw_token >> >> The mb() introduced was supposed to be used to make sure that EHCI >> can see the correct value of dummy->hw_token. > > It won't do that. All it will do is guarantee that the CPU writes out > dumy->hw_token before it writes out or reads in any values executed > after the mb. > >> If EHCI can't get the correct >> hw_token in time, it will work badly, such as irq delay or irq lost which may be >> caused by mistaken 'IOC' or 'total bytes to transfer' in hw_token. > > No. All that will happen is that the HC will execute the qTD a little > later, when it does read the correct hw_token. No IRQs will be lost, > and there will be no mistaken IOC or "total bytes to transfer" values. IMO, the dummy has been linked into queue pointed by qh->hw->hw_qtd_next, so EHCI will fetch dummy qtd and execute the transaction and will not have any delay on the transaction. Let me explain the problem again: On ARM, the wmb() before 'dummy->hw_token = token;' will flush l2 write buffer into memory and all parts of 'dummy' except for hw_token field have reached into memory already, but dummy->hw_token will stay at l2 write buffer and not reach into memory at this time, so ehci may fetch a inconsistent qtd and execute it, then mistaken IOC or "total bytes to transfer" are read by EHCI and cause delayed irq or lost irq. It is not only a reasoning or guess, and I have traced this kind of fact certainly. > >> This is just >> what the patch is to fix. > > The patch won't fix this. You're trying to force the CPU to write out > dummy->hw_token more quickly. A memory barrier won't do that. And > even if the CPU does write out the value more quickly, that doesn't > guarantee it will be in time for the HC to see the new value right > away. > >> But I think the above is still not correct, and the correct way may be >> like below: >> >> CPU EHCI >> dummy->hw_token = token; >> wmb >> qh->hw->hw_qtd_next = QTD_NEXT(ehci, dummy->qtd_dma); > > If you do this write, you will corrupt the hardware list. You _must_ > _not_ modify the hardware fields of an active QH. Yes, I agree on it. hw_qtd_next is write by HC and qh_refresh(), and it should be written only by cpu when the qh is in idle state. > Besides, qh_link_async() calls qh_refresh() -> qh_update(), which does > this already. And it does this correctly, with the appropriate checks > to make sure the QH isn't active. > >> fetch next qtd from qh->hw->hw_qtd_next >> read qtd->hw_token >> >> The problem is that qh has already linked dummy into its queue before(as did in >> qh_update), so even after 'dummy->hw_token = token;' is executed on CPU, EHCI >> still may not fetch the up-to-date value of the qtd(dummy in here), >> and this should >> be the root cause, IMO. > > There's no way to fix this. The CPU and the HC are independent. The > HC is allowed to cache values whenever it likes (within certain > limits). If it has already cached an old value from the qTD, there's > no way to force it to use a new value. You just have to wait until it > refreshes its cache. I don't think ehci will cache something inside hardware, and it will always fetch qtd linked into qh from memory(not cache) to qh transfer overlay. We can find it in ehci spec 4.10.2. As I explained above in the 'wmb' example, I hope we can use the current memory barrier rules to fix this kind of problem and avoid to invent new barrier. But because we can't write qh->hw->hw_qtd_next when the qh has been linked into ehci async queue, it is not doable. thanks, -- Ming Lei
On Mon, 29 Aug 2011, Ming Lei wrote: > IMO, the dummy has been linked into queue pointed by qh->hw->hw_qtd_next, > so EHCI will fetch dummy qtd and execute the transaction and will not have > any delay on the transaction. > > Let me explain the problem again: On ARM, the wmb() before > 'dummy->hw_token = token;' > will flush l2 write buffer into memory and all parts of 'dummy' except > for hw_token field > have reached into memory already, but dummy->hw_token will stay at l2 > write buffer > and not reach into memory at this time, so ehci may fetch a > inconsistent qtd and execute it, > then mistaken IOC or "total bytes to transfer" are read by EHCI and > cause delayed irq > or lost irq. No. Even if the HC reads dummy before dummy->hw_token has been written out to memory from the L2 cache, it will not see any inconsistencies. It will see the old value in hw_token, which has the ACTIVE bit clear. Therefore it will not try to execute the qTD but will move on to the next QH. See what the fourth paragraph in section 4.10.2 of the EHCI spec says about the case where a qTD's ACTIVE bit is set to 0. Some EHCI implementations have a quirk, in which they perform the overlay even when ACTIVE is clear. But even these implementations won't try to execute the qTD, because the old value of dummy->hw_token also has the HALT bit set. > It is not only a reasoning or guess, and I have traced this kind of > fact certainly. If your controller behaves as you suggest then it is buggy. And in that case, adding another memory barrier won't fix it. There is still the possibility that the HC will read dummy during the brief time after the existing wmb() and before the CPU has written dummy->hw_token. Alan Stern
Hi, On Mon, Aug 29, 2011 at 11:03 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Mon, 29 Aug 2011, Ming Lei wrote: > >> IMO, the dummy has been linked into queue pointed by qh->hw->hw_qtd_next, >> so EHCI will fetch dummy qtd and execute the transaction and will not have >> any delay on the transaction. >> >> Let me explain the problem again: On ARM, the wmb() before >> 'dummy->hw_token = token;' >> will flush l2 write buffer into memory and all parts of 'dummy' except >> for hw_token field >> have reached into memory already, but dummy->hw_token will stay at l2 >> write buffer >> and not reach into memory at this time, so ehci may fetch a >> inconsistent qtd and execute it, >> then mistaken IOC or "total bytes to transfer" are read by EHCI and >> cause delayed irq >> or lost irq. > > No. Even if the HC reads dummy before dummy->hw_token has been written > out to memory from the L2 cache, it will not see any inconsistencies. > It will see the old value in hw_token, which has the ACTIVE bit clear. > Therefore it will not try to execute the qTD but will move on to the > next QH. See what the fourth paragraph in section 4.10.2 of the EHCI > spec says about the case where a qTD's ACTIVE bit is set to 0. Suppose HC can see the old value in hw_token, which has the ACTIVE bit clear. The qtd transaction will not be executed until the new token is flushed into memory. From view of CPU, the irq is still be delayed because ioc irq is generated after the qtd transaction is executed when new token is flushed into memory. The delay depends on how long the token will be flushed into memory. From my observation, the delay is commonly over 5ms for CSW, sometimes the delay is more than 20ms, so caused the bad performance. Also I am not sure if EHCI can read the old hw_token correctly in this kind of inconsistent memory state. > > Some EHCI implementations have a quirk, in which they perform the > overlay even when ACTIVE is clear. But even these implementations > won't try to execute the qTD, because the old value of dummy->hw_token > also has the HALT bit set. > >> It is not only a reasoning or guess, and I have traced this kind of >> fact certainly. > > If your controller behaves as you suggest then it is buggy. And in > that case, adding another memory barrier won't fix it. There is still > the possibility that the HC will read dummy during the brief time after > the existing wmb() and before the CPU has written dummy->hw_token. No, the mb after 'dummy->hw_token=token' does fix the problem. As said above, IOC IRQ is surely delayed from view of CPU. thanks, -- Ming Lei
Hi, On Mon, Aug 29, 2011 at 9:57 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Mon, 29 Aug 2011, Russell King - ARM Linux wrote: > >> > You know better than I do what is needed to resolve the ordering issue. >> > However, contrary to what the original patch description said, this >> > isn't entirely a matter of making the write visible to the host >> > controller: No doubt in time the write will eventually become visible >> > anyway. It's a matter of making the write become visible reasonably >> > quickly and in the correct order with respect to other writes. >> >> I'm not entirely sure what the problem is - I think its about a write >> by the CPU to dma coherent memory being delayed and not being visible >> to the HC in a timely manner. Either mb() or wmb() placed after the >> write on ARM will do that - and ARM has no requirement to do a read- >> back after the barrier. > > Okay, then this needs to be done in a way that won't slow down other > architectures with an unnecessary memory barrier. And there needs to > be a comment in the code explaining that the new mb() instruction isn't > being used as a memory barrier but rather to expedite writeback of the > L2 cache. If writing to coherent memory can't reach physical memory immediately on other ARCHs, the problem can still happen on these ARCHs. But I am not sure if there are these kind of ARCHs except for ARM. Anyway, current memory barriers in qh_append_tds() can't prevent the problem from happening on ARM. If no better solutions, maybe we have to use 'mb()' after 'dummy->hw_token = token' to fix the problem: > > This certainly is starting to sound like something that needs to be > addressed in the arch-specific #include files... > >> > Is this extra L2-cache "poke" needed for proper ordering, or is it >> > needed merely to flush the write out to memory in a timely manner? >> >> Both, though primerily it's about ensuring correct ordering. A side >> effect of it is that it will flush all pending writes in L2 before >> completing. >> >> From the theoretical viewpoint, I think I'm right to say that mb() >> doesn't need to provide that level of ordering as its supposed to be >> an inter-CPU barrier - which probably means we need to invent a new >> barrier to deal with DMA memory ordering. However, given the >> difficulty of getting the existing barriers placed correctly, I don't >> think inventing new barriers is a very good idea. >> >> What we can do is view devices which perform DMA as being strongly >> ordered with respect to their memory accesses - iow, they have an >> implicit memory barrier before and after their accesses to memory. >> This would make the CPUs use of mb() have a conceptual pairing with >> the DMA agents. > > Yes, that's the model I have been using all along. After all, if a DMA > master carries out its memory accesses in some random order then it's > impossible for the CPU to make any guarantees. > > Alan Stern > >
On Mon, 2011-08-29 at 23:55 +0800, Ming Lei wrote: > If writing to coherent memory can't reach physical memory immediately on > other ARCHs, the problem can still happen on these ARCHs. But I am > not sure if there are these kind of ARCHs except for ARM. If there is write buffering which prevents outside bus masters from seeing the data, can we really call it coherent memory? --Mark
Hi Alan, Thanks for your comment. On Tue, Aug 30, 2011 at 12:33 AM, Alan Stern <stern@rowland.harvard.edu> wrote: > Nevertheless, it remains true that you want to add a memory barrier > instruction simply in order to speed up a cache writeback, not to force > any sort of ordering. This needs to be explained carefully in the code > (not just in the patch description!) and it needs to be done in a way > that won't affect other architectures. OK, will do it. > Also, as you mentioned before, you may want to do the same thing in > qh_link_async() just after the > > head->hw->hw_next = dma; > > line. Delays in flushing that write would also slow down performance. Will do it too. thanks, -- Ming Lei
diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c index 0917e3a..65b5021 100644 --- a/drivers/usb/host/ehci-q.c +++ b/drivers/usb/host/ehci-q.c @@ -1082,6 +1082,20 @@ static struct ehci_qh *qh_append_tds ( wmb (); dummy->hw_token = token; + /* The mb() below is added to make sure that + * 'token' can be writen into qtd, so that ehci + * HC can see the up-to-date qtd descriptor. On + * some archs(at least on ARM Cortex A9 dual core), + * writing into coherenet memory doesn't mean the + * value written can reach physical memory + * immediately, and the value may be buffered + * inside L2 cache. 'dummy->hw_token = token;' + * after mb() is added for obeying correct mb() + * usage. + * */ + mb(); + token = dummy->hw_token; + urb->hcpriv = qh_get (qh); } }