Message ID | 1314720193-26577-1-git-send-email-ming.lei@canonical.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, 31 Aug 2011 ming.lei@canonical.com wrote: > From: Ming Lei <ming.lei@canonical.com> > > This patch introduces the helper of ehci_sync_mem to flush > qtd/qh into memory immediately on some ARM, so that HC can > see the up-to-date qtd/qh descriptor asap. > > This patch fixs one performance bug on ARM Cortex A9 dual core > platform, which has been reported on quite a few ARM machines > (OMAP4, Tegra 2, snowball...), see details from link of > https://bugs.launchpad.net/bugs/709245. > > The patch has been tested ok on OMAP4 panda A1 board, and the > performance of 'dd' over usb mass storage can be increased from > 4~5MB/sec to 14~16MB/sec after applying this patch. > > Cc: Alan Stern <stern@rowland.harvard.edu> > Cc: Russell King <linux@arm.linux.org.uk> > Signed-off-by: Ming Lei <ming.lei@canonical.com> > --- > drivers/usb/host/ehci-q.c | 18 ++++++++++++++++++ > drivers/usb/host/ehci.h | 17 +++++++++++++++++ > 2 files changed, 35 insertions(+), 0 deletions(-) > > diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c > index 0917e3a..2719879 100644 > --- a/drivers/usb/host/ehci-q.c > +++ b/drivers/usb/host/ehci-q.c > @@ -995,6 +995,12 @@ static void qh_link_async (struct ehci_hcd *ehci, struct ehci_qh *qh) > head->qh_next.qh = qh; > head->hw->hw_next = dma; > > + /* > + * flush qh descriptor into memory immediately, > + * see comments in qh_append_tds. > + * */ Comments are supposed to look like this: /* * Blah blah blah * blah blah blah */ > + ehci_sync_mem(); > + > qh_get(qh); > qh->xacterrs = 0; > qh->qh_state = QH_STATE_LINKED; > @@ -1082,6 +1088,18 @@ static struct ehci_qh *qh_append_tds ( > wmb (); > dummy->hw_token = token; > > + /* > + * Writing to dma coherent buffer on ARM may > + * be delayed to reach memory, so HC may not see > + * hw_token of dummy qtd in time, which can cause > + * the qtd transaction to be executed very late, > + * and degrade performance a lot. ehci_sync_mem > + * is added to flush 'token' immediatelly into > + * memory, so that ehci can execute the transaction > + * ASAP. > + * */ Here too. > + ehci_sync_mem(); > + > urb->hcpriv = qh_get (qh); > } > } > diff --git a/drivers/usb/host/ehci.h b/drivers/usb/host/ehci.h > index cc7d337..313d9d6 100644 > --- a/drivers/usb/host/ehci.h > +++ b/drivers/usb/host/ehci.h > @@ -738,6 +738,23 @@ static inline u32 hc32_to_cpup (const struct ehci_hcd *ehci, const __hc32 *x) > > #endif > > +/* > + * Writing to dma coherent memory on ARM may be delayed via L2 > + * writing buffer, so introduce the helper which can flush L2 writing > + * buffer into memory immediately, especially used to flush ehci > + * descriptor to memory. > + * */ And here. > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE > +static inline void ehci_sync_mem() > +{ > + mb(); > +} > +#else > +static inline void ehci_sync_mem() > +{ > +} > +#endif > + Except for the formatting of the comments, this is fine. When you fix up the comments, you can add: Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Alan Stern
On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote: > +/* > + * Writing to dma coherent memory on ARM may be delayed via L2 > + * writing buffer, so introduce the helper which can flush L2 writing > + * buffer into memory immediately, especially used to flush ehci > + * descriptor to memory. > + * */ > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE > +static inline void ehci_sync_mem() > +{ > + mb(); > +} > +#else > +static inline void ehci_sync_mem() > +{ > +} > +#endif > + I'm wondering if this doesn't really belong in the DMA API for any future architectures that can't avoid prolonged write buffering to DMA coherent memory. IIUC, ARM mitigates this for most drivers by including an implicit write buffer flush in the mmio write routines. This takes care of the drivers which write to a mmio device register after writing something to shared DMA memory. IIUC, this doesn't help ehci because the host controller is polling to see what the cpu writes to the shared memory. Other hardware which polls shared memory like that will likely have the same problem and could use buffer drain helpers as well. --Mark
On Tue, 30 Aug 2011, Mark Salter wrote: > On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote: > > +/* > > + * Writing to dma coherent memory on ARM may be delayed via L2 > > + * writing buffer, so introduce the helper which can flush L2 writing > > + * buffer into memory immediately, especially used to flush ehci > > + * descriptor to memory. > > + * */ > > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE > > +static inline void ehci_sync_mem() > > +{ > > + mb(); > > +} > > +#else > > +static inline void ehci_sync_mem() > > +{ > > +} > > +#endif > > + > > I'm wondering if this doesn't really belong in the DMA API for any > future architectures that can't avoid prolonged write buffering to DMA > coherent memory. IIUC, ARM mitigates this for most drivers by including > an implicit write buffer flush in the mmio write routines. This takes > care of the drivers which write to a mmio device register after writing > something to shared DMA memory. IIUC, this doesn't help ehci because the > host controller is polling to see what the cpu writes to the shared > memory. Other hardware which polls shared memory like that will likely > have the same problem and could use buffer drain helpers as well. This would be a good thing to define centrally. Would you like to post an RFC on LKML? Do you know of any other examples of hardware that polls shared DMA memory? Alan Stern
On Tue, Aug 30, 2011 at 05:38:30PM +0100, Mark Salter wrote: > On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote: > > +/* > > + * Writing to dma coherent memory on ARM may be delayed via L2 > > + * writing buffer, so introduce the helper which can flush L2 writing > > + * buffer into memory immediately, especially used to flush ehci > > + * descriptor to memory. > > + * */ > > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE > > +static inline void ehci_sync_mem() > > +{ > > + mb(); > > +} > > +#else > > +static inline void ehci_sync_mem() > > +{ > > +} > > +#endif > > + > > I'm wondering if this doesn't really belong in the DMA API for any > future architectures that can't avoid prolonged write buffering to DMA > coherent memory. IIUC, ARM mitigates this for most drivers by including > an implicit write buffer flush in the mmio write routines. This takes > care of the drivers which write to a mmio device register after writing > something to shared DMA memory. IIUC, this doesn't help ehci because the > host controller is polling to see what the cpu writes to the shared > memory. Other hardware which polls shared memory like that will likely > have the same problem and could use buffer drain helpers as well. Right. In this case the buffering is happening at L2 which becomes noticeable when measuring performance. We also buffer stores at the CPU (regardless of memory type) but because these tend to become visible fairly quickly, there isn't a comparable performance problem. Given that I would expect other architectures to buffer writes at the CPU, would it not be worth having an API for flushing to L3 (devices)? It seems like this would be a useful addition to the coherent DMA API on platforms that handle coherency with non-cacheable memory attributes. Will
On Tue, Aug 30, 2011 at 06:26:42PM +0100, Will Deacon wrote: > On Tue, Aug 30, 2011 at 05:38:30PM +0100, Mark Salter wrote: > > On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote: > > > +/* > > > + * Writing to dma coherent memory on ARM may be delayed via L2 > > > + * writing buffer, so introduce the helper which can flush L2 writing > > > + * buffer into memory immediately, especially used to flush ehci > > > + * descriptor to memory. > > > + * */ > > > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE > > > +static inline void ehci_sync_mem() > > > +{ > > > + mb(); > > > +} > > > +#else > > > +static inline void ehci_sync_mem() > > > +{ > > > +} > > > +#endif > > > + > > > > I'm wondering if this doesn't really belong in the DMA API for any > > future architectures that can't avoid prolonged write buffering to DMA > > coherent memory. IIUC, ARM mitigates this for most drivers by including > > an implicit write buffer flush in the mmio write routines. This takes > > care of the drivers which write to a mmio device register after writing > > something to shared DMA memory. IIUC, this doesn't help ehci because the > > host controller is polling to see what the cpu writes to the shared > > memory. Other hardware which polls shared memory like that will likely > > have the same problem and could use buffer drain helpers as well. > > Right. In this case the buffering is happening at L2 which becomes > noticeable when measuring performance. We also buffer stores at the > CPU (regardless of memory type) but because these tend to become visible > fairly quickly, there isn't a comparable performance problem. > > Given that I would expect other architectures to buffer writes at the CPU, > would it not be worth having an API for flushing to L3 (devices)? It seems > like this would be a useful addition to the coherent DMA API on platforms > that handle coherency with non-cacheable memory attributes. I agree, this seems to be a "new" type of barrier that is needed, as the code comment above seems to go against what the kernel memory barrier documentation says about what a memory barrier really does on the hardware. greg k-h
On Tue, Aug 30, 2011 at 06:48:59PM +0100, Greg KH wrote: > On Tue, Aug 30, 2011 at 06:26:42PM +0100, Will Deacon wrote: > > On Tue, Aug 30, 2011 at 05:38:30PM +0100, Mark Salter wrote: > > > On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote: > > > > +/* > > > > + * Writing to dma coherent memory on ARM may be delayed via L2 > > > > + * writing buffer, so introduce the helper which can flush L2 writing > > > > + * buffer into memory immediately, especially used to flush ehci > > > > + * descriptor to memory. > > > > + * */ > > > > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE > > > > +static inline void ehci_sync_mem() > > > > +{ > > > > + mb(); > > > > +} > > > > +#else > > > > +static inline void ehci_sync_mem() > > > > +{ > > > > +} > > > > +#endif > > > > + > > > > > > I'm wondering if this doesn't really belong in the DMA API for any > > > future architectures that can't avoid prolonged write buffering to DMA > > > coherent memory. IIUC, ARM mitigates this for most drivers by including > > > an implicit write buffer flush in the mmio write routines. This takes > > > care of the drivers which write to a mmio device register after writing > > > something to shared DMA memory. IIUC, this doesn't help ehci because the > > > host controller is polling to see what the cpu writes to the shared > > > memory. Other hardware which polls shared memory like that will likely > > > have the same problem and could use buffer drain helpers as well. > > > > Right. In this case the buffering is happening at L2 which becomes > > noticeable when measuring performance. We also buffer stores at the > > CPU (regardless of memory type) but because these tend to become visible > > fairly quickly, there isn't a comparable performance problem. > > > > Given that I would expect other architectures to buffer writes at the CPU, > > would it not be worth having an API for flushing to L3 (devices)? It seems > > like this would be a useful addition to the coherent DMA API on platforms > > that handle coherency with non-cacheable memory attributes. > > I agree, this seems to be a "new" type of barrier that is needed, as the > code comment above seems to go against what the kernel memory barrier > documentation says about what a memory barrier really does on the > hardware. Although this doesn't have anything to do with ordering; it's all to do with immediacy so I think we should try to avoiding using the term `barrier'. If this can be made part of the coherent DMA API, that might be the best place for it (I can't think of any other areas this is needed given that the streaming DMA API and I/O accessors already deal with it). Will
On Tue, 2011-08-30 at 13:15 -0400, Alan Stern wrote: > On Tue, 30 Aug 2011, Mark Salter wrote: > > > On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote: > > > +/* > > > + * Writing to dma coherent memory on ARM may be delayed via L2 > > > + * writing buffer, so introduce the helper which can flush L2 writing > > > + * buffer into memory immediately, especially used to flush ehci > > > + * descriptor to memory. > > > + * */ > > > +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE > > > +static inline void ehci_sync_mem() > > > +{ > > > + mb(); > > > +} > > > +#else > > > +static inline void ehci_sync_mem() > > > +{ > > > +} > > > +#endif > > > + > > > > I'm wondering if this doesn't really belong in the DMA API for any > > future architectures that can't avoid prolonged write buffering to DMA > > coherent memory. IIUC, ARM mitigates this for most drivers by including > > an implicit write buffer flush in the mmio write routines. This takes > > care of the drivers which write to a mmio device register after writing > > something to shared DMA memory. IIUC, this doesn't help ehci because the > > host controller is polling to see what the cpu writes to the shared > > memory. Other hardware which polls shared memory like that will likely > > have the same problem and could use buffer drain helpers as well. > > This would be a good thing to define centrally. Would you like to > post an RFC on LKML? Yes, I can take a stab at that. > > Do you know of any other examples of hardware that polls shared DMA > memory? Not offhand nor after a quick search. I don't think it is a common way of doing things. --Mark
On Aug 31, 2011, at 1:54 AM, Will Deacon wrote: > On Tue, Aug 30, 2011 at 06:48:59PM +0100, Greg KH wrote: >> On Tue, Aug 30, 2011 at 06:26:42PM +0100, Will Deacon wrote: >>> On Tue, Aug 30, 2011 at 05:38:30PM +0100, Mark Salter wrote: >>>> On Wed, 2011-08-31 at 00:03 +0800, ming.lei@canonical.com wrote: >>>>> +/* >>>>> + * Writing to dma coherent memory on ARM may be delayed via L2 >>>>> + * writing buffer, so introduce the helper which can flush L2 writing >>>>> + * buffer into memory immediately, especially used to flush ehci >>>>> + * descriptor to memory. >>>>> + * */ >>>>> +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE >>>>> +static inline void ehci_sync_mem() >>>>> +{ >>>>> + mb(); >>>>> +} >>>>> +#else >>>>> +static inline void ehci_sync_mem() >>>>> +{ >>>>> +} >>>>> +#endif >>>>> + >>>> >>>> I'm wondering if this doesn't really belong in the DMA API for any >>>> future architectures that can't avoid prolonged write buffering to DMA >>>> coherent memory. IIUC, ARM mitigates this for most drivers by including >>>> an implicit write buffer flush in the mmio write routines. This takes >>>> care of the drivers which write to a mmio device register after writing >>>> something to shared DMA memory. IIUC, this doesn't help ehci because the >>>> host controller is polling to see what the cpu writes to the shared >>>> memory. Other hardware which polls shared memory like that will likely >>>> have the same problem and could use buffer drain helpers as well. >>> >>> Right. In this case the buffering is happening at L2 which becomes >>> noticeable when measuring performance. We also buffer stores at the >>> CPU (regardless of memory type) but because these tend to become visible >>> fairly quickly, there isn't a comparable performance problem. >>> >>> Given that I would expect other architectures to buffer writes at the CPU, >>> would it not be worth having an API for flushing to L3 (devices)? It seems >>> like this would be a useful addition to the coherent DMA API on platforms >>> that handle coherency with non-cacheable memory attributes. >> >> I agree, this seems to be a "new" type of barrier that is needed, as the >> code comment above seems to go against what the kernel memory barrier >> documentation says about what a memory barrier really does on the >> hardware. > > Although this doesn't have anything to do with ordering; it's all to do with > immediacy so I think we should try to avoiding using the term `barrier'. If > this can be made part of the coherent DMA API, that might be the best place > for it (I can't think of any other areas this is needed given that the > streaming DMA API and I/O accessors already deal with it). I am agree with you. I met the same issue at both usb device driver (adding next dTD pointer which the current one is handling) and usb host driver (performance issue this thread have discussed) at Freescale i.MX6Q platform (4 Cores, ARM SMP). So, now I need to add two barriers at two different drivers. One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer also uncache, but bufferable? > > Will > -- > To unsubscribe from this list: send the line "unsubscribe linux-usb" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Best Regard, Peter Chen
On Wed, Aug 31, 2011 at 1:54 AM, Will Deacon <will.deacon@arm.com> wrote: > Although this doesn't have anything to do with ordering; it's all to do with > immediacy so I think we should try to avoiding using the term `barrier'. If > this can be made part of the coherent DMA API, that might be the best place > for it (I can't think of any other areas this is needed given that the > streaming DMA API and I/O accessors already deal with it). Agree too. thanks, -- Ming Lei
On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer > also uncache, but bufferable? Which CPU was on this platform? Will
Best Regard, Peter Chen On Aug 31, 2011, at 4:49 PM, Will Deacon wrote: > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: >> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer >> also uncache, but bufferable? > > Which CPU was on this platform? Cortex A8 UP system (freescale i.MX5x platform) > > Will >
On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote: > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: > > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer > > also uncache, but bufferable? > > Which CPU was on this platform? Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing nosmp on the commandline, I see 20.3MB/s. Can someone explain why nosmp would make such a difference? --Mark
On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote: > On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote: > > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: > > > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer > > > also uncache, but bufferable? > > > > Which CPU was on this platform? > > Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a > usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing > nosmp on the commandline, I see 20.3MB/s. > > Can someone explain why nosmp would make such a difference? Oh gawd, that's horrible. I have a feeling it's probably a separate issue though, caused by: omap_modify_auxcoreboot0(0x200, 0xfffffdff); in boot_secondary for OMAP. Unfortunately I have no idea what that line is doing because it ends up talking to the secure monitor. Will
On Wed, 2011-08-31 at 16:21 +0100, Will Deacon wrote: > On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote: > > On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote: > > > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: > > > > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer > > > > also uncache, but bufferable? > > > > > > Which CPU was on this platform? > > > > Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a > > usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing > > nosmp on the commandline, I see 20.3MB/s. > > > > Can someone explain why nosmp would make such a difference? > > Oh gawd, that's horrible. I have a feeling it's probably a separate issue > though, caused by: > > omap_modify_auxcoreboot0(0x200, 0xfffffdff); > > in boot_secondary for OMAP. Unfortunately I have no idea what that line is > doing because it ends up talking to the secure monitor. Okay, I may poke around a bit with that to see I can get a better understanding. With the patched ehci-q.c, I see no noticeable difference between smp and nosmp. Both get me around 23.5MB/s with my setup. --Mark
On 31/08/11 16:27, Mark Salter wrote: > On Wed, 2011-08-31 at 16:21 +0100, Will Deacon wrote: >> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote: >>> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote: >>>> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: >>>>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer >>>>> also uncache, but bufferable? >>>> >>>> Which CPU was on this platform? >>> >>> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a >>> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing >>> nosmp on the commandline, I see 20.3MB/s. >>> >>> Can someone explain why nosmp would make such a difference? >> >> Oh gawd, that's horrible. I have a feeling it's probably a separate issue >> though, caused by: >> >> omap_modify_auxcoreboot0(0x200, 0xfffffdff); >> >> in boot_secondary for OMAP. Unfortunately I have no idea what that line is >> doing because it ends up talking to the secure monitor. > > Okay, I may poke around a bit with that to see I can get a better > understanding. > > With the patched ehci-q.c, I see no noticeable difference between smp > and nosmp. Both get me around 23.5MB/s with my setup. Oddly enough, this patch doesn't do anything on my Tegra setup. In both cases, I get around 17MB/s from a crap SD card plugged in a USB reader. This leads me to suspect that this issue is very much OMAP4 specific. Can anyone verify this theory on other some A9 platforms? Cheers, M.
Am Mittwoch 31 August 2011, 18:12:48 schrieb Marc Zyngier: > [...] > Oddly enough, this patch doesn't do anything on my Tegra setup. In both > cases, I get around 17MB/s from a crap SD card plugged in a USB reader. > > This leads me to suspect that this issue is very much OMAP4 specific. > Can anyone verify this theory on other some A9 platforms? That's odd. On my Tegra2 (on ac100) it boosts the transfer rate from 7 to 17 MB/s. Marc
On Wed, 31 Aug 2011, Will Deacon wrote: > On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote: > > On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote: > > > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: > > > > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer > > > > also uncache, but bufferable? > > > > > > Which CPU was on this platform? > > > > Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a > > usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing > > nosmp on the commandline, I see 20.3MB/s. > > > > Can someone explain why nosmp would make such a difference? > > Oh gawd, that's horrible. I have a feeling it's probably a separate issue > though, caused by: > > omap_modify_auxcoreboot0(0x200, 0xfffffdff); > > in boot_secondary for OMAP. Unfortunately I have no idea what that line is > doing because it ends up talking to the secure monitor. Well, this issue is apparently affecting other ARMv9 implementations too. In which case this code in arch/arm/mm/mmu.c could be responsible: if (is_smp()) { /* * Mark memory with the "shared" attribute * for SMP systems */ user_pgprot |= L_PTE_SHARED; kern_pgprot |= L_PTE_SHARED; vecs_pgprot |= L_PTE_SHARED; mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S; mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED; mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S; mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED; mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S; mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED; mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S; mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED; } However I don't see the nosmp kernel argument having any effect on the result from is_smp(). Nicolas
On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote: > On Wed, 31 Aug 2011, Will Deacon wrote: > > > On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote: > > > On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote: > > > > On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: > > > > > One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer > > > > > also uncache, but bufferable? > > > > > > > > Which CPU was on this platform? > > > > > > Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a > > > usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing > > > nosmp on the commandline, I see 20.3MB/s. > > > > > > Can someone explain why nosmp would make such a difference? > > > > Oh gawd, that's horrible. I have a feeling it's probably a separate issue > > though, caused by: > > > > omap_modify_auxcoreboot0(0x200, 0xfffffdff); > > > > in boot_secondary for OMAP. Unfortunately I have no idea what that line is > > doing because it ends up talking to the secure monitor. > > Well, this issue is apparently affecting other ARMv9 implementations > too. In which case this code in arch/arm/mm/mmu.c could be responsible: > > if (is_smp()) { > /* > * Mark memory with the "shared" attribute > * for SMP systems > */ > user_pgprot |= L_PTE_SHARED; > kern_pgprot |= L_PTE_SHARED; > vecs_pgprot |= L_PTE_SHARED; > mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S; > mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED; > mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S; > mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED; > mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S; > mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED; > mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S; > mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED; > } > > However I don't see the nosmp kernel argument having any effect on the > result from is_smp(). Yes, the first thing that sprung to mind was the shared attribute, but like you say, that doesn't seem to be affected by the nosmp command line argument. Another thing that Marc and I tried on OMAP4 was not bringing up the secondary CPU during boot (by commenting out most of smp_init). In this case, I/O performance was good until we tried to online the secondary CPU. The online failed but after that the I/O performance was certainly degraded. Will
On 08/31/2011 12:51 PM, Will Deacon wrote: > On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote: >> On Wed, 31 Aug 2011, Will Deacon wrote: >> >>> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote: >>>> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote: >>>>> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: >>>>>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer >>>>>> also uncache, but bufferable? >>>>> >>>>> Which CPU was on this platform? >>>> >>>> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a >>>> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing >>>> nosmp on the commandline, I see 20.3MB/s. >>>> >>>> Can someone explain why nosmp would make such a difference? >>> >>> Oh gawd, that's horrible. I have a feeling it's probably a separate issue >>> though, caused by: >>> >>> omap_modify_auxcoreboot0(0x200, 0xfffffdff); >>> >>> in boot_secondary for OMAP. Unfortunately I have no idea what that line is >>> doing because it ends up talking to the secure monitor. >> >> Well, this issue is apparently affecting other ARMv9 implementations >> too. In which case this code in arch/arm/mm/mmu.c could be responsible: >> >> if (is_smp()) { >> /* >> * Mark memory with the "shared" attribute >> * for SMP systems >> */ >> user_pgprot |= L_PTE_SHARED; >> kern_pgprot |= L_PTE_SHARED; >> vecs_pgprot |= L_PTE_SHARED; >> mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S; >> mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED; >> mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S; >> mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED; >> mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S; >> mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED; >> mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S; >> mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED; >> } >> >> However I don't see the nosmp kernel argument having any effect on the >> result from is_smp(). > > Yes, the first thing that sprung to mind was the shared attribute, but like > you say, that doesn't seem to be affected by the nosmp command line > argument. > > Another thing that Marc and I tried on OMAP4 was not bringing up the secondary > CPU during boot (by commenting out most of smp_init). In this case, I/O > performance was good until we tried to online the secondary CPU. The online > failed but after that the I/O performance was certainly degraded. > Was the SCU enabled at that point? One diff between nosmp boot and offlining the 2nd core would be that the SCU remains enabled in the latter case. I think the SCU does not get enabled for nosmp. Do we really know which write buffer the data is sitting? Some experiments to only flush the L1 write buffer would be interesting. Perhaps something executed on the 2nd core has a mb which doesn't help for SMP because the other core's L1 write buffer is not flushed, but it helps for nosmp because everything runs on 1 core and any occurrence of a mb will flush all data out. I wouldn't expect the behavior to be so consistent though. Could it be something is not visible to the other core rather than not visible to the EHCI controller? Rob
On Wed, 2011-08-31 at 13:19 -0500, Rob Herring wrote: > On 08/31/2011 12:51 PM, Will Deacon wrote: > > On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote: > >> On Wed, 31 Aug 2011, Will Deacon wrote: > >> > >>> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote: > >>>> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote: > >>>>> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: > >>>>>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer > >>>>>> also uncache, but bufferable? > >>>>> > >>>>> Which CPU was on this platform? > >>>> > >>>> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a > >>>> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing > >>>> nosmp on the commandline, I see 20.3MB/s. > >>>> > >>>> Can someone explain why nosmp would make such a difference? > >>> > >>> Oh gawd, that's horrible. I have a feeling it's probably a separate issue > >>> though, caused by: > >>> > >>> omap_modify_auxcoreboot0(0x200, 0xfffffdff); > >>> > >>> in boot_secondary for OMAP. Unfortunately I have no idea what that line is > >>> doing because it ends up talking to the secure monitor. > >> > >> Well, this issue is apparently affecting other ARMv9 implementations > >> too. In which case this code in arch/arm/mm/mmu.c could be responsible: > >> > >> if (is_smp()) { > >> /* > >> * Mark memory with the "shared" attribute > >> * for SMP systems > >> */ > >> user_pgprot |= L_PTE_SHARED; > >> kern_pgprot |= L_PTE_SHARED; > >> vecs_pgprot |= L_PTE_SHARED; > >> mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S; > >> mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED; > >> mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S; > >> mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED; > >> mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S; > >> mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED; > >> mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S; > >> mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED; > >> } > >> > >> However I don't see the nosmp kernel argument having any effect on the > >> result from is_smp(). > > > > Yes, the first thing that sprung to mind was the shared attribute, but like > > you say, that doesn't seem to be affected by the nosmp command line > > argument. > > > > Another thing that Marc and I tried on OMAP4 was not bringing up the secondary > > CPU during boot (by commenting out most of smp_init). In this case, I/O > > performance was good until we tried to online the secondary CPU. The online > > failed but after that the I/O performance was certainly degraded. > > > > Was the SCU enabled at that point? One diff between nosmp boot and > offlining the 2nd core would be that the SCU remains enabled in the > latter case. I think the SCU does not get enabled for nosmp. > > Do we really know which write buffer the data is sitting? Some > experiments to only flush the L1 write buffer would be interesting. > Perhaps something executed on the 2nd core has a mb which doesn't help > for SMP because the other core's L1 write buffer is not flushed, but it > helps for nosmp because everything runs on 1 core and any occurrence of > a mb will flush all data out. I wouldn't expect the behavior to be so > consistent though. Could it be something is not visible to the other > core rather than not visible to the EHCI controller? One experiment I did a few days ago was to pin processes and interrupts to core#0 (except IPI and local timer). This didn't make any noticeable difference. My current understanding is that the writes are getting hung up in a cache and not a write buffer. I am seeing delays of 10-15ms between queuing the urb and getting an interrupt for urb completion. That drops to a few hundred microseconds with the explicit flushing added to the ehci driver. I don't see how any write buffer could hold data that long without draining out on its own. What I see seems to suggest that the memory is only coherent among the cores and not coherent for CPU writes/device reads. Adding just a dsb() for the ehci flush does not help. An outer_sync() is also necessary. --Mark
On 08/31/2011 01:35 PM, Mark Salter wrote: > On Wed, 2011-08-31 at 13:19 -0500, Rob Herring wrote: >> On 08/31/2011 12:51 PM, Will Deacon wrote: >>> On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote: >>>> On Wed, 31 Aug 2011, Will Deacon wrote: >>>> >>>>> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote: >>>>>> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote: >>>>>>> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote: >>>>>>>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer >>>>>>>> also uncache, but bufferable? >>>>>>> >>>>>>> Which CPU was on this platform? >>>>>> >>>>>> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a >>>>>> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing >>>>>> nosmp on the commandline, I see 20.3MB/s. >>>>>> >>>>>> Can someone explain why nosmp would make such a difference? >>>>> >>>>> Oh gawd, that's horrible. I have a feeling it's probably a separate issue >>>>> though, caused by: >>>>> >>>>> omap_modify_auxcoreboot0(0x200, 0xfffffdff); >>>>> >>>>> in boot_secondary for OMAP. Unfortunately I have no idea what that line is >>>>> doing because it ends up talking to the secure monitor. >>>> >>>> Well, this issue is apparently affecting other ARMv9 implementations >>>> too. In which case this code in arch/arm/mm/mmu.c could be responsible: >>>> >>>> if (is_smp()) { >>>> /* >>>> * Mark memory with the "shared" attribute >>>> * for SMP systems >>>> */ >>>> user_pgprot |= L_PTE_SHARED; >>>> kern_pgprot |= L_PTE_SHARED; >>>> vecs_pgprot |= L_PTE_SHARED; >>>> mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S; >>>> mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED; >>>> mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S; >>>> mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED; >>>> mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S; >>>> mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED; >>>> mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S; >>>> mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED; >>>> } >>>> >>>> However I don't see the nosmp kernel argument having any effect on the >>>> result from is_smp(). >>> >>> Yes, the first thing that sprung to mind was the shared attribute, but like >>> you say, that doesn't seem to be affected by the nosmp command line >>> argument. >>> >>> Another thing that Marc and I tried on OMAP4 was not bringing up the secondary >>> CPU during boot (by commenting out most of smp_init). In this case, I/O >>> performance was good until we tried to online the secondary CPU. The online >>> failed but after that the I/O performance was certainly degraded. >>> >> >> Was the SCU enabled at that point? One diff between nosmp boot and >> offlining the 2nd core would be that the SCU remains enabled in the >> latter case. I think the SCU does not get enabled for nosmp. >> >> Do we really know which write buffer the data is sitting? Some >> experiments to only flush the L1 write buffer would be interesting. >> Perhaps something executed on the 2nd core has a mb which doesn't help >> for SMP because the other core's L1 write buffer is not flushed, but it >> helps for nosmp because everything runs on 1 core and any occurrence of >> a mb will flush all data out. I wouldn't expect the behavior to be so >> consistent though. Could it be something is not visible to the other >> core rather than not visible to the EHCI controller? > > One experiment I did a few days ago was to pin processes and interrupts > to core#0 (except IPI and local timer). This didn't make any noticeable > difference. > > My current understanding is that the writes are getting hung up in a > cache and not a write buffer. I am seeing delays of 10-15ms between > queuing the urb and getting an interrupt for urb completion. That > drops to a few hundred microseconds with the explicit flushing added > to the ehci driver. I don't see how any write buffer could hold data > that long without draining out on its own. What I see seems to suggest > that the memory is only coherent among the cores and not coherent for > CPU writes/device reads. Adding just a dsb() for the ehci flush does > not help. An outer_sync() is also necessary. > An outer_sync will only drain the write buffer of the L2. It does not flush the cache though. If the write buffer does in fact keep data as long as possible (until it needs a free slot or the line is full), then long delays to write out data are certainly possible. The exact operation is not documented AFAIR. Rob
On Wed, 2011-08-31 at 13:49 -0500, Rob Herring wrote: > An outer_sync will only drain the write buffer of the L2. It does not > flush the cache though. If the write buffer does in fact keep data as > long as possible (until it needs a free slot or the line is full), then > long delays to write out data are certainly possible. The exact > operation is not documented AFAIR. Ah, thanks for that. I really haven't been paying close attention to ARMv6/7 hardware and I'm in the process of playing catch up. --Mark
On Wed, Aug 31, 2011 at 07:19:33PM +0100, Rob Herring wrote: > On 08/31/2011 12:51 PM, Will Deacon wrote: > > Another thing that Marc and I tried on OMAP4 was not bringing up the secondary > > CPU during boot (by commenting out most of smp_init). In this case, I/O > > performance was good until we tried to online the secondary CPU. The online > > failed but after that the I/O performance was certainly degraded. > > > > Was the SCU enabled at that point? One diff between nosmp boot and > offlining the 2nd core would be that the SCU remains enabled in the > latter case. I think the SCU does not get enabled for nosmp. Our rudimentary test (printing out the SCU control register during boot) showed that it *was* enabled for nosmp. I think this is due to the secure world having to do that on OMAP so it's probably not true for other platforms. Will
Hi Marc, On 31/08/11 17:55, Marc Dietrich wrote: > Am Mittwoch 31 August 2011, 18:12:48 schrieb Marc Zyngier: >> [...] >> Oddly enough, this patch doesn't do anything on my Tegra setup. In both >> cases, I get around 17MB/s from a crap SD card plugged in a USB reader. >> >> This leads me to suspect that this issue is very much OMAP4 specific. >> Can anyone verify this theory on other some A9 platforms? > > That's odd. On my Tegra2 (on ac100) it boosts the transfer rate from 7 to > 17 MB/s. I'm using a Harmony board. Could you share your kernel version, .config and dmesg? Thanks, M.
> Hi Marc, ^dito, > On 31/08/11 17:55, Marc Dietrich wrote: > > Am Mittwoch 31 August 2011, 18:12:48 schrieb Marc Zyngier: > >> [...] > >> Oddly enough, this patch doesn't do anything on my Tegra setup. In both > >> cases, I get around 17MB/s from a crap SD card plugged in a USB reader. > >> > >> This leads me to suspect that this issue is very much OMAP4 specific. > >> Can anyone verify this theory on other some A9 platforms? > > > > That's odd. On my Tegra2 (on ac100) it boosts the transfer rate from 7 to > > 17 MB/s. > > I'm using a Harmony board. Could you share your kernel version, .config > and dmesg? > > Thanks, > > M. I use the chromiumos tree (for chromiumos 2.6.38 kernel: http://git.chromium.org/chromiumos/third_party/kernel-next.git) with some additions to make it run on the AC100. This modified tree is on git://gitorious.org/~marvin24/ac100/marvin24s-kernel.git. The config is paz00_defconfig and a dmesg you can get e.g. from http://pastebin.com/9uVfDWma (it's not very current, but it should be sufficient). I'll add Stephen Warren from NVIDIA to the CC list. He has more HW to test on. Btw, this is the patch I used: http://gitorious.org/~marvin24/ac100/marvin24s- kernel/commit/cce8d9e25d009a45c219a6ad0b9ac4e27d034ab0 Marc
Marc Dietich wrote at Thursday, September 01, 2011 5:14 AM:
> I'll add Stephen Warren from NVIDIA to the CC list. He has more HW to test on.
Here are the results I found:
Harmony:
Tegra USB3 -> SMSC9514 hub: NOT affected
(Unplugging LAN cable, or disabling SMSC9514 LAN driver doesn't change this)
Seaboard (springbank; clamshell):
Tegra USB1 -> no hub: Affected
Seaboard (seaboard non-clamshell):
Tegra USB1 -> no hub: Affected
Tegra USB3 -> no hub: Affected
TrimSlice:
Tegra USB3 -> unknown hub: Affected
This implies there's something different about Harmony.
Is the USB hub a clue? Seaboard doesn't have one, and although I don't
know what model TrimSlice uses, I assume it's different since I know
TrimSlice's Ethernet is not the same as Harmony's.
I don't see anything in board-harmony.c vs. board-seaboard.c that'd affect
anything USB-related.
Perhaps there's some kind of bootloader or BCT difference. However, my
Harmony and both Seaboards both use (a very old) U-Boot and BCT from
ChromeOS, so I don't imagine there's actually much difference there.
On 01/09/11 20:08, Stephen Warren wrote: > Marc Dietich wrote at Thursday, September 01, 2011 5:14 AM: >> I'll add Stephen Warren from NVIDIA to the CC list. He has more HW to test on. > > Here are the results I found: > > Harmony: > Tegra USB3 -> SMSC9514 hub: NOT affected > (Unplugging LAN cable, or disabling SMSC9514 LAN driver doesn't change this) > > Seaboard (springbank; clamshell): > Tegra USB1 -> no hub: Affected > > Seaboard (seaboard non-clamshell): > Tegra USB1 -> no hub: Affected > Tegra USB3 -> no hub: Affected > > TrimSlice: > Tegra USB3 -> unknown hub: Affected > > This implies there's something different about Harmony. > > Is the USB hub a clue? Seaboard doesn't have one, and although I don't > know what model TrimSlice uses, I assume it's different since I know > TrimSlice's Ethernet is not the same as Harmony's. Panda has the exact same USB hub configuration, and is affected. So we can rule this out. > I don't see anything in board-harmony.c vs. board-seaboard.c that'd affect > anything USB-related. > > Perhaps there's some kind of bootloader or BCT difference. However, my > Harmony and both Seaboards both use (a very old) U-Boot and BCT from > ChromeOS, so I don't imagine there's actually much difference there. I just noticed something else. Harmony is fast *most of the time*. In about one in 3 reboots, I get the slow behavior. When USB is fast, I also have I2C interrupts "screaming": 85: 294321 0 GIC tegra-i2c 116: 0 0 GIC tegra-i2c 118: 98542 0 GIC tps6586x This is a couple of seconds after boot. When USB is slow, I see the following: [ 0.385270] tps6586x 3-0034: Chip ID read failed: -121 [ 0.390584] tps6586x: probe of 3-0034 failed with error -5 ... and I2C interrupt is quiet. The I2C interrupt handler calls writel(), which does a cache sync. That would explain the "fast" behavior of Harmony. Do you see the same this on your board? M.
just another measurement point > Stephen Warren wrote at Thursday: > > Here are the results I found: > > Harmony: > Tegra USB3 -> SMSC9514 hub: NOT affected > (Unplugging LAN cable, or disabling SMSC9514 LAN driver doesn't change > this) > > Seaboard (springbank; clamshell): > Tegra USB1 -> no hub: Affected > > Seaboard (seaboard non-clamshell): > Tegra USB1 -> no hub: Affected > Tegra USB3 -> no hub: Affected > > TrimSlice: > Tegra USB3 -> unknown hub: Affected PAZ00: ULPI -> SMCS 2512: Affected Tegra USB3 -> SMSC 2514: Affected The patch also cures high latencies/packet drops on wifi connected to ULPI via the 2512 hub. The pen drive was connected to USB3/2514. > This implies there's something different about Harmony. > > Is the USB hub a clue? Seaboard doesn't have one, and although I don't > know what model TrimSlice uses, I assume it's different since I know > TrimSlice's Ethernet is not the same as Harmony's. > > I don't see anything in board-harmony.c vs. board-seaboard.c that'd affect > anything USB-related. > > Perhaps there's some kind of bootloader or BCT difference. However, my > Harmony and both Seaboards both use (a very old) U-Boot and BCT from > ChromeOS, so I don't imagine there's actually much difference there.
Marc Zyngier wrote at Friday, September 02, 2011 3:51 AM: > On 01/09/11 20:08, Stephen Warren wrote: > > Marc Dietich wrote at Thursday, September 01, 2011 5:14 AM: > >> I'll add Stephen Warren from NVIDIA to the CC list. He has more HW to test on. > > > > Here are the results I found: > > > > Harmony: > > Tegra USB3 -> SMSC9514 hub: NOT affected > > (Unplugging LAN cable, or disabling SMSC9514 LAN driver doesn't change this) ... > I just noticed something else. Harmony is fast *most of the time*. In > about one in 3 reboots, I get the slow behavior. When USB is fast, I > also have I2C interrupts "screaming": > > 85: 294321 0 GIC tegra-i2c > 116: 0 0 GIC tegra-i2c > 118: 98542 0 GIC tps6586x > > This is a couple of seconds after boot. > > When USB is slow, I see the following: > [ 0.385270] tps6586x 3-0034: Chip ID read failed: -121 > [ 0.390584] tps6586x: probe of 3-0034 failed with error -5 > > ... and I2C interrupt is quiet. > > The I2C interrupt handler calls writel(), which does a cache sync. That > would explain the "fast" behavior of Harmony. > > Do you see the same this on your board? Yes, I re-ran the test a few more times and see those exact same symptoms. In a case with the screaming I2C interrupts and fast USB, I then did: echo 3-0034 > /sys/bus/i2c/drivers/tps6586x/unbind (I got a kernel BUG and bash crashed here, but just logged back in) which caused the I2C interrupt handler to stop, then re-ran the test. I then saw the slow USB speed. So, now I think *all* platforms(boards) are affected, right?
diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c index 0917e3a..2719879 100644 --- a/drivers/usb/host/ehci-q.c +++ b/drivers/usb/host/ehci-q.c @@ -995,6 +995,12 @@ static void qh_link_async (struct ehci_hcd *ehci, struct ehci_qh *qh) head->qh_next.qh = qh; head->hw->hw_next = dma; + /* + * flush qh descriptor into memory immediately, + * see comments in qh_append_tds. + * */ + ehci_sync_mem(); + qh_get(qh); qh->xacterrs = 0; qh->qh_state = QH_STATE_LINKED; @@ -1082,6 +1088,18 @@ static struct ehci_qh *qh_append_tds ( wmb (); dummy->hw_token = token; + /* + * Writing to dma coherent buffer on ARM may + * be delayed to reach memory, so HC may not see + * hw_token of dummy qtd in time, which can cause + * the qtd transaction to be executed very late, + * and degrade performance a lot. ehci_sync_mem + * is added to flush 'token' immediatelly into + * memory, so that ehci can execute the transaction + * ASAP. + * */ + ehci_sync_mem(); + urb->hcpriv = qh_get (qh); } } diff --git a/drivers/usb/host/ehci.h b/drivers/usb/host/ehci.h index cc7d337..313d9d6 100644 --- a/drivers/usb/host/ehci.h +++ b/drivers/usb/host/ehci.h @@ -738,6 +738,23 @@ static inline u32 hc32_to_cpup (const struct ehci_hcd *ehci, const __hc32 *x) #endif +/* + * Writing to dma coherent memory on ARM may be delayed via L2 + * writing buffer, so introduce the helper which can flush L2 writing + * buffer into memory immediately, especially used to flush ehci + * descriptor to memory. + * */ +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE +static inline void ehci_sync_mem() +{ + mb(); +} +#else +static inline void ehci_sync_mem() +{ +} +#endif + /*-------------------------------------------------------------------------*/ #ifndef DEBUG