Message ID | BANLkTinK88QT_7Zr2YF1sTAiM778Xp9rFQ@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jun 27, 2011 at 11:42:52AM +0200, Per Forlin wrote: > Conclusion: > Working with mmc the relative cost of DSB is almost none. There seems > to be slightly higher number for mmc blocking requests with the DSB > patch compared to not having it. These figures suggest that dsb is comparitively not heavy on the hardware you're testing. I think I'm going to apply the patch anyway - it certainly makes stuff no worse, and if someone has a platform where dsb is more expensive, then they should see a greater benefit from this change. The next thing to think about in DMA-land is whether we should total up the size of the SG list and choose whether to flush the individual SG elements or do a full cache flush. There becomes a point where the full cache flush becomes cheaper than flushing each SG element individually.
On 27 June 2011 12:02, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Mon, Jun 27, 2011 at 11:42:52AM +0200, Per Forlin wrote: >> Conclusion: >> Working with mmc the relative cost of DSB is almost none. There seems >> to be slightly higher number for mmc blocking requests with the DSB >> patch compared to not having it. > > These figures suggest that dsb is comparitively not heavy on the hardware > you're testing. > Yes, of course. > I think I'm going to apply the patch anyway - it certainly makes stuff > no worse, and if someone has a platform where dsb is more expensive, > then they should see a greater benefit from this change. > I agree. > The next thing to think about in DMA-land is whether we should total up > the size of the SG list and choose whether to flush the individual SG > elements or do a full cache flush. There becomes a point where the full > cache flush becomes cheaper than flushing each SG element individually. > Interesting. I have seen such optimisations in hwmem (yet another memory manager for multi media hardware). It would be nice to have such functionality in dma-mapping, to be used by anyone Regards, Per
On Mon, Jun 27, 2011 at 12:02 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > The next thing to think about in DMA-land is whether we should total up > the size of the SG list and choose whether to flush the individual SG > elements or do a full cache flush. There becomes a point where the full > cache flush becomes cheaper than flushing each SG element individually. We noticed that even for a single (large) buffer, any cache flush operation above a certain threshold flushing indiviudal lines become more expensive than flushing the entire cache. I requested colleagues to look into implenting this threshold in the arch/arm/mm/cache-v7.S file. but I think they ran into trouble and eventually had to give up on it. Vijay or Srinidhi, can you share your findings? Thanks, Linus Walleij
Hi, The below are the timings on clean & flush. /* Size Clean Dirty_clean Flush Dirty_Flush T1(ns) T2(ns) T3(ns) T2(ns) ============================================================ 4096 30517 30517 30517 30517 8192 30517 30517 30517 30517 16384 30518 30518 30518 30518 32768 30518 30518 30518 61035<-- 36864 61036 61036 61035 61035 65536 91553 91553 91553 91553 131072 183106 183106 183106 183106 Full 30518 30518 30518 30518<-- Cache */ /* Based on Above values, 32768 size is breakeven for flushing/cleaning * full D cache */ I have noticed with 32KB DLIMIT, there is small reduction about 1fps in skiamark profile after this change. It could be because of full flush or clean is causing more cache misses later on in the execution. However with 64KB DLIMIT, there is further degrade in skiamark performance. So I think 32KB is good value. However the problems are seen in the Android UI. Small artifacts are seen during Video playback on UI widgets. This artifacts are not seen if clean is called for each cpu. Also I find it takes some effort to implement clean_all / flush_all API's in cache-V7.S (asm) file to execute on each cpu. And hence it was parked aside. And I have not investigated, why flush on both cases in case of flush all on Both cpu's always works? Thanks & Regards Vijay -----Original Message----- From: Linus Walleij [mailto:linus.walleij@linaro.org] Sent: Monday, June 27, 2011 5:30 PM To: Russell King - ARM Linux; Srinidhi KASAGAR; Vijaya Kumar K-1 Cc: Per Forlin; Nicolas Pitre; Chris Ball; linaro-dev@lists.linaro.org; linux-mmc@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Robert Fekete Subject: Re: [PATCH v6 00/11] mmc: use nonblock mmc requests to minimize latency On Mon, Jun 27, 2011 at 12:02 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > The next thing to think about in DMA-land is whether we should total up > the size of the SG list and choose whether to flush the individual SG > elements or do a full cache flush. There becomes a point where the full > cache flush becomes cheaper than flushing each SG element individually. We noticed that even for a single (large) buffer, any cache flush operation above a certain threshold flushing indiviudal lines become more expensive than flushing the entire cache. I requested colleagues to look into implenting this threshold in the arch/arm/mm/cache-v7.S file. but I think they ran into trouble and eventually had to give up on it. Vijay or Srinidhi, can you share your findings? Thanks, Linus Walleij
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S index d32f02b..3fb51c5 100644 --- a/arch/arm/mm/cache-v7.S +++ b/arch/arm/mm/cache-v7.S @@ -228,7 +228,6 @@ ENTRY(v7_flush_kern_dcache_area) add r0, r0, r2 cmp r0, r1 blo 1b - dsb mov pc, lr ENDPROC(v7_flush_kern_dcache_area)