Message ID | 20140509212810.GF18257@obsidianresearch.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 09 May 03:28 PM, Jason Gunthorpe wrote: > > > I gave this a try in order to answer Arnd's performance > > question. First of all, the patch seems wrong. I guess it's because > > readsl reads 4-bytes pieces, instead of 8-bytes. > > > > This patch below is tested (but not completely, see below) and works: > > Compilers are better now, I think you can just ditch the weirdness: > [..] > > The below gives: > > c8: ea000002 b d8 <orion_nand_read_buf+0x84> > cc: e5dc0000 ldrb r0, [ip] > d0: e7c30001 strb r0, [r3, r1] > d4: e2811001 add r1, r1, #1 > d8: e1510002 cmp r1, r2 > > Which looks the same as the asm version to me. > Nice! It wasn't really needed but since I have the board here: # time nanddump /dev/mtd5 -f /dev/null -q real 0m 5.82s user 0m 0.20s sys 0m 5.60s Jason: Care to submit a proper patch? On 08 May 04:56 PM, Arnd Bergmann wrote: > Ok, that is a noticeable difference. For scale, what is the size of that partition? The board is Openblocks A6, running mainline. # cat /proc/mtd dev: size erasesize name mtd0: 00090000 00004000 "uboot" mtd1: 00044000 00004000 "env" mtd2: 00024000 00004000 "test" mtd3: 00400000 00004000 "conf" mtd4: 01d20000 00004000 "linux" mtd5: 01dec000 00004000 "user"
On Friday 09 May 2014 19:09:15 Ezequiel Garcia wrote: > # time nanddump /dev/mtd5 -f /dev/null -q > real 0m 5.82s > user 0m 0.20s > sys 0m 5.60s > > Jason: Care to submit a proper patch? > > On 08 May 04:56 PM, Arnd Bergmann wrote: > > Ok, that is a noticeable difference. For scale, what is the size of that partition? > > The board is Openblocks A6, running mainline. > > # cat /proc/mtd > dev: size erasesize name > mtd0: 00090000 00004000 "uboot" > mtd1: 00044000 00004000 "env" > mtd2: 00024000 00004000 "test" > mtd3: 00400000 00004000 "conf" > mtd4: 01d20000 00004000 "linux" > mtd5: 01dec000 00004000 "user" Ok, so it takes 5.6 seconds in kernel mode to access 31MB, which comes down to 5.60MB/s. That isn't very fast compared to the time the CPU should take for those instructions, so I'm surprised it actually makes any difference at all. There isn't a usable slave DMA engine in Armada XP by chance? Arnd
On 10 May 12:24 AM, Arnd Bergmann wrote: > On Friday 09 May 2014 19:09:15 Ezequiel Garcia wrote: > > # time nanddump /dev/mtd5 -f /dev/null -q > > real 0m 5.82s > > user 0m 0.20s > > sys 0m 5.60s > > > > Jason: Care to submit a proper patch? > > > > On 08 May 04:56 PM, Arnd Bergmann wrote: > > > Ok, that is a noticeable difference. For scale, what is the size of that partition? > > > > The board is Openblocks A6, running mainline. > > > > # cat /proc/mtd > > dev: size erasesize name > > mtd0: 00090000 00004000 "uboot" > > mtd1: 00044000 00004000 "env" > > mtd2: 00024000 00004000 "test" > > mtd3: 00400000 00004000 "conf" > > mtd4: 01d20000 00004000 "linux" > > mtd5: 01dec000 00004000 "user" > > Ok, so it takes 5.6 seconds in kernel mode to access 31MB, which comes down > to 5.60MB/s. That isn't very fast compared to the time the CPU should take > for those instructions, so I'm surprised it actually makes any difference > at all. > > There isn't a usable slave DMA engine in Armada XP by chance? > The Openblocks A6 is a Kirkwood, not Armada XP.
On Fri, May 09, 2014 at 07:09:15PM -0300, Ezequiel Garcia wrote: > On 09 May 03:28 PM, Jason Gunthorpe wrote: > > > > > I gave this a try in order to answer Arnd's performance > > > question. First of all, the patch seems wrong. I guess it's because > > > readsl reads 4-bytes pieces, instead of 8-bytes. > > > > > > This patch below is tested (but not completely, see below) and works: > > > > Compilers are better now, I think you can just ditch the weirdness: > > > [..] > > > > The below gives: > > > > c8: ea000002 b d8 <orion_nand_read_buf+0x84> > > cc: e5dc0000 ldrb r0, [ip] > > d0: e7c30001 strb r0, [r3, r1] > > d4: e2811001 add r1, r1, #1 > > d8: e1510002 cmp r1, r2 > > > > Which looks the same as the asm version to me. > > > > Nice! It wasn't really needed but since I have the board here: > > # time nanddump /dev/mtd5 -f /dev/null -q > real 0m 5.82s > user 0m 0.20s > sys 0m 5.60s > > Jason: Care to submit a proper patch? Sure, but did anyone (Arnd?) have thoughts on a better way to do this: +#ifdef CONFIG_64BIT + buf64[i++] = readq_relaxed(io_base); +#else + buf64[i++] = *(const volatile u64 __force *)io_base; +#endif IMHO, readq should exist on any platform that can issue a 64 bit bus transaction, and I expect many ARM's qualify. > On 08 May 04:56 PM, Arnd Bergmann wrote: > Ok, so it takes 5.6 seconds in kernel mode to access 31MB, which > comes down to 5.60MB/s. That isn't very fast compared to the time > the CPU should take for those instructions, so I'm surprised it > actually makes any difference at all. Likely, what is happening is that the bus interface is holding off returning the read data until it complets the bus cycles, then the response travels to the CPU which turns around another. This creates a dead time where the bus isn't do anything. The larger bus transfer the CPU can do the less percentage of time the turnaround takes as overhead. If the cpu could pipeline two reads then it could be highest-possible, but I guess the memory ordering for the mapping prevents that?? Regarding DMA, who knows if the interface can handle a burst transfer.. Jason
On Tuesday 13 May 2014 14:55:48 Jason Gunthorpe wrote: > On Fri, May 09, 2014 at 07:09:15PM -0300, Ezequiel Garcia wrote: > > On 09 May 03:28 PM, Jason Gunthorpe wrote: > > > > > > > I gave this a try in order to answer Arnd's performance > > > > question. First of all, the patch seems wrong. I guess it's because > > > > readsl reads 4-bytes pieces, instead of 8-bytes. > > > > > > > > This patch below is tested (but not completely, see below) and works: > > > > > > Compilers are better now, I think you can just ditch the weirdness: > > > > > [..] > > > > > > The below gives: > > > > > > c8: ea000002 b d8 <orion_nand_read_buf+0x84> > > > cc: e5dc0000 ldrb r0, [ip] > > > d0: e7c30001 strb r0, [r3, r1] > > > d4: e2811001 add r1, r1, #1 > > > d8: e1510002 cmp r1, r2 > > > > > > Which looks the same as the asm version to me. > > > > > > > Nice! It wasn't really needed but since I have the board here: > > > > # time nanddump /dev/mtd5 -f /dev/null -q > > real 0m 5.82s > > user 0m 0.20s > > sys 0m 5.60s > > > > Jason: Care to submit a proper patch? > > Sure, but did anyone (Arnd?) have thoughts on a better way to do this: > > +#ifdef CONFIG_64BIT > + buf64[i++] = readq_relaxed(io_base); > +#else > + buf64[i++] = *(const volatile u64 __force *)io_base; > +#endif > > IMHO, readq should exist on any platform that can issue a 64 bit bus > transaction, and I expect many ARM's qualify. Well, the original problem happened specifically for the case that doesn't have a 64-bit bus transaction (building for ARMv4). If we define readq_relaxed, it has to be an inline assembly, in order to work for drivers that require an atomic transaction, so that would have the same problem. We are a bit inconsistent here though: most 32-bit architectures have no readq, parisc has one that does two 32-bit accesses, sh relies on the compiler, and tile apparently has a native instruction. It seems reasonable to replace the inline assembly in this driver with a new function as a cleanup, but then how do you want to solve the case of building a combined armv4/v5 kernel? Arnd
On Wed, May 14, 2014 at 1:47 PM, Arnd Bergmann <arnd@arndb.de> wrote: >> Sure, but did anyone (Arnd?) have thoughts on a better way to do this: >> >> +#ifdef CONFIG_64BIT >> + buf64[i++] = readq_relaxed(io_base); >> +#else >> + buf64[i++] = *(const volatile u64 __force *)io_base; >> +#endif >> >> IMHO, readq should exist on any platform that can issue a 64 bit bus >> transaction, and I expect many ARM's qualify. > > Well, the original problem happened specifically for the case that doesn't > have a 64-bit bus transaction (building for ARMv4). If we define > readq_relaxed, it has to be an inline assembly, in order to work for > drivers that require an atomic transaction, so that would have the > same problem. We are a bit inconsistent here though: most 32-bit > architectures have no readq, parisc has one that does two 32-bit accesses, > sh relies on the compiler, and tile apparently has a native instruction. > > It seems reasonable to replace the inline assembly in this driver with > a new function as a cleanup, but then how do you want to solve the case > of building a combined armv4/v5 kernel? Provide two helper functions, one for v4, one for v5, and call them through a function pointer that's set up during driver probe? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On Wednesday 14 May 2014 14:35:28 Geert Uytterhoeven wrote: > On Wed, May 14, 2014 at 1:47 PM, Arnd Bergmann <arnd@arndb.de> wrote: > >> Sure, but did anyone (Arnd?) have thoughts on a better way to do this: > >> > >> +#ifdef CONFIG_64BIT > >> + buf64[i++] = readq_relaxed(io_base); > >> +#else > >> + buf64[i++] = *(const volatile u64 __force *)io_base; > >> +#endif > >> > >> IMHO, readq should exist on any platform that can issue a 64 bit bus > >> transaction, and I expect many ARM's qualify. > > > > Well, the original problem happened specifically for the case that doesn't > > have a 64-bit bus transaction (building for ARMv4). If we define > > readq_relaxed, it has to be an inline assembly, in order to work for > > drivers that require an atomic transaction, so that would have the > > same problem. We are a bit inconsistent here though: most 32-bit > > architectures have no readq, parisc has one that does two 32-bit accesses, > > sh relies on the compiler, and tile apparently has a native instruction. > > > > It seems reasonable to replace the inline assembly in this driver with > > a new function as a cleanup, but then how do you want to solve the case > > of building a combined armv4/v5 kernel? > > Provide two helper functions, one for v4, one for v5, and call > them through a function pointer that's set up during driver probe? No, that won't help. It's a compile-time problem only: This driver is never actually used on ARMv4, but if we build a kernel that supports both ARMv4 and later, gcc passes -march=armv4 to the assembler, which barfs on an invalid instruction. It would be fine to make that error message just go away because we know it will only be used on CPUs that do have this instruction. Arnd
diff --git a/drivers/mtd/nand/orion_nand.c b/drivers/mtd/nand/orion_nand.c index dc9d07f34b8a..fea1597f623e 100644 --- a/drivers/mtd/nand/orion_nand.c +++ b/drivers/mtd/nand/orion_nand.c @@ -95,16 +95,13 @@ static void orion_nand_read_buf(struct mtd_info *mtd, uint8_t *buf, int len) } buf64 = (uint64_t *)buf; while (i < len/8) { - /* - * Since GCC has no proper constraint (PR 43518) - * force x variable to r2/r3 registers as ldrd instruction - * requires first register to be even. - */ - register uint64_t x asm ("r2"); - - asm volatile ("ldrd\t%0, [%1]" : "=&r" (x) : "r" (io_base)); - buf64[i++] = x; +#ifdef CONFIG_64BIT + buf64[i++] = readq_relaxed(io_base); +#else + buf64[i++] = *(const volatile u64 __force *)io_base; +#endif } + i *= 8; while (i < len) buf[i++] = readb(io_base);