Message ID | E1ooSWP-000FDy-5t@rmk-PC.armlinux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ARM: findbit assembly updates | expand |
On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle) <rmk+kernel@armlinux.org.uk> wrote: > > Document the ARMv5 bit offset calculation code. Hmm. Don't the generic bitop functions end up using this? We do have a comment in the code that says * On ARMv5 and above, the gcc built-ins may rely on the clz instruction * and produce optimal inlined code in all cases. On ARMv7 it is even * better by also using the rbit instruction. but that 'may' makes me wonder... IOW, what is it in the hand-written code that doesn't get done by the generic code these days? Linus
On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote: > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle) > <rmk+kernel@armlinux.org.uk> wrote: > > > > Document the ARMv5 bit offset calculation code. > > Hmm. Don't the generic bitop functions end up using this? We do have a > comment in the code that says > > * On ARMv5 and above, the gcc built-ins may rely on the clz instruction > * and produce optimal inlined code in all cases. On ARMv7 it is even > * better by also using the rbit instruction. It's true that the generic code also makes use of the rbit and clz instructions - but in terms of the speed of the functions these only get used once we've found a word that is interesting to locate the bit we want in. > but that 'may' makes me wonder... > > IOW, what is it in the hand-written code that doesn't get done by the > generic code these days? For the _find_first_bit, there isn't much difference in the number of instructions or really what is going on, only the organisation and flow of the code is more inline - but that shouldn't make much of a difference. Yet, there is a definite repeatable measurable difference between the two: random-filled: arm : find_first_bit: 17778911 ns, 16448 iterations generic: find_first_bit: 18596622 ns, 16401 iterations sparse: arm : find_first_bit: 7301363 ns, 656 iterations generic: find_first_bit: 7589120 ns, 655 iterations The bigger difference is in the find_next_bit operations, and this likely comes from the arm32 code not having the hassles of the "_and" and other conditionals that the generic code has: random-filled: arm : find_next_bit: 2242618 ns, 163949 iterations generic: find_next_bit: 2632859 ns, 163743 iterations sparse: arm : find_next_bit: 40078 ns, 656 iterations generic: find_next_bit: 69943 ns, 655 iterations find_next_zero_bit show a greater difference: random-filled: arm : find_next_zero_bit: 2049129 ns, 163732 iterations generic: find_next_zero_bit: 2844221 ns, 163938 iterations sparse: arm : find_next_zero_bit: 3939309 ns, 327025 iterations generic: find_next_zero_bit: 5529553 ns, 327026 iterations Here's the disassemblies for comparison. Note that the arm versions share code paths between the functions which makes the code even more compact - so the loop in the find_first gets re-used for find_next after we check the current word. generic: 00000000 <_find_first_bit>: 0: e1a02000 mov r2, r0 4: e2510000 subs r0, r1, #0 8: 012fff1e bxeq lr c: e2422004 sub r2, r2, #4 10: e3a03000 mov r3, #0 14: ea000002 b 24 <_find_first_bit+0x24> 18: e2833020 add r3, r3, #32 1c: e1500003 cmp r0, r3 20: 912fff1e bxls lr 24: e5b2c004 ldr ip, [r2, #4]! 28: e35c0000 cmp ip, #0 2c: 0afffff9 beq 18 <_find_first_bit+0x18> 30: e6ffcf3c rbit ip, ip 34: e16fcf1c clz ip, ip 38: e08c3003 add r3, ip, r3 3c: e1500003 cmp r0, r3 40: 21a00003 movcs r0, r3 44: e12fff1e bx lr 000000e8 <_find_next_bit>: e8: e92d4070 push {r4, r5, r6, lr} ec: e1530002 cmp r3, r2 f0: e59d4010 ldr r4, [sp, #16] f4: e59d5014 ldr r5, [sp, #20] f8: 2a000024 bcs 190 <_find_next_bit+0xa8> fc: e1a0e2a3 lsr lr, r3, #5 100: e3510000 cmp r1, #0 104: e203601f and r6, r3, #31 108: e3c3301f bic r3, r3, #31 10c: e790c10e ldr ip, [r0, lr, lsl #2] 110: 1791e10e ldrne lr, [r1, lr, lsl #2] 114: 100cc00e andne ip, ip, lr 118: e3e0e000 mvn lr, #0 11c: e02cc004 eor ip, ip, r4 120: e3550000 cmp r5, #0 124: e1a0e61e lsl lr, lr, r6 128: 1a00001a bne 198 <_find_next_bit+0xb0> 12c: e01cc00e ands ip, ip, lr 130: 1a000011 bne 17c <_find_next_bit+0x94> 134: e2833020 add r3, r3, #32 138: e1530002 cmp r3, r2 13c: 3a000003 bcc 150 <_find_next_bit+0x68> 140: ea000012 b 190 <_find_next_bit+0xa8> 144: e2833020 add r3, r3, #32 148: e1520003 cmp r2, r3 14c: 9a00000f bls 190 <_find_next_bit+0xa8> 150: e1a0e2a3 lsr lr, r3, #5 154: e3510000 cmp r1, #0 158: e790c10e ldr ip, [r0, lr, lsl #2] 15c: 1791e10e ldrne lr, [r1, lr, lsl #2] 160: 100cc00e andne ip, ip, lr 164: e15c0004 cmp ip, r4 168: 0afffff5 beq 144 <_find_next_bit+0x5c> 16c: e02cc004 eor ip, ip, r4 170: e3550000 cmp r5, #0 174: 0a000000 beq 17c <_find_next_bit+0x94> 178: e6bfcf3c rev ip, ip 17c: e6ffcf3c rbit ip, ip 180: e16fcf1c clz ip, ip 184: e08c3003 add r3, ip, r3 188: e1520003 cmp r2, r3 18c: 21a02003 movcs r2, r3 190: e1a00002 mov r0, r2 194: e8bd8070 pop {r4, r5, r6, pc} 198: e6bfef3e rev lr, lr 19c: e01cc00e ands ip, ip, lr 1a0: 0affffe3 beq 134 <_find_next_bit+0x4c> 1a4: eafffff3 b 178 <_find_next_bit+0x90> ========== arm optimised: 00000060 <_find_first_bit_le>: 60: e3310000 teq r1, #0 64: 0a000006 beq 84 <_find_first_bit_le+0x24> 68: e3a02000 mov r2, #0 6c: e4903004 ldr r3, [r0], #4 70: e1b03003 movs r3, r3 74: 1a000011 bne c0 <_find_next_bit_le+0x34> 78: e2822020 add r2, r2, #32 7c: e1520001 cmp r2, r1 80: 3afffff9 bcc 6c <_find_first_bit_le+0xc> 84: e1a00001 mov r0, r1 88: e12fff1e bx lr 0000008c <_find_next_bit_le>: 8c: e1520001 cmp r2, r1 90: 2afffffb bcs 84 <_find_first_bit_le+0x24> 94: e1a0c2a2 lsr ip, r2, #5 98: e080010c add r0, r0, ip, lsl #2 9c: e212c01f ands ip, r2, #31 a0: 0afffff1 beq 6c <_find_first_bit_le+0xc> a4: e4903004 ldr r3, [r0], #4 a8: e1b03c33 lsrs r3, r3, ip ac: 1a000003 bne c0 <_find_next_bit_le+0x34> b0: e382201f orr r2, r2, #31 b4: e2822001 add r2, r2, #1 b8: eaffffef b 7c <_find_first_bit_le+0x1c> bc: e6bf3f33 rev r3, r3 c0: e6ff3f33 rbit r3, r3 c4: e16f3f13 clz r3, r3 c8: e0820003 add r0, r2, r3 cc: e1510000 cmp r1, r0 d0: 31a00001 movcc r0, r1 d4: e12fff1e bx lr
+ Alexey Klimov On Fri, Oct 28, 2022 at 06:45:50PM +0100, Russell King (Oracle) wrote: > On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote: > > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle) > > <rmk+kernel@armlinux.org.uk> wrote: > > > > > > Document the ARMv5 bit offset calculation code. > > > > Hmm. Don't the generic bitop functions end up using this? We do have a > > comment in the code that says > > > > * On ARMv5 and above, the gcc built-ins may rely on the clz instruction > > * and produce optimal inlined code in all cases. On ARMv7 it is even > > * better by also using the rbit instruction. > > It's true that the generic code also makes use of the rbit and clz > instructions - but in terms of the speed of the functions these only > get used once we've found a word that is interesting to locate the > bit we want in. > > > but that 'may' makes me wonder... > > > > IOW, what is it in the hand-written code that doesn't get done by the > > generic code these days? > > For the _find_first_bit, there isn't much difference in the number > of instructions or really what is going on, only the organisation > and flow of the code is more inline - but that shouldn't make much > of a difference. Yet, there is a definite repeatable measurable > difference between the two: > > random-filled: > arm : find_first_bit: 17778911 ns, 16448 iterations > generic: find_first_bit: 18596622 ns, 16401 iterations > > sparse: > arm : find_first_bit: 7301363 ns, 656 iterations > generic: find_first_bit: 7589120 ns, 655 iterations > > The bigger difference is in the find_next_bit operations, and this > likely comes from the arm32 code not having the hassles of the "_and" > and other conditionals that the generic code has: > > random-filled: > arm : find_next_bit: 2242618 ns, 163949 iterations > generic: find_next_bit: 2632859 ns, 163743 iterations > > sparse: > arm : find_next_bit: 40078 ns, 656 iterations > generic: find_next_bit: 69943 ns, 655 iterations > > find_next_zero_bit show a greater difference: > > random-filled: > arm : find_next_zero_bit: 2049129 ns, 163732 iterations > generic: find_next_zero_bit: 2844221 ns, 163938 iterations > > sparse: > arm : find_next_zero_bit: 3939309 ns, 327025 iterations > generic: find_next_zero_bit: 5529553 ns, 327026 iterations Those numbers disagree with what Alexey has measured on Odroid board for A15 but somewhat in line with what he had for A7: https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/ Can you please share details about your methodology: what CPU did you use; did you lock cpu frequencies; in addition to mean, can you also show standard deviation, raw logs? To Alexey: if you have time, can you please repeat those measurements on top of v6 + this series? This would help a lot in understanding how new code performs, both generic and arch. If generic vs arch code comparison looks different for different CPU versions, what should we prefer? > Here's the disassemblies for comparison. Note that the arm versions > share code paths between the functions which makes the code even more > compact - so the loop in the find_first gets re-used for find_next > after we check the current word. > > generic: [...] > 000000e8 <_find_next_bit>: > e8: e92d4070 push {r4, r5, r6, lr} > ec: e1530002 cmp r3, r2 > f0: e59d4010 ldr r4, [sp, #16] > f4: e59d5014 ldr r5, [sp, #20] > f8: 2a000024 bcs 190 <_find_next_bit+0xa8> > fc: e1a0e2a3 lsr lr, r3, #5 > 100: e3510000 cmp r1, #0 > 104: e203601f and r6, r3, #31 > 108: e3c3301f bic r3, r3, #31 > 10c: e790c10e ldr ip, [r0, lr, lsl #2] > 110: 1791e10e ldrne lr, [r1, lr, lsl #2] > 114: 100cc00e andne ip, ip, lr > 118: e3e0e000 mvn lr, #0 > 11c: e02cc004 eor ip, ip, r4 > 120: e3550000 cmp r5, #0 > 124: e1a0e61e lsl lr, lr, r6 > 128: 1a00001a bne 198 <_find_next_bit+0xb0> > 12c: e01cc00e ands ip, ip, lr > 130: 1a000011 bne 17c <_find_next_bit+0x94> > 134: e2833020 add r3, r3, #32 > 138: e1530002 cmp r3, r2 > 13c: 3a000003 bcc 150 <_find_next_bit+0x68> > 140: ea000012 b 190 <_find_next_bit+0xa8> > 144: e2833020 add r3, r3, #32 > 148: e1520003 cmp r2, r3 > 14c: 9a00000f bls 190 <_find_next_bit+0xa8> > 150: e1a0e2a3 lsr lr, r3, #5 > 154: e3510000 cmp r1, #0 > 158: e790c10e ldr ip, [r0, lr, lsl #2] > 15c: 1791e10e ldrne lr, [r1, lr, lsl #2] > 160: 100cc00e andne ip, ip, lr > 164: e15c0004 cmp ip, r4 > 168: 0afffff5 beq 144 <_find_next_bit+0x5c> > 16c: e02cc004 eor ip, ip, r4 > 170: e3550000 cmp r5, #0 > 174: 0a000000 beq 17c <_find_next_bit+0x94> > 178: e6bfcf3c rev ip, ip > 17c: e6ffcf3c rbit ip, ip > 180: e16fcf1c clz ip, ip > 184: e08c3003 add r3, ip, r3 > 188: e1520003 cmp r2, r3 > 18c: 21a02003 movcs r2, r3 > 190: e1a00002 mov r0, r2 > 194: e8bd8070 pop {r4, r5, r6, pc} > 198: e6bfef3e rev lr, lr > 19c: e01cc00e ands ip, ip, lr > 1a0: 0affffe3 beq 134 <_find_next_bit+0x4c> > 1a4: eafffff3 b 178 <_find_next_bit+0x90> On top of master, my generic _find_next_bit looks different: 000000e4 <_find_next_bit>: e4: e1510002 cmp r1, r2 e8: e1a0c000 mov ip, r0 ec: e1a00001 mov r0, r1 f0: 912fff1e bxls lr f4: e1a012a2 lsr r1, r2, #5 f8: e92d4010 push {r4, lr} fc: e202201f and r2, r2, #31 100: e3e03000 mvn r3, #0 104: e79ce101 ldr lr, [ip, r1, lsl #2] 108: e01ee213 ands lr, lr, r3, lsl r2 10c: 1a000012 bne 15c <_find_next_bit+0x78> 110: e2813001 add r3, r1, #1 114: e1a04283 lsl r4, r3, #5 118: e1540000 cmp r4, r0 11c: 28bd8010 popcs {r4, pc} 120: e08c2101 add r2, ip, r1, lsl #2 124: ea000002 b 134 <_find_next_bit+0x50> 128: e1a04283 lsl r4, r3, #5 12c: e1500004 cmp r0, r4 130: 98bd8010 popls {r4, pc} 134: e5b2e004 ldr lr, [r2, #4]! 138: e2833001 add r3, r3, #1 13c: e35e0000 cmp lr, #0 140: 0afffff8 beq 128 <_find_next_bit+0x44> 144: e6ffef3e rbit lr, lr 148: e16fef1e clz lr, lr 14c: e084400e add r4, r4, lr 150: e1500004 cmp r0, r4 154: 21a00004 movcs r0, r4 158: e8bd8010 pop {r4, pc} 15c: e1a04281 lsl r4, r1, #5 160: eafffff7 b 144 <_find_next_bit+0x60> Are you sure you're running latest kernel? Thanks, Yury
On Fri, Oct 28, 2022 at 10:45 AM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > For the _find_first_bit, there isn't much difference in the number > of instructions or really what is going on, only the organisation > and flow of the code is more inline - but that shouldn't make much > of a difference. Yet, there is a definite repeatable measurable > difference between the two: Hmm. Interestingly, your _find_first_zero_bit_le() (which find_next_bit ends up using except for the first byte) ends up doing an optimization that is technically not valid. In particular, the *generic* code does sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(val)), sz); for the final result. In contrast, the arm code doesn't do the "min()" at all, and if there are bits after the bitmap (in a partial byte), it will just return those bits. So the arm code ends up avoiding some operations. Which works most of the time, because (a) usually bitmaps are byte-aligned anyway (b) most users do "find_first_bit(.., size) >= size" as the "found no bits" test but it actually looks to me like your handcoded arm code is simply wrong. At least going by our docbook comments for find_first_bit: * Returns the bit number of the first set bit. * If no bits are set, returns @size. And look here: bitmap_empty() does return find_first_bit(src, nbits) == nbits; and now imagine that 'nbits' is not a small constant value (which we handle separately) and is also not byte aligned. Maybe I'm mis-reading your assembly (I "know" arm assembly, but I can't read it fluently like x86). But I don't think so. So I think your code is actually buggy, but probably the bug is quite hard to trigger in practice due to (a)/(b) above. We do have bitmaps that aren't byte-aligned. The cpumask ones are the most common ones. But in the cpumask_first() implementation (which is just a wrapper for find_first_bit()), our documentation actually says * Returns >= nr_cpu_ids if no cpus set. and I think that may have been what we historically did elsewhere too, and may be the source of the arm situation. Anyway, this can be fixed by either (a) fixing the arm code (b) changing the docs and making that ">= size" be the right thing to do but I do think this is a very strong example of why having architecture-specific code like this is very very dangerous. Because as it stands now, that arm code really looks like it's actively buggy to me. Linus
On Fri, Oct 28, 2022 at 12:01 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > In contrast, the arm code doesn't do the "min()" at all, and if there > are bits after the bitmap (in a partial byte), it will just return > those bits. No, I did misread the code. It returns 'size' for the no bits case, and the 'movlo r0, r1'; does the right thing for the "bits past the end" case too. I guess I need to get better at reading arm asm. Linus
On Fri, Oct 28, 2022 at 11:37:44AM -0700, Yury Norov wrote: > + Alexey Klimov > > On Fri, Oct 28, 2022 at 06:45:50PM +0100, Russell King (Oracle) wrote: > > On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote: > > > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle) > > > <rmk+kernel@armlinux.org.uk> wrote: > > > > > > > > Document the ARMv5 bit offset calculation code. > > > > > > Hmm. Don't the generic bitop functions end up using this? We do have a > > > comment in the code that says > > > > > > * On ARMv5 and above, the gcc built-ins may rely on the clz instruction > > > * and produce optimal inlined code in all cases. On ARMv7 it is even > > > * better by also using the rbit instruction. > > > > It's true that the generic code also makes use of the rbit and clz > > instructions - but in terms of the speed of the functions these only > > get used once we've found a word that is interesting to locate the > > bit we want in. > > > > > but that 'may' makes me wonder... > > > > > > IOW, what is it in the hand-written code that doesn't get done by the > > > generic code these days? > > > > For the _find_first_bit, there isn't much difference in the number > > of instructions or really what is going on, only the organisation > > and flow of the code is more inline - but that shouldn't make much > > of a difference. Yet, there is a definite repeatable measurable > > difference between the two: > > > > random-filled: > > arm : find_first_bit: 17778911 ns, 16448 iterations > > generic: find_first_bit: 18596622 ns, 16401 iterations > > > > sparse: > > arm : find_first_bit: 7301363 ns, 656 iterations > > generic: find_first_bit: 7589120 ns, 655 iterations > > > > The bigger difference is in the find_next_bit operations, and this > > likely comes from the arm32 code not having the hassles of the "_and" > > and other conditionals that the generic code has: > > > > random-filled: > > arm : find_next_bit: 2242618 ns, 163949 iterations > > generic: find_next_bit: 2632859 ns, 163743 iterations > > > > sparse: > > arm : find_next_bit: 40078 ns, 656 iterations > > generic: find_next_bit: 69943 ns, 655 iterations > > > > find_next_zero_bit show a greater difference: > > > > random-filled: > > arm : find_next_zero_bit: 2049129 ns, 163732 iterations > > generic: find_next_zero_bit: 2844221 ns, 163938 iterations > > > > sparse: > > arm : find_next_zero_bit: 3939309 ns, 327025 iterations > > generic: find_next_zero_bit: 5529553 ns, 327026 iterations > > Those numbers disagree with what Alexey has measured on Odroid board > for A15 but somewhat in line with what he had for A7: Considering no one has seen these patches until I've just posted them, frankly I don't think there's any point me looking at anyone elses results. These changes make substantial improvements to the arm32 assembly code versions. If you want a like-for-like comparison, then please get Alexey to test with these patches applied. I am confident that he will confirm my results.
On Fri, Oct 28, 2022 at 12:01:00PM -0700, Linus Torvalds wrote: > Hmm. Interestingly, your _find_first_zero_bit_le() (which > find_next_bit ends up using except for the first byte) ends up doing > an optimization that is technically not valid. > > In particular, the *generic* code does > > sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(val)), sz); > > for the final result. > > In contrast, the arm code doesn't do the "min()" at all, and if there > are bits after the bitmap (in a partial byte), it will just return > those bits. You've missed how the min() is coded. Specifically, that's handled by: cc: e1510000 cmp r1, r0 d0: 31a00001 movcc r0, r1 which clamps the returned index to the size of the array (held in r1). So everything is in fact fine - and I think your analysis is incorrect. Please could you take another look and evaluate whether you think the arm assembly is incorrect. I kind'a stopped reading here on the assumption that the remainder of your email was based on this misinterpretation of the code.
On Fri, Oct 28, 2022 at 12:46 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > You've missed how the min() is coded. Yes, I realised that and sent another email about my mea culpa. Linus
diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S index 7fd3600db8ef..4c584bc4704b 100644 --- a/arch/arm/lib/findbit.S +++ b/arch/arm/lib/findbit.S @@ -172,10 +172,10 @@ ENDPROC(_find_next_bit_be) .L_found: #if __LINUX_ARM_ARCH__ >= 5 rsb r0, r3, #0 - and r3, r3, r0 - clz r3, r3 - rsb r3, r3, #31 - add r0, r2, r3 + and r3, r3, r0 @ mask out lowest bit set + clz r3, r3 @ count high zero bits + rsb r3, r3, #31 @ offset of first set bit + add r0, r2, r3 @ add offset of first set bit #else tst r3, #0x0f addeq r2, r2, #4
Document the ARMv5 bit offset calculation code. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> --- arch/arm/lib/findbit.S | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)