Message ID | 20131212164748.GS4360@n2100.arm.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Dec 12, 2013 at 04:47:48PM +0000, Russell King - ARM Linux wrote: > > Then changing the type of the function argument would probably be safer! > > Actually, I think we can do a bit better with this code. We really don't > need much of this messing around here, we can combine some of these steps. > > We have: > > 16-bit protocol in host endian > 16-bit length in host endian > > and we need to combine them into a 32-bit checksum which is then > subsequently folded down to 16-bits by adding the top and bottom halves. > > Now, what we can do is this: > > 1. Construct a combined 32-bit protocol and length: > > unsigned lenproto = len | proto << 16; > > 2. Pass this into the assembly thusly: > > __asm__( > "adds %0, %1, %2 @ csum_tcpudp_nofold \n\t" > "adcs %0, %0, %3 \n\t" > #ifdef __ARMEB__ > "adcs %0, %0, %4 \n\t" > #else > "adcs %0, %0, %4, ror #8 \n\t" > #endif > "adc %0, %0, #0" > : "=&r"(sum) > : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot) > : "cc"); > > with no swabbing at this stage. Well, where do we get the endian > conversion? See that ror #8 - that a 32 bit rotate by 8 bits. As > these are two 16-bit quantities, we end up with this: > > original: > 31..24 23..16 15..8 7..0 > len_h len_l pro_h pro_l > > accumulated: > 31..24 23..16 15..8 7..0 > pro_l len_h len_l pro_h > > And now when we fold it down to 16-bit: > > 15..8 7..0 > len_l pro_h > pro_l len_h Amusing, I've used the same optimization yesterday when computing a TCP pseudo-header checksum. Another thing that can be done to improve the folding of the 16-bit checksum is to swap the values to be added, sum them and only keep the high half integer which already contains the carry. At least on x86 I save some cycles doing this : 31:24 23:16 15:8 7:0 sum32 = D C B A To fold this into 16-bit at a time, I just do this : 31:24 23:16 15:8 7:0 sum32 D C B A + sum32swapped B A D C = A+B C+A+carry(B+D/C+A) B+D C+A so just take the upper result and you get the final 16-bit word at once. In C it does : fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16 When the CPU has a rotate instruction, it's fast :-) Cheers, Willy
On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote: > Another thing that can be done to improve the folding of the 16-bit > checksum is to swap the values to be added, sum them and only keep > the high half integer which already contains the carry. At least on > x86 I save some cycles doing this : > > 31:24 23:16 15:8 7:0 > sum32 = D C B A > > To fold this into 16-bit at a time, I just do this : > > 31:24 23:16 15:8 7:0 > sum32 D C B A > + sum32swapped B A D C > = A+B C+A+carry(B+D/C+A) B+D C+A > > so just take the upper result and you get the final 16-bit word at > once. > > In C it does : > > fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16 > > When the CPU has a rotate instruction, it's fast :-) Indeed - and if your CPU can do the rotate and add at the same time, it's just a singe instruction, and it ends up looking remarkably similar to this: static inline __sum16 csum_fold(__wsum sum) { __asm__( "add %0, %1, %1, ror #16 @ csum_fold" : "=r" (sum) : "r" (sum) : "cc"); return (__force __sum16)(~(__force u32)sum >> 16); }
On Thu, Dec 12, 2013 at 05:20:49PM +0000, Russell King - ARM Linux wrote: > On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote: > > Another thing that can be done to improve the folding of the 16-bit > > checksum is to swap the values to be added, sum them and only keep > > the high half integer which already contains the carry. At least on > > x86 I save some cycles doing this : > > > > 31:24 23:16 15:8 7:0 > > sum32 = D C B A > > > > To fold this into 16-bit at a time, I just do this : > > > > 31:24 23:16 15:8 7:0 > > sum32 D C B A > > + sum32swapped B A D C > > = A+B C+A+carry(B+D/C+A) B+D C+A > > > > so just take the upper result and you get the final 16-bit word at > > once. > > > > In C it does : > > > > fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16 > > > > When the CPU has a rotate instruction, it's fast :-) > > Indeed - and if your CPU can do the rotate and add at the same time, > it's just a singe instruction, and it ends up looking remarkably > similar to this: > > static inline __sum16 csum_fold(__wsum sum) > { > __asm__( > "add %0, %1, %1, ror #16 @ csum_fold" > : "=r" (sum) > : "r" (sum) > : "cc"); > return (__force __sum16)(~(__force u32)sum >> 16); > } Marvelous :-) Willy
On Thu, 12 Dec 2013, Russell King - ARM Linux wrote: > static inline __sum16 csum_fold(__wsum sum) > { > __asm__( > "add %0, %1, %1, ror #16 @ csum_fold" > : "=r" (sum) > : "r" (sum) > : "cc"); > return (__force __sum16)(~(__force u32)sum >> 16); > } There is no longer any flag updates so you should remove "cc" from the clobber list as well. Nicolas
On Thu, 2013-12-12 at 16:47 +0000, Russell King - ARM Linux wrote: > > So here's a patch for Maxime to try - I've not run it yet but it seems > to do the right thing by code inspection. Kernel built with this patch seems to work fine. Also I called it manually with a few test patterns and the results are correct. Tested-by: Maxime Bizon <mbizon@freebox.fr> Thanks Russell,
On Thu, Dec 12, 2013 at 11:30:02PM +0100, Maxime Bizon wrote: > > On Thu, 2013-12-12 at 16:47 +0000, Russell King - ARM Linux wrote: > > > > So here's a patch for Maxime to try - I've not run it yet but it seems > > to do the right thing by code inspection. > > Kernel built with this patch seems to work fine. Also I called it > manually with a few test patterns and the results are correct. > > Tested-by: Maxime Bizon <mbizon@freebox.fr> > > Thanks Russell, Thanks Maxime. Does this bug show up in existing unmodified mainline kernels?
On Thu, 2013-12-12 at 22:36 +0000, Russell King - ARM Linux wrote:
> Does this bug show up in existing unmodified mainline kernels?
not for me, I found this while writing new code.
though all possible kernel callers are not enabled in my .config.
On Thu, Dec 12, 2013 at 11:44:05PM +0100, Maxime Bizon wrote: > > On Thu, 2013-12-12 at 22:36 +0000, Russell King - ARM Linux wrote: > > > Does this bug show up in existing unmodified mainline kernels? > > not for me, I found this while writing new code. > > though all possible kernel callers are not enabled in my .config. Okay, as we have no reports so far of this affecting anything in any mainline kernel, I think I'm going to queue the fix up for v3.14 and not -rc - we know that although the assembly is not correct in the IP code, we know that it's completely harmless there because the overflowed byte is always zero. So... I'll queue it for v3.14. If anyone does find it affects existing kernels, we can always apply it for -rc and stable kernels. Thanks for your patience Maxime.
diff --git a/arch/arm/include/asm/checksum.h b/arch/arm/include/asm/checksum.h index 6dcc16430868..523315115478 100644 --- a/arch/arm/include/asm/checksum.h +++ b/arch/arm/include/asm/checksum.h @@ -87,19 +87,34 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, unsigned short len, unsigned short proto, __wsum sum) { - __asm__( - "adds %0, %1, %2 @ csum_tcpudp_nofold \n\ - adcs %0, %0, %3 \n" + u32 lenprot = len | proto << 16; + + if (__builtin_constant_p(sum) && sum == 0) { + __asm__( + "adds %0, %1, %2 @ csum_tcpudp_nofold0 \n\t" #ifdef __ARMEB__ - "adcs %0, %0, %4 \n" + "adcs %0, %0, %3 \n\t" #else - "adcs %0, %0, %4, lsl #8 \n" + "adcs %0, %0, %3, ror #8 \n\t" #endif - "adcs %0, %0, %5 \n\ - adc %0, %0, #0" - : "=&r"(sum) - : "r" (sum), "r" (daddr), "r" (saddr), "r" (len), "Ir" (htons(proto)) - : "cc"); + "adc %0, %0, #0" + : "=&r" (sum) + : "r" (daddr), "r" (saddr), "r" (lenprot) + : "cc"); + } else { + __asm__( + "adds %0, %1, %2 @ csum_tcpudp_nofold \n\t" + "adcs %0, %0, %3 \n\t" +#ifdef __ARMEB__ + "adcs %0, %0, %4 \n\t" +#else + "adcs %0, %0, %4, ror #8 \n\t" +#endif + "adc %0, %0, #0" + : "=&r"(sum) + : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot) + : "cc"); + } return sum; } /*