Message ID | 20240213234631.940055-1-linux@roeck-us.net (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | parisc: Fix csum_ipv6_magic on 64-bit systems | expand |
On Tue, Feb 13, 2024 at 03:46:31PM -0800, Guenter Roeck wrote: > hppa 64-bit systems calculates the IPv6 checksum using 64-bit add > operations. The last add folds protocol and length fields into the 64-bit > result. While unlikely, this operation can overflow. The overflow can be > triggered with a code sequence such as the following. > > /* try to trigger massive overflows */ > memset(tmp_buf, 0xff, sizeof(struct in6_addr)); > csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf, > (struct in6_addr *)tmp_buf, > 0xffff, 0xff, 0xffffffff); > > Fix the problem by adding any overflows from the final add operation into > the calculated checksum. Fortunately, we can do this without additional > cost by replacing the add operation used to fold the checksum into 32 bit > with "add,dc" to add in the missing carry. > > Cc: Charlie Jenkins <charlie@rivosinc.com> > Cc: Palmer Dabbelt <palmer@rivosinc.com> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > Signed-off-by: Guenter Roeck <linux@roeck-us.net> > --- > This patch does not completely fix the problems with csum_ipv6_magic seen > when running 64-bit parisc images with the C3700 emulation in qemu. That > is due to unaligned 64-bit load operations which (presumably as part of > unaligned trap handling) generate bad carry flags. It is unknown if that > is a problem with the qemu emulation or with the Linux kernel, so it is not > addressed here. > > arch/parisc/include/asm/checksum.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h > index e619e67440db..c949aa20fa16 100644 > --- a/arch/parisc/include/asm/checksum.h > +++ b/arch/parisc/include/asm/checksum.h > @@ -137,8 +137,8 @@ static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr, > " add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */ > " extrd,u %0, 31, 32, %4\n"/* copy upper half down */ > " depdi 0, 31, 32, %0\n"/* clear upper half */ > -" add %4, %0, %0\n" /* fold into 32-bits */ > -" addc 0, %0, %0\n" /* add carry */ > +" add,dc %4, %0, %0\n" /* fold into 32-bits, plus carry */ > +" addc 0, %0, %0\n" /* add final carry */ > > #else > > -- > 2.39.2 > Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
* Guenter Roeck <linux@roeck-us.net>: > hppa 64-bit systems calculates the IPv6 checksum using 64-bit add > operations. The last add folds protocol and length fields into the 64-bit > result. While unlikely, this operation can overflow. The overflow can be > triggered with a code sequence such as the following. > > /* try to trigger massive overflows */ > memset(tmp_buf, 0xff, sizeof(struct in6_addr)); > csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf, > (struct in6_addr *)tmp_buf, > 0xffff, 0xff, 0xffffffff); > > Fix the problem by adding any overflows from the final add operation into > the calculated checksum. Fortunately, we can do this without additional > cost by replacing the add operation used to fold the checksum into 32 bit > with "add,dc" to add in the missing carry. Gunter, Thanks for your patch for 32- and 64-bit systems. But I think it's time to sunset the parisc inline assembly for ipv4/ipv6 checksumming. The patch below converts the code to use standard C-coding (taken from the s390 header file) and it survives the v8 lib/checksum patch. Opinions? Helge Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines Guenter noticed that the 32- and 64-bit checksum functions may calculate wrong values under certain circumstances. He fixed it by usining corrected carry-flags added, but overall I think it's better to convert away from inline assembly to generic C-coding. That way the code is much cleaner and the compiler can do much better optimizations based on the target architecture (32- vs. 64-bit). Furthermore, the compiler generates nowadays much better code, so inline assembly usually won't give any benefit any longer. Signed-off-by: Helge Deller <deller@gmx.de> Noticed-by: Guenter Roeck <linux@roeck-us.net> diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h index 3c43baca7b39..c72f14536353 100644 --- a/arch/parisc/include/asm/checksum.h +++ b/arch/parisc/include/asm/checksum.h @@ -18,160 +18,93 @@ */ extern __wsum csum_partial(const void *, int, __wsum); + /* - * Optimized for IP headers, which always checksum on 4 octet boundaries. - * - * Written by Randolph Chung <tausq@debian.org>, and then mucked with by - * LaMont Jones <lamont@debian.org> + * Fold a partial checksum without adding pseudo headers. */ -static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) +static inline __sum16 csum_fold(__wsum sum) { - unsigned int sum; - unsigned long t0, t1, t2; + u32 csum = (__force u32) sum; - __asm__ __volatile__ ( -" ldws,ma 4(%1), %0\n" -" addib,<= -4, %2, 2f\n" -"\n" -" ldws 4(%1), %4\n" -" ldws 8(%1), %5\n" -" add %0, %4, %0\n" -" ldws,ma 12(%1), %3\n" -" addc %0, %5, %0\n" -" addc %0, %3, %0\n" -"1: ldws,ma 4(%1), %3\n" -" addib,< 0, %2, 1b\n" -" addc %0, %3, %0\n" -"\n" -" extru %0, 31, 16, %4\n" -" extru %0, 15, 16, %5\n" -" addc %4, %5, %0\n" -" extru %0, 15, 16, %5\n" -" add %0, %5, %0\n" -" subi -1, %0, %0\n" -"2:\n" - : "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (t0), "=r" (t1), "=r" (t2) - : "1" (iph), "2" (ihl) - : "memory"); - - return (__force __sum16)sum; + csum += (csum >> 16) | (csum << 16); + csum >>= 16; + return (__force __sum16) ~csum; } /* - * Fold a partial checksum + * This is a version of ip_compute_csum() optimized for IP headers, + * which always checksums on 4 octet boundaries. */ -static inline __sum16 csum_fold(__wsum csum) +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) { - u32 sum = (__force u32)csum; - /* add the swapped two 16-bit halves of sum, - a possible carry from adding the two 16-bit halves, - will carry from the lower half into the upper half, - giving us the correct sum in the upper half. */ - sum += (sum << 16) + (sum >> 16); - return (__force __sum16)(~sum >> 16); + __u64 csum = 0; + __u32 *ptr = (u32 *)iph; + + csum += *ptr++; + csum += *ptr++; + csum += *ptr++; + csum += *ptr++; + ihl -= 4; + while (ihl--) + csum += *ptr++; + csum += (csum >> 32) | (csum << 32); + return csum_fold((__force __wsum)(csum >> 32)); } - -static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, - __u32 len, __u8 proto, - __wsum sum) + +/* + * Computes the checksum of the TCP/UDP pseudo-header. + * Returns a 32-bit checksum. + */ +static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len, + __u8 proto, __wsum sum) { - __asm__( - " add %1, %0, %0\n" - " addc %2, %0, %0\n" - " addc %3, %0, %0\n" - " addc %%r0, %0, %0\n" - : "=r" (sum) - : "r" (daddr), "r"(saddr), "r"(proto+len), "0"(sum)); - return sum; + __u64 csum = (__force __u64)sum; + + csum += (__force __u32)saddr; + csum += (__force __u32)daddr; + csum += len; + csum += proto; + csum += (csum >> 32) | (csum << 32); + return (__force __wsum)(csum >> 32); } /* - * computes the checksum of the TCP/UDP pseudo-header - * returns a 16-bit checksum, already complemented + * Computes the checksum of the TCP/UDP pseudo-header. + * Returns a 16-bit checksum, already complemented. */ -static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, - __u32 len, __u8 proto, - __wsum sum) +static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len, + __u8 proto, __wsum sum) { - return csum_fold(csum_tcpudp_nofold(saddr,daddr,len,proto,sum)); + return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum)); } /* - * this routine is used for miscellaneous IP-like checksums, mainly - * in icmp.c + * Used for miscellaneous IP-like checksums, mainly icmp. */ -static inline __sum16 ip_compute_csum(const void *buf, int len) +static inline __sum16 ip_compute_csum(const void *buff, int len) { - return csum_fold (csum_partial(buf, len, 0)); + return csum_fold(csum_partial(buff, len, 0)); } - #define _HAVE_ARCH_IPV6_CSUM -static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr, - const struct in6_addr *daddr, - __u32 len, __u8 proto, - __wsum sum) +static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr, + const struct in6_addr *daddr, + __u32 len, __u8 proto, __wsum csum) { - unsigned long t0, t1, t2, t3; - - len += proto; /* add 16-bit proto + len */ - - __asm__ __volatile__ ( - -#if BITS_PER_LONG > 32 - - /* - ** We can execute two loads and two adds per cycle on PA 8000. - ** But add insn's get serialized waiting for the carry bit. - ** Try to keep 4 registers with "live" values ahead of the ALU. - */ - -" ldd,ma 8(%1), %4\n" /* get 1st saddr word */ -" ldd,ma 8(%2), %5\n" /* get 1st daddr word */ -" add %4, %0, %0\n" -" ldd,ma 8(%1), %6\n" /* 2nd saddr */ -" ldd,ma 8(%2), %7\n" /* 2nd daddr */ -" add,dc %5, %0, %0\n" -" add,dc %6, %0, %0\n" -" add,dc %7, %0, %0\n" -" add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */ -" extrd,u %0, 31, 32, %4\n"/* copy upper half down */ -" depdi 0, 31, 32, %0\n"/* clear upper half */ -" add %4, %0, %0\n" /* fold into 32-bits */ -" addc 0, %0, %0\n" /* add carry */ - -#else - - /* - ** For PA 1.x, the insn order doesn't matter as much. - ** Insn stream is serialized on the carry bit here too. - ** result from the previous operation (eg r0 + x) - */ -" ldw,ma 4(%1), %4\n" /* get 1st saddr word */ -" ldw,ma 4(%2), %5\n" /* get 1st daddr word */ -" add %4, %0, %0\n" -" ldw,ma 4(%1), %6\n" /* 2nd saddr */ -" addc %5, %0, %0\n" -" ldw,ma 4(%2), %7\n" /* 2nd daddr */ -" addc %6, %0, %0\n" -" ldw,ma 4(%1), %4\n" /* 3rd saddr */ -" addc %7, %0, %0\n" -" ldw,ma 4(%2), %5\n" /* 3rd daddr */ -" addc %4, %0, %0\n" -" ldw,ma 4(%1), %6\n" /* 4th saddr */ -" addc %5, %0, %0\n" -" ldw,ma 4(%2), %7\n" /* 4th daddr */ -" addc %6, %0, %0\n" -" addc %7, %0, %0\n" -" addc %3, %0, %0\n" /* fold in proto+len, catch carry */ - -#endif - : "=r" (sum), "=r" (saddr), "=r" (daddr), "=r" (len), - "=r" (t0), "=r" (t1), "=r" (t2), "=r" (t3) - : "0" (sum), "1" (saddr), "2" (daddr), "3" (len) - : "memory"); - return csum_fold(sum); + __u64 sum = (__force __u64)csum; + + sum += (__force __u32)saddr->s6_addr32[0]; + sum += (__force __u32)saddr->s6_addr32[1]; + sum += (__force __u32)saddr->s6_addr32[2]; + sum += (__force __u32)saddr->s6_addr32[3]; + sum += (__force __u32)daddr->s6_addr32[0]; + sum += (__force __u32)daddr->s6_addr32[1]; + sum += (__force __u32)daddr->s6_addr32[2]; + sum += (__force __u32)daddr->s6_addr32[3]; + sum += len; + sum += proto; + sum += (sum >> 32) | (sum << 32); + return csum_fold((__force __wsum)(sum >> 32)); } #endif -
Hi Helge, On 2/16/24 04:38, Helge Deller wrote: > * Guenter Roeck <linux@roeck-us.net>: >> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add >> operations. The last add folds protocol and length fields into the 64-bit >> result. While unlikely, this operation can overflow. The overflow can be >> triggered with a code sequence such as the following. >> >> /* try to trigger massive overflows */ >> memset(tmp_buf, 0xff, sizeof(struct in6_addr)); >> csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf, >> (struct in6_addr *)tmp_buf, >> 0xffff, 0xff, 0xffffffff); >> >> Fix the problem by adding any overflows from the final add operation into >> the calculated checksum. Fortunately, we can do this without additional >> cost by replacing the add operation used to fold the checksum into 32 bit >> with "add,dc" to add in the missing carry. > > > Gunter, > > Thanks for your patch for 32- and 64-bit systems. > But I think it's time to sunset the parisc inline assembly for ipv4/ipv6 > checksumming. > The patch below converts the code to use standard C-coding (taken from the > s390 header file) and it survives the v8 lib/checksum patch. > > Opinions? > Works for me. > Helge > > Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines > > Guenter noticed that the 32- and 64-bit checksum functions may calculate > wrong values under certain circumstances. He fixed it by usining using > corrected carry-flags added, but overall I think it's better to convert > away from inline assembly to generic C-coding. That way the code is > much cleaner and the compiler can do much better optimizations based on > the target architecture (32- vs. 64-bit). Furthermore, the compiler > generates nowadays much better code, so inline assembly usually won't > give any benefit any longer. > > Signed-off-by: Helge Deller <deller@gmx.de> > Noticed-by: Guenter Roeck <linux@roeck-us.net> That would be a new tag. Maybe use "Reported-by:", or just drop it. Maybe also consider adding Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Other than that, Tested-by: Guenter Roeck <linux@roeck-us.net> Thanks, Guenter
On Fri, Feb 16, 2024 at 01:38:51PM +0100, Helge Deller wrote: > * Guenter Roeck <linux@roeck-us.net>: > > hppa 64-bit systems calculates the IPv6 checksum using 64-bit add > > operations. The last add folds protocol and length fields into the 64-bit > > result. While unlikely, this operation can overflow. The overflow can be > > triggered with a code sequence such as the following. > > > > /* try to trigger massive overflows */ > > memset(tmp_buf, 0xff, sizeof(struct in6_addr)); > > csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf, > > (struct in6_addr *)tmp_buf, > > 0xffff, 0xff, 0xffffffff); > > > > Fix the problem by adding any overflows from the final add operation into > > the calculated checksum. Fortunately, we can do this without additional > > cost by replacing the add operation used to fold the checksum into 32 bit > > with "add,dc" to add in the missing carry. > > > Gunter, > > Thanks for your patch for 32- and 64-bit systems. > But I think it's time to sunset the parisc inline assembly for ipv4/ipv6 > checksumming. > The patch below converts the code to use standard C-coding (taken from the > s390 header file) and it survives the v8 lib/checksum patch. > > Opinions? > > Helge > > Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines > > Guenter noticed that the 32- and 64-bit checksum functions may calculate > wrong values under certain circumstances. He fixed it by usining > corrected carry-flags added, but overall I think it's better to convert > away from inline assembly to generic C-coding. That way the code is > much cleaner and the compiler can do much better optimizations based on > the target architecture (32- vs. 64-bit). Furthermore, the compiler > generates nowadays much better code, so inline assembly usually won't > give any benefit any longer. > > Signed-off-by: Helge Deller <deller@gmx.de> > Noticed-by: Guenter Roeck <linux@roeck-us.net> > > diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h > index 3c43baca7b39..c72f14536353 100644 > --- a/arch/parisc/include/asm/checksum.h > +++ b/arch/parisc/include/asm/checksum.h > @@ -18,160 +18,93 @@ > */ > extern __wsum csum_partial(const void *, int, __wsum); > > + > /* > - * Optimized for IP headers, which always checksum on 4 octet boundaries. > - * > - * Written by Randolph Chung <tausq@debian.org>, and then mucked with by > - * LaMont Jones <lamont@debian.org> > + * Fold a partial checksum without adding pseudo headers. > */ > -static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) > +static inline __sum16 csum_fold(__wsum sum) > { > - unsigned int sum; > - unsigned long t0, t1, t2; > + u32 csum = (__force u32) sum; > > - __asm__ __volatile__ ( > -" ldws,ma 4(%1), %0\n" > -" addib,<= -4, %2, 2f\n" > -"\n" > -" ldws 4(%1), %4\n" > -" ldws 8(%1), %5\n" > -" add %0, %4, %0\n" > -" ldws,ma 12(%1), %3\n" > -" addc %0, %5, %0\n" > -" addc %0, %3, %0\n" > -"1: ldws,ma 4(%1), %3\n" > -" addib,< 0, %2, 1b\n" > -" addc %0, %3, %0\n" > -"\n" > -" extru %0, 31, 16, %4\n" > -" extru %0, 15, 16, %5\n" > -" addc %4, %5, %0\n" > -" extru %0, 15, 16, %5\n" > -" add %0, %5, %0\n" > -" subi -1, %0, %0\n" > -"2:\n" > - : "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (t0), "=r" (t1), "=r" (t2) > - : "1" (iph), "2" (ihl) > - : "memory"); > - > - return (__force __sum16)sum; > + csum += (csum >> 16) | (csum << 16); > + csum >>= 16; > + return (__force __sum16) ~csum; > } For all of my analysis I am using gcc 12.3.0. We can do better than this! By inspection this looks like a performance regression. The generic version of csum_fold in include/asm-generic/checksum.h is better than this so should be used instead. I am not familiar with hppa assembly but this is the assembly for the original csum_fold here: 0000000000000000 <csum_fold>: 0: 08 03 02 41 copy r3,r1 4: 08 1e 02 43 copy sp,r3 8: 73 c1 00 88 std,ma r1,40(sp) c: d3 5a 09 fc shrpw r26,r26,16,ret0 10: 0b 9a 0a 1c add,l r26,ret0,ret0 14: 0b 80 09 9c uaddcm r0,ret0,ret0 18: db 9c 09 f0 extrd,u,* ret0,47,16,ret0 1c: 34 7e 00 80 ldo 40(r3),sp 20: e8 40 d0 00 bve (rp) 24: 53 c3 3f 8d ldd,mb -40(sp),r3 This is the assembly for the generic version: 0000000000000000 <csum_fold_generic>: 0: 08 03 02 41 copy r3,r1 4: 08 1e 02 43 copy sp,r3 8: 73 c1 00 88 std,ma r1,40(sp) c: 0b 40 09 9c uaddcm r0,r26,ret0 10: d3 5a 09 fa shrpw r26,r26,16,r26 14: 0b 5c 04 1c sub ret0,r26,ret0 18: db 9c 09 f0 extrd,u,* ret0,47,16,ret0 1c: 34 7e 00 80 ldo 40(r3),sp 20: e8 40 d0 00 bve (rp) 24: 53 c3 3f 8d ldd,mb -40(sp),r3 This is the assembly for yours: 0000000000000028 <csum_fold_new>: 28: 08 03 02 41 copy r3,r1 2c: 08 1e 02 43 copy sp,r3 30: 73 c1 00 88 std,ma r1,40(sp) 34: d3 5a 09 fc shrpw r26,r26,16,ret0 38: 0b 9a 0a 1c add,l r26,ret0,ret0 3c: d3 9c 19 f0 extrw,u ret0,15,16,ret0 40: 0b 80 09 9c uaddcm r0,ret0,ret0 44: db 9c 0b f0 extrd,u,* ret0,63,16,ret0 48: 34 7e 00 80 ldo 40(r3),sp 4c: e8 40 d0 00 bve (rp) 50: 53 c3 3f 8d ldd,mb -40(sp),r3 54: 00 00 00 00 break 0,0 The assembly is almost the same for the generic and the original code, except for rearranging the shift and add operation which allows the addition to become a subtraction and shortens the dependency chain on some architectures (looks like it doesn't change much here). However, this new code introduces an additional extrw instruction. > > /* > - * Fold a partial checksum > + * This is a version of ip_compute_csum() optimized for IP headers, > + * which always checksums on 4 octet boundaries. > */ > -static inline __sum16 csum_fold(__wsum csum) > +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) > { > - u32 sum = (__force u32)csum; > - /* add the swapped two 16-bit halves of sum, > - a possible carry from adding the two 16-bit halves, > - will carry from the lower half into the upper half, > - giving us the correct sum in the upper half. */ > - sum += (sum << 16) + (sum >> 16); > - return (__force __sum16)(~sum >> 16); > + __u64 csum = 0; > + __u32 *ptr = (u32 *)iph; > + > + csum += *ptr++; > + csum += *ptr++; > + csum += *ptr++; > + csum += *ptr++; > + ihl -= 4; > + while (ihl--) > + csum += *ptr++; > + csum += (csum >> 32) | (csum << 32); > + return csum_fold((__force __wsum)(csum >> 32)); > } This doesn't leverage add with carry well. This causes the code size of this to be dramatically larger than the original assembly, which I assume nicely correlates to an increased execution time. This is the original assembly in 64-bit (almost the same in 32-bit): 0000000000000028 <ip_fast_csum>: 28: 08 03 02 41 copy r3,r1 2c: 08 1e 02 43 copy sp,r3 30: 73 c1 00 88 std,ma r1,40(sp) 34: 0f 48 10 bc ldw,ma 4(r26),ret0 38: a7 39 60 70 addib,<= -4,r25,78 <ip_fast_csum+0x50> 3c: 0f 48 10 93 ldw 4(r26),r19 40: 0f 50 10 94 ldw 8(r26),r20 44: 0a 7c 06 1c add ret0,r19,ret0 48: 0f 58 10 bf ldw,ma c(r26),r31 4c: 0a 9c 07 1c add,c ret0,r20,ret0 50: 0b fc 07 1c add,c ret0,r31,ret0 54: 0f 48 10 bf ldw,ma 4(r26),r31 58: a7 20 5f ed addib,< 0,r25,54 <ip_fast_csum+0x2c> 5c: 0b fc 07 1c add,c ret0,r31,ret0 60: d3 93 1b f0 extrw,u ret0,31,16,r19 64: d3 94 19 f0 extrw,u ret0,15,16,r20 68: 0a 93 07 1c add,c r19,r20,ret0 6c: d3 94 19 f0 extrw,u ret0,15,16,r20 70: 0a 9c 06 1c add ret0,r20,ret0 74: 97 9c 07 ff subi -1,ret0,ret0 78: db 9c 0b f0 extrd,u,* ret0,63,16,ret0 7c: 34 7e 00 80 ldo 40(r3),sp 80: e8 40 d0 00 bve (rp) 84: 53 c3 3f 8d ldd,mb -40(sp),r3 This is the 64-bit assembly of your proposal: 0000000000000058 <ip_fast_csum_new>: 58: 08 03 02 41 copy r3,r1 5c: 0f c2 12 c1 std rp,-10(sp) 60: 08 1e 02 43 copy sp,r3 64: 73 c1 01 08 std,ma r1,80(sp) 68: 37 39 3f f9 ldo -4(r25),r25 6c: 0c 64 12 d0 std r4,8(r3) 70: 0f 48 10 9f ldw 4(r26),r31 74: db 39 0b e0 extrd,u,* r25,63,32,r25 78: 37 5a 00 20 ldo 10(r26),r26 7c: 0f 41 10 9c ldw -10(r26),ret0 80: 0b fc 0a 1c add,l ret0,r31,ret0 84: 0f 51 10 9f ldw -8(r26),r31 88: 0b 9f 0a 1f add,l r31,ret0,r31 8c: 0f 59 10 9c ldw -4(r26),ret0 90: 0b fc 0a 1c add,l ret0,r31,ret0 94: 37 3f 3f ff ldo -1(r25),r31 98: 8f ff 20 68 cmpib,<> -1,r31,d4 <ip_fast_csum_new+0x7c> 9c: db f9 0b e0 extrd,u,* r31,63,32,r25 a0: d3 9c 07 fa shrpd,* ret0,ret0,32,r26 a4: 37 dd 3f a1 ldo -30(sp),ret1 a8: 0b 9a 0a 1a add,l r26,ret0,r26 ac: 00 00 14 a1 mfia r1 b0: 28 20 00 00 addil L%0,r1,r1 b4: 34 21 00 00 ldo 0(r1),r1 b8: e8 20 f0 00 bve,l (r1),rp bc: db 5a 03 e0 extrd,u,* r26,31,32,r26 c0: 0c 70 10 c4 ldd 8(r3),r4 c4: 0c 61 10 c2 ldd -10(r3),rp c8: 34 7e 00 80 ldo 40(r3),sp cc: e8 40 d0 00 bve (rp) d0: 53 c3 3f 8d ldd,mb -40(sp),r3 d4: 0f 40 10 9f ldw 0(r26),r31 d8: 37 5a 00 08 ldo 4(r26),r26 dc: e8 1f 1f 65 b,l 94 <ip_fast_csum_new+0x3c>,r0 e0: 0b fc 0a 1c add,l ret0,r31,ret0 e4: 00 00 00 00 break 0,0 The 32-bit assembly is even longer. Maybe you can get some inspiration from my riscv implementation arch/riscv/include/asm/checksum.h. You can swap out the line: csum += csum < ((const unsigned int *)iph)[pos]; With the addc macro defined in arch/parisc/lib/checksum.c. I will guess that this will improve the code but I haven't checked. > - > -static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, > - __u32 len, __u8 proto, > - __wsum sum) > + > +/* > + * Computes the checksum of the TCP/UDP pseudo-header. > + * Returns a 32-bit checksum. > + */ > +static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len, > + __u8 proto, __wsum sum) > { > - __asm__( > - " add %1, %0, %0\n" > - " addc %2, %0, %0\n" > - " addc %3, %0, %0\n" > - " addc %%r0, %0, %0\n" > - : "=r" (sum) > - : "r" (daddr), "r"(saddr), "r"(proto+len), "0"(sum)); > - return sum; > + __u64 csum = (__force __u64)sum; > + > + csum += (__force __u32)saddr; > + csum += (__force __u32)daddr; > + csum += len; > + csum += proto; > + csum += (csum >> 32) | (csum << 32); > + return (__force __wsum)(csum >> 32); > } The assembly from the original: 0000000000000088 <csum_tcpudp_nofold>: 88: 08 03 02 41 copy r3,r1 8c: 08 1e 02 43 copy sp,r3 90: 73 c1 00 88 std,ma r1,40(sp) 94: da f7 0b f8 extrd,u,* r23,63,8,r23 98: 0a f8 0a 17 add,l r24,r23,r23 9c: 0a d9 06 16 add r25,r22,r22 a0: 0a da 07 16 add,c r26,r22,r22 a4: 0a d7 07 16 add,c r23,r22,r22 a8: 0a c0 07 16 add,c r0,r22,r22 ac: da dc 0b e0 extrd,u,* r22,63,32,ret0 b0: 34 7e 00 80 ldo 40(r3),sp b4: e8 40 d0 00 bve (rp) b8: 53 c3 3f 8d ldd,mb -40(sp),r3 bc: 00 00 00 00 break 0,0 The 64-bit assembly from your proposal: 0000000000000140 <csum_tcpudp_nofold_new>: 140: 08 03 02 41 copy r3,r1 144: 08 1e 02 43 copy sp,r3 148: 73 c1 00 88 std,ma r1,40(sp) 14c: db 5a 0b e0 extrd,u,* r26,63,32,r26 150: db 39 0b e0 extrd,u,* r25,63,32,r25 154: da f7 0b f8 extrd,u,* r23,63,8,r23 158: db 18 0b e0 extrd,u,* r24,63,32,r24 15c: da d6 0b e0 extrd,u,* r22,63,32,r22 160: 0a f8 0a 18 add,l r24,r23,r24 164: 0b 38 0a 18 add,l r24,r25,r24 168: 0b 58 0a 18 add,l r24,r26,r24 16c: 0a d8 0a 16 add,l r24,r22,r22 170: d2 d6 07 fc shrpd,* r22,r22,32,ret0 174: 0a dc 0a 1c add,l ret0,r22,ret0 178: db 9c 03 e0 extrd,u,* ret0,31,32,ret0 17c: 34 7e 00 80 ldo 40(r3),sp 180: e8 40 d0 00 bve (rp) 184: 53 c3 3f 8d ldd,mb -40(sp),r3 There are a lot of extra extrd instructions, and again is really bad on 32-bit parisc. Maybe there is a good way to represent this in C, but the assembly is so simple and clean for this function already. > > /* > - * computes the checksum of the TCP/UDP pseudo-header > - * returns a 16-bit checksum, already complemented > + * Computes the checksum of the TCP/UDP pseudo-header. > + * Returns a 16-bit checksum, already complemented. > */ > -static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, > - __u32 len, __u8 proto, > - __wsum sum) > +static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len, > + __u8 proto, __wsum sum) > { > - return csum_fold(csum_tcpudp_nofold(saddr,daddr,len,proto,sum)); > + return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum)); > } > The implementation of csum_tcpudp_magic is the same as the generic, so the generic version should be used instead. > /* > - * this routine is used for miscellaneous IP-like checksums, mainly > - * in icmp.c > + * Used for miscellaneous IP-like checksums, mainly icmp. > */ > -static inline __sum16 ip_compute_csum(const void *buf, int len) > +static inline __sum16 ip_compute_csum(const void *buff, int len) > { > - return csum_fold (csum_partial(buf, len, 0)); > + return csum_fold(csum_partial(buff, len, 0)); > } The generic version is better than this so it should be used instead. > > - > #define _HAVE_ARCH_IPV6_CSUM > -static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr, > - const struct in6_addr *daddr, > - __u32 len, __u8 proto, > - __wsum sum) > +static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr, > + const struct in6_addr *daddr, > + __u32 len, __u8 proto, __wsum csum) > { > - unsigned long t0, t1, t2, t3; > - > - len += proto; /* add 16-bit proto + len */ > - > - __asm__ __volatile__ ( > - > -#if BITS_PER_LONG > 32 > - > - /* > - ** We can execute two loads and two adds per cycle on PA 8000. > - ** But add insn's get serialized waiting for the carry bit. > - ** Try to keep 4 registers with "live" values ahead of the ALU. > - */ > - > -" ldd,ma 8(%1), %4\n" /* get 1st saddr word */ > -" ldd,ma 8(%2), %5\n" /* get 1st daddr word */ > -" add %4, %0, %0\n" > -" ldd,ma 8(%1), %6\n" /* 2nd saddr */ > -" ldd,ma 8(%2), %7\n" /* 2nd daddr */ > -" add,dc %5, %0, %0\n" > -" add,dc %6, %0, %0\n" > -" add,dc %7, %0, %0\n" > -" add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */ > -" extrd,u %0, 31, 32, %4\n"/* copy upper half down */ > -" depdi 0, 31, 32, %0\n"/* clear upper half */ > -" add %4, %0, %0\n" /* fold into 32-bits */ > -" addc 0, %0, %0\n" /* add carry */ > - > -#else > - > - /* > - ** For PA 1.x, the insn order doesn't matter as much. > - ** Insn stream is serialized on the carry bit here too. > - ** result from the previous operation (eg r0 + x) > - */ > -" ldw,ma 4(%1), %4\n" /* get 1st saddr word */ > -" ldw,ma 4(%2), %5\n" /* get 1st daddr word */ > -" add %4, %0, %0\n" > -" ldw,ma 4(%1), %6\n" /* 2nd saddr */ > -" addc %5, %0, %0\n" > -" ldw,ma 4(%2), %7\n" /* 2nd daddr */ > -" addc %6, %0, %0\n" > -" ldw,ma 4(%1), %4\n" /* 3rd saddr */ > -" addc %7, %0, %0\n" > -" ldw,ma 4(%2), %5\n" /* 3rd daddr */ > -" addc %4, %0, %0\n" > -" ldw,ma 4(%1), %6\n" /* 4th saddr */ > -" addc %5, %0, %0\n" > -" ldw,ma 4(%2), %7\n" /* 4th daddr */ > -" addc %6, %0, %0\n" > -" addc %7, %0, %0\n" > -" addc %3, %0, %0\n" /* fold in proto+len, catch carry */ > - > -#endif > - : "=r" (sum), "=r" (saddr), "=r" (daddr), "=r" (len), > - "=r" (t0), "=r" (t1), "=r" (t2), "=r" (t3) > - : "0" (sum), "1" (saddr), "2" (daddr), "3" (len) > - : "memory"); > - return csum_fold(sum); > + __u64 sum = (__force __u64)csum; > + > + sum += (__force __u32)saddr->s6_addr32[0]; > + sum += (__force __u32)saddr->s6_addr32[1]; > + sum += (__force __u32)saddr->s6_addr32[2]; > + sum += (__force __u32)saddr->s6_addr32[3]; > + sum += (__force __u32)daddr->s6_addr32[0]; > + sum += (__force __u32)daddr->s6_addr32[1]; > + sum += (__force __u32)daddr->s6_addr32[2]; > + sum += (__force __u32)daddr->s6_addr32[3]; > + sum += len; > + sum += proto; > + sum += (sum >> 32) | (sum << 32); > + return csum_fold((__force __wsum)(sum >> 32)); > } > > #endif > - Similar story again here where the add with carry is not well translated into C, resulting in significantly worse assembly. Using __u64 seems to be a big contributing factor for why the 32-bit assembly is worse. I am not sure the best way to represent this in a clean way in C. add with carry is not well understood by GCC 12.3 it seems. These functions are generally heavily optimized on every architecture, so I think it is worth it to default to assembly if you aren't able to achieve comparable performance in C. - Charlie
... > We can do better than this! By inspection this looks like a performance > regression. The generic version of csum_fold in > include/asm-generic/checksum.h is better than this so should be used > instead. Yes, that got changed for 6.8-rc1 (I pretty much suggested the patch) but hadn't noticed Linus has applied it. That C version is (probably) not worse than any of the asm versions except sparc32 - which has a carry flag but rotate. (It is better than the x86-64 asm one.) ... > This doesn't leverage add with carry well. This causes the code size of this > to be dramatically larger than the original assembly, which I assume > nicely correlates to an increased execution time. It is pretty much impossible to do add with carry from C. So an asm adc block is pretty much always going to win. For csum_partial and short to moderate length buffers on x86 it is hard to beat 10: adc, adc, dec, jnz 10b which (on modern intel cpu at least) does 8 bytes/clock. You can get 12 bytes/clock but it only really wins for 256+ bytes. (See the current x86-64 version.) For cpu without a carry flag it is likely that a common C function will be pretty much optimal on all architectures. (Or maybe a couple of implementations based the actual cpu implementation - not the architecture.) Mostly I don't think you can beat 4 instructions/word, but they will pipeline so with multi-issue you might get a read/clock. Arm's barrel shifter might give 3: v + *p; x += v, y += v >> 32. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 2/17/24 04:00, Charlie Jenkins wrote: > On Fri, Feb 16, 2024 at 01:38:51PM +0100, Helge Deller wrote: >> * Guenter Roeck <linux@roeck-us.net>: >>> hppa 64-bit systems calculates the IPv6 checksum using 64-bit add >>> operations. The last add folds protocol and length fields into the 64-bit >>> result. While unlikely, this operation can overflow. The overflow can be >>> triggered with a code sequence such as the following. >>> >>> /* try to trigger massive overflows */ >>> memset(tmp_buf, 0xff, sizeof(struct in6_addr)); >>> csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf, >>> (struct in6_addr *)tmp_buf, >>> 0xffff, 0xff, 0xffffffff); >>> >>> Fix the problem by adding any overflows from the final add operation into >>> the calculated checksum. Fortunately, we can do this without additional >>> cost by replacing the add operation used to fold the checksum into 32 bit >>> with "add,dc" to add in the missing carry. >> >> >> Gunter, >> >> Thanks for your patch for 32- and 64-bit systems. >> But I think it's time to sunset the parisc inline assembly for ipv4/ipv6 >> checksumming. >> The patch below converts the code to use standard C-coding (taken from the >> s390 header file) and it survives the v8 lib/checksum patch. >> >> Opinions? >> [...] > We can do better than this! By inspection this looks like a performance > regression. > [...] > Similar story again here where the add with carry is not well translated > into C, resulting in significantly worse assembly. Using __u64 seems to > be a big contributing factor for why the 32-bit assembly is worse. > > I am not sure the best way to represent this in a clean way in C. > > add with carry is not well understood by GCC 12.3 it seems. These > functions are generally heavily optimized on every architecture, so I > think it is worth it to default to assembly if you aren't able to > achieve comparable performance in C. Thanks a lot for your analysis!!! I've now reverted my change to switch to generic code and applied the 3 suggested patches from Guenter which fix the hppa assembly. Let's see how they behave in the for-next git tree during the next few days. Helge
diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h index e619e67440db..c949aa20fa16 100644 --- a/arch/parisc/include/asm/checksum.h +++ b/arch/parisc/include/asm/checksum.h @@ -137,8 +137,8 @@ static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr, " add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */ " extrd,u %0, 31, 32, %4\n"/* copy upper half down */ " depdi 0, 31, 32, %0\n"/* clear upper half */ -" add %4, %0, %0\n" /* fold into 32-bits */ -" addc 0, %0, %0\n" /* add carry */ +" add,dc %4, %0, %0\n" /* fold into 32-bits, plus carry */ +" addc 0, %0, %0\n" /* add final carry */ #else
hppa 64-bit systems calculates the IPv6 checksum using 64-bit add operations. The last add folds protocol and length fields into the 64-bit result. While unlikely, this operation can overflow. The overflow can be triggered with a code sequence such as the following. /* try to trigger massive overflows */ memset(tmp_buf, 0xff, sizeof(struct in6_addr)); csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf, (struct in6_addr *)tmp_buf, 0xffff, 0xff, 0xffffffff); Fix the problem by adding any overflows from the final add operation into the calculated checksum. Fortunately, we can do this without additional cost by replacing the add operation used to fold the checksum into 32 bit with "add,dc" to add in the missing carry. Cc: Charlie Jenkins <charlie@rivosinc.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Guenter Roeck <linux@roeck-us.net> --- This patch does not completely fix the problems with csum_ipv6_magic seen when running 64-bit parisc images with the C3700 emulation in qemu. That is due to unaligned 64-bit load operations which (presumably as part of unaligned trap handling) generate bad carry flags. It is unknown if that is a problem with the qemu emulation or with the Linux kernel, so it is not addressed here. arch/parisc/include/asm/checksum.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)