Message ID | 1573006806-12037-1-git-send-email-zhangshaokun@hisilicon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] arm64: lib: accelerate do_csum | expand |
On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote: > From: Lingyan Huang <huanglingyan2@huawei.com> > > Function do_csum() in lib/checksum.c is used to compute checksum, > which is turned out to be slowly and costs a lot of resources. > Let's accelerate the checksum computation for arm64. > > While we test its performance on Huawei Kunpeng 920 SoC, as follow: > 1cycle general(ns) csum_128(ns) csum_64(ns) > 64B: 160 80 50 > 256B: 120 70 60 > 1023B: 350 140 150 > 1024B: 350 130 140 > 1500B: 470 170 180 > 2048B: 630 210 240 > 4095B: 1220 390 430 > 4096B: 1230 390 430 > > Cc: Will Deacon <will@kernel.org> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> > Originally-from: Robin Murphy <robin.murphy@arm.com> > Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com> > Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> > --- > Hi, > Apologies that we post this version so later, because we want to > optimise it better, Lingyan tested it performance which is attached > in commit log. Both(128 and 64) are much better than the initial > code. > ChangeLog: > based on Robin's code and change strides from 64 to 128. > > arch/arm64/include/asm/checksum.h | 3 ++ > arch/arm64/lib/Makefile | 2 +- > arch/arm64/lib/csum.c | 81 +++++++++++++++++++++++++++++++++++++++ > 3 files changed, 85 insertions(+), 1 deletion(-) > create mode 100644 arch/arm64/lib/csum.c Robin -- any chance you could look at this please? If it's based on your code then hopefully it's straightforward to review ;) Will
+Cc Yuke Zhang who has used this patch and enjoyed the patch's gain when debugged the performance issue. Hi Will, Thanks for reactivate this thread. Robin, any comments are welcome and hopefully it can be merged in mainline. Thanks, Shaokun On 2020/1/9 1:20, Will Deacon wrote: > On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote: >> From: Lingyan Huang <huanglingyan2@huawei.com> >> >> Function do_csum() in lib/checksum.c is used to compute checksum, >> which is turned out to be slowly and costs a lot of resources. >> Let's accelerate the checksum computation for arm64. >> >> While we test its performance on Huawei Kunpeng 920 SoC, as follow: >> 1cycle general(ns) csum_128(ns) csum_64(ns) >> 64B: 160 80 50 >> 256B: 120 70 60 >> 1023B: 350 140 150 >> 1024B: 350 130 140 >> 1500B: 470 170 180 >> 2048B: 630 210 240 >> 4095B: 1220 390 430 >> 4096B: 1230 390 430 >> >> Cc: Will Deacon <will@kernel.org> >> Cc: Robin Murphy <robin.murphy@arm.com> >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> >> Originally-from: Robin Murphy <robin.murphy@arm.com> >> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com> >> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> >> --- >> Hi, >> Apologies that we post this version so later, because we want to >> optimise it better, Lingyan tested it performance which is attached >> in commit log. Both(128 and 64) are much better than the initial >> code. >> ChangeLog: >> based on Robin's code and change strides from 64 to 128. >> >> arch/arm64/include/asm/checksum.h | 3 ++ >> arch/arm64/lib/Makefile | 2 +- >> arch/arm64/lib/csum.c | 81 +++++++++++++++++++++++++++++++++++++++ >> 3 files changed, 85 insertions(+), 1 deletion(-) >> create mode 100644 arch/arm64/lib/csum.c > > Robin -- any chance you could look at this please? If it's based on your > code then hopefully it's straightforward to review ;) > > Will > > . >
On 2020-01-11 8:09 am, Shaokun Zhang wrote: > +Cc Yuke Zhang who has used this patch and enjoyed the patch's gain when debugged > the performance issue. > > Hi Will, > > Thanks for reactivate this thread. > Robin, any comments are welcome and hopefully it can be merged in mainline. OK, I had a play with this yesterday, and somewhat surprisingly even with a recent GCC it results in utterly dreadful code. I would always have expected the head/tail alignment in __uint128_t arithmetic to be ugly, and it certainly is, but even the "*ptr++" load in the main loop comes out as this delightful nugget: e3c: f8410502 ldr x2, [x8], #16 e40: f85f8103 ldur x3, [x8, #-8] (Clang does at least manage to emit a post-indexed LDP there, but the rest remains pretty awful) Overall it ends up noticeably slower than even the generic code for small buffers. I rigged up a crude userspace test to run the numbers below - data is average call time in nanoseconds; "new" is the routine from this patch, "new2/3/4" are are loop-tuning variations of what I came up with when I then went back to my WIP branch and finished off my original idea. Once I've confirmed I got big-endian right I'll send out another patch :) Robin. GCC 9.2.0: --------- Cortex-A53 size generic new new2 new3 new4 3: 20 35 22 22 24 8: 34 35 22 22 24 15: 36 35 29 23 25 48: 69 45 38 38 39 64: 80 50 49 44 44 256: 217 117 99 110 92 4096: 2908 1310 1146 1269 983 1048576: 860430 461694 461694 493173 451201 Cortex-A72 size generic new new2 new3 new4 3: 8 21 10 9 10 8: 20 21 10 9 10 15: 16 21 12 11 11 48: 29 29 18 19 20 64: 35 30 24 21 23 256: 125 66 48 46 46 4096: 1720 778 532 573 450 1048576: 472187 272819 188874 220354 167888 Clang 9.0.1: ----------- Cortex-A53 size generic new new2 new3 new4 3: 21 29 21 21 21 8: 33 29 21 21 21 15: 35 28 24 23 23 48: 73 39 36 37 38 64: 85 44 46 42 44 256: 220 110 107 107 89 4096: 2949 1310 1187 1310 942 1048576: 849937 451201 472187 482680 451201 Cortex-A72 size generic new new2 new3 new4 3: 8 16 10 10 10 8: 23 16 10 10 10 15: 16 16 12 12 12 48: 27 21 18 20 20 64: 31 24 24 22 23 256: 125 53 48 63 46 4096: 1720 655 573 860 532 1048576: 472187 230847 209861 272819 188874 > > Thanks, > Shaokun > > On 2020/1/9 1:20, Will Deacon wrote: >> On Wed, Nov 06, 2019 at 10:20:06AM +0800, Shaokun Zhang wrote: >>> From: Lingyan Huang <huanglingyan2@huawei.com> >>> >>> Function do_csum() in lib/checksum.c is used to compute checksum, >>> which is turned out to be slowly and costs a lot of resources. >>> Let's accelerate the checksum computation for arm64. >>> >>> While we test its performance on Huawei Kunpeng 920 SoC, as follow: >>> 1cycle general(ns) csum_128(ns) csum_64(ns) >>> 64B: 160 80 50 >>> 256B: 120 70 60 >>> 1023B: 350 140 150 >>> 1024B: 350 130 140 >>> 1500B: 470 170 180 >>> 2048B: 630 210 240 >>> 4095B: 1220 390 430 >>> 4096B: 1230 390 430 >>> >>> Cc: Will Deacon <will@kernel.org> >>> Cc: Robin Murphy <robin.murphy@arm.com> >>> Cc: Catalin Marinas <catalin.marinas@arm.com> >>> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> >>> Originally-from: Robin Murphy <robin.murphy@arm.com> >>> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com> >>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> >>> --- >>> Hi, >>> Apologies that we post this version so later, because we want to >>> optimise it better, Lingyan tested it performance which is attached >>> in commit log. Both(128 and 64) are much better than the initial >>> code. >>> ChangeLog: >>> based on Robin's code and change strides from 64 to 128. >>> >>> arch/arm64/include/asm/checksum.h | 3 ++ >>> arch/arm64/lib/Makefile | 2 +- >>> arch/arm64/lib/csum.c | 81 +++++++++++++++++++++++++++++++++++++++ >>> 3 files changed, 85 insertions(+), 1 deletion(-) >>> create mode 100644 arch/arm64/lib/csum.c >> >> Robin -- any chance you could look at this please? If it's based on your >> code then hopefully it's straightforward to review ;) >> >> Will >> >> . >> >
diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h index d064a50deb5f..8d2a7de39744 100644 --- a/arch/arm64/include/asm/checksum.h +++ b/arch/arm64/include/asm/checksum.h @@ -35,6 +35,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) } #define ip_fast_csum ip_fast_csum +extern unsigned int do_csum(const unsigned char *buff, int len); +#define do_csum do_csum + #include <asm-generic/checksum.h> #endif /* __ASM_CHECKSUM_H */ diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile index c21b936dc01d..8a0644a831eb 100644 --- a/arch/arm64/lib/Makefile +++ b/arch/arm64/lib/Makefile @@ -3,7 +3,7 @@ lib-y := clear_user.o delay.o copy_from_user.o \ copy_to_user.o copy_in_user.o copy_page.o \ clear_page.o memchr.o memcpy.o memmove.o memset.o \ memcmp.o strcmp.o strncmp.o strlen.o strnlen.o \ - strchr.o strrchr.o tishift.o + strchr.o strrchr.o tishift.o csum.o ifeq ($(CONFIG_KERNEL_MODE_NEON), y) obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o diff --git a/arch/arm64/lib/csum.c b/arch/arm64/lib/csum.c new file mode 100644 index 000000000000..20170d8dcbc4 --- /dev/null +++ b/arch/arm64/lib/csum.c @@ -0,0 +1,81 @@ +// SPDX-License-Identifier: GPL-2.0-only +// Copyright (C) 2019 Arm Ltd. + +#include <linux/compiler.h> +#include <linux/kasan-checks.h> +#include <linux/kernel.h> + +#include <net/checksum.h> + + +/* handle overflow */ +static __uint128_t accumulate128(__uint128_t sum, __uint128_t data) +{ + sum += (sum >> 64) | (sum << 64); + data += (data >> 64) | (data << 64); + return (sum + data) >> 64; +} + +unsigned int do_csum(const unsigned char *buff, int len) +{ + unsigned int offset, shift, sum, count; + __uint128_t data, *ptr; + __uint128_t sum128 = 0; + u64 sum64 = 0; + + offset = (unsigned long)buff & 0xf; + /* + * This is to all intents and purposes safe, since rounding down cannot + * result in a different page or cache line being accessed, and @buff + * should absolutely not be pointing to anything read-sensitive. We do, + * however, have to be careful not to piss off KASAN, which means using + * unchecked reads to accommodate the head and tail, for which we'll + * compensate with an explicit check up-front. + */ + kasan_check_read(buff, len); + ptr = (__uint128_t *)(buff - offset); + shift = offset * 8; + + /* + * Head: zero out any excess leading bytes. Shifting back by the same + * amount should be at least as fast as any other way of handling the + * odd/even alignment, and means we can ignore it until the very end. + */ + data = READ_ONCE_NOCHECK(*ptr++); +#ifdef __LITTLE_ENDIAN + data = (data >> shift) << shift; +#else + data = (data << shift) >> shift; +#endif + count = 16 - offset; + + /* Body: straightforward aligned loads from here on... */ + + while (len > count) { + sum128 = accumulate128(sum128, data); + data = READ_ONCE_NOCHECK(*ptr++); + count += 16; + } + /* + * Tail: zero any over-read bytes similarly to the head, again + * preserving odd/even alignment. + */ + shift = (count - len) * 8; +#ifdef __LITTLE_ENDIAN + data = (data << shift) >> shift; +#else + data = (data >> shift) << shift; +#endif + sum128 = accumulate128(sum128, data); + + /* Finally, folding */ + sum128 += (sum128 >> 64) | (sum128 << 64); + sum64 = (sum128 >> 64); + sum64 += (sum64 >> 32) | (sum64 << 32); + sum = (sum64 >> 32); + sum += (sum >> 16) | (sum << 16); + if (offset & 1) + return (u16)swab32(sum); + + return sum >> 16; +}