Message ID | 20210514100106.3404011-8-arnd@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | Unify asm/unaligned.h around struct helper | expand |
On Fri, May 14, 2021 at 12:00:55PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann <arnd@arndb.de> > > As found by Vineet Gupta and Linus Torvalds, gcc has somewhat unexpected > behavior when faced with overlapping unaligned pointers. The kernel's > unaligned/access-ok.h header technically invokes undefined behavior > that happens to usually work on the architectures using it, but if the > compiler optimizes code based on the assumption that undefined behavior > doesn't happen, it can create output that actually causes data corruption. > > A related problem was previously found on 32-bit ARMv7, where most > instructions can be used on unaligned data, but 64-bit ldrd/strd causes > an exception. The workaround was to always use the unaligned/le_struct.h > helper instead of unaligned/access-ok.h, in commit 1cce91dfc8f7 ("ARM: > 8715/1: add a private asm/unaligned.h"). > > The same solution should work on all other architectures as well, so > remove the access-ok.h variant and use the other one unconditionally on > all architectures, picking either the big-endian or little-endian version. FYI, gcc 10 had a bug where it miscompiled code that uses "packed structs" to copy between overlapping unaligned pointers (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94994). I'm not sure whether the kernel will run into that or not, and gcc has since fixed it. But it's worth mentioning, especially since the issue mentioned in this commit sounds very similar (overlapping unaligned pointers), and both involved implementations of DEFLATE decompression. Anyway, partly due to the above, in userspace I now only use memcpy() to implement {get,put}_unaligned_*, since these days it seems to be compiled optimally and have the least amount of problems. I wonder if the kernel should do the same, or whether there are still cases where memcpy() isn't compiled optimally. armv6/7 used to be one such case, but it was fixed in gcc 6. - Eric
On Mon, May 17, 2021 at 11:53 PM Eric Biggers <ebiggers@kernel.org> wrote: > On Fri, May 14, 2021 at 12:00:55PM +0200, Arnd Bergmann wrote: > > From: Arnd Bergmann <arnd@arndb.de> > > > > As found by Vineet Gupta and Linus Torvalds, gcc has somewhat unexpected > > behavior when faced with overlapping unaligned pointers. The kernel's > > unaligned/access-ok.h header technically invokes undefined behavior > > that happens to usually work on the architectures using it, but if the > > compiler optimizes code based on the assumption that undefined behavior > > doesn't happen, it can create output that actually causes data corruption. > > > > A related problem was previously found on 32-bit ARMv7, where most > > instructions can be used on unaligned data, but 64-bit ldrd/strd causes > > an exception. The workaround was to always use the unaligned/le_struct.h > > helper instead of unaligned/access-ok.h, in commit 1cce91dfc8f7 ("ARM: > > 8715/1: add a private asm/unaligned.h"). > > > > The same solution should work on all other architectures as well, so > > remove the access-ok.h variant and use the other one unconditionally on > > all architectures, picking either the big-endian or little-endian version. > > FYI, gcc 10 had a bug where it miscompiled code that uses "packed structs" to > copy between overlapping unaligned pointers > (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94994). Thank you for pointing this out > I'm not sure whether the kernel will run into that or not, and gcc has since > fixed it. But it's worth mentioning, especially since the issue mentioned in > this commit sounds very similar (overlapping unaligned pointers), and both > involved implementations of DEFLATE decompression. I tried reproducing this on the kernel deflate code with the kernel.org gcc-10.1 and gcc-10.3 crosstool versions but couldn't quite get there with Vineet's preprocessed source https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 Trying with both the original get_unaligned() version in there and the packed-struct variant, I get the same output from gcc-10.1 and gcc-10.3 when I compile those myself for arc hs4x , but it's rather different from the output that Vineet got and I don't know how to spot whether the problem exists in any of those versions. > Anyway, partly due to the above, in userspace I now only use memcpy() to > implement {get,put}_unaligned_*, since these days it seems to be compiled > optimally and have the least amount of problems. > > I wonder if the kernel should do the same, or whether there are still cases > where memcpy() isn't compiled optimally. armv6/7 used to be one such case, but > it was fixed in gcc 6. It would have to be memmove(), not memcpy() in this case, right? My feeling is that if gcc-4.9 and gcc-5 produce correct but slightly slower code, we can live with that, unlike the possibility of gcc-10.{1,2} producing incorrect code. Since the new asm/unaligned.h has a single implementation across all architectures, we could probably fall back to a memmove based version for the compilers affected by the 94994 bug, but I'd first need to have a better way to test regarding whether given combination of asm/unaligned.h and compiler version runs into this bug. I have checked your reproducer and confirmed that it does affect x86_64 gcc-10.1 -O3 with my proposed version of asm-generic/unaligned.h, but does not trigger on any other version (4.9 though 9.3, 10.3 or 11.1), and not on -O2 or "-O3 -mno-sse" builds or on arm64, but that doesn't necessarily mean it's safe on these. Arnd
On Tue, May 18, 2021 at 12:27 AM Arnd Bergmann <arnd@kernel.org> wrote: > > > > I wonder if the kernel should do the same, or whether there are still cases > > where memcpy() isn't compiled optimally. armv6/7 used to be one such case, but > > it was fixed in gcc 6. > > It would have to be memmove(), not memcpy() in this case, right? No, it would simply be something like #define __get_unaligned_t(type, ptr) \ ({ type __val; memcpy(&__val, ptr, sizeof(type)); __val; }) #define get_unaligned(ptr) \ __get_unaligned_t(typeof(*(ptr)), ptr) but honestly, the likelihood that the compiler generates something horrible (possibly because of KASAN etc) is uncomfortably high. I'd prefer the __packed thing. We don't actually use -O3, and it's considered a bad idea, and the gcc bug is as such less likely than just the above generating unacceptable code (we have several cases where "bad code generation" ends up being an actual bug, since we depend on inlining and depend on some code sequences not generating calls etc). But I hate how gcc is buggy in so many places here, and the straightforward thing is made to explicitly not work. I absolutely despise compiler people who think it's ok to generate known bad code based on pointless "undefined behavior" arguments - and then those same clever optimizations break even when you do things properly. It's basically intellectual dishonesty - doing known fragile things, blaming the user when it breaks, but then not acknowledging that the fragile shit they did was broken even when the user bent over backwards. Linus
On Tue, May 18, 2021 at 4:56 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, May 18, 2021 at 12:27 AM Arnd Bergmann <arnd@kernel.org> wrote: > > > > > > I wonder if the kernel should do the same, or whether there are still cases > > > where memcpy() isn't compiled optimally. armv6/7 used to be one such case, but > > > it was fixed in gcc 6. > > > > It would have to be memmove(), not memcpy() in this case, right? > > No, it would simply be something like > > #define __get_unaligned_t(type, ptr) \ > ({ type __val; memcpy(&__val, ptr, sizeof(type)); __val; }) > > #define get_unaligned(ptr) \ > __get_unaligned_t(typeof(*(ptr)), ptr) > > but honestly, the likelihood that the compiler generates something > horrible (possibly because of KASAN etc) is uncomfortably high. > > I'd prefer the __packed thing. We don't actually use -O3, and it's > considered a bad idea, and the gcc bug is as such less likely than > just the above generating unacceptable code (we have several cases > where "bad code generation" ends up being an actual bug, since we > depend on inlining and depend on some code sequences not generating > calls etc). I think the important question is whether we know that the bug that Eric pointed to can only happen with -O3, or whether it is something in gcc-10.1 that got triggered by -O3 -msse on x86-64 but could equally well get triggered on some other architecture without -O3 or vector instructions enabled. From the gcc fix at https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=9fa5b473b5b8e289b it looks like this code path is entered when compiling with -ftree-loop-vectorize, which is documented as '-ftree-loop-vectorize' Perform loop vectorization on trees. This flag is enabled by default at '-O3' and by '-ftree-vectorize', '-fprofile-use', and '-fauto-profile'. -ftree-vectorize is set in arch/arm/lib/xor-neon.c -O3 is set for the lz4 and zstd compression helpers and for wireguard. To be on the safe side, we could pass -fno-tree-loop-vectorize along with -O3 on the affected gcc versions, or use a bigger hammer (not use -O3 at all, always set -fno-tree-loop-vectorize, ...). Arnd
On Tue, May 18, 2021 at 5:42 AM Arnd Bergmann <arnd@kernel.org> wrote: > > To be on the safe side, we could pass -fno-tree-loop-vectorize along > with -O3 on the affected gcc versions, or use a bigger hammer > (not use -O3 at all, always set -fno-tree-loop-vectorize, ...). I personally think -O3 in general is unsafe. It has historically been horribly buggy. It's gotten better, but this case clearly shows that "gotten better" really isn't that high of a bar. Very few projects use -O3, which is obviously part of why it's buggy. But the other part of why it's buggy is that vectorization is simply very complicated, and honestly, judging by the last report the gcc people don't care about being careful. They literally are ok with knowingly generating an off-by-one range check, because "it's undefined behavior". With that kind of mentality, I'm not personally all that inclined to say "sure, use -O3". We know it has bugs even for the well-defined cases. > -O3 is set for the lz4 and zstd compression helpers and for wireguard. I'm actually surprised wireguard would use -O3. Yes, performance is important. But for wireguard, correctness is certainly important too. Maybe Jason isn't aware of just how bad gcc -O3 has historically been? And -O3 has often generated _slower_ code, in addition to the bugs. It's not like it's a situation where "-O3 is obviously better than -O2". There's a reason -O2 is the default. And that tends to be even more true in the kernel than in many user space programs (ie smaller loops, generally much higher I$ miss rates etc). Jason? How big of a deal is that -O3 for wireguard wrt the normal -O2? There are known buggy gcc versions that aren't ancient. Of the other cases, that xor-neon.c case actually makes sense. For that file, it literally exists _only_ to get a vectorized version of the trivial xor_8regs loop. It's one of the (very very few) cases of vectorization we actually want. And in that case, we might even want to make things easier - and more explicit - for the compiler by making the xor_8regs loops use "restrict" pointers. That neon case actually wants and needs that tree-vectorization to DTRT. But maybe it doesn't need the actual _loop_ vectorization? The xor_8regs code is literally using hand-unrolled loops already, exactly to make it as simple as possible for the compiler (but the lack of "restrict" pointers means that it's not all that simple after all, and I assume the compiler generates conditionals for the NEON case? lz4 is questionable - yes, upstream lh4 seems to use -O3 (good), but it also very much uses unaligned accesses, which is where the gcc bug hits. I doubt that it really needs or wants the loop vectorization. zstd looks very similar to lz4. End result: at a minimum, I'd suggest using "-fno-tree-loop-vectorize", although somebody should check that NEON case. And I still think that using O3 for anything halfway complicated should be considered odd and need some strong numbers to enable. Linus
Hi Linus, On Tue, May 18, 2021 at 6:12 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > I'm actually surprised wireguard would use -O3. Yes, performance is > important. But for wireguard, correctness is certainly important too. > Maybe Jason isn't aware of just how bad gcc -O3 has historically been? > Jason? How big of a deal is that -O3 for wireguard wrt the normal -O2? > There are known buggy gcc versions that aren't ancient. My impression has always been that O3 might sometimes generate slower code, but not that it generates buggy code so commonly. Thanks for letting me know. I have a habit of compulsively run IDA Pro after making changes (brain damage from too many years as a "security person" or something), to see what the compiler did, and I've just been doing that with O3 since the beginning of the project, so that's what I wound up optimizing for. Or sometimes I'll work little things out in Godbolt's compiler explorer. It's not like it matters much most of the time, but sometimes I enjoy the golf. Anyway, I've never noticed it producing any clearly wrong code compared to O2. But I'm obviously not testing on all compilers or on all architectures. So if you think there's danger lurking somewhere, it seems reasonable to change that to O2. Comparing gcc 11's output between O2 and O3, it looks like the primary difference is that the constant propagation is much less aggressive with O2, and less inlining in general also means that some stores and loads to local variables across static function calls aren't being coalesced. A few null checks are removed too, where the compiler can prove them away. So while I've never seen issues with that code under O3, I don't see a super compelling speed up anywhere either, but rather a bunch of places that may or may not be theoretically faster or slower on some system, maybe. I can queue up a patch for the next wireguard series I send to Dave. Jason
On Tue, May 18, 2021 at 6:12 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, May 18, 2021 at 5:42 AM Arnd Bergmann <arnd@kernel.org> wrote: > > Of the other cases, that xor-neon.c case actually makes sense. For > that file, it literally exists _only_ to get a vectorized version of > the trivial xor_8regs loop. It's one of the (very very few) cases of > vectorization we actually want. And in that case, we might even want > to make things easier - and more explicit - for the compiler by making > the xor_8regs loops use "restrict" pointers. > > That neon case actually wants and needs that tree-vectorization to > DTRT. But maybe it doesn't need the actual _loop_ vectorization? The > xor_8regs code is literally using hand-unrolled loops already, exactly > to make it as simple as possible for the compiler (but the lack of > "restrict" pointers means that it's not all that simple after all, and > I assume the compiler generates conditionals for the NEON case? Right, I think there is an ongoing debate over how to best handle this one in clang, since that does not do any vectorization for this file unless the pointers are marked "restrict". As far as I remember, there are some callers that want to do the xor in place though, which means restrict is wrong. > lz4 is questionable - yes, upstream lh4 seems to use -O3 (good), but > it also very much uses unaligned accesses, which is where the gcc bug > hits. I doubt that it really needs or wants the loop vectorization. I ran some limited speed tests with the lz4 sources that come with Ubuntu, using gcc-10.3 on an AMD Zen1 Threadripper with 10GB of /dev/urandom input. This package patches the sources to use -O2 and no vectorization, which turns out to be the fastest combination for my stupid test as well. The results are barely above noise, but it appears that -O2 -ftree-loop-vectorize makes it slightly slower than just -O2, while -O3 is even slower than that regardless of -fno-tree-loop-vectorize/-ftree-loop-vectorize. I see that Nobuhiro Iwamatsu (Cc'd) changed the Debian lz4 package to use -O2, but I don't see an explanation for it. I also see that the lz4 sources force -O2 on ppc64 because -O3 causes a 30% slowdown from vectorization. The kernel version is missing the bit that does this. > zstd looks very similar to lz4. > End result: at a minimum, I'd suggest using > "-fno-tree-loop-vectorize", although somebody should check that NEON > case. > And I still think that using O3 for anything halfway complicated > should be considered odd and need some strong numbers to enable. Agreed. I think there is a fairly strong case for just using -O2 on lz4 and backport that to stable. Searching for lz4 bugs with -O3 also finds several reports including one that I sent myself: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702 I see that user space zstd is built with -O3 in Debian, but it the changelog also lists "Improved : better speed on clang and gcc -O2, thanks to Eric Biggers", so maybe Eric has some useful ideas on whether we should just use -O2 for the in-kernel version. Arnd
From: Linus Torvalds > Sent: 18 May 2021 15:56 > > On Tue, May 18, 2021 at 12:27 AM Arnd Bergmann <arnd@kernel.org> wrote: > > > > > > I wonder if the kernel should do the same, or whether there are still cases > > > where memcpy() isn't compiled optimally. armv6/7 used to be one such case, but > > > it was fixed in gcc 6. > > > > It would have to be memmove(), not memcpy() in this case, right? > > No, it would simply be something like > > #define __get_unaligned_t(type, ptr) \ > ({ type __val; memcpy(&__val, ptr, sizeof(type)); __val; }) You still need something to ensure that gcc can't assume that 'ptr' has an aligned type. If there is an 'int *ptr' visible in the call chain no amount of (void *) casts will make gcc forget the alignment. So the memcpy() will get converted to an aligned load-store pair. (This has always caused grief on sparc.) A cast though (long) might be enough, as might a cast to a __packed struct pointer type. Using a union of the two pointer types might be ok - but might generate a store/load to stack. An alternative is an asm statement with input and output of different pointer types but using the same register for both. That ought to force the compile to forget any tracked type and value. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, May 18, 2021 at 10:51:23PM +0200, Arnd Bergmann wrote: > > > zstd looks very similar to lz4. > > > End result: at a minimum, I'd suggest using > > "-fno-tree-loop-vectorize", although somebody should check that NEON > > case. > > > And I still think that using O3 for anything halfway complicated > > should be considered odd and need some strong numbers to enable. > > Agreed. I think there is a fairly strong case for just using -O2 on lz4 > and backport that to stable. > Searching for lz4 bugs with -O3 also finds several reports including > one that I sent myself: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709 > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702 > > I see that user space zstd is built with -O3 in Debian, but it the changelog > also lists "Improved : better speed on clang and gcc -O2, thanks to Eric > Biggers", so maybe Eric has some useful ideas on whether we should > just use -O2 for the in-kernel version. > In my opinion, -O2 is a good default even for compression code. I generally don't see any benefit from -O3 in compression code I've written. That being said, -O2 is what I usually use during development. Other people could write code that relies on -O3 to be optimized well. The Makefiles for lz4 and zstd use -O3 by default, which is a little concerning. I do expect that they're still well-written enough to do well with -O2 too, but it would require doing benchmarks to tell for sure. (As Arnd noted, it happens that I did do such benchmarks on zstd about 5 years ago, and I found an issue where some functions weren't marked inline when they should be, causing them to be inlined at -O3 but not at -O2. That got fixed.) - Eric
diff --git a/arch/arm/include/asm/unaligned.h b/arch/arm/include/asm/unaligned.h deleted file mode 100644 index 3c5248fb4cdc..000000000000 --- a/arch/arm/include/asm/unaligned.h +++ /dev/null @@ -1,25 +0,0 @@ -#ifndef __ASM_ARM_UNALIGNED_H -#define __ASM_ARM_UNALIGNED_H - -/* - * We generally want to set CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on ARMv6+, - * but we don't want to use linux/unaligned/access_ok.h since that can lead - * to traps on unaligned stm/ldm or strd/ldrd. - */ -#include <asm/byteorder.h> - -#if defined(__LITTLE_ENDIAN) -# include <linux/unaligned/le_struct.h> -# include <linux/unaligned/generic.h> -# define get_unaligned __get_unaligned_le -# define put_unaligned __put_unaligned_le -#elif defined(__BIG_ENDIAN) -# include <linux/unaligned/be_struct.h> -# include <linux/unaligned/generic.h> -# define get_unaligned __get_unaligned_be -# define put_unaligned __put_unaligned_be -#else -# error need to define endianess -#endif - -#endif /* __ASM_ARM_UNALIGNED_H */ diff --git a/arch/mips/crypto/crc32-mips.c b/arch/mips/crypto/crc32-mips.c index faa88a6a74c0..0a03529cf317 100644 --- a/arch/mips/crypto/crc32-mips.c +++ b/arch/mips/crypto/crc32-mips.c @@ -8,13 +8,13 @@ * Copyright (C) 2018 MIPS Tech, LLC */ -#include <linux/unaligned/access_ok.h> #include <linux/cpufeature.h> #include <linux/init.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/string.h> #include <asm/mipsregs.h> +#include <asm/unaligned.h> #include <crypto/internal/hash.h> diff --git a/include/asm-generic/unaligned.h b/include/asm-generic/unaligned.h index d79df721ae60..36bf03aaa674 100644 --- a/include/asm-generic/unaligned.h +++ b/include/asm-generic/unaligned.h @@ -8,22 +8,13 @@ */ #include <asm/byteorder.h> -/* Set by the arch if it can handle unaligned accesses in hardware. */ -#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS -# include <linux/unaligned/access_ok.h> -#endif - #if defined(__LITTLE_ENDIAN) -# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS -# include <linux/unaligned/le_struct.h> -# endif +# include <linux/unaligned/le_struct.h> # include <linux/unaligned/generic.h> # define get_unaligned __get_unaligned_le # define put_unaligned __put_unaligned_le #elif defined(__BIG_ENDIAN) -# ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS -# include <linux/unaligned/be_struct.h> -# endif +# include <linux/unaligned/be_struct.h> # include <linux/unaligned/generic.h> # define get_unaligned __get_unaligned_be # define put_unaligned __put_unaligned_be diff --git a/include/linux/unaligned/access_ok.h b/include/linux/unaligned/access_ok.h deleted file mode 100644 index 167aa849c0ce..000000000000 --- a/include/linux/unaligned/access_ok.h +++ /dev/null @@ -1,68 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _LINUX_UNALIGNED_ACCESS_OK_H -#define _LINUX_UNALIGNED_ACCESS_OK_H - -#include <linux/kernel.h> -#include <asm/byteorder.h> - -static __always_inline u16 get_unaligned_le16(const void *p) -{ - return le16_to_cpup((__le16 *)p); -} - -static __always_inline u32 get_unaligned_le32(const void *p) -{ - return le32_to_cpup((__le32 *)p); -} - -static __always_inline u64 get_unaligned_le64(const void *p) -{ - return le64_to_cpup((__le64 *)p); -} - -static __always_inline u16 get_unaligned_be16(const void *p) -{ - return be16_to_cpup((__be16 *)p); -} - -static __always_inline u32 get_unaligned_be32(const void *p) -{ - return be32_to_cpup((__be32 *)p); -} - -static __always_inline u64 get_unaligned_be64(const void *p) -{ - return be64_to_cpup((__be64 *)p); -} - -static __always_inline void put_unaligned_le16(u16 val, void *p) -{ - *((__le16 *)p) = cpu_to_le16(val); -} - -static __always_inline void put_unaligned_le32(u32 val, void *p) -{ - *((__le32 *)p) = cpu_to_le32(val); -} - -static __always_inline void put_unaligned_le64(u64 val, void *p) -{ - *((__le64 *)p) = cpu_to_le64(val); -} - -static __always_inline void put_unaligned_be16(u16 val, void *p) -{ - *((__be16 *)p) = cpu_to_be16(val); -} - -static __always_inline void put_unaligned_be32(u32 val, void *p) -{ - *((__be32 *)p) = cpu_to_be32(val); -} - -static __always_inline void put_unaligned_be64(u64 val, void *p) -{ - *((__be64 *)p) = cpu_to_be64(val); -} - -#endif /* _LINUX_UNALIGNED_ACCESS_OK_H */