Message ID | 20220415021328.7D31EC385A1@smtp.kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [01/14] MAINTAINERS: Broadcom internal lists aren't maintainers | expand |
On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when > iter_is_iovec(); in other cases, use the more natural iov_iter_zero() > instead of copy_page_to_iter(). We would use iov_iter_zero() throughout, > but the x86 clear_user() is not nearly so well optimized as copy to user > (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds). Ugh. I've applied this patch, but honestly, the proper course of action should just be to improve on clear_user(). If it really is important enough that we should care about that performance, then we just should fix clear_user(). It's a very odd special thing right now (at least on x86-64) using some strange handcrafted inline asm code. I assume that 'rep stosb' is the fastest way to clear things on modern CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS -> "rep stos" except for small areas, and probably that "store zeros by hand" for older CPUs). Adding PeterZ and Borislav (who seem to be the last ones to have worked on the copy and clear_page stuff respectively) and the x86 maintainers in case somebody gets the urge to just fix this. Because memory clearing should be faster than copying, and the thing that makes copying fast is that FSRM and ERMS logic (the whole "manually unrolled copy" is hopefully mostly a thing of the past and we can consider it legacy) Linus
On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote: > On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when > > iter_is_iovec(); in other cases, use the more natural iov_iter_zero() > > instead of copy_page_to_iter(). We would use iov_iter_zero() throughout, > > but the x86 clear_user() is not nearly so well optimized as copy to user > > (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds). > > Ugh. > > I've applied this patch, but honestly, the proper course of action > should just be to improve on clear_user(). > > If it really is important enough that we should care about that > performance, then we just should fix clear_user(). > > It's a very odd special thing right now (at least on x86-64) using > some strange handcrafted inline asm code. > > I assume that 'rep stosb' is the fastest way to clear things on modern > CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS -> > "rep stos" except for small areas, and probably that "store zeros by > hand" for older CPUs). > > Adding PeterZ and Borislav (who seem to be the last ones to have > worked on the copy and clear_page stuff respectively) and the x86 > maintainers in case somebody gets the urge to just fix this. Perhaps the x86 maintainers would like to start from https://lore.kernel.org/lkml/20210523180423.108087-1-sneves@dei.uc.pt/ instead of pushing that work off on the submitter.
On Fri, 15 Apr 2022, Linus Torvalds wrote: > On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when > > iter_is_iovec(); in other cases, use the more natural iov_iter_zero() > > instead of copy_page_to_iter(). We would use iov_iter_zero() throughout, > > but the x86 clear_user() is not nearly so well optimized as copy to user > > (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds). > > Ugh. > > I've applied this patch, Phew, thanks. > but honestly, the proper course of action > should just be to improve on clear_user(). You'll find no disagreement here: we've all been saying the same. It's just that that work is yet to be done (or yet to be accepted). > > If it really is important enough that we should care about that > performance, then we just should fix clear_user(). > > It's a very odd special thing right now (at least on x86-64) using > some strange handcrafted inline asm code. > > I assume that 'rep stosb' is the fastest way to clear things on modern > CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS -> > "rep stos" except for small areas, and probably that "store zeros by > hand" for older CPUs). > > Adding PeterZ and Borislav (who seem to be the last ones to have > worked on the copy and clear_page stuff respectively) and the x86 > maintainers in case somebody gets the urge to just fix this. Yes, it was exactly Borislav and PeterZ whom I first approached too, link 3 in the commit message of the patch that this one is fixing, https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com/ Borislav wants a thorough good patch, and I don't blame him for that! Hugh > > Because memory clearing should be faster than copying, and the thing > that makes copying fast is that FSRM and ERMS logic (the whole > "manually unrolled copy" is hopefully mostly a thing of the past and > we can consider it legacy) > > Linus
On Sat, 16 Apr 2022, Borislav Petkov wrote: > On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote: > > Adding PeterZ and Borislav (who seem to be the last ones to have > > worked on the copy and clear_page stuff respectively) and the x86 > > maintainers in case somebody gets the urge to just fix this. > > I guess if enough people ask and keep asking, some people at least try > to move... > > > Because memory clearing should be faster than copying, and the thing > > that makes copying fast is that FSRM and ERMS logic (the whole > > "manually unrolled copy" is hopefully mostly a thing of the past and > > we can consider it legacy) > > So I did give it a look and it seems to me, if we want to do the > alternatives thing here, it will have to look something like > arch/x86/lib/copy_user_64.S. > > I.e., the current __clear_user() will have to become the "handle_tail" > thing there which deals with uncopied rest-bytes at the end and the new > fsrm/erms/rep_good variants will then be alternative_call_2 or _3. > > The fsrm thing will have only the handle_tail thing at the end when size > != 0. > > The others - erms and rep_good - will have to check for sizes smaller > than, say a cacheline, and for those call the handle_tail thing directly > instead of going into a REP loop. The current __clear_user() is still a > lot better than that copy_user_generic_unrolled() abomination. And it's > not like old CPUs would get any perf penalty - they'll simply use the > same code. > > And then you need the labels for _ASM_EXTABLE_UA() exception handling. > > Anyway, something along those lines. > > And then we'll need to benchmark this on a bunch of current machines to > make sure there's no funny surprises, perf-wise. > > I can get cracking on this but I would advise people not to hold their > breaths. :) > > Unless someone has a better idea or is itching to get hands dirty > her-/himself. I've done a skeleton implementation of alternative __clear_user() based on CPU features. It has three versions of __clear_user(); o __clear_user_original() - similar to the 'standard' __clear_user() o __clear_user_rep_good() - using resp stos{qb} when CPU has 'rep_good' o __clear_user_erms() - using 'resp stosb' when CPU has 'erms' Not claiming the implementation is ideal, but might be a useful starting point for someone. Patch is against 5.18.0-rc2. Only basic sanity testing done. Simple performance testing done for large sizes, on a system (Intel E8400) which has rep_good but not erms; # dd if=/dev/zero of=/dev/null bs=16384 count=10000 o *_original() - ~14.2GB/s. Same as the 'standard' __clear_user(). o *_rep_good() - same throughput as *_original(). o *_erms() - ~12.2GB/s (expected on a system without erms). No performance testing done for zeroing small sizes. Cheers, Mark Signed-off-by: Mark Hemment <markhemm@googlemail.com> --- arch/x86/include/asm/asm.h | 39 +++++++++++++++ arch/x86/include/asm/uaccess_64.h | 36 ++++++++++++++ arch/x86/lib/clear_page_64.S | 100 ++++++++++++++++++++++++++++++++++++++ arch/x86/lib/usercopy_64.c | 32 ------------ 4 files changed, 175 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index fbcfec4dc4cc..373ed6be7a8d 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -132,6 +132,35 @@ /* Exception table entry */ #ifdef __ASSEMBLY__ +# define UNDEFINE_EXTABLE_TYPE_REG \ + .purgem extable_type_reg ; + +# define DEFINE_EXTABLE_TYPE_REG \ + .macro extable_type_reg type:req reg:req ; \ + .set .Lfound, 0 ; \ + .set .Lregnr, 0 ; \ + .irp rs,rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13, \ + r14,r15 ; \ + .ifc \reg, %\rs ; \ + .set .Lfound, .Lfound+1 ; \ + .long \type + (.Lregnr << 8) ; \ + .endif ; \ + .set .Lregnr, .Lregnr+1 ; \ + .endr ; \ + .set .Lregnr, 0 ; \ + .irp rs,eax,ecx,edx,ebx,esp,ebp,esi,edi,r8d,r9d,r10d,r11d,r12d, \ + r13d,r14d,r15d ; \ + .ifc \reg, %\rs ; \ + .set .Lfound, .Lfound+1 ; \ + .long \type + (.Lregnr << 8) ; \ + .endif ; \ + .set .Lregnr, .Lregnr+1 ; \ + .endr ; \ + .if (.Lfound != 1) ; \ + .error "extable_type_reg: bad register argument" ; \ + .endif ; \ + .endm ; + # define _ASM_EXTABLE_TYPE(from, to, type) \ .pushsection "__ex_table","a" ; \ .balign 4 ; \ @@ -140,6 +169,16 @@ .long type ; \ .popsection +# define _ASM_EXTABLE_TYPE_REG(from, to, type1, reg1) \ + .pushsection "__ex_table","a" ; \ + .balign 4 ; \ + .long (from) - . ; \ + .long (to) - . ; \ + DEFINE_EXTABLE_TYPE_REG \ + extable_type_reg reg=reg1, type=type1 ; \ + UNDEFINE_EXTABLE_TYPE_REG \ + .popsection + # ifdef CONFIG_KPROBES # define _ASM_NOKPROBE(entry) \ .pushsection "_kprobe_blacklist","aw" ; \ diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index 45697e04d771..6a4995e4cfae 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -79,4 +79,40 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size) kasan_check_write(dst, size); return __copy_user_flushcache(dst, src, size); } + +/* + * Zero Userspace. + */ + +__must_check unsigned long +clear_user_original(void __user *addr, unsigned long len); +__must_check unsigned long +clear_user_rep_good(void __user *addr, unsigned long len); +__must_check unsigned long +clear_user_erms(void __user *addr, unsigned long len); + +static __always_inline __must_check unsigned long +___clear_user(void __user *addr, unsigned long len) +{ + unsigned long ret; + + /* + * No memory constraint because it doesn't change any memory gcc + * knows about. + */ + + might_fault(); + alternative_call_2( + clear_user_original, + clear_user_rep_good, + X86_FEATURE_REP_GOOD, + clear_user_erms, + X86_FEATURE_ERMS, + ASM_OUTPUT2("=a" (ret), "=D" (addr), "=c" (len)), + "1" (addr), "2" (len) + : "%rdx", "cc"); + return ret; +} + +#define __clear_user(d, n) ___clear_user(d, n) #endif /* _ASM_X86_UACCESS_64_H */ diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S index fe59b8ac4fcc..abe1f44ea422 100644 --- a/arch/x86/lib/clear_page_64.S +++ b/arch/x86/lib/clear_page_64.S @@ -1,5 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0-only */ #include <linux/linkage.h> +#include <asm/asm.h> +#include <asm/smap.h> #include <asm/export.h> /* @@ -50,3 +52,101 @@ SYM_FUNC_START(clear_page_erms) RET SYM_FUNC_END(clear_page_erms) EXPORT_SYMBOL_GPL(clear_page_erms) + +/* + * Default clear user-space. + * Input: + * rdi destination + * rcx count + * + * Output: + * rax uncopied bytes or 0 if successful. + */ + +SYM_FUNC_START(clear_user_original) + ASM_STAC + movq %rcx,%rax + shrq $3,%rcx + andq $7,%rax + testq %rcx,%rcx + jz 1f + + .p2align 4 +0: movq $0,(%rdi) + leaq 8(%rdi),%rdi + decq %rcx + jnz 0b + +1: movq %rax,%rcx + testq %rcx,%rcx + jz 3f + +2: movb $0,(%rdi) + incq %rdi + decl %ecx + jnz 2b + +3: ASM_CLAC + movq %rcx,%rax + RET + + _ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rax) + _ASM_EXTABLE_UA(2b, 3b) +SYM_FUNC_END(clear_user_original) +EXPORT_SYMBOL(clear_user_original) + +/* + * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is + * present. + * Input: + * rdi destination + * rcx count + * + * Output: + * rax uncopied bytes or 0 if successful. + */ + +SYM_FUNC_START(clear_user_rep_good) + ASM_STAC + movq %rcx,%rdx + xorq %rax,%rax + shrq $3,%rcx + andq $7,%rdx + +0: rep stosq + movq %rdx,%rcx + +1: rep stosb + +3: ASM_CLAC + movq %rcx,%rax + RET + + _ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rdx) + _ASM_EXTABLE_UA(1b, 3b) +SYM_FUNC_END(clear_user_rep_good) +EXPORT_SYMBOL(clear_user_rep_good) + +/* + * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present. + * Input: + * rdi destination + * rcx count + * + * Output: + * rax uncopied bytes or 0 if successful. + */ + +SYM_FUNC_START(clear_user_erms) + xorq %rax,%rax + ASM_STAC + +0: rep stosb + +3: ASM_CLAC + movq %rcx,%rax + RET + + _ASM_EXTABLE_UA(0b, 3b) +SYM_FUNC_END(clear_user_erms) +EXPORT_SYMBOL(clear_user_erms) diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c index 0402a749f3a0..3a2872c9c4a9 100644 --- a/arch/x86/lib/usercopy_64.c +++ b/arch/x86/lib/usercopy_64.c @@ -14,38 +14,6 @@ * Zero Userspace */ -unsigned long __clear_user(void __user *addr, unsigned long size) -{ - long __d0; - might_fault(); - /* no memory constraint because it doesn't change any memory gcc knows - about */ - stac(); - asm volatile( - " testq %[size8],%[size8]\n" - " jz 4f\n" - " .align 16\n" - "0: movq $0,(%[dst])\n" - " addq $8,%[dst]\n" - " decl %%ecx ; jnz 0b\n" - "4: movq %[size1],%%rcx\n" - " testl %%ecx,%%ecx\n" - " jz 2f\n" - "1: movb $0,(%[dst])\n" - " incq %[dst]\n" - " decl %%ecx ; jnz 1b\n" - "2:\n" - - _ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1]) - _ASM_EXTABLE_UA(1b, 2b) - - : [size8] "=&c"(size), [dst] "=&D" (__d0) - : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr)); - clac(); - return size; -} -EXPORT_SYMBOL(__clear_user); - unsigned long clear_user(void __user *to, unsigned long n) { if (access_ok(to, n))
On Sat, Apr 16, 2022 at 10:28 AM Borislav Petkov <bp@alien8.de> wrote: > > you also need a _fsrm() one which checks X86_FEATURE_FSRM. That one > should simply do rep; stosb regardless of the size. For that you can > define an alternative_call_3 similar to how the _2 variant is defined. Honestly, my personal preference would be that with FSRM, we'd have an alternative that looks something like asm volatile( "1:" ALTERNATIVE("call __stosb_user", "rep movsb", X86_FEATURE_FSRM) "2:" _ASM_EXTABLE_UA(1b, 2b) :"=c" (count), "=D" (dest),ASM_CALL_CONSTRAINT :"0" (count), "1" (dest), "a" (0) :"memory"); iow, the 'rep stosb' case would be inline. Note that the above would have a few things to look out for: - special 'stosb' calling convention: %rax/%rcx/%rdx as inputs %rcx as "bytes not copied" return value %rdi can be clobbered so the actual functions would look a bit odd and would need to save/restore some registers, but they'd basically just emulate "rep stosb". - since the whole point is that the "rep movsb" is inlined, it also means that the "call __stosb_user" is done within the STAC/CLAC region, so objdump would have to be taught that's ok but wouldn't it be lovely if we could start moving towards a model where we can just inline 'memset' and 'memcpy' like this? NOTE! The above asm has not been tested. I wrote it in this email. I'm sure I messed something up. Linus
On Sun, Apr 17, 2022 at 12:41 PM Borislav Petkov <bp@alien8.de> wrote: > > Anyway, more playing with this later to make sure it really does what it > should. I think the special calling conventions have tripped you up: > SYM_FUNC_START(clear_user_original) > - ASM_STAC > movq %rcx,%rax > shrq $3,%rcx > andq $7,%rax > @@ -86,7 +84,7 @@ SYM_FUNC_START(clear_user_original) > decl %ecx > jnz 2b > > -3: ASM_CLAC > +3: > movq %rcx,%rax > RET That 'movq %rcx,%rax' can't be right. The caller expects it to be zero on input and stay zero on output. But I think "xorl %eax,%eax" is good, since %eax was used as a temporary in that function. And the comment above the function should be fixed too. > SYM_FUNC_START(clear_user_rep_good) > - ASM_STAC > movq %rcx,%rdx > - xorq %rax,%rax > shrq $3,%rcx > andq $7,%rdx > > @@ -118,7 +113,7 @@ SYM_FUNC_START(clear_user_rep_good) > > 1: rep stosb > > -3: ASM_CLAC > +3: > movq %rcx,%rax > RET Same issue here. Probably nothing notices, since %rcx *does* end up containing the right value, and it's _likely_ that the compiler doesn't actually end up re-using the zero value in %rax after the inline asm (that this bug has corrupted), but if the compiler ever goes "Oh, I put zero in %rax, so I'll just use that afterwards", this is going to blow up very spectacularly and be very hard to debug. > @@ -135,15 +130,13 @@ EXPORT_SYMBOL(clear_user_rep_good) > * > * Output: > * rax uncopied bytes or 0 if successful. > + * > + * XXX: check for small sizes and call the original version. > + * Benchmark it first though. > */ > - > SYM_FUNC_START(clear_user_erms) > - xorq %rax,%rax > - ASM_STAC > - > 0: rep stosb > - > -3: ASM_CLAC > +3: > movq %rcx,%rax > RET .. and one more time. Also, I do think that the rep_good and erms cases should probably check for small copes, and use the clear_user_original thing for %rcx < 64 or something like that. It's what we do on the memcpy side - and more importantly, it's the only difference between "erms" and FSRM. If the ERMS code doesn't do anything different for small copies, why have it at all? But other than these small issues, it looks good to me. Linus
On Mon, Apr 18, 2022 at 3:15 AM Borislav Petkov <bp@alien8.de> wrote: > > Yah, wanted to singlestep that whole asm anyway to make sure it is good. > And just started going through it - I think it can be even optimized a > bit to use %rax for the rest bytes and decrement it into 0 eventually. Ugh. If you do this, you need to have a big comment about how that %rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends up doing "%rcx = %rcx*8+%rax" in ex_handler_ucopy_len() for the exception case). Plus you need to explain the xorl here: > 3: > xorl %eax,%eax > RET because with your re-written function it *looks* like %eax is already zero, so - once again - this is about how the exception cases work and get here magically. So that needs some big comment about "if an exception happens, we jump to label '3', and the exception handler will fix up %rcx, but we'll need to clear %rax". Or something like that. But yes, that does look like it will work, it's just really subtle how %rcx is zero for the 'rest bytes', and %rax is the byte count remaining in addition. Linus
On Tue, Apr 19, 2022 at 2:17 AM Borislav Petkov <bp@alien8.de> wrote: > > Yap, and I reused your text and expanded it. You made me look at that > crazy DEFINE_EXTABLE_TYPE_REG macro finally so that I know what it does > in detail. > > So I have the below now, it boots in the guest so it must be perfect. This looks fine to me. Although honestly, I'd be even happier without those fancy exception table tricks. I actually think things would be more legible if we had explicit error return points that did the err8: shrq $3,%rcx addq %rax,%rcx err1: xorl %eax,%eax RET things explicitly. That's perhaps especially true since this whole thing now added a new - and even more complex - error case with that _ASM_EXTABLE_TYPE_REG. But I'm ok with the complex version too, I guess. Linus
On Thu, Apr 21, 2022 at 9:50 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > where that "read(..) = 16" is the important part. It correctly figured > out that it can only do 16 bytes (ok, 17, but we've always allowed the > user accessor functions to block). Bad choice of words - by "block" I meant "doing the accesses in blocks", in this case 64-bit words. Obviously the user accessors _also_ "block" in the sense of having to wait for page faults and IO. I think 'copy_{to,from}_user()' actually does go to the effort to try to do byte-exact results, though. In particular, see copy_user_handle_tail in arch/x86/lib/copy_user_64.S. But I think that we long ago ended up deciding it really wasn't worth doing it, and x86 ends up just going to unnecessary lengths for this case. Linus
On Sun, Apr 24, 2022 at 12:37 PM Borislav Petkov <bp@alien8.de> wrote: > > You could give me some more details but AFAIU, you mean, that > fallback to byte-sized reads is unnecessary and I can get rid of > copy_user_handle_tail? Because that would be a nice cleanup. Yeah, we already don't have that fallback in many other places. For example: traditionally we implemented fixed-size small copy_to/from_user() with get/put_user(). We don't do that any more, but see commit 4b842e4e25b1 ("x86: get rid of small constant size cases in raw_copy_{to,from}_user()") and realize how it historically never did the byte-loop fallback. The "clear_user()" case is another example of something that was never byte-exact. And last time we discussed this, Al was looking at making it byte-exact, and I'm pretty sure he noted that other architectures already didn't do always do it. Let me go try to find it. > Anyway, I ran your short prog and it all looks like you predicted it: Well, this actually shows a bug. > fsrm: > ---- > read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17 The above is good. > erms: > ----- > read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17 This is good too: erms should do small copies with the expanded case, but sicne it *thinks* it's a large copy, it will just use "rep movsb" and be byte-exact. > rep_good: > --------- > read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16 This is good: it starts off with "rep movsq", does two iterations, and then fails on the third word, so it only succeeds for 16 bytes. > original: > --------- > read(3, strace: umoven: short read (17 < 33) @0x7f3ff61d5fef > 0x7f3ff61d5fef, 65536) = 3586 This is broken. And I'm *not* talking about the strace warning. The strace warning is actually just a *result* of the kernel bug. Look at that return value. It returns '3586'. That's just pure garbage. So that 'original' routine simply returns the wrong value. I suspect it's a %rax vs %rcx confusion again, but with your "patch on top of earlier patch" I didn't go and sort it out. Linus
On Sun, Apr 24, 2022 at 12:54 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And last time we discussed this, Al was looking at making it > byte-exact, and I'm pretty sure he noted that other architectures > already didn't do always do it. > > Let me go try to find it. Hmnm. I may have mis-remembered the details. The thread I was thinking of was this: https://lore.kernel.org/all/20200719031733.GI2786714@ZenIV.linux.org.uk/ and while Al was arguing for not enforcing the exact byte count, he still suggested that it must make *some* progress. But note the whole "There are two problems with that. First of all, the promise was bogus - there are architectures where it is simply not true. E.g. ppc (or alpha, or sparc, or...) can have copy_from_user() called with source one word prior to an unmapped page, fetch that word, fail to fetch the next one and bugger off without doing any stores" ie it's simply never been the case in general, and Al mentions ppc, alpha and sparc as examples of architectures where it has not been true. (arm and arm64, btw, does seem to have the "try harder" byte copy loop at the end, like x86 does). And that's when I argued that we should just accept that the byte exact behavior simply has never been reality, and we shouldn't even try to make it be reality. NOTE! We almost certainly do want to have some limit of how much off we can be, though. I do *not* think we can just unroll the loop a ton, and say "hey, we're doing copies in chunks of 16 words, so now we're off by up to 128 bytes". I'd suggest making it clear that being "off" by a word is fine, because that's the natural thing for any architecture that needs to do a "load low/high word" followed by "store aligned word" due to not handling unaligned faults well (eg the whole traditional RISC thing). And yes, I think it's actually somewhat detrimental to our test coverage that x86 does the byte-exact thing, because it means that *if* we have any code that depends on it, it will just happen to work on x86, but then fail on architectures that don't get nearly the same test coverage. Linus
On Tue, Apr 26, 2022 at 5:14 PM Borislav Petkov <bp@alien8.de> wrote: > > So when we enter the function, we shift %rcx to get the number of > qword-sized quantities to zero: > > SYM_FUNC_START(clear_user_original) > mov %rcx,%rax > shr $3,%rcx # qwords <--- Yes. But that's what we do for "rep stosq" too, for all the same reasons. > but when we encounter the fault here, we return *%rcx* - not %rcx << 3 > - latter being the *bytes* leftover which we *actually* need to return > when we encounter the #PF. Yes, but: > So, we need to shift back when we fault during the qword-sized zeroing, > i.e., full function below, see label 3 there. No. The problem is that you're using the wrong exception type. Thanks for posting the whole thing, because that makes it much more obvious. You have the exception table entries switched. You should have _ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rax) _ASM_EXTABLE_UA(2b, 3b) and not need that label '4' at all. Note how that "_ASM_EXTABLE_TYPE_REG" thing is literally designed to do %rcx = 8*%rcx+%rax in the exception handler. Of course, to get that regular _ASM_EXTABLE_UA(2b, 3b) to work, you need to have the final byte count in %rcx, not in %rax so that means that the "now do the rest of the bytes" case should have done something like movl %eax,%ecx 2: movb $0,(%rdi) inc %rdi decl %ecx jnz 2b instead. Yeah, yeah, you could also use that _ASM_EXTABLE_TYPE_REG thing for the second exception point, and keep %rcx as zero, and keep it in %eax, and depend on that whole "%rcx = 8*%rcx+%rax" fixing it up, and knowing that if an exception does *not* happen, %rcx will be zero from the word-size loop. But that really seems much too subtle for me - why not just keep things in %rcx, and try to make this look as much as possible like the "rep stosq + rep stosb" case? And finally: I still think that those fancy exception table things are *much* too fancy, and much too subtle, and much too complicated. So I'd actually prefer to get rid of them entirely, and make the code use the "no register changes" exception, and make the exception handler do a proper site-specific fixup. At that point, you can get rid of all the "mask bits early" logic, get rid of all the extraneous 'test' instructions, and make it all look something like below. NOTE! I've intentionally kept the %eax thing using 32-bit instructions - smaller encoding, and only the low three bits matter, so why move/mask full 64 bits? NOTE2! Entirely untested. But I tried to make the normal code do minimal work, and then fix things up in the exception case more. So it just keeps the original count in the 32 bits in %eax until it wants to test it, and then uses the 'andl' to both mask and test. And the exception case knows that, so it masks there too. I dunno. But I really think that whole new _ASM_EXTABLE_TYPE_REG and EX_TYPE_UCOPY_LEN8 was unnecessary. Linus SYM_FUNC_START(clear_user_original) movl %ecx,%eax shrq $3,%rcx # qwords jz .Lrest_bytes # do the qwords first .p2align 4 .Lqwords: movq $0,(%rdi) lea 8(%rdi),%rdi dec %rcx jnz .Lqwords .Lrest_bytes: andl $7,%eax # rest bytes jz .Lexit # now do the rest bytes .Lbytes: movb $0,(%rdi) inc %rdi decl %eax jnz .Lbytes .Lexit: xorl %eax,%eax RET .Lqwords_exception: # convert qwords back into bytes to return to caller shlq $3, %rcx andl $7, %eax addq %rax,%rcx jmp .Lexit .Lbytes_exception: movl %eax,%ecx jmp .Lexit _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception) _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)
On Wed, Apr 27, 2022 at 3:41 AM Borislav Petkov <bp@alien8.de> wrote: > > Any particular reason for the explicit "q" suffix? The register already > denotes the opsize and the generated opcode is the same. No, I just added the 'q' prefix to distinguish it from the 'l' instruction working on the same register one line earlier. And then I wasn't consistent throughout, because it was really just about me thinking about that "here we can use 32-bit ops, and here we are working with the full possible 64-bit range". So take that code as what it is - an untested and slightly edited version of the code you sent me, meant for "how about like this" purposes rather than anything else. Linus
On Wed, May 4, 2022 at 11:56 AM Borislav Petkov <bp@alien8.de> wrote: > > Just to update folks here: I haven't forgotten about this - Mel and I > are running some benchmarks first and staring at results to see whether > all the hoopla is even worth it. Side note: the "do FSRM inline" would likely be a really good thing for "copy_to_user()", more so than the silly "clear_user()" that we realistically do almost nowhere. I doubt you can find "clear_user()" outside of benchmarks (but hey, people do odd things). But "copy_to_user()" is everywhere, and the I$ advantage of inlining it might be noticeable on some real loads. I remember some git profiles having copy_to_user very high due to fstat(), for example - cp_new_stat64 and friends. Of course, I haven't profiled git in ages, but I doubt that has changed. Many of those kinds of loads are all about name lookup and stat (basic things like "make" would be that too, if it weren't for the fact that it spends a _lot_ of its time in user space string handling). The inlining advantage would obviously only show up on CPUs that actually do FSRM. Which I think is currently only Ice Lake. I don't have access to one. Linus
On Wed, May 4, 2022 at 1:18 PM Borislav Petkov <bp@alien8.de> wrote: > > Zen3 has FSRM. Sadly, I'm on Zen2 with my 3970X, and the Zen 3 threadrippers seem to be basically impossible to get. > So below's the git test suite with clear_user on Zen3. It creates a lot > of processes so we get to clear_user a bunch and that's the inlined rep > movsb. Oh, the clear_user() in the ELF loader? I wouldn't have expected that to be noticeable. Now, clear_page() is *very* noticeable, but that has its own special magic and doesn't use clear_user(). Maybe there is some other clear_user() case I never thought of. My dim memories of profiles definitely had copy_to_user, clear_page and copy_page being some of the top copy loops. Linus
On Wed, May 4, 2022 at 2:01 PM Borislav Petkov <bp@alien8.de> wrote: > > I could try to do a perf probe or whatever fancy new thing we do now on > clear_user to get some numbers of how many times it gets called during > the benchmark run. Or do you wanna know the callers too? One of the non-performance reasons I like inlined memcpy is actually that when you do a regular 'perf record' run, the cost of the memcpy gets associated with the call-site. Which is universally what I want for those things. I used to love our inlined spinlocks for the same reason back when we did them. Yeah, yeah, you can do it with callchain magic, but then you get it all - and I really consider memcpy/memset to be a special case. Normally I want the "oh, that leaf function is expensive", but not for memcpy and memset (and not for spinlocks, but we'll never go back to the old trivial spinlocks) I don't tend to particularly care about "how many times has this been called" kind of trace profiles. It's the actual expense in CPU cycles I tend to care about. That said, I cared deeply about those kinds of CPU profiles when I was working with Al on the RCU path lookup code and looking for where the problem spots were. That was years ago. I haven't really done serious profiling work for a while (which is just as well, because it's one of the things that went backwards when I switch to the Zen 2 threadripper for my main machine) Linus
On Tue, May 10, 2022 at 2:31 AM Borislav Petkov <bp@alien8.de> wrote: > > > I haven't really done serious profiling work for a while (which is > > just as well, because it's one of the things that went backwards when > > I switch to the Zen 2 threadripper for my main machine) > > Because of the not as advanced perf support there? Any pain points I can > forward? It's not anything fancy, and it's not anything new - you've been cc'd on me talking about it before. As mentioned, I don't actually do anything fancy with profiling - I basically almost always just want to do a simple perf record -e cycles:pp so that I get reasonable instruction attribution for what the expensive part actually is (where "actually is" is obviously always just an approximation, I'm not claiming anything else - but I just don't want to have to try to figure out some huge instruction skid issue). And then (because I only tend to care about the kernel, and don't care about _who_ is doing things), I just do perf report --sort=dso,symbol and start looking at the kernel side of things. I then occasionally enable -g, but I hate doing it, and it' susually because I see "oh damn, some spinlock slowpath, let's see what the callers are" just to figure out which spinlock it ls. VERY rudimentary, in other words. It's the "I don't know where the time is going, so let's find out". And that simple thing doesn't work, because Zen 2 doesn't like the per-thread profiling. So I can be root and use '-a', and it works fine, except I don't want to do things as root just for profiling. Plus I don't actually want to see the ACPI "CPU idle" things. I'm told the issue is that Zen 2 is not stable with IBS (aka "PEBS for AMD"), which is all kinds of sad, but there it is. As also mentioned, it's not actually a huge deal for me, because all I do is read email and do "git pull". And the times when I used profiling to find things git could need improvement on (usually pathname lookup for "git diff") are long long gone. Linus
On Tue, May 10, 2022 at 2:31 AM Borislav Petkov <bp@alien8.de> wrote: > > clear_user_original: > Amean: 9219.71 (Sum: 6340154910, samples: 687674) > > fsrm: > Amean: 8030.63 (Sum: 5522277720, samples: 687652) Well, that's pretty conclusive. I'm obviously very happy with fsrm. I've been pushing for that thing for probably over two decades by now, because I absolutely detest uarch optimizations for memset/memcpy that can never be done well in software anyway (because it depends not just on cache organization, but on cache sizes and dynamic cache hit/miss behavior of the load). And one of the things I always wanted to do was to just have memcpy/memset entirely inlined. In fact, if you go back to the 0.01 linux kernel sources, you'll see that they only compile with my bastardized version of gcc-1.40, because I made the compiler inline those things with 'rep movs/stos', and there was no other implementation of memcpy/memset at all. That was a bit optimistic at the time, but here we are, 30+ years later and it is finally looking possible, at least on some uarchs. Linus
--- a/mm/filemap.c~tmpfs-fix-regressions-from-wider-use-of-zero_page +++ a/mm/filemap.c @@ -1063,12 +1063,6 @@ void __init pagecache_init(void) init_waitqueue_head(&folio_wait_table[i]); page_writeback_init(); - - /* - * tmpfs uses the ZERO_PAGE for reading holes: it is up-to-date, - * and splice's page_cache_pipe_buf_confirm() needs to see that. - */ - SetPageUptodate(ZERO_PAGE(0)); } /* --- a/mm/shmem.c~tmpfs-fix-regressions-from-wider-use-of-zero_page +++ a/mm/shmem.c @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(stru pgoff_t end_index; unsigned long nr, ret; loff_t i_size = i_size_read(inode); - bool got_page; end_index = i_size >> PAGE_SHIFT; if (index > end_index) @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(stru */ if (!offset) mark_page_accessed(page); - got_page = true; + /* + * Ok, we have the page, and it's up-to-date, so + * now we can copy it to user space... + */ + ret = copy_page_to_iter(page, offset, nr, to); + put_page(page); + + } else if (iter_is_iovec(to)) { + /* + * Copy to user tends to be so well optimized, but + * clear_user() not so much, that it is noticeably + * faster to copy the zero page instead of clearing. + */ + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to); } else { - page = ZERO_PAGE(0); - got_page = false; + /* + * But submitting the same page twice in a row to + * splice() - or others? - can result in confusion: + * so don't attempt that optimization on pipes etc. + */ + ret = iov_iter_zero(nr, to); } - /* - * Ok, we have the page, and it's up-to-date, so - * now we can copy it to user space... - */ - ret = copy_page_to_iter(page, offset, nr, to); retval += ret; offset += ret; index += offset >> PAGE_SHIFT; offset &= ~PAGE_MASK; - if (got_page) - put_page(page); if (!iov_iter_count(to)) break; if (ret < nr) {