Message ID | YozQZMyQ0NDdD8cH@zn.tnic (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/clear_user: Make it faster | expand |
On Tue, May 24, 2022 at 5:32 AM Borislav Petkov <bp@alien8.de> wrote: > > finally a somewhat final version, lightly tested. I can't find anything wrong with this, but who knows what patch-blindness I have from looking at a few different versions of it. Maybe my eyes just skim over it now. I do note that the clearing of %rax here: > +.Lerms_exit: > + xorl %eax,%eax > + RET seems to be unnecessary, since %rax is never modified in the path leading to this. But maybe just as well just for consistency with the cases where it *is* used as a temporary. And I still suspect that "copy_to_user()" is *much* more interesting than "clear_user()", but I guess we can't inline it anyway due to all the other overhead (ie access_ok() and stac/clac). And for a plain "call memcpy/memset", we'd need compiler help to do this (at a minimum, we'd have to have the compiler use the 'rep movs/stos' register logic, and then we could patch things in place afterwards, with objtool creating the alternatives section or something). Linus
On Tue, May 24, 2022 at 09:51:56AM -0700, Linus Torvalds wrote: > I can't find anything wrong with this, but who knows what > patch-blindness I have from looking at a few different versions of it. > Maybe my eyes just skim over it now. Same here - I can't look at that code anymore. I'll try to gain some distance and look at it again later, and do some more extensive testing too. > I do note that the clearing of %rax here: > > > +.Lerms_exit: > > + xorl %eax,%eax > > + RET > > seems to be unnecessary, since %rax is never modified in the path > leading to this. But maybe just as well just for consistency with the > cases where it *is* used as a temporary. Yeah. > And I still suspect that "copy_to_user()" is *much* more interesting > than "clear_user()", but I guess we can't inline it anyway due to all > the other overhead (ie access_ok() and stac/clac). > > And for a plain "call memcpy/memset", we'd need compiler help to do > this (at a minimum, we'd have to have the compiler use the 'rep > movs/stos' register logic, and then we could patch things in place > afterwards, with objtool creating the alternatives section or > something). Yeah, I have this on my todo to research them properly. Will report when I have something.
On Tue, 24 May 2022 at 13:32, Borislav Petkov <bp@alien8.de> wrote: > > Ok, > > finally a somewhat final version, lightly tested. > > I still need to run it on production Icelake and that is kinda being > delayed due to server room cooling issues (don't ask ;-\). > > --- > From: Borislav Petkov <bp@suse.de> > > Based on a patch by Mark Hemment <markhemm@googlemail.com> and > incorporating very sane suggestions from Linus. > > The point here is to have the default case with FSRM - which is supposed > to be the majority of x86 hw out there - if not now then soon - be > directly inlined into the instruction stream so that no function call > overhead is taking place. > > The benchmarks I ran would show very small improvements and a PF > benchmark would even show weird things like slowdowns with higher core > counts. > > So for a ~6m running the git test suite, the function gets called under > 700K times, all from padzero(): > > <...>-2536 [006] ..... 261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900 > <...>-2536 [006] ..... 261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160 > <...>-2537 [008] ..... 261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850 > <...>-2537 [008] ..... 261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900 > ... > > which is around 1%-ish of the total time and which is consistent with > the benchmark numbers. > > So Mel gave me the idea to simply measure how fast the function becomes. > I.e.: > > start = rdtsc_ordered(); > ret = __clear_user(to, n); > end = rdtsc_ordered(); > > Computing the mean average of all the samples collected during the test > suite run then shows some improvement: > > clear_user_original: > Amean: 9219.71 (Sum: 6340154910, samples: 687674) > > fsrm: > Amean: 8030.63 (Sum: 5522277720, samples: 687652) > > That's on Zen3. > > Signed-off-by: Borislav Petkov <bp@suse.de> > Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com > --- > arch/x86/include/asm/uaccess.h | 5 +- > arch/x86/include/asm/uaccess_64.h | 45 ++++++++++ > arch/x86/lib/clear_page_64.S | 142 ++++++++++++++++++++++++++++++ > arch/x86/lib/usercopy_64.c | 40 --------- > tools/objtool/check.c | 3 + > 5 files changed, 192 insertions(+), 43 deletions(-) [...snip...] > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S > index fe59b8ac4fcc..6dd6c7fd1eb8 100644 ... > +/* > + * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is > + * present. > + * Input: > + * rdi destination > + * rcx count > + * > + * Output: > + * rcx: uncleared bytes or 0 if successful. > + */ > +SYM_FUNC_START(clear_user_rep_good) > + # call the original thing for less than a cacheline > + cmp $64, %rcx > + ja .Lprep > + call clear_user_original > + RET > + > +.Lprep: > + # copy lower 32-bits for rest bytes > + mov %ecx, %edx > + shr $3, %rcx > + jz .Lrep_good_rest_bytes A slight doubt here; comment says "less than a cachline", but the code is using 'ja' (jump if above) - so calls 'clear_user_original' for a 'len' less than or equal to 64. Not sure of the intended behaviour for 64 bytes here, but 'copy_user_enhanced_fast_string' uses the slow-method for lengths less than 64. So, should this be coded as; cmp $64,%rcx jb clear_user_original ? 'clear_user_erms' has similar logic which might also need reworking. > + > +.Lrep_good_qwords: > + rep stosq > + > +.Lrep_good_rest_bytes: > + and $7, %edx > + jz .Lrep_good_exit > + > +.Lrep_good_bytes: > + mov %edx, %ecx > + rep stosb > + > +.Lrep_good_exit: > + # see .Lexit comment above > + xor %eax, %eax > + RET > + > +.Lrep_good_qwords_exception: > + # convert remaining qwords back into bytes to return to caller > + shl $3, %rcx > + and $7, %edx > + add %rdx, %rcx > + jmp .Lrep_good_exit > + > + _ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception) > + _ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit) > +SYM_FUNC_END(clear_user_rep_good) > +EXPORT_SYMBOL(clear_user_rep_good) > + > +/* > + * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present. > + * Input: > + * rdi destination > + * rcx count > + * > + * Output: > + * rcx: uncleared bytes or 0 if successful. > + * > + */ > +SYM_FUNC_START(clear_user_erms) > + # call the original thing for less than a cacheline > + cmp $64, %rcx > + ja .Lerms_bytes > + call clear_user_original > + RET > + > +.Lerms_bytes: > + rep stosb > + > +.Lerms_exit: > + xorl %eax,%eax > + RET > + > + _ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit) > +SYM_FUNC_END(clear_user_erms) > +EXPORT_SYMBOL(clear_user_erms) > diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c > index 0ae6cf804197..6c1f8ac5e721 100644 > --- a/arch/x86/lib/usercopy_64.c > +++ b/arch/x86/lib/usercopy_64.c > @@ -14,46 +14,6 @@ > * Zero Userspace > */ > > -unsigned long __clear_user(void __user *addr, unsigned long size) > -{ > - long __d0; > - might_fault(); > - /* no memory constraint because it doesn't change any memory gcc knows > - about */ > - stac(); > - asm volatile( > - " testq %[size8],%[size8]\n" > - " jz 4f\n" > - " .align 16\n" > - "0: movq $0,(%[dst])\n" > - " addq $8,%[dst]\n" > - " decl %%ecx ; jnz 0b\n" > - "4: movq %[size1],%%rcx\n" > - " testl %%ecx,%%ecx\n" > - " jz 2f\n" > - "1: movb $0,(%[dst])\n" > - " incq %[dst]\n" > - " decl %%ecx ; jnz 1b\n" > - "2:\n" > - > - _ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1]) > - _ASM_EXTABLE_UA(1b, 2b) > - > - : [size8] "=&c"(size), [dst] "=&D" (__d0) > - : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr)); > - clac(); > - return size; > -} > -EXPORT_SYMBOL(__clear_user); > - > -unsigned long clear_user(void __user *to, unsigned long n) > -{ > - if (access_ok(to, n)) > - return __clear_user(to, n); > - return n; > -} > -EXPORT_SYMBOL(clear_user); > - > #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE > /** > * clean_cache_range - write back a cache range with CLWB > diff --git a/tools/objtool/check.c b/tools/objtool/check.c > index ca5b74603008..e460aa004b88 100644 > --- a/tools/objtool/check.c > +++ b/tools/objtool/check.c > @@ -1020,6 +1020,9 @@ static const char *uaccess_safe_builtin[] = { > "copy_mc_fragile_handle_tail", > "copy_mc_enhanced_fast_string", > "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */ > + "clear_user_erms", > + "clear_user_rep_good", > + "clear_user_original", > NULL > }; > > -- > 2.35.1 > > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette Cheers, Mark
* Borislav Petkov <bp@alien8.de> wrote: > Ok, > > finally a somewhat final version, lightly tested. > > I still need to run it on production Icelake and that is kinda being > delayed due to server room cooling issues (don't ask ;-\). > So Mel gave me the idea to simply measure how fast the function becomes. > I.e.: > > start = rdtsc_ordered(); > ret = __clear_user(to, n); > end = rdtsc_ordered(); > > Computing the mean average of all the samples collected during the test > suite run then shows some improvement: > > clear_user_original: > Amean: 9219.71 (Sum: 6340154910, samples: 687674) > > fsrm: > Amean: 8030.63 (Sum: 5522277720, samples: 687652) > > That's on Zen3. As a side note, there's some rudimentary perf tooling that allows the user-space testing of kernel-space x86 memcpy and memset implementations: $ perf bench mem memcpy # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1MB bytes ... 42.459239 GB/sec # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 23.818598 GB/sec # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 10.172526 GB/sec # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 10.614810 GB/sec Note how the actual implementation in arch/x86/lib/memcpy_64.S was used to build a user-space test into 'perf bench'. For copy_user() & clear_user() some additional wrappery would be needed I guess, to wrap away stac()/clac()/might_sleep(), etc. ... [ Plus it could all be improved to measure cache hot & cache cold performance, to use different sizes, etc. ] Even with the limitation that it's not 100% equivalent to the kernel-space thing, especially for very short buffers, having the whole perf side benchmarking, profiling & statistics machinery available is a plus I think. Thanks, Ingo
On Wed, May 25, 2022 at 01:11:17PM +0100, Mark Hemment wrote: > A slight doubt here; comment says "less than a cachline", but the code > is using 'ja' (jump if above) - so calls 'clear_user_original' for a > 'len' less than or equal to 64. > Not sure of the intended behaviour for 64 bytes here, but > 'copy_user_enhanced_fast_string' uses the slow-method for lengths less > than 64. So, should this be coded as; > cmp $64,%rcx > jb clear_user_original > ? Yeah, it probably doesn't matter whether you clear a cacheline the "old" way or with some of the new ones. clear_user() performance matters only in microbenchmarks, as I've come to realize. But your suggestion simplifies the code so lemme do that. Thx!
On Tue, May 24, 2022 at 02:32:36PM +0200, Borislav Petkov wrote: > I still need to run it on production Icelake and that is kinda being > delayed due to server room cooling issues (don't ask ;-\). So I finally got a production level ICL-X: [ 0.220822] smpboot: CPU0: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz (family: 0x6, model: 0x6a, stepping: 0x6) and frankly, this looks really weird: clear_user_original: Amean: 19679.4 (Sum: 13652560764, samples: 693750) Amean: 19743.7 (Sum: 13693470604, samples: 693562) (I ran it twice just to be sure.) ERMS: Amean: 20374.3 (Sum: 13910601024, samples: 682752) Amean: 20453.7 (Sum: 14186223606, samples: 693576) FSRM: Amean: 20458.2 (Sum: 13918381386, sample s: 680331) so either that particular box is weird or Icelake really is slower wrt FSRM or I've fat-fingered it somewhere. But my measuring is as trivial as: static __always_inline unsigned long clear_user(void __user *to, unsigned long n) { if (access_ok(to, n)) { unsigned long start, end, ret; start = rdtsc_ordered(); ret = __clear_user(to, n); end = rdtsc_ordered(); trace_printk("to: 0x%lx, size: %ld, cycles: %lu\n", (unsigned long)to, n, end - start); return ret; } return n; } so if anything I don't see what the problem could be... Hmm.
On Wed, Jun 22, 2022 at 9:21 AM Borislav Petkov <bp@alien8.de> wrote: > > and frankly, this looks really weird: I'm not sure how valid the TSC thing is, with the extra synchronization maybe interacting with the whole microcode engine startup/stop thing. I'm also not sure the rdtsc is doing the same thing on your AMD tests vs your Intel tests - I suspect you end up both using 'rdtscp' (as opposed to the 'lsync' variant we also have), but I don't think the ordering really is all that well defined architecturally, so AMD may have very different serialization rules than Intel does. .. and that serialization may well be different wrt normal load/stores and microcode. So those numbers look like they have a 3% difference, but I'm not 100% convinced it might not be due to measuring artifacts. The fact that it worked well for you on your AMD platform doesn't necessarily mean that it has to work on icelake-x. But it could equally easily be that "rep stosb" really just isn't any better on that platform, and the numbers are just giving the plain reality. Or it could mean that it makes some cache access decision ("this is big enough that let's not pollute L1 caches, do stores directly to L2") that might be better for actual performance afterwards, but that makes that clearing itself that bit slower. IOW, I do think that microbenchmarks are kind of suspect to begin with, and the rdtsc thing in particular may work better on some microarchitectures than it does others. Very hard to make a judgment call - I think the only thing that really ends up mattering is the macro-benchmarks, but I think when you tried that it was way too noisy to actually show any real signal. That is, of course, a problem with memcpy and memset in general. It's easy to do microbenchmarks for them, it's not just clear whether said microbenchmarks give numbers that are actually meaningful, exactly because of things like cache replacement policy etc. And finally, I will repeat that this particular code probably just isn't that important. The memory clearing for page allocation and regular memcpy is where most of the real time is spent, so I don't think that you should necessarily worry too much about this special case. Linus
On Wed, Jun 22, 2022 at 10:06:42AM -0500, Linus Torvalds wrote: > I'm not sure how valid the TSC thing is, with the extra > synchronization maybe interacting with the whole microcode engine > startup/stop thing. Very possible. So I went and did the original microbenchmark which started people looking into this in the first place and with it, it looks very good: before: $ dd if=/dev/zero of=/dev/null bs=1024k status=progress 400823418880 bytes (401 GB, 373 GiB) copied, 17 s, 23.6 GB/s after: $ dd if=/dev/zero of=/dev/null bs=1024k status=progress 2696274771968 bytes (2.7 TB, 2.5 TiB) copied, 50 s, 53.9 GB/s So that's very persuasive in my book. > I'm also not sure the rdtsc is doing the same thing on your AMD tests > vs your Intel tests - I suspect you end up both using 'rdtscp' (as That is correct. > opposed to the 'lsync' variant we also have), but I don't think the > ordering really is all that well defined architecturally, so AMD may > have very different serialization rules than Intel does. > > .. and that serialization may well be different wrt normal load/stores > and microcode. Well, if it is that, hw people have always been telling me to use RDTSC to measure stuff but I will object next time. > So those numbers look like they have a 3% difference, but I'm not 100% > convinced it might not be due to measuring artifacts. The fact that it > worked well for you on your AMD platform doesn't necessarily mean that > it has to work on icelake-x. Well, it certainly is something uarch-specific because that machine had an eval Icelake sample before and it would show the same minute slowdown too. I attributed it to the CPU being an eval sample but I guess uarch-wise it didn't matter. > But it could equally easily be that "rep stosb" really just isn't any > better on that platform, and the numbers are just giving the plain > reality. Right. > Or it could mean that it makes some cache access decision ("this is > big enough that let's not pollute L1 caches, do stores directly to > L2") that might be better for actual performance afterwards, but that > makes that clearing itself that bit slower. There's that too. > IOW, I do think that microbenchmarks are kind of suspect to begin > with, and the rdtsc thing in particular may work better on some > microarchitectures than it does others. > > Very hard to make a judgment call - I think the only thing that really > ends up mattering is the macro-benchmarks, but I think when you tried > that it was way too noisy to actually show any real signal. Yap, that was the reason why I went down to the microbenchmarks. But even with the real benchmark, it would show a slightly bad numbers on ICL which got me scratching my head as to why that is... > That is, of course, a problem with memcpy and memset in general. It's > easy to do microbenchmarks for them, it's not just clear whether said > microbenchmarks give numbers that are actually meaningful, exactly > because of things like cache replacement policy etc. > > And finally, I will repeat that this particular code probably just > isn't that important. The memory clearing for page allocation and > regular memcpy is where most of the real time is spent, so I don't > think that you should necessarily worry too much about this special > case. Yeah, I started poking at this because people would come with patches or complain about stuff being slow but then no one would actually sit down and do the measurements... Oh well, anyway, I still think we should take that because that dd thing above is pretty good-lookin' even on ICL. And we now have a good example of how all this patching thing should work - have the insns patched in and only replace with calls to other variants on the minority of machines. And the ICL slowdown is small enough and kinda hard to measure... Thoughts?
On Wed, Jun 22, 2022 at 3:14 PM Borislav Petkov <bp@alien8.de> wrote: > > before: > > $ dd if=/dev/zero of=/dev/null bs=1024k status=progress > 400823418880 bytes (401 GB, 373 GiB) copied, 17 s, 23.6 GB/s > > after: > > $ dd if=/dev/zero of=/dev/null bs=1024k status=progress > 2696274771968 bytes (2.7 TB, 2.5 TiB) copied, 50 s, 53.9 GB/s > > So that's very persuasive in my book. Heh. Your numbers are very confusing, because apparently you just ^C'd the thing randomly and they do different sizes (and the GB/s number is what matters). Might I suggest just using "count=XYZ" to make the sizes the same and the numbers a but more comparable? Because when I first looked at the numbers I was like "oh, the first one finished in 17s, the second one was three times slower! But yes, apparently that "rep stos" is *much* better with that /dev/zero test. That does imply that what it does is to avoid polluting some cache hierarchy, since your 'dd' test case doesn't actually ever *use* the end result of the zeroing. So yeah, memset and memcpy are just fundamentally hard to benchmark, because what matters more than the cost of the op itself is often how the end result interacts with the code around it. For example, one of the things that I hope FSRM really does well is when small copies (or memsets) are then used immediately afterwards - does the just stored data by the microcode get nicely forwarded from the store buffers (like it would if it was a loop of stores) or does it mean that the store buffer is bypassed and subsequent loads will then hit the L1 cache? That is *not* an issue in this situation, since any clear_user() won't be immediately loaded just a few instructions later, but it's traditionally an issue for the "small memset/memcpy" case, where the memset/memcpy destination is possibly accessed immediately afterwards (either to make further modifications, or to just be read). In a perfect world, you get all the memory forwarding logic kicking in, which can really shortcircuit things on an OoO core and take the memory pipeline out of the critical path, which then helps IPC. And that's an area that legacy microcoded 'rep stosb' has not been good at. Whether FSRM is quite there yet, I don't know. (Somebody could test: do a 'store register to memory', then to a 'memcpy()' of that memory to another memory area, and then do a register load from that new area - at least in _theory_ a very aggressive microarchitecture could actually do that whole forwarding, and make the latency from the original memory store to the final memory load be zero cycles. I know AMD was supposedly doing that for some of the simpler cases, and it *does* actually matter for real world loads, because that memory indirection is often due to passing data in structures as function arguments. So it sounds stupid to store to memory and then immediately load it again, but it actually happens _all_the_time_ even for smart software). Linus
On Wed, Jun 22, 2022 at 04:07:19PM -0500, Linus Torvalds wrote: > Might I suggest just using "count=XYZ" to make the sizes the same and > the numbers a but more comparable? Because when I first looked at the > numbers I was like "oh, the first one finished in 17s, the second one > was three times slower! Yah, I got confused too but then I looked at the rate... But it looks like even this microbenchmark is hm, well, showing that there's more than meets the eye. Look at at rates: for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied 32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s 37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s 62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s 60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s 53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s 55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s 55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s 54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s 50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s 58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s Now the same thing with smaller buffers: for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s So it is all magically alignment, microcode "activity" and other planets-alignment things of the uarch. It is not getting even close to 50GB/s with larger block sizes - 4M in this case: dd if=/dev/zero of=/dev/null bs=4M status=progress count=65536 249334595584 bytes (249 GB, 232 GiB) copied, 10 s, 24.9 GB/s 65536+0 records in 65536+0 records out 274877906944 bytes (275 GB, 256 GiB) copied, 10.9976 s, 25.0 GB/s so it is all a bit: "yes, we can go faster, but <do all those requirements first>" :-) > But yes, apparently that "rep stos" is *much* better with that /dev/zero test. > > That does imply that what it does is to avoid polluting some cache > hierarchy, since your 'dd' test case doesn't actually ever *use* the > end result of the zeroing. > > So yeah, memset and memcpy are just fundamentally hard to benchmark, > because what matters more than the cost of the op itself is often how > the end result interacts with the code around it. Yap, and this discarding of the end result is silly but well... > For example, one of the things that I hope FSRM really does well is > when small copies (or memsets) are then used immediately afterwards - > does the just stored data by the microcode get nicely forwarded from > the store buffers (like it would if it was a loop of stores) or does > it mean that the store buffer is bypassed and subsequent loads will > then hit the L1 cache? > > That is *not* an issue in this situation, since any clear_user() won't > be immediately loaded just a few instructions later, but it's > traditionally an issue for the "small memset/memcpy" case, where the > memset/memcpy destination is possibly accessed immediately afterwards > (either to make further modifications, or to just be read). Right. > In a perfect world, you get all the memory forwarding logic kicking > in, which can really shortcircuit things on an OoO core and take the > memory pipeline out of the critical path, which then helps IPC. > > And that's an area that legacy microcoded 'rep stosb' has not been > good at. Whether FSRM is quite there yet, I don't know. Right. > (Somebody could test: do a 'store register to memory', then to a > 'memcpy()' of that memory to another memory area, and then do a > register load from that new area - at least in _theory_ a very > aggressive microarchitecture could actually do that whole forwarding, > and make the latency from the original memory store to the final > memory load be zero cycles. Ha, that would be an interesting exercise. Hmm, but but, how would the hardware recognize it is the same data it has in the cache at that new virtual address? I presume it needs some smart tracking of cachelines. But smart tracking costs so it needs to be something that happens a lot in all the insn traces hw guys look at when thinking of new "shortcuts" to raise IPC. :) > I know AMD was supposedly doing that for some of the simpler cases, Yap, the simpler cases are probably easy to track and I guess that's what the hw does properly and does the forwarding there while for the more complex ones, it simply does the whole run-around at least to a lower-level cache if not to mem. > and it *does* actually matter for real world loads, because that > memory indirection is often due to passing data in structures as > function arguments. So it sounds stupid to store to memory and then > immediately load it again, but it actually happens _all_the_time_ even > for smart software). Right. Thx.
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index f78e2b3501a1..335d571e8a79 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -405,9 +405,6 @@ strncpy_from_user(char *dst, const char __user *src, long count); extern __must_check long strnlen_user(const char __user *str, long n); -unsigned long __must_check clear_user(void __user *mem, unsigned long len); -unsigned long __must_check __clear_user(void __user *mem, unsigned long len); - #ifdef CONFIG_ARCH_HAS_COPY_MC unsigned long __must_check copy_mc_to_kernel(void *to, const void *from, unsigned len); @@ -429,6 +426,8 @@ extern struct movsl_mask { #define ARCH_HAS_NOCACHE_UACCESS 1 #ifdef CONFIG_X86_32 +unsigned long __must_check clear_user(void __user *mem, unsigned long len); +unsigned long __must_check __clear_user(void __user *mem, unsigned long len); # include <asm/uaccess_32.h> #else # include <asm/uaccess_64.h> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index 45697e04d771..4ffefb4bb844 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -79,4 +79,49 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size) kasan_check_write(dst, size); return __copy_user_flushcache(dst, src, size); } + +/* + * Zero Userspace. + */ + +__must_check unsigned long +clear_user_original(void __user *addr, unsigned long len); +__must_check unsigned long +clear_user_rep_good(void __user *addr, unsigned long len); +__must_check unsigned long +clear_user_erms(void __user *addr, unsigned long len); + +static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size) +{ + might_fault(); + stac(); + + /* + * No memory constraint because it doesn't change any memory gcc + * knows about. + */ + asm volatile( + "1:\n\t" + ALTERNATIVE_3("rep stosb", + "call clear_user_erms", ALT_NOT(X86_FEATURE_FSRM), + "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS), + "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD)) + "2:\n" + _ASM_EXTABLE_UA(1b, 2b) + : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT + : "a" (0) + /* rep_good clobbers %rdx */ + : "rdx"); + + clac(); + + return size; +} + +static __always_inline unsigned long clear_user(void __user *to, unsigned long n) +{ + if (access_ok(to, n)) + return __clear_user(to, n); + return n; +} #endif /* _ASM_X86_UACCESS_64_H */ diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S index fe59b8ac4fcc..6dd6c7fd1eb8 100644 --- a/arch/x86/lib/clear_page_64.S +++ b/arch/x86/lib/clear_page_64.S @@ -1,5 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0-only */ #include <linux/linkage.h> +#include <asm/asm.h> #include <asm/export.h> /* @@ -50,3 +51,144 @@ SYM_FUNC_START(clear_page_erms) RET SYM_FUNC_END(clear_page_erms) EXPORT_SYMBOL_GPL(clear_page_erms) + +/* + * Default clear user-space. + * Input: + * rdi destination + * rcx count + * + * Output: + * rcx: uncleared bytes or 0 if successful. + */ +SYM_FUNC_START(clear_user_original) + /* + * Copy only the lower 32 bits of size as that is enough to handle the rest bytes, + * i.e., no need for a 'q' suffix and thus a REX prefix. + */ + mov %ecx,%eax + shr $3,%rcx + jz .Lrest_bytes + + # do the qwords first + .p2align 4 +.Lqwords: + movq $0,(%rdi) + lea 8(%rdi),%rdi + dec %rcx + jnz .Lqwords + +.Lrest_bytes: + and $7, %eax + jz .Lexit + + # now do the rest bytes +.Lbytes: + movb $0,(%rdi) + inc %rdi + dec %eax + jnz .Lbytes + +.Lexit: + /* + * %rax still needs to be cleared in the exception case because this function is called + * from inline asm and the compiler expects %rax to be zero when exiting the inline asm, + * in case it might reuse it somewhere. + */ + xor %eax,%eax + RET + +.Lqwords_exception: + # convert remaining qwords back into bytes to return to caller + shl $3, %rcx + and $7, %eax + add %rax,%rcx + jmp .Lexit + +.Lbytes_exception: + mov %eax,%ecx + jmp .Lexit + + _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception) + _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception) +SYM_FUNC_END(clear_user_original) +EXPORT_SYMBOL(clear_user_original) + +/* + * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is + * present. + * Input: + * rdi destination + * rcx count + * + * Output: + * rcx: uncleared bytes or 0 if successful. + */ +SYM_FUNC_START(clear_user_rep_good) + # call the original thing for less than a cacheline + cmp $64, %rcx + ja .Lprep + call clear_user_original + RET + +.Lprep: + # copy lower 32-bits for rest bytes + mov %ecx, %edx + shr $3, %rcx + jz .Lrep_good_rest_bytes + +.Lrep_good_qwords: + rep stosq + +.Lrep_good_rest_bytes: + and $7, %edx + jz .Lrep_good_exit + +.Lrep_good_bytes: + mov %edx, %ecx + rep stosb + +.Lrep_good_exit: + # see .Lexit comment above + xor %eax, %eax + RET + +.Lrep_good_qwords_exception: + # convert remaining qwords back into bytes to return to caller + shl $3, %rcx + and $7, %edx + add %rdx, %rcx + jmp .Lrep_good_exit + + _ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception) + _ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit) +SYM_FUNC_END(clear_user_rep_good) +EXPORT_SYMBOL(clear_user_rep_good) + +/* + * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present. + * Input: + * rdi destination + * rcx count + * + * Output: + * rcx: uncleared bytes or 0 if successful. + * + */ +SYM_FUNC_START(clear_user_erms) + # call the original thing for less than a cacheline + cmp $64, %rcx + ja .Lerms_bytes + call clear_user_original + RET + +.Lerms_bytes: + rep stosb + +.Lerms_exit: + xorl %eax,%eax + RET + + _ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit) +SYM_FUNC_END(clear_user_erms) +EXPORT_SYMBOL(clear_user_erms) diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c index 0ae6cf804197..6c1f8ac5e721 100644 --- a/arch/x86/lib/usercopy_64.c +++ b/arch/x86/lib/usercopy_64.c @@ -14,46 +14,6 @@ * Zero Userspace */ -unsigned long __clear_user(void __user *addr, unsigned long size) -{ - long __d0; - might_fault(); - /* no memory constraint because it doesn't change any memory gcc knows - about */ - stac(); - asm volatile( - " testq %[size8],%[size8]\n" - " jz 4f\n" - " .align 16\n" - "0: movq $0,(%[dst])\n" - " addq $8,%[dst]\n" - " decl %%ecx ; jnz 0b\n" - "4: movq %[size1],%%rcx\n" - " testl %%ecx,%%ecx\n" - " jz 2f\n" - "1: movb $0,(%[dst])\n" - " incq %[dst]\n" - " decl %%ecx ; jnz 1b\n" - "2:\n" - - _ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1]) - _ASM_EXTABLE_UA(1b, 2b) - - : [size8] "=&c"(size), [dst] "=&D" (__d0) - : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr)); - clac(); - return size; -} -EXPORT_SYMBOL(__clear_user); - -unsigned long clear_user(void __user *to, unsigned long n) -{ - if (access_ok(to, n)) - return __clear_user(to, n); - return n; -} -EXPORT_SYMBOL(clear_user); - #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE /** * clean_cache_range - write back a cache range with CLWB diff --git a/tools/objtool/check.c b/tools/objtool/check.c index ca5b74603008..e460aa004b88 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1020,6 +1020,9 @@ static const char *uaccess_safe_builtin[] = { "copy_mc_fragile_handle_tail", "copy_mc_enhanced_fast_string", "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */ + "clear_user_erms", + "clear_user_rep_good", + "clear_user_original", NULL };