Message ID | 20250325185219.315319-1-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/bitops: Account for POPCNT errata on earlier Intel CPUs | expand |
On 25.03.2025 19:52, Andrew Cooper wrote: > Manually break the false dependency for the benefit of cases such as > bitmap_weight() which is a reasonable hotpath. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > --- > CC: Jan Beulich <JBeulich@suse.com> > CC: Roger Pau Monné <roger.pau@citrix.com> > > Not sure if this warrants a fixes or not, but: > > Fixes: 6978602334d9 ("x86/bitops: Use the POPCNT instruction when available") > > Many examples online suggest a 2x improvement perf improvement on tight loops > by breaking this dependency. cpumasks in particular make frequent use of this > loop. > > Still TODO: > > 1) Put a double CS prefix on the CALL instruction to avoid a trailing 2-byte > NOP, but this depends on x86_decode_lite() in order to work. > > 2) Revert a buggy GAS diagnostic: > > ./arch/x86/include/asm/bitops.h: Assembler messages: > ./arch/x86/include/asm/bitops.h:493: Error: same type of prefix used twice > ./arch/x86/include/asm/bitops.h:493: Error: same type of prefix used twice > > Multiple prefixes are not an error, and are sometimes the best choice > available. Well. Sane insns have every kind of prefix at most once. The assembler has (effectively) boolean markers for that. Thing is that multiple different prefixes of the same kind are ambiguous to encode: Which order should they be emitted in? One programmer may think "first wins" while another may expect "last wins". It may be possible to special case "same prefix", by converting the boolean-ness to a count. Yet I expect more often than not multiple prefixes of the same type are a mistake. Hence we probably would then still want to warn about that, requiring a new way to silence that warning. However, imo it is better to leave the assembler unaltered, and continue to require people who want such unusual encodings to specify the prefixes they want as separate kind-of-insns. Such unusual encodings are likely to also mean certain positioning of those prefixes relative to other possible ones; there's no prescribed order the assembler would emit prefixes of different kinds (REX and REX2 always coming last of course) when they're part of a real insn. > It turns out that LZCNT/TZCNT have the same input dependent bug, prior to > Skylake. These two do, but BSF/BSR don't? Pretty odd. > --- a/xen/arch/x86/include/asm/bitops.h > +++ b/xen/arch/x86/include/asm/bitops.h > @@ -488,10 +488,16 @@ static always_inline unsigned int arch_hweightl(unsigned long x) > * > * This limits the POPCNT instruction to using the same ABI as a function > * call (input in %rdi, output in %eax) but that's fine. > + * > + * On Intel CPUs prior to Cannon Lake, the POPCNT instruction has a false > + * input dependency on it's destination register (errata HSD146, SKL029 > + * amongst others), impacting loops such as bitmap_weight(). Insert an > + * XOR to manually break the dependency. > */ With this being an Intel-only issue, wouldn't we better ... > alternative_io("call arch_generic_hweightl", > + "xor %k[res], %k[res]\n\t" ... put this line in #ifdef CONFIG_INTEL then? The extra overhead is small, but I see no reason not to avoid it if we can. (I realize that's not quite as straightforward as it reads, for alternative_io() being a macro.) Jan > "popcnt %[val], %q[res]", X86_FEATURE_POPCNT, > - ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT), > + ASM_OUTPUT2([res] "=&a" (r) ASM_CALL_CONSTRAINT), > [val] "D" (x)); > > return r;
On 26/03/2025 9:20 am, Jan Beulich wrote: > On 25.03.2025 19:52, Andrew Cooper wrote: >> It turns out that LZCNT/TZCNT have the same input dependent bug, prior to >> Skylake. > These two do, but BSF/BSR don't? Pretty odd. BSF/BSR have true input dependencies. They both have cases where the destination register is left unmodified. > >> --- a/xen/arch/x86/include/asm/bitops.h >> +++ b/xen/arch/x86/include/asm/bitops.h >> @@ -488,10 +488,16 @@ static always_inline unsigned int arch_hweightl(unsigned long x) >> * >> * This limits the POPCNT instruction to using the same ABI as a function >> * call (input in %rdi, output in %eax) but that's fine. >> + * >> + * On Intel CPUs prior to Cannon Lake, the POPCNT instruction has a false >> + * input dependency on it's destination register (errata HSD146, SKL029 >> + * amongst others), impacting loops such as bitmap_weight(). Insert an >> + * XOR to manually break the dependency. >> */ > With this being an Intel-only issue, wouldn't we better ... > >> alternative_io("call arch_generic_hweightl", >> + "xor %k[res], %k[res]\n\t" > ... put this line in #ifdef CONFIG_INTEL then? The extra overhead is small, but > I see no reason not to avoid it if we can. (I realize that's not quite as > straightforward as it reads, for alternative_io() being a macro.) For an XOR, no; not worth the complexity. Besides, this gets used a whole 5 locations in the hypervisor, after I cleaned up the paths which shouldn't have been using hweight() in the first place. ~Andrew
On 26.03.2025 10:34, Andrew Cooper wrote: > On 26/03/2025 9:20 am, Jan Beulich wrote: >> On 25.03.2025 19:52, Andrew Cooper wrote: >>> --- a/xen/arch/x86/include/asm/bitops.h >>> +++ b/xen/arch/x86/include/asm/bitops.h >>> @@ -488,10 +488,16 @@ static always_inline unsigned int arch_hweightl(unsigned long x) >>> * >>> * This limits the POPCNT instruction to using the same ABI as a function >>> * call (input in %rdi, output in %eax) but that's fine. >>> + * >>> + * On Intel CPUs prior to Cannon Lake, the POPCNT instruction has a false >>> + * input dependency on it's destination register (errata HSD146, SKL029 >>> + * amongst others), impacting loops such as bitmap_weight(). Insert an >>> + * XOR to manually break the dependency. >>> */ >> With this being an Intel-only issue, wouldn't we better ... >> >>> alternative_io("call arch_generic_hweightl", >>> + "xor %k[res], %k[res]\n\t" >> ... put this line in #ifdef CONFIG_INTEL then? The extra overhead is small, but >> I see no reason not to avoid it if we can. (I realize that's not quite as >> straightforward as it reads, for alternative_io() being a macro.) > > For an XOR, no; not worth the complexity. > > Besides, this gets used a whole 5 locations in the hypervisor, after I > cleaned up the paths which shouldn't have been using hweight() in the > first place. Well, okay with me then: Acked-by: Jan Beulich <jbeulich@suse.com> Jan
diff --git a/xen/arch/x86/include/asm/bitops.h b/xen/arch/x86/include/asm/bitops.h index bb9d75646023..87eac7782f10 100644 --- a/xen/arch/x86/include/asm/bitops.h +++ b/xen/arch/x86/include/asm/bitops.h @@ -488,10 +488,16 @@ static always_inline unsigned int arch_hweightl(unsigned long x) * * This limits the POPCNT instruction to using the same ABI as a function * call (input in %rdi, output in %eax) but that's fine. + * + * On Intel CPUs prior to Cannon Lake, the POPCNT instruction has a false + * input dependency on it's destination register (errata HSD146, SKL029 + * amongst others), impacting loops such as bitmap_weight(). Insert an + * XOR to manually break the dependency. */ alternative_io("call arch_generic_hweightl", + "xor %k[res], %k[res]\n\t" "popcnt %[val], %q[res]", X86_FEATURE_POPCNT, - ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT), + ASM_OUTPUT2([res] "=&a" (r) ASM_CALL_CONSTRAINT), [val] "D" (x)); return r;
Manually break the false dependency for the benefit of cases such as bitmap_weight() which is a reasonable hotpath. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> Not sure if this warrants a fixes or not, but: Fixes: 6978602334d9 ("x86/bitops: Use the POPCNT instruction when available") Many examples online suggest a 2x improvement perf improvement on tight loops by breaking this dependency. cpumasks in particular make frequent use of this loop. Still TODO: 1) Put a double CS prefix on the CALL instruction to avoid a trailing 2-byte NOP, but this depends on x86_decode_lite() in order to work. 2) Revert a buggy GAS diagnostic: ./arch/x86/include/asm/bitops.h: Assembler messages: ./arch/x86/include/asm/bitops.h:493: Error: same type of prefix used twice ./arch/x86/include/asm/bitops.h:493: Error: same type of prefix used twice Multiple prefixes are not an error, and are sometimes the best choice available. It turns out that LZCNT/TZCNT have the same input dependent bug, prior to Skylake. There are no instructions in the "cleaned up" part of bitops yet, and I don't expect any to survive cleaning. --- xen/arch/x86/include/asm/bitops.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)