Message ID | acf0f5f6-f4da-cd88-1515-2546153322b4@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/shadow: misc tidying | expand |
On 05/01/2023 4:05 pm, Jan Beulich wrote: > The "n" input is a GFN value and hence bounded by the physical address > bits in use on a system. The one case where this isn't obviously true is in sh_audit(). It comes from a real MFN in the system, not a GFN, which will have the same property WRT PADDR_BITS. > The hash quality won't improve by also > including the upper always-zero bits in the calculation. To keep things > as compile-time-constant as they were before, use PADDR_BITS (not > paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5. While this is all true, you'll get a much better improvement by not forcing 'n' onto the stack just to access it bytewise. Right now, the loop looks like: <shadow_hash_insert>: 48 83 ec 10 sub $0x10,%rsp 49 89 c9 mov %rcx,%r9 41 89 d0 mov %edx,%r8d 48 8d 44 24 08 lea 0x8(%rsp),%rax 48 8d 4c 24 10 lea 0x10(%rsp),%rcx 48 89 74 24 08 mov %rsi,0x8(%rsp) 0f 1f 80 00 00 00 00 nopl 0x0(%rax) /-> 0f b6 10 movzbl (%rax),%edx | 48 83 c0 01 add $0x1,%rax | 45 69 c0 3f 00 01 00 imul $0x1003f,%r8d,%r8d | 41 01 d0 add %edx,%r8d | 48 39 c1 cmp %rax,%rcx \-- 75 ea jne ffff82d0402efda0 <shadow_hash_insert+0x20> which doesn't even have a compile-time constant loop bound. It's runtime calculated by the second lea constructing the upper pointer bound. Given this further delta: diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c index 4a8bcec10fe8..902c749f2724 100644 --- a/xen/arch/x86/mm/shadow/common.c +++ b/xen/arch/x86/mm/shadow/common.c @@ -1397,13 +1397,12 @@ static unsigned int shadow_get_allocation(struct domain *d) typedef u32 key_t; static inline key_t sh_hash(unsigned long n, unsigned int t) { - unsigned char *p = (unsigned char *)&n; key_t k = t; int i; BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT); - for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ ) - k = p[i] + (k << 6) + (k << 16) - k; + for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++, n >>= 8 ) + k = (uint8_t)n + (k << 6) + (k << 16) - k; return k % SHADOW_HASH_BUCKETS; } the code gen becomes: <shadow_hash_insert>: 41 89 d0 mov %edx,%r8d 49 89 c9 mov %rcx,%r9 b8 05 00 00 00 mov $0x5,%eax /-> 45 69 c0 3f 00 01 00 imul $0x1003f,%r8d,%r8d | 40 0f b6 d6 movzbl %sil,%edx | 48 c1 ee 08 shr $0x8,%rsi | 41 01 d0 add %edx,%r8d | 83 e8 01 sub $0x1,%eax \-- 75 e9 jne ffff82d0402efd8b <shadow_hash_insert+0xb> with an actual constant loop bound, and not a memory operand in sight. This form (even at 8 iterations) will easily execute faster than the stack-spilled form. ~Andrew
On 06.01.2023 03:03, Andrew Cooper wrote: > On 05/01/2023 4:05 pm, Jan Beulich wrote: >> The "n" input is a GFN value and hence bounded by the physical address >> bits in use on a system. > > The one case where this isn't obviously true is in sh_audit(). It comes > from a real MFN in the system, not a GFN, which will have the same > property WRT PADDR_BITS. I'm afraid I was more wrong with that than just for the audit case. Only FL1 shadows use GFNs. All other shadows us MFNs. I'll update the sentence. >> The hash quality won't improve by also >> including the upper always-zero bits in the calculation. To keep things >> as compile-time-constant as they were before, use PADDR_BITS (not >> paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5. > > While this is all true, you'll get a much better improvement by not > forcing 'n' onto the stack just to access it bytewise. Right now, the > loop looks like: > > <shadow_hash_insert>: > 48 83 ec 10 sub $0x10,%rsp > 49 89 c9 mov %rcx,%r9 > 41 89 d0 mov %edx,%r8d > 48 8d 44 24 08 lea 0x8(%rsp),%rax > 48 8d 4c 24 10 lea 0x10(%rsp),%rcx > 48 89 74 24 08 mov %rsi,0x8(%rsp) > 0f 1f 80 00 00 00 00 nopl 0x0(%rax) > /-> 0f b6 10 movzbl (%rax),%edx > | 48 83 c0 01 add $0x1,%rax > | 45 69 c0 3f 00 01 00 imul $0x1003f,%r8d,%r8d > | 41 01 d0 add %edx,%r8d > | 48 39 c1 cmp %rax,%rcx > \-- 75 ea jne ffff82d0402efda0 > <shadow_hash_insert+0x20> > > > which doesn't even have a compile-time constant loop bound. It's > runtime calculated by the second lea constructing the upper pointer bound. > > Given this further delta: > > diff --git a/xen/arch/x86/mm/shadow/common.c > b/xen/arch/x86/mm/shadow/common.c > index 4a8bcec10fe8..902c749f2724 100644 > --- a/xen/arch/x86/mm/shadow/common.c > +++ b/xen/arch/x86/mm/shadow/common.c > @@ -1397,13 +1397,12 @@ static unsigned int shadow_get_allocation(struct > domain *d) > typedef u32 key_t; > static inline key_t sh_hash(unsigned long n, unsigned int t) > { > - unsigned char *p = (unsigned char *)&n; > key_t k = t; > int i; > > BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT); > - for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ ) > - k = p[i] + (k << 6) + (k << 16) - k; > + for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++, n >>= 8 ) > + k = (uint8_t)n + (k << 6) + (k << 16) - k; > > return k % SHADOW_HASH_BUCKETS; > } > > the code gen becomes: > > <shadow_hash_insert>: > 41 89 d0 mov %edx,%r8d > 49 89 c9 mov %rcx,%r9 > b8 05 00 00 00 mov $0x5,%eax > /-> 45 69 c0 3f 00 01 00 imul $0x1003f,%r8d,%r8d > | 40 0f b6 d6 movzbl %sil,%edx > | 48 c1 ee 08 shr $0x8,%rsi > | 41 01 d0 add %edx,%r8d > | 83 e8 01 sub $0x1,%eax > \-- 75 e9 jne ffff82d0402efd8b > <shadow_hash_insert+0xb> > > with an actual constant loop bound, and not a memory operand in sight. > This form (even at 8 iterations) will easily execute faster than the > stack-spilled form. Oh, yes, good idea. Will adjust. Jan
--- a/xen/arch/x86/mm/shadow/common.c +++ b/xen/arch/x86/mm/shadow/common.c @@ -1400,7 +1400,11 @@ static inline key_t sh_hash(unsigned lon unsigned char *p = (unsigned char *)&n; key_t k = t; int i; - for ( i = 0; i < sizeof(n) ; i++ ) k = (u32)p[i] + (k<<6) + (k<<16) - k; + + BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT); + for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ ) + k = p[i] + (k << 6) + (k << 16) - k; + return k % SHADOW_HASH_BUCKETS; }
The "n" input is a GFN value and hence bounded by the physical address bits in use on a system. The hash quality won't improve by also including the upper always-zero bits in the calculation. To keep things as compile-time-constant as they were before, use PADDR_BITS (not paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5. While there also drop the unnecessary cast to u32. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- I was tempted to also change the types of "p" (pointer to const) and "i" (unsigned) right here (and perhaps even the "byte" in the comment ahead of the function), but then thought this might be going too far ...