diff mbox series

vcpumask_to_pcpumask() case study

Message ID 3bb4e3fa-376b-4641-824d-61864b4e1e8e@citrix.com (mailing list archive)
State New, archived
Headers show
Series vcpumask_to_pcpumask() case study | expand

Commit Message

Andrew Cooper June 1, 2024, 6:50 p.m. UTC
One of the followon items I had from the bitops clean-up is this:


which yields the following improvement:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34)
  Function                                     old     new   delta
  vcpumask_to_pcpumask                         519     485     -34


While I (the programmer) can reason the two expressions are equivalent,
the compiler can't, so we really are saving a SHL to re-calculate (1 <<
vcpu_id) and swapping it for a LEA -1(vmask) which happens to be hoisted
above the ffsl() call.

However, the majority of the savings came from the fact that the old
code used to hold 1 in %r15 (across the entire function!) so it could
calculate (1 << vcpu_id) on each loop iteration.  With %r15 now free for
other use, we one fewer thing spilt to the stack.

Anyway - while it goes to show that while/ffs logic should be extra
careful to use x &= x - 1 for it's loop condition logic, that's not all.

The rest of this function is crazy.  We're reading a guest-word at a
time in order to make a d->max_vcpus sized bitmap (with a reasonable
amount of opencoded logic to reassemble the fragments back into a vcpu
number).

PVH dom0 can reach here, and because it's not pv32, will be deemed to
have a guest word size of 64.  Also, word-at-time for any HVM guest is
an insane overhead in terms of the pagewalks behind the copy_from_hvm()
infrastructure.

Instead, we should calculate the size of the bitmap at
DIV_ROUND_UP(d->max_vcpus, 8), copy the bitmap in one whole go, and then
just have a single for_each_set_bit() looping over it.  Importantly this
avoids needing to know or care about the guest word size.

This removes 4 of the 6 hiding lfences; the pair for calculating
is_native to start with, and then one of the two pairs behind the use of
is_native to select the type of the access.

The only complications is this:  Right now, PV max vCPUS is 8k, so we
could even get away with this being an on-stack bitmap.  However, the
vast majority of PV guests small and a 64-bit bitmap would do fine.

But, given this is just one example, wouldn't it be better if we just
unconditionally had a marshalling buffer for hypercalls?  There's the
XLAT area but it doesn't exist PV64, and we've got various other pieces
of scratch per-cpu data.

If not, having a 128/256-bit bitmap on the stack will still be good
enough in practice, but still ok at amortising the PVH dom0 costs.

Thoughts?

~Andrew

Comments

Jan Beulich June 3, 2024, 9:19 p.m. UTC | #1
On 01.06.2024 20:50, Andrew Cooper wrote:
> One of the followon items I had from the bitops clean-up is this:
> 
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 648d6dd475ba..9c3a017606ed 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask(
>              unsigned int cpu;
>  
>              vcpu_id = ffsl(vmask) - 1;
> -            vmask &= ~(1UL << vcpu_id);
> +            vmask &= vmask - 1;
>              vcpu_id += vcpu_bias;
>              if ( (vcpu_id >= d->max_vcpus) )
>                  return 0;
> 
> which yields the following improvement:
> 
>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34)
>   Function                                     old     new   delta
>   vcpumask_to_pcpumask                         519     485     -34

Nice. At the risk of getting flamed again for raising dumb questions:
Considering that elsewhere "trickery" like the &= mask - 1 here were
deemed not nice to have (at least wanting to be hidden behind a
suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit()
be suitable here too, and less at risk of being considered "trickery"?
But yes, that would eliminate the benefit of making the bit clearing
independent of the ffsl() result. And personally I'm fine anyway with
the form as suggested.

> While I (the programmer) can reason the two expressions are equivalent,
> the compiler can't,

Why is it you think it can't? There's no further knowledge that you
as a human need to rely on for this, afaics. If ffsl() uses the
built-in (as it now does), the compiler has full insight into what's
going on. It's just that compiler engineers may not deem it worth the
effort to carry out such a special-case optimization.

> so we really are saving a SHL to re-calculate (1 <<
> vcpu_id) and swapping it for a LEA -1(vmask) which happens to be hoisted
> above the ffsl() call.
> 
> However, the majority of the savings came from the fact that the old
> code used to hold 1 in %r15 (across the entire function!) so it could
> calculate (1 << vcpu_id) on each loop iteration.  With %r15 now free for
> other use, we one fewer thing spilt to the stack.
> 
> Anyway - while it goes to show that while/ffs logic should be extra
> careful to use x &= x - 1 for it's loop condition logic, that's not all.
> 
> The rest of this function is crazy.  We're reading a guest-word at a
> time in order to make a d->max_vcpus sized bitmap (with a reasonable
> amount of opencoded logic to reassemble the fragments back into a vcpu
> number).
> 
> PVH dom0 can reach here, and because it's not pv32, will be deemed to
> have a guest word size of 64.  Also, word-at-time for any HVM guest is
> an insane overhead in terms of the pagewalks behind the copy_from_hvm()
> infrastructure.
> 
> Instead, we should calculate the size of the bitmap at
> DIV_ROUND_UP(d->max_vcpus, 8), copy the bitmap in one whole go, and then
> just have a single for_each_set_bit() looping over it.  Importantly this
> avoids needing to know or care about the guest word size.
> 
> This removes 4 of the 6 hiding lfences; the pair for calculating
> is_native to start with, and then one of the two pairs behind the use of
> is_native to select the type of the access.
> 
> The only complications is this:  Right now, PV max vCPUS is 8k, so we
> could even get away with this being an on-stack bitmap.  However, the
> vast majority of PV guests small and a 64-bit bitmap would do fine.
> 
> But, given this is just one example, wouldn't it be better if we just
> unconditionally had a marshalling buffer for hypercalls?  There's the
> XLAT area but it doesn't exist PV64, and we've got various other pieces
> of scratch per-cpu data.
> 
> If not, having a 128/256-bit bitmap on the stack will still be good
> enough in practice, but still ok at amortising the PVH dom0 costs.
> 
> Thoughts?

Well, yes, the last of what you suggest might certainly be worthwhile
as a minimal improvement. The only slight concern with increasing the
granularity is that it'll be increasingly less likely that a 2nd or
further iterations of the loop would actually be exercised in routine
testing.

A marshalling buffer might also make sense to have, as long as we have
more than just one or two places wanting to use it. Since you say "just
one example", likely you have further uses in mind. In most other
places we use input data more directly, so it looks like I can't right
away come up with possible further use cases.

Jan
Andrew Cooper June 7, 2024, 12:35 p.m. UTC | #2
On 03/06/2024 10:19 pm, Jan Beulich wrote:
> On 01.06.2024 20:50, Andrew Cooper wrote:
>> One of the followon items I had from the bitops clean-up is this:
>>
>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>> index 648d6dd475ba..9c3a017606ed 100644
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask(
>>              unsigned int cpu;
>>  
>>              vcpu_id = ffsl(vmask) - 1;
>> -            vmask &= ~(1UL << vcpu_id);
>> +            vmask &= vmask - 1;
>>              vcpu_id += vcpu_bias;
>>              if ( (vcpu_id >= d->max_vcpus) )
>>                  return 0;
>>
>> which yields the following improvement:
>>
>>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34)
>>   Function                                     old     new   delta
>>   vcpumask_to_pcpumask                         519     485     -34
> Nice. At the risk of getting flamed again for raising dumb questions:
> Considering that elsewhere "trickery" like the &= mask - 1 here were
> deemed not nice to have (at least wanting to be hidden behind a
> suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit()
> be suitable here too, and less at risk of being considered "trickery"?

__clear_bit() is even worse, because it forces the bitmap to be spilled
to memory.  It hopefully wont when I've given the test/set helpers the
same treatment as ffs/fls.

I'm not really a fan of the bithack.  Like the others, it's completely
unintuitive unless you know it.

However, the improvements speak for themselves and in this one case, the
best chance of people recognising it is in full; hiding any part behind
a macro (of any name) obscures things further.

> But yes, that would eliminate the benefit of making the bit clearing
> independent of the ffsl() result. And personally I'm fine anyway with
> the form as suggested.
>
>> While I (the programmer) can reason the two expressions are equivalent,
>> the compiler can't,
> Why is it you think it can't? There's no further knowledge that you
> as a human need to rely on for this, afaics. If ffsl() uses the
> built-in (as it now does), the compiler has full insight into what's
> going on. It's just that compiler engineers may not deem it worth the
> effort to carry out such a special-case optimization.

On x86, it's a block of asm, not the builtin.

But even with the builtin, just because there is a builtin transforming
a common expression into efficient assembly, doesn't mean there's
semantic reasoning about the result.

https://godbolt.org/z/eah1374a1

I can't find any way to get any compiler to reason about bit in order to
avoid the shift (or rotate, for Clang).

~Andrew
Jan Beulich June 10, 2024, 7:15 a.m. UTC | #3
On 07.06.2024 14:35, Andrew Cooper wrote:
> On 03/06/2024 10:19 pm, Jan Beulich wrote:
>> On 01.06.2024 20:50, Andrew Cooper wrote:
>>> One of the followon items I had from the bitops clean-up is this:
>>>
>>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>>> index 648d6dd475ba..9c3a017606ed 100644
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask(
>>>              unsigned int cpu;
>>>  
>>>              vcpu_id = ffsl(vmask) - 1;
>>> -            vmask &= ~(1UL << vcpu_id);
>>> +            vmask &= vmask - 1;
>>>              vcpu_id += vcpu_bias;
>>>              if ( (vcpu_id >= d->max_vcpus) )
>>>                  return 0;
>>>
>>> which yields the following improvement:
>>>
>>>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34)
>>>   Function                                     old     new   delta
>>>   vcpumask_to_pcpumask                         519     485     -34
>> Nice. At the risk of getting flamed again for raising dumb questions:
>> Considering that elsewhere "trickery" like the &= mask - 1 here were
>> deemed not nice to have (at least wanting to be hidden behind a
>> suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit()
>> be suitable here too, and less at risk of being considered "trickery"?
> 
> __clear_bit() is even worse, because it forces the bitmap to be spilled
> to memory.  It hopefully wont when I've given the test/set helpers the
> same treatment as ffs/fls.

Sorry, not directly related here: When you're saying "when I've given"
does that mean you'd like Oleksii's "xen: introduce generic non-atomic
test_*bit()" to not go in once at least an Arm ack has arrived?

Jan
Andrew Cooper June 10, 2024, 10:12 a.m. UTC | #4
On 10/06/2024 8:15 am, Jan Beulich wrote:
> On 07.06.2024 14:35, Andrew Cooper wrote:
>> On 03/06/2024 10:19 pm, Jan Beulich wrote:
>>> On 01.06.2024 20:50, Andrew Cooper wrote:
>>>> One of the followon items I had from the bitops clean-up is this:
>>>>
>>>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>>>> index 648d6dd475ba..9c3a017606ed 100644
>>>> --- a/xen/arch/x86/mm.c
>>>> +++ b/xen/arch/x86/mm.c
>>>> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask(
>>>>              unsigned int cpu;
>>>>  
>>>>              vcpu_id = ffsl(vmask) - 1;
>>>> -            vmask &= ~(1UL << vcpu_id);
>>>> +            vmask &= vmask - 1;
>>>>              vcpu_id += vcpu_bias;
>>>>              if ( (vcpu_id >= d->max_vcpus) )
>>>>                  return 0;
>>>>
>>>> which yields the following improvement:
>>>>
>>>>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34)
>>>>   Function                                     old     new   delta
>>>>   vcpumask_to_pcpumask                         519     485     -34
>>> Nice. At the risk of getting flamed again for raising dumb questions:
>>> Considering that elsewhere "trickery" like the &= mask - 1 here were
>>> deemed not nice to have (at least wanting to be hidden behind a
>>> suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit()
>>> be suitable here too, and less at risk of being considered "trickery"?
>> __clear_bit() is even worse, because it forces the bitmap to be spilled
>> to memory.  It hopefully wont when I've given the test/set helpers the
>> same treatment as ffs/fls.
> Sorry, not directly related here: When you're saying "when I've given"
> does that mean you'd like Oleksii's "xen: introduce generic non-atomic
> test_*bit()" to not go in once at least an Arm ack has arrived?

If we weren't deep in a code freeze, I'd be insisting on some changes in
that patch.

For now, I'll settle for not introducing regressions, so it needs at
least one more spin (there's a MISRA and UB regression I spotted, but I
haven't reviewed it in detail yet).

But yes - they're going to end up rather different when I've applied all
the compile time optimisations which are available.

~Andre
Jan Beulich June 10, 2024, 11:44 a.m. UTC | #5
On 10.06.2024 12:12, Andrew Cooper wrote:
> On 10/06/2024 8:15 am, Jan Beulich wrote:
>> On 07.06.2024 14:35, Andrew Cooper wrote:
>>> On 03/06/2024 10:19 pm, Jan Beulich wrote:
>>>> On 01.06.2024 20:50, Andrew Cooper wrote:
>>>>> One of the followon items I had from the bitops clean-up is this:
>>>>>
>>>>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>>>>> index 648d6dd475ba..9c3a017606ed 100644
>>>>> --- a/xen/arch/x86/mm.c
>>>>> +++ b/xen/arch/x86/mm.c
>>>>> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask(
>>>>>              unsigned int cpu;
>>>>>  
>>>>>              vcpu_id = ffsl(vmask) - 1;
>>>>> -            vmask &= ~(1UL << vcpu_id);
>>>>> +            vmask &= vmask - 1;
>>>>>              vcpu_id += vcpu_bias;
>>>>>              if ( (vcpu_id >= d->max_vcpus) )
>>>>>                  return 0;
>>>>>
>>>>> which yields the following improvement:
>>>>>
>>>>>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34)
>>>>>   Function                                     old     new   delta
>>>>>   vcpumask_to_pcpumask                         519     485     -34
>>>> Nice. At the risk of getting flamed again for raising dumb questions:
>>>> Considering that elsewhere "trickery" like the &= mask - 1 here were
>>>> deemed not nice to have (at least wanting to be hidden behind a
>>>> suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit()
>>>> be suitable here too, and less at risk of being considered "trickery"?
>>> __clear_bit() is even worse, because it forces the bitmap to be spilled
>>> to memory.  It hopefully wont when I've given the test/set helpers the
>>> same treatment as ffs/fls.
>> Sorry, not directly related here: When you're saying "when I've given"
>> does that mean you'd like Oleksii's "xen: introduce generic non-atomic
>> test_*bit()" to not go in once at least an Arm ack has arrived?
> 
> If we weren't deep in a code freeze, I'd be insisting on some changes in
> that patch.
> 
> For now, I'll settle for not introducing regressions, so it needs at
> least one more spin (there's a MISRA and UB regression I spotted, but I
> haven't reviewed it in detail yet).

Did you point this[1] out to him? I've just checked, and I can't seem to be
able to find any reply of yours on any of the respective sub-threads, which
formally still means the patch would be fine to go in as is once an Arm ack
arrives (taking the same approach as you did elsewhere wrt a PPC one). It's
just that informally at least I now know to wait ...

Jan

[1] It'll likely be embarrassing to learn what I've overlooked during the
many rounds of review.

> But yes - they're going to end up rather different when I've applied all
> the compile time optimisations which are available.
> 
> ~Andre
diff mbox series

Patch

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 648d6dd475ba..9c3a017606ed 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3425,7 +3425,7 @@  static int vcpumask_to_pcpumask(
             unsigned int cpu;
 
             vcpu_id = ffsl(vmask) - 1;
-            vmask &= ~(1UL << vcpu_id);
+            vmask &= vmask - 1;
             vcpu_id += vcpu_bias;
             if ( (vcpu_id >= d->max_vcpus) )
                 return 0;