diff mbox series

[3/4] x86/hvm: Rework hpet_write() for improved code generation

Message ID 20240827135746.1908070-4-andrew.cooper3@citrix.com (mailing list archive)
State New
Headers show
Series xen/bitops: More for_each_bit() conversions | expand

Commit Message

Andrew Cooper Aug. 27, 2024, 1:57 p.m. UTC
In the HPET_STATUS handling, the use of __clear_bit(i, &new_val) is the only
thing causing it to be spilled to the stack.  Furthemore we only care about
the bottom 3 bits, so rewrite it to be a plain for loop.

For the {start,stop}_timer variables, these are spilled to the stack despite
the __{set,clear}_bit() calls.  Again we only care about the bottom 3 bits, so
shrink the variables from long to int.  Use for_each_set_bit() rather than
opencoding it at the end which amongst other things means the loop predicate
is no longer forced to the stack by the loop body.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

All in all, it's modest according to bloat-o-meter:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-29 (-29)
  Function                                     old     new   delta
  hpet_write                                  2225    2196     -29

but we have shrunk the stack frame by 8 bytes; 0x28 as opposed to 0x30 before.
---
 xen/arch/x86/hvm/hpet.c | 29 ++++++++---------------------
 1 file changed, 8 insertions(+), 21 deletions(-)

Comments

Jan Beulich Aug. 28, 2024, 8:13 a.m. UTC | #1
On 27.08.2024 15:57, Andrew Cooper wrote:
> In the HPET_STATUS handling, the use of __clear_bit(i, &new_val) is the only
> thing causing it to be spilled to the stack.  Furthemore we only care about
> the bottom 3 bits, so rewrite it to be a plain for loop.
> 
> For the {start,stop}_timer variables, these are spilled to the stack despite
> the __{set,clear}_bit() calls.

That's an observation from what the compiler happens to do? I don't see any
other reason why they would need spilling; I expect it's merely a matter of
registers better be used for other variables. If we ever meant to build Xen
with APX fully in use, that might change. IOW may I at least ask for
s/are/happen to be/? I'm also a little irritated by "despite", but you're
the native speaker. It would have seemed to me that e.g. "irrespective of"
would better express what (I think) is meant.

>  Again we only care about the bottom 3 bits, so
> shrink the variables from long to int.  Use for_each_set_bit() rather than
> opencoding it at the end which amongst other things means the loop predicate
> is no longer forced to the stack by the loop body.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> All in all, it's modest according to bloat-o-meter:
> 
>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-29 (-29)
>   Function                                     old     new   delta
>   hpet_write                                  2225    2196     -29
> 
> but we have shrunk the stack frame by 8 bytes; 0x28 as opposed to 0x30 before.

However, on the negative side all the first of the loops you touch now always
takes 3 iterations, when previously we may have got away with as little as
none. Is there a reason not to use

    for_each_set_bit ( i, new_val & ((1U << HPET_TIMER_NUM) - 1) )

there (with the masking of the low bit possibly pulled out)?

> @@ -533,19 +528,11 @@ static int cf_check hpet_write(
>      }
>  
>      /* stop/start timers whos state was changed by this write. */
> -    while (stop_timers)
> -    {
> -        i = ffsl(stop_timers) - 1;
> -        __clear_bit(i, &stop_timers);
> +    for_each_set_bit ( i, stop_timers )
>          hpet_stop_timer(h, i, guest_time);
> -    }
>  
> -    while (start_timers)
> -    {
> -        i = ffsl(start_timers) - 1;
> -        __clear_bit(i, &start_timers);
> +    for_each_set_bit ( i, start_timers )
>          hpet_set_timer(h, i, guest_time);
> -    }

To avoid variable shadowing, I think you don't want to use i in these two
loops. Alternatively the function scope i would need constraining to the
individual loops.

Unrelated to the change you make, but related to the code you touch: Isn't
there a bug there with the length != 8 handling ahead of the switch()? The
bits being write-1-to-clear, using the value read for parts the original
insn didn't write means we might clear ISR bits we weren't asked to clear.
I guess I'll make a patch, which may want to go ahead of yours for ease of
backporting. (Of course guests should have no need to write to other than
the bottom part of the register, but still.)

Jan
Andrew Cooper Aug. 28, 2024, 5:50 p.m. UTC | #2
On 28/08/2024 9:13 am, Jan Beulich wrote:
> On 27.08.2024 15:57, Andrew Cooper wrote:
>> In the HPET_STATUS handling, the use of __clear_bit(i, &new_val) is the only
>> thing causing it to be spilled to the stack.  Furthemore we only care about
>> the bottom 3 bits, so rewrite it to be a plain for loop.
>>
>> For the {start,stop}_timer variables, these are spilled to the stack despite
>> the __{set,clear}_bit() calls.
> That's an observation from what the compiler happens to do? I don't see any
> other reason why they would need spilling; I expect it's merely a matter of
> registers better be used for other variables.

It is a consequence of how our helpers are written.  I do expect it to
improve when I get around to reworking them.

For example, the Linux helpers have enough constant folding capabilities
to allow the compiler to turn:

{
    int foo = 0;
    ...
    __set_bit(1, &foo);

into:

{
    int foo = 1;


as well as being able to emit LOCK AND/OR/XOR in place of LOCK BT{C,S,R}
for a constant bit position.

One thing I want to do, which I haven't figured out how to do yet, is to
allow the arch form to emit BT?Q forms.

Right now, code generation for PGC_* and PGT_* suffers quite a lot.  We
mix between reg/imm logic, then spill to the stack because top bits
aren't within range for the "I" constraint on 32-bit instructions, issue
a BT?L reg/mem (which has much higher latency than any other form), then
pick it back off the stack to do more reg/imm logic.

I was wondering if, because of the always_inline, I could do something
like __builtin_constant_p(bit) && __builtin_object_size(addr, 0) >= 8
and emitting long-granular logic, which will be able to pick the imm/reg
form rather than turning into reg/mem.

But, I've not had time to experiment here, and I doubt I'll get around
to it soon.

Another optimisation we're lacking vs Linux is that our test_bit() has a
volatile pointer where Linux's is non-volatile.  This makes a massive
difference for the ability to optimise looking at multiple bits.


>  If we ever meant to build Xen
> with APX fully in use, that might change. IOW may I at least ask for
> s/are/happen to be/? I'm also a little irritated by "despite", but you're
> the native speaker. It would have seemed to me that e.g. "irrespective of"
> would better express what (I think) is meant.

"despite" isn't really the right term, but I also wouldn't have said it
was something to be irritated over.

What I was trying to say was "they're spilled to the stack even with the
__set_bit() calls removed".  Which makes sense; they're values held for
almost the full duration of the function, that are not used in ~every
step of logic.

Interestingly, given that they're spilled to the stack, the __set_bit()
form is more efficient than the plain C "|= (1u << i);", but I'd still
like an implementation which could make that determination itself.

>
>>  Again we only care about the bottom 3 bits, so
>> shrink the variables from long to int.  Use for_each_set_bit() rather than
>> opencoding it at the end which amongst other things means the loop predicate
>> is no longer forced to the stack by the loop body.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>
>> All in all, it's modest according to bloat-o-meter:
>>
>>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-29 (-29)
>>   Function                                     old     new   delta
>>   hpet_write                                  2225    2196     -29
>>
>> but we have shrunk the stack frame by 8 bytes; 0x28 as opposed to 0x30 before.
> However, on the negative side all the first of the loops you touch now always
> takes 3 iterations, when previously we may have got away with as little as
> none. Is there a reason not to use
>
>     for_each_set_bit ( i, new_val & ((1U << HPET_TIMER_NUM) - 1) )
>
> there (with the masking of the low bit possibly pulled out)?

There are multiple angles here.

First, I got an unexpected surprise on ARM with an expression, and while
this one won't pick up pointer const-ness, I can never remember what
MISRA's view on this is.

Second, this is the odd-loop-out compared to rest of the function, which
are all of the form "for ( i = 0; i < HPET_TIMER_NUM ;".

But perhaps most importantly, OSes don't touch this register.  Xen not
at all, and Linux only in _hpet_print_config().  Neither bother
preserving/clearing it on suspend/resume, even when running the HPET in
legacy replacement mode.

I haven't checked windows behaviour, but I don't expect it to differ
here.  This register simply isn't interesting for the preferred type of
interrupts (edge), and also isn't useful for an ISR handling a line
interrupt.

So my choice was based on which produced the smallest code, because it's
an dead-in-practice codepath.

>
>> @@ -533,19 +528,11 @@ static int cf_check hpet_write(
>>      }
>>  
>>      /* stop/start timers whos state was changed by this write. */
>> -    while (stop_timers)
>> -    {
>> -        i = ffsl(stop_timers) - 1;
>> -        __clear_bit(i, &stop_timers);
>> +    for_each_set_bit ( i, stop_timers )
>>          hpet_stop_timer(h, i, guest_time);
>> -    }
>>  
>> -    while (start_timers)
>> -    {
>> -        i = ffsl(start_timers) - 1;
>> -        __clear_bit(i, &start_timers);
>> +    for_each_set_bit ( i, start_timers )
>>          hpet_set_timer(h, i, guest_time);
>> -    }
> To avoid variable shadowing, I think you don't want to use i in these two
> loops. Alternatively the function scope i would need constraining to the
> individual loops.

Yeah, I was bitten by that on one of the ARM patches.  I'll adjust.

~Andrew
Jan Beulich Aug. 29, 2024, 6:25 a.m. UTC | #3
On 28.08.2024 19:50, Andrew Cooper wrote:
> On 28/08/2024 9:13 am, Jan Beulich wrote:
>> On 27.08.2024 15:57, Andrew Cooper wrote:
>>> In the HPET_STATUS handling, the use of __clear_bit(i, &new_val) is the only
>>> thing causing it to be spilled to the stack.  Furthemore we only care about
>>> the bottom 3 bits, so rewrite it to be a plain for loop.
>>>
>>> For the {start,stop}_timer variables, these are spilled to the stack despite
>>> the __{set,clear}_bit() calls.
>> That's an observation from what the compiler happens to do? I don't see any
>> other reason why they would need spilling; I expect it's merely a matter of
>> registers better be used for other variables.
> 
> It is a consequence of how our helpers are written.  I do expect it to
> improve when I get around to reworking them.
> 
> For example, the Linux helpers have enough constant folding capabilities
> to allow the compiler to turn:
> 
> {
>     int foo = 0;
>     ...
>     __set_bit(1, &foo);
> 
> into:
> 
> {
>     int foo = 1;
> 
> 
> as well as being able to emit LOCK AND/OR/XOR in place of LOCK BT{C,S,R}
> for a constant bit position.
> 
> One thing I want to do, which I haven't figured out how to do yet, is to
> allow the arch form to emit BT?Q forms.
> 
> Right now, code generation for PGC_* and PGT_* suffers quite a lot.  We
> mix between reg/imm logic, then spill to the stack because top bits
> aren't within range for the "I" constraint on 32-bit instructions, issue
> a BT?L reg/mem (which has much higher latency than any other form), then
> pick it back off the stack to do more reg/imm logic.
> 
> I was wondering if, because of the always_inline, I could do something
> like __builtin_constant_p(bit) && __builtin_object_size(addr, 0) >= 8
> and emitting long-granular logic, which will be able to pick the imm/reg
> form rather than turning into reg/mem.

That may work, provided there actually was always_inline.

>>  If we ever meant to build Xen
>> with APX fully in use, that might change. IOW may I at least ask for
>> s/are/happen to be/? I'm also a little irritated by "despite", but you're
>> the native speaker. It would have seemed to me that e.g. "irrespective of"
>> would better express what (I think) is meant.
> 
> "despite" isn't really the right term, but I also wouldn't have said it
> was something to be irritated over.
> 
> What I was trying to say was "they're spilled to the stack even with the
> __set_bit() calls removed".  Which makes sense; they're values held for
> almost the full duration of the function, that are not used in ~every
> step of logic.

Right, the "not a good use for a register var" reason that I had alluded to.

>>>  Again we only care about the bottom 3 bits, so
>>> shrink the variables from long to int.  Use for_each_set_bit() rather than
>>> opencoding it at the end which amongst other things means the loop predicate
>>> is no longer forced to the stack by the loop body.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>> CC: Jan Beulich <JBeulich@suse.com>
>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>>
>>> All in all, it's modest according to bloat-o-meter:
>>>
>>>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-29 (-29)
>>>   Function                                     old     new   delta
>>>   hpet_write                                  2225    2196     -29
>>>
>>> but we have shrunk the stack frame by 8 bytes; 0x28 as opposed to 0x30 before.
>> However, on the negative side all the first of the loops you touch now always
>> takes 3 iterations, when previously we may have got away with as little as
>> none. Is there a reason not to use
>>
>>     for_each_set_bit ( i, new_val & ((1U << HPET_TIMER_NUM) - 1) )
>>
>> there (with the masking of the low bit possibly pulled out)?
> 
> There are multiple angles here.
> 
> First, I got an unexpected surprise on ARM with an expression, and while
> this one won't pick up pointer const-ness, I can never remember what
> MISRA's view on this is.
> 
> Second, this is the odd-loop-out compared to rest of the function, which
> are all of the form "for ( i = 0; i < HPET_TIMER_NUM ;".
> 
> But perhaps most importantly, OSes don't touch this register.  Xen not
> at all, and Linux only in _hpet_print_config().  Neither bother
> preserving/clearing it on suspend/resume, even when running the HPET in
> legacy replacement mode.
> 
> I haven't checked windows behaviour, but I don't expect it to differ
> here.  This register simply isn't interesting for the preferred type of
> interrupts (edge), and also isn't useful for an ISR handling a line
> interrupt.

Yet there must have been an environment where the register is of use, or
else Roger wouldn't have been prompted to make what is now be07023be115
("x86/vhpet: add support for level triggered interrupts").

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/hpet.c b/xen/arch/x86/hvm/hpet.c
index 87642575f9cd..e3981d5e467c 100644
--- a/xen/arch/x86/hvm/hpet.c
+++ b/xen/arch/x86/hvm/hpet.c
@@ -349,8 +349,7 @@  static int cf_check hpet_write(
     unsigned int tn, i;
 
     /* Acculumate a bit mask of timers whos state is changed by this write. */
-    unsigned long start_timers = 0;
-    unsigned long stop_timers  = 0;
+    unsigned int start_timers = 0, stop_timers = 0;
 #define set_stop_timer(n)    (__set_bit((n), &stop_timers))
 #define set_start_timer(n)   (__set_bit((n), &start_timers))
 #define set_restart_timer(n) (set_stop_timer(n),set_start_timer(n))
@@ -405,16 +404,12 @@  static int cf_check hpet_write(
 
     case HPET_STATUS:
         /* write 1 to clear. */
-        while ( new_val )
+        for ( i = 0; i < HPET_TIMER_NUM; i++ )
         {
-            bool active;
-
-            i = ffsl(new_val) - 1;
-            if ( i >= HPET_TIMER_NUM )
-                break;
-            __clear_bit(i, &new_val);
-            active = __test_and_clear_bit(i, &h->hpet.isr);
-            if ( active )
+            if ( !(new_val & (1U << i)) )
+                continue;
+
+            if ( __test_and_clear_bit(i, &h->hpet.isr) )
             {
                 hvm_ioapic_deassert(v->domain, timer_int_route(h, i));
                 if ( hpet_enabled(h) && timer_enabled(h, i) &&
@@ -533,19 +528,11 @@  static int cf_check hpet_write(
     }
 
     /* stop/start timers whos state was changed by this write. */
-    while (stop_timers)
-    {
-        i = ffsl(stop_timers) - 1;
-        __clear_bit(i, &stop_timers);
+    for_each_set_bit ( i, stop_timers )
         hpet_stop_timer(h, i, guest_time);
-    }
 
-    while (start_timers)
-    {
-        i = ffsl(start_timers) - 1;
-        __clear_bit(i, &start_timers);
+    for_each_set_bit ( i, start_timers )
         hpet_set_timer(h, i, guest_time);
-    }
 
 #undef set_stop_timer
 #undef set_start_timer