mbox series

[v2,00/12] VMX/VSX instructions with gvec

Message ID 20221010191356.83659-1-lucas.araujo@eldorado.org.br (mailing list archive)
Headers show
Series VMX/VSX instructions with gvec | expand

Message

Lucas Mateus Martins Araujo e Castro Oct. 10, 2022, 7:13 p.m. UTC
From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>

Patches missing review: 3,5,9,11,12

v1 -> v2:
    - Implemented instructions with fni4/fni8 and dropped the helper:
        * VSUBCUW
        * VADDCUW
        * VPRTYBW
        * VPRTYBD
    - Reworked patch12 to only use gvec implementation with a few
      immediates.
    - Used bitsel_ver on patch9
    - Changed vec variables to tcg_constant_vec when possible

This patch series moves some instructions from decode legacy to
decodetree and translate said instructions with gvec. Some cases using
gvec ended up with a bigger, more complex and slower so those
instructions were only moved to decodetree.

In each patch there's a comparison of the execution time before the
patch being applied and after. Said result is the sum of 10 executions.

The program used to time the execution worked like this:

clock_t start = clock();
for (int i = 0; i < LOOP; i++) {
    asm (
         load values in registers, between 2 and 3 instructions
         ".rept REPT\n\t"
         "INSTRUCTION registers\n\t"
         ".endr\n\t"
         save result from register, 1 instruction
    );
}
clock_t end = clock();
printf("INSTRUCTION rept=REPT loop=LOOP, time taken: %.12lf\n",
       ((double)(end - start))/ CLOCKS_PER_SEC);

Where the column rept in the value used in .rept in the inline assembly
and loop column is the value used for the for loop. All of those tests
were executed on a Power9. When comparing the TCGop the data used was
gathered using '-d op' and '-d op_opt'.

Lucas Mateus Castro (alqotel) (12):
  target/ppc: Moved VMLADDUHM to decodetree and use gvec
  target/ppc: Move VMH[R]ADDSHS instruction to decodetree
  target/ppc: Move V(ADD|SUB)CUW to decodetree and use gvec
  target/ppc: Move VNEG[WD] to decodtree and use gvec
  target/ppc: Move VPRTYB[WDQ] to decodetree and use gvec
  target/ppc: Move VAVG[SU][BHW] to decodetree and use gvec
  target/ppc: Move VABSDU[BHW] to decodetree and use gvec
  target/ppc: Use gvec to decode XV[N]ABS[DS]P/XVNEG[DS]P
  target/ppc: Use gvec to decode XVCPSGN[SD]P
  target/ppc: Moved XVTSTDC[DS]P to decodetree
  target/ppc: Moved XSTSTDC[QDS]P to decodetree
  target/ppc: Use gvec to decode XVTSTDC[DS]P

 target/ppc/fpu_helper.c             | 140 +++++-----
 target/ppc/helper.h                 |  42 ++-
 target/ppc/insn32.decode            |  50 ++++
 target/ppc/int_helper.c             | 107 ++------
 target/ppc/translate.c              |   1 -
 target/ppc/translate/vmx-impl.c.inc | 364 +++++++++++++++++++++----
 target/ppc/translate/vmx-ops.c.inc  |  15 +-
 target/ppc/translate/vsx-impl.c.inc | 394 +++++++++++++++++++++++-----
 target/ppc/translate/vsx-ops.c.inc  |  21 --
 9 files changed, 808 insertions(+), 326 deletions(-)

Comments

Richard Henderson Oct. 10, 2022, 7:31 p.m. UTC | #1
On 10/10/22 12:13, Lucas Mateus Castro(alqotel) wrote:
> From: "Lucas Mateus Castro (alqotel)"<lucas.araujo@eldorado.org.br>
> 
> Moved XSTSTDCSP, XSTSTDCDP and XSTSTDCQP to decodetree and moved some of
> its decoding away from the helper as previously the DCMX, XB and BF were
> calculated in the helper with the help of cpu_env, now that part was
> moved to the decodetree with the rest.
> 
> xvtstdcsp:
> rept    loop    master             patch
> 8       12500   1,85393600         1,94683600 (+5.0%)
> 25      4000    1,78779800         1,92479000 (+7.7%)
> 100     1000    2,12775000         2,28895500 (+7.6%)
> 500     200     2,99655300         3,23102900 (+7.8%)
> 2500    40      6,89082200         7,44827500 (+8.1%)
> 8000    12     17,50585500        18,95152100 (+8.3%)
> 
> xvtstdcdp:
> rept    loop    master             patch
> 8       12500   1,39043100         1,33539800 (-4.0%)
> 25      4000    1,35731800         1,37347800 (+1.2%)
> 100     1000    1,51514800         1,56053000 (+3.0%)
> 500     200     2,21014400         2,47906000 (+12.2%)
> 2500    40      5,39488200         6,68766700 (+24.0%)
> 8000    12     13,98623900        18,17661900 (+30.0%)
> 
> xvtstdcdp:
> rept    loop    master             patch
> 8       12500   1,35123800         1,34455800 (-0.5%)
> 25      4000    1,36441200         1,36759600 (+0.2%)
> 100     1000    1,49763500         1,54138400 (+2.9%)
> 500     200     2,19020200         2,46196400 (+12.4%)
> 2500    40      5,39265700         6,68147900 (+23.9%)
> 8000    12     14,04163600        18,19669600 (+29.6%)
> 
> As some values are now decoded outside the helper and passed to it as an
> argument the number of arguments of the helper increased, the number
> of TCGop needed to load the arguments increased. I suspect that's why
> the slow-down in the tests with a high REPT but low LOOP.
> 
> Signed-off-by: Lucas Mateus Castro (alqotel)<lucas.araujo@eldorado.org.br>
> ---
>   target/ppc/fpu_helper.c             | 114 +++++++++-------------------
>   target/ppc/helper.h                 |   6 +-
>   target/ppc/insn32.decode            |   6 ++
>   target/ppc/translate/vsx-impl.c.inc |  20 ++++-
>   target/ppc/translate/vsx-ops.c.inc  |   4 -
>   5 files changed, 60 insertions(+), 90 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~
Lucas Mateus Martins Araujo e Castro Oct. 10, 2022, 7:53 p.m. UTC | #2
On 10/10/2022 16:42, Richard Henderson wrote:
>  
> On 10/10/22 12:13, Lucas Mateus Castro(alqotel) wrote:
>> +/* test if +Inf or -Inf */
>> +static void gen_is_any_inf(unsigned vece, TCGv_vec t, TCGv_vec b)
>> +{
>> +    uint64_t exp_msk = (vece == MO_32) ? (uint32_t)EXP_MASK_SP : 
>> EXP_MASK_DP;
>> +    uint64_t sgn_msk = (vece == MO_32) ? (uint32_t)SGN_MASK_SP : 
>> SGN_MASK_DP;
>> +    tcg_gen_andc_vec(vece, b, b, tcg_constant_vec_matching(t, vece, 
>> exp_msk));
>> +    tcg_gen_cmp_vec(TCG_COND_EQ, vece, t, b,
>> +                    tcg_constant_vec_matching(t, vece, sgn_msk));
>> +}
> 
> Should be clearing sign and comparing exp, not the other way.
Yeah this was a mistake, I'll fix it in the next version.
Kind of weird that my tests didn't caught this, probably should test 
that the '.out' risu file I'm using actually test every immediate value.
> 
>> +    GVecGen2 op = {
>> +        .fno = (vece == MO_32) ? gen_helper_XVTSTDCSP : 
>> gen_helper_XVTSTDCDP,
>> +        .vece = vece,
>> +        .opt_opc = vecop_list
>>       };
>>
>>       REQUIRE_VSX(ctx);
>>
>> -    tcg_gen_gvec_2i(vsr_full_offset(a->xt), vsr_full_offset(a->xb),
>> -                    16, 16, (int32_t)(a->uim), &op[vece - MO_32]);
>> +    switch (a->uim) {
>> +    case 0:
>> +        set_cpu_vsr(a->xt, tcg_constant_i64(0), true);
>> +        set_cpu_vsr(a->xt, tcg_constant_i64(0), false);
>> +        break;
>> +    case ((1 << 0) | (1 << 1)):
>> +        /* test if +Denormal or -Denormal */
>> +        op.fniv = gen_is_any_denormal,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
> 
> This default setting of .fno doesn't work, because the helper requires 
> simd_data set,
> which this invocation via tcg_gen_gvec_2 will not provide.
> 
> You could fix this by using GVecGen2i and tcg_gen_gvec_2i, and ignoring 
> the immediate
And I can remove the new #include from int_helper which was bothering me.
> argument added to the functions above.  Which also means...
> 
>> +    case (1 << 0):
>> +        /* test if -Denormal */
>> +        op.fniv = gen_is_neg_denormal,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
>> +        break;
>> +    case (1 << 1):
>> +        /* test if +Denormal */
>> +        op.fniv = gen_is_pos_denormal,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
>> +        break;
>> +    case ((1 << 2) | (1 << 3)):
>> +        /* test if +0 or -0 */
>> +        op.fniv = gen_is_any_zero,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
>> +        break;
>> +    case (1 << 2):
>> +        /* test if -0 */
>> +        op.fniv = gen_is_neg_zero,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
>> +        break;
>> +    case (1 << 3):
>> +        /* test if +0 */
>> +        op.fniv = gen_is_pos_zero,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
>> +        break;
>> +    case ((1 << 4) | (1 << 5)):
>> +        /* test if +Inf or -Inf */
>> +        op.fniv = gen_is_any_inf,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
>> +        break;
>> +    case (1 << 4):
>> +        /* test if -Inf */
>> +        op.fniv = gen_is_neg_inf,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
>> +        break;
>> +    case (1 << 5):
>> +        /* test if +Inf */
>> +        op.fniv = gen_is_pos_inf,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
>> +        break;
>> +    case (1 << 6):
>> +        /* test if NaN */
>> +        op.fniv = gen_is_nan,
>> +        tcg_gen_gvec_2(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16, 16,
>> +                       &op);
>> +        break;
>> +    default:
>> +        tcg_gen_gvec_2_ool(vsr_full_offset(a->xt), 
>> vsr_full_offset(a->xb), 16,
>> +                           16, (int32_t)(a->uim), op.fno);
>> +        break;
> 
> You can have only the store to op.fniv in the switch, remove the 
> default, and have a
> common call to tcg_gen_gvec_2i after the switch.
> 
> 
> r~
I'll send a v3 with these changes.