[v4,4/9] target-ppc: improve lxvw4x implementation

Message ID	1475040687-27523-5-git-send-email-nikunj@linux.vnet.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> Gateway: Authorized Use Only! Violators will be prosecuted for <qemu-devel@nongnu.org> from <nikunj@linux.vnet.ibm.com>; Wed, 28 Sep 2016 15:31:45 +1000 Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 28 Sep 2016 15:31:42 +1000 From: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> To: qemu-ppc@nongnu.org, david@gibson.dropbear.id.au, rth@twiddle.net Date: Wed, 28 Sep 2016 11:01:22 +0530 In-Reply-To: <1475040687-27523-1-git-send-email-nikunj@linux.vnet.ibm.com> References: <1475040687-27523-1-git-send-email-nikunj@linux.vnet.ibm.com> Message-Id: <1475040687-27523-5-git-send-email-nikunj@linux.vnet.ibm.com> Subject: [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation Precedence: list Cc: qemu-devel@nongnu.org, nikunj@linux.vnet.ibm.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Nikunj A. Dadhania Sept. 28, 2016, 5:31 a.m. UTC

Load 8byte at a time and manipulate.

Big-Endian Storage
+-------------+-------------+-------------+-------------+
| 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
+-------------+-------------+-------------+-------------+

Little-Endian Storage
+-------------+-------------+-------------+-------------+
| 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
+-------------+-------------+-------------+-------------+

Vector load results in:
+-------------+-------------+-------------+-------------+
| 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
+-------------+-------------+-------------+-------------+

Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

Richard Henderson Sept. 28, 2016, 4:07 p.m. UTC | #1

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> Load 8byte at a time and manipulate.
> 
> Big-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+
> 
> Little-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
> +-------------+-------------+-------------+-------------+
> 
> Vector load results in:
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++--------------
>  1 file changed, 19 insertions(+), 14 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

David Gibson Sept. 29, 2016, 1:38 a.m. UTC | #2

On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
> Load 8byte at a time and manipulate.
> 
> Big-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+
> 
> Little-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
> +-------------+-------------+-------------+-------------+
> 
> Vector load results in:
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+

Ok.  I'm guessing from this that implementing those GPR<->VSR
instructions showed that the earlier versions were endian-incorrect as
I suspected.

Have you verified that this new implementation is actually faster (or
at least no slower) on LE than the original implementation with
individual 32-bit stores?

> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++--------------
>  1 file changed, 19 insertions(+), 14 deletions(-)
> 
> diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
> index 74d0533..1eca042 100644
> --- a/target-ppc/translate/vsx-impl.inc.c
> +++ b/target-ppc/translate/vsx-impl.inc.c
> @@ -75,7 +75,6 @@ static void gen_lxvdsx(DisasContext *ctx)
>  static void gen_lxvw4x(DisasContext *ctx)
>  {
>      TCGv EA;
> -    TCGv_i64 tmp;
>      TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
>      TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
>      if (unlikely(!ctx->vsx_enabled)) {
> @@ -84,22 +83,28 @@ static void gen_lxvw4x(DisasContext *ctx)
>      }
>      gen_set_access_type(ctx, ACCESS_INT);
>      EA = tcg_temp_new();
> -    tmp = tcg_temp_new_i64();
>  
>      gen_addr_reg_index(ctx, EA);
> -    gen_qemu_ld32u_i64(ctx, tmp, EA);
> -    tcg_gen_addi_tl(EA, EA, 4);
> -    gen_qemu_ld32u_i64(ctx, xth, EA);
> -    tcg_gen_deposit_i64(xth, xth, tmp, 32, 32);
> -
> -    tcg_gen_addi_tl(EA, EA, 4);
> -    gen_qemu_ld32u_i64(ctx, tmp, EA);
> -    tcg_gen_addi_tl(EA, EA, 4);
> -    gen_qemu_ld32u_i64(ctx, xtl, EA);
> -    tcg_gen_deposit_i64(xtl, xtl, tmp, 32, 32);
> -
> +    if (ctx->le_mode) {
> +        TCGv_i64 t0, t1;
> +
> +        t0 = tcg_temp_new_i64();
> +        t1 = tcg_temp_new_i64();
> +        tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ);
> +        tcg_gen_shri_i64(t1, t0, 32);
> +        tcg_gen_deposit_i64(xth, t1, t0, 32, 32);
> +        tcg_gen_addi_tl(EA, EA, 8);
> +        tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ);
> +        tcg_gen_shri_i64(t1, t0, 32);
> +        tcg_gen_deposit_i64(xtl, t1, t0, 32, 32);
> +        tcg_temp_free_i64(t0);
> +        tcg_temp_free_i64(t1);
> +    } else {
> +        tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
> +        tcg_gen_addi_tl(EA, EA, 8);
> +        tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
> +    }
>      tcg_temp_free(EA);
> -    tcg_temp_free_i64(tmp);
>  }
>  
>  #define VSX_STORE_SCALAR(name, operation)                     \

Nikunj A. Dadhania Sept. 29, 2016, 2:34 a.m. UTC | #3

David Gibson <david@gibson.dropbear.id.au> writes:

> [ Unknown signature status ]
> On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
>> Load 8byte at a time and manipulate.
>> 
>> Big-Endian Storage
>> +-------------+-------------+-------------+-------------+
>> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
>> +-------------+-------------+-------------+-------------+
>> 
>> Little-Endian Storage
>> +-------------+-------------+-------------+-------------+
>> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
>> +-------------+-------------+-------------+-------------+
>> 
>> Vector load results in:
>> +-------------+-------------+-------------+-------------+
>> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
>> +-------------+-------------+-------------+-------------+
>
> Ok.  I'm guessing from this that implementing those GPR<->VSR
> instructions showed that the earlier versions were endian-incorrect as
> I suspected.
>
> Have you verified that this new implementation is actually faster (or
> at least no slower) on LE than the original implementation with
> individual 32-bit stores?

I haven't, will check it once and get back.

Regards
Nikunj

Nikunj A. Dadhania Sept. 29, 2016, 3:41 a.m. UTC | #4

David Gibson <david@gibson.dropbear.id.au> writes:

> [ Unknown signature status ]
> On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
>> Load 8byte at a time and manipulate.
>> 
>> Big-Endian Storage
>> +-------------+-------------+-------------+-------------+
>> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
>> +-------------+-------------+-------------+-------------+
>> 
>> Little-Endian Storage
>> +-------------+-------------+-------------+-------------+
>> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
>> +-------------+-------------+-------------+-------------+
>> 
>> Vector load results in:
>> +-------------+-------------+-------------+-------------+
>> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
>> +-------------+-------------+-------------+-------------+
>
> Ok.  I'm guessing from this that implementing those GPR<->VSR
> instructions showed that the earlier versions were endian-incorrect as
> I suspected.
>
> Have you verified that this new implementation is actually faster (or
> at least no slower) on LE than the original implementation with
> individual 32-bit stores?

Result of million lxvw4x, mfvsrd/mfvsrld and print

Without patch:
==============
[tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
real	0m2.812s
user	0m2.792s
sys	0m0.020s
[tcg_test]$

With patch:
===========
[tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
real	0m2.801s
user	0m2.783s
sys	0m0.018s
[tcg_test]$

Not much perceivable difference, is there a better way to benchmark?

Regards
Nikunj

Richard Henderson Sept. 29, 2016, 3:48 a.m. UTC | #5

On 09/28/2016 08:41 PM, Nikunj A Dadhania wrote:
> Without patch:
> ==============
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> real	0m2.812s
> user	0m2.792s
> sys	0m0.020s
> [tcg_test]$
>
> With patch:
> ===========
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> real	0m2.801s
> user	0m2.783s
> sys	0m0.018s
> [tcg_test]$
>
> Not much perceivable difference, is there a better way to benchmark?

There should be more of a difference for softmmu, since the tlb lookup for the 
memory is more expensive.


r~

David Gibson Sept. 29, 2016, 3:55 a.m. UTC | #6

On Thu, Sep 29, 2016 at 09:11:10AM +0530, Nikunj A Dadhania wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > [ Unknown signature status ]
> > On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
> >> Load 8byte at a time and manipulate.
> >> 
> >> Big-Endian Storage
> >> +-------------+-------------+-------------+-------------+
> >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> >> +-------------+-------------+-------------+-------------+
> >> 
> >> Little-Endian Storage
> >> +-------------+-------------+-------------+-------------+
> >> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
> >> +-------------+-------------+-------------+-------------+
> >> 
> >> Vector load results in:
> >> +-------------+-------------+-------------+-------------+
> >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> >> +-------------+-------------+-------------+-------------+
> >
> > Ok.  I'm guessing from this that implementing those GPR<->VSR
> > instructions showed that the earlier versions were endian-incorrect as
> > I suspected.
> >
> > Have you verified that this new implementation is actually faster (or
> > at least no slower) on LE than the original implementation with
> > individual 32-bit stores?
> 
> Result of million lxvw4x, mfvsrd/mfvsrld and print
> 
> Without patch:
> ==============
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> real	0m2.812s
> user	0m2.792s
> sys	0m0.020s
> [tcg_test]$
> 
> With patch:
> ===========
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> real	0m2.801s
> user	0m2.783s
> sys	0m0.018s
> [tcg_test]$
> 
> Not much perceivable difference, is there a better way to benchmark?

Not dramatically, that I can think of.  A few tweaks you can make:
    * Increase the loop counter so the test simply runs for longer
    * Also run the test multiple times, so you can get an idea of how
      much the results vary from one run to another
    * Run the test on a system that's as idle of other activity as you
      can make it (at both host and guest level).

For out purposes the user time is probably the meaningful thing here,
and should show less variance than the system and real time.

Note that it would be interesting to get these results for both a
power and x86 host.

In any case the results above are enough to convince me that the
change isn't likely to be a significant regression.

David Gibson Sept. 29, 2016, 3:57 a.m. UTC | #7

On Wed, Sep 28, 2016 at 08:48:54PM -0700, Richard Henderson wrote:
> On 09/28/2016 08:41 PM, Nikunj A Dadhania wrote:
> > Without patch:
> > ==============
> > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> > real	0m2.812s
> > user	0m2.792s
> > sys	0m0.020s
> > [tcg_test]$
> > 
> > With patch:
> > ===========
> > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> > real	0m2.801s
> > user	0m2.783s
> > sys	0m0.018s
> > [tcg_test]$
> > 
> > Not much perceivable difference, is there a better way to benchmark?
> 
> There should be more of a difference for softmmu, since the tlb lookup for
> the memory is more expensive.

Good point.  Oh.. also, I'd remove the prints from the benchmark for
this purpose.  The time involved in the syscalls and whatnot for the
print will just add noise to the measurement (sending to /dev/null
reduces the impact, but it's probably still significant compared to a
simple math operation).

[v4,4/9] target-ppc: improve lxvw4x implementation

Commit Message

Comments

Patch