Message ID | 1475040687-27523-5-git-send-email-nikunj@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote: > Load 8byte at a time and manipulate. > > Big-Endian Storage > +-------------+-------------+-------------+-------------+ > | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | > +-------------+-------------+-------------+-------------+ > > Little-Endian Storage > +-------------+-------------+-------------+-------------+ > | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC | > +-------------+-------------+-------------+-------------+ > > Vector load results in: > +-------------+-------------+-------------+-------------+ > | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | > +-------------+-------------+-------------+-------------+ > > Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> > --- > target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++-------------- > 1 file changed, 19 insertions(+), 14 deletions(-) Reviewed-by: Richard Henderson <rth@twiddle.net> r~
On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote: > Load 8byte at a time and manipulate. > > Big-Endian Storage > +-------------+-------------+-------------+-------------+ > | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | > +-------------+-------------+-------------+-------------+ > > Little-Endian Storage > +-------------+-------------+-------------+-------------+ > | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC | > +-------------+-------------+-------------+-------------+ > > Vector load results in: > +-------------+-------------+-------------+-------------+ > | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | > +-------------+-------------+-------------+-------------+ Ok. I'm guessing from this that implementing those GPR<->VSR instructions showed that the earlier versions were endian-incorrect as I suspected. Have you verified that this new implementation is actually faster (or at least no slower) on LE than the original implementation with individual 32-bit stores? > Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> > --- > target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++-------------- > 1 file changed, 19 insertions(+), 14 deletions(-) > > diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c > index 74d0533..1eca042 100644 > --- a/target-ppc/translate/vsx-impl.inc.c > +++ b/target-ppc/translate/vsx-impl.inc.c > @@ -75,7 +75,6 @@ static void gen_lxvdsx(DisasContext *ctx) > static void gen_lxvw4x(DisasContext *ctx) > { > TCGv EA; > - TCGv_i64 tmp; > TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode)); > TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode)); > if (unlikely(!ctx->vsx_enabled)) { > @@ -84,22 +83,28 @@ static void gen_lxvw4x(DisasContext *ctx) > } > gen_set_access_type(ctx, ACCESS_INT); > EA = tcg_temp_new(); > - tmp = tcg_temp_new_i64(); > > gen_addr_reg_index(ctx, EA); > - gen_qemu_ld32u_i64(ctx, tmp, EA); > - tcg_gen_addi_tl(EA, EA, 4); > - gen_qemu_ld32u_i64(ctx, xth, EA); > - tcg_gen_deposit_i64(xth, xth, tmp, 32, 32); > - > - tcg_gen_addi_tl(EA, EA, 4); > - gen_qemu_ld32u_i64(ctx, tmp, EA); > - tcg_gen_addi_tl(EA, EA, 4); > - gen_qemu_ld32u_i64(ctx, xtl, EA); > - tcg_gen_deposit_i64(xtl, xtl, tmp, 32, 32); > - > + if (ctx->le_mode) { > + TCGv_i64 t0, t1; > + > + t0 = tcg_temp_new_i64(); > + t1 = tcg_temp_new_i64(); > + tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ); > + tcg_gen_shri_i64(t1, t0, 32); > + tcg_gen_deposit_i64(xth, t1, t0, 32, 32); > + tcg_gen_addi_tl(EA, EA, 8); > + tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ); > + tcg_gen_shri_i64(t1, t0, 32); > + tcg_gen_deposit_i64(xtl, t1, t0, 32, 32); > + tcg_temp_free_i64(t0); > + tcg_temp_free_i64(t1); > + } else { > + tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ); > + tcg_gen_addi_tl(EA, EA, 8); > + tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ); > + } > tcg_temp_free(EA); > - tcg_temp_free_i64(tmp); > } > > #define VSX_STORE_SCALAR(name, operation) \
David Gibson <david@gibson.dropbear.id.au> writes: > [ Unknown signature status ] > On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote: >> Load 8byte at a time and manipulate. >> >> Big-Endian Storage >> +-------------+-------------+-------------+-------------+ >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | >> +-------------+-------------+-------------+-------------+ >> >> Little-Endian Storage >> +-------------+-------------+-------------+-------------+ >> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC | >> +-------------+-------------+-------------+-------------+ >> >> Vector load results in: >> +-------------+-------------+-------------+-------------+ >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | >> +-------------+-------------+-------------+-------------+ > > Ok. I'm guessing from this that implementing those GPR<->VSR > instructions showed that the earlier versions were endian-incorrect as > I suspected. > > Have you verified that this new implementation is actually faster (or > at least no slower) on LE than the original implementation with > individual 32-bit stores? I haven't, will check it once and get back. Regards Nikunj
David Gibson <david@gibson.dropbear.id.au> writes: > [ Unknown signature status ] > On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote: >> Load 8byte at a time and manipulate. >> >> Big-Endian Storage >> +-------------+-------------+-------------+-------------+ >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | >> +-------------+-------------+-------------+-------------+ >> >> Little-Endian Storage >> +-------------+-------------+-------------+-------------+ >> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC | >> +-------------+-------------+-------------+-------------+ >> >> Vector load results in: >> +-------------+-------------+-------------+-------------+ >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | >> +-------------+-------------+-------------+-------------+ > > Ok. I'm guessing from this that implementing those GPR<->VSR > instructions showed that the earlier versions were endian-incorrect as > I suspected. > > Have you verified that this new implementation is actually faster (or > at least no slower) on LE than the original implementation with > individual 32-bit stores? Result of million lxvw4x, mfvsrd/mfvsrld and print Without patch: ============== [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null real 0m2.812s user 0m2.792s sys 0m0.020s [tcg_test]$ With patch: =========== [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null real 0m2.801s user 0m2.783s sys 0m0.018s [tcg_test]$ Not much perceivable difference, is there a better way to benchmark? Regards Nikunj
On 09/28/2016 08:41 PM, Nikunj A Dadhania wrote: > Without patch: > ============== > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null > real 0m2.812s > user 0m2.792s > sys 0m0.020s > [tcg_test]$ > > With patch: > =========== > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null > real 0m2.801s > user 0m2.783s > sys 0m0.018s > [tcg_test]$ > > Not much perceivable difference, is there a better way to benchmark? There should be more of a difference for softmmu, since the tlb lookup for the memory is more expensive. r~
On Thu, Sep 29, 2016 at 09:11:10AM +0530, Nikunj A Dadhania wrote: > David Gibson <david@gibson.dropbear.id.au> writes: > > > [ Unknown signature status ] > > On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote: > >> Load 8byte at a time and manipulate. > >> > >> Big-Endian Storage > >> +-------------+-------------+-------------+-------------+ > >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | > >> +-------------+-------------+-------------+-------------+ > >> > >> Little-Endian Storage > >> +-------------+-------------+-------------+-------------+ > >> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC | > >> +-------------+-------------+-------------+-------------+ > >> > >> Vector load results in: > >> +-------------+-------------+-------------+-------------+ > >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | > >> +-------------+-------------+-------------+-------------+ > > > > Ok. I'm guessing from this that implementing those GPR<->VSR > > instructions showed that the earlier versions were endian-incorrect as > > I suspected. > > > > Have you verified that this new implementation is actually faster (or > > at least no slower) on LE than the original implementation with > > individual 32-bit stores? > > Result of million lxvw4x, mfvsrd/mfvsrld and print > > Without patch: > ============== > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null > real 0m2.812s > user 0m2.792s > sys 0m0.020s > [tcg_test]$ > > With patch: > =========== > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null > real 0m2.801s > user 0m2.783s > sys 0m0.018s > [tcg_test]$ > > Not much perceivable difference, is there a better way to benchmark? Not dramatically, that I can think of. A few tweaks you can make: * Increase the loop counter so the test simply runs for longer * Also run the test multiple times, so you can get an idea of how much the results vary from one run to another * Run the test on a system that's as idle of other activity as you can make it (at both host and guest level). For out purposes the user time is probably the meaningful thing here, and should show less variance than the system and real time. Note that it would be interesting to get these results for both a power and x86 host. In any case the results above are enough to convince me that the change isn't likely to be a significant regression.
On Wed, Sep 28, 2016 at 08:48:54PM -0700, Richard Henderson wrote: > On 09/28/2016 08:41 PM, Nikunj A Dadhania wrote: > > Without patch: > > ============== > > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null > > real 0m2.812s > > user 0m2.792s > > sys 0m0.020s > > [tcg_test]$ > > > > With patch: > > =========== > > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null > > real 0m2.801s > > user 0m2.783s > > sys 0m0.018s > > [tcg_test]$ > > > > Not much perceivable difference, is there a better way to benchmark? > > There should be more of a difference for softmmu, since the tlb lookup for > the memory is more expensive. Good point. Oh.. also, I'd remove the prints from the benchmark for this purpose. The time involved in the syscalls and whatnot for the print will just add noise to the measurement (sending to /dev/null reduces the impact, but it's probably still significant compared to a simple math operation).
diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c index 74d0533..1eca042 100644 --- a/target-ppc/translate/vsx-impl.inc.c +++ b/target-ppc/translate/vsx-impl.inc.c @@ -75,7 +75,6 @@ static void gen_lxvdsx(DisasContext *ctx) static void gen_lxvw4x(DisasContext *ctx) { TCGv EA; - TCGv_i64 tmp; TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode)); TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode)); if (unlikely(!ctx->vsx_enabled)) { @@ -84,22 +83,28 @@ static void gen_lxvw4x(DisasContext *ctx) } gen_set_access_type(ctx, ACCESS_INT); EA = tcg_temp_new(); - tmp = tcg_temp_new_i64(); gen_addr_reg_index(ctx, EA); - gen_qemu_ld32u_i64(ctx, tmp, EA); - tcg_gen_addi_tl(EA, EA, 4); - gen_qemu_ld32u_i64(ctx, xth, EA); - tcg_gen_deposit_i64(xth, xth, tmp, 32, 32); - - tcg_gen_addi_tl(EA, EA, 4); - gen_qemu_ld32u_i64(ctx, tmp, EA); - tcg_gen_addi_tl(EA, EA, 4); - gen_qemu_ld32u_i64(ctx, xtl, EA); - tcg_gen_deposit_i64(xtl, xtl, tmp, 32, 32); - + if (ctx->le_mode) { + TCGv_i64 t0, t1; + + t0 = tcg_temp_new_i64(); + t1 = tcg_temp_new_i64(); + tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ); + tcg_gen_shri_i64(t1, t0, 32); + tcg_gen_deposit_i64(xth, t1, t0, 32, 32); + tcg_gen_addi_tl(EA, EA, 8); + tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ); + tcg_gen_shri_i64(t1, t0, 32); + tcg_gen_deposit_i64(xtl, t1, t0, 32, 32); + tcg_temp_free_i64(t0); + tcg_temp_free_i64(t1); + } else { + tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ); + tcg_gen_addi_tl(EA, EA, 8); + tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ); + } tcg_temp_free(EA); - tcg_temp_free_i64(tmp); } #define VSX_STORE_SCALAR(name, operation) \
Load 8byte at a time and manipulate. Big-Endian Storage +-------------+-------------+-------------+-------------+ | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | +-------------+-------------+-------------+-------------+ Little-Endian Storage +-------------+-------------+-------------+-------------+ | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC | +-------------+-------------+-------------+-------------+ Vector load results in: +-------------+-------------+-------------+-------------+ | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | +-------------+-------------+-------------+-------------+ Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> --- target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-)