From patchwork Mon Jul  1 11:18:53 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jan Beulich <JBeulich@suse.com>
X-Patchwork-Id: 11025629
Return-Path: <xen-devel-bounces@lists.xenproject.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CCFED746
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Mon,  1 Jul 2019 11:21:58 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BEBCE283CA
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Mon,  1 Jul 2019 11:21:58 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id B2DE62866C; Mon,  1 Jul 2019 11:21:58 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,UNPARSEABLE_RELAY autolearn=ham version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id AC7F4283CA
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Mon,  1 Jul 2019 11:21:57 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.89)
	(envelope-from <xen-devel-bounces@lists.xenproject.org>)
	id 1hhuLu-0000xR-3x; Mon, 01 Jul 2019 11:20:06 +0000
Received: from us1-rack-dfw2.inumbo.com ([104.130.134.6])
 by lists.xenproject.org with esmtp (Exim 4.89)
 (envelope-from
 <SRS0=YZ94=U6=suse.com=jbeulich@srs-us1.protection.inumbo.net>)
 id 1hhuLs-0000qv-VM
 for xen-devel@lists.xenproject.org; Mon, 01 Jul 2019 11:20:05 +0000
X-Inumbo-ID: 2b7ca934-9bf2-11e9-8980-bc764e045a96
Received: from m4a0039g.houston.softwaregrp.com (unknown [15.124.2.85])
 by us1-rack-dfw2.inumbo.com (Halon) with ESMTPS
 id 2b7ca934-9bf2-11e9-8980-bc764e045a96;
 Mon, 01 Jul 2019 11:20:03 +0000 (UTC)
Received: FROM m4a0039g.houston.softwaregrp.com (15.120.17.146) BY
 m4a0039g.houston.softwaregrp.com WITH ESMTP;
 Mon,  1 Jul 2019 11:16:28 +0000
Received: from M4W0334.microfocus.com (2002:f78:1192::f78:1192) by
 M4W0334.microfocus.com (2002:f78:1192::f78:1192) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id
 15.1.1591.10; Mon, 1 Jul 2019 11:18:55 +0000
Received: from NAM04-CO1-obe.outbound.protection.outlook.com (15.124.8.14) by
 M4W0334.microfocus.com (15.120.17.146) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id
 15.1.1591.10 via Frontend Transport; Mon, 1 Jul 2019 11:18:55 +0000
Received: from BY5PR18MB3394.namprd18.prod.outlook.com (10.255.139.95) by
 BY5PR18MB3363.namprd18.prod.outlook.com (10.255.139.24) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.2032.18; Mon, 1 Jul 2019 11:18:53 +0000
Received: from BY5PR18MB3394.namprd18.prod.outlook.com
 ([fe80::2005:4b02:1d60:d1bc]) by BY5PR18MB3394.namprd18.prod.outlook.com
 ([fe80::2005:4b02:1d60:d1bc%3]) with mapi id 15.20.2008.020; Mon, 1 Jul 2019
 11:18:53 +0000
From: Jan Beulich <JBeulich@suse.com>
To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Thread-Topic: [PATCH v9 05/23] x86emul: support AVX512F gather insns
Thread-Index: AQHVL/7DlmNHyySWV0mdXoNuP5HddA==
Date: Mon, 1 Jul 2019 11:18:53 +0000
Message-ID: <95252da8-777b-9527-6f5b-1e1a5994f845@suse.com>
References: <f69ca82f-e2db-e85e-ff98-2060a8dc28a5@suse.com>
In-Reply-To: <f69ca82f-e2db-e85e-ff98-2060a8dc28a5@suse.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-clientproxiedby: DB6PR01CA0043.eurprd01.prod.exchangelabs.com
 (2603:10a6:6:46::20) To BY5PR18MB3394.namprd18.prod.outlook.com
 (2603:10b6:a03:194::31)
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=JBeulich@suse.com;
x-ms-exchange-messagesentrepresentingtype: 1
x-originating-ip: [87.234.252.170]
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: c3f4a6dc-83de-4f88-3bc0-08d6fe15e657
x-microsoft-antispam: BCL:0; PCL:0;
 RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600148)(711020)(4605104)(1401327)(2017052603328)(7193020);
 SRVR:BY5PR18MB3363;
x-ms-traffictypediagnostic: BY5PR18MB3363:
x-microsoft-antispam-prvs: 
 <BY5PR18MB3363FD1D70A726B9202CC172B3F90@BY5PR18MB3363.namprd18.prod.outlook.com>
x-ms-oob-tlc-oobclassifiers: OLM:4125;
x-forefront-prvs: 00851CA28B
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10019020)(4636009)(396003)(346002)(39860400002)(376002)(136003)(366004)(199004)(189003)(14444005)(25786009)(2616005)(5640700003)(6486002)(256004)(66066001)(6436002)(486006)(36756003)(72206003)(11346002)(476003)(446003)(2501003)(3846002)(6116002)(66946007)(386003)(76176011)(52116002)(81156014)(8676002)(14454004)(8936002)(71190400001)(80792005)(2906002)(81166006)(305945005)(99286004)(31696002)(7736002)(102836004)(26005)(186003)(6506007)(86362001)(54906003)(478600001)(4326008)(316002)(30864003)(66476007)(5660300002)(64756008)(66446008)(68736007)(73956011)(6512007)(71200400001)(31686004)(6916009)(53946003)(53936002)(66556008)(2351001);
 DIR:OUT; SFP:1102; SCL:1; SRVR:BY5PR18MB3363;
 H:BY5PR18MB3394.namprd18.prod.outlook.com; FPR:; SPF:None; LANG:en;
 PTR:InfoNoRecords; A:1; MX:1;
received-spf: None (protection.outlook.com: suse.com does not designate
 permitted sender hosts)
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam-message-info: 
 L6ymetEpGWXSeI9wfrF/u3sXwKWwhp4F8NHu1imKYSfPLze1T6ysVcgmY0+Lw0LamJYbKUA/fZ7KfYOSAKLsKjfRiHW8SvlLOfCoNLj9rHgiV4opQxn6eGUU6CkpXZY5vCrw+hqAgoE342nZv+BZrytkXxTYoypHvjyvfI1WqoexpQMfaGKp2qbKtN8ay70YCJrxR6kzplNkuH8wn2cEzIV9InIvR81QFSNJUq2wSR1O2e0OwoyFLYcv8Mw96m874iwzHwVY39Ks2CZe7fXc85DSIPypCNN0awEvP6swccvjJa9g2RUTVXURDu25xixFVsIe5NX3/awtH7AspbA+DzwaKcy/9bynCGUrcwh5UN9yHnZJnIzHnsXG38Ji1BRGdyaHTigvOXDKp+X+8j+Fej4de2mK9X6GPSmLFEAwefw=
Content-ID: <B922AB118D63884EBEA5B5193E3E7528@namprd18.prod.outlook.com>
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 c3f4a6dc-83de-4f88-3bc0-08d6fe15e657
X-MS-Exchange-CrossTenant-originalarrivaltime: 01 Jul 2019 11:18:53.3847 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 856b813c-16e5-49a5-85ec-6f081e13b527
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: JBeulich@suse.com
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY5PR18MB3363
X-OriginatorOrg: suse.com
Subject: [Xen-devel] [PATCH v9 05/23] x86emul: support AVX512F gather insns
X-BeenThere: xen-devel@lists.xenproject.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xenproject.org>
List-Unsubscribe: <https://lists.xenproject.org/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xenproject.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xenproject.org>
List-Help: <mailto:xen-devel-request@lists.xenproject.org?subject=help>
List-Subscribe: <https://lists.xenproject.org/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xenproject.org?subject=subscribe>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Wei Liu <wl@xen.org>,
 RogerPau Monne <roger.pau@citrix.com>
Errors-To: xen-devel-bounces@lists.xenproject.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v9: Suppress general register update upon failures. Split out ModR/M
     handling changes as well as independent test harness ones into
     prereq patches. Re-base.
v8: Re-base.
v7: Fix ByteOp register decode. Re-base.
v6: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -18,7 +18,7 @@ CFLAGS += $(CFLAGS_xeninclude)
  
  SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
  FMA := fma4 fma
-SG := avx2-sg
+SG := avx2-sg avx512f-sg avx512vl-sg
  TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
  
  OPMASK := avx512f avx512dq avx512bw
@@ -66,6 +66,14 @@ xop-flts := $(avx-flts)
  avx512f-vecs := 64 16 32
  avx512f-ints := 4 8
  avx512f-flts := 4 8
+avx512f-sg-vecs := 64
+avx512f-sg-idxs := 4 8
+avx512f-sg-ints := $(avx512f-ints)
+avx512f-sg-flts := $(avx512f-flts)
+avx512vl-sg-vecs := 16 32
+avx512vl-sg-idxs := $(avx512f-sg-idxs)
+avx512vl-sg-ints := $(avx512f-ints)
+avx512vl-sg-flts := $(avx512f-flts)
  avx512bw-vecs := $(avx512f-vecs)
  avx512bw-ints := 1 2
  avx512bw-flts :=
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -176,6 +176,8 @@ static const struct test avx512f_all[] =
      INSN(fnmsub213,    66, 0f38, af,    el,     sd, el),
      INSN(fnmsub231,    66, 0f38, be,    vl,     sd, vl),
      INSN(fnmsub231,    66, 0f38, bf,    el,     sd, el),
+    INSN(gatherd,      66, 0f38, 92,    vl,     sd, el),
+    INSN(gatherq,      66, 0f38, 93,    vl,     sd, el),
      INSN(getexp,       66, 0f38, 42,    vl,     sd, vl),
      INSN(getexp,       66, 0f38, 43,    el,     sd, el),
      INSN(getmant,      66, 0f3a, 26,    vl,     sd, vl),
@@ -229,6 +231,8 @@ static const struct test avx512f_all[] =
      INSN(permt2,       66, 0f38, 7e,    vl,     dq, vl),
      INSN(permt2,       66, 0f38, 7f,    vl,     sd, vl),
      INSN(pexpand,      66, 0f38, 89,    vl,     dq, el),
+    INSN(pgatherd,     66, 0f38, 90,    vl,     dq, el),
+    INSN(pgatherq,     66, 0f38, 91,    vl,     dq, el),
      INSN(pmaxs,        66, 0f38, 3d,    vl,     dq, vl),
      INSN(pmaxu,        66, 0f38, 3f,    vl,     dq, vl),
      INSN(pmins,        66, 0f38, 39,    vl,     dq, vl),
--- a/tools/tests/x86_emulator/simd-sg.c
+++ b/tools/tests/x86_emulator/simd-sg.c
@@ -35,13 +35,78 @@ typedef long long __attribute__((vector_
  #define ITEM_COUNT (VEC_SIZE / ELEM_SIZE < IVEC_SIZE / IDX_SIZE ? \
                      VEC_SIZE / ELEM_SIZE : IVEC_SIZE / IDX_SIZE)
  
-#if VEC_SIZE == 16
-# define to_bool(cmp) __builtin_ia32_ptestc128(cmp, (vec_t){} == 0)
-#else
-# define to_bool(cmp) __builtin_ia32_ptestc256(cmp, (vec_t){} == 0)
-#endif
+#if defined(__AVX512F__)
+# define ALL_TRUE (~0ULL >> (64 - ELEM_COUNT))
+# if ELEM_SIZE == 4
+#  if IDX_SIZE == 4 || defined(__AVX512VL__)
+#   define to_mask(msk) B(ptestmd, , (vsi_t)(msk), (vsi_t)(msk), ~0)
+#   define eq(x, y) (B(pcmpeqd, _mask, (vsi_t)(x), (vsi_t)(y), -1) == ALL_TRUE)
+#  else
+#   define widen(x) __builtin_ia32_pmovzxdq512_mask((vsi_t)(x), (idi_t){}, ~0)
+#   define to_mask(msk) __builtin_ia32_ptestmq512(widen(msk), widen(msk), ~0)
+#   define eq(x, y) (__builtin_ia32_pcmpeqq512_mask(widen(x), widen(y), ~0) == ALL_TRUE)
+#  endif
+#  define BG_(dt, it, reg, mem, idx, msk, scl) \
+    __builtin_ia32_gather##it##dt(reg, mem, idx, to_mask(msk), scl)
+# else
+#  define eq(x, y) (B(pcmpeqq, _mask, (vdi_t)(x), (vdi_t)(y), -1) == ALL_TRUE)
+#  define BG_(dt, it, reg, mem, idx, msk, scl) \
+    __builtin_ia32_gather##it##dt(reg, mem, idx, B(ptestmq, , (vdi_t)(msk), (vdi_t)(msk), ~0), scl)
+# endif
+/*
+ * Instead of replicating the main IDX_SIZE conditional below three times, use
+ * a double layer of macro invocations, allowing for substitution of the
+ * respective relevant macro argument tokens.
+ */
+# define BG(dt, it, reg, mem, idx, msk, scl) BG_(dt, it, reg, mem, idx, msk, scl)
+# if VEC_MAX < 64
+/*
+ * The sub-512-bit built-ins have an extra "3" infix, presumably because the
+ * 512-bit names were chosen without the AVX512VL extension in mind (and hence
+ * making the latter collide with the AVX2 ones).
+ */
+#  define si 3si
+#  define di 3di
+# endif
+# if VEC_MAX == 16
+#  define v8df v2df
+#  define v8di v2di
+#  define v16sf v4sf
+#  define v16si v4si
+# elif VEC_MAX == 32
+#  define v8df v4df
+#  define v8di v4di
+#  define v16sf v8sf
+#  define v16si v8si
+# endif
+# if IDX_SIZE == 4
+#  if INT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16si, si, reg, mem, idx, msk, scl)
+#  elif INT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, si, (vdi_t)(reg), mem, idx, msk, scl))
+#  elif FLOAT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16sf, si, reg, mem, idx, msk, scl)
+#  elif FLOAT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) BG(v8df, si, reg, mem, idx, msk, scl)
+#  endif
+# elif IDX_SIZE == 8
+#  if INT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16si, di, reg, mem, (idi_t)(idx), msk, scl)
+#  elif INT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) (vec_t)(BG(v8di, di, (vdi_t)(reg), mem, (idi_t)(idx), msk, scl))
+#  elif FLOAT_SIZE == 4
+#   define gather(reg, mem, idx, msk, scl) BG(v16sf, di, reg, mem, (idi_t)(idx), msk, scl)
+#  elif FLOAT_SIZE == 8
+#   define gather(reg, mem, idx, msk, scl) BG(v8df, di, reg, mem, (idi_t)(idx), msk, scl)
+#  endif
+# endif
+#elif defined(__AVX2__)
+# if VEC_SIZE == 16
+#  define to_bool(cmp) __builtin_ia32_ptestc128(cmp, (vec_t){} == 0)
+# else
+#  define to_bool(cmp) __builtin_ia32_ptestc256(cmp, (vec_t){} == 0)
+# endif
  
-#if defined(__AVX2__)
  # if VEC_MAX == 16
  #  if IDX_SIZE == 4
  #   if INT_SIZE == 4
@@ -111,6 +176,10 @@ typedef long long __attribute__((vector_
  # endif
  #endif
  
+#ifndef eq
+# define eq(x, y) to_bool((x) == (y))
+#endif
+
  #define GLUE_(x, y) x ## y
  #define GLUE(x, y) GLUE_(x, y)
  
@@ -119,6 +188,7 @@ typedef long long __attribute__((vector_
  #define PUT8(n)  PUT4(n),   PUT4((n) +  4)
  #define PUT16(n) PUT8(n),   PUT8((n) +  8)
  #define PUT32(n) PUT16(n), PUT16((n) + 16)
+#define PUT64(n) PUT32(n), PUT32((n) + 32)
  
  const typeof((vec_t){}[0]) array[] = {
      GLUE(PUT, VEC_MAX)(1),
@@ -174,7 +244,7 @@ int sg_test(void)
  
      y = gather(full, array + ITEM_COUNT, -idx, full, ELEM_SIZE);
  #if ITEM_COUNT == ELEM_COUNT
-    if ( !to_bool(y == x - 1) )
+    if ( !eq(y, x - 1) )
          return __LINE__;
  #else
      for ( i = 0; i < ITEM_COUNT; ++i )
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -22,6 +22,8 @@ asm ( ".pushsection .test, \"ax\", @prog
  #include "avx512dq-opmask.h"
  #include "avx512bw-opmask.h"
  #include "avx512f.h"
+#include "avx512f-sg.h"
+#include "avx512vl-sg.h"
  #include "avx512bw.h"
  #include "avx512dq.h"
  #include "avx512er.h"
@@ -90,11 +92,13 @@ static bool simd_check_avx512f(void)
      return cpu_has_avx512f;
  }
  #define simd_check_avx512f_opmask simd_check_avx512f
+#define simd_check_avx512f_sg simd_check_avx512f
  
  static bool simd_check_avx512f_vl(void)
  {
      return cpu_has_avx512f && cpu_has_avx512vl;
  }
+#define simd_check_avx512vl_sg simd_check_avx512f_vl
  
  static bool simd_check_avx512dq(void)
  {
@@ -291,6 +295,14 @@ static const struct {
      SIMD(AVX512F u32x16,      avx512f,      64u4),
      SIMD(AVX512F s64x8,       avx512f,      64i8),
      SIMD(AVX512F u64x8,       avx512f,      64u8),
+    SIMD(AVX512F S/G f32[16x32], avx512f_sg, 64x4f4),
+    SIMD(AVX512F S/G f64[ 8x32], avx512f_sg, 64x4f8),
+    SIMD(AVX512F S/G f32[ 8x64], avx512f_sg, 64x8f4),
+    SIMD(AVX512F S/G f64[ 8x64], avx512f_sg, 64x8f8),
+    SIMD(AVX512F S/G i32[16x32], avx512f_sg, 64x4i4),
+    SIMD(AVX512F S/G i64[ 8x32], avx512f_sg, 64x4i8),
+    SIMD(AVX512F S/G i32[ 8x64], avx512f_sg, 64x8i4),
+    SIMD(AVX512F S/G i64[ 8x64], avx512f_sg, 64x8i8),
      AVX512VL(VL f32x4,        avx512f,      16f4),
      AVX512VL(VL f64x2,        avx512f,      16f8),
      AVX512VL(VL f32x8,        avx512f,      32f4),
@@ -303,6 +315,22 @@ static const struct {
      AVX512VL(VL u64x2,        avx512f,      16u8),
      AVX512VL(VL s64x4,        avx512f,      32i8),
      AVX512VL(VL u64x4,        avx512f,      32u8),
+    SIMD(AVX512VL S/G f32[4x32], avx512vl_sg, 16x4f4),
+    SIMD(AVX512VL S/G f64[2x32], avx512vl_sg, 16x4f8),
+    SIMD(AVX512VL S/G f32[2x64], avx512vl_sg, 16x8f4),
+    SIMD(AVX512VL S/G f64[2x64], avx512vl_sg, 16x8f8),
+    SIMD(AVX512VL S/G f32[8x32], avx512vl_sg, 32x4f4),
+    SIMD(AVX512VL S/G f64[4x32], avx512vl_sg, 32x4f8),
+    SIMD(AVX512VL S/G f32[4x64], avx512vl_sg, 32x8f4),
+    SIMD(AVX512VL S/G f64[4x64], avx512vl_sg, 32x8f8),
+    SIMD(AVX512VL S/G i32[4x32], avx512vl_sg, 16x4i4),
+    SIMD(AVX512VL S/G i64[2x32], avx512vl_sg, 16x4i8),
+    SIMD(AVX512VL S/G i32[2x64], avx512vl_sg, 16x8i4),
+    SIMD(AVX512VL S/G i64[2x64], avx512vl_sg, 16x8i8),
+    SIMD(AVX512VL S/G i32[8x32], avx512vl_sg, 32x4i4),
+    SIMD(AVX512VL S/G i64[4x32], avx512vl_sg, 32x4i8),
+    SIMD(AVX512VL S/G i32[4x64], avx512vl_sg, 32x8i4),
+    SIMD(AVX512VL S/G i64[4x64], avx512vl_sg, 32x8i8),
      SIMD(AVX512BW s8x64,     avx512bw,      64i1),
      SIMD(AVX512BW u8x64,     avx512bw,      64u1),
      SIMD(AVX512BW s16x32,    avx512bw,      64i2),
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -499,7 +499,7 @@ static const struct ext0f38_table {
      [0x8c] = { .simd_size = simd_packed_int },
      [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
      [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
-    [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1 },
+    [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
      [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
      [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
      [0x9a] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
@@ -9100,6 +9100,133 @@ x86_emulate(
          put_stub(stub);
  
          if ( rc != X86EMUL_OKAY )
+            goto done;
+
+        state->simd_size = simd_none;
+        break;
+    }
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x90): /* vpgatherd{d,q} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x91): /* vpgatherq{d,q} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x92): /* vgatherdp{s,d} mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x93): /* vgatherqp{s,d} mem,[xyz]mm{k} */
+    {
+        typeof(evex) *pevex;
+        union {
+            int32_t dw[16];
+            int64_t qw[8];
+        } index;
+        bool done = false;
+
+        ASSERT(ea.type == OP_MEM);
+        generate_exception_if((!evex.opmsk || evex.brs || evex.z ||
+                               evex.reg != 0xf ||
+                               modrm_reg == state->sib_index),
+                              EXC_UD);
+        avx512_vlen_check(false);
+        host_and_vcpu_must_have(avx512f);
+        get_fpu(X86EMUL_FPU_zmm);
+
+        /* Read destination and index registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x7f; /* vmovdqa{32,64} */
+        /*
+         * The register writeback below has to retain masked-off elements, but
+         * needs to clear upper portions in the index-wider-than-data cases.
+         * Therefore read (and write below) the full register. The alternative
+         * would have been to fiddle with the mask register used.
+         */
+        pevex->opmsk = 0;
+        /* Use (%rax) as destination and modrm_reg as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
+
+        pevex->pfx = vex_f3; /* vmovdqu{32,64} */
+        pevex->w = b & 1;
+        /* Switch to sib_index as source. */
+        pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
+        pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
+        opc[1] = (state->sib_index & 7) << 3;
+
+        invoke_stub("", "", "=m" (index) : "a" (&index));
+        put_stub(stub);
+
+        /* Clear untouched parts of the destination and mask values. */
+        n = 1 << (2 + evex.lr - ((b & 1) | evex.w));
+        op_bytes = 4 << evex.w;
+        memset((void *)mmvalp + n * op_bytes, 0, 64 - n * op_bytes);
+        op_mask &= (1 << n) - 1;
+
+        for ( i = 0; op_mask; ++i )
+        {
+            signed long idx = b & 1 ? index.qw[i] : index.dw[i];
+
+            if ( !(op_mask & (1 << i)) )
+                continue;
+
+            rc = ops->read(ea.mem.seg,
+                           truncate_ea(ea.mem.off + (idx << state->sib_scale)),
+                           (void *)mmvalp + i * op_bytes, op_bytes, ctxt);
+            if ( rc != X86EMUL_OKAY )
+            {
+                /*
+                 * If we've made some progress and the access did not fault,
+                 * force a retry instead. This is for example necessary to
+                 * cope with the limited capacity of HVM's MMIO cache.
+                 */
+                if ( rc != X86EMUL_EXCEPTION && done )
+                    rc = X86EMUL_RETRY;
+                break;
+            }
+
+            op_mask &= ~(1 << i);
+            done = true;
+
+#ifdef __XEN__
+            if ( op_mask && local_events_need_delivery() )
+            {
+                rc = X86EMUL_RETRY;
+                break;
+            }
+#endif
+        }
+
+        /* Write destination and mask registers. */
+        opc = init_evex(stub);
+        pevex = copy_EVEX(opc, evex);
+        pevex->opcx = vex_0f;
+        opc[0] = 0x6f; /* vmovdqa{32,64} */
+        pevex->opmsk = 0;
+        /* Use modrm_reg as destination and (%rax) as source. */
+        pevex->b = 1;
+        opc[1] = (modrm_reg & 7) << 3;
+        pevex->RX = 1;
+        opc[2] = 0xc3;
+
+        invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
+
+        /*
+         * kmovw: This is VEX-encoded, so we can't use pevex. Avoid copy_VEX() etc
+         * as well, since we can easily use the 2-byte VEX form here.
+         */
+        opc -= EVEX_PFX_BYTES;
+        opc[0] = 0xc5;
+        opc[1] = 0xf8;
+        opc[2] = 0x90;
+        /* Use (%rax) as source. */
+        opc[3] = evex.opmsk << 3;
+        opc[4] = 0xc3;
+
+        invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
+        put_stub(stub);
+
+        if ( rc != X86EMUL_OKAY )
              goto done;
  
          state->simd_size = simd_none;