[RFC,V1,7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

Introduce the AVX512 implementation that optimizes the AESNI-GCM encode
and decode routines using VPCLMULQDQ.

The glue code in AESNI module overrides the existing AVX2 GCM mode
encryption/decryption routines with the AX512 AES GCM mode ones when the
following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform
   (currently only Icelake)
2. aesni_intel.use_avx512 module parameter is set at boot time. For this
   algorithm, switching from AVX512 optimized version is not possible
   once set at boot time because of how the code is structured today.(Can
   be changed later if required)

The functions aesni_gcm_init_avx_512, aesni_gcm_enc_update_avx_512,
aesni_gcm_dec_update_avx_512 and aesni_gcm_finalize_avx_512 are adapted
from the Intel Optimized IPSEC Cryptographic library.

On a Icelake desktop, with turbo disabled and all CPUs running at
maximum frequency, the AVX512 GCM mode optimization shows better
performance across data & key sizes as measured by tcrypt.

The average performance improvement of the AVX512 version over the AVX2
version is as follows:
For all key sizes(128/192/256 bits),
        data sizes < 128 bytes/block, negligible improvement (~7.5%)
        data sizes > 128 bytes/block, there is an average improvement of
        40% for both encryption and decryption.

A typical run of tcrypt with AES GCM mode encryption/decryption of the
AVX2 and AVX512 optimization on a Icelake desktop shows the following
results:

  ----------------------------------------------------------------------
  |   key  | bytes | cycles/op (lower is better)   | Percentage gain/  |
  | length |   per |   encryption  |  decryption   |      loss         |
  | (bits) | block |-------------------------------|-------------------|
  |        |       | avx2 | avx512 | avx2 | avx512 | Encrypt | Decrypt |
  |---------------------------------------------------------------------
  |  128   | 16    | 689  |  701   | 689  |  707   |  -1.7   |  -2.61  |
  |  128   | 64    | 731  |  660   | 771  |  649   |   9.7   |  15.82  |
  |  128   | 256   | 911  |  750   | 900  |  721   |  17.67  |  19.88  |
  |  128   | 512   | 1181 |  814   | 1161 |  782   |  31.07  |  32.64  |
  |  128   | 1024  | 1676 |  1052  | 1685 |  1030  |  37.23  |  38.87  |
  |  128   | 2048  | 2475 |  1447  | 2456 |  1419  |  41.53  |  42.22  |
  |  128   | 4096  | 3806 |  2154  | 3820 |  2119  |  43.41  |  44.53  |
  |  128   | 8192  | 9169 |  3806  | 6997 |  3718  |  58.49  |  46.86  |
  |  192   | 16    | 754  |  683   | 737  |  672   |   9.42  |   8.82  |
  |  192   | 64    | 735  |  686   | 715  |  640   |   6.66  |  10.49  |
  |  192   | 256   | 949  |  738   | 2435 |  729   |  22.23  |  70     |
  |  192   | 512   | 1235 |  854   | 1200 |  833   |  30.85  |  30.58  |
  |  192   | 1024  | 1777 |  1084  | 1763 |  1051  |  38.99  |  40.39  |
  |  192   | 2048  | 2574 |  1497  | 2592 |  1459  |  41.84  |  43.71  |
  |  192   | 4096  | 4086 |  2317  | 4091 |  2244  |  43.29  |  45.14  |
  |  192   | 8192  | 7481 |  4054  | 7505 |  3953  |  45.81  |  47.32  |
  |  256   | 16    | 755  |  682   | 720  |  683   |   9.68  |   5.14  |
  |  256   | 64    | 744  |  677   | 719  |  658   |   9     |   8.48  |
  |  256   | 256   | 962  |  758   | 948  |  749   |  21.21  |  21     |
  |  256   | 512   | 1297 |  862   | 1276 |  836   |  33.54  |  34.48  |
  |  256   | 1024  | 1831 |  1114  | 1819 |  1095  |  39.16  |  39.8   |
  |  256   | 2048  | 2767 |  1566  | 2715 |  1524  |  43.4   |  43.87  |
  |  256   | 4096  | 4378 |  2382  | 4368 |  2354  |  45.6   |  46.11  |
  |  256   | 8192  | 8075 |  4262  | 8080 |  4186  |  47.22  |  48.19  |
  ----------------------------------------------------------------------

This work was inspired by the AES GCM mode optimization published in
Intel Optimized IPSEC Cryptographic library.
https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/gcm_vaes_avx512.asm

Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/crypto/Makefile                    |    1 +
 arch/x86/crypto/aesni-intel_avx512-x86_64.S | 1788 +++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c          |   62 +-
 crypto/Kconfig                              |   12 +
 4 files changed, 1858 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S

Message ID	1608325864-4033-8-git-send-email-megha.dey@intel.com (mailing list archive)
State	RFC
Delegated to:	Herbert Xu
Headers	show Return-Path: <linux-crypto-owner@kernel.org> IronPort-SDR: rjhGuJkA6jexgVVHitRb1X+zMhObrFN2n+qqJf8UvfEvOdRXENN7VUL000ViqXZQkRkbvfWlJP 3Efs/KFXtxEQ== IronPort-SDR: eGsLVvYL85huZm+2IgtMuoo4wcr9qfgQItMQ/JaknHF6jSP+XE6Aow/xi6cSibxHPHi2+CpcMN QUnlmV3NVzfg== From: Megha Dey <megha.dey@intel.com> To: herbert@gondor.apana.org.au, davem@davemloft.net Cc: linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org, ravi.v.shankar@intel.com, tim.c.chen@intel.com, andi.kleen@intel.com, dave.hansen@intel.com, megha.dey@intel.com, wajdi.k.feghali@intel.com, greg.b.tucker@intel.com, robert.a.kasten@intel.com, rajendrakumar.chinnaiyan@intel.com, tomasz.kantecki@intel.com, ryan.d.saffores@intel.com, ilya.albrekht@intel.com, kyung.min.park@intel.com, tony.luck@intel.com, ira.weiny@intel.com Subject: [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ Date: Fri, 18 Dec 2020 13:11:04 -0800 Message-Id: <1608325864-4033-8-git-send-email-megha.dey@intel.com> In-Reply-To: <1608325864-4033-1-git-send-email-megha.dey@intel.com> References: <1608325864-4033-1-git-send-email-megha.dey@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Introduce AVX512 optimized crypto algorithms \| expand [RFC,V1,0/7] Introduce AVX512 optimized crypto algorithms [RFC,V1,1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support [RFC,V1,2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction [RFC,V1,3/7] crypto: ghash - Optimized GHASH computations [RFC,V1,4/7] crypto: tcrypt - Add speed test for optimized GHASH computations [RFC,V1,5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization [RFC,V1,6/7] crypto: aesni - fix coding style for if/else block [RFC,V1,7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

[RFC,V1,7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

Commit Message

Comments

Patch