[RFC,V1,3/7] crypto: ghash - Optimized GHASH computations

From: Kyung Min Park <kyung.min.park@intel.com>

From: Kyung Min Park <kyung.min.park@intel.com>

Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.
The new instruction allows to work on 4 x 16 byte blocks at the time.
For best parallelism and deeper out of order execution, the main loop of
the code works on 16 x 16 byte blocks at the time and performs reduction
every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys
and the precompute operation has been optimized as well to leverage 512 bit
registers, parallel carry less multiply and reduction.

VPCLMULQDQ instruction is used to accelerate the most time-consuming
part of GHASH, carry-less multiplication. VPCLMULQDQ instruction
with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.

The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ
version with the VPCLMULQDQ version when the following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
   only Icelake)
2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at
   boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set
   to 1 after boot.
   If compiled as loadable module, use_avx512 module parameter must be set:
   modprobe ghash_clmulni_intel use_avx512=1

With new implementation, tcrypt ghash speed test shows about 4x to 10x
speedup improvement for GHASH calculation compared to the original
implementation with PCLMULQDQ when the bytes per update size is 256 Bytes
or above. Detailed results for a variety of block sizes and update
sizes are in the table below. The test was performed on Icelake based
platform with constant frequency set for CPU.

The average performance improvement of the AVX512 version over the current
implementation is as follows:
For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).
For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).

A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction
and VPCLMULQDQ instruction shows the following results.

---------------------------------------------------------------------------
|            |            |         cycles/operation         |            |
|            |            |       (the lower the better)     |            |
|    byte    |   bytes    |----------------------------------| percentage |
|   blocks   | per update |   GHASH test   |   GHASH test    | loss/gain  |
|            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
|------------|------------|----------------|-----------------|------------|
|      16    |     16     |       144      |        233      |   -38.0    |
|      64    |     16     |       535      |        709      |   -24.5    |
|      64    |     64     |       210      |        146      |    43.8    |
|     256    |     16     |      1808      |       1911      |    -5.4    |
|     256    |     64     |       865      |        581      |    48.9    |
|     256    |    256     |       682      |        170      |   301.0    |
|    1024    |     16     |      6746      |       6935      |    -2.7    |
|    1024    |    256     |      2829      |        714      |   296.0    |
|    1024    |   1024     |      2543      |        341      |   645.0    |
|    2048    |     16     |     13219      |      13403      |    -1.3    |
|    2048    |    256     |      5435      |       1408      |   286.0    |
|    2048    |   1024     |      5218      |        685      |   661.0    |
|    2048    |   2048     |      5061      |        565      |   796.0    |
|    4096    |     16     |     40793      |      27615      |    47.8    |
|    4096    |    256     |     10662      |       2689      |   297.0    |
|    4096    |   1024     |     10196      |       1333      |   665.0    |
|    4096    |   4096     |     10049      |       1011      |   894.0    |
|    8192    |     16     |     51672      |      54599      |    -5.3    |
|    8192    |    256     |     21228      |       5284      |   301.0    |
|    8192    |   1024     |     20306      |       2556      |   694.0    |
|    8192    |   4096     |     20076      |       2044      |   882.0    |
|    8192    |   8192     |     20071      |       2017      |   895.0    |
---------------------------------------------------------------------------

This work was inspired by the AES GCM mode optimization published
in Intel Optimized IPSEC Cryptographic library.
https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm

Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Co-developed-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/crypto/Makefile                     |    1 +
 arch/x86/crypto/avx512_vaes_common.S         | 1211 ++++++++++++++++++++++++++
 arch/x86/crypto/ghash-clmulni-intel_avx512.S |   68 ++
 arch/x86/crypto/ghash-clmulni-intel_glue.c   |   39 +-
 crypto/Kconfig                               |   12 +
 5 files changed, 1329 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/crypto/avx512_vaes_common.S
 create mode 100644 arch/x86/crypto/ghash-clmulni-intel_avx512.S

Message ID	1608325864-4033-4-git-send-email-megha.dey@intel.com (mailing list archive)
State	RFC
Delegated to:	Herbert Xu
Headers	show Return-Path: <linux-crypto-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-21.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA90BC3526D for <linux-crypto@archiver.kernel.org>; Fri, 18 Dec 2020 21:07:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C3C1323B88 for <linux-crypto@archiver.kernel.org>; Fri, 18 Dec 2020 21:07:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726430AbgLRVHj (ORCPT <rfc822;linux-crypto@archiver.kernel.org>); Fri, 18 Dec 2020 16:07:39 -0500 Received: from mga06.intel.com ([134.134.136.31]:16592 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726288AbgLRVHi (ORCPT <rfc822;linux-crypto@vger.kernel.org>); Fri, 18 Dec 2020 16:07:38 -0500 IronPort-SDR: pOkTBeu3qt1Ro5S2oFX/EfW0VZgO29LZ6v11Sp3nmPvr9UiiSFVSp93wwwmrasZ9mnoO61U5O4 G5xyuHFO7elw== X-IronPort-AV: E=McAfee;i="6000,8403,9839"; a="237075268" X-IronPort-AV: E=Sophos;i="5.78,431,1599548400"; d="scan'208";a="237075268" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Dec 2020 13:06:42 -0800 IronPort-SDR: YzIsv9HfpXbx2zeA36XLYfoa4DKB/L8uzSyfTbxYdDyPWx7kcLoAwGuq54YCyHHm2mfcXW8UR/ UA50YDrRlq6Q== X-IronPort-AV: E=Sophos;i="5.78,431,1599548400"; d="scan'208";a="370785947" Received: from megha-z97x-ud7-th.sc.intel.com ([143.183.85.154]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 18 Dec 2020 13:06:42 -0800 From: Megha Dey <megha.dey@intel.com> To: herbert@gondor.apana.org.au, davem@davemloft.net Cc: linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org, ravi.v.shankar@intel.com, tim.c.chen@intel.com, andi.kleen@intel.com, dave.hansen@intel.com, megha.dey@intel.com, wajdi.k.feghali@intel.com, greg.b.tucker@intel.com, robert.a.kasten@intel.com, rajendrakumar.chinnaiyan@intel.com, tomasz.kantecki@intel.com, ryan.d.saffores@intel.com, ilya.albrekht@intel.com, kyung.min.park@intel.com, tony.luck@intel.com, ira.weiny@intel.com Subject: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations Date: Fri, 18 Dec 2020 13:11:00 -0800 Message-Id: <1608325864-4033-4-git-send-email-megha.dey@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1608325864-4033-1-git-send-email-megha.dey@intel.com> References: <1608325864-4033-1-git-send-email-megha.dey@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-crypto.vger.kernel.org> X-Mailing-List: linux-crypto@vger.kernel.org
Series	Introduce AVX512 optimized crypto algorithms \| expand [RFC,V1,0/7] Introduce AVX512 optimized crypto algorithms [RFC,V1,1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support [RFC,V1,2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction [RFC,V1,3/7] crypto: ghash - Optimized GHASH computations [RFC,V1,4/7] crypto: tcrypt - Add speed test for optimized GHASH computations [RFC,V1,5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization [RFC,V1,6/7] crypto: aesni - fix coding style for if/else block [RFC,V1,7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

[RFC,V1,3/7] crypto: ghash - Optimized GHASH computations

Commit Message

Comments

Patch