[RFC,V1,2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction

From: Kyung Min Park <kyung.min.park@intel.com>

From: Kyung Min Park <kyung.min.park@intel.com>

Update the crc_pcl function that calculates T10 Data Integrity Field
CRC16 (CRC T10 DIF) using VPCLMULQDQ instruction. VPCLMULQDQ instruction
with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
The advantage comes from packing multiples of 4 * 128 bit data into AVX512
reducing instruction latency.

The glue code in crct10diff module overrides the existing PCLMULQDQ version
with the VPCLMULQDQ version when the following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
   only Icelake)
2. If compiled as built-in module, crct10dif_pclmul.use_avx512 is set at
   boot time or /sys/module/crct10dif_pclmul/parameters/use_avx512 is set
   to 1 after boot.
   If compiled as loadable module, use_avx512 module parameter must be set:
   modprobe crct10dif_pclmul use_avx512=1

A typical run of tcrypt with CRC T10 DIF calculation with PCLMULQDQ
instruction and VPCLMULQDQ instruction shows the following results:
For bytes per update >= 1KB, we see the average improvement of 46%(~1.4x)
For bytes per update < 1KB, we see the average improvement of 13%.
Test was performed on an Icelake based platform with constant frequency
set for CPU.

Detailed results for a variety of block sizes and update sizes are in
the table below.

---------------------------------------------------------------------------
|            |            |         cycles/operation         |            |
|            |            |       (the lower the better)     |            |
|    byte    |   bytes    |----------------------------------| percentage |
|   blocks   | per update |   CRC T10 DIF  |  CRC T10 DIF    | loss/gain  |
|            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
|------------|------------|----------------|-----------------|------------|
|      16    |     16     |        77      |        106      |   -27.0    |
|      64    |     16     |       411      |        390      |     5.4    |
|      64    |     64     |        71      |         85      |   -16.0    |
|     256    |     16     |      1224      |       1308      |    -6.4    |
|     256    |     64     |       393      |        407      |    -3.4    |
|     256    |    256     |        93      |         86      |     8.1    |
|    1024    |     16     |      4564      |       5020      |    -9.0    |
|    1024    |    256     |       486      |        475      |     2.3    |
|    1024    |   1024     |       221      |        148      |    49.3    |
|    2048    |     16     |      8945      |       9851      |    -9.1    |
|    2048    |    256     |       982      |        951      |     3.3    |
|    2048    |   1024     |       500      |        369      |    35.5    |
|    2048    |   2048     |       413      |        265      |    55.8    |
|    4096    |     16     |     17885      |      19351      |    -7.5    |
|    4096    |    256     |      1828      |       1713      |     6.7    |
|    4096    |   1024     |       968      |        805      |    20.0    |
|    4096    |   4096     |       739      |        475      |    55.6    |
|    8192    |     16     |     48339      |      41556      |    16.3    |
|    8192    |    256     |      3494      |       3342      |     4.5    |
|    8192    |   1024     |      1959      |       1462      |    34.0    |
|    8192    |   4096     |      1561      |       1036      |    50.7    |
|    8192    |   8192     |      1540      |       1004      |    53.4    |
---------------------------------------------------------------------------

This work was inspired by the CRC T10 DIF AVX512 optimization published
in Intel Intelligent Storage Acceleration Library.
https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_by16_10.asm

Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/crypto/Makefile                  |   1 +
 arch/x86/crypto/crct10dif-avx512-asm_64.S | 482 ++++++++++++++++++++++++++++++
 arch/x86/crypto/crct10dif-pclmul_glue.c   |  24 +-
 arch/x86/include/asm/disabled-features.h  |   8 +-
 crypto/Kconfig                            |  23 ++
 5 files changed, 535 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S

Message ID	1608325864-4033-3-git-send-email-megha.dey@intel.com (mailing list archive)
State	RFC
Delegated to:	Herbert Xu
Headers	show Return-Path: <linux-crypto-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-21.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB7E9C3526B for <linux-crypto@archiver.kernel.org>; Fri, 18 Dec 2020 21:07:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8C38E23B88 for <linux-crypto@archiver.kernel.org>; Fri, 18 Dec 2020 21:07:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726351AbgLRVH2 (ORCPT <rfc822;linux-crypto@archiver.kernel.org>); Fri, 18 Dec 2020 16:07:28 -0500 Received: from mga06.intel.com ([134.134.136.31]:16601 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726288AbgLRVH0 (ORCPT <rfc822;linux-crypto@vger.kernel.org>); Fri, 18 Dec 2020 16:07:26 -0500 IronPort-SDR: 0u/oW4qsr73bXvY+WcjPI4k/hLAeAtBSC0i1TUlJYpFrElgS7SyknEvjwSeV2FGL3Ca42OfBNB 4q0D+MZb2nTg== X-IronPort-AV: E=McAfee;i="6000,8403,9839"; a="237075266" X-IronPort-AV: E=Sophos;i="5.78,431,1599548400"; d="scan'208";a="237075266" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Dec 2020 13:06:42 -0800 IronPort-SDR: 6kZE4fzuxuSWnR4vfEHiSPB65k3GCFtOl/+tB1iauXqjbSBulpNTp1N/DR0LnuGFs76XvhHkcL xoWcSproALAQ== X-IronPort-AV: E=Sophos;i="5.78,431,1599548400"; d="scan'208";a="370785943" Received: from megha-z97x-ud7-th.sc.intel.com ([143.183.85.154]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 18 Dec 2020 13:06:41 -0800 From: Megha Dey <megha.dey@intel.com> To: herbert@gondor.apana.org.au, davem@davemloft.net Cc: linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org, ravi.v.shankar@intel.com, tim.c.chen@intel.com, andi.kleen@intel.com, dave.hansen@intel.com, megha.dey@intel.com, wajdi.k.feghali@intel.com, greg.b.tucker@intel.com, robert.a.kasten@intel.com, rajendrakumar.chinnaiyan@intel.com, tomasz.kantecki@intel.com, ryan.d.saffores@intel.com, ilya.albrekht@intel.com, kyung.min.park@intel.com, tony.luck@intel.com, ira.weiny@intel.com Subject: [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction Date: Fri, 18 Dec 2020 13:10:59 -0800 Message-Id: <1608325864-4033-3-git-send-email-megha.dey@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1608325864-4033-1-git-send-email-megha.dey@intel.com> References: <1608325864-4033-1-git-send-email-megha.dey@intel.com> Precedence: bulk List-ID: <linux-crypto.vger.kernel.org> X-Mailing-List: linux-crypto@vger.kernel.org
Series	Introduce AVX512 optimized crypto algorithms \| expand [RFC,V1,0/7] Introduce AVX512 optimized crypto algorithms [RFC,V1,1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support [RFC,V1,2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction [RFC,V1,3/7] crypto: ghash - Optimized GHASH computations [RFC,V1,4/7] crypto: tcrypt - Add speed test for optimized GHASH computations [RFC,V1,5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization [RFC,V1,6/7] crypto: aesni - fix coding style for if/else block [RFC,V1,7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

[RFC,V1,2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction

Commit Message

Comments

Patch