[RFC,V2,0/5] Introduce AVX512 optimized crypto algorithms

Message ID	1611386920-28579-1-git-send-email-megha.dey@intel.com (mailing list archive)
Headers	show Return-Path: <linux-crypto-owner@kernel.org> IronPort-SDR: 5nWrUb53UPTT9BbNQmbyGR4rvRE8HIHTkCAnwT6h3YmWEhhf/CpQdMjsAKTeuO4j3EYwUdDaym kW8irb06KcRA== IronPort-SDR: rd/XQ9SdwvmV95wY5PFQz9Fk2M1Y5zLyOCwdgHi9mkip/QvW+7y+AsglbtzTXvhYeAVqq1tsmE j4ThMzK4pOfQ== From: Megha Dey <megha.dey@intel.com> To: linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au, davem@davemloft.net Cc: ravi.v.shankar@intel.com, tim.c.chen@intel.com, andi.kleen@intel.com, dave.hansen@intel.com, megha.dey@intel.com, greg.b.tucker@intel.com, robert.a.kasten@intel.com, rajendrakumar.chinnaiyan@intel.com, tomasz.kantecki@intel.com, ryan.d.saffores@intel.com, ilya.albrekht@intel.com, kyung.min.park@intel.com, tony.luck@intel.com, ira.weiny@intel.com, ebiggers@kernel.org, ardb@kernel.org, x86@kernel.org Subject: [RFC V2 0/5] Introduce AVX512 optimized crypto algorithms Date: Fri, 22 Jan 2021 23:28:35 -0800 Message-Id: <1611386920-28579-1-git-send-email-megha.dey@intel.com> Precedence: bulk
Series	Introduce AVX512 optimized crypto algorithms \| expand [RFC,V2,0/5] Introduce AVX512 optimized crypto algorithms [RFC,V2,1/5] crypto: aesni - fix coding style for if/else block [RFC,V2,2/5] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support [RFC,V2,3/5] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction [RFC,V2,4/5] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization [RFC,V2,5/5] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

Dey, Megha Jan. 23, 2021, 7:28 a.m. UTC

Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
(first implemented on Intel's Icelake client and Xeon CPUs).

These algorithms take advantage of the AVX512 registers to keep the CPU
busy and increase memory bandwidth utilization. They provide substantial
(2-10x) improvements over existing crypto algorithms when update data size
is greater than 128 bytes and do not have any significant impact when used
on small amounts of data.

However, these algorithms may also incur a frequency penalty and cause
collateral damage to other workloads running on the same core(co-scheduled
threads). These frequency drops are also known as bin drops where 1 bin
drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
are observed on the Icelake server.

The AVX512 optimization are disabled by default to avoid impact on other
workloads. In order to use these optimized algorithms:
1. At compile time:
   a. User must enable CONFIG_CRYPTO_AVX512 option
   b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
2. At run time:
   a. User must set module parameter use_avx512 at boot time
   b. Platform must support VPCLMULQDQ and VAES features

N.B. It is unclear whether these coarse grain controls(global module
parameter) would meet all user needs. Perhaps some per-thread control might
be useful? Looking for guidance here.

Other implementations of these crypto algorithms are possible, which would
result in lower crypto performance but would not cause collateral damage
from frequency drops (AVX512L vs AVX512VL).

The following crypto algorithms are optimized using AVX512 registers:
1. "by16" implementation of T10 Data Integrity Field CRC16 (CRC T10 DIF)
   The "by16" means the main loop processes 256 bytes (16 * 16 bytes) at
   a time in CRC T10 DIF calculation. This algorithm is optimized using
   the VPCLMULQDQ instruction which is the encoded 512 bit version of
   PCLMULQDQ instruction. On an Icelake desktop, with constant frequency
   set, the "by16" CRC T10 DIF AVX512 optimization shows about 1.5X
   improvement when the bytes per update size is 1KB or above as measured
   by the tcrypt module.

2. "by16" implementation of the AES CTR mode using VAES instructions
   "by16" means that 16 independent blocks (each 128 bits) can be ciphered
   simultaneously. On an Icelake desktop, with constant frequency set, the
   "by16" AES CTR mode shows about 2X improvement when the bytes per update
   size is 256B or above as measured by the tcrypt module.

3. AES GCM using VPCLMULQDQ instructions
   Using AVX 512 registers, an average increase of 2X is observed when the
   bytes per update size is 256B or above as measured by tcrypt module.

These algorithms have been tested using CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=n,
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y and CONFIG_CRYPTO_TEST=m.

This patchset has been rebased on top of Herbert's Crypto tree(master branch):
https://kernel.googlesource.com/pub/scm/linux/kernel/git/herbert/cryptodev-2.6
Patch 1 fixes coding style in existing if else block
Patch 2 checks for assembler support for VPCLMULQDQ instruction
Patch 3 introduces CRC T10 DIF calculation with VPCLMULQDQ instructions
Patch 4 introduces "by 16" version of AES CTR mode using VAES instructions
Patch 5 introduces the AES GCM mode using VPCLMULQDQ instructions

Complex sign off chain in patch 3. Original implementation (non kernel) was
done by Intel's IPsec team. Kyung Min Park is the author of this patch.

Also, most of this code is related to crypto subsystem. X86 mailing list is
copied here because of Patch 2.
Cc: x86@kernel.org

Changes V1->V2:
1. Fixed errors in all the algorithms to ensure all tests pass, when
   CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y
2. Removed GHASH AVX512 algorithm because of lack of use case
3. Removed code from AES-CTR VAES assembly which deals with partial blocks
   as C glue layer only sends 16 byte blocks
4. Removed dummy function definitions when the CRYPTO_AVX512 is disabled
5. Use static calls and static keys. This means that use_avx512 cannot be set
   after boot.
6. Allocated GCM hash_keys on the heap instead of stack
7. Removed '&& 64BIT' reference while probing assembler capability
8. Updated cover letter and copyright year from 2020 to 2021
9. Reorder patches so that coding style patch is first

Kyung Min Park (1):
  crypto: crct10dif - Accelerated CRC T10 DIF with vectorized
    instruction

Megha Dey (4):
  crypto: aesni - fix coding style for if/else block
  x86: Probe assembler capabilities for VAES and VPLCMULQDQ support
  crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization
  crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

 arch/x86/Kconfig.assembler                  |   10 +
 arch/x86/crypto/Makefile                    |    3 +
 arch/x86/crypto/aes_avx512_common.S         |  341 +++
 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S |  955 +++++++++
 arch/x86/crypto/aesni-intel_avx512-x86_64.S | 3078 +++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c          |  141 +-
 arch/x86/crypto/crct10dif-avx512-asm_64.S   |  482 +++++
 arch/x86/crypto/crct10dif-pclmul_glue.c     |   17 +-
 arch/x86/include/asm/disabled-features.h    |   14 +-
 crypto/Kconfig                              |   50 +
 10 files changed, 5077 insertions(+), 14 deletions(-)
 create mode 100644 arch/x86/crypto/aes_avx512_common.S
 create mode 100644 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S
 create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S
 create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S

Andy Lutomirski Jan. 24, 2021, 4:23 p.m. UTC | #1

On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@intel.com> wrote:
>
> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
> (first implemented on Intel's Icelake client and Xeon CPUs).
>
> These algorithms take advantage of the AVX512 registers to keep the CPU
> busy and increase memory bandwidth utilization. They provide substantial
> (2-10x) improvements over existing crypto algorithms when update data size
> is greater than 128 bytes and do not have any significant impact when used
> on small amounts of data.
>
> However, these algorithms may also incur a frequency penalty and cause
> collateral damage to other workloads running on the same core(co-scheduled
> threads). These frequency drops are also known as bin drops where 1 bin
> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
> are observed on the Icelake server.
>
> The AVX512 optimization are disabled by default to avoid impact on other
> workloads. In order to use these optimized algorithms:
> 1. At compile time:
>    a. User must enable CONFIG_CRYPTO_AVX512 option
>    b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
> 2. At run time:
>    a. User must set module parameter use_avx512 at boot time
>    b. Platform must support VPCLMULQDQ and VAES features
>
> N.B. It is unclear whether these coarse grain controls(global module
> parameter) would meet all user needs. Perhaps some per-thread control might
> be useful? Looking for guidance here.

I've just been looking at some performance issues with in-kernel AVX,
and I have a whole pile of questions that I think should be answered
first:

What is the impact of using an AVX-512 instruction on the logical
thread, its siblings, and other cores on the package?

Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?

What is the impact on subsequent shorter EVEX, VEX, and legacy
SSE(2,3, etc) insns?

How does VZEROUPPER figure in?  I can find an enormous amount of
misinformation online, but nothing authoritative.

What is the effect of the AVX-512 states (5-7) being “in use”?  As far
as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
and its variants.  Is this correct?

On AVX-512 capable CPUs, do we ever get a penalty for executing a
non-VEX insn followed by a large-width EVEX insn without an
intervening VZEROUPPER?  The docs suggest no, since Broadwell and
before don’t support EVEX, but I’d like to know for sure.

My current opinion is that we should not enable AVX-512 in-kernel
except on CPUs that we determine have good AVX-512 support.  Based on
some reading, that seems to mean Ice Lake Client and not anything
before it.  I also think a bunch of the above questions should be
answered before we do any of this.  Right now we have a regression of
unknown impact in regular AVX support in-kernel, we will have
performance issues in-kernel depending on what user code has done
recently, and I'm still trying to figure out what to do about it.
Throwing AVX-512 into the mix without real information is not going to
improve the situation.

Dave Hansen Jan. 25, 2021, 5:27 p.m. UTC | #2

On 1/22/21 11:28 PM, Megha Dey wrote:
> Other implementations of these crypto algorithms are possible, which would
> result in lower crypto performance but would not cause collateral damage
> from frequency drops (AVX512L vs AVX512VL).

I don't think you told us anywhere what AVX512L and AVX512VL are, or why
they matter here.

Dey, Megha Feb. 24, 2021, 12:54 a.m. UTC | #3

Hi Andy,

On 1/24/2021 8:23 AM, Andy Lutomirski wrote:
> On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@intel.com> wrote:
>> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
>> (first implemented on Intel's Icelake client and Xeon CPUs).
>>
>> These algorithms take advantage of the AVX512 registers to keep the CPU
>> busy and increase memory bandwidth utilization. They provide substantial
>> (2-10x) improvements over existing crypto algorithms when update data size
>> is greater than 128 bytes and do not have any significant impact when used
>> on small amounts of data.
>>
>> However, these algorithms may also incur a frequency penalty and cause
>> collateral damage to other workloads running on the same core(co-scheduled
>> threads). These frequency drops are also known as bin drops where 1 bin
>> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
>> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
>> are observed on the Icelake server.
>>
>> The AVX512 optimization are disabled by default to avoid impact on other
>> workloads. In order to use these optimized algorithms:
>> 1. At compile time:
>>     a. User must enable CONFIG_CRYPTO_AVX512 option
>>     b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
>> 2. At run time:
>>     a. User must set module parameter use_avx512 at boot time
>>     b. Platform must support VPCLMULQDQ and VAES features
>>
>> N.B. It is unclear whether these coarse grain controls(global module
>> parameter) would meet all user needs. Perhaps some per-thread control might
>> be useful? Looking for guidance here.
>
> I've just been looking at some performance issues with in-kernel AVX,
> and I have a whole pile of questions that I think should be answered
> first:
>
> What is the impact of using an AVX-512 instruction on the logical
> thread, its siblings, and other cores on the package?
>
> Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?
>
> What is the impact on subsequent shorter EVEX, VEX, and legacy
> SSE(2,3, etc) insns?
>
> How does VZEROUPPER figure in?  I can find an enormous amount of
> misinformation online, but nothing authoritative.
>
> What is the effect of the AVX-512 states (5-7) being “in use”?  As far
> as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
> and its variants.  Is this correct?
>
> On AVX-512 capable CPUs, do we ever get a penalty for executing a
> non-VEX insn followed by a large-width EVEX insn without an
> intervening VZEROUPPER?  The docs suggest no, since Broadwell and
> before don’t support EVEX, but I’d like to know for sure.
>
>
> My current opinion is that we should not enable AVX-512 in-kernel
> except on CPUs that we determine have good AVX-512 support.  Based on
> some reading, that seems to mean Ice Lake Client and not anything
> before it.  I also think a bunch of the above questions should be
> answered before we do any of this.  Right now we have a regression of
> unknown impact in regular AVX support in-kernel, we will have
> performance issues in-kernel depending on what user code has done
> recently, and I'm still trying to figure out what to do about it.
> Throwing AVX-512 into the mix without real information is not going to
> improve the situation.

We are currently working on providing you with answers on the questions 
you have raised regarding AVX.

Thanks,

Megha

Andy Lutomirski Feb. 24, 2021, 5:42 p.m. UTC | #4

On Tue, Feb 23, 2021 at 4:54 PM Dey, Megha <megha.dey@intel.com> wrote:
>
> Hi Andy,
>
> On 1/24/2021 8:23 AM, Andy Lutomirski wrote:
> > On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@intel.com> wrote:
> >> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
> >> (first implemented on Intel's Icelake client and Xeon CPUs).
> >>
> >> These algorithms take advantage of the AVX512 registers to keep the CPU
> >> busy and increase memory bandwidth utilization. They provide substantial
> >> (2-10x) improvements over existing crypto algorithms when update data size
> >> is greater than 128 bytes and do not have any significant impact when used
> >> on small amounts of data.
> >>
> >> However, these algorithms may also incur a frequency penalty and cause
> >> collateral damage to other workloads running on the same core(co-scheduled
> >> threads). These frequency drops are also known as bin drops where 1 bin
> >> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
> >> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
> >> are observed on the Icelake server.
> >>
> >> The AVX512 optimization are disabled by default to avoid impact on other
> >> workloads. In order to use these optimized algorithms:
> >> 1. At compile time:
> >>     a. User must enable CONFIG_CRYPTO_AVX512 option
> >>     b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
> >> 2. At run time:
> >>     a. User must set module parameter use_avx512 at boot time
> >>     b. Platform must support VPCLMULQDQ and VAES features
> >>
> >> N.B. It is unclear whether these coarse grain controls(global module
> >> parameter) would meet all user needs. Perhaps some per-thread control might
> >> be useful? Looking for guidance here.
> >
> > I've just been looking at some performance issues with in-kernel AVX,
> > and I have a whole pile of questions that I think should be answered
> > first:
> >
> > What is the impact of using an AVX-512 instruction on the logical
> > thread, its siblings, and other cores on the package?
> >
> > Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?
> >
> > What is the impact on subsequent shorter EVEX, VEX, and legacy
> > SSE(2,3, etc) insns?
> >
> > How does VZEROUPPER figure in?  I can find an enormous amount of
> > misinformation online, but nothing authoritative.
> >
> > What is the effect of the AVX-512 states (5-7) being “in use”?  As far
> > as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
> > and its variants.  Is this correct?
> >
> > On AVX-512 capable CPUs, do we ever get a penalty for executing a
> > non-VEX insn followed by a large-width EVEX insn without an
> > intervening VZEROUPPER?  The docs suggest no, since Broadwell and
> > before don’t support EVEX, but I’d like to know for sure.
> >
> >
> > My current opinion is that we should not enable AVX-512 in-kernel
> > except on CPUs that we determine have good AVX-512 support.  Based on
> > some reading, that seems to mean Ice Lake Client and not anything
> > before it.  I also think a bunch of the above questions should be
> > answered before we do any of this.  Right now we have a regression of
> > unknown impact in regular AVX support in-kernel, we will have
> > performance issues in-kernel depending on what user code has done
> > recently, and I'm still trying to figure out what to do about it.
> > Throwing AVX-512 into the mix without real information is not going to
> > improve the situation.
>
> We are currently working on providing you with answers on the questions
> you have raised regarding AVX.

Thanks!

Dave Hansen May 7, 2021, 4:22 p.m. UTC | #5

Hi Andy,

Here are a few answers to your questions.  Sorry for the delay.  There's
more of this kind of stuff to come, so stay tuned.

On 1/24/21 8:23 AM, Andy Lutomirski wrote:
> What is the impact of using an AVX-512 instruction on the logical
> thread, its siblings, and other cores on the package?

There’s a frequency penalty on the core using AVX-512, which means both
hyperthreads. The penalty duration is longer on Skylake than Cascade
Lake which is longer than Icelake.

There’s no direct penalty to the other cores.  They do all share an
overall heat budget of course, and on systems with insufficient fans,
heat can impact turbo range performance.

> Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?

The impact is incurred when ZMM-specific registers are used; this is not
dependent on the encoding.

On Icelake, the size of the drop depends on the type of the instruction
(mov like instructions have small to none, while the most heavy
instruction is the VFMA family which has the largest penalty)

> What is the impact on subsequent shorter EVEX, VEX, and legacy
> SSE(2,3, etc) insns?

There’s a “shadow” in time even after the last ZMM-using instruction,
(hysteresis)

> How does VZEROUPPER figure in?  I can find an enormous amount of
> misinformation online, but nothing authoritative.

VZEROUPPER exists to clear the AVX2 (and 512 state) so that subsequent
SSE operations don’t get false data dependencies. It’s not related to
the frequency impact.

> What is the effect of the AVX-512 states (5-7) being “in use”?  As far
> as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
> and its variants.  Is this correct?

XINUSE only impacts XSAVE*/XRSTOR*.  Just having XINUSE[5-7]=0x7 will
not incur the frequency impact.  In other words, the XSAVE*/XRSTOR*
“use” of ZMM-specific register state does not incur the frequency penalty.

> On AVX-512 capable CPUs, do we ever get a penalty for executing a
> non-VEX insn followed by a large-width EVEX insn without an
> intervening VZEROUPPER?  The docs suggest no, since Broadwell and
> before don’t support EVEX, but I’d like to know for sure.

It’s the other way around; the dependency is on the non-VEX instruction
side on state in the YMM/ZMM “upper half” that non-VEX is required to
preserve, creating a false dependency.  An instruction cannot depend on
a future instruction, so non-VEX followed by (E)VEX have no false
dependency… so no VZEROUPPER is needed.

Dey, Megha Jan. 31, 2022, 6:43 p.m. UTC | #6

Hi all,

On 2/24/2021 9:42 AM, Andy Lutomirski wrote:
> On Tue, Feb 23, 2021 at 4:54 PM Dey, Megha <megha.dey@intel.com> wrote:
>> Hi Andy,
>>
>> On 1/24/2021 8:23 AM, Andy Lutomirski wrote:
>>> On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@intel.com> wrote:
>>>> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
>>>> (first implemented on Intel's Icelake client and Xeon CPUs).
>>>>
>>>> These algorithms take advantage of the AVX512 registers to keep the CPU
>>>> busy and increase memory bandwidth utilization. They provide substantial
>>>> (2-10x) improvements over existing crypto algorithms when update data size
>>>> is greater than 128 bytes and do not have any significant impact when used
>>>> on small amounts of data.
>>>>
>>>> However, these algorithms may also incur a frequency penalty and cause
>>>> collateral damage to other workloads running on the same core(co-scheduled
>>>> threads). These frequency drops are also known as bin drops where 1 bin
>>>> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
>>>> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
>>>> are observed on the Icelake server.
>>>>
>>>> The AVX512 optimization are disabled by default to avoid impact on other
>>>> workloads. In order to use these optimized algorithms:
>>>> 1. At compile time:
>>>>      a. User must enable CONFIG_CRYPTO_AVX512 option
>>>>      b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
>>>> 2. At run time:
>>>>      a. User must set module parameter use_avx512 at boot time
>>>>      b. Platform must support VPCLMULQDQ and VAES features
>>>>
>>>> N.B. It is unclear whether these coarse grain controls(global module
>>>> parameter) would meet all user needs. Perhaps some per-thread control might
>>>> be useful? Looking for guidance here.
>>> I've just been looking at some performance issues with in-kernel AVX,
>>> and I have a whole pile of questions that I think should be answered
>>> first:
>>>
>>> What is the impact of using an AVX-512 instruction on the logical
>>> thread, its siblings, and other cores on the package?
>>>
>>> Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?
>>>
>>> What is the impact on subsequent shorter EVEX, VEX, and legacy
>>> SSE(2,3, etc) insns?
>>>
>>> How does VZEROUPPER figure in?  I can find an enormous amount of
>>> misinformation online, but nothing authoritative.
>>>
>>> What is the effect of the AVX-512 states (5-7) being “in use”?  As far
>>> as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
>>> and its variants.  Is this correct?
>>>
>>> On AVX-512 capable CPUs, do we ever get a penalty for executing a
>>> non-VEX insn followed by a large-width EVEX insn without an
>>> intervening VZEROUPPER?  The docs suggest no, since Broadwell and
>>> before don’t support EVEX, but I’d like to know for sure.
>>>
>>>
>>> My current opinion is that we should not enable AVX-512 in-kernel
>>> except on CPUs that we determine have good AVX-512 support.  Based on
>>> some reading, that seems to mean Ice Lake Client and not anything
>>> before it.  I also think a bunch of the above questions should be
>>> answered before we do any of this.  Right now we have a regression of
>>> unknown impact in regular AVX support in-kernel, we will have
>>> performance issues in-kernel depending on what user code has done
>>> recently, and I'm still trying to figure out what to do about it.
>>> Throwing AVX-512 into the mix without real information is not going to
>>> improve the situation.
>> We are currently working on providing you with answers on the questions
>> you have raised regarding AVX.
> Thanks!

We had submitted this patch series last year which uses AVX512F, VAES, 
VPCLMULQDQ instructions and ZMM(512 bit) registers to optimize certain 
crypto algorithms. As concluded, this approach could introduce a 
frequency drop of 1-2 bins for sibling threads running on the same core 
(512L instructions). The behavior is explained in article [1]. [2] 
covers similar topic as [1] but it focuses on client processors.

Since then, we have worked on new AES-GCM implementation using AVX512VL, 
VAES, VCLMUQLDQ instructions using only 256-bit YMM registers. With this 
implementation, we see a 1.5X improvement on ICX/ICL for 16KB buffers 
compared to the existing kernel AES-GCM implementation that works on 
128-bit XMM registers. Instructions used in the new GCM implementation 
classify as 256L ones. 256L class maps onto Core License 2 resulting in 
no frequency reduction (Figure 6 in [1]) and execute at the same 
frequency as an SSE code.

Before we start work on any upstream worthy patch, we would want to 
solicit any feedback to see if this implementation approach receives 
interest from the community.

Please note that AES-GCM is the predominant cipher suite for TLS and 
IPSEC. Having its efficient/performant implementation in the kernel will 
help customers and applications that rely on KTLS (like CDN/TLS proxy) 
or kernel IPSEC tunneling services.

[1] 
https://www.intel.com/content/www/us/en/architecture-and-technology/crypto-acceleration-in-xeon-scalable-processors-wp.html 


[2] https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Thanks,

Megha

Dave Hansen Jan. 31, 2022, 7:18 p.m. UTC | #7

On 1/31/22 10:43, Dey, Megha wrote:
> With this implementation, we see a 1.5X improvement on ICX/ICL for 16KB
> buffers compared to the existing kernel AES-GCM implementation that
> works on 128-bit XMM registers.

What is your best guess about how future-proof this implementation is?

Will this be an ICL/ICX one-off?  Or, will implementations using 256-bit
YMM registers continue to enjoy a frequency advantage over the 512-bit
implementations for a long time?

Dey, Megha Feb. 1, 2022, 4:42 p.m. UTC | #8

Hi Dave,

On 1/31/2022 11:18 AM, Dave Hansen wrote:
> On 1/31/22 10:43, Dey, Megha wrote:
>> With this implementation, we see a 1.5X improvement on ICX/ICL for 16KB
>> buffers compared to the existing kernel AES-GCM implementation that
>> works on 128-bit XMM registers.
> What is your best guess about how future-proof this implementation is?
>
> Will this be an ICL/ICX one-off?  Or, will implementations using 256-bit
> YMM registers continue to enjoy a frequency advantage over the 512-bit
> implementations for a long time?
This is not planned as ICL/ICX one-off.AVX512VL code using YMM registers 
is expected to have the same power license properties as AVX2 code which 
implies it would have a frequency advantage over the current AVX512 
implementation until we have new implementations of AVX512 instructions 
which do not have the frequency drop issue.

Dey, Megha Feb. 24, 2022, 7:31 p.m. UTC | #9

Hi all,

On 1/31/2022 11:18 AM, Dave Hansen wrote:
> On 1/31/22 10:43, Dey, Megha wrote:
>> With this implementation, we see a 1.5X improvement on ICX/ICL for 16KB
>> buffers compared to the existing kernel AES-GCM implementation that
>> works on 128-bit XMM registers.
> What is your best guess about how future-proof this implementation is?
>
> Will this be an ICL/ICX one-off?  Or, will implementations using 256-bit
> YMM registers continue to enjoy a frequency advantage over the 512-bit
> implementations for a long time?

Dave,

This would not be an ICL/ICX one off. For the foreseeable future, 
AVX512VL YMM implementations will enjoy a frequency advantage over 
AVX512L ZMM implementations.

Although, over time, ZMM and YMM will converge when it comes to performance.

Herbert/Andy,

Could you please let us know if this approach is a viable one and would 
be acceptable by the community?

Optimizing crypto algorithms using AVX512VL instructions gives a 1.5X 
performance improvement over existing AES-GCM algorithm in the 
kernel(using XMM registers) with no frequency drop.

Thanks,

Megha

Andy Lutomirski March 5, 2022, 6:37 p.m. UTC | #10

On Thu, Feb 24, 2022, at 11:31 AM, Dey, Megha wrote:
> Hi all,
>
> On 1/31/2022 11:18 AM, Dave Hansen wrote:
>> On 1/31/22 10:43, Dey, Megha wrote:
>>> With this implementation, we see a 1.5X improvement on ICX/ICL for 16KB
>>> buffers compared to the existing kernel AES-GCM implementation that
>>> works on 128-bit XMM registers.
>> What is your best guess about how future-proof this implementation is?
>>
>> Will this be an ICL/ICX one-off?  Or, will implementations using 256-bit
>> YMM registers continue to enjoy a frequency advantage over the 512-bit
>> implementations for a long time?
>
> Dave,
>
> This would not be an ICL/ICX one off. For the foreseeable future, 
> AVX512VL YMM implementations will enjoy a frequency advantage over 
> AVX512L ZMM implementations.
>
> Although, over time, ZMM and YMM will converge when it comes to performance.
>
> Herbert/Andy,
>
> Could you please let us know if this approach is a viable one and would 
> be acceptable by the community?
>
> Optimizing crypto algorithms using AVX512VL instructions gives a 1.5X 
> performance improvement over existing AES-GCM algorithm in the 
> kernel(using XMM registers) with no frequency drop.

I'm assuming this would be enabled automatically without needing any special command line options.  If so, it seems reasonable to me.

--Andy

[RFC,V2,0/5] Introduce AVX512 optimized crypto algorithms

Message

Comments