Message ID | 1611386920-28579-1-git-send-email-megha.dey@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce AVX512 optimized crypto algorithms | expand |
On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@intel.com> wrote: > > Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ > (first implemented on Intel's Icelake client and Xeon CPUs). > > These algorithms take advantage of the AVX512 registers to keep the CPU > busy and increase memory bandwidth utilization. They provide substantial > (2-10x) improvements over existing crypto algorithms when update data size > is greater than 128 bytes and do not have any significant impact when used > on small amounts of data. > > However, these algorithms may also incur a frequency penalty and cause > collateral damage to other workloads running on the same core(co-scheduled > threads). These frequency drops are also known as bin drops where 1 bin > drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin > drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz) > are observed on the Icelake server. > > The AVX512 optimization are disabled by default to avoid impact on other > workloads. In order to use these optimized algorithms: > 1. At compile time: > a. User must enable CONFIG_CRYPTO_AVX512 option > b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions > 2. At run time: > a. User must set module parameter use_avx512 at boot time > b. Platform must support VPCLMULQDQ and VAES features > > N.B. It is unclear whether these coarse grain controls(global module > parameter) would meet all user needs. Perhaps some per-thread control might > be useful? Looking for guidance here. I've just been looking at some performance issues with in-kernel AVX, and I have a whole pile of questions that I think should be answered first: What is the impact of using an AVX-512 instruction on the logical thread, its siblings, and other cores on the package? Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn? What is the impact on subsequent shorter EVEX, VEX, and legacy SSE(2,3, etc) insns? How does VZEROUPPER figure in? I can find an enormous amount of misinformation online, but nothing authoritative. What is the effect of the AVX-512 states (5-7) being “in use”? As far as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR and its variants. Is this correct? On AVX-512 capable CPUs, do we ever get a penalty for executing a non-VEX insn followed by a large-width EVEX insn without an intervening VZEROUPPER? The docs suggest no, since Broadwell and before don’t support EVEX, but I’d like to know for sure. My current opinion is that we should not enable AVX-512 in-kernel except on CPUs that we determine have good AVX-512 support. Based on some reading, that seems to mean Ice Lake Client and not anything before it. I also think a bunch of the above questions should be answered before we do any of this. Right now we have a regression of unknown impact in regular AVX support in-kernel, we will have performance issues in-kernel depending on what user code has done recently, and I'm still trying to figure out what to do about it. Throwing AVX-512 into the mix without real information is not going to improve the situation.
On 1/22/21 11:28 PM, Megha Dey wrote: > Other implementations of these crypto algorithms are possible, which would > result in lower crypto performance but would not cause collateral damage > from frequency drops (AVX512L vs AVX512VL). I don't think you told us anywhere what AVX512L and AVX512VL are, or why they matter here.
Hi Andy, On 1/24/2021 8:23 AM, Andy Lutomirski wrote: > On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@intel.com> wrote: >> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ >> (first implemented on Intel's Icelake client and Xeon CPUs). >> >> These algorithms take advantage of the AVX512 registers to keep the CPU >> busy and increase memory bandwidth utilization. They provide substantial >> (2-10x) improvements over existing crypto algorithms when update data size >> is greater than 128 bytes and do not have any significant impact when used >> on small amounts of data. >> >> However, these algorithms may also incur a frequency penalty and cause >> collateral damage to other workloads running on the same core(co-scheduled >> threads). These frequency drops are also known as bin drops where 1 bin >> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin >> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz) >> are observed on the Icelake server. >> >> The AVX512 optimization are disabled by default to avoid impact on other >> workloads. In order to use these optimized algorithms: >> 1. At compile time: >> a. User must enable CONFIG_CRYPTO_AVX512 option >> b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions >> 2. At run time: >> a. User must set module parameter use_avx512 at boot time >> b. Platform must support VPCLMULQDQ and VAES features >> >> N.B. It is unclear whether these coarse grain controls(global module >> parameter) would meet all user needs. Perhaps some per-thread control might >> be useful? Looking for guidance here. > > I've just been looking at some performance issues with in-kernel AVX, > and I have a whole pile of questions that I think should be answered > first: > > What is the impact of using an AVX-512 instruction on the logical > thread, its siblings, and other cores on the package? > > Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn? > > What is the impact on subsequent shorter EVEX, VEX, and legacy > SSE(2,3, etc) insns? > > How does VZEROUPPER figure in? I can find an enormous amount of > misinformation online, but nothing authoritative. > > What is the effect of the AVX-512 states (5-7) being “in use”? As far > as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR > and its variants. Is this correct? > > On AVX-512 capable CPUs, do we ever get a penalty for executing a > non-VEX insn followed by a large-width EVEX insn without an > intervening VZEROUPPER? The docs suggest no, since Broadwell and > before don’t support EVEX, but I’d like to know for sure. > > > My current opinion is that we should not enable AVX-512 in-kernel > except on CPUs that we determine have good AVX-512 support. Based on > some reading, that seems to mean Ice Lake Client and not anything > before it. I also think a bunch of the above questions should be > answered before we do any of this. Right now we have a regression of > unknown impact in regular AVX support in-kernel, we will have > performance issues in-kernel depending on what user code has done > recently, and I'm still trying to figure out what to do about it. > Throwing AVX-512 into the mix without real information is not going to > improve the situation. We are currently working on providing you with answers on the questions you have raised regarding AVX. Thanks, Megha
On Tue, Feb 23, 2021 at 4:54 PM Dey, Megha <megha.dey@intel.com> wrote: > > Hi Andy, > > On 1/24/2021 8:23 AM, Andy Lutomirski wrote: > > On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@intel.com> wrote: > >> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ > >> (first implemented on Intel's Icelake client and Xeon CPUs). > >> > >> These algorithms take advantage of the AVX512 registers to keep the CPU > >> busy and increase memory bandwidth utilization. They provide substantial > >> (2-10x) improvements over existing crypto algorithms when update data size > >> is greater than 128 bytes and do not have any significant impact when used > >> on small amounts of data. > >> > >> However, these algorithms may also incur a frequency penalty and cause > >> collateral damage to other workloads running on the same core(co-scheduled > >> threads). These frequency drops are also known as bin drops where 1 bin > >> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin > >> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz) > >> are observed on the Icelake server. > >> > >> The AVX512 optimization are disabled by default to avoid impact on other > >> workloads. In order to use these optimized algorithms: > >> 1. At compile time: > >> a. User must enable CONFIG_CRYPTO_AVX512 option > >> b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions > >> 2. At run time: > >> a. User must set module parameter use_avx512 at boot time > >> b. Platform must support VPCLMULQDQ and VAES features > >> > >> N.B. It is unclear whether these coarse grain controls(global module > >> parameter) would meet all user needs. Perhaps some per-thread control might > >> be useful? Looking for guidance here. > > > > I've just been looking at some performance issues with in-kernel AVX, > > and I have a whole pile of questions that I think should be answered > > first: > > > > What is the impact of using an AVX-512 instruction on the logical > > thread, its siblings, and other cores on the package? > > > > Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn? > > > > What is the impact on subsequent shorter EVEX, VEX, and legacy > > SSE(2,3, etc) insns? > > > > How does VZEROUPPER figure in? I can find an enormous amount of > > misinformation online, but nothing authoritative. > > > > What is the effect of the AVX-512 states (5-7) being “in use”? As far > > as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR > > and its variants. Is this correct? > > > > On AVX-512 capable CPUs, do we ever get a penalty for executing a > > non-VEX insn followed by a large-width EVEX insn without an > > intervening VZEROUPPER? The docs suggest no, since Broadwell and > > before don’t support EVEX, but I’d like to know for sure. > > > > > > My current opinion is that we should not enable AVX-512 in-kernel > > except on CPUs that we determine have good AVX-512 support. Based on > > some reading, that seems to mean Ice Lake Client and not anything > > before it. I also think a bunch of the above questions should be > > answered before we do any of this. Right now we have a regression of > > unknown impact in regular AVX support in-kernel, we will have > > performance issues in-kernel depending on what user code has done > > recently, and I'm still trying to figure out what to do about it. > > Throwing AVX-512 into the mix without real information is not going to > > improve the situation. > > We are currently working on providing you with answers on the questions > you have raised regarding AVX. Thanks!
Hi Andy, Here are a few answers to your questions. Sorry for the delay. There's more of this kind of stuff to come, so stay tuned. On 1/24/21 8:23 AM, Andy Lutomirski wrote: > What is the impact of using an AVX-512 instruction on the logical > thread, its siblings, and other cores on the package? There’s a frequency penalty on the core using AVX-512, which means both hyperthreads. The penalty duration is longer on Skylake than Cascade Lake which is longer than Icelake. There’s no direct penalty to the other cores. They do all share an overall heat budget of course, and on systems with insufficient fans, heat can impact turbo range performance. > Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn? The impact is incurred when ZMM-specific registers are used; this is not dependent on the encoding. On Icelake, the size of the drop depends on the type of the instruction (mov like instructions have small to none, while the most heavy instruction is the VFMA family which has the largest penalty) > What is the impact on subsequent shorter EVEX, VEX, and legacy > SSE(2,3, etc) insns? There’s a “shadow” in time even after the last ZMM-using instruction, (hysteresis) > How does VZEROUPPER figure in? I can find an enormous amount of > misinformation online, but nothing authoritative. VZEROUPPER exists to clear the AVX2 (and 512 state) so that subsequent SSE operations don’t get false data dependencies. It’s not related to the frequency impact. > What is the effect of the AVX-512 states (5-7) being “in use”? As far > as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR > and its variants. Is this correct? XINUSE only impacts XSAVE*/XRSTOR*. Just having XINUSE[5-7]=0x7 will not incur the frequency impact. In other words, the XSAVE*/XRSTOR* “use” of ZMM-specific register state does not incur the frequency penalty. > On AVX-512 capable CPUs, do we ever get a penalty for executing a > non-VEX insn followed by a large-width EVEX insn without an > intervening VZEROUPPER? The docs suggest no, since Broadwell and > before don’t support EVEX, but I’d like to know for sure. It’s the other way around; the dependency is on the non-VEX instruction side on state in the YMM/ZMM “upper half” that non-VEX is required to preserve, creating a false dependency. An instruction cannot depend on a future instruction, so non-VEX followed by (E)VEX have no false dependency… so no VZEROUPPER is needed.
Hi all, On 2/24/2021 9:42 AM, Andy Lutomirski wrote: > On Tue, Feb 23, 2021 at 4:54 PM Dey, Megha <megha.dey@intel.com> wrote: >> Hi Andy, >> >> On 1/24/2021 8:23 AM, Andy Lutomirski wrote: >>> On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@intel.com> wrote: >>>> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ >>>> (first implemented on Intel's Icelake client and Xeon CPUs). >>>> >>>> These algorithms take advantage of the AVX512 registers to keep the CPU >>>> busy and increase memory bandwidth utilization. They provide substantial >>>> (2-10x) improvements over existing crypto algorithms when update data size >>>> is greater than 128 bytes and do not have any significant impact when used >>>> on small amounts of data. >>>> >>>> However, these algorithms may also incur a frequency penalty and cause >>>> collateral damage to other workloads running on the same core(co-scheduled >>>> threads). These frequency drops are also known as bin drops where 1 bin >>>> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin >>>> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz) >>>> are observed on the Icelake server. >>>> >>>> The AVX512 optimization are disabled by default to avoid impact on other >>>> workloads. In order to use these optimized algorithms: >>>> 1. At compile time: >>>> a. User must enable CONFIG_CRYPTO_AVX512 option >>>> b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions >>>> 2. At run time: >>>> a. User must set module parameter use_avx512 at boot time >>>> b. Platform must support VPCLMULQDQ and VAES features >>>> >>>> N.B. It is unclear whether these coarse grain controls(global module >>>> parameter) would meet all user needs. Perhaps some per-thread control might >>>> be useful? Looking for guidance here. >>> I've just been looking at some performance issues with in-kernel AVX, >>> and I have a whole pile of questions that I think should be answered >>> first: >>> >>> What is the impact of using an AVX-512 instruction on the logical >>> thread, its siblings, and other cores on the package? >>> >>> Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn? >>> >>> What is the impact on subsequent shorter EVEX, VEX, and legacy >>> SSE(2,3, etc) insns? >>> >>> How does VZEROUPPER figure in? I can find an enormous amount of >>> misinformation online, but nothing authoritative. >>> >>> What is the effect of the AVX-512 states (5-7) being “in use”? As far >>> as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR >>> and its variants. Is this correct? >>> >>> On AVX-512 capable CPUs, do we ever get a penalty for executing a >>> non-VEX insn followed by a large-width EVEX insn without an >>> intervening VZEROUPPER? The docs suggest no, since Broadwell and >>> before don’t support EVEX, but I’d like to know for sure. >>> >>> >>> My current opinion is that we should not enable AVX-512 in-kernel >>> except on CPUs that we determine have good AVX-512 support. Based on >>> some reading, that seems to mean Ice Lake Client and not anything >>> before it. I also think a bunch of the above questions should be >>> answered before we do any of this. Right now we have a regression of >>> unknown impact in regular AVX support in-kernel, we will have >>> performance issues in-kernel depending on what user code has done >>> recently, and I'm still trying to figure out what to do about it. >>> Throwing AVX-512 into the mix without real information is not going to >>> improve the situation. >> We are currently working on providing you with answers on the questions >> you have raised regarding AVX. > Thanks! We had submitted this patch series last year which uses AVX512F, VAES, VPCLMULQDQ instructions and ZMM(512 bit) registers to optimize certain crypto algorithms. As concluded, this approach could introduce a frequency drop of 1-2 bins for sibling threads running on the same core (512L instructions). The behavior is explained in article [1]. [2] covers similar topic as [1] but it focuses on client processors. Since then, we have worked on new AES-GCM implementation using AVX512VL, VAES, VCLMUQLDQ instructions using only 256-bit YMM registers. With this implementation, we see a 1.5X improvement on ICX/ICL for 16KB buffers compared to the existing kernel AES-GCM implementation that works on 128-bit XMM registers. Instructions used in the new GCM implementation classify as 256L ones. 256L class maps onto Core License 2 resulting in no frequency reduction (Figure 6 in [1]) and execute at the same frequency as an SSE code. Before we start work on any upstream worthy patch, we would want to solicit any feedback to see if this implementation approach receives interest from the community. Please note that AES-GCM is the predominant cipher suite for TLS and IPSEC. Having its efficient/performant implementation in the kernel will help customers and applications that rely on KTLS (like CDN/TLS proxy) or kernel IPSEC tunneling services. [1] https://www.intel.com/content/www/us/en/architecture-and-technology/crypto-acceleration-in-xeon-scalable-processors-wp.html [2] https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html Thanks, Megha
On 1/31/22 10:43, Dey, Megha wrote: > With this implementation, we see a 1.5X improvement on ICX/ICL for 16KB > buffers compared to the existing kernel AES-GCM implementation that > works on 128-bit XMM registers. What is your best guess about how future-proof this implementation is? Will this be an ICL/ICX one-off? Or, will implementations using 256-bit YMM registers continue to enjoy a frequency advantage over the 512-bit implementations for a long time?
Hi Dave, On 1/31/2022 11:18 AM, Dave Hansen wrote: > On 1/31/22 10:43, Dey, Megha wrote: >> With this implementation, we see a 1.5X improvement on ICX/ICL for 16KB >> buffers compared to the existing kernel AES-GCM implementation that >> works on 128-bit XMM registers. > What is your best guess about how future-proof this implementation is? > > Will this be an ICL/ICX one-off? Or, will implementations using 256-bit > YMM registers continue to enjoy a frequency advantage over the 512-bit > implementations for a long time? This is not planned as ICL/ICX one-off.AVX512VL code using YMM registers is expected to have the same power license properties as AVX2 code which implies it would have a frequency advantage over the current AVX512 implementation until we have new implementations of AVX512 instructions which do not have the frequency drop issue.
Hi all, On 1/31/2022 11:18 AM, Dave Hansen wrote: > On 1/31/22 10:43, Dey, Megha wrote: >> With this implementation, we see a 1.5X improvement on ICX/ICL for 16KB >> buffers compared to the existing kernel AES-GCM implementation that >> works on 128-bit XMM registers. > What is your best guess about how future-proof this implementation is? > > Will this be an ICL/ICX one-off? Or, will implementations using 256-bit > YMM registers continue to enjoy a frequency advantage over the 512-bit > implementations for a long time? Dave, This would not be an ICL/ICX one off. For the foreseeable future, AVX512VL YMM implementations will enjoy a frequency advantage over AVX512L ZMM implementations. Although, over time, ZMM and YMM will converge when it comes to performance. Herbert/Andy, Could you please let us know if this approach is a viable one and would be acceptable by the community? Optimizing crypto algorithms using AVX512VL instructions gives a 1.5X performance improvement over existing AES-GCM algorithm in the kernel(using XMM registers) with no frequency drop. Thanks, Megha
On Thu, Feb 24, 2022, at 11:31 AM, Dey, Megha wrote: > Hi all, > > On 1/31/2022 11:18 AM, Dave Hansen wrote: >> On 1/31/22 10:43, Dey, Megha wrote: >>> With this implementation, we see a 1.5X improvement on ICX/ICL for 16KB >>> buffers compared to the existing kernel AES-GCM implementation that >>> works on 128-bit XMM registers. >> What is your best guess about how future-proof this implementation is? >> >> Will this be an ICL/ICX one-off? Or, will implementations using 256-bit >> YMM registers continue to enjoy a frequency advantage over the 512-bit >> implementations for a long time? > > Dave, > > This would not be an ICL/ICX one off. For the foreseeable future, > AVX512VL YMM implementations will enjoy a frequency advantage over > AVX512L ZMM implementations. > > Although, over time, ZMM and YMM will converge when it comes to performance. > > Herbert/Andy, > > Could you please let us know if this approach is a viable one and would > be acceptable by the community? > > Optimizing crypto algorithms using AVX512VL instructions gives a 1.5X > performance improvement over existing AES-GCM algorithm in the > kernel(using XMM registers) with no frequency drop. I'm assuming this would be enabled automatically without needing any special command line options. If so, it seems reasonable to me. --Andy