crypto: x86/aes-ctr - rewrite AES-NI optimized CTR and add VAES support

From: Eric Biggers <ebiggers@google.com>

From: Eric Biggers <ebiggers@google.com>

Delete aes_ctrby8_avx-x86_64.S and add a new assembly file
aes-ctr-avx-x86_64.S which follows a similar approach to
aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX,
VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just
AESNI+AVX.  Wire it up to the crypto API accordingly.

This greatly improves the performance of AES-CTR and AES-XCTR on
VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230%
increase in throughput (i.e. over 3.3x faster) is seen on long messages.
Performance on older CPUs remains about the same.  There are some slight
regressions (less than 10%) on some short message lengths on some CPUs;
these are difficult to avoid, given how the previous code was so heavily
unrolled by message length, and they are not particularly important.
Detailed performance results are given in the tables below.

Both CTR and XCTR support is retained.  The main loop remains
8-vector-wide, which differs from the 4-vector-wide main loops that are
used in the XTS and GCM code.  A wider loop is appropriate for CTR and
XCTR since they have fewer other instructions (such as vpclmulqdq) to
interleave with the AES instructions.

Similar to what was the case for AES-GCM, the new assembly code also has
a much smaller binary size, as it fixes the excessive unrolling by data
length and key length present in the old code.  Specifically, the new
assembly file compiles to about 9 KB of text vs. 28 KB for the old file.
This is despite 4x as many implementations being included.

The tables below show the detailed performance results.  The tables show
percentage improvement in single-threaded throughput for repeated
encryption of the given message length; an increase from 6000 MB/s to
12000 MB/s would be listed as 100%.  They were collected by directly
measuring the Linux crypto API performance using a custom kernel module.
The tested CPUs were all server processors from Google Compute Engine
except for Zen 5 which was a Ryzen 9 9950X desktop processor.

Table 1: AES-256-CTR throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
AMD Zen 5            |  232% |  203% |  212% |  143% |   71% |   95% |
Intel Emerald Rapids |  116% |  116% |  117% |   91% |   78% |   79% |
Intel Ice Lake       |  109% |  103% |  107% |   81% |   54% |   56% |
AMD Zen 4            |  109% |   91% |  100% |   70% |   43% |   59% |
AMD Zen 3            |   92% |   78% |   87% |   57% |   32% |   43% |
AMD Zen 2            |    9% |    8% |   14% |   12% |    8% |   21% |
Intel Skylake        |    7% |    7% |    8% |    5% |    3% |    8% |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
AMD Zen 5            |   57% |   39% |   -9% |    7% |   -7% |
Intel Emerald Rapids |   37% |   42% |   -0% |   13% |   -8% |
Intel Ice Lake       |   39% |   30% |   -1% |   14% |   -9% |
AMD Zen 4            |   42% |   38% |   -0% |   18% |   -3% |
AMD Zen 3            |   38% |   35% |    6% |   31% |    5% |
AMD Zen 2            |   24% |   23% |    5% |   30% |    3% |
Intel Skylake        |    9% |    1% |   -4% |   10% |   -7% |

Table 2: AES-256-XCTR throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
AMD Zen 5            |  240% |  201% |  216% |  151% |   75% |  108% |
Intel Emerald Rapids |  100% |   99% |  102% |   91% |   94% |  104% |
Intel Ice Lake       |   93% |   89% |   92% |   74% |   50% |   64% |
AMD Zen 4            |   86% |   75% |   83% |   60% |   41% |   52% |
AMD Zen 3            |   73% |   63% |   69% |   45% |   21% |   33% |
AMD Zen 2            |   -2% |   -2% |    2% |    3% |   -1% |   11% |
Intel Skylake        |   -1% |   -1% |    1% |    2% |   -1% |    9% |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
AMD Zen 5            |   78% |   56% |   -4% |   38% |   -2% |
Intel Emerald Rapids |   61% |   55% |    4% |   32% |   -5% |
Intel Ice Lake       |   57% |   42% |    3% |   44% |   -4% |
AMD Zen 4            |   35% |   28% |   -1% |   17% |   -3% |
AMD Zen 3            |   26% |   23% |   -3% |   11% |   -6% |
AMD Zen 2            |   13% |   24% |   -1% |   14% |   -3% |
Intel Skylake        |   16% |    8% |   -4% |   35% |   -3% |

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/Makefile                |   2 +-
 arch/x86/crypto/aes-ctr-avx-x86_64.S    | 556 ++++++++++++++++++++++
 arch/x86/crypto/aes_ctrby8_avx-x86_64.S | 597 ------------------------
 arch/x86/crypto/aesni-intel_glue.c      | 450 ++++++++----------
 4 files changed, 760 insertions(+), 845 deletions(-)
 create mode 100644 arch/x86/crypto/aes-ctr-avx-x86_64.S
 delete mode 100644 arch/x86/crypto/aes_ctrby8_avx-x86_64.S

base-commit: 805ba04cb7ccfc7d72e834ebd796e043142156ba

Message ID	20250128063118.187690-1-ebiggers@kernel.org (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1AB9212F5A5; Tue, 28 Jan 2025 06:32:35 +0000 (UTC) From: Eric Biggers <ebiggers@kernel.org> To: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org, x86@kernel.org Subject: [PATCH] crypto: x86/aes-ctr - rewrite AES-NI optimized CTR and add VAES support Date: Mon, 27 Jan 2025 22:31:18 -0800 Message-ID: <20250128063118.187690-1-ebiggers@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	crypto: x86/aes-ctr - rewrite AES-NI optimized CTR and add VAES support \| expand crypto: x86/aes-ctr - rewrite AES-NI optimized CTR and add VAES support

crypto: x86/aes-ctr - rewrite AES-NI optimized CTR and add VAES support

Commit Message

Comments

Patch