From patchwork Thu Nov 29 23:02:12 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 10705489 X-Patchwork-Delegate: herbert@gondor.apana.org.au Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 74CD61057 for ; Thu, 29 Nov 2018 23:03:30 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 65AA42F8B7 for ; Thu, 29 Nov 2018 23:03:30 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 599E52F9D4; Thu, 29 Nov 2018 23:03:30 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AD56F2F8B7 for ; Thu, 29 Nov 2018 23:03:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726905AbeK3KKf (ORCPT ); Fri, 30 Nov 2018 05:10:35 -0500 Received: from mail.kernel.org ([198.145.29.99]:44888 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726406AbeK3KKe (ORCPT ); Fri, 30 Nov 2018 05:10:34 -0500 Received: from ebiggers.mtv.corp.google.com (unknown [104.132.1.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id E154020989; Thu, 29 Nov 2018 23:03:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1543532606; bh=rIObHrLpzENW/xlnar4kbcb4xmM6RyCV+hJi0C/KGjI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=lBmL36yqiyYb392qpK0U3Hu9lG/Vr3h716mSVrTXHd0Q0ID/nhQn8YIAQfM95Gqom 7kqw8HIh3bS4oayK7qaf1eCy5yHFzMxS/g2v88c6KvNZK/I/azLFQfeBQKu3Ur2g++ TF3v5GkOD4pesNNpk4hyhVyRKb0doBgkf4eviS34= From: Eric Biggers To: linux-crypto@vger.kernel.org Cc: Paul Crowley , Martin Willi , Milan Broz , "Jason A . Donenfeld" , linux-kernel@vger.kernel.org Subject: [PATCH v2 1/6] crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305 Date: Thu, 29 Nov 2018 15:02:12 -0800 Message-Id: <20181129230217.158038-2-ebiggers@kernel.org> X-Mailer: git-send-email 2.20.0.rc0.387.gc7a69e6b6c-goog In-Reply-To: <20181129230217.158038-1-ebiggers@kernel.org> References: <20181129230217.158038-1-ebiggers@kernel.org> MIME-Version: 1.0 Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Eric Biggers Add a 64-bit SSE2 implementation of NHPoly1305, an ε-almost-∆-universal hash function used in the Adiantum encryption mode. For now, only the NH portion is actually SSE2-accelerated; the Poly1305 part is less performance-critical so is just implemented in C. Signed-off-by: Eric Biggers --- arch/x86/crypto/Makefile | 4 + arch/x86/crypto/nh-sse2-x86_64.S | 123 +++++++++++++++++++++++++ arch/x86/crypto/nhpoly1305-sse2-glue.c | 76 +++++++++++++++ crypto/Kconfig | 8 ++ 4 files changed, 211 insertions(+) create mode 100644 arch/x86/crypto/nh-sse2-x86_64.S create mode 100644 arch/x86/crypto/nhpoly1305-sse2-glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index ce4e43642984..2a6acb4de373 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -47,6 +47,8 @@ obj-$(CONFIG_CRYPTO_MORUS1280_GLUE) += morus1280_glue.o obj-$(CONFIG_CRYPTO_MORUS640_SSE2) += morus640-sse2.o obj-$(CONFIG_CRYPTO_MORUS1280_SSE2) += morus1280-sse2.o +obj-$(CONFIG_CRYPTO_NHPOLY1305_SSE2) += nhpoly1305-sse2.o + # These modules require assembler to support AVX. ifeq ($(avx_supported),yes) obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64) += \ @@ -85,6 +87,8 @@ aegis256-aesni-y := aegis256-aesni-asm.o aegis256-aesni-glue.o morus640-sse2-y := morus640-sse2-asm.o morus640-sse2-glue.o morus1280-sse2-y := morus1280-sse2-asm.o morus1280-sse2-glue.o +nhpoly1305-sse2-y := nh-sse2-x86_64.o nhpoly1305-sse2-glue.o + ifeq ($(avx_supported),yes) camellia-aesni-avx-x86_64-y := camellia-aesni-avx-asm_64.o \ camellia_aesni_avx_glue.o diff --git a/arch/x86/crypto/nh-sse2-x86_64.S b/arch/x86/crypto/nh-sse2-x86_64.S new file mode 100644 index 000000000000..51f52d4ab4bb --- /dev/null +++ b/arch/x86/crypto/nh-sse2-x86_64.S @@ -0,0 +1,123 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * NH - ε-almost-universal hash function, x86_64 SSE2 accelerated + * + * Copyright 2018 Google LLC + * + * Author: Eric Biggers + */ + +#include + +#define PASS0_SUMS %xmm0 +#define PASS1_SUMS %xmm1 +#define PASS2_SUMS %xmm2 +#define PASS3_SUMS %xmm3 +#define K0 %xmm4 +#define K1 %xmm5 +#define K2 %xmm6 +#define K3 %xmm7 +#define T0 %xmm8 +#define T1 %xmm9 +#define T2 %xmm10 +#define T3 %xmm11 +#define T4 %xmm12 +#define T5 %xmm13 +#define T6 %xmm14 +#define T7 %xmm15 +#define KEY %rdi +#define MESSAGE %rsi +#define MESSAGE_LEN %rdx +#define HASH %rcx + +.macro _nh_stride k0, k1, k2, k3, offset + + // Load next message stride + movdqu \offset(MESSAGE), T1 + + // Load next key stride + movdqu \offset(KEY), \k3 + + // Add message words to key words + movdqa T1, T2 + movdqa T1, T3 + paddd T1, \k0 // reuse k0 to avoid a move + paddd \k1, T1 + paddd \k2, T2 + paddd \k3, T3 + + // Multiply 32x32 => 64 and accumulate + pshufd $0x10, \k0, T4 + pshufd $0x32, \k0, \k0 + pshufd $0x10, T1, T5 + pshufd $0x32, T1, T1 + pshufd $0x10, T2, T6 + pshufd $0x32, T2, T2 + pshufd $0x10, T3, T7 + pshufd $0x32, T3, T3 + pmuludq T4, \k0 + pmuludq T5, T1 + pmuludq T6, T2 + pmuludq T7, T3 + paddq \k0, PASS0_SUMS + paddq T1, PASS1_SUMS + paddq T2, PASS2_SUMS + paddq T3, PASS3_SUMS +.endm + +/* + * void nh_sse2(const u32 *key, const u8 *message, size_t message_len, + * u8 hash[NH_HASH_BYTES]) + * + * It's guaranteed that message_len % 16 == 0. + */ +ENTRY(nh_sse2) + + movdqu 0x00(KEY), K0 + movdqu 0x10(KEY), K1 + movdqu 0x20(KEY), K2 + add $0x30, KEY + pxor PASS0_SUMS, PASS0_SUMS + pxor PASS1_SUMS, PASS1_SUMS + pxor PASS2_SUMS, PASS2_SUMS + pxor PASS3_SUMS, PASS3_SUMS + + sub $0x40, MESSAGE_LEN + jl .Lloop4_done +.Lloop4: + _nh_stride K0, K1, K2, K3, 0x00 + _nh_stride K1, K2, K3, K0, 0x10 + _nh_stride K2, K3, K0, K1, 0x20 + _nh_stride K3, K0, K1, K2, 0x30 + add $0x40, KEY + add $0x40, MESSAGE + sub $0x40, MESSAGE_LEN + jge .Lloop4 + +.Lloop4_done: + and $0x3f, MESSAGE_LEN + jz .Ldone + _nh_stride K0, K1, K2, K3, 0x00 + + sub $0x10, MESSAGE_LEN + jz .Ldone + _nh_stride K1, K2, K3, K0, 0x10 + + sub $0x10, MESSAGE_LEN + jz .Ldone + _nh_stride K2, K3, K0, K1, 0x20 + +.Ldone: + // Sum the accumulators for each pass, then store the sums to 'hash' + movdqa PASS0_SUMS, T0 + movdqa PASS2_SUMS, T1 + punpcklqdq PASS1_SUMS, T0 // => (PASS0_SUM_A PASS1_SUM_A) + punpcklqdq PASS3_SUMS, T1 // => (PASS2_SUM_A PASS3_SUM_A) + punpckhqdq PASS1_SUMS, PASS0_SUMS // => (PASS0_SUM_B PASS1_SUM_B) + punpckhqdq PASS3_SUMS, PASS2_SUMS // => (PASS2_SUM_B PASS3_SUM_B) + paddq PASS0_SUMS, T0 + paddq PASS2_SUMS, T1 + movdqu T0, 0x00(HASH) + movdqu T1, 0x10(HASH) + ret +ENDPROC(nh_sse2) diff --git a/arch/x86/crypto/nhpoly1305-sse2-glue.c b/arch/x86/crypto/nhpoly1305-sse2-glue.c new file mode 100644 index 000000000000..ed68d164ce14 --- /dev/null +++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c @@ -0,0 +1,76 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NHPoly1305 - ε-almost-∆-universal hash function for Adiantum + * (SSE2 accelerated version) + * + * Copyright 2018 Google LLC + */ + +#include +#include +#include +#include + +asmlinkage void nh_sse2(const u32 *key, const u8 *message, size_t message_len, + u8 hash[NH_HASH_BYTES]); + +/* wrapper to avoid indirect call to assembly, which doesn't work with CFI */ +static void _nh_sse2(const u32 *key, const u8 *message, size_t message_len, + __le64 hash[NH_NUM_PASSES]) +{ + nh_sse2(key, message, message_len, (u8 *)hash); +} + +static int nhpoly1305_sse2_update(struct shash_desc *desc, + const u8 *src, unsigned int srclen) +{ + if (srclen < 64 || !irq_fpu_usable()) + return crypto_nhpoly1305_update(desc, src, srclen); + + do { + unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + + kernel_fpu_begin(); + crypto_nhpoly1305_update_helper(desc, src, n, _nh_sse2); + kernel_fpu_end(); + src += n; + srclen -= n; + } while (srclen); + return 0; +} + +static struct shash_alg nhpoly1305_alg = { + .base.cra_name = "nhpoly1305", + .base.cra_driver_name = "nhpoly1305-sse2", + .base.cra_priority = 200, + .base.cra_ctxsize = sizeof(struct nhpoly1305_key), + .base.cra_module = THIS_MODULE, + .digestsize = POLY1305_DIGEST_SIZE, + .init = crypto_nhpoly1305_init, + .update = nhpoly1305_sse2_update, + .final = crypto_nhpoly1305_final, + .setkey = crypto_nhpoly1305_setkey, + .descsize = sizeof(struct nhpoly1305_state), +}; + +static int __init nhpoly1305_mod_init(void) +{ + if (!boot_cpu_has(X86_FEATURE_XMM2)) + return -ENODEV; + + return crypto_register_shash(&nhpoly1305_alg); +} + +static void __exit nhpoly1305_mod_exit(void) +{ + crypto_unregister_shash(&nhpoly1305_alg); +} + +module_init(nhpoly1305_mod_init); +module_exit(nhpoly1305_mod_exit); + +MODULE_DESCRIPTION("NHPoly1305 ε-almost-∆-universal hash function (SSE2-accelerated)"); +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Eric Biggers "); +MODULE_ALIAS_CRYPTO("nhpoly1305"); +MODULE_ALIAS_CRYPTO("nhpoly1305-sse2"); diff --git a/crypto/Kconfig b/crypto/Kconfig index b6376d5d973e..b85133966d64 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -501,6 +501,14 @@ config CRYPTO_NHPOLY1305 select CRYPTO_HASH select CRYPTO_POLY1305 +config CRYPTO_NHPOLY1305_SSE2 + tristate "NHPoly1305 hash function (x86_64 SSE2 implementation)" + depends on X86 && 64BIT + select CRYPTO_NHPOLY1305 + help + SSE2 optimized implementation of the hash function used by the + Adiantum encryption mode. + config CRYPTO_ADIANTUM tristate "Adiantum support" select CRYPTO_CHACHA20 From patchwork Thu Nov 29 23:02:13 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 10705501 X-Patchwork-Delegate: herbert@gondor.apana.org.au Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C483C13A4 for ; Thu, 29 Nov 2018 23:04:16 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B5F832F26E for ; Thu, 29 Nov 2018 23:04:16 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id AA1E72B491; Thu, 29 Nov 2018 23:04:16 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E98A72B491 for ; Thu, 29 Nov 2018 23:04:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726525AbeK3KLQ (ORCPT ); Fri, 30 Nov 2018 05:11:16 -0500 Received: from mail.kernel.org ([198.145.29.99]:44902 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726425AbeK3KKe (ORCPT ); Fri, 30 Nov 2018 05:10:34 -0500 Received: from ebiggers.mtv.corp.google.com (unknown [104.132.1.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 42F1D21019; Thu, 29 Nov 2018 23:03:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1543532606; bh=XcQb+8NkkRhxH6T9J6l2LNL+gXzYljNik8rytODfgrc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=0GcGE7jW2MjHC498lYb3MHSuy3DhAP65sfuF+6xTbJYZEtXUKkl25tm9GCa4BDfh4 JvamlvNbNYRGrZ+5W/mIXBMmSlisjHT7nkqQThWnvJJOy8ZtaBgNgcgLk+MkqzSDtg cLhzjwEuCd9+7u1SU/IIMPv6rYsWvcnoGWWeNsRk= From: Eric Biggers To: linux-crypto@vger.kernel.org Cc: Paul Crowley , Martin Willi , Milan Broz , "Jason A . Donenfeld" , linux-kernel@vger.kernel.org Subject: [PATCH v2 2/6] crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305 Date: Thu, 29 Nov 2018 15:02:13 -0800 Message-Id: <20181129230217.158038-3-ebiggers@kernel.org> X-Mailer: git-send-email 2.20.0.rc0.387.gc7a69e6b6c-goog In-Reply-To: <20181129230217.158038-1-ebiggers@kernel.org> References: <20181129230217.158038-1-ebiggers@kernel.org> MIME-Version: 1.0 Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Eric Biggers Add a 64-bit AVX2 implementation of NHPoly1305, an ε-almost-∆-universal hash function used in the Adiantum encryption mode. For now, only the NH portion is actually AVX2-accelerated; the Poly1305 part is less performance-critical so is just implemented in C. Signed-off-by: Eric Biggers --- arch/x86/crypto/Makefile | 3 + arch/x86/crypto/nh-avx2-x86_64.S | 157 +++++++++++++++++++++++++ arch/x86/crypto/nhpoly1305-avx2-glue.c | 77 ++++++++++++ crypto/Kconfig | 8 ++ 4 files changed, 245 insertions(+) create mode 100644 arch/x86/crypto/nh-avx2-x86_64.S create mode 100644 arch/x86/crypto/nhpoly1305-avx2-glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 2a6acb4de373..0b31b16f49d8 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -48,6 +48,7 @@ obj-$(CONFIG_CRYPTO_MORUS640_SSE2) += morus640-sse2.o obj-$(CONFIG_CRYPTO_MORUS1280_SSE2) += morus1280-sse2.o obj-$(CONFIG_CRYPTO_NHPOLY1305_SSE2) += nhpoly1305-sse2.o +obj-$(CONFIG_CRYPTO_NHPOLY1305_AVX2) += nhpoly1305-avx2.o # These modules require assembler to support AVX. ifeq ($(avx_supported),yes) @@ -106,6 +107,8 @@ ifeq ($(avx2_supported),yes) serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o morus1280-avx2-y := morus1280-avx2-asm.o morus1280-avx2-glue.o + + nhpoly1305-avx2-y := nh-avx2-x86_64.o nhpoly1305-avx2-glue.o endif ifeq ($(avx512_supported),yes) diff --git a/arch/x86/crypto/nh-avx2-x86_64.S b/arch/x86/crypto/nh-avx2-x86_64.S new file mode 100644 index 000000000000..f7946ea1b704 --- /dev/null +++ b/arch/x86/crypto/nh-avx2-x86_64.S @@ -0,0 +1,157 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * NH - ε-almost-universal hash function, x86_64 AVX2 accelerated + * + * Copyright 2018 Google LLC + * + * Author: Eric Biggers + */ + +#include + +#define PASS0_SUMS %ymm0 +#define PASS1_SUMS %ymm1 +#define PASS2_SUMS %ymm2 +#define PASS3_SUMS %ymm3 +#define K0 %ymm4 +#define K0_XMM %xmm4 +#define K1 %ymm5 +#define K1_XMM %xmm5 +#define K2 %ymm6 +#define K2_XMM %xmm6 +#define K3 %ymm7 +#define K3_XMM %xmm7 +#define T0 %ymm8 +#define T1 %ymm9 +#define T2 %ymm10 +#define T2_XMM %xmm10 +#define T3 %ymm11 +#define T3_XMM %xmm11 +#define T4 %ymm12 +#define T5 %ymm13 +#define T6 %ymm14 +#define T7 %ymm15 +#define KEY %rdi +#define MESSAGE %rsi +#define MESSAGE_LEN %rdx +#define HASH %rcx + +.macro _nh_2xstride k0, k1, k2, k3 + + // Add message words to key words + vpaddd \k0, T3, T0 + vpaddd \k1, T3, T1 + vpaddd \k2, T3, T2 + vpaddd \k3, T3, T3 + + // Multiply 32x32 => 64 and accumulate + vpshufd $0x10, T0, T4 + vpshufd $0x32, T0, T0 + vpshufd $0x10, T1, T5 + vpshufd $0x32, T1, T1 + vpshufd $0x10, T2, T6 + vpshufd $0x32, T2, T2 + vpshufd $0x10, T3, T7 + vpshufd $0x32, T3, T3 + vpmuludq T4, T0, T0 + vpmuludq T5, T1, T1 + vpmuludq T6, T2, T2 + vpmuludq T7, T3, T3 + vpaddq T0, PASS0_SUMS, PASS0_SUMS + vpaddq T1, PASS1_SUMS, PASS1_SUMS + vpaddq T2, PASS2_SUMS, PASS2_SUMS + vpaddq T3, PASS3_SUMS, PASS3_SUMS +.endm + +/* + * void nh_avx2(const u32 *key, const u8 *message, size_t message_len, + * u8 hash[NH_HASH_BYTES]) + * + * It's guaranteed that message_len % 16 == 0. + */ +ENTRY(nh_avx2) + + vmovdqu 0x00(KEY), K0 + vmovdqu 0x10(KEY), K1 + add $0x20, KEY + vpxor PASS0_SUMS, PASS0_SUMS, PASS0_SUMS + vpxor PASS1_SUMS, PASS1_SUMS, PASS1_SUMS + vpxor PASS2_SUMS, PASS2_SUMS, PASS2_SUMS + vpxor PASS3_SUMS, PASS3_SUMS, PASS3_SUMS + + sub $0x40, MESSAGE_LEN + jl .Lloop4_done +.Lloop4: + vmovdqu (MESSAGE), T3 + vmovdqu 0x00(KEY), K2 + vmovdqu 0x10(KEY), K3 + _nh_2xstride K0, K1, K2, K3 + + vmovdqu 0x20(MESSAGE), T3 + vmovdqu 0x20(KEY), K0 + vmovdqu 0x30(KEY), K1 + _nh_2xstride K2, K3, K0, K1 + + add $0x40, MESSAGE + add $0x40, KEY + sub $0x40, MESSAGE_LEN + jge .Lloop4 + +.Lloop4_done: + and $0x3f, MESSAGE_LEN + jz .Ldone + + cmp $0x20, MESSAGE_LEN + jl .Llast + + // 2 or 3 strides remain; do 2 more. + vmovdqu (MESSAGE), T3 + vmovdqu 0x00(KEY), K2 + vmovdqu 0x10(KEY), K3 + _nh_2xstride K0, K1, K2, K3 + add $0x20, MESSAGE + add $0x20, KEY + sub $0x20, MESSAGE_LEN + jz .Ldone + vmovdqa K2, K0 + vmovdqa K3, K1 +.Llast: + // Last stride. Zero the high 128 bits of the message and keys so they + // don't affect the result when processing them like 2 strides. + vmovdqu (MESSAGE), T3_XMM + vmovdqa K0_XMM, K0_XMM + vmovdqa K1_XMM, K1_XMM + vmovdqu 0x00(KEY), K2_XMM + vmovdqu 0x10(KEY), K3_XMM + _nh_2xstride K0, K1, K2, K3 + +.Ldone: + // Sum the accumulators for each pass, then store the sums to 'hash' + + // PASS0_SUMS is (0A 0B 0C 0D) + // PASS1_SUMS is (1A 1B 1C 1D) + // PASS2_SUMS is (2A 2B 2C 2D) + // PASS3_SUMS is (3A 3B 3C 3D) + // We need the horizontal sums: + // (0A + 0B + 0C + 0D, + // 1A + 1B + 1C + 1D, + // 2A + 2B + 2C + 2D, + // 3A + 3B + 3C + 3D) + // + + vpunpcklqdq PASS1_SUMS, PASS0_SUMS, T0 // T0 = (0A 1A 0C 1C) + vpunpckhqdq PASS1_SUMS, PASS0_SUMS, T1 // T1 = (0B 1B 0D 1D) + vpunpcklqdq PASS3_SUMS, PASS2_SUMS, T2 // T2 = (2A 3A 2C 3C) + vpunpckhqdq PASS3_SUMS, PASS2_SUMS, T3 // T3 = (2B 3B 2D 3D) + + vinserti128 $0x1, T2_XMM, T0, T4 // T4 = (0A 1A 2A 3A) + vinserti128 $0x1, T3_XMM, T1, T5 // T5 = (0B 1B 2B 3B) + vperm2i128 $0x31, T2, T0, T0 // T0 = (0C 1C 2C 3C) + vperm2i128 $0x31, T3, T1, T1 // T1 = (0D 1D 2D 3D) + + vpaddq T5, T4, T4 + vpaddq T1, T0, T0 + vpaddq T4, T0, T0 + vmovdqu T0, (HASH) + ret +ENDPROC(nh_avx2) diff --git a/arch/x86/crypto/nhpoly1305-avx2-glue.c b/arch/x86/crypto/nhpoly1305-avx2-glue.c new file mode 100644 index 000000000000..20d815ea4b6a --- /dev/null +++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c @@ -0,0 +1,77 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NHPoly1305 - ε-almost-∆-universal hash function for Adiantum + * (AVX2 accelerated version) + * + * Copyright 2018 Google LLC + */ + +#include +#include +#include +#include + +asmlinkage void nh_avx2(const u32 *key, const u8 *message, size_t message_len, + u8 hash[NH_HASH_BYTES]); + +/* wrapper to avoid indirect call to assembly, which doesn't work with CFI */ +static void _nh_avx2(const u32 *key, const u8 *message, size_t message_len, + __le64 hash[NH_NUM_PASSES]) +{ + nh_avx2(key, message, message_len, (u8 *)hash); +} + +static int nhpoly1305_avx2_update(struct shash_desc *desc, + const u8 *src, unsigned int srclen) +{ + if (srclen < 64 || !irq_fpu_usable()) + return crypto_nhpoly1305_update(desc, src, srclen); + + do { + unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + + kernel_fpu_begin(); + crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2); + kernel_fpu_end(); + src += n; + srclen -= n; + } while (srclen); + return 0; +} + +static struct shash_alg nhpoly1305_alg = { + .base.cra_name = "nhpoly1305", + .base.cra_driver_name = "nhpoly1305-avx2", + .base.cra_priority = 300, + .base.cra_ctxsize = sizeof(struct nhpoly1305_key), + .base.cra_module = THIS_MODULE, + .digestsize = POLY1305_DIGEST_SIZE, + .init = crypto_nhpoly1305_init, + .update = nhpoly1305_avx2_update, + .final = crypto_nhpoly1305_final, + .setkey = crypto_nhpoly1305_setkey, + .descsize = sizeof(struct nhpoly1305_state), +}; + +static int __init nhpoly1305_mod_init(void) +{ + if (!boot_cpu_has(X86_FEATURE_AVX2) || + !boot_cpu_has(X86_FEATURE_OSXSAVE)) + return -ENODEV; + + return crypto_register_shash(&nhpoly1305_alg); +} + +static void __exit nhpoly1305_mod_exit(void) +{ + crypto_unregister_shash(&nhpoly1305_alg); +} + +module_init(nhpoly1305_mod_init); +module_exit(nhpoly1305_mod_exit); + +MODULE_DESCRIPTION("NHPoly1305 ε-almost-∆-universal hash function (AVX2-accelerated)"); +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Eric Biggers "); +MODULE_ALIAS_CRYPTO("nhpoly1305"); +MODULE_ALIAS_CRYPTO("nhpoly1305-avx2"); diff --git a/crypto/Kconfig b/crypto/Kconfig index b85133966d64..e084e2fb6743 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -509,6 +509,14 @@ config CRYPTO_NHPOLY1305_SSE2 SSE2 optimized implementation of the hash function used by the Adiantum encryption mode. +config CRYPTO_NHPOLY1305_AVX2 + tristate "NHPoly1305 hash function (x86_64 AVX2 implementation)" + depends on X86 && 64BIT + select CRYPTO_NHPOLY1305 + help + AVX2 optimized implementation of the hash function used by the + Adiantum encryption mode. + config CRYPTO_ADIANTUM tristate "Adiantum support" select CRYPTO_CHACHA20 From patchwork Thu Nov 29 23:02:14 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 10705493 X-Patchwork-Delegate: herbert@gondor.apana.org.au Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EF45017F0 for ; Thu, 29 Nov 2018 23:03:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E137E2B491 for ; Thu, 29 Nov 2018 23:03:44 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D57192B697; Thu, 29 Nov 2018 23:03:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8E0C92B491 for ; Thu, 29 Nov 2018 23:03:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726966AbeK3KKf (ORCPT ); Fri, 30 Nov 2018 05:10:35 -0500 Received: from mail.kernel.org ([198.145.29.99]:44908 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726403AbeK3KKe (ORCPT ); Fri, 30 Nov 2018 05:10:34 -0500 Received: from ebiggers.mtv.corp.google.com (unknown [104.132.1.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 91A482146F; Thu, 29 Nov 2018 23:03:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1543532606; bh=XvF/lMOC0lL0Q4e5HLcoOSHhmcyW66zlM9QPCJmkwGc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=kJ6UH4llFo77x1Xs8xsh/XitHtePx9VfWwpnyfi3Hkfpn4Me4x5YHFOobgQzqs/s2 /EvxmYrqeKpdBHR++5CbxSpmhrE28tIor2YwBkz/G6W26nubptjzY8XZ5nKPKXnfGQ //B2xYA1wKEaGCtHe0HPIdmLuqsmZRzuLicGKXu8= From: Eric Biggers To: linux-crypto@vger.kernel.org Cc: Paul Crowley , Martin Willi , Milan Broz , "Jason A . Donenfeld" , linux-kernel@vger.kernel.org Subject: [PATCH v2 3/6] crypto: x86/chacha20 - limit the preemption-disabled section Date: Thu, 29 Nov 2018 15:02:14 -0800 Message-Id: <20181129230217.158038-4-ebiggers@kernel.org> X-Mailer: git-send-email 2.20.0.rc0.387.gc7a69e6b6c-goog In-Reply-To: <20181129230217.158038-1-ebiggers@kernel.org> References: <20181129230217.158038-1-ebiggers@kernel.org> MIME-Version: 1.0 Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Eric Biggers To improve responsiveness, disable preemption for each step of the walk (which is at most PAGE_SIZE) rather than for the entire encryption/decryption operation. Signed-off-by: Eric Biggers --- arch/x86/crypto/chacha20_glue.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c index 773d075a1483..036de144aab6 100644 --- a/arch/x86/crypto/chacha20_glue.c +++ b/arch/x86/crypto/chacha20_glue.c @@ -135,26 +135,24 @@ static int chacha20_simd(struct skcipher_request *req) if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd()) return crypto_chacha_crypt(req); - err = skcipher_walk_virt(&walk, req, true); + err = skcipher_walk_virt(&walk, req, false); crypto_chacha_init(state, ctx, walk.iv); - kernel_fpu_begin(); - while (walk.nbytes > 0) { unsigned int nbytes = walk.nbytes; if (nbytes < walk.total) nbytes = round_down(nbytes, walk.stride); + kernel_fpu_begin(); chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr, nbytes); + kernel_fpu_end(); err = skcipher_walk_done(&walk, walk.nbytes - nbytes); } - kernel_fpu_end(); - return err; } From patchwork Thu Nov 29 23:02:15 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 10705497 X-Patchwork-Delegate: herbert@gondor.apana.org.au Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1339C1057 for ; Thu, 29 Nov 2018 23:04:03 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 03E7A2B491 for ; Thu, 29 Nov 2018 23:04:03 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EC3D82B697; Thu, 29 Nov 2018 23:04:02 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 239AB2B491 for ; Thu, 29 Nov 2018 23:04:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726451AbeK3KKu (ORCPT ); Fri, 30 Nov 2018 05:10:50 -0500 Received: from mail.kernel.org ([198.145.29.99]:44920 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726885AbeK3KKf (ORCPT ); Fri, 30 Nov 2018 05:10:35 -0500 Received: from ebiggers.mtv.corp.google.com (unknown [104.132.1.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id E154521473; Thu, 29 Nov 2018 23:03:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1543532607; bh=ft2chNb12cqZy/KkAHujdtoAMUiRgzlPV22Y8/DfA+k=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=hd+YAegyjSPciVYUOtb8mdfKaktV8a8LQRq8bUaJN1JswbjPhA0F7H5JiQk2t3+oH KbzThBrI74AFlqUXW3VPTRlxis/NruZ/xqDTgHJWS25DJ92W/KHXS5fK5ABcFxihwG 6fxMQYzNHfn70bGpYBlHab1nq0EwGdKn6lJXnvIA= From: Eric Biggers To: linux-crypto@vger.kernel.org Cc: Paul Crowley , Martin Willi , Milan Broz , "Jason A . Donenfeld" , linux-kernel@vger.kernel.org Subject: [PATCH v2 4/6] crypto: x86/chacha20 - add XChaCha20 support Date: Thu, 29 Nov 2018 15:02:15 -0800 Message-Id: <20181129230217.158038-5-ebiggers@kernel.org> X-Mailer: git-send-email 2.20.0.rc0.387.gc7a69e6b6c-goog In-Reply-To: <20181129230217.158038-1-ebiggers@kernel.org> References: <20181129230217.158038-1-ebiggers@kernel.org> MIME-Version: 1.0 Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Eric Biggers Add an XChaCha20 implementation that is hooked up to the x86_64 SIMD implementations of ChaCha20. This can be used by Adiantum. An SSSE3 implementation of single-block HChaCha20 is also added so that XChaCha20 can use it rather than the generic implementation. This required refactoring the ChaCha permutation into its own function. Signed-off-by: Eric Biggers Reviewed-by: Martin Willi --- arch/x86/crypto/chacha20-ssse3-x86_64.S | 76 ++++++++++++------- arch/x86/crypto/chacha20_glue.c | 99 +++++++++++++++++++------ crypto/Kconfig | 12 +-- 3 files changed, 129 insertions(+), 58 deletions(-) diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha20-ssse3-x86_64.S index d8ac75bb448f..45e4ccdd9c98 100644 --- a/arch/x86/crypto/chacha20-ssse3-x86_64.S +++ b/arch/x86/crypto/chacha20-ssse3-x86_64.S @@ -23,37 +23,24 @@ CTRINC: .octa 0x00000003000000020000000100000000 .text -ENTRY(chacha20_block_xor_ssse3) - # %rdi: Input state matrix, s - # %rsi: up to 1 data block output, o - # %rdx: up to 1 data block input, i - # %rcx: input/output length in bytes - - # This function encrypts one ChaCha20 block by loading the state matrix - # in four SSE registers. It performs matrix operation on four words in - # parallel, but requires shuffling to rearrange the words after each - # round. 8/16-bit word rotation is done with the slightly better - # performing SSSE3 byte shuffling, 7/12-bit word rotation uses - # traditional shift+OR. - - # x0..3 = s0..3 - movdqa 0x00(%rdi),%xmm0 - movdqa 0x10(%rdi),%xmm1 - movdqa 0x20(%rdi),%xmm2 - movdqa 0x30(%rdi),%xmm3 - movdqa %xmm0,%xmm8 - movdqa %xmm1,%xmm9 - movdqa %xmm2,%xmm10 - movdqa %xmm3,%xmm11 +/* + * chacha20_permute - permute one block + * + * Permute one 64-byte block where the state matrix is in %xmm0-%xmm3. This + * function performs matrix operations on four words in parallel, but requires + * shuffling to rearrange the words after each round. 8/16-bit word rotation is + * done with the slightly better performing SSSE3 byte shuffling, 7/12-bit word + * rotation uses traditional shift+OR. + * + * Clobbers: %ecx, %xmm4-%xmm7 + */ +chacha20_permute: movdqa ROT8(%rip),%xmm4 movdqa ROT16(%rip),%xmm5 - - mov %rcx,%rax mov $10,%ecx .Ldoubleround: - # x0 += x1, x3 = rotl32(x3 ^ x0, 16) paddd %xmm1,%xmm0 pxor %xmm0,%xmm3 @@ -123,6 +110,28 @@ ENTRY(chacha20_block_xor_ssse3) dec %ecx jnz .Ldoubleround + ret +ENDPROC(chacha20_permute) + +ENTRY(chacha20_block_xor_ssse3) + # %rdi: Input state matrix, s + # %rsi: up to 1 data block output, o + # %rdx: up to 1 data block input, i + # %rcx: input/output length in bytes + + # x0..3 = s0..3 + movdqa 0x00(%rdi),%xmm0 + movdqa 0x10(%rdi),%xmm1 + movdqa 0x20(%rdi),%xmm2 + movdqa 0x30(%rdi),%xmm3 + movdqa %xmm0,%xmm8 + movdqa %xmm1,%xmm9 + movdqa %xmm2,%xmm10 + movdqa %xmm3,%xmm11 + + mov %rcx,%rax + call chacha20_permute + # o0 = i0 ^ (x0 + s0) paddd %xmm8,%xmm0 cmp $0x10,%rax @@ -189,6 +198,23 @@ ENTRY(chacha20_block_xor_ssse3) ENDPROC(chacha20_block_xor_ssse3) +ENTRY(hchacha20_block_ssse3) + # %rdi: Input state matrix, s + # %rsi: output (8 32-bit words) + + movdqa 0x00(%rdi),%xmm0 + movdqa 0x10(%rdi),%xmm1 + movdqa 0x20(%rdi),%xmm2 + movdqa 0x30(%rdi),%xmm3 + + call chacha20_permute + + movdqu %xmm0,0x00(%rsi) + movdqu %xmm3,0x10(%rsi) + + ret +ENDPROC(hchacha20_block_ssse3) + ENTRY(chacha20_4block_xor_ssse3) # %rdi: Input state matrix, s # %rsi: up to 4 data blocks output, o diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c index 036de144aab6..2bea425acb76 100644 --- a/arch/x86/crypto/chacha20_glue.c +++ b/arch/x86/crypto/chacha20_glue.c @@ -23,6 +23,7 @@ asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src, unsigned int len); asmlinkage void chacha20_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src, unsigned int len); +asmlinkage void hchacha20_block_ssse3(const u32 *state, u32 *out); #ifdef CONFIG_AS_AVX2 asmlinkage void chacha20_2block_xor_avx2(u32 *state, u8 *dst, const u8 *src, unsigned int len); @@ -121,10 +122,9 @@ static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src, } } -static int chacha20_simd(struct skcipher_request *req) +static int chacha20_simd_stream_xor(struct skcipher_request *req, + struct chacha_ctx *ctx, u8 *iv) { - struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); - struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); u32 *state, state_buf[16 + 2] __aligned(8); struct skcipher_walk walk; int err; @@ -132,12 +132,9 @@ static int chacha20_simd(struct skcipher_request *req) BUILD_BUG_ON(CHACHA20_STATE_ALIGN != 16); state = PTR_ALIGN(state_buf + 0, CHACHA20_STATE_ALIGN); - if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd()) - return crypto_chacha_crypt(req); - err = skcipher_walk_virt(&walk, req, false); - crypto_chacha_init(state, ctx, walk.iv); + crypto_chacha_init(state, ctx, iv); while (walk.nbytes > 0) { unsigned int nbytes = walk.nbytes; @@ -156,21 +153,73 @@ static int chacha20_simd(struct skcipher_request *req) return err; } -static struct skcipher_alg alg = { - .base.cra_name = "chacha20", - .base.cra_driver_name = "chacha20-simd", - .base.cra_priority = 300, - .base.cra_blocksize = 1, - .base.cra_ctxsize = sizeof(struct chacha_ctx), - .base.cra_module = THIS_MODULE, - - .min_keysize = CHACHA_KEY_SIZE, - .max_keysize = CHACHA_KEY_SIZE, - .ivsize = CHACHA_IV_SIZE, - .chunksize = CHACHA_BLOCK_SIZE, - .setkey = crypto_chacha20_setkey, - .encrypt = chacha20_simd, - .decrypt = chacha20_simd, +static int chacha20_simd(struct skcipher_request *req) +{ + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); + + if (req->cryptlen <= CHACHA_BLOCK_SIZE || !irq_fpu_usable()) + return crypto_chacha_crypt(req); + + return chacha20_simd_stream_xor(req, ctx, req->iv); +} + +static int xchacha20_simd(struct skcipher_request *req) +{ + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); + struct chacha_ctx subctx; + u32 *state, state_buf[16 + 2] __aligned(8); + u8 real_iv[16]; + + if (req->cryptlen <= CHACHA_BLOCK_SIZE || !irq_fpu_usable()) + return crypto_xchacha_crypt(req); + + BUILD_BUG_ON(CHACHA20_STATE_ALIGN != 16); + state = PTR_ALIGN(state_buf + 0, CHACHA20_STATE_ALIGN); + crypto_chacha_init(state, ctx, req->iv); + + kernel_fpu_begin(); + hchacha20_block_ssse3(state, subctx.key); + kernel_fpu_end(); + + memcpy(&real_iv[0], req->iv + 24, 8); + memcpy(&real_iv[8], req->iv + 16, 8); + return chacha20_simd_stream_xor(req, &subctx, real_iv); +} + +static struct skcipher_alg algs[] = { + { + .base.cra_name = "chacha20", + .base.cra_driver_name = "chacha20-simd", + .base.cra_priority = 300, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = CHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .setkey = crypto_chacha20_setkey, + .encrypt = chacha20_simd, + .decrypt = chacha20_simd, + }, { + .base.cra_name = "xchacha20", + .base.cra_driver_name = "xchacha20-simd", + .base.cra_priority = 300, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = XCHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .setkey = crypto_chacha20_setkey, + .encrypt = xchacha20_simd, + .decrypt = xchacha20_simd, + }, }; static int __init chacha20_simd_mod_init(void) @@ -188,12 +237,12 @@ static int __init chacha20_simd_mod_init(void) boot_cpu_has(X86_FEATURE_AVX512BW); /* kmovq */ #endif #endif - return crypto_register_skcipher(&alg); + return crypto_register_skciphers(algs, ARRAY_SIZE(algs)); } static void __exit chacha20_simd_mod_fini(void) { - crypto_unregister_skcipher(&alg); + crypto_unregister_skciphers(algs, ARRAY_SIZE(algs)); } module_init(chacha20_simd_mod_init); @@ -204,3 +253,5 @@ MODULE_AUTHOR("Martin Willi "); MODULE_DESCRIPTION("chacha20 cipher algorithm, SIMD accelerated"); MODULE_ALIAS_CRYPTO("chacha20"); MODULE_ALIAS_CRYPTO("chacha20-simd"); +MODULE_ALIAS_CRYPTO("xchacha20"); +MODULE_ALIAS_CRYPTO("xchacha20-simd"); diff --git a/crypto/Kconfig b/crypto/Kconfig index e084e2fb6743..df466771e9bf 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1468,19 +1468,13 @@ config CRYPTO_CHACHA20 in some performance-sensitive scenarios. config CRYPTO_CHACHA20_X86_64 - tristate "ChaCha20 cipher algorithm (x86_64/SSSE3/AVX2)" + tristate "ChaCha stream cipher algorithms (x86_64/SSSE3/AVX2/AVX-512VL)" depends on X86 && 64BIT select CRYPTO_BLKCIPHER select CRYPTO_CHACHA20 help - ChaCha20 cipher algorithm, RFC7539. - - ChaCha20 is a 256-bit high-speed stream cipher designed by Daniel J. - Bernstein and further specified in RFC7539 for use in IETF protocols. - This is the x86_64 assembler implementation using SIMD instructions. - - See also: - + SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20 + and XChaCha20 stream ciphers. config CRYPTO_SEED tristate "SEED cipher algorithm" From patchwork Thu Nov 29 23:02:16 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 10705491 X-Patchwork-Delegate: herbert@gondor.apana.org.au Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AFE5113A4 for ; Thu, 29 Nov 2018 23:03:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9B45F2B4D4 for ; Thu, 29 Nov 2018 23:03:44 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8C82E2B697; Thu, 29 Nov 2018 23:03:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0C0B62B491 for ; Thu, 29 Nov 2018 23:03:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726616AbeK3KKn (ORCPT ); Fri, 30 Nov 2018 05:10:43 -0500 Received: from mail.kernel.org ([198.145.29.99]:44940 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726891AbeK3KKf (ORCPT ); Fri, 30 Nov 2018 05:10:35 -0500 Received: from ebiggers.mtv.corp.google.com (unknown [104.132.1.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 4372A21479; Thu, 29 Nov 2018 23:03:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1543532607; bh=4CDUl/a/7tTBYk9YWg8I+Xk7pfT7SorGHncYTmI9D7w=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=fD3r8Gbf+SIQEGm33Kf5GfgTzHaP2sPTOORAX3mt55uNwpno+XiIrIiTkCpZzbU/a cWxdCURW7q9a5uVY6G8xYVEopt0s85Woz4lP9uYGgXeXzR1ktizlChoZ8MgxncyoJK 8PxJxfAYsZ7Y6KH5crch9pCrI5iW86u6TMtezfRM= From: Eric Biggers To: linux-crypto@vger.kernel.org Cc: Paul Crowley , Martin Willi , Milan Broz , "Jason A . Donenfeld" , linux-kernel@vger.kernel.org Subject: [PATCH v2 5/6] crypto: x86/chacha20 - refactor to allow varying number of rounds Date: Thu, 29 Nov 2018 15:02:16 -0800 Message-Id: <20181129230217.158038-6-ebiggers@kernel.org> X-Mailer: git-send-email 2.20.0.rc0.387.gc7a69e6b6c-goog In-Reply-To: <20181129230217.158038-1-ebiggers@kernel.org> References: <20181129230217.158038-1-ebiggers@kernel.org> MIME-Version: 1.0 Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Eric Biggers In preparation for adding XChaCha12 support, rename/refactor the x86_64 SIMD implementations of ChaCha20 to support different numbers of rounds. Signed-off-by: Eric Biggers Reviewed-by: Martin Willi --- arch/x86/crypto/Makefile | 8 +- ...a20-avx2-x86_64.S => chacha-avx2-x86_64.S} | 33 ++-- ...12vl-x86_64.S => chacha-avx512vl-x86_64.S} | 35 ++-- ...0-ssse3-x86_64.S => chacha-ssse3-x86_64.S} | 41 ++--- .../crypto/{chacha20_glue.c => chacha_glue.c} | 150 +++++++++--------- 5 files changed, 136 insertions(+), 131 deletions(-) rename arch/x86/crypto/{chacha20-avx2-x86_64.S => chacha-avx2-x86_64.S} (97%) rename arch/x86/crypto/{chacha20-avx512vl-x86_64.S => chacha-avx512vl-x86_64.S} (97%) rename arch/x86/crypto/{chacha20-ssse3-x86_64.S => chacha-ssse3-x86_64.S} (96%) rename arch/x86/crypto/{chacha20_glue.c => chacha_glue.c} (51%) diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 0b31b16f49d8..45734e1cf967 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -24,7 +24,7 @@ obj-$(CONFIG_CRYPTO_CAMELLIA_X86_64) += camellia-x86_64.o obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o -obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha20-x86_64.o +obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha-x86_64.o obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o @@ -78,7 +78,7 @@ camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o -chacha20-x86_64-y := chacha20-ssse3-x86_64.o chacha20_glue.o +chacha-x86_64-y := chacha-ssse3-x86_64.o chacha_glue.o serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o @@ -103,7 +103,7 @@ endif ifeq ($(avx2_supported),yes) camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o - chacha20-x86_64-y += chacha20-avx2-x86_64.o + chacha-x86_64-y += chacha-avx2-x86_64.o serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o morus1280-avx2-y := morus1280-avx2-asm.o morus1280-avx2-glue.o @@ -112,7 +112,7 @@ ifeq ($(avx2_supported),yes) endif ifeq ($(avx512_supported),yes) - chacha20-x86_64-y += chacha20-avx512vl-x86_64.o + chacha-x86_64-y += chacha-avx512vl-x86_64.o endif aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o diff --git a/arch/x86/crypto/chacha20-avx2-x86_64.S b/arch/x86/crypto/chacha-avx2-x86_64.S similarity index 97% rename from arch/x86/crypto/chacha20-avx2-x86_64.S rename to arch/x86/crypto/chacha-avx2-x86_64.S index b6ab082be657..32903fd450af 100644 --- a/arch/x86/crypto/chacha20-avx2-x86_64.S +++ b/arch/x86/crypto/chacha-avx2-x86_64.S @@ -1,5 +1,5 @@ /* - * ChaCha20 256-bit cipher algorithm, RFC7539, x64 AVX2 functions + * ChaCha 256-bit cipher algorithm, x64 AVX2 functions * * Copyright (C) 2015 Martin Willi * @@ -38,13 +38,14 @@ CTR4BL: .octa 0x00000000000000000000000000000002 .text -ENTRY(chacha20_2block_xor_avx2) +ENTRY(chacha_2block_xor_avx2) # %rdi: Input state matrix, s # %rsi: up to 2 data blocks output, o # %rdx: up to 2 data blocks input, i # %rcx: input/output length in bytes + # %r8d: nrounds - # This function encrypts two ChaCha20 blocks by loading the state + # This function encrypts two ChaCha blocks by loading the state # matrix twice across four AVX registers. It performs matrix operations # on four words in each matrix in parallel, but requires shuffling to # rearrange the words after each round. @@ -68,7 +69,6 @@ ENTRY(chacha20_2block_xor_avx2) vmovdqa ROT16(%rip),%ymm5 mov %rcx,%rax - mov $10,%ecx .Ldoubleround: @@ -138,7 +138,7 @@ ENTRY(chacha20_2block_xor_avx2) # x3 = shuffle32(x3, MASK(0, 3, 2, 1)) vpshufd $0x39,%ymm3,%ymm3 - dec %ecx + sub $2,%r8d jnz .Ldoubleround # o0 = i0 ^ (x0 + s0) @@ -228,15 +228,16 @@ ENTRY(chacha20_2block_xor_avx2) lea -8(%r10),%rsp jmp .Ldone2 -ENDPROC(chacha20_2block_xor_avx2) +ENDPROC(chacha_2block_xor_avx2) -ENTRY(chacha20_4block_xor_avx2) +ENTRY(chacha_4block_xor_avx2) # %rdi: Input state matrix, s # %rsi: up to 4 data blocks output, o # %rdx: up to 4 data blocks input, i # %rcx: input/output length in bytes + # %r8d: nrounds - # This function encrypts four ChaCha20 block by loading the state + # This function encrypts four ChaCha blocks by loading the state # matrix four times across eight AVX registers. It performs matrix # operations on four words in two matrices in parallel, sequentially # to the operations on the four words of the other two matrices. The @@ -269,7 +270,6 @@ ENTRY(chacha20_4block_xor_avx2) vmovdqa ROT16(%rip),%ymm9 mov %rcx,%rax - mov $10,%ecx .Ldoubleround4: @@ -389,7 +389,7 @@ ENTRY(chacha20_4block_xor_avx2) vpshufd $0x39,%ymm3,%ymm3 vpshufd $0x39,%ymm7,%ymm7 - dec %ecx + sub $2,%r8d jnz .Ldoubleround4 # o0 = i0 ^ (x0 + s0), first block @@ -533,15 +533,16 @@ ENTRY(chacha20_4block_xor_avx2) lea -8(%r10),%rsp jmp .Ldone4 -ENDPROC(chacha20_4block_xor_avx2) +ENDPROC(chacha_4block_xor_avx2) -ENTRY(chacha20_8block_xor_avx2) +ENTRY(chacha_8block_xor_avx2) # %rdi: Input state matrix, s # %rsi: up to 8 data blocks output, o # %rdx: up to 8 data blocks input, i # %rcx: input/output length in bytes + # %r8d: nrounds - # This function encrypts eight consecutive ChaCha20 blocks by loading + # This function encrypts eight consecutive ChaCha blocks by loading # the state matrix in AVX registers eight times. As we need some # scratch registers, we save the first four registers on the stack. The # algorithm performs each operation on the corresponding word of each @@ -588,8 +589,6 @@ ENTRY(chacha20_8block_xor_avx2) # x12 += counter values 0-3 vpaddd %ymm1,%ymm12,%ymm12 - mov $10,%ecx - .Ldoubleround8: # x0 += x4, x12 = rotl32(x12 ^ x0, 16) vpaddd 0x00(%rsp),%ymm4,%ymm0 @@ -775,7 +774,7 @@ ENTRY(chacha20_8block_xor_avx2) vpsrld $25,%ymm4,%ymm4 vpor %ymm0,%ymm4,%ymm4 - dec %ecx + sub $2,%r8d jnz .Ldoubleround8 # x0..15[0-3] += s[0..15] @@ -1023,4 +1022,4 @@ ENTRY(chacha20_8block_xor_avx2) jmp .Ldone8 -ENDPROC(chacha20_8block_xor_avx2) +ENDPROC(chacha_8block_xor_avx2) diff --git a/arch/x86/crypto/chacha20-avx512vl-x86_64.S b/arch/x86/crypto/chacha-avx512vl-x86_64.S similarity index 97% rename from arch/x86/crypto/chacha20-avx512vl-x86_64.S rename to arch/x86/crypto/chacha-avx512vl-x86_64.S index 55d34de29e3e..848f9c75fd4f 100644 --- a/arch/x86/crypto/chacha20-avx512vl-x86_64.S +++ b/arch/x86/crypto/chacha-avx512vl-x86_64.S @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0+ */ /* - * ChaCha20 256-bit cipher algorithm, RFC7539, x64 AVX-512VL functions + * ChaCha 256-bit cipher algorithm, x64 AVX-512VL functions * * Copyright (C) 2018 Martin Willi */ @@ -24,13 +24,14 @@ CTR8BL: .octa 0x00000003000000020000000100000000 .text -ENTRY(chacha20_2block_xor_avx512vl) +ENTRY(chacha_2block_xor_avx512vl) # %rdi: Input state matrix, s # %rsi: up to 2 data blocks output, o # %rdx: up to 2 data blocks input, i # %rcx: input/output length in bytes + # %r8d: nrounds - # This function encrypts two ChaCha20 blocks by loading the state + # This function encrypts two ChaCha blocks by loading the state # matrix twice across four AVX registers. It performs matrix operations # on four words in each matrix in parallel, but requires shuffling to # rearrange the words after each round. @@ -50,8 +51,6 @@ ENTRY(chacha20_2block_xor_avx512vl) vmovdqa %ymm2,%ymm10 vmovdqa %ymm3,%ymm11 - mov $10,%rax - .Ldoubleround: # x0 += x1, x3 = rotl32(x3 ^ x0, 16) @@ -108,7 +107,7 @@ ENTRY(chacha20_2block_xor_avx512vl) # x3 = shuffle32(x3, MASK(0, 3, 2, 1)) vpshufd $0x39,%ymm3,%ymm3 - dec %rax + sub $2,%r8d jnz .Ldoubleround # o0 = i0 ^ (x0 + s0) @@ -188,15 +187,16 @@ ENTRY(chacha20_2block_xor_avx512vl) jmp .Ldone2 -ENDPROC(chacha20_2block_xor_avx512vl) +ENDPROC(chacha_2block_xor_avx512vl) -ENTRY(chacha20_4block_xor_avx512vl) +ENTRY(chacha_4block_xor_avx512vl) # %rdi: Input state matrix, s # %rsi: up to 4 data blocks output, o # %rdx: up to 4 data blocks input, i # %rcx: input/output length in bytes + # %r8d: nrounds - # This function encrypts four ChaCha20 block by loading the state + # This function encrypts four ChaCha blocks by loading the state # matrix four times across eight AVX registers. It performs matrix # operations on four words in two matrices in parallel, sequentially # to the operations on the four words of the other two matrices. The @@ -225,8 +225,6 @@ ENTRY(chacha20_4block_xor_avx512vl) vmovdqa %ymm3,%ymm14 vmovdqa %ymm7,%ymm15 - mov $10,%rax - .Ldoubleround4: # x0 += x1, x3 = rotl32(x3 ^ x0, 16) @@ -321,7 +319,7 @@ ENTRY(chacha20_4block_xor_avx512vl) vpshufd $0x39,%ymm3,%ymm3 vpshufd $0x39,%ymm7,%ymm7 - dec %rax + sub $2,%r8d jnz .Ldoubleround4 # o0 = i0 ^ (x0 + s0), first block @@ -455,15 +453,16 @@ ENTRY(chacha20_4block_xor_avx512vl) jmp .Ldone4 -ENDPROC(chacha20_4block_xor_avx512vl) +ENDPROC(chacha_4block_xor_avx512vl) -ENTRY(chacha20_8block_xor_avx512vl) +ENTRY(chacha_8block_xor_avx512vl) # %rdi: Input state matrix, s # %rsi: up to 8 data blocks output, o # %rdx: up to 8 data blocks input, i # %rcx: input/output length in bytes + # %r8d: nrounds - # This function encrypts eight consecutive ChaCha20 blocks by loading + # This function encrypts eight consecutive ChaCha blocks by loading # the state matrix in AVX registers eight times. Compared to AVX2, this # mostly benefits from the new rotate instructions in VL and the # additional registers. @@ -508,8 +507,6 @@ ENTRY(chacha20_8block_xor_avx512vl) vmovdqa64 %ymm14,%ymm30 vmovdqa64 %ymm15,%ymm31 - mov $10,%eax - .Ldoubleround8: # x0 += x4, x12 = rotl32(x12 ^ x0, 16) vpaddd %ymm0,%ymm4,%ymm0 @@ -647,7 +644,7 @@ ENTRY(chacha20_8block_xor_avx512vl) vpxord %ymm9,%ymm4,%ymm4 vprold $7,%ymm4,%ymm4 - dec %eax + sub $2,%r8d jnz .Ldoubleround8 # x0..15[0-3] += s[0..15] @@ -836,4 +833,4 @@ ENTRY(chacha20_8block_xor_avx512vl) jmp .Ldone8 -ENDPROC(chacha20_8block_xor_avx512vl) +ENDPROC(chacha_8block_xor_avx512vl) diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha-ssse3-x86_64.S similarity index 96% rename from arch/x86/crypto/chacha20-ssse3-x86_64.S rename to arch/x86/crypto/chacha-ssse3-x86_64.S index 45e4ccdd9c98..613f80ae9857 100644 --- a/arch/x86/crypto/chacha20-ssse3-x86_64.S +++ b/arch/x86/crypto/chacha-ssse3-x86_64.S @@ -1,5 +1,5 @@ /* - * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions + * ChaCha 256-bit cipher algorithm, x64 SSSE3 functions * * Copyright (C) 2015 Martin Willi * @@ -24,7 +24,7 @@ CTRINC: .octa 0x00000003000000020000000100000000 .text /* - * chacha20_permute - permute one block + * chacha_permute - permute one block * * Permute one 64-byte block where the state matrix is in %xmm0-%xmm3. This * function performs matrix operations on four words in parallel, but requires @@ -32,13 +32,14 @@ CTRINC: .octa 0x00000003000000020000000100000000 * done with the slightly better performing SSSE3 byte shuffling, 7/12-bit word * rotation uses traditional shift+OR. * - * Clobbers: %ecx, %xmm4-%xmm7 + * The round count is given in %r8d. + * + * Clobbers: %r8d, %xmm4-%xmm7 */ -chacha20_permute: +chacha_permute: movdqa ROT8(%rip),%xmm4 movdqa ROT16(%rip),%xmm5 - mov $10,%ecx .Ldoubleround: # x0 += x1, x3 = rotl32(x3 ^ x0, 16) @@ -107,17 +108,18 @@ chacha20_permute: # x3 = shuffle32(x3, MASK(0, 3, 2, 1)) pshufd $0x39,%xmm3,%xmm3 - dec %ecx + sub $2,%r8d jnz .Ldoubleround ret -ENDPROC(chacha20_permute) +ENDPROC(chacha_permute) -ENTRY(chacha20_block_xor_ssse3) +ENTRY(chacha_block_xor_ssse3) # %rdi: Input state matrix, s # %rsi: up to 1 data block output, o # %rdx: up to 1 data block input, i # %rcx: input/output length in bytes + # %r8d: nrounds # x0..3 = s0..3 movdqa 0x00(%rdi),%xmm0 @@ -130,7 +132,7 @@ ENTRY(chacha20_block_xor_ssse3) movdqa %xmm3,%xmm11 mov %rcx,%rax - call chacha20_permute + call chacha_permute # o0 = i0 ^ (x0 + s0) paddd %xmm8,%xmm0 @@ -196,32 +198,35 @@ ENTRY(chacha20_block_xor_ssse3) lea -8(%r10),%rsp jmp .Ldone -ENDPROC(chacha20_block_xor_ssse3) +ENDPROC(chacha_block_xor_ssse3) -ENTRY(hchacha20_block_ssse3) +ENTRY(hchacha_block_ssse3) # %rdi: Input state matrix, s # %rsi: output (8 32-bit words) + # %edx: nrounds movdqa 0x00(%rdi),%xmm0 movdqa 0x10(%rdi),%xmm1 movdqa 0x20(%rdi),%xmm2 movdqa 0x30(%rdi),%xmm3 - call chacha20_permute + mov %edx,%r8d + call chacha_permute movdqu %xmm0,0x00(%rsi) movdqu %xmm3,0x10(%rsi) ret -ENDPROC(hchacha20_block_ssse3) +ENDPROC(hchacha_block_ssse3) -ENTRY(chacha20_4block_xor_ssse3) +ENTRY(chacha_4block_xor_ssse3) # %rdi: Input state matrix, s # %rsi: up to 4 data blocks output, o # %rdx: up to 4 data blocks input, i # %rcx: input/output length in bytes + # %r8d: nrounds - # This function encrypts four consecutive ChaCha20 blocks by loading the + # This function encrypts four consecutive ChaCha blocks by loading the # the state matrix in SSE registers four times. As we need some scratch # registers, we save the first four registers on the stack. The # algorithm performs each operation on the corresponding word of each @@ -274,8 +279,6 @@ ENTRY(chacha20_4block_xor_ssse3) # x12 += counter values 0-3 paddd %xmm1,%xmm12 - mov $10,%ecx - .Ldoubleround4: # x0 += x4, x12 = rotl32(x12 ^ x0, 16) movdqa 0x00(%rsp),%xmm0 @@ -493,7 +496,7 @@ ENTRY(chacha20_4block_xor_ssse3) psrld $25,%xmm4 por %xmm0,%xmm4 - dec %ecx + sub $2,%r8d jnz .Ldoubleround4 # x0[0-3] += s0[0] @@ -784,4 +787,4 @@ ENTRY(chacha20_4block_xor_ssse3) jmp .Ldone4 -ENDPROC(chacha20_4block_xor_ssse3) +ENDPROC(chacha_4block_xor_ssse3) diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha_glue.c similarity index 51% rename from arch/x86/crypto/chacha20_glue.c rename to arch/x86/crypto/chacha_glue.c index 2bea425acb76..83cfb450b816 100644 --- a/arch/x86/crypto/chacha20_glue.c +++ b/arch/x86/crypto/chacha_glue.c @@ -1,5 +1,6 @@ /* - * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code + * x64 SIMD accelerated ChaCha and XChaCha stream ciphers, + * including ChaCha20 (RFC7539) * * Copyright (C) 2015 Martin Willi * @@ -17,120 +18,124 @@ #include #include -#define CHACHA20_STATE_ALIGN 16 +#define CHACHA_STATE_ALIGN 16 -asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src, - unsigned int len); -asmlinkage void chacha20_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src, - unsigned int len); -asmlinkage void hchacha20_block_ssse3(const u32 *state, u32 *out); +asmlinkage void chacha_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src, + unsigned int len, int nrounds); +asmlinkage void chacha_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src, + unsigned int len, int nrounds); +asmlinkage void hchacha_block_ssse3(const u32 *state, u32 *out, int nrounds); #ifdef CONFIG_AS_AVX2 -asmlinkage void chacha20_2block_xor_avx2(u32 *state, u8 *dst, const u8 *src, - unsigned int len); -asmlinkage void chacha20_4block_xor_avx2(u32 *state, u8 *dst, const u8 *src, - unsigned int len); -asmlinkage void chacha20_8block_xor_avx2(u32 *state, u8 *dst, const u8 *src, - unsigned int len); -static bool chacha20_use_avx2; +asmlinkage void chacha_2block_xor_avx2(u32 *state, u8 *dst, const u8 *src, + unsigned int len, int nrounds); +asmlinkage void chacha_4block_xor_avx2(u32 *state, u8 *dst, const u8 *src, + unsigned int len, int nrounds); +asmlinkage void chacha_8block_xor_avx2(u32 *state, u8 *dst, const u8 *src, + unsigned int len, int nrounds); +static bool chacha_use_avx2; #ifdef CONFIG_AS_AVX512 -asmlinkage void chacha20_2block_xor_avx512vl(u32 *state, u8 *dst, const u8 *src, - unsigned int len); -asmlinkage void chacha20_4block_xor_avx512vl(u32 *state, u8 *dst, const u8 *src, - unsigned int len); -asmlinkage void chacha20_8block_xor_avx512vl(u32 *state, u8 *dst, const u8 *src, - unsigned int len); -static bool chacha20_use_avx512vl; +asmlinkage void chacha_2block_xor_avx512vl(u32 *state, u8 *dst, const u8 *src, + unsigned int len, int nrounds); +asmlinkage void chacha_4block_xor_avx512vl(u32 *state, u8 *dst, const u8 *src, + unsigned int len, int nrounds); +asmlinkage void chacha_8block_xor_avx512vl(u32 *state, u8 *dst, const u8 *src, + unsigned int len, int nrounds); +static bool chacha_use_avx512vl; #endif #endif -static unsigned int chacha20_advance(unsigned int len, unsigned int maxblocks) +static unsigned int chacha_advance(unsigned int len, unsigned int maxblocks) { len = min(len, maxblocks * CHACHA_BLOCK_SIZE); return round_up(len, CHACHA_BLOCK_SIZE) / CHACHA_BLOCK_SIZE; } -static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src, - unsigned int bytes) +static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src, + unsigned int bytes, int nrounds) { #ifdef CONFIG_AS_AVX2 #ifdef CONFIG_AS_AVX512 - if (chacha20_use_avx512vl) { + if (chacha_use_avx512vl) { while (bytes >= CHACHA_BLOCK_SIZE * 8) { - chacha20_8block_xor_avx512vl(state, dst, src, bytes); + chacha_8block_xor_avx512vl(state, dst, src, bytes, + nrounds); bytes -= CHACHA_BLOCK_SIZE * 8; src += CHACHA_BLOCK_SIZE * 8; dst += CHACHA_BLOCK_SIZE * 8; state[12] += 8; } if (bytes > CHACHA_BLOCK_SIZE * 4) { - chacha20_8block_xor_avx512vl(state, dst, src, bytes); - state[12] += chacha20_advance(bytes, 8); + chacha_8block_xor_avx512vl(state, dst, src, bytes, + nrounds); + state[12] += chacha_advance(bytes, 8); return; } if (bytes > CHACHA_BLOCK_SIZE * 2) { - chacha20_4block_xor_avx512vl(state, dst, src, bytes); - state[12] += chacha20_advance(bytes, 4); + chacha_4block_xor_avx512vl(state, dst, src, bytes, + nrounds); + state[12] += chacha_advance(bytes, 4); return; } if (bytes) { - chacha20_2block_xor_avx512vl(state, dst, src, bytes); - state[12] += chacha20_advance(bytes, 2); + chacha_2block_xor_avx512vl(state, dst, src, bytes, + nrounds); + state[12] += chacha_advance(bytes, 2); return; } } #endif - if (chacha20_use_avx2) { + if (chacha_use_avx2) { while (bytes >= CHACHA_BLOCK_SIZE * 8) { - chacha20_8block_xor_avx2(state, dst, src, bytes); + chacha_8block_xor_avx2(state, dst, src, bytes, nrounds); bytes -= CHACHA_BLOCK_SIZE * 8; src += CHACHA_BLOCK_SIZE * 8; dst += CHACHA_BLOCK_SIZE * 8; state[12] += 8; } if (bytes > CHACHA_BLOCK_SIZE * 4) { - chacha20_8block_xor_avx2(state, dst, src, bytes); - state[12] += chacha20_advance(bytes, 8); + chacha_8block_xor_avx2(state, dst, src, bytes, nrounds); + state[12] += chacha_advance(bytes, 8); return; } if (bytes > CHACHA_BLOCK_SIZE * 2) { - chacha20_4block_xor_avx2(state, dst, src, bytes); - state[12] += chacha20_advance(bytes, 4); + chacha_4block_xor_avx2(state, dst, src, bytes, nrounds); + state[12] += chacha_advance(bytes, 4); return; } if (bytes > CHACHA_BLOCK_SIZE) { - chacha20_2block_xor_avx2(state, dst, src, bytes); - state[12] += chacha20_advance(bytes, 2); + chacha_2block_xor_avx2(state, dst, src, bytes, nrounds); + state[12] += chacha_advance(bytes, 2); return; } } #endif while (bytes >= CHACHA_BLOCK_SIZE * 4) { - chacha20_4block_xor_ssse3(state, dst, src, bytes); + chacha_4block_xor_ssse3(state, dst, src, bytes, nrounds); bytes -= CHACHA_BLOCK_SIZE * 4; src += CHACHA_BLOCK_SIZE * 4; dst += CHACHA_BLOCK_SIZE * 4; state[12] += 4; } if (bytes > CHACHA_BLOCK_SIZE) { - chacha20_4block_xor_ssse3(state, dst, src, bytes); - state[12] += chacha20_advance(bytes, 4); + chacha_4block_xor_ssse3(state, dst, src, bytes, nrounds); + state[12] += chacha_advance(bytes, 4); return; } if (bytes) { - chacha20_block_xor_ssse3(state, dst, src, bytes); + chacha_block_xor_ssse3(state, dst, src, bytes, nrounds); state[12]++; } } -static int chacha20_simd_stream_xor(struct skcipher_request *req, - struct chacha_ctx *ctx, u8 *iv) +static int chacha_simd_stream_xor(struct skcipher_request *req, + struct chacha_ctx *ctx, u8 *iv) { u32 *state, state_buf[16 + 2] __aligned(8); struct skcipher_walk walk; int err; - BUILD_BUG_ON(CHACHA20_STATE_ALIGN != 16); - state = PTR_ALIGN(state_buf + 0, CHACHA20_STATE_ALIGN); + BUILD_BUG_ON(CHACHA_STATE_ALIGN != 16); + state = PTR_ALIGN(state_buf + 0, CHACHA_STATE_ALIGN); err = skcipher_walk_virt(&walk, req, false); @@ -143,8 +148,8 @@ static int chacha20_simd_stream_xor(struct skcipher_request *req, nbytes = round_down(nbytes, walk.stride); kernel_fpu_begin(); - chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr, - nbytes); + chacha_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr, + nbytes, ctx->nrounds); kernel_fpu_end(); err = skcipher_walk_done(&walk, walk.nbytes - nbytes); @@ -153,7 +158,7 @@ static int chacha20_simd_stream_xor(struct skcipher_request *req, return err; } -static int chacha20_simd(struct skcipher_request *req) +static int chacha_simd(struct skcipher_request *req) { struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); @@ -161,10 +166,10 @@ static int chacha20_simd(struct skcipher_request *req) if (req->cryptlen <= CHACHA_BLOCK_SIZE || !irq_fpu_usable()) return crypto_chacha_crypt(req); - return chacha20_simd_stream_xor(req, ctx, req->iv); + return chacha_simd_stream_xor(req, ctx, req->iv); } -static int xchacha20_simd(struct skcipher_request *req) +static int xchacha_simd(struct skcipher_request *req) { struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); @@ -175,17 +180,18 @@ static int xchacha20_simd(struct skcipher_request *req) if (req->cryptlen <= CHACHA_BLOCK_SIZE || !irq_fpu_usable()) return crypto_xchacha_crypt(req); - BUILD_BUG_ON(CHACHA20_STATE_ALIGN != 16); - state = PTR_ALIGN(state_buf + 0, CHACHA20_STATE_ALIGN); + BUILD_BUG_ON(CHACHA_STATE_ALIGN != 16); + state = PTR_ALIGN(state_buf + 0, CHACHA_STATE_ALIGN); crypto_chacha_init(state, ctx, req->iv); kernel_fpu_begin(); - hchacha20_block_ssse3(state, subctx.key); + hchacha_block_ssse3(state, subctx.key, ctx->nrounds); kernel_fpu_end(); + subctx.nrounds = ctx->nrounds; memcpy(&real_iv[0], req->iv + 24, 8); memcpy(&real_iv[8], req->iv + 16, 8); - return chacha20_simd_stream_xor(req, &subctx, real_iv); + return chacha_simd_stream_xor(req, &subctx, real_iv); } static struct skcipher_alg algs[] = { @@ -202,8 +208,8 @@ static struct skcipher_alg algs[] = { .ivsize = CHACHA_IV_SIZE, .chunksize = CHACHA_BLOCK_SIZE, .setkey = crypto_chacha20_setkey, - .encrypt = chacha20_simd, - .decrypt = chacha20_simd, + .encrypt = chacha_simd, + .decrypt = chacha_simd, }, { .base.cra_name = "xchacha20", .base.cra_driver_name = "xchacha20-simd", @@ -217,40 +223,40 @@ static struct skcipher_alg algs[] = { .ivsize = XCHACHA_IV_SIZE, .chunksize = CHACHA_BLOCK_SIZE, .setkey = crypto_chacha20_setkey, - .encrypt = xchacha20_simd, - .decrypt = xchacha20_simd, + .encrypt = xchacha_simd, + .decrypt = xchacha_simd, }, }; -static int __init chacha20_simd_mod_init(void) +static int __init chacha_simd_mod_init(void) { if (!boot_cpu_has(X86_FEATURE_SSSE3)) return -ENODEV; #ifdef CONFIG_AS_AVX2 - chacha20_use_avx2 = boot_cpu_has(X86_FEATURE_AVX) && - boot_cpu_has(X86_FEATURE_AVX2) && - cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL); + chacha_use_avx2 = boot_cpu_has(X86_FEATURE_AVX) && + boot_cpu_has(X86_FEATURE_AVX2) && + cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL); #ifdef CONFIG_AS_AVX512 - chacha20_use_avx512vl = chacha20_use_avx2 && - boot_cpu_has(X86_FEATURE_AVX512VL) && - boot_cpu_has(X86_FEATURE_AVX512BW); /* kmovq */ + chacha_use_avx512vl = chacha_use_avx2 && + boot_cpu_has(X86_FEATURE_AVX512VL) && + boot_cpu_has(X86_FEATURE_AVX512BW); /* kmovq */ #endif #endif return crypto_register_skciphers(algs, ARRAY_SIZE(algs)); } -static void __exit chacha20_simd_mod_fini(void) +static void __exit chacha_simd_mod_fini(void) { crypto_unregister_skciphers(algs, ARRAY_SIZE(algs)); } -module_init(chacha20_simd_mod_init); -module_exit(chacha20_simd_mod_fini); +module_init(chacha_simd_mod_init); +module_exit(chacha_simd_mod_fini); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Martin Willi "); -MODULE_DESCRIPTION("chacha20 cipher algorithm, SIMD accelerated"); +MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (x64 SIMD accelerated)"); MODULE_ALIAS_CRYPTO("chacha20"); MODULE_ALIAS_CRYPTO("chacha20-simd"); MODULE_ALIAS_CRYPTO("xchacha20"); From patchwork Thu Nov 29 23:02:17 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 10705499 X-Patchwork-Delegate: herbert@gondor.apana.org.au Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4C9F813A4 for ; Thu, 29 Nov 2018 23:04:04 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3D4DC2B491 for ; Thu, 29 Nov 2018 23:04:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3163F2B697; Thu, 29 Nov 2018 23:04:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D3EE02B491 for ; Thu, 29 Nov 2018 23:04:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727187AbeK3KLJ (ORCPT ); Fri, 30 Nov 2018 05:11:09 -0500 Received: from mail.kernel.org ([198.145.29.99]:44908 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726904AbeK3KKf (ORCPT ); Fri, 30 Nov 2018 05:10:35 -0500 Received: from ebiggers.mtv.corp.google.com (unknown [104.132.1.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 938242147D; Thu, 29 Nov 2018 23:03:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1543532607; bh=nBC9OVb3aNTteHpgQGScNZRTgOQT0CvFyx0FsakJaAQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=XEFhbF6NuTyY8Si1Wl/P81ibnipLPIEfJHluVBjbnHd5UadlIJkY5xLNSkqzWyeim cXq7NsVzz5AtqX0XFOgMrYm9SjWpepUcDI9Y70vvBsWnnUT22q3/XXyDlA8WwmCF2k iH5FfAyhFjJQip/O5UvoXBc1VsDXSc3KDYB9N8KI= From: Eric Biggers To: linux-crypto@vger.kernel.org Cc: Paul Crowley , Martin Willi , Milan Broz , "Jason A . Donenfeld" , linux-kernel@vger.kernel.org Subject: [PATCH v2 6/6] crypto: x86/chacha - add XChaCha12 support Date: Thu, 29 Nov 2018 15:02:17 -0800 Message-Id: <20181129230217.158038-7-ebiggers@kernel.org> X-Mailer: git-send-email 2.20.0.rc0.387.gc7a69e6b6c-goog In-Reply-To: <20181129230217.158038-1-ebiggers@kernel.org> References: <20181129230217.158038-1-ebiggers@kernel.org> MIME-Version: 1.0 Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Eric Biggers Now that the x86_64 SIMD implementations of ChaCha20 and XChaCha20 have been refactored to support varying the number of rounds, add support for XChaCha12. This is identical to XChaCha20 except for the number of rounds, which is 12 instead of 20. This can be used by Adiantum. Signed-off-by: Eric Biggers Reviewed-by: Martin Willi --- arch/x86/crypto/chacha_glue.c | 17 +++++++++++++++++ crypto/Kconfig | 4 ++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c index 83cfb450b816..3db775852205 100644 --- a/arch/x86/crypto/chacha_glue.c +++ b/arch/x86/crypto/chacha_glue.c @@ -225,6 +225,21 @@ static struct skcipher_alg algs[] = { .setkey = crypto_chacha20_setkey, .encrypt = xchacha_simd, .decrypt = xchacha_simd, + }, { + .base.cra_name = "xchacha12", + .base.cra_driver_name = "xchacha12-simd", + .base.cra_priority = 300, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = XCHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .setkey = crypto_chacha12_setkey, + .encrypt = xchacha_simd, + .decrypt = xchacha_simd, }, }; @@ -261,3 +276,5 @@ MODULE_ALIAS_CRYPTO("chacha20"); MODULE_ALIAS_CRYPTO("chacha20-simd"); MODULE_ALIAS_CRYPTO("xchacha20"); MODULE_ALIAS_CRYPTO("xchacha20-simd"); +MODULE_ALIAS_CRYPTO("xchacha12"); +MODULE_ALIAS_CRYPTO("xchacha12-simd"); diff --git a/crypto/Kconfig b/crypto/Kconfig index df466771e9bf..29865c599b04 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1473,8 +1473,8 @@ config CRYPTO_CHACHA20_X86_64 select CRYPTO_BLKCIPHER select CRYPTO_CHACHA20 help - SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20 - and XChaCha20 stream ciphers. + SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20, + XChaCha20, and XChaCha12 stream ciphers. config CRYPTO_SEED tristate "SEED cipher algorithm"