crypto: arm/chacha20 - always use vrev for 16-bit rotates

Message ID	20180725012907.1614-1-ebiggers3@gmail.com (mailing list archive)
State	Accepted
Delegated to:	Herbert Xu
Headers	show Return-Path: <linux-crypto-owner@kernel.org> From: Eric Biggers <ebiggers3@gmail.com> To: linux-crypto@vger.kernel.org, Herbert Xu <herbert@gondor.apana.org.au> Cc: linux-arm-kernel@lists.infradead.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Eric Biggers <ebiggers@google.com> Subject: [PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates Date: Tue, 24 Jul 2018 18:29:07 -0700 Message-Id: <20180725012907.1614-1-ebiggers3@gmail.com> Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk
Series	crypto: arm/chacha20 - always use vrev for 16-bit rotates \| expand crypto: arm/chacha20 - always use vrev for 16-bit rotates

Message ID

20180725012907.1614-1-ebiggers3@gmail.com (mailing list archive)

State

Accepted

Delegated to:

Herbert Xu

Headers

From: Eric Biggers <ebiggers3@gmail.com>
To: linux-crypto@vger.kernel.org,
        Herbert Xu <herbert@gondor.apana.org.au>
Cc: linux-arm-kernel@lists.infradead.org,
        Ard Biesheuvel <ard.biesheuvel@linaro.org>,
        Eric Biggers <ebiggers@google.com>
Subject: [PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates
Date: Tue, 24 Jul 2018 18:29:07 -0700
Message-Id: <20180725012907.1614-1-ebiggers3@gmail.com>
Sender: linux-crypto-owner@vger.kernel.org
Precedence: bulk

Series

crypto: arm/chacha20 - always use vrev for 16-bit rotates | expand

Commit Message

Eric Biggers July 25, 2018, 1:29 a.m. UTC

From: Eric Biggers <ebiggers@google.com>

The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
but the one-way code (used on remainder blocks) implements it with
vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm/crypto/chacha20-neon-core.S | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Comments

Ard Biesheuvel July 25, 2018, 6:18 a.m. UTC | #1

On 25 July 2018 at 03:29, Eric Biggers <ebiggers3@gmail.com> wrote:
> From: Eric Biggers <ebiggers@google.com>
>
> The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
> but the one-way code (used on remainder blocks) implements it with
> vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---
>  arch/arm/crypto/chacha20-neon-core.S | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/crypto/chacha20-neon-core.S b/arch/arm/crypto/chacha20-neon-core.S
> index 3fecb2124c35..451a849ad518 100644
> --- a/arch/arm/crypto/chacha20-neon-core.S
> +++ b/arch/arm/crypto/chacha20-neon-core.S
> @@ -51,9 +51,8 @@ ENTRY(chacha20_block_xor_neon)
>  .Ldoubleround:
>         // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
>         vadd.i32        q0, q0, q1
> -       veor            q4, q3, q0
> -       vshl.u32        q3, q4, #16
> -       vsri.u32        q3, q4, #16
> +       veor            q3, q3, q0
> +       vrev32.16       q3, q3
>
>         // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
>         vadd.i32        q2, q2, q3
> @@ -82,9 +81,8 @@ ENTRY(chacha20_block_xor_neon)
>
>         // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
>         vadd.i32        q0, q0, q1
> -       veor            q4, q3, q0
> -       vshl.u32        q3, q4, #16
> -       vsri.u32        q3, q4, #16
> +       veor            q3, q3, q0
> +       vrev32.16       q3, q3
>
>         // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
>         vadd.i32        q2, q2, q3
> --
> 2.18.0
>

Herbert Xu Aug. 3, 2018, 1:59 p.m. UTC | #2

On Tue, Jul 24, 2018 at 06:29:07PM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
> but the one-way code (used on remainder blocks) implements it with
> vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Patch applied.  Thanks.

diff --git a/arch/arm/crypto/chacha20-neon-core.S b/arch/arm/crypto/chacha20-neon-core.S
index 3fecb2124c35..451a849ad518 100644
--- a/arch/arm/crypto/chacha20-neon-core.S
+++ b/arch/arm/crypto/chacha20-neon-core.S
@@ -51,9 +51,8 @@  ENTRY(chacha20_block_xor_neon)
 .Ldoubleround:
 	// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
 	vadd.i32	q0, q0, q1
-	veor		q4, q3, q0
-	vshl.u32	q3, q4, #16
-	vsri.u32	q3, q4, #16
+	veor		q3, q3, q0
+	vrev32.16	q3, q3
 
 	// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
 	vadd.i32	q2, q2, q3
@@ -82,9 +81,8 @@  ENTRY(chacha20_block_xor_neon)
 
 	// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
 	vadd.i32	q0, q0, q1
-	veor		q4, q3, q0
-	vshl.u32	q3, q4, #16
-	vsri.u32	q3, q4, #16
+	veor		q3, q3, q0
+	vrev32.16	q3, q3
 
 	// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
 	vadd.i32	q2, q2, q3