[v2,0/6] Clean up and improve ARM/arm64 CRC-T10DIF code

Message ID	20241105160859.1459261-8-ardb+git@google.com (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> Date: Tue, 5 Nov 2024 17:09:00 +0100 Mime-Version: 1.0 Message-ID: <20241105160859.1459261-8-ardb+git@google.com> Subject: [PATCH v2 0/6] Clean up and improve ARM/arm64 CRC-T10DIF code From: Ard Biesheuvel <ardb+git@google.com> To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel <ardb@kernel.org> Content-Type: text/plain; charset="UTF-8" Precedence: list Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	Clean up and improve ARM/arm64 CRC-T10DIF code \| expand [v2,0/6] Clean up and improve ARM/arm64 CRC-T10DIF code [v2,1/6] crypto: arm64/crct10dif - Remove obsolete chunking logic [v2,2/6] crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply [v2,3/6] crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code [v2,4/6] crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl [v2,5/6] crypto: arm/crct10dif - Macroify PMULL asm code [v2,6/6] crypto: arm/crct10dif - Implement plain NEON variant

Message ID

20241105160859.1459261-8-ardb+git@google.com (mailing list archive)

Headers

Date: Tue,  5 Nov 2024 17:09:00 +0100
Mime-Version: 1.0
Message-ID: <20241105160859.1459261-8-ardb+git@google.com>
Subject: [PATCH v2 0/6] Clean up and improve ARM/arm64 CRC-T10DIF code
From: Ard Biesheuvel <ardb+git@google.com>
To: linux-crypto@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org,
	herbert@gondor.apana.org.au, keescook@chromium.org,
	Ard Biesheuvel <ardb@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: list
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Series

Clean up and improve ARM/arm64 CRC-T10DIF code | expand

Message

Ard Biesheuvel Nov. 5, 2024, 4:09 p.m. UTC

From: Ard Biesheuvel <ardb@kernel.org>

I realized that the generic sequence implementing 64x64 polynomial
multiply using 8x8 PMULL instructions, which is used in the CRC-T10DIF
code to implement a fallback version for cores that lack the 64x64 PMULL
instruction, is not very efficient.

The folding coefficients that are used when processing the bulk of the
data are only 16 bits wide, and so 3/4 of the partial results of all those
8x8->16 bit multiplications do not contribute anything to the end result.

This means we can use a much faster implementation, producing a speedup
of 3.3x on Cortex-A72 without Crypto Extensions (Raspberry Pi 4).

The same logic can be ported to 32-bit ARM too, where it produces a
speedup of 6.6x compared with the generic C implementation on the same
platform.

Changes since v1:
- fix bug introduced in refactoring
- add asm comments to explain the fallback algorithm
- type 'u8 *out' parameter as 'u8 out[16]'
- avoid asm code for 16 byte inputs (a higher threshold might be more
  appropriate but 16 is nonsensical given that the folding routine
  returns a 16 byte output)

Ard Biesheuvel (6):
  crypto: arm64/crct10dif - Remove obsolete chunking logic
  crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply
  crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code
  crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl
  crypto: arm/crct10dif - Macroify PMULL asm code
  crypto: arm/crct10dif - Implement plain NEON variant

 arch/arm/crypto/crct10dif-ce-core.S   | 249 ++++++++++-----
 arch/arm/crypto/crct10dif-ce-glue.c   |  55 +++-
 arch/arm64/crypto/crct10dif-ce-core.S | 335 +++++++++-----------
 arch/arm64/crypto/crct10dif-ce-glue.c |  48 ++-
 4 files changed, 376 insertions(+), 311 deletions(-)

Comments

Eric Biggers Nov. 13, 2024, 1:56 p.m. UTC | #1

On Tue, Nov 05, 2024 at 05:09:00PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> I realized that the generic sequence implementing 64x64 polynomial
> multiply using 8x8 PMULL instructions, which is used in the CRC-T10DIF
> code to implement a fallback version for cores that lack the 64x64 PMULL
> instruction, is not very efficient.
> 
> The folding coefficients that are used when processing the bulk of the
> data are only 16 bits wide, and so 3/4 of the partial results of all those
> 8x8->16 bit multiplications do not contribute anything to the end result.
> 
> This means we can use a much faster implementation, producing a speedup
> of 3.3x on Cortex-A72 without Crypto Extensions (Raspberry Pi 4).
> 
> The same logic can be ported to 32-bit ARM too, where it produces a
> speedup of 6.6x compared with the generic C implementation on the same
> platform.
> 
> Changes since v1:
> - fix bug introduced in refactoring
> - add asm comments to explain the fallback algorithm
> - type 'u8 *out' parameter as 'u8 out[16]'
> - avoid asm code for 16 byte inputs (a higher threshold might be more
>   appropriate but 16 is nonsensical given that the folding routine
>   returns a 16 byte output)
> 
> Ard Biesheuvel (6):
>   crypto: arm64/crct10dif - Remove obsolete chunking logic
>   crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply
>   crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code
>   crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl
>   crypto: arm/crct10dif - Macroify PMULL asm code
>   crypto: arm/crct10dif - Implement plain NEON variant
> 
>  arch/arm/crypto/crct10dif-ce-core.S   | 249 ++++++++++-----
>  arch/arm/crypto/crct10dif-ce-glue.c   |  55 +++-
>  arch/arm64/crypto/crct10dif-ce-core.S | 335 +++++++++-----------
>  arch/arm64/crypto/crct10dif-ce-glue.c |  48 ++-
>  4 files changed, 376 insertions(+), 311 deletions(-)

Reviewed-by: Eric Biggers <ebiggers@google.com>

- Eric

Herbert Xu Nov. 15, 2024, 11:57 a.m. UTC | #2

On Tue, Nov 05, 2024 at 05:09:00PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> I realized that the generic sequence implementing 64x64 polynomial
> multiply using 8x8 PMULL instructions, which is used in the CRC-T10DIF
> code to implement a fallback version for cores that lack the 64x64 PMULL
> instruction, is not very efficient.
> 
> The folding coefficients that are used when processing the bulk of the
> data are only 16 bits wide, and so 3/4 of the partial results of all those
> 8x8->16 bit multiplications do not contribute anything to the end result.
> 
> This means we can use a much faster implementation, producing a speedup
> of 3.3x on Cortex-A72 without Crypto Extensions (Raspberry Pi 4).
> 
> The same logic can be ported to 32-bit ARM too, where it produces a
> speedup of 6.6x compared with the generic C implementation on the same
> platform.
> 
> Changes since v1:
> - fix bug introduced in refactoring
> - add asm comments to explain the fallback algorithm
> - type 'u8 *out' parameter as 'u8 out[16]'
> - avoid asm code for 16 byte inputs (a higher threshold might be more
>   appropriate but 16 is nonsensical given that the folding routine
>   returns a 16 byte output)
> 
> Ard Biesheuvel (6):
>   crypto: arm64/crct10dif - Remove obsolete chunking logic
>   crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply
>   crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code
>   crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl
>   crypto: arm/crct10dif - Macroify PMULL asm code
>   crypto: arm/crct10dif - Implement plain NEON variant
> 
>  arch/arm/crypto/crct10dif-ce-core.S   | 249 ++++++++++-----
>  arch/arm/crypto/crct10dif-ce-glue.c   |  55 +++-
>  arch/arm64/crypto/crct10dif-ce-core.S | 335 +++++++++-----------
>  arch/arm64/crypto/crct10dif-ce-glue.c |  48 ++-
>  4 files changed, 376 insertions(+), 311 deletions(-)
> 
> -- 
> 2.47.0.199.ga7371fff76-goog

All applied.  Thanks.