From patchwork Fri Aug 31 08:01:40 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 10583287 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C99EB14BD for ; Fri, 31 Aug 2018 08:06:30 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B94FA2B255 for ; Fri, 31 Aug 2018 08:06:30 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id AD93E2B565; Fri, 31 Aug 2018 08:06:30 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=unavailable version=3.3.1 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 839BB2B255 for ; Fri, 31 Aug 2018 08:06:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:MIME-Version:Cc:List-Subscribe: List-Help:List-Post:List-Archive:List-Unsubscribe:List-Id:Message-Id:Date: Subject:To:From:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To: References:List-Owner; bh=Id0mHfUou3NPshYX8KOTkEaYfNgvsHbQ5RGLHLiPRBA=; b=s5U WmVVbgVooE5+oZqR2W5zrtTPL3S1pCbG2Ylj2yL53BwdRIRw0Tcchg6cFg8dO0k1bg/yrk9XUTVsK nnh0B6cz4VT8q4hzwhDAmSta+73w3B8nlF99s3oP2vVLiV73KdYmwoO8uhfUlgRg/0FpB6tHDstZb Rp/BOIMBoIH2qlSKEnsPHWI9cI+A66sqopaPONA6ojQ2iP94Q2l/j3UwXkQ0UvjVpANph+8DWdBLA /3U7iKuy7yBlwUgASO5J9+Tvo7lPnmgQM0l5H07b43m+xDoUsQ/JNcveNkYxDaWcbrzQ7uBPUc02D nyg7u90PAXxvHqLC7BLAw2PcZhu3pnA==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux)) id 1fveRh-00015H-Kh; Fri, 31 Aug 2018 08:06:21 +0000 Received: from mail.kernel.org ([198.145.29.99]) by bombadil.infradead.org with esmtps (Exim 4.90_1 #2 (Red Hat Linux)) id 1fveOb-0006XK-BG for linux-arm-kernel@lists.infradead.org; Fri, 31 Aug 2018 08:03:51 +0000 Received: from sol.localdomain (c-67-185-97-198.hsd1.wa.comcast.net [67.185.97.198]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 8E43A2083A; Fri, 31 Aug 2018 08:02:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1535702577; bh=GI8/1bB7uRn4oOTAgjw0GZ/qkembN5sv02vte6s5XzU=; h=From:To:Cc:Subject:Date:From; b=ccMKJSDdF4oFsTlJ3KpkaEP+KolSWVznX2bdN6zonovAXoeglAGrhr7MzXPXo78vp 3f3GlkHkoTd1fthV5qZn/HQRvg4626xdkuIDcVjMJFXu5HniuaLyDMI4po8339ret/ qWtqS5sfYI4bzumTZJwrb9VZzVJdlPXdF3wMEfgI= From: Eric Biggers To: linux-crypto@vger.kernel.org, Herbert Xu Subject: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations Date: Fri, 31 Aug 2018 01:01:40 -0700 Message-Id: <20180831080140.20553-1-ebiggers@kernel.org> X-Mailer: git-send-email 2.18.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20180831_010309_510558_ADFED8F0 X-CRM114-Status: GOOD ( 13.49 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linux-arm-kernel@lists.infradead.org, Ard Biesheuvel MIME-Version: 1.0 Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org X-Virus-Scanned: ClamAV using ClamSMTP From: Eric Biggers Optimize ChaCha20 NEON performance by: - Implementing the 8-bit rotations using the 'vtbl.8' instruction. - Streamlining the part that adds the original state and XORs the data. - Making some other small tweaks. On ARM Cortex-A7, these optimizations improve ChaCha20 performance from about 11.9 cycles per byte to 11.3. There is a tradeoff involved with the 'vtbl.8' rotation method since there is at least one CPU where it's not fastest. But it seems to be a better default; see the added comment. Signed-off-by: Eric Biggers --- arch/arm/crypto/chacha20-neon-core.S | 289 ++++++++++++++------------- 1 file changed, 147 insertions(+), 142 deletions(-) diff --git a/arch/arm/crypto/chacha20-neon-core.S b/arch/arm/crypto/chacha20-neon-core.S index 3fecb2124c35a..d381cebaba31d 100644 --- a/arch/arm/crypto/chacha20-neon-core.S +++ b/arch/arm/crypto/chacha20-neon-core.S @@ -18,6 +18,33 @@ * (at your option) any later version. */ + /* + * NEON doesn't have a rotate instruction. The alternatives are, more or less: + * + * (a) vshl.u32 + vsri.u32 (needs temporary register) + * (b) vshl.u32 + vshr.u32 + vorr (needs temporary register) + * (c) vrev32.16 (16-bit rotations only) + * (d) vtbl.8 + vtbl.8 (multiple of 8 bits rotations only, + * needs index vector) + * + * ChaCha20 has 16, 12, 8, and 7-bit rotations. For the 12 and 7-bit + * rotations, the only choices are (a) and (b). We use (a) since it takes + * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53. + * + * For the 16-bit rotation, we use vrev32.16 since it's consistently fastest + * and doesn't need a temporary register. + * + * For the 8-bit rotation, we use vtbl.8 + vtbl.8. On Cortex-A7, this sequence + * is twice as fast as (a), even when doing (a) on multiple registers + * simultaneously to eliminate the stall between vshl and vsri. Also, it + * parallelizes better when temporary registers are scarce. + * + * A disadvantage is that on Cortex-A53, the vtbl sequence is the same speed as + * (a), so the need to load the rotation table actually makes the vtbl method + * slightly slower overall on that CPU. Still, it seems to be a good + * compromise to get a significant speed boost on some CPUs. + */ + #include .text @@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon) vmov q10, q2 vmov q11, q3 + ldr ip, =.Lrol8_table + vld1.8 {d10}, [ip, :64] + mov r3, #10 .Ldoubleround: @@ -63,9 +93,9 @@ ENTRY(chacha20_block_xor_neon) // x0 += x1, x3 = rotl32(x3 ^ x0, 8) vadd.i32 q0, q0, q1 - veor q4, q3, q0 - vshl.u32 q3, q4, #8 - vsri.u32 q3, q4, #24 + veor q3, q3, q0 + vtbl.8 d6, {d6}, d10 + vtbl.8 d7, {d7}, d10 // x2 += x3, x1 = rotl32(x1 ^ x2, 7) vadd.i32 q2, q2, q3 @@ -94,9 +124,9 @@ ENTRY(chacha20_block_xor_neon) // x0 += x1, x3 = rotl32(x3 ^ x0, 8) vadd.i32 q0, q0, q1 - veor q4, q3, q0 - vshl.u32 q3, q4, #8 - vsri.u32 q3, q4, #24 + veor q3, q3, q0 + vtbl.8 d6, {d6}, d10 + vtbl.8 d7, {d7}, d10 // x2 += x3, x1 = rotl32(x1 ^ x2, 7) vadd.i32 q2, q2, q3 @@ -143,11 +173,11 @@ ENDPROC(chacha20_block_xor_neon) .align 5 ENTRY(chacha20_4block_xor_neon) - push {r4-r6, lr} - mov ip, sp // preserve the stack pointer - sub r3, sp, #0x20 // allocate a 32 byte buffer - bic r3, r3, #0x1f // aligned to 32 bytes - mov sp, r3 + push {r4} + mov r4, sp // preserve the stack pointer + sub ip, sp, #0x20 // allocate a 32 byte buffer + bic ip, ip, #0x1f // aligned to 32 bytes + mov sp, ip // r0: Input state matrix, s // r1: 4 data blocks output, o @@ -157,25 +187,24 @@ ENTRY(chacha20_4block_xor_neon) // This function encrypts four consecutive ChaCha20 blocks by loading // the state matrix in NEON registers four times. The algorithm performs // each operation on the corresponding word of each state matrix, hence - // requires no word shuffling. For final XORing step we transpose the - // matrix by interleaving 32- and then 64-bit words, which allows us to - // do XOR in NEON registers. + // requires no word shuffling. The words are re-interleaved before the + // final addition of the original state and the XORing step. // - // x0..15[0-3] = s0..3[0..3] - add r3, r0, #0x20 + // x0..15[0-3] = s0..15[0-3] + add ip, r0, #0x20 vld1.32 {q0-q1}, [r0] - vld1.32 {q2-q3}, [r3] + vld1.32 {q2-q3}, [ip] - adr r3, CTRINC + adr ip, .Lctrinc1 vdup.32 q15, d7[1] vdup.32 q14, d7[0] - vld1.32 {q11}, [r3, :128] + vld1.32 {q4}, [ip, :128] vdup.32 q13, d6[1] vdup.32 q12, d6[0] - vadd.i32 q12, q12, q11 // x12 += counter values 0-3 vdup.32 q11, d5[1] vdup.32 q10, d5[0] + vadd.u32 q12, q12, q4 // x12 += counter values 0-3 vdup.32 q9, d4[1] vdup.32 q8, d4[0] vdup.32 q7, d3[1] @@ -187,6 +216,7 @@ ENTRY(chacha20_4block_xor_neon) vdup.32 q1, d0[1] vdup.32 q0, d0[0] + adr ip, .Lrol8_table mov r3, #10 .Ldoubleround4: @@ -238,24 +268,25 @@ ENTRY(chacha20_4block_xor_neon) // x1 += x5, x13 = rotl32(x13 ^ x1, 8) // x2 += x6, x14 = rotl32(x14 ^ x2, 8) // x3 += x7, x15 = rotl32(x15 ^ x3, 8) + vld1.8 {d16}, [ip, :64] vadd.i32 q0, q0, q4 vadd.i32 q1, q1, q5 vadd.i32 q2, q2, q6 vadd.i32 q3, q3, q7 - veor q8, q12, q0 - veor q9, q13, q1 - vshl.u32 q12, q8, #8 - vshl.u32 q13, q9, #8 - vsri.u32 q12, q8, #24 - vsri.u32 q13, q9, #24 + veor q12, q12, q0 + veor q13, q13, q1 + veor q14, q14, q2 + veor q15, q15, q3 - veor q8, q14, q2 - veor q9, q15, q3 - vshl.u32 q14, q8, #8 - vshl.u32 q15, q9, #8 - vsri.u32 q14, q8, #24 - vsri.u32 q15, q9, #24 + vtbl.8 d24, {d24}, d16 + vtbl.8 d25, {d25}, d16 + vtbl.8 d26, {d26}, d16 + vtbl.8 d27, {d27}, d16 + vtbl.8 d28, {d28}, d16 + vtbl.8 d29, {d29}, d16 + vtbl.8 d30, {d30}, d16 + vtbl.8 d31, {d31}, d16 vld1.32 {q8-q9}, [sp, :256] @@ -334,24 +365,25 @@ ENTRY(chacha20_4block_xor_neon) // x1 += x6, x12 = rotl32(x12 ^ x1, 8) // x2 += x7, x13 = rotl32(x13 ^ x2, 8) // x3 += x4, x14 = rotl32(x14 ^ x3, 8) + vld1.8 {d16}, [ip, :64] vadd.i32 q0, q0, q5 vadd.i32 q1, q1, q6 vadd.i32 q2, q2, q7 vadd.i32 q3, q3, q4 - veor q8, q15, q0 - veor q9, q12, q1 - vshl.u32 q15, q8, #8 - vshl.u32 q12, q9, #8 - vsri.u32 q15, q8, #24 - vsri.u32 q12, q9, #24 + veor q15, q15, q0 + veor q12, q12, q1 + veor q13, q13, q2 + veor q14, q14, q3 - veor q8, q13, q2 - veor q9, q14, q3 - vshl.u32 q13, q8, #8 - vshl.u32 q14, q9, #8 - vsri.u32 q13, q8, #24 - vsri.u32 q14, q9, #24 + vtbl.8 d30, {d30}, d16 + vtbl.8 d31, {d31}, d16 + vtbl.8 d24, {d24}, d16 + vtbl.8 d25, {d25}, d16 + vtbl.8 d26, {d26}, d16 + vtbl.8 d27, {d27}, d16 + vtbl.8 d28, {d28}, d16 + vtbl.8 d29, {d29}, d16 vld1.32 {q8-q9}, [sp, :256] @@ -381,104 +413,74 @@ ENTRY(chacha20_4block_xor_neon) vsri.u32 q6, q9, #25 subs r3, r3, #1 - beq 0f - - vld1.32 {q8-q9}, [sp, :256] - b .Ldoubleround4 - - // x0[0-3] += s0[0] - // x1[0-3] += s0[1] - // x2[0-3] += s0[2] - // x3[0-3] += s0[3] -0: ldmia r0!, {r3-r6} - vdup.32 q8, r3 - vdup.32 q9, r4 - vadd.i32 q0, q0, q8 - vadd.i32 q1, q1, q9 - vdup.32 q8, r5 - vdup.32 q9, r6 - vadd.i32 q2, q2, q8 - vadd.i32 q3, q3, q9 - - // x4[0-3] += s1[0] - // x5[0-3] += s1[1] - // x6[0-3] += s1[2] - // x7[0-3] += s1[3] - ldmia r0!, {r3-r6} - vdup.32 q8, r3 - vdup.32 q9, r4 - vadd.i32 q4, q4, q8 - vadd.i32 q5, q5, q9 - vdup.32 q8, r5 - vdup.32 q9, r6 - vadd.i32 q6, q6, q8 - vadd.i32 q7, q7, q9 - - // interleave 32-bit words in state n, n+1 - vzip.32 q0, q1 - vzip.32 q2, q3 - vzip.32 q4, q5 - vzip.32 q6, q7 - - // interleave 64-bit words in state n, n+2 - vswp d1, d4 - vswp d3, d6 - vswp d9, d12 - vswp d11, d14 - - // xor with corresponding input, write to output - vld1.8 {q8-q9}, [r2]! - veor q8, q8, q0 - veor q9, q9, q4 - vst1.8 {q8-q9}, [r1]! - vld1.32 {q8-q9}, [sp, :256] - - // x8[0-3] += s2[0] - // x9[0-3] += s2[1] - // x10[0-3] += s2[2] - // x11[0-3] += s2[3] - ldmia r0!, {r3-r6} - vdup.32 q0, r3 - vdup.32 q4, r4 - vadd.i32 q8, q8, q0 - vadd.i32 q9, q9, q4 - vdup.32 q0, r5 - vdup.32 q4, r6 - vadd.i32 q10, q10, q0 - vadd.i32 q11, q11, q4 - - // x12[0-3] += s3[0] - // x13[0-3] += s3[1] - // x14[0-3] += s3[2] - // x15[0-3] += s3[3] - ldmia r0!, {r3-r6} - vdup.32 q0, r3 - vdup.32 q4, r4 - adr r3, CTRINC - vadd.i32 q12, q12, q0 - vld1.32 {q0}, [r3, :128] - vadd.i32 q13, q13, q4 - vadd.i32 q12, q12, q0 // x12 += counter values 0-3 - - vdup.32 q0, r5 - vdup.32 q4, r6 - vadd.i32 q14, q14, q0 - vadd.i32 q15, q15, q4 - - // interleave 32-bit words in state n, n+1 - vzip.32 q8, q9 - vzip.32 q10, q11 - vzip.32 q12, q13 - vzip.32 q14, q15 - - // interleave 64-bit words in state n, n+2 - vswp d17, d20 - vswp d19, d22 - vswp d25, d28 + bne .Ldoubleround4 + + // re-interleave the 32-bit words of the four blocks + vzip.32 q14, q15 // => (14 15 14 15) (14 15 14 15) + vzip.32 q12, q13 // => (12 13 12 13) (12 13 12 13) + vzip.32 q10, q11 // => (10 11 10 11) (10 11 10 11) + vzip.32 q8, q9 // => (8 9 8 9) (8 9 8 9) + vzip.32 q6, q7 // => (6 7 6 7) (6 7 6 7) + vzip.32 q4, q5 // => (4 5 4 5) (4 5 4 5) + vzip.32 q2, q3 // => (2 3 2 3) (2 3 2 3) + vzip.32 q0, q1 // => (0 1 0 1) (0 1 0 1) vswp d27, d30 + vswp d25, d28 + vswp d19, d22 + vswp d17, d20 + vswp d11, d14 + vst1.32 {q14-q15}, [sp, :256] // free up two registers + vswp d9, d12 + vswp d3, d6 + vld1.32 {q14-q15}, [r0]! // load s0..7 + vswp d1, d4 - vmov q4, q1 + // Swap q1 and q4 so that we'll free up consecutive registers (q0-q1) + // after XORing the first 32 bytes. + vswp q1, q4 + + // blocks are: (q0 q1 q8 q12) (q2 q6 q10 q14) (q4 q5 q9 q13) (q3 q7 q11 q15) + + // x0..3[0-3] += s0..3[0-3] (add orig state to 1st row of each block) + vadd.u32 q0, q0, q14 + vadd.u32 q2, q2, q14 + vadd.u32 q4, q4, q14 + vadd.u32 q3, q3, q14 + + // x4..7[0-3] += s4..7[0-3] (add orig state to 2nd row of each block) + vadd.u32 q1, q1, q15 + vadd.u32 q6, q6, q15 + vadd.u32 q5, q5, q15 + vadd.u32 q7, q7, q15 + + // XOR first 32 bytes using keystream from first two rows of first block + vld1.8 {q14-q15}, [r2]! + veor q14, q14, q0 + veor q15, q15, q1 + vld1.32 {q0-q1}, [r0] // load s8..s15 + vst1.8 {q14-q15}, [r1]! + + // x8..11[0-3] += s8..11[0-3] (add orig state to 3rd row of each block) + vadd.u32 q8, q8, q0 + vadd.u32 q10, q10, q0 + vld1.32 {q14-q15}, [sp, :256] + vadd.u32 q9, q9, q0 + adr ip, .Lctrinc2 + vadd.u32 q11, q11, q0 + + // x12..15[0-3] += s12..15[0-3] (add orig state to 4th row of each block) + vld1.32 {q0}, [ip, :128] // load (1, 0, 2, 0) + vadd.u32 q15, q15, q1 + vadd.u32 q13, q13, q1 + vadd.u32 q14, q14, q1 + vadd.u32 q12, q12, q1 + vadd.u32 d30, d30, d1 // x12[3] += 2 + vadd.u32 d26, d26, d1 // x12[2] += 2 + vadd.u32 d28, d28, d0 // x12[1] += 1 + vadd.u32 d30, d30, d0 // x12[3] += 1 + + // XOR the rest of the data with the keystream vld1.8 {q0-q1}, [r2]! veor q0, q0, q8 @@ -511,13 +513,16 @@ ENTRY(chacha20_4block_xor_neon) vst1.8 {q0-q1}, [r1]! vld1.8 {q0-q1}, [r2] + mov sp, r4 // restore original stack pointer veor q0, q0, q11 veor q1, q1, q15 vst1.8 {q0-q1}, [r1] - mov sp, ip - pop {r4-r6, pc} + pop {r4} + bx lr ENDPROC(chacha20_4block_xor_neon) .align 4 -CTRINC: .word 0, 1, 2, 3 +.Lctrinc1: .word 0, 1, 2, 3 +.Lctrinc2: .word 1, 0, 2, 0 +.Lrol8_table: .byte 3, 0, 1, 2, 7, 4, 5, 6