diff mbox

[v2,2/2] crypto: Fix out-of-bounds access of the AAD buffer in generic-gcm-aesni

Message ID 20171220044259.61106-3-junaids@google.com (mailing list archive)
State Changes Requested
Delegated to: Herbert Xu
Headers show

Commit Message

Junaid Shahid Dec. 20, 2017, 4:42 a.m. UTC
The aesni_gcm_enc/dec functions can access memory after the end of
the AAD buffer if the AAD length is not a multiple of 4 bytes.
It didn't matter with rfc4106-gcm-aesni as in that case the AAD was
always followed by the 8 byte IV, but that is no longer the case with
generic-gcm-aesni. This can potentially result in accessing a page that
is not mapped and thus causing the machine to crash. This patch fixes
that by reading the last <16 byte block of the AAD byte-by-byte and
optionally via an 8-byte load if the block was at least 8 bytes.

Fixes: 0487ccac ("crypto: aesni - make non-AVX AES-GCM work with any aadlen")
Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/crypto/aesni-intel_asm.S | 80 +++++----------------------------------
 1 file changed, 10 insertions(+), 70 deletions(-)

Comments

Eric Biggers Dec. 20, 2017, 8:42 a.m. UTC | #1
On Tue, Dec 19, 2017 at 08:42:59PM -0800, Junaid Shahid wrote:
> The aesni_gcm_enc/dec functions can access memory after the end of
> the AAD buffer if the AAD length is not a multiple of 4 bytes.
> It didn't matter with rfc4106-gcm-aesni as in that case the AAD was
> always followed by the 8 byte IV, but that is no longer the case with
> generic-gcm-aesni. This can potentially result in accessing a page that
> is not mapped and thus causing the machine to crash. This patch fixes
> that by reading the last <16 byte block of the AAD byte-by-byte and
> optionally via an 8-byte load if the block was at least 8 bytes.
> 
> Fixes: 0487ccac ("crypto: aesni - make non-AVX AES-GCM work with any aadlen")
[...]
> -_get_AAD_rest0\num_initial_blocks\operation:
> -	/* finalize: shift out the extra bytes we read, and align
> -	left. since pslldq can only shift by an immediate, we use
> -	vpshufb and an array of shuffle masks */
> -	movq	   %r12, %r11
> -	salq	   $4, %r11
> -	movdqu	   aad_shift_arr(%r11), \TMP1
> -	PSHUFB_XMM \TMP1, %xmm\i

aad_shift_arr is no longer used, so it should be removed.

> -_get_AAD_rest_final\num_initial_blocks\operation:
> +	READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i

It seems that INITIAL_BLOCKS_DEC and INITIAL_BLOCKS_ENC maintain both %r11 and
%r12 as the AAD length.  %r11 is used for real earlier, but here %r12 is used
for real while %r11 is a temporary register.  Can this be simplified to have the
AAD length in %r11 only?

Eric
Junaid Shahid Dec. 20, 2017, 7:35 p.m. UTC | #2
On Wednesday, December 20, 2017 12:42:10 AM PST Eric Biggers wrote:
> > -_get_AAD_rest0\num_initial_blocks\operation:
> > -	/* finalize: shift out the extra bytes we read, and align
> > -	left. since pslldq can only shift by an immediate, we use
> > -	vpshufb and an array of shuffle masks */
> > -	movq	   %r12, %r11
> > -	salq	   $4, %r11
> > -	movdqu	   aad_shift_arr(%r11), \TMP1
> > -	PSHUFB_XMM \TMP1, %xmm\i
> 
> aad_shift_arr is no longer used, so it should be removed.

Ack.

> 
> > -_get_AAD_rest_final\num_initial_blocks\operation:
> > +	READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i
> 
> It seems that INITIAL_BLOCKS_DEC and INITIAL_BLOCKS_ENC maintain both %r11 and
> %r12 as the AAD length.  %r11 is used for real earlier, but here %r12 is used
> for real while %r11 is a temporary register.  Can this be simplified to have the
> AAD length in %r11 only?
> 

We do need both registers, though we could certainly swap their usage to make r12 the temp register. The reason we need the second register is because we need to keep the original length to perform the pshufb at the end. But, of course, that will not be needed anymore if we avoid the pshufb by duplicating the _read_last_lt8 block or utilizing pslldq some other way.

Thanks,
Junaid
Eric Biggers Dec. 20, 2017, 9:12 p.m. UTC | #3
On Wed, Dec 20, 2017 at 11:35:44AM -0800, Junaid Shahid wrote:
> On Wednesday, December 20, 2017 12:42:10 AM PST Eric Biggers wrote:
> > > -_get_AAD_rest0\num_initial_blocks\operation:
> > > -	/* finalize: shift out the extra bytes we read, and align
> > > -	left. since pslldq can only shift by an immediate, we use
> > > -	vpshufb and an array of shuffle masks */
> > > -	movq	   %r12, %r11
> > > -	salq	   $4, %r11
> > > -	movdqu	   aad_shift_arr(%r11), \TMP1
> > > -	PSHUFB_XMM \TMP1, %xmm\i
> > 
> > aad_shift_arr is no longer used, so it should be removed.
> 
> Ack.
> 
> > 
> > > -_get_AAD_rest_final\num_initial_blocks\operation:
> > > +	READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i
> > 
> > It seems that INITIAL_BLOCKS_DEC and INITIAL_BLOCKS_ENC maintain both %r11 and
> > %r12 as the AAD length.  %r11 is used for real earlier, but here %r12 is used
> > for real while %r11 is a temporary register.  Can this be simplified to have the
> > AAD length in %r11 only?
> > 
> 
> We do need both registers, though we could certainly swap their usage to make
> r12 the temp register. The reason we need the second register is because we
> need to keep the original length to perform the pshufb at the end. But, of
> course, that will not be needed anymore if we avoid the pshufb by duplicating
> the _read_last_lt8 block or utilizing pslldq some other way.
> 

If READ_PARTIAL_BLOCK can clobber 'DLEN' that would simplify it even more (no
need for 'TMP1'), but what I am talking about here is how INITIAL_BLOCKS_DEC and
INITIAL_BLOCKS_ENC maintain two copies of the remaining length in lock-step in
r11 and r12:

_get_AAD_blocks\num_initial_blocks\operation:
        movdqu     (%r10), %xmm\i
        PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
        pxor       %xmm\i, \XMM2
        GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
        add        $16, %r10
        sub        $16, %r12
        sub        $16, %r11
        cmp        $16, %r11
        jge        _get_AAD_blocks\num_initial_blocks\operation

The code which you are replacing with READ_PARTIAL_BLOCK actually needed the two
copies, but now it seems that only one copy is needed, so it can be simplified
by only using r11.

Eric
Junaid Shahid Dec. 20, 2017, 9:51 p.m. UTC | #4
On Wednesday, December 20, 2017 1:12:54 PM PST Eric Biggers wrote:
> > 
> > We do need both registers, though we could certainly swap their usage to make
> > r12 the temp register. The reason we need the second register is because we
> > need to keep the original length to perform the pshufb at the end. But, of
> > course, that will not be needed anymore if we avoid the pshufb by duplicating
> > the _read_last_lt8 block or utilizing pslldq some other way.
> > 
> 
> If READ_PARTIAL_BLOCK can clobber 'DLEN' that would simplify it even more (no
> need for 'TMP1'), but what I am talking about here is how INITIAL_BLOCKS_DEC and
> INITIAL_BLOCKS_ENC maintain two copies of the remaining length in lock-step in
> r11 and r12:
> 
> _get_AAD_blocks\num_initial_blocks\operation:
>         movdqu     (%r10), %xmm\i
>         PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
>         pxor       %xmm\i, \XMM2
>         GHASH_MUL  \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
>         add        $16, %r10
>         sub        $16, %r12
>         sub        $16, %r11
>         cmp        $16, %r11
>         jge        _get_AAD_blocks\num_initial_blocks\operation
> 
> The code which you are replacing with READ_PARTIAL_BLOCK actually needed the two
> copies, but now it seems that only one copy is needed, so it can be simplified
> by only using r11.
> 

Sorry, I misunderstood earlier. I’ll remove the extra register from the preceding code in INIITIAL_BLOCKS_ENC/DEC.

Thanks,
Junaid
diff mbox

Patch

diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S
index 4b16f31cb359..03846f2d32c9 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -309,7 +309,7 @@  XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
 	pxor       \XMM2, \XMM2
 
 	cmp	   $16, %r11
-	jl	   _get_AAD_rest8\num_initial_blocks\operation
+	jl	   _get_AAD_rest\num_initial_blocks\operation
 _get_AAD_blocks\num_initial_blocks\operation:
 	movdqu	   (%r10), %xmm\i
 	PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
@@ -322,43 +322,13 @@  _get_AAD_blocks\num_initial_blocks\operation:
 	jge	   _get_AAD_blocks\num_initial_blocks\operation
 
 	movdqu	   \XMM2, %xmm\i
+
+	/* read the last <16B of AAD */
+_get_AAD_rest\num_initial_blocks\operation:
 	cmp	   $0, %r11
 	je	   _get_AAD_done\num_initial_blocks\operation
 
-	pxor	   %xmm\i,%xmm\i
-
-	/* read the last <16B of AAD. since we have at least 4B of
-	data right after the AAD (the ICV, and maybe some CT), we can
-	read 4B/8B blocks safely, and then get rid of the extra stuff */
-_get_AAD_rest8\num_initial_blocks\operation:
-	cmp	   $4, %r11
-	jle	   _get_AAD_rest4\num_initial_blocks\operation
-	movq	   (%r10), \TMP1
-	add	   $8, %r10
-	sub	   $8, %r11
-	pslldq	   $8, \TMP1
-	psrldq	   $8, %xmm\i
-	pxor	   \TMP1, %xmm\i
-	jmp	   _get_AAD_rest8\num_initial_blocks\operation
-_get_AAD_rest4\num_initial_blocks\operation:
-	cmp	   $0, %r11
-	jle	   _get_AAD_rest0\num_initial_blocks\operation
-	mov	   (%r10), %eax
-	movq	   %rax, \TMP1
-	add	   $4, %r10
-	sub	   $4, %r10
-	pslldq	   $12, \TMP1
-	psrldq	   $4, %xmm\i
-	pxor	   \TMP1, %xmm\i
-_get_AAD_rest0\num_initial_blocks\operation:
-	/* finalize: shift out the extra bytes we read, and align
-	left. since pslldq can only shift by an immediate, we use
-	vpshufb and an array of shuffle masks */
-	movq	   %r12, %r11
-	salq	   $4, %r11
-	movdqu	   aad_shift_arr(%r11), \TMP1
-	PSHUFB_XMM \TMP1, %xmm\i
-_get_AAD_rest_final\num_initial_blocks\operation:
+	READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i
 	PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
 	pxor	   \XMM2, %xmm\i
 	GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
@@ -568,7 +538,7 @@  XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation
 	pxor	   \XMM2, \XMM2
 
 	cmp	   $16, %r11
-	jl	   _get_AAD_rest8\num_initial_blocks\operation
+	jl	   _get_AAD_rest\num_initial_blocks\operation
 _get_AAD_blocks\num_initial_blocks\operation:
 	movdqu	   (%r10), %xmm\i
 	PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
@@ -581,43 +551,13 @@  _get_AAD_blocks\num_initial_blocks\operation:
 	jge	   _get_AAD_blocks\num_initial_blocks\operation
 
 	movdqu	   \XMM2, %xmm\i
+
+	/* read the last <16B of AAD */
+_get_AAD_rest\num_initial_blocks\operation:
 	cmp	   $0, %r11
 	je	   _get_AAD_done\num_initial_blocks\operation
 
-	pxor	   %xmm\i,%xmm\i
-
-	/* read the last <16B of AAD. since we have at least 4B of
-	data right after the AAD (the ICV, and maybe some PT), we can
-	read 4B/8B blocks safely, and then get rid of the extra stuff */
-_get_AAD_rest8\num_initial_blocks\operation:
-	cmp	   $4, %r11
-	jle	   _get_AAD_rest4\num_initial_blocks\operation
-	movq	   (%r10), \TMP1
-	add	   $8, %r10
-	sub	   $8, %r11
-	pslldq	   $8, \TMP1
-	psrldq	   $8, %xmm\i
-	pxor	   \TMP1, %xmm\i
-	jmp	   _get_AAD_rest8\num_initial_blocks\operation
-_get_AAD_rest4\num_initial_blocks\operation:
-	cmp	   $0, %r11
-	jle	   _get_AAD_rest0\num_initial_blocks\operation
-	mov	   (%r10), %eax
-	movq	   %rax, \TMP1
-	add	   $4, %r10
-	sub	   $4, %r10
-	pslldq	   $12, \TMP1
-	psrldq	   $4, %xmm\i
-	pxor	   \TMP1, %xmm\i
-_get_AAD_rest0\num_initial_blocks\operation:
-	/* finalize: shift out the extra bytes we read, and align
-	left. since pslldq can only shift by an immediate, we use
-	vpshufb and an array of shuffle masks */
-	movq	   %r12, %r11
-	salq	   $4, %r11
-	movdqu	   aad_shift_arr(%r11), \TMP1
-	PSHUFB_XMM \TMP1, %xmm\i
-_get_AAD_rest_final\num_initial_blocks\operation:
+	READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i
 	PSHUFB_XMM   %xmm14, %xmm\i # byte-reflect the AAD data
 	pxor	   \XMM2, %xmm\i
 	GHASH_MUL  %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1