Message ID | 20171220044259.61106-3-junaids@google.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Herbert Xu |
Headers | show |
On Tue, Dec 19, 2017 at 08:42:59PM -0800, Junaid Shahid wrote: > The aesni_gcm_enc/dec functions can access memory after the end of > the AAD buffer if the AAD length is not a multiple of 4 bytes. > It didn't matter with rfc4106-gcm-aesni as in that case the AAD was > always followed by the 8 byte IV, but that is no longer the case with > generic-gcm-aesni. This can potentially result in accessing a page that > is not mapped and thus causing the machine to crash. This patch fixes > that by reading the last <16 byte block of the AAD byte-by-byte and > optionally via an 8-byte load if the block was at least 8 bytes. > > Fixes: 0487ccac ("crypto: aesni - make non-AVX AES-GCM work with any aadlen") [...] > -_get_AAD_rest0\num_initial_blocks\operation: > - /* finalize: shift out the extra bytes we read, and align > - left. since pslldq can only shift by an immediate, we use > - vpshufb and an array of shuffle masks */ > - movq %r12, %r11 > - salq $4, %r11 > - movdqu aad_shift_arr(%r11), \TMP1 > - PSHUFB_XMM \TMP1, %xmm\i aad_shift_arr is no longer used, so it should be removed. > -_get_AAD_rest_final\num_initial_blocks\operation: > + READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i It seems that INITIAL_BLOCKS_DEC and INITIAL_BLOCKS_ENC maintain both %r11 and %r12 as the AAD length. %r11 is used for real earlier, but here %r12 is used for real while %r11 is a temporary register. Can this be simplified to have the AAD length in %r11 only? Eric
On Wednesday, December 20, 2017 12:42:10 AM PST Eric Biggers wrote: > > -_get_AAD_rest0\num_initial_blocks\operation: > > - /* finalize: shift out the extra bytes we read, and align > > - left. since pslldq can only shift by an immediate, we use > > - vpshufb and an array of shuffle masks */ > > - movq %r12, %r11 > > - salq $4, %r11 > > - movdqu aad_shift_arr(%r11), \TMP1 > > - PSHUFB_XMM \TMP1, %xmm\i > > aad_shift_arr is no longer used, so it should be removed. Ack. > > > -_get_AAD_rest_final\num_initial_blocks\operation: > > + READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i > > It seems that INITIAL_BLOCKS_DEC and INITIAL_BLOCKS_ENC maintain both %r11 and > %r12 as the AAD length. %r11 is used for real earlier, but here %r12 is used > for real while %r11 is a temporary register. Can this be simplified to have the > AAD length in %r11 only? > We do need both registers, though we could certainly swap their usage to make r12 the temp register. The reason we need the second register is because we need to keep the original length to perform the pshufb at the end. But, of course, that will not be needed anymore if we avoid the pshufb by duplicating the _read_last_lt8 block or utilizing pslldq some other way. Thanks, Junaid
On Wed, Dec 20, 2017 at 11:35:44AM -0800, Junaid Shahid wrote: > On Wednesday, December 20, 2017 12:42:10 AM PST Eric Biggers wrote: > > > -_get_AAD_rest0\num_initial_blocks\operation: > > > - /* finalize: shift out the extra bytes we read, and align > > > - left. since pslldq can only shift by an immediate, we use > > > - vpshufb and an array of shuffle masks */ > > > - movq %r12, %r11 > > > - salq $4, %r11 > > > - movdqu aad_shift_arr(%r11), \TMP1 > > > - PSHUFB_XMM \TMP1, %xmm\i > > > > aad_shift_arr is no longer used, so it should be removed. > > Ack. > > > > > > -_get_AAD_rest_final\num_initial_blocks\operation: > > > + READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i > > > > It seems that INITIAL_BLOCKS_DEC and INITIAL_BLOCKS_ENC maintain both %r11 and > > %r12 as the AAD length. %r11 is used for real earlier, but here %r12 is used > > for real while %r11 is a temporary register. Can this be simplified to have the > > AAD length in %r11 only? > > > > We do need both registers, though we could certainly swap their usage to make > r12 the temp register. The reason we need the second register is because we > need to keep the original length to perform the pshufb at the end. But, of > course, that will not be needed anymore if we avoid the pshufb by duplicating > the _read_last_lt8 block or utilizing pslldq some other way. > If READ_PARTIAL_BLOCK can clobber 'DLEN' that would simplify it even more (no need for 'TMP1'), but what I am talking about here is how INITIAL_BLOCKS_DEC and INITIAL_BLOCKS_ENC maintain two copies of the remaining length in lock-step in r11 and r12: _get_AAD_blocks\num_initial_blocks\operation: movdqu (%r10), %xmm\i PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data pxor %xmm\i, \XMM2 GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 add $16, %r10 sub $16, %r12 sub $16, %r11 cmp $16, %r11 jge _get_AAD_blocks\num_initial_blocks\operation The code which you are replacing with READ_PARTIAL_BLOCK actually needed the two copies, but now it seems that only one copy is needed, so it can be simplified by only using r11. Eric
On Wednesday, December 20, 2017 1:12:54 PM PST Eric Biggers wrote: > > > > We do need both registers, though we could certainly swap their usage to make > > r12 the temp register. The reason we need the second register is because we > > need to keep the original length to perform the pshufb at the end. But, of > > course, that will not be needed anymore if we avoid the pshufb by duplicating > > the _read_last_lt8 block or utilizing pslldq some other way. > > > > If READ_PARTIAL_BLOCK can clobber 'DLEN' that would simplify it even more (no > need for 'TMP1'), but what I am talking about here is how INITIAL_BLOCKS_DEC and > INITIAL_BLOCKS_ENC maintain two copies of the remaining length in lock-step in > r11 and r12: > > _get_AAD_blocks\num_initial_blocks\operation: > movdqu (%r10), %xmm\i > PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data > pxor %xmm\i, \XMM2 > GHASH_MUL \XMM2, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 > add $16, %r10 > sub $16, %r12 > sub $16, %r11 > cmp $16, %r11 > jge _get_AAD_blocks\num_initial_blocks\operation > > The code which you are replacing with READ_PARTIAL_BLOCK actually needed the two > copies, but now it seems that only one copy is needed, so it can be simplified > by only using r11. > Sorry, I misunderstood earlier. I’ll remove the extra register from the preceding code in INIITIAL_BLOCKS_ENC/DEC. Thanks, Junaid
diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S index 4b16f31cb359..03846f2d32c9 100644 --- a/arch/x86/crypto/aesni-intel_asm.S +++ b/arch/x86/crypto/aesni-intel_asm.S @@ -309,7 +309,7 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation pxor \XMM2, \XMM2 cmp $16, %r11 - jl _get_AAD_rest8\num_initial_blocks\operation + jl _get_AAD_rest\num_initial_blocks\operation _get_AAD_blocks\num_initial_blocks\operation: movdqu (%r10), %xmm\i PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data @@ -322,43 +322,13 @@ _get_AAD_blocks\num_initial_blocks\operation: jge _get_AAD_blocks\num_initial_blocks\operation movdqu \XMM2, %xmm\i + + /* read the last <16B of AAD */ +_get_AAD_rest\num_initial_blocks\operation: cmp $0, %r11 je _get_AAD_done\num_initial_blocks\operation - pxor %xmm\i,%xmm\i - - /* read the last <16B of AAD. since we have at least 4B of - data right after the AAD (the ICV, and maybe some CT), we can - read 4B/8B blocks safely, and then get rid of the extra stuff */ -_get_AAD_rest8\num_initial_blocks\operation: - cmp $4, %r11 - jle _get_AAD_rest4\num_initial_blocks\operation - movq (%r10), \TMP1 - add $8, %r10 - sub $8, %r11 - pslldq $8, \TMP1 - psrldq $8, %xmm\i - pxor \TMP1, %xmm\i - jmp _get_AAD_rest8\num_initial_blocks\operation -_get_AAD_rest4\num_initial_blocks\operation: - cmp $0, %r11 - jle _get_AAD_rest0\num_initial_blocks\operation - mov (%r10), %eax - movq %rax, \TMP1 - add $4, %r10 - sub $4, %r10 - pslldq $12, \TMP1 - psrldq $4, %xmm\i - pxor \TMP1, %xmm\i -_get_AAD_rest0\num_initial_blocks\operation: - /* finalize: shift out the extra bytes we read, and align - left. since pslldq can only shift by an immediate, we use - vpshufb and an array of shuffle masks */ - movq %r12, %r11 - salq $4, %r11 - movdqu aad_shift_arr(%r11), \TMP1 - PSHUFB_XMM \TMP1, %xmm\i -_get_AAD_rest_final\num_initial_blocks\operation: + READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data pxor \XMM2, %xmm\i GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1 @@ -568,7 +538,7 @@ XMM2 XMM3 XMM4 XMMDst TMP6 TMP7 i i_seq operation pxor \XMM2, \XMM2 cmp $16, %r11 - jl _get_AAD_rest8\num_initial_blocks\operation + jl _get_AAD_rest\num_initial_blocks\operation _get_AAD_blocks\num_initial_blocks\operation: movdqu (%r10), %xmm\i PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data @@ -581,43 +551,13 @@ _get_AAD_blocks\num_initial_blocks\operation: jge _get_AAD_blocks\num_initial_blocks\operation movdqu \XMM2, %xmm\i + + /* read the last <16B of AAD */ +_get_AAD_rest\num_initial_blocks\operation: cmp $0, %r11 je _get_AAD_done\num_initial_blocks\operation - pxor %xmm\i,%xmm\i - - /* read the last <16B of AAD. since we have at least 4B of - data right after the AAD (the ICV, and maybe some PT), we can - read 4B/8B blocks safely, and then get rid of the extra stuff */ -_get_AAD_rest8\num_initial_blocks\operation: - cmp $4, %r11 - jle _get_AAD_rest4\num_initial_blocks\operation - movq (%r10), \TMP1 - add $8, %r10 - sub $8, %r11 - pslldq $8, \TMP1 - psrldq $8, %xmm\i - pxor \TMP1, %xmm\i - jmp _get_AAD_rest8\num_initial_blocks\operation -_get_AAD_rest4\num_initial_blocks\operation: - cmp $0, %r11 - jle _get_AAD_rest0\num_initial_blocks\operation - mov (%r10), %eax - movq %rax, \TMP1 - add $4, %r10 - sub $4, %r10 - pslldq $12, \TMP1 - psrldq $4, %xmm\i - pxor \TMP1, %xmm\i -_get_AAD_rest0\num_initial_blocks\operation: - /* finalize: shift out the extra bytes we read, and align - left. since pslldq can only shift by an immediate, we use - vpshufb and an array of shuffle masks */ - movq %r12, %r11 - salq $4, %r11 - movdqu aad_shift_arr(%r11), \TMP1 - PSHUFB_XMM \TMP1, %xmm\i -_get_AAD_rest_final\num_initial_blocks\operation: + READ_PARTIAL_BLOCK %r10, %r12, %r11, \TMP1, \TMP2, %xmm\i PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data pxor \XMM2, %xmm\i GHASH_MUL %xmm\i, \TMP3, \TMP1, \TMP2, \TMP4, \TMP5, \XMM1
The aesni_gcm_enc/dec functions can access memory after the end of the AAD buffer if the AAD length is not a multiple of 4 bytes. It didn't matter with rfc4106-gcm-aesni as in that case the AAD was always followed by the 8 byte IV, but that is no longer the case with generic-gcm-aesni. This can potentially result in accessing a page that is not mapped and thus causing the machine to crash. This patch fixes that by reading the last <16 byte block of the AAD byte-by-byte and optionally via an 8-byte load if the block was at least 8 bytes. Fixes: 0487ccac ("crypto: aesni - make non-AVX AES-GCM work with any aadlen") Signed-off-by: Junaid Shahid <junaids@google.com> --- arch/x86/crypto/aesni-intel_asm.S | 80 +++++---------------------------------- 1 file changed, 10 insertions(+), 70 deletions(-)