Message ID | 20210505191804.4015873-1-keescook@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Makefile: Introduce CONFIG_ZERO_CALL_USED_REGS | expand |
Hi Kees, On Wed, May 05, 2021 at 12:18:04PM -0700, Kees Cook wrote: > When CONFIG_ZERO_CALL_USED_REGS is enabled, build the kernel with > "-fzero-call-used-regs=used-gpr" (in GCC 11). This option will zero any > caller-used register contents just before returning from a function, > ensuring that temporary values are not leaked beyond the function > boundary. This means that register contents are less likely to be > available for side channel attacks and information exposures. > > Additionally this helps reduce the number of useful ROP gadgets in the > kernel image by about 20%: > > $ ROPgadget.py --nosys --nojop --binary vmlinux.stock | tail -n1 > Unique gadgets found: 337245 > > $ ROPgadget.py --nosys --nojop --binary vmlinux.zero-call-regs | tail -n1 > Unique gadgets found: 267175 > > and more notably removes simple "write-what-where" gadgets: > > $ ROPgadget.py --ropchain --binary vmlinux.stock | sed -n '/Step 1/,/Step 2/p' > - Step 1 -- Write-what-where gadgets > > [+] Gadget found: 0xffffffff8102d76c mov qword ptr [rsi], rdx ; ret > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > [+] Gadget found: 0xffffffff8104d7c8 pop rdx ; ret > [-] Can't find the 'xor rdx, rdx' gadget. Try with another 'mov [reg], reg' > > [+] Gadget found: 0xffffffff814c2b4c mov qword ptr [rsi], rdi ; ret > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret > [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg' > > [+] Gadget found: 0xffffffff81540d61 mov qword ptr [rsi], rdi ; pop rbx ; pop rbp ; ret > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret > [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg' > > [+] Gadget found: 0xffffffff8105341e mov qword ptr [rsi], rax ; ret > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > [+] Gadget found: 0xffffffff81029a11 pop rax ; ret > [+] Gadget found: 0xffffffff811f1c3b xor rax, rax ; ret > > - Step 2 -- Init syscall number gadgets > > $ ROPgadget.py --ropchain --binary vmlinux.zero* | sed -n '/Step 1/,/Step 2/p' > - Step 1 -- Write-what-where gadgets > > [-] Can't find the 'mov qword ptr [r64], r64' gadget > > In parallel build tests, this has a less than 1% performance impact, > and grows the image size less than 1%: > > $ size vmlinux.stock vmlinux.zero-call-regs > text data bss dec hex filename > 22437676 8559152 14127340 45124168 2b08a48 vmlinux.stock > 22453184 8563248 14110956 45127388 2b096dc vmlinux.zero-call-regs FWIW, I gave this a go on arm64, and the size increase is a fair bit larger: | [mark@lakrids:~/src/linux]% ls -l Image* | -rw-r--r-- 1 mark mark 31955456 May 6 13:36 Image.stock | -rw-r--r-- 1 mark mark 33724928 May 6 13:23 Image.zero-call-regs | [mark@lakrids:~/src/linux]% size vmlinux.stock vmlinux.zero-call-regs | text data bss dec hex filename | 20728552 11086474 505540 32320566 1ed2c36 vmlinux.stock | 22500688 11084298 505540 34090526 2082e1e vmlinux.zero-call-regs The Image is ~5.5% bigger, and the .text in the vmlinux is ~8.5% bigger The resulting Image appears to work, but I haven't done anything beyond booting, and I wasn't able to get ROPgadget.py going to quantify the number of gadgets. > Signed-off-by: Kees Cook <keescook@chromium.org> > --- > Makefile | 5 +++++ > security/Kconfig.hardening | 17 +++++++++++++++++ > 2 files changed, 22 insertions(+) > > diff --git a/Makefile b/Makefile > index 31dcdb3d61fa..810600618490 100644 > --- a/Makefile > +++ b/Makefile > @@ -811,6 +811,11 @@ KBUILD_CFLAGS += -ftrivial-auto-var-init=zero > KBUILD_CFLAGS += -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang > endif > > +# Clear used registers at func exit (to reduce data lifetime and ROP gadgets). > +ifdef CONFIG_ZERO_CALL_USED_REGS > +KBUILD_CFLAGS += -fzero-call-used-regs=used-gpr > +endif > + > DEBUG_CFLAGS := > > # Workaround for GCC versions < 5.0 > diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening > index 269967c4fc1b..85f7f2036725 100644 > --- a/security/Kconfig.hardening > +++ b/security/Kconfig.hardening > @@ -217,6 +217,23 @@ config INIT_ON_FREE_DEFAULT_ON > touching "cold" memory areas. Most cases see 3-5% impact. Some > synthetic workloads have measured as high as 8%. > > +config CC_HAS_ZERO_CALL_USED_REGS > + def_bool $(cc-option,-fzero-call-used-regs=used-gpr) > + > +config ZERO_CALL_USED_REGS > + bool "Enable register zeroing on function exit" > + depends on CC_HAS_ZERO_CALL_USED_REGS > + help > + At the end of functions, always zero any caller-used register > + contents. This helps ensure that temporary values are not > + leaked beyond the function boundary. This means that register > + contents are less likely to be available for side channels > + and information exposures. Additionally, this helps reduce the > + number of useful ROP gadgets by about 20% (and removes compiler > + generated "write-what-where" gadgets) in the resulting kernel > + image. This has a less than 1% performance impact on most > + workloads, and grows the image size less than 1%. I think the numbers need an "on x86" caveat, since they're not necessarily representative of other architectures. This shows up under the "Memory initialization" sub-menu, but I assume it was meant to be directly under the "Kernel hardening options" menu... > + > endmenu ... and should presumably be here? Thanks, Mark. > > endmenu > -- > 2.25.1 >
On Thu, May 06, 2021 at 01:54:57PM +0100, Mark Rutland wrote: > Hi Kees, > > On Wed, May 05, 2021 at 12:18:04PM -0700, Kees Cook wrote: > > When CONFIG_ZERO_CALL_USED_REGS is enabled, build the kernel with > > "-fzero-call-used-regs=used-gpr" (in GCC 11). This option will zero any > > caller-used register contents just before returning from a function, > > ensuring that temporary values are not leaked beyond the function > > boundary. This means that register contents are less likely to be > > available for side channel attacks and information exposures. > > > > Additionally this helps reduce the number of useful ROP gadgets in the > > kernel image by about 20%: > > > > $ ROPgadget.py --nosys --nojop --binary vmlinux.stock | tail -n1 > > Unique gadgets found: 337245 > > > > $ ROPgadget.py --nosys --nojop --binary vmlinux.zero-call-regs | tail -n1 > > Unique gadgets found: 267175 > > > > and more notably removes simple "write-what-where" gadgets: > > > > $ ROPgadget.py --ropchain --binary vmlinux.stock | sed -n '/Step 1/,/Step 2/p' > > - Step 1 -- Write-what-where gadgets > > > > [+] Gadget found: 0xffffffff8102d76c mov qword ptr [rsi], rdx ; ret > > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > > [+] Gadget found: 0xffffffff8104d7c8 pop rdx ; ret > > [-] Can't find the 'xor rdx, rdx' gadget. Try with another 'mov [reg], reg' > > > > [+] Gadget found: 0xffffffff814c2b4c mov qword ptr [rsi], rdi ; ret > > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > > [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret > > [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg' > > > > [+] Gadget found: 0xffffffff81540d61 mov qword ptr [rsi], rdi ; pop rbx ; pop rbp ; ret > > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > > [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret > > [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg' > > > > [+] Gadget found: 0xffffffff8105341e mov qword ptr [rsi], rax ; ret > > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > > [+] Gadget found: 0xffffffff81029a11 pop rax ; ret > > [+] Gadget found: 0xffffffff811f1c3b xor rax, rax ; ret > > > > - Step 2 -- Init syscall number gadgets > > > > $ ROPgadget.py --ropchain --binary vmlinux.zero* | sed -n '/Step 1/,/Step 2/p' > > - Step 1 -- Write-what-where gadgets > > > > [-] Can't find the 'mov qword ptr [r64], r64' gadget > > > > In parallel build tests, this has a less than 1% performance impact, > > and grows the image size less than 1%: > > > > $ size vmlinux.stock vmlinux.zero-call-regs > > text data bss dec hex filename > > 22437676 8559152 14127340 45124168 2b08a48 vmlinux.stock > > 22453184 8563248 14110956 45127388 2b096dc vmlinux.zero-call-regs > > FWIW, I gave this a go on arm64, and the size increase is a fair bit > larger: > > | [mark@lakrids:~/src/linux]% ls -l Image* > | -rw-r--r-- 1 mark mark 31955456 May 6 13:36 Image.stock > | -rw-r--r-- 1 mark mark 33724928 May 6 13:23 Image.zero-call-regs > > | [mark@lakrids:~/src/linux]% size vmlinux.stock vmlinux.zero-call-regs > | text data bss dec hex filename > | 20728552 11086474 505540 32320566 1ed2c36 vmlinux.stock > | 22500688 11084298 505540 34090526 2082e1e vmlinux.zero-call-regs > > The Image is ~5.5% bigger, and the .text in the vmlinux is ~8.5% bigger Woo, that's quite a bit larger! So much so that I struggle to imagine the delta. That's almost 1 extra instruction for every 10. I don't imagine functions are that short. There seem to be only r9..r15 as call-used. Even if every one was cleared at every function exit (28 bytes), that implies 63,290 functions, with an average function size of 40 instructions? > The resulting Image appears to work, but I haven't done anything beyond > booting, and I wasn't able to get ROPgadget.py going to quantify the > number of gadgets. Does it not like arm64 machine code? I can go check and see if I can get numbers... Thanks for looking at this! -Kees > > > Signed-off-by: Kees Cook <keescook@chromium.org> > > --- > > Makefile | 5 +++++ > > security/Kconfig.hardening | 17 +++++++++++++++++ > > 2 files changed, 22 insertions(+) > > > > diff --git a/Makefile b/Makefile > > index 31dcdb3d61fa..810600618490 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -811,6 +811,11 @@ KBUILD_CFLAGS += -ftrivial-auto-var-init=zero > > KBUILD_CFLAGS += -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang > > endif > > > > +# Clear used registers at func exit (to reduce data lifetime and ROP gadgets). > > +ifdef CONFIG_ZERO_CALL_USED_REGS > > +KBUILD_CFLAGS += -fzero-call-used-regs=used-gpr > > +endif > > + > > DEBUG_CFLAGS := > > > > # Workaround for GCC versions < 5.0 > > diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening > > index 269967c4fc1b..85f7f2036725 100644 > > --- a/security/Kconfig.hardening > > +++ b/security/Kconfig.hardening > > @@ -217,6 +217,23 @@ config INIT_ON_FREE_DEFAULT_ON > > touching "cold" memory areas. Most cases see 3-5% impact. Some > > synthetic workloads have measured as high as 8%. > > > > +config CC_HAS_ZERO_CALL_USED_REGS > > + def_bool $(cc-option,-fzero-call-used-regs=used-gpr) > > + > > +config ZERO_CALL_USED_REGS > > + bool "Enable register zeroing on function exit" > > + depends on CC_HAS_ZERO_CALL_USED_REGS > > + help > > + At the end of functions, always zero any caller-used register > > + contents. This helps ensure that temporary values are not > > + leaked beyond the function boundary. This means that register > > + contents are less likely to be available for side channels > > + and information exposures. Additionally, this helps reduce the > > + number of useful ROP gadgets by about 20% (and removes compiler > > + generated "write-what-where" gadgets) in the resulting kernel > > + image. This has a less than 1% performance impact on most > > + workloads, and grows the image size less than 1%. > > I think the numbers need an "on x86" caveat, since they're not > necessarily representative of other architectures. > > This shows up under the "Memory initialization" sub-menu, but I assume > it was meant to be directly under the "Kernel hardening options" menu... > > > + > > endmenu > > ... and should presumably be here? > > Thanks, > Mark. > > > > > endmenu > > -- > > 2.25.1 > >
On Thu, May 06, 2021 at 02:24:18PM -0700, Kees Cook wrote: > On Thu, May 06, 2021 at 01:54:57PM +0100, Mark Rutland wrote: > > Hi Kees, > > > > On Wed, May 05, 2021 at 12:18:04PM -0700, Kees Cook wrote: > > > When CONFIG_ZERO_CALL_USED_REGS is enabled, build the kernel with > > > "-fzero-call-used-regs=used-gpr" (in GCC 11). This option will zero any > > > caller-used register contents just before returning from a function, > > > ensuring that temporary values are not leaked beyond the function > > > boundary. This means that register contents are less likely to be > > > available for side channel attacks and information exposures. > > > > > > Additionally this helps reduce the number of useful ROP gadgets in the > > > kernel image by about 20%: > > > > > > $ ROPgadget.py --nosys --nojop --binary vmlinux.stock | tail -n1 > > > Unique gadgets found: 337245 > > > > > > $ ROPgadget.py --nosys --nojop --binary vmlinux.zero-call-regs | tail -n1 > > > Unique gadgets found: 267175 > > > > > > and more notably removes simple "write-what-where" gadgets: > > > > > > $ ROPgadget.py --ropchain --binary vmlinux.stock | sed -n '/Step 1/,/Step 2/p' > > > - Step 1 -- Write-what-where gadgets > > > > > > [+] Gadget found: 0xffffffff8102d76c mov qword ptr [rsi], rdx ; ret > > > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > > > [+] Gadget found: 0xffffffff8104d7c8 pop rdx ; ret > > > [-] Can't find the 'xor rdx, rdx' gadget. Try with another 'mov [reg], reg' > > > > > > [+] Gadget found: 0xffffffff814c2b4c mov qword ptr [rsi], rdi ; ret > > > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > > > [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret > > > [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg' > > > > > > [+] Gadget found: 0xffffffff81540d61 mov qword ptr [rsi], rdi ; pop rbx ; pop rbp ; ret > > > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > > > [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret > > > [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg' > > > > > > [+] Gadget found: 0xffffffff8105341e mov qword ptr [rsi], rax ; ret > > > [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret > > > [+] Gadget found: 0xffffffff81029a11 pop rax ; ret > > > [+] Gadget found: 0xffffffff811f1c3b xor rax, rax ; ret > > > > > > - Step 2 -- Init syscall number gadgets > > > > > > $ ROPgadget.py --ropchain --binary vmlinux.zero* | sed -n '/Step 1/,/Step 2/p' > > > - Step 1 -- Write-what-where gadgets > > > > > > [-] Can't find the 'mov qword ptr [r64], r64' gadget > > > > > > In parallel build tests, this has a less than 1% performance impact, > > > and grows the image size less than 1%: > > > > > > $ size vmlinux.stock vmlinux.zero-call-regs > > > text data bss dec hex filename > > > 22437676 8559152 14127340 45124168 2b08a48 vmlinux.stock > > > 22453184 8563248 14110956 45127388 2b096dc vmlinux.zero-call-regs > > > > FWIW, I gave this a go on arm64, and the size increase is a fair bit > > larger: > > > > | [mark@lakrids:~/src/linux]% ls -l Image* > > | -rw-r--r-- 1 mark mark 31955456 May 6 13:36 Image.stock > > | -rw-r--r-- 1 mark mark 33724928 May 6 13:23 Image.zero-call-regs > > > > | [mark@lakrids:~/src/linux]% size vmlinux.stock vmlinux.zero-call-regs > > | text data bss dec hex filename > > | 20728552 11086474 505540 32320566 1ed2c36 vmlinux.stock > > | 22500688 11084298 505540 34090526 2082e1e vmlinux.zero-call-regs > > > > The Image is ~5.5% bigger, and the .text in the vmlinux is ~8.5% bigger > > Woo, that's quite a bit larger! So much so that I struggle to imagine > the delta. That's almost 1 extra instruction for every 10. About 31% of this seems to be due to GCC (almost) always clearing x16 and x17 (see further down for numbers). I suspect that's because GCC has to assume that any (non-static) functions might be reached via a PLT which would clobber x16 and x17 with specific values. We also have a bunch of small functions with multiple returns, where each return path gets the full complement of zeroing instructions, e.g. Stock: | <fpsimd_sync_to_sve>: | d503245f bti c | f9400001 ldr x1, [x0] | 7209003f tst w1, #0x800000 | 54000040 b.eq ffff800010014cc4 <fpsimd_sync_to_sve+0x14> // b.none | d65f03c0 ret | d503233f paciasp | a9bf7bfd stp x29, x30, [sp, #-16]! | 910003fd mov x29, sp | 97fffdac bl ffff800010014380 <fpsimd_to_sve> | a8c17bfd ldp x29, x30, [sp], #16 | d50323bf autiasp | d65f03c0 ret With zero-call-regs: | <fpsimd_sync_to_sve>: | d503245f bti c | f9400001 ldr x1, [x0] | 7209003f tst w1, #0x800000 | 540000c0 b.eq ffff8000100152a8 <fpsimd_sync_to_sve+0x24> // b.none | d2800000 mov x0, #0x0 // #0 | d2800001 mov x1, #0x0 // #0 | d2800010 mov x16, #0x0 // #0 | d2800011 mov x17, #0x0 // #0 | d65f03c0 ret | d503233f paciasp | a9bf7bfd stp x29, x30, [sp, #-16]! | 910003fd mov x29, sp | 97fffd17 bl ffff800010014710 <fpsimd_to_sve> | a8c17bfd ldp x29, x30, [sp], #16 | d50323bf autiasp | d2800000 mov x0, #0x0 // #0 | d2800001 mov x1, #0x0 // #0 | d2800010 mov x16, #0x0 // #0 | d2800011 mov x17, #0x0 // #0 | d65f03c0 ret ... where we go from 12 instructions to 20, which is a ~67% bloat. > I don't imagine functions are that short. There seem to be only r9..r15 as > call-used. We have a bunch of cases like the above. Also note that per the AAPCS a function can clobber x0-17 (and x18 if it's not reserved for something like SCS), and I see a few places that clobber x1-x17. > Even if every one was cleared at every function exit (28 > bytes), that implies 63,290 functions, with an average function size of > 40 instructions? I generated some (slightly dodgy) numbers by grepping the output of objdump: [mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.stock | wc -l 3979677 [mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.stock | grep 'mov\sx[0-9]\+, #0x0' | wc -l 50070 [mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.stock | grep 'mov\sx1[67], #0x0' | wc -l 1 [mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.zero-call-regs | wc -l 4422188 [mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.zero-call-regs | grep 'mov\sx[0-9]\+, #0x0' | wc -l 491371 [mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.zero-call-regs | grep 'mov\sx1[67], #0x0' | wc -l 135729 That's 441301 new MOVs, and the equivalent of 442511 new instructions overall. There are 135728 new MOVs to x16 and x17 specifically, which account for ~31% of that. Overall we go from MOVs being ~1.3% of all instructions to 11%. > > The resulting Image appears to work, but I haven't done anything beyond > > booting, and I wasn't able to get ROPgadget.py going to quantify the > > number of gadgets. > > Does it not like arm64 machine code? I can go check and see if I can get > numbers... It's supposed to, and I suspect it works fine, but I wasn't able to get the tool running at all due to environment problems on my machine. Thanks, Mark.
On Mon, May 10, 2021 at 02:45:03PM +0100, Mark Rutland wrote: > About 31% of this seems to be due to GCC (almost) always clearing x16 > and x17 (see further down for numbers). I suspect that's because GCC has > to assume that any (non-static) functions might be reached via a PLT > which would clobber x16 and x17 with specific values. Wheee. > We also have a bunch of small functions with multiple returns, where > each return path gets the full complement of zeroing instructions, e.g. > > Stock: > > | <fpsimd_sync_to_sve>: > | d503245f bti c > | f9400001 ldr x1, [x0] > | 7209003f tst w1, #0x800000 > | 54000040 b.eq ffff800010014cc4 <fpsimd_sync_to_sve+0x14> // b.none > | d65f03c0 ret > | d503233f paciasp > | a9bf7bfd stp x29, x30, [sp, #-16]! > | 910003fd mov x29, sp > | 97fffdac bl ffff800010014380 <fpsimd_to_sve> > | a8c17bfd ldp x29, x30, [sp], #16 > | d50323bf autiasp > | d65f03c0 ret > > With zero-call-regs: > > | <fpsimd_sync_to_sve>: > | d503245f bti c > | f9400001 ldr x1, [x0] > | 7209003f tst w1, #0x800000 > | 540000c0 b.eq ffff8000100152a8 <fpsimd_sync_to_sve+0x24> // b.none > | d2800000 mov x0, #0x0 // #0 > | d2800001 mov x1, #0x0 // #0 > | d2800010 mov x16, #0x0 // #0 > | d2800011 mov x17, #0x0 // #0 > | d65f03c0 ret > | d503233f paciasp > | a9bf7bfd stp x29, x30, [sp, #-16]! > | 910003fd mov x29, sp > | 97fffd17 bl ffff800010014710 <fpsimd_to_sve> > | a8c17bfd ldp x29, x30, [sp], #16 > | d50323bf autiasp > | d2800000 mov x0, #0x0 // #0 > | d2800001 mov x1, #0x0 // #0 > | d2800010 mov x16, #0x0 // #0 > | d2800011 mov x17, #0x0 // #0 > | d65f03c0 ret > > ... where we go from 12 instructions to 20, which is a ~67% bloat. Yikes. Yeah, so that is likely a good example of missed optimization opportunity. > We have a bunch of cases like the above. Also note that per the AAPCS a > function can clobber x0-17 (and x18 if it's not reserved for something > like SCS), and I see a few places that clobber x1-x17. Ah, gotcha. I wasn't quite sure which registers might qualify. > [...] > That's 441301 new MOVs, and the equivalent of 442511 new instructions > overall. There are 135728 new MOVs to x16 and x17 specifically, which > account for ~31% of that. I assume the x16/x17 case could be addressed by the compiler if it examined the need for PLTs, or is that too late (in the sense that the linker is doing that phase)? Regardless, I will update the documentation on this feature. :)
On Mon, May 10, 2021 at 03:01:48PM -0700, Kees Cook wrote: > On Mon, May 10, 2021 at 02:45:03PM +0100, Mark Rutland wrote: [...] > > That's 441301 new MOVs, and the equivalent of 442511 new instructions > > overall. There are 135728 new MOVs to x16 and x17 specifically, which > > account for ~31% of that. > > I assume the x16/x17 case could be addressed by the compiler if it > examined the need for PLTs, or is that too late (in the sense that the > linker is doing that phase)? Most (all?) PLTs will be created at link time, and IIUC the compiler simply has to assume any non-static function might have a PLT, since the AAPCS permits that. Maybe some of the smaller memory size models don't permit PLTs, but I have no real knowledge of that area and I'm already out on a limb. LTO could probably help with visiblity, but otherwise I don't see a way the compiler could be sure a PLT won't exist. > Regardless, I will update the documentation on this feature. :) Great; thanks! Mark.
diff --git a/Makefile b/Makefile index 31dcdb3d61fa..810600618490 100644 --- a/Makefile +++ b/Makefile @@ -811,6 +811,11 @@ KBUILD_CFLAGS += -ftrivial-auto-var-init=zero KBUILD_CFLAGS += -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang endif +# Clear used registers at func exit (to reduce data lifetime and ROP gadgets). +ifdef CONFIG_ZERO_CALL_USED_REGS +KBUILD_CFLAGS += -fzero-call-used-regs=used-gpr +endif + DEBUG_CFLAGS := # Workaround for GCC versions < 5.0 diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening index 269967c4fc1b..85f7f2036725 100644 --- a/security/Kconfig.hardening +++ b/security/Kconfig.hardening @@ -217,6 +217,23 @@ config INIT_ON_FREE_DEFAULT_ON touching "cold" memory areas. Most cases see 3-5% impact. Some synthetic workloads have measured as high as 8%. +config CC_HAS_ZERO_CALL_USED_REGS + def_bool $(cc-option,-fzero-call-used-regs=used-gpr) + +config ZERO_CALL_USED_REGS + bool "Enable register zeroing on function exit" + depends on CC_HAS_ZERO_CALL_USED_REGS + help + At the end of functions, always zero any caller-used register + contents. This helps ensure that temporary values are not + leaked beyond the function boundary. This means that register + contents are less likely to be available for side channels + and information exposures. Additionally, this helps reduce the + number of useful ROP gadgets by about 20% (and removes compiler + generated "write-what-where" gadgets) in the resulting kernel + image. This has a less than 1% performance impact on most + workloads, and grows the image size less than 1%. + endmenu endmenu
When CONFIG_ZERO_CALL_USED_REGS is enabled, build the kernel with "-fzero-call-used-regs=used-gpr" (in GCC 11). This option will zero any caller-used register contents just before returning from a function, ensuring that temporary values are not leaked beyond the function boundary. This means that register contents are less likely to be available for side channel attacks and information exposures. Additionally this helps reduce the number of useful ROP gadgets in the kernel image by about 20%: $ ROPgadget.py --nosys --nojop --binary vmlinux.stock | tail -n1 Unique gadgets found: 337245 $ ROPgadget.py --nosys --nojop --binary vmlinux.zero-call-regs | tail -n1 Unique gadgets found: 267175 and more notably removes simple "write-what-where" gadgets: $ ROPgadget.py --ropchain --binary vmlinux.stock | sed -n '/Step 1/,/Step 2/p' - Step 1 -- Write-what-where gadgets [+] Gadget found: 0xffffffff8102d76c mov qword ptr [rsi], rdx ; ret [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret [+] Gadget found: 0xffffffff8104d7c8 pop rdx ; ret [-] Can't find the 'xor rdx, rdx' gadget. Try with another 'mov [reg], reg' [+] Gadget found: 0xffffffff814c2b4c mov qword ptr [rsi], rdi ; ret [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg' [+] Gadget found: 0xffffffff81540d61 mov qword ptr [rsi], rdi ; pop rbx ; pop rbp ; ret [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg' [+] Gadget found: 0xffffffff8105341e mov qword ptr [rsi], rax ; ret [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret [+] Gadget found: 0xffffffff81029a11 pop rax ; ret [+] Gadget found: 0xffffffff811f1c3b xor rax, rax ; ret - Step 2 -- Init syscall number gadgets $ ROPgadget.py --ropchain --binary vmlinux.zero* | sed -n '/Step 1/,/Step 2/p' - Step 1 -- Write-what-where gadgets [-] Can't find the 'mov qword ptr [r64], r64' gadget In parallel build tests, this has a less than 1% performance impact, and grows the image size less than 1%: $ size vmlinux.stock vmlinux.zero-call-regs text data bss dec hex filename 22437676 8559152 14127340 45124168 2b08a48 vmlinux.stock 22453184 8563248 14110956 45127388 2b096dc vmlinux.zero-call-regs Signed-off-by: Kees Cook <keescook@chromium.org> --- Makefile | 5 +++++ security/Kconfig.hardening | 17 +++++++++++++++++ 2 files changed, 22 insertions(+)