Makefile: Introduce CONFIG_ZERO_CALL_USED_REGS

Message ID	20210505191804.4015873-1-keescook@chromium.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-kbuild-owner@kernel.org> From: Kees Cook <keescook@chromium.org> To: linux-hardening@vger.kernel.org Cc: Kees Cook <keescook@chromium.org>, Qing Zhao <qing.zhao@oracle.com>, Masahiro Yamada <masahiroy@kernel.org>, Michal Marek <michal.lkml@markovi.net>, linux-kernel@vger.kernel.org, linux-kbuild@vger.kernel.org, linux-security-module@vger.kernel.org Subject: [PATCH] Makefile: Introduce CONFIG_ZERO_CALL_USED_REGS Date: Wed, 5 May 2021 12:18:04 -0700 Message-Id: <20210505191804.4015873-1-keescook@chromium.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Makefile: Introduce CONFIG_ZERO_CALL_USED_REGS \| expand Makefile: Introduce CONFIG_ZERO_CALL_USED_REGS

Kees Cook May 5, 2021, 7:18 p.m. UTC

When CONFIG_ZERO_CALL_USED_REGS is enabled, build the kernel with
"-fzero-call-used-regs=used-gpr" (in GCC 11). This option will zero any
caller-used register contents just before returning from a function,
ensuring that temporary values are not leaked beyond the function
boundary. This means that register contents are less likely to be
available for side channel attacks and information exposures.

Additionally this helps reduce the number of useful ROP gadgets in the
kernel image by about 20%:

$ ROPgadget.py --nosys --nojop --binary vmlinux.stock | tail -n1
Unique gadgets found: 337245

$ ROPgadget.py --nosys --nojop --binary vmlinux.zero-call-regs | tail -n1
Unique gadgets found: 267175

and more notably removes simple "write-what-where" gadgets:

$ ROPgadget.py --ropchain --binary vmlinux.stock | sed -n '/Step 1/,/Step 2/p'
- Step 1 -- Write-what-where gadgets

        [+] Gadget found: 0xffffffff8102d76c mov qword ptr [rsi], rdx ; ret
        [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
        [+] Gadget found: 0xffffffff8104d7c8 pop rdx ; ret
        [-] Can't find the 'xor rdx, rdx' gadget. Try with another 'mov [reg], reg'

        [+] Gadget found: 0xffffffff814c2b4c mov qword ptr [rsi], rdi ; ret
        [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
        [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
        [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'

        [+] Gadget found: 0xffffffff81540d61 mov qword ptr [rsi], rdi ; pop rbx ; pop rbp ; ret
        [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
        [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
        [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'

        [+] Gadget found: 0xffffffff8105341e mov qword ptr [rsi], rax ; ret
        [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
        [+] Gadget found: 0xffffffff81029a11 pop rax ; ret
        [+] Gadget found: 0xffffffff811f1c3b xor rax, rax ; ret

- Step 2 -- Init syscall number gadgets

$ ROPgadget.py --ropchain --binary vmlinux.zero* | sed -n '/Step 1/,/Step 2/p'
- Step 1 -- Write-what-where gadgets

        [-] Can't find the 'mov qword ptr [r64], r64' gadget

In parallel build tests, this has a less than 1% performance impact,
and grows the image size less than 1%:

$ size vmlinux.stock vmlinux.zero-call-regs
   text    data     bss     dec     hex filename
22437676   8559152 14127340 45124168 2b08a48 vmlinux.stock
22453184   8563248 14110956 45127388 2b096dc vmlinux.zero-call-regs

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 Makefile                   |  5 +++++
 security/Kconfig.hardening | 17 +++++++++++++++++
 2 files changed, 22 insertions(+)

Mark Rutland May 6, 2021, 12:54 p.m. UTC | #1

Hi Kees,

On Wed, May 05, 2021 at 12:18:04PM -0700, Kees Cook wrote:
> When CONFIG_ZERO_CALL_USED_REGS is enabled, build the kernel with
> "-fzero-call-used-regs=used-gpr" (in GCC 11). This option will zero any
> caller-used register contents just before returning from a function,
> ensuring that temporary values are not leaked beyond the function
> boundary. This means that register contents are less likely to be
> available for side channel attacks and information exposures.
> 
> Additionally this helps reduce the number of useful ROP gadgets in the
> kernel image by about 20%:
> 
> $ ROPgadget.py --nosys --nojop --binary vmlinux.stock | tail -n1
> Unique gadgets found: 337245
> 
> $ ROPgadget.py --nosys --nojop --binary vmlinux.zero-call-regs | tail -n1
> Unique gadgets found: 267175
> 
> and more notably removes simple "write-what-where" gadgets:
> 
> $ ROPgadget.py --ropchain --binary vmlinux.stock | sed -n '/Step 1/,/Step 2/p'
> - Step 1 -- Write-what-where gadgets
> 
>         [+] Gadget found: 0xffffffff8102d76c mov qword ptr [rsi], rdx ; ret
>         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
>         [+] Gadget found: 0xffffffff8104d7c8 pop rdx ; ret
>         [-] Can't find the 'xor rdx, rdx' gadget. Try with another 'mov [reg], reg'
> 
>         [+] Gadget found: 0xffffffff814c2b4c mov qword ptr [rsi], rdi ; ret
>         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
>         [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
>         [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'
> 
>         [+] Gadget found: 0xffffffff81540d61 mov qword ptr [rsi], rdi ; pop rbx ; pop rbp ; ret
>         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
>         [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
>         [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'
> 
>         [+] Gadget found: 0xffffffff8105341e mov qword ptr [rsi], rax ; ret
>         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
>         [+] Gadget found: 0xffffffff81029a11 pop rax ; ret
>         [+] Gadget found: 0xffffffff811f1c3b xor rax, rax ; ret
> 
> - Step 2 -- Init syscall number gadgets
> 
> $ ROPgadget.py --ropchain --binary vmlinux.zero* | sed -n '/Step 1/,/Step 2/p'
> - Step 1 -- Write-what-where gadgets
> 
>         [-] Can't find the 'mov qword ptr [r64], r64' gadget
> 
> In parallel build tests, this has a less than 1% performance impact,
> and grows the image size less than 1%:
> 
> $ size vmlinux.stock vmlinux.zero-call-regs
>    text    data     bss     dec     hex filename
> 22437676   8559152 14127340 45124168 2b08a48 vmlinux.stock
> 22453184   8563248 14110956 45127388 2b096dc vmlinux.zero-call-regs

FWIW, I gave this a go on arm64, and the size increase is a fair bit
larger:

| [mark@lakrids:~/src/linux]% ls -l Image* 
| -rw-r--r-- 1 mark mark 31955456 May  6 13:36 Image.stock
| -rw-r--r-- 1 mark mark 33724928 May  6 13:23 Image.zero-call-regs

| [mark@lakrids:~/src/linux]% size vmlinux.stock vmlinux.zero-call-regs 
|    text    data     bss     dec     hex filename
| 20728552        11086474         505540 32320566        1ed2c36 vmlinux.stock
| 22500688        11084298         505540 34090526        2082e1e vmlinux.zero-call-regs

The Image is ~5.5% bigger, and the .text in the vmlinux is ~8.5% bigger

The resulting Image appears to work, but I haven't done anything beyond
booting, and I wasn't able to get ROPgadget.py going to quantify the
number of gadgets.

> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  Makefile                   |  5 +++++
>  security/Kconfig.hardening | 17 +++++++++++++++++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/Makefile b/Makefile
> index 31dcdb3d61fa..810600618490 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -811,6 +811,11 @@ KBUILD_CFLAGS	+= -ftrivial-auto-var-init=zero
>  KBUILD_CFLAGS	+= -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
>  endif
>  
> +# Clear used registers at func exit (to reduce data lifetime and ROP gadgets).
> +ifdef CONFIG_ZERO_CALL_USED_REGS
> +KBUILD_CFLAGS	+= -fzero-call-used-regs=used-gpr
> +endif
> +
>  DEBUG_CFLAGS	:=
>  
>  # Workaround for GCC versions < 5.0
> diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
> index 269967c4fc1b..85f7f2036725 100644
> --- a/security/Kconfig.hardening
> +++ b/security/Kconfig.hardening
> @@ -217,6 +217,23 @@ config INIT_ON_FREE_DEFAULT_ON
>  	  touching "cold" memory areas. Most cases see 3-5% impact. Some
>  	  synthetic workloads have measured as high as 8%.
>  
> +config CC_HAS_ZERO_CALL_USED_REGS
> +	def_bool $(cc-option,-fzero-call-used-regs=used-gpr)
> +
> +config ZERO_CALL_USED_REGS
> +	bool "Enable register zeroing on function exit"
> +	depends on CC_HAS_ZERO_CALL_USED_REGS
> +	help
> +	  At the end of functions, always zero any caller-used register
> +	  contents. This helps ensure that temporary values are not
> +	  leaked beyond the function boundary. This means that register
> +	  contents are less likely to be available for side channels
> +	  and information exposures. Additionally, this helps reduce the
> +	  number of useful ROP gadgets by about 20% (and removes compiler
> +	  generated "write-what-where" gadgets) in the resulting kernel
> +	  image. This has a less than 1% performance impact on most
> +	  workloads, and grows the image size less than 1%.

I think the numbers need an "on x86" caveat, since they're not
necessarily representative of other architectures.

This shows up under the "Memory initialization" sub-menu, but I assume
it was meant to be directly under the "Kernel hardening options" menu...

> +
>  endmenu

... and should presumably be here?

Thanks,
Mark.

>  
>  endmenu
> -- 
> 2.25.1
>

Kees Cook May 6, 2021, 9:24 p.m. UTC | #2

On Thu, May 06, 2021 at 01:54:57PM +0100, Mark Rutland wrote:
> Hi Kees,
> 
> On Wed, May 05, 2021 at 12:18:04PM -0700, Kees Cook wrote:
> > When CONFIG_ZERO_CALL_USED_REGS is enabled, build the kernel with
> > "-fzero-call-used-regs=used-gpr" (in GCC 11). This option will zero any
> > caller-used register contents just before returning from a function,
> > ensuring that temporary values are not leaked beyond the function
> > boundary. This means that register contents are less likely to be
> > available for side channel attacks and information exposures.
> > 
> > Additionally this helps reduce the number of useful ROP gadgets in the
> > kernel image by about 20%:
> > 
> > $ ROPgadget.py --nosys --nojop --binary vmlinux.stock | tail -n1
> > Unique gadgets found: 337245
> > 
> > $ ROPgadget.py --nosys --nojop --binary vmlinux.zero-call-regs | tail -n1
> > Unique gadgets found: 267175
> > 
> > and more notably removes simple "write-what-where" gadgets:
> > 
> > $ ROPgadget.py --ropchain --binary vmlinux.stock | sed -n '/Step 1/,/Step 2/p'
> > - Step 1 -- Write-what-where gadgets
> > 
> >         [+] Gadget found: 0xffffffff8102d76c mov qword ptr [rsi], rdx ; ret
> >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> >         [+] Gadget found: 0xffffffff8104d7c8 pop rdx ; ret
> >         [-] Can't find the 'xor rdx, rdx' gadget. Try with another 'mov [reg], reg'
> > 
> >         [+] Gadget found: 0xffffffff814c2b4c mov qword ptr [rsi], rdi ; ret
> >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> >         [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
> >         [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'
> > 
> >         [+] Gadget found: 0xffffffff81540d61 mov qword ptr [rsi], rdi ; pop rbx ; pop rbp ; ret
> >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> >         [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
> >         [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'
> > 
> >         [+] Gadget found: 0xffffffff8105341e mov qword ptr [rsi], rax ; ret
> >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> >         [+] Gadget found: 0xffffffff81029a11 pop rax ; ret
> >         [+] Gadget found: 0xffffffff811f1c3b xor rax, rax ; ret
> > 
> > - Step 2 -- Init syscall number gadgets
> > 
> > $ ROPgadget.py --ropchain --binary vmlinux.zero* | sed -n '/Step 1/,/Step 2/p'
> > - Step 1 -- Write-what-where gadgets
> > 
> >         [-] Can't find the 'mov qword ptr [r64], r64' gadget
> > 
> > In parallel build tests, this has a less than 1% performance impact,
> > and grows the image size less than 1%:
> > 
> > $ size vmlinux.stock vmlinux.zero-call-regs
> >    text    data     bss     dec     hex filename
> > 22437676   8559152 14127340 45124168 2b08a48 vmlinux.stock
> > 22453184   8563248 14110956 45127388 2b096dc vmlinux.zero-call-regs
> 
> FWIW, I gave this a go on arm64, and the size increase is a fair bit
> larger:
> 
> | [mark@lakrids:~/src/linux]% ls -l Image* 
> | -rw-r--r-- 1 mark mark 31955456 May  6 13:36 Image.stock
> | -rw-r--r-- 1 mark mark 33724928 May  6 13:23 Image.zero-call-regs
> 
> | [mark@lakrids:~/src/linux]% size vmlinux.stock vmlinux.zero-call-regs 
> |    text    data     bss     dec     hex filename
> | 20728552        11086474         505540 32320566        1ed2c36 vmlinux.stock
> | 22500688        11084298         505540 34090526        2082e1e vmlinux.zero-call-regs
> 
> The Image is ~5.5% bigger, and the .text in the vmlinux is ~8.5% bigger

Woo, that's quite a bit larger! So much so that I struggle to imagine
the delta. That's almost 1 extra instruction for every 10. I don't
imagine functions are that short. There seem to be only r9..r15 as
call-used. Even if every one was cleared at every function exit (28
bytes), that implies 63,290 functions, with an average function size of
40 instructions?

> The resulting Image appears to work, but I haven't done anything beyond
> booting, and I wasn't able to get ROPgadget.py going to quantify the
> number of gadgets.

Does it not like arm64 machine code? I can go check and see if I can get
numbers...

Thanks for looking at this!

-Kees

> 
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > ---
> >  Makefile                   |  5 +++++
> >  security/Kconfig.hardening | 17 +++++++++++++++++
> >  2 files changed, 22 insertions(+)
> > 
> > diff --git a/Makefile b/Makefile
> > index 31dcdb3d61fa..810600618490 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -811,6 +811,11 @@ KBUILD_CFLAGS	+= -ftrivial-auto-var-init=zero
> >  KBUILD_CFLAGS	+= -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
> >  endif
> >  
> > +# Clear used registers at func exit (to reduce data lifetime and ROP gadgets).
> > +ifdef CONFIG_ZERO_CALL_USED_REGS
> > +KBUILD_CFLAGS	+= -fzero-call-used-regs=used-gpr
> > +endif
> > +
> >  DEBUG_CFLAGS	:=
> >  
> >  # Workaround for GCC versions < 5.0
> > diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
> > index 269967c4fc1b..85f7f2036725 100644
> > --- a/security/Kconfig.hardening
> > +++ b/security/Kconfig.hardening
> > @@ -217,6 +217,23 @@ config INIT_ON_FREE_DEFAULT_ON
> >  	  touching "cold" memory areas. Most cases see 3-5% impact. Some
> >  	  synthetic workloads have measured as high as 8%.
> >  
> > +config CC_HAS_ZERO_CALL_USED_REGS
> > +	def_bool $(cc-option,-fzero-call-used-regs=used-gpr)
> > +
> > +config ZERO_CALL_USED_REGS
> > +	bool "Enable register zeroing on function exit"
> > +	depends on CC_HAS_ZERO_CALL_USED_REGS
> > +	help
> > +	  At the end of functions, always zero any caller-used register
> > +	  contents. This helps ensure that temporary values are not
> > +	  leaked beyond the function boundary. This means that register
> > +	  contents are less likely to be available for side channels
> > +	  and information exposures. Additionally, this helps reduce the
> > +	  number of useful ROP gadgets by about 20% (and removes compiler
> > +	  generated "write-what-where" gadgets) in the resulting kernel
> > +	  image. This has a less than 1% performance impact on most
> > +	  workloads, and grows the image size less than 1%.
> 
> I think the numbers need an "on x86" caveat, since they're not
> necessarily representative of other architectures.
> 
> This shows up under the "Memory initialization" sub-menu, but I assume
> it was meant to be directly under the "Kernel hardening options" menu...
> 
> > +
> >  endmenu
> 
> ... and should presumably be here?
> 
> Thanks,
> Mark.
> 
> >  
> >  endmenu
> > -- 
> > 2.25.1
> >

Mark Rutland May 10, 2021, 1:45 p.m. UTC | #3

On Thu, May 06, 2021 at 02:24:18PM -0700, Kees Cook wrote:
> On Thu, May 06, 2021 at 01:54:57PM +0100, Mark Rutland wrote:
> > Hi Kees,
> > 
> > On Wed, May 05, 2021 at 12:18:04PM -0700, Kees Cook wrote:
> > > When CONFIG_ZERO_CALL_USED_REGS is enabled, build the kernel with
> > > "-fzero-call-used-regs=used-gpr" (in GCC 11). This option will zero any
> > > caller-used register contents just before returning from a function,
> > > ensuring that temporary values are not leaked beyond the function
> > > boundary. This means that register contents are less likely to be
> > > available for side channel attacks and information exposures.
> > > 
> > > Additionally this helps reduce the number of useful ROP gadgets in the
> > > kernel image by about 20%:
> > > 
> > > $ ROPgadget.py --nosys --nojop --binary vmlinux.stock | tail -n1
> > > Unique gadgets found: 337245
> > > 
> > > $ ROPgadget.py --nosys --nojop --binary vmlinux.zero-call-regs | tail -n1
> > > Unique gadgets found: 267175
> > > 
> > > and more notably removes simple "write-what-where" gadgets:
> > > 
> > > $ ROPgadget.py --ropchain --binary vmlinux.stock | sed -n '/Step 1/,/Step 2/p'
> > > - Step 1 -- Write-what-where gadgets
> > > 
> > >         [+] Gadget found: 0xffffffff8102d76c mov qword ptr [rsi], rdx ; ret
> > >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> > >         [+] Gadget found: 0xffffffff8104d7c8 pop rdx ; ret
> > >         [-] Can't find the 'xor rdx, rdx' gadget. Try with another 'mov [reg], reg'
> > > 
> > >         [+] Gadget found: 0xffffffff814c2b4c mov qword ptr [rsi], rdi ; ret
> > >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> > >         [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
> > >         [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'
> > > 
> > >         [+] Gadget found: 0xffffffff81540d61 mov qword ptr [rsi], rdi ; pop rbx ; pop rbp ; ret
> > >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> > >         [+] Gadget found: 0xffffffff81001e51 pop rdi ; ret
> > >         [-] Can't find the 'xor rdi, rdi' gadget. Try with another 'mov [reg], reg'
> > > 
> > >         [+] Gadget found: 0xffffffff8105341e mov qword ptr [rsi], rax ; ret
> > >         [+] Gadget found: 0xffffffff81000cf5 pop rsi ; ret
> > >         [+] Gadget found: 0xffffffff81029a11 pop rax ; ret
> > >         [+] Gadget found: 0xffffffff811f1c3b xor rax, rax ; ret
> > > 
> > > - Step 2 -- Init syscall number gadgets
> > > 
> > > $ ROPgadget.py --ropchain --binary vmlinux.zero* | sed -n '/Step 1/,/Step 2/p'
> > > - Step 1 -- Write-what-where gadgets
> > > 
> > >         [-] Can't find the 'mov qword ptr [r64], r64' gadget
> > > 
> > > In parallel build tests, this has a less than 1% performance impact,
> > > and grows the image size less than 1%:
> > > 
> > > $ size vmlinux.stock vmlinux.zero-call-regs
> > >    text    data     bss     dec     hex filename
> > > 22437676   8559152 14127340 45124168 2b08a48 vmlinux.stock
> > > 22453184   8563248 14110956 45127388 2b096dc vmlinux.zero-call-regs
> > 
> > FWIW, I gave this a go on arm64, and the size increase is a fair bit
> > larger:
> > 
> > | [mark@lakrids:~/src/linux]% ls -l Image* 
> > | -rw-r--r-- 1 mark mark 31955456 May  6 13:36 Image.stock
> > | -rw-r--r-- 1 mark mark 33724928 May  6 13:23 Image.zero-call-regs
> > 
> > | [mark@lakrids:~/src/linux]% size vmlinux.stock vmlinux.zero-call-regs 
> > |    text    data     bss     dec     hex filename
> > | 20728552        11086474         505540 32320566        1ed2c36 vmlinux.stock
> > | 22500688        11084298         505540 34090526        2082e1e vmlinux.zero-call-regs
> > 
> > The Image is ~5.5% bigger, and the .text in the vmlinux is ~8.5% bigger
> 
> Woo, that's quite a bit larger! So much so that I struggle to imagine
> the delta. That's almost 1 extra instruction for every 10. 

About 31% of this seems to be due to GCC (almost) always clearing x16
and x17 (see further down for numbers). I suspect that's because GCC has
to assume that any (non-static) functions might be reached via a PLT
which would clobber x16 and x17 with specific values.

We also have a bunch of small functions with multiple returns, where
each return path gets the full complement of zeroing instructions, e.g.

Stock:

| <fpsimd_sync_to_sve>:
|        d503245f        bti     c
|        f9400001        ldr     x1, [x0]
|        7209003f        tst     w1, #0x800000
|        54000040        b.eq    ffff800010014cc4 <fpsimd_sync_to_sve+0x14>  // b.none
|        d65f03c0        ret
|        d503233f        paciasp
|        a9bf7bfd        stp     x29, x30, [sp, #-16]!
|        910003fd        mov     x29, sp
|        97fffdac        bl      ffff800010014380 <fpsimd_to_sve>
|        a8c17bfd        ldp     x29, x30, [sp], #16
|        d50323bf        autiasp
|        d65f03c0        ret

With zero-call-regs:

| <fpsimd_sync_to_sve>:
|        d503245f        bti     c
|        f9400001        ldr     x1, [x0]
|        7209003f        tst     w1, #0x800000
|        540000c0        b.eq    ffff8000100152a8 <fpsimd_sync_to_sve+0x24>  // b.none
|        d2800000        mov     x0, #0x0                        // #0
|        d2800001        mov     x1, #0x0                        // #0
|        d2800010        mov     x16, #0x0                       // #0
|        d2800011        mov     x17, #0x0                       // #0
|        d65f03c0        ret
|        d503233f        paciasp
|        a9bf7bfd        stp     x29, x30, [sp, #-16]!
|        910003fd        mov     x29, sp
|        97fffd17        bl      ffff800010014710 <fpsimd_to_sve>
|        a8c17bfd        ldp     x29, x30, [sp], #16
|        d50323bf        autiasp
|        d2800000        mov     x0, #0x0                        // #0
|        d2800001        mov     x1, #0x0                        // #0
|        d2800010        mov     x16, #0x0                       // #0
|        d2800011        mov     x17, #0x0                       // #0
|        d65f03c0        ret

... where we go from 12 instructions to 20, which is a ~67% bloat.

> I don't imagine functions are that short. There seem to be only r9..r15 as
> call-used.

We have a bunch of cases like the above. Also note that per the AAPCS a
function can clobber x0-17 (and x18 if it's not reserved for something
like SCS), and I see a few places that clobber x1-x17.

> Even if every one was cleared at every function exit (28
> bytes), that implies 63,290 functions, with an average function size of
> 40 instructions?

I generated some (slightly dodgy) numbers by grepping the output of
objdump:

[mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.stock | wc -l                                
3979677
[mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.stock | grep 'mov\sx[0-9]\+, #0x0' | wc -l 
50070
[mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.stock | grep 'mov\sx1[67], #0x0' | wc -l
1

[mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.zero-call-regs | wc -l                                
4422188
[mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.zero-call-regs | grep 'mov\sx[0-9]\+, #0x0' | wc -l
491371
[mark@lakrids:~/src/linux]% usekorg 10.1.0 aarch64-linux-objdump -d vmlinux.zero-call-regs | grep 'mov\sx1[67], #0x0' | wc -l 
135729

That's 441301 new MOVs, and the equivalent of 442511 new instructions
overall. There are 135728 new MOVs to x16 and x17 specifically, which
account for ~31% of that.

Overall we go from MOVs being ~1.3% of all instructions to 11%.

> > The resulting Image appears to work, but I haven't done anything beyond
> > booting, and I wasn't able to get ROPgadget.py going to quantify the
> > number of gadgets.
> 
> Does it not like arm64 machine code? I can go check and see if I can get
> numbers...

It's supposed to, and I suspect it works fine, but I wasn't able to get
the tool running at all due to environment problems on my machine.

Thanks,
Mark.

Kees Cook May 10, 2021, 10:01 p.m. UTC | #4

On Mon, May 10, 2021 at 02:45:03PM +0100, Mark Rutland wrote:
> About 31% of this seems to be due to GCC (almost) always clearing x16
> and x17 (see further down for numbers). I suspect that's because GCC has
> to assume that any (non-static) functions might be reached via a PLT
> which would clobber x16 and x17 with specific values.

Wheee.

> We also have a bunch of small functions with multiple returns, where
> each return path gets the full complement of zeroing instructions, e.g.
> 
> Stock:
> 
> | <fpsimd_sync_to_sve>:
> |        d503245f        bti     c
> |        f9400001        ldr     x1, [x0]
> |        7209003f        tst     w1, #0x800000
> |        54000040        b.eq    ffff800010014cc4 <fpsimd_sync_to_sve+0x14>  // b.none
> |        d65f03c0        ret
> |        d503233f        paciasp
> |        a9bf7bfd        stp     x29, x30, [sp, #-16]!
> |        910003fd        mov     x29, sp
> |        97fffdac        bl      ffff800010014380 <fpsimd_to_sve>
> |        a8c17bfd        ldp     x29, x30, [sp], #16
> |        d50323bf        autiasp
> |        d65f03c0        ret
> 
> With zero-call-regs:
> 
> | <fpsimd_sync_to_sve>:
> |        d503245f        bti     c
> |        f9400001        ldr     x1, [x0]
> |        7209003f        tst     w1, #0x800000
> |        540000c0        b.eq    ffff8000100152a8 <fpsimd_sync_to_sve+0x24>  // b.none
> |        d2800000        mov     x0, #0x0                        // #0
> |        d2800001        mov     x1, #0x0                        // #0
> |        d2800010        mov     x16, #0x0                       // #0
> |        d2800011        mov     x17, #0x0                       // #0
> |        d65f03c0        ret
> |        d503233f        paciasp
> |        a9bf7bfd        stp     x29, x30, [sp, #-16]!
> |        910003fd        mov     x29, sp
> |        97fffd17        bl      ffff800010014710 <fpsimd_to_sve>
> |        a8c17bfd        ldp     x29, x30, [sp], #16
> |        d50323bf        autiasp
> |        d2800000        mov     x0, #0x0                        // #0
> |        d2800001        mov     x1, #0x0                        // #0
> |        d2800010        mov     x16, #0x0                       // #0
> |        d2800011        mov     x17, #0x0                       // #0
> |        d65f03c0        ret
> 
> ... where we go from 12 instructions to 20, which is a ~67% bloat.

Yikes. Yeah, so that is likely a good example of missed optimization
opportunity.

> We have a bunch of cases like the above. Also note that per the AAPCS a
> function can clobber x0-17 (and x18 if it's not reserved for something
> like SCS), and I see a few places that clobber x1-x17.

Ah, gotcha. I wasn't quite sure which registers might qualify.

> [...]
> That's 441301 new MOVs, and the equivalent of 442511 new instructions
> overall. There are 135728 new MOVs to x16 and x17 specifically, which
> account for ~31% of that.

I assume the x16/x17 case could be addressed by the compiler if it
examined the need for PLTs, or is that too late (in the sense that the
linker is doing that phase)?

Regardless, I will update the documentation on this feature. :)

Mark Rutland May 11, 2021, 1:59 p.m. UTC | #5

On Mon, May 10, 2021 at 03:01:48PM -0700, Kees Cook wrote:
> On Mon, May 10, 2021 at 02:45:03PM +0100, Mark Rutland wrote:

[...]

> > That's 441301 new MOVs, and the equivalent of 442511 new instructions
> > overall. There are 135728 new MOVs to x16 and x17 specifically, which
> > account for ~31% of that.
> 
> I assume the x16/x17 case could be addressed by the compiler if it
> examined the need for PLTs, or is that too late (in the sense that the
> linker is doing that phase)?

Most (all?) PLTs will be created at link time, and IIUC the compiler
simply has to assume any non-static function might have a PLT, since the
AAPCS permits that. Maybe some of the smaller memory size models don't
permit PLTs, but I have no real knowledge of that area and I'm already
out on a limb.

LTO could probably help with visiblity, but otherwise I don't see a way
the compiler could be sure a PLT won't exist.

> Regardless, I will update the documentation on this feature. :)

Great; thanks!

Mark.

Makefile: Introduce CONFIG_ZERO_CALL_USED_REGS

Commit Message

Comments

Patch