Message ID | 20211030074758.GT174703@worktop.programming.kicks-ass.net (mailing list archive) |
---|---|
State | Mainlined |
Commit | 2105a92748e83e2e3ee6be539da959706bbb3898 |
Delegated to: | Kees Cook |
Headers | show |
Series | static_call,x86: Robustify trampoline patching | expand |
On Sat, Oct 30, 2021 at 09:47:58AM +0200, Peter Zijlstra wrote: > On Fri, Oct 29, 2021 at 10:03:24PM +0200, Peter Zijlstra wrote: > > > So I had a bit of a peek at what clang generates: > > > > 3fa4: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 3fa7: R_X86_64_32S __SCK__x86_pmu_handle_irq > > 3fab: 48 c7 c6 00 00 00 00 mov $0x0,%rsi 3fae: R_X86_64_32S __SCT__x86_pmu_handle_irq.cfi_jt > > 3fb2: e8 00 00 00 00 call 3fb7 <init_hw_perf_events+0x1dc> 3fb3: R_X86_64_PLT32 __static_call_update-0x4 > > > > So this then gives the trampoline jump table entry to > > __static_call_update(), with the result that it will rewrite the > > jump-table entry, not the trampoline! > > > > Now it so happens that the trampoline looks *exactly* like the > > jump-table entry (one jmp.d32 instruction), so in that regards it'll > > again 'work'. > > > > But this is all really, as in *really*, wrong. And I'm really sad I'm > > the one to have to discover this, even though I've mentioned > > static_call()s being tricky in previous reviews. > > The below makes the clang-cfi build properly sick: > > [ 0.000000] trampoline signature fail > [ 0.000000] ------------[ cut here ]------------ > [ 0.000000] kernel BUG at arch/x86/kernel/static_call.c:65! So fundamentally I think the whole notion that the address of a function is something different than 'the address of that function' is an *utter* fail. So things like FineIBT use a scheme where they pass a hash value along with the indirect call, which is veryfied at the other end. Can't we adjust it like: foo.cfi: endbr xorl $0xdeadbeef, %r10d jz foo ud2 nop # make it an even 16 bytes foo: # actual function text Then have the address of foo, be the address of foo, like any normal sane person would expect. Have direct calls to foo, go to foo, again, as expected. When doing an indirect call (to r11, as clang does), then, and only then, do: movl $0xdeadbeef, %r10d subq $0x10, %r11 call *%r11 # if the r11 lives, add: addq $0x10, %r11 Then only when caller and callee agree 0xdeadbeef is the password, does the indirect call go through. Why isn't this a suitable CFI scheme even without IBT?
On Sat, 30 Oct 2021 at 09:49, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Oct 29, 2021 at 10:03:24PM +0200, Peter Zijlstra wrote: > > > So I had a bit of a peek at what clang generates: > > > > 3fa4: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 3fa7: R_X86_64_32S __SCK__x86_pmu_handle_irq > > 3fab: 48 c7 c6 00 00 00 00 mov $0x0,%rsi 3fae: R_X86_64_32S __SCT__x86_pmu_handle_irq.cfi_jt > > 3fb2: e8 00 00 00 00 call 3fb7 <init_hw_perf_events+0x1dc> 3fb3: R_X86_64_PLT32 __static_call_update-0x4 > > > > So this then gives the trampoline jump table entry to > > __static_call_update(), with the result that it will rewrite the > > jump-table entry, not the trampoline! > > > > Now it so happens that the trampoline looks *exactly* like the > > jump-table entry (one jmp.d32 instruction), so in that regards it'll > > again 'work'. > > > > But this is all really, as in *really*, wrong. And I'm really sad I'm > > the one to have to discover this, even though I've mentioned > > static_call()s being tricky in previous reviews. > > The below makes the clang-cfi build properly sick: > > [ 0.000000] trampoline signature fail > [ 0.000000] ------------[ cut here ]------------ > [ 0.000000] kernel BUG at arch/x86/kernel/static_call.c:65! > > --- > Subject: static_call,x86: Robustify trampoline patching > > Add a few signature bytes after the static call trampoline and verify > those bytes match before patching the trampoline. This avoids patching > random other JMPs (such as CFI jump-table entries) instead. > > These bytes decode as: > > d: 53 push %rbx > e: 43 54 rex.XB push %r12 > > And happen to spell "SCT". > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> I just realized that arm64 has the exact same problem, which is not being addressed by my v5 of the static call support patch. As it turns out, the v11 Clang that I have been testing with is broken wrt BTI landing pads, and omits them from the jump table entries. Clang 12+ adds them properly, which means that both the jump table entry and the static call trampoline may start with BTI C + direct branch, and we also need additional checks to disambiguate.
On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote: > I just realized that arm64 has the exact same problem, which is not > being addressed by my v5 of the static call support patch. Yeah, it would. > As it turns out, the v11 Clang that I have been testing with is broken > wrt BTI landing pads, and omits them from the jump table entries. > Clang 12+ adds them properly, which means that both the jump table > entry and the static call trampoline may start with BTI C + direct > branch, and we also need additional checks to disambiguate. I'm not sure, why would the static_call trampoline need a BTI C ? The whole point of static_call() is to be a direct call, we should never have an indirect call to the trampoline, that would defeat the whole purpose.
On Sat, 30 Oct 2021 at 20:03, Peter Zijlstra <peterz@infradead.org> wrote: > > On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote: > > I just realized that arm64 has the exact same problem, which is not > > being addressed by my v5 of the static call support patch. > > Yeah, it would. > > > As it turns out, the v11 Clang that I have been testing with is broken > > wrt BTI landing pads, and omits them from the jump table entries. > > Clang 12+ adds them properly, which means that both the jump table > > entry and the static call trampoline may start with BTI C + direct > > branch, and we also need additional checks to disambiguate. > > I'm not sure, why would the static_call trampoline need a BTI C ? The > whole point of static_call() is to be a direct call, we should never > have an indirect call to the trampoline, that would defeat the whole > purpose. This might happen when the distance between the caller and the trampoline is more than 128 MB, in which case we emit a veneer that uses an indirect call as well. So we definitely need the landing pad in the trampoline.
On Sat, 30 Oct 2021 at 20:55, Ard Biesheuvel <ardb@kernel.org> wrote: > > On Sat, 30 Oct 2021 at 20:03, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote: > > > I just realized that arm64 has the exact same problem, which is not > > > being addressed by my v5 of the static call support patch. > > > > Yeah, it would. > > > > > As it turns out, the v11 Clang that I have been testing with is broken > > > wrt BTI landing pads, and omits them from the jump table entries. > > > Clang 12+ adds them properly, which means that both the jump table > > > entry and the static call trampoline may start with BTI C + direct > > > branch, and we also need additional checks to disambiguate. > > > > I'm not sure, why would the static_call trampoline need a BTI C ? The > > whole point of static_call() is to be a direct call, we should never > > have an indirect call to the trampoline, that would defeat the whole > > purpose. > > This might happen when the distance between the caller and the > trampoline is more than 128 MB, in which case we emit a veneer that > uses an indirect call as well. So we definitely need the landing pad > in the trampoline. Something like the below seems to work to prevent getting the wrong trampoline address into arch_static_call_transform: diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h index cbb67b6030f9..c3704ea21bee 100644 --- a/arch/x86/include/asm/static_call.h +++ b/arch/x86/include/asm/static_call.h @@ -25,7 +25,9 @@ asm(".pushsection .static_call.text, \"ax\" \n" \ ".align 4 \n" \ ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \ + ".globl " STATIC_CALL_TRAMP_P_STR(name) " \n" \ STATIC_CALL_TRAMP_STR(name) ": \n" \ + STATIC_CALL_TRAMP_P_STR(name) ": \n" \ insns " \n" \ ".type " STATIC_CALL_TRAMP_STR(name) ", @function \n" \ ".size " STATIC_CALL_TRAMP_STR(name) ", . - " STATIC_CALL_TRAMP_STR(name) " \n" \ diff --git a/include/linux/static_call.h b/include/linux/static_call.h index 19dc210214c0..46777a3395d3 100644 --- a/include/linux/static_call.h +++ b/include/linux/static_call.h @@ -143,7 +143,7 @@ */ extern void arch_static_call_transform(void *site, void *tramp, void *func, bool tail); -#define STATIC_CALL_TRAMP_ADDR(name) &STATIC_CALL_TRAMP(name) +#define STATIC_CALL_TRAMP_ADDR(name) &STATIC_CALL_TRAMP_P(name) #else #define STATIC_CALL_TRAMP_ADDR(name) NULL diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h index 5a00b8b2cf9f..98a448f5ae45 100644 --- a/include/linux/static_call_types.h +++ b/include/linux/static_call_types.h @@ -18,6 +18,8 @@ #define STATIC_CALL_TRAMP(name) __PASTE(STATIC_CALL_TRAMP_PREFIX, name) #define STATIC_CALL_TRAMP_STR(name) __stringify(STATIC_CALL_TRAMP(name)) +#define STATIC_CALL_TRAMP_P(name) __PASTE(STATIC_CALL_TRAMP(name), _p) +#define STATIC_CALL_TRAMP_P_STR(name) __stringify(STATIC_CALL_TRAMP_P(name)) /* * Flags in the low bits of static_call_site::key. */ @@ -36,7 +38,8 @@ struct static_call_site { #define DECLARE_STATIC_CALL(name, func) \ extern struct static_call_key STATIC_CALL_KEY(name); \ - extern typeof(func) STATIC_CALL_TRAMP(name); + extern typeof(func) STATIC_CALL_TRAMP(name); \ + extern u8 STATIC_CALL_TRAMP_P(name); #ifdef CONFIG_HAVE_STATIC_CALL That leaves the 'func' argument, which ideally should not go through the jump table either, but at least it is not terminally broken there.
On Sun, Oct 31, 2021 at 05:24:13PM +0100, Ard Biesheuvel wrote: > On Sat, 30 Oct 2021 at 20:55, Ard Biesheuvel <ardb@kernel.org> wrote: > > > > On Sat, 30 Oct 2021 at 20:03, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote: > > > > I just realized that arm64 has the exact same problem, which is not > > > > being addressed by my v5 of the static call support patch. > > > > > > Yeah, it would. > > > > > > > As it turns out, the v11 Clang that I have been testing with is broken > > > > wrt BTI landing pads, and omits them from the jump table entries. > > > > Clang 12+ adds them properly, which means that both the jump table > > > > entry and the static call trampoline may start with BTI C + direct > > > > branch, and we also need additional checks to disambiguate. > > > > > > I'm not sure, why would the static_call trampoline need a BTI C ? The > > > whole point of static_call() is to be a direct call, we should never > > > have an indirect call to the trampoline, that would defeat the whole > > > purpose. > > > > This might happen when the distance between the caller and the > > trampoline is more than 128 MB, in which case we emit a veneer that > > uses an indirect call as well. So we definitely need the landing pad > > in the trampoline. > > Something like the below seems to work to prevent getting the wrong > trampoline address into arch_static_call_transform: Is is also a terriblly gross hack. I really want the clang-cfi stuff to improve, not add layers of hacks on top of it.
On Sun, 31 Oct 2021 at 17:39, Peter Zijlstra <peterz@infradead.org> wrote: > > On Sun, Oct 31, 2021 at 05:24:13PM +0100, Ard Biesheuvel wrote: > > On Sat, 30 Oct 2021 at 20:55, Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > > On Sat, 30 Oct 2021 at 20:03, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote: > > > > > I just realized that arm64 has the exact same problem, which is not > > > > > being addressed by my v5 of the static call support patch. > > > > > > > > Yeah, it would. > > > > > > > > > As it turns out, the v11 Clang that I have been testing with is broken > > > > > wrt BTI landing pads, and omits them from the jump table entries. > > > > > Clang 12+ adds them properly, which means that both the jump table > > > > > entry and the static call trampoline may start with BTI C + direct > > > > > branch, and we also need additional checks to disambiguate. > > > > > > > > I'm not sure, why would the static_call trampoline need a BTI C ? The > > > > whole point of static_call() is to be a direct call, we should never > > > > have an indirect call to the trampoline, that would defeat the whole > > > > purpose. > > > > > > This might happen when the distance between the caller and the > > > trampoline is more than 128 MB, in which case we emit a veneer that > > > uses an indirect call as well. So we definitely need the landing pad > > > in the trampoline. > > > > Something like the below seems to work to prevent getting the wrong > > trampoline address into arch_static_call_transform: > > Is is also a terriblly gross hack. I really want the clang-cfi stuff to > improve, not add layers of hacks on top of it. I'm just as annoyed as you are about the apparent need for this. However, emitting an alias at build time is far better IMHO than adding a magic byte sequence and having to check it at runtime.
On Sun, Oct 31, 2021 at 05:44:04PM +0100, Ard Biesheuvel wrote: > > Is is also a terriblly gross hack. I really want the clang-cfi stuff to > > improve, not add layers of hacks on top of it. > > I'm just as annoyed as you are about the apparent need for this. > However, emitting an alias at build time is far better IMHO than > adding a magic byte sequence and having to check it at runtime. Oh, I'm keeping that magic sequence :-) That's hardening in general, and I don't want to ever want to debug a wrong poke like that again. Adding an extra label fixes this thing, but there's still the other cases where we need/want/desire a *real* function pointer. I'm very close to saying that anything that mucks up function pointers like this is a complete non-starter. Let's start re-start this whole CFI endeavour from the start.
On Sun, 31 Oct 2021 at 21:11, Peter Zijlstra <peterz@infradead.org> wrote: > > On Sun, Oct 31, 2021 at 05:44:04PM +0100, Ard Biesheuvel wrote: > > > > Is is also a terriblly gross hack. I really want the clang-cfi stuff to > > > improve, not add layers of hacks on top of it. > > > > I'm just as annoyed as you are about the apparent need for this. > > However, emitting an alias at build time is far better IMHO than > > adding a magic byte sequence and having to check it at runtime. > > Oh, I'm keeping that magic sequence :-) That's hardening in general, and > I don't want to ever want to debug a wrong poke like that again. > > Adding an extra label fixes this thing, but there's still the other > cases where we need/want/desire a *real* function pointer. > > I'm very close to saying that anything that mucks up function pointers > like this is a complete non-starter. Let's start re-start this whole CFI > endeavour from the start. Well, CFI is already in mainline for arm64, whereas static call support is not. So we have to deal with it one way or the other. So for the static call targets, I agree that we want to support any expression that produces a function pointer, but that part is not actually broken, it is just sub-optimal iff you are using CFI Clang. For taking the address of the trampoline, I think the solutions we have are sufficient (although I am not inclined to add the magic sig to arm64 if the label is sufficient). That means we can support static calls on arm64 now without breaking Clang CFI, and work on a solution for the redundant jumps on a more relaxed schedule.
On Sun, Oct 31, 2021 at 09:21:56PM +0100, Ard Biesheuvel wrote: > That means we can support static calls on arm64 now without breaking > Clang CFI, and work on a solution for the redundant jumps on a more > relaxed schedule. Yes, arm64 has a 'problem' with having already merged the clang-cfi stuff :/ I'm hoping the x86 solution can be an alternative CFI scheme, I'm starting to really hate this one. And I'm not at all convinced the proposed scheme is the best possible scheme given the constraints of kernel code. AFAICT it's a compromise made in userspace.
On Sun, 31 Oct 2021 at 21:45, Peter Zijlstra <peterz@infradead.org> wrote: > > On Sun, Oct 31, 2021 at 09:21:56PM +0100, Ard Biesheuvel wrote: > > > That means we can support static calls on arm64 now without breaking > > Clang CFI, and work on a solution for the redundant jumps on a more > > relaxed schedule. > > Yes, arm64 has a 'problem' with having already merged the clang-cfi > stuff :/ > > I'm hoping the x86 solution can be an alternative CFI scheme, I'm > starting to really hate this one. And I'm not at all convinced the > proposed scheme is the best possible scheme given the constraints of > kernel code. AFAICT it's a compromise made in userspace. Your scheme only works with IBT: the value of %r11 is under the adversary's control so it could just point it at 'foo+0x10' if it wants to call foo indirectly, and circumvent the check. So without IBT (or BTI), I think the check fundamentally belongs in the caller, not in the callee.
On Mon, Nov 01, 2021 at 12:36:18AM +0100, Ard Biesheuvel wrote: > On Sun, 31 Oct 2021 at 21:45, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Sun, Oct 31, 2021 at 09:21:56PM +0100, Ard Biesheuvel wrote: > > > > > That means we can support static calls on arm64 now without breaking > > > Clang CFI, and work on a solution for the redundant jumps on a more > > > relaxed schedule. > > > > Yes, arm64 has a 'problem' with having already merged the clang-cfi > > stuff :/ > > > > I'm hoping the x86 solution can be an alternative CFI scheme, I'm > > starting to really hate this one. And I'm not at all convinced the > > proposed scheme is the best possible scheme given the constraints of > > kernel code. AFAICT it's a compromise made in userspace. > > Your scheme only works with IBT: the value of %r11 is under the > adversary's control so it could just point it at 'foo+0x10' if it > wants to call foo indirectly, and circumvent the check. So without IBT > (or BTI), I think the check fundamentally belongs in the caller, not > in the callee. How is that not true for the jump table approach? Like I showed earlier, it is *trivial* to reconstruct the actual function pointer from a jump-table entry pointer. In any case, I really want the discussion to start at square one, and show/explain why any chosen CFI scheme is actually good for the kernel. Just because clang happened to have implemented it, doesn't make it the most suitable scheme for the kernel.
From: Peter Zijlstra > Sent: 01 November 2021 09:02 .. > In any case, I really want the discussion to start at square one, and > show/explain why any chosen CFI scheme is actually good for the kernel. > Just because clang happened to have implemented it, doesn't make it the > most suitable scheme for the kernel. How much overhead does it add to write("/dev/null", "", 1) ? You've two large jump tables. One for the syscall entry - (all the syscalls have the same prototype), and a second for selecting the correct device driver's 'write' entry point. You really don't want to be doing any kind of search. Hardware that supported a (say) 16-bit constant in both the 'landing pad' and call indirect instruction and trapped if they differed would be useful - but I doubt any hardware that checks landing pads is anywhere near that useful. David. - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Nov 01, 2021 at 12:36:18AM +0100, Ard Biesheuvel wrote: > > On Sun, 31 Oct 2021 at 21:45, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Sun, Oct 31, 2021 at 09:21:56PM +0100, Ard Biesheuvel wrote: > > > > > > > That means we can support static calls on arm64 now without breaking > > > > Clang CFI, and work on a solution for the redundant jumps on a more > > > > relaxed schedule. > > > > > > Yes, arm64 has a 'problem' with having already merged the clang-cfi > > > stuff :/ > > > > > > I'm hoping the x86 solution can be an alternative CFI scheme, I'm > > > starting to really hate this one. And I'm not at all convinced the > > > proposed scheme is the best possible scheme given the constraints of > > > kernel code. AFAICT it's a compromise made in userspace. > > > > Your scheme only works with IBT: the value of %r11 is under the > > adversary's control so it could just point it at 'foo+0x10' if it > > wants to call foo indirectly, and circumvent the check. So without IBT > > (or BTI), I think the check fundamentally belongs in the caller, not > > in the callee. > > How is that not true for the jump table approach? Like I showed earlier, > it is *trivial* to reconstruct the actual function pointer from a > jump-table entry pointer. > That is not the point. The point is that Clang instruments every indirect call that it emits, to check whether the type of the jump table entry it is about to call matches the type of the caller. IOW, the indirect calls can only branch into jump tables, and all jump table entries in a table each branch to the start of some function of the same type. So the only thing you could achieve by adding or subtracting a constant value from the indirect call address is either calling another function of the same type (if you are hitting another entry in the same table), or failing the CFI type check. Instrumenting the callee only needs something like BTI, and a consistent use of the landing pads to ensure that you cannot trivially omit the check by landing right after it. > In any case, I really want the discussion to start at square one, and > show/explain why any chosen CFI scheme is actually good for the kernel. > Just because clang happened to have implemented it, doesn't make it the > most suitable scheme for the kernel. Not disagreeing with that for x86, but again, we're already past that for arm64.
On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote: > On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote: > > How is that not true for the jump table approach? Like I showed earlier, > > it is *trivial* to reconstruct the actual function pointer from a > > jump-table entry pointer. > > > > That is not the point. The point is that Clang instruments every > indirect call that it emits, to check whether the type of the jump > table entry it is about to call matches the type of the caller. IOW, > the indirect calls can only branch into jump tables, and all jump > table entries in a table each branch to the start of some function of > the same type. > > So the only thing you could achieve by adding or subtracting a > constant value from the indirect call address is either calling > another function of the same type (if you are hitting another entry in > the same table), or failing the CFI type check. Ah, I see, so the call-site needs to have a branch around the indirect call instruction. > Instrumenting the callee only needs something like BTI, and a > consistent use of the landing pads to ensure that you cannot trivially > omit the check by landing right after it. That does bring up another point tho; how are we going to do a kernel that's optimal for both software CFI and hardware aided CFI? All questions that need answering I think. So how insane is something like this, have each function: foo.cfi: endbr64 xorl $0xdeadbeef, %r10d jz foo ud2 nop # make it 16 bytes foo: # actual function text goes here And for each hash have two thunks: # arg: r11 # clobbers: r10, r11 __x86_indirect_cfi_deadbeef: movl -9(%r11), %r10 # immediate in foo.cfi xorl $0xdeadbeef, %r10 # our immediate jz 1f ud2 1: ALTERNATIVE_2 "jmp *%r11", "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD # arg: r11 # clobbers: r10, r11 __x86_indirect_ibt_deadbeef: movl $0xdeadbeef, %r10 subq $0x10, %r11 ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE jmp *%r11 And have the actual indirect callsite look like: # r11 - &foo ALTERNATIVE_2 "cs call __x86_indirect_thunk_r11", "cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI "cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT Although if the compiler were to emit: cs call __x86_indirect_cfi_deadbeef we could probaly fix it up from there. Then we can at runtime decide between: {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd}
On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: > So how insane is something like this, have each function: > > foo.cfi: > endbr64 > xorl $0xdeadbeef, %r10d > jz foo > ud2 > nop # make it 16 bytes > foo: > # actual function text goes here > > > And for each hash have two thunks: > > > # arg: r11 > # clobbers: r10, r11 > __x86_indirect_cfi_deadbeef: > movl -9(%r11), %r10 # immediate in foo.cfi > xorl $0xdeadbeef, %r10 # our immediate > jz 1f > ud2 > 1: ALTERNATIVE_2 "jmp *%r11", > "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE > "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD > > > > # arg: r11 > # clobbers: r10, r11 > __x86_indirect_ibt_deadbeef: > movl $0xdeadbeef, %r10 > subq $0x10, %r11 > ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE > jmp *%r11 > These two thunks could of course be one big alternative. > And have the actual indirect callsite look like: > > # r11 - &foo > ALTERNATIVE_2 "cs call __x86_indirect_thunk_r11", > "cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI > "cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT Also simplifying this. > Although if the compiler were to emit: > > cs call __x86_indirect_cfi_deadbeef > > we could probaly fix it up from there. > > > Then we can at runtime decide between: > > {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd} >
On Sat, Oct 30, 2021 at 10:16:31AM +0200, Peter Zijlstra wrote: > foo.cfi: > endbr > xorl $0xdeadbeef, %r10d > jz foo > ud2 > nop # make it an even 16 bytes > foo: > # actual function text > > > Then have the address of foo, be the address of foo, like any normal > sane person would expect. Have direct calls to foo, go to foo, again, as > expected. > > When doing an indirect call (to r11, as clang does), then, and only > then, do: > > movl $0xdeadbeef, %r10d > subq $0x10, %r11 > call *%r11 > > # if the r11 lives, add: > addq $0x10, %r11 > > > Then only when caller and callee agree 0xdeadbeef is the password, does > the indirect call go through. > > Why isn't this a suitable CFI scheme even without IBT? The trouble is that the callee is doing the verification. There's no protection against calling into a callee that doesn't perform a check (e.g. BPF JIT, or otherwise constructed executable memory, etc). The caller needs to do the verification that what they're calling into is safe before it makes the call.
On Tue, 2 Nov 2021 at 16:15, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: > > > So how insane is something like this, have each function: > > > > foo.cfi: > > endbr64 > > xorl $0xdeadbeef, %r10d > > jz foo > > ud2 > > nop # make it 16 bytes > > foo: > > # actual function text goes here > > > > > > And for each hash have two thunks: > > > > > > # arg: r11 > > # clobbers: r10, r11 > > __x86_indirect_cfi_deadbeef: > > movl -9(%r11), %r10 # immediate in foo.cfi > > xorl $0xdeadbeef, %r10 # our immediate > > jz 1f > > ud2 > > 1: ALTERNATIVE_2 "jmp *%r11", > > "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE > > "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD > > So are these supposed to go into the jump tables? If so, there still needs to be a check against the boundary of the table at the call site, to ensure that we are not calling something that we shouldn't. If they are not going into the jump tables, I don't see the point of having them, as only happy flow/uncomprised code would bother to use them. > > > > > > # arg: r11 > > # clobbers: r10, r11 > > __x86_indirect_ibt_deadbeef: > > movl $0xdeadbeef, %r10 > > subq $0x10, %r11 > > ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE > > jmp *%r11 > > > > These two thunks could of course be one big alternative. > > > And have the actual indirect callsite look like: > > > > # r11 - &foo > > ALTERNATIVE_2 "cs call __x86_indirect_thunk_r11", > > "cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI > > "cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT > > Also simplifying this. > > > Although if the compiler were to emit: > > > > cs call __x86_indirect_cfi_deadbeef > > > > we could probaly fix it up from there. > > > > > > Then we can at runtime decide between: > > > > {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd} > >
On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: > On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote: > > On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote: > > > > How is that not true for the jump table approach? Like I showed earlier, > > > it is *trivial* to reconstruct the actual function pointer from a > > > jump-table entry pointer. > > > > > > > That is not the point. The point is that Clang instruments every > > indirect call that it emits, to check whether the type of the jump > > table entry it is about to call matches the type of the caller. IOW, > > the indirect calls can only branch into jump tables, and all jump > > table entries in a table each branch to the start of some function of > > the same type. > > > > So the only thing you could achieve by adding or subtracting a > > constant value from the indirect call address is either calling > > another function of the same type (if you are hitting another entry in > > the same table), or failing the CFI type check. > > Ah, I see, so the call-site needs to have a branch around the indirect > call instruction. > > > Instrumenting the callee only needs something like BTI, and a > > consistent use of the landing pads to ensure that you cannot trivially > > omit the check by landing right after it. > > That does bring up another point tho; how are we going to do a kernel > that's optimal for both software CFI and hardware aided CFI? > > All questions that need answering I think. I'm totally fine with designing a new CFI for a future option, but blocking the existing (working) one does not best serve our end users. There are already people waiting on x86 CFI because having the extra layer of defense is valuable for them. No, it's not perfect, but it works right now, and evidence from Android shows that it has significant real-world defensive value. Some of the more adventurous are actually patching their kernels with the CFI support already, and happily running their workloads, etc. Supporting Clang CFI means we actually have something to evolve from, where as starting completely over means (likely significant) delays leaving folks without the option available at all. I think the various compiler and kernel tweaks needed to improve kernel support are reasonable, but building a totally new CFI implementation is not: it _does_ work today on x86. Yes, it's weird, but not outrageously so. (And just to state the obvious, CFI is an _optional_ CONFIG: not everyone wants CFI, so it's okay if there are some sharp edges under some CONFIG combinations.) Regardless, speaking to a new CFI design below: > So how insane is something like this, have each function: > > foo.cfi: > endbr64 > xorl $0xdeadbeef, %r10d > jz foo > ud2 > nop # make it 16 bytes > foo: > # actual function text goes here > > > And for each hash have two thunks: > > > # arg: r11 > # clobbers: r10, r11 > __x86_indirect_cfi_deadbeef: > movl -9(%r11), %r10 # immediate in foo.cfi This requires the text be readable. I have been hoping to avoid this for a CFI implementation so we could gain the benefit of execute-only memory (available soon on arm64, and possible today on x86 under a hypervisor). But, yes, FWIW, this is very similar to what PaX RAP CFI does: the caller reads "$dest - offset" for a hash, and compares it against the hard-coded hash at the call site, before "call $dest". > xorl $0xdeadbeef, %r10 # our immediate > jz 1f > ud2 > 1: ALTERNATIVE_2 "jmp *%r11", > "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE > "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD > > > > # arg: r11 > # clobbers: r10, r11 > __x86_indirect_ibt_deadbeef: > movl $0xdeadbeef, %r10 > subq $0x10, %r11 > ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE > jmp *%r11 > > > > And have the actual indirect callsite look like: > > # r11 - &foo > ALTERNATIVE_2 "cs call __x86_indirect_thunk_r11", > "cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI > "cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT > > Although if the compiler were to emit: > > cs call __x86_indirect_cfi_deadbeef > > we could probaly fix it up from there. It seems like this could work for any CFI implementation, though, if the CFI implementation always performed a call, or if the bounds of the inline checking were known? i.e. objtool could also find the inline checks just as well as the call, though? > Then we can at runtime decide between: > > {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd} This does look pretty powerful, but I still don't think it precludes using the existing Clang CFI. I don't want perfect to be the enemy of good. :)
On Tue, Nov 02, 2021 at 06:44:56PM +0100, Ard Biesheuvel wrote: > On Tue, 2 Nov 2021 at 16:15, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: > > > > > So how insane is something like this, have each function: > > > > > > foo.cfi: > > > endbr64 > > > xorl $0xdeadbeef, %r10d > > > jz foo > > > ud2 > > > nop # make it 16 bytes > > > foo: > > > # actual function text goes here > > > > > > > > > And for each hash have two thunks: > > > > > > > > > # arg: r11 > > > # clobbers: r10, r11 > > > __x86_indirect_cfi_deadbeef: > > > movl -9(%r11), %r10 # immediate in foo.cfi > > > xorl $0xdeadbeef, %r10 # our immediate > > > jz 1f > > > ud2 > > > 1: ALTERNATIVE_2 "jmp *%r11", > > > "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE > > > "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD > > > > > So are these supposed to go into the jump tables? If so, there still > needs to be a check against the boundary of the table at the call > site, to ensure that we are not calling something that we shouldn't. > > If they are not going into the jump tables, I don't see the point of > having them, as only happy flow/uncomprised code would bother to use > them. I don't understand. If you can scribble your own code, you can circumvent pretty much any range check anyway. But if you can't scribble your own code, you get to use the branch here and that checks the callsite and callee signature. The range check isn't fundamental to CFI, having a check is the important thing AFAIU.
On Tue, Nov 02, 2021 at 10:35:30AM -0700, Kees Cook wrote: > On Sat, Oct 30, 2021 at 10:16:31AM +0200, Peter Zijlstra wrote: > > foo.cfi: > > endbr > > xorl $0xdeadbeef, %r10d > > jz foo > > ud2 > > nop # make it an even 16 bytes > > foo: > > # actual function text > > > > > > Then have the address of foo, be the address of foo, like any normal > > sane person would expect. Have direct calls to foo, go to foo, again, as > > expected. > > > > When doing an indirect call (to r11, as clang does), then, and only > > then, do: > > > > movl $0xdeadbeef, %r10d > > subq $0x10, %r11 > > call *%r11 > > > > # if the r11 lives, add: > > addq $0x10, %r11 > > > > > > Then only when caller and callee agree 0xdeadbeef is the password, does > > the indirect call go through. > > > > Why isn't this a suitable CFI scheme even without IBT? > > The trouble is that the callee is doing the verification. There's no > protection against calling into a callee that doesn't perform a check > (e.g. BPF JIT, or otherwise constructed executable memory, etc). The > caller needs to do the verification that what they're calling into is > safe before it makes the call. Right, Ard said the same, see new crackpot scheme here: https://lkml.kernel.org/r/YYE1yPClPMHvyvIt@hirez.programming.kicks-ass.net
On Tue, Nov 02, 2021 at 07:14:25PM +0100, Peter Zijlstra wrote: > On Tue, Nov 02, 2021 at 06:44:56PM +0100, Ard Biesheuvel wrote: > > On Tue, 2 Nov 2021 at 16:15, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: > > > > > > > So how insane is something like this, have each function: > > > > > > > > foo.cfi: > > > > endbr64 > > > > xorl $0xdeadbeef, %r10d > > > > jz foo > > > > ud2 > > > > nop # make it 16 bytes > > > > foo: > > > > # actual function text goes here > > > > > > > > > > > > And for each hash have two thunks: > > > > > > > > > > > > # arg: r11 > > > > # clobbers: r10, r11 > > > > __x86_indirect_cfi_deadbeef: > > > > movl -9(%r11), %r10 # immediate in foo.cfi > > > > xorl $0xdeadbeef, %r10 # our immediate > > > > jz 1f > > > > ud2 > > > > 1: ALTERNATIVE_2 "jmp *%r11", > > > > "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE > > > > "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD > > > > > > > > So are these supposed to go into the jump tables? If so, there still > > needs to be a check against the boundary of the table at the call > > site, to ensure that we are not calling something that we shouldn't. > > > > If they are not going into the jump tables, I don't see the point of > > having them, as only happy flow/uncomprised code would bother to use > > them. > > I don't understand. If you can scribble your own code, you can > circumvent pretty much any range check anyway. But if you can't scribble > your own code, you get to use the branch here and that checks the > callsite and callee signature. > > The range check isn't fundamental to CFI, having a check is the > important thing AFAIU. That is, how is a jump-table/range-check better than a hash-value match check?
On Tue, 2 Nov 2021 at 19:14, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Nov 02, 2021 at 06:44:56PM +0100, Ard Biesheuvel wrote: > > On Tue, 2 Nov 2021 at 16:15, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: > > > > > > > So how insane is something like this, have each function: > > > > > > > > foo.cfi: > > > > endbr64 > > > > xorl $0xdeadbeef, %r10d > > > > jz foo > > > > ud2 > > > > nop # make it 16 bytes > > > > foo: > > > > # actual function text goes here > > > > > > > > > > > > And for each hash have two thunks: > > > > > > > > > > > > # arg: r11 > > > > # clobbers: r10, r11 > > > > __x86_indirect_cfi_deadbeef: > > > > movl -9(%r11), %r10 # immediate in foo.cfi > > > > xorl $0xdeadbeef, %r10 # our immediate > > > > jz 1f > > > > ud2 > > > > 1: ALTERNATIVE_2 "jmp *%r11", > > > > "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE > > > > "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD > > > > > > > > So are these supposed to go into the jump tables? If so, there still > > needs to be a check against the boundary of the table at the call > > site, to ensure that we are not calling something that we shouldn't. > > > > If they are not going into the jump tables, I don't see the point of > > having them, as only happy flow/uncomprised code would bother to use > > them. > > I don't understand. If you can scribble your own code, you can > circumvent pretty much any range check anyway. A function pointer is data not code. > But if you can't scribble > your own code, you get to use the branch here and that checks the > callsite and callee signature. > OK, so the call site has a direct branch to this trampoline then? That wasn't clear to me. > The range check isn't fundamental to CFI, having a check is the > important thing AFAIU. Agreed. If the call site has a direct branch, it doesn't need the range check.
On Tue, Nov 2, 2021, at 11:10 AM, Kees Cook wrote: > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: >> On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote: >> > On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote: >> >> > > How is that not true for the jump table approach? Like I showed earlier, >> > > it is *trivial* to reconstruct the actual function pointer from a >> > > jump-table entry pointer. >> > > >> > >> > That is not the point. The point is that Clang instruments every >> > indirect call that it emits, to check whether the type of the jump >> > table entry it is about to call matches the type of the caller. IOW, >> > the indirect calls can only branch into jump tables, and all jump >> > table entries in a table each branch to the start of some function of >> > the same type. >> > >> > So the only thing you could achieve by adding or subtracting a >> > constant value from the indirect call address is either calling >> > another function of the same type (if you are hitting another entry in >> > the same table), or failing the CFI type check. >> >> Ah, I see, so the call-site needs to have a branch around the indirect >> call instruction. >> >> > Instrumenting the callee only needs something like BTI, and a >> > consistent use of the landing pads to ensure that you cannot trivially >> > omit the check by landing right after it. >> >> That does bring up another point tho; how are we going to do a kernel >> that's optimal for both software CFI and hardware aided CFI? >> >> All questions that need answering I think. > > I'm totally fine with designing a new CFI for a future option, > but blocking the existing (working) one does not best serve our end > users. I like security, but I also like building working systems, and I think I disagree with you. There are a whole bunch of CFI schemes out there, with varying hardware requirements, and they provide varying degrees of fine grained protection and varying degrees of protection against improper speculation. We do not want to merge clang CFI just because it’s “ready” and end up with a mess that makes it harder to support other schemes in the kernel. So, yes, a good CFI scheme needs caller-side protection, especially if IBT isn’t in use. But a good CFI scheme also needs to interoperate with the rest of the kernel, and this whole “canonical” and symbol-based lookup and static_call thing is nonsense. I think we need a better implementation, whether it uses intrinsics or little C helpers or whatever. I’m not saying this needs to be incompatible with current clang releases, but I do think we need a clear story for how operations like static call patching are supposed to work. FYI, Ard, many years ago we merged kernel support for the original gcc stack protector. We have since *removed* it on x86_32 in favor of a nicer implementation that requires a newer toolchain.
On Tue, Nov 02, 2021 at 11:10:10AM -0700, Kees Cook wrote: > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: > > All questions that need answering I think. > > I'm totally fine with designing a new CFI for a future option, > but blocking the existing (working) one does not best serve our end > users. Accepting a half arsed CFI now, just because it is what we have, will only make it ever so much more difficuly to get new/better things implemented in the toolchains, because the pressure is off. Also, it will make enabling IBT more difficult, or we'll end up having CFI and IBT mutually exclusive, which is a crap position to be in. > There are already people waiting on x86 CFI because having the > extra layer of defense is valuable for them. No, it's not perfect, > but it works right now, and evidence from Android shows that it has > significant real-world defensive value. Some of the more adventurous > are actually patching their kernels with the CFI support already, and > happily running their workloads, etc. It works by accident, not design. Which is a giant red flag in my book. > Supporting Clang CFI means we actually have something to evolve > from, I don't want to evolve from a place that's crazy and we shouldn't have been in to begin with. Redefining what a function pointer is is insane, and the required work-arounds are ugly at best. The ARM64 folks have expressed regret from having merged it. Why should x86 do something that's known to cause regret? > where as starting completely over means (likely significant) > delays leaving folks without the option available at all. See above. Maybe some of those folks will get motivated to make it happen faster. > I think the various compiler and kernel tweaks needed to improve > kernel support are reasonable, but building a totally new CFI > implementation is not: it _does_ work today on x86. By sodding accident; see that one static call patch that makes it burn. If someone were to use the jump-table entry after we'd scribbled it, thing will go sideways bad. > Yes, it's weird, but not outrageously so. Clearly we disagree. > Regardless, speaking to a new CFI design below: > > > So how insane is something like this, have each function: > > > > foo.cfi: > > endbr64 > > xorl $0xdeadbeef, %r10d > > jz foo > > ud2 > > nop # make it 16 bytes > > foo: > > # actual function text goes here > > > > > > And for each hash have two thunks: > > > > > > # arg: r11 > > # clobbers: r10, r11 > > __x86_indirect_cfi_deadbeef: > > movl -9(%r11), %r10 # immediate in foo.cfi > > This requires the text be readable. I have been hoping to avoid this for > a CFI implementation so we could gain the benefit of execute-only > memory (available soon on arm64, and possible today on x86 under a > hypervisor). It's only needed for the 'legacy' case of software only CFI if that makes you feel better. BTI/IBT based CFI doesn't need this. (also, I'm very much a bare metal kinda guy) > > xorl $0xdeadbeef, %r10 # our immediate > > jz 1f > > ud2 > > 1: ALTERNATIVE_2 "jmp *%r11", > > "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE > > "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD > > > > > > > > # arg: r11 > > # clobbers: r10, r11 > > __x86_indirect_ibt_deadbeef: > > movl $0xdeadbeef, %r10 > > subq $0x10, %r11 > > ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE > > jmp *%r11 > > IBT case ^ doesn't actually read from the code. It simply jumps in front of the real function by a set amount to get the extra cruft / landing-pad required for indirect calls. Also note that direct calls never make use of that pad, which means it can be stripped from all symbols that never have their address taken, which means less indirect targets. (Or poison the thing with UD2 and write 0 in the immediate) > > And have the actual indirect callsite look like: > > > > # r11 - &foo > > ALTERNATIVE_2 "cs call __x86_indirect_thunk_r11", > > "cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI > > "cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT > > > > Although if the compiler were to emit: > > > > cs call __x86_indirect_cfi_deadbeef > > > > we could probaly fix it up from there. > > It seems like this could work for any CFI implementation, though, if The strong suggestion is that function pointers are sane, and there's a few other considerations, but yes. > the CFI implementation always performed a call, or if the bounds of the > inline checking were known? i.e. objtool could also find the inline > checks just as well as the call, though? Somewhat hard, it's much easier to find a single instruction than a pattern. Also, footprint. Smaller really is better. > > Then we can at runtime decide between: > > > > {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd} > > This does look pretty powerful, but I still don't think it precludes > using the existing Clang CFI. I don't want perfect to be the enemy of > good. :) As stated above, we're in violent disagreement that the proposed scheme comes anywhere near good. I utterly hate the redefintion of function pointers and I also think the range compare goes sideways in the face of modules. That's two marks against jump-tables. (Also, in the !CFI case, you can't actually free the jump-tables, since you're uncondtionally s(t)uck with them because of how &func is wrecked.)
On Tue, Nov 02, 2021 at 07:18:53PM +0100, Ard Biesheuvel wrote: > > The range check isn't fundamental to CFI, having a check is the > > important thing AFAIU. > > Agreed. If the call site has a direct branch, it doesn't need the range check. That, from the earlier email: | And have the actual indirect callsite look like: | | # r11 - &foo | ALTERNATIVE_2 "cs call __x86_indirect_thunk_r11", | "cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI | "cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT So the callsite has a direct call to the hash-specific and cfi-type specific thunk, which then does an (indirect) tail-call. The CFI one does the hash check in the thunk and jumps to the function proper, the IBT one on does it in the landing-pad. The !CFI one ignore it all and simply does an indirect call (retpoline aided or otherwise) to the function proper -- in which case we can free all the thunks.
On Tue, Nov 02, 2021 at 02:02:38PM -0700, Andy Lutomirski wrote: > > > On Tue, Nov 2, 2021, at 11:10 AM, Kees Cook wrote: > > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: > >> On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote: > >> > On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote: > >> > >> > > How is that not true for the jump table approach? Like I showed earlier, > >> > > it is *trivial* to reconstruct the actual function pointer from a > >> > > jump-table entry pointer. > >> > > > >> > > >> > That is not the point. The point is that Clang instruments every > >> > indirect call that it emits, to check whether the type of the jump > >> > table entry it is about to call matches the type of the caller. IOW, > >> > the indirect calls can only branch into jump tables, and all jump > >> > table entries in a table each branch to the start of some function of > >> > the same type. > >> > > >> > So the only thing you could achieve by adding or subtracting a > >> > constant value from the indirect call address is either calling > >> > another function of the same type (if you are hitting another entry in > >> > the same table), or failing the CFI type check. > >> > >> Ah, I see, so the call-site needs to have a branch around the indirect > >> call instruction. > >> > >> > Instrumenting the callee only needs something like BTI, and a > >> > consistent use of the landing pads to ensure that you cannot trivially > >> > omit the check by landing right after it. > >> > >> That does bring up another point tho; how are we going to do a kernel > >> that's optimal for both software CFI and hardware aided CFI? > >> > >> All questions that need answering I think. > > > > I'm totally fine with designing a new CFI for a future option, > > but blocking the existing (working) one does not best serve our end > > users. > > I like security, but I also like building working systems, and I think > I disagree with you. There are a whole bunch of CFI schemes out there, > with varying hardware requirements, and they provide varying degrees > of fine grained protection and varying degrees of protection against > improper speculation. We do not want to merge clang CFI just because > it’s “ready” and end up with a mess that makes it harder to support > other schemes in the kernel. Right, and I see the difficulties here. And speaking to Peter's observation that CFI "accidentally" worked with static_calls, I don't see it that way: it worked because it was designed to be as "invisible" as possible. It's just that at a certain point of extreme binary output control, it becomes an issue and I think that's going to be true for *any* CFI system: they each will have different design characteristics. One of the goals of the Clang CFI use in Linux was to make it as minimally invasive as possible (and you can see this guiding Sami's choices: e.g. he didn't go change all the opaque address uses to need a "&" prefix added, etc). I think we're always going to have some push/pull between the compiler's "general"ness and the kernel's "specific"ness. > So, yes, a good CFI scheme needs caller-side protection, especially if > IBT isn’t in use. But a good CFI scheme also needs to interoperate with > the rest of the kernel, and this whole “canonical” and symbol-based > lookup and static_call thing is nonsense. I think we need a better > implementation, whether it uses intrinsics or little C helpers or > whatever. I think we're very close already. Like I said, I think it's fine to nail down some of these interoperability requirements; we've been doing it all along. We got there with arm64, and it looks to me like we're almost there on x86. There is this particular case with static_calls now, but I don't think it's insurmountable. > I’m not saying this needs to be incompatible with current clang > releases, but I do think we need a clear story for how operations like > static call patching are supposed to work. Right -- and until very recently, it did all Just Work. I think we'll always have these kinds of issues with whatever CFI implementations are made available, and since we've already cleaned up so much of what's needed to make the kernel work with *any* CFI, it makes sense to continue with the one we've already got working for arm64. > FYI, Ard, many years ago we merged kernel support for the original > gcc stack protector. We have since *removed* it on x86_32 in favor of > a nicer implementation that requires a newer toolchain. Sure, and this is the kind of thing I mean: we had an awkward implementation of a meaningful defense, and we improved on it. I think it's important to be able to make these kinds of concessions to gain the defensive features they provide. And yes, we can continue to improve it, but in the meantime, we can stop entire classes of problems from happening to our user base.
On 11/2/21 16:13, Kees Cook wrote: > On Tue, Nov 02, 2021 at 02:02:38PM -0700, Andy Lutomirski wrote: >> >> >> On Tue, Nov 2, 2021, at 11:10 AM, Kees Cook wrote: >>> On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote: >>>> On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote: >>>>> On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote: >>>> >>>>>> How is that not true for the jump table approach? Like I showed earlier, >>>>>> it is *trivial* to reconstruct the actual function pointer from a >>>>>> jump-table entry pointer. >>>>>> >>>>> >>>>> That is not the point. The point is that Clang instruments every >>>>> indirect call that it emits, to check whether the type of the jump >>>>> table entry it is about to call matches the type of the caller. IOW, >>>>> the indirect calls can only branch into jump tables, and all jump >>>>> table entries in a table each branch to the start of some function of >>>>> the same type. >>>>> >>>>> So the only thing you could achieve by adding or subtracting a >>>>> constant value from the indirect call address is either calling >>>>> another function of the same type (if you are hitting another entry in >>>>> the same table), or failing the CFI type check. >>>> >>>> Ah, I see, so the call-site needs to have a branch around the indirect >>>> call instruction. >>>> >>>>> Instrumenting the callee only needs something like BTI, and a >>>>> consistent use of the landing pads to ensure that you cannot trivially >>>>> omit the check by landing right after it. >>>> >>>> That does bring up another point tho; how are we going to do a kernel >>>> that's optimal for both software CFI and hardware aided CFI? >>>> >>>> All questions that need answering I think. >>> >>> I'm totally fine with designing a new CFI for a future option, >>> but blocking the existing (working) one does not best serve our end >>> users. >> >> I like security, but I also like building working systems, and I think >> I disagree with you. There are a whole bunch of CFI schemes out there, >> with varying hardware requirements, and they provide varying degrees >> of fine grained protection and varying degrees of protection against >> improper speculation. We do not want to merge clang CFI just because >> it’s “ready” and end up with a mess that makes it harder to support >> other schemes in the kernel. > > Right, and I see the difficulties here. And speaking to Peter's > observation that CFI "accidentally" worked with static_calls, I don't > see it that way: it worked because it was designed to be as "invisible" > as possible. It's just that at a certain point of extreme binary output > control, it becomes an issue and I think that's going to be true for > *any* CFI system: they each will have different design characteristics. > > One of the goals of the Clang CFI use in Linux was to make it as > minimally invasive as possible (and you can see this guiding Sami's > choices: e.g. he didn't go change all the opaque address uses to need a > "&" prefix added, etc). I think we're always going to have some > push/pull between the compiler's "general"ness and the kernel's > "specific"ness. The "&" thing was the wrong way around. That part of CFI was Linux being sloppy, and the & part is a straight-up improvement. Improvements can be invasive. The problem with invasive code is when it invades places it doesn't belong. > >> So, yes, a good CFI scheme needs caller-side protection, especially if >> IBT isn’t in use. But a good CFI scheme also needs to interoperate with >> the rest of the kernel, and this whole “canonical” and symbol-based >> lookup and static_call thing is nonsense. I think we need a better >> implementation, whether it uses intrinsics or little C helpers or >> whatever. > > I think we're very close already. Like I said, I think it's fine to nail > down some of these interoperability requirements; we've been doing it > all along. We got there with arm64, and it looks to me like we're almost > there on x86. There is this particular case with static_calls now, but I > don't think it's insurmountable. So fundamentally, I think there's a sort of type system issue here, and it's more or less the same issue with IBT and with various fine-grained refinements to IBT. Using syntax more or like what's been floating around this thread, the asm works like this: [possible magic here depending on the precise scheme] func.indirect: ENDBR (if it's an IBT build) [magic here?] [in a jump table scheme, there's a jmp here and possibly a large or negative gap.] func: actual body The point being that there's the actual function body (where one should *direct* jump to) and there's the address that goes in a C function pointer. Indirect calls/jumps go to (or pretend to go to, depending on the scheme) func.indirect. While one can plausibly kludge up the static call patching (and who knows what else -- eBPF, kprobes, mcount stuff, etc) using symbol-based hackery, I think we actually just want a way to convert from a function pointer to a function address. It doesn't need to be fast, but it needs to work. And I think this can probably be mostly done in plain C based on clang's implementation. Ideally it would have a typecheck: unsigned long __cfi_decode_funcpointer(funcptr type); but that's not happening in plain C because I don't think there's a way to get the magic metadata. But I bet we could do: unsigned long __cfi_decode_funcpointer(funcptr val, funcptr ref) { BUG_ON(ref and val aren't the same type); read insn bytes at val; return the jump target; } If this won't work with current clang, let's ask to get whatever basic functionality is needed to enable it. <rambling> Sadly, in clang, this is wrapped up in the incomprehensible "canonical" stuff. I think that's a big mistake -- any sane ENDBR-using scheme would really prefer that ENDBR to be right next to the actual function body, and really any scheme would benefit due to better cache locality. But, more importantly, IMO any sane ENDBR-using scheme wants to generate the indirect stub as part of code gen for the actual function. In an IBT build, this really doesn't deserve to work: non-IBT C file generating: func: ret IBT-aware C file: extern void func(void); ptr = func; clang has horrible magic to generate the indirect stub in the caller translation unit, and I think that's solving a problem that doesn't really need solving. Sure, it's nifty, but it IMO should be opt-in, at least in a world where everyone agrees to recompile for CFI. (Sure, using that hack for userspace IBT would be cute, and it would work.) There's another tradeoff, though: errors like this are possible: translation unit A: void func(int) { [body here] } translation unit B: extern void func(char); /* wrong signature! */ ptr = func; ptr(); The callee-generates-the-stub model will not catch this. The caller-generates-the-stub model will. > > Sure, and this is the kind of thing I mean: we had an awkward > implementation of a meaningful defense, and we improved on it. I think > it's important to be able to make these kinds of concessions to gain the > defensive features they provide. And yes, we can continue to improve it, > but in the meantime, we can stop entire classes of problems from > happening to our user base. > In retrospect, we should have put our feet down with stack protector, though. The problem with it was gcc hardcoding things that shouldn't have been hardcoded, and the gcc fixes were trivial.
On Tue, Nov 02, 2021 at 05:20:05PM -0700, Andy Lutomirski wrote: > I think that's a big mistake -- any sane ENDBR-using scheme would > really prefer that ENDBR to be right next to the actual function body, > and really any scheme would benefit due to better cache locality. Agreed, IBT/BTI want the landing pad in front of the actual function. > But, more importantly, IMO any sane ENDBR-using scheme wants to > generate the indirect stub as part of code gen for the actual > function. Sorta, I really want to be able to not have a landing pad for functions whose address is never taken. At that point it doesn't matter if it gets generated along with the function and then stripped/poisoned later, or generated later. As such, the landing pad should not be part of the function proper, direct calls should never observe it. Less landing pads is more better.
From: Peter Zijlstra > Sent: 03 November 2021 08:36 > > On Tue, Nov 02, 2021 at 05:20:05PM -0700, Andy Lutomirski wrote: > > I think that's a big mistake -- any sane ENDBR-using scheme would > > really prefer that ENDBR to be right next to the actual function body, > > and really any scheme would benefit due to better cache locality. > > Agreed, IBT/BTI want the landing pad in front of the actual function. > > > But, more importantly, IMO any sane ENDBR-using scheme wants to > > generate the indirect stub as part of code gen for the actual > > function. > > Sorta, I really want to be able to not have a landing pad for functions > whose address is never taken. At that point it doesn't matter if it gets > generated along with the function and then stripped/poisoned later, or > generated later. > > As such, the landing pad should not be part of the function proper, > direct calls should never observe it. > > Less landing pads is more better. One problem is when a direct call is 'too far' for a call instruction. IIRC this can happen in arm64 with modules (all 64bit except x86?). So an indirect call has to be used instead - which needs the landing pad. Although it may actually be better to put a trampoline (landing pad + near jump) elsewhere and have the module loader do the correct fixup. (Is the loader already generating a trampoline in the module code?) The function body can then be cache-line aligned - with its benefits. Can't anything that can write instructions always use a retpoline to implement a jump indirect to an arbitrary address? (Not to mention just generating the required code rather than a call.) AFAICT CFI is all about detecting invalid values in function pointer tables. It doesn't really protect in any way from JIT code doing incorrect things. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 11/3/21 01:35, Peter Zijlstra wrote: > On Tue, Nov 02, 2021 at 05:20:05PM -0700, Andy Lutomirski wrote: >> I think that's a big mistake -- any sane ENDBR-using scheme would >> really prefer that ENDBR to be right next to the actual function body, >> and really any scheme would benefit due to better cache locality. > > Agreed, IBT/BTI want the landing pad in front of the actual function. > >> But, more importantly, IMO any sane ENDBR-using scheme wants to >> generate the indirect stub as part of code gen for the actual >> function. > > Sorta, I really want to be able to not have a landing pad for functions > whose address is never taken. At that point it doesn't matter if it gets > generated along with the function and then stripped/poisoned later, or > generated later. Stripping is conceptually straightforward even without LTO. foo.indirect: ENDBR foo: ... and the linker learns (using magic function sections?) that, if foo.indirect is not referenced, then it should not be generated. Or a very straightforward walk over all the relocations in an object to poison the unused .indirect entries could be done. Making this work with DSOs, EXPORT_SYMBOL, etc will be somewhat nontrivial, but not impossible. --Andy
On 30/10/2021 10.16, Peter Zijlstra wrote: > On Sat, Oct 30, 2021 at 09:47:58AM +0200, Peter Zijlstra wrote: > So fundamentally I think the whole notion that the address of a function > is something different than 'the address of that function' is an *utter* > fail. Agreed. IMHO, commits like 0a5b412891df, 981731129e0f should have been a big red flag saying "ok, perhaps we shall not mess with the semantics of function pointer comparisons". Yes yes, the commit log in cf68fffb66d6 did dutifully say "This may break code that relies on comparing addresses passed from other components.". But it did not spell out that that for example means that i2c bus recovery now doesn't result in recovering the bus but instead in a NULL pointer deref and kernel panic: Some i2c drivers set ->recover_bus = i2c_generic_scl_recovery and provide ->sda_gpiod, ->scl_gpiod, then rely on if (bri->scl_gpiod && bri->recover_bus == i2c_generic_scl_recovery) in i2c-core-base.c causing ->get_scl/->set_scl to be filled in. That's of course broken with ->recover_bus pointing to the module's jump table, so when ->recover_bus is actually called, the bri->set_scl() call in i2c_generic_scl_recovery() is a NULL pointer deref. [I've tested that this actually happens on an imx8mp_evk board.] And who knows how much other code happens to rely on function pointer comparison working as specified in the C standard for correctness. Rasmus
diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h index cbb67b6030f9..39ebe0511869 100644 --- a/arch/x86/include/asm/static_call.h +++ b/arch/x86/include/asm/static_call.h @@ -27,6 +27,7 @@ ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \ STATIC_CALL_TRAMP_STR(name) ": \n" \ insns " \n" \ + ".byte 0x53, 0x43, 0x54 \n" \ ".type " STATIC_CALL_TRAMP_STR(name) ", @function \n" \ ".size " STATIC_CALL_TRAMP_STR(name) ", . - " STATIC_CALL_TRAMP_STR(name) " \n" \ ".popsection \n") diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c index ea028e736831..9c407a33a774 100644 --- a/arch/x86/kernel/static_call.c +++ b/arch/x86/kernel/static_call.c @@ -56,10 +56,15 @@ static void __ref __static_call_transform(void *insn, enum insn_type type, void text_poke_bp(insn, code, size, emulate); } -static void __static_call_validate(void *insn, bool tail) +static void __static_call_validate(void *insn, bool tail, bool tramp) { u8 opcode = *(u8 *)insn; + if (tramp && memcmp(insn+5, "SCT", 3)) { + pr_err("trampoline signature fail"); + BUG(); + } + if (tail) { if (opcode == JMP32_INSN_OPCODE || opcode == RET_INSN_OPCODE) @@ -74,7 +79,8 @@ static void __static_call_validate(void *insn, bool tail) /* * If we ever trigger this, our text is corrupt, we'll probably not live long. */ - WARN_ONCE(1, "unexpected static_call insn opcode 0x%x at %pS\n", opcode, insn); + pr_err("unexpected static_call insn opcode 0x%x at %pS\n", opcode, insn); + BUG(); } static inline enum insn_type __sc_insn(bool null, bool tail) @@ -97,12 +103,12 @@ void arch_static_call_transform(void *site, void *tramp, void *func, bool tail) mutex_lock(&text_mutex); if (tramp) { - __static_call_validate(tramp, true); + __static_call_validate(tramp, true, true); __static_call_transform(tramp, __sc_insn(!func, true), func); } if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) && site) { - __static_call_validate(site, tail); + __static_call_validate(site, tail, false); __static_call_transform(site, __sc_insn(!func, tail), func); } diff --git a/tools/objtool/check.c b/tools/objtool/check.c index fb3f251ea021..6d5951113d53 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3310,6 +3310,9 @@ static bool ignore_unreachable_insn(struct objtool_file *file, struct instructio if (!insn->func) return false; + if (insn->func->static_call_tramp) + return true; + /* * CONFIG_UBSAN_TRAP inserts a UD2 when it sees * __builtin_unreachable(). The BUG() macro has an unreachable() after