Message ID | 20221209112500.GA3116@willie-the-truck (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] arm64 updates for 6.2 | expand |
On Fri, Dec 9, 2022 at 3:25 AM Will Deacon <will@kernel.org> wrote: > > Dynamic SCS: > * Support for dynamic shadow call stacks to allow switching at > runtime between Clang's SCS implementation and the CPU's > pointer authentication feature when it is supported (complete > with scary DWARF parser!) I've pulled this thing, but this part makes me nervous. There's some bad history with debug information not being 100% reliable probably simply because it gets very little correctness testing. It might be worth thinking about at least verifying the information using something like objtool, so that you at least catch problem cases at *build* time rather than runtime. For example, that whole default: pr_err("unhandled opcode: %02x in FDE frame %lx\n", opcode[-1], (uintptr_t)frame); return -ENOEXEC; really makes me go "this should have been verified at build time, it's much too late to notice now that you don't understand the dwarf data". Hmm? Linus
The pull request you sent on Fri, 9 Dec 2022 11:25:01 +0000:
> git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git tags/arm64-upstream
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/06cff4a58e7dfa018c5f8a6ebdc3ff12745e0bae
Thank you!
Hi Linus, [+Ard] On Mon, Dec 12, 2022 at 10:05:07AM -0800, Linus Torvalds wrote: > On Fri, Dec 9, 2022 at 3:25 AM Will Deacon <will@kernel.org> wrote: > > > > Dynamic SCS: > > * Support for dynamic shadow call stacks to allow switching at > > runtime between Clang's SCS implementation and the CPU's > > pointer authentication feature when it is supported (complete > > with scary DWARF parser!) > > I've pulled this thing, but this part makes me nervous. There's some > bad history with debug information not being 100% reliable probably > simply because it gets very little correctness testing. Hey, I did use the word "scary"! This is, at least, very easy to back out (it's effectively an optimisation) if the DWARF info ends up being too unreliable and causes issues in practice. We're also only looking at .eh_frame here, which should hopefully get a lot more correctness testing when compared to the .debug sections due to exception unwinding. > It might be worth thinking about at least verifying the information > using something like objtool, so that you at least catch problem cases > at *build* time rather than runtime. Checking that the DWARF data looks sensible at build time isn't a bad idea, but see below as I think we can probably still produce a functional kernel Image in this case. > For example, that whole > > default: > pr_err("unhandled opcode: %02x in FDE frame %lx\n", > opcode[-1], (uintptr_t)frame); > return -ENOEXEC; > > really makes me go "this should have been verified at build time, it's > much too late to notice now that you don't understand the dwarf data". This isn't actually as bad as it looks -- the patching operation here only kicks in on CPUs which do not implement the pointer authentication instructions (i.e. where the CPU executes these as NOPs). Therefore, if patching bails out half way due to the "unhandled opcode" above, we should be ok, albeit missing some SCS coverage. I say "should" because if we fail within a frame after patching in the SCS "push" but before patching in the "pop", then we'd end up with a corrupt SCS pointer. Ard -- do you think we could tweak the patching so that we patch the push and the pop together (e.g. by tracking the two locations on a per-frame basis and postponing the text poking until just before we return from scs_handle_fde_frame())? Will
l On Tue, 13 Dec 2022 at 13:11, Will Deacon <will@kernel.org> wrote: > > Hi Linus, > > [+Ard] > > On Mon, Dec 12, 2022 at 10:05:07AM -0800, Linus Torvalds wrote: > > On Fri, Dec 9, 2022 at 3:25 AM Will Deacon <will@kernel.org> wrote: > > > > > > Dynamic SCS: > > > * Support for dynamic shadow call stacks to allow switching at > > > runtime between Clang's SCS implementation and the CPU's > > > pointer authentication feature when it is supported (complete > > > with scary DWARF parser!) > > > > I've pulled this thing, but this part makes me nervous. There's some > > bad history with debug information not being 100% reliable probably > > simply because it gets very little correctness testing. > > Hey, I did use the word "scary"! This is, at least, very easy to back > out (it's effectively an optimisation) if the DWARF info ends up being > too unreliable and causes issues in practice. We're also only looking > at .eh_frame here, which should hopefully get a lot more correctness > testing when compared to the .debug sections due to exception unwinding. > Indeed. And this is Clang 15+ at the moment, for precisely this reason. > > It might be worth thinking about at least verifying the information > > using something like objtool, so that you at least catch problem cases > > at *build* time rather than runtime. > > Checking that the DWARF data looks sensible at build time isn't a bad > idea, but see below as I think we can probably still produce a functional > kernel Image in this case. > > > For example, that whole > > > > default: > > pr_err("unhandled opcode: %02x in FDE frame %lx\n", > > opcode[-1], (uintptr_t)frame); > > return -ENOEXEC; > > > > really makes me go "this should have been verified at build time, it's > > much too late to notice now that you don't understand the dwarf data". > > This isn't actually as bad as it looks -- the patching operation here > only kicks in on CPUs which do not implement the pointer authentication > instructions (i.e. where the CPU executes these as NOPs). Therefore, if > patching bails out half way due to the "unhandled opcode" above, we > should be ok, albeit missing some SCS coverage. Indeed. > I say "should" because > if we fail within a frame after patching in the SCS "push" but before > patching in the "pop", then we'd end up with a corrupt SCS pointer. > > Ard -- do you think we could tweak the patching so that we patch the push > and the pop together (e.g. by tracking the two locations on a per-frame > basis and postponing the text poking until just before we return from > scs_handle_fde_frame())? > The push and the pop are not necessarily balanced (there may be more than one pop for each push), and the opcode we look for (DW_CFA_negate_ra_state) may occur in places which are not actually a pop, so tracking these is not as straight-forward as this. What we could do is track the push and the first pop on a first pass, and if we don't encounter any unexpected opcodes, patch the push and do a second pass starting from the first pop. Or just simply run it twice and do no patching the first time around (the DWARF frames are not very big)
On Tue, Dec 13, 2022 at 01:36:09PM +0100, Ard Biesheuvel wrote: > On Tue, 13 Dec 2022 at 13:11, Will Deacon <will@kernel.org> wrote: > > Ard -- do you think we could tweak the patching so that we patch the push > > and the pop together (e.g. by tracking the two locations on a per-frame > > basis and postponing the text poking until just before we return from > > scs_handle_fde_frame())? > > > > The push and the pop are not necessarily balanced (there may be more > than one pop for each push), and the opcode we look for > (DW_CFA_negate_ra_state) may occur in places which are not actually a > pop, so tracking these is not as straight-forward as this. Duh, yes, of course. You only _execute_ one of the pops for a given run through the function, but there could be numerous return points. So my idea doesn't work at all :) > What we could do is track the push and the first pop on a first pass, > and if we don't encounter any unexpected opcodes, patch the push and > do a second pass starting from the first pop. Or just simply run it > twice and do no patching the first time around (the DWARF frames are > not very big) Doing a dry-run first sounds fairly easy to implement, so it would probably be a good starting point. It also means that if anybody complains about the overhead, then we can get them to work on doing it at build time instead! Will