mbox series

[0/6] Fix unwinding through sigreturn trampolines

Message ID 20200623085436.3696-1-will@kernel.org (mailing list archive)
Headers show
Series Fix unwinding through sigreturn trampolines | expand

Message

Will Deacon June 23, 2020, 8:54 a.m. UTC
Hi all,

This series fixes unwinding through our 64-bit sigreturn trampoline (by
effectively removing the incomplete CFI support enabled during the merge
window) and the 32-bit sigreturn trampoline in the compat vDSO (by removing
it in favour of the sigpage). This forces the unwinder to fallback on the
heuristics it was using previously without any reported issues.

The downside is that this series undoes the LLVM unwinder support added
during the merge window, in favour of not regressing libgcc. Although Ard
and I tried very hard to make this all work with CFI, it became obvious
that it's not 5.8 material, particularly as testing this stuff is extremely
difficult and appears to be fragile on other architectures too. For example,
trying to unwind out of a SIGCANCEL handler triggered from within a leaf
function with pthread cleanup handlers stacked further up in the callchain
doesn't appear to work at all when compiling with -fexceptions, and works
partially (invoking a subset of the handlers...) if you reduce the
optimisation level.

The compat vDSO changes are a result of me auditing our 32-bit sigreturn,
realising the compat vDSO trampoline is broken and then deciding that we
may as well just use the sigpage unconditionally, like arch/arm/ does.

Thanks to Ard for wasting a significant chunk of his weekend on this mess.

Patches 1-4 are targetting 5.8, the remaining two are cosmetic.

Will

Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Tamas Zsoldos <tamas.zsoldos@arm.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: <kernel-team@android.com>

--->8

Will Deacon (6):
  arm64: vdso: Disable dwarf unwinding through the sigreturn trampoline
  arm64: compat: Allow 32-bit vdso and sigpage to co-exist
  arm64: compat: Always use sigpage for sigreturn trampoline
  arm64: compat: Remove 32-bit sigreturn code from the vDSO
  arm64: vdso32: Remove a bunch of #ifdef CONFIG_COMPAT_VDSO guards
  arm64: vdso: Fix unusual formatting in *setup_additional_pages()

 arch/arm64/include/asm/mmu.h         |  3 +
 arch/arm64/kernel/Makefile           |  2 -
 arch/arm64/kernel/signal32.c         | 27 +-------
 arch/arm64/kernel/vdso.c             | 98 +++++++++++-----------------
 arch/arm64/kernel/vdso/Makefile      |  2 +-
 arch/arm64/kernel/vdso/sigreturn.S   | 54 +++++++++------
 arch/arm64/kernel/vdso32/Makefile    |  1 -
 arch/arm64/kernel/vdso32/sigreturn.S | 58 ----------------
 arch/arm64/kernel/vdso32/vdso.lds.S  | 12 ----
 9 files changed, 77 insertions(+), 180 deletions(-)
 delete mode 100644 arch/arm64/kernel/vdso32/sigreturn.S

Comments

Ard Biesheuvel June 23, 2020, 9:46 a.m. UTC | #1
On Tue, 23 Jun 2020 at 10:54, Will Deacon <will@kernel.org> wrote:
>
> Hi all,
>
> This series fixes unwinding through our 64-bit sigreturn trampoline (by
> effectively removing the incomplete CFI support enabled during the merge
> window) and the 32-bit sigreturn trampoline in the compat vDSO (by removing
> it in favour of the sigpage). This forces the unwinder to fallback on the
> heuristics it was using previously without any reported issues.
>
> The downside is that this series undoes the LLVM unwinder support added
> during the merge window, in favour of not regressing libgcc. Although Ard
> and I tried very hard to make this all work with CFI, it became obvious
> that it's not 5.8 material, particularly as testing this stuff is extremely
> difficult and appears to be fragile on other architectures too. For example,
> trying to unwind out of a SIGCANCEL handler triggered from within a leaf
> function with pthread cleanup handlers stacked further up in the callchain
> doesn't appear to work at all when compiling with -fexceptions, and works
> partially (invoking a subset of the handlers...) if you reduce the
> optimisation level.
>
> The compat vDSO changes are a result of me auditing our 32-bit sigreturn,
> realising the compat vDSO trampoline is broken and then deciding that we
> may as well just use the sigpage unconditionally, like arch/arm/ does.
>
> Thanks to Ard for wasting a significant chunk of his weekend on this mess.
>
> Patches 1-4 are targetting 5.8, the remaining two are cosmetic.
>
> Will
>
> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Cc: Tamas Zsoldos <tamas.zsoldos@arm.com>
> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Daniel Kiss <daniel.kiss@arm.com>
> Cc: Dave Martin <dave.martin@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: <kernel-team@android.com>
>
> --->8
>
> Will Deacon (6):
>   arm64: vdso: Disable dwarf unwinding through the sigreturn trampoline
>   arm64: compat: Allow 32-bit vdso and sigpage to co-exist
>   arm64: compat: Always use sigpage for sigreturn trampoline
>   arm64: compat: Remove 32-bit sigreturn code from the vDSO
>   arm64: vdso32: Remove a bunch of #ifdef CONFIG_COMPAT_VDSO guards
>   arm64: vdso: Fix unusual formatting in *setup_additional_pages()
>

For the series,

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>

Once this is queued up, I will send out my proposed fixes to go on top
for the CFI directives in the 64-bit sigreturn trampoline, and we can
start the discussion on how to fix this properly for v5.9.




>  arch/arm64/include/asm/mmu.h         |  3 +
>  arch/arm64/kernel/Makefile           |  2 -
>  arch/arm64/kernel/signal32.c         | 27 +-------
>  arch/arm64/kernel/vdso.c             | 98 +++++++++++-----------------
>  arch/arm64/kernel/vdso/Makefile      |  2 +-
>  arch/arm64/kernel/vdso/sigreturn.S   | 54 +++++++++------
>  arch/arm64/kernel/vdso32/Makefile    |  1 -
>  arch/arm64/kernel/vdso32/sigreturn.S | 58 ----------------
>  arch/arm64/kernel/vdso32/vdso.lds.S  | 12 ----
>  9 files changed, 77 insertions(+), 180 deletions(-)
>  delete mode 100644 arch/arm64/kernel/vdso32/sigreturn.S
>
> --
> 2.27.0.111.gc72c7da667-goog
>
Dave Martin June 23, 2020, 9:54 a.m. UTC | #2
On Tue, Jun 23, 2020 at 09:54:30AM +0100, Will Deacon wrote:
> Hi all,
> 
> This series fixes unwinding through our 64-bit sigreturn trampoline (by
> effectively removing the incomplete CFI support enabled during the merge
> window) and the 32-bit sigreturn trampoline in the compat vDSO (by removing
> it in favour of the sigpage). This forces the unwinder to fallback on the
> heuristics it was using previously without any reported issues.
> 
> The downside is that this series undoes the LLVM unwinder support added
> during the merge window, in favour of not regressing libgcc. Although Ard
> and I tried very hard to make this all work with CFI, it became obvious
> that it's not 5.8 material, particularly as testing this stuff is extremely
> difficult and appears to be fragile on other architectures too. For example,
> trying to unwind out of a SIGCANCEL handler triggered from within a leaf
> function with pthread cleanup handlers stacked further up in the callchain
> doesn't appear to work at all when compiling with -fexceptions, and works
> partially (invoking a subset of the handlers...) if you reduce the
> optimisation level.
> 
> The compat vDSO changes are a result of me auditing our 32-bit sigreturn,
> realising the compat vDSO trampoline is broken and then deciding that we
> may as well just use the sigpage unconditionally, like arch/arm/ does.
> 
> Thanks to Ard for wasting a significant chunk of his weekend on this mess.
> 
> Patches 1-4 are targetting 5.8, the remaining two are cosmetic.

In the longer term, we should perhaps keep .cfi_signal_frame, and ditch
all the other annotations.  If the unwinder knows for sure it's the
signal frame (and understands this annotation), it can use its own
knowledge to do the right thing.  Otherwise, it can fall back on its
other heuristics.

This is probably better than having rich annotations that are
pedantically correct, but only half-supported by unwinders out in the
wild.

(Just my view)

Cheers
---Dave
Ard Biesheuvel June 23, 2020, 10:11 a.m. UTC | #3
On Tue, 23 Jun 2020 at 11:54, Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Tue, Jun 23, 2020 at 09:54:30AM +0100, Will Deacon wrote:
> > Hi all,
> >
> > This series fixes unwinding through our 64-bit sigreturn trampoline (by
> > effectively removing the incomplete CFI support enabled during the merge
> > window) and the 32-bit sigreturn trampoline in the compat vDSO (by removing
> > it in favour of the sigpage). This forces the unwinder to fallback on the
> > heuristics it was using previously without any reported issues.
> >
> > The downside is that this series undoes the LLVM unwinder support added
> > during the merge window, in favour of not regressing libgcc. Although Ard
> > and I tried very hard to make this all work with CFI, it became obvious
> > that it's not 5.8 material, particularly as testing this stuff is extremely
> > difficult and appears to be fragile on other architectures too. For example,
> > trying to unwind out of a SIGCANCEL handler triggered from within a leaf
> > function with pthread cleanup handlers stacked further up in the callchain
> > doesn't appear to work at all when compiling with -fexceptions, and works
> > partially (invoking a subset of the handlers...) if you reduce the
> > optimisation level.
> >
> > The compat vDSO changes are a result of me auditing our 32-bit sigreturn,
> > realising the compat vDSO trampoline is broken and then deciding that we
> > may as well just use the sigpage unconditionally, like arch/arm/ does.
> >
> > Thanks to Ard for wasting a significant chunk of his weekend on this mess.
> >
> > Patches 1-4 are targetting 5.8, the remaining two are cosmetic.
>
> In the longer term, we should perhaps keep .cfi_signal_frame, and ditch
> all the other annotations.  If the unwinder knows for sure it's the
> signal frame (and understands this annotation), it can use its own
> knowledge to do the right thing.  Otherwise, it can fall back on its
> other heuristics.
>

What we found out is that libunwind unconditionally uses its own
unwind strategy when it spots the mov/svc instruction pair at the
return address, and so what we put here does not matter at all.
The libgcc case is different, as it only applies the heuristics if it
doesn't find any unwind info covering the return address. So this is
what bit us here: by fixing the way the CFI directives are emitted,
libgcc no longer uses its own baked in unwind strategy for the signal
frame, but interprets the one we provide, which results in a bogus
address for the /next/ frame. Since that is not covered by unwind info
either, libgcc attempts to apply the heuristics at this level, and
explodes because the bogus address cannot be dereferenced to get at
the opcodes.

So libgcc doesn't even check whether it looks like a signal frame
unless the unwind info is missing.

> This is probably better than having rich annotations that are
> pedantically correct, but only half-supported by unwinders out in the
> wild.
>

Agreed. But without providing any annotations, other unwinders will
fail to unwind through the sigreturn trampoline, which is why the
change was made in the first place.

So the safest way to fix this (imho) is to add CFI directives that
mirror what libunwind does unconditionally, and libgcc when it spots
the mov/svc pair on unannotated addresses, which is to restore all the
registers from the sigframe, and to use regs->pc as the return address
(which is where the unwind should proceed, rather than at whatever
value happened to reside in x30 at the point the signal was taken)

The only hairy bit is the FP/SIMD registers, which are ignored by
libunwind, and only half-restored by libgcc (which appears to assume
that only d8..d15 need to be restored). This means throwing a C++
exception in a signal handler and catching it outside of it is going
to be difficult to get 100% correct, but this is something that
appears to be broken already everywhere. (Unless there are some
AAPCS/ABI rules that specify that FP/SIMD registers have to be assumed
clobbered when catching an exception)
Will Deacon June 23, 2020, 10:58 a.m. UTC | #4
On Tue, Jun 23, 2020 at 12:11:24PM +0200, Ard Biesheuvel wrote:
> On Tue, 23 Jun 2020 at 11:54, Dave Martin <Dave.Martin@arm.com> wrote:
> > In the longer term, we should perhaps keep .cfi_signal_frame, and ditch
> > all the other annotations.  If the unwinder knows for sure it's the
> > signal frame (and understands this annotation), it can use its own
> > knowledge to do the right thing.  Otherwise, it can fall back on its
> > other heuristics.
> 
> What we found out is that libunwind unconditionally uses its own
> unwind strategy when it spots the mov/svc instruction pair at the
> return address, and so what we put here does not matter at all.
> The libgcc case is different, as it only applies the heuristics if it
> doesn't find any unwind info covering the return address. So this is
> what bit us here: by fixing the way the CFI directives are emitted,
> libgcc no longer uses its own baked in unwind strategy for the signal
> frame, but interprets the one we provide, which results in a bogus
> address for the /next/ frame. Since that is not covered by unwind info
> either, libgcc attempts to apply the heuristics at this level, and
> explodes because the bogus address cannot be dereferenced to get at
> the opcodes.
> 
> So libgcc doesn't even check whether it looks like a signal frame
> unless the unwind info is missing.
> 
> > This is probably better than having rich annotations that are
> > pedantically correct, but only half-supported by unwinders out in the
> > wild.
> >
> 
> Agreed. But without providing any annotations, other unwinders will
> fail to unwind through the sigreturn trampoline, which is why the
> change was made in the first place.
> 
> So the safest way to fix this (imho) is to add CFI directives that
> mirror what libunwind does unconditionally, and libgcc when it spots
> the mov/svc pair on unannotated addresses, which is to restore all the
> registers from the sigframe, and to use regs->pc as the return address
> (which is where the unwind should proceed, rather than at whatever
> value happened to reside in x30 at the point the signal was taken)
> 
> The only hairy bit is the FP/SIMD registers, which are ignored by
> libunwind, and only half-restored by libgcc (which appears to assume
> that only d8..d15 need to be restored). This means throwing a C++
> exception in a signal handler and catching it outside of it is going
> to be difficult to get 100% correct, but this is something that
> appears to be broken already everywhere. (Unless there are some
> AAPCS/ABI rules that specify that FP/SIMD registers have to be assumed
> clobbered when catching an exception)

I think Dave makes an interesting point about .cfi_signal_frame, though.
With any released arm64 kernel, unwinding here requires heuristics to
identify the signal frame. The usual approach seems to pattern match the
instruction stream (and I'm not convinced libunwind works on big-endian
here), but you could also have looked up pc+4 to see if the function was
marked with .cfi_signal_frame. You can't do that anymore after this patch.
We have no evidence anybody bothered with that (it seems like more trouble
than it's worth), but it's an interesting observation. Furthermore, maybe
the right answer *is* to punt this all to the unwinder given that we're
already forced to keep the instruction sequence exactly as it is. I suspect
the LLVM unwinder has to use heuristics for most other architectures
anyway, so it should already have whatever hooks it needs.

It would be a different story if most unwinders were expecting to use
CFI and it generally worked well on other architectures, but that really
doesn't appear to be the case.

Will
Ard Biesheuvel June 23, 2020, 11:08 a.m. UTC | #5
On Tue, 23 Jun 2020 at 12:58, Will Deacon <will@kernel.org> wrote:
>
> On Tue, Jun 23, 2020 at 12:11:24PM +0200, Ard Biesheuvel wrote:
> > On Tue, 23 Jun 2020 at 11:54, Dave Martin <Dave.Martin@arm.com> wrote:
> > > In the longer term, we should perhaps keep .cfi_signal_frame, and ditch
> > > all the other annotations.  If the unwinder knows for sure it's the
> > > signal frame (and understands this annotation), it can use its own
> > > knowledge to do the right thing.  Otherwise, it can fall back on its
> > > other heuristics.
> >
> > What we found out is that libunwind unconditionally uses its own
> > unwind strategy when it spots the mov/svc instruction pair at the
> > return address, and so what we put here does not matter at all.
> > The libgcc case is different, as it only applies the heuristics if it
> > doesn't find any unwind info covering the return address. So this is
> > what bit us here: by fixing the way the CFI directives are emitted,
> > libgcc no longer uses its own baked in unwind strategy for the signal
> > frame, but interprets the one we provide, which results in a bogus
> > address for the /next/ frame. Since that is not covered by unwind info
> > either, libgcc attempts to apply the heuristics at this level, and
> > explodes because the bogus address cannot be dereferenced to get at
> > the opcodes.
> >
> > So libgcc doesn't even check whether it looks like a signal frame
> > unless the unwind info is missing.
> >
> > > This is probably better than having rich annotations that are
> > > pedantically correct, but only half-supported by unwinders out in the
> > > wild.
> > >
> >
> > Agreed. But without providing any annotations, other unwinders will
> > fail to unwind through the sigreturn trampoline, which is why the
> > change was made in the first place.
> >
> > So the safest way to fix this (imho) is to add CFI directives that
> > mirror what libunwind does unconditionally, and libgcc when it spots
> > the mov/svc pair on unannotated addresses, which is to restore all the
> > registers from the sigframe, and to use regs->pc as the return address
> > (which is where the unwind should proceed, rather than at whatever
> > value happened to reside in x30 at the point the signal was taken)
> >
> > The only hairy bit is the FP/SIMD registers, which are ignored by
> > libunwind, and only half-restored by libgcc (which appears to assume
> > that only d8..d15 need to be restored). This means throwing a C++
> > exception in a signal handler and catching it outside of it is going
> > to be difficult to get 100% correct, but this is something that
> > appears to be broken already everywhere. (Unless there are some
> > AAPCS/ABI rules that specify that FP/SIMD registers have to be assumed
> > clobbered when catching an exception)
>
> I think Dave makes an interesting point about .cfi_signal_frame, though.
> With any released arm64 kernel, unwinding here requires heuristics to
> identify the signal frame. The usual approach seems to pattern match the
> instruction stream (and I'm not convinced libunwind works on big-endian
> here), but you could also have looked up pc+4 to see if the function was
> marked with .cfi_signal_frame. You can't do that anymore after this patch.

Good point.

> We have no evidence anybody bothered with that (it seems like more trouble
> than it's worth), but it's an interesting observation. Furthermore, maybe
> the right answer *is* to punt this all to the unwinder given that we're
> already forced to keep the instruction sequence exactly as it is. I suspect
> the LLVM unwinder has to use heuristics for most other architectures
> anyway, so it should already have whatever hooks it needs.
>

The unwinder is definitely better equipped to deal with this, given
how difficult it is to come up with CFI directives at compile time
that apply universally to any kind of sigframe we may instantiate at
runtime.

> It would be a different story if most unwinders were expecting to use
> CFI and it generally worked well on other architectures, but that really
> doesn't appear to be the case.
>

If fixing LLVM unwinding is no longer a requirement, reverting to what
we had is probably the best, although it is rather unfortunate that we
will be encouraging the LLVM developers to put another bodge in user
space. Could we at least encourage them to take the libgcc approach,
where the builtin strategy is only used as a fallback if the return
address is not covered by unwind info? Otherwise, we are painting
ourselves into a corner even more than we have already with respect to
our ability to ever start doing this properly in the future.
Szabolcs Nagy June 23, 2020, 11:11 a.m. UTC | #6
The 06/23/2020 12:11, Ard Biesheuvel wrote:
> So the safest way to fix this (imho) is to add CFI directives that
> mirror what libunwind does unconditionally, and libgcc when it spots
> the mov/svc pair on unannotated addresses, which is to restore all the
> registers from the sigframe, and to use regs->pc as the return address
> (which is where the unwind should proceed, rather than at whatever
> value happened to reside in x30 at the point the signal was taken)
> 
> The only hairy bit is the FP/SIMD registers, which are ignored by
> libunwind, and only half-restored by libgcc (which appears to assume
> that only d8..d15 need to be restored). This means throwing a C++
> exception in a signal handler and catching it outside of it is going
> to be difficult to get 100% correct, but this is something that
> appears to be broken already everywhere. (Unless there are some
> AAPCS/ABI rules that specify that FP/SIMD registers have to be assumed
> clobbered when catching an exception)

all fp/simd registers are assumed to be clobbered after
a function call, except for the bottom part of v8-v15
(sve calls preserve more: v8-v23 which may be a problem)

this means if the exception is caught at least one call
frame above the interrupted function then the libgcc
logic should work.

if you want to catch exceptions at the same call frame
that is interrupted by the signal, then you need to
compile with -fnon-call-exceptions which has limitations
anyway and not expected to be reliable (apparently on
aarch64 libgcc does not even support this usage with
respect to fp/simd regs, which is good to know..)

this is all fine because c++ does not define exception
handling across signal handlers.

as for thread cancellation in glibc: it uses exception
mechanism for cleanups, but the default cancel state
is PTHREAD_CANCEL_DEFERRED which means only blocking
libc calls throw (so -fexceptions is enough and the
libgcc logic is fine), if you switch to
PTHREAD_CANCEL_ASYNCHRONOUS then there may be a problem
but you can only do pure computations in that state,
(only 3 libc functions are defined to be async cancel
safe), i think you cannot register cleanup handlers
that run on the same stack frame that may be async
interrupted.
Ard Biesheuvel June 23, 2020, 12:39 p.m. UTC | #7
On Tue, 23 Jun 2020 at 13:11, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
>
> The 06/23/2020 12:11, Ard Biesheuvel wrote:
> > So the safest way to fix this (imho) is to add CFI directives that
> > mirror what libunwind does unconditionally, and libgcc when it spots
> > the mov/svc pair on unannotated addresses, which is to restore all the
> > registers from the sigframe, and to use regs->pc as the return address
> > (which is where the unwind should proceed, rather than at whatever
> > value happened to reside in x30 at the point the signal was taken)
> >
> > The only hairy bit is the FP/SIMD registers, which are ignored by
> > libunwind, and only half-restored by libgcc (which appears to assume
> > that only d8..d15 need to be restored). This means throwing a C++
> > exception in a signal handler and catching it outside of it is going
> > to be difficult to get 100% correct, but this is something that
> > appears to be broken already everywhere. (Unless there are some
> > AAPCS/ABI rules that specify that FP/SIMD registers have to be assumed
> > clobbered when catching an exception)
>
> all fp/simd registers are assumed to be clobbered after
> a function call, except for the bottom part of v8-v15
> (sve calls preserve more: v8-v23 which may be a problem)
>
> this means if the exception is caught at least one call
> frame above the interrupted function then the libgcc
> logic should work.
>
> if you want to catch exceptions at the same call frame
> that is interrupted by the signal, then you need to
> compile with -fnon-call-exceptions which has limitations
> anyway and not expected to be reliable (apparently on
> aarch64 libgcc does not even support this usage with
> respect to fp/simd regs, which is good to know..)
>
> this is all fine because c++ does not define exception
> handling across signal handlers.
>

It doesn't? In that case, all we need to take into account is cleanup
handlers, which are not permitted to catch exceptions.

I did notice that GCC permits cleanup handlers to be nested functions,
which means they could potentially attempt to access the outer
function's scope. Not sure what that means for unwinding, though.

> as for thread cancellation in glibc: it uses exception
> mechanism for cleanups, but the default cancel state
> is PTHREAD_CANCEL_DEFERRED which means only blocking
> libc calls throw (so -fexceptions is enough and the
> libgcc logic is fine), if you switch to
> PTHREAD_CANCEL_ASYNCHRONOUS then there may be a problem
> but you can only do pure computations in that state,
> (only 3 libc functions are defined to be async cancel
> safe), i think you cannot register cleanup handlers
> that run on the same stack frame that may be async
> interrupted.
Will Deacon June 23, 2020, 1:20 p.m. UTC | #8
Hi Szabolcs,

Cheers for the reply.

On Tue, Jun 23, 2020 at 12:11:09PM +0100, Szabolcs Nagy wrote:
> as for thread cancellation in glibc: it uses exception
> mechanism for cleanups, but the default cancel state
> is PTHREAD_CANCEL_DEFERRED which means only blocking
> libc calls throw (so -fexceptions is enough and the
> libgcc logic is fine), if you switch to
> PTHREAD_CANCEL_ASYNCHRONOUS then there may be a problem
> but you can only do pure computations in that state,
> (only 3 libc functions are defined to be async cancel
> safe), i think you cannot register cleanup handlers
> that run on the same stack frame that may be async
> interrupted.

Ah, I was trying to print a message, so I suppose that's out. Even so,
debugging with gdb and putting a breakpoint on the callback showed that
it wasn't getting invoked.

My code is below just as an FYI, since being able to derive a test from
this would be useful should we try to fix the CFI directives in future.

I get different results based on different combinations of
architecture, toolchain and optimisation level.

Will

--->8

/* cc -O2 -g -o tc tc.c -pthread -fexceptions */

#include <errno.h>
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>

void cb(void *arg)
{
	printf("Cleanup handler called %p\n", arg);
}

int __attribute__((noinline)) baz(volatile int x)
{
	while (x);
	return x;
}

void __attribute__((noinline)) bar(volatile int x)
{
	pthread_cleanup_push(cb, (void *)2);
	x = baz(x);
	pthread_cleanup_pop(0);
}

void *foo(void *unused)
{
	int otype;

	if (pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, &otype))
		perror("pthread_setcanceltype");

	pthread_cleanup_push(cb, (void *)1);
	bar(1);
	pthread_cleanup_pop(0);
	return NULL;
}

int main(void)
{
	pthread_t thread;

	if (pthread_create(&thread, NULL, foo, NULL)) {
		perror("pthread_create");
		return -1;
	}

	sleep(1);
	pthread_cancel(thread);
	pthread_join(thread, NULL);
	return 0;
}
Szabolcs Nagy June 23, 2020, 1:56 p.m. UTC | #9
The 06/23/2020 14:20, Will Deacon wrote:
> Hi Szabolcs,
> 
> Cheers for the reply.
> 
> On Tue, Jun 23, 2020 at 12:11:09PM +0100, Szabolcs Nagy wrote:
> > as for thread cancellation in glibc: it uses exception
> > mechanism for cleanups, but the default cancel state
> > is PTHREAD_CANCEL_DEFERRED which means only blocking
> > libc calls throw (so -fexceptions is enough and the
> > libgcc logic is fine), if you switch to
> > PTHREAD_CANCEL_ASYNCHRONOUS then there may be a problem
> > but you can only do pure computations in that state,
> > (only 3 libc functions are defined to be async cancel
> > safe), i think you cannot register cleanup handlers
> > that run on the same stack frame that may be async
> > interrupted.
> 
> Ah, I was trying to print a message, so I suppose that's out. Even so,
> debugging with gdb and putting a breakpoint on the callback showed that
> it wasn't getting invoked.
> 
> My code is below just as an FYI, since being able to derive a test from
> this would be useful should we try to fix the CFI directives in future.
> 
> I get different results based on different combinations of
> architecture, toolchain and optimisation level.

with -fexceptions gcc only emits the cleanup begin/end
labels around function calls, i.e. it only expects a throw
from functions (the cleanup handler is called if the pc is
between the begin/end labels during unwind), if an
instruction is interrupted and you throw from there then
cleanup may work if the instruction happens to be in the
range covered by the begin/end labels, but gcc does not
try to make that happen.

with -fnon-call-exceptions i think the test is supposed
to work and here it works, i get:

Cleanup handler called 0x2
Cleanup handler called 0x1

i think posix does not allow pthread_cleanup_push in
async cancel state (but you can change the cancel
state before and after it, which is valid i think),
i think printf is valid in the cleanup handler:
the cancel state is reset (and cancellation is disabled)
when libc acts on cancellation. (and if the interrupted
code was async cancel safe it should work).

(that said i've seen issues with -fnon-call-exceptions
so i consider the musl cancellation design more robust:
just add the cleanup ptr to a libc internal list that
is called on cancellation, no unwinding is involved.
this does not work with c++ dtors though, but c++ never
defined dtor vs posix cancellation semantics so
cancelling c++ code is just undefined.)

i think unwinding from arbitrary instruction should
work on aarch64 when c/c++ code is interrupted, but
exception handlers don't need to.
Will Deacon June 23, 2020, 2:34 p.m. UTC | #10
On Tue, Jun 23, 2020 at 02:56:14PM +0100, Szabolcs Nagy wrote:
> The 06/23/2020 14:20, Will Deacon wrote:
> > On Tue, Jun 23, 2020 at 12:11:09PM +0100, Szabolcs Nagy wrote:
> > > as for thread cancellation in glibc: it uses exception
> > > mechanism for cleanups, but the default cancel state
> > > is PTHREAD_CANCEL_DEFERRED which means only blocking
> > > libc calls throw (so -fexceptions is enough and the
> > > libgcc logic is fine), if you switch to
> > > PTHREAD_CANCEL_ASYNCHRONOUS then there may be a problem
> > > but you can only do pure computations in that state,
> > > (only 3 libc functions are defined to be async cancel
> > > safe), i think you cannot register cleanup handlers
> > > that run on the same stack frame that may be async
> > > interrupted.
> > 
> > Ah, I was trying to print a message, so I suppose that's out. Even so,
> > debugging with gdb and putting a breakpoint on the callback showed that
> > it wasn't getting invoked.
> > 
> > My code is below just as an FYI, since being able to derive a test from
> > this would be useful should we try to fix the CFI directives in future.
> > 
> > I get different results based on different combinations of
> > architecture, toolchain and optimisation level.
> 
> with -fexceptions gcc only emits the cleanup begin/end
> labels around function calls, i.e. it only expects a throw
> from functions (the cleanup handler is called if the pc is
> between the begin/end labels during unwind), if an
> instruction is interrupted and you throw from there then
> cleanup may work if the instruction happens to be in the
> range covered by the begin/end labels, but gcc does not
> try to make that happen.

Interesting. That's not mentioned anywhere in the pthread_cleanup_push
man page!

> with -fnon-call-exceptions i think the test is supposed
> to work and here it works, i get:
> 
> Cleanup handler called 0x2
> Cleanup handler called 0x1

Thanks, that's much better. I could even replace baz with some out-of-line
assembly:

	.text
	.align	2
	.globl	baz
	.cfi_startproc
baz:
	mov	x0, x29
	.cfi_register x29, x0
	mov	x1, x30
	.cfi_register x30, x1
	mov	x29, #42
	mov	x30, #42
	b	.
	ret
	.cfi_endproc

and I can see the unwinder segfaulting if I remove the CFI directives.

> i think posix does not allow pthread_cleanup_push in
> async cancel state (but you can change the cancel
> state before and after it, which is valid i think),

That makes sense to avoid racing with a signal when installing the things,
I suppose.

> i think printf is valid in the cleanup handler:
> the cancel state is reset (and cancellation is disabled)
> when libc acts on cancellation. (and if the interrupted
> code was async cancel safe it should work).
> 
> (that said i've seen issues with -fnon-call-exceptions
> so i consider the musl cancellation design more robust:
> just add the cleanup ptr to a libc internal list that
> is called on cancellation, no unwinding is involved.
> this does not work with c++ dtors though, but c++ never
> defined dtor vs posix cancellation semantics so
> cancelling c++ code is just undefined.)

Yes, that makes a tonne more sense to me. I couldn't figure out why
unwinding was necessary at all for this, since the context for the longjmp
is stored in TLS *anyway* afaiu.

Will
Daniel Kiss June 23, 2020, 3:34 p.m. UTC | #11
Hi Will,

The CFI is correct for PowerPC[1] and X86[2] at least; I did not check others.
so LLVM’s unwinder can unwind/backtrace sigreturn without any specific hack on those architectures.
The a4eb355a3fda change implements the same CFI as PowerPC and X86 have so the generic unwind logic is able to process the sigreturn.
IMHO the kernel change was fine. 

I’m not comfortable to say the instruction sequence is the ABI that describe the unwinding information.
I guess the pattern matching was implemented due to the CFI was wrong.

The .cfi_signal_frame could be useful once the entry is found, but needs the right .CFI annotations at the right place.

This change definitely a not good for LLVM’s unwinder.

Cheers,
Daniel

[1] PowerPC:
The label (.Lsigrt_start) that is used for the CFI info include the nop: 
https://elixir.bootlin.com/linux/latest/source/arch/powerpc/kernel/vdso64/sigtramp.S#L25
https://elixir.bootlin.com/linux/latest/source/arch/powerpc/kernel/vdso64/sigtramp.S#L291

[2] X86:
https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/vdso/vdso32/sigreturn.S#L13
https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/vdso/vdso32/sigreturn.S#L59

> On 23 Jun 2020, at 13:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> 
> On Tue, 23 Jun 2020 at 12:58, Will Deacon <will@kernel.org> wrote:
>> 
>> On Tue, Jun 23, 2020 at 12:11:24PM +0200, Ard Biesheuvel wrote:
>>> On Tue, 23 Jun 2020 at 11:54, Dave Martin <Dave.Martin@arm.com> wrote:
>>>> In the longer term, we should perhaps keep .cfi_signal_frame, and ditch
>>>> all the other annotations.  If the unwinder knows for sure it's the
>>>> signal frame (and understands this annotation), it can use its own
>>>> knowledge to do the right thing.  Otherwise, it can fall back on its
>>>> other heuristics.
>>> 
>>> What we found out is that libunwind unconditionally uses its own
>>> unwind strategy when it spots the mov/svc instruction pair at the
>>> return address, and so what we put here does not matter at all.
>>> The libgcc case is different, as it only applies the heuristics if it
>>> doesn't find any unwind info covering the return address. So this is
>>> what bit us here: by fixing the way the CFI directives are emitted,
>>> libgcc no longer uses its own baked in unwind strategy for the signal
>>> frame, but interprets the one we provide, which results in a bogus
>>> address for the /next/ frame. Since that is not covered by unwind info
>>> either, libgcc attempts to apply the heuristics at this level, and
>>> explodes because the bogus address cannot be dereferenced to get at
>>> the opcodes.
>>> 
>>> So libgcc doesn't even check whether it looks like a signal frame
>>> unless the unwind info is missing.
>>> 
>>>> This is probably better than having rich annotations that are
>>>> pedantically correct, but only half-supported by unwinders out in the
>>>> wild.
>>>> 
>>> 
>>> Agreed. But without providing any annotations, other unwinders will
>>> fail to unwind through the sigreturn trampoline, which is why the
>>> change was made in the first place.
>>> 
>>> So the safest way to fix this (imho) is to add CFI directives that
>>> mirror what libunwind does unconditionally, and libgcc when it spots
>>> the mov/svc pair on unannotated addresses, which is to restore all the
>>> registers from the sigframe, and to use regs->pc as the return address
>>> (which is where the unwind should proceed, rather than at whatever
>>> value happened to reside in x30 at the point the signal was taken)
>>> 
>>> The only hairy bit is the FP/SIMD registers, which are ignored by
>>> libunwind, and only half-restored by libgcc (which appears to assume
>>> that only d8..d15 need to be restored). This means throwing a C++
>>> exception in a signal handler and catching it outside of it is going
>>> to be difficult to get 100% correct, but this is something that
>>> appears to be broken already everywhere. (Unless there are some
>>> AAPCS/ABI rules that specify that FP/SIMD registers have to be assumed
>>> clobbered when catching an exception)
>> 
>> I think Dave makes an interesting point about .cfi_signal_frame, though.
>> With any released arm64 kernel, unwinding here requires heuristics to
>> identify the signal frame. The usual approach seems to pattern match the
>> instruction stream (and I'm not convinced libunwind works on big-endian
>> here), but you could also have looked up pc+4 to see if the function was
>> marked with .cfi_signal_frame. You can't do that anymore after this patch.
> 
> Good point.
> 
>> We have no evidence anybody bothered with that (it seems like more trouble
>> than it's worth), but it's an interesting observation. Furthermore, maybe
>> the right answer *is* to punt this all to the unwinder given that we're
>> already forced to keep the instruction sequence exactly as it is. I suspect
>> the LLVM unwinder has to use heuristics for most other architectures
>> anyway, so it should already have whatever hooks it needs.
>> 
> 
> The unwinder is definitely better equipped to deal with this, given
> how difficult it is to come up with CFI directives at compile time
> that apply universally to any kind of sigframe we may instantiate at
> runtime.
> 
>> It would be a different story if most unwinders were expecting to use
>> CFI and it generally worked well on other architectures, but that really
>> doesn't appear to be the case.
>> 
> 
> If fixing LLVM unwinding is no longer a requirement, reverting to what
> we had is probably the best, although it is rather unfortunate that we
> will be encouraging the LLVM developers to put another bodge in user
> space. Could we at least encourage them to take the libgcc approach,
> where the builtin strategy is only used as a fallback if the return
> address is not covered by unwind info? Otherwise, we are painting
> ourselves into a corner even more than we have already with respect to
> our ability to ever start doing this properly in the future.
Ard Biesheuvel June 23, 2020, 3:43 p.m. UTC | #12
On Tue, 23 Jun 2020 at 17:34, Daniel Kiss <Daniel.Kiss@arm.com> wrote:
>
> Hi Will,
>
> The CFI is correct for PowerPC[1] and X86[2] at least; I did not check others.
> so LLVM’s unwinder can unwind/backtrace sigreturn without any specific hack on those architectures.
> The a4eb355a3fda change implements the same CFI as PowerPC and X86 have so the generic unwind logic is able to process the sigreturn.

It most certainly did not implement the same CFI. The x86 and PowerPC
examples you quote have elaborate asm foo to emit the eh_frame by
hand.

> IMHO the kernel change was fine.
>

It creates easily reproducible segfaults in the libgcc unwinder.

> I’m not comfortable to say the instruction sequence is the ABI that describe the unwinding information.
> I guess the pattern matching was implemented due to the CFI was wrong.
>
> The .cfi_signal_frame could be useful once the entry is found, but needs the right .CFI annotations at the right place.
>
> This change definitely a not good for LLVM’s unwinder.
>

I agree that it would be better for the CFI to be correct, but that
takes a bit of work at the very least, and even then, the variable
nature of our sigframe may make it impossible to restore the FP/SIMD
register state reliably.

I did some tests with the below CFI directives, where
ARM64_SIGFRAME_REGS_OFFSET is emitted by asm-offsets as the offset
from the top of the sigframe to the regs[] array. This fixes the
segfaults, and seems to do the right thing if i single step through
the unwinder as it unwinds the stack in response to a call to
pthread_cancel(). But we need a lot of testing to ensure that this is
correct.


.cfi_def_cfa_offset ARM64_SIGFRAME_REGS_OFFSET
.irp r, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, \
    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,  \
    24, 25, 26, 27, 28, 29, 30, 31
.cfi_offset \r, \r * 8
.endr

.cfi_offset 96, 32 * 8 // regs->pc
.cfi_return_column 96
Daniel Kiss July 6, 2020, 9:29 a.m. UTC | #13
Hi Ard,

I like your suggestions and tuned a bit and now it works with the LLVM’s unwinders.

Register 96 is out of the DWARF spec[1] and will collide with SVE registers[2] so 32 is better which is the reserved register for PC.

my version:

#define ARM64_SIGFRAME_REGS_OFFSET 312 /* offsetof (struct rt_sigframe, uc.uc_mcontext.regs) */

    .text
    .cfi_startproc
    .cfi_signal_frame
    
    .cfi_def_cfa    sp, ARM64_SIGFRAME_REGS_OFFSET
    .irp x, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, \
        13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,  \
        24, 25, 26, 27, 28, 29, 30, 31
    .cfi_offset \x, \x * 8
    .endr

    .cfi_offset 32, 32 * 8 // regs->pc
    .cfi_return_column 32

    nop
ENTRY(__kernel_rt_sigreturn)
    mov x8, #__NR_rt_sigreturn
    svc #0
    .cfi_endproc
ENDPROC(__kernel_rt_sigreturn)

[1] https://developer.arm.com/documentation/ihi0057/e
[2] https://developer.arm.com/documentation/100985/0000/

> 
> 
>> On 23 Jun 2020, at 17:43, Ard Biesheuvel <ardb@kernel.org> wrote:
>> 
>> On Tue, 23 Jun 2020 at 17:34, Daniel Kiss <Daniel.Kiss@arm.com> wrote:
>>> 
>>> Hi Will,
>>> 
>>> The CFI is correct for PowerPC[1] and X86[2] at least; I did not check others.
>>> so LLVM’s unwinder can unwind/backtrace sigreturn without any specific hack on those architectures.
>>> The a4eb355a3fda change implements the same CFI as PowerPC and X86 have so the generic unwind logic is able to process the sigreturn.
>> 
>> It most certainly did not implement the same CFI. The x86 and PowerPC
>> examples you quote have elaborate asm foo to emit the eh_frame by
>> hand.
>> 
>>> IMHO the kernel change was fine.
>>> 
>> 
>> It creates easily reproducible segfaults in the libgcc unwinder.
>> 
>>> I’m not comfortable to say the instruction sequence is the ABI that describe the unwinding information.
>>> I guess the pattern matching was implemented due to the CFI was wrong.
>>> 
>>> The .cfi_signal_frame could be useful once the entry is found, but needs the right .CFI annotations at the right place.
>>> 
>>> This change definitely a not good for LLVM’s unwinder.
>>> 
>> 
>> I agree that it would be better for the CFI to be correct, but that
>> takes a bit of work at the very least, and even then, the variable
>> nature of our sigframe may make it impossible to restore the FP/SIMD
>> register state reliably.
>> 
>> I did some tests with the below CFI directives, where
>> ARM64_SIGFRAME_REGS_OFFSET is emitted by asm-offsets as the offset
>> from the top of the sigframe to the regs[] array. This fixes the
>> segfaults, and seems to do the right thing if i single step through
>> the unwinder as it unwinds the stack in response to a call to
>> pthread_cancel(). But we need a lot of testing to ensure that this is
>> correct.
>> 
>> 
>> .cfi_def_cfa_offset ARM64_SIGFRAME_REGS_OFFSET
>> .irp r, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, \
>>    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,  \
>>    24, 25, 26, 27, 28, 29, 30, 31
>> .cfi_offset \r, \r * 8
>> .endr
>> 
>> .cfi_offset 96, 32 * 8 // regs->pc
>> .cfi_return_column 96
>
Ard Biesheuvel July 6, 2020, 9:41 a.m. UTC | #14
On Mon, 6 Jul 2020 at 12:29, Daniel Kiss <Daniel.Kiss@arm.com> wrote:
>
> Hi Ard,
>
> I like your suggestions and tuned a bit and now it works with the LLVM’s unwinders.
>
> Register 96 is out of the DWARF spec[1] and will collide with SVE registers[2] so 32 is better which is the reserved register for PC.
>

I just copied that from libgcc's implementation, so using the existing
reserved register for that indeed seems like a much better approach.

The only remaining question is how to deal with FP/SIMD and SVE
registers, but given Szabolcs's assertion that exceptions thrown in
signal handlers cannot be caught outside of them anyway, this does not
seem like a huge deal, with the exception of the nested function case
for pthread cleanup handlers but perhaps the solution for that is
'don't do that'.




> my version:
>
> #define ARM64_SIGFRAME_REGS_OFFSET 312 /* offsetof (struct rt_sigframe, uc.uc_mcontext.regs) */
>
>     .text
>     .cfi_startproc
>     .cfi_signal_frame
>
>     .cfi_def_cfa    sp, ARM64_SIGFRAME_REGS_OFFSET
>     .irp x, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, \
>         13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,  \
>         24, 25, 26, 27, 28, 29, 30, 31
>     .cfi_offset \x, \x * 8
>     .endr
>
>     .cfi_offset 32, 32 * 8 // regs->pc
>     .cfi_return_column 32
>
>     nop
> ENTRY(__kernel_rt_sigreturn)
>     mov x8, #__NR_rt_sigreturn
>     svc #0
>     .cfi_endproc
> ENDPROC(__kernel_rt_sigreturn)
>
> [1] https://developer.arm.com/documentation/ihi0057/e
> [2] https://developer.arm.com/documentation/100985/0000/
>
> >
> >
> >> On 23 Jun 2020, at 17:43, Ard Biesheuvel <ardb@kernel.org> wrote:
> >>
> >> On Tue, 23 Jun 2020 at 17:34, Daniel Kiss <Daniel.Kiss@arm.com> wrote:
> >>>
> >>> Hi Will,
> >>>
> >>> The CFI is correct for PowerPC[1] and X86[2] at least; I did not check others.
> >>> so LLVM’s unwinder can unwind/backtrace sigreturn without any specific hack on those architectures.
> >>> The a4eb355a3fda change implements the same CFI as PowerPC and X86 have so the generic unwind logic is able to process the sigreturn.
> >>
> >> It most certainly did not implement the same CFI. The x86 and PowerPC
> >> examples you quote have elaborate asm foo to emit the eh_frame by
> >> hand.
> >>
> >>> IMHO the kernel change was fine.
> >>>
> >>
> >> It creates easily reproducible segfaults in the libgcc unwinder.
> >>
> >>> I’m not comfortable to say the instruction sequence is the ABI that describe the unwinding information.
> >>> I guess the pattern matching was implemented due to the CFI was wrong.
> >>>
> >>> The .cfi_signal_frame could be useful once the entry is found, but needs the right .CFI annotations at the right place.
> >>>
> >>> This change definitely a not good for LLVM’s unwinder.
> >>>
> >>
> >> I agree that it would be better for the CFI to be correct, but that
> >> takes a bit of work at the very least, and even then, the variable
> >> nature of our sigframe may make it impossible to restore the FP/SIMD
> >> register state reliably.
> >>
> >> I did some tests with the below CFI directives, where
> >> ARM64_SIGFRAME_REGS_OFFSET is emitted by asm-offsets as the offset
> >> from the top of the sigframe to the regs[] array. This fixes the
> >> segfaults, and seems to do the right thing if i single step through
> >> the unwinder as it unwinds the stack in response to a call to
> >> pthread_cancel(). But we need a lot of testing to ensure that this is
> >> correct.
> >>
> >>
> >> .cfi_def_cfa_offset ARM64_SIGFRAME_REGS_OFFSET
> >> .irp r, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, \
> >>    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,  \
> >>    24, 25, 26, 27, 28, 29, 30, 31
> >> .cfi_offset \r, \r * 8
> >> .endr
> >>
> >> .cfi_offset 96, 32 * 8 // regs->pc
> >> .cfi_return_column 96
> >
>
Dave Martin July 8, 2020, 4:57 p.m. UTC | #15
On Mon, Jul 06, 2020 at 09:29:24AM +0000, Daniel Kiss wrote:
> Hi Ard,
> 
> I like your suggestions and tuned a bit and now it works with the LLVM’s unwinders.
> 
> Register 96 is out of the DWARF spec[1] and will collide with SVE registers[2] so 32 is better which is the reserved register for PC.
> 
> my version:
> 
> #define ARM64_SIGFRAME_REGS_OFFSET 312 /* offsetof (struct rt_sigframe, uc.uc_mcontext.regs) */
> 
>     .text
>     .cfi_startproc
>     .cfi_signal_frame
>     
>     .cfi_def_cfa    sp, ARM64_SIGFRAME_REGS_OFFSET
>     .irp x, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, \
>         13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,  \
>         24, 25, 26, 27, 28, 29, 30, 31
>     .cfi_offset \x, \x * 8
>     .endr
> 
>     .cfi_offset 32, 32 * 8 // regs->pc
>     .cfi_return_column 32

Have you verified that this works with the GNU unwinders?  I seem to
remember experimenting with .cfi_return_column in the past and hitting
problems when trying to use a fake register.

At the time I just assumed I was doing something wrong and didn't go
digging.  Maybe I _was_ doing something wrong...

Cheers
---Dave
Daniel Kiss July 10, 2020, 5:08 p.m. UTC | #16
> Have you verified that this works with the GNU unwinders?  I seem to
> remember experimenting with .cfi_return_column in the past and hitting
> problems when trying to use a fake register.
> 
> At the time I just assumed I was doing something wrong and didn't go
> digging.  Maybe I _was_ doing something wrong…

Not yet, I hope there is a CI that would do it for me :) 

Cheers,
Daniel