diff mbox series

[RFC,v1,1/1] arm64: Unwinder enhancements for reliable stack trace

Message ID 20210223181243.6776-2-madvenka@linux.microsoft.com (mailing list archive)
State New, archived
Headers show
Series arm64: Unwinder enhancements for reliable stack trace | expand

Commit Message

Madhavan T. Venkataraman Feb. 23, 2021, 6:12 p.m. UTC
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Unwinder changes
================

	Termination
	===========

	Currently, the unwinder terminates when both the FP (frame pointer)
	and the PC (return address) of a frame are 0. But a frame could get
	corrupted and zeroed. There needs to be a better check.

	The following special terminating frame and function have been
	defined for this purpose:

	const u64    arm64_last_frame[2] __attribute__ ((aligned (16)));

	void arm64_last_func(void)
	{
	}

	So, set the FP to arm64_last_frame and the PC to arm64_last_func in
	the bottom most frame.

	Exception/Interrupt detection
	=============================

	An EL1 exception renders the stack trace unreliable as it can happen
	anywhere including the frame pointer prolog and epilog. The
	unwinder needs to be able to detect the exception on the stack.

	Currently, the EL1 exception handler sets up pt_regs on the stack
	and chains pt_regs->stackframe with the other frames on the stack.
	But, the unwinder does not know where this exception frame is in
	the stack trace.

	Set the LSB of the exception frame FP to allow the unwinder to
	detect the exception frame. When the unwinder detects the frame,
	it needs to make sure that it is really an exception frame and
	not the result of any stack corruption.

	It can do this if the FP and PC are also recorded elsewhere in the
	pt_regs for comparison. Currently, the FP is also stored in
	regs->regs[29]. The PC is stored in regs->pc. However, regs->pc can
	be changed by lower level functions.

	Create a new field, pt_regs->orig_pc, and record the return address
	PC there. With this, the unwinder can validate the exception frame
	and set a flag so that the caller of the unwinder can know when
	an exception frame is encountered.

	Unwinder return value
	=====================

	Currently, the unwinder returns -EINVAL for stack trace termination
	as well as stack trace error. Return -ENOENT for stack trace
	termination and -EINVAL for error to disambiguate. This idea has
	been borrowed from Mark Brown.

Reliable stack trace function
=============================

Implement arch_stack_walk_reliable(). This function walks the stack like
the existing stack trace functions with a couple of additional checks:

	Return address check
	--------------------

	For each frame, check the return address to see if it is a
	proper kernel text address. If not, return -EINVAL.

	Exception frame check
	---------------------

	Check each frame to see if it is an EL1 exception frame. If it is,
	return -EINVAL.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/arm64/include/asm/processor.h  |   2 +
 arch/arm64/include/asm/ptrace.h     |   7 ++
 arch/arm64/include/asm/stacktrace.h |   5 ++
 arch/arm64/kernel/asm-offsets.c     |   1 +
 arch/arm64/kernel/entry.S           |  14 +++-
 arch/arm64/kernel/head.S            |   8 +--
 arch/arm64/kernel/process.c         |  12 ++++
 arch/arm64/kernel/stacktrace.c      | 103 +++++++++++++++++++++++++---
 8 files changed, 137 insertions(+), 15 deletions(-)

Comments

Mark Brown Feb. 23, 2021, 7:02 p.m. UTC | #1
On Tue, Feb 23, 2021 at 12:12:43PM -0600, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> 
> Unwinder changes
> ================

This is making several different changes so should be split into a patch
series - for example the change to terminate on a specific function
pointer rather than NULL and the changes to the exception/interupt
detection should be split.  Please see submitting-patches.rst for some
discussion about how to split things up.  In general if you've got a
changelog enumerating a number of different changes in a patch that's a
warning sign that it might be good split things up.

You should also copy the architecture maintainers (Catalin and Will) on
any arch/arm64 submissions.

> 	Unwinder return value
> 	=====================
> 
> 	Currently, the unwinder returns -EINVAL for stack trace termination
> 	as well as stack trace error. Return -ENOENT for stack trace
> 	termination and -EINVAL for error to disambiguate. This idea has
> 	been borrowed from Mark Brown.

You could just include my patch for this in your series.

> Reliable stack trace function
> =============================
> 
> Implement arch_stack_walk_reliable(). This function walks the stack like
> the existing stack trace functions with a couple of additional checks:
> 
> 	Return address check
> 	--------------------
> 
> 	For each frame, check the return address to see if it is a
> 	proper kernel text address. If not, return -EINVAL.
> 
> 	Exception frame check
> 	---------------------
> 
> 	Check each frame to see if it is an EL1 exception frame. If it is,
> 	return -EINVAL.

Again, this should be at least one separate patch.  How does this ensure
that we don't have any issues with any of the various probe mechanisms?
If there's no need to explicitly check anything that should be called
out in the changelog.

Since all these changes are mixed up this is a fairly superficial
review of the actual code.

> +static notrace struct pt_regs *get_frame_regs(struct task_struct *task,
> +					      struct stackframe *frame)
> +{
> +	unsigned long stackframe, regs_start, regs_end;
> +	struct stack_info info;
> +
> +	stackframe = frame->prev_fp;
> +	if (!stackframe)
> +		return NULL;
> +
> +	(void) on_accessible_stack(task, stackframe, &info);

Shouldn't we return NULL if we are not on an accessible stack?

> +static notrace int update_frame(struct task_struct *task,
> +				struct stackframe *frame)

This function really needs some documentation, the function is just
called update_frame() which doesn't say what sort of updates it's
supposed to do and most of the checks aren't explained, not all of them
are super obvious.

> +{
> +	unsigned long lsb = frame->fp & 0xf;
> +	unsigned long fp = frame->fp & ~lsb;
> +	unsigned long pc = frame->pc;
> +	struct pt_regs *regs;
> +
> +	frame->exception_frame = false;
> +
> +	if (fp == (unsigned long) arm64_last_frame &&
> +	    pc == (unsigned long) arm64_last_func)
> +		return -ENOENT;
> +
> +	if (!lsb)
> +		return 0;
> +	if (lsb != 1)
> +		return -EINVAL;
> +
> +	/*
> +	 * This looks like an EL1 exception frame.

For clarity it would be good to spell out the properties of an EL1
exception frame.  It is not clear to me why we don't reference the frame
type information the unwinder already records as part of these checks.

In general, especially for the bits specific to reliable stack trace, I
think we want to err on the side of verbosity here so that it is crystal
clear what all the checks are supposed to be doing and it's that much
easier to tie everything through to the requirements document.
Madhavan T. Venkataraman Feb. 23, 2021, 7:20 p.m. UTC | #2
On 2/23/21 1:02 PM, Mark Brown wrote:
> On Tue, Feb 23, 2021 at 12:12:43PM -0600, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> Unwinder changes
>> ================
> 
> This is making several different changes so should be split into a patch
> series - for example the change to terminate on a specific function
> pointer rather than NULL and the changes to the exception/interupt
> detection should be split.  Please see submitting-patches.rst for some
> discussion about how to split things up.  In general if you've got a
> changelog enumerating a number of different changes in a patch that's a
> warning sign that it might be good split things up.
> 

Will do.

> You should also copy the architecture maintainers (Catalin and Will) on
> any arch/arm64 submissions.
> 

Will do when I resubmit.

>> 	Unwinder return value
>> 	=====================
>>
>> 	Currently, the unwinder returns -EINVAL for stack trace termination
>> 	as well as stack trace error. Return -ENOENT for stack trace
>> 	termination and -EINVAL for error to disambiguate. This idea has
>> 	been borrowed from Mark Brown.
> 
> You could just include my patch for this in your series.
> 

OK.

>> Reliable stack trace function
>> =============================
>>
>> Implement arch_stack_walk_reliable(). This function walks the stack like
>> the existing stack trace functions with a couple of additional checks:
>>
>> 	Return address check
>> 	--------------------
>>
>> 	For each frame, check the return address to see if it is a
>> 	proper kernel text address. If not, return -EINVAL.
>>
>> 	Exception frame check
>> 	---------------------
>>
>> 	Check each frame to see if it is an EL1 exception frame. If it is,
>> 	return -EINVAL.
> 
> Again, this should be at least one separate patch.  How does this ensure
> that we don't have any issues with any of the various probe mechanisms?
> If there's no need to explicitly check anything that should be called
> out in the changelog.
> 

I am trying to do this in an incremental fashion. I have to study the probe
mechanisms a little bit more before I can come up with a solution. But
if you want to see that addressed in this patch set, I could do that.
It will take a little bit of time. That is all.

> Since all these changes are mixed up this is a fairly superficial
> review of the actual code.
> 

Understood. I will split things up and we can take it from there.

>> +static notrace struct pt_regs *get_frame_regs(struct task_struct *task,
>> +					      struct stackframe *frame)
>> +{
>> +	unsigned long stackframe, regs_start, regs_end;
>> +	struct stack_info info;
>> +
>> +	stackframe = frame->prev_fp;
>> +	if (!stackframe)
>> +		return NULL;
>> +
>> +	(void) on_accessible_stack(task, stackframe, &info);
> 
> Shouldn't we return NULL if we are not on an accessible stack?
> 

The prev_fp has already been checked by the unwinder in the previous
frame. That is why I don't check the return value. If that is acceptable,
I will add a comment.

>> +static notrace int update_frame(struct task_struct *task,
>> +				struct stackframe *frame)
> 
> This function really needs some documentation, the function is just
> called update_frame() which doesn't say what sort of updates it's
> supposed to do and most of the checks aren't explained, not all of them
> are super obvious.
> 

I will add the documentation as well as try think of a better name.

>> +{
>> +	unsigned long lsb = frame->fp & 0xf;
>> +	unsigned long fp = frame->fp & ~lsb;
>> +	unsigned long pc = frame->pc;
>> +	struct pt_regs *regs;
>> +
>> +	frame->exception_frame = false;
>> +
>> +	if (fp == (unsigned long) arm64_last_frame &&
>> +	    pc == (unsigned long) arm64_last_func)
>> +		return -ENOENT;
>> +
>> +	if (!lsb)
>> +		return 0;
>> +	if (lsb != 1)
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * This looks like an EL1 exception frame.
> 
> For clarity it would be good to spell out the properties of an EL1
> exception frame.  It is not clear to me why we don't reference the frame
> type information the unwinder already records as part of these checks.
> 
> In general, especially for the bits specific to reliable stack trace, I
> think we want to err on the side of verbosity here so that it is crystal
> clear what all the checks are supposed to be doing and it's that much
> easier to tie everything through to the requirements document.

OK. I will improve the documentation.

Madhavan
Mark Rutland Feb. 24, 2021, 12:17 p.m. UTC | #3
Hi Madhavan,

As Mark Brown says, I think this needs to be split into several
patches. i have some comments on the general approach, but I'll save
in-depth review until this has been split.

On Tue, Feb 23, 2021 at 12:12:43PM -0600, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> 
> Unwinder changes
> ================
> 
> 	Termination
> 	===========
> 
> 	Currently, the unwinder terminates when both the FP (frame pointer)
> 	and the PC (return address) of a frame are 0. But a frame could get
> 	corrupted and zeroed. There needs to be a better check.
> 
> 	The following special terminating frame and function have been
> 	defined for this purpose:
> 
> 	const u64    arm64_last_frame[2] __attribute__ ((aligned (16)));
> 
> 	void arm64_last_func(void)
> 	{
> 	}
> 
> 	So, set the FP to arm64_last_frame and the PC to arm64_last_func in
> 	the bottom most frame.

My expectation was that we'd do this per-task, creating an empty frame
record (i.e. with fp=NULL and lr=NULL) on the task's stack at the
instant it was created, and chaining this into x29. That way the address
is known (since it can be derived from the task), and the frame will
also implicitly check that the callchain terminates on the task stack
without loops. That also means that we can use it to detect the entry
code going wrong (e.g. if the SP gets corrupted), since in that case the
entry code would place the record at a different location.

> 
> 	Exception/Interrupt detection
> 	=============================
> 
> 	An EL1 exception renders the stack trace unreliable as it can happen
> 	anywhere including the frame pointer prolog and epilog. The
> 	unwinder needs to be able to detect the exception on the stack.
> 
> 	Currently, the EL1 exception handler sets up pt_regs on the stack
> 	and chains pt_regs->stackframe with the other frames on the stack.
> 	But, the unwinder does not know where this exception frame is in
> 	the stack trace.
> 
> 	Set the LSB of the exception frame FP to allow the unwinder to
> 	detect the exception frame. When the unwinder detects the frame,
> 	it needs to make sure that it is really an exception frame and
> 	not the result of any stack corruption.

I'm not keen on messing with the encoding of the frame record as this
will break external unwinders (e.g. using GDB on a kernel running under
QEMU). I'd rather that we detected the exception boundary based on the
LR, similar to what we did in commit:

  7326749801396105 ("arm64: unwind: reference pt_regs via embedded stack frame")

... I reckon once we've moved the last of the exception triage out to C
this will be relatively simple, since all of the exception handlers will
look like:

| SYM_CODE_START_LOCAL(elX_exception)
| 	kernel_entry X
| 	mov	x0, sp
| 	bl	elX_exception_handler
| 	kernel_exit X
| SYM_CODE_END(elX_exception)

... and so we just need to identify the set of elX_exception functions
(which we'll never expect to take exceptions from directly). We could be
strict and reject unwinding into arbitrary bits of the entry code (e.g.
if we took an unexpected exception), and only permit unwinding to the
BL.

> 	It can do this if the FP and PC are also recorded elsewhere in the
> 	pt_regs for comparison. Currently, the FP is also stored in
> 	regs->regs[29]. The PC is stored in regs->pc. However, regs->pc can
> 	be changed by lower level functions.
> 
> 	Create a new field, pt_regs->orig_pc, and record the return address
> 	PC there. With this, the unwinder can validate the exception frame
> 	and set a flag so that the caller of the unwinder can know when
> 	an exception frame is encountered.

I don't understand the case you're trying to solve here. When is
regs->pc changed in a way that's problematic?

> 	Unwinder return value
> 	=====================
> 
> 	Currently, the unwinder returns -EINVAL for stack trace termination
> 	as well as stack trace error. Return -ENOENT for stack trace
> 	termination and -EINVAL for error to disambiguate. This idea has
> 	been borrowed from Mark Brown.

IIRC Mark Brown already has a patch for this (and it could be queued on
its own if it hasn't already been).

Thanks,
Mark.

> 
> Reliable stack trace function
> =============================
> 
> Implement arch_stack_walk_reliable(). This function walks the stack like
> the existing stack trace functions with a couple of additional checks:
> 
> 	Return address check
> 	--------------------
> 
> 	For each frame, check the return address to see if it is a
> 	proper kernel text address. If not, return -EINVAL.
> 
> 	Exception frame check
> 	---------------------
> 
> 	Check each frame to see if it is an EL1 exception frame. If it is,
> 	return -EINVAL.
> 
> Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
> ---
>  arch/arm64/include/asm/processor.h  |   2 +
>  arch/arm64/include/asm/ptrace.h     |   7 ++
>  arch/arm64/include/asm/stacktrace.h |   5 ++
>  arch/arm64/kernel/asm-offsets.c     |   1 +
>  arch/arm64/kernel/entry.S           |  14 +++-
>  arch/arm64/kernel/head.S            |   8 +--
>  arch/arm64/kernel/process.c         |  12 ++++
>  arch/arm64/kernel/stacktrace.c      | 103 +++++++++++++++++++++++++---
>  8 files changed, 137 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> index ca2cd75d3286..d268c74d262e 100644
> --- a/arch/arm64/include/asm/processor.h
> +++ b/arch/arm64/include/asm/processor.h
> @@ -195,6 +195,8 @@ static inline void start_thread_common(struct pt_regs *regs, unsigned long pc)
>  	memset(regs, 0, sizeof(*regs));
>  	forget_syscall(regs);
>  	regs->pc = pc;
> +	regs->stackframe[0] = (u64) arm64_last_frame;
> +	regs->stackframe[1] = (u64) arm64_last_func;
>  
>  	if (system_uses_irq_prio_masking())
>  		regs->pmr_save = GIC_PRIO_IRQON;
> diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
> index e58bca832dff..a15750a9f6e5 100644
> --- a/arch/arm64/include/asm/ptrace.h
> +++ b/arch/arm64/include/asm/ptrace.h
> @@ -201,8 +201,15 @@ struct pt_regs {
>  	/* Only valid for some EL1 exceptions. */
>  	u64 lockdep_hardirqs;
>  	u64 exit_rcu;
> +
> +	/* Only valid for EL1 exceptions. */
> +	u64 orig_pc;
> +	u64 unused1;
>  };
>  
> +extern const u64 arm64_last_frame[2];
> +extern void arm64_last_func(void);
> +
>  static inline bool in_syscall(struct pt_regs const *regs)
>  {
>  	return regs->syscallno != NO_SYSCALL;
> diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
> index eb29b1fe8255..9760ceddbd78 100644
> --- a/arch/arm64/include/asm/stacktrace.h
> +++ b/arch/arm64/include/asm/stacktrace.h
> @@ -49,6 +49,9 @@ struct stack_info {
>   *
>   * @graph:       When FUNCTION_GRAPH_TRACER is selected, holds the index of a
>   *               replacement lr value in the ftrace graph stack.
> + *
> + * @exception_frame
> + *		EL1 exception frame.
>   */
>  struct stackframe {
>  	unsigned long fp;
> @@ -59,6 +62,7 @@ struct stackframe {
>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>  	int graph;
>  #endif
> +	bool exception_frame;
>  };
>  
>  extern int unwind_frame(struct task_struct *tsk, struct stackframe *frame);
> @@ -169,6 +173,7 @@ static inline void start_backtrace(struct stackframe *frame,
>  	bitmap_zero(frame->stacks_done, __NR_STACK_TYPES);
>  	frame->prev_fp = 0;
>  	frame->prev_type = STACK_TYPE_UNKNOWN;
> +	frame->exception_frame = false;
>  }
>  
>  #endif	/* __ASM_STACKTRACE_H */
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 301784463587..a9fbe1ca6d8a 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -75,6 +75,7 @@ int main(void)
>    DEFINE(S_SDEI_TTBR1,		offsetof(struct pt_regs, sdei_ttbr1));
>    DEFINE(S_PMR_SAVE,		offsetof(struct pt_regs, pmr_save));
>    DEFINE(S_STACKFRAME,		offsetof(struct pt_regs, stackframe));
> +  DEFINE(S_ORIG_PC,		offsetof(struct pt_regs, orig_pc));
>    DEFINE(PT_REGS_SIZE,		sizeof(struct pt_regs));
>    BLANK();
>  #ifdef CONFIG_COMPAT
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index c9bae73f2621..b2d6c73dd054 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -264,10 +264,21 @@ alternative_else_nop_endif
>  	 * In order to be able to dump the contents of struct pt_regs at the
>  	 * time the exception was taken (in case we attempt to walk the call
>  	 * stack later), chain it together with the stack frames.
> +	 *
> +	 * Set up a synthetic EL0 frame such that the unwinder can recognize
> +	 * it and stop the unwind.
> +	 *
> +	 * Set up a synthetic EL1 frame such that the unwinder can recognize
> +	 * it. For a reliable stack trace, the unwinder stops here. Else, it
> +	 * continues. Also, record the return address in regs->orig_pc for
> +	 * the unwinder's benefit because regs->pc can be changed.
>  	 */
>  	.if \el == 0
> -	stp	xzr, xzr, [sp, #S_STACKFRAME]
> +	ldr	x29, =arm64_last_frame
> +	ldr	x17, =arm64_last_func
> +	stp	x29, x17, [sp, #S_STACKFRAME]
>  	.else
> +	orr	x29, x29, #1
>  	stp	x29, x22, [sp, #S_STACKFRAME]
>  	.endif
>  	add	x29, sp, #S_STACKFRAME
> @@ -279,6 +290,7 @@ alternative_else_nop_endif
>  #endif
>  
>  	stp	x22, x23, [sp, #S_PC]
> +	str	x22, [sp, #S_ORIG_PC]
>  
>  	/* Not in a syscall by default (el0_svc overwrites for real syscall) */
>  	.if	\el == 0
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index a0dc987724ed..2cce019f29fa 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -448,8 +448,8 @@ SYM_FUNC_START_LOCAL(__primary_switched)
>  0:
>  #endif
>  	add	sp, sp, #16
> -	mov	x29, #0
> -	mov	x30, #0
> +	ldr	x29, =arm64_last_frame
> +	ldr	x30, =arm64_last_func
>  	b	start_kernel
>  SYM_FUNC_END(__primary_switched)
>  
> @@ -644,8 +644,8 @@ SYM_FUNC_START_LOCAL(__secondary_switched)
>  	cbz	x2, __secondary_too_slow
>  	msr	sp_el0, x2
>  	scs_load x2, x3
> -	mov	x29, #0
> -	mov	x30, #0
> +	ldr	x29, =arm64_last_frame
> +	ldr	x30, =arm64_last_func
>  
>  #ifdef CONFIG_ARM64_PTR_AUTH
>  	ptrauth_keys_init_cpu x2, x3, x4, x5
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 6616486a58fe..bac13fc33914 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -380,6 +380,12 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
>  
>  asmlinkage void ret_from_fork(void) asm("ret_from_fork");
>  
> +const u64	arm64_last_frame[2] __attribute__ ((aligned (16)));
> +
> +void arm64_last_func(void)
> +{
> +}
> +
>  int copy_thread(unsigned long clone_flags, unsigned long stack_start,
>  		unsigned long stk_sz, struct task_struct *p, unsigned long tls)
>  {
> @@ -437,6 +443,12 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
>  	}
>  	p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
>  	p->thread.cpu_context.sp = (unsigned long)childregs;
> +	/*
> +	 * Set up a special termination stack frame for the task.
> +	 */
> +	p->thread.cpu_context.fp = (unsigned long)childregs->stackframe;
> +	childregs->stackframe[0] = (u64) arm64_last_frame;
> +	childregs->stackframe[1] = (u64) arm64_last_func;
>  
>  	ptrace_hw_copy_thread(p);
>  
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index fa56af1a59c3..26ac4dd54eaf 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -18,6 +18,60 @@
>  #include <asm/stack_pointer.h>
>  #include <asm/stacktrace.h>
>  
> +static notrace struct pt_regs *get_frame_regs(struct task_struct *task,
> +					      struct stackframe *frame)
> +{
> +	unsigned long stackframe, regs_start, regs_end;
> +	struct stack_info info;
> +
> +	stackframe = frame->prev_fp;
> +	if (!stackframe)
> +		return NULL;
> +
> +	(void) on_accessible_stack(task, stackframe, &info);
> +
> +	regs_start = stackframe - offsetof(struct pt_regs, stackframe);
> +	if (regs_start < info.low)
> +		return NULL;
> +	regs_end = regs_start + sizeof(struct pt_regs);
> +	if (regs_end > info.high)
> +		return NULL;
> +	return (struct pt_regs *) regs_start;
> +}
> +
> +static notrace int update_frame(struct task_struct *task,
> +				struct stackframe *frame)
> +{
> +	unsigned long lsb = frame->fp & 0xf;
> +	unsigned long fp = frame->fp & ~lsb;
> +	unsigned long pc = frame->pc;
> +	struct pt_regs *regs;
> +
> +	frame->exception_frame = false;
> +
> +	if (fp == (unsigned long) arm64_last_frame &&
> +	    pc == (unsigned long) arm64_last_func)
> +		return -ENOENT;
> +
> +	if (!lsb)
> +		return 0;
> +	if (lsb != 1)
> +		return -EINVAL;
> +
> +	/*
> +	 * This looks like an EL1 exception frame.
> +	 * Make sure the frame matches the EL1 pt_regs.
> +	 */
> +	regs = get_frame_regs(task, frame);
> +	if (!regs || fp != READ_ONCE_NOCHECK(regs->regs[29]) ||
> +	   pc != regs->orig_pc)
> +		return -EINVAL;
> +
> +	frame->exception_frame = true;
> +	frame->fp = fp;
> +	return 0;
> +}
> +
>  /*
>   * AArch64 PCS assigns the frame pointer to x29.
>   *
> @@ -104,16 +158,7 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>  
>  	frame->pc = ptrauth_strip_insn_pac(frame->pc);
>  
> -	/*
> -	 * Frames created upon entry from EL0 have NULL FP and PC values, so
> -	 * don't bother reporting these. Frames created by __noreturn functions
> -	 * might have a valid FP even if PC is bogus, so only terminate where
> -	 * both are NULL.
> -	 */
> -	if (!frame->fp && !frame->pc)
> -		return -EINVAL;
> -
> -	return 0;
> +	return update_frame(tsk, frame);
>  }
>  NOKPROBE_SYMBOL(unwind_frame);
>  
> @@ -217,4 +262,42 @@ void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
>  	walk_stackframe(task, &frame, consume_entry, cookie);
>  }
>  
> +int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry,
> +			      void *cookie, struct task_struct *task)
> +{
> +	struct stackframe frame;
> +	int ret = 0;
> +
> +	if (task == current) {
> +		start_backtrace(&frame,
> +				(unsigned long)__builtin_frame_address(0),
> +				(unsigned long)arch_stack_walk_reliable);
> +	} else {
> +		start_backtrace(&frame, thread_saved_fp(task),
> +				thread_saved_pc(task));
> +	}
> +
> +	while (!ret) {
> +		/*
> +		 * If the task encountered an EL1 exception, the stack trace
> +		 * is unreliable.
> +		 */
> +		if (frame.exception_frame)
> +			return -EINVAL;
> +
> +		/*
> +		 * A NULL or invalid return address probably means there's
> +		 * some generated code which __kernel_text_address() doesn't
> +		 * know about.
> +		 */
> +		if (!__kernel_text_address(frame.pc))
> +			return -EINVAL;
> +		if (!consume_entry(cookie, frame.pc))
> +			return -EINVAL;
> +		ret = unwind_frame(task, &frame);
> +	}
> +
> +	return ret == -ENOENT ? 0 : -EINVAL;
> +}
> +
>  #endif
> -- 
> 2.25.1
>
Mark Brown Feb. 24, 2021, 12:33 p.m. UTC | #4
On Tue, Feb 23, 2021 at 01:20:49PM -0600, Madhavan T. Venkataraman wrote:
> On 2/23/21 1:02 PM, Mark Brown wrote:
> > On Tue, Feb 23, 2021 at 12:12:43PM -0600, madvenka@linux.microsoft.com wrote:

> >> Reliable stack trace function
> >> =============================
> >>
> >> Implement arch_stack_walk_reliable(). This function walks the stack like
> >> the existing stack trace functions with a couple of additional checks:

> > Again, this should be at least one separate patch.  How does this ensure
> > that we don't have any issues with any of the various probe mechanisms?
> > If there's no need to explicitly check anything that should be called
> > out in the changelog.

> I am trying to do this in an incremental fashion. I have to study the probe
> mechanisms a little bit more before I can come up with a solution. But
> if you want to see that addressed in this patch set, I could do that.
> It will take a little bit of time. That is all.

Handling of the probes stuff seems like it's critical to reliable stack
walk so we shouldn't claim to have support for reliable stack walk
without it.  If it was a working implementation we could improve that'd
be one thing but this would be buggy which is a different thing.

> >> +	(void) on_accessible_stack(task, stackframe, &info);

> > Shouldn't we return NULL if we are not on an accessible stack?

> The prev_fp has already been checked by the unwinder in the previous
> frame. That is why I don't check the return value. If that is acceptable,
> I will add a comment.

TBH if you're adding the comment it seems like you may as well add the
check, it's not like it's expensive and it means there's no possibility
that some future change could result in this assumption being broken.
Madhavan T. Venkataraman Feb. 24, 2021, 7:26 p.m. UTC | #5
On 2/24/21 6:33 AM, Mark Brown wrote:
> On Tue, Feb 23, 2021 at 01:20:49PM -0600, Madhavan T. Venkataraman wrote:
>> On 2/23/21 1:02 PM, Mark Brown wrote:
>>> On Tue, Feb 23, 2021 at 12:12:43PM -0600, madvenka@linux.microsoft.com wrote:
> 
>>>> Reliable stack trace function
>>>> =============================
>>>>
>>>> Implement arch_stack_walk_reliable(). This function walks the stack like
>>>> the existing stack trace functions with a couple of additional checks:
> 
>>> Again, this should be at least one separate patch.  How does this ensure
>>> that we don't have any issues with any of the various probe mechanisms?
>>> If there's no need to explicitly check anything that should be called
>>> out in the changelog.
> 
>> I am trying to do this in an incremental fashion. I have to study the probe
>> mechanisms a little bit more before I can come up with a solution. But
>> if you want to see that addressed in this patch set, I could do that.
>> It will take a little bit of time. That is all.
> 
> Handling of the probes stuff seems like it's critical to reliable stack
> walk so we shouldn't claim to have support for reliable stack walk
> without it.  If it was a working implementation we could improve that'd
> be one thing but this would be buggy which is a different thing.
> 

OK. I will address the probe stuff in my resend.

>>>> +	(void) on_accessible_stack(task, stackframe, &info);
> 
>>> Shouldn't we return NULL if we are not on an accessible stack?
> 
>> The prev_fp has already been checked by the unwinder in the previous
>> frame. That is why I don't check the return value. If that is acceptable,
>> I will add a comment.
> 
> TBH if you're adding the comment it seems like you may as well add the
> check, it's not like it's expensive and it means there's no possibility
> that some future change could result in this assumption being broken.
> 

OK. I will add the check.

Thanks.

Madhavan
Madhavan T. Venkataraman Feb. 24, 2021, 7:34 p.m. UTC | #6
On 2/24/21 6:17 AM, Mark Rutland wrote:
> Hi Madhavan,
> 
> As Mark Brown says, I think this needs to be split into several
> patches. i have some comments on the general approach, but I'll save
> in-depth review until this has been split.
> 

OK.

> On Tue, Feb 23, 2021 at 12:12:43PM -0600, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> Unwinder changes
>> ================
>>
>> 	Termination
>> 	===========
>>
>> 	Currently, the unwinder terminates when both the FP (frame pointer)
>> 	and the PC (return address) of a frame are 0. But a frame could get
>> 	corrupted and zeroed. There needs to be a better check.
>>
>> 	The following special terminating frame and function have been
>> 	defined for this purpose:
>>
>> 	const u64    arm64_last_frame[2] __attribute__ ((aligned (16)));
>>
>> 	void arm64_last_func(void)
>> 	{
>> 	}
>>
>> 	So, set the FP to arm64_last_frame and the PC to arm64_last_func in
>> 	the bottom most frame.
> 
> My expectation was that we'd do this per-task, creating an empty frame
> record (i.e. with fp=NULL and lr=NULL) on the task's stack at the
> instant it was created, and chaining this into x29. That way the address
> is known (since it can be derived from the task), and the frame will
> also implicitly check that the callchain terminates on the task stack
> without loops. That also means that we can use it to detect the entry
> code going wrong (e.g. if the SP gets corrupted), since in that case the
> entry code would place the record at a different location.
> 

That is exactly what this is doing. arm64_last_frame[] is a marker frame
that contains fp=0 and pc=0.

>>
>> 	Exception/Interrupt detection
>> 	=============================
>>
>> 	An EL1 exception renders the stack trace unreliable as it can happen
>> 	anywhere including the frame pointer prolog and epilog. The
>> 	unwinder needs to be able to detect the exception on the stack.
>>
>> 	Currently, the EL1 exception handler sets up pt_regs on the stack
>> 	and chains pt_regs->stackframe with the other frames on the stack.
>> 	But, the unwinder does not know where this exception frame is in
>> 	the stack trace.
>>
>> 	Set the LSB of the exception frame FP to allow the unwinder to
>> 	detect the exception frame. When the unwinder detects the frame,
>> 	it needs to make sure that it is really an exception frame and
>> 	not the result of any stack corruption.
> 
> I'm not keen on messing with the encoding of the frame record as this
> will break external unwinders (e.g. using GDB on a kernel running under
> QEMU). I'd rather that we detected the exception boundary based on the
> LR, similar to what we did in commit:
> 

OK. I will take a look at the commit you mentioned.

>   7326749801396105 ("arm64: unwind: reference pt_regs via embedded stack frame")
> 
> ... I reckon once we've moved the last of the exception triage out to C
> this will be relatively simple, since all of the exception handlers will
> look like:
> 
> | SYM_CODE_START_LOCAL(elX_exception)
> | 	kernel_entry X
> | 	mov	x0, sp
> | 	bl	elX_exception_handler
> | 	kernel_exit X
> | SYM_CODE_END(elX_exception)
> 
> ... and so we just need to identify the set of elX_exception functions
> (which we'll never expect to take exceptions from directly). We could be
> strict and reject unwinding into arbitrary bits of the entry code (e.g.
> if we took an unexpected exception), and only permit unwinding to the
> BL.
> 
>> 	It can do this if the FP and PC are also recorded elsewhere in the
>> 	pt_regs for comparison. Currently, the FP is also stored in
>> 	regs->regs[29]. The PC is stored in regs->pc. However, regs->pc can
>> 	be changed by lower level functions.
>>
>> 	Create a new field, pt_regs->orig_pc, and record the return address
>> 	PC there. With this, the unwinder can validate the exception frame
>> 	and set a flag so that the caller of the unwinder can know when
>> 	an exception frame is encountered.
> 
> I don't understand the case you're trying to solve here. When is
> regs->pc changed in a way that's problematic?
> 

For instance, I used a test driver in which the driver calls a function
pointer which is NULL. The low level fault handler sends a signal to the
task. Looks like it changes regs->pc for this. When I dump the stack
from the low level handler, the comparison with regs->pc does not work.
But comparison with regs->orig_pc works.

>> 	Unwinder return value
>> 	=====================
>>
>> 	Currently, the unwinder returns -EINVAL for stack trace termination
>> 	as well as stack trace error. Return -ENOENT for stack trace
>> 	termination and -EINVAL for error to disambiguate. This idea has
>> 	been borrowed from Mark Brown.
> 
> IIRC Mark Brown already has a patch for this (and it could be queued on
> its own if it hasn't already been).
> 

I saw it. That is fine.

Thanks.

Madhavan
Mark Rutland Feb. 25, 2021, 11:58 a.m. UTC | #7
On Wed, Feb 24, 2021 at 01:34:09PM -0600, Madhavan T. Venkataraman wrote:
> On 2/24/21 6:17 AM, Mark Rutland wrote:
> > On Tue, Feb 23, 2021 at 12:12:43PM -0600, madvenka@linux.microsoft.com wrote:
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >> 	Termination
> >> 	===========
> >>
> >> 	Currently, the unwinder terminates when both the FP (frame pointer)
> >> 	and the PC (return address) of a frame are 0. But a frame could get
> >> 	corrupted and zeroed. There needs to be a better check.
> >>
> >> 	The following special terminating frame and function have been
> >> 	defined for this purpose:
> >>
> >> 	const u64    arm64_last_frame[2] __attribute__ ((aligned (16)));
> >>
> >> 	void arm64_last_func(void)
> >> 	{
> >> 	}
> >>
> >> 	So, set the FP to arm64_last_frame and the PC to arm64_last_func in
> >> 	the bottom most frame.
> > 
> > My expectation was that we'd do this per-task, creating an empty frame
> > record (i.e. with fp=NULL and lr=NULL) on the task's stack at the
> > instant it was created, and chaining this into x29. That way the address
> > is known (since it can be derived from the task), and the frame will
> > also implicitly check that the callchain terminates on the task stack
> > without loops. That also means that we can use it to detect the entry
> > code going wrong (e.g. if the SP gets corrupted), since in that case the
> > entry code would place the record at a different location.
> 
> That is exactly what this is doing. arm64_last_frame[] is a marker frame
> that contains fp=0 and pc=0.

Almost! What I meant was that rather that each task should have its own
final/marker frame record on its task task rather than sharing a common
global variable.

That way a check for that frame record implicitly checks that a task
started at the expected location *on that stack*, which catches
additional stack corruption cases (e.g. if we left data on the stack
prior to an EL0 entry).

[...]

> > ... I reckon once we've moved the last of the exception triage out to C
> > this will be relatively simple, since all of the exception handlers will
> > look like:
> > 
> > | SYM_CODE_START_LOCAL(elX_exception)
> > | 	kernel_entry X
> > | 	mov	x0, sp
> > | 	bl	elX_exception_handler
> > | 	kernel_exit X
> > | SYM_CODE_END(elX_exception)
> > 
> > ... and so we just need to identify the set of elX_exception functions
> > (which we'll never expect to take exceptions from directly). We could be
> > strict and reject unwinding into arbitrary bits of the entry code (e.g.
> > if we took an unexpected exception), and only permit unwinding to the
> > BL.
> > 
> >> 	It can do this if the FP and PC are also recorded elsewhere in the
> >> 	pt_regs for comparison. Currently, the FP is also stored in
> >> 	regs->regs[29]. The PC is stored in regs->pc. However, regs->pc can
> >> 	be changed by lower level functions.
> >>
> >> 	Create a new field, pt_regs->orig_pc, and record the return address
> >> 	PC there. With this, the unwinder can validate the exception frame
> >> 	and set a flag so that the caller of the unwinder can know when
> >> 	an exception frame is encountered.
> > 
> > I don't understand the case you're trying to solve here. When is
> > regs->pc changed in a way that's problematic?
> > 
> 
> For instance, I used a test driver in which the driver calls a function
> pointer which is NULL. The low level fault handler sends a signal to the
> task. Looks like it changes regs->pc for this.

I'm struggling to follow what you mean here.

If the kernel branches to NULL, the CPU should raise an instruction
abort from the current EL, and our handling of that should terminate the
thread via die_kernel_fault(), without returning to the faulting
context, and without altering the PC in the faulting context.

Signals are delivered to userspace and alter the userspace PC, not a
kernel PC, so this doesn't seem relevant. Do you mean an exception fixup
handler rather than a signal?

> When I dump the stack from the low level handler, the comparison with
> regs->pc does not work.  But comparison with regs->orig_pc works.

As above, I'm struggling with this; could you share a concrete example? 

Thanks,
Mark.
Madhavan T. Venkataraman March 1, 2021, 4:58 p.m. UTC | #8
On 2/25/21 5:58 AM, Mark Rutland wrote:
> On Wed, Feb 24, 2021 at 01:34:09PM -0600, Madhavan T. Venkataraman wrote:
>> On 2/24/21 6:17 AM, Mark Rutland wrote:
>>> On Tue, Feb 23, 2021 at 12:12:43PM -0600, madvenka@linux.microsoft.com wrote:
>>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>> 	Termination
>>>> 	===========
>>>>
>>>> 	Currently, the unwinder terminates when both the FP (frame pointer)
>>>> 	and the PC (return address) of a frame are 0. But a frame could get
>>>> 	corrupted and zeroed. There needs to be a better check.
>>>>
>>>> 	The following special terminating frame and function have been
>>>> 	defined for this purpose:
>>>>
>>>> 	const u64    arm64_last_frame[2] __attribute__ ((aligned (16)));
>>>>
>>>> 	void arm64_last_func(void)
>>>> 	{
>>>> 	}
>>>>
>>>> 	So, set the FP to arm64_last_frame and the PC to arm64_last_func in
>>>> 	the bottom most frame.
>>>
>>> My expectation was that we'd do this per-task, creating an empty frame
>>> record (i.e. with fp=NULL and lr=NULL) on the task's stack at the
>>> instant it was created, and chaining this into x29. That way the address
>>> is known (since it can be derived from the task), and the frame will
>>> also implicitly check that the callchain terminates on the task stack
>>> without loops. That also means that we can use it to detect the entry
>>> code going wrong (e.g. if the SP gets corrupted), since in that case the
>>> entry code would place the record at a different location.
>>
>> That is exactly what this is doing. arm64_last_frame[] is a marker frame
>> that contains fp=0 and pc=0.
> 
> Almost! What I meant was that rather that each task should have its own
> final/marker frame record on its task task rather than sharing a common
> global variable.
> 
> That way a check for that frame record implicitly checks that a task
> started at the expected location *on that stack*, which catches
> additional stack corruption cases (e.g. if we left data on the stack
> prior to an EL0 entry).
> 

Ok. I will think about this.

> [...]
> 
>>> ... I reckon once we've moved the last of the exception triage out to C
>>> this will be relatively simple, since all of the exception handlers will
>>> look like:
>>>
>>> | SYM_CODE_START_LOCAL(elX_exception)
>>> | 	kernel_entry X
>>> | 	mov	x0, sp
>>> | 	bl	elX_exception_handler
>>> | 	kernel_exit X
>>> | SYM_CODE_END(elX_exception)
>>>
>>> ... and so we just need to identify the set of elX_exception functions
>>> (which we'll never expect to take exceptions from directly). We could be
>>> strict and reject unwinding into arbitrary bits of the entry code (e.g.
>>> if we took an unexpected exception), and only permit unwinding to the
>>> BL.
>>>
>>>> 	It can do this if the FP and PC are also recorded elsewhere in the
>>>> 	pt_regs for comparison. Currently, the FP is also stored in
>>>> 	regs->regs[29]. The PC is stored in regs->pc. However, regs->pc can
>>>> 	be changed by lower level functions.
>>>>
>>>> 	Create a new field, pt_regs->orig_pc, and record the return address
>>>> 	PC there. With this, the unwinder can validate the exception frame
>>>> 	and set a flag so that the caller of the unwinder can know when
>>>> 	an exception frame is encountered.
>>>
>>> I don't understand the case you're trying to solve here. When is
>>> regs->pc changed in a way that's problematic?
>>>
>>
>> For instance, I used a test driver in which the driver calls a function
>> pointer which is NULL. The low level fault handler sends a signal to the
>> task. Looks like it changes regs->pc for this.
> 
> I'm struggling to follow what you mean here.
> 
> If the kernel branches to NULL, the CPU should raise an instruction
> abort from the current EL, and our handling of that should terminate the
> thread via die_kernel_fault(), without returning to the faulting
> context, and without altering the PC in the faulting context.
> 
> Signals are delivered to userspace and alter the userspace PC, not a
> kernel PC, so this doesn't seem relevant. Do you mean an exception fixup
> handler rather than a signal?
> 
>> When I dump the stack from the low level handler, the comparison with
>> regs->pc does not work.  But comparison with regs->orig_pc works.
> 
> As above, I'm struggling with this; could you share a concrete example? 
> 

Actually, my bad. I needed the orig_pc because of something that my test
driver did. And, it slipped my mind entirely.

Thanks for pointing it out. I will fix this in my resend.

Thanks again for your comments.

I am currently studying probing/tracing. As soon as I have a solution for that,
I will send out the next version. I look forward to the in-depth review.

Thanks,

Madhavan
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index ca2cd75d3286..d268c74d262e 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -195,6 +195,8 @@  static inline void start_thread_common(struct pt_regs *regs, unsigned long pc)
 	memset(regs, 0, sizeof(*regs));
 	forget_syscall(regs);
 	regs->pc = pc;
+	regs->stackframe[0] = (u64) arm64_last_frame;
+	regs->stackframe[1] = (u64) arm64_last_func;
 
 	if (system_uses_irq_prio_masking())
 		regs->pmr_save = GIC_PRIO_IRQON;
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index e58bca832dff..a15750a9f6e5 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -201,8 +201,15 @@  struct pt_regs {
 	/* Only valid for some EL1 exceptions. */
 	u64 lockdep_hardirqs;
 	u64 exit_rcu;
+
+	/* Only valid for EL1 exceptions. */
+	u64 orig_pc;
+	u64 unused1;
 };
 
+extern const u64 arm64_last_frame[2];
+extern void arm64_last_func(void);
+
 static inline bool in_syscall(struct pt_regs const *regs)
 {
 	return regs->syscallno != NO_SYSCALL;
diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
index eb29b1fe8255..9760ceddbd78 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -49,6 +49,9 @@  struct stack_info {
  *
  * @graph:       When FUNCTION_GRAPH_TRACER is selected, holds the index of a
  *               replacement lr value in the ftrace graph stack.
+ *
+ * @exception_frame
+ *		EL1 exception frame.
  */
 struct stackframe {
 	unsigned long fp;
@@ -59,6 +62,7 @@  struct stackframe {
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 	int graph;
 #endif
+	bool exception_frame;
 };
 
 extern int unwind_frame(struct task_struct *tsk, struct stackframe *frame);
@@ -169,6 +173,7 @@  static inline void start_backtrace(struct stackframe *frame,
 	bitmap_zero(frame->stacks_done, __NR_STACK_TYPES);
 	frame->prev_fp = 0;
 	frame->prev_type = STACK_TYPE_UNKNOWN;
+	frame->exception_frame = false;
 }
 
 #endif	/* __ASM_STACKTRACE_H */
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 301784463587..a9fbe1ca6d8a 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -75,6 +75,7 @@  int main(void)
   DEFINE(S_SDEI_TTBR1,		offsetof(struct pt_regs, sdei_ttbr1));
   DEFINE(S_PMR_SAVE,		offsetof(struct pt_regs, pmr_save));
   DEFINE(S_STACKFRAME,		offsetof(struct pt_regs, stackframe));
+  DEFINE(S_ORIG_PC,		offsetof(struct pt_regs, orig_pc));
   DEFINE(PT_REGS_SIZE,		sizeof(struct pt_regs));
   BLANK();
 #ifdef CONFIG_COMPAT
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index c9bae73f2621..b2d6c73dd054 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -264,10 +264,21 @@  alternative_else_nop_endif
 	 * In order to be able to dump the contents of struct pt_regs at the
 	 * time the exception was taken (in case we attempt to walk the call
 	 * stack later), chain it together with the stack frames.
+	 *
+	 * Set up a synthetic EL0 frame such that the unwinder can recognize
+	 * it and stop the unwind.
+	 *
+	 * Set up a synthetic EL1 frame such that the unwinder can recognize
+	 * it. For a reliable stack trace, the unwinder stops here. Else, it
+	 * continues. Also, record the return address in regs->orig_pc for
+	 * the unwinder's benefit because regs->pc can be changed.
 	 */
 	.if \el == 0
-	stp	xzr, xzr, [sp, #S_STACKFRAME]
+	ldr	x29, =arm64_last_frame
+	ldr	x17, =arm64_last_func
+	stp	x29, x17, [sp, #S_STACKFRAME]
 	.else
+	orr	x29, x29, #1
 	stp	x29, x22, [sp, #S_STACKFRAME]
 	.endif
 	add	x29, sp, #S_STACKFRAME
@@ -279,6 +290,7 @@  alternative_else_nop_endif
 #endif
 
 	stp	x22, x23, [sp, #S_PC]
+	str	x22, [sp, #S_ORIG_PC]
 
 	/* Not in a syscall by default (el0_svc overwrites for real syscall) */
 	.if	\el == 0
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index a0dc987724ed..2cce019f29fa 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -448,8 +448,8 @@  SYM_FUNC_START_LOCAL(__primary_switched)
 0:
 #endif
 	add	sp, sp, #16
-	mov	x29, #0
-	mov	x30, #0
+	ldr	x29, =arm64_last_frame
+	ldr	x30, =arm64_last_func
 	b	start_kernel
 SYM_FUNC_END(__primary_switched)
 
@@ -644,8 +644,8 @@  SYM_FUNC_START_LOCAL(__secondary_switched)
 	cbz	x2, __secondary_too_slow
 	msr	sp_el0, x2
 	scs_load x2, x3
-	mov	x29, #0
-	mov	x30, #0
+	ldr	x29, =arm64_last_frame
+	ldr	x30, =arm64_last_func
 
 #ifdef CONFIG_ARM64_PTR_AUTH
 	ptrauth_keys_init_cpu x2, x3, x4, x5
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 6616486a58fe..bac13fc33914 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -380,6 +380,12 @@  int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 
 asmlinkage void ret_from_fork(void) asm("ret_from_fork");
 
+const u64	arm64_last_frame[2] __attribute__ ((aligned (16)));
+
+void arm64_last_func(void)
+{
+}
+
 int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 		unsigned long stk_sz, struct task_struct *p, unsigned long tls)
 {
@@ -437,6 +443,12 @@  int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 	}
 	p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
 	p->thread.cpu_context.sp = (unsigned long)childregs;
+	/*
+	 * Set up a special termination stack frame for the task.
+	 */
+	p->thread.cpu_context.fp = (unsigned long)childregs->stackframe;
+	childregs->stackframe[0] = (u64) arm64_last_frame;
+	childregs->stackframe[1] = (u64) arm64_last_func;
 
 	ptrace_hw_copy_thread(p);
 
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index fa56af1a59c3..26ac4dd54eaf 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -18,6 +18,60 @@ 
 #include <asm/stack_pointer.h>
 #include <asm/stacktrace.h>
 
+static notrace struct pt_regs *get_frame_regs(struct task_struct *task,
+					      struct stackframe *frame)
+{
+	unsigned long stackframe, regs_start, regs_end;
+	struct stack_info info;
+
+	stackframe = frame->prev_fp;
+	if (!stackframe)
+		return NULL;
+
+	(void) on_accessible_stack(task, stackframe, &info);
+
+	regs_start = stackframe - offsetof(struct pt_regs, stackframe);
+	if (regs_start < info.low)
+		return NULL;
+	regs_end = regs_start + sizeof(struct pt_regs);
+	if (regs_end > info.high)
+		return NULL;
+	return (struct pt_regs *) regs_start;
+}
+
+static notrace int update_frame(struct task_struct *task,
+				struct stackframe *frame)
+{
+	unsigned long lsb = frame->fp & 0xf;
+	unsigned long fp = frame->fp & ~lsb;
+	unsigned long pc = frame->pc;
+	struct pt_regs *regs;
+
+	frame->exception_frame = false;
+
+	if (fp == (unsigned long) arm64_last_frame &&
+	    pc == (unsigned long) arm64_last_func)
+		return -ENOENT;
+
+	if (!lsb)
+		return 0;
+	if (lsb != 1)
+		return -EINVAL;
+
+	/*
+	 * This looks like an EL1 exception frame.
+	 * Make sure the frame matches the EL1 pt_regs.
+	 */
+	regs = get_frame_regs(task, frame);
+	if (!regs || fp != READ_ONCE_NOCHECK(regs->regs[29]) ||
+	   pc != regs->orig_pc)
+		return -EINVAL;
+
+	frame->exception_frame = true;
+	frame->fp = fp;
+	return 0;
+}
+
 /*
  * AArch64 PCS assigns the frame pointer to x29.
  *
@@ -104,16 +158,7 @@  int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
 
 	frame->pc = ptrauth_strip_insn_pac(frame->pc);
 
-	/*
-	 * Frames created upon entry from EL0 have NULL FP and PC values, so
-	 * don't bother reporting these. Frames created by __noreturn functions
-	 * might have a valid FP even if PC is bogus, so only terminate where
-	 * both are NULL.
-	 */
-	if (!frame->fp && !frame->pc)
-		return -EINVAL;
-
-	return 0;
+	return update_frame(tsk, frame);
 }
 NOKPROBE_SYMBOL(unwind_frame);
 
@@ -217,4 +262,42 @@  void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
 	walk_stackframe(task, &frame, consume_entry, cookie);
 }
 
+int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry,
+			      void *cookie, struct task_struct *task)
+{
+	struct stackframe frame;
+	int ret = 0;
+
+	if (task == current) {
+		start_backtrace(&frame,
+				(unsigned long)__builtin_frame_address(0),
+				(unsigned long)arch_stack_walk_reliable);
+	} else {
+		start_backtrace(&frame, thread_saved_fp(task),
+				thread_saved_pc(task));
+	}
+
+	while (!ret) {
+		/*
+		 * If the task encountered an EL1 exception, the stack trace
+		 * is unreliable.
+		 */
+		if (frame.exception_frame)
+			return -EINVAL;
+
+		/*
+		 * A NULL or invalid return address probably means there's
+		 * some generated code which __kernel_text_address() doesn't
+		 * know about.
+		 */
+		if (!__kernel_text_address(frame.pc))
+			return -EINVAL;
+		if (!consume_entry(cookie, frame.pc))
+			return -EINVAL;
+		ret = unwind_frame(task, &frame);
+	}
+
+	return ret == -ENOENT ? 0 : -EINVAL;
+}
+
 #endif