diff mbox series

[RFC,v5,1/2] arm64: Introduce stack trace reliability checks in the unwinder

Message ID 20210526214917.20099-2-madvenka@linux.microsoft.com (mailing list archive)
State New, archived
Headers show
Series arm64: Implement stack trace reliability checks | expand

Commit Message

Madhavan T. Venkataraman May 26, 2021, 9:49 p.m. UTC
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

The unwinder should check for the presence of various features and
conditions that can render the stack trace unreliable and mark the
the stack trace as unreliable for the benefit of the caller.

Introduce the first reliability check - If a return PC is not a valid
kernel text address, consider the stack trace unreliable. It could be
some generated code.

Other reliability checks will be added in the future.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/arm64/include/asm/stacktrace.h |  9 +++++++
 arch/arm64/kernel/stacktrace.c      | 38 +++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 5 deletions(-)

Comments

Mark Rutland June 24, 2021, 2:40 p.m. UTC | #1
Hi Madhavan,

On Wed, May 26, 2021 at 04:49:16PM -0500, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> 
> The unwinder should check for the presence of various features and
> conditions that can render the stack trace unreliable and mark the
> the stack trace as unreliable for the benefit of the caller.
> 
> Introduce the first reliability check - If a return PC is not a valid
> kernel text address, consider the stack trace unreliable. It could be
> some generated code.
> 
> Other reliability checks will be added in the future.
> 
> Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>

At a high-level, I'm on-board with keeping track of this per unwind
step, but if we do that then I want to be abel to use this during
regular unwinds (e.g. so that we can have a backtrace idicate when a
step is not reliable, like x86 does with '?'), and to do that we need to
be a little more accurate.

I think we first need to do some more preparatory work for that, but
regardless, I have some comments below.

> ---
>  arch/arm64/include/asm/stacktrace.h |  9 +++++++
>  arch/arm64/kernel/stacktrace.c      | 38 +++++++++++++++++++++++++----
>  2 files changed, 42 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
> index eb29b1fe8255..4c822ef7f588 100644
> --- a/arch/arm64/include/asm/stacktrace.h
> +++ b/arch/arm64/include/asm/stacktrace.h
> @@ -49,6 +49,13 @@ struct stack_info {
>   *
>   * @graph:       When FUNCTION_GRAPH_TRACER is selected, holds the index of a
>   *               replacement lr value in the ftrace graph stack.
> + *
> + * @reliable:	Is this stack frame reliable? There are several checks that
> + *              need to be performed in unwind_frame() before a stack frame
> + *              is truly reliable. Until all the checks are present, this flag
> + *              is just a place holder. Once all the checks are implemented,
> + *              this comment will be updated and the flag can be used by the
> + *              caller of unwind_frame().

I'd prefer that we state the high-level semantic first, then drill down
into detail, e.g.

| @reliable: Indicates whether this frame is beleived to be a reliable
|            unwinding from the parent stackframe. This may be set
|            regardless of whether the parent stackframe was reliable.
|            
|            This is set only if all the following are true:
| 
|            * @pc is a valid text address.
| 
|            Note: this is currently incomplete.

>   */
>  struct stackframe {
>  	unsigned long fp;
> @@ -59,6 +66,7 @@ struct stackframe {
>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>  	int graph;
>  #endif
> +	bool reliable;
>  };
>  
>  extern int unwind_frame(struct task_struct *tsk, struct stackframe *frame);
> @@ -169,6 +177,7 @@ static inline void start_backtrace(struct stackframe *frame,
>  	bitmap_zero(frame->stacks_done, __NR_STACK_TYPES);
>  	frame->prev_fp = 0;
>  	frame->prev_type = STACK_TYPE_UNKNOWN;
> +	frame->reliable = true;
>  }

I think we need more data than this to be accurate.

Consider arch_stack_walk() starting from a pt_regs -- the initial state
(the PC from the regs) is accurate, but the first unwind from that will
not be, and we don't account for that at all.

I think we need to capture an unwind type in struct stackframe, which we
can pass into start_backtrace(), e.g.

| enum unwind_type {
|         /*
|          * The next frame is indicated by the frame pointer.
|          * The next unwind may or may not be reliable.
|          */
|         UNWIND_TYPE_FP,
| 
|         /*
|          * The next frame is indicated by the LR in pt_regs.
|          * The next unwind is not reliable.
|          */
|         UNWIND_TYPE_REGS_LR,
| 
|         /*
|          * We do not know how to unwind to the next frame.
|          * The next unwind is not reliable.
|          */
|         UNWIND_TYPE_UNKNOWN
| };

That should be simple enough to set up around start_backtrace(), but
we'll need further rework to make that simple at exception boundaries.
With the entry rework I have queued for v5.14, we're *almost* down to a
single asm<->c transition point for all vectors, and I'm hoping to
factor the remainder out to C for v5.15, whereupon we can annotate that
BL with some metadata for unwinding (with something similar to x86's
UNWIND_HINT, but retained for runtime).

>  
>  #endif	/* __ASM_STACKTRACE_H */
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index d55bdfb7789c..9061375c8785 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -44,21 +44,29 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>  	unsigned long fp = frame->fp;
>  	struct stack_info info;
>  
> +	frame->reliable = true;

I'd prefer to do this the other way around, e.g. here do:

|        /*
|         * Assume that an unwind step is unreliable until it has passed
|         * all relevant checks.
|         */
|        frame->reliable = false;

... then only set this to true once we're certain the step is reliable.

That requires fewer changes below, and would also be more robust as if
we forget to update this we'd accidentally mark an entry as unreliable
rather than accidentally marking it as reliable.

> +
>  	/* Terminal record; nothing to unwind */
>  	if (!fp)
>  		return -ENOENT;
>  
> -	if (fp & 0xf)
> +	if (fp & 0xf) {
> +		frame->reliable = false;
>  		return -EINVAL;
> +	}
>  
>  	if (!tsk)
>  		tsk = current;
>  
> -	if (!on_accessible_stack(tsk, fp, &info))
> +	if (!on_accessible_stack(tsk, fp, &info)) {
> +		frame->reliable = false;
>  		return -EINVAL;
> +	}
>  
> -	if (test_bit(info.type, frame->stacks_done))
> +	if (test_bit(info.type, frame->stacks_done)) {
> +		frame->reliable = false;
>  		return -EINVAL;
> +	}
>  
>  	/*
>  	 * As stacks grow downward, any valid record on the same stack must be
> @@ -74,8 +82,10 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>  	 * stack.
>  	 */
>  	if (info.type == frame->prev_type) {
> -		if (fp <= frame->prev_fp)
> +		if (fp <= frame->prev_fp) {
> +			frame->reliable = false;
>  			return -EINVAL;
> +		}
>  	} else {
>  		set_bit(frame->prev_type, frame->stacks_done);
>  	}
> @@ -100,14 +110,32 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>  		 * So replace it to an original value.
>  		 */
>  		ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
> -		if (WARN_ON_ONCE(!ret_stack))
> +		if (WARN_ON_ONCE(!ret_stack)) {
> +			frame->reliable = false;
>  			return -EINVAL;
> +		}
>  		frame->pc = ret_stack->ret;
>  	}
>  #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
>  
>  	frame->pc = ptrauth_strip_insn_pac(frame->pc);
>  
> +	/*
> +	 * Check the return PC for conditions that make unwinding unreliable.
> +	 * In each case, mark the stack trace as such.
> +	 */
> +
> +	/*
> +	 * Make sure that the return address is a proper kernel text address.
> +	 * A NULL or invalid return address could mean:
> +	 *
> +	 *	- generated code such as eBPF and optprobe trampolines
> +	 *	- Foreign code (e.g. EFI runtime services)
> +	 *	- Procedure Linkage Table (PLT) entries and veneer functions
> +	 */
> +	if (!__kernel_text_address(frame->pc))
> +		frame->reliable = false;

I don't think we should mention PLTs here. They appear in regular kernel
text, and on arm64 they are generally not problematic for unwinding. The
case in which they are problematic are where they interpose an
trampoline call that isn't following the AAPCS (e.g. ftrace calls from a
module, or calls to __hwasan_tag_mismatch generally), and we'll have to
catch those explciitly (or forbid RELIABLE_STACKTRACE with HWASAN).

From a backtrace perspective, the PC itself *is* reliable, but the next
unwind from this frame will not be, so I'd like to mark this as
reliable and the next unwind as unreliable. We can do that with the
UNWIND_TYPE_UNKNOWN suggestion above.

For the comment here, how about:

|	/*
|	 * If the PC is not a known kernel text address, then we cannot
|	 * be sure that a subsequent unwind will be reliable, as we
|	 * don't know that the code follows our unwind requirements.
|	 */
|	if (!__kernel_text_address(frame-pc))
|		frame->unwind = UNWIND_TYPE_UNKNOWN;

Thanks,
Mark.

>  	return 0;
>  }
>  NOKPROBE_SYMBOL(unwind_frame);
> -- 
> 2.25.1
>
Mark Brown June 24, 2021, 4:03 p.m. UTC | #2
On Thu, Jun 24, 2021 at 03:40:21PM +0100, Mark Rutland wrote:

> regular unwinds (e.g. so that we can have a backtrace idicate when a
> step is not reliable, like x86 does with '?'), and to do that we need to
> be a little more accurate.

There was the idea that was discussed a bit when I was more actively
working on this of just refactoring our unwinder infrastructure to be a
lot more like the x86 and (IIRC) S/390 in form.  Part of the thing there
was that it'd mean that even where we're not able to actually share code
we'd have more of a common baseline for how things work and what works.
It'd make review, especially cross architecture review, of what's going
on a bit easier too - see some of the concerns Josh had about the
differences here for example.  It'd be a relatively big bit of
refactoring though.
Madhavan T. Venkataraman June 25, 2021, 3:39 p.m. UTC | #3
On 6/24/21 9:40 AM, Mark Rutland wrote:
> Hi Madhavan,
> 
> On Wed, May 26, 2021 at 04:49:16PM -0500, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> The unwinder should check for the presence of various features and
>> conditions that can render the stack trace unreliable and mark the
>> the stack trace as unreliable for the benefit of the caller.
>>
>> Introduce the first reliability check - If a return PC is not a valid
>> kernel text address, consider the stack trace unreliable. It could be
>> some generated code.
>>
>> Other reliability checks will be added in the future.
>>
>> Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
> 
> At a high-level, I'm on-board with keeping track of this per unwind
> step, but if we do that then I want to be abel to use this during
> regular unwinds (e.g. so that we can have a backtrace idicate when a
> step is not reliable, like x86 does with '?'), and to do that we need to
> be a little more accurate.
> 

The only consumer of frame->reliable is livepatch. So, in retrospect, my
original per-frame reliability flag was an overkill. I was just trying to
provide extra per-frame debug information which is not really a requirement
for livepatch.

So, let us separate the two. I will rename frame->reliable to frame->livepatch_safe.
This will apply to the whole stacktrace and not to every frame.

Pass a livepatch_safe flag to start_backtrace(). This will be the initial value
of frame->livepatch_safe. So, if the caller knows that the starting frame is
unreliable, he can pass "false" to start_backtrace().

Whenever a reliability check fails, frame->livepatch_safe = false. After that
point, it will remain false till the end of the stacktrace. This keeps it simple.

Also, once livepatch_safe is set to false, further reliability checks will not
be performed (what would be the point?).

Finally, it might be a good idea to perform reliability checks even in
start_backtrace() so we don't assume that the starting frame is reliable even
if the caller passes livepatch_safe=true. What do you think?

> I think we first need to do some more preparatory work for that, but
> regardless, I have some comments below.
> 

I agree that some more work is required to provide per-frame debug information
and tracking. That can be done later. It is not a requirement for livepatch.

>> ---
>>  arch/arm64/include/asm/stacktrace.h |  9 +++++++
>>  arch/arm64/kernel/stacktrace.c      | 38 +++++++++++++++++++++++++----
>>  2 files changed, 42 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
>> index eb29b1fe8255..4c822ef7f588 100644
>> --- a/arch/arm64/include/asm/stacktrace.h
>> +++ b/arch/arm64/include/asm/stacktrace.h
>> @@ -49,6 +49,13 @@ struct stack_info {
>>   *
>>   * @graph:       When FUNCTION_GRAPH_TRACER is selected, holds the index of a
>>   *               replacement lr value in the ftrace graph stack.
>> + *
>> + * @reliable:	Is this stack frame reliable? There are several checks that
>> + *              need to be performed in unwind_frame() before a stack frame
>> + *              is truly reliable. Until all the checks are present, this flag
>> + *              is just a place holder. Once all the checks are implemented,
>> + *              this comment will be updated and the flag can be used by the
>> + *              caller of unwind_frame().
> 
> I'd prefer that we state the high-level semantic first, then drill down
> into detail, e.g.
> 
> | @reliable: Indicates whether this frame is beleived to be a reliable
> |            unwinding from the parent stackframe. This may be set
> |            regardless of whether the parent stackframe was reliable.
> |            
> |            This is set only if all the following are true:
> | 
> |            * @pc is a valid text address.
> | 
> |            Note: this is currently incomplete.
> 

I will change the name of the flag. I will change the comment accordingly.

>>   */
>>  struct stackframe {
>>  	unsigned long fp;
>> @@ -59,6 +66,7 @@ struct stackframe {
>>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>  	int graph;
>>  #endif
>> +	bool reliable;
>>  };
>>  
>>  extern int unwind_frame(struct task_struct *tsk, struct stackframe *frame);
>> @@ -169,6 +177,7 @@ static inline void start_backtrace(struct stackframe *frame,
>>  	bitmap_zero(frame->stacks_done, __NR_STACK_TYPES);
>>  	frame->prev_fp = 0;
>>  	frame->prev_type = STACK_TYPE_UNKNOWN;
>> +	frame->reliable = true;
>>  }
> 
> I think we need more data than this to be accurate.
> 
> Consider arch_stack_walk() starting from a pt_regs -- the initial state
> (the PC from the regs) is accurate, but the first unwind from that will
> not be, and we don't account for that at all.
> 
> I think we need to capture an unwind type in struct stackframe, which we
> can pass into start_backtrace(), e.g.
> 

> | enum unwind_type {
> |         /*
> |          * The next frame is indicated by the frame pointer.
> |          * The next unwind may or may not be reliable.
> |          */
> |         UNWIND_TYPE_FP,
> | 
> |         /*
> |          * The next frame is indicated by the LR in pt_regs.
> |          * The next unwind is not reliable.
> |          */
> |         UNWIND_TYPE_REGS_LR,
> | 
> |         /*
> |          * We do not know how to unwind to the next frame.
> |          * The next unwind is not reliable.
> |          */
> |         UNWIND_TYPE_UNKNOWN
> | };
> 
> That should be simple enough to set up around start_backtrace(), but
> we'll need further rework to make that simple at exception boundaries.
> With the entry rework I have queued for v5.14, we're *almost* down to a
> single asm<->c transition point for all vectors, and I'm hoping to
> factor the remainder out to C for v5.15, whereupon we can annotate that
> BL with some metadata for unwinding (with something similar to x86's
> UNWIND_HINT, but retained for runtime).
> 

I understood UNWIND_TYPE_FP and UNWIND_TYPE_REGS_LR. When would UNWIND_TYPE_UNKNOWN
be passed to start_backtrace? Could you elaborate?

Regardless, the above comment applies only to per-frame tracking when it is eventually
implemented. For livepatch, it is not needed. At exception boundaries, if stack metadata
is available, then use that to unwind safely. Else, livepatch_safe = false. The latter
is what is being done in my patch series. So, we can go with that until stack metadata
becomes available.

For the UNWIND_TYPE_REGS_LR and UNWIND_TYPE_UNKNOWN cases, the caller will
pass livepatch_safe=false to start_backtrace(). For UNWIND_TYPE_FP, the caller will
pass livepatch_safe=true. So, only UNWIND_TYPE_FP matters for livepatch.

>>  
>>  #endif	/* __ASM_STACKTRACE_H */
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index d55bdfb7789c..9061375c8785 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -44,21 +44,29 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>>  	unsigned long fp = frame->fp;
>>  	struct stack_info info;
>>  
>> +	frame->reliable = true;
> 
> I'd prefer to do this the other way around, e.g. here do:
> 
> |        /*
> |         * Assume that an unwind step is unreliable until it has passed
> |         * all relevant checks.
> |         */
> |        frame->reliable = false;
> 
> ... then only set this to true once we're certain the step is reliable.
> 
> That requires fewer changes below, and would also be more robust as if
> we forget to update this we'd accidentally mark an entry as unreliable
> rather than accidentally marking it as reliable.
> 

For livepatch_safe, the initial statement setting it to true at the
beginning of unwind_frame() goes away. But whenever a reliability check fails,
livepatch_safe has to be set to false.

>> +
>>  	/* Terminal record; nothing to unwind */
>>  	if (!fp)
>>  		return -ENOENT;
>>  
>> -	if (fp & 0xf)
>> +	if (fp & 0xf) {
>> +		frame->reliable = false;
>>  		return -EINVAL;
>> +	}
>>  
>>  	if (!tsk)
>>  		tsk = current;
>>  
>> -	if (!on_accessible_stack(tsk, fp, &info))
>> +	if (!on_accessible_stack(tsk, fp, &info)) {
>> +		frame->reliable = false;
>>  		return -EINVAL;
>> +	}
>>  
>> -	if (test_bit(info.type, frame->stacks_done))
>> +	if (test_bit(info.type, frame->stacks_done)) {
>> +		frame->reliable = false;
>>  		return -EINVAL;
>> +	}
>>  
>>  	/*
>>  	 * As stacks grow downward, any valid record on the same stack must be
>> @@ -74,8 +82,10 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>>  	 * stack.
>>  	 */
>>  	if (info.type == frame->prev_type) {
>> -		if (fp <= frame->prev_fp)
>> +		if (fp <= frame->prev_fp) {
>> +			frame->reliable = false;
>>  			return -EINVAL;
>> +		}
>>  	} else {
>>  		set_bit(frame->prev_type, frame->stacks_done);
>>  	}
>> @@ -100,14 +110,32 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>>  		 * So replace it to an original value.
>>  		 */
>>  		ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
>> -		if (WARN_ON_ONCE(!ret_stack))
>> +		if (WARN_ON_ONCE(!ret_stack)) {
>> +			frame->reliable = false;
>>  			return -EINVAL;
>> +		}
>>  		frame->pc = ret_stack->ret;
>>  	}
>>  #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
>>  
>>  	frame->pc = ptrauth_strip_insn_pac(frame->pc);
>>  
>> +	/*
>> +	 * Check the return PC for conditions that make unwinding unreliable.
>> +	 * In each case, mark the stack trace as such.
>> +	 */
>> +
>> +	/*
>> +	 * Make sure that the return address is a proper kernel text address.
>> +	 * A NULL or invalid return address could mean:
>> +	 *
>> +	 *	- generated code such as eBPF and optprobe trampolines
>> +	 *	- Foreign code (e.g. EFI runtime services)
>> +	 *	- Procedure Linkage Table (PLT) entries and veneer functions
>> +	 */
>> +	if (!__kernel_text_address(frame->pc))
>> +		frame->reliable = false;
> 
> I don't think we should mention PLTs here. They appear in regular kernel
> text, and on arm64 they are generally not problematic for unwinding. The
> case in which they are problematic are where they interpose an
> trampoline call that isn't following the AAPCS (e.g. ftrace calls from a
> module, or calls to __hwasan_tag_mismatch generally), and we'll have to
> catch those explciitly (or forbid RELIABLE_STACKTRACE with HWASAN).
> 

I will remove the mention of PLTs.

>>From a backtrace perspective, the PC itself *is* reliable, but the next
> unwind from this frame will not be, so I'd like to mark this as
> reliable and the next unwind as unreliable. We can do that with the
> UNWIND_TYPE_UNKNOWN suggestion above.
> 

In the livepatch_safe approach, it can be set to false as soon as the unwinder
realizes that there is unreliability, even if the unreliability is in the next
frame. Actually, this would avoid one extra unwind step for livepatch.

> For the comment here, how about:
> 
> |	/*
> |	 * If the PC is not a known kernel text address, then we cannot
> |	 * be sure that a subsequent unwind will be reliable, as we
> |	 * don't know that the code follows our unwind requirements.
> |	 */
> |	if (!__kernel_text_address(frame-pc))
> |		frame->unwind = UNWIND_TYPE_UNKNOWN;
> 

OK. I can change the comment.

Thanks!

Madhavan
Mark Brown June 25, 2021, 3:51 p.m. UTC | #4
On Fri, Jun 25, 2021 at 10:39:57AM -0500, Madhavan T. Venkataraman wrote:
> On 6/24/21 9:40 AM, Mark Rutland wrote:

> > At a high-level, I'm on-board with keeping track of this per unwind
> > step, but if we do that then I want to be abel to use this during
> > regular unwinds (e.g. so that we can have a backtrace idicate when a
> > step is not reliable, like x86 does with '?'), and to do that we need to
> > be a little more accurate.

> The only consumer of frame->reliable is livepatch. So, in retrospect, my
> original per-frame reliability flag was an overkill. I was just trying to
> provide extra per-frame debug information which is not really a requirement
> for livepatch.

It's not a requirement for livepatch but if it's there a per frame
reliability flag would have other uses - for example Mark has mentioned
the way x86 prints a ? next to unreliable entries in oops output for
example, that'd be handy for people debugging issues and would have the
added bonus of ensuring that there's more constant and widespread
exercising of the reliability stuff than if it's just used for livepatch
which is a bit niche.

> So, let us separate the two. I will rename frame->reliable to frame->livepatch_safe.
> This will apply to the whole stacktrace and not to every frame.

I'd rather keep it as reliable, even with only the livepatch usage I
think it's clearer.

> Finally, it might be a good idea to perform reliability checks even in
> start_backtrace() so we don't assume that the starting frame is reliable even
> if the caller passes livepatch_safe=true. What do you think?

That makes sense to me.
Madhavan T. Venkataraman June 25, 2021, 5:05 p.m. UTC | #5
On 6/25/21 10:51 AM, Mark Brown wrote:
> On Fri, Jun 25, 2021 at 10:39:57AM -0500, Madhavan T. Venkataraman wrote:
>> On 6/24/21 9:40 AM, Mark Rutland wrote:
> 
>>> At a high-level, I'm on-board with keeping track of this per unwind
>>> step, but if we do that then I want to be abel to use this during
>>> regular unwinds (e.g. so that we can have a backtrace idicate when a
>>> step is not reliable, like x86 does with '?'), and to do that we need to
>>> be a little more accurate.
> 
>> The only consumer of frame->reliable is livepatch. So, in retrospect, my
>> original per-frame reliability flag was an overkill. I was just trying to
>> provide extra per-frame debug information which is not really a requirement
>> for livepatch.
> 
> It's not a requirement for livepatch but if it's there a per frame
> reliability flag would have other uses - for example Mark has mentioned
> the way x86 prints a ? next to unreliable entries in oops output for
> example, that'd be handy for people debugging issues and would have the
> added bonus of ensuring that there's more constant and widespread
> exercising of the reliability stuff than if it's just used for livepatch
> which is a bit niche.
> 

I agree. That is why I introduced the per-frame flag.

So, let us try a different approach.

First, let us get rid of the frame->reliable flag from this patch series. That flag
can be implemented when all of the pieces are in place for per-frame debug and tracking.

For consumers such as livepatch that don't really care about per-frame stuff, let us
solve it more cleanly via the return value of unwind_frame().

Currently, the return value from unwind_frame() is a tri-state return value which is
somewhat confusing.

	0	means continue unwinding
	-error	means stop unwinding. However,
			-ENOENT means successful termination
			Other values mean an error has happened.

Instead, let unwind_frame() return one of 3 values:

enum {
	UNWIND_CONTINUE,
	UNWIND_CONTINUE_WITH_ERRORS,
	UNWIND_STOP,
};

All consumers will stop unwinding upon seeing UNWIND_STOP.

Livepatch type consumers will stop unwinding upon seeing anything other than UNWIND_CONTINUE.

Debug type consumers can choose to continue upon seeing UNWIND_CONTINUE_WITH_ERRORS.

When we eventually implement per-frame stuff, debug consumers can examine the
frame for more information when they see UNWIND_CONTINUE_WITH_ERRORS.

This way, my patch series does not have a dependency on the per-frame enhancements.

>> So, let us separate the two. I will rename frame->reliable to frame->livepatch_safe.
>> This will apply to the whole stacktrace and not to every frame.
> 
> I'd rather keep it as reliable, even with only the livepatch usage I
> think it's clearer.
> 

See suggestion above.

>> Finally, it might be a good idea to perform reliability checks even in
>> start_backtrace() so we don't assume that the starting frame is reliable even
>> if the caller passes livepatch_safe=true. What do you think?
> 
> That makes sense to me.
> 

Thanks.

Madhavan
Madhavan T. Venkataraman June 25, 2021, 5:18 p.m. UTC | #6
On 6/25/21 12:05 PM, Madhavan T. Venkataraman wrote:
> 
> 
> On 6/25/21 10:51 AM, Mark Brown wrote:
>> On Fri, Jun 25, 2021 at 10:39:57AM -0500, Madhavan T. Venkataraman wrote:
>>> On 6/24/21 9:40 AM, Mark Rutland wrote:
>>
>>>> At a high-level, I'm on-board with keeping track of this per unwind
>>>> step, but if we do that then I want to be abel to use this during
>>>> regular unwinds (e.g. so that we can have a backtrace idicate when a
>>>> step is not reliable, like x86 does with '?'), and to do that we need to
>>>> be a little more accurate.
>>
>>> The only consumer of frame->reliable is livepatch. So, in retrospect, my
>>> original per-frame reliability flag was an overkill. I was just trying to
>>> provide extra per-frame debug information which is not really a requirement
>>> for livepatch.
>>
>> It's not a requirement for livepatch but if it's there a per frame
>> reliability flag would have other uses - for example Mark has mentioned
>> the way x86 prints a ? next to unreliable entries in oops output for
>> example, that'd be handy for people debugging issues and would have the
>> added bonus of ensuring that there's more constant and widespread
>> exercising of the reliability stuff than if it's just used for livepatch
>> which is a bit niche.
>>
> 
> I agree. That is why I introduced the per-frame flag.
> 
> So, let us try a different approach.
> 
> First, let us get rid of the frame->reliable flag from this patch series. That flag
> can be implemented when all of the pieces are in place for per-frame debug and tracking.
> 
> For consumers such as livepatch that don't really care about per-frame stuff, let us
> solve it more cleanly via the return value of unwind_frame().
> 
> Currently, the return value from unwind_frame() is a tri-state return value which is
> somewhat confusing.
> 
> 	0	means continue unwinding
> 	-error	means stop unwinding. However,
> 			-ENOENT means successful termination
> 			Other values mean an error has happened.
> 
> Instead, let unwind_frame() return one of 3 values:
> 
> enum {
> 	UNWIND_CONTINUE,
> 	UNWIND_CONTINUE_WITH_ERRORS,
> 	UNWIND_STOP,
> };
> 

Sorry. I need to add one more value to this. So, the enum will be:

enum {
	UNWIND_CONTINUE,
	UNWIND_CONTINUE_WITH_ERRORS,
	UNWIND_STOP,
	UNWIND_STOP_WITH_ERRORS,
};

UNWIND_CONTINUE (what used to be a return value of 0)
	Continue with the unwind.

UNWIND_CONTINUE_WITH_ERRORS (new return value)
	Errors encountered. But the errors are not fatal errors like stack corruption.

UNWIND_STOP (what used to be -ENOENT)
	Successful termination of unwind.

UNWIND_STOP_WITH_ERRORS (what used to be -EINVAL, etc)
	Unsuccessful termination.

Sorry I missed this the last time.

So, to reiterate:

All consumers will stop unwinding when they see UNWIND_STOP and UNWIND_STOP_WITH_ERRORS.

Debug type consumers can choose to continue when they see UNWIND_CONTINUE_WITH_ERRORS.

Livepatch type consumers will only continue on UNWIND_CONTINUE.

This way, my patch series does not have a dependency on the per-frame enhancements.

Thanks!

Madhavan
Madhavan T. Venkataraman June 26, 2021, 3:35 p.m. UTC | #7
I will send out the next version without frame->reliable as implementing
a per-frame reliability thing obviously needs other changes and needs
Mark Rutland's code reorg.

Thanks!

Madhavan

On 6/25/21 10:39 AM, Madhavan T. Venkataraman wrote:
> 
> 
> On 6/24/21 9:40 AM, Mark Rutland wrote:
>> Hi Madhavan,
>>
>> On Wed, May 26, 2021 at 04:49:16PM -0500, madvenka@linux.microsoft.com wrote:
>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>
>>> The unwinder should check for the presence of various features and
>>> conditions that can render the stack trace unreliable and mark the
>>> the stack trace as unreliable for the benefit of the caller.
>>>
>>> Introduce the first reliability check - If a return PC is not a valid
>>> kernel text address, consider the stack trace unreliable. It could be
>>> some generated code.
>>>
>>> Other reliability checks will be added in the future.
>>>
>>> Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
>>
>> At a high-level, I'm on-board with keeping track of this per unwind
>> step, but if we do that then I want to be abel to use this during
>> regular unwinds (e.g. so that we can have a backtrace idicate when a
>> step is not reliable, like x86 does with '?'), and to do that we need to
>> be a little more accurate.
>>
> 
> The only consumer of frame->reliable is livepatch. So, in retrospect, my
> original per-frame reliability flag was an overkill. I was just trying to
> provide extra per-frame debug information which is not really a requirement
> for livepatch.
> 
> So, let us separate the two. I will rename frame->reliable to frame->livepatch_safe.
> This will apply to the whole stacktrace and not to every frame.
> 
> Pass a livepatch_safe flag to start_backtrace(). This will be the initial value
> of frame->livepatch_safe. So, if the caller knows that the starting frame is
> unreliable, he can pass "false" to start_backtrace().
> 
> Whenever a reliability check fails, frame->livepatch_safe = false. After that
> point, it will remain false till the end of the stacktrace. This keeps it simple.
> 
> Also, once livepatch_safe is set to false, further reliability checks will not
> be performed (what would be the point?).
> 
> Finally, it might be a good idea to perform reliability checks even in
> start_backtrace() so we don't assume that the starting frame is reliable even
> if the caller passes livepatch_safe=true. What do you think?
> 
>> I think we first need to do some more preparatory work for that, but
>> regardless, I have some comments below.
>>
> 
> I agree that some more work is required to provide per-frame debug information
> and tracking. That can be done later. It is not a requirement for livepatch.
> 
>>> ---
>>>  arch/arm64/include/asm/stacktrace.h |  9 +++++++
>>>  arch/arm64/kernel/stacktrace.c      | 38 +++++++++++++++++++++++++----
>>>  2 files changed, 42 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
>>> index eb29b1fe8255..4c822ef7f588 100644
>>> --- a/arch/arm64/include/asm/stacktrace.h
>>> +++ b/arch/arm64/include/asm/stacktrace.h
>>> @@ -49,6 +49,13 @@ struct stack_info {
>>>   *
>>>   * @graph:       When FUNCTION_GRAPH_TRACER is selected, holds the index of a
>>>   *               replacement lr value in the ftrace graph stack.
>>> + *
>>> + * @reliable:	Is this stack frame reliable? There are several checks that
>>> + *              need to be performed in unwind_frame() before a stack frame
>>> + *              is truly reliable. Until all the checks are present, this flag
>>> + *              is just a place holder. Once all the checks are implemented,
>>> + *              this comment will be updated and the flag can be used by the
>>> + *              caller of unwind_frame().
>>
>> I'd prefer that we state the high-level semantic first, then drill down
>> into detail, e.g.
>>
>> | @reliable: Indicates whether this frame is beleived to be a reliable
>> |            unwinding from the parent stackframe. This may be set
>> |            regardless of whether the parent stackframe was reliable.
>> |            
>> |            This is set only if all the following are true:
>> | 
>> |            * @pc is a valid text address.
>> | 
>> |            Note: this is currently incomplete.
>>
> 
> I will change the name of the flag. I will change the comment accordingly.
> 
>>>   */
>>>  struct stackframe {
>>>  	unsigned long fp;
>>> @@ -59,6 +66,7 @@ struct stackframe {
>>>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>>  	int graph;
>>>  #endif
>>> +	bool reliable;
>>>  };
>>>  
>>>  extern int unwind_frame(struct task_struct *tsk, struct stackframe *frame);
>>> @@ -169,6 +177,7 @@ static inline void start_backtrace(struct stackframe *frame,
>>>  	bitmap_zero(frame->stacks_done, __NR_STACK_TYPES);
>>>  	frame->prev_fp = 0;
>>>  	frame->prev_type = STACK_TYPE_UNKNOWN;
>>> +	frame->reliable = true;
>>>  }
>>
>> I think we need more data than this to be accurate.
>>
>> Consider arch_stack_walk() starting from a pt_regs -- the initial state
>> (the PC from the regs) is accurate, but the first unwind from that will
>> not be, and we don't account for that at all.
>>
>> I think we need to capture an unwind type in struct stackframe, which we
>> can pass into start_backtrace(), e.g.
>>
> 
>> | enum unwind_type {
>> |         /*
>> |          * The next frame is indicated by the frame pointer.
>> |          * The next unwind may or may not be reliable.
>> |          */
>> |         UNWIND_TYPE_FP,
>> | 
>> |         /*
>> |          * The next frame is indicated by the LR in pt_regs.
>> |          * The next unwind is not reliable.
>> |          */
>> |         UNWIND_TYPE_REGS_LR,
>> | 
>> |         /*
>> |          * We do not know how to unwind to the next frame.
>> |          * The next unwind is not reliable.
>> |          */
>> |         UNWIND_TYPE_UNKNOWN
>> | };
>>
>> That should be simple enough to set up around start_backtrace(), but
>> we'll need further rework to make that simple at exception boundaries.
>> With the entry rework I have queued for v5.14, we're *almost* down to a
>> single asm<->c transition point for all vectors, and I'm hoping to
>> factor the remainder out to C for v5.15, whereupon we can annotate that
>> BL with some metadata for unwinding (with something similar to x86's
>> UNWIND_HINT, but retained for runtime).
>>
> 
> I understood UNWIND_TYPE_FP and UNWIND_TYPE_REGS_LR. When would UNWIND_TYPE_UNKNOWN
> be passed to start_backtrace? Could you elaborate?
> 
> Regardless, the above comment applies only to per-frame tracking when it is eventually
> implemented. For livepatch, it is not needed. At exception boundaries, if stack metadata
> is available, then use that to unwind safely. Else, livepatch_safe = false. The latter
> is what is being done in my patch series. So, we can go with that until stack metadata
> becomes available.
> 
> For the UNWIND_TYPE_REGS_LR and UNWIND_TYPE_UNKNOWN cases, the caller will
> pass livepatch_safe=false to start_backtrace(). For UNWIND_TYPE_FP, the caller will
> pass livepatch_safe=true. So, only UNWIND_TYPE_FP matters for livepatch.
> 
>>>  
>>>  #endif	/* __ASM_STACKTRACE_H */
>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>> index d55bdfb7789c..9061375c8785 100644
>>> --- a/arch/arm64/kernel/stacktrace.c
>>> +++ b/arch/arm64/kernel/stacktrace.c
>>> @@ -44,21 +44,29 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>>>  	unsigned long fp = frame->fp;
>>>  	struct stack_info info;
>>>  
>>> +	frame->reliable = true;
>>
>> I'd prefer to do this the other way around, e.g. here do:
>>
>> |        /*
>> |         * Assume that an unwind step is unreliable until it has passed
>> |         * all relevant checks.
>> |         */
>> |        frame->reliable = false;
>>
>> ... then only set this to true once we're certain the step is reliable.
>>
>> That requires fewer changes below, and would also be more robust as if
>> we forget to update this we'd accidentally mark an entry as unreliable
>> rather than accidentally marking it as reliable.
>>
> 
> For livepatch_safe, the initial statement setting it to true at the
> beginning of unwind_frame() goes away. But whenever a reliability check fails,
> livepatch_safe has to be set to false.
> 
>>> +
>>>  	/* Terminal record; nothing to unwind */
>>>  	if (!fp)
>>>  		return -ENOENT;
>>>  
>>> -	if (fp & 0xf)
>>> +	if (fp & 0xf) {
>>> +		frame->reliable = false;
>>>  		return -EINVAL;
>>> +	}
>>>  
>>>  	if (!tsk)
>>>  		tsk = current;
>>>  
>>> -	if (!on_accessible_stack(tsk, fp, &info))
>>> +	if (!on_accessible_stack(tsk, fp, &info)) {
>>> +		frame->reliable = false;
>>>  		return -EINVAL;
>>> +	}
>>>  
>>> -	if (test_bit(info.type, frame->stacks_done))
>>> +	if (test_bit(info.type, frame->stacks_done)) {
>>> +		frame->reliable = false;
>>>  		return -EINVAL;
>>> +	}
>>>  
>>>  	/*
>>>  	 * As stacks grow downward, any valid record on the same stack must be
>>> @@ -74,8 +82,10 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>>>  	 * stack.
>>>  	 */
>>>  	if (info.type == frame->prev_type) {
>>> -		if (fp <= frame->prev_fp)
>>> +		if (fp <= frame->prev_fp) {
>>> +			frame->reliable = false;
>>>  			return -EINVAL;
>>> +		}
>>>  	} else {
>>>  		set_bit(frame->prev_type, frame->stacks_done);
>>>  	}
>>> @@ -100,14 +110,32 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
>>>  		 * So replace it to an original value.
>>>  		 */
>>>  		ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
>>> -		if (WARN_ON_ONCE(!ret_stack))
>>> +		if (WARN_ON_ONCE(!ret_stack)) {
>>> +			frame->reliable = false;
>>>  			return -EINVAL;
>>> +		}
>>>  		frame->pc = ret_stack->ret;
>>>  	}
>>>  #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
>>>  
>>>  	frame->pc = ptrauth_strip_insn_pac(frame->pc);
>>>  
>>> +	/*
>>> +	 * Check the return PC for conditions that make unwinding unreliable.
>>> +	 * In each case, mark the stack trace as such.
>>> +	 */
>>> +
>>> +	/*
>>> +	 * Make sure that the return address is a proper kernel text address.
>>> +	 * A NULL or invalid return address could mean:
>>> +	 *
>>> +	 *	- generated code such as eBPF and optprobe trampolines
>>> +	 *	- Foreign code (e.g. EFI runtime services)
>>> +	 *	- Procedure Linkage Table (PLT) entries and veneer functions
>>> +	 */
>>> +	if (!__kernel_text_address(frame->pc))
>>> +		frame->reliable = false;
>>
>> I don't think we should mention PLTs here. They appear in regular kernel
>> text, and on arm64 they are generally not problematic for unwinding. The
>> case in which they are problematic are where they interpose an
>> trampoline call that isn't following the AAPCS (e.g. ftrace calls from a
>> module, or calls to __hwasan_tag_mismatch generally), and we'll have to
>> catch those explciitly (or forbid RELIABLE_STACKTRACE with HWASAN).
>>
> 
> I will remove the mention of PLTs.
> 
>> >From a backtrace perspective, the PC itself *is* reliable, but the next
>> unwind from this frame will not be, so I'd like to mark this as
>> reliable and the next unwind as unreliable. We can do that with the
>> UNWIND_TYPE_UNKNOWN suggestion above.
>>
> 
> In the livepatch_safe approach, it can be set to false as soon as the unwinder
> realizes that there is unreliability, even if the unreliability is in the next
> frame. Actually, this would avoid one extra unwind step for livepatch.
> 
>> For the comment here, how about:
>>
>> |	/*
>> |	 * If the PC is not a known kernel text address, then we cannot
>> |	 * be sure that a subsequent unwind will be reliable, as we
>> |	 * don't know that the code follows our unwind requirements.
>> |	 */
>> |	if (!__kernel_text_address(frame-pc))
>> |		frame->unwind = UNWIND_TYPE_UNKNOWN;
>>
> 
> OK. I can change the comment.
> 
> Thanks!
> 
> Madhavan
>
Josh Poimboeuf June 29, 2021, 4:47 p.m. UTC | #8
On Thu, Jun 24, 2021 at 03:40:21PM +0100, Mark Rutland wrote:
> Hi Madhavan,
> 
> On Wed, May 26, 2021 at 04:49:16PM -0500, madvenka@linux.microsoft.com wrote:
> > From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> > 
> > The unwinder should check for the presence of various features and
> > conditions that can render the stack trace unreliable and mark the
> > the stack trace as unreliable for the benefit of the caller.
> > 
> > Introduce the first reliability check - If a return PC is not a valid
> > kernel text address, consider the stack trace unreliable. It could be
> > some generated code.
> > 
> > Other reliability checks will be added in the future.
> > 
> > Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
> 
> At a high-level, I'm on-board with keeping track of this per unwind
> step, but if we do that then I want to be abel to use this during
> regular unwinds (e.g. so that we can have a backtrace idicate when a
> step is not reliable, like x86 does with '?'), and to do that we need to
> be a little more accurate.

On x86, the '?' entries don't come from the unwinder's determination of
whether a frame is reliable.  (And the x86 unwinder doesn't track
reliable-ness on a per-frame basis anyway; it keeps a per-unwind global
error state.)

The stack dumping code blindly scans the stack for kernel text
addresses, in lockstep with calls to the unwinder.  Any text addresses
which aren't also reported by the unwinder are prepended with '?'.

The point is two-fold:

  a) failsafe in case the unwinder fails or skips a frame;

  b) showing of breadcrumbs from previous execution contexts which can
     help the debugging of more difficult scenarios.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
index eb29b1fe8255..4c822ef7f588 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -49,6 +49,13 @@  struct stack_info {
  *
  * @graph:       When FUNCTION_GRAPH_TRACER is selected, holds the index of a
  *               replacement lr value in the ftrace graph stack.
+ *
+ * @reliable:	Is this stack frame reliable? There are several checks that
+ *              need to be performed in unwind_frame() before a stack frame
+ *              is truly reliable. Until all the checks are present, this flag
+ *              is just a place holder. Once all the checks are implemented,
+ *              this comment will be updated and the flag can be used by the
+ *              caller of unwind_frame().
  */
 struct stackframe {
 	unsigned long fp;
@@ -59,6 +66,7 @@  struct stackframe {
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 	int graph;
 #endif
+	bool reliable;
 };
 
 extern int unwind_frame(struct task_struct *tsk, struct stackframe *frame);
@@ -169,6 +177,7 @@  static inline void start_backtrace(struct stackframe *frame,
 	bitmap_zero(frame->stacks_done, __NR_STACK_TYPES);
 	frame->prev_fp = 0;
 	frame->prev_type = STACK_TYPE_UNKNOWN;
+	frame->reliable = true;
 }
 
 #endif	/* __ASM_STACKTRACE_H */
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index d55bdfb7789c..9061375c8785 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -44,21 +44,29 @@  int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
 	unsigned long fp = frame->fp;
 	struct stack_info info;
 
+	frame->reliable = true;
+
 	/* Terminal record; nothing to unwind */
 	if (!fp)
 		return -ENOENT;
 
-	if (fp & 0xf)
+	if (fp & 0xf) {
+		frame->reliable = false;
 		return -EINVAL;
+	}
 
 	if (!tsk)
 		tsk = current;
 
-	if (!on_accessible_stack(tsk, fp, &info))
+	if (!on_accessible_stack(tsk, fp, &info)) {
+		frame->reliable = false;
 		return -EINVAL;
+	}
 
-	if (test_bit(info.type, frame->stacks_done))
+	if (test_bit(info.type, frame->stacks_done)) {
+		frame->reliable = false;
 		return -EINVAL;
+	}
 
 	/*
 	 * As stacks grow downward, any valid record on the same stack must be
@@ -74,8 +82,10 @@  int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
 	 * stack.
 	 */
 	if (info.type == frame->prev_type) {
-		if (fp <= frame->prev_fp)
+		if (fp <= frame->prev_fp) {
+			frame->reliable = false;
 			return -EINVAL;
+		}
 	} else {
 		set_bit(frame->prev_type, frame->stacks_done);
 	}
@@ -100,14 +110,32 @@  int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
 		 * So replace it to an original value.
 		 */
 		ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
-		if (WARN_ON_ONCE(!ret_stack))
+		if (WARN_ON_ONCE(!ret_stack)) {
+			frame->reliable = false;
 			return -EINVAL;
+		}
 		frame->pc = ret_stack->ret;
 	}
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
 	frame->pc = ptrauth_strip_insn_pac(frame->pc);
 
+	/*
+	 * Check the return PC for conditions that make unwinding unreliable.
+	 * In each case, mark the stack trace as such.
+	 */
+
+	/*
+	 * Make sure that the return address is a proper kernel text address.
+	 * A NULL or invalid return address could mean:
+	 *
+	 *	- generated code such as eBPF and optprobe trampolines
+	 *	- Foreign code (e.g. EFI runtime services)
+	 *	- Procedure Linkage Table (PLT) entries and veneer functions
+	 */
+	if (!__kernel_text_address(frame->pc))
+		frame->reliable = false;
+
 	return 0;
 }
 NOKPROBE_SYMBOL(unwind_frame);