mbox series

[RFC,v2,0/4] arm64: Implement stack trace reliability checks

Message ID 20210405204313.21346-1-madvenka@linux.microsoft.com (mailing list archive)
Headers show
Series arm64: Implement stack trace reliability checks | expand

Message

Madhavan T. Venkataraman April 5, 2021, 8:43 p.m. UTC
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

There are a number of places in kernel code where the stack trace is not
reliable. Enhance the unwinder to check for those cases and mark the
stack trace as unreliable. Once all of the checks are in place, the unwinder
can provide a reliable stack trace. But before this can be used for livepatch,
some other entity needs to guarantee that the frame pointers are all set up
correctly in kernel functions. objtool is currently being worked on to
fill that gap.

Except for the return address check, all the other checks involve checking
the return PC of every frame against certain kernel functions. To do this,
implement some infrastructure code:

	- Define a special_functions[] array and populate the array with
	  the special functions

	- Using kallsyms_lookup(), lookup the symbol table entries for the
	  functions and record their address ranges

	- Define an is_reliable_function(pc) to match a return PC against
	  the special functions.

The unwinder calls is_reliable_function(pc) for every return PC and marks
the stack trace as reliable or unreliable accordingly.

Return address check
====================

Check the return PC of every stack frame to make sure that it is a valid
kernel text address (and not some generated code, for example).

Detect EL1 exception frame
==========================

EL1 exceptions can happen on any instruction including instructions in
the frame pointer prolog or epilog. Depending on where exactly they happen,
they could render the stack trace unreliable.

Add all of the EL1 exception handlers to special_functions[].

	- el1_sync()
	- el1_irq()
	- el1_error()
	- el1_sync_invalid()
	- el1_irq_invalid()
	- el1_fiq_invalid()
	- el1_error_invalid()

Detect ftrace frame
===================

When FTRACE executes at the beginning of a traced function, it creates two
frames and calls the tracer function:

	- One frame for the traced function

	- One frame for the caller of the traced function

That gives a sensible stack trace while executing in the tracer function.
When FTRACE returns to the traced function, the frames are popped and
everything is back to normal.

However, in cases like live patch, the tracer function redirects execution
to a different function. When FTRACE returns, control will go to that target
function. A stack trace taken in the tracer function will not show the target
function. The target function is the real function that we want to track.
So, the stack trace is unreliable.

To detect stack traces from a tracer function, add the following to
special_functions[]:

	- ftrace_call + 4

ftrace_call is the label at which the tracer function is patched in. So,
ftrace_call + 4 is its return address. This is what will show up in a
stack trace taken from the tracer function.

When Function Graph Tracing is on, ftrace_graph_caller is patched in
at the label ftrace_graph_call. If a tracer function called before it has
redirected execution as mentioned above, the stack traces taken from within
ftrace_graph_caller will also be unreliable for the same reason as mentioned
above. So, add ftrace_graph_caller to special_functions[] as well.

Also, the Function Graph Tracer modifies the return address of a traced
function to a return trampoline (return_to_handler()) to gather tracing
data on function return. Stack traces taken from the traced function and
functions it calls will not show the original caller of the traced function.
The unwinder handles this case by getting the original caller from FTRACE.

However, stack traces taken from the trampoline itself and functions it calls
are unreliable as the original return address may not be available in
that context. This is because the trampoline calls FTRACE to gather trace
data as well as to obtain the actual return address and FTRACE discards the
record of the original return address along the way.

Add return_to_handler() to special_functions[].

Check for kretprobe
===================

For functions with a kretprobe set up, probe code executes on entry
to the function and replaces the return address in the stack frame with a
kretprobe trampoline. Whenever the function returns, control is
transferred to the trampoline. The trampoline eventually returns to the
original return address.

A stack trace taken while executing in the function (or in functions that
get called from the function) will not show the original return address.
Similarly, a stack trace taken while executing in the trampoline itself
(and functions that get called from the trampoline) will not show the
original return address. This means that the caller of the probed function
will not show. This makes the stack trace unreliable.

Add the kretprobe trampoline to special_functions[].

Optprobes
=========

Optprobes may be implemented in the future for arm64. For optprobes,
the relevant trampoline(s) can be added to special_functions[].
---
Changelog:

v1
	- Define a bool field in struct stackframe. This will indicate if
	  a stack trace is reliable.

	- Implement a special_functions[] array that will be populated
	  with special functions in which the stack trace is considered
	  unreliable.
	
	- Using kallsyms_lookup(), get the address ranges for the special
	  functions and record them.

	- Implement an is_reliable_function(pc). This function will check
	  if a given return PC falls in any of the special functions. If
	  it does, the stack trace is unreliable.

	- Implement check_reliability() function that will check if a
	  stack frame is reliable. Call is_reliable_function() from
	  check_reliability().

	- Before a return PC is checked against special_funtions[], it
	  must be validates as a proper kernel text address. Call
	  __kernel_text_address() from check_reliability().

	- Finally, call check_reliability() from unwind_frame() for
	  each stack frame.

	- Add EL1 exception handlers to special_functions[].

		el1_sync();
		el1_irq();
		el1_error();
		el1_sync_invalid();
		el1_irq_invalid();
		el1_fiq_invalid();
		el1_error_invalid();

	- The above functions are currently defined as LOCAL symbols.
	  Make them global so that they can be referenced from the
	  unwinder code.

	- Add FTRACE trampolines to special_functions[]:

		ftrace_graph_call()
		ftrace_graph_caller()
		return_to_handler()

	- Add the kretprobe trampoline to special functions[]:

		kretprobe_trampoline()

v2
	- Removed the terminating entry { 0, 0 } in special_functions[]
	  and replaced it with the idiom { /* sentinel */ }.

	- Change the ftrace trampoline entry ftrace_graph_call in
	  special_functions[] to ftrace_call + 4 and added explanatory
	  comments.

	- Unnested #ifdefs in special_functions[] for FTRACE.

Madhavan T. Venkataraman (4):
  arm64: Implement infrastructure for stack trace reliability checks
  arm64: Mark a stack trace unreliable if an EL1 exception frame is
    detected
  arm64: Detect FTRACE cases that make the stack trace unreliable
  arm64: Mark stack trace as unreliable if kretprobed functions are
    present

 arch/arm64/include/asm/exception.h  |   8 ++
 arch/arm64/include/asm/stacktrace.h |   2 +
 arch/arm64/kernel/entry-ftrace.S    |  12 ++
 arch/arm64/kernel/entry.S           |  14 +-
 arch/arm64/kernel/stacktrace.c      | 215 ++++++++++++++++++++++++++++
 5 files changed, 244 insertions(+), 7 deletions(-)


base-commit: 0d02ec6b3136c73c09e7859f0d0e4e2c4c07b49b

Comments

Mark Rutland April 9, 2021, 12:09 p.m. UTC | #1
Hi Madhavan,

I've noted some concerns below. At a high-level, I'm not keen on the
blacklisting approach, and I think there's some other preparatory work
that would be more valuable in the short term.

On Mon, Apr 05, 2021 at 03:43:09PM -0500, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> 
> There are a number of places in kernel code where the stack trace is not
> reliable. Enhance the unwinder to check for those cases and mark the
> stack trace as unreliable. Once all of the checks are in place, the unwinder
> can provide a reliable stack trace. But before this can be used for livepatch,
> some other entity needs to guarantee that the frame pointers are all set up
> correctly in kernel functions. objtool is currently being worked on to
> fill that gap.
> 
> Except for the return address check, all the other checks involve checking
> the return PC of every frame against certain kernel functions. To do this,
> implement some infrastructure code:
> 
> 	- Define a special_functions[] array and populate the array with
> 	  the special functions

I'm not too keen on having to manually collate this within the unwinder,
as it's very painful from a maintenance perspective. I'd much rather we
could associate this information with the implementations of these
functions, so that they're more likely to stay in sync.

Further, I believe all the special cases are assembly functions, and
most of those are already in special sections to begin with. I reckon
it'd be simpler and more robust to reject unwinding based on the
section. If we need to unwind across specific functions in those
sections, we could opt-in with some metadata. So e.g. we could reject
all functions in ".entry.text", special casing the EL0 entry functions
if necessary.

As I mentioned before, I'm currently reworking the entry assembly to
make this simpler to do. I'd prefer to not make invasive changes in that
area until that's sorted.

I think there's a lot more code that we cannot unwind, e.g. KVM
exception code, or almost anything marked with SYM_CODE_END().

> 	- Using kallsyms_lookup(), lookup the symbol table entries for the
> 	  functions and record their address ranges
> 
> 	- Define an is_reliable_function(pc) to match a return PC against
> 	  the special functions.
> 
> The unwinder calls is_reliable_function(pc) for every return PC and marks
> the stack trace as reliable or unreliable accordingly.
> 
> Return address check
> ====================
> 
> Check the return PC of every stack frame to make sure that it is a valid
> kernel text address (and not some generated code, for example).
> 
> Detect EL1 exception frame
> ==========================
> 
> EL1 exceptions can happen on any instruction including instructions in
> the frame pointer prolog or epilog. Depending on where exactly they happen,
> they could render the stack trace unreliable.
> 
> Add all of the EL1 exception handlers to special_functions[].
> 
> 	- el1_sync()
> 	- el1_irq()
> 	- el1_error()
> 	- el1_sync_invalid()
> 	- el1_irq_invalid()
> 	- el1_fiq_invalid()
> 	- el1_error_invalid()
> 
> Detect ftrace frame
> ===================
> 
> When FTRACE executes at the beginning of a traced function, it creates two
> frames and calls the tracer function:
> 
> 	- One frame for the traced function
> 
> 	- One frame for the caller of the traced function
> 
> That gives a sensible stack trace while executing in the tracer function.
> When FTRACE returns to the traced function, the frames are popped and
> everything is back to normal.
> 
> However, in cases like live patch, the tracer function redirects execution
> to a different function. When FTRACE returns, control will go to that target
> function. A stack trace taken in the tracer function will not show the target
> function. The target function is the real function that we want to track.
> So, the stack trace is unreliable.

This doesn't match my understanding of the reliable stacktrace
requirements, but I might have misunderstood what you're saying here.

IIUC what you're describing here is:

1) A calls B
2) B is traced
3) tracer replaces B with TARGET
4) tracer returns to TARGET

... and if a stacktrace is taken at step 3 (before the return address is
patched), the trace will show B rather than TARGET.

My understanding is that this is legitimate behaviour.

> To detect stack traces from a tracer function, add the following to
> special_functions[]:
> 
> 	- ftrace_call + 4
> 
> ftrace_call is the label at which the tracer function is patched in. So,
> ftrace_call + 4 is its return address. This is what will show up in a
> stack trace taken from the tracer function.
> 
> When Function Graph Tracing is on, ftrace_graph_caller is patched in
> at the label ftrace_graph_call. If a tracer function called before it has
> redirected execution as mentioned above, the stack traces taken from within
> ftrace_graph_caller will also be unreliable for the same reason as mentioned
> above. So, add ftrace_graph_caller to special_functions[] as well.
> 
> Also, the Function Graph Tracer modifies the return address of a traced
> function to a return trampoline (return_to_handler()) to gather tracing
> data on function return. Stack traces taken from the traced function and
> functions it calls will not show the original caller of the traced function.
> The unwinder handles this case by getting the original caller from FTRACE.
> 
> However, stack traces taken from the trampoline itself and functions it calls
> are unreliable as the original return address may not be available in
> that context. This is because the trampoline calls FTRACE to gather trace
> data as well as to obtain the actual return address and FTRACE discards the
> record of the original return address along the way.

The reason we cannot unwind the trampolines in the usual way is because
they are not AAPCS compliant functions. We don't discard the original
return address, but it's not in the usual location.  With care, we could
write a special case unwinder for them. Note that we also cannot unwind
from any PLT on the way to the trampolines, so we'd also need to
identify those.  Luckily we're in charge of creating those, and (for
now) we only need to care about the module PLTs.

The bigger problem is return_to_handler, since there's a transient
period when C code removes the return address from the graph return
stack before passing this to assembly in a register, and so we can't
reliably find the correct return address during this period. With care
we could special case unwinding immediately before/after this.

If we could find a way to restructure return_to_handler such that we can
reliably find the correct return address, that would be a useful
improvement today, and would mean that we don't have to blacklist it for
reliable stacktrace.

Thanks,
Mark.

> Add return_to_handler() to special_functions[].
> 
> Check for kretprobe
> ===================
> 
> For functions with a kretprobe set up, probe code executes on entry
> to the function and replaces the return address in the stack frame with a
> kretprobe trampoline. Whenever the function returns, control is
> transferred to the trampoline. The trampoline eventually returns to the
> original return address.
> 
> A stack trace taken while executing in the function (or in functions that
> get called from the function) will not show the original return address.
> Similarly, a stack trace taken while executing in the trampoline itself
> (and functions that get called from the trampoline) will not show the
> original return address. This means that the caller of the probed function
> will not show. This makes the stack trace unreliable.
> 
> Add the kretprobe trampoline to special_functions[].
> 
> Optprobes
> =========
> 
> Optprobes may be implemented in the future for arm64. For optprobes,
> the relevant trampoline(s) can be added to special_functions[].
> ---
> Changelog:
> 
> v1
> 	- Define a bool field in struct stackframe. This will indicate if
> 	  a stack trace is reliable.
> 
> 	- Implement a special_functions[] array that will be populated
> 	  with special functions in which the stack trace is considered
> 	  unreliable.
> 	
> 	- Using kallsyms_lookup(), get the address ranges for the special
> 	  functions and record them.
> 
> 	- Implement an is_reliable_function(pc). This function will check
> 	  if a given return PC falls in any of the special functions. If
> 	  it does, the stack trace is unreliable.
> 
> 	- Implement check_reliability() function that will check if a
> 	  stack frame is reliable. Call is_reliable_function() from
> 	  check_reliability().
> 
> 	- Before a return PC is checked against special_funtions[], it
> 	  must be validates as a proper kernel text address. Call
> 	  __kernel_text_address() from check_reliability().
> 
> 	- Finally, call check_reliability() from unwind_frame() for
> 	  each stack frame.
> 
> 	- Add EL1 exception handlers to special_functions[].
> 
> 		el1_sync();
> 		el1_irq();
> 		el1_error();
> 		el1_sync_invalid();
> 		el1_irq_invalid();
> 		el1_fiq_invalid();
> 		el1_error_invalid();
> 
> 	- The above functions are currently defined as LOCAL symbols.
> 	  Make them global so that they can be referenced from the
> 	  unwinder code.
> 
> 	- Add FTRACE trampolines to special_functions[]:
> 
> 		ftrace_graph_call()
> 		ftrace_graph_caller()
> 		return_to_handler()
> 
> 	- Add the kretprobe trampoline to special functions[]:
> 
> 		kretprobe_trampoline()
> 
> v2
> 	- Removed the terminating entry { 0, 0 } in special_functions[]
> 	  and replaced it with the idiom { /* sentinel */ }.
> 
> 	- Change the ftrace trampoline entry ftrace_graph_call in
> 	  special_functions[] to ftrace_call + 4 and added explanatory
> 	  comments.
> 
> 	- Unnested #ifdefs in special_functions[] for FTRACE.
> 
> Madhavan T. Venkataraman (4):
>   arm64: Implement infrastructure for stack trace reliability checks
>   arm64: Mark a stack trace unreliable if an EL1 exception frame is
>     detected
>   arm64: Detect FTRACE cases that make the stack trace unreliable
>   arm64: Mark stack trace as unreliable if kretprobed functions are
>     present
> 
>  arch/arm64/include/asm/exception.h  |   8 ++
>  arch/arm64/include/asm/stacktrace.h |   2 +
>  arch/arm64/kernel/entry-ftrace.S    |  12 ++
>  arch/arm64/kernel/entry.S           |  14 +-
>  arch/arm64/kernel/stacktrace.c      | 215 ++++++++++++++++++++++++++++
>  5 files changed, 244 insertions(+), 7 deletions(-)
> 
> 
> base-commit: 0d02ec6b3136c73c09e7859f0d0e4e2c4c07b49b
> -- 
> 2.25.1
>
Madhavan T. Venkataraman April 9, 2021, 5:16 p.m. UTC | #2
On 4/9/21 7:09 AM, Mark Rutland wrote:
> Hi Madhavan,
> 
> I've noted some concerns below. At a high-level, I'm not keen on the
> blacklisting approach, and I think there's some other preparatory work
> that would be more valuable in the short term.
> 

Some kind of blacklisting has to be done whichever way you do it.

> On Mon, Apr 05, 2021 at 03:43:09PM -0500, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> There are a number of places in kernel code where the stack trace is not
>> reliable. Enhance the unwinder to check for those cases and mark the
>> stack trace as unreliable. Once all of the checks are in place, the unwinder
>> can provide a reliable stack trace. But before this can be used for livepatch,
>> some other entity needs to guarantee that the frame pointers are all set up
>> correctly in kernel functions. objtool is currently being worked on to
>> fill that gap.
>>
>> Except for the return address check, all the other checks involve checking
>> the return PC of every frame against certain kernel functions. To do this,
>> implement some infrastructure code:
>>
>> 	- Define a special_functions[] array and populate the array with
>> 	  the special functions
> 
> I'm not too keen on having to manually collate this within the unwinder,
> as it's very painful from a maintenance perspective. I'd much rather we
> could associate this information with the implementations of these
> functions, so that they're more likely to stay in sync.
> 
> Further, I believe all the special cases are assembly functions, and
> most of those are already in special sections to begin with. I reckon
> it'd be simpler and more robust to reject unwinding based on the
> section. If we need to unwind across specific functions in those
> sections, we could opt-in with some metadata. So e.g. we could reject
> all functions in ".entry.text", special casing the EL0 entry functions
> if necessary.
> 

Yes. I have already agreed that using sections is the way to go. I am working
on that now.

> As I mentioned before, I'm currently reworking the entry assembly to
> make this simpler to do. I'd prefer to not make invasive changes in that
> area until that's sorted.
> 

I don't plan to make any invasive changes. But a couple of cosmetic changes may be
necessary. I don't know yet. But I will keep in mind that you don't want
any invasive changes there.

> I think there's a lot more code that we cannot unwind, e.g. KVM
> exception code, or almost anything marked with SYM_CODE_END().
> 

As Mark Brown suggested, I will take a look at all code that is marked as
SYM_CODE. His idea of placing all SYM_CODE in a separate section and blacklisting
that to begin with and refining things as we go along appears to me to be
a reasonable approach.

>> 	- Using kallsyms_lookup(), lookup the symbol table entries for the
>> 	  functions and record their address ranges
>>
>> 	- Define an is_reliable_function(pc) to match a return PC against
>> 	  the special functions.
>>
>> The unwinder calls is_reliable_function(pc) for every return PC and marks
>> the stack trace as reliable or unreliable accordingly.
>>
>> Return address check
>> ====================
>>
>> Check the return PC of every stack frame to make sure that it is a valid
>> kernel text address (and not some generated code, for example).
>>
>> Detect EL1 exception frame
>> ==========================
>>
>> EL1 exceptions can happen on any instruction including instructions in
>> the frame pointer prolog or epilog. Depending on where exactly they happen,
>> they could render the stack trace unreliable.
>>
>> Add all of the EL1 exception handlers to special_functions[].
>>
>> 	- el1_sync()
>> 	- el1_irq()
>> 	- el1_error()
>> 	- el1_sync_invalid()
>> 	- el1_irq_invalid()
>> 	- el1_fiq_invalid()
>> 	- el1_error_invalid()
>>
>> Detect ftrace frame
>> ===================
>>
>> When FTRACE executes at the beginning of a traced function, it creates two
>> frames and calls the tracer function:
>>
>> 	- One frame for the traced function
>>
>> 	- One frame for the caller of the traced function
>>
>> That gives a sensible stack trace while executing in the tracer function.
>> When FTRACE returns to the traced function, the frames are popped and
>> everything is back to normal.
>>
>> However, in cases like live patch, the tracer function redirects execution
>> to a different function. When FTRACE returns, control will go to that target
>> function. A stack trace taken in the tracer function will not show the target
>> function. The target function is the real function that we want to track.
>> So, the stack trace is unreliable.
> 
> This doesn't match my understanding of the reliable stacktrace
> requirements, but I might have misunderstood what you're saying here.
> 
> IIUC what you're describing here is:
> 
> 1) A calls B
> 2) B is traced
> 3) tracer replaces B with TARGET
> 4) tracer returns to TARGET
> 
> ... and if a stacktrace is taken at step 3 (before the return address is
> patched), the trace will show B rather than TARGET.
> 
> My understanding is that this is legitimate behaviour.
> 

My understanding is as follows (correct me if I am wrong):

- Before B is traced, the situation is "A calls B".

- A trace is placed on B to redirect execution to TARGET. Semantically, it
  becomes "A calls TARGET" beyond that point and B is irrelevant.

- But temporarily, the stack trace will show A -> B.

>> To detect stack traces from a tracer function, add the following to
>> special_functions[]:
>>
>> 	- ftrace_call + 4
>>
>> ftrace_call is the label at which the tracer function is patched in. So,
>> ftrace_call + 4 is its return address. This is what will show up in a
>> stack trace taken from the tracer function.
>>
>> When Function Graph Tracing is on, ftrace_graph_caller is patched in
>> at the label ftrace_graph_call. If a tracer function called before it has
>> redirected execution as mentioned above, the stack traces taken from within
>> ftrace_graph_caller will also be unreliable for the same reason as mentioned
>> above. So, add ftrace_graph_caller to special_functions[] as well.
>>
>> Also, the Function Graph Tracer modifies the return address of a traced
>> function to a return trampoline (return_to_handler()) to gather tracing
>> data on function return. Stack traces taken from the traced function and
>> functions it calls will not show the original caller of the traced function.
>> The unwinder handles this case by getting the original caller from FTRACE.
>>
>> However, stack traces taken from the trampoline itself and functions it calls
>> are unreliable as the original return address may not be available in
>> that context. This is because the trampoline calls FTRACE to gather trace
>> data as well as to obtain the actual return address and FTRACE discards the
>> record of the original return address along the way.
> 
> The reason we cannot unwind the trampolines in the usual way is because
> they are not AAPCS compliant functions. We don't discard the original
> return address, but it's not in the usual location.  With care, we could
> write a special case unwinder for them. Note that we also cannot unwind
> from any PLT on the way to the trampolines, so we'd also need to
> identify those.  Luckily we're in charge of creating those, and (for
> now) we only need to care about the module PLTs.
> 
> The bigger problem is return_to_handler, since there's a transient
> period when C code removes the return address from the graph return
> stack before passing this to assembly in a register, and so we can't
> reliably find the correct return address during this period. With care
> we could special case unwinding immediately before/after this.
> 

This is what I meant when I said "as the original return address may not be available in
that context" because the original address is popped off the return address stack
by the ftrace code called from the trampoline.

> If we could find a way to restructure return_to_handler such that we can
> reliably find the correct return address, that would be a useful
> improvement today, and would mean that we don't have to blacklist it for
> reliable stacktrace.
> 

Agreed. But until then it needs to be blacklisted. Rather than wait for that restructuring
to be done, we could initially blacklist it and remove the blacklist if and when the
restructuring is done.

Thanks.

Madhavan
Josh Poimboeuf April 9, 2021, 9:37 p.m. UTC | #3
On Fri, Apr 09, 2021 at 01:09:09PM +0100, Mark Rutland wrote:
> On Mon, Apr 05, 2021 at 03:43:09PM -0500, madvenka@linux.microsoft.com wrote:
> > From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> > 
> > There are a number of places in kernel code where the stack trace is not
> > reliable. Enhance the unwinder to check for those cases and mark the
> > stack trace as unreliable. Once all of the checks are in place, the unwinder
> > can provide a reliable stack trace. But before this can be used for livepatch,
> > some other entity needs to guarantee that the frame pointers are all set up
> > correctly in kernel functions. objtool is currently being worked on to
> > fill that gap.
> > 
> > Except for the return address check, all the other checks involve checking
> > the return PC of every frame against certain kernel functions. To do this,
> > implement some infrastructure code:
> > 
> > 	- Define a special_functions[] array and populate the array with
> > 	  the special functions
> 
> I'm not too keen on having to manually collate this within the unwinder,
> as it's very painful from a maintenance perspective.

Agreed.

> I'd much rather we could associate this information with the
> implementations of these functions, so that they're more likely to
> stay in sync.
> 
> Further, I believe all the special cases are assembly functions, and
> most of those are already in special sections to begin with. I reckon
> it'd be simpler and more robust to reject unwinding based on the
> section. If we need to unwind across specific functions in those
> sections, we could opt-in with some metadata. So e.g. we could reject
> all functions in ".entry.text", special casing the EL0 entry functions
> if necessary.

Couldn't this also end up being somewhat fragile?  Saying "certain
sections are deemed unreliable" isn't necessarily obvious to somebody
who doesn't already know about it, and it could be overlooked or
forgotten over time.  And there's no way to enforce it stays that way.

FWIW, over the years we've had zero issues with encoding the frame
pointer on x86.  After you save pt_regs, you encode the frame pointer to
point to it.  Ideally in the same macro so it's hard to overlook.

If you're concerned about debuggers getting confused by the encoding -
which debuggers specifically?  In my experience, if vmlinux has
debuginfo, gdb and most other debuggers will use DWARF (which is already
broken in asm code) and completely ignore frame pointers.

> I think there's a lot more code that we cannot unwind, e.g. KVM
> exception code, or almost anything marked with SYM_CODE_END().

Just a reminder that livepatch only unwinds blocked tasks (plus the
'current' task which calls into livepatch).  So practically speaking, it
doesn't matter whether the 'unreliable' detection has full coverage.
The only exceptions which really matter are those which end up calling
schedule(), e.g. preemption or page faults.

Being able to consistently detect *all* possible unreliable paths would
be nice in theory, but it's unnecessary and may not be worth the extra
complexity.
Madhavan T. Venkataraman April 9, 2021, 10:05 p.m. UTC | #4
On 4/9/21 4:37 PM, Josh Poimboeuf wrote:
> On Fri, Apr 09, 2021 at 01:09:09PM +0100, Mark Rutland wrote:
>> On Mon, Apr 05, 2021 at 03:43:09PM -0500, madvenka@linux.microsoft.com wrote:
>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>
>>> There are a number of places in kernel code where the stack trace is not
>>> reliable. Enhance the unwinder to check for those cases and mark the
>>> stack trace as unreliable. Once all of the checks are in place, the unwinder
>>> can provide a reliable stack trace. But before this can be used for livepatch,
>>> some other entity needs to guarantee that the frame pointers are all set up
>>> correctly in kernel functions. objtool is currently being worked on to
>>> fill that gap.
>>>
>>> Except for the return address check, all the other checks involve checking
>>> the return PC of every frame against certain kernel functions. To do this,
>>> implement some infrastructure code:
>>>
>>> 	- Define a special_functions[] array and populate the array with
>>> 	  the special functions
>>
>> I'm not too keen on having to manually collate this within the unwinder,
>> as it's very painful from a maintenance perspective.
> 
> Agreed.
> 
>> I'd much rather we could associate this information with the
>> implementations of these functions, so that they're more likely to
>> stay in sync.
>>
>> Further, I believe all the special cases are assembly functions, and
>> most of those are already in special sections to begin with. I reckon
>> it'd be simpler and more robust to reject unwinding based on the
>> section. If we need to unwind across specific functions in those
>> sections, we could opt-in with some metadata. So e.g. we could reject
>> all functions in ".entry.text", special casing the EL0 entry functions
>> if necessary.
> 
> Couldn't this also end up being somewhat fragile?  Saying "certain
> sections are deemed unreliable" isn't necessarily obvious to somebody
> who doesn't already know about it, and it could be overlooked or
> forgotten over time.  And there's no way to enforce it stays that way.
> 

Good point!

> FWIW, over the years we've had zero issues with encoding the frame
> pointer on x86.  After you save pt_regs, you encode the frame pointer to
> point to it.  Ideally in the same macro so it's hard to overlook.
> 

I had the same opinion. In fact, in my encoding scheme, I have additional
checks to make absolutely sure that it is a true encoding and not stack
corruption. The chances of all of those values accidentally matching are,
well, null.

> If you're concerned about debuggers getting confused by the encoding -
> which debuggers specifically?  In my experience, if vmlinux has
> debuginfo, gdb and most other debuggers will use DWARF (which is already
> broken in asm code) and completely ignore frame pointers.
> 

Yes. I checked gdb actually. It did not show a problem.

>> I think there's a lot more code that we cannot unwind, e.g. KVM
>> exception code, or almost anything marked with SYM_CODE_END().
> 
> Just a reminder that livepatch only unwinds blocked tasks (plus the
> 'current' task which calls into livepatch).  So practically speaking, it
> doesn't matter whether the 'unreliable' detection has full coverage.
> The only exceptions which really matter are those which end up calling
> schedule(), e.g. preemption or page faults.
> 
> Being able to consistently detect *all* possible unreliable paths would
> be nice in theory, but it's unnecessary and may not be worth the extra
> complexity.
> 

You do have a point. I tried to think of arch_stack_walk_reliable() as
something that should be implemented independent of livepatching. But
I could not really come up with a single example of where else it would
really be useful.

So, if we assume that the reliable stack trace is solely for the purpose
of livepatching, I agree with your earlier comments as well.

Thanks!

Madhavan
Josh Poimboeuf April 9, 2021, 10:32 p.m. UTC | #5
On Fri, Apr 09, 2021 at 05:05:58PM -0500, Madhavan T. Venkataraman wrote:
> > FWIW, over the years we've had zero issues with encoding the frame
> > pointer on x86.  After you save pt_regs, you encode the frame pointer to
> > point to it.  Ideally in the same macro so it's hard to overlook.
> > 
> 
> I had the same opinion. In fact, in my encoding scheme, I have additional
> checks to make absolutely sure that it is a true encoding and not stack
> corruption. The chances of all of those values accidentally matching are,
> well, null.

Right, stack corruption -- which is already exceedingly rare -- would
have to be combined with a miracle or two in order to come out of the
whole thing marked as 'reliable' :-)

And really, we already take a similar risk today by "trusting" the frame
pointer value on the stack to a certain extent.

> >> I think there's a lot more code that we cannot unwind, e.g. KVM
> >> exception code, or almost anything marked with SYM_CODE_END().
> > 
> > Just a reminder that livepatch only unwinds blocked tasks (plus the
> > 'current' task which calls into livepatch).  So practically speaking, it
> > doesn't matter whether the 'unreliable' detection has full coverage.
> > The only exceptions which really matter are those which end up calling
> > schedule(), e.g. preemption or page faults.
> > 
> > Being able to consistently detect *all* possible unreliable paths would
> > be nice in theory, but it's unnecessary and may not be worth the extra
> > complexity.
> > 
> 
> You do have a point. I tried to think of arch_stack_walk_reliable() as
> something that should be implemented independent of livepatching. But
> I could not really come up with a single example of where else it would
> really be useful.
> 
> So, if we assume that the reliable stack trace is solely for the purpose
> of livepatching, I agree with your earlier comments as well.

One thought: if folks really view this as a problem, it might help to
just rename things to reduce confusion.

For example, instead of calling it 'reliable', we could call it
something more precise, like 'klp_reliable', to indicate that its
reliable enough for live patching.

Then have a comment above 'klp_reliable' and/or
stack_trace_save_tsk_klp_reliable() which describes what that means.

Hm, for that matter, even without renaming things, a comment above
stack_trace_save_tsk_reliable() describing the meaning of "reliable"
would be a good idea.
Josh Poimboeuf April 9, 2021, 10:53 p.m. UTC | #6
On Fri, Apr 09, 2021 at 05:32:27PM -0500, Josh Poimboeuf wrote:
> On Fri, Apr 09, 2021 at 05:05:58PM -0500, Madhavan T. Venkataraman wrote:
> > > FWIW, over the years we've had zero issues with encoding the frame
> > > pointer on x86.  After you save pt_regs, you encode the frame pointer to
> > > point to it.  Ideally in the same macro so it's hard to overlook.
> > > 
> > 
> > I had the same opinion. In fact, in my encoding scheme, I have additional
> > checks to make absolutely sure that it is a true encoding and not stack
> > corruption. The chances of all of those values accidentally matching are,
> > well, null.
> 
> Right, stack corruption -- which is already exceedingly rare -- would
> have to be combined with a miracle or two in order to come out of the
> whole thing marked as 'reliable' :-)
> 
> And really, we already take a similar risk today by "trusting" the frame
> pointer value on the stack to a certain extent.

Oh yeah, I forgot to mention some more benefits of encoding the frame
pointer (or marking pt_regs in some other way):

a) Stack addresses can be printed properly: '%pS' for printing regs->pc
   and '%pB' for printing call returns.

   Using '%pS' for call returns (as arm64 seems to do today) will result
   in printing the wrong function when you have tail calls to noreturn
   functions on the stack (which is actually quite common for calls to
   panic(), die(), etc).

   More details:

   https://lkml.kernel.org/r/20210403155948.ubbgtwmlsdyar7yp@treble

b) Stack dumps to the console can dump the exception registers they find
   along the way.  This is actually quite nice for debugging.
Madhavan T. Venkataraman April 11, 2021, 5:54 p.m. UTC | #7
On 4/9/21 5:53 PM, Josh Poimboeuf wrote:
> On Fri, Apr 09, 2021 at 05:32:27PM -0500, Josh Poimboeuf wrote:
>> On Fri, Apr 09, 2021 at 05:05:58PM -0500, Madhavan T. Venkataraman wrote:
>>>> FWIW, over the years we've had zero issues with encoding the frame
>>>> pointer on x86.  After you save pt_regs, you encode the frame pointer to
>>>> point to it.  Ideally in the same macro so it's hard to overlook.
>>>>
>>>
>>> I had the same opinion. In fact, in my encoding scheme, I have additional
>>> checks to make absolutely sure that it is a true encoding and not stack
>>> corruption. The chances of all of those values accidentally matching are,
>>> well, null.
>>
>> Right, stack corruption -- which is already exceedingly rare -- would
>> have to be combined with a miracle or two in order to come out of the
>> whole thing marked as 'reliable' :-)
>>
>> And really, we already take a similar risk today by "trusting" the frame
>> pointer value on the stack to a certain extent.
> 
> Oh yeah, I forgot to mention some more benefits of encoding the frame
> pointer (or marking pt_regs in some other way):
> 
> a) Stack addresses can be printed properly: '%pS' for printing regs->pc
>    and '%pB' for printing call returns.
> 
>    Using '%pS' for call returns (as arm64 seems to do today) will result
>    in printing the wrong function when you have tail calls to noreturn
>    functions on the stack (which is actually quite common for calls to
>    panic(), die(), etc).
> 
>    More details:
> 
>    https://lkml.kernel.org/r/20210403155948.ubbgtwmlsdyar7yp@treble
> 
> b) Stack dumps to the console can dump the exception registers they find
>    along the way.  This is actually quite nice for debugging.
> 
> 

Great.

I am preparing version 3 taking into account comments from yourself,
Mark Rutland and Mark Brown.

Stay tuned.

Madhavan
Mark Brown April 12, 2021, 4:59 p.m. UTC | #8
On Fri, Apr 09, 2021 at 05:32:27PM -0500, Josh Poimboeuf wrote:

> Hm, for that matter, even without renaming things, a comment above
> stack_trace_save_tsk_reliable() describing the meaning of "reliable"
> would be a good idea.

Might be better to place something at the prototype for
arch_stack_walk_reliable() or cross link the two since that's where any
new architectures should be starting, or perhaps even better to extend
the document that Mark wrote further and point to that from both places.  

Some more explict pointer to live patching as the only user would
definitely be good but I think the more important thing would be writing
down any assumptions in the API that aren't already written down and
we're supposed to be relying on.  Mark's document captured a lot of it
but it sounds like there's more here, and even with knowing that this
interface is only used by live patch and digging into what it does it's
not always clear what happens to work with the code right now and what's
something that's suitable to be relied on.
Mark Brown April 12, 2021, 5:36 p.m. UTC | #9
On Fri, Apr 09, 2021 at 04:37:41PM -0500, Josh Poimboeuf wrote:
> On Fri, Apr 09, 2021 at 01:09:09PM +0100, Mark Rutland wrote:

> > Further, I believe all the special cases are assembly functions, and
> > most of those are already in special sections to begin with. I reckon
> > it'd be simpler and more robust to reject unwinding based on the
> > section. If we need to unwind across specific functions in those
> > sections, we could opt-in with some metadata. So e.g. we could reject
> > all functions in ".entry.text", special casing the EL0 entry functions
> > if necessary.

> Couldn't this also end up being somewhat fragile?  Saying "certain
> sections are deemed unreliable" isn't necessarily obvious to somebody
> who doesn't already know about it, and it could be overlooked or
> forgotten over time.  And there's no way to enforce it stays that way.

Anything in this area is going to have some opportunity for fragility
and missed assumptions somewhere.  I do find the idea of using the
SYM_CODE annotations that we already have and use for other purposes to
flag code that we don't expect to be suitable for reliable unwinding
appealing from that point of view.  It's pretty clear at the points
where they're used that they're needed, even with a pretty surface level
review, and the bit actually pushing things into a section is going to
be in a single place where the macro is defined.  That seems relatively
robust as these things go, it seems no worse than our reliance on
SYM_FUNC to create BTI annotations.  Missing those causes oopses when we
try to branch to the function.
Madhavan T. Venkataraman April 12, 2021, 7:55 p.m. UTC | #10
On 4/12/21 12:36 PM, Mark Brown wrote:
> On Fri, Apr 09, 2021 at 04:37:41PM -0500, Josh Poimboeuf wrote:
>> On Fri, Apr 09, 2021 at 01:09:09PM +0100, Mark Rutland wrote:
> 
>>> Further, I believe all the special cases are assembly functions, and
>>> most of those are already in special sections to begin with. I reckon
>>> it'd be simpler and more robust to reject unwinding based on the
>>> section. If we need to unwind across specific functions in those
>>> sections, we could opt-in with some metadata. So e.g. we could reject
>>> all functions in ".entry.text", special casing the EL0 entry functions
>>> if necessary.
> 
>> Couldn't this also end up being somewhat fragile?  Saying "certain
>> sections are deemed unreliable" isn't necessarily obvious to somebody
>> who doesn't already know about it, and it could be overlooked or
>> forgotten over time.  And there's no way to enforce it stays that way.
> 
> Anything in this area is going to have some opportunity for fragility
> and missed assumptions somewhere.  I do find the idea of using the
> SYM_CODE annotations that we already have and use for other purposes to
> flag code that we don't expect to be suitable for reliable unwinding
> appealing from that point of view.  It's pretty clear at the points
> where they're used that they're needed, even with a pretty surface level
> review, and the bit actually pushing things into a section is going to
> be in a single place where the macro is defined.  That seems relatively
> robust as these things go, it seems no worse than our reliance on
> SYM_FUNC to create BTI annotations.  Missing those causes oopses when we
> try to branch to the function.
> 

OK. Just so I am clear on the whole picture, let me state my understanding so far.
Correct me if I am wrong.

1. We are hoping that we can convert a significant number of SYM_CODE functions to
   SYM_FUNC functions by providing them with a proper FP prolog and epilog so that
   we can get objtool coverage for them. These don't need any blacklisting.

2. If we can locate the pt_regs structures created on the stack cleanly for EL1
   exceptions, etc, then we can handle those cases in the unwinder without needing
   any black listing.

   I have a solution for this in version 3 that does it without encoding the FP or
   matching values on the stack. I have addressed all of the objections so far on
   that count. I will send the patch series out soon.

3. We are going to assume that the reliable unwinder is only for livepatch purposes
   and will only be invoked on a task that is not currently running. The task either
   voluntarily gave up the CPU or was pre-empted. We can safely ignore all SYM_CODE
   functions that will never voluntarily give up the CPU. They can only be pre-empted
   and pre-emption is already handled in (2). We don't need to blacklist any of these
   functions.

4. So, the only functions that will need blacklisting are the remaining SYM_CODE functions
   that might give up the CPU voluntarily. At this point, I am not even sure how
   many of these will exist. One hopes that all of these would have ended up as
   SYM_FUNC functions in (1).

So, IMHO, placing code in a black listed section should be the last step and not the first
one. This also satisfies Mark Rutland's requirement that no one muck with the entry text
while he is sorting out that code.

I suggest we do (3) first. Then, review the assembly functions to do (1). Then, review the
remaining ones to see which ones must be blacklisted, if any.

Do you agree?

Madhavan
Mark Brown April 13, 2021, 11:02 a.m. UTC | #11
On Mon, Apr 12, 2021 at 02:55:35PM -0500, Madhavan T. Venkataraman wrote:

> 
> OK. Just so I am clear on the whole picture, let me state my understanding so far.
> Correct me if I am wrong.

> 1. We are hoping that we can convert a significant number of SYM_CODE functions to
>    SYM_FUNC functions by providing them with a proper FP prolog and epilog so that
>    we can get objtool coverage for them. These don't need any blacklisting.

I wouldn't expect to be converting lots of SYM_CODE to SYM_FUNC.  I'd
expect the overwhelming majority of SYM_CODE to be SYM_CODE because it's
required to be non standard due to some external interface - things like
the exception vectors, ftrace, and stuff around suspend/hibernate.  A
quick grep seems to confirm this.

> 3. We are going to assume that the reliable unwinder is only for livepatch purposes
>    and will only be invoked on a task that is not currently running. The task either

The reliable unwinder can also be invoked on itself.

> 4. So, the only functions that will need blacklisting are the remaining SYM_CODE functions
>    that might give up the CPU voluntarily. At this point, I am not even sure how
>    many of these will exist. One hopes that all of these would have ended up as
>    SYM_FUNC functions in (1).

There's stuff like ret_from_fork there.

> I suggest we do (3) first. Then, review the assembly functions to do (1). Then, review the
> remaining ones to see which ones must be blacklisted, if any.

I'm not clear what the concrete steps you're planning to do first are
there - your 3 seems like a statement of assumptions.  For flagging
functions I do think it'd be safer to default to assuming that all
SYM_CODE functions can't be unwound reliably rather than only explicitly
listing ones that cause problems.
Josh Poimboeuf April 13, 2021, 10:53 p.m. UTC | #12
On Mon, Apr 12, 2021 at 05:59:33PM +0100, Mark Brown wrote:
> On Fri, Apr 09, 2021 at 05:32:27PM -0500, Josh Poimboeuf wrote:
> 
> > Hm, for that matter, even without renaming things, a comment above
> > stack_trace_save_tsk_reliable() describing the meaning of "reliable"
> > would be a good idea.
> 
> Might be better to place something at the prototype for
> arch_stack_walk_reliable() or cross link the two since that's where any
> new architectures should be starting, or perhaps even better to extend
> the document that Mark wrote further and point to that from both places.  
> 
> Some more explict pointer to live patching as the only user would
> definitely be good but I think the more important thing would be writing
> down any assumptions in the API that aren't already written down and
> we're supposed to be relying on.  Mark's document captured a lot of it
> but it sounds like there's more here, and even with knowing that this
> interface is only used by live patch and digging into what it does it's
> not always clear what happens to work with the code right now and what's
> something that's suitable to be relied on.

Something like so?

From: Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH] livepatch: Clarify the meaning of 'reliable'

Update the comments and documentation to reflect what 'reliable'
unwinding actually means, in the context of live patching.

Suggested-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 .../livepatch/reliable-stacktrace.rst         | 26 +++++++++++++----
 arch/x86/kernel/stacktrace.c                  |  6 ----
 include/linux/stacktrace.h                    | 29 +++++++++++++++++--
 kernel/stacktrace.c                           |  7 ++++-
 4 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/Documentation/livepatch/reliable-stacktrace.rst b/Documentation/livepatch/reliable-stacktrace.rst
index 67459d2ca2af..e325efc7e952 100644
--- a/Documentation/livepatch/reliable-stacktrace.rst
+++ b/Documentation/livepatch/reliable-stacktrace.rst
@@ -72,7 +72,21 @@ The unwinding process varies across architectures, their respective procedure
 call standards, and kernel configurations. This section describes common
 details that architectures should consider.
 
-4.1 Identifying successful termination
+4.1 Only preemptible code needs reliability detection
+-----------------------------------------------------
+
+The only current user of reliable stacktracing is livepatch, which only
+calls it for a) inactive tasks; or b) the current task in task context.
+
+Therefore, the unwinder only needs to detect the reliability of stacks
+involving *preemptible* code.
+
+Practically speaking, reliability of stacks involving *non-preemptible*
+code is a "don't-care".  It may help to return a wrong reliability
+result for such cases, if it results in reduced complexity, since such
+cases will not happen in practice.
+
+4.2 Identifying successful termination
 --------------------------------------
 
 Unwinding may terminate early for a number of reasons, including:
@@ -95,7 +109,7 @@ architectures verify that a stacktrace ends at an expected location, e.g.
 * On a specific stack expected for a kernel entry point (e.g. if the
   architecture has separate task and IRQ stacks).
 
-4.2 Identifying unwindable code
+4.3 Identifying unwindable code
 -------------------------------
 
 Unwinding typically relies on code following specific conventions (e.g.
@@ -129,7 +143,7 @@ unreliable to unwind from, e.g.
 
 * Identifying specific portions of code using bounds information.
 
-4.3 Unwinding across interrupts and exceptions
+4.4 Unwinding across interrupts and exceptions
 ----------------------------------------------
 
 At function call boundaries the stack and other unwind state is expected to be
@@ -156,7 +170,7 @@ have no such cases) should attempt to unwind across exception boundaries, as
 doing so can prevent unnecessarily stalling livepatch consistency checks and
 permits livepatch transitions to complete more quickly.
 
-4.4 Rewriting of return addresses
+4.5 Rewriting of return addresses
 ---------------------------------
 
 Some trampolines temporarily modify the return address of a function in order
@@ -222,7 +236,7 @@ middle of return_to_handler and can report this as unreliable. Architectures
 are not required to unwind from other trampolines which modify the return
 address.
 
-4.5 Obscuring of return addresses
+4.6 Obscuring of return addresses
 ---------------------------------
 
 Some trampolines do not rewrite the return address in order to intercept
@@ -249,7 +263,7 @@ than the link register as would usually be the case.
 Architectures must either ensure that unwinders either reliably unwind
 such cases, or report the unwinding as unreliable.
 
-4.6 Link register unreliability
+4.7 Link register unreliability
 -------------------------------
 
 On some other architectures, 'call' instructions place the return address into a
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 8627fda8d993..15b058eefc4e 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -29,12 +29,6 @@ void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
 	}
 }
 
-/*
- * This function returns an error if it detects any unreliable features of the
- * stack.  Otherwise it guarantees that the stack trace is reliable.
- *
- * If the task is not 'current', the caller *must* ensure the task is inactive.
- */
 int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry,
 			     void *cookie, struct task_struct *task)
 {
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 50e2df30b0aa..1b6a65a0ad22 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -26,7 +26,7 @@ unsigned int stack_trace_save_user(unsigned long *store, unsigned int size);
 #ifdef CONFIG_ARCH_STACKWALK
 
 /**
- * stack_trace_consume_fn - Callback for arch_stack_walk()
+ * stack_trace_consume_fn() - Callback for arch_stack_walk()
  * @cookie:	Caller supplied pointer handed back by arch_stack_walk()
  * @addr:	The stack entry address to consume
  *
@@ -35,7 +35,7 @@ unsigned int stack_trace_save_user(unsigned long *store, unsigned int size);
  */
 typedef bool (*stack_trace_consume_fn)(void *cookie, unsigned long addr);
 /**
- * arch_stack_walk - Architecture specific function to walk the stack
+ * arch_stack_walk() - Architecture specific function to walk the stack
  * @consume_entry:	Callback which is invoked by the architecture code for
  *			each entry.
  * @cookie:		Caller supplied pointer which is handed back to
@@ -52,8 +52,33 @@ typedef bool (*stack_trace_consume_fn)(void *cookie, unsigned long addr);
  */
 void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
 		     struct task_struct *task, struct pt_regs *regs);
+
+/**
+ * arch_stack_walk_reliable() - Architecture specific function to walk the
+ *				stack, with stack reliability check
+ * @consume_entry:	Callback which is invoked by the architecture code for
+ *			each entry.
+ * @cookie:		Caller supplied pointer which is handed back to
+ *			@consume_entry
+ * @task:		Pointer to a task struct, can be NULL for current
+ *
+ * Return: 0 if the stack trace is considered reliable for livepatch; else < 0.
+ *
+ * NOTE: This interface is only used by livepatch.  The caller must ensure that
+ *	 it's only called in one of the following two scenarios:
+ *
+ *	 a) the task is inactive (and guaranteed to remain so); or
+ *
+ *	 b) the task is 'current', running in task context.
+ *
+ * Effectively, this means the arch unwinder doesn't need to detect the
+ * reliability of stacks involving non-preemptible code.
+ *
+ * For more details, see Documentation/livepatch/reliable-stacktrace.rst.
+ */
 int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry, void *cookie,
 			     struct task_struct *task);
+
 void arch_stack_walk_user(stack_trace_consume_fn consume_entry, void *cookie,
 			  const struct pt_regs *regs);
 
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 9f8117c7cfdd..a198fd194fed 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -185,7 +185,12 @@ unsigned int stack_trace_save_regs(struct pt_regs *regs, unsigned long *store,
  *		stack. Otherwise it guarantees that the stack trace is
  *		reliable and returns the number of entries stored.
  *
- * If the task is not 'current', the caller *must* ensure the task is inactive.
+ * NOTE: This interface is only used by livepatch.  The caller must ensure that
+ *	 it's only called in one of the following two scenarios:
+ *
+ *	 a) the task is inactive (and guaranteed to remain so); or
+ *
+ *	 b) the task is 'current', running in task context.
  */
 int stack_trace_save_tsk_reliable(struct task_struct *tsk, unsigned long *store,
 				  unsigned int size)
Madhavan T. Venkataraman April 14, 2021, 10:23 a.m. UTC | #13
On 4/13/21 6:02 AM, Mark Brown wrote:
> On Mon, Apr 12, 2021 at 02:55:35PM -0500, Madhavan T. Venkataraman wrote:
> 
>>
>> OK. Just so I am clear on the whole picture, let me state my understanding so far.
>> Correct me if I am wrong.
> 
>> 1. We are hoping that we can convert a significant number of SYM_CODE functions to
>>    SYM_FUNC functions by providing them with a proper FP prolog and epilog so that
>>    we can get objtool coverage for them. These don't need any blacklisting.
> 
> I wouldn't expect to be converting lots of SYM_CODE to SYM_FUNC.  I'd
> expect the overwhelming majority of SYM_CODE to be SYM_CODE because it's
> required to be non standard due to some external interface - things like
> the exception vectors, ftrace, and stuff around suspend/hibernate.  A
> quick grep seems to confirm this.
> 

OK. Fair enough.

>> 3. We are going to assume that the reliable unwinder is only for livepatch purposes
>>    and will only be invoked on a task that is not currently running. The task either
> 
> The reliable unwinder can also be invoked on itself.
> 

I have not called out the self-directed case because I am assuming that the reliable unwinder
is only used for livepatch. So, AFAICT, this is applicable to the task that performs the
livepatch operation itself. In this case, there should be no unreliable functions on the
self-directed stack trace (otherwise, livepatching would always fail).

>> 4. So, the only functions that will need blacklisting are the remaining SYM_CODE functions
>>    that might give up the CPU voluntarily. At this point, I am not even sure how
>>    many of these will exist. One hopes that all of these would have ended up as
>>    SYM_FUNC functions in (1).
> 
> There's stuff like ret_from_fork there.
> 

OK. There would be a few functions that fit this category. I agree.

>> I suggest we do (3) first. Then, review the assembly functions to do (1). Then, review the
>> remaining ones to see which ones must be blacklisted, if any.
> 
> I'm not clear what the concrete steps you're planning to do first are
> there - your 3 seems like a statement of assumptions.  For flagging
> functions I do think it'd be safer to default to assuming that all
> SYM_CODE functions can't be unwound reliably rather than only explicitly
> listing ones that cause problems.
> 

They are not assumptions. They are true statements. But I probably did not do a good
job of explaining. But Josh sent out a patch that updates the documentation that
explains what I said a lot better.

In any case, I have absolutely no problems in implementing your section idea. I will
make an attempt to do that in version 3 of my patch series.

Stay tuned.

And, thanks for all the input. It is very helpful.

Madhavan
Mark Brown April 14, 2021, 12:24 p.m. UTC | #14
On Tue, Apr 13, 2021 at 05:53:10PM -0500, Josh Poimboeuf wrote:
> On Mon, Apr 12, 2021 at 05:59:33PM +0100, Mark Brown wrote:

> > Some more explict pointer to live patching as the only user would
> > definitely be good but I think the more important thing would be writing
> > down any assumptions in the API that aren't already written down and

> Something like so?

Yeah, looks reasonable - it'll need rebasing against current code as I
moved the docs in the source out of the arch code into the header this
cycle (they were copied verbatim in a couple of places).

>  #ifdef CONFIG_ARCH_STACKWALK
>  
>  /**
> - * stack_trace_consume_fn - Callback for arch_stack_walk()
> + * stack_trace_consume_fn() - Callback for arch_stack_walk()
>   * @cookie:	Caller supplied pointer handed back by arch_stack_walk()
>   * @addr:	The stack entry address to consume
>   *

> @@ -35,7 +35,7 @@ unsigned int stack_trace_save_user(unsigned long *store, unsigned int size);
>   */
>  typedef bool (*stack_trace_consume_fn)(void *cookie, unsigned long addr);
>  /**
> - * arch_stack_walk - Architecture specific function to walk the stack
> + * arch_stack_walk() - Architecture specific function to walk the stack
>   * @consume_entry:	Callback which is invoked by the architecture code for
>   *			each entry.
>   * @cookie:		Caller supplied pointer which is handed back to

These two should be separated.
Mark Brown April 14, 2021, 12:35 p.m. UTC | #15
On Wed, Apr 14, 2021 at 05:23:38AM -0500, Madhavan T. Venkataraman wrote:
> On 4/13/21 6:02 AM, Mark Brown wrote:
> > On Mon, Apr 12, 2021 at 02:55:35PM -0500, Madhavan T. Venkataraman wrote:

> >> 3. We are going to assume that the reliable unwinder is only for livepatch purposes
> >>    and will only be invoked on a task that is not currently running. The task either
> > 
> > The reliable unwinder can also be invoked on itself.

> I have not called out the self-directed case because I am assuming that the reliable unwinder
> is only used for livepatch. So, AFAICT, this is applicable to the task that performs the
> livepatch operation itself. In this case, there should be no unreliable functions on the
> self-directed stack trace (otherwise, livepatching would always fail).

Someone might've added a probe of some kind which upsets things so
there's a possibility things might fail.  Like you say there's no way a
system in such a state can succesfully apply a live patch but we might
still run into that situation.

> >> I suggest we do (3) first. Then, review the assembly functions to do (1). Then, review the
> >> remaining ones to see which ones must be blacklisted, if any.

> > I'm not clear what the concrete steps you're planning to do first are
> > there - your 3 seems like a statement of assumptions.  For flagging
> > functions I do think it'd be safer to default to assuming that all
> > SYM_CODE functions can't be unwound reliably rather than only explicitly
> > listing ones that cause problems.

> They are not assumptions. They are true statements. But I probably did not do a good
> job of explaining. But Josh sent out a patch that updates the documentation that
> explains what I said a lot better.

You say true statements, I say assumptions :)
Madhavan T. Venkataraman April 16, 2021, 2:43 p.m. UTC | #16
On 4/14/21 5:23 AM, Madhavan T. Venkataraman wrote:
> In any case, I have absolutely no problems in implementing your section idea. I will
> make an attempt to do that in version 3 of my patch series.

So, I attempted a patch with just declaring all .entry.text functions as unreliable
by checking just the section bounds. It does work for EL1 exceptions. But there
are other functions that are actually reliable that show up as unreliable.
The example in my test is el0_sync() which is at the base of all system call stacks.

How would you prefer I handle this? Should I place all SYM_CODE functions that
are actually safe for the unwinder in a separate section? I could just take
some approach and solve this. But I would like to get your opinion and Mark
Rutland's opinion so we are all on the same page.

Please let me know.

Madhavan
Mark Brown April 16, 2021, 3:36 p.m. UTC | #17
On Fri, Apr 16, 2021 at 09:43:48AM -0500, Madhavan T. Venkataraman wrote:

> How would you prefer I handle this? Should I place all SYM_CODE functions that
> are actually safe for the unwinder in a separate section? I could just take
> some approach and solve this. But I would like to get your opinion and Mark
> Rutland's opinion so we are all on the same page.

That sounds reasonable to me, obviously we'd have to look at how
exactly the annotation ends up getting done and general bikeshed colour
discussions.  I'm not sure if we want a specific "safe for unwinder
section" or to split things up into sections per reason things are safe
for the unwinder (kind of like what you were proposing for flagging
things as a problem), that might end up being useful for other things at
some point.