diff mbox series

[v1] kexec_core: Add and update comments regarding the KEXEC_JUMP flow

Message ID 4968818.GXAFRqVoOG@rjwysocki.net (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series [v1] kexec_core: Add and update comments regarding the KEXEC_JUMP flow | expand

Commit Message

Rafael J. Wysocki Dec. 16, 2024, 1:39 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The KEXEC_JUMP flow is analogous to hibernation flows occurring before
and after creating an image and before and after jumping from the
restore kernel to the image one, which is why it uses the same device
callbacks as those hibernation flows.

Add comments explaining that to the code in question and update an
existing comment in it which appears a bit out of context.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

The kexec_jump code has a few problems AFAICS.

First off, it should use lock_system_sleep() or "interesting" things may
happen when it is run in parallel to a system-wide PM transition.

Second, it looks like it should use pm_sleep_disable/enable_decondary_cpus()
instead of the "raw" suspend_disable/enable_secondary_cpus() because running
it with unpaused cpuidle is kind of a slippery slope.

Moreover, it wouldn't hurt to somehow call acpi_pm_freeze() somewhere during
it to prevent background platform activity from interfering with the "resume"
part of it.

It also might be useful to unify it with the analogous hibernation flows more
directly, but that would require some rearrangements of the latter.

I'm going to send patches along these lines at one point in the future
unless I'm told that this is a bad idea.

Thanks!

---
 kernel/kexec_core.c |   23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

Comments

David Woodhouse Dec. 16, 2024, 1:43 p.m. UTC | #1
On Mon, 2024-12-16 at 14:39 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The KEXEC_JUMP flow is analogous to hibernation flows occurring
> before
> and after creating an image and before and after jumping from the
> restore kernel to the image one, which is why it uses the same device
> callbacks as those hibernation flows.
> 
> Add comments explaining that to the code in question and update an
> existing comment in it which appears a bit out of context.
> 
> No functional changes.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Thanks. I'll round that up into my kexec-debug tree, which Ingo has
been taking into tip/x86/boot. Once I'm done fighting with objtool(qv).
Eric W. Biederman Dec. 16, 2024, 4 p.m. UTC | #2
David Woodhouse <dwmw2@infradead.org> writes:

> On Mon, 2024-12-16 at 14:39 +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> 
>> The KEXEC_JUMP flow is analogous to hibernation flows occurring
>> before
>> and after creating an image and before and after jumping from the
>> restore kernel to the image one, which is why it uses the same device
>> callbacks as those hibernation flows.
>> 
>> Add comments explaining that to the code in question and update an
>> existing comment in it which appears a bit out of context.
>> 
>> No functional changes.
>> 
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Thanks. I'll round that up into my kexec-debug tree, which Ingo has
> been taking into tip/x86/boot. Once I'm done fighting with
> objtool(qv).

I have no objection to getting kexec jump more in sync with the
rest of the power management code.

I do have a question though.  Does anyone actually use kexec jump?

It is fine if folks do, but I haven't actually heard of anyone using
it.  If folks aren't using it, it might make sense to just use the fact
that it is broken as a nudge to remove that option.

Eric
David Woodhouse Dec. 16, 2024, 6:14 p.m. UTC | #3
On 16 December 2024 16:00:00 GMT, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>David Woodhouse <dwmw2@infradead.org> writes:
>
>> On Mon, 2024-12-16 at 14:39 +0100, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> 
>>> The KEXEC_JUMP flow is analogous to hibernation flows occurring
>>> before
>>> and after creating an image and before and after jumping from the
>>> restore kernel to the image one, which is why it uses the same device
>>> callbacks as those hibernation flows.
>>> 
>>> Add comments explaining that to the code in question and update an
>>> existing comment in it which appears a bit out of context.
>>> 
>>> No functional changes.
>>> 
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Thanks. I'll round that up into my kexec-debug tree, which Ingo has
>> been taking into tip/x86/boot. Once I'm done fighting with
>> objtool(qv).
>
>I have no objection to getting kexec jump more in sync with the
>rest of the power management code.
>
>I do have a question though.  Does anyone actually use kexec jump?
>
>It is fine if folks do, but I haven't actually heard of anyone using
>it.  If folks aren't using it, it might make sense to just use the fact
>that it is broken as a nudge to remove that option.
>
>Eric
>
>
>

It isn't broken. I know of it being used a few million times a week.

There are corner cases which have never worked right, like the callee putting a different return address for its next invocation, on the stack *above* the address it 'ret's to. Which since the first kjump patch has been the first word of the page *after* the swap page (and is now fixed in my tree). But fundamentally it *does* work.

I only started messing with it because I was working on relocate_kernel() and needed to write a test case for it; the fact that I know of it being used in production is actually just a coincidence.
Eric W. Biederman Dec. 16, 2024, 6:21 p.m. UTC | #4
David Woodhouse <dwmw2@infradead.org> writes:

> It isn't broken. I know of it being used a few million times a week.
>
> There are corner cases which have never worked right, like the callee
> putting a different return address for its next invocation, on the
> stack *above* the address it 'ret's to. Which since the first kjump
> patch has been the first word of the page *after* the swap page (and
> is now fixed in my tree). But fundamentally it *does* work.
>
> I only started messing with it because I was working on
> relocate_kernel() and needed to write a test case for it; the fact
> that I know of it being used in production is actually just a
> coincidence.

Cool.  I had the sense that the original developer never got around
to using it, so I figured I should check.

Mind if I ask what you know of it being used for?

I had imagined it might be a way to call firmware code preventing the
need to code of a specific interface for each type of firmware.

Eric
diff mbox series

Patch

--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1001,6 +1001,12 @@ 
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (kexec_image->preserve_context) {
+		/*
+		 * This flow is analogous to hibernation flows that occur before
+		 * creating an image and before jumping from the restore kernel
+		 * to the image one, so it uses the same device device callbacks
+		 * as those two flows.
+		 */
 		pm_prepare_console();
 		error = freeze_processes();
 		if (error) {
@@ -1011,12 +1017,10 @@ 
 		error = dpm_suspend_start(PMSG_FREEZE);
 		if (error)
 			goto Resume_console;
-		/* At this point, dpm_suspend_start() has been called,
-		 * but *not* dpm_suspend_end(). We *must* call
-		 * dpm_suspend_end() now.  Otherwise, drivers for
-		 * some devices (e.g. interrupt controllers) become
-		 * desynchronized with the actual state of the
-		 * hardware at resume time, and evil weirdness ensues.
+		/*
+		 * dpm_suspend_end() must be called after dpm_suspend_start()
+		 * to complete the transition, like in the hibernation flows
+		 * mentioned above.
 		 */
 		error = dpm_suspend_end(PMSG_FREEZE);
 		if (error)
@@ -1052,6 +1056,13 @@ 
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (kexec_image->preserve_context) {
+		/*
+		 * This flow is analogous to hibernation flows that occur after
+		 * creating an image and after the image hernel has got control
+		 * back, and in case the devices have been reset or otherwise
+		 * manipulated in the meantime, it uses the device callbacks
+		 * used by the latter.
+		 */
 		syscore_resume();
  Enable_irqs:
 		local_irq_enable();