mbox series

[0/2] Triggering a softlockup panic during SMP boot

Message ID cover.1698441495.git.kjlx@templeofstupid.com (mailing list archive)
Headers show
Series Triggering a softlockup panic during SMP boot | expand

Message

Krister Johansen Oct. 27, 2023, 9:46 p.m. UTC
Hi,
This pair of patches was the result of an unsuccessful attempt to set
softlockup_panic before SMP boot.  The rationale for wanting to set this
parameter is that some of the VMs that my team runs will occasionally
get stuck while onlining the non-boot processors as part of SMP boot.

In the cases where this happens, we find out about it after the instance
successfully boots; however, the machines can get stuck for tens of
minutes at a time before finally completing onlining processors.  Since
we pay per minute for many of these VMs there were two goals for setting
this value on boot: first, fail fast and hope that a subsequent boot
attempt will be successful.  Second, a panic is a little easier to keep
track of, especially if we're scraping serial logs after the fact.  In
essence, the goal is to trigger the failure earlier and hopefully get
more useful information for further debugging the problem as well.

While testing to make sure that this value was getting correctly set on
boot, I ran into a pair of surprises.  First, when the softlockup_panic
parameter was migrated to a sysctl alias, it had the side effect of
setting the parameter value after SMP boot has occurred, when it used to
be set before this.  Second, testing revealed that even though the
aliases were being correctly processed, the kernel was reporting the
commandline arguments as unrecognized. This generated a message in the
logs about an unrecognized parameter (even though it was) and the
parameter was passed as an environment variable to init.

The first patch ensures that aliased sysctl arguments are not reported
as unrecognized boot arguments.

The second patch moves the setting of softlockup_panic earlier in boot,
where it can take effect before SMP boot beings.

Thanks,

-K

Krister Johansen (2):
  proc: sysctl: prevent aliased sysctls from getting passed to init
  watchdog: move softlockup_panic back to early_param

 fs/proc/proc_sysctl.c  | 8 +++++++-
 include/linux/sysctl.h | 6 ++++++
 init/main.c            | 4 ++++
 kernel/watchdog.c      | 7 +++++++
 4 files changed, 24 insertions(+), 1 deletion(-)

Comments

Luis Chamberlain Oct. 27, 2023, 10:04 p.m. UTC | #1
On Fri, Oct 27, 2023 at 02:46:26PM -0700, Krister Johansen wrote:
> Hi,
> This pair of patches was the result of an unsuccessful attempt to set
> softlockup_panic before SMP boot.  The rationale for wanting to set this
> parameter is that some of the VMs that my team runs will occasionally
> get stuck while onlining the non-boot processors as part of SMP boot.
> 
> In the cases where this happens, we find out about it after the instance
> successfully boots; however, the machines can get stuck for tens of
> minutes at a time before finally completing onlining processors.  Since
> we pay per minute for many of these VMs there were two goals for setting
> this value on boot: first, fail fast and hope that a subsequent boot
> attempt will be successful.  Second, a panic is a little easier to keep
> track of, especially if we're scraping serial logs after the fact.  In
> essence, the goal is to trigger the failure earlier and hopefully get
> more useful information for further debugging the problem as well.
> 
> While testing to make sure that this value was getting correctly set on
> boot, I ran into a pair of surprises.  First, when the softlockup_panic
> parameter was migrated to a sysctl alias, it had the side effect of
> setting the parameter value after SMP boot has occurred, when it used to
> be set before this.  Second, testing revealed that even though the
> aliases were being correctly processed, the kernel was reporting the
> commandline arguments as unrecognized. This generated a message in the
> logs about an unrecognized parameter (even though it was) and the
> parameter was passed as an environment variable to init.
> 
> The first patch ensures that aliased sysctl arguments are not reported
> as unrecognized boot arguments.
> 
> The second patch moves the setting of softlockup_panic earlier in boot,
> where it can take effect before SMP boot beings.

Sounds all great but I only got the cover letter, so may be resend?

  Luis
Krister Johansen Oct. 27, 2023, 11:06 p.m. UTC | #2
On Fri, Oct 27, 2023 at 03:04:56PM -0700, Luis Chamberlain wrote:
> On Fri, Oct 27, 2023 at 02:46:26PM -0700, Krister Johansen wrote:
> > Hi,
> > This pair of patches was the result of an unsuccessful attempt to set
> > softlockup_panic before SMP boot.  The rationale for wanting to set this
> > parameter is that some of the VMs that my team runs will occasionally
> > get stuck while onlining the non-boot processors as part of SMP boot.
> > 
> > In the cases where this happens, we find out about it after the instance
> > successfully boots; however, the machines can get stuck for tens of
> > minutes at a time before finally completing onlining processors.  Since
> > we pay per minute for many of these VMs there were two goals for setting
> > this value on boot: first, fail fast and hope that a subsequent boot
> > attempt will be successful.  Second, a panic is a little easier to keep
> > track of, especially if we're scraping serial logs after the fact.  In
> > essence, the goal is to trigger the failure earlier and hopefully get
> > more useful information for further debugging the problem as well.
> > 
> > While testing to make sure that this value was getting correctly set on
> > boot, I ran into a pair of surprises.  First, when the softlockup_panic
> > parameter was migrated to a sysctl alias, it had the side effect of
> > setting the parameter value after SMP boot has occurred, when it used to
> > be set before this.  Second, testing revealed that even though the
> > aliases were being correctly processed, the kernel was reporting the
> > commandline arguments as unrecognized. This generated a message in the
> > logs about an unrecognized parameter (even though it was) and the
> > parameter was passed as an environment variable to init.
> > 
> > The first patch ensures that aliased sysctl arguments are not reported
> > as unrecognized boot arguments.
> > 
> > The second patch moves the setting of softlockup_panic earlier in boot,
> > where it can take effect before SMP boot beings.
> 
> Sounds all great but I only got the cover letter, so may be resend?

Apologies, I'm not sure quite what went wrong there.  I've resent the
patches to the people in the To: of the original messages, in an attempt
to avoid sending copies to everybody a second time.

The entire set seems to have made it to lore:

https://lore.kernel.org/linux-fsdevel/ZTw0CACF3jtT3%2FdX@bombadil.infradead.org/T/#r831972d73aad653c3b732e4e36e743cd53673847

If you still haven't got the copies, please let me know and I'll see
if there's something else I can do to get them to you.

Sorry about this. :/

-K
Luis Chamberlain Nov. 1, 2023, 7:10 p.m. UTC | #3
On Fri, Oct 27, 2023 at 02:46:26PM -0700, Krister Johansen wrote:
> Hi,
> This pair of patches was the result of an unsuccessful attempt to set
> softlockup_panic before SMP boot.  The rationale for wanting to set this
> parameter is that some of the VMs that my team runs will occasionally
> get stuck while onlining the non-boot processors as part of SMP boot.
> 
> In the cases where this happens, we find out about it after the instance
> successfully boots; however, the machines can get stuck for tens of
> minutes at a time before finally completing onlining processors.  Since
> we pay per minute for many of these VMs there were two goals for setting
> this value on boot: first, fail fast and hope that a subsequent boot
> attempt will be successful.  Second, a panic is a little easier to keep
> track of, especially if we're scraping serial logs after the fact.  In
> essence, the goal is to trigger the failure earlier and hopefully get
> more useful information for further debugging the problem as well.
> 
> While testing to make sure that this value was getting correctly set on
> boot, I ran into a pair of surprises.  First, when the softlockup_panic
> parameter was migrated to a sysctl alias, it had the side effect of
> setting the parameter value after SMP boot has occurred, when it used to
> be set before this.  Second, testing revealed that even though the
> aliases were being correctly processed, the kernel was reporting the
> commandline arguments as unrecognized. This generated a message in the
> logs about an unrecognized parameter (even though it was) and the
> parameter was passed as an environment variable to init.
> 
> The first patch ensures that aliased sysctl arguments are not reported
> as unrecognized boot arguments.
> 
> The second patch moves the setting of softlockup_panic earlier in boot,
> where it can take effect before SMP boot beings.

Thanks! Looks good, merged and will push to Linus soon.

  Luis