Message ID | 20170412202238.5d327vmwjqvbzzop@pd.tnic (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, 2017-04-12 at 22:22 +0200, Borislav Petkov wrote: > On Wed, Apr 12, 2017 at 01:59:03PM -0600, Vishal Verma wrote: > > I don't think we can do anything about the panic path errors. > > Then the fix should be a lot easier: > > --- > diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c > index 3ba1c3472cf9..44c092ec2ac9 100644 > --- a/drivers/acpi/nfit/mce.c > +++ b/drivers/acpi/nfit/mce.c > @@ -25,6 +25,9 @@ static int nfit_handle_mce(struct notifier_block > *nb, unsigned long val, > struct acpi_nfit_desc *acpi_desc; > struct nfit_spa *nfit_spa; > > + if (in_atomic()) > + return NOTIFY_DONE; But isn't the atomic notifier call chain always called in atomic context? > + > /* We only care about memory errors */ > if (!(mce->status & MCACOD)) > return NOTIFY_DONE; > > > -- > Regards/Gruss, > Boris. > > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, > HRB 21284 (AG Nürnberg)
On Wed, Apr 12, 2017 at 01:27:05PM -0700, Verma, Vishal L wrote: > > /* We only care about memory errors */ > > if (!(mce->status & MCACOD)) > > return NOTIFY_DONE; N.B. that isn't a valid test that this is a memory error. You need if (!(m->status & 0xef80) == BIT(7)) return NOTIFY_DONE; See: Intel SDM Volume 3B - 15.9.2 Compound Error Codes -Tony
On Wed, Apr 12, 2017 at 1:52 PM, Luck, Tony <tony.luck@intel.com> wrote: > On Wed, Apr 12, 2017 at 01:27:05PM -0700, Verma, Vishal L wrote: >> > /* We only care about memory errors */ >> > if (!(mce->status & MCACOD)) >> > return NOTIFY_DONE; > > N.B. that isn't a valid test that this is a memory error. You need > > > if (!(m->status & 0xef80) == BIT(7)) > return NOTIFY_DONE; > > See: Intel SDM Volume 3B - 15.9.2 Compound Error Codes But Vishal's point is that even if we get this check correct the notifier still requires no sleeping operations. So we would need to move recoverable notifications to a separate blocking notifier chain.
On Wed, 12 Apr 2017, Dan Williams wrote: > On Wed, Apr 12, 2017 at 1:52 PM, Luck, Tony <tony.luck@intel.com> wrote: > > On Wed, Apr 12, 2017 at 01:27:05PM -0700, Verma, Vishal L wrote: > >> > /* We only care about memory errors */ > >> > if (!(mce->status & MCACOD)) > >> > return NOTIFY_DONE; > > > > N.B. that isn't a valid test that this is a memory error. You need > > > > > > if (!(m->status & 0xef80) == BIT(7)) > > return NOTIFY_DONE; > > > > See: Intel SDM Volume 3B - 15.9.2 Compound Error Codes > > But Vishal's point is that even if we get this check correct the > notifier still requires no sleeping operations. So we would need to > move recoverable notifications to a separate blocking notifier chain. There is another solution: Convert the notifier to a blocking notifier and in the panic case, ignore the locking and invoke the notifier chain directly. That needs some minimal surgery in the notifier code to allow that, but that's certainly less ugly than splitting stuff up into two chains. Thanks, tglx
On Wed, Apr 12, 2017 at 08:27:05PM +0000, Verma, Vishal L wrote: > But isn't the atomic notifier call chain always called in atomic > context? No, it isn't. We're calling it in normal process context in mce_gen_pool_process() too. So this early exit will avoid any sleeping in atomic context. And since there's nothing you can do about the errors reported in atomic context, we can actually use that fact.
On Wed, Apr 12, 2017 at 11:12:21PM +0200, Thomas Gleixner wrote: > There is another solution: > > Convert the notifier to a blocking notifier and in the panic case, ignore > the locking and invoke the notifier chain directly. That needs some minimal > surgery in the notifier code to allow that, but that's certainly less ugly > than splitting stuff up into two chains. But I wonder whether we actually want two chains. We've been adding a bunch of general run-time logging and recovery stuff to this chain. So now we have things there that aren't needed or useful in the panic case. E.g. srao_decode_notifier() (which tries to offline a page that reported an uncorrected error out of the execution path) and Boris's new CEC code. -Tony
On Wed, Apr 12, 2017 at 02:19:32PM -0700, Luck, Tony wrote: > On Wed, Apr 12, 2017 at 11:12:21PM +0200, Thomas Gleixner wrote: > > There is another solution: > > > > Convert the notifier to a blocking notifier and in the panic case, ignore > > the locking and invoke the notifier chain directly. That needs some minimal > > surgery in the notifier code to allow that, but that's certainly less ugly > > than splitting stuff up into two chains. > > But I wonder whether we actually want two chains. We've been adding a bunch > of general run-time logging and recovery stuff to this chain. So now we have > things there that aren't needed or useful in the panic case. E.g. > srao_decode_notifier() (which tries to offline a page that reported an > uncorrected error out of the execution path) and Boris's new CEC code. I guess we'll have to. The CEC thing does mutex_lock() too and the atomic notifier disables preemption: __atomic_notifier_call_chain() rcu_read_lock() __rcu_read_lock() if (IS_ENABLED(CONFIG_PREEMPT_COUNT)) preempt_disable(); so we need to think about something better to handle events down the whole chain. Maybe route events from the atomic path to the blocking path where the sleeping notifier callbacks can sleep as much as they want to...
On Wed, 12 Apr 2017, Borislav Petkov wrote: > On Wed, Apr 12, 2017 at 08:27:05PM +0000, Verma, Vishal L wrote: > > But isn't the atomic notifier call chain always called in atomic > > context? > > No, it isn't. We're calling it in normal process context in > mce_gen_pool_process() too. > > So this early exit will avoid any sleeping in atomic context. And since > there's nothing you can do about the errors reported in atomic context, > we can actually use that fact. No, you can't. CONFIG_RCU_PREEMPT=n + CONFIG_PREEMPT_COUNT will disable preemption from within __atomic_notifier_call_chain() via rcu_read_lock(). Ergo you wont ever enter the handler. The behaviour in the RCU code is inconsistent. CONFIG_RCU_PREEMPT=y does obviouly not disable preemption, but it should still trigger the might_sleep() check when a blocking function is called from within a rcu read side critical section. Thanks, tglx
On Wed, Apr 12, 2017 at 11:47:49PM +0200, Borislav Petkov wrote: > so we need to think about something better to handle events down the > whole chain. Maybe route events from the atomic path to the blocking > path where the sleeping notifier callbacks can sleep as much as they > want to... ... and it is midnight here so I could be talking crap but we probably should really split the reporting "chain" into two: 1. atomic context which doesn't do any notifier wankery. We simply call non-sleeping functions one after the other. And those should be fast and not taking any locks. For example, I'm thinking you want to decode the error and that's it. So we can replace the notifier call in print_mce() with our atomic context function. Then: 2. move the notifier completely in process context, convert it to a blocking one and connect there all the noodling stuff like EDAC, NFIT, mcelog, EXTLOG, ... I.e., all the logging machinery. Problem solved. And here's where you come in and say, "Boris, you're talking crap.." :-) Lemme sleep on it.
On Thu, Apr 13, 2017 at 12:16:39AM +0200, Borislav Petkov wrote: > ... and it is midnight here so I could be talking crap but we probably > should really split the reporting "chain" into two: This shouldn't be too painful. Users ask to be put on the chain via the wrapper: void mce_register_decode_chain(struct notifier_block *nb) { atomic_inc(&num_notifiers); WARN_ON(nb->priority > MCE_PRIO_LOWEST && nb->priority < MCE_PRIO_EDAC); atomic_notifier_chain_register(&x86_mce_decoder_chain, nb); } We can futz with that and have them specify which chain (or both) that they want to be added to. -Tony
On Wed, Apr 12, 2017 at 03:26:19PM -0700, Luck, Tony wrote: > We can futz with that and have them specify which chain (or both) > that they want to be added to. Well, I didn't want the atomic chain to be a notifier because we can keep it simple and non-blocking. Only the process context one will be. So the question is, do we even have a use case for outside consumers hanging on the atomic chain? Because if not, we're good to go.
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c index 3ba1c3472cf9..44c092ec2ac9 100644 --- a/drivers/acpi/nfit/mce.c +++ b/drivers/acpi/nfit/mce.c @@ -25,6 +25,9 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val, struct acpi_nfit_desc *acpi_desc; struct nfit_spa *nfit_spa; + if (in_atomic()) + return NOTIFY_DONE; + /* We only care about memory errors */ if (!(mce->status & MCACOD)) return NOTIFY_DONE;
On Wed, Apr 12, 2017 at 01:59:03PM -0600, Vishal Verma wrote: > I don't think we can do anything about the panic path errors. Then the fix should be a lot easier: ---