Message ID | 20210111214452.1826-2-tony.luck@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix infinite machine check loop in futex_wait_setup() | expand |
> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: > > Recovery action when get_user() triggers a machine check uses the fixup > path to make get_user() return -EFAULT. Also queue_task_work() sets up > so that kill_me_maybe() will be called on return to user mode to send a > SIGBUS to the current process. > > But there are places in the kernel where the code assumes that this > EFAULT return was simply because of a page fault. The code takes some > action to fix that, and then retries the access. This results in a second > machine check. > > While processing this second machine check queue_task_work() is called > again. But since this uses the same callback_head structure that > was used in the first call, the net result is an entry on the > current->task_works list that points to itself. Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. Yes, I realize this is messy, but maybe it’s not that messy. Conceptually, we just (famous last words) need to arrange for an MCE with IF=1 to switch off the IST stack and run like a normal exception.
On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: > > > On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: > > > > Recovery action when get_user() triggers a machine check uses the fixup > > path to make get_user() return -EFAULT. Also queue_task_work() sets up > > so that kill_me_maybe() will be called on return to user mode to send a > > SIGBUS to the current process. > > > > But there are places in the kernel where the code assumes that this > > EFAULT return was simply because of a page fault. The code takes some > > action to fix that, and then retries the access. This results in a second > > machine check. > > > > While processing this second machine check queue_task_work() is called > > again. But since this uses the same callback_head structure that > > was used in the first call, the net result is an entry on the > > current->task_works list that points to itself. > > Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. The first machine check is in pagefault_disable() context. static int get_futex_value_locked(u32 *dest, u32 __user *from) { int ret; pagefault_disable(); ret = __get_user(*dest, from); pagefault_enable(); return (ret == -ENXIO) ? ret : ret ? -EFAULT : 0; } -Tony
> On Jan 11, 2021, at 2:21 PM, Luck, Tony <tony.luck@intel.com> wrote: > > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: >> >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: >>> >>> Recovery action when get_user() triggers a machine check uses the fixup >>> path to make get_user() return -EFAULT. Also queue_task_work() sets up >>> so that kill_me_maybe() will be called on return to user mode to send a >>> SIGBUS to the current process. >>> >>> But there are places in the kernel where the code assumes that this >>> EFAULT return was simply because of a page fault. The code takes some >>> action to fix that, and then retries the access. This results in a second >>> machine check. >>> >>> While processing this second machine check queue_task_work() is called >>> again. But since this uses the same callback_head structure that >>> was used in the first call, the net result is an entry on the >>> current->task_works list that points to itself. >> >> Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. > > The first machine check is in pagefault_disable() context. > > static int get_futex_value_locked(u32 *dest, u32 __user *from) > { > int ret; > > pagefault_disable(); > ret = __get_user(*dest, from); I have very mixed feelings as to whether we should even try to recover from the first failure like this. If we actually want to try to recover, perhaps we should instead arrange for the second MCE to recover successfully instead of panicking. --Andy > pagefault_enable(); > > return (ret == -ENXIO) ? ret : ret ? -EFAULT : 0; > } > > -Tony
On Tue, Jan 12, 2021 at 09:00:14AM -0800, Andy Lutomirski wrote: > > On Jan 11, 2021, at 2:21 PM, Luck, Tony <tony.luck@intel.com> wrote: > > > > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: > >> > >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: > >>> > >>> Recovery action when get_user() triggers a machine check uses the fixup > >>> path to make get_user() return -EFAULT. Also queue_task_work() sets up > >>> so that kill_me_maybe() will be called on return to user mode to send a > >>> SIGBUS to the current process. > >>> > >>> But there are places in the kernel where the code assumes that this > >>> EFAULT return was simply because of a page fault. The code takes some > >>> action to fix that, and then retries the access. This results in a second > >>> machine check. > >>> > >>> While processing this second machine check queue_task_work() is called > >>> again. But since this uses the same callback_head structure that > >>> was used in the first call, the net result is an entry on the > >>> current->task_works list that points to itself. > >> > >> Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. > > > > The first machine check is in pagefault_disable() context. > > > > static int get_futex_value_locked(u32 *dest, u32 __user *from) > > { > > int ret; > > > > pagefault_disable(); > > ret = __get_user(*dest, from); > > I have very mixed feelings as to whether we should even try to recover > from the first failure like this. If we actually want to try to > recover, perhaps we should instead arrange for the second MCE to > recover successfully instead of panicking. Well we obviously have to "recover" from the first machine check in order to get to the second. Are you saying you don't like the different return value from get_user()? In my initial playing around with this I just had the second machine check simply skip the task_work_add(). This worked for this case, but only because there wasn't a third, fourth, etc. access to the poisoned data. If the caller keeps peeking, then we'll keep taking more machine checks - possibly forever. Even if we do recover with just one extra machine check ... that's one more than was necessary. -Tony
On Tue, Jan 12, 2021 at 9:16 AM Luck, Tony <tony.luck@intel.com> wrote: > > On Tue, Jan 12, 2021 at 09:00:14AM -0800, Andy Lutomirski wrote: > > > On Jan 11, 2021, at 2:21 PM, Luck, Tony <tony.luck@intel.com> wrote: > > > > > > On Mon, Jan 11, 2021 at 02:11:56PM -0800, Andy Lutomirski wrote: > > >> > > >>>> On Jan 11, 2021, at 1:45 PM, Tony Luck <tony.luck@intel.com> wrote: > > >>> > > >>> Recovery action when get_user() triggers a machine check uses the fixup > > >>> path to make get_user() return -EFAULT. Also queue_task_work() sets up > > >>> so that kill_me_maybe() will be called on return to user mode to send a > > >>> SIGBUS to the current process. > > >>> > > >>> But there are places in the kernel where the code assumes that this > > >>> EFAULT return was simply because of a page fault. The code takes some > > >>> action to fix that, and then retries the access. This results in a second > > >>> machine check. > > >>> > > >>> While processing this second machine check queue_task_work() is called > > >>> again. But since this uses the same callback_head structure that > > >>> was used in the first call, the net result is an entry on the > > >>> current->task_works list that points to itself. > > >> > > >> Is this happening in pagefault_disable context or normal sleepable fault context? If the latter, maybe we should reconsider finding a way for the machine check code to do its work inline instead of deferring it. > > > > > > The first machine check is in pagefault_disable() context. > > > > > > static int get_futex_value_locked(u32 *dest, u32 __user *from) > > > { > > > int ret; > > > > > > pagefault_disable(); > > > ret = __get_user(*dest, from); > > > > I have very mixed feelings as to whether we should even try to recover > > from the first failure like this. If we actually want to try to > > recover, perhaps we should instead arrange for the second MCE to > > recover successfully instead of panicking. > > Well we obviously have to "recover" from the first machine check > in order to get to the second. Are you saying you don't like the > different return value from get_user()? > > In my initial playing around with this I just had the second machine > check simply skip the task_work_add(). This worked for this case, but > only because there wasn't a third, fourth, etc. access to the poisoned > data. If the caller keeps peeking, then we'll keep taking more machine > checks - possibly forever. > > Even if we do recover with just one extra machine check ... that's one > more than was necessary. Well, we need to do *something* when the first __get_user() trips the #MC. It would be nice if we could actually fix up the page tables inside the #MC handler, but, if we're in a pagefault_disable() context we might have locks held. Heck, we could have the pagetable lock held, be inside NMI, etc. Skipping the task_work_add() might actually make sense if we get a second one. We won't actually infinite loop in pagefault_disable() context -- if we would, then we would also infinite loop just from a regular page fault, too.
On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: > Well, we need to do *something* when the first __get_user() trips the > #MC. It would be nice if we could actually fix up the page tables > inside the #MC handler, but, if we're in a pagefault_disable() context > we might have locks held. Heck, we could have the pagetable lock > held, be inside NMI, etc. Skipping the task_work_add() might actually > make sense if we get a second one. > > We won't actually infinite loop in pagefault_disable() context -- if > we would, then we would also infinite loop just from a regular page > fault, too. Fixing the page tables inside the #MC handler to unmap the poison page would indeed be a good solution. But, as you point out, not possible because of locks. Could we take a more drastic approach? We know that this case the kernel is accessing a user address for the current process. Could the machine check handler just re-write %cr3 to point to a kernel-only page table[1]. I.e. unmap the entire current user process. Then any subsequent access to user space will page fault. Maybe have a flag in the task structure to help the #PF handler understand what just happened. The code we execute in the task_work handler can restore %cr3 -Tony [1] Does such a thing already exist? If not, I'd need some help/pointers to create it.
On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@intel.com> wrote: > > On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: > > Well, we need to do *something* when the first __get_user() trips the > > #MC. It would be nice if we could actually fix up the page tables > > inside the #MC handler, but, if we're in a pagefault_disable() context > > we might have locks held. Heck, we could have the pagetable lock > > held, be inside NMI, etc. Skipping the task_work_add() might actually > > make sense if we get a second one. > > > > We won't actually infinite loop in pagefault_disable() context -- if > > we would, then we would also infinite loop just from a regular page > > fault, too. > > Fixing the page tables inside the #MC handler to unmap the poison > page would indeed be a good solution. But, as you point out, not possible > because of locks. > > Could we take a more drastic approach? We know that this case the kernel > is accessing a user address for the current process. Could the machine > check handler just re-write %cr3 to point to a kernel-only page table[1]. > I.e. unmap the entire current user process. That seems scary, especially if we're in the middle of a context switch when this happens. We *could* make it work, but I'm not at all convinced it's wise. > > Then any subsequent access to user space will page fault. Maybe have a > flag in the task structure to help the #PF handler understand what just > happened. > > The code we execute in the task_work handler can restore %cr3 This would need to be integrated with something much more local IMO. Maybe it could be scoped to pagefault_disable()/pagefault_enable()? --Andy
On Tue, Jan 12, 2021 at 10:57:07AM -0800, Andy Lutomirski wrote: > On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: > > > Well, we need to do *something* when the first __get_user() trips the > > > #MC. It would be nice if we could actually fix up the page tables > > > inside the #MC handler, but, if we're in a pagefault_disable() context > > > we might have locks held. Heck, we could have the pagetable lock > > > held, be inside NMI, etc. Skipping the task_work_add() might actually > > > make sense if we get a second one. > > > > > > We won't actually infinite loop in pagefault_disable() context -- if > > > we would, then we would also infinite loop just from a regular page > > > fault, too. > > > > Fixing the page tables inside the #MC handler to unmap the poison > > page would indeed be a good solution. But, as you point out, not possible > > because of locks. > > > > Could we take a more drastic approach? We know that this case the kernel > > is accessing a user address for the current process. Could the machine > > check handler just re-write %cr3 to point to a kernel-only page table[1]. > > I.e. unmap the entire current user process. > > That seems scary, especially if we're in the middle of a context > switch when this happens. We *could* make it work, but I'm not at all > convinced it's wise. Scary? It's terrifying! But we know that the fault happend in a get_user() or copy_from_user() call (i.e. an RIP with an extable recovery address). Does context switch access user memory? -Tony
> On Jan 12, 2021, at 12:52 PM, Luck, Tony <tony.luck@intel.com> wrote: > > On Tue, Jan 12, 2021 at 10:57:07AM -0800, Andy Lutomirski wrote: >>> On Tue, Jan 12, 2021 at 10:24 AM Luck, Tony <tony.luck@intel.com> wrote: >>> >>> On Tue, Jan 12, 2021 at 09:21:21AM -0800, Andy Lutomirski wrote: >>>> Well, we need to do *something* when the first __get_user() trips the >>>> #MC. It would be nice if we could actually fix up the page tables >>>> inside the #MC handler, but, if we're in a pagefault_disable() context >>>> we might have locks held. Heck, we could have the pagetable lock >>>> held, be inside NMI, etc. Skipping the task_work_add() might actually >>>> make sense if we get a second one. >>>> >>>> We won't actually infinite loop in pagefault_disable() context -- if >>>> we would, then we would also infinite loop just from a regular page >>>> fault, too. >>> >>> Fixing the page tables inside the #MC handler to unmap the poison >>> page would indeed be a good solution. But, as you point out, not possible >>> because of locks. >>> >>> Could we take a more drastic approach? We know that this case the kernel >>> is accessing a user address for the current process. Could the machine >>> check handler just re-write %cr3 to point to a kernel-only page table[1]. >>> I.e. unmap the entire current user process. >> >> That seems scary, especially if we're in the middle of a context >> switch when this happens. We *could* make it work, but I'm not at all >> convinced it's wise. > > Scary? It's terrifying! > > But we know that the fault happend in a get_user() or copy_from_user() call > (i.e. an RIP with an extable recovery address). Does context switch > access user memory? No, but NMI can. The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI. What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it. Can we do the cleanup from an interrupt? IPI-to-self might be a credible approach, if so.
On Tue, Jan 12, 2021 at 02:04:55PM -0800, Andy Lutomirski wrote: > > But we know that the fault happend in a get_user() or copy_from_user() call > > (i.e. an RIP with an extable recovery address). Does context switch > > access user memory? > > No, but NMI can. > > The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI. > > What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it. > > Can we do the cleanup from an interrupt? IPI-to-self might be a credible approach, if so. You seem to be looking for a solution that is entirely contained within the machine check handling code. Willing to allow for repeated machine checks from the same poison address in order to achieve that. I'm opposed to mutliple machine checks. Willing to make some changes in core code to avoid repeated access to the same poison location. We need a tie-breaker. -Tony
> On Jan 12, 2021, at 5:50 PM, Luck, Tony <tony.luck@intel.com> wrote: > > On Tue, Jan 12, 2021 at 02:04:55PM -0800, Andy Lutomirski wrote: >>> But we know that the fault happend in a get_user() or copy_from_user() call >>> (i.e. an RIP with an extable recovery address). Does context switch >>> access user memory? >> >> No, but NMI can. >> >> The case that would be very very hard to deal with is if we get an NMI just before IRET/SYSRET and get #MC inside that NMI. >> >> What we should probably do is have a percpu list of pending memory failure cleanups and just accept that we’re going to sometimes get a second MCE (or third or fourth) before we can get to it. >> >> Can we do the cleanup from an interrupt? IPI-to-self might be a credible approach, if so. > > You seem to be looking for a solution that is entirely contained within > the machine check handling code. Willing to allow for repeated machine > checks from the same poison address in order to achieve that. > > I'm opposed to mutliple machine checks. Willing to make some changes > in core code to avoid repeated access to the same poison location. How about more questions before the showdown? If we made core code changes to avoid this, what would they be? We really can do user access from NMI and #DB, and those can happen in horrible places. We could have the pagetable lock held, be in the middle of CoWing the very page we tripped over, etc. I think we really can’t count on being able to write to the PTEs from #MC. Similarly, we might have IRQs off, so we can’t broadcast a TLB flush. And we might be in the middle of entry, exit, or CR3 switches, and I don’t see a particularly clean way to write CR3 without risking accidentally returning to user mode with the wrong CR3. So I’m sort of at a loss as to what we can do. All user memory accessors can handle failure, and they will all avoid infinite looping. If we can tolerate repeated MCE, we can survive. But there’s not a whole lot we can do from these horrible contexts. Hmm. Maybe if we have SMAP we could play with EFLAGS.AC? I can imagine this having various regrettable side effects, though.
On Tue, Jan 12, 2021 at 08:15:46PM -0800, Andy Lutomirski wrote:
> So I’m sort of at a loss as to what we can do.
I think you guys have veered off into the weeds with this. The current
solution - modulo error messages not destined for humans :) - is not soo
bad, considering the whole MCA trainwreck.
Or am I missing something completely unacceptable?
> I think you guys have veered off into the weeds with this. The current > solution - modulo error messages not destined for humans :) - is not soo > bad, considering the whole MCA trainwreck. > > Or am I missing something completely unacceptable? Maybe the other difference in approach between Andy and me is whether to go for a solution that covers all the corner cases, or just make an incremental improvement that allows for recover in some useful subset of remaining fatal cases, but still dies in other cases. I'm happy to replace error messages with ones that are more descriptive and helpful to humans. -Tony
On Wed, Jan 13, 2021 at 04:06:09PM +0000, Luck, Tony wrote: > Maybe the other difference in approach between Andy and me is whether to > go for a solution that covers all the corner cases, or just make an incremental > improvement that allows for recover in some useful subset of remaining fatal > cases, but still dies in other cases. Does that mean more core code surgery? > I'm happy to replace error messages with ones that are more descriptive and > helpful to humans. Yap, that: "Multiple copyin" with something more understandable to users... Thx.
>> Maybe the other difference in approach between Andy and me is whether to >> go for a solution that covers all the corner cases, or just make an incremental >> improvement that allows for recover in some useful subset of remaining fatal >> cases, but still dies in other cases. > > Does that mean more core code surgery? Yes. I need to look at other user access inside pagefault_disable/enable() as likely spots where the code may continue after a machine check and retry the access. So expect some more "if (ret == ENXIO) { do something to give up gracefully }" >> I'm happy to replace error messages with ones that are more descriptive and >> helpful to humans. > > Yap, that: "Multiple copyin" with something more understandable to users... I'll work on it. We tend not to have essay length messages as panic() strings. But I can add a comment in the code there so that people who grep whatever panic message we choose can get more details on what happened and what to do. -Tony
On Wed, Jan 13, 2021 at 04:32:54PM +0000, Luck, Tony wrote: > I'll work on it. We tend not to have essay length messages as panic() strings. But I can > add a comment in the code there so that people who grep whatever panic message > we choose can get more details on what happened and what to do. I did not mean to write an essay but to simply make it more palatable for "normal" users. Something like: "PANIC: Second machine check exception while accessing user data." or along those lines to explain *why* the machine panics.
On Mon, Jan 11, 2021 at 01:44:50PM -0800, Tony Luck wrote: > @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs) > mce_panic("Failed kernel mode recovery", &m, msg); > } > > - if (m.kflags & MCE_IN_KERNEL_COPYIN) > + if (m.kflags & MCE_IN_KERNEL_COPYIN) { > + if (current->mce_busy) > + mce_panic("Multiple copyin", &m, msg); So this: we're currently busy handling the first MCE, why do we must panic? Can we simply ignore all follow-up MCEs to that page? I.e., the page will get poisoned eventually and that poisoning is currently executing so all following MCEs are simply nothing new and we can ignore them. It's not like we're going to corrupt more data - we already are "corrupting" whole 4K. Am I making sense? Because if we do this, we won't have to pay attention to any get_user() callers and whatnot - we simply ignore and the solution is simple and you won't have to touch any get_user() callers... Hmmm?
On Thu, Jan 14, 2021 at 09:22:13PM +0100, Borislav Petkov wrote: > On Mon, Jan 11, 2021 at 01:44:50PM -0800, Tony Luck wrote: > > @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs) > > mce_panic("Failed kernel mode recovery", &m, msg); > > } > > > > - if (m.kflags & MCE_IN_KERNEL_COPYIN) > > + if (m.kflags & MCE_IN_KERNEL_COPYIN) { > > + if (current->mce_busy) > > + mce_panic("Multiple copyin", &m, msg); > > So this: we're currently busy handling the first MCE, why do we must > panic? > > Can we simply ignore all follow-up MCEs to that page? If we s/all/some/ you are saying the same as Andy: > So I tend to think that the machine check code should arrange to > survive some reasonable number of duplicate machine checks. > I.e., the page will get poisoned eventually and that poisoning is > currently executing so all following MCEs are simply nothing new and we > can ignore them. > > It's not like we're going to corrupt more data - we already are > "corrupting" whole 4K. > > Am I making sense? > > Because if we do this, we won't have to pay attention to any get_user() > callers and whatnot - we simply ignore and the solution is simple and > you won't have to touch any get_user() callers... Changing get_user() is a can of worms. I don't think its a very big can. Perhaps two or three dozen places where code needs to change to account for the -ENXIO return ... but touching a bunch of different subsystems it is likley to take a while to get everyone in agreement. I'll try out this new approach, and if it works, I'll post a v3 patch. Thanks -Tony
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 13d3f1cbda17..1bf11213e093 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1246,6 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + p->mce_busy = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1268,6 +1269,7 @@ static void kill_me_maybe(struct callback_head *cb) static void queue_task_work(struct mce *m, int kill_current_task) { + current->mce_busy = 1; current->mce_addr = m->addr; current->mce_kflags = m->kflags; current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); @@ -1431,8 +1433,11 @@ noinstr void do_machine_check(struct pt_regs *regs) mce_panic("Failed kernel mode recovery", &m, msg); } - if (m.kflags & MCE_IN_KERNEL_COPYIN) + if (m.kflags & MCE_IN_KERNEL_COPYIN) { + if (current->mce_busy) + mce_panic("Multiple copyin", &m, msg); queue_task_work(&m, kill_current_task); + } } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); diff --git a/include/linux/sched.h b/include/linux/sched.h index 6e3a5eeec509..a763a76eac57 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1360,7 +1360,8 @@ struct task_struct { u64 mce_addr; __u64 mce_ripv : 1, mce_whole_page : 1, - __mce_reserved : 62; + mce_busy : 1, + __mce_reserved : 61; struct callback_head mce_kill_me; #endif
Recovery action when get_user() triggers a machine check uses the fixup path to make get_user() return -EFAULT. Also queue_task_work() sets up so that kill_me_maybe() will be called on return to user mode to send a SIGBUS to the current process. But there are places in the kernel where the code assumes that this EFAULT return was simply because of a page fault. The code takes some action to fix that, and then retries the access. This results in a second machine check. While processing this second machine check queue_task_work() is called again. But since this uses the same callback_head structure that was used in the first call, the net result is an entry on the current->task_works list that points to itself. When task_work_run() is called it loops forever in this code: do { next = work->next; work->func(work); work = next; cond_resched(); } while (work); Add a "mce_busy" flag bit to detect this situation and panic when it happens. Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 7 ++++++- include/linux/sched.h | 3 ++- 2 files changed, 8 insertions(+), 2 deletions(-)