diff mbox series

[v4] x86/mce: Avoid infinite loop for copy from user recovery

Message ID 20210115205103.GA5920@agluck-desk2.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show
Series [v4] x86/mce: Avoid infinite loop for copy from user recovery | expand

Commit Message

Tony Luck Jan. 15, 2021, 8:51 p.m. UTC
Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

V4:
	Fix bug with "||" where I meant "&&"
	Update stale commit comment referring to mce_busy field
	in task structure (now called mce_count).
	Add some comments to queue_task_work()

 arch/x86/kernel/cpu/mce/core.c | 31 ++++++++++++++++++++++++-------
 include/linux/sched.h          |  1 +
 2 files changed, 25 insertions(+), 7 deletions(-)

Comments

Tony Luck Jan. 15, 2021, 11:23 p.m. UTC | #1
On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
>  static void kill_me_now(struct callback_head *ch)
>  {
> +	p->mce_count = 0;
>  	force_sig(SIGBUS);
>  }

Brown paper bag time ... I just pasted that line from kill_me_maybe()
and I thought I did a re-compile ... but obviously not since it gives

error: ‘p’ undeclared (first use in this function)

Option a) (just like kill_me_maybe)

struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);

Option b) (simpler ... not sure why PeterZ did the container_of thing

	current->mce_count = 0;

-Tony
Borislav Petkov Jan. 19, 2021, 10:56 a.m. UTC | #2
On Fri, Jan 15, 2021 at 03:23:46PM -0800, Luck, Tony wrote:
> On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
> >  static void kill_me_now(struct callback_head *ch)
> >  {
> > +	p->mce_count = 0;
> >  	force_sig(SIGBUS);
> >  }
> 
> Brown paper bag time ... I just pasted that line from kill_me_maybe()
> and I thought I did a re-compile ... but obviously not since it gives
> 
> error: ‘p’ undeclared (first use in this function)
> 
> Option a) (just like kill_me_maybe)
> 
> struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> 
> Option b) (simpler ... not sure why PeterZ did the container_of thing
> 
> 	current->mce_count = 0;

Right, he says it is the canonical way to get it out of callback_head.
I don't think current will change while the #MC handler runs but we can
adhere to the design pattern here and do container_of() ...
Tony Luck Jan. 19, 2021, 11:57 p.m. UTC | #3
On Tue, Jan 19, 2021 at 11:56:32AM +0100, Borislav Petkov wrote:
> On Fri, Jan 15, 2021 at 03:23:46PM -0800, Luck, Tony wrote:
> > On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
> > >  static void kill_me_now(struct callback_head *ch)
> > >  {
> > > +	p->mce_count = 0;
> > >  	force_sig(SIGBUS);
> > >  }
> > 
> > Brown paper bag time ... I just pasted that line from kill_me_maybe()
> > and I thought I did a re-compile ... but obviously not since it gives
> > 
> > error: ‘p’ undeclared (first use in this function)
> > 
> > Option a) (just like kill_me_maybe)
> > 
> > struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> > 
> > Option b) (simpler ... not sure why PeterZ did the container_of thing
> > 
> > 	current->mce_count = 0;
> 
> Right, he says it is the canonical way to get it out of callback_head.
> I don't think current will change while the #MC handler runs but we can
> adhere to the design pattern here and do container_of() ...

Ok ... I'll use the canonical way.

But now I've run into a weird issue. I'd run some basic tests with a
dozen machine checks in each of:
1) user access
2) kernel copyin
3) futex (multiple accesses from kernel before task_work())

and it passed my tests before I posted.

But the real validation folks took my patch and found that it has
destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
more times). System either hangs or panics. Generally before 100
injection/conumption cycles.

Their tests are still just doing one at a time (i.e. complete recovery
of one machine cehck before injecting the next error). So there aren't
any complicated race conditions.

So if you see anything obviously broken, let me know. Otherwise I'll
be poking around at the patch to figure out what is wrong.

-Tony
Borislav Petkov Jan. 20, 2021, 12:18 p.m. UTC | #4
On Tue, Jan 19, 2021 at 03:57:59PM -0800, Luck, Tony wrote:
> But the real validation folks took my patch and found that it has
> destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
> more times). System either hangs or panics. Generally before 100
> injection/conumption cycles.

Hmm, that mce_count increment and check stuff might turn out to be
fragile in the face of a race condition. But I don't see it...

> So if you see anything obviously broken, let me know. Otherwise I'll
> be poking around at the patch to figure out what is wrong.

Yah, some printk sprinkling might be a good start. But debugging in that
atomic context is always nasty. ;-\
Tony Luck Jan. 20, 2021, 5:17 p.m. UTC | #5
> Yah, some printk sprinkling might be a good start. But debugging in that
> atomic context is always nasty. ;-\

Some very light printk sprinkling (one message in queue_task_work() in atomic
context, one each in kill_me_now() and kill_me_maybe() to check when task_work
actually called them.

Cases 1 & 2 (user & normal copyin) now work just fine (hundreds of iterations).

Case 3 (my futex test) just hangs with only two characters making it to the serial port "[ "

Deeply strange.

-Tony
Tony Luck Jan. 21, 2021, 9:09 p.m. UTC | #6
On Wed, Jan 20, 2021 at 01:18:12PM +0100, Borislav Petkov wrote:
> On Tue, Jan 19, 2021 at 03:57:59PM -0800, Luck, Tony wrote:
> > But the real validation folks took my patch and found that it has
> > destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
> > more times). System either hangs or panics. Generally before 100
> > injection/conumption cycles.
> 
> Hmm, that mce_count increment and check stuff might turn out to be
> fragile in the face of a race condition. But I don't see it...

I can't see what the race is either.  As far as I can see the flow
looks like this (for a machine check in user mode)

	user-mode

		do_machine_check() [in atomic context on IST etc.]

			queue_task_work()
				increments current->mce_count
				first (only) call saves address etc. and calls task_work_add()
		return from machine check

Plausibly we might context switch to another process here. We could even
resume on a different CPU.

		do_task_work() [in process context]

			kill_me_maybe()
				sets mce_count back to zero, ready for another #MC
				memory_failure()
				send SIGBUS

The kernel get_user() case is similar, but may take extra machine checks before
before calling code decides get_user() really is going to keep saying -EFAULT.


But, on a whim, I changed the type of mce_count from "int" to "atomic_t" and
fixeed up the increment & clear to use atomic_inc_return() and atomic_set().
See updated patch below.

I can't see why this should have made any difference. But the "int" version
crashes in a few dozen machine checks. The "atomic_t" version has run many
thousands of machine checks (10213 in the current boot according to /proc/interrupts)
with no issues.

-Tony

From d1b003f1de92f87c8dc53dd710fd79ad6e277364 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Wed, 20 Jan 2021 12:40:50 -0800
Subject: [PATCH] x86/mce: Avoid infinite loop for copy from user recovery

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 35 +++++++++++++++++++++++++++-------
 include/linux/sched.h          |  1 +
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..4a8660058b67 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
+
+	atomic_set(&p->mce_count, 0);
 	force_sig(SIGBUS);
 }
 
@@ -1246,6 +1249,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	atomic_set(&p->mce_count, 0);
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1266,12 +1270,29 @@ static void kill_me_maybe(struct callback_head *cb)
 	}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	int count = atomic_inc_return(&current->mce_count);
+
+	/* First call, save all the details */
+	if (count == 1) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+	}
+
+	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
+	if (count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	/* Second or later call, make sure page address matches the one from first call */
+	if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (count > 1)
+		return;
 
 	if (kill_current_task)
 		current->mce_kill_me.func = kill_me_now;
@@ -1414,7 +1435,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		queue_task_work(&m, msg, kill_current_task);
 
 	} else {
 		/*
@@ -1432,7 +1453,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, msg, kill_current_task);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..5727d59ab737 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1362,6 +1362,7 @@ struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	atomic_t			mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..5460c146edb5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1238,6 +1238,7 @@  static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+	p->mce_count = 0;
 	force_sig(SIGBUS);
 }
 
@@ -1246,6 +1247,7 @@  static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	p->mce_count = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1266,12 +1268,27 @@  static void kill_me_maybe(struct callback_head *cb)
 	}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	/* First call, save all the details */
+	if (current->mce_count++ == 0) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+	}
+
+	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
+	if (current->mce_count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	/* Second or later call, make sure page address matches the one from first call */
+	if (current->mce_count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (current->mce_count > 1)
+		return;
 
 	if (kill_current_task)
 		current->mce_kill_me.func = kill_me_now;
@@ -1414,7 +1431,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		queue_task_work(&m, msg, kill_current_task);
 
 	} else {
 		/*
@@ -1432,7 +1449,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, msg, kill_current_task);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..386366c9c757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1362,6 +1362,7 @@  struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	int				mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES