diff mbox series

[v3] x86/mce: Avoid infinite loop for copy from user recovery

Message ID 20210115003817.23657-1-tony.luck@intel.com (mailing list archive)
State New, archived
Headers show
Series [v3] x86/mce: Avoid infinite loop for copy from user recovery | expand

Commit Message

Tony Luck Jan. 15, 2021, 12:38 a.m. UTC
Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a "mce_busy" counter so that task_work_add() is only called once
per faulty page in this task.

Do not allow too many repeated machine checks, or machine checks to
a different page from the first.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

V3: Thanks to extensive commentary from Andy & Boris

Throws out the changes to get_user() and subsequent changes to core
code. Everything is now handled in the machine check code. Downside is
that we can (and do) take multiple machine checks from a single poisoned
page before generic kernel code finally gets the message that a page is
really and truly gone (but all the failed get_user() calls still return
the legacy -EFAULT code, so none of that code will ever mistakenly use
a value from a bad page). But even on an old machine that does broadcast
interrupts for each machine check things survive multiple cycles of my
test injection into a futex operation.

I picked "10" as the magic upper limit for how many times the machine
check code will allow a fault from the same page before deciding to
panic.  We can bike shed that value if you like.

 arch/x86/kernel/cpu/mce/core.c | 27 ++++++++++++++++++++-------
 include/linux/sched.h          |  1 +
 2 files changed, 21 insertions(+), 7 deletions(-)

Comments

Borislav Petkov Jan. 15, 2021, 3:27 p.m. UTC | #1
On Thu, Jan 14, 2021 at 04:38:17PM -0800, Tony Luck wrote:
> Recovery action when get_user() triggers a machine check uses the fixup
> path to make get_user() return -EFAULT.  Also queue_task_work() sets up
> so that kill_me_maybe() will be called on return to user mode to send a
> SIGBUS to the current process.
> 
> But there are places in the kernel where the code assumes that this
> EFAULT return was simply because of a page fault. The code takes some
> action to fix that, and then retries the access. This results in a second
> machine check.
> 
> While processing this second machine check queue_task_work() is called
> again. But since this uses the same callback_head structure that
> was used in the first call, the net result is an entry on the
> current->task_works list that points to itself. When task_work_run()
> is called it loops forever in this code:
> 
> 		do {
> 			next = work->next;
> 			work->func(work);
> 			work = next;
> 			cond_resched();
> 		} while (work);
> 
> Add a "mce_busy" counter so that task_work_add() is only called once
> per faulty page in this task.

Yeah, that sentence can be removed now too.

> Do not allow too many repeated machine checks, or machine checks to
> a different page from the first.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> V3: Thanks to extensive commentary from Andy & Boris
> 
> Throws out the changes to get_user() and subsequent changes to core
> code. Everything is now handled in the machine check code. Downside is
> that we can (and do) take multiple machine checks from a single poisoned
> page before generic kernel code finally gets the message that a page is
> really and truly gone (but all the failed get_user() calls still return
> the legacy -EFAULT code, so none of that code will ever mistakenly use
> a value from a bad page). But even on an old machine that does broadcast
> interrupts for each machine check things survive multiple cycles of my
> test injection into a futex operation.

Nice.

> 
> I picked "10" as the magic upper limit for how many times the machine
> check code will allow a fault from the same page before deciding to
> panic.  We can bike shed that value if you like.
> 
>  arch/x86/kernel/cpu/mce/core.c | 27 ++++++++++++++++++++-------
>  include/linux/sched.h          |  1 +
>  2 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 13d3f1cbda17..25daf6517dc9 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1246,6 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb)
>  	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
>  	int flags = MF_ACTION_REQUIRED;
>  
> +	p->mce_count = 0;
>  	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
>  
>  	if (!p->mce_ripv)
> @@ -1266,12 +1267,24 @@ static void kill_me_maybe(struct callback_head *cb)
>  	}
>  }
>  
> -static void queue_task_work(struct mce *m, int kill_current_task)
> +static void queue_task_work(struct mce *m, char *msg, int kill_current_task)

So this function gets called in the user mode MCE case too:

	if ((m.cs & 3) == 3) {

		queue_task_work(&m, msg, kill_current_task);
	}

Do we want to panic for multiple MCEs to different addresses in user
mode?

I don't think so - that should go down the memory failure page
offlining path...

> -	current->mce_addr = m->addr;
> -	current->mce_kflags = m->kflags;
> -	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> -	current->mce_whole_page = whole_page(m);
> +	if (current->mce_count++ == 0) {
> +		current->mce_addr = m->addr;
> +		current->mce_kflags = m->kflags;
> +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> +		current->mce_whole_page = whole_page(m);
> +	}
> +

	/* Magic number should be large enough */

> +	if (current->mce_count > 10)
> +		mce_panic("Too many machine checks while accessing user data", m, msg);
> +
> +	if (current->mce_count > 1 || (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> +		mce_panic("Machine checks to different user pages", m, msg);

Will this second part of the test expression, after the "||" ever hit?

You do above in the first branch:

	if (current->mce_count++ == 0) {

		...

		current->mce_addr = m->addr;

and ->mce_count becomes 1.

In that case that

	(current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)

gets tested but that won't ever be true because ->mce_addr = ->addr
above.

And then, for other values of mce_count, mce_count > 1 will hit.

In any case, what are you trying to catch with this? Two get_user() to
different pages both catching MCEs?

> +
> +	/* Do not call task_work_add() more than once */
> +	if (current->mce_count > 1)
> +		return;

That won't happen either, AFAICT. It'll panic above.

Regardless, I like how this is all confined to the MCE code and there's
no need to touch stuff outside...

Thx.
Tony Luck Jan. 15, 2021, 7:34 p.m. UTC | #2
On Fri, Jan 15, 2021 at 04:27:54PM +0100, Borislav Petkov wrote:
> On Thu, Jan 14, 2021 at 04:38:17PM -0800, Tony Luck wrote:
> > Add a "mce_busy" counter so that task_work_add() is only called once
> > per faulty page in this task.
> 
> Yeah, that sentence can be removed now too.

I will update with new name "mce_count" and some details.

> > -static void queue_task_work(struct mce *m, int kill_current_task)
> > +static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
> 
> So this function gets called in the user mode MCE case too:
> 
> 	if ((m.cs & 3) == 3) {
> 
> 		queue_task_work(&m, msg, kill_current_task);
> 	}
> 
> Do we want to panic for multiple MCEs to different addresses in user
> mode?

In the user mode case we should only bump mce_count to "1" and
before task_work() gets called. It shouldn't hurt to do the
same checks. Maybe it will catch something weird - like an NMI
handler on return from the machine check doing a get_user() that
hits another machine check during the return from this machine check.

AndyL has made me extra paranoid. :-)

> > -	current->mce_addr = m->addr;
> > -	current->mce_kflags = m->kflags;
> > -	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > -	current->mce_whole_page = whole_page(m);
> > +	if (current->mce_count++ == 0) {
> > +		current->mce_addr = m->addr;
> > +		current->mce_kflags = m->kflags;
> > +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > +		current->mce_whole_page = whole_page(m);
> > +	}
> > +
> 
> 	/* Magic number should be large enough */
>
> > +	if (current->mce_count > 10)

Will add similar comment here ... and to other tests in this function
since it may not be obvious to me next year what I was thinking now :-)

> > +	if (current->mce_count > 10)
> > +		mce_panic("Too many machine checks while accessing user data", m, msg);
> > +
> > +	if (current->mce_count > 1 || (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> > +		mce_panic("Machine checks to different user pages", m, msg);
> 
> Will this second part of the test expression, after the "||" ever hit?

No :-( This code is wrong. Should be "&&" not "||". Then it makes more sense.
Will fix for v4.

> In any case, what are you trying to catch with this? Two get_user() to
> different pages both catching MCEs?

Yes. Trying to catch two accesses to different pages. Need to do this
because kill_me_maybe() is only going to offline one page.

I'm not expecting that this would ever hit.  It means that calling code
took a machine check on one page and get_user() said -EFAULT. The the
code decided to access a different page *and* that other page also triggered
a machine check.

> > +	/* Do not call task_work_add() more than once */
> > +	if (current->mce_count > 1)
> > +		return;
> 
> That won't happen either, AFAICT. It'll panic above.

With the s/||/&&/ above, we can get here.
> 
> Regardless, I like how this is all confined to the MCE code and there's
> no need to touch stuff outside...

Thanks for the review.

-Tony
Borislav Petkov Jan. 18, 2021, 3:39 p.m. UTC | #3
On Fri, Jan 15, 2021 at 11:34:35AM -0800, Luck, Tony wrote:
> In the user mode case we should only bump mce_count to "1" and
> before task_work() gets called.

Ok, right, it should not be possible to trigger a second MCE while
queue_task_work() runs when it is a user MCE. The handler itself won't
touch the page with the hw error so our assumption is that it'll get
poisoned.

If it doesn't, I guess the memory failure code will kill the process
yadda yadda...

> It shouldn't hurt to do the same checks. Maybe it will catch something
> weird - like an NMI handler on return from the machine check doing a
> get_user() that hits another machine check during the return from this
> machine check.

Eww.

> AndyL has made me extra paranoid. :-)

Yeah, he comes up with the nuttiest scenarios. :-)
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..25daf6517dc9 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1246,6 +1246,7 @@  static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	p->mce_count = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1266,12 +1267,24 @@  static void kill_me_maybe(struct callback_head *cb)
 	}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	if (current->mce_count++ == 0) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+	}
+
+	if (current->mce_count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	if (current->mce_count > 1 || (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (current->mce_count > 1)
+		return;
 
 	if (kill_current_task)
 		current->mce_kill_me.func = kill_me_now;
@@ -1414,7 +1427,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		queue_task_work(&m, msg, kill_current_task);
 
 	} else {
 		/*
@@ -1432,7 +1445,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, msg, kill_current_task);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..386366c9c757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1362,6 +1362,7 @@  struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	int				mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES