Message ID | 20210818002942.1607544-2-tony.luck@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/3] x86/mce: Avoid infinite loop for copy from user recovery | expand |
On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote: > @@ -1287,17 +1291,34 @@ static void kill_me_maybe(struct callback_head *cb) > } > } > > -static void queue_task_work(struct mce *m, int kill_current_task) > +static void queue_task_work(struct mce *m, char *msg, int kill_current_task) > { > - current->mce_addr = m->addr; > - current->mce_kflags = m->kflags; > - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > - current->mce_whole_page = whole_page(m); > + int count = ++current->mce_count; > > - if (kill_current_task) > - current->mce_kill_me.func = kill_me_now; > - else > - current->mce_kill_me.func = kill_me_maybe; > + /* First call, save all the details */ > + if (count == 1) { > + current->mce_addr = m->addr; > + current->mce_kflags = m->kflags; > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > + current->mce_whole_page = whole_page(m); > + > + if (kill_current_task) > + current->mce_kill_me.func = kill_me_now; > + else > + current->mce_kill_me.func = kill_me_maybe; > + } > + > + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ "likely" > + if (count > 10) > + mce_panic("Too many machine checks while accessing user data", m, msg); Ok, aren't we too nasty here? Why should we panic the whole box even with 10 MCEs? It is still user memory... IOW, why not: if (count > 10) current->mce_kill_me.func = kill_me_now; and when we return, that user process dies immediately. > + /* Second or later call, make sure page address matches the one from first call */ > + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) > + mce_panic("Machine checks to different user pages", m, msg); Same question here.
On Fri, Aug 20, 2021 at 07:31:43PM +0200, Borislav Petkov wrote: > On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote: > > + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ > > "likely" Oops. > > > + if (count > 10) > > + mce_panic("Too many machine checks while accessing user data", m, msg); > > Ok, aren't we too nasty here? Why should we panic the whole box even > with 10 MCEs? It is still user memory... > > IOW, why not: > > if (count > 10) > current->mce_kill_me.func = kill_me_now; > > and when we return, that user process dies immediately. It's the "when we return" part that is the problem here. Logical trace looks like: user-syscall: kernel does get_user() or copyin(), hits user poison address machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path still in kernel, see that get_user() or copyin() failed Kernel does another get_user() or copyin() (maybe the first was inside a pagefault_disable() region, and kernel is trying again to see if the error was a fixable page fault. But that wasn't the problem so ... machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path still in kernel ... but persistently thinks that just trying again might fix it. machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path still in kernel ... this time for sure! get_user() machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path still in kernel ... but you may see the pattern get_user() machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path I'm bored typing this, but the kernel may not ever give up machine check sees that this was kernel get_user()/copyin() and uses extable to "return" to exception path I.e. the kernel doesn't ever get to call current->mce_kill_me.func() I do have tests that show as many as 4 consecutive machine checks before the kernel gives up trying and returns to the user to complete recovery. Maybe the message could be clearer? mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg); > > > + /* Second or later call, make sure page address matches the one from first call */ > > + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) > > + mce_panic("Machine checks to different user pages", m, msg); > > Same question here. Not quite the same answer ... but similar. We could in theory handle multiple different machine check addresses by turning the "mce_addr" field in the task structure into an array and saving each address so that when the kernel eventually gives up poking at poison and tries to return to user kill_me_maybe() could loop through them and deal with each poison page. I don't think this can happen. Jue Wang suggested that multiple poisoned pages passed to a single write(2) syscall might trigger this panic (and because of a bug in my earlier version, he managed to trigger this "different user pages" panic). But this fixed up version survives the "Jue test". -Tony
On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote: > It's the "when we return" part that is the problem here. Logical > trace looks like: > > user-syscall: > > kernel does get_user() or copyin(), hits user poison address > > machine check > sees that this was kernel get_user()/copyin() and > uses extable to "return" to exception path > > still in kernel, see that get_user() or copyin() failed > > Kernel does another get_user() or copyin() (maybe the first I forgot all the details we were talking at the time but there's no way to tell the kernel to back off here, is it? As in: there was an MCE while trying to access this user memory, you should not do get_user anymore. You did add that * Return zero to pretend that this copy succeeded. This * is counter-intuitive, but needed to prevent the code * in lib/iov_iter.c from retrying and running back into which you're removing with the last patch so I'm confused. IOW, the problem is that with repeated MCEs while the kernel is accessing that memory, it should be the kernel which should back off. And then we should kill that process too but apparently we don't even come to that. > Maybe the message could be clearer? > > mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg); That's not my point - it is rather: this is a recoverable error because it is in user memory even if it is the kernel which tries to access it. And maybe we should not panic the whole box but try to cordon off the faulty memory only and poison it after having killed the process using it... > Not quite the same answer ... but similar. We could in theory handle > multiple different machine check addresses by turning the "mce_addr" > field in the task structure into an array and saving each address so > that when the kernel eventually gives up poking at poison and tries > to return to user kill_me_maybe() could loop through them and deal > with each poison page. Yes, I like the aspect of making the kernel give up poking at poison and when we return we should kill the process and poison all pages collected so that the error source is hopefully contained. But again, I think the important thing is how to make the kernel to back off quicker so that we can poison the pages at all...
On Fri, Aug 20, 2021 at 09:27:44PM +0200, Borislav Petkov wrote: > On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote: > > It's the "when we return" part that is the problem here. Logical > > trace looks like: > > > > user-syscall: > > > > kernel does get_user() or copyin(), hits user poison address > > > > machine check > > sees that this was kernel get_user()/copyin() and > > uses extable to "return" to exception path > > > > still in kernel, see that get_user() or copyin() failed > > > > Kernel does another get_user() or copyin() (maybe the first > > I forgot all the details we were talking at the time but there's no way > to tell the kernel to back off here, is it? > > As in: there was an MCE while trying to access this user memory, you > should not do get_user anymore. You did add that > > * Return zero to pretend that this copy succeeded. This > * is counter-intuitive, but needed to prevent the code > * in lib/iov_iter.c from retrying and running back into > > which you're removing with the last patch so I'm confused. > > IOW, the problem is that with repeated MCEs while the kernel is > accessing that memory, it should be the kernel which should back off. > And then we should kill that process too but apparently we don't even > come to that. My first foray into this just tried to fix the futex() case ... which is one of the first copy is with pagefault_disable() set, then try again with it clear. My attempt there was to make get_user() return -EHWPOISON, and change the futex code to give up immediatley when it saw that code. AndyL (and maybe others) barfed all over that (and rightly so ... there are thousands of get_user() and copyin() calls ... making sure all of them did the right thing with a new error code would be a huge effort. Very error prone (because testing all these paths is hard). New direction was just deal with the fact that we might take more than one machine check before the kernel is finished poking at the poison. > > > Maybe the message could be clearer? > > > > mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg); > > That's not my point - it is rather: this is a recoverable error because > it is in user memory even if it is the kernel which tries to access it. > And maybe we should not panic the whole box but try to cordon off the > faulty memory only and poison it after having killed the process using > it... To recover we need to have some other place to jump to (besides the normal extable error return ... which isn't working if we find ourselves in this situation) when we hit a fault covered by an extable entry. And also know how many machine checks is "normal" before taking the other path. For futex(2) things resolve in two machine checks (one with pagefault_disable() and then one without). For write(2) I see up to four machine cehcks (I didn't do a detailed track ... but I think it is because copyin() does a retry to see if a failure in a many-bytes-at-atime copy might be able to get a few more bytes by doing byte-at-a-time). The logical spot for the alternate return would be to unwind the stack back to the syscall entry point, and then force an error return from there. But that seems a perilous path ... what if some code between syscall entry and the copyin() grabbed a mutex? We would also need to know about that and release it as part of the recovery. Another failed approach was to mark the page not present in the page tables of the process accessing the poison. That would get us out of the machine check loop. But walking page tables and changing them while still in machine check context can't be done in a safe way (the process may be multi-threaded and other threads could still be running on other cores). Bottom line is that I don't think this panic can actually happen unless there is some buggy kernel code that retries get_user() or copyin() indefinitely. Probably the same for the two different addresses case ... though I'm not 100% confident about that. There could be some ioctl() that peeks at two parts of a passed in structure, and the user might pass in a structure that spans across a page boundary with both pages poisoned. But that would only hit if the driver code ignored the failure of the first get_user() and blindly tried the second. So I'd count that as a critically bad driver bug. -Tony
On Fri, Aug 20, 2021 at 09:27:44PM +0200, Borislav Petkov wrote: > On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote: > As in: there was an MCE while trying to access this user memory, you > should not do get_user anymore. You did add that > > * Return zero to pretend that this copy succeeded. This > * is counter-intuitive, but needed to prevent the code > * in lib/iov_iter.c from retrying and running back into > > which you're removing with the last patch so I'm confused. Forget to address this part in the earlier reply. My original code that forced a zero return has a hack. It allowed recovery to complete, but only because there was going to be a SIGBUS. There were some unplesant side effects. E.g. on a write syscall the file size was updated as if the write had succeeded. That would be very confusing for anyone trying to clean up afterwards as the file would have good data that was copied from the user up to the point where the machine check interrupted things. Then NUL bytes after (because the kernel clears pages that are allocated into the page cache). The new version (thanks to All fixing iov_iter.c) now does exactly what POSIX says should happen. If I have a buffer with poison at offset 213, and I do this: ret = write(fd, buf, 512); Then the return from write is 213, and the first 213 bytes from the buffer appear in the file, and the file size is incremented by 213 (assuming the write started with the lseek offset at the original size of the file). -Tony
On Fri, Aug 20, 2021 at 1:25 PM Luck, Tony <tony.luck@intel.com> wrote: > Probably the same for the two different addresses case ... though I'm > not 100% confident about that. There could be some ioctl() that peeks > at two parts of a passed in structure, and the user might pass in a > structure that spans across a page boundary with both pages poisoned. > But that would only hit if the driver code ignored the failure of the > first get_user() and blindly tried the second. So I'd count that as a > critically bad driver bug. Or maybe driver writers are just evil :-( for (i = 0; i < len; i++) { tx_wait(10); get_user(dsp56k_host_interface.data.b[1], bin++); get_user(dsp56k_host_interface.data.b[2], bin++); get_user(dsp56k_host_interface.data.b[3], bin++); } -Tony
On Fri, Aug 20, 2021 at 09:51:41PM -0700, Tony Luck wrote: > On Fri, Aug 20, 2021 at 1:25 PM Luck, Tony <tony.luck@intel.com> wrote: > > Probably the same for the two different addresses case ... though I'm > > not 100% confident about that. There could be some ioctl() that peeks > > at two parts of a passed in structure, and the user might pass in a > > structure that spans across a page boundary with both pages poisoned. > > But that would only hit if the driver code ignored the failure of the > > first get_user() and blindly tried the second. So I'd count that as a > > critically bad driver bug. > > Or maybe driver writers are just evil :-( > > for (i = 0; i < len; i++) { > tx_wait(10); > get_user(dsp56k_host_interface.data.b[1], bin++); > get_user(dsp56k_host_interface.data.b[2], bin++); > get_user(dsp56k_host_interface.data.b[3], bin++); > } Almost any unchecked get_user()/put_user() is a bug. Fortunately, there's not a lot of them <greps> 93 for put_user() and 73 for get_user(). _Some_ of the former variety might be legitimate, but most should be taken out and shot. And dsp56k should be taken out and shot, period ;-/ This is far from the worst in there...
On Fri, Aug 20, 2021 at 01:23:46PM -0700, Luck, Tony wrote: > To recover we need to have some other place to jump to (besides the > normal extable error return ... which isn't working if we find ourselves > in this situation) when we hit a fault covered by an extable entry. And > also know how many machine checks is "normal" before taking the other path. Hohumm, we're on the same page here. ... > Bottom line is that I don't think this panic can actually happen unless > there is some buggy kernel code that retries get_user() or copyin() > indefinitely. You know how such statements of "well, this should not really happen in practice" get disproved by, well, practice. :-) I guess we'll see anyway what actually happens in practice. > Probably the same for the two different addresses case ... though I'm > not 100% confident about that. There could be some ioctl() that peeks > at two parts of a passed in structure, and the user might pass in a > structure that spans across a page boundary with both pages poisoned. > But that would only hit if the driver code ignored the failure of the > first get_user() and blindly tried the second. So I'd count that as a > critically bad driver bug. Right.
On Fri, Aug 20, 2021 at 01:33:56PM -0700, Luck, Tony wrote: > The new version (thanks to All fixing iov_iter.c) now does > exactly what POSIX says should happen. If I have a buffer > with poison at offset 213, and I do this: > > ret = write(fd, buf, 512); > > Then the return from write is 213, and the first 213 bytes > from the buffer appear in the file, and the file size is > incremented by 213 (assuming the write started with the lseek > offset at the original size of the file). ... and the user still gets a SIGBUS so that it gets a chance to handle the encountered poison? I.e., not retry the write for the remaining 512 - 213 bytes? If so, do we document that somewhere so that application writers can know what they should do in such cases?
On Sun, Aug 22, 2021 at 04:46:14PM +0200, Borislav Petkov wrote: > On Fri, Aug 20, 2021 at 01:33:56PM -0700, Luck, Tony wrote: > > The new version (thanks to All fixing iov_iter.c) now does > > exactly what POSIX says should happen. If I have a buffer > > with poison at offset 213, and I do this: > > > > ret = write(fd, buf, 512); > > > > Then the return from write is 213, and the first 213 bytes > > from the buffer appear in the file, and the file size is > > incremented by 213 (assuming the write started with the lseek > > offset at the original size of the file). > > ... and the user still gets a SIGBUS so that it gets a chance to handle > the encountered poison? I.e., not retry the write for the remaining 512 > - 213 bytes? Whether the user gets a SIGBUS depends on what they do next. In a typical user loop trying to do a write: while (nbytes) { ret = write(fd, buf, nbytes); if (ret == -1) return ret; buf += ret; nbytes -= ret; } The next iteration after the short write caused by the machine check will return ret == -1, errno = EFAULT. Andy Lutomirski convinced me that the kernel should not send a SIGBUS to an application when the kernel accesses the poison in user memory. If the user tries to access the page with the poison directly they'll get a SIGBUS (page was unmapped so user gets a #PF, but the x86 fault handler sees that the page was unmapped because of poison, so sends a SIGBUS). > If so, do we document that somewhere so that application writers can > know what they should do in such cases? Applications see a failed write ... they should do whatever they would normally do for a failed write. -Tony
On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote: > Recovery action when get_user() triggers a machine check uses the fixup > path to make get_user() return -EFAULT. Also queue_task_work() sets up > so that kill_me_maybe() will be called on return to user mode to send > a SIGBUS to the current process. > > But there are places in the kernel where the code assumes that this > EFAULT return was simply because of a page fault. The code takes some > action to fix that, and then retries the access. This results in a second > machine check. > > While processing this second machine check queue_task_work() is called > again. But since this uses the same callback_head structure that was used > in the first call, the net result is an entry on the current->task_works > list that points to itself. When task_work_run() is called it loops > forever in this code: > > do { > next = work->next; > work->func(work); > work = next; > cond_resched(); > } while (work); > > Add a counter (current->mce_count) to keep track of repeated machine > checks before task_work() is called. First machine check saves the address > information and calls task_work_add(). Subsequent machine checks before > that task_work call back is executed check that the address is in the > same page as the first machine check (since the callback will offline > exactly one page). > > Expected worst case is two machine checks before moving on (e.g. one user > access with page faults disabled, then a repeat to the same addrsss with > page faults enabled). Just in case there is some code that loops forever > enforce a limit of 10. > > Cc: <stable@vger.kernel.org> What about a Fixes: tag? I guess backporting this to the respective kernels is predicated upon the existence of those other "places" in the kernel where code assumes the EFAULT was because of a #PF. Hmmm?
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 22791aadc085..94830ee9581c 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1250,6 +1250,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { + struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me); + + p->mce_count = 0; force_sig(SIGBUS); } @@ -1259,6 +1262,7 @@ static void kill_me_maybe(struct callback_head *cb) int flags = MF_ACTION_REQUIRED; int ret; + p->mce_count = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) @@ -1287,17 +1291,34 @@ static void kill_me_maybe(struct callback_head *cb) } } -static void queue_task_work(struct mce *m, int kill_current_task) +static void queue_task_work(struct mce *m, char *msg, int kill_current_task) { - current->mce_addr = m->addr; - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); + int count = ++current->mce_count; - if (kill_current_task) - current->mce_kill_me.func = kill_me_now; - else - current->mce_kill_me.func = kill_me_maybe; + /* First call, save all the details */ + if (count == 1) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m); + + if (kill_current_task) + current->mce_kill_me.func = kill_me_now; + else + current->mce_kill_me.func = kill_me_maybe; + } + + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ + if (count > 10) + mce_panic("Too many machine checks while accessing user data", m, msg); + + /* Second or later call, make sure page address matches the one from first call */ + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT)) + mce_panic("Machine checks to different user pages", m, msg); + + /* Do not call task_work_add() more than once */ + if (count > 1) + return; task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); } @@ -1435,7 +1456,7 @@ noinstr void do_machine_check(struct pt_regs *regs) /* If this triggers there is no way to recover. Die hard. */ BUG_ON(!on_thread_stack() || !user_mode(regs)); - queue_task_work(&m, kill_current_task); + queue_task_work(&m, msg, kill_current_task); } else { /* @@ -1453,7 +1474,7 @@ noinstr void do_machine_check(struct pt_regs *regs) } if (m.kflags & MCE_IN_KERNEL_COPYIN) - queue_task_work(&m, kill_current_task); + queue_task_work(&m, msg, kill_current_task); } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); diff --git a/include/linux/sched.h b/include/linux/sched.h index ec8d07d88641..f6935787e7e8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1394,6 +1394,7 @@ struct task_struct { mce_whole_page : 1, __mce_reserved : 62; struct callback_head mce_kill_me; + int mce_count; #endif #ifdef CONFIG_KRETPROBES
Recovery action when get_user() triggers a machine check uses the fixup path to make get_user() return -EFAULT. Also queue_task_work() sets up so that kill_me_maybe() will be called on return to user mode to send a SIGBUS to the current process. But there are places in the kernel where the code assumes that this EFAULT return was simply because of a page fault. The code takes some action to fix that, and then retries the access. This results in a second machine check. While processing this second machine check queue_task_work() is called again. But since this uses the same callback_head structure that was used in the first call, the net result is an entry on the current->task_works list that points to itself. When task_work_run() is called it loops forever in this code: do { next = work->next; work->func(work); work = next; cond_resched(); } while (work); Add a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is two machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same addrsss with page faults enabled). Just in case there is some code that loops forever enforce a limit of 10. Cc: <stable@vger.kernel.org> Signed-off-by: Tony Luck <tony.luck@intel.com> --- arch/x86/kernel/cpu/mce/core.c | 43 +++++++++++++++++++++++++--------- include/linux/sched.h | 1 + 2 files changed, 33 insertions(+), 11 deletions(-)