Message ID | 1532103744-31902-2-git-send-email-joro@8bytes.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote: > > From: Joerg Roedel <jroedel@suse.de> > > The ring-buffer is accessed in the NMI handler, so we better > avoid faulting on it. Sync the vmalloc range with all > page-tables in system to make sure everyone has it mapped. > > This fixes a WARN_ON_ONCE() that can be triggered with PTI > enabled on x86-32: > > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230 > > This triggers because with PTI enabled on an PAE kernel the > PMDs are no longer shared between the page-tables, so the > vmalloc changes do not propagate automatically. It seems like it would be much more robust to fix the vmalloc_fault() code instead.
On Fri, Jul 20, 2018 at 10:06:54AM -0700, Andy Lutomirski wrote: > > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote: > > > > From: Joerg Roedel <jroedel@suse.de> > > > > The ring-buffer is accessed in the NMI handler, so we better > > avoid faulting on it. Sync the vmalloc range with all > > page-tables in system to make sure everyone has it mapped. > > > > This fixes a WARN_ON_ONCE() that can be triggered with PTI > > enabled on x86-32: > > > > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230 > > > > This triggers because with PTI enabled on an PAE kernel the > > PMDs are no longer shared between the page-tables, so the > > vmalloc changes do not propagate automatically. > > It seems like it would be much more robust to fix the vmalloc_fault() > code instead. The question is whether the NMI path is nesting-safe, then we can remove the WARN_ON_ONCE(in_nmi()) in the vmalloc_fault path. It should be nesting-safe on x86-32 because of the way the stack-switch happens there. If its also nesting-safe on x86-64 the warning there can be removed. Or did you think of something else to fix there? Thanks, Joerg
On Fri, 20 Jul 2018, Andy Lutomirski wrote: > > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote: > > > > From: Joerg Roedel <jroedel@suse.de> > > > > The ring-buffer is accessed in the NMI handler, so we better > > avoid faulting on it. Sync the vmalloc range with all > > page-tables in system to make sure everyone has it mapped. > > > > This fixes a WARN_ON_ONCE() that can be triggered with PTI > > enabled on x86-32: > > > > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230 > > > > This triggers because with PTI enabled on an PAE kernel the > > PMDs are no longer shared between the page-tables, so the > > vmalloc changes do not propagate automatically. > > It seems like it would be much more robust to fix the vmalloc_fault() > code instead. Right, but now the obvious fix for the issue at hand is this. We surely should revisit this. Thanks, tglx
On Fri, Jul 20, 2018 at 10:48 AM, Joerg Roedel <joro@8bytes.org> wrote: > On Fri, Jul 20, 2018 at 10:06:54AM -0700, Andy Lutomirski wrote: >> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote: >> > >> > From: Joerg Roedel <jroedel@suse.de> >> > >> > The ring-buffer is accessed in the NMI handler, so we better >> > avoid faulting on it. Sync the vmalloc range with all >> > page-tables in system to make sure everyone has it mapped. >> > >> > This fixes a WARN_ON_ONCE() that can be triggered with PTI >> > enabled on x86-32: >> > >> > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230 >> > >> > This triggers because with PTI enabled on an PAE kernel the >> > PMDs are no longer shared between the page-tables, so the >> > vmalloc changes do not propagate automatically. >> >> It seems like it would be much more robust to fix the vmalloc_fault() >> code instead. > > The question is whether the NMI path is nesting-safe, then we can remove > the WARN_ON_ONCE(in_nmi()) in the vmalloc_fault path. It should be > nesting-safe on x86-32 because of the way the stack-switch happens > there. If its also nesting-safe on x86-64 the warning there can be > removed. > > Or did you think of something else to fix there? I'm just reading your changelog, and you said the PMDs are no longer shared between the page tables. So this presumably means that vmalloc_fault() no longer actually works correctly on PTI systems. I didn't read the code to figure out *why* it doesn't work, but throwing random vmalloc_sync_all() calls around is wrong. Or maybe the bug really just is the warning. The warning can probably go. > > > Thanks, > > Joerg >
On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 20 Jul 2018, Andy Lutomirski wrote: >> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote: >> > >> > From: Joerg Roedel <jroedel@suse.de> >> > >> > The ring-buffer is accessed in the NMI handler, so we better >> > avoid faulting on it. Sync the vmalloc range with all >> > page-tables in system to make sure everyone has it mapped. >> > >> > This fixes a WARN_ON_ONCE() that can be triggered with PTI >> > enabled on x86-32: >> > >> > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230 >> > >> > This triggers because with PTI enabled on an PAE kernel the >> > PMDs are no longer shared between the page-tables, so the >> > vmalloc changes do not propagate automatically. >> >> It seems like it would be much more robust to fix the vmalloc_fault() >> code instead. > > Right, but now the obvious fix for the issue at hand is this. We surely > should revisit this. If you commit this under this reasoning, then please at least make it say: /* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this is a workaround. */ Let's not have code in the kernel that pretends to make sense but is actually voodoo magic that works around bugs elsewhere. It's no fun to maintain down the road.
On Fri, 20 Jul 2018, Andy Lutomirski wrote: > On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Fri, 20 Jul 2018, Andy Lutomirski wrote: > >> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote: > >> > > >> > From: Joerg Roedel <jroedel@suse.de> > >> > > >> > The ring-buffer is accessed in the NMI handler, so we better > >> > avoid faulting on it. Sync the vmalloc range with all > >> > page-tables in system to make sure everyone has it mapped. > >> > > >> > This fixes a WARN_ON_ONCE() that can be triggered with PTI > >> > enabled on x86-32: > >> > > >> > WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230 > >> > > >> > This triggers because with PTI enabled on an PAE kernel the > >> > PMDs are no longer shared between the page-tables, so the > >> > vmalloc changes do not propagate automatically. > >> > >> It seems like it would be much more robust to fix the vmalloc_fault() > >> code instead. > > > > Right, but now the obvious fix for the issue at hand is this. We surely > > should revisit this. > > If you commit this under this reasoning, then please at least make it say: > > /* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this > is a workaround. */ > > Let's not have code in the kernel that pretends to make sense but is > actually voodoo magic that works around bugs elsewhere. It's no fun > to maintain down the road. Fair enough. Lemme amend it. Joerg is looking into it, but I surely want to get that stuff some exposure in next ASAP. Thanks, tglx
On Fri, Jul 20, 2018 at 12:32:10PM -0700, Andy Lutomirski wrote: > I'm just reading your changelog, and you said the PMDs are no longer > shared between the page tables. So this presumably means that > vmalloc_fault() no longer actually works correctly on PTI systems. I > didn't read the code to figure out *why* it doesn't work, but throwing > random vmalloc_sync_all() calls around is wrong. Hmm, so the whole point of vmalloc_fault() fault is to sync changes from swapper_pg_dir to process page-tables when the relevant parts of the kernel page-table are not shared, no? That is also the reason we don't see this on 64 bit, because there these parts *are* shared. So with that reasoning vmalloc_fault() works as designed, except that a warning is issued when it's happens in the NMI path. That warning comes from ebc8827f75954 x86: Barf when vmalloc and kmemcheck faults happen in NMI which went into 2.6.37 and was added because the NMI handler were not nesting-safe back then. Reason probably was that the handler on 64 bit has to use an IST stack and a nested NMI would overwrite the stack of the upper handler. We don't have this problem on 32 bit as a nested NMI will not do another stack-switch there. I am not sure about 64 bit, but there is a lot of assembly magic to make NMIs nesting-safe, so I guess the problem should be gone there too. Regards, Joerg
> On Jul 20, 2018, at 11:37 AM, Joerg Roedel <jroedel@suse.de> wrote: > >> On Fri, Jul 20, 2018 at 12:32:10PM -0700, Andy Lutomirski wrote: >> I'm just reading your changelog, and you said the PMDs are no longer >> shared between the page tables. So this presumably means that >> vmalloc_fault() no longer actually works correctly on PTI systems. I >> didn't read the code to figure out *why* it doesn't work, but throwing >> random vmalloc_sync_all() calls around is wrong. > > Hmm, so the whole point of vmalloc_fault() fault is to sync changes from > swapper_pg_dir to process page-tables when the relevant parts of the > kernel page-table are not shared, no? > > That is also the reason we don't see this on 64 bit, because there these > parts *are* shared. > > So with that reasoning vmalloc_fault() works as designed, except that > a warning is issued when it's happens in the NMI path. That warning comes > from > > ebc8827f75954 x86: Barf when vmalloc and kmemcheck faults happen in NMI > > which went into 2.6.37 and was added because the NMI handler were not > nesting-safe back then. Reason probably was that the handler on 64 bit > has to use an IST stack and a nested NMI would overwrite the stack of > the upper handler. We don't have this problem on 32 bit as a nested NMI > will not do another stack-switch there. > Thanks for digging! The problem was presumably that vmalloc_fault() will IRET and re-enable NMIs on the way out. But we’ve supported page faults on user memory in NMI handlers on 32-bit and 64-bit for quite a while, and it’s fine now. I would remove the warning, re-test, and revert the other patch. The one case we can’t handle in vmalloc_fault() is a fault on a stack access. I don’t expect this to be a problem for PTI. It was a problem for CONFIG_VMAP_STACK, though. > I am not sure about 64 bit, but there is a lot of assembly magic to make > NMIs nesting-safe, so I guess the problem should be gone there too. > > > Regards, > > Joerg
On Fri, Jul 20, 2018 at 3:20 PM Andy Lutomirski <luto@amacapital.net> wrote: > Thanks for digging! The problem was presumably that vmalloc_fault() will IRET and re-enable NMIs on the way out. > But we’ve supported page faults on user memory in NMI handlers on 32-bit and 64-bit for quite a while, and it’s fine now. > > I would remove the warning, re-test, and revert the other patch. Agreed. I don't think we have any issues with page faults during NMI any more. Afaik the kprobe people depend on it. That said, 64-bit mode has that scary PV-op case (arch_flush_lazy_mmu_mode). Being PV mode, I can't find it in myself to worry about it, I'm assuming it's ok. Linus
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 5d3cf40..7b0e9aa 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -814,6 +814,9 @@ static void rb_free_work(struct work_struct *work) vfree(base); kfree(rb); + + /* Make sure buffer is unmapped in all page-tables */ + vmalloc_sync_all(); } void rb_free(struct ring_buffer *rb) @@ -840,6 +843,13 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags) if (!all_buf) goto fail_all_buf; + /* + * The buffer is accessed in NMI handlers, make sure it is + * mapped in all page-tables in the system so that we don't + * fault on the range in an NMI handler. + */ + vmalloc_sync_all(); + rb->user_page = all_buf; rb->data_pages[0] = all_buf + PAGE_SIZE; if (nr_pages) {