Message ID | 1454976038-22486-1-git-send-email-toshi.kani@hpe.com (mailing list archive) |
---|---|
State | Accepted |
Commit | f4eafd8bcd52 |
Headers | show |
* Toshi Kani <toshi.kani@hpe.com> wrote: > Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 and PAE. > vmalloc_fault() however assumes that the vmalloc range is limited to pte > mappings. > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), which makes > user processes share the same page tables for the kernel ranges. When a call to > ioremap() is made at run-time that leads to allocate a new 2nd level table (pud > in 64-bit and pmd in PAE), user process needs to re-sync with the updated kernel > pgd entry with vmalloc_fault(). > > Following changes are made to vmalloc_fault(). So what were the effects of this shortcoming? Were large page ioremap()s unusable? Was this harmless because no driver used this facility? If so then the changelog needs to spell this out clearly ... Thanks, Ingo
On Tue, 9 Feb 2016 10:10:03 +0100 Ingo Molnar <mingo@kernel.org> wrote: > * Toshi Kani <toshi.kani@hpe.com> wrote: > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > range is limited to pte mappings. > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), > > which makes user processes share the same page tables for the > > kernel ranges. When a call to ioremap() is made at run-time that > > leads to allocate a new 2nd level table (pud in 64-bit and pmd in > > PAE), user process needs to re-sync with the updated kernel pgd > > entry with vmalloc_fault(). > > > > Following changes are made to vmalloc_fault(). > > So what were the effects of this shortcoming? Were large page > ioremap()s unusable? Was this harmless because no driver used this > facility? Drivers do use huge ioremap()s. Now if a pre-existing mm is used to access the device memory a #PF and the call to vmalloc_fault would eventually make the kernel treat device memory as if it was a pagetable. The results are illegal reads/writes on iomem and dereferencing iomem content like it was a pointer to a lower level pagetable. - #PF if you are lucky - funny modification of arbitrary memory possible - can be abused with uio or regular userland ?? Henning > If so then the changelog needs to spell this out clearly ... > Thanks, > > Ingo
* Henning Schild <henning.schild@siemens.com> wrote: > On Tue, 9 Feb 2016 10:10:03 +0100 > Ingo Molnar <mingo@kernel.org> wrote: > > > * Toshi Kani <toshi.kani@hpe.com> wrote: > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > range is limited to pte mappings. > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), > > > which makes user processes share the same page tables for the > > > kernel ranges. When a call to ioremap() is made at run-time that > > > leads to allocate a new 2nd level table (pud in 64-bit and pmd in > > > PAE), user process needs to re-sync with the updated kernel pgd > > > entry with vmalloc_fault(). > > > > > > Following changes are made to vmalloc_fault(). > > > > So what were the effects of this shortcoming? Were large page > > ioremap()s unusable? Was this harmless because no driver used this > > facility? > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > access the device memory a #PF and the call to vmalloc_fault would > eventually make the kernel treat device memory as if it was a > pagetable. > The results are illegal reads/writes on iomem and dereferencing iomem > content like it was a pointer to a lower level pagetable. > - #PF if you are lucky > - funny modification of arbitrary memory possible > - can be abused with uio or regular userland ?? Ok, so this is a serious live bug exposed to drivers, that also requires a Cc: stable tag. All of this should have been in the changelog! Thanks, Ingo
On Tue, 9 Feb 2016 11:22:35 +0100 Ingo Molnar <mingo@kernel.org> wrote: > * Henning Schild <henning.schild@siemens.com> wrote: > > > On Tue, 9 Feb 2016 10:10:03 +0100 > > Ingo Molnar <mingo@kernel.org> wrote: > > > > > * Toshi Kani <toshi.kani@hpe.com> wrote: > > > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > > range is limited to pte mappings. > > > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during > > > > fork(), which makes user processes share the same page tables > > > > for the kernel ranges. When a call to ioremap() is made at > > > > run-time that leads to allocate a new 2nd level table (pud in > > > > 64-bit and pmd in PAE), user process needs to re-sync with the > > > > updated kernel pgd entry with vmalloc_fault(). > > > > > > > > Following changes are made to vmalloc_fault(). > > > > > > So what were the effects of this shortcoming? Were large page > > > ioremap()s unusable? Was this harmless because no driver used this > > > facility? > > > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > > access the device memory a #PF and the call to vmalloc_fault would > > eventually make the kernel treat device memory as if it was a > > pagetable. > > The results are illegal reads/writes on iomem and dereferencing > > iomem content like it was a pointer to a lower level pagetable. > > - #PF if you are lucky > > - funny modification of arbitrary memory possible > > - can be abused with uio or regular userland ?? Looking over the code again i am not sure the last two are even possible, it is just the pointer deref that can cause a #PF. If the pointer turns out to "work" the code will just read and eventually BUG(). > Ok, so this is a serious live bug exposed to drivers, that also > requires a Cc: stable tag. > > All of this should have been in the changelog! > > Thanks, > > Ingo
On Tue, 2016-02-09 at 10:10 +0100, Ingo Molnar wrote: > * Toshi Kani <toshi.kani@hpe.com> wrote: > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 > > and PAE. vmalloc_fault() however assumes that the vmalloc range is > > limited to pte mappings. > > > > pgd_ctor() sets the kernel's pgd entries to user's during fork(), which > > makes user processes share the same page tables for the kernel > > ranges. When a call to ioremap() is made at run-time that leads to > > allocate a new 2nd level table (pud in 64-bit and pmd in PAE), user > > process needs to re-sync with the updated kernel pgd entry with > > vmalloc_fault(). > > > > Following changes are made to vmalloc_fault(). > > So what were the effects of this shortcoming? Were large page ioremap()s > unusable? Was this harmless because no driver used this facility? > > If so then the changelog needs to spell this out clearly ... Large page support of ioremap() has been used for persistent memory mappings for a while. In order to hit this problem, i.e. causing a vmalloc fault, a large mount of ioremap allocations at run-time is required. The following example repeats allocation of 16GB range. # cat /proc/vmallocinfo | grep memremap 0xffffc90040000000-0xffffc90440001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc90480000000-0xffffc90880001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc908c0000000-0xffffc90cc0001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc90d00000000-0xffffc91100001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc91140000000-0xffffc91540001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap : 0xffffc97300000000-0xffffc97700001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc97740000000-0xffffc97b40001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap 0xffffc97b80000000-0xffffc97f80001000 17179873280 memremap+0xb4/0x110 phys=c80000000 ioremap 0xffffc97fc0000000-0xffffc983c0001000 17179873280 memremap+0xb4/0x110 phys=480000000 ioremap The last ioremap call above crossed a 512GB boundary (0x8000000000), which allocated a new pud table and updated the kernel pgd entry to point it. Because user process's page table does not have this pgd entry update, a read/write syscall request to the range will hit a vmalloc fault. Since vmalloc_fault() does not handle a large page properly, this causes an Oops as follows. BUG: unable to handle kernel paging request at ffff880840000ff8 IP: [<ffffffff810664ae>] vmalloc_fault+0x1be/0x300 PGD c7f03a067 PUD 0 Oops: 0000 [#1] SM : Call Trace: [<ffffffff81067335>] __do_page_fault+0x285/0x3e0 [<ffffffff810674bf>] do_page_fault+0x2f/0x80 [<ffffffff810d6d85>] ? put_prev_entity+0x35/0x7a0 [<ffffffff817a6888>] page_fault+0x28/0x30 [<ffffffff813bb976>] ? memcpy_erms+0x6/0x10 [<ffffffff817a0845>] ? schedule+0x35/0x80 [<ffffffffa006350a>] ? pmem_rw_bytes+0x6a/0x190 [nd_pmem] [<ffffffff817a3713>] ? schedule_timeout+0x183/0x240 [<ffffffffa028d2b3>] btt_log_read+0x63/0x140 [nd_btt] : [<ffffffff811201d0>] ? __symbol_put+0x60/0x60 [<ffffffff8122dc60>] ? kernel_read+0x50/0x80 [<ffffffff81124489>] SyS_finit_module+0xb9/0xf0 [<ffffffff817a4632>] entry_SYSCALL_64_fastpath+0x1a/0xa4 Note that this issue is limited to 64-bit. 32-bit only uses index 3 of the pgd entry to cover the entire vmalloc range, which is always valid. I will add this information to the change log. Thanks, -Toshi
On Tue, 2016-02-09 at 13:26 +0100, Henning Schild wrote: > On Tue, 9 Feb 2016 11:22:35 +0100 > Ingo Molnar <mingo@kernel.org> wrote: > > > * Henning Schild <henning.schild@siemens.com> wrote: > > > > > On Tue, 9 Feb 2016 10:10:03 +0100 > > > Ingo Molnar <mingo@kernel.org> wrote: > > > > > > > * Toshi Kani <toshi.kani@hpe.com> wrote: > > > > > > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in > > > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc > > > > > range is limited to pte mappings. > > > > > > > > > > pgd_ctor() sets the kernel's pgd entries to user's during > > > > > fork(), which makes user processes share the same page tables > > > > > for the kernel ranges. When a call to ioremap() is made at > > > > > run-time that leads to allocate a new 2nd level table (pud in > > > > > 64-bit and pmd in PAE), user process needs to re-sync with the > > > > > updated kernel pgd entry with vmalloc_fault(). > > > > > > > > > > Following changes are made to vmalloc_fault(). > > > > > > > > So what were the effects of this shortcoming? Were large page > > > > ioremap()s unusable? Was this harmless because no driver used this > > > > facility? > > > > > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to > > > access the device memory a #PF and the call to vmalloc_fault would > > > eventually make the kernel treat device memory as if it was a > > > pagetable. > > > The results are illegal reads/writes on iomem and dereferencing > > > iomem content like it was a pointer to a lower level pagetable. > > > - #PF if you are lucky #PF -> vmalloc_fault -> oops > > > - funny modification of arbitrary memory possible > > > - can be abused with uio or regular userland ?? > > Looking over the code again i am not sure the last two are even > possible, it is just the pointer deref that can cause a #PF. > If the pointer turns out to "work" the code will just read and > eventually BUG(). The last two case are not possible. > > Ok, so this is a serious live bug exposed to drivers, that also > > requires a Cc: stable tag. Yes, the fix should go to stable as well. Thanks, -Toshi
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index eef44d9..e830c71 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -287,6 +287,9 @@ static noinline int vmalloc_fault(unsigned long address) if (!pmd_k) return -1; + if (pmd_huge(*pmd_k)) + return 0; + pte_k = pte_offset_kernel(pmd_k, address); if (!pte_present(*pte_k)) return -1; @@ -360,8 +363,6 @@ void vmalloc_sync_all(void) * 64-bit: * * Handle a fault on the vmalloc area - * - * This assumes no large pages in there. */ static noinline int vmalloc_fault(unsigned long address) { @@ -403,17 +404,23 @@ static noinline int vmalloc_fault(unsigned long address) if (pud_none(*pud_ref)) return -1; - if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref)) + if (pud_none(*pud) || pud_pfn(*pud) != pud_pfn(*pud_ref)) BUG(); + if (pud_huge(*pud)) + return 0; + pmd = pmd_offset(pud, address); pmd_ref = pmd_offset(pud_ref, address); if (pmd_none(*pmd_ref)) return -1; - if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref)) + if (pmd_none(*pmd) || pmd_pfn(*pmd) != pmd_pfn(*pmd_ref)) BUG(); + if (pmd_huge(*pmd)) + return 0; + pte_ref = pte_offset_kernel(pmd_ref, address); if (!pte_present(*pte_ref)) return -1;
Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc range is limited to pte mappings. pgd_ctor() sets the kernel's pgd entries to user's during fork(), which makes user processes share the same page tables for the kernel ranges. When a call to ioremap() is made at run-time that leads to allocate a new 2nd level table (pud in 64-bit and pmd in PAE), user process needs to re-sync with the updated kernel pgd entry with vmalloc_fault(). Following changes are made to vmalloc_fault(). 64-bit: - No change for the sync operation as set_pgd() takes care of huge pages as well. - Add pud_huge() and pmd_huge() to the validation code to handle huge pages. - Change pud_page_vaddr() to pud_pfn() since an ioremap range is not directly mapped (although the if-statement still works with a bogus addr). - Change pmd_page() to pmd_pfn() since an ioremap range is not backed by struct page table (although the if-statement still works with a bogus addr). PAE: - No change for the sync operation since the index3 pgd entry covers the entire vmalloc range, which is always valid. (A separate change will be needed if this assumption gets changed regardless of the page size.) - Add pmd_huge() to the validation code to handle huge pages. This is only for completeness since vmalloc_fault() won't happen for ioremap'd ranges as its pgd entry is always valid. (I was unable to test this part of the changes as a result.) Reported-by: Henning Schild <henning.schild@siemens.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@alien8.de> --- When this patch is accepted, please copy to stable up to 4.1. --- arch/x86/mm/fault.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)