Message ID | f1ac3700970365fb979533294774af0b0dd84b3b.1554248002.git.khalid.aziz@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add support for eXclusive Page Frame Ownership | expand |
On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 2c471a2c43fa..d17d33f36a01 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -204,6 +204,14 @@ struct page { > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS > int _last_cpupid; > #endif > + > +#ifdef CONFIG_XPFO > + /* Counts the number of times this page has been kmapped. */ > + atomic_t xpfo_mapcount; > + > + /* Serialize kmap/kunmap of this page */ > + spinlock_t xpfo_lock; NAK, see ALLOC_SPLIT_PTLOCKS spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_ sure you want spinlock_t and not raw_spinlock_t ? For CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex. > +#endif Growing the page-frame by 8 bytes (in the good case) is really sad, that's a _lot_ of memory. > } _struct_page_alignment;
You must be so glad I no longer use kmap_atomic from NMI context :-) On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: > +static inline void xpfo_kmap(void *kaddr, struct page *page) > +{ > + unsigned long flags; > + > + if (!static_branch_unlikely(&xpfo_inited)) > + return; > + > + if (!PageXpfoUser(page)) > + return; > + > + /* > + * The page was previously allocated to user space, so > + * map it back into the kernel if needed. No TLB flush required. > + */ > + spin_lock_irqsave(&page->xpfo_lock, flags); > + > + if ((atomic_inc_return(&page->xpfo_mapcount) == 1) && > + TestClearPageXpfoUnmapped(page)) > + set_kpte(kaddr, page, PAGE_KERNEL); > + > + spin_unlock_irqrestore(&page->xpfo_lock, flags); That's a really sad sequence, not wrong, but sad. _3_ atomic operations, 2 on likely the same cacheline. And mostly all pointless. This patch makes xpfo_mapcount an atomic, but then all modifications are under the spinlock, what gives? Anyway, a possibly saner sequence might be: if (atomic_inc_not_zero(&page->xpfo_mapcount)) return; spin_lock_irqsave(&page->xpfo_lock, flag); if ((atomic_inc_return(&page->xpfo_mapcount) == 1) && TestClearPageXpfoUnmapped(page)) set_kpte(kaddr, page, PAGE_KERNEL); spin_unlock_irqrestore(&page->xpfo_lock, flags); > +} > + > +static inline void xpfo_kunmap(void *kaddr, struct page *page) > +{ > + unsigned long flags; > + > + if (!static_branch_unlikely(&xpfo_inited)) > + return; > + > + if (!PageXpfoUser(page)) > + return; > + > + /* > + * The page is to be allocated back to user space, so unmap it from > + * the kernel, flush the TLB and tag it as a user page. > + */ > + spin_lock_irqsave(&page->xpfo_lock, flags); > + > + if (atomic_dec_return(&page->xpfo_mapcount) == 0) { > +#ifdef CONFIG_XPFO_DEBUG > + WARN_ON(PageXpfoUnmapped(page)); > +#endif > + SetPageXpfoUnmapped(page); > + set_kpte(kaddr, page, __pgprot(0)); > + xpfo_flush_kernel_tlb(page, 0); You didn't speak about the TLB invalidation anywhere. But basically this is one that x86 cannot do. > + } > + > + spin_unlock_irqrestore(&page->xpfo_lock, flags); Idem: if (atomic_add_unless(&page->xpfo_mapcount, -1, 1)) return; .... > +} Also I'm failing to see the point of PG_xpfo_unmapped, afaict it is identical to !atomic_read(&page->xpfo_mapcount).
On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote: > On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 2c471a2c43fa..d17d33f36a01 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -204,6 +204,14 @@ struct page { > > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS > > int _last_cpupid; > > #endif > > + > > +#ifdef CONFIG_XPFO > > + /* Counts the number of times this page has been kmapped. */ > > + atomic_t xpfo_mapcount; > > + > > + /* Serialize kmap/kunmap of this page */ > > + spinlock_t xpfo_lock; > > NAK, see ALLOC_SPLIT_PTLOCKS > > spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_ > sure you want spinlock_t and not raw_spinlock_t ? For > CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex. > > > +#endif > > Growing the page-frame by 8 bytes (in the good case) is really sad, > that's a _lot_ of memory. So if you use the original kmap_atomic/kmap code from i386 and create an alias per user you can do away with all that. Now, that leaves you with the fixmap kmap_atomic code, which I also hate, but it gets rid of a lot of the ugly you introduce in these here patches. As to the fixmap kmap_atomic; so fundamentally the PTEs are only used on a single CPU and therefore CPU local TLB invalidation _should_ suffice. However, speculation... Another CPU can speculatively hit upon a fixmap entry for another CPU and populate it's own TLB entry. Then the TLB invalidate is insufficient, it leaves a stale entry in a remote TLB. If the remote CPU then re-uses that fixmap slot to alias another page, we have two CPUs with different translations for the same VA, a condition that AMD CPU's dislike enough to machine check on (IIRC). Actually hitting that is incredibly difficult (we have to have speculation, fixmap reuse and not get a full TLB invalidate in between), but, afaict, not impossible. Your monstrosity from the last patch avoids this particular issue by not aliasing in this manner, but it comes at the cost of this page-frame bloat. Also, I'm still not sure there's not other problems with it. Bah..
On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote: > On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 2c471a2c43fa..d17d33f36a01 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -204,6 +204,14 @@ struct page { > > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS > > int _last_cpupid; > > #endif > > + > > +#ifdef CONFIG_XPFO > > + /* Counts the number of times this page has been kmapped. */ > > + atomic_t xpfo_mapcount; > > + > > + /* Serialize kmap/kunmap of this page */ > > + spinlock_t xpfo_lock; > > NAK, see ALLOC_SPLIT_PTLOCKS > > spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_ > sure you want spinlock_t and not raw_spinlock_t ? For > CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex. > > > +#endif > > Growing the page-frame by 8 bytes (in the good case) is really sad, > that's a _lot_ of memory. Originally we had this in page_ext, it's not really clear to me why we moved it out. Julien? Tycho
On 4/4/19 1:43 AM, Peter Zijlstra wrote: > > You must be so glad I no longer use kmap_atomic from NMI context :-) > > On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: >> +static inline void xpfo_kmap(void *kaddr, struct page *page) >> +{ >> + unsigned long flags; >> + >> + if (!static_branch_unlikely(&xpfo_inited)) >> + return; >> + >> + if (!PageXpfoUser(page)) >> + return; >> + >> + /* >> + * The page was previously allocated to user space, so >> + * map it back into the kernel if needed. No TLB flush required. >> + */ >> + spin_lock_irqsave(&page->xpfo_lock, flags); >> + >> + if ((atomic_inc_return(&page->xpfo_mapcount) == 1) && >> + TestClearPageXpfoUnmapped(page)) >> + set_kpte(kaddr, page, PAGE_KERNEL); >> + >> + spin_unlock_irqrestore(&page->xpfo_lock, flags); > > That's a really sad sequence, not wrong, but sad. _3_ atomic operations, > 2 on likely the same cacheline. And mostly all pointless. > > This patch makes xpfo_mapcount an atomic, but then all modifications are > under the spinlock, what gives? > > Anyway, a possibly saner sequence might be: > > if (atomic_inc_not_zero(&page->xpfo_mapcount)) > return; > > spin_lock_irqsave(&page->xpfo_lock, flag); > if ((atomic_inc_return(&page->xpfo_mapcount) == 1) && > TestClearPageXpfoUnmapped(page)) > set_kpte(kaddr, page, PAGE_KERNEL); > spin_unlock_irqrestore(&page->xpfo_lock, flags); > >> +} >> + >> +static inline void xpfo_kunmap(void *kaddr, struct page *page) >> +{ >> + unsigned long flags; >> + >> + if (!static_branch_unlikely(&xpfo_inited)) >> + return; >> + >> + if (!PageXpfoUser(page)) >> + return; >> + >> + /* >> + * The page is to be allocated back to user space, so unmap it from >> + * the kernel, flush the TLB and tag it as a user page. >> + */ >> + spin_lock_irqsave(&page->xpfo_lock, flags); >> + >> + if (atomic_dec_return(&page->xpfo_mapcount) == 0) { >> +#ifdef CONFIG_XPFO_DEBUG >> + WARN_ON(PageXpfoUnmapped(page)); >> +#endif >> + SetPageXpfoUnmapped(page); >> + set_kpte(kaddr, page, __pgprot(0)); >> + xpfo_flush_kernel_tlb(page, 0); > > You didn't speak about the TLB invalidation anywhere. But basically this > is one that x86 cannot do. > >> + } >> + >> + spin_unlock_irqrestore(&page->xpfo_lock, flags); > > Idem: > > if (atomic_add_unless(&page->xpfo_mapcount, -1, 1)) > return; > > .... > > >> +} > > Also I'm failing to see the point of PG_xpfo_unmapped, afaict it > is identical to !atomic_read(&page->xpfo_mapcount). > Thanks Peter. I really appreciate your review. Your feedback helps make this code better and closer to where I can feel comfortable not calling it RFC any more. The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get uncomfortable with it. As you pointed out about calling kmap_atomic from NMI context, that just makes the kmap_atomic code look even worse. I pointed out one problem with this code in cover letter and suggested a rewrite. I see these problems with this code: 1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir attack security hole again even if just for the duration of kmap. A kmap can stay around for some time if the page is being used for I/O. 2. This code uses spinlock which leads to problems. If it does not disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables IRQ, I think it can still deadlock around pgd_lock. I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map the page at a new virtual address similar to what kmap_high for i386 does. This avoids re-opening the ret2dir security hole. We can also possibly do away with xpfo_lock saving bytes in page-frame and the not so sane code sequence can go away. Good point about PG_xpfo_unmapped flag. You are right that it can be replaced with !atomic_read(&page->xpfo_mapcount). Thanks, Khalid
On Thu, Apr 04, 2019 at 09:15:46AM -0600, Khalid Aziz wrote: > Thanks Peter. I really appreciate your review. Your feedback helps make > this code better and closer to where I can feel comfortable not calling > it RFC any more. > > The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get > uncomfortable with it. As you pointed out about calling kmap_atomic from > NMI context, that just makes the kmap_atomic code look even worse. I > pointed out one problem with this code in cover letter and suggested a > rewrite. I see these problems with this code: Well, I no longer use it from NMI context, but I did do that for a while. We now have a giant heap of magic in the NMI path that allows us to take faults from NMI context (please don't ask), this means we can mostly do copy_from_user_inatomic() now. > 1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir > attack security hole again even if just for the duration of kmap. A kmap > can stay around for some time if the page is being used for I/O. Correct. > 2. This code uses spinlock which leads to problems. If it does not > disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables > IRQ, I think it can still deadlock around pgd_lock. I've not spotted that inversion yet, but then I didn't look at the lock usage outside of k{,un}map_xpfo(). > I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map > the page at a new virtual address similar to what kmap_high for i386 > does. This avoids re-opening the ret2dir security hole. We can also > possibly do away with xpfo_lock saving bytes in page-frame and the not > so sane code sequence can go away. Right, the TLB invalidation issues are still tricky, even there :/
[ Sorry, had to trim the Cc: list from hell. Tried to keep all the mailing lists and all x86 developers. ] * Khalid Aziz <khalid.aziz@oracle.com> wrote: > From: Juerg Haefliger <juerg.haefliger@canonical.com> > > This patch adds basic support infrastructure for XPFO which protects > against 'ret2dir' kernel attacks. The basic idea is to enforce > exclusive ownership of page frames by either the kernel or userspace, > unless explicitly requested by the kernel. Whenever a page destined for > userspace is allocated, it is unmapped from physmap (the kernel's page > table). When such a page is reclaimed from userspace, it is mapped back > to physmap. Individual architectures can enable full XPFO support using > this infrastructure by supplying architecture specific pieces. I have a higher level, meta question: Is there any updated analysis outlining why this XPFO overhead would be required on x86-64 kernels running on SMAP/SMEP CPUs which should be all recent Intel and AMD CPUs, and with kernel that mark all direct kernel mappings as non-executable - which should be all reasonably modern kernels later than v4.0 or so? I.e. the original motivation of the XPFO patches was to prevent execution of direct kernel mappings. Is this motivation still present if those mappings are non-executable? (Sorry if this has been asked and answered in previous discussions.) Thanks, Ingo
On 4/17/19 10:15 AM, Ingo Molnar wrote: > > [ Sorry, had to trim the Cc: list from hell. Tried to keep all the > mailing lists and all x86 developers. ] > > * Khalid Aziz <khalid.aziz@oracle.com> wrote: > >> From: Juerg Haefliger <juerg.haefliger@canonical.com> >> >> This patch adds basic support infrastructure for XPFO which protects >> against 'ret2dir' kernel attacks. The basic idea is to enforce >> exclusive ownership of page frames by either the kernel or userspace, >> unless explicitly requested by the kernel. Whenever a page destined for >> userspace is allocated, it is unmapped from physmap (the kernel's page >> table). When such a page is reclaimed from userspace, it is mapped back >> to physmap. Individual architectures can enable full XPFO support using >> this infrastructure by supplying architecture specific pieces. > > I have a higher level, meta question: > > Is there any updated analysis outlining why this XPFO overhead would be > required on x86-64 kernels running on SMAP/SMEP CPUs which should be all > recent Intel and AMD CPUs, and with kernel that mark all direct kernel > mappings as non-executable - which should be all reasonably modern > kernels later than v4.0 or so? > > I.e. the original motivation of the XPFO patches was to prevent execution > of direct kernel mappings. Is this motivation still present if those > mappings are non-executable? > > (Sorry if this has been asked and answered in previous discussions.) Hi Ingo, That is a good question. Because of the cost of XPFO, we have to be very sure we need this protection. The paper from Vasileios, Michalis and Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>, does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1 and 6.2. Thanks, Khalid
* Khalid Aziz <khalid.aziz@oracle.com> wrote: > > I.e. the original motivation of the XPFO patches was to prevent execution > > of direct kernel mappings. Is this motivation still present if those > > mappings are non-executable? > > > > (Sorry if this has been asked and answered in previous discussions.) > > Hi Ingo, > > That is a good question. Because of the cost of XPFO, we have to be very > sure we need this protection. The paper from Vasileios, Michalis and > Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>, > does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1 > and 6.2. So it would be nice if you could generally summarize external arguments when defending a patchset, instead of me having to dig through a PDF which not only causes me to spend time that you probably already spent reading that PDF, but I might also interpret it incorrectly. ;-) The PDF you cited says this: "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced in many platforms, including x86-64. In our example, the content of user address 0xBEEF000 is also accessible through kernel address 0xFFFF87FF9F080000 as plain, executable code." Is this actually true of modern x86-64 kernels? We've locked down W^X protections in general. I.e. this conclusion: "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and triggering the kernel to dereference it, an attacker can directly execute shell code with kernel privileges." ... appears to be predicated on imperfect W^X protections on the x86-64 kernel. Do such holes exist on the latest x86-64 kernel? If yes, is there a reason to believe that these W^X holes cannot be fixed, or that any fix would be more expensive than XPFO? Thanks, Ingo
> On Apr 17, 2019, at 10:09 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > * Khalid Aziz <khalid.aziz@oracle.com> wrote: > >>> I.e. the original motivation of the XPFO patches was to prevent execution >>> of direct kernel mappings. Is this motivation still present if those >>> mappings are non-executable? >>> >>> (Sorry if this has been asked and answered in previous discussions.) >> >> Hi Ingo, >> >> That is a good question. Because of the cost of XPFO, we have to be very >> sure we need this protection. The paper from Vasileios, Michalis and >> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>, >> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1 >> and 6.2. > > So it would be nice if you could generally summarize external arguments > when defending a patchset, instead of me having to dig through a PDF > which not only causes me to spend time that you probably already spent > reading that PDF, but I might also interpret it incorrectly. ;-) > > The PDF you cited says this: > > "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced > in many platforms, including x86-64. In our example, the content of > user address 0xBEEF000 is also accessible through kernel address > 0xFFFF87FF9F080000 as plain, executable code." > > Is this actually true of modern x86-64 kernels? We've locked down W^X > protections in general. As I was curious, I looked at the paper. Here is a quote from it: "In x86-64, however, the permissions of physmap are not in sane state. Kernels up to v3.8.13 violate the W^X property by mapping the entire region as “readable, writeable, and executable” (RWX)—only very recent kernels (≥v3.9) use the more conservative RW mapping.”
* Nadav Amit <nadav.amit@gmail.com> wrote: > > On Apr 17, 2019, at 10:09 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Khalid Aziz <khalid.aziz@oracle.com> wrote: > > > >>> I.e. the original motivation of the XPFO patches was to prevent execution > >>> of direct kernel mappings. Is this motivation still present if those > >>> mappings are non-executable? > >>> > >>> (Sorry if this has been asked and answered in previous discussions.) > >> > >> Hi Ingo, > >> > >> That is a good question. Because of the cost of XPFO, we have to be very > >> sure we need this protection. The paper from Vasileios, Michalis and > >> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>, > >> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1 > >> and 6.2. > > > > So it would be nice if you could generally summarize external arguments > > when defending a patchset, instead of me having to dig through a PDF > > which not only causes me to spend time that you probably already spent > > reading that PDF, but I might also interpret it incorrectly. ;-) > > > > The PDF you cited says this: > > > > "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced > > in many platforms, including x86-64. In our example, the content of > > user address 0xBEEF000 is also accessible through kernel address > > 0xFFFF87FF9F080000 as plain, executable code." > > > > Is this actually true of modern x86-64 kernels? We've locked down W^X > > protections in general. > > As I was curious, I looked at the paper. Here is a quote from it: > > "In x86-64, however, the permissions of physmap are not in sane state. > Kernels up to v3.8.13 violate the W^X property by mapping the entire region > as “readable, writeable, and executable” (RWX)—only very recent kernels > (≥v3.9) use the more conservative RW mapping.” But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" kernel in any sense of the word. For any proposed patchset with significant complexity and non-trivial costs the benchmark version threshold is the "current upstream kernel". So does that quote address my followup questions: > Is this actually true of modern x86-64 kernels? We've locked down W^X > protections in general. > > I.e. this conclusion: > > "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and > triggering the kernel to dereference it, an attacker can directly > execute shell code with kernel privileges." > > ... appears to be predicated on imperfect W^X protections on the x86-64 > kernel. > > Do such holes exist on the latest x86-64 kernel? If yes, is there a > reason to believe that these W^X holes cannot be fixed, or that any fix > would be more expensive than XPFO? ? What you are proposing here is a XPFO patch-set against recent kernels with significant runtime overhead, so my questions about the W^X holes are warranted. Thanks, Ingo
On 4/17/19 11:09 AM, Ingo Molnar wrote: > > * Khalid Aziz <khalid.aziz@oracle.com> wrote: > >>> I.e. the original motivation of the XPFO patches was to prevent execution >>> of direct kernel mappings. Is this motivation still present if those >>> mappings are non-executable? >>> >>> (Sorry if this has been asked and answered in previous discussions.) >> >> Hi Ingo, >> >> That is a good question. Because of the cost of XPFO, we have to be very >> sure we need this protection. The paper from Vasileios, Michalis and >> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>, >> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1 >> and 6.2. > > So it would be nice if you could generally summarize external arguments > when defending a patchset, instead of me having to dig through a PDF > which not only causes me to spend time that you probably already spent > reading that PDF, but I might also interpret it incorrectly. ;-) Sorry, you are right. Even though that paper explains it well, a summary is always useful. > > The PDF you cited says this: > > "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced > in many platforms, including x86-64. In our example, the content of > user address 0xBEEF000 is also accessible through kernel address > 0xFFFF87FF9F080000 as plain, executable code." > > Is this actually true of modern x86-64 kernels? We've locked down W^X > protections in general. > > I.e. this conclusion: > > "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and > triggering the kernel to dereference it, an attacker can directly > execute shell code with kernel privileges." > > ... appears to be predicated on imperfect W^X protections on the x86-64 > kernel. > > Do such holes exist on the latest x86-64 kernel? If yes, is there a > reason to believe that these W^X holes cannot be fixed, or that any fix > would be more expensive than XPFO? Even if physmap is not executable, return-oriented programming (ROP) can still be used to launch an attack. Instead of placing executable code at user address 0xBEEF000, attacker can place an ROP payload there. kfptr is then overwritten to point to a stack-pivoting gadget. Using the physmap address aliasing, the ROP payload becomes kernel-mode stack. The execution can then be hijacked upon execution of ret instruction. This is a gist of the subsection titled "Non-executable physmap" under section 6.2 and it looked convincing enough to me. If you have a different take on this, I am very interested in your point of view. Thanks, Khalid
> On Apr 17, 2019, at 10:26 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > * Nadav Amit <nadav.amit@gmail.com> wrote: > >>> On Apr 17, 2019, at 10:09 AM, Ingo Molnar <mingo@kernel.org> wrote: >>> >>> >>> * Khalid Aziz <khalid.aziz@oracle.com> wrote: >>> >>>>> I.e. the original motivation of the XPFO patches was to prevent execution >>>>> of direct kernel mappings. Is this motivation still present if those >>>>> mappings are non-executable? >>>>> >>>>> (Sorry if this has been asked and answered in previous discussions.) >>>> >>>> Hi Ingo, >>>> >>>> That is a good question. Because of the cost of XPFO, we have to be very >>>> sure we need this protection. The paper from Vasileios, Michalis and >>>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>, >>>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1 >>>> and 6.2. >>> >>> So it would be nice if you could generally summarize external arguments >>> when defending a patchset, instead of me having to dig through a PDF >>> which not only causes me to spend time that you probably already spent >>> reading that PDF, but I might also interpret it incorrectly. ;-) >>> >>> The PDF you cited says this: >>> >>> "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced >>> in many platforms, including x86-64. In our example, the content of >>> user address 0xBEEF000 is also accessible through kernel address >>> 0xFFFF87FF9F080000 as plain, executable code." >>> >>> Is this actually true of modern x86-64 kernels? We've locked down W^X >>> protections in general. >> >> As I was curious, I looked at the paper. Here is a quote from it: >> >> "In x86-64, however, the permissions of physmap are not in sane state. >> Kernels up to v3.8.13 violate the W^X property by mapping the entire region >> as “readable, writeable, and executable” (RWX)—only very recent kernels >> (≥v3.9) use the more conservative RW mapping.” > > But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" > kernel in any sense of the word. For any proposed patchset with > significant complexity and non-trivial costs the benchmark version > threshold is the "current upstream kernel". > > So does that quote address my followup questions: > >> Is this actually true of modern x86-64 kernels? We've locked down W^X >> protections in general. >> >> I.e. this conclusion: >> >> "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and >> triggering the kernel to dereference it, an attacker can directly >> execute shell code with kernel privileges." >> >> ... appears to be predicated on imperfect W^X protections on the x86-64 >> kernel. >> >> Do such holes exist on the latest x86-64 kernel? If yes, is there a >> reason to believe that these W^X holes cannot be fixed, or that any fix >> would be more expensive than XPFO? > > ? > > What you are proposing here is a XPFO patch-set against recent kernels > with significant runtime overhead, so my questions about the W^X holes > are warranted. > Just to clarify - I am an innocent bystander and have no part in this work. I was just looking (again) at the paper, as I was curious due to the recent patches that I sent that improve W^X protection.
On Wed, Apr 17, 2019 at 10:33 AM Khalid Aziz <khalid.aziz@oracle.com> wrote: > > On 4/17/19 11:09 AM, Ingo Molnar wrote: > > > > * Khalid Aziz <khalid.aziz@oracle.com> wrote: > > > >>> I.e. the original motivation of the XPFO patches was to prevent execution > >>> of direct kernel mappings. Is this motivation still present if those > >>> mappings are non-executable? > >>> > >>> (Sorry if this has been asked and answered in previous discussions.) > >> > >> Hi Ingo, > >> > >> That is a good question. Because of the cost of XPFO, we have to be very > >> sure we need this protection. The paper from Vasileios, Michalis and > >> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>, > >> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1 > >> and 6.2. > > > > So it would be nice if you could generally summarize external arguments > > when defending a patchset, instead of me having to dig through a PDF > > which not only causes me to spend time that you probably already spent > > reading that PDF, but I might also interpret it incorrectly. ;-) > > Sorry, you are right. Even though that paper explains it well, a summary > is always useful. > > > > > The PDF you cited says this: > > > > "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced > > in many platforms, including x86-64. In our example, the content of > > user address 0xBEEF000 is also accessible through kernel address > > 0xFFFF87FF9F080000 as plain, executable code." > > > > Is this actually true of modern x86-64 kernels? We've locked down W^X > > protections in general. > > > > I.e. this conclusion: > > > > "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and > > triggering the kernel to dereference it, an attacker can directly > > execute shell code with kernel privileges." > > > > ... appears to be predicated on imperfect W^X protections on the x86-64 > > kernel. > > > > Do such holes exist on the latest x86-64 kernel? If yes, is there a > > reason to believe that these W^X holes cannot be fixed, or that any fix > > would be more expensive than XPFO? > > Even if physmap is not executable, return-oriented programming (ROP) can > still be used to launch an attack. Instead of placing executable code at > user address 0xBEEF000, attacker can place an ROP payload there. kfptr > is then overwritten to point to a stack-pivoting gadget. Using the > physmap address aliasing, the ROP payload becomes kernel-mode stack. The > execution can then be hijacked upon execution of ret instruction. This > is a gist of the subsection titled "Non-executable physmap" under > section 6.2 and it looked convincing enough to me. If you have a > different take on this, I am very interested in your point of view. My issue with all this is that XPFO is really very expensive. I think that, if we're going to seriously consider upstreaming expensive exploit mitigations like this, we should consider others first, in particular CFI techniques. grsecurity's RAP would be a great start. I also proposed using a gcc plugin (or upstream gcc feature) to add some instrumentation to any code that pops RSP to verify that the resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes. This will make ROP quite a bit harder.
On Wed, Apr 17, 2019 at 12:49:04PM -0700, Andy Lutomirski wrote: > I also proposed using a gcc plugin (or upstream gcc feature) to add > some instrumentation to any code that pops RSP to verify that the > resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes. > This will make ROP quite a bit harder. I've been playing around with this for a bit, and hope to have something to post Soon :) Tycho
On 4/17/19 1:49 PM, Andy Lutomirski wrote: > On Wed, Apr 17, 2019 at 10:33 AM Khalid Aziz <khalid.aziz@oracle.com> wrote: >> >> On 4/17/19 11:09 AM, Ingo Molnar wrote: >>> >>> * Khalid Aziz <khalid.aziz@oracle.com> wrote: >>> >>>>> I.e. the original motivation of the XPFO patches was to prevent execution >>>>> of direct kernel mappings. Is this motivation still present if those >>>>> mappings are non-executable? >>>>> >>>>> (Sorry if this has been asked and answered in previous discussions.) >>>> >>>> Hi Ingo, >>>> >>>> That is a good question. Because of the cost of XPFO, we have to be very >>>> sure we need this protection. The paper from Vasileios, Michalis and >>>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>, >>>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1 >>>> and 6.2. >>> >>> So it would be nice if you could generally summarize external arguments >>> when defending a patchset, instead of me having to dig through a PDF >>> which not only causes me to spend time that you probably already spent >>> reading that PDF, but I might also interpret it incorrectly. ;-) >> >> Sorry, you are right. Even though that paper explains it well, a summary >> is always useful. >> >>> >>> The PDF you cited says this: >>> >>> "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced >>> in many platforms, including x86-64. In our example, the content of >>> user address 0xBEEF000 is also accessible through kernel address >>> 0xFFFF87FF9F080000 as plain, executable code." >>> >>> Is this actually true of modern x86-64 kernels? We've locked down W^X >>> protections in general. >>> >>> I.e. this conclusion: >>> >>> "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and >>> triggering the kernel to dereference it, an attacker can directly >>> execute shell code with kernel privileges." >>> >>> ... appears to be predicated on imperfect W^X protections on the x86-64 >>> kernel. >>> >>> Do such holes exist on the latest x86-64 kernel? If yes, is there a >>> reason to believe that these W^X holes cannot be fixed, or that any fix >>> would be more expensive than XPFO? >> >> Even if physmap is not executable, return-oriented programming (ROP) can >> still be used to launch an attack. Instead of placing executable code at >> user address 0xBEEF000, attacker can place an ROP payload there. kfptr >> is then overwritten to point to a stack-pivoting gadget. Using the >> physmap address aliasing, the ROP payload becomes kernel-mode stack. The >> execution can then be hijacked upon execution of ret instruction. This >> is a gist of the subsection titled "Non-executable physmap" under >> section 6.2 and it looked convincing enough to me. If you have a >> different take on this, I am very interested in your point of view. > > My issue with all this is that XPFO is really very expensive. I think > that, if we're going to seriously consider upstreaming expensive > exploit mitigations like this, we should consider others first, in > particular CFI techniques. grsecurity's RAP would be a great start. > I also proposed using a gcc plugin (or upstream gcc feature) to add > some instrumentation to any code that pops RSP to verify that the > resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes. > This will make ROP quite a bit harder. > Yes, XPFO is expensive. I have been able to reduce the overhead of XPFO from 2537% to 28% (on large servers) but 28% is still quite significant. Alternative mitigation techniques with lower impact would easily be more acceptable as long as they provide same level of protection. If we have to go with XPFO, we will continue to look for more performance improvement to bring that number down further from 28%. Hopefully what Tycho is working on will yield better results. I am continuing to look for improvements to XPFO in parallel. Thanks, Khalid
On Wed, 17 Apr 2019, Nadav Amit wrote: > > On Apr 17, 2019, at 10:26 AM, Ingo Molnar <mingo@kernel.org> wrote: > >> As I was curious, I looked at the paper. Here is a quote from it: > >> > >> "In x86-64, however, the permissions of physmap are not in sane state. > >> Kernels up to v3.8.13 violate the W^X property by mapping the entire region > >> as “readable, writeable, and executable” (RWX)—only very recent kernels > >> (≥v3.9) use the more conservative RW mapping.” > > > > But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" > > kernel in any sense of the word. For any proposed patchset with > > significant complexity and non-trivial costs the benchmark version > > threshold is the "current upstream kernel". > > > > So does that quote address my followup questions: > > > >> Is this actually true of modern x86-64 kernels? We've locked down W^X > >> protections in general. > >> > >> I.e. this conclusion: > >> > >> "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and > >> triggering the kernel to dereference it, an attacker can directly > >> execute shell code with kernel privileges." > >> > >> ... appears to be predicated on imperfect W^X protections on the x86-64 > >> kernel. > >> > >> Do such holes exist on the latest x86-64 kernel? If yes, is there a > >> reason to believe that these W^X holes cannot be fixed, or that any fix > >> would be more expensive than XPFO? > > > > ? > > > > What you are proposing here is a XPFO patch-set against recent kernels > > with significant runtime overhead, so my questions about the W^X holes > > are warranted. > > > > Just to clarify - I am an innocent bystander and have no part in this work. > I was just looking (again) at the paper, as I was curious due to the recent > patches that I sent that improve W^X protection. It's not necessarily a W+X issue. The user space text is mapped in the kernel as well and even if it is mapped RX then this can happen. So any kernel mappings of user space text need to be mapped NX! Thanks, tglx
On Wed, Apr 17, 2019, 14:20 Thomas Gleixner <tglx@linutronix.de> wrote: > > It's not necessarily a W+X issue. The user space text is mapped in the > kernel as well and even if it is mapped RX then this can happen. So any > kernel mappings of user space text need to be mapped NX! With SMEP, user space pages are always NX. I really think SM[AE]P is something we can already take for granted. People who have old CPU's workout it are simply not serious about security anyway. There is no point in saying "we can do it badly in software". Linus > <div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Apr 17, 2019, 14:20 Thomas Gleixner <<a href="mailto:tglx@linutronix.de">tglx@linutronix.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br> It's not necessarily a W+X issue. The user space text is mapped in the<br> kernel as well and even if it is mapped RX then this can happen. So any<br> kernel mappings of user space text need to be mapped NX!</blockquote></div></div><div dir="auto"><br></div><div dir="auto">With SMEP, user space pages are always NX.</div><div dir="auto"><br></div><div dir="auto">I really think SM[AE]P is something we can already take for granted. People who have old CPU's workout it are simply not serious about security anyway. There is no point in saying "we can do it badly in software".</div><div dir="auto"><br></div><div dir="auto"> Linus</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> </blockquote></div></div><div dir="auto"></div></div>
On Wed, 17 Apr 2019, Linus Torvalds wrote: > On Wed, Apr 17, 2019, 14:20 Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > It's not necessarily a W+X issue. The user space text is mapped in the > > kernel as well and even if it is mapped RX then this can happen. So any > > kernel mappings of user space text need to be mapped NX! > > With SMEP, user space pages are always NX. We talk past each other. The user space page in the ring3 valid virtual address space (non negative) is of course protected by SMEP. The attack utilizes the kernel linear mapping of the physical memory. I.e. user space address 0x43210 has a kernel equivalent at 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid kernel address and that is mapped X --> game over. SMEP does not help there. From the top of my head I'd say this is a non issue as those kernel address space mappings _should_ be NX, but we got bitten by _should_ in the past:) Thanks, tglx
On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Wed, 17 Apr 2019, Linus Torvalds wrote: > > > With SMEP, user space pages are always NX. > > We talk past each other. The user space page in the ring3 valid virtual > address space (non negative) is of course protected by SMEP. > > The attack utilizes the kernel linear mapping of the physical > memory. I.e. user space address 0x43210 has a kernel equivalent at > 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid > kernel address and that is mapped X --> game over. SMEP does not help > there. Oh, agreed. But that would simply be a kernel bug. We should only map kernel pages executable when we have kernel code in them, and we should certainly not allow those pages to be mapped writably in user space. That kind of "executable in kernel, writable in user" would be a horrendous and major bug. So i think it's a non-issue. > From the top of my head I'd say this is a non issue as those kernel address > space mappings _should_ be NX, but we got bitten by _should_ in the past:) I do agree that bugs can happen, obviously, and we might have missed something. But in the context of XPFO, I would argue (*very* strongly) that the likelihood of the above kind of bug is absolutely *miniscule* compared to the likelihood that we'd have something wrong in the software implementation of XPFO. So if the argument is "we might have bugs in software", then I think that's an argument _against_ XPFO rather than for it. Linus
On Wed, Apr 17, 2019 at 5:00 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > On Wed, 17 Apr 2019, Linus Torvalds wrote: > > > > > With SMEP, user space pages are always NX. > > > > We talk past each other. The user space page in the ring3 valid virtual > > address space (non negative) is of course protected by SMEP. > > > > The attack utilizes the kernel linear mapping of the physical > > memory. I.e. user space address 0x43210 has a kernel equivalent at > > 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid > > kernel address and that is mapped X --> game over. SMEP does not help > > there. > > Oh, agreed. > > But that would simply be a kernel bug. We should only map kernel pages > executable when we have kernel code in them, and we should certainly > not allow those pages to be mapped writably in user space. > > That kind of "executable in kernel, writable in user" would be a > horrendous and major bug. > > So i think it's a non-issue. > > > From the top of my head I'd say this is a non issue as those kernel address > > space mappings _should_ be NX, but we got bitten by _should_ in the past:) > > I do agree that bugs can happen, obviously, and we might have missed something. > > But in the context of XPFO, I would argue (*very* strongly) that the > likelihood of the above kind of bug is absolutely *miniscule* compared > to the likelihood that we'd have something wrong in the software > implementation of XPFO. > > So if the argument is "we might have bugs in software", then I think > that's an argument _against_ XPFO rather than for it. > I don't think this type of NX goof was ever the argument for XPFO. The main argument I've heard is that a malicious user program writes a ROP payload into user memory (regular anonymous user memory) and then gets the kernel to erroneously set RSP (*not* RIP) to point there. I find this argument fairly weak for a couple reasons. First, if we're worried about this, let's do in-kernel CFI, not XPFO, to mitigate it. Second, I don't see why the exact same attack can't be done using, say, page cache, and unless I'm missing something, XPFO doesn't protect page cache. Or network buffers, or pipe buffers, etc.
On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote: > I don't think this type of NX goof was ever the argument for XPFO. > The main argument I've heard is that a malicious user program writes a > ROP payload into user memory (regular anonymous user memory) and then > gets the kernel to erroneously set RSP (*not* RIP) to point there. Well, more than just ROP. Any of the various attack primitives. The NX stuff is about moving RIP: SMEP-bypassing. But there is still basic SMAP-bypassing for putting a malicious structure in userspace and having the kernel access it via the linear mapping, etc. > I find this argument fairly weak for a couple reasons. First, if > we're worried about this, let's do in-kernel CFI, not XPFO, to CFI is getting much closer. Getting the kernel happy under Clang, LTO, and CFI is under active development. (It's functional for arm64 already, and pieces have been getting upstreamed.) > mitigate it. Second, I don't see why the exact same attack can't be > done using, say, page cache, and unless I'm missing something, XPFO > doesn't protect page cache. Or network buffers, or pipe buffers, etc. My understanding is that it's much easier to feel out the linear mapping address than for the others. But yes, all of those same attack primitives are possible in other memory areas (though most are NX), and plenty of exploits have done such things.
On Wed, 17 Apr 2019, Linus Torvalds wrote: > On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Wed, 17 Apr 2019, Linus Torvalds wrote: > > > With SMEP, user space pages are always NX. > > > > We talk past each other. The user space page in the ring3 valid virtual > > address space (non negative) is of course protected by SMEP. > > > > The attack utilizes the kernel linear mapping of the physical > > memory. I.e. user space address 0x43210 has a kernel equivalent at > > 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid > > kernel address and that is mapped X --> game over. SMEP does not help > > there. > > Oh, agreed. > > But that would simply be a kernel bug. We should only map kernel pages > executable when we have kernel code in them, and we should certainly > not allow those pages to be mapped writably in user space. > > That kind of "executable in kernel, writable in user" would be a > horrendous and major bug. > > So i think it's a non-issue. Pretty much. > > From the top of my head I'd say this is a non issue as those kernel address > > space mappings _should_ be NX, but we got bitten by _should_ in the past:) > > I do agree that bugs can happen, obviously, and we might have missed something. > > But in the context of XPFO, I would argue (*very* strongly) that the > likelihood of the above kind of bug is absolutely *miniscule* compared > to the likelihood that we'd have something wrong in the software > implementation of XPFO. > > So if the argument is "we might have bugs in software", then I think > that's an argument _against_ XPFO rather than for it. No argument from my side. We better spend time to make sure that a bogus kernel side X mapping is caught, like we catch other things. Thanks, tglx
On 4/17/19 11:41 PM, Kees Cook wrote: > On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote: >> I don't think this type of NX goof was ever the argument for XPFO. >> The main argument I've heard is that a malicious user program writes a >> ROP payload into user memory (regular anonymous user memory) and then >> gets the kernel to erroneously set RSP (*not* RIP) to point there. > > Well, more than just ROP. Any of the various attack primitives. The NX > stuff is about moving RIP: SMEP-bypassing. But there is still basic > SMAP-bypassing for putting a malicious structure in userspace and > having the kernel access it via the linear mapping, etc. > >> I find this argument fairly weak for a couple reasons. First, if >> we're worried about this, let's do in-kernel CFI, not XPFO, to > > CFI is getting much closer. Getting the kernel happy under Clang, LTO, > and CFI is under active development. (It's functional for arm64 > already, and pieces have been getting upstreamed.) > CFI theoretically offers protection with fairly low overhead. I have not played much with CFI in clang. I agree with Linus that probability of bugs in XPFO implementation itself is a cause of concern. If CFI in Clang can provide us the same level of protection as XPFO does, I wouldn't want to push for an expensive change like XPFO. If Clang/CFI can't get us there for extended period of time, does it make sense to continue to poke at XPFO? Thanks, Khalid
On 4/18/19 8:34 AM, Khalid Aziz wrote: > On 4/17/19 11:41 PM, Kees Cook wrote: >> On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote: >>> I don't think this type of NX goof was ever the argument for XPFO. >>> The main argument I've heard is that a malicious user program writes a >>> ROP payload into user memory (regular anonymous user memory) and then >>> gets the kernel to erroneously set RSP (*not* RIP) to point there. >> >> Well, more than just ROP. Any of the various attack primitives. The NX >> stuff is about moving RIP: SMEP-bypassing. But there is still basic >> SMAP-bypassing for putting a malicious structure in userspace and >> having the kernel access it via the linear mapping, etc. >> >>> I find this argument fairly weak for a couple reasons. First, if >>> we're worried about this, let's do in-kernel CFI, not XPFO, to >> >> CFI is getting much closer. Getting the kernel happy under Clang, LTO, >> and CFI is under active development. (It's functional for arm64 >> already, and pieces have been getting upstreamed.) >> > > CFI theoretically offers protection with fairly low overhead. I have not > played much with CFI in clang. I agree with Linus that probability of > bugs in XPFO implementation itself is a cause of concern. If CFI in > Clang can provide us the same level of protection as XPFO does, I > wouldn't want to push for an expensive change like XPFO. > > If Clang/CFI can't get us there for extended period of time, does it > make sense to continue to poke at XPFO? Any feedback on continued effort on XPFO? If it makes sense to have XPFO available as a solution for ret2dir issue in case Clang/CFI does not work out, I will continue to refine it. -- Khalid
On Thu, Apr 18, 2019 at 7:35 AM Khalid Aziz <khalid.aziz@oracle.com> wrote: > > On 4/17/19 11:41 PM, Kees Cook wrote: > > On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote: > >> I don't think this type of NX goof was ever the argument for XPFO. > >> The main argument I've heard is that a malicious user program writes a > >> ROP payload into user memory (regular anonymous user memory) and then > >> gets the kernel to erroneously set RSP (*not* RIP) to point there. > > > > Well, more than just ROP. Any of the various attack primitives. The NX > > stuff is about moving RIP: SMEP-bypassing. But there is still basic > > SMAP-bypassing for putting a malicious structure in userspace and > > having the kernel access it via the linear mapping, etc. > > > >> I find this argument fairly weak for a couple reasons. First, if > >> we're worried about this, let's do in-kernel CFI, not XPFO, to > > > > CFI is getting much closer. Getting the kernel happy under Clang, LTO, > > and CFI is under active development. (It's functional for arm64 > > already, and pieces have been getting upstreamed.) > > > > CFI theoretically offers protection with fairly low overhead. I have not > played much with CFI in clang. I agree with Linus that probability of > bugs in XPFO implementation itself is a cause of concern. If CFI in > Clang can provide us the same level of protection as XPFO does, I > wouldn't want to push for an expensive change like XPFO. > > If Clang/CFI can't get us there for extended period of time, does it > make sense to continue to poke at XPFO? Well, I think CFI will certainly vastly narrow the execution paths available to an attacker, but what I continue to see XPFO useful for is stopping attacks that need to locate something in memory. (i.e. not ret2dir but, like, read2dir.) It's arguable that such attacks would just use heap, stack, etc to hold such things, but the linear map remains relatively easy to find/target. But I agree: the protection is getting more and more narrow (especially with CFI coming down the pipe), and if it's still a 28% hit, that's not going to be tenable for anyone but the truly paranoid. :) All that said, there isn't a very good backward-edge CFI protection (i.e. ROP defense) on x86 in Clang. The forward-edge looks decent, but requires LTO, etc. My point is there is still a long path to gaining CFI in upstream.
On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 858b6c0b9a15..9b36da94760e 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -2997,6 +2997,12 @@ > > nox2apic [X86-64,APIC] Do not enable x2APIC mode. > > + noxpfo [XPFO] Disable eXclusive Page Frame Ownership (XPFO) > + when CONFIG_XPFO is on. Physical pages mapped into > + user applications will also be mapped in the > + kernel's address space as if CONFIG_XPFO was not > + enabled. > + > cpu0_hotplug [X86] Turn on CPU0 hotplug feature when > CONFIG_BO OTPARAM_HOTPLUG_CPU0 is off. > Some features depend on CPU0. Known dependencies are: Given the big performance impact that XPFO can have. It should be off by default when configured. Instead, the xpfo option should be used to enable it. Cheers, Longman
On 5/1/19 8:49 AM, Waiman Long wrote: > On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: >> diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > >> index 858b6c0b9a15..9b36da94760e 100644 >> --- a/Documentation/admin-guide/kernel-parameters.txt >> +++ b/Documentation/admin-guide/kernel-parameters.txt >> @@ -2997,6 +2997,12 @@ >> >> nox2apic [X86-64,APIC] Do not enable x2APIC mode. >> >> + noxpfo [XPFO] Disable eXclusive Page Frame Ownership (XPFO) >> + when CONFIG_XPFO is on. Physical pages mapped into >> + user applications will also be mapped in the >> + kernel's address space as if CONFIG_XPFO was not >> + enabled. >> + >> cpu0_hotplug [X86] Turn on CPU0 hotplug feature when >> CONFIG_BO OTPARAM_HOTPLUG_CPU0 is off. >> Some features depend on CPU0. Known dependencies are: > > Given the big performance impact that XPFO can have. It should be off by > default when configured. Instead, the xpfo option should be used to > enable it. Agreed. I plan to disable it by default in the next version of the patch. This is likely to end up being a feature for extreme security conscious folks only, unless I or someone else comes up with further significant performance boost. Thanks, Khalid
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 858b6c0b9a15..9b36da94760e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2997,6 +2997,12 @@ nox2apic [X86-64,APIC] Do not enable x2APIC mode. + noxpfo [XPFO] Disable eXclusive Page Frame Ownership (XPFO) + when CONFIG_XPFO is on. Physical pages mapped into + user applications will also be mapped in the + kernel's address space as if CONFIG_XPFO was not + enabled. + cpu0_hotplug [X86] Turn on CPU0 hotplug feature when CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off. Some features depend on CPU0. Known dependencies are: diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ea5cdbd8c2c3..59a1a5fa598d 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -8,6 +8,7 @@ #include <linux/mm.h> #include <linux/uaccess.h> #include <linux/hardirq.h> +#include <linux/xpfo.h> #include <asm/cacheflush.h> @@ -77,36 +78,6 @@ static inline struct page *kmap_to_page(void *addr) static inline unsigned long totalhigh_pages(void) { return 0UL; } -#ifndef ARCH_HAS_KMAP -static inline void *kmap(struct page *page) -{ - might_sleep(); - return page_address(page); -} - -static inline void kunmap(struct page *page) -{ -} - -static inline void *kmap_atomic(struct page *page) -{ - preempt_disable(); - pagefault_disable(); - return page_address(page); -} -#define kmap_atomic_prot(page, prot) kmap_atomic(page) - -static inline void __kunmap_atomic(void *addr) -{ - pagefault_enable(); - preempt_enable(); -} - -#define kmap_atomic_pfn(pfn) kmap_atomic(pfn_to_page(pfn)) - -#define kmap_flush_unused() do {} while(0) -#endif - #endif /* CONFIG_HIGHMEM */ #if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 2c471a2c43fa..d17d33f36a01 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -204,6 +204,14 @@ struct page { #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS int _last_cpupid; #endif + +#ifdef CONFIG_XPFO + /* Counts the number of times this page has been kmapped. */ + atomic_t xpfo_mapcount; + + /* Serialize kmap/kunmap of this page */ + spinlock_t xpfo_lock; +#endif } _struct_page_alignment; /* diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 39b4494e29f1..3622e8c33522 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -101,6 +101,10 @@ enum pageflags { #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) PG_young, PG_idle, +#endif +#ifdef CONFIG_XPFO + PG_xpfo_user, /* Page is allocated to user-space */ + PG_xpfo_unmapped, /* Page is unmapped from the linear map */ #endif __NR_PAGEFLAGS, @@ -398,6 +402,22 @@ TESTCLEARFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif +#ifdef CONFIG_XPFO +PAGEFLAG(XpfoUser, xpfo_user, PF_ANY) +TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY) +TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY) +#define __PG_XPFO_USER (1UL << PG_xpfo_user) +PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY) +TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY) +TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY) +#define __PG_XPFO_UNMAPPED (1UL << PG_xpfo_unmapped) +#else +#define __PG_XPFO_USER 0 +PAGEFLAG_FALSE(XpfoUser) +#define __PG_XPFO_UNMAPPED 0 +PAGEFLAG_FALSE(XpfoUnmapped) +#endif + /* * On an anonymous page mapped into a user virtual memory area, * page->mapping points to its anon_vma, not to a struct address_space; @@ -780,7 +800,8 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) * alloc-free cycle to prevent from reusing the page. */ #define PAGE_FLAGS_CHECK_AT_PREP \ - (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) + (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON & ~__PG_XPFO_USER & \ + ~__PG_XPFO_UNMAPPED) #define PAGE_FLAGS_PRIVATE \ (1UL << PG_private | 1UL << PG_private_2) diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h new file mode 100644 index 000000000000..93a1b5aceca3 --- /dev/null +++ b/include/linux/xpfo.h @@ -0,0 +1,147 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2017 Docker, Inc. + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P. + * Copyright (C) 2016 Brown University. All rights reserved. + * + * Authors: + * Juerg Haefliger <juerg.haefliger@hpe.com> + * Vasileios P. Kemerlis <vpk@cs.brown.edu> + * Tycho Andersen <tycho@docker.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published by + * the Free Software Foundation. + */ + +#ifndef _LINUX_XPFO_H +#define _LINUX_XPFO_H + +#include <linux/types.h> +#include <linux/dma-direction.h> +#include <linux/uaccess.h> + +struct page; + +#ifdef CONFIG_XPFO + +DECLARE_STATIC_KEY_TRUE(xpfo_inited); + +/* Architecture specific implementations */ +void set_kpte(void *kaddr, struct page *page, pgprot_t prot); +void xpfo_flush_kernel_tlb(struct page *page, int order); + +void xpfo_init_single_page(struct page *page); + +static inline void xpfo_kmap(void *kaddr, struct page *page) +{ + unsigned long flags; + + if (!static_branch_unlikely(&xpfo_inited)) + return; + + if (!PageXpfoUser(page)) + return; + + /* + * The page was previously allocated to user space, so + * map it back into the kernel if needed. No TLB flush required. + */ + spin_lock_irqsave(&page->xpfo_lock, flags); + + if ((atomic_inc_return(&page->xpfo_mapcount) == 1) && + TestClearPageXpfoUnmapped(page)) + set_kpte(kaddr, page, PAGE_KERNEL); + + spin_unlock_irqrestore(&page->xpfo_lock, flags); +} + +static inline void xpfo_kunmap(void *kaddr, struct page *page) +{ + unsigned long flags; + + if (!static_branch_unlikely(&xpfo_inited)) + return; + + if (!PageXpfoUser(page)) + return; + + /* + * The page is to be allocated back to user space, so unmap it from + * the kernel, flush the TLB and tag it as a user page. + */ + spin_lock_irqsave(&page->xpfo_lock, flags); + + if (atomic_dec_return(&page->xpfo_mapcount) == 0) { +#ifdef CONFIG_XPFO_DEBUG + WARN_ON(PageXpfoUnmapped(page)); +#endif + SetPageXpfoUnmapped(page); + set_kpte(kaddr, page, __pgprot(0)); + xpfo_flush_kernel_tlb(page, 0); + } + + spin_unlock_irqrestore(&page->xpfo_lock, flags); +} + +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map); +void xpfo_free_pages(struct page *page, int order); + +#else /* !CONFIG_XPFO */ + +static inline void xpfo_init_single_page(struct page *page) { } + +static inline void xpfo_kmap(void *kaddr, struct page *page) { } +static inline void xpfo_kunmap(void *kaddr, struct page *page) { } +static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, + bool will_map) { } +static inline void xpfo_free_pages(struct page *page, int order) { } + +static inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) { } +static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { } + +#endif /* CONFIG_XPFO */ + +#if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP)) +static inline void *kmap(struct page *page) +{ + void *kaddr; + + might_sleep(); + kaddr = page_address(page); + xpfo_kmap(kaddr, page); + return kaddr; +} + +static inline void kunmap(struct page *page) +{ + xpfo_kunmap(page_address(page), page); +} + +static inline void *kmap_atomic(struct page *page) +{ + void *kaddr; + + preempt_disable(); + pagefault_disable(); + kaddr = page_address(page); + xpfo_kmap(kaddr, page); + return kaddr; +} + +#define kmap_atomic_prot(page, prot) kmap_atomic(page) + +static inline void __kunmap_atomic(void *addr) +{ + xpfo_kunmap(addr, virt_to_page(addr)); + pagefault_enable(); + preempt_enable(); +} + +#define kmap_atomic_pfn(pfn) kmap_atomic(pfn_to_page(pfn)) + +#define kmap_flush_unused() do {} while (0) + +#endif + +#endif /* _LINUX_XPFO_H */ diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index a1675d43777e..6bb000bb366f 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -79,6 +79,12 @@ #define IF_HAVE_PG_IDLE(flag,string) #endif +#ifdef CONFIG_XPFO +#define IF_HAVE_PG_XPFO(flag,string) ,{1UL << flag, string} +#else +#define IF_HAVE_PG_XPFO(flag,string) +#endif + #define __def_pageflag_names \ {1UL << PG_locked, "locked" }, \ {1UL << PG_waiters, "waiters" }, \ @@ -105,7 +111,9 @@ IF_HAVE_PG_MLOCK(PG_mlocked, "mlocked" ) \ IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \ IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \ IF_HAVE_PG_IDLE(PG_young, "young" ) \ -IF_HAVE_PG_IDLE(PG_idle, "idle" ) +IF_HAVE_PG_IDLE(PG_idle, "idle" ) \ +IF_HAVE_PG_XPFO(PG_xpfo_user, "xpfo_user" ) \ +IF_HAVE_PG_XPFO(PG_xpfo_unmapped, "xpfo_unmapped" ) \ #define show_page_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/mm/Makefile b/mm/Makefile index d210cc9d6f80..e99e1e6ae5ae 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -99,3 +99,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o obj-$(CONFIG_HMM) += hmm.o obj-$(CONFIG_MEMFD_CREATE) += memfd.o +obj-$(CONFIG_XPFO) += xpfo.o diff --git a/mm/compaction.c b/mm/compaction.c index ef29490b0f46..fdd5d9783adb 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -78,7 +78,7 @@ static void map_pages(struct list_head *list) order = page_private(page); nr_pages = 1 << order; - post_alloc_hook(page, order, __GFP_MOVABLE); + post_alloc_hook(page, order, __GFP_MOVABLE, false); if (order) split_page(page, order); diff --git a/mm/internal.h b/mm/internal.h index f4a7bb02decf..e076e51376df 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -165,7 +165,7 @@ extern void memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order); extern void prep_compound_page(struct page *page, unsigned int order); extern void post_alloc_hook(struct page *page, unsigned int order, - gfp_t gfp_flags); + gfp_t gfp_flags, bool will_map); extern int user_min_free_kbytes; #if defined CONFIG_COMPACTION || defined CONFIG_CMA diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0b9f577b1a2a..2e0dda1322a2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1062,6 +1062,7 @@ static __always_inline bool free_pages_prepare(struct page *page, if (bad) return false; + xpfo_free_pages(page, order); page_cpupid_reset_last(page); page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; reset_page_owner(page, order); @@ -1229,6 +1230,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn, if (!is_highmem_idx(zone)) set_page_address(page, __va(pfn << PAGE_SHIFT)); #endif + xpfo_init_single_page(page); } #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT @@ -1938,7 +1940,7 @@ static bool check_new_pages(struct page *page, unsigned int order) } inline void post_alloc_hook(struct page *page, unsigned int order, - gfp_t gfp_flags) + gfp_t gfp_flags, bool will_map) { set_page_private(page, 0); set_page_refcounted(page); @@ -1947,6 +1949,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, kernel_map_pages(page, 1 << order, 1); kernel_poison_pages(page, 1 << order, 1); kasan_alloc_pages(page, order); + xpfo_alloc_pages(page, order, gfp_flags, will_map); set_page_owner(page, order, gfp_flags); } @@ -1954,10 +1957,11 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags unsigned int alloc_flags) { int i; + bool needs_zero = !free_pages_prezeroed() && (gfp_flags & __GFP_ZERO); - post_alloc_hook(page, order, gfp_flags); + post_alloc_hook(page, order, gfp_flags, needs_zero); - if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO)) + if (needs_zero) for (i = 0; i < (1 << order); i++) clear_highpage(page + i); diff --git a/mm/page_isolation.c b/mm/page_isolation.c index ce323e56b34d..d8730dd134a9 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -137,7 +137,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype) out: spin_unlock_irqrestore(&zone->lock, flags); if (isolated_page) { - post_alloc_hook(page, order, __GFP_MOVABLE); + post_alloc_hook(page, order, __GFP_MOVABLE, false); __free_pages(page, order); } } diff --git a/mm/xpfo.c b/mm/xpfo.c new file mode 100644 index 000000000000..b74fee0479e7 --- /dev/null +++ b/mm/xpfo.c @@ -0,0 +1,106 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2017 Docker, Inc. + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P. + * Copyright (C) 2016 Brown University. All rights reserved. + * + * Authors: + * Juerg Haefliger <juerg.haefliger@hpe.com> + * Vasileios P. Kemerlis <vpk@cs.brown.edu> + * Tycho Andersen <tycho@docker.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published by + * the Free Software Foundation. + */ + +#include <linux/mm.h> +#include <linux/module.h> +#include <linux/xpfo.h> + +#include <asm/tlbflush.h> + +DEFINE_STATIC_KEY_TRUE(xpfo_inited); +EXPORT_SYMBOL_GPL(xpfo_inited); + +static int __init noxpfo_param(char *str) +{ + static_branch_disable(&xpfo_inited); + + return 0; +} + +early_param("noxpfo", noxpfo_param); + +bool __init xpfo_enabled(void) +{ + if (!static_branch_unlikely(&xpfo_inited)) + return false; + else + return true; +} + +void __meminit xpfo_init_single_page(struct page *page) +{ + spin_lock_init(&page->xpfo_lock); +} + +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map) +{ + int i; + bool flush_tlb = false; + + if (!static_branch_unlikely(&xpfo_inited)) + return; + + for (i = 0; i < (1 << order); i++) { +#ifdef CONFIG_XPFO_DEBUG + WARN_ON(PageXpfoUser(page + i)); + WARN_ON(PageXpfoUnmapped(page + i)); + lockdep_assert_held(&(page + i)->xpfo_lock); + WARN_ON(atomic_read(&(page + i)->xpfo_mapcount)); +#endif + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) { + /* + * Tag the page as a user page and flush the TLB if it + * was previously allocated to the kernel. + */ + if ((!TestSetPageXpfoUser(page + i)) || !will_map) { + SetPageXpfoUnmapped(page + i); + flush_tlb = true; + } + } else { + /* Tag the page as a non-user (kernel) page */ + ClearPageXpfoUser(page + i); + } + } + + if (flush_tlb) { + set_kpte(page_address(page), page, __pgprot(0)); + xpfo_flush_kernel_tlb(page, order); + } +} + +void xpfo_free_pages(struct page *page, int order) +{ + int i; + + if (!static_branch_unlikely(&xpfo_inited)) + return; + + for (i = 0; i < (1 << order); i++) { +#ifdef CONFIG_XPFO_DEBUG + WARN_ON(atomic_read(&(page + i)->xpfo_mapcount)); +#endif + + /* + * Map the page back into the kernel if it was previously + * allocated to user space. + */ + if (TestClearPageXpfoUser(page + i)) { + ClearPageXpfoUnmapped(page + i); + set_kpte(page_address(page + i), page + i, + PAGE_KERNEL); + } + } +} diff --git a/security/Kconfig b/security/Kconfig index e4fe2f3c2c65..3636ba7e2615 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -6,6 +6,33 @@ menu "Security options" source "security/keys/Kconfig" +config ARCH_SUPPORTS_XPFO + bool + +config XPFO + bool "Enable eXclusive Page Frame Ownership (XPFO)" + depends on ARCH_SUPPORTS_XPFO + help + This option offers protection against 'ret2dir' kernel attacks. + When enabled, every time a page frame is allocated to user space, it + is unmapped from the direct mapped RAM region in kernel space + (physmap). Similarly, when a page frame is freed/reclaimed, it is + mapped back to physmap. + + There is a slight performance impact when this option is enabled. + + If in doubt, say "N". + +config XPFO_DEBUG + bool "Enable debugging of XPFO" + depends on XPFO + help + Enables additional checking of XPFO data structures that help find + bugs in the XPFO implementation. This option comes with a slight + performance cost. + + If in doubt, say "N". + config SECURITY_DMESG_RESTRICT bool "Restrict unprivileged access to the kernel syslog" default n