Message ID | 20200228154322.329228-4-imbrenda@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | add callbacks for inaccessible pages | expand |
Andrew, while patch 1 is a fixup for the FOLL_PIN work in your patch queue, I would really love to see this patch in 5.7. The exploitation code of kvm/s390 is in Linux next also scheduled for 5.7. Christian On 28.02.20 16:43, Claudio Imbrenda wrote: > With the introduction of protected KVM guests on s390 there is now a > concept of inaccessible pages. These pages need to be made accessible > before the host can access them. > > While cpu accesses will trigger a fault that can be resolved, I/O > accesses will just fail. We need to add a callback into architecture > code for places that will do I/O, namely when writeback is started or > when a page reference is taken. > > This is not only to enable paging, file backing etc, it is also > necessary to protect the host against a malicious user space. For > example a bad QEMU could simply start direct I/O on such protected > memory. We do not want userspace to be able to trigger I/O errors and > thus we the logic is "whenever somebody accesses that page (gup) or > does I/O, make sure that this page can be accessed". When the guest > tries to access that page we will wait in the page fault handler for > writeback to have finished and for the page_ref to be the expected > value. > > On s390x the function is not supposed to fail, so it is ok to use a > WARN_ON on failure. If we ever need some more finegrained handling > we can tackle this when we know the details. > > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> > Acked-by: Will Deacon <will@kernel.org> > Reviewed-by: David Hildenbrand <david@redhat.com> > Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> > Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> > --- > include/linux/gfp.h | 6 ++++++ > mm/gup.c | 19 ++++++++++++++++--- > mm/page-writeback.c | 5 +++++ > 3 files changed, 27 insertions(+), 3 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index e5b817cb86e7..be2754841369 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { } > #ifndef HAVE_ARCH_ALLOC_PAGE > static inline void arch_alloc_page(struct page *page, int order) { } > #endif > +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE > +static inline int arch_make_page_accessible(struct page *page) > +{ > + return 0; > +} > +#endif > > struct page * > __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > diff --git a/mm/gup.c b/mm/gup.c > index 0b9a806898f3..86fff6e4e4f3 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -391,6 +391,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, > struct page *page; > spinlock_t *ptl; > pte_t *ptep, pte; > + int ret; > > /* FOLL_GET and FOLL_PIN are mutually exclusive. */ > if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == > @@ -449,8 +450,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, > if (is_zero_pfn(pte_pfn(pte))) { > page = pte_page(pte); > } else { > - int ret; > - > ret = follow_pfn_pte(vma, address, ptep, flags); > page = ERR_PTR(ret); > goto out; > @@ -458,7 +457,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, > } > > if (flags & FOLL_SPLIT && PageTransCompound(page)) { > - int ret; > get_page(page); > pte_unmap_unlock(ptep, ptl); > lock_page(page); > @@ -475,6 +473,14 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, > page = ERR_PTR(-ENOMEM); > goto out; > } > + if (flags & FOLL_PIN) { > + ret = arch_make_page_accessible(page); > + if (ret) { > + unpin_user_page(page); > + page = ERR_PTR(ret); > + goto out; > + } > + } > if (flags & FOLL_TOUCH) { > if ((flags & FOLL_WRITE) && > !pte_dirty(pte) && !PageDirty(page)) > @@ -2143,6 +2149,13 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > > VM_BUG_ON_PAGE(compound_head(page) != head, page); > > + if (flags & FOLL_PIN) { > + ret = arch_make_page_accessible(page); > + if (ret) { > + unpin_user_page(page); > + goto pte_unmap; > + } > + } > SetPageReferenced(page); > pages[*nr] = page; > (*nr)++; > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index ab5a3cee8ad3..8384be5a2758 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2807,6 +2807,11 @@ int __test_set_page_writeback(struct page *page, bool keep_write) > inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); > } > unlock_page_memcg(page); > + /* > + * If writeback has been triggered on a page that cannot be made > + * accessible, it is too late. > + */ > + WARN_ON(arch_make_page_accessible(page)); > return ret; > > } >
On 2/28/20 8:08 AM, Christian Borntraeger wrote: > Andrew, > > while patch 1 is a fixup for the FOLL_PIN work in your patch queue, > I would really love to see this patch in 5.7. The exploitation code > of kvm/s390 is in Linux next also scheduled for 5.7. > > Christian > > On 28.02.20 16:43, Claudio Imbrenda wrote: >> With the introduction of protected KVM guests on s390 there is now a >> concept of inaccessible pages. These pages need to be made accessible >> before the host can access them. >> >> While cpu accesses will trigger a fault that can be resolved, I/O >> accesses will just fail. We need to add a callback into architecture >> code for places that will do I/O, namely when writeback is started or >> when a page reference is taken. >> >> This is not only to enable paging, file backing etc, it is also >> necessary to protect the host against a malicious user space. For >> example a bad QEMU could simply start direct I/O on such protected >> memory. We do not want userspace to be able to trigger I/O errors and >> thus we the logic is "whenever somebody accesses that page (gup) or I actually kind of like the sound of that: "We the logic of the kernel, in order to form a more perfect computer..." :) Probably this wording is what you want, though: "thus the logic is "whenever somebody (gup) accesses that page or" ... >> @@ -458,7 +457,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, >> } >> >> if (flags & FOLL_SPLIT && PageTransCompound(page)) { >> - int ret; >> get_page(page); >> pte_unmap_unlock(ptep, ptl); >> lock_page(page); >> @@ -475,6 +473,14 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, >> page = ERR_PTR(-ENOMEM); >> goto out; >> } >> + if (flags & FOLL_PIN) { What about FOLL_GET? Unless your calling code has some sort of BUG_ON(flags & FOLL_GET), I'm not sure it's a good idea to leave that case unhandled. >> + ret = arch_make_page_accessible(page); >> + if (ret) { >> + unpin_user_page(page); >> + page = ERR_PTR(ret); >> + goto out; >> + } >> + } >> if (flags & FOLL_TOUCH) { >> if ((flags & FOLL_WRITE) && >> !pte_dirty(pte) && !PageDirty(page)) >> @@ -2143,6 +2149,13 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> >> VM_BUG_ON_PAGE(compound_head(page) != head, page); >> >> + if (flags & FOLL_PIN) { >> + ret = arch_make_page_accessible(page); >> + if (ret) { >> + unpin_user_page(page); Same concern as above, about leaving FOLL_GET unhandled. thanks,
On Fri, 28 Feb 2020 16:08:23 -0800 John Hubbard <jhubbard@nvidia.com> wrote: > On 2/28/20 8:08 AM, Christian Borntraeger wrote: > > Andrew, > > > > while patch 1 is a fixup for the FOLL_PIN work in your patch queue, > > I would really love to see this patch in 5.7. The exploitation code > > of kvm/s390 is in Linux next also scheduled for 5.7. > > > > Christian > > > > On 28.02.20 16:43, Claudio Imbrenda wrote: > >> With the introduction of protected KVM guests on s390 there is now > >> a concept of inaccessible pages. These pages need to be made > >> accessible before the host can access them. > >> > >> While cpu accesses will trigger a fault that can be resolved, I/O > >> accesses will just fail. We need to add a callback into > >> architecture code for places that will do I/O, namely when > >> writeback is started or when a page reference is taken. > >> > >> This is not only to enable paging, file backing etc, it is also > >> necessary to protect the host against a malicious user space. For > >> example a bad QEMU could simply start direct I/O on such protected > >> memory. We do not want userspace to be able to trigger I/O errors > >> and thus we the logic is "whenever somebody accesses that page > >> (gup) or > > > I actually kind of like the sound of that: "We the logic of the > kernel, in order to form a more perfect computer..." :) > > Probably this wording is what you want, though: > > "thus the logic is "whenever somebody (gup) accesses that page or" > > > ... > >> @@ -458,7 +457,6 @@ static struct page *follow_page_pte(struct > >> vm_area_struct *vma, } > >> > >> if (flags & FOLL_SPLIT && PageTransCompound(page)) { > >> - int ret; > >> get_page(page); > >> pte_unmap_unlock(ptep, ptl); > >> lock_page(page); > >> @@ -475,6 +473,14 @@ static struct page *follow_page_pte(struct > >> vm_area_struct *vma, page = ERR_PTR(-ENOMEM); > >> goto out; > >> } > >> + if (flags & FOLL_PIN) { > > > What about FOLL_GET? Unless your calling code has some sort of > BUG_ON(flags & FOLL_GET), I'm not sure it's a good idea to leave that > case unhandled. if I understood the semantics of FOLL_PIN correctly, then we don't need to make the page accessible for FOLL_GET. FOLL_PIN indicates intent to access the content of the page, whereas FOLL_GET is only for the struct page. if we are not touching the content of the page, there is no need to make it accessible > >> + ret = arch_make_page_accessible(page); > >> + if (ret) { > >> + unpin_user_page(page); > >> + page = ERR_PTR(ret); > >> + goto out; > >> + } > >> + } > >> if (flags & FOLL_TOUCH) { > >> if ((flags & FOLL_WRITE) && > >> !pte_dirty(pte) && !PageDirty(page)) > >> @@ -2143,6 +2149,13 @@ static int gup_pte_range(pmd_t pmd, > >> unsigned long addr, unsigned long end, > >> VM_BUG_ON_PAGE(compound_head(page) != head, page); > >> > >> + if (flags & FOLL_PIN) { > >> + ret = arch_make_page_accessible(page); > >> + if (ret) { > >> + unpin_user_page(page); > > > Same concern as above, about leaving FOLL_GET unhandled. and same answer as above :)
On 2/29/20 2:49 AM, Claudio Imbrenda wrote: >> ... >>>> @@ -458,7 +457,6 @@ static struct page *follow_page_pte(struct >>>> vm_area_struct *vma, } >>>> >>>> if (flags & FOLL_SPLIT && PageTransCompound(page)) { >>>> - int ret; >>>> get_page(page); >>>> pte_unmap_unlock(ptep, ptl); >>>> lock_page(page); >>>> @@ -475,6 +473,14 @@ static struct page *follow_page_pte(struct >>>> vm_area_struct *vma, page = ERR_PTR(-ENOMEM); >>>> goto out; >>>> } >>>> + if (flags & FOLL_PIN) { >> >> >> What about FOLL_GET? Unless your calling code has some sort of >> BUG_ON(flags & FOLL_GET), I'm not sure it's a good idea to leave that >> case unhandled. > > if I understood the semantics of FOLL_PIN correctly, then we don't need > to make the page accessible for FOLL_GET. FOLL_PIN indicates intent to > access the content of the page, whereas FOLL_GET is only for the struct > page. > > if we are not touching the content of the page, there is no need to > make it accessible > OK, I hope I'm not overlooking anything, but that sounds correct to me. thanks,
On Fri, 28 Feb 2020 17:08:19 +0100 Christian Borntraeger <borntraeger@de.ibm.com> wrote: > while patch 1 is a fixup for the FOLL_PIN work in your patch queue, > I would really love to see this patch in 5.7. The exploitation code > of kvm/s390 is in Linux next also scheduled for 5.7. Sounds good. My inbox eagerly awaits v2 ;)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index e5b817cb86e7..be2754841369 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { } #ifndef HAVE_ARCH_ALLOC_PAGE static inline void arch_alloc_page(struct page *page, int order) { } #endif +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE +static inline int arch_make_page_accessible(struct page *page) +{ + return 0; +} +#endif struct page * __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, diff --git a/mm/gup.c b/mm/gup.c index 0b9a806898f3..86fff6e4e4f3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -391,6 +391,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, struct page *page; spinlock_t *ptl; pte_t *ptep, pte; + int ret; /* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == @@ -449,8 +450,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (is_zero_pfn(pte_pfn(pte))) { page = pte_page(pte); } else { - int ret; - ret = follow_pfn_pte(vma, address, ptep, flags); page = ERR_PTR(ret); goto out; @@ -458,7 +457,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } if (flags & FOLL_SPLIT && PageTransCompound(page)) { - int ret; get_page(page); pte_unmap_unlock(ptep, ptl); lock_page(page); @@ -475,6 +473,14 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, page = ERR_PTR(-ENOMEM); goto out; } + if (flags & FOLL_PIN) { + ret = arch_make_page_accessible(page); + if (ret) { + unpin_user_page(page); + page = ERR_PTR(ret); + goto out; + } + } if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -2143,6 +2149,13 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, VM_BUG_ON_PAGE(compound_head(page) != head, page); + if (flags & FOLL_PIN) { + ret = arch_make_page_accessible(page); + if (ret) { + unpin_user_page(page); + goto pte_unmap; + } + } SetPageReferenced(page); pages[*nr] = page; (*nr)++; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ab5a3cee8ad3..8384be5a2758 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2807,6 +2807,11 @@ int __test_set_page_writeback(struct page *page, bool keep_write) inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); } unlock_page_memcg(page); + /* + * If writeback has been triggered on a page that cannot be made + * accessible, it is too late. + */ + WARN_ON(arch_make_page_accessible(page)); return ret; }