Message ID | 13-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Simplify the external interface for GUP | expand |
On 1/24/23 12:34, Jason Gunthorpe wrote: > Move the flags that should not/are not used outside gup.c and related into > mm/internal.h to discourage driver abuse. > > To make this more maintainable going forward compact the two FOLL ranges > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is > explict. > > Switch to an enum so the whole thing is easier to read. > > Cc: David Howells <dhowells@redhat.com> > Cc: Christoph Hellwig <hch@infradead.org> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > include/linux/mm_types.h | 57 ++++++++++++++++++++++++---------------- > mm/internal.h | 15 +++++++++++ > 2 files changed, 50 insertions(+), 22 deletions(-) Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks,
On 24.01.23 21:34, Jason Gunthorpe wrote: > Move the flags that should not/are not used outside gup.c and related into > mm/internal.h to discourage driver abuse. > > To make this more maintainable going forward compact the two FOLL ranges > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is > explict. > > Switch to an enum so the whole thing is easier to read. Using a __bitwise type would be even better, but that requires quite some adjustments ... The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is only one caller that doesn't pass FOLL_GET (s390). We could either add a new function to "probe" that anything is mapped (IIRC that's the use case), or simply ref+unref. So we might just be able to make FOLL_GET internal fairly easily, too ... Anyhow Acked-by: David Hildenbrand <david@redhat.com>
On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote: > On 24.01.23 21:34, Jason Gunthorpe wrote: > > Move the flags that should not/are not used outside gup.c and related into > > mm/internal.h to discourage driver abuse. > > > > To make this more maintainable going forward compact the two FOLL ranges > > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is > > explict. > > > > Switch to an enum so the whole thing is easier to read. > > Using a __bitwise type would be even better, but that requires quite some > adjustments ... > > The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is > only one caller that doesn't pass FOLL_GET (s390). We could either add a new > function to "probe" that anything is mapped (IIRC that's the use case), or > simply ref+unref. Is that code even safe as written? I don't really understand how it can safely call lock_page() on something it doesn't have a reference too ? So adding the FOLL_GET and put_page seems like a good idea to me? At a minimum this should get a comment to explain why it is OK. S390 people? int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb) { [..] rc = -ENXIO; page = follow_page(vma, uaddr, FOLL_WRITE); if (IS_ERR_OR_NULL(page)) goto out; lock_page(page); ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); Jason
On 26.01.23 13:55, Jason Gunthorpe wrote: > On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote: >> On 24.01.23 21:34, Jason Gunthorpe wrote: >>> Move the flags that should not/are not used outside gup.c and related into >>> mm/internal.h to discourage driver abuse. >>> >>> To make this more maintainable going forward compact the two FOLL ranges >>> with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is >>> explict. >>> >>> Switch to an enum so the whole thing is easier to read. >> >> Using a __bitwise type would be even better, but that requires quite some >> adjustments ... >> >> The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is >> only one caller that doesn't pass FOLL_GET (s390). We could either add a new >> function to "probe" that anything is mapped (IIRC that's the use case), or >> simply ref+unref. > > Is that code even safe as written? I don't really understand how it > can safely call lock_page() on something it doesn't have a reference > too ? Let me look into the details ... I remember reviewing that before I got to study the beauty of GUP in more detail. CCin Claudio > > So adding the FOLL_GET and put_page seems like a good idea to me? At a > minimum this should get a comment to explain why it is OK. > At first sight, it really feels like the right thing to do. Maybe another good reason to handle FOLL_GET vs. FOLL_PIN completely internal. Harder to abuse. > S390 people? > > int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb) > { > [..] > rc = -ENXIO; > page = follow_page(vma, uaddr, FOLL_WRITE); > if (IS_ERR_OR_NULL(page)) > goto out; > > lock_page(page); > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); > > Jason >
On Thu, 26 Jan 2023 08:55:27 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote: > > On 24.01.23 21:34, Jason Gunthorpe wrote: > > > Move the flags that should not/are not used outside gup.c and related into > > > mm/internal.h to discourage driver abuse. > > > > > > To make this more maintainable going forward compact the two FOLL ranges > > > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is > > > explict. > > > > > > Switch to an enum so the whole thing is easier to read. > > > > Using a __bitwise type would be even better, but that requires quite some > > adjustments ... > > > > The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is > > only one caller that doesn't pass FOLL_GET (s390). We could either add a new > > function to "probe" that anything is mapped (IIRC that's the use case), or > > simply ref+unref. > > Is that code even safe as written? I don't really understand how it yes (surprisingly) it is > can safely call lock_page() on something it doesn't have a reference > too ? the code between lock_page and unlock_page will behave "properly" and do nothing or at worst cause a tiny performance issue in the rare case something changes between the follow_page and the page_lock, i.e. if things are done on the wrong page. make_secure_pte does some very hacky stuff involving page_ref_freeze, with expected_page_refs counting how many references are expected. the code is how it is because, when it was written, FOLL_PIN did not exist, and FOLL_GET caused the page to be converted to unsecure. I guess maybe we can make it work with FOLL_GET if expected_page_refs takes into account the extra reference? (and then maybe we could a put_page after unlock_page) > > So adding the FOLL_GET and put_page seems like a good idea to me? At a > minimum this should get a comment to explain why it is OK. > > S390 people? > > int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb) > { > [..] > rc = -ENXIO; > page = follow_page(vma, uaddr, FOLL_WRITE); > if (IS_ERR_OR_NULL(page)) > goto out; > > lock_page(page); > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); > > Jason
On 26.01.23 15:41, Claudio Imbrenda wrote: > On Thu, 26 Jan 2023 08:55:27 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > >> On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote: >>> On 24.01.23 21:34, Jason Gunthorpe wrote: >>>> Move the flags that should not/are not used outside gup.c and related into >>>> mm/internal.h to discourage driver abuse. >>>> >>>> To make this more maintainable going forward compact the two FOLL ranges >>>> with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is >>>> explict. >>>> >>>> Switch to an enum so the whole thing is easier to read. >>> >>> Using a __bitwise type would be even better, but that requires quite some >>> adjustments ... >>> >>> The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is >>> only one caller that doesn't pass FOLL_GET (s390). We could either add a new >>> function to "probe" that anything is mapped (IIRC that's the use case), or >>> simply ref+unref. >> >> Is that code even safe as written? I don't really understand how it > > yes (surprisingly) it is > >> can safely call lock_page() on something it doesn't have a reference >> too ? > > the code between lock_page and unlock_page will behave "properly" and > do nothing or at worst cause a tiny performance issue in the rare case > something changes between the follow_page and the page_lock, i.e. if > things are done on the wrong page. What prevents the page from getting unmapped (MADV_DONTNEED), freed, reallocated as a larger folio and the unlock_page() would target the wrong bit? I think even while freeing a locked page we might run into trouble ...
On Thu, Jan 26, 2023 at 03:46:09PM +0100, David Hildenbrand wrote: > On 26.01.23 15:41, Claudio Imbrenda wrote: > > On Thu, 26 Jan 2023 08:55:27 -0400 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote: > > > > On 24.01.23 21:34, Jason Gunthorpe wrote: > > > > > Move the flags that should not/are not used outside gup.c and related into > > > > > mm/internal.h to discourage driver abuse. > > > > > > > > > > To make this more maintainable going forward compact the two FOLL ranges > > > > > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is > > > > > explict. > > > > > > > > > > Switch to an enum so the whole thing is easier to read. > > > > > > > > Using a __bitwise type would be even better, but that requires quite some > > > > adjustments ... > > > > > > > > The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is > > > > only one caller that doesn't pass FOLL_GET (s390). We could either add a new > > > > function to "probe" that anything is mapped (IIRC that's the use case), or > > > > simply ref+unref. > > > > > > Is that code even safe as written? I don't really understand how it > > > > yes (surprisingly) it is > > > > > can safely call lock_page() on something it doesn't have a reference > > > too ? > > > > the code between lock_page and unlock_page will behave "properly" and > > do nothing or at worst cause a tiny performance issue in the rare case > > something changes between the follow_page and the page_lock, i.e. if > > things are done on the wrong page. > > What prevents the page from getting unmapped (MADV_DONTNEED), freed, > reallocated as a larger folio and the unlock_page() would target the wrong > bit? I think even while freeing a locked page we might run into trouble ... Yep. The issue is you can't call lock_page() on something you don't have a ref to. The worst case would be the memory got unmapped from the VMA and the entire memory space was hot-unpluged eg it was DAX or something. Now the page pointer will oops if you call lock_page. Why not just use the get_locked_pte() exclusively and do -EAGAIN or -EBUSY if folio_try_lock fails, under the PTL? This already happens for PageWriteback caes. Jason
On Thu, 26 Jan 2023 11:05:27 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Thu, Jan 26, 2023 at 03:46:09PM +0100, David Hildenbrand wrote: > > On 26.01.23 15:41, Claudio Imbrenda wrote: > > > On Thu, 26 Jan 2023 08:55:27 -0400 > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote: > > > > > On 24.01.23 21:34, Jason Gunthorpe wrote: > > > > > > Move the flags that should not/are not used outside gup.c and related into > > > > > > mm/internal.h to discourage driver abuse. > > > > > > > > > > > > To make this more maintainable going forward compact the two FOLL ranges > > > > > > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is > > > > > > explict. > > > > > > > > > > > > Switch to an enum so the whole thing is easier to read. > > > > > > > > > > Using a __bitwise type would be even better, but that requires quite some > > > > > adjustments ... > > > > > > > > > > The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is > > > > > only one caller that doesn't pass FOLL_GET (s390). We could either add a new > > > > > function to "probe" that anything is mapped (IIRC that's the use case), or > > > > > simply ref+unref. > > > > > > > > Is that code even safe as written? I don't really understand how it > > > > > > yes (surprisingly) it is > > > > > > > can safely call lock_page() on something it doesn't have a reference > > > > too ? > > > > > > the code between lock_page and unlock_page will behave "properly" and > > > do nothing or at worst cause a tiny performance issue in the rare case > > > something changes between the follow_page and the page_lock, i.e. if > > > things are done on the wrong page. > > > > What prevents the page from getting unmapped (MADV_DONTNEED), freed, > > reallocated as a larger folio and the unlock_page() would target the wrong > > bit? I think even while freeing a locked page we might run into trouble ... > > Yep. > > The issue is you can't call lock_page() on something you don't have a > ref to. so we have been doing this wrong the whole time? oops > > The worst case would be the memory got unmapped from the VMA and the > entire memory space was hot-unpluged eg it was DAX or something. Now > the page pointer will oops if you call lock_page. we do not have memory mapped devices or anything, so this scenario is highly unlikely (at last this) > > Why not just use the get_locked_pte() exclusively and do -EAGAIN or > -EBUSY if folio_try_lock fails, under the PTL? This already happens > for PageWriteback caes. I think I will need some time to process this sentence I can tell you that the original goal of that function is to make sure that there are no extra references. in particular, we want to prevent I/O of any kind to be ongoing while the page becomes secure. (the I/O will fail and, depending on which device it was, the whole system might end up in a rather unhappy state) transitioning from secure to non-secure instead is much easier > > Jason
On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote: > I can tell you that the original goal of that function is to make sure > that there are no extra references. in particular, we want to prevent > I/O of any kind to be ongoing while the page becomes secure. (the I/O > will fail and, depending on which device it was, the whole system might > end up in a rather unhappy state) Sure, but if there is concurrent IO you just try again right? It doesn't wait for refs to drop for instance. So make the lock_page work the same way: diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c index 9f18a4af9c1319..847ee50b8672c6 100644 --- a/arch/s390/kernel/uv.c +++ b/arch/s390/kernel/uv.c @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page) } static int make_secure_pte(pte_t *ptep, unsigned long addr, - struct page *exp_page, struct uv_cb_header *uvcb) + struct page *page, struct uv_cb_header *uvcb) { pte_t entry = READ_ONCE(*ptep); - struct page *page; int expected, cc = 0; - if (!pte_present(entry)) - return -ENXIO; - if (pte_val(entry) & _PAGE_INVALID) - return -ENXIO; - - page = pte_page(entry); - if (page != exp_page) - return -ENXIO; if (PageWriteback(page)) return -EAGAIN; expected = expected_page_refs(page); @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb) goto out; rc = -ENXIO; - page = follow_page(vma, uaddr, FOLL_WRITE); - if (IS_ERR_OR_NULL(page)) - goto out; - - lock_page(page); ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); + + if (!pte_present(entry)) + goto out_unlock_pte; + if (pte_val(entry) & _PAGE_INVALID) + goto out_unlock_pte; + page = pte_page(entry); + + if (!trylock_page(page)) { + rc = -EAGAIN; + goto out_unlock_pte; + } + if (should_export_before_import(uvcb, gmap->mm)) uv_convert_from_secure(page_to_phys(page)); rc = make_secure_pte(ptep, uaddr, page, uvcb); - pte_unmap_unlock(ptep, ptelock); unlock_page(page); +out_unlock_pte: + pte_unmap_unlock(ptep, ptelock); out: mmap_read_unlock(gmap->mm);
On Thu, 26 Jan 2023 12:35:37 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote: > > > I can tell you that the original goal of that function is to make sure > > that there are no extra references. in particular, we want to prevent > > I/O of any kind to be ongoing while the page becomes secure. (the I/O > > will fail and, depending on which device it was, the whole system might > > end up in a rather unhappy state) > > Sure, but if there is concurrent IO you just try again right? It > doesn't wait for refs to drop for instance. > > So make the lock_page work the same way: I'll need to think carefully about this. I remember that there were good reasons to do things the way we did, so I want to make sure we don't break something by changing things around. > > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c > index 9f18a4af9c1319..847ee50b8672c6 100644 > --- a/arch/s390/kernel/uv.c > +++ b/arch/s390/kernel/uv.c > @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page) > } > > static int make_secure_pte(pte_t *ptep, unsigned long addr, > - struct page *exp_page, struct uv_cb_header *uvcb) > + struct page *page, struct uv_cb_header *uvcb) > { > pte_t entry = READ_ONCE(*ptep); > - struct page *page; > int expected, cc = 0; > > - if (!pte_present(entry)) > - return -ENXIO; > - if (pte_val(entry) & _PAGE_INVALID) > - return -ENXIO; > - > - page = pte_page(entry); > - if (page != exp_page) > - return -ENXIO; > if (PageWriteback(page)) > return -EAGAIN; > expected = expected_page_refs(page); > @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb) > goto out; > > rc = -ENXIO; > - page = follow_page(vma, uaddr, FOLL_WRITE); > - if (IS_ERR_OR_NULL(page)) > - goto out; > - > - lock_page(page); > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); > + > + if (!pte_present(entry)) > + goto out_unlock_pte; > + if (pte_val(entry) & _PAGE_INVALID) > + goto out_unlock_pte; > + page = pte_page(entry); > + > + if (!trylock_page(page)) { > + rc = -EAGAIN; > + goto out_unlock_pte; > + } > + > if (should_export_before_import(uvcb, gmap->mm)) > uv_convert_from_secure(page_to_phys(page)); > rc = make_secure_pte(ptep, uaddr, page, uvcb); > - pte_unmap_unlock(ptep, ptelock); > unlock_page(page); > +out_unlock_pte: > + pte_unmap_unlock(ptep, ptelock); > out: > mmap_read_unlock(gmap->mm); >
On Thu, 26 Jan 2023 12:35:37 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote: > > > I can tell you that the original goal of that function is to make sure > > that there are no extra references. in particular, we want to prevent > > I/O of any kind to be ongoing while the page becomes secure. (the I/O > > will fail and, depending on which device it was, the whole system might > > end up in a rather unhappy state) > > Sure, but if there is concurrent IO you just try again right? It > doesn't wait for refs to drop for instance. > > So make the lock_page work the same way: the more I look at this, the less I understand why I wrote the code like that. I do have a comment, though > > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c > index 9f18a4af9c1319..847ee50b8672c6 100644 > --- a/arch/s390/kernel/uv.c > +++ b/arch/s390/kernel/uv.c > @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page) > } > > static int make_secure_pte(pte_t *ptep, unsigned long addr, > - struct page *exp_page, struct uv_cb_header *uvcb) > + struct page *page, struct uv_cb_header *uvcb) > { > pte_t entry = READ_ONCE(*ptep); > - struct page *page; > int expected, cc = 0; > > - if (!pte_present(entry)) > - return -ENXIO; > - if (pte_val(entry) & _PAGE_INVALID) > - return -ENXIO; > - > - page = pte_page(entry); > - if (page != exp_page) > - return -ENXIO; > if (PageWriteback(page)) > return -EAGAIN; > expected = expected_page_refs(page); > @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb) > goto out; > > rc = -ENXIO; > - page = follow_page(vma, uaddr, FOLL_WRITE); > - if (IS_ERR_OR_NULL(page)) > - goto out; > - > - lock_page(page); > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); > + > + if (!pte_present(entry)) > + goto out_unlock_pte; > + if (pte_val(entry) & _PAGE_INVALID) > + goto out_unlock_pte; I guess we also need to make sure the page was writable? FOLL_WRITE made sure of that so I guess something like: if (!pte_write(entry)) goto out_unlock_pte; > + page = pte_page(entry); > + > + if (!trylock_page(page)) { > + rc = -EAGAIN; > + goto out_unlock_pte; > + } > + > if (should_export_before_import(uvcb, gmap->mm)) > uv_convert_from_secure(page_to_phys(page)); > rc = make_secure_pte(ptep, uaddr, page, uvcb); > - pte_unmap_unlock(ptep, ptelock); > unlock_page(page); > +out_unlock_pte: > + pte_unmap_unlock(ptep, ptelock); > out: > mmap_read_unlock(gmap->mm); >
On Mon, Jan 30, 2023 at 07:21:04PM +0100, Claudio Imbrenda wrote: > On Thu, 26 Jan 2023 12:35:37 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote: > > > > > I can tell you that the original goal of that function is to make sure > > > that there are no extra references. in particular, we want to prevent > > > I/O of any kind to be ongoing while the page becomes secure. (the I/O > > > will fail and, depending on which device it was, the whole system might > > > end up in a rather unhappy state) > > > > Sure, but if there is concurrent IO you just try again right? It > > doesn't wait for refs to drop for instance. > > > > So make the lock_page work the same way: > > the more I look at this, the less I understand why I wrote the code > like that. I do have a comment, though > > > > > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c > > index 9f18a4af9c1319..847ee50b8672c6 100644 > > --- a/arch/s390/kernel/uv.c > > +++ b/arch/s390/kernel/uv.c > > @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page) > > } > > > > static int make_secure_pte(pte_t *ptep, unsigned long addr, > > - struct page *exp_page, struct uv_cb_header *uvcb) > > + struct page *page, struct uv_cb_header *uvcb) > > { > > pte_t entry = READ_ONCE(*ptep); > > - struct page *page; > > int expected, cc = 0; > > > > - if (!pte_present(entry)) > > - return -ENXIO; > > - if (pte_val(entry) & _PAGE_INVALID) > > - return -ENXIO; > > - > > - page = pte_page(entry); > > - if (page != exp_page) > > - return -ENXIO; > > if (PageWriteback(page)) > > return -EAGAIN; > > expected = expected_page_refs(page); > > @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb) > > goto out; > > > > rc = -ENXIO; > > - page = follow_page(vma, uaddr, FOLL_WRITE); > > - if (IS_ERR_OR_NULL(page)) > > - goto out; > > - > > - lock_page(page); > > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); > > + > > + if (!pte_present(entry)) > > + goto out_unlock_pte; > > + if (pte_val(entry) & _PAGE_INVALID) > > + goto out_unlock_pte; > > I guess we also need to make sure the page was writable? > FOLL_WRITE made sure of that > > so I guess something like: > > if (!pte_write(entry)) > goto out_unlock_pte; Probably, that looks like an existing race that it wasn't re-checked :\ Jason
On Mon, 30 Jan 2023 14:24:40 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Jan 30, 2023 at 07:21:04PM +0100, Claudio Imbrenda wrote: > > On Thu, 26 Jan 2023 12:35:37 -0400 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote: > > > > > > > I can tell you that the original goal of that function is to make sure > > > > that there are no extra references. in particular, we want to prevent > > > > I/O of any kind to be ongoing while the page becomes secure. (the I/O > > > > will fail and, depending on which device it was, the whole system might > > > > end up in a rather unhappy state) > > > > > > Sure, but if there is concurrent IO you just try again right? It > > > doesn't wait for refs to drop for instance. > > > > > > So make the lock_page work the same way: > > > > the more I look at this, the less I understand why I wrote the code > > like that. I do have a comment, though > > > > > > > > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c > > > index 9f18a4af9c1319..847ee50b8672c6 100644 > > > --- a/arch/s390/kernel/uv.c > > > +++ b/arch/s390/kernel/uv.c > > > @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page) > > > } > > > > > > static int make_secure_pte(pte_t *ptep, unsigned long addr, > > > - struct page *exp_page, struct uv_cb_header *uvcb) > > > + struct page *page, struct uv_cb_header *uvcb) > > > { > > > pte_t entry = READ_ONCE(*ptep); > > > - struct page *page; > > > int expected, cc = 0; > > > > > > - if (!pte_present(entry)) > > > - return -ENXIO; > > > - if (pte_val(entry) & _PAGE_INVALID) > > > - return -ENXIO; > > > - > > > - page = pte_page(entry); > > > - if (page != exp_page) > > > - return -ENXIO; > > > if (PageWriteback(page)) > > > return -EAGAIN; > > > expected = expected_page_refs(page); > > > @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb) > > > goto out; > > > > > > rc = -ENXIO; > > > - page = follow_page(vma, uaddr, FOLL_WRITE); > > > - if (IS_ERR_OR_NULL(page)) > > > - goto out; > > > - > > > - lock_page(page); > > > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); > > > + > > > + if (!pte_present(entry)) > > > + goto out_unlock_pte; > > > + if (pte_val(entry) & _PAGE_INVALID) > > > + goto out_unlock_pte; > > > > I guess we also need to make sure the page was writable? > > FOLL_WRITE made sure of that > > > > so I guess something like: > > > > if (!pte_write(entry)) > > goto out_unlock_pte; > > Probably, that looks like an existing race that it wasn't re-checked :\ > > Jason how do we want to proceed with this? I can write a patch based on the above and see if anything breaks, but it will take some thinking and testing and reviewing before I'll be comfortable with it.
On Tue, Feb 07, 2023 at 12:31:12PM +0100, Claudio Imbrenda wrote: > On Mon, 30 Jan 2023 14:24:40 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Jan 30, 2023 at 07:21:04PM +0100, Claudio Imbrenda wrote: > > > On Thu, 26 Jan 2023 12:35:37 -0400 > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote: > > > > > > > > > I can tell you that the original goal of that function is to make sure > > > > > that there are no extra references. in particular, we want to prevent > > > > > I/O of any kind to be ongoing while the page becomes secure. (the I/O > > > > > will fail and, depending on which device it was, the whole system might > > > > > end up in a rather unhappy state) > > > > > > > > Sure, but if there is concurrent IO you just try again right? It > > > > doesn't wait for refs to drop for instance. > > > > > > > > So make the lock_page work the same way: > > > > > > the more I look at this, the less I understand why I wrote the code > > > like that. I do have a comment, though > > > > > > > > > > > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c > > > > index 9f18a4af9c1319..847ee50b8672c6 100644 > > > > --- a/arch/s390/kernel/uv.c > > > > +++ b/arch/s390/kernel/uv.c > > > > @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page) > > > > } > > > > > > > > static int make_secure_pte(pte_t *ptep, unsigned long addr, > > > > - struct page *exp_page, struct uv_cb_header *uvcb) > > > > + struct page *page, struct uv_cb_header *uvcb) > > > > { > > > > pte_t entry = READ_ONCE(*ptep); > > > > - struct page *page; > > > > int expected, cc = 0; > > > > > > > > - if (!pte_present(entry)) > > > > - return -ENXIO; > > > > - if (pte_val(entry) & _PAGE_INVALID) > > > > - return -ENXIO; > > > > - > > > > - page = pte_page(entry); > > > > - if (page != exp_page) > > > > - return -ENXIO; > > > > if (PageWriteback(page)) > > > > return -EAGAIN; > > > > expected = expected_page_refs(page); > > > > @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb) > > > > goto out; > > > > > > > > rc = -ENXIO; > > > > - page = follow_page(vma, uaddr, FOLL_WRITE); > > > > - if (IS_ERR_OR_NULL(page)) > > > > - goto out; > > > > - > > > > - lock_page(page); > > > > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); > > > > + > > > > + if (!pte_present(entry)) > > > > + goto out_unlock_pte; > > > > + if (pte_val(entry) & _PAGE_INVALID) > > > > + goto out_unlock_pte; > > > > > > I guess we also need to make sure the page was writable? > > > FOLL_WRITE made sure of that > > > > > > so I guess something like: > > > > > > if (!pte_write(entry)) > > > goto out_unlock_pte; > > > > Probably, that looks like an existing race that it wasn't re-checked :\ > > > > Jason > > how do we want to proceed with this? > > I can write a patch based on the above and see if anything breaks, but > it will take some thinking and testing and reviewing before I'll be > comfortable with it. I would be happy if you did that I think the code is clearly racey as written now Jason
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 518617431f8085..dbfc3a7e9559b4 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1039,9 +1039,6 @@ typedef unsigned int __bitwise zap_flags_t; * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each * other. Here is what they mean, and how to use them: * - * FOLL_LONGTERM indicates that the page will be held for an indefinite time - * period _often_ under userspace control. This is in contrast to - * iov_iter_get_pages(), whose usages are transient. * * FIXME: For pages which are part of a filesystem, mappings are subject to the * lifetime enforced by the filesystem and we need guarantees that longterm @@ -1085,24 +1082,40 @@ typedef unsigned int __bitwise zap_flags_t; * Please see Documentation/core-api/pin_user_pages.rst for more information. */ -#define FOLL_WRITE 0x01 /* check pte is writable */ -#define FOLL_TOUCH 0x02 /* mark page accessed */ -#define FOLL_GET 0x04 /* do get_page on page */ -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO - * and return without waiting upon it */ -#define FOLL_NOFAULT 0x80 /* do not fault in pages */ -#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_ANON 0x8000 /* don't do file mappings */ -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ -#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */ -#define FOLL_INTERRUPTIBLE 0x200000 /* allow interrupts from generic signals */ -#define FOLL_UNLOCKABLE 0x400000 /* allow unlocking the mmap lock (internal only) */ +enum { + /* check pte is writable */ + FOLL_WRITE = 1 << 0, + /* do get_page on page */ + FOLL_GET = 1 << 1, + /* give error on hole if it would be zero */ + FOLL_DUMP = 1 << 2, + /* get_user_pages read/write w/o permission */ + FOLL_FORCE = 1 << 3, + /* + * if a disk transfer is needed, start the IO and return without waiting + * upon it + */ + FOLL_NOWAIT = 1 << 4, + /* do not fault in pages */ + FOLL_NOFAULT = 1 << 5, + /* check page is hwpoisoned */ + FOLL_HWPOISON = 1 << 6, + /* don't do file mappings */ + FOLL_ANON = 1 << 7, + /* + * FOLL_LONGTERM indicates that the page will be held for an indefinite + * time period _often_ under userspace control. This is in contrast to + * iov_iter_get_pages(), whose usages are transient. + */ + FOLL_LONGTERM = 1 << 8, + /* split huge pmd before returning */ + FOLL_SPLIT_PMD = 1 << 9, + /* allow returning PCI P2PDMA pages */ + FOLL_PCI_P2PDMA = 1 << 10, + /* allow interrupts from generic signals */ + FOLL_INTERRUPTIBLE = 1 << 11, + + /* See also internal only FOLL flags in mm/internal.h */ +}; #endif /* _LINUX_MM_TYPES_H */ diff --git a/mm/internal.h b/mm/internal.h index 5c1310b98db64d..673b4b1d5f4c79 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -854,6 +854,21 @@ int migrate_device_coherent_page(struct page *page); struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); int __must_check try_grab_page(struct page *page, unsigned int flags); +enum { + /* mark page accessed */ + FOLL_TOUCH = 1 << 16, + /* a retry, previous pass started an IO */ + FOLL_TRIED = 1 << 17, + /* we are working on non-current tsk/mm */ + FOLL_REMOTE = 1 << 18, + /* pages must be released via unpin_user_page */ + FOLL_PIN = 1 << 19, + /* gup_fast: prevent fall-back to slow gup */ + FOLL_FAST_ONLY = 1 << 20, + /* allow unlocking the mmap lock */ + FOLL_UNLOCKABLE = 1 << 21, +}; + /* * Indicates for which pages that are write-protected in the page table, * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the
Move the flags that should not/are not used outside gup.c and related into mm/internal.h to discourage driver abuse. To make this more maintainable going forward compact the two FOLL ranges with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is explict. Switch to an enum so the whole thing is easier to read. Cc: David Howells <dhowells@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> --- include/linux/mm_types.h | 57 ++++++++++++++++++++++++---------------- mm/internal.h | 15 +++++++++++ 2 files changed, 50 insertions(+), 22 deletions(-)