Message ID | 20230526214142.958751-2-dhowells@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | block: Make old dio use iov_iter_extract_pages() and page pinning | expand |
On Fri, May 26, 2023 at 10:41:40PM +0100, David Howells wrote: > Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer > to it from the page tables and make unpin_user_page*() correspondingly > ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a > zero page's refcount as we're only allowed ~2 million pins on it - > something that userspace can conceivably trigger. > > Add a pair of functions to test whether a page or a folio is a ZERO_PAGE. > > Signed-off-by: David Howells <dhowells@redhat.com> > cc: Christoph Hellwig <hch@infradead.org> > cc: David Hildenbrand <david@redhat.com> > cc: Lorenzo Stoakes <lstoakes@gmail.com> > cc: Andrew Morton <akpm@linux-foundation.org> > cc: Jens Axboe <axboe@kernel.dk> > cc: Al Viro <viro@zeniv.linux.org.uk> > cc: Matthew Wilcox <willy@infradead.org> > cc: Jan Kara <jack@suse.cz> > cc: Jeff Layton <jlayton@kernel.org> > cc: Jason Gunthorpe <jgg@nvidia.com> > cc: Logan Gunthorpe <logang@deltatee.com> > cc: Hillf Danton <hdanton@sina.com> > cc: Christian Brauner <brauner@kernel.org> > cc: Linus Torvalds <torvalds@linux-foundation.org> > cc: linux-fsdevel@vger.kernel.org > cc: linux-block@vger.kernel.org > cc: linux-kernel@vger.kernel.org > cc: linux-mm@kvack.org > --- > > Notes: > ver #3) > - Move is_zero_page() and is_zero_folio() to mm.h for dependency reasons. > - Add more comments and adjust the docs. > > ver #2) > - Fix use of ZERO_PAGE(). > - Add is_zero_page() and is_zero_folio() wrappers. > - Return the zero page obtained, not ZERO_PAGE(0) unconditionally. > > Documentation/core-api/pin_user_pages.rst | 6 +++++ > include/linux/mm.h | 26 +++++++++++++++++-- > mm/gup.c | 31 ++++++++++++++++++++++- > 3 files changed, 60 insertions(+), 3 deletions(-) > > diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst > index 9fb0b1080d3b..d3c1f6d8c0e0 100644 > --- a/Documentation/core-api/pin_user_pages.rst > +++ b/Documentation/core-api/pin_user_pages.rst > @@ -112,6 +112,12 @@ pages: > This also leads to limitations: there are only 31-10==21 bits available for a > counter that increments 10 bits at a time. > > +* Because of that limitation, special handling is applied to the zero pages > + when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its > + refcount or pincount at all (it is permanent, so there's no need). The > + unpinning functions also don't do anything to a zero page. This is > + transparent to the caller. > + Great! Cheers. :) > * Callers must specifically request "dma-pinned tracking of pages". In other > words, just calling get_user_pages() will not suffice; a new set of functions, > pin_user_page() and related, must be used. > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 27ce77080c79..3c2f6b452586 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1910,6 +1910,28 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma, > return page_maybe_dma_pinned(page); > } > > +/** > + * is_zero_page - Query if a page is a zero page > + * @page: The page to query > + * > + * This returns true if @page is one of the permanent zero pages. > + */ > +static inline bool is_zero_page(const struct page *page) > +{ > + return is_zero_pfn(page_to_pfn(page)); > +} > + > +/** > + * is_zero_folio - Query if a folio is a zero page > + * @folio: The folio to query > + * > + * This returns true if @folio is one of the permanent zero pages. > + */ > +static inline bool is_zero_folio(const struct folio *folio) > +{ > + return is_zero_page(&folio->page); > +} > + > /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */ > #ifdef CONFIG_MIGRATION > static inline bool is_longterm_pinnable_page(struct page *page) > @@ -1920,8 +1942,8 @@ static inline bool is_longterm_pinnable_page(struct page *page) > if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE) > return false; > #endif > - /* The zero page may always be pinned */ > - if (is_zero_pfn(page_to_pfn(page))) > + /* The zero page can be "pinned" but gets special handling. */ > + if (is_zero_page(page)) > return true; > > /* Coherent device memory must always allow eviction. */ > diff --git a/mm/gup.c b/mm/gup.c > index bbe416236593..ad28261dcafd 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages, > struct page *page = *pages; > struct folio *folio = page_folio(page); > > - if (!folio_test_anon(folio)) > + if (is_zero_page(page) || > + !folio_test_anon(folio)) > continue; > if (!folio_test_large(folio) || folio_test_hugetlb(folio)) > VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page); > @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > else if (flags & FOLL_PIN) { > struct folio *folio; > > + /* > + * Don't take a pin on the zero page - it's not going anywhere > + * and it is used in a *lot* of places. > + */ > + if (is_zero_page(page)) > + return page_folio(page); > + > /* > * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a > * right zone, so fail and let the caller fall back to the slow > @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) > { > if (flags & FOLL_PIN) { > + if (is_zero_folio(folio)) > + return; > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); > if (folio_test_large(folio)) > atomic_sub(refs, &folio->_pincount); > @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) > if (flags & FOLL_GET) > folio_ref_inc(folio); > else if (flags & FOLL_PIN) { > + /* > + * Don't take a pin on the zero page - it's not going anywhere > + * and it is used in a *lot* of places. > + */ > + if (is_zero_page(page)) > + return 0; > + > /* > * Similar to try_grab_folio(): be sure to *also* > * increment the normal page refcount field at least once, > @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); > * > * FOLL_PIN means that the pages must be released via unpin_user_page(). Please > * see Documentation/core-api/pin_user_pages.rst for further details. > + * > + * Note that if a zero_page is amongst the returned pages, it will not have > + * pins in it and unpin_user_page() will not remove pins from it. > */ > int pin_user_pages_fast(unsigned long start, int nr_pages, > unsigned int gup_flags, struct page **pages) > @@ -3110,6 +3130,9 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast); > * > * FOLL_PIN means that the pages must be released via unpin_user_page(). Please > * see Documentation/core-api/pin_user_pages.rst for details. > + * > + * Note that if a zero_page is amongst the returned pages, it will not have > + * pins in it and unpin_user_page*() will not remove pins from it. > */ > long pin_user_pages_remote(struct mm_struct *mm, > unsigned long start, unsigned long nr_pages, > @@ -3143,6 +3166,9 @@ EXPORT_SYMBOL(pin_user_pages_remote); > * > * FOLL_PIN means that the pages must be released via unpin_user_page(). Please > * see Documentation/core-api/pin_user_pages.rst for details. > + * > + * Note that if a zero_page is amongst the returned pages, it will not have > + * pins in it and unpin_user_page*() will not remove pins from it. > */ > long pin_user_pages(unsigned long start, unsigned long nr_pages, > unsigned int gup_flags, struct page **pages, > @@ -3161,6 +3187,9 @@ EXPORT_SYMBOL(pin_user_pages); > * pin_user_pages_unlocked() is the FOLL_PIN variant of > * get_user_pages_unlocked(). Behavior is the same, except that this one sets > * FOLL_PIN and rejects FOLL_GET. > + * > + * Note that if a zero_page is amongst the returned pages, it will not have > + * pins in it and unpin_user_page*() will not remove pins from it. > */ > long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, > struct page **pages, unsigned int gup_flags) > All looks good, thanks for adding comments/updating doc! Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
On 26.05.23 23:41, David Howells wrote: > Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer > to it from the page tables and make unpin_user_page*() correspondingly > ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a > zero page's refcount as we're only allowed ~2 million pins on it - > something that userspace can conceivably trigger. 2 millions pins (FOLL_PIN, which increments the refcount by 1024) or 2 million references ? > > Add a pair of functions to test whether a page or a folio is a ZERO_PAGE. > [...] > diff --git a/mm/gup.c b/mm/gup.c > index bbe416236593..ad28261dcafd 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages, > struct page *page = *pages; > struct folio *folio = page_folio(page); > > - if (!folio_test_anon(folio)) > + if (is_zero_page(page) || > + !folio_test_anon(folio)) Nit: Fits into a single line (without harming readability IMHO). > continue; > if (!folio_test_large(folio) || folio_test_hugetlb(folio)) > VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page); > @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > else if (flags & FOLL_PIN) { > struct folio *folio; > > + /* > + * Don't take a pin on the zero page - it's not going anywhere > + * and it is used in a *lot* of places. > + */ > + if (is_zero_page(page)) > + return page_folio(page); > + > /* > * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a > * right zone, so fail and let the caller fall back to the slow > @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) > { > if (flags & FOLL_PIN) { > + if (is_zero_folio(folio)) > + return; > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); > if (folio_test_large(folio)) > atomic_sub(refs, &folio->_pincount); > @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) > if (flags & FOLL_GET) > folio_ref_inc(folio); > else if (flags & FOLL_PIN) { > + /* > + * Don't take a pin on the zero page - it's not going anywhere > + * and it is used in a *lot* of places. > + */ > + if (is_zero_page(page)) > + return 0; > + > /* > * Similar to try_grab_folio(): be sure to *also* > * increment the normal page refcount field at least once, > @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); > * > * FOLL_PIN means that the pages must be released via unpin_user_page(). Please > * see Documentation/core-api/pin_user_pages.rst for further details. > + * > + * Note that if a zero_page is amongst the returned pages, it will not have > + * pins in it and unpin_user_page() will not remove pins from it. > */ "it will not have pins in it" sounds fairly weird to a non-native speaker. "Note that the refcount of any zero_pages returned among the pinned pages will not be incremented, and unpin_user_page() will similarly not decrement it." Acked-by: David Hildenbrand <david@redhat.com>
David Hildenbrand <david@redhat.com> wrote: > > Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer > > to it from the page tables and make unpin_user_page*() correspondingly > > ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a > > zero page's refcount as we're only allowed ~2 million pins on it - > > something that userspace can conceivably trigger. > > 2 millions pins (FOLL_PIN, which increments the refcount by 1024) or 2 million > references ? Definitely pins. It's tricky because we've been using "pinned" to mean held by a refcount or held by a flag too. 2 million pins on the zero page is in the realms of possibility. It only takes 32768 64-page DIO writes. > > @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); > > * > > * FOLL_PIN means that the pages must be released via unpin_user_page(). Please > > * see Documentation/core-api/pin_user_pages.rst for further details. > > + * > > + * Note that if a zero_page is amongst the returned pages, it will not have > > + * pins in it and unpin_user_page() will not remove pins from it. > > */ > > "it will not have pins in it" sounds fairly weird to a non-native speaker. Oh, I know. The problem is that "pin" is now really ambiguous. Can we change "FOLL_PIN" to "FOLL_NAIL"? Or maybe "FOLL_SCREW" - your pages are screwed if you use DIO and fork at the same time. > "Note that the refcount of any zero_pages returned among the pinned pages will > not be incremented, and unpin_user_page() will similarly not decrement it." That's not really right (although it happens to be true), because we're talking primarily about the pin counter, not the refcount - and they may be separate. David
On 31.05.23 10:35, David Howells wrote: > David Hildenbrand <david@redhat.com> wrote: > >>> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer >>> to it from the page tables and make unpin_user_page*() correspondingly >>> ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a >>> zero page's refcount as we're only allowed ~2 million pins on it - >>> something that userspace can conceivably trigger. >> >> 2 millions pins (FOLL_PIN, which increments the refcount by 1024) or 2 million >> references ? > > Definitely pins. It's tricky because we've been using "pinned" to mean held > by a refcount or held by a flag too. > Yes, it would be clearer if we would be using "pinned" now only for FOLL_PIN and everything else is simply "taking a temporary reference on the page". > 2 million pins on the zero page is in the realms of possibility. It only > takes 32768 64-page DIO writes. > >>> @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); >>> * >>> * FOLL_PIN means that the pages must be released via unpin_user_page(). Please >>> * see Documentation/core-api/pin_user_pages.rst for further details. >>> + * >>> + * Note that if a zero_page is amongst the returned pages, it will not have >>> + * pins in it and unpin_user_page() will not remove pins from it. >>> */ >> >> "it will not have pins in it" sounds fairly weird to a non-native speaker. > > Oh, I know. The problem is that "pin" is now really ambiguous. Can we change > "FOLL_PIN" to "FOLL_NAIL"? Or maybe "FOLL_SCREW" - your pages are screwed if > you use DIO and fork at the same time. > I'm hoping that "pinning" will be "FOLL_PIN" (intention to access page content) and everything else is simply "taking a temporary page reference". >> "Note that the refcount of any zero_pages returned among the pinned pages will >> not be incremented, and unpin_user_page() will similarly not decrement it." > > That's not really right (although it happens to be true), because we're > talking primarily about the pin counter, not the refcount - and they may be > separate. In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If we have a separate pincount, we increment/decrement the refcount by 1 when (un)pinning. Sure, if we'd have a separate pincount we'd also not be modifying it.
David Hildenbrand <david@redhat.com> wrote: > Yes, it would be clearer if we would be using "pinned" now only for FOLL_PIN You're not likely to get that. "To pin" is too useful a verb that gets used in other contexts too. For that reason, I think FOLL_PIN was a poor choice of name:-/. I guess the English language has got somewhat overloaded. Maybe FOLL_PEG? ;-) > and everything else is simply "taking a temporary reference on the page". Excluding refs taken with pins, many refs are more permanent than pins as, so far as I'm aware, pins only last for the duration of an I/O operation. > >> "Note that the refcount of any zero_pages returned among the pinned pages will > >> not be incremented, and unpin_user_page() will similarly not decrement it." > > That's not really right (although it happens to be true), because we're > > talking primarily about the pin counter, not the refcount - and they may be > > separate. > > In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If we > have a separate pincount, we increment/decrement the refcount by 1 when > (un)pinning. FOLL_GET isn't relevant here - only FOLL_PIN. Yes, as it happens, we count a ref if we count a pin, but that's kind of irrelevant; what matters is that the effect must be undone with un-PUP. It would be nice not to get a ref on the zero page in FOLL_GET, but I don't think we can do that yet. Too many places assume that GUP will give them a ref they can release later via ordinary methods. David
On Wed, May 31, 2023 at 02:55:35PM +0100, David Howells wrote: > David Hildenbrand <david@redhat.com> wrote: > > > Yes, it would be clearer if we would be using "pinned" now only for FOLL_PIN > > You're not likely to get that. "To pin" is too useful a verb that gets used > in other contexts too. For that reason, I think FOLL_PIN was a poor choice of > name:-/. I guess the English language has got somewhat overloaded. Maybe > FOLL_PEG? ;-) > Might I suggest FOLL_FOLL? As we are essentially 'following' the page once 'pinned'. We could differentiate between what is currently FOLL_GET and FOLL_PIN by using FOLL_FOLL and FOLL_FOLL_FOLL. Patch series coming soon... > > and everything else is simply "taking a temporary reference on the page". > > Excluding refs taken with pins, many refs are more permanent than pins as, so > far as I'm aware, pins only last for the duration of an I/O operation. > > > >> "Note that the refcount of any zero_pages returned among the pinned pages will > > >> not be incremented, and unpin_user_page() will similarly not decrement it." > > > That's not really right (although it happens to be true), because we're > > > talking primarily about the pin counter, not the refcount - and they may be > > > separate. > > > > In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If we > > have a separate pincount, we increment/decrement the refcount by 1 when > > (un)pinning. > > FOLL_GET isn't relevant here - only FOLL_PIN. Yes, as it happens, we count a > ref if we count a pin, but that's kind of irrelevant; what matters is that the > effect must be undone with un-PUP. > > It would be nice not to get a ref on the zero page in FOLL_GET, but I don't > think we can do that yet. Too many places assume that GUP will give them a > ref they can release later via ordinary methods. > > David >
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 9fb0b1080d3b..d3c1f6d8c0e0 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -112,6 +112,12 @@ pages: This also leads to limitations: there are only 31-10==21 bits available for a counter that increments 10 bits at a time. +* Because of that limitation, special handling is applied to the zero pages + when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its + refcount or pincount at all (it is permanent, so there's no need). The + unpinning functions also don't do anything to a zero page. This is + transparent to the caller. + * Callers must specifically request "dma-pinned tracking of pages". In other words, just calling get_user_pages() will not suffice; a new set of functions, pin_user_page() and related, must be used. diff --git a/include/linux/mm.h b/include/linux/mm.h index 27ce77080c79..3c2f6b452586 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1910,6 +1910,28 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma, return page_maybe_dma_pinned(page); } +/** + * is_zero_page - Query if a page is a zero page + * @page: The page to query + * + * This returns true if @page is one of the permanent zero pages. + */ +static inline bool is_zero_page(const struct page *page) +{ + return is_zero_pfn(page_to_pfn(page)); +} + +/** + * is_zero_folio - Query if a folio is a zero page + * @folio: The folio to query + * + * This returns true if @folio is one of the permanent zero pages. + */ +static inline bool is_zero_folio(const struct folio *folio) +{ + return is_zero_page(&folio->page); +} + /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */ #ifdef CONFIG_MIGRATION static inline bool is_longterm_pinnable_page(struct page *page) @@ -1920,8 +1942,8 @@ static inline bool is_longterm_pinnable_page(struct page *page) if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE) return false; #endif - /* The zero page may always be pinned */ - if (is_zero_pfn(page_to_pfn(page))) + /* The zero page can be "pinned" but gets special handling. */ + if (is_zero_page(page)) return true; /* Coherent device memory must always allow eviction. */ diff --git a/mm/gup.c b/mm/gup.c index bbe416236593..ad28261dcafd 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages, struct page *page = *pages; struct folio *folio = page_folio(page); - if (!folio_test_anon(folio)) + if (is_zero_page(page) || + !folio_test_anon(folio)) continue; if (!folio_test_large(folio) || folio_test_hugetlb(folio)) VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page); @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) else if (flags & FOLL_PIN) { struct folio *folio; + /* + * Don't take a pin on the zero page - it's not going anywhere + * and it is used in a *lot* of places. + */ + if (is_zero_page(page)) + return page_folio(page); + /* * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a * right zone, so fail and let the caller fall back to the slow @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { + if (is_zero_folio(folio)) + return; node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); if (folio_test_large(folio)) atomic_sub(refs, &folio->_pincount); @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) if (flags & FOLL_GET) folio_ref_inc(folio); else if (flags & FOLL_PIN) { + /* + * Don't take a pin on the zero page - it's not going anywhere + * and it is used in a *lot* of places. + */ + if (is_zero_page(page)) + return 0; + /* * Similar to try_grab_folio(): be sure to *also* * increment the normal page refcount field at least once, @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); * * FOLL_PIN means that the pages must be released via unpin_user_page(). Please * see Documentation/core-api/pin_user_pages.rst for further details. + * + * Note that if a zero_page is amongst the returned pages, it will not have + * pins in it and unpin_user_page() will not remove pins from it. */ int pin_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) @@ -3110,6 +3130,9 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast); * * FOLL_PIN means that the pages must be released via unpin_user_page(). Please * see Documentation/core-api/pin_user_pages.rst for details. + * + * Note that if a zero_page is amongst the returned pages, it will not have + * pins in it and unpin_user_page*() will not remove pins from it. */ long pin_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, @@ -3143,6 +3166,9 @@ EXPORT_SYMBOL(pin_user_pages_remote); * * FOLL_PIN means that the pages must be released via unpin_user_page(). Please * see Documentation/core-api/pin_user_pages.rst for details. + * + * Note that if a zero_page is amongst the returned pages, it will not have + * pins in it and unpin_user_page*() will not remove pins from it. */ long pin_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, @@ -3161,6 +3187,9 @@ EXPORT_SYMBOL(pin_user_pages); * pin_user_pages_unlocked() is the FOLL_PIN variant of * get_user_pages_unlocked(). Behavior is the same, except that this one sets * FOLL_PIN and rejects FOLL_GET. + * + * Note that if a zero_page is amongst the returned pages, it will not have + * pins in it and unpin_user_page*() will not remove pins from it. */ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags)
Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer to it from the page tables and make unpin_user_page*() correspondingly ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a zero page's refcount as we're only allowed ~2 million pins on it - something that userspace can conceivably trigger. Add a pair of functions to test whether a page or a folio is a ZERO_PAGE. Signed-off-by: David Howells <dhowells@redhat.com> cc: Christoph Hellwig <hch@infradead.org> cc: David Hildenbrand <david@redhat.com> cc: Lorenzo Stoakes <lstoakes@gmail.com> cc: Andrew Morton <akpm@linux-foundation.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Matthew Wilcox <willy@infradead.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: Jason Gunthorpe <jgg@nvidia.com> cc: Logan Gunthorpe <logang@deltatee.com> cc: Hillf Danton <hdanton@sina.com> cc: Christian Brauner <brauner@kernel.org> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-kernel@vger.kernel.org cc: linux-mm@kvack.org --- Notes: ver #3) - Move is_zero_page() and is_zero_folio() to mm.h for dependency reasons. - Add more comments and adjust the docs. ver #2) - Fix use of ZERO_PAGE(). - Add is_zero_page() and is_zero_folio() wrappers. - Return the zero page obtained, not ZERO_PAGE(0) unconditionally. Documentation/core-api/pin_user_pages.rst | 6 +++++ include/linux/mm.h | 26 +++++++++++++++++-- mm/gup.c | 31 ++++++++++++++++++++++- 3 files changed, 60 insertions(+), 3 deletions(-)