Message ID | 20240627221413.671680-1-yang@os.amperecomputing.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm: gup: do not call try_grab_folio() in slow path | expand |
On Thu, 27 Jun 2024 15:14:13 -0700 Yang Shi <yang@os.amperecomputing.com> wrote: > The try_grab_folio() is supposed to be used in fast path and it elevates > folio refcount by using add ref unless zero. We are guaranteed to have > at least one stable reference in slow path, so the simple atomic add > could be used. The performance difference should be trivial, but the > misuse may be confusing and misleading. > > In another thread [1] a kernel warning was reported when pinning folio > in CMA memory when launching SEV virtual machine. The splat looks like: > > [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 > [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 > [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 > [ 464.325515] Call Trace: > [ 464.325520] <TASK> > [ 464.325523] ? __get_user_pages+0x423/0x520 > [ 464.325528] ? __warn+0x81/0x130 > [ 464.325536] ? __get_user_pages+0x423/0x520 > [ 464.325541] ? report_bug+0x171/0x1a0 > [ 464.325549] ? handle_bug+0x3c/0x70 > [ 464.325554] ? exc_invalid_op+0x17/0x70 > [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 > [ 464.325567] ? __get_user_pages+0x423/0x520 > [ 464.325575] __gup_longterm_locked+0x212/0x7a0 > [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 > [ 464.325590] pin_user_pages_fast+0x47/0x60 > [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] > [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] > > ... > > Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") > Cc: <stable@vger.kernel.org> [6.6+] So we want something against Linus mainline for backporting ease. > 3. Rebased onto the latest mm-unstable mm-unstable is quite different - memfd_pin_folios() doesn't exist in mainline! So can you please prepare the fix against current -linus? I'll hang onto this patch to guide myself when I redo Vivek's "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios" series on top.
On Thu, Jun 27, 2024 at 3:54 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 27 Jun 2024 15:14:13 -0700 Yang Shi <yang@os.amperecomputing.com> wrote: > > > The try_grab_folio() is supposed to be used in fast path and it elevates > > folio refcount by using add ref unless zero. We are guaranteed to have > > at least one stable reference in slow path, so the simple atomic add > > could be used. The performance difference should be trivial, but the > > misuse may be confusing and misleading. > > > > In another thread [1] a kernel warning was reported when pinning folio > > in CMA memory when launching SEV virtual machine. The splat looks like: > > > > [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 > > [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 > > [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 > > [ 464.325515] Call Trace: > > [ 464.325520] <TASK> > > [ 464.325523] ? __get_user_pages+0x423/0x520 > > [ 464.325528] ? __warn+0x81/0x130 > > [ 464.325536] ? __get_user_pages+0x423/0x520 > > [ 464.325541] ? report_bug+0x171/0x1a0 > > [ 464.325549] ? handle_bug+0x3c/0x70 > > [ 464.325554] ? exc_invalid_op+0x17/0x70 > > [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 > > [ 464.325567] ? __get_user_pages+0x423/0x520 > > [ 464.325575] __gup_longterm_locked+0x212/0x7a0 > > [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 > > [ 464.325590] pin_user_pages_fast+0x47/0x60 > > [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] > > [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] > > > > ... > > > > Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") > > Cc: <stable@vger.kernel.org> [6.6+] > > So we want something against Linus mainline for backporting ease. > > > 3. Rebased onto the latest mm-unstable > > mm-unstable is quite different - memfd_pin_folios() doesn't exist in > mainline! > > So can you please prepare the fix against current -linus? I'll hang > onto this patch to guide myself when I redo Vivek's "mm/gup: Introduce > memfd_pin_folios() for pinning memfd folios" series on top. Sure, I'm going to come up with another patch on top of Linus's tree. > > >
Yang, On Thu, Jun 27, 2024 at 03:14:13PM -0700, Yang Shi wrote: > The try_grab_folio() is supposed to be used in fast path and it elevates > folio refcount by using add ref unless zero. We are guaranteed to have > at least one stable reference in slow path, so the simple atomic add > could be used. The performance difference should be trivial, but the > misuse may be confusing and misleading. This first paragraph is IMHO misleading itself.. I think we should mention upfront the important bit, on the user impact. Here IMO the user impact should be: Linux may fail longterm pin in some releavnt paths when applied over CMA reserved blocks. And if to extend a bit, that include not only slow-gup but also the new memfd pinning, because both of them used try_grab_folio() which used to be only for fast-gup. It's great this patch renamed try_grab_folio() to try_grab_folio_fast(), I think that definitely helps on reducing the abuse in the future. However then with that the subject becomes misleading, because it says "do not call try_grab_folio()" however after this patch we keep using it. Maybe rename the subject to "mm: Fix longterm pin on slow gup and memfd pin regress"? > > In another thread [1] a kernel warning was reported when pinning folio > in CMA memory when launching SEV virtual machine. The splat looks like: > > [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 > [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 > [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 > [ 464.325515] Call Trace: > [ 464.325520] <TASK> > [ 464.325523] ? __get_user_pages+0x423/0x520 > [ 464.325528] ? __warn+0x81/0x130 > [ 464.325536] ? __get_user_pages+0x423/0x520 > [ 464.325541] ? report_bug+0x171/0x1a0 > [ 464.325549] ? handle_bug+0x3c/0x70 > [ 464.325554] ? exc_invalid_op+0x17/0x70 > [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 > [ 464.325567] ? __get_user_pages+0x423/0x520 > [ 464.325575] __gup_longterm_locked+0x212/0x7a0 > [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 > [ 464.325590] pin_user_pages_fast+0x47/0x60 > [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] > [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] > > Per the analysis done by yangge, when starting the SEV virtual machine, > it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the > memory. But the page is in CMA area, so fast GUP will fail then > fallback to the slow path due to the longterm pinnalbe check in > try_grab_folio(). > The slow path will try to pin the pages then migrate them out of CMA > area. But the slow path also uses try_grab_folio() to pin the page, > it will also fail due to the same check then the above warning > is triggered. > > [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/ > > Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") > Cc: <stable@vger.kernel.org> [6.6+] > Reported-by: yangge <yangge1116@126.com> > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> The patch itself looks mostly ok to me. There's still some "cleanup" part mangled together, e.g., the real meat should be avoiding the folio_is_longterm_pinnable() check in relevant paths. The rest (e.g. switch slow-gup / memfd pin to use folio_ref_add() not try_get_folio(), and renames) could be good cleanups. So a smaller fix might be doable, but again I don't have a strong opinion here. Thanks,
On Thu, 27 Jun 2024 19:19:40 -0400 Peter Xu <peterx@redhat.com> wrote: > Yang, > > On Thu, Jun 27, 2024 at 03:14:13PM -0700, Yang Shi wrote: > > The try_grab_folio() is supposed to be used in fast path and it elevates > > folio refcount by using add ref unless zero. We are guaranteed to have > > at least one stable reference in slow path, so the simple atomic add > > could be used. The performance difference should be trivial, but the > > misuse may be confusing and misleading. > > This first paragraph is IMHO misleading itself.. > > I think we should mention upfront the important bit, on the user impact. > > Here IMO the user impact should be: Linux may fail longterm pin in some > releavnt paths when applied over CMA reserved blocks. And if to extend a > bit, that include not only slow-gup but also the new memfd pinning, because > both of them used try_grab_folio() which used to be only for fast-gup. It's still unclear how users will be affected. What do the *users* see? If it's a slight slowdown, do we need to backport this at all? > > The patch itself looks mostly ok to me. > > There's still some "cleanup" part mangled together, e.g., the real meat > should be avoiding the folio_is_longterm_pinnable() check in relevant > paths. The rest (e.g. switch slow-gup / memfd pin to use folio_ref_add() > not try_get_folio(), and renames) could be good cleanups. > > So a smaller fix might be doable, but again I don't have a strong opinion > here. The smaller the better for backporting, of course.
On Thu, Jun 27, 2024 at 4:32 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 27 Jun 2024 19:19:40 -0400 Peter Xu <peterx@redhat.com> wrote: > > > Yang, > > > > On Thu, Jun 27, 2024 at 03:14:13PM -0700, Yang Shi wrote: > > > The try_grab_folio() is supposed to be used in fast path and it elevates > > > folio refcount by using add ref unless zero. We are guaranteed to have > > > at least one stable reference in slow path, so the simple atomic add > > > could be used. The performance difference should be trivial, but the > > > misuse may be confusing and misleading. > > > > This first paragraph is IMHO misleading itself.. > > > > I think we should mention upfront the important bit, on the user impact. > > > > Here IMO the user impact should be: Linux may fail longterm pin in some > > releavnt paths when applied over CMA reserved blocks. And if to extend a > > bit, that include not only slow-gup but also the new memfd pinning, because > > both of them used try_grab_folio() which used to be only for fast-gup. > > It's still unclear how users will be affected. What do the *users* > see? If it's a slight slowdown, do we need to backport this at all? I think Peter meant the warning reported by yangge? Peter also mentioned the patch subject is misleading. I agree. So how's about "mm: gup: stop abusing try_grab_folio()"? > > > > > The patch itself looks mostly ok to me. > > > > There's still some "cleanup" part mangled together, e.g., the real meat > > should be avoiding the folio_is_longterm_pinnable() check in relevant > > paths. The rest (e.g. switch slow-gup / memfd pin to use folio_ref_add() > > not try_get_folio(), and renames) could be good cleanups. > > > > So a smaller fix might be doable, but again I don't have a strong opinion > > here. > > The smaller the better for backporting, of course. I view the fix to the warning as just by-product of the clean up. The whole patch is naturally integral IMHO. We can generate a smaller fix if it is too hard to backport. However, it should be ok since we just need to backport to 6.6+. >
On Thu, Jun 27, 2024 at 04:32:42PM -0700, Andrew Morton wrote: > On Thu, 27 Jun 2024 19:19:40 -0400 Peter Xu <peterx@redhat.com> wrote: > > > Yang, > > > > On Thu, Jun 27, 2024 at 03:14:13PM -0700, Yang Shi wrote: > > > The try_grab_folio() is supposed to be used in fast path and it elevates > > > folio refcount by using add ref unless zero. We are guaranteed to have > > > at least one stable reference in slow path, so the simple atomic add > > > could be used. The performance difference should be trivial, but the > > > misuse may be confusing and misleading. > > > > This first paragraph is IMHO misleading itself.. > > > > I think we should mention upfront the important bit, on the user impact. > > > > Here IMO the user impact should be: Linux may fail longterm pin in some > > releavnt paths when applied over CMA reserved blocks. And if to extend a > > bit, that include not only slow-gup but also the new memfd pinning, because > > both of them used try_grab_folio() which used to be only for fast-gup. > > It's still unclear how users will be affected. What do the *users* > see? If it's a slight slowdown, do we need to backport this at all? The user will see the pin fails, for gpu-slow it further triggers the WARN right below that failure (as in the original report): folio = try_grab_folio(page, page_increm - 1, foll_flags); if (WARN_ON_ONCE(!folio)) { <------------------------ here /* * Release the 1st page ref if the * folio is problematic, fail hard. */ gup_put_folio(page_folio(page), 1, foll_flags); ret = -EFAULT; goto out; } For memfd pin and hugepd paths, they should just observe GUP failure on those longterm pins, and it'll be the caller context to decide what user can see, I think. > > > > > The patch itself looks mostly ok to me. > > > > There's still some "cleanup" part mangled together, e.g., the real meat > > should be avoiding the folio_is_longterm_pinnable() check in relevant > > paths. The rest (e.g. switch slow-gup / memfd pin to use folio_ref_add() > > not try_get_folio(), and renames) could be good cleanups. > > > > So a smaller fix might be doable, but again I don't have a strong opinion > > here. > > The smaller the better for backporting, of course. I think a smaller version might be yangge's patch, plus Yang's hugepd "fast" parameter for the hugepd stack, then hugepd can also use try_grab_page(). memfd-pin change can be a separate small patch perhaps squashed. I'll leave how to move on to Yang. Thanks,
在 2024/6/28 7:43, Peter Xu 写道: > On Thu, Jun 27, 2024 at 04:32:42PM -0700, Andrew Morton wrote: >> On Thu, 27 Jun 2024 19:19:40 -0400 Peter Xu <peterx@redhat.com> wrote: >> >>> Yang, >>> >>> On Thu, Jun 27, 2024 at 03:14:13PM -0700, Yang Shi wrote: >>>> The try_grab_folio() is supposed to be used in fast path and it elevates >>>> folio refcount by using add ref unless zero. We are guaranteed to have >>>> at least one stable reference in slow path, so the simple atomic add >>>> could be used. The performance difference should be trivial, but the >>>> misuse may be confusing and misleading. >>> >>> This first paragraph is IMHO misleading itself.. >>> >>> I think we should mention upfront the important bit, on the user impact. >>> >>> Here IMO the user impact should be: Linux may fail longterm pin in some >>> releavnt paths when applied over CMA reserved blocks. And if to extend a >>> bit, that include not only slow-gup but also the new memfd pinning, because >>> both of them used try_grab_folio() which used to be only for fast-gup. >> >> It's still unclear how users will be affected. What do the *users* >> see? If it's a slight slowdown, do we need to backport this at all? > > The user will see the pin fails, for gpu-slow it further triggers the WARN > right below that failure (as in the original report): > > folio = try_grab_folio(page, page_increm - 1, > foll_flags); > if (WARN_ON_ONCE(!folio)) { <------------------------ here > /* > * Release the 1st page ref if the > * folio is problematic, fail hard. > */ > gup_put_folio(page_folio(page), 1, > foll_flags); > ret = -EFAULT; > goto out; > } > > For memfd pin and hugepd paths, they should just observe GUP failure on > those longterm pins, and it'll be the caller context to decide what user > can see, I think. > >> >>> >>> The patch itself looks mostly ok to me. >>> >>> There's still some "cleanup" part mangled together, e.g., the real meat >>> should be avoiding the folio_is_longterm_pinnable() check in relevant >>> paths. The rest (e.g. switch slow-gup / memfd pin to use folio_ref_add() >>> not try_get_folio(), and renames) could be good cleanups. >>> >>> So a smaller fix might be doable, but again I don't have a strong opinion >>> here. >> >> The smaller the better for backporting, of course. > > I think a smaller version might be yangge's patch, plus Yang's hugepd > "fast" parameter for the hugepd stack, then hugepd can also use > try_grab_page(). memfd-pin change can be a separate small patch perhaps > squashed. > If needed, I can submit a new version based on Yang's V1 version. > I'll leave how to move on to Yang. > > Thanks, >
在 2024/6/28 7:43, Peter Xu 写道: > On Thu, Jun 27, 2024 at 04:32:42PM -0700, Andrew Morton wrote: >> On Thu, 27 Jun 2024 19:19:40 -0400 Peter Xu <peterx@redhat.com> wrote: >> >>> Yang, >>> >>> On Thu, Jun 27, 2024 at 03:14:13PM -0700, Yang Shi wrote: >>>> The try_grab_folio() is supposed to be used in fast path and it elevates >>>> folio refcount by using add ref unless zero. We are guaranteed to have >>>> at least one stable reference in slow path, so the simple atomic add >>>> could be used. The performance difference should be trivial, but the >>>> misuse may be confusing and misleading. >>> >>> This first paragraph is IMHO misleading itself.. >>> >>> I think we should mention upfront the important bit, on the user impact. >>> >>> Here IMO the user impact should be: Linux may fail longterm pin in some >>> releavnt paths when applied over CMA reserved blocks. And if to extend a >>> bit, that include not only slow-gup but also the new memfd pinning, because >>> both of them used try_grab_folio() which used to be only for fast-gup. >> >> It's still unclear how users will be affected. What do the *users* >> see? If it's a slight slowdown, do we need to backport this at all? > > The user will see the pin fails, for gpu-slow it further triggers the WARN > right below that failure (as in the original report): > > folio = try_grab_folio(page, page_increm - 1, > foll_flags); > if (WARN_ON_ONCE(!folio)) { <------------------------ here > /* > * Release the 1st page ref if the > * folio is problematic, fail hard. > */ > gup_put_folio(page_folio(page), 1, > foll_flags); > ret = -EFAULT; > goto out; > } > > For memfd pin and hugepd paths, they should just observe GUP failure on > those longterm pins, and it'll be the caller context to decide what user > can see, I think. > >> >>> >>> The patch itself looks mostly ok to me. >>> >>> There's still some "cleanup" part mangled together, e.g., the real meat >>> should be avoiding the folio_is_longterm_pinnable() check in relevant >>> paths. The rest (e.g. switch slow-gup / memfd pin to use folio_ref_add() >>> not try_get_folio(), and renames) could be good cleanups. >>> >>> So a smaller fix might be doable, but again I don't have a strong opinion >>> here. >> >> The smaller the better for backporting, of course. > > I think a smaller version might be yangge's patch, plus Yang's hugepd > "fast" parameter for the hugepd stack, then hugepd can also use > try_grab_page(). memfd-pin change can be a separate small patch perhaps > squashed. Thanks, new version: https://lore.kernel.org/all/1719554518-11006-1-git-send-email-yangge1116@126.com/ > > I'll leave how to move on to Yang. > > Thanks, >
> +int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags) Overly long line (same for the external declaration) > + struct page *page = &folio->page; Page is only used for is_pci_p2pdma_page and is_zero_page, and for the latter a is_zero_folio already exist. Maybe remove the local variable, use is_zero_folio and just open code the dereference in the is_pci_p2pdma_page call? > + ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, fast); Overly lone line. > + folio_ref_add(folio, > + refs * (GUP_PIN_COUNTING_BIAS - 1)); Nit: this easily fits onto a single line. > if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr, > - PMD_SHIFT, next, flags, pages, nr) != 1) > + PMD_SHIFT, next, flags, pages, nr, true) != 1) Overly long lin (same in the similar calls below)
On Thu, Jun 27, 2024 at 11:35 PM Christoph Hellwig <hch@infradead.org> wrote: > > > +int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags) > > Overly long line (same for the external declaration) > > > + struct page *page = &folio->page; > > Page is only used for is_pci_p2pdma_page and is_zero_page, and for > the latter a is_zero_folio already exist. Maybe remove the local > variable, use is_zero_folio and just open code the dereference in the > is_pci_p2pdma_page call? Thanks, Christoph. Good point, I think we can use is_zero_folio and open coeded it in is_pci_p2pdma_page. And all the format problems will be solved in v3. > > > + ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, fast); > > Overly lone line. > > > + folio_ref_add(folio, > > + refs * (GUP_PIN_COUNTING_BIAS - 1)); > > Nit: this easily fits onto a single line. > > > if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr, > > - PMD_SHIFT, next, flags, pages, nr) != 1) > > + PMD_SHIFT, next, flags, pages, nr, true) != 1) > > Overly long lin (same in the similar calls below) > >
diff --git a/mm/gup.c b/mm/gup.c index 8bea9ad80984..7439359d0b71 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -99,95 +99,6 @@ static inline struct folio *try_get_folio(struct page *page, int refs) return folio; } -/** - * try_grab_folio() - Attempt to get or pin a folio. - * @page: pointer to page to be grabbed - * @refs: the value to (effectively) add to the folio's refcount - * @flags: gup flags: these are the FOLL_* flag values. - * - * "grab" names in this file mean, "look at flags to decide whether to use - * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. - * - * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the - * same time. (That's true throughout the get_user_pages*() and - * pin_user_pages*() APIs.) Cases: - * - * FOLL_GET: folio's refcount will be incremented by @refs. - * - * FOLL_PIN on large folios: folio's refcount will be incremented by - * @refs, and its pincount will be incremented by @refs. - * - * FOLL_PIN on single-page folios: folio's refcount will be incremented by - * @refs * GUP_PIN_COUNTING_BIAS. - * - * Return: The folio containing @page (with refcount appropriately - * incremented) for success, or NULL upon failure. If neither FOLL_GET - * nor FOLL_PIN was set, that's considered failure, and furthermore, - * a likely bug in the caller, so a warning is also emitted. - */ -struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) -{ - struct folio *folio; - - if (WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0)) - return NULL; - - if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) - return NULL; - - if (flags & FOLL_GET) - return try_get_folio(page, refs); - - /* FOLL_PIN is set */ - - /* - * Don't take a pin on the zero page - it's not going anywhere - * and it is used in a *lot* of places. - */ - if (is_zero_page(page)) - return page_folio(page); - - folio = try_get_folio(page, refs); - if (!folio) - return NULL; - - /* - * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a - * right zone, so fail and let the caller fall back to the slow - * path. - */ - if (unlikely((flags & FOLL_LONGTERM) && - !folio_is_longterm_pinnable(folio))) { - if (!put_devmap_managed_folio_refs(folio, refs)) - folio_put_refs(folio, refs); - return NULL; - } - - /* - * When pinning a large folio, use an exact count to track it. - * - * However, be sure to *also* increment the normal folio - * refcount field at least once, so that the folio really - * is pinned. That's why the refcount from the earlier - * try_get_folio() is left intact. - */ - if (folio_test_large(folio)) - atomic_add(refs, &folio->_pincount); - else - folio_ref_add(folio, - refs * (GUP_PIN_COUNTING_BIAS - 1)); - /* - * Adjust the pincount before re-checking the PTE for changes. - * This is essentially a smp_mb() and is paired with a memory - * barrier in folio_try_share_anon_rmap_*(). - */ - smp_mb__after_atomic(); - - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); - - return folio; -} - static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { @@ -205,28 +116,31 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) } /** - * try_grab_page() - elevate a page's refcount by a flag-dependent amount - * @page: pointer to page to be grabbed - * @flags: gup flags: these are the FOLL_* flag values. + * try_grab_folio() - add a folio's refcount by a flag-dependent amount + * @folio: pointer to folio to be grabbed + * @refs: the value to (effectively) add to the folio's refcount + * @flags: gup flags: these are the FOLL_* flag values * * This might not do anything at all, depending on the flags argument. * * "grab" names in this file mean, "look at flags to decide whether to use - * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount. + * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. * * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same - * time. Cases: please see the try_grab_folio() documentation, with - * "refs=1". + * time. * * Return: 0 for success, or if no action was required (if neither FOLL_PIN * nor FOLL_GET was set, nothing is done). A negative error code for failure: * - * -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not + * -ENOMEM FOLL_GET or FOLL_PIN was set, but the folio could not * be grabbed. + * + * It is called when we have a stable reference for the folio, typically in + * GUP slow path. */ -int __must_check try_grab_page(struct page *page, unsigned int flags) +int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags) { - struct folio *folio = page_folio(page); + struct page *page = &folio->page; if (WARN_ON_ONCE(folio_ref_count(folio) <= 0)) return -ENOMEM; @@ -235,7 +149,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) return -EREMOTEIO; if (flags & FOLL_GET) - folio_ref_inc(folio); + folio_ref_add(folio, refs); else if (flags & FOLL_PIN) { /* * Don't take a pin on the zero page - it's not going anywhere @@ -245,18 +159,18 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) return 0; /* - * Similar to try_grab_folio(): be sure to *also* - * increment the normal page refcount field at least once, + * Increment the normal page refcount field at least once, * so that the page really is pinned. */ if (folio_test_large(folio)) { - folio_ref_add(folio, 1); - atomic_add(1, &folio->_pincount); + folio_ref_add(folio, refs); + atomic_add(refs, &folio->_pincount); } else { - folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); + folio_ref_add(folio, + refs * GUP_PIN_COUNTING_BIAS); } - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); } return 0; @@ -584,7 +498,7 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, */ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz, unsigned long addr, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { unsigned long pte_end; struct page *page; @@ -607,9 +521,15 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz page = pte_page(pte); refs = record_subpages(page, sz, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); - if (!folio) - return 0; + if (fast) { + folio = try_grab_folio_fast(page, refs, flags); + if (!folio) + return 0; + } else { + folio = page_folio(page); + if (try_grab_folio(folio, refs, flags)) + return 0; + } if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, refs, flags); @@ -637,7 +557,7 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { pte_t *ptep; unsigned long sz = 1UL << hugepd_shift(hugepd); @@ -647,7 +567,7 @@ static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); do { next = hugepte_addr_end(addr, end, sz); - ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr); + ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, fast); if (ret != 1) return ret; } while (ptep++, addr = next, addr != end); @@ -674,7 +594,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); ptl = huge_pte_lock(h, vma->vm_mm, ptep); ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE, - flags, &page, &nr); + flags, &page, &nr, false); spin_unlock(ptl); if (ret == 1) { @@ -691,7 +611,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, static inline int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { return 0; } @@ -778,7 +698,7 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma, gup_must_unshare(vma, flags, page)) return ERR_PTR(-EMLINK); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) page = ERR_PTR(ret); else @@ -855,7 +775,7 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) return ERR_PTR(ret); @@ -1017,8 +937,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ - ret = try_grab_page(page, flags); + /* try_grab_folio() does nothing unless FOLL_GET or FOLL_PIN is set. */ + ret = try_grab_folio(page_folio(page), 1, flags); if (unlikely(ret)) { page = ERR_PTR(ret); goto out; @@ -1282,7 +1202,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, goto unmap; *page = pte_page(entry); } - ret = try_grab_page(*page, gup_flags); + ret = try_grab_folio(page_folio(*page), 1, gup_flags); if (unlikely(ret)) goto unmap; out: @@ -1685,20 +1605,19 @@ static long __get_user_pages(struct mm_struct *mm, * pages. */ if (page_increm > 1) { - struct folio *folio; + struct folio *folio = page_folio(page); /* * Since we already hold refcount on the * large folio, this should never fail. */ - folio = try_grab_folio(page, page_increm - 1, - foll_flags); - if (WARN_ON_ONCE(!folio)) { + if (try_grab_folio(folio, page_increm - 1, + foll_flags)) { /* * Release the 1st page ref if the * folio is problematic, fail hard. */ - gup_put_folio(page_folio(page), 1, + gup_put_folio(folio, 1, foll_flags); ret = -EFAULT; goto out; @@ -2876,6 +2795,101 @@ EXPORT_SYMBOL(get_user_pages_unlocked); * This code is based heavily on the PowerPC implementation by Nick Piggin. */ #ifdef CONFIG_HAVE_GUP_FAST +/** + * try_grab_folio_fast() - Attempt to get or pin a folio in fast path. + * @page: pointer to page to be grabbed + * @refs: the value to (effectively) add to the folio's refcount + * @flags: gup flags: these are the FOLL_* flag values. + * + * "grab" names in this file mean, "look at flags to decide whether to use + * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount. + * + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the + * same time. (That's true throughout the get_user_pages*() and + * pin_user_pages*() APIs.) Cases: + * + * FOLL_GET: folio's refcount will be incremented by @refs. + * + * FOLL_PIN on large folios: folio's refcount will be incremented by + * @refs, and its pincount will be incremented by @refs. + * + * FOLL_PIN on single-page folios: folio's refcount will be incremented by + * @refs * GUP_PIN_COUNTING_BIAS. + * + * Return: The folio containing @page (with refcount appropriately + * incremented) for success, or NULL upon failure. If neither FOLL_GET + * nor FOLL_PIN was set, that's considered failure, and furthermore, + * a likely bug in the caller, so a warning is also emitted. + * + * It uses add ref unless zero to elevate the folio refcount and must be called + * in fast path only. + */ +static struct folio *try_grab_folio_fast(struct page *page, int refs, + unsigned int flags) +{ + struct folio *folio; + + /* Raise warn if it is not called in fast GUP */ + VM_WARN_ON_ONCE(!irqs_disabled()); + + if (WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0)) + return NULL; + + if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) + return NULL; + + if (flags & FOLL_GET) + return try_get_folio(page, refs); + + /* FOLL_PIN is set */ + + /* + * Don't take a pin on the zero page - it's not going anywhere + * and it is used in a *lot* of places. + */ + if (is_zero_page(page)) + return page_folio(page); + + folio = try_get_folio(page, refs); + if (!folio) + return NULL; + + /* + * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a + * right zone, so fail and let the caller fall back to the slow + * path. + */ + if (unlikely((flags & FOLL_LONGTERM) && + !folio_is_longterm_pinnable(folio))) { + if (!put_devmap_managed_folio_refs(folio, refs)) + folio_put_refs(folio, refs); + return NULL; + } + + /* + * When pinning a large folio, use an exact count to track it. + * + * However, be sure to *also* increment the normal folio + * refcount field at least once, so that the folio really + * is pinned. That's why the refcount from the earlier + * try_get_folio() is left intact. + */ + if (folio_test_large(folio)) + atomic_add(refs, &folio->_pincount); + else + folio_ref_add(folio, + refs * (GUP_PIN_COUNTING_BIAS - 1)); + /* + * Adjust the pincount before re-checking the PTE for changes. + * This is essentially a smp_mb() and is paired with a memory + * barrier in folio_try_share_anon_rmap_*(). + */ + smp_mb__after_atomic(); + + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); + + return folio; +} /* * Used in the GUP-fast path to determine whether GUP is permitted to work on @@ -3041,7 +3055,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - folio = try_grab_folio(page, 1, flags); + folio = try_grab_folio_fast(page, 1, flags); if (!folio) goto pte_unmap; @@ -3128,7 +3142,7 @@ static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr, break; } - folio = try_grab_folio(page, 1, flags); + folio = try_grab_folio_fast(page, 1, flags); if (!folio) { gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages); break; @@ -3217,7 +3231,7 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, page = pmd_page(orig); refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3261,7 +3275,7 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, page = pud_page(orig); refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3301,7 +3315,7 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr, page = pgd_page(orig); refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); + folio = try_grab_folio_fast(page, refs, flags); if (!folio) return 0; @@ -3355,7 +3369,7 @@ static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, * pmd format and THP pmd format */ if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr, - PMD_SHIFT, next, flags, pages, nr) != 1) + PMD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) @@ -3385,7 +3399,7 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, return 0; } else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) { if (gup_hugepd(NULL, __hugepd(pud_val(pud)), addr, - PUD_SHIFT, next, flags, pages, nr) != 1) + PUD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pmd_range(pudp, pud, addr, next, flags, pages, nr)) @@ -3412,7 +3426,7 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, BUILD_BUG_ON(p4d_leaf(p4d)); if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) { if (gup_hugepd(NULL, __hugepd(p4d_val(p4d)), addr, - P4D_SHIFT, next, flags, pages, nr) != 1) + P4D_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) @@ -3441,7 +3455,7 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end, return; } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) { if (gup_hugepd(NULL, __hugepd(pgd_val(pgd)), addr, - PGDIR_SHIFT, next, flags, pages, nr) != 1) + PGDIR_SHIFT, next, flags, pages, nr, true) != 1) return; } else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) @@ -3842,14 +3856,15 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, next_idx != folio_index(fbatch.folios[i])) continue; - folio = try_grab_folio(&fbatch.folios[i]->page, - 1, FOLL_PIN); - if (!folio) { + if (try_grab_folio(fbatch.folios[i], + 1, FOLL_PIN)) { folio_batch_release(&fbatch); ret = -EINVAL; goto err; } + folio = fbatch.folios[i]; + if (nr_folios == 0) *offset = offset_in_folio(folio, start); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c7ce28f6b7f3..954c63575917 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1333,7 +1333,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - ret = try_grab_page(page, flags); + ret = try_grab_folio(page_folio(page), 1, flags); if (ret) page = ERR_PTR(ret); diff --git a/mm/internal.h b/mm/internal.h index 2ea9a88dcb95..b264a7dabefe 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1226,8 +1226,7 @@ int migrate_device_coherent_page(struct page *page); /* * mm/gup.c */ -struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); -int __must_check try_grab_page(struct page *page, unsigned int flags); +int __must_check try_grab_folio(struct folio *folio, int refs, unsigned int flags); /* * mm/huge_memory.c
The try_grab_folio() is supposed to be used in fast path and it elevates folio refcount by using add ref unless zero. We are guaranteed to have at least one stable reference in slow path, so the simple atomic add could be used. The performance difference should be trivial, but the misuse may be confusing and misleading. In another thread [1] a kernel warning was reported when pinning folio in CMA memory when launching SEV virtual machine. The splat looks like: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] <TASK> [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] Per the analysis done by yangge, when starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. But the page is in CMA area, so fast GUP will fail then fallback to the slow path due to the longterm pinnalbe check in try_grab_folio(). The slow path will try to pin the pages then migrate them out of CMA area. But the slow path also uses try_grab_folio() to pin the page, it will also fail due to the same check then the above warning is triggered. [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/ Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Cc: <stable@vger.kernel.org> [6.6+] Reported-by: yangge <yangge1116@126.com> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> --- mm/gup.c | 285 +++++++++++++++++++++++++---------------------- mm/huge_memory.c | 2 +- mm/internal.h | 3 +- 3 files changed, 152 insertions(+), 138 deletions(-) v2: 1. Fixed the build warning 2. Reworked the commit log to include the bug report and analysis (reworded by me) from yangge 3. Rebased onto the latest mm-unstable