Message ID | 165490039431.944052.12458624139225785964.stgit@omen (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Re-allow pinning of zero pfns | expand |
On Fri, Jun 10, 2022 at 04:35:13PM -0600, Alex Williamson wrote: > The commit referenced below subtly and inadvertently changed the logic > to disallow pinning of zero pfns. This breaks device assignment with > vfio and potentially various other users of gup. Exclude the zero page > test from the negation. > > Fixes: 1c563432588d ("mm: fix is_pinnable_page against a cma page") > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > > At least I assume this was inadvertent... If there's a better fix, > please run with it as I'm out of the office the 1st half of next > week and would like to see this fixed ASAP. Thanks! > > include/linux/mm.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index bc8f326be0ce..781fae17177d 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1600,7 +1600,7 @@ static inline bool is_pinnable_page(struct page *page) > if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE) > return false; > #endif > - return !(is_zone_movable_page(page) || is_zero_pfn(page_to_pfn(page))); > + return !is_zone_movable_page(page) || is_zero_pfn(page_to_pfn(page)); Thanks for catching! I don't think zero pfn could stay in the movable zone or CMA area. Acked-by: Minchan Kim <minchan@kernel.org>
On 11.06.22 00:35, Alex Williamson wrote: > The commit referenced below subtly and inadvertently changed the logic > to disallow pinning of zero pfns. This breaks device assignment with > vfio and potentially various other users of gup. Exclude the zero page > test from the negation. I wonder which setups can reliably work with a long-term pin on a shared zeropage. In a MAP_PRIVATE mapping, any write access via the page tables will end up replacing the shared zeropage with an anonymous page. Something similar should apply in MAP_SHARED mappings, when lazily allocating disk blocks. In the future, we might trigger unsharing when taking a R/O pin for the shared zeropage, just like we do as of now upstream for shared anonymous pages (!PageAnonExclusive). And something similar could then be done when finding a !anon page in a MAP_SHARED mapping. > > Fixes: 1c563432588d ("mm: fix is_pinnable_page against a cma page") Having that said, it indeed looks like that was an unintended change. Acked-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > > At least I assume this was inadvertent... If there's a better fix, > please run with it as I'm out of the office the 1st half of next > week and would like to see this fixed ASAP. Thanks! > > include/linux/mm.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index bc8f326be0ce..781fae17177d 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1600,7 +1600,7 @@ static inline bool is_pinnable_page(struct page *page) > if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE) > return false; > #endif > - return !(is_zone_movable_page(page) || is_zero_pfn(page_to_pfn(page))); > + return !is_zone_movable_page(page) || is_zero_pfn(page_to_pfn(page)); > } > #else > static inline bool is_pinnable_page(struct page *page) > >
On Sat, Jun 11, 2022 at 08:29:47PM +0200, David Hildenbrand wrote: > On 11.06.22 00:35, Alex Williamson wrote: > > The commit referenced below subtly and inadvertently changed the logic > > to disallow pinning of zero pfns. This breaks device assignment with > > vfio and potentially various other users of gup. Exclude the zero page > > test from the negation. > > I wonder which setups can reliably work with a long-term pin on a shared > zeropage. In a MAP_PRIVATE mapping, any write access via the page tables > will end up replacing the shared zeropage with an anonymous page. > Something similar should apply in MAP_SHARED mappings, when lazily > allocating disk blocks. > > In the future, we might trigger unsharing when taking a R/O pin for the > shared zeropage, just like we do as of now upstream for shared anonymous > pages (!PageAnonExclusive). And something similar could then be done > when finding a !anon page in a MAP_SHARED mapping. I'm also confused how qemu is hitting this and it isn't already a bug? It is arising because vfio doesn't use FOLL_FORCE|FOLL_WRITE to move away the zero page in most cases. And why does Yishai say it causes an infinite loop in the kernel? Jason
On 15.06.22 17:56, Jason Gunthorpe wrote: > On Sat, Jun 11, 2022 at 08:29:47PM +0200, David Hildenbrand wrote: >> On 11.06.22 00:35, Alex Williamson wrote: >>> The commit referenced below subtly and inadvertently changed the logic >>> to disallow pinning of zero pfns. This breaks device assignment with >>> vfio and potentially various other users of gup. Exclude the zero page >>> test from the negation. >> >> I wonder which setups can reliably work with a long-term pin on a shared >> zeropage. In a MAP_PRIVATE mapping, any write access via the page tables >> will end up replacing the shared zeropage with an anonymous page. >> Something similar should apply in MAP_SHARED mappings, when lazily >> allocating disk blocks. ^ correction, shared zeropage is never user in MAP_SHARED mappings (fortunally). >> >> In the future, we might trigger unsharing when taking a R/O pin for the >> shared zeropage, just like we do as of now upstream for shared anonymous >> pages (!PageAnonExclusive). And something similar could then be done >> when finding a !anon page in a MAP_SHARED mapping. > > I'm also confused how qemu is hitting this and it isn't already a bug? > I assume it's just some random thingy mapped into the guest physical address space (by the bios? R/O?), that actually never ends up getting used by a device. So vfio simply only needs this to keep working ... but weon't actually ever user that data. But this is just my best guess after thinking about it. > It is arising because vfio doesn't use FOLL_FORCE|FOLL_WRITE to move > away the zero page in most cases. > > And why does Yishai say it causes an infinite loop in the kernel? Good question. Maybe $something keeps retying if pinning fails, either in the kernel (which would be bad) or in user space. At least QEMU seems to just fail if pinning fails, but maybe it's a different user space?
On Thu, 23 Jun 2022 20:07:14 +0200 David Hildenbrand <david@redhat.com> wrote: > On 15.06.22 17:56, Jason Gunthorpe wrote: > > On Sat, Jun 11, 2022 at 08:29:47PM +0200, David Hildenbrand wrote: > >> On 11.06.22 00:35, Alex Williamson wrote: > >>> The commit referenced below subtly and inadvertently changed the logic > >>> to disallow pinning of zero pfns. This breaks device assignment with > >>> vfio and potentially various other users of gup. Exclude the zero page > >>> test from the negation. > >> > >> I wonder which setups can reliably work with a long-term pin on a shared > >> zeropage. In a MAP_PRIVATE mapping, any write access via the page tables > >> will end up replacing the shared zeropage with an anonymous page. > >> Something similar should apply in MAP_SHARED mappings, when lazily > >> allocating disk blocks. > > ^ correction, shared zeropage is never user in MAP_SHARED mappings > (fortunally). > > >> > >> In the future, we might trigger unsharing when taking a R/O pin for the > >> shared zeropage, just like we do as of now upstream for shared anonymous > >> pages (!PageAnonExclusive). And something similar could then be done > >> when finding a !anon page in a MAP_SHARED mapping. > > > > I'm also confused how qemu is hitting this and it isn't already a bug? > > > > I assume it's just some random thingy mapped into the guest physical > address space (by the bios? R/O?), that actually never ends up getting > used by a device. > > So vfio simply only needs this to keep working ... but weon't actually > ever user that data. > > But this is just my best guess after thinking about it. Good guess. > > It is arising because vfio doesn't use FOLL_FORCE|FOLL_WRITE to move > > away the zero page in most cases. > > > > And why does Yishai say it causes an infinite loop in the kernel? > > > Good question. Maybe $something keeps retying if pinning fails, either > in the kernel (which would be bad) or in user space. At least QEMU seems > to just fail if pinning fails, but maybe it's a different user space? The loop is in __gup_longterm_locked(): do { rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, gup_flags); if (rc <= 0) break; rc = check_and_migrate_movable_pages(rc, pages, gup_flags); } while (!rc); It appears we're pinning a 32 page (128K) range, __get_user_pages_locked() returns 32, but check_and_migrate_movable_pages() perpetually returns zero. I believe this is because folio_is_pinnable() previously returned true, and now returns false. Therefore we drop down to fail at folio_isolate_lru(), incrementing isolation_error_count. From there we do nothing more than unpin the pages, return zero, and hope for better luck next time, which obviously doesn't happen. If I generate an errno here, QEMU reports failing on the pc.rom memory region at 0xc0000. Thanks, Alex
On Thu, Jun 23, 2022 at 02:21:39PM -0600, Alex Williamson wrote: > check_and_migrate_movable_pages() perpetually returns zero. I believe > this is because folio_is_pinnable() previously returned true, and now > returns false. Indeed, it is a bug that check_and_migrate_movable_pages() returns 0 when it didn't do anything. It should return an error code. Hum.. Alistair, maybe you should look at this as well, I'm struggling alot to understand how it is safe to drop the reference on the page but hold a pointer to it on the movable_page_list - sure it was isolated - but why does that mean it won't be concurrently unmapped and freed? Anyhow, it looks like the problem is the tortured logic in this function, what do you think about this: diff --git a/mm/gup.c b/mm/gup.c index 5512644076246d..2ffcb3f4ff4a7b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1879,10 +1879,15 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, unsigned int gup_flags) { unsigned long isolation_error_count = 0, i; + struct migration_target_control mtc = { + .nid = NUMA_NO_NODE, + .gfp_mask = GFP_USER | __GFP_NOWARN, + }; struct folio *prev_folio = NULL; LIST_HEAD(movable_page_list); bool drain_allow = true; - int ret = 0; + int not_migrated; + int ret; for (i = 0; i < nr_pages; i++) { struct folio *folio = page_folio(pages[i]); @@ -1919,16 +1924,13 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, folio_nr_pages(folio)); } - if (!list_empty(&movable_page_list) || isolation_error_count) - goto unpin_pages; - /* * If list is empty, and no isolation errors, means that all pages are - * in the correct zone. + * in the correct zone, nothing to do. */ - return nr_pages; + if (list_empty(&movable_page_list) && !isolation_error_count) + return nr_pages; -unpin_pages: if (gup_flags & FOLL_PIN) { unpin_user_pages(pages, nr_pages); } else { @@ -1936,20 +1938,22 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, put_page(pages[i]); } - if (!list_empty(&movable_page_list)) { - struct migration_target_control mtc = { - .nid = NUMA_NO_NODE, - .gfp_mask = GFP_USER | __GFP_NOWARN, - }; + if (isolation_error_count) { + ret = -EINVAL; + goto err_putback; + } - ret = migrate_pages(&movable_page_list, alloc_migration_target, - NULL, (unsigned long)&mtc, MIGRATE_SYNC, - MR_LONGTERM_PIN, NULL); - if (ret > 0) /* number of pages not migrated */ - ret = -ENOMEM; + not_migrated = migrate_pages(&movable_page_list, alloc_migration_target, + NULL, (unsigned long)&mtc, MIGRATE_SYNC, + MR_LONGTERM_PIN, NULL); + if (not_migrated > 0) { + ret = -ENOMEM; + goto err_putback; } + return 0; - if (ret && !list_empty(&movable_page_list)) +err_putback: + if (!list_empty(&movable_page_list)) putback_movable_pages(&movable_page_list); return ret; } > If I generate an errno here, QEMU reports failing on the pc.rom memory > region at 0xc0000. Thanks, Ah, a ROM region that is all zero'd makes some sense why it has gone unnoticed as a bug. Jason
Jason Gunthorpe <jgg@nvidia.com> writes: > On Thu, Jun 23, 2022 at 02:21:39PM -0600, Alex Williamson wrote: > >> check_and_migrate_movable_pages() perpetually returns zero. I believe >> this is because folio_is_pinnable() previously returned true, and now >> returns false. > > Indeed, it is a bug that check_and_migrate_movable_pages() returns > 0 when it didn't do anything. It should return an error code. > > Hum.. Alistair, maybe you should look at this as well, I'm struggling > alot to understand how it is safe to drop the reference on the page > but hold a pointer to it on the movable_page_list - sure it was > isolated - but why does that mean it won't be concurrently unmapped > and freed? folio_isolate_lru() takes a reference on the page so you're safe from it being freed. If it gets unmapped it will be freed when the matching putback_movable_pages() is called. > Anyhow, it looks like the problem is the tortured logic in this > function, what do you think about this: At a glance it seems reasonable, although I fear it might conflict with my changes for device coherent migration. Agree the whole check_and_migrate_movable_pages() logic is pretty tortured though, and I don't think I'm making it better so would be happy to try cleaning it up futher once the device coherent changes are in. > diff --git a/mm/gup.c b/mm/gup.c > index 5512644076246d..2ffcb3f4ff4a7b 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1879,10 +1879,15 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, > unsigned int gup_flags) > { > unsigned long isolation_error_count = 0, i; > + struct migration_target_control mtc = { > + .nid = NUMA_NO_NODE, > + .gfp_mask = GFP_USER | __GFP_NOWARN, > + }; > struct folio *prev_folio = NULL; > LIST_HEAD(movable_page_list); > bool drain_allow = true; > - int ret = 0; > + int not_migrated; > + int ret; > > for (i = 0; i < nr_pages; i++) { > struct folio *folio = page_folio(pages[i]); > @@ -1919,16 +1924,13 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, > folio_nr_pages(folio)); > } > > - if (!list_empty(&movable_page_list) || isolation_error_count) > - goto unpin_pages; > - > /* > * If list is empty, and no isolation errors, means that all pages are > - * in the correct zone. > + * in the correct zone, nothing to do. > */ > - return nr_pages; > + if (list_empty(&movable_page_list) && !isolation_error_count) > + return nr_pages; > > -unpin_pages: > if (gup_flags & FOLL_PIN) { > unpin_user_pages(pages, nr_pages); > } else { > @@ -1936,20 +1938,22 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, > put_page(pages[i]); > } > > - if (!list_empty(&movable_page_list)) { > - struct migration_target_control mtc = { > - .nid = NUMA_NO_NODE, > - .gfp_mask = GFP_USER | __GFP_NOWARN, > - }; > + if (isolation_error_count) { > + ret = -EINVAL; > + goto err_putback; > + } > > - ret = migrate_pages(&movable_page_list, alloc_migration_target, > - NULL, (unsigned long)&mtc, MIGRATE_SYNC, > - MR_LONGTERM_PIN, NULL); > - if (ret > 0) /* number of pages not migrated */ > - ret = -ENOMEM; > + not_migrated = migrate_pages(&movable_page_list, alloc_migration_target, > + NULL, (unsigned long)&mtc, MIGRATE_SYNC, > + MR_LONGTERM_PIN, NULL); > + if (not_migrated > 0) { > + ret = -ENOMEM; > + goto err_putback; > } > + return 0; > > - if (ret && !list_empty(&movable_page_list)) > +err_putback: > + if (!list_empty(&movable_page_list)) > putback_movable_pages(&movable_page_list); > return ret; > } > >> If I generate an errno here, QEMU reports failing on the pc.rom memory >> region at 0xc0000. Thanks, > > Ah, a ROM region that is all zero'd makes some sense why it has gone > unnoticed as a bug. > > Jason
On Fri, Jun 24, 2022 at 10:11:01AM +1000, Alistair Popple wrote: > > Hum.. Alistair, maybe you should look at this as well, I'm struggling > > alot to understand how it is safe to drop the reference on the page > > but hold a pointer to it on the movable_page_list - sure it was > > isolated - but why does that mean it won't be concurrently unmapped > > and freed? > > folio_isolate_lru() takes a reference on the page so you're safe from it > being freed. If it gets unmapped it will be freed when the matching > putback_movable_pages() is called. Hm, I guess I didn't dig deep enough into that call chain.. > > Anyhow, it looks like the problem is the tortured logic in this > > function, what do you think about this: > > At a glance it seems reasonable, although I fear it might conflict with > my changes for device coherent migration. Agree the whole > check_and_migrate_movable_pages() logic is pretty tortured though, and I > don't think I'm making it better so would be happy to try cleaning it up > futher once the device coherent changes are in. OK, can I leave this patch with you then? I have no way to test it.. Thanks, Jason
Jason Gunthorpe <jgg@nvidia.com> writes: > On Fri, Jun 24, 2022 at 10:11:01AM +1000, Alistair Popple wrote: > >> > Hum.. Alistair, maybe you should look at this as well, I'm struggling >> > alot to understand how it is safe to drop the reference on the page >> > but hold a pointer to it on the movable_page_list - sure it was >> > isolated - but why does that mean it won't be concurrently unmapped >> > and freed? >> >> folio_isolate_lru() takes a reference on the page so you're safe from it >> being freed. If it gets unmapped it will be freed when the matching >> putback_movable_pages() is called. > > Hm, I guess I didn't dig deep enough into that call chain.. > >> > Anyhow, it looks like the problem is the tortured logic in this >> > function, what do you think about this: >> >> At a glance it seems reasonable, although I fear it might conflict with >> my changes for device coherent migration. Agree the whole >> check_and_migrate_movable_pages() logic is pretty tortured though, and I >> don't think I'm making it better so would be happy to try cleaning it up >> futher once the device coherent changes are in. > > OK, can I leave this patch with you then? I have no way to test it.. Yep, no worries. > Thanks, > Jason
Looks like the original patch might need rebasing. I am about to post a clean-up for the tortured logic in check_and_migrate_movable_pages() so can incorporate it there, but I'm wondering what the consensus was for pinning of zero pfn? Currently my clean-up will result in PUP returning an error for the zero pfn rather than looping indefinitely in the kernel. However it wasn't clear from this thread if returning an error is ok, or if R/O pinning of the zero pfn should succeed? - Alistair Alistair Popple <apopple@nvidia.com> writes: > Jason Gunthorpe <jgg@nvidia.com> writes: > >> On Fri, Jun 24, 2022 at 10:11:01AM +1000, Alistair Popple wrote: >> >>> > Hum.. Alistair, maybe you should look at this as well, I'm struggling >>> > alot to understand how it is safe to drop the reference on the page >>> > but hold a pointer to it on the movable_page_list - sure it was >>> > isolated - but why does that mean it won't be concurrently unmapped >>> > and freed? >>> >>> folio_isolate_lru() takes a reference on the page so you're safe from it >>> being freed. If it gets unmapped it will be freed when the matching >>> putback_movable_pages() is called. >> >> Hm, I guess I didn't dig deep enough into that call chain.. >> >>> > Anyhow, it looks like the problem is the tortured logic in this >>> > function, what do you think about this: >>> >>> At a glance it seems reasonable, although I fear it might conflict with >>> my changes for device coherent migration. Agree the whole >>> check_and_migrate_movable_pages() logic is pretty tortured though, and I >>> don't think I'm making it better so would be happy to try cleaning it up >>> futher once the device coherent changes are in. >> >> OK, can I leave this patch with you then? I have no way to test it.. > > Yep, no worries. > >> Thanks, >> Jason
On 28.07.22 10:45, Alistair Popple wrote: > > Looks like the original patch might need rebasing. I am about to post a > clean-up for the tortured logic in check_and_migrate_movable_pages() so > can incorporate it there, but I'm wondering what the consensus was for > pinning of zero pfn? We have to keep it working right now, but in most cases (inside MAP_PRIVATE regions) it's shaky and undesired. > > Currently my clean-up will result in PUP returning an error for the zero > pfn rather than looping indefinitely in the kernel. However it wasn't > clear from this thread if returning an error is ok, or if R/O pinning > of the zero pfn should succeed? I'm working on proper COW breaking in MAP_PRIVATE mappings, which will, for example, unshare the shared zeropage and properly replace it by exclusive anon pages first in the FOLL_LONGTERM case.
David Hildenbrand <david@redhat.com> writes: > On 28.07.22 10:45, Alistair Popple wrote: >> >> Looks like the original patch might need rebasing. I am about to post a >> clean-up for the tortured logic in check_and_migrate_movable_pages() so >> can incorporate it there, but I'm wondering what the consensus was for >> pinning of zero pfn? > > We have to keep it working right now, but in most cases (inside > MAP_PRIVATE regions) it's shaky and undesired. Ok. Well I've looked at this now so may as well stick Reviewed-by: Alistair Popple <apopple@nvidia.com> on it. However I think it needs rebasing, should I send an updated version? >> >> Currently my clean-up will result in PUP returning an error for the zero >> pfn rather than looping indefinitely in the kernel. However it wasn't >> clear from this thread if returning an error is ok, or if R/O pinning >> of the zero pfn should succeed? > > I'm working on proper COW breaking in MAP_PRIVATE mappings, which will, > for example, unshare the shared zeropage and properly replace it by > exclusive anon pages first in the FOLL_LONGTERM case.
diff --git a/include/linux/mm.h b/include/linux/mm.h index bc8f326be0ce..781fae17177d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1600,7 +1600,7 @@ static inline bool is_pinnable_page(struct page *page) if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE) return false; #endif - return !(is_zone_movable_page(page) || is_zero_pfn(page_to_pfn(page))); + return !is_zone_movable_page(page) || is_zero_pfn(page_to_pfn(page)); } #else static inline bool is_pinnable_page(struct page *page)
The commit referenced below subtly and inadvertently changed the logic to disallow pinning of zero pfns. This breaks device assignment with vfio and potentially various other users of gup. Exclude the zero page test from the negation. Fixes: 1c563432588d ("mm: fix is_pinnable_page against a cma page") Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- At least I assume this was inadvertent... If there's a better fix, please run with it as I'm out of the office the 1st half of next week and would like to see this fixed ASAP. Thanks! include/linux/mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)