Message ID | 20201217185243.3288048-9-pasha.tatashin@soleen.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | prohibit pinning pages in ZONE_MOVABLE | expand |
On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote: > +/* > + * Verify that there are no unpinnable (movable) pages, if so return true. > + * Otherwise an unpinnable pages is found return false, and unpin all pages. > + */ > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages, > + unsigned int gup_flags) > +{ > + unsigned long i, step; > + > + for (i = 0; i < nr_pages; i += step) { > + struct page *head = compound_head(pages[i]); > + > + step = compound_nr(head) - (pages[i] - head); You can't assume that all of a compound head is in the pages array, this assumption would only work inside the page walkers if the page was found in a PMD or something. > + if (gup_flags & FOLL_PIN) { > + unpin_user_pages(pages, nr_pages); So we throw everything away? Why? That isn't how the old algorithm worked > @@ -1654,22 +1664,55 @@ static long __gup_longterm_locked(struct mm_struct *mm, > struct vm_area_struct **vmas, > unsigned int gup_flags) > { > - unsigned long flags = 0; > + int migrate_retry = 0; > + int isolate_retry = 0; > + unsigned int flags; > long rc; > > - if (gup_flags & FOLL_LONGTERM) > - flags = memalloc_pin_save(); > + if (!(gup_flags & FOLL_LONGTERM)) > + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, > + NULL, gup_flags); > > - rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, > - gup_flags); > + /* > + * Without FOLL_WRITE fault handler may return zero page, which can > + * be in a movable zone, and also will fail to isolate during migration, > + * thus the longterm pin will fail. > + */ > + gup_flags &= FOLL_WRITE; Is &= what you mean here? |= right? Seems like we've ended up in a weird place if FOLL_LONGTERM always includes FOLL_WRITE. Putting the zero page in ZONE_MOVABLE seems like a bad idea, no? > + /* > + * Migration may fail, we retry before giving up. Also, because after > + * migration pages[] becomes outdated, we unpin and repin all pages > + * in the range, so pages array is repopulated with new values. > + * Also, because of this we cannot retry migration failures in a loop > + * without pinning/unpinnig pages. > + */ The old algorithm made continuous forward progress and only went back to the first migration point. > + for (; ; ) { while (true)? > + rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, > + NULL, gup_flags); > + /* Return if error or if all pages are pinnable */ > + if (rc <= 0 || check_and_unpin_pages(rc, pages, gup_flags)) > + break; So we sweep the pages list twice now? > + /* Some pages are not pinnable, migrate them */ > + rc = migrate_movable_pages(rc, pages); > + > + /* > + * If there is an error, and we tried maximum number of times > + * bail out. Notice: we return an error code, and all pages are > + * unpinned > + */ > + if (rc < 0 && migrate_retry++ >= PINNABLE_MIGRATE_MAX) { > + break; > + } else if (rc > 0 && isolate_retry++ >= PINNABLE_ISOLATE_MAX) { > + rc = -EBUSY; I don't like this at all. It shouldn't be so flakey Can you do migration without the LRU? Jason
Hi Jason, Thank you for your comments. My replies below. On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote: > > +/* > > + * Verify that there are no unpinnable (movable) pages, if so return true. > > + * Otherwise an unpinnable pages is found return false, and unpin all pages. > > + */ > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages, > > + unsigned int gup_flags) > > +{ > > + unsigned long i, step; > > + > > + for (i = 0; i < nr_pages; i += step) { > > + struct page *head = compound_head(pages[i]); > > + > > + step = compound_nr(head) - (pages[i] - head); > > You can't assume that all of a compound head is in the pages array, > this assumption would only work inside the page walkers if the page > was found in a PMD or something. I am not sure I understand your comment. The compound head is not taken from the pages array, and not assumed to be in it. It is exactly the same logic as that we currently have: https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565 > > > + if (gup_flags & FOLL_PIN) { > > + unpin_user_pages(pages, nr_pages); > > So we throw everything away? Why? That isn't how the old algorithm worked It is exactly like the old algorithm worked: if there are pages to be migrated (not pinnable pages) we unpinned everything. See here: https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603 If cma_pages_list is not empty unpin everything. The list is not empty if we isolated some pages, we isolated some pages if there are some pages which are not pinnable. Now, we do exactly the same thing, but cleaner, and handle errors. We must unpin everything because if we fail, no pages should stay pinned, and also if we migrated some pages, the pages array must be updated, so we need to call __get_user_pages_locked() pin and repopulated pages array. > > > @@ -1654,22 +1664,55 @@ static long __gup_longterm_locked(struct mm_struct *mm, > > struct vm_area_struct **vmas, > > unsigned int gup_flags) > > { > > - unsigned long flags = 0; > > + int migrate_retry = 0; > > + int isolate_retry = 0; > > + unsigned int flags; > > long rc; > > > > - if (gup_flags & FOLL_LONGTERM) > > - flags = memalloc_pin_save(); > > + if (!(gup_flags & FOLL_LONGTERM)) > > + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, > > + NULL, gup_flags); > > > > - rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, > > - gup_flags); > > + /* > > + * Without FOLL_WRITE fault handler may return zero page, which can > > + * be in a movable zone, and also will fail to isolate during migration, > > + * thus the longterm pin will fail. > > + */ > > + gup_flags &= FOLL_WRITE; > > Is &= what you mean here? |= right? Right, I meant |=. > > Seems like we've ended up in a weird place if FOLL_LONGTERM always > includes FOLL_WRITE. Putting the zero page in ZONE_MOVABLE seems like > a bad idea, no? I am not sure, I just found this problem during testing, and this is the solution I am proposing. I am worried about limiting the zero page to a non movable zone, but let's see what others think about this. > > > + /* > > + * Migration may fail, we retry before giving up. Also, because after > > + * migration pages[] becomes outdated, we unpin and repin all pages > > + * in the range, so pages array is repopulated with new values. > > + * Also, because of this we cannot retry migration failures in a loop > > + * without pinning/unpinnig pages. > > + */ > > The old algorithm made continuous forward progress and only went back > to the first migration point. That is not right, the old code went back to the beginning. Making continuous progress is possible, but we won't see any performance betnefit from it, because migration failures is already exception scenarios where machine is under memory stress. The truth is if we fail to migrate it is unlikely will succeed if we retry right away, so giving some time between retries may be even beneficial. Also with continious progress we need to take care of some corner cases where we need to unpin already succeeded pages in case if forward progress is not possible. Also, adjust pages array, start address etc. > > > + for (; ; ) { > > while (true)? Hm, the same thing? :) > > > + rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, > > + NULL, gup_flags); > > > + /* Return if error or if all pages are pinnable */ > > + if (rc <= 0 || check_and_unpin_pages(rc, pages, gup_flags)) > > + break; > > So we sweep the pages list twice now? Yes, but O(N) is the same. No new operation is added. Before we had something like this: while (npages) check if pinnable isolate while (npages) unpin migrate while (npages) pin Now: while(npages) check if pinnable while(npages) unpin while(npages) isolate migrate pin > > > + /* Some pages are not pinnable, migrate them */ > > + rc = migrate_movable_pages(rc, pages); > > + > > + /* > > + * If there is an error, and we tried maximum number of times > > + * bail out. Notice: we return an error code, and all pages are > > + * unpinned > > + */ > > + if (rc < 0 && migrate_retry++ >= PINNABLE_MIGRATE_MAX) { > > + break; > > + } else if (rc > 0 && isolate_retry++ >= PINNABLE_ISOLATE_MAX) { > > + rc = -EBUSY; > > I don't like this at all. It shouldn't be so flakey > > Can you do migration without the LRU? I do not think it is possible, we must isolate pages before migration. Pasha
On Thu 17-12-20 13:52:41, Pavel Tatashin wrote: [...] > +#define PINNABLE_MIGRATE_MAX 10 > +#define PINNABLE_ISOLATE_MAX 100 Why would we need to limit the isolation retries. Those should always be temporary failure unless I am missing something. I am not sure about the PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages already implements its retry logic why do you want to count retries on top of that? I do agree that the existing logic is suboptimal because the migration failure might be ephemeral or permanent but that should be IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report failures that are permanent - e.g. any potential pre-existing long term pin - if that is possible at all. If not what would cause permanent migration failure? OOM?
On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 17-12-20 13:52:41, Pavel Tatashin wrote: > [...] > > +#define PINNABLE_MIGRATE_MAX 10 > > +#define PINNABLE_ISOLATE_MAX 100 > > Why would we need to limit the isolation retries. Those should always be > temporary failure unless I am missing something. Actually, during development, I was retrying isolate errors infinitely, but during testing found a hung where when FOLL_TOUCH without FOLL_WRITE is passed (fault in kernel without write flag), the zero page is faulted. The isolation of the zero page was failing every time, therefore the process was hanging. Since then, I fixed this problem by adding FOLL_WRITE unconditionally to FOLL_LONGTERM, but I was worried about other possible bugs that would cause hangs, so decided to limit isolation errors. If you think it its not necessary, I can unlimit isolate retires. > I am not sure about the > PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages > already implements its retry logic why do you want to count retries on > top of that? I do agree that the existing logic is suboptimal because True, but again, just recently, I worked on a race bug where pages can end up in per-cpu list after lru_add_drain_all() but before isolation, so I think retry is necessary. > the migration failure might be ephemeral or permanent but that should be > IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report > failures that are permanent - e.g. any potential pre-existing long term > pin - if that is possible at all. If not what would cause permanent > migration failure? OOM? Yes, OOM is the main cause for migration failures. And also a few cases described in movable zone comment, where it is possible during boot some pages can be allocated by memblock in movable zone due to lack of memory resources (even if those resources were added later), hardware page poisoning is another rare example. > -- > Michal Hocko > SUSE Labs
> Am 18.12.2020 um 13:43 schrieb Pavel Tatashin <pasha.tatashin@soleen.com>: > > On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <mhocko@suse.com> wrote: >> >> On Thu 17-12-20 13:52:41, Pavel Tatashin wrote: >> [...] >>> +#define PINNABLE_MIGRATE_MAX 10 >>> +#define PINNABLE_ISOLATE_MAX 100 >> >> Why would we need to limit the isolation retries. Those should always be >> temporary failure unless I am missing something. > > Actually, during development, I was retrying isolate errors > infinitely, but during testing found a hung where when FOLL_TOUCH > without FOLL_WRITE is passed (fault in kernel without write flag), the > zero page is faulted. The isolation of the zero page was failing every > time, therefore the process was hanging. > > Since then, I fixed this problem by adding FOLL_WRITE unconditionally > to FOLL_LONGTERM, but I was worried about other possible bugs that > would cause hangs, so decided to limit isolation errors. If you think > it its not necessary, I can unlimit isolate retires. > >> I am not sure about the >> PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages >> already implements its retry logic why do you want to count retries on >> top of that? I do agree that the existing logic is suboptimal because > > True, but again, just recently, I worked on a race bug where pages can > end up in per-cpu list after lru_add_drain_all() but before isolation, > so I think retry is necessary. > >> the migration failure might be ephemeral or permanent but that should be >> IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report >> failures that are permanent - e.g. any potential pre-existing long term >> pin - if that is possible at all. If not what would cause permanent >> migration failure? OOM? > > Yes, OOM is the main cause for migration failures. And also a few > cases described in movable zone comment, where it is possible during > boot some pages can be allocated by memblock in movable zone due to > lack of memory resources (even if those resources were added later), > hardware page poisoning is another rare example. > How is concurrent migration handled? Like memory offlining, compaction, alloc_contig_range() while trying to pin? >> -- >> Michal Hocko >> SUSE Labs >
On Fri 18-12-20 07:43:15, Pavel Tatashin wrote: > On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 17-12-20 13:52:41, Pavel Tatashin wrote: > > [...] > > > +#define PINNABLE_MIGRATE_MAX 10 > > > +#define PINNABLE_ISOLATE_MAX 100 > > > > Why would we need to limit the isolation retries. Those should always be > > temporary failure unless I am missing something. > > Actually, during development, I was retrying isolate errors > infinitely, but during testing found a hung where when FOLL_TOUCH > without FOLL_WRITE is passed (fault in kernel without write flag), the > zero page is faulted. The isolation of the zero page was failing every > time, therefore the process was hanging. Why would you migrate zero page in the first place? Simply instantiate it. > Since then, I fixed this problem by adding FOLL_WRITE unconditionally > to FOLL_LONGTERM, but I was worried about other possible bugs that > would cause hangs, so decided to limit isolation errors. If you think > it its not necessary, I can unlimit isolate retires. It should have a really good reason to exist. Worries about some corner cases is definitely not a reason to put some awkward retry mechanism. My historical experience is that these things are extremely hard to get rid of later. > > I am not sure about the > > PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages > > already implements its retry logic why do you want to count retries on > > top of that? I do agree that the existing logic is suboptimal because > > True, but again, just recently, I worked on a race bug where pages can > end up in per-cpu list after lru_add_drain_all() but before isolation, > so I think retry is necessary. There are ways to make sure pages are not ending on pcp list. Have a look at how hotplug does that. > > the migration failure might be ephemeral or permanent but that should be > > IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report > > failures that are permanent - e.g. any potential pre-existing long term > > pin - if that is possible at all. If not what would cause permanent > > migration failure? OOM? > > Yes, OOM is the main cause for migration failures. Then you can treat ENOMEM as a permanent failure. > And also a few > cases described in movable zone comment, where it is possible during > boot some pages can be allocated by memblock in movable zone due to > lack of memory resources (even if those resources were added later), Do you have any examples? I find it hard to follow that somebody would be pinning early boot allocations. > hardware page poisoning is another rare example. Could you elaborate please?
On Thu, Dec 17, 2020 at 05:02:03PM -0500, Pavel Tatashin wrote: > Hi Jason, > > Thank you for your comments. My replies below. > > On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote: > > > +/* > > > + * Verify that there are no unpinnable (movable) pages, if so return true. > > > + * Otherwise an unpinnable pages is found return false, and unpin all pages. > > > + */ > > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages, > > > + unsigned int gup_flags) > > > +{ > > > + unsigned long i, step; > > > + > > > + for (i = 0; i < nr_pages; i += step) { > > > + struct page *head = compound_head(pages[i]); > > > + > > > + step = compound_nr(head) - (pages[i] - head); > > > > You can't assume that all of a compound head is in the pages array, > > this assumption would only work inside the page walkers if the page > > was found in a PMD or something. > > I am not sure I understand your comment. The compound head is not > taken from the pages array, and not assumed to be in it. It is exactly > the same logic as that we currently have: > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565 Oh, that existing logic is wrong too :( Another bug. You can't skip pages in the pages[] array under the assumption they are contiguous. ie the i+=step is wrong. > > > > > + if (gup_flags & FOLL_PIN) { > > > + unpin_user_pages(pages, nr_pages); > > > > So we throw everything away? Why? That isn't how the old algorithm worked > > It is exactly like the old algorithm worked: if there are pages to be > migrated (not pinnable pages) we unpinned everything. > See here: > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603 Hmm, OK, but I'm not sure that is great either > cleaner, and handle errors. We must unpin everything because if we > fail, no pages should stay pinned, and also if we migrated some pages, > the pages array must be updated, so we need to call > __get_user_pages_locked() pin and repopulated pages array. However the page can't be unpinned until it is put on the LRU (and I'm hoping that the LRU is enough of a 'lock' to make that safe, no idea) > > I don't like this at all. It shouldn't be so flakey > > > > Can you do migration without the LRU? > > I do not think it is possible, we must isolate pages before migration. I don't like this at all :( Lots of stuff relies on GUP, introducing a random flakiness like this not good. Jason
On Fri, Dec 18, 2020 at 9:19 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Dec 17, 2020 at 05:02:03PM -0500, Pavel Tatashin wrote: > > Hi Jason, > > > > Thank you for your comments. My replies below. > > > > On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote: > > > > +/* > > > > + * Verify that there are no unpinnable (movable) pages, if so return true. > > > > + * Otherwise an unpinnable pages is found return false, and unpin all pages. > > > > + */ > > > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages, > > > > + unsigned int gup_flags) > > > > +{ > > > > + unsigned long i, step; > > > > + > > > > + for (i = 0; i < nr_pages; i += step) { > > > > + struct page *head = compound_head(pages[i]); > > > > + > > > > + step = compound_nr(head) - (pages[i] - head); > > > > > > You can't assume that all of a compound head is in the pages array, > > > this assumption would only work inside the page walkers if the page > > > was found in a PMD or something. > > > > I am not sure I understand your comment. The compound head is not > > taken from the pages array, and not assumed to be in it. It is exactly > > the same logic as that we currently have: > > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565 > > Oh, that existing logic is wrong too :( Another bug. I do not think there is a bug. > You can't skip pages in the pages[] array under the assumption they > are contiguous. ie the i+=step is wrong. If pages[i] is part of a compound page, the other parts of this page must be sequential in this array for this compound page (it might start in the middle through). If they are not sequential then the translation will be broken, as these pages also correspond to virtual addresses from [start, start + nr_pages) in __gup_longterm_locked. For example, when __gup_longterm_locked() is returned, the following must be true: PHYSICAL VIRTUAL page_to_phys(pages[0]) -> start + 0 * PAGE_SIZE page_to_phys(pages[1]) -> start + 1 * PAGE_SIZE page_to_phys(pages[2]) -> start + 2 * PAGE_SIZE page_to_phys(pages[3]) -> start + 3 * PAGE_SIZE ... page_to_phys(pages[nr_pages - 1]) -> start + (nr_pages - 1) * PAGE_SIZE If any pages[i] is part of a compound page (i.e. huge page), we can't have other pages to be in the middle of that page in the array.. > > > > > > > > + if (gup_flags & FOLL_PIN) { > > > > + unpin_user_pages(pages, nr_pages); > > > > > > So we throw everything away? Why? That isn't how the old algorithm worked > > > > It is exactly like the old algorithm worked: if there are pages to be > > migrated (not pinnable pages) we unpinned everything. > > See here: > > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603 > > Hmm, OK, but I'm not sure that is great either I will send out a new series. We can discuss it there if you have suggestions for improvement here. > > > cleaner, and handle errors. We must unpin everything because if we > > fail, no pages should stay pinned, and also if we migrated some pages, > > the pages array must be updated, so we need to call > > __get_user_pages_locked() pin and repopulated pages array. > > However the page can't be unpinned until it is put on the LRU (and I'm > hoping that the LRU is enough of a 'lock' to make that safe, no idea) > > > > I don't like this at all. It shouldn't be so flakey > > > > > > Can you do migration without the LRU? > > > > I do not think it is possible, we must isolate pages before migration. > > I don't like this at all :( Lots of stuff relies on GUP, introducing a > random flakiness like this not good. This is actually standard migration procedure, elsewhere in the kernel we migrate pages in exactly the same fashion: isolate and later migrate. The isolation works for LRU only pages. > > Jason
On Fri, Dec 18, 2020 at 8:14 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 18-12-20 07:43:15, Pavel Tatashin wrote: > > On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 17-12-20 13:52:41, Pavel Tatashin wrote: > > > [...] > > > > +#define PINNABLE_MIGRATE_MAX 10 > > > > +#define PINNABLE_ISOLATE_MAX 100 > > > > > > Why would we need to limit the isolation retries. Those should always be > > > temporary failure unless I am missing something. > > > > Actually, during development, I was retrying isolate errors > > infinitely, but during testing found a hung where when FOLL_TOUCH > > without FOLL_WRITE is passed (fault in kernel without write flag), the > > zero page is faulted. The isolation of the zero page was failing every > > time, therefore the process was hanging. > > Why would you migrate zero page in the first place? Simply instantiate > it. This is exactly the idea behind FOLL_WRITE; it causes zero pages to be created in the right zone right away, and no migration is necessary. > > > Since then, I fixed this problem by adding FOLL_WRITE unconditionally > > to FOLL_LONGTERM, but I was worried about other possible bugs that > > would cause hangs, so decided to limit isolation errors. If you think > > it its not necessary, I can unlimit isolate retires. > > It should have a really good reason to exist. Worries about some corner > cases is definitely not a reason to put some awkward retry mechanism. > My historical experience is that these things are extremely hard to get > rid of later. > > > > I am not sure about the > > > PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages > > > already implements its retry logic why do you want to count retries on > > > top of that? I do agree that the existing logic is suboptimal because > > > > True, but again, just recently, I worked on a race bug where pages can > > end up in per-cpu list after lru_add_drain_all() but before isolation, > > so I think retry is necessary. > > There are ways to make sure pages are not ending on pcp list. Have a > look at how hotplug does that. Sounds good to me, I will remove PINNABLE_MIGRATE_MAX, and leave only PINNABLE_ISOLATE_MAX for transient isolation errors. > > > > the migration failure might be ephemeral or permanent but that should be > > > IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report > > > failures that are permanent - e.g. any potential pre-existing long term > > > pin - if that is possible at all. If not what would cause permanent > > > migration failure? OOM? > > > > Yes, OOM is the main cause for migration failures. > > Then you can treat ENOMEM as a permanent failure. Sounds good.
On Wed, Jan 13, 2021 at 02:43:50PM -0500, Pavel Tatashin wrote: > On Fri, Dec 18, 2020 at 9:19 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Thu, Dec 17, 2020 at 05:02:03PM -0500, Pavel Tatashin wrote: > > > Hi Jason, > > > > > > Thank you for your comments. My replies below. > > > > > > On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote: > > > > > +/* > > > > > + * Verify that there are no unpinnable (movable) pages, if so return true. > > > > > + * Otherwise an unpinnable pages is found return false, and unpin all pages. > > > > > + */ > > > > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages, > > > > > + unsigned int gup_flags) > > > > > +{ > > > > > + unsigned long i, step; > > > > > + > > > > > + for (i = 0; i < nr_pages; i += step) { > > > > > + struct page *head = compound_head(pages[i]); > > > > > + > > > > > + step = compound_nr(head) - (pages[i] - head); > > > > > > > > You can't assume that all of a compound head is in the pages array, > > > > this assumption would only work inside the page walkers if the page > > > > was found in a PMD or something. > > > > > > I am not sure I understand your comment. The compound head is not > > > taken from the pages array, and not assumed to be in it. It is exactly > > > the same logic as that we currently have: > > > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565 > > > > Oh, that existing logic is wrong too :( Another bug. > > I do not think there is a bug. > > > You can't skip pages in the pages[] array under the assumption they > > are contiguous. ie the i+=step is wrong. > > If pages[i] is part of a compound page, the other parts of this page > must be sequential in this array for this compound page That is true only if the PMD points to the page. If the PTE points to a tail page then there is no requirement that other PTEs are contiguous with the compount page. At this point we have no idea if the GUP logic got this compound page as a head page in a PMD or as a tail page from a PTE, so we can't assume a contiguous run of addresses. Look at split_huge_pmd() - it doesn't break up the compound page it just converts the PMD to a PTE array and scatters the tail pages to the PTE. I understand Matt is pushing on this idea more by having compound pages in the page cache, but still mapping tail pages when required. > This is actually standard migration procedure, elsewhere in the kernel > we migrate pages in exactly the same fashion: isolate and later > migrate. The isolation works for LRU only pages. But do other places cause a userspace visible random failure when LRU isolation fails? I don't like it at all, what is the user supposed to do? Jason
> > > Oh, that existing logic is wrong too :( Another bug. > > > > I do not think there is a bug. > > > > > You can't skip pages in the pages[] array under the assumption they > > > are contiguous. ie the i+=step is wrong. > > > > If pages[i] is part of a compound page, the other parts of this page > > must be sequential in this array for this compound page > > That is true only if the PMD points to the page. If the PTE points to > a tail page then there is no requirement that other PTEs are > contiguous with the compount page. > > At this point we have no idea if the GUP logic got this compound page > as a head page in a PMD or as a tail page from a PTE, so we can't > assume a contiguous run of addresses. I see, I will fix this bug in an upstream as a separate patch in my series, and keep the fix when my fixes are applied. > > Look at split_huge_pmd() - it doesn't break up the compound page it > just converts the PMD to a PTE array and scatters the tail pages to > the PTE. Got it, unfortunately the fix will deoptimize the code by having to check every page if it is part of a previous compound page or not. > > I understand Matt is pushing on this idea more by having compound > pages in the page cache, but still mapping tail pages when required. > > > This is actually standard migration procedure, elsewhere in the kernel > > we migrate pages in exactly the same fashion: isolate and later > > migrate. The isolation works for LRU only pages. > > But do other places cause a userspace visible random failure when LRU > isolation fails? Makes sense, I will remove maximum retries for isolation, and retry indefinitely, the same as it is done during memory hot-remove. So, we will fail only when migration fails. > > I don't like it at all, what is the user supposed to do? > > Jason
On Wed, Jan 13, 2021 at 03:05:41PM -0500, Pavel Tatashin wrote: > Makes sense, I will remove maximum retries for isolation, and retry > indefinitely, the same as it is done during memory hot-remove. So, we > will fail only when migration fails. It would be good to make this also as a stand alone pre-fix as well.. Thanks, Jason
On Wed, Jan 13, 2021 at 3:05 PM Pavel Tatashin <pasha.tatashin@soleen.com> wrote: > > > > > Oh, that existing logic is wrong too :( Another bug. > > > > > > I do not think there is a bug. > > > > > > > You can't skip pages in the pages[] array under the assumption they > > > > are contiguous. ie the i+=step is wrong. > > > > > > If pages[i] is part of a compound page, the other parts of this page > > > must be sequential in this array for this compound page > > > > That is true only if the PMD points to the page. If the PTE points to > > a tail page then there is no requirement that other PTEs are > > contiguous with the compount page. > > > > At this point we have no idea if the GUP logic got this compound page > > as a head page in a PMD or as a tail page from a PTE, so we can't > > assume a contiguous run of addresses. > > I see, I will fix this bug in an upstream as a separate patch in my > series, and keep the fix when my fixes are applied. > > > > > Look at split_huge_pmd() - it doesn't break up the compound page it > > just converts the PMD to a PTE array and scatters the tail pages to > > the PTE. Hi Jason, I've been thinking about this some more. Again, I am not sure this is a bug. I understand split_huge_pmd() may split the PMD size page into PTEs and leave the compound page intact. However, in order for pages[] to have non sequential addresses in compound page, those PTEs must also be migrated after split_huge_pmd(), however when we migrate them we will either migrate the whole compound page or do split_huge_page_to_list() which will in turn do ClearPageCompound(). Please let me know if I am missing something. Thank you, Pasha > > Got it, unfortunately the fix will deoptimize the code by having to > check every page if it is part of a previous compound page or not. > > > > > I understand Matt is pushing on this idea more by having compound > > pages in the page cache, but still mapping tail pages when required. > > > > > This is actually standard migration procedure, elsewhere in the kernel > > > we migrate pages in exactly the same fashion: isolate and later > > > migrate. The isolation works for LRU only pages. > > > > But do other places cause a userspace visible random failure when LRU > > isolation fails? > > Makes sense, I will remove maximum retries for isolation, and retry > indefinitely, the same as it is done during memory hot-remove. So, we > will fail only when migration fails. > > > > > I don't like it at all, what is the user supposed to do? > > > > Jason
On Fri, Jan 15, 2021 at 01:10:27PM -0500, Pavel Tatashin wrote: > I've been thinking about this some more. Again, I am not sure this is > a bug. I understand split_huge_pmd() may split the PMD size page into > PTEs and leave the compound page intact. However, in order for pages[] > to have non sequential addresses in compound page, those PTEs must > also be migrated after split_huge_pmd(), Why focus on migrated? Anything could happen to those PTEs: they could be COW'd, they could be mmap/munmap'd, etc. Jason
diff --git a/mm/gup.c b/mm/gup.c index 1ebb7cc2fbe4..70cc8b8f67c4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1550,27 +1550,57 @@ struct page *get_dump_page(unsigned long addr) } #endif /* CONFIG_ELF_CORE */ -static long check_and_migrate_movable_pages(struct mm_struct *mm, - unsigned long start, - unsigned long nr_pages, - struct page **pages, - struct vm_area_struct **vmas, - unsigned int gup_flags) -{ - unsigned long i; - unsigned long step; - bool drain_allow = true; - bool migrate_allow = true; +/* + * Verify that there are no unpinnable (movable) pages, if so return true. + * Otherwise an unpinnable pages is found return false, and unpin all pages. + */ +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages, + unsigned int gup_flags) +{ + unsigned long i, step; + + for (i = 0; i < nr_pages; i += step) { + struct page *head = compound_head(pages[i]); + + step = compound_nr(head) - (pages[i] - head); + if (!is_pinnable_page(head)) + break; + } + + if (i >= nr_pages) + return true; + + if (gup_flags & FOLL_PIN) { + unpin_user_pages(pages, nr_pages); + } else { + for (i = 0; i < nr_pages; i++) + put_page(pages[i]); + } + + return false; +} + +#define PINNABLE_MIGRATE_MAX 10 +#define PINNABLE_ISOLATE_MAX 100 + +/* + * Migrate pages that cannot be pinned. Return zero on success and error code + * on migration failure. If migration was successful but page isolation had + * failures return number of pages that failed to be isolated. + */ +static long migrate_movable_pages(unsigned long nr_pages, struct page **pages) +{ + unsigned long i, step; LIST_HEAD(movable_page_list); - long ret = nr_pages; + long ret = 0; + long error_count = 0; struct migration_target_control mtc = { .nid = NUMA_NO_NODE, .gfp_mask = GFP_USER | __GFP_NOWARN, }; -check_again: - for (i = 0; i < nr_pages;) { - + lru_add_drain_all(); + for (i = 0; i < nr_pages; i += step) { struct page *head = compound_head(pages[i]); /* @@ -1583,62 +1613,42 @@ static long check_and_migrate_movable_pages(struct mm_struct *mm, * these entries, try to move them out if possible. */ if (!is_pinnable_page(head)) { - if (PageHuge(head)) - isolate_huge_page(head, &movable_page_list); - else { - if (!PageLRU(head) && drain_allow) { - lru_add_drain_all(); - drain_allow = false; - } - + if (PageHuge(head)) { + if (!isolate_huge_page(head, &movable_page_list)) + error_count += step; + } else { if (!isolate_lru_page(head)) { list_add_tail(&head->lru, &movable_page_list); mod_node_page_state(page_pgdat(head), NR_ISOLATED_ANON + page_is_file_lru(head), thp_nr_pages(head)); + } else { + error_count += step; } } } - - i += step; } if (!list_empty(&movable_page_list)) { - /* - * drop the above get_user_pages reference. - */ - if (gup_flags & FOLL_PIN) - unpin_user_pages(pages, nr_pages); - else - for (i = 0; i < nr_pages; i++) - put_page(pages[i]); + ret = migrate_pages(&movable_page_list, alloc_migration_target, + NULL, (unsigned long)&mtc, MIGRATE_SYNC, + MR_LONGTERM_PIN); + /* Assume -EBUSY failure if some pages were not migrated */ + if (ret > 0) + ret = -EBUSY; + } - if (migrate_pages(&movable_page_list, alloc_migration_target, NULL, - (unsigned long)&mtc, MIGRATE_SYNC, MR_LONGTERM_PIN)) { - /* - * some of the pages failed migration. Do get_user_pages - * without migration. - */ - migrate_allow = false; + if (ret && !list_empty(&movable_page_list)) + putback_movable_pages(&movable_page_list); - if (!list_empty(&movable_page_list)) - putback_movable_pages(&movable_page_list); - } - /* - * We did migrate all the pages, Try to get the page references - * again migrating any pages which we failed to isolate earlier. - */ - ret = __get_user_pages_locked(mm, start, nr_pages, - pages, vmas, NULL, - gup_flags); - - if ((ret > 0) && migrate_allow) { - nr_pages = ret; - drain_allow = true; - goto check_again; - } - } + /* + * Check if there were isolation errors, if so they should not be + * counted toward PINNABLE_MIGRATE_MAX, so separate them, by + * returning number of pages failed to isolate. + */ + if (!ret && error_count) + ret = error_count; return ret; } @@ -1654,22 +1664,55 @@ static long __gup_longterm_locked(struct mm_struct *mm, struct vm_area_struct **vmas, unsigned int gup_flags) { - unsigned long flags = 0; + int migrate_retry = 0; + int isolate_retry = 0; + unsigned int flags; long rc; - if (gup_flags & FOLL_LONGTERM) - flags = memalloc_pin_save(); + if (!(gup_flags & FOLL_LONGTERM)) + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, + NULL, gup_flags); - rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, - gup_flags); + /* + * Without FOLL_WRITE fault handler may return zero page, which can + * be in a movable zone, and also will fail to isolate during migration, + * thus the longterm pin will fail. + */ + gup_flags &= FOLL_WRITE; - if (gup_flags & FOLL_LONGTERM) { - if (rc > 0) - rc = check_and_migrate_movable_pages(mm, start, rc, - pages, vmas, - gup_flags); - memalloc_pin_restore(flags); + flags = memalloc_pin_save(); + /* + * Migration may fail, we retry before giving up. Also, because after + * migration pages[] becomes outdated, we unpin and repin all pages + * in the range, so pages array is repopulated with new values. + * Also, because of this we cannot retry migration failures in a loop + * without pinning/unpinnig pages. + */ + for (; ; ) { + rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, + NULL, gup_flags); + + /* Return if error or if all pages are pinnable */ + if (rc <= 0 || check_and_unpin_pages(rc, pages, gup_flags)) + break; + + /* Some pages are not pinnable, migrate them */ + rc = migrate_movable_pages(rc, pages); + + /* + * If there is an error, and we tried maximum number of times + * bail out. Notice: we return an error code, and all pages are + * unpinned + */ + if (rc < 0 && migrate_retry++ >= PINNABLE_MIGRATE_MAX) { + break; + } else if (rc > 0 && isolate_retry++ >= PINNABLE_ISOLATE_MAX) { + rc = -EBUSY; + break; + } } + memalloc_pin_restore(flags); + return rc; }
check_and_migrate_movable_pages() does not honor isolation errors, and also retries migration failures indefinably. Fix both of the above issues: add a new function that checks and unpins pages range check_and_unpin_pages(). Move the retry loop from check_and_migrate_movable_pages() to __gup_longterm_locked(). Rename check_and_migrate_movable_pages() as migrate_movable_pages() and make this function accept already unpinned pages. Also, track the errors during isolation, so they can be re-tried with a different maximum limit, the isolation errors should be ephemeral. Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> --- mm/gup.c | 179 ++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 111 insertions(+), 68 deletions(-)