Message ID | 20210125194751.1275316-3-pasha.tatashin@soleen.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | prohibit pinning pages in ZONE_MOVABLE | expand |
On 1/25/21 7:47 PM, Pavel Tatashin wrote: > When pages are isolated in check_and_migrate_movable_pages() we skip > compound number of pages at a time. However, as Jason noted, it is > not necessary correct that pages[i] corresponds to the pages that > we skipped. This is because it is possible that the addresses in > this range had split_huge_pmd()/split_huge_pud(), and these functions > do not update the compound page metadata. > > The problem can be reproduced if something like this occurs: > > 1. User faulted huge pages. > 2. split_huge_pmd() was called for some reason > 3. User has unmapped some sub-pages in the range > 4. User tries to longterm pin the addresses. > > The resulting pages[i] might end-up having pages which are not compound > size page aligned. > > Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page") > Reported-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > --- [...] > /* > * If we get a page from the CMA zone, since we are going to > * be pinning these entries, we might as well move them out > @@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm, > } > } > } > - > - i += step; > } > With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> 32k usecs when pinning a 16G hugetlb file. Splitting can only occur on THP right? If so, perhaps we could retain the @step increment for compound pages but when !is_transparent_hugepage(head) or just PageHuge(head) like: + if (!is_transparent_hugepage(head) && PageCompound(page)) + i += (compound_nr(head) - (pages[i] - head)); Or making specific to hugetlbfs: + if (PageHuge(head)) + i += (compound_nr(head) - (pages[i] - head));
On Wed, Feb 3, 2021 at 8:23 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 1/25/21 7:47 PM, Pavel Tatashin wrote: > > When pages are isolated in check_and_migrate_movable_pages() we skip > > compound number of pages at a time. However, as Jason noted, it is > > not necessary correct that pages[i] corresponds to the pages that > > we skipped. This is because it is possible that the addresses in > > this range had split_huge_pmd()/split_huge_pud(), and these functions > > do not update the compound page metadata. > > > > The problem can be reproduced if something like this occurs: > > > > 1. User faulted huge pages. > > 2. split_huge_pmd() was called for some reason > > 3. User has unmapped some sub-pages in the range > > 4. User tries to longterm pin the addresses. > > > > The resulting pages[i] might end-up having pages which are not compound > > size page aligned. > > > > Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page") > > Reported-by: Jason Gunthorpe <jgg@nvidia.com> > > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > --- > > [...] > > > /* > > * If we get a page from the CMA zone, since we are going to > > * be pinning these entries, we might as well move them out > > @@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm, > > } > > } > > } > > - > > - i += step; > > } > > > Hi Joao, > With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> 32k usecs when > pinning a 16G hugetlb file. Estimate or you actually measured? > > Splitting can only occur on THP right? If so, perhaps we could retain the @step increment Yes, I do not think we can split HugePage, only THP. > for compound pages but when !is_transparent_hugepage(head) or just PageHuge(head) like: > > + if (!is_transparent_hugepage(head) && PageCompound(page)) > + i += (compound_nr(head) - (pages[i] - head)); > > Or making specific to hugetlbfs: > > + if (PageHuge(head)) > + i += (compound_nr(head) - (pages[i] - head)); Yes, this is reasonable optimization. I will submit a follow up patch against linux-next. Thank you, Pasha
On Wed, Feb 03, 2021 at 01:22:18PM +0000, Joao Martins wrote: > With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> 32k usecs when > pinning a 16G hugetlb file. Yes, but correctness demands it. The solution is to track these pages as we discover them so we know if a PMD/PUD points and can directly skip the duplicated work > Splitting can only occur on THP right? If so, perhaps we could > retain the @step increment for compound pages but when > !is_transparent_hugepage(head) or just PageHuge(head) like: Honestly I'd rather see it fixed properly which will give even bigger performance gains - avoiding the entire rescan of the page list will be a win Jason
On 2/3/21 2:51 PM, Pavel Tatashin wrote: > On Wed, Feb 3, 2021 at 8:23 AM Joao Martins <joao.m.martins@oracle.com> wrote: >> >> On 1/25/21 7:47 PM, Pavel Tatashin wrote: >>> When pages are isolated in check_and_migrate_movable_pages() we skip >>> compound number of pages at a time. However, as Jason noted, it is >>> not necessary correct that pages[i] corresponds to the pages that >>> we skipped. This is because it is possible that the addresses in >>> this range had split_huge_pmd()/split_huge_pud(), and these functions >>> do not update the compound page metadata. >>> >>> The problem can be reproduced if something like this occurs: >>> >>> 1. User faulted huge pages. >>> 2. split_huge_pmd() was called for some reason >>> 3. User has unmapped some sub-pages in the range >>> 4. User tries to longterm pin the addresses. >>> >>> The resulting pages[i] might end-up having pages which are not compound >>> size page aligned. >>> >>> Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page") >>> Reported-by: Jason Gunthorpe <jgg@nvidia.com> >>> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> >>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> >>> --- >> >> [...] >> >>> /* >>> * If we get a page from the CMA zone, since we are going to >>> * be pinning these entries, we might as well move them out >>> @@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm, >>> } >>> } >>> } >>> - >>> - i += step; >>> } >>> >> > > Hi Joao, > >> With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> ~32k usecs when >> pinning a 16G hugetlb file. > > Estimate or you actually measured? > It's what I had measured before sending. The ~ is because there's error variance. >> > >> Splitting can only occur on THP right? If so, perhaps we could retain the @step increment > > Yes, I do not think we can split HugePage, only THP. > Right, that's my impression too. >> for compound pages but when !is_transparent_hugepage(head) or just PageHuge(head) like: >> >> + if (!is_transparent_hugepage(head) && PageCompound(page)) >> + i += (compound_nr(head) - (pages[i] - head)); >> >> Or making specific to hugetlbfs: >> >> + if (PageHuge(head)) >> + i += (compound_nr(head) - (pages[i] - head)); > > Yes, this is reasonable optimization. I will submit a follow up patch > against linux-next.
On 2/3/21 2:53 PM, Jason Gunthorpe wrote: > On Wed, Feb 03, 2021 at 01:22:18PM +0000, Joao Martins wrote: > >> With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> ~32k usecs when >> pinning a 16G hugetlb file. > > Yes, but correctness demands it. > Hence the quotes around the word regress. But still, we are adding unnecessary overhead (previously nonexistent) to compound pages even those where the issue doesn't exist (hugetlb). > The solution is to track these pages as we discover them so we know if > a PMD/PUD points and can directly skip the duplicated work > That looks better. It would require moving check_and_migrate_movable_pages() elsewhere, and/or possibly reworking the entire series? >> Splitting can only occur on THP right? If so, perhaps we could >> retain the @step increment for compound pages but when >> !is_transparent_hugepage(head) or just PageHuge(head) like: > > Honestly I'd rather see it fixed properly which will give even bigger > performance gains - avoiding the entire rescan of the page list will > be a win > If check_and_migrate_movable_pages() is meant to migrate unpinned pages, then rather than pinning+unpinning+moving, perhaps it would be called in __get_user_pages() place where we are walking page tables and know if it's a PUD/PMD and can skip all the subpages and just record and migrate those instead. Was that your thinking? Joao
On 2/3/21 3:32 PM, Joao Martins wrote: > On 2/3/21 2:51 PM, Pavel Tatashin wrote: >> On Wed, Feb 3, 2021 at 8:23 AM Joao Martins <joao.m.martins@oracle.com> wrote: >>> On 1/25/21 7:47 PM, Pavel Tatashin wrote: >>> for compound pages but when !is_transparent_hugepage(head) or just PageHuge(head) like: >>> >>> + if (!is_transparent_hugepage(head) && PageCompound(page)) >>> + i += (compound_nr(head) - (pages[i] - head)); >>> >>> Or making specific to hugetlbfs: >>> >>> + if (PageHuge(head)) >>> + i += (compound_nr(head) - (pages[i] - head)); >> >> Yes, this is reasonable optimization. I will submit a follow up patch >> against linux-next. Realized it late, but the previous step was already broken. And I inherited its brokeness, when copy-pasting the deleted chunk: The @step should be capped at the remaining pages to iterate: i += min(nr_pages - i, compound_nr(head) - (pages[i] - head)); Joao
On Wed, Feb 03, 2021 at 04:13:21PM +0000, Joao Martins wrote: > If check_and_migrate_movable_pages() is meant to migrate unpinned > pages, then rather than pinning+unpinning+moving, perhaps it would > be called in __get_user_pages() place where we are walking page > tables and know if it's a PUD/PMD and can skip all the subpages and > just record and migrate those instead. Was that your thinking? I think a reasonable approach is to detect non-pinnable pages while walking the VMAs, when found isolate them and thread them on a linked list. When the VMA walk is done you'll have a linked list of isolated pages that need migration. So the check_and_migrate_movable_pages() gets split into the top half being diffused in the VMA walk and the bottom half still called after __get_user_pages() returns. Jason
diff --git a/mm/gup.c b/mm/gup.c index 24f25b1e9103..16f10d5a9eb6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1556,26 +1556,23 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm, unsigned int gup_flags) { unsigned long i; - unsigned long step; bool drain_allow = true; bool migrate_allow = true; LIST_HEAD(cma_page_list); long ret = nr_pages; + struct page *prev_head, *head; struct migration_target_control mtc = { .nid = NUMA_NO_NODE, .gfp_mask = GFP_USER | __GFP_NOWARN, }; check_again: - for (i = 0; i < nr_pages;) { - - struct page *head = compound_head(pages[i]); - - /* - * gup may start from a tail page. Advance step by the left - * part. - */ - step = compound_nr(head) - (pages[i] - head); + prev_head = NULL; + for (i = 0; i < nr_pages; i++) { + head = compound_head(pages[i]); + if (head == prev_head) + continue; + prev_head = head; /* * If we get a page from the CMA zone, since we are going to * be pinning these entries, we might as well move them out @@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm, } } } - - i += step; } if (!list_empty(&cma_page_list)) {