Message ID | 1578993378-10860-2-git-send-email-lixinhai.lxh@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] mm/mempolicy: Checking hstate for hugetlbfs page in vma_migratable | expand |
Add cc to Yang Shi <yang.shi@linux.alibaba.com> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> , who has been worked on this part On 2020-01-14 at 17:16 Li Xinhai wrote: >Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man >page: > >Notes >MPOL_MF_STRICT is ignored on huge page mappings. > >If MPOL_MF_STRICT is specified alone without any MOVE flag, we should >indicate, from test_walk, that walking this vma should be skipped even if >there are misplaced pages. > >Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >Cc: Michal Hocko <mhocko@suse.com> >Cc: Mike Kravetz <mike.kravetz@oracle.com> >--- > mm/mempolicy.c | 7 +++++++ > 1 file changed, 7 insertions(+) > >diff --git a/mm/mempolicy.c b/mm/mempolicy.c >index 067cf7d..c117b5f 100644 >--- a/mm/mempolicy.c >+++ b/mm/mempolicy.c >@@ -656,6 +656,13 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, > return 1; > } > >+ /* MPOL_MF_STRICT is ignored for huge page, skip checking >+ * misplaced pages >+ */ >+ if ((flags & MPOL_MF_VALID) == MPOL_MF_STRICT && >+ is_vm_hugetlb_page(vma)) >+ return 1; >+ > /* queue pages from current vma */ > if (flags & MPOL_MF_VALID) > return 0; >-- >1.8.3.1 >
On 1/14/20 6:09 AM, Li Xinhai wrote: > Add cc to > Yang Shi <yang.shi@linux.alibaba.com> > Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > , who has been worked on this part > > On 2020-01-14 at 17:16 Li Xinhai wrote: >> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man >> page: >> >> Notes >> MPOL_MF_STRICT is ignored on huge page mappings. >> >> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should >> indicate, from test_walk, that walking this vma should be skipped even if >> there are misplaced pages. >> >> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Mike Kravetz <mike.kravetz@oracle.com> >> --- >> mm/mempolicy.c | 7 +++++++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 067cf7d..c117b5f 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -656,6 +656,13 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, >> return 1; >> } >> >> + /* MPOL_MF_STRICT is ignored for huge page, skip checking >> + * misplaced pages >> + */ >> + if ((flags & MPOL_MF_VALID) == MPOL_MF_STRICT && >> + is_vm_hugetlb_page(vma)) >> + return 1; It makes some sense to me. We do save some effort by not acquiring/releasing ptl and walking page tables. Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> >> /* queue pages from current vma */ >> if (flags & MPOL_MF_VALID) >> return 0; >> -- >> 1.8.3.1 > >
On 1/14/20 6:09 AM, Li Xinhai wrote: > Add cc to > Yang Shi <yang.shi@linux.alibaba.com> > Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > , who has been worked on this part > > On 2020-01-14 at 17:16 Li Xinhai wrote: >> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man >> page: >> >> Notes >> MPOL_MF_STRICT is ignored on huge page mappings. >> >> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should >> indicate, from test_walk, that walking this vma should be skipped even if >> there are misplaced pages. >> >> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Mike Kravetz <mike.kravetz@oracle.com> I do not necessarily disagree with the change. However, this has made me question a couple things: 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? - Is that leftover from the the days when huge page migration was not supported? - Is it just because huge page migration is more likely to fail than base page migration. 2) Does the mbind code function properly when unable to migrate a huge page MPOL_MF_STRICT is set? A quick look at the code looks like it returns EIO. I will look into these questions. However, if someone already knows the answer that would be appreciated.
On 1/14/20 5:07 PM, Mike Kravetz wrote: > On 1/14/20 6:09 AM, Li Xinhai wrote: >> Add cc to >> Yang Shi <yang.shi@linux.alibaba.com> >> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> >> , who has been worked on this part >> >> On 2020-01-14 at 17:16 Li Xinhai wrote: >>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man >>> page: >>> >>> Notes >>> MPOL_MF_STRICT is ignored on huge page mappings. >>> >>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should >>> indicate, from test_walk, that walking this vma should be skipped even if >>> there are misplaced pages. >>> >>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >>> Cc: Michal Hocko <mhocko@suse.com> >>> Cc: Mike Kravetz <mike.kravetz@oracle.com> > I do not necessarily disagree with the change. However, this has made me > question a couple things: > 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? > - Is that leftover from the the days when huge page migration was not > supported? > - Is it just because huge page migration is more likely to fail than > base page migration. > 2) Does the mbind code function properly when unable to migrate a huge page > MPOL_MF_STRICT is set? A quick look at the code looks like it returns > EIO. I don't know the answer about question #1 I didn't dig into the history. The queue_pages_hugetlb() returns 0 unconditionally, I think this is what "MPOL_MF_STRICT is ignored on huge page mappings" means in code. It would return -EIO for base pages or THP as what the manpage describes. > > I will look into these questions. However, if someone already knows the > answer that would be appreciated.
On 1/14/20 5:24 PM, Yang Shi wrote: > > > On 1/14/20 5:07 PM, Mike Kravetz wrote: >> On 1/14/20 6:09 AM, Li Xinhai wrote: >>> Add cc to >>> Yang Shi <yang.shi@linux.alibaba.com> >>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> >>> , who has been worked on this part >>> >>> On 2020-01-14 at 17:16 Li Xinhai wrote: >>>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man >>>> page: >>>> >>>> Notes >>>> MPOL_MF_STRICT is ignored on huge page mappings. >>>> >>>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should >>>> indicate, from test_walk, that walking this vma should be skipped even if >>>> there are misplaced pages. >>>> >>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >>>> Cc: Michal Hocko <mhocko@suse.com> >>>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >> I do not necessarily disagree with the change. However, this has made me >> question a couple things: >> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? >> - Is that leftover from the the days when huge page migration was not >> supported? >> - Is it just because huge page migration is more likely to fail than >> base page migration. >> 2) Does the mbind code function properly when unable to migrate a huge page >> MPOL_MF_STRICT is set? A quick look at the code looks like it returns >> EIO. > > I don't know the answer about question #1 I didn't dig into the history. The queue_pages_hugetlb() returns 0 unconditionally, I think this is what "MPOL_MF_STRICT is ignored on huge page mappings" means in code. > > It would return -EIO for base pages or THP as what the manpage describes. > I was thinking about a migration failure after isolation. This block of code in do_mbind() after queue_pages_range() and mbind_range(). if (!err) { int nr_failed = 0; if (!list_empty(&pagelist)) { WARN_ON_ONCE(flags & MPOL_MF_LAZY); nr_failed = migrate_pages(&pagelist, new_page, NULL, start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND); if (nr_failed) putback_movable_pages(&pagelist); } if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT))) err = -EIO;
On 1/14/20 8:28 PM, Mike Kravetz wrote: > On 1/14/20 5:24 PM, Yang Shi wrote: >> >> On 1/14/20 5:07 PM, Mike Kravetz wrote: >>> On 1/14/20 6:09 AM, Li Xinhai wrote: >>>> Add cc to >>>> Yang Shi <yang.shi@linux.alibaba.com> >>>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> >>>> , who has been worked on this part >>>> >>>> On 2020-01-14 at 17:16 Li Xinhai wrote: >>>>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man >>>>> page: >>>>> >>>>> Notes >>>>> MPOL_MF_STRICT is ignored on huge page mappings. >>>>> >>>>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should >>>>> indicate, from test_walk, that walking this vma should be skipped even if >>>>> there are misplaced pages. >>>>> >>>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >>>>> Cc: Michal Hocko <mhocko@suse.com> >>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>> I do not necessarily disagree with the change. However, this has made me >>> question a couple things: >>> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? >>> - Is that leftover from the the days when huge page migration was not >>> supported? >>> - Is it just because huge page migration is more likely to fail than >>> base page migration. >>> 2) Does the mbind code function properly when unable to migrate a huge page >>> MPOL_MF_STRICT is set? A quick look at the code looks like it returns >>> EIO. >> I don't know the answer about question #1 I didn't dig into the history. The queue_pages_hugetlb() returns 0 unconditionally, I think this is what "MPOL_MF_STRICT is ignored on huge page mappings" means in code. >> >> It would return -EIO for base pages or THP as what the manpage describes. >> > I was thinking about a migration failure after isolation. This block of > code in do_mbind() after queue_pages_range() and mbind_range(). > > if (!err) { > int nr_failed = 0; > > if (!list_empty(&pagelist)) { > WARN_ON_ONCE(flags & MPOL_MF_LAZY); > nr_failed = migrate_pages(&pagelist, new_page, NULL, > start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND); > if (nr_failed) > putback_movable_pages(&pagelist); > } > > if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT))) > err = -EIO; Hmm.. I agree this part in man page does look ambiguous. We may assume "MPOL_MF_STRICT is ignored on huge page mappings." implies if MPOL_MF_STRICT is specified alone? If MOVE flag is specified it should return -EIO if some pages could not be moved as what the man page describes. I don't know what the intention was at the first place. We may have to dig into the history. >
On 2020-01-15 at 13:23 Yang Shi wrote: > > >On 1/14/20 8:28 PM, Mike Kravetz wrote: >> On 1/14/20 5:24 PM, Yang Shi wrote: >>> >>> On 1/14/20 5:07 PM, Mike Kravetz wrote: >>>> On 1/14/20 6:09 AM, Li Xinhai wrote: >>>>> Add cc to >>>>> Yang Shi <yang.shi@linux.alibaba.com> >>>>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> >>>>> , who has been worked on this part >>>>> >>>>> On 2020-01-14 at 17:16 Li Xinhai wrote: >>>>>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man >>>>>> page: >>>>>> >>>>>> Notes >>>>>> MPOL_MF_STRICT is ignored on huge page mappings. >>>>>> >>>>>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should >>>>>> indicate, from test_walk, that walking this vma should be skipped even if >>>>>> there are misplaced pages. >>>>>> >>>>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >>>>>> Cc: Michal Hocko <mhocko@suse.com> >>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>>> I do not necessarily disagree with the change. However, this has made me >>>> question a couple things: >>>> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? >>>> - Is that leftover from the the days when huge page migration was not >>>> supported? >>>> - Is it just because huge page migration is more likely to fail than >>>> base page migration. >>>> 2) Does the mbind code function properly when unable to migrate a huge page >>>> MPOL_MF_STRICT is set? A quick look at the code looks like it returns >>>> EIO. for question (2), look into queue_pages_hugetlb(), the misplaced page would not cause -EIO reported, for both STRICT set alone and STRICT set with MOVE*; that means STRICT been effectively ignored during isolation phase. In unmap and move phase, -EIO is reported if failed to move page and STRICT is set. >>> I don't know the answer about question #1 I didn't dig into the history. The queue_pages_hugetlb() returns 0 unconditionally, I think this is what "MPOL_MF_STRICT is ignored on huge page mappings" means in code. >>> >>> It would return -EIO for base pages or THP as what the manpage describes. >>> >> I was thinking about a migration failure after isolation. This block of >> code in do_mbind() after queue_pages_range() and mbind_range(). >> >> if (!err) { >> int nr_failed = 0; >> >> if (!list_empty(&pagelist)) { >> WARN_ON_ONCE(flags & MPOL_MF_LAZY); >> nr_failed = migrate_pages(&pagelist, new_page, NULL, >> start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND); >> if (nr_failed) >> putback_movable_pages(&pagelist); >> } >> >> if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT))) >> err = -EIO; > >Hmm.. I agree this part in man page does look ambiguous. We may assume >"MPOL_MF_STRICT is ignored on huge page mappings." implies if >MPOL_MF_STRICT is specified alone? If MOVE flag is specified it should >return -EIO if some pages could not be moved as what the man page describes. > It looks to me that current code has no feasible way to ignore STRICT flag for hugetlb page when failure happen in unmap&move phase, because mbind is about to handle multiple vma(i.e., hugetlb vma mixed with other vma) in one call. >I don't know what the intention was at the first place. We may have to >dig into the history. > >> > >
On 1/14/20 11:36 PM, Li Xinhai wrote: > On 2020-01-15 at 13:23 Yang Shi wrote: >> >> On 1/14/20 8:28 PM, Mike Kravetz wrote: >>> On 1/14/20 5:24 PM, Yang Shi wrote: >>>> On 1/14/20 5:07 PM, Mike Kravetz wrote: >>>>> On 1/14/20 6:09 AM, Li Xinhai wrote: >>>>>> Add cc to >>>>>> Yang Shi <yang.shi@linux.alibaba.com> >>>>>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> >>>>>> , who has been worked on this part >>>>>> >>>>>> On 2020-01-14 at 17:16 Li Xinhai wrote: >>>>>>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man >>>>>>> page: >>>>>>> >>>>>>> Notes >>>>>>> MPOL_MF_STRICT is ignored on huge page mappings. >>>>>>> >>>>>>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should >>>>>>> indicate, from test_walk, that walking this vma should be skipped even if >>>>>>> there are misplaced pages. >>>>>>> >>>>>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> >>>>>>> Cc: Michal Hocko <mhocko@suse.com> >>>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>>>> I do not necessarily disagree with the change. However, this has made me >>>>> question a couple things: >>>>> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? >>>>> - Is that leftover from the the days when huge page migration was not >>>>> supported? >>>>> - Is it just because huge page migration is more likely to fail than >>>>> base page migration. >>>>> 2) Does the mbind code function properly when unable to migrate a huge page >>>>> MPOL_MF_STRICT is set? A quick look at the code looks like it returns >>>>> EIO. > for question (2), > look into queue_pages_hugetlb(), the misplaced page would not > cause -EIO reported, for both STRICT set alone and STRICT set with MOVE*; > that means STRICT been effectively ignored during isolation phase. > > In unmap and move phase, -EIO is reported if failed to move page and > STRICT is set. > >>>> I don't know the answer about question #1 I didn't dig into the history. The queue_pages_hugetlb() returns 0 unconditionally, I think this is what "MPOL_MF_STRICT is ignored on huge page mappings" means in code. >>>> >>>> It would return -EIO for base pages or THP as what the manpage describes. >>>> >>> I was thinking about a migration failure after isolation. This block of >>> code in do_mbind() after queue_pages_range() and mbind_range(). >>> >>> if (!err) { >>> int nr_failed = 0; >>> >>> if (!list_empty(&pagelist)) { >>> WARN_ON_ONCE(flags & MPOL_MF_LAZY); >>> nr_failed = migrate_pages(&pagelist, new_page, NULL, >>> start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND); >>> if (nr_failed) >>> putback_movable_pages(&pagelist); >>> } >>> >>> if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT))) >>> err = -EIO; >> Hmm.. I agree this part in man page does look ambiguous. We may assume >> "MPOL_MF_STRICT is ignored on huge page mappings." implies if >> MPOL_MF_STRICT is specified alone? If MOVE flag is specified it should >> return -EIO if some pages could not be moved as what the man page describes. >> > It looks to me that current code has no feasible way to ignore STRICT > flag for hugetlb page when failure happen in unmap&move phase, > because mbind is about to handle multiple vma(i.e., hugetlb vma mixed with > other vma) in one call. Yes, if you have both MPOL_MF_STRICT and MPOL_MF_MOVE_{ALL} specified, so I speculate the condition is MPOL_MF_STRICT is specified alone, but the man page doesn't elaborate this. This is also what the code does. But, I'm not sure if my understanding is correct or not. > >> I don't know what the intention was at the first place. We may have to >> dig into the history. >> > >
Naoya, would appreciate any comments you have as this seems to be related to your changes to add huge page migration support. On 1/14/20 5:07 PM, Mike Kravetz wrote: > 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? > - Is that leftover from the the days when huge page migration was not > supported? > - Is it just because huge page migration is more likely to fail than > base page migration. According to man page, mbind was added to v2.6.7 kernel. git history does not go back that far, so I first looked at v2.6.12. Definitions from include/linux/mempolicy.h ------------------------------------------ /* Policies */ #define MPOL_DEFAULT 0 #define MPOL_PREFERRED 1 #define MPOL_BIND 2 #define MPOL_INTERLEAVE 3 /* Flags for mbind */ #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ Note the MPOL_MF_STRICT would simply verify current page placement. At this time, hugetlb pages were not part of this verification. From v2.6.12 routine check_range() ... for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { if (!vma->vm_next && vma->vm_end < end) return ERR_PTR(-EFAULT); if (prev && prev->vm_end < vma->vm_start) return ERR_PTR(-EFAULT); if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) { err = verify_pages(vma->vm_mm, vma->vm_start, vma->vm_end, nodes); if (err) { first = ERR_PTR(err); break; } } prev = vma; } ... man page history does not go back this far. The first mbind man page has some things that are interesting: 1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge page mappings (even though the code does this). 2) Support for huge page policy was added with 2.6.16 3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to move pages to the requested node with this flag. Next look at v2.6.16 kernel ========================== This kernel introduces the MPOL_MF_MOVE* flags #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */ #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ Once again, the up front check_range() routine will skip hugetlb vma's. Note that check_range() will also isolate pages. So since hugetlb vma's are skipped, no hugetlb pages will be isolated. Since no hugetlb pages are isolated, none will be on the list to be migrate. Therefore, hugetlb pages are effectively ignored for the new MPOL_MF_MOVE* flags. This makes sense as huge page migration support was not added until the v3.12 kernel. Note that at when MPOL_MF_MOVE* flags were added to the mbind man page, the statement "MPOL_MF_STRICT is ignored on huge page mappings right now." was added. This was later changed to "MPOL_MF_STRICT is ignored on huge page mappings." Next look at v3.12 kernel ========================= The v3.12 code looks similiar to today's code. In the verify/isolation phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar to queue_pages_hugetlb(). Neither will cause errors at this point in the process. And, hugetlb pages with a mapcount == 1 will potentially be isolated and added to the list of pages to be migrated. In addition, if the subsequent call to migrate_pages() fails to migrate ANY page, we return -EIO if MPOL_MF_STRICT is set. This is true even if the only page(s) that failed to migrate were hugetlb pages. It should also be noted that no mbind man page updates were made WRT hugetlb pages after migration support was added. Summary ======= It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page mappings." is left over from the original mbind implementation. When the huge page migration support was added, I can not be sure if ignoring MPOL_MF_STRICT for huge pages during the verify/isolation phase was intentional. It seems like it was as the return value from isolate_huge_page() is ignored. What should we do? ================== 1) Nothing more than optimizations by Li Xinhai. Behavior that could be seen as conflicting with man page has existed since v3.12 and I am not aware of any complaints. 2) In addition to optimizations by Li Xinhai, modify code to truly ignore MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do after a failure of migrate_pages(). We could simply traverse the list of pages that were not migrated looking for any non-hugetlb page. 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." and modify code accordingly. My suggestion would be for 1 or 2. Thoughts?
On 1/15/20 1:07 PM, Mike Kravetz wrote: > Naoya, would appreciate any comments you have as this seems to be related > to your changes to add huge page migration support. > > On 1/14/20 5:07 PM, Mike Kravetz wrote: >> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? >> - Is that leftover from the the days when huge page migration was not >> supported? >> - Is it just because huge page migration is more likely to fail than >> base page migration. > > According to man page, mbind was added to v2.6.7 kernel. git history does > not go back that far, so I first looked at v2.6.12. > > Definitions from include/linux/mempolicy.h > ------------------------------------------ > /* Policies */ > #define MPOL_DEFAULT 0 > #define MPOL_PREFERRED 1 > #define MPOL_BIND 2 > #define MPOL_INTERLEAVE 3 > > /* Flags for mbind */ > #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ > > Note the MPOL_MF_STRICT would simply verify current page placement. At > this time, hugetlb pages were not part of this verification. > > From v2.6.12 routine check_range() > ... > for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { > if (!vma->vm_next && vma->vm_end < end) > return ERR_PTR(-EFAULT); > if (prev && prev->vm_end < vma->vm_start) > return ERR_PTR(-EFAULT); > if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) { > err = verify_pages(vma->vm_mm, > vma->vm_start, vma->vm_end, nodes); > if (err) { > first = ERR_PTR(err); > break; > } > } > prev = vma; > } > ... > > man page history does not go back this far. The first mbind man page has > some things that are interesting: > 1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge > page mappings (even though the code does this). > 2) Support for huge page policy was added with 2.6.16 > 3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to > move pages to the requested node with this flag. > > Next look at v2.6.16 kernel > ========================== > This kernel introduces the MPOL_MF_MOVE* flags > #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform > to mapping */ > #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ > > Once again, the up front check_range() routine will skip hugetlb vma's. > Note that check_range() will also isolate pages. So since hugetlb vma's > are skipped, no hugetlb pages will be isolated. Since no hugetlb pages > are isolated, none will be on the list to be migrate. Therefore, hugetlb > pages are effectively ignored for the new MPOL_MF_MOVE* flags. This makes > sense as huge page migration support was not added until the v3.12 kernel. > > Note that at when MPOL_MF_MOVE* flags were added to the mbind man page, > the statement "MPOL_MF_STRICT is ignored on huge page mappings right now." > was added. This was later changed to "MPOL_MF_STRICT is ignored on huge > page mappings." > > Next look at v3.12 kernel > ========================= > The v3.12 code looks similiar to today's code. In the verify/isolation > phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar > to queue_pages_hugetlb(). Neither will cause errors at this point in the > process. And, hugetlb pages with a mapcount == 1 will potentially be > isolated and added to the list of pages to be migrated. In addition, if > the subsequent call to migrate_pages() fails to migrate ANY page, we > return -EIO if MPOL_MF_STRICT is set. This is true even if the only page(s) > that failed to migrate were hugetlb pages. > > It should also be noted that no mbind man page updates were made WRT > hugetlb pages after migration support was added. > > Summary > ======= > It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page > mappings." is left over from the original mbind implementation. When > the huge page migration support was added, I can not be sure if ignoring > MPOL_MF_STRICT for huge pages during the verify/isolation phase was > intentional. It seems like it was as the return value from > isolate_huge_page() is ignored. Thanks a lot for digging into the history. > > What should we do? > ================== > 1) Nothing more than optimizations by Li Xinhai. Behavior that could be > seen as conflicting with man page has existed since v3.12 and I am > not aware of any complaints. > 2) In addition to optimizations by Li Xinhai, modify code to truly ignore > MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do > after a failure of migrate_pages(). We could simply traverse the list > of pages that were not migrated looking for any non-hugetlb page. I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly. Other than that traversing page list to look for a certain type page doesn't sound scalable to me. > 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." > and modify code accordingly. > > My suggestion would be for 1 or 2. Thoughts? By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change.
On 1/15/20 1:30 PM, Yang Shi wrote: > On 1/15/20 1:07 PM, Mike Kravetz wrote: >> What should we do? >> ================== >> 1) Nothing more than optimizations by Li Xinhai. Behavior that could be >> seen as conflicting with man page has existed since v3.12 and I am >> not aware of any complaints. >> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore >> MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do >> after a failure of migrate_pages(). We could simply traverse the list >> of pages that were not migrated looking for any non-hugetlb page. > > I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly. > You are correct. I made an assumption without actually looking at the code. :( > Other than that traversing page list to look for a certain type page doesn't sound scalable to me. > >> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." >> and modify code accordingly. >> >> My suggestion would be for 1 or 2. Thoughts? > > By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change. > Let's hope Naoya comments. My only concern with #3 is that we will be changing behavior. I do not think many people (if any) depend on existing behavior. However, you can never be sure.
On 1/15/20 1:45 PM, Mike Kravetz wrote: > On 1/15/20 1:30 PM, Yang Shi wrote: >> On 1/15/20 1:07 PM, Mike Kravetz wrote: >>> What should we do? >>> ================== >>> 1) Nothing more than optimizations by Li Xinhai. Behavior that could be >>> seen as conflicting with man page has existed since v3.12 and I am >>> not aware of any complaints. >>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore >>> MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do >>> after a failure of migrate_pages(). We could simply traverse the list >>> of pages that were not migrated looking for any non-hugetlb page. >> I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly. >> > You are correct. I made an assumption without actually looking at the code. :( > >> Other than that traversing page list to look for a certain type page doesn't sound scalable to me. >> >>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." >>> and modify code accordingly. >>> >>> My suggestion would be for 1 or 2. Thoughts? >> By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change. >> > Let's hope Naoya comments. My only concern with #3 is that we will be changing > behavior. I do not think many people (if any) depend on existing behavior. > However, you can never be sure. Yes, this would change the bahavior, but I don't see why we have to treat hugetlb specially nowadays with migration supported.
On Wed 15-01-20 13:07:17, Mike Kravetz wrote: [...] > Summary > ======= > It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page > mappings." is left over from the original mbind implementation. When > the huge page migration support was added, I can not be sure if ignoring > MPOL_MF_STRICT for huge pages during the verify/isolation phase was > intentional. It seems like it was as the return value from > isolate_huge_page() is ignored. THanks for the tedious work of studying the mess^Whistory. > > What should we do? > ================== > 1) Nothing more than optimizations by Li Xinhai. Behavior that could be > seen as conflicting with man page has existed since v3.12 and I am > not aware of any complaints. > 2) In addition to optimizations by Li Xinhai, modify code to truly ignore > MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do > after a failure of migrate_pages(). We could simply traverse the list > of pages that were not migrated looking for any non-hugetlb page. > 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." > and modify code accordingly. > > My suggestion would be for 1 or 2. Thoughts? And why do we exactly need to do anything at all? There is an inconsistency that has been there for years without anybody noticing. NUMA API is a mess on its own and unfixable at this stage, there will always be some corner cases. If there is no real workload hitting this incosistency and suffering, I would rather not touch this at all. Unless the change would clean up the code or make it more maintainable.
Hi everyone, Thank you all for finding and digging the issue. > Summary > ======= > It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page > mappings." is left over from the original mbind implementation. When > the huge page migration support was added, I can not be sure if ignoring > MPOL_MF_STRICT for huge pages during the verify/isolation phase was > intentional. It seems like it was as the return value from > isolate_huge_page() is ignored. This summary is totally correct. I've simply missed considering MPOL_MF_STRICT flag when implementing hugetlb migration. As you pointed out, the discrepacy between the manpage and the code is also due to the lack of updates on the "MPOL_MF_STRICT is ignored on huge page mappings." statement. On Wed, Jan 15, 2020 at 01:59:14PM -0800, Yang Shi wrote: > > On 1/15/20 1:45 PM, Mike Kravetz wrote: > > On 1/15/20 1:30 PM, Yang Shi wrote: > > > On 1/15/20 1:07 PM, Mike Kravetz wrote: > > > > What should we do? > > > > ================== > > > > 1) Nothing more than optimizations by Li Xinhai. Behavior that could be > > > > seen as conflicting with man page has existed since v3.12 and I am > > > > not aware of any complaints. > > > > 2) In addition to optimizations by Li Xinhai, modify code to truly ignore > > > > MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do > > > > after a failure of migrate_pages(). We could simply traverse the list > > > > of pages that were not migrated looking for any non-hugetlb page. > > > I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly. Although this behavior seems to me not prevent from finding non-hugetlb pages in migration list, this is a difference in migration behavior between normal pages and hugepages that might be better to be optimized. Maybe hugepages failed to migrate should remain in migration list after migrate_pages() returns and the should be put back via putback_movable_pages(). > > > > > You are correct. I made an assumption without actually looking at the code. :( > > > > > Other than that traversing page list to look for a certain type page doesn't sound scalable to me. > > > > > > > 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." > > > > and modify code accordingly. > > > > > > > > My suggestion would be for 1 or 2. Thoughts? > > > By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change. > > > > > Let's hope Naoya comments. My only concern with #3 is that we will be changing > > behavior. I do not think many people (if any) depend on existing behavior. > > However, you can never be sure. > > Yes, this would change the bahavior, but I don't see why we have to treat > hugetlb specially nowadays with migration supported. (Option #1 is good for short term solution, but eventually) I agree with option #3. We have no reason to handle hugetlb differently about MPOL_MF_STRICT flag. Thanks, Naoya Horiguchi
On 2020-01-16 at 16:07 HORIGUCHI NAOYA(堀口 直也) wrote: >Hi everyone, > >Thank you all for finding and digging the issue. > >> Summary >> ======= >> It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page >> mappings." is left over from the original mbind implementation. When >> the huge page migration support was added, I can not be sure if ignoring >> MPOL_MF_STRICT for huge pages during the verify/isolation phase was >> intentional. It seems like it was as the return value from >> isolate_huge_page() is ignored. > >This summary is totally correct. I've simply missed considering MPOL_MF_STRICT >flag when implementing hugetlb migration. As you pointed out, the discrepacy >between the manpage and the code is also due to the lack of updates on the >"MPOL_MF_STRICT is ignored on huge page mappings." statement. > >On Wed, Jan 15, 2020 at 01:59:14PM -0800, Yang Shi wrote: >> >> On 1/15/20 1:45 PM, Mike Kravetz wrote: >> > On 1/15/20 1:30 PM, Yang Shi wrote: >> > > On 1/15/20 1:07 PM, Mike Kravetz wrote: >> > > > What should we do? >> > > > ================== >> > > > 1) Nothing more than optimizations by Li Xinhai. Behavior that could be >> > > > seen as conflicting with man page has existed since v3.12 and I am >> > > > not aware of any complaints. >> > > > 2) In addition to optimizations by Li Xinhai, modify code to truly ignore >> > > > MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do >> > > > after a failure of migrate_pages(). We could simply traverse the list >> > > > of pages that were not migrated looking for any non-hugetlb page. >> > > I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly. > >Although this behavior seems to me not prevent from finding non-hugetlb >pages in migration list, this is a difference in migration behavior between >normal pages and hugepages that might be better to be optimized. >Maybe hugepages failed to migrate should remain in migration list after >migrate_pages() returns and the should be put back via putback_movable_pages(). > >> > > >> > You are correct. I made an assumption without actually looking at the code. :( >> > >> > > Other than that traversing page list to look for a certain type page doesn't sound scalable to me. >> > > >> > > > 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." >> > > > and modify code accordingly. >> > > > >> > > > My suggestion would be for 1 or 2. Thoughts? >> > > By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change. >> > > >> > Let's hope Naoya comments. My only concern with #3 is that we will be changing >> > behavior. I do not think many people (if any) depend on existing behavior. >> > However, you can never be sure. >> >> Yes, this would change the bahavior, but I don't see why we have to treat >> hugetlb specially nowadays with migration supported. > >(Option #1 is good for short term solution, but eventually) I agree with option #3. >We have no reason to handle hugetlb differently about MPOL_MF_STRICT flag. Thanks. Same thoughts for option #3, but it seems better not change current behavior. Add more about current behavior of code: - In unmap&move phase, there is no different behavior of handling hugepage and non-hugepage, that is when STRICT is set, report -EIO if any page can't move, when STRICT is not set, don't report when failed to move; - In isolation phase, STRICT is effective for non-hugepge, that means set STRICT alone will cause -EIO if found misplaced pages, and set STRICT with MOVE* will cause -EIO if failed to isolate pages; for hugepage, STRICT is ignored, it don't detect misplaced pages nor report -EIO if isolation failed. This patch don't change any part of current behavior, only avoids walking page table, where currently do nothing if STRICT is set alone. > >Thanks, >Naoya Horiguchi
On 1/15/20 11:59 PM, Michal Hocko wrote: > On Wed 15-01-20 13:07:17, Mike Kravetz wrote: >> What should we do? >> ================== >> 1) Nothing more than optimizations by Li Xinhai. Behavior that could be >> seen as conflicting with man page has existed since v3.12 and I am >> not aware of any complaints. >> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore >> MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do >> after a failure of migrate_pages(). We could simply traverse the list >> of pages that were not migrated looking for any non-hugetlb page. >> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." >> and modify code accordingly. >> >> My suggestion would be for 1 or 2. Thoughts? > > And why do we exactly need to do anything at all? There is an > inconsistency that has been there for years without anybody noticing. > NUMA API is a mess on its own and unfixable at this stage, there will > always be some corner cases. If there is no real workload hitting this > incosistency and suffering, I would rather not touch this at all. > Unless the change would clean up the code or make it more maintainable. That is a very valid point. Sometimes we as developers get focused on the actual code changes and fail to ask the question "does this really need to be changed?" or "what value do the code changes provide?". Li Xinhai came up with two optimizations in how the mbind code deals with hugetlb pages. This 'sub-optimal' code has existed for more than 6 years. Unless I am mistaken, nobody has actually complained or noticed this behavior. I believe Li Xinhai noticed this inefficient code via code inspection. Of course, based on what we know today one could write a test program to show the inefficient behavior. However, no real users have noticed this during the past 6 years. The proposed code changes are fairly simple. However, I would not say that they clean up the code or make it more maintainable. They essentially add or modify two checks to bail out early for hugetlb vma's if the flag which is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified. If one is trying to follow the entire mbind code path for hugetlb pages, these patches will make that easier follow/understand. That is simply because one can ignore downstream code/functionality. Based on Michal's criteria above, I now believe the code changes should not be made. Yes, they are fairly simple. However, even simple changes have the potential to break something (build breakage with v1 of patch). We should leave this code as is unless issues are reported by users.
On 1/16/20 11:22 AM, Mike Kravetz wrote: > On 1/15/20 11:59 PM, Michal Hocko wrote: >> On Wed 15-01-20 13:07:17, Mike Kravetz wrote: >>> What should we do? >>> ================== >>> 1) Nothing more than optimizations by Li Xinhai. Behavior that could be >>> seen as conflicting with man page has existed since v3.12 and I am >>> not aware of any complaints. >>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore >>> MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do >>> after a failure of migrate_pages(). We could simply traverse the list >>> of pages that were not migrated looking for any non-hugetlb page. >>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." >>> and modify code accordingly. >>> >>> My suggestion would be for 1 or 2. Thoughts? >> And why do we exactly need to do anything at all? There is an >> inconsistency that has been there for years without anybody noticing. >> NUMA API is a mess on its own and unfixable at this stage, there will >> always be some corner cases. If there is no real workload hitting this >> incosistency and suffering, I would rather not touch this at all. >> Unless the change would clean up the code or make it more maintainable. > That is a very valid point. Sometimes we as developers get focused on the > actual code changes and fail to ask the question "does this really need to > be changed?" or "what value do the code changes provide?". > > Li Xinhai came up with two optimizations in how the mbind code deals with > hugetlb pages. This 'sub-optimal' code has existed for more than 6 years. > Unless I am mistaken, nobody has actually complained or noticed this behavior. > I believe Li Xinhai noticed this inefficient code via code inspection. Of > course, based on what we know today one could write a test program to show > the inefficient behavior. However, no real users have noticed this during > the past 6 years. > > The proposed code changes are fairly simple. However, I would not say that > they clean up the code or make it more maintainable. They essentially add > or modify two checks to bail out early for hugetlb vma's if the flag which > is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified. > If one is trying to follow the entire mbind code path for hugetlb pages, > these patches will make that easier follow/understand. That is simply > because one can ignore downstream code/functionality. > > Based on Michal's criteria above, I now believe the code changes should not > be made. Yes, they are fairly simple. However, even simple changes have > the potential to break something (build breakage with v1 of patch). We should > leave this code as is unless issues are reported by users. I tend to agree with you. And, according to what Horiguchi explained, the intention was not ignoring hugetlb mappings when hugetlb migration was added at the first place. And, I'm supposed all of us agree hugetlb pages should be not treated specially although it is not a good timing or there is not strong motivation to fix it right now (we may correct the behavior in the future). The patch may convey the wrong information. And, the code path is definitely not a hot path, so I'm fine to drop it. And, I'm wondering if we need add some comments in the code to explain the edge case just in case someone else repeat all the tedious history digging.
On 2020-01-17 at 03:22 Mike Kravetz wrote: >On 1/15/20 11:59 PM, Michal Hocko wrote: >> On Wed 15-01-20 13:07:17, Mike Kravetz wrote: >>> What should we do? >>> ================== >>> 1) Nothing more than optimizations by Li Xinhai. Behavior that could be >>> seen as conflicting with man page has existed since v3.12 and I am >>> not aware of any complaints. >>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore >>> MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do >>> after a failure of migrate_pages(). We could simply traverse the list >>> of pages that were not migrated looking for any non-hugetlb page. >>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." >>> and modify code accordingly. >>> >>> My suggestion would be for 1 or 2. Thoughts? >> >> And why do we exactly need to do anything at all? There is an >> inconsistency that has been there for years without anybody noticing. >> NUMA API is a mess on its own and unfixable at this stage, there will >> always be some corner cases. If there is no real workload hitting this >> incosistency and suffering, I would rather not touch this at all. >> Unless the change would clean up the code or make it more maintainable. > >That is a very valid point. Sometimes we as developers get focused on the >actual code changes and fail to ask the question "does this really need to >be changed?" or "what value do the code changes provide?". > >Li Xinhai came up with two optimizations in how the mbind code deals with >hugetlb pages. This 'sub-optimal' code has existed for more than 6 years. >Unless I am mistaken, nobody has actually complained or noticed this behavior. >I believe Li Xinhai noticed this inefficient code via code inspection. Of >course, based on what we know today one could write a test program to show >the inefficient behavior. However, no real users have noticed this during >the past 6 years. > >The proposed code changes are fairly simple. However, I would not say that >they clean up the code or make it more maintainable. They essentially add >or modify two checks to bail out early for hugetlb vma's if the flag which >is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified. >If one is trying to follow the entire mbind code path for hugetlb pages, >these patches will make that easier follow/understand. That is simply >because one can ignore downstream code/functionality. > >Based on Michal's criteria above, I now believe the code changes should not >be made. Yes, they are fairly simple. However, even simple changes have >the potential to break something (build breakage with v1 of patch). We should >leave this code as is unless issues are reported by users. Acctually I am the user of this API, and when using STRICT alone to know if pages are misplaced and take action later, it seems don't work consistantly on hugepage. Initially, I assumed that I didn't use it correctly, that flag must use with MOVE*? After check the code, found that flag didn't been handled, and finally noticed that there is a NOTE about it in MAN page. I'd like the STRICT been handled, but at present since this is not going to be implemented, I have to handle differently on hugetlb mapping. At meantime, I thought that this patch can be useful and posted it, because for scenarios where hugetlb mapping are handled with other mappings, less cost for the mbind call. (although it didn't help my current case) I think if we could provid better performance, why not do that only because that code has exists for 6 years? >-- >Mike Kravetz
On Fri 17-01-20 10:38:21, Li Xinhai wrote: > On 2020-01-17 at 03:22 Mike Kravetz wrote: > >On 1/15/20 11:59 PM, Michal Hocko wrote: > >> On Wed 15-01-20 13:07:17, Mike Kravetz wrote: > >>> What should we do? > >>> ================== > >>> 1) Nothing more than optimizations by Li Xinhai. Behavior that could be > >>> seen as conflicting with man page has existed since v3.12 and I am > >>> not aware of any complaints. > >>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore > >>> MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do > >>> after a failure of migrate_pages(). We could simply traverse the list > >>> of pages that were not migrated looking for any non-hugetlb page. > >>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." > >>> and modify code accordingly. > >>> > >>> My suggestion would be for 1 or 2. Thoughts? > >> > >> And why do we exactly need to do anything at all? There is an > >> inconsistency that has been there for years without anybody noticing. > >> NUMA API is a mess on its own and unfixable at this stage, there will > >> always be some corner cases. If there is no real workload hitting this > >> incosistency and suffering, I would rather not touch this at all. > >> Unless the change would clean up the code or make it more maintainable. > > > >That is a very valid point. Sometimes we as developers get focused on the > >actual code changes and fail to ask the question "does this really need to > >be changed?" or "what value do the code changes provide?". > > > >Li Xinhai came up with two optimizations in how the mbind code deals with > >hugetlb pages. This 'sub-optimal' code has existed for more than 6 years. > >Unless I am mistaken, nobody has actually complained or noticed this behavior. > >I believe Li Xinhai noticed this inefficient code via code inspection. Of > >course, based on what we know today one could write a test program to show > >the inefficient behavior. However, no real users have noticed this during > >the past 6 years. > > > >The proposed code changes are fairly simple. However, I would not say that > >they clean up the code or make it more maintainable. They essentially add > >or modify two checks to bail out early for hugetlb vma's if the flag which > >is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified. > >If one is trying to follow the entire mbind code path for hugetlb pages, > >these patches will make that easier follow/understand. That is simply > >because one can ignore downstream code/functionality. > > > >Based on Michal's criteria above, I now believe the code changes should not > >be made. Yes, they are fairly simple. However, even simple changes have > >the potential to break something (build breakage with v1 of patch). We should > >leave this code as is unless issues are reported by users. > > Acctually I am the user of this API, and when using STRICT alone to know if > pages are misplaced and take action later, it seems don't work consistantly > on hugepage. Initially, I assumed that I didn't use it correctly, that flag must > use with MOVE*? After check the code, found that flag didn't been handled, > and finally noticed that there is a NOTE about it in MAN page. This is the first time you are mentioning an actual use case. This should have been expressed from the very begining. Including an explanation of what the use case is and how it is affected by the current behavior. > I'd like the STRICT been handled, but at present since this is not going to > be implemented, I have to handle differently on hugetlb mapping. > > At meantime, I thought that this patch can be useful and posted it, because > for scenarios where hugetlb mapping are handled with other mappings, less > cost for the mbind call. (although it didn't help my current case) > > I think if we could provid better performance, why not do that only because > that code has exists for 6 years? Do you have any numbers to prove performance improvements? I believe arguments against the patch have been already presented.
On 2020-01-17 at 15:57 Michal Hocko wrote: >On Fri 17-01-20 10:38:21, Li Xinhai wrote: >> On 2020-01-17 at 03:22 Mike Kravetz wrote: >> >On 1/15/20 11:59 PM, Michal Hocko wrote: >> >> On Wed 15-01-20 13:07:17, Mike Kravetz wrote: >> >>> What should we do? >> >>> ================== >> >>> 1) Nothing more than optimizations by Li Xinhai. Behavior that could be >> >>> seen as conflicting with man page has existed since v3.12 and I am >> >>> not aware of any complaints. >> >>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore >> >>> MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do >> >>> after a failure of migrate_pages(). We could simply traverse the list >> >>> of pages that were not migrated looking for any non-hugetlb page. >> >>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." >> >>> and modify code accordingly. >> >>> >> >>> My suggestion would be for 1 or 2. Thoughts? >> >> >> >> And why do we exactly need to do anything at all? There is an >> >> inconsistency that has been there for years without anybody noticing. >> >> NUMA API is a mess on its own and unfixable at this stage, there will >> >> always be some corner cases. If there is no real workload hitting this >> >> incosistency and suffering, I would rather not touch this at all. >> >> Unless the change would clean up the code or make it more maintainable. >> > >> >That is a very valid point. Sometimes we as developers get focused on the >> >actual code changes and fail to ask the question "does this really need to >> >be changed?" or "what value do the code changes provide?". >> > >> >Li Xinhai came up with two optimizations in how the mbind code deals with >> >hugetlb pages. This 'sub-optimal' code has existed for more than 6 years. >> >Unless I am mistaken, nobody has actually complained or noticed this behavior. >> >I believe Li Xinhai noticed this inefficient code via code inspection. Of >> >course, based on what we know today one could write a test program to show >> >the inefficient behavior. However, no real users have noticed this during >> >the past 6 years. >> > >> >The proposed code changes are fairly simple. However, I would not say that >> >they clean up the code or make it more maintainable. They essentially add >> >or modify two checks to bail out early for hugetlb vma's if the flag which >> >is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified. >> >If one is trying to follow the entire mbind code path for hugetlb pages, >> >these patches will make that easier follow/understand. That is simply >> >because one can ignore downstream code/functionality. >> > >> >Based on Michal's criteria above, I now believe the code changes should not >> >be made. Yes, they are fairly simple. However, even simple changes have >> >the potential to break something (build breakage with v1 of patch). We should >> >leave this code as is unless issues are reported by users. >> >> Acctually I am the user of this API, and when using STRICT alone to know if >> pages are misplaced and take action later, it seems don't work consistantly >> on hugepage. Initially, I assumed that I didn't use it correctly, that flag must >> use with MOVE*? After check the code, found that flag didn't been handled, >> and finally noticed that there is a NOTE about it in MAN page. > >This is the first time you are mentioning an actual use case. This >should have been expressed from the very begining. Including an >explanation of what the use case is and how it is affected by the >current behavior. > OK, that is better practice, thanks. >> I'd like the STRICT been handled, but at present since this is not going to >> be implemented, I have to handle differently on hugetlb mapping. >> >> At meantime, I thought that this patch can be useful and posted it, because >> for scenarios where hugetlb mapping are handled with other mappings, less >> cost for the mbind call. (although it didn't help my current case) >> >> I think if we could provid better performance, why not do that only because >> that code has exists for 6 years? > >Do you have any numbers to prove performance improvements? I believe >arguments against the patch have been already presented. I didn't test to know difference with this patch since it doesn't help for my case. If walking the page table and checking the present PTE is basically no cost according to existing experience, then it is not needed. Now, for my case, I am doing workaround by call mbind then check the numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is a simple change, and I am glad to add this function if no objection. >-- >Michal Hocko >SUSE Labs
On Fri 17-01-20 20:05:23, Li Xinhai wrote: [...] > Now, for my case, I am doing workaround workaround for what? > by call mbind then check the > numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is > a simple change, and I am glad to add this function if no objection. Any reason you cannot use move_pages syscall?
On 2020-01-17 at 23:20 Michal Hocko wrote: >On Fri 17-01-20 20:05:23, Li Xinhai wrote: >[...] >> Now, for my case, I am doing workaround > >workaround for what? > I mean now call mbind without MOVE* or STRICT, only apply policy, because my purpose is to take action later on misplaced pages. If STRICT give correct feedback, I don't need to check numa_maps. For above description, I don't use MOVE*, as want to take action later. >> by call mbind then check the >> numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is >> a simple change, and I am glad to add this function if no objection. > >Any reason you cannot use move_pages syscall? I use it, that is the 'take action later' mentioned above, I'd like to know if there are misplace pages before call it if mbind can help to give answer. >-- >Michal Hocko >SUSE Labs
On Fri 17-01-20 23:46:03, Li Xinhai wrote: [...] > >> by call mbind then check the > >> numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is > >> a simple change, and I am glad to add this function if no objection. > > > >Any reason you cannot use move_pages syscall? > > I use it, that is the 'take action later' mentioned above, I'd like to > know if there are misplace pages before call it if mbind can help > to give answer. I am sorry but I do not follow? Why do you need to do two steps? Why don't you simply call move_pages right away?
On 2020-01-20 at 20:45 Michal Hocko wrote: >On Fri 17-01-20 23:46:03, Li Xinhai wrote: >[...] >> >> by call mbind then check the >> >> numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is >> >> a simple change, and I am glad to add this function if no objection. >> > >> >Any reason you cannot use move_pages syscall? >> >> I use it, that is the 'take action later' mentioned above, I'd like to >> know if there are misplace pages before call it if mbind can help >> to give answer. > >I am sorry but I do not follow? Why do you need to do two steps? Why >don't you simply call move_pages right away? I am trying to decrease the number of call to move_pages(). For call move_pages(), we need prepare parameters in array for 'pages', 'nodes' (i.e., it do not support moving pages as specified through start addr and length), which could be cost. One way for handling my case is - mbind: set policy without detecting if any pages misplaced nor moving pages. (continue below part when timing is right) - move_pages: call on pages which are within range of above mbind. The other way is: - mbind: set policy and detects if any pages are misplaced. (continue below part when timing is right) - move_pages: call on pages which are within range has misplaced pages. Knowing about misplaced pages from mbind will give me flexibility for choosing different solutions. > >-- >Michal Hocko >SUSE Labs
On Tue 21-01-20 22:15:43, Li Xinhai wrote: [...] > Knowing about misplaced pages from mbind will give me flexibility for > choosing different solutions. Would /proc/pid/numa_maps help to give you rough idea which mappings of interest have misplaced pages and move_pages for those? It is still allocating arrays but is that really a big deal?
On 2020-01-21 at 22:53 Michal Hocko wrote: >On Tue 21-01-20 22:15:43, Li Xinhai wrote: >[...] >> Knowing about misplaced pages from mbind will give me flexibility for >> choosing different solutions. > >Would /proc/pid/numa_maps help to give you rough idea which mappings >of interest have misplaced pages and move_pages for those? It is still >allocating arrays but is that really a big deal? numa_maps can be checked about misplaced pages, and I mentioned it in previous mail for this purpose. For my case, using move_pages() is necessary for later action. The only difference is about mbind() could give feedback of misplaced page on impacted range, numa_mpas requres me iterate over maps (that will collect states on those maps although don't been used) to reach impacted range and check it. I may assume there is no much difference on overall performance, it is more about coding effort in user sapce application. I've being try to apply patch for mbind(i.e., apply STRICT on hugetlb mapping), and may sharing it after verification and let's see if that is usable, thanks. >-- >Michal Hocko >SUSE Labs
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 067cf7d..c117b5f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -656,6 +656,13 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, return 1; } + /* MPOL_MF_STRICT is ignored for huge page, skip checking + * misplaced pages + */ + if ((flags & MPOL_MF_VALID) == MPOL_MF_STRICT && + is_vm_hugetlb_page(vma)) + return 1; + /* queue pages from current vma */ if (flags & MPOL_MF_VALID) return 0;
Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man page: Notes MPOL_MF_STRICT is ignored on huge page mappings. If MPOL_MF_STRICT is specified alone without any MOVE flag, we should indicate, from test_walk, that walking this vma should be skipped even if there are misplaced pages. Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> --- mm/mempolicy.c | 7 +++++++ 1 file changed, 7 insertions(+)