diff mbox series

mm: mempolicy: don't have to split pmd for huge zero page

Message ID 20210604203513.240709-1-shy828301@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm: mempolicy: don't have to split pmd for huge zero page | expand

Commit Message

Yang Shi June 4, 2021, 8:35 p.m. UTC
When trying to migrate pages to obey mempolicy, the huge zero page is
split then the page table walk at PTE level just skips zero page.  So it
seems pointless to split huge zero page, it could be just skipped like
base zero page.

Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
this case.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/mempolicy.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Comments

Zi Yan June 4, 2021, 9:23 p.m. UTC | #1
On 4 Jun 2021, at 16:35, Yang Shi wrote:

> When trying to migrate pages to obey mempolicy, the huge zero page is
> split then the page table walk at PTE level just skips zero page.  So it
> seems pointless to split huge zero page, it could be just skipped like
> base zero page.
>
> Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
> this case.
>
> Signed-off-by: Yang Shi <shy828301@gmail.com>

LGTM. Thanks.

Reviewed-by: Zi Yan <ziy@nvidia.com>


—
Best Regards,
Yan, Zi
Michal Hocko June 7, 2021, 6:21 a.m. UTC | #2
On Fri 04-06-21 13:35:13, Yang Shi wrote:
> When trying to migrate pages to obey mempolicy, the huge zero page is
> split then the page table walk at PTE level just skips zero page.  So it
> seems pointless to split huge zero page, it could be just skipped like
> base zero page.

My THP knowledge is not the best but this is incorrect AIACS. Huge zero
page is not split. We do split the pmd which is mapping the said page. I
suspect you refer to vm_normal_page when talking about a zero page but
please be aware that huge zero page is not a normal zero page. It is
allocated dynamically (see get_huge_zero_page).

So in the end you patch disables mbind of zero pages to a target node
and that is a regression.

Have you tested the patch?

> Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
> this case.

Btw. this changelog is missing a problem statement. I suspect there is
no actual problem that it should fix and it is likely driven by reading
the code. Right?

> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/mempolicy.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index b5f4f584009b..205c1a768775 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -436,7 +436,8 @@ static inline bool queue_pages_required(struct page *page,
>  
>  /*
>   * queue_pages_pmd() has four possible return values:
> - * 0 - pages are placed on the right node or queued successfully.
> + * 0 - pages are placed on the right node or queued successfully, or
> + *     special page is met, i.e. huge zero page.
>   * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
>   *     specified.
>   * 2 - THP was split.
> @@ -460,8 +461,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
>  	page = pmd_page(*pmd);
>  	if (is_huge_zero_page(page)) {
>  		spin_unlock(ptl);
> -		__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
> -		ret = 2;
> +		walk->action = ACTION_CONTINUE;
>  		goto out;
>  	}
>  	if (!queue_pages_required(page, qp))
> @@ -488,7 +488,8 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
>   * and move them to the pagelist if they do.
>   *
>   * queue_pages_pte_range() has three possible return values:
> - * 0 - pages are placed on the right node or queued successfully.
> + * 0 - pages are placed on the right node or queued successfully, or
> + *     special page is met, i.e. zero page.
>   * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
>   *     specified.
>   * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
> -- 
> 2.26.2
Yang Shi June 7, 2021, 5 p.m. UTC | #3
On Sun, Jun 6, 2021 at 11:21 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 04-06-21 13:35:13, Yang Shi wrote:
> > When trying to migrate pages to obey mempolicy, the huge zero page is
> > split then the page table walk at PTE level just skips zero page.  So it
> > seems pointless to split huge zero page, it could be just skipped like
> > base zero page.
>
> My THP knowledge is not the best but this is incorrect AIACS. Huge zero
> page is not split. We do split the pmd which is mapping the said page. I
> suspect you refer to vm_normal_page when talking about a zero page but
> please be aware that huge zero page is not a normal zero page. It is
> allocated dynamically (see get_huge_zero_page).

For a normal huge page, yes, split_huge_pmd() just splits pmd. But
actually the base zero pfn will be inserted to PTEs when splitting
huge zero pmd. Please check __split_huge_zero_page_pmd() out.

I should make this point clearer in the commit log. Sorry for the confusion.

>
> So in the end you patch disables mbind of zero pages to a target node
> and that is a regression.

Do we really migrate zero page? IIUC zero page is just skipped by
vm_normal_page() check in queue_pages_pte_range(), isn't it?

>
> Have you tested the patch?

No, just build test. I thought this change was straightforward.

>
> > Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
> > this case.
>
> Btw. this changelog is missing a problem statement. I suspect there is
> no actual problem that it should fix and it is likely driven by reading
> the code. Right?

The actual problem is it is pointless to split a huge zero pmd. Yes,
it is driven by visual inspection.

The behavior before the patch for huge zero page is:
split huge zero pmd (insert base zero pfn to ptes)
walk ptes
skip zero pfn

So why not just skip the huge zero page in the first place?

>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/mempolicy.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index b5f4f584009b..205c1a768775 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -436,7 +436,8 @@ static inline bool queue_pages_required(struct page *page,
> >
> >  /*
> >   * queue_pages_pmd() has four possible return values:
> > - * 0 - pages are placed on the right node or queued successfully.
> > + * 0 - pages are placed on the right node or queued successfully, or
> > + *     special page is met, i.e. huge zero page.
> >   * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> >   *     specified.
> >   * 2 - THP was split.
> > @@ -460,8 +461,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> >       page = pmd_page(*pmd);
> >       if (is_huge_zero_page(page)) {
> >               spin_unlock(ptl);
> > -             __split_huge_pmd(walk->vma, pmd, addr, false, NULL);
> > -             ret = 2;
> > +             walk->action = ACTION_CONTINUE;
> >               goto out;
> >       }
> >       if (!queue_pages_required(page, qp))
> > @@ -488,7 +488,8 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> >   * and move them to the pagelist if they do.
> >   *
> >   * queue_pages_pte_range() has three possible return values:
> > - * 0 - pages are placed on the right node or queued successfully.
> > + * 0 - pages are placed on the right node or queued successfully, or
> > + *     special page is met, i.e. zero page.
> >   * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> >   *     specified.
> >   * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
> > --
> > 2.26.2
>
> --
> Michal Hocko
> SUSE Labs
Yang Shi June 7, 2021, 6:41 p.m. UTC | #4
On Mon, Jun 7, 2021 at 10:00 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Sun, Jun 6, 2021 at 11:21 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 04-06-21 13:35:13, Yang Shi wrote:
> > > When trying to migrate pages to obey mempolicy, the huge zero page is
> > > split then the page table walk at PTE level just skips zero page.  So it
> > > seems pointless to split huge zero page, it could be just skipped like
> > > base zero page.
> >
> > My THP knowledge is not the best but this is incorrect AIACS. Huge zero
> > page is not split. We do split the pmd which is mapping the said page. I
> > suspect you refer to vm_normal_page when talking about a zero page but
> > please be aware that huge zero page is not a normal zero page. It is
> > allocated dynamically (see get_huge_zero_page).
>
> For a normal huge page, yes, split_huge_pmd() just splits pmd. But
> actually the base zero pfn will be inserted to PTEs when splitting
> huge zero pmd. Please check __split_huge_zero_page_pmd() out.
>
> I should make this point clearer in the commit log. Sorry for the confusion.
>
> >
> > So in the end you patch disables mbind of zero pages to a target node
> > and that is a regression.
>
> Do we really migrate zero page? IIUC zero page is just skipped by
> vm_normal_page() check in queue_pages_pte_range(), isn't it?
>
> >
> > Have you tested the patch?
>
> No, just build test. I thought this change was straightforward.

Just came up with a quick test, the test is trying to mbind 1G address
space (use huge zero page) to another node, the result is:

w/o patch:
pgmigrate_success 0
pgmigrate_fail 0
thp_migration_success 0
thp_migration_fail 0
thp_migration_split 0

thp_split_pmd 512
thp_split_pud 0
thp_zero_page_alloc 1


w/ patch:
pgmigrate_success 0
pgmigrate_fail 0
thp_migration_success 0
thp_migration_fail 0
thp_migration_split 0

thp_split_pmd 0
thp_split_pud 0
thp_zero_page_alloc 1


We can tell neither huge zero nor base zero was migrated even before
the patch. The patch just kills the pointless pmd split and we keep
the huge zero page.

>
> >
> > > Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
> > > this case.
> >
> > Btw. this changelog is missing a problem statement. I suspect there is
> > no actual problem that it should fix and it is likely driven by reading
> > the code. Right?
>
> The actual problem is it is pointless to split a huge zero pmd. Yes,
> it is driven by visual inspection.
>
> The behavior before the patch for huge zero page is:
> split huge zero pmd (insert base zero pfn to ptes)
> walk ptes
> skip zero pfn
>
> So why not just skip the huge zero page in the first place?
>
> >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> > >  mm/mempolicy.c | 9 +++++----
> > >  1 file changed, 5 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index b5f4f584009b..205c1a768775 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -436,7 +436,8 @@ static inline bool queue_pages_required(struct page *page,
> > >
> > >  /*
> > >   * queue_pages_pmd() has four possible return values:
> > > - * 0 - pages are placed on the right node or queued successfully.
> > > + * 0 - pages are placed on the right node or queued successfully, or
> > > + *     special page is met, i.e. huge zero page.
> > >   * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> > >   *     specified.
> > >   * 2 - THP was split.
> > > @@ -460,8 +461,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> > >       page = pmd_page(*pmd);
> > >       if (is_huge_zero_page(page)) {
> > >               spin_unlock(ptl);
> > > -             __split_huge_pmd(walk->vma, pmd, addr, false, NULL);
> > > -             ret = 2;
> > > +             walk->action = ACTION_CONTINUE;
> > >               goto out;
> > >       }
> > >       if (!queue_pages_required(page, qp))
> > > @@ -488,7 +488,8 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> > >   * and move them to the pagelist if they do.
> > >   *
> > >   * queue_pages_pte_range() has three possible return values:
> > > - * 0 - pages are placed on the right node or queued successfully.
> > > + * 0 - pages are placed on the right node or queued successfully, or
> > > + *     special page is met, i.e. zero page.
> > >   * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> > >   *     specified.
> > >   * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
> > > --
> > > 2.26.2
> >
> > --
> > Michal Hocko
> > SUSE Labs
Michal Hocko June 7, 2021, 6:55 p.m. UTC | #5
On Mon 07-06-21 10:00:01, Yang Shi wrote:
> On Sun, Jun 6, 2021 at 11:21 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 04-06-21 13:35:13, Yang Shi wrote:
> > > When trying to migrate pages to obey mempolicy, the huge zero page is
> > > split then the page table walk at PTE level just skips zero page.  So it
> > > seems pointless to split huge zero page, it could be just skipped like
> > > base zero page.
> >
> > My THP knowledge is not the best but this is incorrect AIACS. Huge zero
> > page is not split. We do split the pmd which is mapping the said page. I
> > suspect you refer to vm_normal_page when talking about a zero page but
> > please be aware that huge zero page is not a normal zero page. It is
> > allocated dynamically (see get_huge_zero_page).
> 
> For a normal huge page, yes, split_huge_pmd() just splits pmd. But
> actually the base zero pfn will be inserted to PTEs when splitting
> huge zero pmd. Please check __split_huge_zero_page_pmd() out.

My bad. I didn't have a look all the way down there. The naming
suggested that this is purely page table operations and I have suspected
that ptes just point to the offset of the THP.

But I am obviously wrong here. Sorry about that.

> I should make this point clearer in the commit log. Sorry for the confusion.
> 
> >
> > So in the end you patch disables mbind of zero pages to a target node
> > and that is a regression.
> 
> Do we really migrate zero page? IIUC zero page is just skipped by
> vm_normal_page() check in queue_pages_pte_range(), isn't it?

Yeah, normal zero pages are skipped indeed. I haven't studied why this
is the case yet. It surely sounds a bit suspicious because this is an
explicit request to migrate memory and if the zero page is misplaced it
should be moved. On the hand this would increase RSS so maybe this is
the point.

> > Have you tested the patch?
> 
> No, just build test. I thought this change was straightforward.
> 
> >
> > > Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
> > > this case.
> >
> > Btw. this changelog is missing a problem statement. I suspect there is
> > no actual problem that it should fix and it is likely driven by reading
> > the code. Right?
> 
> The actual problem is it is pointless to split a huge zero pmd. Yes,
> it is driven by visual inspection.

Is there any actual workload that cares? This is quite a subtle area so
I would be careful to do changes just because...
Yang Shi June 7, 2021, 10:02 p.m. UTC | #6
On Mon, Jun 7, 2021 at 11:55 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 07-06-21 10:00:01, Yang Shi wrote:
> > On Sun, Jun 6, 2021 at 11:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 04-06-21 13:35:13, Yang Shi wrote:
> > > > When trying to migrate pages to obey mempolicy, the huge zero page is
> > > > split then the page table walk at PTE level just skips zero page.  So it
> > > > seems pointless to split huge zero page, it could be just skipped like
> > > > base zero page.
> > >
> > > My THP knowledge is not the best but this is incorrect AIACS. Huge zero
> > > page is not split. We do split the pmd which is mapping the said page. I
> > > suspect you refer to vm_normal_page when talking about a zero page but
> > > please be aware that huge zero page is not a normal zero page. It is
> > > allocated dynamically (see get_huge_zero_page).
> >
> > For a normal huge page, yes, split_huge_pmd() just splits pmd. But
> > actually the base zero pfn will be inserted to PTEs when splitting
> > huge zero pmd. Please check __split_huge_zero_page_pmd() out.
>
> My bad. I didn't have a look all the way down there. The naming
> suggested that this is purely page table operations and I have suspected
> that ptes just point to the offset of the THP.
>
> But I am obviously wrong here. Sorry about that.
>
> > I should make this point clearer in the commit log. Sorry for the confusion.
> >
> > >
> > > So in the end you patch disables mbind of zero pages to a target node
> > > and that is a regression.
> >
> > Do we really migrate zero page? IIUC zero page is just skipped by
> > vm_normal_page() check in queue_pages_pte_range(), isn't it?
>
> Yeah, normal zero pages are skipped indeed. I haven't studied why this
> is the case yet. It surely sounds a bit suspicious because this is an
> explicit request to migrate memory and if the zero page is misplaced it
> should be moved. On the hand this would increase RSS so maybe this is
> the point.

The zero page is a global shared page, I don't think "misplace"
applies to it. It doesn't make too much sense to migrate a shared
page. Actually there is page mapcount check in migrate_page_add() to
skip shared normal pages as well.

>
> > > Have you tested the patch?
> >
> > No, just build test. I thought this change was straightforward.
> >
> > >
> > > > Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
> > > > this case.
> > >
> > > Btw. this changelog is missing a problem statement. I suspect there is
> > > no actual problem that it should fix and it is likely driven by reading
> > > the code. Right?
> >
> > The actual problem is it is pointless to split a huge zero pmd. Yes,
> > it is driven by visual inspection.
>
> Is there any actual workload that cares? This is quite a subtle area so
> I would be careful to do changes just because...

I'm not sure whether there is measurable improvement for actual
workloads, but I believe this change does eliminate some unnecessary
work.

I think the test shown in the previous email gives us some confidence
that the change doesn't have regression.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 8, 2021, 6:41 a.m. UTC | #7
On Mon 07-06-21 15:02:39, Yang Shi wrote:
> On Mon, Jun 7, 2021 at 11:55 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 07-06-21 10:00:01, Yang Shi wrote:
> > > On Sun, Jun 6, 2021 at 11:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 04-06-21 13:35:13, Yang Shi wrote:
> > > > > When trying to migrate pages to obey mempolicy, the huge zero page is
> > > > > split then the page table walk at PTE level just skips zero page.  So it
> > > > > seems pointless to split huge zero page, it could be just skipped like
> > > > > base zero page.
> > > >
> > > > My THP knowledge is not the best but this is incorrect AIACS. Huge zero
> > > > page is not split. We do split the pmd which is mapping the said page. I
> > > > suspect you refer to vm_normal_page when talking about a zero page but
> > > > please be aware that huge zero page is not a normal zero page. It is
> > > > allocated dynamically (see get_huge_zero_page).
> > >
> > > For a normal huge page, yes, split_huge_pmd() just splits pmd. But
> > > actually the base zero pfn will be inserted to PTEs when splitting
> > > huge zero pmd. Please check __split_huge_zero_page_pmd() out.
> >
> > My bad. I didn't have a look all the way down there. The naming
> > suggested that this is purely page table operations and I have suspected
> > that ptes just point to the offset of the THP.
> >
> > But I am obviously wrong here. Sorry about that.
> >
> > > I should make this point clearer in the commit log. Sorry for the confusion.
> > >
> > > >
> > > > So in the end you patch disables mbind of zero pages to a target node
> > > > and that is a regression.
> > >
> > > Do we really migrate zero page? IIUC zero page is just skipped by
> > > vm_normal_page() check in queue_pages_pte_range(), isn't it?
> >
> > Yeah, normal zero pages are skipped indeed. I haven't studied why this
> > is the case yet. It surely sounds a bit suspicious because this is an
> > explicit request to migrate memory and if the zero page is misplaced it
> > should be moved. On the hand this would increase RSS so maybe this is
> > the point.
> 
> The zero page is a global shared page, I don't think "misplace"
> applies to it. It doesn't make too much sense to migrate a shared
> page. Actually there is page mapcount check in migrate_page_add() to
> skip shared normal pages as well.

I didn't really mean to migrate zero page itself. What I meant was to
instanciate a new page when the global one is on a different NUMA node
than the bind() requests. This can be either done by having per NUMA
zero page or simply allocate a new page for the exclusive mapping.

> > > > Have you tested the patch?
> > >
> > > No, just build test. I thought this change was straightforward.
> > >
> > > >
> > > > > Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
> > > > > this case.
> > > >
> > > > Btw. this changelog is missing a problem statement. I suspect there is
> > > > no actual problem that it should fix and it is likely driven by reading
> > > > the code. Right?
> > >
> > > The actual problem is it is pointless to split a huge zero pmd. Yes,
> > > it is driven by visual inspection.
> >
> > Is there any actual workload that cares? This is quite a subtle area so
> > I would be careful to do changes just because...
> 
> I'm not sure whether there is measurable improvement for actual
> workloads, but I believe this change does eliminate some unnecessary
> work.

I can see why being consistent here is a good argument. On the other
hand it would be imho better to look for reasons why zero pages are left
misplaced before making the code consistent. From a very quick git
archeology it seems that vm_normal_page has been used since MPOL_MF_MOVE
was introduced. At the time (dc9aa5b9d65fd) vm_normal_page hasn't
skipped through zero page AFAICS. I do not remember all the details
about zero page (wrt. pte special) handling though so it might be hidden
at some other place.

In any case the existing code doesn't really work properly. The question
is whether anybody actually cares but this is definitely something worth
looking into IMHO.
 
> I think the test shown in the previous email gives us some confidence
> that the change doesn't have regression.

Yes, this is true.
Yang Shi June 8, 2021, 5:15 p.m. UTC | #8
On Mon, Jun 7, 2021 at 11:41 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 07-06-21 15:02:39, Yang Shi wrote:
> > On Mon, Jun 7, 2021 at 11:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 07-06-21 10:00:01, Yang Shi wrote:
> > > > On Sun, Jun 6, 2021 at 11:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 04-06-21 13:35:13, Yang Shi wrote:
> > > > > > When trying to migrate pages to obey mempolicy, the huge zero page is
> > > > > > split then the page table walk at PTE level just skips zero page.  So it
> > > > > > seems pointless to split huge zero page, it could be just skipped like
> > > > > > base zero page.
> > > > >
> > > > > My THP knowledge is not the best but this is incorrect AIACS. Huge zero
> > > > > page is not split. We do split the pmd which is mapping the said page. I
> > > > > suspect you refer to vm_normal_page when talking about a zero page but
> > > > > please be aware that huge zero page is not a normal zero page. It is
> > > > > allocated dynamically (see get_huge_zero_page).
> > > >
> > > > For a normal huge page, yes, split_huge_pmd() just splits pmd. But
> > > > actually the base zero pfn will be inserted to PTEs when splitting
> > > > huge zero pmd. Please check __split_huge_zero_page_pmd() out.
> > >
> > > My bad. I didn't have a look all the way down there. The naming
> > > suggested that this is purely page table operations and I have suspected
> > > that ptes just point to the offset of the THP.
> > >
> > > But I am obviously wrong here. Sorry about that.
> > >
> > > > I should make this point clearer in the commit log. Sorry for the confusion.
> > > >
> > > > >
> > > > > So in the end you patch disables mbind of zero pages to a target node
> > > > > and that is a regression.
> > > >
> > > > Do we really migrate zero page? IIUC zero page is just skipped by
> > > > vm_normal_page() check in queue_pages_pte_range(), isn't it?
> > >
> > > Yeah, normal zero pages are skipped indeed. I haven't studied why this
> > > is the case yet. It surely sounds a bit suspicious because this is an
> > > explicit request to migrate memory and if the zero page is misplaced it
> > > should be moved. On the hand this would increase RSS so maybe this is
> > > the point.
> >
> > The zero page is a global shared page, I don't think "misplace"
> > applies to it. It doesn't make too much sense to migrate a shared
> > page. Actually there is page mapcount check in migrate_page_add() to
> > skip shared normal pages as well.
>
> I didn't really mean to migrate zero page itself. What I meant was to
> instanciate a new page when the global one is on a different NUMA node
> than the bind() requests. This can be either done by having per NUMA
> zero page or simply allocate a new page for the exclusive mapping.

IMHO, isn't it too overkilling?

>
> > > > > Have you tested the patch?
> > > >
> > > > No, just build test. I thought this change was straightforward.
> > > >
> > > > >
> > > > > > Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
> > > > > > this case.
> > > > >
> > > > > Btw. this changelog is missing a problem statement. I suspect there is
> > > > > no actual problem that it should fix and it is likely driven by reading
> > > > > the code. Right?
> > > >
> > > > The actual problem is it is pointless to split a huge zero pmd. Yes,
> > > > it is driven by visual inspection.
> > >
> > > Is there any actual workload that cares? This is quite a subtle area so
> > > I would be careful to do changes just because...
> >
> > I'm not sure whether there is measurable improvement for actual
> > workloads, but I believe this change does eliminate some unnecessary
> > work.
>
> I can see why being consistent here is a good argument. On the other
> hand it would be imho better to look for reasons why zero pages are left
> misplaced before making the code consistent. From a very quick git

Typically the zero page is created from kernel's bss section, for
example, x86. I'm supposed kernel image itself is loaded on node #0
always.

> archeology it seems that vm_normal_page has been used since MPOL_MF_MOVE
> was introduced. At the time (dc9aa5b9d65fd) vm_normal_page hasn't
> skipped through zero page AFAICS. I do not remember all the details
> about zero page (wrt. pte special) handling though so it might be hidden
> at some other place.

I did some archeology, the findings are:

The zero page has PageReserved flag set, it was skipped by the
explicit PageReserved check in mempolicy.c since commit f4598c8b3678
("[PATCH] migration: make sure there is no attempt to migrate reserved
pages."). The zero page was not used anymore by do_anonymous_page()
since 2.6.24 by commit 557ed1fa2620 ("remove ZERO_PAGE"), then
reinstated by commit a13ea5b759645 ("mm: reinstate ZERO_PAGE") and
this commit added zero page check in vm_normal_page(), so mempolicy
doesn't depend on PageReserved check to skip zero page anymore since
then.

So the zero page is skipped by mempolicy.c since 2.6.16.

>
> In any case the existing code doesn't really work properly. The question
> is whether anybody actually cares but this is definitely something worth
> looking into IMHO.
>
> > I think the test shown in the previous email gives us some confidence
> > that the change doesn't have regression.
>
> Yes, this is true.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 8, 2021, 5:49 p.m. UTC | #9
On Tue 08-06-21 10:15:36, Yang Shi wrote:
[...]
> I did some archeology, the findings are:
> 
> The zero page has PageReserved flag set, it was skipped by the
> explicit PageReserved check in mempolicy.c since commit f4598c8b3678
> ("[PATCH] migration: make sure there is no attempt to migrate reserved
> pages."). The zero page was not used anymore by do_anonymous_page()
> since 2.6.24 by commit 557ed1fa2620 ("remove ZERO_PAGE"), then
> reinstated by commit a13ea5b759645 ("mm: reinstate ZERO_PAGE") and
> this commit added zero page check in vm_normal_page(), so mempolicy
> doesn't depend on PageReserved check to skip zero page anymore since
> then.
> 
> So the zero page is skipped by mempolicy.c since 2.6.16.

Thanks a lot! This is really useful. Can you just add it to the
changelog so others do not have to go through the painful archeology.

With that, feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

Thanls!
Yang Shi June 8, 2021, 7:36 p.m. UTC | #10
On Tue, Jun 8, 2021 at 10:49 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 08-06-21 10:15:36, Yang Shi wrote:
> [...]
> > I did some archeology, the findings are:
> >
> > The zero page has PageReserved flag set, it was skipped by the
> > explicit PageReserved check in mempolicy.c since commit f4598c8b3678
> > ("[PATCH] migration: make sure there is no attempt to migrate reserved
> > pages."). The zero page was not used anymore by do_anonymous_page()
> > since 2.6.24 by commit 557ed1fa2620 ("remove ZERO_PAGE"), then
> > reinstated by commit a13ea5b759645 ("mm: reinstate ZERO_PAGE") and
> > this commit added zero page check in vm_normal_page(), so mempolicy
> > doesn't depend on PageReserved check to skip zero page anymore since
> > then.
> >
> > So the zero page is skipped by mempolicy.c since 2.6.16.
>
> Thanks a lot! This is really useful. Can you just add it to the
> changelog so others do not have to go through the painful archeology.
>
> With that, feel free to add
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks. Will add that into v2.

>
> Thanls!
> --
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b5f4f584009b..205c1a768775 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -436,7 +436,8 @@  static inline bool queue_pages_required(struct page *page,
 
 /*
  * queue_pages_pmd() has four possible return values:
- * 0 - pages are placed on the right node or queued successfully.
+ * 0 - pages are placed on the right node or queued successfully, or
+ *     special page is met, i.e. huge zero page.
  * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
  *     specified.
  * 2 - THP was split.
@@ -460,8 +461,7 @@  static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
 	page = pmd_page(*pmd);
 	if (is_huge_zero_page(page)) {
 		spin_unlock(ptl);
-		__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
-		ret = 2;
+		walk->action = ACTION_CONTINUE;
 		goto out;
 	}
 	if (!queue_pages_required(page, qp))
@@ -488,7 +488,8 @@  static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
  * and move them to the pagelist if they do.
  *
  * queue_pages_pte_range() has three possible return values:
- * 0 - pages are placed on the right node or queued successfully.
+ * 0 - pages are placed on the right node or queued successfully, or
+ *     special page is met, i.e. zero page.
  * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
  *     specified.
  * -EIO - only MPOL_MF_STRICT was specified and an existing page was already