[RFC] mm,memory_hotplug: Unlock 1GB-hugetlb on x86_64

Message ID	20190221094212.16906-1-osalvador@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of osalvador@suse.de designates 195.135.221.5 as permitted sender) client-ip=195.135.221.5; From: Oscar Salvador <osalvador@suse.de> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, mhocko@suse.com, david@redhat.com, mike.kravetz@oracle.com, Oscar Salvador <osalvador@suse.de> Subject: [RFC PATCH] mm,memory_hotplug: Unlock 1GB-hugetlb on x86_64 Date: Thu, 21 Feb 2019 10:42:12 +0100 Message-Id: <20190221094212.16906-1-osalvador@suse.de> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] mm,memory_hotplug: Unlock 1GB-hugetlb on x86_64 \| expand [RFC] mm,memory_hotplug: Unlock 1GB-hugetlb on x86_64

Oscar Salvador Feb. 21, 2019, 9:42 a.m. UTC

On x86_64, 1GB-hugetlb pages could never be offlined due to the fact
that hugepage_migration_supported() returned false for PUD_SHIFT.
So whenever we wanted to offline a memblock containing a gigantic
hugetlb page, we never got beyond has_unmovable_pages() check.
This changed with [1], where now we also return true for PUD_SHIFT.

After that patch, the check in has_unmovable_pages() and scan_movable_pages()
returned true, but we still had a final barrier in do_migrate_range():

if (compound_order(head) > PFN_SECTION_SHIFT) {
	ret = -EBUSY;
	break;
}

This is not really nice, and we do not really need it.
It is perfectly possible to migrate a gigantic page as long as another node has
a spare gigantic page for us.
In alloc_huge_page_nodemask(), we calculate the __real__ number of free pages,
and if any, we try to dequeue one from another node.

This all works fine when we do have another node with a spare gigantic page,
but if that is not the case, alloc_huge_page_nodemask() ends up calling
alloc_migrate_huge_page() which bails out if the wanted page is gigantic.
That is mainly because finding a 1GB (or even 16GB on powerpc) contiguous
memory is quite unlikely when the system has been running for a while.

In that situation, we will keep looping forever because scan_movable_pages()
will give us the same page and we will fail again because there is no node
where we can dequeue a gigantic page from.
This is not nice, and I wish we could differentiate a fatal error from a
transient error in do_migrate_range()->migrate_pages(), but I do not really
see a way now.

Anyway, I would tend say that this is the administrator's job, to make sure
that the system can keep up with the memory to be offlined, so that would mean
that if we want to use gigantic pages, make sure that the other nodes have at
least enough gigantic pages to keep up in case we need to offline memory.

Just for the sake of completeness, this is one of the tests done:

 # echo 1 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
 # echo 1 > /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages

 # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   1
 # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
   1

 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages
   1
 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
   1

 (hugetlb1gb is a program that maps 1GB region using MAP_HUGE_1GB)

 # numactl -m 1 ./hugetlb1gb
 # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
   0
 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
   1

 # offline node1 memory
 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
   0

[1] https://lore.kernel.org/patchwork/patch/998796/

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory_hotplug.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

Mike Kravetz Feb. 21, 2019, 10:12 p.m. UTC | #1

On 2/21/19 1:42 AM, Oscar Salvador wrote:
> On x86_64, 1GB-hugetlb pages could never be offlined due to the fact
> that hugepage_migration_supported() returned false for PUD_SHIFT.
> So whenever we wanted to offline a memblock containing a gigantic
> hugetlb page, we never got beyond has_unmovable_pages() check.
> This changed with [1], where now we also return true for PUD_SHIFT.
> 
> After that patch, the check in has_unmovable_pages() and scan_movable_pages()
> returned true, but we still had a final barrier in do_migrate_range():
> 
> if (compound_order(head) > PFN_SECTION_SHIFT) {
> 	ret = -EBUSY;
> 	break;
> }
> 
> This is not really nice, and we do not really need it.
> It is perfectly possible to migrate a gigantic page as long as another node has
> a spare gigantic page for us.
> In alloc_huge_page_nodemask(), we calculate the __real__ number of free pages,
> and if any, we try to dequeue one from another node.
> 
> This all works fine when we do have another node with a spare gigantic page,
> but if that is not the case, alloc_huge_page_nodemask() ends up calling
> alloc_migrate_huge_page() which bails out if the wanted page is gigantic.
> That is mainly because finding a 1GB (or even 16GB on powerpc) contiguous
> memory is quite unlikely when the system has been running for a while.

I suspect the reason for the check is that it was there before the ability
to migrate gigantic pages was added, and nobody thought to remove it.  As
you say, the likelihood of finding a gigantic page after running for some
time is not too good.  I wonder if we should remove that check?  Just trying
to create a gigantic page could result in a bunch of migrations which could
impact the system.  But, this is the result of a memory offline operation
which one would expect to have some negative impact.

> In that situation, we will keep looping forever because scan_movable_pages()
> will give us the same page and we will fail again because there is no node
> where we can dequeue a gigantic page from.
> This is not nice, and I wish we could differentiate a fatal error from a
> transient error in do_migrate_range()->migrate_pages(), but I do not really
> see a way now.

Michal may have some thoughts here.  Note that the repeat loop does not
even consider the return value from do_migrate_range().  Since this the the
result of an offline, I am thinking it was designed to retry forever.  But,
perhaps there are some errors/ret codes where we should give up?

> Anyway, I would tend say that this is the administrator's job, to make sure
> that the system can keep up with the memory to be offlined, so that would mean
> that if we want to use gigantic pages, make sure that the other nodes have at
> least enough gigantic pages to keep up in case we need to offline memory.
> 
> Just for the sake of completeness, this is one of the tests done:
> 
>  # echo 1 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>  # echo 1 > /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages
> 
>  # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>    1
>  # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
>    1
> 
>  # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages
>    1
>  # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
>    1
> 
>  (hugetlb1gb is a program that maps 1GB region using MAP_HUGE_1GB)
> 
>  # numactl -m 1 ./hugetlb1gb
>  # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
>    0
>  # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
>    1
> 
>  # offline node1 memory
>  # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
>    0
> 
> [1] https://lore.kernel.org/patchwork/patch/998796/
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/memory_hotplug.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d5f7afda67db..04f6695b648c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>  		if (!PageHuge(page))
>  			continue;
>  		head = compound_head(page);
> -		if (hugepage_migration_supported(page_hstate(head)) &&
> -		    page_huge_active(head))
> +		if (page_huge_active(head))

I'm confused as to why the removal of the hugepage_migration_supported()
check is required.  Seems that commit aa9d95fa40a2 ("mm/hugetlb: enable
arch specific huge page size support for migration") should make the check
work as desired for all architectures.

Oscar Salvador Feb. 22, 2019, 8:24 a.m. UTC | #2

On Thu, Feb 21, 2019 at 02:12:19PM -0800, Mike Kravetz wrote:
> I suspect the reason for the check is that it was there before the ability
> to migrate gigantic pages was added, and nobody thought to remove it.  As
> you say, the likelihood of finding a gigantic page after running for some
> time is not too good.  I wonder if we should remove that check?  Just trying
> to create a gigantic page could result in a bunch of migrations which could
> impact the system.  But, this is the result of a memory offline operation
> which one would expect to have some negative impact.

The check was introduced by ("ab5ac90aecf56:mm, hugetlb: do not rely on 
overcommit limit during migration), but I would have to do some research
to the changes that came after.
I am not really sure about removing the check.
I can see that is perfectly fine to migrate gigantic pages as long as the
other nodes can back us up, but trying to allocate them at runtime seems that
is going to fail more than succeed. I might be wrong of course.
I would rather leave it as it is.

> 
> > In that situation, we will keep looping forever because scan_movable_pages()
> > will give us the same page and we will fail again because there is no node
> > where we can dequeue a gigantic page from.
> > This is not nice, and I wish we could differentiate a fatal error from a
> > transient error in do_migrate_range()->migrate_pages(), but I do not really
> > see a way now.
> 
> Michal may have some thoughts here.  Note that the repeat loop does not
> even consider the return value from do_migrate_range().  Since this the the
> result of an offline, I am thinking it was designed to retry forever.  But,
> perhaps there are some errors/ret codes where we should give up?

Well, it has changed a bit over the time.
It used to be a sort of retry-timer before, where we bailed out after
a while.
But it turned out to be too easy to fail and the timer logic was removed
in (ecde0f3e7f9ed: mm, memory_hotplug: remove timeout from __offline_memory).

I think that returning a valuable error code from migrate_pages back to
do_migrate_range has always been a bit difficult.
What should be considered a fatal error?

And for the purpose here, the error we would return is -ENOMEM when we do not
have nodes containing spare gigantic pages.
Maybe that could be one of the reasons to bail out.
If we are short of memory, offlining more memory will not do anything but apply
more pressure to the system.

But I am bit worried to actually start backing off due to that, since at the
moment, the only way to back off from offlining operation is to send a signal
to the process.

I would have to think a bit more, but another possibility that comes to my mind
is:

*) Try to check whether the hstate has free pages in has_unmovable_pages.
   If not report the gigantic page as unmovable.
   This would follow the check hugepage_migration_supported() in has_unmovable_pages.

If not, as I said, we could leave it as it is.
Should be sysadmin's responsability to check in advance that the system is ready
to take over the memory to be offlined.

> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index d5f7afda67db..04f6695b648c 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> >  		if (!PageHuge(page))
> >  			continue;
> >  		head = compound_head(page);
> > -		if (hugepage_migration_supported(page_hstate(head)) &&
> > -		    page_huge_active(head))
> > +		if (page_huge_active(head))
> 
> I'm confused as to why the removal of the hugepage_migration_supported()
> check is required.  Seems that commit aa9d95fa40a2 ("mm/hugetlb: enable
> arch specific huge page size support for migration") should make the check
> work as desired for all architectures.

has_unmovable_pages() should already cover us in case the hstate is not migrateable:

<--
if (PageHuge(page)) {
	struct page *head = compound_head(page);
	unsigned int skip_pages;

	if (!hugepage_migration_supported(page_hstate(head)))
		goto unmovable;

	skip_pages = (1 << compound_order(head)) - (page - head);
	iter += skip_pages - 1;
	continue;
}
-->

Should not be migrateable, we report unmovable pages within the range,
start_isolate_page_range() will report the failure to __offline_pages() and
so we will not go further.

Oscar Salvador Feb. 27, 2019, 9:51 p.m. UTC | #3

On Thu, Feb 21, 2019 at 10:42:12AM +0100, Oscar Salvador wrote:
> [1] https://lore.kernel.org/patchwork/patch/998796/
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Any further comments on this?
I do have a "concern" I would like to sort out before dropping the RFC:

It is the fact that unless we have spare gigantic pages in other notes, the
offlining operation will loop forever (until the customer cancels the operation).
While I do not really like that, I do think that memory offlining should be done
with some sanity, and the administrator should know in advance if the system is going
to be able to keep up with the memory pressure, aka: make sure we got what we need in
order to make the offlining operation to succeed.
That translates to be sure that we have spare gigantic pages and other nodes
can take them.

Given said that, another thing I thought about is that we could check if we have
spare gigantic pages at has_unmovable_pages() time.
Something like checking "h->free_huge_pages - h->resv_huge_pages > 0", and if it
turns out that we do not have gigantic pages anywhere, just return as we have
non-movable pages.

But I would rather not convulate has_unmovable_pages() with such checks and "trust"
the administrator.

> ---
>  mm/memory_hotplug.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d5f7afda67db..04f6695b648c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>  		if (!PageHuge(page))
>  			continue;
>  		head = compound_head(page);
> -		if (hugepage_migration_supported(page_hstate(head)) &&
> -		    page_huge_active(head))
> +		if (page_huge_active(head))
>  			return pfn;
>  		skip = (1 << compound_order(head)) - (page - head);
>  		pfn += skip - 1;
> @@ -1378,10 +1377,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  
>  		if (PageHuge(page)) {
>  			struct page *head = compound_head(page);
> -			if (compound_order(head) > PFN_SECTION_SHIFT) {
> -				ret = -EBUSY;
> -				break;
> -			}
>  			pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
>  			isolate_huge_page(head, &source);
>  			continue;
> -- 
> 2.13.7
>

Mike Kravetz Feb. 27, 2019, 10 p.m. UTC | #4

On 2/27/19 1:51 PM, Oscar Salvador wrote:
> On Thu, Feb 21, 2019 at 10:42:12AM +0100, Oscar Salvador wrote:
>> [1] https://lore.kernel.org/patchwork/patch/998796/
>>
>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> 
> Any further comments on this?
> I do have a "concern" I would like to sort out before dropping the RFC:
> 
> It is the fact that unless we have spare gigantic pages in other notes, the
> offlining operation will loop forever (until the customer cancels the operation).
> While I do not really like that, I do think that memory offlining should be done
> with some sanity, and the administrator should know in advance if the system is going
> to be able to keep up with the memory pressure, aka: make sure we got what we need in
> order to make the offlining operation to succeed.
> That translates to be sure that we have spare gigantic pages and other nodes
> can take them.
> 
> Given said that, another thing I thought about is that we could check if we have
> spare gigantic pages at has_unmovable_pages() time.
> Something like checking "h->free_huge_pages - h->resv_huge_pages > 0", and if it
> turns out that we do not have gigantic pages anywhere, just return as we have
> non-movable pages.

Of course, that check would be racy.  Even if there is an available gigantic
page at has_unmovable_pages() time there is no guarantee it will be there when
we want to allocate/use it.  But, you would at least catch 'most' cases of
looping forever.

> But I would rather not convulate has_unmovable_pages() with such checks and "trust"
> the administrator.

Agree

David Hildenbrand Feb. 28, 2019, 7:38 a.m. UTC | #5

On 27.02.19 23:00, Mike Kravetz wrote:
> On 2/27/19 1:51 PM, Oscar Salvador wrote:
>> On Thu, Feb 21, 2019 at 10:42:12AM +0100, Oscar Salvador wrote:
>>> [1] https://lore.kernel.org/patchwork/patch/998796/
>>>
>>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>>
>> Any further comments on this?
>> I do have a "concern" I would like to sort out before dropping the RFC:
>>
>> It is the fact that unless we have spare gigantic pages in other notes, the
>> offlining operation will loop forever (until the customer cancels the operation).
>> While I do not really like that, I do think that memory offlining should be done
>> with some sanity, and the administrator should know in advance if the system is going
>> to be able to keep up with the memory pressure, aka: make sure we got what we need in
>> order to make the offlining operation to succeed.
>> That translates to be sure that we have spare gigantic pages and other nodes
>> can take them.
>>
>> Given said that, another thing I thought about is that we could check if we have
>> spare gigantic pages at has_unmovable_pages() time.
>> Something like checking "h->free_huge_pages - h->resv_huge_pages > 0", and if it
>> turns out that we do not have gigantic pages anywhere, just return as we have
>> non-movable pages.
> 
> Of course, that check would be racy.  Even if there is an available gigantic
> page at has_unmovable_pages() time there is no guarantee it will be there when
> we want to allocate/use it.  But, you would at least catch 'most' cases of
> looping forever.
> 
>> But I would rather not convulate has_unmovable_pages() with such checks and "trust"
>> the administrator.

I think we have the exact same issue already with huge/ordinary pages if
we are low on memory. We could loop forever.

In the long run, we should properly detect such issues and abort instead
of looping forever I guess. But as we all know, error handling in the
whole offlining part is still far away from being perfect ...

Michal Hocko Feb. 28, 2019, 9:16 a.m. UTC | #6

On Thu 28-02-19 08:38:34, David Hildenbrand wrote:
> On 27.02.19 23:00, Mike Kravetz wrote:
> > On 2/27/19 1:51 PM, Oscar Salvador wrote:
> >> On Thu, Feb 21, 2019 at 10:42:12AM +0100, Oscar Salvador wrote:
> >>> [1] https://lore.kernel.org/patchwork/patch/998796/
> >>>
> >>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> >>
> >> Any further comments on this?
> >> I do have a "concern" I would like to sort out before dropping the RFC:
> >>
> >> It is the fact that unless we have spare gigantic pages in other notes, the
> >> offlining operation will loop forever (until the customer cancels the operation).
> >> While I do not really like that, I do think that memory offlining should be done
> >> with some sanity, and the administrator should know in advance if the system is going
> >> to be able to keep up with the memory pressure, aka: make sure we got what we need in
> >> order to make the offlining operation to succeed.
> >> That translates to be sure that we have spare gigantic pages and other nodes
> >> can take them.
> >>
> >> Given said that, another thing I thought about is that we could check if we have
> >> spare gigantic pages at has_unmovable_pages() time.
> >> Something like checking "h->free_huge_pages - h->resv_huge_pages > 0", and if it
> >> turns out that we do not have gigantic pages anywhere, just return as we have
> >> non-movable pages.
> > 
> > Of course, that check would be racy.  Even if there is an available gigantic
> > page at has_unmovable_pages() time there is no guarantee it will be there when
> > we want to allocate/use it.  But, you would at least catch 'most' cases of
> > looping forever.
> > 
> >> But I would rather not convulate has_unmovable_pages() with such checks and "trust"
> >> the administrator.
> 
> I think we have the exact same issue already with huge/ordinary pages if
> we are low on memory. We could loop forever.
> 
> In the long run, we should properly detect such issues and abort instead
> of looping forever I guess. But as we all know, error handling in the
> whole offlining part is still far away from being perfect ...

Migration allocation callbacks use __GFP_RETRY_MAYFAIL to not
be disruptive so they do not trigger the OOM killer and rely on somebody
else to pull the trigger instead. This means that if there is no other
activity on the system the hotplug migration would just loop for ever
or until interrupted by the userspace. THe later is important, user
might define a policy when to terminate and keep retrying is not
necessarily a wrong thing. One can simply do
timeout $TIMEOUT echo 0 > $PATH_TO_MEMBLOCK/online

and ENOMEM handling is not important. But I can see how people might
want to bail out early instead. So I do not have a strong opinion here.
We can try to consider ENOMEM from the migration as a hard failure and
bail out and see whether it works in practice.

Michal Hocko Feb. 28, 2019, 9:21 a.m. UTC | #7

On Thu 21-02-19 10:42:12, Oscar Salvador wrote:
[...]
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d5f7afda67db..04f6695b648c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>  		if (!PageHuge(page))
>  			continue;
>  		head = compound_head(page);
> -		if (hugepage_migration_supported(page_hstate(head)) &&
> -		    page_huge_active(head))
> +		if (page_huge_active(head))
>  			return pfn;
>  		skip = (1 << compound_order(head)) - (page - head);
>  		pfn += skip - 1;

Is this part correct? Say we have a gigantic page which is migrateable.
Now scan_movable_pages would skip it and we will not migrate it, no?

> @@ -1378,10 +1377,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  
>  		if (PageHuge(page)) {
>  			struct page *head = compound_head(page);
> -			if (compound_order(head) > PFN_SECTION_SHIFT) {
> -				ret = -EBUSY;
> -				break;
> -			}
>  			pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
>  			isolate_huge_page(head, &source);
>  			continue;

I think it would be much easier to have only this check removed in this
patch. Because it is obviously bogus and wrong as well. The other check
might be considered in a separate patch.

Oscar Salvador Feb. 28, 2019, 9:41 a.m. UTC | #8

On Thu, Feb 28, 2019 at 10:21:54AM +0100, Michal Hocko wrote:
> On Thu 21-02-19 10:42:12, Oscar Salvador wrote:
> [...]
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index d5f7afda67db..04f6695b648c 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> >  		if (!PageHuge(page))
> >  			continue;
> >  		head = compound_head(page);
> > -		if (hugepage_migration_supported(page_hstate(head)) &&
> > -		    page_huge_active(head))
> > +		if (page_huge_active(head))
> >  			return pfn;
> >  		skip = (1 << compound_order(head)) - (page - head);
> >  		pfn += skip - 1;
> 
> Is this part correct? Say we have a gigantic page which is migrateable.
> Now scan_movable_pages would skip it and we will not migrate it, no?

All non-migrateable hugepages should have been caught in has_unmovable_pages:

<--
                if (PageHuge(page)) {
                        struct page *head = compound_head(page);
                        unsigned int skip_pages;

                        if (!hugepage_migration_supported(page_hstate(head)))
                                goto unmovable;
-->

So, there is no need to check again for migrateability here, as it is something
that does not change.
To put it in another way, all huge pages found in scan_movable_pages() should be
migrateable.
In scan_movable_pages() we just need to check whether the hugepage, gigantic or not, is
in use (aka active) to migrate it.

> 
> > @@ -1378,10 +1377,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> >  
> >  		if (PageHuge(page)) {
> >  			struct page *head = compound_head(page);
> > -			if (compound_order(head) > PFN_SECTION_SHIFT) {
> > -				ret = -EBUSY;
> > -				break;
> > -			}
> >  			pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
> >  			isolate_huge_page(head, &source);
> >  			continue;
> 
> I think it would be much easier to have only this check removed in this
> patch. Because it is obviously bogus and wrong as well. The other check
> might be considered in a separate patch.

I do not have an issue sending both changes separedtly.
I mean, this check is the one we need to remove in order to make 1Gb-hugetlb
offlining to proceed.
The removed check from scan_movable_pages() is only removed because it is redundant
as we already checked for that condition in has_unmovable_pages()
(when isolating the range).

Michal Hocko Feb. 28, 2019, 9:55 a.m. UTC | #9

On Thu 28-02-19 10:41:08, Oscar Salvador wrote:
> On Thu, Feb 28, 2019 at 10:21:54AM +0100, Michal Hocko wrote:
> > On Thu 21-02-19 10:42:12, Oscar Salvador wrote:
> > [...]
> > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > index d5f7afda67db..04f6695b648c 100644
> > > --- a/mm/memory_hotplug.c
> > > +++ b/mm/memory_hotplug.c
> > > @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> > >  		if (!PageHuge(page))
> > >  			continue;
> > >  		head = compound_head(page);
> > > -		if (hugepage_migration_supported(page_hstate(head)) &&
> > > -		    page_huge_active(head))
> > > +		if (page_huge_active(head))
> > >  			return pfn;
> > >  		skip = (1 << compound_order(head)) - (page - head);
> > >  		pfn += skip - 1;
> > 
> > Is this part correct? Say we have a gigantic page which is migrateable.
> > Now scan_movable_pages would skip it and we will not migrate it, no?
> 
> All non-migrateable hugepages should have been caught in has_unmovable_pages:
> 
> <--
>                 if (PageHuge(page)) {
>                         struct page *head = compound_head(page);
>                         unsigned int skip_pages;
> 
>                         if (!hugepage_migration_supported(page_hstate(head)))
>                                 goto unmovable;
> -->
> 
> So, there is no need to check again for migrateability here, as it is something
> that does not change.
> To put it in another way, all huge pages found in scan_movable_pages() should be
> migrateable.
> In scan_movable_pages() we just need to check whether the hugepage, gigantic or not, is
> in use (aka active) to migrate it.

You seemed to miss my point or I am wrong here. If scan_movable_pages
skips over a hugetlb page then there is nothing to migrate it and it
will stay in the pfn range and the range will not become idle.

Oscar Salvador Feb. 28, 2019, 10:19 a.m. UTC | #10

On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote:
> You seemed to miss my point or I am wrong here. If scan_movable_pages
> skips over a hugetlb page then there is nothing to migrate it and it
> will stay in the pfn range and the range will not become idle.

I might be misunterstanding you, but I am not sure I get you.

scan_movable_pages() can either skip or not a hugetlb page.
In case it does, pfn will be incremented to skip the whole hugetlb
range.
If that happens, pfn will hold the next non-hugetlb page.

If it happens that the end of the hugetlb page is also the end
of the memory range, scan_movable_pages() will return 0 and we will
eventually break the loop in __offline_pages().

If this is not what you meant, could you please elaborate a bit more your concern?

Michal Hocko Feb. 28, 2019, 12:11 p.m. UTC | #11

On Thu 28-02-19 11:19:52, Oscar Salvador wrote:
> On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote:
> > You seemed to miss my point or I am wrong here. If scan_movable_pages
> > skips over a hugetlb page then there is nothing to migrate it and it
> > will stay in the pfn range and the range will not become idle.
> 
> I might be misunterstanding you, but I am not sure I get you.
> 
> scan_movable_pages() can either skip or not a hugetlb page.
> In case it does, pfn will be incremented to skip the whole hugetlb
> range.
> If that happens, pfn will hold the next non-hugetlb page.

And as a result the previous hugetlb page doesn't get migrated right?
What does that mean? Well, the page is still in use and we cannot
proceed with offlining because the full range is not isolated right?

Oscar Salvador Feb. 28, 2019, 1:40 p.m. UTC | #12

On Thu, Feb 28, 2019 at 01:11:15PM +0100, Michal Hocko wrote:
> On Thu 28-02-19 11:19:52, Oscar Salvador wrote:
> > On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote:
> > > You seemed to miss my point or I am wrong here. If scan_movable_pages
> > > skips over a hugetlb page then there is nothing to migrate it and it
> > > will stay in the pfn range and the range will not become idle.
> > 
> > I might be misunterstanding you, but I am not sure I get you.
> > 
> > scan_movable_pages() can either skip or not a hugetlb page.
> > In case it does, pfn will be incremented to skip the whole hugetlb
> > range.
> > If that happens, pfn will hold the next non-hugetlb page.
> 
> And as a result the previous hugetlb page doesn't get migrated right?
> What does that mean? Well, the page is still in use and we cannot
> proceed with offlining because the full range is not isolated right?

I might be clumsy today but I still fail to see the point of concern here.

Let us start from the beginning.
start_isolate_page_range() will mark the range as isolated unless we happen to have
unmovable pages within it (for the exercise here, that would be non-migreateable hugetlb
pages).

If we pass that point, it means that all hugetlb pages found can really be migrated.
Leter, scan_movable_pages() will scan them, and it will only take those that are
in use (active), as are the ones that we are interested in.
We will skip those who are not being used (non-active).

If it happens that we skip a hugetlb page and we return the next non-hugetlb page
to be migrated, do_migrate_range() will proceed as usual, eventually we will
break the main loop due to having being scanned the whole range, etc.

If it happens that the whole range spans a gigantic hugetlb and it is not in use,
we will skip the whole range, and we will break the main loop by returning "0".
Eitherway, I do not see how this changes the picture.

Michal Hocko Feb. 28, 2019, 2:08 p.m. UTC | #13

On Thu 28-02-19 14:40:54, Oscar Salvador wrote:
> On Thu, Feb 28, 2019 at 01:11:15PM +0100, Michal Hocko wrote:
> > On Thu 28-02-19 11:19:52, Oscar Salvador wrote:
> > > On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote:
> > > > You seemed to miss my point or I am wrong here. If scan_movable_pages
> > > > skips over a hugetlb page then there is nothing to migrate it and it
> > > > will stay in the pfn range and the range will not become idle.
> > > 
> > > I might be misunterstanding you, but I am not sure I get you.
> > > 
> > > scan_movable_pages() can either skip or not a hugetlb page.
> > > In case it does, pfn will be incremented to skip the whole hugetlb
> > > range.
> > > If that happens, pfn will hold the next non-hugetlb page.
> > 
> > And as a result the previous hugetlb page doesn't get migrated right?
> > What does that mean? Well, the page is still in use and we cannot
> > proceed with offlining because the full range is not isolated right?
> 
> I might be clumsy today but I still fail to see the point of concern here.

No, it's me who is daft. I have misread the patch and seen that also
page_huge_active got removed. Now it makes perfect sense to me because
active pages are still handled properly.

I will leave the decision whether to split up the patch to you.

Acked-by: Michal Hocko <mhocko@suse.com>

and sorry for being dense here.

Oscar Salvador Feb. 28, 2019, 9:01 p.m. UTC | #14

On Thu, Feb 28, 2019 at 03:08:17PM +0100, Michal Hocko wrote:
> On Thu 28-02-19 14:40:54, Oscar Salvador wrote:
> > On Thu, Feb 28, 2019 at 01:11:15PM +0100, Michal Hocko wrote:
> > > On Thu 28-02-19 11:19:52, Oscar Salvador wrote:
> > > > On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote:
> > > > > You seemed to miss my point or I am wrong here. If scan_movable_pages
> > > > > skips over a hugetlb page then there is nothing to migrate it and it
> > > > > will stay in the pfn range and the range will not become idle.
> > > > 
> > > > I might be misunterstanding you, but I am not sure I get you.
> > > > 
> > > > scan_movable_pages() can either skip or not a hugetlb page.
> > > > In case it does, pfn will be incremented to skip the whole hugetlb
> > > > range.
> > > > If that happens, pfn will hold the next non-hugetlb page.
> > > 
> > > And as a result the previous hugetlb page doesn't get migrated right?
> > > What does that mean? Well, the page is still in use and we cannot
> > > proceed with offlining because the full range is not isolated right?
> > 
> > I might be clumsy today but I still fail to see the point of concern here.
> 
> No, it's me who is daft. I have misread the patch and seen that also
> page_huge_active got removed. Now it makes perfect sense to me because
> active pages are still handled properly.

Heh, no worries.
Glad we got the point, I was just scratching my head like a monkey.

> I will leave the decision whether to split up the patch to you.

On a second thought, I will split it up.
One of the changes is merely to remove a redundant check, while the other is
actually the one that enables the system to be able to proceed with gigantic
pages, so not really related.

> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

[RFC] mm,memory_hotplug: Unlock 1GB-hugetlb on x86_64

Commit Message

Comments

Patch