Message ID | 20190221094212.16906-1-osalvador@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] mm,memory_hotplug: Unlock 1GB-hugetlb on x86_64 | expand |
On 2/21/19 1:42 AM, Oscar Salvador wrote: > On x86_64, 1GB-hugetlb pages could never be offlined due to the fact > that hugepage_migration_supported() returned false for PUD_SHIFT. > So whenever we wanted to offline a memblock containing a gigantic > hugetlb page, we never got beyond has_unmovable_pages() check. > This changed with [1], where now we also return true for PUD_SHIFT. > > After that patch, the check in has_unmovable_pages() and scan_movable_pages() > returned true, but we still had a final barrier in do_migrate_range(): > > if (compound_order(head) > PFN_SECTION_SHIFT) { > ret = -EBUSY; > break; > } > > This is not really nice, and we do not really need it. > It is perfectly possible to migrate a gigantic page as long as another node has > a spare gigantic page for us. > In alloc_huge_page_nodemask(), we calculate the __real__ number of free pages, > and if any, we try to dequeue one from another node. > > This all works fine when we do have another node with a spare gigantic page, > but if that is not the case, alloc_huge_page_nodemask() ends up calling > alloc_migrate_huge_page() which bails out if the wanted page is gigantic. > That is mainly because finding a 1GB (or even 16GB on powerpc) contiguous > memory is quite unlikely when the system has been running for a while. I suspect the reason for the check is that it was there before the ability to migrate gigantic pages was added, and nobody thought to remove it. As you say, the likelihood of finding a gigantic page after running for some time is not too good. I wonder if we should remove that check? Just trying to create a gigantic page could result in a bunch of migrations which could impact the system. But, this is the result of a memory offline operation which one would expect to have some negative impact. > In that situation, we will keep looping forever because scan_movable_pages() > will give us the same page and we will fail again because there is no node > where we can dequeue a gigantic page from. > This is not nice, and I wish we could differentiate a fatal error from a > transient error in do_migrate_range()->migrate_pages(), but I do not really > see a way now. Michal may have some thoughts here. Note that the repeat loop does not even consider the return value from do_migrate_range(). Since this the the result of an offline, I am thinking it was designed to retry forever. But, perhaps there are some errors/ret codes where we should give up? > Anyway, I would tend say that this is the administrator's job, to make sure > that the system can keep up with the memory to be offlined, so that would mean > that if we want to use gigantic pages, make sure that the other nodes have at > least enough gigantic pages to keep up in case we need to offline memory. > > Just for the sake of completeness, this is one of the tests done: > > # echo 1 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages > # echo 1 > /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages > > # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages > 1 > # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages > 1 > > # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages > 1 > # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages > 1 > > (hugetlb1gb is a program that maps 1GB region using MAP_HUGE_1GB) > > # numactl -m 1 ./hugetlb1gb > # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages > 0 > # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages > 1 > > # offline node1 memory > # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages > 0 > > [1] https://lore.kernel.org/patchwork/patch/998796/ > > Signed-off-by: Oscar Salvador <osalvador@suse.de> > --- > mm/memory_hotplug.c | 7 +------ > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index d5f7afda67db..04f6695b648c 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end) > if (!PageHuge(page)) > continue; > head = compound_head(page); > - if (hugepage_migration_supported(page_hstate(head)) && > - page_huge_active(head)) > + if (page_huge_active(head)) I'm confused as to why the removal of the hugepage_migration_supported() check is required. Seems that commit aa9d95fa40a2 ("mm/hugetlb: enable arch specific huge page size support for migration") should make the check work as desired for all architectures.
On Thu, Feb 21, 2019 at 02:12:19PM -0800, Mike Kravetz wrote: > I suspect the reason for the check is that it was there before the ability > to migrate gigantic pages was added, and nobody thought to remove it. As > you say, the likelihood of finding a gigantic page after running for some > time is not too good. I wonder if we should remove that check? Just trying > to create a gigantic page could result in a bunch of migrations which could > impact the system. But, this is the result of a memory offline operation > which one would expect to have some negative impact. The check was introduced by ("ab5ac90aecf56:mm, hugetlb: do not rely on overcommit limit during migration), but I would have to do some research to the changes that came after. I am not really sure about removing the check. I can see that is perfectly fine to migrate gigantic pages as long as the other nodes can back us up, but trying to allocate them at runtime seems that is going to fail more than succeed. I might be wrong of course. I would rather leave it as it is. > > > In that situation, we will keep looping forever because scan_movable_pages() > > will give us the same page and we will fail again because there is no node > > where we can dequeue a gigantic page from. > > This is not nice, and I wish we could differentiate a fatal error from a > > transient error in do_migrate_range()->migrate_pages(), but I do not really > > see a way now. > > Michal may have some thoughts here. Note that the repeat loop does not > even consider the return value from do_migrate_range(). Since this the the > result of an offline, I am thinking it was designed to retry forever. But, > perhaps there are some errors/ret codes where we should give up? Well, it has changed a bit over the time. It used to be a sort of retry-timer before, where we bailed out after a while. But it turned out to be too easy to fail and the timer logic was removed in (ecde0f3e7f9ed: mm, memory_hotplug: remove timeout from __offline_memory). I think that returning a valuable error code from migrate_pages back to do_migrate_range has always been a bit difficult. What should be considered a fatal error? And for the purpose here, the error we would return is -ENOMEM when we do not have nodes containing spare gigantic pages. Maybe that could be one of the reasons to bail out. If we are short of memory, offlining more memory will not do anything but apply more pressure to the system. But I am bit worried to actually start backing off due to that, since at the moment, the only way to back off from offlining operation is to send a signal to the process. I would have to think a bit more, but another possibility that comes to my mind is: *) Try to check whether the hstate has free pages in has_unmovable_pages. If not report the gigantic page as unmovable. This would follow the check hugepage_migration_supported() in has_unmovable_pages. If not, as I said, we could leave it as it is. Should be sysadmin's responsability to check in advance that the system is ready to take over the memory to be offlined. > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > index d5f7afda67db..04f6695b648c 100644 > > --- a/mm/memory_hotplug.c > > +++ b/mm/memory_hotplug.c > > @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end) > > if (!PageHuge(page)) > > continue; > > head = compound_head(page); > > - if (hugepage_migration_supported(page_hstate(head)) && > > - page_huge_active(head)) > > + if (page_huge_active(head)) > > I'm confused as to why the removal of the hugepage_migration_supported() > check is required. Seems that commit aa9d95fa40a2 ("mm/hugetlb: enable > arch specific huge page size support for migration") should make the check > work as desired for all architectures. has_unmovable_pages() should already cover us in case the hstate is not migrateable: <-- if (PageHuge(page)) { struct page *head = compound_head(page); unsigned int skip_pages; if (!hugepage_migration_supported(page_hstate(head))) goto unmovable; skip_pages = (1 << compound_order(head)) - (page - head); iter += skip_pages - 1; continue; } --> Should not be migrateable, we report unmovable pages within the range, start_isolate_page_range() will report the failure to __offline_pages() and so we will not go further.
On Thu, Feb 21, 2019 at 10:42:12AM +0100, Oscar Salvador wrote: > [1] https://lore.kernel.org/patchwork/patch/998796/ > > Signed-off-by: Oscar Salvador <osalvador@suse.de> Any further comments on this? I do have a "concern" I would like to sort out before dropping the RFC: It is the fact that unless we have spare gigantic pages in other notes, the offlining operation will loop forever (until the customer cancels the operation). While I do not really like that, I do think that memory offlining should be done with some sanity, and the administrator should know in advance if the system is going to be able to keep up with the memory pressure, aka: make sure we got what we need in order to make the offlining operation to succeed. That translates to be sure that we have spare gigantic pages and other nodes can take them. Given said that, another thing I thought about is that we could check if we have spare gigantic pages at has_unmovable_pages() time. Something like checking "h->free_huge_pages - h->resv_huge_pages > 0", and if it turns out that we do not have gigantic pages anywhere, just return as we have non-movable pages. But I would rather not convulate has_unmovable_pages() with such checks and "trust" the administrator. > --- > mm/memory_hotplug.c | 7 +------ > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index d5f7afda67db..04f6695b648c 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end) > if (!PageHuge(page)) > continue; > head = compound_head(page); > - if (hugepage_migration_supported(page_hstate(head)) && > - page_huge_active(head)) > + if (page_huge_active(head)) > return pfn; > skip = (1 << compound_order(head)) - (page - head); > pfn += skip - 1; > @@ -1378,10 +1377,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > > if (PageHuge(page)) { > struct page *head = compound_head(page); > - if (compound_order(head) > PFN_SECTION_SHIFT) { > - ret = -EBUSY; > - break; > - } > pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1; > isolate_huge_page(head, &source); > continue; > -- > 2.13.7 >
On 2/27/19 1:51 PM, Oscar Salvador wrote: > On Thu, Feb 21, 2019 at 10:42:12AM +0100, Oscar Salvador wrote: >> [1] https://lore.kernel.org/patchwork/patch/998796/ >> >> Signed-off-by: Oscar Salvador <osalvador@suse.de> > > Any further comments on this? > I do have a "concern" I would like to sort out before dropping the RFC: > > It is the fact that unless we have spare gigantic pages in other notes, the > offlining operation will loop forever (until the customer cancels the operation). > While I do not really like that, I do think that memory offlining should be done > with some sanity, and the administrator should know in advance if the system is going > to be able to keep up with the memory pressure, aka: make sure we got what we need in > order to make the offlining operation to succeed. > That translates to be sure that we have spare gigantic pages and other nodes > can take them. > > Given said that, another thing I thought about is that we could check if we have > spare gigantic pages at has_unmovable_pages() time. > Something like checking "h->free_huge_pages - h->resv_huge_pages > 0", and if it > turns out that we do not have gigantic pages anywhere, just return as we have > non-movable pages. Of course, that check would be racy. Even if there is an available gigantic page at has_unmovable_pages() time there is no guarantee it will be there when we want to allocate/use it. But, you would at least catch 'most' cases of looping forever. > But I would rather not convulate has_unmovable_pages() with such checks and "trust" > the administrator. Agree
On 27.02.19 23:00, Mike Kravetz wrote: > On 2/27/19 1:51 PM, Oscar Salvador wrote: >> On Thu, Feb 21, 2019 at 10:42:12AM +0100, Oscar Salvador wrote: >>> [1] https://lore.kernel.org/patchwork/patch/998796/ >>> >>> Signed-off-by: Oscar Salvador <osalvador@suse.de> >> >> Any further comments on this? >> I do have a "concern" I would like to sort out before dropping the RFC: >> >> It is the fact that unless we have spare gigantic pages in other notes, the >> offlining operation will loop forever (until the customer cancels the operation). >> While I do not really like that, I do think that memory offlining should be done >> with some sanity, and the administrator should know in advance if the system is going >> to be able to keep up with the memory pressure, aka: make sure we got what we need in >> order to make the offlining operation to succeed. >> That translates to be sure that we have spare gigantic pages and other nodes >> can take them. >> >> Given said that, another thing I thought about is that we could check if we have >> spare gigantic pages at has_unmovable_pages() time. >> Something like checking "h->free_huge_pages - h->resv_huge_pages > 0", and if it >> turns out that we do not have gigantic pages anywhere, just return as we have >> non-movable pages. > > Of course, that check would be racy. Even if there is an available gigantic > page at has_unmovable_pages() time there is no guarantee it will be there when > we want to allocate/use it. But, you would at least catch 'most' cases of > looping forever. > >> But I would rather not convulate has_unmovable_pages() with such checks and "trust" >> the administrator. I think we have the exact same issue already with huge/ordinary pages if we are low on memory. We could loop forever. In the long run, we should properly detect such issues and abort instead of looping forever I guess. But as we all know, error handling in the whole offlining part is still far away from being perfect ...
On Thu 28-02-19 08:38:34, David Hildenbrand wrote: > On 27.02.19 23:00, Mike Kravetz wrote: > > On 2/27/19 1:51 PM, Oscar Salvador wrote: > >> On Thu, Feb 21, 2019 at 10:42:12AM +0100, Oscar Salvador wrote: > >>> [1] https://lore.kernel.org/patchwork/patch/998796/ > >>> > >>> Signed-off-by: Oscar Salvador <osalvador@suse.de> > >> > >> Any further comments on this? > >> I do have a "concern" I would like to sort out before dropping the RFC: > >> > >> It is the fact that unless we have spare gigantic pages in other notes, the > >> offlining operation will loop forever (until the customer cancels the operation). > >> While I do not really like that, I do think that memory offlining should be done > >> with some sanity, and the administrator should know in advance if the system is going > >> to be able to keep up with the memory pressure, aka: make sure we got what we need in > >> order to make the offlining operation to succeed. > >> That translates to be sure that we have spare gigantic pages and other nodes > >> can take them. > >> > >> Given said that, another thing I thought about is that we could check if we have > >> spare gigantic pages at has_unmovable_pages() time. > >> Something like checking "h->free_huge_pages - h->resv_huge_pages > 0", and if it > >> turns out that we do not have gigantic pages anywhere, just return as we have > >> non-movable pages. > > > > Of course, that check would be racy. Even if there is an available gigantic > > page at has_unmovable_pages() time there is no guarantee it will be there when > > we want to allocate/use it. But, you would at least catch 'most' cases of > > looping forever. > > > >> But I would rather not convulate has_unmovable_pages() with such checks and "trust" > >> the administrator. > > I think we have the exact same issue already with huge/ordinary pages if > we are low on memory. We could loop forever. > > In the long run, we should properly detect such issues and abort instead > of looping forever I guess. But as we all know, error handling in the > whole offlining part is still far away from being perfect ... Migration allocation callbacks use __GFP_RETRY_MAYFAIL to not be disruptive so they do not trigger the OOM killer and rely on somebody else to pull the trigger instead. This means that if there is no other activity on the system the hotplug migration would just loop for ever or until interrupted by the userspace. THe later is important, user might define a policy when to terminate and keep retrying is not necessarily a wrong thing. One can simply do timeout $TIMEOUT echo 0 > $PATH_TO_MEMBLOCK/online and ENOMEM handling is not important. But I can see how people might want to bail out early instead. So I do not have a strong opinion here. We can try to consider ENOMEM from the migration as a hard failure and bail out and see whether it works in practice.
On Thu 21-02-19 10:42:12, Oscar Salvador wrote: [...] > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index d5f7afda67db..04f6695b648c 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end) > if (!PageHuge(page)) > continue; > head = compound_head(page); > - if (hugepage_migration_supported(page_hstate(head)) && > - page_huge_active(head)) > + if (page_huge_active(head)) > return pfn; > skip = (1 << compound_order(head)) - (page - head); > pfn += skip - 1; Is this part correct? Say we have a gigantic page which is migrateable. Now scan_movable_pages would skip it and we will not migrate it, no? > @@ -1378,10 +1377,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > > if (PageHuge(page)) { > struct page *head = compound_head(page); > - if (compound_order(head) > PFN_SECTION_SHIFT) { > - ret = -EBUSY; > - break; > - } > pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1; > isolate_huge_page(head, &source); > continue; I think it would be much easier to have only this check removed in this patch. Because it is obviously bogus and wrong as well. The other check might be considered in a separate patch.
On Thu, Feb 28, 2019 at 10:21:54AM +0100, Michal Hocko wrote: > On Thu 21-02-19 10:42:12, Oscar Salvador wrote: > [...] > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > index d5f7afda67db..04f6695b648c 100644 > > --- a/mm/memory_hotplug.c > > +++ b/mm/memory_hotplug.c > > @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end) > > if (!PageHuge(page)) > > continue; > > head = compound_head(page); > > - if (hugepage_migration_supported(page_hstate(head)) && > > - page_huge_active(head)) > > + if (page_huge_active(head)) > > return pfn; > > skip = (1 << compound_order(head)) - (page - head); > > pfn += skip - 1; > > Is this part correct? Say we have a gigantic page which is migrateable. > Now scan_movable_pages would skip it and we will not migrate it, no? All non-migrateable hugepages should have been caught in has_unmovable_pages: <-- if (PageHuge(page)) { struct page *head = compound_head(page); unsigned int skip_pages; if (!hugepage_migration_supported(page_hstate(head))) goto unmovable; --> So, there is no need to check again for migrateability here, as it is something that does not change. To put it in another way, all huge pages found in scan_movable_pages() should be migrateable. In scan_movable_pages() we just need to check whether the hugepage, gigantic or not, is in use (aka active) to migrate it. > > > @@ -1378,10 +1377,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > > > > if (PageHuge(page)) { > > struct page *head = compound_head(page); > > - if (compound_order(head) > PFN_SECTION_SHIFT) { > > - ret = -EBUSY; > > - break; > > - } > > pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1; > > isolate_huge_page(head, &source); > > continue; > > I think it would be much easier to have only this check removed in this > patch. Because it is obviously bogus and wrong as well. The other check > might be considered in a separate patch. I do not have an issue sending both changes separedtly. I mean, this check is the one we need to remove in order to make 1Gb-hugetlb offlining to proceed. The removed check from scan_movable_pages() is only removed because it is redundant as we already checked for that condition in has_unmovable_pages() (when isolating the range).
On Thu 28-02-19 10:41:08, Oscar Salvador wrote: > On Thu, Feb 28, 2019 at 10:21:54AM +0100, Michal Hocko wrote: > > On Thu 21-02-19 10:42:12, Oscar Salvador wrote: > > [...] > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > > index d5f7afda67db..04f6695b648c 100644 > > > --- a/mm/memory_hotplug.c > > > +++ b/mm/memory_hotplug.c > > > @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end) > > > if (!PageHuge(page)) > > > continue; > > > head = compound_head(page); > > > - if (hugepage_migration_supported(page_hstate(head)) && > > > - page_huge_active(head)) > > > + if (page_huge_active(head)) > > > return pfn; > > > skip = (1 << compound_order(head)) - (page - head); > > > pfn += skip - 1; > > > > Is this part correct? Say we have a gigantic page which is migrateable. > > Now scan_movable_pages would skip it and we will not migrate it, no? > > All non-migrateable hugepages should have been caught in has_unmovable_pages: > > <-- > if (PageHuge(page)) { > struct page *head = compound_head(page); > unsigned int skip_pages; > > if (!hugepage_migration_supported(page_hstate(head))) > goto unmovable; > --> > > So, there is no need to check again for migrateability here, as it is something > that does not change. > To put it in another way, all huge pages found in scan_movable_pages() should be > migrateable. > In scan_movable_pages() we just need to check whether the hugepage, gigantic or not, is > in use (aka active) to migrate it. You seemed to miss my point or I am wrong here. If scan_movable_pages skips over a hugetlb page then there is nothing to migrate it and it will stay in the pfn range and the range will not become idle.
On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote: > You seemed to miss my point or I am wrong here. If scan_movable_pages > skips over a hugetlb page then there is nothing to migrate it and it > will stay in the pfn range and the range will not become idle. I might be misunterstanding you, but I am not sure I get you. scan_movable_pages() can either skip or not a hugetlb page. In case it does, pfn will be incremented to skip the whole hugetlb range. If that happens, pfn will hold the next non-hugetlb page. If it happens that the end of the hugetlb page is also the end of the memory range, scan_movable_pages() will return 0 and we will eventually break the loop in __offline_pages(). If this is not what you meant, could you please elaborate a bit more your concern?
On Thu 28-02-19 11:19:52, Oscar Salvador wrote: > On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote: > > You seemed to miss my point or I am wrong here. If scan_movable_pages > > skips over a hugetlb page then there is nothing to migrate it and it > > will stay in the pfn range and the range will not become idle. > > I might be misunterstanding you, but I am not sure I get you. > > scan_movable_pages() can either skip or not a hugetlb page. > In case it does, pfn will be incremented to skip the whole hugetlb > range. > If that happens, pfn will hold the next non-hugetlb page. And as a result the previous hugetlb page doesn't get migrated right? What does that mean? Well, the page is still in use and we cannot proceed with offlining because the full range is not isolated right?
On Thu, Feb 28, 2019 at 01:11:15PM +0100, Michal Hocko wrote: > On Thu 28-02-19 11:19:52, Oscar Salvador wrote: > > On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote: > > > You seemed to miss my point or I am wrong here. If scan_movable_pages > > > skips over a hugetlb page then there is nothing to migrate it and it > > > will stay in the pfn range and the range will not become idle. > > > > I might be misunterstanding you, but I am not sure I get you. > > > > scan_movable_pages() can either skip or not a hugetlb page. > > In case it does, pfn will be incremented to skip the whole hugetlb > > range. > > If that happens, pfn will hold the next non-hugetlb page. > > And as a result the previous hugetlb page doesn't get migrated right? > What does that mean? Well, the page is still in use and we cannot > proceed with offlining because the full range is not isolated right? I might be clumsy today but I still fail to see the point of concern here. Let us start from the beginning. start_isolate_page_range() will mark the range as isolated unless we happen to have unmovable pages within it (for the exercise here, that would be non-migreateable hugetlb pages). If we pass that point, it means that all hugetlb pages found can really be migrated. Leter, scan_movable_pages() will scan them, and it will only take those that are in use (active), as are the ones that we are interested in. We will skip those who are not being used (non-active). If it happens that we skip a hugetlb page and we return the next non-hugetlb page to be migrated, do_migrate_range() will proceed as usual, eventually we will break the main loop due to having being scanned the whole range, etc. If it happens that the whole range spans a gigantic hugetlb and it is not in use, we will skip the whole range, and we will break the main loop by returning "0". Eitherway, I do not see how this changes the picture.
On Thu 28-02-19 14:40:54, Oscar Salvador wrote: > On Thu, Feb 28, 2019 at 01:11:15PM +0100, Michal Hocko wrote: > > On Thu 28-02-19 11:19:52, Oscar Salvador wrote: > > > On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote: > > > > You seemed to miss my point or I am wrong here. If scan_movable_pages > > > > skips over a hugetlb page then there is nothing to migrate it and it > > > > will stay in the pfn range and the range will not become idle. > > > > > > I might be misunterstanding you, but I am not sure I get you. > > > > > > scan_movable_pages() can either skip or not a hugetlb page. > > > In case it does, pfn will be incremented to skip the whole hugetlb > > > range. > > > If that happens, pfn will hold the next non-hugetlb page. > > > > And as a result the previous hugetlb page doesn't get migrated right? > > What does that mean? Well, the page is still in use and we cannot > > proceed with offlining because the full range is not isolated right? > > I might be clumsy today but I still fail to see the point of concern here. No, it's me who is daft. I have misread the patch and seen that also page_huge_active got removed. Now it makes perfect sense to me because active pages are still handled properly. I will leave the decision whether to split up the patch to you. Acked-by: Michal Hocko <mhocko@suse.com> and sorry for being dense here.
On Thu, Feb 28, 2019 at 03:08:17PM +0100, Michal Hocko wrote: > On Thu 28-02-19 14:40:54, Oscar Salvador wrote: > > On Thu, Feb 28, 2019 at 01:11:15PM +0100, Michal Hocko wrote: > > > On Thu 28-02-19 11:19:52, Oscar Salvador wrote: > > > > On Thu, Feb 28, 2019 at 10:55:35AM +0100, Michal Hocko wrote: > > > > > You seemed to miss my point or I am wrong here. If scan_movable_pages > > > > > skips over a hugetlb page then there is nothing to migrate it and it > > > > > will stay in the pfn range and the range will not become idle. > > > > > > > > I might be misunterstanding you, but I am not sure I get you. > > > > > > > > scan_movable_pages() can either skip or not a hugetlb page. > > > > In case it does, pfn will be incremented to skip the whole hugetlb > > > > range. > > > > If that happens, pfn will hold the next non-hugetlb page. > > > > > > And as a result the previous hugetlb page doesn't get migrated right? > > > What does that mean? Well, the page is still in use and we cannot > > > proceed with offlining because the full range is not isolated right? > > > > I might be clumsy today but I still fail to see the point of concern here. > > No, it's me who is daft. I have misread the patch and seen that also > page_huge_active got removed. Now it makes perfect sense to me because > active pages are still handled properly. Heh, no worries. Glad we got the point, I was just scratching my head like a monkey. > I will leave the decision whether to split up the patch to you. On a second thought, I will split it up. One of the changes is merely to remove a redundant check, while the other is actually the one that enables the system to be able to proceed with gigantic pages, so not really related. > Acked-by: Michal Hocko <mhocko@suse.com> Thanks!
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d5f7afda67db..04f6695b648c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1337,8 +1337,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end) if (!PageHuge(page)) continue; head = compound_head(page); - if (hugepage_migration_supported(page_hstate(head)) && - page_huge_active(head)) + if (page_huge_active(head)) return pfn; skip = (1 << compound_order(head)) - (page - head); pfn += skip - 1; @@ -1378,10 +1377,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) if (PageHuge(page)) { struct page *head = compound_head(page); - if (compound_order(head) > PFN_SECTION_SHIFT) { - ret = -EBUSY; - break; - } pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1; isolate_huge_page(head, &source); continue;
On x86_64, 1GB-hugetlb pages could never be offlined due to the fact that hugepage_migration_supported() returned false for PUD_SHIFT. So whenever we wanted to offline a memblock containing a gigantic hugetlb page, we never got beyond has_unmovable_pages() check. This changed with [1], where now we also return true for PUD_SHIFT. After that patch, the check in has_unmovable_pages() and scan_movable_pages() returned true, but we still had a final barrier in do_migrate_range(): if (compound_order(head) > PFN_SECTION_SHIFT) { ret = -EBUSY; break; } This is not really nice, and we do not really need it. It is perfectly possible to migrate a gigantic page as long as another node has a spare gigantic page for us. In alloc_huge_page_nodemask(), we calculate the __real__ number of free pages, and if any, we try to dequeue one from another node. This all works fine when we do have another node with a spare gigantic page, but if that is not the case, alloc_huge_page_nodemask() ends up calling alloc_migrate_huge_page() which bails out if the wanted page is gigantic. That is mainly because finding a 1GB (or even 16GB on powerpc) contiguous memory is quite unlikely when the system has been running for a while. In that situation, we will keep looping forever because scan_movable_pages() will give us the same page and we will fail again because there is no node where we can dequeue a gigantic page from. This is not nice, and I wish we could differentiate a fatal error from a transient error in do_migrate_range()->migrate_pages(), but I do not really see a way now. Anyway, I would tend say that this is the administrator's job, to make sure that the system can keep up with the memory to be offlined, so that would mean that if we want to use gigantic pages, make sure that the other nodes have at least enough gigantic pages to keep up in case we need to offline memory. Just for the sake of completeness, this is one of the tests done: # echo 1 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages # echo 1 > /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages 1 # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages 1 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages 1 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages 1 (hugetlb1gb is a program that maps 1GB region using MAP_HUGE_1GB) # numactl -m 1 ./hugetlb1gb # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages 0 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages 1 # offline node1 memory # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages 0 [1] https://lore.kernel.org/patchwork/patch/998796/ Signed-off-by: Oscar Salvador <osalvador@suse.de> --- mm/memory_hotplug.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-)