diff mbox

[v2,2/2] mm: hugetlb: support gigantic surplus pages

Message ID 1478675294-2507-1-git-send-email-shijie.huang@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Huang Shijie Nov. 9, 2016, 7:08 a.m. UTC
When testing the gigantic page whose order is too large for the buddy
allocator, the libhugetlbfs test case "counter.sh" will fail.

The failure is caused by:
 1) kernel fails to allocate a gigantic page for the surplus case.
    And the gather_surplus_pages() will return NULL in the end.

 2) The condition checks for "over-commit" is wrong.

This patch adds code to allocate the gigantic page in the
__alloc_huge_page(). After this patch, gather_surplus_pages()
can return a gigantic page for the surplus case.

This patch changes the condition checks for:
     return_unused_surplus_pages()
     nr_overcommit_hugepages_store()
     hugetlb_overcommit_handler()

This patch also set @nid with proper value when NUMA_NO_NODE is
passed to alloc_gigantic_page().

After this patch, the counter.sh can pass for the gigantic page.

Acked-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Huang Shijie <shijie.huang@arm.com>
---
  1. fix the wrong check in hugetlb_overcommit_handler();
  2. try to fix the s390 issue.
---
 mm/hugetlb.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

Comments

Gerald Schaefer Nov. 9, 2016, 3:55 p.m. UTC | #1
On Wed, 9 Nov 2016 15:08:14 +0800
Huang Shijie <shijie.huang@arm.com> wrote:

> When testing the gigantic page whose order is too large for the buddy
> allocator, the libhugetlbfs test case "counter.sh" will fail.
> 
> The failure is caused by:
>  1) kernel fails to allocate a gigantic page for the surplus case.
>     And the gather_surplus_pages() will return NULL in the end.
> 
>  2) The condition checks for "over-commit" is wrong.
> 
> This patch adds code to allocate the gigantic page in the
> __alloc_huge_page(). After this patch, gather_surplus_pages()
> can return a gigantic page for the surplus case.
> 
> This patch changes the condition checks for:
>      return_unused_surplus_pages()
>      nr_overcommit_hugepages_store()
>      hugetlb_overcommit_handler()
> 
> This patch also set @nid with proper value when NUMA_NO_NODE is
> passed to alloc_gigantic_page().
> 
> After this patch, the counter.sh can pass for the gigantic page.
> 
> Acked-by: Steve Capper <steve.capper@arm.com>
> Signed-off-by: Huang Shijie <shijie.huang@arm.com>
> ---
>   1. fix the wrong check in hugetlb_overcommit_handler();
>   2. try to fix the s390 issue.
> ---
>  mm/hugetlb.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9fdfc24..5dbfd62 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1095,6 +1095,9 @@ static struct page *alloc_gigantic_page(int nid, unsigned int order)
>  	unsigned long ret, pfn, flags;
>  	struct zone *z;
> 
> +	if (nid == NUMA_NO_NODE)
> +		nid = numa_mem_id();
> +

Now counter.sh works (on s390) w/o the lockdep warning. However, it looks
like this change will now result in inconsistent behavior compared to the
normal sized hugepages, regarding surplus page allocation. Setting nid to
numa_mem_id() means that only the node of the current CPU will be considered
for allocating a gigantic page, as opposed to just "preferring" the current
node in the normal size case (__hugetlb_alloc_buddy_huge_page() ->
alloc_pages_node()) with a fallback to using other nodes.

I am not really familiar with NUMA, and I might be wrong here, but if
this is true then gigantic pages, which may be hard allocate at runtime
in general, will be even harder to find (as surplus pages) because you
only look on the current node.

I honestly do not understand why alloc_gigantic_page() needs a nid
parameter at all, since it looks like it will only be called from
alloc_fresh_gigantic_page_node(), which in turn is only called
from alloc_fresh_gigantic_page() in a "for_each_node" loop (at least
before your patch).

Now it could be an option to also use alloc_fresh_gigantic_page()
in your patch, instead of directly calling alloc_gigantic_page(),
in __alloc_huge_page(). This would fix the "local node only" issue,
but I am not sure how to handle the required nodes_allowed parameter.

Maybe someone with more NUMA insight could have a look at this.
The patch as it is also seems to work, with the "local node only"
restriction, so it may be an option to just accept this restriction.
Huang Shijie Nov. 10, 2016, 7:03 a.m. UTC | #2
On Wed, Nov 09, 2016 at 04:55:49PM +0100, Gerald Schaefer wrote:
> > index 9fdfc24..5dbfd62 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1095,6 +1095,9 @@ static struct page *alloc_gigantic_page(int nid, unsigned int order)
> >  	unsigned long ret, pfn, flags;
> >  	struct zone *z;
> > 
> > +	if (nid == NUMA_NO_NODE)
> > +		nid = numa_mem_id();
> > +
> 
> Now counter.sh works (on s390) w/o the lockdep warning. However, it looks
Good news to me :)
We have found the root cause of the s390 issue.

> like this change will now result in inconsistent behavior compared to the
> normal sized hugepages, regarding surplus page allocation. Setting nid to
> numa_mem_id() means that only the node of the current CPU will be considered
> for allocating a gigantic page, as opposed to just "preferring" the current
> node in the normal size case (__hugetlb_alloc_buddy_huge_page() ->
> alloc_pages_node()) with a fallback to using other nodes.
Yes.

> 
> I am not really familiar with NUMA, and I might be wrong here, but if
> this is true then gigantic pages, which may be hard allocate at runtime
> in general, will be even harder to find (as surplus pages) because you
> only look on the current node.
Okay, I will try to fix this in the next version.

> 
> I honestly do not understand why alloc_gigantic_page() needs a nid
> parameter at all, since it looks like it will only be called from
> alloc_fresh_gigantic_page_node(), which in turn is only called
> from alloc_fresh_gigantic_page() in a "for_each_node" loop (at least
> before your patch).
> 
> Now it could be an option to also use alloc_fresh_gigantic_page()
> in your patch, instead of directly calling alloc_gigantic_page(),
Yes, a good suggestion. But I need to do some change to the
alloc_fresh_gigantic_page().	


Thanks
Huang Shijie
diff mbox

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9fdfc24..5dbfd62 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1095,6 +1095,9 @@  static struct page *alloc_gigantic_page(int nid, unsigned int order)
 	unsigned long ret, pfn, flags;
 	struct zone *z;
 
+	if (nid == NUMA_NO_NODE)
+		nid = numa_mem_id();
+
 	z = NODE_DATA(nid)->node_zones;
 	for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
 		spin_lock_irqsave(&z->lock, flags);
@@ -1578,7 +1581,7 @@  static struct page *__alloc_huge_page(struct hstate *h,
 	struct page *page;
 	unsigned int r_nid;
 
-	if (hstate_is_gigantic(h))
+	if (hstate_is_gigantic(h) && !gigantic_page_supported())
 		return NULL;
 
 	/*
@@ -1623,7 +1626,13 @@  static struct page *__alloc_huge_page(struct hstate *h,
 	}
 	spin_unlock(&hugetlb_lock);
 
-	page = __hugetlb_alloc_buddy_huge_page(h, vma, addr, nid);
+	if (hstate_is_gigantic(h))  {
+		page = alloc_gigantic_page(nid, huge_page_order(h));
+		if (page)
+			prep_compound_gigantic_page(page, huge_page_order(h));
+	} else {
+		page = __hugetlb_alloc_buddy_huge_page(h, vma, addr, nid);
+	}
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -1790,8 +1799,7 @@  static void return_unused_surplus_pages(struct hstate *h,
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
-	/* Cannot return gigantic pages currently */
-	if (hstate_is_gigantic(h))
+	if (hstate_is_gigantic(h) && !gigantic_page_supported())
 		return;
 
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
@@ -2443,7 +2451,7 @@  static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 	unsigned long input;
 	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
-	if (hstate_is_gigantic(h))
+	if (hstate_is_gigantic(h) && !gigantic_page_supported())
 		return -EINVAL;
 
 	err = kstrtoul(buf, 10, &input);
@@ -2884,7 +2892,7 @@  int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 
 	tmp = h->nr_overcommit_huge_pages;
 
-	if (write && hstate_is_gigantic(h))
+	if (write && hstate_is_gigantic(h) && !gigantic_page_supported())
 		return -EINVAL;
 
 	table->data = &tmp;