Message ID | 20190724175014.9935-4-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fix hugetlb page allocation stalls | expand |
On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote: > When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, > the pages will be interleaved between all nodes of the system. If > nodes are not equal, it is quite possible for one node to fill up > before the others. When this happens, the code still attempts to > allocate pages from the full node. This results in calls to direct > reclaim and compaction which slow things down considerably. > > When allocating pool pages, note the state of the previous allocation > for each node. If previous allocation failed, do not use the > aggressive retry algorithm on successive attempts. The allocation > will still succeed if there is memory available, but it will not try > as hard to free up memory. > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle *but* in the event of an allocation failure this bug can silently recur. An informational message might be justified in that case in case the stall should recur with no hint as to why. Technically passing NULL into NODEMASK_FREE is also safe as kfree (if used for that kernel config) can handle freeing of a NULL pointer. However, that is cosmetic more than anything. Whether you decide to change either or not; Acked-by: Mel Gorman <mgorman@suse.de>
On 7/25/19 1:13 AM, Mel Gorman wrote: > On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote: >> When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, >> the pages will be interleaved between all nodes of the system. If >> nodes are not equal, it is quite possible for one node to fill up >> before the others. When this happens, the code still attempts to >> allocate pages from the full node. This results in calls to direct >> reclaim and compaction which slow things down considerably. >> >> When allocating pool pages, note the state of the previous allocation >> for each node. If previous allocation failed, do not use the >> aggressive retry algorithm on successive attempts. The allocation >> will still succeed if there is memory available, but it will not try >> as hard to free up memory. >> >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle > *but* in the event of an allocation failure this bug can silently recur. > An informational message might be justified in that case in case the > stall should recur with no hint as to why. Right. Perhaps a NODEMASK_ALLOC() failure should just result in a quick exit/error. If we can't allocate a node mask, it is unlikely we will be able to allocate a/any huge pages. And, the system must be extremely low on memory and there are likely other bigger issues. There have been discussions elsewhere about discontinuing the use of NODEMASK_ALLOC() and just putting the mask on the stack. That may be acceptable here as well. > Technically passing NULL into > NODEMASK_FREE is also safe as kfree (if used for that kernel config) can > handle freeing of a NULL pointer. However, that is cosmetic more than > anything. Whether you decide to change either or not; Yes. I will clean up with an updated series after more feedback. > > Acked-by: Mel Gorman <mgorman@suse.de> > Thanks!
On Thu, Jul 25, 2019 at 10:15:29AM -0700, Mike Kravetz wrote: > On 7/25/19 1:13 AM, Mel Gorman wrote: > > On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote: > >> When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, > >> the pages will be interleaved between all nodes of the system. If > >> nodes are not equal, it is quite possible for one node to fill up > >> before the others. When this happens, the code still attempts to > >> allocate pages from the full node. This results in calls to direct > >> reclaim and compaction which slow things down considerably. > >> > >> When allocating pool pages, note the state of the previous allocation > >> for each node. If previous allocation failed, do not use the > >> aggressive retry algorithm on successive attempts. The allocation > >> will still succeed if there is memory available, but it will not try > >> as hard to free up memory. > >> > >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > > > set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle > > *but* in the event of an allocation failure this bug can silently recur. > > An informational message might be justified in that case in case the > > stall should recur with no hint as to why. > > Right. > Perhaps a NODEMASK_ALLOC() failure should just result in a quick exit/error. > If we can't allocate a node mask, it is unlikely we will be able to allocate > a/any huge pages. And, the system must be extremely low on memory and there > are likely other bigger issues. > That might be better overall, you make a valid point that a failed kmalloc is not a good sign for hugetlbfs allocations. > There have been discussions elsewhere about discontinuing the use of > NODEMASK_ALLOC() and just putting the mask on the stack. That may be > acceptable here as well. > They can be big and while this particular path would be relatively safe, I think the fact that there will not be much functional difference between allocating on the stack and a failed kmalloc in terms of hugetlbfs allocation success rates. > > Technically passing NULL into > > NODEMASK_FREE is also safe as kfree (if used for that kernel config) can > > handle freeing of a NULL pointer. However, that is cosmetic more than > > anything. Whether you decide to change either or not; > > Yes. > I will clean up with an updated series after more feedback. > Thanks.
On 7/25/19 7:15 PM, Mike Kravetz wrote: > On 7/25/19 1:13 AM, Mel Gorman wrote: >> On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote: >>> When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, >>> the pages will be interleaved between all nodes of the system. If >>> nodes are not equal, it is quite possible for one node to fill up >>> before the others. When this happens, the code still attempts to >>> allocate pages from the full node. This results in calls to direct >>> reclaim and compaction which slow things down considerably. >>> >>> When allocating pool pages, note the state of the previous allocation >>> for each node. If previous allocation failed, do not use the >>> aggressive retry algorithm on successive attempts. The allocation >>> will still succeed if there is memory available, but it will not try >>> as hard to free up memory. >>> >>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> >> >> set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle >> *but* in the event of an allocation failure this bug can silently recur. >> An informational message might be justified in that case in case the >> stall should recur with no hint as to why. > > Right. > Perhaps a NODEMASK_ALLOC() failure should just result in a quick exit/error. > If we can't allocate a node mask, it is unlikely we will be able to allocate > a/any huge pages. And, the system must be extremely low on memory and there > are likely other bigger issues. Agreed. But I would perhaps drop __GFP_NORETRY from the mask allocation as that can fail for transient conditions. > There have been discussions elsewhere about discontinuing the use of > NODEMASK_ALLOC() and just putting the mask on the stack. That may be > acceptable here as well. > >> Technically passing NULL into >> NODEMASK_FREE is also safe as kfree (if used for that kernel config) can >> handle freeing of a NULL pointer. However, that is cosmetic more than >> anything. Whether you decide to change either or not; > > Yes. > I will clean up with an updated series after more feedback. > >> >> Acked-by: Mel Gorman <mgorman@suse.de> >> > > Thanks! >
On 7/31/19 6:23 AM, Vlastimil Babka wrote: > On 7/25/19 7:15 PM, Mike Kravetz wrote: >> On 7/25/19 1:13 AM, Mel Gorman wrote: >>> On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote: >>> >>> set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle >>> *but* in the event of an allocation failure this bug can silently recur. >>> An informational message might be justified in that case in case the >>> stall should recur with no hint as to why. >> >> Right. >> Perhaps a NODEMASK_ALLOC() failure should just result in a quick exit/error. >> If we can't allocate a node mask, it is unlikely we will be able to allocate >> a/any huge pages. And, the system must be extremely low on memory and there >> are likely other bigger issues. > > Agreed. But I would perhaps drop __GFP_NORETRY from the mask allocation > as that can fail for transient conditions. Thanks, I was unsure if adding __GFP_NORETRY would be a good idea.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ede7e7f5d1ab..f3c50344a9b4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1405,12 +1405,27 @@ pgoff_t __basepage_index(struct page *page) } static struct page *alloc_buddy_huge_page(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask) + gfp_t gfp_mask, int nid, nodemask_t *nmask, + nodemask_t *node_alloc_noretry) { int order = huge_page_order(h); struct page *page; + bool alloc_try_hard = true; - gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; + /* + * By default we always try hard to allocate the page with + * __GFP_RETRY_MAYFAIL flag. However, if we are allocating pages in + * a loop (to adjust global huge page counts) and previous allocation + * failed, do not continue to try hard on the same node. Use the + * node_alloc_noretry bitmap to manage this state information. + */ + if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry)) + alloc_try_hard = false; + gfp_mask |= __GFP_COMP|__GFP_NOWARN; + if (alloc_try_hard) + gfp_mask |= __GFP_RETRY_MAYFAIL; + else + gfp_mask |= __GFP_NORETRY; if (nid == NUMA_NO_NODE) nid = numa_mem_id(); page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask); @@ -1419,6 +1434,22 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, else __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); + /* + * If we did not specify __GFP_RETRY_MAYFAIL, but still got a page this + * indicates an overall state change. Clear bit so that we resume + * normal 'try hard' allocations. + */ + if (node_alloc_noretry && page && !alloc_try_hard) + node_clear(nid, *node_alloc_noretry); + + /* + * If we tried hard to get a page but failed, set bit so that + * subsequent attempts will not try as hard until there is an + * overall state change. + */ + if (node_alloc_noretry && !page && alloc_try_hard) + node_set(nid, *node_alloc_noretry); + return page; } @@ -1427,7 +1458,8 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, * should use this function to get new hugetlb pages */ static struct page *alloc_fresh_huge_page(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask) + gfp_t gfp_mask, int nid, nodemask_t *nmask, + nodemask_t *node_alloc_noretry) { struct page *page; @@ -1435,7 +1467,7 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, page = alloc_gigantic_page(h, gfp_mask, nid, nmask); else page = alloc_buddy_huge_page(h, gfp_mask, - nid, nmask); + nid, nmask, node_alloc_noretry); if (!page) return NULL; @@ -1450,14 +1482,16 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, * Allocates a fresh page to the hugetlb allocator pool in the node interleaved * manner. */ -static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed) +static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, + nodemask_t *node_alloc_noretry) { struct page *page; int nr_nodes, node; gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { - page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed); + page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed, + node_alloc_noretry); if (page) break; } @@ -1601,7 +1635,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, goto out_unlock; spin_unlock(&hugetlb_lock); - page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask); + page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); if (!page) return NULL; @@ -1637,7 +1671,7 @@ struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, if (hstate_is_gigantic(h)) return NULL; - page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask); + page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); if (!page) return NULL; @@ -2207,13 +2241,31 @@ static void __init gather_bootmem_prealloc(void) static void __init hugetlb_hstate_alloc_pages(struct hstate *h) { unsigned long i; + nodemask_t *node_alloc_noretry; + + if (!hstate_is_gigantic(h)) { + /* + * bit mask controlling how hard we retry per-node + * allocations. + */ + node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry), + GFP_KERNEL | __GFP_NORETRY); + } else { + /* allocations done at boot time */ + node_alloc_noretry = NULL; + } + + /* bit mask controlling how hard we retry per-node allocations */ + if (node_alloc_noretry) + nodes_clear(*node_alloc_noretry); for (i = 0; i < h->max_huge_pages; ++i) { if (hstate_is_gigantic(h)) { if (!alloc_bootmem_huge_page(h)) break; } else if (!alloc_pool_huge_page(h, - &node_states[N_MEMORY])) + &node_states[N_MEMORY], + node_alloc_noretry)) break; cond_resched(); } @@ -2225,6 +2277,9 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h) h->max_huge_pages, buf, i); h->max_huge_pages = i; } + + if (node_alloc_noretry) + kfree(node_alloc_noretry); } static void __init hugetlb_init_hstates(void) @@ -2323,6 +2378,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, nodemask_t *nodes_allowed) { unsigned long min_count, ret; + NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, + GFP_KERNEL | __GFP_NORETRY); + + /* bit mask controlling how hard we retry per-node allocations */ + if (node_alloc_noretry) + nodes_clear(*node_alloc_noretry); spin_lock(&hugetlb_lock); @@ -2356,6 +2417,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) { if (count > persistent_huge_pages(h)) { spin_unlock(&hugetlb_lock); + if (node_alloc_noretry) + NODEMASK_FREE(node_alloc_noretry); return -EINVAL; } /* Fall through to decrease pool */ @@ -2388,7 +2451,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, /* yield cpu to avoid soft lockup */ cond_resched(); - ret = alloc_pool_huge_page(h, nodes_allowed); + ret = alloc_pool_huge_page(h, nodes_allowed, + node_alloc_noretry); spin_lock(&hugetlb_lock); if (!ret) goto out; @@ -2429,6 +2493,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, h->max_huge_pages = persistent_huge_pages(h); spin_unlock(&hugetlb_lock); + if (node_alloc_noretry) + NODEMASK_FREE(node_alloc_noretry); + return 0; }
When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, the pages will be interleaved between all nodes of the system. If nodes are not equal, it is quite possible for one node to fill up before the others. When this happens, the code still attempts to allocate pages from the full node. This results in calls to direct reclaim and compaction which slow things down considerably. When allocating pool pages, note the state of the previous allocation for each node. If previous allocation failed, do not use the aggressive retry algorithm on successive attempts. The allocation will still succeed if there is memory available, but it will not try as hard to free up memory. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 77 insertions(+), 10 deletions(-)