Message ID | 20210319224209.150047-6-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | make hugetlb put_page safe for all calling contexts | expand |
On Fri 19-03-21 15:42:06, Mike Kravetz wrote: [...] > @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h, > while (nr_pages--) { > h->resv_huge_pages--; > unused_resv_pages--; > - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) > + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1); > + if (!page) > goto out; > - cond_resched_lock(&hugetlb_lock); > + > + /* Drop lock and free page to buddy as it could sleep */ > + spin_unlock(&hugetlb_lock); > + update_and_free_page(h, page); > + cond_resched(); > + spin_lock(&hugetlb_lock); > } > > out: This is likely a matter of taste but the repeated pattern of unlock, update_and_free_page, cond_resched and lock seems rather clumsy. Would it be slightly better/nicer to remove_pool_huge_page into a list_head under a single lock invocation and then free up the whole lot after the lock is dropped?
On 3/22/21 7:31 AM, Michal Hocko wrote: > On Fri 19-03-21 15:42:06, Mike Kravetz wrote: > [...] >> @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h, >> while (nr_pages--) { >> h->resv_huge_pages--; >> unused_resv_pages--; >> - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) >> + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1); >> + if (!page) >> goto out; >> - cond_resched_lock(&hugetlb_lock); >> + >> + /* Drop lock and free page to buddy as it could sleep */ >> + spin_unlock(&hugetlb_lock); >> + update_and_free_page(h, page); >> + cond_resched(); >> + spin_lock(&hugetlb_lock); >> } >> >> out: > > This is likely a matter of taste but the repeated pattern of unlock, > update_and_free_page, cond_resched and lock seems rather clumsy. > Would it be slightly better/nicer to remove_pool_huge_page into a > list_head under a single lock invocation and then free up the whole lot > after the lock is dropped? Yes, we can certainly do that. One downside I see is that the list can contain a bunch of pages not accounted for in hugetlb and not free in buddy (or cma). Ideally, we would want to keep those in sync if possible. Also, the commit that added the cond_resched talked about freeing up 12 TB worth of huge pages and it holding the lock for 150 seconds. The new code is not holding the lock while calling free to buddy, but I wonder how long it would take to remove 12 TB worth of huge pages and add them to a separate list? I do not know how realistic the 12 TB number is. But, I certainly am aware of pools that are a few TB in size.
On Mon 22-03-21 16:28:07, Mike Kravetz wrote: > On 3/22/21 7:31 AM, Michal Hocko wrote: > > On Fri 19-03-21 15:42:06, Mike Kravetz wrote: > > [...] > >> @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h, > >> while (nr_pages--) { > >> h->resv_huge_pages--; > >> unused_resv_pages--; > >> - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) > >> + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1); > >> + if (!page) > >> goto out; > >> - cond_resched_lock(&hugetlb_lock); > >> + > >> + /* Drop lock and free page to buddy as it could sleep */ > >> + spin_unlock(&hugetlb_lock); > >> + update_and_free_page(h, page); > >> + cond_resched(); > >> + spin_lock(&hugetlb_lock); > >> } > >> > >> out: > > > > This is likely a matter of taste but the repeated pattern of unlock, > > update_and_free_page, cond_resched and lock seems rather clumsy. > > Would it be slightly better/nicer to remove_pool_huge_page into a > > list_head under a single lock invocation and then free up the whole lot > > after the lock is dropped? > > Yes, we can certainly do that. > One downside I see is that the list can contain a bunch of pages not > accounted for in hugetlb and not free in buddy (or cma). Ideally, we > would want to keep those in sync if possible. Also, the commit that > added the cond_resched talked about freeing up 12 TB worth of huge pages > and it holding the lock for 150 seconds. The new code is not holding > the lock while calling free to buddy, but I wonder how long it would > take to remove 12 TB worth of huge pages and add them to a separate list? Well, the remove_pool_huge_page is just a accounting part and that should be pretty invisible even when the number of pages is large. The lockless nature (from hugetlb POV) of the final page release is the heavy weight operation and whether you do it in chunks or in a single go (with cond_resched) should be visible either. We already do the same thing when uncharging memcg pages (mem_cgroup_uncharge_list). So I would agree with you that this would be a much bigger problem if both the hugetlb and freeing path were equally heavy weight and the delay between first pages uncaccounted and freed would be noticeable. But I do not want to push for this. I just hated the hugetlb_lock dances as this is ugly and repetitive pattern.
On 3/23/21 12:57 AM, Michal Hocko wrote: > On Mon 22-03-21 16:28:07, Mike Kravetz wrote: >> On 3/22/21 7:31 AM, Michal Hocko wrote: >>> On Fri 19-03-21 15:42:06, Mike Kravetz wrote: >>> [...] >>>> @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h, >>>> while (nr_pages--) { >>>> h->resv_huge_pages--; >>>> unused_resv_pages--; >>>> - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) >>>> + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1); >>>> + if (!page) >>>> goto out; >>>> - cond_resched_lock(&hugetlb_lock); >>>> + >>>> + /* Drop lock and free page to buddy as it could sleep */ >>>> + spin_unlock(&hugetlb_lock); >>>> + update_and_free_page(h, page); >>>> + cond_resched(); >>>> + spin_lock(&hugetlb_lock); >>>> } >>>> >>>> out: >>> >>> This is likely a matter of taste but the repeated pattern of unlock, >>> update_and_free_page, cond_resched and lock seems rather clumsy. >>> Would it be slightly better/nicer to remove_pool_huge_page into a >>> list_head under a single lock invocation and then free up the whole lot >>> after the lock is dropped? >> >> Yes, we can certainly do that. >> One downside I see is that the list can contain a bunch of pages not >> accounted for in hugetlb and not free in buddy (or cma). Ideally, we >> would want to keep those in sync if possible. Also, the commit that >> added the cond_resched talked about freeing up 12 TB worth of huge pages >> and it holding the lock for 150 seconds. The new code is not holding >> the lock while calling free to buddy, but I wonder how long it would >> take to remove 12 TB worth of huge pages and add them to a separate list? > > Well, the remove_pool_huge_page is just a accounting part and that > should be pretty invisible even when the number of pages is large. The > lockless nature (from hugetlb POV) of the final page release is the > heavy weight operation and whether you do it in chunks or in a single go > (with cond_resched) should be visible either. We already do the same > thing when uncharging memcg pages (mem_cgroup_uncharge_list). > > So I would agree with you that this would be a much bigger problem if > both the hugetlb and freeing path were equally heavy weight and the > delay between first pages uncaccounted and freed would be noticeable. > > But I do not want to push for this. I just hated the hugetlb_lock dances > as this is ugly and repetitive pattern. As you may have seen in my reply to patch 3, I am going to use this batching approach for all places we do remove/free hugetlb page. Since you brought up cgroups ... what is your opinion on lock hold time in hugetlb_cgroup_css_offline? We could potentially be calling hugetlb_cgroup_move_parent for every hugetlb page while holding the lock with interrupts disabled.
On Tue 23-03-21 18:03:07, Mike Kravetz wrote: [...] > Since you brought up cgroups ... what is your opinion on lock hold time > in hugetlb_cgroup_css_offline? We could potentially be calling > hugetlb_cgroup_move_parent for every hugetlb page while holding the lock > with interrupts disabled. I am not familiar with hugetlb cgroup code TBH. But from a quick look there is not much of heavy lifting there. If we find out that this is really visible we can do the lock dance with cond_resched and retry with the iteration again. Or is there any strong reason to process the list in a single go?
On 3/24/21 1:40 AM, Michal Hocko wrote: > On Tue 23-03-21 18:03:07, Mike Kravetz wrote: > [...] >> Since you brought up cgroups ... what is your opinion on lock hold time >> in hugetlb_cgroup_css_offline? We could potentially be calling >> hugetlb_cgroup_move_parent for every hugetlb page while holding the lock >> with interrupts disabled. > > I am not familiar with hugetlb cgroup code TBH. But from a quick look > there is not much of heavy lifting there. If we find out that this is > really visible we can do the lock dance with cond_resched and retry with > the iteration again. Or is there any strong reason to process the list > in a single go? AFAICT, the primary reason for processing the list in a single go is that the lock protects the list. If you drop the lock, the list can change ... I have come up with a (not so pretty) way of processing the list in batches of pages. But, I dod not want to introduce that if there is no need. Perhaps just take a wait and see approach for now. I'll see if I can come up with some timing information to determine if/when we may have an issue.
On Wed 24-03-21 09:38:17, Mike Kravetz wrote: > On 3/24/21 1:40 AM, Michal Hocko wrote: > > On Tue 23-03-21 18:03:07, Mike Kravetz wrote: > > [...] > >> Since you brought up cgroups ... what is your opinion on lock hold time > >> in hugetlb_cgroup_css_offline? We could potentially be calling > >> hugetlb_cgroup_move_parent for every hugetlb page while holding the lock > >> with interrupts disabled. > > > > I am not familiar with hugetlb cgroup code TBH. But from a quick look > > there is not much of heavy lifting there. If we find out that this is > > really visible we can do the lock dance with cond_resched and retry with > > the iteration again. Or is there any strong reason to process the list > > in a single go? > > AFAICT, the primary reason for processing the list in a single go is > that the lock protects the list. If you drop the lock, the list can > change ... > > I have come up with a (not so pretty) way of processing the list in > batches of pages. But, I dod not want to introduce that if there is no > need. Perhaps just take a wait and see approach for now. > > I'll see if I can come up with some timing information to determine > if/when we may have an issue. I wouldn't bother at this stage. This can be done on top.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3028cf10d504..f60a24e326c2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1184,7 +1184,7 @@ static int hstate_next_node_to_alloc(struct hstate *h, } /* - * helper for free_pool_huge_page() - return the previously saved + * helper for remove_pool_huge_page() - return the previously saved * node ["this node"] from which to free a huge page. Advance the * next node id whether or not we find a free huge page to free so * that the next attempt to free addresses the next node. @@ -1699,16 +1699,18 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, } /* - * Free huge page from pool from next node to free. - * Attempt to keep persistent huge pages more or less - * balanced over allowed nodes. + * Remove huge page from pool from next node to free. Attempt to keep + * persistent huge pages more or less balanced over allowed nodes. + * This routine only 'removes' the hugetlb page. The caller must make + * an additional call to free the page to low level allocators. * Called with hugetlb_lock locked. */ -static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, - bool acct_surplus) +static struct page *remove_pool_huge_page(struct hstate *h, + nodemask_t *nodes_allowed, + bool acct_surplus) { int nr_nodes, node; - int ret = 0; + struct page *page = NULL; for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) { /* @@ -1717,23 +1719,14 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, */ if ((!acct_surplus || h->surplus_huge_pages_node[node]) && !list_empty(&h->hugepage_freelists[node])) { - struct page *page = - list_entry(h->hugepage_freelists[node].next, + page = list_entry(h->hugepage_freelists[node].next, struct page, lru); remove_hugetlb_page(h, page, acct_surplus); - /* - * unlock/lock around update_and_free_page is temporary - * and will be removed with subsequent patch. - */ - spin_unlock(&hugetlb_lock); - update_and_free_page(h, page); - spin_lock(&hugetlb_lock); - ret = 1; break; } } - return ret; + return page; } /* @@ -2064,6 +2057,7 @@ static void return_unused_surplus_pages(struct hstate *h, unsigned long unused_resv_pages) { unsigned long nr_pages; + struct page *page; /* Cannot return gigantic pages currently */ if (hstate_is_gigantic(h)) @@ -2080,7 +2074,7 @@ static void return_unused_surplus_pages(struct hstate *h, * evenly across all nodes with memory. Iterate across these nodes * until we can no longer free unreserved surplus pages. This occurs * when the nodes with surplus pages have no free pages. - * free_pool_huge_page() will balance the freed pages across the + * remove_pool_huge_page() will balance the freed pages across the * on-line nodes with memory and will handle the hstate accounting. * * Note that we decrement resv_huge_pages as we free the pages. If @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h, while (nr_pages--) { h->resv_huge_pages--; unused_resv_pages--; - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1); + if (!page) goto out; - cond_resched_lock(&hugetlb_lock); + + /* Drop lock and free page to buddy as it could sleep */ + spin_unlock(&hugetlb_lock); + update_and_free_page(h, page); + cond_resched(); + spin_lock(&hugetlb_lock); } out: @@ -2631,6 +2631,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, nodemask_t *nodes_allowed) { unsigned long min_count, ret; + struct page *page; NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL); /* @@ -2740,9 +2741,15 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, min_count = min_hp_count(h, count); try_to_free_low(h, count, nodes_allowed); while (min_count < persistent_huge_pages(h)) { - if (!free_pool_huge_page(h, nodes_allowed, 0)) + page = remove_pool_huge_page(h, nodes_allowed, 0); + if (!page) break; - cond_resched_lock(&hugetlb_lock); + + /* Drop lock as free routines may sleep */ + spin_unlock(&hugetlb_lock); + update_and_free_page(h, page); + cond_resched(); + spin_lock(&hugetlb_lock); /* Recompute min_count in case hugetlb_lock was dropped */ min_count = min_hp_count(h, count);
free_pool_huge_page was called with hugetlb_lock held. It would remove a hugetlb page, and then free the corresponding pages to the lower level allocators such as buddy. free_pool_huge_page was called in a loop to remove hugetlb pages and these loops could hold the hugetlb_lock for a considerable time. Create new routine remove_pool_huge_page to replace free_pool_huge_page. remove_pool_huge_page will remove the hugetlb page, and it must be called with the hugetlb_lock held. It will return the removed page and it is the responsibility of the caller to free the page to the lower level allocators. The hugetlb_lock is dropped before freeing to these allocators which results in shorter lock hold times. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- mm/hugetlb.c | 53 +++++++++++++++++++++++++++++----------------------- 1 file changed, 30 insertions(+), 23 deletions(-)