diff mbox series

[RFC,7/8] hugetlb: add update_and_free_page_no_sleep for irq context

Message ID 20210319224209.150047-8-mike.kravetz@oracle.com (mailing list archive)
State New, archived
Headers show
Series make hugetlb put_page safe for all calling contexts | expand

Commit Message

Mike Kravetz March 19, 2021, 10:42 p.m. UTC
The locks acquired in free_huge_page are irq safe.  However, in certain
circumstances the routine update_and_free_page could sleep.  Since
free_huge_page can be called from any context, it can not sleep.

Use a waitqueue to defer freeing of pages if the operation may sleep.  A
new routine update_and_free_page_no_sleep provides this functionality
and is only called from free_huge_page.

Note that any 'pages' sent to the workqueue for deferred freeing have
already been removed from the hugetlb subsystem.  What is actually
deferred is returning those base pages to the low level allocator.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 12 +++++-
 mm/hugetlb.c            | 86 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 94 insertions(+), 4 deletions(-)

Comments

Hillf Danton March 20, 2021, 1:18 a.m. UTC | #1
On Fri, 19 Mar 2021 15:42:08 -0700  Mike Kravetz wrote:
> +
> +	if (!can_sleep && free_page_may_sleep(h, page)) {
> +		/*
> +		 * Send page freeing to workqueue
> +		 *
> +		 * Only call schedule_work() if hpage_freelist is previously
> +		 * empty. Otherwise, schedule_work() had been called but the
> +		 * workfn hasn't retrieved the list yet.
> +		 */
> +		if (llist_add((struct llist_node *)&page->mapping,
> +					&hpage_freelist))
> +			schedule_work(&free_hpage_work);
> +		return;
> +	}

Queue work on system_unbound_wq instead of system_wq because of blocking work.
Peter Zijlstra March 22, 2021, 8:41 a.m. UTC | #2
On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> The locks acquired in free_huge_page are irq safe.  However, in certain
> circumstances the routine update_and_free_page could sleep.  Since
> free_huge_page can be called from any context, it can not sleep.
> 
> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
> new routine update_and_free_page_no_sleep provides this functionality
> and is only called from free_huge_page.
> 
> Note that any 'pages' sent to the workqueue for deferred freeing have
> already been removed from the hugetlb subsystem.  What is actually
> deferred is returning those base pages to the low level allocator.

So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
should be in cma_release().

Also, afaict cma_release() does free_contig_range() *first*, and then
does the 'difficult' bits. So how about you re-order
free_gigantic_page() a bit to make it unconditionally do
free_contig_range() and *then* call into CMA, which can then do a
workqueue thingy if it feels like it.

That way none of the hugetlb accounting is delayed, and only CMA gets to
suffer.
Michal Hocko March 22, 2021, 2:42 p.m. UTC | #3
On Fri 19-03-21 15:42:08, Mike Kravetz wrote:
> The locks acquired in free_huge_page are irq safe.  However, in certain
> circumstances the routine update_and_free_page could sleep.  Since
> free_huge_page can be called from any context, it can not sleep.
> 
> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
> new routine update_and_free_page_no_sleep provides this functionality
> and is only called from free_huge_page.
> 
> Note that any 'pages' sent to the workqueue for deferred freeing have
> already been removed from the hugetlb subsystem.  What is actually
> deferred is returning those base pages to the low level allocator.

This patch or its alternative would need to be applied prior to patch 6
which makes the whole context IRQ safe.

Besides that the changelog doesn't really say anything about changed
user visible behavior change. Now if somebody decreases the GB huge pool
from the userspace the real effect on the freed up memory will be
postponed to some later time. That "later" is unpredictable as it
depends on WQ utilization. We definitely need some sort of
wait_for_inflight pages. One way to do that would be to have a dedicated
WQ and schedule a sync work item after the pool has been shrunk and wait
for that item.
Michal Hocko March 22, 2021, 2:46 p.m. UTC | #4
On Mon 22-03-21 15:42:27, Michal Hocko wrote:
[...]
> Besides that the changelog doesn't really say anything about changed
> user visible behavior change. Now if somebody decreases the GB huge pool
> from the userspace the real effect on the freed up memory will be
> postponed to some later time. That "later" is unpredictable as it
> depends on WQ utilization. We definitely need some sort of
> wait_for_inflight pages. One way to do that would be to have a dedicated
> WQ and schedule a sync work item after the pool has been shrunk and wait
> for that item.

Scratch that. It is not really clear from the patch context but after
looking at the resulting code set_max_huge_pages will use the blockable
update_and_free_page so we should be fine.

Sorry about the noise!
Mike Kravetz March 22, 2021, 5:42 p.m. UTC | #5
Cc: Roman, Christoph

On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
>> The locks acquired in free_huge_page are irq safe.  However, in certain
>> circumstances the routine update_and_free_page could sleep.  Since
>> free_huge_page can be called from any context, it can not sleep.
>>
>> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
>> new routine update_and_free_page_no_sleep provides this functionality
>> and is only called from free_huge_page.
>>
>> Note that any 'pages' sent to the workqueue for deferred freeing have
>> already been removed from the hugetlb subsystem.  What is actually
>> deferred is returning those base pages to the low level allocator.
> 
> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> should be in cma_release().

My thinking (which could be totally wrong) is that cma_release makes no
claims about calling context.  From the code, it is pretty clear that it
can only be called from task context with no locks held.  Although,
there could be code incorrectly calling it today hugetlb does.  Since
hugetlb is the only code with this new requirement, it should do the
work.

Wait!!!  That made me remember something.
Roman had code to create a non-blocking version of cma_release().
https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/

There were no objections, and Christoph even thought there may be
problems with callers of dma_free_contiguous.

Perhaps, we should just move forward with Roman's patches to create
cma_release_nowait() and avoid this workqueue stuff?
Roman Gushchin March 22, 2021, 6:10 p.m. UTC | #6
On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> Cc: Roman, Christoph
> 
> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> > On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >> The locks acquired in free_huge_page are irq safe.  However, in certain
> >> circumstances the routine update_and_free_page could sleep.  Since
> >> free_huge_page can be called from any context, it can not sleep.
> >>
> >> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
> >> new routine update_and_free_page_no_sleep provides this functionality
> >> and is only called from free_huge_page.
> >>
> >> Note that any 'pages' sent to the workqueue for deferred freeing have
> >> already been removed from the hugetlb subsystem.  What is actually
> >> deferred is returning those base pages to the low level allocator.
> > 
> > So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> > should be in cma_release().
> 
> My thinking (which could be totally wrong) is that cma_release makes no
> claims about calling context.  From the code, it is pretty clear that it
> can only be called from task context with no locks held.  Although,
> there could be code incorrectly calling it today hugetlb does.  Since
> hugetlb is the only code with this new requirement, it should do the
> work.
> 
> Wait!!!  That made me remember something.
> Roman had code to create a non-blocking version of cma_release().
> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/
> 
> There were no objections, and Christoph even thought there may be
> problems with callers of dma_free_contiguous.
> 
> Perhaps, we should just move forward with Roman's patches to create
> cma_release_nowait() and avoid this workqueue stuff?

Sounds good to me. If it's the preferred path, I can rebase and resend
those patches (they been carried for some time by Zi Yan for his 1GB THP work,
but they are completely independent).

Thanks!


> -- 
> Mike Kravetz
Peter Zijlstra March 22, 2021, 8:43 p.m. UTC | #7
On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> Cc: Roman, Christoph
> 
> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> > On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >> The locks acquired in free_huge_page are irq safe.  However, in certain
> >> circumstances the routine update_and_free_page could sleep.  Since
> >> free_huge_page can be called from any context, it can not sleep.
> >>
> >> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
> >> new routine update_and_free_page_no_sleep provides this functionality
> >> and is only called from free_huge_page.
> >>
> >> Note that any 'pages' sent to the workqueue for deferred freeing have
> >> already been removed from the hugetlb subsystem.  What is actually
> >> deferred is returning those base pages to the low level allocator.
> > 
> > So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> > should be in cma_release().
> 
> My thinking (which could be totally wrong) is that cma_release makes no
> claims about calling context.  From the code, it is pretty clear that it
> can only be called from task context with no locks held.  Although,
> there could be code incorrectly calling it today hugetlb does.  Since
> hugetlb is the only code with this new requirement, it should do the
> work.
> 
> Wait!!!  That made me remember something.
> Roman had code to create a non-blocking version of cma_release().
> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/
> 
> There were no objections, and Christoph even thought there may be
> problems with callers of dma_free_contiguous.
> 
> Perhaps, we should just move forward with Roman's patches to create
> cma_release_nowait() and avoid this workqueue stuff?

Ha!, that basically does as I suggested. Using that page is unfortunate
in that it will destroy the contig range for allocations until the work
happens, but I'm not sure I see a nice alternative.
Mike Kravetz March 23, 2021, 6:51 p.m. UTC | #8
On 3/22/21 11:10 AM, Roman Gushchin wrote:
> On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
>> Cc: Roman, Christoph
>>
>> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
>>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
>>>> The locks acquired in free_huge_page are irq safe.  However, in certain
>>>> circumstances the routine update_and_free_page could sleep.  Since
>>>> free_huge_page can be called from any context, it can not sleep.
>>>>
>>>> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
>>>> new routine update_and_free_page_no_sleep provides this functionality
>>>> and is only called from free_huge_page.
>>>>
>>>> Note that any 'pages' sent to the workqueue for deferred freeing have
>>>> already been removed from the hugetlb subsystem.  What is actually
>>>> deferred is returning those base pages to the low level allocator.
>>>
>>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
>>> should be in cma_release().
>>
>> My thinking (which could be totally wrong) is that cma_release makes no
>> claims about calling context.  From the code, it is pretty clear that it
>> can only be called from task context with no locks held.  Although,
>> there could be code incorrectly calling it today hugetlb does.  Since
>> hugetlb is the only code with this new requirement, it should do the
>> work.
>>
>> Wait!!!  That made me remember something.
>> Roman had code to create a non-blocking version of cma_release().
>> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/
>>
>> There were no objections, and Christoph even thought there may be
>> problems with callers of dma_free_contiguous.
>>
>> Perhaps, we should just move forward with Roman's patches to create
>> cma_release_nowait() and avoid this workqueue stuff?
> 
> Sounds good to me. If it's the preferred path, I can rebase and resend
> those patches (they been carried for some time by Zi Yan for his 1GB THP work,
> but they are completely independent).

Thanks Roman,

Yes, this is the preferred path.  If there is a non blocking version of
cma_release, then it makes fixup of hugetlb put_page path much easier.

If you would prefer, I can rebase your patches and send with this series.
Roman Gushchin March 23, 2021, 7:07 p.m. UTC | #9
On Tue, Mar 23, 2021 at 11:51:04AM -0700, Mike Kravetz wrote:
> On 3/22/21 11:10 AM, Roman Gushchin wrote:
> > On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> >> Cc: Roman, Christoph
> >>
> >> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> >>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >>>> The locks acquired in free_huge_page are irq safe.  However, in certain
> >>>> circumstances the routine update_and_free_page could sleep.  Since
> >>>> free_huge_page can be called from any context, it can not sleep.
> >>>>
> >>>> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
> >>>> new routine update_and_free_page_no_sleep provides this functionality
> >>>> and is only called from free_huge_page.
> >>>>
> >>>> Note that any 'pages' sent to the workqueue for deferred freeing have
> >>>> already been removed from the hugetlb subsystem.  What is actually
> >>>> deferred is returning those base pages to the low level allocator.
> >>>
> >>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> >>> should be in cma_release().
> >>
> >> My thinking (which could be totally wrong) is that cma_release makes no
> >> claims about calling context.  From the code, it is pretty clear that it
> >> can only be called from task context with no locks held.  Although,
> >> there could be code incorrectly calling it today hugetlb does.  Since
> >> hugetlb is the only code with this new requirement, it should do the
> >> work.
> >>
> >> Wait!!!  That made me remember something.
> >> Roman had code to create a non-blocking version of cma_release().
> >> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/
> >>
> >> There were no objections, and Christoph even thought there may be
> >> problems with callers of dma_free_contiguous.
> >>
> >> Perhaps, we should just move forward with Roman's patches to create
> >> cma_release_nowait() and avoid this workqueue stuff?
> > 
> > Sounds good to me. If it's the preferred path, I can rebase and resend
> > those patches (they been carried for some time by Zi Yan for his 1GB THP work,
> > but they are completely independent).
> 
> Thanks Roman,
> 
> Yes, this is the preferred path.  If there is a non blocking version of
> cma_release, then it makes fixup of hugetlb put_page path much easier.
> 
> If you would prefer, I can rebase your patches and send with this series.

Sounds good! Please, proceed. And, please, let me know if I can help.

Thanks!
Michal Hocko March 24, 2021, 8:43 a.m. UTC | #10
On Tue 23-03-21 11:51:04, Mike Kravetz wrote:
> On 3/22/21 11:10 AM, Roman Gushchin wrote:
> > On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> >> Cc: Roman, Christoph
> >>
> >> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> >>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >>>> The locks acquired in free_huge_page are irq safe.  However, in certain
> >>>> circumstances the routine update_and_free_page could sleep.  Since
> >>>> free_huge_page can be called from any context, it can not sleep.
> >>>>
> >>>> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
> >>>> new routine update_and_free_page_no_sleep provides this functionality
> >>>> and is only called from free_huge_page.
> >>>>
> >>>> Note that any 'pages' sent to the workqueue for deferred freeing have
> >>>> already been removed from the hugetlb subsystem.  What is actually
> >>>> deferred is returning those base pages to the low level allocator.
> >>>
> >>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> >>> should be in cma_release().
> >>
> >> My thinking (which could be totally wrong) is that cma_release makes no
> >> claims about calling context.  From the code, it is pretty clear that it
> >> can only be called from task context with no locks held.  Although,
> >> there could be code incorrectly calling it today hugetlb does.  Since
> >> hugetlb is the only code with this new requirement, it should do the
> >> work.
> >>
> >> Wait!!!  That made me remember something.
> >> Roman had code to create a non-blocking version of cma_release().
> >> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/
> >>
> >> There were no objections, and Christoph even thought there may be
> >> problems with callers of dma_free_contiguous.
> >>
> >> Perhaps, we should just move forward with Roman's patches to create
> >> cma_release_nowait() and avoid this workqueue stuff?
> > 
> > Sounds good to me. If it's the preferred path, I can rebase and resend
> > those patches (they been carried for some time by Zi Yan for his 1GB THP work,
> > but they are completely independent).
> 
> Thanks Roman,
> 
> Yes, this is the preferred path.  If there is a non blocking version of
> cma_release, then it makes fixup of hugetlb put_page path much easier.

I do not object to the plan I just want to point out that the sparse
vmemmap for hugetlb pages will need to recognize sleep/nosleep variants
of the freeing path as well to handle its vmemmap repopulate games.
Mike Kravetz March 24, 2021, 4:53 p.m. UTC | #11
On 3/24/21 1:43 AM, Michal Hocko wrote:
> On Tue 23-03-21 11:51:04, Mike Kravetz wrote:
>> On 3/22/21 11:10 AM, Roman Gushchin wrote:
>>> On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
>>>> Cc: Roman, Christoph
>>>>
>>>> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
>>>>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
>>>>>> The locks acquired in free_huge_page are irq safe.  However, in certain
>>>>>> circumstances the routine update_and_free_page could sleep.  Since
>>>>>> free_huge_page can be called from any context, it can not sleep.
>>>>>>
>>>>>> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
>>>>>> new routine update_and_free_page_no_sleep provides this functionality
>>>>>> and is only called from free_huge_page.
>>>>>>
>>>>>> Note that any 'pages' sent to the workqueue for deferred freeing have
>>>>>> already been removed from the hugetlb subsystem.  What is actually
>>>>>> deferred is returning those base pages to the low level allocator.
>>>>>
>>>>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
>>>>> should be in cma_release().
>>>>
>>>> My thinking (which could be totally wrong) is that cma_release makes no
>>>> claims about calling context.  From the code, it is pretty clear that it
>>>> can only be called from task context with no locks held.  Although,
>>>> there could be code incorrectly calling it today hugetlb does.  Since
>>>> hugetlb is the only code with this new requirement, it should do the
>>>> work.
>>>>
>>>> Wait!!!  That made me remember something.
>>>> Roman had code to create a non-blocking version of cma_release().
>>>> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/
>>>>
>>>> There were no objections, and Christoph even thought there may be
>>>> problems with callers of dma_free_contiguous.
>>>>
>>>> Perhaps, we should just move forward with Roman's patches to create
>>>> cma_release_nowait() and avoid this workqueue stuff?
>>>
>>> Sounds good to me. If it's the preferred path, I can rebase and resend
>>> those patches (they been carried for some time by Zi Yan for his 1GB THP work,
>>> but they are completely independent).
>>
>> Thanks Roman,
>>
>> Yes, this is the preferred path.  If there is a non blocking version of
>> cma_release, then it makes fixup of hugetlb put_page path much easier.
> 
> I do not object to the plan I just want to point out that the sparse
> vmemmap for hugetlb pages will need to recognize sleep/nosleep variants
> of the freeing path as well to handle its vmemmap repopulate games.
> 

Yes,

I also commented elsewhere that we will likely want to do the
drop/reacquire lock for each page in the looping page free routines when
adding the vmemmap freeing support.

Unless someone thinks otherwise, I still think it is better to first fix
the hugetlb put_page/free_huge_page path with this series.  Then move on
to the free vmemmap series.
Mike Kravetz March 25, 2021, 12:26 a.m. UTC | #12
On 3/19/21 6:18 PM, Hillf Danton wrote:
> On Fri, 19 Mar 2021 15:42:08 -0700  Mike Kravetz wrote:
>> +
>> +	if (!can_sleep && free_page_may_sleep(h, page)) {
>> +		/*
>> +		 * Send page freeing to workqueue
>> +		 *
>> +		 * Only call schedule_work() if hpage_freelist is previously
>> +		 * empty. Otherwise, schedule_work() had been called but the
>> +		 * workfn hasn't retrieved the list yet.
>> +		 */
>> +		if (llist_add((struct llist_node *)&page->mapping,
>> +					&hpage_freelist))
>> +			schedule_work(&free_hpage_work);
>> +		return;
>> +	}
> 
> Queue work on system_unbound_wq instead of system_wq because of blocking work.
> 

Thanks Hillf,

I am dropping this patch and going with Roman's patches to create an
version of cma_release that will not sleep.  A workqueue handoff like
this may be needed in the vmemmap reduction series, so will keep this in
mind.
diff mbox series

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f42d44050548..a81ca39c06be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -666,9 +666,14 @@  static inline unsigned huge_page_shift(struct hstate *h)
 	return h->order + PAGE_SHIFT;
 }
 
+static inline bool order_is_gigantic(unsigned int order)
+{
+	return order >= MAX_ORDER;
+}
+
 static inline bool hstate_is_gigantic(struct hstate *h)
 {
-	return huge_page_order(h) >= MAX_ORDER;
+	return order_is_gigantic(huge_page_order(h));
 }
 
 static inline unsigned int pages_per_huge_page(struct hstate *h)
@@ -942,6 +947,11 @@  static inline unsigned int huge_page_shift(struct hstate *h)
 	return PAGE_SHIFT;
 }
 
+static inline bool order_is_gigantic(unsigned int order)
+{
+	return false;
+}
+
 static inline bool hstate_is_gigantic(struct hstate *h)
 {
 	return false;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 82614bbe7bb9..b8304b290a73 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1351,7 +1351,60 @@  static void remove_hugetlb_page(struct hstate *h, struct page *page,
 	h->nr_huge_pages_node[nid]--;
 }
 
-static void update_and_free_page(struct hstate *h, struct page *page)
+/*
+ * free_huge_page() can be called from any context.  However, the freeing
+ * of a hugetlb page can potentially sleep.  If freeing will sleep, defer
+ * the actual freeing to a workqueue to prevent sleeping in contexts where
+ * sleeping is not allowed.
+ *
+ * Use the page->mapping pointer as a llist_node structure for the lockless
+ * linked list of pages to be freeed.  free_hpage_workfn() locklessly
+ * retrieves the linked list of pages to be freed and frees them one-by-one.
+ *
+ * The page passed to __free_huge_page is technically not a hugetlb page, so
+ * we can not use interfaces such as page_hstate().
+ */
+static void __free_huge_page(struct page *page)
+{
+	unsigned int order = compound_order(page);
+
+	if (order_is_gigantic(order)) {
+		destroy_compound_gigantic_page(page, order);
+		free_gigantic_page(page, order);
+	} else {
+		__free_pages(page, order);
+	}
+}
+
+static LLIST_HEAD(hpage_freelist);
+
+static void free_hpage_workfn(struct work_struct *work)
+{
+	struct llist_node *node;
+	struct page *page;
+
+	node = llist_del_all(&hpage_freelist);
+
+	while (node) {
+		page = container_of((struct address_space **)node,
+				     struct page, mapping);
+		node = node->next;
+		__free_huge_page(page);
+	}
+}
+static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
+
+static bool free_page_may_sleep(struct hstate *h, struct page *page)
+{
+	/* freeing gigantic pages in CMA may sleep */
+	if (hstate_is_gigantic(h))
+		return true;
+
+	return false;
+}
+
+static void __update_and_free_page(struct hstate *h, struct page *page,
+								bool can_sleep)
 {
 	int i;
 	struct page *subpage = page;
@@ -1366,6 +1419,21 @@  static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_active | 1 << PG_private |
 				1 << PG_writeback);
 	}
+
+	if (!can_sleep && free_page_may_sleep(h, page)) {
+		/*
+		 * Send page freeing to workqueue
+		 *
+		 * Only call schedule_work() if hpage_freelist is previously
+		 * empty. Otherwise, schedule_work() had been called but the
+		 * workfn hasn't retrieved the list yet.
+		 */
+		if (llist_add((struct llist_node *)&page->mapping,
+					&hpage_freelist))
+			schedule_work(&free_hpage_work);
+		return;
+	}
+
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
@@ -1374,6 +1442,18 @@  static void update_and_free_page(struct hstate *h, struct page *page)
 	}
 }
 
+static void update_and_free_page_no_sleep(struct hstate *h, struct page *page)
+{
+	/* can not sleep */
+	return __update_and_free_page(h, page, false);
+}
+
+static void update_and_free_page(struct hstate *h, struct page *page)
+{
+	/* can sleep */
+	return __update_and_free_page(h, page, true);
+}
+
 struct hstate *size_to_hstate(unsigned long size)
 {
 	struct hstate *h;
@@ -1436,12 +1516,12 @@  void free_huge_page(struct page *page)
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
-		update_and_free_page(h, page);
+		update_and_free_page_no_sleep(h, page);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
-		update_and_free_page(h, page);
+		update_and_free_page_no_sleep(h, page);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);