[v13,04/12] mm: hugetlb: defer freeing of HugeTLB pages

Message ID	20210117151053.24600-5-songmuchun@bytedance.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Muchun Song <songmuchun@bytedance.com> To: corbet@lwn.net, mike.kravetz@oracle.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org, paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, rdunlap@infradead.org, oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, almasrymina@google.com, rientjes@google.com, willy@infradead.org, osalvador@suse.de, mhocko@suse.com, song.bao.hua@hisilicon.com, david@redhat.com, naoya.horiguchi@nec.com Cc: duanxiongchun@bytedance.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH v13 04/12] mm: hugetlb: defer freeing of HugeTLB pages Date: Sun, 17 Jan 2021 23:10:45 +0800 Message-Id: <20210117151053.24600-5-songmuchun@bytedance.com> In-Reply-To: <20210117151053.24600-1-songmuchun@bytedance.com> References: <20210117151053.24600-1-songmuchun@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Free some vmemmap pages of HugeTLB page \| expand [v13,00/12] Free some vmemmap pages of HugeTLB page [v13,01/12] mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c [v13,02/12] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP [v13,03/12] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page [v13,04/12] mm: hugetlb: defer freeing of HugeTLB pages [v13,05/12] mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB page [v13,06/12] mm: hugetlb: set the PageHWPoison to the raw error page [v13,07/12] mm: hugetlb: flush work when dissolving a HugeTLB page [v13,08/12] mm: hugetlb: introduce PageHugeInflight [v13,09/12] mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap [v13,10/12] mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate [v13,11/12] mm: hugetlb: gather discrete indexes of tail page [v13,12/12] mm: hugetlb: optimize the code with the help of the compiler

Message ID

20210117151053.24600-5-songmuchun@bytedance.com (mailing list archive)

State

New, archived

Headers

From: Muchun Song <songmuchun@bytedance.com>
To: corbet@lwn.net, mike.kravetz@oracle.com, tglx@linutronix.de,
        mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com,
        dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org,
        viro@zeniv.linux.org.uk, akpm@linux-foundation.org,
        paulmck@kernel.org, mchehab+huawei@kernel.org,
        pawan.kumar.gupta@linux.intel.com, rdunlap@infradead.org,
        oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de,
        almasrymina@google.com, rientjes@google.com, willy@infradead.org,
        osalvador@suse.de, mhocko@suse.com, song.bao.hua@hisilicon.com,
        david@redhat.com, naoya.horiguchi@nec.com
Cc: duanxiongchun@bytedance.com, linux-doc@vger.kernel.org,
        linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        linux-fsdevel@vger.kernel.org,
        Muchun Song <songmuchun@bytedance.com>
Subject: [PATCH v13 04/12] mm: hugetlb: defer freeing of HugeTLB pages
Date: Sun, 17 Jan 2021 23:10:45 +0800
Message-Id: <20210117151053.24600-5-songmuchun@bytedance.com>
In-Reply-To: <20210117151053.24600-1-songmuchun@bytedance.com>
References: <20210117151053.24600-1-songmuchun@bytedance.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Free some vmemmap pages of HugeTLB page | expand

Commit Message

Muchun Song Jan. 17, 2021, 3:10 p.m. UTC

In the subsequent patch, we should allocate the vmemmap pages when
freeing HugeTLB pages. But update_and_free_page() is always called
with holding hugetlb_lock, so we cannot use GFP_KERNEL to allocate
vmemmap pages. However, we can defer the actual freeing in a kworker
to prevent from using GFP_ATOMIC to allocate the vmemmap pages.

The update_hpage_vmemmap_workfn() is where the call to allocate
vmemmmap pages will be inserted.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c         | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/hugetlb_vmemmap.c | 12 ---------
 mm/hugetlb_vmemmap.h | 17 ++++++++++++
 3 files changed, 89 insertions(+), 14 deletions(-)

Comments

David Rientjes Jan. 24, 2021, 11:55 p.m. UTC | #1

On Sun, 17 Jan 2021, Muchun Song wrote:

> In the subsequent patch, we should allocate the vmemmap pages when
> freeing HugeTLB pages. But update_and_free_page() is always called
> with holding hugetlb_lock, so we cannot use GFP_KERNEL to allocate
> vmemmap pages. However, we can defer the actual freeing in a kworker
> to prevent from using GFP_ATOMIC to allocate the vmemmap pages.
> 
> The update_hpage_vmemmap_workfn() is where the call to allocate
> vmemmmap pages will be inserted.
> 

I think it's reasonable to assume that userspace can release free hugetlb 
pages from the pool on oom conditions when reclaim has become too 
expensive.  This approach now requires that we can allocate vmemmap pages 
in a potential oom condition as a prerequisite for freeing memory, which 
seems less than ideal.

And, by doing this through a kworker, we can presumably get queued behind 
another work item that requires memory to make forward progress in this 
oom condition.

Two thoughts:

- We're going to be freeing the hugetlb page after we can allocate the 
  vmemmap pages, so why do we need to allocate with GFP_KERNEL?  Can't we
  simply dip into memory reserves using GFP_ATOMIC (and thus can be 
  holding hugetlb_lock) because we know we'll be freeing more memory than
  we'll be allocating?  I think requiring a GFP_KERNEL allocation to block
  to free memory for vmemmap when we'll be freeing memory ourselves is
  dubious.  This simplifies all of this.

- If the answer is that we actually have to use GFP_KERNEL for other 
  reasons, what are your thoughts on pre-allocating the vmemmap as opposed
  to deferring to a kworker?  In other words, preallocate the necessary
  memory with GFP_KERNEL and put it on a linked list in struct hstate 
  before acquiring hugetlb_lock.

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c         | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  mm/hugetlb_vmemmap.c | 12 ---------
>  mm/hugetlb_vmemmap.h | 17 ++++++++++++
>  3 files changed, 89 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 140135fc8113..c165186ec2cf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1292,15 +1292,85 @@ static inline void destroy_compound_gigantic_page(struct page *page,
>  						unsigned int order) { }
>  #endif
>  
> -static void update_and_free_page(struct hstate *h, struct page *page)
> +static void __free_hugepage(struct hstate *h, struct page *page);
> +
> +/*
> + * As update_and_free_page() is always called with holding hugetlb_lock, so we
> + * cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
> + * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate
> + * the vmemmap pages.
> + *
> + * The update_hpage_vmemmap_workfn() is where the call to allocate vmemmmap
> + * pages will be inserted.
> + *
> + * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of pages
> + * to be freed and frees them one-by-one. As the page->mapping pointer is going
> + * to be cleared in update_hpage_vmemmap_workfn() anyway, it is reused as the
> + * llist_node structure of a lockless linked list of huge pages to be freed.
> + */
> +static LLIST_HEAD(hpage_update_freelist);
> +
> +static void update_hpage_vmemmap_workfn(struct work_struct *work)
>  {
> -	int i;
> +	struct llist_node *node;
> +
> +	node = llist_del_all(&hpage_update_freelist);
> +
> +	while (node) {
> +		struct page *page;
> +		struct hstate *h;
> +
> +		page = container_of((struct address_space **)node,
> +				     struct page, mapping);
> +		node = node->next;
> +		page->mapping = NULL;
> +		h = page_hstate(page);
> +
> +		spin_lock(&hugetlb_lock);
> +		__free_hugepage(h, page);
> +		spin_unlock(&hugetlb_lock);
>  
> +		cond_resched();

Wouldn't it be better to hold hugetlb_lock for the iteration rather than 
constantly dropping it and reacquiring it?  Use 
cond_resched_lock(&hugetlb_lock) instead?

Muchun Song Jan. 25, 2021, 3:58 a.m. UTC | #2

On Mon, Jan 25, 2021 at 7:55 AM David Rientjes <rientjes@google.com> wrote:
>
>
> On Sun, 17 Jan 2021, Muchun Song wrote:
>
> > In the subsequent patch, we should allocate the vmemmap pages when
> > freeing HugeTLB pages. But update_and_free_page() is always called
> > with holding hugetlb_lock, so we cannot use GFP_KERNEL to allocate
> > vmemmap pages. However, we can defer the actual freeing in a kworker
> > to prevent from using GFP_ATOMIC to allocate the vmemmap pages.
> >
> > The update_hpage_vmemmap_workfn() is where the call to allocate
> > vmemmmap pages will be inserted.
> >
>
> I think it's reasonable to assume that userspace can release free hugetlb
> pages from the pool on oom conditions when reclaim has become too
> expensive.  This approach now requires that we can allocate vmemmap pages
> in a potential oom condition as a prerequisite for freeing memory, which
> seems less than ideal.
>
> And, by doing this through a kworker, we can presumably get queued behind
> another work item that requires memory to make forward progress in this
> oom condition.
>
> Two thoughts:
>
> - We're going to be freeing the hugetlb page after we can allocate the
>   vmemmap pages, so why do we need to allocate with GFP_KERNEL?  Can't we
>   simply dip into memory reserves using GFP_ATOMIC (and thus can be
>   holding hugetlb_lock) because we know we'll be freeing more memory than
>   we'll be allocating?

Right.

>   I think requiring a GFP_KERNEL allocation to block
>   to free memory for vmemmap when we'll be freeing memory ourselves is
>   dubious.  This simplifies all of this.

Thanks for your thoughts. I just thought that we can go to reclaim
when there is no memory in the system. But we cannot block when
using GFP_KERNEL. Actually, we cannot deal with fail of memory
allocating. In the next patch, I try to sleep 100ms and then try again
to allocate memory when allocating memory fails.

>
> - If the answer is that we actually have to use GFP_KERNEL for other
>   reasons, what are your thoughts on pre-allocating the vmemmap as opposed
>   to deferring to a kworker?  In other words, preallocate the necessary
>   memory with GFP_KERNEL and put it on a linked list in struct hstate
>   before acquiring hugetlb_lock.

put_page() can be used in an atomic context. Actually, we cannot sleep
in the __free_huge_page(). It seems a little tricky. Right?

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  mm/hugetlb.c         | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  mm/hugetlb_vmemmap.c | 12 ---------
> >  mm/hugetlb_vmemmap.h | 17 ++++++++++++
> >  3 files changed, 89 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 140135fc8113..c165186ec2cf 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1292,15 +1292,85 @@ static inline void destroy_compound_gigantic_page(struct page *page,
> >                                               unsigned int order) { }
> >  #endif
> >
> > -static void update_and_free_page(struct hstate *h, struct page *page)
> > +static void __free_hugepage(struct hstate *h, struct page *page);
> > +
> > +/*
> > + * As update_and_free_page() is always called with holding hugetlb_lock, so we
> > + * cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
> > + * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate
> > + * the vmemmap pages.
> > + *
> > + * The update_hpage_vmemmap_workfn() is where the call to allocate vmemmmap
> > + * pages will be inserted.
> > + *
> > + * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of pages
> > + * to be freed and frees them one-by-one. As the page->mapping pointer is going
> > + * to be cleared in update_hpage_vmemmap_workfn() anyway, it is reused as the
> > + * llist_node structure of a lockless linked list of huge pages to be freed.
> > + */
> > +static LLIST_HEAD(hpage_update_freelist);
> > +
> > +static void update_hpage_vmemmap_workfn(struct work_struct *work)
> >  {
> > -     int i;
> > +     struct llist_node *node;
> > +
> > +     node = llist_del_all(&hpage_update_freelist);
> > +
> > +     while (node) {
> > +             struct page *page;
> > +             struct hstate *h;
> > +
> > +             page = container_of((struct address_space **)node,
> > +                                  struct page, mapping);
> > +             node = node->next;
> > +             page->mapping = NULL;
> > +             h = page_hstate(page);
> > +
> > +             spin_lock(&hugetlb_lock);
> > +             __free_hugepage(h, page);
> > +             spin_unlock(&hugetlb_lock);
> >
> > +             cond_resched();
>
> Wouldn't it be better to hold hugetlb_lock for the iteration rather than
> constantly dropping it and reacquiring it?  Use
> cond_resched_lock(&hugetlb_lock) instead?

Great. We can use it. Thanks.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 140135fc8113..c165186ec2cf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1292,15 +1292,85 @@  static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
 
-static void update_and_free_page(struct hstate *h, struct page *page)
+static void __free_hugepage(struct hstate *h, struct page *page);
+
+/*
+ * As update_and_free_page() is always called with holding hugetlb_lock, so we
+ * cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
+ * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate
+ * the vmemmap pages.
+ *
+ * The update_hpage_vmemmap_workfn() is where the call to allocate vmemmmap
+ * pages will be inserted.
+ *
+ * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of pages
+ * to be freed and frees them one-by-one. As the page->mapping pointer is going
+ * to be cleared in update_hpage_vmemmap_workfn() anyway, it is reused as the
+ * llist_node structure of a lockless linked list of huge pages to be freed.
+ */
+static LLIST_HEAD(hpage_update_freelist);
+
+static void update_hpage_vmemmap_workfn(struct work_struct *work)
 {
-	int i;
+	struct llist_node *node;
+
+	node = llist_del_all(&hpage_update_freelist);
+
+	while (node) {
+		struct page *page;
+		struct hstate *h;
+
+		page = container_of((struct address_space **)node,
+				     struct page, mapping);
+		node = node->next;
+		page->mapping = NULL;
+		h = page_hstate(page);
+
+		spin_lock(&hugetlb_lock);
+		__free_hugepage(h, page);
+		spin_unlock(&hugetlb_lock);
 
+		cond_resched();
+	}
+}
+static DECLARE_WORK(hpage_update_work, update_hpage_vmemmap_workfn);
+
+static inline void __update_and_free_page(struct hstate *h, struct page *page)
+{
+	/* No need to allocate vmemmap pages */
+	if (!free_vmemmap_pages_per_hpage(h)) {
+		__free_hugepage(h, page);
+		return;
+	}
+
+	/*
+	 * Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap
+	 * pages.
+	 *
+	 * Only call schedule_work() if hpage_update_freelist is previously
+	 * empty. Otherwise, schedule_work() had been called but the workfn
+	 * hasn't retrieved the list yet.
+	 */
+	if (llist_add((struct llist_node *)&page->mapping,
+		      &hpage_update_freelist))
+		schedule_work(&hpage_update_work);
+}
+
+static void update_and_free_page(struct hstate *h, struct page *page)
+{
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[page_to_nid(page)]--;
+
+	__update_and_free_page(h, page);
+}
+
+static void __free_hugepage(struct hstate *h, struct page *page)
+{
+	int i;
+
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4ffa2a4ae2a8..19f1898aaede 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -178,18 +178,6 @@ 
 #define RESERVE_VMEMMAP_NR		2U
 #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
 
-/*
- * How many vmemmap pages associated with a HugeTLB page that can be freed
- * to the buddy allocator.
- *
- * Todo: Returns zero for now, which means the feature is disabled. We will
- * enable it once all the infrastructure is there.
- */
-static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
-{
-	return 0;
-}
-
 static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
 {
 	return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 6923f03534d5..01f8637adbe0 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -12,9 +12,26 @@ 
 
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+
+/*
+ * How many vmemmap pages associated with a HugeTLB page that can be freed
+ * to the buddy allocator.
+ *
+ * Todo: Returns zero for now, which means the feature is disabled. We will
+ * enable it once all the infrastructure is there.
+ */
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return 0;
+}
 #else
 static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return 0;
+}
 #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
 #endif /* _LINUX_HUGETLB_VMEMMAP_H */

[v13,04/12] mm: hugetlb: defer freeing of HugeTLB pages

Commit Message

Comments

Patch