Message ID | 20210319224209.150047-8-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | make hugetlb put_page safe for all calling contexts | expand |
On Fri, 19 Mar 2021 15:42:08 -0700 Mike Kravetz wrote: > + > + if (!can_sleep && free_page_may_sleep(h, page)) { > + /* > + * Send page freeing to workqueue > + * > + * Only call schedule_work() if hpage_freelist is previously > + * empty. Otherwise, schedule_work() had been called but the > + * workfn hasn't retrieved the list yet. > + */ > + if (llist_add((struct llist_node *)&page->mapping, > + &hpage_freelist)) > + schedule_work(&free_hpage_work); > + return; > + } Queue work on system_unbound_wq instead of system_wq because of blocking work.
On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote: > The locks acquired in free_huge_page are irq safe. However, in certain > circumstances the routine update_and_free_page could sleep. Since > free_huge_page can be called from any context, it can not sleep. > > Use a waitqueue to defer freeing of pages if the operation may sleep. A > new routine update_and_free_page_no_sleep provides this functionality > and is only called from free_huge_page. > > Note that any 'pages' sent to the workqueue for deferred freeing have > already been removed from the hugetlb subsystem. What is actually > deferred is returning those base pages to the low level allocator. So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it should be in cma_release(). Also, afaict cma_release() does free_contig_range() *first*, and then does the 'difficult' bits. So how about you re-order free_gigantic_page() a bit to make it unconditionally do free_contig_range() and *then* call into CMA, which can then do a workqueue thingy if it feels like it. That way none of the hugetlb accounting is delayed, and only CMA gets to suffer.
On Fri 19-03-21 15:42:08, Mike Kravetz wrote: > The locks acquired in free_huge_page are irq safe. However, in certain > circumstances the routine update_and_free_page could sleep. Since > free_huge_page can be called from any context, it can not sleep. > > Use a waitqueue to defer freeing of pages if the operation may sleep. A > new routine update_and_free_page_no_sleep provides this functionality > and is only called from free_huge_page. > > Note that any 'pages' sent to the workqueue for deferred freeing have > already been removed from the hugetlb subsystem. What is actually > deferred is returning those base pages to the low level allocator. This patch or its alternative would need to be applied prior to patch 6 which makes the whole context IRQ safe. Besides that the changelog doesn't really say anything about changed user visible behavior change. Now if somebody decreases the GB huge pool from the userspace the real effect on the freed up memory will be postponed to some later time. That "later" is unpredictable as it depends on WQ utilization. We definitely need some sort of wait_for_inflight pages. One way to do that would be to have a dedicated WQ and schedule a sync work item after the pool has been shrunk and wait for that item.
On Mon 22-03-21 15:42:27, Michal Hocko wrote: [...] > Besides that the changelog doesn't really say anything about changed > user visible behavior change. Now if somebody decreases the GB huge pool > from the userspace the real effect on the freed up memory will be > postponed to some later time. That "later" is unpredictable as it > depends on WQ utilization. We definitely need some sort of > wait_for_inflight pages. One way to do that would be to have a dedicated > WQ and schedule a sync work item after the pool has been shrunk and wait > for that item. Scratch that. It is not really clear from the patch context but after looking at the resulting code set_max_huge_pages will use the blockable update_and_free_page so we should be fine. Sorry about the noise!
Cc: Roman, Christoph On 3/22/21 1:41 AM, Peter Zijlstra wrote: > On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote: >> The locks acquired in free_huge_page are irq safe. However, in certain >> circumstances the routine update_and_free_page could sleep. Since >> free_huge_page can be called from any context, it can not sleep. >> >> Use a waitqueue to defer freeing of pages if the operation may sleep. A >> new routine update_and_free_page_no_sleep provides this functionality >> and is only called from free_huge_page. >> >> Note that any 'pages' sent to the workqueue for deferred freeing have >> already been removed from the hugetlb subsystem. What is actually >> deferred is returning those base pages to the low level allocator. > > So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it > should be in cma_release(). My thinking (which could be totally wrong) is that cma_release makes no claims about calling context. From the code, it is pretty clear that it can only be called from task context with no locks held. Although, there could be code incorrectly calling it today hugetlb does. Since hugetlb is the only code with this new requirement, it should do the work. Wait!!! That made me remember something. Roman had code to create a non-blocking version of cma_release(). https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/ There were no objections, and Christoph even thought there may be problems with callers of dma_free_contiguous. Perhaps, we should just move forward with Roman's patches to create cma_release_nowait() and avoid this workqueue stuff?
On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote: > Cc: Roman, Christoph > > On 3/22/21 1:41 AM, Peter Zijlstra wrote: > > On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote: > >> The locks acquired in free_huge_page are irq safe. However, in certain > >> circumstances the routine update_and_free_page could sleep. Since > >> free_huge_page can be called from any context, it can not sleep. > >> > >> Use a waitqueue to defer freeing of pages if the operation may sleep. A > >> new routine update_and_free_page_no_sleep provides this functionality > >> and is only called from free_huge_page. > >> > >> Note that any 'pages' sent to the workqueue for deferred freeing have > >> already been removed from the hugetlb subsystem. What is actually > >> deferred is returning those base pages to the low level allocator. > > > > So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it > > should be in cma_release(). > > My thinking (which could be totally wrong) is that cma_release makes no > claims about calling context. From the code, it is pretty clear that it > can only be called from task context with no locks held. Although, > there could be code incorrectly calling it today hugetlb does. Since > hugetlb is the only code with this new requirement, it should do the > work. > > Wait!!! That made me remember something. > Roman had code to create a non-blocking version of cma_release(). > https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/ > > There were no objections, and Christoph even thought there may be > problems with callers of dma_free_contiguous. > > Perhaps, we should just move forward with Roman's patches to create > cma_release_nowait() and avoid this workqueue stuff? Sounds good to me. If it's the preferred path, I can rebase and resend those patches (they been carried for some time by Zi Yan for his 1GB THP work, but they are completely independent). Thanks! > -- > Mike Kravetz
On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote: > Cc: Roman, Christoph > > On 3/22/21 1:41 AM, Peter Zijlstra wrote: > > On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote: > >> The locks acquired in free_huge_page are irq safe. However, in certain > >> circumstances the routine update_and_free_page could sleep. Since > >> free_huge_page can be called from any context, it can not sleep. > >> > >> Use a waitqueue to defer freeing of pages if the operation may sleep. A > >> new routine update_and_free_page_no_sleep provides this functionality > >> and is only called from free_huge_page. > >> > >> Note that any 'pages' sent to the workqueue for deferred freeing have > >> already been removed from the hugetlb subsystem. What is actually > >> deferred is returning those base pages to the low level allocator. > > > > So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it > > should be in cma_release(). > > My thinking (which could be totally wrong) is that cma_release makes no > claims about calling context. From the code, it is pretty clear that it > can only be called from task context with no locks held. Although, > there could be code incorrectly calling it today hugetlb does. Since > hugetlb is the only code with this new requirement, it should do the > work. > > Wait!!! That made me remember something. > Roman had code to create a non-blocking version of cma_release(). > https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/ > > There were no objections, and Christoph even thought there may be > problems with callers of dma_free_contiguous. > > Perhaps, we should just move forward with Roman's patches to create > cma_release_nowait() and avoid this workqueue stuff? Ha!, that basically does as I suggested. Using that page is unfortunate in that it will destroy the contig range for allocations until the work happens, but I'm not sure I see a nice alternative.
On 3/22/21 11:10 AM, Roman Gushchin wrote: > On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote: >> Cc: Roman, Christoph >> >> On 3/22/21 1:41 AM, Peter Zijlstra wrote: >>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote: >>>> The locks acquired in free_huge_page are irq safe. However, in certain >>>> circumstances the routine update_and_free_page could sleep. Since >>>> free_huge_page can be called from any context, it can not sleep. >>>> >>>> Use a waitqueue to defer freeing of pages if the operation may sleep. A >>>> new routine update_and_free_page_no_sleep provides this functionality >>>> and is only called from free_huge_page. >>>> >>>> Note that any 'pages' sent to the workqueue for deferred freeing have >>>> already been removed from the hugetlb subsystem. What is actually >>>> deferred is returning those base pages to the low level allocator. >>> >>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it >>> should be in cma_release(). >> >> My thinking (which could be totally wrong) is that cma_release makes no >> claims about calling context. From the code, it is pretty clear that it >> can only be called from task context with no locks held. Although, >> there could be code incorrectly calling it today hugetlb does. Since >> hugetlb is the only code with this new requirement, it should do the >> work. >> >> Wait!!! That made me remember something. >> Roman had code to create a non-blocking version of cma_release(). >> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/ >> >> There were no objections, and Christoph even thought there may be >> problems with callers of dma_free_contiguous. >> >> Perhaps, we should just move forward with Roman's patches to create >> cma_release_nowait() and avoid this workqueue stuff? > > Sounds good to me. If it's the preferred path, I can rebase and resend > those patches (they been carried for some time by Zi Yan for his 1GB THP work, > but they are completely independent). Thanks Roman, Yes, this is the preferred path. If there is a non blocking version of cma_release, then it makes fixup of hugetlb put_page path much easier. If you would prefer, I can rebase your patches and send with this series.
On Tue, Mar 23, 2021 at 11:51:04AM -0700, Mike Kravetz wrote: > On 3/22/21 11:10 AM, Roman Gushchin wrote: > > On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote: > >> Cc: Roman, Christoph > >> > >> On 3/22/21 1:41 AM, Peter Zijlstra wrote: > >>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote: > >>>> The locks acquired in free_huge_page are irq safe. However, in certain > >>>> circumstances the routine update_and_free_page could sleep. Since > >>>> free_huge_page can be called from any context, it can not sleep. > >>>> > >>>> Use a waitqueue to defer freeing of pages if the operation may sleep. A > >>>> new routine update_and_free_page_no_sleep provides this functionality > >>>> and is only called from free_huge_page. > >>>> > >>>> Note that any 'pages' sent to the workqueue for deferred freeing have > >>>> already been removed from the hugetlb subsystem. What is actually > >>>> deferred is returning those base pages to the low level allocator. > >>> > >>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it > >>> should be in cma_release(). > >> > >> My thinking (which could be totally wrong) is that cma_release makes no > >> claims about calling context. From the code, it is pretty clear that it > >> can only be called from task context with no locks held. Although, > >> there could be code incorrectly calling it today hugetlb does. Since > >> hugetlb is the only code with this new requirement, it should do the > >> work. > >> > >> Wait!!! That made me remember something. > >> Roman had code to create a non-blocking version of cma_release(). > >> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/ > >> > >> There were no objections, and Christoph even thought there may be > >> problems with callers of dma_free_contiguous. > >> > >> Perhaps, we should just move forward with Roman's patches to create > >> cma_release_nowait() and avoid this workqueue stuff? > > > > Sounds good to me. If it's the preferred path, I can rebase and resend > > those patches (they been carried for some time by Zi Yan for his 1GB THP work, > > but they are completely independent). > > Thanks Roman, > > Yes, this is the preferred path. If there is a non blocking version of > cma_release, then it makes fixup of hugetlb put_page path much easier. > > If you would prefer, I can rebase your patches and send with this series. Sounds good! Please, proceed. And, please, let me know if I can help. Thanks!
On Tue 23-03-21 11:51:04, Mike Kravetz wrote: > On 3/22/21 11:10 AM, Roman Gushchin wrote: > > On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote: > >> Cc: Roman, Christoph > >> > >> On 3/22/21 1:41 AM, Peter Zijlstra wrote: > >>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote: > >>>> The locks acquired in free_huge_page are irq safe. However, in certain > >>>> circumstances the routine update_and_free_page could sleep. Since > >>>> free_huge_page can be called from any context, it can not sleep. > >>>> > >>>> Use a waitqueue to defer freeing of pages if the operation may sleep. A > >>>> new routine update_and_free_page_no_sleep provides this functionality > >>>> and is only called from free_huge_page. > >>>> > >>>> Note that any 'pages' sent to the workqueue for deferred freeing have > >>>> already been removed from the hugetlb subsystem. What is actually > >>>> deferred is returning those base pages to the low level allocator. > >>> > >>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it > >>> should be in cma_release(). > >> > >> My thinking (which could be totally wrong) is that cma_release makes no > >> claims about calling context. From the code, it is pretty clear that it > >> can only be called from task context with no locks held. Although, > >> there could be code incorrectly calling it today hugetlb does. Since > >> hugetlb is the only code with this new requirement, it should do the > >> work. > >> > >> Wait!!! That made me remember something. > >> Roman had code to create a non-blocking version of cma_release(). > >> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/ > >> > >> There were no objections, and Christoph even thought there may be > >> problems with callers of dma_free_contiguous. > >> > >> Perhaps, we should just move forward with Roman's patches to create > >> cma_release_nowait() and avoid this workqueue stuff? > > > > Sounds good to me. If it's the preferred path, I can rebase and resend > > those patches (they been carried for some time by Zi Yan for his 1GB THP work, > > but they are completely independent). > > Thanks Roman, > > Yes, this is the preferred path. If there is a non blocking version of > cma_release, then it makes fixup of hugetlb put_page path much easier. I do not object to the plan I just want to point out that the sparse vmemmap for hugetlb pages will need to recognize sleep/nosleep variants of the freeing path as well to handle its vmemmap repopulate games.
On 3/24/21 1:43 AM, Michal Hocko wrote: > On Tue 23-03-21 11:51:04, Mike Kravetz wrote: >> On 3/22/21 11:10 AM, Roman Gushchin wrote: >>> On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote: >>>> Cc: Roman, Christoph >>>> >>>> On 3/22/21 1:41 AM, Peter Zijlstra wrote: >>>>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote: >>>>>> The locks acquired in free_huge_page are irq safe. However, in certain >>>>>> circumstances the routine update_and_free_page could sleep. Since >>>>>> free_huge_page can be called from any context, it can not sleep. >>>>>> >>>>>> Use a waitqueue to defer freeing of pages if the operation may sleep. A >>>>>> new routine update_and_free_page_no_sleep provides this functionality >>>>>> and is only called from free_huge_page. >>>>>> >>>>>> Note that any 'pages' sent to the workqueue for deferred freeing have >>>>>> already been removed from the hugetlb subsystem. What is actually >>>>>> deferred is returning those base pages to the low level allocator. >>>>> >>>>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it >>>>> should be in cma_release(). >>>> >>>> My thinking (which could be totally wrong) is that cma_release makes no >>>> claims about calling context. From the code, it is pretty clear that it >>>> can only be called from task context with no locks held. Although, >>>> there could be code incorrectly calling it today hugetlb does. Since >>>> hugetlb is the only code with this new requirement, it should do the >>>> work. >>>> >>>> Wait!!! That made me remember something. >>>> Roman had code to create a non-blocking version of cma_release(). >>>> https://lore.kernel.org/linux-mm/20201022225308.2927890-1-guro@fb.com/ >>>> >>>> There were no objections, and Christoph even thought there may be >>>> problems with callers of dma_free_contiguous. >>>> >>>> Perhaps, we should just move forward with Roman's patches to create >>>> cma_release_nowait() and avoid this workqueue stuff? >>> >>> Sounds good to me. If it's the preferred path, I can rebase and resend >>> those patches (they been carried for some time by Zi Yan for his 1GB THP work, >>> but they are completely independent). >> >> Thanks Roman, >> >> Yes, this is the preferred path. If there is a non blocking version of >> cma_release, then it makes fixup of hugetlb put_page path much easier. > > I do not object to the plan I just want to point out that the sparse > vmemmap for hugetlb pages will need to recognize sleep/nosleep variants > of the freeing path as well to handle its vmemmap repopulate games. > Yes, I also commented elsewhere that we will likely want to do the drop/reacquire lock for each page in the looping page free routines when adding the vmemmap freeing support. Unless someone thinks otherwise, I still think it is better to first fix the hugetlb put_page/free_huge_page path with this series. Then move on to the free vmemmap series.
On 3/19/21 6:18 PM, Hillf Danton wrote: > On Fri, 19 Mar 2021 15:42:08 -0700 Mike Kravetz wrote: >> + >> + if (!can_sleep && free_page_may_sleep(h, page)) { >> + /* >> + * Send page freeing to workqueue >> + * >> + * Only call schedule_work() if hpage_freelist is previously >> + * empty. Otherwise, schedule_work() had been called but the >> + * workfn hasn't retrieved the list yet. >> + */ >> + if (llist_add((struct llist_node *)&page->mapping, >> + &hpage_freelist)) >> + schedule_work(&free_hpage_work); >> + return; >> + } > > Queue work on system_unbound_wq instead of system_wq because of blocking work. > Thanks Hillf, I am dropping this patch and going with Roman's patches to create an version of cma_release that will not sleep. A workqueue handoff like this may be needed in the vmemmap reduction series, so will keep this in mind.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index f42d44050548..a81ca39c06be 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -666,9 +666,14 @@ static inline unsigned huge_page_shift(struct hstate *h) return h->order + PAGE_SHIFT; } +static inline bool order_is_gigantic(unsigned int order) +{ + return order >= MAX_ORDER; +} + static inline bool hstate_is_gigantic(struct hstate *h) { - return huge_page_order(h) >= MAX_ORDER; + return order_is_gigantic(huge_page_order(h)); } static inline unsigned int pages_per_huge_page(struct hstate *h) @@ -942,6 +947,11 @@ static inline unsigned int huge_page_shift(struct hstate *h) return PAGE_SHIFT; } +static inline bool order_is_gigantic(unsigned int order) +{ + return false; +} + static inline bool hstate_is_gigantic(struct hstate *h) { return false; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 82614bbe7bb9..b8304b290a73 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1351,7 +1351,60 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page, h->nr_huge_pages_node[nid]--; } -static void update_and_free_page(struct hstate *h, struct page *page) +/* + * free_huge_page() can be called from any context. However, the freeing + * of a hugetlb page can potentially sleep. If freeing will sleep, defer + * the actual freeing to a workqueue to prevent sleeping in contexts where + * sleeping is not allowed. + * + * Use the page->mapping pointer as a llist_node structure for the lockless + * linked list of pages to be freeed. free_hpage_workfn() locklessly + * retrieves the linked list of pages to be freed and frees them one-by-one. + * + * The page passed to __free_huge_page is technically not a hugetlb page, so + * we can not use interfaces such as page_hstate(). + */ +static void __free_huge_page(struct page *page) +{ + unsigned int order = compound_order(page); + + if (order_is_gigantic(order)) { + destroy_compound_gigantic_page(page, order); + free_gigantic_page(page, order); + } else { + __free_pages(page, order); + } +} + +static LLIST_HEAD(hpage_freelist); + +static void free_hpage_workfn(struct work_struct *work) +{ + struct llist_node *node; + struct page *page; + + node = llist_del_all(&hpage_freelist); + + while (node) { + page = container_of((struct address_space **)node, + struct page, mapping); + node = node->next; + __free_huge_page(page); + } +} +static DECLARE_WORK(free_hpage_work, free_hpage_workfn); + +static bool free_page_may_sleep(struct hstate *h, struct page *page) +{ + /* freeing gigantic pages in CMA may sleep */ + if (hstate_is_gigantic(h)) + return true; + + return false; +} + +static void __update_and_free_page(struct hstate *h, struct page *page, + bool can_sleep) { int i; struct page *subpage = page; @@ -1366,6 +1419,21 @@ static void update_and_free_page(struct hstate *h, struct page *page) 1 << PG_active | 1 << PG_private | 1 << PG_writeback); } + + if (!can_sleep && free_page_may_sleep(h, page)) { + /* + * Send page freeing to workqueue + * + * Only call schedule_work() if hpage_freelist is previously + * empty. Otherwise, schedule_work() had been called but the + * workfn hasn't retrieved the list yet. + */ + if (llist_add((struct llist_node *)&page->mapping, + &hpage_freelist)) + schedule_work(&free_hpage_work); + return; + } + if (hstate_is_gigantic(h)) { destroy_compound_gigantic_page(page, huge_page_order(h)); free_gigantic_page(page, huge_page_order(h)); @@ -1374,6 +1442,18 @@ static void update_and_free_page(struct hstate *h, struct page *page) } } +static void update_and_free_page_no_sleep(struct hstate *h, struct page *page) +{ + /* can not sleep */ + return __update_and_free_page(h, page, false); +} + +static void update_and_free_page(struct hstate *h, struct page *page) +{ + /* can sleep */ + return __update_and_free_page(h, page, true); +} + struct hstate *size_to_hstate(unsigned long size) { struct hstate *h; @@ -1436,12 +1516,12 @@ void free_huge_page(struct page *page) if (HPageTemporary(page)) { remove_hugetlb_page(h, page, false); spin_unlock_irqrestore(&hugetlb_lock, flags); - update_and_free_page(h, page); + update_and_free_page_no_sleep(h, page); } else if (h->surplus_huge_pages_node[nid]) { /* remove the page from active list */ remove_hugetlb_page(h, page, true); spin_unlock_irqrestore(&hugetlb_lock, flags); - update_and_free_page(h, page); + update_and_free_page_no_sleep(h, page); } else { arch_clear_hugepage_flags(page); enqueue_huge_page(h, page);
The locks acquired in free_huge_page are irq safe. However, in certain circumstances the routine update_and_free_page could sleep. Since free_huge_page can be called from any context, it can not sleep. Use a waitqueue to defer freeing of pages if the operation may sleep. A new routine update_and_free_page_no_sleep provides this functionality and is only called from free_huge_page. Note that any 'pages' sent to the workqueue for deferred freeing have already been removed from the hugetlb subsystem. What is actually deferred is returning those base pages to the low level allocator. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- include/linux/hugetlb.h | 12 +++++- mm/hugetlb.c | 86 +++++++++++++++++++++++++++++++++++++++-- 2 files changed, 94 insertions(+), 4 deletions(-)