Message ID | 20200716123810.25292-13-osalvador@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Hwpoison soft-offline rework | expand |
On Thu, Jul 16, 2020 at 02:38:06PM +0200, Oscar Salvador wrote: > This patch changes the way we set and handle in-use poisoned pages. > Until now, poisoned pages were released to the buddy allocator, trusting > that the checks that take place prior to deliver the page to its end > user would act as a safe net and would skip that page. > > This has proved to be wrong, as we got some pfn walkers out there, like > compaction, that all they care is the page to be PageBuddy and be in a > freelist. > Although this might not be the only user, having poisoned pages > in the buddy allocator seems a bad idea as we should only have > free pages that are ready and meant to be used as such. > > Before explaining the taken approach, let us break down the kind > of pages we can soft offline. > > - Anonymous THP (after the split, they end up being 4K pages) > - Hugetlb > - Order-0 pages (that can be either migrated or invalited) > > * Normal pages (order-0 and anon-THP) > > - If they are clean and unmapped page cache pages, we invalidate > then by means of invalidate_inode_page(). > - If they are mapped/dirty, we do the isolate-and-migrate dance. > > Either way, do not call put_page directly from those paths. > Instead, we keep the page and send it to page_set_poison to perform the > right handling. > > Among other things, page_set_poison() sets the HWPoison flag and does the last > put_page. > This call to put_page is mainly to be able to call __page_cache_release, > since this function is not exported. > > Down the chain, we placed a check for HWPoison page in > free_pages_prepare, that just skips any poisoned page, so those pages > do not end up either in a pcplist or in buddy-freelist. > > After that, we set the refcount on the page to 1 and we increment > the poisoned pages counter. > > We could do as we do for free pages: > 1) wait until the page hits buddy's freelists > 2) take it off > 3) flag it > > The problem is that we could race with an allocation, so by the time we > want to take the page off the buddy, the page is already allocated, so we > cannot soft-offline it. > This is not fatal of course, but if it is better if we can close the race > as does not require a lot of code. > > * Hugetlb pages > > - We isolate-and-migrate them > > There is no magic in here, we just isolate and migrate them. > A new set of internal functions have been made to flag a hugetlb page as > poisoned (SetPageHugePoisoned(), PageHugePoisoned(), ClearPageHugePoisoned()) > This allows us to flag the page when we migrate it, back in > move_hugetlb_state(). > > Later on we check whether the page is poisoned in __free_huge_page, > and we bail out in that case before sending the page to e.g: active > free list. > This gives us full control of the page, and we can handle it > page_handle_poison(). > > In other words, we do not allow migrated hugepages to get back to the > freelists. > > Since now the page has no user and has been migrated, we can call > dissolve_free_huge_page, which will end up calling update_and_free_page. > In update_and_free_page(), we check for the page to be poisoned. > If it so, we handle it as we handle gigantic pages, i.e: we break down > the page in order-0 pages and free them one by one. > Doing so, allows us for free_pages_prepare to skip poisoned pages. > > Because of the way we handle now in-use pages, we no longer need the > put-as-isolation-migratetype dance, that was guarding for poisoned pages > to end up in pcplists. I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on a small (4GB memory) VM, and weiredly found that (1) the target hugepages are not always dissolved and (2) dissovled hugetpages are still counted in "HugePages_Total:". See below: $ ./random 1 - start: migrate_huge_offline - use NUMA nodes 0,1. - mmap and free 8388608 bytes hugepages on node 0 - mmap and free 8388608 bytes hugepages on node 1 madvise: Cannot allocate memory $ cat /proc/meminfo MemTotal: 4026772 kB MemFree: 976300 kB MemAvailable: 892840 kB Buffers: 20936 kB Cached: 99768 kB SwapCached: 5904 kB Active: 84332 kB Inactive: 116328 kB Active(anon): 27944 kB Inactive(anon): 68524 kB Active(file): 56388 kB Inactive(file): 47804 kB Unevictable: 7532 kB Mlocked: 0 kB SwapTotal: 2621436 kB SwapFree: 2609844 kB Dirty: 56 kB Writeback: 0 kB AnonPages: 81764 kB Mapped: 54348 kB Shmem: 8948 kB KReclaimable: 22744 kB Slab: 52056 kB SReclaimable: 22744 kB SUnreclaim: 29312 kB KernelStack: 3888 kB PageTables: 2804 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 3260612 kB Committed_AS: 828196 kB VmallocTotal: 34359738367 kB VmallocUsed: 19260 kB VmallocChunk: 0 kB Percpu: 5120 kB HardwareCorrupted: 5368 kB AnonHugePages: 18432 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 1342 // still counted as hugetlb pages. HugePages_Free: 0 // all hugepage are still allocated (or leaked?) HugePages_Rsvd: 0 HugePages_Surp: 762 // some are counted in surplus. Hugepagesize: 2048 kB Hugetlb: 2748416 kB DirectMap4k: 112480 kB DirectMap2M: 4081664 kB $ page-types -b hwpoison flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000080008 421 1 ___U_______________X_______________________ uptodate,hwpoison 0x00000000000a8018 1 0 ___UD__________H_G_X_______________________ uptodate,dirty,compound_head,huge,hwpoison 0x00000000000a801c 920 3 __RUD__________H_G_X_______________________ referenced,uptodate,dirty,compound_head,huge,hwpoison total 1342 5 This means that some hugepages are dissolved, but the others not, maybe which is not desirable. I'll dig this more later but just let me share at first. A few minor comment below ... > > Signed-off-by: Oscar Salvador <osalvador@suse.com> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > --- > include/linux/page-flags.h | 5 ---- > mm/hugetlb.c | 60 +++++++++++++++++++++++++++++++++----- > mm/memory-failure.c | 53 +++++++++++++-------------------- > mm/migrate.c | 11 ++----- > mm/page_alloc.c | 38 +++++++----------------- > 5 files changed, 86 insertions(+), 81 deletions(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 01baf6d426ff..2ac8bfa0cf20 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -426,13 +426,8 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY) > TESTSCFLAG(HWPoison, hwpoison, PF_ANY) > #define __PG_HWPOISON (1UL << PG_hwpoison) > extern bool take_page_off_buddy(struct page *page); > -extern bool set_hwpoison_free_buddy_page(struct page *page); > #else > PAGEFLAG_FALSE(HWPoison) > -static inline bool set_hwpoison_free_buddy_page(struct page *page) > -{ > - return 0; > -} > #define __PG_HWPOISON 0 > #endif > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 7badb01d15e3..1c6397936512 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -29,6 +29,7 @@ > #include <linux/numa.h> > #include <linux/llist.h> > #include <linux/cma.h> > +#include <linux/migrate.h> > > #include <asm/page.h> > #include <asm/pgalloc.h> > @@ -1209,9 +1210,26 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) > ((node = hstate_next_node_to_free(hs, mask)) || 1); \ > nr_nodes--) > > -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > -static void destroy_compound_gigantic_page(struct page *page, > - unsigned int order) > +static inline bool PageHugePoisoned(struct page *page) > +{ > + if (!PageHuge(page)) > + return false; > + > + return (unsigned long)page[3].mapping == -1U; > +} > + > +static inline void SetPageHugePoisoned(struct page *page) > +{ > + page[3].mapping = (void *)-1U; > +} > + > +static inline void ClearPageHugePoisoned(struct page *page) > +{ > + page[3].mapping = NULL; > +} > + > +static void destroy_compound_gigantic_page(struct hstate *h, struct page *page, > + unsigned int order) > { > int i; > int nr_pages = 1 << order; > @@ -1222,14 +1240,19 @@ static void destroy_compound_gigantic_page(struct page *page, > atomic_set(compound_pincount_ptr(page), 0); > > for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > + if (!hstate_is_gigantic(h)) > + p->mapping = NULL; > clear_compound_head(p); > set_page_refcounted(p); > } > > + if (PageHugePoisoned(page)) > + ClearPageHugePoisoned(page); > set_compound_order(page, 0); > __ClearPageHead(page); > } > > +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > static void free_gigantic_page(struct page *page, unsigned int order) > { > /* > @@ -1284,13 +1307,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, > return NULL; > } > static inline void free_gigantic_page(struct page *page, unsigned int order) { } > -static inline void destroy_compound_gigantic_page(struct page *page, > - unsigned int order) { } > +static inline void destroy_compound_gigantic_page(struct hstate *h, > + struct page *page, > + unsigned int order) { } > #endif > > static void update_and_free_page(struct hstate *h, struct page *page) > { > int i; > + bool poisoned = PageHugePoisoned(page); > + unsigned int order = huge_page_order(h); > > if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) > return; > @@ -1313,11 +1339,21 @@ static void update_and_free_page(struct hstate *h, struct page *page) > * we might block in free_gigantic_page(). > */ > spin_unlock(&hugetlb_lock); > - destroy_compound_gigantic_page(page, huge_page_order(h)); > - free_gigantic_page(page, huge_page_order(h)); > + destroy_compound_gigantic_page(h, page, order); > + free_gigantic_page(page, order); > spin_lock(&hugetlb_lock); > } else { > - __free_pages(page, huge_page_order(h)); > + if (unlikely(poisoned)) { > + /* > + * If the hugepage is poisoned, do as we do for > + * gigantic pages and free the pages as order-0. > + * free_pages_prepare will skip over the poisoned ones. > + */ > + destroy_compound_gigantic_page(h, page, order); This function is for gigantic page from its name, so shouldn't be called for non-gigantic huge page. Maybe renaming it and/or introducing some inner function layer to factor out common part would be better. > + free_contig_range(page_to_pfn(page), 1 << order); > + } else { > + __free_pages(page, huge_page_order(h)); > + } > } > } > > @@ -1427,6 +1463,11 @@ static void __free_huge_page(struct page *page) > if (restore_reserve) > h->resv_huge_pages++; > > + if (PageHugePoisoned(page)) { > + spin_unlock(&hugetlb_lock); > + return; > + } > + > if (PageHugeTemporary(page)) { > list_del(&page->lru); > ClearPageHugeTemporary(page); > @@ -5642,6 +5683,9 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason) > hugetlb_cgroup_migrate(oldpage, newpage); > set_page_owner_migrate_reason(newpage, reason); > > + if (reason == MR_MEMORY_FAILURE) > + SetPageHugePoisoned(oldpage); > + > /* > * transfer temporary state of the new huge page. This is > * reverse to other transitions because the newpage is going to > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index caf012d34607..c0ebab4eed4c 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -65,9 +65,17 @@ int sysctl_memory_failure_recovery __read_mostly = 1; > > atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); > > -static void page_handle_poison(struct page *page) > +static void page_handle_poison(struct page *page, bool release, bool set_flag, > + bool huge_flag) > { > - SetPageHWPoison(page); > + if (set_flag) > + SetPageHWPoison(page); > + > + if (huge_flag) > + dissolve_free_huge_page(page); > + else if (release) > + put_page(page); > + Indentation seems to be broken, you can run checkpatch.pl to find details. Thanks, Naoya Horiguchi > page_ref_inc(page); > num_poisoned_pages_inc(); > } > @@ -1717,7 +1725,7 @@ static int get_any_page(struct page *page, unsigned long pfn) > > static int soft_offline_huge_page(struct page *page) > { > - int ret; > + int ret = -EBUSY; > unsigned long pfn = page_to_pfn(page); > struct page *hpage = compound_head(page); > LIST_HEAD(pagelist); > @@ -1757,19 +1765,12 @@ static int soft_offline_huge_page(struct page *page) > ret = -EIO; > } else { > /* > - * We set PG_hwpoison only when the migration source hugepage > - * was successfully dissolved, because otherwise hwpoisoned > - * hugepage remains on free hugepage list, then userspace will > - * find it as SIGBUS by allocation failure. That's not expected > - * in soft-offlining. > + * At this point the page cannot be in-use since we do not > + * let the page to go back to hugetlb freelists. > + * In that case we just need to dissolve it. > + * page_handle_poison will take care of it. > */ > - ret = dissolve_free_huge_page(page); > - if (!ret) { > - if (set_hwpoison_free_buddy_page(page)) > - num_poisoned_pages_inc(); > - else > - ret = -EBUSY; > - } > + page_handle_poison(page, true, true, true); > } > return ret; > } > @@ -1804,10 +1805,8 @@ static int __soft_offline_page(struct page *page) > * would need to fix isolation locking first. > */ > if (ret == 1) { > - put_page(page); > pr_info("soft_offline: %#lx: invalidated\n", pfn); > - SetPageHWPoison(page); > - num_poisoned_pages_inc(); > + page_handle_poison(page, true, true, false); > return 0; > } > > @@ -1838,7 +1837,9 @@ static int __soft_offline_page(struct page *page) > list_add(&page->lru, &pagelist); > ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, > MIGRATE_SYNC, MR_MEMORY_FAILURE); > - if (ret) { > + if (!ret) { > + page_handle_poison(page, true, true, false); > + } else { > if (!list_empty(&pagelist)) > putback_movable_pages(&pagelist); > > @@ -1857,37 +1858,25 @@ static int __soft_offline_page(struct page *page) > static int soft_offline_in_use_page(struct page *page) > { > int ret; > - int mt; > struct page *hpage = compound_head(page); > > if (!PageHuge(page) && PageTransHuge(hpage)) > if (try_to_split_thp_page(page, "soft offline") < 0) > return -EBUSY; > > - /* > - * Setting MIGRATE_ISOLATE here ensures that the page will be linked > - * to free list immediately (not via pcplist) when released after > - * successful page migration. Otherwise we can't guarantee that the > - * page is really free after put_page() returns, so > - * set_hwpoison_free_buddy_page() highly likely fails. > - */ > - mt = get_pageblock_migratetype(page); > - set_pageblock_migratetype(page, MIGRATE_ISOLATE); > if (PageHuge(page)) > ret = soft_offline_huge_page(page); > else > ret = __soft_offline_page(page); > - set_pageblock_migratetype(page, mt); > return ret; > } > > static int soft_offline_free_page(struct page *page) > { > int rc = -EBUSY; > - int rc = dissolve_free_huge_page(page); > > if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) { > - page_handle_poison(page); > + page_handle_poison(page, false, true, false); > rc = 0; > } > > diff --git a/mm/migrate.c b/mm/migrate.c > index 75c10d81e833..a68d81d0ae6e 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1222,16 +1222,11 @@ static int unmap_and_move(new_page_t get_new_page, > * we want to retry. > */ > if (rc == MIGRATEPAGE_SUCCESS) { > - put_page(page); > - if (reason == MR_MEMORY_FAILURE) { > + if (reason != MR_MEMORY_FAILURE) > /* > - * Set PG_HWPoison on just freed page > - * intentionally. Although it's rather weird, > - * it's how HWPoison flag works at the moment. > + * We handle poisoned pages in page_handle_poison. > */ > - if (set_hwpoison_free_buddy_page(page)) > - num_poisoned_pages_inc(); > - } > + put_page(page); > } else { > if (rc != -EAGAIN) { > if (likely(!__PageMovable(page))) { > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 4fa0e0887c07..11df51fc2718 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1175,6 +1175,16 @@ static __always_inline bool free_pages_prepare(struct page *page, > > trace_mm_page_free(page, order); > > + if (unlikely(PageHWPoison(page)) && !order) { > + /* > + * Untie memcg state and reset page's owner > + */ > + if (memcg_kmem_enabled() && PageKmemcg(page)) > + __memcg_kmem_uncharge_page(page, order); > + reset_page_owner(page, order); > + return false; > + } > + > /* > * Check tail pages before head page information is cleared to > * avoid checking PageCompound for order-0 pages. > @@ -8844,32 +8854,4 @@ bool take_page_off_buddy(struct page *page) > spin_unlock_irqrestore(&zone->lock, flags); > return ret; > } > - > -/* > - * Set PG_hwpoison flag if a given page is confirmed to be a free page. This > - * test is performed under the zone lock to prevent a race against page > - * allocation. > - */ > -bool set_hwpoison_free_buddy_page(struct page *page) > -{ > - struct zone *zone = page_zone(page); > - unsigned long pfn = page_to_pfn(page); > - unsigned long flags; > - unsigned int order; > - bool hwpoisoned = false; > - > - spin_lock_irqsave(&zone->lock, flags); > - for (order = 0; order < MAX_ORDER; order++) { > - struct page *page_head = page - (pfn & ((1 << order) - 1)); > - > - if (PageBuddy(page_head) && page_order(page_head) >= order) { > - if (!TestSetPageHWPoison(page)) > - hwpoisoned = true; > - break; > - } > - } > - spin_unlock_irqrestore(&zone->lock, flags); > - > - return hwpoisoned; > -} > #endif > -- > 2.26.2 >
On 2020-07-17 08:55, HORIGUCHI NAOYA wrote: > I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on a > small (4GB memory) VM, and weiredly found that (1) the target hugepages > are not always dissolved and (2) dissovled hugetpages are still counted > in "HugePages_Total:". See below: > > $ ./random 1 > - start: migrate_huge_offline > - use NUMA nodes 0,1. > - mmap and free 8388608 bytes hugepages on node 0 > - mmap and free 8388608 bytes hugepages on node 1 > madvise: Cannot allocate memory > > $ cat /proc/meminfo > MemTotal: 4026772 kB > MemFree: 976300 kB > MemAvailable: 892840 kB > Buffers: 20936 kB > Cached: 99768 kB > SwapCached: 5904 kB > Active: 84332 kB > Inactive: 116328 kB > Active(anon): 27944 kB > Inactive(anon): 68524 kB > Active(file): 56388 kB > Inactive(file): 47804 kB > Unevictable: 7532 kB > Mlocked: 0 kB > SwapTotal: 2621436 kB > SwapFree: 2609844 kB > Dirty: 56 kB > Writeback: 0 kB > AnonPages: 81764 kB > Mapped: 54348 kB > Shmem: 8948 kB > KReclaimable: 22744 kB > Slab: 52056 kB > SReclaimable: 22744 kB > SUnreclaim: 29312 kB > KernelStack: 3888 kB > PageTables: 2804 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 3260612 kB > Committed_AS: 828196 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 19260 kB > VmallocChunk: 0 kB > Percpu: 5120 kB > HardwareCorrupted: 5368 kB > AnonHugePages: 18432 kB > ShmemHugePages: 0 kB > ShmemPmdMapped: 0 kB > FileHugePages: 0 kB > FilePmdMapped: 0 kB > CmaTotal: 0 kB > CmaFree: 0 kB > HugePages_Total: 1342 // still counted as hugetlb pages. > HugePages_Free: 0 // all hugepage are still allocated > (or leaked?) > HugePages_Rsvd: 0 > HugePages_Surp: 762 // some are counted in surplus. > Hugepagesize: 2048 kB > Hugetlb: 2748416 kB > DirectMap4k: 112480 kB > DirectMap2M: 4081664 kB > > > $ page-types -b hwpoison > flags page-count MB symbolic-flags > long-symbolic-flags > 0x0000000000080008 421 1 > ___U_______________X_______________________ uptodate,hwpoison > 0x00000000000a8018 1 0 > ___UD__________H_G_X_______________________ > uptodate,dirty,compound_head,huge,hwpoison > 0x00000000000a801c 920 3 > __RUD__________H_G_X_______________________ > referenced,uptodate,dirty,compound_head,huge,hwpoison > total 1342 5 > > This means that some hugepages are dissolved, but the others not, > maybe which is not desirable. > I'll dig this more later but just let me share at first. > > A few minor comment below ... Uhm, weird. I will be taking a look today. Thanks
On 2020-07-20 10:27, osalvador@suse.de wrote: > On 2020-07-17 08:55, HORIGUCHI NAOYA wrote: >> I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on >> a >> small (4GB memory) VM, and weiredly found that (1) the target >> hugepages >> are not always dissolved and (2) dissovled hugetpages are still >> counted >> in "HugePages_Total:". See below: >> >> $ ./random 1 >> - start: migrate_huge_offline >> - use NUMA nodes 0,1. >> - mmap and free 8388608 bytes hugepages on node 0 >> - mmap and free 8388608 bytes hugepages on node 1 >> madvise: Cannot allocate memory >> >> $ cat /proc/meminfo >> MemTotal: 4026772 kB >> MemFree: 976300 kB >> MemAvailable: 892840 kB >> Buffers: 20936 kB >> Cached: 99768 kB >> SwapCached: 5904 kB >> Active: 84332 kB >> Inactive: 116328 kB >> Active(anon): 27944 kB >> Inactive(anon): 68524 kB >> Active(file): 56388 kB >> Inactive(file): 47804 kB >> Unevictable: 7532 kB >> Mlocked: 0 kB >> SwapTotal: 2621436 kB >> SwapFree: 2609844 kB >> Dirty: 56 kB >> Writeback: 0 kB >> AnonPages: 81764 kB >> Mapped: 54348 kB >> Shmem: 8948 kB >> KReclaimable: 22744 kB >> Slab: 52056 kB >> SReclaimable: 22744 kB >> SUnreclaim: 29312 kB >> KernelStack: 3888 kB >> PageTables: 2804 kB >> NFS_Unstable: 0 kB >> Bounce: 0 kB >> WritebackTmp: 0 kB >> CommitLimit: 3260612 kB >> Committed_AS: 828196 kB >> VmallocTotal: 34359738367 kB >> VmallocUsed: 19260 kB >> VmallocChunk: 0 kB >> Percpu: 5120 kB >> HardwareCorrupted: 5368 kB >> AnonHugePages: 18432 kB >> ShmemHugePages: 0 kB >> ShmemPmdMapped: 0 kB >> FileHugePages: 0 kB >> FilePmdMapped: 0 kB >> CmaTotal: 0 kB >> CmaFree: 0 kB >> HugePages_Total: 1342 // still counted as hugetlb pages. >> HugePages_Free: 0 // all hugepage are still allocated >> (or leaked?) >> HugePages_Rsvd: 0 >> HugePages_Surp: 762 // some are counted in surplus. >> Hugepagesize: 2048 kB >> Hugetlb: 2748416 kB >> DirectMap4k: 112480 kB >> DirectMap2M: 4081664 kB >> >> >> $ page-types -b hwpoison >> flags page-count MB symbolic-flags >> long-symbolic-flags >> 0x0000000000080008 421 1 >> ___U_______________X_______________________ uptodate,hwpoison >> 0x00000000000a8018 1 0 >> ___UD__________H_G_X_______________________ >> uptodate,dirty,compound_head,huge,hwpoison >> 0x00000000000a801c 920 3 >> __RUD__________H_G_X_______________________ >> referenced,uptodate,dirty,compound_head,huge,hwpoison >> total 1342 5 >> >> This means that some hugepages are dissolved, but the others not, >> maybe which is not desirable. >> I'll dig this more later but just let me share at first. >> >> A few minor comment below ... > > > Uhm, weird. > > I will be taking a look today. After some digging up I __think__ I found the problem. I will try to fix it up and I will be running tests. I might reach out to you once I am done because I remember you had a test-suite that worked quite well, so you can give it a spin there. Thanks
On Wed, Jul 22, 2020 at 10:08:59AM +0200, osalvador@suse.de wrote: > On 2020-07-20 10:27, osalvador@suse.de wrote: > > > This means that some hugepages are dissolved, but the others not, > > > maybe which is not desirable. > > > I'll dig this more later but just let me share at first. > > > > > > A few minor comment below ... > > > > > > Uhm, weird. > > > > I will be taking a look today. > > After some digging up I __think__ I found the problem. > I will try to fix it up and I will be running tests. I found the problem. I re-ran the tests again with small and large memory and the stats look correct this time. After some more testing, I also fixed a list corruption that was happening due to the same problem. I am creating a git branch so you can re-run your tests on it as well.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 01baf6d426ff..2ac8bfa0cf20 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -426,13 +426,8 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY) TESTSCFLAG(HWPoison, hwpoison, PF_ANY) #define __PG_HWPOISON (1UL << PG_hwpoison) extern bool take_page_off_buddy(struct page *page); -extern bool set_hwpoison_free_buddy_page(struct page *page); #else PAGEFLAG_FALSE(HWPoison) -static inline bool set_hwpoison_free_buddy_page(struct page *page) -{ - return 0; -} #define __PG_HWPOISON 0 #endif diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7badb01d15e3..1c6397936512 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -29,6 +29,7 @@ #include <linux/numa.h> #include <linux/llist.h> #include <linux/cma.h> +#include <linux/migrate.h> #include <asm/page.h> #include <asm/pgalloc.h> @@ -1209,9 +1210,26 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) ((node = hstate_next_node_to_free(hs, mask)) || 1); \ nr_nodes--) -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -static void destroy_compound_gigantic_page(struct page *page, - unsigned int order) +static inline bool PageHugePoisoned(struct page *page) +{ + if (!PageHuge(page)) + return false; + + return (unsigned long)page[3].mapping == -1U; +} + +static inline void SetPageHugePoisoned(struct page *page) +{ + page[3].mapping = (void *)-1U; +} + +static inline void ClearPageHugePoisoned(struct page *page) +{ + page[3].mapping = NULL; +} + +static void destroy_compound_gigantic_page(struct hstate *h, struct page *page, + unsigned int order) { int i; int nr_pages = 1 << order; @@ -1222,14 +1240,19 @@ static void destroy_compound_gigantic_page(struct page *page, atomic_set(compound_pincount_ptr(page), 0); for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { + if (!hstate_is_gigantic(h)) + p->mapping = NULL; clear_compound_head(p); set_page_refcounted(p); } + if (PageHugePoisoned(page)) + ClearPageHugePoisoned(page); set_compound_order(page, 0); __ClearPageHead(page); } +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE static void free_gigantic_page(struct page *page, unsigned int order) { /* @@ -1284,13 +1307,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, return NULL; } static inline void free_gigantic_page(struct page *page, unsigned int order) { } -static inline void destroy_compound_gigantic_page(struct page *page, - unsigned int order) { } +static inline void destroy_compound_gigantic_page(struct hstate *h, + struct page *page, + unsigned int order) { } #endif static void update_and_free_page(struct hstate *h, struct page *page) { int i; + bool poisoned = PageHugePoisoned(page); + unsigned int order = huge_page_order(h); if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return; @@ -1313,11 +1339,21 @@ static void update_and_free_page(struct hstate *h, struct page *page) * we might block in free_gigantic_page(). */ spin_unlock(&hugetlb_lock); - destroy_compound_gigantic_page(page, huge_page_order(h)); - free_gigantic_page(page, huge_page_order(h)); + destroy_compound_gigantic_page(h, page, order); + free_gigantic_page(page, order); spin_lock(&hugetlb_lock); } else { - __free_pages(page, huge_page_order(h)); + if (unlikely(poisoned)) { + /* + * If the hugepage is poisoned, do as we do for + * gigantic pages and free the pages as order-0. + * free_pages_prepare will skip over the poisoned ones. + */ + destroy_compound_gigantic_page(h, page, order); + free_contig_range(page_to_pfn(page), 1 << order); + } else { + __free_pages(page, huge_page_order(h)); + } } } @@ -1427,6 +1463,11 @@ static void __free_huge_page(struct page *page) if (restore_reserve) h->resv_huge_pages++; + if (PageHugePoisoned(page)) { + spin_unlock(&hugetlb_lock); + return; + } + if (PageHugeTemporary(page)) { list_del(&page->lru); ClearPageHugeTemporary(page); @@ -5642,6 +5683,9 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason) hugetlb_cgroup_migrate(oldpage, newpage); set_page_owner_migrate_reason(newpage, reason); + if (reason == MR_MEMORY_FAILURE) + SetPageHugePoisoned(oldpage); + /* * transfer temporary state of the new huge page. This is * reverse to other transitions because the newpage is going to diff --git a/mm/memory-failure.c b/mm/memory-failure.c index caf012d34607..c0ebab4eed4c 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -65,9 +65,17 @@ int sysctl_memory_failure_recovery __read_mostly = 1; atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); -static void page_handle_poison(struct page *page) +static void page_handle_poison(struct page *page, bool release, bool set_flag, + bool huge_flag) { - SetPageHWPoison(page); + if (set_flag) + SetPageHWPoison(page); + + if (huge_flag) + dissolve_free_huge_page(page); + else if (release) + put_page(page); + page_ref_inc(page); num_poisoned_pages_inc(); } @@ -1717,7 +1725,7 @@ static int get_any_page(struct page *page, unsigned long pfn) static int soft_offline_huge_page(struct page *page) { - int ret; + int ret = -EBUSY; unsigned long pfn = page_to_pfn(page); struct page *hpage = compound_head(page); LIST_HEAD(pagelist); @@ -1757,19 +1765,12 @@ static int soft_offline_huge_page(struct page *page) ret = -EIO; } else { /* - * We set PG_hwpoison only when the migration source hugepage - * was successfully dissolved, because otherwise hwpoisoned - * hugepage remains on free hugepage list, then userspace will - * find it as SIGBUS by allocation failure. That's not expected - * in soft-offlining. + * At this point the page cannot be in-use since we do not + * let the page to go back to hugetlb freelists. + * In that case we just need to dissolve it. + * page_handle_poison will take care of it. */ - ret = dissolve_free_huge_page(page); - if (!ret) { - if (set_hwpoison_free_buddy_page(page)) - num_poisoned_pages_inc(); - else - ret = -EBUSY; - } + page_handle_poison(page, true, true, true); } return ret; } @@ -1804,10 +1805,8 @@ static int __soft_offline_page(struct page *page) * would need to fix isolation locking first. */ if (ret == 1) { - put_page(page); pr_info("soft_offline: %#lx: invalidated\n", pfn); - SetPageHWPoison(page); - num_poisoned_pages_inc(); + page_handle_poison(page, true, true, false); return 0; } @@ -1838,7 +1837,9 @@ static int __soft_offline_page(struct page *page) list_add(&page->lru, &pagelist); ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); - if (ret) { + if (!ret) { + page_handle_poison(page, true, true, false); + } else { if (!list_empty(&pagelist)) putback_movable_pages(&pagelist); @@ -1857,37 +1858,25 @@ static int __soft_offline_page(struct page *page) static int soft_offline_in_use_page(struct page *page) { int ret; - int mt; struct page *hpage = compound_head(page); if (!PageHuge(page) && PageTransHuge(hpage)) if (try_to_split_thp_page(page, "soft offline") < 0) return -EBUSY; - /* - * Setting MIGRATE_ISOLATE here ensures that the page will be linked - * to free list immediately (not via pcplist) when released after - * successful page migration. Otherwise we can't guarantee that the - * page is really free after put_page() returns, so - * set_hwpoison_free_buddy_page() highly likely fails. - */ - mt = get_pageblock_migratetype(page); - set_pageblock_migratetype(page, MIGRATE_ISOLATE); if (PageHuge(page)) ret = soft_offline_huge_page(page); else ret = __soft_offline_page(page); - set_pageblock_migratetype(page, mt); return ret; } static int soft_offline_free_page(struct page *page) { int rc = -EBUSY; - int rc = dissolve_free_huge_page(page); if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) { - page_handle_poison(page); + page_handle_poison(page, false, true, false); rc = 0; } diff --git a/mm/migrate.c b/mm/migrate.c index 75c10d81e833..a68d81d0ae6e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1222,16 +1222,11 @@ static int unmap_and_move(new_page_t get_new_page, * we want to retry. */ if (rc == MIGRATEPAGE_SUCCESS) { - put_page(page); - if (reason == MR_MEMORY_FAILURE) { + if (reason != MR_MEMORY_FAILURE) /* - * Set PG_HWPoison on just freed page - * intentionally. Although it's rather weird, - * it's how HWPoison flag works at the moment. + * We handle poisoned pages in page_handle_poison. */ - if (set_hwpoison_free_buddy_page(page)) - num_poisoned_pages_inc(); - } + put_page(page); } else { if (rc != -EAGAIN) { if (likely(!__PageMovable(page))) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4fa0e0887c07..11df51fc2718 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1175,6 +1175,16 @@ static __always_inline bool free_pages_prepare(struct page *page, trace_mm_page_free(page, order); + if (unlikely(PageHWPoison(page)) && !order) { + /* + * Untie memcg state and reset page's owner + */ + if (memcg_kmem_enabled() && PageKmemcg(page)) + __memcg_kmem_uncharge_page(page, order); + reset_page_owner(page, order); + return false; + } + /* * Check tail pages before head page information is cleared to * avoid checking PageCompound for order-0 pages. @@ -8844,32 +8854,4 @@ bool take_page_off_buddy(struct page *page) spin_unlock_irqrestore(&zone->lock, flags); return ret; } - -/* - * Set PG_hwpoison flag if a given page is confirmed to be a free page. This - * test is performed under the zone lock to prevent a race against page - * allocation. - */ -bool set_hwpoison_free_buddy_page(struct page *page) -{ - struct zone *zone = page_zone(page); - unsigned long pfn = page_to_pfn(page); - unsigned long flags; - unsigned int order; - bool hwpoisoned = false; - - spin_lock_irqsave(&zone->lock, flags); - for (order = 0; order < MAX_ORDER; order++) { - struct page *page_head = page - (pfn & ((1 << order) - 1)); - - if (PageBuddy(page_head) && page_order(page_head) >= order) { - if (!TestSetPageHWPoison(page)) - hwpoisoned = true; - break; - } - } - spin_unlock_irqrestore(&zone->lock, flags); - - return hwpoisoned; -} #endif