From patchwork Wed Sep 7 14:45:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zach O'Keefe X-Patchwork-Id: 12969071 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55E0DC54EE9 for ; Wed, 7 Sep 2022 14:45:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2ACF06B0074; Wed, 7 Sep 2022 10:45:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 262166B0075; Wed, 7 Sep 2022 10:45:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0FE698D0001; Wed, 7 Sep 2022 10:45:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 0237E6B0074 for ; Wed, 7 Sep 2022 10:45:40 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D1E48121985 for ; Wed, 7 Sep 2022 14:45:39 +0000 (UTC) X-FDA: 79885563198.21.0632250 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) by imf01.hostedemail.com (Postfix) with ESMTP id 66FC540077 for ; Wed, 7 Sep 2022 14:45:39 +0000 (UTC) Received: by mail-pf1-f201.google.com with SMTP id b25-20020aa78119000000b00536a929d8e4so7574903pfi.1 for ; Wed, 07 Sep 2022 07:45:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date; bh=mwUyQsXXnuzgRC80f4Ko1g3FBHdD3PmOleWr3+725FQ=; b=HpLkWKmGYyap1p31IVtoIU+ngo4Pg/gFeu24PJplCjOKwxvfVoaXtv8uNO7Mmf+CIp Nznc2CZ5ZRZ3Ln0JJWMSOaw23HODU6ahxs6xNQ7kWnDwfVpUXlWdyVk/SEQWjYzrTt2L pQ8U3yVLzwmKB8XHOYiUE9iPb6hL3rQWe0CZ1cbWf6JTiPZCbWGRfMPs4IdMbneW/6qO RCJZcZXrjdEHfBWFGu/a6YwfLCSmVLla1MFtmPDoTSKSGkmVM7294qBbdJSp2TRkcoUx /ysFdjZ0wvxwCZaQBnrM9F0e9muLQwWZnPLZREuc6y8B11rcNtvcTmbQ4OwaJbHDZi+r L25Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date; bh=mwUyQsXXnuzgRC80f4Ko1g3FBHdD3PmOleWr3+725FQ=; b=O5ZVdQC2DvAU/VXSeBYs/oyNt9rzMSDd55/XxfK+YtOrW5zO9FWzg6CIIjtJLQnh7D 8kAo4UhcLjE7MCF3c3rWJ4UHO9p05oA399XGhprFMlW67ltGpDzK1NOEsNVqTjW4P36S c5i/3LoNmDlqTDWdHRfnSlWjtGDyW2itwoemPolHFsRzuOOIoISQ7WOAJSwBhxQJTLKW xNtjG8QoOG/DzIR815rEa5YLs/9sjx/quw1O1LbJGtS2RbGPXi7L/gkIMm3Sol+jotkB LL81Ak+vHbYxr0TO2x7cv2EtYQFA2Ghy7ctuv2m7tzBoJr2Us7etWlyRr0xAEImCNHpi e4EQ== X-Gm-Message-State: ACgBeo1ZH9VCFxJQP/agL4uzOxMlplCYYLIGItAfkVvkQjO4tPqvf0r8 hM8PLTSrcHAeQRBCNpeTV9s5RRsblQbcImDuFS3f857V6H7uhZM5lP/0FqubRWR3Xcw/ZX2ntIF x/FnMLgp4cQgBvJLxU3OlZHgTQWQnaxGH6gFgrnUweUGfqQqxw45K4pd2vis= X-Google-Smtp-Source: AA6agR5tCW73dc7xRpTqLBr5637+l/XhHqtTAOqlHeR9e1be6e5YuZvx8XQpZgaTJdXJmgqVDgGdQDl0JqXV X-Received: from zokeefe3.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1b6]) (user=zokeefe job=sendgmr) by 2002:a17:902:8349:b0:176:a034:ec76 with SMTP id z9-20020a170902834900b00176a034ec76mr4370392pln.73.1662561938325; Wed, 07 Sep 2022 07:45:38 -0700 (PDT) Date: Wed, 7 Sep 2022 07:45:13 -0700 In-Reply-To: <20220907144521.3115321-1-zokeefe@google.com> Mime-Version: 1.0 References: <20220907144521.3115321-1-zokeefe@google.com> X-Mailer: git-send-email 2.37.2.789.g6183377224-goog Message-ID: <20220907144521.3115321-3-zokeefe@google.com> Subject: [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds From: "Zach O'Keefe" To: linux-mm@kvack.org Cc: Andrew Morton , linux-api@vger.kernel.org, Axel Rasmussen , James Houghton , Hugh Dickins , Yang Shi , Miaohe Lin , David Hildenbrand , David Rientjes , Matthew Wilcox , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Chris Kennelly , "Kirill A. Shutemov" , Minchan Kim , Patrick Xia , "Zach O'Keefe" ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=HpLkWKmG; spf=pass (imf01.hostedemail.com: domain of 3kq4YYwcKCHEodZTTUTVddVaT.RdbaXcjm-bbZkPRZ.dgV@flex--zokeefe.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=3kq4YYwcKCHEodZTTUTVddVaT.RdbaXcjm-bbZkPRZ.dgV@flex--zokeefe.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662561939; a=rsa-sha256; cv=none; b=lF+GC7rC87kmPecP6w0IiUiZorzH9yA7Aee9i1FpcnPteigErG0fRgzTBzuzkkgn15MlGw OHJ6J5PQIn513tW3ubTt1pAvhdYJtqoC6C+1ffij2gRKWFarZT03CkBBiqAjHEe6kV6ghg +k1h2bBrdUnslm+X242kZ4H8uRUiqH0= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662561939; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mwUyQsXXnuzgRC80f4Ko1g3FBHdD3PmOleWr3+725FQ=; b=sck0KAViGnyM9d5RVBqLS7+KfdHbbWfpD4PuQ7DOKMsXpe9v1AsgydzkGyC5+k97hUJgh0 wx/c3nKz1oEgP04KEx3zX7wRH+OGV3TLpXh+pTjNXMMXHyjxGlsjwbKjKIlTNLF+L4OjbF aQ03rwdSTk9vnKlJaBEqLM6hzTyLR6Y= X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 66FC540077 Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=HpLkWKmG; spf=pass (imf01.hostedemail.com: domain of 3kq4YYwcKCHEodZTTUTVddVaT.RdbaXcjm-bbZkPRZ.dgV@flex--zokeefe.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=3kq4YYwcKCHEodZTTUTVddVaT.RdbaXcjm-bbZkPRZ.dgV@flex--zokeefe.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspam-User: X-Stat-Signature: hoyrabi9xgfw9qw3ofua3eqgqddjxidz X-HE-Tag: 1662561939-634930 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The main benefit of THPs are that they can be mapped at the pmd level, increasing the likelihood of TLB hit and spending less cycles in page table walks. pte-mapped hugepages - that is - hugepage-aligned compound pages of order HPAGE_PMD_ORDER mapped by ptes - although being contiguous in physical memory, don't have this advantage. In fact, one could argue they are detrimental to system performance overall since they occupy a precious hugepage-aligned/sized region of physical memory that could otherwise be used more effectively. Additionally, pte-mapped hugepages can be the cheapest memory to collapse for khugepaged since no new hugepage allocation or copying of memory contents is necessary - we only need to update the mapping page tables. In the anonymous collapse path, we are able to collapse pte-mapped hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no effort when compound pages (of any order) are encountered. Identify pte-mapped hugepages in the file/shmem collapse path. The final step of which makes a racy check of the value of the pmd to ensure it maps a pte table. This should be fine, since races that result in false-positive (i.e. attempt collapse even though we sholdn't) will fail later in collapse_pte_mapped_thp() once we actually lock mmap_lock and reinspect the pmd value. Races that result in false-negatives (i.e. where we decide to not attempt collapse, but should have) shouldn't be an issue, since in the worst case, we do nothing - which is what we've done up to this point. We make a similar check in retract_page_tables(). If we do think we've found a pte-mapped hugepgae in khugepaged context, attempt to update page tables mapping this hugepage. Note that these collapses still count towards the /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter, and if the pte-mapped hugepage was also mapped into multiple process' address spaces, could be incremented for each page table update. Since we increment the counter when a pte-mapped hugepage is successfully added to the list of to-collapse pte-mapped THPs, it's possible that we never actually update the page table either. This is different from how file/shmem pages_collapsed accounting works today where only a successful page cache update is counted (it's also possible here that no page tables are actually changed). Though it incurs some slop, this is preferred to either not accounting for the event at all, or plumbing through data in struct mm_slot on whether to account for the collapse or not. Also note that work still needs to be done to support arbitrary compound pages, and that this should all be converted to using folios. Signed-off-by: Zach O'Keefe --- include/trace/events/huge_memory.h | 1 + mm/khugepaged.c | 67 +++++++++++++++++++++++++++--- 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 55392bf30a03..fbbb25494d60 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -17,6 +17,7 @@ EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \ EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \ + EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \ EM( SCAN_PAGE_RO, "no_writable_page") \ EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \ EM( SCAN_PAGE_NULL, "page_null") \ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 55c8625ed950..31ccf49cf279 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -35,6 +35,7 @@ enum scan_result { SCAN_EXCEED_SHARED_PTE, SCAN_PTE_NON_PRESENT, SCAN_PTE_UFFD_WP, + SCAN_PTE_MAPPED_HUGEPAGE, SCAN_PAGE_RO, SCAN_LACK_REFERENCED_PAGE, SCAN_PAGE_NULL, @@ -1318,20 +1319,24 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) * Notify khugepaged that given addr of the mm is pte-mapped THP. Then * khugepaged should try to collapse the page table. */ -static void khugepaged_add_pte_mapped_thp(struct mm_struct *mm, +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) { struct khugepaged_mm_slot *mm_slot; struct mm_slot *slot; + bool ret = false; VM_BUG_ON(addr & ~HPAGE_PMD_MASK); spin_lock(&khugepaged_mm_lock); slot = mm_slot_lookup(mm_slots_hash, mm); mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot); - if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) + if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) { mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr; + ret = true; + } spin_unlock(&khugepaged_mm_lock); + return ret; } static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma, @@ -1368,9 +1373,16 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) pte_t *start_pte, *pte; pmd_t *pmd; spinlock_t *ptl; - int count = 0; + int count = 0, result = SCAN_FAIL; int i; + mmap_assert_write_locked(mm); + + /* Fast check before locking page if already PMD-mapped */ + result = find_pmd_or_thp_or_none(mm, haddr, &pmd); + if (result != SCAN_SUCCEED) + return; + if (!vma || !vma->vm_file || !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) return; @@ -1721,9 +1733,16 @@ static int collapse_file(struct mm_struct *mm, struct file *file, /* * If file was truncated then extended, or hole-punched, before * we locked the first page, then a THP might be there already. + * This will be discovered on the first iteration. */ if (PageTransCompound(page)) { - result = SCAN_PAGE_COMPOUND; + struct page *head = compound_head(page); + + result = compound_order(head) == HPAGE_PMD_ORDER && + head->index == start + /* Maybe PMD-mapped */ + ? SCAN_PTE_MAPPED_HUGEPAGE + : SCAN_PAGE_COMPOUND; goto out_unlock; } @@ -1961,7 +1980,19 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, * into a PMD sized page */ if (PageTransCompound(page)) { - result = SCAN_PAGE_COMPOUND; + struct page *head = compound_head(page); + + result = compound_order(head) == HPAGE_PMD_ORDER && + head->index == start + /* Maybe PMD-mapped */ + ? SCAN_PTE_MAPPED_HUGEPAGE + : SCAN_PAGE_COMPOUND; + /* + * For SCAN_PTE_MAPPED_HUGEPAGE, further processing + * by the caller won't touch the page cache, and so + * it's safe to skip LRU and refcount checks before + * returning. + */ break; } @@ -2021,6 +2052,12 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot) { } + +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, + unsigned long addr) +{ + return false; +} #endif static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, @@ -2115,8 +2152,26 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, &mmap_locked, cc); } - if (*result == SCAN_SUCCEED) + switch (*result) { + case SCAN_PTE_MAPPED_HUGEPAGE: { + pmd_t *pmd; + + *result = find_pmd_or_thp_or_none(mm, + khugepaged_scan.address, + &pmd); + if (*result != SCAN_SUCCEED) + break; + if (!khugepaged_add_pte_mapped_thp(mm, + khugepaged_scan.address)) + break; + } fallthrough; + case SCAN_SUCCEED: ++khugepaged_pages_collapsed; + break; + default: + break; + } + /* move to next address */ khugepaged_scan.address += HPAGE_PMD_SIZE; progress += HPAGE_PMD_NR;