From patchwork Thu Jan 18 11:10:34 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com> X-Patchwork-Id: 13522712 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D48DCC47DAF for ; Thu, 18 Jan 2024 11:12:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 74A156B0089; Thu, 18 Jan 2024 06:12:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 720606B008A; Thu, 18 Jan 2024 06:12:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E8826B008C; Thu, 18 Jan 2024 06:12:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4F0C76B0089 for ; Thu, 18 Jan 2024 06:12:07 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2D965160C12 for ; Thu, 18 Jan 2024 11:12:07 +0000 (UTC) X-FDA: 81692167494.24.8ACABF8 Received: from mail-oo1-f45.google.com (mail-oo1-f45.google.com [209.85.161.45]) by imf29.hostedemail.com (Postfix) with ESMTP id 5F69612001F for ; Thu, 18 Jan 2024 11:12:05 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ZBuGzUuc; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.161.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705576325; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vxtV65B3xdBW2CsjvzUu/C8QzJUJxjQlgzUH6d2NXwk=; b=rgWv3IHkfa0ThyX8zccis2Y6grH+CRcc4fXXL22m7/+nDxreKAp1wV6Ayhm8XX5+BByThn B9WDQutJnp4FuOLpbFtrE9xWJt3Z8kEDQe+xNHJ6IOMk+Mlw2rH8vfUZM4NU51ac2LvleP FiCrwqhlyWGf+ikE07eJTQXRD+OTCKo= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ZBuGzUuc; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.161.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705576325; a=rsa-sha256; cv=none; b=aR7O7rwGeQSZTMsrUkdzj7UiPmIw2pfdYlmBr6e5zqM8BSiByn+8UUZlsZ4OwIZAjzp+9Z DTV44oNeisdokkQSLiZ+77RjDYlgA9BqpWRDUlhvjTWtiUPyCnlmBV8baDauv2AbFwKAdx Fj6eKsdMVri4wssa+sMMhf5o60b3AEc= Received: by mail-oo1-f45.google.com with SMTP id 006d021491bc7-5986d902ae6so4977714eaf.3 for ; Thu, 18 Jan 2024 03:12:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705576324; x=1706181124; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=vxtV65B3xdBW2CsjvzUu/C8QzJUJxjQlgzUH6d2NXwk=; b=ZBuGzUuc2kQ2NO+Lf+AHp72FW1vN2AbsF9BAfELW2v+Sbb/jjW5I7ip6SzjhaCZ8op QjxNFPkmFjTkqIyxaqLORgHgqfhOG+dk4uujBAGafg+6iE16sEjNsTihP7qW8ak5vxON JRygxhGnnX9LL/W9hOrvW4607PzWkwp4uL6Q17sMFYZjiIP9yMDJcyn+R5DHVNFgRVnS e69P/op2VVZvR1pGqcLz6pvQBpxu1VD7IvYK0qCvCEQ2vvz5AQYy/vKFNmw71XTTPUHM 7TY3B6nrzbbNrBZgCJuPFv5ttHXFnqH5k47EZzout5W3HX6qnXN41IUcxAjBG8taLJ0Y hsNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705576324; x=1706181124; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vxtV65B3xdBW2CsjvzUu/C8QzJUJxjQlgzUH6d2NXwk=; b=H1vZWG+/BoNBBr0zAdkkJ2Y9OhVazjynLutYpVaTXcNT7A3vKNSGC7sHqr335XiF6p uPH9xGq8dLiAgVLvRjgiZzPr1Jbwwfh0UTVUMXma8F20E3t+VS9d+LkpRIfRritZrVo4 8fzYZsiyIufe02b7VIOiXr9LTv8jD/HoWuODb+8xBq1i1xC7B7suxjTzkULd6CLrROn4 HKk4dwSJB0TYpEfGiWg0I3ujTxs1F+eGKjUu8h1sjTWJ9vOe1dEV1gO6a/z3Mhi2R4Zz k6t05yQfl56hJ0CYYVq7X0QVpN44KT8gFqMQuxj7O/Sg+f/307nXCTI3B4hzYJ3/flVP Gq5g== X-Gm-Message-State: AOJu0YzxPE1HQqbttpKhg1Vmx0Mi2Krr7DAqAd3WHFFB6CaJKF+8S8Bq jjmM14NlhNpPCThSET/QT2EzSDT9UuIxWJnkBTxSqck3YAxgVMJd X-Google-Smtp-Source: AGHT+IEeSGuh99kBzDsDR/89TCeSTS2RWEiKEOvRPUVCzJfTDywJ9YQlQRSPdbA+rjeC2MmjCJPSvw== X-Received: by 2002:a05:6359:ba7:b0:175:5c8c:3ab with SMTP id gf39-20020a0563590ba700b001755c8c03abmr609446rwb.65.1705576324307; Thu, 18 Jan 2024 03:12:04 -0800 (PST) Received: from barry-desktop.. (143.122.224.49.dyn.cust.vf.net.nz. [49.224.122.143]) by smtp.gmail.com with ESMTPSA id t19-20020a056a0021d300b006d9be753ac7sm3039107pfj.108.2024.01.18.03.11.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 Jan 2024 03:12:03 -0800 (PST) From: Barry Song <21cnbao@gmail.com> To: ryan.roberts@arm.com, akpm@linux-foundation.org, david@redhat.com, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yuzhao@google.com, surenb@google.com, steven.price@arm.com, Chuanhua Han , Barry Song Subject: [PATCH RFC 4/6] mm: support large folios swapin as a whole Date: Fri, 19 Jan 2024 00:10:34 +1300 Message-Id: <20240118111036.72641-5-21cnbao@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240118111036.72641-1-21cnbao@gmail.com> References: <20231025144546.577640-1-ryan.roberts@arm.com> <20240118111036.72641-1-21cnbao@gmail.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 5F69612001F X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: fw9hmojfh815cbf9x3gmiawspmcgenax X-HE-Tag: 1705576325-341397 X-HE-Meta: U2FsdGVkX1+sI7qzgIzSFJ2yqlZgIYrDDKCMN/3wuhLq9KAWfnH2kxfo9QnwTiYZlCsa79pj/az4eDe+VN1X7wR9MfG+2t0Dua9hRCzrYPysX+T50GLTSD5HXd/6496di/yQ9RsLan/8VB61lM/uYHSduU17NjCXtZMw7QQin4Bh0Tpv4E6YpefnRllvyoigkODACexSKWeYsrn4GN/Ni2XBGwEzLV14/fV6l7poq4vH9T3wsUawAqvhFd+0o+JNT4YaKHBE09HBb9qmOsV2WPEdm173abNKmGZSxyeZZVfYLf/gXVp1EBXrN/ZUqxOKUn4z7pNCx9TYd9pJrASHkmTRPSikHXzsr5f5B+L4OliAlF3fB5ihTDqhsF935nK9KEji7eumz/qOudRCVYMdkA0mCJK8g71jpOwg7eU0/JbY+fTr3KXiuS27tRYyxmPdDJyrWKExB/4d+f31/scesNuUu6w9AiTONZjluQdCSXuHLbiXI9kROmlsgxbqXRspJ0sW63OnNqV6Jow1r7wB2SmARFKRUb6VY7gjwEjuzCEBrkC0CS85OCSRMw0t8GcEjknQq5UZEHvx2R4uUkYR8Qa8duJyKpbhkE7Tv3JubuQFR4K8rJIhqXcUahcEDal2R0XhgWp4XUdxpbqShJC4e0mqKZnNOi1K97d66Ypa4Fd88h+AipVNppFWMjxpdxReOTTuj0zbaKC5xiYaL/DoythaD5ukdbSft8CfvEDIm09CLgKjv3r/0MmfJ/usHEmwW76zWG/mpacNkdxGKZiKHx88NrWHBJZ23UCtTzaUjdxObnrmH6CBouU9xlfbodycn72isKwPSDaWBjU9/BbV3KBG/aFSmjPIouhYhFcV3UfLSq4U/sclrBf+v4qYEzRMh1cxfj54UVEFE5CW/kP2B9ugnBT9x2yADQFtDfPy9TVc3pjKgJWLwcYNQNCzkpGeSkkPJ/L9bUec/e440Ff 3Zol0wcZ z4kGwghvTdAlDzQPxAlnvZVUs/HNpbq1PWm9pyMNrLosgJH6QRF/+fpU6kh38PQMrtQxWenWNtyj5C/Qt5TN8ZEnJyU5U/xNKX0GiqPd09oDHngfuVm6L9YFfHFxEJk5ZisYtdk4oFY/D4sAcFqdEb2i21AdFqzJ9FW/acKWFAH24UFE29ZPnn1Xk5+e3NuYysKkS75P72qmhsj7mnj9LPBKseZ0JZphotx8j33GiyF9xdBTCjqjAN/r80vOrTO0r/NNzB1pPyZcwSS33cUdkuiTYNm9xf5Dkoe0mXDwk69yAEzFttnjP/bZXvQ9QLK27Kn3XVN3f6hqw0amtHyJ+qEjnuIJUjlBjtMwo07yh0L1O4+PK5qSYSK+2ORL1LtG5iKn5Lizq2x8l8orEA14uFsxbIMlSmrkv7Q/chi4bbIIJq4K6mF9sUmlHqis7iPCG+eTAWHCkeaYxzfyZUddIwOHqqo/+S0YaMSJh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Chuanhua Han On an embedded system like Android, more than half of anon memory is actually in swap devices such as zRAM. For example, while an app is switched to back- ground, its most memory might be swapped-out. Now we have mTHP features, unfortunately, if we don't support large folios swap-in, once those large folios are swapped-out, we immediately lose the performance gain we can get through large folios and hardware optimization such as CONT-PTE. This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in to those contiguous swaps which were likely swapped out from mTHP as a whole. On the other hand, the current implementation only covers the SWAP_SYCHRONOUS case. It doesn't support swapin_readahead as large folios yet. Right now, we are re-faulting large folios which are still in swapcache as a whole, this can effectively decrease extra loops and early-exitings which we have increased in arch_swap_restore() while supporting MTE restore for folios rather than page. Signed-off-by: Chuanhua Han Co-developed-by: Barry Song Signed-off-by: Barry Song --- mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 94 insertions(+), 14 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index f61a48929ba7..928b3f542932 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map); static vm_fault_t do_fault(struct vm_fault *vmf); static vm_fault_t do_anonymous_page(struct vm_fault *vmf); static bool vmf_pte_changed(struct vm_fault *vmf); +static struct folio *alloc_anon_folio(struct vm_fault *vmf, + bool (*pte_range_check)(pte_t *, int)); /* * Return true if the original pte was a uffd-wp pte marker (so the pte was @@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) return VM_FAULT_SIGBUS; } +static bool pte_range_swap(pte_t *pte, int nr_pages) +{ + int i; + swp_entry_t entry; + unsigned type; + pgoff_t start_offset; + + entry = pte_to_swp_entry(ptep_get_lockless(pte)); + if (non_swap_entry(entry)) + return false; + start_offset = swp_offset(entry); + if (start_offset % nr_pages) + return false; + + type = swp_type(entry); + for (i = 1; i < nr_pages; i++) { + entry = pte_to_swp_entry(ptep_get_lockless(pte + i)); + if (non_swap_entry(entry)) + return false; + if (swp_offset(entry) != start_offset + i) + return false; + if (swp_type(entry) != type) + return false; + } + + return true; +} + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) pte_t pte; vm_fault_t ret = 0; void *shadow = NULL; + int nr_pages = 1; + unsigned long start_address; + pte_t *start_pte; if (!pte_unmap_same(vmf)) goto out; @@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { /* skip swapcache */ - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, - vma, vmf->address, false); + folio = alloc_anon_folio(vmf, pte_range_swap); page = &folio->page; if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); + if (folio_test_large(folio)) { + unsigned long start_offset; + + nr_pages = folio_nr_pages(folio); + start_offset = swp_offset(entry) & ~(nr_pages - 1); + entry = swp_entry(swp_type(entry), start_offset); + } + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { @@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + + start_address = vmf->address; + start_pte = vmf->pte; + if (folio_test_large(folio)) { + unsigned long nr = folio_nr_pages(folio); + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); + pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE; + + /* + * case 1: we are allocating large_folio, try to map it as a whole + * iff the swap entries are still entirely mapped; + * case 2: we hit a large folio in swapcache, and all swap entries + * are still entirely mapped, try to map a large folio as a whole. + * otherwise, map only the faulting page within the large folio + * which is swapcache + */ + if (pte_range_swap(pte_t, nr)) { + start_address = addr; + start_pte = pte_t; + if (unlikely(folio == swapcache)) { + /* + * the below has been done before swap_read_folio() + * for case 1 + */ + nr_pages = nr; + entry = pte_to_swp_entry(ptep_get(start_pte)); + page = &folio->page; + } + } else if (nr_pages > 1) { /* ptes have changed for case 1 */ + goto out_nomap; + } + } + if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) goto out_nomap; @@ -4047,12 +4120,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * We're already holding a reference on the page but haven't mapped it * yet. */ - swap_free(entry); + swap_nr_free(entry, nr_pages); if (should_try_to_free_swap(folio, vma, vmf->flags)) folio_free_swap(folio); - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); + pte = mk_pte(page, vma->vm_page_prot); /* @@ -4062,14 +4137,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * exclusivity. */ if (!folio_test_ksm(folio) && - (exclusive || folio_ref_count(folio) == 1)) { + (exclusive || folio_ref_count(folio) == nr_pages)) { if (vmf->flags & FAULT_FLAG_WRITE) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; } rmap_flags |= RMAP_EXCLUSIVE; } - flush_icache_page(vma, page); + flush_icache_pages(vma, page, nr_pages); if (pte_swp_soft_dirty(vmf->orig_pte)) pte = pte_mksoft_dirty(pte); if (pte_swp_uffd_wp(vmf->orig_pte)) @@ -4081,14 +4156,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_add_new_anon_rmap(folio, vma, vmf->address); folio_add_lru_vma(folio, vma); } else { - folio_add_anon_rmap_pte(folio, page, vma, vmf->address, + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address, rmap_flags); } VM_BUG_ON(!folio_test_anon(folio) || (pte_write(pte) && !PageAnonExclusive(page))); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages); + + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte); folio_unlock(folio); if (folio != swapcache && swapcache) { @@ -4105,6 +4181,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } if (vmf->flags & FAULT_FLAG_WRITE) { + if (folio_test_large(folio) && nr_pages > 1) + vmf->orig_pte = ptep_get(vmf->pte); + ret |= do_wp_page(vmf); if (ret & VM_FAULT_ERROR) ret &= VM_FAULT_ERROR; @@ -4112,7 +4191,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } /* No need to invalidate - it was non-present before */ - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -4148,7 +4227,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages) return true; } -static struct folio *alloc_anon_folio(struct vm_fault *vmf) +static struct folio *alloc_anon_folio(struct vm_fault *vmf, + bool (*pte_range_check)(pte_t *, int)) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE struct vm_area_struct *vma = vmf->vma; @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) order = highest_order(orders); while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); - if (pte_range_none(pte + pte_index(addr), 1 << order)) + if (pte_range_check(pte + pte_index(addr), 1 << order)) break; order = next_order(&orders, order); } @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (unlikely(anon_vma_prepare(vma))) goto oom; /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */ - folio = alloc_anon_folio(vmf); + folio = alloc_anon_folio(vmf, pte_range_none); if (IS_ERR(folio)) return 0; if (!folio)