From patchwork Fri Jun 21 21:37:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13708079 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4646EC27C4F for ; Fri, 21 Jun 2024 21:37:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 46E9F8D01A7; Fri, 21 Jun 2024 17:37:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 41F178D01A5; Fri, 21 Jun 2024 17:37:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2BEF28D01A7; Fri, 21 Jun 2024 17:37:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 0ECF58D01A5 for ; Fri, 21 Jun 2024 17:37:25 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 80FF2121186 for ; Fri, 21 Jun 2024 21:37:24 +0000 (UTC) X-FDA: 82256207208.30.0F8FC79 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf01.hostedemail.com (Postfix) with ESMTP id C4BF740014 for ; Fri, 21 Jun 2024 21:37:22 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=E1ml4B0U; spf=pass (imf01.hostedemail.com: domain of 3kfJ1ZgYKCIgA6Btm0s00sxq.o0yxuz69-yyw7mow.03s@flex--yuzhao.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3kfJ1ZgYKCIgA6Btm0s00sxq.o0yxuz69-yyw7mow.03s@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719005832; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=otVyqZKdCMOqFknJP6lQOy/XNsIXNtuDOVE/9/nW2qA=; b=xMisBkuq5vUcJ439NcKANo95MS3ObZ86lNyEqbsvBIlI2GIKtKpDj50tDfFevbz7wetPhi PXFnOTsXwlclgdSIAhqpuMvOpjjJCL92j6zvX5uuHxE4EkRKtyK+Ga/KdHNAsnqESXDJF9 IUMr1lo/497263JWyHRu0PRYDkPHrGQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719005832; a=rsa-sha256; cv=none; b=anEHvItMozl/gJx3reaNt2Vgd2LsPSWZXnLfFoVJU3GyWmC/ehPNNYbuwATFrK+ujvbM9L 4WKYC0JUvmsslXQneLnQGHib2cHlERsQ76U6+QkHI1VhrNYcm/Z3BVt2HxaIUXWMsB+zY1 UFqWaMyXim2MpXeR6YIT23gkxg938+U= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=E1ml4B0U; spf=pass (imf01.hostedemail.com: domain of 3kfJ1ZgYKCIgA6Btm0s00sxq.o0yxuz69-yyw7mow.03s@flex--yuzhao.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3kfJ1ZgYKCIgA6Btm0s00sxq.o0yxuz69-yyw7mow.03s@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-63bca8ce79eso45331397b3.1 for ; Fri, 21 Jun 2024 14:37:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1719005842; x=1719610642; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=otVyqZKdCMOqFknJP6lQOy/XNsIXNtuDOVE/9/nW2qA=; b=E1ml4B0U6B+P50OfzooRPldcnxzoEN98vKoaUv2IiLn+dA8yivQ0UhlBUCgy8W1B5j j+F5cziMNtL+znbkayLaOvkS/MnJkUS0sMA90Vh77RlN/pMPNIFQuw8OF8Ve6vVSM5hS 59afqT/YgjNS1ohVOBOZ8MLZAskeWg++QoSfV0k4/FaDmSihwPGa845stu4oIzIHjgLV WBykAebiA/IZ1RjB/ee5b9s+bxmPaLZvDCzeuL2alZcZ5tkfbUiHL83tpYFutcJ7Aa9L D052CaN9u9bAHZ6fnCQ2l21/mah6DIBWjH7auHndHfqWoWqVrlAUI2z/izHLomEXWSTS 2neA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719005842; x=1719610642; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=otVyqZKdCMOqFknJP6lQOy/XNsIXNtuDOVE/9/nW2qA=; b=NMVHYtfbP2hKDQ8z4Qbz2RbdgxOSHLxbWceIHjW78++0AxBwTPjLNZwureeyza15Ih CFysksEqP9BpH42qTmtEKFK/2HsyWLlrR7NDL95VgfKWkeU/FJMAFkv8osBbiBybfvbp OZpQtqFNjt3oYgB+kvwMpqQPQINdcZY7X0S1KR3Ye5/c0LbtxaxFn/POhhhw3TxYLbTm FJTSdoMlgJ2PkPZ/VREzT6Vv1DAd5MLCDG2I5dkeWPi4mAWSXg0q9YLqNj/OTqTwcNsT 7qYQz2OFZYtVHmQYWgiiSbh73sNCb0Ci3wo3xY2p5dqQPZ8HjcAoSe2wYpvIYS46ZQCH /Haw== X-Forwarded-Encrypted: i=1; AJvYcCUsM8VmUiFCy4DtLZCUQKg1TReYk16xrzGA0XBfO9b+orwR4+b2DNz5ZaWI07WgHoMmBJrvvWQansB1PadHVJJgIXc= X-Gm-Message-State: AOJu0YzKdI1W0tCIpb+OOQID9e9tCqpH9AzI/jnDDjirx60ATHU2TUd4 TpQpqBpoGJ8Mrv0z2CxylMEJNG05CLAT7xgVcseLx9tbwQZ/UmOWNBB8LO1D0H+crKgjJ7Xc8w5 1vw== X-Google-Smtp-Source: AGHT+IFE5oQn7lQAhF/CAgj62WWaA0urb8tLJjmW13JcFp1aFkNGunD8A2UMMNRErhC0EW3IotjUduNCwW0= X-Received: from yuzhao2.bld.corp.google.com ([2a00:79e0:2e28:6:926d:954c:e273:a8e1]) (user=yuzhao job=sendgmr) by 2002:a05:6902:70f:b0:dfb:5bd:17b2 with SMTP id 3f1490d57ef6-e02be0f256dmr658175276.2.1719005841714; Fri, 21 Jun 2024 14:37:21 -0700 (PDT) Date: Fri, 21 Jun 2024 15:37:17 -0600 Mime-Version: 1.0 X-Mailer: git-send-email 2.45.2.741.gdbec12cfda-goog Message-ID: <20240621213717.1099079-1-yuzhao@google.com> Subject: [RFC PATCH] mm/hugetlb_vmemmap: fix race with speculative PFN walkers From: Yu Zhao To: Muchun Song , "Matthew Wilcox (Oracle)" Cc: Andrew Morton , linux-mm@kvack.org, David Hildenbrand , Frank van der Linden , Yu Zhao X-Rspamd-Queue-Id: C4BF740014 X-Stat-Signature: r3pqi7wgm4q98qejkk94d3hr8k4o89q9 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1719005842-727888 X-HE-Meta: U2FsdGVkX1+6Fxn3/RODnUH9BWWYVkeXxtIdgViG4zHoL4RcP029erB8f+/942thKA0wwhJKMZpgDLia2Yg51hpyz1Q7d9U+Vvx51x/rlRo8cy2KT5eXjyUCsEuE5sWlvkKDme9uFqN/yCw4gbWdTsI/foxDK/0fgXERj+MASWy+rAu4HMYF05wXq+cODsy/hgRNaYR2lIa0dRVhI2JhbA1D1/IB5hEEPFD3fo655NWLRy/OAdq/2obThYeC7qphxx7WWhODlP75aIoyyG4KM2KJqetQOAAixg9LeeUNdjNqTE6DpmLE2fBjVVJ5DV4mL3sFZ9VofhsYzeIqZAy7LZJqAUhi4J9Ei2e3NHMjD/s6bpm5RljLqc1zutwOXj+nBreo0yip6rvQcqGvlo/JEjKdrhCEkazgZK4kyTu0ZqZe+36ivEf/71qgy6xMniRnvOXrYQ9SiiE4fxLaJSe02HOXgyfPgs6Sv3qtw2q7ck2sWS/vCVqJAzscsVBwSPU5gNxQ7622CbW2zBArxMfM+fQhIBEf+mIWA1ro5iXLqhAgtTimOInnCb83GVq2PHY2zcmmfkpSbVHmGKlZQtAE3yDvWAl3+dauYzCsrOcNgU85+qf3nZyej8VW3heyI+EK4xho450tiOO5OFsqFEW+Go10Dpbm4/xK264/mpB95sWRczGRIT0JbROf7cKv5aaR2Mfx+Lrrn97U95YCmo5si3SDJrx39gfTw8BQWrhBnM42nWKfzdUw+hU+ynIoP1bN2d0wiqzioFSyoVg8X3IcHzTQpsZtTtAPzRCqlBl3TWFKG38e0s2wPKnIsK17AaMBUOid0FsWY2s7CHzyzvyqCCbK6sR73L9JhxWy/xk4I/NKVNfZ9LAwUQELAOyfF31z9kPVaKmWzRBN5HmW8GY9vKYxQ2PHmZcTQSJmdeZ2RwZoG4yWZYzfBDJm0/QLP0Pu2lExGXDnIyEYA4IxoE5 YJuiHrYz hBkQSYikkqK+DY4r0lPYMUSZBapwCbeWHqYgOoACCa+ntfx1yyBiaiJPEuOxUi8V3Edt/sJJvB3XAFX7iRVixa4zc2Se/1XQE2dXYTDfOXBLx1glD7mcvemWkaE87XBeQ1m5aObm55kbqyjY6tGLmM8FDWh7POmjUUb4lMe6yixXCeqH9iVfS9KiFShID+6VON9cdo4EBi5HwA2/n4gOe14ZwbVpK5QHYOAdvaufwKvTImEwwhXt3cmxWhG13uNv1nWxNUgho2Z8KNvIfdQlYKJ4i+3SdGej+0worW8i9hfWuRLjMAWutG7UDpsFYRDVNQeSiXEaelMcvK1ZxD9LXzpJTV38kAQzCC4+36fi61eLXYtWM7qfIkcm8PTZzN7+SuyMwWkjsbqIJpHuK/dN/3fHK9jwMagM3/FLSnDuugKpktmmlvj+HLfpmovxTT99ALZYjX4uZ1aaSMRZXOLA5JJQY7c5diHwsP13psDk1c/yk5vW8xqefW6OsteJvxAI3ZqwyuJircgrADV1nx56NKBYIGcUKcCGOY8yk47tKQCrM4NAoWrUoF+uwc4mK4A95xktPdj0h1GPERB5zTHAisrUfcOZYhamUyj/XEtGKr6i4IiFQ7nCvyrYh0vdvDzXm6H7j29FbCzcPISU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000004, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: While investigating HVO for THPs [1], it turns out that speculative PFN walkers like compaction can race with vmemmap modificatioins, e.g., CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker) ----------------------------- ------------------------------ Allocates an LRU folio page1 Sees page1 Frees page1 Allocates a hugeTLB folio page2 (page1 being a tail of page2) Updates vmemmap mapping page1 get_page_unless_zero(page1) Even though page1 has a zero refcnt after HVO, get_page_unless_zero() can still try to modify its read-only struct page resulting in a crash. An independent report [2] confirmed this race. There are two discussed approaches to fix this race: 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without triggering a PF. 2. Use RCU to make sure get_page_unless_zero() either sees zero refcnts through the old vmemmap or non-zero refcnts through the new one. The second approach is preferred here because: 1. It can prevent illegal modifications to struct page[] that is HVO; 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix similar races in other places, e.g., arch_remove_memory() on x86 [3], which frees vmemmap mapping offlined struct page[]. While adding synchronize_rcu(), the goal is to be surgical, rather than optimized. Specifically, calls to synchronize_rcu() on the error handling paths can be coalesced, but it is not done for the sake of Simplicity: noticeably, this fix removes ~50% more lines than it adds. [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/ [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/ Signed-off-by: Yu Zhao --- include/linux/page_ref.h | 8 ++++++- mm/hugetlb.c | 50 +++++----------------------------------- mm/hugetlb_vmemmap.c | 16 +++++++++++++ 3 files changed, 29 insertions(+), 45 deletions(-) base-commit: 264efe488fd82cf3145a3dc625f394c61db99934 prerequisite-patch-id: 5029fb66d9bf40b84903a5b4f066e85101169e84 prerequisite-patch-id: 7889e5ee16b8e91cccde12468f1d2c3f65500336 prerequisite-patch-id: 0d4c19afc7b92f16bee9e9cf2b6832406389742a prerequisite-patch-id: c56f06d4bb3e738aea489ec30313ed0c1dbac325 diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 1acf5bac7f50..add92e8f31b2 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -230,7 +230,13 @@ static inline int folio_ref_dec_return(struct folio *folio) static inline bool page_ref_add_unless(struct page *page, int nr, int u) { - bool ret = atomic_add_unless(&page->_refcount, nr, u); + bool ret = false; + + rcu_read_lock(); + /* avoid writing to the vmemmap area being remapped */ + if (!page_is_fake_head(page) && page_ref_count(page) != u) + ret = atomic_add_unless(&page->_refcount, nr, u); + rcu_read_unlock(); if (page_ref_tracepoint_active(page_ref_mod_unless)) __page_ref_mod_unless(page, nr, ret); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f35abff8be60..271d83a7cde0 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1629,9 +1629,8 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio, * * Must be called with hugetlb lock held. */ -static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, - bool adjust_surplus, - bool demote) +static void remove_hugetlb_folio(struct hstate *h, struct folio *folio, + bool adjust_surplus) { int nid = folio_nid(folio); @@ -1661,33 +1660,13 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, if (!folio_test_hugetlb_vmemmap_optimized(folio)) __folio_clear_hugetlb(folio); - /* - * In the case of demote we do not ref count the page as it will soon - * be turned into a page of smaller size. - */ - if (!demote) - folio_ref_unfreeze(folio, 1); - h->nr_huge_pages--; h->nr_huge_pages_node[nid]--; } -static void remove_hugetlb_folio(struct hstate *h, struct folio *folio, - bool adjust_surplus) -{ - __remove_hugetlb_folio(h, folio, adjust_surplus, false); -} - -static void remove_hugetlb_folio_for_demote(struct hstate *h, struct folio *folio, - bool adjust_surplus) -{ - __remove_hugetlb_folio(h, folio, adjust_surplus, true); -} - static void add_hugetlb_folio(struct hstate *h, struct folio *folio, bool adjust_surplus) { - int zeroed; int nid = folio_nid(folio); VM_BUG_ON_FOLIO(!folio_test_hugetlb_vmemmap_optimized(folio), folio); @@ -1711,21 +1690,6 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio, */ folio_set_hugetlb_vmemmap_optimized(folio); - /* - * This folio is about to be managed by the hugetlb allocator and - * should have no users. Drop our reference, and check for others - * just in case. - */ - zeroed = folio_put_testzero(folio); - if (unlikely(!zeroed)) - /* - * It is VERY unlikely soneone else has taken a ref - * on the folio. In this case, we simply return as - * free_huge_folio() will be called when this other ref - * is dropped. - */ - return; - arch_clear_hugetlb_flags(folio); enqueue_hugetlb_folio(h, folio); } @@ -1779,6 +1743,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, spin_unlock_irq(&hugetlb_lock); } + folio_ref_unfreeze(folio, 1); + /* * Non-gigantic pages demoted from CMA allocated gigantic pages * need to be given back to CMA in free_gigantic_folio. @@ -3079,11 +3045,8 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, free_new: spin_unlock_irq(&hugetlb_lock); - if (new_folio) { - /* Folio has a zero ref count, but needs a ref to be freed */ - folio_ref_unfreeze(new_folio, 1); + if (new_folio) update_and_free_hugetlb_folio(h, new_folio, false); - } return ret; } @@ -3938,7 +3901,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio) target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order); - remove_hugetlb_folio_for_demote(h, folio, false); + remove_hugetlb_folio(h, folio, false); spin_unlock_irq(&hugetlb_lock); /* @@ -3952,7 +3915,6 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio) if (rc) { /* Allocation of vmemmmap failed, we can not demote folio */ spin_lock_irq(&hugetlb_lock); - folio_ref_unfreeze(folio, 1); add_hugetlb_folio(h, folio, false); return rc; } diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index b9a55322e52c..8193906515c6 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -446,6 +446,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, unsigned long vmemmap_reuse; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); + if (!folio_test_hugetlb_vmemmap_optimized(folio)) return 0; @@ -481,6 +483,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, */ int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio) { + /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ + synchronize_rcu(); + return __hugetlb_vmemmap_restore_folio(h, folio, 0); } @@ -505,6 +510,9 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, long restored = 0; long ret = 0; + /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ + synchronize_rcu(); + list_for_each_entry_safe(folio, t_folio, folio_list, lru) { if (folio_test_hugetlb_vmemmap_optimized(folio)) { ret = __hugetlb_vmemmap_restore_folio(h, folio, @@ -550,6 +558,8 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h, unsigned long vmemmap_reuse; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); + if (!vmemmap_should_optimize_folio(h, folio)) return ret; @@ -601,6 +611,9 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio) { LIST_HEAD(vmemmap_pages); + /* avoid writes from page_ref_add_unless() while folding vmemmap */ + synchronize_rcu(); + __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0); free_vmemmap_page_list(&vmemmap_pages); } @@ -644,6 +657,9 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l flush_tlb_all(); + /* avoid writes from page_ref_add_unless() while folding vmemmap */ + synchronize_rcu(); + list_for_each_entry(folio, folio_list, lru) { int ret;