From patchwork Tue Oct 3 09:27:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hugh Dickins X-Patchwork-Id: 13407345 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C8ABE75438 for ; Tue, 3 Oct 2023 09:27:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A2FD940007; Tue, 3 Oct 2023 05:27:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 22C518D0003; Tue, 3 Oct 2023 05:27:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 07F19940007; Tue, 3 Oct 2023 05:27:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id E66508D0003 for ; Tue, 3 Oct 2023 05:27:52 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B67B680210 for ; Tue, 3 Oct 2023 09:27:52 +0000 (UTC) X-FDA: 81303623184.16.CAC4BAA Received: from mail-yw1-f170.google.com (mail-yw1-f170.google.com [209.85.128.170]) by imf27.hostedemail.com (Postfix) with ESMTP id DE61C40008 for ; Tue, 3 Oct 2023 09:27:50 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=w78WRgqF; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of hughd@google.com designates 209.85.128.170 as permitted sender) smtp.mailfrom=hughd@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696325270; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mU56FlmjTv5eFSUjfveUTDJtj2QxEcbZGZN96w5K2oc=; b=KJrkwGuvzuLYEJpVW8cOsvb8NbD7AMxGv8/NTTnOQC8qWFMwEg9jqeQNVjc2rnx+gyr0JI xuYMYqccNY5GckukJ2aHv5ofZ+VEqkimpCfGnH27IeW6H4pg8t/N7uH1x7i9TkmcydbX0K mxM2tZwsZ11pgJokOCUgGO4kEnV//wg= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=w78WRgqF; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of hughd@google.com designates 209.85.128.170 as permitted sender) smtp.mailfrom=hughd@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696325270; a=rsa-sha256; cv=none; b=bfENp3HeJGnXacKrhp+/Ip5QYe7mKfk2ErvLN/pgp96Nz6DAmDHQfKwfjwBKAC6uyUhops ZmC3w+/yc/M3RO4hvfzCpxM8m4HDQuw1+RCzSuvd1BA9hJ6YuYx/Tp9kK1e06TpZI3KgDn ScotETkTGeixT6hrsDaKNXY+FPPyAmM= Received: by mail-yw1-f170.google.com with SMTP id 00721157ae682-5a1d0fee86aso8617417b3.2 for ; Tue, 03 Oct 2023 02:27:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696325270; x=1696930070; darn=kvack.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=mU56FlmjTv5eFSUjfveUTDJtj2QxEcbZGZN96w5K2oc=; b=w78WRgqFleH6+imFqmjOZ48EhoD0nZm9aE+QQ40B4KuCyrQzx0nEIM733VXQXqL0f4 2sZVJmXA3fxS+/5blgx6P7R6cAF6aasLB1IHXbH4ktnQ9ze+NEbWrMY71mo5e6WXoXvb O2RVjTmsxbLo4WYMtlhhjtOBICkOu2omvVO6RUpIlFBkn44g8bhn5pZETobe6A99LXcx RvsR/nob+ZekuzrT0CSEKetoXVHtndGMdCLptVbQ+/sdks40+0//HB/c9V2Fjr5OBpYb cbLQIGfl+99j1ZNaOrDqNZt37zwuaFh8tbziCBh/fD2GP2O+OJaR5uz5syfyFePImKH6 yzMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696325270; x=1696930070; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mU56FlmjTv5eFSUjfveUTDJtj2QxEcbZGZN96w5K2oc=; b=MoyeCmF/BmHSJk+RR8vDN2lrow8j5U7mOn4dusQE7D5qM4OOq9sgAo4bJWoXEon0hr nFFO4GKACId93IKUhxbDHgZt1CA3fgxNrFrD7yScWgWLsMm46/MsoGP0fFsWlOQoLzwV y5xO1QBRp8rp2wWWEDo4dY0iPugsHJ29Qqumpq7YkvtOL3pQQ5F0Wb2Zx7QxMGaF0pbG 7bG3VHECbGLMbI4RyzsIaZ5ZWEtqVE/1O7mtnocg4seulEGH5VFgDWLBXJeufNND6ZJs AvFM4Et5rE/7x3qvtgMIkfyi9mhIhk300nZLTVUrPSlsVAIgn7moZuz0Z3oDB6gxqFr1 mZNw== X-Gm-Message-State: AOJu0YwK6aOyhnqV2T2AExHNy57ajvZ/HY70nQyDcbnE1SYbEWRl4bBu nxMjWC/gM6cAmJNdSm6mxIyhFw== X-Google-Smtp-Source: AGHT+IH+RBdvJ2zPjHqn2yE7GeZTf5dQ/uq/KjJe+YyhinlFoMPpIY/IVB5OPsNP+GzZ5W6FH4sY9g== X-Received: by 2002:a0d:ddc1:0:b0:5a1:d4bc:7faa with SMTP id g184-20020a0dddc1000000b005a1d4bc7faamr14296514ywe.18.1696325269879; Tue, 03 Oct 2023 02:27:49 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id k6-20020a0dc806000000b0058038e6609csm256304ywd.74.2023.10.03.02.27.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 02:27:49 -0700 (PDT) Date: Tue, 3 Oct 2023 02:27:47 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Andi Kleen , Christoph Lameter , Matthew Wilcox , Mike Kravetz , David Hildenbrand , Suren Baghdasaryan , Yang Shi , Sidhartha Kumar , Vishal Moola , Kefeng Wang , Greg Kroah-Hartman , Tejun Heo , Mel Gorman , Michal Hocko , "Huang, Ying" , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 11/12] mempolicy: mmap_lock is not needed while migrating folios In-Reply-To: Message-ID: <21e564e8-269f-6a89-7ee2-fd612831c289@google.com> References: MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: DE61C40008 X-Stat-Signature: omhzf8pp9o8au53xanzdg793c3azx6dt X-HE-Tag: 1696325270-615309 X-HE-Meta: U2FsdGVkX18US+dCjKPG4nkjJfhnj9Kckq7rPrWmT1AqqjK8wwkpKps/95yuPuo32MCwjxpDMosEXKr+NYcHzxMIlIoem4uOrZOpjSTJAL4VhnvBRGRdVDGtUiN12eZ1LKRWIYilHDYHX8JZhJLcxh+CnDJ/pVo64pNxmXqL5C1mXvJM/ueA0T6auY9nx6U48mrsu24em8uCTJ8vZWMH2+pIK4DRagyZLGNbVVnvQ3ryRZju7H7DJcqdybdJX+iNkiJhWLscbL5M4xC3d3ykdaSVtHIh3FGkUmC7D1+dk8ZNTYk77zdF5e6rjHKFPnj80maHddjTCVun7dG70btPh4Gb93p37D0SKmwhQquQbx9cgE4jV4HAT16mmTX/MluG56KtoT9tme4OeakgKgaXBAx75xeoigyWkpQDagRFMfoFE12DEQim/229oCEiLq1ug0Djze4Jp6aOVIFXDnS/dDs7BDSZVqBUgHRqczD/qu+qg2Q+Xu3tU4wDsgh4fqUAQdlqI+AfT3axwaIW5xWGFFZy1SVXR1lJiRL5Dn/LnkHrEqJeD71qmQnwjH/K/d2xOAkgWhDtElGTeeDwSXO5/ietHei1QR6siz0hpC+YE2nyKGOLWJDvCDU2lEJuXQXbc4O238ptYbmSpp4UIFPlB6OxCbSSvIfAsfgOZ7tDIR99rIIjC51twtPwyM9OBX0YO1UkL/sMVfP+L23XZ5eYHELaGZJbyhtkHUSA3BI7d+XPgrSag3iQ1a6DBsCoJPtSJKggrL6p605cOMp3yixh6rFOT5aXHOHd9/jgrYFHR94Ep09x37SunB0ceeOsDx48jpgltKYMrdqTwJ+xMRe0mRVWsGluEoPwEq/ytMTvsEk8omZuogAodvrh4wDX+dfMK7a8UrImtLuznoI6eBIsnjVdlCSr4m6Ch/pI7aFS3txq2yr1JVg4B3MGVgP2BNV86ff7RAixHnzIlCRjyLe iQlk0SBy n/4I3nQiekbM3xP0UFOi6V136KFBmrbZrMSPQCD4xwUqoC9gVDslYEu93t4z6Q7LwlMLEOX1oNE8FK55+bWRmfG0tFS4POPCKxrX5GoOwGlnMhW24oWRAMywSmmyDAKC5WRsg7jx1nxdpQlwgRWOSiLwUyDKwx6+VIHFBPLsMri8KmjdwSq0qrl21lA89FbxCY4IxXZMlZi82WmTFtSnll+21OAczb4/I5ErhJShZZcJo1+BqZFWAGc5Q06JkZ1LqcRPQMsxNrj0dOr2i+2GEG8T7matvp/NZx5iZq0SbLzwiW3Y6Az3TPF5RHPwfRucSWvItcpzJEAghmvyZsYUoJu79uJPMVT4eRbokpJroi6tLbh+KVUjOFxB+lTB7v67UeG2g2Svkzm04LxTJVIfSOoYwcw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: mbind(2) holds down_write of current task's mmap_lock throughout (exclusive because it needs to set the new mempolicy on the vmas); migrate_pages(2) holds down_read of pid's mmap_lock throughout. They both hold mmap_lock across the internal migrate_pages(), under which all new page allocations (huge or small) are made. I'm nervous about it; and migrate_pages() certainly does not need mmap_lock itself. It's done this way for mbind(2), because its page allocator is vma_alloc_folio() or alloc_hugetlb_folio_vma(), both of which depend on vma and address. Now that we have alloc_pages_mpol(), depending on (refcounted) memory policy and interleave index, mbind(2) can be modified to use that or alloc_hugetlb_folio_nodemask(), and then not need mmap_lock across the internal migrate_pages() at all: add alloc_migration_target_by_mpol() to replace mbind's new_page(). (After that change, alloc_hugetlb_folio_vma() is used by nothing but a userfaultfd function: move it out of hugetlb.h and into the #ifdef.) migrate_pages(2) has chosen its target node before migrating, so can continue to use the standard alloc_migration_target(); but let it take and drop mmap_lock just around migrate_to_node()'s queue_pages_range(): neither the node-to-node calculations nor the page migrations need it. It seems unlikely, but it is conceivable that some userspace depends on the kernel's mmap_lock exclusion here, instead of doing its own locking: more likely in a testsuite than in real life. It is also possible, of course, that some pages on the list will be munmapped by another thread before they are migrated, or a newer memory policy applied to the range by that time: but such races could happen before, as soon as mmap_lock was dropped, so it does not appear to be a concern. Signed-off-by: Hugh Dickins --- include/linux/hugetlb.h | 9 ----- mm/hugetlb.c | 38 ++++++++++---------- mm/mempolicy.c | 83 ++++++++++++++++++++++--------------------- 3 files changed, 63 insertions(+), 67 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a574e26e18a2..7c6faee07b42 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -716,8 +716,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask, gfp_t gfp_mask); -struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address); int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping, pgoff_t idx); void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, @@ -1040,13 +1038,6 @@ alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, return NULL; } -static inline struct folio *alloc_hugetlb_folio_vma(struct hstate *h, - struct vm_area_struct *vma, - unsigned long address) -{ - return NULL; -} - static inline int __alloc_bootmem_huge_page(struct hstate *h) { return 0; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9d5b7f208dac..68ff79061f88 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2458,24 +2458,6 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask); } -/* mempolicy aware migration callback */ -struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address) -{ - struct mempolicy *mpol; - nodemask_t *nodemask; - struct folio *folio; - gfp_t gfp_mask; - int node; - - gfp_mask = htlb_alloc_mask(h); - node = huge_node(vma, address, gfp_mask, &mpol, &nodemask); - folio = alloc_hugetlb_folio_nodemask(h, node, nodemask, gfp_mask); - mpol_cond_put(mpol); - - return folio; -} - /* * Increase the hugetlb pool such that it can accommodate a reservation * of size 'delta'. @@ -6279,6 +6261,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } #ifdef CONFIG_USERFAULTFD +/* + * Can probably be eliminated, but still used by hugetlb_mfill_atomic_pte(). + */ +static struct folio *alloc_hugetlb_folio_vma(struct hstate *h, + struct vm_area_struct *vma, unsigned long address) +{ + struct mempolicy *mpol; + nodemask_t *nodemask; + struct folio *folio; + gfp_t gfp_mask; + int node; + + gfp_mask = htlb_alloc_mask(h); + node = huge_node(vma, address, gfp_mask, &mpol, &nodemask); + folio = alloc_hugetlb_folio_nodemask(h, node, nodemask, gfp_mask); + mpol_cond_put(mpol); + + return folio; +} + /* * Used by userfaultfd UFFDIO_* ioctls. Based on userfaultfd's mfill_atomic_pte * with modifications for hugetlb pages. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 8cf76de12acd..a7b34b9c00ef 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -417,6 +417,8 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, unsigned long flags); +static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, + pgoff_t ilx, int *nid); static bool strictly_unmovable(unsigned long flags) { @@ -1043,6 +1045,8 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest, node_set(source, nmask); VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))); + + mmap_read_lock(mm); vma = find_vma(mm, 0); /* @@ -1053,6 +1057,7 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest, */ nr_failed = queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask, flags | MPOL_MF_DISCONTIG_OK, &pagelist); + mmap_read_unlock(mm); if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, alloc_migration_target, NULL, @@ -1081,8 +1086,6 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, lru_cache_disable(); - mmap_read_lock(mm); - /* * Find a 'source' bit set in 'tmp' whose corresponding 'dest' * bit in 'to' is not also set in 'tmp'. Clear the found 'source' @@ -1162,7 +1165,6 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, if (err < 0) break; } - mmap_read_unlock(mm); lru_cache_enable(); if (err < 0) @@ -1171,44 +1173,38 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, } /* - * Allocate a new page for page migration based on vma policy. - * Start by assuming the page is mapped by the same vma as contains @start. - * Search forward from there, if not. N.B., this assumes that the - * list of pages handed to migrate_pages()--which is how we get here-- - * is in virtual address order. + * Allocate a new folio for page migration, according to NUMA mempolicy. */ -static struct folio *new_folio(struct folio *src, unsigned long start) +static struct folio *alloc_migration_target_by_mpol(struct folio *src, + unsigned long private) { - struct vm_area_struct *vma; - unsigned long address; - VMA_ITERATOR(vmi, current->mm, start); - gfp_t gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL; + struct mempolicy *pol = (struct mempolicy *)private; + pgoff_t ilx = 0; /* improve on this later */ + struct page *page; + unsigned int order; + int nid = numa_node_id(); + gfp_t gfp; - for_each_vma(vmi, vma) { - address = page_address_in_vma(&src->page, vma); - if (address != -EFAULT) - break; - } - - /* - * __get_vma_policy() now expects a genuine non-NULL vma. Return NULL - * when the page can no longer be located in a vma: that is not ideal - * (migrate_pages() will give up early, presuming ENOMEM), but good - * enough to avoid a crash by syzkaller or concurrent holepunch. - */ - if (!vma) - return NULL; + order = folio_order(src); + ilx += src->index >> order; if (folio_test_hugetlb(src)) { - return alloc_hugetlb_folio_vma(folio_hstate(src), - vma, address); + nodemask_t *nodemask; + struct hstate *h; + + h = folio_hstate(src); + gfp = htlb_alloc_mask(h); + nodemask = policy_nodemask(gfp, pol, ilx, &nid); + return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp); } if (folio_test_large(src)) gfp = GFP_TRANSHUGE; + else + gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP; - return vma_alloc_folio(gfp, folio_order(src), vma, address, - folio_test_large(src)); + page = alloc_pages_mpol(gfp, order, pol, ilx, nid); + return page_rmappable_folio(page); } #else @@ -1224,7 +1220,8 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, return -ENOSYS; } -static struct folio *new_folio(struct folio *src, unsigned long start) +static struct folio *alloc_migration_target_by_mpol(struct folio *src, + unsigned long private) { return NULL; } @@ -1298,6 +1295,7 @@ static long do_mbind(unsigned long start, unsigned long len, if (nr_failed < 0) { err = nr_failed; + nr_failed = 0; } else { vma_iter_init(&vmi, mm, start); prev = vma_prev(&vmi); @@ -1308,19 +1306,24 @@ static long do_mbind(unsigned long start, unsigned long len, } } - if (!err) { - if (!list_empty(&pagelist)) { - nr_failed |= migrate_pages(&pagelist, new_folio, NULL, - start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL); + mmap_write_unlock(mm); + + if (!err && !list_empty(&pagelist)) { + /* Convert MPOL_DEFAULT's NULL to task or default policy */ + if (!new) { + new = get_task_policy(current); + mpol_get(new); } - if (nr_failed && (flags & MPOL_MF_STRICT)) - err = -EIO; + nr_failed |= migrate_pages(&pagelist, + alloc_migration_target_by_mpol, NULL, + (unsigned long)new, MIGRATE_SYNC, + MR_MEMPOLICY_MBIND, NULL); } + if (nr_failed && (flags & MPOL_MF_STRICT)) + err = -EIO; if (!list_empty(&pagelist)) putback_movable_pages(&pagelist); - - mmap_write_unlock(mm); mpol_out: mpol_put(new); if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))