From patchwork Mon Sep 25 08:35:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hugh Dickins X-Patchwork-Id: 13397490 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51450CE7A81 for ; Mon, 25 Sep 2023 08:35:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E5A368D0019; Mon, 25 Sep 2023 04:35:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DE36D8D0001; Mon, 25 Sep 2023 04:35:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CB6518D0019; Mon, 25 Sep 2023 04:35:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B8B998D0001 for ; Mon, 25 Sep 2023 04:35:09 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 87716140AB5 for ; Mon, 25 Sep 2023 08:35:09 +0000 (UTC) X-FDA: 81274459938.26.E135388 Received: from mail-yb1-f173.google.com (mail-yb1-f173.google.com [209.85.219.173]) by imf29.hostedemail.com (Postfix) with ESMTP id AFE2512000E for ; Mon, 25 Sep 2023 08:35:07 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jAeratV7; spf=pass (imf29.hostedemail.com: domain of hughd@google.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695630907; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ezzNrLDnqV/k6jp9dH7Pe4D9g9Anl40AHbeIOwXqGUY=; b=EdEQXo04Xt//SFPY8Vwhqpvn8wQkkb++ge0tlMkJQm7NS1yTDxa8LJlKB8bx2R+lIzYWAL smSUcRt9PY1hW+mmtZqEVjqNRhzR5Snj5LgcJaO2dXj4WBcKI2JVbHt3KMLB42/HE/y9vF XIvW5zfEsrQ97DoKW4rs3a5olq3qln4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695630907; a=rsa-sha256; cv=none; b=u6sUO2sxQUBw3sJfY+JfsCsQIpykBQ/b3swflt5hnNdqduJkfenrPIlIdnMe3V+5bHWvwa f6hhEMN88fWqDDMNirwipScxvj1uo9YH8xbj2dbm/FMRVohJqB4qPGtvN8o64STazWJCtG eeauTF+yCxxfp85ZxOJbZ9TOSFpwLLY= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jAeratV7; spf=pass (imf29.hostedemail.com: domain of hughd@google.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yb1-f173.google.com with SMTP id 3f1490d57ef6-d867d4cf835so2976519276.1 for ; Mon, 25 Sep 2023 01:35:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695630907; x=1696235707; darn=kvack.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=ezzNrLDnqV/k6jp9dH7Pe4D9g9Anl40AHbeIOwXqGUY=; b=jAeratV7w6LMIdPonXtSUSXBkUDPo+nJVlVaZPkV4nMCSRjplpw1qwQ7j5JbF1XRQr egkyP0P1xMCJGQTo4hE1gGWS3yCnA61AIrukHcIvQnJ382jHgs1a4BRTfyYcs5FXkZC8 YgPU9MRZ+C8PPxp6bzGNHZFfbk8vN5kcJde2XRjFt0JCICkkxy1tvOPc6PAXn5zsAORH nyInffigVyr/b6y0aSgx92NxnV85IgA1DkTKepT36hHcnOgk3BR/T6KoB6J2Q7fsnOc6 fhqSmFhvlL4pOsQSuLJhVbylAyQ7IeV6Rdu8x2bi4jFI436Rynv9CEeCWYK+s2+jlzni fzCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695630907; x=1696235707; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ezzNrLDnqV/k6jp9dH7Pe4D9g9Anl40AHbeIOwXqGUY=; b=TyPyaqyXLK3/VQlNUxsdqyeGeFM0CNkjUbOT5f2mRf95a/3/DIgbl/ZoVoWgtThdGx GKFbOlp/doivj7QEVGE/RRAVJsx/mz/iGHVGLTYaGjG9340bub/4g/pbbqR3RlLBu8Bi VTWlGAZ5luHjga9kIBWtbQZUlRAaksogudXZmdwr4J2dzhhRGw14zJI/AbyJTJAaweI6 2ZmOJRxVq3+UbljlbJtdx1Mye+yDda4116lbytWd/C0m5O3onB6GKGVywxWX0PlyliHP ALYqV/KzWOJzNeUIxAJh8kaeF/0/a8xeA6R62DNMKohnXDzIcn9IwCYqdoeBVh0ZBAs3 Oi8A== X-Gm-Message-State: AOJu0YylBEstmvhrdMD0i1mJIh0PQtWCMbozKM/1CFNSrXGa8V1rY+8i 853wJCOn6LeN7hmX0iou2aySAqQWAchQqsuZexuaUQ== X-Google-Smtp-Source: AGHT+IFp0uZwCYh9f6Ww0ZKvnQPx4qXewNgiqFJbqCBgiDfI2aX3U5liWK+g0EmK6zec2o4LhHt/pg== X-Received: by 2002:a05:690c:d93:b0:59f:8026:4260 with SMTP id da19-20020a05690c0d9300b0059f80264260mr2238947ywb.24.1695630906593; Mon, 25 Sep 2023 01:35:06 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id c188-20020a0df3c5000000b0059293c8d70csm2293994ywf.132.2023.09.25.01.35.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 25 Sep 2023 01:35:05 -0700 (PDT) Date: Mon, 25 Sep 2023 01:35:03 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Andi Kleen , Christoph Lameter , Matthew Wilcox , Mike Kravetz , David Hildenbrand , Suren Baghdasaryan , Yang Shi , Sidhartha Kumar , Vishal Moola , Kefeng Wang , Greg Kroah-Hartman , Tejun Heo , Mel Gorman , Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 11/12] mempolicy: mmap_lock is not needed while migrating folios In-Reply-To: <2d872cef-7787-a7ca-10e-9d45a64c80b4@google.com> Message-ID: <73183de1-6529-b146-f2cc-fcd5b812166@google.com> References: <2d872cef-7787-a7ca-10e-9d45a64c80b4@google.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: AFE2512000E X-Rspam-User: X-Stat-Signature: qur6ccqtgmypo9j8tjz6or47adjmqnnw X-Rspamd-Server: rspam03 X-HE-Tag: 1695630907-354767 X-HE-Meta: U2FsdGVkX19DmypxcSdvzoNVax9hpxwd0EsGNiVLiz67EnCAh3OotTWsb4PyBB0dzdvrFG2iXF4Pf7fNVeFWje5mdZtejYJ80eeVnD8tFUnkcsoa8QksrvXSr00oVunlZ0xNUc9JzJ1ihjUm8YYK5aotUKqOdokVAgaw9GMdL/W/6P7GSTMa3x4lCb/0DSo8L2n0qVncUqb3dejOiAXXzHpRiCS0tgvXkMw11S++yE24X66weymjg368S2hDGEAWJPmWPwizCilX5oz4bQAX4yTK9PQf25GfveRcI8Iejwa17Xh6Ez9cNjt9Rt39bx8ZMGmp0fGSr7/O2fYh21W9qRy7bBVG1F4xgeVwCNdXiL2ASXKOeIHsJB6ajlorX7BdWAFuooV0dAL02GCJJ3/vuD2RuzXnHXFwDRGITIJB6z3U/7Skoegej9aMFrteKbMkKEpDHoXpD6kaWzhHMijsrgX1I4guFi/0PbpdceyMft/7OmdhNP2KIYKlPCkZ8WpNkMVt4JpJaL/z3Nfe4JLZ3xX4hpz+YI0vF+AXgRmHkIsI96d+1dy8Pg5EaBiH7Lrl9+43UX+QDPsU39PBF9DY695UwXA4K70oGlaPcqgbO5fUx55kXPZqeBp1TEINbAbIkaCuTHvnjABrFgJLQ57ZG0hUk6Qq5K5e2NlOvRvBrobDSPgNKJI6PtHm6GvQnaiFegDWz1a7FYGxgXMzlRk8bSbxJDoUYojFK3pfJEC/ajWQPzmA2cy2n/yjCmzG+TpGdzbhpJUEstHY/gAUzuegYuznhvkniaeK/3x6S3UYmgROxlztCzrCpbUwFccCJRPxM3qh8rKw8jmXKkHkjkX7hpC5bCvJc4uxbIkua9n9eNrQ/gSUrTRzVYP49JgIgWZw/eYK8MPVW4lOjJEk36oXBWnQKxLwaLgXxokUMgUfdZgv5S225Gqgx3XXMLmLrIgXbr5NmxJ8p6qTVPu2Y4D sk1KzLOb AMqnVeM0IqR3J0vMn4Wqa6MY+cerrHWDwxWBBFa5iJrawFd7VNZugaCa5QXw8j0TlFoCfhkfoQZrNxP45K8Z+iN/xj2Xw77wM3XE+VMcu77wvHIGZq+6mc6qDZrBTAcey+IOLgYwAJVPIIEMHakjCQKA0gVL+nRba+o1gocJV++WZXDGG9MK5zKwE2EOCgU1GJbLu+Ve7J//ECSSo9G2tLbNS2PWVjiA39D2jSAijQGlYWyPYsHSdjylkI2UXlc4DIr93Q6Uyg+igM+9V+4P5/5UZMGR2LekDKGUhydTLBLf9r/1jP7l/IvgekDsPl0MsCEKRUiydcMqKGptctuoZ8EuC/nObhGp/z5f3gNbf1ZMbXRKWLnmTYNbsWwPA3VZPhkqD76T08PUO7ZkfLGj5UV72wx1iQm86Yb6zgr3dJBM4KLOQJ5siWe7sTaoXy2gH6URBJ9BupPl/BRsyEvYBvXk4C9VZpnUFVe5KgWhO/SK4CsGynptsGyg6lppc0QGQRX2X X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: mbind(2) holds down_write of current task's mmap_lock throughout (exclusive because it needs to set the new mempolicy on the vmas); migrate_pages(2) holds down_read of pid's mmap_lock throughout. They both hold mmap_lock across the internal migrate_pages(), under which all new page allocations (huge or small) are made. I'm nervous about it; and migrate_pages() certainly does not need mmap_lock itself. It's done this way for mbind(2), because its page allocator is vma_alloc_folio() or alloc_hugetlb_folio_vma(), both of which depend on vma and address. Now that we have alloc_pages_mpol(), depending on (refcounted) memory policy and interleave index, mbind(2) can be modified to use that or alloc_hugetlb_folio_nodemask(), and then not need mmap_lock across the internal migrate_pages() at all: add alloc_migration_target_by_mpol() to replace mbind's new_page(). (After that change, alloc_hugetlb_folio_vma() is used by nothing but a userfaultfd function: move it out of hugetlb.h and into the #ifdef.) migrate_pages(2) has chosen its target node before migrating, so can continue to use the standard alloc_migration_target(); but let it take and drop mmap_lock just around migrate_to_node()'s queue_pages_range(): neither the node-to-node calculations nor the page migrations need it. It seems unlikely, but it is conceivable that some userspace depends on the kernel's mmap_lock exclusion here, instead of doing its own locking: more likely in a testsuite than in real life. It is also possible, of course, that some pages on the list will be munmapped by another thread before they are migrated, or a newer memory policy applied to the range by that time: but such races could happen before, as soon as mmap_lock was dropped, so it does not appear to be a concern. Signed-off-by: Hugh Dickins --- include/linux/hugetlb.h | 9 ----- mm/hugetlb.c | 38 +++++++++--------- mm/mempolicy.c | 85 +++++++++++++++++++++-------------------- 3 files changed, 64 insertions(+), 68 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 6522eb3cd007..9c4265c73f76 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -714,8 +714,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask, gfp_t gfp_mask); -struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address); int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping, pgoff_t idx); void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, @@ -1024,13 +1022,6 @@ alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, return NULL; } -static inline struct folio *alloc_hugetlb_folio_vma(struct hstate *h, - struct vm_area_struct *vma, - unsigned long address) -{ - return NULL; -} - static inline int __alloc_bootmem_huge_page(struct hstate *h) { return 0; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ba6d39b71cb1..1af54dbbd7cc 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2479,24 +2479,6 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask); } -/* mempolicy aware migration callback */ -struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address) -{ - struct mempolicy *mpol; - nodemask_t *nodemask; - struct folio *folio; - gfp_t gfp_mask; - int node; - - gfp_mask = htlb_alloc_mask(h); - node = huge_node(vma, address, gfp_mask, &mpol, &nodemask); - folio = alloc_hugetlb_folio_nodemask(h, node, nodemask, gfp_mask); - mpol_cond_put(mpol); - - return folio; -} - /* * Increase the hugetlb pool such that it can accommodate a reservation * of size 'delta'. @@ -6225,6 +6207,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } #ifdef CONFIG_USERFAULTFD +/* + * Can probably be eliminated, but still used by hugetlb_mfill_atomic_pte(). + */ +static struct folio *alloc_hugetlb_folio_vma(struct hstate *h, + struct vm_area_struct *vma, unsigned long address) +{ + struct mempolicy *mpol; + nodemask_t *nodemask; + struct folio *folio; + gfp_t gfp_mask; + int node; + + gfp_mask = htlb_alloc_mask(h); + node = huge_node(vma, address, gfp_mask, &mpol, &nodemask); + folio = alloc_hugetlb_folio_nodemask(h, node, nodemask, gfp_mask); + mpol_cond_put(mpol); + + return folio; +} + /* * Used by userfaultfd UFFDIO_* ioctls. Based on userfaultfd's mfill_atomic_pte * with modifications for hugetlb pages. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d74df1e1b14a..74b1894d29c1 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -417,6 +417,8 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, unsigned long flags); +static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, + pgoff_t ilx, int *nid); static bool strictly_unmovable(unsigned long flags) { @@ -1040,6 +1042,8 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest, node_set(source, nmask); VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))); + + mmap_read_lock(mm); vma = find_vma(mm, 0); /* @@ -1050,6 +1054,7 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest, */ nr_failed = queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask, flags | MPOL_MF_DISCONTIG_OK, &pagelist); + mmap_read_unlock(mm); if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, alloc_migration_target, NULL, @@ -1078,8 +1083,6 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, lru_cache_disable(); - mmap_read_lock(mm); - /* * Find a 'source' bit set in 'tmp' whose corresponding 'dest' * bit in 'to' is not also set in 'tmp'. Clear the found 'source' @@ -1159,7 +1162,6 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, if (err < 0) break; } - mmap_read_unlock(mm); lru_cache_enable(); if (err < 0) @@ -1168,44 +1170,38 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, } /* - * Allocate a new page for page migration based on vma policy. - * Start by assuming the page is mapped by the same vma as contains @start. - * Search forward from there, if not. N.B., this assumes that the - * list of pages handed to migrate_pages()--which is how we get here-- - * is in virtual address order. + * Allocate a new folio for page migration, according to NUMA mempolicy. */ -static struct folio *new_folio(struct folio *src, unsigned long start) +static struct folio *alloc_migration_target_by_mpol(struct folio *src, + unsigned long private) { - struct vm_area_struct *vma; - unsigned long address; - VMA_ITERATOR(vmi, current->mm, start); - gfp_t gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL; - - for_each_vma(vmi, vma) { - address = page_address_in_vma(&src->page, vma); - if (address != -EFAULT) - break; - } - - /* - * __get_vma_policy() now expects a genuine non-NULL vma. Return NULL - * when the page can no longer be located in a vma: that is not ideal - * (migrate_pages() will give up early, presuming ENOMEM), but good - * enough to avoid a crash by syzkaller or concurrent holepunch. - */ - if (!vma) - return NULL; + struct mempolicy *pol = (struct mempolicy *)private; + pgoff_t ilx = 0; /* improve on this later */ + struct page *page; + unsigned int order; + int nid = numa_node_id(); + gfp_t gfp; if (folio_test_hugetlb(src)) { - return alloc_hugetlb_folio_vma(folio_hstate(src), - vma, address); + nodemask_t *nodemask; + struct hstate *h; + + ilx += src->index; /* HugeTLBfs indexes in hpage_size */ + h = folio_hstate(src); + gfp = htlb_alloc_mask(h); + nodemask = policy_nodemask(gfp, pol, ilx, &nid); + return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp); } if (folio_test_large(src)) gfp = GFP_TRANSHUGE; + else + gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP; - return vma_alloc_folio(gfp, folio_order(src), vma, address, - folio_test_large(src)); + order = folio_order(src); + ilx += src->index >> order; + page = alloc_pages_mpol(gfp, order, pol, ilx, nid); + return page_rmappable_folio(page); } #else @@ -1221,7 +1217,8 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, return -ENOSYS; } -static struct folio *new_folio(struct folio *src, unsigned long start) +static struct folio *alloc_migration_target_by_mpol(struct folio *src, + unsigned long private) { return NULL; } @@ -1295,6 +1292,7 @@ static long do_mbind(unsigned long start, unsigned long len, if (nr_failed < 0) { err = nr_failed; + nr_failed = 0; } else { vma_iter_init(&vmi, mm, start); prev = vma_prev(&vmi); @@ -1305,19 +1303,24 @@ static long do_mbind(unsigned long start, unsigned long len, } } - if (!err) { - if (!list_empty(&pagelist)) { - nr_failed |= migrate_pages(&pagelist, new_folio, NULL, - start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL); + mmap_write_unlock(mm); + + if (!err && !list_empty(&pagelist)) { + /* Convert MPOL_DEFAULT's NULL to task or default policy */ + if (!new) { + new = get_task_policy(current); + mpol_get(new); } - if (nr_failed && (flags & MPOL_MF_STRICT)) - err = -EIO; + nr_failed |= migrate_pages(&pagelist, + alloc_migration_target_by_mpol, NULL, + (unsigned long)new, MIGRATE_SYNC, + MR_MEMPOLICY_MBIND, NULL); } + if (nr_failed && (flags & MPOL_MF_STRICT)) + err = -EIO; if (!list_empty(&pagelist)) putback_movable_pages(&pagelist); - - mmap_write_unlock(mm); mpol_out: mpol_put(new); if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))