From patchwork Fri Apr 29 13:35:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832006 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ECC2BC433EF for ; Fri, 29 Apr 2022 13:36:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6EE096B0073; Fri, 29 Apr 2022 09:36:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 672CF6B0074; Fri, 29 Apr 2022 09:36:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 53B7C6B0075; Fri, 29 Apr 2022 09:36:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id 45F096B0073 for ; Fri, 29 Apr 2022 09:36:23 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 1980B12069C for ; Fri, 29 Apr 2022 13:36:23 +0000 (UTC) X-FDA: 79410015846.19.546BDB7 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) by imf25.hostedemail.com (Postfix) with ESMTP id 6329BA0076 for ; Fri, 29 Apr 2022 13:36:13 +0000 (UTC) Received: by mail-pj1-f52.google.com with SMTP id w17-20020a17090a529100b001db302efed6so5763916pjh.4 for ; Fri, 29 Apr 2022 06:36:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=L6JfYichFyXFsf3r/4znwUIKw3e8Xt3i1DYSf+5mskE=; b=uiu5RLuEB7qSCoidGIQkJZQf2tXs/EuKorStHUYGrE3Q5/xwMjpvGCmRYxjgJA6Ctu 75nC7BuohPGVHhHiVloQeiByLs8cb+N0Ed8rJ3utK+A6gQToV+GujF6l5qKhJ6c7EtGn NGGslCeqHkUhC439mUKObKU0nONs9Y2XDO0mRRIdYkuP+YXoGmTiT18RjaEEpxX9bPsh AAQJ8XZb1PV/10ty6IAIC9mYDPdZ84ojotFDfykLySr4Rl5A0JazS0rcCBaq64jgtFXH n+rOZQhRPeLRzNBkYnF0c08yntciuvNCXpZK09/v/tg0M7w22eVyLTCmchLVjWICGks9 yfeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=L6JfYichFyXFsf3r/4znwUIKw3e8Xt3i1DYSf+5mskE=; b=Uv4Psu0nLRawgpo7ASIYs19Cmy8oaEN96avcVE+r9xmnrwQ51sx5cMV5PQVCaiIuu8 sDUexjw92uFden+H+YQA2GFyF4lOJQRQaOZ+KTVZnweF+q6VH3lJUYcFne+PEohRROUQ g5gU/o4rlBAJ3WJBQmzPSLhbQKXTNvsoYrCaJrCe6R4kpXTdv4aPpaazBms5mBcsFJU8 mYvl1a8uRBXKAo17Xfv9xDejrDpEwG6VjMDUjoYzpHDyBiCq3YY1gq2lpHrv6hUv+axo cbtUpP3AP9/ufHVffUasvnfhuDCPLEUV7y/l5S5kE405KK+Jfa6PRE/fbjNv14pplMmQ Y+WQ== X-Gm-Message-State: AOAM530oKq1yrbBBuJmpb9K8tdtNroW1qfs1gQLpsPctfpuSNFRyf7Ux 6WUSJmWWeaC8pd+t/K3YG4UtvQ== X-Google-Smtp-Source: ABdhPJx+W1PxtRy+jGpbbBvOatDdA/DNvTfTbEwQM2M30hTFb7Isln+pU5bpMFj6BaOrsgTZDYJFuQ== X-Received: by 2002:a17:902:ed89:b0:15a:d3e:ada6 with SMTP id e9-20020a170902ed8900b0015a0d3eada6mr39276616plj.94.1651239381708; Fri, 29 Apr 2022 06:36:21 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:21 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Date: Fri, 29 Apr 2022 21:35:35 +0800 Message-Id: <20220429133552.33768-2-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 6329BA0076 X-Stat-Signature: 45rt3mjwaa7o57gjkinyhoauxfnbhnco Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=uiu5RLuE; spf=pass (imf25.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.52 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1651239373-448336 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The paired pte_unmap() call is missing before the sme_populate_pgd() returns. Although this code only runs under the CONFIG_X86_64, for the correctness of the code semantics, it is necessary to add a paired pte_unmap() call. Signed-off-by: Qi Zheng --- arch/x86/mm/mem_encrypt_identity.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c index b43bc24d2bb6..6d323230320a 100644 --- a/arch/x86/mm/mem_encrypt_identity.c +++ b/arch/x86/mm/mem_encrypt_identity.c @@ -190,6 +190,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd) pte = pte_offset_map(pmd, ppd->vaddr); if (pte_none(*pte)) set_pte(pte, __pte(ppd->paddr | ppd->pte_flags)); + pte_unmap(pte); } static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd) From patchwork Fri Apr 29 13:35:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832007 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 945B8C433F5 for ; Fri, 29 Apr 2022 13:36:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 341636B0074; Fri, 29 Apr 2022 09:36:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2F1FE6B0075; Fri, 29 Apr 2022 09:36:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1937D6B0078; Fri, 29 Apr 2022 09:36:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 0B0326B0074 for ; Fri, 29 Apr 2022 09:36:29 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id C67DF60434 for ; Fri, 29 Apr 2022 13:36:28 +0000 (UTC) X-FDA: 79410016056.30.8BF8528 Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) by imf30.hostedemail.com (Postfix) with ESMTP id 4222D80071 for ; Fri, 29 Apr 2022 13:36:19 +0000 (UTC) Received: by mail-pj1-f42.google.com with SMTP id l11-20020a17090a49cb00b001d923a9ca99so7295629pjm.1 for ; Fri, 29 Apr 2022 06:36:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=TMZmZ4AGQljuOcKnvNnlPkQjL1BDLadEZvUt3w1Anw8=; b=iNbXqvglmSrqfFNb3nFzL0dKEc6ia3WcRkU7xNMP/tHBQSotTfWchC6ouLupv7gKG6 mjLWcX7CspmW80y0Qeb/TyGyZZi5yxh/EQGlKJWLIMropwa6Iv2OOK9jr2sYgcXxprZ5 UEr8ImcwooRjwwOIzUNVOu/4dBlKsJNZb0ph4n9C1AkUYJhYk1T7s4n+e/UvC/IBJXiH lelZrCLgW31cqCdXshBxLBAl5D20bnxmQwk5IutcdRzkuh3QaA7GJQaxil8805VZ+bKL LtZQu7le0S6cehGWwU6lFsz62Ehg8Ne/le2UyAqsY2z4SBsgcZzowDkcSOqKwQ72hUWd Ry/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=TMZmZ4AGQljuOcKnvNnlPkQjL1BDLadEZvUt3w1Anw8=; b=FgqgNk5Wt6/4v/fNIgeJINFpNO5E8Kt7o7oglEJv3I2O6CMFn5urB0Ce2lobDR8CtS q5Tx5DgB8XXTMGjt9GygGKvljTncFpf//kjibr2UE9IqSobId5ie8SkOYlfmCj5dbYR3 vTPIUhRlb6pOGNTp0oO2BbR0+dooUnRxcMDNJeEboAu/fq4nDqvgFxUyrECcz20XTn6s kxmOaSlRNYfqO3DqP2M9OkqVyEwsnKZPzd3CWhzzmJe13rD35trZzOC0P/TDIl+ypzGA dg9wTlyGvwJSA1vL1+1Ku2WFjV3/Buk5HFpKZRS+pFWmVTuQep0OW9m13YvpvYi+l6qa 3xaw== X-Gm-Message-State: AOAM531gMogyq9TDMy4VBMqjz+RGeS6+nMYmegR7jJiunkkyl5a4StO/ Kv+QLldu8UQa1Md2xzVvS0s0og== X-Google-Smtp-Source: ABdhPJxpxUNyJ4Q8PzIgoCFgG7OivbK5xNlpIRA+PWmaJAKluGlH8xiPo6kD1HPbaMTFamIM9Vs/pw== X-Received: by 2002:a17:90b:3a89:b0:1d9:b448:a932 with SMTP id om9-20020a17090b3a8900b001d9b448a932mr3941757pjb.173.1651239387248; Fri, 29 Apr 2022 06:36:27 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:26 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns Date: Fri, 29 Apr 2022 21:35:36 +0800 Message-Id: <20220429133552.33768-3-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 4222D80071 X-Stat-Signature: jgyhnypsf31eqz71du7bh5utzum5xbng X-Rspam-User: Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=iNbXqvgl; spf=pass (imf30.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.42 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam09 X-HE-Tag: 1651239379-403477 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the percpu_ref_call_confirm_rcu(), we call the wake_up_all() before calling percpu_ref_put(), which will cause the value of percpu_ref to be unstable when percpu_ref_switch_to_atomic_sync() returns. CPU0 CPU1 percpu_ref_switch_to_atomic_sync(&ref) --> percpu_ref_switch_to_atomic(&ref) --> percpu_ref_get(ref); /* put after confirmation */ call_rcu(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu); percpu_ref_switch_to_atomic_rcu --> percpu_ref_call_confirm_rcu --> data->confirm_switch = NULL; wake_up_all(&percpu_ref_switch_waitq); /* here waiting to wake up */ wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch); (A)percpu_ref_put(ref); /* The value of &ref is unstable! */ percpu_ref_is_zero(&ref) (B)percpu_ref_put(ref); As shown above, assuming that the counts on each cpu add up to 0 before calling percpu_ref_switch_to_atomic_sync(), we expect that after switching to atomic mode, percpu_ref_is_zero() can return true. But actually it will return different values in the two cases of A and B, which is not what we expected. Now there are two users of percpu_ref_switch_to_atomic_sync() in the kernel: i. mddev->writes_pending in the driver/md/md.c ii. q->q_usage_counter in the block/blk-pm.c And they are all used as shown above. In the worst case, percpu_ref_is_zero() may not hold because of the case B every time. While this is unlikely to occur in a production environment, it is a problem. This patch moves percpu_ref_put() out of the rcu handler and call it after wait_event(), which can makes ref stable after percpu_ref_switch_to_atomic_sync() returns. Then in the example above, percpu_ref_is_zero() can see a steady 0 value, which is what we would expect. Signed-off-by: Qi Zheng --- include/linux/percpu-refcount.h | 4 ++- lib/percpu-refcount.c | 56 +++++++++++++++++++++++---------- 2 files changed, 43 insertions(+), 17 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index d73a1c08c3e3..75844939a965 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -98,6 +98,7 @@ struct percpu_ref_data { percpu_ref_func_t *confirm_switch; bool force_atomic:1; bool allow_reinit:1; + bool sync:1; struct rcu_head rcu; struct percpu_ref *ref; }; @@ -123,7 +124,8 @@ int __must_check percpu_ref_init(struct percpu_ref *ref, gfp_t gfp); void percpu_ref_exit(struct percpu_ref *ref); void percpu_ref_switch_to_atomic(struct percpu_ref *ref, - percpu_ref_func_t *confirm_switch); + percpu_ref_func_t *confirm_switch, + bool sync); void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref); void percpu_ref_switch_to_percpu(struct percpu_ref *ref); void percpu_ref_kill_and_confirm(struct percpu_ref *ref, diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index af9302141bcf..3a8906715e09 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -99,6 +99,7 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release, data->release = release; data->confirm_switch = NULL; data->ref = ref; + data->sync = false; ref->data = data; return 0; } @@ -146,21 +147,33 @@ void percpu_ref_exit(struct percpu_ref *ref) } EXPORT_SYMBOL_GPL(percpu_ref_exit); +static inline void percpu_ref_switch_to_atomic_post(struct percpu_ref *ref) +{ + struct percpu_ref_data *data = ref->data; + + if (!data->allow_reinit) + __percpu_ref_exit(ref); + + /* drop ref from percpu_ref_switch_to_atomic() */ + percpu_ref_put(ref); +} + static void percpu_ref_call_confirm_rcu(struct rcu_head *rcu) { struct percpu_ref_data *data = container_of(rcu, struct percpu_ref_data, rcu); struct percpu_ref *ref = data->ref; + bool need_put = true; + + if (data->sync) + need_put = data->sync = false; data->confirm_switch(ref); data->confirm_switch = NULL; wake_up_all(&percpu_ref_switch_waitq); - if (!data->allow_reinit) - __percpu_ref_exit(ref); - - /* drop ref from percpu_ref_switch_to_atomic() */ - percpu_ref_put(ref); + if (need_put) + percpu_ref_switch_to_atomic_post(ref); } static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) @@ -210,14 +223,19 @@ static void percpu_ref_noop_confirm_switch(struct percpu_ref *ref) } static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref, - percpu_ref_func_t *confirm_switch) + percpu_ref_func_t *confirm_switch, + bool sync) { if (ref->percpu_count_ptr & __PERCPU_REF_ATOMIC) { if (confirm_switch) confirm_switch(ref); + if (sync) + percpu_ref_get(ref); return; } + ref->data->sync = sync; + /* switching from percpu to atomic */ ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC; @@ -232,13 +250,16 @@ static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref, call_rcu(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu); } -static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref) +static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref, bool sync) { unsigned long __percpu *percpu_count = percpu_count_ptr(ref); int cpu; BUG_ON(!percpu_count); + if (sync) + percpu_ref_get(ref); + if (!(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC)) return; @@ -261,7 +282,8 @@ static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref) } static void __percpu_ref_switch_mode(struct percpu_ref *ref, - percpu_ref_func_t *confirm_switch) + percpu_ref_func_t *confirm_switch, + bool sync) { struct percpu_ref_data *data = ref->data; @@ -276,9 +298,9 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref, percpu_ref_switch_lock); if (data->force_atomic || percpu_ref_is_dying(ref)) - __percpu_ref_switch_to_atomic(ref, confirm_switch); + __percpu_ref_switch_to_atomic(ref, confirm_switch, sync); else - __percpu_ref_switch_to_percpu(ref); + __percpu_ref_switch_to_percpu(ref, sync); } /** @@ -302,14 +324,15 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref, * switching to atomic mode, this function can be called from any context. */ void percpu_ref_switch_to_atomic(struct percpu_ref *ref, - percpu_ref_func_t *confirm_switch) + percpu_ref_func_t *confirm_switch, + bool sync) { unsigned long flags; spin_lock_irqsave(&percpu_ref_switch_lock, flags); ref->data->force_atomic = true; - __percpu_ref_switch_mode(ref, confirm_switch); + __percpu_ref_switch_mode(ref, confirm_switch, sync); spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } @@ -325,8 +348,9 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic); */ void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref) { - percpu_ref_switch_to_atomic(ref, NULL); + percpu_ref_switch_to_atomic(ref, NULL, true); wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch); + percpu_ref_switch_to_atomic_post(ref); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync); @@ -355,7 +379,7 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *ref) spin_lock_irqsave(&percpu_ref_switch_lock, flags); ref->data->force_atomic = false; - __percpu_ref_switch_mode(ref, NULL); + __percpu_ref_switch_mode(ref, NULL, false); spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } @@ -390,7 +414,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, ref->data->release); ref->percpu_count_ptr |= __PERCPU_REF_DEAD; - __percpu_ref_switch_mode(ref, confirm_kill); + __percpu_ref_switch_mode(ref, confirm_kill, false); percpu_ref_put(ref); spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); @@ -470,7 +494,7 @@ void percpu_ref_resurrect(struct percpu_ref *ref) ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD; percpu_ref_get(ref); - __percpu_ref_switch_mode(ref, NULL); + __percpu_ref_switch_mode(ref, NULL, false); spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } From patchwork Fri Apr 29 13:35:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832008 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98609C433F5 for ; Fri, 29 Apr 2022 13:36:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2CC466B0075; Fri, 29 Apr 2022 09:36:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 27B396B0078; Fri, 29 Apr 2022 09:36:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 11CB86B007B; Fri, 29 Apr 2022 09:36:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id 046106B0075 for ; Fri, 29 Apr 2022 09:36:35 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id D576B1206A7 for ; Fri, 29 Apr 2022 13:36:34 +0000 (UTC) X-FDA: 79410016308.07.7A37E9F Received: from mail-pg1-f181.google.com (mail-pg1-f181.google.com [209.85.215.181]) by imf13.hostedemail.com (Postfix) with ESMTP id 92A9320070 for ; Fri, 29 Apr 2022 13:36:25 +0000 (UTC) Received: by mail-pg1-f181.google.com with SMTP id bg9so6539402pgb.9 for ; Fri, 29 Apr 2022 06:36:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ieDn4aiMaGLBWCIS3wtwm2RaJ5ulm380NWzDeauNbb4=; b=EBM/v5v9b/++wKJ7UyNtJG3hOZtf0D0P421yMRX4mbDxLWaLjOnVd44J8Kojpk1/Jx wCFrmBGQ/ceCjK2D4mHXGw0dSrIqtbGqjjBrZNAFATQx+fQ3H5noDNFQwBzJ7OOgtWB+ JVr1SHNAxpY01lGGs23Wp0T2G74lyX6XkV1W6acl+SBAclRUO5GvysEsfCuAuaRStabd GfI7JWXoZkXpXKsQkXGJI8UFEcFsomE5V4M3jZm36E83p3L+E9GJIPbxQ/45XJ5HkdU0 inXCNH8Zsk7/VEzeqUaBbswbkgOHbaJakh63ipxlq9pP6hJllz+gzTUgl2B+oN4/WO8g 3ANA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ieDn4aiMaGLBWCIS3wtwm2RaJ5ulm380NWzDeauNbb4=; b=nRAfvafWNwWLJEQR8CAmIKC4IFZLJ16ZhFik8tVFH5gaOJxwaRK2n/8X4BVdX4XZIJ jstVEOu0c+utvu/xnYdxx+Jv4wM3nVZlR5KnGg/Sol4aSBZNCb1JHvhO8PMwJvpGgRoM gLAxDCfpQwYM9O0cpHvhvGFTm2DFddl93Ns5BD76XLsBDEfRSyNVcaf1bltlF0/2h6Qz cdnAyXPItNDzdoAM3DhcP+g11d73OJN2JQOvraaI+obIukZjbo3scxZP/c2affRDKEyx 8qSRN66RZpiHRBySoXhXU9u1yfXwUu8QiyyIKSxTZnL5h+QPtlnG7bxfKbqmebHpec5Y BlMA== X-Gm-Message-State: AOAM531P52TjXZysfhHeFpdLVRn9R7BmetZxaBac9SvVJiyGrdslzf1B z6ck9FvWTrgjW9sKc2qki9lH4Q== X-Google-Smtp-Source: ABdhPJxItc3PJijXr/92XujdPMZZoCsMTMRrNeqFfimvMzqakWNCQB/zCudyZqo7pQwxbtrJvVy/Xg== X-Received: by 2002:a65:6a56:0:b0:3aa:49b8:ee77 with SMTP id o22-20020a656a56000000b003aa49b8ee77mr32125077pgu.19.1651239392962; Fri, 29 Apr 2022 06:36:32 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:32 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref Date: Fri, 29 Apr 2022 21:35:37 +0800 Message-Id: <20220429133552.33768-4-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="EBM/v5v9"; spf=pass (imf13.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.181 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 92A9320070 X-Stat-Signature: y3azscmgy9w48frkhwpe9pqyxx3ra4nn X-HE-Tag: 1651239385-548280 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently, percpu_ref uses the global percpu_ref_switch_lock to protect the mode switching operation. When multiple percpu_ref perform mode switching at the same time, the lock may become a performance bottleneck. This patch introduces per percpu_ref percpu_ref_switch_lock to fixes this situation. Signed-off-by: Qi Zheng --- include/linux/percpu-refcount.h | 2 ++ lib/percpu-refcount.c | 30 +++++++++++++++--------------- 2 files changed, 17 insertions(+), 15 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 75844939a965..eb8695e578fd 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -110,6 +110,8 @@ struct percpu_ref { */ unsigned long percpu_count_ptr; + spinlock_t percpu_ref_switch_lock; + /* * 'percpu_ref' is often embedded into user structure, and only * 'percpu_count_ptr' is required in fast path, move other fields diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 3a8906715e09..4336fd1bd77a 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -36,7 +36,6 @@ #define PERCPU_COUNT_BIAS (1LU << (BITS_PER_LONG - 1)) -static DEFINE_SPINLOCK(percpu_ref_switch_lock); static DECLARE_WAIT_QUEUE_HEAD(percpu_ref_switch_waitq); static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) @@ -95,6 +94,7 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release, start_count++; atomic_long_set(&data->count, start_count); + spin_lock_init(&ref->percpu_ref_switch_lock); data->release = release; data->confirm_switch = NULL; @@ -137,11 +137,11 @@ void percpu_ref_exit(struct percpu_ref *ref) if (!data) return; - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); ref->percpu_count_ptr |= atomic_long_read(&ref->data->count) << __PERCPU_REF_FLAG_BITS; ref->data = NULL; - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); kfree(data); } @@ -287,7 +287,7 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref, { struct percpu_ref_data *data = ref->data; - lockdep_assert_held(&percpu_ref_switch_lock); + lockdep_assert_held(&ref->percpu_ref_switch_lock); /* * If the previous ATOMIC switching hasn't finished yet, wait for @@ -295,7 +295,7 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref, * isn't in progress, this function can be called from any context. */ wait_event_lock_irq(percpu_ref_switch_waitq, !data->confirm_switch, - percpu_ref_switch_lock); + ref->percpu_ref_switch_lock); if (data->force_atomic || percpu_ref_is_dying(ref)) __percpu_ref_switch_to_atomic(ref, confirm_switch, sync); @@ -329,12 +329,12 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref, { unsigned long flags; - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); ref->data->force_atomic = true; __percpu_ref_switch_mode(ref, confirm_switch, sync); - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic); @@ -376,12 +376,12 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *ref) { unsigned long flags; - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); ref->data->force_atomic = false; __percpu_ref_switch_mode(ref, NULL, false); - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu); @@ -407,7 +407,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, { unsigned long flags; - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); WARN_ONCE(percpu_ref_is_dying(ref), "%s called more than once on %ps!", __func__, @@ -417,7 +417,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, __percpu_ref_switch_mode(ref, confirm_kill, false); percpu_ref_put(ref); - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm); @@ -438,12 +438,12 @@ bool percpu_ref_is_zero(struct percpu_ref *ref) return false; /* protect us from being destroyed */ - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); if (ref->data) count = atomic_long_read(&ref->data->count); else count = ref->percpu_count_ptr >> __PERCPU_REF_FLAG_BITS; - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); return count == 0; } @@ -487,7 +487,7 @@ void percpu_ref_resurrect(struct percpu_ref *ref) unsigned long __percpu *percpu_count; unsigned long flags; - spin_lock_irqsave(&percpu_ref_switch_lock, flags); + spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags); WARN_ON_ONCE(!percpu_ref_is_dying(ref)); WARN_ON_ONCE(__ref_is_percpu(ref, &percpu_count)); @@ -496,6 +496,6 @@ void percpu_ref_resurrect(struct percpu_ref *ref) percpu_ref_get(ref); __percpu_ref_switch_mode(ref, NULL, false); - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_resurrect); From patchwork Fri Apr 29 13:35:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832009 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4B14C433F5 for ; Fri, 29 Apr 2022 13:36:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7AD786B0078; Fri, 29 Apr 2022 09:36:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 75CF76B007B; Fri, 29 Apr 2022 09:36:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 624876B007D; Fri, 29 Apr 2022 09:36:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id 553FC6B0078 for ; Fri, 29 Apr 2022 09:36:43 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 282762070D for ; Fri, 29 Apr 2022 13:36:43 +0000 (UTC) X-FDA: 79410016686.05.725399A Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf17.hostedemail.com (Postfix) with ESMTP id 6F2EA40066 for ; Fri, 29 Apr 2022 13:36:32 +0000 (UTC) Received: by mail-pj1-f51.google.com with SMTP id w17-20020a17090a529100b001db302efed6so5764948pjh.4 for ; Fri, 29 Apr 2022 06:36:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=VfzL8+rKvBDkdHAWJz+BdGeabJSdieMoCyrARJf31C8=; b=Dqjg2bggCbl33T/45lA7Ve3w9lBbtFVa1mXx8Ejbfs5Il/fK1BO/bf2J+ibvJhiS+Y l23GVOPL2ey3vEhQd+FUJsO23dOCMwlDnUhlsabAJNQ3g/BLU+l3S8kF5q8xLqzcTLa9 N+2i4eoix1OJOltamNeXZ/RGg3nTxJmJ0nzjpfCx0serQBaJh+6ocZmzJ/QUqMKiG8zZ C7mOsKzMS1pHhUig4gWLkDlR56bIK6XpC/halFlOpppKGwDWVvjdEs/vIuYtRhDEisEk ZvmIPR0cAqycBsaDnGQH53owG2POMRJWUrpPJ9xjBwVSchPUFsRwPJYuD3dyOj2bgPPh wFOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=VfzL8+rKvBDkdHAWJz+BdGeabJSdieMoCyrARJf31C8=; b=rS1d8davBQONoyCjXnwnXJTA90eUXzxwzBp5WGqtPPtBKA0QZm6CDIi5QDoBmyzq7i nE3egNS39WQ34bc/BGWemybqpk0az+24isHeS4G+dhe3n3VRuCejwqkf6y4LJwCbnmYA Hakbh+9rZpVrPMZ0rNvtOmOJPyxZ6Ga2i2pqCmExJG9XzNDqnaVgvmyfloxV3uRLqc7Y zPupo2fphPmIbJq8MD5VIuxI1uqO/KTbytk44FUw7Fy0NDUCVx9FBkZDIHcVvdQLjj8D OJEJRlfh4QjmDPE8fFoTOoSXJdEs44paMa8eNlgR6aqUARG2gX49z609ogcoMYvMvo1J b+nQ== X-Gm-Message-State: AOAM532Bx/Z91YxHDjuoJkXyMnPpsWnUDnrxlkjf+SXpn15c3pPLDmwP im7ovAaUUj/+FhquSoAfChonJA== X-Google-Smtp-Source: ABdhPJwdG27EhVPvJrwtYvxik6bbTXqrOGosOq9mejsaq+voyl0UgeehoOYZEA+LTuFQKgY2z1Cn4A== X-Received: by 2002:a17:903:1051:b0:15c:f02f:cd0e with SMTP id f17-20020a170903105100b0015cf02fcd0emr31282057plc.81.1651239400770; Fri, 29 Apr 2022 06:36:40 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:40 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() Date: Fri, 29 Apr 2022 21:35:38 +0800 Message-Id: <20220429133552.33768-5-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=Dqjg2bgg; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf17.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 6F2EA40066 X-Rspam-User: X-Stat-Signature: 9jabdjizdn9zbehxujcx7871k4u5jyx1 X-HE-Tag: 1651239392-201204 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: After commit 08d5b29eac7d ("mm: ptep_clear() page table helper"), the ptep_clear() can be used to track the clearing of PTE page table entries, but pte_clear_not_present_full() is not covered, so also convert it to use ptep_clear(), we will need this call in subsequent patches. Signed-off-by: Qi Zheng --- include/linux/pgtable.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f4f4077b97aa..bed9a559d45b 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -423,7 +423,7 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm, pte_t *ptep, int full) { - pte_clear(mm, address, ptep); + ptep_clear(mm, address, ptep); } #endif From patchwork Fri Apr 29 13:35:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832010 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EEDA6C433F5 for ; Fri, 29 Apr 2022 13:36:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 870EA6B007B; Fri, 29 Apr 2022 09:36:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 821336B007D; Fri, 29 Apr 2022 09:36:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6C1F86B007E; Fri, 29 Apr 2022 09:36:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 5E0CF6B007B for ; Fri, 29 Apr 2022 09:36:48 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 402186002D for ; Fri, 29 Apr 2022 13:36:48 +0000 (UTC) X-FDA: 79410016896.14.4766295 Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) by imf02.hostedemail.com (Postfix) with ESMTP id 0A7CB8006F for ; Fri, 29 Apr 2022 13:36:43 +0000 (UTC) Received: by mail-pf1-f170.google.com with SMTP id bo5so6952896pfb.4 for ; Fri, 29 Apr 2022 06:36:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=6HmhPOYQfQyw96XMQOEpsURE5/Z+FdPGWunRugIZ8bE=; b=IptdK2Nlzt+e9FhAHJZOhyofi5+WuZbLP0Ii/Ru4VJXlBwx3Jtv5GU1CtyjqaFxexS KY8mxkGvEwjwdKyqqQe9TUyh4sAekQ/hfIMYn0f36mdYafrTxAGiuzghQc89UaEolTsO blkiqeRCUyMFCYPfaVufQUh1xBHZr1JubX3N9JmXee3AqD5ax6McY5qa7yfJC7tnId0k Hl4C1AGpoKEpQOyU05qCdMKctzm14UjXlbAukdt+PdkHRxcC/rDScCQ2pyT5/1QzPYDH A5cJG+jlDrM3qMTFyQNF8Hrw3Hjkd3C/gEP++G4BVKJAYOU6+6IajYkXQhkcpIwumF8Z nJ9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=6HmhPOYQfQyw96XMQOEpsURE5/Z+FdPGWunRugIZ8bE=; b=U4f3Zoj6blce9I5ruYyx64xWrR4m21QVpTKL2fiJCXwISA6G8N3IxJsheLi67STdG5 nvThQDeonIkmmLdrxFHb6Z2UEdCjQxFpSd3vA4/9JTxSkQEAePQcag36oLhxzotUDFBB 8vkt0d4srxN8EXQBDO35R1YJ6atDC1KWGlmeCCILBH80pgmZdr83UpjtRygHtiHQ6Axu H4BpUBZx6FLpW/D9u2+FS2wesf3XVjJbysCP1pITtKXamPqeo259zbooX5+our++MY2I 1ykxYwxcNSvhGRa/raMsqnoF/VTIEoGZc/Dge2C1LACtabUm6xEBLlnCIaWvIzD/jqjr cMaw== X-Gm-Message-State: AOAM530WzsmOJmHPcFLHojTM158PevMxf6U+s0MXD0vnbT2h/Lfn1TFL FEot92wLMsoTyjNueeQ1nurEUQ== X-Google-Smtp-Source: ABdhPJzqvZjOyyOerClbeLgfXGriI0EREvRRrSKg/2B9wt0ANwQMK0Q+vOCP30x0IB5WyOyWJThPqA== X-Received: by 2002:a05:6a00:21c7:b0:4fd:f89f:ec17 with SMTP id t7-20020a056a0021c700b004fdf89fec17mr40167798pfj.72.1651239406601; Fri, 29 Apr 2022 06:36:46 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:46 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h Date: Fri, 29 Apr 2022 21:35:39 +0800 Message-Id: <20220429133552.33768-6-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Stat-Signature: sbm9ne4acuwqtz83hd76d75jpepn1hsj Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=IptdK2Nl; spf=pass (imf02.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 0A7CB8006F X-HE-Tag: 1651239403-724610 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The pte_offset_map_lock() and its friend pte_offset_map() are in mm.h and pgtable.h respectively, it would be better to have them in one file. Considering that they are all helper functions related to page tables, move pte_offset_map_lock() to pgtable.h. The pte_lockptr() is required for pte_offset_map_lock(), so move it and its friends {pmd,pud}_lockptr() to pgtable.h together. Signed-off-by: Qi Zheng --- include/linux/mm.h | 149 ---------------------------------------- include/linux/pgtable.h | 149 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 149 insertions(+), 149 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e34edb775334..0afd3b097e90 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2252,70 +2252,6 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a } #endif /* CONFIG_MMU */ -#if USE_SPLIT_PTE_PTLOCKS -#if ALLOC_SPLIT_PTLOCKS -void __init ptlock_cache_init(void); -extern bool ptlock_alloc(struct page *page); -extern void ptlock_free(struct page *page); - -static inline spinlock_t *ptlock_ptr(struct page *page) -{ - return page->ptl; -} -#else /* ALLOC_SPLIT_PTLOCKS */ -static inline void ptlock_cache_init(void) -{ -} - -static inline bool ptlock_alloc(struct page *page) -{ - return true; -} - -static inline void ptlock_free(struct page *page) -{ -} - -static inline spinlock_t *ptlock_ptr(struct page *page) -{ - return &page->ptl; -} -#endif /* ALLOC_SPLIT_PTLOCKS */ - -static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return ptlock_ptr(pmd_page(*pmd)); -} - -static inline bool ptlock_init(struct page *page) -{ - /* - * prep_new_page() initialize page->private (and therefore page->ptl) - * with 0. Make sure nobody took it in use in between. - * - * It can happen if arch try to use slab for page table allocation: - * slab code uses page->slab_cache, which share storage with page->ptl. - */ - VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page); - if (!ptlock_alloc(page)) - return false; - spin_lock_init(ptlock_ptr(page)); - return true; -} - -#else /* !USE_SPLIT_PTE_PTLOCKS */ -/* - * We use mm->page_table_lock to guard all pagetable pages of the mm. - */ -static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return &mm->page_table_lock; -} -static inline void ptlock_cache_init(void) {} -static inline bool ptlock_init(struct page *page) { return true; } -static inline void ptlock_free(struct page *page) {} -#endif /* USE_SPLIT_PTE_PTLOCKS */ - static inline void pgtable_init(void) { ptlock_cache_init(); @@ -2338,20 +2274,6 @@ static inline void pgtable_pte_page_dtor(struct page *page) dec_lruvec_page_state(page, NR_PAGETABLE); } -#define pte_offset_map_lock(mm, pmd, address, ptlp) \ -({ \ - spinlock_t *__ptl = pte_lockptr(mm, pmd); \ - pte_t *__pte = pte_offset_map(pmd, address); \ - *(ptlp) = __ptl; \ - spin_lock(__ptl); \ - __pte; \ -}) - -#define pte_unmap_unlock(pte, ptl) do { \ - spin_unlock(ptl); \ - pte_unmap(pte); \ -} while (0) - #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) #define pte_alloc_map(mm, pmd, address) \ @@ -2365,58 +2287,6 @@ static inline void pgtable_pte_page_dtor(struct page *page) ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \ NULL: pte_offset_kernel(pmd, address)) -#if USE_SPLIT_PMD_PTLOCKS - -static struct page *pmd_to_page(pmd_t *pmd) -{ - unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1); - return virt_to_page((void *)((unsigned long) pmd & mask)); -} - -static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return ptlock_ptr(pmd_to_page(pmd)); -} - -static inline bool pmd_ptlock_init(struct page *page) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - page->pmd_huge_pte = NULL; -#endif - return ptlock_init(page); -} - -static inline void pmd_ptlock_free(struct page *page) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - VM_BUG_ON_PAGE(page->pmd_huge_pte, page); -#endif - ptlock_free(page); -} - -#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte) - -#else - -static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return &mm->page_table_lock; -} - -static inline bool pmd_ptlock_init(struct page *page) { return true; } -static inline void pmd_ptlock_free(struct page *page) {} - -#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte) - -#endif - -static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd) -{ - spinlock_t *ptl = pmd_lockptr(mm, pmd); - spin_lock(ptl); - return ptl; -} - static inline bool pgtable_pmd_page_ctor(struct page *page) { if (!pmd_ptlock_init(page)) @@ -2433,25 +2303,6 @@ static inline void pgtable_pmd_page_dtor(struct page *page) dec_lruvec_page_state(page, NR_PAGETABLE); } -/* - * No scalability reason to split PUD locks yet, but follow the same pattern - * as the PMD locks to make it easier if we decide to. The VM should not be - * considered ready to switch to split PUD locks yet; there may be places - * which need to be converted from page_table_lock. - */ -static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud) -{ - return &mm->page_table_lock; -} - -static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) -{ - spinlock_t *ptl = pud_lockptr(mm, pud); - - spin_lock(ptl); - return ptl; -} - extern void __init pagecache_init(void); extern void free_initmem(void); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index bed9a559d45b..0928acca6b48 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -85,6 +85,141 @@ static inline unsigned long pud_index(unsigned long address) #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) #endif +#if USE_SPLIT_PTE_PTLOCKS +#if ALLOC_SPLIT_PTLOCKS +void __init ptlock_cache_init(void); +extern bool ptlock_alloc(struct page *page); +extern void ptlock_free(struct page *page); + +static inline spinlock_t *ptlock_ptr(struct page *page) +{ + return page->ptl; +} +#else /* ALLOC_SPLIT_PTLOCKS */ +static inline void ptlock_cache_init(void) +{ +} + +static inline bool ptlock_alloc(struct page *page) +{ + return true; +} + +static inline void ptlock_free(struct page *page) +{ +} + +static inline spinlock_t *ptlock_ptr(struct page *page) +{ + return &page->ptl; +} +#endif /* ALLOC_SPLIT_PTLOCKS */ + +static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return ptlock_ptr(pmd_page(*pmd)); +} + +static inline bool ptlock_init(struct page *page) +{ + /* + * prep_new_page() initialize page->private (and therefore page->ptl) + * with 0. Make sure nobody took it in use in between. + * + * It can happen if arch try to use slab for page table allocation: + * slab code uses page->slab_cache, which share storage with page->ptl. + */ + VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page); + if (!ptlock_alloc(page)) + return false; + spin_lock_init(ptlock_ptr(page)); + return true; +} + +#else /* !USE_SPLIT_PTE_PTLOCKS */ +/* + * We use mm->page_table_lock to guard all pagetable pages of the mm. + */ +static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return &mm->page_table_lock; +} +static inline void ptlock_cache_init(void) {} +static inline bool ptlock_init(struct page *page) { return true; } +static inline void ptlock_free(struct page *page) {} +#endif /* USE_SPLIT_PTE_PTLOCKS */ + +#if USE_SPLIT_PMD_PTLOCKS + +static struct page *pmd_to_page(pmd_t *pmd) +{ + unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1); + return virt_to_page((void *)((unsigned long) pmd & mask)); +} + +static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return ptlock_ptr(pmd_to_page(pmd)); +} + +static inline bool pmd_ptlock_init(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + page->pmd_huge_pte = NULL; +#endif + return ptlock_init(page); +} + +static inline void pmd_ptlock_free(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + VM_BUG_ON_PAGE(page->pmd_huge_pte, page); +#endif + ptlock_free(page); +} + +#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte) + +#else /* !USE_SPLIT_PMD_PTLOCKS */ + +static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return &mm->page_table_lock; +} + +static inline bool pmd_ptlock_init(struct page *page) { return true; } +static inline void pmd_ptlock_free(struct page *page) {} + +#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte) + +#endif /* USE_SPLIT_PMD_PTLOCKS */ + +static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd) +{ + spinlock_t *ptl = pmd_lockptr(mm, pmd); + spin_lock(ptl); + return ptl; +} + +/* + * No scalability reason to split PUD locks yet, but follow the same pattern + * as the PMD locks to make it easier if we decide to. The VM should not be + * considered ready to switch to split PUD locks yet; there may be places + * which need to be converted from page_table_lock. + */ +static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud) +{ + return &mm->page_table_lock; +} + +static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) +{ + spinlock_t *ptl = pud_lockptr(mm, pud); + + spin_lock(ptl); + return ptl; +} + #ifndef pte_offset_kernel static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) { @@ -103,6 +238,20 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) #define pte_unmap(pte) ((void)(pte)) /* NOP */ #endif +#define pte_offset_map_lock(mm, pmd, address, ptlp) \ +({ \ + spinlock_t *__ptl = pte_lockptr(mm, pmd); \ + pte_t *__pte = pte_offset_map(pmd, address); \ + *(ptlp) = __ptl; \ + spin_lock(__ptl); \ + __pte; \ +}) + +#define pte_unmap_unlock(pte, ptl) do { \ + spin_unlock(ptl); \ + pte_unmap(pte); \ +} while (0) + /* Find an entry in the second-level page table.. */ #ifndef pmd_offset static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) From patchwork Fri Apr 29 13:35:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832011 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4363BC433EF for ; Fri, 29 Apr 2022 13:36:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D039A6B007D; Fri, 29 Apr 2022 09:36:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB3836B007E; Fri, 29 Apr 2022 09:36:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7CA46B0080; Fri, 29 Apr 2022 09:36:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id A9F426B007D for ; Fri, 29 Apr 2022 09:36:53 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 855FB206CE for ; Fri, 29 Apr 2022 13:36:53 +0000 (UTC) X-FDA: 79410017106.14.9486543 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) by imf26.hostedemail.com (Postfix) with ESMTP id 44024140019 for ; Fri, 29 Apr 2022 13:36:51 +0000 (UTC) Received: by mail-pj1-f50.google.com with SMTP id j8-20020a17090a060800b001cd4fb60dccso7296843pjj.2 for ; Fri, 29 Apr 2022 06:36:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=mtrvfydE+wjlh8q+BDcMPZBC4jr9kvSALqoUY6wVrYg=; b=k8n9XhC/kWX+O8zBpqQG6cpY+9Tk08Geziwyjeu7tMaVC30yvyM7eOUm07SBr/4qEX xGqxG4TTH1PT6oqMgQaRTtltsursPhk1i+MdgOQ1+/JIIumQOzWrxXnaddl/A1Tn59BI cgiuxGNjxUI1TWVPKG3eDAjqjWUBdRQBpMufheg8/dbNPMOFvTm1MSRw1X4GaUzRJLAi 68ZJScRb/+LlLB4Z4jQd070JkFSQskomWSs4klopW+l35QRFY/XgBDVCI8aVPLQramu0 VklkPI8+1fbvC+YaQ6exBpg29zS3OdVQh36+iqKyd4OEuTm0Yh9T6sP0mw9P1ytkJmvV B+Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=mtrvfydE+wjlh8q+BDcMPZBC4jr9kvSALqoUY6wVrYg=; b=dj7hyN5snuY8haB+dk1PiyNcQ4r7nZ2H+58ZhFOJ1ESmf1ve3Z+Qc/I4rWjqevKHVf PaVCAnourfJi2OBjECpWUlGriCRlkYIOMYxNr0x05gjR+moT7JnvyEJ2gOLNtgt/CBZk 9d5sQmO061hdLgANy+xVtfPINc5jO0PEU9Q50SueQfckW8SXRBOJH7bHjkEowFS4rigw B1pNFU5ohmscGdYb5lw+t4Y+BBN+2MSlsAI5hvLArjoMOh1aMe2YPCX+qSHn8jH1vDsA 8ozDW/Mn256QDa3CQIq7vlDoZSoEn4S7VHRRJ4i6EYgBEHZxBgf2IZYzbYY6Gg0PvQB3 AquQ== X-Gm-Message-State: AOAM533TMK5vxiMDGPEALi+JufYks6Qaa9x6mD3TrE3SsODmO2qdX7Jj cSAx6McJidl/U+8kq6uuCu6tDQ== X-Google-Smtp-Source: ABdhPJxkwZhomXI1CyZvZSH/IPeC0YUXjZY29+uiw0fkUp1dwm5zxcysMSj/RCsHmgPONsDBh4zerQ== X-Received: by 2002:a17:90b:1a8b:b0:1d9:971d:4269 with SMTP id ng11-20020a17090b1a8b00b001d9971d4269mr4032863pjb.65.1651239412204; Fri, 29 Apr 2022 06:36:52 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:51 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE Date: Fri, 29 Apr 2022 21:35:40 +0800 Message-Id: <20220429133552.33768-7-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 44024140019 X-Stat-Signature: fwbkkg7bonkdohpri6hrkxh431qnqj8e X-Rspam-User: Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="k8n9XhC/"; spf=pass (imf26.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.50 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1651239411-424786 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000528, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This configuration variable will be used to build the code needed to free user PTE page table pages. The PTE page table setting and clearing functions(such as set_pte_at()) are in the architecture's files, and these functions will be hooked to implement FREE_USER_PTE, so the architecture support is needed. Signed-off-by: Qi Zheng --- mm/Kconfig | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index 034d87953600..af99ed626732 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -909,6 +909,16 @@ config ANON_VMA_NAME area from being merged with adjacent virtual memory areas due to the difference in their name. +config ARCH_SUPPORTS_FREE_USER_PTE + def_bool n + +config FREE_USER_PTE + bool "Free user PTE page tables" + default y + depends on ARCH_SUPPORTS_FREE_USER_PTE && MMU && SMP + help + Try to free user PTE page table page when its all entries are none. + source "mm/damon/Kconfig" endmenu From patchwork Fri Apr 29 13:35:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832012 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CD04C433EF for ; Fri, 29 Apr 2022 13:37:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90F106B0072; Fri, 29 Apr 2022 09:36:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8BC096B007E; Fri, 29 Apr 2022 09:36:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7AACE6B0080; Fri, 29 Apr 2022 09:36:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id 6C0676B0072 for ; Fri, 29 Apr 2022 09:36:59 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4CC29206FA for ; Fri, 29 Apr 2022 13:36:59 +0000 (UTC) X-FDA: 79410017358.22.807A89B Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170]) by imf23.hostedemail.com (Postfix) with ESMTP id EBA1914003A for ; Fri, 29 Apr 2022 13:36:51 +0000 (UTC) Received: by mail-pg1-f170.google.com with SMTP id bg9so6540280pgb.9 for ; Fri, 29 Apr 2022 06:36:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZTHv+EobGmd8CLqrN8ICsD/Gd2+vakaEN+AZqfml/As=; b=Pp9YaWXF+keaVk+3vcFs3ST1OD3cYkcjwDO13WpXirW65DJ5bV24lriyyZ3gSCtH1t ZrexZP4eqVpM+sJIbrDvAV6PHKPMjCYED8+24O1lxhFn2leYX3+FUdT0xJMzmNoM8vHu R6U2/6kf69dP6xXUieWIdERPGU8JMCKVGGHds1fcYgNFcNtGAn1c/kHvsWaE3iNCkYA5 RqdeoXcJZMaW+VrHTPiE0q4Zx8NLKhMbg/05XMpQP+Fif3YIpIK9cdVZGsJrI8deD15+ Yk+iY23SxB+rz3MFcApWicRPxvVoXsv2GXFFrxlXAOr0Z2k4Bv9kbswt9Pf2UaM2ubmN BZag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZTHv+EobGmd8CLqrN8ICsD/Gd2+vakaEN+AZqfml/As=; b=XTCi8KGVp/mrVs432XFmiPYLf27c/5cEqWtIk8ir0ikfUKO6iZLD7iXMvGoqWbj76u pg9alhE/2AqWp+gnvCjH8DtCdi1t0Y3SYsnrimp1ujxjmZIiFmkDi/YIGNoCK/JCmMqa 1WqQRQ5M6jAHqw4yS/D8SFPNkdLwB0HY4P63csXyU3XKMBhoW5Nu0mpHz82XopexJsaL Mb/fYZ918SSQQj9CF4FQ99kBG78RJ3JwkLfLo/A8zMMiyZHaPCAr09BK7QGc7dPQG6qC Lq6u1qqQDF9gYTpxFignJKlsvc8UECkQReP9ZF3UrjIlQpr5DF1+z45xB6Da+KYjAeLe i1oA== X-Gm-Message-State: AOAM531LftuUf925aKvZaRzFRkKNENDfJhzTNTN0apiL5qMCIdGBfeg3 0asMgD2qk7fEbMcZtYfKnPYT2w== X-Google-Smtp-Source: ABdhPJxQ/VkMVyn4Qp+z5Cb9ljljtE8dhgkTNUvvkMORdefL3XrTmyDghPgZGP75wajseTIiF7sh1Q== X-Received: by 2002:a63:290:0:b0:3aa:8b8b:1a3d with SMTP id 138-20020a630290000000b003aa8b8b1a3dmr31635992pgc.208.1651239417928; Fri, 29 Apr 2022 06:36:57 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:36:57 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 07/18] mm: add pte_to_page() helper Date: Fri, 29 Apr 2022 21:35:41 +0800 Message-Id: <20220429133552.33768-8-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Stat-Signature: fdqmf6q7ti695bojixxrd4ha6e97zd1g X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: EBA1914003A Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=Pp9YaWXF; spf=pass (imf23.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.170 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspam-User: X-HE-Tag: 1651239411-915043 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add pte_to_page() helper similar to pmd_to_page(), which will be used to get the struct page of the PTE page table. Signed-off-by: Qi Zheng --- include/linux/pgtable.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 0928acca6b48..d1218cb1013e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -85,6 +85,14 @@ static inline unsigned long pud_index(unsigned long address) #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) #endif +#ifdef CONFIG_FREE_USER_PTE +static inline struct page *pte_to_page(pte_t *pte) +{ + unsigned long mask = ~(PTRS_PER_PTE * sizeof(pte_t) - 1); + return virt_to_page((void *)((unsigned long) pte & mask)); +} +#endif + #if USE_SPLIT_PTE_PTLOCKS #if ALLOC_SPLIT_PTLOCKS void __init ptlock_cache_init(void); From patchwork Fri Apr 29 13:35:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832013 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ECCD4C433F5 for ; Fri, 29 Apr 2022 13:37:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E4E66B0073; Fri, 29 Apr 2022 09:37:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 895146B007E; Fri, 29 Apr 2022 09:37:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 735C16B0081; Fri, 29 Apr 2022 09:37:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 678266B007E for ; Fri, 29 Apr 2022 09:37:05 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 409FB26AA7 for ; Fri, 29 Apr 2022 13:37:05 +0000 (UTC) X-FDA: 79410017610.16.9F1269A Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) by imf24.hostedemail.com (Postfix) with ESMTP id 80665180016 for ; Fri, 29 Apr 2022 13:37:00 +0000 (UTC) Received: by mail-pg1-f178.google.com with SMTP id q76so3501300pgq.10 for ; Fri, 29 Apr 2022 06:37:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=a0pdWx48nGjNlGaELTzsLa2nKTJtILEQX8NtQw3N2p4=; b=vOzMg0WwZNZJ2aJySF61oOZUFo15rtaSv8ZgCsHQW+x4vUg2gpucSoyYCNUztLrAxj YcWyZiXJ4nMk4ksCagIhUghKnVR27Rn7j5BSZ6aUVnVn3NAPYBai3hPJK2hBYOmr0AY2 gVgLHui6SE2OT16oLJPwJ+QyDgjiWtWpf8tActM2PfHuUuZcatMxnqeHas/8t6014yuM KVp7m4lEzHdo2JHc5dhmn11rh2qarw9f3jRr2hhwGlBagI8nFT6RMAAMvwp/3kKUjbVb VSXDJPDXDGANDA3YkyThMCz5buYTS7SBPVfY+iHQXizpWs0u+IFSRVckBCZyha5Y/rwP zzGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=a0pdWx48nGjNlGaELTzsLa2nKTJtILEQX8NtQw3N2p4=; b=xjKMW65NwhaoGl3cI03OmViPxEUvjjlOD3tTppRpqKkJh/Pb5X84aKi3HowYfJMPlQ AS6rcKaQUpPsFMXy1jbjan3tk9XYHe2JVA0y7cSQ6rlw2KXq7sNLv9A6OwdYL4N+Dspy X54Q8aBvkEvjBqjQFuYSl6h2XAVsCOPZ5USWahUn3zj4uoVlwzP91ycmVcQ5Yzbg6NMT FNg+2pAI62bRZ7nuaAYCv2GTdWzv5muL0o4kUPCLqNNO4nWHmLmyeA2Q3SYUzvM1wrng +COOdJhzo1vLsce20Tk1ywwTHhLnlyPvVc6HAiRZ4nN+TsNBxWBz7P+wwhI1ROUIZa/U HEuQ== X-Gm-Message-State: AOAM53137IVM4pL3ZOx73BGR54FpEpftkRtKjdg8fwUfWUVswsjvCNxX ofQFAk5V3GtOY0c6eQUev4Z6XA== X-Google-Smtp-Source: ABdhPJwB5nqK52ryAKPNOtkptHUR5XR7j8KK8PoAy/NWEG3XiuyosiUypONBAmFBtJovBvthn4Ixbw== X-Received: by 2002:a65:60d3:0:b0:39c:f431:5859 with SMTP id r19-20020a6560d3000000b0039cf4315859mr32421659pgv.442.1651239423704; Fri, 29 Apr 2022 06:37:03 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.36.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:03 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page Date: Fri, 29 Apr 2022 21:35:42 +0800 Message-Id: <20220429133552.33768-9-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Stat-Signature: wn31nqbqob8o1b8qq58qkj3rz1n13x8k X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 80665180016 X-Rspam-User: Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=vOzMg0Ww; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf24.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-HE-Tag: 1651239420-681251 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory for the following reasons:: First of all, we should hold as few write locks of mmap_lock as possible, since the mmap_lock semaphore has long been a contention point in the memory management subsystem. The mmap()/munmap() hold the write lock, and the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using madvise() instead of munmap() to released physical memory can reduce the competition of the mmap_lock. Secondly, after using madvise() to release physical memory, there is no need to build vma and allocate page tables again when accessing the same virtual address again, which can also save some time. The following is the largest user PTE page table memory that can be allocated by a single user process in a 32-bit and a 64-bit system. +---------------------------+--------+---------+ | | 32-bit | 64-bit | +===========================+========+=========+ | user PTE page table pages | 3 MiB | 512 GiB | +---------------------------+--------+---------+ | user PMD page table pages | 3 KiB | 1 GiB | +---------------------------+--------+---------+ (for 32-bit, take 3G user address space, 4K page size as an example; for 64-bit, take 48-bit address width, 4K page size as an example.) After using madvise(), everything looks good, but as can be seen from the above table, a single process can create a large number of PTE page tables on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not release page table memory. And before the process exits or calls munmap(), the kernel cannot reclaim these pages even if these PTE page tables do not map anything. To fix the situation, this patchset introduces a percpu_ref for each user PTE page table page. The following people will hold a percpu_ref:: The !pte_none() entry, such as regular page table entry that map physical pages, or swap entry, or migrate entry, etc. Visitor to the PTE page table entries, such as page table walker. Any ``!pte_none()`` entry and visitor can be regarded as the user of its PTE page table page. When the percpu_ref is reduced to 0 (need to switch to atomic mode first to check), it means that no one is using the PTE page table page, then this free PTE page table page can be reclaimed at this time. Signed-off-by: Qi Zheng --- include/linux/mm.h | 9 +++++++- include/linux/mm_types.h | 1 + include/linux/pte_ref.h | 29 +++++++++++++++++++++++++ mm/Makefile | 2 +- mm/pte_ref.c | 47 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 86 insertions(+), 2 deletions(-) create mode 100644 include/linux/pte_ref.h create mode 100644 mm/pte_ref.c diff --git a/include/linux/mm.h b/include/linux/mm.h index 0afd3b097e90..1a6bc79c351b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -28,6 +28,7 @@ #include #include #include +#include struct mempolicy; struct anon_vma; @@ -2260,11 +2261,16 @@ static inline void pgtable_init(void) static inline bool pgtable_pte_page_ctor(struct page *page) { - if (!ptlock_init(page)) + if (!pte_ref_init(page)) return false; + if (!ptlock_init(page)) + goto free_pte_ref; __SetPageTable(page); inc_lruvec_page_state(page, NR_PAGETABLE); return true; +free_pte_ref: + pte_ref_free(page); + return false; } static inline void pgtable_pte_page_dtor(struct page *page) @@ -2272,6 +2278,7 @@ static inline void pgtable_pte_page_dtor(struct page *page) ptlock_free(page); __ClearPageTable(page); dec_lruvec_page_state(page, NR_PAGETABLE); + pte_ref_free(page); } #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8834e38c06a4..650bfb22b0e2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -153,6 +153,7 @@ struct page { union { struct mm_struct *pt_mm; /* x86 pgds only */ atomic_t pt_frag_refcount; /* powerpc */ + struct percpu_ref *pte_ref; /* PTE page only */ }; #if ALLOC_SPLIT_PTLOCKS spinlock_t *ptl; diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h new file mode 100644 index 000000000000..d3963a151ca5 --- /dev/null +++ b/include/linux/pte_ref.h @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2022, ByteDance. All rights reserved. + * + * Author: Qi Zheng + */ + +#ifndef _LINUX_PTE_REF_H +#define _LINUX_PTE_REF_H + +#ifdef CONFIG_FREE_USER_PTE + +bool pte_ref_init(pgtable_t pte); +void pte_ref_free(pgtable_t pte); + +#else /* !CONFIG_FREE_USER_PTE */ + +static inline bool pte_ref_init(pgtable_t pte) +{ + return true; +} + +static inline void pte_ref_free(pgtable_t pte) +{ +} + +#endif /* CONFIG_FREE_USER_PTE */ + +#endif /* _LINUX_PTE_REF_H */ diff --git a/mm/Makefile b/mm/Makefile index 4cc13f3179a5..b9711510f84f 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -54,7 +54,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ mm_init.o percpu.o slab_common.o \ compaction.o vmacache.o \ interval_tree.o list_lru.o workingset.o \ - debug.o gup.o mmap_lock.o $(mmu-y) + debug.o gup.o mmap_lock.o $(mmu-y) pte_ref.o # Give 'page_alloc' its own module-parameter namespace page-alloc-y := page_alloc.o diff --git a/mm/pte_ref.c b/mm/pte_ref.c new file mode 100644 index 000000000000..52e31be00de4 --- /dev/null +++ b/mm/pte_ref.c @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2022, ByteDance. All rights reserved. + * + * Author: Qi Zheng + */ +#include +#include +#include +#include + +#ifdef CONFIG_FREE_USER_PTE + +static void no_op(struct percpu_ref *r) {} + +bool pte_ref_init(pgtable_t pte) +{ + struct percpu_ref *pte_ref; + + pte_ref = kmalloc(sizeof(struct percpu_ref), GFP_KERNEL); + if (!pte_ref) + return false; + if (percpu_ref_init(pte_ref, no_op, + PERCPU_REF_ALLOW_REINIT, GFP_KERNEL) < 0) + goto free_ref; + /* We want to start with the refcount at zero */ + percpu_ref_put(pte_ref); + + pte->pte_ref = pte_ref; + return true; +free_ref: + kfree(pte_ref); + return false; +} + +void pte_ref_free(pgtable_t pte) +{ + struct percpu_ref *ref = pte->pte_ref; + if (!ref) + return; + + pte->pte_ref = NULL; + percpu_ref_exit(ref); + kfree(ref); +} + +#endif /* CONFIG_FREE_USER_PTE */ From patchwork Fri Apr 29 13:35:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832014 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E388C433F5 for ; Fri, 29 Apr 2022 13:37:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 38EA26B0074; Fri, 29 Apr 2022 09:37:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 33D8B6B007E; Fri, 29 Apr 2022 09:37:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 205856B0080; Fri, 29 Apr 2022 09:37:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id 139116B0074 for ; Fri, 29 Apr 2022 09:37:11 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id DED3D265EA for ; Fri, 29 Apr 2022 13:37:10 +0000 (UTC) X-FDA: 79410017820.10.5B3C809 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) by imf02.hostedemail.com (Postfix) with ESMTP id BBBB18004A for ; Fri, 29 Apr 2022 13:37:06 +0000 (UTC) Received: by mail-pl1-f177.google.com with SMTP id p6so7151017plf.9 for ; Fri, 29 Apr 2022 06:37:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=DGl50d69T/46DVPYmj2DL1x7ncKmmZ54JyK/lw6nQrc=; b=ARePECzUIuI/o7LaJ6U30K3W4hzXRsvLwyPDJjZN5Ir39F24/V1LJiWa1QSu1EMnH5 A1GdbC0XtJEshbkbFIKnlj/USWQUZJ1ZbMvKe6cExn1vdQNp2EvKrmONYlgc4UfzDqDN NPrE/d77mWvSMe8t+Pj6PUpwZJtO1lp6YSqboATYrrZ7UX43LUsVHKLtdYbl0QShre5S KTKyDJkzYVYEsW858cbU/F6b9nm45sIAiNbvtYGtEbXr8MjkUHvyh6vh9G95S67sc/Qd 5bayvVPY/1MQqIcZ8qKIQkAnNN6I1m7O69zc83csx5Wy9gJ5vcWF0bOjLvSmoOPA1ADx MHpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=DGl50d69T/46DVPYmj2DL1x7ncKmmZ54JyK/lw6nQrc=; b=K6ptEpY1aKyii38H9ixjBA+uccSCiiupl4l9UnvdMi0awgai4twMc3ms5XVKEOIFm3 rL+JetBvTF2h6vuv1uA8bePPNzG+W/JVyzuEm9ZCM5Z4O3XvJfI6TRJk1UwmmHcSZ2FA E1HzO6UJ4/TardmzK+3c6bIkGgy/yL3BBQWxFnJgD9dXclCnHzkOUisKBEXHlYIZHG8K ANr/lmSAjYZpNUbKHaFjkruTG9cWtdSbo4VxI5XZf2QgguuxDUToQdmM+29kyh/CajBN kFfN75OL5rlUY4ZywYOHG0g9tgd7gCrcI4pXG8t/qGjXOG7MYzE37lYskyoFSjGiK0yQ gz1g== X-Gm-Message-State: AOAM5330PG0TJRd0UBnZj+GiCZioeh1+QW9KvwS1+Ph0t24h6Id73f71 84iR+UPi5TJEVIkhR9EUo4u+jg== X-Google-Smtp-Source: ABdhPJwkDEFr/eSiIXsECiLskFgBvhXhvx+Ht6goriTOJRM12m+x++WpdKjE8bfOm4QcHnGGrKiSMQ== X-Received: by 2002:a17:90b:380e:b0:1da:2943:b975 with SMTP id mq14-20020a17090b380e00b001da2943b975mr4006350pjb.42.1651239429455; Fri, 29 Apr 2022 06:37:09 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:08 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper Date: Fri, 29 Apr 2022 21:35:43 +0800 Message-Id: <20220429133552.33768-10-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: BBBB18004A X-Stat-Signature: 5yw8y8wmr338qxqipjctr7cg7swxgt6j Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=ARePECzU; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf02.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1651239426-342439 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The user PTE page table page may be freed when the last percpu_ref is dropped. So we need to try to get its percpu_ref before accessing the PTE page to prevent it form being freed during the access process. This patch adds pte_tryget() and {__,}pte_put() to help us to get and put the percpu_ref of user PTE page table pages. Signed-off-by: Qi Zheng --- include/linux/pte_ref.h | 23 ++++++++++++++++ mm/pte_ref.c | 58 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 81 insertions(+) diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index d3963a151ca5..bfe620038699 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -12,6 +12,10 @@ bool pte_ref_init(pgtable_t pte); void pte_ref_free(pgtable_t pte); +void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); +bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); +void __pte_put(pgtable_t page); +void pte_put(pte_t *ptep); #else /* !CONFIG_FREE_USER_PTE */ @@ -24,6 +28,25 @@ static inline void pte_ref_free(pgtable_t pte) { } +static inline void free_user_pte(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr) +{ +} + +static inline bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr) +{ + return true; +} + +static inline void __pte_put(pgtable_t page) +{ +} + +static inline void pte_put(pte_t *ptep) +{ +} + #endif /* CONFIG_FREE_USER_PTE */ #endif /* _LINUX_PTE_REF_H */ diff --git a/mm/pte_ref.c b/mm/pte_ref.c index 52e31be00de4..5b382445561e 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -44,4 +44,62 @@ void pte_ref_free(pgtable_t pte) kfree(ref); } +void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) {} + +/* + * pte_tryget - try to get the pte_ref of the user PTE page table page + * @mm: pointer the target address space + * @pmd: pointer to a PMD. + * @addr: virtual address associated with pmd. + * + * Return: true if getting the pte_ref succeeded. And false otherwise. + * + * Before accessing the user PTE page table, we need to hold a refcount to + * protect against the concurrent release of the PTE page table. + * But we will fail in the following case: + * - The content mapped in @pmd is not a PTE page + * - The pte_ref is zero, it may be reclaimed + */ +bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) +{ + bool retval = true; + pmd_t pmdval; + pgtable_t pte; + + rcu_read_lock(); + pmdval = READ_ONCE(*pmd); + pte = pmd_pgtable(pmdval); + if (unlikely(pmd_none(pmdval) || pmd_leaf(pmdval))) { + retval = false; + } else if (!percpu_ref_tryget(pte->pte_ref)) { + rcu_read_unlock(); + /* + * Also do free_user_pte() here to prevent missed reclaim due + * to race condition. + */ + free_user_pte(mm, pmd, addr & PMD_MASK); + return false; + } + rcu_read_unlock(); + + return retval; +} + +void __pte_put(pgtable_t page) +{ + percpu_ref_put(page->pte_ref); +} + +void pte_put(pte_t *ptep) +{ + pgtable_t page; + + if (pte_huge(*ptep)) + return; + + page = pte_to_page(ptep); + __pte_put(page); +} +EXPORT_SYMBOL(pte_put); + #endif /* CONFIG_FREE_USER_PTE */ From patchwork Fri Apr 29 13:35:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832015 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28A6FC433EF for ; Fri, 29 Apr 2022 13:37:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B7D876B0075; Fri, 29 Apr 2022 09:37:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B2D556B007E; Fri, 29 Apr 2022 09:37:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9CE1A6B0080; Fri, 29 Apr 2022 09:37:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id 90E966B0075 for ; Fri, 29 Apr 2022 09:37:16 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 72DE960401 for ; Fri, 29 Apr 2022 13:37:16 +0000 (UTC) X-FDA: 79410018072.03.2814407 Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170]) by imf17.hostedemail.com (Postfix) with ESMTP id B7AD64002B for ; Fri, 29 Apr 2022 13:37:06 +0000 (UTC) Received: by mail-pg1-f170.google.com with SMTP id i62so6551362pgd.6 for ; Fri, 29 Apr 2022 06:37:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Ti47oOTvJFCNXT5uBN/QcWBt/mztyaSPv2YCzaUIEks=; b=3g9d4G3AVSSQh+94A2C6QzQqKt1PRF8js91ZL0nJcPA3Z2pqrsS736awrXrhB5qPNL sUEJ1N/4NDvBYvnmd6ICC5/nmqG4YbNK2tVYSwte7tSPMcq+itTorKQlC5NV96S78vve 49KHlqBtvfS6ikyeLIE98STuW16ozIxTWc4DS49i0ZPiYyllxwnGRYZcxC6V30CKju9o CL9fjfO7XWQ7Yvc9MmOIn0mD/1tlgEEmvZfM2zZieTtyLlokUilgkaABmZykB1TcatHq pjHEGHSkjNa+P+/dvoKyd/UP7gMxeI4LAtJltzdrP0xKNx+P2UQhK6u7RqFWZSdZOLE4 X8LA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Ti47oOTvJFCNXT5uBN/QcWBt/mztyaSPv2YCzaUIEks=; b=eWVqRNSYp0bXeEuAmPuCWSFrJ514ibKYMQ4deYi7b0QcXXcxF/UNYmEv1w1VYsl7hM Urs8YD8GSwqJmWQxRJE9bDRg4KeiraCJEhAGhdEB51ORL7kEpHLZ0ilv7iYafiX9MHHN T39u+f0/0hu+jf9+y2qb2zNN5UDm8g1ibOtFqxh+Z49FolwQK1MTegXR1qYr4CSGSnOQ 7+GTuWeBa9Uqt982iTsx5mZlf6j9ZOc1ZcphFX340G5tWoB6UYWj5/1a+ZZpRszcxxFa 1U0fM9MpQymRN1XLDQGincoXUq5FecB2drD6pLudAAX/3Zi6C8fTT4vLkr/TU3E9sN/S iawA== X-Gm-Message-State: AOAM533U/eo9e3ERIdTpv6i1LUbKhoP0jf9n8ACKAWubqIX2Ar1zCJoQ 7VEtH+//zxxCblhgRPOYuPBX/g== X-Google-Smtp-Source: ABdhPJxMIKA/P0RDHv9STrZjiOvbOSFfoH4N5N3pT1umtFqilIbIuxsCtayDjWtKg053GReSSR2iXA== X-Received: by 2002:a65:4006:0:b0:3aa:1cb6:e2f8 with SMTP id f6-20020a654006000000b003aa1cb6e2f8mr32435549pgp.274.1651239435027; Fri, 29 Apr 2022 06:37:15 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:14 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper Date: Fri, 29 Apr 2022 21:35:44 +0800 Message-Id: <20220429133552.33768-11-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: B7AD64002B Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=3g9d4G3A; spf=pass (imf17.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.170 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspam-User: X-Stat-Signature: y9fubddgzk5au4sghrcxpmcqi71pgq37 X-HE-Tag: 1651239426-131811 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now, we usually use pte_offset_map{_lock}() to get the pte_t pointer before accessing the PTE page table page. After adding the FREE_USER_PTE, we also need to call the pte_tryget() before calling pte_offset_map{_lock}(), which is used to try to get the reference count of the PTE to prevent the PTE page table page from being freed during the access process. This patch adds pte_tryget_map{_lock}() to help us to do that. A return value of NULL indicates that we failed to get the percpu_ref, and there is a concurrent thread that is releasing this PTE (or has already been released). It needs to be treated as the case of pte_none(). Signed-off-by: Qi Zheng --- include/linux/pgtable.h | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index d1218cb1013e..6f205fee6348 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -228,6 +228,8 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) return ptl; } +#include + #ifndef pte_offset_kernel static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) { @@ -240,12 +242,38 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) #define pte_offset_map(dir, address) \ ((pte_t *)kmap_atomic(pmd_page(*(dir))) + \ pte_index((address))) -#define pte_unmap(pte) kunmap_atomic((pte)) +#define __pte_unmap(pte) kunmap_atomic((pte)) #else #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address)) -#define pte_unmap(pte) ((void)(pte)) /* NOP */ +#define __pte_unmap(pte) ((void)(pte)) /* NOP */ #endif +#define pte_tryget_map(mm, pmd, address) \ +({ \ + pte_t *__pte = NULL; \ + if (pte_tryget(mm, pmd, address)) \ + __pte = pte_offset_map(pmd, address); \ + __pte; \ +}) + +#define pte_unmap(pte) do { \ + pte_put(pte); \ + __pte_unmap(pte); \ +} while (0) + +#define pte_tryget_map_lock(mm, pmd, address, ptlp) \ +({ \ + spinlock_t *__ptl = NULL; \ + pte_t *__pte = NULL; \ + if (pte_tryget(mm, pmd, address)) { \ + __ptl = pte_lockptr(mm, pmd); \ + __pte = pte_offset_map(pmd, address); \ + *(ptlp) = __ptl; \ + spin_lock(__ptl); \ + } \ + __pte; \ +}) + #define pte_offset_map_lock(mm, pmd, address, ptlp) \ ({ \ spinlock_t *__ptl = pte_lockptr(mm, pmd); \ @@ -260,6 +288,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) pte_unmap(pte); \ } while (0) +#define __pte_unmap_unlock(pte, ptl) do { \ + spin_unlock(ptl); \ + __pte_unmap(pte); \ +} while (0) + /* Find an entry in the second-level page table.. */ #ifndef pmd_offset static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) From patchwork Fri Apr 29 13:35:45 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832016 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29584C433F5 for ; Fri, 29 Apr 2022 13:37:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B7D356B007E; Fri, 29 Apr 2022 09:37:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ADEDF6B0080; Fri, 29 Apr 2022 09:37:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BC6C6B0081; Fri, 29 Apr 2022 09:37:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id 7867F6B007E for ; Fri, 29 Apr 2022 09:37:22 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 56D14604F1 for ; Fri, 29 Apr 2022 13:37:22 +0000 (UTC) X-FDA: 79410018324.31.7ED56F9 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) by imf20.hostedemail.com (Postfix) with ESMTP id 8AD231C0039 for ; Fri, 29 Apr 2022 13:37:17 +0000 (UTC) Received: by mail-pl1-f181.google.com with SMTP id s14so7159204plk.8 for ; Fri, 29 Apr 2022 06:37:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=vfiCmddC+7ZKj47uoVoXhmc2OB+ra/xXEY2BBYUTQ0I=; b=N8g8Y0NlYBcVwLLMJUrGJv7JOI9uCFz5c4LZHV8Ana9n4lqzBs5+xsmPmVEYIbfKsg odMCYNM1+2iRANbOVyDsE4CxFIDC6tfN5Q4KRP+UGYgc76Av12gM0j7qlThiAfAuXs3h /VAXQKUdsKtp70HAo+Dtez0N9tQfzuOKJM24kBsHm3sFWq42S4JPVlh0rLXuWPylEvW4 ZcJ8Qm3Rj10ZVraesw1ftsomuWK7lNtzefDcu8/9mp82Y3WihTOadO9/7h8Zf7lNnun/ /XVssHvGBkLS5je51Whqn3VO6awKzUQbPc9s9/5Mjp//R1lTImRXCyRGVWRCQ1a1s3qH Zk7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=vfiCmddC+7ZKj47uoVoXhmc2OB+ra/xXEY2BBYUTQ0I=; b=UhT7RPfOOdmQfu5efBdyvaPg9kC8c9qLDzAGFDpQkWXQi3vahADHRdkDrhgicomgoQ +oT/3zNenGGn6l8NJyMHqzP4FGogZ2Rq79yahj1QZ0ZX9uFi+NfElzK988XJVop3Ulr0 538IMHdioGWkHqHVJQSnIrlkJUT9/zmM++yYVcSimNh7iRfqP9fWu61U0IU5tSsxqxsX GgyG85RMOi6E9O6LpJywLCvK6QkSRBwMa2IFxgJI6mqeF6hUCb0Jorz+F323pUJ0Ea9Z rLiryBVJTz3EIYAG9YZ4R4hKpjoO3HFjmxNPdy7xmOqN8fLGskPP3+TkwfuHloj31WeB C9Ow== X-Gm-Message-State: AOAM532wOL8+VXoq7SWBs5itvWebcYqFzm6ALOZtKDZj3v3CGlYhOi1z XqiEBL3zIBzKbVnG7qoU+atM/w== X-Google-Smtp-Source: ABdhPJxU50DpM9zXS10Ztlg3jVeqDyyHMnBwcSuzKIki0feYMT5Nf8Fd0pD18GCFTDesQcItxVzKcA== X-Received: by 2002:a17:902:e5cd:b0:15d:57c7:b9fb with SMTP id u13-20020a170902e5cd00b0015d57c7b9fbmr12006973plf.101.1651239440662; Fri, 29 Apr 2022 06:37:20 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:20 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock() Date: Fri, 29 Apr 2022 21:35:45 +0800 Message-Id: <20220429133552.33768-12-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 8AD231C0039 Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=N8g8Y0Nl; spf=pass (imf20.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspam-User: X-Stat-Signature: swp3tbnz7bg9ehuqmz1b5oao9fqujebx X-HE-Tag: 1651239437-420430 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Use pte_tryget_map_lock() to help us to try to get the refcount of the PTE page table page we want to access, which can prevents the page from being freed during access. For the following cases, the PTE page table page is stable: - got the refcount of PTE page table page already - has no concurrent threads(e.g. the write lock of mmap_lock is acquired) - the PTE page table page is not yet visible - turn off the local cpu interrupt or hold the rcu lock (e.g. GUP fast path) - the PTE page table page is kernel PTE page table page So we still keep using pte_offset_map_lock() and replace pte_unmap_unlock() with __pte_unmap_unlock() which doesn't reduce the refcount. Signed-off-by: Qi Zheng --- fs/proc/task_mmu.c | 16 ++++-- include/linux/mm.h | 2 +- mm/damon/vaddr.c | 30 ++++++---- mm/debug_vm_pgtable.c | 2 +- mm/filemap.c | 4 +- mm/gup.c | 4 +- mm/khugepaged.c | 10 +++- mm/ksm.c | 4 +- mm/madvise.c | 30 +++++++--- mm/memcontrol.c | 8 ++- mm/memory-failure.c | 4 +- mm/memory.c | 125 +++++++++++++++++++++++++++++------------- mm/mempolicy.c | 4 +- mm/migrate_device.c | 22 +++++--- mm/mincore.c | 5 +- mm/mlock.c | 5 +- mm/mprotect.c | 4 +- mm/mremap.c | 5 +- mm/pagewalk.c | 4 +- mm/swapfile.c | 13 +++-- mm/userfaultfd.c | 11 +++- 21 files changed, 219 insertions(+), 93 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f46060eb91b5..5fff96659e4f 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -625,7 +625,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, * keeps khugepaged out of here and from collapsing things * in here. */ - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + goto out; for (; addr != end; pte++, addr += PAGE_SIZE) smaps_pte_entry(pte, addr, walk); pte_unmap_unlock(pte - 1, ptl); @@ -1178,7 +1180,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; @@ -1515,7 +1519,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, * We can assume that @vma always points to a valid one and @end never * goes beyond vma->vm_end. */ - orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(walk->mm, pmdp, addr, &ptl); + if (!pte) + return 0; for (; addr < end; pte++, addr += PAGE_SIZE) { pagemap_entry_t pme; @@ -1849,7 +1855,9 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; #endif - orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + return 0; do { struct page *page = can_gather_numa_stats(*pte, vma, addr); if (!page) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1a6bc79c351b..04f7a6c36dc7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2288,7 +2288,7 @@ static inline void pgtable_pte_page_dtor(struct page *page) #define pte_alloc_map_lock(mm, pmd, address, ptlp) \ (pte_alloc(mm, pmd) ? \ - NULL : pte_offset_map_lock(mm, pmd, address, ptlp)) + NULL : pte_tryget_map_lock(mm, pmd, address, ptlp)) #define pte_alloc_kernel(pmd, address) \ ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \ diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index b2ec0aa1ff45..4aa9e252c081 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -372,10 +372,13 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr, { pte_t *pte; spinlock_t *ptl; + pmd_t pmdval; - if (pmd_huge(*pmd)) { +retry: + pmdval = READ_ONCE(*pmd); + if (pmd_huge(pmdval)) { ptl = pmd_lock(walk->mm, pmd); - if (pmd_huge(*pmd)) { + if (pmd_huge(pmdval)) { damon_pmdp_mkold(pmd, walk->mm, addr); spin_unlock(ptl); return 0; @@ -383,9 +386,11 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr, spin_unlock(ptl); } - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) + if (pmd_none(pmdval) || unlikely(pmd_bad(pmdval))) return 0; - pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + goto retry; if (!pte_present(*pte)) goto out; damon_ptep_mkold(pte, walk->mm, addr); @@ -499,18 +504,21 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; struct page *page; struct damon_young_walk_private *priv = walk->private; + pmd_t pmdval; +retry: + pmdval = READ_ONCE(*pmd); #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (pmd_huge(*pmd)) { + if (pmd_huge(pmdval)) { ptl = pmd_lock(walk->mm, pmd); - if (!pmd_huge(*pmd)) { + if (!pmd_huge(pmdval)) { spin_unlock(ptl); goto regular_page; } - page = damon_get_page(pmd_pfn(*pmd)); + page = damon_get_page(pmd_pfn(pmdval)); if (!page) goto huge_out; - if (pmd_young(*pmd) || !page_is_idle(page) || + if (pmd_young(pmdval) || !page_is_idle(page) || mmu_notifier_test_young(walk->mm, addr)) { *priv->page_sz = ((1UL) << HPAGE_PMD_SHIFT); @@ -525,9 +533,11 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr, regular_page: #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) + if (pmd_none(pmdval) || unlikely(pmd_bad(pmdval))) return -EINVAL; - pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + goto retry; if (!pte_present(*pte)) goto out; page = damon_get_page(pte_pfn(*pte)); diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index db2abd9e415b..91c4400ca13c 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -1303,7 +1303,7 @@ static int __init debug_vm_pgtable(void) * proper page table lock. */ - args.ptep = pte_offset_map_lock(args.mm, args.pmdp, args.vaddr, &ptl); + args.ptep = pte_tryget_map_lock(args.mm, args.pmdp, args.vaddr, &ptl); pte_clear_tests(&args); pte_advanced_tests(&args); pte_unmap_unlock(args.ptep, ptl); diff --git a/mm/filemap.c b/mm/filemap.c index 3a5ffb5587cd..fc156922147b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3368,7 +3368,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, } addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); + vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); + if (!vmf->pte) + goto out; do { again: page = folio_file_page(folio, xas.xa_index); diff --git a/mm/gup.c b/mm/gup.c index f598a037eb04..d2c24181fb04 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -451,7 +451,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (unlikely(pmd_bad(*pmd))) return no_page_table(vma, flags); - ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + ptep = pte_tryget_map_lock(mm, pmd, address, &ptl); + if (!ptep) + return no_page_table(vma, flags); pte = *ptep; if (!pte_present(pte)) { swp_entry_t entry; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a4e5eaf3eb01..3776cc315294 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1227,7 +1227,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, } memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load)); - pte = pte_offset_map_lock(mm, pmd, address, &ptl); + pte = pte_tryget_map_lock(mm, pmd, address, &ptl); + if (!pte) { + result = SCAN_PMD_NULL; + goto out; + } for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { pte_t pteval = *_pte; @@ -1505,7 +1509,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) page_remove_rmap(page, vma, false); } - pte_unmap_unlock(start_pte, ptl); + __pte_unmap_unlock(start_pte, ptl); /* step 3: set proper refcount and mm_counters. */ if (count) { @@ -1521,7 +1525,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) return; abort: - pte_unmap_unlock(start_pte, ptl); + __pte_unmap_unlock(start_pte, ptl); goto drop_hpage; } diff --git a/mm/ksm.c b/mm/ksm.c index 063a48eeb5ee..64a5f965cfc5 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1138,7 +1138,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, addr + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + ptep = pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!ptep) + goto out_mn; if (!pte_same(*ptep, orig_pte)) { pte_unmap_unlock(ptep, ptl); goto out_mn; diff --git a/mm/madvise.c b/mm/madvise.c index 1873616a37d2..8123397f14c8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -207,7 +207,9 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, struct page *page; spinlock_t *ptl; - orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); + orig_pte = pte_tryget_map_lock(vma->vm_mm, pmd, start, &ptl); + if (!orig_pte) + break; pte = *(orig_pte + ((index - start) / PAGE_SIZE)); pte_unmap_unlock(orig_pte, ptl); @@ -400,7 +402,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, return 0; #endif tlb_change_page_size(tlb, PAGE_SIZE); - orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!orig_pte) + return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr < end; pte++, addr += PAGE_SIZE) { @@ -432,12 +436,14 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (split_huge_page(page)) { unlock_page(page); put_page(page); - pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl); break; } unlock_page(page); put_page(page); - pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!pte) + break; pte--; addr -= PAGE_SIZE; continue; @@ -477,7 +483,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, } arch_leave_lazy_mmu_mode(); - pte_unmap_unlock(orig_pte, ptl); + if (orig_pte) + pte_unmap_unlock(orig_pte, ptl); if (pageout) reclaim_pages(&page_list); cond_resched(); @@ -602,7 +609,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, return 0; tlb_change_page_size(tlb, PAGE_SIZE); - orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!orig_pte) + return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr != end; pte++, addr += PAGE_SIZE) { @@ -648,12 +657,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (split_huge_page(page)) { unlock_page(page); put_page(page); - pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl); goto out; } unlock_page(page); put_page(page); - pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!pte) + goto out; pte--; addr -= PAGE_SIZE; continue; @@ -707,7 +718,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, add_mm_counter(mm, MM_SWAPENTS, nr_swap); } arch_leave_lazy_mmu_mode(); - pte_unmap_unlock(orig_pte, ptl); + if (orig_pte) + pte_unmap_unlock(orig_pte, ptl); cond_resched(); next: return 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 725f76723220..ad51ec9043b7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5736,7 +5736,9 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, if (pmd_trans_unstable(pmd)) return 0; - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr != end; pte++, addr += PAGE_SIZE) if (get_mctgt_type(vma, addr, *pte, NULL)) mc.precharge++; /* increment precharge temporarily */ @@ -5955,7 +5957,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, if (pmd_trans_unstable(pmd)) return 0; retry: - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr != end; addr += PAGE_SIZE) { pte_t ptent = *(pte++); bool device = false; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index dcb6bb9cf731..5247932df3fa 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -637,8 +637,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, if (pmd_trans_unstable(pmdp)) goto out; - mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, + mapped_pte = ptep = pte_tryget_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl); + if (!mapped_pte) + goto out; for (; addr != end; ptep++, addr += PAGE_SIZE) { ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT, hwp->pfn, &hwp->tk); diff --git a/mm/memory.c b/mm/memory.c index 76e3af9639d9..ca03006b32cb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1352,7 +1352,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_change_page_size(tlb, PAGE_SIZE); again: init_rss_vec(rss); - start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + start_pte = pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!start_pte) + return end; pte = start_pte; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); @@ -1846,7 +1848,9 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr, int pte_idx = 0; const int batch_size = min_t(int, pages_to_write_in_pmd, 8); - start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock); + start_pte = pte_tryget_map_lock(mm, pmd, addr, &pte_lock); + if (!start_pte) + break; for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) { int err = insert_page_in_batch_locked(vma, pte, addr, pages[curr_page_idx], prot); @@ -2532,9 +2536,13 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, if (!pte) return -ENOMEM; } else { - mapped_pte = pte = (mm == &init_mm) ? - pte_offset_kernel(pmd, addr) : - pte_offset_map_lock(mm, pmd, addr, &ptl); + if (mm == &init_mm) { + mapped_pte = pte = pte_offset_kernel(pmd, addr); + } else { + mapped_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!mapped_pte) + return err; + } } BUG_ON(pmd_huge(*pmd)); @@ -2787,7 +2795,11 @@ static inline bool cow_user_page(struct page *dst, struct page *src, if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { pte_t entry; - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); + vmf->pte = pte_tryget_map_lock(mm, vmf->pmd, addr, &vmf->ptl); + if (!vmf->pte) { + ret = false; + goto pte_unlock; + } locked = true; if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { /* @@ -2815,7 +2827,11 @@ static inline bool cow_user_page(struct page *dst, struct page *src, goto warn; /* Re-validate under PTL if the page is still mapped */ - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); + vmf->pte = pte_tryget_map_lock(mm, vmf->pmd, addr, &vmf->ptl); + if (!vmf->pte) { + ret = false; + goto pte_unlock; + } locked = true; if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { /* The PTE changed under us, update local tlb */ @@ -3005,6 +3021,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) pte_t entry; int page_copied = 0; struct mmu_notifier_range range; + vm_fault_t ret = VM_FAULT_OOM; if (unlikely(anon_vma_prepare(vma))) goto oom; @@ -3048,7 +3065,12 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) /* * Re-check the pte - we dropped the lock */ - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl); + vmf->pte = pte_tryget_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) { + mmu_notifier_invalidate_range_only_end(&range); + ret = VM_FAULT_RETRY; + goto uncharge; + } if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { if (old_page) { if (!PageAnon(old_page)) { @@ -3129,12 +3151,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) put_page(old_page); } return page_copied ? VM_FAULT_WRITE : 0; +uncharge: + mem_cgroup_uncharge(page_folio(new_page)); oom_free_new: put_page(new_page); oom: if (old_page) put_page(old_page); - return VM_FAULT_OOM; + return ret; } /** @@ -3156,8 +3180,10 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf) { WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED)); - vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, + vmf->pte = pte_tryget_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + return VM_FAULT_NOPAGE; /* * We might have raced with another page fault while we released the * pte_offset_map_lock. @@ -3469,6 +3495,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) struct page *page = vmf->page; struct vm_area_struct *vma = vmf->vma; struct mmu_notifier_range range; + vm_fault_t ret = 0; if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) return VM_FAULT_RETRY; @@ -3477,16 +3504,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) (vmf->address & PAGE_MASK) + PAGE_SIZE, NULL); mmu_notifier_invalidate_range_start(&range); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, + vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) { + ret = VM_FAULT_RETRY; + goto out; + } if (likely(pte_same(*vmf->pte, vmf->orig_pte))) restore_exclusive_pte(vma, page, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); +out: unlock_page(page); mmu_notifier_invalidate_range_end(&range); - return 0; + return ret; } static inline bool should_try_to_free_swap(struct page *page, @@ -3599,8 +3631,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Back out if somebody else faulted in this pte * while we released the pte lock. */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) { + ret = VM_FAULT_OOM; + goto out; + } if (likely(pte_same(*vmf->pte, vmf->orig_pte))) ret = VM_FAULT_OOM; goto unlock; @@ -3666,8 +3702,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* * Back out if somebody else already faulted in this pte. */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, + vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + goto out_page; if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) goto out_nomap; @@ -3781,6 +3819,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_SHARED) return VM_FAULT_SIGBUS; +retry: /* * Use pte_alloc() instead of pte_alloc_map(). We can't run * pte_offset_map() on pmds where a huge pmd might be created @@ -3803,8 +3842,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) !mm_forbids_zeropage(vma->vm_mm)) { entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), vma->vm_page_prot)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + goto retry; if (!pte_none(*vmf->pte)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto unlock; @@ -3843,8 +3884,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, + vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + goto uncharge; if (!pte_none(*vmf->pte)) { update_mmu_cache(vma, vmf->address, vmf->pte); goto release; @@ -3875,6 +3918,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) release: put_page(page); goto unlock; +uncharge: + mem_cgroup_uncharge(page_folio(page)); oom_free_page: put_page(page); oom: @@ -4112,8 +4157,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf) if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + if (!vmf->pte) + return 0; ret = 0; /* Re-check under ptl */ if (likely(pte_none(*vmf->pte))) @@ -4340,31 +4387,27 @@ static vm_fault_t do_fault(struct vm_fault *vmf) * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */ if (!vma->vm_ops->fault) { - /* - * If we find a migration pmd entry or a none pmd entry, which - * should never happen, return SIGBUS - */ - if (unlikely(!pmd_present(*vmf->pmd))) - ret = VM_FAULT_SIGBUS; - else { - vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, + vmf->pte = pte_tryget_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - /* - * Make sure this is not a temporary clearing of pte - * by holding ptl and checking again. A R/M/W update - * of pte involves: take ptl, clearing the pte so that - * we don't have concurrent modification by hardware - * followed by an update. - */ - if (unlikely(pte_none(*vmf->pte))) - ret = VM_FAULT_SIGBUS; - else - ret = VM_FAULT_NOPAGE; - - pte_unmap_unlock(vmf->pte, vmf->ptl); + if (!vmf->pte) { + ret = VM_FAULT_RETRY; + goto out; } + /* + * Make sure this is not a temporary clearing of pte + * by holding ptl and checking again. A R/M/W update + * of pte involves: take ptl, clearing the pte so that + * we don't have concurrent modification by hardware + * followed by an update. + */ + if (unlikely(pte_none(*vmf->pte))) + ret = VM_FAULT_SIGBUS; + else + ret = VM_FAULT_NOPAGE; + + pte_unmap_unlock(vmf->pte, vmf->ptl); } else if (!(vmf->flags & FAULT_FLAG_WRITE)) ret = do_read_fault(vmf); else if (!(vma->vm_flags & VM_SHARED)) @@ -4372,6 +4415,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf) else ret = do_shared_fault(vmf); +out: /* preallocated pagetable is unused: free it */ if (vmf->prealloc_pte) { pte_free(vm_mm, vmf->prealloc_pte); @@ -5003,13 +5047,16 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, (address & PAGE_MASK) + PAGE_SIZE); mmu_notifier_invalidate_range_start(range); } - ptep = pte_offset_map_lock(mm, pmd, address, ptlp); + ptep = pte_tryget_map_lock(mm, pmd, address, ptlp); + if (!ptep) + goto invalid; if (!pte_present(*ptep)) goto unlock; *ptepp = ptep; return 0; unlock: pte_unmap_unlock(ptep, *ptlp); +invalid: if (range) mmu_notifier_invalidate_range_end(range); out: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 8c74107a2b15..a846666c64c3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -523,7 +523,9 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; - mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + mapped_pte = pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!mapped_pte) + return 0; for (; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue; diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 70c7dc05bbfc..260471f37470 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -64,21 +64,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, unsigned long addr = start, unmapped = 0; spinlock_t *ptl; pte_t *ptep; + pmd_t pmdval; again: - if (pmd_none(*pmdp)) + pmdval = READ_ONCE(*pmdp); + if (pmd_none(pmdval)) return migrate_vma_collect_hole(start, end, -1, walk); - if (pmd_trans_huge(*pmdp)) { + if (pmd_trans_huge(pmdval)) { struct page *page; ptl = pmd_lock(mm, pmdp); - if (unlikely(!pmd_trans_huge(*pmdp))) { + if (unlikely(!pmd_trans_huge(pmdval))) { spin_unlock(ptl); goto again; } - page = pmd_page(*pmdp); + page = pmd_page(pmdval); if (is_huge_zero_page(page)) { spin_unlock(ptl); split_huge_pmd(vma, pmdp, addr); @@ -99,16 +101,18 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (ret) return migrate_vma_collect_skip(start, end, walk); - if (pmd_none(*pmdp)) + if (pmd_none(pmdval)) return migrate_vma_collect_hole(start, end, -1, walk); } } - if (unlikely(pmd_bad(*pmdp))) + if (unlikely(pmd_bad(pmdval))) return migrate_vma_collect_skip(start, end, walk); - ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); + ptep = pte_tryget_map_lock(mm, pmdp, addr, &ptl); + if (!ptep) + goto again; arch_enter_lazy_mmu_mode(); for (; addr < end; addr += PAGE_SIZE, ptep++) { @@ -588,7 +592,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, entry = pte_mkwrite(pte_mkdirty(entry)); } - ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); + ptep = pte_tryget_map_lock(mm, pmdp, addr, &ptl); + if (!ptep) + goto abort; if (check_stable_address_space(mm)) goto unlock_abort; diff --git a/mm/mincore.c b/mm/mincore.c index 9122676b54d6..337f8a45ded0 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -105,6 +105,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, unsigned char *vec = walk->private; int nr = (end - addr) >> PAGE_SHIFT; +again: ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { memset(vec, 1, nr); @@ -117,7 +118,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + ptep = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!ptep) + goto again; for (; addr != end; ptep++, addr += PAGE_SIZE) { pte_t pte = *ptep; diff --git a/mm/mlock.c b/mm/mlock.c index 716caf851043..89f7de636efc 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -314,6 +314,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr, pte_t *start_pte, *pte; struct page *page; +again: ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { if (!pmd_present(*pmd)) @@ -328,7 +329,9 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr, goto out; } - start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + start_pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!start_pte) + goto again; for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue; diff --git a/mm/mprotect.c b/mm/mprotect.c index b69ce7a7b2b7..aa09cd34ea30 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -63,7 +63,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, * from under us even if the mmap_lock is only hold for * reading. */ - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; /* Get target node for single threaded private VMAs */ if (prot_numa && !(vma->vm_flags & VM_SHARED) && diff --git a/mm/mremap.c b/mm/mremap.c index 303d3290b938..d5ea5ce8a22a 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -167,7 +167,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, * We don't have to worry about the ordering of src and dst * pte locks because exclusive mmap_lock prevents deadlock. */ - old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl); + old_pte = pte_tryget_map_lock(mm, old_pmd, old_addr, &old_ptl); + if (!old_pte) + goto drop_lock; new_pte = pte_offset_map(new_pmd, new_addr); new_ptl = pte_lockptr(mm, new_pmd); if (new_ptl != old_ptl) @@ -206,6 +208,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, spin_unlock(new_ptl); pte_unmap(new_pte - 1); pte_unmap_unlock(old_pte - 1, old_ptl); +drop_lock: if (need_rmap_locks) drop_rmap_locks(vma); } diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 9b3db11a4d1d..264b717e24ef 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -50,7 +50,9 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, err = walk_pte_range_inner(pte, addr, end, walk); pte_unmap(pte); } else { - pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + return end; err = walk_pte_range_inner(pte, addr, end, walk); pte_unmap_unlock(pte, ptl); } diff --git a/mm/swapfile.c b/mm/swapfile.c index 63c61f8b2611..710fbeec9e58 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1790,10 +1790,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, if (unlikely(!page)) return -ENOMEM; - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) { + ret = -EAGAIN; + goto out; + } if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) { ret = 0; - goto out; + goto unlock; } dec_mm_counter(vma->vm_mm, MM_SWAPENTS); @@ -1808,8 +1812,9 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, set_pte_at(vma->vm_mm, addr, pte, pte_mkold(mk_pte(page, vma->vm_page_prot))); swap_free(entry); -out: +unlock: pte_unmap_unlock(pte, ptl); +out: if (page != swapcache) { unlock_page(page); put_page(page); @@ -1897,7 +1902,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; ret = unuse_pte_range(vma, pmd, addr, next, type); - if (ret) + if (ret && ret != -EAGAIN) return ret; } while (pmd++, addr = next, addr != end); return 0; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 0cb8e5ef1713..c1bce9cf5657 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -79,7 +79,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, _dst_pte = pte_mkwrite(_dst_pte); } - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + dst_pte = pte_tryget_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + if (!dst_pte) + return -EAGAIN; if (vma_is_shmem(dst_vma)) { /* serialize against truncate with the page table lock */ @@ -194,7 +196,9 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), dst_vma->vm_page_prot)); - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + dst_pte = pte_tryget_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + if (!dst_pte) + return -EAGAIN; if (dst_vma->vm_file) { /* the shmem MAP_PRIVATE case requires checking the i_size */ inode = dst_vma->vm_file->f_inode; @@ -587,6 +591,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, break; } +again: dst_pmdval = pmd_read_atomic(dst_pmd); /* * If the dst_pmd is mapped as THP don't @@ -612,6 +617,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, &page, mcopy_mode, wp_copy); + if (err == -EAGAIN) + goto again; cond_resched(); if (unlikely(err == -ENOENT)) { From patchwork Fri Apr 29 13:35:46 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832017 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91907C433EF for ; Fri, 29 Apr 2022 13:37:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3185B6B0080; Fri, 29 Apr 2022 09:37:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2A0F86B0081; Fri, 29 Apr 2022 09:37:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0CDCD6B0082; Fri, 29 Apr 2022 09:37:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id EC1876B0080 for ; Fri, 29 Apr 2022 09:37:27 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id C1FD8120679 for ; Fri, 29 Apr 2022 13:37:27 +0000 (UTC) X-FDA: 79410018534.31.08CA205 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) by imf03.hostedemail.com (Postfix) with ESMTP id 0C2192005E for ; Fri, 29 Apr 2022 13:37:22 +0000 (UTC) Received: by mail-pg1-f182.google.com with SMTP id v10so6531007pgl.11 for ; Fri, 29 Apr 2022 06:37:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=dRcr9R3ZgGJ3jIPCrJCelPkeBao75RZpxc/JH0Fqwmk=; b=On/bg069CRwx7Wor8888fyLbdp/Gc2rCyWaQOJucO5HhZjg4SG7Y7E+0wvaS2KVGCr rYVMsY9Pcx4ZkQBTX5xuyYpXhUf4C6OuINok1WrDesCw8g1UcjrCtHFP9Uc4bVMEZxjt OIcwN7j795Kjz5t35Z0xUEpDbRAji8lwd5XFDeha1a52CKhYwwVvCiV/cG99uz9myna2 P7rQ2e+2Kp2FbsCe8MmnE/QTIhVGwjnukQzyLdsopG1ty8vFAwWvsuTtbmFag60JVvpG kVfMWnwMlMZZEg4iQa0V1QUkR1j9mj3tvwjXdAGJmtu46/ZcKjb8ktKLMIi02Z/qhfWt /19A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=dRcr9R3ZgGJ3jIPCrJCelPkeBao75RZpxc/JH0Fqwmk=; b=Bnk2NUeB4po5sq/jsc3X2bOYB2M//su+Ve6CB9ivOnAMo15MPT4vrEc910zCUzOJ0z L34U5qCTaQCkBv1CnK6Lcc69ozZcH+ykOhhLQJTlmwS7QVCu3p3NpUTFGJZh+BzQX7F4 nBhP2Keq5Jas37p2BTJ3htdOy/AB8gDQVwAnvf/ZTuabVpFfm5HJszYOeGF9bPwdNOr2 eoVs6xLsOIYtyeAcKWdn9M7T6nZNE3/ffHUiUSjPhgtiWc1WcVT61WAOcSZkd9eLkmSg 22UKJKCrMTB2jpdVDZt+F1t/scLoldnvBh17K+8O44XHIpcKX889Old4MWqqjYPjPneu EsVQ== X-Gm-Message-State: AOAM530UTcA4LuuFbItyA4ILOJG6IHVFlpHEazeozCYB9yill9GReRWp yU8VQXC2g3Zjg0yHdPs+llDJoA== X-Google-Smtp-Source: ABdhPJyRLve9t7CiZZ1B6242P2dY8VW+CeFhX8q+DbR17XJUJmfD+8d8bMjGyknZbnYW2wjWl9d0Kw== X-Received: by 2002:a05:6a00:140c:b0:4e1:530c:edc0 with SMTP id l12-20020a056a00140c00b004e1530cedc0mr39851295pfu.18.1651239446323; Fri, 29 Apr 2022 06:37:26 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:25 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 12/18] mm: convert to use pte_tryget_map() Date: Fri, 29 Apr 2022 21:35:46 +0800 Message-Id: <20220429133552.33768-13-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 0C2192005E X-Stat-Signature: g7w8bqykworrcaiwoug9dmouaayagyso Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="On/bg069"; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf03.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1651239442-841464 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Use pte_tryget_map() to help us to try to get the refcount of the PTE page table page we want to access, which can prevents the page from being freed during access. For unuse_pte_range(), there are multiple locations where pte_offset_map() is called, and it is inconvenient to handle error conditions, so we perform pte_tryget() in advance in unuse_pmd_range(). For the following cases, the PTE page table page is stable: - got the refcount of PTE page table page already - has no concurrent threads(e.g. the write lock of mmap_lock is acquired) - the PTE page table page is not yet visible - turn off the local cpu interrupt or hold the rcu lock (e.g. GUP fast path) - the PTE page table page is kernel PTE page table page So we still keep using pte_offset_map() and replace pte_unmap() with __pte_unmap() which doesn't reduce the refcount. Signed-off-by: Qi Zheng --- arch/x86/mm/mem_encrypt_identity.c | 11 ++++++++--- fs/userfaultfd.c | 10 +++++++--- include/linux/mm.h | 2 +- include/linux/swapops.h | 4 ++-- kernel/events/core.c | 5 ++++- mm/gup.c | 16 +++++++++++----- mm/hmm.c | 9 +++++++-- mm/huge_memory.c | 4 ++-- mm/khugepaged.c | 8 +++++--- mm/memory-failure.c | 11 ++++++++--- mm/memory.c | 19 +++++++++++++------ mm/migrate.c | 8 ++++++-- mm/mremap.c | 5 ++++- mm/page_table_check.c | 2 +- mm/page_vma_mapped.c | 13 ++++++++++--- mm/pagewalk.c | 2 +- mm/swap_state.c | 4 ++-- mm/swapfile.c | 9 ++++++--- mm/vmalloc.c | 2 +- 19 files changed, 99 insertions(+), 45 deletions(-) diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c index 6d323230320a..37a3f4da7bd2 100644 --- a/arch/x86/mm/mem_encrypt_identity.c +++ b/arch/x86/mm/mem_encrypt_identity.c @@ -171,26 +171,31 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd) pud_t *pud; pmd_t *pmd; pte_t *pte; + pmd_t pmdval; pud = sme_prepare_pgd(ppd); if (!pud) return; pmd = pmd_offset(pud, ppd->vaddr); - if (pmd_none(*pmd)) { +retry: + pmdval = READ_ONCE(*pmd); + if (pmd_none(pmdval)) { pte = ppd->pgtable_area; memset(pte, 0, sizeof(*pte) * PTRS_PER_PTE); ppd->pgtable_area += sizeof(*pte) * PTRS_PER_PTE; set_pmd(pmd, __pmd(PMD_FLAGS | __pa(pte))); } - if (pmd_large(*pmd)) + if (pmd_large(pmdval)) return; pte = pte_offset_map(pmd, ppd->vaddr); + if (!pte) + goto retry; if (pte_none(*pte)) set_pte(pte, __pte(ppd->paddr | ppd->pte_flags)); - pte_unmap(pte); + __pte_unmap(pte); } static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index aa0c47cb0d16..c83fc73f29c0 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -309,6 +309,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, * This is to deal with the instability (as in * pmd_trans_unstable) of the pmd. */ +retry: _pmd = READ_ONCE(*pmd); if (pmd_none(_pmd)) goto out; @@ -324,10 +325,13 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, } /* - * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it - * and use the standard pte_offset_map() instead of parsing _pmd. + * After we tryget successfully, the pmd is stable (as in + * !pmd_trans_unstable) so we can re-read it and use the standard + * pte_offset_map() instead of parsing _pmd. */ - pte = pte_offset_map(pmd, address); + pte = pte_tryget_map(mm, pmd, address); + if (!pte) + goto retry; /* * Lockless access: we're in a wait_event so it's ok if it * changes under us. diff --git a/include/linux/mm.h b/include/linux/mm.h index 04f7a6c36dc7..cc8fb009bab7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2284,7 +2284,7 @@ static inline void pgtable_pte_page_dtor(struct page *page) #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) #define pte_alloc_map(mm, pmd, address) \ - (pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address)) + (pte_alloc(mm, pmd) ? NULL : pte_tryget_map(mm, pmd, address)) #define pte_alloc_map_lock(mm, pmd, address, ptlp) \ (pte_alloc(mm, pmd) ? \ diff --git a/include/linux/swapops.h b/include/linux/swapops.h index d356ab4047f7..b671ecd6b5e7 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -214,7 +214,7 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, spinlock_t *ptl); -extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, +extern bool migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address); extern void migration_entry_wait_huge(struct vm_area_struct *vma, struct mm_struct *mm, pte_t *pte); @@ -236,7 +236,7 @@ static inline int is_migration_entry(swp_entry_t swp) static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, spinlock_t *ptl) { } -static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, +static inline bool migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { } static inline void migration_entry_wait_huge(struct vm_area_struct *vma, struct mm_struct *mm, pte_t *pte) { } diff --git a/kernel/events/core.c b/kernel/events/core.c index 23bb19716ad3..443b0af075e6 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7215,6 +7215,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr) return pud_leaf_size(pud); pmdp = pmd_offset_lockless(pudp, pud, addr); +retry: pmd = READ_ONCE(*pmdp); if (!pmd_present(pmd)) return 0; @@ -7222,7 +7223,9 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr) if (pmd_leaf(pmd)) return pmd_leaf_size(pmd); - ptep = pte_offset_map(&pmd, addr); + ptep = pte_tryget_map(mm, &pmd, addr); + if (!ptep) + goto retry; pte = ptep_get_lockless(ptep); if (pte_present(pte)) size = pte_leaf_size(pte); diff --git a/mm/gup.c b/mm/gup.c index d2c24181fb04..114a7e7f871b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -470,7 +470,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (!is_migration_entry(entry)) goto no_page; pte_unmap_unlock(ptep, ptl); - migration_entry_wait(mm, pmd, address); + if (!migration_entry_wait(mm, pmd, address)) + return no_page_table(vma, flags); goto retry; } if ((flags & FOLL_NUMA) && pte_protnone(pte)) @@ -805,6 +806,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, pmd_t *pmd; pte_t *pte; int ret = -EFAULT; + pmd_t pmdval; /* user gate pages are read-only */ if (gup_flags & FOLL_WRITE) @@ -822,10 +824,14 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, if (pud_none(*pud)) return -EFAULT; pmd = pmd_offset(pud, address); - if (!pmd_present(*pmd)) +retry: + pmdval = READ_ONCE(*pmd); + if (!pmd_present(pmdval)) return -EFAULT; - VM_BUG_ON(pmd_trans_huge(*pmd)); - pte = pte_offset_map(pmd, address); + VM_BUG_ON(pmd_trans_huge(pmdval)); + pte = pte_tryget_map(mm, pmd, address); + if (!pte) + goto retry; if (pte_none(*pte)) goto unmap; *vma = get_gate_vma(mm); @@ -2223,7 +2229,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, pte_unmap: if (pgmap) put_dev_pagemap(pgmap); - pte_unmap(ptem); + __pte_unmap(ptem); return ret; } #else diff --git a/mm/hmm.c b/mm/hmm.c index af71aac3140e..0cf45092efca 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -279,7 +279,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (is_migration_entry(entry)) { pte_unmap(ptep); hmm_vma_walk->last = addr; - migration_entry_wait(walk->mm, pmdp, addr); + if (!migration_entry_wait(walk->mm, pmdp, addr)) + return -EAGAIN; return -EBUSY; } @@ -384,12 +385,16 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); } - ptep = pte_offset_map(pmdp, addr); + ptep = pte_tryget_map(walk->mm, pmdp, addr); + if (!ptep) + goto again; for (; addr < end; addr += PAGE_SIZE, ptep++, hmm_pfns++) { int r; r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, hmm_pfns); if (r) { + if (r == -EAGAIN) + goto again; /* hmm_vma_handle_pte() did pte_unmap() */ return r; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c468fee595ff..73ac2e9c9193 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1932,7 +1932,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, pte = pte_offset_map(&_pmd, haddr); VM_BUG_ON(!pte_none(*pte)); set_pte_at(mm, haddr, pte, entry); - pte_unmap(pte); + __pte_unmap(pte); } smp_wmb(); /* make pte visible before pmd */ pmd_populate(mm, pmd, pgtable); @@ -2086,7 +2086,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, set_pte_at(mm, addr, pte, entry); if (!pmd_migration) atomic_inc(&page[i]._mapcount); - pte_unmap(pte); + __pte_unmap(pte); } if (!pmd_migration) { diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3776cc315294..f540d7983b2d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1003,7 +1003,9 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, .pmd = pmd, }; - vmf.pte = pte_offset_map(pmd, address); + vmf.pte = pte_tryget_map(mm, pmd, address); + if (!vmf.pte) + return false; vmf.orig_pte = *vmf.pte; if (!is_swap_pte(vmf.orig_pte)) { pte_unmap(vmf.pte); @@ -1145,7 +1147,7 @@ static void collapse_huge_page(struct mm_struct *mm, spin_unlock(pte_ptl); if (unlikely(!isolated)) { - pte_unmap(pte); + __pte_unmap(pte); spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); /* @@ -1168,7 +1170,7 @@ static void collapse_huge_page(struct mm_struct *mm, __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl, &compound_pagelist); - pte_unmap(pte); + __pte_unmap(pte); /* * spin_lock() below is not the equivalent of smp_wmb(), but * the smp_wmb() inside __SetPageUptodate() can be reused to diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 5247932df3fa..2a840ddfc34e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -304,6 +304,7 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page, pud_t *pud; pmd_t *pmd; pte_t *pte; + pmd_t pmdval; VM_BUG_ON_VMA(address == -EFAULT, vma); pgd = pgd_offset(vma->vm_mm, address); @@ -318,11 +319,15 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page, if (pud_devmap(*pud)) return PUD_SHIFT; pmd = pmd_offset(pud, address); - if (!pmd_present(*pmd)) +retry: + pmdval = READ_ONCE(*pmd); + if (!pmd_present(pmdval)) return 0; - if (pmd_devmap(*pmd)) + if (pmd_devmap(pmdval)) return PMD_SHIFT; - pte = pte_offset_map(pmd, address); + pte = pte_tryget_map(vma->vm_mm, pmd, address); + if (!pte) + goto retry; if (pte_present(*pte) && pte_devmap(*pte)) ret = PAGE_SHIFT; pte_unmap(pte); diff --git a/mm/memory.c b/mm/memory.c index ca03006b32cb..aa2bac561d5e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1091,7 +1091,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, arch_leave_lazy_mmu_mode(); spin_unlock(src_ptl); - pte_unmap(orig_src_pte); + __pte_unmap(orig_src_pte); add_mm_rss_vec(dst_mm, rss); pte_unmap_unlock(orig_dst_pte, dst_ptl); cond_resched(); @@ -3566,8 +3566,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) entry = pte_to_swp_entry(vmf->orig_pte); if (unlikely(non_swap_entry(entry))) { if (is_migration_entry(entry)) { - migration_entry_wait(vma->vm_mm, vmf->pmd, - vmf->address); + if (!migration_entry_wait(vma->vm_mm, vmf->pmd, + vmf->address)) + ret = VM_FAULT_RETRY; } else if (is_device_exclusive_entry(entry)) { vmf->page = pfn_swap_entry_to_page(entry); ret = remove_device_exclusive_entry(vmf); @@ -4507,7 +4508,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) flags |= TNF_MIGRATED; } else { flags |= TNF_MIGRATE_FAIL; - vmf->pte = pte_offset_map(vmf->pmd, vmf->address); + vmf->pte = pte_tryget_map(vma->vm_mm, vmf->pmd, vmf->address); + if (!vmf->pte) + return VM_FAULT_RETRY; spin_lock(vmf->ptl); if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -4617,7 +4620,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) { pte_t entry; - if (unlikely(pmd_none(*vmf->pmd))) { +retry: + if (unlikely(pmd_none(READ_ONCE(*vmf->pmd)))) { /* * Leave __pte_alloc() until later: because vm_ops->fault may * want to allocate huge page, and if we expose page table @@ -4646,7 +4650,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) * mmap_lock read mode and khugepaged takes it in write mode. * So now it's safe to run pte_offset_map(). */ - vmf->pte = pte_offset_map(vmf->pmd, vmf->address); + vmf->pte = pte_tryget_map(vmf->vma->vm_mm, vmf->pmd, + vmf->address); + if (!vmf->pte) + goto retry; vmf->orig_pte = *vmf->pte; /* diff --git a/mm/migrate.c b/mm/migrate.c index 6c31ee1e1c9b..125fbe300df2 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -301,12 +301,16 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, pte_unmap_unlock(ptep, ptl); } -void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, +bool migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { spinlock_t *ptl = pte_lockptr(mm, pmd); - pte_t *ptep = pte_offset_map(pmd, address); + pte_t *ptep = pte_tryget_map(mm, pmd, address); + if (!ptep) + return false; __migration_entry_wait(mm, ptep, ptl); + + return true; } void migration_entry_wait_huge(struct vm_area_struct *vma, diff --git a/mm/mremap.c b/mm/mremap.c index d5ea5ce8a22a..71022d42f577 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -170,7 +170,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, old_pte = pte_tryget_map_lock(mm, old_pmd, old_addr, &old_ptl); if (!old_pte) goto drop_lock; - new_pte = pte_offset_map(new_pmd, new_addr); + new_pte = pte_tryget_map(mm, new_pmd, new_addr); + if (!new_pte) + goto unmap_drop_lock; new_ptl = pte_lockptr(mm, new_pmd); if (new_ptl != old_ptl) spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); @@ -207,6 +209,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, if (new_ptl != old_ptl) spin_unlock(new_ptl); pte_unmap(new_pte - 1); +unmap_drop_lock: pte_unmap_unlock(old_pte - 1, old_ptl); drop_lock: if (need_rmap_locks) diff --git a/mm/page_table_check.c b/mm/page_table_check.c index 2458281bff89..185e84f22c6c 100644 --- a/mm/page_table_check.c +++ b/mm/page_table_check.c @@ -251,7 +251,7 @@ void __page_table_check_pte_clear_range(struct mm_struct *mm, pte_t *ptep = pte_offset_map(&pmd, addr); unsigned long i; - pte_unmap(ptep); + __pte_unmap(ptep); for (i = 0; i < PTRS_PER_PTE; i++) { __page_table_check_pte_clear(mm, addr, *ptep); addr += PAGE_SIZE; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 14a5cda73dee..8ecf8fd7cf5e 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -15,7 +15,9 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw) static bool map_pte(struct page_vma_mapped_walk *pvmw) { - pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address); + pvmw->pte = pte_tryget_map(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address); + if (!pvmw->pte) + return false; if (!(pvmw->flags & PVMW_SYNC)) { if (pvmw->flags & PVMW_MIGRATION) { if (!is_swap_pte(*pvmw->pte)) @@ -203,6 +205,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) } pvmw->pmd = pmd_offset(pud, pvmw->address); +retry: /* * Make sure the pmd value isn't cached in a register by the * compiler and used as a stale value after we've observed a @@ -251,8 +254,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) step_forward(pvmw, PMD_SIZE); continue; } - if (!map_pte(pvmw)) - goto next_pte; + if (!map_pte(pvmw)) { + if (!pvmw->pte) + goto retry; + else + goto next_pte; + } this_pte: if (check_pte(pvmw)) return true; diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 264b717e24ef..adb5dacbd537 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -48,7 +48,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (walk->no_vma) { pte = pte_offset_map(pmd, addr); err = walk_pte_range_inner(pte, addr, end, walk); - pte_unmap(pte); + __pte_unmap(pte); } else { pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); if (!pte) diff --git a/mm/swap_state.c b/mm/swap_state.c index 013856004825..5b70c2c815ef 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -743,7 +743,7 @@ static void swap_ra_info(struct vm_fault *vmf, SWAP_RA_VAL(faddr, win, 0)); if (win == 1) { - pte_unmap(orig_pte); + __pte_unmap(orig_pte); return; } @@ -768,7 +768,7 @@ static void swap_ra_info(struct vm_fault *vmf, for (pfn = start; pfn != end; pfn++) *tpte++ = *pte++; #endif - pte_unmap(orig_pte); + __pte_unmap(orig_pte); } /** diff --git a/mm/swapfile.c b/mm/swapfile.c index 710fbeec9e58..f1c64fc15e24 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1845,7 +1845,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, continue; offset = swp_offset(entry); - pte_unmap(pte); + __pte_unmap(pte); swap_map = &si->swap_map[offset]; page = lookup_swap_cache(entry, vma, addr); if (!page) { @@ -1880,7 +1880,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, try_next: pte = pte_offset_map(pmd, addr); } while (pte++, addr += PAGE_SIZE, addr != end); - pte_unmap(pte - 1); + __pte_unmap(pte - 1); ret = 0; out: @@ -1901,8 +1901,11 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, next = pmd_addr_end(addr, end); if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; + if (!pte_tryget(vma->vm_mm, pmd, addr)) + continue; ret = unuse_pte_range(vma, pmd, addr, next, type); - if (ret && ret != -EAGAIN) + __pte_put(pmd_pgtable(*pmd)); + if (ret) return ret; } while (pmd++, addr = next, addr != end); return 0; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e163372d3967..080aa78bdaff 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -694,7 +694,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr) pte = *ptep; if (pte_present(pte)) page = pte_page(pte); - pte_unmap(ptep); + __pte_unmap(ptep); return page; } From patchwork Fri Apr 29 13:35:47 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832018 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 340C0C433F5 for ; Fri, 29 Apr 2022 13:37:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CB5B26B0081; Fri, 29 Apr 2022 09:37:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C3E416B0082; Fri, 29 Apr 2022 09:37:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A91F06B0083; Fri, 29 Apr 2022 09:37:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 966126B0081 for ; Fri, 29 Apr 2022 09:37:33 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 73A9A1205F0 for ; Fri, 29 Apr 2022 13:37:33 +0000 (UTC) X-FDA: 79410018786.26.A3E6DD9 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by imf12.hostedemail.com (Postfix) with ESMTP id C93BE4004A for ; Fri, 29 Apr 2022 13:37:21 +0000 (UTC) Received: by mail-pf1-f180.google.com with SMTP id h1so6923381pfv.12 for ; Fri, 29 Apr 2022 06:37:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=23JDuOeXS3qapCnfe9+MGlUBuO57K6IKT3rd6Zq9kR4=; b=MHSRX303E6gAt323/IxcNAaiiuJtGUdH5cveE5vIqfJsvePabyp0uPeIkm0nDQ8x+z mCEeA/j/KM/zmasCI3NTuE8qHxZdXOOfIfcn8do+VN9VYhVXhI+g+9WAniNW4BEPE+Hr J9OlM7BM2PT4gmQ+NP52DmvU2IhXT/3tZBir8W+BpSrREGb9XJc337kADy/ZkI4Obkb1 fYgKvKZsYGfcnMZSjdbXdUQsBbzcAXrLxtXnir6+P37YT8MteDij0NXzVlxDMbO/OmKK fvdicx7MXbzcUO6F6lxiLRamPoU1KEA+G/pfvWZB5UwwmBvNGsfLlRTWzI28/+l7upAy HoeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=23JDuOeXS3qapCnfe9+MGlUBuO57K6IKT3rd6Zq9kR4=; b=udeEzxoxKQFbuMN6SRCLXbHaJi2jQQNh+zRz7GCqDYQxJ3pDOVuOiHUao6l3+G5XRk DISYlBKQ6yglxbhHYCvvaNBauLMLIcfxBPc/t9Ad3p+fHIcs44+QVEbTLu2Ql/RBVayG peDuj4fqb875BZ/+Vn+K9xABsR1TnTLtChDg2kPlNuCpQRYOS13mrKqQ3SXXm+FjcMiK Bj2K9+jRoztANAkajErfTNpu7SluMUY16+22vq5wDy/+OSrMi5PqH3TXgNuz3TSNhvjW kZ1K0vX7iODjYQk5AWWAbGua3YH8bAkhD4tPr6L0wdsZHon2QrEPRGaULKTjLmShs2yv tVmw== X-Gm-Message-State: AOAM533HhQJAabdcNYLYg30Sg92YMNqP640U60LEIFlIQPp+5pLywRdt QAHJWTvbH5OMb0RR/VTKUQdo7g== X-Google-Smtp-Source: ABdhPJzyfarjjWMZT/Urjo6R6jR7MFp2VAkxiYz4i0cDzXZXP0EBlBsAWmKDRDq8ix5Fv+EZW5xpSg== X-Received: by 2002:a63:5847:0:b0:399:3452:ffe4 with SMTP id i7-20020a635847000000b003993452ffe4mr32556539pgm.406.1651239451972; Fri, 29 Apr 2022 06:37:31 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:31 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Date: Fri, 29 Apr 2022 21:35:47 +0800 Message-Id: <20220429133552.33768-14-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: C93BE4004A X-Stat-Signature: ng579ynzro6rjze5zxo6ae79mqoo531c X-Rspam-User: Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=MHSRX303; spf=pass (imf12.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam09 X-HE-Tag: 1651239441-202136 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Normally, the percpu_ref of the user PTE page table page is in percpu mode. This patch add try_to_free_user_pte() to switch the percpu_ref to atomic mode and check if it is 0. If the percpu_ref is 0, which means that no one is using the user PTE page table page, then we can safely reclaim it. Signed-off-by: Qi Zheng --- include/linux/pte_ref.h | 7 +++ mm/pte_ref.c | 99 ++++++++++++++++++++++++++++++++++++++++- 2 files changed, 104 insertions(+), 2 deletions(-) diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index bfe620038699..379c3b45a6ab 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -16,6 +16,8 @@ void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); void __pte_put(pgtable_t page); void pte_put(pte_t *ptep); +void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, + bool switch_back); #else /* !CONFIG_FREE_USER_PTE */ @@ -47,6 +49,11 @@ static inline void pte_put(pte_t *ptep) { } +static inline void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, bool switch_back) +{ +} + #endif /* CONFIG_FREE_USER_PTE */ #endif /* _LINUX_PTE_REF_H */ diff --git a/mm/pte_ref.c b/mm/pte_ref.c index 5b382445561e..bf9629272c71 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -8,6 +8,9 @@ #include #include #include +#include +#include +#include #ifdef CONFIG_FREE_USER_PTE @@ -44,8 +47,6 @@ void pte_ref_free(pgtable_t pte) kfree(ref); } -void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) {} - /* * pte_tryget - try to get the pte_ref of the user PTE page table page * @mm: pointer the target address space @@ -102,4 +103,98 @@ void pte_put(pte_t *ptep) } EXPORT_SYMBOL(pte_put); +#ifdef CONFIG_DEBUG_VM +void pte_free_debug(pmd_t pmd) +{ + pte_t *ptep = (pte_t *)pmd_page_vaddr(pmd); + int i = 0; + + for (i = 0; i < PTRS_PER_PTE; i++) + BUG_ON(!pte_none(*ptep++)); +} +#else +static inline void pte_free_debug(pmd_t pmd) +{ +} +#endif + +static inline void pte_free_rcu(struct rcu_head *rcu) +{ + struct page *page = container_of(rcu, struct page, rcu_head); + + pgtable_pte_page_dtor(page); + __free_page(page); +} + +/* + * free_user_pte - free the user PTE page table page + * @mm: pointer the target address space + * @pmd: pointer to a PMD + * @addr: start address of the tlb range to be flushed + * + * Context: The pmd range has been unmapped and TLB purged. And the user PTE + * page table page will be freed by rcu handler. + */ +void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) +{ + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); + spinlock_t *ptl; + pmd_t pmdval; + + ptl = pmd_lock(mm, pmd); + pmdval = *pmd; + if (pmd_none(pmdval) || pmd_leaf(pmdval)) { + spin_unlock(ptl); + return; + } + pmd_clear(pmd); + flush_tlb_range(&vma, addr, addr + PMD_SIZE); + spin_unlock(ptl); + + pte_free_debug(pmdval); + mm_dec_nr_ptes(mm); + call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu); +} + +/* + * try_to_free_user_pte - try to free the user PTE page table page + * @mm: pointer the target address space + * @pmd: pointer to a PMD + * @addr: virtual address associated with pmd + * @switch_back: indicates if switching back to percpu mode is required + */ +void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, + bool switch_back) +{ + pgtable_t pte; + + if (&init_mm == mm) + return; + + if (!pte_tryget(mm, pmd, addr)) + return; + pte = pmd_pgtable(*pmd); + percpu_ref_switch_to_atomic_sync(pte->pte_ref); + rcu_read_lock(); + /* + * Here we can safely put the pte_ref because we already hold the rcu + * lock, which guarantees that the user PTE page table page will not + * be released. + */ + __pte_put(pte); + if (percpu_ref_is_zero(pte->pte_ref)) { + rcu_read_unlock(); + free_user_pte(mm, pmd, addr & PMD_MASK); + return; + } + rcu_read_unlock(); + + if (switch_back) { + if (pte_tryget(mm, pmd, addr)) { + percpu_ref_switch_to_percpu(pte->pte_ref); + __pte_put(pte); + } + } +} + #endif /* CONFIG_FREE_USER_PTE */ From patchwork Fri Apr 29 13:35:48 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832019 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1AF9C433EF for ; Fri, 29 Apr 2022 13:37:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7855C6B0072; Fri, 29 Apr 2022 09:37:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 70E586B007B; Fri, 29 Apr 2022 09:37:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 588696B007D; Fri, 29 Apr 2022 09:37:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 4667E6B0072 for ; Fri, 29 Apr 2022 09:37:40 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1A27326B62 for ; Fri, 29 Apr 2022 13:37:40 +0000 (UTC) X-FDA: 79410019080.22.A3DA6C1 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) by imf07.hostedemail.com (Postfix) with ESMTP id CB6DC4003E for ; Fri, 29 Apr 2022 13:37:35 +0000 (UTC) Received: by mail-pf1-f179.google.com with SMTP id p12so6991420pfn.0 for ; Fri, 29 Apr 2022 06:37:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=rYeLNg+oS2y3akdD0dEeA3yQLeG8PhzEVgB+CG0F5x8=; b=r0oSktHd6lJh1xiGDkOseB1CwEKbnU5ADwHmNxjoz0rF9tVkTla0cHmYCIjX9kad48 HXeS+/YDOH1RsqM0QaLq3bkmUHA4i6Jbi5iXnJXClUtQW4MWr3dTM217XBF44U478R6b Shvq73K4KVjlhzE/n7vZ2+kDth8ZTt+DfZSh4jXPMr/dgaiOTUv7UitxA5wBpQeUxLT3 /SrfjXz8WTr03+JamLKfd/O6NmDHtxE1rvUHfP30jXpZwF3OrnSi1H2YikZmmdZh3Lk1 XhF0EkwMAPk7F1PJU3e9O2p9K6frj/vvPrSN47R2YW4XK31joU4z9lay8VQ6QPegDiSH W1uA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=rYeLNg+oS2y3akdD0dEeA3yQLeG8PhzEVgB+CG0F5x8=; b=1/Os1HNyXHJt9yskqOyhhocjuW8bcaUo3sOq5OM45NjInWnWroa21KdYIPcIBYNTl7 lne6ZmzbEa4BN/2pXVzqBKXbqducNsxmT7Oa7Bf0ZcEx46xtWBLbNJYvUFLTtc+GY95T 24vH2+Lx/jz9aGHO56A3/vzLw6HpwgOnGEZMNajN91WkVFvSiwkNcioQ22P3tVMrYZPO DVWh9+yDyPWOVPCSQcAkLQuT/a/uDn0r8jBjZ0n8+/4sjvPHZlVmSpRO0m6uGn0BqskU NsBaNF0wjBpXN0KVPxDsses2U7GJESMewHcu1i3UP/kS0V6pUId+fVMeuQB0HoV/wELy GkIQ== X-Gm-Message-State: AOAM530Kuuo3CMmXSOUFNMpuVoY5Av0zUY4FczvfsGlHClMh/hztrrTy IiQwIxC88RO8TGxsYLN16x5CMw== X-Google-Smtp-Source: ABdhPJwqz/7MfX5i1uHVza5qIKuW50STzQue9UefcrOGDN8WPmjUeneAZB8Gavo9bb4peYnwNe/Qfg== X-Received: by 2002:a05:6a00:238f:b0:4f7:78b1:2f6b with SMTP id f15-20020a056a00238f00b004f778b12f6bmr40178261pfc.17.1651239457603; Fri, 29 Apr 2022 06:37:37 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:37 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case Date: Fri, 29 Apr 2022 21:35:48 +0800 Message-Id: <20220429133552.33768-15-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: CB6DC4003E X-Stat-Signature: anj7cmdnufrtkxir1wyo6p3cir8jzdc3 Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=r0oSktHd; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf07.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspam-User: X-HE-Tag: 1651239455-936822 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Immediately after a successful MADV_DONTNEED operation, the physical page is unmapped from the PTE page table entry. This is a good time to call try_to_free_user_pte() to try to free the PTE page table page. Signed-off-by: Qi Zheng --- mm/internal.h | 3 ++- mm/memory.c | 43 +++++++++++++++++++++++++++++-------------- mm/oom_kill.c | 3 ++- 3 files changed, 33 insertions(+), 16 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index cf16280ce132..f93a9170d2e3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -77,7 +77,8 @@ struct zap_details; void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end, - struct zap_details *details); + struct zap_details *details, + bool free_pte); void page_cache_ra_order(struct readahead_control *, struct file_ra_state *, unsigned int order); diff --git a/mm/memory.c b/mm/memory.c index aa2bac561d5e..75a0e16a095a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1339,7 +1339,8 @@ static inline bool should_zap_page(struct zap_details *details, struct page *pag static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { struct mm_struct *mm = tlb->mm; int force_flush = 0; @@ -1348,6 +1349,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_t *start_pte; pte_t *pte; swp_entry_t entry; + unsigned long start = addr; tlb_change_page_size(tlb, PAGE_SIZE); again: @@ -1455,13 +1457,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, goto again; } + if (free_pte) + try_to_free_user_pte(mm, pmd, start, true); + return addr; } static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { pmd_t *pmd; unsigned long next; @@ -1496,7 +1502,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, */ if (pmd_none_or_trans_huge_or_clear_bad(pmd)) goto next; - next = zap_pte_range(tlb, vma, pmd, addr, next, details); + next = zap_pte_range(tlb, vma, pmd, addr, next, details, + free_pte); next: cond_resched(); } while (pmd++, addr = next, addr != end); @@ -1507,7 +1514,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, static inline unsigned long zap_pud_range(struct mmu_gather *tlb, struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { pud_t *pud; unsigned long next; @@ -1525,7 +1533,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, } if (pud_none_or_clear_bad(pud)) continue; - next = zap_pmd_range(tlb, vma, pud, addr, next, details); + next = zap_pmd_range(tlb, vma, pud, addr, next, details, + free_pte); next: cond_resched(); } while (pud++, addr = next, addr != end); @@ -1536,7 +1545,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, static inline unsigned long zap_p4d_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { p4d_t *p4d; unsigned long next; @@ -1546,7 +1556,8 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb, next = p4d_addr_end(addr, end); if (p4d_none_or_clear_bad(p4d)) continue; - next = zap_pud_range(tlb, vma, p4d, addr, next, details); + next = zap_pud_range(tlb, vma, p4d, addr, next, details, + free_pte); } while (p4d++, addr = next, addr != end); return addr; @@ -1555,7 +1566,8 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb, void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { pgd_t *pgd; unsigned long next; @@ -1567,7 +1579,8 @@ void unmap_page_range(struct mmu_gather *tlb, next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) continue; - next = zap_p4d_range(tlb, vma, pgd, addr, next, details); + next = zap_p4d_range(tlb, vma, pgd, addr, next, details, + free_pte); } while (pgd++, addr = next, addr != end); tlb_end_vma(tlb, vma); } @@ -1576,7 +1589,8 @@ void unmap_page_range(struct mmu_gather *tlb, static void unmap_single_vma(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, - struct zap_details *details) + struct zap_details *details, + bool free_pte) { unsigned long start = max(vma->vm_start, start_addr); unsigned long end; @@ -1612,7 +1626,8 @@ static void unmap_single_vma(struct mmu_gather *tlb, i_mmap_unlock_write(vma->vm_file->f_mapping); } } else - unmap_page_range(tlb, vma, start, end, details); + unmap_page_range(tlb, vma, start, end, details, + free_pte); } } @@ -1644,7 +1659,7 @@ void unmap_vmas(struct mmu_gather *tlb, start_addr, end_addr); mmu_notifier_invalidate_range_start(&range); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) - unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); + unmap_single_vma(tlb, vma, start_addr, end_addr, NULL, false); mmu_notifier_invalidate_range_end(&range); } @@ -1669,7 +1684,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, update_hiwater_rss(vma->vm_mm); mmu_notifier_invalidate_range_start(&range); for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next) - unmap_single_vma(&tlb, vma, start, range.end, NULL); + unmap_single_vma(&tlb, vma, start, range.end, NULL, true); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); } @@ -1695,7 +1710,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr tlb_gather_mmu(&tlb, vma->vm_mm); update_hiwater_rss(vma->vm_mm); mmu_notifier_invalidate_range_start(&range); - unmap_single_vma(&tlb, vma, address, range.end, details); + unmap_single_vma(&tlb, vma, address, range.end, details, true); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7ec38194f8e1..c4c25a7add7b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -549,7 +549,8 @@ bool __oom_reap_task_mm(struct mm_struct *mm) ret = false; continue; } - unmap_page_range(&tlb, vma, range.start, range.end, NULL); + unmap_page_range(&tlb, vma, range.start, range.end, + NULL, false); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); } From patchwork Fri Apr 29 13:35:49 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832020 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AAEF3C43217 for ; Fri, 29 Apr 2022 13:37:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 49F626B0078; Fri, 29 Apr 2022 09:37:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 428CD6B007B; Fri, 29 Apr 2022 09:37:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 253B46B007D; Fri, 29 Apr 2022 09:37:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 1282E6B0078 for ; Fri, 29 Apr 2022 09:37:45 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id E4C5320293 for ; Fri, 29 Apr 2022 13:37:44 +0000 (UTC) X-FDA: 79410019248.21.EB10AAB Received: from mail-pf1-f173.google.com (mail-pf1-f173.google.com [209.85.210.173]) by imf21.hostedemail.com (Postfix) with ESMTP id 4965F1C0068 for ; Fri, 29 Apr 2022 13:37:40 +0000 (UTC) Received: by mail-pf1-f173.google.com with SMTP id bo5so6954886pfb.4 for ; Fri, 29 Apr 2022 06:37:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=O5cSbfL+Kws2WK1XWtshx/2qRi6Wn3xrF3ZBZvoPEF4=; b=k82gmWwUxLeytVIW73YEj81XzJTEAi/VIKgG1khPQ5s9ntJrhtKUiHv2z65IM6x1pP cHyGFHCju/c6KUsgM1fEQ0gtlsNMOaUvhzKlUmUgO8zN6rDeRYSGJyNAofMSN3w21Nek gSVYkIKFvK/gSV3nvq9ZsEQmDBGd8P7fkTMiDlBwEy6tOiIFycE6VmPf+oth9zRFTpRq xQmyrVyejNKrcVf5wn9fIwvLO8i/YUdZETKNoql2OyZvMa0w5HO+dTEIc3OZxueEv5n+ C3QqYodUydvPO2/5EMdhE3bSPnrK80wuPiHAcYuSAuhfVECNh/kNOJxshL9h1CVpnKlE HNhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=O5cSbfL+Kws2WK1XWtshx/2qRi6Wn3xrF3ZBZvoPEF4=; b=eMKrbn/MLi7JoBrLqwBpMmlAcXMzdErmJwlQuo5DpO9lh+g4o5+sK/L+WxBrFWInh1 kn/s2YpUG85UBqiErMPKIptNRxpC2o2nHYJPX52w3kTPJ98KnlDdjeLt2oXhO0Siuv3I ExVHycjpqmr2s5UEUwVaY2mdRT+ZJR7hiBL8HXDpLySz0kklarOXMPuTyTBKhlTOJM5c 7cNPk+y2y5ivbKy/1oIMMjpeYqgbaReYAfQ9/O2ci1WBTe5JCqI/USKEvfEhVHzy/jMW K4BExg231if2dcorX64cJI6nGjZ+F9QS0JwIQOaU2azJE2aUifshNHKV/ot5fS5zdI3r mJlg== X-Gm-Message-State: AOAM532EJpdPr7Jp9MQQU/QS2sI8GUQARysObzX9bHm+dfVXZj2zYodo F9o85TZMz20EZ3rCNlkCFW723Q== X-Google-Smtp-Source: ABdhPJw5CiRd+JTZxOeaHYQjHa7B3chtYZCpnIESroFu7Xy+Zl7X7Cm8G2ZfLUtwF1UYd9U60cOnNg== X-Received: by 2002:a63:f749:0:b0:3aa:361c:8827 with SMTP id f9-20020a63f749000000b003aa361c8827mr32583907pgk.361.1651239463216; Fri, 29 Apr 2022 06:37:43 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:42 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case Date: Fri, 29 Apr 2022 21:35:49 +0800 Message-Id: <20220429133552.33768-16-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 4965F1C0068 X-Stat-Signature: bniynnekcde13ehsmidq88i8nnkkst58 Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=k82gmWwU; spf=pass (imf21.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.173 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1651239460-336691 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Different from MADV_DONTNEED case, MADV_FREE just marks the physical page as lazyfree instead of unmapping it immediately, and the physical page will not be unmapped until the system memory is tight. So we convert the percpu_ref of the related user PTE page table page to atomic mode in madvise_free_pte_range(), and then check if it is 0 in try_to_unmap_one(). If it is 0, we can safely reclaim the PTE page table page at this time. Signed-off-by: Qi Zheng --- include/linux/rmap.h | 2 ++ mm/madvise.c | 7 ++++++- mm/page_vma_mapped.c | 46 ++++++++++++++++++++++++++++++++++++++++++-- mm/rmap.c | 9 +++++++++ 4 files changed, 61 insertions(+), 3 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 17230c458341..a3174d3bf118 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -204,6 +204,8 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, #define PVMW_SYNC (1 << 0) /* Look for migration entries rather than present PTEs */ #define PVMW_MIGRATION (1 << 1) +/* Used for MADV_FREE page */ +#define PVMW_MADV_FREE (1 << 2) struct page_vma_mapped_walk { unsigned long pfn; diff --git a/mm/madvise.c b/mm/madvise.c index 8123397f14c8..bd4bcaad5a9f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -598,7 +598,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, pte_t *orig_pte, *pte, ptent; struct page *page; int nr_swap = 0; + bool have_lazyfree = false; unsigned long next; + unsigned long start = addr; next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) @@ -709,6 +711,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, tlb_remove_tlb_entry(tlb, pte, addr); } mark_page_lazyfree(page); + have_lazyfree = true; } out: if (nr_swap) { @@ -718,8 +721,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, add_mm_counter(mm, MM_SWAPENTS, nr_swap); } arch_leave_lazy_mmu_mode(); - if (orig_pte) + if (orig_pte) { pte_unmap_unlock(orig_pte, ptl); + try_to_free_user_pte(mm, pmd, start, !have_lazyfree); + } cond_resched(); next: return 0; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 8ecf8fd7cf5e..00bc09f57f48 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -266,8 +266,30 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) next_pte: do { pvmw->address += PAGE_SIZE; - if (pvmw->address >= end) - return not_found(pvmw); + if (pvmw->address >= end) { + not_found(pvmw); + + if (pvmw->flags & PVMW_MADV_FREE) { + pgtable_t pte; + pmd_t pmdval; + + pvmw->flags &= ~PVMW_MADV_FREE; + rcu_read_lock(); + pmdval = READ_ONCE(*pvmw->pmd); + if (pmd_none(pmdval) || pmd_leaf(pmdval)) { + rcu_read_unlock(); + return false; + } + pte = pmd_pgtable(pmdval); + if (percpu_ref_is_zero(pte->pte_ref)) { + rcu_read_unlock(); + free_user_pte(mm, pvmw->pmd, pvmw->address); + } else { + rcu_read_unlock(); + } + } + return false; + } /* Did we cross page table boundary? */ if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) { if (pvmw->ptl) { @@ -275,6 +297,26 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) pvmw->ptl = NULL; } pte_unmap(pvmw->pte); + if (pvmw->flags & PVMW_MADV_FREE) { + pgtable_t pte; + pmd_t pmdval; + + pvmw->flags &= ~PVMW_MADV_FREE; + rcu_read_lock(); + pmdval = READ_ONCE(*pvmw->pmd); + if (pmd_none(pmdval) || pmd_leaf(pmdval)) { + rcu_read_unlock(); + pvmw->pte = NULL; + goto restart; + } + pte = pmd_pgtable(pmdval); + if (percpu_ref_is_zero(pte->pte_ref)) { + rcu_read_unlock(); + free_user_pte(mm, pvmw->pmd, pvmw->address); + } else { + rcu_read_unlock(); + } + } pvmw->pte = NULL; goto restart; } diff --git a/mm/rmap.c b/mm/rmap.c index fedb82371efe..f978d324d4f9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1616,6 +1616,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); dec_mm_counter(mm, MM_ANONPAGES); + if (IS_ENABLED(CONFIG_FREE_USER_PTE)) + pvmw.flags |= PVMW_MADV_FREE; goto discard; } @@ -1627,6 +1629,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, folio_set_swapbacked(folio); ret = false; page_vma_mapped_walk_done(&pvmw); + if (IS_ENABLED(CONFIG_FREE_USER_PTE) && + pte_tryget(mm, pvmw.pmd, address)) { + pgtable_t pte_page = pmd_pgtable(*pvmw.pmd); + + percpu_ref_switch_to_percpu(pte_page->pte_ref); + __pte_put(pte_page); + } break; } From patchwork Fri Apr 29 13:35:50 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832021 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F041C433F5 for ; Fri, 29 Apr 2022 13:37:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B7C966B007B; Fri, 29 Apr 2022 09:37:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B05E06B007D; Fri, 29 Apr 2022 09:37:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9588C6B0082; Fri, 29 Apr 2022 09:37:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27]) by kanga.kvack.org (Postfix) with ESMTP id 83F8F6B007B for ; Fri, 29 Apr 2022 09:37:50 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 5CB8D6050F for ; Fri, 29 Apr 2022 13:37:50 +0000 (UTC) X-FDA: 79410019500.26.71B88D7 Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) by imf14.hostedemail.com (Postfix) with ESMTP id BE5D610002F for ; Fri, 29 Apr 2022 13:37:48 +0000 (UTC) Received: by mail-pj1-f48.google.com with SMTP id z5-20020a17090a468500b001d2bc2743c4so7324987pjf.0 for ; Fri, 29 Apr 2022 06:37:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=VvydQG5zjUSumdtIipaqFpls71i8yJoag+sGGmgO3rc=; b=7NbcGgSkwHrJY1tx3toqeV+Xk5yX2uYskO+z5MFj6UbbTYBe0q1WO97WvSjnChdIC7 i1KaBtCoZvUzFuHqvg/yX7Sermx9dZh9Iowvqovf+yqAqOU9nR2rvNz8vX4p8YMW9Pr/ lcrLnleXLlHf98P6UtW9P6LVIs3t/qYmRfgfSYvLdrou9KwvgQiL7q66BHZj/JDMNoUC D900OhH7LCn0E2x6kfYvQ9+i0goWdK5HbTs5N9gnfI5iJNzW/g4nC1cnBFMZbnmGJu/U 3ZC4/cu0R2dvf0oaLfZZa5zTEhGLX+Dy1LhFT0O/TkQEjvp4dUTJFE4jWAq0xK4ZGERA XgYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=VvydQG5zjUSumdtIipaqFpls71i8yJoag+sGGmgO3rc=; b=DjwIEDCYMZxxEPUMcyPdXyaMAWY9F/RBwTASd4UXEq1gU100GVEYe/oiYLTCIRTCeV wp+eIYEyxmUMWnl75iGltosLv6eYd39quBFjsGW1NNrBmNI9QlOPbaF8w+JLyQqLjxkB b2TyZyZUojvu76nunWhh6GvSH8xwsoT/Wz9iF5aWB2Me25WMezN///tiL7GXfaF5xg2G 527CsBx1nF6mBiZkDC0s3oa9mcmT3p/PD4APRsOM/LmStKN5S+eGs7MVPPZkAxfq0fcP azmrACRbGE2D0ZpZcCNxKpqjXBp4Z+B6vIHeBnA2JXICpSKxBR8Q7bvhcGk2htY1kq9j 2UwA== X-Gm-Message-State: AOAM533hi+qj4NRAyJTkB9wpe7uDU6cg6dtNK2Bh6okgX9TNmwbl0Jxh pRPd8qx79JF2o7bE8Cj/DebEKS7Gjoopag== X-Google-Smtp-Source: ABdhPJzNnW/PfivTTKMAdGz8oCU9UxMQX9vEZMqhPr9ijXGFPd76Q1h9afEdGPE6Q5AkyqAGhbs2Cw== X-Received: by 2002:a17:90a:784b:b0:1db:dfe6:5d54 with SMTP id y11-20020a17090a784b00b001dbdfe65d54mr3938283pjl.112.1651239468914; Fri, 29 Apr 2022 06:37:48 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:48 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper Date: Fri, 29 Apr 2022 21:35:50 +0800 Message-Id: <20220429133552.33768-17-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: BE5D610002F X-Stat-Signature: b7gopze8enr5bzia4fun9e175k9x1e7x X-Rspam-User: Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=7NbcGgSk; spf=pass (imf14.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.48 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1651239468-164442 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The track_pte_set() is used to track the setting of the PTE page table entry, and the percpu_ref of the PTE page table page will be incremented when the entry changes from pte_none() to !pte_none(). The track_pte_clear() is used to track the clearing of the PTE page table entry, and the percpu_ref of the PTE page table page will be decremented when the entry changes from !pte_none() to pte_none(). In this way, the usage of the PTE page table page can be tracked by its percpu_ref. Signed-off-by: Qi Zheng --- include/linux/pte_ref.h | 14 ++++++++++++++ mm/pte_ref.c | 30 ++++++++++++++++++++++++++++++ 2 files changed, 44 insertions(+) diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index 379c3b45a6ab..6ab740e1b989 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -18,6 +18,10 @@ void __pte_put(pgtable_t page); void pte_put(pte_t *ptep); void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, bool switch_back); +void track_pte_set(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte); +void track_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte); #else /* !CONFIG_FREE_USER_PTE */ @@ -54,6 +58,16 @@ static inline void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, { } +static inline void track_pte_set(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ +} + +static inline void track_pte_clear(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ +} + #endif /* CONFIG_FREE_USER_PTE */ #endif /* _LINUX_PTE_REF_H */ diff --git a/mm/pte_ref.c b/mm/pte_ref.c index bf9629272c71..e92510deda0b 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -197,4 +197,34 @@ void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, } } +void track_pte_set(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte) +{ + pgtable_t page; + + if (&init_mm == mm || pte_huge(pte)) + return; + + page = pte_to_page(ptep); + BUG_ON(percpu_ref_is_zero(page->pte_ref)); + if (pte_none(*ptep) && !pte_none(pte)) + percpu_ref_get(page->pte_ref); +} +EXPORT_SYMBOL(track_pte_set); + +void track_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte) +{ + pgtable_t page; + + if (&init_mm == mm || pte_huge(pte)) + return; + + page = pte_to_page(ptep); + BUG_ON(percpu_ref_is_zero(page->pte_ref)); + if (!pte_none(pte)) + percpu_ref_put(page->pte_ref); +} +EXPORT_SYMBOL(track_pte_clear); + #endif /* CONFIG_FREE_USER_PTE */ From patchwork Fri Apr 29 13:35:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832022 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B24EC433F5 for ; Fri, 29 Apr 2022 13:37:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 95B1F6B0073; Fri, 29 Apr 2022 09:37:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8E2706B0074; Fri, 29 Apr 2022 09:37:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 735306B0075; Fri, 29 Apr 2022 09:37:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 62F1E6B0073 for ; Fri, 29 Apr 2022 09:37:56 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 35EBB26AD7 for ; Fri, 29 Apr 2022 13:37:56 +0000 (UTC) X-FDA: 79410019752.21.B51C077 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf27.hostedemail.com (Postfix) with ESMTP id 492994002F for ; Fri, 29 Apr 2022 13:37:54 +0000 (UTC) Received: by mail-pl1-f182.google.com with SMTP id s14so7160323plk.8 for ; Fri, 29 Apr 2022 06:37:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=PDb352oR+7/T20U1mOURtFBi5DeN+iddATN1VUBhapY=; b=M4jKPxmu6lHLcuN2g3BJoI5FkuKhJPRPks8ERBsNxVTHX3vhgtEMdYENdfW7JcqHwT yIH289VoRZXsV1Haj1Czsaw7RvMgfpNlIK48fcm+b7NNl/oeI7592UtbcrVGAto3j8qC VAobCLwT2x5h91yk+I7lMpecP96Ie33InH4JS0EJJehK9pIzs5ACN6H54/tw0aIyHkt/ wfJfNrXbInpX6Uxy8tB2TVaoW38mmPPuQ4fg2KMeUpipyE6IjHJ7eHpciSWTRoh6yjhe peT1zi4IDbHb6FzrmQ0mld9hn0zaB46duAeGKxPzmtH5lzBQefASzBL34t5JOt4uPvRq 1WyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PDb352oR+7/T20U1mOURtFBi5DeN+iddATN1VUBhapY=; b=wA6OkBwjYoPEoR2mpBRr6OpkYquQIJHrRorYds4+2ydea9QYls56bsOX2QgFtcaw8l UXVR3t6Bdwj85wBXrCI/UeGdH+vqapGvYuIDgwYQ5HXloKRe28Rleo//McDKBh/MsEdW uVz+aYkPDv1LQuITydUZox+mnnyCwKtXx4r4iKu2BpAwiuSmBqCuOTqh1A3fdhryZ798 YosMhM6bVS35cNfU3du8u5OOR1O54yfM9jzEluELZKP6r9ZAM3VpWv3GtEeAqAw6GSnc uzLx0kzupg3FVzND4eFvh3e5ZJYtjafifVvMePABagS19WBtx3eJEjXSlDmVcDegEiFX ZTJA== X-Gm-Message-State: AOAM533H+BqikfIKgW5THxVHON4LEgcWt4si5Uyv0uR1ZwYEqYzByiJk Hdr7ZVVCxxBgLBQAsVU7fX5c5g== X-Google-Smtp-Source: ABdhPJxuA4t7O9ORj9b40GN2GlWOmbbs1FpQC2BMOlij9ue4QgL40/Ximfv1xZWE9b9PeECFx/EvXA== X-Received: by 2002:a17:90a:e510:b0:1d8:39b3:280b with SMTP id t16-20020a17090ae51000b001d839b3280bmr4062679pjy.142.1651239474777; Fri, 29 Apr 2022 06:37:54 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:37:54 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref Date: Fri, 29 Apr 2022 21:35:51 +0800 Message-Id: <20220429133552.33768-18-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Stat-Signature: p4cgwjnq8b9zte7iuhydjrwzy7a4ffyz X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 492994002F X-Rspam-User: Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=M4jKPxmu; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf27.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-HE-Tag: 1651239474-725544 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add pte_ref hooks into routines that modify user PTE page tables, and select ARCH_SUPPORTS_FREE_USER_PTE, so that the pte_ref code can be compiled and worked on this architecture. Signed-off-by: Qi Zheng --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 7 ++++++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b0142e01002e..c1046fc15882 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -34,6 +34,7 @@ config X86_64 select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 + select ARCH_SUPPORTS_FREE_USER_PTE config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 62ab07e24aef..08d0aa5ce8d4 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -23,6 +23,7 @@ #include #include #include +#include extern pgd_t early_top_pgt[PTRS_PER_PGD]; bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd); @@ -1010,6 +1011,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { page_table_check_pte_set(mm, addr, ptep, pte); + track_pte_set(mm, addr, ptep, pte); set_pte(ptep, pte); } @@ -1055,6 +1057,7 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, { pte_t pte = native_ptep_get_and_clear(ptep); page_table_check_pte_clear(mm, addr, pte); + track_pte_clear(mm, addr, ptep, pte); return pte; } @@ -1071,6 +1074,7 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, */ pte = native_local_ptep_get_and_clear(ptep); page_table_check_pte_clear(mm, addr, pte); + track_pte_clear(mm, addr, ptep, pte); } else { pte = ptep_get_and_clear(mm, addr, ptep); } @@ -1081,7 +1085,8 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, static inline void ptep_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { - if (IS_ENABLED(CONFIG_PAGE_TABLE_CHECK)) + if (IS_ENABLED(CONFIG_PAGE_TABLE_CHECK) + || IS_ENABLED(CONFIG_FREE_USER_PTE)) ptep_get_and_clear(mm, addr, ptep); else pte_clear(mm, addr, ptep); From patchwork Fri Apr 29 13:35:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12832023 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76F44C433EF for ; Fri, 29 Apr 2022 13:38:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 16A536B0074; Fri, 29 Apr 2022 09:38:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0F4166B0075; Fri, 29 Apr 2022 09:38:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB0EA6B007D; Fri, 29 Apr 2022 09:38:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id DA8226B0074 for ; Fri, 29 Apr 2022 09:38:02 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BBBE326BFB for ; Fri, 29 Apr 2022 13:38:02 +0000 (UTC) X-FDA: 79410020004.16.2ED6AC4 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) by imf06.hostedemail.com (Postfix) with ESMTP id 6AB2518006B for ; Fri, 29 Apr 2022 13:38:00 +0000 (UTC) Received: by mail-pl1-f181.google.com with SMTP id j8so7146229pll.11 for ; Fri, 29 Apr 2022 06:38:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=eRy8MpvNC58F/UyXGBAubwgn5yhnXW5DArHTiriXqOA=; b=TkkXGSkcHeQ2HN+4kQyis5zmihRY3+P5XKbLE56DY4YYoqqTSnBAZQSkt8HShcDL1Y pna6lXKQks7nCKFwAb8lPv2ELLP1EHP7BC0L+tEblOUQ2tjXbPofhtBCEL4RHSKUiM+T SPXWW+3xIBJJPl60oVBE1kbw0YfRp5hBgkURC7WcxvTmgKuysqThrwaHfS2f/ZPJckX2 iRS4JQjouD9CEKnyfCkh5QKVkKqvMDxXrYZFXA7bjAukG7CzEOo/V1CQpcoLZehvz0Ll t8R5QN2CL7ruHzsj5cLWt/UWgvclVrA8+xFk2tgGWIk8gd2+/YwSVDmRPNjq4+Ef8xwN DLww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=eRy8MpvNC58F/UyXGBAubwgn5yhnXW5DArHTiriXqOA=; b=bu6nNT8Tpk/PwWwNBFjemh2wIOIp1FRgn4SsWZxdVoOLYtyYREUNvjtxvsKyQ6xKzA 7UsQuiFGtqcT+cN1FCWG2x1Wr/Ri703ej4TjLy02YqoKXyBf/2fntJYNWTVWA5/gkPk3 uaEuXqtSyeY4vIM67xt3TPWEs36TDwN1GtwKGuSaQ51lta9NIfKIiEc6I0/CY4IjT0KV yrMdsRhGvxWiftx7dsS6IVfx1ldyaX7HoXHzx9/ZUGTUxbvotklDpKKWM/0iosFy1f81 YbtuZZkQ2lxCKP2ymJWQnNiR5xyXL3wYv5KwQO7IV52U9KEIybqyyjtVXNmZL6b+20gy 6LXA== X-Gm-Message-State: AOAM532fhZdBLCcLmmS5xiDT0VsIMpgbPDxW9X5R8HRpZXMUAoGBkY33 65CjPfAN7KECS9oAV06AQ+KJWg== X-Google-Smtp-Source: ABdhPJxSWULOjvO6LZb8fH5rIPVk36LG3U/StG5RIQqrXaps+4e7g0yd+bWTzUdJKU9VKCIdUNmIXg== X-Received: by 2002:a17:902:cec3:b0:15d:242c:477d with SMTP id d3-20020a170902cec300b0015d242c477dmr22380034plg.54.1651239480766; Fri, 29 Apr 2022 06:38:00 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id m8-20020a17090a414800b001d81a30c437sm10681977pjg.50.2022.04.29.06.37.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 06:38:00 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [RFC PATCH 18/18] Documentation: add document for pte_ref Date: Fri, 29 Apr 2022 21:35:52 +0800 Message-Id: <20220429133552.33768-19-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com> References: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 6AB2518006B X-Stat-Signature: 6pxe3ya1rk8q4kq4k5ocqzjo38yhujfr Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=TkkXGSkc; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf06.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1651239480-790440 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This commit adds document for pte_ref under `Documentation/vm/`. Signed-off-by: Qi Zheng --- Documentation/vm/index.rst | 1 + Documentation/vm/pte_ref.rst | 210 +++++++++++++++++++++++++++++++++++ 2 files changed, 211 insertions(+) create mode 100644 Documentation/vm/pte_ref.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 44365c4574a3..ee71baccc2e7 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -31,6 +31,7 @@ algorithms. If you are looking for advice on simply allocating memory, see the page_frags page_owner page_table_check + pte_ref remap_file_pages slub split_page_table_lock diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst new file mode 100644 index 000000000000..0ac1e5a408d7 --- /dev/null +++ b/Documentation/vm/pte_ref.rst @@ -0,0 +1,210 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================================ +pte_ref: Tracking about how many references to each user PTE page table page +============================================================================ + +Preface +======= + +Now in order to pursue high performance, applications mostly use some +high-performance user-mode memory allocators, such as jemalloc or tcmalloc. +These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release +physical memory for the following reasons:: + + First of all, we should hold as few write locks of mmap_lock as possible, + since the mmap_lock semaphore has long been a contention point in the + memory management subsystem. The mmap()/munmap() hold the write lock, and + the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using + madvise() instead of munmap() to released physical memory can reduce the + competition of the mmap_lock. + + Secondly, after using madvise() to release physical memory, there is no + need to build vma and allocate page tables again when accessing the same + virtual address again, which can also save some time. + +The following is the largest user PTE page table memory that can be +allocated by a single user process in a 32-bit and a 64-bit system. + ++---------------------------+--------+---------+ +| | 32-bit | 64-bit | ++===========================+========+=========+ +| user PTE page table pages | 3 MiB | 512 GiB | ++---------------------------+--------+---------+ +| user PMD page table pages | 3 KiB | 1 GiB | ++---------------------------+--------+---------+ + +(for 32-bit, take 3G user address space, 4K page size as an example; + for 64-bit, take 48-bit address width, 4K page size as an example.) + +After using madvise(), everything looks good, but as can be seen from the +above table, a single process can create a large number of PTE page tables +on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not +release page table memory. And before the process exits or calls munmap(), +the kernel cannot reclaim these pages even if these PTE page tables do not +map anything. + +To fix the situation, we introduces a reference count for each user PTE page +table page. Then we can track whether users are using the user PTE page table +page and reclaim the user PTE page table pages that does not map anything at +the right time. + +Introduction +============ + +The ``pte_ref``, which is the reference count of user PTE page table page, is +``percpu_ref`` type. It is used to track the usage of each user PTE page table +page. + +Who will hold the pte_ref? +-------------------------- + +The following people will hold a pte_ref:: + + The !pte_none() entry, such as regular page table entry that map physical + pages, or swap entry, or migrate entry, etc. + + Visitor to the PTE page table entries, such as page table walker. + +Any ``!pte_none()`` entry and visitor can be regarded as the user of the PTE +page table page. When the pte_ref is reduced to 0, it means that no one is +using the PTE page table page, then this free PTE page table page can be +reclaimed at this time. + +About mode switching +-------------------- + +When user PTE page table page is allocated, its ``pte_ref`` will be initialized +to percpu mode, which basically does not bring performance overhead. When we +want to reclaim the PTE page, it will be switched to atomic mode. Then we can +check if the ``pte_ref`` is zero:: + + - If it is zero, we can safely reclaim it immediately; + - If it is not zero but we expect that the PTE page can be reclaimed + automatically when no one is using it, we can keep its ``pte_ref`` in + atomic mode (e.g. MADV_FREE case); + - If it is not zero, and we will continue to try at the next opportunity, + then we can choose to switch back to percpu mode (e.g. MADV_DONTNEED case). + +Competitive relationship +------------------------ + +Now, the user page table will only be released by calling ``free_pgtables()`` +when the process exits or ``unmap_region()`` is called (e.g. ``munmap()`` path). +So other threads only need to ensure mutual exclusion with these paths to ensure +that the page table is not released. For example:: + + thread A thread B + page table walker munmap + ================= ====== + + mmap_read_lock() + if (!pte_none() && pte_present() && !pmd_trans_unstable()) { + pte_offset_map_lock() + *walk page table* + pte_unmap_unlock() + } + mmap_read_unlock() + + mmap_write_lock_killable() + detach_vmas_to_be_unmapped() + unmap_region() + --> free_pgtables() + +But after we introduce the ``pte_ref`` for the user PTE page table page, these +existing balances will be broken. The page can be released at any time when its +``pte_ref`` is reduced to 0. Therefore, the following case may happen:: + + thread A thread B thread C + page table walker madvise(MADV_DONTNEED) page fault + ================= ====================== ========== + + mmap_read_lock() + if (!pte_none() && pte_present() && !pmd_trans_unstable()) { + + mmap_read_lock() + unmap_page_range() + --> zap_pte_range() + /* the pte_ref is reduced to 0 */ + --> free PTE page table page + + mmap_read_lock() + /* may allocate + * a new huge + * pmd or a new + * PTE page + */ + + /* broken!! */ + pte_offset_map_lock() + +As we can see, all of the thread A, B and C hold the read lock of mmap_lock, so +they can execute concurrently. When thread B releases the PTE page table page, +the value in the corresponding pmd entry will become unstable, which may be +none or huge pmd, or map a new PTE page table page again. This will cause system +chaos and even panic. + +So as described in the section "Who will hold the pte_ref?", the page table +walker (visitor) also need to try to take a ``pte_ref`` to the user PTE page +table page before walking page table (the helper ``pte_tryget_map{_lock}()`` +can help us to do this), then the system will become orderly again:: + + thread A thread B + page table walker madvise(MADV_DONTNEED) + ================= ====================== + + mmap_read_lock() + if (!pte_none() && pte_present() && !pmd_trans_unstable()) { + pte_tryget() + --> percpu_ref_tryget + *if successfully, then:* + + mmap_read_lock() + unmap_page_range() + --> zap_pte_range() + /* the pte_refcount is reduced to 1 */ + + pte_offset_map_lock() + *walk page table* + pte_unmap_unlock() + +There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need +to do any additional operations to ensure that the system is in order. Take fast +GUP as an example:: + + thread A thread B + fast GUP madvise(MADV_DONTNEED) + ======== ====================== + + get_user_pages_fast_only() + --> local_irq_save(); + call_rcu(pte_free_rcu) + gup_pgd_range(); + local_irq_restore(); + /* do pte_free_rcu() */ + +Helpers +======= + ++----------------------+------------------------------------------------+ +| pte_ref_init | Initialize the pte_ref | ++----------------------+------------------------------------------------+ +| pte_ref_free | Free the pte_ref | ++----------------------+------------------------------------------------+ +| pte_tryget | Try to hold a pte_ref | ++----------------------+------------------------------------------------+ +| pte_put | Decrement a pte_ref | ++----------------------+------------------------------------------------+ +| pte_tryget_map | Do pte_tryget and pte_offset_map | ++----------------------+------------------------------------------------+ +| pte_tryget_map_lock | Do pte_tryget and pte_offset_map_lock | ++----------------------+------------------------------------------------+ +| free_user_pte | Free the user PTE page table page | ++----------------------+------------------------------------------------+ +| try_to_free_user_pte | Try to free the user PTE page table page | ++----------------------+------------------------------------------------+ +| track_pte_set | Track the setting of user PTE page table page | ++----------------------+------------------------------------------------+ +| track_pte_clear | Track the clearing of user PTE page table page | ++----------------------+------------------------------------------------+ +