From patchwork Thu Sep 26 06:46:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13812857 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8235FCCF9E9 for ; Thu, 26 Sep 2024 06:47:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 17D7A6B009A; Thu, 26 Sep 2024 02:47:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 12B476B009C; Thu, 26 Sep 2024 02:47:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F0FB76B009D; Thu, 26 Sep 2024 02:47:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id CBE7A6B009A for ; Thu, 26 Sep 2024 02:47:00 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 550B3ACD51 for ; Thu, 26 Sep 2024 06:47:00 +0000 (UTC) X-FDA: 82605957000.07.CBB0822 Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) by imf21.hostedemail.com (Postfix) with ESMTP id 7D7021C0013 for ; Thu, 26 Sep 2024 06:46:58 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="j7O/5Rzl"; spf=pass (imf21.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.222.180 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727333058; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VEMasL/thV9Tj1Tfta4uIol3BYRuzPr3NckO1OF+52k=; b=3f+wqMcxrOu9+3jN6XtYnvU+qUd0+KAV1DzN4ixDAi2HMZq2JySuH4xZY8yw8TLIHPuhsW fuboN0p5/objhd3B6QbW5x6jsQ33levw1xWBGglyFqaHCraJCICiX5uAn+Snyv+a9noG9V C28EHSvTkpLgSYA1J5b7xtnjNB5elSM= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="j7O/5Rzl"; spf=pass (imf21.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.222.180 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727333058; a=rsa-sha256; cv=none; b=6dpRMP8URJTKs6oBemCGID0TmsYf9PRJi+pyntYyVscxfcZPJySk1eYJv82hjHElVGLH9W e7dcy/DhHLOpI6zvtXxkVgYOIxqdBIyyZfSJ32FmBV2X1mfjklN1xCkGxB3vcuHjDL9Nk6 TmWZGJkB3a0d5NJ/eyVVZSHcA+OFT/4= Received: by mail-qk1-f180.google.com with SMTP id af79cd13be357-7a9ac0092d9so66625785a.1 for ; Wed, 25 Sep 2024 23:46:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1727333216; x=1727938016; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=VEMasL/thV9Tj1Tfta4uIol3BYRuzPr3NckO1OF+52k=; b=j7O/5RzluhXq+8xCf7/ucGPRUf1QevnOVBjLKqyIHsrJD3C1a161ykDrXedjt0ZycD dl18G4Ty+fYGpB6aVqnDIH5G+5hCtrkz8FZvjZYbkRt8wro6s/VIj6GqWIWddwe8etYl corYh2xrELjLK91KprnN+tvPMrmDs7cMZfvsR/fOzafvORfDZKjs0wZVZ2tR7Rm8A9hM PSGyhPLP91a/bD1ejtDRg9uBV8IptfPrQ+eh3ey54Zd8vtNqiAkrSBZyNqrdKjXS1abu ajhWqYfAGmkRQUCiGvroubJWa8+AeENTLu2KTWj1Q30XSa/fUPPuK64qTniXvXbT2Vhd SXQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727333216; x=1727938016; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VEMasL/thV9Tj1Tfta4uIol3BYRuzPr3NckO1OF+52k=; b=IEFfYBpw799/sgYmshoi0HqT2DCHEo/Sp9wKVzpG0ZSR5XBRIVDBMP2j2PfJ3x9A0b Pl7rZ2vthe1RhPK1a4zlZQP0arPSbLSzbrCxXKQjzQc+bnjJ6d+vB9vHF771+RqErwoP 1pg54N9WK2j9wolMjgKJj8mPhVILy7Fw4+rxRcI5VmS/FbQGsMzCIv6eYp+180u1mL2b Z/8RPK7Q9xCORn7+387slU2cS2BNK4f62hB7c1GOasML2rElGGrJEH9x4nE0WCM9CCkd g1+0J2KdVeOjRTysFTXjrvpRchD8tzvZ+9t1p9Ht7KdEPF6EtfgUt6fO2y7obbWalLfz 9r2Q== X-Forwarded-Encrypted: i=1; AJvYcCVncFDCAgXj6vPig6b5f+1bsC1TnEsx4iGXlFaEFzOAplUi5QXcVcq4Lq4RS9+zCrMfT6ze7XsVSA==@kvack.org X-Gm-Message-State: AOJu0YyS0JVpthSyrspSJmLGdrZf58jc7ofCYqQvJ5xOyTD/3y3usf0r MSB+LtvumrGHB6vbi4QkTR8cHpXzXUvRbfb2qxKxdKZHhZsP0uw/GzuTFVmN0Go= X-Google-Smtp-Source: AGHT+IFdBHDg/pcI8fYkiDBKrKz5/bcsNyjF/Vh4ia+kT/ICQCu69/2acL+Eh3ZQbIhYSZdsv978Jw== X-Received: by 2002:a05:622a:20b:b0:458:4b6b:ff16 with SMTP id d75a77b69052e-45b5def4888mr86143491cf.32.1727333215740; Wed, 25 Sep 2024 23:46:55 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-45b5257ff1esm23024611cf.38.2024.09.25.23.46.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Sep 2024 23:46:55 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, rppt@kernel.org, vishal.moola@gmail.com, peterx@redhat.com, ryan.roberts@arm.com, christophe.leroy2@cs-soprasteria.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, Qi Zheng Subject: [PATCH v5 01/13] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock() Date: Thu, 26 Sep 2024 14:46:14 +0800 Message-Id: <5aeecfa131600a454b1f3a038a1a54282ca3b856.1727332572.git.zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 7D7021C0013 X-Stat-Signature: dshucc81n61dpzw1h67mf4fa75pcufoa X-Rspam-User: X-HE-Tag: 1727333218-72159 X-HE-Meta: U2FsdGVkX19se0hqBMvl5MxO6/Zlvy/qM26IIgtw/9gR1ztMPM0r5eq16dVHsi6QbpnWTtEvz1mUHSJfe9NvnRS+CgXfmepLfxvTFCZMunvG1iGARwBSK8mdArrgqbo8NM5E8z2SkPxGWp5jJ4fJt6EOLsaCtiyUNSf7RtJ2/tNdF2FHUnlgNS2+JdzY2hu4ANWX3cpXccbK5dq68pSy/6O7mB3onDJ+Za/NG+KJMaalcUUvrPYSAxzecks07uLaYTJNmDetL80ySLE7Fr2GX8HjHQQx9Z2J+H+MFFarQQW2li1cQu3jIOqY7H5jcrcwDSHwXf+bWbqCRrOoXckqR9IABxGmRPqp3j5m+y+xNRrrCkpAPrNYFIHjadeZ8LIsgYVWgqcyabuCHB7lPaMT1u4N0z/VIpdOdWCwS26FWtkslCsTDMj9QHTlshehnDckB3AaZhfbS5Prl9kEOZluXjpDWBIzBvSxjaGpqZrcKPmIf/L0LADGDwDYZS6C4nm9fME+diUxRLbevIsO1g/Ffr6uDX3QCMNms5vmxLxXUcMTdRZ3E6nKQZgmI2ReIU5no9P1gBVek7bOR8HtZWRbYfGlmoW1YIR3ZS7f/JoZcFbKpgiqOkPOXOXaRMhHlweVFE9hX/+p5TOcZHdpmaIbeaJxCwUAwRyPFhdESmhyrtBq91eFpv5HriEcVaoueYH5IOJaIZgNIMDhqmf1Wj+Jexpz4Vlf1FOW/QeFSoxCdqu9deA0VOvQ7LGDx4PPtGQHfinuMMUlQa/4a3WgH5hZcA4+BAZiY5xWhVLEkt/Qce4y77ZMVQOqj5OLqCQ0Hd8gjZxuVuQgH+j/vSvKxdZSyYAQ9qhQHUgtHlef4IIHsU2Rmh4tkEssa3Zjf14NxBgsXe96OEqx/4A4bBGhw99agD3AZoRzc+oSM1q/kFgvPRIQWsDRr4pzVkzuI6qGLmXxaM445BYxM2s9GDDCNoF wdk2p6u6 7hV0Zn/9qIqmJA2XGtYp05VA2E0AKWSFo64vLR086Yj6C0VRE2NJ1uwnYuDONhUQBP+3dLa/2fTKk0JMF5l/K0ha1QNyoTz6OHb5FcRiwzml1mPKySjhgGqSylVWqDDiZPGNVtAeNX7aO5b+/hDqdQYVO2c2j9r+If1LAOh62nZ0JZWuqTClp0mY9Ws7P2dWLhbwX/KLBwZ8bxmje4ZGVwg0ps7pUqFkXe3aWWNXKbp30FACFzT/Vq+iZGJAxj4BcSvfH87IOVJljyo5RSOiDzQp+QROzhxUHsorhuM9m3CwJ7rVrZsqK3a6dwWdprG4bpR9q5xBgtC8S0Kt9s6cVQRBFxDVB28tGTMvjJ0lgOlhU2Jwfzr/HjFaG6v7UnpR+3LM6BEk3uedIZhCujO9UdWMWgK4NHuhpatUyVTJY6L2MLh3700vBu9IPgVpTbkc/BF2nTrpe8msr72c= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently, the usage of pte_offset_map_nolock() can be divided into the following two cases: 1) After acquiring PTL, only read-only operations are performed on the PTE page. In this case, the RCU lock in pte_offset_map_nolock() will ensure that the PTE page will not be freed, and there is no need to worry about whether the pmd entry is modified. 2) After acquiring PTL, the pte or pmd entries may be modified. At this time, we need to ensure that the pmd entry has not been modified concurrently. To more clearing distinguish between these two cases, this commit introduces two new helper functions to replace pte_offset_map_nolock(). For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition to changing the name to pte_offset_map_rw_nolock(), it also outputs the pmdval when successful. It is applicable for may-write cases where any modification operations to the page table may happen after the corresponding spinlock is held afterwards. But the users should make sure the page table is stable like checking pte_same() or checking pmd_same() by using the output pmdval before performing the write operations. Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will be read-only/read-write protected. Subsequent commits will convert pte_offset_map_nolock() into the above two functions one by one, and finally completely delete it. Signed-off-by: Qi Zheng Reviewed-by: Muchun Song Acked-by: David Hildenbrand --- Documentation/mm/split_page_table_lock.rst | 7 ++++ include/linux/mm.h | 5 +++ mm/pgtable-generic.c | 48 ++++++++++++++++++++++ 3 files changed, 60 insertions(+) diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst index e4f6972eb6c04..08d0e706a32db 100644 --- a/Documentation/mm/split_page_table_lock.rst +++ b/Documentation/mm/split_page_table_lock.rst @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions: - pte_offset_map_nolock() maps PTE, returns pointer to PTE with pointer to its PTE table lock (not taken), or returns NULL if no PTE table; + - pte_offset_map_ro_nolock() + maps PTE, returns pointer to PTE with pointer to its PTE table + lock (not taken), or returns NULL if no PTE table; + - pte_offset_map_rw_nolock() + maps PTE, returns pointer to PTE with pointer to its PTE table + lock (not taken) and the value of its pmd entry, or returns NULL + if no PTE table; - pte_offset_map() maps PTE, returns pointer to PTE, or returns NULL if no PTE table; - pte_unmap() diff --git a/include/linux/mm.h b/include/linux/mm.h index e9077ab169723..46828b9a74f2c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3017,6 +3017,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, spinlock_t **ptlp); +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, spinlock_t **ptlp); +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, pmd_t *pmdvalp, + spinlock_t **ptlp); #define pte_unmap_unlock(pte, ptl) do { \ spin_unlock(ptl); \ diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index a78a4adf711ac..daa08b91ab6b2 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -317,6 +317,31 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, return pte; } +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, spinlock_t **ptlp) +{ + pmd_t pmdval; + pte_t *pte; + + pte = __pte_offset_map(pmd, addr, &pmdval); + if (likely(pte)) + *ptlp = pte_lockptr(mm, &pmdval); + return pte; +} + +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, pmd_t *pmdvalp, + spinlock_t **ptlp) +{ + pte_t *pte; + + VM_WARN_ON_ONCE(!pmdvalp); + pte = __pte_offset_map(pmd, addr, pmdvalp); + if (likely(pte)) + *ptlp = pte_lockptr(mm, pmdvalp); + return pte; +} + /* * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation * __pte_offset_map_lock() below, is usually called with the pmd pointer for @@ -356,6 +381,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, * recheck *pmd once the lock is taken; in practice, no callsite needs that - * either the mmap_lock for write, or pte_same() check on contents, is enough. * + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map(); + * but when successful, it also outputs a pointer to the spinlock in ptlp - as + * pte_offset_map_lock() does, but in this case without locking it. This helps + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time + * act on a changed *pmd: pte_offset_map_ro_nolock() provides the correct spinlock + * pointer for the page table that it returns. Even after grabbing the spinlock, + * we might be looking either at a page table that is still mapped or one that + * was unmapped and is about to get freed. But for R/O access this is sufficient. + * So it is only applicable for read-only cases where any modification operations + * to the page table are not allowed even if the corresponding spinlock is held + * afterwards. + * + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like + * pte_offset_map_ro_nolock(); but when successful, it also outputs the pdmval. + * It is applicable for may-write cases where any modification operations to the + * page table may happen after the corresponding spinlock is held afterwards. + * But the users should make sure the page table is stable like checking pte_same() + * or checking pmd_same() by using the output pmdval before performing the write + * operations. + * + * Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will + * be read-only/read-write protected. + * * Note that free_pgtables(), used after unmapping detached vmas, or when * exiting the whole mm, does not take page table lock before freeing a page * table, and may not use RCU at all: "outsiders" like khugepaged should avoid