From patchwork Wed Dec 7 20:30:27 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13067580 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0BFACC63705 for ; Wed, 7 Dec 2022 20:30:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EC4FA8E0007; Wed, 7 Dec 2022 15:30:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E0C3C8E0005; Wed, 7 Dec 2022 15:30:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BDDEC8E0007; Wed, 7 Dec 2022 15:30:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9D4B18E0005 for ; Wed, 7 Dec 2022 15:30:46 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 50D551A0F31 for ; Wed, 7 Dec 2022 20:30:46 +0000 (UTC) X-FDA: 80216653692.30.DC26D46 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf30.hostedemail.com (Postfix) with ESMTP id A8E178001A for ; Wed, 7 Dec 2022 20:30:45 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ZlOcxqMN; spf=pass (imf30.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670445045; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1A7zSSAdH3Q+Nn9HaLXwUXJ3cHWzn45QfpKMkITSm0w=; b=Q7wi+6H7MsnY03AbUDWi+rU+gCb4azmtR5II9rRX/JaHs+sBJPC/p/ydgSyqbUvsTr5r0z MK6l/Qqtpl+nSsPYllIYkrIpoZ/8kj9d8UvvhHihOnH9sr/otDJ9vBl083stAsMP0JUpQF +MFupoE2iIL/enPZbHzsQeq0wu+uycs= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ZlOcxqMN; spf=pass (imf30.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670445045; a=rsa-sha256; cv=none; b=elFsk41xilKGHVE9oNvoZtKXLXpAHCHAHMPW1YP4R8yKqsXe0ktn1xnskjfiIz3jqxxqsi A8fPgbqVXDdQLktzI6fn7cemPpmqyzbSHQLcnj/h3O1MuNMmBG8Mgh7XkJNlMLwNuZ003T 98TKoRC7xP4uGYByMZiAVXe9ginFpJE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670445045; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1A7zSSAdH3Q+Nn9HaLXwUXJ3cHWzn45QfpKMkITSm0w=; b=ZlOcxqMNAb8KYh2XPxVOfnwl0Vtd3WqIGafTrqiPpLYh9R0VBE0UC31b673AUFJeWdq3DO bNT6q4GJlaf2b1PbmYMIWry0ffR0DB3A8b40PSHeh5mkWHQKD4BCOe3HXCSK51NMQQVQNK R99XkjrgyOF6cFjro9C+w0UWD0Qtc/w= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-659-nK-YC4PLO0qxTlpLUmx_Ag-1; Wed, 07 Dec 2022 15:30:44 -0500 X-MC-Unique: nK-YC4PLO0qxTlpLUmx_Ag-1 Received: by mail-qt1-f199.google.com with SMTP id s14-20020a05622a1a8e00b00397eacd9c1aso40512947qtc.21 for ; Wed, 07 Dec 2022 12:30:43 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1A7zSSAdH3Q+Nn9HaLXwUXJ3cHWzn45QfpKMkITSm0w=; b=BdFL2CqOssSXkxy2aJMyiodaYMMugCE/J5HSXWDYTXfMSytYHLL8qCbskVqJFG6FFr Un50wHjbdh/e2h0X9Bdqg7iWcKCvnQsGXjIolhfumfc8QMfhokPzIB5+OL589108tE29 sN3uMvXuW7IUwvNTAjoCIQHyCxW9WxGww+E2UDJvjhl+5GL1JCOPHuUH5YS5WzB6TOOI wPoNmUlSFOjlYGx3pgKFgz8pVfcNUPUwh3CFVCP1tM21c6FygaKzvKA5dgzv9CK8p8mr 1Pady4w3pPUwG3dgDDbWPcACsLoC75H+4CANF0kmu1SRVJ4YxW9v9AiGMXzwcQZhBUiF c0JA== X-Gm-Message-State: ANoB5pmakTw9fUqXNcN37MlGQjdGO3JQuUDwWRrJ7aZAHUnhlEQUhch6 fj3F0RdTumglOcAikK6sriXfEhxf/1kWcuBVzCObMJe5PkYd3bhFMAm7DyPUUmJ5cV1NfRzarej xTDF+oWowgpHYnt3sCPnebQx7HuSwSAcsDsgfq5CzCNkGkAoNS4bpcFDEBzly X-Received: by 2002:a0c:e109:0:b0:4c6:ecbf:e47e with SMTP id w9-20020a0ce109000000b004c6ecbfe47emr1918700qvk.44.1670445043021; Wed, 07 Dec 2022 12:30:43 -0800 (PST) X-Google-Smtp-Source: AA0mqf7c6SH17+R0AvUqKo+tJdpohz+daUf26kZ0d9KHAqt+n6XDiZgwzv4YTz/R34Mrel5se5ut4w== X-Received: by 2002:a0c:e109:0:b0:4c6:ecbf:e47e with SMTP id w9-20020a0ce109000000b004c6ecbfe47emr1918677qvk.44.1670445042671; Wed, 07 Dec 2022 12:30:42 -0800 (PST) Received: from x1n.redhat.com (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id dc53-20020a05620a523500b006fefa5f7fcesm855594qkb.10.2022.12.07.12.30.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Dec 2022 12:30:42 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Muchun Song , John Hubbard , Andrea Arcangeli , James Houghton , Jann Horn , Rik van Riel , Miaohe Lin , Andrew Morton , Mike Kravetz , peterx@redhat.com, David Hildenbrand , Nadav Amit Subject: [PATCH v2 03/10] mm/hugetlb: Document huge_pte_offset usage Date: Wed, 7 Dec 2022 15:30:27 -0500 Message-Id: <20221207203034.650899-4-peterx@redhat.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221207203034.650899-1-peterx@redhat.com> References: <20221207203034.650899-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-type: text/plain X-Spamd-Result: default: False [5.07 / 9.00]; BAYES_HAM(-4.03)[95.15%]; SORBS_IRL_BL(3.00)[209.85.160.199:received]; R_MISSING_CHARSET(2.50)[]; SUSPICIOUS_RECIPS(1.50)[]; SUBJECT_HAS_UNDERSCORES(1.00)[]; MID_CONTAINS_FROM(1.00)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; BAD_REP_POLICIES(0.10)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_TWELVE(0.00)[14]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_ALLOW(0.00)[redhat.com:s=mimecast20190719]; MIME_TRACE(0.00)[0:+]; TAGGED_RCPT(0.00)[]; DMARC_POLICY_ALLOW(0.00)[redhat.com,none]; TO_MATCH_ENVRCPT_SOME(0.00)[]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; DKIM_TRACE(0.00)[redhat.com:+]; R_SPF_ALLOW(0.00)[+ip4:170.10.129.0/24]; RCVD_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: A8E178001A X-Stat-Signature: p7cgfmm7g1b9f8heya4gfaoff6x4uryi X-HE-Tag: 1670445045-219726 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a hugetlb address. Normally, it's always safe to walk a generic pgtable as long as we're with the mmap lock held for either read or write, because that guarantees the pgtable pages will always be valid during the process. But it's not true for hugetlbfs, especially shared: hugetlbfs can have its pgtable freed by pmd unsharing, it means that even with mmap lock held for current mm, the PMD pgtable page can still go away from under us if pmd unsharing is possible during the walk. So we have two ways to make it safe even for a shared mapping: (1) If we're with the hugetlb vma lock held for either read/write, it's okay because pmd unshare cannot happen at all. (2) If we're with the i_mmap_rwsem lock held for either read/write, it's okay because even if pmd unshare can happen, the pgtable page cannot be freed from under us. Document it. Signed-off-by: Peter Xu Reviewed-by: John Hubbard Reviewed-by: David Hildenbrand --- include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 551834cd5299..81efd9b9baa2 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages; pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long sz); +/* + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE. + * Returns the pte_t* if found, or NULL if the address is not mapped. + * + * Since this function will walk all the pgtable pages (including not only + * high-level pgtable page, but also PUD entry that can be unshared + * concurrently for VM_SHARED), the caller of this function should be + * responsible of its thread safety. One can follow this rule: + * + * (1) For private mappings: pmd unsharing is not possible, so it'll + * always be safe if we're with the mmap sem for either read or write. + * This is normally always the case, IOW we don't need to do anything + * special. + * + * (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged + * pgtable page can go away from under us! It can be done by a pmd + * unshare with a follow up munmap() on the other process), then we + * need either: + * + * (2.1) hugetlb vma lock read or write held, to make sure pmd unshare + * won't happen upon the range (it also makes sure the pte_t we + * read is the right and stable one), or, + * + * (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make + * sure even if unshare happened the racy unmap() will wait until + * i_mmap_rwsem is released. + * + * Option (2.1) is the safest, which guarantees pte stability from pmd + * sharing pov, until the vma lock released. Option (2.2) doesn't protect + * a concurrent pmd unshare, but it makes sure the pgtable page is safe to + * access. + */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); unsigned long hugetlb_mask_last_page(struct hstate *h);