From patchwork Mon Aug 21 19:51:20 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hugh Dickins X-Patchwork-Id: 13359770 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B7CEEE49AA for ; Mon, 21 Aug 2023 19:51:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D03F9940008; Mon, 21 Aug 2023 15:51:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB4768E0012; Mon, 21 Aug 2023 15:51:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2E9D940008; Mon, 21 Aug 2023 15:51:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A03488E0012 for ; Mon, 21 Aug 2023 15:51:28 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6CEFF40380 for ; Mon, 21 Aug 2023 19:51:28 +0000 (UTC) X-FDA: 81149156256.19.EDD18B7 Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com [209.85.128.180]) by imf21.hostedemail.com (Postfix) with ESMTP id 9B0951C0003 for ; Mon, 21 Aug 2023 19:51:26 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=NndR0ksI; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of hughd@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=hughd@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692647486; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=K/XGDkPeLVVPvfADvrmUj6hQsONcx7CAy/VBzqur6kA=; b=hy40XAe8obT6H3fi7PnXQmZU6jyf/Xvz2G/LRejwlnjHLZ9dUqR2uZmEf1tXyx63LTjQtf 5VZy1YyZ4+g1ar4R0uMoUsP0+NXQdn2amF7av6Zk/gzUIN1j/IyKZiGVPgF00rOFOXb4wv sGymmQSqTe2Ix8PU8+6yNOQwMtgZTuA= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=NndR0ksI; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of hughd@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=hughd@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692647486; a=rsa-sha256; cv=none; b=g7wvf1MAyEseGbuh5Lvwp/p9AYFqFGtmnGGU6+yoz0Wy/PJlXc606poVt9OwPG8KgKl8yX TD2ib1zJ2/dnyem5zxz5ovdDi/8i5UKVihNmPXzOBmXatIitq5LVLnDXgC1jQ28P3dJsbm KEgSnVfuPh99m4LYC8QlqVNlnaywu80= Received: by mail-yw1-f180.google.com with SMTP id 00721157ae682-58fb73e26a6so27920747b3.1 for ; Mon, 21 Aug 2023 12:51:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692647486; x=1693252286; h=mime-version:message-id:subject:cc:to:from:date:from:to:cc:subject :date:message-id:reply-to; bh=K/XGDkPeLVVPvfADvrmUj6hQsONcx7CAy/VBzqur6kA=; b=NndR0ksIoZBQcWmrMd/NDLEjSSelCSPs4qSBgLV5/C51jeq2YQI/BO5s8EuPCsYHjs 4mhgu+/VYY1ieT3LJ0uTF/r6mi2mMgzwPQnofwMFXd6+du7XYuTrt3C2BYB+UE+W0jMt LfrgtW+JUtLFySySFAz+YqlRvr8/f/9GBp2imfas7GgMalU7cxn/4kyBOQp+mzr+dF9G xOG2JDQDTu8p0jj5hmk4xhjc0te0MfHUfNbwBf3dOWE1AZGzkH1lK7uKp3ZJA3J7W+fQ FAE2q4E77CqUNRImAVXU6LnPqDtsvyddu2WPFAC564jeeoZhspAiYe8QVcp5mjTUFaoF 80bg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692647486; x=1693252286; h=mime-version:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=K/XGDkPeLVVPvfADvrmUj6hQsONcx7CAy/VBzqur6kA=; b=Y46mvnFPdc/ESOzJDHl7Lo7w4dN3z97OectzrwIQ9kebja12nHlqBERC2paDK0vcLS dQe91UY3p0aNJ77XSddNLA7nw9FLK2nEYoEPDJeTj0zvPHX01IUKQZQKOTI0DXbvqajU SC8RyNyIeGJs2QzGAhtIOPbb2muIE7eLRZAq9EcvZ6sbXRxRQ3oGMNbVuXpRjvC3pDMt b26mGY4TfmVr0VFYFpjUn0xK9Dy7vNZQVdpnH3rdJHZzYyJQ1640awvySliA7OD+828z k6W5ATVYB880DWRbHfLk6E5w7o20R1day9pA+cQqKp7UTcR+dY+EpN0RBScMQkMbnoqn 3WkA== X-Gm-Message-State: AOJu0Yz3gye08vAcaIo2aMrqxL8JAJWqK/YPjXleYmV+3ANOGS1oVvkh G2w6PB7h6yEsI1Noa9cuxDSCIA== X-Google-Smtp-Source: AGHT+IHkJ6q4xN1gFoj/dTBPdmg8L1RSRYwXvYCwJUuzq7cZqNfZCZl8UnATk59vGQ4CcJPeIgwl5A== X-Received: by 2002:a0d:d145:0:b0:581:5cb9:6c2b with SMTP id t66-20020a0dd145000000b005815cb96c2bmr7782127ywd.45.1692647485613; Mon, 21 Aug 2023 12:51:25 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id o63-20020a817342000000b0058fc7604f45sm1401786ywc.130.2023.08.21.12.51.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 21 Aug 2023 12:51:25 -0700 (PDT) Date: Mon, 21 Aug 2023 12:51:20 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Andrew Morton cc: Jann Horn , Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Vishal Moola , Vlastimil Babka , Zi Yan , Zach O'Keefe , Linux ARM , sparclinux@vger.kernel.org, linuxppc-dev , linux-s390 , kernel list , Linux-MM Subject: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd Message-ID: <4d31abf5-56c0-9f3d-d12f-c9317936691@google.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 9B0951C0003 X-Stat-Signature: 83hjifqpz99ia3snyeo34xzydumt5bzz X-HE-Tag: 1692647486-726632 X-HE-Meta: U2FsdGVkX1/0BUPlibWUsg2uC6Oz8/lMnwIQVv4y99hbmkWagcs9cReEteY1dImVSCUL11Uq5EUvy1hx2ZL0708K1YU+FNcVikwkbt282FtKo/Z6pdY21UaOJfgyKuUbVgwvltwqboZrGT9jC9TCKVM6KjjmKLkPR36WqNEDtoWhKEbIiGVVDCZ2ZYz5ygWW6hVzaBaxF8JJb8scynSXGate4uEFT7nRlnQe9r2l8v72o0uAi51E5acl0vJ6TVWlFR6XntizN9xo+VtRna0141hagaXNRFWKO4qPA1WKZbxsHebQ16iD4GP7fNo+0Wt+p8dztSMKzAKfzisUALpTiBjNsVDqvP9YsTCjlmUT/6MilKh3NXcaF/HSCq6wiDFtVkJnWqzwaiF+Nngvn7e70y+a+Dzyq9BMTIps44XB3fIJnq+mdbdVKvyfBIeOASBtNwY85T4Yjy4HTaY3ucU++siwPFn1gocurRk2HtRbTG1TxwKYEZpLGrPejxoCM9u+JoWcZRzwKk7HwueJiDfhf127uqUSzGDcQF/CCd2Ll+lIo5rqXMf8+nWsMRX4vsekpDLYWG1yMEDbMa1utD94raMhs0O1fwORKu5CdhYKwThjZk73dVZ8qQZLydZXw9MWiCVWfzi34NcnusjbD5NSJU1UEhGFcHikuKwdtoTiYMCIYo4vvrB2Jb2IewQqa+f+n5VDqcBrnb+GmlhQOYRs5w3jBUY0HvWM9aXnAr/Hzn4A0YRpP/lCDV2u31ngHg0V/mFm/QvP5s/vL9GmQNFv1xr/R0N8jJtcczf+hPh17lUyU28n3KtwOvWTsZnLJuM+XbC6EzG4FvItcuGLpEP79OhSvtxR8LYu/xjxDCqN6JOo7eD+IRJin78jVCJg9Q1RTyVw0+yTK7oJYsfblr4w6FAxdTi2VQk9YRNk05fpWHgBczkKbCqbQGcNfcNwGeXMUhHf4a2vEEtbDPpr6pI upN6J5ZR EzluixnoNTTJ8pUcs0ze2mP/Dc/CtleJa6iKBjD0lrHkwGvA86XfuZCNpCE0FpcQ05ZaC0VLvmtEIm+/NbLHK1bENlTsstzQwucVDA+vVZT5Hyy/aAuSx/lOw99BlpB5MYynCpSAhTAkKv+Tmxii+E55j8BnIhM4bUs+RHmUfe/jgx0eUOm7frEYS4EaCjGA6e2z6mH1HoUsNXcj982XdY9zZoyR/9djnhyxfwx8ysIzz1I3k0W/ryROWdLklvbriFdjT50vRCnb/RgO7VoQGcJ6r1Y/+c07kaKIPLsu5MglPHdJGkvPgBIYmRhdvvQ6aC48yMo4n8gZN8jA8jSeW/zVw3z/jfC2klMh4D6sojUR34e5PkCHQW08WXBeY9vN9t9RAu5n1b8Oksh2Ug6yUpRrdqLiQkXNjqrrczyEvW2QV+gFppWN7hBN7oCrQWVf/RMuI2AxthXmw58IfcDCd7vtiCo7KaaLAzzmHDcK2L6arwMejXo3c/P1mBychJGJ3HZ6FpNKqwrnZFgrQ6u4FR4cbJSIiOwXBW8OkfDjqENOJuVGyGU1DD1zCweSYv/6yL7rCZ2aiIhq09bk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() thought it had emptied: page lock on the huge page is enough to protect against WP faults (which find the PTE has been cleared), but not enough to protect against userfaultfd. "BUG: Bad rss-counter state" followed. retract_page_tables() protects against this by checking !vma->anon_vma; but we know that MADV_COLLAPSE needs to be able to work on private shmem mappings, even those with an anon_vma prepared for another part of the mapping; and we know that MADV_COLLAPSE needs to work on shared shmem mappings which are userfaultfd_armed(). Whether it needs to work on private shmem mappings which are userfaultfd_armed(), I'm not so sure: but assume that it does. Just for this case, take the pmd_lock() two steps earlier: not because it gives any protection against this case itself, but because ptlock nests inside it, and it's the dropping of ptlock which let the bug in. In other cases, continue to minimize the pmd_lock() hold time. Reported-by: Jann Horn Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/ Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()") Signed-off-by: Hugh Dickins Acked-by: Peter Xu Reported-by: Jann Horn Signed-off-by: Hugh Dickins --- mm/khugepaged.c | 38 +++++++++++++++++++++++++++++--------- 1 file changed, 29 insertions(+), 9 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 40d43eccdee8..d5650541083a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, struct page *hpage; pte_t *start_pte, *pte; pmd_t *pmd, pgt_pmd; - spinlock_t *pml, *ptl; + spinlock_t *pml = NULL, *ptl; int nr_ptes = 0, result = SCAN_FAIL; int i; @@ -1572,9 +1572,25 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, haddr, haddr + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); notified = true; - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); + + /* + * pmd_lock covers a wider range than ptl, and (if split from mm's + * page_table_lock) ptl nests inside pml. The less time we hold pml, + * the better; but userfaultfd's mfill_atomic_pte() on a private VMA + * inserts a valid as-if-COWed PTE without even looking up page cache. + * So page lock of hpage does not protect from it, so we must not drop + * ptl before pgt_pmd is removed, so uffd private needs pml taken now. + */ + if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED)) + pml = pmd_lock(mm, pmd); + + start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl); if (!start_pte) /* mmap_lock + page lock should prevent this */ goto abort; + if (!pml) + spin_lock(ptl); + else if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); /* step 2: clear page table and adjust rmap */ for (i = 0, addr = haddr, pte = start_pte; @@ -1608,7 +1624,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, nr_ptes++; } - pte_unmap_unlock(start_pte, ptl); + pte_unmap(start_pte); + if (!pml) + spin_unlock(ptl); /* step 3: set proper refcount and mm_counters. */ if (nr_ptes) { @@ -1616,12 +1634,12 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); } - /* step 4: remove page table */ - - /* Huge page lock is still held, so page table must remain empty */ - pml = pmd_lock(mm, pmd); - if (ptl != pml) - spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + /* step 4: remove empty page table */ + if (!pml) { + pml = pmd_lock(mm, pmd); + if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + } pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); pmdp_get_lockless_sync(); if (ptl != pml) @@ -1648,6 +1666,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, } if (start_pte) pte_unmap_unlock(start_pte, ptl); + if (pml && pml != ptl) + spin_unlock(pml); if (notified) mmu_notifier_invalidate_range_end(&range); drop_hpage: