From patchwork Fri Dec 25 09:25:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Nadav Amit X-Patchwork-Id: 11989953 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 06CF9C433E6 for ; Fri, 25 Dec 2020 09:29:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6BF0022AAC for ; Fri, 25 Dec 2020 09:29:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6BF0022AAC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E073A6B00AA; Fri, 25 Dec 2020 04:29:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D69236B00AC; Fri, 25 Dec 2020 04:29:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B6C908D0080; Fri, 25 Dec 2020 04:29:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0156.hostedemail.com [216.40.44.156]) by kanga.kvack.org (Postfix) with ESMTP id 9BC3A6B00AA for ; Fri, 25 Dec 2020 04:29:40 -0500 (EST) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 6875BE088 for ; Fri, 25 Dec 2020 09:29:40 +0000 (UTC) X-FDA: 77631282120.21.rifle53_2605bff27478 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin21.hostedemail.com (Postfix) with ESMTP id 4DD2E180C6F11 for ; Fri, 25 Dec 2020 09:29:40 +0000 (UTC) X-HE-Tag: rifle53_2605bff27478 X-Filterd-Recvd-Size: 8789 Received: from mail-pf1-f173.google.com (mail-pf1-f173.google.com [209.85.210.173]) by imf16.hostedemail.com (Postfix) with ESMTP for ; Fri, 25 Dec 2020 09:29:39 +0000 (UTC) Received: by mail-pf1-f173.google.com with SMTP id s21so2343838pfu.13 for ; Fri, 25 Dec 2020 01:29:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=LpW2sWQQiHoxQtJSvZTgROQggONQuj6JLwoHM4c5b9s=; b=qIMav+mzgIAW7g0LrirZYpLxCgws1BwK2bIZJQQ7rqrcM7WaJuHOHEgzK5VBJwPqfU cbozQEFWvwn4bbxCWFOscvSia2WFVcHJ8As++dsiMWLv3TOUXIho26mfusnp8hvjuSzC rwqh7aY1mOHs1zy4EkJtA/9KLx5MQFKjtIbW8B1qGeSI2xrM7tEuk67RhC878DPeKTW1 swIo81e//qTnYwIQbT7bFiqTp1yNnVJ2ZDwjUjHSmPYgbTrVvnAw41lJ9vFQzE1/Lmit bFKZdZR+F7Up72ndlgf1kfjkKTYZYXnogJ7JOzKiltesjTZFxtT+WNXW6Tcv7jXMphhz JKDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=LpW2sWQQiHoxQtJSvZTgROQggONQuj6JLwoHM4c5b9s=; b=YSTbaS3JNokNlOIMXAuca9df+clafiu58Dgzss95BNDjwBoqmWVvpKMfc9PF7PbCQM 1zXpRsIq7lrlCWI3OwhOiC++rmAxtmeSZ07BbMdB+JTBgEcOvRQFKl6RaoirdTuuzZ9J 1Q+DP/Rg5KHTctGIn/Ybg3QOVWgBA/o3g6v+9HgHF1KPN4YHeDIO2sqCLLic8v2Mso8Z uvWHDBpo8yPLWF0V3S4MXV8lngznwsc/fykruKJTZkwxuySIjR/1wKLf6ej0iOgvE1lP h74ubdl3AUO134JVgGdPviOQnVwVHz2gpblPcCmw6KMDfCing50QX2Kej9L4QGJAM6dT geOQ== X-Gm-Message-State: AOAM533XR8+Rki4B0med8bgrXKfgjSqoqXmzquqK369lnXBS1snoOWgh 2Eemi9GUhufLIWmFw6dNO9Brkx93gl9P/g== X-Google-Smtp-Source: ABdhPJzo19NhjDgpPlOPmmbFArkEFCZaD9tv7zY5etpFZLNE69F/+Mw5J7PU6MgQJcB6e4++YPZuQQ== X-Received: by 2002:a62:5547:0:b029:1a4:cb2a:2833 with SMTP id j68-20020a6255470000b02901a4cb2a2833mr30296026pfb.35.1608888578285; Fri, 25 Dec 2020 01:29:38 -0800 (PST) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id s13sm28966659pfd.99.2020.12.25.01.29.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Dec 2020 01:29:37 -0800 (PST) From: Nadav Amit X-Google-Original-From: Nadav Amit To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Nadav Amit , Andrea Arcangeli , Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Peter Zijlstra Subject: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Date: Fri, 25 Dec 2020 01:25:28 -0800 Message-Id: <20201225092529.3228466-2-namit@vmware.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201225092529.3228466-1-namit@vmware.com> References: <20201225092529.3228466-1-namit@vmware.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Nadav Amit Userfaultfd self-test fails occasionally, indicating a memory corruption. Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range() and defers flushes, and since there is insufficient consideration of concurrent deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from the TLBs in wp_page_copy(), this flush takes place after the copy has already been performed, and therefore changes of the page are possible between the time of the copy and the time in which the PTE is flushed. To make matters worse, memory-unprotection using userfaultfd also poses a problem. Although memory unprotection is logically a promotion of PTE permissions, and therefore should not require a TLB flush, the current userrfaultfd code might actually cause a demotion of the architectural PTE permission: when userfaultfd_writeprotect() unprotects memory region, it unintentionally *clears* the RW-bit if it was already set. Note that this unprotecting a PTE that is not write-protected is a valid use-case: the userfaultfd monitor might ask to unprotect a region that holds both write-protected and write-unprotected PTEs. The scenario that happens in selftests/vm/userfaultfd is as follows: cpu0 cpu1 cpu2 ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-*unprotect* ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ *clear* “write”-bit ] [ defer TLB flushes ] [ page-fault ] ... wp_page_copy() cow_user_page() [ copy page ] [ write to old page ] ... set_pte_at_notify() A similar scenario can happen: cpu0 cpu1 cpu2 cpu3 ---- ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-protect ] [ deferred TLB flush ] userfaultfd_writeprotect() [ write-unprotect ] [ deferred TLB flush] [ page-fault ] wp_page_copy() cow_user_page() [ copy page ] ... [ write to page ] set_pte_at_notify() As Yu Zhao pointed, these races became more apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made wp_page_copy() more likely to take place, specifically if page_count(page) > 1. Note that one might consider additional potentially dangerous scenarios, which are not directly related to the deferred TLB flushes. A memory corruption might in theory occur if after the page is copied by cow_user_page() and before the PTE is set, the PTE is write-unprotected (by a concurrent page-fault handler) and then protected again (by subsequent calls to userfaultfd_writeprotect() to protect and unprotect the page). In practice, it seems that such scenarios cannot happen. To resolve the aforementioned races, acquire mmap_lock for write when write-protecting userfaultfd region using ioctl's. Keep acquiring mmap_lock for read when unprotecting memory, but do not change the write-bit set when performing userfaultfd write-unprotection. This solution can introduce performance regression to userfaultfd write-protection. Cc: Andrea Arcangeli Cc: Yu Zhao Cc: Andy Lutomirski Cc: Peter Xu Cc: Pavel Emelyanov Cc: Mike Kravetz Cc: Mike Rapoport Cc: Minchan Kim Cc: Will Deacon Cc: Peter Zijlstra Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit") Signed-off-by: Nadav Amit --- mm/mprotect.c | 3 ++- mm/userfaultfd.c | 15 +++++++++++++-- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index ab709023e9aa..c08c4055b051 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; - bool preserve_write = prot_numa && pte_write(oldpte); + bool preserve_write = (prot_numa || uffd_wp_resolve) && + pte_write(oldpte); /* * Avoid trapping faults against the zero or KSM diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 9a3d451402d7..7423808640ef 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -652,7 +652,15 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, /* Does the address range wrap, or is the span zero-sized? */ BUG_ON(start + len <= start); - mmap_read_lock(dst_mm); + /* + * Although we do not change the VMA, we have to ensure deferred TLB + * flushes are performed before page-faults can be handled. Otherwise + * we can get inconsistent TLB state. + */ + if (enable_wp) + mmap_write_lock(dst_mm); + else + mmap_read_lock(dst_mm); /* * If memory mappings are changing because of non-cooperative @@ -686,6 +694,9 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, err = 0; out_unlock: - mmap_read_unlock(dst_mm); + if (enable_wp) + mmap_write_unlock(dst_mm); + else + mmap_read_unlock(dst_mm); return err; } From patchwork Fri Dec 25 09:25:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nadav Amit X-Patchwork-Id: 11989955 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 127B0C433E0 for ; Fri, 25 Dec 2020 09:29:44 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7BC2B22AAC for ; Fri, 25 Dec 2020 09:29:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7BC2B22AAC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7C6196B00AD; Fri, 25 Dec 2020 04:29:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 729C16B00AE; Fri, 25 Dec 2020 04:29:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5EF4B8D0080; Fri, 25 Dec 2020 04:29:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0159.hostedemail.com [216.40.44.159]) by kanga.kvack.org (Postfix) with ESMTP id 448CC6B00AD for ; Fri, 25 Dec 2020 04:29:42 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 17689E085 for ; Fri, 25 Dec 2020 09:29:42 +0000 (UTC) X-FDA: 77631282204.12.wind77_5f01ba327478 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin12.hostedemail.com (Postfix) with ESMTP id ED340180078B3 for ; Fri, 25 Dec 2020 09:29:41 +0000 (UTC) X-HE-Tag: wind77_5f01ba327478 X-Filterd-Recvd-Size: 10176 Received: from mail-pg1-f171.google.com (mail-pg1-f171.google.com [209.85.215.171]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Fri, 25 Dec 2020 09:29:41 +0000 (UTC) Received: by mail-pg1-f171.google.com with SMTP id w5so2912389pgj.3 for ; Fri, 25 Dec 2020 01:29:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=XDjmC0wED5YD+zWwSTV9f3sSNl1DtsQGXc+J7WldMSU=; b=IDq6+mY9HiagE+bcoLTYLB1ccM6eglGCAr7GicPFjIKA7o9+qPNXEFwK2FaauiK38a el+BClrxiN/a8uB/9TDzCB07saucKcPX1sw4q7RXY4bX0TlfRxO9wlR6Cl2di94gfJ6v +RTOJPgaF1PFyD8lDx3QsjJ3MK9eiaSiACP4b7BnIABEjr5SKbMalcyo3m/4FNjMEMP1 JF2jwq48R5vyyk5eEIIGf5vH8RP9zwpSXInlwsQC4CSa2eTBIy4Md/bqHo+8DDb8820p 0bFJkA7+H3tosywwTUoFaPF8IWm4CluDbVUShKhhOMEbt6W+BpVUub1SLt8d+VwEbe5J Sz6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=XDjmC0wED5YD+zWwSTV9f3sSNl1DtsQGXc+J7WldMSU=; b=eIVt88nmTnjf7XRX6OBw736OVQbUCese6kLAVmOr/SjBAnHeqaUSfqIoh/MIDhWJ5B 2sznkfY45oq1leP2Qjc6j3nV5UGdzCY6EAveZNjwYB9uV149Wc1DUEEyaShp5IdIw/kH JpblrwdNiZB4UJZ0l3LetfG6piGm1pEQwK6e+/LGu5oWa6CcgDFA9rxhEaPOPzPCPkBL JvDJpP+iqXI/l21SJe8Xiiv0vQCt0WIepp6Vn51fOxgTKqysbloBE+RXmWdbrl3qM6fL xI3A5T0P/GVG7qc4katE+c8CORQxvSuzOkFV8IwKFkeohOPYpXwLLYuStmv/pviKxE4W ZU9Q== X-Gm-Message-State: AOAM531kP22fjXiovMo0lmQVC4TQ4+cNF5H6tKwznLaAtnkOgga+YfCY Pt+MsnFge6CQbyTkGDSwLwUT+lT0l02AEA== X-Google-Smtp-Source: ABdhPJy/1/+yedd/HVU2ah9M1o5ll1wUDPc9qrbrm6ucr24j/Do15+ei/nVGJU5eAX9bxpkhxlozTg== X-Received: by 2002:a62:1b06:0:b029:19d:d05d:f67a with SMTP id b6-20020a621b060000b029019dd05df67amr30508332pfb.78.1608888579962; Fri, 25 Dec 2020 01:29:39 -0800 (PST) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id s13sm28966659pfd.99.2020.12.25.01.29.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Dec 2020 01:29:39 -0800 (PST) From: Nadav Amit X-Google-Original-From: Nadav Amit To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Nadav Amit , Andrea Arcangeli , Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Peter Zijlstra Subject: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Date: Fri, 25 Dec 2020 01:25:29 -0800 Message-Id: <20201225092529.3228466-3-namit@vmware.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201225092529.3228466-1-namit@vmware.com> References: <20201225092529.3228466-1-namit@vmware.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Nadav Amit Clearing soft-dirty through /proc/[pid]/clear_refs can cause memory corruption as it clears the dirty-bit without acquiring the mmap_lock for write and defers TLB flushes. As a result of this behavior, it is possible that one of the CPUs would have the stale PTE cached in its TLB and keep updating the page while another thread triggers a page-fault, and the page-fault handler would copy the old page into a new one. Since the copying is performed without holding the page-table lock, it is possible that after the copying, and before the PTE is actually flushed, the CPU that cached the stale PTE in the TLB would keep changing the page. These changes would be lost and memory corruption would occur. As Yu Zhao pointed, this race became more apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made wp_page_copy() more likely to take place, specifically if page_count(page) > 1. The following test produces the failure quite well on 5.10 and my machine. Note that the test is tailored for recent kernels behavior in which wp_page_copy() is called when page_count(page) != 1, but the fact the test does not fail on older kernels does not mean they are not affected. #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #define PAGE_SIZE (4096) #define TLB_SIZE (2000) #define N_PAGES (300000) #define ITERATIONS (2000) #define N_THREADS (2) static int stop; static char *m; static int writer(void *argp) { unsigned long t_idx = (unsigned long)argp; int i, cnt = 0; while (!atomic_load(&stop)) { cnt++; atomic_fetch_add((atomic_int *)m, 1); /* * First thread only accesses the page to have it cached in the * TLB. */ if (t_idx == 0) continue; /* * Other threads access enough entries to cause eviction from * the TLB and trigger #PF upon the next access (before the TLB * flush of clear_ref actually takes place). */ for (i = 1; i < TLB_SIZE; i++) { if (atomic_load((atomic_int *)(m + PAGE_SIZE * i))) { fprintf(stderr, "unexpected error\n"); exit(1); } } } return cnt; } /* * Runs mlock/munlock in the background to raise the page-count of the * page and force copying instead of reusing the page. Raising the * page-count is possible in better ways, e.g., registering io_uring * buffers. */ static int do_mlock(void *argp) { while (!atomic_load(&stop)) { if (mlock(m, PAGE_SIZE) || munlock(m, PAGE_SIZE)) { perror("mlock/munlock"); exit(1); } } return 0; } int main(void) { int r, cnt, fd, total = 0; long i; thrd_t thr[N_THREADS]; thrd_t mlock_thr; fd = open("/proc/self/clear_refs", O_WRONLY, 0666); if (fd < 0) { perror("open"); exit(1); } /* * Have large memory for clear_ref, so there would be some time between * the unmap and the actual deferred flush. */ m = mmap(NULL, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0); if (m == MAP_FAILED) { perror("mmap"); exit(1); } for (i = 0; i < N_THREADS; i++) { r = thrd_create(&thr[i], writer, (void *)i); assert(r == thrd_success); } r = thrd_create(&mlock_thr, do_mlock, (void *)i); assert(r == thrd_success); for (i = 0; i < ITERATIONS; i++) { r = pwrite(fd, "4", 1, 0); if (r < 0) { perror("pwrite"); exit(1); } } atomic_store(&stop, 1); r = thrd_join(mlock_thr, NULL); assert(r == thrd_success); for (i = 0; i < N_THREADS; i++) { r = thrd_join(thr[i], &cnt); assert(r == thrd_success); total += cnt; } r = atomic_load((atomic_int *)(m)); if (r != total) { fprintf(stderr, "failed: expected=%d actual=%d\n", total, r); exit(-1); } fprintf(stderr, "ok\n"); return 0; } Fix it by taking mmap_lock for write when clearing soft-dirty. Note that the test keeps failing without the pending fix of the missing TLB flushes in clear_refs_write() [1]. [1] https://lore.kernel.org/patchwork/patch/1351776/ Cc: Andrea Arcangeli Cc: Yu Zhao Cc: Andy Lutomirski Cc: Peter Xu Cc: Pavel Emelyanov Cc: Mike Kravetz Cc: Mike Rapoport Cc: Minchan Kim Cc: Will Deacon Cc: Peter Zijlstra Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") Signed-off-by: Nadav Amit Acked-by: Will Deacon --- fs/proc/task_mmu.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 217aa2705d5d..39b2bd27af79 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1189,6 +1189,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, struct mm_struct *mm; struct vm_area_struct *vma; enum clear_refs_types type; + bool write_lock = false; struct mmu_gather tlb; int itype; int rv; @@ -1236,21 +1237,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, } tlb_gather_mmu(&tlb, mm, 0, -1); if (type == CLEAR_REFS_SOFT_DIRTY) { + mmap_read_unlock(mm); + if (mmap_write_lock_killable(mm)) { + count = -EINTR; + goto out_mm; + } for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (!(vma->vm_flags & VM_SOFTDIRTY)) - continue; - mmap_read_unlock(mm); - if (mmap_write_lock_killable(mm)) { - count = -EINTR; - goto out_mm; - } - for (vma = mm->mmap; vma; vma = vma->vm_next) { - vma->vm_flags &= ~VM_SOFTDIRTY; - vma_set_page_prot(vma); - } - mmap_write_downgrade(mm); - break; + vma->vm_flags &= ~VM_SOFTDIRTY; + vma_set_page_prot(vma); } + write_lock = true; mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, 0, NULL, mm, 0, -1UL); @@ -1261,7 +1257,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, if (type == CLEAR_REFS_SOFT_DIRTY) mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb, 0, -1); - mmap_read_unlock(mm); + if (write_lock) + mmap_write_unlock(mm); + else + mmap_read_unlock(mm); out_mm: mmput(mm); }