From patchwork Mon Jun 17 20:46:30 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yang Shi X-Patchwork-Id: 11000473 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0FAB0924 for ; Mon, 17 Jun 2019 20:57:05 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 01BE828936 for ; Mon, 17 Jun 2019 20:57:05 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E978028994; Mon, 17 Jun 2019 20:57:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4A27128936 for ; Mon, 17 Jun 2019 20:57:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E8E218E0002; Mon, 17 Jun 2019 16:57:02 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id E3DD68E0001; Mon, 17 Jun 2019 16:57:02 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D2C758E0002; Mon, 17 Jun 2019 16:57:02 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f199.google.com (mail-pg1-f199.google.com [209.85.215.199]) by kanga.kvack.org (Postfix) with ESMTP id 9A2448E0001 for ; Mon, 17 Jun 2019 16:57:02 -0400 (EDT) Received: by mail-pg1-f199.google.com with SMTP id k2so8455953pga.12 for ; Mon, 17 Jun 2019 13:57:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id; bh=36LP9IWlmujmK1ehbDnCrqlhx78+IdcVDIvPgQA3+Us=; b=AtUbH0aqZ5cT38tS+lNgskhu9k6xVeSLvEACHXRuHj5Fc42riTZvwk0FJu3dm8MpUa rIfcq1qvdJLAwp7w3igjNKWNfVGsT9xTcSlDx5Wea0iDJ6L9ZudOxL/9sM8zTL06jedY ktU7alWihkAhQeFy0lxNeRZfCm59jGVdHfJpeLrDyDO9VDBw5bEYGDaOncu9VvjCjtiG vSEDJZTHoMl2p6Lnkzt66dECEG28FebVUQek1IiexDxu/+pcsKFRZdBltwP8zdzkM+wG KxmnOyyGz/1OQb2oDVH6HaEcAI3TF+SO/Z8V2VuBOVjxgTvy8ZZIYw43WNOyp9KWWfTw wAbw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of yang.shi@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=yang.shi@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAV6upvu3b2bmMA3huV5jd5OfkHvSSPSML/6/iaIpataw8K10nH1 hg36jkSIjeajKdI8PaQC/CIqfnlA6frAa5M5F5esMjw/m4KCsW0nmrq/pA4cqLrLmTNU0mok1j8 eZpyE29sA2dE6xJuHyhkK4KGDQrcVlsz4Vv0VPk9JKKbwew6mp4h3dw9p826NbFJc0g== X-Received: by 2002:a62:ee05:: with SMTP id e5mr398647pfi.117.1560805021455; Mon, 17 Jun 2019 13:57:01 -0700 (PDT) X-Google-Smtp-Source: APXvYqw0zbVn9xwqR1JD2nwA4WjHzXYIcfcKr7GaZe0pOLQfm8RLCY88oB8qkWRnIOBMfUG9Ji1t X-Received: by 2002:a62:ee05:: with SMTP id e5mr354388pfi.117.1560804400727; Mon, 17 Jun 2019 13:46:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1560804400; cv=none; d=google.com; s=arc-20160816; b=TIS11UHcpWuQVAidTxud/cz0tfAigDdvSeUNXawfkqcIInBb90CqK5HSxHjtftZXVl 4OqqYkbQW8rwqXXjpufKB3T22sbqgpOl4nFNfqarMofhQ0AN+saGCnOliP4qfbLaqsj+ ZSh7BMui7aXpQhWjkOhjSyPOje83YcPrG17LsGCAA4i83Pinja4pWiuzlzTj0qR8aiFl MwogC2jWqsR9fSYFtAk+KKKLDUpcus7rAX9KRVOIGsUy82pCo0bw9etlulLyAuObmGgV 7KiKLABbA2uzp5oVU9CGTUP40S+uppN6g8OoxyRvvz+kjQj5UpOJgWp3FWyuVGu8a17A wWVg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=message-id:date:subject:cc:to:from; bh=36LP9IWlmujmK1ehbDnCrqlhx78+IdcVDIvPgQA3+Us=; b=W9rte/fLFWWVNEJ1rYoaWwMZ9rhpHeo3galQqiMkVwnpbVTvF/thH+D+60JSPcC+hs 4Ix3JRMO2PXfwB/m/DiMDpqex75iCeQBrcBqcZI+Sm3Quq3k5ZbGthbsaNt1NbQiQXcH r2xCjv86TCfHLbqGo2Fm1VBNLIisFv7QgxjsnBeCC5T3ZjqWMlsU5MfnN6yhhUSCfs+p E8deo/M3pUaDxz4j15HRnEBuRoKQ+XIcA50jZaz0Co45QZghLn6X3uqKmk2R/p9dMYTl kUt9KhHgPtkoaoLRt6b+4H7itLQSlXYEkoZ1mpgxPtKjFrT4XpSjQaNu8PHIJG9lIya+ NNyA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of yang.shi@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=yang.shi@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com. [115.124.30.130]) by mx.google.com with ESMTPS id a5si11047026pgt.281.2019.06.17.13.46.40 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Jun 2019 13:46:40 -0700 (PDT) Received-SPF: pass (google.com: domain of yang.shi@linux.alibaba.com designates 115.124.30.130 as permitted sender) client-ip=115.124.30.130; Authentication-Results: mx.google.com; spf=pass (google.com: domain of yang.shi@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=yang.shi@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R741e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04423;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=14;SR=0;TI=SMTPD_---0TURjHX8_1560804391; Received: from e19h19392.et15sqa.tbsite.net(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TURjHX8_1560804391) by smtp.aliyun-inc.com(127.0.0.1); Tue, 18 Jun 2019 04:46:38 +0800 From: Yang Shi To: gregkh@linuxfoundation.org, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, jstancek@redhat.com, mgorman@suse.de, minchan@kernel.org, namit@vmware.com, npiggin@gmail.com, peterz@infradead.org, will.deacon@arm.com Cc: yang.shi@linux.alibaba.com, stable@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [5.1-stable PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush Date: Tue, 18 Jun 2019 04:46:30 +0800 Message-Id: <1560804390-28494-1-git-send-email-yang.shi@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed. __tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update PTEs in parallel don't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This may result in program hang on aarch64 reported by Jan Stancek. The problem could be reproduced by his test program with slightly modified below. ---8<--- static int map_size = 4096; static int num_iter = 500; static long threads_total; static void *distant_area; void *map_write_unmap(void *ptr) { int *fd = ptr; unsigned char *map_address; int i, j = 0; for (i = 0; i < num_iter; i++) { map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (map_address == MAP_FAILED) { perror("mmap"); exit(1); } for (j = 0; j < map_size; j++) map_address[j] = 'b'; if (munmap(map_address, map_size) == -1) { perror("munmap"); exit(1); } } return NULL; } void *dummy(void *ptr) { return NULL; } int main(void) { pthread_t thid[2]; /* hint for mmap in map_write_unmap() */ distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); munmap(distant_area, (size_t)DISTANT_MMAP_SIZE); distant_area += DISTANT_MMAP_SIZE / 2; while (1) { pthread_create(&thid[0], NULL, map_write_unmap, NULL); pthread_create(&thid[1], NULL, dummy, NULL); pthread_join(thid[0], NULL); pthread_join(thid[1], NULL); } } ---8<--- The program may bring in parallel execution like below: t1 t2 munmap(map_address) downgrade_write(&mm->mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1 pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED) zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range() __tlb_reset_range() would reset freed_tables and cleared_* bits, but this may cause inconsistency for munmap() which do free page tables. Then it may result in some architectures, e.g. aarch64, may not flush TLB completely as expected to have stale TLB entries remained. Use fullmm flush since it yields much better performance on aarch64 and non-fullmm doesn't yields significant difference on x86. The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together. Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Reported-by: Jan Stancek Tested-by: Jan Stancek Suggested-by: Will Deacon Cc: Peter Zijlstra Cc: Nick Piggin Cc: "Aneesh Kumar K.V" Cc: Nadav Amit Cc: Minchan Kim Cc: Mel Gorman Cc: stable@vger.kernel.org 4.20+ Signed-off-by: Yang Shi Signed-off-by: Jan Stancek --- mm/mmu_gather.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index f2f03c6..3543b82 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -92,9 +92,30 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb, { struct mmu_gather_batch *batch, *next; + /* + * If there are parallel threads are doing PTE changes on same range + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB + * flush by batching, one thread may end up seeing inconsistent PTEs + * and result in having stale TLB entries. So flush TLB forcefully + * if we detect parallel PTE batching threads. + * + * However, some syscalls, e.g. munmap(), may free page tables, this + * needs force flush everything in the given range. Otherwise this + * may result in having stale TLB entries for some architectures, + * e.g. aarch64, that could specify flush what level TLB. + */ if (force) { + /* + * The aarch64 yields better performance with fullmm by + * avoiding multiple CPUs spamming TLBI messages at the + * same time. + * + * On x86 non-fullmm doesn't yield significant difference + * against fullmm. + */ + tlb->fullmm = 1; __tlb_reset_range(tlb); - __tlb_adjust_range(tlb, start, end - start); + tlb->freed_tables = 1; } tlb_flush_mmu(tlb);