From patchwork Fri Apr 14 14:23:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211627 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79A6BC77B72 for ; Fri, 14 Apr 2023 14:24:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230229AbjDNOYs (ORCPT ); Fri, 14 Apr 2023 10:24:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48364 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230431AbjDNOYn (ORCPT ); Fri, 14 Apr 2023 10:24:43 -0400 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1F144C152; Fri, 14 Apr 2023 07:24:31 -0700 (PDT) Received: by mail-pj1-x102c.google.com with SMTP id f2so9911822pjs.3; Fri, 14 Apr 2023 07:24:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482270; x=1684074270; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=uu2tvK/dyE/2iioL4m/q+1zLO0tq6AJoZ0N1fRXdBLI=; b=VZpG6YhRv9cCG3dhHA4pW1dfvwCMu8AdyhCK9POKf5RMu8MA/hMLmX9JH6IlIvozl1 1fVKpFWNxMc57Q90dZ7CxY5g8ZjQa98VKAWwZZYhMZCFgRuJLaCo0ene2tZmu1p3/eXY vgGdzYayeuafSTKmDbbuUID2tYVhNc9bsT3u2L34Qz3Sk4G3BHCgygYiRqeJFkwIHjs2 Nr2TTPHguT7oEswc2lYK0IYXrvHQR6EAjkurHZk2S7EXlB2uNgw4hengUxVBmEPCr65L 0Jsxm3ALANv9sKddMqxMVD0nh2jBBNUv96/3Ut35UTZ1DS0w9BP0/MQAFy73Zj+IzSAG mDuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482270; x=1684074270; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=uu2tvK/dyE/2iioL4m/q+1zLO0tq6AJoZ0N1fRXdBLI=; b=A1ixAIeOTQNaUx4ylEcA1Fn9UbDYq6LFOnlnJPwuQS5fGU4vv+QbuIh5ypSFTqej7M IzY2uyxSbTte/Z+rlz+bpH2/0tCnrtQ8e4MwnjieVjMaqJ1vXlYFKEmxTJTkXDNu/B8u UWmMoFZmy46mKARpxeMWadb1UjIM0Ovg8j+eLHQ0OruNx4TH834Z94PC+MiYhzxENkJ3 AZIc6MUdpB20DXaUY3GvUQGgRtS70JK5vUHhfmZIIaVnsdNCOgy9QpHuuRm+bC+mLX3V axOaFMxyruswCAwY1B051W/s07G1nLfOXbTHa0oqSOvkz8saYq6rlSap18Wm5+etLL4D SlYw== X-Gm-Message-State: AAQBX9d2laPtzEyhRItEj0sN/O15AE8IEnzY/k5DmXPLFQC65BRfuzA2 ulLrx5di0vezqRwiXGh2qqI= X-Google-Smtp-Source: AKy350YJJQnFLpjJ1GWwTuBA1s0UTSEAKxeCsIUIoR1dqPUv21yoi97HToCfZrLndDVV2aGKaYjHpw== X-Received: by 2002:a17:90a:4a17:b0:247:2d48:76f7 with SMTP id e23-20020a17090a4a1700b002472d4876f7mr4010602pjh.44.1681482270305; Fri, 14 Apr 2023 07:24:30 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.24.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:24:29 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 01/17] mm: Split out the present cases from zap_pte_range() Date: Fri, 14 Apr 2023 22:23:25 +0800 Message-Id: <20230414142341.354556-2-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org As the complexity of zap_pte_range() has increased, The readability and maintainability are becoming more difficult. To simplfy and improve the expandability of zap PTE part, split the present and non-present cases from zap_pte_range() and replace the individual flag variable by the single flag with bitwise operations. Signed-off-by: Chih-En Lin --- mm/memory.c | 217 +++++++++++++++++++++++++++++++--------------------- 1 file changed, 129 insertions(+), 88 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 01a23ad48a04..0476cf22ea33 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1351,29 +1351,147 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); } +#define ZAP_PTE_INIT 0x0000 +#define ZAP_PTE_FORCE_FLUSH 0x0001 + +struct zap_pte_details { + pte_t **pte; + unsigned long *addr; + unsigned int flags; + int rss[NR_MM_COUNTERS]; +}; + +/* Return 0 to continue, 1 to break. */ +static inline int +zap_present_pte(struct mmu_gather *tlb, struct vm_area_struct *vma, + struct zap_details *details, + struct zap_pte_details *pte_details) +{ + struct mm_struct *mm = tlb->mm; + struct page *page; + unsigned int delay_rmap; + unsigned long addr = *pte_details->addr; + pte_t *pte = *pte_details->pte; + pte_t ptent = *pte; + + page = vm_normal_page(vma, addr, ptent); + if (unlikely(!should_zap_page(details, page))) + return 0; + + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + tlb_remove_tlb_entry(tlb, pte, addr); + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + if (unlikely(!page)) + return 0; + + delay_rmap = 0; + if (!PageAnon(page)) { + if (pte_dirty(ptent)) { + set_page_dirty(page); + if (tlb_delay_rmap(tlb)) { + delay_rmap = 1; + pte_details->flags |= ZAP_PTE_FORCE_FLUSH; + } + } + if (pte_young(ptent) && likely(vma_has_recency(vma))) + mark_page_accessed(page); + + } + pte_details->rss[mm_counter(page)]--; + if (!delay_rmap) { + page_remove_rmap(page, vma, false); + if (unlikely(page_mapcount(page) < 0)) + print_bad_pte(vma, addr, ptent, page); + } + if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { + *pte_details->addr += PAGE_SIZE; + pte_details->flags |= ZAP_PTE_FORCE_FLUSH; + return 1; + } + + return 0; +} + +static inline void +zap_nopresent_pte(struct mmu_gather *tlb, struct vm_area_struct *vma, + struct zap_details *details, + struct zap_pte_details *pte_details) +{ + struct mm_struct *mm = tlb->mm; + struct page *page; + unsigned long addr = *pte_details->addr; + pte_t *pte = *pte_details->pte; + pte_t ptent = *pte; + swp_entry_t entry = pte_to_swp_entry(ptent); + + if (is_device_private_entry(entry) || + is_device_exclusive_entry(entry)) { + page = pfn_swap_entry_to_page(entry); + if (unlikely(!should_zap_page(details, page))) + return; + /* + * Both device private/exclusive mappings should only + * work with anonymous page so far, so we don't need to + * consider uffd-wp bit when zap. For more information, + * see zap_install_uffd_wp_if_needed(). + */ + WARN_ON_ONCE(!vma_is_anonymous(vma)); + pte_details->rss[mm_counter(page)]--; + if (is_device_private_entry(entry)) + page_remove_rmap(page, vma, false); + put_page(page); + } else if (!non_swap_entry(entry)) { + /* Genuine swap entry, hence a private anon page */ + if (!should_zap_cows(details)) + return; + pte_details->rss[MM_SWAPENTS]--; + if (unlikely(!free_swap_and_cache(entry))) + print_bad_pte(vma, addr, ptent, NULL); + } else if (is_migration_entry(entry)) { + page = pfn_swap_entry_to_page(entry); + if (!should_zap_page(details, page)) + return; + pte_details->rss[mm_counter(page)]--; + } else if (pte_marker_entry_uffd_wp(entry)) { + /* Only drop the uffd-wp marker if explicitly requested */ + if (!zap_drop_file_uffd_wp(details)) + return; + } else if (is_hwpoison_entry(entry) || + is_swapin_error_entry(entry)) { + if (!should_zap_cows(details)) + return; + } else { + /* We should have covered all the swap entry types */ + WARN_ON_ONCE(1); + } + pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, struct zap_details *details) { struct mm_struct *mm = tlb->mm; - int force_flush = 0; - int rss[NR_MM_COUNTERS]; spinlock_t *ptl; pte_t *start_pte; pte_t *pte; - swp_entry_t entry; + struct zap_pte_details pte_details = { + .addr = &addr, + .flags = ZAP_PTE_INIT, + .pte = &pte, + }; tlb_change_page_size(tlb, PAGE_SIZE); again: - init_rss_vec(rss); + init_rss_vec(pte_details.rss); start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); pte = start_pte; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { pte_t ptent = *pte; - struct page *page; if (pte_none(ptent)) continue; @@ -1382,95 +1500,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, break; if (pte_present(ptent)) { - unsigned int delay_rmap; - - page = vm_normal_page(vma, addr, ptent); - if (unlikely(!should_zap_page(details, page))) - continue; - ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); - tlb_remove_tlb_entry(tlb, pte, addr); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, - ptent); - if (unlikely(!page)) - continue; - - delay_rmap = 0; - if (!PageAnon(page)) { - if (pte_dirty(ptent)) { - set_page_dirty(page); - if (tlb_delay_rmap(tlb)) { - delay_rmap = 1; - force_flush = 1; - } - } - if (pte_young(ptent) && likely(vma_has_recency(vma))) - mark_page_accessed(page); - } - rss[mm_counter(page)]--; - if (!delay_rmap) { - page_remove_rmap(page, vma, false); - if (unlikely(page_mapcount(page) < 0)) - print_bad_pte(vma, addr, ptent, page); - } - if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { - force_flush = 1; - addr += PAGE_SIZE; + if (zap_present_pte(tlb, vma, details, &pte_details)) break; - } continue; } - - entry = pte_to_swp_entry(ptent); - if (is_device_private_entry(entry) || - is_device_exclusive_entry(entry)) { - page = pfn_swap_entry_to_page(entry); - if (unlikely(!should_zap_page(details, page))) - continue; - /* - * Both device private/exclusive mappings should only - * work with anonymous page so far, so we don't need to - * consider uffd-wp bit when zap. For more information, - * see zap_install_uffd_wp_if_needed(). - */ - WARN_ON_ONCE(!vma_is_anonymous(vma)); - rss[mm_counter(page)]--; - if (is_device_private_entry(entry)) - page_remove_rmap(page, vma, false); - put_page(page); - } else if (!non_swap_entry(entry)) { - /* Genuine swap entry, hence a private anon page */ - if (!should_zap_cows(details)) - continue; - rss[MM_SWAPENTS]--; - if (unlikely(!free_swap_and_cache(entry))) - print_bad_pte(vma, addr, ptent, NULL); - } else if (is_migration_entry(entry)) { - page = pfn_swap_entry_to_page(entry); - if (!should_zap_page(details, page)) - continue; - rss[mm_counter(page)]--; - } else if (pte_marker_entry_uffd_wp(entry)) { - /* Only drop the uffd-wp marker if explicitly requested */ - if (!zap_drop_file_uffd_wp(details)) - continue; - } else if (is_hwpoison_entry(entry) || - is_swapin_error_entry(entry)) { - if (!should_zap_cows(details)) - continue; - } else { - /* We should have covered all the swap entry types */ - WARN_ON_ONCE(1); - } - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + zap_nopresent_pte(tlb, vma, details, &pte_details); } while (pte++, addr += PAGE_SIZE, addr != end); - add_mm_rss_vec(mm, rss); + add_mm_rss_vec(mm, pte_details.rss); arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ - if (force_flush) { + if (pte_details.flags & ZAP_PTE_FORCE_FLUSH) { tlb_flush_mmu_tlbonly(tlb); tlb_flush_rmaps(tlb, vma); } @@ -1482,8 +1523,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * entries before releasing the ptl), free the batched * memory too. Restart if we didn't do everything. */ - if (force_flush) { - force_flush = 0; + if (pte_details.flags & ZAP_PTE_FORCE_FLUSH) { + pte_details.flags &= ~ZAP_PTE_FORCE_FLUSH; tlb_flush_mmu(tlb); } From patchwork Fri Apr 14 14:23:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211628 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9DCAFC77B77 for ; Fri, 14 Apr 2023 14:24:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230246AbjDNOY5 (ORCPT ); Fri, 14 Apr 2023 10:24:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230493AbjDNOYy (ORCPT ); Fri, 14 Apr 2023 10:24:54 -0400 Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E02BC140; Fri, 14 Apr 2023 07:24:40 -0700 (PDT) Received: by mail-pj1-x102b.google.com with SMTP id cm18-20020a17090afa1200b0024713adf69dso6267251pjb.3; Fri, 14 Apr 2023 07:24:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482280; x=1684074280; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=EI/xgA3T+jD6+X1P1tFB/fdrO9EhnI4pOgRfKhyKXi8=; b=dp9B4As7NRowFOhmFgHptzwRqDpv7/qxF2zrakWreLkouXNX+Ey0loxIIQdV0mAyjL wbB5XAF4XVWn7QWMCRTQsmGD65y5w+/3d5YZz+ZuQURVTP5mQmbfpw6uP6zYos/mTpxi Z2hMRosDak5Clhncc6D/belgtQPtCjSdu0z42/aBMHdBa3D3dMT8E2ppfvQuohvqykwf DKHGG/m/ZpQ6kgNHGpm5DTfDKIRJF+FpahmqHY2B6kez9gLckPsly4+UQSub9VJUk4mA UmGR+EpFifjGDLsruJ//xx4xJj9JDgYl1bK+1deT7vfi23+wJW8oTTmCtvanwFXVxaJf ifng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482280; x=1684074280; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EI/xgA3T+jD6+X1P1tFB/fdrO9EhnI4pOgRfKhyKXi8=; b=bglvtuH5/hADmO8XMJyTq/8ZPtQvG71S7GEwWrJKkcfMC/OHlGTh4JYFhdyUL4kM0r aKqsSjVYz9N9xv2jiYkt8lsst/cyKknEYcDUOoG125Xnm0uONVg0C0KQjtx3xEtxI3rt QTNfijMOLW4RBbtT1Gxu5NsOcOLXPXLzfIv2yBfbSRiykJuux8iWKtpoyKx4wFUSfxWK at+tvCDk+yjX/HtpzXC4g6tdqUO2QioaklQgfyrv5tkZGDO4hrR/N7+hX1B1If7PPrTd cHLS/KSifXpJxg7fIfFWUKBQ/fKTYpz8CFwU4eEQGwsBXLErigeIupjmlgrDGKVyY2r6 ZHqg== X-Gm-Message-State: AAQBX9cXYvRdfHJOcPC/gJgw5J4jICfVgR/yo1JrvvVeyb7oeyFaFMgD vkJKW0vmRsSfjsvuw94vdvs= X-Google-Smtp-Source: AKy350bVGay/l+iKywICpX6DkfyRQb5ITLNSdlflzo3Pxj01rUABeJIlD5c4oSXbLyfahy3pI5Uy8Q== X-Received: by 2002:a17:90a:fc6:b0:237:97a3:1479 with SMTP id 64-20020a17090a0fc600b0023797a31479mr6207174pjz.28.1681482279586; Fri, 14 Apr 2023 07:24:39 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.24.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:24:39 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 02/17] mm: Allow user to control COW PTE via prctl Date: Fri, 14 Apr 2023 22:23:26 +0800 Message-Id: <20230414142341.354556-3-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org Add a new prctl, PR_SET_COW_PTE, to allow the user to enable COW PTE. Since it has a time gap between using the prctl to enable the COW PTE and doing the fork, we use two states (MMF_COW_PTE_READY and MMF_COW_PTE) to determine the task that wants to do COW PTE or already doing it. The MMF_COW_PTE_READY flag marks the task to do COW PTE in the next time of fork(). During fork(), if MMF_COW_PTE_READY set, fork() will unset the flag and set the MMF_COW_PTE flag. After that, fork() might shares PTEs instead of duplicates it. Signed-off-by: Chih-En Lin --- include/linux/sched/coredump.h | 13 ++++++++++++- include/uapi/linux/prctl.h | 6 ++++++ kernel/sys.c | 11 +++++++++++ 3 files changed, 29 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 0e17ae7fbfd3..dff4b0938c39 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -87,7 +87,18 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) +/* + * MMF_COW_PTE_READY: Marking the task to do COW PTE in the next time of + * fork(). During fork(), if MMF_COW_PTE_READY set, fork() will unset the + * flag and set the MMF_COW_PTE flag. After that, fork() might shares PTEs + * rather than duplicates it. + */ +#define MMF_COW_PTE_READY 29 /* Share PTE tables in next time of fork() */ +#define MMF_COW_PTE 30 /* PTE tables are shared between processes */ +#define MMF_COW_PTE_MASK (1 << MMF_COW_PTE) + #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ - MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK) + MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\ + MMF_COW_PTE_MASK) #endif /* _LINUX_SCHED_COREDUMP_H */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 1312a137f7fb..8fc82ced80b5 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -290,4 +290,10 @@ struct prctl_mm_map { #define PR_SET_VMA 0x53564d41 # define PR_SET_VMA_ANON_NAME 0 +/* + * Set the prepare flag, MMF_COW_PTE_READY, to do the share (copy-on-write) + * page table in the next time of fork. + */ +#define PR_SET_COW_PTE 65 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 495cd87d9bf4..eb1c38c4bad2 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2377,6 +2377,14 @@ static inline int prctl_get_mdwe(unsigned long arg2, unsigned long arg3, PR_MDWE_REFUSE_EXEC_GAIN : 0; } +static int prctl_set_cow_pte(struct mm_struct *mm) +{ + if (test_bit(MMF_COW_PTE, &mm->flags)) + return -EINVAL; + set_bit(MMF_COW_PTE_READY, &mm->flags); + return 0; +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2661,6 +2669,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_SET_VMA: error = prctl_set_vma(arg2, arg3, arg4, arg5); break; + case PR_SET_COW_PTE: + error = prctl_set_cow_pte(me->mm); + break; default: error = -EINVAL; break; From patchwork Fri Apr 14 14:23:27 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211629 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCEE8C77B72 for ; Fri, 14 Apr 2023 14:25:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230478AbjDNOZT (ORCPT ); Fri, 14 Apr 2023 10:25:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49002 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230437AbjDNOZN (ORCPT ); Fri, 14 Apr 2023 10:25:13 -0400 Received: from mail-pj1-x102d.google.com (mail-pj1-x102d.google.com [IPv6:2607:f8b0:4864:20::102d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9FF78B772; Fri, 14 Apr 2023 07:24:49 -0700 (PDT) Received: by mail-pj1-x102d.google.com with SMTP id cm18-20020a17090afa1200b0024713adf69dso6267802pjb.3; Fri, 14 Apr 2023 07:24:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482289; x=1684074289; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=3qO9MOf1iXn2D7MFz9Fy1893SAyzuoqWYER46kEd3Ck=; b=TBVjEIetFZ+/MLviN3UdPDL0mRMevi7pHpgDVLpWTmwvlKiqWj2T68F0+mgDIxiLng JVKu12nP5F7UE9GeKmODeuzoYpXdSOeCQMZlNw+H8mXW+uxg+SYhfbS5xx8AJJW1ZUhN A7FzBSUjWidIgUYjNrdThiBBKMfvcxRYJsBU9/ylg0sheFH0U4CwB8V7oZERsqGmEZwt sqo+GhY6XZ2Ijmrm50LwZpwk7hjn2OEFqMx5O/U5lp57FkZhCXfmmZX3RsGflFndSqsl hF3MiuJeYGc4Odp6Muq1yMzWQFT1PCB1d7IaDV2/UzjN2tvwCp7j4cSV2nMc/rA1tDeD /ASg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482289; x=1684074289; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3qO9MOf1iXn2D7MFz9Fy1893SAyzuoqWYER46kEd3Ck=; b=cRjSkLdNAs2NRxIiUPwRGeV3iGJeeS7Y1EoT5mzP0IDEesHxhMJpEJXGR1MLM9cic0 dxxqAa9pFxOaqpaEUWo8x4NdobVTxH87HkWLtyRGylzI53N34rgsKxftUC5CrA6wqWS0 y3o9O2jI1KyfXjd3HryU21sSDBt9e9rGAM56+urvuX6u47W5CMUNNAoAk+T1UbLx/eKt ngcwM5pOGyN7dHmZ2flQiN1DmaHeLtNP8jn6QGEEZG/y0dCvm4X0iPUeaUpvHzAPYjlR 6VkUJl7/nMiunzygWmC2NdV5DBAAnMPNX1hFmGXWqdl0vAb5/Buzm16y+WBNlCk+zZkN lU9Q== X-Gm-Message-State: AAQBX9f4juXOsIHWr6NA0zgdnEuWnAPGMAfAAmGyH55i/sUMHwSPCOvp T+1N3pDdPR23dzJwOG+ZK2A= X-Google-Smtp-Source: AKy350b8HM6v+i7vq7tXjX/k/iame5prsYNpvD+3rHdZJR6torx2mq+Xn9RsLCpFxg0qlbEyKbmTPQ== X-Received: by 2002:a17:90a:d18d:b0:247:c85:21f5 with SMTP id fu13-20020a17090ad18d00b002470c8521f5mr8521546pjb.19.1681482288762; Fri, 14 Apr 2023 07:24:48 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.24.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:24:48 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 03/17] mm: Add Copy-On-Write PTE to fork() Date: Fri, 14 Apr 2023 22:23:27 +0800 Message-Id: <20230414142341.354556-4-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org Add copy_cow_pte_range() and recover_pte_range() for copy-on-write (COW) PTE in fork system call. During COW PTE fork, when processing the shared PTE, we traverse all the entries to determine current mapped page is available to share between processes. If PTE can be shared, account those mapped pages and then share the PTE. However, once we find out the mapped page is unavailable, e.g., pinned page, we have to copy it via copy_present_page(), which means that we will fall back to default path, page table copying (copy_pte_range()). And, since we may have already processed some COW-ed PTE entries, before starting the default path, we have to recover those entries. All the COW PTE behaviors are protected by the pte lock. The logic of how we handle nonpresent/present pte entries and error in copy_cow_pte_range() is same as copy_pte_range(). But to keep the codes clean (e.g., avoiding condition lock), we introduce new functions instead of modifying copy_pte_range(). To track the lifetime of COW-ed PTE, introduce the refcount of PTE. We reuse the _refcount in struct page for the page table to maintain the number of process references to COW-ed PTE table. Doing the fork with COW PTE will increase the refcount. And, when someone writes to the COW-ed PTE, it will cause the write fault to break COW PTE. If the refcount of COW-ed PTE is one, the process that triggers the fault will reuse the COW-ed PTE. Otherwise, the process will decrease the refcount and duplicate it. Since we share the PTE between the parent and child, the state of the parent's pte entries is different between COW PTE and the normal fork. COW PTE handles all the pte entries on the child side which means it will clear the dirty and access bit of the parent's pte entry. And, since some of the architectures, e.g., s390 and powerpc32, don't support the PMD entry and PTE table operations, add a new Kconfig, COW_PTE. COW_PTE config depends on the (HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT) condition, it is same as the TRANSPARENT_HUGEPAGE config since most of the operations in COW PTE are depend on it. Signed-off-by: Chih-En Lin --- include/linux/mm.h | 20 +++ mm/Kconfig | 9 ++ mm/memory.c | 303 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 332 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1f79667824eb..828f8a1b1e32 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2636,6 +2636,23 @@ static inline bool ptlock_init(struct page *page) { return true; } static inline void ptlock_free(struct page *page) {} #endif /* USE_SPLIT_PTE_PTLOCKS */ +#ifdef CONFIG_COW_PTE +static inline int pmd_get_pte(pmd_t *pmd) +{ + return page_ref_inc_return(pmd_page(*pmd)); +} + +static inline bool pmd_put_pte(pmd_t *pmd) +{ + return page_ref_add_unless(pmd_page(*pmd), -1, 1); +} + +static inline int cow_pte_count(pmd_t *pmd) +{ + return page_count(pmd_page(*pmd)); +} +#endif + static inline void pgtable_init(void) { ptlock_cache_init(); @@ -2648,6 +2665,9 @@ static inline bool pgtable_pte_page_ctor(struct page *page) return false; __SetPageTable(page); inc_lruvec_page_state(page, NR_PAGETABLE); +#ifdef CONFIG_COW_PTE + set_page_count(page, 1); +#endif return true; } diff --git a/mm/Kconfig b/mm/Kconfig index 4751031f3f05..0eac8851601b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -841,6 +841,15 @@ config READ_ONLY_THP_FOR_FS endif # TRANSPARENT_HUGEPAGE +menuconfig COW_PTE + bool "Copy-on-write PTE table" + depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT + help + Extend the copy-on-write (COW) mechanism to the PTE table + (the bottom level of the page-table hierarchy). To enable this + feature, a process must set prctl(PR_SET_COW_PTE) before the + fork system call. + # # UP and nommu archs use km based percpu allocator # diff --git a/mm/memory.c b/mm/memory.c index 0476cf22ea33..3b1c4a7e632c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -749,11 +749,17 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, int *rss) { + /* With COW PTE, dst_vma is src_vma. */ unsigned long vm_flags = dst_vma->vm_flags; pte_t pte = *src_pte; struct page *page; swp_entry_t entry = pte_to_swp_entry(pte); + /* + * If it's COW PTE, parent shares PTE with child. Which means the + * following modifications of child will also affect parent. + */ + if (likely(!non_swap_entry(entry))) { if (swap_duplicate(entry) < 0) return -EIO; @@ -896,6 +902,8 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma /* * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page * is required to copy this pte. + * However, if prealloc is NULL, it is COW PTE. We should return and fall back + * to copy the PTE table. */ static inline int copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, @@ -922,6 +930,14 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) { /* Page may be pinned, we have to copy. */ folio_put(folio); + /* + * If prealloc is NULL, we are processing share page + * table (COW PTE, in copy_cow_pte_range()). We cannot + * call copy_present_page() right now, instead, we + * should fall back to copy_pte_range(). + */ + if (!prealloc) + return -EAGAIN; return copy_present_page(dst_vma, src_vma, dst_pte, src_pte, addr, rss, prealloc, page); } @@ -942,6 +958,11 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page)); + /* + * If it's COW PTE, parent shares PTE with child. + * Which means the following will also affect parent. + */ + /* * If it's a shared mapping, mark it clean in * the child @@ -950,6 +971,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte = pte_mkclean(pte); pte = pte_mkold(pte); + /* For COW PTE, dst_vma is still src_vma. */ if (!userfaultfd_wp(dst_vma)) pte = pte_clear_uffd_wp(pte); @@ -975,6 +997,8 @@ static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm, return new_folio; } + +/* copy_pte_range() will immediately allocate new page table. */ static int copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, @@ -1099,6 +1123,227 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, return ret; } +#ifdef CONFIG_COW_PTE +/* + * copy_cow_pte_range() will try to share the page table with child. + * The logic of non-present, present and error handling is same as + * copy_pte_range() but dst_vma and dst_pte are src_vma and src_pte. + * + * We cannot preserve soft-dirty information, because PTE will share + * between multiple processes. + */ +static int +copy_cow_pte_range(struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, + unsigned long end, unsigned long *recover_end) +{ + struct mm_struct *dst_mm = dst_vma->vm_mm; + struct mm_struct *src_mm = src_vma->vm_mm; + struct vma_iterator vmi; + struct vm_area_struct *curr = src_vma; + pte_t *src_pte, *orig_src_pte; + spinlock_t *src_ptl; + int ret = 0; + int rss[NR_MM_COUNTERS]; + swp_entry_t entry = (swp_entry_t){0}; + unsigned long vm_end, orig_addr = addr; + pgtable_t pte_table = pmd_pgtable(*src_pmd); + + end = (addr + PMD_SIZE) & PMD_MASK; + addr = addr & PMD_MASK; + + /* + * Increase the refcount to prevent the parent's PTE + * dropped/reused. Only increace the refcount at first + * time attached. + */ + src_ptl = pte_lockptr(src_mm, src_pmd); + spin_lock(src_ptl); + pmd_get_pte(src_pmd); + pmd_install(dst_mm, dst_pmd, &pte_table); + spin_unlock(src_ptl); + + /* + * We should handle all of the entries in this PTE at this traversal, + * since we cannot promise that the next vma will not do the lazy fork. + * The lazy fork will skip the copying, which may cause the incomplete + * state of COW-ed PTE. + */ + vma_iter_init(&vmi, src_mm, addr); + for_each_vma_range(vmi, curr, end) { + vm_end = min(end, curr->vm_end); + addr = max(addr, curr->vm_start); + + /* We don't share the PTE with VM_DONTCOPY. */ + if (curr->vm_flags & VM_DONTCOPY) { + *recover_end = addr; + return -EAGAIN; + } +again: + init_rss_vec(rss); + src_pte = pte_offset_map(src_pmd, addr); + src_ptl = pte_lockptr(src_mm, src_pmd); + orig_src_pte = src_pte; + spin_lock(src_ptl); + arch_enter_lazy_mmu_mode(); + + do { + if (pte_none(*src_pte)) + continue; + if (unlikely(!pte_present(*src_pte))) { + /* + * Although, parent's PTE is COW-ed, we should + * still need to handle all the swap stuffs. + */ + ret = copy_nonpresent_pte(dst_mm, src_mm, + src_pte, src_pte, + curr, curr, + addr, rss); + if (ret == -EIO) { + entry = pte_to_swp_entry(*src_pte); + break; + } else if (ret == -EBUSY) { + break; + } else if (!ret) + continue; + /* + * Device exclusive entry restored, continue by + * copying the now present pte. + */ + WARN_ON_ONCE(ret != -ENOENT); + } + /* + * copy_present_pte() will determine the mapped page + * should be COW mapping or not. + */ + ret = copy_present_pte(curr, curr, src_pte, src_pte, + addr, rss, NULL); + /* + * If we need a pre-allocated page for this pte, + * drop the lock, recover all the entries, fall + * back to copy_pte_range(), and try again. + */ + if (unlikely(ret == -EAGAIN)) + break; + } while (src_pte++, addr += PAGE_SIZE, addr != vm_end); + + arch_leave_lazy_mmu_mode(); + add_mm_rss_vec(dst_mm, rss); + spin_unlock(src_ptl); + pte_unmap(orig_src_pte); + cond_resched(); + + if (ret == -EIO) { + VM_WARN_ON_ONCE(!entry.val); + if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) { + ret = -ENOMEM; + goto out; + } + entry.val = 0; + } else if (ret == -EBUSY) { + goto out; + } else if (ret == -EAGAIN) { + /* + * We've to allocate the page immediately but first we + * should recover the processed entries and fall back + * to copy_pte_range(). + */ + *recover_end = addr; + return -EAGAIN; + } else if (ret) { + VM_WARN_ON_ONCE(1); + } + + /* We've captured and resolved the error. Reset, try again. */ + ret = 0; + if (addr != vm_end) + goto again; + } + +out: + /* + * All the pte entries are available to COW mapping. + * Now, we can share with child (COW PTE). + */ + pmdp_set_wrprotect(src_mm, orig_addr, src_pmd); + set_pmd_at(dst_mm, orig_addr, dst_pmd, pmd_wrprotect(*src_pmd)); + + return ret; +} + +/* When recovering the pte entries, we should hold the locks entirely. */ +static int +recover_pte_range(struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long end) +{ + struct mm_struct *dst_mm = dst_vma->vm_mm; + struct mm_struct *src_mm = src_vma->vm_mm; + struct vma_iterator vmi; + struct vm_area_struct *curr = src_vma; + pte_t *orig_src_pte, *orig_dst_pte; + pte_t *src_pte, *dst_pte; + spinlock_t *src_ptl, *dst_ptl; + unsigned long vm_end, addr = end & PMD_MASK; + int ret = 0; + + /* Before we allocate the new PTE, clear the entry. */ + mm_dec_nr_ptes(dst_mm); + pmd_clear(dst_pmd); + if (pte_alloc(dst_mm, dst_pmd)) + return -ENOMEM; + + /* + * Traverse all the vmas that cover this PTE table until + * the end of recover address (unshareable page). + */ + vma_iter_init(&vmi, src_mm, addr); + for_each_vma_range(vmi, curr, end) { + vm_end = min(end, curr->vm_end); + addr = max(addr, curr->vm_start); + + orig_dst_pte = dst_pte = pte_offset_map(dst_pmd, addr); + dst_ptl = pte_lockptr(dst_mm, dst_pmd); + spin_lock(dst_ptl); + + orig_src_pte = src_pte = pte_offset_map(src_pmd, addr); + src_ptl = pte_lockptr(src_mm, src_pmd); + spin_lock(src_ptl); + arch_enter_lazy_mmu_mode(); + + do { + if (pte_none(*src_pte)) + continue; + /* + * COW mapping stuffs (e.g., PageAnonExclusive) + * should already handled by copy_cow_pte_range(). + * We can simply set the entry to the child. + */ + set_pte_at(dst_mm, addr, dst_pte, *src_pte); + } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); + + arch_leave_lazy_mmu_mode(); + spin_unlock(src_ptl); + pte_unmap(orig_src_pte); + + spin_unlock(dst_ptl); + pte_unmap(orig_dst_pte); + } + /* + * After recovering the entries, release the holding from child. + * Parent may still share with others, so don't make it writeable. + */ + spin_lock(src_ptl); + pmd_put_pte(src_pmd); + spin_unlock(src_ptl); + + cond_resched(); + + return ret; +} +#endif /* CONFIG_COW_PTE */ + static inline int copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pud_t *dst_pud, pud_t *src_pud, unsigned long addr, @@ -1127,6 +1372,64 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, continue; /* fall through */ } + +#ifdef CONFIG_COW_PTE + /* + * If MMF_COW_PTE set, copy_pte_range() will try to share + * the PTE page table first. In other words, it attempts to + * do COW on PTE (and mapped pages). However, if there has + * any unshareable page (e.g., pinned page, device private + * page), it will fall back to the default path, which will + * copy the page table immediately. + * In such a case, it stores the address of first unshareable + * page to recover_end then goes back to the beginning of PTE + * and recovers the COW-ed PTE entries until it meets the same + * unshareable page again. During the recovering, because of + * COW-ed PTE entries are logical same as COW mapping, so it + * only needs to allocate the new PTE and sets COW-ed PTE + * entries to new PTE (which will be same as COW mapping). + */ + if (test_bit(MMF_COW_PTE, &src_mm->flags)) { + unsigned long recover_end = 0; + int ret; + + /* + * Setting wrprotect with normal PTE to pmd entry + * will trigger pmd_bad(). Skip bad checking here. + */ + if (pmd_none(*src_pmd)) + continue; + /* Skip if the PTE already did COW PTE this time. */ + if (!pmd_none(*dst_pmd) && !pmd_write(*dst_pmd)) + continue; + + ret = copy_cow_pte_range(dst_vma, src_vma, + dst_pmd, src_pmd, + addr, next, &recover_end); + if (!ret) { + /* COW PTE succeeded. */ + continue; + } else if (ret == -EAGAIN) { + /* fall back to normal copy method. */ + if (recover_pte_range(dst_vma, src_vma, + dst_pmd, src_pmd, + recover_end)) + return -ENOMEM; + /* + * Since we processed all the entries of PTE + * table, recover_end may not in the src_vma. + * If we already handled the src_vma, skip it. + */ + if (!range_in_vma(src_vma, recover_end, + recover_end + PAGE_SIZE)) + continue; + else + addr = recover_end; + /* fall through */ + } else if (ret) + return -ENOMEM; + } +#endif /* CONFIG_COW_PTE */ if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, From patchwork Fri Apr 14 14:23:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211630 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92F96C77B6E for ; Fri, 14 Apr 2023 14:25:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230473AbjDNOZh (ORCPT ); Fri, 14 Apr 2023 10:25:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48994 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230431AbjDNOZP (ORCPT ); Fri, 14 Apr 2023 10:25:15 -0400 Received: from mail-pj1-x1035.google.com (mail-pj1-x1035.google.com [IPv6:2607:f8b0:4864:20::1035]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D5A8EC151; Fri, 14 Apr 2023 07:24:58 -0700 (PDT) Received: by mail-pj1-x1035.google.com with SMTP id my14-20020a17090b4c8e00b0024708e8e2ddso7875796pjb.4; Fri, 14 Apr 2023 07:24:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482298; x=1684074298; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=iaw2VxgI5AOS+jKxtvax6AkU7/SfikwF+oX34AlhCVI=; b=rJFjdOiiKNNXrMgf0f7hQQpzgbUfjp/YzEEFdDEchHIG1R7cyEuRP4vZ4X9QAYdQpO 6ajhP80GcIbDgZosE/SHJrXcdEIWgu1h59eQB9X95h+n/Q64EqRJ9GBz/j5e2RopGOw8 2TZGBt+XhJegu0jCPROYu4kJFKMzLj1plc2+7zCv4uJc4IB6J+PwDFXwfIcVXUKYTd9j e1sihKOvAQYVMlwMH2YoKpSFwmUJogkEfcn9vi94ebsv/i3E/NbE9mmv0GyDf0/lbGNh Y22w1vLhGvDkCwXa1vqYP4rXIRfndQd7a3n9wiQYvVBRivfsTUyvOh0y+wdRkTC+0zMs olIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482298; x=1684074298; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=iaw2VxgI5AOS+jKxtvax6AkU7/SfikwF+oX34AlhCVI=; b=Wfs1av5IdqJwsFVmy+KnnSUgbQZVe8XMILUw6bzXir/VN/QujS9fPhuTqmd9cfFz+B x4l2tql6LsRvqRl5GsULViH2Ykj/vPNk46adrAydriHXFGGcq5IjI3Ge0X/sk5VlwpAJ Lg2YXVTAOFqcfNMt9Hkg0ZAE/UWB8orLKFs/qmnsCRx4oB6mOu4wUsQEpYUGpdE9HWKS nc3glP2sV2FE4nh7NNu7gpjMG5Sutp6Ri/UpKtrvdJhjTHpfH3/5X5Yrw3lgre9ucalu Z9XnHls2/O6sW5mfszEJTl5zZh8BIn/oF5A+uJdJ/ixPpPzns8OaBRPZ0cLkzMHL3q2V tSqQ== X-Gm-Message-State: AAQBX9eqKdtyVUoBl1/bexJWK4y36aGPemHZk9BN6wEwAj/qXCRu5J9/ Sj0UT8ao9/pkQuDHzDHAklk= X-Google-Smtp-Source: AKy350a6UNo1RjXOl4dyrR1EDggOcNns0y0a21eWdH/247G7ARNprz1vznZe7RgwqXKLT+0DQYVdyA== X-Received: by 2002:a17:90b:4d8d:b0:240:67d5:aea1 with SMTP id oj13-20020a17090b4d8d00b0024067d5aea1mr6379154pjb.14.1681482298053; Fri, 14 Apr 2023 07:24:58 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.24.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:24:57 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 04/17] mm: Add break COW PTE fault and helper functions Date: Fri, 14 Apr 2023 22:23:28 +0800 Message-Id: <20230414142341.354556-5-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org Add the function, handle_cow_pte_fault(), to break (unshare) COW-ed PTE with the page fault that will modify the PTE table or the mapped page resided in COW-ed PTE (i.e., write, unshared, file read fault). When breaking COW PTE, it first checks COW-ed PTE's refcount to try to reuse it. If COW-ed PTE cannot be reused, allocates new PTE and duplicates all pte entries in COW-ed PTE. Moreover, flush TLB when we change the write protection of PTE. In addition, provide the helper functions, break_cow_pte{,_range}(), to let the other features (remap, THP, migration, swapfile, etc) to use. Signed-off-by: Chih-En Lin --- include/linux/mm.h | 17 +++ include/linux/pgtable.h | 6 + mm/memory.c | 318 +++++++++++++++++++++++++++++++++++++++- mm/mmap.c | 4 + mm/mremap.c | 2 + mm/swapfile.c | 2 + 6 files changed, 348 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 828f8a1b1e32..b4c9658ccd28 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2179,6 +2179,23 @@ void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to); void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int generic_error_remove_page(struct address_space *mapping, struct page *page); +#ifdef CONFIG_COW_PTE +int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr); +int break_cow_pte_range(struct vm_area_struct *vma, unsigned long start, + unsigned long end); +#else +static inline int break_cow_pte(struct vm_area_struct *vma, + pmd_t *pmd, unsigned long addr) +{ + return 0; +} +static inline int break_cow_pte_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + return 0; +} +#endif + #ifdef CONFIG_MMU extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index c63cd44777ec..f177a9d48b70 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1378,6 +1378,12 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) if (pmd_none(pmdval) || pmd_trans_huge(pmdval) || (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval))) return 1; + /* + * COW-ed PTE has write protection which can trigger pmd_bad(). + * To avoid this, return here if entry is write protection. + */ + if (!pmd_write(pmdval)) + return 0; if (unlikely(pmd_bad(pmdval))) { pmd_clear_bad(pmd); return 1; diff --git a/mm/memory.c b/mm/memory.c index 3b1c4a7e632c..f8a87a0fc382 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2166,6 +2166,8 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, if (retval) goto out; retval = -ENOMEM; + if (break_cow_pte(vma, NULL, addr)) + goto out; pte = get_locked_pte(vma->vm_mm, addr, &ptl); if (!pte) goto out; @@ -2425,6 +2427,9 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, pte_t *pte, entry; spinlock_t *ptl; + if (break_cow_pte(vma, NULL, addr)) + return VM_FAULT_OOM; + pte = get_locked_pte(mm, addr, &ptl); if (!pte) return VM_FAULT_OOM; @@ -2802,6 +2807,10 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, BUG_ON(addr >= end); pfn -= addr >> PAGE_SHIFT; pgd = pgd_offset(mm, addr); + + if (break_cow_pte_range(vma, addr, end)) + return -ENOMEM; + flush_cache_range(vma, addr, end); do { next = pgd_addr_end(addr, end); @@ -5192,6 +5201,285 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) return VM_FAULT_FALLBACK; } +#ifdef CONFIG_COW_PTE +/* + * Break (unshare) COW PTE + * + * Since the pte lock is held during all operations on the COW-ed PTE + * table, it should be safe to modify it's pmd entry as well, provided + * it has been ensured that the pmd entry points to a COW-ed PTE table + * rather than a huge page or default PTE. Otherwise, we should also + * consider holding the pmd lock as we do for the huge page. + */ +static vm_fault_t handle_cow_pte_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + pmd_t *pmd = vmf->pmd; + unsigned long start, end, addr = vmf->address; + struct mmu_notifier_range range; + pmd_t new_entry, cowed_entry; + pte_t *orig_dst_pte, *orig_src_pte; + pte_t *dst_pte, *src_pte; + pgtable_t new_pte_table = NULL; + spinlock_t *src_ptl; + int ret = 0; + + /* Do nothing with the fault that doesn't have PTE yet. */ + if (pmd_none(*pmd) || pmd_write(*pmd)) + return 0; + /* COW PTE doesn't handle huge page. */ + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) + return 0; + + start = addr & PMD_MASK; + end = (addr + PMD_SIZE) & PMD_MASK; + addr = start; + + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, + 0, vma, mm, start, end); + /* + * Because of the address range is PTE not only for the faulted + * vma, it might have some unmatch situations since mmu notifier + * will only reigster the faulted vma. + * Do we really need to care about this kind of unmatch? + */ + mmu_notifier_invalidate_range_start(&range); + raw_write_seqcount_begin(&mm->write_protect_seq); + + /* + * Fast path, check if we are the only one faulted task + * references to this COW-ed PTE, reuse it. + */ + src_pte = pte_offset_map(pmd, addr); + src_ptl = pte_lockptr(mm, pmd); + spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); + if (cow_pte_count(pmd) == 1) { + pmd_t new = pmd_mkwrite(*pmd); + set_pmd_at(mm, addr, pmd, new); + pte_unmap_unlock(src_pte, src_ptl); + goto flush_tlb; + } + /* We don't hold the lock when allocating the new PTE. */ + pte_unmap_unlock(src_pte, src_ptl); + + /* + * Slow path. Since we already did the accounting and still + * sharing the mapped pages, we can just clone PTE. + */ + + /* + * Before acquiring the lock, we allocate the memory we may + * possibly need. + */ + new_pte_table = pte_alloc_one(mm); + if (unlikely(!new_pte_table)) { + ret = -ENOMEM; + goto out; + } + + /* + * To protect the pte table from the rmap and page table walk, + * we should hold the lock of COW-ed PTE until all the operations + * have been done including setting pmd entry, duplicating, and + * decrease refcount. + */ + orig_src_pte = src_pte = pte_offset_map(pmd, addr); + src_ptl = pte_lockptr(mm, pmd); + spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); + + /* Before pouplate the new pte table, we store the cowed (old) one. */ + cowed_entry = READ_ONCE(*pmd); + + /* + * Someone may also break COW PTE when we allocating the pte table. + * So, let check refcount again. + */ + if (cow_pte_count(&cowed_entry) == 1) { + pmd_t new = pmd_mkwrite(*pmd); + set_pmd_at(mm, addr, pmd, new); + pte_unmap_unlock(src_pte, src_ptl); + goto flush_tlb; + } + + /* + * We will only set the new pte table to the pmd entry after finish + * all the duplicating. + * We first store the new table in another pmd entry even though we + * have held the COW-ed PTE's lock. This is because, if we clear the + * pmd entry assigned to the COW-ed PTe table, other places (e.g., + * another page fault) may allocate an empty PTe table, leading to + * potential issues. + */ + pmd_clear(&new_entry); + pmd_populate(mm, &new_entry, new_pte_table); + /* + * No one else excluding us can access to this new table, so we don't + * have to hold the second pte lock. + */ + orig_dst_pte = dst_pte = pte_offset_map(&new_entry, addr); + + arch_enter_lazy_mmu_mode(); + + /* + * All the mapped pages in COW-ed PTE are COW mapping. We can + * set the entries and leave other stuff to handle_pte_fault(). + */ + do { + if (pte_none(*src_pte)) + continue; + set_pte_at(mm, addr, dst_pte, *src_pte); + } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); + + arch_leave_lazy_mmu_mode(); + + pte_unmap(orig_dst_pte); + + /* + * Decrease the refcount of COW-ed PTE. + * In this path, we assume that someone is still using COW-ed PTE. + * So, if the refcount is 1 before we decrease it, this might be + * wrong. + */ + VM_WARN_ON(!pmd_put_pte(&cowed_entry)); + VM_WARN_ON(!pmd_same(*pmd, cowed_entry)); + + /* Now, we can finally install the new PTE table to the pmd entry. */ + set_pmd_at(mm, start, pmd, new_entry); + /* + * We installed the new table, let cleanup the new_pte_table + * variable to prevent pte_free() free it in the following. + */ + new_pte_table = NULL; + pte_unmap_unlock(orig_src_pte, src_ptl); + +flush_tlb: + /* + * If we change the protection, flush TLB. + * flush_tlb_range() will only use vma to get mm, we don't need + * to consider the unmatch address range with vma problem here. + * + * Should we flush TLB when holding the pte lock? + */ + flush_tlb_range(vma, start, end); +out: + raw_write_seqcount_end(&mm->write_protect_seq); + mmu_notifier_invalidate_range_end(&range); + + if (new_pte_table) + pte_free(mm, new_pte_table); + + return ret; +} + +static inline int __break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr) +{ + struct vm_fault vmf = { + .vma = vma, + .address = addr & PAGE_MASK, + .pmd = pmd, + }; + + return handle_cow_pte_fault(&vmf); +} + +/** + * break_cow_pte - duplicate/reuse shared, wprotected (COW-ed) PTE + * @vma: target vma want to break COW + * @pmd: pmd index that maps to the shared PTE + * @addr: the address trigger break COW PTE + * + * Return: zero on success, < 0 otherwise. + * + * The address needs to be in the range of shared and write portected + * PTE that the pmd index mapped. If pmd is NULL, it will get the pmd + * from vma. Duplicate COW-ed PTE when some still mapping to it. + * Otherwise, reuse COW-ed PTE. + * If the first attempt fails, it will wait for some time and try + * again. If it fails again, then the OOM killer will be called. + */ +int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr) +{ + struct mm_struct *mm; + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + int ret = 0; + + if (!vma) + return -EINVAL; + mm = vma->vm_mm; + + if (!test_bit(MMF_COW_PTE, &mm->flags)) + return 0; + + if (!pmd) { + pgd = pgd_offset(mm, addr); + if (pgd_none_or_clear_bad(pgd)) + return 0; + p4d = p4d_offset(pgd, addr); + if (p4d_none_or_clear_bad(p4d)) + return 0; + pud = pud_offset(p4d, addr); + if (pud_none_or_clear_bad(pud)) + return 0; + pmd = pmd_offset(pud, addr); + } + + /* We will check the type of pmd entry later. */ + + ret = __break_cow_pte(vma, pmd, addr); + + if (unlikely(ret == -ENOMEM)) { + unsigned int cow_pte_alloc_sleep_millisecs = 60000; + + schedule_timeout(msecs_to_jiffies( + cow_pte_alloc_sleep_millisecs)); + + ret = __break_cow_pte(vma, pmd, addr); + if (unlikely(ret == -ENOMEM)) { + struct oom_control oc = { + .gfp_mask = GFP_PGTABLE_USER, + }; + + mutex_lock(&oom_lock); + out_of_memory(&oc); + mutex_unlock(&oom_lock); + } + } + + return ret; +} + +/** + * break_cow_pte_range - duplicate/reuse COW-ed PTE in a given range + * @vma: target vma want to break COW + * @start: the address of start breaking + * @end: the address of end breaking + * + * Return: zero on success, the number of failed otherwise. + */ +int break_cow_pte_range(struct vm_area_struct *vma, unsigned long start, + unsigned long end) +{ + unsigned long addr, next; + int nr_failed = 0; + + if (!range_in_vma(vma, start, end)) + return -EINVAL; + + addr = start; + do { + next = pmd_addr_end(addr, end); + if (break_cow_pte(vma, NULL, addr)) + nr_failed++; + } while (addr = next, addr != end); + + return nr_failed; +} +#endif /* CONFIG_COW_PTE */ + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most @@ -5267,8 +5555,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) return do_fault(vmf); } - if (!pte_present(vmf->orig_pte)) + if (!pte_present(vmf->orig_pte)) { +#ifdef CONFIG_COW_PTE + if (test_bit(MMF_COW_PTE, &vmf->vma->vm_mm->flags)) + handle_cow_pte_fault(vmf); +#endif return do_swap_page(vmf); + } if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf); @@ -5404,8 +5697,31 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, return 0; } } +#ifdef CONFIG_COW_PTE + /* + * Duplicate COW-ed PTE when page fault will change the + * mapped pages (write or unshared fault) or COW-ed PTE + * (file mapped read fault, see do_read_fault()). + */ + if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE) || + vma->vm_ops) && test_bit(MMF_COW_PTE, &mm->flags)) { + ret = handle_cow_pte_fault(&vmf); + if (unlikely(ret == -ENOMEM)) + return VM_FAULT_OOM; + } +#endif } +#ifdef CONFIG_COW_PTE + /* + * It's definitely will break the kernel when refcount of PTE + * is higher than 1 and it is writeable in PMD entry. But we + * want to see more information so just warning here. + */ + if (likely(!pmd_none(*vmf.pmd))) + VM_WARN_ON(cow_pte_count(vmf.pmd) > 1 && pmd_write(*vmf.pmd)); +#endif + return handle_pte_fault(&vmf); } diff --git a/mm/mmap.c b/mm/mmap.c index ff68a67a2a7c..ac1002e85d88 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2169,6 +2169,10 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, return err; } + err = break_cow_pte(vma, NULL, addr); + if (err) + return err; + new = vm_area_dup(vma); if (!new) return -ENOMEM; diff --git a/mm/mremap.c b/mm/mremap.c index 411a85682b58..0668e9ead65a 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -534,6 +534,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma, old_pmd = get_old_pmd(vma->vm_mm, old_addr); if (!old_pmd) continue; + /* TLB flush twice time here? */ + break_cow_pte(vma, old_pmd, old_addr); new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr); if (!new_pmd) break; diff --git a/mm/swapfile.c b/mm/swapfile.c index 2c718f45745f..b7aa880957fd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1919,6 +1919,8 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, next = pmd_addr_end(addr, end); if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; + if (break_cow_pte(vma, pmd, addr)) + return -ENOMEM; ret = unuse_pte_range(vma, pmd, addr, next, type); if (ret) return ret; From patchwork Fri Apr 14 14:23:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211631 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C7B3C77B72 for ; Fri, 14 Apr 2023 14:25:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230498AbjDNOZw (ORCPT ); Fri, 14 Apr 2023 10:25:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230499AbjDNOZf (ORCPT ); Fri, 14 Apr 2023 10:25:35 -0400 Received: from mail-pj1-x1035.google.com (mail-pj1-x1035.google.com [IPv6:2607:f8b0:4864:20::1035]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3A945C159; Fri, 14 Apr 2023 07:25:08 -0700 (PDT) Received: by mail-pj1-x1035.google.com with SMTP id z11-20020a17090abd8b00b0024721c47ceaso4803623pjr.3; Fri, 14 Apr 2023 07:25:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482307; x=1684074307; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=dV/mcZq3W0H+ZVjv4/WnTWEPlbKKMR+weOFNfLWMeuE=; b=BQzfiDtPhE/M68lDOGSaU8+eH4GXZloBn+mcpK/aB1N6LYOdAjEHlkaXhXyCfvnuvb CEpHvh+6L7eTtlX4yjqbuSmzGdE8fDpHFFYGnZbcZ6MNJT9ylYo2GZtipXJ+6+UzOHTP D9tDXMnJaFlxK7VzjklsrLI+Av/83wf+aOZbs0pt3RSvdic9m6nNNQPQNVFeZpQ+CdQC 995Zkm9AnPgmLkqehhkRv+/kwi2qGfOnf/V+BnSYKMp7WXLrh660WuQiac7blN5+7+zM OxDurW2yAjeOkZJcY9VilooGc5d9qjlhXDXtnE3QoGRPpuKLJN+X+vr5d2OKSx411qOk mH+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482307; x=1684074307; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dV/mcZq3W0H+ZVjv4/WnTWEPlbKKMR+weOFNfLWMeuE=; b=PncXfUVSv4SOYf3jdUX0J4/cn3EljQIX9CeIJfVVkBnOldqJcpOoqmRqK+GFyUxA2/ uGuGxRxk0rb2AAg14iqE3Laa/F15zI0hQpFWxh2wKbqelbZF5+c3r5XDcoNh/Gc8xbqX nZcFa7mAlSekRCI5ckUlK1rBbI/hn/9vnrJsQcORP22xXD834IT++R32bgYtfNn9STgJ Qd6Ozlp87AZfXUTzEMJfUvnFTBHUXIuhgoN2CSITq2cZ7OEC8503x7ckP3LRVY8o0d0H 2bR7i9oOw1BhGD/39gmunyZDqslab/v1FlfEM3HK6lRAzYbGUSU2QcHMZpsItktuKAIq Qg2g== X-Gm-Message-State: AAQBX9fILVNpyWWN90rMbvcZoeligUZojvtgYqb5EvDnPOD0asfqCa68 hz7AJ9CMmTfF/tCFM6t+Img= X-Google-Smtp-Source: AKy350aLkOiAuG932gYmw+BNYfvRwYLr6hT7jeWvEtEaIuQt7YrDBI1fGEqKpu+mP4tyXM4ARraeiA== X-Received: by 2002:a17:902:ce89:b0:19f:2dff:21a4 with SMTP id f9-20020a170902ce8900b0019f2dff21a4mr3285427plg.16.1681482307277; Fri, 14 Apr 2023 07:25:07 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.24.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:25:06 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 05/17] mm: Handle COW-ed PTE during zapping Date: Fri, 14 Apr 2023 22:23:29 +0800 Message-Id: <20230414142341.354556-6-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org To support the zap functionally for COW-ed PTE, we need to zap the entire PTE table each time instead of partially zapping pages. Therefore, if the zap range covers the entire PTE table, we can handle de-account, remove the rmap, etc. However we shouldn't modify the entries when there are still someone references to the COW-ed PTE. Otherwise, if only the zapped process references to this COW-ed PTE, we just reuse it and do the normal zapping. Signed-off-by: Chih-En Lin --- mm/memory.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 87 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index f8a87a0fc382..7908e20f802a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -192,6 +192,12 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); +#ifdef CONFIG_COW_PTE + if (test_bit(MMF_COW_PTE, &tlb->mm->flags)) { + if (!pmd_none(*pmd) && !pmd_write(*pmd)) + VM_WARN_ON(cow_pte_count(pmd) != 1); + } +#endif if (pmd_none_or_clear_bad(pmd)) continue; free_pte_range(tlb, pmd, addr); @@ -1656,6 +1662,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, #define ZAP_PTE_INIT 0x0000 #define ZAP_PTE_FORCE_FLUSH 0x0001 +#define ZAP_PTE_IS_SHARED 0x0002 struct zap_pte_details { pte_t **pte; @@ -1681,9 +1688,13 @@ zap_present_pte(struct mmu_gather *tlb, struct vm_area_struct *vma, if (unlikely(!should_zap_page(details, page))) return 0; - ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + if (pte_details->flags & ZAP_PTE_IS_SHARED) + ptent = ptep_get(pte); + else + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); tlb_remove_tlb_entry(tlb, pte, addr); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + if (!(pte_details->flags & ZAP_PTE_IS_SHARED)) + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); if (unlikely(!page)) return 0; @@ -1767,8 +1778,10 @@ zap_nopresent_pte(struct mmu_gather *tlb, struct vm_area_struct *vma, /* We should have covered all the swap entry types */ WARN_ON_ONCE(1); } - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + if (!(pte_details->flags & ZAP_PTE_IS_SHARED)) { + pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + } } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -1785,6 +1798,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, .flags = ZAP_PTE_INIT, .pte = &pte, }; +#ifdef CONFIG_COW_PTE + unsigned long orig_addr = addr; + + if (test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd)) { + if (!range_in_vma(vma, addr & PMD_MASK, + (addr + PMD_SIZE) & PMD_MASK)) { + /* + * We cannot promise this COW-ed PTE will also be zap + * with the rest of VMAs. So, break COW PTE here. + */ + break_cow_pte(vma, pmd, addr); + } else { + /* + * We free the batched memory before we handle + * COW-ed PTE. + */ + tlb_flush_mmu(tlb); + end = (addr + PMD_SIZE) & PMD_MASK; + addr = addr & PMD_MASK; + start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + if (cow_pte_count(pmd) == 1) { + /* Reuse COW-ed PTE */ + pmd_t new = pmd_mkwrite(*pmd); + set_pmd_at(tlb->mm, addr, pmd, new); + } else + pte_details.flags |= ZAP_PTE_IS_SHARED; + pte_unmap_unlock(start_pte, ptl); + } + } +#endif tlb_change_page_size(tlb, PAGE_SIZE); again: @@ -1828,7 +1871,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, */ if (pte_details.flags & ZAP_PTE_FORCE_FLUSH) { pte_details.flags &= ~ZAP_PTE_FORCE_FLUSH; - tlb_flush_mmu(tlb); + /* + * With COW-ed PTE, we defer freeing the batched memory until + * after we have actually cleared the COW-ed PTE's pmd entry. + * Since, if we are the only ones still referencing the COW-ed + * PTe table after we have freed the batched memory, the page + * table check will report a bug with anon_map_count != 0 in + * page_table_check_zero(). + */ + if (!(pte_details.flags & ZAP_PTE_IS_SHARED)) + tlb_flush_mmu(tlb); } if (addr != end) { @@ -1836,6 +1888,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, goto again; } +#ifdef CONFIG_COW_PTE + if (pte_details.flags & ZAP_PTE_IS_SHARED) { + start_pte = pte_offset_map_lock(mm, pmd, orig_addr, &ptl); + if (!pmd_put_pte(pmd)) { + pmd_t new = pmd_mkwrite(*pmd); + set_pmd_at(tlb->mm, addr, pmd, new); + /* + * We are the only ones who still referencing this. + * Clear the page table check before we free the + * batched memory. + */ + page_table_check_pte_clear_range(mm, orig_addr, *pmd); + pte_unmap_unlock(start_pte, ptl); + /* free the batched memory and flush the TLB. */ + tlb_flush_mmu(tlb); + free_pte_range(tlb, pmd, addr); + } else { + pmd_clear(pmd); + pte_unmap_unlock(start_pte, ptl); + mm_dec_nr_ptes(tlb->mm); + /* + * Someone still referencing to the table, + * we just flush TLB here. + */ + flush_tlb_range(vma, addr & PMD_MASK, + (addr + PMD_SIZE) & PMD_MASK); + } + } +#endif + return addr; } From patchwork Fri Apr 14 14:23:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211632 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16111C77B72 for ; Fri, 14 Apr 2023 14:26:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230431AbjDNO0F (ORCPT ); Fri, 14 Apr 2023 10:26:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49556 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230491AbjDNOZl (ORCPT ); Fri, 14 Apr 2023 10:25:41 -0400 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40F0BC174; Fri, 14 Apr 2023 07:25:17 -0700 (PDT) Received: by mail-pl1-x636.google.com with SMTP id n17so2474695pln.8; Fri, 14 Apr 2023 07:25:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482316; x=1684074316; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=NclmCECex/CFHpfM/6wkOB23PF4WtO4cEeKh/pps1Cs=; b=V/AYfRJNaMRNYL9IFUoiq8cp7WDMTq2GdsZvRBqNgIQMjdcljye7UNo4NkMi4droC4 g5yT7eVZsB6DEJwhMaSh3Br5JzCqVkrbV1+D6X5yYOdeaazRSZc17CpFg5tcLGYOjwQT 97ybU/LG3FFdXqYCQI8mH9xuTMVWg1lxOxMNJFuV2OvfA9O6XB/Fg78pyV5CD0wc+3v4 BUCcN4TLNpn2PMDRaTahfhVrMZNCmzTbW/+ugSX+IsLMaYMGzhOgUaEZ+bYBSMuxem1z mpC8ySmb9LnKke2HRfRgo/xQlUYMkO3YtXm14ThPJznVqyVcaH20JK5Gwte/rlNF4GH/ 3Irw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482316; x=1684074316; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NclmCECex/CFHpfM/6wkOB23PF4WtO4cEeKh/pps1Cs=; b=hFRxCaokBTIcSxQPcIA8ssdBIleLZVURsfkAo3+eg5koiv6I7tsdJgF/FxlV3hAIEF 926WfCaRz3x2Rgsna1SaG8pP0Q24lG9+AMc1e8tPprb8ZKmArLzXaSc8uGOjdLCa4X23 H0BgOxcW+IszZaByY+uAR4NEbshXCJPXASzPRSaH/1x6HZosBQEHiHYxbhVixdoX3xd/ 8b0ZiOj62d2ocl2FoWQszLZHSD1hGtG4Il3HY5NlX8R2Mpl3uUAKy8K9PD0vQU5W2KwV M2DeFO5wsityzxZqVrotTRZ+u/Xb09uS0o9GusUMqbcjyFlJCf1fOj6azOwcUs1SpZmf rwvw== X-Gm-Message-State: AAQBX9eB7tKWvfTWqfahePpz30Kbl9DJlrHw0/P+Ze2AGXO3/gD/vW+h 4JY1yJMAmQzCSLXC3AuBB+I= X-Google-Smtp-Source: AKy350ZPU+Edlq0VRRV3L/JnX2VakyMBg2O0FN77xU7rql7Qa95cBelgIKvLqgkAsYrK6MRTS18ulw== X-Received: by 2002:a17:90a:7382:b0:244:af48:c4f3 with SMTP id j2-20020a17090a738200b00244af48c4f3mr5780766pjg.7.1681482316538; Fri, 14 Apr 2023 07:25:16 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.25.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:25:16 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 06/17] mm/rmap: Break COW PTE in rmap walking Date: Fri, 14 Apr 2023 22:23:30 +0800 Message-Id: <20230414142341.354556-7-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org Some of the features (unmap, migrate, device exclusive, mkclean, etc) might modify the pte entry via rmap. Add a new page vma mapped walk flag, PVMW_BREAK_COW_PTE, to indicate the rmap walking to break COW PTE. Signed-off-by: Chih-En Lin --- include/linux/rmap.h | 2 ++ mm/migrate.c | 3 ++- mm/page_vma_mapped.c | 4 ++++ mm/rmap.c | 9 +++++---- mm/vmscan.c | 3 ++- 5 files changed, 15 insertions(+), 6 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b87d01660412..57e9b72dc63a 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -377,6 +377,8 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, #define PVMW_SYNC (1 << 0) /* Look for migration entries rather than present PTEs */ #define PVMW_MIGRATION (1 << 1) +/* Break COW-ed PTE during walking */ +#define PVMW_BREAK_COW_PTE (1 << 2) struct page_vma_mapped_walk { unsigned long pfn; diff --git a/mm/migrate.c b/mm/migrate.c index db3f154446af..38933993af14 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -184,7 +184,8 @@ void putback_movable_pages(struct list_head *l) static bool remove_migration_pte(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *old) { - DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION); + DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, + PVMW_SYNC | PVMW_MIGRATION | PVMW_BREAK_COW_PTE); while (page_vma_mapped_walk(&pvmw)) { rmap_t rmap_flags = RMAP_NONE; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 4e448cfbc6ef..1750b3460828 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -254,6 +254,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) step_forward(pvmw, PMD_SIZE); continue; } + if (pvmw->flags & PVMW_BREAK_COW_PTE) { + if (break_cow_pte(vma, pvmw->pmd, pvmw->address)) + return not_found(pvmw); + } if (!map_pte(pvmw)) goto next_pte; this_pte: diff --git a/mm/rmap.c b/mm/rmap.c index 8632e02661ac..5582da6d72fc 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1006,7 +1006,8 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw) static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_SYNC); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, + PVMW_SYNC | PVMW_BREAK_COW_PTE); int *cleaned = arg; *cleaned += page_vma_mkclean_one(&pvmw); @@ -1450,7 +1451,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE); pte_t pteval; struct page *subpage; bool anon_exclusive, ret = true; @@ -1810,7 +1811,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE); pte_t pteval; struct page *subpage; bool anon_exclusive, ret = true; @@ -2177,7 +2178,7 @@ static bool page_make_device_exclusive_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *priv) { struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE); struct make_exclusive_args *args = priv; pte_t pteval; struct page *subpage; diff --git a/mm/vmscan.c b/mm/vmscan.c index 9c1c5e8b24b8..4abbd036f927 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1892,7 +1892,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* * The folio is mapped into the page tables of one or more - * processes. Try to unmap it here. + * processes. Try to unmap it here. Also, since it will write + * to the page tables, break COW PTE if they are. */ if (folio_mapped(folio)) { enum ttu_flags flags = TTU_BATCH_FLUSH; From patchwork Fri Apr 14 14:23:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211633 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B33A1C77B72 for ; Fri, 14 Apr 2023 14:26:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230458AbjDNO0T (ORCPT ); Fri, 14 Apr 2023 10:26:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49830 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230516AbjDNOZ4 (ORCPT ); Fri, 14 Apr 2023 10:25:56 -0400 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 74E11CC09; Fri, 14 Apr 2023 07:25:26 -0700 (PDT) Received: by mail-pj1-x1034.google.com with SMTP id h24-20020a17090a9c1800b002404be7920aso18803455pjp.5; Fri, 14 Apr 2023 07:25:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482326; x=1684074326; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=TCxBEbp3Vh2LecugBLRL5yp+Vpy2XMoG5rUe/amD8/I=; b=ZUkBFFgDZMzq0ei7OkZmTW87uX2xk5y2kJKyAYLp3lpS89qPRHDKJgz7tCSi6O1V6k GrFbOBQFX/XypGH/c1yklfosrEHoIqzQ10mKCvEBfVArBFpSNqifnArvfV57Wofxfked tQSdXPx2lkO9j+I9wB/qaUE15663ceimdY4la8QuwN52VLu4kI7OSpxGktPP3xTt05op 8/JPrXs0WD8ikj7mOCQTPzq3BQuATZD6Bj6rr4p7YEsHpV8Rg7fuN426Vq3Yjlwn5Htj 98xjWssljzirK0y11ZY7ToIsvcWt1GPlCV+66dUlZIWt1/ldVbpqyb0Vt9r3pd3avAS7 t9Zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482326; x=1684074326; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TCxBEbp3Vh2LecugBLRL5yp+Vpy2XMoG5rUe/amD8/I=; b=YVs5vOLVgAe5DagTU3OXVUWA9HSV7pUhn7AC7Bxe1tYjhaSh6KeNiqlmtIGuk3gH08 XHMbtyTFDybcJqWs6UedMQxC9d+cHsbZ3y83WJcKsToWGzF9WPQ2ew4nooBdUCMVhCAF pLxFG74WYPXuiq9wuPsPkR0AKinxwLUL9bW9b0Bv/EGFtZnl84ZOHytt1zWb4AZBUqdA /NHYlFHg1qoe4sNmQaEu6HiZSQ94q3RQ1gEAu8RvE1GvruKw4uRL6dAsj9fb3STfDaf0 VEnFmtqN9Eawfnrg+Q86HrDERlyzRiU+4zAuqXxzRrQmxoprNNirm7cg4h0LTi8tLxqa vDUA== X-Gm-Message-State: AAQBX9dz4iMwNe2k2V6b5RpQXrEO8jzZ7I+Zl/BW1+x53G1jkkLAwFZy 1p4J16lNeVGcYg0jp+sAfZc= X-Google-Smtp-Source: AKy350apXLIP0kt5VCJQRQireZ4xb8EqK6nTeakZbNZ/7tzb6r5mGG/H+ghRbHSVVMAlJe/5ZJjFlQ== X-Received: by 2002:a17:90b:788:b0:246:896a:408d with SMTP id l8-20020a17090b078800b00246896a408dmr5981639pjz.14.1681482325813; Fri, 14 Apr 2023 07:25:25 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.25.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:25:25 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 07/17] mm/khugepaged: Break COW PTE before scanning pte Date: Fri, 14 Apr 2023 22:23:31 +0800 Message-Id: <20230414142341.354556-8-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org We should not allow THP to collapse COW-ed PTE. So, break COW PTE before collapse_pte_mapped_thp() collapse to THP. Also, break COW PTE before khugepaged_scan_pmd() scan PTE. Signed-off-by: Chih-En Lin --- include/trace/events/huge_memory.h | 1 + mm/khugepaged.c | 35 +++++++++++++++++++++++++++++- 2 files changed, 35 insertions(+), 1 deletion(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 3e6fb05852f9..5f2c39f61521 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -13,6 +13,7 @@ EM( SCAN_PMD_NULL, "pmd_null") \ EM( SCAN_PMD_NONE, "pmd_none") \ EM( SCAN_PMD_MAPPED, "page_pmd_mapped") \ + EM( SCAN_COW_PTE, "cowed_pte") \ EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \ EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 92e6f56a932d..3020fcb53691 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -31,6 +31,7 @@ enum scan_result { SCAN_PMD_NULL, SCAN_PMD_NONE, SCAN_PMD_MAPPED, + SCAN_COW_PTE, SCAN_EXCEED_NONE_PTE, SCAN_EXCEED_SWAP_PTE, SCAN_EXCEED_SHARED_PTE, @@ -886,7 +887,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, return SCAN_PMD_MAPPED; if (pmd_devmap(pmde)) return SCAN_PMD_NULL; - if (pmd_bad(pmde)) + if (pmd_write(pmde) && pmd_bad(pmde)) return SCAN_PMD_NULL; return SCAN_SUCCEED; } @@ -937,6 +938,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, pte_unmap(vmf.pte); continue; } + if (break_cow_pte(vma, pmd, address)) + return SCAN_COW_PTE; ret = do_swap_page(&vmf); /* @@ -1049,6 +1052,9 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, if (result != SCAN_SUCCEED) goto out_up_write; + /* We should already handled COW-ed PTE. */ + VM_WARN_ON(test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd)); + anon_vma_lock_write(vma->anon_vma); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, @@ -1159,6 +1165,13 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + + /* Break COW PTE before we collapse the pages. */ + if (break_cow_pte(vma, pmd, address)) { + result = SCAN_COW_PTE; + goto out; + } + pte = pte_offset_map_lock(mm, pmd, address, &ptl); for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { @@ -1217,6 +1230,10 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, goto out_unmap; } + /* + * If we only trigger the break COW PTE, the page usually + * still in COW mapping, which it still be shared. + */ if (page_mapcount(page) > 1) { ++shared; if (cc->is_khugepaged && @@ -1512,6 +1529,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, goto drop_hpage; } + /* We shouldn't let COW-ed PTE collapse. */ + if (break_cow_pte(vma, pmd, haddr)) + goto drop_hpage; + VM_WARN_ON(test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd)); + /* * We need to lock the mapping so that from here on, only GUP-fast and * hardware page walks can access the parts of the page tables that @@ -1717,6 +1739,11 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, result = SCAN_PTE_UFFD_WP; goto unlock_next; } + if (test_bit(MMF_COW_PTE, &mm->flags) && + !pmd_write(*pmd)) { + result = SCAN_COW_PTE; + goto unlock_next; + } collapse_and_free_pmd(mm, vma, addr, pmd); if (!cc->is_khugepaged && is_target) result = set_huge_pmd(vma, addr, pmd, hpage); @@ -2154,6 +2181,11 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, swap = 0; memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + if (break_cow_pte(find_vma(mm, addr), NULL, addr)) { + result = SCAN_COW_PTE; + goto out; + } + rcu_read_lock(); xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) { if (xas_retry(&xas, page)) @@ -2224,6 +2256,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, } rcu_read_unlock(); +out: if (result == SCAN_SUCCEED) { if (cc->is_khugepaged && present < HPAGE_PMD_NR - khugepaged_max_ptes_none) { From patchwork Fri Apr 14 14:23:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211634 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6C4AC77B72 for ; Fri, 14 Apr 2023 14:26:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230432AbjDNO0a (ORCPT ); Fri, 14 Apr 2023 10:26:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49078 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230040AbjDNO0E (ORCPT ); Fri, 14 Apr 2023 10:26:04 -0400 Received: from mail-pj1-x1036.google.com (mail-pj1-x1036.google.com [IPv6:2607:f8b0:4864:20::1036]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CF04EC160; Fri, 14 Apr 2023 07:25:35 -0700 (PDT) Received: by mail-pj1-x1036.google.com with SMTP id hg12so3951157pjb.2; Fri, 14 Apr 2023 07:25:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482335; x=1684074335; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=oPtVqejd8xN45YdR2Q6v6XG0kYjXUF7jvXEtA29Ek40=; b=DO2Q+d81fMe4kXw95FsyYMfTz/qUSAi6t2W34rh9XXJe8g3xYfBr+ymjaQ+uYh9eTI gHotO5OUPv5kWIbSchNOMjer5oobGqrHJG+prcCHTVFXEpI6y9M0+rge1hgjyuYxoB0u wX1SvUVvU2sIJxeGvrr7rS1EZIQVvYq2oa4eABUzfPnn7z8r+qL+fynWlo9OAhrZaHzR 3jOzOMCHKM1OzWW9i5YbxNsYH1Uk4ufsEhHfEwnQp8mwyPkSVTACUM3lCwWK6IlQbd5Q G7EY5aNAMKUNsQAQl+8b7ZKPp2am6ArpZNUTK0p0bV3mimWd5Tg5a3ckqpM3MRvKpkwh wEcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482335; x=1684074335; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oPtVqejd8xN45YdR2Q6v6XG0kYjXUF7jvXEtA29Ek40=; b=IVrzADRA6p9Bd2wxW9UcrFJcS6/Q9j2mUhI6BW+SvNXO5bxNeNHYdG9f9nE0hAAD88 RFoBLXh5mxi9hWAnAMYX5kb1zPcI5FXUSjYvZXai5PsYrXXjmYy6T5kPgvFpyPWQ+a0e 45Zego2K4S2CBiY6pQmNFDBRbYxS524xHZ/bWhaX5Iw6gXgvR2IvZMoH4T7dyGs5bCS5 NgXFGSKxH6avnCc9zlHLzYRqDd9Dyg9YtUkvmrP1kxieSDk2rXXfpXTYE15AiO+9C6e3 e16nUb7m1bqIySqvr9P9pFTKYTmp4Gtym1dw9zhNDpM9ilK+5rMo2eGbIhoBydt42Jvl QNJA== X-Gm-Message-State: AAQBX9d22eiNuAL7i6F8EVXg6ZBH/sKzRnY3a6OsSdZoK7RUOl72UkWL QNtNSKlWxmEnT83J/z+pLXw= X-Google-Smtp-Source: AKy350adAdAtgay4aFNJF81hhKV3XEsvf6Y3Gn4taIN+xVDvh1O3itAQNBSXC0EdDJB6K6v4iC1zqg== X-Received: by 2002:a17:902:d4cc:b0:19f:2dff:2199 with SMTP id o12-20020a170902d4cc00b0019f2dff2199mr3045076plg.68.1681482335057; Fri, 14 Apr 2023 07:25:35 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.25.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:25:34 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 08/17] mm/ksm: Break COW PTE before modify shared PTE Date: Fri, 14 Apr 2023 22:23:32 +0800 Message-Id: <20230414142341.354556-9-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org Break COW PTE before merge the page that reside in COW-ed PTE. Signed-off-by: Chih-En Lin --- mm/ksm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/ksm.c b/mm/ksm.c index 2b8d30068cbb..963ef4d0085d 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1052,7 +1052,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, pte_t *orig_pte) { struct mm_struct *mm = vma->vm_mm; - DEFINE_PAGE_VMA_WALK(pvmw, page, vma, 0, 0); + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, 0, PVMW_BREAK_COW_PTE); int swapped; int err = -EFAULT; struct mmu_notifier_range range; @@ -1169,6 +1169,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, barrier(); if (!pmd_present(pmde) || pmd_trans_huge(pmde)) goto out; + if (break_cow_pte(vma, pmd, addr)) + goto out; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, addr + PAGE_SIZE); From patchwork Fri Apr 14 14:23:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211635 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1FCBC77B6E for ; Fri, 14 Apr 2023 14:26:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231133AbjDNO0n (ORCPT ); Fri, 14 Apr 2023 10:26:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50266 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231140AbjDNO0Q (ORCPT ); Fri, 14 Apr 2023 10:26:16 -0400 Received: from mail-pj1-x102a.google.com (mail-pj1-x102a.google.com [IPv6:2607:f8b0:4864:20::102a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E0B4CC21; Fri, 14 Apr 2023 07:25:45 -0700 (PDT) Received: by mail-pj1-x102a.google.com with SMTP id w11so18965495pjh.5; Fri, 14 Apr 2023 07:25:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482344; x=1684074344; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=kpWPxfuvaIk0mWPLrjy7A4XPzxhA7fLlfGdW1mVCUr4=; b=hh51/J2GfHz8TwBXTTAD8ByIt7vqu+CJYMOv8Gn6VP6zDZ8jsx1AQK/YylhhDx4Yy+ 0CRLCxANjdDWRbaxrpOrHOgnYRD+lcAf3QSkDS3YbWO/xcoE3Gbv43mriuPlxclwVTP9 phx5iCBpOFU3ljdbu0G2CPDJwoQgv/dk9OnKHsSHbvxVBZS7ab+FJCrdG2Pr8gDTkNbA cLPqeTL9RK1Gw4vF6qbn/cWyXpQXTI1KYeztWpb3aEC0oy5a7FRBFGm+8ZwfNEXjkHbL Rw88PFqPGh3pJ+S1C9QijWHos/UCC72XeH/rxFcOom0qMDyPSeMXOfp7PnqjaxjFryYV 7CUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482344; x=1684074344; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kpWPxfuvaIk0mWPLrjy7A4XPzxhA7fLlfGdW1mVCUr4=; b=TY1NK6FePlkw/ygUqZeuSi5bhpMlTOFr12TsPVicX2pxQHse6S2kys73OhatSfhYg7 YOkXrTpgk1o2OaK3aIBkMQH81ZDFXl7AABV80MWAFarAcTNtUuZpso98Hxh4SfJIdj2W Y+Uwvqk1JhQRgn9MozLR/jBw0tgNtye0wobDSMLKnfb944MsvL/do19scY8/j+Te1ufl nIHstgVc+bZh07ivfho9Eub+42zLtpxX31HWSgCJZ5KJEtc0mTJH5qJ1UJrvm6svgWxJ +9gYS2bZm7UTXuGzjdmv6vSu3mDv7Y+q5JjnyUJAABXymbphTDWvlpgZWR91DZs1McUq X3WA== X-Gm-Message-State: AAQBX9ePj4M6d7TEvndTtoGdrSAzGgThUcsztPCiutpWHNpOjr9inxWs SlDwcUrM8+8iB+EhcsaJQ0Q= X-Google-Smtp-Source: AKy350aBdG9vJ3XmlVaPR+W/QhjaJcPhPOUy0mPoGFQqm07k6eNoRRTZfMByKXsSMfstD+2cplrm3g== X-Received: by 2002:a17:90a:4925:b0:23f:7ff6:eba with SMTP id c34-20020a17090a492500b0023f7ff60ebamr5879955pjh.0.1681482344318; Fri, 14 Apr 2023 07:25:44 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.25.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:25:43 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 09/17] mm/madvise: Handle COW-ed PTE with madvise() Date: Fri, 14 Apr 2023 22:23:33 +0800 Message-Id: <20230414142341.354556-10-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org Break COW PTE if madvise() modify the pte entry of COW-ed PTE. Following are the list of flags which need to break COW PTE. However, like MADV_HUGEPAGE and MADV_MERGEABLE, we should handle it respectively. - MADV_DONTNEED: It calls to zap_page_range() which already be handled. - MADV_FREE: It uses walk_page_range() with madvise_free_pte_range() to free the page by itself, so add break_cow_pte(). - MADV_REMOVE: Same as MADV_FREE, it remove the page by itself, so add break_cow_pte_range(). - MADV_COLD: Similar to MAD_FREE, break COW PTE before pageout. - MADV_POPULATE: Let GUP deal with it. Signed-off-by: Chih-En Lin --- mm/madvise.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/mm/madvise.c b/mm/madvise.c index 340125d08c03..71176edb751e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -425,6 +425,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (pmd_trans_unstable(pmd)) return 0; #endif + if (break_cow_pte(vma, pmd, addr)) + return 0; + tlb_change_page_size(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); flush_tlb_batched_pending(mm); @@ -625,6 +628,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; + /* We should only allocate PTE. */ + if (break_cow_pte(vma, pmd, addr)) + goto next; + tlb_change_page_size(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); flush_tlb_batched_pending(mm); @@ -984,6 +991,12 @@ static long madvise_remove(struct vm_area_struct *vma, if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE)) return -EACCES; + error = break_cow_pte_range(vma, start, end); + if (error < 0) + return error; + else if (error > 0) + return -ENOMEM; + offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); From patchwork Fri Apr 14 14:23:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211636 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90293C77B6E for ; Fri, 14 Apr 2023 14:26:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230403AbjDNO0y (ORCPT ); Fri, 14 Apr 2023 10:26:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50470 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231182AbjDNO01 (ORCPT ); Fri, 14 Apr 2023 10:26:27 -0400 Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 45E88C144; Fri, 14 Apr 2023 07:25:54 -0700 (PDT) Received: by mail-pl1-x62d.google.com with SMTP id kh6so16856994plb.0; Fri, 14 Apr 2023 07:25:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482353; x=1684074353; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=/tVFETdIXCOfqp++FbGXI8ZTRFNtRU4kz7mKWpQ0TKQ=; b=V/L9mGM5vRCilsgFuug2qDo+tf+mz8X3NMe4ExWTkIzgp5PoriAleb9ABHiqBVuzoB llo+kuyF6EvctKAWOk7/AgctLIF5nzoIZG8o22MemV1Trqr4GMU4SWQjs7lk4AFfxWZs fEFh+J5+1k4h11tgMkVGCwcojSc3VhOk+jlEQkXDF+LG7ADIDJS5DzduuXdI5BA3XzGA o5xxBjrwCkRjOkZx09CVQJsh7Y3tKhqCo7zRr9YQVNfXI9o2wXUwtf+xHKgwJyn/eUFx DbBL8+08eqWZGvXOjPhYZPEEOFvmlvTmXp2I+hQAV+uqiFqt7yhXiTpblR0eG92ZfVyg zMBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482353; x=1684074353; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/tVFETdIXCOfqp++FbGXI8ZTRFNtRU4kz7mKWpQ0TKQ=; b=eAPdAV+Kwf65yINl2J7xlqvEhtlEmrmfgl7zcXjoRNq2ut85AxNxMm/B/qpkDvURqm n70T0uvgHehhjQlbGgd2nc6usPdIBPLDOV9eDSPQEcHCxzYPMpeSyb462HDPLjw55tiB VfBz7Bpl29QVEqOFgKivBwIXpIDP2rP1ZrZ6ref+UTVbZfXu6c2r026gBa8mXYDn4FSM farVAKNMoiyvdclGlusE/lb2fi9HxisBp85pcMWutEb/R3ST/3yoLfISEeuLm3GifoPy iaW4XTT6bMm/5F+N+ZEl/i8KUkHDD/pCEyrqUVBZ1YFJeHOWGRo272+Pa8Y8q7IvMCoq +Ceg== X-Gm-Message-State: AAQBX9c+2cqRwF/sZoOKMFhddEReArpFrc2jGHH1sqGiSi5VW4vwSqu+ AMB0DmU8oSQYNN259RbfwXA= X-Google-Smtp-Source: AKy350bY4APGkMEn5jnT3hrPcgD0IFpWnxPux4TPQ74b4zdNFtZk3bMgcaFkmd5oasXe2ggPZ2CL0Q== X-Received: by 2002:a17:903:22ce:b0:1a5:f:a7c7 with SMTP id y14-20020a17090322ce00b001a5000fa7c7mr3342685plg.0.1681482353510; Fri, 14 Apr 2023 07:25:53 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.25.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:25:53 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 10/17] mm/gup: Trigger break COW PTE before calling follow_pfn_pte() Date: Fri, 14 Apr 2023 22:23:34 +0800 Message-Id: <20230414142341.354556-11-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org In most of cases, GUP will not modify the page table, excluding follow_pfn_pte(). To deal with COW PTE, Trigger the break COW PTE fault before calling follow_pfn_pte(). Signed-off-by: Chih-En Lin --- mm/gup.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index eab18ba045db..325424c02ca6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -544,7 +544,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) return ERR_PTR(-EINVAL); - if (unlikely(pmd_bad(*pmd))) + /* COW-ed PTE has write protection which can trigger pmd_bad(). */ + if (unlikely(pmd_write(*pmd) && pmd_bad(*pmd))) return no_page_table(vma, flags); ptep = pte_offset_map_lock(mm, pmd, address, &ptl); @@ -587,6 +588,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (is_zero_pfn(pte_pfn(pte))) { page = pte_page(pte); } else { + if (test_bit(MMF_COW_PTE, &mm->flags) && + !pmd_write(*pmd)) { + page = ERR_PTR(-EMLINK); + goto out; + } ret = follow_pfn_pte(vma, address, ptep, flags); page = ERR_PTR(ret); goto out; From patchwork Fri Apr 14 14:23:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211637 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E3CEC77B6E for ; Fri, 14 Apr 2023 14:27:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231231AbjDNO1M (ORCPT ); Fri, 14 Apr 2023 10:27:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50652 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231209AbjDNO0j (ORCPT ); Fri, 14 Apr 2023 10:26:39 -0400 Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 29C79A5DE; Fri, 14 Apr 2023 07:26:04 -0700 (PDT) Received: by mail-pj1-x102e.google.com with SMTP id v9so23781319pjk.0; Fri, 14 Apr 2023 07:26:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482363; x=1684074363; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=JhJhY1i1X+rQfu4Bn4P5O3hneE5Ahqf44H1V3MlMZSE=; b=WVSBGvQEWeeZCzJHUSLw6TdU7X7htMEhqLVUOYjuOm5exiNC5ijdRkufnNFt84Bcvp nBa+HXyOE083VwztPKaRnxXoBufXzhDqGmx5EU3nQdOfFARZyba4elQgihG66ZdIP+us AWQNBgqB3ZgRwKcjr/anLW7Yt0XHKvMOhoHHSJ+HOxL3sObVLefIP9AIAVNCeVEtEv1h XIqwvMreBHtnZXG4BLKofcPieosXJ8aJJOZrrwvezai6c/gmMsNouFT0iBOD95fZd4ZF KiC837Ko9b7VgCfyYEEiAshz0fl07KLeKm1Og9HKhxYYplg/Ram9QSDXV5nmw7PrXhcj UxEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482363; x=1684074363; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JhJhY1i1X+rQfu4Bn4P5O3hneE5Ahqf44H1V3MlMZSE=; b=I3yWmk9/SnuWH4NgGe/+okpq54cUPsfNpqIcFmHifKA1FfwZZ7KAwF2ybTXSpjvd/N GG3BtDGjhDKwGiXFS3bVKAdycnUK/yYnBDp0xB9JXlpJVWwgwuKgrfvBTMYsetvlFpEV gIlTsNa9mnDW8YgjN4e3zKcuEtWxEpsrI9b6o9x+sW2vlqrJVf2Ew0LFW6IhvKoCF+d4 F/a3vGHUWNbh3XDRaGBYBVEFBpHiMoSVOc8/LvKIhb7JOP0ZLVrI1aeBZadllmjN8w2X IRTEfCLAEEIYdGSSrybFatLVqsdCMO+yU+oiz5NZsRs//BVklMxfFrgf1ztxR3k6YXPC bjjA== X-Gm-Message-State: AAQBX9esOUpz2Nj3PX+L3kYvJCBkcBpLxU8H133ucw/GKr8+KndtpioT +dwIOBxcWSWSRl0qlDwHN2s= X-Google-Smtp-Source: AKy350Y3WXBAfHkfEIZhG0ei3Tp0PNc3fK1Ni3ksxBjZThrmtWAJdx1hd2dMChF2jmP20depYO7fIQ== X-Received: by 2002:a17:90a:5b12:b0:244:9385:807f with SMTP id o18-20020a17090a5b1200b002449385807fmr5616908pji.44.1681482362745; Fri, 14 Apr 2023 07:26:02 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.25.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:26:02 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 11/17] mm/mprotect: Break COW PTE before changing protection Date: Fri, 14 Apr 2023 22:23:35 +0800 Message-Id: <20230414142341.354556-12-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org If the PTE table is COW-ed, break it before changing the protection. Signed-off-by: Chih-En Lin --- mm/mprotect.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/mm/mprotect.c b/mm/mprotect.c index 13e84d8c0797..a33f23a73fa5 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -103,6 +103,9 @@ static long change_pte_range(struct mmu_gather *tlb, if (pmd_trans_unstable(pmd)) return 0; + if (break_cow_pte(vma, pmd, addr)) + return 0; + /* * The pmd points to a regular pte so the pmd can't change * from under us even if the mmap_lock is only hold for @@ -312,6 +315,12 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd) return 1; if (pmd_trans_huge(pmdval)) return 0; + /* + * If the entry point to COW-ed PTE, it's write protection bit + * will cause pmd_bad(). + */ + if (!pmd_write(pmdval)) + return 0; if (unlikely(pmd_bad(pmdval))) { pmd_clear_bad(pmd); return 1; From patchwork Fri Apr 14 14:23:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211638 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F635C77B6E for ; Fri, 14 Apr 2023 14:27:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229650AbjDNO1Z (ORCPT ); Fri, 14 Apr 2023 10:27:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50432 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231232AbjDNO0w (ORCPT ); Fri, 14 Apr 2023 10:26:52 -0400 Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9931FC654; Fri, 14 Apr 2023 07:26:19 -0700 (PDT) Received: by mail-pj1-x102e.google.com with SMTP id c10-20020a17090abf0a00b0023d1bbd9f9eso21823683pjs.0; Fri, 14 Apr 2023 07:26:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482372; x=1684074372; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Cn0v+6+8REt3te6+W7jwtjVuTBBAk2uf1jUodunRRdQ=; b=qi9wgiyeY0KQmKmVHmk3pZf/Hi4Vp8dgW2DPgrhsj8Xwtx2FZitjvpQoSol6ANk7d2 1nwO8VT1YJy8uPXgkX9/FutEPVh31MHIZQlLhkDNLiI0vwdWENXTb4dgDqyvIrfKzMOX JI0CKwQjE52y7/H+kjVfplDq/+2Hs/lKXVITRsXKPy/05LnN2D+n76SDHQsW1vaTbWXK 90b5glPmafEggiq/JPPsEci/7YhvLEKLpSuc5jH4zplAtjPos9DlUXXyGmXSv706vknt tp2hKzs3UcLFWOqwvzyG5LStOAqXtz7Uhjar9GOjeCsp/OqcZ426jrQ99KPeGXiVL2Px Rr0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482372; x=1684074372; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Cn0v+6+8REt3te6+W7jwtjVuTBBAk2uf1jUodunRRdQ=; b=a5r9lvniXCC+LJw1c5qkHrnC1TXiRzFIkSJxTBEUMELzyHNl2TQPjTbArM6piz6tDQ K9w/IEzS2M47lwUGxsP9OGAPLP8GzH/fvqJIEuX7g5roWIIj2ZZ70KLNwOQjmgIvYmNX 0XUU9pyYmejUZjfn+a4C/KoRKg/EsGuxuT4MxiJp3RayiGJiVYTWOaHS867keY4K8FjI 1LTTN5mmsMdba8lNzzPix9ZKNHwzseup8/e6bt+nr0vj+FRlHfUzADitgWyCADI3AmM3 aCl5rIRPDNdJZDLosyQt4PH9iCQAZm2v9otYViiZT4/Y3DwxqHY5MtO0BB/J58NY4Hc6 /2ZQ== X-Gm-Message-State: AAQBX9dpn5UcRyYJo6HNuf4QY3BsLPmnhEF/1IbkjAWMBSF7nk/2Lle9 h11EK/sf2vLz16v+0FJyzCQ= X-Google-Smtp-Source: AKy350YNu6dj+98meByV/+r5D+K99Oh4Y9jClRwMRsz3Jwt8ozybPOvGDsiGKENsPm95AhWUU5y+EQ== X-Received: by 2002:a17:90a:a392:b0:246:9517:30b6 with SMTP id x18-20020a17090aa39200b00246951730b6mr11192974pjp.4.1681482372086; Fri, 14 Apr 2023 07:26:12 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.26.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:26:11 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 12/17] mm/userfaultfd: Support COW PTE Date: Fri, 14 Apr 2023 22:23:36 +0800 Message-Id: <20230414142341.354556-13-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org If uffd fills the zeropage or installs to COW-ed PTE, break it first. Signed-off-by: Chih-En Lin --- mm/userfaultfd.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 53c3d916ff66..f5e0a97d6a3d 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -70,6 +70,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct inode *inode; pgoff_t offset, max_off; + if (break_cow_pte(dst_vma, dst_pmd, dst_addr)) + return -ENOMEM; + _dst_pte = mk_pte(page, dst_vma->vm_page_prot); _dst_pte = pte_mkdirty(_dst_pte); if (page_in_cache && !vm_shared) @@ -215,6 +218,9 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, pgoff_t offset, max_off; struct inode *inode; + if (break_cow_pte(dst_vma, dst_pmd, dst_addr)) + return -ENOMEM; + _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), dst_vma->vm_page_prot)); dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); From patchwork Fri Apr 14 14:23:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211639 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0C98C77B76 for ; Fri, 14 Apr 2023 14:27:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230420AbjDNO13 (ORCPT ); Fri, 14 Apr 2023 10:27:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50532 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230210AbjDNO05 (ORCPT ); Fri, 14 Apr 2023 10:26:57 -0400 Received: from mail-pl1-x635.google.com (mail-pl1-x635.google.com [IPv6:2607:f8b0:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2DAABC67C; Fri, 14 Apr 2023 07:26:22 -0700 (PDT) Received: by mail-pl1-x635.google.com with SMTP id n17so2477808pln.8; Fri, 14 Apr 2023 07:26:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482381; x=1684074381; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=w+iY0A30eFtwWyAFuM9+7ULnwLRWOt6NuM4HbKIdcyM=; b=S2PKrNk1rZmPzQjBe3Gp+FFrEA22A/oUwM6sgobCibmvZRzzgfVW7JPEF+Hf3GnoxJ fHQcvhd9v59UQ1RIMxTKJOv6V2IEI0h0UOs0dMB4wnSFcNEEpf95lM8ucoB6BxZjZFxI udsEfh/4nvAcVf9KQ+QZOFe2VYrZduHzu/gfKqOe7ncI8sWohb1RPGhFtgjg4G7iFdLn Vr/JMp0xB6omRyLoRvwI5VOL/iAqvrXB2RKnW7egxBBMPA79pSBcBl5zDOVT8i7bT3e9 b9aVWyD9IEmV6XDaA7HGppl3Zooko4m+CP0CgCsLcjbGZU0/OMex6COYA0MDeCh15YyY 42VQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482381; x=1684074381; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=w+iY0A30eFtwWyAFuM9+7ULnwLRWOt6NuM4HbKIdcyM=; b=FQe6nB+HKByGSO/iGi2q5snFR3MRx7Eb01wLbG0pWnF9Bxo9tESkQb6K1KpHkaGa3J nNWeoeIPlZnuloqFojUGlNz5pZO8qXSQz7cIkPZQ34kAPZKz8khGOLNasY+RJeNlwGGh 58tDk2LtLNY4E5PRX1O3/mGgfiGIlTuKf3Nhlp9ZbQ5M9nL9gmdWa6j18A3HISQ3XVpT ZmeGntc6uhg+gm1UbvwJTQkVzed89NYbsH+2U9RZ5HDA6v+n7xOnONnkjYCdsBKEIiDS LUOV6760Mr87wLzRogkeLWGA9e1s7cROQAHVbcq3uDcngcNmbT9Buc/Ued0P2BizGdVg 3UcQ== X-Gm-Message-State: AAQBX9defm2cCH4UNVOqxE2KTWm3dvfNqul5I5JkaSfBnSGxXVNWkRMU lwkqJ9HwsePWorVdvPAzCSc= X-Google-Smtp-Source: AKy350aafhYfAjSh+STiWo4YJpjj7PBy7VIF4XlJ5Jf/VkGdL3OLdg6xCGiKRr6Dx+sTOHgJ0Jz+NQ== X-Received: by 2002:a17:90a:a392:b0:246:9517:30b6 with SMTP id x18-20020a17090aa39200b00246951730b6mr11193557pjp.4.1681482381349; Fri, 14 Apr 2023 07:26:21 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.26.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:26:20 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 13/17] mm/migrate_device: Support COW PTE Date: Fri, 14 Apr 2023 22:23:37 +0800 Message-Id: <20230414142341.354556-14-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org Break COW PTE before collecting the pages in COW-ed PTE. Signed-off-by: Chih-En Lin --- mm/migrate_device.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index d30c9de60b0d..340a8c39ee3b 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -106,6 +106,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, } } + if (break_cow_pte_range(vma, start, end)) + return migrate_vma_collect_skip(start, end, walk); if (unlikely(pmd_bad(*pmdp))) return migrate_vma_collect_skip(start, end, walk); From patchwork Fri Apr 14 14:23:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211640 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1173C77B72 for ; Fri, 14 Apr 2023 14:27:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231207AbjDNO1p (ORCPT ); Fri, 14 Apr 2023 10:27:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50652 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230387AbjDNO1L (ORCPT ); Fri, 14 Apr 2023 10:27:11 -0400 Received: from mail-pj1-x102d.google.com (mail-pj1-x102d.google.com [IPv6:2607:f8b0:4864:20::102d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA40BB466; Fri, 14 Apr 2023 07:26:32 -0700 (PDT) Received: by mail-pj1-x102d.google.com with SMTP id pm7-20020a17090b3c4700b00246f00dace2so10889278pjb.2; Fri, 14 Apr 2023 07:26:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482390; x=1684074390; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=rbLbXdRuHfxtXDKPQRm3+FT68gnEt5TFpjsogoRT0lI=; b=HmAKDMksf6xnRVcfcxIezwmEPoznukRB4mgVdijUHQ21YemDDShfTmL8ccYJJy/uIv m8wcVocb0Vu9eVxKqU1s7MpYGhcuKa+oShTZ+fEaDuq206zVo4U+a6BLfds/qNiJyf9P aBJBBXz5pGWdJwCV/ISvkd5f2lPY9+MmyenS3R3CLgKuxE1lM3D7qGDF37xFAjH6b+F4 3VsNve3Y7ot1RL+o7x7nuJ4ZPPVWRZ9yCgTv0NDTjoGu5zBmi13H0UBJOVTjJJqfyBjX eBme7l2CRwYd3A7kXwU5eY0/p3lxE6ieNay2Hkcq0BUNNIpDcn+prGiU5zYTEFJds9wk 5U3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482390; x=1684074390; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rbLbXdRuHfxtXDKPQRm3+FT68gnEt5TFpjsogoRT0lI=; b=MG1A1b6BuApdHyK70zewH85v7y3usbnlluqUZo/wRglrBC6ImPTv9phRjx3khES7P1 4khJKGBKdAjyGQgTqLQprzNl9tvZ8cxfLs37QqYkXOBr/McDherszCPRNAa02Nm/UA2o VqwFq+syFhVY4bIi3xwMNm3cuJnu27HSVPq7vkKb5rOawvxFtUPF4GLDHtlpD0IUEk8y QjgOHYWsBZf+Wd7gBom8fdFzstsXESFWqNjUMGMHxuzsWSfYG2K8TaWLdWafA7uWC5Px Y85F0bDrPTXw25N7Y+fJP6cNCOKppBSEQ4mrTB0m7F5BQfSs/yciwi97dO3Vs3eJL3UQ qcXA== X-Gm-Message-State: AAQBX9fMNhgc9pYynaOUkGECM5Pm/mWbnJfjWZNE12Jw0xF0TYt0zpwU MM/y6PoU4VAFuNmItQ32Q9M= X-Google-Smtp-Source: AKy350aqLfuqbdu67mJ4nM0k7rzcjp896bC+PGdAfO2G4R7flBKcTTuNhdwRHh+IhJRQkBKzvLByHQ== X-Received: by 2002:a17:90a:2c05:b0:247:1e1e:57c0 with SMTP id m5-20020a17090a2c0500b002471e1e57c0mr5918775pjd.14.1681482390502; Fri, 14 Apr 2023 07:26:30 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.26.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:26:30 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 14/17] fs/proc: Support COW PTE with clear_refs_write Date: Fri, 14 Apr 2023 22:23:38 +0800 Message-Id: <20230414142341.354556-15-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org Before clearing the entry in COW-ed PTE, break COW PTE first. Signed-off-by: Chih-En Lin --- fs/proc/task_mmu.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 6a96e1713fd5..c76b74029dfd 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1195,6 +1195,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; + /* Only break COW when we modify the soft-dirty bit. */ + if (cp->type == CLEAR_REFS_SOFT_DIRTY && + break_cow_pte(vma, pmd, addr)) + return 0; + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; From patchwork Fri Apr 14 14:23:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211641 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BAA0C77B72 for ; Fri, 14 Apr 2023 14:27:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231308AbjDNO1y (ORCPT ); Fri, 14 Apr 2023 10:27:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230136AbjDNO1V (ORCPT ); Fri, 14 Apr 2023 10:27:21 -0400 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84E17C14C; Fri, 14 Apr 2023 07:26:41 -0700 (PDT) Received: by mail-pl1-x62a.google.com with SMTP id p17so7278698pla.3; Fri, 14 Apr 2023 07:26:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482400; x=1684074400; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=pkKoN7JGClqVnt1AeoWnxp66FVN/o5YfrH8QxbWcJ4w=; b=GRybeaFMg+Tlp7YS9TVn4xA45yiEJL0qMT1inOuLZsGOXPVD2E/YaXnYcgYTXRUz// QvmpHhgCwPci58KuGNuLYdFMpZPK6bnXEcnxiRsfm/bCSZrOLaHnj/SEZDBWYisnuRM2 +DmqjELMmOt4orXaPNd0id1kWz9cyVowFUg4AyBtYjiUc+UmYRPdjKRwUOeizBqMFH1Y 7D+lI6PErU4cZ0n1S53tvBs5zsPXdak3yH0KqfblvcF/6rQEuvpqrSOwPDxRwOk2GHaL +RbWVZRrlnha4CkqVW6/HiukXtBUoOII7B692mHbsqKGp5WGuHfv+sJ9KmHFUGHLUnyc jtWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482400; x=1684074400; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pkKoN7JGClqVnt1AeoWnxp66FVN/o5YfrH8QxbWcJ4w=; b=AEbkhT1zxlE560LGxj1aMgVQVpWCcAFl5hpsAGqlbAab3XCB1LSwz/Q9klw8851QD+ XNd0d/d9S7d2mr+fFF9vpM78obyYtAN6YqzLzGoODfxLr9oLG+hIpgX34I3N4CQvEtO0 Fz9EwH2Akc8uhTA5Xid6tD6ylVt7BLZNSLqxljlTYnncaEKhS6Vnf+55moIRnTx5zUp2 veVswsYcVgG/Q/wQiK1twLGvYQoceJwFFL3oXflC6JCryLGSGnNlKUv/RSvtSdlCNuov re3uR1PbqO8wabjDS5lpKDaL1e7lWu3qBP8z25MdxWXbi1uWM/89Os1HbEOPARnQ5xU9 yJXw== X-Gm-Message-State: AAQBX9dRR1fNjs8pKcBYPLK1iT4nCjoj/O7StaiFCvq92l8eFX2vpQB7 CpD1uLnCa5C/xMxSvHfCkrA= X-Google-Smtp-Source: AKy350Y5ToRmeTXJN5+SiXzLMP/O2AXtSNTAG2ikMyRrX7oX9xVsgnswEEqRZu4VkIFIeAnepJcLLg== X-Received: by 2002:a17:90b:19c3:b0:246:cf0c:ecd7 with SMTP id nm3-20020a17090b19c300b00246cf0cecd7mr5668320pjb.16.1681482399689; Fri, 14 Apr 2023 07:26:39 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.26.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:26:39 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 15/17] events/uprobes: Break COW PTE before replacing page Date: Fri, 14 Apr 2023 22:23:39 +0800 Message-Id: <20230414142341.354556-16-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org Break COW PTE if we want to replace the page which resides in COW-ed PTE. Signed-off-by: Chih-En Lin --- kernel/events/uprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 59887c69d54c..db6bfaab928d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -156,7 +156,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, struct folio *old_folio = page_folio(old_page); struct folio *new_folio; struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, PVMW_BREAK_COW_PTE); int err; struct mmu_notifier_range range; From patchwork Fri Apr 14 14:23:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211642 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F869C77B72 for ; Fri, 14 Apr 2023 14:27:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231322AbjDNO14 (ORCPT ); Fri, 14 Apr 2023 10:27:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50432 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231172AbjDNO1Y (ORCPT ); Fri, 14 Apr 2023 10:27:24 -0400 Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51A09CC37; Fri, 14 Apr 2023 07:26:49 -0700 (PDT) Received: by mail-pj1-x102e.google.com with SMTP id v9so23783189pjk.0; Fri, 14 Apr 2023 07:26:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482409; x=1684074409; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=6QF5xJOwjDJikkH9xEOT05QJOI9cchS0txpOZNjqBVs=; b=PG8vzbsOguYntaBArXbPoyIOsUI0mlogXRMQ4K9KcrLog2kyEaNPzgy2SEcS28YMUz IYXTxZGaddfvbi1anPbUGMwmo7Ik9EQqSMINnbh5SC09Rc1vsm3SmI8jX+O0mSrg3R1D MEvaod7BVxSb4j6WdFrCjbbKx5aJ4njVMPUPFG3PrD+ynvMVvRCyg3r+bhlDW0Qh0WQ7 JQcJOgXzbwgB47RvgU3kyZR72Ehyob9f2lb4Xld+MgKykEaM1vneMQijfXeQ/BPtilCu 3w3bGXb4MB3IouRkauI2Ok4meKWRELFArwxIoQaBgvym7muq1nRaMooCTqk/IeMMr1cK E+Mw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482409; x=1684074409; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6QF5xJOwjDJikkH9xEOT05QJOI9cchS0txpOZNjqBVs=; b=X19LYlhZUWyru24aX23k7E1eTbaXQf42xQX0tO6zLfFhfnhQLdGL8SZSLk5wZaKkMz vS+9wv9blCaAJSgKtDqiML48knLu8i51N/OQT59nG0MWmjyc+y9wXwiIHUn/3Bvv+IPn Qb8ABSQhFql+nKQVuE2AOT/bFO5uh9JEvhkdU96k1JyF3GbVmhAhL4G1q6qB8mWxVd3T 8CpHAbRM1chmBnkFa3pG+zHsLh8kWHrIZg3yjp5tGr6+GwLp02uEUCFf7ROLB3ppN36f QzbHQSgNMnY8tKm7vRVWdKyMuLYwDmL4AYfAQEjgoNULM8kbuoTAM2ejopG7C26FPN6d /6kA== X-Gm-Message-State: AAQBX9dkCKRK5Bn9hUhJLejWilfvx3NXSy8Ocwu4tY/caZtXEtmLgoaN INGZ5+xz49JOX+7WHTAF8hk= X-Google-Smtp-Source: AKy350YDDumVi1irC542NcdDUhAmG+zJHPpfUQ/Rbi3FQQGV1ol+SNXvZU2bDpybjtxtp5gLwcClZg== X-Received: by 2002:a17:90b:17c9:b0:246:a228:1359 with SMTP id me9-20020a17090b17c900b00246a2281359mr10871223pjb.23.1681482408896; Fri, 14 Apr 2023 07:26:48 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.26.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:26:48 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 16/17] mm: fork: Enable COW PTE to fork system call Date: Fri, 14 Apr 2023 22:23:40 +0800 Message-Id: <20230414142341.354556-17-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org This patch enables the Copy-On-Write (COW) mechanism to the PTE table in fork system call. To let the process do COW PTE fork, use prctl(PR_SET_COW_PTE), it will set the MMF_COW_PTE_READY flag to the process for enabling COW PTE during the next time of fork. It uses the MMF_COW_PTE flag to distinguish the normal page table and the COW one. Moreover, it is difficult to distinguish whether all the page tables is out of COW state. So the MMF_COW_PTE flag won't be disabled after setup. Signed-off-by: Chih-En Lin --- kernel/fork.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index 0c92f224c68c..8452d5c4eb5e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2679,6 +2679,13 @@ pid_t kernel_clone(struct kernel_clone_args *args) trace = 0; } +#ifdef CONFIG_COW_PTE + if (current->mm && test_bit(MMF_COW_PTE_READY, ¤t->mm->flags)) { + clear_bit(MMF_COW_PTE_READY, ¤t->mm->flags); + set_bit(MMF_COW_PTE, ¤t->mm->flags); + } +#endif + p = copy_process(NULL, trace, NUMA_NO_NODE, args); add_latent_entropy(); From patchwork Fri Apr 14 14:23:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13211643 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8ACDC77B6E for ; Fri, 14 Apr 2023 14:28:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231192AbjDNO2Q (ORCPT ); Fri, 14 Apr 2023 10:28:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50644 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231196AbjDNO1n (ORCPT ); Fri, 14 Apr 2023 10:27:43 -0400 Received: from mail-pl1-x632.google.com (mail-pl1-x632.google.com [IPv6:2607:f8b0:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E3E66CC27; Fri, 14 Apr 2023 07:27:00 -0700 (PDT) Received: by mail-pl1-x632.google.com with SMTP id y6so17360091plp.2; Fri, 14 Apr 2023 07:27:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681482418; x=1684074418; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=dmHQTVbWw4YJlaNGarehFo5gsGgag0lLmmC2sUNSy5Y=; b=Rg/JhHpRFP6NIoGwm8nSvdS97FJcuj54dkHXyU7p5Et6BsX8Y9CS79uGrfUmlS+JAL hPa+a+/WivLZ888Jgxh5bYaernMZCgxjwb6UiAJIg+o7J8xsb26hyzcAQGcdrRDrWPni qAx/B+FYNu3GQx+8coAUZg7QUzb86cyQ6xyA9cGOrVU+5F5TvZM8zXeYjs3NfX13Npqo bUkbOI/uzjodhpPk7flzky6o5Bn/RWd94kPtge56rFxlfVjiZwagW4fP1Ra7E8uq6m3p xS7lqJYVqdKFKA4LdZGqWW4hjdpjXCYPbiDtCuRbf0Arcl1cQTSkZlFTBu9DDfP/aL+W WL1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681482418; x=1684074418; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dmHQTVbWw4YJlaNGarehFo5gsGgag0lLmmC2sUNSy5Y=; b=iUonSlTBPWbCa/9d8ZmkzWFyhNV/hGokH5v4WSwwzG/8Wrgyl3uWlj6JYpmm3bpvNf CKRVBIUqLeAlVxKPXtUBZZS866iM5K0drTPyjdfio/cU5pC6DpH2lABeUoMpS9+ovGTW MrQmgqy6lOzSwigaddGyHonSTiWEaQ9uMROXiZKOAl1mpb93yWtKoXRs6D8ihfMcLYKz dmvxYJbPuGJMa+H1xGWb77dO1DOJGX1/RL4PbU99ZdrgIhLP4MZJuM2Qj8agA6fbtjf8 ykzWrXZal0s/X10bzQZR72xSLHbyLjGfo0yLdtQEpUqJBXXBszn9KPiSA3YsRAE6oJR6 WeYQ== X-Gm-Message-State: AAQBX9cUDE+ZEshlaugIeQ1+dqhesHHPgs0B6Ujm2lBwdG6U/dm86n/8 ucdpLC/Yi5Yma7WhPTijwPY= X-Google-Smtp-Source: AKy350abULEhuTKLG0Ici+w1lLZw7xKikgteau1wnZf/Xw2onMLRl5rBE0j5qE6rGm+yWTHui+5XfQ== X-Received: by 2002:a17:90a:7443:b0:247:271:c3f4 with SMTP id o3-20020a17090a744300b002470271c3f4mr5656869pjk.2.1681482418095; Fri, 14 Apr 2023 07:26:58 -0700 (PDT) Received: from strix-laptop.. (2001-b011-20e0-1499-8303-7502-d3d7-e13b.dynamic-ip6.hinet.net. [2001:b011:20e0:1499:8303:7502:d3d7:e13b]) by smtp.googlemail.com with ESMTPSA id h7-20020a17090ac38700b0022335f1dae2sm2952386pjt.22.2023.04.14.07.26.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 07:26:57 -0700 (PDT) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Pasha Tatashin Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Yu Zhao , Steven Barrett , Juergen Gross , Peter Xu , Kefeng Wang , Tong Tiangen , Christoph Hellwig , "Liam R. Howlett" , Yang Shi , Vlastimil Babka , Alex Sierra , Vincent Whitchurch , Anshuman Khandual , Li kunyu , Liu Shixin , Hugh Dickins , Minchan Kim , Joey Gouly , Chih-En Lin , Michal Hocko , Suren Baghdasaryan , "Zach O'Keefe" , Gautam Menghani , Catalin Marinas , Mark Brown , "Eric W. Biederman" , Andrei Vagin , Shakeel Butt , Daniel Bristot de Oliveira , "Jason A. Donenfeld" , Greg Kroah-Hartman , Alexey Gladkov , x86@kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: [PATCH v5 17/17] mm: Check the unexpected modification of COW-ed PTE Date: Fri, 14 Apr 2023 22:23:41 +0800 Message-Id: <20230414142341.354556-18-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230414142341.354556-1-shiyn.lin@gmail.com> References: <20230414142341.354556-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org In the most of the cases, we don't expect any write access to COW-ed PTE table. To prevent this, add the new modification check to the page table check. But, there are still some of valid reasons where we might want to modify COW-ed PTE tables. Therefore, add the enable/disable function to the check. Signed-off-by: Chih-En Lin --- arch/x86/include/asm/pgtable.h | 1 + include/linux/page_table_check.h | 62 ++++++++++++++++++++++++++++++++ mm/memory.c | 4 +++ mm/page_table_check.c | 58 ++++++++++++++++++++++++++++++ 4 files changed, 125 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 7425f32e5293..6b323c672e36 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1022,6 +1022,7 @@ static inline pud_t native_local_pudp_get_and_clear(pud_t *pudp) static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { + cowed_pte_table_check_modify(mm, addr, ptep, pte); page_table_check_pte_set(mm, addr, ptep, pte); set_pte(ptep, pte); } diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h index 01e16c7696ec..4a54dc454281 100644 --- a/include/linux/page_table_check.h +++ b/include/linux/page_table_check.h @@ -113,6 +113,54 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm, __page_table_check_pte_clear_range(mm, addr, pmd); } +#ifdef CONFIG_COW_PTE +void __check_cowed_pte_table_enable(pte_t *ptep); +void __check_cowed_pte_table_disable(pte_t *ptep); +void __cowed_pte_table_check_modify(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte); + +static inline void check_cowed_pte_table_enable(pte_t *ptep) +{ + if (static_branch_likely(&page_table_check_disabled)) + return; + + __check_cowed_pte_table_enable(ptep); +} + +static inline void check_cowed_pte_table_disable(pte_t *ptep) +{ + if (static_branch_likely(&page_table_check_disabled)) + return; + + __check_cowed_pte_table_disable(ptep); +} + +static inline void cowed_pte_table_check_modify(struct mm_struct *mm, + unsigned long addr, + pte_t *ptep, pte_t pte) +{ + if (static_branch_likely(&page_table_check_disabled)) + return; + + __cowed_pte_table_check_modify(mm, addr, ptep, pte); +} +#else +static inline void check_cowed_pte_table_enable(pte_t *ptep) +{ +} + +static inline void check_cowed_pte_table_disable(pte_t *ptep) +{ +} + +static inline void cowed_pte_table_check_modify(struct mm_struct *mm, + unsigned long addr, + pte_t *ptep, pte_t pte) +{ +} +#endif /* CONFIG_COW_PTE */ + + #else static inline void page_table_check_alloc(struct page *page, unsigned int order) @@ -162,5 +210,19 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm, { } +static inline void check_cowed_pte_table_enable(pte_t *ptep) +{ +} + +static inline void check_cowed_pte_table_disable(pte_t *ptep) +{ +} + +static inline void cowed_pte_table_check_modify(struct mm_struct *mm, + unsigned long addr, + pte_t *ptep, pte_t pte) +{ +} + #endif /* CONFIG_PAGE_TABLE_CHECK */ #endif /* __LINUX_PAGE_TABLE_CHECK_H */ diff --git a/mm/memory.c b/mm/memory.c index 7908e20f802a..e62487413038 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1202,10 +1202,12 @@ copy_cow_pte_range(struct vm_area_struct *dst_vma, * Although, parent's PTE is COW-ed, we should * still need to handle all the swap stuffs. */ + check_cowed_pte_table_disable(src_pte); ret = copy_nonpresent_pte(dst_mm, src_mm, src_pte, src_pte, curr, curr, addr, rss); + check_cowed_pte_table_enable(src_pte); if (ret == -EIO) { entry = pte_to_swp_entry(*src_pte); break; @@ -1223,8 +1225,10 @@ copy_cow_pte_range(struct vm_area_struct *dst_vma, * copy_present_pte() will determine the mapped page * should be COW mapping or not. */ + check_cowed_pte_table_disable(src_pte); ret = copy_present_pte(curr, curr, src_pte, src_pte, addr, rss, NULL); + check_cowed_pte_table_enable(src_pte); /* * If we need a pre-allocated page for this pte, * drop the lock, recover all the entries, fall diff --git a/mm/page_table_check.c b/mm/page_table_check.c index 25d8610c0042..5175c7476508 100644 --- a/mm/page_table_check.c +++ b/mm/page_table_check.c @@ -14,6 +14,9 @@ struct page_table_check { atomic_t anon_map_count; atomic_t file_map_count; +#ifdef CONFIG_COW_PTE + atomic_t check_cowed_pte; +#endif }; static bool __page_table_check_enabled __initdata = @@ -248,3 +251,58 @@ void __page_table_check_pte_clear_range(struct mm_struct *mm, pte_unmap(ptep - PTRS_PER_PTE); } } + +#ifdef CONFIG_COW_PTE +void __check_cowed_pte_table_enable(pte_t *ptep) +{ + struct page *page = pte_page(*ptep); + struct page_ext *page_ext = page_ext_get(page); + struct page_table_check *ptc = get_page_table_check(page_ext); + + atomic_set(&ptc->check_cowed_pte, 1); + page_ext_put(page_ext); +} + +void __check_cowed_pte_table_disable(pte_t *ptep) +{ + struct page *page = pte_page(*ptep); + struct page_ext *page_ext = page_ext_get(page); + struct page_table_check *ptc = get_page_table_check(page_ext); + + atomic_set(&ptc->check_cowed_pte, 0); + page_ext_put(page_ext); +} + +static int check_cowed_pte_table(pte_t *ptep) +{ + struct page *page = pte_page(*ptep); + struct page_ext *page_ext = page_ext_get(page); + struct page_table_check *ptc = get_page_table_check(page_ext); + int check = 0; + + check = atomic_read(&ptc->check_cowed_pte); + page_ext_put(page_ext); + + return check; +} + +void __cowed_pte_table_check_modify(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + + if (!test_bit(MMF_COW_PTE, &mm->flags) || !check_cowed_pte_table(ptep)) + return; + + pgd = pgd_offset(mm, addr); + p4d = p4d_offset(pgd, addr); + pud = pud_offset(p4d, addr); + pmd = pmd_offset(pud, addr); + + if (!pmd_none(*pmd) && !pmd_write(*pmd) && cow_pte_count(pmd) > 1) + BUG_ON(!pte_same(*ptep, pte)); +} +#endif