From patchwork Fri May 13 08:09:52 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Borislav Petkov X-Patchwork-Id: 9089391 Return-Path: X-Original-To: patchwork-kvm@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 94547BF29F for ; Fri, 13 May 2016 08:11:00 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 9A6D420212 for ; Fri, 13 May 2016 08:10:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 62708201BC for ; Fri, 13 May 2016 08:10:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752498AbcEMIKI (ORCPT ); Fri, 13 May 2016 04:10:08 -0400 Received: from mail.skyhub.de ([78.46.96.112]:57877 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750907AbcEMIKC (ORCPT ); Fri, 13 May 2016 04:10:02 -0400 X-Virus-Scanned: Nedap ESD1 at mail.skyhub.de Received: from mail.skyhub.de ([127.0.0.1]) by localhost (door.skyhub.de [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id cnHIWZ0sDmA3; Fri, 13 May 2016 10:09:58 +0200 (CEST) Received: from pd.tnic (p200300454B498800ECBD71CC1455942D.dip0.t-ipconnect.de [IPv6:2003:45:4b49:8800:ecbd:71cc:1455:942d]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.skyhub.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id 3F70C245A95; Fri, 13 May 2016 10:09:57 +0200 (CEST) Received: by pd.tnic (Postfix, from userid 1000) id D3A921605B2; Fri, 13 May 2016 10:09:52 +0200 (CEST) Date: Fri, 13 May 2016 10:09:52 +0200 From: Borislav Petkov To: Marc Haber Cc: "Kirill A. Shutemov" , "Dr. David Alan Gilbert" , Paolo Bonzini , linux-kernel@vger.kernel.org, kvm ML , Andrea Arcangeli , Andrew Morton Subject: Re: transparent huge pages breaks KVM on AMD. Message-ID: <20160513080952.GB17715@pd.tnic> References: <20160421200433.GL21755@torres.zugschlus.de> <20160423160429.GL8531@pd.tnic> <20160423184341.GA21755@torres.zugschlus.de> <20160423185246.GC8376@gallifrey> <20160512202009.GZ9143@torres.zugschlus.de> <20160512202402.GA22506@node.shutemov.name> <20160512203457.GC9143@torres.zugschlus.de> <20160512204216.GA22695@node.shutemov.name> <20160513052334.GE9143@torres.zugschlus.de> <20160513080745.GA17715@pd.tnic> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160513080745.GA17715@pd.tnic> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD, T_TVD_MIME_EPI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Try this one better - it fixes an unitialized var. From linux-kernel-owner@vger.kernel.org Thu May 12 18:32:37 2016 From: Andrea Arcangeli To: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org Cc: Alex Williamson , "Kirill A. Shutemov" , "Luick, Dean" , "Marciniszyn, Mike" Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages during WP faults Date: Thu, 12 May 2016 18:32:22 +0200 Message-Id: <1463070742-18401-1-git-send-email-aarcange@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=utf-8 Status: RO This will provide fully accuracy to the mapcount calculation in the write protect faults, so page pinning will not get broken by false positive copy-on-writes. total_mapcount() isn't the right calculation needed in reuse_swap_page(), so this introduces a page_trans_huge_mapcount() that is effectively the full accurate return value for page_mapcount() if dealing with Transparent Hugepages, however we only use the page_trans_huge_mapcount() during COW faults where it strictly needed, due to its higher runtime cost. This also provide at practical zero cost the total_mapcount information which is needed to know if we can still relocate the page anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we can reuse the page no matter if it's a pte or a pmd_trans_huge triggering the fault, but we can only relocate the page anon_vma to the local vma->anon_vma if we're sure it's only this "vma" mapping the whole THP physical range. Kirill A. Shutemov discovered the problem with moving the page anon_vma to the local vma->anon_vma in a previous version of this patch and another problem in the way page_move_anon_rmap() was called. Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a previous version, because reuse_swap_page must be a macro to call page_trans_huge_mapcount from swap.h, so this uses a macro again instead of an inline function. With this change at least it's a less dangerous usage than it was before, because "page" is used only once now, while with the previous code reuse_swap_page(page++) would have called page_mapcount on page+1 and it would have increased page twice instead of just once. Dean Luick noticed an uninitialized variable that could result in a rmap inefficiency for the non-THP case in a previous version. Reviewed-by: "Kirill A. Shutemov" Signed-off-by: Andrea Arcangeli Cc: "Kirill A. Shutemov" Cc: "Kirill A. Shutemov" Cc: "Luick Cc: "Marciniszyn Cc: Al Viro Cc: Alex Williamson Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Chris Wilson Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Dean" Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Ebru Akagunduz Cc: Geliang Tang Cc: Gerald Schaefer Cc: Hugh Dickins Cc: Ingo Molnar Cc: Jan Kara Cc: Jerome Marchand Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Matthew Wilcox Cc: Michal Hocko Cc: Mike" Cc: Minchan Kim Cc: Vladimir Davydov Cc: Vlastimil Babka Cc: Xie XiuQi Cc: linux-mm Cc: linux-rdma@vger.kernel.org Cc: lkml Link: http://lkml.kernel.org/r/1463070742-18401-1-git-send-email-aarcange@redhat.com --- include/linux/mm.h | 9 +++++++ include/linux/swap.h | 6 ++--- mm/huge_memory.c | 71 +++++++++++++++++++++++++++++++++++++++++++++------- mm/memory.c | 22 ++++++++++------ mm/swapfile.c | 13 +++++----- 5 files changed, 95 insertions(+), 26 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 864d722..8f468e0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -500,11 +500,20 @@ static inline int page_mapcount(struct page *page) #ifdef CONFIG_TRANSPARENT_HUGEPAGE int total_mapcount(struct page *page); +int page_trans_huge_mapcount(struct page *page, int *total_mapcount); #else static inline int total_mapcount(struct page *page) { return page_mapcount(page); } +static inline int page_trans_huge_mapcount(struct page *page, + int *total_mapcount) +{ + int mapcount = page_mapcount(page); + if (total_mapcount) + *total_mapcount = mapcount; + return mapcount; +} #endif static inline struct page *virt_to_head_page(const void *x) diff --git a/include/linux/swap.h b/include/linux/swap.h index 0a4cd47..ad22035 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -418,7 +418,7 @@ extern sector_t swapdev_block(int, pgoff_t); extern int page_swapcount(struct page *); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); -extern int reuse_swap_page(struct page *); +extern bool reuse_swap_page(struct page *, int *); extern int try_to_free_swap(struct page *); struct backing_dev_info; @@ -513,8 +513,8 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; } -#define reuse_swap_page(page) \ - (!PageTransCompound(page) && page_mapcount(page) == 1) +#define reuse_swap_page(page, total_mapcount) \ + (page_trans_huge_mapcount(page, total_mapcount) == 1) static inline int try_to_free_swap(struct page *page) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f7daa7d..b49ee12 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page); /* * We can only reuse the page if nobody else maps the huge page or it's - * part. We can do it by checking page_mapcount() on each sub-page, but - * it's expensive. - * The cheaper way is to check page_count() to be equal 1: every - * mapcount takes page reference reference, so this way we can - * guarantee, that the PMD is the only mapping. - * This can give false negative if somebody pinned the page, but that's - * fine. + * part. */ - if (page_mapcount(page) == 1 && page_count(page) == 1) { + if (page_trans_huge_mapcount(page, NULL) == 1) { pmd_t entry; entry = pmd_mkyoung(orig_pmd); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); @@ -2079,7 +2073,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, if (pte_write(pteval)) { writable = true; } else { - if (PageSwapCache(page) && !reuse_swap_page(page)) { + if (PageSwapCache(page) && + !reuse_swap_page(page, NULL)) { unlock_page(page); result = SCAN_SWAP_CACHE_PAGE; goto out; @@ -3223,6 +3218,64 @@ int total_mapcount(struct page *page) } /* + * This calculates accurately how many mappings a transparent hugepage + * has (unlike page_mapcount() which isn't fully accurate). This full + * accuracy is primarily needed to know if copy-on-write faults can + * reuse the page and change the mapping to read-write instead of + * copying them. At the same time this returns the total_mapcount too. + * + * The function returns the highest mapcount any one of the subpages + * has. If the return value is one, even if different processes are + * mapping different subpages of the transparent hugepage, they can + * all reuse it, because each process is reusing a different subpage. + * + * The total_mapcount is instead counting all virtual mappings of the + * subpages. If the total_mapcount is equal to "one", it tells the + * caller all mappings belong to the same "mm" and in turn the + * anon_vma of the transparent hugepage can become the vma->anon_vma + * local one as no other process may be mapping any of the subpages. + * + * It would be more accurate to replace page_mapcount() with + * page_trans_huge_mapcount(), however we only use + * page_trans_huge_mapcount() in the copy-on-write faults where we + * need full accuracy to avoid breaking page pinning, because + * page_trans_huge_mapcount() is slower than page_mapcount(). + */ +int page_trans_huge_mapcount(struct page *page, int *total_mapcount) +{ + int i, ret, _total_mapcount, mapcount; + + /* hugetlbfs shouldn't call it */ + VM_BUG_ON_PAGE(PageHuge(page), page); + + if (likely(!PageTransCompound(page))) { + mapcount = atomic_read(&page->_mapcount) + 1; + if (total_mapcount) + *total_mapcount = mapcount; + return mapcount; + } + + page = compound_head(page); + + _total_mapcount = ret = 0; + for (i = 0; i < HPAGE_PMD_NR; i++) { + mapcount = atomic_read(&page[i]._mapcount) + 1; + ret = max(ret, mapcount); + _total_mapcount += mapcount; + } + if (PageDoubleMap(page)) { + ret -= 1; + _total_mapcount -= HPAGE_PMD_NR; + } + mapcount = compound_mapcount(page); + ret += mapcount; + _total_mapcount += mapcount; + if (total_mapcount) + *total_mapcount = _total_mapcount; + return ret; +} + +/* * This function splits huge page into normal pages. @page can point to any * subpage of huge page to split. Split doesn't change the position of @page. * diff --git a/mm/memory.c b/mm/memory.c index 52c218e..07493e3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2373,6 +2373,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * not dirty accountable. */ if (PageAnon(old_page) && !PageKsm(old_page)) { + int total_mapcount; if (!trylock_page(old_page)) { get_page(old_page); pte_unmap_unlock(page_table, ptl); @@ -2387,13 +2388,18 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, } put_page(old_page); } - if (reuse_swap_page(old_page)) { - /* - * The page is all ours. Move it to our anon_vma so - * the rmap code will not search our parent or siblings. - * Protected against the rmap code by the page lock. - */ - page_move_anon_rmap(old_page, vma, address); + if (reuse_swap_page(old_page, &total_mapcount)) { + if (total_mapcount == 1) { + /* + * The page is all ours. Move it to + * our anon_vma so the rmap code will + * not search our parent or siblings. + * Protected against the rmap code by + * the page lock. + */ + page_move_anon_rmap(compound_head(old_page), + vma, address); + } unlock_page(old_page); return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte, old_page, 0, 0); @@ -2617,7 +2623,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, inc_mm_counter_fast(mm, MM_ANONPAGES); dec_mm_counter_fast(mm, MM_SWAPENTS); pte = mk_pte(page, vma->vm_page_prot); - if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { + if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; diff --git a/mm/swapfile.c b/mm/swapfile.c index 83874ec..031713ab 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -922,18 +922,19 @@ out: * to it. And as a side-effect, free up its swap: because the old content * on disk will never be read, and seeking back there to write new content * later would only waste time away from clustering. + * + * NOTE: total_mapcount should not be relied upon by the caller if + * reuse_swap_page() returns false, but it may be always overwritten + * (see the other implementation for CONFIG_SWAP=n). */ -int reuse_swap_page(struct page *page) +bool reuse_swap_page(struct page *page, int *total_mapcount) { int count; VM_BUG_ON_PAGE(!PageLocked(page), page); if (unlikely(PageKsm(page))) - return 0; - /* The page is part of THP and cannot be reused */ - if (PageTransCompound(page)) - return 0; - count = page_mapcount(page); + return false; + count = page_trans_huge_mapcount(page, total_mapcount); if (count <= 1 && PageSwapCache(page)) { count += page_swapcount(page); if (count == 1 && !PageWriteback(page)) {