From patchwork Mon Jul 6 20:26:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Kravetz X-Patchwork-Id: 11646849 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 885871398 for ; Mon, 6 Jul 2020 20:26:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 47DFF20BED for ; Mon, 6 Jul 2020 20:26:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="PKolNnM9" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 47DFF20BED Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 677276B0007; Mon, 6 Jul 2020 16:26:36 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 5D6656B0008; Mon, 6 Jul 2020 16:26:36 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 404966B000A; Mon, 6 Jul 2020 16:26:36 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0209.hostedemail.com [216.40.44.209]) by kanga.kvack.org (Postfix) with ESMTP id 25E796B0007 for ; Mon, 6 Jul 2020 16:26:36 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id DE35F180AD802 for ; Mon, 6 Jul 2020 20:26:35 +0000 (UTC) X-FDA: 77008783950.22.sun84_4010eae26eae Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id B6F3018038E60 for ; Mon, 6 Jul 2020 20:26:35 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,mike.kravetz@oracle.com,,RULES_HIT:30012:30054:30064:30070,0,RBL:141.146.126.78:@oracle.com:.lbl8.mailshell.net-62.18.0.100 64.10.201.10;04y8b4yjxfbkbx8ikbz8w7coaccckochq8hc63xyiqdau9y6s8yuhroxcyfwfsn.6ccyrthrh9ocy5m8kp4idc5txmipc6ogscbnor34brya9y9ntbdmfcek55qm6ia.c-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none X-HE-Tag: sun84_4010eae26eae X-Filterd-Recvd-Size: 11670 Received: from aserp2120.oracle.com (aserp2120.oracle.com [141.146.126.78]) by imf13.hostedemail.com (Postfix) with ESMTP for ; Mon, 6 Jul 2020 20:26:35 +0000 (UTC) Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 066KN0kf068162; Mon, 6 Jul 2020 20:26:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2020-01-29; bh=b4THuCn4C9a7B6Y9bQgWFqgNH1327HQkfLZ8OO2NUd0=; b=PKolNnM9dnKExK2G++WTmtfCqcsQv02F10MGuzrbJfUAj72FCVZUNh7lMPyVHLw8rN7N A1SE4Y4iG2nQ2dob9AdeyH7a8yukEBnvZ5R9YSez6/suweKdsgadUx46RVdZnKJxxDTZ QVX3tlRj/hI09+Rd9ohHB62si3tTbvGp3PTt10FPFaTWBp8ukSou101/dOoQYI1JU94i iaPn1cDLP7HRxGgm9028pjbaTeyGr6iAYerLonIDZ0J3aZZcsdB32teQpodFQGjQ455M JpjT0O4XeqU4HZbrQVoNVzvnyOCEuaBv6b1E8JMWoN/uRPLR/HnCTEmb/mKVeoVaL6ew 0w== Received: from userp3020.oracle.com (userp3020.oracle.com [156.151.31.79]) by aserp2120.oracle.com with ESMTP id 322kv68gw0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Mon, 06 Jul 2020 20:26:26 +0000 Received: from pps.filterd (userp3020.oracle.com [127.0.0.1]) by userp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 066KMv7n054443; Mon, 6 Jul 2020 20:26:25 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userp3020.oracle.com with ESMTP id 3243pddfwp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 06 Jul 2020 20:26:25 +0000 Received: from abhmp0001.oracle.com (abhmp0001.oracle.com [141.146.116.7]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 066KQNTd025732; Mon, 6 Jul 2020 20:26:23 GMT Received: from monkey.oracle.com (/50.38.35.18) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 06 Jul 2020 13:26:23 -0700 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko , Hugh Dickins , Naoya Horiguchi , "Aneesh Kumar K . V" , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa , Andrew Morton , Linus Torvalds , Mike Kravetz , kernel test robot Subject: [RFC PATCH 2/3] hugetlbfs: Only take i_mmap_rwsem when sharing is possible Date: Mon, 6 Jul 2020 13:26:14 -0700 Message-Id: <20200706202615.32111-3-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.25.4 In-Reply-To: <20200706202615.32111-1-mike.kravetz@oracle.com> References: <20200622005551.GK5535@shao2-debian> <20200706202615.32111-1-mike.kravetz@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9674 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 suspectscore=0 spamscore=0 adultscore=0 phishscore=0 mlxlogscore=999 mlxscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2007060138 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9674 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 lowpriorityscore=0 bulkscore=0 malwarescore=0 suspectscore=0 mlxlogscore=999 phishscore=0 spamscore=0 priorityscore=1501 clxscore=1015 impostorscore=0 mlxscore=0 adultscore=0 cotscore=-2147483648 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2007060138 X-Rspamd-Queue-Id: B6F3018038E60 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Commit c0d0381ade79 added code to take i_mmap_rwsem in read mode during fault processing. However, this was observed to increase fault processing time by aprox 33%. Technically, i_mmap_rwsem only needs to be held when pmd sharing is possible. pmd sharing depends on mapping flags, alignment and size. The routine vma_shareable() already checks these conditions. Therefore, use vma_shareable to determine if sharing is possible and if taking i_mmap_rwsem is necessary. This is done during fault processing and vma copying. Code in memory-failure, page migration and userfaultfd continue to always take i_mmap_rwsem. These are not as performance sensitive. Reported-by: kernel test robot Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 96 ++++++++++++++++++++++++++++++++++------------------ 1 file changed, 63 insertions(+), 33 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5349beda3658..6e9085464e78 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3656,6 +3656,21 @@ static int hugetlb_acct_memory(struct hstate *h, long delta) return ret; } +#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE +static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) +{ + unsigned long base = addr & PUD_MASK; + unsigned long end = base + PUD_SIZE; + + /* + * check on proper vm_flags and page table alignment + */ + if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end)) + return true; + return false; +} +#endif + static void hugetlb_vm_op_open(struct vm_area_struct *vma) { struct resv_map *resv = vma_resv_map(vma); @@ -3807,6 +3822,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, unsigned long sz = huge_page_size(h); struct address_space *mapping = vma->vm_file->f_mapping; struct mmu_notifier_range range; + bool i_mmap_rwsem_held = false; int ret = 0; cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; @@ -3816,14 +3832,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, vma->vm_start, vma->vm_end); mmu_notifier_invalidate_range_start(&range); - } else { - /* - * For shared mappings i_mmap_rwsem must be held to call - * huge_pte_alloc, otherwise the returned ptep could go - * away if part of a shared pmd and another thread calls - * huge_pmd_unshare. - */ - i_mmap_lock_read(mapping); } for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { @@ -3831,6 +3839,28 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, src_pte = huge_pte_offset(src, addr, sz); if (!src_pte) continue; + + /* + * For shared mappings(non-cow) i_mmap_rwsem must be held to + * call huge_pte_alloc, otherwise the returned ptep could go + * away if part of a shared pmd and another thread calls + * huge_pmd_unshare. This is only necessary if the specific + * pmd can be shared. Acquire/drop semaphore as necessary. + */ + if (!cow) { + if (!i_mmap_rwsem_held) { + if (vma_shareable(vma, addr)) { + i_mmap_lock_read(mapping); + i_mmap_rwsem_held = true; + } + } else { + if (!vma_shareable(vma, addr)) { + i_mmap_unlock_read(mapping); + i_mmap_rwsem_held = false; + } + } + } + dst_pte = huge_pte_alloc(dst, addr, sz); if (!dst_pte) { ret = -ENOMEM; @@ -3901,7 +3931,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, if (cow) mmu_notifier_invalidate_range_end(&range); - else + if (i_mmap_rwsem_held) i_mmap_unlock_read(mapping); return ret; @@ -4357,9 +4387,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, */ hash = hugetlb_fault_mutex_hash(mapping, idx); mutex_unlock(&hugetlb_fault_mutex_table[hash]); - i_mmap_unlock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_unlock_read(mapping); ret = handle_userfault(&vmf, VM_UFFD_MISSING); - i_mmap_lock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_lock_read(mapping); mutex_lock(&hugetlb_fault_mutex_table[hash]); goto out; } @@ -4543,19 +4575,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold - * until finished with ptep. This prevents huge_pmd_unshare from - * being called elsewhere and making the ptep no longer valid. + * If PMD sharing is possible, acquire i_mmap_rwsem before calling + * huge_pte_alloc and hold until finished with ptep. This prevents + * huge_pmd_unshare from being called elsewhere and making the ptep + * no longer valid. * * ptep could have already be assigned via huge_pte_offset. That * is OK, as huge_pte_alloc will return the same value unless * something has changed. */ mapping = vma->vm_file->f_mapping; - i_mmap_lock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_lock_read(mapping); ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); if (!ptep) { - i_mmap_unlock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_unlock_read(mapping); return VM_FAULT_OOM; } @@ -4652,7 +4687,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } out_mutex: mutex_unlock(&hugetlb_fault_mutex_table[hash]); - i_mmap_unlock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_unlock_read(mapping); /* * Generally it's safe to hold refcount during waiting page lock. But * here we just wait to defer the next page fault to avoid busy loop and @@ -5287,19 +5323,6 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma, return saddr; } -static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) -{ - unsigned long base = addr & PUD_MASK; - unsigned long end = base + PUD_SIZE; - - /* - * check on proper vm_flags and page table alignment - */ - if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end)) - return true; - return false; -} - /* * Determine if start,end range within vma could be mapped by shared pmd. * If yes, adjust start and end to cover range associated with possible @@ -5335,9 +5358,12 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, * !shared pmd case because we can allocate the pmd later as well, it makes the * code much cleaner. * - * This routine must be called with i_mmap_rwsem held in at least read mode. - * For hugetlbfs, this prevents removal of any page table entries associated - * with the address space. This is important as we are setting up sharing + * FIXME - If sharing is possible, this routine must be called with + * i_mmap_rwsem held in at least read mode. Leaving it up to the caller + * to determine if sharing is possible is asking for trouble. Right now, all + * calling code is correct. But, this needs to be cleaner. Holding + * i_mmap_rwsem prevents removal of any page table entries associated with + * the address space. This is important as we are setting up sharing * based on existing page table entries (mappings). */ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) @@ -5355,6 +5381,10 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (!vma_shareable(vma, addr)) return (pte_t *)pmd_alloc(mm, pud, addr); + /* + * If we get here, caller should have acquired i_mmap_rwsem in + * at least read mode. + */ vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { if (svma == vma) continue;