From patchwork Mon Jul 6 20:26:13 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Kravetz X-Patchwork-Id: 11646845 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 755581398 for ; Mon, 6 Jul 2020 20:26:36 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 347212158C for ; Mon, 6 Jul 2020 20:26:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="CCHQ/GCk" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 347212158C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 36BFC6B0006; Mon, 6 Jul 2020 16:26:35 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 2C86B6B0007; Mon, 6 Jul 2020 16:26:35 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0CDEC6B0008; Mon, 6 Jul 2020 16:26:35 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0066.hostedemail.com [216.40.44.66]) by kanga.kvack.org (Postfix) with ESMTP id E378C6B0006 for ; Mon, 6 Jul 2020 16:26:34 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id A88D62C7C for ; Mon, 6 Jul 2020 20:26:34 +0000 (UTC) X-FDA: 77008783908.06.bulb63_2b0ebe026eae Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 793F710035137 for ; Mon, 6 Jul 2020 20:26:34 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,mike.kravetz@oracle.com,,RULES_HIT:30003:30012:30054:30064:30070,0,RBL:141.146.126.78:@oracle.com:.lbl8.mailshell.net-64.10.201.10 62.18.0.100;04y88f6uzbmajtop9qy1a5b5wfr79yc76i3yc3beizafbg64owo9m7iccw8ak87.ceg4tkiqyddzefy6nct9dcjcac754pwktnbsj6z89qqrwtopfhu7jkszchy4q51.s-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: bulb63_2b0ebe026eae X-Filterd-Recvd-Size: 9521 Received: from aserp2120.oracle.com (aserp2120.oracle.com [141.146.126.78]) by imf25.hostedemail.com (Postfix) with ESMTP for ; Mon, 6 Jul 2020 20:26:33 +0000 (UTC) Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 066KMV1L067944; Mon, 6 Jul 2020 20:26:25 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2020-01-29; bh=7JsQApN0BMGpz+aVHeK67BKjZn8HFGP3784iAYJAa+s=; b=CCHQ/GCk3RbyK+kRfSJez25eXRWAixwDsWqZPd7anwhLc9S4NhWtAV2DWOHVrq/DKEJi yjtJJwTeQt0dVPvSDXtzw4aci1eVKg1f9P2Z1X/nx8ttJZwD6blBkspDsbyn2WZmDVWI LCNG6ByKRAhhKIu1er3int3mXm/YdbSXsGK55w9wiEK/A4NBnILg9yEIf/O7kCcPx8g/ w7DX5k6oXxRtG2M2pnbKMr5fjMfLWCyLrCCefHRrUQNJ5/MFx7uRU+lw84ZUNG9Ezwao O3j3DzgmFk8coN8NTPJFnKK8gbjMFD2HeKypjVIyYeXWG5u9l1sxQ2mFTCUDdEyNu3pj lg== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by aserp2120.oracle.com with ESMTP id 322kv68gvy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Mon, 06 Jul 2020 20:26:25 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 066KN5TV108646; Mon, 6 Jul 2020 20:26:25 GMT Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserp3020.oracle.com with ESMTP id 3233p0sw4t-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 06 Jul 2020 20:26:25 +0000 Received: from abhmp0001.oracle.com (abhmp0001.oracle.com [141.146.116.7]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 066KQLgj013157; Mon, 6 Jul 2020 20:26:21 GMT Received: from monkey.oracle.com (/50.38.35.18) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 06 Jul 2020 13:26:21 -0700 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko , Hugh Dickins , Naoya Horiguchi , "Aneesh Kumar K . V" , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa , Andrew Morton , Linus Torvalds , Mike Kravetz , kernel test robot Subject: [RFC PATCH 1/3] Revert: "hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race" Date: Mon, 6 Jul 2020 13:26:13 -0700 Message-Id: <20200706202615.32111-2-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.25.4 In-Reply-To: <20200706202615.32111-1-mike.kravetz@oracle.com> References: <20200622005551.GK5535@shao2-debian> <20200706202615.32111-1-mike.kravetz@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9674 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 mlxlogscore=999 mlxscore=0 spamscore=0 bulkscore=0 malwarescore=0 phishscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2007060138 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9674 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 lowpriorityscore=0 bulkscore=0 malwarescore=0 suspectscore=0 mlxlogscore=999 phishscore=0 spamscore=0 priorityscore=1501 clxscore=1015 impostorscore=0 mlxscore=0 adultscore=0 cotscore=-2147483648 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2007060138 X-Rspamd-Queue-Id: 793F710035137 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This reverts 87bf91d39bb52b688fb411d668fbe7df278b29ae Commit 87bf91d39bb5 depends on i_mmap_rwsem being taken during hugetlb fault processing. Commit c0d0381ade79 added code to take i_mmap_rwsem in read mode during fault processing. However, this was observed to increase fault processing time by aprox 33%. To address this, i_mmap_rwsem will only be taken during fault processing when necessary. As a result, i_mmap_rwsem can not be used to synchronize fault and truncate. In a subsequent commit, code will be added to detect the race and back out operations. Reported-by: kernel test robot Signed-off-by: Mike Kravetz --- fs/hugetlbfs/inode.c | 28 ++++++++-------------------- mm/hugetlb.c | 23 ++++++++++++----------- 2 files changed, 20 insertions(+), 31 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ef5313f9c78f..b4bb82815dd4 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -444,9 +444,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end) * In this case, we first scan the range and release found pages. * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv * maps and global counts. Page faults can not race with truncation - * in this routine. hugetlb_no_page() holds i_mmap_rwsem and prevents - * page faults in the truncated range by checking i_size. i_size is - * modified while holding i_mmap_rwsem. + * in this routine. hugetlb_no_page() prevents page faults in the + * truncated range. It checks i_size before allocation, and again after + * with the page table lock for the page held. The same lock must be + * acquired to unmap a page. * hole punch is indicated if end is not LLONG_MAX * In the hole punch case we scan the range and release found pages. * Only when releasing a page is the associated region/reserv map @@ -486,15 +487,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, index = page->index; hash = hugetlb_fault_mutex_hash(mapping, index); - if (!truncate_op) { - /* - * Only need to hold the fault mutex in the - * hole punch case. This prevents races with - * page faults. Races are not possible in the - * case of truncation. - */ - mutex_lock(&hugetlb_fault_mutex_table[hash]); - } + mutex_lock(&hugetlb_fault_mutex_table[hash]); /* * If page is mapped, it was faulted in after being @@ -537,8 +530,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, } unlock_page(page); - if (!truncate_op) - mutex_unlock(&hugetlb_fault_mutex_table[hash]); + mutex_unlock(&hugetlb_fault_mutex_table[hash]); } huge_pagevec_release(&pvec); cond_resched(); @@ -576,8 +568,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset) BUG_ON(offset & ~huge_page_mask(h)); pgoff = offset >> PAGE_SHIFT; - i_mmap_lock_write(mapping); i_size_write(inode, offset); + i_mmap_lock_write(mapping); if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)) hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0); i_mmap_unlock_write(mapping); @@ -699,11 +691,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, /* addr is the offset within the file (zero based) */ addr = index * hpage_size; - /* - * fault mutex taken here, protects against fault path - * and hole punch. inode_lock previously taken protects - * against truncation. - */ + /* mutex taken here, fault path and hole punch */ hash = hugetlb_fault_mutex_hash(mapping, index); mutex_lock(&hugetlb_fault_mutex_table[hash]); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 57ece74e3aae..5349beda3658 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4322,17 +4322,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, } /* - * We can not race with truncation due to holding i_mmap_rwsem. - * i_size is modified when holding i_mmap_rwsem, so check here - * once for faults beyond end of file. + * Use page lock to guard against racing truncation + * before we get page_table_lock. */ - size = i_size_read(mapping->host) >> huge_page_shift(h); - if (idx >= size) - goto out; - retry: page = find_lock_page(mapping, idx); if (!page) { + size = i_size_read(mapping->host) >> huge_page_shift(h); + if (idx >= size) + goto out; + /* * Check for page in userfault range */ @@ -4438,6 +4437,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, } ptl = huge_pte_lock(h, mm, ptep); + size = i_size_read(mapping->host) >> huge_page_shift(h); + if (idx >= size) + goto backout; + ret = 0; if (!huge_pte_none(huge_ptep_get(ptep))) goto backout; @@ -4541,10 +4544,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, /* * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold - * until finished with ptep. This serves two purposes: - * 1) It prevents huge_pmd_unshare from being called elsewhere - * and making the ptep no longer valid. - * 2) It synchronizes us with i_size modifications during truncation. + * until finished with ptep. This prevents huge_pmd_unshare from + * being called elsewhere and making the ptep no longer valid. * * ptep could have already be assigned via huge_pte_offset. That * is OK, as huge_pte_alloc will return the same value unless From patchwork Mon Jul 6 20:26:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Kravetz X-Patchwork-Id: 11646849 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 885871398 for ; Mon, 6 Jul 2020 20:26:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 47DFF20BED for ; Mon, 6 Jul 2020 20:26:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="PKolNnM9" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 47DFF20BED Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 677276B0007; Mon, 6 Jul 2020 16:26:36 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 5D6656B0008; Mon, 6 Jul 2020 16:26:36 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 404966B000A; Mon, 6 Jul 2020 16:26:36 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0209.hostedemail.com [216.40.44.209]) by kanga.kvack.org (Postfix) with ESMTP id 25E796B0007 for ; Mon, 6 Jul 2020 16:26:36 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id DE35F180AD802 for ; Mon, 6 Jul 2020 20:26:35 +0000 (UTC) X-FDA: 77008783950.22.sun84_4010eae26eae Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id B6F3018038E60 for ; Mon, 6 Jul 2020 20:26:35 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,mike.kravetz@oracle.com,,RULES_HIT:30012:30054:30064:30070,0,RBL:141.146.126.78:@oracle.com:.lbl8.mailshell.net-62.18.0.100 64.10.201.10;04y8b4yjxfbkbx8ikbz8w7coaccckochq8hc63xyiqdau9y6s8yuhroxcyfwfsn.6ccyrthrh9ocy5m8kp4idc5txmipc6ogscbnor34brya9y9ntbdmfcek55qm6ia.c-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none X-HE-Tag: sun84_4010eae26eae X-Filterd-Recvd-Size: 11670 Received: from aserp2120.oracle.com (aserp2120.oracle.com [141.146.126.78]) by imf13.hostedemail.com (Postfix) with ESMTP for ; Mon, 6 Jul 2020 20:26:35 +0000 (UTC) Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 066KN0kf068162; Mon, 6 Jul 2020 20:26:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2020-01-29; bh=b4THuCn4C9a7B6Y9bQgWFqgNH1327HQkfLZ8OO2NUd0=; b=PKolNnM9dnKExK2G++WTmtfCqcsQv02F10MGuzrbJfUAj72FCVZUNh7lMPyVHLw8rN7N A1SE4Y4iG2nQ2dob9AdeyH7a8yukEBnvZ5R9YSez6/suweKdsgadUx46RVdZnKJxxDTZ QVX3tlRj/hI09+Rd9ohHB62si3tTbvGp3PTt10FPFaTWBp8ukSou101/dOoQYI1JU94i iaPn1cDLP7HRxGgm9028pjbaTeyGr6iAYerLonIDZ0J3aZZcsdB32teQpodFQGjQ455M JpjT0O4XeqU4HZbrQVoNVzvnyOCEuaBv6b1E8JMWoN/uRPLR/HnCTEmb/mKVeoVaL6ew 0w== Received: from userp3020.oracle.com (userp3020.oracle.com [156.151.31.79]) by aserp2120.oracle.com with ESMTP id 322kv68gw0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Mon, 06 Jul 2020 20:26:26 +0000 Received: from pps.filterd (userp3020.oracle.com [127.0.0.1]) by userp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 066KMv7n054443; Mon, 6 Jul 2020 20:26:25 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userp3020.oracle.com with ESMTP id 3243pddfwp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 06 Jul 2020 20:26:25 +0000 Received: from abhmp0001.oracle.com (abhmp0001.oracle.com [141.146.116.7]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 066KQNTd025732; Mon, 6 Jul 2020 20:26:23 GMT Received: from monkey.oracle.com (/50.38.35.18) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 06 Jul 2020 13:26:23 -0700 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko , Hugh Dickins , Naoya Horiguchi , "Aneesh Kumar K . V" , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa , Andrew Morton , Linus Torvalds , Mike Kravetz , kernel test robot Subject: [RFC PATCH 2/3] hugetlbfs: Only take i_mmap_rwsem when sharing is possible Date: Mon, 6 Jul 2020 13:26:14 -0700 Message-Id: <20200706202615.32111-3-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.25.4 In-Reply-To: <20200706202615.32111-1-mike.kravetz@oracle.com> References: <20200622005551.GK5535@shao2-debian> <20200706202615.32111-1-mike.kravetz@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9674 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 suspectscore=0 spamscore=0 adultscore=0 phishscore=0 mlxlogscore=999 mlxscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2007060138 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9674 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 lowpriorityscore=0 bulkscore=0 malwarescore=0 suspectscore=0 mlxlogscore=999 phishscore=0 spamscore=0 priorityscore=1501 clxscore=1015 impostorscore=0 mlxscore=0 adultscore=0 cotscore=-2147483648 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2007060138 X-Rspamd-Queue-Id: B6F3018038E60 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Commit c0d0381ade79 added code to take i_mmap_rwsem in read mode during fault processing. However, this was observed to increase fault processing time by aprox 33%. Technically, i_mmap_rwsem only needs to be held when pmd sharing is possible. pmd sharing depends on mapping flags, alignment and size. The routine vma_shareable() already checks these conditions. Therefore, use vma_shareable to determine if sharing is possible and if taking i_mmap_rwsem is necessary. This is done during fault processing and vma copying. Code in memory-failure, page migration and userfaultfd continue to always take i_mmap_rwsem. These are not as performance sensitive. Reported-by: kernel test robot Signed-off-by: Mike Kravetz --- mm/hugetlb.c | 96 ++++++++++++++++++++++++++++++++++------------------ 1 file changed, 63 insertions(+), 33 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5349beda3658..6e9085464e78 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3656,6 +3656,21 @@ static int hugetlb_acct_memory(struct hstate *h, long delta) return ret; } +#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE +static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) +{ + unsigned long base = addr & PUD_MASK; + unsigned long end = base + PUD_SIZE; + + /* + * check on proper vm_flags and page table alignment + */ + if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end)) + return true; + return false; +} +#endif + static void hugetlb_vm_op_open(struct vm_area_struct *vma) { struct resv_map *resv = vma_resv_map(vma); @@ -3807,6 +3822,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, unsigned long sz = huge_page_size(h); struct address_space *mapping = vma->vm_file->f_mapping; struct mmu_notifier_range range; + bool i_mmap_rwsem_held = false; int ret = 0; cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; @@ -3816,14 +3832,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, vma->vm_start, vma->vm_end); mmu_notifier_invalidate_range_start(&range); - } else { - /* - * For shared mappings i_mmap_rwsem must be held to call - * huge_pte_alloc, otherwise the returned ptep could go - * away if part of a shared pmd and another thread calls - * huge_pmd_unshare. - */ - i_mmap_lock_read(mapping); } for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { @@ -3831,6 +3839,28 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, src_pte = huge_pte_offset(src, addr, sz); if (!src_pte) continue; + + /* + * For shared mappings(non-cow) i_mmap_rwsem must be held to + * call huge_pte_alloc, otherwise the returned ptep could go + * away if part of a shared pmd and another thread calls + * huge_pmd_unshare. This is only necessary if the specific + * pmd can be shared. Acquire/drop semaphore as necessary. + */ + if (!cow) { + if (!i_mmap_rwsem_held) { + if (vma_shareable(vma, addr)) { + i_mmap_lock_read(mapping); + i_mmap_rwsem_held = true; + } + } else { + if (!vma_shareable(vma, addr)) { + i_mmap_unlock_read(mapping); + i_mmap_rwsem_held = false; + } + } + } + dst_pte = huge_pte_alloc(dst, addr, sz); if (!dst_pte) { ret = -ENOMEM; @@ -3901,7 +3931,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, if (cow) mmu_notifier_invalidate_range_end(&range); - else + if (i_mmap_rwsem_held) i_mmap_unlock_read(mapping); return ret; @@ -4357,9 +4387,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, */ hash = hugetlb_fault_mutex_hash(mapping, idx); mutex_unlock(&hugetlb_fault_mutex_table[hash]); - i_mmap_unlock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_unlock_read(mapping); ret = handle_userfault(&vmf, VM_UFFD_MISSING); - i_mmap_lock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_lock_read(mapping); mutex_lock(&hugetlb_fault_mutex_table[hash]); goto out; } @@ -4543,19 +4575,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold - * until finished with ptep. This prevents huge_pmd_unshare from - * being called elsewhere and making the ptep no longer valid. + * If PMD sharing is possible, acquire i_mmap_rwsem before calling + * huge_pte_alloc and hold until finished with ptep. This prevents + * huge_pmd_unshare from being called elsewhere and making the ptep + * no longer valid. * * ptep could have already be assigned via huge_pte_offset. That * is OK, as huge_pte_alloc will return the same value unless * something has changed. */ mapping = vma->vm_file->f_mapping; - i_mmap_lock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_lock_read(mapping); ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); if (!ptep) { - i_mmap_unlock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_unlock_read(mapping); return VM_FAULT_OOM; } @@ -4652,7 +4687,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } out_mutex: mutex_unlock(&hugetlb_fault_mutex_table[hash]); - i_mmap_unlock_read(mapping); + if (vma_shareable(vma, haddr)) + i_mmap_unlock_read(mapping); /* * Generally it's safe to hold refcount during waiting page lock. But * here we just wait to defer the next page fault to avoid busy loop and @@ -5287,19 +5323,6 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma, return saddr; } -static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) -{ - unsigned long base = addr & PUD_MASK; - unsigned long end = base + PUD_SIZE; - - /* - * check on proper vm_flags and page table alignment - */ - if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end)) - return true; - return false; -} - /* * Determine if start,end range within vma could be mapped by shared pmd. * If yes, adjust start and end to cover range associated with possible @@ -5335,9 +5358,12 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, * !shared pmd case because we can allocate the pmd later as well, it makes the * code much cleaner. * - * This routine must be called with i_mmap_rwsem held in at least read mode. - * For hugetlbfs, this prevents removal of any page table entries associated - * with the address space. This is important as we are setting up sharing + * FIXME - If sharing is possible, this routine must be called with + * i_mmap_rwsem held in at least read mode. Leaving it up to the caller + * to determine if sharing is possible is asking for trouble. Right now, all + * calling code is correct. But, this needs to be cleaner. Holding + * i_mmap_rwsem prevents removal of any page table entries associated with + * the address space. This is important as we are setting up sharing * based on existing page table entries (mappings). */ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) @@ -5355,6 +5381,10 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (!vma_shareable(vma, addr)) return (pte_t *)pmd_alloc(mm, pud, addr); + /* + * If we get here, caller should have acquired i_mmap_rwsem in + * at least read mode. + */ vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; From patchwork Mon Jul 6 20:26:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Kravetz X-Patchwork-Id: 11646851 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 186441398 for ; Mon, 6 Jul 2020 20:28:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C914B21835 for ; Mon, 6 Jul 2020 20:28:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="IOYsjUoT" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C914B21835 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DC52C6B0006; Mon, 6 Jul 2020 16:28:35 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id D4EC56B000A; Mon, 6 Jul 2020 16:28:35 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C160E6B000C; Mon, 6 Jul 2020 16:28:35 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0086.hostedemail.com [216.40.44.86]) by kanga.kvack.org (Postfix) with ESMTP id A93EA6B0006 for ; Mon, 6 Jul 2020 16:28:35 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 635FD181AEF00 for ; Mon, 6 Jul 2020 20:28:35 +0000 (UTC) X-FDA: 77008788990.26.hope23_5a0260326eae Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id 3FF0A1804B661 for ; Mon, 6 Jul 2020 20:28:35 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,mike.kravetz@oracle.com,,RULES_HIT:30003:30012:30054:30064:30069,0,RBL:141.146.126.78:@oracle.com:.lbl8.mailshell.net-62.18.0.100 64.10.201.10;04yrx4dz6satpu4ojs39g7zw9pghmycwehbfwtpm3aof4k4co1hxam3wx87oiie.9jbimaqh6aju471498jnh95oin3pnfwh9ukd16e3f8hnbthgne5zia4akt9u5ts.y-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none X-HE-Tag: hope23_5a0260326eae X-Filterd-Recvd-Size: 9439 Received: from aserp2120.oracle.com (aserp2120.oracle.com [141.146.126.78]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Mon, 6 Jul 2020 20:28:34 +0000 (UTC) Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 066KMXNT068015; Mon, 6 Jul 2020 20:28:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=corp-2020-01-29; bh=Ek3cu3980FjfdjMBjhaJjsW0/MXLN4KL95B9Eh15jvo=; b=IOYsjUoT9/A91GCjKBYdBXFtaYuNs46cpDoPvxw3Bg/Rs7rpiLxXRMRsJiYrSa81nVZv L89joWucqI8ZkVfdS3EW5+jCNBeW6irqgw/EncJVaI3Ae/2Sil3hltpaJCM4F/yN/iZY CirF7iof7sVhPwxySWQgPReojxRIlCJBOBETNcsnuHRrJgdfGLuYiJjPscRdc5Fhol/1 yUR9HIbksHSh8G/uM9Rae/hgogJsKwM4Dd9cGI4mbR++nZ+zYqW/S69rDfPeKS627g7u 3EKyic6DmsbhuOOUrtFAHD6k+xBrCl35yBIx9a+sWLEZk5HLzuRvvQ2rjhygUpI5+Qno Rw== Received: from userp3020.oracle.com (userp3020.oracle.com [156.151.31.79]) by aserp2120.oracle.com with ESMTP id 322kv68h5g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Mon, 06 Jul 2020 20:28:27 +0000 Received: from pps.filterd (userp3020.oracle.com [127.0.0.1]) by userp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 066KMvx7054548; Mon, 6 Jul 2020 20:26:26 GMT Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userp3020.oracle.com with ESMTP id 3243pddfxd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 06 Jul 2020 20:26:26 +0000 Received: from abhmp0001.oracle.com (abhmp0001.oracle.com [141.146.116.7]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 066KQPLJ008675; Mon, 6 Jul 2020 20:26:25 GMT Received: from monkey.oracle.com (/50.38.35.18) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 06 Jul 2020 13:26:25 -0700 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko , Hugh Dickins , Naoya Horiguchi , "Aneesh Kumar K . V" , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa , Andrew Morton , Linus Torvalds , Mike Kravetz , kernel test robot Subject: [RFC PATCH 3/3] huegtlbfs: handle page fault/truncate races Date: Mon, 6 Jul 2020 13:26:15 -0700 Message-Id: <20200706202615.32111-4-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.25.4 In-Reply-To: <20200706202615.32111-1-mike.kravetz@oracle.com> References: <20200622005551.GK5535@shao2-debian> <20200706202615.32111-1-mike.kravetz@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9674 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 suspectscore=2 spamscore=0 adultscore=0 phishscore=0 mlxlogscore=999 mlxscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2007060138 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9674 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 lowpriorityscore=0 bulkscore=0 malwarescore=0 suspectscore=2 mlxlogscore=999 phishscore=0 spamscore=0 priorityscore=1501 clxscore=1015 impostorscore=0 mlxscore=0 adultscore=0 cotscore=-2147483648 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2007060138 X-Rspamd-Queue-Id: 3FF0A1804B661 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: A huegtlb page fault can race with page truncation. Make the code identifying and handling these races more robust. Page fault handling needs to back out pages added to page cache beyond file size (i_size). When backing out the page, take care to restore reserve map entries and counts as necessary. File truncation (remove_inode_hugepages) needs to handle page mapping changes before locking the page. This could happen if page was added to page cache and later backed out in fault processing. Reported-by: kernel test robot Signed-off-by: Mike Kravetz --- fs/hugetlbfs/inode.c | 41 ++++++++++++++++++++++------------------- mm/hugetlb.c | 37 +++++++++++++++++++++++++++++++++++-- 2 files changed, 57 insertions(+), 21 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index b4bb82815dd4..eeddd43b8809 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -494,13 +494,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, * unmapped in caller. Unmap (again) now after taking * the fault mutex. The mutex will prevent faults * until we finish removing the page. - * - * This race can only happen in the hole punch case. - * Getting here in a truncate operation is a bug. */ if (unlikely(page_mapped(page))) { - BUG_ON(truncate_op); - mutex_unlock(&hugetlb_fault_mutex_table[hash]); i_mmap_lock_write(mapping); mutex_lock(&hugetlb_fault_mutex_table[hash]); @@ -512,23 +507,31 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, lock_page(page); /* - * We must free the huge page and remove from page - * cache (remove_huge_page) BEFORE removing the - * region/reserve map (hugetlb_unreserve_pages). In - * rare out of memory conditions, removal of the - * region/reserve map could fail. Correspondingly, - * the subpool and global reserve usage count can need - * to be adjusted. + * After locking page, make sure mapping is the same. + * We could have raced with page fault populate and + * backout code. */ - VM_BUG_ON(PagePrivate(page)); - remove_huge_page(page); - freed++; - if (!truncate_op) { - if (unlikely(hugetlb_unreserve_pages(inode, + if (page_mapping(page) == mapping) { + /* + * We must free the huge page and remove from + * page cache (remove_huge_page) BEFORE + * removing the region/reserve map. In rare + * out of memory conditions, removal of the + * region/reserve map could fail. + * Correspondingly, the subpool and global + * reserve usage count can need to be adjusted. + */ + VM_BUG_ON(PagePrivate(page)); + remove_huge_page(page); + freed++; + if (!truncate_op) { + if (unlikely( + hugetlb_unreserve_pages(inode, index, index + 1, 1))) - hugetlb_fix_reserve_counts(inode); + hugetlb_fix_reserve_counts( + inode); + } } - unlock_page(page); mutex_unlock(&hugetlb_fault_mutex_table[hash]); } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6e9085464e78..68785cc80523 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4339,6 +4339,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, spinlock_t *ptl; unsigned long haddr = address & huge_page_mask(h); bool new_page = false; + bool page_cache = false; + bool reserve_alloc = false; + bool beyond_i_size = false; /* * Currently, we are forced to kill the process in the event the @@ -4423,6 +4426,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, clear_huge_page(page, address, pages_per_huge_page(h)); __SetPageUptodate(page); new_page = true; + if (PagePrivate(page)) + reserve_alloc = true; if (vma->vm_flags & VM_MAYSHARE) { int err = huge_add_to_page_cache(page, mapping, idx); @@ -4432,6 +4437,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, goto retry; goto out; } + page_cache = true; } else { lock_page(page); if (unlikely(anon_vma_prepare(vma))) { @@ -4470,8 +4476,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, ptl = huge_pte_lock(h, mm, ptep); size = i_size_read(mapping->host) >> huge_page_shift(h); - if (idx >= size) + if (idx >= size) { + beyond_i_size = true; goto backout; + } ret = 0; if (!huge_pte_none(huge_ptep_get(ptep))) @@ -4509,8 +4517,33 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, backout: spin_unlock(ptl); backout_unlocked: + if (new_page) { + /* + * Back out pages added to page cache beyond i_size. Otherwise, + * they will 'sit' there until the file is removed. + */ + if (page_cache && beyond_i_size) { + /* FIXME - following lines are remove_huge_page() */ + ClearPageDirty(page); + ClearPageUptodate(page); + delete_from_page_cache(page); + } + + /* + * If reserve was consumed, set PagePrivate so that it will + * be restored in free_huge_page(). + */ + if (reserve_alloc) + SetPagePrivate(page); + + /* + * Do not restore reserve map entries beyond i_size. Otherwise, + * there will be leaks when the file is removed. + */ + if (!beyond_i_size) + restore_reserve_on_error(h, vma, haddr, page); + } unlock_page(page); - restore_reserve_on_error(h, vma, haddr, page); put_page(page); goto out; }