From patchwork Wed Nov 9 20:06:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joao Martins X-Patchwork-Id: 13038069 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C5D2C4332F for ; Wed, 9 Nov 2022 21:09:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EFD2D6B0072; Wed, 9 Nov 2022 16:09:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EAD218E0002; Wed, 9 Nov 2022 16:09:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D754A6B0074; Wed, 9 Nov 2022 16:09:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C82DE6B0072 for ; Wed, 9 Nov 2022 16:09:34 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9B1E9404BD for ; Wed, 9 Nov 2022 21:09:34 +0000 (UTC) X-FDA: 80115145068.05.9D84815 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) by imf12.hostedemail.com (Postfix) with ESMTP id 2451440004 for ; Wed, 9 Nov 2022 21:09:33 +0000 (UTC) Received: from pps.filterd (m0246627.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2A9L8fmQ003166; Wed, 9 Nov 2022 21:09:32 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding; s=corp-2022-7-12; bh=xuz0lysgXH0+UQaj1KyKqaNTSgz3AG1QZNMrCW266p4=; b=1oTsHXDaeRlTtn+S/24SLZrYkd4HRjRv7XdhrA5aETuWksn/Vlx/M9oKHNczyus3euoR LCQ1i3RzGKD5SW7OsZNsN/h4Er30+OFH0e02R1SXRLsUIvebdQzQmzjGthS/wzIEfhHi XfyhoRCgq0+7Vpj7Yi45Vm9Uh29ceGnC6QIkBdS5AaZGVRnP2e97Eki/e7UszD82Cydp 7e5V92q5LB2k+fLQNykxqeHbrsSmb8KAlaqFH3eW2l+Pi0qgBNsZ8lnA4KJQZERHUGlt 1JGFLEKsjvjBvZ25uF4gOkUFegftY37//N5S5VaxsNIdp0IUxir6CpGVIjdHIPW3H+mP ag== Received: from iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta01.appoci.oracle.com [130.35.100.223]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3krkuvg00p-6 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 09 Nov 2022 21:09:30 +0000 Received: from pps.filterd (iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (8.17.1.5/8.17.1.5) with ESMTP id 2A9J5vtl036333; Wed, 9 Nov 2022 20:06:32 GMT Received: from pps.reinject (localhost [127.0.0.1]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTPS id 3kpcyqshpt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 09 Nov 2022 20:06:32 +0000 Received: from iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2A9K6VD1006056; Wed, 9 Nov 2022 20:06:32 GMT Received: from joaomart-mac.uk.oracle.com (dhcp-10-175-203-48.vpn.oracle.com [10.175.203.48]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTP id 3kpcyqshnb-1; Wed, 09 Nov 2022 20:06:31 +0000 From: Joao Martins To: linux-mm@kvack.org Cc: Muchun Song , Mike Kravetz , Andrew Morton , Joao Martins Subject: [PATCH v3] mm/hugetlb_vmemmap: remap head page to newly allocated page Date: Wed, 9 Nov 2022 20:06:23 +0000 Message-Id: <20221109200623.96867-1-joao.m.martins@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-11-09_06,2022-11-09_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 spamscore=0 malwarescore=0 adultscore=0 mlxscore=0 mlxlogscore=999 phishscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2210170000 definitions=main-2211090152 X-Proofpoint-ORIG-GUID: KipQB8OLZGkC2bAfFXiptRFEeiuucKiQ X-Proofpoint-GUID: KipQB8OLZGkC2bAfFXiptRFEeiuucKiQ ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2022-7-12 header.b=1oTsHXDa; spf=pass (imf12.hostedemail.com: domain of joao.m.martins@oracle.com designates 205.220.165.32 as permitted sender) smtp.mailfrom=joao.m.martins@oracle.com; dmarc=pass (policy=none) header.from=oracle.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668028174; a=rsa-sha256; cv=none; b=I1q5yErVUCsKThaJtO4nVuCn2xvhyEHyCuXd8DMUxAiRVsbBUCzyZLBX4lhpX3JEOJcDb3 Z5jR9lq+8ec54rOt+jLi4ccpg3J3TYplF6VUP3zcxorsZNNtQbWXvDqA7QpohWDb4v6TYp I0XJ+4FRnxHXZIqiIfP9ryUZLwf6HoI= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668028174; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=xuz0lysgXH0+UQaj1KyKqaNTSgz3AG1QZNMrCW266p4=; b=E1h6kvzicZom0Y7jensfHWbyyXeBPgTScDToG9ru91dWLmwLeWaUZrMVyZ5cn221HtT/K2 l26LVotOvXygivNKU9SDmiqhedyxRaaImnNd8IwzqN1Yb8YsTnTohEEEow5sjzmsQhYHM5 g0Eyddk4uA0rqKI0E/ycHSTqdo5wurk= X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 2451440004 X-Rspam-User: Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2022-7-12 header.b=1oTsHXDa; spf=pass (imf12.hostedemail.com: domain of joao.m.martins@oracle.com designates 205.220.165.32 as permitted sender) smtp.mailfrom=joao.m.martins@oracle.com; dmarc=pass (policy=none) header.from=oracle.com X-Stat-Signature: r7cy49u1fubtwq6mnbyeak49w6sf9cn7 X-HE-Tag: 1668028173-690822 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed back to page allocator is as following: for a 2M hugetlb page it will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially, that means that it breaks the first 4K of a potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For this reason the memory that it's free back to page allocator cannot be used for hugetlb to allocate huge pages of the same size, but rather only of a smaller huge page size: Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node having 64G): * Before allocation: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 340 100 32 15 1 2 0 0 0 1 15558 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 31987 * After: Node 0, zone Normal, type Movable 30893 32006 31515 7 0 0 0 0 0 0 0 Notice how the memory freed back are put back into 4K / 8K / 16K page pools. And it allocates a total of 31987 pages (63974M). To fix this behaviour rather than remapping second vmemmap page (thus breaking the contiguous block of memory backing the struct pages) repopulate the first vmemmap page with a new one. We allocate and copy from the currently mapped vmemmap page, and then remap it later on. The same algorithm works if there's a pre initialized walk::reuse_page and the head page doesn't need to be skipped and instead we remap it when the @addr being changed is the @reuse_addr. The new head page is allocated in vmemmap_remap_free() given that on restore there's no need for functional change. Note that, because right now one hugepage is remapped at a time, thus only one free 4K page at a time is needed to remap the head page. Should it fail to allocate said new page, it reuses the one that's already mapped just like before. As a result, for every 64G of contiguous hugepages it can give back 1G more of contiguous memory per 64G, while needing in total 128M new 4K pages (for 2M hugetlb) or 256k (for 1G hugetlb). After the changes, try to assign a 64G node to hugetlb (on a 128G 2node guest, each node with 64G): * Before allocation Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 1 1 1 0 0 1 0 0 1 1 15564 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 32394 * After: Node 0, zone Normal, type Movable 0 50 97 108 96 81 70 46 18 0 0 In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out of the 32394 (64788M) allocated. So the memory freed back is indeed being used back in hugetlb and there's no massive order-0..order-2 pages accumulated unused. Signed-off-by: Joao Martins Reviewed-by: Muchun Song --- Changes since v2: Comments from Muchun: * Delete the comment above the tlb flush * Move the head vmemmap page copy into vmemmap_remap_free() * Add and del the new head page to the vmemmap_pages (to be freed in case of error) * Move the remap of the head like the tail pages in vmemmap_remap_pte() but special casing only when addr == reuse_Addr * Removes the PAGE_SIZE alignment check as the code has the assumption that start/end are page-aligned (and VM_BUG_ON otherwise). * Adjusted commit message taking the above changes into account. --- mm/hugetlb_vmemmap.c | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-) diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 7898c2c75e35..f562b3f46410 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -203,12 +203,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, return ret; } while (pgd++, addr = next, addr != end); - /* - * We only change the mapping of the vmemmap virtual address range - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which - * belongs to the range. - */ - flush_tlb_kernel_range(start + PAGE_SIZE, end); + flush_tlb_kernel_range(start, end); return 0; } @@ -244,9 +239,16 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, * to the tail pages. */ pgprot_t pgprot = PAGE_KERNEL_RO; - pte_t entry = mk_pte(walk->reuse_page, pgprot); struct page *page = pte_page(*pte); + pte_t entry; + /* Remapping the head page requires r/w */ + if (unlikely(addr == walk->reuse_addr)) { + pgprot = PAGE_KERNEL; + list_del(&walk->reuse_page->lru); + } + + entry = mk_pte(walk->reuse_page, pgprot); list_add_tail(&page->lru, walk->vmemmap_pages); set_pte_at(&init_mm, addr, pte, entry); } @@ -315,6 +317,24 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, .reuse_addr = reuse, .vmemmap_pages = &vmemmap_pages, }; + int nid = page_to_nid((struct page *)start); + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | + __GFP_NOWARN; + + /* + * Allocate a new head vmemmap page to avoid breaking a contiguous + * block of struct page memory when freeing it back to page allocator + * in free_vmemmap_page_list(). This will allow the likely contiguous + * struct page backing memory to be kept contiguous and allowing for + * more allocations of hugepages. Fallback to the currently + * mapped head page in case should it fail to allocate. + */ + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); + if (walk.reuse_page) { + copy_page(page_to_virt(walk.reuse_page), + (void *)walk.reuse_addr); + list_add(&walk.reuse_page->lru, &vmemmap_pages); + } /* * In order to make remapping routine most efficient for the huge pages,