From patchwork Mon Jan 9 07:22:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yin Fengwei X-Patchwork-Id: 13093092 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A3BAC54EBD for ; Mon, 9 Jan 2023 07:19:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B387B8E0002; Mon, 9 Jan 2023 02:19:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AC13B8E0001; Mon, 9 Jan 2023 02:19:39 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 93CB38E0002; Mon, 9 Jan 2023 02:19:39 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 838E28E0001 for ; Mon, 9 Jan 2023 02:19:39 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4876716015A for ; Mon, 9 Jan 2023 07:19:39 +0000 (UTC) X-FDA: 80334410478.23.AD47148 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf12.hostedemail.com (Postfix) with ESMTP id 5633240002 for ; Mon, 9 Jan 2023 07:19:34 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=ltGyOhGv; spf=pass (imf12.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248777; a=rsa-sha256; cv=none; b=m59hHBSlxSSSJeesw0kmF3vizZQMl3ab1d2r3TTI4FfhYJ1k/37P5i4A931RBEhbTsDgLa cTAlBrKm8dxXLjTV2fZ0XVUwwwCZYYiqSDPpuY/XZkwEifgG/nbFzLrYDgulWvRonqWcBj TOetj1a/1AXlX6i2scyFRroqzJvdUSs= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=ltGyOhGv; spf=pass (imf12.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673248777; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=ZC5VCt3Ve7l7I2E+/HNckCBUR/NLe3yQ4W8l9Dv+6AM=; b=lZK0SfgVM0jN/EYmv/RFEjUJEDzFkYeYm5otP1Bz5Tlttb/uLcNJZw4L12aRR9y8nttEZJ f+1WdRtBph/TNS7BfeL632rfiGCdw4k/uoG//pesTY6JyVZEVvWwACh8fiZeK7SUbCE0N0 nfxXgIIsGVIyBD8Y7RBjbwNbLiqqDlw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673248776; x=1704784776; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=mn/bPJA42opP875WnbXxjaSij86n8mn9oSahyM2Pd7U=; b=ltGyOhGvwFvG2SMgvP+Y34qWN26lP2ovT43nlWgOOdVUkHU7AK5JYtwq vf6HWkAOSywVv14Lk0QPZH5MfP7+96WhLs/jI3A953LrrXwQSiC/EnEz/ 6UqbahrZSOmtHtU4V5JWM1jYcd83gPpOAxCXRqWZZVhe6ywB/Pl/9NRs9 V53+2HC6/m0RVRqWiQpzz45YHL4iAZTwS90pfJ1xD6XdmZC3xWUVgpA4k RrekqTW15k+023ueVSQVtrjoM+IFOim/XUEDFtXb3B7/uo2RwaslHzqgF okhQeOg7HlF5wmHIbQdLqBX6xColNyfspRyAoHKw/eHEFrCVw35AHXO93 w==; X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260853" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="387260853" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2023 23:19:14 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634111990" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="634111990" Received: from fyin-dev.sh.intel.com ([10.239.159.32]) by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:08 -0800 From: Yin Fengwei To: linux-mm@kvack.org, akpm@linux-foundation.org, jack@suse.cz, hughd@google.com, kirill.shutemov@linux.intel.com, mhocko@suse.com, ak@linux.intel.com, aarcange@redhat.com, npiggin@gmail.com, mgorman@techsingularity.net, willy@infradead.org, rppt@kernel.org, dave.hansen@intel.com, ying.huang@intel.com, tim.c.chen@intel.com Cc: fengwei.yin@intel.com Subject: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping Date: Mon, 9 Jan 2023 15:22:28 +0800 Message-Id: <20230109072232.2398464-1-fengwei.yin@intel.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 5633240002 X-Stat-Signature: j4kqa5abhbfu7bx1kyj5es8k3p4romtu X-HE-Tag: 1673248774-39466 X-HE-Meta: U2FsdGVkX1+R13vEsHHfjJuuoIfwOkjQcYfr0bBY9MJtwOdXWEAIv4j3V+PuwBGX7C5Uw3Jp0TcqNK7P7K1oaWAVkvJ8wJm/QpA6+LXuwkwStr3ZM7JwuZoIJdie4ujWxXPPrIpvG0zbFyloAzX6hZDCKjPga7UW6KbZtVda4j0VPdhStAXb6KFcFx8y5rzIIbISjPyrzWs9B1sSCcMKp+IiYwsffILets7KBAEevcmxUs3dYeHqhscadL6iOH+2vGSUiaEFdgcWjfr7Xl/5gtDGJvQ7Eyd+ZgxwSyMUdgT15ksdHKsmtdy/oc8fV9xF4w2EIgIeLcnJUtWghhtZype8L5XriRtCuGAFDnpeRTWXgq71GRFBrEzLmfxH/8w5L2M6T+01EdRIwmSpwWZRKva5R69xreoyc4UlwwC6DrBcZ8XlDqNq4S/XzCwQI1sUbt189OiVKtLMn3RvhGi/7+2QXBSewnT39D+kchTj84lBu0yOXwElXp/hG9WJaaKEbGu0ZLpuPtOLplzpTOWfH6JGK3udZaCVuvmVfLN9QLZw35egFQ2KE+EEB8F3YOwF2atdEtV8DX3E1uTofQ6qI81BgBRLBaLHbWgNazVjf8ukLTsoWH6gH0XRWKOUKDaYJWKXGZ9bpNbLiSgTS2HDS9qUgCXLnOrJNq3ua59P7HDjTKtMDwraLJ99XGQDCumWE3gYVoDJZTVxsGLjDpdkI4n9sXSDlJ3T+DVwPEskW2OcDbcAZ7vMJ1D+U6wSc1rePjSJL3BKsmIiqsw7w1GJZuWuuorfK8la2hR7/SwcvmjX0SdztwLXlszv43Xv+HoKsGw24JBFH8eITEHWollxVLG0nc22IuRynlAJ4r5+Zy6m/v/OXxzISQhdzqZblzYv8UgKFkqteakrT48atyZAPA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In a nutshell: 4k is too small and 2M is too big. We started asking ourselves whether there was something in the middle that we could do. This series shows what that middle ground might look like. It provides some of the benefits of THP while eliminating some of the downsides. This series uses "multiple consecutive pages" (mcpages) of between 8K and 2M of base pages for anonymous user space mappings. This will lead to less internal fragmentation versus 2M mappings and thus less memory consumption and wasted CPU time zeroing memory which will never be used. In the implementation, we allocate high order page with order of mcpage (e.g., order 2 for 16KB mcpage). This makes sure the physical contiguous memory is used and benefit sequential memory access latency. Then split the high order page. By doing this, the sub-page of mcpage is just 4K normal page. The current kernel page management is applied to "mc" pages without any changes. Batching page faults is allowed with mcpage and reduce page faults number. There are costs with mcpage. Besides no TLB benefit THP brings, it increases memory consumption and latency of allocation page comparing to 4K base page. This series is the first step of mcpage. The furture work can be enable mcpage for more components like page cache, swapping etc. Finally, most pages in system will be allocated/free/reclaimed with mcpage order. The series is constructed as following: Patch 1 add the mcpage size related definitions and Kconfig entry Patch 2 specific for x86_64 to align mmap start address to mcpage size Patch 3 is the main change. It adds code to hook to anonymous page fault handle and apply mcpage to anonymous mapping Patch 4 adds some statistic of mcpage The overall code change is quite straight forward. The most thing I like to hear here is whether this is a right direction I can go further. This series does not leverage compound pages. This means that normal kernel code that encounters an 'mcpage' region does not need to do anything special. It also does not leverage folios, although trying to leverage folios is something that we would like to explore. We would welcome input on how that might happen. Some performance data were collected with 16K mcpage size and shown in patch 2/4 and 4/4. If you have other workload and like to know the impact, just let me know. I can setup the env and run the test. Yin Fengwei (4): mcpage: add size/mask/shift definition for multiple consecutive page mcpage: anon page: Use mcpage for anonymous mapping mcpage: add vmstat counters for mcpages mcpage: get_unmapped_area return mcpage size aligned addr arch/x86/kernel/sys_x86_64.c | 8 ++ include/linux/gfp.h | 5 ++ include/linux/mcpage_mm.h | 35 +++++++++ include/linux/mm_types.h | 11 +++ include/linux/vm_event_item.h | 10 +++ mm/Kconfig | 19 +++++ mm/Makefile | 1 + mm/mcpage_memory.c | 140 ++++++++++++++++++++++++++++++++++ mm/memory.c | 12 +++ mm/mempolicy.c | 51 +++++++++++++ mm/vmstat.c | 7 ++ 11 files changed, 299 insertions(+) create mode 100644 include/linux/mcpage_mm.h create mode 100644 mm/mcpage_memory.c base-commit: b7bfaa761d760e72a969d116517eaa12e404c262