[RFC,2/4] mcpage: anon page: Use mcpage for anonymous mapping

If mcpage is in the range of VMA, try to allocated mcpage and setup
for anonymous mapping.

Try best to populate all the around page table entries. The benefit
is that the page fault number will be reduced.

Split the mcpage to allow each sub-page to be managed as normal 4K page.
Doing split before setup page table entries to avoid the complicated page
lock, mapcount and refcount handling.

It's expected that the change will impact the memory consumption, page
fault number, zone lock and lru lock directly. The memory consumption and
system performance impact are evaluated as following.

Some system performance data were collected with 16K mcpage size:
===============================================================================
                                             v6.1-rc4-no-thp v6.1-rc4-thp mcpage
will-it-scale/malloc1 (higher is better)      100%            2%           17%
will-it-scale/page_fault1 (higher is better)  100%            238%         115%
redis.set_avg_throughput (higher is better)   100%            99%          102%
redis.get_avg_throughput (higher is better)   100%            99%          100%
kernel build (lower is better)                100%            98%          97%

  * v6.1-rc4-no-thp:   6.1-rc4 with THP disabled in Kconfig
  * v6.1-rc4-thp:      6.1-rc4 with THP enabled as always in Kconfig
  * mcpage:            6.1-rc4 + 16KB mcpage
  The test results are normalized to config "v6.1-rc4-no-thp"

The perf data between v6.1-rc4-no-thp and mcpage are collected:

  For kernel build, perf showed 56% minor_page_fault drop and 1.3% clear_page
  increasing:
     v6.1-rc4-no-thp          mcpage
     5.939e+08       -56.0%   2.61e+08    kbuild.time.minor_page_faults
     0.00            +2.2        2.20     perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages
     0.72            -0.7        0.00     perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.vma_alloc_folio.do_anonymous_page

  For redis, perf showed 74.6% minor_page_fault drop and 0.11% zone lock drop.
     v6.1-rc4-no-thp            mcpage
    401414           -74.6%     102134    redis.time.minor_page_faults
      0.00           +0.1        0.11     perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages
      0.22           -0.2        0.00     perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.vma_alloc_folio

  For will-it-scale/page_fault1, perf showed 12.8% minor_page_fault drop and
  15.97% zone lock drop and 27% lru lock increasing.
      v6.1-rc4-no-thp            mcpage
      7239           -12.8%      6312     will-it-scale.time.minor_page_faults
      52.15          -34.4       17.75    perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages
      3.29           +27.0       30.29    perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush
      4.14           -4.1         0.00    perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.vma_alloc_folio.do_anonymous_page
      0.00           +13.2       13.20    perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages
      0.00           +18.4       18.43    perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.alloc_mcpages

  For will-it-scale/malloc1, the test result is surprise. The regression is
  much bigger than expected. perf showed 12.3% minor_page_fault drop and 43.6%
  zone lock increasing:
   v6.1-rc4-no-thp               mcpage
   2978027           -82.2%      530847   will-it-scale.128.processes
     7249            -12.3%        6360   will-it-scale.time.minor_page_faults
     0.00            +43.6        43.62   perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.pte_alloc_one.__pte_alloc
     0.00            +45.4        45.39   perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush

  It turned out the mcpage allocation/free pattern hit a corn case (high zone
  lock contention triggered and impact pte_alloc) which current pcp list bulk
  free can't handle very well.

  Will address the pcp list bulk free issue separately. After fix the pcp list
  bulk corn case, the result of will-it-scale/malloc1 is restored to 56% of
  v6.1-rc4-no-thp.

===============================================================================
For tail latency of page allocation, use following testing setup:
  - alloc_page() with order 0, 2 and 9 are called 2097152, 2097152 and 32768
    times in kernel
  - none fragment and fragment entier memory
  - w/o __GFP_ZERO flag to identify pure compaction latency and user visible
    latency

And the result is as following:

no page zeroing:
4K page:
    none fragment:                        fragment:
        Number of test: 2097152               Number of test: 2097152
        max latency: 26us                     max latency: 27us
        90% tail latency: 1us (1887436th)     90% tail latency: 1us (1887436th)
        95% tail latency: 1us (1992294th)     95% tail latency: 1us (1992294th)
        99% tail latency: 2us (2076180th)     99% tail latency: 3us (2076180th)

16K mcpage
    none fragment:                        fragment:
        Number of test: 2097152               Number of test: 2097152
        max latency: 26us                     max latency: 9862us
        90% tail latency: 1us (1887436th)     90% tail latency: 1us (1887436th)
        95% tail latency: 1us (1992294th)     95% tail latency: 1us (1992294th)
        99% tail latency: 1us (2076180th)     99% tail latency: 3us (2076180th)

2M THP:
    none fragment:                        fragment:
      Number of test: 32768               Number of test: 32768
      max latency: 40us                   max latency: 12149us
      90% tail latency: 8us  (29491th)    90% tail latency: 864us  (29491th)
      95% tail latency: 10us (31129th)    95% tail latency: 943us  (31129th)
      99% tail latency: 13us (32440th)    99% tail latency: 1067us (32440th)

page zeroing:
4K page:
    none fragment:                        fragment:
      Number of test: 2097152               Number of test: 2097152
      max latency: 18us                     max latency: 46us
      90% tail latency: 1us (1887436th)     90% tail latency: 1us (1887436th)
      95% tail latency: 1us (1992294th)     95% tail latency: 1us (1992294th)
      99% tail latency: 2us (2076180th)     99% tail latency: 4us (2076180th)

16K mcpage
    none fragment:                        fragment:
      Number of test: 2097152               Number of test: 2097152
      max latency: 31us                     max latency: 5740us
      90% tail latency: 3us (1887436th)     90% tail latency: 3us (1887436th)
      95% tail latency: 3us (1992294th)     95% tail latency: 4us (1992294th)
      99% tail latency: 4us (2076180th)     99% tail latency: 5us (2076180th)

2M THP:
    none fragment:                        fragment:
      Number of test: 32768                 Number of test: 32768
      max latency: 530us                    max latency: 10494us
      90% tail latency: 366us (29491th)     90% tail latency: 1114us (29491th)
      95% tail latency: 373us (31129th)     95% tail latency: 1263us (31129th)
      99% tail latency: 391us (32440th)     99% tail latency: 1808us (32440th)

With 16K mcpage, the tail latency for page allocation is good while 2M THP
has much worse result in memory fragment case.

===============================================================================
For the performance of NUMA interleaving on base page, mcpage and THP,
memory latency from https://github.com/torvalds/test-tlb is used.

On a Cascade Lake box with 96 core + 258G memory with two NUMA nodes:
    node distances:
    node   0   1
      0:  10  20
      1:  20  10

With memory policy set to MPOL_INTERLEAVE and 1G memory mapping with
128 bytes (2X cache line) stride, the memory access latency (less
is better):
  random access with 4K apge:   	142.32 ns
  random access with 16K mcpage:	141.21 ns (+0.8%)
  random access with 2M THP:		116.56 ns (+18.2%)

  sequential access with 4K page:	21.28 ns
  sequential access with 16K mcpage:	20.52 ns (+0.36%)
  sequential access with 2M THP:	20.36 ns (+0.43%)

mcpage brings minor memory access latency improvement comparing to 4K page.
But less than the improvement comparing to 2M THP.

===============================================================================
The memory consumption is checked by using firefox to access "www.lwn.net"
website and collect the RSS of firefox with 16K mcpage size:
  6.1-rc7: 		RSS of firefox is 285300 KB
  6.1-rc7 + 16K mcpage: RSS of firefox is 295536 KB

3.59% more memory consumption with 16K mcpage.

===============================================================================

In this RFC patch, the none-batch update to page table entries is used
to show the idea. Batch mode will be chosen if make this official patch
in the future.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/gfp.h       |   5 ++
 include/linux/mcpage_mm.h |  35 ++++++++++
 mm/Makefile               |   1 +
 mm/mcpage_memory.c        | 134 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c               |  11 ++++
 mm/mempolicy.c            |  51 +++++++++++++++
 6 files changed, 237 insertions(+)
 create mode 100644 include/linux/mcpage_mm.h
 create mode 100644 mm/mcpage_memory.c

Message ID	20230109072232.2398464-3-fengwei.yin@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2333CC61DB3 for <linux-mm@archiver.kernel.org>; Mon, 9 Jan 2023 07:19:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B8008E0007; Mon, 9 Jan 2023 02:19:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 39C8E8E0008; Mon, 9 Jan 2023 02:19:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 061E48E0007; Mon, 9 Jan 2023 02:19:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D99638E0001 for <linux-mm@kvack.org>; Mon, 9 Jan 2023 02:19:42 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A171116015A for <linux-mm@kvack.org>; Mon, 9 Jan 2023 07:19:42 +0000 (UTC) X-FDA: 80334410604.08.7F04EB5 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf12.hostedemail.com (Postfix) with ESMTP id 8D33C40002 for <linux-mm@kvack.org>; Mon, 9 Jan 2023 07:19:39 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="nSmaWp/r"; spf=pass (imf12.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248779; a=rsa-sha256; cv=none; b=bhkN3Fis7GgTWyeG7T7hCErk+CBErfHwUPjO0Y7B+VhP5JRLKyVPJlgmE1zhDKJCvgqtc4 Q/moMTB4lyg9cXLw7m2k9BQrD2g+1jg3imMNJsPzDcFyLIzJaXVxVjkTsWpm92OSfO9e+o GgBAYVQRIByBmEqlRNYrg6gea7gyUdU= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="nSmaWp/r"; spf=pass (imf12.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673248779; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rWv8v9UrXcndVYkgrCYiYsENQ2Yf2ms5o49zD2wS+70=; b=f5grHRps+u1XFmh10SPdkWX9ZiI/6YXbOykerRflkMec1+2h6wcOWih3w7OoLkqMXT0Z5V iVQ2WTNamz88vZz6ENIKo8TbTbN2cK/9qaVhFV8PrSRTQH8VTPNaUQaJfTKNGm7syyFSgg b+vmagCFKvZXp0/mzibFnxD+CA90Xg0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673248779; x=1704784779; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1GZ20/ujhcJiI6hk+ypJrjOL78IcEdeQQ7lmSo6WjBk=; b=nSmaWp/rvWBJwWKrtrZwALT09FsS+R03FIg+6vJsf6fvH2bpSZxKyMxV sc8C8U+xCgJeHnUfTEss8MPWZKEd7Hhd/rna5yBKr+OgEj/tOcAWTXkHi 0UIJLdxSDI/mLo5DH5RFvjSL1WedDN7Bvg9/4rPujeYgGcr8vY5Fek74W Fh8OkZ4t67ukHek0THl2ub+9QkoLt43wYFSaB2hRdPuVz1Ites2tYaS7L oXKW1JKtG/Exo25w8JRFAFJMLbrqPwuK+2+TpqSDvwzRGLl9OZfWb10TT QQL14/N3Z4rJ6teZxmvS3PvayoI4yp+YfP7nzLPGhAkBKs9lxDmyy+jhr A==; X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260912" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="387260912" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2023 23:19:25 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634112081" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="634112081" Received: from fyin-dev.sh.intel.com ([10.239.159.32]) by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:18 -0800 From: Yin Fengwei <fengwei.yin@intel.com> To: linux-mm@kvack.org, akpm@linux-foundation.org, jack@suse.cz, hughd@google.com, kirill.shutemov@linux.intel.com, mhocko@suse.com, ak@linux.intel.com, aarcange@redhat.com, npiggin@gmail.com, mgorman@techsingularity.net, willy@infradead.org, rppt@kernel.org, dave.hansen@intel.com, ying.huang@intel.com, tim.c.chen@intel.com Cc: fengwei.yin@intel.com Subject: [RFC PATCH 2/4] mcpage: anon page: Use mcpage for anonymous mapping Date: Mon, 9 Jan 2023 15:22:30 +0800 Message-Id: <20230109072232.2398464-3-fengwei.yin@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20230109072232.2398464-1-fengwei.yin@intel.com> References: <20230109072232.2398464-1-fengwei.yin@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 8D33C40002 X-Stat-Signature: 71ro3bymxu3fzj66ugh3t1abca8e9ebm X-HE-Tag: 1673248779-214982 X-HE-Meta: U2FsdGVkX19t2mzNODmcHH6ZMhHGb9euSbjUAikTu9F8I0pHnNjZM+GFPkzemDVSGLGW2hpo2aFlxwlz5UzSGhn2dwWutMZbr/aF5DtKt5eN/rkrK0aNopEyGn++h4Ycic0QNTjLwJtmYr2q09Ldfje5eLxJMGynwlQbSoaOmnrusbF5FXLOtZqFHAjbAStvjlHiWDnAnJM7j5JJKcsO/v4uBX5XzV4ogncEAJLPBPqwXT14lihK766OOqklfU2qQ5iI/k+kakOM76/2cDZcTXnLuCkFqabo67CSNssmDoLe+RqO0xoWUfNsx6a3b8cD1/w2JcIN82MxDbKHJ9/T2UOsF2bZj8JX+tXgzguGTxVhXvg8nWRHPkqysmBn5d7rexTeROLJof190vN4h3xVkBEoKtaZPh41N0LgUcnGhKj0PK3dB+WrL4bVQ6/TO6FSYupas95vIMS05AO+LBKk5UGtHp9GTy7VCuNyeGEAlvhCTIntahRxCiIPDkYFlYUMLWkjfjhcDxWvGJ6lX5TUwYS1r9+I3xCsBC2fpl8TO520jl1i7eYpfPEAyxvo7NlbDomy4nIgMG1l8AMbjAhEuszuNiyICIFH7pmXsWtZEAx92En7RRjEijtxYF3T6+uU/m7u60YrcCFIqbM4Ij9+R4hTVUrsQ7SQ6K8UcfJT+2nN/XJ+ZxP8qroOWd5R2Z67y62mQDsXw7gIivV1Okifbq0OljMcaZ/vXgSnOGhZ4JJepSQBB/HHnGWJPT5lfX3lRUeIC8SHLlI9lFaDTR1ApmykkJOLBVz3974cUaKAagw4tGQ6ktZHrtWntdgTBwDvPJlI/MKtdAp7aX7lceVnpoX8N94mpTLtIe/7BKqdhVIRlCiMKRzJ/N0HFa7GEFdEzr6Hqz/idxFPiX+0ff4A5w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	Multiple consecutive page for anonymous mapping \| expand [RFC,0/4] Multiple consecutive page for anonymous mapping [RFC,1/4] mcpage: add size/mask/shift definition for multiple consecutive page [RFC,2/4] mcpage: anon page: Use mcpage for anonymous mapping [RFC,3/4] mcpage: add vmstat counters for mcpages [RFC,4/4] mcpage: get_unmapped_area return mcpage size aligned addr

[RFC,2/4] mcpage: anon page: Use mcpage for anonymous mapping

Commit Message

Patch