From patchwork Wed Jul 13 08:39:50 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 12916278 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A2A5C433EF for ; Wed, 13 Jul 2022 08:40:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 24133940112; Wed, 13 Jul 2022 04:40:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1F1C59400E5; Wed, 13 Jul 2022 04:40:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B93A940112; Wed, 13 Jul 2022 04:40:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id EFF499400E5 for ; Wed, 13 Jul 2022 04:40:31 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id C17EE34666 for ; Wed, 13 Jul 2022 08:40:31 +0000 (UTC) X-FDA: 79681430262.08.85D68A5 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf30.hostedemail.com (Postfix) with ESMTP id 30DFF8007C for ; Wed, 13 Jul 2022 08:40:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657701630; x=1689237630; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=O3XlJdFKMMXZ/3uT7ErsQxlftSciCG/DfbQ0W2TTX/o=; b=Cayeh9+Vg+aavpQM0O+vBonjDvRpKJCmHP14EEiFiHNgvlJBEmvLmFaf VqWadZ/VGcqiLDDbSDzEeqoELjJG87Wdz+4GMxFG4DaJH3kh+p5KFeDNs Ktq7dFWf8iEAXqkmgDtxDtce63vFSlbVq2lZwheYuH/DGogtVZ7svYB5A PMnHXvvDAYLefsEoGXcvfzX1NNf3awgfnBxQNqS1kpdmmLZ8QeDGq1DYU jG1GP2bb6PvmUk8iKJTmz0n4p7Pa70HUPeYbkBNAIOsUlse2pfBrtj47k 0c3QfSQQ9RPs5nhVGt88svP376sV0tHraVSumpDgcwKGMwnalMVmrs3qy A==; X-IronPort-AV: E=McAfee;i="6400,9594,10406"; a="282704713" X-IronPort-AV: E=Sophos;i="5.92,267,1650956400"; d="scan'208";a="282704713" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jul 2022 01:40:28 -0700 X-IronPort-AV: E=Sophos;i="5.92,267,1650956400"; d="scan'208";a="653284442" Received: from yijunwa1-mobl.ccr.corp.intel.com (HELO yhuang6-mobl1.ccr.corp.intel.com) ([10.254.215.54]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jul 2022 01:40:25 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Johannes Weiner , Michal Hocko , Rik van Riel , Mel Gorman , Peter Zijlstra , Dave Hansen , Yang Shi , Zi Yan , Wei Xu , osalvador , Shakeel Butt , Baolin Wang , Zhong Jiang Subject: [PATCH -V4 RESEND 0/3] memory tiering: hot page selection Date: Wed, 13 Jul 2022 16:39:50 +0800 Message-Id: <20220713083954.34196-1-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657701631; a=rsa-sha256; cv=none; b=taS3KMy0VINbqcMdfpLGsUF9x5fRWO1pXmjAFCtDFtc5/X+pth8kpy80yyrdle6ADBxwRf QjItJeHhBEZ3hzXXyE9IczvONU//0smceKkXSbdAL2aeWf9Kl6n7tV9+GMSHhQwjsiu7cu r5jDTyrSbOkCsf0QLDgCJ6aiF8gPGhk= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Cayeh9+V; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf30.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 192.55.52.93) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657701631; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=Qv6Y/qGt8UJDXNoNqffIVkL8JcaKv1HA01K5jiiOHi8=; b=vplOGTGVbDgCik/pW53KliF83LAlpqVhwsf5uwcttUUWPvx1UIYASrsBokS4Z1RijI8RI6 w7j0K+ZCVdPcaMuFaT48F/7x3piajlszUad9DTjyxsw7QJhnO4zBp1Y6uNVDMDPXb/1ap+ M6K7UE5F0qTjDMPTX23oU1Ai3fRsEwM= X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 30DFF8007C Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Cayeh9+V; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf30.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 192.55.52.93) smtp.mailfrom=ying.huang@intel.com X-Stat-Signature: z819gsjxii3g1m5cyn5spyt464rip146 X-Rspam-User: X-HE-Tag: 1657701629-87667 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory nodes need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). So in this patchset, we implement a new hot page identification algorithm based on the latency between NUMA balancing page table scanning and hint page fault. Which is a kind of mostly frequently accessed (MFU) algorithm. In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patchset as the page promotion rate limit mechanism. The promotion hot threshold is workload and system configuration dependent. So in this patchset, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. We used the pmbench memory accessing benchmark tested the patchset on a 2-socket server system with DRAM and PMEM installed. The test results are as follows, pmbench score promote rate (accesses/s) MB/s ------------- ------------ base 146887704.1 725.6 hot selection 165695601.2 544.0 rate limit 162814569.8 165.2 auto adjustment 170495294.0 136.9 From the results above, With hot page selection patch [1/3], the pmbench score increases about 12.8%, and promote rate (overhead) decreases about 25.0%, compared with base kernel. With rate limit patch [2/3], pmbench score decreases about 1.7%, and promote rate decreases about 69.6%, compared with hot page selection patch. With threshold auto adjustment patch [3/3], pmbench score increases about 4.7%, and promote rate decrease about 17.1%, compared with rate limit patch. Baolin helped to test the patchset with MySQL on a machine which contains 1 DRAM node (30G) and 1 PMEM node (126G). sysbench /usr/share/sysbench/oltp_read_write.lua \ ...... --tables=200 \ --table-size=1000000 \ --report-interval=10 \ --threads=16 \ --time=120 The tps can be improved about 5%. Changelogs: v4: - Rebased on v5.19-rc3 - Collected reviewed-by and tested-by. v3: - Rebased on v5.19-rc1 - Renamed newly-added fields in struct pglist_data. v2: - Added ABI document for promote rate limit per Andrew's comments. Thanks! - Added function comments when necessary per Andrew's comments. - Address other comments from Andrew Morton. Best Regards, Huang, Ying