From patchwork Sun Jul 7 09:49:53 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yafang Shao X-Patchwork-Id: 13725989 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB496C30653 for ; Sun, 7 Jul 2024 09:50:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3F1016B0082; Sun, 7 Jul 2024 05:50:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3A03E6B0088; Sun, 7 Jul 2024 05:50:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 267B86B0089; Sun, 7 Jul 2024 05:50:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 05A2F6B0082 for ; Sun, 7 Jul 2024 05:50:31 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7C8ED140C72 for ; Sun, 7 Jul 2024 09:50:31 +0000 (UTC) X-FDA: 82312486662.24.2B665FD Received: from mail-oo1-f41.google.com (mail-oo1-f41.google.com [209.85.161.41]) by imf17.hostedemail.com (Postfix) with ESMTP id C01284000D for ; Sun, 7 Jul 2024 09:50:29 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=afsqCbF9; spf=pass (imf17.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.161.41 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720345816; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=rJ5hPuj9/X0RlVBSjYM+Ab6en6/Gd5azAyCOV8txIrc=; b=lGuIRgAca2fzmyP4UpH++a1nXop3qSm14rEqWA8hVdiU6hGae91hEN897rpITE3d8QrIEq ASBvXH3AdW1CfOo/vL0A2eqWLfws+ywurgDKLimYXbmmjekGYDZbkkldgdYO0gWl9mvBYd rUofH24/uxYjtvo7Zbu0UtfjvAbk/10= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=afsqCbF9; spf=pass (imf17.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.161.41 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720345816; a=rsa-sha256; cv=none; b=txhk0Ti9fLuJo27inu29eV5qFxKKt3RNMgXSsqHpXjfvKDXcXp+zQtKpIwGo1DocqyUYVO 5heZSDv1wjP0tKAmR9voVmlegeX5M5tIcjeQHJ+PQSURP0KCSmRfw0UEKgGMZSHJAFUbW8 CEd7Trs6w/vO54XpGEtdXWM2hw0tDE8= Received: by mail-oo1-f41.google.com with SMTP id 006d021491bc7-5c661e75ff6so855888eaf.2 for ; Sun, 07 Jul 2024 02:50:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720345828; x=1720950628; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=rJ5hPuj9/X0RlVBSjYM+Ab6en6/Gd5azAyCOV8txIrc=; b=afsqCbF9t54NYJtcC9v8L+Vtnq9F6LIUmBcBEExdGIZbtxgCMRWEHkI5YvriMoVVdE e0ciBbVfdrPXGQLyQS3BIr4ly6u97XxeRQSTl+wVLL+94hKttnrHzTP3m96FmYda4Ybk A2TZESF1+o69SUWH3EAEAXsyq6LQvL+ZK2TAKnVrSUFEqOMisCFZHNknu8iGg2O3Kc+6 YKsx1pSAn/MsPhNlmXc5yOSPfnX6mamjk13Af5avNiUblOzMI6bGd/eUS4DZVmRHNiDE RCBSQd7nDgELLg/wRe8KDbqdOp6oN7GolgAutYiSmHtnBo3nNEIIVkdnS5odM+2Dk1Tn bRvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720345828; x=1720950628; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=rJ5hPuj9/X0RlVBSjYM+Ab6en6/Gd5azAyCOV8txIrc=; b=f7Bz51u6W4Rm7R9sTFFETrPIYe/zTXluDqQHE5BdVOskRixwrz5zKfgrRNghiDKSvg RRjrxOmpjQp0z9l4y59fX5rQuPXT2fc5jAr2c1xpEkb+1Jdk36Hnk3pwTbq4nmMqw0bp WYPciT+0f/gXSbuEzUFkpFJf84SkfLcN7yP1DeGsq1JQY57sBf2TkB/NgWFQ8PQiTfnu A4EKfzPmh4ocFDqT7LrLoXFRuWkolHWVZnc+KyfjQPbbRGlIm8Lc/qVJOp6vBDGOXi+l 84yej/7LxxHriBL8EU5vLtjhEZ52I4IgL5lZ0jZBW3eo1n4qdfERHV5guoHcIrC4Bteg J1TQ== X-Forwarded-Encrypted: i=1; AJvYcCXH8qg3qh9c82GbgKAQ0YlWBgOqn7ao1M+6Qzj5gTkviFyebrCRavayR2czr4taSdzqaE/ZgE3Cuuj8P7u8rcqzDzI= X-Gm-Message-State: AOJu0Yw9q9nEFGW50/eUnYs1Y2J2ybFESnB0OKy1VFFF6LoAi7vUb+b5 jJycmWDk7qoIMqag4AtO2Lp0r/NemGPanXUSaVOCTVG52uGyzSAb X-Google-Smtp-Source: AGHT+IGe7Wv3aDtBuYvhmsQXsUOBrKpxLAmXFdJDuQPi+p5k7WAMEqq7s8ZS8iBUXEi2allzPUBoYA== X-Received: by 2002:a05:6870:7b4a:b0:25e:7a1:ea8f with SMTP id 586e51a60fabf-25e2bec9f3bmr8684566fac.47.1720345828472; Sun, 07 Jul 2024 02:50:28 -0700 (PDT) Received: from localhost.localdomain ([39.144.43.178]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-70b1fdc335csm2184601b3a.22.2024.07.07.02.50.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 07 Jul 2024 02:50:27 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org Cc: ying.huang@intel.com, mgorman@techsingularity.net, linux-mm@kvack.org, Yafang Shao Subject: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Date: Sun, 7 Jul 2024 17:49:53 +0800 Message-Id: <20240707094956.94654-1-laoar.shao@gmail.com> X-Mailer: git-send-email 2.30.1 (Apple Git-130) MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: C01284000D X-Stat-Signature: hkz4scima59osy4e1fbcgxp6jsahsxnt X-HE-Tag: 1720345829-49954 X-HE-Meta: U2FsdGVkX1+rXNP9igrLPjSDABHN5IKwGnsUvnru1f37xPtdADOUfkgFC4rFzpZbTRnedgUD80WtDu1y1Jhrwgxjxkd1JRcEOQboD/00OU8+pY8cZcfDmkM2VXkaKFSP9gUhS/qsuAnWaxf/KmeZP0FFRhsf0DqNdFRSslO/6jYUiTfLTkd/RNJ9HLQvbriFf7qE7NI9ypRhB2yFkUpnMmGKpmsYzENnqMY0tKsNd8kF6H22fn6F9+KzgJRJvUkG5eyrXCaekFhyghTMjwCPXEmDzHMHBxQ9Vt0CMdf0sVe1iEhI3ANQIIdEDzN+XRrRj4VMFERkwJM4OSIyCB0CUhCpUpQtG5zLIeoLOcvSyunBQtTzg452iSWZAJUr5T89ejqctbg0IuAj21C9Hb7sepzW+7hz7fEKTeFdMGYf5jfKXfAVs5quYgc0dOBt041Oy0FBf6XxjdUqf515rdCKD9GmynLdoI3/wxmgKeyrRLFP9yWkveDBAsZCZr/VHNhfFOp3GX1J9AC8nQed8P6E4gKnmfObYqlZHJgb6Zo/1zASQcU2bikZRdtHrtplCV+ii7KHGM86L3hdf6SarSXg39JGZnPRPfpUKXEdEZOtIsnFHnsroMH3FGtkZ4B/WYlAiSDesxgr5VMB3VgzYN3fm3zxYGYE8qPZb+ANGnodCV9wyhEe5+PgtqaGKQPqr7zIPG6n+AuIwodYvxpfV8l+n+sLN1JNzepC61EsBd0uTEp7qA0gtLHfznvGjIxWygi8qHQG/mYi+RwFKBr7fKeGugKET92irAa7rNZEa01HK1yrl0KosjTiMUgmCBtv/cMQC6MM7hAM05Bs3DHwryYlaDCUyiUkizSpaj2ZSHZLdR8gQ2PoeZtovRo+dzTlWSkMUM14Ykqgv4YGY7OWiaPGX+/jcNbW5+KybePgnHk02pow6PmnoWONuHunPlL6tBF2NS1EfEMy/nWQkZro0cC QZgW9P4S N8p9rrkX2QLVp5p14FtU4Jk5pZtlxtSBrlSliV8wJPEuo9nY1RAiz29PRXHz3Ze+gBkCQD+hCOwEsPdh0aWtVikuolIjONRsvlQxpi9CHLDNypGu9TGUxxY3DJGh+wp0GOtGhRQVbfZpHkkWA1DOj+xebD7tRcdGPYcBki2Cp4EJJERwTCXc8ky4oKtxLg5lnOiYw46hRBmawyJ+1GJLQ6YT8FBiwcp2tCK3qbHNBBxsaCfaw1KwLg2ihJSHS0nQPS10SRPnCPQSDnwpJU51OzqtmS8VZKxEa67WzU1kJLwP4OLxdiHOkZcpdZq8dZEQPi1+zgjacKsTuT3HMfLAskO9H7/Jaqvq1rbhxzA1a0wnym2agX1+T1yiLzy3od6UJ56thr4e+rT4TZYpB3hLa4VQ8HNAuNAg+Q4LFmQNbGBG2I25YsnyZyMiX2lZGC/6y3UJuTmfoEhl0zfgMM5g7fBSAqaQpAapvRsN9fytBvfu7wSPYK1GcQYuBFJegU4BJgGLwBqo8RWNJjyyu/IfdBJBiJf22Aw72zu7T5zgKWsR4JggCTw57gg3rEkL1fCd31LH28cZuSevMkV4CbgYbxqsyZXejXKXBzuZczDLnc1ICMNHi0yfmba7MRPhvIqE1g9h3DDW/1LqeJiQTTuTkJu821t7jtYIJZqLUwmmZ79O/RLQsN1bJNaqX7tp+U+wYL/YVbk0NnL+M1Kw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Background ========== In our containerized environment, we have a specific type of container that runs 18 processes, each consuming approximately 6GB of RSS. These processes are organized as separate processes rather than threads due to the Python Global Interpreter Lock (GIL) being a bottleneck in a multi-threaded setup. Upon the exit of these containers, other containers hosted on the same machine experience significant latency spikes. Investigation ============= My investigation using perf tracing revealed that the root cause of these spikes is the simultaneous execution of exit_mmap() by each of the exiting processes. This concurrent access to the zone->lock results in contention, which becomes a hotspot and negatively impacts performance. The perf results clearly indicate this contention as a primary contributor to the observed latency issues. + 77.02% 0.00% uwsgi [kernel.kallsyms] [k] mmput - 76.98% 0.01% uwsgi [kernel.kallsyms] [k] exit_mmap - 76.97% exit_mmap - 58.58% unmap_vmas - 58.55% unmap_single_vma - unmap_page_range - 58.32% zap_pte_range - 42.88% tlb_flush_mmu - 42.76% free_pages_and_swap_cache - 41.22% release_pages - 33.29% free_unref_page_list - 32.37% free_unref_page_commit - 31.64% free_pcppages_bulk + 28.65% _raw_spin_lock 1.28% __list_del_entry_valid + 3.25% folio_lruvec_lock_irqsave + 0.75% __mem_cgroup_uncharge_list 0.60% __mod_lruvec_state 1.07% free_swap_cache + 11.69% page_remove_rmap 0.64% __mod_lruvec_page_state - 17.34% remove_vma - 17.25% vm_area_free - 17.23% kmem_cache_free - 17.15% __slab_free - 14.56% discard_slab free_slab __free_slab __free_pages - free_unref_page - 13.50% free_unref_page_commit - free_pcppages_bulk + 13.44% _raw_spin_lock By enabling the mm_page_pcpu_drain() we can locate the pertinent page, with the majority of them being regular order-0 user pages. <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp e=1 <...>-1540432 [224] d..3. 618048.023887: => free_pcppages_bulk => free_unref_page_commit => free_unref_page_list => release_pages => free_pages_and_swap_cache => tlb_flush_mmu => zap_pte_range => unmap_page_range => unmap_single_vma => unmap_vmas => exit_mmap => mmput => do_exit => do_group_exit => get_signal => arch_do_signal_or_restart => exit_to_user_mode_prepare => syscall_exit_to_user_mode => do_syscall_64 => entry_SYSCALL_64_after_hwframe The servers experiencing these issues are equipped with impressive hardware specifications, including 256 CPUs and 1TB of memory, all within a single NUMA node. The zoneinfo is as follows, Node 0, zone Normal pages free 144465775 boost 0 min 1309270 low 1636587 high 1963904 spanned 564133888 present 296747008 managed 291974346 cma 0 protection: (0, 0, 0, 0) ... pagesets cpu: 0 count: 2217 high: 6392 batch: 63 vm stats threshold: 125 cpu: 1 count: 4510 high: 6392 batch: 63 vm stats threshold: 125 cpu: 2 count: 3059 high: 6392 batch: 63 ... The pcp high is around 100 times the batch size. I also traced the latency associated with the free_pcppages_bulk() function during the container exit process: nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 148 |***************** | 512 -> 1023 : 334 |****************************************| 1024 -> 2047 : 33 |*** | 2048 -> 4095 : 5 | | 4096 -> 8191 : 7 | | 8192 -> 16383 : 12 |* | 16384 -> 32767 : 30 |*** | 32768 -> 65535 : 21 |** | 65536 -> 131071 : 15 |* | 131072 -> 262143 : 27 |*** | 262144 -> 524287 : 84 |********** | 524288 -> 1048575 : 203 |************************ | 1048576 -> 2097151 : 284 |********************************** | 2097152 -> 4194303 : 327 |*************************************** | 4194304 -> 8388607 : 215 |************************* | 8388608 -> 16777215 : 116 |************* | 16777216 -> 33554431 : 47 |***** | 33554432 -> 67108863 : 8 | | 67108864 -> 134217727 : 3 | | The latency can reach tens of milliseconds. Experimenting ============= vm.percpu_pagelist_high_fraction -------------------------------- The kernel version currently deployed in our production environment is the stable 6.1.y, and my initial strategy involves optimizing the vm.percpu_pagelist_high_fraction parameter. By increasing the value of vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during page draining, which subsequently leads to a substantial reduction in latency. After setting the sysctl value to 0x7fffffff, I observed a notable improvement in latency. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 120 | | 256 -> 511 : 365 |* | 512 -> 1023 : 201 | | 1024 -> 2047 : 103 | | 2048 -> 4095 : 84 | | 4096 -> 8191 : 87 | | 8192 -> 16383 : 4777 |************** | 16384 -> 32767 : 10572 |******************************* | 32768 -> 65535 : 13544 |****************************************| 65536 -> 131071 : 12723 |************************************* | 131072 -> 262143 : 8604 |************************* | 262144 -> 524287 : 3659 |********** | 524288 -> 1048575 : 921 |** | 1048576 -> 2097151 : 122 | | 2097152 -> 4194303 : 5 | | However, augmenting vm.percpu_pagelist_high_fraction can also decrease the pcp high watermark size to a minimum of four times the batch size. While this could theoretically affect throughput, as highlighted by Ying[0], we have yet to observe any significant difference in throughput within our production environment after implementing this change. Backporting the series "mm: PCP high auto-tuning" ------------------------------------------------- My second endeavor was to backport the series titled "mm: PCP high auto-tuning"[1], which comprises nine individual patches, into our 6.1.y stable kernel version. Subsequent to its deployment in our production environment, I noted a pronounced reduction in latency. The observed outcomes are as enumerated below: nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 2 | | 2048 -> 4095 : 11 | | 4096 -> 8191 : 3 | | 8192 -> 16383 : 1 | | 16384 -> 32767 : 2 | | 32768 -> 65535 : 7 | | 65536 -> 131071 : 198 |********* | 131072 -> 262143 : 530 |************************ | 262144 -> 524287 : 824 |************************************** | 524288 -> 1048575 : 852 |****************************************| 1048576 -> 2097151 : 714 |********************************* | 2097152 -> 4194303 : 389 |****************** | 4194304 -> 8388607 : 143 |****** | 8388608 -> 16777215 : 29 |* | 16777216 -> 33554431 : 1 | | Compared to the previous data, the maximum latency has been reduced to less than 30ms. Adjusting the CONFIG_PCP_BATCH_SCALE_MAX ---------------------------------------- Upon Ying's suggestion, adjusting the CONFIG_PCP_BATCH_SCALE_MAX can potentially reduce the PCP batch size without compromising the PCP high watermark size. This approach could mitigate latency spikes without adversely affecting throughput. Consequently, my third attempt focused on modifying this configuration. To facilitate easier adjustments, I replaced CONFIG_PCP_BATCH_SCALE_MAX with a new sysctl knob named vm.pcp_batch_scale_max. By fine-tuning vm.pcp_batch_scale_max from its default value of 5 down to 0, I achieved a further reduction in the maximum latency, which was lowered to less than 2ms: nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 36 | | 2048 -> 4095 : 5063 |***** | 4096 -> 8191 : 31226 |******************************** | 8192 -> 16383 : 37606 |*************************************** | 16384 -> 32767 : 38359 |****************************************| 32768 -> 65535 : 30652 |******************************* | 65536 -> 131071 : 18714 |******************* | 131072 -> 262143 : 7968 |******** | 262144 -> 524287 : 1996 |** | 524288 -> 1048575 : 302 | | 1048576 -> 2097151 : 19 | | After multiple trials, I observed no significant differences between each attempt. The Proposal ============ This series encompasses two minor refinements to the PCP high watermark auto-tuning mechanism, along with the introduction of a new sysctl knob that serves as a more practical alternative to the previous configuration method. Future improvement to zone->lock ================================ To ultimately mitigate the zone->lock contention issue, several suggestions have been proposed. One approach involves dividing large zones into multi smaller zones, as suggested by Matthew[2], while another entails splitting the zone->lock using a mechanism similar to memory arenas and shifting away from relying solely on zone_id to identify the range of free lists a particular page belongs to[3]. However, implementing these solutions is likely to necessitate a more extended development effort. Link: https://lore.kernel.org/linux-mm/874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com/ [0] Link: https://lore.kernel.org/all/20231016053002.756205-1-ying.huang@intel.com/ [1] Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [2] Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [3] Changes: - mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@gmail.com/ Yafang Shao (3): mm/page_alloc: A minor fix to the calculation of pcp->free_count mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++ include/linux/sysctl.h | 1 + kernel/sysctl.c | 2 +- mm/Kconfig | 11 ------- mm/page_alloc.c | 38 ++++++++++++++++++------- 5 files changed, 45 insertions(+), 22 deletions(-)