From patchwork Wed Sep 25 13:47:32 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Adrian Huang X-Patchwork-Id: 13812028 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF461CF58ED for ; Wed, 25 Sep 2024 13:48:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 61CBF6B00AA; Wed, 25 Sep 2024 09:48:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5CD056B00AB; Wed, 25 Sep 2024 09:48:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 494BE6B00AD; Wed, 25 Sep 2024 09:48:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 295AD6B00AA for ; Wed, 25 Sep 2024 09:48:17 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DF4481A02E7 for ; Wed, 25 Sep 2024 13:48:16 +0000 (UTC) X-FDA: 82603389792.08.1907D57 Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) by imf13.hostedemail.com (Postfix) with ESMTP id 1452020004 for ; Wed, 25 Sep 2024 13:48:14 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HSZSriO6; spf=pass (imf13.hostedemail.com: domain of adrianhuang0701@gmail.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=adrianhuang0701@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727271974; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=5Ssh+pRrf5L+2e4qG8n4IwFFdzUvMFGa+rHl+gGVC1w=; b=idmGumidfgrBs7JDrR1tPAyGO0wnmlTk9XSKRl5eLiudRsKqHyLPXcU37silcY6eOax2Vi UhgmgwW6Qq0ni1RndZrac0c8SORsvhGrkhxWxwij+Y51mP1bMcVkiYprHtrGqJL+iDMtmr ThGZj3LafD6QJaoi181hn9VKb/PtxT4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727271974; a=rsa-sha256; cv=none; b=8mZFSz0clGUxnn2bmnO/u68ZUyh1fuOwXmS8/AaEDLeOgEWm9qCE0lKFqgVLqQNAF5x7SF P0K3r/9w98kIAzh77yj+RhuL3CGMlhUTC2JCUEN9z6WIQulVBF0O2N17TXLba9JmE4m3wh VwrTG4YZPfLDGml+QozEtZbi80TtnAk= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HSZSriO6; spf=pass (imf13.hostedemail.com: domain of adrianhuang0701@gmail.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=adrianhuang0701@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-206aee4073cso72610535ad.1 for ; Wed, 25 Sep 2024 06:48:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1727272094; x=1727876894; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=5Ssh+pRrf5L+2e4qG8n4IwFFdzUvMFGa+rHl+gGVC1w=; b=HSZSriO6mied9JSmSkmf1p2VVoaKspZeMHdBan5UXs7qJ2yBDfG5Sc+p2DM5gbho1q ivCWXo0tTEOeNCHVHjjTvIc02Qar+UHE+eHNFVNF7sjqNtN/RER/DZu0sna4ncCGLMUL YPjRqZ6NQEmEl0e6rx1ZiG+gXzmExYBE4KWeuAXmfJjCm2AoxlytSuyJFAwOclVXKQmX LbT4I2BK7zevt4TD3IKFWBT/n2lsF5Z3QYRefsx5K3ww4nuYNY2DZV3LUf4SnpUfM5as STbxgpy08oeLepPO7ToOcGoOKPVRQcMwstk5tL+4iw4Dih9jD6Hfwd0RboTJYH5IL/g6 BRRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727272094; x=1727876894; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=5Ssh+pRrf5L+2e4qG8n4IwFFdzUvMFGa+rHl+gGVC1w=; b=IPFiXnzP2N4k3EM0m2lykEl+Xceg9zt8T06vSu8lYH7JJUyLkmfqJ0fvQWGav/Dd+m 3+Qm45PI+lWtMCVhkbAGi90f+TrbUr4zm+cWvM+vEeiM6Uk3DMYPt+i1VSj7/03w9VvO WoQkkfHBEh4zWJIvC7jOg+sd+6ijSZvjEh5mEBosUFWv1L13Rv6TA6lvYEF606aYiu5R QrxFNbzxTtXSOSOZZu+eGC1cEiUAFLuziSIeEpPekaC47Xe6XJiG9EOLIdtOMC4Y/CSM jc0pWYlyYc+6L+r639ASokrPpsN7LJRmehxcB27cP2UaTfsor9a6XKiUkVNFLJQaKdKO ADIQ== X-Forwarded-Encrypted: i=1; AJvYcCWiK5Q7MZ/w/SV36LVnDPOUo6Y+Mo6WcFk2wgBK8iFHSEYmpzMu198dpNm5tj4CjC6GvmKDf7/BBQ==@kvack.org X-Gm-Message-State: AOJu0YyLEBhvCt3zck3xDyxsMuFxuQrbhp3O6Ex27x+JADffiKHuz+i2 1d32xcghdq3r6sE0jzkQVp0a5jZj4jp6R0uteP0wShUj3QJn8WwE X-Google-Smtp-Source: AGHT+IFvnb6qaqt/AhKZq9muVdeV36NHVgH5CzP6oOtqs1fRPaV6zcV9zEJsxuZIjVh15Lo5VOT1XA== X-Received: by 2002:a17:903:234f:b0:209:dc6d:7697 with SMTP id d9443c01a7336-20afc44865bmr34325865ad.24.1727272093509; Wed, 25 Sep 2024 06:48:13 -0700 (PDT) Received: from AHUANG12-3ZHH9X.lenovo.com (220-143-197-103.dynamic-ip.hinet.net. [220.143.197.103]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20af16e7f40sm24958805ad.8.2024.09.25.06.48.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Sep 2024 06:48:13 -0700 (PDT) From: Adrian Huang X-Google-Original-From: Adrian Huang To: Andrey Ryabinin , Alexander Potapenko , Andrey Konovalov , Dmitry Vyukov , Vincenzo Frascino , Andrew Morton , Uladzislau Rezki Cc: kasan-dev@googlegroups.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Adrian Huang Subject: [PATCH 1/1] kasan, vmalloc: avoid lock contention when depopulating vmalloc Date: Wed, 25 Sep 2024 21:47:32 +0800 Message-Id: <20240925134732.24431-1-ahuang12@lenovo.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-Stat-Signature: mgparnxoengkee4kp6nrj1gxgiqdo5jg X-Rspamd-Queue-Id: 1452020004 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1727272094-690408 X-HE-Meta: U2FsdGVkX1+vX/U11RFIhAm7YRVU6eS/91rCh3J2nknJphf3LKKB+pTffu0ZbuYPfKbxDMrfQqm6eSqvc2BVH8UumS0hIKgOxdF8KPxkjvatvBfhvRkinyEzbESY5jux4AptQdh0zhuJ8FhDfAr/o71ndI8tuAg9wRvIAxtxfsIG3Jl0E4f6iKu3oU9a3jsrRqNbEuwG/fps7S8A7bkWEhpq34KATfWj0pkpsGk/JtSsVxqni12NBJASFyR/PKJFIiMrmIwgwpSLAHt3jT7ki82Piv16NPA2+cJIcFW8cgcnwSrqf8zOHLhy/2gdZt7BT0zeSZ59b0sZRKhr2QXm56J7CG7qVpQ7ZOR5VJfEcd0IFznSoWpys2DeTTD/yC/5XpRubi8KIl7hJoXsVsRThfaWljXZP45dqvBsKI0yCftNtLN2ziqeVdk5VsQfiZG5+015Hc7FUvjGpk+uKVd2gJSaoQ1cTtdN2nRlEl4REYFbqN354ZXai4w3B0+qPkuLyGpswOav9og4igpG0kN3aY/Eh+Oe5NPoHL9Ht/fooIhRHmyS7hSAYiZ5Ge/MYHpmZD+XCF+sFKskUxaEztMIlKSZmx+VxzF9NUzRMtbqMDm5ImYafJp2BLaGOkAdO7spKWi1hH6dm1iPPM7lcYnkGrxCwQIEF/szRpTGvozDwseI+brxL9Xl4jjJjimYKcH0cJRRPGTJBbPW0gTrmfOPnWfcqOwJsdhjMIgMWo/ChLCd9DV8kFxhtQaG7ADlaHX2ao85+vqLMAUW/QRsWi4fSNLIGJGLLL8IXRrEcT9EU14fc9TO1HrX8q1U4HO2WYcDGnRwJIw7XlU1yBrI/V5mTh/sGVQAeri74uSXumaWeRpaYrkydqs2LKwBxsqtu66gamzOGFBSb9FP60mk8Y1rhXD1IBdr/h3vbLFSvnC0Hjos/3NgJZsJAvSb6N+S8PbIHNs3d4Zz5VR4oDpAJ4T gVxl7SLo ocst9N73pUFZy8Uqq8zVK7cv06rdES23/gE/QuYyhlWaYDedDz91W56J3/2m3nOMVHBsyGD73yqAW2+rvByVUf+s0ozyKpSFhF3Mj7MSErADVa5P3X+tFMcgzKzRRW6mPaCd1uj/3+NaZUFopW0m5E7odNqb7/Vqc5+ZkpSBsanpKXYcx7WbA1U0Rjf+QQ1YYf7tWG7X3CFRCi0UhWUJsHOsRHbCYWkiKIYdrsdynZ/arHDLS9h+IHf8DLurI4OGdc0i9vu1RIHzIVzJ1AWtiiauIlth/mMYbeH266XaRAlGlVVC2ddW+hk/a1FMFdreuaNdXH+hYSsszcWurCY725pdVWoblXT8qSFCagdPvbZfvCM4bp+hyhlsjQEIN5zZRobIefnGqtjNKjdsrfQS23wXelUGmMTQLSu6SC/PEBBEA2XxOGfMBvKHhUw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Adrian Huang When running the test_vmalloc stress on a 448-core server, the following soft/hard lockups were observed and the OS was panicked eventually. 1) Kernel config CONFIG_KASAN=y CONFIG_KASAN_VMALLOC=y 2) Reproduced command # modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8 3) OS Log: Detail is in [1]. watchdog: BUG: soft lockup - CPU#258 stuck for 26s! RIP: 0010:native_queued_spin_lock_slowpath+0x504/0x940 Call Trace: do_raw_spin_lock+0x1e7/0x270 _raw_spin_lock+0x63/0x80 kasan_depopulate_vmalloc_pte+0x3c/0x70 apply_to_pte_range+0x127/0x4e0 apply_to_pmd_range+0x19e/0x5c0 apply_to_pud_range+0x167/0x510 __apply_to_page_range+0x2b4/0x7c0 kasan_release_vmalloc+0xc8/0xd0 purge_vmap_node+0x190/0x980 __purge_vmap_area_lazy+0x640/0xa60 drain_vmap_area_work+0x23/0x30 process_one_work+0x84a/0x1760 worker_thread+0x54d/0xc60 kthread+0x2a8/0x380 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 ... watchdog: Watchdog detected hard LOCKUP on cpu 8 watchdog: Watchdog detected hard LOCKUP on cpu 42 watchdog: Watchdog detected hard LOCKUP on cpu 10 ... Shutting down cpus with NMI Kernel Offset: disabled pstore: backend (erst) writing error (-28) ---[ end Kernel panic - not syncing: Hard LOCKUP ]--- BTW, the issue can be also reproduced on a 192-core server and a 256-core server. [Root Cause] The tight loop in kasan_release_vmalloc_node() iteratively calls kasan_release_vmalloc() to clear the corresponding PTE, which acquires/releases "init_mm.page_table_lock" in kasan_depopulate_vmalloc_pte(). The lock_stat shows that the "init_mm.page_table_lock" is the first entry of top list of the contentions. This lock_stat info is based on the following command (in order not to get OS panicked), where the max wait time is 600ms: # modprobe test_vmalloc nr_threads=150 run_test_mask=0x1 nr_pages=8 ------------------------------------------------------------------ class name con-bounces contentions waittime-min waittime-max ... ------------------------------------------------------------------ init_mm.page_table_lock: 87859653 93020601 0.27 600304.90 ... ----------------------- init_mm.page_table_lock 54332301 [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120 init_mm.page_table_lock 6680902 [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370 init_mm.page_table_lock 31991077 [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70 init_mm.page_table_lock 16321 [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720 ----------------------- init_mm.page_table_lock 50278552 [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120 init_mm.page_table_lock 5725380 [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370 init_mm.page_table_lock 36992410 [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70 init_mm.page_table_lock 24259 [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720 ... [Solution] After re-visiting code path about setting the kasan ptep (pte pointer), it's unlikely that a kasan ptep is set and cleared simultaneously by different CPUs. So, use ptep_get_and_clear() to get rid of the spinlock operation. The result shows the max wait time is 13ms with the following command (448 cores are fully stressed): # modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8 ------------------------------------------------------------------ class name con-bounces contentions waittime-min waittime-max ... ------------------------------------------------------------------ init_mm.page_table_lock: 109999304 110008477 0.27 13534.76 ----------------------- init_mm.page_table_lock 109369156 [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120 init_mm.page_table_lock 637661 [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370 init_mm.page_table_lock 1660 [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720 ----------------------- init_mm.page_table_lock 109410237 [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120 init_mm.page_table_lock 595016 [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370 init_mm.page_table_lock 3224 [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720 [More verifications on a 448-core server: Passed] 1) test_vmalloc module * Each test is run sequentially. 2) stress-ng * fork() and exit() # stress-ng --fork 448 --timeout 180 * pthread # stress-ng --pthread 448 --timeout 180 * fork()/exit() and pthread # stress-ng --pthread 448 --fork 448 --timeout 180 The above verifications were run repeatedly for more than 24 hours. [1] https://gist.github.com/AdrianHuang/99d12986a465cc33a38c7a7ceeb6f507 Signed-off-by: Adrian Huang --- mm/kasan/shadow.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c index 88d1c9dcb507..985356811aee 100644 --- a/mm/kasan/shadow.c +++ b/mm/kasan/shadow.c @@ -397,17 +397,13 @@ int kasan_populate_vmalloc(unsigned long addr, unsigned long size) static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr, void *unused) { + pte_t orig_pte = ptep_get_and_clear(&init_mm, addr, ptep); unsigned long page; - page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT); - - spin_lock(&init_mm.page_table_lock); - - if (likely(!pte_none(ptep_get(ptep)))) { - pte_clear(&init_mm, addr, ptep); + if (likely(!pte_none(orig_pte))) { + page = (unsigned long)__va(pte_pfn(orig_pte) << PAGE_SHIFT); free_page(page); } - spin_unlock(&init_mm.page_table_lock); return 0; }