From patchwork Sun Oct 27 12:07:46 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lance Yang X-Patchwork-Id: 13852509 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 116BDD116E8 for ; Sun, 27 Oct 2024 12:08:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9E9786B0088; Sun, 27 Oct 2024 08:08:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 998F36B0089; Sun, 27 Oct 2024 08:08:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 839B66B008A; Sun, 27 Oct 2024 08:08:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 629BC6B0088 for ; Sun, 27 Oct 2024 08:08:24 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6F283161B4C for ; Sun, 27 Oct 2024 12:07:59 +0000 (UTC) X-FDA: 82719259266.16.59429F4 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf26.hostedemail.com (Postfix) with ESMTP id A308414000F for ; Sun, 27 Oct 2024 12:08:05 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Wm/MFI3e"; spf=pass (imf26.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730030694; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=C0YBTty29fCDfURPSOlaCYcuKhU2mEB3YdomHT84Hu0=; b=GeDEprq8Xo9Asw54h1sI8mBc64PiT7ObdyaGTlJX1QSVGv/8eSmE/HmXcy17ZYK+TpCx8b L/DIzrQsr1Pyh+cDBaUSMI4J4pWwVnJrHEa0pfsGGspPuSBre4BB6DqsYpTnXX3mV59SIv iJz+C2wHXkzRZkt9HMyPyZqB4BJwAyc= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Wm/MFI3e"; spf=pass (imf26.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730030694; a=rsa-sha256; cv=none; b=FIjbMOImFIA1VXi/rWdTnHaPqtUgUj/YEwCdb5Cy9jDwrINF3a9wlg0fitsHUBPNFwJW2A WiMgaBs/2N4VlDYCTenqJyXVQlLTZGGe+AvmnxE7hg6oUETrYoaf4d7KKbkt97T8N6860K RVxht7Px8JvM0ga1F0dvJMnQrWvR66A= Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-20ce65c8e13so29019105ad.1 for ; Sun, 27 Oct 2024 05:08:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730030901; x=1730635701; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=C0YBTty29fCDfURPSOlaCYcuKhU2mEB3YdomHT84Hu0=; b=Wm/MFI3eHEqbSK086Kt0KcGdVguEewps9bUhZXTUtVt+Bi14URPJ3lOrMNhPgStwZy AApOpeklJFo8e+vJ83ZUBvFvIsZBSAleaY3cWYlTAweiZsTlKjfrIBAPJqEyExSgt2jm fB//nGzuOlaH7Ik5g7cCCah+4TbwwrxDth8Rm9XRunZFja5wZfySWiZPpe3vb2ZoU5fZ kDT5YNpoc3Rj6UcMy+yb6Y8YrLrQ5+3lnJqVoWFHxcCZVWRXptNltueAzCKaOKinOsuT 7jYZOCoOP+QAa5Kw5UoKkZJnjRTzGk6tc9Sl4lNA+rpTAmn1I7DLkHyZ4TNZQa0kQJYu E+Qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730030901; x=1730635701; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=C0YBTty29fCDfURPSOlaCYcuKhU2mEB3YdomHT84Hu0=; b=V5f7J440etB9uwqT+9A9lHVkOuGiWsDvHL3z1LRa2bJtHDXj6If3R3/Tw2YeJ038Zd GAdX++21cupxFvyL/RSoZ7ga85oglJ868a0C9wcdDmwIcGIUdhtnBaBUibHTOksZF3rm IGqj1FkXujRotlEejy1H4MP3Ruxrh1Xjzq5w8A6dRgNGXIVxIAQ7VOIfxTqsRGKWS6Bt gPJp5EKFDJxsJCpUgJEp7JhRKTkr9DulTcTojQ/Kuf3aMy4q/gHMrXwYqhvvGGKji6JB V/XGu+icCGQ9dVovwhWx/4jDNV8dDBvfRTr+mnTDCZmGA30Zcz6iIjUfR2kLNym/RGJH mQWw== X-Forwarded-Encrypted: i=1; AJvYcCVXSsJ6OWDBNJ3SoknSDGSa/ZamU5gns3AMuZ0jwOoxIesDmD3LIg2Pja0vsZZmMpBXkNNiTxyIgw==@kvack.org X-Gm-Message-State: AOJu0YwOzH0o8V7j+VCd2heJrYbHDA3CynjE8w2fFCDyijxT155yWLgD BKyN3UsyFeEDCfccu5fOOyPj0v8DqKoBuyyqWQmdtY0WH/o9Ksjk X-Google-Smtp-Source: AGHT+IEC7SL1yaU6CsPvivYj99USvnYvIa+jRLOO6wsH1fIK/VVaqfW9WsUEB/88wKPnAtzTyVWulA== X-Received: by 2002:a17:903:8ce:b0:20c:c631:d81f with SMTP id d9443c01a7336-210c6c28284mr59132545ad.21.1730030901053; Sun, 27 Oct 2024 05:08:21 -0700 (PDT) Received: from localhost.localdomain ([124.156.216.125]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-210bbf43476sm34897435ad.24.2024.10.27.05.08.15 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Sun, 27 Oct 2024 05:08:20 -0700 (PDT) From: Lance Yang To: akpm@linux-foundation.org Cc: dj456119@gmail.com, cunhuang@tencent.com, leonylgao@tencent.com, j.granados@samsung.com, jsiddle@redhat.com, kent.overstreet@linux.dev, 21cnbao@gmail.com, ryan.roberts@arm.com, david@redhat.com, ziy@nvidia.com, libang.li@antgroup.com, baolin.wang@linux.alibaba.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, joel.granados@kernel.org, linux@weissschuh.net, Lance Yang , Mingzhe Yang Subject: [PATCH v2 1/2] hung_task: add detect count for hung tasks Date: Sun, 27 Oct 2024 20:07:46 +0800 Message-ID: <20241027120747.42833-2-ioworker0@gmail.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241027120747.42833-1-ioworker0@gmail.com> References: <20241027120747.42833-1-ioworker0@gmail.com> MIME-Version: 1.0 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: A308414000F X-Stat-Signature: ptcczg81mn1gmux14sm8p74skewyj8pa X-Rspam-User: X-HE-Tag: 1730030885-448189 X-HE-Meta: U2FsdGVkX18+n45CNgsKkTE55KacrFHvOUKoMQchMg4qYDZugrXWJygbi6TWC/SyECBdXKbDXOZuj8nZSTJjmwwHw/ZSzcLnkusx/jq/3MHsJg/H6oKNn7gI33OrBeR7HjkDd1TrNTY5x/vqSxlvWNbhQuieyuOdRI2ePrigdNLb+VDmjoFX6Gr3i1+PfpWQdrnorxB/buD39ZCH0h97sNqETi/aaoi4F2qIf/dgKa1gxy1heyhiZgzUbita7cOKbCCIDbEg6CIQgxtYOW0SrzXmNANBEowitOin8uTRTHOk0CJdAM8jZOjoY4EgEqJNiDQ0FPkLrPyyzJlyffmVMeUV2Iev6BrkCLrRGFFXhqfGkkpqDM4gXr/gVCQEErcFN4txmbP2DtOx39sEulyNGqb3ALp3F9MfwILvO7eo55iMBtoQTkTZnyUMdHGUQ4XaiW/OLW+SS4qiFd5gH01hcHgLoOf/Uoi7sy6+TxzYbu9UABKSqZdiAWrHTP/8wl4wdmJH7fwDif7WrsndzwBK2AXEMQYRvDKuMe18m3uWAyNWXEs0K96CHZSFKc5EM0iHEsDju+4JE7EMOuPwDjcDvDUL+sX/E1a5mmVWYtXVGHtcff3FuPxfA5V6tVasnxYQ9pLbbvnOmVdBmuprEvbmZcRPj8aHp+tQr1vgP3yw/rdTliOAXOzd34L/Wq3YJsrLxXNxjyu0BqDUrx2tLiwzD61k3uV8OSPK8VnWN0nFdyENOvSoB3tc/PF559LCtJLXwtpx6PwQYkI4i4rJ4xMapMjPgUV/uGEoUTussh6Th7Z2Kz7yeH4WHuK9vFKS26Xp7UcDD74sP5vzrdX3Km29ZkoqOrsR3IWcaOHLrmjkR2efOJiHkriiHge+5HPzmXc6jKQBmIkfkUbV+lB1Ak2h5+0BFdXOi21mTE/N6mz3ptfaXcyTNWQbwyangIrtDGbf6J8NmcQbRBjnehrcWhf e6oFzetX B6OBaUWrnKvDwmyfmM+zC3BziOyvooLyEE4IAlBpc+0jl8eKo9TZG7ohdMXT7cOqgma0yIpSP/XUX5KMYJa2xFDk5K2eP1/Y0QDG/53N+gA6n5Xw3AA1x6RCm24+m4M1pselsxdhiP0v6hII7V3RzF3/s1R+kIKr6wDZ0NzEaf/CY5xkdooIOw6UZxESj5+swSA1PmLreCWEiiogg3kEPbzrsPSA03ZkM8ITguFtQW/5K+vbgRuBnMCr+WuqJJ0PHjj0BHK/7oSG3vYnlDhm8LqeAESeGhLBPWdD9H304ILWKFQEb960Ly9yE1Yr0fuTpwelm+ZeXrTMvyyvj+NzQjOhP1lsiMfyhdpih4+0n/lUkFgAAL+CucTvTzeW7MqydnnAGyhw9kZJ5pYKdSndDN11X00XkLyJdf2v6TMCDVnXyVPTFZ3/+q322bmRNj3H3BU/bmnsFJvbZY+AIB7XTq8btYgoLam3xauDpGyfB+Kuxcss= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This commit adds a counter, hung_task_detect_count, to track the number of times hung tasks are detected. IHMO, hung tasks are a critical metric. Currently, we detect them by periodically parsing dmesg. However, this method isn't as user-friendly as using a counter. Sometimes, a short-lived issue with NIC or hard drive can quickly decrease the hung_task_warnings to zero. Without warnings, we must directly access the node to ensure that there are no more hung tasks and that the system has recovered. After all, load average alone cannot provide a clear picture. Once this counter is in place, in a high-density deployment pattern, we plan to set hung_task_timeout_secs to a lower number to improve stability, even though this might result in false positives. And then we can set a time-based threshold: if hung tasks last beyond this duration, we will automatically migrate containers to other nodes. Based on past experience, this approach could help avoid many production disruptions. Moreover, just like other important events such as OOM that already have counters, having a dedicated counter for hung tasks makes sense. Signed-off-by: Mingzhe Yang Signed-off-by: Lance Yang --- kernel/hung_task.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 959d99583d1c..229ff3d4e501 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -30,6 +30,11 @@ */ static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT; +/* + * Total number of tasks detected as hung since boot: + */ +static unsigned long __read_mostly sysctl_hung_task_detect_count; + /* * Limit number of tasks checked in a batch. * @@ -115,6 +120,12 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) if (time_is_after_jiffies(t->last_switch_time + timeout * HZ)) return; + /* + * This counter tracks the total number of tasks detected as hung + * since boot. + */ + sysctl_hung_task_detect_count++; + trace_sched_process_hang(t); if (sysctl_hung_task_panic) { @@ -314,6 +325,13 @@ static struct ctl_table hung_task_sysctls[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_NEG_ONE, }, + { + .procname = "hung_task_detect_count", + .data = &sysctl_hung_task_detect_count, + .maxlen = sizeof(unsigned long), + .mode = 0444, + .proc_handler = proc_dointvec, + }, }; static void __init hung_task_sysctl_init(void)