From patchwork Thu Dec 14 12:50:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yafang Shao X-Patchwork-Id: 13493003 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F899C4332F for ; Thu, 14 Dec 2023 12:51:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ECA518D00AD; Thu, 14 Dec 2023 07:51:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E7B298D00A2; Thu, 14 Dec 2023 07:51:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D428F8D00AD; Thu, 14 Dec 2023 07:51:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C0A4C8D00A2 for ; Thu, 14 Dec 2023 07:51:44 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 9846880C6B for ; Thu, 14 Dec 2023 12:51:44 +0000 (UTC) X-FDA: 81565410528.28.C6B4AE9 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf22.hostedemail.com (Postfix) with ESMTP id DC19AC0006 for ; Thu, 14 Dec 2023 12:51:42 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="R1A+r/08"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702558302; a=rsa-sha256; cv=none; b=HM1VVCXikr+/yW6gAdiuGm9y/gvxYN/Q/PT+jU06TNO+I7f4hBPx5VqqOjxC5aA827qDri /wAl6yZIvB67VV2OScpm3tr4L0c8r9Re0Pq7Qwrowp6bJaEsQut9WlsXjuTL9HliMdg6b+ V7thFqdEutkwBifYJH23nnyz53p5P84= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="R1A+r/08"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702558302; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=oj0iJvkscsu8K8KTKJIWnYWXUHTwTNv0Yge5CoUg6vU=; b=Qt6xNB3g5A91aRr62EUeDeCgzXTJfSmm/e/yvOfrkZ+6NWCGLrDxtz/6gPtyHCJ6j15tOW qDJvyJOlqVERvHQqMEMMLHaBtNMFVS8Sa7wTeRbgy/3ymXeueIj6QmDJU5ESsiDNgPwybS 4T69idDkH1t6+I4AirQVyfpg3CDLCac= Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-1d0aaa979f0so47028605ad.0 for ; Thu, 14 Dec 2023 04:51:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702558301; x=1703163101; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=oj0iJvkscsu8K8KTKJIWnYWXUHTwTNv0Yge5CoUg6vU=; b=R1A+r/08Jw6POb6HbaQ4WgjeHdGKZJbb3vhi0berTbrctlZpsbdxZVfslWATtNTk13 fecQetrH2Hk84vjKvmuxWKiBZ93uWSWvBKyM4acLo5OvFgF1BtfvhQUUOC7c2sDBDnZQ /72VR4iNBi2Ina/+35dqTvBIREIe0coZC3FT4fx/m+ZDFcl8JZIJh2yyOmo4tAmzU5/l D+Ol7BHivRzuzatpYKJPYsMw9c81C8lPsc26Fbb8Jeegl+OOL7lXZbR5cakBx0cFZVOJ g73RKiFpo2vyBytT7LWEiBOz2aDl5Ww7SkQh/nQM15aab4oUm09sFb0Ij94zQmxAWZ+r Dw/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702558301; x=1703163101; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=oj0iJvkscsu8K8KTKJIWnYWXUHTwTNv0Yge5CoUg6vU=; b=xBKM5m52HcG+TzQASDrH5EJNRJm2iM7q9ZH3x7G7oVEtl/6NbctBO/Y2F8Qdv5LWqy 7JPpN09Wwuvf0Du53bNfI+oqFvtcwQcADBXCaRfs1DUpTN5DJSduVWvEUN6ZHMXhFze0 668JgZ4HgzR2+1f9p9Jehvr6+BwJyFXo/oRbHkmV5kDpwWOS9hasc9gGBYhWWZqM2uah bj4fx8T/Jt1JHkVjKZmpJvKqJ0oFIOJkz4WYzoLXocBsf7BLdGD+ITnfslCMuB85ih71 0rl6Xbku7MXGxxMB/28V7z7MCFNluENkRtKeWBkKK89ZEfNJJGScfIHhduYIDkkqtwkm ReJw== X-Gm-Message-State: AOJu0Yx+3j2ldw3eyfBOzyxgn3BW7P9hLR54cmcxLakcgs9dTGlxzfTb GlKTly2naPHhby4GWP8hhlU= X-Google-Smtp-Source: AGHT+IFs2hng0mwsEvoC+hm8RAbM12yZWcwi7bTtTSObTgxyW6oPlqzYS1I/1EkUf9p4JKo1OFsXzQ== X-Received: by 2002:a17:902:7004:b0:1d0:6ffd:ae23 with SMTP id y4-20020a170902700400b001d06ffdae23mr4335630plk.138.1702558301453; Thu, 14 Dec 2023 04:51:41 -0800 (PST) Received: from vultr.guest ([149.28.194.201]) by smtp.gmail.com with ESMTPSA id jj17-20020a170903049100b001d36b2e3dddsm1184528plb.192.2023.12.14.04.51.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 04:51:40 -0800 (PST) From: Yafang Shao To: akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, omosnace@redhat.com, casey@schaufler-ca.com, kpsingh@kernel.org, mhocko@suse.com, ying.huang@intel.com Cc: linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com, Yafang Shao Subject: [PATCH v5 bpf-next 0/5] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Date: Thu, 14 Dec 2023 12:50:28 +0000 Message-Id: <20231214125033.4158-1-laoar.shao@gmail.com> X-Mailer: git-send-email 2.39.3 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: DC19AC0006 X-Stat-Signature: qhchpamdjnijq4becef33ds3cnywao8f X-HE-Tag: 1702558302-554074 X-HE-Meta: U2FsdGVkX1+mg2dltCS9ET5ZIWCjfXRDaf0edEcmonDSxcLs1TIWtXCdzv3v7QTVw2udo8O1Jfh+oIYbKdKmew5sTf+W1vefJ6HgD02QT9G+m69OtjPfldUCqP4YHKBVvVKUMazJU+RZeMm4nC1ZjNlggf5wf8X3R7pUAHpPHTl7DVJu9b4UgiXli66/LR4QJFHIjRLfEX6MPytf01xP/ZKEIQSXBrUh8xSzPoXzyaofisj3nb8/tBe9I1/+j1ioi7QKXw1fbgO7D21kiJcb0T6wOauQN/YG1Kr8VhbQ6y1eVT6WGt/OYoLnX+1O6If3RMtPPEsFvTFaP4IyWxos2fDILSpEU4QLzYol1v/iDyrz8x2wT0h8FKcJZ+a5CTyc+4RXq6clB+G6mwNVJNj9GiiPgsYDPEWDWVBpLUic4h68tRrQMnhv3qfr+ko20WtW8wi89VzowBhSSP9zQlSrncSdQOKAct1iEfCbVTpwcCF4qOne+FBOYYGRPwV1FMuuSKVapt9ueezfuXSJw0Cs3QXwLtgyNQh0jKkduuJT0kgqGdl51Lry58qcRWhjGZQgD6rnICXlyCsDnD0Pev1C+JuJ5nyPx2QZlxWV5lvSOr0z/GKYD2K36NgiJtiOmbTpz2uRA9DvlMoCl6jbyzfHrLJyykkIX0usDdzKZRLajY9xZat5zGJPSozNuDnoZ9ofm0PhjEf015yL6hIHL5rL5Q7jJwf4nzAl4kuKy6lDbnR44Am5uaY/dABA8aem2VfRVznyjlWG+rPR6ezQtuq+TL46TntHFrJ5GT1kBzIQ0mHXpJlM3Z+ztbG851xrHDvR15rbhMjeWEYfeWeMl5a9y3M8y5XGXvvG0ZLbc0VHViF5TwwT2AXKHbaiDRctlgeDiLmHvBEa9GvctmOeOjGMNILBc9oluFqP9EG8r7WoKI3JeAcreY97iSGyb/eEo3ZCFf4yI0roYTCbX9t/rT8 ZomUXtwn RrW5SNSzpEVb07sjtrWhBCtEynAgzmMTvv+5C9G5Ej2eWcMpl28kWKnA0Y6o0PSeZ/G8xOy7Uzlq0BpK1ynaITH3nccO29SVbxuvsakcdQExEh/sLkcHHSmEykbldU2GVZOK05BTPlx0Ix5/AyhrmGhn47Qksvc7VErgDiLJOPYwowCvJGiaJ5IpRt+4L6DfNhwOj+kr4rb8H9gBsSOn3GxrgeellS8huG+f84Aoqi3hu52GzsoHmZV+lgD15GJK4uVt5AF/a7s9JKYqMxL65/zVK0u0SMS6ViIIlaPkWFMJNm2etdWSNPk3arVU2WEf/1Ap72LOH2oq7IZILisTZbkuLFHdYnYmHjht339kc/sGakph4iD0jmdoWWxDSJbBr++qg6O6gXfRm+n0Gq2B9UWK7rjT5OdeDtUmw/LWNs4rvlfPr1dPMDlHKH5GzOsUrPKpSOWCcK9JArLrYXNj1N+/BawIT1pNq5VrxOhaCwB2BRQcrSpPAYYkhtJI06j0b8K/iPJR3jRe856AKo/+3q5tAoSO3zMXOFbXSuXvR6ok27/FjYCFObrNSKsBuHOkqVNMlWV9QK+9b7QlVyk7ufdwjcKjqoVitQfvri0baUDg2lSEydAO/gJ2p85Makl4zmj8m+ayCwH51NE+sj3EH0fOCrGxSOqlOQ5od X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Background ========== In our containerized environment, we've identified unexpected OOM events where the OOM-killer terminates tasks despite having ample free memory. This anomaly is traced back to tasks within a container using mbind(2) to bind memory to a specific NUMA node. When the allocated memory on this node is exhausted, the OOM-killer, prioritizing tasks based on oom_score, indiscriminately kills tasks. The Challenge ============= In a containerized environment, independent memory binding by a user can lead to unexpected system issues or disrupt tasks being run by other users on the same server. If a user genuinely requires memory binding, we will allocate dedicated servers to them by leveraging kubelet deployment. Currently, users possess the ability to autonomously bind their memory to specific nodes without explicit agreement or authorization from our end. It's imperative that we establish a method to prevent this behavior. Proposed Solution ================= - Capability Currently, any task can perform MPOL_BIND without specific capabilities. Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this may have unintended consequences. Capabilities, being broad, might grant unnecessary privileges. We should explore alternatives to prevent unexpected side effects. - LSM Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2) to disable MPOL_BIND. This approach is more flexibility and allows for fine-grained control without unintended consequences. A sample LSM BPF program is included, demonstrating practical implementation in a production environment. - seccomp seccomp is relatively heavyweight, making it less suitable for enabling in our production environment: - Both kubelet and containers need adaptation to support it. - Dynamically altering security policies for individual containers without interrupting their operations isn't straightforward. Future Considerations ===================== In addition, there's room for enhancement in the OOM-killer for cases involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to prioritize selecting a victim that has allocated memory on the same NUMA node. My exploration on the lore led me to a proposal[0] related to this matter, although consensus seems elusive at this point. Nevertheless, delving into this specific topic is beyond the scope of the current patchset. [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/ Changes: - v4 -> v5: - Revise the commit log in patch #5. (KP) - v3 -> v4: https://lwn.net/Articles/954126/ - Drop the changes around security_task_movememory (Serge) - RCC v2 -> v3: https://lwn.net/Articles/953526/ - Add MPOL_F_NUMA_BALANCING man-page (Ying) - Fix bpf selftests error reported by bot+bpf-ci - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/ - Refine the commit log to avoid misleading - Use one common lsm hook instead and add comment for it - Add selinux implementation - Other improments in mempolicy - RFC v1: https://lwn.net/Articles/951188/ Yafang Shao (5): mm, doc: Add doc for MPOL_F_NUMA_BALANCING mm: mempolicy: Revise comment regarding mempolicy mode flags mm, security: Add lsm hook for memory policy adjustment security: selinux: Implement set_mempolicy hook selftests/bpf: Add selftests for set_mempolicy with a lsm prog .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++ include/linux/lsm_hook_defs.h | 3 + include/linux/security.h | 9 +++ include/uapi/linux/mempolicy.h | 2 +- mm/mempolicy.c | 8 +++ security/security.c | 13 ++++ security/selinux/hooks.c | 8 +++ security/selinux/include/classmap.h | 2 +- .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++ .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++ 10 files changed, 182 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c