From patchwork Sat Nov 20 04:50:06 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 12630057 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C398C433F5 for ; Sat, 20 Nov 2021 04:50:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D43F06B0071; Fri, 19 Nov 2021 23:50:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CF3B96B0072; Fri, 19 Nov 2021 23:50:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B95186B0073; Fri, 19 Nov 2021 23:50:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0250.hostedemail.com [216.40.44.250]) by kanga.kvack.org (Postfix) with ESMTP id A94516B0071 for ; Fri, 19 Nov 2021 23:50:25 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 7087882E592D for ; Sat, 20 Nov 2021 04:50:15 +0000 (UTC) X-FDA: 78828081990.28.8D789A2 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf02.hostedemail.com (Postfix) with ESMTP id C51E37001709 for ; Sat, 20 Nov 2021 04:50:13 +0000 (UTC) Received: by mail-pl1-f201.google.com with SMTP id e9-20020a170902ed8900b00143a3f40299so5690902plj.20 for ; Fri, 19 Nov 2021 20:50:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:message-id:mime-version:subject:from:cc :content-transfer-encoding; bh=GBPQGTJqAilqh8R2ionmFa146IW13gdDGdP75NfAh5o=; b=AI76jn8x9yBui/uf8c/CHqDoXt1YKDeQ3ejTzNhwVx3AYVrNJuNAhhKytb8tZenR1n kUouCLsIlq7x5otTGqJnCHWF+Bw9Wnlc9CKS1HE7EUVcypMvli7CFtv1SEGENs8iWbx1 GYgkP8NgeT/ki8yAgKEqpdvlyUHUpAWFNsr51QSQc0EV+rtO/p+tb6SVFEMBpqwM9RqS hPOvveOj4yYZa6bKsotdfFz5nej3riOk6ORe20UjZEPf+7WRmHCmH0i4pmdnMUc1YcW0 MHwiT5YAKb8IK5q8XZBxW7Rc5omvgZ9ArCrtvNm4eyxZ5lydv+aw/A+beKp73P5RVTo0 VvNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:message-id:mime-version:subject:from:cc :content-transfer-encoding; bh=GBPQGTJqAilqh8R2ionmFa146IW13gdDGdP75NfAh5o=; b=hClJVaB0jHUvDDqvo8I2XCsi2N4LbDspU8KE6K2Heb+crYoKoIh3N11ibj1GOvWigE GOnfG9huoecw+xnKUmzuIYP1elNQW3sh63i2P70ys5cN6jfFXN+91i3ZSywTWpmO93Ja nvbcj8aqBNY0dbbNTAhKKg3bZnJ2iwHCBwkdZ5x+2Feb00kUAYgyhAPkCfX39SGu+ExS baz41zxbUVxzhaMfYDNwY6E1vU6ygDrfo5zXnEtVN+s0HfCJv9VZ4eCFJcDHgzZUtkvr MaztFRtX2AUDc3vTg66QbHAnSxbCaPF3PE2+A81Ch6RK5DD6gnG7xZ7mz8ylOw6ulIuU WeZA== X-Gm-Message-State: AOAM532Es4ugiJ8QYf8no1+K3mpJ1/BkAKqGCePWsIgw9pSxm/zvQZ8+ DKeW5ib1EL5nbUul4HFiUvoB/Km8MBt41KLGzg== X-Google-Smtp-Source: ABdhPJwrO0XgJB6QQxq1EPt6/q2hs1MEyfssdTw0iP25RRfThEqdaTu1q3184prBGMof/2GyU5urMyNqZ+9MO3zZVw== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2cd:202:fa91:560a:d7b4:93]) (user=almasrymina job=sendgmr) by 2002:a17:90b:1c86:: with SMTP id oo6mr6684834pjb.165.1637383813809; Fri, 19 Nov 2021 20:50:13 -0800 (PST) Date: Fri, 19 Nov 2021 20:50:06 -0800 Message-Id: <20211120045011.3074840-1-almasrymina@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.34.0.rc2.393.gf8c9666880-goog Subject: [PATCH v4 0/4] Deterministic charging of shared memory From: Mina Almasry Cc: Mina Almasry , Jonathan Corbet , Alexander Viro , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Hugh Dickins , Shuah Khan , Shakeel Butt , Greg Thelen , Dave Chinner , Matthew Wilcox , Roman Gushchin , "Theodore Ts'o" , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: C51E37001709 X-Stat-Signature: hzsnazxehnomrw9ujw3rc9udny3zumcn Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=AI76jn8x; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of 3hX6YYQsKCPYYjkYqpwkglYemmejc.amkjglsv-kkitYai.mpe@flex--almasrymina.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3hX6YYQsKCPYYjkYqpwkglYemmejc.amkjglsv-kkitYai.mpe@flex--almasrymina.bounces.google.com X-HE-Tag: 1637383813-477913 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Problem: Currently shared memory is charged to the memcg of the allocating process. This makes memory usage of processes accessing shared memory a bit unpredictable since whichever process accesses the memory first will get charged. We have a number of use cases where our userspace would like deterministic charging of shared memory: 1. System services allocating memory for client jobs: We have services (namely a network access service[1]) that provide functionality for clients running on the machine and allocate memory to carry out these services. The memory usage of these services depends on the number of jobs running on the machine and the nature of the requests made to the service, which makes the memory usage of these services hard to predict and thus hard to limit via memory.max. These system services would like a way to allocate memory and instruct the kernel to charge this memory to the client’s memcg. 2. Shared filesystem between subtasks of a large job Our infrastructure has large meta jobs such as kubernetes which spawn multiple subtasks which share a tmpfs mount. These jobs and its subtasks use that tmpfs mount for various purposes such as data sharing or persistent data between the subtask restarts. In kubernetes terminology, the meta job is similar to pods and subtasks are containers under pods. We want the shared memory to be deterministically charged to the kubernetes's pod and independent to the lifetime of containers under the pod. 3. Shared libraries and language runtimes shared between independent jobs. We’d like to optimize memory usage on the machine by sharing libraries and language runtimes of many of the processes running on our machines in separate memcgs. This produces a side effect that one job may be unlucky to be the first to access many of the libraries and may get oom killed as all the cached files get charged to it. Design: My rough proposal to solve this problem is to simply add a ‘memcg=/path/to/memcg’ mount option for filesystems: directing all the memory of the file system to be ‘remote charged’ to cgroup provided by that memcg= option. Caveats: 1. One complication to address is the behavior when the target memcg hits its memory.max limit because of remote charging. In this case the oom-killer will be invoked, but the oom-killer may not find anything to kill in the target memcg being charged. Thera are a number of considerations in this case: 1. It's not great to kill the allocating process since the allocating process is not running in the memcg under oom, and killing it will not free memory in the memcg under oom. 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault somehow. If not, the process will forever loop the pagefault in the upstream kernel. In this case, I propose simply failing the remote charge and returning an ENOSPC to the caller. This will cause will cause the process executing the remote charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault path. This will be documented behavior of remote charging, and this feature is opt-in. Users can: - Not opt-into the feature if they want. - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and abort if they desire. - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their operation without executing the remote charge if possible. 2. Only processes allowed the enter cgroup at mount time can mount a tmpfs with memcg=. This is to prevent intential DoS of random cgroups on the machine. However, once a filesysetem is mounted with memcg=, any process with write access to this mount point will be able to charge memory to . This is largely a non-issue because in configurations where there is untrusted code running on the machine, mount point access needs to be restricted to the intended users only regardless of whether the mount point memory is deterministly charged or not. [1] https://research.google/pubs/pub48630 Cc: Jonathan Corbet Cc: Alexander Viro Cc: Andrew Morton Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Hugh Dickins Cc: Shuah Khan Cc: Shakeel Butt Cc: Greg Thelen Cc: Dave Chinner Cc: Matthew Wilcox Cc: Roman Gushchin Cc: Theodore Ts'o Cc: linux-kernel@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Mina Almasry (4): mm: support deterministic memory charging of filesystems mm/oom: handle remote ooms mm, shmem: add filesystem memcg= option documentation mm, shmem, selftests: add tmpfs memcg= mount option tests Documentation/filesystems/tmpfs.rst | 28 ++++ fs/fs_context.c | 27 ++++ fs/proc_namespace.c | 4 + fs/super.c | 9 ++ include/linux/fs.h | 5 + include/linux/fs_context.h | 2 + include/linux/memcontrol.h | 38 +++++ mm/filemap.c | 2 +- mm/khugepaged.c | 3 +- mm/memcontrol.c | 171 ++++++++++++++++++++++ mm/oom_kill.c | 9 ++ mm/shmem.c | 3 +- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/mmap_write.c | 103 +++++++++++++ tools/testing/selftests/vm/tmpfs-memcg.sh | 116 +++++++++++++++ 15 files changed, 518 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/vm/mmap_write.c create mode 100755 tools/testing/selftests/vm/tmpfs-memcg.sh --- 2.34.0.rc2.393.gf8c9666880-goog