From patchwork Fri Sep 2 02:29:50 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yafang Shao X-Patchwork-Id: 12963560 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C85E3ECAAD2 for ; Fri, 2 Sep 2022 02:30:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D041780097; Thu, 1 Sep 2022 22:30:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB23C8008D; Thu, 1 Sep 2022 22:30:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B79E380097; Thu, 1 Sep 2022 22:30:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A85558008D for ; Thu, 1 Sep 2022 22:30:12 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6004AAB9AD for ; Fri, 2 Sep 2022 02:30:12 +0000 (UTC) X-FDA: 79865565864.09.1476538 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf02.hostedemail.com (Postfix) with ESMTP id 21CDF8005C for ; Fri, 2 Sep 2022 02:30:11 +0000 (UTC) Received: by mail-pf1-f174.google.com with SMTP id q15so561684pfn.11 for ; Thu, 01 Sep 2022 19:30:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date; bh=D3AdsrAzP/OCg/edNuQ9xj8+/J7mGnaQDmfeSstZF4o=; b=etmvLgpdksRnkUosjQdyioujA+kIpML12lmkCnfGCOavvO1XxHfO4TqJcxHP5b7B0j F6Hou4guyntbnb0VTPiO0OXyp3wWgBno+i1vUs3kKc8cOzVSOv+zYcDaEs4oKmHXusOz Kz6jIDt7pq15ZFqP1aEp9h76JKe7c+t64qfglAp/t9l2JZ8nYeayTjCfPeo23qdGQh5h 8GsZIoLf2yVn6ZwKBmzMrnmH9eT5JvGfY4uf6vWy9nin6pXUsSFwHRjZqy43r0vXcIpr G09X2nMO6OY9ApWWYGQTLHPGglne0PCdMjUHak9SF5TTvTgMCYjwMJ2wHDqAPTBWae+d ZlwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date; bh=D3AdsrAzP/OCg/edNuQ9xj8+/J7mGnaQDmfeSstZF4o=; b=seG91Ohp0/IjMPhGK7wFHtr9jLOhbi0Jf9Jhd9skRN4GeMGyl915IdOM+c1Aj8jPBG 9EsDoUuQO6obaKfTOSEnFctWtIrP1yFzcXscabBCtqExBz4ZzAnG2vxnNW7VqP3wuUDr fJTrLnNf0+DfmCL1+Y99HiCW09O0D7ygxiNwifMWY/l+YAr/JdllFGCxpHn5HLlQ5iah wRU49LgWtQonDahOnHpN5uAVSaDZdjesplnADQIeMd33t4Lh489y7/qmRV4M1XYVyY/A zbLfUNmcfsIQ7WpWAbamFUh34wNSbEwTKWJ/Ih+4tsZLjIYCMVUlMfU4skX2pFECrWQE 0Fjg== X-Gm-Message-State: ACgBeo0i70O787/5RLjuIXq3+Qc5c2HvoVPIWvjTP5Rn/+ni0ARIU6UD 4zS2L8JFlFdVtWBifm5OPBo= X-Google-Smtp-Source: AA6agR5LBNj3Vby8h2IWgEeZoZ1h88MDc9Z9Qb16Il+pXzlE4DEluuQDO2cTAVQ70i6qAIb5Z+guYw== X-Received: by 2002:aa7:9242:0:b0:536:f215:4f4f with SMTP id 2-20020aa79242000000b00536f2154f4fmr34192084pfp.45.1662085810846; Thu, 01 Sep 2022 19:30:10 -0700 (PDT) Received: from vultr.guest ([2001:19f0:6001:50ea:5400:4ff:fe1f:fbe2]) by smtp.gmail.com with ESMTPSA id j4-20020a170902da8400b0017297a6b39dsm269719plx.265.2022.09.01.19.30.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Sep 2022 19:30:09 -0700 (PDT) From: Yafang Shao To: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, songmuchun@bytedance.com, akpm@linux-foundation.org, tj@kernel.org, lizefan.x@bytedance.com Cc: cgroups@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org, Yafang Shao Subject: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Date: Fri, 2 Sep 2022 02:29:50 +0000 Message-Id: <20220902023003.47124-1-laoar.shao@gmail.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662085812; a=rsa-sha256; cv=none; b=73kPtEEC1LEIuXN+Mm6HuMtuBpPACTdYcQlltGxAhZlQ4sKTXKsrGFYcHOiJwqZQd9o08Z tp/wxfZKTxqQsL8Cg08XgwwibqU/ea2oe3uzZT/PZLcBIndnIQ/sEqbsxYSjhTPeF/IQGb YGRNmBQOvQIa3Fj3uzZpOku5Uu2+ORA= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=etmvLgpd; spf=pass (imf02.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662085812; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=D3AdsrAzP/OCg/edNuQ9xj8+/J7mGnaQDmfeSstZF4o=; b=tp3mZiZXueF4bQzZZlmopPrfXGIf0eMdnPZc5kIQ5IMyMyuX7lCxniDgBcDu7neyeUhV01 UC6CxtM4kuIsK1T19H2nZwAbI8VRtyabNMLO2JN+7NYKVKGEpmLJGV/ZT5SmQ0Si/+HFBm JFc6aNIW2coiLJxZfvAJT07Y20sQ0DM= X-Rspam-User: X-Stat-Signature: nn5w7jfnf1nozfmjy6owiqoqqkkwa3e7 X-Rspamd-Queue-Id: 21CDF8005C X-Rspamd-Server: rspam06 Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=etmvLgpd; spf=pass (imf02.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1662085811-572962 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On our production environment, we may load, run and pin bpf programs and maps in containers. For example, some of our networking bpf programs and maps are loaded and pinned by a process running in a container on our k8s environment. In this container, there're also running some other user applications which watch the networking configurations from remote servers and update them on this local host, log the error events, monitor the traffic, and do some other stuffs. Sometimes we may need to update these user applications to a new release, and in this update process we will destroy the old container and then start a new genration. In order not to interrupt the bpf programs in the update process, we will pin the bpf programs and maps in bpffs. That is the background and use case on our production environment. After switching to memcg-based bpf memory accounting to limit the bpf memory, some unexpected issues jumped out at us. 1. The memory usage is not consistent between the first generation and new generations. 2. After the first generation is destroyed, the bpf memory can't be limited if the bpf maps are not preallocated, because they will be reparented. Besides, there's another issue that the bpf-map's memcg is breaking the memcg hierarchy, because bpf-map has its own memcg. A bpf map can be wrote by tasks running in other memcgs, once a writer in other memcg writes a shared bpf map, the memory allocated in this writing won't be charged to the writer's memcg but it will be charge to bpf-map's own memcg instead. IOW, the bpf-map is improperly treated as a task, while actually it is a shared resource. This patchset doesn't resolve this issue. I will post another RFC once I find a workable solution to address it. This patchset tries to resolve the above two issues by introducing a selectable memcg to limit the bpf memory. Currently we only allow to select its ancestor to avoid breaking the memcg hierarchy further. Possible use cases of the selectable memcg as follows, - Select the root memcg as bpf-map's memcg Then bpf-map's memory won't be throttled by current memcg limit. - Put current memcg under a fixed memcg dir and select the fixed memcg as bpf-map's memcg The hierarchy as follows, Parent-memcg (A fixed dir, i.e. /sys/fs/cgroup/memory/bpf) \ Current-memcg (Container dir, i.e. /sys/fs/cgroup/memory/bpf/foo) At the map creation time, the bpf-map's memory will be charged into the parent directly without charging into current memcg, and thus current memcg's usage will be consistent among different generations. To limit bpf-map's memory usage, we can set the limit in the parent memcg. Currenly it only supports for bpf map, and we can extend it to bpf prog as well. The observebility can also be supported in the next step, for example, showing the bpf map's memcg by 'bpftool map show' or even showing which maps are charged to a specific memcg by 'bpftool cgroup show'. Furthermore, we may also show an accurate memory size of a bpf map instead of an estimated memory size in 'bpftool map show' in the future. v2->v3: - use css_tryget() instead of css_tryget_online() (Shakeel) - add comment for get_obj_cgroup_from_cgroup() (Shakeel) - add new memcg helper task_under_memcg_hierarchy() - add restriction to allow selecting ancestor only to avoid breaking the memcg hierarchy further, per discussion with Tejun v1->v2: - cgroup1 is also supported after commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1") So update the commit log. - remove incorrect fix to mem_cgroup_put (Shakeel,Roman,Muchun) - use cgroup_put() in bpf_map_save_memcg() (Shakeel) - add detailed commit log for get_obj_cgroup_from_cgroup (Shakeel) RFC->v1: - get rid of bpf_map container wrapper (Alexei) - add the new field into the end of struct (Alexei) - get rid of BPF_F_SELECTABLE_MEMCG (Alexei) - save memcg in bpf_map_init_from_attr - introduce bpf_ringbuf_pages_{alloc,free} and keep them inside kernel/bpf/ringbuf.c (Andrii) Yafang Shao (13): cgroup: Update the comment on cgroup_get_from_fd bpf: Introduce new helper bpf_map_put_memcg() bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM bpf: Call bpf_map_init_from_attr() immediately after map creation bpf: Save memcg in bpf_map_init_from_attr() bpf: Use scoped-based charge in bpf_map_area_alloc bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} bpf: Use bpf_map_kzalloc in arraymap bpf: Use bpf_map_kvcalloc in bpf_local_storage mm, memcg: Add new helper get_obj_cgroup_from_cgroup mm, memcg: Add new helper task_under_memcg_hierarchy bpf: Add return value for bpf_map_init_from_attr bpf: Introduce selectable memcg for bpf map include/linux/bpf.h | 40 +++++++++++- include/linux/memcontrol.h | 25 ++++++++ include/uapi/linux/bpf.h | 1 + kernel/bpf/arraymap.c | 34 +++++----- kernel/bpf/bloom_filter.c | 11 +++- kernel/bpf/bpf_local_storage.c | 17 +++-- kernel/bpf/bpf_struct_ops.c | 19 +++--- kernel/bpf/cpumap.c | 17 +++-- kernel/bpf/devmap.c | 30 +++++---- kernel/bpf/hashtab.c | 26 +++++--- kernel/bpf/local_storage.c | 11 +++- kernel/bpf/lpm_trie.c | 12 +++- kernel/bpf/offload.c | 12 ++-- kernel/bpf/queue_stack_maps.c | 11 +++- kernel/bpf/reuseport_array.c | 11 +++- kernel/bpf/ringbuf.c | 104 ++++++++++++++++++++---------- kernel/bpf/stackmap.c | 13 ++-- kernel/bpf/syscall.c | 140 ++++++++++++++++++++++++++++------------- kernel/cgroup/cgroup.c | 2 +- mm/memcontrol.c | 48 ++++++++++++++ net/core/sock_map.c | 30 +++++---- net/xdp/xskmap.c | 12 +++- tools/include/uapi/linux/bpf.h | 1 + tools/lib/bpf/bpf.c | 3 +- tools/lib/bpf/bpf.h | 3 +- tools/lib/bpf/gen_loader.c | 2 +- tools/lib/bpf/libbpf.c | 2 + tools/lib/bpf/skel_internal.h | 2 +- 28 files changed, 462 insertions(+), 177 deletions(-)