From patchwork Fri Feb 25 23:43:31 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hao Luo <haoluo@google.com>
X-Patchwork-Id: 12761094
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E08B6C433FE
	for <bpf@archiver.kernel.org>; Fri, 25 Feb 2022 23:43:50 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236240AbiBYXoV (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Fri, 25 Feb 2022 18:44:21 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42406 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239447AbiBYXoU (ORCPT <rfc822;bpf@vger.kernel.org>);
        Fri, 25 Feb 2022 18:44:20 -0500
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E6CF617C426
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:46 -0800 (PST)
Received: by mail-yb1-xb4a.google.com with SMTP id
 k10-20020a056902070a00b0062469b00335so4888372ybt.14
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=1qljGvi/baSUFDY4mqKz3QQQFwbltSyykW/yjXYd2yM=;
        b=mE7MUbyagfjgIsftf4aBMAUfxEHMroR905idQD8siRRfqQwgOnpPHWSclHIFB9jW+r
         a+nayfIt30kJIiVu6bCSgXXcyRI5Xd9jG0sREse3WUJ0C/ONMYIt3EeQHsTsA5MQrmVJ
         /hCXLXZBtDnlLwleHB47K4FvJ8jsO/cPPvlUDvNX8bjVgpgFxvC+T/ezP2SDUjwkH46n
         RIcvqawbZ1Zz86BR6A0N5zzclJzoyJLnllwd+ITPJBGk4vHIy4h4bX8vVFqa2YTPNKPv
         SFfGbGLpqnKm7G+mg0UJYsGpfNUQTgqSb69KHzW8qVGm21ZClX8CwB3gnUEtFtm0oaan
         SFtQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=1qljGvi/baSUFDY4mqKz3QQQFwbltSyykW/yjXYd2yM=;
        b=VEob5LqZj+xC4coNbxTDrcnm5GB07gXJ6ymV4zqr1tm/t79sphFJhbi7UZ3mmD0olS
         MMNLu84ZKpwT/J0EgJUOWldhg4+WiZPyyGrFODJhoPVpmtZ/gyQK3+ooviq1BuEVF+DO
         3Psp3oZ6J6m0v9qw2+fzOtvwc5UDROZC6nwPH2WaEobwJQcGUk89Yvxf9Gq+DjZwju2e
         L5neU9QYDFjDD4GpWAUTcwCHguR1b5ZVnImn8rgU7fIUfN/W8sKqSlQy0sXhO/oY+qFy
         ZQQxqx6wrANpzTzwcx/OM1xGIeZoJhHxrufFe2/ece6OvlKV98Oi9N2pVuGu8IpyUrX+
         eXHQ==
X-Gm-Message-State: AOAM530hYMFs9tjnclbN08XF2TkRsUL9YfTFD+Fbu/5FXjkeLFA5TLEf
        abAUzP7hjkfCc3RTkoYE6x+GbQKTf+s=
X-Google-Smtp-Source: 
 ABdhPJxEUyC7gY+QotGZ9BkZFQFS4jfSmgc0ZR7kFVOJiEUIskrlPPI1dFQi2UIKQgsaT8xhZrfCOW46s28=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2cd:202:378d:645d:49ad:4f8b])
 (user=haoluo job=sendgmr) by 2002:a25:b004:0:b0:621:f386:f10a with SMTP id
 q4-20020a25b004000000b00621f386f10amr9229547ybf.314.1645832626177; Fri, 25
 Feb 2022 15:43:46 -0800 (PST)
Date: Fri, 25 Feb 2022 15:43:31 -0800
In-Reply-To: <20220225234339.2386398-1-haoluo@google.com>
Message-Id: <20220225234339.2386398-2-haoluo@google.com>
Mime-Version: 1.0
References: <20220225234339.2386398-1-haoluo@google.com>
X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog
Subject: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir,
 unlink syscalls for prog_bpf_syscall
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
        Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
        Shakeel Butt <shakeelb@google.com>,
        Joe Burton <jevburton.kernel@gmail.com>,
        Tejun Heo <tj@kernel.org>, joshdon@google.com, sdf@google.com,
        bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net

This patch allows bpf_syscall prog to perform some basic filesystem
operations: create, remove directories and unlink files. Three bpf
helpers are added for this purpose. When combined with the following
patches that allow pinning and getting bpf objects from bpf prog,
this feature can be used to create directory hierarchy in bpffs that
help manage bpf objects purely using bpf progs.

The added helpers subject to the same permission checks as their syscall
version. For example, one can not write to a read-only file system;
The identity of the current process is checked to see whether it has
sufficient permission to perform the operations.

Only directories and files in bpffs can be created or removed by these
helpers. But it won't be too hard to allow these helpers to operate
on files in other filesystems, if we want.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h            |   1 +
 include/uapi/linux/bpf.h       |  26 +++++
 kernel/bpf/inode.c             |   9 +-
 kernel/bpf/syscall.c           | 177 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  26 +++++
 5 files changed, 236 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f19abc59b6cd..fce5e26179f5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1584,6 +1584,7 @@ int bpf_link_new_fd(struct bpf_link *link);
 struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 
+bool bpf_path_is_bpf_dir(const struct path *path);
 int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
 int bpf_obj_get_user(const char __user *pathname, int flags);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index afe3d0d7f5f2..a5dbc794403d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5086,6 +5086,29 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure. On error
  *		*dst* buffer is zeroed out.
+ *
+ * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
+ *	Description
+ *		Attempts to create a directory name *pathname*. The argument
+ *		*pathname_sz* specifies the length of the string *pathname*.
+ *		The argument *mode* specifies the mode for the new directory. It
+ *		is modified by the process's umask. It has the same semantic as
+ *		the syscall mkdir(2).
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * long bpf_rmdir(const char *pathname, int pathname_sz)
+ *	Description
+ *		Deletes a directory, which must be empty.
+ *	Return
+ *		0 on sucess, or a negative error in case of failure.
+ *
+ * long bpf_unlink(const char *pathname, int pathname_sz)
+ *	Description
+ *		Deletes a name and possibly the file it refers to. It has the
+ *		same semantic as the syscall unlink(2).
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5280,6 +5303,9 @@ union bpf_attr {
 	FN(xdp_load_bytes),		\
 	FN(xdp_store_bytes),		\
 	FN(copy_from_user_task),	\
+	FN(mkdir),			\
+	FN(rmdir),			\
+	FN(unlink),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 4f841e16779e..3aca00e9e950 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -414,6 +414,11 @@ static const struct inode_operations bpf_dir_iops = {
 	.unlink		= simple_unlink,
 };
 
+bool bpf_path_is_bpf_dir(const struct path *path)
+{
+	return d_inode(path->dentry)->i_op == &bpf_dir_iops;
+}
+
 /* pin iterator link into bpffs */
 static int bpf_iter_link_pin_kernel(struct dentry *parent,
 				    const char *name, struct bpf_link *link)
@@ -439,7 +444,6 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 			  enum bpf_type type)
 {
 	struct dentry *dentry;
-	struct inode *dir;
 	struct path path;
 	umode_t mode;
 	int ret;
@@ -454,8 +458,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	if (ret)
 		goto out;
 
-	dir = d_inode(path.dentry);
-	if (dir->i_op != &bpf_dir_iops) {
+	if (!bpf_path_is_bpf_dir(&path)) {
 		ret = -EPERM;
 		goto out;
 	}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index db402ebc5570..07683b791733 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -12,6 +12,7 @@
 #include <linux/sched/signal.h>
 #include <linux/vmalloc.h>
 #include <linux/mmzone.h>
+#include <linux/namei.h>
 #include <linux/anon_inodes.h>
 #include <linux/fdtable.h>
 #include <linux/file.h>
@@ -4867,6 +4868,176 @@ const struct bpf_func_proto bpf_kallsyms_lookup_name_proto = {
 	.arg4_type	= ARG_PTR_TO_LONG,
 };
 
+BPF_CALL_3(bpf_mkdir, const char *, pathname, int, pathname_sz, u32, raw_mode)
+{
+	struct user_namespace *mnt_userns;
+	struct dentry *dentry;
+	struct path path;
+	umode_t mode;
+	int err;
+
+	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
+		return -EINVAL;
+
+	dentry = kern_path_create(AT_FDCWD, pathname, &path, LOOKUP_DIRECTORY);
+	if (IS_ERR(dentry))
+		return PTR_ERR(dentry);
+
+	if (!bpf_path_is_bpf_dir(&path)) {
+		err = -EPERM;
+		goto err_exit;
+	}
+
+	mode = raw_mode;
+	if (!IS_POSIXACL(path.dentry->d_inode))
+		mode &= ~current_umask();
+	err = security_path_mkdir(&path, dentry, mode);
+	if (err)
+		goto err_exit;
+
+	mnt_userns = mnt_user_ns(path.mnt);
+	err = vfs_mkdir(mnt_userns, d_inode(path.dentry), dentry, mode);
+
+err_exit:
+	done_path_create(&path, dentry);
+	return err;
+}
+
+const struct bpf_func_proto bpf_mkdir_proto = {
+	.func		= bpf_mkdir,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_2(bpf_rmdir, const char *, pathname, int, pathname_sz)
+{
+	struct user_namespace *mnt_userns;
+	struct path parent;
+	struct dentry *dentry;
+	int err;
+
+	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
+		return -EINVAL;
+
+	err = kern_path(pathname, 0, &parent);
+	if (err)
+		return err;
+
+	if (!bpf_path_is_bpf_dir(&parent)) {
+		err = -EPERM;
+		goto exit1;
+	}
+
+	err = mnt_want_write(parent.mnt);
+	if (err)
+		goto exit1;
+
+	dentry = kern_path_locked(pathname, &parent);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto exit2;
+	}
+
+	if (d_really_is_negative(dentry)) {
+		err = -ENOENT;
+		goto exit3;
+	}
+
+	err = security_path_rmdir(&parent, dentry);
+	if (err)
+		goto exit3;
+
+	mnt_userns = mnt_user_ns(parent.mnt);
+	err = vfs_rmdir(mnt_userns, d_inode(parent.dentry), dentry);
+exit3:
+	dput(dentry);
+	inode_unlock(d_inode(parent.dentry));
+exit2:
+	mnt_drop_write(parent.mnt);
+exit1:
+	path_put(&parent);
+	return err;
+}
+
+const struct bpf_func_proto bpf_rmdir_proto = {
+	.func		= bpf_rmdir,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+};
+
+BPF_CALL_2(bpf_unlink, const char *, pathname, int, pathname_sz)
+{
+	struct user_namespace *mnt_userns;
+	struct path parent;
+	struct dentry *dentry;
+	struct inode *inode = NULL;
+	int err;
+
+	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
+		return -EINVAL;
+
+	err = kern_path(pathname, 0, &parent);
+	if (err)
+		return err;
+
+	err = mnt_want_write(parent.mnt);
+	if (err)
+		goto exit1;
+
+	dentry = kern_path_locked(pathname, &parent);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto exit2;
+	}
+
+	if (!bpf_path_is_bpf_dir(&parent)) {
+		err = -EPERM;
+		goto exit3;
+	}
+
+	if (d_is_negative(dentry)) {
+		err = -ENOENT;
+		goto exit3;
+	}
+
+	if (d_is_dir(dentry)) {
+		err = -EISDIR;
+		goto exit3;
+	}
+
+	inode = dentry->d_inode;
+	ihold(inode);
+	err = security_path_unlink(&parent, dentry);
+	if (err)
+		goto exit3;
+
+	mnt_userns = mnt_user_ns(parent.mnt);
+	err = vfs_unlink(mnt_userns, d_inode(parent.dentry), dentry, NULL);
+exit3:
+	dput(dentry);
+	inode_unlock(d_inode(parent.dentry));
+	if (inode)
+		iput(inode);
+exit2:
+	mnt_drop_write(parent.mnt);
+exit1:
+	path_put(&parent);
+	return err;
+}
+
+const struct bpf_func_proto bpf_unlink_proto = {
+	.func		= bpf_unlink,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+};
+
 static const struct bpf_func_proto *
 syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -4879,6 +5050,12 @@ syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sys_close_proto;
 	case BPF_FUNC_kallsyms_lookup_name:
 		return &bpf_kallsyms_lookup_name_proto;
+	case BPF_FUNC_mkdir:
+		return &bpf_mkdir_proto;
+	case BPF_FUNC_rmdir:
+		return &bpf_rmdir_proto;
+	case BPF_FUNC_unlink:
+		return &bpf_unlink_proto;
 	default:
 		return tracing_prog_func_proto(func_id, prog);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index afe3d0d7f5f2..a5dbc794403d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5086,6 +5086,29 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure. On error
  *		*dst* buffer is zeroed out.
+ *
+ * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
+ *	Description
+ *		Attempts to create a directory name *pathname*. The argument
+ *		*pathname_sz* specifies the length of the string *pathname*.
+ *		The argument *mode* specifies the mode for the new directory. It
+ *		is modified by the process's umask. It has the same semantic as
+ *		the syscall mkdir(2).
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * long bpf_rmdir(const char *pathname, int pathname_sz)
+ *	Description
+ *		Deletes a directory, which must be empty.
+ *	Return
+ *		0 on sucess, or a negative error in case of failure.
+ *
+ * long bpf_unlink(const char *pathname, int pathname_sz)
+ *	Description
+ *		Deletes a name and possibly the file it refers to. It has the
+ *		same semantic as the syscall unlink(2).
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5280,6 +5303,9 @@ union bpf_attr {
 	FN(xdp_load_bytes),		\
 	FN(xdp_store_bytes),		\
 	FN(copy_from_user_task),	\
+	FN(mkdir),			\
+	FN(rmdir),			\
+	FN(unlink),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper

From patchwork Fri Feb 25 23:43:32 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hao Luo <haoluo@google.com>
X-Patchwork-Id: 12761096
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 527EDC433EF
	for <bpf@archiver.kernel.org>; Fri, 25 Feb 2022 23:44:00 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239523AbiBYXo3 (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Fri, 25 Feb 2022 18:44:29 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42624 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239481AbiBYXoW (ORCPT <rfc822;bpf@vger.kernel.org>);
        Fri, 25 Feb 2022 18:44:22 -0500
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6AED3187BAF
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:49 -0800 (PST)
Received: by mail-yb1-xb4a.google.com with SMTP id
 k7-20020a255607000000b00621afc793b8so4978011ybb.1
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=uFPxHpKM1wpN11u3R15gkkPUUa85utV+YA3orSMAaMQ=;
        b=jGGYR7+JlJhz+u/oapTmhLOpMRP4r7XNM0dqrGXqjl9Og9VMXbKrCKYt/9JKLsr8q1
         a6yOKUHWJUf2Lmor+BTC5wQfPvoz09O0YY/IPHq0mncxa3brxguarta8UGQvem8MYqbl
         mqrC+JaFJRxXCA+L5oZcbpqKmC2CH0DG1Zis1U728AlI3TxC2Y8kVbs2+IECcgkAkUOn
         shIYrlgazgjuocB9EhfMgYSxHylqm8eQ9Q+u8e4dnek+skbidW+ACdvZ0jRkMVID6kam
         j30pSbzqLM2vckx/tAHDNk7I7fI/6JXEJy3mnJYc6kde4FHYp5GxJ5Fla4X0D00Or+vE
         Xm1Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=uFPxHpKM1wpN11u3R15gkkPUUa85utV+YA3orSMAaMQ=;
        b=Sw6Z+vvDX7ocxMRQ3d05ZKo61efhZAvk3cM2aq5PS28vYUZzXxklCO+aaHEGQHu30h
         PVJ7N25EPN8wglvSqz2I2tKXU3A2N5pSg9OkE7U0NIzJ/v/3NP0X9ohiFPEmcyG2rrji
         rrIKjydtpyZd4XHJzgBcXxXFhAZgOCXEPkWVrhHnRC/AktojsqccwjOOpUfYSgUY5DkV
         o7JN87ix2GUvpSEpsXbWWjtOCqPVgQB+u3PLCMKP4J0ncNmvuM9DoHkKmWTDvMDCxk/6
         XEUkmx2OPQl59IIkweao45EfM6MhDZt9XI506ZlX+l1s1QV+SlFNoziY9Gx0AQGNJbkH
         dGRg==
X-Gm-Message-State: AOAM532jnjmlWaM6ryvkTeZsEy9rWnDtWrlmi6qhPOSQ1LszhSMVMRtD
        KHgQFgQ4xGcJkL130rTnPGdb9SHSk8Y=
X-Google-Smtp-Source: 
 ABdhPJy+r5NcA4n55DFecrVbqxpsO0btN/dh4PxHGww5fNUy03ZbrhBM+V2jd7zdFN7r8nz2R/bkcIa8r84=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2cd:202:378d:645d:49ad:4f8b])
 (user=haoluo job=sendgmr) by 2002:a25:9a44:0:b0:622:4e:958c with SMTP id
 r4-20020a259a44000000b00622004e958cmr9880165ybo.566.1645832628657; Fri, 25
 Feb 2022 15:43:48 -0800 (PST)
Date: Fri, 25 Feb 2022 15:43:32 -0800
In-Reply-To: <20220225234339.2386398-1-haoluo@google.com>
Message-Id: <20220225234339.2386398-3-haoluo@google.com>
Mime-Version: 1.0
References: <20220225234339.2386398-1-haoluo@google.com>
X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog
Subject: [PATCH bpf-next v1 2/9] bpf: Add BPF_OBJ_PIN and BPF_OBJ_GET in the
 bpf_sys_bpf helper
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
        Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
        Shakeel Butt <shakeelb@google.com>,
        Joe Burton <jevburton.kernel@gmail.com>,
        Tejun Heo <tj@kernel.org>, joshdon@google.com, sdf@google.com,
        bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net

Now bpf syscall prog is able to pin bpf objects into bpffs and get
a pinned object from file system. Combining the previous patch that
introduced the helpers for creating and deleting directories in bpffs,
syscall prog can now persist bpf objects and organize them in a
directory hierarchy.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h  |  4 ++--
 kernel/bpf/inode.c   | 24 ++++++++++++++++++------
 kernel/bpf/syscall.c | 21 ++++++++++++++-------
 3 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fce5e26179f5..c36eeced3838 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1585,8 +1585,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 
 bool bpf_path_is_bpf_dir(const struct path *path);
-int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
-int bpf_obj_get_user(const char __user *pathname, int flags);
+int bpf_obj_pin_path(u32 ufd, bpfptr_t pathname);
+int bpf_obj_get_path(bpfptr_t pathname, int flags);
 
 #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
 #define DEFINE_BPF_ITER_FUNC(target, args...)			\
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 3aca00e9e950..6c2db54a2ff9 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -440,7 +440,7 @@ static int bpf_iter_link_pin_kernel(struct dentry *parent,
 	return ret;
 }
 
-static int bpf_obj_do_pin(const char __user *pathname, void *raw,
+static int bpf_obj_do_pin(bpfptr_t pathname, void *raw,
 			  enum bpf_type type)
 {
 	struct dentry *dentry;
@@ -448,7 +448,13 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	umode_t mode;
 	int ret;
 
-	dentry = user_path_create(AT_FDCWD, pathname, &path, 0);
+	if (bpfptr_is_null(pathname))
+		return -EINVAL;
+
+	if (bpfptr_is_kernel(pathname))
+		dentry = kern_path_create(AT_FDCWD, pathname.kernel, &path, 0);
+	else
+		dentry = user_path_create(AT_FDCWD, pathname.user, &path, 0);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
@@ -481,7 +487,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	return ret;
 }
 
-int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
+int bpf_obj_pin_path(u32 ufd, bpfptr_t pathname)
 {
 	enum bpf_type type;
 	void *raw;
@@ -498,7 +504,7 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
 	return ret;
 }
 
-static void *bpf_obj_do_get(const char __user *pathname,
+static void *bpf_obj_do_get(bpfptr_t pathname,
 			    enum bpf_type *type, int flags)
 {
 	struct inode *inode;
@@ -506,7 +512,13 @@ static void *bpf_obj_do_get(const char __user *pathname,
 	void *raw;
 	int ret;
 
-	ret = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW, &path);
+	if (bpfptr_is_null(pathname))
+		return ERR_PTR(-EINVAL);
+
+	if (bpfptr_is_kernel(pathname))
+		ret = kern_path(pathname.kernel, LOOKUP_FOLLOW, &path);
+	else
+		ret = user_path_at(AT_FDCWD, pathname.user, LOOKUP_FOLLOW, &path);
 	if (ret)
 		return ERR_PTR(ret);
 
@@ -530,7 +542,7 @@ static void *bpf_obj_do_get(const char __user *pathname,
 	return ERR_PTR(ret);
 }
 
-int bpf_obj_get_user(const char __user *pathname, int flags)
+int bpf_obj_get_path(bpfptr_t pathname, int flags)
 {
 	enum bpf_type type = BPF_TYPE_UNSPEC;
 	int f_flags;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 07683b791733..9e6d8d0c8af5 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2402,22 +2402,27 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 
 #define BPF_OBJ_LAST_FIELD file_flags
 
-static int bpf_obj_pin(const union bpf_attr *attr)
+static int bpf_obj_pin(const union bpf_attr *attr, bpfptr_t uattr)
 {
+	bpfptr_t pathname;
+
 	if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0)
 		return -EINVAL;
 
-	return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname));
+	pathname = make_bpfptr(attr->pathname, bpfptr_is_kernel(uattr));
+	return bpf_obj_pin_path(attr->bpf_fd, pathname);
 }
 
-static int bpf_obj_get(const union bpf_attr *attr)
+static int bpf_obj_get(const union bpf_attr *attr, bpfptr_t uattr)
 {
+	bpfptr_t pathname;
+
 	if (CHECK_ATTR(BPF_OBJ) || attr->bpf_fd != 0 ||
 	    attr->file_flags & ~BPF_OBJ_FLAG_MASK)
 		return -EINVAL;
 
-	return bpf_obj_get_user(u64_to_user_ptr(attr->pathname),
-				attr->file_flags);
+	pathname = make_bpfptr(attr->pathname, bpfptr_is_kernel(uattr));
+	return bpf_obj_get_path(pathname, attr->file_flags);
 }
 
 void bpf_link_init(struct bpf_link *link, enum bpf_link_type type,
@@ -4648,10 +4653,10 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 		err = bpf_prog_load(&attr, uattr);
 		break;
 	case BPF_OBJ_PIN:
-		err = bpf_obj_pin(&attr);
+		err = bpf_obj_pin(&attr, uattr);
 		break;
 	case BPF_OBJ_GET:
-		err = bpf_obj_get(&attr);
+		err = bpf_obj_get(&attr, uattr);
 		break;
 	case BPF_PROG_ATTACH:
 		err = bpf_prog_attach(&attr);
@@ -4776,6 +4781,8 @@ BPF_CALL_3(bpf_sys_bpf, int, cmd, union bpf_attr *, attr, u32, attr_size)
 	case BPF_BTF_LOAD:
 	case BPF_LINK_CREATE:
 	case BPF_RAW_TRACEPOINT_OPEN:
+	case BPF_OBJ_PIN:
+	case BPF_OBJ_GET:
 		break;
 #ifdef CONFIG_BPF_JIT /* __bpf_prog_enter_sleepable used by trampoline and JIT */
 	case BPF_PROG_TEST_RUN:

From patchwork Fri Feb 25 23:43:33 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hao Luo <haoluo@google.com>
X-Patchwork-Id: 12761095
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EE779C433FE
	for <bpf@archiver.kernel.org>; Fri, 25 Feb 2022 23:43:58 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239587AbiBYXo3 (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Fri, 25 Feb 2022 18:44:29 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43126 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239510AbiBYXo2 (ORCPT <rfc822;bpf@vger.kernel.org>);
        Fri, 25 Feb 2022 18:44:28 -0500
Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com
 [IPv6:2607:f8b0:4864:20::1149])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A05FC189A95
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:51 -0800 (PST)
Received: by mail-yw1-x1149.google.com with SMTP id
 00721157ae682-2d07ae11467so46378807b3.12
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=CQ83ADA826CcNYM4PsEmnTeYIiXeHVnMOlfVD0fM6L8=;
        b=oSWxFXNgsoQGu7aE46foiUXppB7TXXZzWVMgqHDgDZ7rKr3uOpTNXNAp6WU4aMzbTM
         31r1/g51RRVW7n+rDBe8ChRuToecXHCIjVmuAMDoFh7tzDepFlrYDXHAVIfoH+UzjssI
         FxN0YEzmQ5hcJl+8r+Dq8Q4JSBnPFhOsbxPeI9VPwpCuxXYW0aeE6bhz0o9Cgimyt5am
         32FluCPLSqEGXiO1PqKHp8jhntngq1gSXfv1ScMWOG9wlcaYkJ9TKysC/cW4+HH5QWqq
         kncug2mNPqW/21Ma3WpTcZELZ2YuScl1Kh1dgH/9/bI3Ydnk6yR1LLkeBIgM+9LzNAho
         b9yQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=CQ83ADA826CcNYM4PsEmnTeYIiXeHVnMOlfVD0fM6L8=;
        b=yhfiyfyw+lmVFdwziJw0dMtE0fRZ9hBr+YiC2SkLo/nI3orFVk2JYFrM8VlDmjao/S
         xEXMHYk5Z2xRjnrKiZLwEKz9pdbzikPyHx0fb9PWaOwbcWvJK2C5LzwnPk7gz3jPyXuf
         IeqNvW95S78/7//jI61ziu2MWhqa1x+cG8Nf2joHWzeRRz5Zz9y7rs9eU75ZE4z8K2b8
         P8Na7mVJZ+riKh6po+b1PuU3RMJ1iM/hrg8hg2mBhLrzh8EQoa+NYWgIKuvmtVAz9muA
         nE8ybNhQ4UBxYIxb7yNfwnC64S9U05Mt3CNABfvNctgybW/NOhXt3LG4ifw+MJ6sSMBx
         nyLg==
X-Gm-Message-State: AOAM532ZKYx9kKqssrlTIxUtteG2wY/vfQQkq6d/BrYNAznaK78xKt0P
        b+1UeiVXRU9N0A1LjWKMtgnEFulsxUs=
X-Google-Smtp-Source: 
 ABdhPJwMhAIVAKEuRaaDmgn4bLiE++XUv4sp87DRSHk0UYkM2EhEzAHSjiM6bmZERZ0vbVtNhzllW6aTM/M=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2cd:202:378d:645d:49ad:4f8b])
 (user=haoluo job=sendgmr) by 2002:a0d:d697:0:b0:2d4:2ff7:5c5c with SMTP id
 y145-20020a0dd697000000b002d42ff75c5cmr9978773ywd.495.1645832630812; Fri, 25
 Feb 2022 15:43:50 -0800 (PST)
Date: Fri, 25 Feb 2022 15:43:33 -0800
In-Reply-To: <20220225234339.2386398-1-haoluo@google.com>
Message-Id: <20220225234339.2386398-4-haoluo@google.com>
Mime-Version: 1.0
References: <20220225234339.2386398-1-haoluo@google.com>
X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog
Subject: [PATCH bpf-next v1 3/9] selftests/bpf: tests mkdir, rmdir, unlink and
 pin in syscall
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
        Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
        Shakeel Butt <shakeelb@google.com>,
        Joe Burton <jevburton.kernel@gmail.com>,
        Tejun Heo <tj@kernel.org>, joshdon@google.com, sdf@google.com,
        bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net

Add a subtest in syscall to test bpf_mkdir(), bpf_rmdir(),
bpf_unlink() and object pinning in syscall prog.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../selftests/bpf/prog_tests/syscall.c        | 67 +++++++++++++++++-
 .../testing/selftests/bpf/progs/syscall_fs.c  | 69 +++++++++++++++++++
 2 files changed, 135 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/syscall_fs.c

diff --git a/tools/testing/selftests/bpf/prog_tests/syscall.c b/tools/testing/selftests/bpf/prog_tests/syscall.c
index f4d40001155a..782b5fe73096 100644
--- a/tools/testing/selftests/bpf/prog_tests/syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/syscall.c
@@ -1,7 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2021 Facebook */
+#include <sys/stat.h>
 #include <test_progs.h>
 #include "syscall.skel.h"
+#include "syscall_fs.skel.h"
 
 struct args {
 	__u64 log_buf;
@@ -12,7 +14,7 @@ struct args {
 	int btf_fd;
 };
 
-void test_syscall(void)
+static void test_syscall_basic(void)
 {
 	static char verifier_log[8192];
 	struct args ctx = {
@@ -53,3 +55,66 @@ void test_syscall(void)
 	if (ctx.btf_fd > 0)
 		close(ctx.btf_fd);
 }
+
+static void test_syscall_fs(void)
+{
+	char tmpl[] = "/sys/fs/bpf/syscall_XXXXXX";
+	struct stat statbuf = {};
+	static char verifier_log[8192];
+	struct args ctx = {
+		.log_buf = (uintptr_t) verifier_log,
+		.log_size = sizeof(verifier_log),
+		.prog_fd = 0,
+	};
+	LIBBPF_OPTS(bpf_test_run_opts, tattr,
+		.ctx_in = &ctx,
+		.ctx_size_in = sizeof(ctx),
+	);
+	struct syscall_fs *skel = NULL;
+	int err, mkdir_fd, rmdir_fd;
+	char *root, *dir, *path;
+
+	/* prepares test directories */
+	system("mount -t bpf bpffs /sys/fs/bpf");
+	root = mkdtemp(tmpl);
+	chmod(root, 0755);
+
+	/* loads prog */
+	skel = syscall_fs__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	dir = skel->bss->dirname;
+	snprintf(dir, sizeof(skel->bss->dirname), "%s/test", root);
+	path = skel->bss->pathname;
+	snprintf(path, sizeof(skel->bss->pathname), "%s/prog", dir);
+
+	/* tests mkdir */
+	mkdir_fd = bpf_program__fd(skel->progs.mkdir_prog);
+	err = bpf_prog_test_run_opts(mkdir_fd, &tattr);
+	ASSERT_EQ(err, 0, "mkdir_err");
+	ASSERT_EQ(tattr.retval, 0, "mkdir_retval");
+	ASSERT_OK(stat(dir, &statbuf), "mkdir_success");
+	ASSERT_OK(stat(path, &statbuf), "pin_success");
+
+	/* tests rmdir */
+	rmdir_fd = bpf_program__fd(skel->progs.rmdir_prog);
+	err = bpf_prog_test_run_opts(rmdir_fd, &tattr);
+	ASSERT_EQ(err, 0, "rmdir_err");
+	ASSERT_EQ(tattr.retval, 0, "rmdir_retval");
+	ASSERT_ERR(stat(path, &statbuf), "unlink_success");
+	ASSERT_ERR(stat(dir, &statbuf), "rmdir_success");
+
+cleanup:
+	syscall_fs__destroy(skel);
+	if (ctx.prog_fd > 0)
+		close(ctx.prog_fd);
+	rmdir(root);
+}
+
+void test_syscall(void) {
+	if (test__start_subtest("basic"))
+		test_syscall_basic();
+	if (test__start_subtest("filesystem"))
+		test_syscall_fs();
+}
diff --git a/tools/testing/selftests/bpf/progs/syscall_fs.c b/tools/testing/selftests/bpf/progs/syscall_fs.c
new file mode 100644
index 000000000000..9418d1364c09
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/syscall_fs.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <../../../tools/include/linux/filter.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct args {
+	__u64 log_buf;
+	__u32 log_size;
+	int max_entries;
+	int map_fd;
+	int prog_fd;
+	int btf_fd;
+};
+
+char dirname[64];
+char pathname[64];
+
+SEC("syscall")
+int mkdir_prog(struct args *ctx)
+{
+	static char license[] = "GPL";
+	static struct bpf_insn insns[] = {
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+	static union bpf_attr load_attr = {
+		.prog_type = BPF_PROG_TYPE_XDP,
+		.insn_cnt = sizeof(insns) / sizeof(insns[0]),
+	};
+	static union bpf_attr pin_attr = {
+		.file_flags = 0,
+	};
+	int ret;
+
+	ret = bpf_mkdir(dirname, sizeof(dirname), 0644);
+	if (ret)
+		return ret;
+
+	load_attr.license = (long) license;
+	load_attr.insns = (long) insns;
+	load_attr.log_buf = ctx->log_buf;
+	load_attr.log_size = ctx->log_size;
+	load_attr.log_level = 1;
+	ret = bpf_sys_bpf(BPF_PROG_LOAD, &load_attr, sizeof(load_attr));
+	if (ret < 0)
+		return ret;
+	else if (ret == 0)
+		return -1;
+	ctx->prog_fd = ret;
+
+	pin_attr.pathname = (__u64)pathname;
+	pin_attr.bpf_fd = ret;
+	return bpf_sys_bpf(BPF_OBJ_PIN, &pin_attr, sizeof(pin_attr));
+}
+
+SEC("syscall")
+int rmdir_prog(struct args *ctx)
+{
+	int ret;
+
+	ret = bpf_unlink(pathname, sizeof(pathname));
+	if (ret)
+		return ret;
+
+	return bpf_rmdir(dirname, sizeof(dirname));
+}

From patchwork Fri Feb 25 23:43:34 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hao Luo <haoluo@google.com>
X-Patchwork-Id: 12761099
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EEBCCC433EF
	for <bpf@archiver.kernel.org>; Fri, 25 Feb 2022 23:44:03 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239446AbiBYXod (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Fri, 25 Feb 2022 18:44:33 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43188 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239543AbiBYXo2 (ORCPT <rfc822;bpf@vger.kernel.org>);
        Fri, 25 Feb 2022 18:44:28 -0500
Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com
 [IPv6:2607:f8b0:4864:20::114a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 07C52194144
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:54 -0800 (PST)
Received: by mail-yw1-x114a.google.com with SMTP id
 00721157ae682-2d74a0ff060so46201947b3.6
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:53 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=93Dy3pXp1EtGBKvAGMRWdloXVx8eO4YCYmEbzoeaeYk=;
        b=pnXI1KrV9MNxcdc11kWQSQhoY7dWYBJtcK64UyHbc02BWWucUIXvShTv3FROxsL5BW
         8vhoSDADnCQlSzakUBv1jjkdIoXtTZJKwyKp4vIZumKf4lK8elweUo9MrpFKaMqNsP1C
         UljZGZKErOTVs8LZhCK/mR6FK+aAZobtlX80IqAAEBRl7Qe0B+CSCUHaDKs3wiaxWx5v
         Z62FCKndn+O4YNFQgUuOl24uOevsHtzutVCrKav7zJn6AGUf0ysKVgFesRF5a4fbE0rz
         WuajkK0JLUtcXqS6vpwYJ6i7eCHQ8hEC2mdov2uSU4z88wcDUhFlB9dzaJI8HK/2cM//
         Cefg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=93Dy3pXp1EtGBKvAGMRWdloXVx8eO4YCYmEbzoeaeYk=;
        b=PJMla5bU4AEdG7NdvjAUOvoBGmSqr9Ga2vy6VEYBfZvG2VEvYu2VxPlA/tLuT0ZVD7
         HboO+ef7reOl1Dm6u/SMYJe2FB450RErhLMx/SLTQj3df27nmXTp1wK4n+Zi0EnThoGl
         XohUn8PaGhGHYrbHI27KBQjYMiNDswhY9paCelt/907lWr/d2RNB+B/ki6kyJDMV6lxB
         EKBMIc0BVq2NaMjkAlL7IhLM3hYfTMKjOHcGKdvS/728yjZaJMLkNa/5GPU7SmwV9yMO
         iANSgW2zui14YtsWp+AxxfID8kCgUtJka1lQNq0DhI9qGMwhcu8cVBHVrNoQj9tzY9Aa
         r3oA==
X-Gm-Message-State: AOAM5300FJNSGCPjqsAajhA6qJUquEzfr2BTpR+eKc89zV8Uu5MTYJUk
        c2XHf9ameFZOGkcxszZctEcxmsrXLMQ=
X-Google-Smtp-Source: 
 ABdhPJwaleac35eNfnnuZ5ZlgBLZqZSBlHD/27YuV/VOvM8XhXeNJ4XcCUmMceiEyeGNtaHSURlTJU5++EE=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2cd:202:378d:645d:49ad:4f8b])
 (user=haoluo job=sendgmr) by 2002:a81:f0c:0:b0:2d6:83ab:7605 with SMTP id
 12-20020a810f0c000000b002d683ab7605mr9979774ywp.150.1645832633202; Fri, 25
 Feb 2022 15:43:53 -0800 (PST)
Date: Fri, 25 Feb 2022 15:43:34 -0800
In-Reply-To: <20220225234339.2386398-1-haoluo@google.com>
Message-Id: <20220225234339.2386398-5-haoluo@google.com>
Mime-Version: 1.0
References: <20220225234339.2386398-1-haoluo@google.com>
X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog
Subject: [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
        Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
        Shakeel Butt <shakeelb@google.com>,
        Joe Burton <jevburton.kernel@gmail.com>,
        Tejun Heo <tj@kernel.org>, joshdon@google.com, sdf@google.com,
        bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net

Add a new type of bpf tracepoints: sleepable tracepoints, which allows
the handler to make calls that may sleep. With sleepable tracepoints, a
set of syscall helpers (which may sleep) may also be called from
sleepable tracepoints.

In the following patches, we will whitelist some tracepoints to be
sleepable.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h             | 10 +++++++-
 include/linux/tracepoint-defs.h |  1 +
 include/trace/bpf_probe.h       | 22 ++++++++++++++----
 kernel/bpf/syscall.c            | 41 +++++++++++++++++++++++----------
 kernel/trace/bpf_trace.c        |  5 ++++
 5 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c36eeced3838..759ade7b24b3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1810,6 +1810,9 @@ struct bpf_prog *bpf_prog_by_id(u32 id);
 struct bpf_link *bpf_link_by_id(u32 id);
 
 const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id);
+const struct bpf_func_proto *
+tracing_prog_syscall_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
+
 void bpf_task_storage_free(struct task_struct *task);
 bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
 const struct btf_func_model *
@@ -1822,7 +1825,6 @@ struct bpf_core_ctx {
 
 int bpf_core_apply(struct bpf_core_ctx *ctx, const struct bpf_core_relo *relo,
 		   int relo_idx, void *insn);
-
 #else /* !CONFIG_BPF_SYSCALL */
 static inline struct bpf_prog *bpf_prog_get(u32 ufd)
 {
@@ -2011,6 +2013,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 	return NULL;
 }
 
+static inline struct bpf_func_proto *
+tracing_prog_syscall_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return NULL;
+}
+
 static inline void bpf_task_storage_free(struct task_struct *task)
 {
 }
diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index e7c2276be33e..c73c7ab3680e 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -51,6 +51,7 @@ struct bpf_raw_event_map {
 	void			*bpf_func;
 	u32			num_args;
 	u32			writable_size;
+	u32			sleepable;
 } __aligned(32);
 
 /*
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index 7660a7846586..4edfc6df2f52 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -88,7 +88,7 @@ __bpf_trace_##call(void *__data, proto)					\
  * to make sure that if the tracepoint handling changes, the
  * bpf probe will fail to compile unless it too is updated.
  */
-#define __DEFINE_EVENT(template, call, proto, args, size)		\
+#define __DEFINE_EVENT(template, call, proto, args, size, sleep)	\
 static inline void bpf_test_probe_##call(void)				\
 {									\
 	check_trace_callback_type_##call(__bpf_trace_##template);	\
@@ -104,6 +104,7 @@ __section("__bpf_raw_tp_map") = {					\
 		.bpf_func	= __bpf_trace_##template,		\
 		.num_args	= COUNT_ARGS(args),			\
 		.writable_size	= size,					\
+		.sleepable	= sleep,				\
 	},								\
 };
 
@@ -123,11 +124,15 @@ static inline void bpf_test_buffer_##call(void)				\
 #undef DEFINE_EVENT_WRITABLE
 #define DEFINE_EVENT_WRITABLE(template, call, proto, args, size) \
 	__CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \
-	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), size)
+	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), size, 0)
+
+#undef DEFINE_EVENT_SLEEPABLE
+#define DEFINE_EVENT_SLEEPABLE(template, call, proto, args)	\
+	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), 0, 1)
 
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, call, proto, args)			\
-	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), 0)
+	__DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), 0, 0)
 
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
@@ -136,19 +141,26 @@ static inline void bpf_test_buffer_##call(void)				\
 #undef DECLARE_TRACE
 #define DECLARE_TRACE(call, proto, args)				\
 	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))		\
-	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)
+	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0, 0)
 
 #undef DECLARE_TRACE_WRITABLE
 #define DECLARE_TRACE_WRITABLE(call, proto, args, size) \
 	__CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \
 	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
-	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size)
+	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size, 0)
+
+#undef DECLARE_TRACE_SLEEPABLE
+#define DECLARE_TRACE_SLEEPABLE(call, proto, args)			\
+	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))		\
+	__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0, 1)
 
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
 #undef DECLARE_TRACE_WRITABLE
 #undef DEFINE_EVENT_WRITABLE
 #undef __CHECK_WRITABLE_BUF_SIZE
+#undef DECLARE_TRACE_SLEEPABLE
+#undef DEFINE_EVENT_SLEEPABLE
 #undef __DEFINE_EVENT
 #undef FIRST
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9e6d8d0c8af5..0a12f52fe8a9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4827,12 +4827,6 @@ static const struct bpf_func_proto bpf_sys_bpf_proto = {
 	.arg3_type	= ARG_CONST_SIZE,
 };
 
-const struct bpf_func_proto * __weak
-tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
-{
-	return bpf_base_func_proto(func_id);
-}
-
 BPF_CALL_1(bpf_sys_close, u32, fd)
 {
 	/* When bpf program calls this helper there should not be
@@ -5045,24 +5039,47 @@ const struct bpf_func_proto bpf_unlink_proto = {
 	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
-static const struct bpf_func_proto *
-syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+/* Syscall helpers that are also allowed in sleepable tracing prog. */
+const struct bpf_func_proto *
+tracing_prog_syscall_func_proto(enum bpf_func_id func_id,
+				const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_sys_bpf:
 		return &bpf_sys_bpf_proto;
-	case BPF_FUNC_btf_find_by_name_kind:
-		return &bpf_btf_find_by_name_kind_proto;
 	case BPF_FUNC_sys_close:
 		return &bpf_sys_close_proto;
-	case BPF_FUNC_kallsyms_lookup_name:
-		return &bpf_kallsyms_lookup_name_proto;
 	case BPF_FUNC_mkdir:
 		return &bpf_mkdir_proto;
 	case BPF_FUNC_rmdir:
 		return &bpf_rmdir_proto;
 	case BPF_FUNC_unlink:
 		return &bpf_unlink_proto;
+	default:
+		return NULL;
+	}
+}
+
+const struct bpf_func_proto * __weak
+tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	const struct bpf_func_proto *fn;
+
+	fn = tracing_prog_syscall_func_proto(func_id, prog);
+	if (fn)
+		return fn;
+
+	return bpf_base_func_proto(func_id);
+}
+
+static const struct bpf_func_proto *
+syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_btf_find_by_name_kind:
+		return &bpf_btf_find_by_name_kind_proto;
+	case BPF_FUNC_kallsyms_lookup_name:
+		return &bpf_kallsyms_lookup_name_proto;
 	default:
 		return tracing_prog_func_proto(func_id, prog);
 	}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a2024ba32a20..c816e0e0d4a0 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1691,6 +1691,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		fn = raw_tp_prog_func_proto(func_id, prog);
 		if (!fn && prog->expected_attach_type == BPF_TRACE_ITER)
 			fn = bpf_iter_get_func_proto(func_id, prog);
+		if (!fn && prog->aux->sleepable)
+			fn = tracing_prog_syscall_func_proto(func_id, prog);
 		return fn;
 	}
 }
@@ -2053,6 +2055,9 @@ static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *
 	if (prog->aux->max_tp_access > btp->writable_size)
 		return -EINVAL;
 
+	if (prog->aux->sleepable && !btp->sleepable)
+		return -EPERM;
+
 	return tracepoint_probe_register_may_exist(tp, (void *)btp->bpf_func,
 						   prog);
 }

From patchwork Fri Feb 25 23:43:35 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hao Luo <haoluo@google.com>
X-Patchwork-Id: 12761097
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 48073C433FE
	for <bpf@archiver.kernel.org>; Fri, 25 Feb 2022 23:44:02 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239473AbiBYXoc (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Fri, 25 Feb 2022 18:44:32 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43084 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239517AbiBYXo3 (ORCPT <rfc822;bpf@vger.kernel.org>);
        Fri, 25 Feb 2022 18:44:29 -0500
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 579241A2717
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:56 -0800 (PST)
Received: by mail-yb1-xb4a.google.com with SMTP id
 j17-20020a25ec11000000b0061dabf74012so4922343ybh.15
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=zSQ3ZyUQ2OEfiGKk7osb66v16UBXeasV5LfSlDNJLTs=;
        b=RAWcMz9SCO9AYZ5Pdggqlu3RxuyFkY6L5ZYVTbuszydG5Wij7F33XdUpXekqDNvgDb
         BbWbrze1Z2q5JkRVH/Mrso+MKEvpttminKC8Jc8Xr9EZ4/5mI1+WvY6YbaYulqScKQ9j
         1YLBQR1mMvdAHHfUlaqPu3ftcgq2VlgRWG5Q/GgUJKgBwYQa4yvPUYeZOCtdTy2Z88cI
         1OgIKH+ZbEIKHOz5hZX4svpyPB4h86ryzdTGE/bvN1dTBn28bElnZWfe4IDGdFs5di39
         4PGMTjYksntvmqRVXi+U2b/LGkRjGmblq9KBt6ylJxGB+j711I9tATWezK3+pr2Z+60A
         c6Yw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=zSQ3ZyUQ2OEfiGKk7osb66v16UBXeasV5LfSlDNJLTs=;
        b=2bPU4LjU8yz/frkgCyJtyqO/TWjXTjMztgEk4NjVk39LwzRyit42sCA2ev8f3xwUGB
         cegaFMyq1TqlL4T3D1mvBd7X5YDhCOzzp59nec35DhiY/+5Wg43IZPWDWkstM04w2QIH
         I4dOZ2P6bogZ1vuJNhmAYpjwnWnFQcr3z4l3SQhjXUQFQnq4MtVLTfX4P864hRuFfj94
         iirVrCpx2hX199hEurD6dERlTUoAUzkMHSjU3SIlU5bXvFqi6D2G9wKGUZpSzDOIJqxu
         +W9/WmPpBZA+xdGe/8c//FMYvIOO8yZdehmrrR2WXB0WrM3QgJ0gJT7ZctP2Xhq+XSZz
         wWzg==
X-Gm-Message-State: AOAM5319n/x3+gxBqDdRD5YXGlLke3YSM1t6/8amxUEJgmzroNB6XItg
        7piuyPog0DsSoq8ZHvSo35zfAjN5V7A=
X-Google-Smtp-Source: 
 ABdhPJxqiIRkNLzksIZdrDz6SCwnzk2SPJ/9GP6wVJL0Xqy6aFUvgU6B8ncEMVzBhuX/1S7CgZJDqPIDleo=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2cd:202:378d:645d:49ad:4f8b])
 (user=haoluo job=sendgmr) by 2002:a25:c5d0:0:b0:624:41d4:e37c with SMTP id
 v199-20020a25c5d0000000b0062441d4e37cmr9499906ybe.318.1645832635508; Fri, 25
 Feb 2022 15:43:55 -0800 (PST)
Date: Fri, 25 Feb 2022 15:43:35 -0800
In-Reply-To: <20220225234339.2386398-1-haoluo@google.com>
Message-Id: <20220225234339.2386398-6-haoluo@google.com>
Mime-Version: 1.0
References: <20220225234339.2386398-1-haoluo@google.com>
X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog
Subject: [PATCH bpf-next v1 5/9] cgroup: Sleepable cgroup tracepoints.
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
        Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
        Shakeel Butt <shakeelb@google.com>,
        Joe Burton <jevburton.kernel@gmail.com>,
        Tejun Heo <tj@kernel.org>, joshdon@google.com, sdf@google.com,
        bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net

Add two new sleepable tracepoints in cgroup: cgroup_mkdir_s and
cgroup_rmdir_s. The suffix _s means they are in a sleepable context.
These two tracepoints don't need full cgroup paths, they don't have
to live in atomic context. These two tracepoints are also called without
holding cgroup_mutex.

They can be used for bpf to monitor cgroup creation and deletion. Bpf
sleepable programs can attach to these two tracepoints and create
corresponding directories in bpffs. The created directories don't need
the cgroup paths, cgroup id is sufficient to identify the cgroup. Once
the bpffs directories have been created, the bpf prog can further pin
bpf objects inside the directories and allow users to read the pinned
objects.

This serves a way to extend the fixed cgroup interface.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/trace/events/cgroup.h | 45 +++++++++++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c        |  5 ++++
 2 files changed, 50 insertions(+)

diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index dd7d7c9efecd..4483a7d6c43a 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -204,6 +204,51 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen,
 	TP_ARGS(cgrp, path, val)
 );
 
+/*
+ * The following tracepoints are supposed to be called in a sleepable context.
+ */
+DECLARE_EVENT_CLASS(cgroup_sleepable_tp,
+
+	TP_PROTO(struct cgroup *cgrp),
+
+	TP_ARGS(cgrp),
+
+	TP_STRUCT__entry(
+		__field(	int,		root			)
+		__field(	int,		level			)
+		__field(	u64,		id			)
+	),
+
+	TP_fast_assign(
+		__entry->root = cgrp->root->hierarchy_id;
+		__entry->id = cgroup_id(cgrp);
+		__entry->level = cgrp->level;
+	),
+
+	TP_printk("root=%d id=%llu level=%d",
+		  __entry->root, __entry->id, __entry->level)
+);
+
+#ifdef DEFINE_EVENT_SLEEPABLE
+#undef DEFINE_EVENT
+#define DEFINE_EVENT(template, call, proto, args)		\
+	DEFINE_EVENT_SLEEPABLE(template, call, PARAMS(proto), PARAMS(args))
+#endif
+
+DEFINE_EVENT(cgroup_sleepable_tp, cgroup_mkdir_s,
+
+	TP_PROTO(struct cgroup *cgrp),
+
+	TP_ARGS(cgrp)
+);
+
+DEFINE_EVENT(cgroup_sleepable_tp, cgroup_rmdir_s,
+
+	TP_PROTO(struct cgroup *cgrp),
+
+	TP_ARGS(cgrp)
+);
+
 #endif /* _TRACE_CGROUP_H */
 
 /* This part must be outside protection */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9d05c3ca2d5e..f14ab00d9ef5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5535,6 +5535,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 	cgroup_destroy_locked(cgrp);
 out_unlock:
 	cgroup_kn_unlock(parent_kn);
+	if (!ret)
+		trace_cgroup_mkdir_s(cgrp);
 	return ret;
 }
 
@@ -5725,6 +5727,9 @@ int cgroup_rmdir(struct kernfs_node *kn)
 		TRACE_CGROUP_PATH(rmdir, cgrp);
 
 	cgroup_kn_unlock(kn);
+
+	if (!ret)
+		trace_cgroup_rmdir_s(cgrp);
 	return ret;
 }
 

From patchwork Fri Feb 25 23:43:36 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hao Luo <haoluo@google.com>
X-Patchwork-Id: 12761098
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 77EB8C433F5
	for <bpf@archiver.kernel.org>; Fri, 25 Feb 2022 23:44:02 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239505AbiBYXod (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Fri, 25 Feb 2022 18:44:33 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43532 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239576AbiBYXoc (ORCPT <rfc822;bpf@vger.kernel.org>);
        Fri, 25 Feb 2022 18:44:32 -0500
Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com
 [IPv6:2607:f8b0:4864:20::b49])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EC74B1A275C
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:58 -0800 (PST)
Received: by mail-yb1-xb49.google.com with SMTP id
 s133-20020a252c8b000000b0062112290d0bso4893583ybs.23
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:43:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=CSyFxfUZpUzPLuCXdgJGZClgw+3AJRlfu/i2BCEdHwc=;
        b=WeGdlmUqXHjVbARfDifYeaq/dMDQCvyVTPwaPTxPW4KG8eTZQDll26EEGndOpSiHKE
         O2oyDy8pB8fJFLNeMvX+/y3ET9UeDjH3dyk8YZnyif9DxqF40ioemsVYcnVfVmwlljlZ
         TI1yWSmE3ArYWQEuNs8XIqXbRgNntoqP8ZQc8cTYFn+rXTN9mKIAGs+ODtdY2/7/dpF1
         Mlx2LIVyU+OWV8LptNe5GGrjvcRVEqgm+i9y4DmgSkq3eGyIdnTmcWSA6dmVKtf1Bow+
         wCwfJyzUhGMPRZmVKFgs/mCIVO36zfE1LI32FpabZPaXDSuiHPNgUPnHVdWzjEAMKIL5
         zhKw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=CSyFxfUZpUzPLuCXdgJGZClgw+3AJRlfu/i2BCEdHwc=;
        b=maahD+ZBufI70FYVlEUWFcDrN7VvvCOUtDKhYWOxUZxnTmAqWoXnjEj/u1EFFwNG1s
         OA4s3FYCb+SbCaEfntP0qD2AW5qOAlzlL9yPgb/4OPL1+qlUHTKV/xSW2hSFDsSj8wmK
         UR32/oExxj/+lLk2fSMnsRduTzP4zWFE8mz+CeuOmP8ihr7S4NXXg+ILZNrHPo2KBiuf
         dKOuBPtLpGM8z7z+b11stnn9qlfa73/Zky900mghIXFoYoU0AwRtNP3e8k8o2XPVlz3F
         DoDhiX32/5XxVZGrM2OaGh5XyR80Xa/R+POoyD1f3FalzTRqtL1Q0TUvK75OZWcztai8
         +ajQ==
X-Gm-Message-State: AOAM530/1xLgr/neViSckey51qcvzl0KMwV5cuIe0qCu81mxKvsKVKFg
        WEfwFu3YmlWxr/wqXzEae12f6uvTUIA=
X-Google-Smtp-Source: 
 ABdhPJyRV+FC7RXrcuAS+KiM2C2hXhTqFOLuXMoFojAkUbWorNMJ49eKnvj8E76UedfW3Xoi/Qwx2zKOnLU=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2cd:202:378d:645d:49ad:4f8b])
 (user=haoluo job=sendgmr) by 2002:a25:25c8:0:b0:622:82ce:ec7a with SMTP id
 l191-20020a2525c8000000b0062282ceec7amr9568082ybl.66.1645832638179; Fri, 25
 Feb 2022 15:43:58 -0800 (PST)
Date: Fri, 25 Feb 2022 15:43:36 -0800
In-Reply-To: <20220225234339.2386398-1-haoluo@google.com>
Message-Id: <20220225234339.2386398-7-haoluo@google.com>
Mime-Version: 1.0
References: <20220225234339.2386398-1-haoluo@google.com>
X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog
Subject: [PATCH bpf-next v1 6/9] libbpf: Add sleepable tp_btf
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
        Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
        Shakeel Butt <shakeelb@google.com>,
        Joe Burton <jevburton.kernel@gmail.com>,
        Tejun Heo <tj@kernel.org>, joshdon@google.com, sdf@google.com,
        bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net

In the previous patches, we have introduced sleepable tracepoints in the
kernel and listed a couple of cgroup tracepoints as sleepable. This
patch introduces a sleepable version of tp_btf. Sleepable tp_btf progs
can only attach to sleepable tracepoints.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 tools/lib/bpf/libbpf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 776b8e034d62..910682357390 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8619,6 +8619,7 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("fentry/",		TRACING, BPF_TRACE_FENTRY, SEC_ATTACH_BTF, attach_trace),
 	SEC_DEF("fmod_ret/",		TRACING, BPF_MODIFY_RETURN, SEC_ATTACH_BTF, attach_trace),
 	SEC_DEF("fexit/",		TRACING, BPF_TRACE_FEXIT, SEC_ATTACH_BTF, attach_trace),
+	SEC_DEF("tp_btf.s/",            TRACING, BPF_TRACE_RAW_TP, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),
 	SEC_DEF("fentry.s/",		TRACING, BPF_TRACE_FENTRY, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),
 	SEC_DEF("fmod_ret.s/",		TRACING, BPF_MODIFY_RETURN, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),
 	SEC_DEF("fexit.s/",		TRACING, BPF_TRACE_FEXIT, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),

From patchwork Fri Feb 25 23:43:37 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hao Luo <haoluo@google.com>
X-Patchwork-Id: 12761100
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B92B3C433F5
	for <bpf@archiver.kernel.org>; Fri, 25 Feb 2022 23:44:08 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239674AbiBYXok (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Fri, 25 Feb 2022 18:44:40 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43736 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239464AbiBYXoe (ORCPT <rfc822;bpf@vger.kernel.org>);
        Fri, 25 Feb 2022 18:44:34 -0500
Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com
 [IPv6:2607:f8b0:4864:20::114a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5BED11B7561
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:44:01 -0800 (PST)
Received: by mail-yw1-x114a.google.com with SMTP id
 00721157ae682-2d6994a6942so45802017b3.17
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:44:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=p4juH2rEFqz2fRb0q0EOdUgi0xcehQjhls9oJX8T4Ro=;
        b=Xih7N6C6oWH75MqvjSdcB5rlpKfs8fpSY2hs91oDFWZdSoraz3WAhCnm6FRYONgPev
         UI+qioAt0hPDvIh8XuqBrXDNUb6U4WgiJHojKbt1wBjEVLKol0Bmno7jX9SuVh3jnBYp
         Tmg8Hn2FVymogh6B0K1aCrwZteoMTbwCtIyLP5HMft7/k6PBiVaDBNofejnldDu0f0ng
         Jg8ENvWPshI9CqMg+5UCyI74YPmB5rpgQq78dq7e7gZPi6mzpLl/hpj979WiHsoarytX
         hWBOmkVTOxeuSu4nNhnBQNpeQsYgTQKIBfEm3Fy08XDlfL7PiVhvwjOk212C3F9vsH4P
         y2/g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=p4juH2rEFqz2fRb0q0EOdUgi0xcehQjhls9oJX8T4Ro=;
        b=Nl+LOlEIo4mpJDBmZgf0Dalwq9aJB+cngj4jCFHLaSmMesE6H1vSb4R3trgiWIhlfS
         l7Mor3PFU8bmvXzOAH6gfJgd0qpy4kexTOVQaTONVOHbp3VesoH9aQ4smk4O1ZAPU5yR
         Hs6imNjCv3fp42XcR93qTXm4oEwxSeM3DYpqJNERC85p+6zBNmGv1zTtR/67Vnl7iI0E
         wXWwqmE/CH6s7+IwSQqcBl/rDyJEc97xemER9y7kifubBB+D+fkdF8gknLyyaduEAYjs
         tNnevSxPP7euJIE1SwIutJ0KT57/D1oUxy/WL/gQ9sFuPujvwM63nRRo03UbyrELmn8e
         JA0w==
X-Gm-Message-State: AOAM531QqOyoZb7Nbc2MI6Jgfwmr3v6oM7TtEl3sUo+a2zrjZn+BiHUq
        Kf0F/8eHmuB/sMbr62nDR2ND0oe08FQ=
X-Google-Smtp-Source: 
 ABdhPJyzBgcYPBrOnA3Nrku4K+C/BiAAZJWhHak8FL2CwleJ8t+gqeTBvZDmuziEcaIN1FWvBiftzXM/sOE=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2cd:202:378d:645d:49ad:4f8b])
 (user=haoluo job=sendgmr) by 2002:a25:6b4d:0:b0:624:7295:42ee with SMTP id
 o13-20020a256b4d000000b00624729542eemr9575122ybm.290.1645832640625; Fri, 25
 Feb 2022 15:44:00 -0800 (PST)
Date: Fri, 25 Feb 2022 15:43:37 -0800
In-Reply-To: <20220225234339.2386398-1-haoluo@google.com>
Message-Id: <20220225234339.2386398-8-haoluo@google.com>
Mime-Version: 1.0
References: <20220225234339.2386398-1-haoluo@google.com>
X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog
Subject: [PATCH bpf-next v1 7/9] bpf: Lift permission check in __sys_bpf when
 called from kernel.
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
        Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
        Shakeel Butt <shakeelb@google.com>,
        Joe Burton <jevburton.kernel@gmail.com>,
        Tejun Heo <tj@kernel.org>, joshdon@google.com, sdf@google.com,
        bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net

After we introduced sleepable tracing programs, we now have an
interesting problem. There are now three execution paths that can
reach bpf_sys_bpf:

 1. called from bpf syscall.
 2. called from kernel context (e.g. kernel modules).
 3. called from bpf programs.

Ideally, capability check in bpf_sys_bpf is necessary for the first two
scenarios. But it may not be necessary for the third case.

The use case of sleepable tracepoints is to allow root user to deploy
bpf progs which run when a certain kernel tracepoints are triggered.
An example use case is to monitor cgroup creation and perform bpf
operations whenever a cgroup is created. These operations include
pinning an iter to export the cgroup's state. Using sleepable tracing
is preferred because it eliminates the need of a userspace daemon to
monitor cgroup changes.

However, in this use case, the current task who triggers the tracepoint
may be unprivileged and the permission check in __sys_bpf will thus
prevent it from making bpf syscalls. Therefore the tracing progs
deployed by root can not be used by non-root users.

A solution to this problem is to lift the permission check if the caller
of bpf_sys_bpf comes from either kernel context or bpf programs.

An alternative of lifting this permission check would be introducing an
'unpriv' version of bpf_sys_bpf, which doesn't check the current task's
capability. If the owner of the tracing prog wants it to be exclusively
used by root users, they can use the 'priv' version of bpf_sys_bpf; if
the owner wants it to be usable for non-root users, they can use the
'unpriv' version.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 kernel/bpf/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0a12f52fe8a9..3bf88002ee56 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4613,7 +4613,7 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 	union bpf_attr attr;
 	int err;
 
-	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
+	if (sysctl_unprivileged_bpf_disabled && !bpf_capable() && !uattr.is_kernel)
 		return -EPERM;
 
 	err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);

From patchwork Fri Feb 25 23:43:38 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hao Luo <haoluo@google.com>
X-Patchwork-Id: 12761101
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E1D19C433FE
	for <bpf@archiver.kernel.org>; Fri, 25 Feb 2022 23:44:11 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239694AbiBYXol (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Fri, 25 Feb 2022 18:44:41 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43820 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239582AbiBYXok (ORCPT <rfc822;bpf@vger.kernel.org>);
        Fri, 25 Feb 2022 18:44:40 -0500
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E36E41BE105
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:44:03 -0800 (PST)
Received: by mail-yb1-xb4a.google.com with SMTP id
 x1-20020a25a001000000b0061c64ee0196so4949692ybh.9
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:44:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=90LDKc8yXNYaFV1vYO/ZK9vfbd4VKTswsQEBzwuWgFM=;
        b=Rbjw56WAak5LEeKjlXfG5yGu0ZXkBZ9FppDt939bYveYrEXS89bN8MwjuaxDB39XVw
         VBx+xoB3vRyPlYZOYUYRMHb3S/kdssvR1uowucrPn0+RwEY601NRejBh2FuCL5afaZ96
         VsO8ItDGhDqRQ30kzTIcBGkAn7VfeUyD1eZ/As/nIX9eFrzQZw+aMjAlZv2unO/PaRMC
         x6KQYYq7T+89ARPmSXe9CpmG2DWHTWw+EYSmrRgg0mGZ1QXazzHuYl9T60h0PRd68Og5
         Y9r70hMvQPIKEnepGpbtbuXDg5REDdiGIPRUyYAAFmNnaHePZZFGODvWJzxH5rv/hxH7
         S2hg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=90LDKc8yXNYaFV1vYO/ZK9vfbd4VKTswsQEBzwuWgFM=;
        b=lu05ra/annM/FN8jHOBcJzIxifDxfIK5b0B1TPKgOciGyx5ctTGraK5MMSSj+KAg0T
         uA8vSjWBDwUBpyAGUGXBG3o85o5l0411OWLKJEE5E6yZ2piWn8MEiL+aZI6tZNCtq35c
         uVPzQJ+yOaWvMxO0VyVvR9Awd5KOw7RmBOUl598QMabRlvpBgiEcJPAxRV0cLhuxocKM
         Qd7bj9CjtLtjymTjIYxTynzdErAy85JAVHbw73KadOZpJWnOMQ7F/UJRQHQcMsqDqXIN
         pTT8HLKcyoQviWpmw52svcKcBBOQ8n3ciC3q3a8HV/Y8zpW4Lfj8KuQNIlFZ1mg46WLG
         Z5MQ==
X-Gm-Message-State: AOAM5321o1duFlWPvenBV3GVWd+uL06fGjchC2F0fjZjklqitBkpm95m
        d3t2zNiGM0tyV4M4XkPrRSrSm2zDwxk=
X-Google-Smtp-Source: 
 ABdhPJySPfMxm0kIwLpGy7EfSr2+guFookXm4g7V1JmjaV2muGrmB1OVtxGnJes9Dloe8HEWWvkqymG87Qo=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2cd:202:378d:645d:49ad:4f8b])
 (user=haoluo job=sendgmr) by 2002:a25:a223:0:b0:621:1238:68b1 with SMTP id
 b32-20020a25a223000000b00621123868b1mr9769008ybi.370.1645832643208; Fri, 25
 Feb 2022 15:44:03 -0800 (PST)
Date: Fri, 25 Feb 2022 15:43:38 -0800
In-Reply-To: <20220225234339.2386398-1-haoluo@google.com>
Message-Id: <20220225234339.2386398-9-haoluo@google.com>
Mime-Version: 1.0
References: <20220225234339.2386398-1-haoluo@google.com>
X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog
Subject: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
        Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
        Shakeel Butt <shakeelb@google.com>,
        Joe Burton <jevburton.kernel@gmail.com>,
        Tejun Heo <tj@kernel.org>, joshdon@google.com, sdf@google.com,
        bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net

Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
iter doesn't iterate a set of kernel objects. Instead, it is supposed to
be parameterized by a cgroup id and prints only that cgroup. So one
needs to specify a target cgroup id when attaching this iter.

The target cgroup's state can be read out via a link of this iter.
Typically, we can monitor cgroup creation and deletion using sleepable
tracing and use it to create corresponding directories in bpffs and pin
a cgroup id parameterized link in the directory. Then we can read the
auto-pinned iter link to get cgroup's state. The output of the iter link
is determined by the program. See the selftest test_cgroup_stats.c for
an example.

Signed-off-by: Hao Luo <haoluo@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 include/linux/bpf.h            |   1 +
 include/uapi/linux/bpf.h       |   6 ++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/cgroup_iter.c       | 141 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   6 ++
 5 files changed, 155 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/cgroup_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 759ade7b24b3..3ce9b0b7ed89 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
 
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
+	u64 cgroup_id;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a5dbc794403d..855ad80d9983 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5887,6 +5890,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index c1a9be6a4b9f..52a0e4c6e96e 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 000000000000..011d9dcd1d51
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+struct bpf_iter__cgroup {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct cgroup *cgroup;
+	u64 cgroup_id;
+
+	/* Only one session is supported. */
+	if (*pos > 0)
+		return NULL;
+
+	cgroup_id = *(u64 *)seq->private;
+	cgroup = cgroup_get_from_id(cgroup_id);
+	if (!cgroup)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+
+	return cgroup;
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter__cgroup ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = v;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, false);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	return ret;
+}
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+	if (v)
+		cgroup_put(v);
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+	.start  = cgroup_iter_seq_start,
+	.next   = cgroup_iter_seq_next,
+	.stop   = cgroup_iter_seq_stop,
+	.show   = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	*(u64 *)priv_data = aux->cgroup_id;
+	return 0;
+}
+
+static void cgroup_iter_seq_fini(void *priv_data)
+{
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+	.seq_ops                = &cgroup_iter_seq_ops,
+	.init_seq_private       = cgroup_iter_seq_init,
+	.fini_seq_private       = cgroup_iter_seq_fini,
+	.seq_priv_size          = sizeof(u64),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	aux->cgroup_id = linfo->cgroup.cgroup_id;
+	return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+}
+
+void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+				 struct seq_file *seq)
+{
+	char buf[64] = {0};
+
+	cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
+	seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
+	seq_printf(seq, "cgroup_path:\t%s\n", buf);
+}
+
+int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+				   struct bpf_link_info *info)
+{
+	info->iter.cgroup.cgroup_id = aux->cgroup_id;
+	return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+	.target			= "cgroup",
+	.attach_target		= bpf_iter_attach_cgroup,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup, cgroup),
+		  PTR_TO_BTF_ID },
+	},
+	.seq_info		= &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a5dbc794403d..855ad80d9983 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5887,6 +5890,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {

From patchwork Fri Feb 25 23:43:39 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hao Luo <haoluo@google.com>
X-Patchwork-Id: 12761102
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 427DAC433EF
	for <bpf@archiver.kernel.org>; Fri, 25 Feb 2022 23:44:13 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239559AbiBYXom (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Fri, 25 Feb 2022 18:44:42 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44352 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239690AbiBYXok (ORCPT <rfc822;bpf@vger.kernel.org>);
        Fri, 25 Feb 2022 18:44:40 -0500
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9499F1BE133
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:44:06 -0800 (PST)
Received: by mail-yb1-xb4a.google.com with SMTP id
 w1-20020a05690204e100b006244315a721so4975024ybs.0
        for <bpf@vger.kernel.org>; Fri, 25 Feb 2022 15:44:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=mnpUHa/7ilfAutSYFbFz7WOqZO/5qWSihfyAdgaHxvw=;
        b=AE+b+1YFMHdLarVClvHMRgI2KRhKCKNhQ/Yi5xLr+vYSboaPkmq24kX+agaWmTjIsE
         mC6jEVFVpNPeravHTTNV5UuBbu9LaIiPwsiHrN+YxZanruN7nof+StAsLu+l+P8WO6BE
         wIkIfDAUQRzRlwPmE66y0AVHNMAl0x5ilpxgT/NUgCgIOG2eTjkcNui5PE3lJVqn2Wki
         j3XOqqAIMIkeg8u2Yay+1LO6EhuxP3NJ0XC5KFkhL2zQpxZPrCc4v4PdC6Dk+AgdqkS6
         4TGRW8nsOgPF5k47cbVmEJZj3MpWzbxV8gey8OBqYg7VKMeN7X6NBiC7wZpo321EA2CZ
         YuHQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=mnpUHa/7ilfAutSYFbFz7WOqZO/5qWSihfyAdgaHxvw=;
        b=wcYtwqYOS9PhwizoLfctqHEOInqKQAnCdt1tos9Al+sPeZEmAWPGK8V625j1wpvf7W
         mZohJXe/is5rg+H+NcO7qseRP5woalL6zI2QI62Yjhds+6cxETmpzd2anbyT+RgxrAkG
         Y8ejfK2ROaq/QTPOtJrF09PO3XTP9R3J8hb9OxdBXJpEE6rraHdK/RJOZflTPoAkoaR8
         /hX99Kgwlg7s0Sj56hSSdO5c6fvqummTS7SEXrQ2kS28yDkOygikkQjeBzTlDRzHCcdr
         pO6Vt8BFzwZexwMdhmkCyJN63iWgcsShVo/OU5I25GsQDD9GBfS9tsbsReZwVVPgU2Zg
         dzcA==
X-Gm-Message-State: AOAM533wlpqZ86880mujEG8REwiSAXRVW9ix5QC0fKdNF1RmD0WLnbn1
        IMUUZNGDiesJ9FFAu9hMeB9KBAOtP6M=
X-Google-Smtp-Source: 
 ABdhPJyk4wWTO4jdK2xsOjpeiCiDnM2+f8PS2s6q52g0nS7vYWtEVDQKhogFo0NyHyrWV4sSyRxM5UG+eYY=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2cd:202:378d:645d:49ad:4f8b])
 (user=haoluo job=sendgmr) by 2002:a5b:e:0:b0:61d:841c:543b with SMTP id
 a14-20020a5b000e000000b0061d841c543bmr9460079ybp.604.1645832645829; Fri, 25
 Feb 2022 15:44:05 -0800 (PST)
Date: Fri, 25 Feb 2022 15:43:39 -0800
In-Reply-To: <20220225234339.2386398-1-haoluo@google.com>
Message-Id: <20220225234339.2386398-10-haoluo@google.com>
Mime-Version: 1.0
References: <20220225234339.2386398-1-haoluo@google.com>
X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog
Subject: [PATCH bpf-next v1 9/9] selftests/bpf: Tests using sleepable
 tracepoints to monitor cgroup events
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
        Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
        Shakeel Butt <shakeelb@google.com>,
        Joe Burton <jevburton.kernel@gmail.com>,
        Tejun Heo <tj@kernel.org>, joshdon@google.com, sdf@google.com,
        bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net

Tests the functionalities of sleepable tracing prog, sleepable tracepoints
(i.e. cgroup_mkdir_s and cgroup_rmdir_s) and cgroup iter prog all together.

The added selftest resembles a real-world application, where bpf is used
to export cgroup-level performance stats. There are two sets of progs in
the test: cgroup_monitor and cgroup_sched_lat

- Cgroup_monitor monitors cgroup creation and deletion using sleepable
  tracing; for each cgroup created, creates a directory in bpffs; creates
  a cgroup iter link and pins it in that directory.

- Cgroup_sched_lat is the program that collects cgroup's scheduling
  latencies and store them in hash map. Cgroup_sched_lat implements a
  cgroup iter prog, which reads the stats from the map and seq_prints
  them. This cgroup iter prog is the prog pinned by cgroup_monitor in
  each bpffs directory.

The cgroup_sched_lat in this test can be adapted for exporting similar
cgroup-level performance stats.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../bpf/prog_tests/test_cgroup_stats.c        | 187 ++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_monitor.c      |  78 ++++++
 .../selftests/bpf/progs/cgroup_sched_lat.c    | 232 ++++++++++++++++++
 4 files changed, 504 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_monitor.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_sched_lat.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c
new file mode 100644
index 000000000000..b6607ac074bc
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#define _GNU_SOURCE
+#include <sys/stat.h>	/* mkdir */
+#include <fcntl.h>	/* name_to_handle_at */
+#include <stdlib.h>
+#include <test_progs.h>
+#include "cgroup_monitor.skel.h"
+#include "cgroup_sched_lat.skel.h"
+
+static char mkdir_prog_path[64];
+static char rmdir_prog_path[64];
+static char dump_prog_path[64];
+
+/* Get cgroup id from a full path to cgroup */
+static int get_cgroup_id(const char *cgroup)
+{
+	int mount_id = 0;
+	struct {
+		struct file_handle fh;
+		__u64 cgid;
+	} fh = {};
+
+	fh.fh.handle_bytes = sizeof(fh.cgid);
+	if (name_to_handle_at(AT_FDCWD, cgroup, &fh.fh, &mount_id, 0))
+		return -1;
+
+	return fh.cgid;
+}
+
+static void spin_on_cpu(int seconds)
+{
+	time_t start, now;
+
+	start = time(NULL);
+	do {
+		now = time(NULL);
+	} while (now - start < seconds);
+}
+
+static void do_work(const char *cgroup)
+{
+	int i, cpu = 0, pid;
+	char cmd[128];
+
+	/* make cgroup threaded */
+	snprintf(cmd, 128, "echo threaded > %s/cgroup.type", cgroup);
+	system(cmd);
+
+	/* try to enable cpu controller. this may fail if cpu controller is not
+	 * available in cgroup.controllers or there is a cgroup v1 already
+	 * mounted in the system.
+	 */
+	snprintf(cmd, 128, "echo \"+cpu\" > %s/cgroup.subtree_control", cgroup);
+	system(cmd);
+
+	/* launch two children, both running in child cgroup */
+	for (i = 0; i < 2; ++i) {
+		pid = fork();
+		if (pid == 0) {
+			/* attach to cgroup */
+			snprintf(cmd, 128, "echo %d > %s/cgroup.procs", getpid(), cgroup);
+			system(cmd);
+
+			/* pin process to target cpu */
+			snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid());
+			system(cmd);
+
+			spin_on_cpu(3); /* spin on cpu for 3 seconds */
+			exit(0);
+		}
+	}
+
+	/* pin parent process to target cpu */
+	snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid());
+	system(cmd);
+
+	spin_on_cpu(3); /* spin on cpu for 3 seconds */
+	wait(NULL);
+}
+
+/* Check reading cgroup stats from auto pinned objects
+ * @root: root directory in bpffs set up for this test
+ * @cgroup: cgroup path
+ */
+static void check_cgroup_stats(const char *root, const char *cgroup)
+{
+	unsigned long queue_self, queue_other;
+	char buf[64], path[64];
+	int id, cgroup_id;
+	FILE *file;
+
+	id = get_cgroup_id(cgroup);
+	if (!ASSERT_GE(id, 0, "get_cgroup_id"))
+		return;
+
+	snprintf(path, sizeof(path), "%s/%d/stats", root, id);
+	file = fopen(path, "r");
+	if (!ASSERT_OK_PTR(file, "open"))
+		return;
+
+	ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+	ASSERT_EQ(sscanf(buf, "cgroup_id: %8d", &cgroup_id), 1, "output");
+	ASSERT_EQ(id, cgroup_id, "cgroup_id");
+
+	ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+	ASSERT_EQ(sscanf(buf, "queue_self: %8lu", &queue_self), 1, "output");
+
+	ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+	ASSERT_EQ(sscanf(buf, "queue_other: %8lu", &queue_other), 1, "output");
+	fclose(file);
+}
+
+/* Set up bpf progs for monitoring cgroup activities. */
+static void setup_cgroup_monitor(const char *root)
+{
+	struct cgroup_monitor *skel = NULL;
+
+	skel = cgroup_monitor__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "cgroup_monitor_skel_load"))
+		return;
+
+	cgroup_monitor__attach(skel);
+
+	snprintf(skel->bss->root, sizeof(skel->bss->root), "%s", root);
+
+	snprintf(mkdir_prog_path, 64, "%s/mkdir_prog", root);
+	bpf_obj_pin(bpf_link__fd(skel->links.mkdir_prog), mkdir_prog_path);
+
+	snprintf(rmdir_prog_path, 64, "%s/rmdir_prog", root);
+	bpf_obj_pin(bpf_link__fd(skel->links.rmdir_prog), rmdir_prog_path);
+
+	cgroup_monitor__destroy(skel);
+}
+
+void test_cgroup_stats(void)
+{
+	char bpf_tmpl[] = "/sys/fs/bpf/XXXXXX";
+	char cgrp_tmpl[] = "/sys/fs/cgroup/XXXXXX";
+	struct cgroup_sched_lat *skel = NULL;
+	char *root, *cgroup;
+
+	/* prepare test directories */
+	system("mount -t cgroup2 none /sys/fs/cgroup");
+	system("mount -t bpf bpffs /sys/fs/bpf");
+	root = mkdtemp(bpf_tmpl);
+	chmod(root, 0777);
+
+	/* set up progs for monitoring cgroup events */
+	setup_cgroup_monitor(root);
+
+	/* set up progs for profiling cgroup stats */
+	skel = cgroup_sched_lat__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "cgroup_sched_lat_skel_load"))
+		goto cleanup_root;
+
+	snprintf(dump_prog_path, 64, "%s/prog", root);
+	bpf_obj_pin(bpf_program__fd(skel->progs.dump_cgroup), dump_prog_path);
+	chmod(dump_prog_path, 0644);
+
+	cgroup_sched_lat__attach(skel);
+
+	/* thanks to cgroup monitoring progs, a directory corresponding to the
+	 * cgroup is created in bpffs.
+	 */
+	cgroup = mkdtemp(cgrp_tmpl);
+
+	/* collect some cgroup-level stats and check reading them from bpffs */
+	do_work(cgroup);
+	check_cgroup_stats(root, cgroup);
+
+	/* thanks to cgroup monitoring progs, removing cgroups also removes
+	 * the created directory in bpffs.
+	 */
+	rmdir(cgroup);
+
+	/* clean up cgroup monitoring progs */
+	cgroup_sched_lat__detach(skel);
+	cgroup_sched_lat__destroy(skel);
+	unlink(dump_prog_path);
+cleanup_root:
+	/* remove test directories in bpffs */
+	unlink(mkdir_prog_path);
+	unlink(rmdir_prog_path);
+	rmdir(root);
+	return;
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index 8cfaeba1ddbf..0d1bf954e831 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -16,6 +16,7 @@
 #define bpf_iter__bpf_map_elem bpf_iter__bpf_map_elem___not_used
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
+#define bpf_iter__cgroup bpf_iter__cgroup__not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -37,6 +38,7 @@
 #undef bpf_iter__bpf_map_elem
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
+#undef bpf_iter__cgroup
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -132,6 +134,11 @@ struct bpf_iter__sockmap {
 	struct sock *sk;
 };
 
+struct bpf_iter__cgroup {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute__((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_monitor.c b/tools/testing/selftests/bpf/progs/cgroup_monitor.c
new file mode 100644
index 000000000000..fa5debe1e15a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_monitor.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+/* root is the directory path. */
+char root[64];
+
+SEC("tp_btf.s/cgroup_mkdir_s")
+int BPF_PROG(mkdir_prog, struct cgroup *cgrp)
+{
+	static char dirname[64];
+	static char prog_path[64];
+	static char iter_path[64];
+	static union bpf_iter_link_info info;
+	static union bpf_attr get_attr;
+	static union bpf_attr link_attr;
+	static union bpf_attr pin_attr;
+	int link_fd, prog_fd, ret;
+	__u64 id;
+
+	/* create directory in bpffs named by cgroup's id. */
+	id = cgrp->kn->id;
+	BPF_SNPRINTF(dirname, sizeof(dirname), "%s/%lu", root, id);
+	ret = bpf_mkdir(dirname, sizeof(dirname), 0755);
+	if (ret)
+		return ret;
+
+	/* get cgroup iter prog pinned by test progs. */
+	BPF_SNPRINTF(prog_path, sizeof(prog_path), "%s/prog", root);
+	get_attr.bpf_fd = 0;
+	get_attr.pathname = (__u64)prog_path;
+	get_attr.file_flags = BPF_F_RDONLY;
+	prog_fd = bpf_sys_bpf(BPF_OBJ_GET, &get_attr, sizeof(get_attr));
+	if (prog_fd < 0)
+		return prog_fd;
+
+	/* create a link, parameterized by cgroup id. */
+	info.cgroup.cgroup_id = id;
+	link_attr.link_create.prog_fd = prog_fd;
+	link_attr.link_create.attach_type = BPF_TRACE_ITER;
+	link_attr.link_create.target_fd = 0;
+	link_attr.link_create.flags = 0;
+	link_attr.link_create.iter_info = (__u64)&info;
+	link_attr.link_create.iter_info_len = sizeof(info);
+	ret = bpf_sys_bpf(BPF_LINK_CREATE, &link_attr, sizeof(link_attr));
+	if (ret < 0) {
+		bpf_sys_close(prog_fd);
+		return ret;
+	}
+	link_fd = ret;
+
+	/* pin the link in the created directory */
+	BPF_SNPRINTF(iter_path, sizeof(iter_path), "%s/stats", dirname);
+	pin_attr.pathname = (__u64)iter_path;
+	pin_attr.bpf_fd = link_fd;
+	pin_attr.file_flags = 0;
+	ret = bpf_sys_bpf(BPF_OBJ_PIN, &pin_attr, sizeof(pin_attr));
+
+	bpf_sys_close(prog_fd);
+	bpf_sys_close(link_fd);
+	return ret;
+}
+
+SEC("tp_btf.s/cgroup_rmdir_s")
+int BPF_PROG(rmdir_prog, struct cgroup *cgrp)
+{
+	static char dirname[64];
+	static char path[64];
+
+	BPF_SNPRINTF(dirname, sizeof(dirname), "%s/%lu", root, cgrp->kn->id);
+	BPF_SNPRINTF(path, sizeof(path), "%s/stats", dirname);
+	bpf_unlink(path, sizeof(path));
+	return bpf_rmdir(dirname, sizeof(dirname));
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_sched_lat.c b/tools/testing/selftests/bpf/progs/cgroup_sched_lat.c
new file mode 100644
index 000000000000..90fe709377e1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_sched_lat.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define TASK_RUNNING 0
+#define BPF_F_CURRENT_CPU 0xffffffffULL
+
+extern void fair_sched_class __ksym;
+extern bool CONFIG_FAIR_GROUP_SCHED __kconfig;
+extern bool CONFIG_CGROUP_SCHED __kconfig;
+
+struct wait_lat {
+	/* Queue_self stands for the latency a task experiences while waiting
+	 * behind the tasks that are from the same cgroup.
+	 *
+	 * Queue_other stands for the latency a task experiences while waiting
+	 * behind the tasks that are from other cgroups.
+	 *
+	 * For example, if there are three tasks: A, B and C. Suppose A and B
+	 * are in the same cgroup and C is in another cgroup and we see A has
+	 * a queueing latency X milliseconds. Let's say during the X milliseconds,
+	 * B has run for Y milliseconds. We can break down X to two parts: time
+	 * when B is on cpu, that is Y; the time when C is on cpu, that is X - Y.
+	 *
+	 * Queue_self is the former (Y) while queue_other is the latter (X - Y).
+	 *
+	 * large value in queue_self is an indication of contention within a
+	 * cgroup; while large value in queue_other is an indication of
+	 * contention from multiple cgroups.
+	 */
+	u64 queue_self;
+	u64 queue_other;
+};
+
+struct timestamp {
+	/* timestamp when last queued */
+	u64 tsp;
+
+	/* cgroup exec_clock when last queued */
+	u64 exec_clock;
+};
+
+/* Map to store per-cgroup wait latency */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, u64);
+	__type(value, struct wait_lat);
+	__uint(max_entries, 65532);
+} cgroup_lat SEC(".maps");
+
+/* Map to store per-task queue timestamp */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct timestamp);
+} start SEC(".maps");
+
+/* adapt from task_cfs_rq in kernel/sched/sched.h */
+__always_inline
+struct cfs_rq *task_cfs_rq(struct task_struct *t)
+{
+	if (!CONFIG_FAIR_GROUP_SCHED)
+		return NULL;
+
+	return BPF_CORE_READ(&t->se, cfs_rq);
+}
+
+/* record enqueue timestamp */
+__always_inline
+static int trace_enqueue(struct task_struct *t)
+{
+	u32 pid = t->pid;
+	struct timestamp *ptr;
+	struct cfs_rq *cfs_rq;
+
+	if (!pid)
+		return 0;
+
+	/* only measure for CFS tasks */
+	if (t->sched_class != &fair_sched_class)
+		return 0;
+
+	ptr = bpf_task_storage_get(&start, t, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+
+	/* CONFIG_FAIR_GROUP_SCHED may not be enabled */
+	cfs_rq = task_cfs_rq(t);
+	if (!cfs_rq)
+		return 0;
+
+	ptr->tsp = bpf_ktime_get_ns();
+	ptr->exec_clock = BPF_CORE_READ(cfs_rq, exec_clock);
+	return 0;
+}
+
+SEC("tp_btf/sched_wakeup")
+int handle__sched_wakeup(u64 *ctx)
+{
+	/* TP_PROTO(struct task_struct *p) */
+	struct task_struct *p = (void *)ctx[0];
+
+	return trace_enqueue(p);
+}
+
+SEC("tp_btf/sched_wakeup_new")
+int handle__sched_wakeup_new(u64 *ctx)
+{
+	/* TP_PROTO(struct task_struct *p) */
+	struct task_struct *p = (void *)ctx[0];
+
+	return trace_enqueue(p);
+}
+
+/* task_group() from kernel/sched/sched.h */
+__always_inline
+struct task_group *task_group(struct task_struct *p)
+{
+	if (!CONFIG_CGROUP_SCHED)
+		return NULL;
+
+	return BPF_CORE_READ(p, sched_task_group);
+}
+
+__always_inline
+struct cgroup *task_cgroup(struct task_struct *p)
+{
+	struct task_group *tg;
+
+	tg = task_group(p);
+	if (!tg)
+		return NULL;
+
+	return BPF_CORE_READ(tg, css).cgroup;
+}
+
+__always_inline
+u64 max(u64 x, u64 y)
+{
+	return x > y ? x : y;
+}
+
+SEC("tp_btf/sched_switch")
+int handle__sched_switch(u64 *ctx)
+{
+	/* TP_PROTO(bool preempt, struct task_struct *prev,
+	 *	    struct task_struct *next)
+	 */
+	struct task_struct *prev = (struct task_struct *)ctx[1];
+	struct task_struct *next = (struct task_struct *)ctx[2];
+	u64 delta, delta_self, delta_other, id;
+	struct cfs_rq *cfs_rq;
+	struct timestamp *tsp;
+	struct wait_lat *lat;
+	struct cgroup *cgroup;
+
+	/* ivcsw: treat like an enqueue event and store timestamp */
+	if (prev->__state == TASK_RUNNING)
+		trace_enqueue(prev);
+
+	/* only measure for CFS tasks */
+	if (next->sched_class != &fair_sched_class)
+		return 0;
+
+	/* fetch timestamp and calculate delta */
+	tsp = bpf_task_storage_get(&start, next, 0, 0);
+	if (!tsp)
+		return 0;   /* missed enqueue */
+
+	/* CONFIG_FAIR_GROUP_SCHED may not be enabled */
+	cfs_rq = task_cfs_rq(next);
+	if (!cfs_rq)
+		return 0;
+
+	/* cpu controller may not be enabled */
+	cgroup = task_cgroup(next);
+	if (!cgroup)
+		return 0;
+
+	/* calculate self delay and other delay */
+	delta = bpf_ktime_get_ns() - tsp->tsp;
+	delta_self = BPF_CORE_READ(cfs_rq, exec_clock) - tsp->exec_clock;
+	if (delta_self > delta)
+		delta_self = delta;
+	delta_other = delta - delta_self;
+
+	/* insert into cgroup_lat map */
+	id = BPF_CORE_READ(cgroup, kn, id);
+	lat = bpf_map_lookup_elem(&cgroup_lat, &id);
+	if (!lat) {
+		struct wait_lat w = {
+			.queue_self = delta_self,
+			.queue_other = delta_other,
+		};
+
+		bpf_map_update_elem(&cgroup_lat, &id, &w, BPF_ANY);
+	} else {
+		lat->queue_self = max(delta_self, lat->queue_self);
+		lat->queue_other = max(delta_other, lat->queue_other);
+	}
+
+	bpf_task_storage_delete(&start, next);
+	return 0;
+}
+
+SEC("iter/cgroup")
+int dump_cgroup(struct bpf_iter__cgroup *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct cgroup *cgroup = ctx->cgroup;
+	struct wait_lat *lat;
+	u64 id = cgroup->kn->id;
+
+	BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", id);
+	lat = bpf_map_lookup_elem(&cgroup_lat, &id);
+	if (lat) {
+		BPF_SEQ_PRINTF(seq, "queue_self: %8lu\n", lat->queue_self);
+		BPF_SEQ_PRINTF(seq, "queue_other: %8lu\n", lat->queue_other);
+	} else {
+		/* print anyway for universal parsing logic in userspace. */
+		BPF_SEQ_PRINTF(seq, "queue_self: %8d\n", 0);
+		BPF_SEQ_PRINTF(seq, "queue_other: %8d\n", 0);
+	}
+	return 0;
+}