From patchwork Wed Jan 17 21:56:17 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Amery Hung <ameryhung@gmail.com>
X-Patchwork-Id: 13522185
X-Patchwork-Delegate: bpf@iogearbox.net
Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com
 [209.85.160.171])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC10C2557E;
	Wed, 17 Jan 2024 21:56:27 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.160.171
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1705528590; cv=none;
 b=qH9EfUmVw5trauFlX+HuAx4RJRdTfu4rEhGC8HaTKZ/xOAoZtXtoMR9fN0JQj1Uv++4otKBiNBJLWFow8E4PNOgCoJVJfpxUlcQUhBh7rruXbRZ9mqrgO67YQ2xvru+7/RqK2RNLxjBpPRoLt+v7oRx6rfJf0fGhnLKPsh/JC4U=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1705528590; c=relaxed/simple;
	bh=8pbOavnSscO+wB6RQW+gthln+vXcr4SYu4tqvN+dSro=;
	h=Received:DKIM-Signature:X-Google-DKIM-Signature:
	 X-Gm-Message-State:X-Google-Smtp-Source:X-Received:Received:From:
	 X-Google-Original-From:To:Cc:Subject:Date:Message-Id:X-Mailer:
	 In-Reply-To:References:MIME-Version:Content-Type:
	 Content-Transfer-Encoding;
 b=FZl+IttjQt1t5bXNc0bTbL37/KLAYENlb/dY4ugdV/ngN52d4aKrvoFvE4xVoOK3dI7qh0N4E4sIzNod55n3eSsvQ3HK3KQns2budpa5uFysbNMkEklPygAO+DrMFGT6LdWgYOlf+kG4diabFWILsC9Fz+UrkHLqMG5+A46xmh4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=T2cGaJcc; arc=none smtp.client-ip=209.85.160.171
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="T2cGaJcc"
Received: by mail-qt1-f171.google.com with SMTP id
 d75a77b69052e-42a0ba5098bso8208951cf.0;
        Wed, 17 Jan 2024 13:56:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1705528587; x=1706133387;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=WdZKs2b8dosnxgwWywIaRPsTh9LO0asJn6WR5363DGI=;
        b=T2cGaJcc6oFXe0+kNZbNY23TUFnmR1WnWi8c5hL5NBIdA7sMI1ulfNIV9GKpQXughQ
         ADG7Qg2P57ybyjmZmrIZcw/4Ywe3W5CqX81H3DEivGWpeiZ8x5dMeHJCaNTPAeSfT21V
         V0AJRDdaRjFR0ZcmXcNVa2XEjHrYY9LLAoC55WMCLXZ0F2wsQmwGXGSjkF58Pbqx9WuJ
         15+DZntMs/u3amCJjVwAmFtGqicik+qoYHHBuLGV0oQ0J9nfCoCEbNPuEVECk0Fkpdbu
         CiWuW36aohdR00tTZu96Gg558KXKlBHBFngcPHAL+ELB3oWcEYANepLM5NqqXP/KAzM4
         wV9A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705528587; x=1706133387;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=WdZKs2b8dosnxgwWywIaRPsTh9LO0asJn6WR5363DGI=;
        b=DhBb+f/pa5Xt3ZzTrgIEiAvMCs48Vu6KZQGHKdye9JmXi5J2zheLpNT20hVdBZZIZy
         iAgCq0SqHk3P7gmFAu/pER7u7U8Zm7OhaNuzLQqI8/QUQk1Q58vLGQ7wTgoRqFxts7Hz
         3aOsgRINlbBIgBkvMdcxzhYrDfxvw1iLQq49p7TK77GqurKFnwAZY6f4rKKN7zE5yTGW
         skkopku5WF8WQj9SAvK5iLIz4layWIoyLGbRWRlBABf+lTWBXEHOGOPGQBlYFBleQBN6
         xir1/N+V+J9+3CnEL8TSNCmaCdUbW3RZafmbNsk/CdRWzmhncSoLa8rUUOqIyjM1uvw8
         Bb0A==
X-Gm-Message-State: AOJu0Ywea54B2E//n/R6/65z5AyXfZIg8L+nfCd03VUmTDyFV4C17eFr
	XLYc/8hssyHGDzfUn7YJz64QWpt0hxE=
X-Google-Smtp-Source: 
 AGHT+IFdRFkv7bVBJ2CRO8eDcPFNvI9xuwF8agTzpuHJzS6i66DYAUqfYCljc0812Zh0Fi9Fnt+qzA==
X-Received: by 2002:a05:622a:14f:b0:429:c98e:7fd3 with SMTP id
 v15-20020a05622a014f00b00429c98e7fd3mr9519296qtw.68.1705528586593;
        Wed, 17 Jan 2024 13:56:26 -0800 (PST)
Received: from n36-183-057.byted.org ([147.160.184.91])
        by smtp.gmail.com with ESMTPSA id
 hj11-20020a05622a620b00b00428346b88bfsm6105263qtb.65.2024.01.17.13.56.25
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 17 Jan 2024 13:56:26 -0800 (PST)
From: Amery Hung <ameryhung@gmail.com>
X-Google-Original-From: Amery Hung <amery.hung@bytedance.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org,
	yangpeihao@sjtu.edu.cn,
	toke@redhat.com,
	jhs@mojatatu.com,
	jiri@resnulli.us,
	sdf@google.com,
	xiyou.wangcong@gmail.com,
	yepeilin.cs@gmail.com
Subject: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
Date: Wed, 17 Jan 2024 21:56:17 +0000
Message-Id: 
 <232881645a5c4c05a35df4ff1f08a19ef9a02662.1705432850.git.amery.hung@bytedance.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <cover.1705432850.git.amery.hung@bytedance.com>
References: <cover.1705432850.git.amery.hung@bytedance.com>
Precedence: bulk
X-Mailing-List: bpf@vger.kernel.org
List-Id: <bpf.vger.kernel.org>
List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

From: Cong Wang <xiyou.wangcong@gmail.com>

Introduce a new Qdisc which is completely managed by eBPF program
of type BPF_PROG_TYPE_QDISC. It accepts two eBPF programs of
the same type, but one for enqueue and the other for dequeue.

It interacts with Qdisc layer in two ways:
1) It relies on Qdisc watchdog to handle throttling;
2) It could pass the skb enqueue/dequeue down to child classes

The context is used differently for enqueue and dequeue, as shown below:

┌──────────┬───────────────┬──────────────────────────────────┐
│ prog     │     input     │              output              │
├──────────┼───────────────┼──────────────────────────────────┤
│          │ ctx->skb      │ SCH_BPF_THROTTLE: ctx->expire    │
│          │               │                   ctx->delta_ns  │
│          │ ctx->classid  │                                  │
│          │               │ SCH_BPF_QUEUED: None             │
│          │               │                                  │
│          │               │ SCH_BPF_BYPASS: None             │
│ enqueue  │               │                                  │
│          │               │ SCH_BPF_STOLEN: None             │
│          │               │                                  │
│          │               │ SCH_BPF_DROP: None               │
│          │               │                                  │
│          │               │ SCH_BPF_CN: None                 │
│          │               │                                  │
│          │               │ SCH_BPF_PASS: ctx->classid       │
├──────────┼───────────────┼──────────────────────────────────┤
│          │ ctx->classid  │ SCH_BPF_THROTTLE: ctx->expire    │
│          │               │                   ctx->delta_ns  │
│          │               │                                  │
│ dequeue  │               │ SCH_BPF_DEQUEUED: None           │
│          │               │                                  │
│          │               │ SCH_BPF_DROP: None               │
│          │               │                                  │
│          │               │ SCH_BPF_PASS: ctx->classid       │
└──────────┴───────────────┴──────────────────────────────────┘

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf_types.h      |   4 +
 include/uapi/linux/bpf.h       |  21 ++
 include/uapi/linux/pkt_sched.h |  16 +
 kernel/bpf/btf.c               |   5 +
 kernel/bpf/helpers.c           |   1 +
 kernel/bpf/syscall.c           |   8 +
 net/core/filter.c              |  96 ++++++
 net/sched/Kconfig              |  15 +
 net/sched/Makefile             |   1 +
 net/sched/sch_bpf.c            | 537 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  21 ++
 11 files changed, 725 insertions(+)
 create mode 100644 net/sched/sch_bpf.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index a4247377e951..3e35033a9126 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -83,6 +83,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall,
 BPF_PROG_TYPE(BPF_PROG_TYPE_NETFILTER, netfilter,
 	      struct bpf_nf_ctx, struct bpf_nf_ctx)
 #endif
+#ifdef CONFIG_NET
+BPF_PROG_TYPE(BPF_PROG_TYPE_QDISC, tc_qdisc,
+	      struct bpf_qdisc_ctx, struct bpf_qdisc_ctx)
+#endif
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0bb92414c036..df280bbb7c0d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -997,6 +997,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	BPF_PROG_TYPE_QDISC,
 };
 
 enum bpf_attach_type {
@@ -1056,6 +1057,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_UNIX_GETSOCKNAME,
 	BPF_NETKIT_PRIMARY,
 	BPF_NETKIT_PEER,
+	BPF_QDISC_ENQUEUE,
+	BPF_QDISC_DEQUEUE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -7357,4 +7360,22 @@ struct bpf_iter_num {
 	__u64 __opaque[1];
 } __attribute__((aligned(8)));
 
+struct bpf_qdisc_ctx {
+	__bpf_md_ptr(struct sk_buff *, skb);
+	__u32 classid;
+	__u64 expire;
+	__u64 delta_ns;
+};
+
+enum {
+	SCH_BPF_QUEUED,
+	SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
+	SCH_BPF_DROP,
+	SCH_BPF_CN,
+	SCH_BPF_THROTTLE,
+	SCH_BPF_PASS,
+	SCH_BPF_BYPASS,
+	SCH_BPF_STOLEN,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index f762a10bfb78..d05462309f5a 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1317,4 +1317,20 @@ enum {
 
 #define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
 
+#define TCA_SCH_BPF_FLAG_DIRECT _BITUL(0)
+enum {
+	TCA_SCH_BPF_UNSPEC,
+	TCA_SCH_BPF_ENQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_ENQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_TAG,	/* data */
+	TCA_SCH_BPF_DEQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_DEQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_TAG,	/* data */
+	__TCA_SCH_BPF_MAX,
+};
+
+#define TCA_SCH_BPF_MAX (__TCA_SCH_BPF_MAX - 1)
+
 #endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 15d71d2986d3..ee8d6c127b04 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -217,6 +217,7 @@ enum btf_kfunc_hook {
 	BTF_KFUNC_HOOK_SOCKET_FILTER,
 	BTF_KFUNC_HOOK_LWT,
 	BTF_KFUNC_HOOK_NETFILTER,
+	BTF_KFUNC_HOOK_QDISC,
 	BTF_KFUNC_HOOK_MAX,
 };
 
@@ -5928,6 +5929,8 @@ static bool prog_args_trusted(const struct bpf_prog *prog)
 		return bpf_lsm_is_trusted(prog);
 	case BPF_PROG_TYPE_STRUCT_OPS:
 		return true;
+	case BPF_PROG_TYPE_QDISC:
+		return true;
 	default:
 		return false;
 	}
@@ -7865,6 +7868,8 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
 		return BTF_KFUNC_HOOK_LWT;
 	case BPF_PROG_TYPE_NETFILTER:
 		return BTF_KFUNC_HOOK_NETFILTER;
+	case BPF_PROG_TYPE_QDISC:
+		return BTF_KFUNC_HOOK_QDISC;
 	default:
 		return BTF_KFUNC_HOOK_MAX;
 	}
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 56b0c1f678ee..d5e581ccd9a0 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2610,6 +2610,7 @@ static int __init kfunc_init(void)
 
 	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_QDISC, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &generic_kfunc_set);
 	ret = ret ?: register_btf_id_dtor_kfuncs(generic_dtors,
 						  ARRAY_SIZE(generic_dtors),
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 13eb50446e7a..1838bddd8526 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2502,6 +2502,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		if (expected_attach_type == BPF_NETFILTER)
 			return 0;
 		return -EINVAL;
+	case BPF_PROG_TYPE_QDISC:
+		switch (expected_attach_type) {
+		case BPF_QDISC_ENQUEUE:
+		case BPF_QDISC_DEQUEUE:
+			return 0;
+		default:
+			return -EINVAL;
+		}
 	case BPF_PROG_TYPE_SYSCALL:
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
diff --git a/net/core/filter.c b/net/core/filter.c
index 383f96b0a1c7..f25a0b6b5d56 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8889,6 +8889,90 @@ static int tc_cls_act_btf_struct_access(struct bpf_verifier_log *log,
 	return ret;
 }
 
+static int tc_qdisc_prologue(struct bpf_insn *insn_buf, bool direct_write,
+			     const struct bpf_prog *prog)
+{
+	return bpf_unclone_prologue(insn_buf, direct_write, prog,
+				    SCH_BPF_DROP);
+}
+
+BTF_ID_LIST_SINGLE(tc_qdisc_ctx_access_btf_ids, struct, sk_buff)
+
+static bool tc_qdisc_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     const struct bpf_prog *prog,
+				     struct bpf_insn_access_aux *info)
+{
+	struct btf *btf;
+
+	if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
+		return false;
+
+	switch (off) {
+	case offsetof(struct bpf_qdisc_ctx, skb):
+		if (type == BPF_WRITE ||
+		    size != sizeof_field(struct bpf_qdisc_ctx, skb))
+			return false;
+
+		if (prog->expected_attach_type != BPF_QDISC_ENQUEUE)
+			return false;
+
+		btf = bpf_get_btf_vmlinux();
+		if (IS_ERR_OR_NULL(btf))
+			return false;
+
+		info->btf = btf;
+		info->btf_id = tc_qdisc_ctx_access_btf_ids[0];
+		info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
+		return true;
+	case bpf_ctx_range(struct bpf_qdisc_ctx, classid):
+		return size == sizeof_field(struct bpf_qdisc_ctx, classid);
+	case bpf_ctx_range(struct bpf_qdisc_ctx, expire):
+		return size == sizeof_field(struct bpf_qdisc_ctx, expire);
+	case bpf_ctx_range(struct bpf_qdisc_ctx, delta_ns):
+		return size == sizeof_field(struct bpf_qdisc_ctx, delta_ns);
+	default:
+		return false;
+	}
+
+	return false;
+}
+
+static int tc_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+				      const struct bpf_reg_state *reg,
+				      int off, int size)
+{
+	const struct btf_type *skbt, *t;
+	size_t end;
+
+	skbt = btf_type_by_id(reg->btf, tc_qdisc_ctx_access_btf_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t != skbt)
+		return -EACCES;
+
+	switch (off) {
+	case offsetof(struct sk_buff, cb) ...
+	     offsetofend(struct sk_buff, cb) - 1:
+		end = offsetofend(struct sk_buff, cb);
+		break;
+	case offsetof(struct sk_buff, tstamp):
+		end = offsetofend(struct sk_buff, tstamp);
+		break;
+	default:
+		bpf_log(log, "no write support to skb at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of sk_buff ended at %zu\n",
+			off, size, end);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
 static bool __is_valid_xdp_access(int off, int size)
 {
 	if (off < 0 || off >= sizeof(struct xdp_md))
@@ -10890,6 +10974,18 @@ const struct bpf_prog_ops tc_cls_act_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops tc_qdisc_verifier_ops = {
+	.get_func_proto		= tc_cls_act_func_proto,
+	.is_valid_access	= tc_qdisc_is_valid_access,
+	.gen_prologue		= tc_qdisc_prologue,
+	.gen_ld_abs		= bpf_gen_ld_abs,
+	.btf_struct_access	= tc_qdisc_btf_struct_access,
+};
+
+const struct bpf_prog_ops tc_qdisc_prog_ops = {
+	.test_run		= bpf_prog_test_run_skb,
+};
+
 const struct bpf_verifier_ops xdp_verifier_ops = {
 	.get_func_proto		= xdp_func_proto,
 	.is_valid_access	= xdp_is_valid_access,
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 470c70deffe2..e4ece091af4d 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -403,6 +403,21 @@ config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_BPF
+	tristate "eBPF based programmable queue discipline"
+	help
+	  This eBPF based queue discipline offers a way to program your
+	  own packet scheduling algorithm. This is a classful qdisc which
+	  also allows you to decide the hierarchy.
+
+	  Say Y here if you want to use the eBPF based programmable queue
+	  discipline.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_bpf.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b5fd49641d91..4e24c6c79cb8 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -63,6 +63,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_BPF)	+= sch_bpf.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
new file mode 100644
index 000000000000..56f3ab9c6059
--- /dev/null
+++ b/net/sched/sch_bpf.c
@@ -0,0 +1,537 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Programmable Qdisc with eBPF
+ *
+ * Copyright (C) 2022, ByteDance, Cong Wang <cong.wang@bytedance.com>
+ */
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/slab.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+#define ACT_BPF_NAME_LEN	256
+
+struct sch_bpf_prog {
+	struct bpf_prog *prog;
+	const char *name;
+};
+
+struct sch_bpf_class {
+	struct Qdisc_class_common common;
+	struct Qdisc *qdisc;
+
+	unsigned int drops;
+	unsigned int overlimits;
+	struct gnet_stats_basic_sync bstats;
+};
+
+struct bpf_sched_data {
+	struct tcf_proto __rcu *filter_list; /* optional external classifier */
+	struct tcf_block *block;
+	struct Qdisc_class_hash clhash;
+	struct sch_bpf_prog __rcu enqueue_prog;
+	struct sch_bpf_prog __rcu dequeue_prog;
+
+	struct qdisc_watchdog watchdog;
+};
+
+static int sch_bpf_dump_prog(const struct sch_bpf_prog *prog, struct sk_buff *skb,
+			     int name, int id, int tag)
+{
+	struct nlattr *nla;
+
+	if (prog->name &&
+	    nla_put_string(skb, name, prog->name))
+		return -EMSGSIZE;
+
+	if (nla_put_u32(skb, id, prog->prog->aux->id))
+		return -EMSGSIZE;
+
+	nla = nla_reserve(skb, tag, sizeof(prog->prog->tag));
+	if (!nla)
+		return -EMSGSIZE;
+
+	memcpy(nla_data(nla), prog->prog->tag, nla_len(nla));
+	return 0;
+}
+
+static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts;
+
+	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (sch_bpf_dump_prog(&q->enqueue_prog, skb, TCA_SCH_BPF_ENQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_ENQUEUE_PROG_ID, TCA_SCH_BPF_ENQUEUE_PROG_TAG))
+		goto nla_put_failure;
+	if (sch_bpf_dump_prog(&q->dequeue_prog, skb, TCA_SCH_BPF_DEQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_DEQUEUE_PROG_ID, TCA_SCH_BPF_DEQUEUE_PROG_TAG))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	return -1;
+}
+
+static int sch_bpf_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	return 0;
+}
+
+static struct sch_bpf_class *sch_bpf_find(struct Qdisc *sch, u32 classid)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct Qdisc_class_common *clc;
+
+	clc = qdisc_class_find(&q->clhash, classid);
+	if (!clc)
+		return NULL;
+	return container_of(clc, struct sch_bpf_class, common);
+}
+
+static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			   struct sk_buff **to_free)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	unsigned int len = qdisc_pkt_len(skb);
+	struct bpf_qdisc_ctx ctx = {};
+	int res = NET_XMIT_SUCCESS;
+	struct sch_bpf_class *cl;
+	struct bpf_prog *enqueue;
+
+	enqueue = rcu_dereference(q->enqueue_prog.prog);
+	if (!enqueue)
+		return NET_XMIT_DROP;
+
+	ctx.skb = skb;
+	ctx.classid = sch->handle;
+	res = bpf_prog_run(enqueue, &ctx);
+	switch (res) {
+	case SCH_BPF_THROTTLE:
+		qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
+		qdisc_qstats_overlimit(sch);
+		fallthrough;
+	case SCH_BPF_QUEUED:
+		qdisc_qstats_backlog_inc(sch, skb);
+		return NET_XMIT_SUCCESS;
+	case SCH_BPF_BYPASS:
+		qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	case SCH_BPF_STOLEN:
+		__qdisc_drop(skb, to_free);
+		return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+	case SCH_BPF_CN:
+		return NET_XMIT_CN;
+	case SCH_BPF_PASS:
+		break;
+	default:
+		return qdisc_drop(skb, sch, to_free);
+	}
+
+	cl = sch_bpf_find(sch, ctx.classid);
+	if (!cl || !cl->qdisc)
+		return qdisc_drop(skb, sch, to_free);
+
+	res = qdisc_enqueue(skb, cl->qdisc, to_free);
+	if (res != NET_XMIT_SUCCESS) {
+		if (net_xmit_drop_count(res)) {
+			qdisc_qstats_drop(sch);
+			cl->drops++;
+		}
+		return res;
+	}
+
+	sch->qstats.backlog += len;
+	sch->q.qlen++;
+	return res;
+}
+
+DEFINE_PER_CPU(struct sk_buff*, bpf_skb_dequeue);
+
+static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct bpf_qdisc_ctx ctx = {};
+	struct sk_buff *skb = NULL;
+	struct bpf_prog *dequeue;
+	struct sch_bpf_class *cl;
+	int res;
+
+	dequeue = rcu_dereference(q->dequeue_prog.prog);
+	if (!dequeue)
+		return NULL;
+
+	__this_cpu_write(bpf_skb_dequeue, NULL);
+	ctx.classid = sch->handle;
+	res = bpf_prog_run(dequeue, &ctx);
+	switch (res) {
+	case SCH_BPF_DEQUEUED:
+		skb = __this_cpu_read(bpf_skb_dequeue);
+		qdisc_bstats_update(sch, skb);
+		qdisc_qstats_backlog_dec(sch, skb);
+		break;
+	case SCH_BPF_THROTTLE:
+		qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
+		qdisc_qstats_overlimit(sch);
+		cl = sch_bpf_find(sch, ctx.classid);
+		if (cl)
+			cl->overlimits++;
+		return NULL;
+	case SCH_BPF_PASS:
+		cl = sch_bpf_find(sch, ctx.classid);
+		if (!cl || !cl->qdisc)
+			return NULL;
+		skb = qdisc_dequeue_peeked(cl->qdisc);
+		if (skb) {
+			bstats_update(&cl->bstats, skb);
+			qdisc_bstats_update(sch, skb);
+			qdisc_qstats_backlog_dec(sch, skb);
+			sch->q.qlen--;
+		}
+		break;
+	}
+
+	return skb;
+}
+
+static struct Qdisc *sch_bpf_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	return cl->qdisc;
+}
+
+static int sch_bpf_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
+			 struct Qdisc **old, struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	if (new)
+		*old = qdisc_replace(sch, new, &cl->qdisc);
+	return 0;
+}
+
+static unsigned long sch_bpf_bind(struct Qdisc *sch, unsigned long parent,
+				  u32 classid)
+{
+	return 0;
+}
+
+static void sch_bpf_unbind(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static unsigned long sch_bpf_search(struct Qdisc *sch, u32 handle)
+{
+	return (unsigned long)sch_bpf_find(sch, handle);
+}
+
+static struct tcf_block *sch_bpf_tcf_block(struct Qdisc *sch, unsigned long cl,
+					   struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return q->block;
+}
+
+static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = {
+	[TCA_SCH_BPF_ENQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_ENQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+	[TCA_SCH_BPF_DEQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_DEQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+};
+
+static int bpf_init_prog(struct nlattr *fd, struct nlattr *name, struct sch_bpf_prog *prog)
+{
+	struct bpf_prog *fp, *old_fp;
+	char *prog_name = NULL;
+	u32 bpf_fd;
+
+	if (!fd)
+		return -EINVAL;
+	bpf_fd = nla_get_u32(fd);
+
+	fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_QDISC);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	if (name) {
+		prog_name = nla_memdup(name, GFP_KERNEL);
+		if (!prog_name) {
+			bpf_prog_put(fp);
+			return -ENOMEM;
+		}
+	}
+
+	prog->name = prog_name;
+
+	/* updates to prog->prog are prevent since the caller holds
+	 * sch_tree_lock
+	 */
+	old_fp = rcu_replace_pointer(prog->prog, fp, 1);
+	if (old_fp)
+		bpf_prog_put(old_fp);
+
+	return 0;
+}
+
+static void bpf_cleanup_prog(struct sch_bpf_prog *prog)
+{
+	struct bpf_prog *old_fp = NULL;
+
+	/* updates to prog->prog are prevent since the caller holds
+	 * sch_tree_lock
+	 */
+	old_fp = rcu_replace_pointer(prog->prog, old_fp, 1);
+	if (old_fp)
+		bpf_prog_put(old_fp);
+
+	kfree(prog->name);
+}
+
+static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_SCH_BPF_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested_deprecated(tb, TCA_SCH_BPF_MAX, opt,
+					  sch_bpf_policy, NULL);
+	if (err < 0)
+		return err;
+
+	sch_tree_lock(sch);
+
+	err = bpf_init_prog(tb[TCA_SCH_BPF_ENQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog);
+	if (err)
+		goto failure;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_DEQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog);
+failure:
+	sch_tree_unlock(sch);
+	return err;
+}
+
+static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+	if (opt) {
+		err = sch_bpf_change(sch, opt, extack);
+		if (err)
+			return err;
+	}
+
+	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
+	if (err)
+		return err;
+
+	return qdisc_class_hash_init(&q->clhash);
+}
+
+static void sch_bpf_reset(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			if (cl->qdisc)
+				qdisc_reset(cl->qdisc);
+		}
+	}
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static void sch_bpf_destroy_class(struct Qdisc *sch, struct sch_bpf_class *cl)
+{
+	qdisc_put(cl->qdisc);
+	kfree(cl);
+}
+
+static void sch_bpf_destroy(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	qdisc_watchdog_cancel(&q->watchdog);
+	tcf_block_put(q->block);
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			sch_bpf_destroy_class(sch, cl);
+		}
+	}
+
+	qdisc_class_hash_destroy(&q->clhash);
+
+	sch_tree_lock(sch);
+	bpf_cleanup_prog(&q->enqueue_prog);
+	bpf_cleanup_prog(&q->dequeue_prog);
+	sch_tree_unlock(sch);
+}
+
+static int sch_bpf_change_class(struct Qdisc *sch, u32 classid,
+				u32 parentid, struct nlattr **tca,
+				unsigned long *arg,
+				struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)*arg;
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	if (!cl) {
+		if (classid == 0 || TC_H_MAJ(classid ^ sch->handle) != 0 ||
+		    sch_bpf_find(sch, classid))
+			return -EINVAL;
+
+		cl = kzalloc(sizeof(*cl), GFP_KERNEL);
+		if (!cl)
+			return -ENOBUFS;
+
+		cl->common.classid = classid;
+		gnet_stats_basic_sync_init(&cl->bstats);
+		qdisc_class_hash_insert(&q->clhash, &cl->common);
+	}
+
+	qdisc_class_hash_grow(sch, &q->clhash);
+	*arg = (unsigned long)cl;
+	return 0;
+}
+
+static int sch_bpf_delete(struct Qdisc *sch, unsigned long arg,
+			  struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_class_hash_remove(&q->clhash, &cl->common);
+	if (cl->qdisc)
+		qdisc_put(cl->qdisc);
+	return 0;
+}
+
+static int sch_bpf_dump_class(struct Qdisc *sch, unsigned long arg,
+			      struct sk_buff *skb, struct tcmsg *tcm)
+{
+	return 0;
+}
+
+static int
+sch_bpf_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct gnet_stats_queue qs = {
+		.drops = cl->drops,
+		.overlimits = cl->overlimits,
+	};
+	__u32 qlen = 0;
+
+	if (cl->qdisc)
+		qdisc_qstats_qlen_backlog(cl->qdisc, &qlen, &qs.backlog);
+	else
+		qlen = 0;
+
+	if (gnet_stats_copy_basic(d, NULL, &cl->bstats, true) < 0 ||
+	    gnet_stats_copy_queue(d, NULL, &qs, qlen) < 0)
+		return -1;
+	return 0;
+}
+
+static void sch_bpf_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			if (arg->count < arg->skip) {
+				arg->count++;
+				continue;
+			}
+			if (arg->fn(sch, (unsigned long)cl, arg) < 0) {
+				arg->stop = 1;
+				return;
+			}
+			arg->count++;
+		}
+	}
+}
+
+static const struct Qdisc_class_ops sch_bpf_class_ops = {
+	.graft		=	sch_bpf_graft,
+	.leaf		=	sch_bpf_leaf,
+	.find		=	sch_bpf_search,
+	.change		=	sch_bpf_change_class,
+	.delete		=	sch_bpf_delete,
+	.tcf_block	=	sch_bpf_tcf_block,
+	.bind_tcf	=	sch_bpf_bind,
+	.unbind_tcf	=	sch_bpf_unbind,
+	.dump		=	sch_bpf_dump_class,
+	.dump_stats	=	sch_bpf_dump_class_stats,
+	.walk		=	sch_bpf_walk,
+};
+
+static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
+	.cl_ops		=	&sch_bpf_class_ops,
+	.id		=	"bpf",
+	.priv_size	=	sizeof(struct bpf_sched_data),
+	.enqueue	=	sch_bpf_enqueue,
+	.dequeue	=	sch_bpf_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	sch_bpf_init,
+	.reset		=	sch_bpf_reset,
+	.destroy	=	sch_bpf_destroy,
+	.change		=	sch_bpf_change,
+	.dump		=	sch_bpf_dump,
+	.dump_stats	=	sch_bpf_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init sch_bpf_mod_init(void)
+{
+	return register_qdisc(&sch_bpf_qdisc_ops);
+}
+
+static void __exit sch_bpf_mod_exit(void)
+{
+	unregister_qdisc(&sch_bpf_qdisc_ops);
+}
+
+module_init(sch_bpf_mod_init)
+module_exit(sch_bpf_mod_exit)
+MODULE_AUTHOR("Cong Wang");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("eBPF queue discipline");
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0bb92414c036..df280bbb7c0d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -997,6 +997,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	BPF_PROG_TYPE_QDISC,
 };
 
 enum bpf_attach_type {
@@ -1056,6 +1057,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_UNIX_GETSOCKNAME,
 	BPF_NETKIT_PRIMARY,
 	BPF_NETKIT_PEER,
+	BPF_QDISC_ENQUEUE,
+	BPF_QDISC_DEQUEUE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -7357,4 +7360,22 @@ struct bpf_iter_num {
 	__u64 __opaque[1];
 } __attribute__((aligned(8)));
 
+struct bpf_qdisc_ctx {
+	__bpf_md_ptr(struct sk_buff *, skb);
+	__u32 classid;
+	__u64 expire;
+	__u64 delta_ns;
+};
+
+enum {
+	SCH_BPF_QUEUED,
+	SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
+	SCH_BPF_DROP,
+	SCH_BPF_CN,
+	SCH_BPF_THROTTLE,
+	SCH_BPF_PASS,
+	SCH_BPF_BYPASS,
+	SCH_BPF_STOLEN,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */

From patchwork Wed Jan 17 21:56:18 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Amery Hung <ameryhung@gmail.com>
X-Patchwork-Id: 13522184
X-Patchwork-Delegate: kuba@kernel.org
Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com
 [209.85.160.171])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B36E2562B;
	Wed, 17 Jan 2024 21:56:28 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.160.171
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1705528590; cv=none;
 b=Q2CYI2LGbajXXDmjNaWsmHsIb+2Vpz8K98x1tADFZsP3UaAutcCf/TK8XHUbkK7yFOH0hNhh1s0zBvwCCJOqqEyzBH0iht1Y/HjSFx+lMgBMaNdtXcfxv/L2len4PPAq4ExaVNXW6qAeSJG2AYZQzHgwhLhVWxYF7YYWzngJ9Rs=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1705528590; c=relaxed/simple;
	bh=PS35uwKUUOYbFtfV4djPNxakwFL36YGy5n1JSjA0rvQ=;
	h=Received:DKIM-Signature:X-Google-DKIM-Signature:
	 X-Gm-Message-State:X-Google-Smtp-Source:X-Received:Received:From:
	 X-Google-Original-From:To:Cc:Subject:Date:Message-Id:X-Mailer:
	 In-Reply-To:References:MIME-Version:Content-Transfer-Encoding;
 b=jL1g6JPgO4+moBCXvyuVKNEbVtAhpFbhqsTAAIItyao1cYjIDEP/jl04DjH2PJ+yBS1cehM1ljmJnXlV6GG8E5hsLArLwVYsEqy23jKcfEoc069O4l5+7qE+7SMxjy5Hkw6CSBdgWq8Rx0y/Q532eesbwqcK7YdYRVIqQ+qMqDI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=fh+PkmMA; arc=none smtp.client-ip=209.85.160.171
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="fh+PkmMA"
Received: by mail-qt1-f171.google.com with SMTP id
 d75a77b69052e-429fc7a1eacso21838881cf.2;
        Wed, 17 Jan 2024 13:56:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1705528587; x=1706133387;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=XQM2ylX/O5EIXXAsKxWvTLMgB+Lb4fb3EVYiX33BjKY=;
        b=fh+PkmMAZSqBdAUZcAnyv5G8kFa+qMrXEgJxgyy1n8WNoZlchfqH0QKidq3SDbn2g/
         Wum+2ZHsxEVYGDBMq8Vns4pHR1TzAHJB446NeRM4RFzmXZerlbmDPaFiVDbte/nZmYWn
         rDBmDZVhZqg6exCfIjc+jv5UtE05zkSokiu1ONTYdH6lo19PO/ifv9EpZCkP9LN2+gPW
         gdmSd+e8HKNtDUEG6/Wrr9hYrsdRNMqMo9FIhmaDC0b3bVu0KHDyl31RwqxSGlPBR8pI
         cNdUAxHqlv0EscvAS/zwSTNFua0/eqh+S2IBgC/CU+4hLpVoZbJeLuw6WT6qpFqCDrFY
         9deQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705528587; x=1706133387;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=XQM2ylX/O5EIXXAsKxWvTLMgB+Lb4fb3EVYiX33BjKY=;
        b=Sxk75y8hkZk/dpOc8fKPsQTH32zaOSyxL7gq8vDWYEmdwCkdkaLdbnze/2TbbeVaO7
         IvunWrEffebwjVxL4Iz0m+qQ0PB2xdJzUoYrSzfs81+VwuRMxl8Qi/cM7eicrqcPj9jw
         0GCV5vOtzf5OOY4foAOYTkV8ADe8b/CuR2Vv2ehZ7nb8Vc2+Xpy63f2ilvq9gAQnvUH5
         BqAsknq/cH3V2YNUvzVWgoeTuJHbmtxAtyGvuoRXoIGY+yNQJKttkbM2GCn7emIoO39j
         sTiksRwIjrM7oY4okAI0YHLWFYe4cyKcovmOnErQb/d/3DIyA1T93pECCNQiQMJ+beSj
         sAeQ==
X-Gm-Message-State: AOJu0Yw3e/vY2qJgdaCfkOZm3nBOP4Bsp3GbgtNCsM3qtlR0n+JJXOuj
	Xe/EPiZ2c59GhSsrCYsxTqNEeBLY20U=
X-Google-Smtp-Source: 
 AGHT+IEi4hnjhSA0AuKZ6RCksU+Uv/5I4C2UbDc/UxC7jAeItwoHZngJ68v7OWkPAz+nmwfeQqfsEA==
X-Received: by 2002:a05:622a:2284:b0:42a:2b2:df24 with SMTP id
 ay4-20020a05622a228400b0042a02b2df24mr4003174qtb.23.1705528587440;
        Wed, 17 Jan 2024 13:56:27 -0800 (PST)
Received: from n36-183-057.byted.org ([147.160.184.91])
        by smtp.gmail.com with ESMTPSA id
 hj11-20020a05622a620b00b00428346b88bfsm6105263qtb.65.2024.01.17.13.56.26
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 17 Jan 2024 13:56:27 -0800 (PST)
From: Amery Hung <ameryhung@gmail.com>
X-Google-Original-From: Amery Hung <amery.hung@bytedance.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org,
	yangpeihao@sjtu.edu.cn,
	toke@redhat.com,
	jhs@mojatatu.com,
	jiri@resnulli.us,
	sdf@google.com,
	xiyou.wangcong@gmail.com,
	yepeilin.cs@gmail.com
Subject: [RFC PATCH v7 2/8] net_sched: Add kfuncs for working with skb
Date: Wed, 17 Jan 2024 21:56:18 +0000
Message-Id: 
 <2d31261b245828d09d2f80e0953e911a9c38573a.1705432850.git.amery.hung@bytedance.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <cover.1705432850.git.amery.hung@bytedance.com>
References: <cover.1705432850.git.amery.hung@bytedance.com>
Precedence: bulk
X-Mailing-List: bpf@vger.kernel.org
List-Id: <bpf.vger.kernel.org>
List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Patchwork-Delegate: kuba@kernel.org
X-Patchwork-State: RFC

From: Cong Wang <cong.wang@bytedance.com>

This patch introduces four kfuncs available to developers:

struct sk_buff *bpf_skb_acquire(struct sk_buff *skb);
void bpf_skb_release(struct sk_buff *skb);
bool bpf_qdisc_set_skb_dequeue(struct sk_buff *skb);
u32 bpf_skb_get_hash(struct sk_buff *skb)

kptr is used to ensure the vailidility of skbs throughout their lifetime
in eBPF qdiscs. First, in the enqueue program, bpf_skb_acquire() can be
used to acquire a referenced kptr to an skb from ctx->skb. Then, it can
be stored in bpf maps or allocated objects serving as queues. Otherwise,
the program should call bpf_skb_release() to release the reference.
Finally, in the dequeue program, a skb kptr can be exchanged out of
queues and passed to bpf_qdisc_set_skb_dequeue() to set the skb to be
dequeued. The kfunc will also release the reference.

Since skb kptr is incompatible with helpers taking __sk_buff,
bpf_skb_get_hash() is added for now for the ease of implementing
flow-based queueing algorithms.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/sch_bpf.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
index 56f3ab9c6059..b0e7c3a19c30 100644
--- a/net/sched/sch_bpf.c
+++ b/net/sched/sch_bpf.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
+#include <linux/btf_ids.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
@@ -520,9 +521,89 @@ static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
 	.owner		=	THIS_MODULE,
 };
 
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/* bpf_skb_acquire - Acquire a reference to an skb. An skb acquired by this
+ * kfunc which is not stored in a map as a kptr, must be released by calling
+ * bpf_skb_release().
+ * @skb: The skb on which a reference is being acquired.
+ */
+__bpf_kfunc struct sk_buff *bpf_skb_acquire(struct sk_buff *skb)
+{
+	return skb_get(skb);
+}
+
+/* bpf_skb_release - Release the reference acquired on an skb.
+ * @skb: The skb on which a reference is being released.
+ */
+__bpf_kfunc void bpf_skb_release(struct sk_buff *skb)
+{
+	skb_unref(skb);
+}
+
+/* bpf_skb_destroy - Release an skb reference acquired and exchanged into
+ * an allocated object or a map.
+ * @skb: The skb on which a reference is being released.
+ */
+__bpf_kfunc void bpf_skb_destroy(struct sk_buff *skb)
+{
+	skb_unref(skb);
+	consume_skb(skb);
+}
+
+/* bpf_skb_get_hash - Get the flow hash of an skb.
+ * @skb: The skb to get the flow hash from.
+ */
+__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
+{
+	return skb_get_hash(skb);
+}
+
+/* bpf_qdisc_set_skb_dequeue - Set the skb to be dequeued. This will also
+ * release the reference to the skb.
+ * @skb: The skb to be dequeued by the qdisc.
+ */
+__bpf_kfunc void bpf_qdisc_set_skb_dequeue(struct sk_buff *skb)
+{
+	consume_skb(skb);
+	__this_cpu_write(bpf_skb_dequeue, skb);
+}
+
+__diag_pop();
+
+BTF_SET8_START(skb_kfunc_btf_ids)
+BTF_ID_FLAGS(func, bpf_skb_acquire, KF_ACQUIRE)
+BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_skb_get_hash)
+BTF_ID_FLAGS(func, bpf_qdisc_set_skb_dequeue, KF_RELEASE)
+BTF_SET8_END(skb_kfunc_btf_ids)
+
+static const struct btf_kfunc_id_set skb_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &skb_kfunc_btf_ids,
+};
+
+BTF_ID_LIST(skb_kfunc_dtor_ids)
+BTF_ID(struct, sk_buff)
+BTF_ID_FLAGS(func, bpf_skb_destroy, KF_RELEASE)
+
 static int __init sch_bpf_mod_init(void)
 {
-	return register_qdisc(&sch_bpf_qdisc_ops);
+	int ret;
+	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
+		{
+			.btf_id       = skb_kfunc_dtor_ids[0],
+			.kfunc_btf_id = skb_kfunc_dtor_ids[1]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_QDISC, &skb_kfunc_set);
+	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
+						 ARRAY_SIZE(skb_kfunc_dtors),
+						 THIS_MODULE);
+	return ret ?: register_qdisc(&sch_bpf_qdisc_ops);
 }
 
 static void __exit sch_bpf_mod_exit(void)

From patchwork Wed Jan 17 21:56:19 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Amery Hung <ameryhung@gmail.com>
X-Patchwork-Id: 13522186
X-Patchwork-Delegate: kuba@kernel.org
Received: from mail-ot1-f43.google.com (mail-ot1-f43.google.com
 [209.85.210.43])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 26D5725637;
	Wed, 17 Jan 2024 21:56:29 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.210.43
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1705528590; cv=none;
 b=BsDd36zaDvxqEpBB0/c7+RQkIIo4z2LGIo/L1JFX0DNS5f87dBJnCfmF/zbPqA/wW6IacNYJpWx7Qbm/GxVn2nZ4gUj+sCu22LgSrQwrmNL6bLnqugeAZYAsUrb0SgtesDbLfFhti+ebfT85fe0hHoGbD7St1clpkVEVldWhS3Q=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1705528590; c=relaxed/simple;
	bh=1iVlF9N/CW7ESvz3Eoi/t41Ro5gcaQ18eBluQOzy5mY=;
	h=Received:DKIM-Signature:X-Google-DKIM-Signature:
	 X-Gm-Message-State:X-Google-Smtp-Source:X-Received:Received:From:
	 X-Google-Original-From:To:Cc:Subject:Date:Message-Id:X-Mailer:
	 In-Reply-To:References:MIME-Version:Content-Transfer-Encoding;
 b=DdMW9XeXoJn4FsHpiqh3ZqbRTE3IBYqZ30kpIkQisOTwCuXM4v4RGNkB0EdegnWs2LIzeDBmji/Rb2SFUCqPqSPTPmsUmiLVVsdc9RmXEFTOr1uqeOSkCYt20a9FZDQvq186aDtAOHJdStanBYsObp2/spJxLMUp1kIvUvW1tpI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=TO8Nm7Ej; arc=none smtp.client-ip=209.85.210.43
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="TO8Nm7Ej"
Received: by mail-ot1-f43.google.com with SMTP id
 46e09a7af769-6dde1f23060so5280225a34.2;
        Wed, 17 Jan 2024 13:56:29 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1705528588; x=1706133388;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=MPkGxkT4+R0IPK4IYArd5bblmoZ7caNjmTqGy5ddOHw=;
        b=TO8Nm7Eji6Y8/2iQceV61zYk/M4HdZho35eYSLcqUlOPV1iSZAkY1cPNAwxokhCCDj
         ETZumDdwpSMwX5QLkPxMgejS78Kmszm/NVx8LIPNgwxG8tauz1zZBKkas7aQmuAT3Wgz
         /uQPXfjoUnmn0NFgkKbYwRhbHG8SZHruU2PzwoBxUkoXJKmqnBiDXtJsML0NC6/BzRsO
         J2STAUNx1aC6teHWLfRDblpI2OBu/2UGx/9aXlYpynCzcCRJvyNCrJZPgV8Sw0ODE9U6
         gESw3TwSvBm5D3OzjMih8fbdQhclAiu+VkizpHlfFNG2ib7Xr2gunJ3Ykrh1927mhQPn
         vroA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705528588; x=1706133388;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=MPkGxkT4+R0IPK4IYArd5bblmoZ7caNjmTqGy5ddOHw=;
        b=TrXJljQWt1Zm1L+yF2AWTjoEvS3X1cIdoD0pqKE6PKymoOnL34UfeCkMB/pXXhkoCC
         qqXxc4stQ3h22ff0yb3YdDc21oErVBsoXfCFTuEeWxyXjZFfPdyQwiFNdW3ldrKxjbZb
         UMdkGCG+HRclvbzio9h5AFADnL/bwS10gJ8dfAOVYBmRHVujzniJSxjm5snf5KOcc+XN
         WBbK4qvex3w/sJ4QGBJMph+Hfu9Iij4uNPUt2x4uxDNWyv4rxZ00pqClqJvkCGIkM4GB
         RN+uE+wLUFrIExWUSELBiZ+n74dzGmENUhnMN4nFPL1PjU68k1IcQ8tXvPNOy6sfSOHG
         jU6g==
X-Gm-Message-State: AOJu0YydDT8m3ZbiUjQU8tMJkSMON5dTlYpXICKu98R1bz+kVGC1udd2
	gDMJbrTrGM1IHUQ8/gfqnUEAAYhxfJM=
X-Google-Smtp-Source: 
 AGHT+IEb7boVBcg0A9ZDHfJT86UXu5lurfifGxrIQZhESq8BpkHgQud+w4o03eGvB5dwtrSYr2Q+Hw==
X-Received: by 2002:a9d:7dd6:0:b0:6dd:df7c:52f6 with SMTP id
 k22-20020a9d7dd6000000b006dddf7c52f6mr8437599otn.44.1705528588287;
        Wed, 17 Jan 2024 13:56:28 -0800 (PST)
Received: from n36-183-057.byted.org ([147.160.184.91])
        by smtp.gmail.com with ESMTPSA id
 hj11-20020a05622a620b00b00428346b88bfsm6105263qtb.65.2024.01.17.13.56.27
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 17 Jan 2024 13:56:28 -0800 (PST)
From: Amery Hung <ameryhung@gmail.com>
X-Google-Original-From: Amery Hung <amery.hung@bytedance.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org,
	yangpeihao@sjtu.edu.cn,
	toke@redhat.com,
	jhs@mojatatu.com,
	jiri@resnulli.us,
	sdf@google.com,
	xiyou.wangcong@gmail.com,
	yepeilin.cs@gmail.com
Subject: [RFC PATCH v7 3/8] net_sched: Introduce kfunc bpf_skb_tc_classify()
Date: Wed, 17 Jan 2024 21:56:19 +0000
Message-Id: 
 <d95508d28c8e3549c975c4b67a305ecce7306878.1705432850.git.amery.hung@bytedance.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <cover.1705432850.git.amery.hung@bytedance.com>
References: <cover.1705432850.git.amery.hung@bytedance.com>
Precedence: bulk
X-Mailing-List: bpf@vger.kernel.org
List-Id: <bpf.vger.kernel.org>
List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Patchwork-Delegate: kuba@kernel.org
X-Patchwork-State: RFC

From: Cong Wang <cong.wang@bytedance.com>

Introduce a kfunc, bpf_skb_tc_classify(), to reuse exising TC filters
on *any* Qdisc to classify the skb.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/sch_bpf.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
index b0e7c3a19c30..1910a58a3352 100644
--- a/net/sched/sch_bpf.c
+++ b/net/sched/sch_bpf.c
@@ -571,6 +571,73 @@ __bpf_kfunc void bpf_qdisc_set_skb_dequeue(struct sk_buff *skb)
 	__this_cpu_write(bpf_skb_dequeue, skb);
 }
 
+/* bpf_skb_tc_classify - Classify an skb using an existing filter referred
+ * to by the specified handle on the net device of index ifindex.
+ * @skb: The skb to be classified.
+ * @handle: The handle of the filter to be referenced.
+ * @ifindex: The ifindex of the net device where the filter is attached.
+ *
+ * Returns a 64-bit integer containing the tc action verdict and the classid,
+ * created as classid << 32 | action.
+ */
+__bpf_kfunc u64 bpf_skb_tc_classify(struct sk_buff *skb, int ifindex,
+				    u32 handle)
+{
+	struct net *net = dev_net(skb->dev);
+	const struct Qdisc_class_ops *cops;
+	struct tcf_result res = {};
+	struct tcf_block *block;
+	struct tcf_chain *chain;
+	struct net_device *dev;
+	int result = TC_ACT_OK;
+	unsigned long cl = 0;
+	struct Qdisc *q;
+
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, ifindex);
+	if (!dev)
+		goto out;
+	q = qdisc_lookup_rcu(dev, handle);
+	if (!q)
+		goto out;
+
+	cops = q->ops->cl_ops;
+	if (!cops)
+		goto out;
+	if (!cops->tcf_block)
+		goto out;
+	if (TC_H_MIN(handle)) {
+		cl = cops->find(q, handle);
+		if (cl == 0)
+			goto out;
+	}
+	block = cops->tcf_block(q, cl, NULL);
+	if (!block)
+		goto out;
+
+	for (chain = tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain = tcf_get_next_chain(block, chain)) {
+		struct tcf_proto *tp;
+
+		result = tcf_classify(skb, NULL, tp, &res, false);
+		if (result >= 0) {
+			switch (result) {
+			case TC_ACT_QUEUED:
+			case TC_ACT_STOLEN:
+			case TC_ACT_TRAP:
+				fallthrough;
+			case TC_ACT_SHOT:
+				rcu_read_unlock();
+				return result;
+			}
+		}
+	}
+out:
+	rcu_read_unlock();
+	return (res.class << 32 | result);
+}
+
 __diag_pop();
 
 BTF_SET8_START(skb_kfunc_btf_ids)
@@ -578,6 +645,7 @@ BTF_ID_FLAGS(func, bpf_skb_acquire, KF_ACQUIRE)
 BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_skb_get_hash)
 BTF_ID_FLAGS(func, bpf_qdisc_set_skb_dequeue, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_skb_tc_classify)
 BTF_SET8_END(skb_kfunc_btf_ids)
 
 static const struct btf_kfunc_id_set skb_kfunc_set = {

From patchwork Wed Jan 17 21:56:20 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Amery Hung <ameryhung@gmail.com>
X-Patchwork-Id: 13522187
X-Patchwork-Delegate: bpf@iogearbox.net
Received: from mail-ot1-f42.google.com (mail-ot1-f42.google.com
 [209.85.210.42])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2347925639;
	Wed, 17 Jan 2024 21:56:29 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.210.42
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1705528591; cv=none;
 b=tKzKv44l9wDGp054W+Vb35xRvTnn9zv02oI96JkpXHR7AOKEMjNZ7EE7yaaKXyMYN4i0e7er6zeshmpubDoMLFJJnpfMQEzjj2ZBmKyO4S/U0s8dtQA7Nt4HVYWq2PcgykkADeYeool6AhKOlDOonbIApW7tokWglmw31k5nl6A=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1705528591; c=relaxed/simple;
	bh=z7Mg94SJvR2xmHWS+cKogaYuWpxt7kMQXygY+jCPFAs=;
	h=Received:DKIM-Signature:X-Google-DKIM-Signature:
	 X-Gm-Message-State:X-Google-Smtp-Source:X-Received:Received:From:
	 X-Google-Original-From:To:Cc:Subject:Date:Message-Id:X-Mailer:
	 In-Reply-To:References:MIME-Version:Content-Transfer-Encoding;
 b=a/yLY7KSKlnQA5LBoBgHw+HxCROLBwshrZKAO/Q/aU46VE+6tuVht1QTIqcwRUFrzozrrDdewGgXU2OEEDZkH7W0+C18rBaaftVujpe/jsJumx5YibWlCuAI13p/EmMPbpCKR/f4lSFXl5Ew13ZLESIi/PY3BpfS/I4TyYZhf/o=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=fYo5f4vt; arc=none smtp.client-ip=209.85.210.42
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="fYo5f4vt"
Received: by mail-ot1-f42.google.com with SMTP id
 46e09a7af769-6dde5d308c6so4752059a34.0;
        Wed, 17 Jan 2024 13:56:29 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1705528589; x=1706133389;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=PO5BrodvLix8xDPnoK1LMM/6W0M4iKIvZWEV6IRwDR0=;
        b=fYo5f4vtXZT2sp+QkjjSQjFtZxuyoKB0ygoc2iwKSlymR9sjZf9odQQDBIu8lLie6+
         J520SYBcaKt/effmimhb2sgV3/9rlk69h9SIa8QzosHm2wNCUJQaMleZ9z+CZtImNcxL
         PpEWucafP8U7nDM6PzyDPkIduMwmfXpdT08ePp2/bInELjFbGI7ccNA95Gz8m2xykYnS
         tfIXYHM9qLJTG9Yun7aZxEPpS/wfW12RFNUrHTyDtEBqqd55Sfm4KJnfdngvQ+vqzRdn
         ftjk/NTGCzHQ9iiWzeQLMkXZofIc8xZ/fOsLPiTSs18cjxD4OWty73yaBVLtDX3HN9L2
         1V5g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705528589; x=1706133389;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=PO5BrodvLix8xDPnoK1LMM/6W0M4iKIvZWEV6IRwDR0=;
        b=I4tfWbXt9/W5dKfak+/aMNC6AytiQUWJArzaR3L66dgFLH2+/z7gpAEFxjEHTthhGA
         rPv4bhmfFHBtcCEX9/rsqyWLbxJJPePYi5LDhyJfTk3VsXFOdTHTxM79gpdXUPj1Fro4
         37Q+kIMr6/jNCqioRUPFuTrmQ/OjGsnjvF5SSq9MsAm6wpXFsW2lR53vI+m5Ce0tXvU3
         TuO8Jx5Ncr1l0GHEW38XiTHewwbckqmx8f0EVCN16tKh7M56UQPl81tHj6WEo15feRgV
         R2qHu2JLZThxdJE8+ad7XBu0sfZ32qkhHx5Bqvmofbq9ML1QGqVtBaayHft/G9907bCg
         mBEA==
X-Gm-Message-State: AOJu0YxCqUBQz4EIDDyqLiOaQ1AZJI4d63TXRpF0L7McP1ft8zTt6MmV
	1x+pXylmS9r1FL06wlFRmKI/TQ9Etys=
X-Google-Smtp-Source: 
 AGHT+IGJwxV24t8odLHznbs2q1z69j9rmJ+Ts5huLAGmT3fDCHzL/hSVV4oyA2lIPqbuxaWxoiXMiA==
X-Received: by 2002:a05:6830:60c:b0:6dd:dd3a:a8a with SMTP id
 w12-20020a056830060c00b006dddd3a0a8amr6586257oti.58.1705528589139;
        Wed, 17 Jan 2024 13:56:29 -0800 (PST)
Received: from n36-183-057.byted.org ([147.160.184.91])
        by smtp.gmail.com with ESMTPSA id
 hj11-20020a05622a620b00b00428346b88bfsm6105263qtb.65.2024.01.17.13.56.28
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 17 Jan 2024 13:56:28 -0800 (PST)
From: Amery Hung <ameryhung@gmail.com>
X-Google-Original-From: Amery Hung <amery.hung@bytedance.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org,
	yangpeihao@sjtu.edu.cn,
	toke@redhat.com,
	jhs@mojatatu.com,
	jiri@resnulli.us,
	sdf@google.com,
	xiyou.wangcong@gmail.com,
	yepeilin.cs@gmail.com
Subject: [RFC PATCH v7 4/8] net_sched: Add reset program
Date: Wed, 17 Jan 2024 21:56:20 +0000
Message-Id: 
 <a45e9b29b616fdfb71cb6920aaecc6d22b1540b4.1705432850.git.amery.hung@bytedance.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <cover.1705432850.git.amery.hung@bytedance.com>
References: <cover.1705432850.git.amery.hung@bytedance.com>
Precedence: bulk
X-Mailing-List: bpf@vger.kernel.org
List-Id: <bpf.vger.kernel.org>
List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Allow developers to implement customized reset logic through an optional
reset program. The program also takes bpf_qdisc_ctx as context, but
currently cannot access any field.

To release skbs, the program can release all references to bpf list or
rbtree serving as skb queues. The destructor kfunc bpf_skb_destroy()
will be called by bpf_map_free_deferred(). This prevents the qdisc from
holding the sch_tree_lock for too long when there are many packets in
the qdisc.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/uapi/linux/bpf.h       |  1 +
 include/uapi/linux/pkt_sched.h |  4 ++++
 kernel/bpf/syscall.c           |  1 +
 net/core/filter.c              |  3 +++
 net/sched/sch_bpf.c            | 30 ++++++++++++++++++++++++++----
 tools/include/uapi/linux/bpf.h |  1 +
 6 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index df280bbb7c0d..84669886a493 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1059,6 +1059,7 @@ enum bpf_attach_type {
 	BPF_NETKIT_PEER,
 	BPF_QDISC_ENQUEUE,
 	BPF_QDISC_DEQUEUE,
+	BPF_QDISC_RESET,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index d05462309f5a..e9e1a83c22f7 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1328,6 +1328,10 @@ enum {
 	TCA_SCH_BPF_DEQUEUE_PROG_FD,	/* u32 */
 	TCA_SCH_BPF_DEQUEUE_PROG_ID,	/* u32 */
 	TCA_SCH_BPF_DEQUEUE_PROG_TAG,	/* data */
+	TCA_SCH_BPF_RESET_PROG_NAME,	/* string */
+	TCA_SCH_BPF_RESET_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_RESET_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_RESET_PROG_TAG,	/* data */
 	__TCA_SCH_BPF_MAX,
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1838bddd8526..9af6fa542f2e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2506,6 +2506,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		switch (expected_attach_type) {
 		case BPF_QDISC_ENQUEUE:
 		case BPF_QDISC_DEQUEUE:
+		case BPF_QDISC_RESET:
 			return 0;
 		default:
 			return -EINVAL;
diff --git a/net/core/filter.c b/net/core/filter.c
index f25a0b6b5d56..f8e17465377f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8905,6 +8905,9 @@ static bool tc_qdisc_is_valid_access(int off, int size,
 {
 	struct btf *btf;
 
+	if (prog->expected_attach_type == BPF_QDISC_RESET)
+		return false;
+
 	if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
 		return false;
 
diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
index 1910a58a3352..3f0f809dced6 100644
--- a/net/sched/sch_bpf.c
+++ b/net/sched/sch_bpf.c
@@ -42,6 +42,7 @@ struct bpf_sched_data {
 	struct Qdisc_class_hash clhash;
 	struct sch_bpf_prog __rcu enqueue_prog;
 	struct sch_bpf_prog __rcu dequeue_prog;
+	struct sch_bpf_prog __rcu reset_prog;
 
 	struct qdisc_watchdog watchdog;
 };
@@ -51,6 +52,9 @@ static int sch_bpf_dump_prog(const struct sch_bpf_prog *prog, struct sk_buff *sk
 {
 	struct nlattr *nla;
 
+	if (!prog->prog)
+		return 0;
+
 	if (prog->name &&
 	    nla_put_string(skb, name, prog->name))
 		return -EMSGSIZE;
@@ -81,6 +85,9 @@ static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (sch_bpf_dump_prog(&q->dequeue_prog, skb, TCA_SCH_BPF_DEQUEUE_PROG_NAME,
 			      TCA_SCH_BPF_DEQUEUE_PROG_ID, TCA_SCH_BPF_DEQUEUE_PROG_TAG))
 		goto nla_put_failure;
+	if (sch_bpf_dump_prog(&q->reset_prog, skb, TCA_SCH_BPF_RESET_PROG_NAME,
+			      TCA_SCH_BPF_RESET_PROG_ID, TCA_SCH_BPF_RESET_PROG_TAG))
+		goto nla_put_failure;
 
 	return nla_nest_end(skb, opts);
 
@@ -259,16 +266,21 @@ static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = {
 	[TCA_SCH_BPF_DEQUEUE_PROG_FD]	= { .type = NLA_U32 },
 	[TCA_SCH_BPF_DEQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
 					    .len = ACT_BPF_NAME_LEN },
+	[TCA_SCH_BPF_RESET_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_RESET_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
 };
 
-static int bpf_init_prog(struct nlattr *fd, struct nlattr *name, struct sch_bpf_prog *prog)
+static int bpf_init_prog(struct nlattr *fd, struct nlattr *name,
+			 struct sch_bpf_prog *prog, bool optional)
 {
 	struct bpf_prog *fp, *old_fp;
 	char *prog_name = NULL;
 	u32 bpf_fd;
 
 	if (!fd)
-		return -EINVAL;
+		return optional ? 0 : -EINVAL;
+
 	bpf_fd = nla_get_u32(fd);
 
 	fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_QDISC);
@@ -327,11 +339,15 @@ static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt,
 	sch_tree_lock(sch);
 
 	err = bpf_init_prog(tb[TCA_SCH_BPF_ENQUEUE_PROG_FD],
-			    tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog);
+			    tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog, false);
 	if (err)
 		goto failure;
 	err = bpf_init_prog(tb[TCA_SCH_BPF_DEQUEUE_PROG_FD],
-			    tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog);
+			    tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog, false);
+	if (err)
+		goto failure;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_RESET_PROG_FD],
+			    tb[TCA_SCH_BPF_RESET_PROG_NAME], &q->reset_prog, true);
 failure:
 	sch_tree_unlock(sch);
 	return err;
@@ -360,7 +376,9 @@ static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
 static void sch_bpf_reset(struct Qdisc *sch)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct bpf_qdisc_ctx ctx = {};
 	struct sch_bpf_class *cl;
+	struct bpf_prog *reset;
 	unsigned int i;
 
 	for (i = 0; i < q->clhash.hashsize; i++) {
@@ -371,6 +389,9 @@ static void sch_bpf_reset(struct Qdisc *sch)
 	}
 
 	qdisc_watchdog_cancel(&q->watchdog);
+	reset = rcu_dereference(q->reset_prog.prog);
+	if (reset)
+		bpf_prog_run(reset, &ctx);
 }
 
 static void sch_bpf_destroy_class(struct Qdisc *sch, struct sch_bpf_class *cl)
@@ -398,6 +419,7 @@ static void sch_bpf_destroy(struct Qdisc *sch)
 	sch_tree_lock(sch);
 	bpf_cleanup_prog(&q->enqueue_prog);
 	bpf_cleanup_prog(&q->dequeue_prog);
+	bpf_cleanup_prog(&q->reset_prog);
 	sch_tree_unlock(sch);
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index df280bbb7c0d..84669886a493 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1059,6 +1059,7 @@ enum bpf_attach_type {
 	BPF_NETKIT_PEER,
 	BPF_QDISC_ENQUEUE,
 	BPF_QDISC_DEQUEUE,
+	BPF_QDISC_RESET,
 	__MAX_BPF_ATTACH_TYPE
 };
 

From patchwork Wed Jan 17 21:56:21 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Amery Hung <ameryhung@gmail.com>
X-Patchwork-Id: 13522188
X-Patchwork-Delegate: bpf@iogearbox.net
Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com
 [209.85.160.182])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5D3BD28DB4;
	Wed, 17 Jan 2024 21:56:31 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.160.182
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1705528592; cv=none;
 b=TsIYpMclDvMAWEXA2+cBjzmvwOe0KWQe7jY43sLgYMTWSn4vBcu/SH+QjnCklAyRP1YHI86znzdmgpXwHIQfNJxaaKAepGPtoreru51fER9Ck7Gy8sjj9U2xG8ML0GdAS4BOIFei5M7J+IIup4uqUXZJYMH9qO+3+0KmNnrZLC4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1705528592; c=relaxed/simple;
	bh=1mboDMZbXhjjGyMIZ10jie1E1Pz+G7KhgTRa+lPl1Ok=;
	h=Received:DKIM-Signature:X-Google-DKIM-Signature:
	 X-Gm-Message-State:X-Google-Smtp-Source:X-Received:Received:From:
	 X-Google-Original-From:To:Cc:Subject:Date:Message-Id:X-Mailer:
	 In-Reply-To:References:MIME-Version:Content-Transfer-Encoding;
 b=SpkBZ6rHFFpsZnGBc8XJJXa9KZSv0ks6jj7Ym5mgErTlElGGoz8Z9/ObQPiHQFdG+jXvwki0g0CA2r1+YImRnxFtuHVA6g4HaFT28QGb1YU/M+nUuT5w9SK4nn6FPe+IylJoHdu6zmFVf4lTxUrJoSxgALzDV5uOmV+4/h2rx8A=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=Oy5iLNhq; arc=none smtp.client-ip=209.85.160.182
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Oy5iLNhq"
Received: by mail-qt1-f182.google.com with SMTP id
 d75a77b69052e-429f53f0b0bso15746611cf.2;
        Wed, 17 Jan 2024 13:56:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1705528590; x=1706133390;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=FmSbpF5HJNRokADpjS/NeLmhwvKGKlpdziz8GFiM6W8=;
        b=Oy5iLNhqGoltgBhYZHJkEHNMJWjpsCY6QNGMA/+WfMOs2LhHfPFu+AS7x7Z49B/7Qe
         h09Ki/Z+AI2PihGxsKqxqphnKNuBOVJwgz3AUPcpMtgE0ekaRQlv43ZAwxstJmrxcWre
         3Vi3fmelp9m6Ct2OdMjjtR9CP5KePSrRyohyiYXevrCKi2jlHQ6iUFIKLEBalA1Rxv2A
         1jtKnT2Qc//1oBJCEIeEQJJe0a1crNWXtrieUkNmv5P3lNJ0KG24+p0DZzlPTgYPvcNL
         G64z/1SeHxWKblxZUP6S7WIlLoeN3YClgRDHPrS42mSIiWgfRJArzLSzupairUKydkWO
         BUiA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705528590; x=1706133390;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=FmSbpF5HJNRokADpjS/NeLmhwvKGKlpdziz8GFiM6W8=;
        b=ONBJyrv/9SO8Rf6S1j0RZzQb5+zATdaj77PqPmK2BUOqNMs4J3UAu20TWDUuyXOQCt
         lPkt7ynak6c6g2KMeC5Y8lkkkj81Wabwya1SwAg+JqWiPfCPuHesG26/7UOHMJNJeXv1
         STsIO+RKH9YUjNL/9N67VlDtaYszJ4lu0rWJAh1214ncUuPhGaN+cjRaLF6OmvgrPtqn
         uFCsC54airPTK3pmZn178D50Iu1TqlDqO/d511OcZXIt2+7RQHhg47C8U0DyVk7s/++2
         1Fg+hvHNEAomullaItbAqd9YUnR1ycVpn/aC2Z4ekA/n1CFVhWcsGXXU8Km052JfUTI8
         sg6A==
X-Gm-Message-State: AOJu0YxkIy2UVkgOw0ZlW06KnFfiB2L4lTkfUT3D2f6Ha+3k23ham/8h
	n/067ERhYHRmW5ZNmzzX6NwoTiIEOF0=
X-Google-Smtp-Source: 
 AGHT+IEgfeoXyRC3XW1dp7hql3R63zkkvDoF83z3VsGpjmGg3G8d6h29e5GL4dVhujHvkH2dskuwgg==
X-Received: by 2002:a05:622a:189:b0:429:c957:708c with SMTP id
 s9-20020a05622a018900b00429c957708cmr12220811qtw.130.1705528590029;
        Wed, 17 Jan 2024 13:56:30 -0800 (PST)
Received: from n36-183-057.byted.org ([147.160.184.91])
        by smtp.gmail.com with ESMTPSA id
 hj11-20020a05622a620b00b00428346b88bfsm6105263qtb.65.2024.01.17.13.56.29
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 17 Jan 2024 13:56:29 -0800 (PST)
From: Amery Hung <ameryhung@gmail.com>
X-Google-Original-From: Amery Hung <amery.hung@bytedance.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org,
	yangpeihao@sjtu.edu.cn,
	toke@redhat.com,
	jhs@mojatatu.com,
	jiri@resnulli.us,
	sdf@google.com,
	xiyou.wangcong@gmail.com,
	yepeilin.cs@gmail.com
Subject: [RFC PATCH v7 5/8] net_sched: Add init program
Date: Wed, 17 Jan 2024 21:56:21 +0000
Message-Id: 
 <dcb5fcc21f7a905279744898415fb9942c7fd1ed.1705432850.git.amery.hung@bytedance.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <cover.1705432850.git.amery.hung@bytedance.com>
References: <cover.1705432850.git.amery.hung@bytedance.com>
Precedence: bulk
X-Mailing-List: bpf@vger.kernel.org
List-Id: <bpf.vger.kernel.org>
List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

This patch adds another optional program to be called during
the creation of a qdisc for initializating data in the bpf world.
The program takes bpf_qdisc_ctx as context, but cannot not access
any field.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/uapi/linux/bpf.h       |  1 +
 include/uapi/linux/pkt_sched.h |  4 ++++
 kernel/bpf/syscall.c           |  1 +
 net/core/filter.c              |  3 ++-
 net/sched/sch_bpf.c            | 23 ++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h |  1 +
 6 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 84669886a493..cad0788bef99 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1060,6 +1060,7 @@ enum bpf_attach_type {
 	BPF_QDISC_ENQUEUE,
 	BPF_QDISC_DEQUEUE,
 	BPF_QDISC_RESET,
+	BPF_QDISC_INIT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index e9e1a83c22f7..61f0cf4a088c 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1332,6 +1332,10 @@ enum {
 	TCA_SCH_BPF_RESET_PROG_FD,	/* u32 */
 	TCA_SCH_BPF_RESET_PROG_ID,	/* u32 */
 	TCA_SCH_BPF_RESET_PROG_TAG,	/* data */
+	TCA_SCH_BPF_INIT_PROG_NAME,	/* string */
+	TCA_SCH_BPF_INIT_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_INIT_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_INIT_PROG_TAG,	/* data */
 	__TCA_SCH_BPF_MAX,
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9af6fa542f2e..0959905044b9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2507,6 +2507,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		case BPF_QDISC_ENQUEUE:
 		case BPF_QDISC_DEQUEUE:
 		case BPF_QDISC_RESET:
+		case BPF_QDISC_INIT:
 			return 0;
 		default:
 			return -EINVAL;
diff --git a/net/core/filter.c b/net/core/filter.c
index f8e17465377f..5619a12c0d06 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8905,7 +8905,8 @@ static bool tc_qdisc_is_valid_access(int off, int size,
 {
 	struct btf *btf;
 
-	if (prog->expected_attach_type == BPF_QDISC_RESET)
+	if (prog->expected_attach_type == BPF_QDISC_RESET ||
+	    prog->expected_attach_type == BPF_QDISC_INIT)
 		return false;
 
 	if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
index 3f0f809dced6..925a131016f0 100644
--- a/net/sched/sch_bpf.c
+++ b/net/sched/sch_bpf.c
@@ -43,6 +43,7 @@ struct bpf_sched_data {
 	struct sch_bpf_prog __rcu enqueue_prog;
 	struct sch_bpf_prog __rcu dequeue_prog;
 	struct sch_bpf_prog __rcu reset_prog;
+	struct sch_bpf_prog __rcu init_prog;
 
 	struct qdisc_watchdog watchdog;
 };
@@ -88,6 +89,9 @@ static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (sch_bpf_dump_prog(&q->reset_prog, skb, TCA_SCH_BPF_RESET_PROG_NAME,
 			      TCA_SCH_BPF_RESET_PROG_ID, TCA_SCH_BPF_RESET_PROG_TAG))
 		goto nla_put_failure;
+	if (sch_bpf_dump_prog(&q->init_prog, skb, TCA_SCH_BPF_INIT_PROG_NAME,
+			      TCA_SCH_BPF_INIT_PROG_ID, TCA_SCH_BPF_INIT_PROG_TAG))
+		goto nla_put_failure;
 
 	return nla_nest_end(skb, opts);
 
@@ -269,6 +273,9 @@ static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = {
 	[TCA_SCH_BPF_RESET_PROG_FD]	= { .type = NLA_U32 },
 	[TCA_SCH_BPF_RESET_PROG_NAME]	= { .type = NLA_NUL_STRING,
 					    .len = ACT_BPF_NAME_LEN },
+	[TCA_SCH_BPF_INIT_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_INIT_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
 };
 
 static int bpf_init_prog(struct nlattr *fd, struct nlattr *name,
@@ -348,6 +355,10 @@ static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt,
 		goto failure;
 	err = bpf_init_prog(tb[TCA_SCH_BPF_RESET_PROG_FD],
 			    tb[TCA_SCH_BPF_RESET_PROG_NAME], &q->reset_prog, true);
+	if (err)
+		goto failure;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_INIT_PROG_FD],
+			    tb[TCA_SCH_BPF_INIT_PROG_NAME], &q->init_prog, true);
 failure:
 	sch_tree_unlock(sch);
 	return err;
@@ -357,6 +368,8 @@ static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
 			struct netlink_ext_ack *extack)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct bpf_qdisc_ctx ctx = {};
+	struct bpf_prog *init;
 	int err;
 
 	qdisc_watchdog_init(&q->watchdog, sch);
@@ -370,7 +383,14 @@ static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
 	if (err)
 		return err;
 
-	return qdisc_class_hash_init(&q->clhash);
+	err = qdisc_class_hash_init(&q->clhash);
+	if (err < 0)
+		return err;
+
+	init = rcu_dereference(q->init_prog.prog);
+	if (init)
+		bpf_prog_run(init, &ctx);
+	return 0;
 }
 
 static void sch_bpf_reset(struct Qdisc *sch)
@@ -420,6 +440,7 @@ static void sch_bpf_destroy(struct Qdisc *sch)
 	bpf_cleanup_prog(&q->enqueue_prog);
 	bpf_cleanup_prog(&q->dequeue_prog);
 	bpf_cleanup_prog(&q->reset_prog);
+	bpf_cleanup_prog(&q->init_prog);
 	sch_tree_unlock(sch);
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 84669886a493..cad0788bef99 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1060,6 +1060,7 @@ enum bpf_attach_type {
 	BPF_QDISC_ENQUEUE,
 	BPF_QDISC_DEQUEUE,
 	BPF_QDISC_RESET,
+	BPF_QDISC_INIT,
 	__MAX_BPF_ATTACH_TYPE
 };
 

From patchwork Wed Jan 17 21:56:22 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Amery Hung <ameryhung@gmail.com>
X-Patchwork-Id: 13522189
X-Patchwork-Delegate: bpf@iogearbox.net
Received: from mail-oi1-f180.google.com (mail-oi1-f180.google.com
 [209.85.167.180])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B83FF28DBC;
	Wed, 17 Jan 2024 21:56:31 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.167.180
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1705528593; cv=none;
 b=KNHheJuTawvwJju3h5PFw1Dkbg9mvUxyzkTI/htAvpscRjNuell8oE9KkMF/DYvsoCBUcoPsmLJSWFNqN8imgYnw/+87KoSOz64FRuKxbmCneiDA5DEH8+GdJ4HqUakkd8yKZTpVBPMVEBn0QKIIrJvbQUA+AYloDe0oWZoQq/c=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1705528593; c=relaxed/simple;
	bh=KvEeiMTRQ8nPj5vMXR+aJYvdZbcqYraXQckJ/aCdrmI=;
	h=Received:DKIM-Signature:X-Google-DKIM-Signature:
	 X-Gm-Message-State:X-Google-Smtp-Source:X-Received:Received:From:
	 X-Google-Original-From:To:Cc:Subject:Date:Message-Id:X-Mailer:
	 In-Reply-To:References:MIME-Version:Content-Transfer-Encoding;
 b=OEU0thofckOsb+BmptKyeZIlxW8vkEruXxd36peMoj6uXswcZkA0Q9fTCroTLAUI0nS/xvsdLoUxabwrF0iEa8BAybmCn0pAwQyZYjiWJM2P2YpplpiAxLNRJoiXwnW+LMbNMTg4rNechEBOT6AI+tYjNQg0b/CARZfQ1+HumQc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=Y8dUzz8h; arc=none smtp.client-ip=209.85.167.180
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Y8dUzz8h"
Received: by mail-oi1-f180.google.com with SMTP id
 5614622812f47-3bd884146e9so1690601b6e.0;
        Wed, 17 Jan 2024 13:56:31 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1705528591; x=1706133391;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=/np8Zmn2uDY1EaXBes4tB+ptV29sKPAVxcQTxxgtFYI=;
        b=Y8dUzz8h5Qrbst4i3Sx8ZjUgrTrESvjEi7Uygo9XGdtKA2kKj3/v5CpB6TrLJBfpIB
         vt4TSRyIjvjmlLu5gfNQpKFNwqJ87z5Gka0VAfvEVKvO9flG4kj6H6nir5lQnZ7NjFNy
         rxC7AMM4selFNJhNPibXhMq766Jb3XxyD83Y5ettJOCma8MAzZKwGBNU/rUXhU1/9a20
         WAqaA1hria18O5n0ln7bolgzoeUYvTTCpRtgDB/wp1t9FdDt9Br32bKQg/iv7RFZUAtH
         /awpNuc/dfpWlZLDMg3pdguFXke4rh9sjoCM7iQqHZTndHUsX/ZIZax72OsLQYn9SrsM
         WBbg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705528591; x=1706133391;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=/np8Zmn2uDY1EaXBes4tB+ptV29sKPAVxcQTxxgtFYI=;
        b=c5yDYTEBl/p74r6hvW8K4Taa6q47P9RfTfjiip6mr7QiWz3vw8XGyaD8CLAQ1GWryb
         x1CFSfv7GILyl+AGv++sM1+pXRcamqr4QjeIyqjrOAK3H/dJEDBoe2xzmDamn106HCoU
         E9FFjpTIrtzwAWIXVpxDC2y0cXH8aJL6TWqeeyBOOPuOaqylsV9uR+26mouWxeEpoC+M
         DtsMrpgOabjvMpLA4LZIMwg/WQAGadFlBAzPQmb7H/avHysPwblJFY+IlUH54X+gF2Xb
         EOxPFGEnVGkQEEBtyddu3Jd79vDffCoyoXc6aJZywAwRMB+w+af4c8zh7OpqjTKLFSNx
         xm0g==
X-Gm-Message-State: AOJu0YwEXqBunIN+tmYv8pVSY40n5V7TBl5Q+Sj8ZQd9Sqo3q6R2idQJ
	myyByD3pk5pisOx6EQ96i6TEUhLysCo=
X-Google-Smtp-Source: 
 AGHT+IGFtbIK4ffShR13uiwOGVAxd90fNCsYKN6HzE93RWiy4ExZjjJljjkPRshBgrKwUu6UCs9ITw==
X-Received: by 2002:a05:6808:d4e:b0:3bd:8ed9:580a with SMTP id
 w14-20020a0568080d4e00b003bd8ed9580amr3506613oik.78.1705528590848;
        Wed, 17 Jan 2024 13:56:30 -0800 (PST)
Received: from n36-183-057.byted.org ([147.160.184.91])
        by smtp.gmail.com with ESMTPSA id
 hj11-20020a05622a620b00b00428346b88bfsm6105263qtb.65.2024.01.17.13.56.30
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 17 Jan 2024 13:56:30 -0800 (PST)
From: Amery Hung <ameryhung@gmail.com>
X-Google-Original-From: Amery Hung <amery.hung@bytedance.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org,
	yangpeihao@sjtu.edu.cn,
	toke@redhat.com,
	jhs@mojatatu.com,
	jiri@resnulli.us,
	sdf@google.com,
	xiyou.wangcong@gmail.com,
	yepeilin.cs@gmail.com
Subject: [RFC PATCH v7 6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC
Date: Wed, 17 Jan 2024 21:56:22 +0000
Message-Id: 
 <813b2de18b94389f4df53f21b8a328e1c2fdda13.1705432850.git.amery.hung@bytedance.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <cover.1705432850.git.amery.hung@bytedance.com>
References: <cover.1705432850.git.amery.hung@bytedance.com>
Precedence: bulk
X-Mailing-List: bpf@vger.kernel.org
List-Id: <bpf.vger.kernel.org>
List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

While eBPF qdisc uses NETLINK for attachment, expected_attach_type is
required at load time to verify context access from different programs.
This patch adds the section definition for this.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 tools/lib/bpf/libbpf.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e067be95da3c..0541f85b4ce6 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8991,6 +8991,10 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("struct_ops.s+",	STRUCT_OPS, 0, SEC_SLEEPABLE),
 	SEC_DEF("sk_lookup",		SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE),
 	SEC_DEF("netfilter",		NETFILTER, BPF_NETFILTER, SEC_NONE),
+	SEC_DEF("qdisc/enqueue",	QDISC, BPF_QDISC_ENQUEUE, SEC_ATTACHABLE_OPT),
+	SEC_DEF("qdisc/dequeue",	QDISC, BPF_QDISC_DEQUEUE, SEC_ATTACHABLE_OPT),
+	SEC_DEF("qdisc/reset",		QDISC, BPF_QDISC_RESET, SEC_ATTACHABLE_OPT),
+	SEC_DEF("qdisc/init",		QDISC, BPF_QDISC_INIT, SEC_ATTACHABLE_OPT),
 };
 
 int libbpf_register_prog_handler(const char *sec,

From patchwork Wed Jan 17 21:56:23 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Amery Hung <ameryhung@gmail.com>
X-Patchwork-Id: 13522190
X-Patchwork-Delegate: bpf@iogearbox.net
Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com
 [209.85.160.171])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 43A8628DDF;
	Wed, 17 Jan 2024 21:56:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.160.171
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1705528595; cv=none;
 b=O2W3Uxe+BnHFPING9cu5ZaAd1nmBlxOUX/2C3mca+l4rMmp4ptDEOAAQO5rabTtExhfsTOgIKcbyWSh7Pg9jhO+OYXOUqnMbU1WpQLrMwzQpDbzgmSJgzMA01FLVIfEBon3wgNLZ7pcoKV88LrypQ70XVSHdJ+ZcVwohJ0dawDo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1705528595; c=relaxed/simple;
	bh=wQY48SIKA2UxhFI2tKp7oxzy5oRHlrmQrWcv+tN4YUs=;
	h=Received:DKIM-Signature:X-Google-DKIM-Signature:
	 X-Gm-Message-State:X-Google-Smtp-Source:X-Received:Received:From:
	 X-Google-Original-From:To:Cc:Subject:Date:Message-Id:X-Mailer:
	 In-Reply-To:References:MIME-Version:Content-Transfer-Encoding;
 b=DJ+CGAdxtt3sfRcKWTZqHj2F1EzGqjSqueOikbc4y2jwk4WIoOKO0ySxoRCuwzpCir4JsXWdGmE6Isekp4D8SI0xb3YxOvjWpvoDl2Afmf+43KUrecckKXe8yNhfEi+u8HhMJ47vV2GSTutwtSs3On/VqcTKVM79ngrbqj/VekI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=YM6c7n/N; arc=none smtp.client-ip=209.85.160.171
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="YM6c7n/N"
Received: by mail-qt1-f171.google.com with SMTP id
 d75a77b69052e-42987bc95ffso77343321cf.1;
        Wed, 17 Jan 2024 13:56:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1705528592; x=1706133392;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=VoqdlFLSzLJnoY4US3wycGoYv6Oytnxns+gjaHrwdsw=;
        b=YM6c7n/NbKKUR0qPlrcHNHsWqgKSZPiVYX81b7OTy96L8UuFhWQBOR+epLi3ozWLzC
         jp2KUR9IOjD3vRQaf6EAd3CXdzag8w2qzNFx1NwlVdgdjjJSCBXn5W3aYpQnDF+64qq6
         ZpdBo9SbtlnTu8few0tJjQbQEyYqC150p0+4lLATTiGwAXPZdTupEeoBbQoruFkrX6I9
         SWaJDLFgO00aI2k3vShMniY7a0wsRW4nUgqiQ//Vd6Jaxvyc1Lx3sqhaGZ3aOLq/V5IJ
         aCUSoF5C4okaOLLPMga0Zh1VJJHwHMlOx87BzMPj/PuN5X/GI32+f2ZkJtFoswqPD3C4
         BSfA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705528592; x=1706133392;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=VoqdlFLSzLJnoY4US3wycGoYv6Oytnxns+gjaHrwdsw=;
        b=Vi/0P/tLxdSEgYRDDcY/utAV76W2cDPTT1+uTANFOG/bHo9V/anfNwSvWXXjP8Qz70
         33whE9SdJAjwJX2DzmDhu+EctLA+V+J9VDDZoIIn80dg7Virk30rCodUHyYGuXU/69A3
         OBOoMNXZW696FaJaPFLeMi+ke8WG8XU5oirdyvgpyLQZUJS4hvizl2cf3nBpqq1ppaK/
         tPm/wpMDEQusiWCnfv1lW7tKZfctbzSFuC2UzlBPtHWD9V/9HOL2B8JOwZpAR6rinwsS
         JmEJJID5XEa5zdXMgTDDkEoQa2tdFXl37tU4/o3IQIO1doLA1pd3AuSbOIvWri4HGLKv
         Va3Q==
X-Gm-Message-State: AOJu0Yxztun2OMl28U3/SvySA4rbAC4t1OELWxnCs2Yl7cJs++easj8g
	RFW2Jb1hw2RS/wfv73gl8R3Xrz39Abo=
X-Google-Smtp-Source: 
 AGHT+IFw2KK9VwGZag3izSgz1uIrClcYwhg4G1dY4teWBFC/W+78/Rgf5RO0gvIEQNIO4e8+gatpzg==
X-Received: by 2002:a05:622a:44:b0:429:fa91:f0 with SMTP id
 y4-20020a05622a004400b00429fa9100f0mr5248369qtw.23.1705528591973;
        Wed, 17 Jan 2024 13:56:31 -0800 (PST)
Received: from n36-183-057.byted.org ([147.160.184.91])
        by smtp.gmail.com with ESMTPSA id
 hj11-20020a05622a620b00b00428346b88bfsm6105263qtb.65.2024.01.17.13.56.30
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 17 Jan 2024 13:56:31 -0800 (PST)
From: Amery Hung <ameryhung@gmail.com>
X-Google-Original-From: Amery Hung <amery.hung@bytedance.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org,
	yangpeihao@sjtu.edu.cn,
	toke@redhat.com,
	jhs@mojatatu.com,
	jiri@resnulli.us,
	sdf@google.com,
	xiyou.wangcong@gmail.com,
	yepeilin.cs@gmail.com
Subject: [RFC PATCH v7 7/8] samples/bpf: Add an example of bpf fq qdisc
Date: Wed, 17 Jan 2024 21:56:23 +0000
Message-Id: 
 <52a0e08033292a88865aab37b0b3bd294b93e13c.1705432850.git.amery.hung@bytedance.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <cover.1705432850.git.amery.hung@bytedance.com>
References: <cover.1705432850.git.amery.hung@bytedance.com>
Precedence: bulk
X-Mailing-List: bpf@vger.kernel.org
List-Id: <bpf.vger.kernel.org>
List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

tc_sch_fq.bpf.c
A simple bpf fair queueing (fq) qdisc that gives each flow a euqal chance
to transmit data. The qdisc respects the timestamp in a skb set by an
clsact rate limiter. It can also inform the rate limiter about packet drop
when enabled to adjust timestamps. The implementation does not prevent hash
collision of flows nor does it recycle flows.

tc_sch_fq.c
A user space program to load and attach the eBPF-based fq qdisc, which
by default add the bpf fq to the loopback device, but can also add to other
dev and class with '-d' and '-p' options.

To test the bpf fq qdisc with the EDT rate limiter:
$ tc qdisc add dev lo clsact
$ tc filter add dev lo egress bpf obj tc_clsact_edt.bpf.o sec classifier
$ ./tc_sch_fq -s

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 samples/bpf/Makefile            |   8 +-
 samples/bpf/bpf_experimental.h  | 134 +++++++
 samples/bpf/tc_clsact_edt.bpf.c | 103 +++++
 samples/bpf/tc_sch_fq.bpf.c     | 666 ++++++++++++++++++++++++++++++++
 samples/bpf/tc_sch_fq.c         | 321 +++++++++++++++
 5 files changed, 1231 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/bpf_experimental.h
 create mode 100644 samples/bpf/tc_clsact_edt.bpf.c
 create mode 100644 samples/bpf/tc_sch_fq.bpf.c
 create mode 100644 samples/bpf/tc_sch_fq.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 933f6c3fe6b0..ea516a00352d 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -46,6 +46,7 @@ tprogs-y += xdp_fwd
 tprogs-y += task_fd_query
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += tc_sch_fq
 
 # Libbpf dependencies
 LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -98,6 +99,7 @@ ibumad-objs := ibumad_user.o
 hbm-objs := hbm.o $(CGROUP_HELPERS)
 
 xdp_router_ipv4-objs := xdp_router_ipv4_user.o $(XDP_SAMPLE)
+tc_sch_fq-objs := tc_sch_fq.o
 
 # Tell kbuild to always build the programs
 always-y := $(tprogs-y)
@@ -149,6 +151,7 @@ always-y += task_fd_query_kern.o
 always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
+always-y += tc_sch_fq.bpf.o
 
 TPROGS_CFLAGS = $(TPROGS_USER_CFLAGS)
 TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
@@ -195,6 +198,7 @@ TPROGLDLIBS_tracex4		+= -lrt
 TPROGLDLIBS_trace_output	+= -lrt
 TPROGLDLIBS_map_perf_test	+= -lrt
 TPROGLDLIBS_test_overhead	+= -lrt
+TPROGLDLIBS_tc_sch_fq		+= -lmnl
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 # make M=samples/bpf LLC=~/git/llvm-project/llvm/build/bin/llc CLANG=~/git/llvm-project/llvm/build/bin/clang
@@ -306,6 +310,7 @@ $(obj)/$(TRACE_HELPERS) $(obj)/$(CGROUP_HELPERS) $(obj)/$(XDP_SAMPLE): | libbpf_
 .PHONY: libbpf_hdrs
 
 $(obj)/xdp_router_ipv4_user.o: $(obj)/xdp_router_ipv4.skel.h
+$(obj)/tc_sch_fq.o: $(obj)/tc_sch_fq.skel.h
 
 $(obj)/tracex5.bpf.o: $(obj)/syscall_nrs.h
 $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
@@ -370,10 +375,11 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
 		-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
 		-c $(filter %.bpf.c,$^) -o $@
 
-LINKED_SKELS := xdp_router_ipv4.skel.h
+LINKED_SKELS := xdp_router_ipv4.skel.h tc_sch_fq.skel.h
 clean-files += $(LINKED_SKELS)
 
 xdp_router_ipv4.skel.h-deps := xdp_router_ipv4.bpf.o xdp_sample.bpf.o
+tc_sch_fq.skel.h-deps := tc_sch_fq.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/samples/bpf/bpf_experimental.h b/samples/bpf/bpf_experimental.h
new file mode 100644
index 000000000000..fc39063e0322
--- /dev/null
+++ b/samples/bpf/bpf_experimental.h
@@ -0,0 +1,134 @@
+#ifndef __BPF_EXPERIMENTAL__
+#define __BPF_EXPERIMENTAL__
+
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+#define __contains(name, node) __attribute__((btf_decl_tag("contains:" #name ":" #node)))
+
+/* Description
+ *	Allocates an object of the type represented by 'local_type_id' in
+ *	program BTF. User may use the bpf_core_type_id_local macro to pass the
+ *	type ID of a struct in program BTF.
+ *
+ *	The 'local_type_id' parameter must be a known constant.
+ *	The 'meta' parameter is rewritten by the verifier, no need for BPF
+ *	program to set it.
+ * Returns
+ *	A pointer to an object of the type corresponding to the passed in
+ *	'local_type_id', or NULL on failure.
+ */
+extern void *bpf_obj_new_impl(__u64 local_type_id, void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_obj_new_impl */
+#define bpf_obj_new(type) ((type *)bpf_obj_new_impl(bpf_core_type_id_local(type), NULL))
+
+/* Description
+ *	Free an allocated object. All fields of the object that require
+ *	destruction will be destructed before the storage is freed.
+ *
+ *	The 'meta' parameter is rewritten by the verifier, no need for BPF
+ *	program to set it.
+ * Returns
+ *	Void.
+ */
+extern void bpf_obj_drop_impl(void *kptr, void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_obj_drop_impl */
+#define bpf_obj_drop(kptr) bpf_obj_drop_impl(kptr, NULL)
+
+/* Description
+ *	Increment the refcount on a refcounted local kptr, turning the
+ *	non-owning reference input into an owning reference in the process.
+ *
+ *	The 'meta' parameter is rewritten by the verifier, no need for BPF
+ *	program to set it.
+ * Returns
+ *	An owning reference to the object pointed to by 'kptr'
+ */
+extern void *bpf_refcount_acquire_impl(void *kptr, void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_refcount_acquire_impl */
+#define bpf_refcount_acquire(kptr) bpf_refcount_acquire_impl(kptr, NULL)
+
+/* Description
+ *	Add a new entry to the beginning of the BPF linked list.
+ *
+ *	The 'meta' and 'off' parameters are rewritten by the verifier, no need
+ *	for BPF programs to set them
+ * Returns
+ *	0 if the node was successfully added
+ *	-EINVAL if the node wasn't added because it's already in a list
+ */
+extern int bpf_list_push_front_impl(struct bpf_list_head *head,
+				    struct bpf_list_node *node,
+				    void *meta, __u64 off) __ksym;
+
+/* Convenience macro to wrap over bpf_list_push_front_impl */
+#define bpf_list_push_front(head, node) bpf_list_push_front_impl(head, node, NULL, 0)
+
+/* Description
+ *	Add a new entry to the end of the BPF linked list.
+ *
+ *	The 'meta' and 'off' parameters are rewritten by the verifier, no need
+ *	for BPF programs to set them
+ * Returns
+ *	0 if the node was successfully added
+ *	-EINVAL if the node wasn't added because it's already in a list
+ */
+extern int bpf_list_push_back_impl(struct bpf_list_head *head,
+				   struct bpf_list_node *node,
+				   void *meta, __u64 off) __ksym;
+
+/* Convenience macro to wrap over bpf_list_push_back_impl */
+#define bpf_list_push_back(head, node) bpf_list_push_back_impl(head, node, NULL, 0)
+
+/* Description
+ *	Remove the entry at the beginning of the BPF linked list.
+ * Returns
+ *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
+ */
+extern struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym;
+
+/* Description
+ *	Remove the entry at the end of the BPF linked list.
+ * Returns
+ *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
+ */
+extern struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;
+
+/* Description
+ *	Remove 'node' from rbtree with root 'root'
+ * Returns
+ * 	Pointer to the removed node, or NULL if 'root' didn't contain 'node'
+ */
+extern struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
+					     struct bpf_rb_node *node) __ksym;
+
+/* Description
+ *	Add 'node' to rbtree with root 'root' using comparator 'less'
+ *
+ *	The 'meta' and 'off' parameters are rewritten by the verifier, no need
+ *	for BPF programs to set them
+ * Returns
+ *	0 if the node was successfully added
+ *	-EINVAL if the node wasn't added because it's already in a tree
+ */
+extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
+			       bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b),
+			       void *meta, __u64 off) __ksym;
+
+/* Convenience macro to wrap over bpf_rbtree_add_impl */
+#define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0)
+
+/* Description
+ *	Return the first (leftmost) node in input tree
+ * Returns
+ *	Pointer to the node, which is _not_ removed from the tree. If the tree
+ *	contains no nodes, returns NULL.
+ */
+extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
+
+#endif
diff --git a/samples/bpf/tc_clsact_edt.bpf.c b/samples/bpf/tc_clsact_edt.bpf.c
new file mode 100644
index 000000000000..f0b2ea84028d
--- /dev/null
+++ b/samples/bpf/tc_clsact_edt.bpf.c
@@ -0,0 +1,103 @@
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define ETH_P_IP	0x0800
+#define TC_ACT_OK	0
+#define NS_PER_SEC	1000000000ULL
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u32);
+	__type(value, __u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+	__uint(max_entries, 16);
+} rate_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 16);
+	__type(key, u32);
+	__type(value, u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+} tstamp_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 16);
+	__type(key, u32);
+	__type(value, u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+} comp_map SEC(".maps");
+
+u64 last_ktime = 0;
+
+SEC("classifier")
+int prog(struct __sk_buff *skb)
+{
+	void *data_end = (void *)(unsigned long long)skb->data_end;
+	u64 *rate, *tstamp, delay_ns, tstamp_comp, tstamp_new, *comp, comp_ns, now, init_rate = 12500000;    /* 100 Mbits/sec */
+	void *data = (void *)(unsigned long long)skb->data;
+	struct iphdr *ip = data + sizeof(struct ethhdr);
+	struct ethhdr *eth = data;
+	u64 len = skb->len;
+	long ret;
+	u64 zero = 0;
+
+	now = bpf_ktime_get_ns();
+
+	if (data + sizeof(struct ethhdr) > data_end)
+		return TC_ACT_OK;
+	if (skb->protocol != bpf_htons(ETH_P_IP))
+		return TC_ACT_OK;
+	if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) > data_end)
+		return TC_ACT_OK;
+
+	rate = bpf_map_lookup_elem(&rate_map, &ip->daddr);
+	if (!rate) {
+		bpf_map_update_elem(&rate_map, &ip->daddr, &init_rate, BPF_ANY);
+		bpf_map_update_elem(&tstamp_map, &ip->daddr, &now, BPF_ANY);
+		bpf_map_update_elem(&comp_map, &ip->daddr, &zero, BPF_ANY);
+		return TC_ACT_OK;
+	}
+
+	delay_ns = skb->len * NS_PER_SEC / (*rate);
+
+	tstamp = bpf_map_lookup_elem(&tstamp_map, &ip->daddr);
+	if (!tstamp)	/* unlikely */
+		return TC_ACT_OK;
+
+	comp = bpf_map_lookup_elem(&comp_map, &ip->daddr);
+	if (!comp)	/* unlikely */
+		return TC_ACT_OK;
+
+	// Reset comp and tstamp when idle
+	if (now - last_ktime > 1000000000) {
+		__sync_lock_test_and_set(comp, 0);
+		__sync_lock_test_and_set(tstamp, now);
+	}
+	last_ktime = now;
+
+	comp_ns = __sync_lock_test_and_set(comp, 0);
+	tstamp_comp = *tstamp - comp_ns;
+	if (tstamp_comp < now) {
+		tstamp_new = tstamp_comp + delay_ns;
+		if (tstamp_new < now) {
+			__sync_fetch_and_add(comp, now - tstamp_new);
+			__sync_lock_test_and_set(tstamp, now);
+		} else {
+			__sync_fetch_and_sub(tstamp, comp_ns);
+			__sync_fetch_and_add(tstamp, delay_ns);
+		}
+		skb->tstamp = now;
+		return TC_ACT_OK;
+	}
+
+	__sync_fetch_and_sub(tstamp, comp_ns);
+	skb->tstamp = *tstamp;
+	__sync_fetch_and_add(tstamp, delay_ns);
+
+	return TC_ACT_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tc_sch_fq.bpf.c b/samples/bpf/tc_sch_fq.bpf.c
new file mode 100644
index 000000000000..b2287ea3e2b2
--- /dev/null
+++ b/samples/bpf/tc_sch_fq.bpf.c
@@ -0,0 +1,666 @@
+#include "vmlinux.h"
+#include "bpf_experimental.h"
+#include <bpf/bpf_helpers.h>
+
+#define TC_PRIO_CONTROL  7
+#define TC_PRIO_MAX  15
+
+#define NS_PER_SEC 1000000000
+#define PSCHED_MTU (64 * 1024 + 14)
+
+#define NUM_QUEUE_LOG 10
+#define NUM_QUEUE (1 << NUM_QUEUE_LOG)
+#define PRIO_QUEUE (NUM_QUEUE + 1)
+#define COMP_DROP_PKT_DELAY 1
+#define THROTTLED 0xffffffffffffffff
+
+/* fq configuration */
+__u64 q_flow_refill_delay = 40 * 10000; //40us
+__u64 q_horizon = NS_PER_SEC * 10ULL;
+__u32 q_initial_quantum = 10 * PSCHED_MTU;
+__u32 q_quantum = 2 * PSCHED_MTU;
+__u32 q_orphan_mask = 1023;
+__u32 q_flow_plimit = 100;
+__u32 q_plimit = 10000;
+bool q_horizon_drop = true;
+
+bool q_compensate_tstamp;
+bool q_random_drop;
+
+unsigned long time_next_delayed_flow = ~0ULL;
+unsigned long unthrottle_latency_ns = 0ULL;
+unsigned long ktime_cache = 0;
+unsigned long dequeue_now;
+unsigned int fq_qlen = 0;
+
+struct fq_cb {
+	u32 plen;
+};
+
+struct skb_node {
+	u64 tstamp;
+	struct sk_buff __kptr *skb;
+	struct bpf_rb_node node;
+};
+
+struct fq_flow_node {
+	u32 hash;
+	int credit;
+	u32 qlen;
+	u32 socket_hash;
+	u64 age;
+	u64 time_next_packet;
+	struct bpf_list_node list_node;
+	struct bpf_rb_node rb_node;
+	struct bpf_rb_root queue __contains(skb_node, node);
+	struct bpf_spin_lock lock;
+	struct bpf_refcount refcount;
+};
+
+struct dequeue_nonprio_ctx {
+	bool dequeued;
+	u64 expire;
+};
+
+struct fq_stashed_flow {
+	struct fq_flow_node __kptr *flow;
+};
+
+/* [NUM_QUEUE] for TC_PRIO_CONTROL
+ * [0, NUM_QUEUE - 1] for other flows
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, NUM_QUEUE + 1);
+} fq_stashed_flows SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u32);
+	__type(value, __u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+	__uint(max_entries, 16);
+} rate_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u32);
+	__type(value, __u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+	__uint(max_entries, 16);
+} comp_map SEC(".maps");
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock fq_delayed_lock;
+private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
+
+private(B) struct bpf_spin_lock fq_new_flows_lock;
+private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
+
+private(C) struct bpf_spin_lock fq_old_flows_lock;
+private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);
+
+struct sk_buff *bpf_skb_acquire(struct sk_buff *p) __ksym;
+void bpf_skb_release(struct sk_buff *p) __ksym;
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_qdisc_set_skb_dequeue(struct sk_buff *p) __ksym;
+
+static __always_inline bool bpf_kptr_xchg_back(void *map_val, void *ptr)
+{
+	void *ret;
+
+	ret = bpf_kptr_xchg(map_val, ptr);
+	if (ret) { //unexpected
+		bpf_obj_drop(ret);
+		return false;
+	}
+	return true;
+}
+
+static __always_inline int hash64(u64 val, int bits)
+{
+	return val * 0x61C8864680B583EBull >> (64 - bits);
+}
+
+static bool skbn_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct skb_node *skb_a;
+	struct skb_node *skb_b;
+
+	skb_a = container_of(a, struct skb_node, node);
+	skb_b = container_of(b, struct skb_node, node);
+
+	return skb_a->tstamp < skb_b->tstamp;
+}
+
+static bool fn_time_next_packet_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct fq_flow_node *flow_a;
+	struct fq_flow_node *flow_b;
+
+	flow_a = container_of(a, struct fq_flow_node, rb_node);
+	flow_b = container_of(b, struct fq_flow_node, rb_node);
+
+	return flow_a->time_next_packet < flow_b->time_next_packet;
+}
+
+static __always_inline void
+fq_flows_add_head(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_front(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+}
+
+static __always_inline void
+fq_flows_add_tail(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_back(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+}
+
+static __always_inline bool
+fq_flows_is_empty(struct bpf_list_head *head, struct bpf_spin_lock *lock)
+{
+	struct bpf_list_node *node;
+
+	bpf_spin_lock(lock);
+	node = bpf_list_pop_front(head);
+	if (node) {
+		bpf_list_push_front(head, node);
+		bpf_spin_unlock(lock);
+		return false;
+	}
+	bpf_spin_unlock(lock);
+
+	return true;
+}
+
+static __always_inline void fq_flow_set_detached(struct fq_flow_node *flow)
+{
+	flow->age = bpf_jiffies64();
+	bpf_obj_drop(flow);
+}
+
+static __always_inline bool fq_flow_is_detached(struct fq_flow_node *flow)
+{
+	return flow->age != 0 && flow->age != THROTTLED;
+}
+
+static __always_inline bool fq_flow_is_throttled(struct fq_flow_node *flow)
+{
+	return flow->age != THROTTLED;
+}
+
+static __always_inline bool sk_listener(struct sock *sk)
+{
+	return (1 << sk->__sk_common.skc_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
+}
+
+static __always_inline int
+fq_classify(struct sk_buff *skb, u32 *hash, struct fq_stashed_flow **sflow,
+	    bool *connected, u32 *sk_hash)
+{
+	struct fq_flow_node *flow;
+	struct sock *sk = skb->sk;
+
+	*connected = false;
+
+	if ((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL) {
+		*hash = PRIO_QUEUE;
+	} else {
+		if (!sk || sk_listener(sk)) {
+			*sk_hash = bpf_skb_get_hash(skb) & q_orphan_mask;
+			*sk_hash = (*sk_hash << 1 | 1);
+		} else if (sk->__sk_common.skc_state == TCP_CLOSE) {
+			*sk_hash = bpf_skb_get_hash(skb) & q_orphan_mask;
+			*sk_hash = (*sk_hash << 1 | 1);
+		} else {
+			*sk_hash = sk->__sk_common.skc_hash;
+			*connected = true;
+		}
+		*hash = hash64(*sk_hash, NUM_QUEUE_LOG);
+	}
+
+	*sflow = bpf_map_lookup_elem(&fq_stashed_flows, hash);
+	if (!*sflow)
+		return -1; //unexpected
+
+	if ((*sflow)->flow)
+		return 0;
+
+	flow = bpf_obj_new(typeof(*flow));
+	if (!flow)
+		return -1;
+
+	flow->hash = *hash;
+	flow->credit = q_initial_quantum;
+	flow->qlen = 0;
+	flow->age = 1UL;
+	flow->time_next_packet = 0;
+
+	bpf_kptr_xchg_back(&(*sflow)->flow, flow);
+
+	return 0;
+}
+
+static __always_inline bool fq_packet_beyond_horizon(struct sk_buff *skb)
+{
+	return (s64)skb->tstamp > (s64)(ktime_cache + q_horizon);
+}
+
+SEC("qdisc/enqueue")
+int enqueue_prog(struct bpf_qdisc_ctx *ctx)
+{
+	struct iphdr *iph = (void *)(long)ctx->skb->data + sizeof(struct ethhdr);
+	u64 time_to_send, jiffies, delay_ns, *comp_ns, *rate;
+	struct fq_flow_node *flow = NULL, *flow_copy;
+	struct sk_buff *skb = ctx->skb;
+	u32 hash, plen, daddr, sk_hash;
+	struct fq_stashed_flow *sflow;
+	struct bpf_rb_node *node;
+	struct skb_node *skbn;
+	void *flow_queue;
+	bool connected;
+
+	if (q_random_drop & (bpf_get_prandom_u32() > ~0U * 0.90))
+		goto drop;
+
+	if (fq_qlen >= q_plimit)
+		goto drop;
+
+	skb = bpf_skb_acquire(ctx->skb);
+	if (!skb->tstamp) {
+		time_to_send = ktime_cache = bpf_ktime_get_ns();
+	} else {
+		if (fq_packet_beyond_horizon(skb)) {
+			ktime_cache = bpf_ktime_get_ns();
+			if (fq_packet_beyond_horizon(skb)) {
+				if (q_horizon_drop)
+					goto rel_skb_and_drop;
+
+				skb->tstamp = ktime_cache + q_horizon;
+			}
+		}
+		time_to_send = skb->tstamp;
+	}
+
+	if (fq_classify(skb, &hash, &sflow, &connected, &sk_hash) < 0)
+		goto rel_skb_and_drop;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		goto rel_skb_and_drop; //unexpected
+
+	if (hash != PRIO_QUEUE) {
+		if (connected && flow->socket_hash != sk_hash) {
+			flow->credit = q_initial_quantum;
+			flow->socket_hash = sk_hash;
+			if (fq_flow_is_throttled(flow)) {
+				/* mark the flow as undetached. The reference to the
+				 * throttled flow in fq_delayed will be removed later.
+				 */
+				flow_copy = bpf_refcount_acquire(flow);
+				flow_copy->age = 0;
+				fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow_copy);
+			}
+			flow->time_next_packet = 0ULL;
+		}
+
+		if (flow->qlen >= q_flow_plimit) {
+			bpf_kptr_xchg_back(&sflow->flow, flow);
+			goto rel_skb_and_drop;
+		}
+
+		if (fq_flow_is_detached(flow)) {
+			if (connected)
+				flow->socket_hash = sk_hash;
+
+			flow_copy = bpf_refcount_acquire(flow);
+
+			jiffies = bpf_jiffies64();
+			if ((s64)(jiffies - (flow_copy->age + q_flow_refill_delay)) > 0) {
+				if (flow_copy->credit < q_quantum)
+					flow_copy->credit = q_quantum;
+			}
+			flow_copy->age = 0;
+			fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy);
+		}
+	}
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn) {
+		bpf_kptr_xchg_back(&sflow->flow, flow);
+		goto rel_skb_and_drop;
+	}
+
+	skbn->tstamp = time_to_send;
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_skb_release(skb);
+
+	bpf_spin_lock(&flow->lock);
+	bpf_rbtree_add(&flow->queue, &skbn->node, skbn_tstamp_less);
+	bpf_spin_unlock(&flow->lock);
+
+	flow->qlen++;
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	fq_qlen++;
+	return SCH_BPF_QUEUED;
+
+rel_skb_and_drop:
+	bpf_skb_release(skb);
+drop:
+	if (q_compensate_tstamp) {
+		bpf_probe_read_kernel(&plen, sizeof(plen), (void *)(ctx->skb->cb));
+		bpf_probe_read_kernel(&daddr, sizeof(daddr), &iph->daddr);
+		rate = bpf_map_lookup_elem(&rate_map, &daddr);
+		comp_ns = bpf_map_lookup_elem(&comp_map, &daddr);
+		if (rate && comp_ns) {
+			delay_ns = (u64)plen * NS_PER_SEC / (*rate);
+			__sync_fetch_and_add(comp_ns, delay_ns);
+		}
+	}
+	return SCH_BPF_DROP;
+}
+
+static int fq_unset_throttled_flows(u32 index, bool *unset_all)
+{
+	struct bpf_rb_node *node = NULL;
+	struct fq_flow_node *flow;
+	u32 hash;
+
+	bpf_spin_lock(&fq_delayed_lock);
+
+	node = bpf_rbtree_first(&fq_delayed);
+	if (!node) {
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	if (!*unset_all && flow->time_next_packet > dequeue_now) {
+		time_next_delayed_flow = flow->time_next_packet;
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	node = bpf_rbtree_remove(&fq_delayed, &flow->rb_node);
+
+	bpf_spin_unlock(&fq_delayed_lock);
+
+	if (!node)
+		return 1; //unexpected
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+
+	/* the flow was recycled during enqueue() */
+	if (flow->age != THROTTLED) {
+		bpf_obj_drop(flow);
+		return 0;
+	}
+
+	flow->age = 0;
+	fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+
+	return 0;
+}
+
+static __always_inline void fq_flow_set_throttled(struct fq_flow_node *flow)
+{
+	flow->age = THROTTLED;
+
+	if (time_next_delayed_flow > flow->time_next_packet)
+		time_next_delayed_flow = flow->time_next_packet;
+
+	bpf_spin_lock(&fq_delayed_lock);
+	bpf_rbtree_add(&fq_delayed, &flow->rb_node, fn_time_next_packet_less);
+	bpf_spin_unlock(&fq_delayed_lock);
+}
+
+static __always_inline void fq_check_throttled(void)
+{
+	bool unset_all = false;
+	unsigned long sample;
+
+	if (time_next_delayed_flow > dequeue_now)
+		return;
+
+	sample = (unsigned long)(dequeue_now - time_next_delayed_flow);
+	unthrottle_latency_ns -= unthrottle_latency_ns >> 3;
+	unthrottle_latency_ns += sample >> 3;
+
+	time_next_delayed_flow = ~0ULL;
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &unset_all, 0);
+}
+
+static int
+fq_dequeue_nonprio_flows(u32 index, struct dequeue_nonprio_ctx *ctx)
+{
+	u64 time_next_packet, time_to_send;
+	struct skb_node *skbn, *skbn_tbd;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct bpf_list_head *head;
+	struct bpf_list_node *node;
+	struct bpf_spin_lock *lock;
+	struct fq_flow_node *flow;
+	u32 plen, key = 0;
+	struct fq_cb cb;
+	bool is_empty;
+
+	head = &fq_new_flows;
+	lock = &fq_new_flows_lock;
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		head = &fq_old_flows;
+		lock = &fq_old_flows_lock;
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node) {
+			if (time_next_delayed_flow != ~0ULL)
+				ctx->expire = time_next_delayed_flow;
+			return 1;
+		}
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	if (flow->credit <= 0) {
+		flow->credit += q_quantum;
+		fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+		return 0;
+	}
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		is_empty = fq_flows_is_empty(&fq_old_flows, &fq_old_flows_lock);
+		if (head == &fq_new_flows && !is_empty)
+			fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+		else
+			fq_flow_set_detached(flow);
+
+		return 0;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	time_to_send = skbn->tstamp;
+
+	time_next_packet = (time_to_send > flow->time_next_packet) ?
+		time_to_send : flow->time_next_packet;
+	if (dequeue_now < time_next_packet) {
+		bpf_spin_unlock(&flow->lock);
+		flow->time_next_packet = time_next_packet;
+		fq_flow_set_throttled(flow);
+		return 0;
+	}
+
+	rb_node = bpf_rbtree_remove(&flow->queue, &skbn->node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node) {
+		fq_flows_add_tail(head, lock, flow);
+		return 0; //unexpected
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (!skb)
+		goto out;
+
+	bpf_probe_read_kernel(&cb, sizeof(cb), skb->cb);
+	plen = cb.plen;
+
+	flow->credit -= plen;
+	flow->qlen--;
+	fq_qlen--;
+
+	ctx->dequeued = true;
+	bpf_qdisc_set_skb_dequeue(skb);
+out:
+	bpf_obj_drop(skbn);
+	fq_flows_add_head(head, lock, flow);
+
+	return 1;
+}
+
+static __always_inline struct sk_buff *fq_dequeue_prio(void)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+	struct sk_buff *skb = NULL;
+	struct bpf_rb_node *node;
+	struct skb_node *skbn;
+	u32 hash = NUM_QUEUE;
+
+	sflow = bpf_map_lookup_elem(&fq_stashed_flows, &hash);
+	if (!sflow)
+		return NULL; //unexpected
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		return NULL;
+
+	bpf_spin_lock(&flow->lock);
+	node = bpf_rbtree_first(&flow->queue);
+	if (!node) {
+		bpf_spin_unlock(&flow->lock);
+		goto xchg_flow_back;
+	}
+
+	skbn = container_of(node, struct skb_node, node);
+	node = bpf_rbtree_remove(&flow->queue, &skbn->node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!node)
+		goto xchg_flow_back;
+
+	skbn = container_of(node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+	fq_qlen--;
+
+xchg_flow_back:
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	return skb;
+}
+
+SEC("qdisc/dequeue")
+int dequeue_prog(struct bpf_qdisc_ctx *ctx)
+{
+	struct dequeue_nonprio_ctx cb_ctx = {};
+	struct skb_node *skbn = NULL;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+
+	skb = fq_dequeue_prio();
+	if (skb) {
+		bpf_qdisc_set_skb_dequeue(skb);
+		return SCH_BPF_DEQUEUED;
+	}
+
+	ktime_cache = dequeue_now = bpf_ktime_get_ns();
+	fq_check_throttled();
+	bpf_loop(q_plimit, fq_dequeue_nonprio_flows, &cb_ctx, 0);
+
+	if (cb_ctx.dequeued)
+		return SCH_BPF_DEQUEUED;
+
+	if (cb_ctx.expire) {
+		ctx->expire = cb_ctx.expire;
+		return SCH_BPF_THROTTLE;
+	}
+
+	return SCH_BPF_DROP;
+}
+
+static int
+fq_reset_flows(u32 index, void *ctx)
+{
+	struct bpf_list_head *head;
+	struct bpf_list_node *node;
+	struct bpf_spin_lock *lock;
+	struct fq_flow_node *flow;
+
+	head = &fq_new_flows;
+	lock = &fq_new_flows_lock;
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		head = &fq_old_flows;
+		lock = &fq_old_flows_lock;
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node)
+			return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	bpf_obj_drop(flow);
+
+	return 0;
+}
+
+static int
+fq_reset_stashed_flows(u32 index, void *ctx)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+
+	sflow = bpf_map_lookup_elem(&fq_stashed_flows, &index);
+	if (!sflow)
+		return 0;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (flow)
+		bpf_obj_drop(flow);
+
+	return 0;
+}
+
+SEC("qdisc/reset")
+void reset_prog(struct bpf_qdisc_ctx *ctx)
+{
+	bool unset_all = true;
+	fq_qlen = 0;
+	bpf_loop(NUM_QUEUE + 1, fq_reset_stashed_flows, NULL, 0);
+	bpf_loop(NUM_QUEUE, fq_reset_flows, NULL, 0);
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &unset_all, 0);
+	return;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tc_sch_fq.c b/samples/bpf/tc_sch_fq.c
new file mode 100644
index 000000000000..0e55d377ea33
--- /dev/null
+++ b/samples/bpf/tc_sch_fq.c
@@ -0,0 +1,321 @@
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <signal.h>
+#include <time.h>
+#include <sys/stat.h>
+
+#include <libmnl/libmnl.h>
+#include <linux/pkt_sched.h>
+#include <linux/rtnetlink.h>
+#include <net/if.h>
+
+#include "tc_sch_fq.skel.h"
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format,
+			   va_list args)
+{
+	return vprintf(format, args);
+}
+
+#define TCA_BUF_MAX (64 * 1024)
+#define FILTER_NAMESZ 16
+
+bool cleanup;
+unsigned int ifindex;
+unsigned int handle = 0x8000000;
+unsigned int parent = TC_H_ROOT;
+struct mnl_socket *nl;
+
+static void usage(const char *cmd)
+{
+	printf("Attach an fq eBPF qdisc and optionally an EDT rate limiter.\n");
+	printf("Usage: %s [...]\n", cmd);
+	printf("	-d <device>	Device\n");
+	printf("	-h <handle>	Qdisc handle\n");
+	printf("	-p <parent>	Parent Qdisc handle\n");
+	printf("	-s		Share packet drop info with the clsact EDT rate limiter\n");
+	printf("	-c		Delete the qdisc before quit\n");
+	printf("	-v		Verbose\n");
+}
+
+static int get_tc_classid(__u32 *h, const char *str)
+{
+	unsigned long maj, min;
+	char *p;
+
+	maj = TC_H_ROOT;
+	if (strcmp(str, "root") == 0)
+		goto ok;
+	maj = TC_H_UNSPEC;
+	if (strcmp(str, "none") == 0)
+		goto ok;
+	maj = strtoul(str, &p, 16);
+	if (p == str) {
+		maj = 0;
+		if (*p != ':')
+			return -1;
+	}
+	if (*p == ':') {
+		if (maj >= (1<<16))
+			return -1;
+		maj <<= 16;
+		str = p+1;
+		min = strtoul(str, &p, 16);
+		if (*p != 0)
+			return -1;
+		if (min >= (1<<16))
+			return -1;
+		maj |= min;
+	} else if (*p != 0)
+		return -1;
+
+ok:
+	*h = maj;
+	return 0;
+}
+
+static int get_qdisc_handle(__u32 *h, const char *str)
+{
+	__u32 maj;
+	char *p;
+
+	maj = TC_H_UNSPEC;
+	if (strcmp(str, "none") == 0)
+		goto ok;
+	maj = strtoul(str, &p, 16);
+	if (p == str || maj >= (1 << 16))
+		return -1;
+	maj <<= 16;
+	if (*p != ':' && *p != 0)
+		return -1;
+ok:
+	*h = maj;
+	return 0;
+}
+
+static void sigdown(int signo)
+{
+	struct {
+		struct nlmsghdr n;
+		struct tcmsg t;
+		char buf[TCA_BUF_MAX];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST,
+		.n.nlmsg_type = RTM_DELQDISC,
+		.t.tcm_family = AF_UNSPEC,
+	};
+
+	if (!cleanup)
+		exit(0);
+
+	req.n.nlmsg_seq = time(NULL);
+	req.t.tcm_ifindex = ifindex;
+	req.t.tcm_parent = TC_H_ROOT;
+	req.t.tcm_handle = handle;
+
+	if (mnl_socket_sendto(nl, &req.n, req.n.nlmsg_len) < 0)
+		exit(1);
+
+	exit(0);
+}
+
+static int qdisc_add_tc_sch_fq(struct tc_sch_fq *skel)
+{
+	char qdisc_type[FILTER_NAMESZ] = "bpf";
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	struct rtattr *option_attr;
+	const char *qdisc_name;
+	char prog_name[256];
+	int ret;
+	unsigned int seq, portid;
+	struct {
+		struct nlmsghdr n;
+		struct tcmsg t;
+		char buf[TCA_BUF_MAX];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST | NLM_F_EXCL | NLM_F_CREATE,
+		.n.nlmsg_type = RTM_NEWQDISC,
+		.t.tcm_family = AF_UNSPEC,
+	};
+
+	seq = time(NULL);
+	portid = mnl_socket_get_portid(nl);
+
+	qdisc_name = bpf_object__name(skel->obj);
+
+	req.t.tcm_ifindex = ifindex;
+	req.t.tcm_parent = parent;
+	req.t.tcm_handle = handle;
+	mnl_attr_put_str(&req.n, TCA_KIND, qdisc_type);
+
+	// eBPF Qdisc specific attributes
+	option_attr = (struct rtattr *)mnl_nlmsg_get_payload_tail(&req.n);
+	mnl_attr_put(&req.n, TCA_OPTIONS, 0, NULL);
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_ENQUEUE_PROG_FD,
+			 bpf_program__fd(skel->progs.enqueue_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_enqueue", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_ENQUEUE_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_DEQUEUE_PROG_FD,
+			 bpf_program__fd(skel->progs.dequeue_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_dequeue", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_DEQUEUE_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_RESET_PROG_FD,
+			 bpf_program__fd(skel->progs.reset_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_reset", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_RESET_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	option_attr->rta_len = (void *)mnl_nlmsg_get_payload_tail(&req.n) -
+			       (void *)option_attr;
+
+	if (mnl_socket_sendto(nl, &req.n, req.n.nlmsg_len) < 0) {
+		perror("mnl_socket_sendto");
+		return -1;
+	}
+
+	for (;;) {
+		ret = mnl_socket_recvfrom(nl, buf, sizeof(buf));
+		if (ret == -1) {
+			if (errno == ENOBUFS || errno == EINTR)
+				continue;
+
+			if (errno == EAGAIN) {
+				errno = 0;
+				ret = 0;
+				break;
+			}
+
+			perror("mnl_socket_recvfrom");
+			return -1;
+		}
+
+		ret = mnl_cb_run(buf, ret, seq, portid, NULL, NULL);
+		if (ret < 0) {
+			perror("mnl_cb_run");
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	LIBBPF_OPTS(bpf_object_open_opts, opts, .kernel_log_level = 2);
+	bool verbose = false, share = false;
+	struct tc_sch_fq *skel = NULL;
+	struct stat stat_buf = {};
+	char d[IFNAMSIZ] = "lo";
+	int opt, ret = 1;
+	struct sigaction sa = {
+		.sa_handler = sigdown,
+	};
+
+	while ((opt = getopt(argc, argv, "d:h:p:csv")) != -1) {
+		switch (opt) {
+		/* General args */
+		case 'd':
+			strncpy(d, optarg, sizeof(d)-1);
+			break;
+		case 'h':
+			ret = get_qdisc_handle(&handle, optarg);
+			if (ret) {
+				printf("Invalid qdisc handle\n");
+				return 1;
+			}
+			break;
+		case 'p':
+			ret = get_tc_classid(&parent, optarg);
+			if (ret) {
+				printf("Invalid parent qdisc handle\n");
+				return 1;
+			}
+			break;
+		case 'c':
+			cleanup = true;
+			break;
+		case 's':
+			share = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	nl = mnl_socket_open(NETLINK_ROUTE);
+	if (!nl) {
+		perror("mnl_socket_open");
+		return 1;
+	}
+
+	ret = mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID);
+	if (ret < 0) {
+		perror("mnl_socket_bind");
+		ret = 1;
+		goto out;
+	}
+
+	ifindex = if_nametoindex(d);
+	if (errno == ENODEV) {
+		fprintf(stderr, "No such device: %s\n", d);
+		goto out;
+	}
+
+	if (sigaction(SIGINT, &sa, NULL) || sigaction(SIGTERM, &sa, NULL))
+		goto out;
+
+	if (verbose)
+		libbpf_set_print(libbpf_print_fn);
+
+	skel = tc_sch_fq__open_opts(&opts);
+	if (!skel) {
+		perror("Failed to open tc_sch_fq");
+		goto out;
+	}
+
+	if (share) {
+		if (stat("/sys/fs/bpf/tc", &stat_buf) == -1)
+			mkdir("/sys/fs/bpf/tc", 0700);
+
+		mkdir("/sys/fs/bpf/tc/globals", 0700);
+
+		bpf_map__set_pin_path(skel->maps.rate_map, "/sys/fs/bpf/tc/globals/rate_map");
+		bpf_map__set_pin_path(skel->maps.comp_map, "/sys/fs/bpf/tc/globals/comp_map");
+
+		skel->bss->q_compensate_tstamp = true;
+		skel->bss->q_random_drop = true;
+	}
+
+	ret = tc_sch_fq__load(skel);
+	if (ret) {
+		perror("Failed to load tc_sch_fq");
+		ret = 1;
+		goto out_destroy;
+	}
+
+	ret = qdisc_add_tc_sch_fq(skel);
+	if (ret < 0) {
+		perror("Failed to create qdisc");
+		ret = 1;
+		goto out_destroy;
+	}
+
+	for (;;)
+		pause();
+
+out_destroy:
+	tc_sch_fq__destroy(skel);
+out:
+	mnl_socket_close(nl);
+	return ret;
+}

From patchwork Wed Jan 17 21:56:24 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Amery Hung <ameryhung@gmail.com>
X-Patchwork-Id: 13522191
X-Patchwork-Delegate: bpf@iogearbox.net
Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com
 [209.85.160.175])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6306628E1C;
	Wed, 17 Jan 2024 21:56:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.160.175
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1705528596; cv=none;
 b=t7hBSH5HIYTm5xaDJQKJqDglLQqLvzjDkIFbD6Zdmvtau9So2uXJwNmqd86EykinYOriyEdet3kfb4k0k4POUCQyIrmsSOq+Sl3W2z+JqNrZmIQR7mEciZjm15iR+f2Yxs7nPKCKkJJNjSneL0oKXfa9TISU2kVwJjYH+JcA/s8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1705528596; c=relaxed/simple;
	bh=WynIwsoUg3bULqZw+eKtS+i7WdzyzHnbyJSzQfcmyCI=;
	h=Received:DKIM-Signature:X-Google-DKIM-Signature:
	 X-Gm-Message-State:X-Google-Smtp-Source:X-Received:Received:From:
	 X-Google-Original-From:To:Cc:Subject:Date:Message-Id:X-Mailer:
	 In-Reply-To:References:MIME-Version:Content-Transfer-Encoding;
 b=A/zMZmZ4xFP7psG/TTlpa2nsUKT1Ojh9gHjPK3WYeS2YX8j6uLwLVY+1uRqi2gjPZF01xg+rVNgrCOPNmJGl+h9iFSnpsFhfDA6nBVS2K1qwyfa0/1SUPtzHAzJaF3SlLnVciqA3O+oPTp2qv8aAbTFIA77UBTOC78s9IWJxf8M=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=FBLBYzYO; arc=none smtp.client-ip=209.85.160.175
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="FBLBYzYO"
Received: by mail-qt1-f175.google.com with SMTP id
 d75a77b69052e-429be9fe952so1058171cf.0;
        Wed, 17 Jan 2024 13:56:34 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1705528593; x=1706133393;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=x0JrVED4ngcpYxa01oTPJs1iJIIx5CIBDAO0VemoDxs=;
        b=FBLBYzYOeKovlNib+CyymKvZ/cz1r3RAePcdnuRZPSeDDFQJ4ryqw9klx/tXRmVFJg
         TehRHh/QLiTdFVOhEvrVs034y8SoG9wQ57qNTAJDkVUVEVeBt09AHn451UTur3ZOjx21
         zJtV3jK+kcmmfBfQQdAuRxR4KaSecCxvyWko8kyAScRWHiXKZz/shiFKKv/8LCPlxPDN
         eli1mWtX/i3+bj+ZmvekoHZHwu4YSgttPOQE8BwG72r2FXnnwzdzoa5LwPj1/5w/MVPE
         W3+tv5aInpAmXukwwn3SHHBRBGbQCVodc8MQPM/IJx4kDMCVN1voZ7F/GVoqzrQrarbN
         CWcw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705528593; x=1706133393;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=x0JrVED4ngcpYxa01oTPJs1iJIIx5CIBDAO0VemoDxs=;
        b=NOvfyCe5r9xFScABTGvI9qlTmuPh35Mpp7GSBpgSCg8qCq9lc4t8WMGe/4vYLP+6vy
         2i0fSa3LRYgQbshHmX5M1YacUf61rPFWQYtvV5J5EMo/g/uByzHEEpyk+uYDJJCYpNkq
         3AUDMBnj7z+nBzEkGWYcpF3OiUTPzytjK/ldedla5ziG7Y1R48EtqiCIxJd/DWcY0ivA
         9PO7Qg731/++znYPtvlzhPN/rM5xGuOdSgdxcNT1tJVgO/CVQA4O4UH1vGEax/SdTPJQ
         p/OOWulFRibicMhgmroVPKbYFvXKLPbsJEMKpxEuZFGee5laUkYwh6byWRJOaQYBY1/X
         bybw==
X-Gm-Message-State: AOJu0YwEfUJCR4adycS+UTP5CwoT6IjDJTLVE0vhfxUn6aFs1dfGSsCl
	ttpS3pmU6zFydZVJXBgzAnLAVhOWPYo=
X-Google-Smtp-Source: 
 AGHT+IEqRbVy9eDvTGMWQyjVQvk1zmsoPqvwn2kENUA0cSnUD0XU6NaPDmUc1dQUs3+caSvUpuk4SA==
X-Received: by 2002:ac8:580c:0:b0:42a:152d:32b4 with SMTP id
 g12-20020ac8580c000000b0042a152d32b4mr1043162qtg.57.1705528592967;
        Wed, 17 Jan 2024 13:56:32 -0800 (PST)
Received: from n36-183-057.byted.org ([147.160.184.91])
        by smtp.gmail.com with ESMTPSA id
 hj11-20020a05622a620b00b00428346b88bfsm6105263qtb.65.2024.01.17.13.56.32
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 17 Jan 2024 13:56:32 -0800 (PST)
From: Amery Hung <ameryhung@gmail.com>
X-Google-Original-From: Amery Hung <amery.hung@bytedance.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org,
	yangpeihao@sjtu.edu.cn,
	toke@redhat.com,
	jhs@mojatatu.com,
	jiri@resnulli.us,
	sdf@google.com,
	xiyou.wangcong@gmail.com,
	yepeilin.cs@gmail.com
Subject: [RFC PATCH v7 8/8] samples/bpf: Add an example of bpf netem qdisc
Date: Wed, 17 Jan 2024 21:56:24 +0000
Message-Id: 
 <5d01cc9a45d3f537a5bf5eb197567d5bcd6b936e.1705432850.git.amery.hung@bytedance.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <cover.1705432850.git.amery.hung@bytedance.com>
References: <cover.1705432850.git.amery.hung@bytedance.com>
Precedence: bulk
X-Mailing-List: bpf@vger.kernel.org
List-Id: <bpf.vger.kernel.org>
List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

tc_sch_netem.bpf.c
A simple bpf network emulator (netem) qdisc that simulates packet drops,
loss, and delay. The qdisc shares the state machine of Gilbert-Elliott
model via a eBPF map when it is added to multiple tx queues.

tc_sch_netem.c
A user space program to load and attach the eBPF-based netem qdisc, which
by default add the bpf fq to the loopback device, but can also add to other
dev and class with '-d' and '-p' options.

To test mq + netem with shared state machine:
$ tc qdisc add dev ens5 root handle 1: mq
$ ./tc_sch_netem -d ens5 -p 1:1 -h 801 -s
$ ./tc_sch_netem -d ens5 -p 1:2 -h 802 -s
$ ./tc_sch_netem -d ens5 -p 1:3 -h 803 -s
$ ./tc_sch_netem -d ens5 -p 1:4 -h 804 -s

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 samples/bpf/Makefile           |   8 +-
 samples/bpf/tc_sch_netem.bpf.c | 256 ++++++++++++++++++++++++
 samples/bpf/tc_sch_netem.c     | 347 +++++++++++++++++++++++++++++++++
 3 files changed, 610 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/tc_sch_netem.bpf.c
 create mode 100644 samples/bpf/tc_sch_netem.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ea516a00352d..880f15ae4bed 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -47,6 +47,7 @@ tprogs-y += task_fd_query
 tprogs-y += ibumad
 tprogs-y += hbm
 tprogs-y += tc_sch_fq
+tprogs-y += tc_sch_netem
 
 # Libbpf dependencies
 LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -100,6 +101,7 @@ hbm-objs := hbm.o $(CGROUP_HELPERS)
 
 xdp_router_ipv4-objs := xdp_router_ipv4_user.o $(XDP_SAMPLE)
 tc_sch_fq-objs := tc_sch_fq.o
+tc_sch_netem-objs := tc_sch_netem.o
 
 # Tell kbuild to always build the programs
 always-y := $(tprogs-y)
@@ -152,6 +154,7 @@ always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
 always-y += tc_sch_fq.bpf.o
+always-y += tc_sch_netem.bpf.o
 
 TPROGS_CFLAGS = $(TPROGS_USER_CFLAGS)
 TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
@@ -199,6 +202,7 @@ TPROGLDLIBS_trace_output	+= -lrt
 TPROGLDLIBS_map_perf_test	+= -lrt
 TPROGLDLIBS_test_overhead	+= -lrt
 TPROGLDLIBS_tc_sch_fq		+= -lmnl
+TPROGLDLIBS_tc_sch_netem	+= -lmnl
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 # make M=samples/bpf LLC=~/git/llvm-project/llvm/build/bin/llc CLANG=~/git/llvm-project/llvm/build/bin/clang
@@ -311,6 +315,7 @@ $(obj)/$(TRACE_HELPERS) $(obj)/$(CGROUP_HELPERS) $(obj)/$(XDP_SAMPLE): | libbpf_
 
 $(obj)/xdp_router_ipv4_user.o: $(obj)/xdp_router_ipv4.skel.h
 $(obj)/tc_sch_fq.o: $(obj)/tc_sch_fq.skel.h
+$(obj)/tc_sch_netem.o: $(obj)/tc_sch_netem.skel.h
 
 $(obj)/tracex5.bpf.o: $(obj)/syscall_nrs.h
 $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
@@ -375,11 +380,12 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
 		-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
 		-c $(filter %.bpf.c,$^) -o $@
 
-LINKED_SKELS := xdp_router_ipv4.skel.h tc_sch_fq.skel.h
+LINKED_SKELS := xdp_router_ipv4.skel.h tc_sch_fq.skel.h tc_sch_netem.skel.h
 clean-files += $(LINKED_SKELS)
 
 xdp_router_ipv4.skel.h-deps := xdp_router_ipv4.bpf.o xdp_sample.bpf.o
 tc_sch_fq.skel.h-deps := tc_sch_fq.bpf.o
+tc_sch_netem.skel.h-deps := tc_sch_netem.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/samples/bpf/tc_sch_netem.bpf.c b/samples/bpf/tc_sch_netem.bpf.c
new file mode 100644
index 000000000000..b4db382f2c58
--- /dev/null
+++ b/samples/bpf/tc_sch_netem.bpf.c
@@ -0,0 +1,256 @@
+#include "vmlinux.h"
+#include "bpf_experimental.h"
+#include <bpf/bpf_helpers.h>
+
+#define NETEM_DIST_SCALE	8192
+
+#define NS_PER_SEC		1000000000
+
+int q_loss_model = CLG_GILB_ELL;
+unsigned int q_limit = 1000;
+signed long q_latency = 0;
+signed long q_jitter = 0;
+unsigned int q_loss = 1;
+unsigned int q_qlen = 0;
+
+struct crndstate q_loss_cor = {.last = 0, .rho = 0,};
+struct crndstate q_delay_cor = {.last = 0, .rho = 0,};
+
+struct skb_node {
+	u64 tstamp;
+	struct sk_buff __kptr *skb;
+	struct bpf_rb_node node;
+};
+
+struct clg_state {
+	u64 state;
+	u32 a1;
+	u32 a2;
+	u32 a3;
+	u32 a4;
+	u32 a5;
+};
+
+static bool skbn_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct skb_node *skb_a;
+	struct skb_node *skb_b;
+
+	skb_a = container_of(a, struct skb_node, node);
+	skb_b = container_of(b, struct skb_node, node);
+
+	return skb_a->tstamp < skb_b->tstamp;
+}
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct clg_state);
+	__uint(max_entries, 1);
+} g_clg_state SEC(".maps");
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock t_root_lock;
+private(A) struct bpf_rb_root t_root __contains(skb_node, node);
+
+struct sk_buff *bpf_skb_acquire(struct sk_buff *p) __ksym;
+void bpf_skb_release(struct sk_buff *p) __ksym;
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_qdisc_set_skb_dequeue(struct sk_buff *p) __ksym;
+
+static __always_inline u32 get_crandom(struct crndstate *state)
+{
+	u64 value, rho;
+	unsigned long answer;
+
+	if (!state || state->rho == 0)	/* no correlation */
+		return bpf_get_prandom_u32();
+
+	value = bpf_get_prandom_u32();
+	rho = (u64)state->rho + 1;
+	answer = (value * ((1ull<<32) - rho) + state->last * rho) >> 32;
+	state->last = answer;
+	return answer;
+}
+
+static __always_inline s64 tabledist(s64 mu, s32 sigma, struct crndstate *state)
+{
+	s64 x;
+	long t;
+	u32 rnd;
+
+	if (sigma == 0)
+		return mu;
+
+	rnd = get_crandom(state);
+
+	/* default uniform distribution */
+	return ((rnd % (2 * (u32)sigma)) + mu) - sigma;
+}
+
+static __always_inline bool loss_gilb_ell(void)
+{
+	struct clg_state *clg;
+	u32 r1, r2, key = 0;
+	bool ret = false;
+
+ 	clg = bpf_map_lookup_elem(&g_clg_state, &key);
+	if (!clg)
+		return false;
+
+	r1 = bpf_get_prandom_u32();
+	r2 = bpf_get_prandom_u32();
+
+	switch (clg->state) {
+	case GOOD_STATE:
+		if (r1 < clg->a1)
+			__sync_val_compare_and_swap(&clg->state,
+						    GOOD_STATE, BAD_STATE);
+		if (r2 < clg->a4)
+			ret = true;
+		break;
+	case BAD_STATE:
+		if (r1 < clg->a2)
+			__sync_val_compare_and_swap(&clg->state,
+						    BAD_STATE, GOOD_STATE);
+		if (r2 > clg->a3)
+			ret = true;
+	}
+
+	return ret;
+}
+
+static __always_inline bool loss_event(void)
+{
+	switch (q_loss_model) {
+	case CLG_RANDOM:
+		return q_loss && q_loss >= get_crandom(&q_loss_cor);
+	case CLG_GILB_ELL:
+		return loss_gilb_ell();
+	}
+
+	return false;
+}
+
+static __always_inline void tfifo_enqueue(struct skb_node *skbn)
+{
+	bpf_spin_lock(&t_root_lock);
+	bpf_rbtree_add(&t_root, &skbn->node, skbn_tstamp_less);
+	bpf_spin_unlock(&t_root_lock);
+}
+
+SEC("qdisc/enqueue")
+int enqueue_prog(struct bpf_qdisc_ctx *ctx)
+{
+	struct sk_buff *old, *skb = ctx->skb;
+	struct skb_node *skbn;
+	int count = 1;
+	s64 delay = 0;
+	u64 now;
+
+	if (loss_event())
+		--count;
+
+	if (count == 0)
+		return SCH_BPF_BYPASS;
+
+	q_qlen++;
+	if (q_qlen > q_limit)
+		return SCH_BPF_DROP;
+
+	skb = bpf_skb_acquire(ctx->skb);
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn) {
+		bpf_skb_release(skb);
+		return SCH_BPF_DROP;
+	}
+
+	delay = tabledist(q_latency, q_jitter, &q_delay_cor);
+
+	now = bpf_ktime_get_ns();
+
+	skbn->tstamp = now + delay;
+	old = bpf_kptr_xchg(&skbn->skb, skb);
+	if (old)
+		bpf_skb_release(old);
+
+	tfifo_enqueue(skbn);
+	return SCH_BPF_QUEUED;
+}
+
+
+SEC("qdisc/dequeue")
+int dequeue_prog(struct bpf_qdisc_ctx *ctx)
+{
+	struct bpf_rb_node *node = NULL;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+	u64 now;
+
+	now = bpf_ktime_get_ns();
+
+	bpf_spin_lock(&t_root_lock);
+	node = bpf_rbtree_first(&t_root);
+	if (!node) {
+		bpf_spin_unlock(&t_root_lock);
+		return SCH_BPF_DROP;
+	}
+
+	skbn = container_of(node, struct skb_node, node);
+	if (skbn->tstamp <= now) {
+		node = bpf_rbtree_remove(&t_root, &skbn->node);
+		bpf_spin_unlock(&t_root_lock);
+
+		if (!node)
+			return SCH_BPF_DROP;
+
+		skbn = container_of(node, struct skb_node, node);
+		skb = bpf_kptr_xchg(&skbn->skb, skb);
+		if (!skb) {
+			bpf_obj_drop(skbn);
+			return SCH_BPF_DROP;
+		}
+
+		bpf_qdisc_set_skb_dequeue(skb);
+		bpf_obj_drop(skbn);
+
+		q_qlen--;
+		return SCH_BPF_DEQUEUED;
+	}
+
+	ctx->expire = skbn->tstamp;
+	bpf_spin_unlock(&t_root_lock);
+	return SCH_BPF_THROTTLE;
+}
+
+static int reset_queue(u32 index, void *ctx)
+{
+	struct bpf_rb_node *node = NULL;
+	struct skb_node *skbn;
+
+	bpf_spin_lock(&t_root_lock);
+	node = bpf_rbtree_first(&t_root);
+	if (!node) {
+		bpf_spin_unlock(&t_root_lock);
+		return 1;
+	}
+
+	skbn = container_of(node, struct skb_node, node);
+	node = bpf_rbtree_remove(&t_root, &skbn->node);
+	bpf_spin_unlock(&t_root_lock);
+
+	if (!node)
+		return 1;
+
+	skbn = container_of(node, struct skb_node, node);
+	bpf_obj_drop(skbn);
+	return 0;
+}
+
+SEC("qdisc/reset")
+void reset_prog(struct bpf_qdisc_ctx *ctx)
+{
+	bpf_loop(q_limit, reset_queue, NULL, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tc_sch_netem.c b/samples/bpf/tc_sch_netem.c
new file mode 100644
index 000000000000..918d626909d3
--- /dev/null
+++ b/samples/bpf/tc_sch_netem.c
@@ -0,0 +1,347 @@
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <signal.h>
+#include <time.h>
+#include <limits.h>
+#include <sys/stat.h>
+
+#include <libmnl/libmnl.h>
+#include <linux/bpf.h>
+#include <linux/pkt_sched.h>
+#include <linux/rtnetlink.h>
+#include <net/if.h>
+
+struct crndstate {
+	__u32 last;
+	__u32 rho;
+};
+
+struct clg_state {
+	__u64 state;
+	__u32 a1;
+	__u32 a2;
+	__u32 a3;
+	__u32 a4;
+	__u32 a5;
+};
+
+#include "tc_sch_netem.skel.h"
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format,
+			   va_list args)
+{
+	return vprintf(format, args);
+}
+
+#define TCA_BUF_MAX (64 * 1024)
+#define FILTER_NAMESZ 16
+
+bool cleanup;
+unsigned int ifindex;
+unsigned int handle = 0x8000000;
+unsigned int parent = TC_H_ROOT;
+struct mnl_socket *nl;
+
+static void usage(const char *cmd)
+{
+	printf("Attach a netem eBPF qdisc and optionally an EDT rate limiter.\n");
+	printf("Usage: %s [...]\n", cmd);
+	printf("	-d <device>	Device\n");
+	printf("	-h <handle>	Qdisc handle\n");
+	printf("	-p <parent>	Parent Qdisc handle\n");
+	printf("	-s		Share a global Gilbert-Elliot state mahine\n");
+	printf("	-c		Delete the qdisc before quit\n");
+	printf("	-v		Verbose\n");
+}
+
+static int get_tc_classid(__u32 *h, const char *str)
+{
+	unsigned long maj, min;
+	char *p;
+
+	maj = TC_H_ROOT;
+	if (strcmp(str, "root") == 0)
+		goto ok;
+	maj = TC_H_UNSPEC;
+	if (strcmp(str, "none") == 0)
+		goto ok;
+	maj = strtoul(str, &p, 16);
+	if (p == str) {
+		maj = 0;
+		if (*p != ':')
+			return -1;
+	}
+	if (*p == ':') {
+		if (maj >= (1<<16))
+			return -1;
+		maj <<= 16;
+		str = p+1;
+		min = strtoul(str, &p, 16);
+		if (*p != 0)
+			return -1;
+		if (min >= (1<<16))
+			return -1;
+		maj |= min;
+	} else if (*p != 0)
+		return -1;
+
+ok:
+	*h = maj;
+	return 0;
+}
+
+static int get_qdisc_handle(__u32 *h, const char *str)
+{
+	__u32 maj;
+	char *p;
+
+	maj = TC_H_UNSPEC;
+	if (strcmp(str, "none") == 0)
+		goto ok;
+	maj = strtoul(str, &p, 16);
+	if (p == str || maj >= (1 << 16))
+		return -1;
+	maj <<= 16;
+	if (*p != ':' && *p != 0)
+		return -1;
+ok:
+	*h = maj;
+	return 0;
+}
+
+static void sigdown(int signo)
+{
+	struct {
+		struct nlmsghdr n;
+		struct tcmsg t;
+		char buf[TCA_BUF_MAX];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST,
+		.n.nlmsg_type = RTM_DELQDISC,
+		.t.tcm_family = AF_UNSPEC,
+	};
+
+	if (!cleanup)
+		exit(0);
+
+	req.n.nlmsg_seq = time(NULL);
+	req.t.tcm_ifindex = ifindex;
+	req.t.tcm_parent = parent;
+	req.t.tcm_handle = handle;
+
+	if (mnl_socket_sendto(nl, &req.n, req.n.nlmsg_len) < 0)
+		exit(1);
+
+	exit(0);
+}
+
+static int qdisc_add_tc_sch_netem(struct tc_sch_netem *skel)
+{
+	char qdisc_type[FILTER_NAMESZ] = "bpf";
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	struct rtattr *option_attr;
+	const char *qdisc_name;
+	char prog_name[256];
+	int ret;
+	unsigned int seq, portid;
+	struct {
+		struct nlmsghdr n;
+		struct tcmsg t;
+		char buf[TCA_BUF_MAX];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST | NLM_F_EXCL | NLM_F_CREATE,
+		.n.nlmsg_type = RTM_NEWQDISC,
+		.t.tcm_family = AF_UNSPEC,
+	};
+
+	seq = time(NULL);
+	portid = mnl_socket_get_portid(nl);
+
+	qdisc_name = bpf_object__name(skel->obj);
+
+	req.t.tcm_ifindex = ifindex;
+	req.t.tcm_parent = parent;
+	req.t.tcm_handle = handle;
+	mnl_attr_put_str(&req.n, TCA_KIND, qdisc_type);
+
+	// eBPF Qdisc specific attributes
+	option_attr = (struct rtattr *)mnl_nlmsg_get_payload_tail(&req.n);
+	mnl_attr_put(&req.n, TCA_OPTIONS, 0, NULL);
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_ENQUEUE_PROG_FD,
+			 bpf_program__fd(skel->progs.enqueue_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_enqueue", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_ENQUEUE_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_DEQUEUE_PROG_FD,
+			 bpf_program__fd(skel->progs.dequeue_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_dequeue", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_DEQUEUE_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_RESET_PROG_FD,
+			 bpf_program__fd(skel->progs.reset_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_reset", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_RESET_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	option_attr->rta_len = (void *)mnl_nlmsg_get_payload_tail(&req.n) -
+			       (void *)option_attr;
+
+	if (mnl_socket_sendto(nl, &req.n, req.n.nlmsg_len) < 0) {
+		perror("mnl_socket_sendto");
+		return -1;
+	}
+
+	for (;;) {
+		ret = mnl_socket_recvfrom(nl, buf, sizeof(buf));
+		if (ret == -1) {
+			if (errno == ENOBUFS || errno == EINTR)
+				continue;
+
+			if (errno == EAGAIN) {
+				errno = 0;
+				ret = 0;
+				break;
+			}
+
+			perror("mnl_socket_recvfrom");
+			return -1;
+		}
+
+		ret = mnl_cb_run(buf, ret, seq, portid, NULL, NULL);
+		if (ret < 0) {
+			perror("mnl_cb_run");
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	LIBBPF_OPTS(bpf_object_open_opts, opts, .kernel_log_level = 2);
+	bool verbose = false, share = false, state_init = false;
+	struct tc_sch_netem *skel = NULL;
+	struct clg_state state = {};
+	struct stat stat_buf = {};
+	int opt, ret = 1, key = 0;
+	char d[IFNAMSIZ] = "lo";
+	struct sigaction sa = {
+		.sa_handler = sigdown,
+	};
+
+	while ((opt = getopt(argc, argv, "d:h:p:csv")) != -1) {
+		switch (opt) {
+		/* General args */
+		case 'd':
+			strncpy(d, optarg, sizeof(d)-1);
+			break;
+		case 'h':
+			ret = get_qdisc_handle(&handle, optarg);
+			if (ret) {
+				printf("Invalid qdisc handle\n");
+				return 1;
+			}
+			break;
+		case 'p':
+			ret = get_tc_classid(&parent, optarg);
+			if (ret) {
+				printf("Invalid parent qdisc handle\n");
+				return 1;
+			}
+			break;
+		case 'c':
+			cleanup = true;
+			break;
+		case 's':
+			share = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	nl = mnl_socket_open(NETLINK_ROUTE);
+	if (!nl) {
+		perror("mnl_socket_open");
+		return 1;
+	}
+
+	ret = mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID);
+	if (ret < 0) {
+		perror("mnl_socket_bind");
+		ret = 1;
+		goto out;
+	}
+
+	ifindex = if_nametoindex(d);
+	if (errno == ENODEV) {
+		fprintf(stderr, "No such device: %s\n", d);
+		goto out;
+	}
+
+	if (sigaction(SIGINT, &sa, NULL) || sigaction(SIGTERM, &sa, NULL))
+		goto out;
+
+	if (verbose)
+		libbpf_set_print(libbpf_print_fn);
+
+	skel = tc_sch_netem__open_opts(&opts);
+	if (!skel) {
+		perror("Failed to open tc_sch_netem");
+		goto out;
+	}
+
+	if (share) {
+		if (stat("/sys/fs/bpf/tc", &stat_buf) == -1)
+			mkdir("/sys/fs/bpf/tc", 0700);
+
+		mkdir("/sys/fs/bpf/tc/globals", 0700);
+
+		bpf_map__set_pin_path(skel->maps.g_clg_state, "/sys/fs/bpf/tc/globals/g_clg_state");
+	}
+
+	ret = tc_sch_netem__load(skel);
+	if (ret) {
+		perror("Failed to load tc_sch_netem");
+		ret = 1;
+		goto out_destroy;
+	}
+
+	if (!state_init) {
+		state.state = 1;
+		state.a1 = (double)UINT_MAX * 0.05;
+		state.a2 = (double)UINT_MAX * 0.95;
+		state.a3 = (double)UINT_MAX * 0.30;
+		state.a4 = (double)UINT_MAX * 0.001;
+
+		bpf_map__update_elem(skel->maps.g_clg_state, &key, sizeof(key), &state,
+				     sizeof(state), 0);
+
+		state_init = true;
+	}
+
+	ret = qdisc_add_tc_sch_netem(skel);
+	if (ret < 0) {
+		perror("Failed to create qdisc");
+		ret = 1;
+		goto out_destroy;
+	}
+
+	for (;;)
+		pause();
+
+out_destroy:
+	tc_sch_netem__destroy(skel);
+out:
+	mnl_socket_close(nl);
+	return ret;
+}