[RFC,v7,1/8] net_sched: Introduce eBPF based Qdisc

Message ID	232881645a5c4c05a35df4ff1f08a19ef9a02662.1705432850.git.amery.hung@bytedance.com (mailing list archive)
State	RFC
Delegated to:	BPF
Headers	show Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com [209.85.160.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC10C2557E; Wed, 17 Jan 2024 21:56:27 +0000 (UTC) From: Amery Hung <ameryhung@gmail.com> To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com, jiri@resnulli.us, sdf@google.com, xiyou.wangcong@gmail.com, yepeilin.cs@gmail.com Subject: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc Date: Wed, 17 Jan 2024 21:56:17 +0000 Message-Id: <232881645a5c4c05a35df4ff1f08a19ef9a02662.1705432850.git.amery.hung@bytedance.com> In-Reply-To: <cover.1705432850.git.amery.hung@bytedance.com> References: <cover.1705432850.git.amery.hung@bytedance.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	net_sched: Introduce eBPF based Qdisc \| expand [RFC,v7,0/8] net_sched: Introduce eBPF based Qdisc [RFC,v7,1/8] net_sched: Introduce eBPF based Qdisc [RFC,v7,2/8] net_sched: Add kfuncs for working with skb [RFC,v7,3/8] net_sched: Introduce kfunc bpf_skb_tc_classify() [RFC,v7,4/8] net_sched: Add reset program [RFC,v7,5/8] net_sched: Add init program [RFC,v7,6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC [RFC,v7,7/8] samples/bpf: Add an example of bpf fq qdisc [RFC,v7,8/8] samples/bpf: Add an example of bpf netem qdisc

Context	Check	Description
netdev/tree_selection	success	Guessing tree name failed - patch did not apply, async
bpf/vmtest-bpf-PR	fail	merge-conflict
bpf/vmtest-bpf-VM_Test-12	success	Logs for s390x-gcc / build-release
bpf/vmtest-bpf-VM_Test-13	pending	Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-VM_Test-5	success	Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-4	success	Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-2	success	Logs for Unittests
bpf/vmtest-bpf-VM_Test-3	success	Logs for Validate matrix.py
bpf/vmtest-bpf-VM_Test-8	success	Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-14	success	Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-11	success	Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-VM_Test-10	success	Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-VM_Test-9	success	Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-0	success	Logs for Lint
bpf/vmtest-bpf-VM_Test-15	success	Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-7	success	Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-6	success	Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-16	success	Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-17	success	Logs for s390x-gcc / veristat
bpf/vmtest-bpf-VM_Test-18	success	Logs for set-matrix
bpf/vmtest-bpf-VM_Test-19	success	Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-20	success	Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-VM_Test-21	success	Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-22	success	Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-23	success	Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-24	success	Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-25	success	Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-26	success	Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-27	success	Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-28	success	Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-29	success	Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-VM_Test-30	success	Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-31	success	Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-32	success	Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-33	success	Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-34	success	Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-VM_Test-35	success	Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-36	success	Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-VM_Test-37	success	Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-38	success	Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-39	success	Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-40	success	Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-41	success	Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-42	success	Logs for x86_64-llvm-18 / veristat

Amery Hung Jan. 17, 2024, 9:56 p.m. UTC

From: Cong Wang <xiyou.wangcong@gmail.com>

Introduce a new Qdisc which is completely managed by eBPF program
of type BPF_PROG_TYPE_QDISC. It accepts two eBPF programs of
the same type, but one for enqueue and the other for dequeue.

It interacts with Qdisc layer in two ways:
1) It relies on Qdisc watchdog to handle throttling;
2) It could pass the skb enqueue/dequeue down to child classes

The context is used differently for enqueue and dequeue, as shown below:

┌──────────┬───────────────┬──────────────────────────────────┐
│ prog     │     input     │              output              │
├──────────┼───────────────┼──────────────────────────────────┤
│          │ ctx->skb      │ SCH_BPF_THROTTLE: ctx->expire    │
│          │               │                   ctx->delta_ns  │
│          │ ctx->classid  │                                  │
│          │               │ SCH_BPF_QUEUED: None             │
│          │               │                                  │
│          │               │ SCH_BPF_BYPASS: None             │
│ enqueue  │               │                                  │
│          │               │ SCH_BPF_STOLEN: None             │
│          │               │                                  │
│          │               │ SCH_BPF_DROP: None               │
│          │               │                                  │
│          │               │ SCH_BPF_CN: None                 │
│          │               │                                  │
│          │               │ SCH_BPF_PASS: ctx->classid       │
├──────────┼───────────────┼──────────────────────────────────┤
│          │ ctx->classid  │ SCH_BPF_THROTTLE: ctx->expire    │
│          │               │                   ctx->delta_ns  │
│          │               │                                  │
│ dequeue  │               │ SCH_BPF_DEQUEUED: None           │
│          │               │                                  │
│          │               │ SCH_BPF_DROP: None               │
│          │               │                                  │
│          │               │ SCH_BPF_PASS: ctx->classid       │
└──────────┴───────────────┴──────────────────────────────────┘

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf_types.h      |   4 +
 include/uapi/linux/bpf.h       |  21 ++
 include/uapi/linux/pkt_sched.h |  16 +
 kernel/bpf/btf.c               |   5 +
 kernel/bpf/helpers.c           |   1 +
 kernel/bpf/syscall.c           |   8 +
 net/core/filter.c              |  96 ++++++
 net/sched/Kconfig              |  15 +
 net/sched/Makefile             |   1 +
 net/sched/sch_bpf.c            | 537 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  21 ++
 11 files changed, 725 insertions(+)
 create mode 100644 net/sched/sch_bpf.c

Martin KaFai Lau Jan. 23, 2024, 11:51 p.m. UTC | #1

On 1/17/24 1:56 PM, Amery Hung wrote:
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0bb92414c036..df280bbb7c0d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -997,6 +997,7 @@ enum bpf_prog_type {
>   	BPF_PROG_TYPE_SK_LOOKUP,
>   	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
>   	BPF_PROG_TYPE_NETFILTER,
> +	BPF_PROG_TYPE_QDISC,
>   };
>   
>   enum bpf_attach_type {
> @@ -1056,6 +1057,8 @@ enum bpf_attach_type {
>   	BPF_CGROUP_UNIX_GETSOCKNAME,
>   	BPF_NETKIT_PRIMARY,
>   	BPF_NETKIT_PEER,
> +	BPF_QDISC_ENQUEUE,
> +	BPF_QDISC_DEQUEUE,
>   	__MAX_BPF_ATTACH_TYPE
>   };
>   
> @@ -7357,4 +7360,22 @@ struct bpf_iter_num {
>   	__u64 __opaque[1];
>   } __attribute__((aligned(8)));
>   
> +struct bpf_qdisc_ctx {
> +	__bpf_md_ptr(struct sk_buff *, skb);
> +	__u32 classid;
> +	__u64 expire;
> +	__u64 delta_ns;
> +};
> +
> +enum {
> +	SCH_BPF_QUEUED,
> +	SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
> +	SCH_BPF_DROP,
> +	SCH_BPF_CN,
> +	SCH_BPF_THROTTLE,
> +	SCH_BPF_PASS,
> +	SCH_BPF_BYPASS,
> +	SCH_BPF_STOLEN,
> +};
> +
>   #endif /* _UAPI__LINUX_BPF_H__ */

[ ... ]

> +static bool tc_qdisc_is_valid_access(int off, int size,
> +				     enum bpf_access_type type,
> +				     const struct bpf_prog *prog,
> +				     struct bpf_insn_access_aux *info)
> +{
> +	struct btf *btf;
> +
> +	if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
> +		return false;
> +
> +	switch (off) {
> +	case offsetof(struct bpf_qdisc_ctx, skb):
> +		if (type == BPF_WRITE ||
> +		    size != sizeof_field(struct bpf_qdisc_ctx, skb))
> +			return false;
> +
> +		if (prog->expected_attach_type != BPF_QDISC_ENQUEUE)
> +			return false;
> +
> +		btf = bpf_get_btf_vmlinux();
> +		if (IS_ERR_OR_NULL(btf))
> +			return false;
> +
> +		info->btf = btf;
> +		info->btf_id = tc_qdisc_ctx_access_btf_ids[0];
> +		info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
> +		return true;
> +	case bpf_ctx_range(struct bpf_qdisc_ctx, classid):
> +		return size == sizeof_field(struct bpf_qdisc_ctx, classid);
> +	case bpf_ctx_range(struct bpf_qdisc_ctx, expire):
> +		return size == sizeof_field(struct bpf_qdisc_ctx, expire);
> +	case bpf_ctx_range(struct bpf_qdisc_ctx, delta_ns):
> +		return size == sizeof_field(struct bpf_qdisc_ctx, delta_ns);
> +	default:
> +		return false;
> +	}
> +
> +	return false;
> +}
> +

[ ... ]

> +static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> +			   struct sk_buff **to_free)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +	unsigned int len = qdisc_pkt_len(skb);
> +	struct bpf_qdisc_ctx ctx = {};
> +	int res = NET_XMIT_SUCCESS;
> +	struct sch_bpf_class *cl;
> +	struct bpf_prog *enqueue;
> +
> +	enqueue = rcu_dereference(q->enqueue_prog.prog);
> +	if (!enqueue)
> +		return NET_XMIT_DROP;
> +
> +	ctx.skb = skb;
> +	ctx.classid = sch->handle;
> +	res = bpf_prog_run(enqueue, &ctx);
> +	switch (res) {
> +	case SCH_BPF_THROTTLE:
> +		qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
> +		qdisc_qstats_overlimit(sch);
> +		fallthrough;
> +	case SCH_BPF_QUEUED:
> +		qdisc_qstats_backlog_inc(sch, skb);
> +		return NET_XMIT_SUCCESS;
> +	case SCH_BPF_BYPASS:
> +		qdisc_qstats_drop(sch);
> +		__qdisc_drop(skb, to_free);
> +		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> +	case SCH_BPF_STOLEN:
> +		__qdisc_drop(skb, to_free);
> +		return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
> +	case SCH_BPF_CN:
> +		return NET_XMIT_CN;
> +	case SCH_BPF_PASS:
> +		break;
> +	default:
> +		return qdisc_drop(skb, sch, to_free);
> +	}
> +
> +	cl = sch_bpf_find(sch, ctx.classid);
> +	if (!cl || !cl->qdisc)
> +		return qdisc_drop(skb, sch, to_free);
> +
> +	res = qdisc_enqueue(skb, cl->qdisc, to_free);
> +	if (res != NET_XMIT_SUCCESS) {
> +		if (net_xmit_drop_count(res)) {
> +			qdisc_qstats_drop(sch);
> +			cl->drops++;
> +		}
> +		return res;
> +	}
> +
> +	sch->qstats.backlog += len;
> +	sch->q.qlen++;
> +	return res;
> +}
> +
> +DEFINE_PER_CPU(struct sk_buff*, bpf_skb_dequeue);
> +
> +static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +	struct bpf_qdisc_ctx ctx = {};
> +	struct sk_buff *skb = NULL;
> +	struct bpf_prog *dequeue;
> +	struct sch_bpf_class *cl;
> +	int res;
> +
> +	dequeue = rcu_dereference(q->dequeue_prog.prog);
> +	if (!dequeue)
> +		return NULL;
> +
> +	__this_cpu_write(bpf_skb_dequeue, NULL);
> +	ctx.classid = sch->handle;
> +	res = bpf_prog_run(dequeue, &ctx);
> +	switch (res) {
> +	case SCH_BPF_DEQUEUED:
> +		skb = __this_cpu_read(bpf_skb_dequeue);
> +		qdisc_bstats_update(sch, skb);
> +		qdisc_qstats_backlog_dec(sch, skb);
> +		break;
> +	case SCH_BPF_THROTTLE:
> +		qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
> +		qdisc_qstats_overlimit(sch);
> +		cl = sch_bpf_find(sch, ctx.classid);
> +		if (cl)
> +			cl->overlimits++;
> +		return NULL;
> +	case SCH_BPF_PASS:
> +		cl = sch_bpf_find(sch, ctx.classid);
> +		if (!cl || !cl->qdisc)
> +			return NULL;
> +		skb = qdisc_dequeue_peeked(cl->qdisc);
> +		if (skb) {
> +			bstats_update(&cl->bstats, skb);
> +			qdisc_bstats_update(sch, skb);
> +			qdisc_qstats_backlog_dec(sch, skb);
> +			sch->q.qlen--;
> +		}
> +		break;
> +	}
> +
> +	return skb;
> +}

[ ... ]

> +static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
> +			struct netlink_ext_ack *extack)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +	int err;
> +
> +	qdisc_watchdog_init(&q->watchdog, sch);
> +	if (opt) {
> +		err = sch_bpf_change(sch, opt, extack);
> +		if (err)
> +			return err;
> +	}
> +
> +	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
> +	if (err)
> +		return err;
> +
> +	return qdisc_class_hash_init(&q->clhash);
> +}
> +
> +static void sch_bpf_reset(struct Qdisc *sch)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +	struct sch_bpf_class *cl;
> +	unsigned int i;
> +
> +	for (i = 0; i < q->clhash.hashsize; i++) {
> +		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
> +			if (cl->qdisc)
> +				qdisc_reset(cl->qdisc);
> +		}
> +	}
> +
> +	qdisc_watchdog_cancel(&q->watchdog);
> +}
> +

[ ... ]

> +static const struct Qdisc_class_ops sch_bpf_class_ops = {
> +	.graft		=	sch_bpf_graft,
> +	.leaf		=	sch_bpf_leaf,
> +	.find		=	sch_bpf_search,
> +	.change		=	sch_bpf_change_class,
> +	.delete		=	sch_bpf_delete,
> +	.tcf_block	=	sch_bpf_tcf_block,
> +	.bind_tcf	=	sch_bpf_bind,
> +	.unbind_tcf	=	sch_bpf_unbind,
> +	.dump		=	sch_bpf_dump_class,
> +	.dump_stats	=	sch_bpf_dump_class_stats,
> +	.walk		=	sch_bpf_walk,
> +};
> +
> +static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
> +	.cl_ops		=	&sch_bpf_class_ops,
> +	.id		=	"bpf",
> +	.priv_size	=	sizeof(struct bpf_sched_data),
> +	.enqueue	=	sch_bpf_enqueue,
> +	.dequeue	=	sch_bpf_dequeue,

I looked at the high level of the patchset. The major ops that it wants to be 
programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in 
patch 4 and patch 5).

This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for 
".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the 
uapi. It is no long an acceptable way to add new bpf extension.

Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented 
in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops 
can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.

If other ops (like ".dump", ".dump_stats"...) do not have good use case to be 
programmable in bpf, it can stay with the kernel implementation for now and only 
allows the userspace to load the a bpf Qdisc_ops with .equeue/dequeue/init/reset 
implemented.

You mentioned in the cover letter that:
"Current struct_ops attachment model does not seem to support replacing only 
functions of a specific instance of a module, but I might be wrong."

I assumed you meant allow bpf to replace only "some" ops of the Qdisc_ops? Yes, 
it can be done through the struct_ops's ".init_member". Take a look at 
bpf_tcp_ca_init_member. The kernel can assign the kernel implementation for 
".dump" (for example) when loading the bpf Qdisc_ops.

> +	.peek		=	qdisc_peek_dequeued,
> +	.init		=	sch_bpf_init,
> +	.reset		=	sch_bpf_reset, > +	.destroy	=	sch_bpf_destroy,
> +	.change		=	sch_bpf_change,
> +	.dump		=	sch_bpf_dump,
> +	.dump_stats	=	sch_bpf_dump_stats,
> +	.owner		=	THIS_MODULE,
> +};
> +
> +static int __init sch_bpf_mod_init(void)
> +{
> +	return register_qdisc(&sch_bpf_qdisc_ops);
> +}
> +
> +static void __exit sch_bpf_mod_exit(void)
> +{
> +	unregister_qdisc(&sch_bpf_qdisc_ops);
> +}

Amery Hung Jan. 24, 2024, 5:22 a.m. UTC | #2

On Tue, Jan 23, 2024 at 3:51 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/17/24 1:56 PM, Amery Hung wrote:
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 0bb92414c036..df280bbb7c0d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -997,6 +997,7 @@ enum bpf_prog_type {
> >       BPF_PROG_TYPE_SK_LOOKUP,
> >       BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> >       BPF_PROG_TYPE_NETFILTER,
> > +     BPF_PROG_TYPE_QDISC,
> >   };
> >
> >   enum bpf_attach_type {
> > @@ -1056,6 +1057,8 @@ enum bpf_attach_type {
> >       BPF_CGROUP_UNIX_GETSOCKNAME,
> >       BPF_NETKIT_PRIMARY,
> >       BPF_NETKIT_PEER,
> > +     BPF_QDISC_ENQUEUE,
> > +     BPF_QDISC_DEQUEUE,
> >       __MAX_BPF_ATTACH_TYPE
> >   };
> >
> > @@ -7357,4 +7360,22 @@ struct bpf_iter_num {
> >       __u64 __opaque[1];
> >   } __attribute__((aligned(8)));
> >
> > +struct bpf_qdisc_ctx {
> > +     __bpf_md_ptr(struct sk_buff *, skb);
> > +     __u32 classid;
> > +     __u64 expire;
> > +     __u64 delta_ns;
> > +};
> > +
> > +enum {
> > +     SCH_BPF_QUEUED,
> > +     SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
> > +     SCH_BPF_DROP,
> > +     SCH_BPF_CN,
> > +     SCH_BPF_THROTTLE,
> > +     SCH_BPF_PASS,
> > +     SCH_BPF_BYPASS,
> > +     SCH_BPF_STOLEN,
> > +};
> > +
> >   #endif /* _UAPI__LINUX_BPF_H__ */
>
> [ ... ]
>
> > +static bool tc_qdisc_is_valid_access(int off, int size,
> > +                                  enum bpf_access_type type,
> > +                                  const struct bpf_prog *prog,
> > +                                  struct bpf_insn_access_aux *info)
> > +{
> > +     struct btf *btf;
> > +
> > +     if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
> > +             return false;
> > +
> > +     switch (off) {
> > +     case offsetof(struct bpf_qdisc_ctx, skb):
> > +             if (type == BPF_WRITE ||
> > +                 size != sizeof_field(struct bpf_qdisc_ctx, skb))
> > +                     return false;
> > +
> > +             if (prog->expected_attach_type != BPF_QDISC_ENQUEUE)
> > +                     return false;
> > +
> > +             btf = bpf_get_btf_vmlinux();
> > +             if (IS_ERR_OR_NULL(btf))
> > +                     return false;
> > +
> > +             info->btf = btf;
> > +             info->btf_id = tc_qdisc_ctx_access_btf_ids[0];
> > +             info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
> > +             return true;
> > +     case bpf_ctx_range(struct bpf_qdisc_ctx, classid):
> > +             return size == sizeof_field(struct bpf_qdisc_ctx, classid);
> > +     case bpf_ctx_range(struct bpf_qdisc_ctx, expire):
> > +             return size == sizeof_field(struct bpf_qdisc_ctx, expire);
> > +     case bpf_ctx_range(struct bpf_qdisc_ctx, delta_ns):
> > +             return size == sizeof_field(struct bpf_qdisc_ctx, delta_ns);
> > +     default:
> > +             return false;
> > +     }
> > +
> > +     return false;
> > +}
> > +
>
> [ ... ]
>
> > +static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > +                        struct sk_buff **to_free)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +     unsigned int len = qdisc_pkt_len(skb);
> > +     struct bpf_qdisc_ctx ctx = {};
> > +     int res = NET_XMIT_SUCCESS;
> > +     struct sch_bpf_class *cl;
> > +     struct bpf_prog *enqueue;
> > +
> > +     enqueue = rcu_dereference(q->enqueue_prog.prog);
> > +     if (!enqueue)
> > +             return NET_XMIT_DROP;
> > +
> > +     ctx.skb = skb;
> > +     ctx.classid = sch->handle;
> > +     res = bpf_prog_run(enqueue, &ctx);
> > +     switch (res) {
> > +     case SCH_BPF_THROTTLE:
> > +             qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
> > +             qdisc_qstats_overlimit(sch);
> > +             fallthrough;
> > +     case SCH_BPF_QUEUED:
> > +             qdisc_qstats_backlog_inc(sch, skb);
> > +             return NET_XMIT_SUCCESS;
> > +     case SCH_BPF_BYPASS:
> > +             qdisc_qstats_drop(sch);
> > +             __qdisc_drop(skb, to_free);
> > +             return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > +     case SCH_BPF_STOLEN:
> > +             __qdisc_drop(skb, to_free);
> > +             return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
> > +     case SCH_BPF_CN:
> > +             return NET_XMIT_CN;
> > +     case SCH_BPF_PASS:
> > +             break;
> > +     default:
> > +             return qdisc_drop(skb, sch, to_free);
> > +     }
> > +
> > +     cl = sch_bpf_find(sch, ctx.classid);
> > +     if (!cl || !cl->qdisc)
> > +             return qdisc_drop(skb, sch, to_free);
> > +
> > +     res = qdisc_enqueue(skb, cl->qdisc, to_free);
> > +     if (res != NET_XMIT_SUCCESS) {
> > +             if (net_xmit_drop_count(res)) {
> > +                     qdisc_qstats_drop(sch);
> > +                     cl->drops++;
> > +             }
> > +             return res;
> > +     }
> > +
> > +     sch->qstats.backlog += len;
> > +     sch->q.qlen++;
> > +     return res;
> > +}
> > +
> > +DEFINE_PER_CPU(struct sk_buff*, bpf_skb_dequeue);
> > +
> > +static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +     struct bpf_qdisc_ctx ctx = {};
> > +     struct sk_buff *skb = NULL;
> > +     struct bpf_prog *dequeue;
> > +     struct sch_bpf_class *cl;
> > +     int res;
> > +
> > +     dequeue = rcu_dereference(q->dequeue_prog.prog);
> > +     if (!dequeue)
> > +             return NULL;
> > +
> > +     __this_cpu_write(bpf_skb_dequeue, NULL);
> > +     ctx.classid = sch->handle;
> > +     res = bpf_prog_run(dequeue, &ctx);
> > +     switch (res) {
> > +     case SCH_BPF_DEQUEUED:
> > +             skb = __this_cpu_read(bpf_skb_dequeue);
> > +             qdisc_bstats_update(sch, skb);
> > +             qdisc_qstats_backlog_dec(sch, skb);
> > +             break;
> > +     case SCH_BPF_THROTTLE:
> > +             qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
> > +             qdisc_qstats_overlimit(sch);
> > +             cl = sch_bpf_find(sch, ctx.classid);
> > +             if (cl)
> > +                     cl->overlimits++;
> > +             return NULL;
> > +     case SCH_BPF_PASS:
> > +             cl = sch_bpf_find(sch, ctx.classid);
> > +             if (!cl || !cl->qdisc)
> > +                     return NULL;
> > +             skb = qdisc_dequeue_peeked(cl->qdisc);
> > +             if (skb) {
> > +                     bstats_update(&cl->bstats, skb);
> > +                     qdisc_bstats_update(sch, skb);
> > +                     qdisc_qstats_backlog_dec(sch, skb);
> > +                     sch->q.qlen--;
> > +             }
> > +             break;
> > +     }
> > +
> > +     return skb;
> > +}
>
> [ ... ]
>
> > +static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
> > +                     struct netlink_ext_ack *extack)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +     int err;
> > +
> > +     qdisc_watchdog_init(&q->watchdog, sch);
> > +     if (opt) {
> > +             err = sch_bpf_change(sch, opt, extack);
> > +             if (err)
> > +                     return err;
> > +     }
> > +
> > +     err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
> > +     if (err)
> > +             return err;
> > +
> > +     return qdisc_class_hash_init(&q->clhash);
> > +}
> > +
> > +static void sch_bpf_reset(struct Qdisc *sch)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +     struct sch_bpf_class *cl;
> > +     unsigned int i;
> > +
> > +     for (i = 0; i < q->clhash.hashsize; i++) {
> > +             hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
> > +                     if (cl->qdisc)
> > +                             qdisc_reset(cl->qdisc);
> > +             }
> > +     }
> > +
> > +     qdisc_watchdog_cancel(&q->watchdog);
> > +}
> > +
>
> [ ... ]
>
> > +static const struct Qdisc_class_ops sch_bpf_class_ops = {
> > +     .graft          =       sch_bpf_graft,
> > +     .leaf           =       sch_bpf_leaf,
> > +     .find           =       sch_bpf_search,
> > +     .change         =       sch_bpf_change_class,
> > +     .delete         =       sch_bpf_delete,
> > +     .tcf_block      =       sch_bpf_tcf_block,
> > +     .bind_tcf       =       sch_bpf_bind,
> > +     .unbind_tcf     =       sch_bpf_unbind,
> > +     .dump           =       sch_bpf_dump_class,
> > +     .dump_stats     =       sch_bpf_dump_class_stats,
> > +     .walk           =       sch_bpf_walk,
> > +};
> > +
> > +static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
> > +     .cl_ops         =       &sch_bpf_class_ops,
> > +     .id             =       "bpf",
> > +     .priv_size      =       sizeof(struct bpf_sched_data),
> > +     .enqueue        =       sch_bpf_enqueue,
> > +     .dequeue        =       sch_bpf_dequeue,
>
> I looked at the high level of the patchset. The major ops that it wants to be
> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
> patch 4 and patch 5).
>
> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
> ".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
> uapi. It is no long an acceptable way to add new bpf extension.
>
> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
> in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
> can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.
>

Partially. If using struct_ops, I think we'll need another structure
like the following in bpf qdisc to be implemented with struct_ops bpf:

struct bpf_qdisc_ops {
    int (*enqueue) (struct sk_buff *skb)
    void (*dequeue) (void)
    void (*init) (void)
    void (*reset) (void)
};

Then, Qdisc_ops will wrap around them to handle things that cannot be
implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).

> If other ops (like ".dump", ".dump_stats"...) do not have good use case to be
> programmable in bpf, it can stay with the kernel implementation for now and only
> allows the userspace to load the a bpf Qdisc_ops with .equeue/dequeue/init/reset
> implemented.
>
> You mentioned in the cover letter that:
> "Current struct_ops attachment model does not seem to support replacing only
> functions of a specific instance of a module, but I might be wrong."
>
> I assumed you meant allow bpf to replace only "some" ops of the Qdisc_ops? Yes,
> it can be done through the struct_ops's ".init_member". Take a look at
> bpf_tcp_ca_init_member. The kernel can assign the kernel implementation for
> ".dump" (for example) when loading the bpf Qdisc_ops.
>

I have no problem with partially replacing a struct, which like you
mentioned has been demonstrated by congestion control or sched_ext.
What I am not sure about is the ability to create multiple bpf qdiscs,
where each has different ".enqueue", ".dequeue", and so on. I like the
struct_ops approach and would love to try it if struct_ops support
this.

Thanks,
Amery



> > +     .peek           =       qdisc_peek_dequeued,
> > +     .init           =       sch_bpf_init,
> > +     .reset          =       sch_bpf_reset, > +      .destroy        =       sch_bpf_destroy,
> > +     .change         =       sch_bpf_change,
> > +     .dump           =       sch_bpf_dump,
> > +     .dump_stats     =       sch_bpf_dump_stats,
> > +     .owner          =       THIS_MODULE,
> > +};
> > +
> > +static int __init sch_bpf_mod_init(void)
> > +{
> > +     return register_qdisc(&sch_bpf_qdisc_ops);
> > +}
> > +
> > +static void __exit sch_bpf_mod_exit(void)
> > +{
> > +     unregister_qdisc(&sch_bpf_qdisc_ops);
> > +}
>

Martin KaFai Lau Jan. 26, 2024, 2:22 a.m. UTC | #3

On 1/23/24 9:22 PM, Amery Hung wrote:
>> I looked at the high level of the patchset. The major ops that it wants to be
>> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
>> patch 4 and patch 5).
>>
>> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
>> ".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
>> uapi. It is no long an acceptable way to add new bpf extension.
>>
>> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
>> in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
>> can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.
>>
> Partially. If using struct_ops, I think we'll need another structure
> like the following in bpf qdisc to be implemented with struct_ops bpf:
> 
> struct bpf_qdisc_ops {
>      int (*enqueue) (struct sk_buff *skb)
>      void (*dequeue) (void)
>      void (*init) (void)
>      void (*reset) (void)
> };
> 
> Then, Qdisc_ops will wrap around them to handle things that cannot be
> implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).

We can see how those limitations (calling sch_tree_lock() and returning a ptr) 
can be addressed in bpf. This will also help other similar use cases.

Other than sch_tree_lock and returning a ptr from a bpf prog. What else do you 
see that blocks directly implementing the enqueue/dequeue/init/reset in the 
struct Qdisc_ops?

Have you thought above ".priv_size"? It is now fixed to sizeof(struct 
bpf_sched_data). It should be useful to allow the bpf prog to store its own data 
there?

> 
>> If other ops (like ".dump", ".dump_stats"...) do not have good use case to be
>> programmable in bpf, it can stay with the kernel implementation for now and only
>> allows the userspace to load the a bpf Qdisc_ops with .equeue/dequeue/init/reset
>> implemented.
>>
>> You mentioned in the cover letter that:
>> "Current struct_ops attachment model does not seem to support replacing only
>> functions of a specific instance of a module, but I might be wrong."
>>
>> I assumed you meant allow bpf to replace only "some" ops of the Qdisc_ops? Yes,
>> it can be done through the struct_ops's ".init_member". Take a look at
>> bpf_tcp_ca_init_member. The kernel can assign the kernel implementation for
>> ".dump" (for example) when loading the bpf Qdisc_ops.
>>
> I have no problem with partially replacing a struct, which like you
> mentioned has been demonstrated by congestion control or sched_ext.
> What I am not sure about is the ability to create multiple bpf qdiscs,
> where each has different ".enqueue", ".dequeue", and so on. I like the
> struct_ops approach and would love to try it if struct_ops support
> this.

The need for allowing different ".enqueue/.dequeue/..." bpf 
(BPF_PROG_TYPE_QDISC) programs loaded into different qdisc instances is because 
there is only one ".id == bpf" Qdisc_ops native kernel implementation which is 
then because of the limitation you mentioned above?

Am I understanding your reason correctly on why it requires to load different 
bpf prog for different qdisc instances?

If the ".enqueue/.dequeue/..." in the "struct Qdisc_ops" can be directly 
implemented in bpf prog itself, it can just load another bpf struct_ops which 
has a different ".enqueue/.dequeue/..." implementation:

#> bpftool struct_ops register bpf_simple_fq_v1.bpf.o
#> bpftool struct_ops register bpf_simple_fq_v2.bpf.o
#> bpftool struct_ops register bpf_simple_fq_xyz.bpf.o

 From reading the test bpf prog, I think the set is on a good footing. Instead 
of working around the limitation by wrapping the bpf prog in a predefined 
"struct Qdisc_ops sch_bpf_qdisc_ops", lets first understand what is missing in 
bpf and see how we could address them.

Amery Hung Jan. 27, 2024, 1:17 a.m. UTC | #4

On Thu, Jan 25, 2024 at 6:22 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/23/24 9:22 PM, Amery Hung wrote:
> >> I looked at the high level of the patchset. The major ops that it wants to be
> >> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
> >> patch 4 and patch 5).
> >>
> >> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
> >> ".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
> >> uapi. It is no long an acceptable way to add new bpf extension.
> >>
> >> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
> >> in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
> >> can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.
> >>
> > Partially. If using struct_ops, I think we'll need another structure
> > like the following in bpf qdisc to be implemented with struct_ops bpf:
> >
> > struct bpf_qdisc_ops {
> >      int (*enqueue) (struct sk_buff *skb)
> >      void (*dequeue) (void)
> >      void (*init) (void)
> >      void (*reset) (void)
> > };
> >
> > Then, Qdisc_ops will wrap around them to handle things that cannot be
> > implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).
>
> We can see how those limitations (calling sch_tree_lock() and returning a ptr)
> can be addressed in bpf. This will also help other similar use cases.
>

For kptr, I wonder if we can support the following semantics in bpf if
they make sense:
1. Passing a referenced kptr into a bpf program, which will also need
to be released, or exchanged into maps or allocated objects.
2. Returning a kptr from a program and treating it as releasing the reference.

> Other than sch_tree_lock and returning a ptr from a bpf prog. What else do you
> see that blocks directly implementing the enqueue/dequeue/init/reset in the
> struct Qdisc_ops?
>

Not much! We can deal with sch_tree_lock later since
enqueue/dequeue/init/reset are unlikely to use it.

> Have you thought above ".priv_size"? It is now fixed to sizeof(struct
> bpf_sched_data). It should be useful to allow the bpf prog to store its own data
> there?
>

Maybe we can let bpf qdiscs store statistics here and make it work
with netlink. I haven't explored much in how bpf qdiscs record and
share statistics with user space.

> >
> >> If other ops (like ".dump", ".dump_stats"...) do not have good use case to be
> >> programmable in bpf, it can stay with the kernel implementation for now and only
> >> allows the userspace to load the a bpf Qdisc_ops with .equeue/dequeue/init/reset
> >> implemented.
> >>
> >> You mentioned in the cover letter that:
> >> "Current struct_ops attachment model does not seem to support replacing only
> >> functions of a specific instance of a module, but I might be wrong."
> >>
> >> I assumed you meant allow bpf to replace only "some" ops of the Qdisc_ops? Yes,
> >> it can be done through the struct_ops's ".init_member". Take a look at
> >> bpf_tcp_ca_init_member. The kernel can assign the kernel implementation for
> >> ".dump" (for example) when loading the bpf Qdisc_ops.
> >>
> > I have no problem with partially replacing a struct, which like you
> > mentioned has been demonstrated by congestion control or sched_ext.
> > What I am not sure about is the ability to create multiple bpf qdiscs,
> > where each has different ".enqueue", ".dequeue", and so on. I like the
> > struct_ops approach and would love to try it if struct_ops support
> > this.
>
> The need for allowing different ".enqueue/.dequeue/..." bpf
> (BPF_PROG_TYPE_QDISC) programs loaded into different qdisc instances is because
> there is only one ".id == bpf" Qdisc_ops native kernel implementation which is
> then because of the limitation you mentioned above?
>
> Am I understanding your reason correctly on why it requires to load different
> bpf prog for different qdisc instances?
>
> If the ".enqueue/.dequeue/..." in the "struct Qdisc_ops" can be directly
> implemented in bpf prog itself, it can just load another bpf struct_ops which
> has a different ".enqueue/.dequeue/..." implementation:
>
> #> bpftool struct_ops register bpf_simple_fq_v1.bpf.o
> #> bpftool struct_ops register bpf_simple_fq_v2.bpf.o
> #> bpftool struct_ops register bpf_simple_fq_xyz.bpf.o
>
>  From reading the test bpf prog, I think the set is on a good footing. Instead
> of working around the limitation by wrapping the bpf prog in a predefined
> "struct Qdisc_ops sch_bpf_qdisc_ops", lets first understand what is missing in
> bpf and see how we could address them.
>

Thank you so much for the clarification. I had the wrong impression
since I was thinking about using a structure in the bpf qdisc for
struct_ops. It makes sense to try making "struct Qdisc_ops" work with
struct_ops. I will send the next patch set with struct_ops.

Thanks,
Amery

>

Martin KaFai Lau Jan. 30, 2024, 6:39 a.m. UTC | #5

On 1/26/24 5:17 PM, Amery Hung wrote:
> On Thu, Jan 25, 2024 at 6:22 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 1/23/24 9:22 PM, Amery Hung wrote:
>>>> I looked at the high level of the patchset. The major ops that it wants to be
>>>> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
>>>> patch 4 and patch 5).
>>>>
>>>> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
>>>> ".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
>>>> uapi. It is no long an acceptable way to add new bpf extension.
>>>>
>>>> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
>>>> in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
>>>> can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.
>>>>
>>> Partially. If using struct_ops, I think we'll need another structure
>>> like the following in bpf qdisc to be implemented with struct_ops bpf:
>>>
>>> struct bpf_qdisc_ops {
>>>       int (*enqueue) (struct sk_buff *skb)
>>>       void (*dequeue) (void)
>>>       void (*init) (void)
>>>       void (*reset) (void)
>>> };
>>>
>>> Then, Qdisc_ops will wrap around them to handle things that cannot be
>>> implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).
>>
>> We can see how those limitations (calling sch_tree_lock() and returning a ptr)
>> can be addressed in bpf. This will also help other similar use cases.
>>
> 
> For kptr, I wonder if we can support the following semantics in bpf if
> they make sense:

I think they are useful but they are not fully supported now.

Some thoughts below.

> 1. Passing a referenced kptr into a bpf program, which will also need
> to be released, or exchanged into maps or allocated objects.

"enqueue" should be the one considering here:

struct Qdisc_ops {
	/* ... */
	int                     (*enqueue)(struct sk_buff *skb,
					   struct Qdisc *sch,
					   struct sk_buff **to_free);

};

The verifier only marks the skb as a trusted kptr but does not mark its 
reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:

	if (prog_args_trusted(prog))
		info->reg_type |= PTR_TRUSTED;

The verifier does not know the skb ownership is passed into the ".enqueue" ops 
and does not know the bpf prog needs to release it or store it in a map.

The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just 
an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at 
acquire_reference_state() which is the useful one here.

Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can 
always acquire_reference_state() for the "struct sk_buff *skb" argument.

Take a look at a recent RFC: 
https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch 
is tagging the argument could be NULL by appending "__nullable" to the argument 
name. The verifier will enforce that the bpf prog must check for NULL first.

The similar idea can be used here but with a different tagging (for example, 
"__must_release", admittedly not a good name). While the RFC patch is 
in-progress, for now, may be hardcode for the ".enqueue" ops in 
check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This 
part can be adjusted later once the RFC patch will be in shape.

Then one more thing is to track when the struct_ops bpf prog is actually reading 
the value of the skb pointer. One thing is worth to mention here, e.g. a 
struct_ops prog for enqueue:

SEC("struct_ops")
int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
	     struct sk_buff **to_free)
{
	return bpf_qdisc_drop(skb, sch, to_free);
}

Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array 
of __u64 as the only argument. The skb is actually in ctx[0], sch is in 
ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]), 
btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize 
the reg->ref_obj_id by the id obtained earlier from acquire_reference_state() 
during check_struct_ops_btf_id() somehow.

> 2. Returning a kptr from a program and treating it as releasing the reference.

e.g. for dequeue:

struct Qdisc_ops {
	/* ... */
	struct sk_buff *        (*dequeue)(struct Qdisc *);
};

Right now the verifier should complain on check_reference_leak() if the 
struct_ops bpf prog is returning a referenced kptr.

Unlike an argument, the return type of a function does not have a name to tag. 
It is the first case that a struct_ops bpf_prog returning a pointer. One idea is 
to assume it must be a trusted pointer (PTR_TRUSTED) and the verifier should 
check it is indeed with PTR_TRUSTED flag.

May be release_reference_state() can be called to assume the kernel will release 
it as long as the return pointer type is PTR_TRUSTED and the type matches the 
return type of the ops. Take a look at check_return_code().

Kui-Feng Lee Jan. 30, 2024, 5:49 p.m. UTC | #6

On 1/29/24 22:39, Martin KaFai Lau wrote:
> On 1/26/24 5:17 PM, Amery Hung wrote:
>> On Thu, Jan 25, 2024 at 6:22 PM Martin KaFai Lau 
>> <martin.lau@linux.dev> wrote:
>>>
>>> On 1/23/24 9:22 PM, Amery Hung wrote:
>>>>> I looked at the high level of the patchset. The major ops that it 
>>>>> wants to be
>>>>> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and 
>>>>> ".reset" in
>>>>> patch 4 and patch 5).
>>>>>
>>>>> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach 
>>>>> types (each for
>>>>> ".enqueue", ".dequeue", ".init", and ".reset"), and a new 
>>>>> "bpf_qdisc_ctx" in the
>>>>> uapi. It is no long an acceptable way to add new bpf extension.
>>>>>
>>>>> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely 
>>>>> implemented
>>>>> in bpf (with the help of new kfuncs if needed)? Then a struct_ops 
>>>>> for Qdisc_ops
>>>>> can be created. The bpf Qdisc_ops can be loaded through the 
>>>>> existing struct_ops api.
>>>>>
>>>> Partially. If using struct_ops, I think we'll need another structure
>>>> like the following in bpf qdisc to be implemented with struct_ops bpf:
>>>>
>>>> struct bpf_qdisc_ops {
>>>>       int (*enqueue) (struct sk_buff *skb)
>>>>       void (*dequeue) (void)
>>>>       void (*init) (void)
>>>>       void (*reset) (void)
>>>> };
>>>>
>>>> Then, Qdisc_ops will wrap around them to handle things that cannot be
>>>> implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).
>>>
>>> We can see how those limitations (calling sch_tree_lock() and 
>>> returning a ptr)
>>> can be addressed in bpf. This will also help other similar use cases.
>>>
>>
>> For kptr, I wonder if we can support the following semantics in bpf if
>> they make sense:
> 
> I think they are useful but they are not fully supported now.
> 
> Some thoughts below.
> 
>> 1. Passing a referenced kptr into a bpf program, which will also need
>> to be released, or exchanged into maps or allocated objects.
> 
> "enqueue" should be the one considering here:
> 
> struct Qdisc_ops {
>      /* ... */
>      int                     (*enqueue)(struct sk_buff *skb,
>                         struct Qdisc *sch,
>                         struct sk_buff **to_free);
> 
> };
> 
> The verifier only marks the skb as a trusted kptr but does not mark its 
> reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:
> 
>      if (prog_args_trusted(prog))
>          info->reg_type |= PTR_TRUSTED;
> 
> The verifier does not know the skb ownership is passed into the 
> ".enqueue" ops and does not know the bpf prog needs to release it or 
> store it in a map.
> 
> The verifier tracks the reference state when a KF_ACQUIRE kfunc is 
> called (just an example, not saying we need to use KF_ACQUIRE kfunc). 
> Take a look at acquire_reference_state() which is the useful one here.
> 
> Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier 
> can always acquire_reference_state() for the "struct sk_buff *skb" 
> argument.
> 
> Take a look at a recent RFC: 
> https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
> which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC 
> patch is tagging the argument could be NULL by appending "__nullable" to 
> the argument name. The verifier will enforce that the bpf prog must 
> check for NULL first.
> 
> The similar idea can be used here but with a different tagging (for 
> example, "__must_release", admittedly not a good name). While the RFC 
> patch is in-progress, for now, may be hardcode for the ".enqueue" ops in 
> check_struct_ops_btf_id() and always acquire_reference_state() for the 
> skb. This part can be adjusted later once the RFC patch will be in shape.
> 
> 
> Then one more thing is to track when the struct_ops bpf prog is actually 
> reading the value of the skb pointer. One thing is worth to mention 
> here, e.g. a struct_ops prog for enqueue:
> 
> SEC("struct_ops")
> int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
>           struct sk_buff **to_free)
> {
>      return bpf_qdisc_drop(skb, sch, to_free);
> }
> 
> Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to 
> an array of __u64 as the only argument. The skb is actually in ctx[0], 
> sch is in ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. 
> r1 = ctx[0]), btf_ctx_access() marks the reg_type to PTR_TRUSTED. It 
> needs to also initialize the reg->ref_obj_id by the id obtained earlier 
> from acquire_reference_state() during check_struct_ops_btf_id() somehow.
> 
> 
>> 2. Returning a kptr from a program and treating it as releasing the 
>> reference.
> 
> e.g. for dequeue:
> 
> struct Qdisc_ops {
>      /* ... */
>      struct sk_buff *        (*dequeue)(struct Qdisc *);
> };
> 
> 
> Right now the verifier should complain on check_reference_leak() if the 
> struct_ops bpf prog is returning a referenced kptr.
> 
> Unlike an argument, the return type of a function does not have a name 
> to tag. It is the first case that a struct_ops bpf_prog returning a 

We may tag the stub functions instead, right?
Is the purpose here to return a referenced pointer from a struct_ops
operator without verifier complaining?

> pointer. One idea is to assume it must be a trusted pointer 
> (PTR_TRUSTED) and the verifier should check it is indeed with 
> PTR_TRUSTED flag.
> 
> May be release_reference_state() can be called to assume the kernel will 
> release it as long as the return pointer type is PTR_TRUSTED and the 
> type matches the return type of the ops. Take a look at 
> check_return_code().
>

Martin KaFai Lau Jan. 31, 2024, 1:01 a.m. UTC | #7

On 1/30/24 9:49 AM, Kui-Feng Lee wrote:
>>> 2. Returning a kptr from a program and treating it as releasing the reference.
>>
>> e.g. for dequeue:
>>
>> struct Qdisc_ops {
>>      /* ... */
>>      struct sk_buff *        (*dequeue)(struct Qdisc *);
>> };
>>
>>
>> Right now the verifier should complain on check_reference_leak() if the 
>> struct_ops bpf prog is returning a referenced kptr.
>>
>> Unlike an argument, the return type of a function does not have a name to tag. 
>> It is the first case that a struct_ops bpf_prog returning a 
> 
> We may tag the stub functions instead, right?

What is the suggestion on how to tag the return type?

I was suggesting it doesn't need to tag and it should by default require a 
trusted ptr for the pointer returned by struct_ops. The pointer argument and the 
return pointer of a struct_ops should be a trusted ptr.

> Is the purpose here to return a referenced pointer from a struct_ops
> operator without verifier complaining?

Yes, basically need to teach the verifier the kernel will do the reference release.

> 
>> pointer. One idea is to assume it must be a trusted pointer (PTR_TRUSTED) and 
>> the verifier should check it is indeed with PTR_TRUSTED flag.
>>
>> May be release_reference_state() can be called to assume the kernel will 
>> release it as long as the return pointer type is PTR_TRUSTED and the type 
>> matches the return type of the ops. Take a look at check_return_code().

Amery Hung Jan. 31, 2024, 4:23 p.m. UTC | #8

On Mon, Jan 29, 2024 at 10:39 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >> We can see how those limitations (calling sch_tree_lock() and returning a ptr)
> >> can be addressed in bpf. This will also help other similar use cases.
> >>
> >
> > For kptr, I wonder if we can support the following semantics in bpf if
> > they make sense:
>
> I think they are useful but they are not fully supported now.
>
> Some thoughts below.
>
> > 1. Passing a referenced kptr into a bpf program, which will also need
> > to be released, or exchanged into maps or allocated objects.
>
> "enqueue" should be the one considering here:
>
> struct Qdisc_ops {
>         /* ... */
>         int                     (*enqueue)(struct sk_buff *skb,
>                                            struct Qdisc *sch,
>                                            struct sk_buff **to_free);
>
> };
>
> The verifier only marks the skb as a trusted kptr but does not mark its
> reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:
>
>         if (prog_args_trusted(prog))
>                 info->reg_type |= PTR_TRUSTED;
>
> The verifier does not know the skb ownership is passed into the ".enqueue" ops
> and does not know the bpf prog needs to release it or store it in a map.
>
> The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just
> an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at
> acquire_reference_state() which is the useful one here.
>
> Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can
> always acquire_reference_state() for the "struct sk_buff *skb" argument.
>
> Take a look at a recent RFC:
> https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
> which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch
> is tagging the argument could be NULL by appending "__nullable" to the argument
> name. The verifier will enforce that the bpf prog must check for NULL first.
>
> The similar idea can be used here but with a different tagging (for example,
> "__must_release", admittedly not a good name). While the RFC patch is
> in-progress, for now, may be hardcode for the ".enqueue" ops in
> check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This
> part can be adjusted later once the RFC patch will be in shape.
>

Make sense. One more thing to consider here is that .enqueue is
actually a reference acquiring and releasing function at the same
time. Assuming ctx written to by a struct_ops program can be seen by
the kernel, another new tag for the "to_free" argument will still be
needed so that the verifier can recognize when writing skb to
"to_free".

>
> Then one more thing is to track when the struct_ops bpf prog is actually reading
> the value of the skb pointer. One thing is worth to mention here, e.g. a
> struct_ops prog for enqueue:
>
> SEC("struct_ops")
> int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
>              struct sk_buff **to_free)
> {
>         return bpf_qdisc_drop(skb, sch, to_free);
> }
>
> Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array
> of __u64 as the only argument. The skb is actually in ctx[0], sch is in
> ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]),
> btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize
> the reg->ref_obj_id by the id obtained earlier from acquire_reference_state()
> during check_struct_ops_btf_id() somehow.
>
>
> > 2. Returning a kptr from a program and treating it as releasing the reference.
>
> e.g. for dequeue:
>
> struct Qdisc_ops {
>         /* ... */
>         struct sk_buff *        (*dequeue)(struct Qdisc *);
> };
>
>
> Right now the verifier should complain on check_reference_leak() if the
> struct_ops bpf prog is returning a referenced kptr.
>
> Unlike an argument, the return type of a function does not have a name to tag.
> It is the first case that a struct_ops bpf_prog returning a pointer. One idea is
> to assume it must be a trusted pointer (PTR_TRUSTED) and the verifier should
> check it is indeed with PTR_TRUSTED flag.
>
> May be release_reference_state() can be called to assume the kernel will release
> it as long as the return pointer type is PTR_TRUSTED and the type matches the
> return type of the ops. Take a look at check_return_code().

Kui-Feng Lee Jan. 31, 2024, 4:49 p.m. UTC | #9

On 1/30/24 17:01, Martin KaFai Lau wrote:
> On 1/30/24 9:49 AM, Kui-Feng Lee wrote:
>>>> 2. Returning a kptr from a program and treating it as releasing the 
>>>> reference.
>>>
>>> e.g. for dequeue:
>>>
>>> struct Qdisc_ops {
>>>      /* ... */
>>>      struct sk_buff *        (*dequeue)(struct Qdisc *);
>>> };
>>>
>>>
>>> Right now the verifier should complain on check_reference_leak() if 
>>> the struct_ops bpf prog is returning a referenced kptr.
>>>
>>> Unlike an argument, the return type of a function does not have a 
>>> name to tag. It is the first case that a struct_ops bpf_prog returning a 
>>
>> We may tag the stub functions instead, right?
> 
> What is the suggestion on how to tag the return type?
> 
> I was suggesting it doesn't need to tag and it should by default require 
> a trusted ptr for the pointer returned by struct_ops. The pointer 
> argument and the return pointer of a struct_ops should be a trusted ptr.


That make sense to me. Should we also allow operators to return a null
pointer?

> 
>> Is the purpose here to return a referenced pointer from a struct_ops
>> operator without verifier complaining?
> 
> Yes, basically need to teach the verifier the kernel will do the 
> reference release.
> 
>>
>>> pointer. One idea is to assume it must be a trusted pointer 
>>> (PTR_TRUSTED) and the verifier should check it is indeed with 
>>> PTR_TRUSTED flag.
>>>
>>> May be release_reference_state() can be called to assume the kernel 
>>> will release it as long as the return pointer type is PTR_TRUSTED and 
>>> the type matches the return type of the ops. Take a look at 
>>> check_return_code(). 
>

Amery Hung Jan. 31, 2024, 4:59 p.m. UTC | #10

On Wed, Jan 31, 2024 at 8:49 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 1/30/24 17:01, Martin KaFai Lau wrote:
> > On 1/30/24 9:49 AM, Kui-Feng Lee wrote:
> >>>> 2. Returning a kptr from a program and treating it as releasing the
> >>>> reference.
> >>>
> >>> e.g. for dequeue:
> >>>
> >>> struct Qdisc_ops {
> >>>      /* ... */
> >>>      struct sk_buff *        (*dequeue)(struct Qdisc *);
> >>> };
> >>>
> >>>
> >>> Right now the verifier should complain on check_reference_leak() if
> >>> the struct_ops bpf prog is returning a referenced kptr.
> >>>
> >>> Unlike an argument, the return type of a function does not have a
> >>> name to tag. It is the first case that a struct_ops bpf_prog returning a
> >>
> >> We may tag the stub functions instead, right?
> >
> > What is the suggestion on how to tag the return type?
> >
> > I was suggesting it doesn't need to tag and it should by default require
> > a trusted ptr for the pointer returned by struct_ops. The pointer
> > argument and the return pointer of a struct_ops should be a trusted ptr.
>
>
> That make sense to me. Should we also allow operators to return a null
> pointer?
>

.dequeue in Qdisc_ops can return a null pointer when there is no skb
to be dequeued so I think that should be allowed.

> >
> >> Is the purpose here to return a referenced pointer from a struct_ops
> >> operator without verifier complaining?
> >
> > Yes, basically need to teach the verifier the kernel will do the
> > reference release.
> >
> >>
> >>> pointer. One idea is to assume it must be a trusted pointer
> >>> (PTR_TRUSTED) and the verifier should check it is indeed with
> >>> PTR_TRUSTED flag.
> >>>
> >>> May be release_reference_state() can be called to assume the kernel
> >>> will release it as long as the return pointer type is PTR_TRUSTED and
> >>> the type matches the return type of the ops. Take a look at
> >>> check_return_code().
> >

Martin KaFai Lau Feb. 2, 2024, 1:47 a.m. UTC | #11

On 1/31/24 8:23 AM, Amery Hung wrote:
>>> 1. Passing a referenced kptr into a bpf program, which will also need
>>> to be released, or exchanged into maps or allocated objects.
>> "enqueue" should be the one considering here:
>>
>> struct Qdisc_ops {
>>          /* ... */
>>          int                     (*enqueue)(struct sk_buff *skb,
>>                                             struct Qdisc *sch,
>>                                             struct sk_buff **to_free);
>>
>> };
>>
>> The verifier only marks the skb as a trusted kptr but does not mark its
>> reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:
>>
>>          if (prog_args_trusted(prog))
>>                  info->reg_type |= PTR_TRUSTED;
>>
>> The verifier does not know the skb ownership is passed into the ".enqueue" ops
>> and does not know the bpf prog needs to release it or store it in a map.
>>
>> The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just
>> an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at
>> acquire_reference_state() which is the useful one here.
>>
>> Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can
>> always acquire_reference_state() for the "struct sk_buff *skb" argument.
>>
>> Take a look at a recent RFC:
>> https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
>> which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch
>> is tagging the argument could be NULL by appending "__nullable" to the argument
>> name. The verifier will enforce that the bpf prog must check for NULL first.
>>
>> The similar idea can be used here but with a different tagging (for example,
>> "__must_release", admittedly not a good name). While the RFC patch is
>> in-progress, for now, may be hardcode for the ".enqueue" ops in
>> check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This
>> part can be adjusted later once the RFC patch will be in shape.
>>
> Make sense. One more thing to consider here is that .enqueue is
> actually a reference acquiring and releasing function at the same
> time. Assuming ctx written to by a struct_ops program can be seen by
> the kernel, another new tag for the "to_free" argument will still be
> needed so that the verifier can recognize when writing skb to
> "to_free".

I don't think "to_free" needs special tagging. I was thinking the 
"bpf_qdisc_drop" kfunc could be a KF_RELEASE. Ideally, it should be like

__bpf_kfunc int bpf_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
	                       struct sk_buff **to_free)
{
	return qdisc_drop(skb, sch, to_free);
}

However, I don't think the verifier supports pointer to pointer now. Meaning
"struct sk_buff **to_free" does not work.

If the ptr indirection spinning in my head is sound, one possible solution to 
unblock the qdisc work is to introduce:

struct bpf_sk_buff_ptr {
	struct sk_buff *skb;
};

and the bpf_qdisc_drop kfunc:

__bpf_kfunc int bpf_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
                                struct bpf_sk_buff_ptr *to_free_list)

and the enqueue prog:

SEC("struct_ops/enqueue")
int BPF_PROG(test_enqueue, struct sk_buff *skb,
              struct Qdisc *sch,
              struct bpf_sk_buff_ptr *to_free_list)
{
	return bpf_qdisc_drop(skb, sch, to_free_list);
}

and the ".is_valid_access" needs to change the btf_type from "struct sk_buff **" 
to "struct bpf_sk_buff_ptr *" which is sort of similar to the bpf_tcp_ca.c that 
is changing the "struct sock *" type to the "struct tcp_sock *" type.

I have the compiler-tested idea here: 
https://git.kernel.org/pub/scm/linux/kernel/git/martin.lau/bpf-next.git/log/?h=qdisc-ideas


> 
>> Then one more thing is to track when the struct_ops bpf prog is actually reading
>> the value of the skb pointer. One thing is worth to mention here, e.g. a
>> struct_ops prog for enqueue:
>>
>> SEC("struct_ops")
>> int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
>>               struct sk_buff **to_free)
>> {
>>          return bpf_qdisc_drop(skb, sch, to_free);
>> }
>>
>> Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array
>> of __u64 as the only argument. The skb is actually in ctx[0], sch is in
>> ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]),
>> btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize
>> the reg->ref_obj_id by the id obtained earlier from acquire_reference_state()
>> during check_struct_ops_btf_id() somehow.

Amery Hung Feb. 9, 2024, 8:14 p.m. UTC | #12

On Thu, Feb 1, 2024 at 5:47 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/31/24 8:23 AM, Amery Hung wrote:
> >>> 1. Passing a referenced kptr into a bpf program, which will also need
> >>> to be released, or exchanged into maps or allocated objects.
> >> "enqueue" should be the one considering here:
> >>
> >> struct Qdisc_ops {
> >>          /* ... */
> >>          int                     (*enqueue)(struct sk_buff *skb,
> >>                                             struct Qdisc *sch,
> >>                                             struct sk_buff **to_free);
> >>
> >> };
> >>
> >> The verifier only marks the skb as a trusted kptr but does not mark its
> >> reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:
> >>
> >>          if (prog_args_trusted(prog))
> >>                  info->reg_type |= PTR_TRUSTED;
> >>
> >> The verifier does not know the skb ownership is passed into the ".enqueue" ops
> >> and does not know the bpf prog needs to release it or store it in a map.
> >>
> >> The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just
> >> an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at
> >> acquire_reference_state() which is the useful one here.
> >>
> >> Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can
> >> always acquire_reference_state() for the "struct sk_buff *skb" argument.
> >>
> >> Take a look at a recent RFC:
> >> https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
> >> which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch
> >> is tagging the argument could be NULL by appending "__nullable" to the argument
> >> name. The verifier will enforce that the bpf prog must check for NULL first.
> >>
> >> The similar idea can be used here but with a different tagging (for example,
> >> "__must_release", admittedly not a good name). While the RFC patch is
> >> in-progress, for now, may be hardcode for the ".enqueue" ops in
> >> check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This
> >> part can be adjusted later once the RFC patch will be in shape.
> >>
> > Make sense. One more thing to consider here is that .enqueue is
> > actually a reference acquiring and releasing function at the same
> > time. Assuming ctx written to by a struct_ops program can be seen by
> > the kernel, another new tag for the "to_free" argument will still be
> > needed so that the verifier can recognize when writing skb to
> > "to_free".
>
> I don't think "to_free" needs special tagging. I was thinking the
> "bpf_qdisc_drop" kfunc could be a KF_RELEASE. Ideally, it should be like
>
> __bpf_kfunc int bpf_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
>                                struct sk_buff **to_free)
> {
>         return qdisc_drop(skb, sch, to_free);
> }
>
> However, I don't think the verifier supports pointer to pointer now. Meaning
> "struct sk_buff **to_free" does not work.
>
> If the ptr indirection spinning in my head is sound, one possible solution to
> unblock the qdisc work is to introduce:
>
> struct bpf_sk_buff_ptr {
>         struct sk_buff *skb;
> };
>
> and the bpf_qdisc_drop kfunc:
>
> __bpf_kfunc int bpf_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
>                                 struct bpf_sk_buff_ptr *to_free_list)
>
> and the enqueue prog:
>
> SEC("struct_ops/enqueue")
> int BPF_PROG(test_enqueue, struct sk_buff *skb,
>               struct Qdisc *sch,
>               struct bpf_sk_buff_ptr *to_free_list)
> {
>         return bpf_qdisc_drop(skb, sch, to_free_list);
> }
>
> and the ".is_valid_access" needs to change the btf_type from "struct sk_buff **"
> to "struct bpf_sk_buff_ptr *" which is sort of similar to the bpf_tcp_ca.c that
> is changing the "struct sock *" type to the "struct tcp_sock *" type.
>
> I have the compiler-tested idea here:
> https://git.kernel.org/pub/scm/linux/kernel/git/martin.lau/bpf-next.git/log/?h=qdisc-ideas
>
>
> >
> >> Then one more thing is to track when the struct_ops bpf prog is actually reading
> >> the value of the skb pointer. One thing is worth to mention here, e.g. a
> >> struct_ops prog for enqueue:
> >>
> >> SEC("struct_ops")
> >> int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> >>               struct sk_buff **to_free)
> >> {
> >>          return bpf_qdisc_drop(skb, sch, to_free);
> >> }
> >>
> >> Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array
> >> of __u64 as the only argument. The skb is actually in ctx[0], sch is in
> >> ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]),
> >> btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize
> >> the reg->ref_obj_id by the id obtained earlier from acquire_reference_state()
> >> during check_struct_ops_btf_id() somehow.
>

I appreciate the idea. The pointer redirection works without problems.
I now have a working fifo bpf qdisc using struct_ops. I will explore
how other parts of qdisc work with struct_ops.

Thanks,
Amery

[RFC,v7,1/8] net_sched: Introduce eBPF based Qdisc

Checks

Commit Message

Comments

Patch