Message ID | 20211208145459.9590-3-xiangxia.m.yue@gmail.com (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | BPF |
Headers | show |
Series | fix bpf_redirect to ifb netdev | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Clearly marked for net |
netdev/fixes_present | success | Fixes tag present in non-next series |
netdev/subject_prefix | success | Link |
netdev/cover_letter | success | Series has a cover letter |
netdev/patch_count | success | Link |
netdev/header_inline | success | No static functions without inline keyword in header files |
netdev/build_32bit | success | Errors and warnings before: 7 this patch: 7 |
netdev/cc_maintainers | warning | 1 maintainers not CCed: bpf@vger.kernel.org |
netdev/build_clang | success | Errors and warnings before: 20 this patch: 20 |
netdev/module_param | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Signed-off-by tag matches author and committer |
netdev/verify_fixes | success | No Fixes tag |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 9 this patch: 9 |
netdev/checkpatch | warning | WARNING: Possible repeated word: 'in' |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/source_inline | success | Was 0 now: 0 |
bpf/vmtest-bpf-next | success | VM_Test |
bpf/vmtest-bpf-next-PR | success | PR summary |
xiangxia.m.yue@ wrote: > From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > > Try to resolve the issues as below: > * We look up and then check tc_skip_classify flag in net > sched layer, even though skb don't want to be classified. > That case may consume a lot of cpu cycles. This patch > is useful when there are a lot of filters with different > prio. There is ~5 prio in in production, ~1% improvement. > > Rules as below: > $ for id in $(seq 1 5); do > $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 > $ done > > * bpf_redirect may be invoked in egress path. If we don't > check the flags and then return immediately, the packets > will loopback. This would be the naive case right? Meaning the BPF program is doing a redirect without any logic or is buggy? Can you map out how this happens for me, I'm not fully sure I understand the exact concern. Is it possible for BPF programs that used to see packets no longer see the packet as expected? Is this the path you are talking about? rx ethx -> execute BPF program on ethx with bpf_redirect(ifb0) -> __skb_dequeue @ifb tc_skip_classify = 1 -> dev_queue_xmit() -> sch_handle_egress() -> execute BPF program again I can't see why you want to skip that second tc BPF program, or for that matter any tc filter there. In general how do you know that is the correct/expected behavior? Before the above change it would have been called, what if its doing useful work. Also its not clear how your ifb setup is built or used. That might help understand your use case. I would just remove the IFB altogether and the above discussion is mute. Thanks, John
John Fastabend wrote: > xiangxia.m.yue@ wrote: > > From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > > > > Try to resolve the issues as below: > > * We look up and then check tc_skip_classify flag in net > > sched layer, even though skb don't want to be classified. > > That case may consume a lot of cpu cycles. This patch > > is useful when there are a lot of filters with different > > prio. There is ~5 prio in in production, ~1% improvement. > > > > Rules as below: > > $ for id in $(seq 1 5); do > > $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 > > $ done > > > > * bpf_redirect may be invoked in egress path. If we don't > > check the flags and then return immediately, the packets > > will loopback. > > This would be the naive case right? Meaning the BPF program is > doing a redirect without any logic or is buggy? > > Can you map out how this happens for me, I'm not fully sure I > understand the exact concern. Is it possible for BPF programs > that used to see packets no longer see the packet as expected? > > Is this the path you are talking about? > > rx ethx -> > execute BPF program on ethx with bpf_redirect(ifb0) -> > __skb_dequeue @ifb tc_skip_classify = 1 -> > dev_queue_xmit() -> > sch_handle_egress() -> > execute BPF program again > > I can't see why you want to skip that second tc BPF program, > or for that matter any tc filter there. In general how do you > know that is the correct/expected behavior? Before the above > change it would have been called, what if its doing useful > work. > > Also its not clear how your ifb setup is built or used. That > might help understand your use case. I would just remove the > IFB altogether and the above discussion is mute. > > Thanks, > John After a bit further thought (and coffee) I think this will break some programs that exist today. Consider the case where I pop a header off and resubmit to the same device intentionally to reprocess the pkt without the header. I've used this pattern in BPF a few times.
On Sat, Dec 11, 2021 at 12:43 AM John Fastabend <john.fastabend@gmail.com> wrote: > > xiangxia.m.yue@ wrote: > > From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > > > > Try to resolve the issues as below: > > * We look up and then check tc_skip_classify flag in net > > sched layer, even though skb don't want to be classified. > > That case may consume a lot of cpu cycles. This patch > > is useful when there are a lot of filters with different > > prio. There is ~5 prio in in production, ~1% improvement. > > > > Rules as below: > > $ for id in $(seq 1 5); do > > $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 > > $ done > > > > * bpf_redirect may be invoked in egress path. If we don't > > check the flags and then return immediately, the packets > > will loopback. > > This would be the naive case right? Meaning the BPF program is > doing a redirect without any logic or is buggy? > > Can you map out how this happens for me, I'm not fully sure I > understand the exact concern. Is it possible for BPF programs > that used to see packets no longer see the packet as expected? > > Is this the path you are talking about? Hi John Tx ethx -> __dev_queue_xmit -> sch_handle_egress -> execute BPF program on ethx with bpf_redirect(ifb0) -> -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit the packets loopbacks, that means bpf_redirect doesn't work with ifb netdev, right ? so in sch_handle_egress, I add the check skb_skip_tc_classify(). > rx ethx -> > execute BPF program on ethx with bpf_redirect(ifb0) -> > __skb_dequeue @ifb tc_skip_classify = 1 -> > dev_queue_xmit() -> > sch_handle_egress() -> > execute BPF program again > > I can't see why you want to skip that second tc BPF program, > or for that matter any tc filter there. In general how do you > know that is the correct/expected behavior? Before the above > change it would have been called, what if its doing useful > work. bpf_redirect works fine on ingress with ifb __netif_receive_skb_core -> sch_handle_ingress -> bpf_redirect (ifb0) -> ifb_xmit -> netif_receive_skb -> __netif_receive_skb_core but __netif_receive_skb_core --> skb_skip_tc_classify(so the packets will execute the BPF progam again) > Also its not clear how your ifb setup is built or used. That > might help understand your use case. I would just remove the > IFB altogether and the above discussion is mute. tc filter add dev veth1 egress bpf direct-action obj test_bpf_redirect_ifb.o sec redirect_ifb the test_bpf_redirect_ifb bpf progam: +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2021 DiDi Global */ + +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +SEC("redirect_ifb") +int redirect(struct __sk_buff *skb) +{ + return bpf_redirect(skb->ifindex + 1 /* ifbX */, 0); +} + +char __license[] SEC("license") = "GPL"; The 3/3 is selftest: https://patchwork.kernel.org/project/netdevbpf/patch/20211208145459.9590-4-xiangxia.m.yue@gmail.com/ > Thanks, > John
On Sat, Dec 11, 2021 at 12:52 AM John Fastabend <john.fastabend@gmail.com> wrote: > > John Fastabend wrote: > > xiangxia.m.yue@ wrote: > > > From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > > > > > > Try to resolve the issues as below: > > > * We look up and then check tc_skip_classify flag in net > > > sched layer, even though skb don't want to be classified. > > > That case may consume a lot of cpu cycles. This patch > > > is useful when there are a lot of filters with different > > > prio. There is ~5 prio in in production, ~1% improvement. > > > > > > Rules as below: > > > $ for id in $(seq 1 5); do > > > $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 > > > $ done > > > > > > * bpf_redirect may be invoked in egress path. If we don't > > > check the flags and then return immediately, the packets > > > will loopback. > > > > This would be the naive case right? Meaning the BPF program is > > doing a redirect without any logic or is buggy? > > > > Can you map out how this happens for me, I'm not fully sure I > > understand the exact concern. Is it possible for BPF programs > > that used to see packets no longer see the packet as expected? > > > > Is this the path you are talking about? > > > > rx ethx -> > > execute BPF program on ethx with bpf_redirect(ifb0) -> > > __skb_dequeue @ifb tc_skip_classify = 1 -> > > dev_queue_xmit() -> > > sch_handle_egress() -> > > execute BPF program again > > > > I can't see why you want to skip that second tc BPF program, > > or for that matter any tc filter there. In general how do you > > know that is the correct/expected behavior? Before the above > > change it would have been called, what if its doing useful > > work. > > > > Also its not clear how your ifb setup is built or used. That > > might help understand your use case. I would just remove the > > IFB altogether and the above discussion is mute. > > > > Thanks, > > John > > After a bit further thought (and coffee) I think this will > break some programs that exist today. Consider the case > where I pop a header off and resubmit to the same device > intentionally to reprocess the pkt without the header. I've > used this pattern in BPF a few times. No, ifb netdev sets the skb->tc_skip_classify = 1, that means we should not process the skb again, no matter on egress or ingress. if the bpf programs don't set the skb->tc_skip_classify = 1, then we can process them again.
On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > On Sat, Dec 11, 2021 at 12:43 AM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > xiangxia.m.yue@ wrote: > > > From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > > > > > > Try to resolve the issues as below: > > > * We look up and then check tc_skip_classify flag in net > > > sched layer, even though skb don't want to be classified. > > > That case may consume a lot of cpu cycles. This patch > > > is useful when there are a lot of filters with different > > > prio. There is ~5 prio in in production, ~1% improvement. > > > > > > Rules as below: > > > $ for id in $(seq 1 5); do > > > $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 > > > $ done > > > > > > * bpf_redirect may be invoked in egress path. If we don't > > > check the flags and then return immediately, the packets > > > will loopback. > > > > This would be the naive case right? Meaning the BPF program is > > doing a redirect without any logic or is buggy? > > > > Can you map out how this happens for me, I'm not fully sure I > > understand the exact concern. Is it possible for BPF programs > > that used to see packets no longer see the packet as expected? > > > > Is this the path you are talking about? > Hi John > Tx ethx -> __dev_queue_xmit -> sch_handle_egress > -> execute BPF program on ethx with bpf_redirect(ifb0) -> > -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit > the packets loopbacks, that means bpf_redirect doesn't work with ifb > netdev, right ? > so in sch_handle_egress, I add the check skb_skip_tc_classify(). > > > rx ethx -> > > execute BPF program on ethx with bpf_redirect(ifb0) -> > > __skb_dequeue @ifb tc_skip_classify = 1 -> > > dev_queue_xmit() -> > > sch_handle_egress() -> > > execute BPF program again > > > > I can't see why you want to skip that second tc BPF program, > > or for that matter any tc filter there. In general how do you > > know that is the correct/expected behavior? Before the above > > change it would have been called, what if its doing useful > > work. > bpf_redirect works fine on ingress with ifb > __netif_receive_skb_core -> sch_handle_ingress -> bpf_redirect (ifb0) > -> ifb_xmit -> netif_receive_skb -> __netif_receive_skb_core > but > __netif_receive_skb_core --> skb_skip_tc_classify(so the packets will > execute the BPF progam again) so the packets will NOT execute the BPF progam again) > > Also its not clear how your ifb setup is built or used. That > > might help understand your use case. I would just remove the > > IFB altogether and the above discussion is mute. > tc filter add dev veth1 egress bpf direct-action obj > test_bpf_redirect_ifb.o sec redirect_ifb > > the test_bpf_redirect_ifb bpf progam: > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright (c) 2021 DiDi Global */ > + > +#include <linux/bpf.h> > +#include <bpf/bpf_helpers.h> > + > +SEC("redirect_ifb") > +int redirect(struct __sk_buff *skb) > +{ > + return bpf_redirect(skb->ifindex + 1 /* ifbX */, 0); > +} > + > +char __license[] SEC("license") = "GPL"; > > The 3/3 is selftest: > https://patchwork.kernel.org/project/netdevbpf/patch/20211208145459.9590-4-xiangxia.m.yue@gmail.com/ > > > Thanks, > > John > > > > -- > Best regards, Tonghao
On Sat, Dec 11, 2021 at 1:46 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > > > On Sat, Dec 11, 2021 at 12:43 AM John Fastabend > > <john.fastabend@gmail.com> wrote: > > > > > > xiangxia.m.yue@ wrote: > > > > From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > > > > > > > > Try to resolve the issues as below: > > > > * We look up and then check tc_skip_classify flag in net > > > > sched layer, even though skb don't want to be classified. > > > > That case may consume a lot of cpu cycles. This patch > > > > is useful when there are a lot of filters with different > > > > prio. There is ~5 prio in in production, ~1% improvement. > > > > > > > > Rules as below: > > > > $ for id in $(seq 1 5); do > > > > $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 > > > > $ done > > > > > > > > * bpf_redirect may be invoked in egress path. If we don't > > > > check the flags and then return immediately, the packets > > > > will loopback. > > > > > > This would be the naive case right? Meaning the BPF program is > > > doing a redirect without any logic or is buggy? > > > > > > Can you map out how this happens for me, I'm not fully sure I > > > understand the exact concern. Is it possible for BPF programs > > > that used to see packets no longer see the packet as expected? > > > > > > Is this the path you are talking about? > > Hi John > > Tx ethx -> __dev_queue_xmit -> sch_handle_egress > > -> execute BPF program on ethx with bpf_redirect(ifb0) -> > > -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit > > the packets loopbacks, that means bpf_redirect doesn't work with ifb > > netdev, right ? > > so in sch_handle_egress, I add the check skb_skip_tc_classify(). > > > > > rx ethx -> > > > execute BPF program on ethx with bpf_redirect(ifb0) -> > > > __skb_dequeue @ifb tc_skip_classify = 1 -> > > > dev_queue_xmit() -> > > > sch_handle_egress() -> > > > execute BPF program again > > > > > > I can't see why you want to skip that second tc BPF program, > > > or for that matter any tc filter there. In general how do you > > > know that is the correct/expected behavior? Before the above > > > change it would have been called, what if its doing useful > > > work. > > bpf_redirect works fine on ingress with ifb > > __netif_receive_skb_core -> sch_handle_ingress -> bpf_redirect (ifb0) > > -> ifb_xmit -> netif_receive_skb -> __netif_receive_skb_core > > but > > __netif_receive_skb_core --> skb_skip_tc_classify(so the packets will > > execute the BPF progam again) > so the packets will NOT execute the BPF progam again) > > > > Also its not clear how your ifb setup is built or used. That > > > might help understand your use case. I would just remove the > > > IFB altogether and the above discussion is mute. > > tc filter add dev veth1 egress bpf direct-action obj > > test_bpf_redirect_ifb.o sec redirect_ifb > > > > the test_bpf_redirect_ifb bpf progam: > > +// SPDX-License-Identifier: GPL-2.0 > > +/* Copyright (c) 2021 DiDi Global */ > > + > > +#include <linux/bpf.h> > > +#include <bpf/bpf_helpers.h> > > + > > +SEC("redirect_ifb") > > +int redirect(struct __sk_buff *skb) > > +{ > > + return bpf_redirect(skb->ifindex + 1 /* ifbX */, 0); > > +} > > + > > +char __license[] SEC("license") = "GPL"; > > > > The 3/3 is selftest: > > https://patchwork.kernel.org/project/netdevbpf/patch/20211208145459.9590-4-xiangxia.m.yue@gmail.com/ > > > > > Thanks, > > > John Hi John Did I answer your question? I hope I explained things clearly. > > > > > > -- > > Best regards, Tonghao > > > > -- > Best regards, Tonghao
On 12/10/21 8:54 PM, Tonghao Zhang wrote: > On Sat, Dec 11, 2021 at 1:46 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: >> On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: >>> On Sat, Dec 11, 2021 at 12:43 AM John Fastabend >>> <john.fastabend@gmail.com> wrote: >>>> xiangxia.m.yue@ wrote: >>>>> From: Tonghao Zhang <xiangxia.m.yue@gmail.com> >>>>> >>>>> Try to resolve the issues as below: >>>>> * We look up and then check tc_skip_classify flag in net >>>>> sched layer, even though skb don't want to be classified. >>>>> That case may consume a lot of cpu cycles. This patch >>>>> is useful when there are a lot of filters with different >>>>> prio. There is ~5 prio in in production, ~1% improvement. >>>>> >>>>> Rules as below: >>>>> $ for id in $(seq 1 5); do >>>>> $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 >>>>> $ done >>>>> >>>>> * bpf_redirect may be invoked in egress path. If we don't >>>>> check the flags and then return immediately, the packets >>>>> will loopback. >>>> >>>> This would be the naive case right? Meaning the BPF program is >>>> doing a redirect without any logic or is buggy? >>>> >>>> Can you map out how this happens for me, I'm not fully sure I >>>> understand the exact concern. Is it possible for BPF programs >>>> that used to see packets no longer see the packet as expected? >>>> >>>> Is this the path you are talking about? >>> Hi John >>> Tx ethx -> __dev_queue_xmit -> sch_handle_egress >>> -> execute BPF program on ethx with bpf_redirect(ifb0) -> >>> -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit >>> the packets loopbacks, that means bpf_redirect doesn't work with ifb >>> netdev, right ? >>> so in sch_handle_egress, I add the check skb_skip_tc_classify(). But why would you do that? Usage like this is just broken by design.. If you need to loop anything back to RX, just use bpf_redirect() with BPF_F_INGRESS? What is the concrete/actual rationale for ifb here?
On Sat, Dec 11, 2021 at 4:11 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 12/10/21 8:54 PM, Tonghao Zhang wrote: > > On Sat, Dec 11, 2021 at 1:46 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > >> On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > >>> On Sat, Dec 11, 2021 at 12:43 AM John Fastabend > >>> <john.fastabend@gmail.com> wrote: > >>>> xiangxia.m.yue@ wrote: > >>>>> From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > >>>>> > >>>>> Try to resolve the issues as below: > >>>>> * We look up and then check tc_skip_classify flag in net > >>>>> sched layer, even though skb don't want to be classified. > >>>>> That case may consume a lot of cpu cycles. This patch > >>>>> is useful when there are a lot of filters with different > >>>>> prio. There is ~5 prio in in production, ~1% improvement. > >>>>> > >>>>> Rules as below: > >>>>> $ for id in $(seq 1 5); do > >>>>> $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 > >>>>> $ done > >>>>> > >>>>> * bpf_redirect may be invoked in egress path. If we don't > >>>>> check the flags and then return immediately, the packets > >>>>> will loopback. > >>>> > >>>> This would be the naive case right? Meaning the BPF program is > >>>> doing a redirect without any logic or is buggy? > >>>> > >>>> Can you map out how this happens for me, I'm not fully sure I > >>>> understand the exact concern. Is it possible for BPF programs > >>>> that used to see packets no longer see the packet as expected? > >>>> > >>>> Is this the path you are talking about? > >>> Hi John > >>> Tx ethx -> __dev_queue_xmit -> sch_handle_egress > >>> -> execute BPF program on ethx with bpf_redirect(ifb0) -> > >>> -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit > >>> the packets loopbacks, that means bpf_redirect doesn't work with ifb > >>> netdev, right ? > >>> so in sch_handle_egress, I add the check skb_skip_tc_classify(). > > But why would you do that? Usage like this is just broken by design.. As I understand, we can redirect packets to a target device either at ingress or at *egress The commit ID: 3896d655f4d491c67d669a15f275a39f713410f8 Allow eBPF programs attached to classifier/actions to call bpf_clone_redirect(skb, ifindex, flags) helper which will mirror or redirect the packet by dynamic ifindex selection from within the program to a target device either at ingress or at egress. Can be used for various scenarios, for example, to load balance skbs into veths, split parts of the traffic to local taps, etc. https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=3896d655f4d491c67d669a15f275a39f713410f8 https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=27b29f63058d26c6c1742f1993338280d5a41dc6 But at egress the bpf_redirect doesn't work with ifb. > If you need to loop anything back to RX, just use bpf_redirect() with Not use it to loop packets back. the flags of bpf_redirect is 0. for example: tc filter add dev veth1 \ egress bpf direct-action obj test_bpf_redirect_ifb.o sec redirect_ifb https://patchwork.kernel.org/project/netdevbpf/patch/20211208145459.9590-4-xiangxia.m.yue@gmail.com/ > BPF_F_INGRESS? What is the concrete/actual rationale for ifb here? We load balance the packets to different ifb netdevices at egress. On ifb, we install filters, rate limit police,
On Sat, Dec 11, 2021 at 4:11 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 12/10/21 8:54 PM, Tonghao Zhang wrote: > > On Sat, Dec 11, 2021 at 1:46 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > >> On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > >>> On Sat, Dec 11, 2021 at 12:43 AM John Fastabend > >>> <john.fastabend@gmail.com> wrote: > >>>> xiangxia.m.yue@ wrote: > >>>>> From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > >>>>> > >>>>> Try to resolve the issues as below: > >>>>> * We look up and then check tc_skip_classify flag in net > >>>>> sched layer, even though skb don't want to be classified. > >>>>> That case may consume a lot of cpu cycles. This patch > >>>>> is useful when there are a lot of filters with different > >>>>> prio. There is ~5 prio in in production, ~1% improvement. > >>>>> > >>>>> Rules as below: > >>>>> $ for id in $(seq 1 5); do > >>>>> $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 > >>>>> $ done > >>>>> > >>>>> * bpf_redirect may be invoked in egress path. If we don't > >>>>> check the flags and then return immediately, the packets > >>>>> will loopback. > >>>> > >>>> This would be the naive case right? Meaning the BPF program is > >>>> doing a redirect without any logic or is buggy? > >>>> > >>>> Can you map out how this happens for me, I'm not fully sure I > >>>> understand the exact concern. Is it possible for BPF programs > >>>> that used to see packets no longer see the packet as expected? > >>>> > >>>> Is this the path you are talking about? > >>> Hi John > >>> Tx ethx -> __dev_queue_xmit -> sch_handle_egress > >>> -> execute BPF program on ethx with bpf_redirect(ifb0) -> > >>> -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit > >>> the packets loopbacks, that means bpf_redirect doesn't work with ifb > >>> netdev, right ? > >>> so in sch_handle_egress, I add the check skb_skip_tc_classify(). > > But why would you do that? Usage like this is just broken by design.. > If you need to loop anything back to RX, just use bpf_redirect() with > BPF_F_INGRESS? What is the concrete/actual rationale for ifb here? Hi note that: ifb_ri_tasklet can send out the packets or receive skb ifb_ri_tasklet if (!skb->from_ingress) { dev_queue_xmit(skb); // bpf_redirect to ifb and ifb invoked the dev_queue_xmit in our case. } else { skb_pull_rcsum(skb, skb->mac_len); netif_receive_skb(skb); }
On Sun, Dec 12, 2021 at 5:40 PM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > On Sat, Dec 11, 2021 at 4:11 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > On 12/10/21 8:54 PM, Tonghao Zhang wrote: > > > On Sat, Dec 11, 2021 at 1:46 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > >> On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > >>> On Sat, Dec 11, 2021 at 12:43 AM John Fastabend > > >>> <john.fastabend@gmail.com> wrote: > > >>>> xiangxia.m.yue@ wrote: > > >>>>> From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > > >>>>> > > >>>>> Try to resolve the issues as below: > > >>>>> * We look up and then check tc_skip_classify flag in net > > >>>>> sched layer, even though skb don't want to be classified. > > >>>>> That case may consume a lot of cpu cycles. This patch > > >>>>> is useful when there are a lot of filters with different > > >>>>> prio. There is ~5 prio in in production, ~1% improvement. > > >>>>> > > >>>>> Rules as below: > > >>>>> $ for id in $(seq 1 5); do > > >>>>> $ tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0 > > >>>>> $ done > > >>>>> > > >>>>> * bpf_redirect may be invoked in egress path. If we don't > > >>>>> check the flags and then return immediately, the packets > > >>>>> will loopback. > > >>>> > > >>>> This would be the naive case right? Meaning the BPF program is > > >>>> doing a redirect without any logic or is buggy? > > >>>> > > >>>> Can you map out how this happens for me, I'm not fully sure I > > >>>> understand the exact concern. Is it possible for BPF programs > > >>>> that used to see packets no longer see the packet as expected? > > >>>> > > >>>> Is this the path you are talking about? > > >>> Hi John > > >>> Tx ethx -> __dev_queue_xmit -> sch_handle_egress > > >>> -> execute BPF program on ethx with bpf_redirect(ifb0) -> > > >>> -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit > > >>> the packets loopbacks, that means bpf_redirect doesn't work with ifb > > >>> netdev, right ? > > >>> so in sch_handle_egress, I add the check skb_skip_tc_classify(). > > > > But why would you do that? Usage like this is just broken by design.. > > If you need to loop anything back to RX, just use bpf_redirect() with > > BPF_F_INGRESS? What is the concrete/actual rationale for ifb here? > Hi > note that: ifb_ri_tasklet can send out the packets or receive skb > ifb_ri_tasklet > if (!skb->from_ingress) { > dev_queue_xmit(skb); // bpf_redirect to ifb > and ifb invoked the dev_queue_xmit in our case. > } else { > skb_pull_rcsum(skb, skb->mac_len); > netif_receive_skb(skb); > } Hi In this thread, I try to explain this patch, and answer questions. What should I do next? v1-v4 is "Changes Requested" in patchwork, but v5 is "Rejected" Should I add more commit message in this patch, and send v6 ? 1/3, 3/3 patch still need to be reviewed. > -- > Best regards, Tonghao
On 12/11/21 1:37 AM, Tonghao Zhang wrote: > On Sat, Dec 11, 2021 at 4:11 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 12/10/21 8:54 PM, Tonghao Zhang wrote: >>> On Sat, Dec 11, 2021 at 1:46 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: >>>> On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: >>>>> On Sat, Dec 11, 2021 at 12:43 AM John Fastabend >>>>> <john.fastabend@gmail.com> wrote: >>>>>> xiangxia.m.yue@ wrote: [...] >>>>> Hi John >>>>> Tx ethx -> __dev_queue_xmit -> sch_handle_egress >>>>> -> execute BPF program on ethx with bpf_redirect(ifb0) -> >>>>> -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit >>>>> the packets loopbacks, that means bpf_redirect doesn't work with ifb >>>>> netdev, right ? >>>>> so in sch_handle_egress, I add the check skb_skip_tc_classify(). >> >> But why would you do that? Usage like this is just broken by design.. > As I understand, we can redirect packets to a target device either at > ingress or at *egress > > The commit ID: 3896d655f4d491c67d669a15f275a39f713410f8 > Allow eBPF programs attached to classifier/actions to call > bpf_clone_redirect(skb, ifindex, flags) helper which will mirror or > redirect the packet by dynamic ifindex selection from within the > program to a target device either at ingress or at egress. Can be used > for various scenarios, for example, to load balance skbs into veths, > split parts of the traffic to local taps, etc. > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=3896d655f4d491c67d669a15f275a39f713410f8 > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=27b29f63058d26c6c1742f1993338280d5a41dc6 > > But at egress the bpf_redirect doesn't work with ifb. >> If you need to loop anything back to RX, just use bpf_redirect() with > Not use it to loop packets back. the flags of bpf_redirect is 0. for example: > > tc filter add dev veth1 \ > egress bpf direct-action obj test_bpf_redirect_ifb.o sec redirect_ifb > https://patchwork.kernel.org/project/netdevbpf/patch/20211208145459.9590-4-xiangxia.m.yue@gmail.com/ >> BPF_F_INGRESS? What is the concrete/actual rationale for ifb here? > We load balance the packets to different ifb netdevices at egress. On > ifb, we install filters, rate limit police, I guess this part here is what I don't quite follow. Could you walk me through the packet flow in this case? So you go from bpf@tc-egress@phys-dev to do the redirect to bpf@tc-egress@ifb, and then again to bpf@tc-egress@phys-dev (same dev or different one I presume)? Why not doing the load-balancing, applying the policy, and doing the rate-limiting (e.g. EDT with sch_fq) directly at the initial bpf@tc-egress@phys-dev location given bpf is perfectly capable to do all of it there w/o the extra detour & overhead through ifb? The issue I see here is adding extra overhead to support such a narrow case that nobody else is using and that can be achieved already with existing infra as I understood it; the justification right now to add the extra checks to the critical fast path is very thin.. Thanks, Daniel
On Thu, Dec 16, 2021 at 8:37 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 12/11/21 1:37 AM, Tonghao Zhang wrote: > > On Sat, Dec 11, 2021 at 4:11 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > >> On 12/10/21 8:54 PM, Tonghao Zhang wrote: > >>> On Sat, Dec 11, 2021 at 1:46 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > >>>> On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > >>>>> On Sat, Dec 11, 2021 at 12:43 AM John Fastabend > >>>>> <john.fastabend@gmail.com> wrote: > >>>>>> xiangxia.m.yue@ wrote: > [...] > >>>>> Hi John > >>>>> Tx ethx -> __dev_queue_xmit -> sch_handle_egress > >>>>> -> execute BPF program on ethx with bpf_redirect(ifb0) -> > >>>>> -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit > >>>>> the packets loopbacks, that means bpf_redirect doesn't work with ifb > >>>>> netdev, right ? > >>>>> so in sch_handle_egress, I add the check skb_skip_tc_classify(). > >> > >> But why would you do that? Usage like this is just broken by design.. > > As I understand, we can redirect packets to a target device either at > > ingress or at *egress > > > > The commit ID: 3896d655f4d491c67d669a15f275a39f713410f8 > > Allow eBPF programs attached to classifier/actions to call > > bpf_clone_redirect(skb, ifindex, flags) helper which will mirror or > > redirect the packet by dynamic ifindex selection from within the > > program to a target device either at ingress or at egress. Can be used > > for various scenarios, for example, to load balance skbs into veths, > > split parts of the traffic to local taps, etc. > > > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=3896d655f4d491c67d669a15f275a39f713410f8 > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=27b29f63058d26c6c1742f1993338280d5a41dc6 > > > > But at egress the bpf_redirect doesn't work with ifb. > >> If you need to loop anything back to RX, just use bpf_redirect() with > > Not use it to loop packets back. the flags of bpf_redirect is 0. for example: > > > > tc filter add dev veth1 \ > > egress bpf direct-action obj test_bpf_redirect_ifb.o sec redirect_ifb > > https://patchwork.kernel.org/project/netdevbpf/patch/20211208145459.9590-4-xiangxia.m.yue@gmail.com/ > >> BPF_F_INGRESS? What is the concrete/actual rationale for ifb here? > > We load balance the packets to different ifb netdevices at egress. On > > ifb, we install filters, rate limit police, Hi Daniel, I try to answer your question. thanks > I guess this part here is what I don't quite follow. Could you walk me through > the packet flow in this case? So you go from bpf@tc-egress@phys-dev to do the > redirect to bpf@tc-egress@ifb, and then again to bpf@tc-egress@phys-dev (same ifb doesn't install bpf prog, ifb will redirect packets to netdevice which packets from. ifb_ri_tasklet 1. skb->dev = dev_get_by_index_rcu(dev_net(txp->dev), skb->skb_iif); // skb_iif is phys-dev 2. dev_queue_xmit > dev or different one I presume)? Why not doing the load-balancing, applying the > policy, and doing the rate-limiting (e.g. EDT with sch_fq) directly at the initial > bpf@tc-egress@phys-dev location given bpf is perfectly capable to do all of it > there w/o the extra detour & overhead through ifb? The issue I see here is adding This is not easy to explain. We have two solution for our production, one is this, other one is my other patchset: https://patchwork.kernel.org/project/netdevbpf/list/?series=593409&archive=both&state=* Why I need those solutions, please see my patch 1/2 commit message. https://patchwork.kernel.org/project/netdevbpf/patch/20211210023626.20905-2-xiangxia.m.yue@gmail.com/ The different of two solution is that this one use ifb to do the rate-limiting/load-blance for applications which key is high throughput. Then for applications which key is the low latency of data access, can use the mq or fifo qdisc in the phys-dev. This is very useful in k8s environment. Anyway my other patchset is better for this one sulotion. but this one solution is easy to used because we can use the tc cmd(not bpf) to install filters. This patch try to fix the bug in bpf. > extra overhead to support such a narrow case that nobody else is using and that I don't agree that, I think it's very useful in k8s. If the pods of k8s are sharing the network resource, this is really a narrow case. But part of pods want low latency, and other part of pods want high throughput, this solution is selectable. > can be achieved already with existing infra as I understood it; the justification > right now to add the extra checks to the critical fast path is very thin.. note that: For tc cmd, there is the check "skb_skip_tc_classify" in tcf_action_exec, This patch only want to move check to sch_handle_egress to avoid the unessential tc filter lookup. For bpf_redirect, we don't run tcf_action_exec, the packets loopback if there is no check. This patch try to fix bug(bpf), and improve performance(tc) I use the netperf/iperf to measure performance, It's not easy to figure out performance drop. > Thanks, > Daniel
t On Fri, Dec 17, 2021 at 11:21 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > On Thu, Dec 16, 2021 at 8:37 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > On 12/11/21 1:37 AM, Tonghao Zhang wrote: > > > On Sat, Dec 11, 2021 at 4:11 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > >> On 12/10/21 8:54 PM, Tonghao Zhang wrote: > > >>> On Sat, Dec 11, 2021 at 1:46 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > >>>> On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > >>>>> On Sat, Dec 11, 2021 at 12:43 AM John Fastabend > > >>>>> <john.fastabend@gmail.com> wrote: > > >>>>>> xiangxia.m.yue@ wrote: > > [...] > > >>>>> Hi John > > >>>>> Tx ethx -> __dev_queue_xmit -> sch_handle_egress > > >>>>> -> execute BPF program on ethx with bpf_redirect(ifb0) -> > > >>>>> -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit > > >>>>> the packets loopbacks, that means bpf_redirect doesn't work with ifb > > >>>>> netdev, right ? > > >>>>> so in sch_handle_egress, I add the check skb_skip_tc_classify(). > > >> > > >> But why would you do that? Usage like this is just broken by design.. > > > As I understand, we can redirect packets to a target device either at > > > ingress or at *egress > > > > > > The commit ID: 3896d655f4d491c67d669a15f275a39f713410f8 > > > Allow eBPF programs attached to classifier/actions to call > > > bpf_clone_redirect(skb, ifindex, flags) helper which will mirror or > > > redirect the packet by dynamic ifindex selection from within the > > > program to a target device either at ingress or at egress. Can be used > > > for various scenarios, for example, to load balance skbs into veths, > > > split parts of the traffic to local taps, etc. > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=3896d655f4d491c67d669a15f275a39f713410f8 > > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=27b29f63058d26c6c1742f1993338280d5a41dc6 > > > > > > But at egress the bpf_redirect doesn't work with ifb. > > >> If you need to loop anything back to RX, just use bpf_redirect() with > > > Not use it to loop packets back. the flags of bpf_redirect is 0. for example: > > > > > > tc filter add dev veth1 \ > > > egress bpf direct-action obj test_bpf_redirect_ifb.o sec redirect_ifb > > > https://patchwork.kernel.org/project/netdevbpf/patch/20211208145459.9590-4-xiangxia.m.yue@gmail.com/ > > >> BPF_F_INGRESS? What is the concrete/actual rationale for ifb here? > > > We load balance the packets to different ifb netdevices at egress. On > > > ifb, we install filters, rate limit police, > Hi Daniel, I try to answer your question. thanks > > I guess this part here is what I don't quite follow. Could you walk me through > > the packet flow in this case? So you go from bpf@tc-egress@phys-dev to do the > > redirect to bpf@tc-egress@ifb, and then again to bpf@tc-egress@phys-dev (same > ifb doesn't install bpf prog, ifb will redirect packets to netdevice > which packets from. > ifb_ri_tasklet > 1. skb->dev = dev_get_by_index_rcu(dev_net(txp->dev), skb->skb_iif); > // skb_iif is phys-dev > 2. dev_queue_xmit > > > dev or different one I presume)? Why not doing the load-balancing, applying the > > policy, and doing the rate-limiting (e.g. EDT with sch_fq) directly at the initial > > bpf@tc-egress@phys-dev location given bpf is perfectly capable to do all of it > > there w/o the extra detour & overhead through ifb? The issue I see here is adding > This is not easy to explain. We have two solution for our production, > one is this, other one is my > other patchset: > https://patchwork.kernel.org/project/netdevbpf/list/?series=593409&archive=both&state=* > > Why I need those solutions, please see my patch 1/2 commit message. > https://patchwork.kernel.org/project/netdevbpf/patch/20211210023626.20905-2-xiangxia.m.yue@gmail.com/ > > The different of two solution is that this one use ifb to do the > rate-limiting/load-blance for applications which > key is high throughput. Then for applications which key is the low > latency of data access, can use the > mq or fifo qdisc in the phys-dev. This is very useful in k8s > environment. Anyway my other patchset is better > for this one sulotion. but this one solution is easy to used because > we can use the tc cmd(not bpf) to install filters. > > This patch try to fix the bug in bpf. > > extra overhead to support such a narrow case that nobody else is using and that > I don't agree that, I think it's very useful in k8s. If the pods of > k8s are sharing the network resource, this is really a narrow case. > But part of pods want low latency, and other part of pods want high > throughput, this solution is selectable. > > can be achieved already with existing infra as I understood it; the justification > > right now to add the extra checks to the critical fast path is very thin.. > note that: > For tc cmd, there is the check "skb_skip_tc_classify" in > tcf_action_exec, This patch only want to move check to > sch_handle_egress to avoid the > unessential tc filter lookup. > For bpf_redirect, we don't run tcf_action_exec, the packets loopback > if there is no check. This patch try to fix bug(bpf), and improve > performance(tc) > > I use the netperf/iperf to measure performance, It's not easy to > figure out performance drop. Hi Daniel any process ? Did I answer your questions? if you think we should not add(only move skb_skip_tc_classify from tcf_action_exec to sch_handle_egress) this check in fastpath. I'am OK to drop this patch. but we can wait for other users to report this issue in the future. If so, we can fix it with this patch. Can you help me to review other patch 1/3 and 3/3. > > Thanks, > > Daniel > > > > -- > Best regards, Tonghao
diff --git a/net/core/dev.c b/net/core/dev.c index a64297a4cc89..81ad415b78f9 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3823,6 +3823,9 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) if (!miniq) return skb; + if (skb_skip_tc_classify(skb)) + return skb; + /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */ qdisc_skb_cb(skb)->mru = 0; qdisc_skb_cb(skb)->post_ct = false;