Message ID | 20220104082821.22487-1-paulb@nvidia.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,1/1] net: openvswitch: Fix ct_state nat flags for conns arriving from tc | expand |
On Tue, 4 Jan 2022 10:28:21 +0200 Paul Blakey wrote: > Netfilter conntrack maintains NAT flags per connection indicating > whether NAT was configured for the connection. Openvswitch maintains > NAT flags on the per packet flow key ct_state field, indicating > whether NAT was actually executed on the packet. > > When a packet misses from tc to ovs the conntrack NAT flags are set. > However, NAT was not necessarily executed on the packet because the > connection's state might still be in NEW state. As such, openvswitch wrongly > assumes that NAT was executed and sets an incorrect flow key NAT flags. > > Fix this, by flagging to openvswitch which NAT was actually done in > act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so > the packet flow key NAT flags will be correctly set. Fixes ? > Signed-off-by: Paul Blakey <paulb@nvidia.com> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 4507d77d6941..bab45a009310 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -287,7 +287,9 @@ struct tc_skb_ext { > __u32 chain; > __u16 mru; > __u16 zone; > - bool post_ct; > + bool post_ct:1; > + bool post_ct_snat:1; > + bool post_ct_dnat:1; single bit bool variables seem weird, use a unsigned int type, like u8. > }; > #endif > > diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h > index 9e71691c491b..a171dfa91910 100644 > --- a/include/net/pkt_sched.h > +++ b/include/net/pkt_sched.h > @@ -197,7 +197,9 @@ struct tc_skb_cb { > struct qdisc_skb_cb qdisc_cb; > > u16 mru; > - bool post_ct; > + bool post_ct: 1; extra space > + bool post_ct_snat:1; > + bool post_ct_dnat:1; > u16 zone; /* Only valid if post_ct = true */ > };
On Tue, 4 Jan 2022, Jakub Kicinski wrote: > On Tue, 4 Jan 2022 10:28:21 +0200 Paul Blakey wrote: > > Netfilter conntrack maintains NAT flags per connection indicating > > whether NAT was configured for the connection. Openvswitch maintains > > NAT flags on the per packet flow key ct_state field, indicating > > whether NAT was actually executed on the packet. > > > > When a packet misses from tc to ovs the conntrack NAT flags are set. > > However, NAT was not necessarily executed on the packet because the > > connection's state might still be in NEW state. As such, openvswitch wrongly > > assumes that NAT was executed and sets an incorrect flow key NAT flags. > > > > Fix this, by flagging to openvswitch which NAT was actually done in > > act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so > > the packet flow key NAT flags will be correctly set. > > Fixes ? I wasn't sure which patches to blame, I guess the bug was there from the introduction of action ct in tc, so I'll blame that. > > > Signed-off-by: Paul Blakey <paulb@nvidia.com> > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > index 4507d77d6941..bab45a009310 100644 > > --- a/include/linux/skbuff.h > > +++ b/include/linux/skbuff.h > > @@ -287,7 +287,9 @@ struct tc_skb_ext { > > __u32 chain; > > __u16 mru; > > __u16 zone; > > - bool post_ct; > > + bool post_ct:1; > > + bool post_ct_snat:1; > > + bool post_ct_dnat:1; > > single bit bool variables seem weird, use a unsigned int type, like u8. > > > }; > > #endif > > > > diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h > > index 9e71691c491b..a171dfa91910 100644 > > --- a/include/net/pkt_sched.h > > +++ b/include/net/pkt_sched.h > > @@ -197,7 +197,9 @@ struct tc_skb_cb { > > struct qdisc_skb_cb qdisc_cb; > > > > u16 mru; > > - bool post_ct; > > + bool post_ct: 1; > > extra space Will remove, and send v2. > > > + bool post_ct_snat:1; > > + bool post_ct_dnat:1; > > u16 zone; /* Only valid if post_ct = true */ > > }; >
On 2022-01-04 03:28, Paul Blakey wrote: [..] > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -287,7 +287,9 @@ struct tc_skb_ext { > __u32 chain; > __u16 mru; > __u16 zone; > - bool post_ct; > + bool post_ct:1; > + bool post_ct_snat:1; > + bool post_ct_dnat:1; > }; is skb_ext intended only for ovs? If yes, why does it belong in the core code? Ex: Looking at tcf_classify() which is such a core function in the fast path any packet going via tc, it is now encumbered with with checking presence of skb_ext. I know passing around metadata is a paramount requirement for programmability but this is getting messier with speacial use cases for ovs and/or offload... cheers, jamal
On 1/5/22 3:57 PM, Jamal Hadi Salim wrote: > On 2022-01-04 03:28, Paul Blakey wrote: > [..] >> --- a/include/linux/skbuff.h >> +++ b/include/linux/skbuff.h >> @@ -287,7 +287,9 @@ struct tc_skb_ext { >> __u32 chain; >> __u16 mru; >> __u16 zone; >> - bool post_ct; >> + bool post_ct:1; >> + bool post_ct_snat:1; >> + bool post_ct_dnat:1; >> }; > > is skb_ext intended only for ovs? If yes, why does it belong > in the core code? Ex: Looking at tcf_classify() which is such > a core function in the fast path any packet going via tc, it > is now encumbered with with checking presence of skb_ext. > I know passing around metadata is a paramount requirement > for programmability but this is getting messier with speacial > use cases for ovs and/or offload... Full ack on the bloat for corner cases like ovs offload, especially given distros just enable most stuff anyway and therefore no light fast path as with !CONFIG_NET_TC_SKB_EXT. :( Could this somehow be hidden behind static key or such if offloads are not used, so we can shrink it back to just calling into plain __tcf_classify() for sw-only use cases (like BPF)?
On Wed, 5 Jan 2022, Daniel Borkmann wrote: > On 1/5/22 3:57 PM, Jamal Hadi Salim wrote: > > On 2022-01-04 03:28, Paul Blakey wrote: > > [..] > >> --- a/include/linux/skbuff.h > >> +++ b/include/linux/skbuff.h > >> @@ -287,7 +287,9 @@ struct tc_skb_ext { > >> __u32 chain; > >> __u16 mru; > >> __u16 zone; > >> - bool post_ct; > >> + bool post_ct:1; > >> + bool post_ct_snat:1; > >> + bool post_ct_dnat:1; > >> }; > > > > is skb_ext intended only for ovs? If yes, why does it belong > > in the core code? Ex: Looking at tcf_classify() which is such > > a core function in the fast path any packet going via tc, it > > is now encumbered with with checking presence of skb_ext. > > I know passing around metadata is a paramount requirement > > for programmability but this is getting messier with speacial > > use cases for ovs and/or offload... > > Full ack on the bloat for corner cases like ovs offload, especially > given distros just enable most stuff anyway and therefore no light > fast path as with !CONFIG_NET_TC_SKB_EXT. :( > > Could this somehow be hidden behind static key or such if offloads > are not used, so we can shrink it back to just calling into plain > __tcf_classify() for sw-only use cases (like BPF)? > > It is used for both tc -> ovs and driver -> tc path. I think I can do what you suggest adn will work on something like that, but this specific patch doesn't really change the ext allocation/derefences count (and probably not the size as well). So can we take this (not yet posted v2 after fixing what already mentioned) and I'll do a patch of what you suggest in net-next?
On 2022-01-05 11:18, Paul Blakey wrote: > > > On Wed, 5 Jan 2022, Daniel Borkmann wrote: > [..] >> Full ack on the bloat for corner cases like ovs offload, especially >> given distros just enable most stuff anyway and therefore no light >> fast path as with !CONFIG_NET_TC_SKB_EXT. :( >> >> Could this somehow be hidden behind static key or such if offloads >> are not used, so we can shrink it back to just calling into plain >> __tcf_classify() for sw-only use cases (like BPF)? >> >> > > It is used for both tc -> ovs and driver -> tc path. > > I think I can do what you suggest adn will work on something like > that, but this specific patch doesn't really change the ext > allocation/derefences count (and probably not the size as well). > So can we take this (not yet posted v2 after fixing what already > mentioned) and I'll do a patch of what you suggest in net-next? > Sounds reasonable. The main outstanding challenge is still going to be all these bit declarations (and i am sure more to come) that are specific for specific components in TC (act_ct in this case); if every component started planting their flags(pun intended) then both backwards and forwards compatibility are going to be needed for maintainance. The _cb spaces are supposed to be opaque and meaning to whats in that space is only sensible to the component that is storing and retrieving. Something like chain id is ok in that namespaces because it has global meaning to TC, the others are not. My suggestion is going forward try to not add more component specific variables. For a proper solution: I think we need some sort of "metadata bus" to resolve this in the long run. Something which is a bigger surgery... cheers, jamal cheers, jamal
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4507d77d6941..bab45a009310 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -287,7 +287,9 @@ struct tc_skb_ext { __u32 chain; __u16 mru; __u16 zone; - bool post_ct; + bool post_ct:1; + bool post_ct_snat:1; + bool post_ct_dnat:1; }; #endif diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index 9e71691c491b..a171dfa91910 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -197,7 +197,9 @@ struct tc_skb_cb { struct qdisc_skb_cb qdisc_cb; u16 mru; - bool post_ct; + bool post_ct: 1; + bool post_ct_snat:1; + bool post_ct_dnat:1; u16 zone; /* Only valid if post_ct = true */ }; diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 6d262d9aa10e..02096f2ec678 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -859,7 +859,7 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info, #if IS_ENABLED(CONFIG_NET_TC_SKB_EXT) struct tc_skb_ext *tc_ext; #endif - bool post_ct = false; + bool post_ct = false, post_ct_snat = false, post_ct_dnat = false; int res, err; u16 zone = 0; @@ -900,6 +900,8 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info, key->recirc_id = tc_ext ? tc_ext->chain : 0; OVS_CB(skb)->mru = tc_ext ? tc_ext->mru : 0; post_ct = tc_ext ? tc_ext->post_ct : false; + post_ct_snat = post_ct ? tc_ext->post_ct_snat : false; + post_ct_dnat = post_ct ? tc_ext->post_ct_dnat : false; zone = post_ct ? tc_ext->zone : 0; } else { key->recirc_id = 0; @@ -911,8 +913,16 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info, err = key_extract(skb, key); if (!err) { ovs_ct_fill_key(skb, key, post_ct); /* Must be after key_extract(). */ - if (post_ct && !skb_get_nfct(skb)) - key->ct_zone = zone; + if (post_ct) { + if (!skb_get_nfct(skb)) { + key->ct_zone = zone; + } else { + if (!post_ct_dnat) + key->ct_state &= ~OVS_CS_F_DST_NAT; + if (!post_ct_snat) + key->ct_state &= ~OVS_CS_F_SRC_NAT; + } + } } return err; } diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c index ab3591408419..2a17eb77c904 100644 --- a/net/sched/act_ct.c +++ b/net/sched/act_ct.c @@ -839,6 +839,12 @@ static int ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct, } err = nf_nat_packet(ct, ctinfo, hooknum, skb); + if (err == NF_ACCEPT) { + if (maniptype == NF_NAT_MANIP_SRC) + tc_skb_cb(skb)->post_ct_snat = 1; + if (maniptype == NF_NAT_MANIP_DST) + tc_skb_cb(skb)->post_ct_dnat = 1; + } out: return err; } diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 35c74bdde848..cc9409aa755e 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -1625,6 +1625,8 @@ int tcf_classify(struct sk_buff *skb, ext->chain = last_executed_chain; ext->mru = cb->mru; ext->post_ct = cb->post_ct; + ext->post_ct_snat = cb->post_ct_snat; + ext->post_ct_dnat = cb->post_ct_dnat; ext->zone = cb->zone; }
Netfilter conntrack maintains NAT flags per connection indicating whether NAT was configured for the connection. Openvswitch maintains NAT flags on the per packet flow key ct_state field, indicating whether NAT was actually executed on the packet. When a packet misses from tc to ovs the conntrack NAT flags are set. However, NAT was not necessarily executed on the packet because the connection's state might still be in NEW state. As such, openvswitch wrongly assumes that NAT was executed and sets an incorrect flow key NAT flags. Fix this, by flagging to openvswitch which NAT was actually done in act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so the packet flow key NAT flags will be correctly set. Signed-off-by: Paul Blakey <paulb@nvidia.com> --- include/linux/skbuff.h | 4 +++- include/net/pkt_sched.h | 4 +++- net/openvswitch/flow.c | 16 +++++++++++++--- net/sched/act_ct.c | 6 ++++++ net/sched/cls_api.c | 2 ++ 5 files changed, 27 insertions(+), 5 deletions(-)