diff mbox series

[net,1/1] net: openvswitch: Fix ct_state nat flags for conns arriving from tc

Message ID 20220104082821.22487-1-paulb@nvidia.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net,1/1] net: openvswitch: Fix ct_state nat flags for conns arriving from tc | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 6019 this patch: 6019
netdev/cc_maintainers warning 2 maintainers not CCed: jiri@resnulli.us jonathan.lemon@gmail.com
netdev/build_clang success Errors and warnings before: 1028 this patch: 1028
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6169 this patch: 6169
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 74 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Paul Blakey Jan. 4, 2022, 8:28 a.m. UTC
Netfilter conntrack maintains NAT flags per connection indicating
whether NAT was configured for the connection. Openvswitch maintains
NAT flags on the per packet flow key ct_state field, indicating
whether NAT was actually executed on the packet.

When a packet misses from tc to ovs the conntrack NAT flags are set.
However, NAT was not necessarily executed on the packet because the
connection's state might still be in NEW state. As such, openvswitch wrongly
assumes that NAT was executed and sets an incorrect flow key NAT flags.

Fix this, by flagging to openvswitch which NAT was actually done in
act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so
the packet flow key NAT flags will be correctly set.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
---
 include/linux/skbuff.h  |  4 +++-
 include/net/pkt_sched.h |  4 +++-
 net/openvswitch/flow.c  | 16 +++++++++++++---
 net/sched/act_ct.c      |  6 ++++++
 net/sched/cls_api.c     |  2 ++
 5 files changed, 27 insertions(+), 5 deletions(-)

Comments

Jakub Kicinski Jan. 4, 2022, 6:08 p.m. UTC | #1
On Tue, 4 Jan 2022 10:28:21 +0200 Paul Blakey wrote:
> Netfilter conntrack maintains NAT flags per connection indicating
> whether NAT was configured for the connection. Openvswitch maintains
> NAT flags on the per packet flow key ct_state field, indicating
> whether NAT was actually executed on the packet.
> 
> When a packet misses from tc to ovs the conntrack NAT flags are set.
> However, NAT was not necessarily executed on the packet because the
> connection's state might still be in NEW state. As such, openvswitch wrongly
> assumes that NAT was executed and sets an incorrect flow key NAT flags.
> 
> Fix this, by flagging to openvswitch which NAT was actually done in
> act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so
> the packet flow key NAT flags will be correctly set.

Fixes ?

> Signed-off-by: Paul Blakey <paulb@nvidia.com>

> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 4507d77d6941..bab45a009310 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -287,7 +287,9 @@ struct tc_skb_ext {
>  	__u32 chain;
>  	__u16 mru;
>  	__u16 zone;
> -	bool post_ct;
> +	bool post_ct:1;
> +	bool post_ct_snat:1;
> +	bool post_ct_dnat:1;

single bit bool variables seem weird, use a unsigned int type, like u8.

>  };
>  #endif
>  
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index 9e71691c491b..a171dfa91910 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -197,7 +197,9 @@ struct tc_skb_cb {
>  	struct qdisc_skb_cb qdisc_cb;
>  
>  	u16 mru;
> -	bool post_ct;
> +	bool post_ct: 1;

extra space

> +	bool post_ct_snat:1;
> +	bool post_ct_dnat:1;
>  	u16 zone; /* Only valid if post_ct = true */
>  };
Paul Blakey Jan. 5, 2022, 8:29 a.m. UTC | #2
On Tue, 4 Jan 2022, Jakub Kicinski wrote:

> On Tue, 4 Jan 2022 10:28:21 +0200 Paul Blakey wrote:
> > Netfilter conntrack maintains NAT flags per connection indicating
> > whether NAT was configured for the connection. Openvswitch maintains
> > NAT flags on the per packet flow key ct_state field, indicating
> > whether NAT was actually executed on the packet.
> > 
> > When a packet misses from tc to ovs the conntrack NAT flags are set.
> > However, NAT was not necessarily executed on the packet because the
> > connection's state might still be in NEW state. As such, openvswitch wrongly
> > assumes that NAT was executed and sets an incorrect flow key NAT flags.
> > 
> > Fix this, by flagging to openvswitch which NAT was actually done in
> > act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so
> > the packet flow key NAT flags will be correctly set.
> 
> Fixes ?

I wasn't sure which patches to blame, I guess the bug was there from the
introduction of action ct in tc, so I'll blame that. 

> 
> > Signed-off-by: Paul Blakey <paulb@nvidia.com>
> 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 4507d77d6941..bab45a009310 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -287,7 +287,9 @@ struct tc_skb_ext {
> >  	__u32 chain;
> >  	__u16 mru;
> >  	__u16 zone;
> > -	bool post_ct;
> > +	bool post_ct:1;
> > +	bool post_ct_snat:1;
> > +	bool post_ct_dnat:1;
> 
> single bit bool variables seem weird, use a unsigned int type, like u8.
> 
> >  };
> >  #endif
> >  
> > diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> > index 9e71691c491b..a171dfa91910 100644
> > --- a/include/net/pkt_sched.h
> > +++ b/include/net/pkt_sched.h
> > @@ -197,7 +197,9 @@ struct tc_skb_cb {
> >  	struct qdisc_skb_cb qdisc_cb;
> >  
> >  	u16 mru;
> > -	bool post_ct;
> > +	bool post_ct: 1;
> 
> extra space

Will remove, and send v2.

> 
> > +	bool post_ct_snat:1;
> > +	bool post_ct_dnat:1;
> >  	u16 zone; /* Only valid if post_ct = true */
> >  };
>
Jamal Hadi Salim Jan. 5, 2022, 2:57 p.m. UTC | #3
On 2022-01-04 03:28, Paul Blakey wrote:
[..]
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -287,7 +287,9 @@ struct tc_skb_ext {
>   	__u32 chain;
>   	__u16 mru;
>   	__u16 zone;
> -	bool post_ct;
> +	bool post_ct:1;
> +	bool post_ct_snat:1;
> +	bool post_ct_dnat:1;
>   };


is skb_ext intended only for ovs? If yes, why does it belong
in the core code? Ex: Looking at tcf_classify() which is such
a core function in the fast path any packet going via tc, it
is now encumbered with with checking presence of skb_ext.
I know passing around metadata is a paramount requirement
for programmability but this is getting messier with speacial
use cases for ovs and/or offload...

cheers,
jamal
Daniel Borkmann Jan. 5, 2022, 3:30 p.m. UTC | #4
On 1/5/22 3:57 PM, Jamal Hadi Salim wrote:
> On 2022-01-04 03:28, Paul Blakey wrote:
> [..]
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -287,7 +287,9 @@ struct tc_skb_ext {
>>       __u32 chain;
>>       __u16 mru;
>>       __u16 zone;
>> -    bool post_ct;
>> +    bool post_ct:1;
>> +    bool post_ct_snat:1;
>> +    bool post_ct_dnat:1;
>>   };
> 
> is skb_ext intended only for ovs? If yes, why does it belong
> in the core code? Ex: Looking at tcf_classify() which is such
> a core function in the fast path any packet going via tc, it
> is now encumbered with with checking presence of skb_ext.
> I know passing around metadata is a paramount requirement
> for programmability but this is getting messier with speacial
> use cases for ovs and/or offload...

Full ack on the bloat for corner cases like ovs offload, especially
given distros just enable most stuff anyway and therefore no light
fast path as with !CONFIG_NET_TC_SKB_EXT. :(

Could this somehow be hidden behind static key or such if offloads
are not used, so we can shrink it back to just calling into plain
__tcf_classify() for sw-only use cases (like BPF)?
Paul Blakey Jan. 5, 2022, 4:18 p.m. UTC | #5
On Wed, 5 Jan 2022, Daniel Borkmann wrote:

> On 1/5/22 3:57 PM, Jamal Hadi Salim wrote:
> > On 2022-01-04 03:28, Paul Blakey wrote:
> > [..]
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -287,7 +287,9 @@ struct tc_skb_ext {
> >>       __u32 chain;
> >>       __u16 mru;
> >>       __u16 zone;
> >> -    bool post_ct;
> >> +    bool post_ct:1;
> >> +    bool post_ct_snat:1;
> >> +    bool post_ct_dnat:1;
> >>   };
> > 
> > is skb_ext intended only for ovs? If yes, why does it belong
> > in the core code? Ex: Looking at tcf_classify() which is such
> > a core function in the fast path any packet going via tc, it
> > is now encumbered with with checking presence of skb_ext.
> > I know passing around metadata is a paramount requirement
> > for programmability but this is getting messier with speacial
> > use cases for ovs and/or offload...
> 
> Full ack on the bloat for corner cases like ovs offload, especially
> given distros just enable most stuff anyway and therefore no light
> fast path as with !CONFIG_NET_TC_SKB_EXT. :(
> 
> Could this somehow be hidden behind static key or such if offloads
> are not used, so we can shrink it back to just calling into plain
> __tcf_classify() for sw-only use cases (like BPF)?
> 
> 

It is used for both tc -> ovs and driver -> tc path.

I think I can do what you suggest adn will work on something like
that,  but this specific patch  doesn't really change the ext 
allocation/derefences count (and probably  not the size as well).
So can  we take  this (not yet posted v2 after fixing what already 
mentioned) and I'll do a patch of what you suggest in net-next?
Jamal Hadi Salim Jan. 6, 2022, 12:54 p.m. UTC | #6
On 2022-01-05 11:18, Paul Blakey wrote:
> 
> 
> On Wed, 5 Jan 2022, Daniel Borkmann wrote:
> 

[..]

>> Full ack on the bloat for corner cases like ovs offload, especially
>> given distros just enable most stuff anyway and therefore no light
>> fast path as with !CONFIG_NET_TC_SKB_EXT. :(
>>
>> Could this somehow be hidden behind static key or such if offloads
>> are not used, so we can shrink it back to just calling into plain
>> __tcf_classify() for sw-only use cases (like BPF)?
>>
>>
> 
> It is used for both tc -> ovs and driver -> tc path.
> 
> I think I can do what you suggest adn will work on something like
> that,  but this specific patch  doesn't really change the ext
> allocation/derefences count (and probably  not the size as well).
> So can  we take  this (not yet posted v2 after fixing what already
> mentioned) and I'll do a patch of what you suggest in net-next?
> 

Sounds reasonable.

The main outstanding challenge is still going to be all these bit
declarations (and i am sure more to come) that are specific for
specific components in TC (act_ct in this case); if every component
started planting their flags(pun intended) then both backwards and
forwards compatibility are going to be needed for maintainance.

The _cb spaces are supposed to be opaque and meaning to whats in
that space is only sensible to the component that is storing and
retrieving. Something like chain id is ok in that namespaces
because it has global meaning to TC, the others are not. My
suggestion is going forward try to not add more component specific
variables.
For a proper solution:
I think we need some sort of "metadata bus" to resolve this in the
long run. Something which is a bigger surgery...

cheers,
jamal

cheers,
jamal
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4507d77d6941..bab45a009310 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -287,7 +287,9 @@  struct tc_skb_ext {
 	__u32 chain;
 	__u16 mru;
 	__u16 zone;
-	bool post_ct;
+	bool post_ct:1;
+	bool post_ct_snat:1;
+	bool post_ct_dnat:1;
 };
 #endif
 
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 9e71691c491b..a171dfa91910 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -197,7 +197,9 @@  struct tc_skb_cb {
 	struct qdisc_skb_cb qdisc_cb;
 
 	u16 mru;
-	bool post_ct;
+	bool post_ct: 1;
+	bool post_ct_snat:1;
+	bool post_ct_dnat:1;
 	u16 zone; /* Only valid if post_ct = true */
 };
 
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 6d262d9aa10e..02096f2ec678 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -859,7 +859,7 @@  int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 #if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
 	struct tc_skb_ext *tc_ext;
 #endif
-	bool post_ct = false;
+	bool post_ct = false, post_ct_snat = false, post_ct_dnat = false;
 	int res, err;
 	u16 zone = 0;
 
@@ -900,6 +900,8 @@  int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 		key->recirc_id = tc_ext ? tc_ext->chain : 0;
 		OVS_CB(skb)->mru = tc_ext ? tc_ext->mru : 0;
 		post_ct = tc_ext ? tc_ext->post_ct : false;
+		post_ct_snat = post_ct ? tc_ext->post_ct_snat : false;
+		post_ct_dnat = post_ct ? tc_ext->post_ct_dnat : false;
 		zone = post_ct ? tc_ext->zone : 0;
 	} else {
 		key->recirc_id = 0;
@@ -911,8 +913,16 @@  int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 	err = key_extract(skb, key);
 	if (!err) {
 		ovs_ct_fill_key(skb, key, post_ct);   /* Must be after key_extract(). */
-		if (post_ct && !skb_get_nfct(skb))
-			key->ct_zone = zone;
+		if (post_ct) {
+			if (!skb_get_nfct(skb)) {
+				key->ct_zone = zone;
+			} else {
+				if (!post_ct_dnat)
+					key->ct_state &= ~OVS_CS_F_DST_NAT;
+				if (!post_ct_snat)
+					key->ct_state &= ~OVS_CS_F_SRC_NAT;
+			}
+		}
 	}
 	return err;
 }
diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index ab3591408419..2a17eb77c904 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -839,6 +839,12 @@  static int ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct,
 	}
 
 	err = nf_nat_packet(ct, ctinfo, hooknum, skb);
+	if (err == NF_ACCEPT) {
+		if (maniptype == NF_NAT_MANIP_SRC)
+			tc_skb_cb(skb)->post_ct_snat = 1;
+		if (maniptype == NF_NAT_MANIP_DST)
+			tc_skb_cb(skb)->post_ct_dnat = 1;
+	}
 out:
 	return err;
 }
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 35c74bdde848..cc9409aa755e 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1625,6 +1625,8 @@  int tcf_classify(struct sk_buff *skb,
 		ext->chain = last_executed_chain;
 		ext->mru = cb->mru;
 		ext->post_ct = cb->post_ct;
+		ext->post_ct_snat = cb->post_ct_snat;
+		ext->post_ct_dnat = cb->post_ct_dnat;
 		ext->zone = cb->zone;
 	}