mbox series

[net-next,v8,0/7] net/sched: cls_api: Support hardware miss to tc action

Message ID 20230205154934.22040-1-paulb@nvidia.com (mailing list archive)
Headers show
Series net/sched: cls_api: Support hardware miss to tc action | expand

Message

Paul Blakey Feb. 5, 2023, 3:49 p.m. UTC
Hi,

This series adds support for hardware miss to instruct tc to continue execution
in a specific tc action instance on a filter's action list. The mlx5 driver patch
(besides the refactors) shows its usage instead of using just chain restore.

Currently a filter's action list must be executed all together or
not at all as driver are only able to tell tc to continue executing from a
specific tc chain, and not a specific filter/action.

This is troublesome with regards to action CT, where new connections should
be sent to software (via tc chain restore), and established connections can
be handled in hardware.

Checking for new connections is done when executing the ct action in hardware
(by checking the packet's tuple against known established tuples).
But if there is a packet modification (pedit) action before action CT and the
checked tuple is a new connection, hardware will need to revert the previous
packet modifications before sending it back to software so it can
re-match the same tc filter in software and re-execute its CT action.

The following is an example configuration of stateless nat
on mlx5 driver that isn't supported before this patchet:

 #Setup corrosponding mlx5 VFs in namespaces
 $ ip netns add ns0
 $ ip netns add ns1
 $ ip link set dev enp8s0f0v0 netns ns0
 $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
 $ ip link set dev enp8s0f0v1 netns ns1
 $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up

 #Setup tc arp and ct rules on mxl5 VF representors
 $ tc qdisc add dev enp8s0f0_0 ingress
 $ tc qdisc add dev enp8s0f0_1 ingress
 $ ifconfig enp8s0f0_0 up
 $ ifconfig enp8s0f0_1 up

 #Original side
 $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
    ct_state -trk ip_proto tcp dst_port 8888 \
      action pedit ex munge tcp dport set 5001 pipe \
      action csum ip tcp pipe \
      action ct pipe \
      action goto chain 1
 $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
    ct_state +trk+est \
      action mirred egress redirect dev enp8s0f0_1
 $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
    ct_state +trk+new \
      action ct commit pipe \
      action mirred egress redirect dev enp8s0f0_1
 $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
      action mirred egress redirect dev enp8s0f0_1

 #Reply side
 $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
      action mirred egress redirect dev enp8s0f0_0
 $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
    ct_state -trk ip_proto tcp \ 
      action ct pipe \
      action pedit ex munge tcp sport set 8888 pipe \
      action csum ip tcp pipe \
      action mirred egress redirect dev enp8s0f0_0

 #Run traffic
 $ ip netns exec ns1 iperf -s -p 5001&
 $ sleep 2 #wait for iperf to fully open
 $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888

 #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
 $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
        Sent hardware 9310116832 bytes 6149672 pkt
        Sent hardware 9310116832 bytes 6149672 pkt
        Sent hardware 9310116832 bytes 6149672 pkt

A new connection executing the first filter in hardware will first rewrite
the dst port to the new port, and then the ct action is executed,
because this is a new connection, hardware will need to be send this back
to software, on chain 0, to execute the first filter again in software.
The dst port needs to be reverted otherwise it won't re-match the old
dst port in the first filter. Because of that, currently mlx5 driver will
reject offloading the above action ct rule.

This series adds supports partial offload of a filter's action list,
and letting tc software continue processing in the specific action instance
where hardware left off (in the above case after the "action pedit ex munge tcp
dport... of the first rule") allowing support for scenarios such as the above.

Changelog:
	v1->v2:
	Fixed compilation without CONFIG_NET_CLS
	Cover letter re-write

	v2->v3:
	Unlock spin_lock on error in cls flower filter handle refactor
	Cover letter

	v3->v4:
	Silence warning by clang

	v4->v5:
	Cover letter example
	Removed ifdef as much as possible by using inline stubs

	v5->v6:
	Removed new inlines in cls_api.c (bot complained in patchwork)
	Added reviewed-by/ack - Thanks!

	v6->v7:
	Removed WARN_ON from pkt path (leon)
	Removed unnecessary return in void func

	v7->v8:
	Removed #if IS_ENABLED on skb ext adding Kconfig changes
	Complex variable init in seperate lines
	if,else if, else if ---> switch case

Paul Blakey (7):
  net/sched: cls_api: Support hardware miss to tc action
  net/sched: flower: Move filter handle initialization earlier
  net/sched: flower: Support hardware miss to tc action
  net/mlx5: Kconfig: Make tc offload depend on tc skb extension
  net/mlx5: Refactor tc miss handling to a single function
  net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
  net/mlx5e: TC, Set CT miss to the specific ct action instance

 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   2 +-
 .../ethernet/mellanox/mlx5/core/en/rep/tc.c   | 225 ++------------
 .../mellanox/mlx5/core/en/tc/sample.c         |   2 +-
 .../ethernet/mellanox/mlx5/core/en/tc_ct.c    |  32 +-
 .../ethernet/mellanox/mlx5/core/en/tc_ct.h    |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   4 +-
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 280 ++++++++++++++++--
 .../net/ethernet/mellanox/mlx5/core/en_tc.h   |  21 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   2 +
 .../mellanox/mlx5/core/lib/fs_chains.c        |  14 +-
 include/linux/skbuff.h                        |   6 +-
 include/net/flow_offload.h                    |   1 +
 include/net/pkt_cls.h                         |  34 ++-
 include/net/sch_generic.h                     |   2 +
 net/openvswitch/flow.c                        |   3 +-
 net/sched/act_api.c                           |   2 +-
 net/sched/cls_api.c                           | 213 ++++++++++++-
 net/sched/cls_flower.c                        |  73 +++--
 18 files changed, 601 insertions(+), 317 deletions(-)

Comments

Ilya Maximets Feb. 6, 2023, 12:34 p.m. UTC | #1
On 2/5/23 16:49, Paul Blakey wrote:
> Hi,
> 
> This series adds support for hardware miss to instruct tc to continue execution
> in a specific tc action instance on a filter's action list. The mlx5 driver patch
> (besides the refactors) shows its usage instead of using just chain restore.
> 
> Currently a filter's action list must be executed all together or
> not at all as driver are only able to tell tc to continue executing from a
> specific tc chain, and not a specific filter/action.
> 
> This is troublesome with regards to action CT, where new connections should
> be sent to software (via tc chain restore), and established connections can
> be handled in hardware.
> 
> Checking for new connections is done when executing the ct action in hardware
> (by checking the packet's tuple against known established tuples).
> But if there is a packet modification (pedit) action before action CT and the
> checked tuple is a new connection, hardware will need to revert the previous
> packet modifications before sending it back to software so it can
> re-match the same tc filter in software and re-execute its CT action.
> 
> The following is an example configuration of stateless nat
> on mlx5 driver that isn't supported before this patchet:
> 
>  #Setup corrosponding mlx5 VFs in namespaces
>  $ ip netns add ns0
>  $ ip netns add ns1
>  $ ip link set dev enp8s0f0v0 netns ns0
>  $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>  $ ip link set dev enp8s0f0v1 netns ns1
>  $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
> 
>  #Setup tc arp and ct rules on mxl5 VF representors
>  $ tc qdisc add dev enp8s0f0_0 ingress
>  $ tc qdisc add dev enp8s0f0_1 ingress
>  $ ifconfig enp8s0f0_0 up
>  $ ifconfig enp8s0f0_1 up
> 
>  #Original side
>  $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>     ct_state -trk ip_proto tcp dst_port 8888 \
>       action pedit ex munge tcp dport set 5001 pipe \
>       action csum ip tcp pipe \
>       action ct pipe \
>       action goto chain 1
>  $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>     ct_state +trk+est \
>       action mirred egress redirect dev enp8s0f0_1
>  $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>     ct_state +trk+new \
>       action ct commit pipe \
>       action mirred egress redirect dev enp8s0f0_1
>  $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>       action mirred egress redirect dev enp8s0f0_1
> 
>  #Reply side
>  $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>       action mirred egress redirect dev enp8s0f0_0
>  $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>     ct_state -trk ip_proto tcp \ 
>       action ct pipe \
>       action pedit ex munge tcp sport set 8888 pipe \
>       action csum ip tcp pipe \
>       action mirred egress redirect dev enp8s0f0_0
> 
>  #Run traffic
>  $ ip netns exec ns1 iperf -s -p 5001&
>  $ sleep 2 #wait for iperf to fully open
>  $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
> 
>  #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>  $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>         Sent hardware 9310116832 bytes 6149672 pkt
>         Sent hardware 9310116832 bytes 6149672 pkt
>         Sent hardware 9310116832 bytes 6149672 pkt
> 
> A new connection executing the first filter in hardware will first rewrite
> the dst port to the new port, and then the ct action is executed,
> because this is a new connection, hardware will need to be send this back
> to software, on chain 0, to execute the first filter again in software.
> The dst port needs to be reverted otherwise it won't re-match the old
> dst port in the first filter. Because of that, currently mlx5 driver will
> reject offloading the above action ct rule.
> 
> This series adds supports partial offload of a filter's action list,
> and letting tc software continue processing in the specific action instance
> where hardware left off (in the above case after the "action pedit ex munge tcp
> dport... of the first rule") allowing support for scenarios such as the above.


Hi, Paul.  Not sure if this was discussed before, but don't we also need
a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?

Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
flags are not per-action.  This may cause confusion among users, if flows
are reported as in_hw, while they are actually partially or even mostly
processed in SW.

What do you think?

Best regards, Ilya Maximets.
Paul Blakey Feb. 6, 2023, 5:14 p.m. UTC | #2
On 06/02/2023 14:34, Ilya Maximets wrote:
> On 2/5/23 16:49, Paul Blakey wrote:
>> Hi,
>>
>> This series adds support for hardware miss to instruct tc to continue execution
>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>> (besides the refactors) shows its usage instead of using just chain restore.
>>
>> Currently a filter's action list must be executed all together or
>> not at all as driver are only able to tell tc to continue executing from a
>> specific tc chain, and not a specific filter/action.
>>
>> This is troublesome with regards to action CT, where new connections should
>> be sent to software (via tc chain restore), and established connections can
>> be handled in hardware.
>>
>> Checking for new connections is done when executing the ct action in hardware
>> (by checking the packet's tuple against known established tuples).
>> But if there is a packet modification (pedit) action before action CT and the
>> checked tuple is a new connection, hardware will need to revert the previous
>> packet modifications before sending it back to software so it can
>> re-match the same tc filter in software and re-execute its CT action.
>>
>> The following is an example configuration of stateless nat
>> on mlx5 driver that isn't supported before this patchet:
>>
>>   #Setup corrosponding mlx5 VFs in namespaces
>>   $ ip netns add ns0
>>   $ ip netns add ns1
>>   $ ip link set dev enp8s0f0v0 netns ns0
>>   $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>   $ ip link set dev enp8s0f0v1 netns ns1
>>   $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>
>>   #Setup tc arp and ct rules on mxl5 VF representors
>>   $ tc qdisc add dev enp8s0f0_0 ingress
>>   $ tc qdisc add dev enp8s0f0_1 ingress
>>   $ ifconfig enp8s0f0_0 up
>>   $ ifconfig enp8s0f0_1 up
>>
>>   #Original side
>>   $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>      ct_state -trk ip_proto tcp dst_port 8888 \
>>        action pedit ex munge tcp dport set 5001 pipe \
>>        action csum ip tcp pipe \
>>        action ct pipe \
>>        action goto chain 1
>>   $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>      ct_state +trk+est \
>>        action mirred egress redirect dev enp8s0f0_1
>>   $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>      ct_state +trk+new \
>>        action ct commit pipe \
>>        action mirred egress redirect dev enp8s0f0_1
>>   $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>        action mirred egress redirect dev enp8s0f0_1
>>
>>   #Reply side
>>   $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>        action mirred egress redirect dev enp8s0f0_0
>>   $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>      ct_state -trk ip_proto tcp \
>>        action ct pipe \
>>        action pedit ex munge tcp sport set 8888 pipe \
>>        action csum ip tcp pipe \
>>        action mirred egress redirect dev enp8s0f0_0
>>
>>   #Run traffic
>>   $ ip netns exec ns1 iperf -s -p 5001&
>>   $ sleep 2 #wait for iperf to fully open
>>   $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>
>>   #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>   $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>          Sent hardware 9310116832 bytes 6149672 pkt
>>          Sent hardware 9310116832 bytes 6149672 pkt
>>          Sent hardware 9310116832 bytes 6149672 pkt
>>
>> A new connection executing the first filter in hardware will first rewrite
>> the dst port to the new port, and then the ct action is executed,
>> because this is a new connection, hardware will need to be send this back
>> to software, on chain 0, to execute the first filter again in software.
>> The dst port needs to be reverted otherwise it won't re-match the old
>> dst port in the first filter. Because of that, currently mlx5 driver will
>> reject offloading the above action ct rule.
>>
>> This series adds supports partial offload of a filter's action list,
>> and letting tc software continue processing in the specific action instance
>> where hardware left off (in the above case after the "action pedit ex munge tcp
>> dport... of the first rule") allowing support for scenarios such as the above.
> 
> 
> Hi, Paul.  Not sure if this was discussed before, but don't we also need
> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
> 
> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
> flags are not per-action.  This may cause confusion among users, if flows
> are reported as in_hw, while they are actually partially or even mostly
> processed in SW.
> 
> What do you think?
> 
> Best regards, Ilya Maximets.

I think its a good idea, and I'm fine with proposing something like this 
in a different series, as this isn't a new problem from this series and 
existed before it, at least with CT rules.

So how about I'll propose it in a different series and we continue with 
this first?

Thanks,
Paul.
Ilya Maximets Feb. 7, 2023, 12:20 a.m. UTC | #3
On 2/6/23 18:14, Paul Blakey wrote:
> 
> 
> On 06/02/2023 14:34, Ilya Maximets wrote:
>> On 2/5/23 16:49, Paul Blakey wrote:
>>> Hi,
>>>
>>> This series adds support for hardware miss to instruct tc to continue execution
>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>
>>> Currently a filter's action list must be executed all together or
>>> not at all as driver are only able to tell tc to continue executing from a
>>> specific tc chain, and not a specific filter/action.
>>>
>>> This is troublesome with regards to action CT, where new connections should
>>> be sent to software (via tc chain restore), and established connections can
>>> be handled in hardware.
>>>
>>> Checking for new connections is done when executing the ct action in hardware
>>> (by checking the packet's tuple against known established tuples).
>>> But if there is a packet modification (pedit) action before action CT and the
>>> checked tuple is a new connection, hardware will need to revert the previous
>>> packet modifications before sending it back to software so it can
>>> re-match the same tc filter in software and re-execute its CT action.
>>>
>>> The following is an example configuration of stateless nat
>>> on mlx5 driver that isn't supported before this patchet:
>>>
>>>   #Setup corrosponding mlx5 VFs in namespaces
>>>   $ ip netns add ns0
>>>   $ ip netns add ns1
>>>   $ ip link set dev enp8s0f0v0 netns ns0
>>>   $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>   $ ip link set dev enp8s0f0v1 netns ns1
>>>   $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>
>>>   #Setup tc arp and ct rules on mxl5 VF representors
>>>   $ tc qdisc add dev enp8s0f0_0 ingress
>>>   $ tc qdisc add dev enp8s0f0_1 ingress
>>>   $ ifconfig enp8s0f0_0 up
>>>   $ ifconfig enp8s0f0_1 up
>>>
>>>   #Original side
>>>   $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>      ct_state -trk ip_proto tcp dst_port 8888 \
>>>        action pedit ex munge tcp dport set 5001 pipe \
>>>        action csum ip tcp pipe \
>>>        action ct pipe \
>>>        action goto chain 1
>>>   $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>      ct_state +trk+est \
>>>        action mirred egress redirect dev enp8s0f0_1
>>>   $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>      ct_state +trk+new \
>>>        action ct commit pipe \
>>>        action mirred egress redirect dev enp8s0f0_1
>>>   $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>        action mirred egress redirect dev enp8s0f0_1
>>>
>>>   #Reply side
>>>   $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>        action mirred egress redirect dev enp8s0f0_0
>>>   $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>      ct_state -trk ip_proto tcp \
>>>        action ct pipe \
>>>        action pedit ex munge tcp sport set 8888 pipe \
>>>        action csum ip tcp pipe \
>>>        action mirred egress redirect dev enp8s0f0_0
>>>
>>>   #Run traffic
>>>   $ ip netns exec ns1 iperf -s -p 5001&
>>>   $ sleep 2 #wait for iperf to fully open
>>>   $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>
>>>   #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>   $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>          Sent hardware 9310116832 bytes 6149672 pkt
>>>          Sent hardware 9310116832 bytes 6149672 pkt
>>>          Sent hardware 9310116832 bytes 6149672 pkt
>>>
>>> A new connection executing the first filter in hardware will first rewrite
>>> the dst port to the new port, and then the ct action is executed,
>>> because this is a new connection, hardware will need to be send this back
>>> to software, on chain 0, to execute the first filter again in software.
>>> The dst port needs to be reverted otherwise it won't re-match the old
>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>> reject offloading the above action ct rule.
>>>
>>> This series adds supports partial offload of a filter's action list,
>>> and letting tc software continue processing in the specific action instance
>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>> dport... of the first rule") allowing support for scenarios such as the above.
>>
>>
>> Hi, Paul.  Not sure if this was discussed before, but don't we also need
>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>
>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>> flags are not per-action.  This may cause confusion among users, if flows
>> are reported as in_hw, while they are actually partially or even mostly
>> processed in SW.
>>
>> What do you think?
>>
>> Best regards, Ilya Maximets.
> 
> I think its a good idea, and I'm fine with proposing something like this in a
> different series, as this isn't a new problem from this series and existed before
> it, at least with CT rules.

Hmm, I didn't realize the issue already exists.

> 
> So how about I'll propose it in a different series and we continue with this first?

Sounds fine to me.  Thanks!

Best regards, Ilya Maximets.
Marcelo Ricardo Leitner Feb. 7, 2023, 5:03 a.m. UTC | #4
On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
> On 2/6/23 18:14, Paul Blakey wrote:
> >
> >
> > On 06/02/2023 14:34, Ilya Maximets wrote:
> >> On 2/5/23 16:49, Paul Blakey wrote:
> >>> Hi,
> >>>
> >>> This series adds support for hardware miss to instruct tc to continue execution
> >>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
> >>> (besides the refactors) shows its usage instead of using just chain restore.
> >>>
> >>> Currently a filter's action list must be executed all together or
> >>> not at all as driver are only able to tell tc to continue executing from a
> >>> specific tc chain, and not a specific filter/action.
> >>>
> >>> This is troublesome with regards to action CT, where new connections should
> >>> be sent to software (via tc chain restore), and established connections can
> >>> be handled in hardware.
> >>>
> >>> Checking for new connections is done when executing the ct action in hardware
> >>> (by checking the packet's tuple against known established tuples).
> >>> But if there is a packet modification (pedit) action before action CT and the
> >>> checked tuple is a new connection, hardware will need to revert the previous
> >>> packet modifications before sending it back to software so it can
> >>> re-match the same tc filter in software and re-execute its CT action.
> >>>
> >>> The following is an example configuration of stateless nat
> >>> on mlx5 driver that isn't supported before this patchet:
> >>>
> >>>   #Setup corrosponding mlx5 VFs in namespaces
> >>>   $ ip netns add ns0
> >>>   $ ip netns add ns1
> >>>   $ ip link set dev enp8s0f0v0 netns ns0
> >>>   $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
> >>>   $ ip link set dev enp8s0f0v1 netns ns1
> >>>   $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
> >>>
> >>>   #Setup tc arp and ct rules on mxl5 VF representors
> >>>   $ tc qdisc add dev enp8s0f0_0 ingress
> >>>   $ tc qdisc add dev enp8s0f0_1 ingress
> >>>   $ ifconfig enp8s0f0_0 up
> >>>   $ ifconfig enp8s0f0_1 up
> >>>
> >>>   #Original side
> >>>   $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
> >>>      ct_state -trk ip_proto tcp dst_port 8888 \
> >>>        action pedit ex munge tcp dport set 5001 pipe \
> >>>        action csum ip tcp pipe \
> >>>        action ct pipe \
> >>>        action goto chain 1
> >>>   $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> >>>      ct_state +trk+est \
> >>>        action mirred egress redirect dev enp8s0f0_1
> >>>   $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> >>>      ct_state +trk+new \
> >>>        action ct commit pipe \
> >>>        action mirred egress redirect dev enp8s0f0_1
> >>>   $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
> >>>        action mirred egress redirect dev enp8s0f0_1
> >>>
> >>>   #Reply side
> >>>   $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
> >>>        action mirred egress redirect dev enp8s0f0_0
> >>>   $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
> >>>      ct_state -trk ip_proto tcp \
> >>>        action ct pipe \
> >>>        action pedit ex munge tcp sport set 8888 pipe \
> >>>        action csum ip tcp pipe \
> >>>        action mirred egress redirect dev enp8s0f0_0
> >>>
> >>>   #Run traffic
> >>>   $ ip netns exec ns1 iperf -s -p 5001&
> >>>   $ sleep 2 #wait for iperf to fully open
> >>>   $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
> >>>
> >>>   #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
> >>>   $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
> >>>          Sent hardware 9310116832 bytes 6149672 pkt
> >>>          Sent hardware 9310116832 bytes 6149672 pkt
> >>>          Sent hardware 9310116832 bytes 6149672 pkt
> >>>
> >>> A new connection executing the first filter in hardware will first rewrite
> >>> the dst port to the new port, and then the ct action is executed,
> >>> because this is a new connection, hardware will need to be send this back
> >>> to software, on chain 0, to execute the first filter again in software.
> >>> The dst port needs to be reverted otherwise it won't re-match the old
> >>> dst port in the first filter. Because of that, currently mlx5 driver will
> >>> reject offloading the above action ct rule.
> >>>
> >>> This series adds supports partial offload of a filter's action list,
> >>> and letting tc software continue processing in the specific action instance
> >>> where hardware left off (in the above case after the "action pedit ex munge tcp
> >>> dport... of the first rule") allowing support for scenarios such as the above.
> >>
> >>
> >> Hi, Paul.  Not sure if this was discussed before, but don't we also need
> >> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
> >>
> >> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
> >> flags are not per-action.  This may cause confusion among users, if flows
> >> are reported as in_hw, while they are actually partially or even mostly
> >> processed in SW.
> >>
> >> What do you think?
> >>
> >> Best regards, Ilya Maximets.
> >
> > I think its a good idea, and I'm fine with proposing something like this in a
> > different series, as this isn't a new problem from this series and existed before
> > it, at least with CT rules.
>
> Hmm, I didn't realize the issue already exists.

Maintainers: please give me up to Friday to review this patchset.

Disclaimer: I had missed this patchset, and I didn't even read it yet.

I don't follow. Can someone please rephase the issue please?
AFAICT, it is not that the NIC is offloading half of the action list
and never executing a part of it. Instead, for established connections
the rule will work fully offloaded. While for misses in the CT action,
it will simply trigger a miss, like it already does today.

>
> >
> > So how about I'll propose it in a different series and we continue with this first?

So I'm not sure either on what's the idea here.

Thanks,
Marcelo

>
> Sounds fine to me.  Thanks!
>
> Best regards, Ilya Maximets.
>
Jakub Kicinski Feb. 7, 2023, 5:20 a.m. UTC | #5
On Mon, 6 Feb 2023 21:03:15 -0800 Marcelo Leitner wrote:
> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
> > On 2/6/23 18:14, Paul Blakey wrote:  
> > > I think its a good idea, and I'm fine with proposing something like this in a
> > > different series, as this isn't a new problem from this series and existed before
> > > it, at least with CT rules.  
> >
> > Hmm, I didn't realize the issue already exists.  
> 
> Maintainers: please give me up to Friday to review this patchset.

No problem, there's a v9 FWIW:

https://lore.kernel.org/all/20230206174403.32733-1-paulb@nvidia.com/
Paul Blakey Feb. 8, 2023, 8:41 a.m. UTC | #6
On 07/02/2023 07:03, Marcelo Leitner wrote:
> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>> On 2/6/23 18:14, Paul Blakey wrote:
>>>
>>>
>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>> Hi,
>>>>>
>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>
>>>>> Currently a filter's action list must be executed all together or
>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>> specific tc chain, and not a specific filter/action.
>>>>>
>>>>> This is troublesome with regards to action CT, where new connections should
>>>>> be sent to software (via tc chain restore), and established connections can
>>>>> be handled in hardware.
>>>>>
>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>> (by checking the packet's tuple against known established tuples).
>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>> packet modifications before sending it back to software so it can
>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>
>>>>> The following is an example configuration of stateless nat
>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>
>>>>>    #Setup corrosponding mlx5 VFs in namespaces
>>>>>    $ ip netns add ns0
>>>>>    $ ip netns add ns1
>>>>>    $ ip link set dev enp8s0f0v0 netns ns0
>>>>>    $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>>    $ ip link set dev enp8s0f0v1 netns ns1
>>>>>    $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>
>>>>>    #Setup tc arp and ct rules on mxl5 VF representors
>>>>>    $ tc qdisc add dev enp8s0f0_0 ingress
>>>>>    $ tc qdisc add dev enp8s0f0_1 ingress
>>>>>    $ ifconfig enp8s0f0_0 up
>>>>>    $ ifconfig enp8s0f0_1 up
>>>>>
>>>>>    #Original side
>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>>       ct_state -trk ip_proto tcp dst_port 8888 \
>>>>>         action pedit ex munge tcp dport set 5001 pipe \
>>>>>         action csum ip tcp pipe \
>>>>>         action ct pipe \
>>>>>         action goto chain 1
>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>       ct_state +trk+est \
>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>       ct_state +trk+new \
>>>>>         action ct commit pipe \
>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>
>>>>>    #Reply side
>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>>         action mirred egress redirect dev enp8s0f0_0
>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>>       ct_state -trk ip_proto tcp \
>>>>>         action ct pipe \
>>>>>         action pedit ex munge tcp sport set 8888 pipe \
>>>>>         action csum ip tcp pipe \
>>>>>         action mirred egress redirect dev enp8s0f0_0
>>>>>
>>>>>    #Run traffic
>>>>>    $ ip netns exec ns1 iperf -s -p 5001&
>>>>>    $ sleep 2 #wait for iperf to fully open
>>>>>    $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>
>>>>>    #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>>    $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>
>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>> the dst port to the new port, and then the ct action is executed,
>>>>> because this is a new connection, hardware will need to be send this back
>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>> reject offloading the above action ct rule.
>>>>>
>>>>> This series adds supports partial offload of a filter's action list,
>>>>> and letting tc software continue processing in the specific action instance
>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>
>>>>
>>>> Hi, Paul.  Not sure if this was discussed before, but don't we also need
>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>
>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>> flags are not per-action.  This may cause confusion among users, if flows
>>>> are reported as in_hw, while they are actually partially or even mostly
>>>> processed in SW.
>>>>
>>>> What do you think?
>>>>
>>>> Best regards, Ilya Maximets.
>>>
>>> I think its a good idea, and I'm fine with proposing something like this in a
>>> different series, as this isn't a new problem from this series and existed before
>>> it, at least with CT rules.
>>
>> Hmm, I didn't realize the issue already exists.
> 
> Maintainers: please give me up to Friday to review this patchset.
> 
> Disclaimer: I had missed this patchset, and I didn't even read it yet.
> 
> I don't follow. Can someone please rephase the issue please?
> AFAICT, it is not that the NIC is offloading half of the action list
> and never executing a part of it. Instead, for established connections
> the rule will work fully offloaded. While for misses in the CT action,
> it will simply trigger a miss, like it already does today.

You got it right, and like you said it was like this before so its not 
strictly related by this series and could be in a different patchset. 
And I thought that (extra) flag would mean that it can miss, compared to 
other rules/actions combination that will never miss because they
don't need sw support.

> 
>>
>>>
>>> So how about I'll propose it in a different series and we continue with this first?
> 
> So I'm not sure either on what's the idea here.
> 
> Thanks,
> Marcelo
> 
>>
>> Sounds fine to me.  Thanks!
>>
>> Best regards, Ilya Maximets.
>>
>
Marcelo Ricardo Leitner Feb. 8, 2023, 6:01 p.m. UTC | #7
On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>
>
> On 07/02/2023 07:03, Marcelo Leitner wrote:
> > On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
> > > On 2/6/23 18:14, Paul Blakey wrote:
> > > >
> > > >
> > > > On 06/02/2023 14:34, Ilya Maximets wrote:
> > > > > On 2/5/23 16:49, Paul Blakey wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This series adds support for hardware miss to instruct tc to continue execution
> > > > > > in a specific tc action instance on a filter's action list. The mlx5 driver patch
> > > > > > (besides the refactors) shows its usage instead of using just chain restore.
> > > > > >
> > > > > > Currently a filter's action list must be executed all together or
> > > > > > not at all as driver are only able to tell tc to continue executing from a
> > > > > > specific tc chain, and not a specific filter/action.
> > > > > >
> > > > > > This is troublesome with regards to action CT, where new connections should
> > > > > > be sent to software (via tc chain restore), and established connections can
> > > > > > be handled in hardware.
> > > > > >
> > > > > > Checking for new connections is done when executing the ct action in hardware
> > > > > > (by checking the packet's tuple against known established tuples).
> > > > > > But if there is a packet modification (pedit) action before action CT and the
> > > > > > checked tuple is a new connection, hardware will need to revert the previous
> > > > > > packet modifications before sending it back to software so it can
> > > > > > re-match the same tc filter in software and re-execute its CT action.
> > > > > >
> > > > > > The following is an example configuration of stateless nat
> > > > > > on mlx5 driver that isn't supported before this patchet:
> > > > > >
> > > > > >    #Setup corrosponding mlx5 VFs in namespaces
> > > > > >    $ ip netns add ns0
> > > > > >    $ ip netns add ns1
> > > > > >    $ ip link set dev enp8s0f0v0 netns ns0
> > > > > >    $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
> > > > > >    $ ip link set dev enp8s0f0v1 netns ns1
> > > > > >    $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
> > > > > >
> > > > > >    #Setup tc arp and ct rules on mxl5 VF representors
> > > > > >    $ tc qdisc add dev enp8s0f0_0 ingress
> > > > > >    $ tc qdisc add dev enp8s0f0_1 ingress
> > > > > >    $ ifconfig enp8s0f0_0 up
> > > > > >    $ ifconfig enp8s0f0_1 up
> > > > > >
> > > > > >    #Original side
> > > > > >    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
> > > > > >       ct_state -trk ip_proto tcp dst_port 8888 \
> > > > > >         action pedit ex munge tcp dport set 5001 pipe \
> > > > > >         action csum ip tcp pipe \
> > > > > >         action ct pipe \
> > > > > >         action goto chain 1
> > > > > >    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> > > > > >       ct_state +trk+est \
> > > > > >         action mirred egress redirect dev enp8s0f0_1
> > > > > >    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> > > > > >       ct_state +trk+new \
> > > > > >         action ct commit pipe \
> > > > > >         action mirred egress redirect dev enp8s0f0_1
> > > > > >    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
> > > > > >         action mirred egress redirect dev enp8s0f0_1
> > > > > >
> > > > > >    #Reply side
> > > > > >    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
> > > > > >         action mirred egress redirect dev enp8s0f0_0
> > > > > >    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
> > > > > >       ct_state -trk ip_proto tcp \
> > > > > >         action ct pipe \
> > > > > >         action pedit ex munge tcp sport set 8888 pipe \
> > > > > >         action csum ip tcp pipe \
> > > > > >         action mirred egress redirect dev enp8s0f0_0
> > > > > >
> > > > > >    #Run traffic
> > > > > >    $ ip netns exec ns1 iperf -s -p 5001&
> > > > > >    $ sleep 2 #wait for iperf to fully open
> > > > > >    $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
> > > > > >
> > > > > >    #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
> > > > > >    $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
> > > > > >           Sent hardware 9310116832 bytes 6149672 pkt
> > > > > >           Sent hardware 9310116832 bytes 6149672 pkt
> > > > > >           Sent hardware 9310116832 bytes 6149672 pkt
> > > > > >
> > > > > > A new connection executing the first filter in hardware will first rewrite
> > > > > > the dst port to the new port, and then the ct action is executed,
> > > > > > because this is a new connection, hardware will need to be send this back
> > > > > > to software, on chain 0, to execute the first filter again in software.
> > > > > > The dst port needs to be reverted otherwise it won't re-match the old
> > > > > > dst port in the first filter. Because of that, currently mlx5 driver will
> > > > > > reject offloading the above action ct rule.
> > > > > >
> > > > > > This series adds supports partial offload of a filter's action list,
> > > > > > and letting tc software continue processing in the specific action instance
> > > > > > where hardware left off (in the above case after the "action pedit ex munge tcp
> > > > > > dport... of the first rule") allowing support for scenarios such as the above.
> > > > >
> > > > >
> > > > > Hi, Paul.  Not sure if this was discussed before, but don't we also need
> > > > > a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
> > > > >
> > > > > Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
> > > > > flags are not per-action.  This may cause confusion among users, if flows
> > > > > are reported as in_hw, while they are actually partially or even mostly
> > > > > processed in SW.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Best regards, Ilya Maximets.
> > > >
> > > > I think its a good idea, and I'm fine with proposing something like this in a
> > > > different series, as this isn't a new problem from this series and existed before
> > > > it, at least with CT rules.
> > >
> > > Hmm, I didn't realize the issue already exists.
> >
> > Maintainers: please give me up to Friday to review this patchset.
> >
> > Disclaimer: I had missed this patchset, and I didn't even read it yet.
> >
> > I don't follow. Can someone please rephase the issue please?
> > AFAICT, it is not that the NIC is offloading half of the action list
> > and never executing a part of it. Instead, for established connections
> > the rule will work fully offloaded. While for misses in the CT action,
> > it will simply trigger a miss, like it already does today.
>
> You got it right, and like you said it was like this before so its not
> strictly related by this series and could be in a different patchset. And I
> thought that (extra) flag would mean that it can miss, compared to other
> rules/actions combination that will never miss because they
> don't need sw support.

This is different from what I understood from Ilya's comment. Maybe I
got his comment wrong, but I have the impression that he meant it in
the sense of having some actions offloaded and some not. Which I think
it is not the goal here.

But anyway, flows can have some packets matching in sw while also
being in hw. That's expected. For example, in more complex flow sets,
if a packet hit a flow with ct action and triggered a miss, all
subsequent flows will handle this packet in sw. Or if we have queued
packets in rx ring already and ovs just updated the datapath, these
will match in tc sw instead of going to upcall. The latter will have
only a few hits, yes, but the former will be increasing over time.
I'm not sure how a new flag, which is probably more informative than
an actual state indication, would help here.

>
> >
> > >
> > > >
> > > > So how about I'll propose it in a different series and we continue with this first?
> >
> > So I'm not sure either on what's the idea here.
> >
> > Thanks,
> > Marcelo
> >
> > >
> > > Sounds fine to me.  Thanks!
> > >
> > > Best regards, Ilya Maximets.
> > >
> >
>
Ilya Maximets Feb. 9, 2023, 12:09 a.m. UTC | #8
On 2/8/23 19:01, Marcelo Leitner wrote:
> On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>>
>>
>> On 07/02/2023 07:03, Marcelo Leitner wrote:
>>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>>>> On 2/6/23 18:14, Paul Blakey wrote:
>>>>>
>>>>>
>>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>>>
>>>>>>> Currently a filter's action list must be executed all together or
>>>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>>>> specific tc chain, and not a specific filter/action.
>>>>>>>
>>>>>>> This is troublesome with regards to action CT, where new connections should
>>>>>>> be sent to software (via tc chain restore), and established connections can
>>>>>>> be handled in hardware.
>>>>>>>
>>>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>>>> (by checking the packet's tuple against known established tuples).
>>>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>>>> packet modifications before sending it back to software so it can
>>>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>>>
>>>>>>> The following is an example configuration of stateless nat
>>>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>>>
>>>>>>>    #Setup corrosponding mlx5 VFs in namespaces
>>>>>>>    $ ip netns add ns0
>>>>>>>    $ ip netns add ns1
>>>>>>>    $ ip link set dev enp8s0f0v0 netns ns0
>>>>>>>    $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>>>>    $ ip link set dev enp8s0f0v1 netns ns1
>>>>>>>    $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>>>
>>>>>>>    #Setup tc arp and ct rules on mxl5 VF representors
>>>>>>>    $ tc qdisc add dev enp8s0f0_0 ingress
>>>>>>>    $ tc qdisc add dev enp8s0f0_1 ingress
>>>>>>>    $ ifconfig enp8s0f0_0 up
>>>>>>>    $ ifconfig enp8s0f0_1 up
>>>>>>>
>>>>>>>    #Original side
>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>>>>       ct_state -trk ip_proto tcp dst_port 8888 \
>>>>>>>         action pedit ex munge tcp dport set 5001 pipe \
>>>>>>>         action csum ip tcp pipe \
>>>>>>>         action ct pipe \
>>>>>>>         action goto chain 1
>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>       ct_state +trk+est \
>>>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>       ct_state +trk+new \
>>>>>>>         action ct commit pipe \
>>>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>>>
>>>>>>>    #Reply side
>>>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>>>>         action mirred egress redirect dev enp8s0f0_0
>>>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>>>>       ct_state -trk ip_proto tcp \
>>>>>>>         action ct pipe \
>>>>>>>         action pedit ex munge tcp sport set 8888 pipe \
>>>>>>>         action csum ip tcp pipe \
>>>>>>>         action mirred egress redirect dev enp8s0f0_0
>>>>>>>
>>>>>>>    #Run traffic
>>>>>>>    $ ip netns exec ns1 iperf -s -p 5001&
>>>>>>>    $ sleep 2 #wait for iperf to fully open
>>>>>>>    $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>>>
>>>>>>>    #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>>>>    $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>
>>>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>>>> the dst port to the new port, and then the ct action is executed,
>>>>>>> because this is a new connection, hardware will need to be send this back
>>>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>>>> reject offloading the above action ct rule.
>>>>>>>
>>>>>>> This series adds supports partial offload of a filter's action list,
>>>>>>> and letting tc software continue processing in the specific action instance
>>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>>>
>>>>>>
>>>>>> Hi, Paul.  Not sure if this was discussed before, but don't we also need
>>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>>>
>>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>>>> flags are not per-action.  This may cause confusion among users, if flows
>>>>>> are reported as in_hw, while they are actually partially or even mostly
>>>>>> processed in SW.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>> I think its a good idea, and I'm fine with proposing something like this in a
>>>>> different series, as this isn't a new problem from this series and existed before
>>>>> it, at least with CT rules.
>>>>
>>>> Hmm, I didn't realize the issue already exists.
>>>
>>> Maintainers: please give me up to Friday to review this patchset.
>>>
>>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
>>>
>>> I don't follow. Can someone please rephase the issue please?
>>> AFAICT, it is not that the NIC is offloading half of the action list
>>> and never executing a part of it. Instead, for established connections
>>> the rule will work fully offloaded. While for misses in the CT action,
>>> it will simply trigger a miss, like it already does today.
>>
>> You got it right, and like you said it was like this before so its not
>> strictly related by this series and could be in a different patchset. And I
>> thought that (extra) flag would mean that it can miss, compared to other
>> rules/actions combination that will never miss because they
>> don't need sw support.
> 
> This is different from what I understood from Ilya's comment. Maybe I
> got his comment wrong, but I have the impression that he meant it in
> the sense of having some actions offloaded and some not.
> Which I thinkit is not the goal here.

I don't really know the code around this patch set well enough, so my
thoughts might be a bit irrelevant.  But after reading the cover letter
and commit messages in this patch set I imagined that if we have some
kind of miss on the N-th action in a list in HW, we could go to software
tc, find that action and continue execution from it.  In this case some
actions are executed in HW and some are in SW.

From the user's perspective, if such tc filter reports an 'in_hw' flag,
that would be a bit misleading, IMO.

If that is not what is happening here, then please ignore my comments,
as I'm not sure what this code is about then. :)

> 
> But anyway, flows can have some packets matching in sw while also
> being in hw. That's expected. For example, in more complex flow sets,
> if a packet hit a flow with ct action and triggered a miss, all
> subsequent flows will handle this packet in sw. Or if we have queued
> packets in rx ring already and ovs just updated the datapath, these
> will match in tc sw instead of going to upcall. The latter will have
> only a few hits, yes, but the former will be increasing over time.
> I'm not sure how a new flag, which is probably more informative than
> an actual state indication, would help here.

These cases are related to just one or a very few packets, so for them
it's generally fine to report 'in_hw', I think.  The vast majority of
traffic will be handled in HW.

My thoughts were about a case where we have a lot of traffic handled
partially in HW and in SW.  Let's say we have N actions and HW doesn't
support action M.  In this case, driver may offload actions [0, M - 1]
inserting some kind of forced "HW miss" at the end, so actions [M, N]
can be executed in TC software.

But now I'm not sure if that is possible with the current implementation.

> 
>>
>>>
>>>>
>>>>>
>>>>> So how about I'll propose it in a different series and we continue with this first?
>>>
>>> So I'm not sure either on what's the idea here.
>>>
>>> Thanks,
>>> Marcelo
>>>
>>>>
>>>> Sounds fine to me.  Thanks!
>>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>
>>
>
Marcelo Ricardo Leitner Feb. 9, 2023, 1:09 a.m. UTC | #9
On Thu, Feb 09, 2023 at 01:09:21AM +0100, Ilya Maximets wrote:
> On 2/8/23 19:01, Marcelo Leitner wrote:
> > On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
> >>
> >>
> >> On 07/02/2023 07:03, Marcelo Leitner wrote:
> >>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
> >>>> On 2/6/23 18:14, Paul Blakey wrote:
> >>>>>
> >>>>>
> >>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
> >>>>>> On 2/5/23 16:49, Paul Blakey wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> This series adds support for hardware miss to instruct tc to continue execution
> >>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
> >>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
> >>>>>>>
> >>>>>>> Currently a filter's action list must be executed all together or
> >>>>>>> not at all as driver are only able to tell tc to continue executing from a
> >>>>>>> specific tc chain, and not a specific filter/action.
> >>>>>>>
> >>>>>>> This is troublesome with regards to action CT, where new connections should
> >>>>>>> be sent to software (via tc chain restore), and established connections can
> >>>>>>> be handled in hardware.
> >>>>>>>
> >>>>>>> Checking for new connections is done when executing the ct action in hardware
> >>>>>>> (by checking the packet's tuple against known established tuples).
> >>>>>>> But if there is a packet modification (pedit) action before action CT and the
> >>>>>>> checked tuple is a new connection, hardware will need to revert the previous
> >>>>>>> packet modifications before sending it back to software so it can
> >>>>>>> re-match the same tc filter in software and re-execute its CT action.
> >>>>>>>
> >>>>>>> The following is an example configuration of stateless nat
> >>>>>>> on mlx5 driver that isn't supported before this patchet:
> >>>>>>>
> >>>>>>>    #Setup corrosponding mlx5 VFs in namespaces
> >>>>>>>    $ ip netns add ns0
> >>>>>>>    $ ip netns add ns1
> >>>>>>>    $ ip link set dev enp8s0f0v0 netns ns0
> >>>>>>>    $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
> >>>>>>>    $ ip link set dev enp8s0f0v1 netns ns1
> >>>>>>>    $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
> >>>>>>>
> >>>>>>>    #Setup tc arp and ct rules on mxl5 VF representors
> >>>>>>>    $ tc qdisc add dev enp8s0f0_0 ingress
> >>>>>>>    $ tc qdisc add dev enp8s0f0_1 ingress
> >>>>>>>    $ ifconfig enp8s0f0_0 up
> >>>>>>>    $ ifconfig enp8s0f0_1 up
> >>>>>>>
> >>>>>>>    #Original side
> >>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
> >>>>>>>       ct_state -trk ip_proto tcp dst_port 8888 \
> >>>>>>>         action pedit ex munge tcp dport set 5001 pipe \
> >>>>>>>         action csum ip tcp pipe \
> >>>>>>>         action ct pipe \
> >>>>>>>         action goto chain 1
> >>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> >>>>>>>       ct_state +trk+est \
> >>>>>>>         action mirred egress redirect dev enp8s0f0_1
> >>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> >>>>>>>       ct_state +trk+new \
> >>>>>>>         action ct commit pipe \
> >>>>>>>         action mirred egress redirect dev enp8s0f0_1
> >>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
> >>>>>>>         action mirred egress redirect dev enp8s0f0_1
> >>>>>>>
> >>>>>>>    #Reply side
> >>>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
> >>>>>>>         action mirred egress redirect dev enp8s0f0_0
> >>>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
> >>>>>>>       ct_state -trk ip_proto tcp \
> >>>>>>>         action ct pipe \
> >>>>>>>         action pedit ex munge tcp sport set 8888 pipe \
> >>>>>>>         action csum ip tcp pipe \
> >>>>>>>         action mirred egress redirect dev enp8s0f0_0
> >>>>>>>
> >>>>>>>    #Run traffic
> >>>>>>>    $ ip netns exec ns1 iperf -s -p 5001&
> >>>>>>>    $ sleep 2 #wait for iperf to fully open
> >>>>>>>    $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
> >>>>>>>
> >>>>>>>    #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
> >>>>>>>    $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
> >>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
> >>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
> >>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
> >>>>>>>
> >>>>>>> A new connection executing the first filter in hardware will first rewrite
> >>>>>>> the dst port to the new port, and then the ct action is executed,
> >>>>>>> because this is a new connection, hardware will need to be send this back
> >>>>>>> to software, on chain 0, to execute the first filter again in software.
> >>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
> >>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
> >>>>>>> reject offloading the above action ct rule.
> >>>>>>>
> >>>>>>> This series adds supports partial offload of a filter's action list,
> >>>>>>> and letting tc software continue processing in the specific action instance
> >>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
> >>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
> >>>>>>
> >>>>>>
> >>>>>> Hi, Paul.  Not sure if this was discussed before, but don't we also need
> >>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
> >>>>>>
> >>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
> >>>>>> flags are not per-action.  This may cause confusion among users, if flows
> >>>>>> are reported as in_hw, while they are actually partially or even mostly
> >>>>>> processed in SW.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best regards, Ilya Maximets.
> >>>>>
> >>>>> I think its a good idea, and I'm fine with proposing something like this in a
> >>>>> different series, as this isn't a new problem from this series and existed before
> >>>>> it, at least with CT rules.
> >>>>
> >>>> Hmm, I didn't realize the issue already exists.
> >>>
> >>> Maintainers: please give me up to Friday to review this patchset.
> >>>
> >>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
> >>>
> >>> I don't follow. Can someone please rephase the issue please?
> >>> AFAICT, it is not that the NIC is offloading half of the action list
> >>> and never executing a part of it. Instead, for established connections
> >>> the rule will work fully offloaded. While for misses in the CT action,
> >>> it will simply trigger a miss, like it already does today.
> >>
> >> You got it right, and like you said it was like this before so its not
> >> strictly related by this series and could be in a different patchset. And I
> >> thought that (extra) flag would mean that it can miss, compared to other
> >> rules/actions combination that will never miss because they
> >> don't need sw support.
> >
> > This is different from what I understood from Ilya's comment. Maybe I
> > got his comment wrong, but I have the impression that he meant it in
> > the sense of having some actions offloaded and some not.
> > Which I thinkit is not the goal here.
>
> I don't really know the code around this patch set well enough, so my
> thoughts might be a bit irrelevant.  But after reading the cover letter
> and commit messages in this patch set I imagined that if we have some
> kind of miss on the N-th action in a list in HW, we could go to software
> tc, find that action and continue execution from it.  In this case some
> actions are executed in HW and some are in SW.

Precisely. :)

>
> From the user's perspective, if such tc filter reports an 'in_hw' flag,
> that would be a bit misleading, IMO.

I may be tainted or perhaps even biased here, but I don't see how it
can be misleading. Since we came up with skip_hw/sw I think it is
expected that packets can be handled in both datapaths. The flag is
just saying that hw has this flow. (btw, in_sw is simplified, as sw
always accepts the flow if skip_sw is not used)

>
> If that is not what is happening here, then please ignore my comments,
> as I'm not sure what this code is about then. :)
>
> >
> > But anyway, flows can have some packets matching in sw while also
> > being in hw. That's expected. For example, in more complex flow sets,
> > if a packet hit a flow with ct action and triggered a miss, all
> > subsequent flows will handle this packet in sw. Or if we have queued
> > packets in rx ring already and ovs just updated the datapath, these
> > will match in tc sw instead of going to upcall. The latter will have
> > only a few hits, yes, but the former will be increasing over time.
> > I'm not sure how a new flag, which is probably more informative than
> > an actual state indication, would help here.
>
> These cases are related to just one or a very few packets, so for them
> it's generally fine to report 'in_hw', I think.  The vast majority of
> traffic will be handled in HW.
>
> My thoughts were about a case where we have a lot of traffic handled
> partially in HW and in SW.  Let's say we have N actions and HW doesn't
> support action M.  In this case, driver may offload actions [0, M - 1]
> inserting some kind of forced "HW miss" at the end, so actions [M, N]
> can be executed in TC software.

Right. Please lets consider this other scenario then. Consider that we
have these flows:
chain 0,ip,match ip X actions=ct,goto chain 1
chain 1,proto_Y_specific_match actions=ct(nat),goto chain 2
chain 2 actions=output:3

The idea here is that on chain 1, the HW doesn't support that particular
match on proto Y. That flow will never be in_hw, and that's okay. But
the flow on chain 2, though, will be tagged as in_hw, and for packets
following these specific sequence, they will get handled in sw on
chain 2.

But if we have another flow there:
chain 1,proto tcp actions=ct(nat),set_ttl,goto chain 2
which is supported by the hw, such packets would be handled by hw in
chain 2.

The flow on chain 2 has no idea on what was done before it. It can't
be tagged with _PARTIAL as the actions in there are not expected to
trigger misses, yet, with this flow set, it is expected to handle
packets in both datapaths, despite being 'in_hw'.

I guess what I'm trying so say is that it is not because a flow is
tagged with in_hw that sw processing is unexpected straight away.

Hopefully this makes sense?

>
> But now I'm not sure if that is possible with the current implementation.

AFAICT you got all right. It is me that had misunderstood you. :)

>
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> So how about I'll propose it in a different series and we continue with this first?
> >>>
> >>> So I'm not sure either on what's the idea here.
> >>>
> >>> Thanks,
> >>> Marcelo
> >>>
> >>>>
> >>>> Sounds fine to me.  Thanks!
> >>>>
> >>>> Best regards, Ilya Maximets.
> >>>>
> >>>
> >>
> >
>
Ilya Maximets Feb. 9, 2023, 12:07 p.m. UTC | #10
On 2/9/23 02:09, Marcelo Leitner wrote:
> On Thu, Feb 09, 2023 at 01:09:21AM +0100, Ilya Maximets wrote:
>> On 2/8/23 19:01, Marcelo Leitner wrote:
>>> On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>>>>
>>>>
>>>> On 07/02/2023 07:03, Marcelo Leitner wrote:
>>>>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>>>>>> On 2/6/23 18:14, Paul Blakey wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>>>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>>>>>
>>>>>>>>> Currently a filter's action list must be executed all together or
>>>>>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>>>>>> specific tc chain, and not a specific filter/action.
>>>>>>>>>
>>>>>>>>> This is troublesome with regards to action CT, where new connections should
>>>>>>>>> be sent to software (via tc chain restore), and established connections can
>>>>>>>>> be handled in hardware.
>>>>>>>>>
>>>>>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>>>>>> (by checking the packet's tuple against known established tuples).
>>>>>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>>>>>> packet modifications before sending it back to software so it can
>>>>>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>>>>>
>>>>>>>>> The following is an example configuration of stateless nat
>>>>>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>>>>>
>>>>>>>>>    #Setup corrosponding mlx5 VFs in namespaces
>>>>>>>>>    $ ip netns add ns0
>>>>>>>>>    $ ip netns add ns1
>>>>>>>>>    $ ip link set dev enp8s0f0v0 netns ns0
>>>>>>>>>    $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>>>>>>    $ ip link set dev enp8s0f0v1 netns ns1
>>>>>>>>>    $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>>>>>
>>>>>>>>>    #Setup tc arp and ct rules on mxl5 VF representors
>>>>>>>>>    $ tc qdisc add dev enp8s0f0_0 ingress
>>>>>>>>>    $ tc qdisc add dev enp8s0f0_1 ingress
>>>>>>>>>    $ ifconfig enp8s0f0_0 up
>>>>>>>>>    $ ifconfig enp8s0f0_1 up
>>>>>>>>>
>>>>>>>>>    #Original side
>>>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>>>>>>       ct_state -trk ip_proto tcp dst_port 8888 \
>>>>>>>>>         action pedit ex munge tcp dport set 5001 pipe \
>>>>>>>>>         action csum ip tcp pipe \
>>>>>>>>>         action ct pipe \
>>>>>>>>>         action goto chain 1
>>>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>       ct_state +trk+est \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>       ct_state +trk+new \
>>>>>>>>>         action ct commit pipe \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>    $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>
>>>>>>>>>    #Reply side
>>>>>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>    $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>>>>>>       ct_state -trk ip_proto tcp \
>>>>>>>>>         action ct pipe \
>>>>>>>>>         action pedit ex munge tcp sport set 8888 pipe \
>>>>>>>>>         action csum ip tcp pipe \
>>>>>>>>>         action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>
>>>>>>>>>    #Run traffic
>>>>>>>>>    $ ip netns exec ns1 iperf -s -p 5001&
>>>>>>>>>    $ sleep 2 #wait for iperf to fully open
>>>>>>>>>    $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>>>>>
>>>>>>>>>    #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>>>>>>    $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>           Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>
>>>>>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>>>>>> the dst port to the new port, and then the ct action is executed,
>>>>>>>>> because this is a new connection, hardware will need to be send this back
>>>>>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>>>>>> reject offloading the above action ct rule.
>>>>>>>>>
>>>>>>>>> This series adds supports partial offload of a filter's action list,
>>>>>>>>> and letting tc software continue processing in the specific action instance
>>>>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi, Paul.  Not sure if this was discussed before, but don't we also need
>>>>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>>>>>
>>>>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>>>>>> flags are not per-action.  This may cause confusion among users, if flows
>>>>>>>> are reported as in_hw, while they are actually partially or even mostly
>>>>>>>> processed in SW.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Best regards, Ilya Maximets.
>>>>>>>
>>>>>>> I think its a good idea, and I'm fine with proposing something like this in a
>>>>>>> different series, as this isn't a new problem from this series and existed before
>>>>>>> it, at least with CT rules.
>>>>>>
>>>>>> Hmm, I didn't realize the issue already exists.
>>>>>
>>>>> Maintainers: please give me up to Friday to review this patchset.
>>>>>
>>>>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
>>>>>
>>>>> I don't follow. Can someone please rephase the issue please?
>>>>> AFAICT, it is not that the NIC is offloading half of the action list
>>>>> and never executing a part of it. Instead, for established connections
>>>>> the rule will work fully offloaded. While for misses in the CT action,
>>>>> it will simply trigger a miss, like it already does today.
>>>>
>>>> You got it right, and like you said it was like this before so its not
>>>> strictly related by this series and could be in a different patchset. And I
>>>> thought that (extra) flag would mean that it can miss, compared to other
>>>> rules/actions combination that will never miss because they
>>>> don't need sw support.
>>>
>>> This is different from what I understood from Ilya's comment. Maybe I
>>> got his comment wrong, but I have the impression that he meant it in
>>> the sense of having some actions offloaded and some not.
>>> Which I thinkit is not the goal here.
>>
>> I don't really know the code around this patch set well enough, so my
>> thoughts might be a bit irrelevant.  But after reading the cover letter
>> and commit messages in this patch set I imagined that if we have some
>> kind of miss on the N-th action in a list in HW, we could go to software
>> tc, find that action and continue execution from it.  In this case some
>> actions are executed in HW and some are in SW.
> 
> Precisely. :)
> 
>>
>> From the user's perspective, if such tc filter reports an 'in_hw' flag,
>> that would be a bit misleading, IMO.
> 
> I may be tainted or perhaps even biased here, but I don't see how it
> can be misleading. Since we came up with skip_hw/sw I think it is
> expected that packets can be handled in both datapaths. The flag is
> just saying that hw has this flow. (btw, in_sw is simplified, as sw
> always accepts the flow if skip_sw is not used)
> 
>>
>> If that is not what is happening here, then please ignore my comments,
>> as I'm not sure what this code is about then. :)
>>
>>>
>>> But anyway, flows can have some packets matching in sw while also
>>> being in hw. That's expected. For example, in more complex flow sets,
>>> if a packet hit a flow with ct action and triggered a miss, all
>>> subsequent flows will handle this packet in sw. Or if we have queued
>>> packets in rx ring already and ovs just updated the datapath, these
>>> will match in tc sw instead of going to upcall. The latter will have
>>> only a few hits, yes, but the former will be increasing over time.
>>> I'm not sure how a new flag, which is probably more informative than
>>> an actual state indication, would help here.
>>
>> These cases are related to just one or a very few packets, so for them
>> it's generally fine to report 'in_hw', I think.  The vast majority of
>> traffic will be handled in HW.
>>
>> My thoughts were about a case where we have a lot of traffic handled
>> partially in HW and in SW.  Let's say we have N actions and HW doesn't
>> support action M.  In this case, driver may offload actions [0, M - 1]
>> inserting some kind of forced "HW miss" at the end, so actions [M, N]
>> can be executed in TC software.
> 
> Right. Please lets consider this other scenario then. Consider that we
> have these flows:
> chain 0,ip,match ip X actions=ct,goto chain 1
> chain 1,proto_Y_specific_match actions=ct(nat),goto chain 2
> chain 2 actions=output:3
> 
> The idea here is that on chain 1, the HW doesn't support that particular
> match on proto Y. That flow will never be in_hw, and that's okay. But
> the flow on chain 2, though, will be tagged as in_hw, and for packets
> following these specific sequence, they will get handled in sw on
> chain 2.
> 
> But if we have another flow there:
> chain 1,proto tcp actions=ct(nat),set_ttl,goto chain 2
> which is supported by the hw, such packets would be handled by hw in
> chain 2.
> 
> The flow on chain 2 has no idea on what was done before it. It can't
> be tagged with _PARTIAL as the actions in there are not expected to
> trigger misses, yet, with this flow set, it is expected to handle
> packets in both datapaths, despite being 'in_hw'.
> 
> I guess what I'm trying so say is that it is not because a flow is
> tagged with in_hw that sw processing is unexpected straight away.
> 
> Hopefully this makes sense?

Yep.  I see your point.  In this case I agree that we can't really tell
if the traffic will be handled in HW or SW and the chain 2 will be
always handled in both.  So, the fact that it is 'in_hw' only means that
the chain is actually in HW as that HW actually has it.

Summarizing: having something doesn't mean using it. :)  So, thinking
that in_hw flows are actually fully processed in HW is a user's fault. :/

However, going back to my example where HW supports half of actions
in the chain ([0, M - 1]) and doesn't support the other half ([M, N])...
If the actions M to N are actually not installed into HW, marking the
chain as in_hw is a bit misleading, because unlike your example, not all
the actions are actually in HW and driver knows that.  For that case,
something like _PARTIAL suffix might still be useful.

> 
>>
>> But now I'm not sure if that is possible with the current implementation.
> 
> AFAICT you got all right. It is me that had misunderstood you. :)
> 
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> So how about I'll propose it in a different series and we continue with this first?
>>>>>
>>>>> So I'm not sure either on what's the idea here.
>>>>>
>>>>> Thanks,
>>>>> Marcelo
>>>>>
>>>>>>
>>>>>> Sounds fine to me.  Thanks!
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>
>>>>
>>>
>>
>
Paul Blakey Feb. 9, 2023, 12:40 p.m. UTC | #11
On 09/02/2023 14:07, Ilya Maximets wrote:
> On 2/9/23 02:09, Marcelo Leitner wrote:
>> On Thu, Feb 09, 2023 at 01:09:21AM +0100, Ilya Maximets wrote:
>>> On 2/8/23 19:01, Marcelo Leitner wrote:
>>>> On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>>>>>
>>>>>
>>>>> On 07/02/2023 07:03, Marcelo Leitner wrote:
>>>>>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>>>>>>> On 2/6/23 18:14, Paul Blakey wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>>>>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>>>>>>
>>>>>>>>>> Currently a filter's action list must be executed all together or
>>>>>>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>>>>>>> specific tc chain, and not a specific filter/action.
>>>>>>>>>>
>>>>>>>>>> This is troublesome with regards to action CT, where new connections should
>>>>>>>>>> be sent to software (via tc chain restore), and established connections can
>>>>>>>>>> be handled in hardware.
>>>>>>>>>>
>>>>>>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>>>>>>> (by checking the packet's tuple against known established tuples).
>>>>>>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>>>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>>>>>>> packet modifications before sending it back to software so it can
>>>>>>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>>>>>>
>>>>>>>>>> The following is an example configuration of stateless nat
>>>>>>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>>>>>>
>>>>>>>>>>     #Setup corrosponding mlx5 VFs in namespaces
>>>>>>>>>>     $ ip netns add ns0
>>>>>>>>>>     $ ip netns add ns1
>>>>>>>>>>     $ ip link set dev enp8s0f0v0 netns ns0
>>>>>>>>>>     $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>>>>>>>     $ ip link set dev enp8s0f0v1 netns ns1
>>>>>>>>>>     $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>>>>>>
>>>>>>>>>>     #Setup tc arp and ct rules on mxl5 VF representors
>>>>>>>>>>     $ tc qdisc add dev enp8s0f0_0 ingress
>>>>>>>>>>     $ tc qdisc add dev enp8s0f0_1 ingress
>>>>>>>>>>     $ ifconfig enp8s0f0_0 up
>>>>>>>>>>     $ ifconfig enp8s0f0_1 up
>>>>>>>>>>
>>>>>>>>>>     #Original side
>>>>>>>>>>     $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>>>>>>>        ct_state -trk ip_proto tcp dst_port 8888 \
>>>>>>>>>>          action pedit ex munge tcp dport set 5001 pipe \
>>>>>>>>>>          action csum ip tcp pipe \
>>>>>>>>>>          action ct pipe \
>>>>>>>>>>          action goto chain 1
>>>>>>>>>>     $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>>        ct_state +trk+est \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>>     $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>>        ct_state +trk+new \
>>>>>>>>>>          action ct commit pipe \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>>     $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>>
>>>>>>>>>>     #Reply side
>>>>>>>>>>     $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>>     $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>>>>>>>        ct_state -trk ip_proto tcp \
>>>>>>>>>>          action ct pipe \
>>>>>>>>>>          action pedit ex munge tcp sport set 8888 pipe \
>>>>>>>>>>          action csum ip tcp pipe \
>>>>>>>>>>          action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>>
>>>>>>>>>>     #Run traffic
>>>>>>>>>>     $ ip netns exec ns1 iperf -s -p 5001&
>>>>>>>>>>     $ sleep 2 #wait for iperf to fully open
>>>>>>>>>>     $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>>>>>>
>>>>>>>>>>     #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>>>>>>>     $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>>>>>>>            Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>>            Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>>            Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>>
>>>>>>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>>>>>>> the dst port to the new port, and then the ct action is executed,
>>>>>>>>>> because this is a new connection, hardware will need to be send this back
>>>>>>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>>>>>>> reject offloading the above action ct rule.
>>>>>>>>>>
>>>>>>>>>> This series adds supports partial offload of a filter's action list,
>>>>>>>>>> and letting tc software continue processing in the specific action instance
>>>>>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi, Paul.  Not sure if this was discussed before, but don't we also need
>>>>>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>>>>>>
>>>>>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>>>>>>> flags are not per-action.  This may cause confusion among users, if flows
>>>>>>>>> are reported as in_hw, while they are actually partially or even mostly
>>>>>>>>> processed in SW.
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> Best regards, Ilya Maximets.
>>>>>>>>
>>>>>>>> I think its a good idea, and I'm fine with proposing something like this in a
>>>>>>>> different series, as this isn't a new problem from this series and existed before
>>>>>>>> it, at least with CT rules.
>>>>>>>
>>>>>>> Hmm, I didn't realize the issue already exists.
>>>>>>
>>>>>> Maintainers: please give me up to Friday to review this patchset.
>>>>>>
>>>>>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
>>>>>>
>>>>>> I don't follow. Can someone please rephase the issue please?
>>>>>> AFAICT, it is not that the NIC is offloading half of the action list
>>>>>> and never executing a part of it. Instead, for established connections
>>>>>> the rule will work fully offloaded. While for misses in the CT action,
>>>>>> it will simply trigger a miss, like it already does today.
>>>>>
>>>>> You got it right, and like you said it was like this before so its not
>>>>> strictly related by this series and could be in a different patchset. And I
>>>>> thought that (extra) flag would mean that it can miss, compared to other
>>>>> rules/actions combination that will never miss because they
>>>>> don't need sw support.
>>>>
>>>> This is different from what I understood from Ilya's comment. Maybe I
>>>> got his comment wrong, but I have the impression that he meant it in
>>>> the sense of having some actions offloaded and some not.
>>>> Which I thinkit is not the goal here.
>>>
>>> I don't really know the code around this patch set well enough, so my
>>> thoughts might be a bit irrelevant.  But after reading the cover letter
>>> and commit messages in this patch set I imagined that if we have some
>>> kind of miss on the N-th action in a list in HW, we could go to software
>>> tc, find that action and continue execution from it.  In this case some
>>> actions are executed in HW and some are in SW.
>>
>> Precisely. :)
>>
>>>
>>>  From the user's perspective, if such tc filter reports an 'in_hw' flag,
>>> that would be a bit misleading, IMO.
>>
>> I may be tainted or perhaps even biased here, but I don't see how it
>> can be misleading. Since we came up with skip_hw/sw I think it is
>> expected that packets can be handled in both datapaths. The flag is
>> just saying that hw has this flow. (btw, in_sw is simplified, as sw
>> always accepts the flow if skip_sw is not used)
>>
>>>
>>> If that is not what is happening here, then please ignore my comments,
>>> as I'm not sure what this code is about then. :)
>>>
>>>>
>>>> But anyway, flows can have some packets matching in sw while also
>>>> being in hw. That's expected. For example, in more complex flow sets,
>>>> if a packet hit a flow with ct action and triggered a miss, all
>>>> subsequent flows will handle this packet in sw. Or if we have queued
>>>> packets in rx ring already and ovs just updated the datapath, these
>>>> will match in tc sw instead of going to upcall. The latter will have
>>>> only a few hits, yes, but the former will be increasing over time.
>>>> I'm not sure how a new flag, which is probably more informative than
>>>> an actual state indication, would help here.
>>>
>>> These cases are related to just one or a very few packets, so for them
>>> it's generally fine to report 'in_hw', I think.  The vast majority of
>>> traffic will be handled in HW.
>>>
>>> My thoughts were about a case where we have a lot of traffic handled
>>> partially in HW and in SW.  Let's say we have N actions and HW doesn't
>>> support action M.  In this case, driver may offload actions [0, M - 1]
>>> inserting some kind of forced "HW miss" at the end, so actions [M, N]
>>> can be executed in TC software.
>>
>> Right. Please lets consider this other scenario then. Consider that we
>> have these flows:
>> chain 0,ip,match ip X actions=ct,goto chain 1
>> chain 1,proto_Y_specific_match actions=ct(nat),goto chain 2
>> chain 2 actions=output:3
>>
>> The idea here is that on chain 1, the HW doesn't support that particular
>> match on proto Y. That flow will never be in_hw, and that's okay. But
>> the flow on chain 2, though, will be tagged as in_hw, and for packets
>> following these specific sequence, they will get handled in sw on
>> chain 2.
>>
>> But if we have another flow there:
>> chain 1,proto tcp actions=ct(nat),set_ttl,goto chain 2
>> which is supported by the hw, such packets would be handled by hw in
>> chain 2.
>>
>> The flow on chain 2 has no idea on what was done before it. It can't
>> be tagged with _PARTIAL as the actions in there are not expected to
>> trigger misses, yet, with this flow set, it is expected to handle
>> packets in both datapaths, despite being 'in_hw'.
>>
>> I guess what I'm trying so say is that it is not because a flow is
>> tagged with in_hw that sw processing is unexpected straight away.
>>
>> Hopefully this makes sense?
> 
> Yep.  I see your point.  In this case I agree that we can't really tell
> if the traffic will be handled in HW or SW and the chain 2 will be
> always handled in both.  So, the fact that it is 'in_hw' only means that
> the chain is actually in HW as that HW actually has it.
> 
> Summarizing: having something doesn't mean using it. :)  So, thinking
> that in_hw flows are actually fully processed in HW is a user's fault. :/
> 
> However, going back to my example where HW supports half of actions
> in the chain ([0, M - 1]) and doesn't support the other half ([M, N])...
> If the actions M to N are actually not installed into HW, marking the
> chain as in_hw is a bit misleading, because unlike your example, not all
> the actions are actually in HW and driver knows that.  For that case,
> something like _PARTIAL suffix might still be useful.


Right, and for now at least, no one does that :)

> 
>>
>>>
>>> But now I'm not sure if that is possible with the current implementation.
>>
>> AFAICT you got all right. It is me that had misunderstood you. :)
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> So how about I'll propose it in a different series and we continue with this first?
>>>>>>
>>>>>> So I'm not sure either on what's the idea here.
>>>>>>
>>>>>> Thanks,
>>>>>> Marcelo
>>>>>>
>>>>>>>
>>>>>>> Sounds fine to me.  Thanks!
>>>>>>>
>>>>>>> Best regards, Ilya Maximets.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>