[net-next,0/4] net/sched: cls_api: Support hardware miss to tc action

Message ID	20230112105905.1738-1-paulb@nvidia.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.160 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C From: Paul Blakey <paulb@nvidia.com> To: Paul Blakey <paulb@nvidia.com>, <netdev@vger.kernel.org>, Saeed Mahameed <saeedm@nvidia.com>, Paolo Abeni <pabeni@redhat.com>, Jakub Kicinski <kuba@kernel.org>, Eric Dumazet <edumazet@google.com>, Jamal Hadi Salim <jhs@mojatatu.com>, Cong Wang <xiyou.wangcong@gmail.com>, "David S. Miller" <davem@davemloft.net> CC: Oz Shlomo <ozsh@nvidia.com>, Jiri Pirko <jiri@nvidia.com>, Roi Dayan <roid@nvidia.com>, Vlad Buslov <vladbu@nvidia.com> Subject: [PATCH net-next 0/4] net/sched: cls_api: Support hardware miss to tc action Date: Thu, 12 Jan 2023 12:58:59 +0200 Message-ID: <20230112105905.1738-1-paulb@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk
Series	net/sched: cls_api: Support hardware miss to tc action \| expand [net-next,0/4] net/sched: cls_api: Support hardware miss to tc action [net-next,1/6] net/sched: cls_api: Support hardware miss to tc action [net-next,2/6] net/sched: flower: Move filter handle initialization earlier [net-next,3/6] net/sched: flower: Support hardware miss to tc action [net-next,4/6] net/mlx5: Refactor tc miss handling to a single function [net-next,5/6] net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG [net-next,6/6] net/mlx5: TC, Set CT miss to the specific ct action instance

Paul Blakey Jan. 12, 2023, 10:58 a.m. UTC

Hi,

This series adds support for hardware miss to a specific tc action
instance on a filter's action list. The mlx5 driver patch (besides
the refactors) shows its usage instead of using just chain restore.

This miss to action supports partial offload of a filter's action list,
and and let software continue processing where hardware left off.

Example is the CT action, where new connections need to be handled in
software. And if there is a packet modifying action before the CT action,
then restoring only the chain on misses might cause the rule to not
re-execute the relevant filter in software.

Consider the following scenario:

$ tc filter add dev dev1 ingress chain 0 proto ip flower \
  ct_state -trk dst_mac fe:50:56:26:13:7d \
  action pedit ex munge eth dst aa:bb:cc:dd:ee:01 \
  action ct \
  action goto chain 1
$ tc filter add dev dev1 ingress chain 1 proto ip flower \
  ct_state +trk+est \
  action mirred egress redirect dev ...
$ tc filter add dev dev1 ingress chain 1 proto ip flower \
  ct_state +trk+new \
  action ct commit \
  action mirred egress redirect dev dev2

$ tc filter add dev dev2 ingress chain 0 proto ip flower \
  action ct \
  action mirred egress redirect dev dev1

A packet doing the pedit in hardware (setting dst_mac to aa:bb:cc:dd:ee:01),
missing in the ct action, and restarting in chain 0 in software will fail
matching the original dst_mac in the flower filter on chain 0.

The above scenario is supported in mlx5 driver by reordering the actions
so ct will be done in hardware before the pedit action, but some packet
modifications can't be reordered in regards to the ct action. An example
of that is a modification to the tuple fields (e.g action pedit ex munge ip
dst 1.1.1.1) since it affects the ct action's result (as it does lookup based
on ip).

Paul Blakey (6):
  net/sched: cls_api: Support hardware miss to tc action
  net/sched: flower: Move filter handle initialization earlier
  net/sched: flower: Support hardware miss to tc action
  net/mlx5: Refactor tc miss handling to a single function
  net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
  net/mlx5: TC, Set CT miss to the specific ct action instance

 .../ethernet/mellanox/mlx5/core/en/rep/tc.c   | 225 ++------------
 .../mellanox/mlx5/core/en/tc/sample.c         |   2 +-
 .../ethernet/mellanox/mlx5/core/en/tc_ct.c    |  32 +-
 .../ethernet/mellanox/mlx5/core/en/tc_ct.h    |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   4 +-
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 276 ++++++++++++++++--
 .../net/ethernet/mellanox/mlx5/core/en_tc.h   |  21 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   2 +
 .../mellanox/mlx5/core/lib/fs_chains.c        |  14 +-
 include/linux/skbuff.h                        |   6 +-
 include/net/flow_offload.h                    |   1 +
 include/net/pkt_cls.h                         |  20 +-
 include/net/sch_generic.h                     |   2 +
 net/openvswitch/flow.c                        |   2 +-
 net/sched/act_api.c                           |   2 +-
 net/sched/cls_api.c                           | 208 ++++++++++++-
 net/sched/cls_flower.c                        |  75 +++--
 17 files changed, 580 insertions(+), 314 deletions(-)

Jamal Hadi Salim Jan. 12, 2023, 12:24 p.m. UTC | #1

On Thu, Jan 12, 2023 at 5:59 AM Paul Blakey <paulb@nvidia.com> wrote:
>
> Hi,
>
> This series adds support for hardware miss to a specific tc action
> instance on a filter's action list. The mlx5 driver patch (besides
> the refactors) shows its usage instead of using just chain restore.
>
> This miss to action supports partial offload of a filter's action list,
> and and let software continue processing where hardware left off.
>
> Example is the CT action, where new connections need to be handled in
> software. And if there is a packet modifying action before the CT action,
> then restoring only the chain on misses might cause the rule to not
> re-execute the relevant filter in software.
>
> Consider the following scenario:
>
> $ tc filter add dev dev1 ingress chain 0 proto ip flower \
>   ct_state -trk dst_mac fe:50:56:26:13:7d \
>   action pedit ex munge eth dst aa:bb:cc:dd:ee:01 \
>   action ct \
>   action goto chain 1
> $ tc filter add dev dev1 ingress chain 1 proto ip flower \
>   ct_state +trk+est \
>   action mirred egress redirect dev ...
> $ tc filter add dev dev1 ingress chain 1 proto ip flower \
>   ct_state +trk+new \
>   action ct commit \
>   action mirred egress redirect dev dev2
>
> $ tc filter add dev dev2 ingress chain 0 proto ip flower \
>   action ct \
>   action mirred egress redirect dev dev1
>
> A packet doing the pedit in hardware (setting dst_mac to aa:bb:cc:dd:ee:01),
> missing in the ct action, and restarting in chain 0 in software will fail
> matching the original dst_mac in the flower filter on chain 0.


I had to read that a couple of times and didnt get it until i went and
started looking at the patches and i am not 100% sure still.. IIUC, you have
something like:
match X action A action B action C
You do action A in HW and then you want to continue with B and C in SW.
If that is correct then the cover letter would be an easier read if it
is laid out as
follows:

-"consider the following scenario" (with examples you had above)....
Actually those example were not clear, i think the reader is meant to
assume lack of skip_sw means they are available both in hw and sw.

-Explain your goals and why it wont work (I think you did that well above)

-"and now with the changes in this patchset, the issue can be resolved
with rearranging the policy setup as follows"...

Having said that: Would this have been equally achieved with using skbmark?
The advantage is removing the need to change the cls sw datapath.
The HW doesnt have to support skb mark setting, just driver transform.
i.e the driver can keep track of the action mapping and set the mark when
it receives the packet from hw.
You rearrange your rules to use cls fw and have action B and C  match
an skb mark.

> The above scenario is supported in mlx5 driver by reordering the actions
> so ct will be done in hardware before the pedit action, but some packet
> modifications can't be reordered in regards to the ct action. An example
> of that is a modification to the tuple fields (e.g action pedit ex munge ip
> dst 1.1.1.1) since it affects the ct action's result (as it does lookup based
> on ip).

Also above text wasnt clear. It sounded like the driver would magically
know that it has to continue action B and C in s/w because it knows the
hardware wont be able to exec them?

cheers,
jamal

Paul Blakey Jan. 15, 2023, 12:04 p.m. UTC | #2

On 12/01/2023 14:24, Jamal Hadi Salim wrote:
> On Thu, Jan 12, 2023 at 5:59 AM Paul Blakey <paulb@nvidia.com> wrote:
>>
>> Hi,
>>
>> This series adds support for hardware miss to a specific tc action
>> instance on a filter's action list. The mlx5 driver patch (besides
>> the refactors) shows its usage instead of using just chain restore.
>>
>> This miss to action supports partial offload of a filter's action list,
>> and and let software continue processing where hardware left off.
>>
>> Example is the CT action, where new connections need to be handled in
>> software. And if there is a packet modifying action before the CT action,
>> then restoring only the chain on misses might cause the rule to not
>> re-execute the relevant filter in software.
>>
>> Consider the following scenario:
>>
>> $ tc filter add dev dev1 ingress chain 0 proto ip flower \
>>    ct_state -trk dst_mac fe:50:56:26:13:7d \
>>    action pedit ex munge eth dst aa:bb:cc:dd:ee:01 \
>>    action ct \
>>    action goto chain 1
>> $ tc filter add dev dev1 ingress chain 1 proto ip flower \
>>    ct_state +trk+est \
>>    action mirred egress redirect dev ...
>> $ tc filter add dev dev1 ingress chain 1 proto ip flower \
>>    ct_state +trk+new \
>>    action ct commit \
>>    action mirred egress redirect dev dev2
>>
>> $ tc filter add dev dev2 ingress chain 0 proto ip flower \
>>    action ct \
>>    action mirred egress redirect dev dev1
>>
>> A packet doing the pedit in hardware (setting dst_mac to aa:bb:cc:dd:ee:01),
>> missing in the ct action, and restarting in chain 0 in software will fail
>> matching the original dst_mac in the flower filter on chain 0.
> 
> 
> I had to read that a couple of times and didnt get it until i went and
> started looking at the patches and i am not 100% sure still.. IIUC, you have
> something like:
> match X action A action B action C
> You do action A in HW and then you want to continue with B and C in SW.
> If that is correct then the cover letter would be an easier read if it
> is laid out as
> follows:
> 
> -"consider the following scenario" (with examples you had above)....
> Actually those example were not clear, i think the reader is meant to
> assume lack of skip_sw means they are available both in hw and sw.
> 
> -Explain your goals and why it wont work (I think you did that well above)
> 
> -"and now with the changes in this patchset, the issue can be resolved
> with rearranging the policy setup as follows"...

Sure I can work on that for v2.

> 
> Having said that: Would this have been equally achieved with using skbmark?
> The advantage is removing the need to change the cls sw datapath.
> The HW doesnt have to support skb mark setting, just driver transform.
> i.e the driver can keep track of the action mapping and set the mark when
> it receives the packet from hw.
> You rearrange your rules to use cls fw and have action B and C  match
> an skb mark.
> 

The user would then need to be familiar with how it works for this 
specific vendor, of which actions are supported, and do so for all the 
rules that have such actions. He will also need to add a
skb mark action after each such successful execution that the driver 
should ignore.

Also with this patchset we actually support tc action continue, where 
with cls_fw it wont' be possible.



>> The above scenario is supported in mlx5 driver by reordering the actions
>> so ct will be done in hardware before the pedit action, but some packet
>> modifications can't be reordered in regards to the ct action. An example
>> of that is a modification to the tuple fields (e.g action pedit ex munge ip
>> dst 1.1.1.1) since it affects the ct action's result (as it does lookup based
>> on ip).
> 
> Also above text wasnt clear. It sounded like the driver would magically
> know that it has to continue action B and C in s/w because it knows the
> hardware wont be able to exec them?

It means that if MLX5 driver gets "action A, action CT, action B",
where action CT is only possible in hardware for offloaded established 
connections (offloaded via flow table).

We reorder the actions and parse it as if the action list was: "action 
CT, action A, action B" and then if we miss in action CT we didn't do 
any modifications (actions A/B, and counters) in hardware.

We can only do this if the actions are independent, and so to support
dependent actions (pedit for src ip followed by action CT), we have this 
patchset.



> 
> cheers,
> jamal

Jamal Hadi Salim Jan. 17, 2023, 1:40 p.m. UTC | #3

On Sun, Jan 15, 2023 at 7:04 AM Paul Blakey <paulb@nvidia.com> wrote:
>
>
>
> On 12/01/2023 14:24, Jamal Hadi Salim wrote:
> > On Thu, Jan 12, 2023 at 5:59 AM Paul Blakey <paulb@nvidia.com> wrote:

[..]

> >
> > Having said that: Would this have been equally achieved with using skbmark?
> > The advantage is removing the need to change the cls sw datapath.
> > The HW doesnt have to support skb mark setting, just driver transform.
> > i.e the driver can keep track of the action mapping and set the mark when
> > it receives the packet from hw.
> > You rearrange your rules to use cls fw and have action B and C  match
> > an skb mark.
> >
>
> The user would then need to be familiar with how it works for this
> specific vendor, of which actions are supported, and do so for all the
> rules that have such actions. He will also need to add a
> skb mark action after each such successful execution that the driver
> should ignore.
>

I agree that with your patch it will be operationally simpler. I hope other
vendors will be able to use this feature (and the only reason i am saying
this is because you are making core tc changes).

Question: How does someone adding these rules tell whether some of
the actions are offloaded and some are not? If i am debugging this because
something was wrong I would like to know.

> Also with this patchset we actually support tc action continue, where
> with cls_fw it wont' be possible.

It will be an action continue for a scenario where (on ingress) you have
action A from A,B,C being offloaded and B,C is in s/w - the fw filter
will have the
B,C and flower can have A offloaded.
Yes, someone/thing programming these will have to know that only A can
be offloaded
in that graph.

[..]

> > Also above text wasnt clear. It sounded like the driver would magically
> > know that it has to continue action B and C in s/w because it knows the
> > hardware wont be able to exec them?
>
> It means that if MLX5 driver gets "action A, action CT, action B",
> where action CT is only possible in hardware for offloaded established
> connections (offloaded via flow table).
>
> We reorder the actions and parse it as if the action list was: "action
> CT, action A, action B" and then if we miss in action CT we didn't do
> any modifications (actions A/B, and counters) in hardware.
>
> We can only do this if the actions are independent, and so to support
> dependent actions (pedit for src ip followed by action CT), we have this
> patchset.

Ok, so would this work for the scenario I described above? i.e A,B, C where
A is offloaded but not B, C?

I also think the above text needs to go in the cover letter.

cheers,
jamal

Paul Blakey Jan. 17, 2023, 2:48 p.m. UTC | #4

On 17/01/2023 15:40, Jamal Hadi Salim wrote:
> On Sun, Jan 15, 2023 at 7:04 AM Paul Blakey <paulb@nvidia.com> wrote:
>>
>>
>>
>> On 12/01/2023 14:24, Jamal Hadi Salim wrote:
>>> On Thu, Jan 12, 2023 at 5:59 AM Paul Blakey <paulb@nvidia.com> wrote:
> 
> [..]
> 
>>>
>>> Having said that: Would this have been equally achieved with using skbmark?
>>> The advantage is removing the need to change the cls sw datapath.
>>> The HW doesnt have to support skb mark setting, just driver transform.
>>> i.e the driver can keep track of the action mapping and set the mark when
>>> it receives the packet from hw.
>>> You rearrange your rules to use cls fw and have action B and C  match
>>> an skb mark.
>>>
>>
>> The user would then need to be familiar with how it works for this
>> specific vendor, of which actions are supported, and do so for all the
>> rules that have such actions. He will also need to add a
>> skb mark action after each such successful execution that the driver
>> should ignore.
>>
> 
> I agree that with your patch it will be operationally simpler. I hope other
> vendors will be able to use this feature (and the only reason i am saying
> this is because you are making core tc changes).
> 
> Question: How does someone adding these rules tell whether some of
> the actions are offloaded and some are not? If i am debugging this because
> something was wrong I would like to know.

Currently by looking at the per action hw stats, and if they are 
advancing. This is the same now with filters and the CT action for
new connections (driver reports offload, but it means that only for 
established connections).

> 
>> Also with this patchset we actually support tc action continue, where
>> with cls_fw it wont' be possible.
> 
> It will be an action continue for a scenario where (on ingress) you have
> action A from A,B,C being offloaded and B,C is in s/w - the fw filter
> will have the
> B,C and flower can have A offloaded.
> Yes, someone/thing programming these will have to know that only A can
> be offloaded
> in that graph.

I meant using continue to go to next tc priority "as in "action A action 
continue" but I'm not sure about the actual details of fully supporting 
this as its not the purpose of this patch, but maybe will lead there.



> 
> [..]
> 
>>> Also above text wasnt clear. It sounded like the driver would magically
>>> know that it has to continue action B and C in s/w because it knows the
>>> hardware wont be able to exec them?
>>
>> It means that if MLX5 driver gets "action A, action CT, action B",
>> where action CT is only possible in hardware for offloaded established
>> connections (offloaded via flow table).
>>
>> We reorder the actions and parse it as if the action list was: "action
>> CT, action A, action B" and then if we miss in action CT we didn't do
>> any modifications (actions A/B, and counters) in hardware.
>>
>> We can only do this if the actions are independent, and so to support
>> dependent actions (pedit for src ip followed by action CT), we have this
>> patchset.
> 
> Ok, so would this work for the scenario I described above? i.e A,B, C where
> A is offloaded but not B, C?

You mean the reorder? we reorder the CT action first if all other 
actions are supported, so we do all or nothing.


> 
> I also think the above text needs to go in the cover letter.
> 
> cheers,
> jamal

Jamal Hadi Salim Jan. 18, 2023, 12:54 p.m. UTC | #5

On Tue, Jan 17, 2023 at 9:48 AM Paul Blakey <paulb@nvidia.com> wrote:
>
> On 17/01/2023 15:40, Jamal Hadi Salim wrote:

[..]

> > Question: How does someone adding these rules tell whether some of
> > the actions are offloaded and some are not? If i am debugging this because
> > something was wrong I would like to know.
>
> Currently by looking at the per action hw stats, and if they are
> advancing. This is the same now with filters and the CT action for
> new connections (driver reports offload, but it means that only for
> established connections).

I think that may be sufficient given we use the same technique for
filter offload.
Can you maybe post an example of such a working example in your commit message
with stats?
You showed a candidate scenario that could be sorted but not a running example.

> > It will be an action continue for a scenario where (on ingress) you have
> > action A from A,B,C being offloaded and B,C is in s/w - the fw filter
> > will have the
> > B,C and flower can have A offloaded.
> > Yes, someone/thing programming these will have to know that only A can
> > be offloaded
> > in that graph.
>
> I meant using continue to go to next tc priority "as in "action A action
> continue" but I'm not sure about the actual details of fully supporting
> this as its not the purpose of this patch, but maybe will lead there.

Yeah, that was initially confusing when i read the commit log. It sounded
like action continue == action pipe (because it continues to the next action
in the action graph).
Maybe fix the commit to be clearer.

> > Ok, so would this work for the scenario I described above? i.e A,B, C where
> > A is offloaded but not B, C?
>
> You mean the reorder? we reorder the CT action first if all other
> actions are supported, so we do all or nothing.

Let me give a longer explanation.
The key i believe is understanding the action dependency. In my mind
there are 3 levels of
complexity for assumed ordering of actions A, B, C:

1) The simplest thing is to assume all-or-nothing (which is what we
have done so far in tc);
IOW if not all of A, B, C can be offloaded then we dont offload.

2) next level of complexity is assuming that A MUST occur before B
which MUST occur before C.
Therefore on ingress you can offload part of that graph depending on
your hardware capability.
Example: On ingress A, B offloaded and then "continue" to C in s/w if
your hardware supports
only offloading A and B but not C. You do the reverse of that graph
for egress offload.

3) And your case is even more complex because you have a lot more
knowledge that infact
there is no action dependency and you can offload something in the
middle like B.
So i believe you are solving a harder problem than #2 which is what
was referring to in
my earlier email.

The way these things are typically solved is to have a "dependency"
graph that can be
programmed depending on h/w offload capability and then you can make a decision
whether (even in s/w) to allow A,B,C vs C,A,B for example.

Note: I am not asking for the change - but would be nice to have and I
think over time
generalize. I am not sure how other vendors would implement this today.

Note: if i have something smart in user space - which is what i was
referring to earlier
(with mention of skbmark) I can achieve these goals without any kernel
code changes
but like i said i understand the operational simplicity by putting
things in the kernel.

cheers,
jamal

Paul Blakey Jan. 18, 2023, 9:27 p.m. UTC | #6

On 18/01/2023 14:54, Jamal Hadi Salim wrote:
> On Tue, Jan 17, 2023 at 9:48 AM Paul Blakey <paulb@nvidia.com> wrote:
>>
>> On 17/01/2023 15:40, Jamal Hadi Salim wrote:
> 
> [..]
> 
>>> Question: How does someone adding these rules tell whether some of
>>> the actions are offloaded and some are not? If i am debugging this because
>>> something was wrong I would like to know.
>>
>> Currently by looking at the per action hw stats, and if they are
>> advancing. This is the same now with filters and the CT action for
>> new connections (driver reports offload, but it means that only for
>> established connections).
> 
> I think that may be sufficient given we use the same technique for
> filter offload.
> Can you maybe post an example of such a working example in your commit message
> with stats?
> You showed a candidate scenario that could be sorted but not a running example.
> 

Sure Ill give it as full example in v3.

>>> It will be an action continue for a scenario where (on ingress) you have
>>> action A from A,B,C being offloaded and B,C is in s/w - the fw filter
>>> will have the
>>> B,C and flower can have A offloaded.
>>> Yes, someone/thing programming these will have to know that only A can
>>> be offloaded
>>> in that graph.
>>
>> I meant using continue to go to next tc priority "as in "action A action
>> continue" but I'm not sure about the actual details of fully supporting
>> this as its not the purpose of this patch, but maybe will lead there.
> 
> Yeah, that was initially confusing when i read the commit log. It sounded
> like action continue == action pipe (because it continues to the next action
> in the action graph).
> Maybe fix the commit to be clearer.

I don't think I mentioned it in the cover letter/commits, or did I miss 
it ?

> 
>>> Ok, so would this work for the scenario I described above? i.e A,B, C where
>>> A is offloaded but not B, C?
>>
>> You mean the reorder? we reorder the CT action first if all other
>> actions are supported, so we do all or nothing.
> 
> Let me give a longer explanation.
> The key i believe is understanding the action dependency. In my mind
> there are 3 levels of
> complexity for assumed ordering of actions A, B, C:
> 
> 1) The simplest thing is to assume all-or-nothing (which is what we
> have done so far in tc);
> IOW if not all of A, B, C can be offloaded then we dont offload. >
> 2) next level of complexity is assuming that A MUST occur before B
> which MUST occur before C.
> Therefore on ingress you can offload part of that graph depending on
> your hardware capability.
> Example: On ingress A, B offloaded and then "continue" to C in s/w if
> your hardware supports
> only offloading A and B but not C. You do the reverse of that graph
> for egress offload.

This is actually the case we want support in this patchset.
Assuming a tc filter has action A , action CT, action B.
If hardware finds that it can't do CT action in hardware (for example 
for a new connection), but we already did action A, we want to continue
executing to "action CT, action B" in sw.

We can use it for partial offload of the action list, but for now it 
will be used for supporting tuple rewrite action followed by action ct 
such as in the example in the cover letter.

> 
> 3) And your case is even more complex because you have a lot more
> knowledge that infact
> there is no action dependency and you can offload something in the
> middle like B.
> So i believe you are solving a harder problem than #2 which is what
> was referring to in
> my earlier email.
> 


This is something we currently do but is "transparent" to the user.
If we have action A, action CT, action B, where A/B != CT and A doesn't 
affect CT result (no dependency), we reorder it internally as action CT, 
action A, action B. Then if we can't do CT in hardware, we didn't do A, 
and can continue in sw to re-execute the filter and actions.

If there is no action dependency then let driver take care of the 
details as we currently do we mlx5.


> The way these things are typically solved is to have a "dependency"
> graph that can be
> programmed depending on h/w offload capability and then you can make a decision
> whether (even in s/w) to allow A,B,C vs C,A,B for example.
> 
> Note: I am not asking for the change - but would be nice to have and I
> think over time
> generalize. I am not sure how other vendors would implement this today.
> 
> Note: if i have something smart in user space - which is what i was
> referring to earlier
> (with mention of skbmark) I can achieve these goals without any kernel
> code changes
> but like i said i understand the operational simplicity by putting
> things in the kernel.

Yeah that would work.


Thanks for the comments,
Paul.

>  > cheers,
> jamal

Edward Cree Jan. 20, 2023, 11:54 a.m. UTC | #7

On 17/01/2023 13:40, Jamal Hadi Salim wrote:
> I agree that with your patch it will be operationally simpler. I hope other
> vendors will be able to use this feature (and the only reason i am saying
> this is because you are making core tc changes).
FTR at AMD/sfc we wouldn't use this as our HW has all action execution after
 all match stages in the pipeline (excepting actions that only control match
 behaviour, i.e. ct lookup), so users on ef100 HW (and I'd imagine probably
 some other vendors' products too) would still need to rewrite their rules
 with skbmark.
I mention this because this feature / patch series disconcerts me.  I wasn't
 even really happy about the 'miss to chain' feature, but even more so 'miss
 to action' feels like it makes the TC-driver offload interface more complex
 than it really ought to be.
Especially because the behaviour in some cases is already weird even with a
 fully offloadable rule; consider a match-all filter with 'action vlan push'
 and no further actions (specifically no redirect).  AIUI the HW will push
 the vlan, then deliver to the default destination (e.g. repr if the packet
 came from a VF), at which point TC SW will apply the same rule and perform
 the vlan push again, leading (incorrectly) to a double-tagged packet.
So it's not really about 'miss', there's a more fundamental issue with how
 HW offload and the SW path interact.  And I don't think it's possible to
 guaranteed-remove that issue without a performance cost (skb ext is
 expensive and we don't want it on *every* RX packet), so users will always
 need to consider this sort of thing when writing their TC rules.

TBH doing an IP address pedit before a CT lookup seems like a fairly
 contrived use case to me.  Is there a pressing real-world need for it, or
 are mlx just adding this support because they can?

-ed

Paul Blakey Jan. 22, 2023, 9:33 a.m. UTC | #8

On 20/01/2023 13:54, Edward Cree wrote:
> On 17/01/2023 13:40, Jamal Hadi Salim wrote:
>> I agree that with your patch it will be operationally simpler. I hope other
>> vendors will be able to use this feature (and the only reason i am saying
>> this is because you are making core tc changes).
> FTR at AMD/sfc we wouldn't use this as our HW has all action execution after
>   all match stages in the pipeline (excepting actions that only control match
>   behaviour, i.e. ct lookup), so users on ef100 HW (and I'd imagine probably
>   some other vendors' products too) would still need to rewrite their rules
>   with skbmark.
> I mention this because this feature / patch series disconcerts me.  I wasn't
>   even really happy about the 'miss to chain' feature, but even more so 'miss
>   to action' feels like it makes the TC-driver offload interface more complex
>   than it really ought to be.
> Especially because the behaviour in some cases is already weird even with a
>   fully offloadable rule; consider a match-all filter with 'action vlan push'
>   and no further actions (specifically no redirect).  AIUI the HW will push
>   the vlan, then deliver to the default destination (e.g. repr if the packet
>   came from a VF), at which point TC SW will apply the same rule and perform
>   the vlan push again, leading (incorrectly) to a double-tagged packet.

If I understand correctly, a "action vlan push" which in SW would return 
OK to end classification or CONTINUE to continue to next prio (not sure 
what the default for this action is), then we don't offload such rules, 
as we require a endpoint (goto/drop/mirred), but if we did, I guess we 
would have used this series to point to end of the last run action list 
continuing as software would (returning OK or CONTINUE in the right 
context) or not do the vlan push and jump right before it saving having 
to match again. This might need a small change to allow jumping to the 
end of the action list.

> So it's not really about 'miss', there's a more fundamental issue with how
>   HW offload and the SW path interact.  And I don't think it's possible to
>   guaranteed-remove that issue without a performance cost (skb ext is
>   expensive and we don't want it on *every* RX packet), so users will always
>   need to consider this sort of thing when writing their TC rules.

We have the skb ext only when hardware already did some stuff (offloaded 
something), so what ever it did hopefully pays for using the skb ext, 
and I think this is the case. And of course we strive for full offload 
so this case would be an exception.

> 
> TBH doing an IP address pedit before a CT lookup seems like a fairly
>   contrived use case to me.  Is there a pressing real-world need for it, or
>   are mlx just adding this support because they can?
>  > -ed

Of course, we see when users want to set up stateless NAT outside of 
action CT.

Paul.

Jamal Hadi Salim Jan. 23, 2023, 8 p.m. UTC | #9

On Wed, Jan 18, 2023 at 4:27 PM Paul Blakey <paulb@nvidia.com> wrote:
>
>
>
> On 18/01/2023 14:54, Jamal Hadi Salim wrote:
> > On Tue, Jan 17, 2023 at 9:48 AM Paul Blakey <paulb@nvidia.com> wrote:
> >>
> >> On 17/01/2023 15:40, Jamal Hadi Salim wrote:
> >
> > [..]


> > Can you maybe post an example of such a working example in your commit message
> > with stats?
> > You showed a candidate scenario that could be sorted but not a running example.
> >
>
> Sure Ill give it as full example in v3.

I saw your latest - I meant an example with tc -s as well to show
stats so we can see
hw vs sw stats.



> > Yeah, that was initially confusing when i read the commit log. It sounded
> > like action continue == action pipe (because it continues to the next action
> > in the action graph).
> > Maybe fix the commit to be clearer.
>
> I don't think I mentioned it in the cover letter/commits, or did I miss
> it ?
>

It's clearer in the latest commit.


> > Let me give a longer explanation.
> > The key i believe is understanding the action dependency. In my mind
> > there are 3 levels of
> > complexity for assumed ordering of actions A, B, C:
> >
> > 1) The simplest thing is to assume all-or-nothing (which is what we
> > have done so far in tc);
> > IOW if not all of A, B, C can be offloaded then we dont offload. >
> > 2) next level of complexity is assuming that A MUST occur before B
> > which MUST occur before C.
> > Therefore on ingress you can offload part of that graph depending on
> > your hardware capability.
> > Example: On ingress A, B offloaded and then "continue" to C in s/w if
> > your hardware supports
> > only offloading A and B but not C. You do the reverse of that graph
> > for egress offload.
>
> This is actually the case we want support in this patchset.
> Assuming a tc filter has action A , action CT, action B.
> If hardware finds that it can't do CT action in hardware (for example
> for a new connection), but we already did action A, we want to continue
> executing to "action CT, action B" in sw.
>
> We can use it for partial offload of the action list, but for now it
> will be used for supporting tuple rewrite action followed by action ct
> such as in the example in the cover letter.
>

Ok. Part of the challenge here is also you are assuming that the rule is
both skip_sw and skip_hw which may simplify things but also adds to the
workaround complexity.
If i was a "smart user" i would split this into two: one that has
skip_sw and the
other that has skip_hw.


> > 3) And your case is even more complex because you have a lot more
> > knowledge that infact
> > there is no action dependency and you can offload something in the
> > middle like B.
> > So i believe you are solving a harder problem than #2 which is what
> > was referring to in
> > my earlier email.
> >
>
>
> This is something we currently do but is "transparent" to the user.
> If we have action A, action CT, action B, where A/B != CT and A doesn't
> affect CT result (no dependency), we reorder it internally as action CT,
> action A, action B. Then if we can't do CT in hardware, we didn't do A,
> and can continue in sw to re-execute the filter and actions.
>
> If there is no action dependency then let driver take care of the
> details as we currently do we mlx5.

I think that is a fitting abstraction. The driver with enough knowledge of the
hw can do the necessary transforms.

cheers,
jamal

[net-next,0/4] net/sched: cls_api: Support hardware miss to tc action

Message

Comments