Message ID | 20230112105905.1738-1-paulb@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | net/sched: cls_api: Support hardware miss to tc action | expand |
On Thu, Jan 12, 2023 at 5:59 AM Paul Blakey <paulb@nvidia.com> wrote: > > Hi, > > This series adds support for hardware miss to a specific tc action > instance on a filter's action list. The mlx5 driver patch (besides > the refactors) shows its usage instead of using just chain restore. > > This miss to action supports partial offload of a filter's action list, > and and let software continue processing where hardware left off. > > Example is the CT action, where new connections need to be handled in > software. And if there is a packet modifying action before the CT action, > then restoring only the chain on misses might cause the rule to not > re-execute the relevant filter in software. > > Consider the following scenario: > > $ tc filter add dev dev1 ingress chain 0 proto ip flower \ > ct_state -trk dst_mac fe:50:56:26:13:7d \ > action pedit ex munge eth dst aa:bb:cc:dd:ee:01 \ > action ct \ > action goto chain 1 > $ tc filter add dev dev1 ingress chain 1 proto ip flower \ > ct_state +trk+est \ > action mirred egress redirect dev ... > $ tc filter add dev dev1 ingress chain 1 proto ip flower \ > ct_state +trk+new \ > action ct commit \ > action mirred egress redirect dev dev2 > > $ tc filter add dev dev2 ingress chain 0 proto ip flower \ > action ct \ > action mirred egress redirect dev dev1 > > A packet doing the pedit in hardware (setting dst_mac to aa:bb:cc:dd:ee:01), > missing in the ct action, and restarting in chain 0 in software will fail > matching the original dst_mac in the flower filter on chain 0. I had to read that a couple of times and didnt get it until i went and started looking at the patches and i am not 100% sure still.. IIUC, you have something like: match X action A action B action C You do action A in HW and then you want to continue with B and C in SW. If that is correct then the cover letter would be an easier read if it is laid out as follows: -"consider the following scenario" (with examples you had above).... Actually those example were not clear, i think the reader is meant to assume lack of skip_sw means they are available both in hw and sw. -Explain your goals and why it wont work (I think you did that well above) -"and now with the changes in this patchset, the issue can be resolved with rearranging the policy setup as follows"... Having said that: Would this have been equally achieved with using skbmark? The advantage is removing the need to change the cls sw datapath. The HW doesnt have to support skb mark setting, just driver transform. i.e the driver can keep track of the action mapping and set the mark when it receives the packet from hw. You rearrange your rules to use cls fw and have action B and C match an skb mark. > The above scenario is supported in mlx5 driver by reordering the actions > so ct will be done in hardware before the pedit action, but some packet > modifications can't be reordered in regards to the ct action. An example > of that is a modification to the tuple fields (e.g action pedit ex munge ip > dst 1.1.1.1) since it affects the ct action's result (as it does lookup based > on ip). Also above text wasnt clear. It sounded like the driver would magically know that it has to continue action B and C in s/w because it knows the hardware wont be able to exec them? cheers, jamal
On 12/01/2023 14:24, Jamal Hadi Salim wrote: > On Thu, Jan 12, 2023 at 5:59 AM Paul Blakey <paulb@nvidia.com> wrote: >> >> Hi, >> >> This series adds support for hardware miss to a specific tc action >> instance on a filter's action list. The mlx5 driver patch (besides >> the refactors) shows its usage instead of using just chain restore. >> >> This miss to action supports partial offload of a filter's action list, >> and and let software continue processing where hardware left off. >> >> Example is the CT action, where new connections need to be handled in >> software. And if there is a packet modifying action before the CT action, >> then restoring only the chain on misses might cause the rule to not >> re-execute the relevant filter in software. >> >> Consider the following scenario: >> >> $ tc filter add dev dev1 ingress chain 0 proto ip flower \ >> ct_state -trk dst_mac fe:50:56:26:13:7d \ >> action pedit ex munge eth dst aa:bb:cc:dd:ee:01 \ >> action ct \ >> action goto chain 1 >> $ tc filter add dev dev1 ingress chain 1 proto ip flower \ >> ct_state +trk+est \ >> action mirred egress redirect dev ... >> $ tc filter add dev dev1 ingress chain 1 proto ip flower \ >> ct_state +trk+new \ >> action ct commit \ >> action mirred egress redirect dev dev2 >> >> $ tc filter add dev dev2 ingress chain 0 proto ip flower \ >> action ct \ >> action mirred egress redirect dev dev1 >> >> A packet doing the pedit in hardware (setting dst_mac to aa:bb:cc:dd:ee:01), >> missing in the ct action, and restarting in chain 0 in software will fail >> matching the original dst_mac in the flower filter on chain 0. > > > I had to read that a couple of times and didnt get it until i went and > started looking at the patches and i am not 100% sure still.. IIUC, you have > something like: > match X action A action B action C > You do action A in HW and then you want to continue with B and C in SW. > If that is correct then the cover letter would be an easier read if it > is laid out as > follows: > > -"consider the following scenario" (with examples you had above).... > Actually those example were not clear, i think the reader is meant to > assume lack of skip_sw means they are available both in hw and sw. > > -Explain your goals and why it wont work (I think you did that well above) > > -"and now with the changes in this patchset, the issue can be resolved > with rearranging the policy setup as follows"... Sure I can work on that for v2. > > Having said that: Would this have been equally achieved with using skbmark? > The advantage is removing the need to change the cls sw datapath. > The HW doesnt have to support skb mark setting, just driver transform. > i.e the driver can keep track of the action mapping and set the mark when > it receives the packet from hw. > You rearrange your rules to use cls fw and have action B and C match > an skb mark. > The user would then need to be familiar with how it works for this specific vendor, of which actions are supported, and do so for all the rules that have such actions. He will also need to add a skb mark action after each such successful execution that the driver should ignore. Also with this patchset we actually support tc action continue, where with cls_fw it wont' be possible. >> The above scenario is supported in mlx5 driver by reordering the actions >> so ct will be done in hardware before the pedit action, but some packet >> modifications can't be reordered in regards to the ct action. An example >> of that is a modification to the tuple fields (e.g action pedit ex munge ip >> dst 1.1.1.1) since it affects the ct action's result (as it does lookup based >> on ip). > > Also above text wasnt clear. It sounded like the driver would magically > know that it has to continue action B and C in s/w because it knows the > hardware wont be able to exec them? It means that if MLX5 driver gets "action A, action CT, action B", where action CT is only possible in hardware for offloaded established connections (offloaded via flow table). We reorder the actions and parse it as if the action list was: "action CT, action A, action B" and then if we miss in action CT we didn't do any modifications (actions A/B, and counters) in hardware. We can only do this if the actions are independent, and so to support dependent actions (pedit for src ip followed by action CT), we have this patchset. > > cheers, > jamal
On Sun, Jan 15, 2023 at 7:04 AM Paul Blakey <paulb@nvidia.com> wrote: > > > > On 12/01/2023 14:24, Jamal Hadi Salim wrote: > > On Thu, Jan 12, 2023 at 5:59 AM Paul Blakey <paulb@nvidia.com> wrote: [..] > > > > Having said that: Would this have been equally achieved with using skbmark? > > The advantage is removing the need to change the cls sw datapath. > > The HW doesnt have to support skb mark setting, just driver transform. > > i.e the driver can keep track of the action mapping and set the mark when > > it receives the packet from hw. > > You rearrange your rules to use cls fw and have action B and C match > > an skb mark. > > > > The user would then need to be familiar with how it works for this > specific vendor, of which actions are supported, and do so for all the > rules that have such actions. He will also need to add a > skb mark action after each such successful execution that the driver > should ignore. > I agree that with your patch it will be operationally simpler. I hope other vendors will be able to use this feature (and the only reason i am saying this is because you are making core tc changes). Question: How does someone adding these rules tell whether some of the actions are offloaded and some are not? If i am debugging this because something was wrong I would like to know. > Also with this patchset we actually support tc action continue, where > with cls_fw it wont' be possible. It will be an action continue for a scenario where (on ingress) you have action A from A,B,C being offloaded and B,C is in s/w - the fw filter will have the B,C and flower can have A offloaded. Yes, someone/thing programming these will have to know that only A can be offloaded in that graph. [..] > > Also above text wasnt clear. It sounded like the driver would magically > > know that it has to continue action B and C in s/w because it knows the > > hardware wont be able to exec them? > > It means that if MLX5 driver gets "action A, action CT, action B", > where action CT is only possible in hardware for offloaded established > connections (offloaded via flow table). > > We reorder the actions and parse it as if the action list was: "action > CT, action A, action B" and then if we miss in action CT we didn't do > any modifications (actions A/B, and counters) in hardware. > > We can only do this if the actions are independent, and so to support > dependent actions (pedit for src ip followed by action CT), we have this > patchset. Ok, so would this work for the scenario I described above? i.e A,B, C where A is offloaded but not B, C? I also think the above text needs to go in the cover letter. cheers, jamal
On 17/01/2023 15:40, Jamal Hadi Salim wrote: > On Sun, Jan 15, 2023 at 7:04 AM Paul Blakey <paulb@nvidia.com> wrote: >> >> >> >> On 12/01/2023 14:24, Jamal Hadi Salim wrote: >>> On Thu, Jan 12, 2023 at 5:59 AM Paul Blakey <paulb@nvidia.com> wrote: > > [..] > >>> >>> Having said that: Would this have been equally achieved with using skbmark? >>> The advantage is removing the need to change the cls sw datapath. >>> The HW doesnt have to support skb mark setting, just driver transform. >>> i.e the driver can keep track of the action mapping and set the mark when >>> it receives the packet from hw. >>> You rearrange your rules to use cls fw and have action B and C match >>> an skb mark. >>> >> >> The user would then need to be familiar with how it works for this >> specific vendor, of which actions are supported, and do so for all the >> rules that have such actions. He will also need to add a >> skb mark action after each such successful execution that the driver >> should ignore. >> > > I agree that with your patch it will be operationally simpler. I hope other > vendors will be able to use this feature (and the only reason i am saying > this is because you are making core tc changes). > > Question: How does someone adding these rules tell whether some of > the actions are offloaded and some are not? If i am debugging this because > something was wrong I would like to know. Currently by looking at the per action hw stats, and if they are advancing. This is the same now with filters and the CT action for new connections (driver reports offload, but it means that only for established connections). > >> Also with this patchset we actually support tc action continue, where >> with cls_fw it wont' be possible. > > It will be an action continue for a scenario where (on ingress) you have > action A from A,B,C being offloaded and B,C is in s/w - the fw filter > will have the > B,C and flower can have A offloaded. > Yes, someone/thing programming these will have to know that only A can > be offloaded > in that graph. I meant using continue to go to next tc priority "as in "action A action continue" but I'm not sure about the actual details of fully supporting this as its not the purpose of this patch, but maybe will lead there. > > [..] > >>> Also above text wasnt clear. It sounded like the driver would magically >>> know that it has to continue action B and C in s/w because it knows the >>> hardware wont be able to exec them? >> >> It means that if MLX5 driver gets "action A, action CT, action B", >> where action CT is only possible in hardware for offloaded established >> connections (offloaded via flow table). >> >> We reorder the actions and parse it as if the action list was: "action >> CT, action A, action B" and then if we miss in action CT we didn't do >> any modifications (actions A/B, and counters) in hardware. >> >> We can only do this if the actions are independent, and so to support >> dependent actions (pedit for src ip followed by action CT), we have this >> patchset. > > Ok, so would this work for the scenario I described above? i.e A,B, C where > A is offloaded but not B, C? You mean the reorder? we reorder the CT action first if all other actions are supported, so we do all or nothing. > > I also think the above text needs to go in the cover letter. > > cheers, > jamal
On Tue, Jan 17, 2023 at 9:48 AM Paul Blakey <paulb@nvidia.com> wrote: > > On 17/01/2023 15:40, Jamal Hadi Salim wrote: [..] > > Question: How does someone adding these rules tell whether some of > > the actions are offloaded and some are not? If i am debugging this because > > something was wrong I would like to know. > > Currently by looking at the per action hw stats, and if they are > advancing. This is the same now with filters and the CT action for > new connections (driver reports offload, but it means that only for > established connections). I think that may be sufficient given we use the same technique for filter offload. Can you maybe post an example of such a working example in your commit message with stats? You showed a candidate scenario that could be sorted but not a running example. > > It will be an action continue for a scenario where (on ingress) you have > > action A from A,B,C being offloaded and B,C is in s/w - the fw filter > > will have the > > B,C and flower can have A offloaded. > > Yes, someone/thing programming these will have to know that only A can > > be offloaded > > in that graph. > > I meant using continue to go to next tc priority "as in "action A action > continue" but I'm not sure about the actual details of fully supporting > this as its not the purpose of this patch, but maybe will lead there. Yeah, that was initially confusing when i read the commit log. It sounded like action continue == action pipe (because it continues to the next action in the action graph). Maybe fix the commit to be clearer. > > Ok, so would this work for the scenario I described above? i.e A,B, C where > > A is offloaded but not B, C? > > You mean the reorder? we reorder the CT action first if all other > actions are supported, so we do all or nothing. Let me give a longer explanation. The key i believe is understanding the action dependency. In my mind there are 3 levels of complexity for assumed ordering of actions A, B, C: 1) The simplest thing is to assume all-or-nothing (which is what we have done so far in tc); IOW if not all of A, B, C can be offloaded then we dont offload. 2) next level of complexity is assuming that A MUST occur before B which MUST occur before C. Therefore on ingress you can offload part of that graph depending on your hardware capability. Example: On ingress A, B offloaded and then "continue" to C in s/w if your hardware supports only offloading A and B but not C. You do the reverse of that graph for egress offload. 3) And your case is even more complex because you have a lot more knowledge that infact there is no action dependency and you can offload something in the middle like B. So i believe you are solving a harder problem than #2 which is what was referring to in my earlier email. The way these things are typically solved is to have a "dependency" graph that can be programmed depending on h/w offload capability and then you can make a decision whether (even in s/w) to allow A,B,C vs C,A,B for example. Note: I am not asking for the change - but would be nice to have and I think over time generalize. I am not sure how other vendors would implement this today. Note: if i have something smart in user space - which is what i was referring to earlier (with mention of skbmark) I can achieve these goals without any kernel code changes but like i said i understand the operational simplicity by putting things in the kernel. cheers, jamal
On 18/01/2023 14:54, Jamal Hadi Salim wrote: > On Tue, Jan 17, 2023 at 9:48 AM Paul Blakey <paulb@nvidia.com> wrote: >> >> On 17/01/2023 15:40, Jamal Hadi Salim wrote: > > [..] > >>> Question: How does someone adding these rules tell whether some of >>> the actions are offloaded and some are not? If i am debugging this because >>> something was wrong I would like to know. >> >> Currently by looking at the per action hw stats, and if they are >> advancing. This is the same now with filters and the CT action for >> new connections (driver reports offload, but it means that only for >> established connections). > > I think that may be sufficient given we use the same technique for > filter offload. > Can you maybe post an example of such a working example in your commit message > with stats? > You showed a candidate scenario that could be sorted but not a running example. > Sure Ill give it as full example in v3. >>> It will be an action continue for a scenario where (on ingress) you have >>> action A from A,B,C being offloaded and B,C is in s/w - the fw filter >>> will have the >>> B,C and flower can have A offloaded. >>> Yes, someone/thing programming these will have to know that only A can >>> be offloaded >>> in that graph. >> >> I meant using continue to go to next tc priority "as in "action A action >> continue" but I'm not sure about the actual details of fully supporting >> this as its not the purpose of this patch, but maybe will lead there. > > Yeah, that was initially confusing when i read the commit log. It sounded > like action continue == action pipe (because it continues to the next action > in the action graph). > Maybe fix the commit to be clearer. I don't think I mentioned it in the cover letter/commits, or did I miss it ? > >>> Ok, so would this work for the scenario I described above? i.e A,B, C where >>> A is offloaded but not B, C? >> >> You mean the reorder? we reorder the CT action first if all other >> actions are supported, so we do all or nothing. > > Let me give a longer explanation. > The key i believe is understanding the action dependency. In my mind > there are 3 levels of > complexity for assumed ordering of actions A, B, C: > > 1) The simplest thing is to assume all-or-nothing (which is what we > have done so far in tc); > IOW if not all of A, B, C can be offloaded then we dont offload. > > 2) next level of complexity is assuming that A MUST occur before B > which MUST occur before C. > Therefore on ingress you can offload part of that graph depending on > your hardware capability. > Example: On ingress A, B offloaded and then "continue" to C in s/w if > your hardware supports > only offloading A and B but not C. You do the reverse of that graph > for egress offload. This is actually the case we want support in this patchset. Assuming a tc filter has action A , action CT, action B. If hardware finds that it can't do CT action in hardware (for example for a new connection), but we already did action A, we want to continue executing to "action CT, action B" in sw. We can use it for partial offload of the action list, but for now it will be used for supporting tuple rewrite action followed by action ct such as in the example in the cover letter. > > 3) And your case is even more complex because you have a lot more > knowledge that infact > there is no action dependency and you can offload something in the > middle like B. > So i believe you are solving a harder problem than #2 which is what > was referring to in > my earlier email. > This is something we currently do but is "transparent" to the user. If we have action A, action CT, action B, where A/B != CT and A doesn't affect CT result (no dependency), we reorder it internally as action CT, action A, action B. Then if we can't do CT in hardware, we didn't do A, and can continue in sw to re-execute the filter and actions. If there is no action dependency then let driver take care of the details as we currently do we mlx5. > The way these things are typically solved is to have a "dependency" > graph that can be > programmed depending on h/w offload capability and then you can make a decision > whether (even in s/w) to allow A,B,C vs C,A,B for example. > > Note: I am not asking for the change - but would be nice to have and I > think over time > generalize. I am not sure how other vendors would implement this today. > > Note: if i have something smart in user space - which is what i was > referring to earlier > (with mention of skbmark) I can achieve these goals without any kernel > code changes > but like i said i understand the operational simplicity by putting > things in the kernel. Yeah that would work. Thanks for the comments, Paul. > > cheers, > jamal
On 17/01/2023 13:40, Jamal Hadi Salim wrote: > I agree that with your patch it will be operationally simpler. I hope other > vendors will be able to use this feature (and the only reason i am saying > this is because you are making core tc changes). FTR at AMD/sfc we wouldn't use this as our HW has all action execution after all match stages in the pipeline (excepting actions that only control match behaviour, i.e. ct lookup), so users on ef100 HW (and I'd imagine probably some other vendors' products too) would still need to rewrite their rules with skbmark. I mention this because this feature / patch series disconcerts me. I wasn't even really happy about the 'miss to chain' feature, but even more so 'miss to action' feels like it makes the TC-driver offload interface more complex than it really ought to be. Especially because the behaviour in some cases is already weird even with a fully offloadable rule; consider a match-all filter with 'action vlan push' and no further actions (specifically no redirect). AIUI the HW will push the vlan, then deliver to the default destination (e.g. repr if the packet came from a VF), at which point TC SW will apply the same rule and perform the vlan push again, leading (incorrectly) to a double-tagged packet. So it's not really about 'miss', there's a more fundamental issue with how HW offload and the SW path interact. And I don't think it's possible to guaranteed-remove that issue without a performance cost (skb ext is expensive and we don't want it on *every* RX packet), so users will always need to consider this sort of thing when writing their TC rules. TBH doing an IP address pedit before a CT lookup seems like a fairly contrived use case to me. Is there a pressing real-world need for it, or are mlx just adding this support because they can? -ed
On 20/01/2023 13:54, Edward Cree wrote: > On 17/01/2023 13:40, Jamal Hadi Salim wrote: >> I agree that with your patch it will be operationally simpler. I hope other >> vendors will be able to use this feature (and the only reason i am saying >> this is because you are making core tc changes). > FTR at AMD/sfc we wouldn't use this as our HW has all action execution after > all match stages in the pipeline (excepting actions that only control match > behaviour, i.e. ct lookup), so users on ef100 HW (and I'd imagine probably > some other vendors' products too) would still need to rewrite their rules > with skbmark. > I mention this because this feature / patch series disconcerts me. I wasn't > even really happy about the 'miss to chain' feature, but even more so 'miss > to action' feels like it makes the TC-driver offload interface more complex > than it really ought to be. > Especially because the behaviour in some cases is already weird even with a > fully offloadable rule; consider a match-all filter with 'action vlan push' > and no further actions (specifically no redirect). AIUI the HW will push > the vlan, then deliver to the default destination (e.g. repr if the packet > came from a VF), at which point TC SW will apply the same rule and perform > the vlan push again, leading (incorrectly) to a double-tagged packet. If I understand correctly, a "action vlan push" which in SW would return OK to end classification or CONTINUE to continue to next prio (not sure what the default for this action is), then we don't offload such rules, as we require a endpoint (goto/drop/mirred), but if we did, I guess we would have used this series to point to end of the last run action list continuing as software would (returning OK or CONTINUE in the right context) or not do the vlan push and jump right before it saving having to match again. This might need a small change to allow jumping to the end of the action list. > So it's not really about 'miss', there's a more fundamental issue with how > HW offload and the SW path interact. And I don't think it's possible to > guaranteed-remove that issue without a performance cost (skb ext is > expensive and we don't want it on *every* RX packet), so users will always > need to consider this sort of thing when writing their TC rules. We have the skb ext only when hardware already did some stuff (offloaded something), so what ever it did hopefully pays for using the skb ext, and I think this is the case. And of course we strive for full offload so this case would be an exception. > > TBH doing an IP address pedit before a CT lookup seems like a fairly > contrived use case to me. Is there a pressing real-world need for it, or > are mlx just adding this support because they can? > > -ed Of course, we see when users want to set up stateless NAT outside of action CT. Paul.
On Wed, Jan 18, 2023 at 4:27 PM Paul Blakey <paulb@nvidia.com> wrote: > > > > On 18/01/2023 14:54, Jamal Hadi Salim wrote: > > On Tue, Jan 17, 2023 at 9:48 AM Paul Blakey <paulb@nvidia.com> wrote: > >> > >> On 17/01/2023 15:40, Jamal Hadi Salim wrote: > > > > [..] > > Can you maybe post an example of such a working example in your commit message > > with stats? > > You showed a candidate scenario that could be sorted but not a running example. > > > > Sure Ill give it as full example in v3. I saw your latest - I meant an example with tc -s as well to show stats so we can see hw vs sw stats. > > Yeah, that was initially confusing when i read the commit log. It sounded > > like action continue == action pipe (because it continues to the next action > > in the action graph). > > Maybe fix the commit to be clearer. > > I don't think I mentioned it in the cover letter/commits, or did I miss > it ? > It's clearer in the latest commit. > > Let me give a longer explanation. > > The key i believe is understanding the action dependency. In my mind > > there are 3 levels of > > complexity for assumed ordering of actions A, B, C: > > > > 1) The simplest thing is to assume all-or-nothing (which is what we > > have done so far in tc); > > IOW if not all of A, B, C can be offloaded then we dont offload. > > > 2) next level of complexity is assuming that A MUST occur before B > > which MUST occur before C. > > Therefore on ingress you can offload part of that graph depending on > > your hardware capability. > > Example: On ingress A, B offloaded and then "continue" to C in s/w if > > your hardware supports > > only offloading A and B but not C. You do the reverse of that graph > > for egress offload. > > This is actually the case we want support in this patchset. > Assuming a tc filter has action A , action CT, action B. > If hardware finds that it can't do CT action in hardware (for example > for a new connection), but we already did action A, we want to continue > executing to "action CT, action B" in sw. > > We can use it for partial offload of the action list, but for now it > will be used for supporting tuple rewrite action followed by action ct > such as in the example in the cover letter. > Ok. Part of the challenge here is also you are assuming that the rule is both skip_sw and skip_hw which may simplify things but also adds to the workaround complexity. If i was a "smart user" i would split this into two: one that has skip_sw and the other that has skip_hw. > > 3) And your case is even more complex because you have a lot more > > knowledge that infact > > there is no action dependency and you can offload something in the > > middle like B. > > So i believe you are solving a harder problem than #2 which is what > > was referring to in > > my earlier email. > > > > > This is something we currently do but is "transparent" to the user. > If we have action A, action CT, action B, where A/B != CT and A doesn't > affect CT result (no dependency), we reorder it internally as action CT, > action A, action B. Then if we can't do CT in hardware, we didn't do A, > and can continue in sw to re-execute the filter and actions. > > If there is no action dependency then let driver take care of the > details as we currently do we mlx5. I think that is a fitting abstraction. The driver with enough knowledge of the hw can do the necessary transforms. cheers, jamal