diff mbox series

[net-next,v3,4/7] netfilter: flowtable: allow updating offloaded rules asynchronously

Message ID 20230119195104.3371966-5-vladbu@nvidia.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Allow offloading of UDP NEW connections via act_ct | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 61 this patch: 61
netdev/cc_maintainers warning 4 maintainers not CCed: fw@strlen.de edumazet@google.com kadlec@netfilter.org coreteam@netfilter.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 61 this patch: 61
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 76 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Vlad Buslov Jan. 19, 2023, 7:51 p.m. UTC
Following patches in series need to update flowtable rule several times
during its lifetime in order to synchronize hardware offload with actual ct
status. However, reusing existing 'refresh' logic in act_ct would cause
data path to potentially schedule significant amount of spurious tasks in
'add' workqueue since it is executed per-packet. Instead, introduce a new
flow 'update' flag and use it to schedule async flow refresh in flowtable
gc which will only be executed once per gc iteration.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
---
 include/net/netfilter/nf_flow_table.h |  3 ++-
 net/netfilter/nf_flow_table_core.c    | 20 +++++++++++++++-----
 net/netfilter/nf_flow_table_offload.c |  5 +++--
 3 files changed, 20 insertions(+), 8 deletions(-)

Comments

Pablo Neira Ayuso Jan. 20, 2023, 11:41 a.m. UTC | #1
Hi Vlad,

On Thu, Jan 19, 2023 at 08:51:01PM +0100, Vlad Buslov wrote:
> Following patches in series need to update flowtable rule several times
> during its lifetime in order to synchronize hardware offload with actual ct
> status. However, reusing existing 'refresh' logic in act_ct would cause
> data path to potentially schedule significant amount of spurious tasks in
> 'add' workqueue since it is executed per-packet. Instead, introduce a new
> flow 'update' flag and use it to schedule async flow refresh in flowtable
> gc which will only be executed once per gc iteration.

So the idea is to use a NF_FLOW_HW_UPDATE which triggers the update
from the garbage collector. I understand the motivation here is to
avoid adding more work to the workqueue, by simply letting the gc
thread pick up for the update.

I already proposed in the last year alternative approaches to improve
the workqueue logic, including cancelation of useless work. For
example, cancel a flying "add" work if "delete" just arrive and the
work is still sitting in the queue. Same approach could be use for
this update logic, ie. cancel an add UDP unidirectional or upgrade it
to bidirectional if, by the time we see traffic in both directions,
then work is still sitting in the queue.

I am sorry to say but it seems to me this approach based on flags is
pushing the existing design to the limit. The flag semantics is
already overloaded that this just makes the state machine behind the
flag logic more complicated. I really think we should explore for
better strategies for the offload work to be processed.

Thanks.
Vlad Buslov Jan. 24, 2023, 7:06 a.m. UTC | #2
On Fri 20 Jan 2023 at 12:41, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Hi Vlad,
>
> On Thu, Jan 19, 2023 at 08:51:01PM +0100, Vlad Buslov wrote:
>> Following patches in series need to update flowtable rule several times
>> during its lifetime in order to synchronize hardware offload with actual ct
>> status. However, reusing existing 'refresh' logic in act_ct would cause
>> data path to potentially schedule significant amount of spurious tasks in
>> 'add' workqueue since it is executed per-packet. Instead, introduce a new
>> flow 'update' flag and use it to schedule async flow refresh in flowtable
>> gc which will only be executed once per gc iteration.
>
> So the idea is to use a NF_FLOW_HW_UPDATE which triggers the update
> from the garbage collector. I understand the motivation here is to
> avoid adding more work to the workqueue, by simply letting the gc
> thread pick up for the update.
>
> I already proposed in the last year alternative approaches to improve
> the workqueue logic, including cancelation of useless work. For
> example, cancel a flying "add" work if "delete" just arrive and the
> work is still sitting in the queue. Same approach could be use for
> this update logic, ie. cancel an add UDP unidirectional or upgrade it
> to bidirectional if, by the time we see traffic in both directions,
> then work is still sitting in the queue.

Thanks for the suggestion. I'll try to make this work over regular
workqueues without further extending the flow flags and/or putting more
stuff into gc.

>
> I am sorry to say but it seems to me this approach based on flags is
> pushing the existing design to the limit. The flag semantics is
> already overloaded that this just makes the state machine behind the
> flag logic more complicated. I really think we should explore for
> better strategies for the offload work to be processed.

Got it.
Pablo Neira Ayuso Jan. 24, 2023, 8:41 a.m. UTC | #3
Hi Vlad,

On Tue, Jan 24, 2023 at 09:06:13AM +0200, Vlad Buslov wrote:
> 
> On Fri 20 Jan 2023 at 12:41, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Hi Vlad,
> >
> > On Thu, Jan 19, 2023 at 08:51:01PM +0100, Vlad Buslov wrote:
> >> Following patches in series need to update flowtable rule several times
> >> during its lifetime in order to synchronize hardware offload with actual ct
> >> status. However, reusing existing 'refresh' logic in act_ct would cause
> >> data path to potentially schedule significant amount of spurious tasks in
> >> 'add' workqueue since it is executed per-packet. Instead, introduce a new
> >> flow 'update' flag and use it to schedule async flow refresh in flowtable
> >> gc which will only be executed once per gc iteration.
> >
> > So the idea is to use a NF_FLOW_HW_UPDATE which triggers the update
> > from the garbage collector. I understand the motivation here is to
> > avoid adding more work to the workqueue, by simply letting the gc
> > thread pick up for the update.
> >
> > I already proposed in the last year alternative approaches to improve
> > the workqueue logic, including cancelation of useless work. For
> > example, cancel a flying "add" work if "delete" just arrive and the
> > work is still sitting in the queue. Same approach could be use for
> > this update logic, ie. cancel an add UDP unidirectional or upgrade it
> > to bidirectional if, by the time we see traffic in both directions,
> > then work is still sitting in the queue.
> 
> Thanks for the suggestion. I'll try to make this work over regular
> workqueues without further extending the flow flags and/or putting more
> stuff into gc.

Let me make a second pass to sort out thoughts on this.

Either we use regular workqueues (without new flags) or we explore
fully consolidating this hardware offload workqueue infrastructure
around flags, ie. use flags not only for update events, but also for
new and delete.

This would go more in the direction of your _UPDATE flag idea:

- Update the hardware offload workqueue to iterate over the
  flowtable. The hardware offload workqueue would be "scanning" for
  entries in the flowtable that require some sort of update in the
  hardware. The flags would tell what kind of action is needed.

- Add these flags:

NF_FLOW_HW_NEW
NF_FLOW_HW_UPDATE
NF_FLOW_HW_DELETE

and remove the work object (flow_offload_work) and the existing list.
If the workqueue finds an entry with:

NEW|DELETE, this means this is short lived flow, not worth to waste
cycles to offload it.
NEW|UPDATE, this means this is an UDP flow that is bidirectional.

Then, there will be no more work allocation + "flying" work objects to
the hardware offload workqueue. Instead, the hardware offload
workqueue will be iterating.

This approach would need _DONE flags to annotate if the offload
updates have been applied to hardware already (similar to the
conntrack _DONE flags).

(Oh well, this proposal is adding even more flags. But I think flags
are not the issue, but the mixture of the existing flow_offload_work
approach with this new _UPDATE flag and the gc changes).

If flow_offload_work is removed, we would also need to add a:

 struct nf_flowtable *flowtable;

field to the flow_offload entry, which is an entry field that is
passed via flow_offload_work. So it is one extra field for the each
flow_offload entry.

The other alternative is to use the existing nf_flow_offload_add_wq
with UPDATE command, which might result in more flying objects in
turn. I think this is what you are trying to avoid with the _UPDATE
flag approach.

Thanks.
Vlad Buslov Jan. 24, 2023, 9:19 a.m. UTC | #4
On Tue 24 Jan 2023 at 09:41, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Hi Vlad,
>
> On Tue, Jan 24, 2023 at 09:06:13AM +0200, Vlad Buslov wrote:
>> 
>> On Fri 20 Jan 2023 at 12:41, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > Hi Vlad,
>> >
>> > On Thu, Jan 19, 2023 at 08:51:01PM +0100, Vlad Buslov wrote:
>> >> Following patches in series need to update flowtable rule several times
>> >> during its lifetime in order to synchronize hardware offload with actual ct
>> >> status. However, reusing existing 'refresh' logic in act_ct would cause
>> >> data path to potentially schedule significant amount of spurious tasks in
>> >> 'add' workqueue since it is executed per-packet. Instead, introduce a new
>> >> flow 'update' flag and use it to schedule async flow refresh in flowtable
>> >> gc which will only be executed once per gc iteration.
>> >
>> > So the idea is to use a NF_FLOW_HW_UPDATE which triggers the update
>> > from the garbage collector. I understand the motivation here is to
>> > avoid adding more work to the workqueue, by simply letting the gc
>> > thread pick up for the update.
>> >
>> > I already proposed in the last year alternative approaches to improve
>> > the workqueue logic, including cancelation of useless work. For
>> > example, cancel a flying "add" work if "delete" just arrive and the
>> > work is still sitting in the queue. Same approach could be use for
>> > this update logic, ie. cancel an add UDP unidirectional or upgrade it
>> > to bidirectional if, by the time we see traffic in both directions,
>> > then work is still sitting in the queue.
>> 
>> Thanks for the suggestion. I'll try to make this work over regular
>> workqueues without further extending the flow flags and/or putting more
>> stuff into gc.
>
> Let me make a second pass to sort out thoughts on this.
>
> Either we use regular workqueues (without new flags) or we explore
> fully consolidating this hardware offload workqueue infrastructure
> around flags, ie. use flags not only for update events, but also for
> new and delete.
>
> This would go more in the direction of your _UPDATE flag idea:
>
> - Update the hardware offload workqueue to iterate over the
>   flowtable. The hardware offload workqueue would be "scanning" for
>   entries in the flowtable that require some sort of update in the
>   hardware. The flags would tell what kind of action is needed.
>
> - Add these flags:
>
> NF_FLOW_HW_NEW
> NF_FLOW_HW_UPDATE
> NF_FLOW_HW_DELETE
>
> and remove the work object (flow_offload_work) and the existing list.
> If the workqueue finds an entry with:
>
> NEW|DELETE, this means this is short lived flow, not worth to waste
> cycles to offload it.
> NEW|UPDATE, this means this is an UDP flow that is bidirectional.
>
> Then, there will be no more work allocation + "flying" work objects to
> the hardware offload workqueue. Instead, the hardware offload
> workqueue will be iterating.
>
> This approach would need _DONE flags to annotate if the offload
> updates have been applied to hardware already (similar to the
> conntrack _DONE flags).
>
> (Oh well, this proposal is adding even more flags. But I think flags
> are not the issue, but the mixture of the existing flow_offload_work
> approach with this new _UPDATE flag and the gc changes).
>
> If flow_offload_work is removed, we would also need to add a:
>
>  struct nf_flowtable *flowtable;
>
> field to the flow_offload entry, which is an entry field that is
> passed via flow_offload_work. So it is one extra field for the each
> flow_offload entry.
>
> The other alternative is to use the existing nf_flow_offload_add_wq
> with UPDATE command, which might result in more flying objects in
> turn. I think this is what you are trying to avoid with the _UPDATE
> flag approach.

This looks interesting, but is very ambitious and will probably be a
bigger change than this whole series. I have an idea how we can leverage
existing 'refresh' mechanism for updating flow state that doesn't
involve large-scale refactoring of existing offload infrastructure,
which I would prefer to try first. WDYT?
Vlad Buslov Jan. 24, 2023, 1:57 p.m. UTC | #5
On Tue 24 Jan 2023 at 11:19, Vlad Buslov <vladbu@nvidia.com> wrote:
> On Tue 24 Jan 2023 at 09:41, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> Hi Vlad,
>>
>> On Tue, Jan 24, 2023 at 09:06:13AM +0200, Vlad Buslov wrote:
>>> 
>>> On Fri 20 Jan 2023 at 12:41, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>>> > Hi Vlad,
>>> >
>>> > On Thu, Jan 19, 2023 at 08:51:01PM +0100, Vlad Buslov wrote:
>>> >> Following patches in series need to update flowtable rule several times
>>> >> during its lifetime in order to synchronize hardware offload with actual ct
>>> >> status. However, reusing existing 'refresh' logic in act_ct would cause
>>> >> data path to potentially schedule significant amount of spurious tasks in
>>> >> 'add' workqueue since it is executed per-packet. Instead, introduce a new
>>> >> flow 'update' flag and use it to schedule async flow refresh in flowtable
>>> >> gc which will only be executed once per gc iteration.
>>> >
>>> > So the idea is to use a NF_FLOW_HW_UPDATE which triggers the update
>>> > from the garbage collector. I understand the motivation here is to
>>> > avoid adding more work to the workqueue, by simply letting the gc
>>> > thread pick up for the update.
>>> >
>>> > I already proposed in the last year alternative approaches to improve
>>> > the workqueue logic, including cancelation of useless work. For
>>> > example, cancel a flying "add" work if "delete" just arrive and the
>>> > work is still sitting in the queue. Same approach could be use for
>>> > this update logic, ie. cancel an add UDP unidirectional or upgrade it
>>> > to bidirectional if, by the time we see traffic in both directions,
>>> > then work is still sitting in the queue.
>>> 
>>> Thanks for the suggestion. I'll try to make this work over regular
>>> workqueues without further extending the flow flags and/or putting more
>>> stuff into gc.
>>
>> Let me make a second pass to sort out thoughts on this.
>>
>> Either we use regular workqueues (without new flags) or we explore
>> fully consolidating this hardware offload workqueue infrastructure
>> around flags, ie. use flags not only for update events, but also for
>> new and delete.
>>
>> This would go more in the direction of your _UPDATE flag idea:
>>
>> - Update the hardware offload workqueue to iterate over the
>>   flowtable. The hardware offload workqueue would be "scanning" for
>>   entries in the flowtable that require some sort of update in the
>>   hardware. The flags would tell what kind of action is needed.
>>
>> - Add these flags:
>>
>> NF_FLOW_HW_NEW
>> NF_FLOW_HW_UPDATE
>> NF_FLOW_HW_DELETE
>>
>> and remove the work object (flow_offload_work) and the existing list.
>> If the workqueue finds an entry with:
>>
>> NEW|DELETE, this means this is short lived flow, not worth to waste
>> cycles to offload it.
>> NEW|UPDATE, this means this is an UDP flow that is bidirectional.
>>
>> Then, there will be no more work allocation + "flying" work objects to
>> the hardware offload workqueue. Instead, the hardware offload
>> workqueue will be iterating.
>>
>> This approach would need _DONE flags to annotate if the offload
>> updates have been applied to hardware already (similar to the
>> conntrack _DONE flags).
>>
>> (Oh well, this proposal is adding even more flags. But I think flags
>> are not the issue, but the mixture of the existing flow_offload_work
>> approach with this new _UPDATE flag and the gc changes).
>>
>> If flow_offload_work is removed, we would also need to add a:
>>
>>  struct nf_flowtable *flowtable;
>>
>> field to the flow_offload entry, which is an entry field that is
>> passed via flow_offload_work. So it is one extra field for the each
>> flow_offload entry.
>>
>> The other alternative is to use the existing nf_flow_offload_add_wq
>> with UPDATE command, which might result in more flying objects in
>> turn. I think this is what you are trying to avoid with the _UPDATE
>> flag approach.
>
> This looks interesting, but is very ambitious and will probably be a
> bigger change than this whole series. I have an idea how we can leverage
> existing 'refresh' mechanism for updating flow state that doesn't
> involve large-scale refactoring of existing offload infrastructure,
> which I would prefer to try first. WDYT?

Update: to illustrate this point I prepared V4 that uses regular refresh
to update the flow and also tries to prevent excessive wq spam or
updating flow offload to a state that is already outdated.
diff mbox series

Patch

diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index 88ab98ab41d9..e396424e2e68 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -165,6 +165,7 @@  enum nf_flow_flags {
 	NF_FLOW_HW_DEAD,
 	NF_FLOW_HW_PENDING,
 	NF_FLOW_HW_BIDIRECTIONAL,
+	NF_FLOW_HW_UPDATE,
 };
 
 enum flow_offload_type {
@@ -300,7 +301,7 @@  unsigned int nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 #define MODULE_ALIAS_NF_FLOWTABLE(family)	\
 	MODULE_ALIAS("nf-flowtable-" __stringify(family))
 
-void nf_flow_offload_add(struct nf_flowtable *flowtable,
+bool nf_flow_offload_add(struct nf_flowtable *flowtable,
 			 struct flow_offload *flow);
 void nf_flow_offload_del(struct nf_flowtable *flowtable,
 			 struct flow_offload *flow);
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 04bd0ed4d2ae..5b495e768655 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -316,21 +316,28 @@  int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
 }
 EXPORT_SYMBOL_GPL(flow_offload_add);
 
+static bool __flow_offload_refresh(struct nf_flowtable *flow_table,
+				   struct flow_offload *flow)
+{
+	if (likely(!nf_flowtable_hw_offload(flow_table)))
+		return true;
+
+	return nf_flow_offload_add(flow_table, flow);
+}
+
 void flow_offload_refresh(struct nf_flowtable *flow_table,
 			  struct flow_offload *flow)
 {
 	u32 timeout;
 
 	timeout = nf_flowtable_time_stamp + flow_offload_get_timeout(flow);
-	if (timeout - READ_ONCE(flow->timeout) > HZ)
+	if (timeout - READ_ONCE(flow->timeout) > HZ &&
+	    !test_bit(NF_FLOW_HW_UPDATE, &flow->flags))
 		WRITE_ONCE(flow->timeout, timeout);
 	else
 		return;
 
-	if (likely(!nf_flowtable_hw_offload(flow_table)))
-		return;
-
-	nf_flow_offload_add(flow_table, flow);
+	__flow_offload_refresh(flow_table, flow);
 }
 EXPORT_SYMBOL_GPL(flow_offload_refresh);
 
@@ -435,6 +442,9 @@  static void nf_flow_offload_gc_step(struct nf_flowtable *flow_table,
 		} else {
 			flow_offload_del(flow_table, flow);
 		}
+	} else if (test_and_clear_bit(NF_FLOW_HW_UPDATE, &flow->flags)) {
+		if (!__flow_offload_refresh(flow_table, flow))
+			set_bit(NF_FLOW_HW_UPDATE, &flow->flags);
 	} else if (test_bit(NF_FLOW_HW, &flow->flags)) {
 		nf_flow_offload_stats(flow_table, flow);
 	}
diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c
index 8b852f10fab4..103b2ca8d123 100644
--- a/net/netfilter/nf_flow_table_offload.c
+++ b/net/netfilter/nf_flow_table_offload.c
@@ -1036,16 +1036,17 @@  nf_flow_offload_work_alloc(struct nf_flowtable *flowtable,
 }
 
 
-void nf_flow_offload_add(struct nf_flowtable *flowtable,
+bool nf_flow_offload_add(struct nf_flowtable *flowtable,
 			 struct flow_offload *flow)
 {
 	struct flow_offload_work *offload;
 
 	offload = nf_flow_offload_work_alloc(flowtable, flow, FLOW_CLS_REPLACE);
 	if (!offload)
-		return;
+		return false;
 
 	flow_offload_queue_work(offload);
+	return true;
 }
 
 void nf_flow_offload_del(struct nf_flowtable *flowtable,