Message ID | 1635931896-27539-1-git-send-email-volodymyr.mytnyk@plvision.eu (mailing list archive) |
---|---|
State | Awaiting Upstream |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,v2,RESEND] netfilter: fix conntrack flows stuck issue on cleanup. | expand |
Hi, On Wed, Nov 03, 2021 at 11:31:36AM +0200, Volodymyr Mytnyk wrote: > From: Volodymyr Mytnyk <vmytnyk@marvell.com> > > On busy system with big number (few thousands) of HW offloaded flows, it > is possible to hit the situation, where some of the conntack flows are > stuck in conntrack table (as offloaded) and cannot be removed by user. > > This behaviour happens if user has configured conntack using tc sub-system, > offloaded those flows for HW and then deleted tc configuration from Linux > system by deleting the tc qdiscs. > > When qdiscs are removed, the nf_flow_table_free() is called to do the > cleanup of HW offloaded flows in conntrack table. > > ... > process_one_work > tcf_ct_flow_table_cleanup_work() > nf_flow_table_free() > > The nf_flow_table_free() does the following things: > > 1. cancels gc workqueue > 2. marks all flows as teardown > 3. executes nf_flow_offload_gc_step() once for each flow to > trigger correct teardown flow procedure (e.g., allocate > work to delete the HW flow and marks the flow as "dying"). > 4. waits for all scheduled flow offload works to be finished. > 5. executes nf_flow_offload_gc_step() once for each flow to > trigger the deleting of flows. > > Root cause: > > In step 3, nf_flow_offload_gc_step() expects to move flow to "dying" > state by using nf_flow_offload_del() and deletes the flow in next > nf_flow_offload_gc_step() iteration. But, if flow is in "pending" state > for some reason (e.g., reading HW stats), it will not be moved to > "dying" state as expected by nf_flow_offload_gc_step() and will not > be marked as "dead" for delition. > > In step 5, nf_flow_offload_gc_step() assumes that all flows marked > as "dead" and will be deleted by this call, but this is not true since > the state was not set diring previous nf_flow_offload_gc_step() > call. > > It issue causes some of the flows to get stuck in connection tracking > system or not release properly. > > To fix this problem, add nf_flow_table_offload_flush() call between 2 & 3 > step, to make sure no other flow offload works will be in "pending" state > during step 3. Thanks for the detailed report. I'm attaching two patches, the first one is a preparation patch. The second patch flushes the pending work, then it sets the teardown flag to all flows in the flowtable and it forces a garbage collector run to queue work to remove the flows from hardware, then it flushes this new pending work and (finally) it forces another garbage collector run to remove the entry from the software flowtable. Compile-tested only.
> Hi, > > On Wed, Nov 03, 2021 at 11:31:36AM +0200, Volodymyr Mytnyk wrote: > > From: Volodymyr Mytnyk <vmytnyk@marvell.com> > > > > On busy system with big number (few thousands) of HW offloaded flows, it > > is possible to hit the situation, where some of the conntack flows are > > stuck in conntrack table (as offloaded) and cannot be removed by user. > > > > This behaviour happens if user has configured conntack using tc sub-system, > > offloaded those flows for HW and then deleted tc configuration from Linux > > system by deleting the tc qdiscs. > > > > When qdiscs are removed, the nf_flow_table_free() is called to do the > > cleanup of HW offloaded flows in conntrack table. > > > > ... > > process_one_work > > tcf_ct_flow_table_cleanup_work() > > nf_flow_table_free() > > > > The nf_flow_table_free() does the following things: > > > > 1. cancels gc workqueue > > 2. marks all flows as teardown > > 3. executes nf_flow_offload_gc_step() once for each flow to > > trigger correct teardown flow procedure (e.g., allocate > > work to delete the HW flow and marks the flow as "dying"). > > 4. waits for all scheduled flow offload works to be finished. > > 5. executes nf_flow_offload_gc_step() once for each flow to > > trigger the deleting of flows. > > > > Root cause: > > > > In step 3, nf_flow_offload_gc_step() expects to move flow to "dying" > > state by using nf_flow_offload_del() and deletes the flow in next > > nf_flow_offload_gc_step() iteration. But, if flow is in "pending" state > > for some reason (e.g., reading HW stats), it will not be moved to > > "dying" state as expected by nf_flow_offload_gc_step() and will not > > be marked as "dead" for delition. > > > > In step 5, nf_flow_offload_gc_step() assumes that all flows marked > > as "dead" and will be deleted by this call, but this is not true since > > the state was not set diring previous nf_flow_offload_gc_step() > > call. > > > > It issue causes some of the flows to get stuck in connection tracking > > system or not release properly. > > > > To fix this problem, add nf_flow_table_offload_flush() call between 2 & 3 > > step, to make sure no other flow offload works will be in "pending" state > > during step 3. > > Thanks for the detailed report. > > I'm attaching two patches, the first one is a preparation patch. The > second patch flushes the pending work, then it sets the teardown flag > to all flows in the flowtable and it forces a garbage collector run to > queue work to remove the flows from hardware, then it flushes this new > pending work and (finally) it forces another garbage collector run to > remove the entry from the software flowtable. Compile-tested only. Hi Pablo, Thanks for reviewing the changes and problem investigation. I will check the provided patches and will back to you. Regards, Volodymyr
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c index 87a7388b6c89..17593f49413b 100644 --- a/net/netfilter/nf_flow_table_core.c +++ b/net/netfilter/nf_flow_table_core.c @@ -637,6 +637,8 @@ void nf_flow_table_free(struct nf_flowtable *flow_table) cancel_delayed_work_sync(&flow_table->gc_work); nf_flow_table_iterate(flow_table, nf_flow_table_do_cleanup, NULL); + /* wait for all 'pending' flows to be finished */ + nf_flow_table_offload_flush(flow_table); nf_flow_table_iterate(flow_table, nf_flow_offload_gc_step, flow_table); nf_flow_table_offload_flush(flow_table); if (nf_flowtable_hw_offload(flow_table))