diff mbox series

[nf,v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

Message ID 20220706145004.22355-1-fw@strlen.de (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series [nf,v3] netfilter: conntrack: fix crash due to confirmed bit load reordering | expand

Checks

Context Check Description
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 6 this patch: 6
netdev/cc_maintainers fail 1 blamed authors not CCed: pablo@netfilter.org; 7 maintainers not CCed: pablo@netfilter.org edumazet@google.com pabeni@redhat.com kadlec@netfilter.org davem@davemloft.net kuba@kernel.org coreteam@netfilter.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 6 this patch: 6
netdev/checkpatch warning WARNING: Non-standard signature: Diagnosed-by: WARNING: line length of 87 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Florian Westphal July 6, 2022, 2:50 p.m. UTC
Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1						CPU2
						encounter 'ct' during hlist walk
 delete_from_lists
 refcount drops to 0
 kmem_cache_free(ct);
 __nf_conntrack_alloc() // returns same object
						refcount_inc_not_zero(ct); /* might fail */

						/* If set, ct is public/in the hash table */
						test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails.  CPU2 ignores the object and moves to
   the next entry in the list.  This happens for objects that are about
   to be free'd, that have been free'd, or that have been reallocated
   by __nf_conntrack_alloc(), but where the refcount has not been
   increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
   in ct->status.  If set, the object is public/in the table.

   If not, the object must be skipped; CPU2 calls nf_ct_put() to
   un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
   of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
   IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

Add needed barriers when refcount_inc_not_zero() is successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still 'see' the object, refcount_set(1)
"resurrects" the object, so we need to make sure that other CPUs will
also observe the right contents.  In particular, the CONFIRMED bit test
must only pass once the object is fully initialised and either in the
hash or about to be inserted (with locks held to delay possible unlink from
early_drop or gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
    add comment in nf_conntrack_netlink, no control dependency there
    due to locks.

Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Diagnosed-by: Will Deacon <will@kernel.org>
Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c       | 22 ++++++++++++++++++++++
 net/netfilter/nf_conntrack_netlink.c    |  1 +
 net/netfilter/nf_conntrack_standalone.c |  3 +++
 3 files changed, 26 insertions(+)

Comments

Will Deacon July 7, 2022, 8:19 a.m. UTC | #1
On Wed, Jul 06, 2022 at 04:50:04PM +0200, Florian Westphal wrote:
> Kajetan Puchalski reports crash on ARM, with backtrace of:
> 
> __nf_ct_delete_from_lists
> nf_ct_delete
> early_drop
> __nf_conntrack_alloc
> 
> Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
> conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
> allocated object is still in use on another CPU:
> 
> CPU1						CPU2
> 						encounter 'ct' during hlist walk
>  delete_from_lists
>  refcount drops to 0
>  kmem_cache_free(ct);
>  __nf_conntrack_alloc() // returns same object
> 						refcount_inc_not_zero(ct); /* might fail */
> 
> 						/* If set, ct is public/in the hash table */
> 						test_bit(IPS_CONFIRMED_BIT, &ct->status);
> 
> In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
> will succeed.
> 
> The expected possibilities for a CPU that obtained the object 'ct'
> (but no reference so far) are:
> 
> 1. refcount_inc_not_zero() fails.  CPU2 ignores the object and moves to
>    the next entry in the list.  This happens for objects that are about
>    to be free'd, that have been free'd, or that have been reallocated
>    by __nf_conntrack_alloc(), but where the refcount has not been
>    increased back to 1 yet.
> 
> 2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
>    in ct->status.  If set, the object is public/in the table.
> 
>    If not, the object must be skipped; CPU2 calls nf_ct_put() to
>    un-do the refcount increment and moves to the next object.
> 
> Parallel deletion from the hlists is prevented by a
> 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
> cpu will do the unlink, the other one will only drop its reference count.
> 
> Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
> delete an object that is not on any list:
> 
> 1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
> 2. CONFIRMED test also successful (load was reordered or zeroing
>    of ct->status not yet visible)
> 3. delete_from_lists unlinks entry not on the hlist, because
>    IPS_DYING_BIT is 0 (already cleared).
> 
> 2) is already wrong: CPU2 will handle a partially initited object
> that is supposed to be private to CPU1.
> 
> Add needed barriers when refcount_inc_not_zero() is successful.
> 
> It also inserts a smp_wmb() before the refcount is set to 1 during
> allocation.
> 
> Because other CPU might still 'see' the object, refcount_set(1)
> "resurrects" the object, so we need to make sure that other CPUs will
> also observe the right contents.  In particular, the CONFIRMED bit test
> must only pass once the object is fully initialised and either in the
> hash or about to be inserted (with locks held to delay possible unlink from
> early_drop or gc worker).
> 
> I did not change flow_offload_alloc(), as far as I can see it should call
> refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
> the skb so its refcount should be >= 1 in all cases.
> 
> v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
> v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
>     add comment in nf_conntrack_netlink, no control dependency there
>     due to locks.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
> Diagnosed-by: Will Deacon <will@kernel.org>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  net/netfilter/nf_conntrack_core.c       | 22 ++++++++++++++++++++++
>  net/netfilter/nf_conntrack_netlink.c    |  1 +
>  net/netfilter/nf_conntrack_standalone.c |  3 +++
>  3 files changed, 26 insertions(+)

Acked-by: Will Deacon <will@kernel.org>

Will
Thorsten Leemhuis July 7, 2022, 10:17 a.m. UTC | #2
On 06.07.22 16:50, Florian Westphal wrote:
> Kajetan Puchalski reports crash on ARM, with backtrace of:
> 
> [...]
> Cc: Peter Zijlstra <peterz@infradead.org>
> Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
> Diagnosed-by: Will Deacon <will@kernel.org>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <fw@strlen.de>

If you need to respin this patch for one reason or another, could you do
me a favor and add proper 'Link:' tags pointing to all reports about
this issue? e.g. like this:

 Link:
https://lore.kernel.org/all/Yr7WTfd6AVTQkLjI@e126311.manchester.arm.com/

These tags are considered important by Linus[1] and others, as they
allow anyone to look into the backstory weeks or years from now. That is
why they should be placed in cases like this, as
Documentation/process/submitting-patches.rst and
Documentation/process/5.Posting.rst explain in more detail. I care
personally, because these tags make my regression tracking efforts a
whole lot easier, as they allow my tracking bot 'regzbot' to
automatically connect reports with patches posted or committed to fix
tracked regressions.

[1] see for example:
https://lore.kernel.org/all/CAHk-=wjMmSZzMJ3Xnskdg4+GGz=5p5p+GSYyFBTh0f-DgvdBWg@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wgs38ZrfPvy=nOwVkVzjpM3VFU1zobP37Fwd_h9iAD5JQ@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wjxzafG-=J8oT30s7upn4RhBs6TX-uVFZ5rME+L5_DoJA@mail.gmail.com/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.
Florian Westphal July 7, 2022, 6:59 p.m. UTC | #3
Will Deacon <will@kernel.org> wrote:
> On Wed, Jul 06, 2022 at 04:50:04PM +0200, Florian Westphal wrote:
> >  net/netfilter/nf_conntrack_core.c       | 22 ++++++++++++++++++++++
> >  net/netfilter/nf_conntrack_netlink.c    |  1 +
> >  net/netfilter/nf_conntrack_standalone.c |  3 +++
> >  3 files changed, 26 insertions(+)
> 
> Acked-by: Will Deacon <will@kernel.org>

Thanks, I pushed this patch to nf.git.
Kajetan Puchalski July 11, 2022, 4:34 p.m. UTC | #4
> v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
>     add comment in nf_conntrack_netlink, no control dependency there
>     due to locks.

Just to follow up on that, I tested v3 for 24 hours with the workload in
question and found no issues so looks like the fix is stable.

In case someone is interested in performance differences, seeing as I
was running benchmarks regardless I thought I might share the numbers on
how using refcount vs atomic here seems to affect networking workloads.

The results were collected using mmtests, the means containing asterisks
are the results that the framework considered statistically significant.

netperf-udp
                                     atomic                     v3
Hmean     send-64         189.36 (   0.00%)      227.14 *  19.95%*
Hmean     send-128        378.77 (   0.00%)      387.94 (   2.42%)
Hmean     send-256        925.96 (   0.00%)      922.77 (  -0.34%)
Hmean     send-1024      3550.03 (   0.00%)     3528.63 (  -0.60%)
Hmean     send-2048      6545.45 (   0.00%)     6655.64 *   1.68%*
Hmean     send-3312     10282.12 (   0.00%)    10388.78 *   1.04%*
Hmean     send-4096     11902.15 (   0.00%)    12052.30 *   1.26%*
Hmean     send-8192     19369.15 (   0.00%)    20363.82 *   5.14%*
Hmean     send-16384    32610.44 (   0.00%)    33080.30 (   1.44%)
Hmean     recv-64         189.36 (   0.00%)      226.34 *  19.53%*
Hmean     recv-128        378.77 (   0.00%)      386.81 (   2.12%)
Hmean     recv-256        925.95 (   0.00%)      922.77 (  -0.34%)
Hmean     recv-1024      3549.90 (   0.00%)     3528.51 (  -0.60%)
Hmean     recv-2048      6542.82 (   0.00%)     6653.05 *   1.68%*
Hmean     recv-3312     10278.46 (   0.00%)    10385.45 *   1.04%*
Hmean     recv-4096     11892.86 (   0.00%)    12041.68 *   1.25%*
Hmean     recv-8192     19345.14 (   0.00%)    20343.76 *   5.16%*
Hmean     recv-16384    32574.38 (   0.00%)    33030.53 (   1.40%)

netperf-tcp
                                atomic                     v3
Hmean     64        1324.25 (   0.00%)     1328.90 *   0.35%*
Hmean     128       2576.89 (   0.00%)     2579.71 (   0.11%)
Hmean     256       4882.34 (   0.00%)     4889.49 (   0.15%)
Hmean     1024     14560.89 (   0.00%)    14423.39 *  -0.94%*
Hmean     2048     20995.91 (   0.00%)    20818.49 *  -0.85%*
Hmean     3312     25440.20 (   0.00%)    25318.16 *  -0.48%*
Hmean     4096     27309.32 (   0.00%)    27282.26 (  -0.10%)
Hmean     8192     31204.34 (   0.00%)    31326.23 *   0.39%*
Hmean     16384    34370.49 (   0.00%)    34298.25 (  -0.21%)

Additionally, the reason I bumped into this issue in the first place was
running benchmarks on different CPUIdle governors so below are the
results for what happens if in additon to changing from atomic to
v3 refcount I also switch the idle governor from menu to TEO.

netperf-udp
                                     atomic                     v3
                                      menu                     teo
Hmean     send-64         189.36 (   0.00%)      248.79 *  31.38%*
Hmean     send-128        378.77 (   0.00%)      439.06 (  15.92%)
Hmean     send-256        925.96 (   0.00%)     1101.20 *  18.93%*
Hmean     send-1024      3550.03 (   0.00%)     3298.19 (  -7.09%)
Hmean     send-2048      6545.45 (   0.00%)     7714.21 *  17.86%*
Hmean     send-3312     10282.12 (   0.00%)    12090.56 *  17.59%*
Hmean     send-4096     11902.15 (   0.00%)    13766.56 *  15.66%*
Hmean     send-8192     19369.15 (   0.00%)    22943.77 *  18.46%*
Hmean     send-16384    32610.44 (   0.00%)    37370.44 *  14.60%*
Hmean     recv-64         189.36 (   0.00%)      248.79 *  31.38%*
Hmean     recv-128        378.77 (   0.00%)      439.06 (  15.92%)
Hmean     recv-256        925.95 (   0.00%)     1101.19 *  18.92%*
Hmean     recv-1024      3549.90 (   0.00%)     3298.16 (  -7.09%)
Hmean     recv-2048      6542.82 (   0.00%)     7711.59 *  17.86%*
Hmean     recv-3312     10278.46 (   0.00%)    12087.81 *  17.60%*
Hmean     recv-4096     11892.86 (   0.00%)    13755.48 *  15.66%*
Hmean     recv-8192     19345.14 (   0.00%)    22933.98 *  18.55%*
Hmean     recv-16384    32574.38 (   0.00%)    37332.10 *  14.61%*

netperf-tcp
                                atomic                     v3
                                 menu                     teo
Hmean     64        1324.25 (   0.00%)     1351.86 *   2.08%*
Hmean     128       2576.89 (   0.00%)     2629.08 *   2.03%*
Hmean     256       4882.34 (   0.00%)     5003.19 *   2.48%*
Hmean     1024     14560.89 (   0.00%)    15237.15 *   4.64%*
Hmean     2048     20995.91 (   0.00%)    22804.40 *   8.61%*
Hmean     3312     25440.20 (   0.00%)    27815.23 *   9.34%*
Hmean     4096     27309.32 (   0.00%)    30171.81 *  10.48%*
Hmean     8192     31204.34 (   0.00%)    37112.55 *  18.93%*
Hmean     16384    34370.49 (   0.00%)    42952.01 *  24.97%*

The absolute values might be skewed by the characteristics of the
machine in question but I thought the comparative differences between
different patches were interesting enough to share.

> Cc: Peter Zijlstra <peterz@infradead.org>
> Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
> Diagnosed-by: Will Deacon <will@kernel.org>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  net/netfilter/nf_conntrack_core.c       | 22 ++++++++++++++++++++++
>  net/netfilter/nf_conntrack_netlink.c    |  1 +
>  net/netfilter/nf_conntrack_standalone.c |  3 +++
>  3 files changed, 26 insertions(+)
> 
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..369aeabb94fe 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -729,6 +729,9 @@ static void nf_ct_gc_expired(struct nf_conn *ct)
>  	if (!refcount_inc_not_zero(&ct->ct_general.use))
>  		return;
>  
> +	/* load ->status after refcount increase */
> +	smp_acquire__after_ctrl_dep();
> +
>  	if (nf_ct_should_gc(ct))
>  		nf_ct_kill(ct);
>  
> @@ -795,6 +798,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
>  		 */
>  		ct = nf_ct_tuplehash_to_ctrack(h);
>  		if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
> +			/* re-check key after refcount */
> +			smp_acquire__after_ctrl_dep();
> +
>  			if (likely(nf_ct_key_equal(h, tuple, zone, net)))
>  				goto found;
>  
> @@ -1387,6 +1393,9 @@ static unsigned int early_drop_list(struct net *net,
>  		if (!refcount_inc_not_zero(&tmp->ct_general.use))
>  			continue;
>  
> +		/* load ->ct_net and ->status after refcount increase */
> +		smp_acquire__after_ctrl_dep();
> +
>  		/* kill only if still in same netns -- might have moved due to
>  		 * SLAB_TYPESAFE_BY_RCU rules.
>  		 *
> @@ -1536,6 +1545,9 @@ static void gc_worker(struct work_struct *work)
>  			if (!refcount_inc_not_zero(&tmp->ct_general.use))
>  				continue;
>  
> +			/* load ->status after refcount increase */
> +			smp_acquire__after_ctrl_dep();
> +
>  			if (gc_worker_skip_ct(tmp)) {
>  				nf_ct_put(tmp);
>  				continue;
> @@ -1775,6 +1787,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
>  	if (!exp)
>  		__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>  
> +	/* Other CPU might have obtained a pointer to this object before it was
> +	 * released.  Because refcount is 0, refcount_inc_not_zero() will fail.
> +	 *
> +	 * After refcount_set(1) it will succeed; ensure that zeroing of
> +	 * ct->status and the correct ct->net pointer are visible; else other
> +	 * core might observe CONFIRMED bit which means the entry is valid and
> +	 * in the hash table, but its not (anymore).
> +	 */
> +	smp_wmb();
> +
>  	/* Now it is going to be associated with an sk_buff, set refcount to 1. */
>  	refcount_set(&ct->ct_general.use, 1);
>  
> diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
> index 722af5e309ba..f5905b5201a7 100644
> --- a/net/netfilter/nf_conntrack_netlink.c
> +++ b/net/netfilter/nf_conntrack_netlink.c
> @@ -1203,6 +1203,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
>  					   hnnode) {
>  			ct = nf_ct_tuplehash_to_ctrack(h);
>  			if (nf_ct_is_expired(ct)) {
> +				/* need to defer nf_ct_kill() until lock is released */
>  				if (i < ARRAY_SIZE(nf_ct_evict) &&
>  				    refcount_inc_not_zero(&ct->ct_general.use))
>  					nf_ct_evict[i++] = ct;
> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
> index 6ad7bbc90d38..05895878610c 100644
> --- a/net/netfilter/nf_conntrack_standalone.c
> +++ b/net/netfilter/nf_conntrack_standalone.c
> @@ -306,6 +306,9 @@ static int ct_seq_show(struct seq_file *s, void *v)
>  	if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
>  		return 0;
>  
> +	/* load ->status after refcount increase */
> +	smp_acquire__after_ctrl_dep();
> +
>  	if (nf_ct_should_gc(ct)) {
>  		nf_ct_kill(ct);
>  		goto release;
> -- 
> 2.35.1
>
diff mbox series

Patch

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 082a2fd8d85b..369aeabb94fe 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -729,6 +729,9 @@  static void nf_ct_gc_expired(struct nf_conn *ct)
 	if (!refcount_inc_not_zero(&ct->ct_general.use))
 		return;
 
+	/* load ->status after refcount increase */
+	smp_acquire__after_ctrl_dep();
+
 	if (nf_ct_should_gc(ct))
 		nf_ct_kill(ct);
 
@@ -795,6 +798,9 @@  __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
 		 */
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
+			/* re-check key after refcount */
+			smp_acquire__after_ctrl_dep();
+
 			if (likely(nf_ct_key_equal(h, tuple, zone, net)))
 				goto found;
 
@@ -1387,6 +1393,9 @@  static unsigned int early_drop_list(struct net *net,
 		if (!refcount_inc_not_zero(&tmp->ct_general.use))
 			continue;
 
+		/* load ->ct_net and ->status after refcount increase */
+		smp_acquire__after_ctrl_dep();
+
 		/* kill only if still in same netns -- might have moved due to
 		 * SLAB_TYPESAFE_BY_RCU rules.
 		 *
@@ -1536,6 +1545,9 @@  static void gc_worker(struct work_struct *work)
 			if (!refcount_inc_not_zero(&tmp->ct_general.use))
 				continue;
 
+			/* load ->status after refcount increase */
+			smp_acquire__after_ctrl_dep();
+
 			if (gc_worker_skip_ct(tmp)) {
 				nf_ct_put(tmp);
 				continue;
@@ -1775,6 +1787,16 @@  init_conntrack(struct net *net, struct nf_conn *tmpl,
 	if (!exp)
 		__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
 
+	/* Other CPU might have obtained a pointer to this object before it was
+	 * released.  Because refcount is 0, refcount_inc_not_zero() will fail.
+	 *
+	 * After refcount_set(1) it will succeed; ensure that zeroing of
+	 * ct->status and the correct ct->net pointer are visible; else other
+	 * core might observe CONFIRMED bit which means the entry is valid and
+	 * in the hash table, but its not (anymore).
+	 */
+	smp_wmb();
+
 	/* Now it is going to be associated with an sk_buff, set refcount to 1. */
 	refcount_set(&ct->ct_general.use, 1);
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 722af5e309ba..f5905b5201a7 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1203,6 +1203,7 @@  ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 					   hnnode) {
 			ct = nf_ct_tuplehash_to_ctrack(h);
 			if (nf_ct_is_expired(ct)) {
+				/* need to defer nf_ct_kill() until lock is released */
 				if (i < ARRAY_SIZE(nf_ct_evict) &&
 				    refcount_inc_not_zero(&ct->ct_general.use))
 					nf_ct_evict[i++] = ct;
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 6ad7bbc90d38..05895878610c 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -306,6 +306,9 @@  static int ct_seq_show(struct seq_file *s, void *v)
 	if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
 		return 0;
 
+	/* load ->status after refcount increase */
+	smp_acquire__after_ctrl_dep();
+
 	if (nf_ct_should_gc(ct)) {
 		nf_ct_kill(ct);
 		goto release;