diff mbox series

[net] inet: inet_defrag: prevent sk release while still in use

Message ID 20240319122310.27474-1-fw@strlen.de (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] inet: inet_defrag: prevent sk release while still in use | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5778 this patch: 5778
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 3 maintainers not CCed: coreteam@netfilter.org pablo@netfilter.org kadlec@netfilter.org
netdev/build_clang success Errors and warnings before: 2061 this patch: 2061
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6068 this patch: 6068
netdev/checkpatch warning WARNING: Non-standard signature: Diagnosed-by: WARNING: externs should be avoided in .c files WARNING: function definition argument 'struct sk_buff *' should also have an identifier name
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-20--12-00 (tests: 910)

Commit Message

Florian Westphal March 19, 2024, 12:23 p.m. UTC
ip_local_out() and other functions can pass skb->sk as function argument.

If the skb is a fragment and reassembly happens before such function call
returns, the sk must not be released.

This affects skb fragments reassembled via netfilter or similar
modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.

Eric Dumazet made an initial analysis of this bug.  Quoting Eric:
  Calling ip_defrag() in output path is also implying skb_orphan(),
  which is buggy because output path relies on sk not disappearing.

  A relevant old patch about the issue was :
  8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()")

  [..]

  net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
  inet socket, not an arbitrary one.

  If we orphan the packet in ipvlan, then downstream things like FQ
  packet scheduler will not work properly.

  We need to change ip_defrag() to only use skb_orphan() when really
  needed, ie whenever frag_list is going to be used.

Eric suggested to stash sk in fragment queue and made an initial patch.
However there is a problem with this:

If skb is refragmented again right after, ip_do_fragment() will copy
head->sk to the new fragments, and sets up destructor to sock_wfree.
IOW, we have no choice but to fix up sk_wmem accouting to reflect the
fully reassembled skb, else wmem will underflow.

This change moves the orphan down into the core, to last possible moment.
As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
offset into the FRAG_CB, else skb->sk gets clobbered.

This allows to delay the orphaning long enough to learn if the skb has
to be queued or if the skb is completing the reasm queue.

In the former case, things work as before, skb is orphaned.  This is
safe because skb gets queued/stolen and won't continue past reasm engine.

In the latter case, we will steal the skb->sk reference, reattach it to
the head skb, and fix up wmem accouting when inet_frag inflates truesize.

Diagnosed-by: Eric Dumazet <edumazet@google.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yue sun <samsun1006219@gmail.com>
Reported-by: syzbot+e5167d7144a62715044c@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h                  |  7 +--
 net/ipv4/inet_fragment.c                | 71 ++++++++++++++++++++-----
 net/ipv4/ip_fragment.c                  |  2 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c |  2 +-
 4 files changed, 61 insertions(+), 21 deletions(-)

Comments

Eric Dumazet March 20, 2024, 2:14 p.m. UTC | #1
On Tue, Mar 19, 2024 at 12:36 PM Florian Westphal <fw@strlen.de> wrote:
>
> ip_local_out() and other functions can pass skb->sk as function argument.
>
> If the skb is a fragment and reassembly happens before such function call
> returns, the sk must not be released.
>
> This affects skb fragments reassembled via netfilter or similar
> modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.
>
> Eric Dumazet made an initial analysis of this bug.  Quoting Eric:
>   Calling ip_defrag() in output path is also implying skb_orphan(),
>   which is buggy because output path relies on sk not disappearing.
>
>   A relevant old patch about the issue was :
>   8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()")
>
>   [..]
>
>   net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
>   inet socket, not an arbitrary one.
>
>   If we orphan the packet in ipvlan, then downstream things like FQ
>   packet scheduler will not work properly.
>
>   We need to change ip_defrag() to only use skb_orphan() when really
>   needed, ie whenever frag_list is going to be used.
>
> Eric suggested to stash sk in fragment queue and made an initial patch.
> However there is a problem with this:
>
> If skb is refragmented again right after, ip_do_fragment() will copy
> head->sk to the new fragments, and sets up destructor to sock_wfree.
> IOW, we have no choice but to fix up sk_wmem accouting to reflect the
> fully reassembled skb, else wmem will underflow.
>
> This change moves the orphan down into the core, to last possible moment.
> As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
> offset into the FRAG_CB, else skb->sk gets clobbered.
>
> This allows to delay the orphaning long enough to learn if the skb has
> to be queued or if the skb is completing the reasm queue.
>
> In the former case, things work as before, skb is orphaned.  This is
> safe because skb gets queued/stolen and won't continue past reasm engine.
>
> In the latter case, we will steal the skb->sk reference, reattach it to
> the head skb, and fix up wmem accouting when inet_frag inflates truesize.
>
> Diagnosed-by: Eric Dumazet <edumazet@google.com>
> Reported-by: xingwei lee <xrivendell7@gmail.com>
> Reported-by: yue sun <samsun1006219@gmail.com>
> Reported-by: syzbot+e5167d7144a62715044c@syzkaller.appspotmail.com
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  include/linux/skbuff.h                  |  7 +--
>  net/ipv4/inet_fragment.c                | 71 ++++++++++++++++++++-----
>  net/ipv4/ip_fragment.c                  |  2 +-
>  net/ipv6/netfilter/nf_conntrack_reasm.c |  2 +-
>  4 files changed, 61 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 7d56ce195120..6d08ff8a9357 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -753,8 +753,6 @@ typedef unsigned char *sk_buff_data_t;
>   *     @list: queue head
>   *     @ll_node: anchor in an llist (eg socket defer_list)
>   *     @sk: Socket we are owned by
> - *     @ip_defrag_offset: (aka @sk) alternate use of @sk, used in
> - *             fragmentation management
>   *     @dev: Device we arrived on/are leaving by
>   *     @dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL
>   *     @cb: Control buffer. Free for use by every layer. Put private vars here
> @@ -875,10 +873,7 @@ struct sk_buff {
>                 struct llist_node       ll_node;
>         };
>
> -       union {
> -               struct sock             *sk;
> -               int                     ip_defrag_offset;
> -       };
> +       struct sock             *sk;
>
>         union {
>                 ktime_t         tstamp;
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 7072fc0783ef..7254b640ba06 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -24,6 +24,8 @@
>  #include <net/ip.h>
>  #include <net/ipv6.h>
>
> +#include "../core/sock_destructor.h"
> +
>  /* Use skb->cb to track consecutive/adjacent fragments coming at
>   * the end of the queue. Nodes in the rb-tree queue will
>   * contain "runs" of one or more adjacent fragments.
> @@ -39,6 +41,7 @@ struct ipfrag_skb_cb {
>         };
>         struct sk_buff          *next_frag;
>         int                     frag_run_len;
> +       int                     ip_defrag_offset;
>  };
>
>  #define FRAG_CB(skb)           ((struct ipfrag_skb_cb *)((skb)->cb))
> @@ -396,12 +399,12 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
>          */
>         if (!last)
>                 fragrun_create(q, skb);  /* First fragment. */
> -       else if (last->ip_defrag_offset + last->len < end) {
> +       else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) {
>                 /* This is the common case: skb goes to the end. */
>                 /* Detect and discard overlaps. */
> -               if (offset < last->ip_defrag_offset + last->len)
> +               if (offset < FRAG_CB(last)->ip_defrag_offset + last->len)
>                         return IPFRAG_OVERLAP;
> -               if (offset == last->ip_defrag_offset + last->len)
> +               if (offset == FRAG_CB(last)->ip_defrag_offset + last->len)
>                         fragrun_append_to_last(q, skb);
>                 else
>                         fragrun_create(q, skb);
> @@ -418,13 +421,13 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
>
>                         parent = *rbn;
>                         curr = rb_to_skb(parent);
> -                       curr_run_end = curr->ip_defrag_offset +
> +                       curr_run_end = FRAG_CB(curr)->ip_defrag_offset +
>                                         FRAG_CB(curr)->frag_run_len;
> -                       if (end <= curr->ip_defrag_offset)
> +                       if (end <= FRAG_CB(curr)->ip_defrag_offset)
>                                 rbn = &parent->rb_left;
>                         else if (offset >= curr_run_end)
>                                 rbn = &parent->rb_right;
> -                       else if (offset >= curr->ip_defrag_offset &&
> +                       else if (offset >= FRAG_CB(curr)->ip_defrag_offset &&
>                                  end <= curr_run_end)
>                                 return IPFRAG_DUP;
>                         else
> @@ -438,23 +441,39 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
>                 rb_insert_color(&skb->rbnode, &q->rb_fragments);
>         }
>
> -       skb->ip_defrag_offset = offset;
> +       FRAG_CB(skb)->ip_defrag_offset = offset;
>
>         return IPFRAG_OK;
>  }
>  EXPORT_SYMBOL(inet_frag_queue_insert);
>
> +void tcp_wfree(struct sk_buff *skb);

Thanks a lot Florian for looking at this !

Since you had : #include "../core/sock_destructor.h", perhaps the line
can be removed,
because it includes <net/tcp.h>
Paolo Abeni March 21, 2024, 11:55 a.m. UTC | #2
Hi,

On Wed, 2024-03-20 at 15:14 +0100, Eric Dumazet wrote:
> On Tue, Mar 19, 2024 at 12:36 PM Florian Westphal <fw@strlen.de> wrote:
> > 
> > ip_local_out() and other functions can pass skb->sk as function argument.
> > 
> > If the skb is a fragment and reassembly happens before such function call
> > returns, the sk must not be released.
> > 
> > This affects skb fragments reassembled via netfilter or similar
> > modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.
> > 
> > Eric Dumazet made an initial analysis of this bug.  Quoting Eric:
> >   Calling ip_defrag() in output path is also implying skb_orphan(),
> >   which is buggy because output path relies on sk not disappearing.
> > 
> >   A relevant old patch about the issue was :
> >   8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()")
> > 
> >   [..]
> > 
> >   net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
> >   inet socket, not an arbitrary one.
> > 
> >   If we orphan the packet in ipvlan, then downstream things like FQ
> >   packet scheduler will not work properly.
> > 
> >   We need to change ip_defrag() to only use skb_orphan() when really
> >   needed, ie whenever frag_list is going to be used.
> > 
> > Eric suggested to stash sk in fragment queue and made an initial patch.
> > However there is a problem with this:
> > 
> > If skb is refragmented again right after, ip_do_fragment() will copy
> > head->sk to the new fragments, and sets up destructor to sock_wfree.
> > IOW, we have no choice but to fix up sk_wmem accouting to reflect the
> > fully reassembled skb, else wmem will underflow.
> > 
> > This change moves the orphan down into the core, to last possible moment.
> > As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
> > offset into the FRAG_CB, else skb->sk gets clobbered.
> > 
> > This allows to delay the orphaning long enough to learn if the skb has
> > to be queued or if the skb is completing the reasm queue.
> > 
> > In the former case, things work as before, skb is orphaned.  This is
> > safe because skb gets queued/stolen and won't continue past reasm engine.
> > 
> > In the latter case, we will steal the skb->sk reference, reattach it to
> > the head skb, and fix up wmem accouting when inet_frag inflates truesize.
> > 
> > Diagnosed-by: Eric Dumazet <edumazet@google.com>
> > Reported-by: xingwei lee <xrivendell7@gmail.com>
> > Reported-by: yue sun <samsun1006219@gmail.com>
> > Reported-by: syzbot+e5167d7144a62715044c@syzkaller.appspotmail.com

Possibly:

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")

it's not very accurate but should be a reasonable oldest affected
version.

> > Signed-off-by: Florian Westphal <fw@strlen.de>
> > ---
> >  include/linux/skbuff.h                  |  7 +--
> >  net/ipv4/inet_fragment.c                | 71 ++++++++++++++++++++-----
> >  net/ipv4/ip_fragment.c                  |  2 +-
> >  net/ipv6/netfilter/nf_conntrack_reasm.c |  2 +-
> >  4 files changed, 61 insertions(+), 21 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 7d56ce195120..6d08ff8a9357 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -753,8 +753,6 @@ typedef unsigned char *sk_buff_data_t;
> >   *     @list: queue head
> >   *     @ll_node: anchor in an llist (eg socket defer_list)
> >   *     @sk: Socket we are owned by
> > - *     @ip_defrag_offset: (aka @sk) alternate use of @sk, used in
> > - *             fragmentation management
> >   *     @dev: Device we arrived on/are leaving by
> >   *     @dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL
> >   *     @cb: Control buffer. Free for use by every layer. Put private vars here
> > @@ -875,10 +873,7 @@ struct sk_buff {
> >                 struct llist_node       ll_node;
> >         };
> > 
> > -       union {
> > -               struct sock             *sk;
> > -               int                     ip_defrag_offset;
> > -       };
> > +       struct sock             *sk;
> > 
> >         union {
> >                 ktime_t         tstamp;
> > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> > index 7072fc0783ef..7254b640ba06 100644
> > --- a/net/ipv4/inet_fragment.c
> > +++ b/net/ipv4/inet_fragment.c
> > @@ -24,6 +24,8 @@
> >  #include <net/ip.h>
> >  #include <net/ipv6.h>
> > 
> > +#include "../core/sock_destructor.h"
> > +
> >  /* Use skb->cb to track consecutive/adjacent fragments coming at
> >   * the end of the queue. Nodes in the rb-tree queue will
> >   * contain "runs" of one or more adjacent fragments.
> > @@ -39,6 +41,7 @@ struct ipfrag_skb_cb {
> >         };
> >         struct sk_buff          *next_frag;
> >         int                     frag_run_len;
> > +       int                     ip_defrag_offset;
> >  };
> > 
> >  #define FRAG_CB(skb)           ((struct ipfrag_skb_cb *)((skb)->cb))
> > @@ -396,12 +399,12 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> >          */
> >         if (!last)
> >                 fragrun_create(q, skb);  /* First fragment. */
> > -       else if (last->ip_defrag_offset + last->len < end) {
> > +       else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) {
> >                 /* This is the common case: skb goes to the end. */
> >                 /* Detect and discard overlaps. */
> > -               if (offset < last->ip_defrag_offset + last->len)
> > +               if (offset < FRAG_CB(last)->ip_defrag_offset + last->len)
> >                         return IPFRAG_OVERLAP;
> > -               if (offset == last->ip_defrag_offset + last->len)
> > +               if (offset == FRAG_CB(last)->ip_defrag_offset + last->len)
> >                         fragrun_append_to_last(q, skb);
> >                 else
> >                         fragrun_create(q, skb);
> > @@ -418,13 +421,13 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> > 
> >                         parent = *rbn;
> >                         curr = rb_to_skb(parent);
> > -                       curr_run_end = curr->ip_defrag_offset +
> > +                       curr_run_end = FRAG_CB(curr)->ip_defrag_offset +
> >                                         FRAG_CB(curr)->frag_run_len;
> > -                       if (end <= curr->ip_defrag_offset)
> > +                       if (end <= FRAG_CB(curr)->ip_defrag_offset)
> >                                 rbn = &parent->rb_left;
> >                         else if (offset >= curr_run_end)
> >                                 rbn = &parent->rb_right;
> > -                       else if (offset >= curr->ip_defrag_offset &&
> > +                       else if (offset >= FRAG_CB(curr)->ip_defrag_offset &&
> >                                  end <= curr_run_end)
> >                                 return IPFRAG_DUP;
> >                         else
> > @@ -438,23 +441,39 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> >                 rb_insert_color(&skb->rbnode, &q->rb_fragments);
> >         }
> > 
> > -       skb->ip_defrag_offset = offset;
> > +       FRAG_CB(skb)->ip_defrag_offset = offset;
> > 
> >         return IPFRAG_OK;
> >  }
> >  EXPORT_SYMBOL(inet_frag_queue_insert);
> > 
> > +void tcp_wfree(struct sk_buff *skb);
> 
> Thanks a lot Florian for looking at this !
> 
> Since you had : #include "../core/sock_destructor.h", perhaps the line
> can be removed,
> because it includes <net/tcp.h>

I think Florian will not able to reply for a few days.

Since the issue looks ancient and we are early in the cycle, I guess
there are no problems with that.

Cheers,

Paolo
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7d56ce195120..6d08ff8a9357 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -753,8 +753,6 @@  typedef unsigned char *sk_buff_data_t;
  *	@list: queue head
  *	@ll_node: anchor in an llist (eg socket defer_list)
  *	@sk: Socket we are owned by
- *	@ip_defrag_offset: (aka @sk) alternate use of @sk, used in
- *		fragmentation management
  *	@dev: Device we arrived on/are leaving by
  *	@dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL
  *	@cb: Control buffer. Free for use by every layer. Put private vars here
@@ -875,10 +873,7 @@  struct sk_buff {
 		struct llist_node	ll_node;
 	};
 
-	union {
-		struct sock		*sk;
-		int			ip_defrag_offset;
-	};
+	struct sock		*sk;
 
 	union {
 		ktime_t		tstamp;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 7072fc0783ef..7254b640ba06 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -24,6 +24,8 @@ 
 #include <net/ip.h>
 #include <net/ipv6.h>
 
+#include "../core/sock_destructor.h"
+
 /* Use skb->cb to track consecutive/adjacent fragments coming at
  * the end of the queue. Nodes in the rb-tree queue will
  * contain "runs" of one or more adjacent fragments.
@@ -39,6 +41,7 @@  struct ipfrag_skb_cb {
 	};
 	struct sk_buff		*next_frag;
 	int			frag_run_len;
+	int			ip_defrag_offset;
 };
 
 #define FRAG_CB(skb)		((struct ipfrag_skb_cb *)((skb)->cb))
@@ -396,12 +399,12 @@  int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
 	 */
 	if (!last)
 		fragrun_create(q, skb);  /* First fragment. */
-	else if (last->ip_defrag_offset + last->len < end) {
+	else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) {
 		/* This is the common case: skb goes to the end. */
 		/* Detect and discard overlaps. */
-		if (offset < last->ip_defrag_offset + last->len)
+		if (offset < FRAG_CB(last)->ip_defrag_offset + last->len)
 			return IPFRAG_OVERLAP;
-		if (offset == last->ip_defrag_offset + last->len)
+		if (offset == FRAG_CB(last)->ip_defrag_offset + last->len)
 			fragrun_append_to_last(q, skb);
 		else
 			fragrun_create(q, skb);
@@ -418,13 +421,13 @@  int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
 
 			parent = *rbn;
 			curr = rb_to_skb(parent);
-			curr_run_end = curr->ip_defrag_offset +
+			curr_run_end = FRAG_CB(curr)->ip_defrag_offset +
 					FRAG_CB(curr)->frag_run_len;
-			if (end <= curr->ip_defrag_offset)
+			if (end <= FRAG_CB(curr)->ip_defrag_offset)
 				rbn = &parent->rb_left;
 			else if (offset >= curr_run_end)
 				rbn = &parent->rb_right;
-			else if (offset >= curr->ip_defrag_offset &&
+			else if (offset >= FRAG_CB(curr)->ip_defrag_offset &&
 				 end <= curr_run_end)
 				return IPFRAG_DUP;
 			else
@@ -438,23 +441,39 @@  int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
 		rb_insert_color(&skb->rbnode, &q->rb_fragments);
 	}
 
-	skb->ip_defrag_offset = offset;
+	FRAG_CB(skb)->ip_defrag_offset = offset;
 
 	return IPFRAG_OK;
 }
 EXPORT_SYMBOL(inet_frag_queue_insert);
 
+void tcp_wfree(struct sk_buff *skb);
 void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
 			      struct sk_buff *parent)
 {
 	struct sk_buff *fp, *head = skb_rb_first(&q->rb_fragments);
-	struct sk_buff **nextp;
+	void (*destructor)(struct sk_buff *);
+	unsigned int orig_truesize = 0;
+	struct sk_buff **nextp = NULL;
+	struct sock *sk = skb->sk;
 	int delta;
 
+	if (sk && is_skb_wmem(skb)) {
+		/* TX: skb->sk might have been passed as argument to
+		 * dst->output and must remain valid until tx completes.
+		 *
+		 * Move sk to reassembled skb and fix up wmem accounting.
+		 */
+		orig_truesize = skb->truesize;
+		destructor = skb->destructor;
+	}
+
 	if (head != skb) {
 		fp = skb_clone(skb, GFP_ATOMIC);
-		if (!fp)
-			return NULL;
+		if (!fp) {
+			head = skb;
+			goto out_restore_sk;
+		}
 		FRAG_CB(fp)->next_frag = FRAG_CB(skb)->next_frag;
 		if (RB_EMPTY_NODE(&skb->rbnode))
 			FRAG_CB(parent)->next_frag = fp;
@@ -463,6 +482,12 @@  void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
 					&q->rb_fragments);
 		if (q->fragments_tail == skb)
 			q->fragments_tail = fp;
+
+		if (orig_truesize) {
+			/* prevent skb_morph from releasing sk */
+			skb->sk = NULL;
+			skb->destructor = NULL;
+		}
 		skb_morph(skb, head);
 		FRAG_CB(skb)->next_frag = FRAG_CB(head)->next_frag;
 		rb_replace_node(&head->rbnode, &skb->rbnode,
@@ -470,13 +495,13 @@  void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
 		consume_skb(head);
 		head = skb;
 	}
-	WARN_ON(head->ip_defrag_offset != 0);
+	WARN_ON(FRAG_CB(head)->ip_defrag_offset != 0);
 
 	delta = -head->truesize;
 
 	/* Head of list must not be cloned. */
 	if (skb_unclone(head, GFP_ATOMIC))
-		return NULL;
+		goto out_restore_sk;
 
 	delta += head->truesize;
 	if (delta)
@@ -492,7 +517,7 @@  void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
 
 		clone = alloc_skb(0, GFP_ATOMIC);
 		if (!clone)
-			return NULL;
+			goto out_restore_sk;
 		skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
 		skb_frag_list_init(head);
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
@@ -509,6 +534,21 @@  void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
 		nextp = &skb_shinfo(head)->frag_list;
 	}
 
+out_restore_sk:
+	if (orig_truesize) {
+		int ts_delta = head->truesize - orig_truesize;
+
+		/* if this reassembled skb is fragmented later,
+		 * fraglist skbs will get skb->sk assigned from head->sk,
+		 * and each frag skb will be released via sock_wfree.
+		 *
+		 * Update sk_wmem_alloc.
+		 */
+		head->sk = sk;
+		head->destructor = destructor;
+		refcount_add(ts_delta, &sk->sk_wmem_alloc);
+	}
+
 	return nextp;
 }
 EXPORT_SYMBOL(inet_frag_reasm_prepare);
@@ -516,6 +556,8 @@  EXPORT_SYMBOL(inet_frag_reasm_prepare);
 void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
 			    void *reasm_data, bool try_coalesce)
 {
+	struct sock *sk = is_skb_wmem(head) ? head->sk : NULL;
+	const unsigned int head_truesize = head->truesize;
 	struct sk_buff **nextp = reasm_data;
 	struct rb_node *rbn;
 	struct sk_buff *fp;
@@ -579,6 +621,9 @@  void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
 	head->prev = NULL;
 	head->tstamp = q->stamp;
 	head->mono_delivery_time = q->mono_delivery_time;
+
+	if (sk)
+		refcount_add(sum_truesize - head_truesize, &sk->sk_wmem_alloc);
 }
 EXPORT_SYMBOL(inet_frag_reasm_finish);
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index a4941f53b523..fb947d1613fe 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -384,6 +384,7 @@  static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	}
 
 	skb_dst_drop(skb);
+	skb_orphan(skb);
 	return -EINPROGRESS;
 
 insert_error:
@@ -487,7 +488,6 @@  int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 	struct ipq *qp;
 
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS);
-	skb_orphan(skb);
 
 	/* Lookup (or create) queue header */
 	qp = ip_find(net, ip_hdr(skb), user, vif);
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 1a51a44571c3..d0dcbaca1994 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -294,6 +294,7 @@  static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 	}
 
 	skb_dst_drop(skb);
+	skb_orphan(skb);
 	return -EINPROGRESS;
 
 insert_error:
@@ -469,7 +470,6 @@  int nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 user)
 	hdr = ipv6_hdr(skb);
 	fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
-	skb_orphan(skb);
 	fq = fq_find(net, fhdr->identification, user, hdr,
 		     skb->dev ? skb->dev->ifindex : 0);
 	if (fq == NULL) {