diff mbox series

[net-next,v2] net/core: add optional threading for backlog processing

Message ID 20230328195925.94495-1-nbd@nbd.name (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v2] net/core: add optional threading for backlog processing | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4306 this patch: 4306
netdev/cc_maintainers warning 2 maintainers not CCed: stephen@networkplumber.org bagasdotme@gmail.com
netdev/build_clang success Errors and warnings before: 963 this patch: 963
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4508 this patch: 4508
netdev/checkpatch warning CHECK: Alignment should match open parenthesis CHECK: Blank lines aren't necessary before a close brace '}' WARNING: line length of 85 exceeds 80 columns WARNING: msleep < 20ms can sleep for up to 20ms; see Documentation/timers/timers-howto.rst
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Felix Fietkau March 28, 2023, 7:59 p.m. UTC
When dealing with few flows or an imbalance on CPU utilization, static RPS
CPU assignment can be too inflexible. Add support for enabling threaded NAPI
for backlog processing in order to allow the scheduler to better balance
processing. This helps better spread the load across idle CPUs.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
---
v2:
 - initialize sd->backlog.poll_list in order fix switching backlogs to threaded
   that have not been scheduled before
PATCH:
 - add missing process_queue_empty initialization
 - fix kthread leak
 - add documentation
RFC v3:
 - make patch more generic, applies to backlog processing in general
 - fix process queue access on flush
RFC v2:
 - fix rebase error in rps locking
 Documentation/admin-guide/sysctl/net.rst |  9 +++
 Documentation/networking/scaling.rst     | 20 ++++++
 include/linux/netdevice.h                |  2 +
 net/core/dev.c                           | 83 ++++++++++++++++++++++--
 net/core/sysctl_net_core.c               | 27 ++++++++
 5 files changed, 136 insertions(+), 5 deletions(-)

Comments

Eric Dumazet March 28, 2023, 10:30 p.m. UTC | #1
On Tue, Mar 28, 2023 at 9:59 PM Felix Fietkau <nbd@nbd.name> wrote:
>
> When dealing with few flows or an imbalance on CPU utilization, static RPS
> CPU assignment can be too inflexible. Add support for enabling threaded NAPI
> for backlog processing in order to allow the scheduler to better balance
> processing. This helps better spread the load across idle CPUs.
>
> Signed-off-by: Felix Fietkau <nbd@nbd.name>
> ---
> v2:
>  - initialize sd->backlog.poll_list in order fix switching backlogs to threaded
>    that have not been scheduled before
> PATCH:
>  - add missing process_queue_empty initialization
>  - fix kthread leak
>  - add documentation
> RFC v3:
>  - make patch more generic, applies to backlog processing in general
>  - fix process queue access on flush
> RFC v2:
>  - fix rebase error in rps locking
>  Documentation/admin-guide/sysctl/net.rst |  9 +++
>  Documentation/networking/scaling.rst     | 20 ++++++
>  include/linux/netdevice.h                |  2 +
>  net/core/dev.c                           | 83 ++++++++++++++++++++++--
>  net/core/sysctl_net_core.c               | 27 ++++++++
>  5 files changed, 136 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> index 466c560b0c30..6d037633a52f 100644
> --- a/Documentation/admin-guide/sysctl/net.rst
> +++ b/Documentation/admin-guide/sysctl/net.rst
> @@ -47,6 +47,15 @@ Table : Subdirectories in /proc/sys/net
>  1. /proc/sys/net/core - Network core options
>  ============================================
>
> +backlog_threaded
> +----------------
> +
> +This offloads processing of backlog (input packets steered by RPS, or
> +queued because the kernel is receiving more than it can handle on the
> +incoming CPU) to threads (one for each CPU) instead of processing them
> +in softirq context. This can improve load balancing by allowing the
> +scheduler to better spread the load across idle CPUs.
> +
>  bpf_jit_enable
>  --------------
>
> diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
> index 3d435caa3ef2..ded6fc713304 100644
> --- a/Documentation/networking/scaling.rst
> +++ b/Documentation/networking/scaling.rst
> @@ -244,6 +244,26 @@ Setting net.core.netdev_max_backlog to either 1000 or 10000
>  performed well in experiments.
>
>
> +Threaded Backlog
> +~~~~~~~~~~~~~~~~
> +
> +When dealing with few flows or an imbalance on CPU utilization, static
> +RPS CPU assignment can be too inflexible. Making backlog processing
> +threaded can improve load balancing by allowing the scheduler to spread
> +the load across idle CPUs.
> +
> +
> +Suggested Configuration
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +If you have CPUs fully utilized with network processing, you can enable
> +threaded backlog processing by setting /proc/sys/net/core/backlog_threaded
> +to 1. Afterwards, RPS CPU configuration bits no longer refer to CPU
> +numbers, but to backlog threads named napi/backlog-<n>.
> +If necessary, you can change the CPU affinity of these threads to limit
> +them to specific CPU cores.
> +
> +
>  RFS: Receive Flow Steering
>  ==========================
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 18a5be6ddd0f..953876cb0e92 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -527,6 +527,7 @@ static inline bool napi_complete(struct napi_struct *n)
>  }
>
>  int dev_set_threaded(struct net_device *dev, bool threaded);
> +int backlog_set_threaded(bool threaded);
>
>  /**
>   *     napi_disable - prevent NAPI from scheduling
> @@ -3217,6 +3218,7 @@ struct softnet_data {
>         unsigned int            cpu;
>         unsigned int            input_queue_tail;
>  #endif
> +       unsigned int            process_queue_empty;

Hmmm... probably better close to input_queue_head, to share a
dedicated cache line already dirtied
with input_queue_head_incr()

I also think we could avoid adding this new field.

Use instead input_queue_head, latching its value in void
flush_backlog() and adding sd->process_queue length ?

Then waiting for (s32)(input_queue_head - latch) >= 0 ?


>                         /*
>                          * Inline a custom version of __napi_complete().
> -                        * only current cpu owns and manipulates this napi,
> -                        * and NAPI_STATE_SCHED is the only possible flag set
> -                        * on backlog.
> +                        * only current cpu owns and manipulates this napi.
>                          * We can use a plain write instead of clear_bit(),
>                          * and we dont need an smp_mb() memory barrier.
>                          */
> -                       napi->state = 0;
> +                       napi->state &= ~(NAPIF_STATE_SCHED |
> +                                        NAPIF_STATE_SCHED_THREADED);
>                         again = false;
>                 } else {
>                         skb_queue_splice_tail_init(&sd->input_pkt_queue,
> @@ -6350,6 +6370,55 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
>  }
>  EXPORT_SYMBOL(dev_set_threaded);
>
> +int backlog_set_threaded(bool threaded)
> +{
> +       static bool backlog_threaded;
> +       int err = 0;
> +       int i;
> +
> +       if (backlog_threaded == threaded)
> +               return 0;
> +
> +       for_each_possible_cpu(i) {
> +               struct softnet_data *sd = &per_cpu(softnet_data, i);
> +               struct napi_struct *n = &sd->backlog;
> +
> +               if (n->thread)
> +                       continue;
> +               n->thread = kthread_run(napi_threaded_poll, n, "napi/backlog-%d", i);
> +               if (IS_ERR(n->thread)) {
> +                       err = PTR_ERR(n->thread);
> +                       pr_err("kthread_run failed with err %d\n", err);
> +                       n->thread = NULL;
> +                       threaded = false;
> +                       break;
> +               }
> +
> +       }
> +
> +       backlog_threaded = threaded;
> +
> +       /* Make sure kthread is created before THREADED bit
> +        * is set.
> +        */
> +       smp_mb__before_atomic();
> +
> +       for_each_possible_cpu(i) {
> +               struct softnet_data *sd = &per_cpu(softnet_data, i);
> +               struct napi_struct *n = &sd->backlog;
> +               unsigned long flags;
> +
> +               rps_lock_irqsave(sd, &flags);
> +               if (threaded)
> +                       n->state |= NAPIF_STATE_THREADED;
> +               else
> +                       n->state &= ~NAPIF_STATE_THREADED;
> +               rps_unlock_irq_restore(sd, &flags);
> +       }
> +
> +       return err;
> +}
> +
>  void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
>                            int (*poll)(struct napi_struct *, int), int weight)
>  {
> @@ -11108,6 +11177,9 @@ static int dev_cpu_dead(unsigned int oldcpu)
>         raise_softirq_irqoff(NET_TX_SOFTIRQ);
>         local_irq_enable();
>
> +       if (test_bit(NAPI_STATE_THREADED, &oldsd->backlog.state))
> +               return 0;
> +
>  #ifdef CONFIG_RPS
>         remsd = oldsd->rps_ipi_list;
>         oldsd->rps_ipi_list = NULL;
> @@ -11411,6 +11483,7 @@ static int __init net_dev_init(void)
>                 INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd);
>                 spin_lock_init(&sd->defer_lock);
>
> +               INIT_LIST_HEAD(&sd->backlog.poll_list);
>                 init_gro_hash(&sd->backlog);
>                 sd->backlog.poll = process_backlog;
>                 sd->backlog.weight = weight_p;
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 74842b453407..77114cd0b021 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -30,6 +30,7 @@ static int int_3600 = 3600;
>  static int min_sndbuf = SOCK_MIN_SNDBUF;
>  static int min_rcvbuf = SOCK_MIN_RCVBUF;
>  static int max_skb_frags = MAX_SKB_FRAGS;
> +static int backlog_threaded;
>
>  static int net_msg_warn;       /* Unused, but still a sysctl */
>
> @@ -188,6 +189,23 @@ static int rps_sock_flow_sysctl(struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_RPS */
>
> +static int backlog_threaded_sysctl(struct ctl_table *table, int write,
> +                              void *buffer, size_t *lenp, loff_t *ppos)
> +{
> +       static DEFINE_MUTEX(backlog_threaded_mutex);
> +       int ret;
> +
> +       mutex_lock(&backlog_threaded_mutex);
> +
> +       ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +       if (write && !ret)
> +               ret = backlog_set_threaded(backlog_threaded);
> +
> +       mutex_unlock(&backlog_threaded_mutex);
> +
> +       return ret;
> +}
> +
>  #ifdef CONFIG_NET_FLOW_LIMIT
>  static DEFINE_MUTEX(flow_limit_update_mutex);
>
> @@ -532,6 +550,15 @@ static struct ctl_table net_core_table[] = {
>                 .proc_handler   = rps_sock_flow_sysctl
>         },
>  #endif
> +       {
> +               .procname       = "backlog_threaded",
> +               .data           = &backlog_threaded,
> +               .maxlen         = sizeof(unsigned int),
> +               .mode           = 0644,
> +               .proc_handler   = backlog_threaded_sysctl,
> +               .extra1         = SYSCTL_ZERO,
> +               .extra2         = SYSCTL_ONE
> +       },
>  #ifdef CONFIG_NET_FLOW_LIMIT
>         {
>                 .procname       = "flow_limit_cpu_bitmap",
> --
> 2.39.0
>
Jakub Kicinski March 28, 2023, 11:16 p.m. UTC | #2
On Tue, 28 Mar 2023 21:59:25 +0200 Felix Fietkau wrote:
> When dealing with few flows or an imbalance on CPU utilization, static RPS
> CPU assignment can be too inflexible. Add support for enabling threaded NAPI
> for backlog processing in order to allow the scheduler to better balance
> processing. This helps better spread the load across idle CPUs.

Can you share some numbers vs a system where RPS only spreads to 
the cores which are not running NAPI?

IMHO you're putting a lot of faith in the scheduler and you need 
to show that it actually does what you say it will do.
Paolo Abeni March 30, 2023, 11:01 a.m. UTC | #3
On Tue, 2023-03-28 at 16:16 -0700, Jakub Kicinski wrote:
> On Tue, 28 Mar 2023 21:59:25 +0200 Felix Fietkau wrote:
> > When dealing with few flows or an imbalance on CPU utilization, static RPS
> > CPU assignment can be too inflexible. Add support for enabling threaded NAPI
> > for backlog processing in order to allow the scheduler to better balance
> > processing. This helps better spread the load across idle CPUs.
> 
> Can you share some numbers vs a system where RPS only spreads to 
> the cores which are not running NAPI?
> 
> IMHO you're putting a lot of faith in the scheduler and you need 
> to show that it actually does what you say it will do.

I have the same feeling. From your description I think some gain is
possible if there are no other processes running except
ksoftirq/rps/threaded napi. 

I guess that the above is expect average state for a small s/w router,
but if/when routing daemon/igmp proxy/local web server kicks-in you
should notice a measurable higher latency (compared to plain RPS in the
same scenario)???

Cheers,

Paolo
Felix Fietkau March 30, 2023, 11:11 a.m. UTC | #4
On 30.03.23 13:01, Paolo Abeni wrote:
> On Tue, 2023-03-28 at 16:16 -0700, Jakub Kicinski wrote:
>> On Tue, 28 Mar 2023 21:59:25 +0200 Felix Fietkau wrote:
>> > When dealing with few flows or an imbalance on CPU utilization, static RPS
>> > CPU assignment can be too inflexible. Add support for enabling threaded NAPI
>> > for backlog processing in order to allow the scheduler to better balance
>> > processing. This helps better spread the load across idle CPUs.
>> 
>> Can you share some numbers vs a system where RPS only spreads to 
>> the cores which are not running NAPI?
>> 
>> IMHO you're putting a lot of faith in the scheduler and you need 
>> to show that it actually does what you say it will do.
I will run some more tests as soon as I have time for it.

> I have the same feeling. From your description I think some gain is
> possible if there are no other processes running except
> ksoftirq/rps/threaded napi.
> 
> I guess that the above is expect average state for a small s/w router,
> but if/when routing daemon/igmp proxy/local web server kicks-in you
> should notice a measurable higher latency (compared to plain RPS in the
> same scenario)???
Depends on the process priority, I guess.

The main thing I'm trying to fix is the fact that RPS as implemented 
right now is too static for devices routing traffic at CPU capacity limit.
Even if you manage to tune properly for simple ethernet NAT, then adding 
WLAN to the mix can easily throw a wrench into the picture as well, 
because its hard to cover different shifting usage patterns with a 
simple static assignment.

- Felix
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 466c560b0c30..6d037633a52f 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -47,6 +47,15 @@  Table : Subdirectories in /proc/sys/net
 1. /proc/sys/net/core - Network core options
 ============================================
 
+backlog_threaded
+----------------
+
+This offloads processing of backlog (input packets steered by RPS, or
+queued because the kernel is receiving more than it can handle on the
+incoming CPU) to threads (one for each CPU) instead of processing them
+in softirq context. This can improve load balancing by allowing the
+scheduler to better spread the load across idle CPUs.
+
 bpf_jit_enable
 --------------
 
diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
index 3d435caa3ef2..ded6fc713304 100644
--- a/Documentation/networking/scaling.rst
+++ b/Documentation/networking/scaling.rst
@@ -244,6 +244,26 @@  Setting net.core.netdev_max_backlog to either 1000 or 10000
 performed well in experiments.
 
 
+Threaded Backlog
+~~~~~~~~~~~~~~~~
+
+When dealing with few flows or an imbalance on CPU utilization, static
+RPS CPU assignment can be too inflexible. Making backlog processing
+threaded can improve load balancing by allowing the scheduler to spread
+the load across idle CPUs.
+
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+If you have CPUs fully utilized with network processing, you can enable
+threaded backlog processing by setting /proc/sys/net/core/backlog_threaded
+to 1. Afterwards, RPS CPU configuration bits no longer refer to CPU
+numbers, but to backlog threads named napi/backlog-<n>.
+If necessary, you can change the CPU affinity of these threads to limit
+them to specific CPU cores.
+
+
 RFS: Receive Flow Steering
 ==========================
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 18a5be6ddd0f..953876cb0e92 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -527,6 +527,7 @@  static inline bool napi_complete(struct napi_struct *n)
 }
 
 int dev_set_threaded(struct net_device *dev, bool threaded);
+int backlog_set_threaded(bool threaded);
 
 /**
  *	napi_disable - prevent NAPI from scheduling
@@ -3217,6 +3218,7 @@  struct softnet_data {
 	unsigned int		cpu;
 	unsigned int		input_queue_tail;
 #endif
+	unsigned int		process_queue_empty;
 	unsigned int		received_rps;
 	unsigned int		dropped;
 	struct sk_buff_head	input_pkt_queue;
diff --git a/net/core/dev.c b/net/core/dev.c
index 7172334a418f..58360aee53e9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4591,7 +4591,7 @@  static int napi_schedule_rps(struct softnet_data *sd)
 	struct softnet_data *mysd = this_cpu_ptr(&softnet_data);
 
 #ifdef CONFIG_RPS
-	if (sd != mysd) {
+	if (sd != mysd && !test_bit(NAPI_STATE_THREADED, &sd->backlog.state)) {
 		sd->rps_ipi_next = mysd->rps_ipi_list;
 		mysd->rps_ipi_list = sd;
 
@@ -5772,6 +5772,8 @@  static DEFINE_PER_CPU(struct work_struct, flush_works);
 /* Network device is going away, flush any packets still pending */
 static void flush_backlog(struct work_struct *work)
 {
+	unsigned int process_queue_empty;
+	bool threaded, flush_processq;
 	struct sk_buff *skb, *tmp;
 	struct softnet_data *sd;
 
@@ -5786,8 +5788,17 @@  static void flush_backlog(struct work_struct *work)
 			input_queue_head_incr(sd);
 		}
 	}
+
+	threaded = test_bit(NAPI_STATE_THREADED, &sd->backlog.state);
+	flush_processq = threaded &&
+			 !skb_queue_empty_lockless(&sd->process_queue);
+	if (flush_processq)
+		process_queue_empty = sd->process_queue_empty;
 	rps_unlock_irq_enable(sd);
 
+	if (threaded)
+		goto out;
+
 	skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
 		if (skb->dev->reg_state == NETREG_UNREGISTERING) {
 			__skb_unlink(skb, &sd->process_queue);
@@ -5795,7 +5806,16 @@  static void flush_backlog(struct work_struct *work)
 			input_queue_head_incr(sd);
 		}
 	}
+
+out:
 	local_bh_enable();
+
+	while (flush_processq) {
+		msleep(1);
+		rps_lock_irq_disable(sd);
+		flush_processq = process_queue_empty == sd->process_queue_empty;
+		rps_unlock_irq_enable(sd);
+	}
 }
 
 static bool flush_required(int cpu)
@@ -5927,16 +5947,16 @@  static int process_backlog(struct napi_struct *napi, int quota)
 		}
 
 		rps_lock_irq_disable(sd);
+		sd->process_queue_empty++;
 		if (skb_queue_empty(&sd->input_pkt_queue)) {
 			/*
 			 * Inline a custom version of __napi_complete().
-			 * only current cpu owns and manipulates this napi,
-			 * and NAPI_STATE_SCHED is the only possible flag set
-			 * on backlog.
+			 * only current cpu owns and manipulates this napi.
 			 * We can use a plain write instead of clear_bit(),
 			 * and we dont need an smp_mb() memory barrier.
 			 */
-			napi->state = 0;
+			napi->state &= ~(NAPIF_STATE_SCHED |
+					 NAPIF_STATE_SCHED_THREADED);
 			again = false;
 		} else {
 			skb_queue_splice_tail_init(&sd->input_pkt_queue,
@@ -6350,6 +6370,55 @@  int dev_set_threaded(struct net_device *dev, bool threaded)
 }
 EXPORT_SYMBOL(dev_set_threaded);
 
+int backlog_set_threaded(bool threaded)
+{
+	static bool backlog_threaded;
+	int err = 0;
+	int i;
+
+	if (backlog_threaded == threaded)
+		return 0;
+
+	for_each_possible_cpu(i) {
+		struct softnet_data *sd = &per_cpu(softnet_data, i);
+		struct napi_struct *n = &sd->backlog;
+
+		if (n->thread)
+			continue;
+		n->thread = kthread_run(napi_threaded_poll, n, "napi/backlog-%d", i);
+		if (IS_ERR(n->thread)) {
+			err = PTR_ERR(n->thread);
+			pr_err("kthread_run failed with err %d\n", err);
+			n->thread = NULL;
+			threaded = false;
+			break;
+		}
+
+	}
+
+	backlog_threaded = threaded;
+
+	/* Make sure kthread is created before THREADED bit
+	 * is set.
+	 */
+	smp_mb__before_atomic();
+
+	for_each_possible_cpu(i) {
+		struct softnet_data *sd = &per_cpu(softnet_data, i);
+		struct napi_struct *n = &sd->backlog;
+		unsigned long flags;
+
+		rps_lock_irqsave(sd, &flags);
+		if (threaded)
+			n->state |= NAPIF_STATE_THREADED;
+		else
+			n->state &= ~NAPIF_STATE_THREADED;
+		rps_unlock_irq_restore(sd, &flags);
+	}
+
+	return err;
+}
+
 void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 			   int (*poll)(struct napi_struct *, int), int weight)
 {
@@ -11108,6 +11177,9 @@  static int dev_cpu_dead(unsigned int oldcpu)
 	raise_softirq_irqoff(NET_TX_SOFTIRQ);
 	local_irq_enable();
 
+	if (test_bit(NAPI_STATE_THREADED, &oldsd->backlog.state))
+		return 0;
+
 #ifdef CONFIG_RPS
 	remsd = oldsd->rps_ipi_list;
 	oldsd->rps_ipi_list = NULL;
@@ -11411,6 +11483,7 @@  static int __init net_dev_init(void)
 		INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd);
 		spin_lock_init(&sd->defer_lock);
 
+		INIT_LIST_HEAD(&sd->backlog.poll_list);
 		init_gro_hash(&sd->backlog);
 		sd->backlog.poll = process_backlog;
 		sd->backlog.weight = weight_p;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 74842b453407..77114cd0b021 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -30,6 +30,7 @@  static int int_3600 = 3600;
 static int min_sndbuf = SOCK_MIN_SNDBUF;
 static int min_rcvbuf = SOCK_MIN_RCVBUF;
 static int max_skb_frags = MAX_SKB_FRAGS;
+static int backlog_threaded;
 
 static int net_msg_warn;	/* Unused, but still a sysctl */
 
@@ -188,6 +189,23 @@  static int rps_sock_flow_sysctl(struct ctl_table *table, int write,
 }
 #endif /* CONFIG_RPS */
 
+static int backlog_threaded_sysctl(struct ctl_table *table, int write,
+			       void *buffer, size_t *lenp, loff_t *ppos)
+{
+	static DEFINE_MUTEX(backlog_threaded_mutex);
+	int ret;
+
+	mutex_lock(&backlog_threaded_mutex);
+
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (write && !ret)
+		ret = backlog_set_threaded(backlog_threaded);
+
+	mutex_unlock(&backlog_threaded_mutex);
+
+	return ret;
+}
+
 #ifdef CONFIG_NET_FLOW_LIMIT
 static DEFINE_MUTEX(flow_limit_update_mutex);
 
@@ -532,6 +550,15 @@  static struct ctl_table net_core_table[] = {
 		.proc_handler	= rps_sock_flow_sysctl
 	},
 #endif
+	{
+		.procname	= "backlog_threaded",
+		.data		= &backlog_threaded,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= backlog_threaded_sysctl,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE
+	},
 #ifdef CONFIG_NET_FLOW_LIMIT
 	{
 		.procname	= "flow_limit_cpu_bitmap",