[1/6] workqueue: Inherit NOIO and NOFS alloc flags

Message ID	20240513125346.764076-2-haakon.bugge@oracle.com (mailing list archive)
State	Superseded
Headers	show Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 91D1914C5A3; Mon, 13 May 2024 12:54:35 +0000 (UTC) From: =?utf-8?q?H=C3=A5kon_Bugge?= <haakon.bugge@oracle.com> To: linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, rds-devel@oss.oracle.com Cc: Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>, Saeed Mahameed <saeedm@nvidia.com>, Tariq Toukan <tariqt@nvidia.com>, "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>, Allison Henderson <allison.henderson@oracle.com>, Manjunath Patil <manjunath.b.patil@oracle.com>, Mark Zhang <markzhang@nvidia.com>, =?utf-8?q?H=C3=A5kon_Bugge?= <haakon.bugge@oracle.com>, Chuck Lever <chuck.lever@oracle.com>, Shiraz Saleem <shiraz.saleem@intel.com>, Yang Li <yang.lee@linux.alibaba.com> Subject: [PATCH 1/6] workqueue: Inherit NOIO and NOFS alloc flags Date: Mon, 13 May 2024 14:53:41 +0200 Message-Id: <20240513125346.764076-2-haakon.bugge@oracle.com> In-Reply-To: <20240513125346.764076-1-haakon.bugge@oracle.com> References: <20240513125346.764076-1-haakon.bugge@oracle.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	rds: rdma: Add ability to force GFP_NOIO \| expand [0/6] rds: rdma: Add ability to force GFP_NOIO [1/6] workqueue: Inherit NOIO and NOFS alloc flags [2/6] rds: Brute force GFP_NOIO [3/6] RDMA/cma: Brute force GFP_NOIO [4/6] RDMA/cm: Brute force GFP_NOIO [5/6] RDMA/mlx5: Brute force GFP_NOIO [6/6] net/mlx5: Brute force GFP_NOIO

Message ID

20240513125346.764076-2-haakon.bugge@oracle.com (mailing list archive)

State

Superseded

Headers

From: =?utf-8?q?H=C3=A5kon_Bugge?= <haakon.bugge@oracle.com>
To: linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
        netdev@vger.kernel.org, rds-devel@oss.oracle.com
Cc: Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>,
 Saeed Mahameed <saeedm@nvidia.com>, Tariq Toukan <tariqt@nvidia.com>,
 "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>,
 Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
 Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
 Allison Henderson <allison.henderson@oracle.com>,
 Manjunath Patil <manjunath.b.patil@oracle.com>,
 Mark Zhang <markzhang@nvidia.com>,
 =?utf-8?q?H=C3=A5kon_Bugge?= <haakon.bugge@oracle.com>,
 Chuck Lever <chuck.lever@oracle.com>,
 Shiraz Saleem <shiraz.saleem@intel.com>, Yang Li <yang.lee@linux.alibaba.com>
Subject: [PATCH 1/6] workqueue: Inherit NOIO and NOFS alloc flags
Date: Mon, 13 May 2024 14:53:41 +0200
Message-Id: <20240513125346.764076-2-haakon.bugge@oracle.com>
In-Reply-To: <20240513125346.764076-1-haakon.bugge@oracle.com>
References: <20240513125346.764076-1-haakon.bugge@oracle.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Series

rds: rdma: Add ability to force GFP_NOIO | expand

Commit Message

Haakon Bugge May 13, 2024, 12:53 p.m. UTC

For drivers/modules running inside a
memalloc_{noio,nofs}_{save,restore} region, if a work-queue is
created, we make sure work executed on the work-queue inherits the
same flag(s).

This in order to conditionally enable drivers to work aligned with
block I/O devices. This commit makes sure that any work queued later
on work-queues created during module initialization, when current's
flags has PF_MEMALLOC_{NOIO,NOFS} set, will inherit the same flags.

We do this in order to enable drivers to be used as a network block
I/O device. This in order to support XFS or other file-systems on top
of a raw block device which uses said drivers as the network transport
layer.

Under intense memory pressure, we get memory reclaims. Assume the
file-system reclaims memory, goes to the raw block device, which calls
into said drivers. Now, if regular GFP_KERNEL allocations in the
drivers require reclaims to be fulfilled, we end up in a circular
dependency.

We break this circular dependency by:

1. Force all allocations in the drivers to use GFP_NOIO, by means of a
   parenthetic use of memalloc_noio_{save,restore} on all relevant
   entry points.

2. Make sure work-queues inherits current->flags
   wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
   work-queue inherits the same flag(s). That is what this commit
   contributes with.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 include/linux/workqueue.h |  2 ++
 kernel/workqueue.c        | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

Comments

Tejun Heo May 13, 2024, 4:48 p.m. UTC | #1

Hello,

On Mon, May 13, 2024 at 02:53:41PM +0200, Håkon Bugge wrote:
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 158784dd189ab..09ecc692ffcae 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -398,6 +398,8 @@ enum wq_flags {
>  	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
>  	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
>  	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
> +	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
> +	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */

I don't quite understand how this is supposed to be used. The flags are
marked internal but nothing actually sets them. Looking at later patches, I
don't see any usages either. What am I missing?

> @@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
>  
>  	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
>  #endif
> +	/* Set inherited alloc flags */
> +	if (use_noio_allocs)
> +		noio_flags = memalloc_noio_save();
> +	if (use_nofs_allocs)
> +		nofs_flags = memalloc_nofs_save();
> +
>  	/* ensure we're on the correct CPU */
>  	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>  		     raw_smp_processor_id() != pool->cpu);
> @@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
>  
>  	/* must be the last step, see the function comment */
>  	pwq_dec_nr_in_flight(pwq, work_data);
> +
> +	/* Restore alloc flags */
> +	if (use_nofs_allocs)
> +		memalloc_nofs_restore(nofs_flags);
> +	if (use_noio_allocs)
> +		memalloc_noio_restore(noio_flags);

Also, this looks like something that the work function can do on entry and
before exit, no?

Thanks.

Haakon Bugge May 14, 2024, 1:48 p.m. UTC | #2

> Hello,
> 
> On Mon, May 13, 2024 at 02:53:41PM +0200, Håkon Bugge wrote:
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 158784dd189ab..09ecc692ffcae 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -398,6 +398,8 @@ enum wq_flags {
> 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
> 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
> 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
> +	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
> +	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */
> 
> I don't quite understand how this is supposed to be used. The flags are
> marked internal but nothing actually sets them. Looking at later patches, I
> don't see any usages either. What am I missing?

Hi Tejun,


Thank you for so quickly looking into this. You are quite right. During re-basing, I missed a hunk from alloc_workqueue(), where I sample current->flags for PF_MALLOC_{NOIO,NOFS} bits and sets the corresponding bits in wq->flags. Fixed in v2.

> @@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
> 
> 	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
> #endif
> +	/* Set inherited alloc flags */
> +	if (use_noio_allocs)
> +		noio_flags = memalloc_noio_save();
> +	if (use_nofs_allocs)
> +		nofs_flags = memalloc_nofs_save();
> +
> 	/* ensure we're on the correct CPU */
> 	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> 		     raw_smp_processor_id() != pool->cpu);
> @@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
> 
> 	/* must be the last step, see the function comment */
> 	pwq_dec_nr_in_flight(pwq, work_data);
> +
> +	/* Restore alloc flags */
> +	if (use_nofs_allocs)
> +		memalloc_nofs_restore(nofs_flags);
> +	if (use_noio_allocs)
> +		memalloc_noio_restore(noio_flags);
> 
> Also, this looks like something that the work function can do on entry and
> before exit, no?

It _can_ be done in the work functions, but that will be a code sprawl. Only in RDS, we have the following worker functions:

rds_ib_odp_mr_worker();
rds_ib_mr_pool_flush_worker()
rds_ib_odp_mr_worker()
rds_tcp_accept_worker()
rds_connect_worker()
rds_send_worker()
rds_recv_worker()
rds_shutdown_worker()

adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly prefer to have it in one place.


Thxs, Håkon

Tejun Heo May 14, 2024, 4:49 p.m. UTC | #3

Hello,

On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote:
> > Also, this looks like something that the work function can do on entry and
> > before exit, no?
> 
> It _can_ be done in the work functions, but that will be a code sprawl.
> Only in RDS, we have the following worker functions:
> 
> rds_ib_odp_mr_worker();
> rds_ib_mr_pool_flush_worker()
> rds_ib_odp_mr_worker()
> rds_tcp_accept_worker()
> rds_connect_worker()
> rds_send_worker()
> rds_recv_worker()
> rds_shutdown_worker()
> 
> adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly
> prefer to have it in one place.

I haven't seen the code yet, so can't tell for sure but if you're
automatically inherting these flags from the scheduling site, I don't think
that's gonna work. Note that getting a different, more permissive,
allocation context is one of reasons why one might want to use workqueues,
so it'd have to be explicit whether it's in workqueue or in its users.

Thanks.

Haakon Bugge May 15, 2024, 2:11 p.m. UTC | #4

Hi Tejun,


> On 14 May 2024, at 18:49, Tejun Heo <tj@kernel.org> wrote:
> 
> Hello,
> 
> On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote:
> 
> I haven't seen the code yet, so can't tell for sure but if you're
> automatically inherting these flags from the scheduling site, I don't think
> that's gonna work. Note that getting a different, more permissive,
> allocation context is one of reasons why one might want to use workqueues,
> so it'd have to be explicit whether it's in workqueue or in its users.

I wanted to hold off sending the v2 if it came in more comments, but I have sent it now.


Thxs, Håkon

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 158784dd189ab..09ecc692ffcae 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -398,6 +398,8 @@  enum wq_flags {
 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
+	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
+	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */
 
 	/* BH wq only allows the following flags */
 	__WQ_BH_ALLOWS		= WQ_BH | WQ_HIGHPRI,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d2dbe099286b9..a1d166a7c0f85 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/mm.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
 #include <linux/delay.h>
@@ -3172,6 +3173,10 @@  __acquires(&pool->lock)
 	unsigned long work_data;
 	int lockdep_start_depth, rcu_start_depth;
 	bool bh_draining = pool->flags & POOL_BH_DRAINING;
+	bool use_noio_allocs = pwq->wq->flags & __WQ_NOIO;
+	bool use_nofs_allocs = pwq->wq->flags & __WQ_NOFS;
+	unsigned long noio_flags;
+	unsigned long nofs_flags;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -3184,6 +3189,12 @@  __acquires(&pool->lock)
 
 	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
 #endif
+	/* Set inherited alloc flags */
+	if (use_noio_allocs)
+		noio_flags = memalloc_noio_save();
+	if (use_nofs_allocs)
+		nofs_flags = memalloc_nofs_save();
+
 	/* ensure we're on the correct CPU */
 	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
 		     raw_smp_processor_id() != pool->cpu);
@@ -3320,6 +3331,12 @@  __acquires(&pool->lock)
 
 	/* must be the last step, see the function comment */
 	pwq_dec_nr_in_flight(pwq, work_data);
+
+	/* Restore alloc flags */
+	if (use_nofs_allocs)
+		memalloc_nofs_restore(nofs_flags);
+	if (use_noio_allocs)
+		memalloc_noio_restore(noio_flags);
 }
 
 /**

[1/6] workqueue: Inherit NOIO and NOFS alloc flags

Commit Message

Comments

Patch