Message ID | 20240513125346.764076-2-haakon.bugge@oracle.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | rds: rdma: Add ability to force GFP_NOIO | expand |
Hello, On Mon, May 13, 2024 at 02:53:41PM +0200, Håkon Bugge wrote: > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h > index 158784dd189ab..09ecc692ffcae 100644 > --- a/include/linux/workqueue.h > +++ b/include/linux/workqueue.h > @@ -398,6 +398,8 @@ enum wq_flags { > __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */ > __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ > __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */ > + __WQ_NOIO = 1 << 19, /* internal: execute work with NOIO */ > + __WQ_NOFS = 1 << 20, /* internal: execute work with NOFS */ I don't quite understand how this is supposed to be used. The flags are marked internal but nothing actually sets them. Looking at later patches, I don't see any usages either. What am I missing? > @@ -3184,6 +3189,12 @@ __acquires(&pool->lock) > > lockdep_copy_map(&lockdep_map, &work->lockdep_map); > #endif > + /* Set inherited alloc flags */ > + if (use_noio_allocs) > + noio_flags = memalloc_noio_save(); > + if (use_nofs_allocs) > + nofs_flags = memalloc_nofs_save(); > + > /* ensure we're on the correct CPU */ > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > raw_smp_processor_id() != pool->cpu); > @@ -3320,6 +3331,12 @@ __acquires(&pool->lock) > > /* must be the last step, see the function comment */ > pwq_dec_nr_in_flight(pwq, work_data); > + > + /* Restore alloc flags */ > + if (use_nofs_allocs) > + memalloc_nofs_restore(nofs_flags); > + if (use_noio_allocs) > + memalloc_noio_restore(noio_flags); Also, this looks like something that the work function can do on entry and before exit, no? Thanks.
> Hello, > > On Mon, May 13, 2024 at 02:53:41PM +0200, Håkon Bugge wrote: > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h > index 158784dd189ab..09ecc692ffcae 100644 > --- a/include/linux/workqueue.h > +++ b/include/linux/workqueue.h > @@ -398,6 +398,8 @@ enum wq_flags { > __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */ > __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ > __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */ > + __WQ_NOIO = 1 << 19, /* internal: execute work with NOIO */ > + __WQ_NOFS = 1 << 20, /* internal: execute work with NOFS */ > > I don't quite understand how this is supposed to be used. The flags are > marked internal but nothing actually sets them. Looking at later patches, I > don't see any usages either. What am I missing? Hi Tejun, Thank you for so quickly looking into this. You are quite right. During re-basing, I missed a hunk from alloc_workqueue(), where I sample current->flags for PF_MALLOC_{NOIO,NOFS} bits and sets the corresponding bits in wq->flags. Fixed in v2. > @@ -3184,6 +3189,12 @@ __acquires(&pool->lock) > > lockdep_copy_map(&lockdep_map, &work->lockdep_map); > #endif > + /* Set inherited alloc flags */ > + if (use_noio_allocs) > + noio_flags = memalloc_noio_save(); > + if (use_nofs_allocs) > + nofs_flags = memalloc_nofs_save(); > + > /* ensure we're on the correct CPU */ > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > raw_smp_processor_id() != pool->cpu); > @@ -3320,6 +3331,12 @@ __acquires(&pool->lock) > > /* must be the last step, see the function comment */ > pwq_dec_nr_in_flight(pwq, work_data); > + > + /* Restore alloc flags */ > + if (use_nofs_allocs) > + memalloc_nofs_restore(nofs_flags); > + if (use_noio_allocs) > + memalloc_noio_restore(noio_flags); > > Also, this looks like something that the work function can do on entry and > before exit, no? It _can_ be done in the work functions, but that will be a code sprawl. Only in RDS, we have the following worker functions: rds_ib_odp_mr_worker(); rds_ib_mr_pool_flush_worker() rds_ib_odp_mr_worker() rds_tcp_accept_worker() rds_connect_worker() rds_send_worker() rds_recv_worker() rds_shutdown_worker() adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly prefer to have it in one place. Thxs, Håkon
Hello, On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote: > > Also, this looks like something that the work function can do on entry and > > before exit, no? > > It _can_ be done in the work functions, but that will be a code sprawl. > Only in RDS, we have the following worker functions: > > rds_ib_odp_mr_worker(); > rds_ib_mr_pool_flush_worker() > rds_ib_odp_mr_worker() > rds_tcp_accept_worker() > rds_connect_worker() > rds_send_worker() > rds_recv_worker() > rds_shutdown_worker() > > adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly > prefer to have it in one place. I haven't seen the code yet, so can't tell for sure but if you're automatically inherting these flags from the scheduling site, I don't think that's gonna work. Note that getting a different, more permissive, allocation context is one of reasons why one might want to use workqueues, so it'd have to be explicit whether it's in workqueue or in its users. Thanks.
Hi Tejun, > On 14 May 2024, at 18:49, Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote: > > I haven't seen the code yet, so can't tell for sure but if you're > automatically inherting these flags from the scheduling site, I don't think > that's gonna work. Note that getting a different, more permissive, > allocation context is one of reasons why one might want to use workqueues, > so it'd have to be explicit whether it's in workqueue or in its users. I wanted to hold off sending the v2 if it came in more comments, but I have sent it now. Thxs, Håkon
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 158784dd189ab..09ecc692ffcae 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -398,6 +398,8 @@ enum wq_flags { __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */ __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */ + __WQ_NOIO = 1 << 19, /* internal: execute work with NOIO */ + __WQ_NOFS = 1 << 20, /* internal: execute work with NOFS */ /* BH wq only allows the following flags */ __WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI, diff --git a/kernel/workqueue.c b/kernel/workqueue.c index d2dbe099286b9..a1d166a7c0f85 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -51,6 +51,7 @@ #include <linux/uaccess.h> #include <linux/sched/isolation.h> #include <linux/sched/debug.h> +#include <linux/sched/mm.h> #include <linux/nmi.h> #include <linux/kvm_para.h> #include <linux/delay.h> @@ -3172,6 +3173,10 @@ __acquires(&pool->lock) unsigned long work_data; int lockdep_start_depth, rcu_start_depth; bool bh_draining = pool->flags & POOL_BH_DRAINING; + bool use_noio_allocs = pwq->wq->flags & __WQ_NOIO; + bool use_nofs_allocs = pwq->wq->flags & __WQ_NOFS; + unsigned long noio_flags; + unsigned long nofs_flags; #ifdef CONFIG_LOCKDEP /* * It is permissible to free the struct work_struct from @@ -3184,6 +3189,12 @@ __acquires(&pool->lock) lockdep_copy_map(&lockdep_map, &work->lockdep_map); #endif + /* Set inherited alloc flags */ + if (use_noio_allocs) + noio_flags = memalloc_noio_save(); + if (use_nofs_allocs) + nofs_flags = memalloc_nofs_save(); + /* ensure we're on the correct CPU */ WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && raw_smp_processor_id() != pool->cpu); @@ -3320,6 +3331,12 @@ __acquires(&pool->lock) /* must be the last step, see the function comment */ pwq_dec_nr_in_flight(pwq, work_data); + + /* Restore alloc flags */ + if (use_nofs_allocs) + memalloc_nofs_restore(nofs_flags); + if (use_noio_allocs) + memalloc_noio_restore(noio_flags); } /**
For drivers/modules running inside a memalloc_{noio,nofs}_{save,restore} region, if a work-queue is created, we make sure work executed on the work-queue inherits the same flag(s). This in order to conditionally enable drivers to work aligned with block I/O devices. This commit makes sure that any work queued later on work-queues created during module initialization, when current's flags has PF_MEMALLOC_{NOIO,NOFS} set, will inherit the same flags. We do this in order to enable drivers to be used as a network block I/O device. This in order to support XFS or other file-systems on top of a raw block device which uses said drivers as the network transport layer. Under intense memory pressure, we get memory reclaims. Assume the file-system reclaims memory, goes to the raw block device, which calls into said drivers. Now, if regular GFP_KERNEL allocations in the drivers require reclaims to be fulfilled, we end up in a circular dependency. We break this circular dependency by: 1. Force all allocations in the drivers to use GFP_NOIO, by means of a parenthetic use of memalloc_noio_{save,restore} on all relevant entry points. 2. Make sure work-queues inherits current->flags wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the work-queue inherits the same flag(s). That is what this commit contributes with. Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> --- include/linux/workqueue.h | 2 ++ kernel/workqueue.c | 17 +++++++++++++++++ 2 files changed, 19 insertions(+)