Message ID | 20220315162052.570677-1-trondmy@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | NFS: Fix memory allocation in rpc_malloc() | expand |
> On Mar 15, 2022, at 12:20 PM, trondmy@kernel.org wrote: > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > When allocating memory, it should be safe to always use GFP_KERNEL, > since both swap tasks and asynchronous tasks will regulate the > allocation mode through the struct task flags. Is a similar change necessary in xprt_rdma_allocate() ? > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > --- > net/sunrpc/sched.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > index 7c8f87ebdbc0..c62fcacf7366 100644 > --- a/net/sunrpc/sched.c > +++ b/net/sunrpc/sched.c > @@ -1030,16 +1030,12 @@ int rpc_malloc(struct rpc_task *task) > struct rpc_rqst *rqst = task->tk_rqstp; > size_t size = rqst->rq_callsize + rqst->rq_rcvsize; > struct rpc_buffer *buf; > - gfp_t gfp = GFP_KERNEL; > - > - if (RPC_IS_ASYNC(task)) > - gfp = GFP_NOWAIT | __GFP_NOWARN; > > size += sizeof(struct rpc_buffer); > if (size <= RPC_BUFFER_MAXSIZE) > - buf = mempool_alloc(rpc_buffer_mempool, gfp); > + buf = mempool_alloc(rpc_buffer_mempool, GFP_KERNEL); > else > - buf = kmalloc(size, gfp); > + buf = kmalloc(size, GFP_KERNEL); > > if (!buf) > return -ENOMEM; > -- > 2.35.1 > -- Chuck Lever
On Tue, 2022-03-15 at 16:50 +0000, Chuck Lever III wrote: > > > > On Mar 15, 2022, at 12:20 PM, trondmy@kernel.org wrote: > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > When allocating memory, it should be safe to always use GFP_KERNEL, > > since both swap tasks and asynchronous tasks will regulate the > > allocation mode through the struct task flags. > > Is a similar change necessary in xprt_rdma_allocate() ? It looks to me as if that would be a safe change, and we should probably also change that definition of RPCRDMA_DEF_GFP to match GFP_KERNEL.
On Wed, 16 Mar 2022, trondmy@kernel.org wrote: > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > When allocating memory, it should be safe to always use GFP_KERNEL, > since both swap tasks and asynchronous tasks will regulate the > allocation mode through the struct task flags. 'struct task_struct' flags can only enforce NOFS, NOIO, or MEMALLOC. They cannot prevent waiting altogether. We need rpciod task to not block waiting for memory. If they all do, then there will be no thread free to handle the replies to WRITE which would allow swapped-out memory to be freed. As the very least the rescuer thread mustn't block, so the use of GFP_NOWAIT could depend on current_is_workqueue_rescuer(). Was there some problem you saw caused by the use of GFP_NOWAIT ?? Thanks, NeilBrown > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > --- > net/sunrpc/sched.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > index 7c8f87ebdbc0..c62fcacf7366 100644 > --- a/net/sunrpc/sched.c > +++ b/net/sunrpc/sched.c > @@ -1030,16 +1030,12 @@ int rpc_malloc(struct rpc_task *task) > struct rpc_rqst *rqst = task->tk_rqstp; > size_t size = rqst->rq_callsize + rqst->rq_rcvsize; > struct rpc_buffer *buf; > - gfp_t gfp = GFP_KERNEL; > - > - if (RPC_IS_ASYNC(task)) > - gfp = GFP_NOWAIT | __GFP_NOWARN; > > size += sizeof(struct rpc_buffer); > if (size <= RPC_BUFFER_MAXSIZE) > - buf = mempool_alloc(rpc_buffer_mempool, gfp); > + buf = mempool_alloc(rpc_buffer_mempool, GFP_KERNEL); > else > - buf = kmalloc(size, gfp); > + buf = kmalloc(size, GFP_KERNEL); > > if (!buf) > return -ENOMEM; > -- > 2.35.1 > >
On Wed, 2022-03-16 at 12:53 +1100, NeilBrown wrote: > On Wed, 16 Mar 2022, trondmy@kernel.org wrote: > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > When allocating memory, it should be safe to always use GFP_KERNEL, > > since both swap tasks and asynchronous tasks will regulate the > > allocation mode through the struct task flags. > > 'struct task_struct' flags can only enforce NOFS, NOIO, or MEMALLOC. > They cannot prevent waiting altogether. > We need rpciod task to not block waiting for memory. If they all do, > then there will be no thread free to handle the replies to WRITE > which > would allow swapped-out memory to be freed. > > As the very least the rescuer thread mustn't block, so the use of > GFP_NOWAIT could depend on current_is_workqueue_rescuer(). > > Was there some problem you saw caused by the use of GFP_NOWAIT ?? > There is no point in trying to give rpciod stronger protection than what is given to the standard memory reclaim processes. The VM does not have a requirement that writepages() and its ilk to be non-waiting and non-blocking, nor does it require that the threads that help service those writebacks be non-blocking. We wouldn't be able to do things like create a new socket to reconnect to the server if we couldn't perform blocking allocations from rpciod. None of the socket creation APIs allow you to pass in a gfp flag mask. What we do require in this situation, is that we must not recurse back into NFS. This is why we have the memalloc_nofs_save() / memalloc_nofs_restore() protection in various places, including rpc_async_schedule() and rpc_execute(). That still allows the VM to actively perform direct reclaim, and to start I/O against block devices when in a low memory situation. Why would we not want it to do that? The alternative with GFP_NOWAIT is to stall the RPC I/O altogether when in a low memory situation, and to cross our fingers that something else causes memory to be freed up.
On Thu, 17 Mar 2022, Trond Myklebust wrote: > On Wed, 2022-03-16 at 12:53 +1100, NeilBrown wrote: > > On Wed, 16 Mar 2022, trondmy@kernel.org wrote: > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > > > When allocating memory, it should be safe to always use GFP_KERNEL, > > > since both swap tasks and asynchronous tasks will regulate the > > > allocation mode through the struct task flags. > > > > 'struct task_struct' flags can only enforce NOFS, NOIO, or MEMALLOC. > > They cannot prevent waiting altogether. > > We need rpciod task to not block waiting for memory. If they all do, > > then there will be no thread free to handle the replies to WRITE > > which > > would allow swapped-out memory to be freed. > > > > As the very least the rescuer thread mustn't block, so the use of > > GFP_NOWAIT could depend on current_is_workqueue_rescuer(). > > > > Was there some problem you saw caused by the use of GFP_NOWAIT ?? > > > > There is no point in trying to give rpciod stronger protection than > what is given to the standard memory reclaim processes. The VM does not > have a requirement that writepages() and its ilk to be non-waiting and > non-blocking, nor does it require that the threads that help service > those writebacks be non-blocking. My understanding is a little more nuanced. It is certainly true that writepages() etc can block, but only for events that are certain to happen in some finite time. This is why the block layer uses mempools, and uses a different mempool at each layer going down the stack. At each level it is ok to wait for events at some lower level, but not for event at the same or higher level. That ensures there are no loops in the waiting dependency. This is where rpciod falls down without the commit that you want to revert. rpciod threads both wait on the rpc_buffer_mempool, and return bufs to it with mempool_free(). This means that all the available threads could be blocked waiting on the mmepool, so none would be available to free buffers for which a reply has been received. If there were separate workqueues (with separate rescuer threads) for sending requests and for receiving replies, then there wouldn't be a problem here. I haven't given that possibility lot of thought but I think it would be messy to separate requests from replies like that. Without the separation, we need to ensure requests don't block waiting for some other reply - else we can deadlock. > > We wouldn't be able to do things like create a new socket to reconnect > to the server if we couldn't perform blocking allocations from rpciod. > None of the socket creation APIs allow you to pass in a gfp flag mask. For cases where mempools are not convenient, there is PF_MEMALLOC. This allows access to generic reserves. It is dangerous to use this for regular allocations as it can still be exhausted and can still cause deadlocks. But it is generally safe for rare allocations as you are unlikely to have lots of them at once resulting in exhaustion. Allocating a new socket is exactly the sort of thing that PF_MEMALLOC is good for. It happen rarely (hours or more), in contrast with rpc_malloc() which happens frequently (multiple times a second). I'm not certain that socket allocation happens under PF_MEMALLOC - I should check - but that is the correct solution for that need. > > What we do require in this situation, is that we must not recurse back > into NFS. This is why we have the memalloc_nofs_save() / > memalloc_nofs_restore() protection in various places, including > rpc_async_schedule() and rpc_execute(). That still allows the VM to > actively perform direct reclaim, and to start I/O against block devices > when in a low memory situation. Why would we not want it to do that? Certainly it is important to avoid recursion and the improvements to properly use memalloc_nofs_save() etc so that we don't need GFP_NOFS are good. Consequently there is no problem with starting reclaim. I would not object to __GFP_DIRECT_RECLAIM being passed to kmalloc() providing __GFP_NORETRY were passed as well. But passing __GFP_DIRECT_RECLAIM to mempool_alloc() causes it to wait indefinitely for success, and that triggers the deadlock because rpciod is waiting for something that only rpciod can do. > > The alternative with GFP_NOWAIT is to stall the RPC I/O altogether when > in a low memory situation, and to cross our fingers that something else > causes memory to be freed up. I don't think crossing our fingers is generally a good strategy. I don't think "something else" freeing up memory is likely, though maybe the oom killer might. PF_MEMALLOC effectively is a "cross your fingers" strategy, but there it is not "something will free up memory", but rather "we'll never need all the reserve at the same time". That is somewhat more defensible. Thanks, NeilBrown > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > > >
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index 7c8f87ebdbc0..c62fcacf7366 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -1030,16 +1030,12 @@ int rpc_malloc(struct rpc_task *task) struct rpc_rqst *rqst = task->tk_rqstp; size_t size = rqst->rq_callsize + rqst->rq_rcvsize; struct rpc_buffer *buf; - gfp_t gfp = GFP_KERNEL; - - if (RPC_IS_ASYNC(task)) - gfp = GFP_NOWAIT | __GFP_NOWARN; size += sizeof(struct rpc_buffer); if (size <= RPC_BUFFER_MAXSIZE) - buf = mempool_alloc(rpc_buffer_mempool, gfp); + buf = mempool_alloc(rpc_buffer_mempool, GFP_KERNEL); else - buf = kmalloc(size, gfp); + buf = kmalloc(size, GFP_KERNEL); if (!buf) return -ENOMEM;