Message ID | 20220319001635.4097742-1-khazhy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] nfsd: avoid recursive locking through fsnotify | expand |
On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > result > in recursing back into nfsd, resulting in deadlock. See below stack. > > nfsd D 0 1591536 2 0x80004080 > Call Trace: > __schedule+0x497/0x630 > schedule+0x67/0x90 > schedule_preempt_disabled+0xe/0x10 > __mutex_lock+0x347/0x4b0 > fsnotify_destroy_mark+0x22/0xa0 > nfsd_file_free+0x79/0xd0 [nfsd] > nfsd_file_put_noref+0x7c/0x90 [nfsd] > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > nfsd_file_lru_scan+0x57/0x80 [nfsd] > do_shrink_slab+0x1f2/0x330 > shrink_slab+0x244/0x2f0 > shrink_node+0xd7/0x490 > do_try_to_free_pages+0x12f/0x3b0 > try_to_free_pages+0x43f/0x540 > __alloc_pages_slowpath+0x6ab/0x11c0 > __alloc_pages_nodemask+0x274/0x2c0 > alloc_slab_page+0x32/0x2e0 > new_slab+0xa6/0x8b0 > ___slab_alloc+0x34b/0x520 > kmem_cache_alloc+0x1c4/0x250 > fsnotify_add_mark_locked+0x18d/0x4c0 > fsnotify_add_mark+0x48/0x70 > nfsd_file_acquire+0x570/0x6f0 [nfsd] > nfsd_read+0xa7/0x1c0 [nfsd] > nfsd3_proc_read+0xc1/0x110 [nfsd] > nfsd_dispatch+0xf7/0x240 [nfsd] > svc_process_common+0x2f4/0x610 [sunrpc] > svc_process+0xf9/0x110 [sunrpc] > nfsd+0x10e/0x180 [nfsd] > kthread+0x130/0x140 > ret_from_fork+0x35/0x40 > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > --- > fs/nfsd/filecache.c | 4 ++++ > 1 file changed, 4 insertions(+) > > Marking this RFC since I haven't actually had a chance to test this, > we > we're seeing this deadlock for some customers. > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index fdf89fcf1a0c..a14760f9b486 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > *nf) > struct fsnotify_mark *mark; > struct nfsd_file_mark *nfm = NULL, *new; > struct inode *inode = nf->nf_inode; > + unsigned int pflags; > > do { > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > *nf) > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > refcount_set(&new->nfm_ref, 1); > > + /* fsnotify allocates, avoid recursion back into nfsd > */ > + pflags = memalloc_nofs_save(); > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > 0); > + memalloc_nofs_restore(pflags); > > /* > * If the add was successful, then return the object. Isn't that stack trace showing a slab direct reclaim, and not a filesystem writeback situation? Does memalloc_nofs_save()/restore() really fix this problem? It seems to me that it cannot, particularly since knfsd is not a filesystem, and so does not ever handle writeback of dirty pages. Cc: the linux-mm mailing list in search of answers to the above 2 questions.
On Fri, Mar 18, 2022 at 5:36 PM Trond Myklebust <trondmy@hammerspace.com> wrote: > > Isn't that stack trace showing a slab direct reclaim, and not a > filesystem writeback situation? > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > to me that it cannot, particularly since knfsd is not a filesystem, and > so does not ever handle writeback of dirty pages. Ah you're right. An alternative would be delaying the destroy_mark, which I am noticing now that, on mainline, the shrinker calls nfsd_file_dispose_list_delayed, something missing from the version of 5.4 I was looking at. To my reading 9542e6a643fc6 ("nfsd: Containerise filecache laundrette") should have the effect of fixing this deadlock (and the message does explicitly call out notifier deadlocks) - maybe something to send towards stable? > > > Cc: the linux-mm mailing list in search of answers to the above 2 > questions. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >
On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > result > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > nfsd D 0 1591536 2 0x80004080 > > Call Trace: > > __schedule+0x497/0x630 > > schedule+0x67/0x90 > > schedule_preempt_disabled+0xe/0x10 > > __mutex_lock+0x347/0x4b0 > > fsnotify_destroy_mark+0x22/0xa0 > > nfsd_file_free+0x79/0xd0 [nfsd] > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > do_shrink_slab+0x1f2/0x330 > > shrink_slab+0x244/0x2f0 > > shrink_node+0xd7/0x490 > > do_try_to_free_pages+0x12f/0x3b0 > > try_to_free_pages+0x43f/0x540 > > __alloc_pages_slowpath+0x6ab/0x11c0 > > __alloc_pages_nodemask+0x274/0x2c0 > > alloc_slab_page+0x32/0x2e0 > > new_slab+0xa6/0x8b0 > > ___slab_alloc+0x34b/0x520 > > kmem_cache_alloc+0x1c4/0x250 > > fsnotify_add_mark_locked+0x18d/0x4c0 > > fsnotify_add_mark+0x48/0x70 > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > nfsd_read+0xa7/0x1c0 [nfsd] > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > nfsd_dispatch+0xf7/0x240 [nfsd] > > svc_process_common+0x2f4/0x610 [sunrpc] > > svc_process+0xf9/0x110 [sunrpc] > > nfsd+0x10e/0x180 [nfsd] > > kthread+0x130/0x140 > > ret_from_fork+0x35/0x40 > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > --- > > fs/nfsd/filecache.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > Marking this RFC since I haven't actually had a chance to test this, > > we > > we're seeing this deadlock for some customers. > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index fdf89fcf1a0c..a14760f9b486 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > *nf) > > struct fsnotify_mark *mark; > > struct nfsd_file_mark *nfm = NULL, *new; > > struct inode *inode = nf->nf_inode; > > + unsigned int pflags; > > > > do { > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > *nf) > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > refcount_set(&new->nfm_ref, 1); > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > */ > > + pflags = memalloc_nofs_save(); > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > 0); > > + memalloc_nofs_restore(pflags); > > > > /* > > * If the add was successful, then return the object. > > Isn't that stack trace showing a slab direct reclaim, and not a > filesystem writeback situation? > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > to me that it cannot, particularly since knfsd is not a filesystem, and > so does not ever handle writeback of dirty pages. > Maybe NOFS throttles direct reclaims to the point that the problem is harder to hit? This report came in at good timing for me. It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. As far as I can tell, nfsd filecache is currently the only fsnotify backend that frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also be evictable in that way, so they would expose fanotify to this deadlock. For the short term, maybe nfsd filecache can avoid the problem by checking mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? Jan, A relatively simple fix would be to allocate fsnotify_mark_connector in fsnotify_add_mark() and free it, if a connector already exists for the object. I don't think there is a good reason to optimize away this allocation for the case of a non-first group to set a mark on an object? Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20220307155741.1352405-1-amir73il@gmail.com/
On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > result > > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > > > nfsd D 0 1591536 2 0x80004080 > > > Call Trace: > > > __schedule+0x497/0x630 > > > schedule+0x67/0x90 > > > schedule_preempt_disabled+0xe/0x10 > > > __mutex_lock+0x347/0x4b0 > > > fsnotify_destroy_mark+0x22/0xa0 > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > do_shrink_slab+0x1f2/0x330 > > > shrink_slab+0x244/0x2f0 > > > shrink_node+0xd7/0x490 > > > do_try_to_free_pages+0x12f/0x3b0 > > > try_to_free_pages+0x43f/0x540 > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > __alloc_pages_nodemask+0x274/0x2c0 > > > alloc_slab_page+0x32/0x2e0 > > > new_slab+0xa6/0x8b0 > > > ___slab_alloc+0x34b/0x520 > > > kmem_cache_alloc+0x1c4/0x250 > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > fsnotify_add_mark+0x48/0x70 > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > svc_process+0xf9/0x110 [sunrpc] > > > nfsd+0x10e/0x180 [nfsd] > > > kthread+0x130/0x140 > > > ret_from_fork+0x35/0x40 > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > --- > > > fs/nfsd/filecache.c | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > Marking this RFC since I haven't actually had a chance to test this, > > > we > > > we're seeing this deadlock for some customers. > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > *nf) > > > struct fsnotify_mark *mark; > > > struct nfsd_file_mark *nfm = NULL, *new; > > > struct inode *inode = nf->nf_inode; > > > + unsigned int pflags; > > > > > > do { > > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > *nf) > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > refcount_set(&new->nfm_ref, 1); > > > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > > */ > > > + pflags = memalloc_nofs_save(); > > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > > 0); > > > + memalloc_nofs_restore(pflags); > > > > > > /* > > > * If the add was successful, then return the object. > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > filesystem writeback situation? > > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > > to me that it cannot, particularly since knfsd is not a filesystem, and > > so does not ever handle writeback of dirty pages. > > > > Maybe NOFS throttles direct reclaims to the point that the problem is > harder to hit? > > This report came in at good timing for me. > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. > As far as I can tell, nfsd filecache is currently the only fsnotify backend that > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also > be evictable in that way, so they would expose fanotify to this deadlock. > > For the short term, maybe nfsd filecache can avoid the problem by checking > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? > > Jan, > > A relatively simple fix would be to allocate fsnotify_mark_connector in > fsnotify_add_mark() and free it, if a connector already exists for the object. > I don't think there is a good reason to optimize away this allocation > for the case of a non-first group to set a mark on an object? Indeed, nasty. Volatile marks will add group->mark_mutex into a set of locks grabbed during inode slab reclaim. So any allocation under group->mark_mutex has to be GFP_NOFS now. This is not just about connector allocations but also mark allocations for fanotify. Moving allocations from under mark_mutex is also possible solution but passing preallocated memory around is kind of ugly as well. So the cleanest solution I currently see is to come up with helpers like "fsnotify_lock_group() & fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do memalloc_nofs_save / restore magic. Honza
On Mon, Mar 21, 2022 at 1:23 PM Jan Kara <jack@suse.cz> wrote: > > On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > > result > > > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > > > > > nfsd D 0 1591536 2 0x80004080 > > > > Call Trace: > > > > __schedule+0x497/0x630 > > > > schedule+0x67/0x90 > > > > schedule_preempt_disabled+0xe/0x10 > > > > __mutex_lock+0x347/0x4b0 > > > > fsnotify_destroy_mark+0x22/0xa0 > > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > > do_shrink_slab+0x1f2/0x330 > > > > shrink_slab+0x244/0x2f0 > > > > shrink_node+0xd7/0x490 > > > > do_try_to_free_pages+0x12f/0x3b0 > > > > try_to_free_pages+0x43f/0x540 > > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > > __alloc_pages_nodemask+0x274/0x2c0 > > > > alloc_slab_page+0x32/0x2e0 > > > > new_slab+0xa6/0x8b0 > > > > ___slab_alloc+0x34b/0x520 > > > > kmem_cache_alloc+0x1c4/0x250 > > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > > fsnotify_add_mark+0x48/0x70 > > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > > svc_process+0xf9/0x110 [sunrpc] > > > > nfsd+0x10e/0x180 [nfsd] > > > > kthread+0x130/0x140 > > > > ret_from_fork+0x35/0x40 > > > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > > --- > > > > fs/nfsd/filecache.c | 4 ++++ > > > > 1 file changed, 4 insertions(+) > > > > > > > > Marking this RFC since I haven't actually had a chance to test this, > > > > we > > > > we're seeing this deadlock for some customers. > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > > --- a/fs/nfsd/filecache.c > > > > +++ b/fs/nfsd/filecache.c > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > > *nf) > > > > struct fsnotify_mark *mark; > > > > struct nfsd_file_mark *nfm = NULL, *new; > > > > struct inode *inode = nf->nf_inode; > > > > + unsigned int pflags; > > > > > > > > do { > > > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > > *nf) > > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > > refcount_set(&new->nfm_ref, 1); > > > > > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > > > */ > > > > + pflags = memalloc_nofs_save(); > > > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > > > 0); > > > > + memalloc_nofs_restore(pflags); > > > > > > > > /* > > > > * If the add was successful, then return the object. > > > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > > filesystem writeback situation? > > > > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > > > to me that it cannot, particularly since knfsd is not a filesystem, and > > > so does not ever handle writeback of dirty pages. > > > > > > > Maybe NOFS throttles direct reclaims to the point that the problem is > > harder to hit? > > > > This report came in at good timing for me. > > > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. > > As far as I can tell, nfsd filecache is currently the only fsnotify backend that > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also > > be evictable in that way, so they would expose fanotify to this deadlock. > > > > For the short term, maybe nfsd filecache can avoid the problem by checking > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the > > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? > > > > Jan, > > > > A relatively simple fix would be to allocate fsnotify_mark_connector in > > fsnotify_add_mark() and free it, if a connector already exists for the object. > > I don't think there is a good reason to optimize away this allocation > > for the case of a non-first group to set a mark on an object? > > Indeed, nasty. Volatile marks will add group->mark_mutex into a set of > locks grabbed during inode slab reclaim. So any allocation under > group->mark_mutex has to be GFP_NOFS now. This is not just about connector > allocations but also mark allocations for fanotify. Moving allocations from > under mark_mutex is also possible solution but passing preallocated memory > around is kind of ugly as well. Yes, kind of, here is how it looks: https://github.com/amir73il/linux/commit/643bb6b9f664f70f68ea0393a06338673c4966b3 https://github.com/amir73il/linux/commit/66f27fc99e46b12f1078e8e2915793040ce50ee7 > So the cleanest solution I currently see is > to come up with helpers like "fsnotify_lock_group() & > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do > memalloc_nofs_save / restore magic. > Sounds good. Won't this cause a regression - more failures to setup new mark under memory pressure? Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE? and set NOFS state only in that case, so at least we don't cause regression for existing applications? Thanks, Amir.
On Mon 21-03-22 13:56:47, Amir Goldstein wrote: > On Mon, Mar 21, 2022 at 1:23 PM Jan Kara <jack@suse.cz> wrote: > > > > On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > > > result > > > > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > > > > > > > nfsd D 0 1591536 2 0x80004080 > > > > > Call Trace: > > > > > __schedule+0x497/0x630 > > > > > schedule+0x67/0x90 > > > > > schedule_preempt_disabled+0xe/0x10 > > > > > __mutex_lock+0x347/0x4b0 > > > > > fsnotify_destroy_mark+0x22/0xa0 > > > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > > > do_shrink_slab+0x1f2/0x330 > > > > > shrink_slab+0x244/0x2f0 > > > > > shrink_node+0xd7/0x490 > > > > > do_try_to_free_pages+0x12f/0x3b0 > > > > > try_to_free_pages+0x43f/0x540 > > > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > > > __alloc_pages_nodemask+0x274/0x2c0 > > > > > alloc_slab_page+0x32/0x2e0 > > > > > new_slab+0xa6/0x8b0 > > > > > ___slab_alloc+0x34b/0x520 > > > > > kmem_cache_alloc+0x1c4/0x250 > > > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > > > fsnotify_add_mark+0x48/0x70 > > > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > > > svc_process+0xf9/0x110 [sunrpc] > > > > > nfsd+0x10e/0x180 [nfsd] > > > > > kthread+0x130/0x140 > > > > > ret_from_fork+0x35/0x40 > > > > > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > > > --- > > > > > fs/nfsd/filecache.c | 4 ++++ > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > Marking this RFC since I haven't actually had a chance to test this, > > > > > we > > > > > we're seeing this deadlock for some customers. > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > > > --- a/fs/nfsd/filecache.c > > > > > +++ b/fs/nfsd/filecache.c > > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > > > *nf) > > > > > struct fsnotify_mark *mark; > > > > > struct nfsd_file_mark *nfm = NULL, *new; > > > > > struct inode *inode = nf->nf_inode; > > > > > + unsigned int pflags; > > > > > > > > > > do { > > > > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > > > *nf) > > > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > > > refcount_set(&new->nfm_ref, 1); > > > > > > > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > > > > */ > > > > > + pflags = memalloc_nofs_save(); > > > > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > > > > 0); > > > > > + memalloc_nofs_restore(pflags); > > > > > > > > > > /* > > > > > * If the add was successful, then return the object. > > > > > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > > > filesystem writeback situation? > > > > > > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > > > > to me that it cannot, particularly since knfsd is not a filesystem, and > > > > so does not ever handle writeback of dirty pages. > > > > > > > > > > Maybe NOFS throttles direct reclaims to the point that the problem is > > > harder to hit? > > > > > > This report came in at good timing for me. > > > > > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. > > > As far as I can tell, nfsd filecache is currently the only fsnotify backend that > > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also > > > be evictable in that way, so they would expose fanotify to this deadlock. > > > > > > For the short term, maybe nfsd filecache can avoid the problem by checking > > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the > > > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? > > > > > > Jan, > > > > > > A relatively simple fix would be to allocate fsnotify_mark_connector in > > > fsnotify_add_mark() and free it, if a connector already exists for the object. > > > I don't think there is a good reason to optimize away this allocation > > > for the case of a non-first group to set a mark on an object? > > > > Indeed, nasty. Volatile marks will add group->mark_mutex into a set of > > locks grabbed during inode slab reclaim. So any allocation under > > group->mark_mutex has to be GFP_NOFS now. This is not just about connector > > allocations but also mark allocations for fanotify. Moving allocations from > > under mark_mutex is also possible solution but passing preallocated memory > > around is kind of ugly as well. > > Yes, kind of, here is how it looks: > https://github.com/amir73il/linux/commit/643bb6b9f664f70f68ea0393a06338673c4966b3 > https://github.com/amir73il/linux/commit/66f27fc99e46b12f1078e8e2915793040ce50ee7 Yup, not an act of beauty but bearable in the worst case :). > > So the cleanest solution I currently see is > > to come up with helpers like "fsnotify_lock_group() & > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do > > memalloc_nofs_save / restore magic. > > > > Sounds good. Won't this cause a regression - more failures to setup new mark > under memory pressure? Well, yes, the chances of hitting ENOMEM under heavy memory pressure are higher. But I don't think that much memory is consumed by connectors or marks that the reduced chances for direct reclaim would really substantially matter for the system as a whole. > Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE? > and set NOFS state only in that case, so at least we don't cause regression > for existing applications? So that's a possibility I've left in my sleeve ;). We could do it but then we'd also have to tell lockdep that there are two kinds of mark_mutex locks so that it does not complain about possible reclaim deadlocks. Doable but at this point I didn't consider it worth it unless someone comes with a bug report from a real user scenario. Honza
On Sat, Mar 19, 2022 at 2:36 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > result > > > in recursing back into nfsd, resulting in deadlock. See below stack. > > > > > > nfsd D 0 1591536 2 0x80004080 > > > Call Trace: > > > __schedule+0x497/0x630 > > > schedule+0x67/0x90 > > > schedule_preempt_disabled+0xe/0x10 > > > __mutex_lock+0x347/0x4b0 > > > fsnotify_destroy_mark+0x22/0xa0 > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > do_shrink_slab+0x1f2/0x330 > > > shrink_slab+0x244/0x2f0 > > > shrink_node+0xd7/0x490 > > > do_try_to_free_pages+0x12f/0x3b0 > > > try_to_free_pages+0x43f/0x540 > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > __alloc_pages_nodemask+0x274/0x2c0 > > > alloc_slab_page+0x32/0x2e0 > > > new_slab+0xa6/0x8b0 > > > ___slab_alloc+0x34b/0x520 > > > kmem_cache_alloc+0x1c4/0x250 > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > fsnotify_add_mark+0x48/0x70 > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > svc_process+0xf9/0x110 [sunrpc] > > > nfsd+0x10e/0x180 [nfsd] > > > kthread+0x130/0x140 > > > ret_from_fork+0x35/0x40 > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > --- > > > fs/nfsd/filecache.c | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > Marking this RFC since I haven't actually had a chance to test this, > > > we > > > we're seeing this deadlock for some customers. > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > *nf) > > > struct fsnotify_mark *mark; > > > struct nfsd_file_mark *nfm = NULL, *new; > > > struct inode *inode = nf->nf_inode; > > > + unsigned int pflags; > > > > > > do { > > > mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file > > > *nf) > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > refcount_set(&new->nfm_ref, 1); > > > > > > + /* fsnotify allocates, avoid recursion back into nfsd > > > */ > > > + pflags = memalloc_nofs_save(); > > > err = fsnotify_add_inode_mark(&new->nfm_mark, inode, > > > 0); > > > + memalloc_nofs_restore(pflags); > > > > > > /* > > > * If the add was successful, then return the object. > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > filesystem writeback situation? > > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems > > to me that it cannot, particularly since knfsd is not a filesystem, and > > so does not ever handle writeback of dirty pages. > > > > Maybe NOFS throttles direct reclaims to the point that the problem is > harder to hit? (I think I simply got confused - I don't see reason that NOFS would help with direct reclaim, though it does look like the gfp flags are passed via a shrink_control struct so one *could* react to them in the shrinker - again not an area i'm super familiar with) > > This report came in at good timing for me. > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1]. > As far as I can tell, nfsd filecache is currently the only fsnotify backend that > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also > be evictable in that way, so they would expose fanotify to this deadlock. > > For the short term, maybe nfsd filecache can avoid the problem by checking > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()? fwiw, it does look like ~5.5 nfsd did stop freeing fanotify marks during reclaim, in the commit "nfsd: Containerise filecache laundrette" (I had sent an earlier email about this, not sure where that's getting caught up, but I do see it on lore...) > > Jan, > > A relatively simple fix would be to allocate fsnotify_mark_connector in > fsnotify_add_mark() and free it, if a connector already exists for the object. > I don't think there is a good reason to optimize away this allocation > for the case of a non-first group to set a mark on an object? > > Thanks, > Amir. > > > > [1] https://lore.kernel.org/linux-fsdevel/20220307155741.1352405-1-amir73il@gmail.com/
On Mon, 2022-03-21 at 12:23 +0100, Jan Kara wrote: > On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust > > <trondmy@hammerspace.com> wrote: > > > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > > result > > > > in recursing back into nfsd, resulting in deadlock. See below > > > > stack. > > > > > > > > nfsd D 0 1591536 2 0x80004080 > > > > Call Trace: > > > > __schedule+0x497/0x630 > > > > schedule+0x67/0x90 > > > > schedule_preempt_disabled+0xe/0x10 > > > > __mutex_lock+0x347/0x4b0 > > > > fsnotify_destroy_mark+0x22/0xa0 > > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > > do_shrink_slab+0x1f2/0x330 > > > > shrink_slab+0x244/0x2f0 > > > > shrink_node+0xd7/0x490 > > > > do_try_to_free_pages+0x12f/0x3b0 > > > > try_to_free_pages+0x43f/0x540 > > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > > __alloc_pages_nodemask+0x274/0x2c0 > > > > alloc_slab_page+0x32/0x2e0 > > > > new_slab+0xa6/0x8b0 > > > > ___slab_alloc+0x34b/0x520 > > > > kmem_cache_alloc+0x1c4/0x250 > > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > > fsnotify_add_mark+0x48/0x70 > > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > > svc_process+0xf9/0x110 [sunrpc] > > > > nfsd+0x10e/0x180 [nfsd] > > > > kthread+0x130/0x140 > > > > ret_from_fork+0x35/0x40 > > > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > > --- > > > > fs/nfsd/filecache.c | 4 ++++ > > > > 1 file changed, 4 insertions(+) > > > > > > > > Marking this RFC since I haven't actually had a chance to test > > > > this, > > > > we > > > > we're seeing this deadlock for some customers. > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > > --- a/fs/nfsd/filecache.c > > > > +++ b/fs/nfsd/filecache.c > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct > > > > nfsd_file > > > > *nf) > > > > struct fsnotify_mark *mark; > > > > struct nfsd_file_mark *nfm = NULL, *new; > > > > struct inode *inode = nf->nf_inode; > > > > + unsigned int pflags; > > > > > > > > do { > > > > mutex_lock(&nfsd_file_fsnotify_group- > > > > >mark_mutex); > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct > > > > nfsd_file > > > > *nf) > > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > > refcount_set(&new->nfm_ref, 1); > > > > > > > > + /* fsnotify allocates, avoid recursion back > > > > into nfsd > > > > */ > > > > + pflags = memalloc_nofs_save(); > > > > err = fsnotify_add_inode_mark(&new->nfm_mark, > > > > inode, > > > > 0); > > > > + memalloc_nofs_restore(pflags); > > > > > > > > /* > > > > * If the add was successful, then return the > > > > object. > > > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > > filesystem writeback situation? > > > > > > Does memalloc_nofs_save()/restore() really fix this problem? It > > > seems > > > to me that it cannot, particularly since knfsd is not a > > > filesystem, and > > > so does not ever handle writeback of dirty pages. > > > > > > > Maybe NOFS throttles direct reclaims to the point that the problem > > is > > harder to hit? > > > > This report came in at good timing for me. > > > > It demonstrates an issue I did not predict for "volatile"' fanotify > > marks [1]. > > As far as I can tell, nfsd filecache is currently the only fsnotify > > backend that > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks > > would also > > be evictable in that way, so they would expose fanotify to this > > deadlock. > > > > For the short term, maybe nfsd filecache can avoid the problem by > > checking > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort > > the > > shrinker. I wonder if there is a place for a helper > > mutex_is_locked_by_me()? > > > > Jan, > > > > A relatively simple fix would be to allocate > > fsnotify_mark_connector in > > fsnotify_add_mark() and free it, if a connector already exists for > > the object. > > I don't think there is a good reason to optimize away this > > allocation > > for the case of a non-first group to set a mark on an object? > > Indeed, nasty. Volatile marks will add group->mark_mutex into a set > of > locks grabbed during inode slab reclaim. So any allocation under > group->mark_mutex has to be GFP_NOFS now. This is not just about > connector > allocations but also mark allocations for fanotify. Moving > allocations from > under mark_mutex is also possible solution but passing preallocated > memory > around is kind of ugly as well. So the cleanest solution I currently > see is > to come up with helpers like "fsnotify_lock_group() & > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also > do > memalloc_nofs_save / restore magic. As has already been reported, the problem was fixed in Linux 5.5 by the garbage collector rewrite, and so this is no longer an issue. In addition, please note that memalloc_nofs_save/restore and the use of GFP_NOFS was never a solution, because it does not prevent the kind of direct reclaim that was happening here. You'd have to enforce GFP_NOWAIT allocations, afaics.
On Mon, Mar 21, 2022 at 3:50 PM Trond Myklebust <trondmy@hammerspace.com> wrote: > > As has already been reported, the problem was fixed in Linux 5.5 by the > garbage collector rewrite, and so this is no longer an issue. > 9542e6a643fc ("nfsd: Containerise filecache laundrette"), 36ebbdb96b69 ("nfsd: cleanup nfsd_file_lru_dispose()") apply cleanly to 5.4.y for me, which is still LTS. Since this should fix a real deadlock, would it be appropriate to include them for the 5.4 stable?
On Mon, 2022-03-21 at 16:36 -0700, Khazhy Kumykov wrote: > On Mon, Mar 21, 2022 at 3:50 PM Trond Myklebust > <trondmy@hammerspace.com> wrote: > > > > As has already been reported, the problem was fixed in Linux 5.5 by > > the > > garbage collector rewrite, and so this is no longer an issue. > > > 9542e6a643fc ("nfsd: Containerise filecache laundrette"), > 36ebbdb96b69 > ("nfsd: cleanup nfsd_file_lru_dispose()") apply cleanly to 5.4.y for > me, which is still LTS. Since this should fix a real deadlock, would > it be appropriate to include them for the 5.4 stable? That would be OK with me. I'm not aware of any side-effects that might be a problem for 5.4.
On Mon 21-03-22 22:50:11, Trond Myklebust wrote: > On Mon, 2022-03-21 at 12:23 +0100, Jan Kara wrote: > > On Sat 19-03-22 11:36:13, Amir Goldstein wrote: > > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust > > > <trondmy@hammerspace.com> wrote: > > > > > > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote: > > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may > > > > > result > > > > > in recursing back into nfsd, resulting in deadlock. See below > > > > > stack. > > > > > > > > > > nfsd D 0 1591536 2 0x80004080 > > > > > Call Trace: > > > > > __schedule+0x497/0x630 > > > > > schedule+0x67/0x90 > > > > > schedule_preempt_disabled+0xe/0x10 > > > > > __mutex_lock+0x347/0x4b0 > > > > > fsnotify_destroy_mark+0x22/0xa0 > > > > > nfsd_file_free+0x79/0xd0 [nfsd] > > > > > nfsd_file_put_noref+0x7c/0x90 [nfsd] > > > > > nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] > > > > > nfsd_file_lru_scan+0x57/0x80 [nfsd] > > > > > do_shrink_slab+0x1f2/0x330 > > > > > shrink_slab+0x244/0x2f0 > > > > > shrink_node+0xd7/0x490 > > > > > do_try_to_free_pages+0x12f/0x3b0 > > > > > try_to_free_pages+0x43f/0x540 > > > > > __alloc_pages_slowpath+0x6ab/0x11c0 > > > > > __alloc_pages_nodemask+0x274/0x2c0 > > > > > alloc_slab_page+0x32/0x2e0 > > > > > new_slab+0xa6/0x8b0 > > > > > ___slab_alloc+0x34b/0x520 > > > > > kmem_cache_alloc+0x1c4/0x250 > > > > > fsnotify_add_mark_locked+0x18d/0x4c0 > > > > > fsnotify_add_mark+0x48/0x70 > > > > > nfsd_file_acquire+0x570/0x6f0 [nfsd] > > > > > nfsd_read+0xa7/0x1c0 [nfsd] > > > > > nfsd3_proc_read+0xc1/0x110 [nfsd] > > > > > nfsd_dispatch+0xf7/0x240 [nfsd] > > > > > svc_process_common+0x2f4/0x610 [sunrpc] > > > > > svc_process+0xf9/0x110 [sunrpc] > > > > > nfsd+0x10e/0x180 [nfsd] > > > > > kthread+0x130/0x140 > > > > > ret_from_fork+0x35/0x40 > > > > > > > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com> > > > > > --- > > > > > fs/nfsd/filecache.c | 4 ++++ > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > Marking this RFC since I haven't actually had a chance to test > > > > > this, > > > > > we > > > > > we're seeing this deadlock for some customers. > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > index fdf89fcf1a0c..a14760f9b486 100644 > > > > > --- a/fs/nfsd/filecache.c > > > > > +++ b/fs/nfsd/filecache.c > > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct > > > > > nfsd_file > > > > > *nf) > > > > > struct fsnotify_mark *mark; > > > > > struct nfsd_file_mark *nfm = NULL, *new; > > > > > struct inode *inode = nf->nf_inode; > > > > > + unsigned int pflags; > > > > > > > > > > do { > > > > > mutex_lock(&nfsd_file_fsnotify_group- > > > > > >mark_mutex); > > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct > > > > > nfsd_file > > > > > *nf) > > > > > new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; > > > > > refcount_set(&new->nfm_ref, 1); > > > > > > > > > > + /* fsnotify allocates, avoid recursion back > > > > > into nfsd > > > > > */ > > > > > + pflags = memalloc_nofs_save(); > > > > > err = fsnotify_add_inode_mark(&new->nfm_mark, > > > > > inode, > > > > > 0); > > > > > + memalloc_nofs_restore(pflags); > > > > > > > > > > /* > > > > > * If the add was successful, then return the > > > > > object. > > > > > > > > Isn't that stack trace showing a slab direct reclaim, and not a > > > > filesystem writeback situation? > > > > > > > > Does memalloc_nofs_save()/restore() really fix this problem? It > > > > seems > > > > to me that it cannot, particularly since knfsd is not a > > > > filesystem, and > > > > so does not ever handle writeback of dirty pages. > > > > > > > > > > Maybe NOFS throttles direct reclaims to the point that the problem > > > is > > > harder to hit? > > > > > > This report came in at good timing for me. > > > > > > It demonstrates an issue I did not predict for "volatile"' fanotify > > > marks [1]. > > > As far as I can tell, nfsd filecache is currently the only fsnotify > > > backend that > > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks > > > would also > > > be evictable in that way, so they would expose fanotify to this > > > deadlock. > > > > > > For the short term, maybe nfsd filecache can avoid the problem by > > > checking > > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort > > > the > > > shrinker. I wonder if there is a place for a helper > > > mutex_is_locked_by_me()? > > > > > > Jan, > > > > > > A relatively simple fix would be to allocate > > > fsnotify_mark_connector in > > > fsnotify_add_mark() and free it, if a connector already exists for > > > the object. > > > I don't think there is a good reason to optimize away this > > > allocation > > > for the case of a non-first group to set a mark on an object? > > > > Indeed, nasty. Volatile marks will add group->mark_mutex into a set > > of > > locks grabbed during inode slab reclaim. So any allocation under > > group->mark_mutex has to be GFP_NOFS now. This is not just about > > connector > > allocations but also mark allocations for fanotify. Moving > > allocations from > > under mark_mutex is also possible solution but passing preallocated > > memory > > around is kind of ugly as well. So the cleanest solution I currently > > see is > > to come up with helpers like "fsnotify_lock_group() & > > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also > > do > > memalloc_nofs_save / restore magic. > > As has already been reported, the problem was fixed in Linux 5.5 by the > garbage collector rewrite, and so this is no longer an issue. Sorry, I was not clear enough I guess. NFS is not a problem since 5.5 as you say. But Amir has changes in the works after which any filesystem inode reclaim could end up in exactly the same path (calling fsnotify_destroy_mark() from clear_inode()). So these changes would introduce the same deadlock NFS was prone to before 5.5. > In addition, please note that memalloc_nofs_save/restore and the use of > GFP_NOFS was never a solution, because it does not prevent the kind of > direct reclaim that was happening here. You'd have to enforce > GFP_NOWAIT allocations, afaics. GFP_NOFS should solve the above problem because with GFP_NOFS we cannot enter inode reclaim from the memory allocation and thus end up freeing marks. Honza
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index fdf89fcf1a0c..a14760f9b486 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file *nf) struct fsnotify_mark *mark; struct nfsd_file_mark *nfm = NULL, *new; struct inode *inode = nf->nf_inode; + unsigned int pflags; do { mutex_lock(&nfsd_file_fsnotify_group->mark_mutex); @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file *nf) new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF; refcount_set(&new->nfm_ref, 1); + /* fsnotify allocates, avoid recursion back into nfsd */ + pflags = memalloc_nofs_save(); err = fsnotify_add_inode_mark(&new->nfm_mark, inode, 0); + memalloc_nofs_restore(pflags); /* * If the add was successful, then return the object.
fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may result in recursing back into nfsd, resulting in deadlock. See below stack. nfsd D 0 1591536 2 0x80004080 Call Trace: __schedule+0x497/0x630 schedule+0x67/0x90 schedule_preempt_disabled+0xe/0x10 __mutex_lock+0x347/0x4b0 fsnotify_destroy_mark+0x22/0xa0 nfsd_file_free+0x79/0xd0 [nfsd] nfsd_file_put_noref+0x7c/0x90 [nfsd] nfsd_file_lru_dispose+0x6d/0xa0 [nfsd] nfsd_file_lru_scan+0x57/0x80 [nfsd] do_shrink_slab+0x1f2/0x330 shrink_slab+0x244/0x2f0 shrink_node+0xd7/0x490 do_try_to_free_pages+0x12f/0x3b0 try_to_free_pages+0x43f/0x540 __alloc_pages_slowpath+0x6ab/0x11c0 __alloc_pages_nodemask+0x274/0x2c0 alloc_slab_page+0x32/0x2e0 new_slab+0xa6/0x8b0 ___slab_alloc+0x34b/0x520 kmem_cache_alloc+0x1c4/0x250 fsnotify_add_mark_locked+0x18d/0x4c0 fsnotify_add_mark+0x48/0x70 nfsd_file_acquire+0x570/0x6f0 [nfsd] nfsd_read+0xa7/0x1c0 [nfsd] nfsd3_proc_read+0xc1/0x110 [nfsd] nfsd_dispatch+0xf7/0x240 [nfsd] svc_process_common+0x2f4/0x610 [sunrpc] svc_process+0xf9/0x110 [sunrpc] nfsd+0x10e/0x180 [nfsd] kthread+0x130/0x140 ret_from_fork+0x35/0x40 Signed-off-by: Khazhismel Kumykov <khazhy@google.com> --- fs/nfsd/filecache.c | 4 ++++ 1 file changed, 4 insertions(+) Marking this RFC since I haven't actually had a chance to test this, we we're seeing this deadlock for some customers.