Message ID | CADpNCvYGqA3a51OH=AcqmKyAmnx3yoZjYPo7US+qk-OMX789vA@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Leaked nfsd_file due to race condition and early unhash (fs/nfsd/filecache.c) | expand |
On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang wrote: > Hello, > > I'd like to report a nfsd_file leaking issue and propose a fix for it. > > When I tested Linux kernel 6.8 and 6.6, I noticed nfsd_file leaks > which led to undestroyable file systems (zfs), Thanks for the report. Some initial comments: - Do you have a specific reproducer? In other words, what is the simplest program that can run on an NFS client that will trigger this leak, and can you post it? - "zfs" is an out-of-tree file system, so it's not directly supported for NFSD. - The guidelines for patch submission require us to fix issues in upstream Linux first (currently that's v6.10-rc6). Then that fix can be backported to older stable kernels like 6.6. Can you reproduce the leak with one of the in-kernel filesystems (either xfs or btrfs would be great) and with NFSD in 6.10-rc6? One more comment below. > here are some examples: > > crash> struct nfsd_file -x ffff88e160db0460 > struct nfsd_file { > nf_rlist = { > rhead = { > next = 0xffff8921fa2392f1 > }, > next = 0x0 > }, > nf_inode = 0xffff8882bc312ef8, > nf_file = 0xffff88e2015b1500, > nf_cred = 0xffff88e3ab0e7800, > nf_net = 0xffffffff83d41600 <init_net>, > nf_flags = 0x8, > nf_ref = { > refs = { > counter = 0xc0000000 > } > }, > nf_may = 0x4, > nf_mark = 0xffff88e1bddfb320, > nf_lru = { > next = 0xffff88e160db04a8, > prev = 0xffff88e160db04a8 > }, > nf_rcu = { > next = 0x10000000000, > func = 0x0 > }, > nf_birthtime = 0x73d22fc1728 > } > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > ffff88839a53d850 > nf_flags = 0x8, > nf_ref.refs.counter = 0x0 > nf_lru = { > next = 0xffff88839a53d898, > prev = 0xffff88839a53d898 > }, > nf_file = 0xffff88810ede8700, > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > ffff88c32b11e850 > nf_flags = 0x8, > nf_ref.refs.counter = 0x0 > nf_lru = { > next = 0xffff88c32b11e898, > prev = 0xffff88c32b11e898 > }, > nf_file = 0xffff88c20a701c00, > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > ffff88e372709700 > nf_flags = 0xc, > nf_ref.refs.counter = 0x0 > nf_lru = { > next = 0xffff88e372709748, > prev = 0xffff88e372709748 > }, > nf_file = 0xffff88e0725e6400, > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > ffff8982864944d0 > nf_flags = 0xc, > nf_ref.refs.counter = 0x0 > nf_lru = { > next = 0xffff898286494518, > prev = 0xffff898286494518 > }, > nf_file = 0xffff89803c0ff700, > > The leak occurs when nfsd_file_put() races with nfsd_file_cond_queue() > or nfsd_file_lru_cb(). With the following patch, I haven't observed > any leak after a few days heavy nfs load: Our patch submission guidelines require a Signed-off-by: line at the end of the patch description. See the "Sign your work - the Developer's Certificate of Origin" section of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 (Needed here in case your fix is acceptable). > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index 1a6d5d000b85..2323829f7208 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > if (!nfsd_file_lru_remove(nf)) > return; > } > + /* > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > + * it's unhashed but then removed from the dispose list, > + * so we need to free it. > + */ > + if (refcount_read(&nf->nf_ref) == 0 && > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > + list_empty(&nf->nf_lru)) { > + nfsd_file_free(nf); > + return; > + } > } > if (refcount_dec_and_test(&nf->nf_ref)) > nfsd_file_free(nf); > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > int decrement = 1; > > /* If we raced with someone else unhashing, ignore it */ > - if (!nfsd_file_unhash(nf)) > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > return; > > /* If we can't get a reference, ignore it */ > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > /* If refcount goes to 0, then put on the dispose list */ > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > list_add(&nf->nf_lru, dispose); > + nfsd_file_unhash(nf); > trace_nfsd_file_closing(nf); > } > } > > Please kindly review the patch and let me know if it makes sense. > > Thanks, > > -Youzhong >
Thank you Chuck. Here are my quick answers to your comments: - I don't have a quick reproducer. I reproduced it by using hundreds of nfs clients generating +600K ops under our workload in the testing environment. Theoretically it should be possible to simplify the reproduction but I am still working on it. - I understand zfs is an out-of-tree file system. That's fine. But this leaking can happen to any file system, and leaking is not a good thing no matter what file system it is. - I will try to come up with a reproducer using xfs or btrfs if possible. Now back to the problem itself, here are my findings: - nfsd_file_put() in one thread can race with another thread doing garbage collection (running nfsd_file_gc() -> list_lru_walk() -> nfsd_file_lru_cb()): * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). * nfsd_file_lru_add() returns true (thus NFSD_FILE_REFERENCED bit set for nf->nf_flags) * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED bit and returns LRU_ROTATE. * garbage collector kicks in again, nfsd_file_lru_cb() now decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from the LRU and adds to the dispose list [list_lru_isolate_move(lru, &nf->nf_lru, head);] * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))] * The 'nf' has been added to the 'dispose' list by nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as part of the LRU and removes it, which leads it to be removed from the 'dispose' list. * At this moment, nf->nf_ref is 0, it's unhashed, and not on the LRU. nfsd_file_put() continues its execution [if (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, now bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, and the 'nf' is leaked. To make this happen, the right timing is crucial. It can be reproduced by adding artifical delays in filecache.c, or hammering the nfsd with tons of ops. - Let's see how nfsd_file_put() can race with nfsd_file_cond_queue(): * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). * nfsd_file_lru_add() sets REFERENCED bit and returns true. * 'exportfs -f' or something like that triggers __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], unhash is done successfully. * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now nf->nf_ref goes to 2. * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to 0, the 'nf' is added to the dispost list [list_add(&nf->nf_lru, dispose)] * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but it is linked in the 'dispose' list, nfsd_file_lru_remove() simply treats it as part of the LRU and removes it. This leads to its removal from the 'dispose' list! * Now nf->ref is 0, unhashed. nfsd_file_put() continues its execution and sets nf->nf_ref to REFCOUNT_SATURATED. The purpose of nf->nf_lru is problematic. As you can see, it is used for the LRU list, and also the 'dispose' list. Adding another 'struct list_head' specifically for the 'dispose' list seems to be a better way of fixing this race condition. Either way works for me. Would you agree my above analysis makes sense? Thanks. Here is my patch with signed-off-by: From: Youzhong Yang <youzhong@gmail.com> Date: Mon, 1 Jul 2024 06:45:22 -0400 Subject: [PATCH] nfsd: fix nfsd_file leaking due to race condition and early unhash Signed-off-by: Youzhong Yang <youzhong@gmail.com> --- fs/nfsd/filecache.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index 1a6d5d000b85..2323829f7208 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) if (!nfsd_file_lru_remove(nf)) return; } + /* + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), + * it's unhashed but then removed from the dispose list, + * so we need to free it. + */ + if (refcount_read(&nf->nf_ref) == 0 && + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && + list_empty(&nf->nf_lru)) { + nfsd_file_free(nf); + return; + } } if (refcount_dec_and_test(&nf->nf_ref)) nfsd_file_free(nf); @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) int decrement = 1; /* If we raced with someone else unhashing, ignore it */ - if (!nfsd_file_unhash(nf)) + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) return; /* If we can't get a reference, ignore it */ @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) /* If refcount goes to 0, then put on the dispose list */ if (refcount_sub_and_test(decrement, &nf->nf_ref)) { list_add(&nf->nf_lru, dispose); + nfsd_file_unhash(nf); trace_nfsd_file_closing(nf); } } -- 2.43.0 On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever <chuck.lever@oracle.com> wrote: > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang wrote: > > Hello, > > > > I'd like to report a nfsd_file leaking issue and propose a fix for it. > > > > When I tested Linux kernel 6.8 and 6.6, I noticed nfsd_file leaks > > which led to undestroyable file systems (zfs), > > Thanks for the report. Some initial comments: > > - Do you have a specific reproducer? In other words, what is the > simplest program that can run on an NFS client that will trigger > this leak, and can you post it? > > - "zfs" is an out-of-tree file system, so it's not directly > supported for NFSD. > > - The guidelines for patch submission require us to fix issues in > upstream Linux first (currently that's v6.10-rc6). Then that fix > can be backported to older stable kernels like 6.6. > > Can you reproduce the leak with one of the in-kernel filesystems > (either xfs or btrfs would be great) and with NFSD in 6.10-rc6? > > One more comment below. > > > > here are some examples: > > > > crash> struct nfsd_file -x ffff88e160db0460 > > struct nfsd_file { > > nf_rlist = { > > rhead = { > > next = 0xffff8921fa2392f1 > > }, > > next = 0x0 > > }, > > nf_inode = 0xffff8882bc312ef8, > > nf_file = 0xffff88e2015b1500, > > nf_cred = 0xffff88e3ab0e7800, > > nf_net = 0xffffffff83d41600 <init_net>, > > nf_flags = 0x8, > > nf_ref = { > > refs = { > > counter = 0xc0000000 > > } > > }, > > nf_may = 0x4, > > nf_mark = 0xffff88e1bddfb320, > > nf_lru = { > > next = 0xffff88e160db04a8, > > prev = 0xffff88e160db04a8 > > }, > > nf_rcu = { > > next = 0x10000000000, > > func = 0x0 > > }, > > nf_birthtime = 0x73d22fc1728 > > } > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > ffff88839a53d850 > > nf_flags = 0x8, > > nf_ref.refs.counter = 0x0 > > nf_lru = { > > next = 0xffff88839a53d898, > > prev = 0xffff88839a53d898 > > }, > > nf_file = 0xffff88810ede8700, > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > ffff88c32b11e850 > > nf_flags = 0x8, > > nf_ref.refs.counter = 0x0 > > nf_lru = { > > next = 0xffff88c32b11e898, > > prev = 0xffff88c32b11e898 > > }, > > nf_file = 0xffff88c20a701c00, > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > ffff88e372709700 > > nf_flags = 0xc, > > nf_ref.refs.counter = 0x0 > > nf_lru = { > > next = 0xffff88e372709748, > > prev = 0xffff88e372709748 > > }, > > nf_file = 0xffff88e0725e6400, > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > ffff8982864944d0 > > nf_flags = 0xc, > > nf_ref.refs.counter = 0x0 > > nf_lru = { > > next = 0xffff898286494518, > > prev = 0xffff898286494518 > > }, > > nf_file = 0xffff89803c0ff700, > > > > The leak occurs when nfsd_file_put() races with nfsd_file_cond_queue() > > or nfsd_file_lru_cb(). With the following patch, I haven't observed > > any leak after a few days heavy nfs load: > > Our patch submission guidelines require a Signed-off-by: > line at the end of the patch description. See the "Sign your work - > the Developer's Certificate of Origin" section of > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > (Needed here in case your fix is acceptable). > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index 1a6d5d000b85..2323829f7208 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > if (!nfsd_file_lru_remove(nf)) > > return; > > } > > + /* > > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > > + * it's unhashed but then removed from the dispose list, > > + * so we need to free it. > > + */ > > + if (refcount_read(&nf->nf_ref) == 0 && > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > + list_empty(&nf->nf_lru)) { > > + nfsd_file_free(nf); > > + return; > > + } > > } > > if (refcount_dec_and_test(&nf->nf_ref)) > > nfsd_file_free(nf); > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > list_head *dispose) > > int decrement = 1; > > > > /* If we raced with someone else unhashing, ignore it */ > > - if (!nfsd_file_unhash(nf)) > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > return; > > > > /* If we can't get a reference, ignore it */ > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > list_head *dispose) > > /* If refcount goes to 0, then put on the dispose list */ > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > list_add(&nf->nf_lru, dispose); > > + nfsd_file_unhash(nf); > > trace_nfsd_file_closing(nf); > > } > > } > > > > Please kindly review the patch and let me know if it makes sense. > > > > Thanks, > > > > -Youzhong > > > > -- > Chuck Lever
> On Jul 3, 2024, at 4:46 PM, Youzhong Yang <youzhong@gmail.com> wrote: > > Thank you Chuck. Here are my quick answers to your comments: > > - I don't have a quick reproducer. I reproduced it by using hundreds > of nfs clients generating +600K ops under our workload in the testing > environment. Theoretically it should be possible to simplify the > reproduction but I am still working on it. > > - I understand zfs is an out-of-tree file system. That's fine. But > this leaking can happen to any file system, and leaking is not a good > thing no matter what file system it is. No doubt, but we need to establish that this is not an issue in the Linux port of zfs. > - I will try to come up with a reproducer using xfs or btrfs if possible. Thanks. A reliable reproducer will demonstrate very clearly that the bug is addressed when the (final) fix has been applied. > Now back to the problem itself, here are my findings: > > - nfsd_file_put() in one thread can race with another thread doing > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > nfsd_file_lru_cb()): > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > * nfsd_file_lru_add() returns true (thus NFSD_FILE_REFERENCED bit > set for nf->nf_flags) > * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED > bit and returns LRU_ROTATE. > * garbage collector kicks in again, nfsd_file_lru_cb() now > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from > the LRU and adds to the dispose list [list_lru_isolate_move(lru, > &nf->nf_lru, head);] > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))] > * The 'nf' has been added to the 'dispose' list by > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as > part of the LRU and removes it, which leads it to be removed from the > 'dispose' list. > * At this moment, nf->nf_ref is 0, it's unhashed, and not on the > LRU. nfsd_file_put() continues its execution [if > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, now > bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, and the > 'nf' is leaked. > > To make this happen, the right timing is crucial. It can be reproduced > by adding artifical delays in filecache.c, or hammering the nfsd with > tons of ops. > > - Let's see how nfsd_file_put() can race with nfsd_file_cond_queue(): > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > * 'exportfs -f' or something like that triggers > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], > unhash is done successfully. > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > nf->nf_ref goes to 2. > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. > * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, > &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to > 0, the 'nf' is added to the dispost list [list_add(&nf->nf_lru, > dispose)] > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > tries to remove the 'nf' from the LRU [if > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but > it is linked in the 'dispose' list, nfsd_file_lru_remove() simply > treats it as part of the LRU and removes it. This leads to its removal > from the 'dispose' list! > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > The purpose of nf->nf_lru is problematic. As you can see, it is used > for the LRU list, and also the 'dispose' list. Adding another 'struct > list_head' specifically for the 'dispose' list seems to be a better > way of fixing this race condition. Either way works for me. Generally this double-use of nf_lru has been determined to be safe. If there is a bug in the list manipulation logic, it will still be there if there are two list_head fields; it will simply be masked. > Would you agree my above analysis makes sense? Thanks. Jeff is the expert in this area. I will also study it. > Here is my patch with signed-off-by: > > From: Youzhong Yang <youzhong@gmail.com> > Date: Mon, 1 Jul 2024 06:45:22 -0400 > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race condition and early > unhash Much better, but you also need a patch description. Two paragraphs that explain the bug and the proposed solution should be adequate. > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > --- > fs/nfsd/filecache.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index 1a6d5d000b85..2323829f7208 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > if (!nfsd_file_lru_remove(nf)) > return; > } > + /* > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > + * it's unhashed but then removed from the dispose list, > + * so we need to free it. > + */ > + if (refcount_read(&nf->nf_ref) == 0 && > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > + list_empty(&nf->nf_lru)) { > + nfsd_file_free(nf); > + return; > + } > } > if (refcount_dec_and_test(&nf->nf_ref)) > nfsd_file_free(nf); > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > int decrement = 1; > > /* If we raced with someone else unhashing, ignore it */ > - if (!nfsd_file_unhash(nf)) > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > return; > > /* If we can't get a reference, ignore it */ > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > /* If refcount goes to 0, then put on the dispose list */ > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > list_add(&nf->nf_lru, dispose); > + nfsd_file_unhash(nf); > trace_nfsd_file_closing(nf); > } > } > -- > 2.43.0 > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever <chuck.lever@oracle.com> wrote: >> >> On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang wrote: >>> Hello, >>> >>> I'd like to report a nfsd_file leaking issue and propose a fix for it. >>> >>> When I tested Linux kernel 6.8 and 6.6, I noticed nfsd_file leaks >>> which led to undestroyable file systems (zfs), >> >> Thanks for the report. Some initial comments: >> >> - Do you have a specific reproducer? In other words, what is the >> simplest program that can run on an NFS client that will trigger >> this leak, and can you post it? >> >> - "zfs" is an out-of-tree file system, so it's not directly >> supported for NFSD. >> >> - The guidelines for patch submission require us to fix issues in >> upstream Linux first (currently that's v6.10-rc6). Then that fix >> can be backported to older stable kernels like 6.6. >> >> Can you reproduce the leak with one of the in-kernel filesystems >> (either xfs or btrfs would be great) and with NFSD in 6.10-rc6? >> >> One more comment below. >> >> >>> here are some examples: >>> >>> crash> struct nfsd_file -x ffff88e160db0460 >>> struct nfsd_file { >>> nf_rlist = { >>> rhead = { >>> next = 0xffff8921fa2392f1 >>> }, >>> next = 0x0 >>> }, >>> nf_inode = 0xffff8882bc312ef8, >>> nf_file = 0xffff88e2015b1500, >>> nf_cred = 0xffff88e3ab0e7800, >>> nf_net = 0xffffffff83d41600 <init_net>, >>> nf_flags = 0x8, >>> nf_ref = { >>> refs = { >>> counter = 0xc0000000 >>> } >>> }, >>> nf_may = 0x4, >>> nf_mark = 0xffff88e1bddfb320, >>> nf_lru = { >>> next = 0xffff88e160db04a8, >>> prev = 0xffff88e160db04a8 >>> }, >>> nf_rcu = { >>> next = 0x10000000000, >>> func = 0x0 >>> }, >>> nf_birthtime = 0x73d22fc1728 >>> } >>> >>> crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x >>> ffff88839a53d850 >>> nf_flags = 0x8, >>> nf_ref.refs.counter = 0x0 >>> nf_lru = { >>> next = 0xffff88839a53d898, >>> prev = 0xffff88839a53d898 >>> }, >>> nf_file = 0xffff88810ede8700, >>> >>> crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x >>> ffff88c32b11e850 >>> nf_flags = 0x8, >>> nf_ref.refs.counter = 0x0 >>> nf_lru = { >>> next = 0xffff88c32b11e898, >>> prev = 0xffff88c32b11e898 >>> }, >>> nf_file = 0xffff88c20a701c00, >>> >>> crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x >>> ffff88e372709700 >>> nf_flags = 0xc, >>> nf_ref.refs.counter = 0x0 >>> nf_lru = { >>> next = 0xffff88e372709748, >>> prev = 0xffff88e372709748 >>> }, >>> nf_file = 0xffff88e0725e6400, >>> >>> crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x >>> ffff8982864944d0 >>> nf_flags = 0xc, >>> nf_ref.refs.counter = 0x0 >>> nf_lru = { >>> next = 0xffff898286494518, >>> prev = 0xffff898286494518 >>> }, >>> nf_file = 0xffff89803c0ff700, >>> >>> The leak occurs when nfsd_file_put() races with nfsd_file_cond_queue() >>> or nfsd_file_lru_cb(). With the following patch, I haven't observed >>> any leak after a few days heavy nfs load: >> >> Our patch submission guidelines require a Signed-off-by: >> line at the end of the patch description. See the "Sign your work - >> the Developer's Certificate of Origin" section of >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 >> >> (Needed here in case your fix is acceptable). >> >> >>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c >>> index 1a6d5d000b85..2323829f7208 100644 >>> --- a/fs/nfsd/filecache.c >>> +++ b/fs/nfsd/filecache.c >>> @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) >>> if (!nfsd_file_lru_remove(nf)) >>> return; >>> } >>> + /* >>> + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), >>> + * it's unhashed but then removed from the dispose list, >>> + * so we need to free it. >>> + */ >>> + if (refcount_read(&nf->nf_ref) == 0 && >>> + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && >>> + list_empty(&nf->nf_lru)) { >>> + nfsd_file_free(nf); >>> + return; >>> + } >>> } >>> if (refcount_dec_and_test(&nf->nf_ref)) >>> nfsd_file_free(nf); >>> @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct >>> list_head *dispose) >>> int decrement = 1; >>> >>> /* If we raced with someone else unhashing, ignore it */ >>> - if (!nfsd_file_unhash(nf)) >>> + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) >>> return; >>> >>> /* If we can't get a reference, ignore it */ >>> @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct >>> list_head *dispose) >>> /* If refcount goes to 0, then put on the dispose list */ >>> if (refcount_sub_and_test(decrement, &nf->nf_ref)) { >>> list_add(&nf->nf_lru, dispose); >>> + nfsd_file_unhash(nf); >>> trace_nfsd_file_closing(nf); >>> } >>> } >>> >>> Please kindly review the patch and let me know if it makes sense. >>> >>> Thanks, >>> >>> -Youzhong >>> >> >> -- >> Chuck Lever -- Chuck Lever
On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > Thank you Chuck. Here are my quick answers to your comments: > > - I don't have a quick reproducer. I reproduced it by using hundreds > of nfs clients generating +600K ops under our workload in the testing > environment. Theoretically it should be possible to simplify the > reproduction but I am still working on it. > > - I understand zfs is an out-of-tree file system. That's fine. But > this leaking can happen to any file system, and leaking is not a good > thing no matter what file system it is. > > - I will try to come up with a reproducer using xfs or btrfs if possible. > > Now back to the problem itself, here are my findings: > > - nfsd_file_put() in one thread can race with another thread doing > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > nfsd_file_lru_cb()): > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > * nfsd_file_lru_add() returns true (thus NFSD_FILE_REFERENCED bit > set for nf->nf_flags) > * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED > bit and returns LRU_ROTATE. > * garbage collector kicks in again, nfsd_file_lru_cb() now > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from > the LRU and adds to the dispose list [list_lru_isolate_move(lru, > &nf->nf_lru, head);] > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))] > * The 'nf' has been added to the 'dispose' list by > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as > part of the LRU and removes it, which leads it to be removed from the > 'dispose' list. > * At this moment, nf->nf_ref is 0, it's unhashed, and not on the > LRU. nfsd_file_put() continues its execution [if > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, now > bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, and the > 'nf' is leaked. > > To make this happen, the right timing is crucial. It can be reproduced > by adding artifical delays in filecache.c, or hammering the nfsd with > tons of ops. > > - Let's see how nfsd_file_put() can race with nfsd_file_cond_queue(): > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > * 'exportfs -f' or something like that triggers > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], > unhash is done successfully. > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > nf->nf_ref goes to 2. > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. > * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, > &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to > 0, the 'nf' is added to the dispost list [list_add(&nf->nf_lru, > dispose)] > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > tries to remove the 'nf' from the LRU [if > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but > it is linked in the 'dispose' list, nfsd_file_lru_remove() simply > treats it as part of the LRU and removes it. This leads to its removal > from the 'dispose' list! > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > The purpose of nf->nf_lru is problematic. As you can see, it is used > for the LRU list, and also the 'dispose' list. Adding another 'struct > list_head' specifically for the 'dispose' list seems to be a better > way of fixing this race condition. Either way works for me. > > Would you agree my above analysis makes sense? Thanks. > I think so. It's been a while since I've done much work in this code, but it does sound like there is a race in the LRU handling. Like Chuck said, the nf->nf_lru list should be safe to use for multiple purposes, but that's only the case if we're not using that list as an indicator. The list_lru code does check this: if (!list_empty(item)) { ...so if we ever check this while it's sitting on the dispose list, it will handle it incorrectly. It sounds like that's the root cause of the problem you're seeing? If so, then maybe a separate list_head for disposal would be better. > Here is my patch with signed-off-by: > > From: Youzhong Yang <youzhong@gmail.com> > Date: Mon, 1 Jul 2024 06:45:22 -0400 > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race condition and early > unhash > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > --- > fs/nfsd/filecache.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index 1a6d5d000b85..2323829f7208 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > if (!nfsd_file_lru_remove(nf)) > return; > } > + /* > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > + * it's unhashed but then removed from the dispose list, > + * so we need to free it. > + */ > + if (refcount_read(&nf->nf_ref) == 0 && A refcount_read in this path is a red flag to me. Anytime you're just looking at the refcount without changing anything just screams out "race condition". In this case, what guarantee is there that this won't run afoul of the timing? We could check this and find out it's 1 just before it goes to 0 and you check the other conditions. Does anything prevent that? > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > + list_empty(&nf->nf_lru)) { > + nfsd_file_free(nf); > + return; > + } > } > if (refcount_dec_and_test(&nf->nf_ref)) > nfsd_file_free(nf); > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > int decrement = 1; > > /* If we raced with someone else unhashing, ignore it */ > - if (!nfsd_file_unhash(nf)) > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > return; Same here: you're just testing for the HASHED bit, but could this also race with someone who is setting it just after you get here. Why is that not a problem? > > /* If we can't get a reference, ignore it */ > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > /* If refcount goes to 0, then put on the dispose list */ > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > list_add(&nf->nf_lru, dispose); > + nfsd_file_unhash(nf); > trace_nfsd_file_closing(nf); > } > } > -- > 2.43.0 > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever <chuck.lever@oracle.com> wrote: > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang wrote: > > > Hello, > > > > > > I'd like to report a nfsd_file leaking issue and propose a fix for it. > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed nfsd_file leaks > > > which led to undestroyable file systems (zfs), > > > > Thanks for the report. Some initial comments: > > > > - Do you have a specific reproducer? In other words, what is the > > simplest program that can run on an NFS client that will trigger > > this leak, and can you post it? > > > > - "zfs" is an out-of-tree file system, so it's not directly > > supported for NFSD. > > > > - The guidelines for patch submission require us to fix issues in > > upstream Linux first (currently that's v6.10-rc6). Then that fix > > can be backported to older stable kernels like 6.6. > > > > Can you reproduce the leak with one of the in-kernel filesystems > > (either xfs or btrfs would be great) and with NFSD in 6.10-rc6? > > > > One more comment below. > > > > > > > here are some examples: > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > struct nfsd_file { > > > nf_rlist = { > > > rhead = { > > > next = 0xffff8921fa2392f1 > > > }, > > > next = 0x0 > > > }, > > > nf_inode = 0xffff8882bc312ef8, > > > nf_file = 0xffff88e2015b1500, > > > nf_cred = 0xffff88e3ab0e7800, > > > nf_net = 0xffffffff83d41600 <init_net>, > > > nf_flags = 0x8, > > > nf_ref = { > > > refs = { > > > counter = 0xc0000000 > > > } > > > }, > > > nf_may = 0x4, > > > nf_mark = 0xffff88e1bddfb320, > > > nf_lru = { > > > next = 0xffff88e160db04a8, > > > prev = 0xffff88e160db04a8 > > > }, > > > nf_rcu = { > > > next = 0x10000000000, > > > func = 0x0 > > > }, > > > nf_birthtime = 0x73d22fc1728 > > > } > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > ffff88839a53d850 > > > nf_flags = 0x8, > > > nf_ref.refs.counter = 0x0 > > > nf_lru = { > > > next = 0xffff88839a53d898, > > > prev = 0xffff88839a53d898 > > > }, > > > nf_file = 0xffff88810ede8700, > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > ffff88c32b11e850 > > > nf_flags = 0x8, > > > nf_ref.refs.counter = 0x0 > > > nf_lru = { > > > next = 0xffff88c32b11e898, > > > prev = 0xffff88c32b11e898 > > > }, > > > nf_file = 0xffff88c20a701c00, > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > ffff88e372709700 > > > nf_flags = 0xc, > > > nf_ref.refs.counter = 0x0 > > > nf_lru = { > > > next = 0xffff88e372709748, > > > prev = 0xffff88e372709748 > > > }, > > > nf_file = 0xffff88e0725e6400, > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > ffff8982864944d0 > > > nf_flags = 0xc, > > > nf_ref.refs.counter = 0x0 > > > nf_lru = { > > > next = 0xffff898286494518, > > > prev = 0xffff898286494518 > > > }, > > > nf_file = 0xffff89803c0ff700, > > > > > > The leak occurs when nfsd_file_put() races with nfsd_file_cond_queue() > > > or nfsd_file_lru_cb(). With the following patch, I haven't observed > > > any leak after a few days heavy nfs load: > > > > Our patch submission guidelines require a Signed-off-by: > > line at the end of the patch description. See the "Sign your work - > > the Developer's Certificate of Origin" section of > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > (Needed here in case your fix is acceptable). > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index 1a6d5d000b85..2323829f7208 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > if (!nfsd_file_lru_remove(nf)) > > > return; > > > } > > > + /* > > > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > > > + * it's unhashed but then removed from the dispose list, > > > + * so we need to free it. > > > + */ > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > + list_empty(&nf->nf_lru)) { > > > + nfsd_file_free(nf); > > > + return; > > > + } > > > } > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > nfsd_file_free(nf); > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > list_head *dispose) > > > int decrement = 1; > > > > > > /* If we raced with someone else unhashing, ignore it */ > > > - if (!nfsd_file_unhash(nf)) > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > return; > > > > > > /* If we can't get a reference, ignore it */ > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > list_head *dispose) > > > /* If refcount goes to 0, then put on the dispose list */ > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > list_add(&nf->nf_lru, dispose); > > > + nfsd_file_unhash(nf); > > > trace_nfsd_file_closing(nf); > > > } > > > } > > > > > > Please kindly review the patch and let me know if it makes sense. > > > > > > Thanks, > > > > > > -Youzhong > > > > > > > -- > > Chuck Lever >
Thank you Jeff for your invaluable insights. I was leaning towards adding a new list_head too, and tested this approach on kernel 6.6 by continuously hammering the server with heavy nfs load for the last few days, not a single leak. Here goes the patch (based on Linux kernel master branch), please review: From: Youzhong Yang <youzhong@gmail.com> Date: Thu, 4 Jul 2024 11:25:40 -0400 Subject: [PATCH] nfsd: fix nfsd_file leaking due to mixed use of nf->nf_lru nfsd_file_put() in one thread can race with another thread doing garbage collection (running nfsd_file_gc() -> list_lru_walk() -> nfsd_file_lru_cb()): * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED bit set) * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED bit and returns LRU_ROTATE. * garbage collector kicks in again, nfsd_file_lru_cb() now decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from the LRU and adds to the dispose list [list_lru_isolate_move(lru, &nf->nf_lru, head)] * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The 'nf' has been added to the 'dispose' list by nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as part of the LRU and removes it, which leads to its removal from the 'dispose' list. * At this moment, 'nf' is unhashed with its nf_ref being 0, and not on the LRU. nfsd_file_put() continues its execution [if (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, nf->nf_ref is set to REFCOUNT_SATURATED, and the 'nf' gets no chance of being freed. nfsd_file_put() can also race with nfsd_file_cond_queue(): * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). * nfsd_file_lru_add() sets REFERENCED bit and returns true. * Some userland application runs 'exportfs -f' or something like that, which triggers __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], unhash is done successfully. * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now nf->nf_ref goes to 2. * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to 0, the 'nf' is added to the dispose list [list_add(&nf->nf_lru, dispose)] * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but it is linked in the 'dispose' list, nfsd_file_lru_remove() simply treats it as part of the LRU and removes it. This leads to its removal from the 'dispose' list! * Now nf->ref is 0, unhashed. nfsd_file_put() continues its execution and set nf->nf_ref to REFCOUNT_SATURATED. As shown in the above analysis, using nf_lru for both the LRU list and dispose list can cause the leaks. This patch adds a new list_head nf_gc in struct nfsd_file, and uses it for the dispose list. It's not expected to have a nfsd_file unhashed but it's not added to the dispose list, so in nfsd_file_cond_queue() and nfsd_file_lru_cb() nfsd_file is unhashed after being added to the dispose list. Signed-off-by: Youzhong Yang <youzhong@gmail.com> --- fs/nfsd/filecache.c | 23 ++++++++++++++--------- fs/nfsd/filecache.h | 1 + 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index ad9083ca144b..3aef2ddfce94 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need, return NULL; INIT_LIST_HEAD(&nf->nf_lru); + INIT_LIST_HEAD(&nf->nf_gc); nf->nf_birthtime = ktime_get(); nf->nf_file = NULL; nf->nf_cred = get_current_cred(); @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) struct nfsd_file *nf; while (!list_empty(dispose)) { - nf = list_first_entry(dispose, struct nfsd_file, nf_lru); - list_del_init(&nf->nf_lru); + nf = list_first_entry(dispose, struct nfsd_file, nf_gc); + list_del_init(&nf->nf_gc); nfsd_file_free(nf); } } @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) { while(!list_empty(dispose)) { struct nfsd_file *nf = list_first_entry(dispose, - struct nfsd_file, nf_lru); + struct nfsd_file, nf_gc); struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); struct nfsd_fcache_disposal *l = nn->fcache_disposal; spin_lock(&l->lock); - list_move_tail(&nf->nf_lru, &l->freeme); + list_move_tail(&nf->nf_gc, &l->freeme); spin_unlock(&l->lock); svc_wake_up(nn->nfsd_serv); } @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru, } /* Refcount went to zero. Unhash it and queue it to the dispose list */ + list_lru_isolate(lru, &nf->nf_lru); + list_add(&nf->nf_gc, head); + /* Unhash after removing from LRU and adding to dispose list */ nfsd_file_unhash(nf); - list_lru_isolate_move(lru, &nf->nf_lru, head); this_cpu_inc(nfsd_file_evictions); trace_nfsd_file_gc_disposed(nf); return LRU_REMOVED; @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) int decrement = 1; /* If we raced with someone else unhashing, ignore it */ - if (!nfsd_file_unhash(nf)) + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) return; /* If we can't get a reference, ignore it */ @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) /* If refcount goes to 0, then put on the dispose list */ if (refcount_sub_and_test(decrement, &nf->nf_ref)) { - list_add(&nf->nf_lru, dispose); + list_add(&nf->nf_gc, dispose); + /* Unhash after adding to dispose list */ + nfsd_file_unhash(nf); trace_nfsd_file_closing(nf); } } @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct inode *inode) nfsd_file_queue_for_close(inode, &dispose); while (!list_empty(&dispose)) { - nf = list_first_entry(&dispose, struct nfsd_file, nf_lru); - list_del_init(&nf->nf_lru); + nf = list_first_entry(&dispose, struct nfsd_file, nf_gc); + list_del_init(&nf->nf_gc); nfsd_file_free(nf); } } diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h index c61884def906..3fbec24eea6c 100644 --- a/fs/nfsd/filecache.h +++ b/fs/nfsd/filecache.h @@ -44,6 +44,7 @@ struct nfsd_file { struct nfsd_file_mark *nf_mark; struct list_head nf_lru; + struct list_head nf_gc; struct rcu_head nf_rcu; ktime_t nf_birthtime; }; -- 2.34.1 On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton <jlayton@poochiereds.net> wrote: > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > > Thank you Chuck. Here are my quick answers to your comments: > > > > - I don't have a quick reproducer. I reproduced it by using hundreds > > of nfs clients generating +600K ops under our workload in the testing > > environment. Theoretically it should be possible to simplify the > > reproduction but I am still working on it. > > > > - I understand zfs is an out-of-tree file system. That's fine. But > > this leaking can happen to any file system, and leaking is not a good > > thing no matter what file system it is. > > > > - I will try to come up with a reproducer using xfs or btrfs if possible. > > > > Now back to the problem itself, here are my findings: > > > > - nfsd_file_put() in one thread can race with another thread doing > > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > > nfsd_file_lru_cb()): > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > > * nfsd_file_lru_add() returns true (thus NFSD_FILE_REFERENCED bit > > set for nf->nf_flags) > > * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED > > bit and returns LRU_ROTATE. > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from > > the LRU and adds to the dispose list [list_lru_isolate_move(lru, > > &nf->nf_lru, head);] > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > > tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))] > > * The 'nf' has been added to the 'dispose' list by > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as > > part of the LRU and removes it, which leads it to be removed from the > > 'dispose' list. > > * At this moment, nf->nf_ref is 0, it's unhashed, and not on the > > LRU. nfsd_file_put() continues its execution [if > > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, now > > bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, and the > > 'nf' is leaked. > > > > To make this happen, the right timing is crucial. It can be reproduced > > by adding artifical delays in filecache.c, or hammering the nfsd with > > tons of ops. > > > > - Let's see how nfsd_file_put() can race with nfsd_file_cond_queue(): > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > * 'exportfs -f' or something like that triggers > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], > > unhash is done successfully. > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > > nf->nf_ref goes to 2. > > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. > > * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, > > &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to > > 0, the 'nf' is added to the dispost list [list_add(&nf->nf_lru, > > dispose)] > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > > tries to remove the 'nf' from the LRU [if > > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but > > it is linked in the 'dispose' list, nfsd_file_lru_remove() simply > > treats it as part of the LRU and removes it. This leads to its removal > > from the 'dispose' list! > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > > > The purpose of nf->nf_lru is problematic. As you can see, it is used > > for the LRU list, and also the 'dispose' list. Adding another 'struct > > list_head' specifically for the 'dispose' list seems to be a better > > way of fixing this race condition. Either way works for me. > > > > Would you agree my above analysis makes sense? Thanks. > > > > I think so. It's been a while since I've done much work in this code, > but it does sound like there is a race in the LRU handling. > > > Like Chuck said, the nf->nf_lru list should be safe to use for multiple > purposes, but that's only the case if we're not using that list as an > indicator. > > The list_lru code does check this: > > if (!list_empty(item)) { > > ...so if we ever check this while it's sitting on the dispose list, it > will handle it incorrectly. It sounds like that's the root cause of the > problem you're seeing? > > If so, then maybe a separate list_head for disposal would be better. > > > Here is my patch with signed-off-by: > > > > From: Youzhong Yang <youzhong@gmail.com> > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race condition and early > > unhash > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > --- > > fs/nfsd/filecache.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index 1a6d5d000b85..2323829f7208 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > if (!nfsd_file_lru_remove(nf)) > > return; > > } > > + /* > > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > > + * it's unhashed but then removed from the dispose list, > > + * so we need to free it. > > + */ > > + if (refcount_read(&nf->nf_ref) == 0 && > > A refcount_read in this path is a red flag to me. Anytime you're just > looking at the refcount without changing anything just screams out > "race condition". > > In this case, what guarantee is there that this won't run afoul of the > timing? We could check this and find out it's 1 just before it goes to > 0 and you check the other conditions. > > Does anything prevent that? > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > + list_empty(&nf->nf_lru)) { > > + nfsd_file_free(nf); > > + return; > > + } > > } > > if (refcount_dec_and_test(&nf->nf_ref)) > > nfsd_file_free(nf); > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > list_head *dispose) > > int decrement = 1; > > > > /* If we raced with someone else unhashing, ignore it */ > > - if (!nfsd_file_unhash(nf)) > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > return; > > Same here: you're just testing for the HASHED bit, but could this also > race with someone who is setting it just after you get here. Why is > that not a problem? > > > > > /* If we can't get a reference, ignore it */ > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > list_head *dispose) > > /* If refcount goes to 0, then put on the dispose list */ > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > list_add(&nf->nf_lru, dispose); > > + nfsd_file_unhash(nf); > > trace_nfsd_file_closing(nf); > > } > > } > > -- > > 2.43.0 > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever <chuck.lever@oracle.com> wrote: > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang wrote: > > > > Hello, > > > > > > > > I'd like to report a nfsd_file leaking issue and propose a fix for it. > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed nfsd_file leaks > > > > which led to undestroyable file systems (zfs), > > > > > > Thanks for the report. Some initial comments: > > > > > > - Do you have a specific reproducer? In other words, what is the > > > simplest program that can run on an NFS client that will trigger > > > this leak, and can you post it? > > > > > > - "zfs" is an out-of-tree file system, so it's not directly > > > supported for NFSD. > > > > > > - The guidelines for patch submission require us to fix issues in > > > upstream Linux first (currently that's v6.10-rc6). Then that fix > > > can be backported to older stable kernels like 6.6. > > > > > > Can you reproduce the leak with one of the in-kernel filesystems > > > (either xfs or btrfs would be great) and with NFSD in 6.10-rc6? > > > > > > One more comment below. > > > > > > > > > > here are some examples: > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > struct nfsd_file { > > > > nf_rlist = { > > > > rhead = { > > > > next = 0xffff8921fa2392f1 > > > > }, > > > > next = 0x0 > > > > }, > > > > nf_inode = 0xffff8882bc312ef8, > > > > nf_file = 0xffff88e2015b1500, > > > > nf_cred = 0xffff88e3ab0e7800, > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > nf_flags = 0x8, > > > > nf_ref = { > > > > refs = { > > > > counter = 0xc0000000 > > > > } > > > > }, > > > > nf_may = 0x4, > > > > nf_mark = 0xffff88e1bddfb320, > > > > nf_lru = { > > > > next = 0xffff88e160db04a8, > > > > prev = 0xffff88e160db04a8 > > > > }, > > > > nf_rcu = { > > > > next = 0x10000000000, > > > > func = 0x0 > > > > }, > > > > nf_birthtime = 0x73d22fc1728 > > > > } > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > ffff88839a53d850 > > > > nf_flags = 0x8, > > > > nf_ref.refs.counter = 0x0 > > > > nf_lru = { > > > > next = 0xffff88839a53d898, > > > > prev = 0xffff88839a53d898 > > > > }, > > > > nf_file = 0xffff88810ede8700, > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > ffff88c32b11e850 > > > > nf_flags = 0x8, > > > > nf_ref.refs.counter = 0x0 > > > > nf_lru = { > > > > next = 0xffff88c32b11e898, > > > > prev = 0xffff88c32b11e898 > > > > }, > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > ffff88e372709700 > > > > nf_flags = 0xc, > > > > nf_ref.refs.counter = 0x0 > > > > nf_lru = { > > > > next = 0xffff88e372709748, > > > > prev = 0xffff88e372709748 > > > > }, > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > ffff8982864944d0 > > > > nf_flags = 0xc, > > > > nf_ref.refs.counter = 0x0 > > > > nf_lru = { > > > > next = 0xffff898286494518, > > > > prev = 0xffff898286494518 > > > > }, > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > The leak occurs when nfsd_file_put() races with nfsd_file_cond_queue() > > > > or nfsd_file_lru_cb(). With the following patch, I haven't observed > > > > any leak after a few days heavy nfs load: > > > > > > Our patch submission guidelines require a Signed-off-by: > > > line at the end of the patch description. See the "Sign your work - > > > the Developer's Certificate of Origin" section of > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > --- a/fs/nfsd/filecache.c > > > > +++ b/fs/nfsd/filecache.c > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > if (!nfsd_file_lru_remove(nf)) > > > > return; > > > > } > > > > + /* > > > > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > > > > + * it's unhashed but then removed from the dispose list, > > > > + * so we need to free it. > > > > + */ > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > + list_empty(&nf->nf_lru)) { > > > > + nfsd_file_free(nf); > > > > + return; > > > > + } > > > > } > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > nfsd_file_free(nf); > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > list_head *dispose) > > > > int decrement = 1; > > > > > > > > /* If we raced with someone else unhashing, ignore it */ > > > > - if (!nfsd_file_unhash(nf)) > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > return; > > > > > > > > /* If we can't get a reference, ignore it */ > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > list_head *dispose) > > > > /* If refcount goes to 0, then put on the dispose list */ > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > list_add(&nf->nf_lru, dispose); > > > > + nfsd_file_unhash(nf); > > > > trace_nfsd_file_closing(nf); > > > > } > > > > } > > > > > > > > Please kindly review the patch and let me know if it makes sense. > > > > > > > > Thanks, > > > > > > > > -Youzhong > > > > > > > > > > -- > > > Chuck Lever > > > > -- > Jeff Layton <jlayton@poochiereds.net>
On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > Thank you Jeff for your invaluable insights. I was leaning towards > adding a new list_head too, and tested this approach on kernel 6.6 by > continuously hammering the server with heavy nfs load for the last few > days, not a single leak. > > Here goes the patch (based on Linux kernel master branch), please review: > > From: Youzhong Yang <youzhong@gmail.com> > Date: Thu, 4 Jul 2024 11:25:40 -0400 > Subject: [PATCH] nfsd: fix nfsd_file leaking due to mixed use of nf->nf_lru > > nfsd_file_put() in one thread can race with another thread doing > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > nfsd_file_lru_cb()): > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED bit set) > * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED bit and > returns LRU_ROTATE. > * garbage collector kicks in again, nfsd_file_lru_cb() now > decrements nf->nf_ref > to 0, runs nfsd_file_unhash(), removes it from the LRU and adds to > the dispose > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > tries to remove > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The 'nf' > has been added > to the 'dispose' list by nfsd_file_lru_cb(), so > nfsd_file_lru_remove(nf) simply > treats it as part of the LRU and removes it, which leads to its removal from > the 'dispose' list. > * At this moment, 'nf' is unhashed with its nf_ref being 0, and not > on the LRU. > nfsd_file_put() continues its execution [if > (refcount_dec_and_test(&nf->nf_ref))], > as nf->nf_ref is already 0, nf->nf_ref is set to > REFCOUNT_SATURATED, and the 'nf' > gets no chance of being freed. > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > * Some userland application runs 'exportfs -f' or something like > that, which triggers > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], > unhash is done > successfully. > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > nf->nf_ref goes to 2. > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. > * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, > &nf->nf_ref))] > (with "decrement" being 2), so the nf->nf_ref goes to 0, the 'nf' > is added to the > dispose list [list_add(&nf->nf_lru, dispose)] > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > tries to remove > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], although > the 'nf' is not > in the LRU, but it is linked in the 'dispose' list, > nfsd_file_lru_remove() simply > treats it as part of the LRU and removes it. This leads to its removal from > the 'dispose' list! > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its execution and set > nf->nf_ref to REFCOUNT_SATURATED. > > As shown in the above analysis, using nf_lru for both the LRU list and > dispose list > can cause the leaks. This patch adds a new list_head nf_gc in struct > nfsd_file, and uses > it for the dispose list. It's not expected to have a nfsd_file > unhashed but it's not > added to the dispose list, so in nfsd_file_cond_queue() and > nfsd_file_lru_cb() nfsd_file > is unhashed after being added to the dispose list. > I don't see where we require the object to be either hashed or on the dispose list. I think you probably just want to do a patch that changes the dispose list to use a dedicated list_head without reordering when the these things are unhashed. > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > --- > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > fs/nfsd/filecache.h | 1 + > 2 files changed, 15 insertions(+), 9 deletions(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index ad9083ca144b..3aef2ddfce94 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > *inode, unsigned char need, > return NULL; > > INIT_LIST_HEAD(&nf->nf_lru); > + INIT_LIST_HEAD(&nf->nf_gc); > nf->nf_birthtime = ktime_get(); > nf->nf_file = NULL; > nf->nf_cred = get_current_cred(); > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > struct nfsd_file *nf; > > while (!list_empty(dispose)) { > - nf = list_first_entry(dispose, struct nfsd_file, nf_lru); > - list_del_init(&nf->nf_lru); > + nf = list_first_entry(dispose, struct nfsd_file, nf_gc); > + list_del_init(&nf->nf_gc); > nfsd_file_free(nf); > } > } > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) > { > while(!list_empty(dispose)) { > struct nfsd_file *nf = list_first_entry(dispose, > - struct nfsd_file, nf_lru); > + struct nfsd_file, nf_gc); > struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > spin_lock(&l->lock); > - list_move_tail(&nf->nf_lru, &l->freeme); > + list_move_tail(&nf->nf_gc, &l->freeme); > spin_unlock(&l->lock); > svc_wake_up(nn->nfsd_serv); > } > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct list_head *item, struct > list_lru_one *lru, > } > > /* Refcount went to zero. Unhash it and queue it to the dispose list */ > + list_lru_isolate(lru, &nf->nf_lru); > + list_add(&nf->nf_gc, head); > + /* Unhash after removing from LRU and adding to dispose list */ > nfsd_file_unhash(nf); > - list_lru_isolate_move(lru, &nf->nf_lru, head); I don't see the point in reordering these operations. Hashing is all about making the thing findable by nfsd operations. The _last_ thing we want to do is put it on the dispose list while the thing can still be found by nfsd threads doing operations. > this_cpu_inc(nfsd_file_evictions); > trace_nfsd_file_gc_disposed(nf); > return LRU_REMOVED; > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > int decrement = 1; > > /* If we raced with someone else unhashing, ignore it */ > - if (!nfsd_file_unhash(nf)) > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) The above change looks wrong. I don't think we need to change this. > return; > > /* If we can't get a reference, ignore it */ > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > > /* If refcount goes to 0, then put on the dispose list */ > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > - list_add(&nf->nf_lru, dispose); > + list_add(&nf->nf_gc, dispose); > + /* Unhash after adding to dispose list */ > + nfsd_file_unhash(nf); This too looks wrong? Maybe I'm unclear on the race you're trying to fix with this? What's the harm in unhashing it early? > trace_nfsd_file_closing(nf); > } > } > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > nfsd_file_queue_for_close(inode, &dispose); > while (!list_empty(&dispose)) { > - nf = list_first_entry(&dispose, struct nfsd_file, nf_lru); > - list_del_init(&nf->nf_lru); > + nf = list_first_entry(&dispose, struct nfsd_file, nf_gc); > + list_del_init(&nf->nf_gc); > nfsd_file_free(nf); > } > } > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > index c61884def906..3fbec24eea6c 100644 > --- a/fs/nfsd/filecache.h > +++ b/fs/nfsd/filecache.h > @@ -44,6 +44,7 @@ struct nfsd_file { > > struct nfsd_file_mark *nf_mark; > struct list_head nf_lru; > + struct list_head nf_gc; > struct rcu_head nf_rcu; > ktime_t nf_birthtime; > }; > -- > 2.34.1 > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton <jlayton@poochiereds.net> wrote: > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > > > Thank you Chuck. Here are my quick answers to your comments: > > > > > > - I don't have a quick reproducer. I reproduced it by using hundreds > > > of nfs clients generating +600K ops under our workload in the testing > > > environment. Theoretically it should be possible to simplify the > > > reproduction but I am still working on it. > > > > > > - I understand zfs is an out-of-tree file system. That's fine. But > > > this leaking can happen to any file system, and leaking is not a good > > > thing no matter what file system it is. > > > > > > - I will try to come up with a reproducer using xfs or btrfs if possible. > > > > > > Now back to the problem itself, here are my findings: > > > > > > - nfsd_file_put() in one thread can race with another thread doing > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > > > nfsd_file_lru_cb()): > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > > > * nfsd_file_lru_add() returns true (thus NFSD_FILE_REFERENCED bit > > > set for nf->nf_flags) > > > * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED > > > bit and returns LRU_ROTATE. > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from > > > the LRU and adds to the dispose list [list_lru_isolate_move(lru, > > > &nf->nf_lru, head);] > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > > > tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))] > > > * The 'nf' has been added to the 'dispose' list by > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as > > > part of the LRU and removes it, which leads it to be removed from the > > > 'dispose' list. > > > * At this moment, nf->nf_ref is 0, it's unhashed, and not on the > > > LRU. nfsd_file_put() continues its execution [if > > > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, now > > > bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, and the > > > 'nf' is leaked. > > > > > > To make this happen, the right timing is crucial. It can be reproduced > > > by adding artifical delays in filecache.c, or hammering the nfsd with > > > tons of ops. > > > > > > - Let's see how nfsd_file_put() can race with nfsd_file_cond_queue(): > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > * 'exportfs -f' or something like that triggers > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], > > > unhash is done successfully. > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > > > nf->nf_ref goes to 2. > > > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. > > > * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, > > > &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to > > > 0, the 'nf' is added to the dispost list [list_add(&nf->nf_lru, > > > dispose)] > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > > > tries to remove the 'nf' from the LRU [if > > > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but > > > it is linked in the 'dispose' list, nfsd_file_lru_remove() simply > > > treats it as part of the LRU and removes it. This leads to its removal > > > from the 'dispose' list! > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > > > > > The purpose of nf->nf_lru is problematic. As you can see, it is used > > > for the LRU list, and also the 'dispose' list. Adding another 'struct > > > list_head' specifically for the 'dispose' list seems to be a better > > > way of fixing this race condition. Either way works for me. > > > > > > Would you agree my above analysis makes sense? Thanks. > > > > > > > I think so. It's been a while since I've done much work in this code, > > but it does sound like there is a race in the LRU handling. > > > > > > Like Chuck said, the nf->nf_lru list should be safe to use for multiple > > purposes, but that's only the case if we're not using that list as an > > indicator. > > > > The list_lru code does check this: > > > > if (!list_empty(item)) { > > > > ...so if we ever check this while it's sitting on the dispose list, it > > will handle it incorrectly. It sounds like that's the root cause of the > > problem you're seeing? > > > > If so, then maybe a separate list_head for disposal would be better. > > > > > Here is my patch with signed-off-by: > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race condition and early > > > unhash > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > --- > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index 1a6d5d000b85..2323829f7208 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > if (!nfsd_file_lru_remove(nf)) > > > return; > > > } > > > + /* > > > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > > > + * it's unhashed but then removed from the dispose list, > > > + * so we need to free it. > > > + */ > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > A refcount_read in this path is a red flag to me. Anytime you're just > > looking at the refcount without changing anything just screams out > > "race condition". > > > > In this case, what guarantee is there that this won't run afoul of the > > timing? We could check this and find out it's 1 just before it goes to > > 0 and you check the other conditions. > > > > Does anything prevent that? > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > + list_empty(&nf->nf_lru)) { > > > + nfsd_file_free(nf); > > > + return; > > > + } > > > } > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > nfsd_file_free(nf); > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > list_head *dispose) > > > int decrement = 1; > > > > > > /* If we raced with someone else unhashing, ignore it */ > > > - if (!nfsd_file_unhash(nf)) > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > return; > > > > Same here: you're just testing for the HASHED bit, but could this also > > race with someone who is setting it just after you get here. Why is > > that not a problem? > > > > > > > > /* If we can't get a reference, ignore it */ > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > list_head *dispose) > > > /* If refcount goes to 0, then put on the dispose list */ > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > list_add(&nf->nf_lru, dispose); > > > + nfsd_file_unhash(nf); > > > trace_nfsd_file_closing(nf); > > > } > > > } > > > -- > > > 2.43.0 > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever <chuck.lever@oracle.com> wrote: > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang wrote: > > > > > Hello, > > > > > > > > > > I'd like to report a nfsd_file leaking issue and propose a fix for it. > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed nfsd_file leaks > > > > > which led to undestroyable file systems (zfs), > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > - Do you have a specific reproducer? In other words, what is the > > > > simplest program that can run on an NFS client that will trigger > > > > this leak, and can you post it? > > > > > > > > - "zfs" is an out-of-tree file system, so it's not directly > > > > supported for NFSD. > > > > > > > > - The guidelines for patch submission require us to fix issues in > > > > upstream Linux first (currently that's v6.10-rc6). Then that fix > > > > can be backported to older stable kernels like 6.6. > > > > > > > > Can you reproduce the leak with one of the in-kernel filesystems > > > > (either xfs or btrfs would be great) and with NFSD in 6.10-rc6? > > > > > > > > One more comment below. > > > > > > > > > > > > > here are some examples: > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > struct nfsd_file { > > > > > nf_rlist = { > > > > > rhead = { > > > > > next = 0xffff8921fa2392f1 > > > > > }, > > > > > next = 0x0 > > > > > }, > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > nf_file = 0xffff88e2015b1500, > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > nf_flags = 0x8, > > > > > nf_ref = { > > > > > refs = { > > > > > counter = 0xc0000000 > > > > > } > > > > > }, > > > > > nf_may = 0x4, > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > nf_lru = { > > > > > next = 0xffff88e160db04a8, > > > > > prev = 0xffff88e160db04a8 > > > > > }, > > > > > nf_rcu = { > > > > > next = 0x10000000000, > > > > > func = 0x0 > > > > > }, > > > > > nf_birthtime = 0x73d22fc1728 > > > > > } > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > ffff88839a53d850 > > > > > nf_flags = 0x8, > > > > > nf_ref.refs.counter = 0x0 > > > > > nf_lru = { > > > > > next = 0xffff88839a53d898, > > > > > prev = 0xffff88839a53d898 > > > > > }, > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > ffff88c32b11e850 > > > > > nf_flags = 0x8, > > > > > nf_ref.refs.counter = 0x0 > > > > > nf_lru = { > > > > > next = 0xffff88c32b11e898, > > > > > prev = 0xffff88c32b11e898 > > > > > }, > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > ffff88e372709700 > > > > > nf_flags = 0xc, > > > > > nf_ref.refs.counter = 0x0 > > > > > nf_lru = { > > > > > next = 0xffff88e372709748, > > > > > prev = 0xffff88e372709748 > > > > > }, > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > ffff8982864944d0 > > > > > nf_flags = 0xc, > > > > > nf_ref.refs.counter = 0x0 > > > > > nf_lru = { > > > > > next = 0xffff898286494518, > > > > > prev = 0xffff898286494518 > > > > > }, > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > The leak occurs when nfsd_file_put() races with nfsd_file_cond_queue() > > > > > or nfsd_file_lru_cb(). With the following patch, I haven't observed > > > > > any leak after a few days heavy nfs load: > > > > > > > > Our patch submission guidelines require a Signed-off-by: > > > > line at the end of the patch description. See the "Sign your work - > > > > the Developer's Certificate of Origin" section of > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > --- a/fs/nfsd/filecache.c > > > > > +++ b/fs/nfsd/filecache.c > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > return; > > > > > } > > > > > + /* > > > > > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > > > > > + * it's unhashed but then removed from the dispose list, > > > > > + * so we need to free it. > > > > > + */ > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > > + list_empty(&nf->nf_lru)) { > > > > > + nfsd_file_free(nf); > > > > > + return; > > > > > + } > > > > > } > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > nfsd_file_free(nf); > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > > list_head *dispose) > > > > > int decrement = 1; > > > > > > > > > > /* If we raced with someone else unhashing, ignore it */ > > > > > - if (!nfsd_file_unhash(nf)) > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > return; > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > > list_head *dispose) > > > > > /* If refcount goes to 0, then put on the dispose list */ > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > list_add(&nf->nf_lru, dispose); > > > > > + nfsd_file_unhash(nf); > > > > > trace_nfsd_file_closing(nf); > > > > > } > > > > > } > > > > > > > > > > Please kindly review the patch and let me know if it makes sense. > > > > > > > > > > Thanks, > > > > > > > > > > -Youzhong > > > > > > > > > > > > > -- > > > > Chuck Lever > > > > > > > -- > > Jeff Layton <jlayton@poochiereds.net>
Thanks Jeff. I am ok with reverting the unhash/dispose list reordering in nfsd_file_lru_cb(), as it doesn't make much of a difference, but for nfsd_file_cond_queue(), imagining this: - A nfsd_file is hashed - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] will get it unhashed, doesn't it? - It continues to get a reference by nfsd_file_get() - It continues to remove itself from LRU by nfsd_file_lru_remove() if it is on the LRU. - Now it runs refcount_sub_and_test(), what happens if the refcnt does not go to 0? How can this nfsd_file be found again? Through the hash table? Through the LRU walk? how? Thanks again. -Youzhong On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > Thank you Jeff for your invaluable insights. I was leaning towards > > adding a new list_head too, and tested this approach on kernel 6.6 by > > continuously hammering the server with heavy nfs load for the last few > > days, not a single leak. > > > > Here goes the patch (based on Linux kernel master branch), please review: > > > > From: Youzhong Yang <youzhong@gmail.com> > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to mixed use of nf->nf_lru > > > > nfsd_file_put() in one thread can race with another thread doing > > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > > nfsd_file_lru_cb()): > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > > * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED bit set) > > * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED bit and > > returns LRU_ROTATE. > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > decrements nf->nf_ref > > to 0, runs nfsd_file_unhash(), removes it from the LRU and adds to > > the dispose > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > > tries to remove > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The 'nf' > > has been added > > to the 'dispose' list by nfsd_file_lru_cb(), so > > nfsd_file_lru_remove(nf) simply > > treats it as part of the LRU and removes it, which leads to its removal from > > the 'dispose' list. > > * At this moment, 'nf' is unhashed with its nf_ref being 0, and not > > on the LRU. > > nfsd_file_put() continues its execution [if > > (refcount_dec_and_test(&nf->nf_ref))], > > as nf->nf_ref is already 0, nf->nf_ref is set to > > REFCOUNT_SATURATED, and the 'nf' > > gets no chance of being freed. > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > * Some userland application runs 'exportfs -f' or something like > > that, which triggers > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], > > unhash is done > > successfully. > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > > nf->nf_ref goes to 2. > > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. > > * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, > > &nf->nf_ref))] > > (with "decrement" being 2), so the nf->nf_ref goes to 0, the 'nf' > > is added to the > > dispose list [list_add(&nf->nf_lru, dispose)] > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > > tries to remove > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], although > > the 'nf' is not > > in the LRU, but it is linked in the 'dispose' list, > > nfsd_file_lru_remove() simply > > treats it as part of the LRU and removes it. This leads to its removal from > > the 'dispose' list! > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its execution and set > > nf->nf_ref to REFCOUNT_SATURATED. > > > > As shown in the above analysis, using nf_lru for both the LRU list and > > dispose list > > can cause the leaks. This patch adds a new list_head nf_gc in struct > > nfsd_file, and uses > > it for the dispose list. It's not expected to have a nfsd_file > > unhashed but it's not > > added to the dispose list, so in nfsd_file_cond_queue() and > > nfsd_file_lru_cb() nfsd_file > > is unhashed after being added to the dispose list. > > > > I don't see where we require the object to be either hashed or on the > dispose list. I think you probably just want to do a patch that > changes the dispose list to use a dedicated list_head without > reordering when the these things are unhashed. > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > --- > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > fs/nfsd/filecache.h | 1 + > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index ad9083ca144b..3aef2ddfce94 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > > *inode, unsigned char need, > > return NULL; > > > > INIT_LIST_HEAD(&nf->nf_lru); > > + INIT_LIST_HEAD(&nf->nf_gc); > > nf->nf_birthtime = ktime_get(); > > nf->nf_file = NULL; > > nf->nf_cred = get_current_cred(); > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > > struct nfsd_file *nf; > > > > while (!list_empty(dispose)) { > > - nf = list_first_entry(dispose, struct nfsd_file, nf_lru); > > - list_del_init(&nf->nf_lru); > > + nf = list_first_entry(dispose, struct nfsd_file, nf_gc); > > + list_del_init(&nf->nf_gc); > > nfsd_file_free(nf); > > } > > } > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) > > { > > while(!list_empty(dispose)) { > > struct nfsd_file *nf = list_first_entry(dispose, > > - struct nfsd_file, nf_lru); > > + struct nfsd_file, nf_gc); > > struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); > > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > > > spin_lock(&l->lock); > > - list_move_tail(&nf->nf_lru, &l->freeme); > > + list_move_tail(&nf->nf_gc, &l->freeme); > > spin_unlock(&l->lock); > > svc_wake_up(nn->nfsd_serv); > > } > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct list_head *item, struct > > list_lru_one *lru, > > } > > > > /* Refcount went to zero. Unhash it and queue it to the dispose list */ > > + list_lru_isolate(lru, &nf->nf_lru); > > + list_add(&nf->nf_gc, head); > > + /* Unhash after removing from LRU and adding to dispose list */ > > nfsd_file_unhash(nf); > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > I don't see the point in reordering these operations. Hashing is all > about making the thing findable by nfsd operations. The _last_ thing we > want to do is put it on the dispose list while the thing can still be > found by nfsd threads doing operations. > > > this_cpu_inc(nfsd_file_evictions); > > trace_nfsd_file_gc_disposed(nf); > > return LRU_REMOVED; > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > list_head *dispose) > > int decrement = 1; > > > > /* If we raced with someone else unhashing, ignore it */ > > - if (!nfsd_file_unhash(nf)) > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > The above change looks wrong. I don't think we need to change this. > > > return; > > > > /* If we can't get a reference, ignore it */ > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > list_head *dispose) > > > > /* If refcount goes to 0, then put on the dispose list */ > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > - list_add(&nf->nf_lru, dispose); > > + list_add(&nf->nf_gc, dispose); > > + /* Unhash after adding to dispose list */ > > + nfsd_file_unhash(nf); > > This too looks wrong? Maybe I'm unclear on the race you're trying to > fix with this? What's the harm in unhashing it early? > > > trace_nfsd_file_closing(nf); > > } > > } > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > > > nfsd_file_queue_for_close(inode, &dispose); > > while (!list_empty(&dispose)) { > > - nf = list_first_entry(&dispose, struct nfsd_file, nf_lru); > > - list_del_init(&nf->nf_lru); > > + nf = list_first_entry(&dispose, struct nfsd_file, nf_gc); > > + list_del_init(&nf->nf_gc); > > nfsd_file_free(nf); > > } > > } > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > index c61884def906..3fbec24eea6c 100644 > > --- a/fs/nfsd/filecache.h > > +++ b/fs/nfsd/filecache.h > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > struct nfsd_file_mark *nf_mark; > > struct list_head nf_lru; > > + struct list_head nf_gc; > > struct rcu_head nf_rcu; > > ktime_t nf_birthtime; > > }; > > -- > > 2.34.1 > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton <jlayton@poochiereds.net> wrote: > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > > > > Thank you Chuck. Here are my quick answers to your comments: > > > > > > > > - I don't have a quick reproducer. I reproduced it by using hundreds > > > > of nfs clients generating +600K ops under our workload in the testing > > > > environment. Theoretically it should be possible to simplify the > > > > reproduction but I am still working on it. > > > > > > > > - I understand zfs is an out-of-tree file system. That's fine. But > > > > this leaking can happen to any file system, and leaking is not a good > > > > thing no matter what file system it is. > > > > > > > > - I will try to come up with a reproducer using xfs or btrfs if possible. > > > > > > > > Now back to the problem itself, here are my findings: > > > > > > > > - nfsd_file_put() in one thread can race with another thread doing > > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > > > > nfsd_file_lru_cb()): > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > > > > * nfsd_file_lru_add() returns true (thus NFSD_FILE_REFERENCED bit > > > > set for nf->nf_flags) > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED > > > > bit and returns LRU_ROTATE. > > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from > > > > the LRU and adds to the dispose list [list_lru_isolate_move(lru, > > > > &nf->nf_lru, head);] > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > > > > tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))] > > > > * The 'nf' has been added to the 'dispose' list by > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as > > > > part of the LRU and removes it, which leads it to be removed from the > > > > 'dispose' list. > > > > * At this moment, nf->nf_ref is 0, it's unhashed, and not on the > > > > LRU. nfsd_file_put() continues its execution [if > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, now > > > > bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, and the > > > > 'nf' is leaked. > > > > > > > > To make this happen, the right timing is crucial. It can be reproduced > > > > by adding artifical delays in filecache.c, or hammering the nfsd with > > > > tons of ops. > > > > > > > > - Let's see how nfsd_file_put() can race with nfsd_file_cond_queue(): > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > > * 'exportfs -f' or something like that triggers > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], > > > > unhash is done successfully. > > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > > > > nf->nf_ref goes to 2. > > > > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. > > > > * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, > > > > &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to > > > > 0, the 'nf' is added to the dispost list [list_add(&nf->nf_lru, > > > > dispose)] > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it > > > > tries to remove the 'nf' from the LRU [if > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but > > > > it is linked in the 'dispose' list, nfsd_file_lru_remove() simply > > > > treats it as part of the LRU and removes it. This leads to its removal > > > > from the 'dispose' list! > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > The purpose of nf->nf_lru is problematic. As you can see, it is used > > > > for the LRU list, and also the 'dispose' list. Adding another 'struct > > > > list_head' specifically for the 'dispose' list seems to be a better > > > > way of fixing this race condition. Either way works for me. > > > > > > > > Would you agree my above analysis makes sense? Thanks. > > > > > > > > > > I think so. It's been a while since I've done much work in this code, > > > but it does sound like there is a race in the LRU handling. > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe to use for multiple > > > purposes, but that's only the case if we're not using that list as an > > > indicator. > > > > > > The list_lru code does check this: > > > > > > if (!list_empty(item)) { > > > > > > ...so if we ever check this while it's sitting on the dispose list, it > > > will handle it incorrectly. It sounds like that's the root cause of the > > > problem you're seeing? > > > > > > If so, then maybe a separate list_head for disposal would be better. > > > > > > > Here is my patch with signed-off-by: > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race condition and early > > > > unhash > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > --- > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > --- a/fs/nfsd/filecache.c > > > > +++ b/fs/nfsd/filecache.c > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > if (!nfsd_file_lru_remove(nf)) > > > > return; > > > > } > > > > + /* > > > > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > > > > + * it's unhashed but then removed from the dispose list, > > > > + * so we need to free it. > > > > + */ > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > A refcount_read in this path is a red flag to me. Anytime you're just > > > looking at the refcount without changing anything just screams out > > > "race condition". > > > > > > In this case, what guarantee is there that this won't run afoul of the > > > timing? We could check this and find out it's 1 just before it goes to > > > 0 and you check the other conditions. > > > > > > Does anything prevent that? > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > + list_empty(&nf->nf_lru)) { > > > > + nfsd_file_free(nf); > > > > + return; > > > > + } > > > > } > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > nfsd_file_free(nf); > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > list_head *dispose) > > > > int decrement = 1; > > > > > > > > /* If we raced with someone else unhashing, ignore it */ > > > > - if (!nfsd_file_unhash(nf)) > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > return; > > > > > > Same here: you're just testing for the HASHED bit, but could this also > > > race with someone who is setting it just after you get here. Why is > > > that not a problem? > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > list_head *dispose) > > > > /* If refcount goes to 0, then put on the dispose list */ > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > list_add(&nf->nf_lru, dispose); > > > > + nfsd_file_unhash(nf); > > > > trace_nfsd_file_closing(nf); > > > > } > > > > } > > > > -- > > > > 2.43.0 > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever <chuck.lever@oracle.com> wrote: > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang wrote: > > > > > > Hello, > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue and propose a fix for it. > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed nfsd_file leaks > > > > > > which led to undestroyable file systems (zfs), > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > - Do you have a specific reproducer? In other words, what is the > > > > > simplest program that can run on an NFS client that will trigger > > > > > this leak, and can you post it? > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's not directly > > > > > supported for NFSD. > > > > > > > > > > - The guidelines for patch submission require us to fix issues in > > > > > upstream Linux first (currently that's v6.10-rc6). Then that fix > > > > > can be backported to older stable kernels like 6.6. > > > > > > > > > > Can you reproduce the leak with one of the in-kernel filesystems > > > > > (either xfs or btrfs would be great) and with NFSD in 6.10-rc6? > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > struct nfsd_file { > > > > > > nf_rlist = { > > > > > > rhead = { > > > > > > next = 0xffff8921fa2392f1 > > > > > > }, > > > > > > next = 0x0 > > > > > > }, > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > nf_flags = 0x8, > > > > > > nf_ref = { > > > > > > refs = { > > > > > > counter = 0xc0000000 > > > > > > } > > > > > > }, > > > > > > nf_may = 0x4, > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > nf_lru = { > > > > > > next = 0xffff88e160db04a8, > > > > > > prev = 0xffff88e160db04a8 > > > > > > }, > > > > > > nf_rcu = { > > > > > > next = 0x10000000000, > > > > > > func = 0x0 > > > > > > }, > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > } > > > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > ffff88839a53d850 > > > > > > nf_flags = 0x8, > > > > > > nf_ref.refs.counter = 0x0 > > > > > > nf_lru = { > > > > > > next = 0xffff88839a53d898, > > > > > > prev = 0xffff88839a53d898 > > > > > > }, > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > ffff88c32b11e850 > > > > > > nf_flags = 0x8, > > > > > > nf_ref.refs.counter = 0x0 > > > > > > nf_lru = { > > > > > > next = 0xffff88c32b11e898, > > > > > > prev = 0xffff88c32b11e898 > > > > > > }, > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > ffff88e372709700 > > > > > > nf_flags = 0xc, > > > > > > nf_ref.refs.counter = 0x0 > > > > > > nf_lru = { > > > > > > next = 0xffff88e372709748, > > > > > > prev = 0xffff88e372709748 > > > > > > }, > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > ffff8982864944d0 > > > > > > nf_flags = 0xc, > > > > > > nf_ref.refs.counter = 0x0 > > > > > > nf_lru = { > > > > > > next = 0xffff898286494518, > > > > > > prev = 0xffff898286494518 > > > > > > }, > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > The leak occurs when nfsd_file_put() races with nfsd_file_cond_queue() > > > > > > or nfsd_file_lru_cb(). With the following patch, I haven't observed > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > Our patch submission guidelines require a Signed-off-by: > > > > > line at the end of the patch description. See the "Sign your work - > > > > > the Developer's Certificate of Origin" section of > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > --- a/fs/nfsd/filecache.c > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > return; > > > > > > } > > > > > > + /* > > > > > > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), > > > > > > + * it's unhashed but then removed from the dispose list, > > > > > > + * so we need to free it. > > > > > > + */ > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > + nfsd_file_free(nf); > > > > > > + return; > > > > > > + } > > > > > > } > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > nfsd_file_free(nf); > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > > > list_head *dispose) > > > > > > int decrement = 1; > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore it */ > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > return; > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > > > list_head *dispose) > > > > > > /* If refcount goes to 0, then put on the dispose list */ > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > + nfsd_file_unhash(nf); > > > > > > trace_nfsd_file_closing(nf); > > > > > > } > > > > > > } > > > > > > > > > > > > Please kindly review the patch and let me know if it makes sense. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > -- > > > > > Chuck Lever > > > > > > > > > > -- > > > Jeff Layton <jlayton@poochiereds.net> > > -- > Jeff Layton <jlayton@poochiereds.net> > > -- > Jeff Layton <jlayton@kernel.org>
On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > Thanks Jeff. > > I am ok with reverting the unhash/dispose list reordering in > nfsd_file_lru_cb(), as it doesn't make much of a difference, > but for nfsd_file_cond_queue(), imagining this: > > - A nfsd_file is hashed > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] will get it > unhashed, doesn't it? > - It continues to get a reference by nfsd_file_get() > - It continues to remove itself from LRU by nfsd_file_lru_remove() if > it is on the LRU. > - Now it runs refcount_sub_and_test(), what happens if the refcnt > does > not go to 0? How can this nfsd_file be found again? Through the hash > table? Through the LRU walk? how? > > Thanks again. > > -Youzhong > It won't need to be found again. The holders of the extra references will put those references when they are finished. Since the object is no longer HASHED, nfsd_file_put just does this: if (refcount_dec_and_test(&nf->nf_ref)) nfsd_file_free(nf); So that should be fine. > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton <jlayton@kernel.org> > wrote: > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > Thank you Jeff for your invaluable insights. I was leaning > > > towards > > > adding a new list_head too, and tested this approach on kernel > > > 6.6 by > > > continuously hammering the server with heavy nfs load for the > > > last few > > > days, not a single leak. > > > > > > Here goes the patch (based on Linux kernel master branch), please > > > review: > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to mixed use of > > > nf->nf_lru > > > > > > nfsd_file_put() in one thread can race with another thread doing > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > > > nfsd_file_lru_cb()): > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > nfsd_file_lru_add(). > > > * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED > > > bit set) > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > REFERENCED bit and > > > returns LRU_ROTATE. > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > decrements nf->nf_ref > > > to 0, runs nfsd_file_unhash(), removes it from the LRU and > > > adds to > > > the dispose > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > > it > > > tries to remove > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The > > > 'nf' > > > has been added > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > nfsd_file_lru_remove(nf) simply > > > treats it as part of the LRU and removes it, which leads to > > > its removal from > > > the 'dispose' list. > > > * At this moment, 'nf' is unhashed with its nf_ref being 0, and > > > not > > > on the LRU. > > > nfsd_file_put() continues its execution [if > > > (refcount_dec_and_test(&nf->nf_ref))], > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > REFCOUNT_SATURATED, and the 'nf' > > > gets no chance of being freed. > > > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > nfsd_file_lru_add(). > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > * Some userland application runs 'exportfs -f' or something > > > like > > > that, which triggers > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > * In nfsd_file_cond_queue(), it runs [if > > > (!nfsd_file_unhash(nf))], > > > unhash is done > > > successfully. > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > > > nf->nf_ref goes to 2. > > > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], > > > it succeeds. > > > * nfsd_file_cond_queue() runs [if > > > (refcount_sub_and_test(decrement, > > > &nf->nf_ref))] > > > (with "decrement" being 2), so the nf->nf_ref goes to 0, the > > > 'nf' > > > is added to the > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > > it > > > tries to remove > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > > > although > > > the 'nf' is not > > > in the LRU, but it is linked in the 'dispose' list, > > > nfsd_file_lru_remove() simply > > > treats it as part of the LRU and removes it. This leads to > > > its removal from > > > the 'dispose' list! > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > execution and set > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > As shown in the above analysis, using nf_lru for both the LRU > > > list and > > > dispose list > > > can cause the leaks. This patch adds a new list_head nf_gc in > > > struct > > > nfsd_file, and uses > > > it for the dispose list. It's not expected to have a nfsd_file > > > unhashed but it's not > > > added to the dispose list, so in nfsd_file_cond_queue() and > > > nfsd_file_lru_cb() nfsd_file > > > is unhashed after being added to the dispose list. > > > > > > > I don't see where we require the object to be either hashed or on > > the > > dispose list. I think you probably just want to do a patch that > > changes the dispose list to use a dedicated list_head without > > reordering when the these things are unhashed. > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > --- > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > fs/nfsd/filecache.h | 1 + > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index ad9083ca144b..3aef2ddfce94 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > > > *inode, unsigned char need, > > > return NULL; > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > nf->nf_birthtime = ktime_get(); > > > nf->nf_file = NULL; > > > nf->nf_cred = get_current_cred(); > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > *dispose) > > > struct nfsd_file *nf; > > > > > > while (!list_empty(dispose)) { > > > - nf = list_first_entry(dispose, struct nfsd_file, > > > nf_lru); > > > - list_del_init(&nf->nf_lru); > > > + nf = list_first_entry(dispose, struct nfsd_file, > > > nf_gc); > > > + list_del_init(&nf->nf_gc); > > > nfsd_file_free(nf); > > > } > > > } > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > list_head *dispose) > > > { > > > while(!list_empty(dispose)) { > > > struct nfsd_file *nf = list_first_entry(dispose, > > > - struct nfsd_file, > > > nf_lru); > > > + struct nfsd_file, > > > nf_gc); > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > nfsd_net_id); > > > struct nfsd_fcache_disposal *l = nn- > > > >fcache_disposal; > > > > > > spin_lock(&l->lock); > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > spin_unlock(&l->lock); > > > svc_wake_up(nn->nfsd_serv); > > > } > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct list_head *item, > > > struct > > > list_lru_one *lru, > > > } > > > > > > /* Refcount went to zero. Unhash it and queue it to the > > > dispose list */ > > > + list_lru_isolate(lru, &nf->nf_lru); > > > + list_add(&nf->nf_gc, head); > > > + /* Unhash after removing from LRU and adding to dispose > > > list */ > > > nfsd_file_unhash(nf); > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > I don't see the point in reordering these operations. Hashing is > > all > > about making the thing findable by nfsd operations. The _last_ > > thing we > > want to do is put it on the dispose list while the thing can still > > be > > found by nfsd threads doing operations. > > > > > this_cpu_inc(nfsd_file_evictions); > > > trace_nfsd_file_gc_disposed(nf); > > > return LRU_REMOVED; > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, > > > struct > > > list_head *dispose) > > > int decrement = 1; > > > > > > /* If we raced with someone else unhashing, ignore it */ > > > - if (!nfsd_file_unhash(nf)) > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > The above change looks wrong. I don't think we need to change this. > > > > > return; > > > > > > /* If we can't get a reference, ignore it */ > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct nfsd_file *nf, > > > struct > > > list_head *dispose) > > > > > > /* If refcount goes to 0, then put on the dispose list */ > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > - list_add(&nf->nf_lru, dispose); > > > + list_add(&nf->nf_gc, dispose); > > > + /* Unhash after adding to dispose list */ > > > + nfsd_file_unhash(nf); > > > > This too looks wrong? Maybe I'm unclear on the race you're trying > > to > > fix with this? What's the harm in unhashing it early? > > > > > trace_nfsd_file_closing(nf); > > > } > > > } > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct inode > > > *inode) > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > while (!list_empty(&dispose)) { > > > - nf = list_first_entry(&dispose, struct nfsd_file, > > > nf_lru); > > > - list_del_init(&nf->nf_lru); > > > + nf = list_first_entry(&dispose, struct nfsd_file, > > > nf_gc); > > > + list_del_init(&nf->nf_gc); > > > nfsd_file_free(nf); > > > } > > > } > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > index c61884def906..3fbec24eea6c 100644 > > > --- a/fs/nfsd/filecache.h > > > +++ b/fs/nfsd/filecache.h > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > struct nfsd_file_mark *nf_mark; > > > struct list_head nf_lru; > > > + struct list_head nf_gc; > > > struct rcu_head nf_rcu; > > > ktime_t nf_birthtime; > > > }; > > > -- > > > 2.34.1 > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > <jlayton@poochiereds.net> wrote: > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > > > > > Thank you Chuck. Here are my quick answers to your comments: > > > > > > > > > > - I don't have a quick reproducer. I reproduced it by using > > > > > hundreds > > > > > of nfs clients generating +600K ops under our workload in the > > > > > testing > > > > > environment. Theoretically it should be possible to simplify > > > > > the > > > > > reproduction but I am still working on it. > > > > > > > > > > - I understand zfs is an out-of-tree file system. That's > > > > > fine. But > > > > > this leaking can happen to any file system, and leaking is > > > > > not a good > > > > > thing no matter what file system it is. > > > > > > > > > > - I will try to come up with a reproducer using xfs or btrfs > > > > > if possible. > > > > > > > > > > Now back to the problem itself, here are my findings: > > > > > > > > > > - nfsd_file_put() in one thread can race with another thread > > > > > doing > > > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() > > > > > -> > > > > > nfsd_file_lru_cb()): > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > nfsd_file_lru_add(). > > > > > * nfsd_file_lru_add() returns true (thus > > > > > NFSD_FILE_REFERENCED bit > > > > > set for nf->nf_flags) > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > REFERENCED > > > > > bit and returns LRU_ROTATE. > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > > > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes > > > > > it from > > > > > the LRU and adds to the dispose list > > > > > [list_lru_isolate_move(lru, > > > > > &nf->nf_lru, head);] > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > so it > > > > > tries to remove the 'nf' from the LRU [if > > > > > (!nfsd_file_lru_remove(nf))] > > > > > * The 'nf' has been added to the 'dispose' list by > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats > > > > > it as > > > > > part of the LRU and removes it, which leads it to be removed > > > > > from the > > > > > 'dispose' list. > > > > > * At this moment, nf->nf_ref is 0, it's unhashed, and not > > > > > on the > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is > > > > > already 0, now > > > > > bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, > > > > > and the > > > > > 'nf' is leaked. > > > > > > > > > > To make this happen, the right timing is crucial. It can be > > > > > reproduced > > > > > by adding artifical delays in filecache.c, or hammering the > > > > > nfsd with > > > > > tons of ops. > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > nfsd_file_cond_queue(): > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > nfsd_file_lru_add(). > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > > > * 'exportfs -f' or something like that triggers > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > (!nfsd_file_unhash(nf))], > > > > > unhash is done successfully. > > > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], > > > > > now > > > > > nf->nf_ref goes to 2. > > > > > * nfsd_file_cond_queue() runs [if > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > * nfsd_file_cond_queue() runs [if > > > > > (refcount_sub_and_test(decrement, > > > > > &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref > > > > > goes to > > > > > 0, the 'nf' is added to the dispost list [list_add(&nf- > > > > > >nf_lru, > > > > > dispose)] > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > so it > > > > > tries to remove the 'nf' from the LRU [if > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the > > > > > LRU, but > > > > > it is linked in the 'dispose' list, nfsd_file_lru_remove() > > > > > simply > > > > > treats it as part of the LRU and removes it. This leads to > > > > > its removal > > > > > from the 'dispose' list! > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > > > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you can see, it > > > > > is used > > > > > for the LRU list, and also the 'dispose' list. Adding another > > > > > 'struct > > > > > list_head' specifically for the 'dispose' list seems to be a > > > > > better > > > > > way of fixing this race condition. Either way works for me. > > > > > > > > > > Would you agree my above analysis makes sense? Thanks. > > > > > > > > > > > > > I think so. It's been a while since I've done much work in this > > > > code, > > > > but it does sound like there is a race in the LRU handling. > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe to use for > > > > multiple > > > > purposes, but that's only the case if we're not using that list > > > > as an > > > > indicator. > > > > > > > > The list_lru code does check this: > > > > > > > > if (!list_empty(item)) { > > > > > > > > ...so if we ever check this while it's sitting on the dispose > > > > list, it > > > > will handle it incorrectly. It sounds like that's the root > > > > cause of the > > > > problem you're seeing? > > > > > > > > If so, then maybe a separate list_head for disposal would be > > > > better. > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race > > > > > condition and early > > > > > unhash > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > --- > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > --- a/fs/nfsd/filecache.c > > > > > +++ b/fs/nfsd/filecache.c > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > return; > > > > > } > > > > > + /* > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > nfsd_file_lru_cb(), > > > > > + * it's unhashed but then removed from the > > > > > dispose list, > > > > > + * so we need to free it. > > > > > + */ > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > A refcount_read in this path is a red flag to me. Anytime > > > > you're just > > > > looking at the refcount without changing anything just screams > > > > out > > > > "race condition". > > > > > > > > In this case, what guarantee is there that this won't run afoul > > > > of the > > > > timing? We could check this and find out it's 1 just before it > > > > goes to > > > > 0 and you check the other conditions. > > > > > > > > Does anything prevent that? > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > >nf_flags) && > > > > > + list_empty(&nf->nf_lru)) { > > > > > + nfsd_file_free(nf); > > > > > + return; > > > > > + } > > > > > } > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > nfsd_file_free(nf); > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > *nf, struct > > > > > list_head *dispose) > > > > > int decrement = 1; > > > > > > > > > > /* If we raced with someone else unhashing, ignore it > > > > > */ > > > > > - if (!nfsd_file_unhash(nf)) > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > return; > > > > > > > > Same here: you're just testing for the HASHED bit, but could > > > > this also > > > > race with someone who is setting it just after you get here. > > > > Why is > > > > that not a problem? > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > *nf, struct > > > > > list_head *dispose) > > > > > /* If refcount goes to 0, then put on the dispose > > > > > list */ > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > list_add(&nf->nf_lru, dispose); > > > > > + nfsd_file_unhash(nf); > > > > > trace_nfsd_file_closing(nf); > > > > > } > > > > > } > > > > > -- > > > > > 2.43.0 > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang > > > > > > wrote: > > > > > > > Hello, > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue and propose > > > > > > > a fix for it. > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed > > > > > > > nfsd_file leaks > > > > > > > which led to undestroyable file systems (zfs), > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > - Do you have a specific reproducer? In other words, what > > > > > > is the > > > > > > simplest program that can run on an NFS client that will > > > > > > trigger > > > > > > this leak, and can you post it? > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's not directly > > > > > > supported for NFSD. > > > > > > > > > > > > - The guidelines for patch submission require us to fix > > > > > > issues in > > > > > > upstream Linux first (currently that's v6.10-rc6). Then > > > > > > that fix > > > > > > can be backported to older stable kernels like 6.6. > > > > > > > > > > > > Can you reproduce the leak with one of the in-kernel > > > > > > filesystems > > > > > > (either xfs or btrfs would be great) and with NFSD in 6.10- > > > > > > rc6? > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > struct nfsd_file { > > > > > > > nf_rlist = { > > > > > > > rhead = { > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > }, > > > > > > > next = 0x0 > > > > > > > }, > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > nf_flags = 0x8, > > > > > > > nf_ref = { > > > > > > > refs = { > > > > > > > counter = 0xc0000000 > > > > > > > } > > > > > > > }, > > > > > > > nf_may = 0x4, > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > nf_lru = { > > > > > > > next = 0xffff88e160db04a8, > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > }, > > > > > > > nf_rcu = { > > > > > > > next = 0x10000000000, > > > > > > > func = 0x0 > > > > > > > }, > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > } > > > > > > > > > > > > > > crash> struct > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > > ffff88839a53d850 > > > > > > > nf_flags = 0x8, > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > nf_lru = { > > > > > > > next = 0xffff88839a53d898, > > > > > > > prev = 0xffff88839a53d898 > > > > > > > }, > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > crash> struct > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > > ffff88c32b11e850 > > > > > > > nf_flags = 0x8, > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > nf_lru = { > > > > > > > next = 0xffff88c32b11e898, > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > }, > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > crash> struct > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > > ffff88e372709700 > > > > > > > nf_flags = 0xc, > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > nf_lru = { > > > > > > > next = 0xffff88e372709748, > > > > > > > prev = 0xffff88e372709748 > > > > > > > }, > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > crash> struct > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > > ffff8982864944d0 > > > > > > > nf_flags = 0xc, > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > nf_lru = { > > > > > > > next = 0xffff898286494518, > > > > > > > prev = 0xffff898286494518 > > > > > > > }, > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races with > > > > > > > nfsd_file_cond_queue() > > > > > > > or nfsd_file_lru_cb(). With the following patch, I > > > > > > > haven't observed > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > Our patch submission guidelines require a Signed-off-by: > > > > > > line at the end of the patch description. See the "Sign > > > > > > your work - > > > > > > the Developer's Certificate of Origin" section of > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > return; > > > > > > > } > > > > > > > + /* > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > nfsd_file_lru_cb(), > > > > > > > + * it's unhashed but then removed from the dispose list, > > > > > > > + * so we need to free it. > > > > > > > + */ > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > + nfsd_file_free(nf); > > > > > > > + return; > > > > > > > + } > > > > > > > } > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > nfsd_file_free(nf); > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > *nf, struct > > > > > > > list_head *dispose) > > > > > > > int decrement = 1; > > > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore it > > > > > > > */ > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > return; > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > *nf, struct > > > > > > > list_head *dispose) > > > > > > > /* If refcount goes to 0, then put on the dispose list > > > > > > > */ > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > + nfsd_file_unhash(nf); > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > Please kindly review the patch and let me know if it > > > > > > > makes sense. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > -- > > > > > > Chuck Lever > > > > > > > > > > > > > -- > > > > Jeff Layton <jlayton@poochiereds.net> > > > > -- > > Jeff Layton <jlayton@poochiereds.net> > > > > -- > > Jeff Layton <jlayton@kernel.org>
Thanks Jeff. Unfortunately the early unhash easily leads to leaks: crash> kmem -S nfsd_file | grep '\[ffff' | sed -e 's|\[||' -e 's|\]||' | xargs -i echo struct nfsd_file.nf_flags,nf_ref.refs.counter '{}' > /var/tmp/nfsd_files crash> !wc -l /var/tmp/nfsd_files 19 /var/tmp/nfsd_files crash> < /var/tmp/nfsd_files crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778900 nf_flags = 8, nf_ref.refs.counter = 2 crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778cc0 nf_flags = 8, nf_ref.refs.counter = 3 crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8885d5f35e00 nf_flags = 8, nf_ref.refs.counter = 1 crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88817443e780 nf_flags = 8, nf_ref.refs.counter = 3 crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88818b3f0600 nf_flags = 8, nf_ref.refs.counter = 2 crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a4490f8300 nf_flags = 8, nf_ref.refs.counter = 1 crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a0dab183c0 nf_flags = 8, nf_ref.refs.counter = 40 ... crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff89209535f200 nf_flags = 8, nf_ref.refs.counter = 2 crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8980e15138c0 nf_flags = 8, nf_ref.refs.counter = 7 crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95800 nf_flags = 8, nf_ref.refs.counter = 3 crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95980 nf_flags = 12, nf_ref.refs.counter = 1 nfsd_file_do_acquire() -> nfsd_file_lookup_locked() relies on the hash table to find the nfsd_file, but I am still scratching my head why and how this happens. FYI, here is the patch I applied for testing: diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index ad9083ca144b..22ebd7fb8639 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need, return NULL; INIT_LIST_HEAD(&nf->nf_lru); + INIT_LIST_HEAD(&nf->nf_gc); nf->nf_birthtime = ktime_get(); nf->nf_file = NULL; nf->nf_cred = get_current_cred(); @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) struct nfsd_file *nf; while (!list_empty(dispose)) { - nf = list_first_entry(dispose, struct nfsd_file, nf_lru); - list_del_init(&nf->nf_lru); + nf = list_first_entry(dispose, struct nfsd_file, nf_gc); + list_del_init(&nf->nf_gc); nfsd_file_free(nf); } } @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) { while(!list_empty(dispose)) { struct nfsd_file *nf = list_first_entry(dispose, - struct nfsd_file, nf_lru); + struct nfsd_file, nf_gc); struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); struct nfsd_fcache_disposal *l = nn->fcache_disposal; spin_lock(&l->lock); - list_move_tail(&nf->nf_lru, &l->freeme); + list_move_tail(&nf->nf_gc, &l->freeme); spin_unlock(&l->lock); svc_wake_up(nn->nfsd_serv); } @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru, /* Refcount went to zero. Unhash it and queue it to the dispose list */ nfsd_file_unhash(nf); - list_lru_isolate_move(lru, &nf->nf_lru, head); + list_lru_isolate(lru, &nf->nf_lru); + list_add(&nf->nf_gc, head); this_cpu_inc(nfsd_file_evictions); trace_nfsd_file_gc_disposed(nf); return LRU_REMOVED; @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) /* If refcount goes to 0, then put on the dispose list */ if (refcount_sub_and_test(decrement, &nf->nf_ref)) { - list_add(&nf->nf_lru, dispose); + list_add(&nf->nf_gc, dispose); trace_nfsd_file_closing(nf); } } @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) nfsd_file_queue_for_close(inode, &dispose); while (!list_empty(&dispose)) { - nf = list_first_entry(&dispose, struct nfsd_file, nf_lru); - list_del_init(&nf->nf_lru); + nf = list_first_entry(&dispose, struct nfsd_file, nf_gc); + list_del_init(&nf->nf_gc); nfsd_file_free(nf); } } diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h index c61884def906..3fbec24eea6c 100644 --- a/fs/nfsd/filecache.h +++ b/fs/nfsd/filecache.h @@ -44,6 +44,7 @@ struct nfsd_file { struct nfsd_file_mark *nf_mark; struct list_head nf_lru; + struct list_head nf_gc; struct rcu_head nf_rcu; ktime_t nf_birthtime; }; On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > > Thanks Jeff. > > > > I am ok with reverting the unhash/dispose list reordering in > > nfsd_file_lru_cb(), as it doesn't make much of a difference, > > but for nfsd_file_cond_queue(), imagining this: > > > > - A nfsd_file is hashed > > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] will get it > > unhashed, doesn't it? > > - It continues to get a reference by nfsd_file_get() > > - It continues to remove itself from LRU by nfsd_file_lru_remove() if > > it is on the LRU. > > - Now it runs refcount_sub_and_test(), what happens if the refcnt > > does > > not go to 0? How can this nfsd_file be found again? Through the hash > > table? Through the LRU walk? how? > > > > Thanks again. > > > > -Youzhong > > > > It won't need to be found again. The holders of the extra references > will put those references when they are finished. Since the object is > no longer HASHED, nfsd_file_put just does this: > > if (refcount_dec_and_test(&nf->nf_ref)) > nfsd_file_free(nf); > > So that should be fine. > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton <jlayton@kernel.org> > > wrote: > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > > Thank you Jeff for your invaluable insights. I was leaning > > > > towards > > > > adding a new list_head too, and tested this approach on kernel > > > > 6.6 by > > > > continuously hammering the server with heavy nfs load for the > > > > last few > > > > days, not a single leak. > > > > > > > > Here goes the patch (based on Linux kernel master branch), please > > > > review: > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to mixed use of > > > > nf->nf_lru > > > > > > > > nfsd_file_put() in one thread can race with another thread doing > > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > > > > nfsd_file_lru_cb()): > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > nfsd_file_lru_add(). > > > > * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED > > > > bit set) > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > REFERENCED bit and > > > > returns LRU_ROTATE. > > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > > decrements nf->nf_ref > > > > to 0, runs nfsd_file_unhash(), removes it from the LRU and > > > > adds to > > > > the dispose > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > > > it > > > > tries to remove > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The > > > > 'nf' > > > > has been added > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > nfsd_file_lru_remove(nf) simply > > > > treats it as part of the LRU and removes it, which leads to > > > > its removal from > > > > the 'dispose' list. > > > > * At this moment, 'nf' is unhashed with its nf_ref being 0, and > > > > not > > > > on the LRU. > > > > nfsd_file_put() continues its execution [if > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > REFCOUNT_SATURATED, and the 'nf' > > > > gets no chance of being freed. > > > > > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > nfsd_file_lru_add(). > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > > * Some userland application runs 'exportfs -f' or something > > > > like > > > > that, which triggers > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > * In nfsd_file_cond_queue(), it runs [if > > > > (!nfsd_file_unhash(nf))], > > > > unhash is done > > > > successfully. > > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > > > > nf->nf_ref goes to 2. > > > > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], > > > > it succeeds. > > > > * nfsd_file_cond_queue() runs [if > > > > (refcount_sub_and_test(decrement, > > > > &nf->nf_ref))] > > > > (with "decrement" being 2), so the nf->nf_ref goes to 0, the > > > > 'nf' > > > > is added to the > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > > > it > > > > tries to remove > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > > > > although > > > > the 'nf' is not > > > > in the LRU, but it is linked in the 'dispose' list, > > > > nfsd_file_lru_remove() simply > > > > treats it as part of the LRU and removes it. This leads to > > > > its removal from > > > > the 'dispose' list! > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > > execution and set > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > As shown in the above analysis, using nf_lru for both the LRU > > > > list and > > > > dispose list > > > > can cause the leaks. This patch adds a new list_head nf_gc in > > > > struct > > > > nfsd_file, and uses > > > > it for the dispose list. It's not expected to have a nfsd_file > > > > unhashed but it's not > > > > added to the dispose list, so in nfsd_file_cond_queue() and > > > > nfsd_file_lru_cb() nfsd_file > > > > is unhashed after being added to the dispose list. > > > > > > > > > > I don't see where we require the object to be either hashed or on > > > the > > > dispose list. I think you probably just want to do a patch that > > > changes the dispose list to use a dedicated list_head without > > > reordering when the these things are unhashed. > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > --- > > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > > fs/nfsd/filecache.h | 1 + > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > --- a/fs/nfsd/filecache.c > > > > +++ b/fs/nfsd/filecache.c > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > > > > *inode, unsigned char need, > > > > return NULL; > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > nf->nf_birthtime = ktime_get(); > > > > nf->nf_file = NULL; > > > > nf->nf_cred = get_current_cred(); > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > *dispose) > > > > struct nfsd_file *nf; > > > > > > > > while (!list_empty(dispose)) { > > > > - nf = list_first_entry(dispose, struct nfsd_file, > > > > nf_lru); > > > > - list_del_init(&nf->nf_lru); > > > > + nf = list_first_entry(dispose, struct nfsd_file, > > > > nf_gc); > > > > + list_del_init(&nf->nf_gc); > > > > nfsd_file_free(nf); > > > > } > > > > } > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > list_head *dispose) > > > > { > > > > while(!list_empty(dispose)) { > > > > struct nfsd_file *nf = list_first_entry(dispose, > > > > - struct nfsd_file, > > > > nf_lru); > > > > + struct nfsd_file, > > > > nf_gc); > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > nfsd_net_id); > > > > struct nfsd_fcache_disposal *l = nn- > > > > >fcache_disposal; > > > > > > > > spin_lock(&l->lock); > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > spin_unlock(&l->lock); > > > > svc_wake_up(nn->nfsd_serv); > > > > } > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct list_head *item, > > > > struct > > > > list_lru_one *lru, > > > > } > > > > > > > > /* Refcount went to zero. Unhash it and queue it to the > > > > dispose list */ > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > + list_add(&nf->nf_gc, head); > > > > + /* Unhash after removing from LRU and adding to dispose > > > > list */ > > > > nfsd_file_unhash(nf); > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > > I don't see the point in reordering these operations. Hashing is > > > all > > > about making the thing findable by nfsd operations. The _last_ > > > thing we > > > want to do is put it on the dispose list while the thing can still > > > be > > > found by nfsd threads doing operations. > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > trace_nfsd_file_gc_disposed(nf); > > > > return LRU_REMOVED; > > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, > > > > struct > > > > list_head *dispose) > > > > int decrement = 1; > > > > > > > > /* If we raced with someone else unhashing, ignore it */ > > > > - if (!nfsd_file_unhash(nf)) > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > The above change looks wrong. I don't think we need to change this. > > > > > > > return; > > > > > > > > /* If we can't get a reference, ignore it */ > > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct nfsd_file *nf, > > > > struct > > > > list_head *dispose) > > > > > > > > /* If refcount goes to 0, then put on the dispose list */ > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > - list_add(&nf->nf_lru, dispose); > > > > + list_add(&nf->nf_gc, dispose); > > > > + /* Unhash after adding to dispose list */ > > > > + nfsd_file_unhash(nf); > > > > > > This too looks wrong? Maybe I'm unclear on the race you're trying > > > to > > > fix with this? What's the harm in unhashing it early? > > > > > > > trace_nfsd_file_closing(nf); > > > > } > > > > } > > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct inode > > > > *inode) > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > while (!list_empty(&dispose)) { > > > > - nf = list_first_entry(&dispose, struct nfsd_file, > > > > nf_lru); > > > > - list_del_init(&nf->nf_lru); > > > > + nf = list_first_entry(&dispose, struct nfsd_file, > > > > nf_gc); > > > > + list_del_init(&nf->nf_gc); > > > > nfsd_file_free(nf); > > > > } > > > > } > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > index c61884def906..3fbec24eea6c 100644 > > > > --- a/fs/nfsd/filecache.h > > > > +++ b/fs/nfsd/filecache.h > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > struct list_head nf_lru; > > > > + struct list_head nf_gc; > > > > struct rcu_head nf_rcu; > > > > ktime_t nf_birthtime; > > > > }; > > > > -- > > > > 2.34.1 > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > > > > > > Thank you Chuck. Here are my quick answers to your comments: > > > > > > > > > > > > - I don't have a quick reproducer. I reproduced it by using > > > > > > hundreds > > > > > > of nfs clients generating +600K ops under our workload in the > > > > > > testing > > > > > > environment. Theoretically it should be possible to simplify > > > > > > the > > > > > > reproduction but I am still working on it. > > > > > > > > > > > > - I understand zfs is an out-of-tree file system. That's > > > > > > fine. But > > > > > > this leaking can happen to any file system, and leaking is > > > > > > not a good > > > > > > thing no matter what file system it is. > > > > > > > > > > > > - I will try to come up with a reproducer using xfs or btrfs > > > > > > if possible. > > > > > > > > > > > > Now back to the problem itself, here are my findings: > > > > > > > > > > > > - nfsd_file_put() in one thread can race with another thread > > > > > > doing > > > > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() > > > > > > -> > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > > nfsd_file_lru_add(). > > > > > > * nfsd_file_lru_add() returns true (thus > > > > > > NFSD_FILE_REFERENCED bit > > > > > > set for nf->nf_flags) > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > > REFERENCED > > > > > > bit and returns LRU_ROTATE. > > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > > > > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes > > > > > > it from > > > > > > the LRU and adds to the dispose list > > > > > > [list_lru_isolate_move(lru, > > > > > > &nf->nf_lru, head);] > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > > so it > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > * The 'nf' has been added to the 'dispose' list by > > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats > > > > > > it as > > > > > > part of the LRU and removes it, which leads it to be removed > > > > > > from the > > > > > > 'dispose' list. > > > > > > * At this moment, nf->nf_ref is 0, it's unhashed, and not > > > > > > on the > > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is > > > > > > already 0, now > > > > > > bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, > > > > > > and the > > > > > > 'nf' is leaked. > > > > > > > > > > > > To make this happen, the right timing is crucial. It can be > > > > > > reproduced > > > > > > by adding artifical delays in filecache.c, or hammering the > > > > > > nfsd with > > > > > > tons of ops. > > > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > > nfsd_file_cond_queue(): > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > > nfsd_file_lru_add(). > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > > > > * 'exportfs -f' or something like that triggers > > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > (!nfsd_file_unhash(nf))], > > > > > > unhash is done successfully. > > > > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], > > > > > > now > > > > > > nf->nf_ref goes to 2. > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > (refcount_sub_and_test(decrement, > > > > > > &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref > > > > > > goes to > > > > > > 0, the 'nf' is added to the dispost list [list_add(&nf- > > > > > > >nf_lru, > > > > > > dispose)] > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > > so it > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the > > > > > > LRU, but > > > > > > it is linked in the 'dispose' list, nfsd_file_lru_remove() > > > > > > simply > > > > > > treats it as part of the LRU and removes it. This leads to > > > > > > its removal > > > > > > from the 'dispose' list! > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > > > > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you can see, it > > > > > > is used > > > > > > for the LRU list, and also the 'dispose' list. Adding another > > > > > > 'struct > > > > > > list_head' specifically for the 'dispose' list seems to be a > > > > > > better > > > > > > way of fixing this race condition. Either way works for me. > > > > > > > > > > > > Would you agree my above analysis makes sense? Thanks. > > > > > > > > > > > > > > > > I think so. It's been a while since I've done much work in this > > > > > code, > > > > > but it does sound like there is a race in the LRU handling. > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe to use for > > > > > multiple > > > > > purposes, but that's only the case if we're not using that list > > > > > as an > > > > > indicator. > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > ...so if we ever check this while it's sitting on the dispose > > > > > list, it > > > > > will handle it incorrectly. It sounds like that's the root > > > > > cause of the > > > > > problem you're seeing? > > > > > > > > > > If so, then maybe a separate list_head for disposal would be > > > > > better. > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race > > > > > > condition and early > > > > > > unhash > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > --- > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > --- a/fs/nfsd/filecache.c > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > return; > > > > > > } > > > > > > + /* > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > nfsd_file_lru_cb(), > > > > > > + * it's unhashed but then removed from the > > > > > > dispose list, > > > > > > + * so we need to free it. > > > > > > + */ > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > A refcount_read in this path is a red flag to me. Anytime > > > > > you're just > > > > > looking at the refcount without changing anything just screams > > > > > out > > > > > "race condition". > > > > > > > > > > In this case, what guarantee is there that this won't run afoul > > > > > of the > > > > > timing? We could check this and find out it's 1 just before it > > > > > goes to > > > > > 0 and you check the other conditions. > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > > >nf_flags) && > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > + nfsd_file_free(nf); > > > > > > + return; > > > > > > + } > > > > > > } > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > nfsd_file_free(nf); > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > *nf, struct > > > > > > list_head *dispose) > > > > > > int decrement = 1; > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore it > > > > > > */ > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > return; > > > > > > > > > > Same here: you're just testing for the HASHED bit, but could > > > > > this also > > > > > race with someone who is setting it just after you get here. > > > > > Why is > > > > > that not a problem? > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > *nf, struct > > > > > > list_head *dispose) > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > list */ > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > + nfsd_file_unhash(nf); > > > > > > trace_nfsd_file_closing(nf); > > > > > > } > > > > > > } > > > > > > -- > > > > > > 2.43.0 > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang > > > > > > > wrote: > > > > > > > > Hello, > > > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue and propose > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed > > > > > > > > nfsd_file leaks > > > > > > > > which led to undestroyable file systems (zfs), > > > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > > > - Do you have a specific reproducer? In other words, what > > > > > > > is the > > > > > > > simplest program that can run on an NFS client that will > > > > > > > trigger > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's not directly > > > > > > > supported for NFSD. > > > > > > > > > > > > > > - The guidelines for patch submission require us to fix > > > > > > > issues in > > > > > > > upstream Linux first (currently that's v6.10-rc6). Then > > > > > > > that fix > > > > > > > can be backported to older stable kernels like 6.6. > > > > > > > > > > > > > > Can you reproduce the leak with one of the in-kernel > > > > > > > filesystems > > > > > > > (either xfs or btrfs would be great) and with NFSD in 6.10- > > > > > > > rc6? > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > > struct nfsd_file { > > > > > > > > nf_rlist = { > > > > > > > > rhead = { > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > }, > > > > > > > > next = 0x0 > > > > > > > > }, > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > > nf_flags = 0x8, > > > > > > > > nf_ref = { > > > > > > > > refs = { > > > > > > > > counter = 0xc0000000 > > > > > > > > } > > > > > > > > }, > > > > > > > > nf_may = 0x4, > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > nf_lru = { > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > }, > > > > > > > > nf_rcu = { > > > > > > > > next = 0x10000000000, > > > > > > > > func = 0x0 > > > > > > > > }, > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > } > > > > > > > > > > > > > > > > crash> struct > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > > > ffff88839a53d850 > > > > > > > > nf_flags = 0x8, > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > nf_lru = { > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > }, > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > crash> struct > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > > > ffff88c32b11e850 > > > > > > > > nf_flags = 0x8, > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > nf_lru = { > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > }, > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > crash> struct > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > > > ffff88e372709700 > > > > > > > > nf_flags = 0xc, > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > nf_lru = { > > > > > > > > next = 0xffff88e372709748, > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > }, > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > crash> struct > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x > > > > > > > > ffff8982864944d0 > > > > > > > > nf_flags = 0xc, > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > nf_lru = { > > > > > > > > next = 0xffff898286494518, > > > > > > > > prev = 0xffff898286494518 > > > > > > > > }, > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races with > > > > > > > > nfsd_file_cond_queue() > > > > > > > > or nfsd_file_lru_cb(). With the following patch, I > > > > > > > > haven't observed > > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > > > Our patch submission guidelines require a Signed-off-by: > > > > > > > line at the end of the patch description. See the "Sign > > > > > > > your work - > > > > > > > the Developer's Certificate of Origin" section of > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > return; > > > > > > > > } > > > > > > > > + /* > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > + * it's unhashed but then removed from the dispose list, > > > > > > > > + * so we need to free it. > > > > > > > > + */ > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > + nfsd_file_free(nf); > > > > > > > > + return; > > > > > > > > + } > > > > > > > > } > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > nfsd_file_free(nf); > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > *nf, struct > > > > > > > > list_head *dispose) > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore it > > > > > > > > */ > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > return; > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > *nf, struct > > > > > > > > list_head *dispose) > > > > > > > > /* If refcount goes to 0, then put on the dispose list > > > > > > > > */ > > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > Please kindly review the patch and let me know if it > > > > > > > > makes sense. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > -- > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > -- > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > -- > > > Jeff Layton <jlayton@kernel.org> > > -- > Jeff Layton <jlayton@kernel.org>
On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > Thanks Jeff. > > Unfortunately the early unhash easily leads to leaks: > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e 's|\[||' -e > 's|\]||' > > xargs -i echo struct nfsd_file.nf_flags,nf_ref.refs.counter '{}' > > /var/tmp/nfsd_files > crash> !wc -l /var/tmp/nfsd_files > 19 /var/tmp/nfsd_files > crash> < /var/tmp/nfsd_files > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778900 > nf_flags = 8, > nf_ref.refs.counter = 2 > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778cc0 > nf_flags = 8, > nf_ref.refs.counter = 3 > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8885d5f35e00 > nf_flags = 8, > nf_ref.refs.counter = 1 > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88817443e780 > nf_flags = 8, > nf_ref.refs.counter = 3 > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88818b3f0600 > nf_flags = 8, > nf_ref.refs.counter = 2 > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a4490f8300 > nf_flags = 8, > nf_ref.refs.counter = 1 > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a0dab183c0 > nf_flags = 8, > nf_ref.refs.counter = 40 That's a lot of references! Might be interesting to look more closely at that one, but the refcounts are all over the place, so it really does look like we just have a refcount leak somewhere. > ... > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff89209535f200 > nf_flags = 8, > nf_ref.refs.counter = 2 > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8980e15138c0 > nf_flags = 8, > nf_ref.refs.counter = 7 > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95800 > nf_flags = 8, > nf_ref.refs.counter = 3 > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95980 > nf_flags = 12, The others are NFSD_FILE_GC. This one is also NFSD_FILE_REFERENCED. Are these objects on the LRU? > nf_ref.refs.counter = 1 > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() relies on the > hash > table to find the nfsd_file, > but I am still scratching my head why and how this happens. > > FYI, here is the patch I applied for testing: > The above suggests to me that there is a garden-variety refcount leak somewhere. Whenever some bit of the code takes a reference to an object (like a nfsd_file), it's implicitly required to eventually put that reference. Typically that means that the code needs to maintain a pointer to that object, as it can be unhashed at any time and it can't rely on finding the same object later. Do you have a reproducer for this? > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index ad9083ca144b..22ebd7fb8639 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > *inode, unsigned char need, > return NULL; > > INIT_LIST_HEAD(&nf->nf_lru); > + INIT_LIST_HEAD(&nf->nf_gc); > nf->nf_birthtime = ktime_get(); > nf->nf_file = NULL; > nf->nf_cred = get_current_cred(); > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > struct nfsd_file *nf; > > while (!list_empty(dispose)) { > - nf = list_first_entry(dispose, struct nfsd_file, > nf_lru); > - list_del_init(&nf->nf_lru); > + nf = list_first_entry(dispose, struct nfsd_file, > nf_gc); > + list_del_init(&nf->nf_gc); > nfsd_file_free(nf); > } > } > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head > *dispose) > { > while(!list_empty(dispose)) { > struct nfsd_file *nf = list_first_entry(dispose, > - struct nfsd_file, > nf_lru); > + struct nfsd_file, > nf_gc); > struct nfsd_net *nn = net_generic(nf->nf_net, > nfsd_net_id); > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > spin_lock(&l->lock); > - list_move_tail(&nf->nf_lru, &l->freeme); > + list_move_tail(&nf->nf_gc, &l->freeme); > spin_unlock(&l->lock); > svc_wake_up(nn->nfsd_serv); > } > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct > list_lru_one *lru, > > /* Refcount went to zero. Unhash it and queue it to the > dispose list */ > nfsd_file_unhash(nf); > - list_lru_isolate_move(lru, &nf->nf_lru, head); > + list_lru_isolate(lru, &nf->nf_lru); > + list_add(&nf->nf_gc, head); > this_cpu_inc(nfsd_file_evictions); > trace_nfsd_file_gc_disposed(nf); > return LRU_REMOVED; > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > > /* If refcount goes to 0, then put on the dispose list */ > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > - list_add(&nf->nf_lru, dispose); > + list_add(&nf->nf_gc, dispose); > trace_nfsd_file_closing(nf); > } > } > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > nfsd_file_queue_for_close(inode, &dispose); > while (!list_empty(&dispose)) { > - nf = list_first_entry(&dispose, struct nfsd_file, > nf_lru); > - list_del_init(&nf->nf_lru); > + nf = list_first_entry(&dispose, struct nfsd_file, > nf_gc); > + list_del_init(&nf->nf_gc); > nfsd_file_free(nf); > } > } > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > index c61884def906..3fbec24eea6c 100644 > --- a/fs/nfsd/filecache.h > +++ b/fs/nfsd/filecache.h > @@ -44,6 +44,7 @@ struct nfsd_file { > > struct nfsd_file_mark *nf_mark; > struct list_head nf_lru; > + struct list_head nf_gc; > struct rcu_head nf_rcu; > ktime_t nf_birthtime; > }; > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton <jlayton@kernel.org> > wrote: > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > > > Thanks Jeff. > > > > > > I am ok with reverting the unhash/dispose list reordering in > > > nfsd_file_lru_cb(), as it doesn't make much of a difference, > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > - A nfsd_file is hashed > > > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] will > > > get it > > > unhashed, doesn't it? > > > - It continues to get a reference by nfsd_file_get() > > > - It continues to remove itself from LRU by > > > nfsd_file_lru_remove() if > > > it is on the LRU. > > > - Now it runs refcount_sub_and_test(), what happens if the refcnt > > > does > > > not go to 0? How can this nfsd_file be found again? Through the > > > hash > > > table? Through the LRU walk? how? > > > > > > Thanks again. > > > > > > -Youzhong > > > > > > > It won't need to be found again. The holders of the extra > > references > > will put those references when they are finished. Since the object > > is > > no longer HASHED, nfsd_file_put just does this: > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > nfsd_file_free(nf); > > > > So that should be fine. > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton <jlayton@kernel.org> > > > wrote: > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > > > Thank you Jeff for your invaluable insights. I was leaning > > > > > towards > > > > > adding a new list_head too, and tested this approach on > > > > > kernel > > > > > 6.6 by > > > > > continuously hammering the server with heavy nfs load for the > > > > > last few > > > > > days, not a single leak. > > > > > > > > > > Here goes the patch (based on Linux kernel master branch), > > > > > please > > > > > review: > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to mixed use > > > > > of > > > > > nf->nf_lru > > > > > > > > > > nfsd_file_put() in one thread can race with another thread > > > > > doing > > > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() > > > > > -> > > > > > nfsd_file_lru_cb()): > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > nfsd_file_lru_add(). > > > > > * nfsd_file_lru_add() returns true (with > > > > > NFSD_FILE_REFERENCED > > > > > bit set) > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > REFERENCED bit and > > > > > returns LRU_ROTATE. > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > > > decrements nf->nf_ref > > > > > to 0, runs nfsd_file_unhash(), removes it from the LRU > > > > > and > > > > > adds to > > > > > the dispose > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > so > > > > > it > > > > > tries to remove > > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. > > > > > The > > > > > 'nf' > > > > > has been added > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > nfsd_file_lru_remove(nf) simply > > > > > treats it as part of the LRU and removes it, which leads > > > > > to > > > > > its removal from > > > > > the 'dispose' list. > > > > > * At this moment, 'nf' is unhashed with its nf_ref being 0, > > > > > and > > > > > not > > > > > on the LRU. > > > > > nfsd_file_put() continues its execution [if > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > gets no chance of being freed. > > > > > > > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > nfsd_file_lru_add(). > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > > > * Some userland application runs 'exportfs -f' or something > > > > > like > > > > > that, which triggers > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > (!nfsd_file_unhash(nf))], > > > > > unhash is done > > > > > successfully. > > > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], > > > > > now > > > > > nf->nf_ref goes to 2. > > > > > * nfsd_file_cond_queue() runs [if > > > > > (nfsd_file_lru_remove(nf))], > > > > > it succeeds. > > > > > * nfsd_file_cond_queue() runs [if > > > > > (refcount_sub_and_test(decrement, > > > > > &nf->nf_ref))] > > > > > (with "decrement" being 2), so the nf->nf_ref goes to 0, > > > > > the > > > > > 'nf' > > > > > is added to the > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > so > > > > > it > > > > > tries to remove > > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > > > > > although > > > > > the 'nf' is not > > > > > in the LRU, but it is linked in the 'dispose' list, > > > > > nfsd_file_lru_remove() simply > > > > > treats it as part of the LRU and removes it. This leads > > > > > to > > > > > its removal from > > > > > the 'dispose' list! > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > > > execution and set > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > As shown in the above analysis, using nf_lru for both the LRU > > > > > list and > > > > > dispose list > > > > > can cause the leaks. This patch adds a new list_head nf_gc in > > > > > struct > > > > > nfsd_file, and uses > > > > > it for the dispose list. It's not expected to have a > > > > > nfsd_file > > > > > unhashed but it's not > > > > > added to the dispose list, so in nfsd_file_cond_queue() and > > > > > nfsd_file_lru_cb() nfsd_file > > > > > is unhashed after being added to the dispose list. > > > > > > > > > > > > > I don't see where we require the object to be either hashed or > > > > on > > > > the > > > > dispose list. I think you probably just want to do a patch > > > > that > > > > changes the dispose list to use a dedicated list_head without > > > > reordering when the these things are unhashed. > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > --- > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > > > fs/nfsd/filecache.h | 1 + > > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > --- a/fs/nfsd/filecache.c > > > > > +++ b/fs/nfsd/filecache.c > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > inode > > > > > *inode, unsigned char need, > > > > > return NULL; > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > nf->nf_birthtime = ktime_get(); > > > > > nf->nf_file = NULL; > > > > > nf->nf_cred = get_current_cred(); > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > *dispose) > > > > > struct nfsd_file *nf; > > > > > > > > > > while (!list_empty(dispose)) { > > > > > - nf = list_first_entry(dispose, struct > > > > > nfsd_file, > > > > > nf_lru); > > > > > - list_del_init(&nf->nf_lru); > > > > > + nf = list_first_entry(dispose, struct > > > > > nfsd_file, > > > > > nf_gc); > > > > > + list_del_init(&nf->nf_gc); > > > > > nfsd_file_free(nf); > > > > > } > > > > > } > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > list_head *dispose) > > > > > { > > > > > while(!list_empty(dispose)) { > > > > > struct nfsd_file *nf = > > > > > list_first_entry(dispose, > > > > > - struct > > > > > nfsd_file, > > > > > nf_lru); > > > > > + struct > > > > > nfsd_file, > > > > > nf_gc); > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > nfsd_net_id); > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > fcache_disposal; > > > > > > > > > > spin_lock(&l->lock); > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > spin_unlock(&l->lock); > > > > > svc_wake_up(nn->nfsd_serv); > > > > > } > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > struct > > > > > list_lru_one *lru, > > > > > } > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > the > > > > > dispose list */ > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > + list_add(&nf->nf_gc, head); > > > > > + /* Unhash after removing from LRU and adding to > > > > > dispose > > > > > list */ > > > > > nfsd_file_unhash(nf); > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > > > > I don't see the point in reordering these operations. Hashing > > > > is > > > > all > > > > about making the thing findable by nfsd operations. The _last_ > > > > thing we > > > > want to do is put it on the dispose list while the thing can > > > > still > > > > be > > > > found by nfsd threads doing operations. > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > return LRU_REMOVED; > > > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > *nf, > > > > > struct > > > > > list_head *dispose) > > > > > int decrement = 1; > > > > > > > > > > /* If we raced with someone else unhashing, ignore it > > > > > */ > > > > > - if (!nfsd_file_unhash(nf)) > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > The above change looks wrong. I don't think we need to change > > > > this. > > > > > > > > > return; > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > *nf, > > > > > struct > > > > > list_head *dispose) > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > list */ > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > - list_add(&nf->nf_lru, dispose); > > > > > + list_add(&nf->nf_gc, dispose); > > > > > + /* Unhash after adding to dispose list */ > > > > > + nfsd_file_unhash(nf); > > > > > > > > This too looks wrong? Maybe I'm unclear on the race you're > > > > trying > > > > to > > > > fix with this? What's the harm in unhashing it early? > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > } > > > > > } > > > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > *inode) > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > while (!list_empty(&dispose)) { > > > > > - nf = list_first_entry(&dispose, struct > > > > > nfsd_file, > > > > > nf_lru); > > > > > - list_del_init(&nf->nf_lru); > > > > > + nf = list_first_entry(&dispose, struct > > > > > nfsd_file, > > > > > nf_gc); > > > > > + list_del_init(&nf->nf_gc); > > > > > nfsd_file_free(nf); > > > > > } > > > > > } > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > --- a/fs/nfsd/filecache.h > > > > > +++ b/fs/nfsd/filecache.h > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > struct list_head nf_lru; > > > > > + struct list_head nf_gc; > > > > > struct rcu_head nf_rcu; > > > > > ktime_t nf_birthtime; > > > > > }; > > > > > -- > > > > > 2.34.1 > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > > > > > > > Thank you Chuck. Here are my quick answers to your > > > > > > > comments: > > > > > > > > > > > > > > - I don't have a quick reproducer. I reproduced it by > > > > > > > using > > > > > > > hundreds > > > > > > > of nfs clients generating +600K ops under our workload in > > > > > > > the > > > > > > > testing > > > > > > > environment. Theoretically it should be possible to > > > > > > > simplify > > > > > > > the > > > > > > > reproduction but I am still working on it. > > > > > > > > > > > > > > - I understand zfs is an out-of-tree file system. That's > > > > > > > fine. But > > > > > > > this leaking can happen to any file system, and leaking > > > > > > > is > > > > > > > not a good > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > - I will try to come up with a reproducer using xfs or > > > > > > > btrfs > > > > > > > if possible. > > > > > > > > > > > > > > Now back to the problem itself, here are my findings: > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race with another > > > > > > > thread > > > > > > > doing > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > list_lru_walk() > > > > > > > -> > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > > do > > > > > > > nfsd_file_lru_add(). > > > > > > > * nfsd_file_lru_add() returns true (thus > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > set for nf->nf_flags) > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > > > REFERENCED > > > > > > > bit and returns LRU_ROTATE. > > > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() > > > > > > > now > > > > > > > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), > > > > > > > removes > > > > > > > it from > > > > > > > the LRU and adds to the dispose list > > > > > > > [list_lru_isolate_move(lru, > > > > > > > &nf->nf_lru, head);] > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > cleared, > > > > > > > so it > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > * The 'nf' has been added to the 'dispose' list by > > > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply > > > > > > > treats > > > > > > > it as > > > > > > > part of the LRU and removes it, which leads it to be > > > > > > > removed > > > > > > > from the > > > > > > > 'dispose' list. > > > > > > > * At this moment, nf->nf_ref is 0, it's unhashed, and > > > > > > > not > > > > > > > on the > > > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is > > > > > > > already 0, now > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > REFCOUNT_SATURATED, > > > > > > > and the > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > To make this happen, the right timing is crucial. It can > > > > > > > be > > > > > > > reproduced > > > > > > > by adding artifical delays in filecache.c, or hammering > > > > > > > the > > > > > > > nfsd with > > > > > > > tons of ops. > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > > > nfsd_file_cond_queue(): > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > > do > > > > > > > nfsd_file_lru_add(). > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns > > > > > > > true. > > > > > > > * 'exportfs -f' or something like that triggers > > > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > unhash is done successfully. > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > (!nfsd_file_get(nf))], > > > > > > > now > > > > > > > nf->nf_ref goes to 2. > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > &nf->nf_ref))] (with "decrement" being 2), so the nf- > > > > > > > >nf_ref > > > > > > > goes to > > > > > > > 0, the 'nf' is added to the dispost list [list_add(&nf- > > > > > > > > nf_lru, > > > > > > > dispose)] > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > cleared, > > > > > > > so it > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in > > > > > > > the > > > > > > > LRU, but > > > > > > > it is linked in the 'dispose' list, > > > > > > > nfsd_file_lru_remove() > > > > > > > simply > > > > > > > treats it as part of the LRU and removes it. This leads > > > > > > > to > > > > > > > its removal > > > > > > > from the 'dispose' list! > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues > > > > > > > its > > > > > > > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you can see, > > > > > > > it > > > > > > > is used > > > > > > > for the LRU list, and also the 'dispose' list. Adding > > > > > > > another > > > > > > > 'struct > > > > > > > list_head' specifically for the 'dispose' list seems to > > > > > > > be a > > > > > > > better > > > > > > > way of fixing this race condition. Either way works for > > > > > > > me. > > > > > > > > > > > > > > Would you agree my above analysis makes sense? Thanks. > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've done much work in > > > > > > this > > > > > > code, > > > > > > but it does sound like there is a race in the LRU handling. > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe to use > > > > > > for > > > > > > multiple > > > > > > purposes, but that's only the case if we're not using that > > > > > > list > > > > > > as an > > > > > > indicator. > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > ...so if we ever check this while it's sitting on the > > > > > > dispose > > > > > > list, it > > > > > > will handle it incorrectly. It sounds like that's the root > > > > > > cause of the > > > > > > problem you're seeing? > > > > > > > > > > > > If so, then maybe a separate list_head for disposal would > > > > > > be > > > > > > better. > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race > > > > > > > condition and early > > > > > > > unhash > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > --- > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > return; > > > > > > > } > > > > > > > + /* > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > nfsd_file_lru_cb(), > > > > > > > + * it's unhashed but then removed from > > > > > > > the > > > > > > > dispose list, > > > > > > > + * so we need to free it. > > > > > > > + */ > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > > A refcount_read in this path is a red flag to me. Anytime > > > > > > you're just > > > > > > looking at the refcount without changing anything just > > > > > > screams > > > > > > out > > > > > > "race condition". > > > > > > > > > > > > In this case, what guarantee is there that this won't run > > > > > > afoul > > > > > > of the > > > > > > timing? We could check this and find out it's 1 just before > > > > > > it > > > > > > goes to > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > nf_flags) && > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > + nfsd_file_free(nf); > > > > > > > + return; > > > > > > > + } > > > > > > > } > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > nfsd_file_free(nf); > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > *nf, struct > > > > > > > list_head *dispose) > > > > > > > int decrement = 1; > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > ignore it > > > > > > > */ > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > return; > > > > > > > > > > > > Same here: you're just testing for the HASHED bit, but > > > > > > could > > > > > > this also > > > > > > race with someone who is setting it just after you get > > > > > > here. > > > > > > Why is > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > *nf, struct > > > > > > > list_head *dispose) > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > list */ > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > >nf_ref)) { > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > + nfsd_file_unhash(nf); > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > } > > > > > > > } > > > > > > > -- > > > > > > > 2.43.0 > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang > > > > > > > > wrote: > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue and > > > > > > > > > propose > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed > > > > > > > > > nfsd_file leaks > > > > > > > > > which led to undestroyable file systems (zfs), > > > > > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In other words, > > > > > > > > what > > > > > > > > is the > > > > > > > > simplest program that can run on an NFS client that > > > > > > > > will > > > > > > > > trigger > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's not > > > > > > > > directly > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > - The guidelines for patch submission require us to fix > > > > > > > > issues in > > > > > > > > upstream Linux first (currently that's v6.10-rc6). > > > > > > > > Then > > > > > > > > that fix > > > > > > > > can be backported to older stable kernels like 6.6. > > > > > > > > > > > > > > > > Can you reproduce the leak with one of the in-kernel > > > > > > > > filesystems > > > > > > > > (either xfs or btrfs would be great) and with NFSD in > > > > > > > > 6.10- > > > > > > > > rc6? > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > > > struct nfsd_file { > > > > > > > > > nf_rlist = { > > > > > > > > > rhead = { > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > }, > > > > > > > > > next = 0x0 > > > > > > > > > }, > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > > > nf_flags = 0x8, > > > > > > > > > nf_ref = { > > > > > > > > > refs = { > > > > > > > > > counter = 0xc0000000 > > > > > > > > > } > > > > > > > > > }, > > > > > > > > > nf_may = 0x4, > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > nf_lru = { > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > }, > > > > > > > > > nf_rcu = { > > > > > > > > > next = 0x10000000000, > > > > > > > > > func = 0x0 > > > > > > > > > }, > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > } > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > -x > > > > > > > > > ffff88839a53d850 > > > > > > > > > nf_flags = 0x8, > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > nf_lru = { > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > }, > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > -x > > > > > > > > > ffff88c32b11e850 > > > > > > > > > nf_flags = 0x8, > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > nf_lru = { > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > }, > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > -x > > > > > > > > > ffff88e372709700 > > > > > > > > > nf_flags = 0xc, > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > nf_lru = { > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > }, > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > -x > > > > > > > > > ffff8982864944d0 > > > > > > > > > nf_flags = 0xc, > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > nf_lru = { > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > }, > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races with > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > or nfsd_file_lru_cb(). With the following patch, I > > > > > > > > > haven't observed > > > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > > > > > Our patch submission guidelines require a Signed-off- > > > > > > > > by: > > > > > > > > line at the end of the patch description. See the "Sign > > > > > > > > your work - > > > > > > > > the Developer's Certificate of Origin" section of > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file > > > > > > > > > *nf) > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > return; > > > > > > > > > } > > > > > > > > > + /* > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > + * it's unhashed but then removed from the dispose > > > > > > > > > list, > > > > > > > > > + * so we need to free it. > > > > > > > > > + */ > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > + return; > > > > > > > > > + } > > > > > > > > > } > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > nfsd_file > > > > > > > > > *nf, struct > > > > > > > > > list_head *dispose) > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore > > > > > > > > > it > > > > > > > > > */ > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > > return; > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > nfsd_file > > > > > > > > > *nf, struct > > > > > > > > > list_head *dispose) > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > > > list > > > > > > > > > */ > > > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) > > > > > > > > > { > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > > > > > > > > > > Please kindly review the patch and let me know if it > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > -- > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > -- > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > -- > > > > Jeff Layton <jlayton@kernel.org> > > > > -- > > Jeff Layton <jlayton@kernel.org> >
It's not on the LRU: crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_gc ffff898107b95980 nf_flags = 12, nf_ref.refs.counter = 1 nf_lru = { next = 0xffff898107b959c8, prev = 0xffff898107b959c8 }, nf_gc = { next = 0xffff898107b959d8, prev = 0xffff898107b959d8 }, I don't have an easy reproducer now as I am having difficulty generating tons of NFS read/write ops using a limited number of nfs clients. I use our in-house testing farm to test the patch. Thanks. On Tue, Jul 9, 2024 at 3:05 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > > Thanks Jeff. > > > > Unfortunately the early unhash easily leads to leaks: > > > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e 's|\[||' -e > > 's|\]||' > > > xargs -i echo struct nfsd_file.nf_flags,nf_ref.refs.counter '{}' > > > /var/tmp/nfsd_files > > crash> !wc -l /var/tmp/nfsd_files > > 19 /var/tmp/nfsd_files > > crash> < /var/tmp/nfsd_files > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778900 > > nf_flags = 8, > > nf_ref.refs.counter = 2 > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778cc0 > > nf_flags = 8, > > nf_ref.refs.counter = 3 > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8885d5f35e00 > > nf_flags = 8, > > nf_ref.refs.counter = 1 > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88817443e780 > > nf_flags = 8, > > nf_ref.refs.counter = 3 > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88818b3f0600 > > nf_flags = 8, > > nf_ref.refs.counter = 2 > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a4490f8300 > > nf_flags = 8, > > nf_ref.refs.counter = 1 > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a0dab183c0 > > nf_flags = 8, > > nf_ref.refs.counter = 40 > > That's a lot of references! Might be interesting to look more closely > at that one, but the refcounts are all over the place, so it really > does look like we just have a refcount leak somewhere. > > > ... > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff89209535f200 > > nf_flags = 8, > > nf_ref.refs.counter = 2 > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8980e15138c0 > > nf_flags = 8, > > nf_ref.refs.counter = 7 > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95800 > > nf_flags = 8, > > nf_ref.refs.counter = 3 > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95980 > > nf_flags = 12, > > The others are NFSD_FILE_GC. This one is also NFSD_FILE_REFERENCED. Are > these objects on the LRU? > > > nf_ref.refs.counter = 1 > > > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() relies on the > > hash > > table to find the nfsd_file, > > but I am still scratching my head why and how this happens. > > > > FYI, here is the patch I applied for testing: > > > > The above suggests to me that there is a garden-variety refcount leak > somewhere. Whenever some bit of the code takes a reference to an object > (like a nfsd_file), it's implicitly required to eventually put that > reference. Typically that means that the code needs to maintain a > pointer to that object, as it can be unhashed at any time and it can't > rely on finding the same object later. > > Do you have a reproducer for this? > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index ad9083ca144b..22ebd7fb8639 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > > *inode, unsigned char need, > > return NULL; > > > > INIT_LIST_HEAD(&nf->nf_lru); > > + INIT_LIST_HEAD(&nf->nf_gc); > > nf->nf_birthtime = ktime_get(); > > nf->nf_file = NULL; > > nf->nf_cred = get_current_cred(); > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > > struct nfsd_file *nf; > > > > while (!list_empty(dispose)) { > > - nf = list_first_entry(dispose, struct nfsd_file, > > nf_lru); > > - list_del_init(&nf->nf_lru); > > + nf = list_first_entry(dispose, struct nfsd_file, > > nf_gc); > > + list_del_init(&nf->nf_gc); > > nfsd_file_free(nf); > > } > > } > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head > > *dispose) > > { > > while(!list_empty(dispose)) { > > struct nfsd_file *nf = list_first_entry(dispose, > > - struct nfsd_file, > > nf_lru); > > + struct nfsd_file, > > nf_gc); > > struct nfsd_net *nn = net_generic(nf->nf_net, > > nfsd_net_id); > > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > > > spin_lock(&l->lock); > > - list_move_tail(&nf->nf_lru, &l->freeme); > > + list_move_tail(&nf->nf_gc, &l->freeme); > > spin_unlock(&l->lock); > > svc_wake_up(nn->nfsd_serv); > > } > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct > > list_lru_one *lru, > > > > /* Refcount went to zero. Unhash it and queue it to the > > dispose list */ > > nfsd_file_unhash(nf); > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > + list_lru_isolate(lru, &nf->nf_lru); > > + list_add(&nf->nf_gc, head); > > this_cpu_inc(nfsd_file_evictions); > > trace_nfsd_file_gc_disposed(nf); > > return LRU_REMOVED; > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > list_head *dispose) > > > > /* If refcount goes to 0, then put on the dispose list */ > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > - list_add(&nf->nf_lru, dispose); > > + list_add(&nf->nf_gc, dispose); > > trace_nfsd_file_closing(nf); > > } > > } > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > > > nfsd_file_queue_for_close(inode, &dispose); > > while (!list_empty(&dispose)) { > > - nf = list_first_entry(&dispose, struct nfsd_file, > > nf_lru); > > - list_del_init(&nf->nf_lru); > > + nf = list_first_entry(&dispose, struct nfsd_file, > > nf_gc); > > + list_del_init(&nf->nf_gc); > > nfsd_file_free(nf); > > } > > } > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > index c61884def906..3fbec24eea6c 100644 > > --- a/fs/nfsd/filecache.h > > +++ b/fs/nfsd/filecache.h > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > struct nfsd_file_mark *nf_mark; > > struct list_head nf_lru; > > + struct list_head nf_gc; > > struct rcu_head nf_rcu; > > ktime_t nf_birthtime; > > }; > > > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton <jlayton@kernel.org> > > wrote: > > > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > > > > Thanks Jeff. > > > > > > > > I am ok with reverting the unhash/dispose list reordering in > > > > nfsd_file_lru_cb(), as it doesn't make much of a difference, > > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > > > - A nfsd_file is hashed > > > > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] will > > > > get it > > > > unhashed, doesn't it? > > > > - It continues to get a reference by nfsd_file_get() > > > > - It continues to remove itself from LRU by > > > > nfsd_file_lru_remove() if > > > > it is on the LRU. > > > > - Now it runs refcount_sub_and_test(), what happens if the refcnt > > > > does > > > > not go to 0? How can this nfsd_file be found again? Through the > > > > hash > > > > table? Through the LRU walk? how? > > > > > > > > Thanks again. > > > > > > > > -Youzhong > > > > > > > > > > It won't need to be found again. The holders of the extra > > > references > > > will put those references when they are finished. Since the object > > > is > > > no longer HASHED, nfsd_file_put just does this: > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > nfsd_file_free(nf); > > > > > > So that should be fine. > > > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton <jlayton@kernel.org> > > > > wrote: > > > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > > > > Thank you Jeff for your invaluable insights. I was leaning > > > > > > towards > > > > > > adding a new list_head too, and tested this approach on > > > > > > kernel > > > > > > 6.6 by > > > > > > continuously hammering the server with heavy nfs load for the > > > > > > last few > > > > > > days, not a single leak. > > > > > > > > > > > > Here goes the patch (based on Linux kernel master branch), > > > > > > please > > > > > > review: > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to mixed use > > > > > > of > > > > > > nf->nf_lru > > > > > > > > > > > > nfsd_file_put() in one thread can race with another thread > > > > > > doing > > > > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() > > > > > > -> > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > > nfsd_file_lru_add(). > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > NFSD_FILE_REFERENCED > > > > > > bit set) > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > > REFERENCED bit and > > > > > > returns LRU_ROTATE. > > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > > > > decrements nf->nf_ref > > > > > > to 0, runs nfsd_file_unhash(), removes it from the LRU > > > > > > and > > > > > > adds to > > > > > > the dispose > > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > > so > > > > > > it > > > > > > tries to remove > > > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. > > > > > > The > > > > > > 'nf' > > > > > > has been added > > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > treats it as part of the LRU and removes it, which leads > > > > > > to > > > > > > its removal from > > > > > > the 'dispose' list. > > > > > > * At this moment, 'nf' is unhashed with its nf_ref being 0, > > > > > > and > > > > > > not > > > > > > on the LRU. > > > > > > nfsd_file_put() continues its execution [if > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > gets no chance of being freed. > > > > > > > > > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > > nfsd_file_lru_add(). > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > > > > * Some userland application runs 'exportfs -f' or something > > > > > > like > > > > > > that, which triggers > > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > (!nfsd_file_unhash(nf))], > > > > > > unhash is done > > > > > > successfully. > > > > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], > > > > > > now > > > > > > nf->nf_ref goes to 2. > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > it succeeds. > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > (refcount_sub_and_test(decrement, > > > > > > &nf->nf_ref))] > > > > > > (with "decrement" being 2), so the nf->nf_ref goes to 0, > > > > > > the > > > > > > 'nf' > > > > > > is added to the > > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > > so > > > > > > it > > > > > > tries to remove > > > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > > > > > > although > > > > > > the 'nf' is not > > > > > > in the LRU, but it is linked in the 'dispose' list, > > > > > > nfsd_file_lru_remove() simply > > > > > > treats it as part of the LRU and removes it. This leads > > > > > > to > > > > > > its removal from > > > > > > the 'dispose' list! > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > > > > execution and set > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > As shown in the above analysis, using nf_lru for both the LRU > > > > > > list and > > > > > > dispose list > > > > > > can cause the leaks. This patch adds a new list_head nf_gc in > > > > > > struct > > > > > > nfsd_file, and uses > > > > > > it for the dispose list. It's not expected to have a > > > > > > nfsd_file > > > > > > unhashed but it's not > > > > > > added to the dispose list, so in nfsd_file_cond_queue() and > > > > > > nfsd_file_lru_cb() nfsd_file > > > > > > is unhashed after being added to the dispose list. > > > > > > > > > > > > > > > > I don't see where we require the object to be either hashed or > > > > > on > > > > > the > > > > > dispose list. I think you probably just want to do a patch > > > > > that > > > > > changes the dispose list to use a dedicated list_head without > > > > > reordering when the these things are unhashed. > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > --- > > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > > > > fs/nfsd/filecache.h | 1 + > > > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > > --- a/fs/nfsd/filecache.c > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > > inode > > > > > > *inode, unsigned char need, > > > > > > return NULL; > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > nf->nf_file = NULL; > > > > > > nf->nf_cred = get_current_cred(); > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > > *dispose) > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > - nf = list_first_entry(dispose, struct > > > > > > nfsd_file, > > > > > > nf_lru); > > > > > > - list_del_init(&nf->nf_lru); > > > > > > + nf = list_first_entry(dispose, struct > > > > > > nfsd_file, > > > > > > nf_gc); > > > > > > + list_del_init(&nf->nf_gc); > > > > > > nfsd_file_free(nf); > > > > > > } > > > > > > } > > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > > list_head *dispose) > > > > > > { > > > > > > while(!list_empty(dispose)) { > > > > > > struct nfsd_file *nf = > > > > > > list_first_entry(dispose, > > > > > > - struct > > > > > > nfsd_file, > > > > > > nf_lru); > > > > > > + struct > > > > > > nfsd_file, > > > > > > nf_gc); > > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > > nfsd_net_id); > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > fcache_disposal; > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > > spin_unlock(&l->lock); > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > } > > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > > struct > > > > > > list_lru_one *lru, > > > > > > } > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > > the > > > > > > dispose list */ > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > + list_add(&nf->nf_gc, head); > > > > > > + /* Unhash after removing from LRU and adding to > > > > > > dispose > > > > > > list */ > > > > > > nfsd_file_unhash(nf); > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > > > > > > I don't see the point in reordering these operations. Hashing > > > > > is > > > > > all > > > > > about making the thing findable by nfsd operations. The _last_ > > > > > thing we > > > > > want to do is put it on the dispose list while the thing can > > > > > still > > > > > be > > > > > found by nfsd threads doing operations. > > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > return LRU_REMOVED; > > > > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > *nf, > > > > > > struct > > > > > > list_head *dispose) > > > > > > int decrement = 1; > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore it > > > > > > */ > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > > > The above change looks wrong. I don't think we need to change > > > > > this. > > > > > > > > > > > return; > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > *nf, > > > > > > struct > > > > > > list_head *dispose) > > > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > list */ > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > + /* Unhash after adding to dispose list */ > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > This too looks wrong? Maybe I'm unclear on the race you're > > > > > trying > > > > > to > > > > > fix with this? What's the harm in unhashing it early? > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > } > > > > > > } > > > > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > > *inode) > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > while (!list_empty(&dispose)) { > > > > > > - nf = list_first_entry(&dispose, struct > > > > > > nfsd_file, > > > > > > nf_lru); > > > > > > - list_del_init(&nf->nf_lru); > > > > > > + nf = list_first_entry(&dispose, struct > > > > > > nfsd_file, > > > > > > nf_gc); > > > > > > + list_del_init(&nf->nf_gc); > > > > > > nfsd_file_free(nf); > > > > > > } > > > > > > } > > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > --- a/fs/nfsd/filecache.h > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > struct list_head nf_lru; > > > > > > + struct list_head nf_gc; > > > > > > struct rcu_head nf_rcu; > > > > > > ktime_t nf_birthtime; > > > > > > }; > > > > > > -- > > > > > > 2.34.1 > > > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > > > > > > > > Thank you Chuck. Here are my quick answers to your > > > > > > > > comments: > > > > > > > > > > > > > > > > - I don't have a quick reproducer. I reproduced it by > > > > > > > > using > > > > > > > > hundreds > > > > > > > > of nfs clients generating +600K ops under our workload in > > > > > > > > the > > > > > > > > testing > > > > > > > > environment. Theoretically it should be possible to > > > > > > > > simplify > > > > > > > > the > > > > > > > > reproduction but I am still working on it. > > > > > > > > > > > > > > > > - I understand zfs is an out-of-tree file system. That's > > > > > > > > fine. But > > > > > > > > this leaking can happen to any file system, and leaking > > > > > > > > is > > > > > > > > not a good > > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > > > - I will try to come up with a reproducer using xfs or > > > > > > > > btrfs > > > > > > > > if possible. > > > > > > > > > > > > > > > > Now back to the problem itself, here are my findings: > > > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race with another > > > > > > > > thread > > > > > > > > doing > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > list_lru_walk() > > > > > > > > -> > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > > > do > > > > > > > > nfsd_file_lru_add(). > > > > > > > > * nfsd_file_lru_add() returns true (thus > > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > > set for nf->nf_flags) > > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > > > > REFERENCED > > > > > > > > bit and returns LRU_ROTATE. > > > > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() > > > > > > > > now > > > > > > > > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), > > > > > > > > removes > > > > > > > > it from > > > > > > > > the LRU and adds to the dispose list > > > > > > > > [list_lru_isolate_move(lru, > > > > > > > > &nf->nf_lru, head);] > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > cleared, > > > > > > > > so it > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > > * The 'nf' has been added to the 'dispose' list by > > > > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply > > > > > > > > treats > > > > > > > > it as > > > > > > > > part of the LRU and removes it, which leads it to be > > > > > > > > removed > > > > > > > > from the > > > > > > > > 'dispose' list. > > > > > > > > * At this moment, nf->nf_ref is 0, it's unhashed, and > > > > > > > > not > > > > > > > > on the > > > > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is > > > > > > > > already 0, now > > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > > REFCOUNT_SATURATED, > > > > > > > > and the > > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > > > To make this happen, the right timing is crucial. It can > > > > > > > > be > > > > > > > > reproduced > > > > > > > > by adding artifical delays in filecache.c, or hammering > > > > > > > > the > > > > > > > > nfsd with > > > > > > > > tons of ops. > > > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > > > do > > > > > > > > nfsd_file_lru_add(). > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns > > > > > > > > true. > > > > > > > > * 'exportfs -f' or something like that triggers > > > > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > unhash is done successfully. > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > now > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > &nf->nf_ref))] (with "decrement" being 2), so the nf- > > > > > > > > >nf_ref > > > > > > > > goes to > > > > > > > > 0, the 'nf' is added to the dispost list [list_add(&nf- > > > > > > > > > nf_lru, > > > > > > > > dispose)] > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > cleared, > > > > > > > > so it > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in > > > > > > > > the > > > > > > > > LRU, but > > > > > > > > it is linked in the 'dispose' list, > > > > > > > > nfsd_file_lru_remove() > > > > > > > > simply > > > > > > > > treats it as part of the LRU and removes it. This leads > > > > > > > > to > > > > > > > > its removal > > > > > > > > from the 'dispose' list! > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues > > > > > > > > its > > > > > > > > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you can see, > > > > > > > > it > > > > > > > > is used > > > > > > > > for the LRU list, and also the 'dispose' list. Adding > > > > > > > > another > > > > > > > > 'struct > > > > > > > > list_head' specifically for the 'dispose' list seems to > > > > > > > > be a > > > > > > > > better > > > > > > > > way of fixing this race condition. Either way works for > > > > > > > > me. > > > > > > > > > > > > > > > > Would you agree my above analysis makes sense? Thanks. > > > > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've done much work in > > > > > > > this > > > > > > > code, > > > > > > > but it does sound like there is a race in the LRU handling. > > > > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe to use > > > > > > > for > > > > > > > multiple > > > > > > > purposes, but that's only the case if we're not using that > > > > > > > list > > > > > > > as an > > > > > > > indicator. > > > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > > > ...so if we ever check this while it's sitting on the > > > > > > > dispose > > > > > > > list, it > > > > > > > will handle it incorrectly. It sounds like that's the root > > > > > > > cause of the > > > > > > > problem you're seeing? > > > > > > > > > > > > > > If so, then maybe a separate list_head for disposal would > > > > > > > be > > > > > > > better. > > > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race > > > > > > > > condition and early > > > > > > > > unhash > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > --- > > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > return; > > > > > > > > } > > > > > > > > + /* > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > + * it's unhashed but then removed from > > > > > > > > the > > > > > > > > dispose list, > > > > > > > > + * so we need to free it. > > > > > > > > + */ > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > > > > A refcount_read in this path is a red flag to me. Anytime > > > > > > > you're just > > > > > > > looking at the refcount without changing anything just > > > > > > > screams > > > > > > > out > > > > > > > "race condition". > > > > > > > > > > > > > > In this case, what guarantee is there that this won't run > > > > > > > afoul > > > > > > > of the > > > > > > > timing? We could check this and find out it's 1 just before > > > > > > > it > > > > > > > goes to > > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > nf_flags) && > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > + nfsd_file_free(nf); > > > > > > > > + return; > > > > > > > > + } > > > > > > > > } > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > nfsd_file_free(nf); > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > *nf, struct > > > > > > > > list_head *dispose) > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > ignore it > > > > > > > > */ > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > return; > > > > > > > > > > > > > > Same here: you're just testing for the HASHED bit, but > > > > > > > could > > > > > > > this also > > > > > > > race with someone who is setting it just after you get > > > > > > > here. > > > > > > > Why is > > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > *nf, struct > > > > > > > > list_head *dispose) > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > > list */ > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > >nf_ref)) { > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > } > > > > > > > > } > > > > > > > > -- > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang > > > > > > > > > wrote: > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue and > > > > > > > > > > propose > > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed > > > > > > > > > > nfsd_file leaks > > > > > > > > > > which led to undestroyable file systems (zfs), > > > > > > > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In other words, > > > > > > > > > what > > > > > > > > > is the > > > > > > > > > simplest program that can run on an NFS client that > > > > > > > > > will > > > > > > > > > trigger > > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's not > > > > > > > > > directly > > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > > > - The guidelines for patch submission require us to fix > > > > > > > > > issues in > > > > > > > > > upstream Linux first (currently that's v6.10-rc6). > > > > > > > > > Then > > > > > > > > > that fix > > > > > > > > > can be backported to older stable kernels like 6.6. > > > > > > > > > > > > > > > > > > Can you reproduce the leak with one of the in-kernel > > > > > > > > > filesystems > > > > > > > > > (either xfs or btrfs would be great) and with NFSD in > > > > > > > > > 6.10- > > > > > > > > > rc6? > > > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > > > > struct nfsd_file { > > > > > > > > > > nf_rlist = { > > > > > > > > > > rhead = { > > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > > }, > > > > > > > > > > next = 0x0 > > > > > > > > > > }, > > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > nf_ref = { > > > > > > > > > > refs = { > > > > > > > > > > counter = 0xc0000000 > > > > > > > > > > } > > > > > > > > > > }, > > > > > > > > > > nf_may = 0x4, > > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > > nf_lru = { > > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > > }, > > > > > > > > > > nf_rcu = { > > > > > > > > > > next = 0x10000000000, > > > > > > > > > > func = 0x0 > > > > > > > > > > }, > > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > -x > > > > > > > > > > ffff88839a53d850 > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > nf_lru = { > > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > > }, > > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > -x > > > > > > > > > > ffff88c32b11e850 > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > nf_lru = { > > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > > }, > > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > -x > > > > > > > > > > ffff88e372709700 > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > nf_lru = { > > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > > }, > > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > -x > > > > > > > > > > ffff8982864944d0 > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > nf_lru = { > > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > > }, > > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races with > > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > > or nfsd_file_lru_cb(). With the following patch, I > > > > > > > > > > haven't observed > > > > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > > > > > > > Our patch submission guidelines require a Signed-off- > > > > > > > > > by: > > > > > > > > > line at the end of the patch description. See the "Sign > > > > > > > > > your work - > > > > > > > > > the Developer's Certificate of Origin" section of > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file > > > > > > > > > > *nf) > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > return; > > > > > > > > > > } > > > > > > > > > > + /* > > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > + * it's unhashed but then removed from the dispose > > > > > > > > > > list, > > > > > > > > > > + * so we need to free it. > > > > > > > > > > + */ > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > + return; > > > > > > > > > > + } > > > > > > > > > > } > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > nfsd_file > > > > > > > > > > *nf, struct > > > > > > > > > > list_head *dispose) > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore > > > > > > > > > > it > > > > > > > > > > */ > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > nfsd_file > > > > > > > > > > *nf, struct > > > > > > > > > > list_head *dispose) > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > > > > list > > > > > > > > > > */ > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) > > > > > > > > > > { > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > Please kindly review the patch and let me know if it > > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > -- > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > -- > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > -- > > > Jeff Layton <jlayton@kernel.org> > > > > -- > Jeff Layton <jlayton@kernel.org>
Ok, thanks. I went crawling over the filecache code, and found a couple of potential nfsd_file leaks. I'm testing patches now and will likely send them along later today. I think we do still want the patch to add nf_gc list_head. I'm not certain there is a race there, but it's safer to just use a separate list_head and it doesn't represent a lot of memory. Could you post that one individually so that Chuck can pick it up? Thanks, Jeff On Tue, 2024-07-09 at 15:13 -0400, Youzhong Yang wrote: > It's not on the LRU: > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_gc > ffff898107b95980 > nf_flags = 12, > nf_ref.refs.counter = 1 > nf_lru = { > next = 0xffff898107b959c8, > prev = 0xffff898107b959c8 > }, > nf_gc = { > next = 0xffff898107b959d8, > prev = 0xffff898107b959d8 > }, > > I don't have an easy reproducer now as I am having difficulty > generating tons of NFS read/write ops using a limited number of nfs > clients. I use our in-house testing farm to test the patch. > > Thanks. > > On Tue, Jul 9, 2024 at 3:05 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > > > Thanks Jeff. > > > > > > Unfortunately the early unhash easily leads to leaks: > > > > > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e 's|\[||' -e > > > 's|\]||' > > > > xargs -i echo struct nfsd_file.nf_flags,nf_ref.refs.counter '{}' > > > > /var/tmp/nfsd_files > > > crash> !wc -l /var/tmp/nfsd_files > > > 19 /var/tmp/nfsd_files > > > crash> < /var/tmp/nfsd_files > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778900 > > > nf_flags = 8, > > > nf_ref.refs.counter = 2 > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778cc0 > > > nf_flags = 8, > > > nf_ref.refs.counter = 3 > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8885d5f35e00 > > > nf_flags = 8, > > > nf_ref.refs.counter = 1 > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88817443e780 > > > nf_flags = 8, > > > nf_ref.refs.counter = 3 > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88818b3f0600 > > > nf_flags = 8, > > > nf_ref.refs.counter = 2 > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a4490f8300 > > > nf_flags = 8, > > > nf_ref.refs.counter = 1 > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a0dab183c0 > > > nf_flags = 8, > > > nf_ref.refs.counter = 40 > > > > That's a lot of references! Might be interesting to look more closely > > at that one, but the refcounts are all over the place, so it really > > does look like we just have a refcount leak somewhere. > > > > > ... > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff89209535f200 > > > nf_flags = 8, > > > nf_ref.refs.counter = 2 > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8980e15138c0 > > > nf_flags = 8, > > > nf_ref.refs.counter = 7 > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95800 > > > nf_flags = 8, > > > nf_ref.refs.counter = 3 > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95980 > > > nf_flags = 12, > > > > The others are NFSD_FILE_GC. This one is also NFSD_FILE_REFERENCED. Are > > these objects on the LRU? > > > > > nf_ref.refs.counter = 1 > > > > > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() relies on the > > > hash > > > table to find the nfsd_file, > > > but I am still scratching my head why and how this happens. > > > > > > FYI, here is the patch I applied for testing: > > > > > > > The above suggests to me that there is a garden-variety refcount leak > > somewhere. Whenever some bit of the code takes a reference to an object > > (like a nfsd_file), it's implicitly required to eventually put that > > reference. Typically that means that the code needs to maintain a > > pointer to that object, as it can be unhashed at any time and it can't > > rely on finding the same object later. > > > > Do you have a reproducer for this? > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index ad9083ca144b..22ebd7fb8639 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > > > *inode, unsigned char need, > > > return NULL; > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > nf->nf_birthtime = ktime_get(); > > > nf->nf_file = NULL; > > > nf->nf_cred = get_current_cred(); > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > > > struct nfsd_file *nf; > > > > > > while (!list_empty(dispose)) { > > > - nf = list_first_entry(dispose, struct nfsd_file, > > > nf_lru); > > > - list_del_init(&nf->nf_lru); > > > + nf = list_first_entry(dispose, struct nfsd_file, > > > nf_gc); > > > + list_del_init(&nf->nf_gc); > > > nfsd_file_free(nf); > > > } > > > } > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head > > > *dispose) > > > { > > > while(!list_empty(dispose)) { > > > struct nfsd_file *nf = list_first_entry(dispose, > > > - struct nfsd_file, > > > nf_lru); > > > + struct nfsd_file, > > > nf_gc); > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > nfsd_net_id); > > > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > > > > > spin_lock(&l->lock); > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > spin_unlock(&l->lock); > > > svc_wake_up(nn->nfsd_serv); > > > } > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct > > > list_lru_one *lru, > > > > > > /* Refcount went to zero. Unhash it and queue it to the > > > dispose list */ > > > nfsd_file_unhash(nf); > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > + list_lru_isolate(lru, &nf->nf_lru); > > > + list_add(&nf->nf_gc, head); > > > this_cpu_inc(nfsd_file_evictions); > > > trace_nfsd_file_gc_disposed(nf); > > > return LRU_REMOVED; > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > list_head *dispose) > > > > > > /* If refcount goes to 0, then put on the dispose list */ > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > - list_add(&nf->nf_lru, dispose); > > > + list_add(&nf->nf_gc, dispose); > > > trace_nfsd_file_closing(nf); > > > } > > > } > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > while (!list_empty(&dispose)) { > > > - nf = list_first_entry(&dispose, struct nfsd_file, > > > nf_lru); > > > - list_del_init(&nf->nf_lru); > > > + nf = list_first_entry(&dispose, struct nfsd_file, > > > nf_gc); > > > + list_del_init(&nf->nf_gc); > > > nfsd_file_free(nf); > > > } > > > } > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > index c61884def906..3fbec24eea6c 100644 > > > --- a/fs/nfsd/filecache.h > > > +++ b/fs/nfsd/filecache.h > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > struct nfsd_file_mark *nf_mark; > > > struct list_head nf_lru; > > > + struct list_head nf_gc; > > > struct rcu_head nf_rcu; > > > ktime_t nf_birthtime; > > > }; > > > > > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton <jlayton@kernel.org> > > > wrote: > > > > > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > > > > > Thanks Jeff. > > > > > > > > > > I am ok with reverting the unhash/dispose list reordering in > > > > > nfsd_file_lru_cb(), as it doesn't make much of a difference, > > > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > > > > > - A nfsd_file is hashed > > > > > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] will > > > > > get it > > > > > unhashed, doesn't it? > > > > > - It continues to get a reference by nfsd_file_get() > > > > > - It continues to remove itself from LRU by > > > > > nfsd_file_lru_remove() if > > > > > it is on the LRU. > > > > > - Now it runs refcount_sub_and_test(), what happens if the refcnt > > > > > does > > > > > not go to 0? How can this nfsd_file be found again? Through the > > > > > hash > > > > > table? Through the LRU walk? how? > > > > > > > > > > Thanks again. > > > > > > > > > > -Youzhong > > > > > > > > > > > > > It won't need to be found again. The holders of the extra > > > > references > > > > will put those references when they are finished. Since the object > > > > is > > > > no longer HASHED, nfsd_file_put just does this: > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > nfsd_file_free(nf); > > > > > > > > So that should be fine. > > > > > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton <jlayton@kernel.org> > > > > > wrote: > > > > > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > > > > > Thank you Jeff for your invaluable insights. I was leaning > > > > > > > towards > > > > > > > adding a new list_head too, and tested this approach on > > > > > > > kernel > > > > > > > 6.6 by > > > > > > > continuously hammering the server with heavy nfs load for the > > > > > > > last few > > > > > > > days, not a single leak. > > > > > > > > > > > > > > Here goes the patch (based on Linux kernel master branch), > > > > > > > please > > > > > > > review: > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to mixed use > > > > > > > of > > > > > > > nf->nf_lru > > > > > > > > > > > > > > nfsd_file_put() in one thread can race with another thread > > > > > > > doing > > > > > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() > > > > > > > -> > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > > > nfsd_file_lru_add(). > > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > > NFSD_FILE_REFERENCED > > > > > > > bit set) > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > > > REFERENCED bit and > > > > > > > returns LRU_ROTATE. > > > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > > > > > decrements nf->nf_ref > > > > > > > to 0, runs nfsd_file_unhash(), removes it from the LRU > > > > > > > and > > > > > > > adds to > > > > > > > the dispose > > > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > > > so > > > > > > > it > > > > > > > tries to remove > > > > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. > > > > > > > The > > > > > > > 'nf' > > > > > > > has been added > > > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > > treats it as part of the LRU and removes it, which leads > > > > > > > to > > > > > > > its removal from > > > > > > > the 'dispose' list. > > > > > > > * At this moment, 'nf' is unhashed with its nf_ref being 0, > > > > > > > and > > > > > > > not > > > > > > > on the LRU. > > > > > > > nfsd_file_put() continues its execution [if > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > > gets no chance of being freed. > > > > > > > > > > > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > > > nfsd_file_lru_add(). > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > > > > > * Some userland application runs 'exportfs -f' or something > > > > > > > like > > > > > > > that, which triggers > > > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > unhash is done > > > > > > > successfully. > > > > > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], > > > > > > > now > > > > > > > nf->nf_ref goes to 2. > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > > it succeeds. > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > &nf->nf_ref))] > > > > > > > (with "decrement" being 2), so the nf->nf_ref goes to 0, > > > > > > > the > > > > > > > 'nf' > > > > > > > is added to the > > > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > > > so > > > > > > > it > > > > > > > tries to remove > > > > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > > > > > > > although > > > > > > > the 'nf' is not > > > > > > > in the LRU, but it is linked in the 'dispose' list, > > > > > > > nfsd_file_lru_remove() simply > > > > > > > treats it as part of the LRU and removes it. This leads > > > > > > > to > > > > > > > its removal from > > > > > > > the 'dispose' list! > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > > > > > execution and set > > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > As shown in the above analysis, using nf_lru for both the LRU > > > > > > > list and > > > > > > > dispose list > > > > > > > can cause the leaks. This patch adds a new list_head nf_gc in > > > > > > > struct > > > > > > > nfsd_file, and uses > > > > > > > it for the dispose list. It's not expected to have a > > > > > > > nfsd_file > > > > > > > unhashed but it's not > > > > > > > added to the dispose list, so in nfsd_file_cond_queue() and > > > > > > > nfsd_file_lru_cb() nfsd_file > > > > > > > is unhashed after being added to the dispose list. > > > > > > > > > > > > > > > > > > > I don't see where we require the object to be either hashed or > > > > > > on > > > > > > the > > > > > > dispose list. I think you probably just want to do a patch > > > > > > that > > > > > > changes the dispose list to use a dedicated list_head without > > > > > > reordering when the these things are unhashed. > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > --- > > > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > > > > > fs/nfsd/filecache.h | 1 + > > > > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > > > inode > > > > > > > *inode, unsigned char need, > > > > > > > return NULL; > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > nf->nf_file = NULL; > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > > > *dispose) > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > nfsd_file, > > > > > > > nf_lru); > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > nfsd_file, > > > > > > > nf_gc); > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > nfsd_file_free(nf); > > > > > > > } > > > > > > > } > > > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > > > list_head *dispose) > > > > > > > { > > > > > > > while(!list_empty(dispose)) { > > > > > > > struct nfsd_file *nf = > > > > > > > list_first_entry(dispose, > > > > > > > - struct > > > > > > > nfsd_file, > > > > > > > nf_lru); > > > > > > > + struct > > > > > > > nfsd_file, > > > > > > > nf_gc); > > > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > > > nfsd_net_id); > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > > > spin_unlock(&l->lock); > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > } > > > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > > > struct > > > > > > > list_lru_one *lru, > > > > > > > } > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > > > the > > > > > > > dispose list */ > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > + /* Unhash after removing from LRU and adding to > > > > > > > dispose > > > > > > > list */ > > > > > > > nfsd_file_unhash(nf); > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > > > > > > > > I don't see the point in reordering these operations. Hashing > > > > > > is > > > > > > all > > > > > > about making the thing findable by nfsd operations. The _last_ > > > > > > thing we > > > > > > want to do is put it on the dispose list while the thing can > > > > > > still > > > > > > be > > > > > > found by nfsd threads doing operations. > > > > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > return LRU_REMOVED; > > > > > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > *nf, > > > > > > > struct > > > > > > > list_head *dispose) > > > > > > > int decrement = 1; > > > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore it > > > > > > > */ > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > > > > > The above change looks wrong. I don't think we need to change > > > > > > this. > > > > > > > > > > > > > return; > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > *nf, > > > > > > > struct > > > > > > > list_head *dispose) > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > list */ > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > + /* Unhash after adding to dispose list */ > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > This too looks wrong? Maybe I'm unclear on the race you're > > > > > > trying > > > > > > to > > > > > > fix with this? What's the harm in unhashing it early? > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > } > > > > > > > } > > > > > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > > > *inode) > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > while (!list_empty(&dispose)) { > > > > > > > - nf = list_first_entry(&dispose, struct > > > > > > > nfsd_file, > > > > > > > nf_lru); > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > + nf = list_first_entry(&dispose, struct > > > > > > > nfsd_file, > > > > > > > nf_gc); > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > nfsd_file_free(nf); > > > > > > > } > > > > > > > } > > > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > struct list_head nf_lru; > > > > > > > + struct list_head nf_gc; > > > > > > > struct rcu_head nf_rcu; > > > > > > > ktime_t nf_birthtime; > > > > > > > }; > > > > > > > -- > > > > > > > 2.34.1 > > > > > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > > > > > > > > > Thank you Chuck. Here are my quick answers to your > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > - I don't have a quick reproducer. I reproduced it by > > > > > > > > > using > > > > > > > > > hundreds > > > > > > > > > of nfs clients generating +600K ops under our workload in > > > > > > > > > the > > > > > > > > > testing > > > > > > > > > environment. Theoretically it should be possible to > > > > > > > > > simplify > > > > > > > > > the > > > > > > > > > reproduction but I am still working on it. > > > > > > > > > > > > > > > > > > - I understand zfs is an out-of-tree file system. That's > > > > > > > > > fine. But > > > > > > > > > this leaking can happen to any file system, and leaking > > > > > > > > > is > > > > > > > > > not a good > > > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > > > > > - I will try to come up with a reproducer using xfs or > > > > > > > > > btrfs > > > > > > > > > if possible. > > > > > > > > > > > > > > > > > > Now back to the problem itself, here are my findings: > > > > > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race with another > > > > > > > > > thread > > > > > > > > > doing > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > list_lru_walk() > > > > > > > > > -> > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > > > > do > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > * nfsd_file_lru_add() returns true (thus > > > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > > > set for nf->nf_flags) > > > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > > > > > REFERENCED > > > > > > > > > bit and returns LRU_ROTATE. > > > > > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() > > > > > > > > > now > > > > > > > > > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), > > > > > > > > > removes > > > > > > > > > it from > > > > > > > > > the LRU and adds to the dispose list > > > > > > > > > [list_lru_isolate_move(lru, > > > > > > > > > &nf->nf_lru, head);] > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > cleared, > > > > > > > > > so it > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > > > * The 'nf' has been added to the 'dispose' list by > > > > > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply > > > > > > > > > treats > > > > > > > > > it as > > > > > > > > > part of the LRU and removes it, which leads it to be > > > > > > > > > removed > > > > > > > > > from the > > > > > > > > > 'dispose' list. > > > > > > > > > * At this moment, nf->nf_ref is 0, it's unhashed, and > > > > > > > > > not > > > > > > > > > on the > > > > > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is > > > > > > > > > already 0, now > > > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > > > REFCOUNT_SATURATED, > > > > > > > > > and the > > > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > > > > > To make this happen, the right timing is crucial. It can > > > > > > > > > be > > > > > > > > > reproduced > > > > > > > > > by adding artifical delays in filecache.c, or hammering > > > > > > > > > the > > > > > > > > > nfsd with > > > > > > > > > tons of ops. > > > > > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > > > > do > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns > > > > > > > > > true. > > > > > > > > > * 'exportfs -f' or something like that triggers > > > > > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > unhash is done successfully. > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > now > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > &nf->nf_ref))] (with "decrement" being 2), so the nf- > > > > > > > > > > nf_ref > > > > > > > > > goes to > > > > > > > > > 0, the 'nf' is added to the dispost list [list_add(&nf- > > > > > > > > > > nf_lru, > > > > > > > > > dispose)] > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > cleared, > > > > > > > > > so it > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in > > > > > > > > > the > > > > > > > > > LRU, but > > > > > > > > > it is linked in the 'dispose' list, > > > > > > > > > nfsd_file_lru_remove() > > > > > > > > > simply > > > > > > > > > treats it as part of the LRU and removes it. This leads > > > > > > > > > to > > > > > > > > > its removal > > > > > > > > > from the 'dispose' list! > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues > > > > > > > > > its > > > > > > > > > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you can see, > > > > > > > > > it > > > > > > > > > is used > > > > > > > > > for the LRU list, and also the 'dispose' list. Adding > > > > > > > > > another > > > > > > > > > 'struct > > > > > > > > > list_head' specifically for the 'dispose' list seems to > > > > > > > > > be a > > > > > > > > > better > > > > > > > > > way of fixing this race condition. Either way works for > > > > > > > > > me. > > > > > > > > > > > > > > > > > > Would you agree my above analysis makes sense? Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've done much work in > > > > > > > > this > > > > > > > > code, > > > > > > > > but it does sound like there is a race in the LRU handling. > > > > > > > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe to use > > > > > > > > for > > > > > > > > multiple > > > > > > > > purposes, but that's only the case if we're not using that > > > > > > > > list > > > > > > > > as an > > > > > > > > indicator. > > > > > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > > > > > ...so if we ever check this while it's sitting on the > > > > > > > > dispose > > > > > > > > list, it > > > > > > > > will handle it incorrectly. It sounds like that's the root > > > > > > > > cause of the > > > > > > > > problem you're seeing? > > > > > > > > > > > > > > > > If so, then maybe a separate list_head for disposal would > > > > > > > > be > > > > > > > > better. > > > > > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race > > > > > > > > > condition and early > > > > > > > > > unhash > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > --- > > > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > return; > > > > > > > > > } > > > > > > > > > + /* > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > + * it's unhashed but then removed from > > > > > > > > > the > > > > > > > > > dispose list, > > > > > > > > > + * so we need to free it. > > > > > > > > > + */ > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > > > > > > A refcount_read in this path is a red flag to me. Anytime > > > > > > > > you're just > > > > > > > > looking at the refcount without changing anything just > > > > > > > > screams > > > > > > > > out > > > > > > > > "race condition". > > > > > > > > > > > > > > > > In this case, what guarantee is there that this won't run > > > > > > > > afoul > > > > > > > > of the > > > > > > > > timing? We could check this and find out it's 1 just before > > > > > > > > it > > > > > > > > goes to > > > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > nf_flags) && > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > + return; > > > > > > > > > + } > > > > > > > > > } > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > > *nf, struct > > > > > > > > > list_head *dispose) > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > ignore it > > > > > > > > > */ > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > > return; > > > > > > > > > > > > > > > > Same here: you're just testing for the HASHED bit, but > > > > > > > > could > > > > > > > > this also > > > > > > > > race with someone who is setting it just after you get > > > > > > > > here. > > > > > > > > Why is > > > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > > *nf, struct > > > > > > > > > list_head *dispose) > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > > > list */ > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > nf_ref)) { > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > -- > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang > > > > > > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue and > > > > > > > > > > > propose > > > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed > > > > > > > > > > > nfsd_file leaks > > > > > > > > > > > which led to undestroyable file systems (zfs), > > > > > > > > > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In other words, > > > > > > > > > > what > > > > > > > > > > is the > > > > > > > > > > simplest program that can run on an NFS client that > > > > > > > > > > will > > > > > > > > > > trigger > > > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's not > > > > > > > > > > directly > > > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > > > > > - The guidelines for patch submission require us to fix > > > > > > > > > > issues in > > > > > > > > > > upstream Linux first (currently that's v6.10-rc6). > > > > > > > > > > Then > > > > > > > > > > that fix > > > > > > > > > > can be backported to older stable kernels like 6.6. > > > > > > > > > > > > > > > > > > > > Can you reproduce the leak with one of the in-kernel > > > > > > > > > > filesystems > > > > > > > > > > (either xfs or btrfs would be great) and with NFSD in > > > > > > > > > > 6.10- > > > > > > > > > > rc6? > > > > > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > > > > > struct nfsd_file { > > > > > > > > > > > nf_rlist = { > > > > > > > > > > > rhead = { > > > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > > > }, > > > > > > > > > > > next = 0x0 > > > > > > > > > > > }, > > > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > nf_ref = { > > > > > > > > > > > refs = { > > > > > > > > > > > counter = 0xc0000000 > > > > > > > > > > > } > > > > > > > > > > > }, > > > > > > > > > > > nf_may = 0x4, > > > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > > > nf_lru = { > > > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > > > }, > > > > > > > > > > > nf_rcu = { > > > > > > > > > > > next = 0x10000000000, > > > > > > > > > > > func = 0x0 > > > > > > > > > > > }, > > > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > > -x > > > > > > > > > > > ffff88839a53d850 > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > nf_lru = { > > > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > > > }, > > > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > > -x > > > > > > > > > > > ffff88c32b11e850 > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > nf_lru = { > > > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > > > }, > > > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > > -x > > > > > > > > > > > ffff88e372709700 > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > nf_lru = { > > > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > > > }, > > > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > > -x > > > > > > > > > > > ffff8982864944d0 > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > nf_lru = { > > > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > > > }, > > > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races with > > > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > > > or nfsd_file_lru_cb(). With the following patch, I > > > > > > > > > > > haven't observed > > > > > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > > > > > > > > > Our patch submission guidelines require a Signed-off- > > > > > > > > > > by: > > > > > > > > > > line at the end of the patch description. See the "Sign > > > > > > > > > > your work - > > > > > > > > > > the Developer's Certificate of Origin" section of > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file > > > > > > > > > > > *nf) > > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > return; > > > > > > > > > > > } > > > > > > > > > > > + /* > > > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > + * it's unhashed but then removed from the dispose > > > > > > > > > > > list, > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > + */ > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > + return; > > > > > > > > > > > + } > > > > > > > > > > > } > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > nfsd_file > > > > > > > > > > > *nf, struct > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore > > > > > > > > > > > it > > > > > > > > > > > */ > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > nfsd_file > > > > > > > > > > > *nf, struct > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > > > > > list > > > > > > > > > > > */ > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) > > > > > > > > > > > { > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > } > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > Please kindly review the patch and let me know if it > > > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > -- > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > -- > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > -- > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > -- > > Jeff Layton <jlayton@kernel.org>
Thanks. Here it is again: commit c6e69cebc052cb82d91fc81a98aee20749fe8d47 Author: Youzhong Yang <youzhong@gmail.com> Date: Thu Jul 4 11:25:40 2024 -0400 nfsd: fix nfsd_file leaking due to mixed use of nf->nf_lru nfsd_file_put() in one thread can race with another thread doing garbage collection (running nfsd_file_gc() -> list_lru_walk() -> nfsd_file_lru_cb()): * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED bit set) * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED bit and returns LRU_ROTATE. * garbage collector kicks in again, nfsd_file_lru_cb() now decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from the LRU and adds to the dispose list [list_lru_isolate_move(lru, &nf->nf_lru, head)] * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The 'nf' has been added to the 'dispose' list by nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as part of the LRU and removes it, which leads to its removal from the 'dispose' list. * At this moment, 'nf' is unhashed with its nf_ref being 0, and not on the LRU. nfsd_file_put() continues its execution [if (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, nf->nf_ref is set to REFCOUNT_SATURATED, and the 'nf' gets no chance of being freed. nfsd_file_put() can also race with nfsd_file_cond_queue(): * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). * nfsd_file_lru_add() sets REFERENCED bit and returns true. * Some userland application runs 'exportfs -f' or something like that, which triggers __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], unhash is done successfully. * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now nf->nf_ref goes to 2. * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to 0, the 'nf' is added to the dispose list [list_add(&nf->nf_lru, dispose)] * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but it is linked in the 'dispose' list, nfsd_file_lru_remove() simply treats it as part of the LRU and removes it. This leads to its removal from the 'dispose' list! * Now nf->ref is 0, unhashed. nfsd_file_put() continues its execution and set nf->nf_ref to REFCOUNT_SATURATED. As shown in the above analysis, using nf_lru for both the LRU list and dispose list can cause the leaks. This patch adds a new list_head nf_gc in struct nfsd_file, and uses it for the dispose list. Signed-off-by: Youzhong Yang <youzhong@gmail.com> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index ad9083ca144b..22ebd7fb8639 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need, return NULL; INIT_LIST_HEAD(&nf->nf_lru); + INIT_LIST_HEAD(&nf->nf_gc); nf->nf_birthtime = ktime_get(); nf->nf_file = NULL; nf->nf_cred = get_current_cred(); @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) struct nfsd_file *nf; while (!list_empty(dispose)) { - nf = list_first_entry(dispose, struct nfsd_file, nf_lru); - list_del_init(&nf->nf_lru); + nf = list_first_entry(dispose, struct nfsd_file, nf_gc); + list_del_init(&nf->nf_gc); nfsd_file_free(nf); } } @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) { while(!list_empty(dispose)) { struct nfsd_file *nf = list_first_entry(dispose, - struct nfsd_file, nf_lru); + struct nfsd_file, nf_gc); struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); struct nfsd_fcache_disposal *l = nn->fcache_disposal; spin_lock(&l->lock); - list_move_tail(&nf->nf_lru, &l->freeme); + list_move_tail(&nf->nf_gc, &l->freeme); spin_unlock(&l->lock); svc_wake_up(nn->nfsd_serv); } @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru, /* Refcount went to zero. Unhash it and queue it to the dispose list */ nfsd_file_unhash(nf); - list_lru_isolate_move(lru, &nf->nf_lru, head); + list_lru_isolate(lru, &nf->nf_lru); + list_add(&nf->nf_gc, head); this_cpu_inc(nfsd_file_evictions); trace_nfsd_file_gc_disposed(nf); return LRU_REMOVED; @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) /* If refcount goes to 0, then put on the dispose list */ if (refcount_sub_and_test(decrement, &nf->nf_ref)) { - list_add(&nf->nf_lru, dispose); + list_add(&nf->nf_gc, dispose); trace_nfsd_file_closing(nf); } } @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) nfsd_file_queue_for_close(inode, &dispose); while (!list_empty(&dispose)) { - nf = list_first_entry(&dispose, struct nfsd_file, nf_lru); - list_del_init(&nf->nf_lru); + nf = list_first_entry(&dispose, struct nfsd_file, nf_gc); + list_del_init(&nf->nf_gc); nfsd_file_free(nf); } } diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h index c61884def906..3fbec24eea6c 100644 --- a/fs/nfsd/filecache.h +++ b/fs/nfsd/filecache.h @@ -44,6 +44,7 @@ struct nfsd_file { struct nfsd_file_mark *nf_mark; struct list_head nf_lru; + struct list_head nf_gc; struct rcu_head nf_rcu; ktime_t nf_birthtime; }; On Wed, Jul 10, 2024 at 8:49 AM Jeff Layton <jlayton@kernel.org> wrote: > > Ok, thanks. I went crawling over the filecache code, and found a couple > of potential nfsd_file leaks. I'm testing patches now and will likely > send them along later today. > > I think we do still want the patch to add nf_gc list_head. I'm not > certain there is a race there, but it's safer to just use a separate > list_head and it doesn't represent a lot of memory. > > Could you post that one individually so that Chuck can pick it up? > > Thanks, > Jeff > > On Tue, 2024-07-09 at 15:13 -0400, Youzhong Yang wrote: > > It's not on the LRU: > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_gc > > ffff898107b95980 > > nf_flags = 12, > > nf_ref.refs.counter = 1 > > nf_lru = { > > next = 0xffff898107b959c8, > > prev = 0xffff898107b959c8 > > }, > > nf_gc = { > > next = 0xffff898107b959d8, > > prev = 0xffff898107b959d8 > > }, > > > > I don't have an easy reproducer now as I am having difficulty > > generating tons of NFS read/write ops using a limited number of nfs > > clients. I use our in-house testing farm to test the patch. > > > > Thanks. > > > > On Tue, Jul 9, 2024 at 3:05 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > > > > Thanks Jeff. > > > > > > > > Unfortunately the early unhash easily leads to leaks: > > > > > > > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e 's|\[||' -e > > > > 's|\]||' > > > > > xargs -i echo struct nfsd_file.nf_flags,nf_ref.refs.counter '{}' > > > > > /var/tmp/nfsd_files > > > > crash> !wc -l /var/tmp/nfsd_files > > > > 19 /var/tmp/nfsd_files > > > > crash> < /var/tmp/nfsd_files > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778900 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 2 > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88865c778cc0 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 3 > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8885d5f35e00 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 1 > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88817443e780 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 3 > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88818b3f0600 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 2 > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a4490f8300 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 1 > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff88a0dab183c0 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 40 > > > > > > That's a lot of references! Might be interesting to look more closely > > > at that one, but the refcounts are all over the place, so it really > > > does look like we just have a refcount leak somewhere. > > > > > > > ... > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff89209535f200 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 2 > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff8980e15138c0 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 7 > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95800 > > > > nf_flags = 8, > > > > nf_ref.refs.counter = 3 > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter ffff898107b95980 > > > > nf_flags = 12, > > > > > > The others are NFSD_FILE_GC. This one is also NFSD_FILE_REFERENCED. Are > > > these objects on the LRU? > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() relies on the > > > > hash > > > > table to find the nfsd_file, > > > > but I am still scratching my head why and how this happens. > > > > > > > > FYI, here is the patch I applied for testing: > > > > > > > > > > The above suggests to me that there is a garden-variety refcount leak > > > somewhere. Whenever some bit of the code takes a reference to an object > > > (like a nfsd_file), it's implicitly required to eventually put that > > > reference. Typically that means that the code needs to maintain a > > > pointer to that object, as it can be unhashed at any time and it can't > > > rely on finding the same object later. > > > > > > Do you have a reproducer for this? > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > --- a/fs/nfsd/filecache.c > > > > +++ b/fs/nfsd/filecache.c > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > > > > *inode, unsigned char need, > > > > return NULL; > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > nf->nf_birthtime = ktime_get(); > > > > nf->nf_file = NULL; > > > > nf->nf_cred = get_current_cred(); > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > > > > struct nfsd_file *nf; > > > > > > > > while (!list_empty(dispose)) { > > > > - nf = list_first_entry(dispose, struct nfsd_file, > > > > nf_lru); > > > > - list_del_init(&nf->nf_lru); > > > > + nf = list_first_entry(dispose, struct nfsd_file, > > > > nf_gc); > > > > + list_del_init(&nf->nf_gc); > > > > nfsd_file_free(nf); > > > > } > > > > } > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head > > > > *dispose) > > > > { > > > > while(!list_empty(dispose)) { > > > > struct nfsd_file *nf = list_first_entry(dispose, > > > > - struct nfsd_file, > > > > nf_lru); > > > > + struct nfsd_file, > > > > nf_gc); > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > nfsd_net_id); > > > > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > > > > > > > spin_lock(&l->lock); > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > spin_unlock(&l->lock); > > > > svc_wake_up(nn->nfsd_serv); > > > > } > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct > > > > list_lru_one *lru, > > > > > > > > /* Refcount went to zero. Unhash it and queue it to the > > > > dispose list */ > > > > nfsd_file_unhash(nf); > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > + list_add(&nf->nf_gc, head); > > > > this_cpu_inc(nfsd_file_evictions); > > > > trace_nfsd_file_gc_disposed(nf); > > > > return LRU_REMOVED; > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > list_head *dispose) > > > > > > > > /* If refcount goes to 0, then put on the dispose list */ > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > - list_add(&nf->nf_lru, dispose); > > > > + list_add(&nf->nf_gc, dispose); > > > > trace_nfsd_file_closing(nf); > > > > } > > > > } > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > while (!list_empty(&dispose)) { > > > > - nf = list_first_entry(&dispose, struct nfsd_file, > > > > nf_lru); > > > > - list_del_init(&nf->nf_lru); > > > > + nf = list_first_entry(&dispose, struct nfsd_file, > > > > nf_gc); > > > > + list_del_init(&nf->nf_gc); > > > > nfsd_file_free(nf); > > > > } > > > > } > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > index c61884def906..3fbec24eea6c 100644 > > > > --- a/fs/nfsd/filecache.h > > > > +++ b/fs/nfsd/filecache.h > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > struct list_head nf_lru; > > > > + struct list_head nf_gc; > > > > struct rcu_head nf_rcu; > > > > ktime_t nf_birthtime; > > > > }; > > > > > > > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton <jlayton@kernel.org> > > > > wrote: > > > > > > > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > > > > > > Thanks Jeff. > > > > > > > > > > > > I am ok with reverting the unhash/dispose list reordering in > > > > > > nfsd_file_lru_cb(), as it doesn't make much of a difference, > > > > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > > > > > > > - A nfsd_file is hashed > > > > > > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] will > > > > > > get it > > > > > > unhashed, doesn't it? > > > > > > - It continues to get a reference by nfsd_file_get() > > > > > > - It continues to remove itself from LRU by > > > > > > nfsd_file_lru_remove() if > > > > > > it is on the LRU. > > > > > > - Now it runs refcount_sub_and_test(), what happens if the refcnt > > > > > > does > > > > > > not go to 0? How can this nfsd_file be found again? Through the > > > > > > hash > > > > > > table? Through the LRU walk? how? > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > It won't need to be found again. The holders of the extra > > > > > references > > > > > will put those references when they are finished. Since the object > > > > > is > > > > > no longer HASHED, nfsd_file_put just does this: > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > nfsd_file_free(nf); > > > > > > > > > > So that should be fine. > > > > > > > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton <jlayton@kernel.org> > > > > > > wrote: > > > > > > > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > > > > > > Thank you Jeff for your invaluable insights. I was leaning > > > > > > > > towards > > > > > > > > adding a new list_head too, and tested this approach on > > > > > > > > kernel > > > > > > > > 6.6 by > > > > > > > > continuously hammering the server with heavy nfs load for the > > > > > > > > last few > > > > > > > > days, not a single leak. > > > > > > > > > > > > > > > > Here goes the patch (based on Linux kernel master branch), > > > > > > > > please > > > > > > > > review: > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to mixed use > > > > > > > > of > > > > > > > > nf->nf_lru > > > > > > > > > > > > > > > > nfsd_file_put() in one thread can race with another thread > > > > > > > > doing > > > > > > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() > > > > > > > > -> > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > > > > nfsd_file_lru_add(). > > > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > > > NFSD_FILE_REFERENCED > > > > > > > > bit set) > > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > > > > REFERENCED bit and > > > > > > > > returns LRU_ROTATE. > > > > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > > > > > > decrements nf->nf_ref > > > > > > > > to 0, runs nfsd_file_unhash(), removes it from the LRU > > > > > > > > and > > > > > > > > adds to > > > > > > > > the dispose > > > > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > > > > so > > > > > > > > it > > > > > > > > tries to remove > > > > > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. > > > > > > > > The > > > > > > > > 'nf' > > > > > > > > has been added > > > > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > > > treats it as part of the LRU and removes it, which leads > > > > > > > > to > > > > > > > > its removal from > > > > > > > > the 'dispose' list. > > > > > > > > * At this moment, 'nf' is unhashed with its nf_ref being 0, > > > > > > > > and > > > > > > > > not > > > > > > > > on the LRU. > > > > > > > > nfsd_file_put() continues its execution [if > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > > > gets no chance of being freed. > > > > > > > > > > > > > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > > > > > nfsd_file_lru_add(). > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > > > > > > * Some userland application runs 'exportfs -f' or something > > > > > > > > like > > > > > > > > that, which triggers > > > > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > unhash is done > > > > > > > > successfully. > > > > > > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], > > > > > > > > now > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > > > it succeeds. > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > &nf->nf_ref))] > > > > > > > > (with "decrement" being 2), so the nf->nf_ref goes to 0, > > > > > > > > the > > > > > > > > 'nf' > > > > > > > > is added to the > > > > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, > > > > > > > > so > > > > > > > > it > > > > > > > > tries to remove > > > > > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > > > > > > > > although > > > > > > > > the 'nf' is not > > > > > > > > in the LRU, but it is linked in the 'dispose' list, > > > > > > > > nfsd_file_lru_remove() simply > > > > > > > > treats it as part of the LRU and removes it. This leads > > > > > > > > to > > > > > > > > its removal from > > > > > > > > the 'dispose' list! > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > > > > > > execution and set > > > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > As shown in the above analysis, using nf_lru for both the LRU > > > > > > > > list and > > > > > > > > dispose list > > > > > > > > can cause the leaks. This patch adds a new list_head nf_gc in > > > > > > > > struct > > > > > > > > nfsd_file, and uses > > > > > > > > it for the dispose list. It's not expected to have a > > > > > > > > nfsd_file > > > > > > > > unhashed but it's not > > > > > > > > added to the dispose list, so in nfsd_file_cond_queue() and > > > > > > > > nfsd_file_lru_cb() nfsd_file > > > > > > > > is unhashed after being added to the dispose list. > > > > > > > > > > > > > > > > > > > > > > I don't see where we require the object to be either hashed or > > > > > > > on > > > > > > > the > > > > > > > dispose list. I think you probably just want to do a patch > > > > > > > that > > > > > > > changes the dispose list to use a dedicated list_head without > > > > > > > reordering when the these things are unhashed. > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > --- > > > > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > > > > > > fs/nfsd/filecache.h | 1 + > > > > > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > > > > inode > > > > > > > > *inode, unsigned char need, > > > > > > > > return NULL; > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > nf->nf_file = NULL; > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > > > > *dispose) > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > > nfsd_file, > > > > > > > > nf_lru); > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > > nfsd_file, > > > > > > > > nf_gc); > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > nfsd_file_free(nf); > > > > > > > > } > > > > > > > > } > > > > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > > > > list_head *dispose) > > > > > > > > { > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > struct nfsd_file *nf = > > > > > > > > list_first_entry(dispose, > > > > > > > > - struct > > > > > > > > nfsd_file, > > > > > > > > nf_lru); > > > > > > > > + struct > > > > > > > > nfsd_file, > > > > > > > > nf_gc); > > > > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > > > > nfsd_net_id); > > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > > > > spin_unlock(&l->lock); > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > } > > > > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > > > > struct > > > > > > > > list_lru_one *lru, > > > > > > > > } > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > > > > the > > > > > > > > dispose list */ > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > + /* Unhash after removing from LRU and adding to > > > > > > > > dispose > > > > > > > > list */ > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > > > > > > > > > > I don't see the point in reordering these operations. Hashing > > > > > > > is > > > > > > > all > > > > > > > about making the thing findable by nfsd operations. The _last_ > > > > > > > thing we > > > > > > > want to do is put it on the dispose list while the thing can > > > > > > > still > > > > > > > be > > > > > > > found by nfsd threads doing operations. > > > > > > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > return LRU_REMOVED; > > > > > > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > *nf, > > > > > > > > struct > > > > > > > > list_head *dispose) > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore it > > > > > > > > */ > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > > > > > > > The above change looks wrong. I don't think we need to change > > > > > > > this. > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > *nf, > > > > > > > > struct > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > > list */ > > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > > + /* Unhash after adding to dispose list */ > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > This too looks wrong? Maybe I'm unclear on the race you're > > > > > > > trying > > > > > > > to > > > > > > > fix with this? What's the harm in unhashing it early? > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > } > > > > > > > > } > > > > > > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > > > > *inode) > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > - nf = list_first_entry(&dispose, struct > > > > > > > > nfsd_file, > > > > > > > > nf_lru); > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > + nf = list_first_entry(&dispose, struct > > > > > > > > nfsd_file, > > > > > > > > nf_gc); > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > nfsd_file_free(nf); > > > > > > > > } > > > > > > > > } > > > > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > struct list_head nf_lru; > > > > > > > > + struct list_head nf_gc; > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > ktime_t nf_birthtime; > > > > > > > > }; > > > > > > > > -- > > > > > > > > 2.34.1 > > > > > > > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang wrote: > > > > > > > > > > Thank you Chuck. Here are my quick answers to your > > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > > > - I don't have a quick reproducer. I reproduced it by > > > > > > > > > > using > > > > > > > > > > hundreds > > > > > > > > > > of nfs clients generating +600K ops under our workload in > > > > > > > > > > the > > > > > > > > > > testing > > > > > > > > > > environment. Theoretically it should be possible to > > > > > > > > > > simplify > > > > > > > > > > the > > > > > > > > > > reproduction but I am still working on it. > > > > > > > > > > > > > > > > > > > > - I understand zfs is an out-of-tree file system. That's > > > > > > > > > > fine. But > > > > > > > > > > this leaking can happen to any file system, and leaking > > > > > > > > > > is > > > > > > > > > > not a good > > > > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > > > > > > > - I will try to come up with a reproducer using xfs or > > > > > > > > > > btrfs > > > > > > > > > > if possible. > > > > > > > > > > > > > > > > > > > > Now back to the problem itself, here are my findings: > > > > > > > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race with another > > > > > > > > > > thread > > > > > > > > > > doing > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > list_lru_walk() > > > > > > > > > > -> > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > > > > > do > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > * nfsd_file_lru_add() returns true (thus > > > > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > > > > set for nf->nf_flags) > > > > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > > > > > > REFERENCED > > > > > > > > > > bit and returns LRU_ROTATE. > > > > > > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() > > > > > > > > > > now > > > > > > > > > > decrements nf->nf_ref to 0, runs nfsd_file_unhash(), > > > > > > > > > > removes > > > > > > > > > > it from > > > > > > > > > > the LRU and adds to the dispose list > > > > > > > > > > [list_lru_isolate_move(lru, > > > > > > > > > > &nf->nf_lru, head);] > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > > cleared, > > > > > > > > > > so it > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > > > > * The 'nf' has been added to the 'dispose' list by > > > > > > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply > > > > > > > > > > treats > > > > > > > > > > it as > > > > > > > > > > part of the LRU and removes it, which leads it to be > > > > > > > > > > removed > > > > > > > > > > from the > > > > > > > > > > 'dispose' list. > > > > > > > > > > * At this moment, nf->nf_ref is 0, it's unhashed, and > > > > > > > > > > not > > > > > > > > > > on the > > > > > > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is > > > > > > > > > > already 0, now > > > > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > > > > REFCOUNT_SATURATED, > > > > > > > > > > and the > > > > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > > > > > > > To make this happen, the right timing is crucial. It can > > > > > > > > > > be > > > > > > > > > > reproduced > > > > > > > > > > by adding artifical delays in filecache.c, or hammering > > > > > > > > > > the > > > > > > > > > > nfsd with > > > > > > > > > > tons of ops. > > > > > > > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > > > > > do > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns > > > > > > > > > > true. > > > > > > > > > > * 'exportfs -f' or something like that triggers > > > > > > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > unhash is done successfully. > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > now > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > &nf->nf_ref))] (with "decrement" being 2), so the nf- > > > > > > > > > > > nf_ref > > > > > > > > > > goes to > > > > > > > > > > 0, the 'nf' is added to the dispost list [list_add(&nf- > > > > > > > > > > > nf_lru, > > > > > > > > > > dispose)] > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > > cleared, > > > > > > > > > > so it > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' is not in > > > > > > > > > > the > > > > > > > > > > LRU, but > > > > > > > > > > it is linked in the 'dispose' list, > > > > > > > > > > nfsd_file_lru_remove() > > > > > > > > > > simply > > > > > > > > > > treats it as part of the LRU and removes it. This leads > > > > > > > > > > to > > > > > > > > > > its removal > > > > > > > > > > from the 'dispose' list! > > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues > > > > > > > > > > its > > > > > > > > > > execution and sets nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you can see, > > > > > > > > > > it > > > > > > > > > > is used > > > > > > > > > > for the LRU list, and also the 'dispose' list. Adding > > > > > > > > > > another > > > > > > > > > > 'struct > > > > > > > > > > list_head' specifically for the 'dispose' list seems to > > > > > > > > > > be a > > > > > > > > > > better > > > > > > > > > > way of fixing this race condition. Either way works for > > > > > > > > > > me. > > > > > > > > > > > > > > > > > > > > Would you agree my above analysis makes sense? Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've done much work in > > > > > > > > > this > > > > > > > > > code, > > > > > > > > > but it does sound like there is a race in the LRU handling. > > > > > > > > > > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe to use > > > > > > > > > for > > > > > > > > > multiple > > > > > > > > > purposes, but that's only the case if we're not using that > > > > > > > > > list > > > > > > > > > as an > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > > > > > > > ...so if we ever check this while it's sitting on the > > > > > > > > > dispose > > > > > > > > > list, it > > > > > > > > > will handle it incorrectly. It sounds like that's the root > > > > > > > > > cause of the > > > > > > > > > problem you're seeing? > > > > > > > > > > > > > > > > > > If so, then maybe a separate list_head for disposal would > > > > > > > > > be > > > > > > > > > better. > > > > > > > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to race > > > > > > > > > > condition and early > > > > > > > > > > unhash > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > --- > > > > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > return; > > > > > > > > > > } > > > > > > > > > > + /* > > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > + * it's unhashed but then removed from > > > > > > > > > > the > > > > > > > > > > dispose list, > > > > > > > > > > + * so we need to free it. > > > > > > > > > > + */ > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > > > > > > > > A refcount_read in this path is a red flag to me. Anytime > > > > > > > > > you're just > > > > > > > > > looking at the refcount without changing anything just > > > > > > > > > screams > > > > > > > > > out > > > > > > > > > "race condition". > > > > > > > > > > > > > > > > > > In this case, what guarantee is there that this won't run > > > > > > > > > afoul > > > > > > > > > of the > > > > > > > > > timing? We could check this and find out it's 1 just before > > > > > > > > > it > > > > > > > > > goes to > > > > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > nf_flags) && > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > + return; > > > > > > > > > > + } > > > > > > > > > > } > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > > > *nf, struct > > > > > > > > > > list_head *dispose) > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > > ignore it > > > > > > > > > > */ > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > Same here: you're just testing for the HASHED bit, but > > > > > > > > > could > > > > > > > > > this also > > > > > > > > > race with someone who is setting it just after you get > > > > > > > > > here. > > > > > > > > > Why is > > > > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > > > *nf, struct > > > > > > > > > > list_head *dispose) > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > > > > list */ > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > nf_ref)) { > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > -- > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang > > > > > > > > > > > wrote: > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue and > > > > > > > > > > > > propose > > > > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I noticed > > > > > > > > > > > > nfsd_file leaks > > > > > > > > > > > > which led to undestroyable file systems (zfs), > > > > > > > > > > > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In other words, > > > > > > > > > > > what > > > > > > > > > > > is the > > > > > > > > > > > simplest program that can run on an NFS client that > > > > > > > > > > > will > > > > > > > > > > > trigger > > > > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's not > > > > > > > > > > > directly > > > > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > > > > > > > - The guidelines for patch submission require us to fix > > > > > > > > > > > issues in > > > > > > > > > > > upstream Linux first (currently that's v6.10-rc6). > > > > > > > > > > > Then > > > > > > > > > > > that fix > > > > > > > > > > > can be backported to older stable kernels like 6.6. > > > > > > > > > > > > > > > > > > > > > > Can you reproduce the leak with one of the in-kernel > > > > > > > > > > > filesystems > > > > > > > > > > > (either xfs or btrfs would be great) and with NFSD in > > > > > > > > > > > 6.10- > > > > > > > > > > > rc6? > > > > > > > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > > > > > > struct nfsd_file { > > > > > > > > > > > > nf_rlist = { > > > > > > > > > > > > rhead = { > > > > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > > > > }, > > > > > > > > > > > > next = 0x0 > > > > > > > > > > > > }, > > > > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > nf_ref = { > > > > > > > > > > > > refs = { > > > > > > > > > > > > counter = 0xc0000000 > > > > > > > > > > > > } > > > > > > > > > > > > }, > > > > > > > > > > > > nf_may = 0x4, > > > > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > > > > }, > > > > > > > > > > > > nf_rcu = { > > > > > > > > > > > > next = 0x10000000000, > > > > > > > > > > > > func = 0x0 > > > > > > > > > > > > }, > > > > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > > > -x > > > > > > > > > > > > ffff88839a53d850 > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > > > > }, > > > > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > > > -x > > > > > > > > > > > > ffff88c32b11e850 > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > > > > }, > > > > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > > > -x > > > > > > > > > > > > ffff88e372709700 > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > > > > }, > > > > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file > > > > > > > > > > > > -x > > > > > > > > > > > > ffff8982864944d0 > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > > > > }, > > > > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races with > > > > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > > > > or nfsd_file_lru_cb(). With the following patch, I > > > > > > > > > > > > haven't observed > > > > > > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > > > > > > > > > > > Our patch submission guidelines require a Signed-off- > > > > > > > > > > > by: > > > > > > > > > > > line at the end of the patch description. See the "Sign > > > > > > > > > > > your work - > > > > > > > > > > > the Developer's Certificate of Origin" section of > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file > > > > > > > > > > > > *nf) > > > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > return; > > > > > > > > > > > > } > > > > > > > > > > > > + /* > > > > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > + * it's unhashed but then removed from the dispose > > > > > > > > > > > > list, > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > + */ > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > + return; > > > > > > > > > > > > + } > > > > > > > > > > > > } > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, ignore > > > > > > > > > > > > it > > > > > > > > > > > > */ > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > > > > > > list > > > > > > > > > > > > */ > > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) > > > > > > > > > > > > { > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > Please kindly review the patch and let me know if it > > > > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > -- > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > -- > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > -- > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > -- > > > Jeff Layton <jlayton@kernel.org> > > -- > Jeff Layton <jlayton@kernel.org>
Can you send it as a separate [PATCH] email? I think your MUA is mangling the patches as I couldn't apply it earlier. Either use git- send-email or have a look over this page: https://www.kernel.org/doc/html/latest/process/email-clients.html#email-clients Thanks! Jeff On Wed, 2024-07-10 at 09:33 -0400, Youzhong Yang wrote: > Thanks. Here it is again: > > commit c6e69cebc052cb82d91fc81a98aee20749fe8d47 > Author: Youzhong Yang <youzhong@gmail.com> > Date: Thu Jul 4 11:25:40 2024 -0400 > > nfsd: fix nfsd_file leaking due to mixed use of nf->nf_lru > > nfsd_file_put() in one thread can race with another thread doing > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > nfsd_file_lru_cb()): > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > nfsd_file_lru_add(). > * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED > bit set) > * garbage collector kicks in, nfsd_file_lru_cb() clears > REFERENCED bit and > returns LRU_ROTATE. > * garbage collector kicks in again, nfsd_file_lru_cb() now > decrements nf->nf_ref > to 0, runs nfsd_file_unhash(), removes it from the LRU and > adds to the dispose > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > it > tries to remove > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The > 'nf' has been added > to the 'dispose' list by nfsd_file_lru_cb(), so > nfsd_file_lru_remove(nf) simply > treats it as part of the LRU and removes it, which leads to > its removal from > the 'dispose' list. > * At this moment, 'nf' is unhashed with its nf_ref being 0, and > not on the LRU. > nfsd_file_put() continues its execution [if > (refcount_dec_and_test(&nf->nf_ref))], > as nf->nf_ref is already 0, nf->nf_ref is set to > REFCOUNT_SATURATED, and the 'nf' > gets no chance of being freed. > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > nfsd_file_lru_add(). > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > * Some userland application runs 'exportfs -f' or something > like > that, which triggers > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > * In nfsd_file_cond_queue(), it runs [if > (!nfsd_file_unhash(nf))], unhash is done > successfully. > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > nf->nf_ref goes to 2. > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], > it succeeds. > * nfsd_file_cond_queue() runs [if > (refcount_sub_and_test(decrement, &nf->nf_ref))] > (with "decrement" being 2), so the nf->nf_ref goes to 0, the > 'nf' is added to the > dispose list [list_add(&nf->nf_lru, dispose)] > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > it > tries to remove > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > although the 'nf' is not > in the LRU, but it is linked in the 'dispose' list, > nfsd_file_lru_remove() simply > treats it as part of the LRU and removes it. This leads to > its > removal from > the 'dispose' list! > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > execution and set > nf->nf_ref to REFCOUNT_SATURATED. > > As shown in the above analysis, using nf_lru for both the LRU > list > and dispose list > can cause the leaks. This patch adds a new list_head nf_gc in > struct nfsd_file, and uses > it for the dispose list. > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index ad9083ca144b..22ebd7fb8639 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > *inode, unsigned char need, > return NULL; > > INIT_LIST_HEAD(&nf->nf_lru); > + INIT_LIST_HEAD(&nf->nf_gc); > nf->nf_birthtime = ktime_get(); > nf->nf_file = NULL; > nf->nf_cred = get_current_cred(); > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > struct nfsd_file *nf; > > while (!list_empty(dispose)) { > - nf = list_first_entry(dispose, struct nfsd_file, > nf_lru); > - list_del_init(&nf->nf_lru); > + nf = list_first_entry(dispose, struct nfsd_file, > nf_gc); > + list_del_init(&nf->nf_gc); > nfsd_file_free(nf); > } > } > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head > *dispose) > { > while(!list_empty(dispose)) { > struct nfsd_file *nf = list_first_entry(dispose, > - struct nfsd_file, > nf_lru); > + struct nfsd_file, > nf_gc); > struct nfsd_net *nn = net_generic(nf->nf_net, > nfsd_net_id); > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > spin_lock(&l->lock); > - list_move_tail(&nf->nf_lru, &l->freeme); > + list_move_tail(&nf->nf_gc, &l->freeme); > spin_unlock(&l->lock); > svc_wake_up(nn->nfsd_serv); > } > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct > list_lru_one *lru, > > /* Refcount went to zero. Unhash it and queue it to the > dispose list */ > nfsd_file_unhash(nf); > - list_lru_isolate_move(lru, &nf->nf_lru, head); > + list_lru_isolate(lru, &nf->nf_lru); > + list_add(&nf->nf_gc, head); > this_cpu_inc(nfsd_file_evictions); > trace_nfsd_file_gc_disposed(nf); > return LRU_REMOVED; > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > list_head *dispose) > > /* If refcount goes to 0, then put on the dispose list */ > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > - list_add(&nf->nf_lru, dispose); > + list_add(&nf->nf_gc, dispose); > trace_nfsd_file_closing(nf); > } > } > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > nfsd_file_queue_for_close(inode, &dispose); > while (!list_empty(&dispose)) { > - nf = list_first_entry(&dispose, struct nfsd_file, > nf_lru); > - list_del_init(&nf->nf_lru); > + nf = list_first_entry(&dispose, struct nfsd_file, > nf_gc); > + list_del_init(&nf->nf_gc); > nfsd_file_free(nf); > } > } > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > index c61884def906..3fbec24eea6c 100644 > --- a/fs/nfsd/filecache.h > +++ b/fs/nfsd/filecache.h > @@ -44,6 +44,7 @@ struct nfsd_file { > > struct nfsd_file_mark *nf_mark; > struct list_head nf_lru; > + struct list_head nf_gc; > struct rcu_head nf_rcu; > ktime_t nf_birthtime; > }; > > On Wed, Jul 10, 2024 at 8:49 AM Jeff Layton <jlayton@kernel.org> > wrote: > > > > Ok, thanks. I went crawling over the filecache code, and found a > > couple > > of potential nfsd_file leaks. I'm testing patches now and will > > likely > > send them along later today. > > > > I think we do still want the patch to add nf_gc list_head. I'm not > > certain there is a race there, but it's safer to just use a > > separate > > list_head and it doesn't represent a lot of memory. > > > > Could you post that one individually so that Chuck can pick it up? > > > > Thanks, > > Jeff > > > > On Tue, 2024-07-09 at 15:13 -0400, Youzhong Yang wrote: > > > It's not on the LRU: > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_gc > > > ffff898107b95980 > > > nf_flags = 12, > > > nf_ref.refs.counter = 1 > > > nf_lru = { > > > next = 0xffff898107b959c8, > > > prev = 0xffff898107b959c8 > > > }, > > > nf_gc = { > > > next = 0xffff898107b959d8, > > > prev = 0xffff898107b959d8 > > > }, > > > > > > I don't have an easy reproducer now as I am having difficulty > > > generating tons of NFS read/write ops using a limited number of > > > nfs > > > clients. I use our in-house testing farm to test the patch. > > > > > > Thanks. > > > > > > On Tue, Jul 9, 2024 at 3:05 PM Jeff Layton <jlayton@kernel.org> > > > wrote: > > > > > > > > On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > > > > > Thanks Jeff. > > > > > > > > > > Unfortunately the early unhash easily leads to leaks: > > > > > > > > > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e 's|\[||' -e > > > > > 's|\]||' > > > > > > xargs -i echo struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > '{}' > > > > > > /var/tmp/nfsd_files > > > > > crash> !wc -l /var/tmp/nfsd_files > > > > > 19 /var/tmp/nfsd_files > > > > > crash> < /var/tmp/nfsd_files > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff88865c778900 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 2 > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff88865c778cc0 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 3 > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff8885d5f35e00 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 1 > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff88817443e780 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 3 > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff88818b3f0600 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 2 > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff88a4490f8300 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 1 > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff88a0dab183c0 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 40 > > > > > > > > That's a lot of references! Might be interesting to look more > > > > closely > > > > at that one, but the refcounts are all over the place, so it > > > > really > > > > does look like we just have a refcount leak somewhere. > > > > > > > > > ... > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff89209535f200 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 2 > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff8980e15138c0 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 7 > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff898107b95800 > > > > > nf_flags = 8, > > > > > nf_ref.refs.counter = 3 > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > ffff898107b95980 > > > > > nf_flags = 12, > > > > > > > > The others are NFSD_FILE_GC. This one is also > > > > NFSD_FILE_REFERENCED. Are > > > > these objects on the LRU? > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() relies on > > > > > the > > > > > hash > > > > > table to find the nfsd_file, > > > > > but I am still scratching my head why and how this happens. > > > > > > > > > > FYI, here is the patch I applied for testing: > > > > > > > > > > > > > The above suggests to me that there is a garden-variety > > > > refcount leak > > > > somewhere. Whenever some bit of the code takes a reference to > > > > an object > > > > (like a nfsd_file), it's implicitly required to eventually put > > > > that > > > > reference. Typically that means that the code needs to maintain > > > > a > > > > pointer to that object, as it can be unhashed at any time and > > > > it can't > > > > rely on finding the same object later. > > > > > > > > Do you have a reproducer for this? > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > > --- a/fs/nfsd/filecache.c > > > > > +++ b/fs/nfsd/filecache.c > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > inode > > > > > *inode, unsigned char need, > > > > > return NULL; > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > nf->nf_birthtime = ktime_get(); > > > > > nf->nf_file = NULL; > > > > > nf->nf_cred = get_current_cred(); > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > *dispose) > > > > > struct nfsd_file *nf; > > > > > > > > > > while (!list_empty(dispose)) { > > > > > - nf = list_first_entry(dispose, struct > > > > > nfsd_file, > > > > > nf_lru); > > > > > - list_del_init(&nf->nf_lru); > > > > > + nf = list_first_entry(dispose, struct > > > > > nfsd_file, > > > > > nf_gc); > > > > > + list_del_init(&nf->nf_gc); > > > > > nfsd_file_free(nf); > > > > > } > > > > > } > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > list_head > > > > > *dispose) > > > > > { > > > > > while(!list_empty(dispose)) { > > > > > struct nfsd_file *nf = > > > > > list_first_entry(dispose, > > > > > - struct > > > > > nfsd_file, > > > > > nf_lru); > > > > > + struct > > > > > nfsd_file, > > > > > nf_gc); > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > nfsd_net_id); > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > >fcache_disposal; > > > > > > > > > > spin_lock(&l->lock); > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > spin_unlock(&l->lock); > > > > > svc_wake_up(nn->nfsd_serv); > > > > > } > > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > struct > > > > > list_lru_one *lru, > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > the > > > > > dispose list */ > > > > > nfsd_file_unhash(nf); > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > + list_add(&nf->nf_gc, head); > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > return LRU_REMOVED; > > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > *nf, struct > > > > > list_head *dispose) > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > list */ > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > - list_add(&nf->nf_lru, dispose); > > > > > + list_add(&nf->nf_gc, dispose); > > > > > trace_nfsd_file_closing(nf); > > > > > } > > > > > } > > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > *inode) > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > while (!list_empty(&dispose)) { > > > > > - nf = list_first_entry(&dispose, struct > > > > > nfsd_file, > > > > > nf_lru); > > > > > - list_del_init(&nf->nf_lru); > > > > > + nf = list_first_entry(&dispose, struct > > > > > nfsd_file, > > > > > nf_gc); > > > > > + list_del_init(&nf->nf_gc); > > > > > nfsd_file_free(nf); > > > > > } > > > > > } > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > --- a/fs/nfsd/filecache.h > > > > > +++ b/fs/nfsd/filecache.h > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > struct list_head nf_lru; > > > > > + struct list_head nf_gc; > > > > > struct rcu_head nf_rcu; > > > > > ktime_t nf_birthtime; > > > > > }; > > > > > > > > > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton > > > > > <jlayton@kernel.org> > > > > > wrote: > > > > > > > > > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > I am ok with reverting the unhash/dispose list reordering > > > > > > > in > > > > > > > nfsd_file_lru_cb(), as it doesn't make much of a > > > > > > > difference, > > > > > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > > > > > > > > > - A nfsd_file is hashed > > > > > > > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] > > > > > > > will > > > > > > > get it > > > > > > > unhashed, doesn't it? > > > > > > > - It continues to get a reference by nfsd_file_get() > > > > > > > - It continues to remove itself from LRU by > > > > > > > nfsd_file_lru_remove() if > > > > > > > it is on the LRU. > > > > > > > - Now it runs refcount_sub_and_test(), what happens if > > > > > > > the refcnt > > > > > > > does > > > > > > > not go to 0? How can this nfsd_file be found again? > > > > > > > Through the > > > > > > > hash > > > > > > > table? Through the LRU walk? how? > > > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > It won't need to be found again. The holders of the extra > > > > > > references > > > > > > will put those references when they are finished. Since the > > > > > > object > > > > > > is > > > > > > no longer HASHED, nfsd_file_put just does this: > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > So that should be fine. > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton > > > > > > > <jlayton@kernel.org> > > > > > > > wrote: > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > > > > > > > Thank you Jeff for your invaluable insights. I was > > > > > > > > > leaning > > > > > > > > > towards > > > > > > > > > adding a new list_head too, and tested this approach > > > > > > > > > on > > > > > > > > > kernel > > > > > > > > > 6.6 by > > > > > > > > > continuously hammering the server with heavy nfs load > > > > > > > > > for the > > > > > > > > > last few > > > > > > > > > days, not a single leak. > > > > > > > > > > > > > > > > > > Here goes the patch (based on Linux kernel master > > > > > > > > > branch), > > > > > > > > > please > > > > > > > > > review: > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to > > > > > > > > > mixed use > > > > > > > > > of > > > > > > > > > nf->nf_lru > > > > > > > > > > > > > > > > > > nfsd_file_put() in one thread can race with another > > > > > > > > > thread > > > > > > > > > doing > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > list_lru_walk() > > > > > > > > > -> > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries > > > > > > > > > to do > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > > > > NFSD_FILE_REFERENCED > > > > > > > > > bit set) > > > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() > > > > > > > > > clears > > > > > > > > > REFERENCED bit and > > > > > > > > > returns LRU_ROTATE. > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > nfsd_file_lru_cb() now > > > > > > > > > decrements nf->nf_ref > > > > > > > > > to 0, runs nfsd_file_unhash(), removes it from > > > > > > > > > the LRU > > > > > > > > > and > > > > > > > > > adds to > > > > > > > > > the dispose > > > > > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > head)] > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > cleared, > > > > > > > > > so > > > > > > > > > it > > > > > > > > > tries to remove > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > (!nfsd_file_lru_remove(nf))]. > > > > > > > > > The > > > > > > > > > 'nf' > > > > > > > > > has been added > > > > > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > > > > treats it as part of the LRU and removes it, > > > > > > > > > which leads > > > > > > > > > to > > > > > > > > > its removal from > > > > > > > > > the 'dispose' list. > > > > > > > > > * At this moment, 'nf' is unhashed with its nf_ref > > > > > > > > > being 0, > > > > > > > > > and > > > > > > > > > not > > > > > > > > > on the LRU. > > > > > > > > > nfsd_file_put() continues its execution [if > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > > > > gets no chance of being freed. > > > > > > > > > > > > > > > > > > nfsd_file_put() can also race with > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries > > > > > > > > > to do > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and > > > > > > > > > returns true. > > > > > > > > > * Some userland application runs 'exportfs -f' or > > > > > > > > > something > > > > > > > > > like > > > > > > > > > that, which triggers > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > unhash is done > > > > > > > > > successfully. > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > now > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > > > > it succeeds. > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > &nf->nf_ref))] > > > > > > > > > (with "decrement" being 2), so the nf->nf_ref > > > > > > > > > goes to 0, > > > > > > > > > the > > > > > > > > > 'nf' > > > > > > > > > is added to the > > > > > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > cleared, > > > > > > > > > so > > > > > > > > > it > > > > > > > > > tries to remove > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > (!nfsd_file_lru_remove(nf))], > > > > > > > > > although > > > > > > > > > the 'nf' is not > > > > > > > > > in the LRU, but it is linked in the 'dispose' > > > > > > > > > list, > > > > > > > > > nfsd_file_lru_remove() simply > > > > > > > > > treats it as part of the LRU and removes it. This > > > > > > > > > leads > > > > > > > > > to > > > > > > > > > its removal from > > > > > > > > > the 'dispose' list! > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() > > > > > > > > > continues its > > > > > > > > > execution and set > > > > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > As shown in the above analysis, using nf_lru for both > > > > > > > > > the LRU > > > > > > > > > list and > > > > > > > > > dispose list > > > > > > > > > can cause the leaks. This patch adds a new list_head > > > > > > > > > nf_gc in > > > > > > > > > struct > > > > > > > > > nfsd_file, and uses > > > > > > > > > it for the dispose list. It's not expected to have a > > > > > > > > > nfsd_file > > > > > > > > > unhashed but it's not > > > > > > > > > added to the dispose list, so in > > > > > > > > > nfsd_file_cond_queue() and > > > > > > > > > nfsd_file_lru_cb() nfsd_file > > > > > > > > > is unhashed after being added to the dispose list. > > > > > > > > > > > > > > > > > > > > > > > > > I don't see where we require the object to be either > > > > > > > > hashed or > > > > > > > > on > > > > > > > > the > > > > > > > > dispose list. I think you probably just want to do a > > > > > > > > patch > > > > > > > > that > > > > > > > > changes the dispose list to use a dedicated list_head > > > > > > > > without > > > > > > > > reordering when the these things are unhashed. > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > --- > > > > > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > > > > > > > fs/nfsd/filecache.h | 1 + > > > > > > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, > > > > > > > > > struct > > > > > > > > > inode > > > > > > > > > *inode, unsigned char need, > > > > > > > > > return NULL; > > > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > > nf->nf_file = NULL; > > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct > > > > > > > > > list_head > > > > > > > > > *dispose) > > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_lru); > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_gc); > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > @@ -411,12 +412,12 @@ > > > > > > > > > nfsd_file_dispose_list_delayed(struct > > > > > > > > > list_head *dispose) > > > > > > > > > { > > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > > struct nfsd_file *nf = > > > > > > > > > list_first_entry(dispose, > > > > > > > > > - > > > > > > > > > struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_lru); > > > > > > > > > + > > > > > > > > > struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_gc); > > > > > > > > > struct nfsd_net *nn = net_generic(nf- > > > > > > > > > >nf_net, > > > > > > > > > nfsd_net_id); > > > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > > - list_move_tail(&nf->nf_lru, &l- > > > > > > > > > >freeme); > > > > > > > > > + list_move_tail(&nf->nf_gc, &l- > > > > > > > > > >freeme); > > > > > > > > > spin_unlock(&l->lock); > > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > > } > > > > > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct > > > > > > > > > list_head *item, > > > > > > > > > struct > > > > > > > > > list_lru_one *lru, > > > > > > > > > } > > > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue > > > > > > > > > it to > > > > > > > > > the > > > > > > > > > dispose list */ > > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > > + /* Unhash after removing from LRU and adding > > > > > > > > > to > > > > > > > > > dispose > > > > > > > > > list */ > > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > head); > > > > > > > > > > > > > > > > I don't see the point in reordering these operations. > > > > > > > > Hashing > > > > > > > > is > > > > > > > > all > > > > > > > > about making the thing findable by nfsd operations. The > > > > > > > > _last_ > > > > > > > > thing we > > > > > > > > want to do is put it on the dispose list while the > > > > > > > > thing can > > > > > > > > still > > > > > > > > be > > > > > > > > found by nfsd threads doing operations. > > > > > > > > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > > return LRU_REMOVED; > > > > > > > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > nfsd_file > > > > > > > > > *nf, > > > > > > > > > struct > > > > > > > > > list_head *dispose) > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > ignore it > > > > > > > > > */ > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > >nf_flags)) > > > > > > > > > > > > > > > > The above change looks wrong. I don't think we need to > > > > > > > > change > > > > > > > > this. > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct > > > > > > > > > nfsd_file > > > > > > > > > *nf, > > > > > > > > > struct > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > dispose > > > > > > > > > list */ > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > >nf_ref)) { > > > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > > > + /* Unhash after adding to dispose > > > > > > > > > list */ > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > This too looks wrong? Maybe I'm unclear on the race > > > > > > > > you're > > > > > > > > trying > > > > > > > > to > > > > > > > > fix with this? What's the harm in unhashing it early? > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct > > > > > > > > > inode > > > > > > > > > *inode) > > > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > > - nf = list_first_entry(&dispose, > > > > > > > > > struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_lru); > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > + nf = list_first_entry(&dispose, > > > > > > > > > struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_gc); > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > diff --git a/fs/nfsd/filecache.h > > > > > > > > > b/fs/nfsd/filecache.h > > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > > struct list_head nf_lru; > > > > > > > > > + struct list_head nf_gc; > > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > > ktime_t nf_birthtime; > > > > > > > > > }; > > > > > > > > > -- > > > > > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang > > > > > > > > > > wrote: > > > > > > > > > > > Thank you Chuck. Here are my quick answers to > > > > > > > > > > > your > > > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > > > > > - I don't have a quick reproducer. I reproduced > > > > > > > > > > > it by > > > > > > > > > > > using > > > > > > > > > > > hundreds > > > > > > > > > > > of nfs clients generating +600K ops under our > > > > > > > > > > > workload in > > > > > > > > > > > the > > > > > > > > > > > testing > > > > > > > > > > > environment. Theoretically it should be possible > > > > > > > > > > > to > > > > > > > > > > > simplify > > > > > > > > > > > the > > > > > > > > > > > reproduction but I am still working on it. > > > > > > > > > > > > > > > > > > > > > > - I understand zfs is an out-of-tree file > > > > > > > > > > > system. That's > > > > > > > > > > > fine. But > > > > > > > > > > > this leaking can happen to any file system, and > > > > > > > > > > > leaking > > > > > > > > > > > is > > > > > > > > > > > not a good > > > > > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > > > > > > > > > - I will try to come up with a reproducer using > > > > > > > > > > > xfs or > > > > > > > > > > > btrfs > > > > > > > > > > > if possible. > > > > > > > > > > > > > > > > > > > > > > Now back to the problem itself, here are my > > > > > > > > > > > findings: > > > > > > > > > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race with > > > > > > > > > > > another > > > > > > > > > > > thread > > > > > > > > > > > doing > > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > -> > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it > > > > > > > > > > > tries to > > > > > > > > > > > do > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > * nfsd_file_lru_add() returns true (thus > > > > > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > > > > > set for nf->nf_flags) > > > > > > > > > > > * garbage collector kicks in, > > > > > > > > > > > nfsd_file_lru_cb() clears > > > > > > > > > > > REFERENCED > > > > > > > > > > > bit and returns LRU_ROTATE. > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > nfsd_file_lru_cb() > > > > > > > > > > > now > > > > > > > > > > > decrements nf->nf_ref to 0, runs > > > > > > > > > > > nfsd_file_unhash(), > > > > > > > > > > > removes > > > > > > > > > > > it from > > > > > > > > > > > the LRU and adds to the dispose list > > > > > > > > > > > [list_lru_isolate_move(lru, > > > > > > > > > > > &nf->nf_lru, head);] > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit > > > > > > > > > > > is > > > > > > > > > > > cleared, > > > > > > > > > > > so it > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > > > > > * The 'nf' has been added to the 'dispose' list > > > > > > > > > > > by > > > > > > > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) > > > > > > > > > > > simply > > > > > > > > > > > treats > > > > > > > > > > > it as > > > > > > > > > > > part of the LRU and removes it, which leads it to > > > > > > > > > > > be > > > > > > > > > > > removed > > > > > > > > > > > from the > > > > > > > > > > > 'dispose' list. > > > > > > > > > > > * At this moment, nf->nf_ref is 0, it's > > > > > > > > > > > unhashed, and > > > > > > > > > > > not > > > > > > > > > > > on the > > > > > > > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf- > > > > > > > > > > > >nf_ref is > > > > > > > > > > > already 0, now > > > > > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > > > > > REFCOUNT_SATURATED, > > > > > > > > > > > and the > > > > > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > > > > > > > > > To make this happen, the right timing is crucial. > > > > > > > > > > > It can > > > > > > > > > > > be > > > > > > > > > > > reproduced > > > > > > > > > > > by adding artifical delays in filecache.c, or > > > > > > > > > > > hammering > > > > > > > > > > > the > > > > > > > > > > > nfsd with > > > > > > > > > > > tons of ops. > > > > > > > > > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it > > > > > > > > > > > tries to > > > > > > > > > > > do > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and > > > > > > > > > > > returns > > > > > > > > > > > true. > > > > > > > > > > > * 'exportfs -f' or something like that triggers > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > unhash is done successfully. > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > now > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > &nf->nf_ref))] (with "decrement" being 2), so the > > > > > > > > > > > nf- > > > > > > > > > > > > nf_ref > > > > > > > > > > > goes to > > > > > > > > > > > 0, the 'nf' is added to the dispost list > > > > > > > > > > > [list_add(&nf- > > > > > > > > > > > > nf_lru, > > > > > > > > > > > dispose)] > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit > > > > > > > > > > > is > > > > > > > > > > > cleared, > > > > > > > > > > > so it > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' > > > > > > > > > > > is not in > > > > > > > > > > > the > > > > > > > > > > > LRU, but > > > > > > > > > > > it is linked in the 'dispose' list, > > > > > > > > > > > nfsd_file_lru_remove() > > > > > > > > > > > simply > > > > > > > > > > > treats it as part of the LRU and removes it. This > > > > > > > > > > > leads > > > > > > > > > > > to > > > > > > > > > > > its removal > > > > > > > > > > > from the 'dispose' list! > > > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() > > > > > > > > > > > continues > > > > > > > > > > > its > > > > > > > > > > > execution and sets nf->nf_ref to > > > > > > > > > > > REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you > > > > > > > > > > > can see, > > > > > > > > > > > it > > > > > > > > > > > is used > > > > > > > > > > > for the LRU list, and also the 'dispose' list. > > > > > > > > > > > Adding > > > > > > > > > > > another > > > > > > > > > > > 'struct > > > > > > > > > > > list_head' specifically for the 'dispose' list > > > > > > > > > > > seems to > > > > > > > > > > > be a > > > > > > > > > > > better > > > > > > > > > > > way of fixing this race condition. Either way > > > > > > > > > > > works for > > > > > > > > > > > me. > > > > > > > > > > > > > > > > > > > > > > Would you agree my above analysis makes sense? > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've done much > > > > > > > > > > work in > > > > > > > > > > this > > > > > > > > > > code, > > > > > > > > > > but it does sound like there is a race in the LRU > > > > > > > > > > handling. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe > > > > > > > > > > to use > > > > > > > > > > for > > > > > > > > > > multiple > > > > > > > > > > purposes, but that's only the case if we're not > > > > > > > > > > using that > > > > > > > > > > list > > > > > > > > > > as an > > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > > > > > > > > > ...so if we ever check this while it's sitting on > > > > > > > > > > the > > > > > > > > > > dispose > > > > > > > > > > list, it > > > > > > > > > > will handle it incorrectly. It sounds like that's > > > > > > > > > > the root > > > > > > > > > > cause of the > > > > > > > > > > problem you're seeing? > > > > > > > > > > > > > > > > > > > > If so, then maybe a separate list_head for disposal > > > > > > > > > > would > > > > > > > > > > be > > > > > > > > > > better. > > > > > > > > > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due > > > > > > > > > > > to race > > > > > > > > > > > condition and early > > > > > > > > > > > unhash > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > --- > > > > > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > nfsd_file *nf) > > > > > > > > > > > if > > > > > > > > > > > (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > return; > > > > > > > > > > > } > > > > > > > > > > > + /* > > > > > > > > > > > + * Racing with > > > > > > > > > > > nfsd_file_cond_queue() or > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > + * it's unhashed but then removed > > > > > > > > > > > from > > > > > > > > > > > the > > > > > > > > > > > dispose list, > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > + */ > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == > > > > > > > > > > > 0 && > > > > > > > > > > > > > > > > > > > > A refcount_read in this path is a red flag to me. > > > > > > > > > > Anytime > > > > > > > > > > you're just > > > > > > > > > > looking at the refcount without changing anything > > > > > > > > > > just > > > > > > > > > > screams > > > > > > > > > > out > > > > > > > > > > "race condition". > > > > > > > > > > > > > > > > > > > > In this case, what guarantee is there that this > > > > > > > > > > won't run > > > > > > > > > > afoul > > > > > > > > > > of the > > > > > > > > > > timing? We could check this and find out it's 1 > > > > > > > > > > just before > > > > > > > > > > it > > > > > > > > > > goes to > > > > > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > &nf- > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > + return; > > > > > > > > > > > + } > > > > > > > > > > > } > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > nfsd_file > > > > > > > > > > > *nf, struct > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > unhashing, > > > > > > > > > > > ignore it > > > > > > > > > > > */ > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > >nf_flags)) > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > Same here: you're just testing for the HASHED bit, > > > > > > > > > > but > > > > > > > > > > could > > > > > > > > > > this also > > > > > > > > > > race with someone who is setting it just after you > > > > > > > > > > get > > > > > > > > > > here. > > > > > > > > > > Why is > > > > > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it > > > > > > > > > > > */ > > > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > nfsd_file > > > > > > > > > > > *nf, struct > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > dispose > > > > > > > > > > > list */ > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > } > > > > > > > > > > > } > > > > > > > > > > > -- > > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, > > > > > > > > > > > > Youzhong Yang > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue > > > > > > > > > > > > > and > > > > > > > > > > > > > propose > > > > > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I > > > > > > > > > > > > > noticed > > > > > > > > > > > > > nfsd_file leaks > > > > > > > > > > > > > which led to undestroyable file systems > > > > > > > > > > > > > (zfs), > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In other > > > > > > > > > > > > words, > > > > > > > > > > > > what > > > > > > > > > > > > is the > > > > > > > > > > > > simplest program that can run on an NFS > > > > > > > > > > > > client that > > > > > > > > > > > > will > > > > > > > > > > > > trigger > > > > > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's > > > > > > > > > > > > not > > > > > > > > > > > > directly > > > > > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > > > > > > > > > - The guidelines for patch submission require > > > > > > > > > > > > us to fix > > > > > > > > > > > > issues in > > > > > > > > > > > > upstream Linux first (currently that's v6.10- > > > > > > > > > > > > rc6). > > > > > > > > > > > > Then > > > > > > > > > > > > that fix > > > > > > > > > > > > can be backported to older stable kernels > > > > > > > > > > > > like 6.6. > > > > > > > > > > > > > > > > > > > > > > > > Can you reproduce the leak with one of the in- > > > > > > > > > > > > kernel > > > > > > > > > > > > filesystems > > > > > > > > > > > > (either xfs or btrfs would be great) and with > > > > > > > > > > > > NFSD in > > > > > > > > > > > > 6.10- > > > > > > > > > > > > rc6? > > > > > > > > > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > > > > > > > struct nfsd_file { > > > > > > > > > > > > > nf_rlist = { > > > > > > > > > > > > > rhead = { > > > > > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > > > > > }, > > > > > > > > > > > > > next = 0x0 > > > > > > > > > > > > > }, > > > > > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > nf_ref = { > > > > > > > > > > > > > refs = { > > > > > > > > > > > > > counter = 0xc0000000 > > > > > > > > > > > > > } > > > > > > > > > > > > > }, > > > > > > > > > > > > > nf_may = 0x4, > > > > > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > > > > > }, > > > > > > > > > > > > > nf_rcu = { > > > > > > > > > > > > > next = 0x10000000000, > > > > > > > > > > > > > func = 0x0 > > > > > > > > > > > > > }, > > > > > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > -x > > > > > > > > > > > > > ffff88839a53d850 > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > > > > > }, > > > > > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > -x > > > > > > > > > > > > > ffff88c32b11e850 > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > > > > > }, > > > > > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > -x > > > > > > > > > > > > > ffff88e372709700 > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > > > > > }, > > > > > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > -x > > > > > > > > > > > > > ffff8982864944d0 > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > > > > > }, > > > > > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races > > > > > > > > > > > > > with > > > > > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > > > > > or nfsd_file_lru_cb(). With the following > > > > > > > > > > > > > patch, I > > > > > > > > > > > > > haven't observed > > > > > > > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > > > > > > > > > > > > > Our patch submission guidelines require a > > > > > > > > > > > > Signed-off- > > > > > > > > > > > > by: > > > > > > > > > > > > line at the end of the patch description. See > > > > > > > > > > > > the "Sign > > > > > > > > > > > > your work - > > > > > > > > > > > > the Developer's Certificate of Origin" section > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > *nf) > > > > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > return; > > > > > > > > > > > > > } > > > > > > > > > > > > > + /* > > > > > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > + * it's unhashed but then removed from the > > > > > > > > > > > > > dispose > > > > > > > > > > > > > list, > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > + */ > > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > >nf_flags) && > > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > > + return; > > > > > > > > > > > > > + } > > > > > > > > > > > > > } > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > @@ -576,7 +587,7 @@ > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > > > > > ignore > > > > > > > > > > > > > it > > > > > > > > > > > > > */ > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > >nf_flags)) > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it > > > > > > > > > > > > > */ > > > > > > > > > > > > > @@ -590,6 +601,7 @@ > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > > > dispose > > > > > > > > > > > > > list > > > > > > > > > > > > > */ > > > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > >nf_ref)) > > > > > > > > > > > > > { > > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > } > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > Please kindly review the patch and let me > > > > > > > > > > > > > know if it > > > > > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > -- > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > -- > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > -- > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > -- > > > > Jeff Layton <jlayton@kernel.org> > > > > -- > > Jeff Layton <jlayton@kernel.org>
Thanks Jeff. Just sent out a separate e-mail. I have a simple reproducer now, it's a C program, I will upload it to a github repository together with a readme file and then post the link (if allowed) here. On Wed, Jul 10, 2024 at 9:40 AM Jeff Layton <jlayton@kernel.org> wrote: > > Can you send it as a separate [PATCH] email? I think your MUA is > mangling the patches as I couldn't apply it earlier. Either use git- > send-email or have a look over this page: > > > https://www.kernel.org/doc/html/latest/process/email-clients.html#email-clients > > Thanks! > Jeff > > On Wed, 2024-07-10 at 09:33 -0400, Youzhong Yang wrote: > > Thanks. Here it is again: > > > > commit c6e69cebc052cb82d91fc81a98aee20749fe8d47 > > Author: Youzhong Yang <youzhong@gmail.com> > > Date: Thu Jul 4 11:25:40 2024 -0400 > > > > nfsd: fix nfsd_file leaking due to mixed use of nf->nf_lru > > > > nfsd_file_put() in one thread can race with another thread doing > > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > > nfsd_file_lru_cb()): > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > nfsd_file_lru_add(). > > * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED > > bit set) > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > REFERENCED bit and > > returns LRU_ROTATE. > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > decrements nf->nf_ref > > to 0, runs nfsd_file_unhash(), removes it from the LRU and > > adds to the dispose > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > it > > tries to remove > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The > > 'nf' has been added > > to the 'dispose' list by nfsd_file_lru_cb(), so > > nfsd_file_lru_remove(nf) simply > > treats it as part of the LRU and removes it, which leads to > > its removal from > > the 'dispose' list. > > * At this moment, 'nf' is unhashed with its nf_ref being 0, and > > not on the LRU. > > nfsd_file_put() continues its execution [if > > (refcount_dec_and_test(&nf->nf_ref))], > > as nf->nf_ref is already 0, nf->nf_ref is set to > > REFCOUNT_SATURATED, and the 'nf' > > gets no chance of being freed. > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > nfsd_file_lru_add(). > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > * Some userland application runs 'exportfs -f' or something > > like > > that, which triggers > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > * In nfsd_file_cond_queue(), it runs [if > > (!nfsd_file_unhash(nf))], unhash is done > > successfully. > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > > nf->nf_ref goes to 2. > > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], > > it succeeds. > > * nfsd_file_cond_queue() runs [if > > (refcount_sub_and_test(decrement, &nf->nf_ref))] > > (with "decrement" being 2), so the nf->nf_ref goes to 0, the > > 'nf' is added to the > > dispose list [list_add(&nf->nf_lru, dispose)] > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > it > > tries to remove > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > > although the 'nf' is not > > in the LRU, but it is linked in the 'dispose' list, > > nfsd_file_lru_remove() simply > > treats it as part of the LRU and removes it. This leads to > > its > > removal from > > the 'dispose' list! > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > execution and set > > nf->nf_ref to REFCOUNT_SATURATED. > > > > As shown in the above analysis, using nf_lru for both the LRU > > list > > and dispose list > > can cause the leaks. This patch adds a new list_head nf_gc in > > struct nfsd_file, and uses > > it for the dispose list. > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index ad9083ca144b..22ebd7fb8639 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > > *inode, unsigned char need, > > return NULL; > > > > INIT_LIST_HEAD(&nf->nf_lru); > > + INIT_LIST_HEAD(&nf->nf_gc); > > nf->nf_birthtime = ktime_get(); > > nf->nf_file = NULL; > > nf->nf_cred = get_current_cred(); > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > > struct nfsd_file *nf; > > > > while (!list_empty(dispose)) { > > - nf = list_first_entry(dispose, struct nfsd_file, > > nf_lru); > > - list_del_init(&nf->nf_lru); > > + nf = list_first_entry(dispose, struct nfsd_file, > > nf_gc); > > + list_del_init(&nf->nf_gc); > > nfsd_file_free(nf); > > } > > } > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head > > *dispose) > > { > > while(!list_empty(dispose)) { > > struct nfsd_file *nf = list_first_entry(dispose, > > - struct nfsd_file, > > nf_lru); > > + struct nfsd_file, > > nf_gc); > > struct nfsd_net *nn = net_generic(nf->nf_net, > > nfsd_net_id); > > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > > > spin_lock(&l->lock); > > - list_move_tail(&nf->nf_lru, &l->freeme); > > + list_move_tail(&nf->nf_gc, &l->freeme); > > spin_unlock(&l->lock); > > svc_wake_up(nn->nfsd_serv); > > } > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct > > list_lru_one *lru, > > > > /* Refcount went to zero. Unhash it and queue it to the > > dispose list */ > > nfsd_file_unhash(nf); > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > + list_lru_isolate(lru, &nf->nf_lru); > > + list_add(&nf->nf_gc, head); > > this_cpu_inc(nfsd_file_evictions); > > trace_nfsd_file_gc_disposed(nf); > > return LRU_REMOVED; > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > list_head *dispose) > > > > /* If refcount goes to 0, then put on the dispose list */ > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > - list_add(&nf->nf_lru, dispose); > > + list_add(&nf->nf_gc, dispose); > > trace_nfsd_file_closing(nf); > > } > > } > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > > > nfsd_file_queue_for_close(inode, &dispose); > > while (!list_empty(&dispose)) { > > - nf = list_first_entry(&dispose, struct nfsd_file, > > nf_lru); > > - list_del_init(&nf->nf_lru); > > + nf = list_first_entry(&dispose, struct nfsd_file, > > nf_gc); > > + list_del_init(&nf->nf_gc); > > nfsd_file_free(nf); > > } > > } > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > index c61884def906..3fbec24eea6c 100644 > > --- a/fs/nfsd/filecache.h > > +++ b/fs/nfsd/filecache.h > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > struct nfsd_file_mark *nf_mark; > > struct list_head nf_lru; > > + struct list_head nf_gc; > > struct rcu_head nf_rcu; > > ktime_t nf_birthtime; > > }; > > > > On Wed, Jul 10, 2024 at 8:49 AM Jeff Layton <jlayton@kernel.org> > > wrote: > > > > > > Ok, thanks. I went crawling over the filecache code, and found a > > > couple > > > of potential nfsd_file leaks. I'm testing patches now and will > > > likely > > > send them along later today. > > > > > > I think we do still want the patch to add nf_gc list_head. I'm not > > > certain there is a race there, but it's safer to just use a > > > separate > > > list_head and it doesn't represent a lot of memory. > > > > > > Could you post that one individually so that Chuck can pick it up? > > > > > > Thanks, > > > Jeff > > > > > > On Tue, 2024-07-09 at 15:13 -0400, Youzhong Yang wrote: > > > > It's not on the LRU: > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_gc > > > > ffff898107b95980 > > > > nf_flags = 12, > > > > nf_ref.refs.counter = 1 > > > > nf_lru = { > > > > next = 0xffff898107b959c8, > > > > prev = 0xffff898107b959c8 > > > > }, > > > > nf_gc = { > > > > next = 0xffff898107b959d8, > > > > prev = 0xffff898107b959d8 > > > > }, > > > > > > > > I don't have an easy reproducer now as I am having difficulty > > > > generating tons of NFS read/write ops using a limited number of > > > > nfs > > > > clients. I use our in-house testing farm to test the patch. > > > > > > > > Thanks. > > > > > > > > On Tue, Jul 9, 2024 at 3:05 PM Jeff Layton <jlayton@kernel.org> > > > > wrote: > > > > > > > > > > On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > > > > > > Thanks Jeff. > > > > > > > > > > > > Unfortunately the early unhash easily leads to leaks: > > > > > > > > > > > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e 's|\[||' -e > > > > > > 's|\]||' > > > > > > > xargs -i echo struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > '{}' > > > > > > > /var/tmp/nfsd_files > > > > > > crash> !wc -l /var/tmp/nfsd_files > > > > > > 19 /var/tmp/nfsd_files > > > > > > crash> < /var/tmp/nfsd_files > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff88865c778900 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 2 > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff88865c778cc0 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 3 > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff8885d5f35e00 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 1 > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff88817443e780 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 3 > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff88818b3f0600 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 2 > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff88a4490f8300 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 1 > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff88a0dab183c0 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 40 > > > > > > > > > > That's a lot of references! Might be interesting to look more > > > > > closely > > > > > at that one, but the refcounts are all over the place, so it > > > > > really > > > > > does look like we just have a refcount leak somewhere. > > > > > > > > > > > ... > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff89209535f200 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 2 > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff8980e15138c0 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 7 > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff898107b95800 > > > > > > nf_flags = 8, > > > > > > nf_ref.refs.counter = 3 > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > ffff898107b95980 > > > > > > nf_flags = 12, > > > > > > > > > > The others are NFSD_FILE_GC. This one is also > > > > > NFSD_FILE_REFERENCED. Are > > > > > these objects on the LRU? > > > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > > > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() relies on > > > > > > the > > > > > > hash > > > > > > table to find the nfsd_file, > > > > > > but I am still scratching my head why and how this happens. > > > > > > > > > > > > FYI, here is the patch I applied for testing: > > > > > > > > > > > > > > > > The above suggests to me that there is a garden-variety > > > > > refcount leak > > > > > somewhere. Whenever some bit of the code takes a reference to > > > > > an object > > > > > (like a nfsd_file), it's implicitly required to eventually put > > > > > that > > > > > reference. Typically that means that the code needs to maintain > > > > > a > > > > > pointer to that object, as it can be unhashed at any time and > > > > > it can't > > > > > rely on finding the same object later. > > > > > > > > > > Do you have a reproducer for this? > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > > > --- a/fs/nfsd/filecache.c > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > > inode > > > > > > *inode, unsigned char need, > > > > > > return NULL; > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > nf->nf_file = NULL; > > > > > > nf->nf_cred = get_current_cred(); > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > > *dispose) > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > - nf = list_first_entry(dispose, struct > > > > > > nfsd_file, > > > > > > nf_lru); > > > > > > - list_del_init(&nf->nf_lru); > > > > > > + nf = list_first_entry(dispose, struct > > > > > > nfsd_file, > > > > > > nf_gc); > > > > > > + list_del_init(&nf->nf_gc); > > > > > > nfsd_file_free(nf); > > > > > > } > > > > > > } > > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > > list_head > > > > > > *dispose) > > > > > > { > > > > > > while(!list_empty(dispose)) { > > > > > > struct nfsd_file *nf = > > > > > > list_first_entry(dispose, > > > > > > - struct > > > > > > nfsd_file, > > > > > > nf_lru); > > > > > > + struct > > > > > > nfsd_file, > > > > > > nf_gc); > > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > > nfsd_net_id); > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > >fcache_disposal; > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > > spin_unlock(&l->lock); > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > } > > > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > > struct > > > > > > list_lru_one *lru, > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > > the > > > > > > dispose list */ > > > > > > nfsd_file_unhash(nf); > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > + list_add(&nf->nf_gc, head); > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > return LRU_REMOVED; > > > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > *nf, struct > > > > > > list_head *dispose) > > > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > list */ > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > trace_nfsd_file_closing(nf); > > > > > > } > > > > > > } > > > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > > *inode) > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > while (!list_empty(&dispose)) { > > > > > > - nf = list_first_entry(&dispose, struct > > > > > > nfsd_file, > > > > > > nf_lru); > > > > > > - list_del_init(&nf->nf_lru); > > > > > > + nf = list_first_entry(&dispose, struct > > > > > > nfsd_file, > > > > > > nf_gc); > > > > > > + list_del_init(&nf->nf_gc); > > > > > > nfsd_file_free(nf); > > > > > > } > > > > > > } > > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > --- a/fs/nfsd/filecache.h > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > struct list_head nf_lru; > > > > > > + struct list_head nf_gc; > > > > > > struct rcu_head nf_rcu; > > > > > > ktime_t nf_birthtime; > > > > > > }; > > > > > > > > > > > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton > > > > > > <jlayton@kernel.org> > > > > > > wrote: > > > > > > > > > > > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > > > I am ok with reverting the unhash/dispose list reordering > > > > > > > > in > > > > > > > > nfsd_file_lru_cb(), as it doesn't make much of a > > > > > > > > difference, > > > > > > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > > > > > > > > > > > - A nfsd_file is hashed > > > > > > > > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] > > > > > > > > will > > > > > > > > get it > > > > > > > > unhashed, doesn't it? > > > > > > > > - It continues to get a reference by nfsd_file_get() > > > > > > > > - It continues to remove itself from LRU by > > > > > > > > nfsd_file_lru_remove() if > > > > > > > > it is on the LRU. > > > > > > > > - Now it runs refcount_sub_and_test(), what happens if > > > > > > > > the refcnt > > > > > > > > does > > > > > > > > not go to 0? How can this nfsd_file be found again? > > > > > > > > Through the > > > > > > > > hash > > > > > > > > table? Through the LRU walk? how? > > > > > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > It won't need to be found again. The holders of the extra > > > > > > > references > > > > > > > will put those references when they are finished. Since the > > > > > > > object > > > > > > > is > > > > > > > no longer HASHED, nfsd_file_put just does this: > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > So that should be fine. > > > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton > > > > > > > > <jlayton@kernel.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > > > > > > > > Thank you Jeff for your invaluable insights. I was > > > > > > > > > > leaning > > > > > > > > > > towards > > > > > > > > > > adding a new list_head too, and tested this approach > > > > > > > > > > on > > > > > > > > > > kernel > > > > > > > > > > 6.6 by > > > > > > > > > > continuously hammering the server with heavy nfs load > > > > > > > > > > for the > > > > > > > > > > last few > > > > > > > > > > days, not a single leak. > > > > > > > > > > > > > > > > > > > > Here goes the patch (based on Linux kernel master > > > > > > > > > > branch), > > > > > > > > > > please > > > > > > > > > > review: > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to > > > > > > > > > > mixed use > > > > > > > > > > of > > > > > > > > > > nf->nf_lru > > > > > > > > > > > > > > > > > > > > nfsd_file_put() in one thread can race with another > > > > > > > > > > thread > > > > > > > > > > doing > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > list_lru_walk() > > > > > > > > > > -> > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries > > > > > > > > > > to do > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > > > > > NFSD_FILE_REFERENCED > > > > > > > > > > bit set) > > > > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() > > > > > > > > > > clears > > > > > > > > > > REFERENCED bit and > > > > > > > > > > returns LRU_ROTATE. > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > nfsd_file_lru_cb() now > > > > > > > > > > decrements nf->nf_ref > > > > > > > > > > to 0, runs nfsd_file_unhash(), removes it from > > > > > > > > > > the LRU > > > > > > > > > > and > > > > > > > > > > adds to > > > > > > > > > > the dispose > > > > > > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > > head)] > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > > cleared, > > > > > > > > > > so > > > > > > > > > > it > > > > > > > > > > tries to remove > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > (!nfsd_file_lru_remove(nf))]. > > > > > > > > > > The > > > > > > > > > > 'nf' > > > > > > > > > > has been added > > > > > > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > > > > > treats it as part of the LRU and removes it, > > > > > > > > > > which leads > > > > > > > > > > to > > > > > > > > > > its removal from > > > > > > > > > > the 'dispose' list. > > > > > > > > > > * At this moment, 'nf' is unhashed with its nf_ref > > > > > > > > > > being 0, > > > > > > > > > > and > > > > > > > > > > not > > > > > > > > > > on the LRU. > > > > > > > > > > nfsd_file_put() continues its execution [if > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > > > > > gets no chance of being freed. > > > > > > > > > > > > > > > > > > > > nfsd_file_put() can also race with > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries > > > > > > > > > > to do > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and > > > > > > > > > > returns true. > > > > > > > > > > * Some userland application runs 'exportfs -f' or > > > > > > > > > > something > > > > > > > > > > like > > > > > > > > > > that, which triggers > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > unhash is done > > > > > > > > > > successfully. > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > now > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > > > > > it succeeds. > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > &nf->nf_ref))] > > > > > > > > > > (with "decrement" being 2), so the nf->nf_ref > > > > > > > > > > goes to 0, > > > > > > > > > > the > > > > > > > > > > 'nf' > > > > > > > > > > is added to the > > > > > > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > > cleared, > > > > > > > > > > so > > > > > > > > > > it > > > > > > > > > > tries to remove > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > (!nfsd_file_lru_remove(nf))], > > > > > > > > > > although > > > > > > > > > > the 'nf' is not > > > > > > > > > > in the LRU, but it is linked in the 'dispose' > > > > > > > > > > list, > > > > > > > > > > nfsd_file_lru_remove() simply > > > > > > > > > > treats it as part of the LRU and removes it. This > > > > > > > > > > leads > > > > > > > > > > to > > > > > > > > > > its removal from > > > > > > > > > > the 'dispose' list! > > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() > > > > > > > > > > continues its > > > > > > > > > > execution and set > > > > > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > As shown in the above analysis, using nf_lru for both > > > > > > > > > > the LRU > > > > > > > > > > list and > > > > > > > > > > dispose list > > > > > > > > > > can cause the leaks. This patch adds a new list_head > > > > > > > > > > nf_gc in > > > > > > > > > > struct > > > > > > > > > > nfsd_file, and uses > > > > > > > > > > it for the dispose list. It's not expected to have a > > > > > > > > > > nfsd_file > > > > > > > > > > unhashed but it's not > > > > > > > > > > added to the dispose list, so in > > > > > > > > > > nfsd_file_cond_queue() and > > > > > > > > > > nfsd_file_lru_cb() nfsd_file > > > > > > > > > > is unhashed after being added to the dispose list. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't see where we require the object to be either > > > > > > > > > hashed or > > > > > > > > > on > > > > > > > > > the > > > > > > > > > dispose list. I think you probably just want to do a > > > > > > > > > patch > > > > > > > > > that > > > > > > > > > changes the dispose list to use a dedicated list_head > > > > > > > > > without > > > > > > > > > reordering when the these things are unhashed. > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > --- > > > > > > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > > > > > > > > fs/nfsd/filecache.h | 1 + > > > > > > > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, > > > > > > > > > > struct > > > > > > > > > > inode > > > > > > > > > > *inode, unsigned char need, > > > > > > > > > > return NULL; > > > > > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > > > nf->nf_file = NULL; > > > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct > > > > > > > > > > list_head > > > > > > > > > > *dispose) > > > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_lru); > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_gc); > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > @@ -411,12 +412,12 @@ > > > > > > > > > > nfsd_file_dispose_list_delayed(struct > > > > > > > > > > list_head *dispose) > > > > > > > > > > { > > > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > > > struct nfsd_file *nf = > > > > > > > > > > list_first_entry(dispose, > > > > > > > > > > - > > > > > > > > > > struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_lru); > > > > > > > > > > + > > > > > > > > > > struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_gc); > > > > > > > > > > struct nfsd_net *nn = net_generic(nf- > > > > > > > > > > >nf_net, > > > > > > > > > > nfsd_net_id); > > > > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > > > - list_move_tail(&nf->nf_lru, &l- > > > > > > > > > > >freeme); > > > > > > > > > > + list_move_tail(&nf->nf_gc, &l- > > > > > > > > > > >freeme); > > > > > > > > > > spin_unlock(&l->lock); > > > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > > > } > > > > > > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct > > > > > > > > > > list_head *item, > > > > > > > > > > struct > > > > > > > > > > list_lru_one *lru, > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue > > > > > > > > > > it to > > > > > > > > > > the > > > > > > > > > > dispose list */ > > > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > > > + /* Unhash after removing from LRU and adding > > > > > > > > > > to > > > > > > > > > > dispose > > > > > > > > > > list */ > > > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > > head); > > > > > > > > > > > > > > > > > > I don't see the point in reordering these operations. > > > > > > > > > Hashing > > > > > > > > > is > > > > > > > > > all > > > > > > > > > about making the thing findable by nfsd operations. The > > > > > > > > > _last_ > > > > > > > > > thing we > > > > > > > > > want to do is put it on the dispose list while the > > > > > > > > > thing can > > > > > > > > > still > > > > > > > > > be > > > > > > > > > found by nfsd threads doing operations. > > > > > > > > > > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > > > return LRU_REMOVED; > > > > > > > > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > nfsd_file > > > > > > > > > > *nf, > > > > > > > > > > struct > > > > > > > > > > list_head *dispose) > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > > ignore it > > > > > > > > > > */ > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > >nf_flags)) > > > > > > > > > > > > > > > > > > The above change looks wrong. I don't think we need to > > > > > > > > > change > > > > > > > > > this. > > > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct > > > > > > > > > > nfsd_file > > > > > > > > > > *nf, > > > > > > > > > > struct > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > dispose > > > > > > > > > > list */ > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > >nf_ref)) { > > > > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > > > > + /* Unhash after adding to dispose > > > > > > > > > > list */ > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > > > This too looks wrong? Maybe I'm unclear on the race > > > > > > > > > you're > > > > > > > > > trying > > > > > > > > > to > > > > > > > > > fix with this? What's the harm in unhashing it early? > > > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct > > > > > > > > > > inode > > > > > > > > > > *inode) > > > > > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > > > - nf = list_first_entry(&dispose, > > > > > > > > > > struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_lru); > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > + nf = list_first_entry(&dispose, > > > > > > > > > > struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_gc); > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > diff --git a/fs/nfsd/filecache.h > > > > > > > > > > b/fs/nfsd/filecache.h > > > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > > > struct list_head nf_lru; > > > > > > > > > > + struct list_head nf_gc; > > > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > > > ktime_t nf_birthtime; > > > > > > > > > > }; > > > > > > > > > > -- > > > > > > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang > > > > > > > > > > > wrote: > > > > > > > > > > > > Thank you Chuck. Here are my quick answers to > > > > > > > > > > > > your > > > > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > > > > > > > - I don't have a quick reproducer. I reproduced > > > > > > > > > > > > it by > > > > > > > > > > > > using > > > > > > > > > > > > hundreds > > > > > > > > > > > > of nfs clients generating +600K ops under our > > > > > > > > > > > > workload in > > > > > > > > > > > > the > > > > > > > > > > > > testing > > > > > > > > > > > > environment. Theoretically it should be possible > > > > > > > > > > > > to > > > > > > > > > > > > simplify > > > > > > > > > > > > the > > > > > > > > > > > > reproduction but I am still working on it. > > > > > > > > > > > > > > > > > > > > > > > > - I understand zfs is an out-of-tree file > > > > > > > > > > > > system. That's > > > > > > > > > > > > fine. But > > > > > > > > > > > > this leaking can happen to any file system, and > > > > > > > > > > > > leaking > > > > > > > > > > > > is > > > > > > > > > > > > not a good > > > > > > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > > > > > > > > > > > - I will try to come up with a reproducer using > > > > > > > > > > > > xfs or > > > > > > > > > > > > btrfs > > > > > > > > > > > > if possible. > > > > > > > > > > > > > > > > > > > > > > > > Now back to the problem itself, here are my > > > > > > > > > > > > findings: > > > > > > > > > > > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race with > > > > > > > > > > > > another > > > > > > > > > > > > thread > > > > > > > > > > > > doing > > > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > > -> > > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it > > > > > > > > > > > > tries to > > > > > > > > > > > > do > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > * nfsd_file_lru_add() returns true (thus > > > > > > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > > > > > > set for nf->nf_flags) > > > > > > > > > > > > * garbage collector kicks in, > > > > > > > > > > > > nfsd_file_lru_cb() clears > > > > > > > > > > > > REFERENCED > > > > > > > > > > > > bit and returns LRU_ROTATE. > > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > > nfsd_file_lru_cb() > > > > > > > > > > > > now > > > > > > > > > > > > decrements nf->nf_ref to 0, runs > > > > > > > > > > > > nfsd_file_unhash(), > > > > > > > > > > > > removes > > > > > > > > > > > > it from > > > > > > > > > > > > the LRU and adds to the dispose list > > > > > > > > > > > > [list_lru_isolate_move(lru, > > > > > > > > > > > > &nf->nf_lru, head);] > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit > > > > > > > > > > > > is > > > > > > > > > > > > cleared, > > > > > > > > > > > > so it > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > > > > > > * The 'nf' has been added to the 'dispose' list > > > > > > > > > > > > by > > > > > > > > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) > > > > > > > > > > > > simply > > > > > > > > > > > > treats > > > > > > > > > > > > it as > > > > > > > > > > > > part of the LRU and removes it, which leads it to > > > > > > > > > > > > be > > > > > > > > > > > > removed > > > > > > > > > > > > from the > > > > > > > > > > > > 'dispose' list. > > > > > > > > > > > > * At this moment, nf->nf_ref is 0, it's > > > > > > > > > > > > unhashed, and > > > > > > > > > > > > not > > > > > > > > > > > > on the > > > > > > > > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf- > > > > > > > > > > > > >nf_ref is > > > > > > > > > > > > already 0, now > > > > > > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > > > > > > REFCOUNT_SATURATED, > > > > > > > > > > > > and the > > > > > > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > > > > > > > > > > > To make this happen, the right timing is crucial. > > > > > > > > > > > > It can > > > > > > > > > > > > be > > > > > > > > > > > > reproduced > > > > > > > > > > > > by adding artifical delays in filecache.c, or > > > > > > > > > > > > hammering > > > > > > > > > > > > the > > > > > > > > > > > > nfsd with > > > > > > > > > > > > tons of ops. > > > > > > > > > > > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it > > > > > > > > > > > > tries to > > > > > > > > > > > > do > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and > > > > > > > > > > > > returns > > > > > > > > > > > > true. > > > > > > > > > > > > * 'exportfs -f' or something like that triggers > > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > > unhash is done successfully. > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > > now > > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > &nf->nf_ref))] (with "decrement" being 2), so the > > > > > > > > > > > > nf- > > > > > > > > > > > > > nf_ref > > > > > > > > > > > > goes to > > > > > > > > > > > > 0, the 'nf' is added to the dispost list > > > > > > > > > > > > [list_add(&nf- > > > > > > > > > > > > > nf_lru, > > > > > > > > > > > > dispose)] > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit > > > > > > > > > > > > is > > > > > > > > > > > > cleared, > > > > > > > > > > > > so it > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' > > > > > > > > > > > > is not in > > > > > > > > > > > > the > > > > > > > > > > > > LRU, but > > > > > > > > > > > > it is linked in the 'dispose' list, > > > > > > > > > > > > nfsd_file_lru_remove() > > > > > > > > > > > > simply > > > > > > > > > > > > treats it as part of the LRU and removes it. This > > > > > > > > > > > > leads > > > > > > > > > > > > to > > > > > > > > > > > > its removal > > > > > > > > > > > > from the 'dispose' list! > > > > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() > > > > > > > > > > > > continues > > > > > > > > > > > > its > > > > > > > > > > > > execution and sets nf->nf_ref to > > > > > > > > > > > > REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you > > > > > > > > > > > > can see, > > > > > > > > > > > > it > > > > > > > > > > > > is used > > > > > > > > > > > > for the LRU list, and also the 'dispose' list. > > > > > > > > > > > > Adding > > > > > > > > > > > > another > > > > > > > > > > > > 'struct > > > > > > > > > > > > list_head' specifically for the 'dispose' list > > > > > > > > > > > > seems to > > > > > > > > > > > > be a > > > > > > > > > > > > better > > > > > > > > > > > > way of fixing this race condition. Either way > > > > > > > > > > > > works for > > > > > > > > > > > > me. > > > > > > > > > > > > > > > > > > > > > > > > Would you agree my above analysis makes sense? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've done much > > > > > > > > > > > work in > > > > > > > > > > > this > > > > > > > > > > > code, > > > > > > > > > > > but it does sound like there is a race in the LRU > > > > > > > > > > > handling. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe > > > > > > > > > > > to use > > > > > > > > > > > for > > > > > > > > > > > multiple > > > > > > > > > > > purposes, but that's only the case if we're not > > > > > > > > > > > using that > > > > > > > > > > > list > > > > > > > > > > > as an > > > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > > > > > > > > > > > ...so if we ever check this while it's sitting on > > > > > > > > > > > the > > > > > > > > > > > dispose > > > > > > > > > > > list, it > > > > > > > > > > > will handle it incorrectly. It sounds like that's > > > > > > > > > > > the root > > > > > > > > > > > cause of the > > > > > > > > > > > problem you're seeing? > > > > > > > > > > > > > > > > > > > > > > If so, then maybe a separate list_head for disposal > > > > > > > > > > > would > > > > > > > > > > > be > > > > > > > > > > > better. > > > > > > > > > > > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due > > > > > > > > > > > > to race > > > > > > > > > > > > condition and early > > > > > > > > > > > > unhash > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > --- > > > > > > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > > nfsd_file *nf) > > > > > > > > > > > > if > > > > > > > > > > > > (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > return; > > > > > > > > > > > > } > > > > > > > > > > > > + /* > > > > > > > > > > > > + * Racing with > > > > > > > > > > > > nfsd_file_cond_queue() or > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > + * it's unhashed but then removed > > > > > > > > > > > > from > > > > > > > > > > > > the > > > > > > > > > > > > dispose list, > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > + */ > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == > > > > > > > > > > > > 0 && > > > > > > > > > > > > > > > > > > > > > > A refcount_read in this path is a red flag to me. > > > > > > > > > > > Anytime > > > > > > > > > > > you're just > > > > > > > > > > > looking at the refcount without changing anything > > > > > > > > > > > just > > > > > > > > > > > screams > > > > > > > > > > > out > > > > > > > > > > > "race condition". > > > > > > > > > > > > > > > > > > > > > > In this case, what guarantee is there that this > > > > > > > > > > > won't run > > > > > > > > > > > afoul > > > > > > > > > > > of the > > > > > > > > > > > timing? We could check this and find out it's 1 > > > > > > > > > > > just before > > > > > > > > > > > it > > > > > > > > > > > goes to > > > > > > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > &nf- > > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > + return; > > > > > > > > > > > > + } > > > > > > > > > > > > } > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > > unhashing, > > > > > > > > > > > > ignore it > > > > > > > > > > > > */ > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > >nf_flags)) > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > Same here: you're just testing for the HASHED bit, > > > > > > > > > > > but > > > > > > > > > > > could > > > > > > > > > > > this also > > > > > > > > > > > race with someone who is setting it just after you > > > > > > > > > > > get > > > > > > > > > > > here. > > > > > > > > > > > Why is > > > > > > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it > > > > > > > > > > > > */ > > > > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > > dispose > > > > > > > > > > > > list */ > > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > -- > > > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, > > > > > > > > > > > > > Youzhong Yang > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue > > > > > > > > > > > > > > and > > > > > > > > > > > > > > propose > > > > > > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I > > > > > > > > > > > > > > noticed > > > > > > > > > > > > > > nfsd_file leaks > > > > > > > > > > > > > > which led to undestroyable file systems > > > > > > > > > > > > > > (zfs), > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In other > > > > > > > > > > > > > words, > > > > > > > > > > > > > what > > > > > > > > > > > > > is the > > > > > > > > > > > > > simplest program that can run on an NFS > > > > > > > > > > > > > client that > > > > > > > > > > > > > will > > > > > > > > > > > > > trigger > > > > > > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's > > > > > > > > > > > > > not > > > > > > > > > > > > > directly > > > > > > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > > > > > > > > > > > - The guidelines for patch submission require > > > > > > > > > > > > > us to fix > > > > > > > > > > > > > issues in > > > > > > > > > > > > > upstream Linux first (currently that's v6.10- > > > > > > > > > > > > > rc6). > > > > > > > > > > > > > Then > > > > > > > > > > > > > that fix > > > > > > > > > > > > > can be backported to older stable kernels > > > > > > > > > > > > > like 6.6. > > > > > > > > > > > > > > > > > > > > > > > > > > Can you reproduce the leak with one of the in- > > > > > > > > > > > > > kernel > > > > > > > > > > > > > filesystems > > > > > > > > > > > > > (either xfs or btrfs would be great) and with > > > > > > > > > > > > > NFSD in > > > > > > > > > > > > > 6.10- > > > > > > > > > > > > > rc6? > > > > > > > > > > > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > > > > > > > > struct nfsd_file { > > > > > > > > > > > > > > nf_rlist = { > > > > > > > > > > > > > > rhead = { > > > > > > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > next = 0x0 > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > nf_ref = { > > > > > > > > > > > > > > refs = { > > > > > > > > > > > > > > counter = 0xc0000000 > > > > > > > > > > > > > > } > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > nf_may = 0x4, > > > > > > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > nf_rcu = { > > > > > > > > > > > > > > next = 0x10000000000, > > > > > > > > > > > > > > func = 0x0 > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > ffff88839a53d850 > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > ffff88c32b11e850 > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > ffff88e372709700 > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > ffff8982864944d0 > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races > > > > > > > > > > > > > > with > > > > > > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > > > > > > or nfsd_file_lru_cb(). With the following > > > > > > > > > > > > > > patch, I > > > > > > > > > > > > > > haven't observed > > > > > > > > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > > > > > > > > > > > > > > > Our patch submission guidelines require a > > > > > > > > > > > > > Signed-off- > > > > > > > > > > > > > by: > > > > > > > > > > > > > line at the end of the patch description. See > > > > > > > > > > > > > the "Sign > > > > > > > > > > > > > your work - > > > > > > > > > > > > > the Developer's Certificate of Origin" section > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > *nf) > > > > > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > } > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > > + * it's unhashed but then removed from the > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > list, > > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > >nf_flags) && > > > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > > > + return; > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > } > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > @@ -576,7 +587,7 @@ > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > > > > > > ignore > > > > > > > > > > > > > > it > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > >nf_flags)) > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > @@ -590,6 +601,7 @@ > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > list > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > > >nf_ref)) > > > > > > > > > > > > > > { > > > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please kindly review the patch and let me > > > > > > > > > > > > > > know if it > > > > > > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > -- > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > > -- > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > -- > > > Jeff Layton <jlayton@kernel.org> > > -- > Jeff Layton <jlayton@kernel.org>
Great! If you're able to test the patches I sent earlier today, please let us know if they help with your reproducer. On Wed, 2024-07-10 at 10:49 -0400, Youzhong Yang wrote: > Thanks Jeff. Just sent out a separate e-mail. > > I have a simple reproducer now, it's a C program, I will upload it to > a github repository together with a readme file and then post the link > (if allowed) here. > > On Wed, Jul 10, 2024 at 9:40 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > Can you send it as a separate [PATCH] email? I think your MUA is > > mangling the patches as I couldn't apply it earlier. Either use git- > > send-email or have a look over this page: > > > > > > https://www.kernel.org/doc/html/latest/process/email-clients.html#email-clients > > > > Thanks! > > Jeff > > > > On Wed, 2024-07-10 at 09:33 -0400, Youzhong Yang wrote: > > > Thanks. Here it is again: > > > > > > commit c6e69cebc052cb82d91fc81a98aee20749fe8d47 > > > Author: Youzhong Yang <youzhong@gmail.com> > > > Date: Thu Jul 4 11:25:40 2024 -0400 > > > > > > nfsd: fix nfsd_file leaking due to mixed use of nf->nf_lru > > > > > > nfsd_file_put() in one thread can race with another thread doing > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > > > nfsd_file_lru_cb()): > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > nfsd_file_lru_add(). > > > * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED > > > bit set) > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > REFERENCED bit and > > > returns LRU_ROTATE. > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > decrements nf->nf_ref > > > to 0, runs nfsd_file_unhash(), removes it from the LRU and > > > adds to the dispose > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > > it > > > tries to remove > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The > > > 'nf' has been added > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > nfsd_file_lru_remove(nf) simply > > > treats it as part of the LRU and removes it, which leads to > > > its removal from > > > the 'dispose' list. > > > * At this moment, 'nf' is unhashed with its nf_ref being 0, and > > > not on the LRU. > > > nfsd_file_put() continues its execution [if > > > (refcount_dec_and_test(&nf->nf_ref))], > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > REFCOUNT_SATURATED, and the 'nf' > > > gets no chance of being freed. > > > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > nfsd_file_lru_add(). > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > * Some userland application runs 'exportfs -f' or something > > > like > > > that, which triggers > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > * In nfsd_file_cond_queue(), it runs [if > > > (!nfsd_file_unhash(nf))], unhash is done > > > successfully. > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > > > nf->nf_ref goes to 2. > > > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], > > > it succeeds. > > > * nfsd_file_cond_queue() runs [if > > > (refcount_sub_and_test(decrement, &nf->nf_ref))] > > > (with "decrement" being 2), so the nf->nf_ref goes to 0, the > > > 'nf' is added to the > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > > it > > > tries to remove > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > > > although the 'nf' is not > > > in the LRU, but it is linked in the 'dispose' list, > > > nfsd_file_lru_remove() simply > > > treats it as part of the LRU and removes it. This leads to > > > its > > > removal from > > > the 'dispose' list! > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > execution and set > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > As shown in the above analysis, using nf_lru for both the LRU > > > list > > > and dispose list > > > can cause the leaks. This patch adds a new list_head nf_gc in > > > struct nfsd_file, and uses > > > it for the dispose list. > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index ad9083ca144b..22ebd7fb8639 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > > > *inode, unsigned char need, > > > return NULL; > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > nf->nf_birthtime = ktime_get(); > > > nf->nf_file = NULL; > > > nf->nf_cred = get_current_cred(); > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > > > struct nfsd_file *nf; > > > > > > while (!list_empty(dispose)) { > > > - nf = list_first_entry(dispose, struct nfsd_file, > > > nf_lru); > > > - list_del_init(&nf->nf_lru); > > > + nf = list_first_entry(dispose, struct nfsd_file, > > > nf_gc); > > > + list_del_init(&nf->nf_gc); > > > nfsd_file_free(nf); > > > } > > > } > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head > > > *dispose) > > > { > > > while(!list_empty(dispose)) { > > > struct nfsd_file *nf = list_first_entry(dispose, > > > - struct nfsd_file, > > > nf_lru); > > > + struct nfsd_file, > > > nf_gc); > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > nfsd_net_id); > > > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > > > > > spin_lock(&l->lock); > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > spin_unlock(&l->lock); > > > svc_wake_up(nn->nfsd_serv); > > > } > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct > > > list_lru_one *lru, > > > > > > /* Refcount went to zero. Unhash it and queue it to the > > > dispose list */ > > > nfsd_file_unhash(nf); > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > + list_lru_isolate(lru, &nf->nf_lru); > > > + list_add(&nf->nf_gc, head); > > > this_cpu_inc(nfsd_file_evictions); > > > trace_nfsd_file_gc_disposed(nf); > > > return LRU_REMOVED; > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > list_head *dispose) > > > > > > /* If refcount goes to 0, then put on the dispose list */ > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > - list_add(&nf->nf_lru, dispose); > > > + list_add(&nf->nf_gc, dispose); > > > trace_nfsd_file_closing(nf); > > > } > > > } > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > while (!list_empty(&dispose)) { > > > - nf = list_first_entry(&dispose, struct nfsd_file, > > > nf_lru); > > > - list_del_init(&nf->nf_lru); > > > + nf = list_first_entry(&dispose, struct nfsd_file, > > > nf_gc); > > > + list_del_init(&nf->nf_gc); > > > nfsd_file_free(nf); > > > } > > > } > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > index c61884def906..3fbec24eea6c 100644 > > > --- a/fs/nfsd/filecache.h > > > +++ b/fs/nfsd/filecache.h > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > struct nfsd_file_mark *nf_mark; > > > struct list_head nf_lru; > > > + struct list_head nf_gc; > > > struct rcu_head nf_rcu; > > > ktime_t nf_birthtime; > > > }; > > > > > > On Wed, Jul 10, 2024 at 8:49 AM Jeff Layton <jlayton@kernel.org> > > > wrote: > > > > > > > > Ok, thanks. I went crawling over the filecache code, and found a > > > > couple > > > > of potential nfsd_file leaks. I'm testing patches now and will > > > > likely > > > > send them along later today. > > > > > > > > I think we do still want the patch to add nf_gc list_head. I'm not > > > > certain there is a race there, but it's safer to just use a > > > > separate > > > > list_head and it doesn't represent a lot of memory. > > > > > > > > Could you post that one individually so that Chuck can pick it up? > > > > > > > > Thanks, > > > > Jeff > > > > > > > > On Tue, 2024-07-09 at 15:13 -0400, Youzhong Yang wrote: > > > > > It's not on the LRU: > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_gc > > > > > ffff898107b95980 > > > > > nf_flags = 12, > > > > > nf_ref.refs.counter = 1 > > > > > nf_lru = { > > > > > next = 0xffff898107b959c8, > > > > > prev = 0xffff898107b959c8 > > > > > }, > > > > > nf_gc = { > > > > > next = 0xffff898107b959d8, > > > > > prev = 0xffff898107b959d8 > > > > > }, > > > > > > > > > > I don't have an easy reproducer now as I am having difficulty > > > > > generating tons of NFS read/write ops using a limited number of > > > > > nfs > > > > > clients. I use our in-house testing farm to test the patch. > > > > > > > > > > Thanks. > > > > > > > > > > On Tue, Jul 9, 2024 at 3:05 PM Jeff Layton <jlayton@kernel.org> > > > > > wrote: > > > > > > > > > > > > On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > Unfortunately the early unhash easily leads to leaks: > > > > > > > > > > > > > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e 's|\[||' -e > > > > > > > 's|\]||' > > > > > > > > xargs -i echo struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > '{}' > > > > > > > > /var/tmp/nfsd_files > > > > > > > crash> !wc -l /var/tmp/nfsd_files > > > > > > > 19 /var/tmp/nfsd_files > > > > > > > crash> < /var/tmp/nfsd_files > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff88865c778900 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 2 > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff88865c778cc0 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 3 > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff8885d5f35e00 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 1 > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff88817443e780 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 3 > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff88818b3f0600 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 2 > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff88a4490f8300 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 1 > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff88a0dab183c0 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 40 > > > > > > > > > > > > That's a lot of references! Might be interesting to look more > > > > > > closely > > > > > > at that one, but the refcounts are all over the place, so it > > > > > > really > > > > > > does look like we just have a refcount leak somewhere. > > > > > > > > > > > > > ... > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff89209535f200 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 2 > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff8980e15138c0 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 7 > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff898107b95800 > > > > > > > nf_flags = 8, > > > > > > > nf_ref.refs.counter = 3 > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > ffff898107b95980 > > > > > > > nf_flags = 12, > > > > > > > > > > > > The others are NFSD_FILE_GC. This one is also > > > > > > NFSD_FILE_REFERENCED. Are > > > > > > these objects on the LRU? > > > > > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > > > > > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() relies on > > > > > > > the > > > > > > > hash > > > > > > > table to find the nfsd_file, > > > > > > > but I am still scratching my head why and how this happens. > > > > > > > > > > > > > > FYI, here is the patch I applied for testing: > > > > > > > > > > > > > > > > > > > The above suggests to me that there is a garden-variety > > > > > > refcount leak > > > > > > somewhere. Whenever some bit of the code takes a reference to > > > > > > an object > > > > > > (like a nfsd_file), it's implicitly required to eventually put > > > > > > that > > > > > > reference. Typically that means that the code needs to maintain > > > > > > a > > > > > > pointer to that object, as it can be unhashed at any time and > > > > > > it can't > > > > > > rely on finding the same object later. > > > > > > > > > > > > Do you have a reproducer for this? > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > > > inode > > > > > > > *inode, unsigned char need, > > > > > > > return NULL; > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > nf->nf_file = NULL; > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > > > *dispose) > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > nfsd_file, > > > > > > > nf_lru); > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > nfsd_file, > > > > > > > nf_gc); > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > nfsd_file_free(nf); > > > > > > > } > > > > > > > } > > > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > > > list_head > > > > > > > *dispose) > > > > > > > { > > > > > > > while(!list_empty(dispose)) { > > > > > > > struct nfsd_file *nf = > > > > > > > list_first_entry(dispose, > > > > > > > - struct > > > > > > > nfsd_file, > > > > > > > nf_lru); > > > > > > > + struct > > > > > > > nfsd_file, > > > > > > > nf_gc); > > > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > > > nfsd_net_id); > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > > > spin_unlock(&l->lock); > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > } > > > > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > > > struct > > > > > > > list_lru_one *lru, > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > > > the > > > > > > > dispose list */ > > > > > > > nfsd_file_unhash(nf); > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > return LRU_REMOVED; > > > > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > *nf, struct > > > > > > > list_head *dispose) > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > list */ > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > } > > > > > > > } > > > > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > > > *inode) > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > while (!list_empty(&dispose)) { > > > > > > > - nf = list_first_entry(&dispose, struct > > > > > > > nfsd_file, > > > > > > > nf_lru); > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > + nf = list_first_entry(&dispose, struct > > > > > > > nfsd_file, > > > > > > > nf_gc); > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > nfsd_file_free(nf); > > > > > > > } > > > > > > > } > > > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > struct list_head nf_lru; > > > > > > > + struct list_head nf_gc; > > > > > > > struct rcu_head nf_rcu; > > > > > > > ktime_t nf_birthtime; > > > > > > > }; > > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton > > > > > > > <jlayton@kernel.org> > > > > > > > wrote: > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > > > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > > > > > I am ok with reverting the unhash/dispose list reordering > > > > > > > > > in > > > > > > > > > nfsd_file_lru_cb(), as it doesn't make much of a > > > > > > > > > difference, > > > > > > > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > > > > > > > > > > > > > - A nfsd_file is hashed > > > > > > > > > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] > > > > > > > > > will > > > > > > > > > get it > > > > > > > > > unhashed, doesn't it? > > > > > > > > > - It continues to get a reference by nfsd_file_get() > > > > > > > > > - It continues to remove itself from LRU by > > > > > > > > > nfsd_file_lru_remove() if > > > > > > > > > it is on the LRU. > > > > > > > > > - Now it runs refcount_sub_and_test(), what happens if > > > > > > > > > the refcnt > > > > > > > > > does > > > > > > > > > not go to 0? How can this nfsd_file be found again? > > > > > > > > > Through the > > > > > > > > > hash > > > > > > > > > table? Through the LRU walk? how? > > > > > > > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > It won't need to be found again. The holders of the extra > > > > > > > > references > > > > > > > > will put those references when they are finished. Since the > > > > > > > > object > > > > > > > > is > > > > > > > > no longer HASHED, nfsd_file_put just does this: > > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > > So that should be fine. > > > > > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton > > > > > > > > > <jlayton@kernel.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > > > > > > > > > Thank you Jeff for your invaluable insights. I was > > > > > > > > > > > leaning > > > > > > > > > > > towards > > > > > > > > > > > adding a new list_head too, and tested this approach > > > > > > > > > > > on > > > > > > > > > > > kernel > > > > > > > > > > > 6.6 by > > > > > > > > > > > continuously hammering the server with heavy nfs load > > > > > > > > > > > for the > > > > > > > > > > > last few > > > > > > > > > > > days, not a single leak. > > > > > > > > > > > > > > > > > > > > > > Here goes the patch (based on Linux kernel master > > > > > > > > > > > branch), > > > > > > > > > > > please > > > > > > > > > > > review: > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to > > > > > > > > > > > mixed use > > > > > > > > > > > of > > > > > > > > > > > nf->nf_lru > > > > > > > > > > > > > > > > > > > > > > nfsd_file_put() in one thread can race with another > > > > > > > > > > > thread > > > > > > > > > > > doing > > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > -> > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries > > > > > > > > > > > to do > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > > > > > > NFSD_FILE_REFERENCED > > > > > > > > > > > bit set) > > > > > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() > > > > > > > > > > > clears > > > > > > > > > > > REFERENCED bit and > > > > > > > > > > > returns LRU_ROTATE. > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > nfsd_file_lru_cb() now > > > > > > > > > > > decrements nf->nf_ref > > > > > > > > > > > to 0, runs nfsd_file_unhash(), removes it from > > > > > > > > > > > the LRU > > > > > > > > > > > and > > > > > > > > > > > adds to > > > > > > > > > > > the dispose > > > > > > > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > > > head)] > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > > > cleared, > > > > > > > > > > > so > > > > > > > > > > > it > > > > > > > > > > > tries to remove > > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > > (!nfsd_file_lru_remove(nf))]. > > > > > > > > > > > The > > > > > > > > > > > 'nf' > > > > > > > > > > > has been added > > > > > > > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > > > > > > treats it as part of the LRU and removes it, > > > > > > > > > > > which leads > > > > > > > > > > > to > > > > > > > > > > > its removal from > > > > > > > > > > > the 'dispose' list. > > > > > > > > > > > * At this moment, 'nf' is unhashed with its nf_ref > > > > > > > > > > > being 0, > > > > > > > > > > > and > > > > > > > > > > > not > > > > > > > > > > > on the LRU. > > > > > > > > > > > nfsd_file_put() continues its execution [if > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > > > > > > gets no chance of being freed. > > > > > > > > > > > > > > > > > > > > > > nfsd_file_put() can also race with > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries > > > > > > > > > > > to do > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and > > > > > > > > > > > returns true. > > > > > > > > > > > * Some userland application runs 'exportfs -f' or > > > > > > > > > > > something > > > > > > > > > > > like > > > > > > > > > > > that, which triggers > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > unhash is done > > > > > > > > > > > successfully. > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > now > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > > > > > > it succeeds. > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > &nf->nf_ref))] > > > > > > > > > > > (with "decrement" being 2), so the nf->nf_ref > > > > > > > > > > > goes to 0, > > > > > > > > > > > the > > > > > > > > > > > 'nf' > > > > > > > > > > > is added to the > > > > > > > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > > > cleared, > > > > > > > > > > > so > > > > > > > > > > > it > > > > > > > > > > > tries to remove > > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], > > > > > > > > > > > although > > > > > > > > > > > the 'nf' is not > > > > > > > > > > > in the LRU, but it is linked in the 'dispose' > > > > > > > > > > > list, > > > > > > > > > > > nfsd_file_lru_remove() simply > > > > > > > > > > > treats it as part of the LRU and removes it. This > > > > > > > > > > > leads > > > > > > > > > > > to > > > > > > > > > > > its removal from > > > > > > > > > > > the 'dispose' list! > > > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() > > > > > > > > > > > continues its > > > > > > > > > > > execution and set > > > > > > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > As shown in the above analysis, using nf_lru for both > > > > > > > > > > > the LRU > > > > > > > > > > > list and > > > > > > > > > > > dispose list > > > > > > > > > > > can cause the leaks. This patch adds a new list_head > > > > > > > > > > > nf_gc in > > > > > > > > > > > struct > > > > > > > > > > > nfsd_file, and uses > > > > > > > > > > > it for the dispose list. It's not expected to have a > > > > > > > > > > > nfsd_file > > > > > > > > > > > unhashed but it's not > > > > > > > > > > > added to the dispose list, so in > > > > > > > > > > > nfsd_file_cond_queue() and > > > > > > > > > > > nfsd_file_lru_cb() nfsd_file > > > > > > > > > > > is unhashed after being added to the dispose list. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't see where we require the object to be either > > > > > > > > > > hashed or > > > > > > > > > > on > > > > > > > > > > the > > > > > > > > > > dispose list. I think you probably just want to do a > > > > > > > > > > patch > > > > > > > > > > that > > > > > > > > > > changes the dispose list to use a dedicated list_head > > > > > > > > > > without > > > > > > > > > > reordering when the these things are unhashed. > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > --- > > > > > > > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > > > > > > > > > fs/nfsd/filecache.h | 1 + > > > > > > > > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, > > > > > > > > > > > struct > > > > > > > > > > > inode > > > > > > > > > > > *inode, unsigned char need, > > > > > > > > > > > return NULL; > > > > > > > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > > > > nf->nf_file = NULL; > > > > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct > > > > > > > > > > > list_head > > > > > > > > > > > *dispose) > > > > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > > > > > nfsd_file, > > > > > > > > > > > nf_lru); > > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > > > > > nfsd_file, > > > > > > > > > > > nf_gc); > > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > } > > > > > > > > > > > } > > > > > > > > > > > @@ -411,12 +412,12 @@ > > > > > > > > > > > nfsd_file_dispose_list_delayed(struct > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > { > > > > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > > > > struct nfsd_file *nf = > > > > > > > > > > > list_first_entry(dispose, > > > > > > > > > > > - > > > > > > > > > > > struct > > > > > > > > > > > nfsd_file, > > > > > > > > > > > nf_lru); > > > > > > > > > > > + > > > > > > > > > > > struct > > > > > > > > > > > nfsd_file, > > > > > > > > > > > nf_gc); > > > > > > > > > > > struct nfsd_net *nn = net_generic(nf- > > > > > > > > > > > > nf_net, > > > > > > > > > > > nfsd_net_id); > > > > > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > > > > - list_move_tail(&nf->nf_lru, &l- > > > > > > > > > > > > freeme); > > > > > > > > > > > + list_move_tail(&nf->nf_gc, &l- > > > > > > > > > > > > freeme); > > > > > > > > > > > spin_unlock(&l->lock); > > > > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > > > > } > > > > > > > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct > > > > > > > > > > > list_head *item, > > > > > > > > > > > struct > > > > > > > > > > > list_lru_one *lru, > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue > > > > > > > > > > > it to > > > > > > > > > > > the > > > > > > > > > > > dispose list */ > > > > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > > > > + /* Unhash after removing from LRU and adding > > > > > > > > > > > to > > > > > > > > > > > dispose > > > > > > > > > > > list */ > > > > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > > > head); > > > > > > > > > > > > > > > > > > > > I don't see the point in reordering these operations. > > > > > > > > > > Hashing > > > > > > > > > > is > > > > > > > > > > all > > > > > > > > > > about making the thing findable by nfsd operations. The > > > > > > > > > > _last_ > > > > > > > > > > thing we > > > > > > > > > > want to do is put it on the dispose list while the > > > > > > > > > > thing can > > > > > > > > > > still > > > > > > > > > > be > > > > > > > > > > found by nfsd threads doing operations. > > > > > > > > > > > > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > > > > return LRU_REMOVED; > > > > > > > > > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > nfsd_file > > > > > > > > > > > *nf, > > > > > > > > > > > struct > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > > > ignore it > > > > > > > > > > > */ > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > > > > > > The above change looks wrong. I don't think we need to > > > > > > > > > > change > > > > > > > > > > this. > > > > > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > nfsd_file > > > > > > > > > > > *nf, > > > > > > > > > > > struct > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > dispose > > > > > > > > > > > list */ > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > > > > > + /* Unhash after adding to dispose > > > > > > > > > > > list */ > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > > > > > This too looks wrong? Maybe I'm unclear on the race > > > > > > > > > > you're > > > > > > > > > > trying > > > > > > > > > > to > > > > > > > > > > fix with this? What's the harm in unhashing it early? > > > > > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > } > > > > > > > > > > > } > > > > > > > > > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct > > > > > > > > > > > inode > > > > > > > > > > > *inode) > > > > > > > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > > > > - nf = list_first_entry(&dispose, > > > > > > > > > > > struct > > > > > > > > > > > nfsd_file, > > > > > > > > > > > nf_lru); > > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > > + nf = list_first_entry(&dispose, > > > > > > > > > > > struct > > > > > > > > > > > nfsd_file, > > > > > > > > > > > nf_gc); > > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > } > > > > > > > > > > > } > > > > > > > > > > > diff --git a/fs/nfsd/filecache.h > > > > > > > > > > > b/fs/nfsd/filecache.h > > > > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > > > > struct list_head nf_lru; > > > > > > > > > > > + struct list_head nf_gc; > > > > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > > > > ktime_t nf_birthtime; > > > > > > > > > > > }; > > > > > > > > > > > -- > > > > > > > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > > > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Thank you Chuck. Here are my quick answers to > > > > > > > > > > > > > your > > > > > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > > > > > > > > > - I don't have a quick reproducer. I reproduced > > > > > > > > > > > > > it by > > > > > > > > > > > > > using > > > > > > > > > > > > > hundreds > > > > > > > > > > > > > of nfs clients generating +600K ops under our > > > > > > > > > > > > > workload in > > > > > > > > > > > > > the > > > > > > > > > > > > > testing > > > > > > > > > > > > > environment. Theoretically it should be possible > > > > > > > > > > > > > to > > > > > > > > > > > > > simplify > > > > > > > > > > > > > the > > > > > > > > > > > > > reproduction but I am still working on it. > > > > > > > > > > > > > > > > > > > > > > > > > > - I understand zfs is an out-of-tree file > > > > > > > > > > > > > system. That's > > > > > > > > > > > > > fine. But > > > > > > > > > > > > > this leaking can happen to any file system, and > > > > > > > > > > > > > leaking > > > > > > > > > > > > > is > > > > > > > > > > > > > not a good > > > > > > > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > > > > > > > > > > > > > - I will try to come up with a reproducer using > > > > > > > > > > > > > xfs or > > > > > > > > > > > > > btrfs > > > > > > > > > > > > > if possible. > > > > > > > > > > > > > > > > > > > > > > > > > > Now back to the problem itself, here are my > > > > > > > > > > > > > findings: > > > > > > > > > > > > > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race with > > > > > > > > > > > > > another > > > > > > > > > > > > > thread > > > > > > > > > > > > > doing > > > > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > > > -> > > > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it > > > > > > > > > > > > > tries to > > > > > > > > > > > > > do > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > * nfsd_file_lru_add() returns true (thus > > > > > > > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > > > > > > > set for nf->nf_flags) > > > > > > > > > > > > > * garbage collector kicks in, > > > > > > > > > > > > > nfsd_file_lru_cb() clears > > > > > > > > > > > > > REFERENCED > > > > > > > > > > > > > bit and returns LRU_ROTATE. > > > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > > > nfsd_file_lru_cb() > > > > > > > > > > > > > now > > > > > > > > > > > > > decrements nf->nf_ref to 0, runs > > > > > > > > > > > > > nfsd_file_unhash(), > > > > > > > > > > > > > removes > > > > > > > > > > > > > it from > > > > > > > > > > > > > the LRU and adds to the dispose list > > > > > > > > > > > > > [list_lru_isolate_move(lru, > > > > > > > > > > > > > &nf->nf_lru, head);] > > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit > > > > > > > > > > > > > is > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > so it > > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > > > > > > > * The 'nf' has been added to the 'dispose' list > > > > > > > > > > > > > by > > > > > > > > > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) > > > > > > > > > > > > > simply > > > > > > > > > > > > > treats > > > > > > > > > > > > > it as > > > > > > > > > > > > > part of the LRU and removes it, which leads it to > > > > > > > > > > > > > be > > > > > > > > > > > > > removed > > > > > > > > > > > > > from the > > > > > > > > > > > > > 'dispose' list. > > > > > > > > > > > > > * At this moment, nf->nf_ref is 0, it's > > > > > > > > > > > > > unhashed, and > > > > > > > > > > > > > not > > > > > > > > > > > > > on the > > > > > > > > > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf- > > > > > > > > > > > > > > nf_ref is > > > > > > > > > > > > > already 0, now > > > > > > > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > > > > > > > REFCOUNT_SATURATED, > > > > > > > > > > > > > and the > > > > > > > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > > > > > > > > > > > > > To make this happen, the right timing is crucial. > > > > > > > > > > > > > It can > > > > > > > > > > > > > be > > > > > > > > > > > > > reproduced > > > > > > > > > > > > > by adding artifical delays in filecache.c, or > > > > > > > > > > > > > hammering > > > > > > > > > > > > > the > > > > > > > > > > > > > nfsd with > > > > > > > > > > > > > tons of ops. > > > > > > > > > > > > > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it > > > > > > > > > > > > > tries to > > > > > > > > > > > > > do > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and > > > > > > > > > > > > > returns > > > > > > > > > > > > > true. > > > > > > > > > > > > > * 'exportfs -f' or something like that triggers > > > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > > > unhash is done successfully. > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > > > now > > > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > > &nf->nf_ref))] (with "decrement" being 2), so the > > > > > > > > > > > > > nf- > > > > > > > > > > > > > > nf_ref > > > > > > > > > > > > > goes to > > > > > > > > > > > > > 0, the 'nf' is added to the dispost list > > > > > > > > > > > > > [list_add(&nf- > > > > > > > > > > > > > > nf_lru, > > > > > > > > > > > > > dispose)] > > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit > > > > > > > > > > > > > is > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > so it > > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' > > > > > > > > > > > > > is not in > > > > > > > > > > > > > the > > > > > > > > > > > > > LRU, but > > > > > > > > > > > > > it is linked in the 'dispose' list, > > > > > > > > > > > > > nfsd_file_lru_remove() > > > > > > > > > > > > > simply > > > > > > > > > > > > > treats it as part of the LRU and removes it. This > > > > > > > > > > > > > leads > > > > > > > > > > > > > to > > > > > > > > > > > > > its removal > > > > > > > > > > > > > from the 'dispose' list! > > > > > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() > > > > > > > > > > > > > continues > > > > > > > > > > > > > its > > > > > > > > > > > > > execution and sets nf->nf_ref to > > > > > > > > > > > > > REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you > > > > > > > > > > > > > can see, > > > > > > > > > > > > > it > > > > > > > > > > > > > is used > > > > > > > > > > > > > for the LRU list, and also the 'dispose' list. > > > > > > > > > > > > > Adding > > > > > > > > > > > > > another > > > > > > > > > > > > > 'struct > > > > > > > > > > > > > list_head' specifically for the 'dispose' list > > > > > > > > > > > > > seems to > > > > > > > > > > > > > be a > > > > > > > > > > > > > better > > > > > > > > > > > > > way of fixing this race condition. Either way > > > > > > > > > > > > > works for > > > > > > > > > > > > > me. > > > > > > > > > > > > > > > > > > > > > > > > > > Would you agree my above analysis makes sense? > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've done much > > > > > > > > > > > > work in > > > > > > > > > > > > this > > > > > > > > > > > > code, > > > > > > > > > > > > but it does sound like there is a race in the LRU > > > > > > > > > > > > handling. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe > > > > > > > > > > > > to use > > > > > > > > > > > > for > > > > > > > > > > > > multiple > > > > > > > > > > > > purposes, but that's only the case if we're not > > > > > > > > > > > > using that > > > > > > > > > > > > list > > > > > > > > > > > > as an > > > > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > > > > > > > > > > > > > ...so if we ever check this while it's sitting on > > > > > > > > > > > > the > > > > > > > > > > > > dispose > > > > > > > > > > > > list, it > > > > > > > > > > > > will handle it incorrectly. It sounds like that's > > > > > > > > > > > > the root > > > > > > > > > > > > cause of the > > > > > > > > > > > > problem you're seeing? > > > > > > > > > > > > > > > > > > > > > > > > If so, then maybe a separate list_head for disposal > > > > > > > > > > > > would > > > > > > > > > > > > be > > > > > > > > > > > > better. > > > > > > > > > > > > > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due > > > > > > > > > > > > > to race > > > > > > > > > > > > > condition and early > > > > > > > > > > > > > unhash > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > > --- > > > > > > > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > > > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > > > nfsd_file *nf) > > > > > > > > > > > > > if > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > return; > > > > > > > > > > > > > } > > > > > > > > > > > > > + /* > > > > > > > > > > > > > + * Racing with > > > > > > > > > > > > > nfsd_file_cond_queue() or > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > + * it's unhashed but then removed > > > > > > > > > > > > > from > > > > > > > > > > > > > the > > > > > > > > > > > > > dispose list, > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > + */ > > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == > > > > > > > > > > > > > 0 && > > > > > > > > > > > > > > > > > > > > > > > > A refcount_read in this path is a red flag to me. > > > > > > > > > > > > Anytime > > > > > > > > > > > > you're just > > > > > > > > > > > > looking at the refcount without changing anything > > > > > > > > > > > > just > > > > > > > > > > > > screams > > > > > > > > > > > > out > > > > > > > > > > > > "race condition". > > > > > > > > > > > > > > > > > > > > > > > > In this case, what guarantee is there that this > > > > > > > > > > > > won't run > > > > > > > > > > > > afoul > > > > > > > > > > > > of the > > > > > > > > > > > > timing? We could check this and find out it's 1 > > > > > > > > > > > > just before > > > > > > > > > > > > it > > > > > > > > > > > > goes to > > > > > > > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > > + return; > > > > > > > > > > > > > + } > > > > > > > > > > > > > } > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > > > unhashing, > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > */ > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > Same here: you're just testing for the HASHED bit, > > > > > > > > > > > > but > > > > > > > > > > > > could > > > > > > > > > > > > this also > > > > > > > > > > > > race with someone who is setting it just after you > > > > > > > > > > > > get > > > > > > > > > > > > here. > > > > > > > > > > > > Why is > > > > > > > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it > > > > > > > > > > > > > */ > > > > > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > > > dispose > > > > > > > > > > > > > list */ > > > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > } > > > > > > > > > > > > > } > > > > > > > > > > > > > -- > > > > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > > > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, > > > > > > > > > > > > > > Youzhong Yang > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > propose > > > > > > > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I > > > > > > > > > > > > > > > noticed > > > > > > > > > > > > > > > nfsd_file leaks > > > > > > > > > > > > > > > which led to undestroyable file systems > > > > > > > > > > > > > > > (zfs), > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In other > > > > > > > > > > > > > > words, > > > > > > > > > > > > > > what > > > > > > > > > > > > > > is the > > > > > > > > > > > > > > simplest program that can run on an NFS > > > > > > > > > > > > > > client that > > > > > > > > > > > > > > will > > > > > > > > > > > > > > trigger > > > > > > > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's > > > > > > > > > > > > > > not > > > > > > > > > > > > > > directly > > > > > > > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > > > > > > > > > > > > > - The guidelines for patch submission require > > > > > > > > > > > > > > us to fix > > > > > > > > > > > > > > issues in > > > > > > > > > > > > > > upstream Linux first (currently that's v6.10- > > > > > > > > > > > > > > rc6). > > > > > > > > > > > > > > Then > > > > > > > > > > > > > > that fix > > > > > > > > > > > > > > can be backported to older stable kernels > > > > > > > > > > > > > > like 6.6. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you reproduce the leak with one of the in- > > > > > > > > > > > > > > kernel > > > > > > > > > > > > > > filesystems > > > > > > > > > > > > > > (either xfs or btrfs would be great) and with > > > > > > > > > > > > > > NFSD in > > > > > > > > > > > > > > 6.10- > > > > > > > > > > > > > > rc6? > > > > > > > > > > > > > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > > > > > > > > > struct nfsd_file { > > > > > > > > > > > > > > > nf_rlist = { > > > > > > > > > > > > > > > rhead = { > > > > > > > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > next = 0x0 > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > nf_ref = { > > > > > > > > > > > > > > > refs = { > > > > > > > > > > > > > > > counter = 0xc0000000 > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > nf_may = 0x4, > > > > > > > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > nf_rcu = { > > > > > > > > > > > > > > > next = 0x10000000000, > > > > > > > > > > > > > > > func = 0x0 > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > ffff88839a53d850 > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > ffff88c32b11e850 > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > ffff88e372709700 > > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > ffff8982864944d0 > > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > > > > > > > or nfsd_file_lru_cb(). With the following > > > > > > > > > > > > > > > patch, I > > > > > > > > > > > > > > > haven't observed > > > > > > > > > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Our patch submission guidelines require a > > > > > > > > > > > > > > Signed-off- > > > > > > > > > > > > > > by: > > > > > > > > > > > > > > line at the end of the patch description. See > > > > > > > > > > > > > > the "Sign > > > > > > > > > > > > > > your work - > > > > > > > > > > > > > > the Developer's Certificate of Origin" section > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > *nf) > > > > > > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > > > + * it's unhashed but then removed from the > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > list, > > > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > > > > + return; > > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > @@ -576,7 +587,7 @@ > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > > > > > > > ignore > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > @@ -590,6 +601,7 @@ > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > list > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > > > > nf_ref)) > > > > > > > > > > > > > > > { > > > > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please kindly review the patch and let me > > > > > > > > > > > > > > > know if it > > > > > > > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > > -- > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > > > > > -- > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > -- > > > > Jeff Layton <jlayton@kernel.org> > > > > -- > > Jeff Layton <jlayton@kernel.org> >
I'd like to report that your patches + the "list_head nf_gc" patch together have been tested under heavy nfs load generated by machines in our testing farm, no leak has been observed. Surely my reproducer can no longer reproduce the issue once the patches are applied. FYI here is the reproducer: https://github.com/youzhongyang/nfsd-file-leaks On Wed, Jul 10, 2024 at 4:22 PM Jeff Layton <jlayton@kernel.org> wrote: > > Great! If you're able to test the patches I sent earlier today, please > let us know if they help with your reproducer. > > > On Wed, 2024-07-10 at 10:49 -0400, Youzhong Yang wrote: > > Thanks Jeff. Just sent out a separate e-mail. > > > > I have a simple reproducer now, it's a C program, I will upload it to > > a github repository together with a readme file and then post the link > > (if allowed) here. > > > > On Wed, Jul 10, 2024 at 9:40 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > Can you send it as a separate [PATCH] email? I think your MUA is > > > mangling the patches as I couldn't apply it earlier. Either use git- > > > send-email or have a look over this page: > > > > > > > > > https://www.kernel.org/doc/html/latest/process/email-clients.html#email-clients > > > > > > Thanks! > > > Jeff > > > > > > On Wed, 2024-07-10 at 09:33 -0400, Youzhong Yang wrote: > > > > Thanks. Here it is again: > > > > > > > > commit c6e69cebc052cb82d91fc81a98aee20749fe8d47 > > > > Author: Youzhong Yang <youzhong@gmail.com> > > > > Date: Thu Jul 4 11:25:40 2024 -0400 > > > > > > > > nfsd: fix nfsd_file leaking due to mixed use of nf->nf_lru > > > > > > > > nfsd_file_put() in one thread can race with another thread doing > > > > garbage collection (running nfsd_file_gc() -> list_lru_walk() -> > > > > nfsd_file_lru_cb()): > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > nfsd_file_lru_add(). > > > > * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED > > > > bit set) > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > REFERENCED bit and > > > > returns LRU_ROTATE. > > > > * garbage collector kicks in again, nfsd_file_lru_cb() now > > > > decrements nf->nf_ref > > > > to 0, runs nfsd_file_unhash(), removes it from the LRU and > > > > adds to the dispose > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > > > it > > > > tries to remove > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The > > > > 'nf' has been added > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > nfsd_file_lru_remove(nf) simply > > > > treats it as part of the LRU and removes it, which leads to > > > > its removal from > > > > the 'dispose' list. > > > > * At this moment, 'nf' is unhashed with its nf_ref being 0, and > > > > not on the LRU. > > > > nfsd_file_put() continues its execution [if > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > REFCOUNT_SATURATED, and the 'nf' > > > > gets no chance of being freed. > > > > > > > > nfsd_file_put() can also race with nfsd_file_cond_queue(): > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do > > > > nfsd_file_lru_add(). > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns true. > > > > * Some userland application runs 'exportfs -f' or something > > > > like > > > > that, which triggers > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > * In nfsd_file_cond_queue(), it runs [if > > > > (!nfsd_file_unhash(nf))], unhash is done > > > > successfully. > > > > * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now > > > > nf->nf_ref goes to 2. > > > > * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], > > > > it succeeds. > > > > * nfsd_file_cond_queue() runs [if > > > > (refcount_sub_and_test(decrement, &nf->nf_ref))] > > > > (with "decrement" being 2), so the nf->nf_ref goes to 0, the > > > > 'nf' is added to the > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so > > > > it > > > > tries to remove > > > > the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], > > > > although the 'nf' is not > > > > in the LRU, but it is linked in the 'dispose' list, > > > > nfsd_file_lru_remove() simply > > > > treats it as part of the LRU and removes it. This leads to > > > > its > > > > removal from > > > > the 'dispose' list! > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues its > > > > execution and set > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > As shown in the above analysis, using nf_lru for both the LRU > > > > list > > > > and dispose list > > > > can cause the leaks. This patch adds a new list_head nf_gc in > > > > struct nfsd_file, and uses > > > > it for the dispose list. > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > --- a/fs/nfsd/filecache.c > > > > +++ b/fs/nfsd/filecache.c > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct inode > > > > *inode, unsigned char need, > > > > return NULL; > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > nf->nf_birthtime = ktime_get(); > > > > nf->nf_file = NULL; > > > > nf->nf_cred = get_current_cred(); > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head *dispose) > > > > struct nfsd_file *nf; > > > > > > > > while (!list_empty(dispose)) { > > > > - nf = list_first_entry(dispose, struct nfsd_file, > > > > nf_lru); > > > > - list_del_init(&nf->nf_lru); > > > > + nf = list_first_entry(dispose, struct nfsd_file, > > > > nf_gc); > > > > + list_del_init(&nf->nf_gc); > > > > nfsd_file_free(nf); > > > > } > > > > } > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct list_head > > > > *dispose) > > > > { > > > > while(!list_empty(dispose)) { > > > > struct nfsd_file *nf = list_first_entry(dispose, > > > > - struct nfsd_file, > > > > nf_lru); > > > > + struct nfsd_file, > > > > nf_gc); > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > nfsd_net_id); > > > > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > > > > > > > spin_lock(&l->lock); > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > spin_unlock(&l->lock); > > > > svc_wake_up(nn->nfsd_serv); > > > > } > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, struct > > > > list_lru_one *lru, > > > > > > > > /* Refcount went to zero. Unhash it and queue it to the > > > > dispose list */ > > > > nfsd_file_unhash(nf); > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > + list_add(&nf->nf_gc, head); > > > > this_cpu_inc(nfsd_file_evictions); > > > > trace_nfsd_file_gc_disposed(nf); > > > > return LRU_REMOVED; > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct > > > > list_head *dispose) > > > > > > > > /* If refcount goes to 0, then put on the dispose list */ > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > - list_add(&nf->nf_lru, dispose); > > > > + list_add(&nf->nf_gc, dispose); > > > > trace_nfsd_file_closing(nf); > > > > } > > > > } > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > while (!list_empty(&dispose)) { > > > > - nf = list_first_entry(&dispose, struct nfsd_file, > > > > nf_lru); > > > > - list_del_init(&nf->nf_lru); > > > > + nf = list_first_entry(&dispose, struct nfsd_file, > > > > nf_gc); > > > > + list_del_init(&nf->nf_gc); > > > > nfsd_file_free(nf); > > > > } > > > > } > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > index c61884def906..3fbec24eea6c 100644 > > > > --- a/fs/nfsd/filecache.h > > > > +++ b/fs/nfsd/filecache.h > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > struct list_head nf_lru; > > > > + struct list_head nf_gc; > > > > struct rcu_head nf_rcu; > > > > ktime_t nf_birthtime; > > > > }; > > > > > > > > On Wed, Jul 10, 2024 at 8:49 AM Jeff Layton <jlayton@kernel.org> > > > > wrote: > > > > > > > > > > Ok, thanks. I went crawling over the filecache code, and found a > > > > > couple > > > > > of potential nfsd_file leaks. I'm testing patches now and will > > > > > likely > > > > > send them along later today. > > > > > > > > > > I think we do still want the patch to add nf_gc list_head. I'm not > > > > > certain there is a race there, but it's safer to just use a > > > > > separate > > > > > list_head and it doesn't represent a lot of memory. > > > > > > > > > > Could you post that one individually so that Chuck can pick it up? > > > > > > > > > > Thanks, > > > > > Jeff > > > > > > > > > > On Tue, 2024-07-09 at 15:13 -0400, Youzhong Yang wrote: > > > > > > It's not on the LRU: > > > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_gc > > > > > > ffff898107b95980 > > > > > > nf_flags = 12, > > > > > > nf_ref.refs.counter = 1 > > > > > > nf_lru = { > > > > > > next = 0xffff898107b959c8, > > > > > > prev = 0xffff898107b959c8 > > > > > > }, > > > > > > nf_gc = { > > > > > > next = 0xffff898107b959d8, > > > > > > prev = 0xffff898107b959d8 > > > > > > }, > > > > > > > > > > > > I don't have an easy reproducer now as I am having difficulty > > > > > > generating tons of NFS read/write ops using a limited number of > > > > > > nfs > > > > > > clients. I use our in-house testing farm to test the patch. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > On Tue, Jul 9, 2024 at 3:05 PM Jeff Layton <jlayton@kernel.org> > > > > > > wrote: > > > > > > > > > > > > > > On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > > > Unfortunately the early unhash easily leads to leaks: > > > > > > > > > > > > > > > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e 's|\[||' -e > > > > > > > > 's|\]||' > > > > > > > > > xargs -i echo struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > '{}' > > > > > > > > > /var/tmp/nfsd_files > > > > > > > > crash> !wc -l /var/tmp/nfsd_files > > > > > > > > 19 /var/tmp/nfsd_files > > > > > > > > crash> < /var/tmp/nfsd_files > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff88865c778900 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 2 > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff88865c778cc0 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 3 > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff8885d5f35e00 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff88817443e780 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 3 > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff88818b3f0600 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 2 > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff88a4490f8300 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff88a0dab183c0 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 40 > > > > > > > > > > > > > > That's a lot of references! Might be interesting to look more > > > > > > > closely > > > > > > > at that one, but the refcounts are all over the place, so it > > > > > > > really > > > > > > > does look like we just have a refcount leak somewhere. > > > > > > > > > > > > > > > ... > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff89209535f200 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 2 > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff8980e15138c0 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 7 > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff898107b95800 > > > > > > > > nf_flags = 8, > > > > > > > > nf_ref.refs.counter = 3 > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > ffff898107b95980 > > > > > > > > nf_flags = 12, > > > > > > > > > > > > > > The others are NFSD_FILE_GC. This one is also > > > > > > > NFSD_FILE_REFERENCED. Are > > > > > > > these objects on the LRU? > > > > > > > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > > > > > > > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() relies on > > > > > > > > the > > > > > > > > hash > > > > > > > > table to find the nfsd_file, > > > > > > > > but I am still scratching my head why and how this happens. > > > > > > > > > > > > > > > > FYI, here is the patch I applied for testing: > > > > > > > > > > > > > > > > > > > > > > The above suggests to me that there is a garden-variety > > > > > > > refcount leak > > > > > > > somewhere. Whenever some bit of the code takes a reference to > > > > > > > an object > > > > > > > (like a nfsd_file), it's implicitly required to eventually put > > > > > > > that > > > > > > > reference. Typically that means that the code needs to maintain > > > > > > > a > > > > > > > pointer to that object, as it can be unhashed at any time and > > > > > > > it can't > > > > > > > rely on finding the same object later. > > > > > > > > > > > > > > Do you have a reproducer for this? > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > > > > inode > > > > > > > > *inode, unsigned char need, > > > > > > > > return NULL; > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > nf->nf_file = NULL; > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > > > > *dispose) > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > > nfsd_file, > > > > > > > > nf_lru); > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > > nfsd_file, > > > > > > > > nf_gc); > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > nfsd_file_free(nf); > > > > > > > > } > > > > > > > > } > > > > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > > > > list_head > > > > > > > > *dispose) > > > > > > > > { > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > struct nfsd_file *nf = > > > > > > > > list_first_entry(dispose, > > > > > > > > - struct > > > > > > > > nfsd_file, > > > > > > > > nf_lru); > > > > > > > > + struct > > > > > > > > nfsd_file, > > > > > > > > nf_gc); > > > > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > > > > nfsd_net_id); > > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > > > > spin_unlock(&l->lock); > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > } > > > > > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > > > > struct > > > > > > > > list_lru_one *lru, > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > > > > the > > > > > > > > dispose list */ > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > return LRU_REMOVED; > > > > > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > > > *nf, struct > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > > > list */ > > > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > } > > > > > > > > } > > > > > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > > > > *inode) > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > - nf = list_first_entry(&dispose, struct > > > > > > > > nfsd_file, > > > > > > > > nf_lru); > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > + nf = list_first_entry(&dispose, struct > > > > > > > > nfsd_file, > > > > > > > > nf_gc); > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > nfsd_file_free(nf); > > > > > > > > } > > > > > > > > } > > > > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > struct list_head nf_lru; > > > > > > > > + struct list_head nf_gc; > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > ktime_t nf_birthtime; > > > > > > > > }; > > > > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton > > > > > > > > <jlayton@kernel.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang wrote: > > > > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > > > > > > > I am ok with reverting the unhash/dispose list reordering > > > > > > > > > > in > > > > > > > > > > nfsd_file_lru_cb(), as it doesn't make much of a > > > > > > > > > > difference, > > > > > > > > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > > > > > > > > > > > > > > > - A nfsd_file is hashed > > > > > > > > > > - In nfsd_file_cond_queue(), [if (!nfsd_file_unhash(nf))] > > > > > > > > > > will > > > > > > > > > > get it > > > > > > > > > > unhashed, doesn't it? > > > > > > > > > > - It continues to get a reference by nfsd_file_get() > > > > > > > > > > - It continues to remove itself from LRU by > > > > > > > > > > nfsd_file_lru_remove() if > > > > > > > > > > it is on the LRU. > > > > > > > > > > - Now it runs refcount_sub_and_test(), what happens if > > > > > > > > > > the refcnt > > > > > > > > > > does > > > > > > > > > > not go to 0? How can this nfsd_file be found again? > > > > > > > > > > Through the > > > > > > > > > > hash > > > > > > > > > > table? Through the LRU walk? how? > > > > > > > > > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > It won't need to be found again. The holders of the extra > > > > > > > > > references > > > > > > > > > will put those references when they are finished. Since the > > > > > > > > > object > > > > > > > > > is > > > > > > > > > no longer HASHED, nfsd_file_put just does this: > > > > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > > > > So that should be fine. > > > > > > > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton > > > > > > > > > > <jlayton@kernel.org> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong Yang wrote: > > > > > > > > > > > > Thank you Jeff for your invaluable insights. I was > > > > > > > > > > > > leaning > > > > > > > > > > > > towards > > > > > > > > > > > > adding a new list_head too, and tested this approach > > > > > > > > > > > > on > > > > > > > > > > > > kernel > > > > > > > > > > > > 6.6 by > > > > > > > > > > > > continuously hammering the server with heavy nfs load > > > > > > > > > > > > for the > > > > > > > > > > > > last few > > > > > > > > > > > > days, not a single leak. > > > > > > > > > > > > > > > > > > > > > > > > Here goes the patch (based on Linux kernel master > > > > > > > > > > > > branch), > > > > > > > > > > > > please > > > > > > > > > > > > review: > > > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due to > > > > > > > > > > > > mixed use > > > > > > > > > > > > of > > > > > > > > > > > > nf->nf_lru > > > > > > > > > > > > > > > > > > > > > > > > nfsd_file_put() in one thread can race with another > > > > > > > > > > > > thread > > > > > > > > > > > > doing > > > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > > -> > > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries > > > > > > > > > > > > to do > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > > > > > > > NFSD_FILE_REFERENCED > > > > > > > > > > > > bit set) > > > > > > > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() > > > > > > > > > > > > clears > > > > > > > > > > > > REFERENCED bit and > > > > > > > > > > > > returns LRU_ROTATE. > > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > > nfsd_file_lru_cb() now > > > > > > > > > > > > decrements nf->nf_ref > > > > > > > > > > > > to 0, runs nfsd_file_unhash(), removes it from > > > > > > > > > > > > the LRU > > > > > > > > > > > > and > > > > > > > > > > > > adds to > > > > > > > > > > > > the dispose > > > > > > > > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > > > > head)] > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > > > > cleared, > > > > > > > > > > > > so > > > > > > > > > > > > it > > > > > > > > > > > > tries to remove > > > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))]. > > > > > > > > > > > > The > > > > > > > > > > > > 'nf' > > > > > > > > > > > > has been added > > > > > > > > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > > > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > > > > > > > treats it as part of the LRU and removes it, > > > > > > > > > > > > which leads > > > > > > > > > > > > to > > > > > > > > > > > > its removal from > > > > > > > > > > > > the 'dispose' list. > > > > > > > > > > > > * At this moment, 'nf' is unhashed with its nf_ref > > > > > > > > > > > > being 0, > > > > > > > > > > > > and > > > > > > > > > > > > not > > > > > > > > > > > > on the LRU. > > > > > > > > > > > > nfsd_file_put() continues its execution [if > > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > > > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > > > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > > > > > > > gets no chance of being freed. > > > > > > > > > > > > > > > > > > > > > > > > nfsd_file_put() can also race with > > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries > > > > > > > > > > > > to do > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and > > > > > > > > > > > > returns true. > > > > > > > > > > > > * Some userland application runs 'exportfs -f' or > > > > > > > > > > > > something > > > > > > > > > > > > like > > > > > > > > > > > > that, which triggers > > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > > unhash is done > > > > > > > > > > > > successfully. > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > > now > > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > > > > > > > it succeeds. > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > &nf->nf_ref))] > > > > > > > > > > > > (with "decrement" being 2), so the nf->nf_ref > > > > > > > > > > > > goes to 0, > > > > > > > > > > > > the > > > > > > > > > > > > 'nf' > > > > > > > > > > > > is added to the > > > > > > > > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > > > > > > > cleared, > > > > > > > > > > > > so > > > > > > > > > > > > it > > > > > > > > > > > > tries to remove > > > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], > > > > > > > > > > > > although > > > > > > > > > > > > the 'nf' is not > > > > > > > > > > > > in the LRU, but it is linked in the 'dispose' > > > > > > > > > > > > list, > > > > > > > > > > > > nfsd_file_lru_remove() simply > > > > > > > > > > > > treats it as part of the LRU and removes it. This > > > > > > > > > > > > leads > > > > > > > > > > > > to > > > > > > > > > > > > its removal from > > > > > > > > > > > > the 'dispose' list! > > > > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() > > > > > > > > > > > > continues its > > > > > > > > > > > > execution and set > > > > > > > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > > > As shown in the above analysis, using nf_lru for both > > > > > > > > > > > > the LRU > > > > > > > > > > > > list and > > > > > > > > > > > > dispose list > > > > > > > > > > > > can cause the leaks. This patch adds a new list_head > > > > > > > > > > > > nf_gc in > > > > > > > > > > > > struct > > > > > > > > > > > > nfsd_file, and uses > > > > > > > > > > > > it for the dispose list. It's not expected to have a > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > unhashed but it's not > > > > > > > > > > > > added to the dispose list, so in > > > > > > > > > > > > nfsd_file_cond_queue() and > > > > > > > > > > > > nfsd_file_lru_cb() nfsd_file > > > > > > > > > > > > is unhashed after being added to the dispose list. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't see where we require the object to be either > > > > > > > > > > > hashed or > > > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > dispose list. I think you probably just want to do a > > > > > > > > > > > patch > > > > > > > > > > > that > > > > > > > > > > > changes the dispose list to use a dedicated list_head > > > > > > > > > > > without > > > > > > > > > > > reordering when the these things are unhashed. > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > --- > > > > > > > > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++--------- > > > > > > > > > > > > fs/nfsd/filecache.h | 1 + > > > > > > > > > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, > > > > > > > > > > > > struct > > > > > > > > > > > > inode > > > > > > > > > > > > *inode, unsigned char need, > > > > > > > > > > > > return NULL; > > > > > > > > > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > > > > > nf->nf_file = NULL; > > > > > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct > > > > > > > > > > > > list_head > > > > > > > > > > > > *dispose) > > > > > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > nf_lru); > > > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > nf_gc); > > > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > @@ -411,12 +412,12 @@ > > > > > > > > > > > > nfsd_file_dispose_list_delayed(struct > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > { > > > > > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > > > > > struct nfsd_file *nf = > > > > > > > > > > > > list_first_entry(dispose, > > > > > > > > > > > > - > > > > > > > > > > > > struct > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > nf_lru); > > > > > > > > > > > > + > > > > > > > > > > > > struct > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > nf_gc); > > > > > > > > > > > > struct nfsd_net *nn = net_generic(nf- > > > > > > > > > > > > > nf_net, > > > > > > > > > > > > nfsd_net_id); > > > > > > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > > > > > - list_move_tail(&nf->nf_lru, &l- > > > > > > > > > > > > > freeme); > > > > > > > > > > > > + list_move_tail(&nf->nf_gc, &l- > > > > > > > > > > > > > freeme); > > > > > > > > > > > > spin_unlock(&l->lock); > > > > > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > > > > > } > > > > > > > > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct > > > > > > > > > > > > list_head *item, > > > > > > > > > > > > struct > > > > > > > > > > > > list_lru_one *lru, > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue > > > > > > > > > > > > it to > > > > > > > > > > > > the > > > > > > > > > > > > dispose list */ > > > > > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > > > > > + /* Unhash after removing from LRU and adding > > > > > > > > > > > > to > > > > > > > > > > > > dispose > > > > > > > > > > > > list */ > > > > > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > > > > head); > > > > > > > > > > > > > > > > > > > > > > I don't see the point in reordering these operations. > > > > > > > > > > > Hashing > > > > > > > > > > > is > > > > > > > > > > > all > > > > > > > > > > > about making the thing findable by nfsd operations. The > > > > > > > > > > > _last_ > > > > > > > > > > > thing we > > > > > > > > > > > want to do is put it on the dispose list while the > > > > > > > > > > > thing can > > > > > > > > > > > still > > > > > > > > > > > be > > > > > > > > > > > found by nfsd threads doing operations. > > > > > > > > > > > > > > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > > > > > return LRU_REMOVED; > > > > > > > > > > > > @@ -565,7 +568,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > *nf, > > > > > > > > > > > > struct > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > > > > ignore it > > > > > > > > > > > > */ > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > > > > > > > > The above change looks wrong. I don't think we need to > > > > > > > > > > > change > > > > > > > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it */ > > > > > > > > > > > > @@ -578,7 +581,9 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > *nf, > > > > > > > > > > > > struct > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > > dispose > > > > > > > > > > > > list */ > > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > > > > > > + /* Unhash after adding to dispose > > > > > > > > > > > > list */ > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > > > > > > > This too looks wrong? Maybe I'm unclear on the race > > > > > > > > > > > you're > > > > > > > > > > > trying > > > > > > > > > > > to > > > > > > > > > > > fix with this? What's the harm in unhashing it early? > > > > > > > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > @@ -654,8 +659,8 @@ nfsd_file_close_inode_sync(struct > > > > > > > > > > > > inode > > > > > > > > > > > > *inode) > > > > > > > > > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > > > > > - nf = list_first_entry(&dispose, > > > > > > > > > > > > struct > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > nf_lru); > > > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > > > + nf = list_first_entry(&dispose, > > > > > > > > > > > > struct > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > nf_gc); > > > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.h > > > > > > > > > > > > b/fs/nfsd/filecache.h > > > > > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > > > > > struct list_head nf_lru; > > > > > > > > > > > > + struct list_head nf_gc; > > > > > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > > > > > ktime_t nf_birthtime; > > > > > > > > > > > > }; > > > > > > > > > > > > -- > > > > > > > > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > > > > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong Yang > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > Thank you Chuck. Here are my quick answers to > > > > > > > > > > > > > > your > > > > > > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > - I don't have a quick reproducer. I reproduced > > > > > > > > > > > > > > it by > > > > > > > > > > > > > > using > > > > > > > > > > > > > > hundreds > > > > > > > > > > > > > > of nfs clients generating +600K ops under our > > > > > > > > > > > > > > workload in > > > > > > > > > > > > > > the > > > > > > > > > > > > > > testing > > > > > > > > > > > > > > environment. Theoretically it should be possible > > > > > > > > > > > > > > to > > > > > > > > > > > > > > simplify > > > > > > > > > > > > > > the > > > > > > > > > > > > > > reproduction but I am still working on it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > - I understand zfs is an out-of-tree file > > > > > > > > > > > > > > system. That's > > > > > > > > > > > > > > fine. But > > > > > > > > > > > > > > this leaking can happen to any file system, and > > > > > > > > > > > > > > leaking > > > > > > > > > > > > > > is > > > > > > > > > > > > > > not a good > > > > > > > > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > > > > > > > > > > > > > > > - I will try to come up with a reproducer using > > > > > > > > > > > > > > xfs or > > > > > > > > > > > > > > btrfs > > > > > > > > > > > > > > if possible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Now back to the problem itself, here are my > > > > > > > > > > > > > > findings: > > > > > > > > > > > > > > > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race with > > > > > > > > > > > > > > another > > > > > > > > > > > > > > thread > > > > > > > > > > > > > > doing > > > > > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > > > > -> > > > > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it > > > > > > > > > > > > > > tries to > > > > > > > > > > > > > > do > > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > > * nfsd_file_lru_add() returns true (thus > > > > > > > > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > > > > > > > > set for nf->nf_flags) > > > > > > > > > > > > > > * garbage collector kicks in, > > > > > > > > > > > > > > nfsd_file_lru_cb() clears > > > > > > > > > > > > > > REFERENCED > > > > > > > > > > > > > > bit and returns LRU_ROTATE. > > > > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > > > > nfsd_file_lru_cb() > > > > > > > > > > > > > > now > > > > > > > > > > > > > > decrements nf->nf_ref to 0, runs > > > > > > > > > > > > > > nfsd_file_unhash(), > > > > > > > > > > > > > > removes > > > > > > > > > > > > > > it from > > > > > > > > > > > > > > the LRU and adds to the dispose list > > > > > > > > > > > > > > [list_lru_isolate_move(lru, > > > > > > > > > > > > > > &nf->nf_lru, head);] > > > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit > > > > > > > > > > > > > > is > > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > > > > > > > > * The 'nf' has been added to the 'dispose' list > > > > > > > > > > > > > > by > > > > > > > > > > > > > > nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) > > > > > > > > > > > > > > simply > > > > > > > > > > > > > > treats > > > > > > > > > > > > > > it as > > > > > > > > > > > > > > part of the LRU and removes it, which leads it to > > > > > > > > > > > > > > be > > > > > > > > > > > > > > removed > > > > > > > > > > > > > > from the > > > > > > > > > > > > > > 'dispose' list. > > > > > > > > > > > > > > * At this moment, nf->nf_ref is 0, it's > > > > > > > > > > > > > > unhashed, and > > > > > > > > > > > > > > not > > > > > > > > > > > > > > on the > > > > > > > > > > > > > > LRU. nfsd_file_put() continues its execution [if > > > > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as nf- > > > > > > > > > > > > > > > nf_ref is > > > > > > > > > > > > > > already 0, now > > > > > > > > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > > > > > > > > REFCOUNT_SATURATED, > > > > > > > > > > > > > > and the > > > > > > > > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > > > > > > > > > > > > > > > To make this happen, the right timing is crucial. > > > > > > > > > > > > > > It can > > > > > > > > > > > > > > be > > > > > > > > > > > > > > reproduced > > > > > > > > > > > > > > by adding artifical delays in filecache.c, or > > > > > > > > > > > > > > hammering > > > > > > > > > > > > > > the > > > > > > > > > > > > > > nfsd with > > > > > > > > > > > > > > tons of ops. > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race with > > > > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it > > > > > > > > > > > > > > tries to > > > > > > > > > > > > > > do > > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and > > > > > > > > > > > > > > returns > > > > > > > > > > > > > > true. > > > > > > > > > > > > > > * 'exportfs -f' or something like that triggers > > > > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > > > > unhash is done successfully. > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > > > > now > > > > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > > > &nf->nf_ref))] (with "decrement" being 2), so the > > > > > > > > > > > > > > nf- > > > > > > > > > > > > > > > nf_ref > > > > > > > > > > > > > > goes to > > > > > > > > > > > > > > 0, the 'nf' is added to the dispost list > > > > > > > > > > > > > > [list_add(&nf- > > > > > > > > > > > > > > > nf_lru, > > > > > > > > > > > > > > dispose)] > > > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit > > > > > > > > > > > > > > is > > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], although the 'nf' > > > > > > > > > > > > > > is not in > > > > > > > > > > > > > > the > > > > > > > > > > > > > > LRU, but > > > > > > > > > > > > > > it is linked in the 'dispose' list, > > > > > > > > > > > > > > nfsd_file_lru_remove() > > > > > > > > > > > > > > simply > > > > > > > > > > > > > > treats it as part of the LRU and removes it. This > > > > > > > > > > > > > > leads > > > > > > > > > > > > > > to > > > > > > > > > > > > > > its removal > > > > > > > > > > > > > > from the 'dispose' list! > > > > > > > > > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() > > > > > > > > > > > > > > continues > > > > > > > > > > > > > > its > > > > > > > > > > > > > > execution and sets nf->nf_ref to > > > > > > > > > > > > > > REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. As you > > > > > > > > > > > > > > can see, > > > > > > > > > > > > > > it > > > > > > > > > > > > > > is used > > > > > > > > > > > > > > for the LRU list, and also the 'dispose' list. > > > > > > > > > > > > > > Adding > > > > > > > > > > > > > > another > > > > > > > > > > > > > > 'struct > > > > > > > > > > > > > > list_head' specifically for the 'dispose' list > > > > > > > > > > > > > > seems to > > > > > > > > > > > > > > be a > > > > > > > > > > > > > > better > > > > > > > > > > > > > > way of fixing this race condition. Either way > > > > > > > > > > > > > > works for > > > > > > > > > > > > > > me. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Would you agree my above analysis makes sense? > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've done much > > > > > > > > > > > > > work in > > > > > > > > > > > > > this > > > > > > > > > > > > > code, > > > > > > > > > > > > > but it does sound like there is a race in the LRU > > > > > > > > > > > > > handling. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should be safe > > > > > > > > > > > > > to use > > > > > > > > > > > > > for > > > > > > > > > > > > > multiple > > > > > > > > > > > > > purposes, but that's only the case if we're not > > > > > > > > > > > > > using that > > > > > > > > > > > > > list > > > > > > > > > > > > > as an > > > > > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > > > > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > > > > > > > > > > > > > > > ...so if we ever check this while it's sitting on > > > > > > > > > > > > > the > > > > > > > > > > > > > dispose > > > > > > > > > > > > > list, it > > > > > > > > > > > > > will handle it incorrectly. It sounds like that's > > > > > > > > > > > > > the root > > > > > > > > > > > > > cause of the > > > > > > > > > > > > > problem you're seeing? > > > > > > > > > > > > > > > > > > > > > > > > > > If so, then maybe a separate list_head for disposal > > > > > > > > > > > > > would > > > > > > > > > > > > > be > > > > > > > > > > > > > better. > > > > > > > > > > > > > > > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking due > > > > > > > > > > > > > > to race > > > > > > > > > > > > > > condition and early > > > > > > > > > > > > > > unhash > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > > > > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > > > > nfsd_file *nf) > > > > > > > > > > > > > > if > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > } > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > + * Racing with > > > > > > > > > > > > > > nfsd_file_cond_queue() or > > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > > + * it's unhashed but then removed > > > > > > > > > > > > > > from > > > > > > > > > > > > > > the > > > > > > > > > > > > > > dispose list, > > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == > > > > > > > > > > > > > > 0 && > > > > > > > > > > > > > > > > > > > > > > > > > > A refcount_read in this path is a red flag to me. > > > > > > > > > > > > > Anytime > > > > > > > > > > > > > you're just > > > > > > > > > > > > > looking at the refcount without changing anything > > > > > > > > > > > > > just > > > > > > > > > > > > > screams > > > > > > > > > > > > > out > > > > > > > > > > > > > "race condition". > > > > > > > > > > > > > > > > > > > > > > > > > > In this case, what guarantee is there that this > > > > > > > > > > > > > won't run > > > > > > > > > > > > > afoul > > > > > > > > > > > > > of the > > > > > > > > > > > > > timing? We could check this and find out it's 1 > > > > > > > > > > > > > just before > > > > > > > > > > > > > it > > > > > > > > > > > > > goes to > > > > > > > > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > > > > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > > > + return; > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > } > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > > > > unhashing, > > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > Same here: you're just testing for the HASHED bit, > > > > > > > > > > > > > but > > > > > > > > > > > > > could > > > > > > > > > > > > > this also > > > > > > > > > > > > > race with someone who is setting it just after you > > > > > > > > > > > > > get > > > > > > > > > > > > > here. > > > > > > > > > > > > > Why is > > > > > > > > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > list */ > > > > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > } > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever > > > > > > > > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM -0400, > > > > > > > > > > > > > > > Youzhong Yang > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file leaking issue > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > propose > > > > > > > > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and 6.6, I > > > > > > > > > > > > > > > > noticed > > > > > > > > > > > > > > > > nfsd_file leaks > > > > > > > > > > > > > > > > which led to undestroyable file systems > > > > > > > > > > > > > > > > (zfs), > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the report. Some initial comments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In other > > > > > > > > > > > > > > > words, > > > > > > > > > > > > > > > what > > > > > > > > > > > > > > > is the > > > > > > > > > > > > > > > simplest program that can run on an NFS > > > > > > > > > > > > > > > client that > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > trigger > > > > > > > > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, so it's > > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > directly > > > > > > > > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - The guidelines for patch submission require > > > > > > > > > > > > > > > us to fix > > > > > > > > > > > > > > > issues in > > > > > > > > > > > > > > > upstream Linux first (currently that's v6.10- > > > > > > > > > > > > > > > rc6). > > > > > > > > > > > > > > > Then > > > > > > > > > > > > > > > that fix > > > > > > > > > > > > > > > can be backported to older stable kernels > > > > > > > > > > > > > > > like 6.6. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you reproduce the leak with one of the in- > > > > > > > > > > > > > > > kernel > > > > > > > > > > > > > > > filesystems > > > > > > > > > > > > > > > (either xfs or btrfs would be great) and with > > > > > > > > > > > > > > > NFSD in > > > > > > > > > > > > > > > 6.10- > > > > > > > > > > > > > > > rc6? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x ffff88e160db0460 > > > > > > > > > > > > > > > > struct nfsd_file { > > > > > > > > > > > > > > > > nf_rlist = { > > > > > > > > > > > > > > > > rhead = { > > > > > > > > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > next = 0x0 > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > > > > > > > > nf_net = 0xffffffff83d41600 <init_net>, > > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > > nf_ref = { > > > > > > > > > > > > > > > > refs = { > > > > > > > > > > > > > > > > counter = 0xc0000000 > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > nf_may = 0x4, > > > > > > > > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > nf_rcu = { > > > > > > > > > > > > > > > > next = 0x10000000000, > > > > > > > > > > > > > > > > func = 0x0 > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > ffff88839a53d850 > > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > ffff88c32b11e850 > > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > ffff88e372709700 > > > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > ffff8982864944d0 > > > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() races > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > > > > > > > > or nfsd_file_lru_cb(). With the following > > > > > > > > > > > > > > > > patch, I > > > > > > > > > > > > > > > > haven't observed > > > > > > > > > > > > > > > > any leak after a few days heavy nfs load: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Our patch submission guidelines require a > > > > > > > > > > > > > > > Signed-off- > > > > > > > > > > > > > > > by: > > > > > > > > > > > > > > > line at the end of the patch description. See > > > > > > > > > > > > > > > the "Sign > > > > > > > > > > > > > > > your work - > > > > > > > > > > > > > > > the Developer's Certificate of Origin" section > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (Needed here in case your fix is acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > *nf) > > > > > > > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > > > + * Racing with nfsd_file_cond_queue() or > > > > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > > > > + * it's unhashed but then removed from the > > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > > list, > > > > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 && > > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > > > > > + return; > > > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > > @@ -576,7 +587,7 @@ > > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else unhashing, > > > > > > > > > > > > > > > > ignore > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, ignore it > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > @@ -590,6 +601,7 @@ > > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > > list > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > > > > > nf_ref)) > > > > > > > > > > > > > > > > { > > > > > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please kindly review the patch and let me > > > > > > > > > > > > > > > > know if it > > > > > > > > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > -- > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > -- > > > Jeff Layton <jlayton@kernel.org> > > > > -- > Jeff Layton <jlayton@kernel.org>
Great, thanks for testing them! Can we add a Tested-by: for you to the patches? Consider posting your testcase as well. Maybe we can add it to fstests or another testsuite. Thanks, Jeff On Thu, 2024-07-11 at 10:54 -0400, Youzhong Yang wrote: > I'd like to report that your patches + the "list_head nf_gc" patch > together have been tested under heavy nfs load generated by machines > in our testing farm, no leak has been observed. > > Surely my reproducer can no longer reproduce the issue once the > patches are applied. > > FYI here is the reproducer: > https://github.com/youzhongyang/nfsd-file-leaks > > On Wed, Jul 10, 2024 at 4:22 PM Jeff Layton <jlayton@kernel.org> > wrote: > > > > Great! If you're able to test the patches I sent earlier today, > > please > > let us know if they help with your reproducer. > > > > > > On Wed, 2024-07-10 at 10:49 -0400, Youzhong Yang wrote: > > > Thanks Jeff. Just sent out a separate e-mail. > > > > > > I have a simple reproducer now, it's a C program, I will upload > > > it to > > > a github repository together with a readme file and then post the > > > link > > > (if allowed) here. > > > > > > On Wed, Jul 10, 2024 at 9:40 AM Jeff Layton <jlayton@kernel.org> > > > wrote: > > > > > > > > Can you send it as a separate [PATCH] email? I think your MUA > > > > is > > > > mangling the patches as I couldn't apply it earlier. Either use > > > > git- > > > > send-email or have a look over this page: > > > > > > > > > > > > > > > > https://www.kernel.org/doc/html/latest/process/email-clients.html#email-clients > > > > > > > > Thanks! > > > > Jeff > > > > > > > > On Wed, 2024-07-10 at 09:33 -0400, Youzhong Yang wrote: > > > > > Thanks. Here it is again: > > > > > > > > > > commit c6e69cebc052cb82d91fc81a98aee20749fe8d47 > > > > > Author: Youzhong Yang <youzhong@gmail.com> > > > > > Date: Thu Jul 4 11:25:40 2024 -0400 > > > > > > > > > > nfsd: fix nfsd_file leaking due to mixed use of nf- > > > > > >nf_lru > > > > > > > > > > nfsd_file_put() in one thread can race with another > > > > > thread doing > > > > > garbage collection (running nfsd_file_gc() -> > > > > > list_lru_walk() -> > > > > > nfsd_file_lru_cb()): > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > do > > > > > nfsd_file_lru_add(). > > > > > * nfsd_file_lru_add() returns true (with > > > > > NFSD_FILE_REFERENCED > > > > > bit set) > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > REFERENCED bit and > > > > > returns LRU_ROTATE. > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() > > > > > now > > > > > decrements nf->nf_ref > > > > > to 0, runs nfsd_file_unhash(), removes it from the > > > > > LRU and > > > > > adds to the dispose > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > cleared, so > > > > > it > > > > > tries to remove > > > > > the 'nf' from the LRU [if > > > > > (!nfsd_file_lru_remove(nf))]. The > > > > > 'nf' has been added > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > nfsd_file_lru_remove(nf) simply > > > > > treats it as part of the LRU and removes it, which > > > > > leads to > > > > > its removal from > > > > > the 'dispose' list. > > > > > * At this moment, 'nf' is unhashed with its nf_ref > > > > > being 0, and > > > > > not on the LRU. > > > > > nfsd_file_put() continues its execution [if > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > gets no chance of being freed. > > > > > > > > > > nfsd_file_put() can also race with > > > > > nfsd_file_cond_queue(): > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > do > > > > > nfsd_file_lru_add(). > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns > > > > > true. > > > > > * Some userland application runs 'exportfs -f' or > > > > > something > > > > > like > > > > > that, which triggers > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > (!nfsd_file_unhash(nf))], unhash is done > > > > > successfully. > > > > > * nfsd_file_cond_queue() runs [if > > > > > (!nfsd_file_get(nf))], now > > > > > nf->nf_ref goes to 2. > > > > > * nfsd_file_cond_queue() runs [if > > > > > (nfsd_file_lru_remove(nf))], > > > > > it succeeds. > > > > > * nfsd_file_cond_queue() runs [if > > > > > (refcount_sub_and_test(decrement, &nf->nf_ref))] > > > > > (with "decrement" being 2), so the nf->nf_ref goes to > > > > > 0, the > > > > > 'nf' is added to the > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > cleared, so > > > > > it > > > > > tries to remove > > > > > the 'nf' from the LRU [if > > > > > (!nfsd_file_lru_remove(nf))], > > > > > although the 'nf' is not > > > > > in the LRU, but it is linked in the 'dispose' list, > > > > > nfsd_file_lru_remove() simply > > > > > treats it as part of the LRU and removes it. This > > > > > leads to > > > > > its > > > > > removal from > > > > > the 'dispose' list! > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues > > > > > its > > > > > execution and set > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > As shown in the above analysis, using nf_lru for both the > > > > > LRU > > > > > list > > > > > and dispose list > > > > > can cause the leaks. This patch adds a new list_head > > > > > nf_gc in > > > > > struct nfsd_file, and uses > > > > > it for the dispose list. > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > > --- a/fs/nfsd/filecache.c > > > > > +++ b/fs/nfsd/filecache.c > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > inode > > > > > *inode, unsigned char need, > > > > > return NULL; > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > nf->nf_birthtime = ktime_get(); > > > > > nf->nf_file = NULL; > > > > > nf->nf_cred = get_current_cred(); > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > *dispose) > > > > > struct nfsd_file *nf; > > > > > > > > > > while (!list_empty(dispose)) { > > > > > - nf = list_first_entry(dispose, struct > > > > > nfsd_file, > > > > > nf_lru); > > > > > - list_del_init(&nf->nf_lru); > > > > > + nf = list_first_entry(dispose, struct > > > > > nfsd_file, > > > > > nf_gc); > > > > > + list_del_init(&nf->nf_gc); > > > > > nfsd_file_free(nf); > > > > > } > > > > > } > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > list_head > > > > > *dispose) > > > > > { > > > > > while(!list_empty(dispose)) { > > > > > struct nfsd_file *nf = > > > > > list_first_entry(dispose, > > > > > - struct > > > > > nfsd_file, > > > > > nf_lru); > > > > > + struct > > > > > nfsd_file, > > > > > nf_gc); > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > nfsd_net_id); > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > >fcache_disposal; > > > > > > > > > > spin_lock(&l->lock); > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > spin_unlock(&l->lock); > > > > > svc_wake_up(nn->nfsd_serv); > > > > > } > > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > struct > > > > > list_lru_one *lru, > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > the > > > > > dispose list */ > > > > > nfsd_file_unhash(nf); > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > + list_add(&nf->nf_gc, head); > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > return LRU_REMOVED; > > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > *nf, struct > > > > > list_head *dispose) > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > list */ > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > - list_add(&nf->nf_lru, dispose); > > > > > + list_add(&nf->nf_gc, dispose); > > > > > trace_nfsd_file_closing(nf); > > > > > } > > > > > } > > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > *inode) > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > while (!list_empty(&dispose)) { > > > > > - nf = list_first_entry(&dispose, struct > > > > > nfsd_file, > > > > > nf_lru); > > > > > - list_del_init(&nf->nf_lru); > > > > > + nf = list_first_entry(&dispose, struct > > > > > nfsd_file, > > > > > nf_gc); > > > > > + list_del_init(&nf->nf_gc); > > > > > nfsd_file_free(nf); > > > > > } > > > > > } > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > --- a/fs/nfsd/filecache.h > > > > > +++ b/fs/nfsd/filecache.h > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > struct list_head nf_lru; > > > > > + struct list_head nf_gc; > > > > > struct rcu_head nf_rcu; > > > > > ktime_t nf_birthtime; > > > > > }; > > > > > > > > > > On Wed, Jul 10, 2024 at 8:49 AM Jeff Layton > > > > > <jlayton@kernel.org> > > > > > wrote: > > > > > > > > > > > > Ok, thanks. I went crawling over the filecache code, and > > > > > > found a > > > > > > couple > > > > > > of potential nfsd_file leaks. I'm testing patches now and > > > > > > will > > > > > > likely > > > > > > send them along later today. > > > > > > > > > > > > I think we do still want the patch to add nf_gc list_head. > > > > > > I'm not > > > > > > certain there is a race there, but it's safer to just use a > > > > > > separate > > > > > > list_head and it doesn't represent a lot of memory. > > > > > > > > > > > > Could you post that one individually so that Chuck can pick > > > > > > it up? > > > > > > > > > > > > Thanks, > > > > > > Jeff > > > > > > > > > > > > On Tue, 2024-07-09 at 15:13 -0400, Youzhong Yang wrote: > > > > > > > It's not on the LRU: > > > > > > > > > > > > > > crash> struct > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_gc > > > > > > > ffff898107b95980 > > > > > > > nf_flags = 12, > > > > > > > nf_ref.refs.counter = 1 > > > > > > > nf_lru = { > > > > > > > next = 0xffff898107b959c8, > > > > > > > prev = 0xffff898107b959c8 > > > > > > > }, > > > > > > > nf_gc = { > > > > > > > next = 0xffff898107b959d8, > > > > > > > prev = 0xffff898107b959d8 > > > > > > > }, > > > > > > > > > > > > > > I don't have an easy reproducer now as I am having > > > > > > > difficulty > > > > > > > generating tons of NFS read/write ops using a limited > > > > > > > number of > > > > > > > nfs > > > > > > > clients. I use our in-house testing farm to test the > > > > > > > patch. > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > On Tue, Jul 9, 2024 at 3:05 PM Jeff Layton > > > > > > > <jlayton@kernel.org> > > > > > > > wrote: > > > > > > > > > > > > > > > > On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > > > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > > > > > Unfortunately the early unhash easily leads to leaks: > > > > > > > > > > > > > > > > > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e > > > > > > > > > 's|\[||' -e > > > > > > > > > 's|\]||' > > > > > > > > > > xargs -i echo struct > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > '{}' > > > > > > > > > > /var/tmp/nfsd_files > > > > > > > > > crash> !wc -l /var/tmp/nfsd_files > > > > > > > > > 19 /var/tmp/nfsd_files > > > > > > > > > crash> < /var/tmp/nfsd_files > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff88865c778900 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 2 > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff88865c778cc0 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 3 > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff8885d5f35e00 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff88817443e780 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 3 > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff88818b3f0600 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 2 > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff88a4490f8300 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff88a0dab183c0 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 40 > > > > > > > > > > > > > > > > That's a lot of references! Might be interesting to > > > > > > > > look more > > > > > > > > closely > > > > > > > > at that one, but the refcounts are all over the place, > > > > > > > > so it > > > > > > > > really > > > > > > > > does look like we just have a refcount leak somewhere. > > > > > > > > > > > > > > > > > ... > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff89209535f200 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 2 > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff8980e15138c0 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 7 > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff898107b95800 > > > > > > > > > nf_flags = 8, > > > > > > > > > nf_ref.refs.counter = 3 > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > ffff898107b95980 > > > > > > > > > nf_flags = 12, > > > > > > > > > > > > > > > > The others are NFSD_FILE_GC. This one is also > > > > > > > > NFSD_FILE_REFERENCED. Are > > > > > > > > these objects on the LRU? > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > > > > > > > > > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() > > > > > > > > > relies on > > > > > > > > > the > > > > > > > > > hash > > > > > > > > > table to find the nfsd_file, > > > > > > > > > but I am still scratching my head why and how this > > > > > > > > > happens. > > > > > > > > > > > > > > > > > > FYI, here is the patch I applied for testing: > > > > > > > > > > > > > > > > > > > > > > > > > The above suggests to me that there is a garden-variety > > > > > > > > refcount leak > > > > > > > > somewhere. Whenever some bit of the code takes a > > > > > > > > reference to > > > > > > > > an object > > > > > > > > (like a nfsd_file), it's implicitly required to > > > > > > > > eventually put > > > > > > > > that > > > > > > > > reference. Typically that means that the code needs to > > > > > > > > maintain > > > > > > > > a > > > > > > > > pointer to that object, as it can be unhashed at any > > > > > > > > time and > > > > > > > > it can't > > > > > > > > rely on finding the same object later. > > > > > > > > > > > > > > > > Do you have a reproducer for this? > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, > > > > > > > > > struct > > > > > > > > > inode > > > > > > > > > *inode, unsigned char need, > > > > > > > > > return NULL; > > > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > > nf->nf_file = NULL; > > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct > > > > > > > > > list_head > > > > > > > > > *dispose) > > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_lru); > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_gc); > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > @@ -411,12 +412,12 @@ > > > > > > > > > nfsd_file_dispose_list_delayed(struct > > > > > > > > > list_head > > > > > > > > > *dispose) > > > > > > > > > { > > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > > struct nfsd_file *nf = > > > > > > > > > list_first_entry(dispose, > > > > > > > > > - > > > > > > > > > struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_lru); > > > > > > > > > + > > > > > > > > > struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_gc); > > > > > > > > > struct nfsd_net *nn = net_generic(nf- > > > > > > > > > >nf_net, > > > > > > > > > nfsd_net_id); > > > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > > - list_move_tail(&nf->nf_lru, &l- > > > > > > > > > >freeme); > > > > > > > > > + list_move_tail(&nf->nf_gc, &l- > > > > > > > > > >freeme); > > > > > > > > > spin_unlock(&l->lock); > > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > > } > > > > > > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head > > > > > > > > > *item, > > > > > > > > > struct > > > > > > > > > list_lru_one *lru, > > > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue > > > > > > > > > it to > > > > > > > > > the > > > > > > > > > dispose list */ > > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > head); > > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > > return LRU_REMOVED; > > > > > > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > nfsd_file > > > > > > > > > *nf, struct > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > dispose > > > > > > > > > list */ > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > >nf_ref)) { > > > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct > > > > > > > > > inode > > > > > > > > > *inode) > > > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > > - nf = list_first_entry(&dispose, > > > > > > > > > struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_lru); > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > + nf = list_first_entry(&dispose, > > > > > > > > > struct > > > > > > > > > nfsd_file, > > > > > > > > > nf_gc); > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > diff --git a/fs/nfsd/filecache.h > > > > > > > > > b/fs/nfsd/filecache.h > > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > > struct list_head nf_lru; > > > > > > > > > + struct list_head nf_gc; > > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > > ktime_t nf_birthtime; > > > > > > > > > }; > > > > > > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton > > > > > > > > > <jlayton@kernel.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang > > > > > > > > > > wrote: > > > > > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > > > > > > > > > I am ok with reverting the unhash/dispose list > > > > > > > > > > > reordering > > > > > > > > > > > in > > > > > > > > > > > nfsd_file_lru_cb(), as it doesn't make much of a > > > > > > > > > > > difference, > > > > > > > > > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > > > > > > > > > > > > > > > > > - A nfsd_file is hashed > > > > > > > > > > > - In nfsd_file_cond_queue(), [if > > > > > > > > > > > (!nfsd_file_unhash(nf))] > > > > > > > > > > > will > > > > > > > > > > > get it > > > > > > > > > > > unhashed, doesn't it? > > > > > > > > > > > - It continues to get a reference by > > > > > > > > > > > nfsd_file_get() > > > > > > > > > > > - It continues to remove itself from LRU by > > > > > > > > > > > nfsd_file_lru_remove() if > > > > > > > > > > > it is on the LRU. > > > > > > > > > > > - Now it runs refcount_sub_and_test(), what > > > > > > > > > > > happens if > > > > > > > > > > > the refcnt > > > > > > > > > > > does > > > > > > > > > > > not go to 0? How can this nfsd_file be found > > > > > > > > > > > again? > > > > > > > > > > > Through the > > > > > > > > > > > hash > > > > > > > > > > > table? Through the LRU walk? how? > > > > > > > > > > > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It won't need to be found again. The holders of the > > > > > > > > > > extra > > > > > > > > > > references > > > > > > > > > > will put those references when they are finished. > > > > > > > > > > Since the > > > > > > > > > > object > > > > > > > > > > is > > > > > > > > > > no longer HASHED, nfsd_file_put just does this: > > > > > > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > > > > > > So that should be fine. > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton > > > > > > > > > > > <jlayton@kernel.org> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong > > > > > > > > > > > > Yang wrote: > > > > > > > > > > > > > Thank you Jeff for your invaluable insights. > > > > > > > > > > > > > I was > > > > > > > > > > > > > leaning > > > > > > > > > > > > > towards > > > > > > > > > > > > > adding a new list_head too, and tested this > > > > > > > > > > > > > approach > > > > > > > > > > > > > on > > > > > > > > > > > > > kernel > > > > > > > > > > > > > 6.6 by > > > > > > > > > > > > > continuously hammering the server with heavy > > > > > > > > > > > > > nfs load > > > > > > > > > > > > > for the > > > > > > > > > > > > > last few > > > > > > > > > > > > > days, not a single leak. > > > > > > > > > > > > > > > > > > > > > > > > > > Here goes the patch (based on Linux kernel > > > > > > > > > > > > > master > > > > > > > > > > > > > branch), > > > > > > > > > > > > > please > > > > > > > > > > > > > review: > > > > > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking > > > > > > > > > > > > > due to > > > > > > > > > > > > > mixed use > > > > > > > > > > > > > of > > > > > > > > > > > > > nf->nf_lru > > > > > > > > > > > > > > > > > > > > > > > > > > nfsd_file_put() in one thread can race with > > > > > > > > > > > > > another > > > > > > > > > > > > > thread > > > > > > > > > > > > > doing > > > > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > > > -> > > > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so > > > > > > > > > > > > > it tries > > > > > > > > > > > > > to do > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > > > > > > > > NFSD_FILE_REFERENCED > > > > > > > > > > > > > bit set) > > > > > > > > > > > > > * garbage collector kicks in, > > > > > > > > > > > > > nfsd_file_lru_cb() > > > > > > > > > > > > > clears > > > > > > > > > > > > > REFERENCED bit and > > > > > > > > > > > > > returns LRU_ROTATE. > > > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > > > nfsd_file_lru_cb() now > > > > > > > > > > > > > decrements nf->nf_ref > > > > > > > > > > > > > to 0, runs nfsd_file_unhash(), removes it > > > > > > > > > > > > > from > > > > > > > > > > > > > the LRU > > > > > > > > > > > > > and > > > > > > > > > > > > > adds to > > > > > > > > > > > > > the dispose > > > > > > > > > > > > > list [list_lru_isolate_move(lru, &nf- > > > > > > > > > > > > > >nf_lru, > > > > > > > > > > > > > head)] > > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED > > > > > > > > > > > > > bit is > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > so > > > > > > > > > > > > > it > > > > > > > > > > > > > tries to remove > > > > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))]. > > > > > > > > > > > > > The > > > > > > > > > > > > > 'nf' > > > > > > > > > > > > > has been added > > > > > > > > > > > > > to the 'dispose' list by > > > > > > > > > > > > > nfsd_file_lru_cb(), so > > > > > > > > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > > > > > > > > treats it as part of the LRU and removes > > > > > > > > > > > > > it, > > > > > > > > > > > > > which leads > > > > > > > > > > > > > to > > > > > > > > > > > > > its removal from > > > > > > > > > > > > > the 'dispose' list. > > > > > > > > > > > > > * At this moment, 'nf' is unhashed with its > > > > > > > > > > > > > nf_ref > > > > > > > > > > > > > being 0, > > > > > > > > > > > > > and > > > > > > > > > > > > > not > > > > > > > > > > > > > on the LRU. > > > > > > > > > > > > > nfsd_file_put() continues its execution > > > > > > > > > > > > > [if > > > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > > > > > > > > as nf->nf_ref is already 0, nf->nf_ref is > > > > > > > > > > > > > set to > > > > > > > > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > > > > > > > > gets no chance of being freed. > > > > > > > > > > > > > > > > > > > > > > > > > > nfsd_file_put() can also race with > > > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so > > > > > > > > > > > > > it tries > > > > > > > > > > > > > to do > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit > > > > > > > > > > > > > and > > > > > > > > > > > > > returns true. > > > > > > > > > > > > > * Some userland application runs 'exportfs > > > > > > > > > > > > > -f' or > > > > > > > > > > > > > something > > > > > > > > > > > > > like > > > > > > > > > > > > > that, which triggers > > > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > > > unhash is done > > > > > > > > > > > > > successfully. > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > > > now > > > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > > > > > > > > it succeeds. > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > > &nf->nf_ref))] > > > > > > > > > > > > > (with "decrement" being 2), so the nf- > > > > > > > > > > > > > >nf_ref > > > > > > > > > > > > > goes to 0, > > > > > > > > > > > > > the > > > > > > > > > > > > > 'nf' > > > > > > > > > > > > > is added to the > > > > > > > > > > > > > dispose list [list_add(&nf->nf_lru, > > > > > > > > > > > > > dispose)] > > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED > > > > > > > > > > > > > bit is > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > so > > > > > > > > > > > > > it > > > > > > > > > > > > > tries to remove > > > > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], > > > > > > > > > > > > > although > > > > > > > > > > > > > the 'nf' is not > > > > > > > > > > > > > in the LRU, but it is linked in the > > > > > > > > > > > > > 'dispose' > > > > > > > > > > > > > list, > > > > > > > > > > > > > nfsd_file_lru_remove() simply > > > > > > > > > > > > > treats it as part of the LRU and removes > > > > > > > > > > > > > it. This > > > > > > > > > > > > > leads > > > > > > > > > > > > > to > > > > > > > > > > > > > its removal from > > > > > > > > > > > > > the 'dispose' list! > > > > > > > > > > > > > * Now nf->ref is 0, unhashed. > > > > > > > > > > > > > nfsd_file_put() > > > > > > > > > > > > > continues its > > > > > > > > > > > > > execution and set > > > > > > > > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > > > > > As shown in the above analysis, using nf_lru > > > > > > > > > > > > > for both > > > > > > > > > > > > > the LRU > > > > > > > > > > > > > list and > > > > > > > > > > > > > dispose list > > > > > > > > > > > > > can cause the leaks. This patch adds a new > > > > > > > > > > > > > list_head > > > > > > > > > > > > > nf_gc in > > > > > > > > > > > > > struct > > > > > > > > > > > > > nfsd_file, and uses > > > > > > > > > > > > > it for the dispose list. It's not expected to > > > > > > > > > > > > > have a > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > unhashed but it's not > > > > > > > > > > > > > added to the dispose list, so in > > > > > > > > > > > > > nfsd_file_cond_queue() and > > > > > > > > > > > > > nfsd_file_lru_cb() nfsd_file > > > > > > > > > > > > > is unhashed after being added to the dispose > > > > > > > > > > > > > list. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't see where we require the object to be > > > > > > > > > > > > either > > > > > > > > > > > > hashed or > > > > > > > > > > > > on > > > > > > > > > > > > the > > > > > > > > > > > > dispose list. I think you probably just want > > > > > > > > > > > > to do a > > > > > > > > > > > > patch > > > > > > > > > > > > that > > > > > > > > > > > > changes the dispose list to use a dedicated > > > > > > > > > > > > list_head > > > > > > > > > > > > without > > > > > > > > > > > > reordering when the these things are unhashed. > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang > > > > > > > > > > > > > <youzhong@gmail.com> > > > > > > > > > > > > > --- > > > > > > > > > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++----- > > > > > > > > > > > > > ---- > > > > > > > > > > > > > fs/nfsd/filecache.h | 1 + > > > > > > > > > > > > > 2 files changed, 15 insertions(+), 9 > > > > > > > > > > > > > deletions(-) > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct > > > > > > > > > > > > > net *net, > > > > > > > > > > > > > struct > > > > > > > > > > > > > inode > > > > > > > > > > > > > *inode, unsigned char need, > > > > > > > > > > > > > return NULL; > > > > > > > > > > > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > > > > > > nf->nf_file = NULL; > > > > > > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > > > > > > @@ -393,8 +394,8 @@ > > > > > > > > > > > > > nfsd_file_dispose_list(struct > > > > > > > > > > > > > list_head > > > > > > > > > > > > > *dispose) > > > > > > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > > > > > > - nf = > > > > > > > > > > > > > list_first_entry(dispose, struct > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > nf_lru); > > > > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > > > > + nf = > > > > > > > > > > > > > list_first_entry(dispose, struct > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > nf_gc); > > > > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > } > > > > > > > > > > > > > } > > > > > > > > > > > > > @@ -411,12 +412,12 @@ > > > > > > > > > > > > > nfsd_file_dispose_list_delayed(struct > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > { > > > > > > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > > > > > > struct nfsd_file *nf = > > > > > > > > > > > > > list_first_entry(dispose, > > > > > > > > > > > > > - > > > > > > > > > > > > > struct > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > nf_lru); > > > > > > > > > > > > > + > > > > > > > > > > > > > struct > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > nf_gc); > > > > > > > > > > > > > struct nfsd_net *nn = > > > > > > > > > > > > > net_generic(nf- > > > > > > > > > > > > > > nf_net, > > > > > > > > > > > > > nfsd_net_id); > > > > > > > > > > > > > struct nfsd_fcache_disposal > > > > > > > > > > > > > *l = nn- > > > > > > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > > > > > > - list_move_tail(&nf->nf_lru, > > > > > > > > > > > > > &l- > > > > > > > > > > > > > > freeme); > > > > > > > > > > > > > + list_move_tail(&nf->nf_gc, > > > > > > > > > > > > > &l- > > > > > > > > > > > > > > freeme); > > > > > > > > > > > > > spin_unlock(&l->lock); > > > > > > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > > > > > > } > > > > > > > > > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct > > > > > > > > > > > > > list_head *item, > > > > > > > > > > > > > struct > > > > > > > > > > > > > list_lru_one *lru, > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it > > > > > > > > > > > > > and queue > > > > > > > > > > > > > it to > > > > > > > > > > > > > the > > > > > > > > > > > > > dispose list */ > > > > > > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > > > > > > + /* Unhash after removing from LRU and > > > > > > > > > > > > > adding > > > > > > > > > > > > > to > > > > > > > > > > > > > dispose > > > > > > > > > > > > > list */ > > > > > > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > > > > > > - list_lru_isolate_move(lru, &nf- > > > > > > > > > > > > > >nf_lru, > > > > > > > > > > > > > head); > > > > > > > > > > > > > > > > > > > > > > > > I don't see the point in reordering these > > > > > > > > > > > > operations. > > > > > > > > > > > > Hashing > > > > > > > > > > > > is > > > > > > > > > > > > all > > > > > > > > > > > > about making the thing findable by nfsd > > > > > > > > > > > > operations. The > > > > > > > > > > > > _last_ > > > > > > > > > > > > thing we > > > > > > > > > > > > want to do is put it on the dispose list while > > > > > > > > > > > > the > > > > > > > > > > > > thing can > > > > > > > > > > > > still > > > > > > > > > > > > be > > > > > > > > > > > > found by nfsd threads doing operations. > > > > > > > > > > > > > > > > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > > > > > > return LRU_REMOVED; > > > > > > > > > > > > > @@ -565,7 +568,7 @@ > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > *nf, > > > > > > > > > > > > > struct > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > > > unhashing, > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > */ > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > > > > > > > > > > The above change looks wrong. I don't think we > > > > > > > > > > > > need to > > > > > > > > > > > > change > > > > > > > > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, > > > > > > > > > > > > > ignore it */ > > > > > > > > > > > > > @@ -578,7 +581,9 @@ > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > *nf, > > > > > > > > > > > > > struct > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on > > > > > > > > > > > > > the > > > > > > > > > > > > > dispose > > > > > > > > > > > > > list */ > > > > > > > > > > > > > if (refcount_sub_and_test(decrement, > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > > > - list_add(&nf->nf_lru, > > > > > > > > > > > > > dispose); > > > > > > > > > > > > > + list_add(&nf->nf_gc, > > > > > > > > > > > > > dispose); > > > > > > > > > > > > > + /* Unhash after adding to > > > > > > > > > > > > > dispose > > > > > > > > > > > > > list */ > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > > > > > > > > > This too looks wrong? Maybe I'm unclear on the > > > > > > > > > > > > race > > > > > > > > > > > > you're > > > > > > > > > > > > trying > > > > > > > > > > > > to > > > > > > > > > > > > fix with this? What's the harm in unhashing it > > > > > > > > > > > > early? > > > > > > > > > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > } > > > > > > > > > > > > > } > > > > > > > > > > > > > @@ -654,8 +659,8 @@ > > > > > > > > > > > > > nfsd_file_close_inode_sync(struct > > > > > > > > > > > > > inode > > > > > > > > > > > > > *inode) > > > > > > > > > > > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, > > > > > > > > > > > > > &dispose); > > > > > > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > > > > > > - nf = > > > > > > > > > > > > > list_first_entry(&dispose, > > > > > > > > > > > > > struct > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > nf_lru); > > > > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > > > > + nf = > > > > > > > > > > > > > list_first_entry(&dispose, > > > > > > > > > > > > > struct > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > nf_gc); > > > > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > } > > > > > > > > > > > > > } > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.h > > > > > > > > > > > > > b/fs/nfsd/filecache.h > > > > > > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > > > > > > struct list_head nf_lru; > > > > > > > > > > > > > + struct list_head nf_gc; > > > > > > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > > > > > > ktime_t nf_birthtime; > > > > > > > > > > > > > }; > > > > > > > > > > > > > -- > > > > > > > > > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > > > > > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong > > > > > > > > > > > > > > Yang > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Thank you Chuck. Here are my quick > > > > > > > > > > > > > > > answers to > > > > > > > > > > > > > > > your > > > > > > > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - I don't have a quick reproducer. I > > > > > > > > > > > > > > > reproduced > > > > > > > > > > > > > > > it by > > > > > > > > > > > > > > > using > > > > > > > > > > > > > > > hundreds > > > > > > > > > > > > > > > of nfs clients generating +600K ops under > > > > > > > > > > > > > > > our > > > > > > > > > > > > > > > workload in > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > testing > > > > > > > > > > > > > > > environment. Theoretically it should be > > > > > > > > > > > > > > > possible > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > simplify > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > reproduction but I am still working on > > > > > > > > > > > > > > > it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - I understand zfs is an out-of-tree > > > > > > > > > > > > > > > file > > > > > > > > > > > > > > > system. That's > > > > > > > > > > > > > > > fine. But > > > > > > > > > > > > > > > this leaking can happen to any file > > > > > > > > > > > > > > > system, and > > > > > > > > > > > > > > > leaking > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > not a good > > > > > > > > > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - I will try to come up with a > > > > > > > > > > > > > > > reproducer using > > > > > > > > > > > > > > > xfs or > > > > > > > > > > > > > > > btrfs > > > > > > > > > > > > > > > if possible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Now back to the problem itself, here are > > > > > > > > > > > > > > > my > > > > > > > > > > > > > > > findings: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > another > > > > > > > > > > > > > > > thread > > > > > > > > > > > > > > > doing > > > > > > > > > > > > > > > garbage collection (running > > > > > > > > > > > > > > > nfsd_file_gc() -> > > > > > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > > > > > -> > > > > > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, > > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > > tries to > > > > > > > > > > > > > > > do > > > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > > > * nfsd_file_lru_add() returns true > > > > > > > > > > > > > > > (thus > > > > > > > > > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > > > > > > > > > set for nf->nf_flags) > > > > > > > > > > > > > > > * garbage collector kicks in, > > > > > > > > > > > > > > > nfsd_file_lru_cb() clears > > > > > > > > > > > > > > > REFERENCED > > > > > > > > > > > > > > > bit and returns LRU_ROTATE. > > > > > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > > > > > nfsd_file_lru_cb() > > > > > > > > > > > > > > > now > > > > > > > > > > > > > > > decrements nf->nf_ref to 0, runs > > > > > > > > > > > > > > > nfsd_file_unhash(), > > > > > > > > > > > > > > > removes > > > > > > > > > > > > > > > it from > > > > > > > > > > > > > > > the LRU and adds to the dispose list > > > > > > > > > > > > > > > [list_lru_isolate_move(lru, > > > > > > > > > > > > > > > &nf->nf_lru, head);] > > > > > > > > > > > > > > > * nfsd_file_put() detects > > > > > > > > > > > > > > > NFSD_FILE_HASHED bit > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > > > > > > > > > * The 'nf' has been added to the > > > > > > > > > > > > > > > 'dispose' list > > > > > > > > > > > > > > > by > > > > > > > > > > > > > > > nfsd_file_lru_cb(), so > > > > > > > > > > > > > > > nfsd_file_lru_remove(nf) > > > > > > > > > > > > > > > simply > > > > > > > > > > > > > > > treats > > > > > > > > > > > > > > > it as > > > > > > > > > > > > > > > part of the LRU and removes it, which > > > > > > > > > > > > > > > leads it to > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > removed > > > > > > > > > > > > > > > from the > > > > > > > > > > > > > > > 'dispose' list. > > > > > > > > > > > > > > > * At this moment, nf->nf_ref is 0, it's > > > > > > > > > > > > > > > unhashed, and > > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > on the > > > > > > > > > > > > > > > LRU. nfsd_file_put() continues its > > > > > > > > > > > > > > > execution [if > > > > > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as > > > > > > > > > > > > > > > nf- > > > > > > > > > > > > > > > > nf_ref is > > > > > > > > > > > > > > > already 0, now > > > > > > > > > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > > > > > > > > > REFCOUNT_SATURATED, > > > > > > > > > > > > > > > and the > > > > > > > > > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To make this happen, the right timing is > > > > > > > > > > > > > > > crucial. > > > > > > > > > > > > > > > It can > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > reproduced > > > > > > > > > > > > > > > by adding artifical delays in > > > > > > > > > > > > > > > filecache.c, or > > > > > > > > > > > > > > > hammering > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > nfsd with > > > > > > > > > > > > > > > tons of ops. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, > > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > > tries to > > > > > > > > > > > > > > > do > > > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED > > > > > > > > > > > > > > > bit and > > > > > > > > > > > > > > > returns > > > > > > > > > > > > > > > true. > > > > > > > > > > > > > > > * 'exportfs -f' or something like that > > > > > > > > > > > > > > > triggers > > > > > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs > > > > > > > > > > > > > > > [if > > > > > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > > > > > unhash is done successfully. > > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > > > > > now > > > > > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > > > > &nf->nf_ref))] (with "decrement" being > > > > > > > > > > > > > > > 2), so the > > > > > > > > > > > > > > > nf- > > > > > > > > > > > > > > > > nf_ref > > > > > > > > > > > > > > > goes to > > > > > > > > > > > > > > > 0, the 'nf' is added to the dispost list > > > > > > > > > > > > > > > [list_add(&nf- > > > > > > > > > > > > > > > > nf_lru, > > > > > > > > > > > > > > > dispose)] > > > > > > > > > > > > > > > * nfsd_file_put() detects > > > > > > > > > > > > > > > NFSD_FILE_HASHED bit > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], although > > > > > > > > > > > > > > > the 'nf' > > > > > > > > > > > > > > > is not in > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > LRU, but > > > > > > > > > > > > > > > it is linked in the 'dispose' list, > > > > > > > > > > > > > > > nfsd_file_lru_remove() > > > > > > > > > > > > > > > simply > > > > > > > > > > > > > > > treats it as part of the LRU and removes > > > > > > > > > > > > > > > it. This > > > > > > > > > > > > > > > leads > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > its removal > > > > > > > > > > > > > > > from the 'dispose' list! > > > > > > > > > > > > > > > * Now nf->ref is 0, unhashed. > > > > > > > > > > > > > > > nfsd_file_put() > > > > > > > > > > > > > > > continues > > > > > > > > > > > > > > > its > > > > > > > > > > > > > > > execution and sets nf->nf_ref to > > > > > > > > > > > > > > > REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. > > > > > > > > > > > > > > > As you > > > > > > > > > > > > > > > can see, > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > is used > > > > > > > > > > > > > > > for the LRU list, and also the 'dispose' > > > > > > > > > > > > > > > list. > > > > > > > > > > > > > > > Adding > > > > > > > > > > > > > > > another > > > > > > > > > > > > > > > 'struct > > > > > > > > > > > > > > > list_head' specifically for the 'dispose' > > > > > > > > > > > > > > > list > > > > > > > > > > > > > > > seems to > > > > > > > > > > > > > > > be a > > > > > > > > > > > > > > > better > > > > > > > > > > > > > > > way of fixing this race condition. Either > > > > > > > > > > > > > > > way > > > > > > > > > > > > > > > works for > > > > > > > > > > > > > > > me. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Would you agree my above analysis makes > > > > > > > > > > > > > > > sense? > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've > > > > > > > > > > > > > > done much > > > > > > > > > > > > > > work in > > > > > > > > > > > > > > this > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > but it does sound like there is a race in > > > > > > > > > > > > > > the LRU > > > > > > > > > > > > > > handling. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should > > > > > > > > > > > > > > be safe > > > > > > > > > > > > > > to use > > > > > > > > > > > > > > for > > > > > > > > > > > > > > multiple > > > > > > > > > > > > > > purposes, but that's only the case if we're > > > > > > > > > > > > > > not > > > > > > > > > > > > > > using that > > > > > > > > > > > > > > list > > > > > > > > > > > > > > as an > > > > > > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > > > > > > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > > > > > > > > > > > > > > > > > ...so if we ever check this while it's > > > > > > > > > > > > > > sitting on > > > > > > > > > > > > > > the > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > list, it > > > > > > > > > > > > > > will handle it incorrectly. It sounds like > > > > > > > > > > > > > > that's > > > > > > > > > > > > > > the root > > > > > > > > > > > > > > cause of the > > > > > > > > > > > > > > problem you're seeing? > > > > > > > > > > > > > > > > > > > > > > > > > > > > If so, then maybe a separate list_head for > > > > > > > > > > > > > > disposal > > > > > > > > > > > > > > would > > > > > > > > > > > > > > be > > > > > > > > > > > > > > better. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file > > > > > > > > > > > > > > > leaking due > > > > > > > > > > > > > > > to race > > > > > > > > > > > > > > > condition and early > > > > > > > > > > > > > > > unhash > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang > > > > > > > > > > > > > > > <youzhong@gmail.com> > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > > > > > > > > > 1 file changed, 13 insertions(+), 1 > > > > > > > > > > > > > > > deletion(-) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > > > > > nfsd_file *nf) > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > > + * Racing with > > > > > > > > > > > > > > > nfsd_file_cond_queue() or > > > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > > > + * it's unhashed but then > > > > > > > > > > > > > > > removed > > > > > > > > > > > > > > > from > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > dispose list, > > > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > + if (refcount_read(&nf- > > > > > > > > > > > > > > > >nf_ref) == > > > > > > > > > > > > > > > 0 && > > > > > > > > > > > > > > > > > > > > > > > > > > > > A refcount_read in this path is a red flag > > > > > > > > > > > > > > to me. > > > > > > > > > > > > > > Anytime > > > > > > > > > > > > > > you're just > > > > > > > > > > > > > > looking at the refcount without changing > > > > > > > > > > > > > > anything > > > > > > > > > > > > > > just > > > > > > > > > > > > > > screams > > > > > > > > > > > > > > out > > > > > > > > > > > > > > "race condition". > > > > > > > > > > > > > > > > > > > > > > > > > > > > In this case, what guarantee is there that > > > > > > > > > > > > > > this > > > > > > > > > > > > > > won't run > > > > > > > > > > > > > > afoul > > > > > > > > > > > > > > of the > > > > > > > > > > > > > > timing? We could check this and find out > > > > > > > > > > > > > > it's 1 > > > > > > > > > > > > > > just before > > > > > > > > > > > > > > it > > > > > > > > > > > > > > goes to > > > > > > > > > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > !test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > > > > > + list_empty(&nf- > > > > > > > > > > > > > > > >nf_lru)) { > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > + return; > > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf- > > > > > > > > > > > > > > > >nf_ref)) > > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > @@ -576,7 +587,7 @@ > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > > > > > unhashing, > > > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > > > Same here: you're just testing for the > > > > > > > > > > > > > > HASHED bit, > > > > > > > > > > > > > > but > > > > > > > > > > > > > > could > > > > > > > > > > > > > > this also > > > > > > > > > > > > > > race with someone who is setting it just > > > > > > > > > > > > > > after you > > > > > > > > > > > > > > get > > > > > > > > > > > > > > here. > > > > > > > > > > > > > > Why is > > > > > > > > > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, > > > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > @@ -590,6 +601,7 @@ > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > /* If refcount goes to 0, then > > > > > > > > > > > > > > > put on the > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > list */ > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > > > > > list_add(&nf->nf_lru, > > > > > > > > > > > > > > > dispose); > > > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck > > > > > > > > > > > > > > > Lever > > > > > > > > > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM - > > > > > > > > > > > > > > > > 0400, > > > > > > > > > > > > > > > > Youzhong Yang > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file > > > > > > > > > > > > > > > > > leaking issue > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > propose > > > > > > > > > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and > > > > > > > > > > > > > > > > > 6.6, I > > > > > > > > > > > > > > > > > noticed > > > > > > > > > > > > > > > > > nfsd_file leaks > > > > > > > > > > > > > > > > > which led to undestroyable file > > > > > > > > > > > > > > > > > systems > > > > > > > > > > > > > > > > > (zfs), > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the report. Some initial > > > > > > > > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In > > > > > > > > > > > > > > > > other > > > > > > > > > > > > > > > > words, > > > > > > > > > > > > > > > > what > > > > > > > > > > > > > > > > is the > > > > > > > > > > > > > > > > simplest program that can run on an > > > > > > > > > > > > > > > > NFS > > > > > > > > > > > > > > > > client that > > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > trigger > > > > > > > > > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, > > > > > > > > > > > > > > > > so it's > > > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > > directly > > > > > > > > > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - The guidelines for patch submission > > > > > > > > > > > > > > > > require > > > > > > > > > > > > > > > > us to fix > > > > > > > > > > > > > > > > issues in > > > > > > > > > > > > > > > > upstream Linux first (currently > > > > > > > > > > > > > > > > that's v6.10- > > > > > > > > > > > > > > > > rc6). > > > > > > > > > > > > > > > > Then > > > > > > > > > > > > > > > > that fix > > > > > > > > > > > > > > > > can be backported to older stable > > > > > > > > > > > > > > > > kernels > > > > > > > > > > > > > > > > like 6.6. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you reproduce the leak with one of > > > > > > > > > > > > > > > > the in- > > > > > > > > > > > > > > > > kernel > > > > > > > > > > > > > > > > filesystems > > > > > > > > > > > > > > > > (either xfs or btrfs would be great) > > > > > > > > > > > > > > > > and with > > > > > > > > > > > > > > > > NFSD in > > > > > > > > > > > > > > > > 6.10- > > > > > > > > > > > > > > > > rc6? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x > > > > > > > > > > > > > > > > > ffff88e160db0460 > > > > > > > > > > > > > > > > > struct nfsd_file { > > > > > > > > > > > > > > > > > nf_rlist = { > > > > > > > > > > > > > > > > > rhead = { > > > > > > > > > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > next = 0x0 > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > > > > > > > > > nf_net = 0xffffffff83d41600 > > > > > > > > > > > > > > > > > <init_net>, > > > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > > > nf_ref = { > > > > > > > > > > > > > > > > > refs = { > > > > > > > > > > > > > > > > > counter = 0xc0000000 > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > nf_may = 0x4, > > > > > > > > > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > nf_rcu = { > > > > > > > > > > > > > > > > > next = 0x10000000000, > > > > > > > > > > > > > > > > > func = 0x0 > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counte > > > > > > > > > > > > > > > > > r,nf_lru > > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > > ffff88839a53d850 > > > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counte > > > > > > > > > > > > > > > > > r,nf_lru > > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > > ffff88c32b11e850 > > > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counte > > > > > > > > > > > > > > > > > r,nf_lru > > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > > ffff88e372709700 > > > > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counte > > > > > > > > > > > > > > > > > r,nf_lru > > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > > ffff8982864944d0 > > > > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() > > > > > > > > > > > > > > > > > races > > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > > > > > > > > > or nfsd_file_lru_cb(). With the > > > > > > > > > > > > > > > > > following > > > > > > > > > > > > > > > > > patch, I > > > > > > > > > > > > > > > > > haven't observed > > > > > > > > > > > > > > > > > any leak after a few days heavy nfs > > > > > > > > > > > > > > > > > load: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Our patch submission guidelines require > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > Signed-off- > > > > > > > > > > > > > > > > by: > > > > > > > > > > > > > > > > line at the end of the patch > > > > > > > > > > > > > > > > description. See > > > > > > > > > > > > > > > > the "Sign > > > > > > > > > > > > > > > > your work - > > > > > > > > > > > > > > > > the Developer's Certificate of Origin" > > > > > > > > > > > > > > > > section > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (Needed here in case your fix is > > > > > > > > > > > > > > > > acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 > > > > > > > > > > > > > > > > > 100644 > > > > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > > @@ -389,6 +389,17 @@ > > > > > > > > > > > > > > > > > nfsd_file_put(struct > > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > > *nf) > > > > > > > > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > > > > + * Racing with > > > > > > > > > > > > > > > > > nfsd_file_cond_queue() or > > > > > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > > > > > + * it's unhashed but then removed > > > > > > > > > > > > > > > > > from the > > > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > > > list, > > > > > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 > > > > > > > > > > > > > > > > > && > > > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > > > > > > + return; > > > > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf- > > > > > > > > > > > > > > > > > >nf_ref)) > > > > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > > > @@ -576,7 +587,7 @@ > > > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > > > > > > > unhashing, > > > > > > > > > > > > > > > > > ignore > > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, > > > > > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > > @@ -590,6 +601,7 @@ > > > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put > > > > > > > > > > > > > > > > > on the > > > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > > > list > > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > > > nf_ref)) > > > > > > > > > > > > > > > > > { > > > > > > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please kindly review the patch and > > > > > > > > > > > > > > > > > let me > > > > > > > > > > > > > > > > > know if it > > > > > > > > > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > -- > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > -- > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > -- > > Jeff Layton <jlayton@kernel.org>
No problem. Thanks. On Thu, Jul 11, 2024 at 10:59 AM Jeff Layton <jlayton@kernel.org> wrote: > > Great, thanks for testing them! Can we add a Tested-by: for you to the > patches? Consider posting your testcase as well. Maybe we can add it to > fstests or another testsuite. > > Thanks, > Jeff > > On Thu, 2024-07-11 at 10:54 -0400, Youzhong Yang wrote: > > I'd like to report that your patches + the "list_head nf_gc" patch > > together have been tested under heavy nfs load generated by machines > > in our testing farm, no leak has been observed. > > > > Surely my reproducer can no longer reproduce the issue once the > > patches are applied. > > > > FYI here is the reproducer: > > https://github.com/youzhongyang/nfsd-file-leaks > > > > On Wed, Jul 10, 2024 at 4:22 PM Jeff Layton <jlayton@kernel.org> > > wrote: > > > > > > Great! If you're able to test the patches I sent earlier today, > > > please > > > let us know if they help with your reproducer. > > > > > > > > > On Wed, 2024-07-10 at 10:49 -0400, Youzhong Yang wrote: > > > > Thanks Jeff. Just sent out a separate e-mail. > > > > > > > > I have a simple reproducer now, it's a C program, I will upload > > > > it to > > > > a github repository together with a readme file and then post the > > > > link > > > > (if allowed) here. > > > > > > > > On Wed, Jul 10, 2024 at 9:40 AM Jeff Layton <jlayton@kernel.org> > > > > wrote: > > > > > > > > > > Can you send it as a separate [PATCH] email? I think your MUA > > > > > is > > > > > mangling the patches as I couldn't apply it earlier. Either use > > > > > git- > > > > > send-email or have a look over this page: > > > > > > > > > > > > > > > > > > > > https://www.kernel.org/doc/html/latest/process/email-clients.html#email-clients > > > > > > > > > > Thanks! > > > > > Jeff > > > > > > > > > > On Wed, 2024-07-10 at 09:33 -0400, Youzhong Yang wrote: > > > > > > Thanks. Here it is again: > > > > > > > > > > > > commit c6e69cebc052cb82d91fc81a98aee20749fe8d47 > > > > > > Author: Youzhong Yang <youzhong@gmail.com> > > > > > > Date: Thu Jul 4 11:25:40 2024 -0400 > > > > > > > > > > > > nfsd: fix nfsd_file leaking due to mixed use of nf- > > > > > > >nf_lru > > > > > > > > > > > > nfsd_file_put() in one thread can race with another > > > > > > thread doing > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > list_lru_walk() -> > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > do > > > > > > nfsd_file_lru_add(). > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > NFSD_FILE_REFERENCED > > > > > > bit set) > > > > > > * garbage collector kicks in, nfsd_file_lru_cb() clears > > > > > > REFERENCED bit and > > > > > > returns LRU_ROTATE. > > > > > > * garbage collector kicks in again, nfsd_file_lru_cb() > > > > > > now > > > > > > decrements nf->nf_ref > > > > > > to 0, runs nfsd_file_unhash(), removes it from the > > > > > > LRU and > > > > > > adds to the dispose > > > > > > list [list_lru_isolate_move(lru, &nf->nf_lru, head)] > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > cleared, so > > > > > > it > > > > > > tries to remove > > > > > > the 'nf' from the LRU [if > > > > > > (!nfsd_file_lru_remove(nf))]. The > > > > > > 'nf' has been added > > > > > > to the 'dispose' list by nfsd_file_lru_cb(), so > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > treats it as part of the LRU and removes it, which > > > > > > leads to > > > > > > its removal from > > > > > > the 'dispose' list. > > > > > > * At this moment, 'nf' is unhashed with its nf_ref > > > > > > being 0, and > > > > > > not on the LRU. > > > > > > nfsd_file_put() continues its execution [if > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > as nf->nf_ref is already 0, nf->nf_ref is set to > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > gets no chance of being freed. > > > > > > > > > > > > nfsd_file_put() can also race with > > > > > > nfsd_file_cond_queue(): > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so it tries to > > > > > > do > > > > > > nfsd_file_lru_add(). > > > > > > * nfsd_file_lru_add() sets REFERENCED bit and returns > > > > > > true. > > > > > > * Some userland application runs 'exportfs -f' or > > > > > > something > > > > > > like > > > > > > that, which triggers > > > > > > __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > (!nfsd_file_unhash(nf))], unhash is done > > > > > > successfully. > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > (!nfsd_file_get(nf))], now > > > > > > nf->nf_ref goes to 2. > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > it succeeds. > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > (refcount_sub_and_test(decrement, &nf->nf_ref))] > > > > > > (with "decrement" being 2), so the nf->nf_ref goes to > > > > > > 0, the > > > > > > 'nf' is added to the > > > > > > dispose list [list_add(&nf->nf_lru, dispose)] > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED bit is > > > > > > cleared, so > > > > > > it > > > > > > tries to remove > > > > > > the 'nf' from the LRU [if > > > > > > (!nfsd_file_lru_remove(nf))], > > > > > > although the 'nf' is not > > > > > > in the LRU, but it is linked in the 'dispose' list, > > > > > > nfsd_file_lru_remove() simply > > > > > > treats it as part of the LRU and removes it. This > > > > > > leads to > > > > > > its > > > > > > removal from > > > > > > the 'dispose' list! > > > > > > * Now nf->ref is 0, unhashed. nfsd_file_put() continues > > > > > > its > > > > > > execution and set > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > As shown in the above analysis, using nf_lru for both the > > > > > > LRU > > > > > > list > > > > > > and dispose list > > > > > > can cause the leaks. This patch adds a new list_head > > > > > > nf_gc in > > > > > > struct nfsd_file, and uses > > > > > > it for the dispose list. > > > > > > > > > > > > Signed-off-by: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > > > --- a/fs/nfsd/filecache.c > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, struct > > > > > > inode > > > > > > *inode, unsigned char need, > > > > > > return NULL; > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > nf->nf_file = NULL; > > > > > > nf->nf_cred = get_current_cred(); > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct list_head > > > > > > *dispose) > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > - nf = list_first_entry(dispose, struct > > > > > > nfsd_file, > > > > > > nf_lru); > > > > > > - list_del_init(&nf->nf_lru); > > > > > > + nf = list_first_entry(dispose, struct > > > > > > nfsd_file, > > > > > > nf_gc); > > > > > > + list_del_init(&nf->nf_gc); > > > > > > nfsd_file_free(nf); > > > > > > } > > > > > > } > > > > > > @@ -411,12 +412,12 @@ nfsd_file_dispose_list_delayed(struct > > > > > > list_head > > > > > > *dispose) > > > > > > { > > > > > > while(!list_empty(dispose)) { > > > > > > struct nfsd_file *nf = > > > > > > list_first_entry(dispose, > > > > > > - struct > > > > > > nfsd_file, > > > > > > nf_lru); > > > > > > + struct > > > > > > nfsd_file, > > > > > > nf_gc); > > > > > > struct nfsd_net *nn = net_generic(nf->nf_net, > > > > > > nfsd_net_id); > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > >fcache_disposal; > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > - list_move_tail(&nf->nf_lru, &l->freeme); > > > > > > + list_move_tail(&nf->nf_gc, &l->freeme); > > > > > > spin_unlock(&l->lock); > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > } > > > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head *item, > > > > > > struct > > > > > > list_lru_one *lru, > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue it to > > > > > > the > > > > > > dispose list */ > > > > > > nfsd_file_unhash(nf); > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, head); > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > + list_add(&nf->nf_gc, head); > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > return LRU_REMOVED; > > > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct nfsd_file > > > > > > *nf, struct > > > > > > list_head *dispose) > > > > > > > > > > > > /* If refcount goes to 0, then put on the dispose > > > > > > list */ > > > > > > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > trace_nfsd_file_closing(nf); > > > > > > } > > > > > > } > > > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct inode > > > > > > *inode) > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > while (!list_empty(&dispose)) { > > > > > > - nf = list_first_entry(&dispose, struct > > > > > > nfsd_file, > > > > > > nf_lru); > > > > > > - list_del_init(&nf->nf_lru); > > > > > > + nf = list_first_entry(&dispose, struct > > > > > > nfsd_file, > > > > > > nf_gc); > > > > > > + list_del_init(&nf->nf_gc); > > > > > > nfsd_file_free(nf); > > > > > > } > > > > > > } > > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > --- a/fs/nfsd/filecache.h > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > struct list_head nf_lru; > > > > > > + struct list_head nf_gc; > > > > > > struct rcu_head nf_rcu; > > > > > > ktime_t nf_birthtime; > > > > > > }; > > > > > > > > > > > > On Wed, Jul 10, 2024 at 8:49 AM Jeff Layton > > > > > > <jlayton@kernel.org> > > > > > > wrote: > > > > > > > > > > > > > > Ok, thanks. I went crawling over the filecache code, and > > > > > > > found a > > > > > > > couple > > > > > > > of potential nfsd_file leaks. I'm testing patches now and > > > > > > > will > > > > > > > likely > > > > > > > send them along later today. > > > > > > > > > > > > > > I think we do still want the patch to add nf_gc list_head. > > > > > > > I'm not > > > > > > > certain there is a race there, but it's safer to just use a > > > > > > > separate > > > > > > > list_head and it doesn't represent a lot of memory. > > > > > > > > > > > > > > Could you post that one individually so that Chuck can pick > > > > > > > it up? > > > > > > > > > > > > > > Thanks, > > > > > > > Jeff > > > > > > > > > > > > > > On Tue, 2024-07-09 at 15:13 -0400, Youzhong Yang wrote: > > > > > > > > It's not on the LRU: > > > > > > > > > > > > > > > > crash> struct > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_gc > > > > > > > > ffff898107b95980 > > > > > > > > nf_flags = 12, > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > nf_lru = { > > > > > > > > next = 0xffff898107b959c8, > > > > > > > > prev = 0xffff898107b959c8 > > > > > > > > }, > > > > > > > > nf_gc = { > > > > > > > > next = 0xffff898107b959d8, > > > > > > > > prev = 0xffff898107b959d8 > > > > > > > > }, > > > > > > > > > > > > > > > > I don't have an easy reproducer now as I am having > > > > > > > > difficulty > > > > > > > > generating tons of NFS read/write ops using a limited > > > > > > > > number of > > > > > > > > nfs > > > > > > > > clients. I use our in-house testing farm to test the > > > > > > > > patch. > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > On Tue, Jul 9, 2024 at 3:05 PM Jeff Layton > > > > > > > > <jlayton@kernel.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Tue, 2024-07-09 at 14:37 -0400, Youzhong Yang wrote: > > > > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > > > > > > > Unfortunately the early unhash easily leads to leaks: > > > > > > > > > > > > > > > > > > > > crash> kmem -S nfsd_file | grep '\[ffff' | sed -e > > > > > > > > > > 's|\[||' -e > > > > > > > > > > 's|\]||' > > > > > > > > > > > xargs -i echo struct > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > > '{}' > > > > > > > > > > > /var/tmp/nfsd_files > > > > > > > > > > crash> !wc -l /var/tmp/nfsd_files > > > > > > > > > > 19 /var/tmp/nfsd_files > > > > > > > > > > crash> < /var/tmp/nfsd_files > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff88865c778900 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 2 > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff88865c778cc0 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 3 > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff8885d5f35e00 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff88817443e780 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 3 > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff88818b3f0600 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 2 > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff88a4490f8300 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff88a0dab183c0 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 40 > > > > > > > > > > > > > > > > > > That's a lot of references! Might be interesting to > > > > > > > > > look more > > > > > > > > > closely > > > > > > > > > at that one, but the refcounts are all over the place, > > > > > > > > > so it > > > > > > > > > really > > > > > > > > > does look like we just have a refcount leak somewhere. > > > > > > > > > > > > > > > > > > > ... > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff89209535f200 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 2 > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff8980e15138c0 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 7 > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff898107b95800 > > > > > > > > > > nf_flags = 8, > > > > > > > > > > nf_ref.refs.counter = 3 > > > > > > > > > > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter > > > > > > > > > > ffff898107b95980 > > > > > > > > > > nf_flags = 12, > > > > > > > > > > > > > > > > > > The others are NFSD_FILE_GC. This one is also > > > > > > > > > NFSD_FILE_REFERENCED. Are > > > > > > > > > these objects on the LRU? > > > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 1 > > > > > > > > > > > > > > > > > > > > nfsd_file_do_acquire() -> nfsd_file_lookup_locked() > > > > > > > > > > relies on > > > > > > > > > > the > > > > > > > > > > hash > > > > > > > > > > table to find the nfsd_file, > > > > > > > > > > but I am still scratching my head why and how this > > > > > > > > > > happens. > > > > > > > > > > > > > > > > > > > > FYI, here is the patch I applied for testing: > > > > > > > > > > > > > > > > > > > > > > > > > > > > The above suggests to me that there is a garden-variety > > > > > > > > > refcount leak > > > > > > > > > somewhere. Whenever some bit of the code takes a > > > > > > > > > reference to > > > > > > > > > an object > > > > > > > > > (like a nfsd_file), it's implicitly required to > > > > > > > > > eventually put > > > > > > > > > that > > > > > > > > > reference. Typically that means that the code needs to > > > > > > > > > maintain > > > > > > > > > a > > > > > > > > > pointer to that object, as it can be unhashed at any > > > > > > > > > time and > > > > > > > > > it can't > > > > > > > > > rely on finding the same object later. > > > > > > > > > > > > > > > > > > Do you have a reproducer for this? > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > index ad9083ca144b..22ebd7fb8639 100644 > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct net *net, > > > > > > > > > > struct > > > > > > > > > > inode > > > > > > > > > > *inode, unsigned char need, > > > > > > > > > > return NULL; > > > > > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > > > nf->nf_file = NULL; > > > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > > > @@ -393,8 +394,8 @@ nfsd_file_dispose_list(struct > > > > > > > > > > list_head > > > > > > > > > > *dispose) > > > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > > > - nf = list_first_entry(dispose, struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_lru); > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > + nf = list_first_entry(dispose, struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_gc); > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > @@ -411,12 +412,12 @@ > > > > > > > > > > nfsd_file_dispose_list_delayed(struct > > > > > > > > > > list_head > > > > > > > > > > *dispose) > > > > > > > > > > { > > > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > > > struct nfsd_file *nf = > > > > > > > > > > list_first_entry(dispose, > > > > > > > > > > - > > > > > > > > > > struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_lru); > > > > > > > > > > + > > > > > > > > > > struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_gc); > > > > > > > > > > struct nfsd_net *nn = net_generic(nf- > > > > > > > > > > >nf_net, > > > > > > > > > > nfsd_net_id); > > > > > > > > > > struct nfsd_fcache_disposal *l = nn- > > > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > > > - list_move_tail(&nf->nf_lru, &l- > > > > > > > > > > >freeme); > > > > > > > > > > + list_move_tail(&nf->nf_gc, &l- > > > > > > > > > > >freeme); > > > > > > > > > > spin_unlock(&l->lock); > > > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > > > } > > > > > > > > > > @@ -503,7 +504,8 @@ nfsd_file_lru_cb(struct list_head > > > > > > > > > > *item, > > > > > > > > > > struct > > > > > > > > > > list_lru_one *lru, > > > > > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it and queue > > > > > > > > > > it to > > > > > > > > > > the > > > > > > > > > > dispose list */ > > > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > > > - list_lru_isolate_move(lru, &nf->nf_lru, > > > > > > > > > > head); > > > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > > > return LRU_REMOVED; > > > > > > > > > > @@ -578,7 +580,7 @@ nfsd_file_cond_queue(struct > > > > > > > > > > nfsd_file > > > > > > > > > > *nf, struct > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on the > > > > > > > > > > dispose > > > > > > > > > > list */ > > > > > > > > > > if (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > >nf_ref)) { > > > > > > > > > > - list_add(&nf->nf_lru, dispose); > > > > > > > > > > + list_add(&nf->nf_gc, dispose); > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > @@ -654,8 +656,8 @@ nfsd_file_close_inode_sync(struct > > > > > > > > > > inode > > > > > > > > > > *inode) > > > > > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, &dispose); > > > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > > > - nf = list_first_entry(&dispose, > > > > > > > > > > struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_lru); > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > + nf = list_first_entry(&dispose, > > > > > > > > > > struct > > > > > > > > > > nfsd_file, > > > > > > > > > > nf_gc); > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > diff --git a/fs/nfsd/filecache.h > > > > > > > > > > b/fs/nfsd/filecache.h > > > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > > > struct list_head nf_lru; > > > > > > > > > > + struct list_head nf_gc; > > > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > > > ktime_t nf_birthtime; > > > > > > > > > > }; > > > > > > > > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 10:55 AM Jeff Layton > > > > > > > > > > <jlayton@kernel.org> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 10:23 -0400, Youzhong Yang > > > > > > > > > > > wrote: > > > > > > > > > > > > Thanks Jeff. > > > > > > > > > > > > > > > > > > > > > > > > I am ok with reverting the unhash/dispose list > > > > > > > > > > > > reordering > > > > > > > > > > > > in > > > > > > > > > > > > nfsd_file_lru_cb(), as it doesn't make much of a > > > > > > > > > > > > difference, > > > > > > > > > > > > but for nfsd_file_cond_queue(), imagining this: > > > > > > > > > > > > > > > > > > > > > > > > - A nfsd_file is hashed > > > > > > > > > > > > - In nfsd_file_cond_queue(), [if > > > > > > > > > > > > (!nfsd_file_unhash(nf))] > > > > > > > > > > > > will > > > > > > > > > > > > get it > > > > > > > > > > > > unhashed, doesn't it? > > > > > > > > > > > > - It continues to get a reference by > > > > > > > > > > > > nfsd_file_get() > > > > > > > > > > > > - It continues to remove itself from LRU by > > > > > > > > > > > > nfsd_file_lru_remove() if > > > > > > > > > > > > it is on the LRU. > > > > > > > > > > > > - Now it runs refcount_sub_and_test(), what > > > > > > > > > > > > happens if > > > > > > > > > > > > the refcnt > > > > > > > > > > > > does > > > > > > > > > > > > not go to 0? How can this nfsd_file be found > > > > > > > > > > > > again? > > > > > > > > > > > > Through the > > > > > > > > > > > > hash > > > > > > > > > > > > table? Through the LRU walk? how? > > > > > > > > > > > > > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It won't need to be found again. The holders of the > > > > > > > > > > > extra > > > > > > > > > > > references > > > > > > > > > > > will put those references when they are finished. > > > > > > > > > > > Since the > > > > > > > > > > > object > > > > > > > > > > > is > > > > > > > > > > > no longer HASHED, nfsd_file_put just does this: > > > > > > > > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf->nf_ref)) > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > > > > > > > > So that should be fine. > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 8, 2024 at 9:35 AM Jeff Layton > > > > > > > > > > > > <jlayton@kernel.org> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 2024-07-08 at 08:58 -0400, Youzhong > > > > > > > > > > > > > Yang wrote: > > > > > > > > > > > > > > Thank you Jeff for your invaluable insights. > > > > > > > > > > > > > > I was > > > > > > > > > > > > > > leaning > > > > > > > > > > > > > > towards > > > > > > > > > > > > > > adding a new list_head too, and tested this > > > > > > > > > > > > > > approach > > > > > > > > > > > > > > on > > > > > > > > > > > > > > kernel > > > > > > > > > > > > > > 6.6 by > > > > > > > > > > > > > > continuously hammering the server with heavy > > > > > > > > > > > > > > nfs load > > > > > > > > > > > > > > for the > > > > > > > > > > > > > > last few > > > > > > > > > > > > > > days, not a single leak. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Here goes the patch (based on Linux kernel > > > > > > > > > > > > > > master > > > > > > > > > > > > > > branch), > > > > > > > > > > > > > > please > > > > > > > > > > > > > > review: > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > > > Date: Thu, 4 Jul 2024 11:25:40 -0400 > > > > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file leaking > > > > > > > > > > > > > > due to > > > > > > > > > > > > > > mixed use > > > > > > > > > > > > > > of > > > > > > > > > > > > > > nf->nf_lru > > > > > > > > > > > > > > > > > > > > > > > > > > > > nfsd_file_put() in one thread can race with > > > > > > > > > > > > > > another > > > > > > > > > > > > > > thread > > > > > > > > > > > > > > doing > > > > > > > > > > > > > > garbage collection (running nfsd_file_gc() -> > > > > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > > > > -> > > > > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so > > > > > > > > > > > > > > it tries > > > > > > > > > > > > > > to do > > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > > * nfsd_file_lru_add() returns true (with > > > > > > > > > > > > > > NFSD_FILE_REFERENCED > > > > > > > > > > > > > > bit set) > > > > > > > > > > > > > > * garbage collector kicks in, > > > > > > > > > > > > > > nfsd_file_lru_cb() > > > > > > > > > > > > > > clears > > > > > > > > > > > > > > REFERENCED bit and > > > > > > > > > > > > > > returns LRU_ROTATE. > > > > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > > > > nfsd_file_lru_cb() now > > > > > > > > > > > > > > decrements nf->nf_ref > > > > > > > > > > > > > > to 0, runs nfsd_file_unhash(), removes it > > > > > > > > > > > > > > from > > > > > > > > > > > > > > the LRU > > > > > > > > > > > > > > and > > > > > > > > > > > > > > adds to > > > > > > > > > > > > > > the dispose > > > > > > > > > > > > > > list [list_lru_isolate_move(lru, &nf- > > > > > > > > > > > > > > >nf_lru, > > > > > > > > > > > > > > head)] > > > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED > > > > > > > > > > > > > > bit is > > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > > so > > > > > > > > > > > > > > it > > > > > > > > > > > > > > tries to remove > > > > > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))]. > > > > > > > > > > > > > > The > > > > > > > > > > > > > > 'nf' > > > > > > > > > > > > > > has been added > > > > > > > > > > > > > > to the 'dispose' list by > > > > > > > > > > > > > > nfsd_file_lru_cb(), so > > > > > > > > > > > > > > nfsd_file_lru_remove(nf) simply > > > > > > > > > > > > > > treats it as part of the LRU and removes > > > > > > > > > > > > > > it, > > > > > > > > > > > > > > which leads > > > > > > > > > > > > > > to > > > > > > > > > > > > > > its removal from > > > > > > > > > > > > > > the 'dispose' list. > > > > > > > > > > > > > > * At this moment, 'nf' is unhashed with its > > > > > > > > > > > > > > nf_ref > > > > > > > > > > > > > > being 0, > > > > > > > > > > > > > > and > > > > > > > > > > > > > > not > > > > > > > > > > > > > > on the LRU. > > > > > > > > > > > > > > nfsd_file_put() continues its execution > > > > > > > > > > > > > > [if > > > > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], > > > > > > > > > > > > > > as nf->nf_ref is already 0, nf->nf_ref is > > > > > > > > > > > > > > set to > > > > > > > > > > > > > > REFCOUNT_SATURATED, and the 'nf' > > > > > > > > > > > > > > gets no chance of being freed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > nfsd_file_put() can also race with > > > > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, so > > > > > > > > > > > > > > it tries > > > > > > > > > > > > > > to do > > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED bit > > > > > > > > > > > > > > and > > > > > > > > > > > > > > returns true. > > > > > > > > > > > > > > * Some userland application runs 'exportfs > > > > > > > > > > > > > > -f' or > > > > > > > > > > > > > > something > > > > > > > > > > > > > > like > > > > > > > > > > > > > > that, which triggers > > > > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs [if > > > > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > > > > unhash is done > > > > > > > > > > > > > > successfully. > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > > > > now > > > > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > (nfsd_file_lru_remove(nf))], > > > > > > > > > > > > > > it succeeds. > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > > > &nf->nf_ref))] > > > > > > > > > > > > > > (with "decrement" being 2), so the nf- > > > > > > > > > > > > > > >nf_ref > > > > > > > > > > > > > > goes to 0, > > > > > > > > > > > > > > the > > > > > > > > > > > > > > 'nf' > > > > > > > > > > > > > > is added to the > > > > > > > > > > > > > > dispose list [list_add(&nf->nf_lru, > > > > > > > > > > > > > > dispose)] > > > > > > > > > > > > > > * nfsd_file_put() detects NFSD_FILE_HASHED > > > > > > > > > > > > > > bit is > > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > > so > > > > > > > > > > > > > > it > > > > > > > > > > > > > > tries to remove > > > > > > > > > > > > > > the 'nf' from the LRU [if > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], > > > > > > > > > > > > > > although > > > > > > > > > > > > > > the 'nf' is not > > > > > > > > > > > > > > in the LRU, but it is linked in the > > > > > > > > > > > > > > 'dispose' > > > > > > > > > > > > > > list, > > > > > > > > > > > > > > nfsd_file_lru_remove() simply > > > > > > > > > > > > > > treats it as part of the LRU and removes > > > > > > > > > > > > > > it. This > > > > > > > > > > > > > > leads > > > > > > > > > > > > > > to > > > > > > > > > > > > > > its removal from > > > > > > > > > > > > > > the 'dispose' list! > > > > > > > > > > > > > > * Now nf->ref is 0, unhashed. > > > > > > > > > > > > > > nfsd_file_put() > > > > > > > > > > > > > > continues its > > > > > > > > > > > > > > execution and set > > > > > > > > > > > > > > nf->nf_ref to REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > > > > > > > As shown in the above analysis, using nf_lru > > > > > > > > > > > > > > for both > > > > > > > > > > > > > > the LRU > > > > > > > > > > > > > > list and > > > > > > > > > > > > > > dispose list > > > > > > > > > > > > > > can cause the leaks. This patch adds a new > > > > > > > > > > > > > > list_head > > > > > > > > > > > > > > nf_gc in > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > nfsd_file, and uses > > > > > > > > > > > > > > it for the dispose list. It's not expected to > > > > > > > > > > > > > > have a > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > unhashed but it's not > > > > > > > > > > > > > > added to the dispose list, so in > > > > > > > > > > > > > > nfsd_file_cond_queue() and > > > > > > > > > > > > > > nfsd_file_lru_cb() nfsd_file > > > > > > > > > > > > > > is unhashed after being added to the dispose > > > > > > > > > > > > > > list. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't see where we require the object to be > > > > > > > > > > > > > either > > > > > > > > > > > > > hashed or > > > > > > > > > > > > > on > > > > > > > > > > > > > the > > > > > > > > > > > > > dispose list. I think you probably just want > > > > > > > > > > > > > to do a > > > > > > > > > > > > > patch > > > > > > > > > > > > > that > > > > > > > > > > > > > changes the dispose list to use a dedicated > > > > > > > > > > > > > list_head > > > > > > > > > > > > > without > > > > > > > > > > > > > reordering when the these things are unhashed. > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang > > > > > > > > > > > > > > <youzhong@gmail.com> > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > fs/nfsd/filecache.c | 23 ++++++++++++++----- > > > > > > > > > > > > > > ---- > > > > > > > > > > > > > > fs/nfsd/filecache.h | 1 + > > > > > > > > > > > > > > 2 files changed, 15 insertions(+), 9 > > > > > > > > > > > > > > deletions(-) > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > > index ad9083ca144b..3aef2ddfce94 100644 > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > > @@ -216,6 +216,7 @@ nfsd_file_alloc(struct > > > > > > > > > > > > > > net *net, > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > inode > > > > > > > > > > > > > > *inode, unsigned char need, > > > > > > > > > > > > > > return NULL; > > > > > > > > > > > > > > > > > > > > > > > > > > > > INIT_LIST_HEAD(&nf->nf_lru); > > > > > > > > > > > > > > + INIT_LIST_HEAD(&nf->nf_gc); > > > > > > > > > > > > > > nf->nf_birthtime = ktime_get(); > > > > > > > > > > > > > > nf->nf_file = NULL; > > > > > > > > > > > > > > nf->nf_cred = get_current_cred(); > > > > > > > > > > > > > > @@ -393,8 +394,8 @@ > > > > > > > > > > > > > > nfsd_file_dispose_list(struct > > > > > > > > > > > > > > list_head > > > > > > > > > > > > > > *dispose) > > > > > > > > > > > > > > struct nfsd_file *nf; > > > > > > > > > > > > > > > > > > > > > > > > > > > > while (!list_empty(dispose)) { > > > > > > > > > > > > > > - nf = > > > > > > > > > > > > > > list_first_entry(dispose, struct > > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > > nf_lru); > > > > > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > > > > > + nf = > > > > > > > > > > > > > > list_first_entry(dispose, struct > > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > > nf_gc); > > > > > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > } > > > > > > > > > > > > > > @@ -411,12 +412,12 @@ > > > > > > > > > > > > > > nfsd_file_dispose_list_delayed(struct > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > { > > > > > > > > > > > > > > while(!list_empty(dispose)) { > > > > > > > > > > > > > > struct nfsd_file *nf = > > > > > > > > > > > > > > list_first_entry(dispose, > > > > > > > > > > > > > > - > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > > nf_lru); > > > > > > > > > > > > > > + > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > > nf_gc); > > > > > > > > > > > > > > struct nfsd_net *nn = > > > > > > > > > > > > > > net_generic(nf- > > > > > > > > > > > > > > > nf_net, > > > > > > > > > > > > > > nfsd_net_id); > > > > > > > > > > > > > > struct nfsd_fcache_disposal > > > > > > > > > > > > > > *l = nn- > > > > > > > > > > > > > > > fcache_disposal; > > > > > > > > > > > > > > > > > > > > > > > > > > > > spin_lock(&l->lock); > > > > > > > > > > > > > > - list_move_tail(&nf->nf_lru, > > > > > > > > > > > > > > &l- > > > > > > > > > > > > > > > freeme); > > > > > > > > > > > > > > + list_move_tail(&nf->nf_gc, > > > > > > > > > > > > > > &l- > > > > > > > > > > > > > > > freeme); > > > > > > > > > > > > > > spin_unlock(&l->lock); > > > > > > > > > > > > > > svc_wake_up(nn->nfsd_serv); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > @@ -502,8 +503,10 @@ nfsd_file_lru_cb(struct > > > > > > > > > > > > > > list_head *item, > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > list_lru_one *lru, > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* Refcount went to zero. Unhash it > > > > > > > > > > > > > > and queue > > > > > > > > > > > > > > it to > > > > > > > > > > > > > > the > > > > > > > > > > > > > > dispose list */ > > > > > > > > > > > > > > + list_lru_isolate(lru, &nf->nf_lru); > > > > > > > > > > > > > > + list_add(&nf->nf_gc, head); > > > > > > > > > > > > > > + /* Unhash after removing from LRU and > > > > > > > > > > > > > > adding > > > > > > > > > > > > > > to > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > list */ > > > > > > > > > > > > > > nfsd_file_unhash(nf); > > > > > > > > > > > > > > - list_lru_isolate_move(lru, &nf- > > > > > > > > > > > > > > >nf_lru, > > > > > > > > > > > > > > head); > > > > > > > > > > > > > > > > > > > > > > > > > > I don't see the point in reordering these > > > > > > > > > > > > > operations. > > > > > > > > > > > > > Hashing > > > > > > > > > > > > > is > > > > > > > > > > > > > all > > > > > > > > > > > > > about making the thing findable by nfsd > > > > > > > > > > > > > operations. The > > > > > > > > > > > > > _last_ > > > > > > > > > > > > > thing we > > > > > > > > > > > > > want to do is put it on the dispose list while > > > > > > > > > > > > > the > > > > > > > > > > > > > thing can > > > > > > > > > > > > > still > > > > > > > > > > > > > be > > > > > > > > > > > > > found by nfsd threads doing operations. > > > > > > > > > > > > > > > > > > > > > > > > > > > this_cpu_inc(nfsd_file_evictions); > > > > > > > > > > > > > > trace_nfsd_file_gc_disposed(nf); > > > > > > > > > > > > > > return LRU_REMOVED; > > > > > > > > > > > > > > @@ -565,7 +568,7 @@ > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > *nf, > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > > > > unhashing, > > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, &nf- > > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > > > > > > > > > > > > The above change looks wrong. I don't think we > > > > > > > > > > > > > need to > > > > > > > > > > > > > change > > > > > > > > > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, > > > > > > > > > > > > > > ignore it */ > > > > > > > > > > > > > > @@ -578,7 +581,9 @@ > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > *nf, > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put on > > > > > > > > > > > > > > the > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > list */ > > > > > > > > > > > > > > if (refcount_sub_and_test(decrement, > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > > > > - list_add(&nf->nf_lru, > > > > > > > > > > > > > > dispose); > > > > > > > > > > > > > > + list_add(&nf->nf_gc, > > > > > > > > > > > > > > dispose); > > > > > > > > > > > > > > + /* Unhash after adding to > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > list */ > > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > > > > > > > > > > > This too looks wrong? Maybe I'm unclear on the > > > > > > > > > > > > > race > > > > > > > > > > > > > you're > > > > > > > > > > > > > trying > > > > > > > > > > > > > to > > > > > > > > > > > > > fix with this? What's the harm in unhashing it > > > > > > > > > > > > > early? > > > > > > > > > > > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > } > > > > > > > > > > > > > > @@ -654,8 +659,8 @@ > > > > > > > > > > > > > > nfsd_file_close_inode_sync(struct > > > > > > > > > > > > > > inode > > > > > > > > > > > > > > *inode) > > > > > > > > > > > > > > > > > > > > > > > > > > > > nfsd_file_queue_for_close(inode, > > > > > > > > > > > > > > &dispose); > > > > > > > > > > > > > > while (!list_empty(&dispose)) { > > > > > > > > > > > > > > - nf = > > > > > > > > > > > > > > list_first_entry(&dispose, > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > > nf_lru); > > > > > > > > > > > > > > - list_del_init(&nf->nf_lru); > > > > > > > > > > > > > > + nf = > > > > > > > > > > > > > > list_first_entry(&dispose, > > > > > > > > > > > > > > struct > > > > > > > > > > > > > > nfsd_file, > > > > > > > > > > > > > > nf_gc); > > > > > > > > > > > > > > + list_del_init(&nf->nf_gc); > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > } > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.h > > > > > > > > > > > > > > b/fs/nfsd/filecache.h > > > > > > > > > > > > > > index c61884def906..3fbec24eea6c 100644 > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.h > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.h > > > > > > > > > > > > > > @@ -44,6 +44,7 @@ struct nfsd_file { > > > > > > > > > > > > > > > > > > > > > > > > > > > > struct nfsd_file_mark *nf_mark; > > > > > > > > > > > > > > struct list_head nf_lru; > > > > > > > > > > > > > > + struct list_head nf_gc; > > > > > > > > > > > > > > struct rcu_head nf_rcu; > > > > > > > > > > > > > > ktime_t nf_birthtime; > > > > > > > > > > > > > > }; > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jul 4, 2024 at 7:14 AM Jeff Layton > > > > > > > > > > > > > > <jlayton@poochiereds.net> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2024-07-03 at 16:46 -0400, Youzhong > > > > > > > > > > > > > > > Yang > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > Thank you Chuck. Here are my quick > > > > > > > > > > > > > > > > answers to > > > > > > > > > > > > > > > > your > > > > > > > > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - I don't have a quick reproducer. I > > > > > > > > > > > > > > > > reproduced > > > > > > > > > > > > > > > > it by > > > > > > > > > > > > > > > > using > > > > > > > > > > > > > > > > hundreds > > > > > > > > > > > > > > > > of nfs clients generating +600K ops under > > > > > > > > > > > > > > > > our > > > > > > > > > > > > > > > > workload in > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > testing > > > > > > > > > > > > > > > > environment. Theoretically it should be > > > > > > > > > > > > > > > > possible > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > simplify > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > reproduction but I am still working on > > > > > > > > > > > > > > > > it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - I understand zfs is an out-of-tree > > > > > > > > > > > > > > > > file > > > > > > > > > > > > > > > > system. That's > > > > > > > > > > > > > > > > fine. But > > > > > > > > > > > > > > > > this leaking can happen to any file > > > > > > > > > > > > > > > > system, and > > > > > > > > > > > > > > > > leaking > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > not a good > > > > > > > > > > > > > > > > thing no matter what file system it is. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - I will try to come up with a > > > > > > > > > > > > > > > > reproducer using > > > > > > > > > > > > > > > > xfs or > > > > > > > > > > > > > > > > btrfs > > > > > > > > > > > > > > > > if possible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Now back to the problem itself, here are > > > > > > > > > > > > > > > > my > > > > > > > > > > > > > > > > findings: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - nfsd_file_put() in one thread can race > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > another > > > > > > > > > > > > > > > > thread > > > > > > > > > > > > > > > > doing > > > > > > > > > > > > > > > > garbage collection (running > > > > > > > > > > > > > > > > nfsd_file_gc() -> > > > > > > > > > > > > > > > > list_lru_walk() > > > > > > > > > > > > > > > > -> > > > > > > > > > > > > > > > > nfsd_file_lru_cb()): > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, > > > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > > > tries to > > > > > > > > > > > > > > > > do > > > > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > > > > * nfsd_file_lru_add() returns true > > > > > > > > > > > > > > > > (thus > > > > > > > > > > > > > > > > NFSD_FILE_REFERENCED bit > > > > > > > > > > > > > > > > set for nf->nf_flags) > > > > > > > > > > > > > > > > * garbage collector kicks in, > > > > > > > > > > > > > > > > nfsd_file_lru_cb() clears > > > > > > > > > > > > > > > > REFERENCED > > > > > > > > > > > > > > > > bit and returns LRU_ROTATE. > > > > > > > > > > > > > > > > * garbage collector kicks in again, > > > > > > > > > > > > > > > > nfsd_file_lru_cb() > > > > > > > > > > > > > > > > now > > > > > > > > > > > > > > > > decrements nf->nf_ref to 0, runs > > > > > > > > > > > > > > > > nfsd_file_unhash(), > > > > > > > > > > > > > > > > removes > > > > > > > > > > > > > > > > it from > > > > > > > > > > > > > > > > the LRU and adds to the dispose list > > > > > > > > > > > > > > > > [list_lru_isolate_move(lru, > > > > > > > > > > > > > > > > &nf->nf_lru, head);] > > > > > > > > > > > > > > > > * nfsd_file_put() detects > > > > > > > > > > > > > > > > NFSD_FILE_HASHED bit > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))] > > > > > > > > > > > > > > > > * The 'nf' has been added to the > > > > > > > > > > > > > > > > 'dispose' list > > > > > > > > > > > > > > > > by > > > > > > > > > > > > > > > > nfsd_file_lru_cb(), so > > > > > > > > > > > > > > > > nfsd_file_lru_remove(nf) > > > > > > > > > > > > > > > > simply > > > > > > > > > > > > > > > > treats > > > > > > > > > > > > > > > > it as > > > > > > > > > > > > > > > > part of the LRU and removes it, which > > > > > > > > > > > > > > > > leads it to > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > removed > > > > > > > > > > > > > > > > from the > > > > > > > > > > > > > > > > 'dispose' list. > > > > > > > > > > > > > > > > * At this moment, nf->nf_ref is 0, it's > > > > > > > > > > > > > > > > unhashed, and > > > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > > on the > > > > > > > > > > > > > > > > LRU. nfsd_file_put() continues its > > > > > > > > > > > > > > > > execution [if > > > > > > > > > > > > > > > > (refcount_dec_and_test(&nf->nf_ref))], as > > > > > > > > > > > > > > > > nf- > > > > > > > > > > > > > > > > > nf_ref is > > > > > > > > > > > > > > > > already 0, now > > > > > > > > > > > > > > > > bad thing happens: nf->nf_ref is set to > > > > > > > > > > > > > > > > REFCOUNT_SATURATED, > > > > > > > > > > > > > > > > and the > > > > > > > > > > > > > > > > 'nf' is leaked. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To make this happen, the right timing is > > > > > > > > > > > > > > > > crucial. > > > > > > > > > > > > > > > > It can > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > reproduced > > > > > > > > > > > > > > > > by adding artifical delays in > > > > > > > > > > > > > > > > filecache.c, or > > > > > > > > > > > > > > > > hammering > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > nfsd with > > > > > > > > > > > > > > > > tons of ops. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Let's see how nfsd_file_put() can race > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > nfsd_file_cond_queue(): > > > > > > > > > > > > > > > > * In nfsd_file_put(), nf->nf_ref is 1, > > > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > > > tries to > > > > > > > > > > > > > > > > do > > > > > > > > > > > > > > > > nfsd_file_lru_add(). > > > > > > > > > > > > > > > > * nfsd_file_lru_add() sets REFERENCED > > > > > > > > > > > > > > > > bit and > > > > > > > > > > > > > > > > returns > > > > > > > > > > > > > > > > true. > > > > > > > > > > > > > > > > * 'exportfs -f' or something like that > > > > > > > > > > > > > > > > triggers > > > > > > > > > > > > > > > > __nfsd_file_cache_purge() -> > > > > > > > > > > > > > > > > nfsd_file_cond_queue(). > > > > > > > > > > > > > > > > * In nfsd_file_cond_queue(), it runs > > > > > > > > > > > > > > > > [if > > > > > > > > > > > > > > > > (!nfsd_file_unhash(nf))], > > > > > > > > > > > > > > > > unhash is done successfully. > > > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > > > (!nfsd_file_get(nf))], > > > > > > > > > > > > > > > > now > > > > > > > > > > > > > > > > nf->nf_ref goes to 2. > > > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > > > (nfsd_file_lru_remove(nf))], it succeeds. > > > > > > > > > > > > > > > > * nfsd_file_cond_queue() runs [if > > > > > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > > > > > &nf->nf_ref))] (with "decrement" being > > > > > > > > > > > > > > > > 2), so the > > > > > > > > > > > > > > > > nf- > > > > > > > > > > > > > > > > > nf_ref > > > > > > > > > > > > > > > > goes to > > > > > > > > > > > > > > > > 0, the 'nf' is added to the dispost list > > > > > > > > > > > > > > > > [list_add(&nf- > > > > > > > > > > > > > > > > > nf_lru, > > > > > > > > > > > > > > > > dispose)] > > > > > > > > > > > > > > > > * nfsd_file_put() detects > > > > > > > > > > > > > > > > NFSD_FILE_HASHED bit > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > cleared, > > > > > > > > > > > > > > > > so it > > > > > > > > > > > > > > > > tries to remove the 'nf' from the LRU [if > > > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf))], although > > > > > > > > > > > > > > > > the 'nf' > > > > > > > > > > > > > > > > is not in > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > LRU, but > > > > > > > > > > > > > > > > it is linked in the 'dispose' list, > > > > > > > > > > > > > > > > nfsd_file_lru_remove() > > > > > > > > > > > > > > > > simply > > > > > > > > > > > > > > > > treats it as part of the LRU and removes > > > > > > > > > > > > > > > > it. This > > > > > > > > > > > > > > > > leads > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > its removal > > > > > > > > > > > > > > > > from the 'dispose' list! > > > > > > > > > > > > > > > > * Now nf->ref is 0, unhashed. > > > > > > > > > > > > > > > > nfsd_file_put() > > > > > > > > > > > > > > > > continues > > > > > > > > > > > > > > > > its > > > > > > > > > > > > > > > > execution and sets nf->nf_ref to > > > > > > > > > > > > > > > > REFCOUNT_SATURATED. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The purpose of nf->nf_lru is problematic. > > > > > > > > > > > > > > > > As you > > > > > > > > > > > > > > > > can see, > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > is used > > > > > > > > > > > > > > > > for the LRU list, and also the 'dispose' > > > > > > > > > > > > > > > > list. > > > > > > > > > > > > > > > > Adding > > > > > > > > > > > > > > > > another > > > > > > > > > > > > > > > > 'struct > > > > > > > > > > > > > > > > list_head' specifically for the 'dispose' > > > > > > > > > > > > > > > > list > > > > > > > > > > > > > > > > seems to > > > > > > > > > > > > > > > > be a > > > > > > > > > > > > > > > > better > > > > > > > > > > > > > > > > way of fixing this race condition. Either > > > > > > > > > > > > > > > > way > > > > > > > > > > > > > > > > works for > > > > > > > > > > > > > > > > me. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Would you agree my above analysis makes > > > > > > > > > > > > > > > > sense? > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think so. It's been a while since I've > > > > > > > > > > > > > > > done much > > > > > > > > > > > > > > > work in > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > but it does sound like there is a race in > > > > > > > > > > > > > > > the LRU > > > > > > > > > > > > > > > handling. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Like Chuck said, the nf->nf_lru list should > > > > > > > > > > > > > > > be safe > > > > > > > > > > > > > > > to use > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > multiple > > > > > > > > > > > > > > > purposes, but that's only the case if we're > > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > using that > > > > > > > > > > > > > > > list > > > > > > > > > > > > > > > as an > > > > > > > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The list_lru code does check this: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > if (!list_empty(item)) { > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ...so if we ever check this while it's > > > > > > > > > > > > > > > sitting on > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > list, it > > > > > > > > > > > > > > > will handle it incorrectly. It sounds like > > > > > > > > > > > > > > > that's > > > > > > > > > > > > > > > the root > > > > > > > > > > > > > > > cause of the > > > > > > > > > > > > > > > problem you're seeing? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If so, then maybe a separate list_head for > > > > > > > > > > > > > > > disposal > > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > better. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Here is my patch with signed-off-by: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Youzhong Yang <youzhong@gmail.com> > > > > > > > > > > > > > > > > Date: Mon, 1 Jul 2024 06:45:22 -0400 > > > > > > > > > > > > > > > > Subject: [PATCH] nfsd: fix nfsd_file > > > > > > > > > > > > > > > > leaking due > > > > > > > > > > > > > > > > to race > > > > > > > > > > > > > > > > condition and early > > > > > > > > > > > > > > > > unhash > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Youzhong Yang > > > > > > > > > > > > > > > > <youzhong@gmail.com> > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > > fs/nfsd/filecache.c | 14 +++++++++++++- > > > > > > > > > > > > > > > > 1 file changed, 13 insertions(+), 1 > > > > > > > > > > > > > > > > deletion(-) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 100644 > > > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > @@ -389,6 +389,17 @@ nfsd_file_put(struct > > > > > > > > > > > > > > > > nfsd_file *nf) > > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > > (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > > > + * Racing with > > > > > > > > > > > > > > > > nfsd_file_cond_queue() or > > > > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > > > > + * it's unhashed but then > > > > > > > > > > > > > > > > removed > > > > > > > > > > > > > > > > from > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > dispose list, > > > > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > > + if (refcount_read(&nf- > > > > > > > > > > > > > > > > >nf_ref) == > > > > > > > > > > > > > > > > 0 && > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > A refcount_read in this path is a red flag > > > > > > > > > > > > > > > to me. > > > > > > > > > > > > > > > Anytime > > > > > > > > > > > > > > > you're just > > > > > > > > > > > > > > > looking at the refcount without changing > > > > > > > > > > > > > > > anything > > > > > > > > > > > > > > > just > > > > > > > > > > > > > > > screams > > > > > > > > > > > > > > > out > > > > > > > > > > > > > > > "race condition". > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In this case, what guarantee is there that > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > won't run > > > > > > > > > > > > > > > afoul > > > > > > > > > > > > > > > of the > > > > > > > > > > > > > > > timing? We could check this and find out > > > > > > > > > > > > > > > it's 1 > > > > > > > > > > > > > > > just before > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > goes to > > > > > > > > > > > > > > > 0 and you check the other conditions. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Does anything prevent that? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > !test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > > > > > > + list_empty(&nf- > > > > > > > > > > > > > > > > >nf_lru)) { > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > > + return; > > > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf- > > > > > > > > > > > > > > > > >nf_ref)) > > > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > > @@ -576,7 +587,7 @@ > > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > > > > > > unhashing, > > > > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Same here: you're just testing for the > > > > > > > > > > > > > > > HASHED bit, > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > this also > > > > > > > > > > > > > > > race with someone who is setting it just > > > > > > > > > > > > > > > after you > > > > > > > > > > > > > > > get > > > > > > > > > > > > > > > here. > > > > > > > > > > > > > > > Why is > > > > > > > > > > > > > > > that not a problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, > > > > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > @@ -590,6 +601,7 @@ > > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > /* If refcount goes to 0, then > > > > > > > > > > > > > > > > put on the > > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > > list */ > > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > > (refcount_sub_and_test(decrement, &nf- > > > > > > > > > > > > > > > > > nf_ref)) { > > > > > > > > > > > > > > > > list_add(&nf->nf_lru, > > > > > > > > > > > > > > > > dispose); > > > > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2024 at 2:21 PM Chuck > > > > > > > > > > > > > > > > Lever > > > > > > > > > > > > > > > > <chuck.lever@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 03, 2024 at 10:12:33AM - > > > > > > > > > > > > > > > > > 0400, > > > > > > > > > > > > > > > > > Youzhong Yang > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to report a nfsd_file > > > > > > > > > > > > > > > > > > leaking issue > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > propose > > > > > > > > > > > > > > > > > > a fix for it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When I tested Linux kernel 6.8 and > > > > > > > > > > > > > > > > > > 6.6, I > > > > > > > > > > > > > > > > > > noticed > > > > > > > > > > > > > > > > > > nfsd_file leaks > > > > > > > > > > > > > > > > > > which led to undestroyable file > > > > > > > > > > > > > > > > > > systems > > > > > > > > > > > > > > > > > > (zfs), > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the report. Some initial > > > > > > > > > > > > > > > > > comments: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Do you have a specific reproducer? In > > > > > > > > > > > > > > > > > other > > > > > > > > > > > > > > > > > words, > > > > > > > > > > > > > > > > > what > > > > > > > > > > > > > > > > > is the > > > > > > > > > > > > > > > > > simplest program that can run on an > > > > > > > > > > > > > > > > > NFS > > > > > > > > > > > > > > > > > client that > > > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > > trigger > > > > > > > > > > > > > > > > > this leak, and can you post it? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - "zfs" is an out-of-tree file system, > > > > > > > > > > > > > > > > > so it's > > > > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > > > directly > > > > > > > > > > > > > > > > > supported for NFSD. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - The guidelines for patch submission > > > > > > > > > > > > > > > > > require > > > > > > > > > > > > > > > > > us to fix > > > > > > > > > > > > > > > > > issues in > > > > > > > > > > > > > > > > > upstream Linux first (currently > > > > > > > > > > > > > > > > > that's v6.10- > > > > > > > > > > > > > > > > > rc6). > > > > > > > > > > > > > > > > > Then > > > > > > > > > > > > > > > > > that fix > > > > > > > > > > > > > > > > > can be backported to older stable > > > > > > > > > > > > > > > > > kernels > > > > > > > > > > > > > > > > > like 6.6. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you reproduce the leak with one of > > > > > > > > > > > > > > > > > the in- > > > > > > > > > > > > > > > > > kernel > > > > > > > > > > > > > > > > > filesystems > > > > > > > > > > > > > > > > > (either xfs or btrfs would be great) > > > > > > > > > > > > > > > > > and with > > > > > > > > > > > > > > > > > NFSD in > > > > > > > > > > > > > > > > > 6.10- > > > > > > > > > > > > > > > > > rc6? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One more comment below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > here are some examples: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct nfsd_file -x > > > > > > > > > > > > > > > > > > ffff88e160db0460 > > > > > > > > > > > > > > > > > > struct nfsd_file { > > > > > > > > > > > > > > > > > > nf_rlist = { > > > > > > > > > > > > > > > > > > rhead = { > > > > > > > > > > > > > > > > > > next = 0xffff8921fa2392f1 > > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > next = 0x0 > > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > nf_inode = 0xffff8882bc312ef8, > > > > > > > > > > > > > > > > > > nf_file = 0xffff88e2015b1500, > > > > > > > > > > > > > > > > > > nf_cred = 0xffff88e3ab0e7800, > > > > > > > > > > > > > > > > > > nf_net = 0xffffffff83d41600 > > > > > > > > > > > > > > > > > > <init_net>, > > > > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > > > > nf_ref = { > > > > > > > > > > > > > > > > > > refs = { > > > > > > > > > > > > > > > > > > counter = 0xc0000000 > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > nf_may = 0x4, > > > > > > > > > > > > > > > > > > nf_mark = 0xffff88e1bddfb320, > > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > > next = 0xffff88e160db04a8, > > > > > > > > > > > > > > > > > > prev = 0xffff88e160db04a8 > > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > nf_rcu = { > > > > > > > > > > > > > > > > > > next = 0x10000000000, > > > > > > > > > > > > > > > > > > func = 0x0 > > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > nf_birthtime = 0x73d22fc1728 > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counte > > > > > > > > > > > > > > > > > > r,nf_lru > > > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > > > ffff88839a53d850 > > > > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > > next = 0xffff88839a53d898, > > > > > > > > > > > > > > > > > > prev = 0xffff88839a53d898 > > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > nf_file = 0xffff88810ede8700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counte > > > > > > > > > > > > > > > > > > r,nf_lru > > > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > > > ffff88c32b11e850 > > > > > > > > > > > > > > > > > > nf_flags = 0x8, > > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > > next = 0xffff88c32b11e898, > > > > > > > > > > > > > > > > > > prev = 0xffff88c32b11e898 > > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > nf_file = 0xffff88c20a701c00, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counte > > > > > > > > > > > > > > > > > > r,nf_lru > > > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > > > ffff88e372709700 > > > > > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > > next = 0xffff88e372709748, > > > > > > > > > > > > > > > > > > prev = 0xffff88e372709748 > > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > nf_file = 0xffff88e0725e6400, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > crash> struct > > > > > > > > > > > > > > > > > > nfsd_file.nf_flags,nf_ref.refs.counte > > > > > > > > > > > > > > > > > > r,nf_lru > > > > > > > > > > > > > > > > > > ,nf_file > > > > > > > > > > > > > > > > > > -x > > > > > > > > > > > > > > > > > > ffff8982864944d0 > > > > > > > > > > > > > > > > > > nf_flags = 0xc, > > > > > > > > > > > > > > > > > > nf_ref.refs.counter = 0x0 > > > > > > > > > > > > > > > > > > nf_lru = { > > > > > > > > > > > > > > > > > > next = 0xffff898286494518, > > > > > > > > > > > > > > > > > > prev = 0xffff898286494518 > > > > > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > nf_file = 0xffff89803c0ff700, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The leak occurs when nfsd_file_put() > > > > > > > > > > > > > > > > > > races > > > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > > nfsd_file_cond_queue() > > > > > > > > > > > > > > > > > > or nfsd_file_lru_cb(). With the > > > > > > > > > > > > > > > > > > following > > > > > > > > > > > > > > > > > > patch, I > > > > > > > > > > > > > > > > > > haven't observed > > > > > > > > > > > > > > > > > > any leak after a few days heavy nfs > > > > > > > > > > > > > > > > > > load: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Our patch submission guidelines require > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > Signed-off- > > > > > > > > > > > > > > > > > by: > > > > > > > > > > > > > > > > > line at the end of the patch > > > > > > > > > > > > > > > > > description. See > > > > > > > > > > > > > > > > > the "Sign > > > > > > > > > > > > > > > > > your work - > > > > > > > > > > > > > > > > > the Developer's Certificate of Origin" > > > > > > > > > > > > > > > > > section > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (Needed here in case your fix is > > > > > > > > > > > > > > > > > acceptable). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > > > b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > > > index 1a6d5d000b85..2323829f7208 > > > > > > > > > > > > > > > > > > 100644 > > > > > > > > > > > > > > > > > > --- a/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > > > +++ b/fs/nfsd/filecache.c > > > > > > > > > > > > > > > > > > @@ -389,6 +389,17 @@ > > > > > > > > > > > > > > > > > > nfsd_file_put(struct > > > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > > > *nf) > > > > > > > > > > > > > > > > > > if (!nfsd_file_lru_remove(nf)) > > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > > > > > + * Racing with > > > > > > > > > > > > > > > > > > nfsd_file_cond_queue() or > > > > > > > > > > > > > > > > > > nfsd_file_lru_cb(), > > > > > > > > > > > > > > > > > > + * it's unhashed but then removed > > > > > > > > > > > > > > > > > > from the > > > > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > > > > list, > > > > > > > > > > > > > > > > > > + * so we need to free it. > > > > > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > > > > + if (refcount_read(&nf->nf_ref) == 0 > > > > > > > > > > > > > > > > > > && > > > > > > > > > > > > > > > > > > + !test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > > > > nf_flags) && > > > > > > > > > > > > > > > > > > + list_empty(&nf->nf_lru)) { > > > > > > > > > > > > > > > > > > + nfsd_file_free(nf); > > > > > > > > > > > > > > > > > > + return; > > > > > > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > if (refcount_dec_and_test(&nf- > > > > > > > > > > > > > > > > > > >nf_ref)) > > > > > > > > > > > > > > > > > > nfsd_file_free(nf); > > > > > > > > > > > > > > > > > > @@ -576,7 +587,7 @@ > > > > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > int decrement = 1; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we raced with someone else > > > > > > > > > > > > > > > > > > unhashing, > > > > > > > > > > > > > > > > > > ignore > > > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > > > - if (!nfsd_file_unhash(nf)) > > > > > > > > > > > > > > > > > > + if (!test_bit(NFSD_FILE_HASHED, > > > > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > > > > nf_flags)) > > > > > > > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /* If we can't get a reference, > > > > > > > > > > > > > > > > > > ignore it > > > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > > > @@ -590,6 +601,7 @@ > > > > > > > > > > > > > > > > > > nfsd_file_cond_queue(struct > > > > > > > > > > > > > > > > > > nfsd_file > > > > > > > > > > > > > > > > > > *nf, struct > > > > > > > > > > > > > > > > > > list_head *dispose) > > > > > > > > > > > > > > > > > > /* If refcount goes to 0, then put > > > > > > > > > > > > > > > > > > on the > > > > > > > > > > > > > > > > > > dispose > > > > > > > > > > > > > > > > > > list > > > > > > > > > > > > > > > > > > */ > > > > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > > > > (refcount_sub_and_test(decrement, > > > > > > > > > > > > > > > > > > &nf- > > > > > > > > > > > > > > > > > > > nf_ref)) > > > > > > > > > > > > > > > > > > { > > > > > > > > > > > > > > > > > > list_add(&nf->nf_lru, dispose); > > > > > > > > > > > > > > > > > > + nfsd_file_unhash(nf); > > > > > > > > > > > > > > > > > > trace_nfsd_file_closing(nf); > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please kindly review the patch and > > > > > > > > > > > > > > > > > > let me > > > > > > > > > > > > > > > > > > know if it > > > > > > > > > > > > > > > > > > makes sense. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Youzhong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Jeff Layton <jlayton@poochiereds.net> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > -- > > > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > -- > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > -- > > > Jeff Layton <jlayton@kernel.org> > > -- > Jeff Layton <jlayton@kernel.org>
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index 1a6d5d000b85..2323829f7208 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf) if (!nfsd_file_lru_remove(nf)) return; } + /* + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(), + * it's unhashed but then removed from the dispose list, + * so we need to free it. + */ + if (refcount_read(&nf->nf_ref) == 0 && + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) && + list_empty(&nf->nf_lru)) { + nfsd_file_free(nf); + return; + } } if (refcount_dec_and_test(&nf->nf_ref)) nfsd_file_free(nf); @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) int decrement = 1; /* If we raced with someone else unhashing, ignore it */ - if (!nfsd_file_unhash(nf)) + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) return;