mbox series

[0/2] sunrpc: Fix issues with cache_detail nextcheck updates

Message ID 20250301064836.3285906-1-leo.lilong@huawei.com (mailing list archive)
Headers show
Series sunrpc: Fix issues with cache_detail nextcheck updates | expand

Message

Long Li March 1, 2025, 6:48 a.m. UTC
During memory fault injection testing with nfsd restart, I encountered an
issue where NFS client threads would hang for around 1800 seconds. Analysis
showed that nfsd threads were blocked for approximately 1800 seconds with
the following scenario:

  PID: 3941444  TASK: ffff0000cf170040  CPU: 0    COMMAND: "nfsd"
   #0 [ffff80008d387120] __switch_to at ffffc4ef3c7a6af0
   #1 [ffff80008d387170] __schedule at ffffc4ef3c7a73a4
   #2 [ffff80008d3872c0] schedule at ffffc4ef3c7a8074
   #3 [ffff80008d387300] schedule_timeout at ffffc4ef3c7b7b60
   #4 [ffff80008d387470] wait_for_common at ffffc4ef3c7a944c
   #5 [ffff80008d387560] wait_for_completion_interruptible_timeout at ffffc4ef3c7a9630
   #6 [ffff80008d387570] cache_wait_req at ffffc4ef3c6804dc
   #7 [ffff80008d3876f0] cache_check at ffffc4ef3c680740
   #8 [ffff80008d3877d0] exp_find_key at ffffc4ef3b6e293c
   #9 [ffff80008d387910] exp_find at ffffc4ef3b6e2ccc
  #10 [ffff80008d387980] rqst_exp_find at ffffc4ef3b6e445c
  #11 [ffff80008d3879e0] exp_pseudoroot at ffffc4ef3b6e4984
  #12 [ffff80008d387a90] nfsd4_putrootfh at ffffc4ef3b6f8720
  #13 [ffff80008d387ab0] nfsd4_proc_compound at ffffc4ef3b6fe4cc
  #14 [ffff80008d387b70] nfsd_dispatch at ffffc4ef3b6cf428
  #15 [ffff80008d387c30] svc_process_common at ffffc4ef3c66235c
  #16 [ffff80008d387d20] svc_process at ffffc4ef3c6652f8
  #17 [ffff80008d387d90] svc_recv at ffffc4ef3c68c5d0
  #18 [ffff80008d387e10] nfsd at ffffc4ef3b6cb968
  #19 [ffff80008d387e60] kthread at ffffc4ef3ad4aca4
  
An nfsd thread sent an upcall and set the cache to CACHE_PENDING state,
waiting for the downcall to complete. However, due to memory fault
injection, this downcall failed and the userspace daemon did not retry.
The nfsd thread could only wait for cache cleanup to clear the
CACHE_PENDING state and resend the upcall.

Under certain edge cases, the cache_detail scanning interval could be set
to a large value like 1800 seconds, causing cache cleanup to be delayed
well beyond the cache's expiry time. This behavior seems unreasonable.

This patch series fix two issues related to the cache_detail nextcheck
time updates in the sunrpc subsystem. The first patch ensures nextcheck
time is properly updated when adding new cache entries to an cache_detail.
The second  patch fixes a race condition between cache cleanup and entry
removal that can result in stale nextcheck times. 

Long Li (2):
  sunrpc: update nextcheck time when adding new cache entries
  sunrpc: fix race in cache cleanup causing stale nextcheck time

 net/sunrpc/cache.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)