Message ID | CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 38744b82913e3067667ef30bff70d8211f71e478 |
Headers | show |
Series | SRCU: kworker hung in synchronize_srcu | expand |
Hello, Firstly, kudos to the detailed report and analysis. Rare failures are hard and your usage crash/kdump is awesome to dig deeper into the issue.. On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote: > > Hi, > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > and destroy concurrent test, the issue happens after several weeks. Now we > didn't have a way to reproduce this issue, the issue happens randomly, this > is the second time we found it in this year. > > We did some investigation with crash tools, hope to get some useful clues > about the issue, try to find root cause of the issue (further find a way to > fix or workaround it), here is some of our investigation, hope to get > feedback from SRCU experts. > > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more > than 122 seconds. > [3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1 > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [3144162.631162] kworker/136:12 D 0 252684 2 0x90004000 > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm] > [3144162.631192] Call Trace: > [3144162.631202] __schedule+0x2ee/0x660 > [3144162.631206] schedule+0x33/0xa0 > [3144162.631209] schedule_timeout+0x1c4/0x340 > [3144162.631214] ? update_load_avg+0x82/0x660 > [3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30 > [3144162.631218] wait_for_completion+0x119/0x180 > [3144162.631220] ? wake_up_q+0x80/0x80 > [3144162.631224] __synchronize_srcu.part.19+0x81/0xb0 > [3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10 > [3144162.631227] synchronize_srcu+0x5f/0xc0 > [3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm] > [3144162.631239] ? __schedule+0x2f6/0x660 > [3144162.631243] process_one_work+0x19a/0x3a0 > [3144162.631244] worker_thread+0x37/0x3a0 > [3144162.631247] kthread+0x117/0x140 > [3144162.631247] ? process_one_work+0x3a0/0x3a0 > [3144162.631248] ? __kthread_cancel_work+0x40/0x40 > [3144162.631250] ret_from_fork+0x1f/0x30 > > Test workload: > - Repeated concurrency testing, include Create VM, service probe, destroy VM. > - VM using virito device, so kvm->irq_srcu is using frequently. > - x86_64 Machine with 191 core. > > Crash analysis clues: > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed. > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed). > - The SNP (node) of SRCU struct has 2 levels. > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs). > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143]. > - The SDP (per-cpu) of SRCU struct show lock and unlock matches. > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list. > - The pending entry is just the calling entry of hung kwoker task. > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296. > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list. > - The work-queue of SRCU and SDP of CPU 136 is idle. > - More details could be found via reference at the end of this mail. > > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda > 0xffffc900f100e878 > srcu_idx = 76 > srcu_gp_seq = 304 > srcu_gp_seq_needed = 304 > srcu_gp_seq_needed_exp = 304 > sda = 0x5fff9ac17800 > > crash> struct srcu_data 0x5fff9ac17800:136 > [136]: ffffe8bee9817800 > ... > srcu_cblist = { > head = 0xffffc900e8413dd0, > tails = {0xffffe8bee9817848, 0xffffc900e8413dd0, > 0xffffc900e8413dd0, 0xffffc900e8413dd0}, > gp_seq = {0, 296, 296, 0}, I am actually a bit puzzled why the DONE segment count is 0. I would have expected it to be >= 296 in this situation. > len = 1, > len_lazy = 0, > enabled = 1 '\001', > offloaded = 0 '\000' > }, > srcu_gp_seq_needed = 288, > srcu_gp_seq_needed_exp = 0, > srcu_cblist_invoking = false, > ... > > Root cause (we can't reproduce the issue, so the result is base on crash > analysis and code analysis): > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288). > - The leaked entry on segcblist seems has been "accelerated". > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because > srcu_invoke_callbacks may be using incorrect GP seq doing accelerate. > - Here are two emulated case may cause the "accelerate issue": > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > to SDP. > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > faster than CPU 136, start GP, set rcu_seq_start. > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > to start GP. > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > "accelerate" the seq of new added entry to 292 (it should be 288). But srcu_gp_seq is at 304 right now. How does it matter that the CB is marked for 292? It should be ready to execute anyway even at 292. Note the meaning of "acceleration", the idea is to start conservatively and move the callbacks forward as more accurate information is available. Considering this, 292 initially should be OK IMHO (that's just more conservative than 288).. > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296 > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > to SDP. > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > faster than CPU 136, start GP. > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP. > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > to start GP. > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP, > set rcu_seq_start. > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > "accelerate" new added entry to 296 (it should be 288). I had the same analysis here. > /* > * CASE1 > * - entry of CPU 136 expecting GP 288 was accelerated to GP 292 > * - GP0: 280-284, GP1: 284-288. > */ > > /* [GP0][CPU-136] */ > process_srcu { > srcu_gp_end > } > /* [GP1][CPU-136] */ > synchronize_srcu { > __call_srcu { > rcu_segcblist_enqueue > /* [GP1][CPU-041] */ > synchronize_srcu { > __call_srcu { > srcu_funnel_gp_start > srcu_gp_start > rcu_seq_start > } > } > /* [GP1][CPU-136] */ > srcu_funnel_gp_start > } > } > /* [GP0][CPU-136] */ > srcu_invoke_callbacks { > rcu_segcblist_accelerate > } > > > /* > * CASE2 > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > */ > > /* [GP0][CPU-136] */ > process_srcu { > srcu_gp_end > } > > /* [GP1][CPU-136] */ > synchronize_srcu { > __call_srcu { > rcu_segcblist_enqueue > /* [GP1][CPU-041] */ > synchronize_srcu { > __call_srcu { > srcu_funnel_gp_start > srcu_gp_start > } > } > process_srcu { > srcu_gp_end > rcu_seq_end > } > /* [GP1][CPU-136] */ > srcu_funnel_gp_start > } > } > /* [GP0][CPU-136] */ > srcu_invoke_callbacks { If srcu_invoke_callbacks() was really called for the rdp, I would have expected rcu_segcblist_advance() to advance all those pending callbacks to 304. I posit that probably srcu_invoke_callbacks() is not even being called in the first place, otherwise why is the DONE segment count still 0? DONE should not be 0, otherwise callbacks will not run. The segcblist of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not being called due to some other bug (or a workqueue/timer issue causing the srcu_invoke_callbacks() to not run?). I wonder if you would consider trying to reproduce it with the rcutorture module's SRCU testing in the chance that it occurs with it on your setup. thanks, - Joel > /* [GP2][CPU-017] */ > synchronize_srcu { > __call_srcu { > srcu_funnel_gp_start > srcu_gp_start > rcu_seq_start > } > } > /* [GP0][CPU-136] */ > rcu_segcblist_accelerate > } > > > Potential fix: > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot > of current gp seq of SRCU struct, how about accelerate the sda base > sdp->srcu_gp_seq_needed, maybe like this: > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 21115ffb6c44..ff8579abc924 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct > work_struct *work) > spin_lock_irq_rcu_node(sdp); > rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs); > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > - rcu_seq_snap(&ssp->srcu_gp_seq)); > + rcu_seq_snap(&sdp->srcu_gp_seq_needed)); > sdp->srcu_cblist_invoking = false; > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > spin_unlock_irq_rcu_node(sdp); I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO, rcu_seq_snap() says which what's the gp_seq at which a full GP has passed from the moment rcu_seq_snap() is called. That's precisely when it is safe to call all those non-accelerated callbacks. Basically, if any of the bits are set in the state portion of the sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2) << RCU_SEQ_CTR_SHIFT. Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT. > > > Thanks for reading! > > > Reference detail log of crash dump information: > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log
On Thu, Sep 28, 2023 at 5:39 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > Hello, > Firstly, kudos to the detailed report and analysis. Rare failures are > hard and your usage crash/kdump is awesome to dig deeper into the > issue.. > > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote: > > > > Hi, > > > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > > and destroy concurrent test, the issue happens after several weeks. Now we > > didn't have a way to reproduce this issue, the issue happens randomly, this > > is the second time we found it in this year. > > > > We did some investigation with crash tools, hope to get some useful clues > > about the issue, try to find root cause of the issue (further find a way to > > fix or workaround it), here is some of our investigation, hope to get > > feedback from SRCU experts. > > > > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more > > than 122 seconds. > > [3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1 > > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [3144162.631162] kworker/136:12 D 0 252684 2 0x90004000 > > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm] > > [3144162.631192] Call Trace: > > [3144162.631202] __schedule+0x2ee/0x660 > > [3144162.631206] schedule+0x33/0xa0 > > [3144162.631209] schedule_timeout+0x1c4/0x340 > > [3144162.631214] ? update_load_avg+0x82/0x660 > > [3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30 > > [3144162.631218] wait_for_completion+0x119/0x180 > > [3144162.631220] ? wake_up_q+0x80/0x80 > > [3144162.631224] __synchronize_srcu.part.19+0x81/0xb0 > > [3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10 > > [3144162.631227] synchronize_srcu+0x5f/0xc0 > > [3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm] > > [3144162.631239] ? __schedule+0x2f6/0x660 > > [3144162.631243] process_one_work+0x19a/0x3a0 > > [3144162.631244] worker_thread+0x37/0x3a0 > > [3144162.631247] kthread+0x117/0x140 > > [3144162.631247] ? process_one_work+0x3a0/0x3a0 > > [3144162.631248] ? __kthread_cancel_work+0x40/0x40 > > [3144162.631250] ret_from_fork+0x1f/0x30 > > > > Test workload: > > - Repeated concurrency testing, include Create VM, service probe, destroy VM. > > - VM using virito device, so kvm->irq_srcu is using frequently. > > - x86_64 Machine with 191 core. > > > > Crash analysis clues: > > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed. > > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed). > > - The SNP (node) of SRCU struct has 2 levels. > > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs). > > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143]. > > - The SDP (per-cpu) of SRCU struct show lock and unlock matches. > > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list. > > - The pending entry is just the calling entry of hung kwoker task. > > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296. > > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list. > > - The work-queue of SRCU and SDP of CPU 136 is idle. > > - More details could be found via reference at the end of this mail. > > > > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda > > 0xffffc900f100e878 > > srcu_idx = 76 > > srcu_gp_seq = 304 > > srcu_gp_seq_needed = 304 > > srcu_gp_seq_needed_exp = 304 > > sda = 0x5fff9ac17800 > > > > crash> struct srcu_data 0x5fff9ac17800:136 > > [136]: ffffe8bee9817800 > > ... > > srcu_cblist = { > > head = 0xffffc900e8413dd0, > > tails = {0xffffe8bee9817848, 0xffffc900e8413dd0, > > 0xffffc900e8413dd0, 0xffffc900e8413dd0}, > > gp_seq = {0, 296, 296, 0}, > > I am actually a bit puzzled why the DONE segment count is 0. I would > have expected it to be >= 296 in this situation. > > > len = 1, > > len_lazy = 0, > > enabled = 1 '\001', > > offloaded = 0 '\000' > > }, > > srcu_gp_seq_needed = 288, > > srcu_gp_seq_needed_exp = 0, > > srcu_cblist_invoking = false, > > ... > > > > Root cause (we can't reproduce the issue, so the result is base on crash > > analysis and code analysis): > > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288). > > - The leaked entry on segcblist seems has been "accelerated". > > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because > > srcu_invoke_callbacks may be using incorrect GP seq doing accelerate. > > - Here are two emulated case may cause the "accelerate issue": > > > > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > to SDP. > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > faster than CPU 136, start GP, set rcu_seq_start. > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > to start GP. > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > "accelerate" the seq of new added entry to 292 (it should be 288). > > But srcu_gp_seq is at 304 right now. How does it matter that the CB is > marked for 292? It should be ready to execute anyway even at 292. Note > the meaning of "acceleration", the idea is to start conservatively and > move the callbacks forward as more accurate information is available. > Considering this, 292 initially should be OK IMHO (that's just more > conservative than 288).. > > > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296 > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > to SDP. > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > faster than CPU 136, start GP. > > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP. > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > to start GP. > > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP, > > set rcu_seq_start. > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > "accelerate" new added entry to 296 (it should be 288). > > I had the same analysis here. > > > /* > > * CASE1 > > * - entry of CPU 136 expecting GP 288 was accelerated to GP 292 > > * - GP0: 280-284, GP1: 284-288. > > */ > > > > /* [GP0][CPU-136] */ > > process_srcu { > > srcu_gp_end > > } > > /* [GP1][CPU-136] */ > > synchronize_srcu { > > __call_srcu { > > rcu_segcblist_enqueue > > /* [GP1][CPU-041] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > rcu_seq_start > > } > > } > > /* [GP1][CPU-136] */ > > srcu_funnel_gp_start > > } > > } > > /* [GP0][CPU-136] */ > > srcu_invoke_callbacks { > > rcu_segcblist_accelerate > > } > > > > > > /* > > * CASE2 > > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > > */ > > > > /* [GP0][CPU-136] */ > > process_srcu { > > srcu_gp_end > > } > > > > /* [GP1][CPU-136] */ > > synchronize_srcu { > > __call_srcu { > > rcu_segcblist_enqueue > > /* [GP1][CPU-041] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > } > > } > > process_srcu { > > srcu_gp_end > > rcu_seq_end > > } > > /* [GP1][CPU-136] */ > > srcu_funnel_gp_start > > } > > } > > /* [GP0][CPU-136] */ > > srcu_invoke_callbacks { > > If srcu_invoke_callbacks() was really called for the rdp, I would have > expected rcu_segcblist_advance() to advance all those pending > callbacks to 304. > > I posit that probably srcu_invoke_callbacks() is not even being called > in the first place, otherwise why is the DONE segment count still 0? > DONE should not be 0, otherwise callbacks will not run. The segcblist > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not > being called due to some other bug (or a workqueue/timer issue causing > the srcu_invoke_callbacks() to not run?). > > I wonder if you would consider trying to reproduce it with the > rcutorture module's SRCU testing in the chance that it occurs with it > on your setup. > > thanks, > > - Joel As a note, I had additional comments below but I accidentally signed my name a bit earlier - fyi. Thanks. -Joel > > > > /* [GP2][CPU-017] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > rcu_seq_start > > } > > } > > /* [GP0][CPU-136] */ > > rcu_segcblist_accelerate > > } > > > > > > Potential fix: > > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot > > of current gp seq of SRCU struct, how about accelerate the sda base > > sdp->srcu_gp_seq_needed, maybe like this: > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 21115ffb6c44..ff8579abc924 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct > > work_struct *work) > > spin_lock_irq_rcu_node(sdp); > > rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs); > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > - rcu_seq_snap(&ssp->srcu_gp_seq)); > > + rcu_seq_snap(&sdp->srcu_gp_seq_needed)); > > sdp->srcu_cblist_invoking = false; > > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > > spin_unlock_irq_rcu_node(sdp); > > I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO, > rcu_seq_snap() says which what's the gp_seq at which a full GP has > passed from the moment rcu_seq_snap() is called. That's precisely when > it is safe to call all those non-accelerated callbacks. > > Basically, if any of the bits are set in the state portion of the > sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2) > << RCU_SEQ_CTR_SHIFT. > Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT. > > > > > > > > > > > Thanks for reading! > > > > > > Reference detail log of crash dump information: > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log
On Fri, Sep 29, 2023 at 6:05 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > Hello, > Firstly, kudos to the detailed report and analysis. Rare failures are > hard and your usage crash/kdump is awesome to dig deeper into the > issue.. > > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote: > > > > Hi, > > > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > > and destroy concurrent test, the issue happens after several weeks. Now we > > didn't have a way to reproduce this issue, the issue happens randomly, this > > is the second time we found it in this year. > > > > We did some investigation with crash tools, hope to get some useful clues > > about the issue, try to find root cause of the issue (further find a way to > > fix or workaround it), here is some of our investigation, hope to get > > feedback from SRCU experts. > > > > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more > > than 122 seconds. > > [3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1 > > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [3144162.631162] kworker/136:12 D 0 252684 2 0x90004000 > > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm] > > [3144162.631192] Call Trace: > > [3144162.631202] __schedule+0x2ee/0x660 > > [3144162.631206] schedule+0x33/0xa0 > > [3144162.631209] schedule_timeout+0x1c4/0x340 > > [3144162.631214] ? update_load_avg+0x82/0x660 > > [3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30 > > [3144162.631218] wait_for_completion+0x119/0x180 > > [3144162.631220] ? wake_up_q+0x80/0x80 > > [3144162.631224] __synchronize_srcu.part.19+0x81/0xb0 > > [3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10 > > [3144162.631227] synchronize_srcu+0x5f/0xc0 > > [3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm] > > [3144162.631239] ? __schedule+0x2f6/0x660 > > [3144162.631243] process_one_work+0x19a/0x3a0 > > [3144162.631244] worker_thread+0x37/0x3a0 > > [3144162.631247] kthread+0x117/0x140 > > [3144162.631247] ? process_one_work+0x3a0/0x3a0 > > [3144162.631248] ? __kthread_cancel_work+0x40/0x40 > > [3144162.631250] ret_from_fork+0x1f/0x30 > > > > Test workload: > > - Repeated concurrency testing, include Create VM, service probe, destroy VM. > > - VM using virito device, so kvm->irq_srcu is using frequently. > > - x86_64 Machine with 191 core. > > > > Crash analysis clues: > > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed. > > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed). > > - The SNP (node) of SRCU struct has 2 levels. > > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs). > > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143]. > > - The SDP (per-cpu) of SRCU struct show lock and unlock matches. > > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list. > > - The pending entry is just the calling entry of hung kwoker task. > > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296. > > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list. > > - The work-queue of SRCU and SDP of CPU 136 is idle. > > - More details could be found via reference at the end of this mail. > > > > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda > > 0xffffc900f100e878 > > srcu_idx = 76 > > srcu_gp_seq = 304 > > srcu_gp_seq_needed = 304 > > srcu_gp_seq_needed_exp = 304 > > sda = 0x5fff9ac17800 > > > > crash> struct srcu_data 0x5fff9ac17800:136 > > [136]: ffffe8bee9817800 > > ... > > srcu_cblist = { > > head = 0xffffc900e8413dd0, > > tails = {0xffffe8bee9817848, 0xffffc900e8413dd0, > > 0xffffc900e8413dd0, 0xffffc900e8413dd0}, > > gp_seq = {0, 296, 296, 0}, > > I am actually a bit puzzled why the DONE segment count is 0. I would > have expected it to be >= 296 in this situation. > > > len = 1, > > len_lazy = 0, > > enabled = 1 '\001', > > offloaded = 0 '\000' > > }, > > srcu_gp_seq_needed = 288, > > srcu_gp_seq_needed_exp = 0, > > srcu_cblist_invoking = false, > > ... > > > > Root cause (we can't reproduce the issue, so the result is base on crash > > analysis and code analysis): > > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288). > > - The leaked entry on segcblist seems has been "accelerated". > > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because > > srcu_invoke_callbacks may be using incorrect GP seq doing accelerate. > > - Here are two emulated case may cause the "accelerate issue": > > > > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > to SDP. > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > faster than CPU 136, start GP, set rcu_seq_start. > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > to start GP. > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > "accelerate" the seq of new added entry to 292 (it should be 288). > > But srcu_gp_seq is at 304 right now. How does it matter that the CB is > marked for 292? It should be ready to execute anyway even at 292. Note > the meaning of "acceleration", the idea is to start conservatively and > move the callbacks forward as more accurate information is available. > Considering this, 292 initially should be OK IMHO (that's just more > conservative than 288).. > > > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296 > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > to SDP. > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > faster than CPU 136, start GP. > > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP. > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > to start GP. > > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP, > > set rcu_seq_start. > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > "accelerate" new added entry to 296 (it should be 288). > > I had the same analysis here. > > > /* > > * CASE1 > > * - entry of CPU 136 expecting GP 288 was accelerated to GP 292 > > * - GP0: 280-284, GP1: 284-288. > > */ > > > > /* [GP0][CPU-136] */ > > process_srcu { > > srcu_gp_end > > } > > /* [GP1][CPU-136] */ > > synchronize_srcu { > > __call_srcu { > > rcu_segcblist_enqueue > > /* [GP1][CPU-041] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > rcu_seq_start > > } > > } > > /* [GP1][CPU-136] */ > > srcu_funnel_gp_start > > } > > } > > /* [GP0][CPU-136] */ > > srcu_invoke_callbacks { > > rcu_segcblist_accelerate > > } > > > > > > /* > > * CASE2 > > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > > */ > > > > /* [GP0][CPU-136] */ > > process_srcu { > > srcu_gp_end > > } > > > > /* [GP1][CPU-136] */ > > synchronize_srcu { > > __call_srcu { > > rcu_segcblist_enqueue > > /* [GP1][CPU-041] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > } > > } > > process_srcu { > > srcu_gp_end > > rcu_seq_end > > } > > /* [GP1][CPU-136] */ > > srcu_funnel_gp_start > > } > > } > > /* [GP0][CPU-136] */ > > srcu_invoke_callbacks { > > If srcu_invoke_callbacks() was really called for the rdp, I would have > expected rcu_segcblist_advance() to advance all those pending > callbacks to 304. > > I posit that probably srcu_invoke_callbacks() is not even being called > in the first place, otherwise why is the DONE segment count still 0? > DONE should not be 0, otherwise callbacks will not run. The segcblist > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not > being called due to some other bug (or a workqueue/timer issue causing > the srcu_invoke_callbacks() to not run?). > > I wonder if you would consider trying to reproduce it with the > rcutorture module's SRCU testing in the chance that it occurs with it > on your setup. As an enthusiast, I have an idea, could we use the ftrace mechanism to log ftrace just before the stall for more detailed analysis. Recently I used the cyclictest --tracemark option to analyze the ftrace of the system just before the cyclictest's -b latency is exceeded. [1] Thanks Zhouyi [1] https://man.archlinux.org/man/cyclictest.8.en > > thanks, > > - Joel > > > > /* [GP2][CPU-017] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > rcu_seq_start > > } > > } > > /* [GP0][CPU-136] */ > > rcu_segcblist_accelerate > > } > > > > > > Potential fix: > > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot > > of current gp seq of SRCU struct, how about accelerate the sda base > > sdp->srcu_gp_seq_needed, maybe like this: > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 21115ffb6c44..ff8579abc924 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct > > work_struct *work) > > spin_lock_irq_rcu_node(sdp); > > rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs); > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > - rcu_seq_snap(&ssp->srcu_gp_seq)); > > + rcu_seq_snap(&sdp->srcu_gp_seq_needed)); > > sdp->srcu_cblist_invoking = false; > > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > > spin_unlock_irq_rcu_node(sdp); > > I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO, > rcu_seq_snap() says which what's the gp_seq at which a full GP has > passed from the moment rcu_seq_snap() is called. That's precisely when > it is safe to call all those non-accelerated callbacks. > > Basically, if any of the bits are set in the state portion of the > sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2) > << RCU_SEQ_CTR_SHIFT. > Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT. > > > > > > > > > > > Thanks for reading! > > > > > > Reference detail log of crash dump information: > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log
Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit : > If srcu_invoke_callbacks() was really called for the rdp, I would have > expected rcu_segcblist_advance() to advance all those pending > callbacks to 304. > > I posit that probably srcu_invoke_callbacks() is not even being called > in the first place, otherwise why is the DONE segment count still 0? > DONE should not be 0, otherwise callbacks will not run. The segcblist > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not > being called due to some other bug (or a workqueue/timer issue causing > the srcu_invoke_callbacks() to not run?). Speaking of, I'm looking at srcu_gp_end() and the comment after releasing srcu_gp_mutex says: "/* A new grace period can start at this point. But only one. */" But what ensures that? Because if we get preempted here and more than 3 grace periods elapse, then ->srcu_have_cbs[the_same_idx] could get overriden and then all sorts of race may lead to srcu_invoke_callbacks() to not be invoked. I'm not asking too loudly though because I'm sure I'm missing something obvious, I see a lot of mirages these days...
On Sat, Sep 30, 2023 at 4:15 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit : > > If srcu_invoke_callbacks() was really called for the rdp, I would have > > expected rcu_segcblist_advance() to advance all those pending > > callbacks to 304. > > > > I posit that probably srcu_invoke_callbacks() is not even being called > > in the first place, otherwise why is the DONE segment count still 0? > > DONE should not be 0, otherwise callbacks will not run. The segcblist > > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not > > being called due to some other bug (or a workqueue/timer issue causing > > the srcu_invoke_callbacks() to not run?). > > Speaking of, I'm looking at srcu_gp_end() and the comment after releasing > srcu_gp_mutex says: > > "/* A new grace period can start at this point. But only one. */" > > But what ensures that? > I think ->srcu_cb_mutex ensures that. /* Prevent more than one additional grace period. */ mutex_lock(&sup->srcu_cb_mutex); Thanks Neeraj > Because if we get preempted here and more than 3 grace periods elapse, then > ->srcu_have_cbs[the_same_idx] could get overriden and then all sorts of race > may lead to srcu_invoke_callbacks() to not be invoked. > > I'm not asking too loudly though because I'm sure I'm missing something obvious, > I see a lot of mirages these days...
On Sat, Sep 30, 2023 at 08:15:06AM +0530, Neeraj upadhyay wrote: > On Sat, Sep 30, 2023 at 4:15 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit : > > > If srcu_invoke_callbacks() was really called for the rdp, I would have > > > expected rcu_segcblist_advance() to advance all those pending > > > callbacks to 304. > > > > > > I posit that probably srcu_invoke_callbacks() is not even being called > > > in the first place, otherwise why is the DONE segment count still 0? > > > DONE should not be 0, otherwise callbacks will not run. The segcblist > > > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not > > > being called due to some other bug (or a workqueue/timer issue causing > > > the srcu_invoke_callbacks() to not run?). > > > > Speaking of, I'm looking at srcu_gp_end() and the comment after releasing > > srcu_gp_mutex says: > > > > "/* A new grace period can start at this point. But only one. */" > > > > But what ensures that? > > > > I think ->srcu_cb_mutex ensures that. > > /* Prevent more than one additional grace period. */ > mutex_lock(&sup->srcu_cb_mutex); Yes but right ater it's relaxed (before callbacks are schedule for execution)?
Le Sat, Sep 30, 2023 at 08:15:06AM +0530, Neeraj upadhyay a écrit : > On Sat, Sep 30, 2023 at 4:15 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit : > > > If srcu_invoke_callbacks() was really called for the rdp, I would have > > > expected rcu_segcblist_advance() to advance all those pending > > > callbacks to 304. > > > > > > I posit that probably srcu_invoke_callbacks() is not even being called > > > in the first place, otherwise why is the DONE segment count still 0? > > > DONE should not be 0, otherwise callbacks will not run. The segcblist > > > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not > > > being called due to some other bug (or a workqueue/timer issue causing > > > the srcu_invoke_callbacks() to not run?). > > > > Speaking of, I'm looking at srcu_gp_end() and the comment after releasing > > srcu_gp_mutex says: > > > > "/* A new grace period can start at this point. But only one. */" > > > > But what ensures that? > > > > I think ->srcu_cb_mutex ensures that. > > /* Prevent more than one additional grace period. */ > mutex_lock(&sup->srcu_cb_mutex); Oh I confused ->srcu_cb_mutex with ->srcu_gp_mutex, didn't I? ;-)
On Sat, Sep 30, 2023 at 2:40 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > Le Sat, Sep 30, 2023 at 08:15:06AM +0530, Neeraj upadhyay a écrit : > > On Sat, Sep 30, 2023 at 4:15 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit : > > > > If srcu_invoke_callbacks() was really called for the rdp, I would have > > > > expected rcu_segcblist_advance() to advance all those pending > > > > callbacks to 304. > > > > > > > > I posit that probably srcu_invoke_callbacks() is not even being called > > > > in the first place, otherwise why is the DONE segment count still 0? > > > > DONE should not be 0, otherwise callbacks will not run. The segcblist > > > > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not > > > > being called due to some other bug (or a workqueue/timer issue causing > > > > the srcu_invoke_callbacks() to not run?). > > > > > > Speaking of, I'm looking at srcu_gp_end() and the comment after releasing > > > srcu_gp_mutex says: > > > > > > "/* A new grace period can start at this point. But only one. */" > > > > > > But what ensures that? > > > > > > > I think ->srcu_cb_mutex ensures that. > > > > /* Prevent more than one additional grace period. */ > > mutex_lock(&sup->srcu_cb_mutex); > > Oh I confused ->srcu_cb_mutex with ->srcu_gp_mutex, didn't I? ;-) Maybe :) From your reply, I thought you meant srcu_invoke_callbacks() for one GP might execute after multiple subsequent GPs have completed, which makes sense. Thanks Neeraj
On Fri, Sep 29, 2023 at 3:35 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > Hello, > Firstly, kudos to the detailed report and analysis. Rare failures are > hard and your usage crash/kdump is awesome to dig deeper into the > issue.. > > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote: > > > > Hi, > > > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > > and destroy concurrent test, the issue happens after several weeks. Now we > > didn't have a way to reproduce this issue, the issue happens randomly, this > > is the second time we found it in this year. > > > > > > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > to SDP. > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > faster than CPU 136, start GP, set rcu_seq_start. > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > to start GP. > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > "accelerate" the seq of new added entry to 292 (it should be 288). > > But srcu_gp_seq is at 304 right now. How does it matter that the CB is > marked for 292? It should be ready to execute anyway even at 292. Note > the meaning of "acceleration", the idea is to start conservatively and > move the callbacks forward as more accurate information is available. > Considering this, 292 initially should be OK IMHO (that's just more > conservative than 288).. > Maybe it matters, as for a CPU, the callbacks will only be scheduled in srcu_gp_end() for the GPs, for which it has updated ->srcu_data_have_cbs[idx] and ->srcu_have_cbs[idx] > > > > /* > > * CASE2 > > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > > */ > > > > /* [GP0][CPU-136] */ > > process_srcu { > > srcu_gp_end > > } > > > > /* [GP1][CPU-136] */ > > synchronize_srcu { > > __call_srcu { > > rcu_segcblist_enqueue > > /* [GP1][CPU-041] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > } > > } > > process_srcu { > > srcu_gp_end > > rcu_seq_end > > } > > /* [GP1][CPU-136] */ > > srcu_funnel_gp_start > > } > > } > > /* [GP0][CPU-136] */ > > srcu_invoke_callbacks { > > If srcu_invoke_callbacks() was really called for the rdp, I would have > expected rcu_segcblist_advance() to advance all those pending > callbacks to 304. If the callback is in NEXT_TAIL and not assigned GP num, rcu_segcblist_advance() won't move it and next accelerate in srcu_invoke_callbacks() will assign it the next gp sequence num. Thanks Neeraj > > I posit that probably srcu_invoke_callbacks() is not even being called > in the first place, otherwise why is the DONE segment count still 0? > DONE should not be 0, otherwise callbacks will not run. The segcblist > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not > being called due to some other bug (or a workqueue/timer issue causing > the srcu_invoke_callbacks() to not run?). > > I wonder if you would consider trying to reproduce it with the > rcutorture module's SRCU testing in the chance that it occurs with it > on your setup. > > thanks, > > - Joel > > > > /* [GP2][CPU-017] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > rcu_seq_start > > } > > } > > /* [GP0][CPU-136] */ > > rcu_segcblist_accelerate > > } > > > > > > Potential fix: > > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot > > of current gp seq of SRCU struct, how about accelerate the sda base > > sdp->srcu_gp_seq_needed, maybe like this: > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 21115ffb6c44..ff8579abc924 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct > > work_struct *work) > > spin_lock_irq_rcu_node(sdp); > > rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs); > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > - rcu_seq_snap(&ssp->srcu_gp_seq)); > > + rcu_seq_snap(&sdp->srcu_gp_seq_needed)); > > sdp->srcu_cblist_invoking = false; > > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > > spin_unlock_irq_rcu_node(sdp); > > I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO, > rcu_seq_snap() says which what's the gp_seq at which a full GP has > passed from the moment rcu_seq_snap() is called. That's precisely when > it is safe to call all those non-accelerated callbacks. > > Basically, if any of the bits are set in the state portion of the > sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2) > << RCU_SEQ_CTR_SHIFT. > Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT. > > > > > > > > > > > Thanks for reading! > > > > > > Reference detail log of crash dump information: > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log
On Fri, Sep 29, 2023 at 3:04 AM zhuangel570 <zhuangel570@gmail.com> wrote: > > Hi, > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > and destroy concurrent test, the issue happens after several weeks. Now we > didn't have a way to reproduce this issue, the issue happens randomly, this > is the second time we found it in this year. > > We did some investigation with crash tools, hope to get some useful clues > about the issue, try to find root cause of the issue (further find a way to > fix or workaround it), here is some of our investigation, hope to get > feedback from SRCU experts. > > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more > than 122 seconds. > [3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1 > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [3144162.631162] kworker/136:12 D 0 252684 2 0x90004000 > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm] > [3144162.631192] Call Trace: > [3144162.631202] __schedule+0x2ee/0x660 > [3144162.631206] schedule+0x33/0xa0 > [3144162.631209] schedule_timeout+0x1c4/0x340 > [3144162.631214] ? update_load_avg+0x82/0x660 > [3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30 > [3144162.631218] wait_for_completion+0x119/0x180 > [3144162.631220] ? wake_up_q+0x80/0x80 > [3144162.631224] __synchronize_srcu.part.19+0x81/0xb0 > [3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10 > [3144162.631227] synchronize_srcu+0x5f/0xc0 > [3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm] > [3144162.631239] ? __schedule+0x2f6/0x660 > [3144162.631243] process_one_work+0x19a/0x3a0 > [3144162.631244] worker_thread+0x37/0x3a0 > [3144162.631247] kthread+0x117/0x140 > [3144162.631247] ? process_one_work+0x3a0/0x3a0 > [3144162.631248] ? __kthread_cancel_work+0x40/0x40 > [3144162.631250] ret_from_fork+0x1f/0x30 > > Test workload: > - Repeated concurrency testing, include Create VM, service probe, destroy VM. > - VM using virito device, so kvm->irq_srcu is using frequently. > - x86_64 Machine with 191 core. > > Crash analysis clues: > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed. > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed). > - The SNP (node) of SRCU struct has 2 levels. > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs). > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143]. > - The SDP (per-cpu) of SRCU struct show lock and unlock matches. > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list. > - The pending entry is just the calling entry of hung kwoker task. > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296. > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list. > - The work-queue of SRCU and SDP of CPU 136 is idle. > - More details could be found via reference at the end of this mail. > > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda > 0xffffc900f100e878 > srcu_idx = 76 > srcu_gp_seq = 304 > srcu_gp_seq_needed = 304 > srcu_gp_seq_needed_exp = 304 > sda = 0x5fff9ac17800 > > crash> struct srcu_data 0x5fff9ac17800:136 > [136]: ffffe8bee9817800 > ... > srcu_cblist = { > head = 0xffffc900e8413dd0, > tails = {0xffffe8bee9817848, 0xffffc900e8413dd0, > 0xffffc900e8413dd0, 0xffffc900e8413dd0}, > gp_seq = {0, 296, 296, 0}, > len = 1, > len_lazy = 0, > enabled = 1 '\001', > offloaded = 0 '\000' > }, > srcu_gp_seq_needed = 288, > srcu_gp_seq_needed_exp = 0, > srcu_cblist_invoking = false, > ... > > Root cause (we can't reproduce the issue, so the result is base on crash > analysis and code analysis): > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288). > - The leaked entry on segcblist seems has been "accelerated". > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because > srcu_invoke_callbacks may be using incorrect GP seq doing accelerate. > - Here are two emulated case may cause the "accelerate issue": > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > to SDP. > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > faster than CPU 136, start GP, set rcu_seq_start. > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > to start GP. > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > "accelerate" the seq of new added entry to 292 (it should be 288). > > > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296 > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > to SDP. > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > faster than CPU 136, start GP. > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP. > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > to start GP. > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP, > set rcu_seq_start. > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > "accelerate" new added entry to 296 (it should be 288). > > > /* > * CASE1 > * - entry of CPU 136 expecting GP 288 was accelerated to GP 292 > * - GP0: 280-284, GP1: 284-288. > */ > > /* [GP0][CPU-136] */ > process_srcu { > srcu_gp_end > } > /* [GP1][CPU-136] */ > synchronize_srcu { > __call_srcu { > rcu_segcblist_enqueue > /* [GP1][CPU-041] */ > synchronize_srcu { > __call_srcu { > srcu_funnel_gp_start > srcu_gp_start > rcu_seq_start > } > } > /* [GP1][CPU-136] */ > srcu_funnel_gp_start > } > } > /* [GP0][CPU-136] */ > srcu_invoke_callbacks { > rcu_segcblist_accelerate > } > > > /* > * CASE2 > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > */ > > /* [GP0][CPU-136] */ > process_srcu { > srcu_gp_end > } > > /* [GP1][CPU-136] */ > synchronize_srcu { > __call_srcu { > rcu_segcblist_enqueue > /* [GP1][CPU-041] */ > synchronize_srcu { > __call_srcu { > srcu_funnel_gp_start > srcu_gp_start > } > } > process_srcu { > srcu_gp_end > rcu_seq_end > } > /* [GP1][CPU-136] */ > srcu_funnel_gp_start > } > } > /* [GP0][CPU-136] */ > srcu_invoke_callbacks { > /* [GP2][CPU-017] */ > synchronize_srcu { > __call_srcu { > srcu_funnel_gp_start > srcu_gp_start > rcu_seq_start > } > } > /* [GP0][CPU-136] */ > rcu_segcblist_accelerate > } > > > Potential fix: > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot > of current gp seq of SRCU struct, how about accelerate the sda base > sdp->srcu_gp_seq_needed, maybe like this: > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 21115ffb6c44..ff8579abc924 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct > work_struct *work) > spin_lock_irq_rcu_node(sdp); > rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs); > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > - rcu_seq_snap(&ssp->srcu_gp_seq)); > + rcu_seq_snap(&sdp->srcu_gp_seq_needed)); Can you please explain why rcu_seq_snap() is needed here? Maybe you also need to advance cbs, to ensure that following call to rcu_segcblist_ready_cbs() picks these callbacks, in case srcu_invoke_callbacks() is called after multiple GPs have elapsed. rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_gp_seq)); ... more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); Thanks Neeraj > sdp->srcu_cblist_invoking = false; > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > spin_unlock_irq_rcu_node(sdp); > > > Thanks for reading! > > > Reference detail log of crash dump information: > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log
On Sat, Sep 30, 2023 at 6:01 AM Neeraj upadhyay <neeraj.iitr10@gmail.com> wrote: > > On Fri, Sep 29, 2023 at 3:35 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > Hello, > > Firstly, kudos to the detailed report and analysis. Rare failures are > > hard and your usage crash/kdump is awesome to dig deeper into the > > issue.. > > > > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote: > > > > > > Hi, > > > > > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > > > and destroy concurrent test, the issue happens after several weeks. Now we > > > didn't have a way to reproduce this issue, the issue happens randomly, this > > > is the second time we found it in this year. > > > > > > > > > > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > > to SDP. > > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > > faster than CPU 136, start GP, set rcu_seq_start. > > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > > to start GP. > > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > > "accelerate" the seq of new added entry to 292 (it should be 288). > > > > But srcu_gp_seq is at 304 right now. How does it matter that the CB is > > marked for 292? It should be ready to execute anyway even at 292. Note > > the meaning of "acceleration", the idea is to start conservatively and > > move the callbacks forward as more accurate information is available. > > Considering this, 292 initially should be OK IMHO (that's just more > > conservative than 288).. > > > > Maybe it matters, as for a CPU, the callbacks will only be scheduled > in srcu_gp_end() for the GPs, for which it has updated ->srcu_data_have_cbs[idx] > and ->srcu_have_cbs[idx] Right but if I am looking at the code correctly, nothing guarantees that srcu_invoke_callbacks is called before srcu_gp_seq can advance. So all callbacks that were previously queued for older grace periods should be run whenever srcu_invoke_callbacks() eventually runs. That's why I was thinking that part looked normal to me (segments having older GP numbers). > > > > > > /* > > > * CASE2 > > > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > > > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > > > */ > > > > > > /* [GP0][CPU-136] */ > > > process_srcu { > > > srcu_gp_end > > > } > > > > > > /* [GP1][CPU-136] */ > > > synchronize_srcu { > > > __call_srcu { > > > rcu_segcblist_enqueue > > > /* [GP1][CPU-041] */ > > > synchronize_srcu { > > > __call_srcu { > > > srcu_funnel_gp_start > > > srcu_gp_start > > > } > > > } > > > process_srcu { > > > srcu_gp_end > > > rcu_seq_end > > > } > > > /* [GP1][CPU-136] */ > > > srcu_funnel_gp_start > > > } > > > } > > > /* [GP0][CPU-136] */ > > > srcu_invoke_callbacks { > > > > If srcu_invoke_callbacks() was really called for the rdp, I would have > > expected rcu_segcblist_advance() to advance all those pending > > callbacks to 304. > > If the callback is in NEXT_TAIL and not assigned GP num, > rcu_segcblist_advance() won't move it and next accelerate in > srcu_invoke_callbacks() will > assign it the next gp sequence num. Sure, and after that again it will call srcu_schedule_cbs_sdp() so that should be fine and the next workqueue invocation srcu_invoke_callbacks() can advance at that time. Right? if (more) { srcu_schedule_cbs_sdp(sdp, 0); } thanks, - Joel
On Sun, Oct 1, 2023 at 5:49 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Sat, Sep 30, 2023 at 6:01 AM Neeraj upadhyay <neeraj.iitr10@gmail.com> wrote: > > > > On Fri, Sep 29, 2023 at 3:35 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > Hello, > > > Firstly, kudos to the detailed report and analysis. Rare failures are > > > hard and your usage crash/kdump is awesome to dig deeper into the > > > issue.. > > > > > > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > > > > and destroy concurrent test, the issue happens after several weeks. Now we > > > > didn't have a way to reproduce this issue, the issue happens randomly, this > > > > is the second time we found it in this year. > > > > > > > > > > > > > > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > > > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > > > to SDP. > > > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > > > faster than CPU 136, start GP, set rcu_seq_start. > > > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > > > to start GP. > > > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > > > "accelerate" the seq of new added entry to 292 (it should be 288). > > > > > > But srcu_gp_seq is at 304 right now. How does it matter that the CB is > > > marked for 292? It should be ready to execute anyway even at 292. Note > > > the meaning of "acceleration", the idea is to start conservatively and > > > move the callbacks forward as more accurate information is available. > > > Considering this, 292 initially should be OK IMHO (that's just more > > > conservative than 288).. > > > > > > > Maybe it matters, as for a CPU, the callbacks will only be scheduled > > in srcu_gp_end() for the GPs, for which it has updated ->srcu_data_have_cbs[idx] > > and ->srcu_have_cbs[idx] > > Right but if I am looking at the code correctly, nothing guarantees > that srcu_invoke_callbacks is called before srcu_gp_seq can advance. > So all callbacks that were previously queued for older grace periods > should be run whenever srcu_invoke_callbacks() eventually runs. That's > why I was thinking that part looked normal to me (segments having > older GP numbers). > > > > > > > > > /* > > > > * CASE2 > > > > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > > > > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > > > > */ > > > > > > > > /* [GP0][CPU-136] */ > > > > process_srcu { > > > > srcu_gp_end > > > > } > > > > > > > > /* [GP1][CPU-136] */ > > > > synchronize_srcu { > > > > __call_srcu { > > > > rcu_segcblist_enqueue > > > > /* [GP1][CPU-041] */ > > > > synchronize_srcu { > > > > __call_srcu { > > > > srcu_funnel_gp_start > > > > srcu_gp_start > > > > } > > > > } > > > > process_srcu { > > > > srcu_gp_end > > > > rcu_seq_end > > > > } > > > > /* [GP1][CPU-136] */ > > > > srcu_funnel_gp_start > > > > } > > > > } > > > > /* [GP0][CPU-136] */ > > > > srcu_invoke_callbacks { > > > > > > If srcu_invoke_callbacks() was really called for the rdp, I would have > > > expected rcu_segcblist_advance() to advance all those pending > > > callbacks to 304. > > > > If the callback is in NEXT_TAIL and not assigned GP num, > > rcu_segcblist_advance() won't move it and next accelerate in > > srcu_invoke_callbacks() will > > assign it the next gp sequence num. > > Sure, and after that again it will call srcu_schedule_cbs_sdp() so > that should be fine and the next workqueue invocation > srcu_invoke_callbacks() can advance at that time. Right? > But "more" only checks for CBs in DONE tail. The callbacks which have been just accelerated are not advanced to DONE tail. Having said above, I am still trying to figure out, which callbacks are actually being pointed by NEXT tail. Given that __call_srcu() already does a advance and accelerate, all enqueued callbacks would be in either WAIT tail (the callbacks for current active GP) or NEXT_READY tail (the callbacks for next GP after current one completes). So, they should already have GP num assigned and srcu_invoke_callbacks() won't accelerate them. Only case I can think of is, if current GP completes after we sample rcu_seq_current(&ssp->srcu_gp_seq) for rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail) and a new GP is started before we take snapshot ('s') of next GP for rcu_segcblist_accelerate(), then the gp num 's' > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure if my understanding is correct here. Thoughts? __call_srcu() rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_gp_seq)); s = rcu_seq_snap(&ssp->srcu_gp_seq); (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); Thanks Neeraj > if (more) { > srcu_schedule_cbs_sdp(sdp, 0); > } > > thanks, > > - Joel
Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit : > > But "more" only checks for CBs in DONE tail. The callbacks which have been just > accelerated are not advanced to DONE tail. > > Having said above, I am still trying to figure out, which callbacks > are actually being pointed > by NEXT tail. Given that __call_srcu() already does a advance and > accelerate, all enqueued > callbacks would be in either WAIT tail (the callbacks for current > active GP) or NEXT_READY > tail (the callbacks for next GP after current one completes). So, they > should already have > GP num assigned and srcu_invoke_callbacks() won't accelerate them. > Only case I can > think of is, if current GP completes after we sample > rcu_seq_current(&ssp->srcu_gp_seq) for > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail) > and a new GP is started > before we take snapshot ('s') of next GP for > rcu_segcblist_accelerate(), then the gp num 's' > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure > if my understanding is correct here. Thoughts? > > __call_srcu() > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_gp_seq)); > s = rcu_seq_snap(&ssp->srcu_gp_seq); > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); Good point! This looks plausible. Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely to handle that case? Because then the acceleration could be moved before the advance on callbacks handling, something like: diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 20d7a238d675..af9d8af1d321 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1245,6 +1245,11 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); + /* + * Acceleration might fail if the preceding call to + * rcu_segcblist_advance() also failed due to a prior incomplete grace + * period. This should be later fixed in srcu_invoke_callbacks(). + */ (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { sdp->srcu_gp_seq_needed = s; @@ -1692,6 +1697,13 @@ static void srcu_invoke_callbacks(struct work_struct *work) ssp = sdp->ssp; rcu_cblist_init(&ready_cbs); spin_lock_irq_rcu_node(sdp); + /* + * Acceleration might have failed in srcu_gp_start_if_needed() if + * the preceding call to rcu_segcblist_advance() also failed due to + * a prior incomplete grace period. + */ + (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, + sdp->srcu_gp_seq_needed); rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); if (sdp->srcu_cblist_invoking || @@ -1720,8 +1732,6 @@ static void srcu_invoke_callbacks(struct work_struct *work) */ spin_lock_irq_rcu_node(sdp); rcu_segcblist_add_len(&sdp->srcu_cblist, -len); - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); sdp->srcu_cblist_invoking = false; more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); spin_unlock_irq_rcu_node(sdp); > > Thanks > Neeraj > > > > if (more) { > > srcu_schedule_cbs_sdp(sdp, 0); > > } > > > > thanks, > > > > - Joel
Le Mon, Oct 02, 2023 at 12:32:41AM +0200, Frederic Weisbecker a écrit : > Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit : > > > > But "more" only checks for CBs in DONE tail. The callbacks which have been just > > accelerated are not advanced to DONE tail. > > > > Having said above, I am still trying to figure out, which callbacks > > are actually being pointed > > by NEXT tail. Given that __call_srcu() already does a advance and > > accelerate, all enqueued > > callbacks would be in either WAIT tail (the callbacks for current > > active GP) or NEXT_READY > > tail (the callbacks for next GP after current one completes). So, they > > should already have > > GP num assigned and srcu_invoke_callbacks() won't accelerate them. > > Only case I can > > think of is, if current GP completes after we sample > > rcu_seq_current(&ssp->srcu_gp_seq) for > > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail) > > and a new GP is started > > before we take snapshot ('s') of next GP for > > rcu_segcblist_accelerate(), then the gp num 's' > > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure > > if my understanding is correct here. Thoughts? > > > > __call_srcu() > > rcu_segcblist_advance(&sdp->srcu_cblist, > > rcu_seq_current(&ssp->srcu_gp_seq)); > > s = rcu_seq_snap(&ssp->srcu_gp_seq); > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > Good point! This looks plausible. > > Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely > to handle that case? Because then the acceleration could be moved before > the advance on callbacks handling, something like: > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 20d7a238d675..af9d8af1d321 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1245,6 +1245,11 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); > + /* > + * Acceleration might fail if the preceding call to > + * rcu_segcblist_advance() also failed due to a prior incomplete grace > + * period. This should be later fixed in srcu_invoke_callbacks(). > + */ > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { > sdp->srcu_gp_seq_needed = s; > @@ -1692,6 +1697,13 @@ static void srcu_invoke_callbacks(struct work_struct *work) > ssp = sdp->ssp; > rcu_cblist_init(&ready_cbs); > spin_lock_irq_rcu_node(sdp); > + /* > + * Acceleration might have failed in srcu_gp_start_if_needed() if > + * the preceding call to rcu_segcblist_advance() also failed due to > + * a prior incomplete grace period. > + */ > + (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > + sdp->srcu_gp_seq_needed); > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > if (sdp->srcu_cblist_invoking || > @@ -1720,8 +1732,6 @@ static void srcu_invoke_callbacks(struct work_struct *work) > */ > spin_lock_irq_rcu_node(sdp); > rcu_segcblist_add_len(&sdp->srcu_cblist, -len); > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); > sdp->srcu_cblist_invoking = false; > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > spin_unlock_irq_rcu_node(sdp); And if this works, can we then remove srcu_invoke_callbacks() self-requeue? If queued several times before it actually fires, it will catch the latest grace period's end. And if queued while the callback runs, it will re-run. Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant? Thanks.
On Mon, Oct 2, 2023 at 4:02 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit : > > > > But "more" only checks for CBs in DONE tail. The callbacks which have been just > > accelerated are not advanced to DONE tail. > > > > Having said above, I am still trying to figure out, which callbacks > > are actually being pointed > > by NEXT tail. Given that __call_srcu() already does a advance and > > accelerate, all enqueued > > callbacks would be in either WAIT tail (the callbacks for current > > active GP) or NEXT_READY > > tail (the callbacks for next GP after current one completes). So, they > > should already have > > GP num assigned and srcu_invoke_callbacks() won't accelerate them. > > Only case I can > > think of is, if current GP completes after we sample > > rcu_seq_current(&ssp->srcu_gp_seq) for > > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail) > > and a new GP is started > > before we take snapshot ('s') of next GP for > > rcu_segcblist_accelerate(), then the gp num 's' > > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure > > if my understanding is correct here. Thoughts? > > > > __call_srcu() > > rcu_segcblist_advance(&sdp->srcu_cblist, > > rcu_seq_current(&ssp->srcu_gp_seq)); > > s = rcu_seq_snap(&ssp->srcu_gp_seq); > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > Good point! This looks plausible. > > Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely > to handle that case? Because then the acceleration could be moved before > the advance on callbacks handling, something like: > I think we might need to accelerate after advance, as the tail pointers (WAIT, NEXT_READY) can be non-empty and we will not be able to accelerate (and assign GP num) until we advance WAIT tail to DONE tail? Thanks Neeraj > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 20d7a238d675..af9d8af1d321 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1245,6 +1245,11 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); > + /* > + * Acceleration might fail if the preceding call to > + * rcu_segcblist_advance() also failed due to a prior incomplete grace > + * period. This should be later fixed in srcu_invoke_callbacks(). > + */ > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { > sdp->srcu_gp_seq_needed = s; > @@ -1692,6 +1697,13 @@ static void srcu_invoke_callbacks(struct work_struct *work) > ssp = sdp->ssp; > rcu_cblist_init(&ready_cbs); > spin_lock_irq_rcu_node(sdp); > + /* > + * Acceleration might have failed in srcu_gp_start_if_needed() if > + * the preceding call to rcu_segcblist_advance() also failed due to > + * a prior incomplete grace period. > + */ > + (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > + sdp->srcu_gp_seq_needed); > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > if (sdp->srcu_cblist_invoking || > @@ -1720,8 +1732,6 @@ static void srcu_invoke_callbacks(struct work_struct *work) > */ > spin_lock_irq_rcu_node(sdp); > rcu_segcblist_add_len(&sdp->srcu_cblist, -len); > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); > sdp->srcu_cblist_invoking = false; > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > spin_unlock_irq_rcu_node(sdp); > > > > > > Thanks > > Neeraj > > > > > > > if (more) { > > > srcu_schedule_cbs_sdp(sdp, 0); > > > } > > > > > > thanks, > > > > > > - Joel
On Mon, Oct 2, 2023 at 4:10 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > Le Mon, Oct 02, 2023 at 12:32:41AM +0200, Frederic Weisbecker a écrit : > > Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit : > > > > > > But "more" only checks for CBs in DONE tail. The callbacks which have been just > > > accelerated are not advanced to DONE tail. > > > > > > Having said above, I am still trying to figure out, which callbacks > > > are actually being pointed > > > by NEXT tail. Given that __call_srcu() already does a advance and > > > accelerate, all enqueued > > > callbacks would be in either WAIT tail (the callbacks for current > > > active GP) or NEXT_READY > > > tail (the callbacks for next GP after current one completes). So, they > > > should already have > > > GP num assigned and srcu_invoke_callbacks() won't accelerate them. > > > Only case I can > > > think of is, if current GP completes after we sample > > > rcu_seq_current(&ssp->srcu_gp_seq) for > > > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail) > > > and a new GP is started > > > before we take snapshot ('s') of next GP for > > > rcu_segcblist_accelerate(), then the gp num 's' > > > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure > > > if my understanding is correct here. Thoughts? > > > > > > __call_srcu() > > > rcu_segcblist_advance(&sdp->srcu_cblist, > > > rcu_seq_current(&ssp->srcu_gp_seq)); > > > s = rcu_seq_snap(&ssp->srcu_gp_seq); > > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > > > Good point! This looks plausible. > > > > Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely > > to handle that case? Because then the acceleration could be moved before > > the advance on callbacks handling, something like: > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 20d7a238d675..af9d8af1d321 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1245,6 +1245,11 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > > rcu_segcblist_advance(&sdp->srcu_cblist, > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); > > + /* > > + * Acceleration might fail if the preceding call to > > + * rcu_segcblist_advance() also failed due to a prior incomplete grace > > + * period. This should be later fixed in srcu_invoke_callbacks(). > > + */ > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { > > sdp->srcu_gp_seq_needed = s; > > @@ -1692,6 +1697,13 @@ static void srcu_invoke_callbacks(struct work_struct *work) > > ssp = sdp->ssp; > > rcu_cblist_init(&ready_cbs); > > spin_lock_irq_rcu_node(sdp); > > + /* > > + * Acceleration might have failed in srcu_gp_start_if_needed() if > > + * the preceding call to rcu_segcblist_advance() also failed due to > > + * a prior incomplete grace period. > > + */ > > + (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > + sdp->srcu_gp_seq_needed); > > rcu_segcblist_advance(&sdp->srcu_cblist, > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > if (sdp->srcu_cblist_invoking || > > @@ -1720,8 +1732,6 @@ static void srcu_invoke_callbacks(struct work_struct *work) > > */ > > spin_lock_irq_rcu_node(sdp); > > rcu_segcblist_add_len(&sdp->srcu_cblist, -len); > > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); > > sdp->srcu_cblist_invoking = false; > > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > > spin_unlock_irq_rcu_node(sdp); > > And if this works, can we then remove srcu_invoke_callbacks() self-requeue? > If queued several times before it actually fires, it will catch the latest > grace period's end. And if queued while the callback runs, it will re-run. > This makes sense, but not sure for non-wq context which link [1] mentions, whether it needs it. > Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant? > I think you mean sdp->srcu_cblist_invoking ? There was a prior discussion on this [1], where Paul mentions about non-wq context. Thanks Neeraj [1] https://lkml.org/lkml/2020/11/19/1065 > Thanks.
On Mon, Oct 02, 2023 at 07:47:55AM +0530, Neeraj upadhyay wrote: > On Mon, Oct 2, 2023 at 4:02 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit : > > > > > > But "more" only checks for CBs in DONE tail. The callbacks which have been just > > > accelerated are not advanced to DONE tail. > > > > > > Having said above, I am still trying to figure out, which callbacks > > > are actually being pointed > > > by NEXT tail. Given that __call_srcu() already does a advance and > > > accelerate, all enqueued > > > callbacks would be in either WAIT tail (the callbacks for current > > > active GP) or NEXT_READY > > > tail (the callbacks for next GP after current one completes). So, they > > > should already have > > > GP num assigned and srcu_invoke_callbacks() won't accelerate them. > > > Only case I can > > > think of is, if current GP completes after we sample > > > rcu_seq_current(&ssp->srcu_gp_seq) for > > > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail) > > > and a new GP is started > > > before we take snapshot ('s') of next GP for > > > rcu_segcblist_accelerate(), then the gp num 's' > > > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure > > > if my understanding is correct here. Thoughts? > > > > > > __call_srcu() > > > rcu_segcblist_advance(&sdp->srcu_cblist, > > > rcu_seq_current(&ssp->srcu_gp_seq)); > > > s = rcu_seq_snap(&ssp->srcu_gp_seq); > > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > > > Good point! This looks plausible. > > > > Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely > > to handle that case? Because then the acceleration could be moved before > > the advance on callbacks handling, something like: > > > > I think we might need to accelerate after advance, as the tail pointers > (WAIT, NEXT_READY) can be non-empty and we will not be able to > accelerate (and assign GP num) until we advance WAIT tail to DONE tail? Right indeed! How about the following patch then, assuming that in SRCU: 1 enqueue == 1 accelerate and therefore it only makes sense to accelerate on enqueue time and any other accelerate call is error prone. I can't find a scenario where the second call below to accelerate (and thus also the second call to advance) would fail. Please tell me if I'm missing something. Also the role of the remaining advance in srcu_gp_start() is unclear to me... diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 20d7a238d675..5ac81ca10ec8 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -782,8 +782,6 @@ static void srcu_gp_start(struct srcu_struct *ssp) spin_lock_rcu_node(sdp); /* Interrupts already disabled. */ rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */ WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies); WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0); @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); + /* + * Acceleration might fail if the preceding call to + * rcu_segcblist_advance() also failed due to a prior grace + * period seen incomplete before rcu_seq_snap(). If so then a new + * call to advance will see the completed grace period and fix + * the situation. + */ + if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) { + rcu_segcblist_advance(&sdp->srcu_cblist, + rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); + WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)); + } if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { sdp->srcu_gp_seq_needed = s; needgp = true; @@ -1692,6 +1701,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) ssp = sdp->ssp; rcu_cblist_init(&ready_cbs); spin_lock_irq_rcu_node(sdp); + WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL)); rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); if (sdp->srcu_cblist_invoking || @@ -1720,8 +1730,6 @@ static void srcu_invoke_callbacks(struct work_struct *work) */ spin_lock_irq_rcu_node(sdp); rcu_segcblist_add_len(&sdp->srcu_cblist, -len); - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); sdp->srcu_cblist_invoking = false; more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); spin_unlock_irq_rcu_node(sdp);
On Mon, Oct 02, 2023 at 07:51:10AM +0530, Neeraj upadhyay wrote: > On Mon, Oct 2, 2023 at 4:10 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > And if this works, can we then remove srcu_invoke_callbacks() self-requeue? > > If queued several times before it actually fires, it will catch the latest > > grace period's end. And if queued while the callback runs, it will re-run. > > > > This makes sense, but not sure for non-wq context which link [1] mentions, > whether it needs it. I don't see a problem with it, see below. > > > Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant? > > > I think you mean sdp->srcu_cblist_invoking ? Right! > > There was a prior discussion on this [1], where Paul mentions about > non-wq context. I don't see that function called elsewhere than workqueue. So in early boot, works queued must wait for workqueue_init() (and kthreads creation and running) which is quite late in the boot process. So it looks like SRCU isn't really usable synchonously in early boot. Besides, this workqueue is already re-entrant. Several srcu_invoke_callbacks() calls can already execute concurrently, which is fine thanks to the snp locking. So I don't see an obvious need for this re-entrancy protection (coming along the requeue thing). Thanks. > > > > Thanks > Neeraj > > [1] https://lkml.org/lkml/2020/11/19/1065 > > > Thanks.
On Mon, Oct 2, 2023 at 4:11 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Mon, Oct 02, 2023 at 07:47:55AM +0530, Neeraj upadhyay wrote: > > On Mon, Oct 2, 2023 at 4:02 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit : > > > > > > > > But "more" only checks for CBs in DONE tail. The callbacks which have been just > > > > accelerated are not advanced to DONE tail. > > > > > > > > Having said above, I am still trying to figure out, which callbacks > > > > are actually being pointed > > > > by NEXT tail. Given that __call_srcu() already does a advance and > > > > accelerate, all enqueued > > > > callbacks would be in either WAIT tail (the callbacks for current > > > > active GP) or NEXT_READY > > > > tail (the callbacks for next GP after current one completes). So, they > > > > should already have > > > > GP num assigned and srcu_invoke_callbacks() won't accelerate them. > > > > Only case I can > > > > think of is, if current GP completes after we sample > > > > rcu_seq_current(&ssp->srcu_gp_seq) for > > > > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail) > > > > and a new GP is started > > > > before we take snapshot ('s') of next GP for > > > > rcu_segcblist_accelerate(), then the gp num 's' > > > > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure > > > > if my understanding is correct here. Thoughts? > > > > > > > > __call_srcu() > > > > rcu_segcblist_advance(&sdp->srcu_cblist, > > > > rcu_seq_current(&ssp->srcu_gp_seq)); > > > > s = rcu_seq_snap(&ssp->srcu_gp_seq); > > > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > > > > > Good point! This looks plausible. > > > > > > Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely > > > to handle that case? Because then the acceleration could be moved before > > > the advance on callbacks handling, something like: > > > > > > > I think we might need to accelerate after advance, as the tail pointers > > (WAIT, NEXT_READY) can be non-empty and we will not be able to > > accelerate (and assign GP num) until we advance WAIT tail to DONE tail? > > Right indeed! How about the following patch then, assuming that in SRCU: > 1 enqueue == 1 accelerate and therefore it only makes sense > to accelerate on enqueue time and any other accelerate call is error prone. > Agree. > I can't find a scenario where the second call below to accelerate (and thus also > the second call to advance) would fail. Please tell me if I'm missing something. Looks good to me. Few minor comments. > Also the role of the remaining advance in srcu_gp_start() is unclear to me... > As far as I understand, the advance call before accelerate is to make space in one of WAIT or NEXT_READY tail for acceleration. So, it can be removed. > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 20d7a238d675..5ac81ca10ec8 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -782,8 +782,6 @@ static void srcu_gp_start(struct srcu_struct *ssp) > spin_lock_rcu_node(sdp); /* Interrupts already disabled. */ > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); Deletion is ok; alternatively, we could have used WARN_ON_ONCE(rcu_segcblist_accelerate(...)) in all places other than enqueue time for few cycles to be on safer side. > spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */ > WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies); > WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0); > @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > + /* > + * Acceleration might fail if the preceding call to > + * rcu_segcblist_advance() also failed due to a prior grace > + * period seen incomplete before rcu_seq_snap(). If so then a new > + * call to advance will see the completed grace period and fix > + * the situation. > + */ > + if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) { We can add below also? Here old and new are rcu_seq_current() values used in the 2 calls to rcu_segcblist_advance(). WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new))); Thanks Neeraj > + rcu_segcblist_advance(&sdp->srcu_cblist, > + rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > + WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)); > + } > if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { > sdp->srcu_gp_seq_needed = s; > needgp = true; > @@ -1692,6 +1701,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) > ssp = sdp->ssp; > rcu_cblist_init(&ready_cbs); > spin_lock_irq_rcu_node(sdp); > + WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL)); > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > if (sdp->srcu_cblist_invoking || > @@ -1720,8 +1730,6 @@ static void srcu_invoke_callbacks(struct work_struct *work) > */ > spin_lock_irq_rcu_node(sdp); > rcu_segcblist_add_len(&sdp->srcu_cblist, -len); > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); > sdp->srcu_cblist_invoking = false; > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > spin_unlock_irq_rcu_node(sdp);
Le Mon, Oct 02, 2023 at 06:52:27PM +0530, Neeraj upadhyay a écrit : > On Mon, Oct 2, 2023 at 4:11 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > Also the role of the remaining advance in srcu_gp_start() is unclear to me... > > > > As far as I understand, the advance call before accelerate is to make > space in one of WAIT > or NEXT_READY tail for acceleration. So, it can be removed. Sounds good. Will remove that. > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 20d7a238d675..5ac81ca10ec8 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -782,8 +782,6 @@ static void srcu_gp_start(struct srcu_struct *ssp) > > spin_lock_rcu_node(sdp); /* Interrupts already disabled. */ > > rcu_segcblist_advance(&sdp->srcu_cblist, > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); > > Deletion is ok; alternatively, we could have used > WARN_ON_ONCE(rcu_segcblist_accelerate(...)) > in all places other than enqueue time for few cycles to be on safer side. How about WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL)) ? > > > spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */ > > WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies); > > WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0); > > @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > > rcu_segcblist_advance(&sdp->srcu_cblist, > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); > > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > + /* > > + * Acceleration might fail if the preceding call to > > + * rcu_segcblist_advance() also failed due to a prior grace > > + * period seen incomplete before rcu_seq_snap(). If so then a new > > + * call to advance will see the completed grace period and fix > > + * the situation. > > + */ > > + if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) { > > We can add below also? Here old and new are rcu_seq_current() values used in > the 2 calls to rcu_segcblist_advance(). > > WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new))); Very good point! "new" should be exactly one and a half grace period away from "old", will add that. Cooking proper patches now. Thanks. > > > Thanks > Neeraj > > > + rcu_segcblist_advance(&sdp->srcu_cblist, > > + rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > + WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)); > > + } > > if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { > > sdp->srcu_gp_seq_needed = s; > > needgp = true; > > @@ -1692,6 +1701,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) > > ssp = sdp->ssp; > > rcu_cblist_init(&ready_cbs); > > spin_lock_irq_rcu_node(sdp); > > + WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL)); > > rcu_segcblist_advance(&sdp->srcu_cblist, > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > if (sdp->srcu_cblist_invoking || > > @@ -1720,8 +1730,6 @@ static void srcu_invoke_callbacks(struct work_struct *work) > > */ > > spin_lock_irq_rcu_node(sdp); > > rcu_segcblist_add_len(&sdp->srcu_cblist, -len); > > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); > > sdp->srcu_cblist_invoking = false; > > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > > spin_unlock_irq_rcu_node(sdp);
Le Mon, Oct 02, 2023 at 07:51:10AM +0530, Neeraj upadhyay a écrit : > > And if this works, can we then remove srcu_invoke_callbacks() self-requeue? > > If queued several times before it actually fires, it will catch the latest > > grace period's end. And if queued while the callback runs, it will re-run. > > > > This makes sense, but not sure for non-wq context which link [1] mentions, > whether it needs it. > > > Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant? > > > I think you mean sdp->srcu_cblist_invoking ? > > There was a prior discussion on this [1], where Paul mentions about > non-wq context. Actually I realize that if we remove that, then two srcu_invoke_callbacks() can race in such a way that an srcu barrier callback may execute before the callbacks it is supposed to follow. And if we must keep sdp->srcu_cblist_invoking then we must also keep self-requeuing. I'll add a comment about that instead of removing. Thanks. > > > > Thanks > Neeraj > > [1] https://lkml.org/lkml/2020/11/19/1065 > > > Thanks.
On Tue, 3 Oct, 2023, 2:39 am Frederic Weisbecker, <frederic@kernel.org> wrote: > > Le Mon, Oct 02, 2023 at 06:52:27PM +0530, Neeraj upadhyay a écrit : > > On Mon, Oct 2, 2023 at 4:11 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > > Also the role of the remaining advance in srcu_gp_start() is unclear to me... > > > > > > > As far as I understand, the advance call before accelerate is to make > > space in one of WAIT > > or NEXT_READY tail for acceleration. So, it can be removed. > > Sounds good. Will remove that. > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 20d7a238d675..5ac81ca10ec8 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -782,8 +782,6 @@ static void srcu_gp_start(struct srcu_struct *ssp) > > > spin_lock_rcu_node(sdp); /* Interrupts already disabled. */ > > > rcu_segcblist_advance(&sdp->srcu_cblist, > > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > > - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); > > > > Deletion is ok; alternatively, we could have used > > WARN_ON_ONCE(rcu_segcblist_accelerate(...)) > > in all places other than enqueue time for few cycles to be on safer side. > > How about WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL)) ? > Sounds good. My initial thought to put WARN_ON_ONCE(rcu_segcblist_accelerate(...)) was to catch any unexpected accelerations for WAIT and NEXT_READY cbs. However, as all rcu_seq_snap() snapshots, used for acceleration, are ordered, I think we do not need to check for those tails. > > > > > spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */ > > > WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies); > > > WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0); > > > @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > > > rcu_segcblist_advance(&sdp->srcu_cblist, > > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > > s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); > > > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > > + /* > > > + * Acceleration might fail if the preceding call to > > > + * rcu_segcblist_advance() also failed due to a prior grace > > > + * period seen incomplete before rcu_seq_snap(). If so then a new > > > + * call to advance will see the completed grace period and fix > > > + * the situation. > > > + */ > > > + if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) { > > > > We can add below also? Here old and new are rcu_seq_current() values used in > > the 2 calls to rcu_segcblist_advance(). > > > > WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new))); > > Very good point! "new" should be exactly one and a half grace period away from > "old", will add that. > > Cooking proper patches now. > Cool! Thanks Neeraj > Thanks. > > > > > > > Thanks > > Neeraj > > > > > + rcu_segcblist_advance(&sdp->srcu_cblist, > > > + rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > > + WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)); > > > + } > > > if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { > > > sdp->srcu_gp_seq_needed = s; > > > needgp = true; > > > @@ -1692,6 +1701,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) > > > ssp = sdp->ssp; > > > rcu_cblist_init(&ready_cbs); > > > spin_lock_irq_rcu_node(sdp); > > > + WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL)); > > > rcu_segcblist_advance(&sdp->srcu_cblist, > > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > > if (sdp->srcu_cblist_invoking || > > > @@ -1720,8 +1730,6 @@ static void srcu_invoke_callbacks(struct work_struct *work) > > > */ > > > spin_lock_irq_rcu_node(sdp); > > > rcu_segcblist_add_len(&sdp->srcu_cblist, -len); > > > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > > - rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq)); > > > sdp->srcu_cblist_invoking = false; > > > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > > > spin_unlock_irq_rcu_node(sdp);
On Tue, Oct 3, 2023 at 4:16 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > Le Mon, Oct 02, 2023 at 07:51:10AM +0530, Neeraj upadhyay a écrit : > > > And if this works, can we then remove srcu_invoke_callbacks() self-requeue? > > > If queued several times before it actually fires, it will catch the latest > > > grace period's end. And if queued while the callback runs, it will re-run. > > > > > > > This makes sense, but not sure for non-wq context which link [1] mentions, > > whether it needs it. > > > > > Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant? > > > > > I think you mean sdp->srcu_cblist_invoking ? > > > > There was a prior discussion on this [1], where Paul mentions about > > non-wq context. > > Actually I realize that if we remove that, then two srcu_invoke_callbacks() > can race in such a way that an srcu barrier callback may execute before the I wasn't aware that two srcu_invoke_callbacks() for same sdp can execute in parallel. My understanding was, any new callback will be handled by the same kworker, once it completes the current executing one [1]. Maybe I am missing your point here? [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/workqueue.c?h=v6.6-rc4#n1081 > callbacks it is supposed to follow. And if we must keep > sdp->srcu_cblist_invoking then we must also keep self-requeuing. > > I'll add a comment about that instead of removing. > > Thanks. > > > > > > > > > > Thanks > > Neeraj > > > > [1] https://lkml.org/lkml/2020/11/19/1065 > > > > > Thanks.
On Mon, Oct 02, 2023 at 11:09:39PM +0200, Frederic Weisbecker wrote: > > > spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */ > > > WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies); > > > WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0); > > > @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > > > rcu_segcblist_advance(&sdp->srcu_cblist, > > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > > s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); > > > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > > + /* > > > + * Acceleration might fail if the preceding call to > > > + * rcu_segcblist_advance() also failed due to a prior grace > > > + * period seen incomplete before rcu_seq_snap(). If so then a new > > > + * call to advance will see the completed grace period and fix > > > + * the situation. > > > + */ > > > + if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) { > > > > We can add below also? Here old and new are rcu_seq_current() values used in > > the 2 calls to rcu_segcblist_advance(). > > > > WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new))); > > Very good point! "new" should be exactly one and a half grace period away from > "old", will add that. > > Cooking proper patches now. Actually this more simple fix below. rcu_seq_snap() can be called before rcu_segcblist_advance() after all. The only side effect is that callbacks advancing is then _after_ the full barrier in rcu_seq_snap(). I don't see an obvious problem with that as that barrier only cares about: 1) Ordering accesses of the update side before call_srcu() so they don't bleed 2) See all the accesses prior to the grace period of the current gp_num The only things callbacks advancing need to be ordered against are carried by snp locking. I still remove the accelerations elsewhere and advancing in srcu_gp_start() in further patches. I'll also avoid advancing and acceleration in srcu_gp_start_if_needed if there is no callback to queue. The point is also that this simple fix alone can be easily backported and the rest can come as cleanups. diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 5602042856b1..8b09fb37dbf3 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1244,10 +1244,10 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, spin_lock_irqsave_sdp_contention(sdp, &flags); if (rhp) rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp); + s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); - s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); + WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s) && rhp); if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { sdp->srcu_gp_seq_needed = s; needgp = true;
On Tue, Oct 3, 2023 at 5:52 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Mon, Oct 02, 2023 at 11:09:39PM +0200, Frederic Weisbecker wrote: > > > > spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */ > > > > WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies); > > > > WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0); > > > > @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > > > > rcu_segcblist_advance(&sdp->srcu_cblist, > > > > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > > > > s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); > > > > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > > > > + /* > > > > + * Acceleration might fail if the preceding call to > > > > + * rcu_segcblist_advance() also failed due to a prior grace > > > > + * period seen incomplete before rcu_seq_snap(). If so then a new > > > > + * call to advance will see the completed grace period and fix > > > > + * the situation. > > > > + */ > > > > + if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) { > > > > > > We can add below also? Here old and new are rcu_seq_current() values used in > > > the 2 calls to rcu_segcblist_advance(). > > > > > > WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new))); > > > > Very good point! "new" should be exactly one and a half grace period away from > > "old", will add that. > > > > Cooking proper patches now. > > Actually this more simple fix below. rcu_seq_snap() can be called before > rcu_segcblist_advance() after all. The only side effect is that callbacks > advancing is then _after_ the full barrier in rcu_seq_snap(). I don't see > an obvious problem with that as that barrier only cares about: > > 1) Ordering accesses of the update side before call_srcu() so they don't bleed > 2) See all the accesses prior to the grace period of the current gp_num > > The only things callbacks advancing need to be ordered against are carried by > snp locking. > Nice! Your analysis looks good to me! > I still remove the accelerations elsewhere and advancing in srcu_gp_start() in > further patches. I'll also avoid advancing and acceleration in > srcu_gp_start_if_needed if there is no callback to queue. > > The point is also that this simple fix alone can be easily backported and > the rest can come as cleanups. > Sounds good! > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 5602042856b1..8b09fb37dbf3 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1244,10 +1244,10 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, > spin_lock_irqsave_sdp_contention(sdp, &flags); > if (rhp) > rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp); > + s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); We might want to add a comment here, why the specific ordering of the two srcu_gp_seq reads is required here. Thanks Neeraj > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); > - s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq); > - (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); > + WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s) && rhp); > if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { > sdp->srcu_gp_seq_needed = s; > needgp = true;
On Fri, Sep 29, 2023 at 5:39 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > Hello, > Firstly, kudos to the detailed report and analysis. Rare failures are > hard and your usage crash/kdump is awesome to dig deeper into the > issue.. > > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote: > > > > Hi, > > > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > > and destroy concurrent test, the issue happens after several weeks. Now we > > didn't have a way to reproduce this issue, the issue happens randomly, this > > is the second time we found it in this year. > > > > We did some investigation with crash tools, hope to get some useful clues > > about the issue, try to find root cause of the issue (further find a way to > > fix or workaround it), here is some of our investigation, hope to get > > feedback from SRCU experts. > > > > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more > > than 122 seconds. > > [3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1 > > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [3144162.631162] kworker/136:12 D 0 252684 2 0x90004000 > > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm] > > [3144162.631192] Call Trace: > > [3144162.631202] __schedule+0x2ee/0x660 > > [3144162.631206] schedule+0x33/0xa0 > > [3144162.631209] schedule_timeout+0x1c4/0x340 > > [3144162.631214] ? update_load_avg+0x82/0x660 > > [3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30 > > [3144162.631218] wait_for_completion+0x119/0x180 > > [3144162.631220] ? wake_up_q+0x80/0x80 > > [3144162.631224] __synchronize_srcu.part.19+0x81/0xb0 > > [3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10 > > [3144162.631227] synchronize_srcu+0x5f/0xc0 > > [3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm] > > [3144162.631239] ? __schedule+0x2f6/0x660 > > [3144162.631243] process_one_work+0x19a/0x3a0 > > [3144162.631244] worker_thread+0x37/0x3a0 > > [3144162.631247] kthread+0x117/0x140 > > [3144162.631247] ? process_one_work+0x3a0/0x3a0 > > [3144162.631248] ? __kthread_cancel_work+0x40/0x40 > > [3144162.631250] ret_from_fork+0x1f/0x30 > > > > Test workload: > > - Repeated concurrency testing, include Create VM, service probe, destroy VM. > > - VM using virito device, so kvm->irq_srcu is using frequently. > > - x86_64 Machine with 191 core. > > > > Crash analysis clues: > > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed. > > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed). > > - The SNP (node) of SRCU struct has 2 levels. > > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs). > > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143]. > > - The SDP (per-cpu) of SRCU struct show lock and unlock matches. > > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list. > > - The pending entry is just the calling entry of hung kwoker task. > > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296. > > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list. > > - The work-queue of SRCU and SDP of CPU 136 is idle. > > - More details could be found via reference at the end of this mail. > > > > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda > > 0xffffc900f100e878 > > srcu_idx = 76 > > srcu_gp_seq = 304 > > srcu_gp_seq_needed = 304 > > srcu_gp_seq_needed_exp = 304 > > sda = 0x5fff9ac17800 > > > > crash> struct srcu_data 0x5fff9ac17800:136 > > [136]: ffffe8bee9817800 > > ... > > srcu_cblist = { > > head = 0xffffc900e8413dd0, > > tails = {0xffffe8bee9817848, 0xffffc900e8413dd0, > > 0xffffc900e8413dd0, 0xffffc900e8413dd0}, > > gp_seq = {0, 296, 296, 0}, > > I am actually a bit puzzled why the DONE segment count is 0. I would > have expected it to be >= 296 in this situation. The gp_seq of segcblist will only be updated in rcu_segcblist_accelerate and rcu_segcblist_advance, and the update logic does not include DONE and NEXT tail, so I think the seq of DONE and NEXT should always be zero. > > > len = 1, > > len_lazy = 0, > > enabled = 1 '\001', > > offloaded = 0 '\000' > > }, > > srcu_gp_seq_needed = 288, > > srcu_gp_seq_needed_exp = 0, > > srcu_cblist_invoking = false, > > ... > > > > Root cause (we can't reproduce the issue, so the result is base on crash > > analysis and code analysis): > > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288). > > - The leaked entry on segcblist seems has been "accelerated". > > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because > > srcu_invoke_callbacks may be using incorrect GP seq doing accelerate. > > - Here are two emulated case may cause the "accelerate issue": > > > > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > to SDP. > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > faster than CPU 136, start GP, set rcu_seq_start. > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > to start GP. > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > "accelerate" the seq of new added entry to 292 (it should be 288). > > But srcu_gp_seq is at 304 right now. How does it matter that the CB is > marked for 292? It should be ready to execute anyway even at 292. Note > the meaning of "acceleration", the idea is to start conservatively and > move the callbacks forward as more accurate information is available. > Considering this, 292 initially should be OK IMHO (that's just more > conservative than 288).. As Neeraj explained, the callback is manged under SDP (per-CPU), but CBs are managed under SNP (group of CPU). > > > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296 > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > to SDP. > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > faster than CPU 136, start GP. > > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP. > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > to start GP. > > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP, > > set rcu_seq_start. > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > "accelerate" new added entry to 296 (it should be 288). > > I had the same analysis here. > > > /* > > * CASE1 > > * - entry of CPU 136 expecting GP 288 was accelerated to GP 292 > > * - GP0: 280-284, GP1: 284-288. > > */ > > > > /* [GP0][CPU-136] */ > > process_srcu { > > srcu_gp_end > > } > > /* [GP1][CPU-136] */ > > synchronize_srcu { > > __call_srcu { > > rcu_segcblist_enqueue > > /* [GP1][CPU-041] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > rcu_seq_start > > } > > } > > /* [GP1][CPU-136] */ > > srcu_funnel_gp_start > > } > > } > > /* [GP0][CPU-136] */ > > srcu_invoke_callbacks { > > rcu_segcblist_accelerate > > } > > > > > > /* > > * CASE2 > > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > > */ > > > > /* [GP0][CPU-136] */ > > process_srcu { > > srcu_gp_end > > } > > > > /* [GP1][CPU-136] */ > > synchronize_srcu { > > __call_srcu { > > rcu_segcblist_enqueue > > /* [GP1][CPU-041] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > } > > } > > process_srcu { > > srcu_gp_end > > rcu_seq_end > > } > > /* [GP1][CPU-136] */ > > srcu_funnel_gp_start > > } > > } > > /* [GP0][CPU-136] */ > > srcu_invoke_callbacks { > > If srcu_invoke_callbacks() was really called for the rdp, I would have > expected rcu_segcblist_advance() to advance all those pending > callbacks to 304. > > I posit that probably srcu_invoke_callbacks() is not even being called > in the first place, otherwise why is the DONE segment count still 0? > DONE should not be 0, otherwise callbacks will not run. The segcblist > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not > being called due to some other bug (or a workqueue/timer issue causing > the srcu_invoke_callbacks() to not run?). There are no other system error messages and the system is running fine except for KVM related operations such as destroying the VM (this is because the SRCU hang is blocking the KVM global work queue). So I'd rather think there's a bug in SRCU and the analysis shows there may be a race issue. > > I wonder if you would consider trying to reproduce it with the > rcutorture module's SRCU testing in the chance that it occurs with it > on your setup. > > thanks, > > - Joel > > > > /* [GP2][CPU-017] */ > > synchronize_srcu { > > __call_srcu { > > srcu_funnel_gp_start > > srcu_gp_start > > rcu_seq_start > > } > > } > > /* [GP0][CPU-136] */ > > rcu_segcblist_accelerate > > } > > > > > > Potential fix: > > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot > > of current gp seq of SRCU struct, how about accelerate the sda base > > sdp->srcu_gp_seq_needed, maybe like this: > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 21115ffb6c44..ff8579abc924 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct > > work_struct *work) > > spin_lock_irq_rcu_node(sdp); > > rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs); > > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, > > - rcu_seq_snap(&ssp->srcu_gp_seq)); > > + rcu_seq_snap(&sdp->srcu_gp_seq_needed)); > > sdp->srcu_cblist_invoking = false; > > more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); > > spin_unlock_irq_rcu_node(sdp); > > I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO, > rcu_seq_snap() says which what's the gp_seq at which a full GP has > passed from the moment rcu_seq_snap() is called. That's precisely when > it is safe to call all those non-accelerated callbacks. > > Basically, if any of the bits are set in the state portion of the > sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2) > << RCU_SEQ_CTR_SHIFT. > Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT. > > > > > > > > > > > Thanks for reading! > > > > > > Reference detail log of crash dump information: > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log > > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log
On Sun, Oct 1, 2023 at 10:27 AM Neeraj upadhyay <neeraj.iitr10@gmail.com> wrote: > > On Sun, Oct 1, 2023 at 5:49 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Sat, Sep 30, 2023 at 6:01 AM Neeraj upadhyay <neeraj.iitr10@gmail.com> wrote: > > > > > > On Fri, Sep 29, 2023 at 3:35 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > Hello, > > > > Firstly, kudos to the detailed report and analysis. Rare failures are > > > > hard and your usage crash/kdump is awesome to dig deeper into the > > > > issue.. > > > > > > > > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote: > > > > > > > > > > Hi, > > > > > > > > > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create > > > > > and destroy concurrent test, the issue happens after several weeks. Now we > > > > > didn't have a way to reproduce this issue, the issue happens randomly, this > > > > > is the second time we found it in this year. > > > > > > > > > > > > > > > > > > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292 > > > > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks. > > > > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry > > > > > to SDP. > > > > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run > > > > > faster than CPU 136, start GP, set rcu_seq_start. > > > > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need > > > > > to start GP. > > > > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks, > > > > > "accelerate" the seq of new added entry to 292 (it should be 288). > > > > > > > > But srcu_gp_seq is at 304 right now. How does it matter that the CB is > > > > marked for 292? It should be ready to execute anyway even at 292. Note > > > > the meaning of "acceleration", the idea is to start conservatively and > > > > move the callbacks forward as more accurate information is available. > > > > Considering this, 292 initially should be OK IMHO (that's just more > > > > conservative than 288).. > > > > > > > > > > Maybe it matters, as for a CPU, the callbacks will only be scheduled > > > in srcu_gp_end() for the GPs, for which it has updated ->srcu_data_have_cbs[idx] > > > and ->srcu_have_cbs[idx] > > > > Right but if I am looking at the code correctly, nothing guarantees > > that srcu_invoke_callbacks is called before srcu_gp_seq can advance. > > So all callbacks that were previously queued for older grace periods > > should be run whenever srcu_invoke_callbacks() eventually runs. That's > > why I was thinking that part looked normal to me (segments having > > older GP numbers). > > > > > > > > > > > > /* > > > > > * CASE2 > > > > > * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296. > > > > > * - GP0: 280-284, GP1: 284-288, GP2: 288-292. > > > > > */ > > > > > > > > > > /* [GP0][CPU-136] */ > > > > > process_srcu { > > > > > srcu_gp_end > > > > > } > > > > > > > > > > /* [GP1][CPU-136] */ > > > > > synchronize_srcu { > > > > > __call_srcu { > > > > > rcu_segcblist_enqueue > > > > > /* [GP1][CPU-041] */ > > > > > synchronize_srcu { > > > > > __call_srcu { > > > > > srcu_funnel_gp_start > > > > > srcu_gp_start > > > > > } > > > > > } > > > > > process_srcu { > > > > > srcu_gp_end > > > > > rcu_seq_end > > > > > } > > > > > /* [GP1][CPU-136] */ > > > > > srcu_funnel_gp_start > > > > > } > > > > > } > > > > > /* [GP0][CPU-136] */ > > > > > srcu_invoke_callbacks { > > > > > > > > If srcu_invoke_callbacks() was really called for the rdp, I would have > > > > expected rcu_segcblist_advance() to advance all those pending > > > > callbacks to 304. > > > > > > If the callback is in NEXT_TAIL and not assigned GP num, > > > rcu_segcblist_advance() won't move it and next accelerate in > > > srcu_invoke_callbacks() will > > > assign it the next gp sequence num. > > > > Sure, and after that again it will call srcu_schedule_cbs_sdp() so > > that should be fine and the next workqueue invocation > > srcu_invoke_callbacks() can advance at that time. Right? > > > > But "more" only checks for CBs in DONE tail. The callbacks which have been just > accelerated are not advanced to DONE tail. > > Having said above, I am still trying to figure out, which callbacks > are actually being pointed > by NEXT tail. Given that __call_srcu() already does a advance and > accelerate, all enqueued > callbacks would be in either WAIT tail (the callbacks for current > active GP) or NEXT_READY > tail (the callbacks for next GP after current one completes). So, they > should already have > GP num assigned and srcu_invoke_callbacks() won't accelerate them. > Only case I can > think of is, if current GP completes after we sample > rcu_seq_current(&ssp->srcu_gp_seq) for > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail) > and a new GP is started > before we take snapshot ('s') of next GP for > rcu_segcblist_accelerate(), then the gp num 's' > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure > if my understanding is correct here. Thoughts? > > __call_srcu() > rcu_segcblist_advance(&sdp->srcu_cblist, > rcu_seq_current(&ssp->srcu_gp_seq)); > s = rcu_seq_snap(&ssp->srcu_gp_seq); > (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s); Thanks for helping to confirm this! What I want to explain in my CASE (maybe is not totally correct, I "created" it which base the hung context), is rcu_segcblist_advance and rcu_segcblist_accelerate in srcu_invoke_callbacks maybe using not-matched GP seq. And this makes the callbacks "accelerate" to wrong GP, which has no CBS event on that SNP (CPU groups), then callback leaked in segcglist, cause __synchronize_srcu hang. > > Thanks > Neeraj > > > > if (more) { > > srcu_schedule_cbs_sdp(sdp, 0); > > } > > > > thanks, > > > > - Joel
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 21115ffb6c44..ff8579abc924 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) spin_lock_irq_rcu_node(sdp); rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs); (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, - rcu_seq_snap(&ssp->srcu_gp_seq)); + rcu_seq_snap(&sdp->srcu_gp_seq_needed)); sdp->srcu_cblist_invoking = false; more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);