Message ID | CA+=Fv5R9NG+1SHU9QV9hjmavycHKpnNyerQ=Ei90G98ukRcRJA@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Kernel Oops on alpha with kernel version >=6.9.x | expand |
On Sat, Nov 30, 2024 at 11:22:45PM +0100, Magnus Lindholm wrote: > Hi, > > > First some background: > I've been trying to boot recent kernels on my alpha machines. Anything > after linux-6.8.12 gives me trouble. After doing a kernel bisect, I > found that commit 9187210eee7d87eea37b45ea93454a88681894a4 > (net-next-6.9) is where my troubles begin. The problem consists in > that the boot process gets stuck when trying to set parameters for > network interfaces. The bad commit does make a lot of updates to the > network code. > > When booting the system with kernel 6.12.0 I'm able to boot into > single-user mode, but when starting system services one by one I > trigger a kernel Oops when the network interface is renamed (see stack > dump below). Looking at the changes made by the bad commit, it seems > to (among other things) be replacing the locking mechanism (RCU > instead of rtnl_lock). The stack dump from the kernel Oops suggests > that something is happening in the RCU locking code. I'm no expert on > RCU-stuff but I read somewhere that it is done by volatile access on > all systems other than DEC Alpha, where a memory barrier instruction > is required. This indicates that the change could affect Alpha > architecture differently? Inspecting the changes to networking code in > the bad commit, particularly the changes made to net/core/dev.c, I put > together the patch below. This patch reverts one of the lines changed > in the "bad commit" for net/core/dev.c. After reverting the change on > just this line, I'm able to boot kernel 6.12.0 on my Alpha ES-40 to > full multi-user again. I've tested this on an Alpha ES40 and an > UP2000+ and the problem is 100% reproducible on both machines. > > The patch might not be a real solution to the problem but could be a good > place to start looking when figuring out what's really going on. The feedback > I've gotten so far (forums and the netdev mailing list) is that the > RCU implementation on alpha is probably where things go wrong. Does booting with the "rcupdate.rcu_normal=1" kernel boot parameter also suppress the problem? That "pc =" down below is the program counter? If so, I am at a loss as to what RCU could do to make it be zero. Thanx, Paul > ------------------------------------ > Patch to "fix" the problem: > ----------------------------------- > > diff --git a/net/core/dev.c b/net/core/dev.c > index 13d00fc10f55..26fda14367e5 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1261,7 +1261,7 @@ int dev_change_name(struct net_device *dev, > const char *newname) > > netdev_name_node_del(dev->name_node); > > - synchronize_net(); > + synchronize_rcu(); > > netdev_name_node_add(net, dev->name_node); > > > -------------------------- > dmesg/kernel log: > ------------------------- > > [ 93.431592] tulip 0000:01:02.0 enp1s2: renamed from eth0 > > [ 93.436475] Unable to handle kernel paging request at virtual > address 0000000000000000 > [ 93.436475] CPU 1 > [ 93.436475] rcu_exp_gp_kthr(17): Oops -1 > [ 93.436475] pc = [<0000000000000000>] ra = [<0000000000000000>] > ps = 0000 Not tainted > [ 93.436475] pc is at 0x0 > [ 93.436475] ra is at 0x0 > [ 93.436475] v0 = 0000000000000007 t0 = fffffc0000e62440 t1 = > 0000000000000001 > [ 93.436475] t2 = 0000000000000000 t3 = 0000000000000001 t4 = > 0000000000000001 > [ 93.436475] t5 = 0000000000000001 t6 = 0000000000000001 t7 = > fffffc0003138000 > [ 93.436475] s0 = fffffc0000e62440 s1 = fffffc0000ec3a10 s2 = > fffffc0000ec3a10 > [ 93.436475] s3 = fffffc0000ec3a10 s4 = fffffc00003a90f0 s5 = > fffffc0000e62440 > [ 93.436475] s6 = 0000000000000000 > [ 93.436475] a0 = 0000000000000000 a1 = 0000000000000000 a2 = > 0000000000000000 > [ 93.436475] a3 = 0000000000000000 a4 = 0000000000000001 a5 = > fffffc0000517744 > [ 93.436475] t8 = 0000000000000001 t9 = 0000000000000001 t10= > fffffc0000e3d320 > [ 93.436475] t11= fffffc0000220240 pv = fffffc0000b73210 at = > 0000000000000000 > [ 93.436475] gp = fffffc0000eb3a10 sp = 00000000ea2ea184 > [ 93.436475] Disabling lock debugging due to kernel taint > [ 93.436475] Trace: > [ 93.436475] [<fffffc00003aee60>] wait_rcu_exp_gp+0x30/0xa0 > [ 93.436475] [<fffffc0000b6c200>] __cond_resched+0x30/0x90 > [ 93.436475] [<fffffc00003569b8>] kthread_worker_fn+0xc8/0x1f0 > [ 93.436475] [<fffffc000035863c>] kthread+0x17c/0x1c0 > [ 93.436475] [<fffffc00003568f0>] kthread_worker_fn+0x0/0x1f0 > [ 93.436475] [<fffffc0000311128>] ret_from_kernel_thread+0x18/0x20 > > [ 93.436475] Code: > [ 93.436475] 00000000 > [ 93.436475] 00000000 > [ 93.436475] 00063301 > [ 93.436475] 0000077c > [ 93.436475] 00001111 > [ 93.436475] 000022a2
On Sun, Dec 1, 2024 at 5:31 AM Paul E. McKenney <paulmck@kernel.org> wrote: > Does booting with the "rcupdate.rcu_normal=1" kernel boot parameter > also suppress the problem? setting rcupdate.rcu_normal=1 also suppresses the problem. I guess this makes RCU code not do synchronize_rcu_normal() in stead of the full synchronize_rcu_expedited() which is where I get the kernel Oops. > That "pc =" down below is the program counter? If so, I am at a loss > as to what RCU could do to make it be zero. > No sure why this happens, if the RCU code is passing around pointers to worker function and this somehow ends up being a null pointer on the Alpha? /Magnus
diff --git a/net/core/dev.c b/net/core/dev.c index 13d00fc10f55..26fda14367e5 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1261,7 +1261,7 @@ int dev_change_name(struct net_device *dev, const char *newname) netdev_name_node_del(dev->name_node); - synchronize_net(); + synchronize_rcu(); netdev_name_node_add(net, dev->name_node);