Message ID | 20180731134603.20089-1-jthumshirn@suse.de (mailing list archive) |
---|---|
Headers | show |
Series | scsi: fcoe: memleak fixes | expand |
Ard, Did you have time to test these patches? Byte, Johannes
Hi, Johannes, On Mon, Aug 06, 2018 at 11:25:03AM +0200, Johannes Thumshirn wrote: > Ard, > > Did you have time to test these patches? Ahh, yeah, sorry. I did test them. I can't say I see a regression. I also can't say I see progress, although it feels like there is a tad less leaking. For that I need to do some statistics. But for now I ack that it does not regress further.
On Mon, Aug 06, 2018 at 03:22:34PM +0200, ard wrote: > Ahh, yeah, sorry. > I did test them. > I can't say I see a regression. > I also can't say I see progress, although it feels like there is > a tad less leaking. > For that I need to do some statistics. > > But for now I ack that it does not regress further. From what I've seen in your kmemleak reports, it wasn't only libfc/fcoe that was leaking memory. So kmemleak reports are welcome (especially if they confirm libfc/fcoe is clean now ;-)). @Martin can you stage these patches for 4.19?
Hi Johannes, On Mon, Aug 06, 2018 at 03:27:25PM +0200, Johannes Thumshirn wrote: > On Mon, Aug 06, 2018 at 03:22:34PM +0200, ard wrote: > > Ahh, yeah, sorry. > > I did test them. > > I can't say I see a regression. > > I also can't say I see progress, although it feels like there is > > a tad less leaking. > > For that I need to do some statistics. > > > > But for now I ack that it does not regress further. > > From what I've seen in your kmemleak reports, it wasn't only > libfc/fcoe that was leaking memory. > > So kmemleak reports are welcome (especially if they confirm libfc/fcoe > is clean now ;-)). I've updated the ticket: https://github.com/hardkernel/linux/issues/360#issuecomment-410721997 dmesg (with full debugging): https://github.com/hardkernel/linux/files/2262858/dmesg-2018-08-06.txt and kmemleak: https://github.com/hardkernel/linux/files/2262861/kmemleak-2018-08-06.txt I'll have to take a good look at the what and why. I'm gonna git clean -x -d and rebuild the kernel, because this is weird. This is the tree I build: https://github.com/ardje/linux/tree/gb-4.14-fcoe-patch So vanilla linux-4.14 with the patch set: https://github.com/ardje/linux/commit/747bf04057a99be5b01f768654cfd61bc9f4fc6c Meh, I should try to use git for patching. I will spam some printk's later on, just to check which flows are followed (if the git clean doesn't reveal some problem, although I can't imagine). Regards, Ard
On Mon, Aug 06, 2018 at 04:24:14PM +0200, ard wrote: > I've updated the ticket: > https://github.com/hardkernel/linux/issues/360#issuecomment-410721997 > dmesg (with full debugging): > https://github.com/hardkernel/linux/files/2262858/dmesg-2018-08-06.txt > and kmemleak: > https://github.com/hardkernel/linux/files/2262861/kmemleak-2018-08-06.txt > > I'll have to take a good look at the what and why. > I'm gonna git clean -x -d and rebuild the kernel, because this is > weird. > > This is the tree I build: > https://github.com/ardje/linux/tree/gb-4.14-fcoe-patch > So vanilla linux-4.14 with the patch set: > https://github.com/ardje/linux/commit/747bf04057a99be5b01f768654cfd61bc9f4fc6c > Meh, I should try to use git for patching. > > I will spam some printk's later on, just to check which flows are > followed (if the git clean doesn't reveal some problem, although > I can't imagine). OK, now this is wired. Are you seeing this on the initiator or on the target side? Also on x86_64 or just the odroid? I could reproduce your reports in my Virtualized Environment [1][2] by issuing deletes from the initiator side.. [1] https://github.com/rapido-linux/rapido [2] https://github.com/rapido-linux/rapido/blob/master/fcoe_local_autorun.sh Byte, Johannes
Hi, On Tue, Aug 07, 2018 at 08:54:00AM +0200, Johannes Thumshirn wrote: > OK, now this is wired. Are you seeing this on the initiator or on the > target side? Also on x86_64 or just the odroid? I could reproduce your > reports in my Virtualized Environment [1][2] by issuing deletes from the > initiator side.. Yes it is weird, and it is even more weird when I looked at the collectd statistics: The memory leak was almost none existent on my test odroid with the PC turned off. When I turn it back on, it rises to 150MB/day So it seems you need at least some party. The most important thing to realise: this is pure vn2vn chatter. There is no traffic going from or to the test odroid (to the test pc there is some). If I disable the FCoE vlan on the switch port, the chatter *and* the memory leaks vanishes. Meeh, this reports needs a better place than just e-mail, I got a few nice graphs to show. But here is an overview of my FCoE vlan: (Sorted by hand) (GS724Tv4) #show mac-addr-table vlan 11 Address Entries Currently in Use............... 89 MAC Address Interface Status ----------------- --------- ------------ 00:1E:06:30:05:50 g4 odroid4 Xu4/exynos 5422/4.4.0-rc6 stable (330 days up) 0E:FD:00:00:05:50 g4 Learned 00:1E:06:30:04:E0 g6 odroid6 Xu4/exynos 5422/4.9.28 stable (330 days up) 0E:FD:00:00:04:E0 g6 Learned 00:1E:06:30:05:52 g7 odroid7 Xu4/exynos 5422/4.14.55 leaking (150MB leak/day) 0E:FD:00:00:05:52 g7 Learned 00:0E:0C:B0:68:37 g14 storage SS4000E/Xscale 80219/3.7.1 stable (295 days up) 0E:FD:00:00:68:37 g14 Learned 00:14:FD:16:DD:50 g15 thecus1 n4200eco/D525/4.3.0 stable (295 days up) 0E:FD:00:00:DD:50 g15 Learned 00:24:1D:7F:40:88 g17 antec PC/i7-920/4.14.59 leaking 0E:FD:00:00:40:88 g17 Learned The system on G14 and G15 are both long time targets. G4,6 and 7 (my production server is on 5 with FCoE and kmemleak, but with the FCoE vlan removed) are odroids doing nothing more with FCoE but being there. (Waiting for experiments for bcache on eMMC, I used to be able to crash the FCoE *target* using btrfs on bcache on eMMC and FCoE. (Target was running 4.0.0 back then). Generic config (PC and odroid): root@odroid6:~# cat /etc/network/interfaces.d/20-fcoe auto fcoe iface fcoe inet manual pre-up modprobe fcoe || true pre-up ip link add link eth0 name fcoe type vlan id 11 pre-up sysctl -w net.ipv6.conf.fcoe.disable_ipv6=1 up ip link set up dev fcoe up sh -c 'echo fcoe > /sys/module/libfcoe/parameters/create_vn2vn' #up /root/mountfcoe #pre-down /root/stop-bcaches pre-down sh -c 'echo fcoe > /sys/module/libfcoe/parameters/destroy' down ip link set down dev fcoe down ip link del fcoe The targets are configured with some version of targetcli (so a big echo shell script). This is on the 4.14 systems: root@antec:~# grep . /sys/class/fc_*/*/port_* /sys/class/fc_host/host10/port_id:0x004088 /sys/class/fc_host/host10/port_name:0x200000241d7f4088 /sys/class/fc_host/host10/port_state:Online /sys/class/fc_host/host10/port_type:NPort (fabric via point-to-point) /sys/class/fc_remote_ports/rport-10:0-0/port_id:0x00dd50 /sys/class/fc_remote_ports/rport-10:0-0/port_name:0x20000014fd16dd50 /sys/class/fc_remote_ports/rport-10:0-0/port_state:Online /sys/class/fc_remote_ports/rport-10:0-1/port_id:0x006837 /sys/class/fc_remote_ports/rport-10:0-1/port_name:0x2000000e0cb06837 /sys/class/fc_remote_ports/rport-10:0-1/port_state:Online /sys/class/fc_remote_ports/rport-10:0-2/port_id:0x000550 /sys/class/fc_remote_ports/rport-10:0-2/port_name:0x2000001e06300550 /sys/class/fc_remote_ports/rport-10:0-2/port_state:Online /sys/class/fc_remote_ports/rport-10:0-3/port_id:0x0004e0 /sys/class/fc_remote_ports/rport-10:0-3/port_name:0x2000001e063004e0 /sys/class/fc_remote_ports/rport-10:0-3/port_state:Online /sys/class/fc_transport/target10:0:0/port_id:0x00dd50 /sys/class/fc_transport/target10:0:0/port_name:0x20000014fd16dd50 None of the other systems have an fc_transport, as they do not have targets assigned to them (currently). Notice that antec (PC) does not see odroid7. The same is true vice versa. All other systems see both antec and odroid7. So they all can see eachother except for the 4.14 systems that can't see eachother. Now when I noticed that it only happened when my PC starts, I wondered why it also happened when my PC is turned off, as I turn it on once every few months and sometimes in the winter, it's power usage is the same as the remaining systems combined. And my next thing is: why did my production server seemed to die less fast since a few kernel upgrades (in the 4.14 line). I got it figured out now: Before the heatwave, I had odroid5 turned on, my steam machine (also with FCoE as an active initiator and 4.14 kernel) and the PC turned off. So that still makes 6 FCoE ports on the network. When the summer came I needed to turn off the steam machine as much as possible. This resulted in my main production server only needing a reboot once ever week instead of every 2 days. I attributed that to kernel fixes (as I knew there was a memory leak, just didn't know where yet). Thinking about that some more: do I need 4.14 systems to trigger a bug within eachother, or is it pure the number of fc hosts that should be bigger than 5 to trigger a bug in 4.14? So a conclusion of my rambling: 1) you either need 6 vn2vn hosts *or* you need more than one 4.14 kernel in a network to trigger. One of the two. I need to think about this. The fact that the 4.14 systems can't see eachother is an indicator. I can turn off the FCoE on some other system to see if the memleak stops. 2) kernels up to 4.9.28 do not have a memoryleak. 4.14.28+ do have the memory leak. 3) I need a place for graphs, I will see if I can abuse the github ticket some more 8-D. 4) Just having FCoE enabled on an interface and *receiving*/interacting with FCoE vn2vn chatter triggers the bug. So that's only setting up the rports, maintaining ownership of your port id. 5) The memleak itself is architecture independent and NIC independent.
On Tue, Aug 07, 2018 at 11:26:19AM +0200, ard wrote: > 3) I need a place for graphs, I will see if I can abuse the > github ticket some more 8-D. https://github.com/hardkernel/linux/issues/360#issuecomment-411001362 Regards, Ard van Breemen
Hi, On Tue, Aug 07, 2018 at 11:26:19AM +0200, ard wrote: > So a conclusion of my rambling: > 1) you either need 6 vn2vn hosts *or* you need more than one 4.14 > kernel in a network to trigger. One of the two. I need to think > about this. The fact that the 4.14 systems can't see eachother is > an indicator. I can turn off the FCoE on some other system to > see if the memleak stops. I've gone all out: Ok, deleted everything, long story short: (for now(!)) PC+steam machine with 4.14 (patched) and 4.16 (upstream, nodebug): no kmemleaks Every device sees every device. Now if I add two odroids running 4.14 (not patched): The odroids will see *every* device running < 4.14 and vv The odroids will not see any device running >= 4.14 and vv The PC starts to kmemleak a lot. I've removed the two 4.14 odroids again from the FCoE lan, and upgraded them with a patched 4.14 . I rebooted the PC with steam (4.16 upstream unpatched) and all other systems in the lan I mentioned. To be clear: the steam machine has a dedicated nic for FCoE. All other systems use a vlan. I will let that talk with eachother for now. I fear there is a platform specific thing in the vn2vn chatter that has changed > 4.9 . 4.14 can communicate with lower revisions, but they can't see eachother, and whatever they chat about, it induces memory leaks on the PC. Oh, another difference is that steam and the PC has targets assigned. I might try that for the 4.14 too. Anyway, I will do some cooking and let the current systems chat with eachother and see if we get a kmemleak making my statement bogus. Regards, Ard
On Mon, Aug 06, 2018 at 11:25:03AM +0200, Johannes Thumshirn wrote: Martin, though it looks like these patches aren't fixing all of the errors Ard has, they fix all I could reproduce up to now and Ard already stated they're not regressing. So can we please stage them for 4.19? Thanks, Johannes
Johannes, > Martin, > > though it looks like these patches aren't fixing all of the errors Ard > has, they fix all I could reproduce up to now and Ard already stated > they're not regressing. > > So can we please stage them for 4.19? I merged them last week.
On Thu, Aug 09, 2018 at 05:52:13AM -0400, Martin K. Petersen wrote: > > Johannes, > > > Martin, > > > > though it looks like these patches aren't fixing all of the errors Ard > > has, they fix all I could reproduce up to now and Ard already stated > > they're not regressing. > > > > So can we please stage them for 4.19? > > I merged them last week. Oh I'm sorry, missed that.
Hi Guys, On Tue, Aug 07, 2018 at 06:04:52PM +0200, ard wrote: > PC+steam machine with 4.14 (patched) and 4.16 (upstream, > nodebug): no kmemleaks > Every device sees every device. New day, new conflicting results. Yay \0/. As I did not trust the results, I redid the tests, and the same tests gave some different results. Before giving the results I've changed my stance on the bug: The bug is not a regression in memory leak. As far as I can tell now, the memory leaks were already there. It's a regression in vn2vn enodes being able to PLOGI. Since I've seen the steam machine and the PC setup an rport, there must be some racy thing going on how the accept or reject the PLOGI. Now once it rejects, it will never succeed to accept, and the relogin happens ad infinitum. In this mode there are about 47 kmemleaks per 10 minutes. I also notice that the kmemleaks takes a while to be detected or to die out. So there are state timers involved that hold on to the memory and after time out do not free it. And another thing I noticed: When the pc and the steam machine had a working rport, after a while the steam machine (4.16 unpatched) fc_timedout the rports to all nodes (so all nodes with kernel < 4.14 too), and all with different timeouts, except the one it has an fc_transport with. So it's sole remaining rport was the "designated" target. Currently I am compiling 4.9 with kmemleak to determine if that exhibits the same leaks when disconnecting and reconnecting the FCoE vlan. This to determine if we have a single regression in just the login handling or both. I will add the dmesg's of a working rport, and a failing rport later. Regards, Ard
Hi, On Thu, Aug 09, 2018 at 12:01:30PM +0200, ard wrote: > This to determine if we have a single regression in just the > login handling or both. As a matter of fact, I think this will not work on vn2vn: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/libfc?id=386b97b43c0c9e0d878eec7ea1db16af22b036ae scsi: libfc: Rework PRLI handling <theorising> As it clearly rejects the PRLI and eventually sends a LOGO when no store is associated. But as san setup goes: it first attaches the network driver to the fcoe layer and hence is store less. Then it attaches stores to export. But at that moment all possible initiators already got a LOGO </theorising> But looking back at the logs, it is not all that bad as what I said. It's not erroring ad infinitum, it's: root@antec:~/logs# grep "vn_add rport 00c76e\|kmemleak" 2018-08-08-kern.log|cut -d\ -f9-|uniq -c 1 2.577320] kmemleak: Kernel memory leak detector initialized 1 2.577350] kmemleak: Automatic memory scanning thread started 1 136.452894] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 1 host10: fip: vn_add rport 00c76e new state 0 1 host10: fip: vn_add rport 00c76e old state 0 8 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 50 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 2 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 50 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 50 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 52 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 50 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 50 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 55 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 52 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 50 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 36 host10: fip: vn_add rport 00c76e old state 4 1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 1 kmemleak: 36 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) 1 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) From the last few lines we can clearly see that de vn_add rport ... old state 4 coincides with a kmemleak 10 minutes later. especially 50 and 36. Now the exact dump follows for that. Now to get back to what is the difference: This is a working login with a 4.9: Aug 8 10:53:15 localhost kernel: [ 14.929451] host10: fip: vn_add rport 0004e0 old state 0 Aug 8 10:53:21 localhost kernel: [ 23.143274] host10: fip: beacon from rport 4e0 Aug 8 10:53:21 localhost kernel: [ 23.143275] host10: fip: beacon expired for rport 4e0 Aug 8 10:53:21 localhost kernel: [ 23.143276] host10: rport 0004e0: Login to port Aug 8 10:53:21 localhost kernel: [ 23.143277] host10: rport 0004e0: Entered FLOGI state from Init state Aug 8 10:53:21 localhost kernel: [ 23.143294] host10: fip: els_send op 7 d_id 4e0 Aug 8 10:53:21 localhost kernel: [ 23.143301] host10: fip: beacon from rport 4e0 Aug 8 10:53:21 localhost kernel: [ 23.143920] host10: rport 0004e0: Received a FLOGI accept Aug 8 10:53:21 localhost kernel: [ 23.143921] host10: rport 0004e0: Port entered PLOGI state from FLOGI state Aug 8 10:53:21 localhost kernel: [ 23.143956] host10: rport 0004e0: Received a PLOGI accept Aug 8 10:53:21 localhost kernel: [ 23.143958] host10: rport 0004e0: Port entered PRLI state from PLOGI state Aug 8 10:53:21 localhost kernel: [ 23.143982] host10: rport 0004e0: Received a PRLI accept Aug 8 10:53:21 localhost kernel: [ 23.143984] host10: rport 0004e0: PRLI spp_flags = 0x0 spp_type 0x20 Aug 8 10:53:21 localhost kernel: [ 23.143985] host10: rport 0004e0: Error -6 in state PRLI, retrying Aug 8 10:53:21 localhost kernel: [ 23.144434] host10: rport 0004e0: Received PRLI request while in state PRLI Aug 8 10:53:21 localhost kernel: [ 23.144445] host10: rport 0004e0: PRLI rspp type 8 active 1 passive 0 Aug 8 10:53:21 localhost kernel: [ 23.749492] host10: rport 0004e0: Received RTV request Aug 8 10:53:23 localhost kernel: [ 25.204559] host10: rport 0004e0: Port timeout, state PRLI Aug 8 10:53:23 localhost kernel: [ 25.347400] host10: rport 0004e0: Port entered PRLI state from PRLI state Aug 8 10:53:23 localhost kernel: [ 25.393058] host10: rport 0004e0: Received a PRLI accept Aug 8 10:53:23 localhost kernel: [ 25.398412] host10: rport 0004e0: PRLI spp_flags = 0x21 spp_type 0x8 Aug 8 10:53:23 localhost kernel: [ 25.404853] host10: rport 0004e0: Port entered RTV state from PRLI state Aug 8 10:53:23 localhost kernel: [ 25.450291] host10: rport 0004e0: Received a RTV reject Aug 8 10:53:23 localhost kernel: [ 25.455620] host10: rport 0004e0: Port is Ready Aug 8 10:53:23 localhost kernel: [ 25.460185] host10: rport 0004e0: work event 1 Aug 8 10:53:23 localhost kernel: [ 25.464672] host10: rport 0004e0: No rport! Aug 8 10:53:23 localhost kernel: [ 25.469577] host10: rport 0004e0: callback ev 1 Aug 8 10:53:23 localhost kernel: [ 25.474152] host10: fip: vn_rport_callback 4e0 event 1 Aug 8 10:53:30 localhost kernel: [ 31.850891] host10: fip: beacon from rport 4e0 This is the error retry with a 4.16 unpatched: Aug 8 11:33:50 localhost kernel: [ 2452.565524] host10: fip: vn_claim_notify: send reply to c76e Aug 8 11:33:50 localhost kernel: [ 2452.571392] host10: fip: vn_add rport 00c76e new state 0 Aug 8 11:33:50 localhost kernel: [ 2452.576898] host10: fip: vn_claim_notify: send reply to c76e Aug 8 11:33:50 localhost kernel: [ 2452.582605] host10: fip: vn_add rport 00c76e old state 0 Aug 8 11:33:57 localhost kernel: [ 2458.854081] host10: fip: beacon from rport c76e Aug 8 11:33:57 localhost kernel: [ 2458.858622] host10: fip: beacon expired for rport c76e Aug 8 11:33:57 localhost kernel: [ 2458.863760] host10: rport 00c76e: Login to port Aug 8 11:33:57 localhost kernel: [ 2458.868455] host10: rport 00c76e: Entered FLOGI state from Init state Aug 8 11:33:57 localhost kernel: [ 2458.875177] host10: fip: els_send op 7 d_id c76e Aug 8 11:33:57 localhost kernel: [ 2458.886089] host10: fip: beacon from rport c76e Aug 8 11:33:57 localhost kernel: [ 2458.896069] host10: rport 00c76e: Received a FLOGI accept Aug 8 11:33:57 localhost kernel: [ 2458.901716] host10: rport 00c76e: Port entered PLOGI state from FLOGI state Aug 8 11:33:57 localhost kernel: [ 2458.934161] host10: rport 00c76e: Received a PLOGI accept Aug 8 11:33:57 localhost kernel: [ 2458.939651] host10: rport 00c76e: Port entered PRLI state from PLOGI state Aug 8 11:33:57 localhost kernel: [ 2458.985427] host10: rport 00c76e: Received a PRLI reject Aug 8 11:33:57 localhost kernel: [ 2458.990817] host10: rport 00c76e: PRLI ELS rejected, reason 5 expl 0 Aug 8 11:33:57 localhost kernel: [ 2458.997255] host10: rport 00c76e: Error 5 in state PRLI, retrying Aug 8 11:33:59 localhost kernel: [ 2461.060167] host10: rport 00c76e: Port timeout, state PRLI Aug 8 11:33:59 localhost kernel: [ 2461.065688] host10: rport 00c76e: Port entered PRLI state from PRLI state Aug 8 11:33:59 localhost kernel: [ 2461.111048] host10: rport 00c76e: Received a PRLI reject Aug 8 11:33:59 localhost kernel: [ 2461.116388] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e Aug 8 11:33:59 localhost kernel: [ 2461.122927] host10: rport 00c76e: Error 5 in state PRLI, retrying Aug 8 11:34:01 localhost kernel: [ 2463.172152] host10: rport 00c76e: Port timeout, state PRLI Aug 8 11:34:01 localhost kernel: [ 2463.177696] host10: rport 00c76e: Port entered PRLI state from PRLI state Aug 8 11:34:01 localhost kernel: [ 2463.224578] host10: rport 00c76e: Received a PRLI reject Aug 8 11:34:01 localhost kernel: [ 2463.230023] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e Aug 8 11:34:01 localhost kernel: [ 2463.236765] host10: rport 00c76e: Error 5 in state PRLI, retrying Aug 8 11:34:03 localhost kernel: [ 2465.283937] host10: rport 00c76e: Port timeout, state PRLI Aug 8 11:34:03 localhost kernel: [ 2465.289424] host10: rport 00c76e: Port entered PRLI state from PRLI state Aug 8 11:34:03 localhost kernel: [ 2465.335150] host10: rport 00c76e: Received a PRLI reject Aug 8 11:34:03 localhost kernel: [ 2465.340540] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e Aug 8 11:34:03 localhost kernel: [ 2465.347068] host10: rport 00c76e: Error -5 in state PRLI, retries 3 Aug 8 11:34:03 localhost kernel: [ 2465.353411] host10: rport 00c76e: Port sending LOGO from PRLI state Aug 8 11:34:03 localhost kernel: [ 2465.359760] host10: fip: els_send op 9 d_id c76e Aug 8 11:34:05 localhost kernel: [ 2467.046011] host10: fip: beacon from rport c76e Aug 8 11:34:05 localhost kernel: [ 2467.050549] host10: fip: beacon from rport c76e Aug 8 11:34:09 localhost kernel: [ 2470.857393] host10: fip: vn_claim_notify: send reply to c76e Aug 8 11:34:09 localhost kernel: [ 2470.863225] host10: fip: vn_add rport 00c76e old state 4 Aug 8 11:34:09 localhost kernel: [ 2470.868764] host10: fip: vn_claim_notify: send reply to c76e Aug 8 11:34:09 localhost kernel: [ 2470.874463] host10: fip: vn_add rport 00c76e old state 4 Aug 8 11:34:13 localhost kernel: [ 2475.238121] host10: fip: beacon from rport c76e Aug 8 11:34:13 localhost kernel: [ 2475.242660] host10: fip: beacon expired for rport c76e Aug 8 11:34:13 localhost kernel: [ 2475.247804] host10: rport 00c76e: port already started Repeat: Aug 8 11:34:13 localhost kernel: [ 2475.253076] host10: fip: beacon from rport c76e Aug 8 11:34:21 localhost kernel: [ 2483.429903] host10: fip: beacon from rport c76e Aug 8 11:34:21 localhost kernel: [ 2483.434500] host10: fip: beacon from rport c76e Aug 8 11:34:29 localhost kernel: [ 2491.621938] host10: fip: beacon from rport c76e Aug 8 11:34:29 localhost kernel: [ 2491.626478] host10: fip: beacon from rport c76e This part contains a memory leak after a timeout: Aug 8 11:34:33 localhost kernel: [ 2495.433102] host10: fip: vn_claim_notify: send reply to c76e Aug 8 11:34:33 localhost kernel: [ 2495.438842] host10: fip: vn_add rport 00c76e old state 4 Aug 8 11:34:33 localhost kernel: [ 2495.444266] host10: fip: vn_claim_notify: send reply to c76e Aug 8 11:34:33 localhost kernel: [ 2495.450120] host10: fip: vn_add rport 00c76e old state 4 Aug 8 11:34:38 localhost kernel: [ 2499.813722] host10: fip: beacon from rport c76e Aug 8 11:34:38 localhost kernel: [ 2499.818256] host10: fip: beacon expired for rport c76e Aug 8 11:34:38 localhost kernel: [ 2499.823402] host10: rport 00c76e: port already started You can find the raw kmemleak and kern.log here: https://github.com/hardkernel/linux/issue_comments#issuecomment-411772096 And the some filtered logs a bit more up. The logs when it all went ok: https://github.com/hardkernel/linux/issues/360#issuecomment-411335647 Especially: Aug 7 17:54:30 localhost kernel: [ 15.540607] host10: fip: claim resp from from rport c76e - state VNMP_UP Aug 7 17:54:30 localhost kernel: [ 15.548710] host10: fip: vn_add rport 00c76e new state 0 Aug 7 17:54:30 localhost kernel: [ 15.555463] host10: fip: claim resp from from rport c76e - state VNMP_UP Aug 7 17:54:30 localhost kernel: [ 15.563584] host10: fip: vn_add rport 00c76e old state 0 Aug 7 17:54:30 localhost kernel: [ 15.696327] host10: fip: beacon expired for rport c76e Aug 7 17:54:30 localhost kernel: [ 15.702975] host10: rport 00c76e: Login to port Aug 7 17:54:30 localhost kernel: [ 15.707617] host10: rport 00c76e: Entered FLOGI state from Init state Aug 7 17:54:30 localhost kernel: [ 15.714224] host10: fip: els_send op 7 d_id c76e Aug 7 17:54:30 localhost kernel: [ 15.736741] host10: rport 00c76e: Received a FLOGI accept Aug 7 17:54:30 localhost kernel: [ 15.742572] host10: rport 00c76e: Port entered PLOGI state from FLOGI state Aug 7 17:54:30 localhost kernel: [ 15.749695] host10: rport 00c76e: Received a PLOGI accept Aug 7 17:54:30 localhost kernel: [ 15.749696] host10: rport 00c76e: Port entered PRLI state from PLOGI state Aug 7 17:54:30 localhost kernel: [ 15.750112] host10: rport 00c76e: Received a PRLI reject Aug 7 17:54:30 localhost kernel: [ 15.750113] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e Aug 7 17:54:30 localhost kernel: [ 15.750114] host10: rport 00c76e: Error 5 in state PRLI, retrying Aug 7 17:54:30 localhost kernel: [ 15.750360] host10: rport 00c76e: Received PRLI request while in state PRLI Aug 7 17:54:30 localhost kernel: [ 15.750371] host10: rport 00c76e: PRLI rspp type 8 active 1 passive 0 Aug 7 17:54:30 localhost kernel: [ 15.750570] host10: rport 00c76e: Received RTV request Aug 7 17:54:30 localhost kernel: [ 17.758897] host10: rport 00c76e: Port timeout, state PRLI Aug 7 17:54:30 localhost kernel: [ 17.764559] host10: rport 00c76e: Port entered PRLI state from PRLI state Aug 7 17:54:30 localhost kernel: [ 17.812081] host10: rport 00c76e: Received a PRLI accept Aug 7 17:54:31 localhost kernel: [ 17.818809] host10: rport 00c76e: PRLI spp_flags = 0x21 spp_type 0x8 Aug 7 17:54:31 localhost kernel: [ 17.826634] host10: rport 00c76e: Port entered RTV state from PRLI state Aug 7 17:54:31 localhost kernel: [ 17.878821] host10: rport 00c76e: Received a RTV accept Aug 7 17:54:31 localhost kernel: [ 17.885478] host10: rport 00c76e: Port is Ready Aug 7 17:54:31 localhost kernel: [ 17.891513] host10: rport 00c76e: work event 1 Aug 7 17:54:31 localhost kernel: [ 17.897386] host10: rport 00c76e: No rport! Aug 7 17:54:31 localhost kernel: [ 17.903698] host10: rport 00c76e: callback ev 1 Aug 7 17:54:31 localhost kernel: [ 17.909081] host10: fip: vn_rport_callback c76e event 1 Aug 7 17:54:34 localhost kernel: [ 20.992106] host10: rport 00c76e: Received ADISC request So in the "working" case I see that the other node (4.16 unpatched) accepts our PRLI, if I interpret that right, and in the other case it doesn't. Same system, different boot order though, but I guess the boot order doesn't matter. When it doesn't work, it gets or sends a mutual LOGO and gets into a weird state. Anyway: as far as I interpreted it, in the vn2vn situation alle enodes should know all other rports to prevent a double rport assignment. I have no experience with FC itself though, so I don't know what the normal state is but sending a LOGO due to a failure to set up, is that ok? Or is the vn2vn rport maintenance on another layer? The 4.9 kernel is still to be baked... I expect it to yield minor kmemleaks, until I start toggling the FCoE. Regards, Ard van Breemen PS: I removed some cross spamming from the Cc:, I hope that's ok.
[ +Cc Hannes ] On Thu, Aug 09, 2018 at 04:41:24PM +0200, ard wrote: > Hi, > > On Thu, Aug 09, 2018 at 12:01:30PM +0200, ard wrote: > > This to determine if we have a single regression in just the > > login handling or both. > As a matter of fact, I think this will not work on vn2vn: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/libfc?id=386b97b43c0c9e0d878eec7ea1db16af22b036ae > scsi: libfc: Rework PRLI handling > <theorising> > As it clearly rejects the PRLI and eventually sends a LOGO when > no store is associated. > But as san setup goes: it first attaches the network driver to > the fcoe layer and hence is store less. Then it attaches stores > to export. But at that moment all possible initiators already got > a LOGO > </theorising> > > But looking back at the logs, it is not all that bad as what I > said. It's not erroring ad infinitum, it's: > > root@antec:~/logs# grep "vn_add rport 00c76e\|kmemleak" 2018-08-08-kern.log|cut -d\ -f9-|uniq -c > 1 2.577320] kmemleak: Kernel memory leak detector initialized > 1 2.577350] kmemleak: Automatic memory scanning thread started > 1 136.452894] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 1 host10: fip: vn_add rport 00c76e new state 0 > 1 host10: fip: vn_add rport 00c76e old state 0 > 8 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 2 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 52 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 55 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 52 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 50 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 36 host10: fip: vn_add rport 00c76e old state 4 > 1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 1 kmemleak: 36 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > 1 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > > From the last few lines we can clearly see that de vn_add rport ... old state 4 > coincides with a kmemleak 10 minutes later. especially 50 and 36. Now the exact > dump follows for that. > > Now to get back to what is the difference: > This is a working login with a 4.9: > > Aug 8 10:53:15 localhost kernel: [ 14.929451] host10: fip: vn_add rport 0004e0 old state 0 > Aug 8 10:53:21 localhost kernel: [ 23.143274] host10: fip: beacon from rport 4e0 > Aug 8 10:53:21 localhost kernel: [ 23.143275] host10: fip: beacon expired for rport 4e0 > Aug 8 10:53:21 localhost kernel: [ 23.143276] host10: rport 0004e0: Login to port > Aug 8 10:53:21 localhost kernel: [ 23.143277] host10: rport 0004e0: Entered FLOGI state from Init state > Aug 8 10:53:21 localhost kernel: [ 23.143294] host10: fip: els_send op 7 d_id 4e0 > Aug 8 10:53:21 localhost kernel: [ 23.143301] host10: fip: beacon from rport 4e0 > Aug 8 10:53:21 localhost kernel: [ 23.143920] host10: rport 0004e0: Received a FLOGI accept > Aug 8 10:53:21 localhost kernel: [ 23.143921] host10: rport 0004e0: Port entered PLOGI state from FLOGI state > Aug 8 10:53:21 localhost kernel: [ 23.143956] host10: rport 0004e0: Received a PLOGI accept > Aug 8 10:53:21 localhost kernel: [ 23.143958] host10: rport 0004e0: Port entered PRLI state from PLOGI state > Aug 8 10:53:21 localhost kernel: [ 23.143982] host10: rport 0004e0: Received a PRLI accept > Aug 8 10:53:21 localhost kernel: [ 23.143984] host10: rport 0004e0: PRLI spp_flags = 0x0 spp_type 0x20 > Aug 8 10:53:21 localhost kernel: [ 23.143985] host10: rport 0004e0: Error -6 in state PRLI, retrying > Aug 8 10:53:21 localhost kernel: [ 23.144434] host10: rport 0004e0: Received PRLI request while in state PRLI > Aug 8 10:53:21 localhost kernel: [ 23.144445] host10: rport 0004e0: PRLI rspp type 8 active 1 passive 0 > Aug 8 10:53:21 localhost kernel: [ 23.749492] host10: rport 0004e0: Received RTV request > Aug 8 10:53:23 localhost kernel: [ 25.204559] host10: rport 0004e0: Port timeout, state PRLI > Aug 8 10:53:23 localhost kernel: [ 25.347400] host10: rport 0004e0: Port entered PRLI state from PRLI state > Aug 8 10:53:23 localhost kernel: [ 25.393058] host10: rport 0004e0: Received a PRLI accept > Aug 8 10:53:23 localhost kernel: [ 25.398412] host10: rport 0004e0: PRLI spp_flags = 0x21 spp_type 0x8 > Aug 8 10:53:23 localhost kernel: [ 25.404853] host10: rport 0004e0: Port entered RTV state from PRLI state > Aug 8 10:53:23 localhost kernel: [ 25.450291] host10: rport 0004e0: Received a RTV reject > Aug 8 10:53:23 localhost kernel: [ 25.455620] host10: rport 0004e0: Port is Ready > Aug 8 10:53:23 localhost kernel: [ 25.460185] host10: rport 0004e0: work event 1 > Aug 8 10:53:23 localhost kernel: [ 25.464672] host10: rport 0004e0: No rport! > Aug 8 10:53:23 localhost kernel: [ 25.469577] host10: rport 0004e0: callback ev 1 > Aug 8 10:53:23 localhost kernel: [ 25.474152] host10: fip: vn_rport_callback 4e0 event 1 > Aug 8 10:53:30 localhost kernel: [ 31.850891] host10: fip: beacon from rport 4e0 > > This is the error retry with a 4.16 unpatched: > > Aug 8 11:33:50 localhost kernel: [ 2452.565524] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:33:50 localhost kernel: [ 2452.571392] host10: fip: vn_add rport 00c76e new state 0 > Aug 8 11:33:50 localhost kernel: [ 2452.576898] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:33:50 localhost kernel: [ 2452.582605] host10: fip: vn_add rport 00c76e old state 0 > Aug 8 11:33:57 localhost kernel: [ 2458.854081] host10: fip: beacon from rport c76e > Aug 8 11:33:57 localhost kernel: [ 2458.858622] host10: fip: beacon expired for rport c76e > Aug 8 11:33:57 localhost kernel: [ 2458.863760] host10: rport 00c76e: Login to port > Aug 8 11:33:57 localhost kernel: [ 2458.868455] host10: rport 00c76e: Entered FLOGI state from Init state > Aug 8 11:33:57 localhost kernel: [ 2458.875177] host10: fip: els_send op 7 d_id c76e > Aug 8 11:33:57 localhost kernel: [ 2458.886089] host10: fip: beacon from rport c76e > Aug 8 11:33:57 localhost kernel: [ 2458.896069] host10: rport 00c76e: Received a FLOGI accept > Aug 8 11:33:57 localhost kernel: [ 2458.901716] host10: rport 00c76e: Port entered PLOGI state from FLOGI state > Aug 8 11:33:57 localhost kernel: [ 2458.934161] host10: rport 00c76e: Received a PLOGI accept > Aug 8 11:33:57 localhost kernel: [ 2458.939651] host10: rport 00c76e: Port entered PRLI state from PLOGI state > Aug 8 11:33:57 localhost kernel: [ 2458.985427] host10: rport 00c76e: Received a PRLI reject > Aug 8 11:33:57 localhost kernel: [ 2458.990817] host10: rport 00c76e: PRLI ELS rejected, reason 5 expl 0 > Aug 8 11:33:57 localhost kernel: [ 2458.997255] host10: rport 00c76e: Error 5 in state PRLI, retrying > Aug 8 11:33:59 localhost kernel: [ 2461.060167] host10: rport 00c76e: Port timeout, state PRLI > Aug 8 11:33:59 localhost kernel: [ 2461.065688] host10: rport 00c76e: Port entered PRLI state from PRLI state > Aug 8 11:33:59 localhost kernel: [ 2461.111048] host10: rport 00c76e: Received a PRLI reject > Aug 8 11:33:59 localhost kernel: [ 2461.116388] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e > Aug 8 11:33:59 localhost kernel: [ 2461.122927] host10: rport 00c76e: Error 5 in state PRLI, retrying > Aug 8 11:34:01 localhost kernel: [ 2463.172152] host10: rport 00c76e: Port timeout, state PRLI > Aug 8 11:34:01 localhost kernel: [ 2463.177696] host10: rport 00c76e: Port entered PRLI state from PRLI state > Aug 8 11:34:01 localhost kernel: [ 2463.224578] host10: rport 00c76e: Received a PRLI reject > Aug 8 11:34:01 localhost kernel: [ 2463.230023] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e > Aug 8 11:34:01 localhost kernel: [ 2463.236765] host10: rport 00c76e: Error 5 in state PRLI, retrying > Aug 8 11:34:03 localhost kernel: [ 2465.283937] host10: rport 00c76e: Port timeout, state PRLI > Aug 8 11:34:03 localhost kernel: [ 2465.289424] host10: rport 00c76e: Port entered PRLI state from PRLI state > Aug 8 11:34:03 localhost kernel: [ 2465.335150] host10: rport 00c76e: Received a PRLI reject > Aug 8 11:34:03 localhost kernel: [ 2465.340540] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e > Aug 8 11:34:03 localhost kernel: [ 2465.347068] host10: rport 00c76e: Error -5 in state PRLI, retries 3 > Aug 8 11:34:03 localhost kernel: [ 2465.353411] host10: rport 00c76e: Port sending LOGO from PRLI state > Aug 8 11:34:03 localhost kernel: [ 2465.359760] host10: fip: els_send op 9 d_id c76e > Aug 8 11:34:05 localhost kernel: [ 2467.046011] host10: fip: beacon from rport c76e > Aug 8 11:34:05 localhost kernel: [ 2467.050549] host10: fip: beacon from rport c76e > Aug 8 11:34:09 localhost kernel: [ 2470.857393] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:34:09 localhost kernel: [ 2470.863225] host10: fip: vn_add rport 00c76e old state 4 > Aug 8 11:34:09 localhost kernel: [ 2470.868764] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:34:09 localhost kernel: [ 2470.874463] host10: fip: vn_add rport 00c76e old state 4 > Aug 8 11:34:13 localhost kernel: [ 2475.238121] host10: fip: beacon from rport c76e > Aug 8 11:34:13 localhost kernel: [ 2475.242660] host10: fip: beacon expired for rport c76e > Aug 8 11:34:13 localhost kernel: [ 2475.247804] host10: rport 00c76e: port already started > > Repeat: > > Aug 8 11:34:13 localhost kernel: [ 2475.253076] host10: fip: beacon from rport c76e > Aug 8 11:34:21 localhost kernel: [ 2483.429903] host10: fip: beacon from rport c76e > Aug 8 11:34:21 localhost kernel: [ 2483.434500] host10: fip: beacon from rport c76e > Aug 8 11:34:29 localhost kernel: [ 2491.621938] host10: fip: beacon from rport c76e > Aug 8 11:34:29 localhost kernel: [ 2491.626478] host10: fip: beacon from rport c76e > > This part contains a memory leak after a timeout: > Aug 8 11:34:33 localhost kernel: [ 2495.433102] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:34:33 localhost kernel: [ 2495.438842] host10: fip: vn_add rport 00c76e old state 4 > Aug 8 11:34:33 localhost kernel: [ 2495.444266] host10: fip: vn_claim_notify: send reply to c76e > Aug 8 11:34:33 localhost kernel: [ 2495.450120] host10: fip: vn_add rport 00c76e old state 4 > Aug 8 11:34:38 localhost kernel: [ 2499.813722] host10: fip: beacon from rport c76e > Aug 8 11:34:38 localhost kernel: [ 2499.818256] host10: fip: beacon expired for rport c76e > Aug 8 11:34:38 localhost kernel: [ 2499.823402] host10: rport 00c76e: port already started > > You can find the raw kmemleak and kern.log here: > https://github.com/hardkernel/linux/issue_comments#issuecomment-411772096 > > And the some filtered logs a bit more up. > > The logs when it all went ok: > https://github.com/hardkernel/linux/issues/360#issuecomment-411335647 > Especially: > Aug 7 17:54:30 localhost kernel: [ 15.540607] host10: fip: claim resp from from rport c76e - state VNMP_UP > Aug 7 17:54:30 localhost kernel: [ 15.548710] host10: fip: vn_add rport 00c76e new state 0 > Aug 7 17:54:30 localhost kernel: [ 15.555463] host10: fip: claim resp from from rport c76e - state VNMP_UP > Aug 7 17:54:30 localhost kernel: [ 15.563584] host10: fip: vn_add rport 00c76e old state 0 > Aug 7 17:54:30 localhost kernel: [ 15.696327] host10: fip: beacon expired for rport c76e > Aug 7 17:54:30 localhost kernel: [ 15.702975] host10: rport 00c76e: Login to port > Aug 7 17:54:30 localhost kernel: [ 15.707617] host10: rport 00c76e: Entered FLOGI state from Init state > Aug 7 17:54:30 localhost kernel: [ 15.714224] host10: fip: els_send op 7 d_id c76e > Aug 7 17:54:30 localhost kernel: [ 15.736741] host10: rport 00c76e: Received a FLOGI accept > Aug 7 17:54:30 localhost kernel: [ 15.742572] host10: rport 00c76e: Port entered PLOGI state from FLOGI state > Aug 7 17:54:30 localhost kernel: [ 15.749695] host10: rport 00c76e: Received a PLOGI accept > Aug 7 17:54:30 localhost kernel: [ 15.749696] host10: rport 00c76e: Port entered PRLI state from PLOGI state > Aug 7 17:54:30 localhost kernel: [ 15.750112] host10: rport 00c76e: Received a PRLI reject > Aug 7 17:54:30 localhost kernel: [ 15.750113] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e > Aug 7 17:54:30 localhost kernel: [ 15.750114] host10: rport 00c76e: Error 5 in state PRLI, retrying > Aug 7 17:54:30 localhost kernel: [ 15.750360] host10: rport 00c76e: Received PRLI request while in state PRLI > Aug 7 17:54:30 localhost kernel: [ 15.750371] host10: rport 00c76e: PRLI rspp type 8 active 1 passive 0 > Aug 7 17:54:30 localhost kernel: [ 15.750570] host10: rport 00c76e: Received RTV request > Aug 7 17:54:30 localhost kernel: [ 17.758897] host10: rport 00c76e: Port timeout, state PRLI > Aug 7 17:54:30 localhost kernel: [ 17.764559] host10: rport 00c76e: Port entered PRLI state from PRLI state > Aug 7 17:54:30 localhost kernel: [ 17.812081] host10: rport 00c76e: Received a PRLI accept > Aug 7 17:54:31 localhost kernel: [ 17.818809] host10: rport 00c76e: PRLI spp_flags = 0x21 spp_type 0x8 > Aug 7 17:54:31 localhost kernel: [ 17.826634] host10: rport 00c76e: Port entered RTV state from PRLI state > Aug 7 17:54:31 localhost kernel: [ 17.878821] host10: rport 00c76e: Received a RTV accept > Aug 7 17:54:31 localhost kernel: [ 17.885478] host10: rport 00c76e: Port is Ready > Aug 7 17:54:31 localhost kernel: [ 17.891513] host10: rport 00c76e: work event 1 > Aug 7 17:54:31 localhost kernel: [ 17.897386] host10: rport 00c76e: No rport! > Aug 7 17:54:31 localhost kernel: [ 17.903698] host10: rport 00c76e: callback ev 1 > Aug 7 17:54:31 localhost kernel: [ 17.909081] host10: fip: vn_rport_callback c76e event 1 > Aug 7 17:54:34 localhost kernel: [ 20.992106] host10: rport 00c76e: Received ADISC request > > So in the "working" case I see that the other node (4.16 unpatched) accepts our > PRLI, if I interpret that right, and in the other case it > doesn't. Same system, different boot order though, but I guess > the boot order doesn't matter. > When it doesn't work, it gets or sends a mutual LOGO and gets into a weird state. > > Anyway: as far as I interpreted it, in the vn2vn situation alle > enodes should know all other rports to prevent a double rport > assignment. I have no experience with FC itself though, so I > don't know what the normal state is but sending a LOGO due to a > failure to set up, is that ok? > Or is the vn2vn rport maintenance on another layer? > > The 4.9 kernel is still to be baked... > I expect it to yield minor kmemleaks, until I start toggling the > FCoE. > > Regards, > Ard van Breemen > > PS: I removed some cross spamming from the Cc:, I hope that's ok. Full quote for reference. This looks like the PRLI rework from Hannes introduced the memleaks here. Hannes can you have a look at it?
Hi,
On Fri, Aug 10, 2018 at 10:34:49AM +0200, Johannes Thumshirn wrote:
> Hannes can you have a look at it?
As a side note, I am busy with other things the next 3 weeks, but
I will be able to add some printk's and run it in weeks again.
If you want me to put probes somewhere to have a better look at
the states :-).
Regards,
Ard