mbox series

[0/3] scsi: fcoe: memleak fixes

Message ID 20180731134603.20089-1-jthumshirn@suse.de (mailing list archive)
Headers show
Series scsi: fcoe: memleak fixes | expand

Message

Johannes Thumshirn July 31, 2018, 1:46 p.m. UTC
Ard reported a memory leak in FCoE at [1] and this patch set contains
the fix for this leak, a second leak I discovered while hunting the
first leak and a use-after-free in FCoE's debugging.

[1] https://marc.info/?l=linux-scsi&m=153261165228133&w=2

Johannes Thumshirn (3):
  fcoe: fix use-after-free in fcoe_ctlr_els_send
  scsi: fcoe: drop frames in ELS LOGO error path
  scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO

 drivers/scsi/fcoe/fcoe_ctlr.c | 6 +++---
 drivers/scsi/libfc/fc_rport.c | 1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

Comments

Johannes Thumshirn Aug. 6, 2018, 9:25 a.m. UTC | #1
Ard,

Did you have time to test these patches?

Byte,
	Johannes
Ard van Breemen Aug. 6, 2018, 1:22 p.m. UTC | #2
Hi,

Johannes,

On Mon, Aug 06, 2018 at 11:25:03AM +0200, Johannes Thumshirn wrote:
> Ard,
> 
> Did you have time to test these patches?

Ahh, yeah, sorry.
I did test them.
I can't say I see a regression.
I also can't say I see progress, although it feels like there is
a tad less leaking.
For that I need to do some statistics.

But for now I ack that it does not regress further.
Johannes Thumshirn Aug. 6, 2018, 1:27 p.m. UTC | #3
On Mon, Aug 06, 2018 at 03:22:34PM +0200, ard wrote:
> Ahh, yeah, sorry.
> I did test them.
> I can't say I see a regression.
> I also can't say I see progress, although it feels like there is
> a tad less leaking.
> For that I need to do some statistics.
> 
> But for now I ack that it does not regress further.

From what I've seen in your kmemleak reports, it wasn't only
libfc/fcoe that was leaking memory.

So kmemleak reports are welcome (especially if they confirm libfc/fcoe
is clean now ;-)).

@Martin can you stage these patches for 4.19?
Ard van Breemen Aug. 6, 2018, 2:24 p.m. UTC | #4
Hi Johannes,

On Mon, Aug 06, 2018 at 03:27:25PM +0200, Johannes Thumshirn wrote:
> On Mon, Aug 06, 2018 at 03:22:34PM +0200, ard wrote:
> > Ahh, yeah, sorry.
> > I did test them.
> > I can't say I see a regression.
> > I also can't say I see progress, although it feels like there is
> > a tad less leaking.
> > For that I need to do some statistics.
> > 
> > But for now I ack that it does not regress further.
> 
> From what I've seen in your kmemleak reports, it wasn't only
> libfc/fcoe that was leaking memory.
> 
> So kmemleak reports are welcome (especially if they confirm libfc/fcoe
> is clean now ;-)).
I've updated the ticket:
https://github.com/hardkernel/linux/issues/360#issuecomment-410721997
dmesg (with full debugging):
https://github.com/hardkernel/linux/files/2262858/dmesg-2018-08-06.txt
and kmemleak:
https://github.com/hardkernel/linux/files/2262861/kmemleak-2018-08-06.txt

I'll have to take a good look at the what and why.
I'm gonna git clean -x -d and rebuild the kernel, because this is
weird.

This is the tree I build:
https://github.com/ardje/linux/tree/gb-4.14-fcoe-patch
So vanilla linux-4.14 with the patch set:
https://github.com/ardje/linux/commit/747bf04057a99be5b01f768654cfd61bc9f4fc6c
Meh, I should try to use git for patching.

I will spam some printk's later on, just to check which flows are
followed (if the git clean doesn't reveal some problem, although
I can't imagine).

Regards,
Ard
Johannes Thumshirn Aug. 7, 2018, 6:54 a.m. UTC | #5
On Mon, Aug 06, 2018 at 04:24:14PM +0200, ard wrote:
> I've updated the ticket:
> https://github.com/hardkernel/linux/issues/360#issuecomment-410721997
> dmesg (with full debugging):
> https://github.com/hardkernel/linux/files/2262858/dmesg-2018-08-06.txt
> and kmemleak:
> https://github.com/hardkernel/linux/files/2262861/kmemleak-2018-08-06.txt
> 
> I'll have to take a good look at the what and why.
> I'm gonna git clean -x -d and rebuild the kernel, because this is
> weird.
> 
> This is the tree I build:
> https://github.com/ardje/linux/tree/gb-4.14-fcoe-patch
> So vanilla linux-4.14 with the patch set:
> https://github.com/ardje/linux/commit/747bf04057a99be5b01f768654cfd61bc9f4fc6c
> Meh, I should try to use git for patching.
> 
> I will spam some printk's later on, just to check which flows are
> followed (if the git clean doesn't reveal some problem, although
> I can't imagine).

OK, now this is wired. Are you seeing this on the initiator or on the
target side? Also on x86_64 or just the odroid? I could reproduce your
reports in my Virtualized Environment [1][2] by issuing deletes from the
initiator side..

[1] https://github.com/rapido-linux/rapido
[2] https://github.com/rapido-linux/rapido/blob/master/fcoe_local_autorun.sh

Byte,
	Johannes
Ard van Breemen Aug. 7, 2018, 9:26 a.m. UTC | #6
Hi,

On Tue, Aug 07, 2018 at 08:54:00AM +0200, Johannes Thumshirn wrote:
> OK, now this is wired. Are you seeing this on the initiator or on the
> target side? Also on x86_64 or just the odroid? I could reproduce your
> reports in my Virtualized Environment [1][2] by issuing deletes from the
> initiator side..
Yes it is weird, and it is even more weird when I looked at the
collectd statistics:
The memory leak was almost none existent on my test odroid with
the PC turned off. When I turn it back on, it rises to 150MB/day
So it seems you need at least some party.
The most important thing to realise: this is pure vn2vn chatter.
There is no traffic going from or to the test odroid (to the test
pc there is some).
If I disable the FCoE vlan on the switch port, the chatter *and*
the memory leaks vanishes.
Meeh, this reports needs a better place than just e-mail, I got a
few nice graphs to show.

But here is an overview of my FCoE vlan:
(Sorted by hand)
(GS724Tv4) #show mac-addr-table vlan 11

Address Entries Currently in Use............... 89

   MAC Address     Interface     Status
-----------------  ---------  ------------
00:1E:06:30:05:50  g4         odroid4 Xu4/exynos 5422/4.4.0-rc6    stable (330 days up)
0E:FD:00:00:05:50  g4         Learned
00:1E:06:30:04:E0  g6         odroid6 Xu4/exynos 5422/4.9.28       stable (330 days up)
0E:FD:00:00:04:E0  g6         Learned
00:1E:06:30:05:52  g7         odroid7 Xu4/exynos 5422/4.14.55      leaking (150MB leak/day)
0E:FD:00:00:05:52  g7         Learned
00:0E:0C:B0:68:37  g14        storage SS4000E/Xscale 80219/3.7.1   stable (295 days up)
0E:FD:00:00:68:37  g14        Learned
00:14:FD:16:DD:50  g15        thecus1 n4200eco/D525/4.3.0          stable (295 days up)
0E:FD:00:00:DD:50  g15        Learned
00:24:1D:7F:40:88  g17        antec   PC/i7-920/4.14.59            leaking
0E:FD:00:00:40:88  g17        Learned


The system on G14 and G15 are both long time targets.
G4,6 and 7 (my production server is on 5 with FCoE and kmemleak, but with the
FCoE vlan removed) are odroids doing nothing more with FCoE but being there.
(Waiting for experiments for bcache on eMMC, I used to be able to
crash the FCoE *target* using btrfs on bcache on eMMC and FCoE.
(Target was running 4.0.0 back then).
Generic config (PC and odroid):
root@odroid6:~# cat /etc/network/interfaces.d/20-fcoe 
auto fcoe
iface fcoe inet manual
        pre-up modprobe fcoe || true
        pre-up ip link add link eth0 name fcoe type vlan id 11
        pre-up sysctl -w net.ipv6.conf.fcoe.disable_ipv6=1
        up ip link set up dev fcoe
        up sh -c 'echo fcoe > /sys/module/libfcoe/parameters/create_vn2vn'
        #up /root/mountfcoe
        #pre-down /root/stop-bcaches
        pre-down sh -c 'echo fcoe > /sys/module/libfcoe/parameters/destroy'
        down ip link set down dev fcoe
        down ip link del fcoe           

The targets are configured with some version of targetcli (so a
big echo shell script).

This is on the 4.14 systems:
root@antec:~# grep .  /sys/class/fc_*/*/port_*
/sys/class/fc_host/host10/port_id:0x004088
/sys/class/fc_host/host10/port_name:0x200000241d7f4088
/sys/class/fc_host/host10/port_state:Online
/sys/class/fc_host/host10/port_type:NPort (fabric via point-to-point)
/sys/class/fc_remote_ports/rport-10:0-0/port_id:0x00dd50
/sys/class/fc_remote_ports/rport-10:0-0/port_name:0x20000014fd16dd50
/sys/class/fc_remote_ports/rport-10:0-0/port_state:Online
/sys/class/fc_remote_ports/rport-10:0-1/port_id:0x006837
/sys/class/fc_remote_ports/rport-10:0-1/port_name:0x2000000e0cb06837
/sys/class/fc_remote_ports/rport-10:0-1/port_state:Online
/sys/class/fc_remote_ports/rport-10:0-2/port_id:0x000550
/sys/class/fc_remote_ports/rport-10:0-2/port_name:0x2000001e06300550
/sys/class/fc_remote_ports/rport-10:0-2/port_state:Online
/sys/class/fc_remote_ports/rport-10:0-3/port_id:0x0004e0
/sys/class/fc_remote_ports/rport-10:0-3/port_name:0x2000001e063004e0
/sys/class/fc_remote_ports/rport-10:0-3/port_state:Online
/sys/class/fc_transport/target10:0:0/port_id:0x00dd50
/sys/class/fc_transport/target10:0:0/port_name:0x20000014fd16dd50

None of the other systems have an fc_transport, as they do not
have targets assigned to them (currently).
Notice that antec (PC) does not see odroid7.
The same is true vice versa.
All other systems see both antec and odroid7.

So they all can see eachother except for the 4.14 systems that
can't see eachother.

Now when I noticed that it only happened when my PC starts, I
wondered why it also happened when my PC is turned off, as I turn
it on once every few months and sometimes in the winter, it's
power usage is the same as the remaining systems combined.

And my next thing is: why did my production server seemed to die
less fast since a few kernel upgrades (in the 4.14 line).
I got it figured out now:
Before the heatwave, I had odroid5 turned on, my steam machine
(also with FCoE as an active initiator and 4.14 kernel) and the
PC turned off.  So that still makes 6 FCoE ports on the network.
When the summer came I needed to turn off the steam machine as
much as possible. This resulted in my main production server only
needing a reboot once ever week instead of every 2 days. I
attributed that to kernel fixes (as I knew there was a memory
leak, just didn't know where yet).

Thinking about that some more: do I need 4.14 systems to trigger
a bug within eachother, or is it pure the number of fc hosts
that should be bigger than 5 to trigger a bug in 4.14?

So a conclusion of my rambling:
1) you either need 6 vn2vn hosts *or* you need more than one 4.14
kernel in a network to trigger. One of the two. I need to think
about this. The fact that the 4.14 systems can't see eachother is
an indicator.  I can turn off the FCoE on some other system to
see if the memleak stops.
2) kernels up to 4.9.28 do not have a memoryleak. 4.14.28+ do
have the memory leak.
3) I need a place for graphs, I will see if I can abuse the
github ticket some more 8-D.
4) Just having FCoE enabled on an interface and
*receiving*/interacting with FCoE vn2vn chatter triggers the bug.
So that's only setting up the rports, maintaining ownership of
your port id.
5) The memleak itself is architecture independent and NIC
independent.
Ard van Breemen Aug. 7, 2018, 9:57 a.m. UTC | #7
On Tue, Aug 07, 2018 at 11:26:19AM +0200, ard wrote:
> 3) I need a place for graphs, I will see if I can abuse the
> github ticket some more 8-D.
https://github.com/hardkernel/linux/issues/360#issuecomment-411001362

Regards,
Ard van Breemen
Ard van Breemen Aug. 7, 2018, 4:04 p.m. UTC | #8
Hi,

On Tue, Aug 07, 2018 at 11:26:19AM +0200, ard wrote:
> So a conclusion of my rambling:
> 1) you either need 6 vn2vn hosts *or* you need more than one 4.14
> kernel in a network to trigger. One of the two. I need to think
> about this. The fact that the 4.14 systems can't see eachother is
> an indicator.  I can turn off the FCoE on some other system to
> see if the memleak stops.

I've gone all out:

Ok, deleted everything, long story short:
(for now(!))

PC+steam machine with 4.14 (patched) and 4.16 (upstream,
nodebug): no kmemleaks
Every device sees every device.

Now if I add two odroids running 4.14 (not patched):
The odroids will see *every* device running < 4.14 and vv
The odroids will not see any device running >= 4.14 and vv
The PC starts to kmemleak a lot.

I've removed the two 4.14 odroids again from the FCoE lan, and
upgraded them with a patched 4.14 .
I rebooted the PC with steam (4.16 upstream unpatched) and all
other systems in the lan I mentioned.
To be clear: the steam machine has a dedicated nic for FCoE. All
other systems use a vlan.
I will let that talk with eachother for now.

I fear there is a platform specific thing in the vn2vn chatter
that has changed > 4.9 .
4.14 can communicate with lower revisions, but they can't see
eachother, and whatever they chat about, it induces memory leaks
on the PC.
Oh, another difference is that steam and the PC has targets
assigned. I might try that for the 4.14 too.

Anyway, I will do some cooking and let the current systems chat
with eachother and see if we get a kmemleak making my statement
bogus.

Regards,
Ard
Johannes Thumshirn Aug. 9, 2018, 8:05 a.m. UTC | #9
On Mon, Aug 06, 2018 at 11:25:03AM +0200, Johannes Thumshirn wrote:
Martin,

though it looks like these patches aren't fixing all of the errors Ard
has, they fix all I could reproduce up to now and Ard already stated
they're not regressing.

So can we please stage them for 4.19?

Thanks,
	Johannes
Martin K. Petersen Aug. 9, 2018, 9:52 a.m. UTC | #10
Johannes,

> Martin,
>
> though it looks like these patches aren't fixing all of the errors Ard
> has, they fix all I could reproduce up to now and Ard already stated
> they're not regressing.
>
> So can we please stage them for 4.19?

I merged them last week.
Johannes Thumshirn Aug. 9, 2018, 9:56 a.m. UTC | #11
On Thu, Aug 09, 2018 at 05:52:13AM -0400, Martin K. Petersen wrote:
> 
> Johannes,
> 
> > Martin,
> >
> > though it looks like these patches aren't fixing all of the errors Ard
> > has, they fix all I could reproduce up to now and Ard already stated
> > they're not regressing.
> >
> > So can we please stage them for 4.19?
> 
> I merged them last week.

Oh I'm sorry, missed that.
Ard van Breemen Aug. 9, 2018, 10:01 a.m. UTC | #12
Hi Guys,

On Tue, Aug 07, 2018 at 06:04:52PM +0200, ard wrote:
> PC+steam machine with 4.14 (patched) and 4.16 (upstream,
> nodebug): no kmemleaks
> Every device sees every device.

New day, new conflicting results.
Yay \0/.

As I did not trust the results, I redid the tests, and the same
tests gave some different results.
Before giving the results I've changed my stance on the bug:
The bug is not a regression in memory leak. As far as I can tell
now, the memory leaks were already there.
It's a regression in vn2vn enodes being able to PLOGI.
Since I've seen the steam machine and the PC setup an rport,
there must be some racy thing going on how the accept or reject
the PLOGI.
Now once it rejects, it will never succeed to accept, and the
relogin happens ad infinitum.
In this mode there are about 47 kmemleaks per 10 minutes.
I also notice that the kmemleaks takes a while to be detected or
to die out. So there are state timers involved that hold on to
the memory and after time out do not free it.
And another thing I noticed: When the pc and the steam machine
had a working rport, after a while the steam machine (4.16
unpatched) fc_timedout the rports to all nodes (so all nodes with
kernel < 4.14 too), and all with different timeouts, except the
one it has an fc_transport with.
So it's sole remaining rport was the "designated" target.
Currently I am compiling 4.9 with kmemleak to determine if that
exhibits the same leaks when disconnecting and reconnecting the
FCoE vlan.
This to determine if we have a single regression in just the
login handling or both.
I will add the dmesg's of a working rport, and a failing rport
later.

Regards,
Ard
Ard van Breemen Aug. 9, 2018, 2:41 p.m. UTC | #13
Hi,

On Thu, Aug 09, 2018 at 12:01:30PM +0200, ard wrote:
> This to determine if we have a single regression in just the
> login handling or both.
As a matter of fact, I think this will not work on vn2vn:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/libfc?id=386b97b43c0c9e0d878eec7ea1db16af22b036ae
scsi: libfc: Rework PRLI handling
<theorising>
As it clearly rejects the PRLI and eventually sends a LOGO when
no store is associated.
But as san setup goes: it first attaches the network driver to
the fcoe layer and hence is store less. Then it attaches stores
to export. But at that moment all possible initiators already got
a LOGO
</theorising>

But looking back at the logs, it is not all that bad as what I
said.  It's not erroring ad infinitum, it's: 

root@antec:~/logs# grep "vn_add rport 00c76e\|kmemleak" 2018-08-08-kern.log|cut -d\  -f9-|uniq -c
      1   2.577320] kmemleak: Kernel memory leak detector initialized
      1   2.577350] kmemleak: Automatic memory scanning thread started
      1 136.452894] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      1 host10: fip: vn_add rport 00c76e new state 0
      1 host10: fip: vn_add rport 00c76e old state 0
      8 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      2 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     52 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 55 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     52 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     36 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      1 kmemleak: 36 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      1 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

From the last few lines we can clearly see that de vn_add rport ...  old state 4
coincides with a kmemleak 10 minutes later. especially 50 and 36. Now the exact
dump follows for that.

Now to get back to what is the difference:
This is a working login with a 4.9:

Aug  8 10:53:15 localhost kernel: [   14.929451] host10: fip: vn_add rport 0004e0 old state 0
Aug  8 10:53:21 localhost kernel: [   23.143274] host10: fip: beacon from rport 4e0
Aug  8 10:53:21 localhost kernel: [   23.143275] host10: fip: beacon expired for rport 4e0
Aug  8 10:53:21 localhost kernel: [   23.143276] host10: rport 0004e0: Login to port
Aug  8 10:53:21 localhost kernel: [   23.143277] host10: rport 0004e0: Entered FLOGI state from Init state
Aug  8 10:53:21 localhost kernel: [   23.143294] host10: fip: els_send op 7 d_id 4e0
Aug  8 10:53:21 localhost kernel: [   23.143301] host10: fip: beacon from rport 4e0
Aug  8 10:53:21 localhost kernel: [   23.143920] host10: rport 0004e0: Received a FLOGI accept
Aug  8 10:53:21 localhost kernel: [   23.143921] host10: rport 0004e0: Port entered PLOGI state from FLOGI state
Aug  8 10:53:21 localhost kernel: [   23.143956] host10: rport 0004e0: Received a PLOGI accept
Aug  8 10:53:21 localhost kernel: [   23.143958] host10: rport 0004e0: Port entered PRLI state from PLOGI state
Aug  8 10:53:21 localhost kernel: [   23.143982] host10: rport 0004e0: Received a PRLI accept
Aug  8 10:53:21 localhost kernel: [   23.143984] host10: rport 0004e0: PRLI spp_flags = 0x0 spp_type 0x20
Aug  8 10:53:21 localhost kernel: [   23.143985] host10: rport 0004e0: Error -6 in state PRLI, retrying
Aug  8 10:53:21 localhost kernel: [   23.144434] host10: rport 0004e0: Received PRLI request while in state PRLI
Aug  8 10:53:21 localhost kernel: [   23.144445] host10: rport 0004e0: PRLI rspp type 8 active 1 passive 0
Aug  8 10:53:21 localhost kernel: [   23.749492] host10: rport 0004e0: Received RTV request
Aug  8 10:53:23 localhost kernel: [   25.204559] host10: rport 0004e0: Port timeout, state PRLI
Aug  8 10:53:23 localhost kernel: [   25.347400] host10: rport 0004e0: Port entered PRLI state from PRLI state
Aug  8 10:53:23 localhost kernel: [   25.393058] host10: rport 0004e0: Received a PRLI accept
Aug  8 10:53:23 localhost kernel: [   25.398412] host10: rport 0004e0: PRLI spp_flags = 0x21 spp_type 0x8
Aug  8 10:53:23 localhost kernel: [   25.404853] host10: rport 0004e0: Port entered RTV state from PRLI state
Aug  8 10:53:23 localhost kernel: [   25.450291] host10: rport 0004e0: Received a RTV reject
Aug  8 10:53:23 localhost kernel: [   25.455620] host10: rport 0004e0: Port is Ready
Aug  8 10:53:23 localhost kernel: [   25.460185] host10: rport 0004e0: work event 1
Aug  8 10:53:23 localhost kernel: [   25.464672] host10: rport 0004e0: No rport!
Aug  8 10:53:23 localhost kernel: [   25.469577] host10: rport 0004e0: callback ev 1
Aug  8 10:53:23 localhost kernel: [   25.474152] host10: fip: vn_rport_callback 4e0 event 1
Aug  8 10:53:30 localhost kernel: [   31.850891] host10: fip: beacon from rport 4e0

This is the error retry with a 4.16 unpatched:

Aug  8 11:33:50 localhost kernel: [ 2452.565524] host10: fip: vn_claim_notify: send reply to c76e
Aug  8 11:33:50 localhost kernel: [ 2452.571392] host10: fip: vn_add rport 00c76e new state 0
Aug  8 11:33:50 localhost kernel: [ 2452.576898] host10: fip: vn_claim_notify: send reply to c76e
Aug  8 11:33:50 localhost kernel: [ 2452.582605] host10: fip: vn_add rport 00c76e old state 0
Aug  8 11:33:57 localhost kernel: [ 2458.854081] host10: fip: beacon from rport c76e
Aug  8 11:33:57 localhost kernel: [ 2458.858622] host10: fip: beacon expired for rport c76e
Aug  8 11:33:57 localhost kernel: [ 2458.863760] host10: rport 00c76e: Login to port
Aug  8 11:33:57 localhost kernel: [ 2458.868455] host10: rport 00c76e: Entered FLOGI state from Init state
Aug  8 11:33:57 localhost kernel: [ 2458.875177] host10: fip: els_send op 7 d_id c76e
Aug  8 11:33:57 localhost kernel: [ 2458.886089] host10: fip: beacon from rport c76e
Aug  8 11:33:57 localhost kernel: [ 2458.896069] host10: rport 00c76e: Received a FLOGI accept
Aug  8 11:33:57 localhost kernel: [ 2458.901716] host10: rport 00c76e: Port entered PLOGI state from FLOGI state
Aug  8 11:33:57 localhost kernel: [ 2458.934161] host10: rport 00c76e: Received a PLOGI accept
Aug  8 11:33:57 localhost kernel: [ 2458.939651] host10: rport 00c76e: Port entered PRLI state from PLOGI state
Aug  8 11:33:57 localhost kernel: [ 2458.985427] host10: rport 00c76e: Received a PRLI reject
Aug  8 11:33:57 localhost kernel: [ 2458.990817] host10: rport 00c76e: PRLI ELS rejected, reason 5 expl 0
Aug  8 11:33:57 localhost kernel: [ 2458.997255] host10: rport 00c76e: Error 5 in state PRLI, retrying
Aug  8 11:33:59 localhost kernel: [ 2461.060167] host10: rport 00c76e: Port timeout, state PRLI
Aug  8 11:33:59 localhost kernel: [ 2461.065688] host10: rport 00c76e: Port entered PRLI state from PRLI state
Aug  8 11:33:59 localhost kernel: [ 2461.111048] host10: rport 00c76e: Received a PRLI reject
Aug  8 11:33:59 localhost kernel: [ 2461.116388] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e
Aug  8 11:33:59 localhost kernel: [ 2461.122927] host10: rport 00c76e: Error 5 in state PRLI, retrying
Aug  8 11:34:01 localhost kernel: [ 2463.172152] host10: rport 00c76e: Port timeout, state PRLI
Aug  8 11:34:01 localhost kernel: [ 2463.177696] host10: rport 00c76e: Port entered PRLI state from PRLI state
Aug  8 11:34:01 localhost kernel: [ 2463.224578] host10: rport 00c76e: Received a PRLI reject
Aug  8 11:34:01 localhost kernel: [ 2463.230023] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e
Aug  8 11:34:01 localhost kernel: [ 2463.236765] host10: rport 00c76e: Error 5 in state PRLI, retrying
Aug  8 11:34:03 localhost kernel: [ 2465.283937] host10: rport 00c76e: Port timeout, state PRLI
Aug  8 11:34:03 localhost kernel: [ 2465.289424] host10: rport 00c76e: Port entered PRLI state from PRLI state
Aug  8 11:34:03 localhost kernel: [ 2465.335150] host10: rport 00c76e: Received a PRLI reject
Aug  8 11:34:03 localhost kernel: [ 2465.340540] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e
Aug  8 11:34:03 localhost kernel: [ 2465.347068] host10: rport 00c76e: Error -5 in state PRLI, retries 3
Aug  8 11:34:03 localhost kernel: [ 2465.353411] host10: rport 00c76e: Port sending LOGO from PRLI state
Aug  8 11:34:03 localhost kernel: [ 2465.359760] host10: fip: els_send op 9 d_id c76e
Aug  8 11:34:05 localhost kernel: [ 2467.046011] host10: fip: beacon from rport c76e
Aug  8 11:34:05 localhost kernel: [ 2467.050549] host10: fip: beacon from rport c76e
Aug  8 11:34:09 localhost kernel: [ 2470.857393] host10: fip: vn_claim_notify: send reply to c76e
Aug  8 11:34:09 localhost kernel: [ 2470.863225] host10: fip: vn_add rport 00c76e old state 4
Aug  8 11:34:09 localhost kernel: [ 2470.868764] host10: fip: vn_claim_notify: send reply to c76e
Aug  8 11:34:09 localhost kernel: [ 2470.874463] host10: fip: vn_add rport 00c76e old state 4
Aug  8 11:34:13 localhost kernel: [ 2475.238121] host10: fip: beacon from rport c76e
Aug  8 11:34:13 localhost kernel: [ 2475.242660] host10: fip: beacon expired for rport c76e
Aug  8 11:34:13 localhost kernel: [ 2475.247804] host10: rport 00c76e: port already started

Repeat:

Aug  8 11:34:13 localhost kernel: [ 2475.253076] host10: fip: beacon from rport c76e
Aug  8 11:34:21 localhost kernel: [ 2483.429903] host10: fip: beacon from rport c76e
Aug  8 11:34:21 localhost kernel: [ 2483.434500] host10: fip: beacon from rport c76e
Aug  8 11:34:29 localhost kernel: [ 2491.621938] host10: fip: beacon from rport c76e
Aug  8 11:34:29 localhost kernel: [ 2491.626478] host10: fip: beacon from rport c76e

This part contains a memory leak after a timeout:
Aug  8 11:34:33 localhost kernel: [ 2495.433102] host10: fip: vn_claim_notify: send reply to c76e
Aug  8 11:34:33 localhost kernel: [ 2495.438842] host10: fip: vn_add rport 00c76e old state 4
Aug  8 11:34:33 localhost kernel: [ 2495.444266] host10: fip: vn_claim_notify: send reply to c76e
Aug  8 11:34:33 localhost kernel: [ 2495.450120] host10: fip: vn_add rport 00c76e old state 4
Aug  8 11:34:38 localhost kernel: [ 2499.813722] host10: fip: beacon from rport c76e
Aug  8 11:34:38 localhost kernel: [ 2499.818256] host10: fip: beacon expired for rport c76e
Aug  8 11:34:38 localhost kernel: [ 2499.823402] host10: rport 00c76e: port already started

You can find the raw kmemleak and kern.log here:
https://github.com/hardkernel/linux/issue_comments#issuecomment-411772096

And the some filtered logs a bit more up.

The logs when it all went ok:
https://github.com/hardkernel/linux/issues/360#issuecomment-411335647
Especially:
Aug  7 17:54:30 localhost kernel: [   15.540607] host10: fip: claim resp from from rport c76e - state VNMP_UP
Aug  7 17:54:30 localhost kernel: [   15.548710] host10: fip: vn_add rport 00c76e new state 0
Aug  7 17:54:30 localhost kernel: [   15.555463] host10: fip: claim resp from from rport c76e - state VNMP_UP
Aug  7 17:54:30 localhost kernel: [   15.563584] host10: fip: vn_add rport 00c76e old state 0
Aug  7 17:54:30 localhost kernel: [   15.696327] host10: fip: beacon expired for rport c76e
Aug  7 17:54:30 localhost kernel: [   15.702975] host10: rport 00c76e: Login to port
Aug  7 17:54:30 localhost kernel: [   15.707617] host10: rport 00c76e: Entered FLOGI state from Init state
Aug  7 17:54:30 localhost kernel: [   15.714224] host10: fip: els_send op 7 d_id c76e
Aug  7 17:54:30 localhost kernel: [   15.736741] host10: rport 00c76e: Received a FLOGI accept
Aug  7 17:54:30 localhost kernel: [   15.742572] host10: rport 00c76e: Port entered PLOGI state from FLOGI state
Aug  7 17:54:30 localhost kernel: [   15.749695] host10: rport 00c76e: Received a PLOGI accept
Aug  7 17:54:30 localhost kernel: [   15.749696] host10: rport 00c76e: Port entered PRLI state from PLOGI state
Aug  7 17:54:30 localhost kernel: [   15.750112] host10: rport 00c76e: Received a PRLI reject
Aug  7 17:54:30 localhost kernel: [   15.750113] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e
Aug  7 17:54:30 localhost kernel: [   15.750114] host10: rport 00c76e: Error 5 in state PRLI, retrying
Aug  7 17:54:30 localhost kernel: [   15.750360] host10: rport 00c76e: Received PRLI request while in state PRLI
Aug  7 17:54:30 localhost kernel: [   15.750371] host10: rport 00c76e: PRLI rspp type 8 active 1 passive 0
Aug  7 17:54:30 localhost kernel: [   15.750570] host10: rport 00c76e: Received RTV request
Aug  7 17:54:30 localhost kernel: [   17.758897] host10: rport 00c76e: Port timeout, state PRLI
Aug  7 17:54:30 localhost kernel: [   17.764559] host10: rport 00c76e: Port entered PRLI state from PRLI state
Aug  7 17:54:30 localhost kernel: [   17.812081] host10: rport 00c76e: Received a PRLI accept
Aug  7 17:54:31 localhost kernel: [   17.818809] host10: rport 00c76e: PRLI spp_flags = 0x21 spp_type 0x8
Aug  7 17:54:31 localhost kernel: [   17.826634] host10: rport 00c76e: Port entered RTV state from PRLI state
Aug  7 17:54:31 localhost kernel: [   17.878821] host10: rport 00c76e: Received a RTV accept
Aug  7 17:54:31 localhost kernel: [   17.885478] host10: rport 00c76e: Port is Ready
Aug  7 17:54:31 localhost kernel: [   17.891513] host10: rport 00c76e: work event 1
Aug  7 17:54:31 localhost kernel: [   17.897386] host10: rport 00c76e: No rport!
Aug  7 17:54:31 localhost kernel: [   17.903698] host10: rport 00c76e: callback ev 1
Aug  7 17:54:31 localhost kernel: [   17.909081] host10: fip: vn_rport_callback c76e event 1
Aug  7 17:54:34 localhost kernel: [   20.992106] host10: rport 00c76e: Received ADISC request

So in the "working" case I see that the other node (4.16 unpatched) accepts our
PRLI, if I interpret that right, and in the other case it
doesn't. Same system, different boot order though, but I guess
the boot order doesn't matter. 
When it doesn't work, it gets or sends a mutual LOGO and gets into a weird state.

Anyway: as far as I interpreted it, in the vn2vn situation alle
enodes should know all other rports to prevent a double rport
assignment. I have no experience with FC itself though, so I
don't know what the normal state is but sending a LOGO due to a
failure to set up, is that ok?
Or is the vn2vn rport maintenance on another layer?

The 4.9 kernel is still to be baked...
I expect it to yield minor kmemleaks, until I start toggling the
FCoE.

Regards,
Ard van Breemen

PS: I removed some cross spamming from the Cc:, I hope that's ok.
Johannes Thumshirn Aug. 10, 2018, 8:34 a.m. UTC | #14
[ +Cc Hannes ]

On Thu, Aug 09, 2018 at 04:41:24PM +0200, ard wrote:
> Hi,
> 
> On Thu, Aug 09, 2018 at 12:01:30PM +0200, ard wrote:
> > This to determine if we have a single regression in just the
> > login handling or both.
> As a matter of fact, I think this will not work on vn2vn:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/libfc?id=386b97b43c0c9e0d878eec7ea1db16af22b036ae
> scsi: libfc: Rework PRLI handling
> <theorising>
> As it clearly rejects the PRLI and eventually sends a LOGO when
> no store is associated.
> But as san setup goes: it first attaches the network driver to
> the fcoe layer and hence is store less. Then it attaches stores
> to export. But at that moment all possible initiators already got
> a LOGO
> </theorising>
> 
> But looking back at the logs, it is not all that bad as what I
> said.  It's not erroring ad infinitum, it's: 
> 
> root@antec:~/logs# grep "vn_add rport 00c76e\|kmemleak" 2018-08-08-kern.log|cut -d\  -f9-|uniq -c
>       1   2.577320] kmemleak: Kernel memory leak detector initialized
>       1   2.577350] kmemleak: Automatic memory scanning thread started
>       1 136.452894] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>       1 host10: fip: vn_add rport 00c76e new state 0
>       1 host10: fip: vn_add rport 00c76e old state 0
>       8 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>       2 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>      52 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 55 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>      52 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>      36 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>       1 kmemleak: 36 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>       1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>       1 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> 
> From the last few lines we can clearly see that de vn_add rport ...  old state 4
> coincides with a kmemleak 10 minutes later. especially 50 and 36. Now the exact
> dump follows for that.
> 
> Now to get back to what is the difference:
> This is a working login with a 4.9:
> 
> Aug  8 10:53:15 localhost kernel: [   14.929451] host10: fip: vn_add rport 0004e0 old state 0
> Aug  8 10:53:21 localhost kernel: [   23.143274] host10: fip: beacon from rport 4e0
> Aug  8 10:53:21 localhost kernel: [   23.143275] host10: fip: beacon expired for rport 4e0
> Aug  8 10:53:21 localhost kernel: [   23.143276] host10: rport 0004e0: Login to port
> Aug  8 10:53:21 localhost kernel: [   23.143277] host10: rport 0004e0: Entered FLOGI state from Init state
> Aug  8 10:53:21 localhost kernel: [   23.143294] host10: fip: els_send op 7 d_id 4e0
> Aug  8 10:53:21 localhost kernel: [   23.143301] host10: fip: beacon from rport 4e0
> Aug  8 10:53:21 localhost kernel: [   23.143920] host10: rport 0004e0: Received a FLOGI accept
> Aug  8 10:53:21 localhost kernel: [   23.143921] host10: rport 0004e0: Port entered PLOGI state from FLOGI state
> Aug  8 10:53:21 localhost kernel: [   23.143956] host10: rport 0004e0: Received a PLOGI accept
> Aug  8 10:53:21 localhost kernel: [   23.143958] host10: rport 0004e0: Port entered PRLI state from PLOGI state
> Aug  8 10:53:21 localhost kernel: [   23.143982] host10: rport 0004e0: Received a PRLI accept
> Aug  8 10:53:21 localhost kernel: [   23.143984] host10: rport 0004e0: PRLI spp_flags = 0x0 spp_type 0x20
> Aug  8 10:53:21 localhost kernel: [   23.143985] host10: rport 0004e0: Error -6 in state PRLI, retrying
> Aug  8 10:53:21 localhost kernel: [   23.144434] host10: rport 0004e0: Received PRLI request while in state PRLI
> Aug  8 10:53:21 localhost kernel: [   23.144445] host10: rport 0004e0: PRLI rspp type 8 active 1 passive 0
> Aug  8 10:53:21 localhost kernel: [   23.749492] host10: rport 0004e0: Received RTV request
> Aug  8 10:53:23 localhost kernel: [   25.204559] host10: rport 0004e0: Port timeout, state PRLI
> Aug  8 10:53:23 localhost kernel: [   25.347400] host10: rport 0004e0: Port entered PRLI state from PRLI state
> Aug  8 10:53:23 localhost kernel: [   25.393058] host10: rport 0004e0: Received a PRLI accept
> Aug  8 10:53:23 localhost kernel: [   25.398412] host10: rport 0004e0: PRLI spp_flags = 0x21 spp_type 0x8
> Aug  8 10:53:23 localhost kernel: [   25.404853] host10: rport 0004e0: Port entered RTV state from PRLI state
> Aug  8 10:53:23 localhost kernel: [   25.450291] host10: rport 0004e0: Received a RTV reject
> Aug  8 10:53:23 localhost kernel: [   25.455620] host10: rport 0004e0: Port is Ready
> Aug  8 10:53:23 localhost kernel: [   25.460185] host10: rport 0004e0: work event 1
> Aug  8 10:53:23 localhost kernel: [   25.464672] host10: rport 0004e0: No rport!
> Aug  8 10:53:23 localhost kernel: [   25.469577] host10: rport 0004e0: callback ev 1
> Aug  8 10:53:23 localhost kernel: [   25.474152] host10: fip: vn_rport_callback 4e0 event 1
> Aug  8 10:53:30 localhost kernel: [   31.850891] host10: fip: beacon from rport 4e0
> 
> This is the error retry with a 4.16 unpatched:
> 
> Aug  8 11:33:50 localhost kernel: [ 2452.565524] host10: fip: vn_claim_notify: send reply to c76e
> Aug  8 11:33:50 localhost kernel: [ 2452.571392] host10: fip: vn_add rport 00c76e new state 0
> Aug  8 11:33:50 localhost kernel: [ 2452.576898] host10: fip: vn_claim_notify: send reply to c76e
> Aug  8 11:33:50 localhost kernel: [ 2452.582605] host10: fip: vn_add rport 00c76e old state 0
> Aug  8 11:33:57 localhost kernel: [ 2458.854081] host10: fip: beacon from rport c76e
> Aug  8 11:33:57 localhost kernel: [ 2458.858622] host10: fip: beacon expired for rport c76e
> Aug  8 11:33:57 localhost kernel: [ 2458.863760] host10: rport 00c76e: Login to port
> Aug  8 11:33:57 localhost kernel: [ 2458.868455] host10: rport 00c76e: Entered FLOGI state from Init state
> Aug  8 11:33:57 localhost kernel: [ 2458.875177] host10: fip: els_send op 7 d_id c76e
> Aug  8 11:33:57 localhost kernel: [ 2458.886089] host10: fip: beacon from rport c76e
> Aug  8 11:33:57 localhost kernel: [ 2458.896069] host10: rport 00c76e: Received a FLOGI accept
> Aug  8 11:33:57 localhost kernel: [ 2458.901716] host10: rport 00c76e: Port entered PLOGI state from FLOGI state
> Aug  8 11:33:57 localhost kernel: [ 2458.934161] host10: rport 00c76e: Received a PLOGI accept
> Aug  8 11:33:57 localhost kernel: [ 2458.939651] host10: rport 00c76e: Port entered PRLI state from PLOGI state
> Aug  8 11:33:57 localhost kernel: [ 2458.985427] host10: rport 00c76e: Received a PRLI reject
> Aug  8 11:33:57 localhost kernel: [ 2458.990817] host10: rport 00c76e: PRLI ELS rejected, reason 5 expl 0
> Aug  8 11:33:57 localhost kernel: [ 2458.997255] host10: rport 00c76e: Error 5 in state PRLI, retrying
> Aug  8 11:33:59 localhost kernel: [ 2461.060167] host10: rport 00c76e: Port timeout, state PRLI
> Aug  8 11:33:59 localhost kernel: [ 2461.065688] host10: rport 00c76e: Port entered PRLI state from PRLI state
> Aug  8 11:33:59 localhost kernel: [ 2461.111048] host10: rport 00c76e: Received a PRLI reject
> Aug  8 11:33:59 localhost kernel: [ 2461.116388] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e
> Aug  8 11:33:59 localhost kernel: [ 2461.122927] host10: rport 00c76e: Error 5 in state PRLI, retrying
> Aug  8 11:34:01 localhost kernel: [ 2463.172152] host10: rport 00c76e: Port timeout, state PRLI
> Aug  8 11:34:01 localhost kernel: [ 2463.177696] host10: rport 00c76e: Port entered PRLI state from PRLI state
> Aug  8 11:34:01 localhost kernel: [ 2463.224578] host10: rport 00c76e: Received a PRLI reject
> Aug  8 11:34:01 localhost kernel: [ 2463.230023] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e
> Aug  8 11:34:01 localhost kernel: [ 2463.236765] host10: rport 00c76e: Error 5 in state PRLI, retrying
> Aug  8 11:34:03 localhost kernel: [ 2465.283937] host10: rport 00c76e: Port timeout, state PRLI
> Aug  8 11:34:03 localhost kernel: [ 2465.289424] host10: rport 00c76e: Port entered PRLI state from PRLI state
> Aug  8 11:34:03 localhost kernel: [ 2465.335150] host10: rport 00c76e: Received a PRLI reject
> Aug  8 11:34:03 localhost kernel: [ 2465.340540] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e
> Aug  8 11:34:03 localhost kernel: [ 2465.347068] host10: rport 00c76e: Error -5 in state PRLI, retries 3
> Aug  8 11:34:03 localhost kernel: [ 2465.353411] host10: rport 00c76e: Port sending LOGO from PRLI state
> Aug  8 11:34:03 localhost kernel: [ 2465.359760] host10: fip: els_send op 9 d_id c76e
> Aug  8 11:34:05 localhost kernel: [ 2467.046011] host10: fip: beacon from rport c76e
> Aug  8 11:34:05 localhost kernel: [ 2467.050549] host10: fip: beacon from rport c76e
> Aug  8 11:34:09 localhost kernel: [ 2470.857393] host10: fip: vn_claim_notify: send reply to c76e
> Aug  8 11:34:09 localhost kernel: [ 2470.863225] host10: fip: vn_add rport 00c76e old state 4
> Aug  8 11:34:09 localhost kernel: [ 2470.868764] host10: fip: vn_claim_notify: send reply to c76e
> Aug  8 11:34:09 localhost kernel: [ 2470.874463] host10: fip: vn_add rport 00c76e old state 4
> Aug  8 11:34:13 localhost kernel: [ 2475.238121] host10: fip: beacon from rport c76e
> Aug  8 11:34:13 localhost kernel: [ 2475.242660] host10: fip: beacon expired for rport c76e
> Aug  8 11:34:13 localhost kernel: [ 2475.247804] host10: rport 00c76e: port already started
> 
> Repeat:
> 
> Aug  8 11:34:13 localhost kernel: [ 2475.253076] host10: fip: beacon from rport c76e
> Aug  8 11:34:21 localhost kernel: [ 2483.429903] host10: fip: beacon from rport c76e
> Aug  8 11:34:21 localhost kernel: [ 2483.434500] host10: fip: beacon from rport c76e
> Aug  8 11:34:29 localhost kernel: [ 2491.621938] host10: fip: beacon from rport c76e
> Aug  8 11:34:29 localhost kernel: [ 2491.626478] host10: fip: beacon from rport c76e
> 
> This part contains a memory leak after a timeout:
> Aug  8 11:34:33 localhost kernel: [ 2495.433102] host10: fip: vn_claim_notify: send reply to c76e
> Aug  8 11:34:33 localhost kernel: [ 2495.438842] host10: fip: vn_add rport 00c76e old state 4
> Aug  8 11:34:33 localhost kernel: [ 2495.444266] host10: fip: vn_claim_notify: send reply to c76e
> Aug  8 11:34:33 localhost kernel: [ 2495.450120] host10: fip: vn_add rport 00c76e old state 4
> Aug  8 11:34:38 localhost kernel: [ 2499.813722] host10: fip: beacon from rport c76e
> Aug  8 11:34:38 localhost kernel: [ 2499.818256] host10: fip: beacon expired for rport c76e
> Aug  8 11:34:38 localhost kernel: [ 2499.823402] host10: rport 00c76e: port already started
> 
> You can find the raw kmemleak and kern.log here:
> https://github.com/hardkernel/linux/issue_comments#issuecomment-411772096
> 
> And the some filtered logs a bit more up.
> 
> The logs when it all went ok:
> https://github.com/hardkernel/linux/issues/360#issuecomment-411335647
> Especially:
> Aug  7 17:54:30 localhost kernel: [   15.540607] host10: fip: claim resp from from rport c76e - state VNMP_UP
> Aug  7 17:54:30 localhost kernel: [   15.548710] host10: fip: vn_add rport 00c76e new state 0
> Aug  7 17:54:30 localhost kernel: [   15.555463] host10: fip: claim resp from from rport c76e - state VNMP_UP
> Aug  7 17:54:30 localhost kernel: [   15.563584] host10: fip: vn_add rport 00c76e old state 0
> Aug  7 17:54:30 localhost kernel: [   15.696327] host10: fip: beacon expired for rport c76e
> Aug  7 17:54:30 localhost kernel: [   15.702975] host10: rport 00c76e: Login to port
> Aug  7 17:54:30 localhost kernel: [   15.707617] host10: rport 00c76e: Entered FLOGI state from Init state
> Aug  7 17:54:30 localhost kernel: [   15.714224] host10: fip: els_send op 7 d_id c76e
> Aug  7 17:54:30 localhost kernel: [   15.736741] host10: rport 00c76e: Received a FLOGI accept
> Aug  7 17:54:30 localhost kernel: [   15.742572] host10: rport 00c76e: Port entered PLOGI state from FLOGI state
> Aug  7 17:54:30 localhost kernel: [   15.749695] host10: rport 00c76e: Received a PLOGI accept
> Aug  7 17:54:30 localhost kernel: [   15.749696] host10: rport 00c76e: Port entered PRLI state from PLOGI state
> Aug  7 17:54:30 localhost kernel: [   15.750112] host10: rport 00c76e: Received a PRLI reject
> Aug  7 17:54:30 localhost kernel: [   15.750113] host10: rport 00c76e: PRLI ELS rejected, reason 9 expl 1e
> Aug  7 17:54:30 localhost kernel: [   15.750114] host10: rport 00c76e: Error 5 in state PRLI, retrying
> Aug  7 17:54:30 localhost kernel: [   15.750360] host10: rport 00c76e: Received PRLI request while in state PRLI
> Aug  7 17:54:30 localhost kernel: [   15.750371] host10: rport 00c76e: PRLI rspp type 8 active 1 passive 0
> Aug  7 17:54:30 localhost kernel: [   15.750570] host10: rport 00c76e: Received RTV request
> Aug  7 17:54:30 localhost kernel: [   17.758897] host10: rport 00c76e: Port timeout, state PRLI
> Aug  7 17:54:30 localhost kernel: [   17.764559] host10: rport 00c76e: Port entered PRLI state from PRLI state
> Aug  7 17:54:30 localhost kernel: [   17.812081] host10: rport 00c76e: Received a PRLI accept
> Aug  7 17:54:31 localhost kernel: [   17.818809] host10: rport 00c76e: PRLI spp_flags = 0x21 spp_type 0x8
> Aug  7 17:54:31 localhost kernel: [   17.826634] host10: rport 00c76e: Port entered RTV state from PRLI state
> Aug  7 17:54:31 localhost kernel: [   17.878821] host10: rport 00c76e: Received a RTV accept
> Aug  7 17:54:31 localhost kernel: [   17.885478] host10: rport 00c76e: Port is Ready
> Aug  7 17:54:31 localhost kernel: [   17.891513] host10: rport 00c76e: work event 1
> Aug  7 17:54:31 localhost kernel: [   17.897386] host10: rport 00c76e: No rport!
> Aug  7 17:54:31 localhost kernel: [   17.903698] host10: rport 00c76e: callback ev 1
> Aug  7 17:54:31 localhost kernel: [   17.909081] host10: fip: vn_rport_callback c76e event 1
> Aug  7 17:54:34 localhost kernel: [   20.992106] host10: rport 00c76e: Received ADISC request
> 
> So in the "working" case I see that the other node (4.16 unpatched) accepts our
> PRLI, if I interpret that right, and in the other case it
> doesn't. Same system, different boot order though, but I guess
> the boot order doesn't matter. 
> When it doesn't work, it gets or sends a mutual LOGO and gets into a weird state.
> 
> Anyway: as far as I interpreted it, in the vn2vn situation alle
> enodes should know all other rports to prevent a double rport
> assignment. I have no experience with FC itself though, so I
> don't know what the normal state is but sending a LOGO due to a
> failure to set up, is that ok?
> Or is the vn2vn rport maintenance on another layer?
> 
> The 4.9 kernel is still to be baked...
> I expect it to yield minor kmemleaks, until I start toggling the
> FCoE.
> 
> Regards,
> Ard van Breemen
> 
> PS: I removed some cross spamming from the Cc:, I hope that's ok.

Full quote for reference. This looks like the PRLI rework from Hannes
introduced the memleaks here.

Hannes can you have a look at it?
Ard van Breemen Aug. 15, 2018, 2:09 p.m. UTC | #15
Hi,

On Fri, Aug 10, 2018 at 10:34:49AM +0200, Johannes Thumshirn wrote:
> Hannes can you have a look at it?

As a side note, I am busy with other things the next 3 weeks, but
I will be able to add some printk's and run it in weeks again.
If you want me to put probes somewhere to have a better look at
the states :-).

Regards,
Ard