rtl8821ae keep alive not set, connection lost

Message ID	59e28611-9840-8873-2f15-1263e4e93d1c@lwfinger.net (mailing list archive)
State	RFC
Delegated to:	Kalle Valo
Headers	show Return-Path: <linux-wireless-owner@kernel.org> Subject: Re: rtl8821ae keep alive not set, connection lost To: James Cameron <quozl@laptop.org>, linux-wireless@vger.kernel.org Cc: Ping-Ke Shih <pkshih@realtek.com>, Kalle Valo <kvalo@codeaurora.org> References: <20170912220916.GB32211@us.netrek.org> From: Larry Finger <Larry.Finger@lwfinger.net> Message-ID: <59e28611-9840-8873-2f15-1263e4e93d1c@lwfinger.net> Date: Wed, 13 Sep 2017 10:01:37 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <20170912220916.GB32211@us.netrek.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-wireless-owner@vger.kernel.org Precedence: bulk

Larry Finger Sept. 13, 2017, 3:01 p.m. UTC

On 09/12/2017 05:09 PM, James Cameron wrote:
> Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
> rtl8821ae keep alive, causing "Connection to AP lost" and deauth, but
> why?
> 
> Wireless connection is lost after a few seconds or minutes, on every
> OLPC NL3 laptop with rtl8821ae, with any stable kernel after 4.10.1,
> and any kernel with 40b368af4b75.
> 
> dmesg contains
> 
>    wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost
> 
> iw event shows
> 
>    wlp2s0: del station 2c:b0:5d:a6:86:eb
>    wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: Disassociated due to inactivity
>    wlp2s0 (phy #0): disconnected (local request)
> 
> Workaround is to bounce the link, then reconnect;
> 
>    ip link set wlp2s0 down
>    ip link set wlp2s0 up
>    iw dev wlp2s0 connect qz
> 
> A nearby monitor host captures a deauthentication packet sent by the
> device.
> 
> Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
> issues") which changes the width of DBI register read.
> 
> On the face of it, 40b368af4b75 looks correct, especially compared
> against same function in rtl8723be.
> 
> I've no idea why reverting fixes the problem.  I'm hoping someone here
> might speculate and suggest ways to test.
> 
> As keep alive is set through this path, my guess is that keep alive is
> not being set in the device.  Or perhaps reading 16-bits perturbs
> another register.  Is there a way to test?
> 
> http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13
> 
> http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and revert
> of 40b368af4b75

James,

Thank you very much for making the effort to bisect this problem. I know that 
several people have reported the problem, which we cannot duplicate; however, 
most of them just say it drops the connection and do nothing more. In fact, we 
are lucky to have them even report which kernel version they are running!

As we do not see the problem, we will be relying on you to help diagnose the 
issue. Merely changing the read from 8 to 16 bits should not cause any change.

As _rtl8821ae_dbi_read() is only called from _rtl8821ae_enable_aspm_back_door(), 
we want to test turning off ASPM. The following patch will accomplish this. 
Unfortunately, the patch is white-space damaged, thus you will need to apply it 
manually. Please try it to see if it helps your connection loss. Note that ASPM 
settings are preserved through a module unload/reload sequence. Thus you will 
need to reboot after rebuilding the driver.

Thanks,

Larry

James Cameron Sept. 13, 2017, 9:46 p.m. UTC | #1

On Wed, Sep 13, 2017 at 10:01:37AM -0500, Larry Finger wrote:
> Thank you very much for making the effort to bisect this problem. I
> know that several people have reported the problem, which we cannot
> duplicate; however, most of them just say it drops the connection
> and do nothing more.  In fact, we are lucky to have them even report
> which kernel version they are running!

Yes, in the reported bugs that style is common; almost animistic, very
mystical, and based on heuristics rather than analysis.  ;-)

> As we do not see the problem, we will be relying on you to help
> diagnose the issue. Merely changing the read from 8 to 16 bits
> should not cause any change.

Agreed.

> As _rtl8821ae_dbi_read() is only called from
> _rtl8821ae_enable_aspm_back_door(), we want to test turning off
> ASPM. The following patch will accomplish this. Unfortunately, the
> patch is white-space damaged, thus you will need to apply it
> manually. Please try it to see if it helps your connection
> loss. Note that ASPM settings are preserved through a module
> unload/reload sequence. Thus you will need to reboot after
> rebuilding the driver.

Went back to 4.13, added your test patch, and built kernel.

http://dev.laptop.org/~quozl/z/1dsFOW.txt is dmesg.

New symptom occurs; after 23 seconds since last transmission, the
device becomes unresponsive to ping from another host, but begins to
respond if the device transmits.  Flurry of responses then it settles
down to regular ping.

64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=39 ttl=64 time=1.71 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=40 ttl=64 time=1.93 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=41 ttl=64 time=1.71 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=42 ttl=64 time=1.66 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=43 ttl=64 time=1.70 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=44 ttl=64 time=1.69 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=45 ttl=64 time=37.7 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=46 ttl=64 time=383 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=47 ttl=64 time=11464 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=48 ttl=64 time=10465 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=49 ttl=64 time=9465 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=50 ttl=64 time=8466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=51 ttl=64 time=7466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=52 ttl=64 time=6466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=53 ttl=64 time=5466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=54 ttl=64 time=4467 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=55 ttl=64 time=3467 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=56 ttl=64 time=2468 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=57 ttl=64 time=1468 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=58 ttl=64 time=469 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=59 ttl=64 time=1.79 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=60 ttl=64 time=1.75 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=61 ttl=64 time=1.72 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=62 ttl=64 time=1.68 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=63 ttl=64 time=1.68 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=64 ttl=64 time=1.95 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=65 ttl=64 time=1.68 ms

I'll give it some more testing and let you know, but it seems as
capable of keeping a connection as 4.13 plus my earlier revert.

Larry Finger Sept. 14, 2017, 12:39 a.m. UTC | #2

On 09/13/2017 04:46 PM, James Cameron wrote:
> 
> I'll give it some more testing and let you know, but it seems as
> capable of keeping a connection as 4.13 plus my earlier revert.
> 

The change I sent earlier should be as good as reverting the change to 
write_byte in your reversion.

There has been a report (in Russian unfortunately) at 
https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP handling. 
According to Google translate is as follows:

============================================================
Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP, which causes 
the Internet to end. Wireshark looks quite interesting: ARP replays can be sent 
by one large packet a few seconds after receiving the requests, ie. they seem to 
be buffered somewhere.

arping, launched under strace, also hints at certain problems:
sendto(3, 
"\0\1\10\0\6\4\0\1\334\205\336\3572\343\300\250\0h\377\377\377\377\377\377\300\250\0\1", 
28, 0, {sa_family=AF_PACKET, proto=0x806, if3, pkttype=PACKET_HOST, addr(6)={1, 
ffffffffffff}, 20) = -1 ENOBUFS (No buffer space available)
alarm(1)                                = 0
rt_sigreturn()                          = 45
recvfrom(3, 0x7fffc1646030, 4096, 0, 0x7fffc1645fb0, 0x7fffc1645eac) = ? 
ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
sendto(3, 
"\0\1\10\0\6\4\0\1\334\205\336\3572\343\300\250\0h\377\377\377\377\377\377\300\250\0\1", 
28, 0, {sa_family=AF_PACKET, proto=0x806, if3, pkttype=PACKET_HOST, addr(6)={1, 
ffffffffffff}, 20) = -1 ENOBUFS (No buffer space available)
alarm(1)                                = 0
rt_sigreturn()                          = 45
recvfrom(3, 0x7fffc1646030, 4096, 0, 0x7fffc1645fb0, 0x7fffc1645eac) = ? 
ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
recvfrom(3, 0x7fffc1646030, 4096, 0, 0x7fffc1645fb0, 0x7fffc1645eac) = ? 
ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
sendto(3, 
"\0\1\10\0\6\4\0\1\334\205\336\3572\343\300\250\0h\377\377\377\377\377\377\300\250\0\1", 
28, 0, {sa_family=AF_PACKET, proto=0x806, if3, pkttype=PACKET_HOST, addr(6)={1, 
ffffffffffff}, 20) = -1 ENOBUFS (No buffer space available)
============================================================

I need to explore that ENOBUFS return code.

Your case where the device is unresponsive to pings from another NIC until the 
device transmits may also be an ARP problem.

For completeness, are you using the 2.4 of 5 GHz band? What is the make/model 
your AP? If possible for you to determine, what firmware is it running?

Thanks,

Larry

James Cameron Sept. 14, 2017, 9:27 a.m. UTC | #3

On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> On 09/13/2017 04:46 PM, James Cameron wrote:
> >
> >I'll give it some more testing and let you know, but it seems as
> >capable of keeping a connection as 4.13 plus my earlier revert.
> >

Testing went well; removing the call to enable ASPM was as good as
changing the DBI read back to 16-bit width.

> The change I sent earlier should be as good as reverting the change
> to write_byte in your reversion.

Yes, that would be the hope.

But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
so perhaps reading that register has an unexpected side-effect.

Is there any documentation for that register?  I see other code writes
to REG_DBI_CTRL+3, in _rtl8821ae_check_pcie_dma_hang

Evidence of read from REG_DBI_CTRL was captured with an instrumented
kernel; git diff http://dev.laptop.org/~quozl/y/1dsQ6B.txt yielding
these dmesg lines;

[    6.010255] rtl_pci: _rtl_pci_update_default_setting const_amdpci_aspm=03
[    6.010338] rtl_pci: rtl_pci_enable_aspm
[    6.034295] ieee80211 phy0: Selected rate control algorithm 'rtl_rc'
[    6.034806] rtlwifi: rtlwifi: wireless switch is on
[    6.196958] rtl8821ae 0000:02:00.0 wlp2s0: renamed from wlan0
[    7.979186] rtl_pci: rtl_pci_disable_aspm
[    7.979306] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
[    8.295360] rtl8821ae: _rtl8821ae_enable_aspm_back_door
[    8.295437] rtl8821ae: _rtl8821ae_dbi_read  070f -> ffff (@034f)
[    8.295449] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
[    8.295462] rtl8821ae: _rtl8821ae_dbi_read  0719 -> 0200 (@034d)
[    8.295474] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
[    8.295477] rtl_pci: rtl_pci_enable_aspm
[    8.469734] rtl_pci: rtl_pci_disable_aspm
[    8.469857] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
[    8.686955] rtl8821ae: _rtl8821ae_enable_aspm_back_door
[    8.687013] rtl8821ae: _rtl8821ae_dbi_read  070f -> ffff (@034f)
[    8.687025] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
[    8.687038] rtl8821ae: _rtl8821ae_dbi_read  0719 -> 0218 (@034d)
[    8.687050] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
[    8.687053] rtl_pci: rtl_pci_enable_aspm

Observe how the windowed read of DBI register 0x70f causes a read of
16-bits at 0x34f, which includes first 8-bits of 0x350 REG_DBI_CTRL.

By the way, the cold boot value of DBI register 0x719 is 0x00, and
the warm boot value is 0x18, so I'm confident there isn't a
comprehensive register reset.  It means that BIOS has relevance; and
this BIOS is outside my control.  BIOS variation may explain
difficulty reproducing.

> There has been a report (in Russian unfortunately) at
> https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP
> handling.

Thanks.  I've considered and excluded ARP handling delay.  Though ARP
renewal is typical reason for device sleep to end.

With the call to enable ASPM disabled, instead of changing the DBI
read to 16-bit width, what happens is that the device stops accepting
data from the access point, packets are buffered there, and are
transmitted as soon as the device makes the next transmission.

http://dev.laptop.org/~quozl/z/1dsQBf.txt has the ping and IP tcpdump
to confirm this.

I've a monitor mode tcpdump I can send by private mail if required.
In that the burst of packets shows ICMP echo requests were buffered by
the access point.

> According to Google translate is as follows:
> 
> ============================================================
> Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP,
> which causes the Internet to end. Wireshark looks quite interesting:
> ARP replays can be sent by one large packet a few seconds after
> receiving the requests, ie. they seem to be buffered somewhere.

Yes, buffering at access point.

> I need to explore that ENOBUFS return code.

I've seen ENOBUFS up at the application level with ping too, when the
original problem happens with v4.10 plus stable.

> Your case where the device is unresponsive to pings from another NIC
> until the device transmits may also be an ARP problem.
> 
> For completeness, are you using the 2.4 of 5 GHz band? What is the
> make/model your AP? If possible for you to determine, what firmware
> is it running?

2.4 GHz and 5 GHz reproduces the problem.

Open or WPA reproduces the problem.

Netgear WNDR3800 OpenWrt 12.09-beta, r33312.

Several other access points reproduce the problem, including a
customer's TP-Link TL-WR1042ND with unknown firmware version.

No access point as yet does not reproduce the problem.

Hope that helps, thanks for your ideas.

James Cameron Sept. 19, 2017, 9:42 a.m. UTC | #4

On Thu, Sep 14, 2017 at 07:27:39PM +1000, James Cameron wrote:
> On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> > On 09/13/2017 04:46 PM, James Cameron wrote:
> > >
> > >I'll give it some more testing and let you know, but it seems as
> > >capable of keeping a connection as 4.13 plus my earlier revert.
> > >
> 
> Testing went well; removing the call to enable ASPM was as good as
> changing the DBI read back to 16-bit width.
> 
> > The change I sent earlier should be as good as reverting the change
> > to write_byte in your reversion.
> 
> Yes, that would be the hope.
> 
> But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
> read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
> so perhaps reading that register has an unexpected side-effect.
> 

I've ruled that out after testing for several days different kernels
based on v4.13;

- add an rtl_read_byte of REG_DBI_CTRL+0 in rtl8821ae_hw_init just
  after the call to enable_aspm; does not solve problem,

- add an rtl_read_byte of REG_DBI_CTRL+0 at the start of
  _rtl8821ae_check_pcie_dma_hang; does not solve problem,

Only way to solve the problem at the moment is either;

- reverting 40b368af4b75 ("rtlwifi: Fix alignment issues"), which
  means using rtl_read_word in _rtl8821ae_dbi_read,

or

- removing the two lines that enable ASPM, as you asked me to try.

> Is there any documentation for that register?  I see other code writes
> to REG_DBI_CTRL+3, in _rtl8821ae_check_pcie_dma_hang

I'll repeat and expand on this.  Is there any documentation for this
register, or the other REG_DBI_* registers?

I see that DBI windowed access in rtl8192de is different and yet very
similar.

In rtl8821ae, rtl8723be, and rtl8192de the method seems straightforward;
there are bits for address, bits for write enable by byte, and flag
bits for starting the transfer and completing.

> Evidence of read from REG_DBI_CTRL was captured with an instrumented
> kernel; git diff http://dev.laptop.org/~quozl/y/1dsQ6B.txt yielding
> these dmesg lines;
> 
> [    6.010255] rtl_pci: _rtl_pci_update_default_setting const_amdpci_aspm=03
> [    6.010338] rtl_pci: rtl_pci_enable_aspm
> [    6.034295] ieee80211 phy0: Selected rate control algorithm 'rtl_rc'
> [    6.034806] rtlwifi: rtlwifi: wireless switch is on
> [    6.196958] rtl8821ae 0000:02:00.0 wlp2s0: renamed from wlan0
> [    7.979186] rtl_pci: rtl_pci_disable_aspm
> [    7.979306] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
> [    8.295360] rtl8821ae: _rtl8821ae_enable_aspm_back_door
> [    8.295437] rtl8821ae: _rtl8821ae_dbi_read  070f -> ffff (@034f)
> [    8.295449] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
> [    8.295462] rtl8821ae: _rtl8821ae_dbi_read  0719 -> 0200 (@034d)
> [    8.295474] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
> [    8.295477] rtl_pci: rtl_pci_enable_aspm
> [    8.469734] rtl_pci: rtl_pci_disable_aspm
> [    8.469857] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
> [    8.686955] rtl8821ae: _rtl8821ae_enable_aspm_back_door
> [    8.687013] rtl8821ae: _rtl8821ae_dbi_read  070f -> ffff (@034f)
> [    8.687025] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
> [    8.687038] rtl8821ae: _rtl8821ae_dbi_read  0719 -> 0218 (@034d)
> [    8.687050] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
> [    8.687053] rtl_pci: rtl_pci_enable_aspm
> 
> Observe how the windowed read of DBI register 0x70f causes a read of
> 16-bits at 0x34f, which includes first 8-bits of 0x350 REG_DBI_CTRL.
> 
> By the way, the cold boot value of DBI register 0x719 is 0x00, and
> the warm boot value is 0x18, so I'm confident there isn't a
> comprehensive register reset.  It means that BIOS has relevance; and
> this BIOS is outside my control.  BIOS variation may explain
> difficulty reproducing.

Is there a register for device reset that I can try?  It would help
to exclude BIOS.

> 
> > There has been a report (in Russian unfortunately) at
> > https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP
> > handling.
> 
> Thanks.  I've considered and excluded ARP handling delay.  Though ARP
> renewal is typical reason for device sleep to end.
> 
> With the call to enable ASPM disabled, instead of changing the DBI
> read to 16-bit width, what happens is that the device stops accepting
> data from the access point, packets are buffered there, and are
> transmitted as soon as the device makes the next transmission.
> 
> http://dev.laptop.org/~quozl/z/1dsQBf.txt has the ping and IP tcpdump
> to confirm this.
> 
> I've a monitor mode tcpdump I can send by private mail if required.
> In that the burst of packets shows ICMP echo requests were buffered by
> the access point.
> 
> > According to Google translate is as follows:
> > 
> > ============================================================
> > Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP,
> > which causes the Internet to end. Wireshark looks quite interesting:
> > ARP replays can be sent by one large packet a few seconds after
> > receiving the requests, ie. they seem to be buffered somewhere.
> 
> Yes, buffering at access point.
> 
> > I need to explore that ENOBUFS return code.
> 
> I've seen ENOBUFS up at the application level with ping too, when the
> original problem happens with v4.10 plus stable.
> 
> > Your case where the device is unresponsive to pings from another NIC
> > until the device transmits may also be an ARP problem.
> > 
> > For completeness, are you using the 2.4 of 5 GHz band? What is the
> > make/model your AP? If possible for you to determine, what firmware
> > is it running?
> 
> 2.4 GHz and 5 GHz reproduces the problem.
> 
> Open or WPA reproduces the problem.
> 
> Netgear WNDR3800 OpenWrt 12.09-beta, r33312.
> 
> Several other access points reproduce the problem, including a
> customer's TP-Link TL-WR1042ND with unknown firmware version.
> 
> No access point as yet does not reproduce the problem.
> 
> Hope that helps, thanks for your ideas.
> 
> -- 
> James Cameron
> http://quozl.netrek.org/

James Cameron Sept. 20, 2017, 9:36 a.m. UTC | #5

On Tue, Sep 19, 2017 at 07:42:04PM +1000, James Cameron wrote:
> On Thu, Sep 14, 2017 at 07:27:39PM +1000, James Cameron wrote:
> > On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> > > On 09/13/2017 04:46 PM, James Cameron wrote:
> > > >
> > > >I'll give it some more testing and let you know, but it seems as
> > > >capable of keeping a connection as 4.13 plus my earlier revert.
> > > >
> > 
> > Testing went well; removing the call to enable ASPM was as good as
> > changing the DBI read back to 16-bit width.
> > 
> > > The change I sent earlier should be as good as reverting the change
> > > to write_byte in your reversion.
> > 
> > Yes, that would be the hope.
> > 
> > But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
> > read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
> > so perhaps reading that register has an unexpected side-effect.
> > 
> 
> I've ruled that out after testing for several days different kernels
> based on v4.13;
> 
> - add an rtl_read_byte of REG_DBI_CTRL+0 in rtl8821ae_hw_init just
>   after the call to enable_aspm; does not solve problem,
> 
> - add an rtl_read_byte of REG_DBI_CTRL+0 at the start of
>   _rtl8821ae_check_pcie_dma_hang; does not solve problem,

When the problem occurs, register 0x350 bit 25 is set, for which a
comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
hang.

So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
and _rtl8821ae_reset_pcie_interface_dma.

Any ideas where to do this?

> [...]

Larry Finger Sept. 20, 2017, 9:48 p.m. UTC | #6

On 09/20/2017 04:36 AM, James Cameron wrote:
> When the problem occurs, register 0x350 bit 25 is set, for which a
> comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
> hang.
> 
> So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
> and _rtl8821ae_reset_pcie_interface_dma.
> 
> Any ideas where to do this?

Thanks for the extended debugging.

I was able to repeat your findings. With the 8-bit read of REG_DBI_RDATA, I got 
poor connection stability. Reverting that part made it stable again. For that 
reason, I pushed the partial reversion of commit 40b368af4b75 ("rtlwifi: Fix 
alignment issues").

Where did you detect that bit 25 of register 0x350 was set?

Larry

James Cameron Sept. 20, 2017, 11:22 p.m. UTC | #7

On Wed, Sep 20, 2017 at 04:48:23PM -0500, Larry Finger wrote:
> On 09/20/2017 04:36 AM, James Cameron wrote:
> >When the problem occurs, register 0x350 bit 25 is set, for which a
> >comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
> >hang.
> >
> >So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
> >and _rtl8821ae_reset_pcie_interface_dma.
> >
> >Any ideas where to do this?
> 
> Thanks for the extended debugging.
> 
> I was able to repeat your findings. With the 8-bit read of
> REG_DBI_RDATA, I got poor connection stability. Reverting that part
> made it stable again. For that reason, I pushed the partial
> reversion of commit 40b368af4b75 ("rtlwifi: Fix alignment issues").

That's great you were able to reproduce, thanks!

> Where did you detect that bit 25 of register 0x350 was set?

In _rtl8821ae_check_pcie_dma_hang on link up.

REG_DBI_FLAG (0x350 bits 16-31) is observed as;

- 0x0000 on entry to function after warm boot,

- 0x0400 on exit from function; debug bit 23 is set by the function,

- 0x0400 on entry to function on link up when the problem has not
  happened,

- 0x0600 on entry to function on link up when the problem has
  happened.

But I don't know if 0x0600 is useful to detect earlier, or if it is
only a symptom of link down while device active.  Either way, if it
truly does signal an RX hang or firmware RX queue full, it's useful.

My "-q9" and "-qa" test kernels dump REG_DBI_CTRL and REG_DBI_FLAG.

"-q9" is with 8-bit read of REG_DBI_RDATA.

"-qa" is with 16-bit read of REG_DBI_DATA.

My "-qa" test kernel;
http://dev.laptop.org/~quozl/y/1dunwN.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1dubX7.txt (dmesg)

REG_DBI_CTRL+3 used by _rtl8821ae_check_pcie_dma_hang is effectively
REG_DBI_FLAG+1 (0x353).

REG_DBI_CTRL is REG_DBI_ADDR; a duplicate register definition.

I'm still pondering a few more theories;

- change write_readback, it is true now, and the while()/udelay in
  _rtl8821ae_dbi_read seems a waste, it never executes,

- clearing REG_DBI_CTRL write enable bits at the end of
  _rtl8821ae_dbi_write,

- switching to 32-bit access as used by rtl8192de.

And a giggle from reviewing the code,
_rtl8821ae_wowlan_initialize_adapter says "Patch Pcie Rx DMA hang
after S3/S4 several times.  The root cause has not be found."
... I've learned that root causes that aren't found tend to cause
further problems later.  ;-)

Given this, my gut feel is firmware or silicon problem; RX DMA ceases,
the driver does not detect it, the connection is lost.

James Cameron Sept. 21, 2017, 8:07 a.m. UTC | #8

On Thu, Sep 21, 2017 at 09:22:28AM +1000, James Cameron wrote:
> On Wed, Sep 20, 2017 at 04:48:23PM -0500, Larry Finger wrote:
> > On 09/20/2017 04:36 AM, James Cameron wrote:
> > >When the problem occurs, register 0x350 bit 25 is set, for which a
> > >comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
> > >hang.
> > >
> > >So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
> > >and _rtl8821ae_reset_pcie_interface_dma.
> > >
> > >Any ideas where to do this?
> > 
> > Thanks for the extended debugging.
> > 
> > I was able to repeat your findings. With the 8-bit read of
> > REG_DBI_RDATA, I got poor connection stability. Reverting that part
> > made it stable again. For that reason, I pushed the partial
> > reversion of commit 40b368af4b75 ("rtlwifi: Fix alignment issues").
> 
> That's great you were able to reproduce, thanks!
> [...]
> I'm still pondering a few more theories;
> 
> - change write_readback, it is true now, and the while()/udelay in
>   _rtl8821ae_dbi_read seems a waste, it never executes,

My test kernel "-qb" was write_readback = false in sw.c, with 8-bit
read of REG_DBI_RDATA, and has been stable for four hours.  I'll focus
on some more testing of this one.  It is a surprise.

http://dev.laptop.org/~quozl/z/1dutXk.txt (dmesg)

Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't happen
with write_readback = true.

> - clearing REG_DBI_CTRL write enable bits at the end of
>   _rtl8821ae_dbi_write,

My test kernel "-qc" had reset of REG_DBI_ADDR as last step in both
_rtl8821ae_dbi_read and _rtl8821ae_dbi_write, and was very unstable,
not able to connect.

http://dev.laptop.org/~quozl/y/1dutbX.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1dutuM.txt (dmesg)

My test kernel "-qd" had reset of REG_DBI_ADDR as last step in only
_rtl8821ae_dbi_write, and had poor connection stability.

http://dev.laptop.org/~quozl/y/1dutr3.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1duuDc.txt (dmesg connection lost)

Based on the above two kernels, clearing REG_DBI_ADDR after a read is
a bad idea, and suggests there is some underlying asynchronicity about
the DBI access.  Almost as if some other condition should signal
completion rather than zero in REG_DBI_FLAG+0.

> - switching to 32-bit access as used by rtl8192de.

My test kernel "-qe" changed RED_DBI_RDATA read to 32-bit, then used a
union hack to pull out the desired byte, and had poor connection
stability.

http://dev.laptop.org/~quozl/y/1duvIC.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1duwI1.txt (dmesg connection lost)

Larry Finger Sept. 21, 2017, 2:40 p.m. UTC | #9

On 09/21/2017 03:07 AM, James Cameron wrote:
> My test kernel "-qb" was write_readback = false in sw.c, with 8-bit
> read of REG_DBI_RDATA, and has been stable for four hours.  I'll focus
> on some more testing of this one.  It is a surprise.
> 
> http://dev.laptop.org/~quozl/z/1dutXk.txt  (dmesg)
> 
> Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't happen
> with write_readback = true.

Again, thanks for your efforts.

At this point, my system has been up over 17 hours without a single drop. As a 
result, I will leave the reversion of commit 40b368af4b75 in place. It seems 
safer than turning off write_readback. After we get more testing, that could 
still be an option.

Larry

James Cameron Sept. 22, 2017, 5:35 a.m. UTC | #10

On Thu, Sep 21, 2017 at 09:40:14AM -0500, Larry Finger wrote:
> On 09/21/2017 03:07 AM, James Cameron wrote:
> >My test kernel "-qb" was write_readback = false in sw.c, with 8-bit
> >read of REG_DBI_RDATA, and has been stable for four hours.  I'll
> >focus on some more testing of this one.  It is a surprise.
> >
> >http://dev.laptop.org/~quozl/z/1dutXk.txt  (dmesg)
> >
> >Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't
> >happen with write_readback = true.
> 
> Again, thanks for your efforts.
> 
> At this point, my system has been up over 17 hours without a single
> drop. As a result, I will leave the reversion of commit 40b368af4b75
> in place. It seems safer than turning off write_readback. After we
> get more testing, that could still be an option.

Thanks for the reversion commit, I'll point others to it.

My apologies for sloppy work, the test kernel features got swapped!
"-qb" above was with write_readback off, and 16-bit read of
REG_DBI_RDATA, not 8-bit.  Verified with objdump.  It has run for 24
hours without a drop.

So at conclusion;

- the 16-bit read is good with or without write_readback.

- the 8-bit read is bad with or without write_readback, and tends to
  lose connection much quicker without write_readback.

Been a pleasure working with you.  Back to lurk mode.

rtl8821ae keep alive not set, connection lost

Commit Message

Comments

Patch