Message ID | 59e28611-9840-8873-2f15-1263e4e93d1c@lwfinger.net (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Kalle Valo |
Headers | show |
On Wed, Sep 13, 2017 at 10:01:37AM -0500, Larry Finger wrote: > Thank you very much for making the effort to bisect this problem. I > know that several people have reported the problem, which we cannot > duplicate; however, most of them just say it drops the connection > and do nothing more. In fact, we are lucky to have them even report > which kernel version they are running! Yes, in the reported bugs that style is common; almost animistic, very mystical, and based on heuristics rather than analysis. ;-) > As we do not see the problem, we will be relying on you to help > diagnose the issue. Merely changing the read from 8 to 16 bits > should not cause any change. Agreed. > As _rtl8821ae_dbi_read() is only called from > _rtl8821ae_enable_aspm_back_door(), we want to test turning off > ASPM. The following patch will accomplish this. Unfortunately, the > patch is white-space damaged, thus you will need to apply it > manually. Please try it to see if it helps your connection > loss. Note that ASPM settings are preserved through a module > unload/reload sequence. Thus you will need to reboot after > rebuilding the driver. Went back to 4.13, added your test patch, and built kernel. http://dev.laptop.org/~quozl/z/1dsFOW.txt is dmesg. New symptom occurs; after 23 seconds since last transmission, the device becomes unresponsive to ping from another host, but begins to respond if the device transmits. Flurry of responses then it settles down to regular ping. 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=39 ttl=64 time=1.71 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=40 ttl=64 time=1.93 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=41 ttl=64 time=1.71 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=42 ttl=64 time=1.66 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=43 ttl=64 time=1.70 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=44 ttl=64 time=1.69 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=45 ttl=64 time=37.7 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=46 ttl=64 time=383 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=47 ttl=64 time=11464 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=48 ttl=64 time=10465 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=49 ttl=64 time=9465 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=50 ttl=64 time=8466 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=51 ttl=64 time=7466 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=52 ttl=64 time=6466 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=53 ttl=64 time=5466 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=54 ttl=64 time=4467 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=55 ttl=64 time=3467 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=56 ttl=64 time=2468 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=57 ttl=64 time=1468 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=58 ttl=64 time=469 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=59 ttl=64 time=1.79 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=60 ttl=64 time=1.75 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=61 ttl=64 time=1.72 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=62 ttl=64 time=1.68 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=63 ttl=64 time=1.68 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=64 ttl=64 time=1.95 ms 64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=65 ttl=64 time=1.68 ms I'll give it some more testing and let you know, but it seems as capable of keeping a connection as 4.13 plus my earlier revert.
On 09/13/2017 04:46 PM, James Cameron wrote: > > I'll give it some more testing and let you know, but it seems as > capable of keeping a connection as 4.13 plus my earlier revert. > The change I sent earlier should be as good as reverting the change to write_byte in your reversion. There has been a report (in Russian unfortunately) at https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP handling. According to Google translate is as follows: ============================================================ Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP, which causes the Internet to end. Wireshark looks quite interesting: ARP replays can be sent by one large packet a few seconds after receiving the requests, ie. they seem to be buffered somewhere. arping, launched under strace, also hints at certain problems: sendto(3, "\0\1\10\0\6\4\0\1\334\205\336\3572\343\300\250\0h\377\377\377\377\377\377\300\250\0\1", 28, 0, {sa_family=AF_PACKET, proto=0x806, if3, pkttype=PACKET_HOST, addr(6)={1, ffffffffffff}, 20) = -1 ENOBUFS (No buffer space available) alarm(1) = 0 rt_sigreturn() = 45 recvfrom(3, 0x7fffc1646030, 4096, 0, 0x7fffc1645fb0, 0x7fffc1645eac) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- sendto(3, "\0\1\10\0\6\4\0\1\334\205\336\3572\343\300\250\0h\377\377\377\377\377\377\300\250\0\1", 28, 0, {sa_family=AF_PACKET, proto=0x806, if3, pkttype=PACKET_HOST, addr(6)={1, ffffffffffff}, 20) = -1 ENOBUFS (No buffer space available) alarm(1) = 0 rt_sigreturn() = 45 recvfrom(3, 0x7fffc1646030, 4096, 0, 0x7fffc1645fb0, 0x7fffc1645eac) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- recvfrom(3, 0x7fffc1646030, 4096, 0, 0x7fffc1645fb0, 0x7fffc1645eac) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- sendto(3, "\0\1\10\0\6\4\0\1\334\205\336\3572\343\300\250\0h\377\377\377\377\377\377\300\250\0\1", 28, 0, {sa_family=AF_PACKET, proto=0x806, if3, pkttype=PACKET_HOST, addr(6)={1, ffffffffffff}, 20) = -1 ENOBUFS (No buffer space available) ============================================================ I need to explore that ENOBUFS return code. Your case where the device is unresponsive to pings from another NIC until the device transmits may also be an ARP problem. For completeness, are you using the 2.4 of 5 GHz band? What is the make/model your AP? If possible for you to determine, what firmware is it running? Thanks, Larry
On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote: > On 09/13/2017 04:46 PM, James Cameron wrote: > > > >I'll give it some more testing and let you know, but it seems as > >capable of keeping a connection as 4.13 plus my earlier revert. > > Testing went well; removing the call to enable ASPM was as good as changing the DBI read back to 16-bit width. > The change I sent earlier should be as good as reverting the change > to write_byte in your reversion. Yes, that would be the hope. But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being read as well, in the first read in _rtl8821ae_enable_aspm_back_door, so perhaps reading that register has an unexpected side-effect. Is there any documentation for that register? I see other code writes to REG_DBI_CTRL+3, in _rtl8821ae_check_pcie_dma_hang Evidence of read from REG_DBI_CTRL was captured with an instrumented kernel; git diff http://dev.laptop.org/~quozl/y/1dsQ6B.txt yielding these dmesg lines; [ 6.010255] rtl_pci: _rtl_pci_update_default_setting const_amdpci_aspm=03 [ 6.010338] rtl_pci: rtl_pci_enable_aspm [ 6.034295] ieee80211 phy0: Selected rate control algorithm 'rtl_rc' [ 6.034806] rtlwifi: rtlwifi: wireless switch is on [ 6.196958] rtl8821ae 0000:02:00.0 wlp2s0: renamed from wlan0 [ 7.979186] rtl_pci: rtl_pci_disable_aspm [ 7.979306] rtl8821ae: _rtl8821ae_check_pcie_dma_hang [ 8.295360] rtl8821ae: _rtl8821ae_enable_aspm_back_door [ 8.295437] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f) [ 8.295449] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c) [ 8.295462] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0200 (@034d) [ 8.295474] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718) [ 8.295477] rtl_pci: rtl_pci_enable_aspm [ 8.469734] rtl_pci: rtl_pci_disable_aspm [ 8.469857] rtl8821ae: _rtl8821ae_check_pcie_dma_hang [ 8.686955] rtl8821ae: _rtl8821ae_enable_aspm_back_door [ 8.687013] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f) [ 8.687025] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c) [ 8.687038] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0218 (@034d) [ 8.687050] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718) [ 8.687053] rtl_pci: rtl_pci_enable_aspm Observe how the windowed read of DBI register 0x70f causes a read of 16-bits at 0x34f, which includes first 8-bits of 0x350 REG_DBI_CTRL. By the way, the cold boot value of DBI register 0x719 is 0x00, and the warm boot value is 0x18, so I'm confident there isn't a comprehensive register reset. It means that BIOS has relevance; and this BIOS is outside my control. BIOS variation may explain difficulty reproducing. > There has been a report (in Russian unfortunately) at > https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP > handling. Thanks. I've considered and excluded ARP handling delay. Though ARP renewal is typical reason for device sleep to end. With the call to enable ASPM disabled, instead of changing the DBI read to 16-bit width, what happens is that the device stops accepting data from the access point, packets are buffered there, and are transmitted as soon as the device makes the next transmission. http://dev.laptop.org/~quozl/z/1dsQBf.txt has the ping and IP tcpdump to confirm this. I've a monitor mode tcpdump I can send by private mail if required. In that the burst of packets shows ICMP echo requests were buffered by the access point. > According to Google translate is as follows: > > ============================================================ > Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP, > which causes the Internet to end. Wireshark looks quite interesting: > ARP replays can be sent by one large packet a few seconds after > receiving the requests, ie. they seem to be buffered somewhere. Yes, buffering at access point. > I need to explore that ENOBUFS return code. I've seen ENOBUFS up at the application level with ping too, when the original problem happens with v4.10 plus stable. > Your case where the device is unresponsive to pings from another NIC > until the device transmits may also be an ARP problem. > > For completeness, are you using the 2.4 of 5 GHz band? What is the > make/model your AP? If possible for you to determine, what firmware > is it running? 2.4 GHz and 5 GHz reproduces the problem. Open or WPA reproduces the problem. Netgear WNDR3800 OpenWrt 12.09-beta, r33312. Several other access points reproduce the problem, including a customer's TP-Link TL-WR1042ND with unknown firmware version. No access point as yet does not reproduce the problem. Hope that helps, thanks for your ideas.
On Thu, Sep 14, 2017 at 07:27:39PM +1000, James Cameron wrote: > On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote: > > On 09/13/2017 04:46 PM, James Cameron wrote: > > > > > >I'll give it some more testing and let you know, but it seems as > > >capable of keeping a connection as 4.13 plus my earlier revert. > > > > > Testing went well; removing the call to enable ASPM was as good as > changing the DBI read back to 16-bit width. > > > The change I sent earlier should be as good as reverting the change > > to write_byte in your reversion. > > Yes, that would be the hope. > > But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being > read as well, in the first read in _rtl8821ae_enable_aspm_back_door, > so perhaps reading that register has an unexpected side-effect. > I've ruled that out after testing for several days different kernels based on v4.13; - add an rtl_read_byte of REG_DBI_CTRL+0 in rtl8821ae_hw_init just after the call to enable_aspm; does not solve problem, - add an rtl_read_byte of REG_DBI_CTRL+0 at the start of _rtl8821ae_check_pcie_dma_hang; does not solve problem, Only way to solve the problem at the moment is either; - reverting 40b368af4b75 ("rtlwifi: Fix alignment issues"), which means using rtl_read_word in _rtl8821ae_dbi_read, or - removing the two lines that enable ASPM, as you asked me to try. > Is there any documentation for that register? I see other code writes > to REG_DBI_CTRL+3, in _rtl8821ae_check_pcie_dma_hang I'll repeat and expand on this. Is there any documentation for this register, or the other REG_DBI_* registers? I see that DBI windowed access in rtl8192de is different and yet very similar. In rtl8821ae, rtl8723be, and rtl8192de the method seems straightforward; there are bits for address, bits for write enable by byte, and flag bits for starting the transfer and completing. > Evidence of read from REG_DBI_CTRL was captured with an instrumented > kernel; git diff http://dev.laptop.org/~quozl/y/1dsQ6B.txt yielding > these dmesg lines; > > [ 6.010255] rtl_pci: _rtl_pci_update_default_setting const_amdpci_aspm=03 > [ 6.010338] rtl_pci: rtl_pci_enable_aspm > [ 6.034295] ieee80211 phy0: Selected rate control algorithm 'rtl_rc' > [ 6.034806] rtlwifi: rtlwifi: wireless switch is on > [ 6.196958] rtl8821ae 0000:02:00.0 wlp2s0: renamed from wlan0 > [ 7.979186] rtl_pci: rtl_pci_disable_aspm > [ 7.979306] rtl8821ae: _rtl8821ae_check_pcie_dma_hang > [ 8.295360] rtl8821ae: _rtl8821ae_enable_aspm_back_door > [ 8.295437] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f) > [ 8.295449] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c) > [ 8.295462] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0200 (@034d) > [ 8.295474] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718) > [ 8.295477] rtl_pci: rtl_pci_enable_aspm > [ 8.469734] rtl_pci: rtl_pci_disable_aspm > [ 8.469857] rtl8821ae: _rtl8821ae_check_pcie_dma_hang > [ 8.686955] rtl8821ae: _rtl8821ae_enable_aspm_back_door > [ 8.687013] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f) > [ 8.687025] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c) > [ 8.687038] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0218 (@034d) > [ 8.687050] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718) > [ 8.687053] rtl_pci: rtl_pci_enable_aspm > > Observe how the windowed read of DBI register 0x70f causes a read of > 16-bits at 0x34f, which includes first 8-bits of 0x350 REG_DBI_CTRL. > > By the way, the cold boot value of DBI register 0x719 is 0x00, and > the warm boot value is 0x18, so I'm confident there isn't a > comprehensive register reset. It means that BIOS has relevance; and > this BIOS is outside my control. BIOS variation may explain > difficulty reproducing. Is there a register for device reset that I can try? It would help to exclude BIOS. > > > There has been a report (in Russian unfortunately) at > > https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP > > handling. > > Thanks. I've considered and excluded ARP handling delay. Though ARP > renewal is typical reason for device sleep to end. > > With the call to enable ASPM disabled, instead of changing the DBI > read to 16-bit width, what happens is that the device stops accepting > data from the access point, packets are buffered there, and are > transmitted as soon as the device makes the next transmission. > > http://dev.laptop.org/~quozl/z/1dsQBf.txt has the ping and IP tcpdump > to confirm this. > > I've a monitor mode tcpdump I can send by private mail if required. > In that the burst of packets shows ICMP echo requests were buffered by > the access point. > > > According to Google translate is as follows: > > > > ============================================================ > > Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP, > > which causes the Internet to end. Wireshark looks quite interesting: > > ARP replays can be sent by one large packet a few seconds after > > receiving the requests, ie. they seem to be buffered somewhere. > > Yes, buffering at access point. > > > I need to explore that ENOBUFS return code. > > I've seen ENOBUFS up at the application level with ping too, when the > original problem happens with v4.10 plus stable. > > > Your case where the device is unresponsive to pings from another NIC > > until the device transmits may also be an ARP problem. > > > > For completeness, are you using the 2.4 of 5 GHz band? What is the > > make/model your AP? If possible for you to determine, what firmware > > is it running? > > 2.4 GHz and 5 GHz reproduces the problem. > > Open or WPA reproduces the problem. > > Netgear WNDR3800 OpenWrt 12.09-beta, r33312. > > Several other access points reproduce the problem, including a > customer's TP-Link TL-WR1042ND with unknown firmware version. > > No access point as yet does not reproduce the problem. > > Hope that helps, thanks for your ideas. > > -- > James Cameron > http://quozl.netrek.org/
On Tue, Sep 19, 2017 at 07:42:04PM +1000, James Cameron wrote: > On Thu, Sep 14, 2017 at 07:27:39PM +1000, James Cameron wrote: > > On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote: > > > On 09/13/2017 04:46 PM, James Cameron wrote: > > > > > > > >I'll give it some more testing and let you know, but it seems as > > > >capable of keeping a connection as 4.13 plus my earlier revert. > > > > > > > > Testing went well; removing the call to enable ASPM was as good as > > changing the DBI read back to 16-bit width. > > > > > The change I sent earlier should be as good as reverting the change > > > to write_byte in your reversion. > > > > Yes, that would be the hope. > > > > But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being > > read as well, in the first read in _rtl8821ae_enable_aspm_back_door, > > so perhaps reading that register has an unexpected side-effect. > > > > I've ruled that out after testing for several days different kernels > based on v4.13; > > - add an rtl_read_byte of REG_DBI_CTRL+0 in rtl8821ae_hw_init just > after the call to enable_aspm; does not solve problem, > > - add an rtl_read_byte of REG_DBI_CTRL+0 at the start of > _rtl8821ae_check_pcie_dma_hang; does not solve problem, When the problem occurs, register 0x350 bit 25 is set, for which a comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX hang. So perhaps driver should call _rtl8821ae_check_pcie_dma_hang and _rtl8821ae_reset_pcie_interface_dma. Any ideas where to do this? > [...]
On 09/20/2017 04:36 AM, James Cameron wrote: > When the problem occurs, register 0x350 bit 25 is set, for which a > comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX > hang. > > So perhaps driver should call _rtl8821ae_check_pcie_dma_hang > and _rtl8821ae_reset_pcie_interface_dma. > > Any ideas where to do this? Thanks for the extended debugging. I was able to repeat your findings. With the 8-bit read of REG_DBI_RDATA, I got poor connection stability. Reverting that part made it stable again. For that reason, I pushed the partial reversion of commit 40b368af4b75 ("rtlwifi: Fix alignment issues"). Where did you detect that bit 25 of register 0x350 was set? Larry
On Wed, Sep 20, 2017 at 04:48:23PM -0500, Larry Finger wrote: > On 09/20/2017 04:36 AM, James Cameron wrote: > >When the problem occurs, register 0x350 bit 25 is set, for which a > >comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX > >hang. > > > >So perhaps driver should call _rtl8821ae_check_pcie_dma_hang > >and _rtl8821ae_reset_pcie_interface_dma. > > > >Any ideas where to do this? > > Thanks for the extended debugging. > > I was able to repeat your findings. With the 8-bit read of > REG_DBI_RDATA, I got poor connection stability. Reverting that part > made it stable again. For that reason, I pushed the partial > reversion of commit 40b368af4b75 ("rtlwifi: Fix alignment issues"). That's great you were able to reproduce, thanks! > Where did you detect that bit 25 of register 0x350 was set? In _rtl8821ae_check_pcie_dma_hang on link up. REG_DBI_FLAG (0x350 bits 16-31) is observed as; - 0x0000 on entry to function after warm boot, - 0x0400 on exit from function; debug bit 23 is set by the function, - 0x0400 on entry to function on link up when the problem has not happened, - 0x0600 on entry to function on link up when the problem has happened. But I don't know if 0x0600 is useful to detect earlier, or if it is only a symptom of link down while device active. Either way, if it truly does signal an RX hang or firmware RX queue full, it's useful. My "-q9" and "-qa" test kernels dump REG_DBI_CTRL and REG_DBI_FLAG. "-q9" is with 8-bit read of REG_DBI_RDATA. "-qa" is with 16-bit read of REG_DBI_DATA. My "-qa" test kernel; http://dev.laptop.org/~quozl/y/1dunwN.txt (git diff v4.13) http://dev.laptop.org/~quozl/z/1dubX7.txt (dmesg) REG_DBI_CTRL+3 used by _rtl8821ae_check_pcie_dma_hang is effectively REG_DBI_FLAG+1 (0x353). REG_DBI_CTRL is REG_DBI_ADDR; a duplicate register definition. I'm still pondering a few more theories; - change write_readback, it is true now, and the while()/udelay in _rtl8821ae_dbi_read seems a waste, it never executes, - clearing REG_DBI_CTRL write enable bits at the end of _rtl8821ae_dbi_write, - switching to 32-bit access as used by rtl8192de. And a giggle from reviewing the code, _rtl8821ae_wowlan_initialize_adapter says "Patch Pcie Rx DMA hang after S3/S4 several times. The root cause has not be found." ... I've learned that root causes that aren't found tend to cause further problems later. ;-) Given this, my gut feel is firmware or silicon problem; RX DMA ceases, the driver does not detect it, the connection is lost.
On Thu, Sep 21, 2017 at 09:22:28AM +1000, James Cameron wrote: > On Wed, Sep 20, 2017 at 04:48:23PM -0500, Larry Finger wrote: > > On 09/20/2017 04:36 AM, James Cameron wrote: > > >When the problem occurs, register 0x350 bit 25 is set, for which a > > >comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX > > >hang. > > > > > >So perhaps driver should call _rtl8821ae_check_pcie_dma_hang > > >and _rtl8821ae_reset_pcie_interface_dma. > > > > > >Any ideas where to do this? > > > > Thanks for the extended debugging. > > > > I was able to repeat your findings. With the 8-bit read of > > REG_DBI_RDATA, I got poor connection stability. Reverting that part > > made it stable again. For that reason, I pushed the partial > > reversion of commit 40b368af4b75 ("rtlwifi: Fix alignment issues"). > > That's great you were able to reproduce, thanks! > [...] > I'm still pondering a few more theories; > > - change write_readback, it is true now, and the while()/udelay in > _rtl8821ae_dbi_read seems a waste, it never executes, My test kernel "-qb" was write_readback = false in sw.c, with 8-bit read of REG_DBI_RDATA, and has been stable for four hours. I'll focus on some more testing of this one. It is a surprise. http://dev.laptop.org/~quozl/z/1dutXk.txt (dmesg) Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't happen with write_readback = true. > - clearing REG_DBI_CTRL write enable bits at the end of > _rtl8821ae_dbi_write, My test kernel "-qc" had reset of REG_DBI_ADDR as last step in both _rtl8821ae_dbi_read and _rtl8821ae_dbi_write, and was very unstable, not able to connect. http://dev.laptop.org/~quozl/y/1dutbX.txt (git diff v4.13) http://dev.laptop.org/~quozl/z/1dutuM.txt (dmesg) My test kernel "-qd" had reset of REG_DBI_ADDR as last step in only _rtl8821ae_dbi_write, and had poor connection stability. http://dev.laptop.org/~quozl/y/1dutr3.txt (git diff v4.13) http://dev.laptop.org/~quozl/z/1duuDc.txt (dmesg connection lost) Based on the above two kernels, clearing REG_DBI_ADDR after a read is a bad idea, and suggests there is some underlying asynchronicity about the DBI access. Almost as if some other condition should signal completion rather than zero in REG_DBI_FLAG+0. > - switching to 32-bit access as used by rtl8192de. My test kernel "-qe" changed RED_DBI_RDATA read to 32-bit, then used a union hack to pull out the desired byte, and had poor connection stability. http://dev.laptop.org/~quozl/y/1duvIC.txt (git diff v4.13) http://dev.laptop.org/~quozl/z/1duwI1.txt (dmesg connection lost)
On 09/21/2017 03:07 AM, James Cameron wrote: > My test kernel "-qb" was write_readback = false in sw.c, with 8-bit > read of REG_DBI_RDATA, and has been stable for four hours. I'll focus > on some more testing of this one. It is a surprise. > > http://dev.laptop.org/~quozl/z/1dutXk.txt (dmesg) > > Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't happen > with write_readback = true. Again, thanks for your efforts. At this point, my system has been up over 17 hours without a single drop. As a result, I will leave the reversion of commit 40b368af4b75 in place. It seems safer than turning off write_readback. After we get more testing, that could still be an option. Larry
On Thu, Sep 21, 2017 at 09:40:14AM -0500, Larry Finger wrote: > On 09/21/2017 03:07 AM, James Cameron wrote: > >My test kernel "-qb" was write_readback = false in sw.c, with 8-bit > >read of REG_DBI_RDATA, and has been stable for four hours. I'll > >focus on some more testing of this one. It is a surprise. > > > >http://dev.laptop.org/~quozl/z/1dutXk.txt (dmesg) > > > >Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't > >happen with write_readback = true. > > Again, thanks for your efforts. > > At this point, my system has been up over 17 hours without a single > drop. As a result, I will leave the reversion of commit 40b368af4b75 > in place. It seems safer than turning off write_readback. After we > get more testing, that could still be an option. Thanks for the reversion commit, I'll point others to it. My apologies for sloppy work, the test kernel features got swapped! "-qb" above was with write_readback off, and 16-bit read of REG_DBI_RDATA, not 8-bit. Verified with objdump. It has run for 24 hours without a drop. So at conclusion; - the 16-bit read is good with or without write_readback. - the 8-bit read is bad with or without write_readback, and tends to lose connection much quicker without write_readback. Been a pleasure working with you. Back to lurk mode.
diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c index 305b3abbf..755d3704b 100644 --- a/rtl8821ae/hw.c +++ b/rtl8821ae/hw.c @@ -1982,8 +1982,8 @@ int rtl8821ae_hw_init(struct ieee80211_hw *hw) ppsc->rfpwr_state = ERFON; rtlpriv->cfg->ops->set_hw_reg(hw, HW_VAR_ETHER_ADDR, mac->mac_addr); - _rtl8821ae_enable_aspm_back_door(hw); - rtlpriv->intf_ops->enable_aspm(hw); + //_rtl8821ae_enable_aspm_back_door(hw); + //rtlpriv->intf_ops->enable_aspm(hw); if (rtlhal->hw_type == HARDWARE_TYPE_RTL8812AE && (rtlhal->rfe_type == 1 || rtlhal->rfe_type == 5))