Message ID | 20160724190237.GP2356@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Kalle Valo |
Headers | show |
Al Viro wrote: > On Sun, Jul 24, 2016 at 07:45:13PM +0200, Christian Lamparter wrote: > > > > The symptom is that downloaded files (http, ftp, and probably other > > > protocols) have small corrupted segments (about 1-2 kilobytes long) in > > > random locations. Only downloads that sustain a high speed for at least a > > > few seconds are corrupted. Anything small enough to be received in less > > > than about 5 seconds is not affected. > > Can that sucker be reproduced with netcat? That would eliminate all issues > with multi-iovec recvmsg(2), narrowing the things down quite bit. netcat seems to be immune. Comparing strace results, I didn't see any recvmsg() calls in the other programs that have had the problem, but there is an interesting difference: netcat calls select() to wait for the socket to be ready for reading, where my other test programs just call read() and let it block until ready. So I wrote a small test program to isolate that difference. It downloads a file using only read() and write() and a hardcoded HTTP request. It has a select mode (main loop alternates read() and select() on the TCP socket) and a noselect mode (main loop just read()s the TCP socket). The program is included at the bottom of this message. I ran it several times in both modes and got corruption if and only if the noselect mode was used. > > Another thing (and if that works, it's *NOT* a proper fix - it would be > papering over the problem, but at least it would show where to look for > it) - try (on top of mainline) the following delta: > > diff --git a/net/core/datagram.c b/net/core/datagram.c Will try that patch soon. Meanwhile, here's my test: /* Demonstration program "dlbug". Usage: dlbug select > outfile or dlbug noselect > outfile outfile will contain the full HTTP response. Edit out the HTTP headers and what's left should be a valid gzip if the download worked. */ #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <netdb.h> #include <sys/select.h> int main(int argc, char **argv) { const char *request = "GET /debian/dists/stable/main/Contents-amd64.gz HTTP/1.0\r\n" "Host: ftp.us.debian.org\r\n" "\r\n"; ssize_t request_len = strlen(request), w, r, copied; struct addrinfo hints, *host; int sock, err, doselect; char buf[10240]; if(argc!=2 || (!strcmp(argv[1], "select") && !strcmp(argv[1], "noselect"))) { fprintf(stderr, "Usage: %s {select|noselect}\n", argv[0]); return 1; } doselect = !strcmp(argv[1], "select"); memset(&hints, 0, sizeof hints); hints.ai_family = AF_INET; hints.ai_socktype = SOCK_STREAM; err = getaddrinfo("ftp.us.debian.org", 0, &hints, &host); if(err) { fprintf(stderr, "getaddrinfo: %s\n", gai_strerror(err)); return 1; } sock = socket(host->ai_family, host->ai_socktype, host->ai_protocol); if(sock < 0) { perror("socket"); return 1; } ((struct sockaddr_in *)host->ai_addr)->sin_port = htons(80); if(connect(sock, host->ai_addr, host->ai_addrlen) < 0) { perror("connect"); return 1; } while(request_len) { w = write(sock, request, request_len); if(w < 0) { perror("write to socket"); return 1; } request += w; request_len -= w; } while((r = read(sock, buf, sizeof buf))) { if(r < 0) { perror("read from socket"); return 1; } copied = 0; while(copied < r) { w = write(1, buf+copied, r-copied); if(w < 0) { perror("write to stdout"); return 1; } copied += w; } if(doselect) { fd_set rfds; FD_ZERO(&rfds); FD_SET(sock, &rfds); select(sock+1, &rfds, 0, 0, 0); } } return 0; }
On Tuesday, July 26, 2016 4:57:03 AM CEST Alan Curry wrote: > Al Viro wrote: > > On Sun, Jul 24, 2016 at 07:45:13PM +0200, Christian Lamparter wrote: > > > > > > The symptom is that downloaded files (http, ftp, and probably other > > > > protocols) have small corrupted segments (about 1-2 kilobytes long) in > > > > random locations. Only downloads that sustain a high speed for at least a > > > > few seconds are corrupted. Anything small enough to be received in less > > > > than about 5 seconds is not affected. > > > > Can that sucker be reproduced with netcat? That would eliminate all issues > > with multi-iovec recvmsg(2), narrowing the things down quite bit. > > netcat seems to be immune. Comparing strace results, I didn't see any > recvmsg() calls in the other programs that have had the problem, but there > is an interesting difference: netcat calls select() to wait for the socket > to be ready for reading, where my other test programs just call read() and > let it block until ready. > > So I wrote a small test program to isolate that difference. It downloads > a file using only read() and write() and a hardcoded HTTP request. It has > a select mode (main loop alternates read() and select() on the TCP socket) > and a noselect mode (main loop just read()s the TCP socket). > > The program is included at the bottom of this message. > > I ran it several times in both modes and got corruption if and only if the > noselect mode was used. > > > > > Another thing (and if that works, it's *NOT* a proper fix - it would be > > papering over the problem, but at least it would show where to look for > > it) - try (on top of mainline) the following delta: > > > > diff --git a/net/core/datagram.c b/net/core/datagram.c > > Will try that patch soon. Meanwhile, here's my test: > > /* Demonstration program "dlbug". > Usage: dlbug select > outfile > or > dlbug noselect > outfile > outfile will contain the full HTTP response. Edit out the HTTP headers > and what's left should be a valid gzip if the download worked. */ > [...] Thanks, I gave the program a try with my WNDA3100 and a WN821N v2 devices. I did not see any corruptions in any of the tests though. Can you tell me something about your wireless network too? I would like to know what router and firmware are you using? Also important: what's your wireless configuration? (WPA?, CCMP or TKIP? HT40, HT20 or Legacy rates? ...) Probably the quickest and easiest way to get that information is by running the following commands as root, when you are connected to your wifi network and post the results: # iw dev wlan0 link # iw dev wlan0 scan dump (You can of course remove your device's MACs, but please do it consistently). Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016-07-26 09:59, Christian Lamparter wrote: > Thanks, I gave the program a try with my WNDA3100 and a WN821N v2 > devices. > I did not see any corruptions in any of the tests though. Can you tell > me > something about your wireless network too? I would like to know what > router > and firmware are you using? Also important: what's your wireless > configuration? > (WPA?, CCMP or TKIP? HT40, HT20 or Legacy rates? ...) > > Probably the quickest and easiest way to get that information is by > running > the following commands as root, when you are connected to your wifi > network > and post the results: > # iw dev wlan0 link > # iw dev wlan0 scan dump > > (You can of course remove your device's MACs, but please do it > consistently). > > Regards, > Christian I wonder if this is something that is only affecting certain NIC's? For example, it see this issue on sun4u boxes with tigon3 and hme interfaces. But on my sun4v boxes that have e1000 interfaces everything works fine. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Lamparter wrote: > Thanks, I gave the program a try with my WNDA3100 and a WN821N v2 devices. > I did not see any corruptions in any of the tests though. Can you tell me > something about your wireless network too? I would like to know what router > and firmware are you using? Also important: what's your wireless configuration? The router/access-point is a Comcast-issued Technicolor cable modem, model TC8305C. The only thing I can find on it that looks like it might identify a firmware version is this: System Software Version eMTA & DOCSIS Software Version: 01.E6.01.22.25 Packet Cable: 2.0 I assume Comcast pushes firmware updates whenever they feel like it. There is possibly another clue. I get this message from the kernel sometimes: ieee80211 phy0: invalid plcp cck rate (0). I had this message appearing long before the data corruption bug started. It never correlated with any actual problems, so I turned down the priority level of the message to get it off the console, and forgot about it. I was unable to discover what a "plcp" or "cck" is so the message means nothing to me. > (WPA?, CCMP or TKIP? HT40, HT20 or Legacy rates? ...) > > Probably the quickest and easiest way to get that information is by running > the following commands as root, when you are connected to your wifi network > and post the results: > # iw dev wlan0 link > # iw dev wlan0 scan dump Connected to cc:03:fa:bf:e9:ea (on wlan0) SSID: HOME-E9EA freq: 2462 RX: 20726719 bytes (106483 packets) TX: 5902478 bytes (44707 packets) signal: -43 dBm tx bitrate: 54.0 MBit/s bss flags: short-slot-time dtim period: 1 beacon int: 100 BSS cc:03:fa:bf:e9:ea(on wlan0) -- associated TSF: 236407205748 usec (2d, 17:40:07) freq: 2462 beacon interval: 100 TUs capability: ESS Privacy ShortSlotTime (0x0411) signal: -33.00 dBm last seen: 634452 ms ago Information elements from Probe Response frame: SSID: HOME-E9EA Supported rates: 1.0* 2.0* 5.5* 11.0* 18.0 24.0* 36.0 54.0 DS Parameter set: channel 11 ERP: <no flags> ERP D4.0: <no flags> RSN: * Version: 1 * Group cipher: TKIP * Pairwise ciphers: CCMP TKIP * Authentication suites: PSK * Capabilities: 16-PTKSA-RC 1-GTKSA-RC (0x000c) Extended supported rates: 6.0* 9.0 12.0* 48.0 HT capabilities: Capabilities: 0x18bd RX LDPC HT20 SM Power Save disabled RX Greenfield RX HT20 SGI TX STBC No RX STBC Max AMSDU length: 7935 bytes DSSS/CCK HT40 Maximum RX AMPDU length 65535 bytes (exponent: 0x003) Minimum RX AMPDU time spacing: 8 usec (0x06) HT RX MCS rate indexes supported: 0-23 HT TX MCS rate indexes are undefined HT operation: * primary channel: 11 * secondary channel offset: no secondary * STA channel width: 20 MHz * RIFS: 1 * HT protection: nonmember * non-GF present: 1 * OBSS non-GF present: 1 * dual beacon: 0 * dual CTS protection: 0 * STBC beacon: 0 * L-SIG TXOP Prot: 0 * PCO active: 0 * PCO phase: 0 WPS: * Version: 1.0 * Wi-Fi Protected Setup State: 2 (Configured) * Response Type: 3 (AP) * UUID: 6d1b1911-14a9-391c-cdee-89850a5aa1ef * Manufacturer: Technicolor * Model: Technicolor * Model Number: 123456 * Serial Number: 0000001 * Primary Device Type: 6-0050f204-1 * Device name: TechnicolorAP * Config methods: Display * RF Bands: 0x1 * Unknown TLV (0x1049, 6 bytes): 00 37 2a 00 01 20 WPA: * Version: 1 * Group cipher: TKIP * Pairwise ciphers: CCMP TKIP * Authentication suites: PSK * Capabilities: 16-PTKSA-RC 1-GTKSA-RC (0x000c) WMM: * Parameter version 1 * u-APSD * BE: CW 15-1023, AIFSN 3 * BK: CW 15-1023, AIFSN 7 * VI: CW 7-15, AIFSN 2, TXOP 3008 usec * VO: CW 3-7, AIFSN 2, TXOP 1504 usec
alexmcwhirter@triadic.us writes: > On 2016-07-26 09:59, Christian Lamparter wrote: >> Thanks, I gave the program a try with my WNDA3100 and a WN821N v2 >> devices. >> I did not see any corruptions in any of the tests though. Can you >> tell me >> something about your wireless network too? I would like to know what >> router >> and firmware are you using? Also important: what's your wireless >> configuration? >> (WPA?, CCMP or TKIP? HT40, HT20 or Legacy rates? ...) >> >> Probably the quickest and easiest way to get that information is by >> running >> the following commands as root, when you are connected to your wifi >> network >> and post the results: >> # iw dev wlan0 link >> # iw dev wlan0 scan dump >> >> (You can of course remove your device's MACs, but please do it >> consistently). >> >> Regards, >> Christian > > I wonder if this is something that is only affecting certain NIC's? > For example, it see this issue on sun4u boxes with tigon3 and hme > interfaces. But on my sun4v boxes that have e1000 interfaces > everything works fine. Kernel configuration might also make a difference. Trying to find an Kconfig option which affects this could be useful.
Al Viro wrote: > > Another thing (and if that works, it's *NOT* a proper fix - it would be > papering over the problem, but at least it would show where to look for > it) - try (on top of mainline) the following delta: I tried it on top of v4.6.4 and on top of the very recent v4.7-2509-g59ebc44 from Linus, and still got corruption. > > diff --git a/net/core/datagram.c b/net/core/datagram.c > index b7de71f..0ee5995 100644 > --- a/net/core/datagram.c > +++ b/net/core/datagram.c > @@ -734,7 +734,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, > if (!chunk) > return 0; > > - if (msg_data_left(msg) < chunk) { > + if (iov_iter_single_seg_count(&msg->msg_iter) < chunk) { > if (__skb_checksum_complete(skb)) > goto csum_error; > if (skb_copy_datagram_msg(skb, hlen, msg, chunk)) >
Just to add some more information to this, the corruption seems to effect ssh as well. Using a sun hme interface, occasionally upon an ssh connection it will refuse to authenticate a client with either password or cert authentication. Using wireshark to capture and decrypt the packets between the two machines, the data coming from the server seems good, but the data received by the server from the client is essentially garbage. Note that the client is sending valid data, but the server is corrupting it upon receipt. Closing the connection and starting a new one will remedy the login issue, but you do occasionally see corruption on the server side sporadically. So far this only seems to occur on incoming data, outgoing data seems fine. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016-07-27 14:04, alexmcwhirter@triadic.us wrote: > Just to add some more information to this, the corruption seems to > effect ssh as well. > > Using a sun hme interface, occasionally upon an ssh connection it will > refuse to authenticate a client with either password or cert > authentication. Using wireshark to capture and decrypt the packets > between the two machines, the data coming from the server seems good, > but the data received by the server from the client is essentially > garbage. Note that the client is sending valid data, but the server is > corrupting it upon receipt. Closing the connection and starting a new > one will remedy the login issue, but you do occasionally see > corruption on the server side sporadically. > > So far this only seems to occur on incoming data, outgoing data seems > fine. Also, there is another patch the references this commit on sparc64 at least. https://patchwork.kernel.org/patch/9221895/ I highly expect both my issue and OP's issue to revolve not around commit e5a4b0bb803b specifically, but around other code that no longer behaves as expected because of it. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: alexmcwhirter@triadic.us Date: Wed, 27 Jul 2016 19:02:40 -0400 > On 2016-07-27 14:04, alexmcwhirter@triadic.us wrote: >> Just to add some more information to this, the corruption seems to >> effect ssh as well. >> Using a sun hme interface, occasionally upon an ssh connection it will >> refuse to authenticate a client with either password or cert >> authentication. Using wireshark to capture and decrypt the packets >> between the two machines, the data coming from the server seems good, >> but the data received by the server from the client is essentially >> garbage. Note that the client is sending valid data, but the server is >> corrupting it upon receipt. Closing the connection and starting a new >> one will remedy the login issue, but you do occasionally see >> corruption on the server side sporadically. >> So far this only seems to occur on incoming data, outgoing data seems >> fine. > > Also, there is another patch the references this commit on sparc64 at > least. > > https://patchwork.kernel.org/patch/9221895/ > > I highly expect both my issue and OP's issue to revolve not around > commit e5a4b0bb803b specifically, but around other code that no longer > behaves as expected because of it. Indeed, and that fault address rounding bug occurs two other times in arch/sparc/lib/user_fixup.c The mentioned patchwork patch should fix the bug and I'll get that into my sparc tree, merged, and queued up for -stable ASAP. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016-07-27 20:31, Al Viro wrote: > On Wed, Jul 27, 2016 at 04:45:43PM -0700, David Miller wrote: > >> > I highly expect both my issue and OP's issue to revolve not around >> > commit e5a4b0bb803b specifically, but around other code that no longer >> > behaves as expected because of it. >> >> Indeed, and that fault address rounding bug occurs two other times >> in arch/sparc/lib/user_fixup.c >> >> The mentioned patchwork patch should fix the bug and I'll get that >> into my sparc tree, merged, and queued up for -stable ASAP. > > Plausible for sparc, but I don't see similar __copy_to_user_inatomic() > bugs in case of x86_64... I'm going to go ahead and say this is where my issue and the op's issue begin to branch apart from one another. He's seeing this on all incoming data, whereas i am only seeing it on ssl data and not on sun4v. At this point i would say data from my issue is only going to cloud this issue as they seem to be two completely different issues revolving around the same commit. If i come across any relevant data for x86_64 ill be sure to post it if this isn't resolved by then, but for now i'm going to refrain from submitting anything sparc related. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 27, 2016 at 04:45:43PM -0700, David Miller wrote: > > I highly expect both my issue and OP's issue to revolve not around > > commit e5a4b0bb803b specifically, but around other code that no longer > > behaves as expected because of it. > > Indeed, and that fault address rounding bug occurs two other times > in arch/sparc/lib/user_fixup.c > > The mentioned patchwork patch should fix the bug and I'll get that > into my sparc tree, merged, and queued up for -stable ASAP. Plausible for sparc, but I don't see similar __copy_to_user_inatomic() bugs in case of x86_64... -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 27, 2016 at 08:26:48PM -0400, alexmcwhirter@triadic.us wrote: > I'm going to go ahead and say this is where my issue and the op's issue > begin to branch apart from one another. He's seeing this on all incoming > data, whereas i am only seeing it on ssl data and not on sun4v. > > At this point i would say data from my issue is only going to cloud this > issue as they seem to be two completely different issues revolving around > the same commit. If i come across any relevant data for x86_64 ill be sure > to post it if this isn't resolved by then, but for now i'm going to refrain > from submitting anything sparc related. Which just might mean that we have *three* issues here - (1) buggered __copy_to_user_inatomic() (and friends) on some sparcs (2) your ssl-only corruption (3) Alan's x86_64 corruption on plain TCP read - no ssl *or* sparc anywhere, and no multi-segment recvmsg(). Which would strongly argue in favour of some kind of copy_page_to_iter() breakage triggered when handling a fragmented skb, as in (1). Except that I don't see anything similar in x86_64 uaccess primitives... -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/datagram.c b/net/core/datagram.c index b7de71f..0ee5995 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -734,7 +734,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, if (!chunk) return 0; - if (msg_data_left(msg) < chunk) { + if (iov_iter_single_seg_count(&msg->msg_iter) < chunk) { if (__skb_checksum_complete(skb)) goto csum_error; if (skb_copy_datagram_msg(skb, hlen, msg, chunk))